Running code on multiple machines
Overview
Teaching: 120 min
Exercises: 60 minQuestions
What is a locale?
Objectives
First objective.
So far we have been working with single-locale Chapel codes that may run on one or many cores on a single compute node, making use of the shared memory space and accelerating computations by launching concurrent tasks on individual cores in parallel. Chapel codes can also run on multiple nodes on a compute cluster. In Chapel this is referred to as multi-locale execution.
If you work inside a Chapel Docker container, e.g., chapel/chapel-gasnet, the
container environment simulates a multi-locale cluster, so you would compile
and launch multi-locale Chapel codes directly by specifying the number of
locales with -nl
flag:
$ chpl --fast mycode.chpl -o mybinary
$ ./mybinary -nl 4
Inside the Docker container on multiple locales your code will not run any faster than on a single locale, since you are emulating a virtual cluster, and all tasks run on the same physical node. To achieve actual speedup, you need to run your parallel multi-locale Chapel code on a real physical cluster which we hope you have access to for this session.
On a real HPC cluster you would need to submit either an interactive or a batch job asking for several nodes and then run a multi-locale Chapel code inside that job. In practice, the exact commands depend on how the multi-locale Chapel was built on the cluster.
When you compile a Chapel code with the multi-locale Chapel compiler, two
binaries will be produced. One is called mybinary
and is a launcher binary
used to submit the real executable mybinary_real
. If the Chapel environment
is configured properly with the launcher for the cluster’s physical
interconnect (which might not be always possible due to a number of factors),
then you would simply compile the code and use the launcher binary mybinary
to submit the job to the queue:
$ chpl --fast mycode.chpl -o mybinary
$ ./mybinary -nl 2
The exact parameters of the job such as the maximum runtime and the requested memory can be specified with Chapel environment variables. One possible drawback of this launching method is that, depending on your cluster setup, Chapel might have access to all physical cores on each node participating in the run – this will present problems if you are scheduling jobs by-core and not by-node, since part of a node should be allocated to someone else’s job.
Note that on Compute Canada clusters this launching method works without problem. On these clusters
multi-locale Chapel is provided by chapel-ofi
(for the OmniPath interconnect on Cedar) and chapel-ucx
(for the
InfiniBand interconnect on Graham, Béluga, Narval) modules, so – depending on the cluster – you will load
Chapel using one of the two lines below:
$ module load gcc chapel-ofi # for the OmniPath interconnect on Cedar cluster
$ module load gcc chapel-ucx # for the InfiniBand interconnect on Graham, Béluga, Narval clusters
We can also launch multi-locale Chapel codes using the real executable mybinary_real
. For example, for an
interactive job you would type:
$ salloc --time=0:30:0 --nodes=4 --cpus-per-task=3 --mem-per-cpu=1000 --account=def-guest
$ chpl --fast mycode.chpl -o mybinary
$ srun ./mybinary_real -nl 4 # will run on four locales with max 3 cores per locale
Production jobs would be launched with sbatch
command and a Slurm launch
script as usual.
For the rest of this class we assume that you have a working multi-locale Chapel environment, whether provided by a Docker container or by multi-locale Chapel on a physical HPC cluster. We will run all examples on four nodes with three cores per node.
Intro to multi-locale code
Let us test our multi-locale Chapel environment by launching the following code:
writeln(Locales);
This code will print the built-in global array Locales
. Running it on four
locales will produce
LOCALE0 LOCALE1 LOCALE2 LOCALE3
We want to run some code on each locale (node). For that, we can cycle through locales:
for loc in Locales do // this is still a serial program
on loc do // run the next line on locale `loc`
writeln("this locale is named ", here.name);
This will produce
this locale is named cdr544
this locale is named cdr552
this locale is named cdr556
this locale is named cdr692
Here the built-in variable class here
refers to the locale on which the code
is running, and here.name
is its hostname. We started a serial for
loop
cycling through all locales, and on each locale we printed its name, i.e., the
hostname of each node. This program ran in serial starting a task on each
locale only after completing the same task on the previous locale. Note the
order in which locales were listed.
To run this code in parallel, starting four simultaneous tasks, one per locale,
we simply need to replace for
with forall
:
forall loc in Locales do // now this is a parallel loop
on loc do
writeln("this locale is named ", here.name);
This starts four tasks in parallel, and the order in which the print statement is executed depends on the runtime conditions and can change from run to run:
this locale is named cdr544
this locale is named cdr692
this locale is named cdr556
this locale is named cdr552
We can print few other attributes of each locale. Here it is actually useful to
revert to the serial loop for
so that the print statements appear in order:
use Memory.Diagnostics;
for loc in Locales do
on loc {
writeln("locale #", here.id, "...");
writeln(" ...is named: ", here.name);
writeln(" ...has ", here.numPUs(), " processor cores");
writeln(" ...has ", here.physicalMemory(unit=MemUnits.GB, retType=real), " GB of memory");
writeln(" ...has ", here.maxTaskPar, " maximum parallelism");
}
locale #0...
...is named: cdr544
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
locale #1...
...is named: cdr552
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
locale #2...
...is named: cdr556
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
locale #3...
...is named: cdr692
...has 3 processor cores
...has 125.804 GB of memory
...has 3 maximum parallelism
Note that while Chapel correctly determines the number of cores available inside our job on each node, and the maximum parallelism (which is the same as the number of cores available!), it lists the total physical memory on each node available to all running jobs which is not the same as the total memory per node allocated to our job.
Key Points
Locale in Chapel is a shared-memory node on a cluster.
We can cycle in serial or parallel through all locales.