Inria automn school "High Performance Numerical Simulation": StarPU Tutorial - Bordeaux, November, 7th 2019
Table of Contents
This tutorial is part of the Inria automn school "High Performance Numerical Simulation" taking place in Bordeaux on November, 4th-8th 2019 The slides are available as PDF.
Connection to the platform
The work will be done
on PlaFRIM.
We'll rely on ssh to connect to the platform. You have received on
Thursday October 31 an email from PlaFRIM support with instructions
to connect to the machine, together with a password. They must look
like the following lines where <myname>
is to be
changed according to your login instructions.
$ ssh hpcs-<myname>@formation.plafrim.fr $ ssh plafrim
These two steps can be gathered into a single step. To do so, you can
add the following lines in your .ssh/config
file,
where <myname>
is to be changed according to your login
instructions.
Host plafrim-hpcs ForwardAgent yes ForwardX11 yes User hpcs-<myname> ProxyCommand ssh -T -q -o "ForwardAgent yes" -l hpcs-<myname> formation.plafrim.fr 'ssh-add -t 1 && nc plafrim 22'
Then to log on to the platform, you can use ssh -X plafrim-hpcs
or ssh -Y
plafrim-hpcs
. In some cases, ssh -Y
may create problems
asking for a key. Then change to ssh -X
.
$ ssh -X plafrim-hpcs
$ ssh -Y plafrim-hpcs % can create problems with some environments
All the files needed in the following sections are available in this archive. You can get the file directly from the platform using the following command.
$ wget https://starpu.gitlabpages.inria.fr/tutorials/2019-11-HPNS-Inria/material.tgz
Setup of the environment with Guix
One of the objective of this school is to allow you mastering your environment. We will use Guix to handle that. Guix is an advanced distribution of the GNU operating system developed by the GNU Project, which respects the freedom of computer users. It is a transactional package manager, with support for per-user package installations. Users can install their own packages without interfering with each other, yet without unnecessarily increasing disk usage or rebuilding every package.
Thanks to joint effort of the Guix development and PlaFRIM team, Guix is readily available on PlaFRIM as detailed here.
The guix-hpc initiative is a solid foundation for hpc reproducible science. The software environments created with Guix are fully reproducible: a package built from a specific Guix commit on your laptop will be exactly the same as the one built on the HPC cluster -- PlaFRIM in our case -- you deploy it to, usually bit-for-bit.
Guix and its package collection are updated by running guix
pull
(see Invoking guix pull). By default, guix pull
downloads
and deploys Guix itself from the official GNU Guix repository.
This can be customized by
defining channels
in the ~/.config/guix/channels.scm
file. A channel specifies
a URL and branch of a Git repository to be deployed, and guix pull
can be instructed to pull from one or more channels. In other words,
channels can be used to customize and to extend Guix. We propose to
set up your channels as follows.
Once connected to the platform, create a
file $HOME/.config/guix/channels.scm
with the following contents
(list (channel (name 'guix-hpc) (url "https://gitlab.inria.fr/guix-hpc/guix-hpc.git") (commit "446507e4ee8ec9ca6335679c8bb96bfb7d929538")) (channel (name 'guix-hpc-non-free) (url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git") (commit "e058192f39e427c9fac8c31f9fcb27b0f671e43f")) (channel (name 'guix) (url "https://git.savannah.gnu.org/git/guix.git") (commit "bbad38f4d8e6b6ecc15c476b973094cdf96cdeae")))
You then need to call
$ guix build hello
to make sure to initialize your Guix environment, then
$ guix pull
to get the proper package definitions.
You can then to go a compute node
$ srun -p hpc -N 1 --pty bash -i
srun
will not allow you to lauch X applications from the
compute node. To do so, you should use salloc
$ salloc -p hpc -N 1 $ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 427522 hpc bash hpcs-fur R INVALID 1 miriel002 $ ssh -X miriel002
where miriel002
should be replaced by the name of the
machine which has been allocated to you.
Installing StarPU on your system
To be able to exercise accelerator support without having real GPUs cards, we will use a simulation version of StarPU, based on top of SimGrid. The following Guix command will put you in a dedicated StarPU SimGrid environment
$ guix environment --pure starpu-simgrid --ad-hoc starpu-simgrid grep coreutils emacs vim less openssh inetutils gv -- /bin/bash --norc
You also need to load the environment from the init.sh
shell script
from the archive file
.
$ . ./init.sh
Then you can see that StarPU detects the simulated platform with
$ starpu_machine_display
Session Part 1: Task-based Programming Model
Application Example: Vector Scaling
This example is at the root of
the archive file
Making it and Running it
A typical Makefile
for
applications using StarPU is the following:
CFLAGS += $(shell pkg-config --cflags starpu-1.3) LDLIBS += $(shell pkg-config --libs starpu-1.3) %.o: %.cu nvcc $(CFLAGS) $< -c -o $@ vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o # vector_scal_cuda.o vector_scal_opencl.o
If you have CUDA or OpenCL available on your system, you can uncomment adding the corresponding files on the last line, and uncomment the corresponding link flags. Here are the source files for the application, available in the material tarball:
- The main application
- The CPU implementation of the codelet
- The CUDA implementation of the codelet
- The OpenCL host implementation of the codelet
- The OpenCL device implementation of the codelet
Run make vector_scal_task_insert
, and run the
resulting vector_scal_task_insert
executable
using the given script vector_scal.sh. It should be working: it simply scales a given
vector by a given factor.
$ make vector_scal_task_insert $ ./vector_scal_task_insert
Note that if you are using the simulation version of StarPU, the computation
will not be performed, and thus the final value will be equal to the initial
value, but the timing provided by starpu_timing_now()
will correspond
to the correct execution time.
Computation Kernels
Examine the source code, starting from vector_scal_cpu.c
: this is
the actual computation code, which is wrapped into a vector_scal_cpu
function which takes a series of DSM interfaces and a non-DSM parameter. The
code simply gets the factor value from the non-DSM parameter,
an actual pointer from the first DSM interface,
and performs the vector scaling.
The GPU implementation, in vector_scal_cuda.cu
, is basically
the same, with the host part (vector_scal_cuda
) which extracts the
actual CUDA pointer from the DSM interface, and passes it to the device part
(vector_mult_cuda
) which performs the actual computation.
The OpenCL implementation in vector_scal_opencl.c
and
vector_scal_opencl_kernel.cl
is more hairy due to the low-level aspect
of the OpenCL standard, but the principle remains the same.
You can set the environment variable STARPU_WORKER_STATS to 1 when running your application to see the number of tasks executed by each device. You can see the whole list of environment variables here.
$ STARPU_WORKER_STATS=1 ./vector_scal_task_insert # to force the implementation on a GPU device, by default, it will enable CUDA $ STARPU_WORKER_STATS=1 STARPU_NCPU=0 ./vector_scal_task_insert # to force the implementation on a OpenCL device $ STARPU_WORKER_STATS=1 STARPU_NCPU=0 STARPU_NCUDA=0 ./vector_scal_task_insert
Main Code
Now examine vector_scal_task_insert.c
: the cl
(codelet) structure simply gathers pointers on the functions
mentioned above.
The main
function
- Allocates an
vector
application buffer and fills it. - Registers it to StarPU, and gets back a DSM handle. From now on, the application is not supposed to access
vector
directly, since its content may be copied and modified by a task on a GPU, the main-memory copy then being outdated. - Submits a (asynchronous) task to StarPU.
- Waits for task completion.
- Unregisters the vector from StarPU, which brings back the modified version to main memory.
Data Partitioning
In the previous section, we submitted only one task. We here discuss how to partition data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.
Let's examine mult.c.
- The computation kernel,
cpu_mult
is a trivial matrix multiplication kernel, which operates on 3 given DSM interfaces. These will actually not be whole matrices, but only small parts of matrices. init_problem_data
initializes the whole A, B and C matrices.partition_mult_data
does the actual registration and partitioning. Matrices are first registered completely, then two partitioning filters are declared. The first one,vert
, is used to split B and C vertically. The second one,horiz
, is used to split A and C horizontally. We thus end up with a grid of pieces of C to be computed from stripes of A and B.launch_tasks
submits the actual tasks: for each piece of C, take the appropriate piece of A and B to produce the piece of C.- The access mode is interesting: A and B just need to be read from, and C will only be written to. This means that StarPU will make copies of the pieces of A and B along the machines, where they are needed for tasks, and will give to the tasks some uninitialized buffers for the pieces of C, since they will not be read from.
- The
main
code initializes StarPU and data, launches tasks, unpartitions data, and unregisters it. Unpartitioning is an interesting step: until then the pieces of C are residing on the various GPUs where they have been computed. Unpartitioning will collect all the pieces of C into the main memory to form the whole C result matrix.
Run the application with the script mult.sh, enabling some statistics:
#!/bin/bash make mult STARPU_WORKER_STATS=1 ./mult
Figures show how the computation were distributed on the various processing units.
Other example
gemm/xgemm.c
is a very similar
matrix-matrix product example, but which makes use of BLAS kernels for
much better performance. The mult_kernel_common
functions
shows how we call DGEMM
(CPUs) or cublasDgemm
(GPUs)
on the DSM interface.
Let's execute it.
#!/bin/bash make gemm/sgemm STARPU_WORKER_STATS=1 ./gemm/sgemm
Exercise
Take the vector example again, and add partitioning support to it, using the
matrix-matrix multiplication as an example. Here we will use the
starpu_vector_filter_block()
filter function. You can see the list of
predefined filters provided by
StarPU here.
By using the SimGrid version of StarPU, you may when running a
partitioned version of vector_scal_task_insert
get the following error
[starpu][_starpu_simgrid_submit_job][assert failure] Codelet vector_scal does not have a perfmodel, \ or is not calibrated enough, please re-run in non-simgrid mode until it is calibrated
this is because the performance model we are providing for this
tutorial is only calibrated for vectors with 2048 elements, to avoid
the issue, you can just multiply the number of elements (NX
)
by the number of sub-data you defined in struct starpu_data_filter
, and so each sub-data will be a vector of
2048 elements.
We provide a solution for the
exercice here.
Session Part 2: Optimizations
This is based on StarPU's documentation optimization chapter.
Data Management
Task Submission
To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.
Performance Model Calibration
Performance prediction is essential for proper scheduling decisions, the
performance models thus have to be calibrated. This is done automatically by
StarPU when a codelet is executed for the first time. Once this is done, the
result is saved to a file in $STARPU_PERF_MODEL_DIR
for later re-use. The
starpu_perfmodel_display
tool can be used to check the resulting
performance model.
STARPU_PERF_MODEL_DIR
specifies the main directory in which
StarPU stores its performance model files. The default is
$STARPU_HOME/.starpu/sampling
.
STARPU_HOME
specifies the main directory in which StarPU
stores its configuration files. The default is $HOME
on Unix
environments, and $USERPROFILE
on Windows environments.
In this tutorial which is using the Simgrid version of StarPU, we are
setting STARPU_PERF_MODEL_DIR
to a specific
directory perfmodels
available in the archive.
$ starpu_perfmodel_display -l file: <vector_scal.conan> file: <mult_perf_model.conan> file: <starpu_dgemm_gemm.conan> file: <starpu_sgemm_gemm.conan> $ starpu_perfmodel_display -s vector_scal # performance model for cuda0_impl0 (Comb0) # performance model for cuda0_impl0 (Comb0) Regression : #sample = 132 Linear: y = alpha size ^ beta alpha = 7.040874e-01 beta = 3.326125e-01 Non-Linear: y = a size ^b + c a = 6.207150e-05 b = 9.503886e-01 c = 1.887639e+01 # hash size flops mean (us) stddev (us) n a3d3725e 4096 0.000000e+00 1.902150e+01 1.639952e+00 10 870a30aa 8192 0.000000e+00 1.971540e+01 1.115123e+00 10 48e988e9 16384 0.000000e+00 1.934910e+01 8.406537e-01 10 ... 09be3ca9 1048576 0.000000e+00 5.483990e+01 7.629412e-01 10 ... # performance model for cuda1_impl0 (Comb1) ... 09be3ca9 1048576 0.000000e+00 5.389290e+01 8.083156e-01 10 ... # performance model for cuda2_impl0 (Comb2) ... 09be3ca9 1048576 0.000000e+00 5.431150e+01 4.599005e-01 10 ... # performance model for cpu0_impl0 (Comb3) ... a3d3725e 4096 0.000000e+00 5.149621e+00 7.096558e-02 66 ... 09be3ca9 1048576 0.000000e+00 1.218595e+03 4.823102e+00 66 ...
This shows that for the vector_scal kernel with a 4KB size, the average
execution time on CPUs was about 5.1µs, with a 0.07µs standard deviation, over
66 samples, while it took about 19µs on CUDA GPUs, with a 1.6µs standard
deviation. With a 1MB size, execution time on CPUs is 1.2ms, while it is only
54µs on the CUDA GPU.
It is a good idea to check the variation before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using export
STARPU_CALIBRATE=1
The performance model can also be drawn by using starpu_perfmodel_plot
,
which will emit a gnuplot file in the current directory:
$ starpu_perfmodel_plot -s vector_scal ... [starpu][main] Gnuplot file <starpu_vector_scal.gp> generated $ gnuplot starpu_vector_scal.gp $ gv starpu_vector_scal.eps
The gv
command will not work if you have not
specified -X
when you ran ssh
.
If the code of a computation kernel is modified, the performance changes, the
performance model thus has to be recalibrated from start. To do so, use
export STARPU_CALIBRATE=2
Task Scheduling Policy
By default, StarPU uses the lws
simple greedy scheduler. This is
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
scheduling decision is taken late.
If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.
For instance, compare the lws
(default) and dmda
scheduling
policies:
$ STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 gemm/sgemm -xy $((256*4)) -nblocks 4
with:
$ STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda gemm/sgemm -xy $((256*4)) -nblocks 4
You can see most (all?) the computation have been done on GPUs, leading to better performances.
Try other schedulers, use STARPU_SCHED=help
to get the
list.
Also try with various sizes (keeping a 256 tile size, i.e. increase both occurrences of 4 above) and draw curves.
You can also try the double version, dgemm
, and notice that GPUs get
less great performance.
Sessions Part 3: MPI Support
StarPU provides support for MPI communications. It does so in two ways. Either the application specifies MPI transfers by hand, or it lets StarPU infer them from data dependencies.
We will use here the non-simulated version of StarPU, by calling the following Guix command
$ guix environment --pure starpu --ad-hoc starpu grep coreutils emacs vim less openssh inetutils -- /bin/bash --norc
Manual MPI transfers
Basically, StarPU provides
equivalents of MPI_*
functions, but which operate on DSM handles
instead of void*
buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
In the mpi/ subdirectory,
ring_async_implicit.c
shows an example of mixing MPI communications and task submission. It
is a classical ring MPI ping-pong, but the token which is being passed
on from neighbour to neighbour is incremented by a starpu task at each
step.
This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.
$ cd mpi $ make ring_async_implicit $ mpirun -np 2 $PWD/ring_async_implicit
starpu_mpi_insert_task
A stencil application shows a basic MPI
task model application. The data distribution over MPI
nodes is decided by the my_distrib
function, and can thus be changed
trivially.
It also shows how data can be migrated to a
new distribution.
$ make stencil5 $ mpirun -np 2 $PWD/stencil5 -display
Session Part 4: OpenMP Support
The Klang-Omp OpenMP Compiler
The Klang-Omp OpenMP compiler converts C/C++ source codes annotated with OpenMP 4 directives into StarPU enabled codes. Klang-Omp is source-to-source compiler based on the LLVM/CLang compiler framework.
The following shell sequence shows an example of an OpenMP version of the Cholesky decomposition compiled into StarPU code.
cd source /gpfslocal/pub/training/runtime_june2016/openmp/environment cp -r /gpfslocal/pub/training/runtime_june2016/openmp/Cholesky . cd Cholesky make ./cholesky_omp4.starpu
Homepage of the Klang-Omp OpenMP compiler: Klang-Omp
More Performance Optimizations
The StarPU performance feedback chapter provides more optimization tips for further reading after this tutorial.
FxT Tracing Support
In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.
A Guix StarPU-FxT package is available. You need to call the following Guix command to use it.
$ guix environment --pure starpu-fxt --ad-hoc starpu-fxt grep coreutils emacs vim less gv \ --with-commit=starpu-fxt=3cc33c3cc85eae5dbc0f4b0ddc9291e3409287b2 -- /bin/bash --norc
The Guix package starpu-fxt
is currently based on the
version 1.3.3 of StarPU which has a bug which leads to the creation
of huge trace files. This has been fixed but not released yet, we
hence use the Guix parameter --with-commit
to indicate a
specific commit for StarPU. This will no longer be needed after the
next StarPU release and the upgrade of the Guix package.
The trace file is stored in /tmp
by default. To tell StarPU to store
output traces in the home directory, one can set:
$ export STARPU_FXT_PREFIX=$HOME/
The application should be run again, for instance:
$ make clean $ make mult $ ./mult
This time a prof_file_XX_YY
trace file will be generated in your home directory. This can be converted to
several formats by using:
$ starpu_fxt_tool -i ~/prof_file_*
This will create
- a
paje.trace
file, which can be opened by using the ViTE tool. This shows a Gant diagram of the tasks which executed, and thus the activity and idleness of tasks, as well as dependencies, data transfers, etc. You may have to zoom in to actually focus on the computation part, and not the lengthy CUDA initialization. - a
dag.dot
file, which contains the graph of all the tasks submitted by the application. It can be opened by using Graphviz. - an
activity.data
file, which records the activity of all processing units over time.
Contact
For any questions regarding StarPU, please contact the StarPU developers mailing list starpu-devel@inria.fr