StarPU Docker Tutorial
(last updated on 2024-08-20)
The tutorial slides are available as PDF.
Setup
Using the StarPU docker image
docker run --rm -it registry.gitlab.inria.fr/starpu/starpu-docker/starpu:1.4.7
If you have GPU nodes on your machine, use instead the following command
docker run --rm -it --gpus all registry.gitlab.inria.fr/starpu/starpu-docker/starpu:1.4.7
StarPU is installed in /usr/local
and StarPU-simgrid is installed in /usr/local/starpu-simgrid
Testing the installation
Make sure that StarPU finds your hardware with:
starpu_machine_display
Note that the first time starpu_machine_display
is executed,
it calibrates the performance model of the bus, the results are then
stored in different files in the
directory $HOME/.starpu/sampling/bus
.
Using the simulation version of StarPU
STARPU_PATH=/usr/local/starpu-simgrid
export PATH=/usr/local/starpu-simgrid/bin:$PATH
export PKG_CONFIG_PATH=/usr/local/starpu-simgrid/lib/pkgconfig:$PKG_CONFIG_PATH
You can for instance run some examples provided by StarPU:
export STARPU_HOSTNAME=attila
STARPU_WORKER_STATS=1 $STARPU_PATH/lib/starpu/examples/cholesky_implicit -size $((960*20)) -nblocks 20
and notice that a lot of the tasks went to the virtual GPU.
Example
To get the simulation version working, you need to set the path to the performance models:
cd $HOME/tutorial/C
export STARPU_PERF_MODEL_DIR=$PWD/../perfmodels
export STARPU_HOSTNAME=conan
starpu_machine_display
make
./vector_scal_task_insert
Session Part 1: Task-based Programming Model
Application Example: Vector Scaling
A vector scaling example is available both in the $HOME/tutorial/C
folder and in the $HOME/tutorial/fortran
folder of the docker image.
Base version
The original non-StarPU versions in C vector_scal0.c
, also available
remotely,
and in Fortran vector_scal0.f90
, also available
remotely,
show the basic example that we will be using to illustrate how to use
StarPU. It simply allocates a vector, and calls a scaling function
over it.
C code | Fortran code |
|
|
StarPU version
The StarPU version of the scaling example is available in the directory $HOME/tutorial/C/
:
vector_scal_task_insert.c
: the main application (remote link)vector_scal_cpu.c
: the CPU implementation of the codelet (remote link)vector_scal_cuda.cu
: the CUDA implementation of the codelet (remote link)vector_scal_opencl.c
: the OpenCL host implementation of the codelet (remote link)vector_scal_opencl_kernel.cl
: the OpenCL device implementation of the codelet (remote link)
The StarPU Fortran version of the scaling example is available in the
directory $HOME/tutorial/fortran/
:
tarball:
vector_scal.f90
: the main application (remote link)vector_scal_cl.f90
: the CPU implementation of the codelet (remote link)vector_scal_cuda.cu
: the CUDA implementation of the codelet (remote link)
Computation Kernels
Examine the source code, starting from
vector_scal_cpu.c
and
vector_scal_cl.f90 :
this is the same vector_scal_cpu
computation code …
C code | Fortran code |
|
|
The GPU implementation, in
vector_scal_cuda.cu
(and the same in
Fortran,
are basically the same, with the host part (vector_scal_cuda
) which
extracts the actual CUDA pointer from the DSM interface, and passes it
to the device part (vector_mult_cuda
) which performs the actual
computation.
The OpenCL implementation in vector_scal_opencl.c and vector_scal_opencl_kernel.cl is more hairy due to the low-level aspect of the OpenCL standard, but the principle remains the same. (Please note it is not available in the Fortran code).
Main Code
Now let’s examine the main code in C and in Fortran
C code | Fortran code |
|
|
Compilation and execution
Building
Let us look at how this should be built. Typical makefiles for applications using StarPU is the following are available in C and in Fortran
C code | Fortran code |
Additionally, to avoid having to set
|
The Fortran module |
The provided Makefiles additionally detects whether CUDA or OpenCL are available in StarPU, and add the corresponding files and link flags.
Simulation
If your system does not have a CUDA or OpenCL GPU, you can use the simulation version of StarPU by setting some environment variables by running in your shell:
. ./../simu.sh
If you ever want to get back to the non-simulated version of StarPU, you can run in your shell:
. ./../native.sh
Note that after switching between the simulated and the non-simulated versions of StarPU, you have to rebuild completely:
make clean
make
Running
make clean vector_scal_task_insert
./vector_scal_task_insert
Note that if you are using the simulation version of StarPU, the computation
will not be performed, and thus the final value will be equal to the initial
value, but the timing provided by starpu_timing_now()
or
fstarpu_timing_now()
will correspond to the correct execution time.
You can set the environment variable STARPU_WORKER_STATS
to 1 when
running your application to see the number of tasks executed by each
device. You can see the whole list of environment
variables here.
STARPU_WORKER_STATS=1 ./vector_scal_task_insert
To make the program rather use OpenCL on CUDA devices, one can set STARPU_NCUDA
to 0:
# to force the implementation on a OpenCL device
STARPU_WORKER_STATS=1 STARPU_NCPU=0 STARPU_NCUDA=0 ./vector_scal_task_insert
Data Partitioning
In the previous section, we submitted only one task. We here discuss how to partition data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.
Let’s examine mult.c or mult.f90
- The computation kernel,
cpu_mult
is a trivial matrix multiplication kernel, which operates on 3 given DSM interfaces. These will actually not be whole matrices, but only small parts of matrices. init_problem_data
initializes the whole A, B and C matrices.partition_mult_data
does the actual registration and partitioning. Matrices are first registered completely, then two partitioning filters are declared. The first one,vert
, is used to split B and C vertically. The second one,horiz
, is used to split A and C horizontally. We thus end up with a grid of pieces of C to be computed from stripes of A and B.launch_tasks
submits the actual tasks: for each piece of C, take the appropriate piece of A and B to produce the piece of C.- The access mode is interesting: A and B just need to be read from, and C will only be written to. This means that StarPU will make copies of the pieces of A and B along the machines, where they are needed for tasks, and will give to the tasks some uninitialized buffers for the pieces of C, since they will not be read from.
- The
main
code initializes StarPU and data, launches tasks, unpartitions data, and unregisters it. Unpartitioning is an interesting step: until then the pieces of C are residing on the various GPUs where they have been computed. Unpartitioning will collect all the pieces of C into the main memory to form the whole C result matrix.
Run the application, enabling some statistics:
make -C partition clean all
STARPU_WORKER_STATS=1 ./partition/mult
Figures show how the computation were distributed on the various processing units.
Other example
gemm/xgemm.c
and
gemm/xgemm.f90
are very similar matrix-matrix product examples, which make use of
BLAS kernels for much better performance. The mult_kernel_common
functions show how we call DGEMM
(CPUs) or cublasDgemm
(GPUs) on
the DSM interface. Also note the presence of the
starpu_cublas_init()
call in the main function so as to more
efficiently connect cublas with StarPU.
Let’s execute it.
make -C gemm clean all
STARPU_WORKER_STATS=1 ./gemm/sgemm
STARPU_WORKER_STATS=1 ./gemm/dgemm
Exercise
Take the vector example again, and add partitioning support to it,
using the matrix-matrix multiplication as an example. Here we will use
the
starpu_vector_filter_block()
filter function or its equivalent in Fortran
fstarpu_df_alloc_vector_filter_block()
.
You can see the list of predefined filters provided by StarPU here.
We provide a solution for the exercice in C or in Fortran.
Session Part 2: Optimizations
This is based on StarPU’s documentation optimization chapter.
Data Management
We have explained how StarPU can overlap computation and data transfers
thanks to DMAs. This is however only possible when CUDA has control over the
application buffers. The application should thus use
starpu_malloc()
when allocating its buffer, to permit asynchronous DMAs from and to
it.
In Fortran, the application should thus use fstarpu_memory_pin()
after allocating its buffer for the same effect.
Take the vector example again, and fix the allocation, to make it use
starpu_malloc()
.
In Fortran, after calling the allocation, you need to use fstarpu_memory_pin()
.
Task Submission
To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.
The CUDA and OpenCL kernel execution themselves should be submitted asynchronously, so as to let kernel computation and data transfer proceed independently:
- In
vector_scal_cuda.cu,
one should actually remove the
cudaStreamSynchronize(starpu_cuda_get_local_stream());
call, and add this flag to the codelet structure, so as to just submit the CUDA kernel, and let StarPU test for its termination:
.cuda_flags = {STARPU_CUDA_ASYNC},
call fstarpu_codelet_add_cuda_func(scal_cl, C_FUNLOC(cl_cuda_func_vector_scal))
- Similarly, in
vector_scal_opencl.c,
one should actually remove the
clFinish(queue);
call, and add this flag to the codelet structure:
.opencl_flags = {STARPU_OPENCL_ASYNC},
Performance Model Calibration
Inspection
Performance prediction is essential for proper scheduling decisions, the
performance models thus have to be calibrated. This is done automatically by
StarPU when a codelet is executed for the first time. Once this is done, the
result is saved to a file in $STARPU_PERF_MODEL_DIR
for later re-use. The
starpu_perfmodel_display
tool can be used to check the resulting
performance model.
STARPU_PERF_MODEL_DIR
specifies the main directory in which
StarPU stores its performance model files. The default is
$STARPU_HOME/.starpu/sampling
.
STARPU_HOME
specifies the main directory in which StarPU
stores its configuration files. The default is $HOME
on Unix
environments, and $USERPROFILE
on Windows environments.
In this tutorial we provide some pre-calibrated performance models with the Simgrid version of StarPU. You can run
Note that the following curves are based on the C version of the code, but the experiments can be made on the Fortran execution. The numbers will just be different.
source $HOME/tutorial/perfmodels.sh
to enable using them, (it sets
STARPU_PERF_MODEL_DIR
to a specific
directory perfmodels
available in
the docker image). Then you can use starpu_perfmodel_display
to get the performance
model details:
$ starpu_perfmodel_display -l # Show the list of codelets that have a performance model
file: <mult_energy_model.conan>
file: <mult_perf_model.conan>
file: <starpu_dgemm_gemm.conan>
file: <starpu_sgemm_gemm.conan>
file: <vector_scal.conan>
$ starpu_perfmodel_display -s vector_scal.conan # Show the details for one codelet
# performance model for cuda0_impl0 (Comb0)
# performance model for cuda0_impl0 (Comb0)
Regression : #sample = 132
Linear: y = alpha size ^ beta
alpha = 7.040874e-01
beta = 3.326125e-01
Non-Linear: y = a size ^b + c
a = 6.207150e-05
b = 9.503886e-01
c = 1.887639e+01
# hash size flops mean (us) stddev (us) n
a3d3725e 4096 0.000000e+00 1.902150e+01 1.639952e+00 10
870a30aa 8192 0.000000e+00 1.971540e+01 1.115123e+00 10
48e988e9 16384 0.000000e+00 1.934910e+01 8.406537e-01 10
...
09be3ca9 1048576 0.000000e+00 5.483990e+01 7.629412e-01 10
...
# performance model for cuda1_impl0 (Comb1)
...
09be3ca9 1048576 0.000000e+00 5.389290e+01 8.083156e-01 10
...
# performance model for cuda2_impl0 (Comb2)
...
09be3ca9 1048576 0.000000e+00 5.431150e+01 4.599005e-01 10
...
# performance model for cpu0_impl0 (Comb3)
...
a3d3725e 4096 0.000000e+00 5.149621e+00 7.096558e-02 66
...
09be3ca9 1048576 0.000000e+00 1.218595e+03 4.823102e+00 66
...
This shows that for the vector_scal
kernel with a 4KB size, the average
execution time on CPUs was about 5.1µs, with a 0.07µs standard deviation, over
66 samples, while it took about 19µs on GPU CUDA0, with a 1.6µs standard
deviation. With a 1MB size, execution time on CPUs is 1.2ms, while it is only
54µs on GPU CUDA0.
The performance model can also be drawn by using starpu_perfmodel_plot
,
which will emit a gnuplot file in the current directory:
$ starpu_perfmodel_plot -s vector_scal.conan
...
[starpu][main] Gnuplot file <starpu_vector_scal.conan.gp> generated
$ gnuplot starpu_vector_scal.conan.gp
The measurements were made on CPUs, but also GPUs that support both OpenCL and CUDA. The graph shows that GPUs become more efficient for vector size beyond 20000 bytes.
We have also measured the performance of the mult
kernel example, which can be
drawn with
starpu_perfmodel_plot -s mult_perf_model.conan
gnuplot starpu_mult_perf_model.conan.gp
We can see a slight bump after 2MB.
The task submission included the number of flops per task, this allows to draw GFlop/s instead of just time:
starpu_perfmodel_plot -f -s mult_perf_model.conan
gnuplot starpu_gflops_mult_perf_model.conan.gp
We indeed notice a performance drop after 2MB, which corresponds to the cache size.
Here is an example on how to set the number of flops per task in a Fortran code.
real(KIND=C_DOUBLE), target :: flops
flops = 2 * (X / X_parts) * (Y / Y_parts) * Z
call fstarpu_task_insert((/ cl_mult, &
FSTARPU_R, sub_handleA, &
FSTARPU_R, sub_handleB, &
FSTARPU_W, sub_handleC, &
FSTARPU_FLOPS, c_loc(flops), &
C_NULL_PTR /))
We can also draw the energy used by tasks:
starpu_perfmodel_plot -e -s mult_energy_model.conan
gnuplot starpu_mult_energy_model.conan.gp
We can again notice the bump after 2MB.
Again, instead of the energy, one can observe the computation efficiency thanks to the flops information:
starpu_perfmodel_plot -f -e -s mult_energy_model.conan
gnuplot starpu_gflops_mult_energy_model.conan.gp
This is much more interesting! We do indeed notice the efficiency drop after 2MB, but we can also notice an efficiency maximum around 100KB.
Measurement
To calibrate the performance models, one needs to re-compile StarPU without simgrid support and re-compile the application against it, to enable the actual execution of the kernels.
It is a good idea to check the variation before doing actual performance
measurements. If the kernel has varying performance, it may be a good idea to
force StarPU to continue calibrating the performance model, by using export
STARPU_CALIBRATE=1
If the code of a computation kernel is modified, the performance changes, the
performance model thus has to be recalibrated from start. To do so, use
export STARPU_CALIBRATE=2
Energy measurement
CPUs can report their energy usage through performance counters, and NVIDIA devices can report it through the CUDA interface. StarPU provides an interface to abstract the measurement for the application. The available measurement precision is however quite coarse. The principle is thus that the application should submit a series of tasks of the same kind, and put measurement calls before and after the series, so StarPU can compute an average over the whole set.
mult_bench.c achieves this: it prepares matrices so as to generate a fair number of tasks according to the number of cpus so the measurement is long enough.
Unfortunately, with docker the performance counters cannot be read due to administrative permissions. Running the benchmark on the raw system (possibly requering root access) would allow to perform the measurement.
Task Scheduling Policy
By default, StarPU uses the lws
simple greedy scheduler. This is
because it provides correct load balance even if the application codelets do not
have performance models: it uses a single central queue, from which workers draw
tasks to work on. This however does not permit to prefetch data, since the
scheduling decision is taken late.
If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.
For instance, compare the lws
(default) and dmdar
scheduling
policies:
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=lws gemm/sgemm -xy $((256*4)) -nblocks 4
with:
STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmdar gemm/sgemm -xy $((256*4)) -nblocks 4
You can see most (all?) the computation have been done on GPUs, leading to better performances.
Try other schedulers, use STARPU_SCHED=help
to get the
list.
Also try with various sizes (keeping a 256 tile size, i.e. increase both occurrences of 4 above) and draw curves.
You can also try the double version, dgemm
, and notice that GPUs get
less great performance.
Sessions Part 3: MPI Support
StarPU provides support for MPI communications. It does so in two ways. Either the application specifies MPI transfers by hand, or it lets StarPU infer them from data dependencies.
The simulation way is a bit tricky (to set up a virtual cluster etc), so it is prefererable to re-compile StarPU without simgrid support.
Manual MPI transfers
Basically, StarPU provides
equivalents of MPI_*
functions, but which operate on DSM handles
instead of void*
buffers. The difference is that the source data may be
residing on a GPU where it just got computed. StarPU will automatically handle
copying it back to main memory before submitting it to MPI.
Both C example and Fortran example show how to mix MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.
This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.
cd $HOME/tutorial/C/mpi
make clean ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit
cd $HOME/tutorial/fortran/mpi
make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit
starpu_mpi_task_insert
A stencil application (C
version
or fortran
version)
shows a basic MPI task model application. The data distribution over
MPI nodes is decided by the my_distrib
function, and can thus be
changed trivially. It also shows how data can be migrated to a new
distribution.
cd $HOME/tutorial/C/mpi
make clean all
mpirun -np 2 $PWD/stencil5 -display
cd $HOME/tutorial/fortran/mpi
make clean all
mpirun -np 2 $PWD/stencil5 -display
More Performance Optimizations
The StarPU performance feedback chapter provides more optimization tips for further reading after this tutorial.
FxT Tracing Support
In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.
The StarPU docker image has been compiled with the FxT support. Trace
generation is by default disabled, enable it by setting the variable
STARPU_FXT_TRACE
.
export STARPU_FXT_TRACE=1
The trace file is stored in /tmp
by default. To tell StarPU to store
output traces in the home directory, one can set:
export STARPU_FXT_PREFIX=$HOME/
The application should be run again, for instance:
./mult
This time a prof_file_XX_YY
trace file will be generated in your home directory. This can be converted to
several formats by using:
starpu_fxt_tool -i ~/prof_file_*
This will create
- a
paje.trace
file, which can be opened by using the ViTE tool. This shows a Gant diagram of the tasks which executed, and thus the activity and idleness of tasks, as well as dependencies, data transfers, etc. You may have to zoom in to actually focus on the computation part, and not the lengthy CUDA initialization. - a
dag.dot
file, which contains the graph of all the tasks submitted by the application. It can be opened by using Graphviz. - an
activity.data
file, which records the activity of all processing units over time.
One can also automatically generate all the different formats by
setting the variable STARPU_GENERATE_TRACE
before running the
application.
export STARPU_GENERATE_TRACE=1
./mult
Contact
For any questions regarding StarPU, please contact the StarPU developers mailing list starpu-devel@inria.fr