StarPU: A Unified Runtime System for Heterogeneous Multicore Architectures
Table of Contents
- Overview
- Documentation
- Latest News
- News
- Discover StarPU
- Features
- Give it a try!
- Get involved
- Contact
- Contributors
- Software using StarPU
- Help
- Publications
- Internships
- Download
- Task Graph Market
- Tutorials
- Benchmarks
- Intranet
- Storm Team
Overview
StarPU is a task programming library for hybrid architectures
- The application provides algorithms and constraints
- CPU/GPU implementations of tasks
- A graph of tasks, using either StarPU's rich C/C++/Fortran/Python API, or OpenMP pragmas.
- StarPU internally deals with the following aspects:
- Task dependencies
- Optimized heterogeneous scheduling
- Optimized data transfers and replication between main memory and discrete memories
- Optimized cluster communications
Rather than handling low-level issues, programmers can concentrate on algorithmic aspects!
Documentation
Latest release
The StarPU documentation is available in PDF and in HTML. Several parts are available, depending on the level of details that is desired:
- An introduction (HTML PDF) is available to get a grasp on the principles in a couple of pages.
- Installation (HTML PDF) should be read if you need to install StarPU yourself.
- Basics (HTML PDF) is the must-read document that covers in a few dozen pages the basics of StarPU.
- Applications (HTML PDF) cover an application example.
- Performance (HTML PDF) covers performance optimization.
- The FAQ (HTML PDF) covers the frequently-asked questions and is also a must-check.
- Languages (HTML PDF) cover the bindings in Fortran, Java, Python, OpenMP.
- Extensions (HTML PDF) cover all the various non-basic features of StarPU.
To take with you on the beach:
And for StarPU hackers:
Please note that these documents are up-to-date with the latest official release of StarPU.
Master
The documentation for the master version, is available here.
Latest News
Get the latest StarPU news by subscribing to the starpu-announce mailing list. See also the full news.
November 2023
The release 1.4.2 is now available! The 1.4 release series brings among other functionalities a Python interface for StarPU, a driver for HIP-based GPUs, the possibility to store performance model files in multiple directories, an OpenMP LLVM support, a new sched_tasks.rec trace file which monitors task scheduling push/pop actions.
June 2023
The release 1.3.11 of StarPU is now available! The 1.3 release serie brings among other functionalities a MPI master-slave support, a tool to replay execution through SimGrid, a HDF5 implementation of the Out-of-core, a new implementation of StarPU-MPI on top of NewMadeleine, implicit support for asynchronous partition planning, a resource management module to share processor cores and accelerator devices with other parallel runtime systems, ...
June 2022
StarPU is now available on https://github.com/starpu-runtime/starpu with its own CI system.
May 2022
StarPU is now available in https://github.com/spack/spack/
Discover StarPU
- A video recording (26') of presentation at the XDC2014 conference gives an overview of StarPU (slides):
- The latest tutorial material for StarPU is composed of two parts:
- A set of slides is also available to get an overview of StarPU.
Features
Portability
Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named codelet. Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet can run on heterogeneous architectures, it is possible to specify one function for each architecture (e.g. one function for CUDA and one function for CPUs). StarPU takes care of scheduling and executing those codelets as efficiently as possible over the entire machine, include multiple GPUs. One can even specify several functions for each architecture (new in v1.0) as well as parallel implementations (e.g. in OpenMP), and StarPU will automatically determine which version is best for each input size (new in v0.9). StarPU can execute them concurrently, e.g. one per socket, provided that the task implementations support it (which is the case for MKL, but unfortunately most often not for OpenMP).
Genericity
The StarPU programming interface is very generic. For instance, various data structures are supported mainline (vectors, dense matrices, CSR/BCSR/COO sparse matrices, ...), but application-specific data structures can also be supported, provided that the application describes how data is to be transferred (e.g. a series of contiguous blocks). That was for instance used for hierarchically-compressed matrices (h-matrices).
Data Transfers
To relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are automatically made available on the compute resource. Data are also kept on e.g. GPUs as long as they are needed for further tasks. When a device runs out of memory, StarPU uses an LRU strategy to evict unused data. StarPU also takes care of automatically prefetching data, which thus permits to overlap data transfers with computations (including GPU-GPU direct transfers) to achieve the most of the architecture. Support is included for for matrices, vectors, n-dimension tensors, compressed matrices (CSR, BCSR, COO), and the application can define its own data interface for arbitrary data representation (that can even be e.g. a C++ or Python object).
Dependencies
Dependencies between tasks can be given either of several ways, to provide the programmer with the best flexibility:
- implicitly from RAW, WAW, and WAR data dependencies.
- explicitly through tags which act as rendez-vous points between tasks (thus including tasks which have not been created yet),
- explicitly between pairs of tasks,
These dependencies are computed in a completely decentralized way, and can be introduced completely dynamically as tasks get submitted by the application while tasks previously submitted are being executed.
StarPU also supports an OpenMP-like reduction access mode (new in v0.9).
It also supports a commute access mode to allow data access commutativity (new in v1.2).
It also supports transparent dependencies tracking between hierarchical sub-pieces of data through asynchronous partitioning, allowing seamless concurrent read access to different partitioning levels (new in v1.3).
Heterogeneous Scheduling
StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models. These determine the relative performance achieved by the different processing units for the various kinds of task, and thus permits to automatically let processing units execute the tasks they are the best for. Various strategies and variants are available. Some of them are centralized, but most of them are completely distributed to properly scale with large numbers of cores. dmdas (a data-locality-aware MCT strategy, thus similar to heft but starts executing tasks before the whole task graph is submitted, thus allowing dynamic task submission and a decentralized scheduler, as well as an energy optimizing extension), eager (dumb centralized queue), lws (distributed locality-aware work-stealing), ... The overhead per task is typically around the order of magnitude of a microsecond. Tasks should thus be a few orders of magnitude bigger, such as 100 microseconds or 1 millisecond, to make the overhead negligible. The application can also provide its own scheduler, to better fit its own needs.
Clusters
To deal with clusters, StarPU can nicely integrate with MPI, through explicit or implicit support, according to the application's preference.
- Explicit network communication requests can be emitted, which will then be automatically combined and overlapped with the intra-node data transfers and computation,
- The application can also just provide the whole task graph, a data distribution over MPI nodes, and StarPU will automatically determine which MPI node should execute which task, and automatically generate all required MPI communications accordingly (new in v0.9). We have gotten excellent scaling on a 256-node cluster with GPUs, we have not yet had the opportunity to test on a yet larger cluster. We have however measured that with naive task submission, it should scale to a thousand nodes, and with pruning-tuned task submission, it should scale to about a million nodes.
- Starting with v1.3, the application can also just provide the whole task graph, and let StarPU decide the data distribution and task distribution, thanks to a master-slave mechanism. This will however by nature have a more limited scalability than the fully distributed paradigm mentioned above.
- Starting with v1.4 and in the NewMadeleine network support case, StarPU automatically detects data broadcasts performed by the application, and automatically optimizes the broadcast.
Out of Core
When memory is not big enough for the working set, one may have to resort to using disks. StarPU makes this seamless thanks to its out of core support (new in v1.2). StarPU will automatically evict data from the main memory in advance, and prefetch back required data before it is needed for tasks.
Fortran Interface
StarPU comes with native Fortran bindings and examples.
OpenMP 4 -compatible Interface
K'Star provides an OpenMP 4 -compatible interface on top of StarPU. This allows to just rebuild OpenMP applications with the K'Star source-to-source compiler, then build it with the usual compiler, and the result will use the StarPU runtime.
K'Star also provides some extensions to the OpenMP 4 standard, to let the StarPU runtime perform online optimizations.
OpenCL-compatible Interface
StarPU provides an OpenCL-compatible interface, SOCL which allows to simply run OpenCL applications on top of StarPU (new in v1.0).
Simulation Support
StarPU can very accurately simulate an application execution and measure the resulting performance thanks to using the SimGrid simulator (new in v1.1). This allows to quickly experiment with various scheduling heuristics, various application algorithms, and even various platforms (available GPUs and CPUs, available bandwidth)!
All in All
All that means that the following sequential source code of a tiled version of the classical Cholesky factorization algorithm using BLAS is also (an almost) valid StarPU code, possibly running on all the CPUs and GPUs, and given a data distribution over MPI nodes, it is even a distributed version!
for (k = 0; k < tiles; k++) { potrf(A[k,k]) for (m = k+1; m < tiles; m++) trsm(A[k,k], A[m,k]) for (m = k+1; m < tiles; m++) syrk(A[m,k], A[m, m]) for (m = k+1, m < tiles; m++) for (n = k+1, n < m; n++) gemm(A[m,k], A[n,k], A[m,n]) }
The complete example can be found inside the StarPU source, examples/cholesky/cholesky_implicit.c for single-node, and mpi/examples/matrix_decomposition/mpi_cholesky_codelets.c for distributed execution.
And Even More
StarPU provides a python interface (new in to-be-released-soon 1.4), a java interface (new in to-be-released-soon 1.4), an Eclipse plugin (new in to-be-released-soon 1.4)
Technical Details
Supported Architectures
Arbitrary mixtures of the following architectures are supported:
- SMP/Multicore Processors (x86, PPC, ARM, ... all Debian architecture have been tested)
- CUDA NVIDIA GPUs (e.g. heterogeneous multi-GPU), with pipelined and concurrent kernel execution support (new in v1.2), GPU-GPU direct transfers (new in v1.1), and GPU-NIC direct transfers (new in v1.4)
- HIP AMD and NVIDIA GPUs (new in v1.4)
- OpenCL devices
- Maxeler FPGA (new in v1.4)
- Intel SCC (experimental, new in v1.2, removed in v1.3)
- Intel MIC / Xeon Phi (new in v1.2, removed in v1.4)
Supported Operating Systems
- GNU/Linux
- Mac OS X
- Windows
Stability
StarPU is checked every night with
- Valgrind / Helgrind
- gcc' Address/Leak/Thread/Undefined Sanitizers
- cppcheck
- Coverity
Performance Analysis Tools
ViTE
In order to understand the performance obtained by StarPU, it is helpful to visualize the actual behavior of the applications running on complex heterogeneous multicore architectures. StarPU therefore makes it possible to generate Pajé traces that can be visualized thanks to the ViTE (Visual Trace Explorer) open source tool.
Example: LU decomposition on 3 CPU cores and a GPU using a very simple greedy scheduling strategy. The green (resp. red) sections indicate when the corresponding processing unit is busy (resp. idle). The number of ready tasks is displayed in the curve on top: it appears that with this scheduling policy, the algorithm suffers a certain lack of parallelism. Measured speed: 175.32 GFlop/s
This second trace depicts the behavior of the same application using a scheduling strategy (dmda) trying to minimize load imbalance thanks to auto-tuned performance models and to keep data locality as high as possible. In this example, the Pajé trace clearly shows that this scheduling strategy outperforms the previous one in terms of processor usage. Measured speed: 239.60 GFlop/s
Temanejo
Temanejo can be used to debug the task graph, as shown below (new in v1.1).
Give it a try!
Using the StarPU sources
You can easily try StarPU on the Cholesky factorization, for
instance. Make sure to have pkg-config
, hwloc for proper CPU
control, as well as BLAS kernels installe and configured in your
environmen for your computation units (e.g. MKL for CPUs and CUBLAS
for GPUs).
$ someversion=1.3.9 # for example $ wget http://files.inria.fr/starpu/starpu-$someversion/starpu-${someversion}.tar.gz $ tar xf starpu-${someversion}.tar.gz $ cd starpu-${someversion} $ mkdir build && cd build $ ../configure $ make -j 12 $ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40 $ STARPU_SCHED=dmdas mpirun -np 4 -machinefile mymachines ./mpi/examples/matrix_decomposition/mpi_cholesky_distributed -size $((960*40*4)) -nblocks $((40*4))
Note that the dmdas scheduler uses performance models, and thus needs calibration execution before exhibiting optimized performance (until the "model something is not calibrated enough" messages go away).
To get a glimpse at what happened, you can get an execution trace by installing FxT and ViTE, and enabling traces:
$ ../configure --with-fxt $ make -j 12 $ STARPU_FXT_TRACE=1 STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40 $ ./tools/starpu_fxt_tool -i /tmp/prof_file_${USER}_0 $ vite paje.trace
Starting with StarPU 1.1, it is also possible to reproduce the performance that we show in our articles on our machines, by installing SimGrid, and then using the simulation mode of StarPU using the performance models of our machines:
$ ../configure --enable-simgrid $ make -j 12 $ STARPU_PERF_MODEL_DIR=$PWD/../tools/perfmodels/sampling STARPU_HOSTNAME=mirage STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40 # size ms GFlops 38400 9915 1903.7
(MPI simulation is not supported yet)
Using the StarPU docker image
Docker images for StarPU are available from
registry.gitlab.inria.fr/starpu/docker/starpu
.
Note these images are minimal, and do not provide GPU support.
Here is an example on how to use these images. By default, the latest master version of StarPU is built within the image. Other images are also available. The list of tags is available at https://gitlab.inria.fr/starpu/starpu-docker/container_registry/2057
docker run -it registry.gitlab.inria.fr/starpu/starpu-docker/starpu:latest ... gitlab@42c20306e808:~$ starpu_machine_display [starpu][42c20306e808][check_bus_config_file] No performance model for the bus, calibrating... [starpu][42c20306e808][check_bus_config_file] ... done Real hostname: 42c20306e808 (StarPU hostname: 42c20306e808) StarPU has found : ... gitlab@42c20306e808:~$ cd /usr/local/lib/starpu/ gitlab@42c20306e808:/usr/local/lib/starpu/examples$ ./vector_scal [BEFORE] 1-th element : 2.00 [BEFORE] (NX-1)th element: 204800.00 [AFTER] 1-th element : 6.28 (should be 6.28) [AFTER] (NX-1)-th element: 643072.00 (should be 643072.00) [AFTER] Computation is correct gitlab@42c20306e808:/usr/local/lib/starpu/examples$
Contact
- For any questions regarding StarPU, please contact the StarPU developers mailing list starpu-devel@inria.fr. You can also look over its archives.
- To submit an issue, if you have an account on gitlab inria, you can go to the following page. Without an account, send an email to starpu-devel@inria.fr. StarPU also has a repository on github, you can also submit issues on this page. More informations are available on this page.
- Details of the StarPU team people are also available.
Software using StarPU
Some software is known for being able to use StarPU to tackle heterogeneous architectures, here is a non-exhaustive list (feel free to ask to be added to the list!):
- AL4SAN, dense linear algebra library
- Chameleon, dense linear algebra library
- Exa2pro, Enhancing Programmability and boosting Performance Portability for Exascale Computing Systems
- ExaGeoStat, Machine learning framework for Climate/Weather prediction applications
- FLUSEPA, Navier-Stokes Solver for Unsteady Problems with Bodies in Relative Motion
- HiCMA, Low-rank general linear algebra library
- hmat, hierarchical matrix C/C++ library
- HPSM, a C++ API for parallel loops programs supporting muti-CPUs and multi-GPUs
- K'Star, OpenMP 4 - compatible interface on top of StarPU.
- KSVD, dense SVD on distributed-memory manycore systems
- MAGMA, dense linear algebra library, starting from version 1.1
- MaPHyS, Massively Parallel Hybrid Solver
- MASA-StarPU, Parallel Sequence Comparison
- MOAO, HPC framework for computational astronomy, servicing the European Extremely Large Telescope and the Japanese Subaru Telescope
- PaStiX, sparse linear algebra library, starting from version 5.2.1
- PEPPHER, Performance Portability and Programmability for Heterogeneous Many-core Architectures
- QDWH, QR-based Dynamically Weighted Halley
- QMCkl, Quantum Monte Carlo Kernel Library
- qr_mumps, sparse linear algebra library
- ScalFMM, N-body interaction simulation using the Fast Multipole Method.
- SCHNAPS, Solver for Conservative Hyperbolic Non-linear systems Applied to PlasmaS.
- SignalPU, a Dataflow-Graph-specific programming model.
- SkePU, a skeleton programming framework.
- StarNEig, a dense nonsymmetric (generalized) eigenvalue solving library.
- STARS-H, HPC low-rank matrix market
- XcalableMP, Directive-based language eXtension for Scalable and performance-aware Parallel Programming
You can find here the list of publications related to applications using StarPU.
Help
For any questions regarding StarPU, you can:
- Read the latest release documentation, and more specifically the FAQ,
- Read the latest master documentation, and more specifically the FAQ,
- Read the tutorials,
- Browse the archive of the StarPU developers mailing list (archive list also available locally),
- Contact the StarPU developers mailing list, starpu-devel@inria.fr.
- Browse the archive of the StarPU announcement mailing list (archive list also available locally),