StarPU: A Unified Runtime System for Heterogeneous Multicore Architectures

Table of Contents


StarPU is a task programming library for hybrid architectures

  • The application provides algorithms and constraints
    • CPU/GPU implementations of tasks
    • A graph of tasks, using either StarPU's rich C/C++ API, or OpenMP pragmas.
  • StarPU internally deals with the following aspects:
    • Task dependencies
    • Optimized heterogeneous scheduling
    • Optimized data transfers and replication between main memory and discrete memories
    • Optimized cluster communications

Rather than handling low-level issues, programmers can concentrate on algorithmic aspects!

The StarPU documentation is available in PDF and in HTML. Please note that these documents are up-to-date with the latest official release of StarPU. The documentation for the master version, is available here.

Latest News

Get the latest StarPU news by subscribing to the starpu-announce mailing list. See also the full news.

October 2021

A ETP4HPC White Paper 'Task-Based Performance Portability in HPC' has been published and can be read here or directly downloaded here.

October 2021

The release 1.3.9 of StarPU is now available! The 1.3 release serie brings among other functionalities a MPI master-slave support, a tool to replay execution through SimGrid, a HDF5 implementation of the Out-of-core, a new implementation of StarPU-MPI on top of NewMadeleine, implicit support for asynchronous partition planning, a resource management module to share processor cores and accelerator devices with other parallel runtime systems, ...

Discover StarPU




Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named codelet. Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet can run on heterogeneous architectures, it is possible to specify one function for each architectures (e.g. one function for CUDA and one function for CPUs). StarPU takes care of scheduling and executing those codelets as efficiently as possible over the entire machine, include multiple GPUs. One can even specify several functions for each architecture (new in v1.0) as well as parallel implementations (e.g. in OpenMP), and StarPU will automatically determine which version is best for each input size (new in v0.9). StarPU can execute them concurrently, e.g. one per socket, provided that the task implementations support it (which is the case for MKL, but unfortunately most often not for OpenMP).


The StarPU programming interface is very generic. For instance, various data structures are supported mainline (vectors, dense matrices, CSR/BCSR/COO sparse matrices, ...), but application-specific data structures can also be supported, provided that the application describes how data is to be transfered (e.g. a series of contiguous blocks). That was for instance used for hierarchically-compressed matrices (h-matrices).

Data transfers

To relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are automatically made available on the compute resource. Data are also kept on e.g. GPUs as long as they are needed for further tasks. When a device runs out of memory, StarPU uses an LRU strategy to evict unused data. StarPU also takes care of automatically prefetching data, which thus permits to overlap data transfers with computations (including GPU-GPU direct transfers) to achieve the most of the architecture.


Dependencies between tasks can be given either of several ways, to provide the programmer with best flexibility:

  • implicitly from RAW, WAW, and WAR data dependencies.
  • explicitly through tags which act as rendez-vous points between tasks (thus including tasks which have not been created yet),
  • explicitly between pairs of tasks,

These dependencies are computed in a completely decentralized way, and can be introduced completely dynamically as tasks get submitted by the application while tasks previously submitted are being executed.

StarPU also supports an OpenMP-like reduction access mode (new in v0.9).

It also supports a commute access mode to allow data access commutativity (new in v1.2).

It also supports transparent dependencies tracking between hierarchical subpieces of data through asynchronous partitioning, allowing seamless concurrent read access to different partitioning levels (new in v1.3).

Heterogeneous Scheduling

StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models. These determine the relative performance achieved by the different processing units for the various kinds of task, and thus permits to automatically let processing units execute the tasks they are the best for. Various strategies and variants are available. Some of them are centralized, but most of them are completely distributed so as to properly scale with large numbers of cores. dmdas (a data-locality-aware MCT strategy, thus similar to heft but starts executing tasks before the whole task graph is submitted, thus allowing dynamic task submission and a decentralized scheduler, as well as an energy optimizing extension), eager (dumb centralized queue), lws (distributed locality-aware work-stealing), ... The overhead per task is typically around the order of magnitude of a microsecond. Tasks should thus be a few orders of magnitude bigger, such as 100 microseconds or 1 millisecond, to make the overhead negligible.


To deal with clusters, StarPU can nicely integrate with MPI, through explicit or implicit support, according to the application's preference.

  • Explicit network communication requests can be emitted, which will then be automatically combined and overlapped with the intra-node data transfers and computation,
  • The application can also just provide the whole task graph, a data distribution over MPI nodes, and StarPU will automatically determine which MPI node should execute which task, and automatically generate all required MPI communications accordingly (new in v0.9). We have gotten excellent scaling on a 256-node cluster with GPUs, we have not yet had the opportunity to test on a yet larger cluster. We have however measured that with naive task submission, it should scale to a thousand nodes, and with pruning-tuned task submission, it should scale to about a million nodes.
  • Starting with v1.3, the application can also just provide the whole task graph, and let StarPU decide the data distribution and task distribution, thanks to a master-slave mechanism. This will however by nature have a more limited scalability than the fully distributed paradigm mentioned above.

Out of core

When memory is not big enough for the working set, one may have to resort to using disks. StarPU makes this seamless thanks to its out of core support (new in v1.2). StarPU will automatically evict data from the main memory in advance, and prefetch back required data before it is needed for tasks.

Fortran interface

StarPU comes with native Fortran bindings and examples.

OpenMP 4 -compatible interface

K'Star provides an OpenMP 4 -compatible interface on top of StarPU. This allows to just rebuild OpenMP applications with the K'Star source-to-source compiler, then build it with the usual compiler, and the result will use the StarPU runtime.

K'Star also provides some extensions to the OpenMP 4 standard, to let the StarPU runtime perform online optimizations.

OpenCL-compatible interface

StarPU provides an OpenCL-compatible interface, SOCL which allows to simply run OpenCL applications on top of StarPU (new in v1.0).

Simulation support

StarPU can very accurately simulate an application execution and measure the resulting performance thanks to using the SimGrid simulator (new in v1.1). This allows to quickly experiment with various scheduling heuristics, various application algorithms, and even various platforms (available GPUs and CPUs, available bandwidth)!

All in all

All that means that the following sequential source code of a tiled version of the classical Cholesky factorization algorithm using BLAS is also (a almost) valid StarPU code, possibly running on all the CPUs and GPUs, and given a data distribution over MPI nodes, it is even a distributed version!

for (k = 0; k < tiles; k++) {
  for (m = k+1; m < tiles; m++)
    trsm(A[k,k], A[m,k])
  for (m = k+1; m < tiles; m++)
    syrk(A[m,k], A[m, m])
  for (m = k+1, m < tiles; m++)
    for (n = k+1, n < m; n++)
      gemm(A[m,k], A[n,k], A[m,n])

Supported Architectures

  • SMP/Multicore Processors (x86, PPC, ARM, ... all Debian architecture have been tested)
  • NVIDIA GPUs (e.g. heterogeneous multi-GPU), with pipelined and concurrent kernel execution support (new in v1.2) and GPU-GPU direct transfers (new in v1.1)
  • OpenCL devices
  • Intel SCC (experimental, new in v1.2)
  • Intel MIC / Xeon Phi (new in v1.2)

Supported Operating Systems

  • GNU/Linux
  • Mac OS X
  • Windows


StarPU is checked every night with

  • Valgrind / Helgrind
  • gcc' Address/Leak/Thread/Undefined Sanitizers
  • cppcheck
  • Coverity

Performance analysis tools

In order to understand the performance obtained by StarPU, it is helpful to visualize the actual behaviour of the applications running on complex heterogeneous multicore architectures. StarPU therefore makes it possible to generate Pajé traces that can be visualized thanks to the ViTE (Visual Trace Explorer) open source tool.

Example: LU decomposition on 3 CPU cores and a GPU using a very simple greedy scheduling strategy. The green (resp. red) sections indicate when the corresponding processing unit is busy (resp. idle). The number of ready tasks is displayed in the curve on top: it appears that with this scheduling policy, the algorithm suffers a certain lack of parallelism. Measured speed: 175.32 GFlop/s


This second trace depicts the behaviour of the same application using a scheduling strategy (dmda) trying to minimize load imbalance thanks to auto-tuned performance models and to keep data locality as high as possible. In this example, the Pajé trace clearly shows that this scheduling strategy outperforms the previous one in terms of processor usage. Measured speed: 239.60 GFlop/s


Temanejo can be used to debug the task graph, as shown below (new in v1.1).


StarVZ (paper) can be used to investigate performance, as shown below


Software using StarPU

Some software is known for being able to use StarPU to tackle heterogeneous architectures, here is a non-exhaustive list (feel free to ask to be added to the list!):

  • AL4SAN, dense linear algebra library
  • Chameleon, dense linear algebra library
  • Exa2pro, Enhancing Programmability and boosting Performance Portability for Exascale Computing Systems
  • ExaGeoStat, Machine learning framework for Climate/Weather prediction applications
  • FLUSEPA, Navier-Stokes Solver for Unsteady Problems with Bodies in Relative Motion
  • HiCMA, Low-rank general linear algebra library
  • hmat, hierarchical matrix C/C++ library
  • HPSM, a C++ API for parallel loops programs supporting muti-CPUs and multi-GPUs
  • K'Star, OpenMP 4 - compatible interface on top of StarPU.
  • KSVD, dense SVD on distributed-memory manycore systems
  • MAGMA, dense linear algebra library, starting from version 1.1
  • MaPHyS, Massively Parallel Hybrid Solver
  • MASA-StarPU, Parallel Sequence Comparison
  • MOAO, HPC framework for computational astronomy, servicing the European Extremely Large Telescope and the Japanese Subaru Telescope
  • PaStiX, sparse linear algebra library, starting from version 5.2.1
  • PEPPHER, Performance Portability and Programmability for Heterogeneous Many-core Architectures
  • QDWH, QR-based Dynamically Weighted Halley
  • QMCkl, Quantum Monte Carlo Kernel Library
  • qr_mumps, sparse linear algebra library
  • ScalFMM, N-body interaction simulation using the Fast Multipole Method.
  • SCHNAPS, Solver for Conservative Hyperbolic Non-linear systems Applied to PlasmaS.
  • SignalPU, a Dataflow-Graph-specific programming model.
  • SkePU, a skeleton programming framework.
  • StarNEig, a dense nonsymmetric (generalized) eigenvalue solving library.
  • STARS-H, HPC low-rank matrix market
  • XcalableMP, Directive-based language eXtension for Scalable and performance-aware Parallel Programming

You can find here the list of publications related to applications using StarPU.

Give it a try!

You can easily try the performance on the Cholesky factorization for instance. Make sure to have the pkg-config and hwloc software installed for proper CPU control and BLAS kernels for your computation units and configured in your environment (e.g. MKL for CPUs and CUBLAS for GPUs).

$ wget
$ tar xf starpu-someversion.tar.gz
$ cd starpu-someversion
$ ./configure
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
$ STARPU_SCHED=dmdas mpirun -np 4 -machinefile mymachines ./mpi/examples/matrix_decomposition/mpi_cholesky_distributed -size $((960*40*4)) -nblocks $((40*4))

Note that the dmdas scheduler uses performance models, and thus needs calibration execution before exhibiting optimized performance (until the "model something is not calibrated enough" messages go away).

To get a glimpse at what happened, you can get an execution trace by installing FxT and ViTE, and enabling traces:

$ ./configure --with-fxt
$ make -j 12
$ STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
$ ./tools/starpu_fxt_tool -i /tmp/prof_file_${USER}_0
$ vite paje.trace

Starting with StarPU 1.1, it is also possible to reproduce the performance that we show in our articles on our machines, by installing simgrid, and then using the simulation mode of StarPU using the performance models of our machines:

$ ./configure --enable-simgrid
$ make -j 12
$ STARPU_PERF_MODEL_DIR=$PWD/tools/perfmodels/sampling STARPU_HOSTNAME=mirage STARPU_SCHED=dmdas ./examples/cholesky/cholesky_implicit -size $((960*40)) -nblocks 40
# size	ms	GFlops
38400	9915	1903.7

(MPI simulation is not supported yet)


For any questions regarding StarPU, you can:

Author: root

Created: 2022-01-27 Thu 09:55