GPU Bootcamps and Hackathons 2022

GPU acceleration is a key feature of many modern supercomputers. Therefore, a series of GPU hackathons and bootcamps, many of which are organized by partner institutions of European Centres of Excellence for High-Performance Computing and National Competence Centres, will take place throughout Europe again this year. The following (and many more) hackathons and bootcamps will be hosted in a hybrid or purely digital format:

The hackathons and bootcamps are multi-day intensive hands-on events designed to help scientists, researchers, and developers to accelerate and optimize their applications for GPUs using libraries, OpenACC, CUDA and other tools by pairing participants with dedicated mentors experienced in GPU programming and development. 

Representing distinguished scholars and pre-eminent institutions around the world, these teams of mentors and attendees work together to realize performance gains and speed-ups by taking advantage of parallel programming on GPUs.

Applications can be submitted at

Image of person typing on laptop


Darshan, a scalable HPC I/O characterization tool. Darshan is designed to capture an accurate picture of application I/O behavior, including properties such as patterns of access within files, with minimum overhead. The name is taken from a Sanskrit word for “sight” or “vision”.

Darshan can be used to investigate and tune the I/O behavior of complex HPC applications. In addition, Darshan’s lightweight design makes it suitable for full time deployment for workload characterization of large systems. We hope that such studies will help the storage research community to better serve the needs of scientific computing. Darshan was originally developed on the IBM Blue Gene series of computers deployed at the Argonne Leadership Computing Facility, but it is portable across a wide variety of platforms include the Cray XE6, Cray XC30, and Linux clusters. Darshan routinely instruments jobs using up to 786,432 compute cores on the Mira system at ALCF.



SimGrid is a framework for developing simulators of distributed applications that executed on distributed platforms, which can in turn be used to prototype, evaluate and compare relevant platform configurations, system designs, and algorithmic approaches.

SimGrid provides ready to use models and APIs to simulate popular distributed computing platforms (commodity clusters, wide-area and local-area networks, peers over DSL connections, data centers, etc.) As a result, SimGrid has served as the foundational technology for developing simulators and obtaining experimental results for a wide range of distributed computing domains: Grid computing, P2P computing, Cloud computing, Fog computing, Volunteer computing, HPC with MPI, MapReduce.

SimGrid is accurate, scalable, and usable

  • Accurate: SimGrid’s simulation models have been theoretically and experimentally evaluated and validated
  • Scalable: SimGrid’s simulation models and their implementations are fast and have low memory footprint, making is possible to run SimGrid simulations quickly on a single machine
  • Usable: SimGrid is free software (LGPL license) available on Linux / Mac OS X / Windows, and allows users to write simulators in C++, C, Python, or Java.



PyPOP package is designed to make it easy to perform application performance analyses based on the POP methodology. The primary goals of PyPOP are:

  • Easy calculation of POP metrics
  • High quality figure generation
  • Easy access to underlying data and statistics (using Pandas)
  • Flexible and extensible design



MAQAO (Modular Assembly Quality Analyzer and Optimizer) is a performance analysis and optimization framework operating at binary level with a focus on core performance. Its main goal of is to guide application developpers along the optimization process through synthetic reports and hints.

MAQAO mixes both dynamic and static analyses based on its ability to reconstruct high level structures such as functions and loops from an application binary.

Since MAQAO operates at binary level, it is agnostic with regard to the language used in the source code and does not require recompiling the application to perform analyses. MAQAO has also been designed to concurrently support multiple architectures. Currently the Intel64 and Xeon Phi architectures are implemented.

The main modules of MAQAO are LProf, a sampling-based lightweight profiler offering results at both function and loop levels, CQA, a static analyser assessing the quality of the code generated by the compiler, and ONE View, a supervising module responsible for invoking the others and aggregating their results.

Other modules, currently in beta version, allow performing value profiling (VProf) and decremental analysis (DECAN).



TAU Performance System® is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, UPC, Java, Python.

TAU (Tuning and Analysis Utilities) is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements as well as event-based sampling. All C++ language features are supported including templates and namespaces. The API also provides selection of profiling groups for organizing and controlling instrumentation. The instrumentation can be inserted in the source code using an automatic instrumentor tool based on the Program Database Toolkit (PDT), dynamically using DyninstAPI, at runtime in the Java Virtual Machine, or manually using the instrumentation API.

TAU’s profile visualization tool, paraprof, provides graphical displays of all the performance analysis results, in aggregate and single node/context/thread forms. The user can quickly identify sources of performance bottlenecks in the application using the graphical interface. In addition, TAU can generate event traces that can be displayed with the Vampir, Paraver or JumpShot trace visualization tools.



Score-P – Scalable Performance Measurement Infrastructure for Parallel Codes- The Score-P measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. It has been created in the German BMBF project SILC and the US DOE project PRIMA and will be maintained and enhanced in a number of follow-up projects such as LMAC, Score-E, and HOPSA. Score-P is developed under a BSD 3-Clause License and governed by a meritocratic governance model.



Dimemas is a performance analysis tool for message-passing programs. It enables the user to develop and tune parallel applications on a workstation, while providing an accurate prediction of their performance on the parallel target machine. The Dimemas simulator reconstructs the time behavior of a parallel application on a machine modeled by a set of performance parameters. Thus, performance experiments can be done easily. The supported target architecture classes include networks of workstations, single and clustered SMPs, distributed memory parallel computers, and even heterogeneous systems.

For communication, a linear performance model is used, but some non-linear effects such as network conflicts are taken into account. The simulator allows specifying different task to node mappings.

Dimemas generates trace files that are suitable for Paraver enabling the user to conveniently examine any performance problems indicated by a simulator run. The analysis module performs critical path analysis reporting the total CPU usage of different code blocks, as well as their importance for the program execution time. Based on a statistical evaluation of synthetically perturbed traces and architectural parameters, the importance of different performance parameters and the benefits of particular code optimizations can be analyzed.



Cube, which is used as performance report explorer for Scalasca and Score-P, is a generic tool for displaying a multi-dimensional performance space consisting of the dimensions (i) performance metric, (ii) call path, and (iii) system resource. Each dimension can be represented as a tree, where non-leaf nodes of the tree can be collapsed or expanded to achieve the desired level of granularity. In addition, Cube can display multi-dimensional Cartesian process topologies.