COSMA

COSMA is a parallel, high-performance, GPU-accelerated, matrix-matrix multiplication algorithm and library implementation that is communication-optimal for all combinations of matrix dimensions, number of processors and memory sizes, without the need for any parameter tuning. COSMA is written in C++11 with MPI, OpenMP and CUDA/ROCm programming models. The library is open-source (BSD 3-clause licence) and is freely available.

http://www.max-centre.eu/software/libraries#5

CoE: MaX