With processor clock speeds having stagnated, parallel computing architectures like GPUs have achieved a breakthrough in recent years. Modern clusters are often designed heterogeneously and offer different processors for different applications.
In order not to waste the available compute power, highly efficient programs are mandatory. The talk is about the development of fast algorithms and their implementations on modern CPUs and GPUs, about the maximum achievable efficiency with respect to peak performance and to power consumption respectively, and about feasibility and limits of programs for CPUs, GPUs, and heterogeneous systems. Three totally different applications from distinct fields are presented.
-The ALICE experiment at the LHC studies heavy-ion collisions at high rates of several hundred Hz, while every collision produces thousands of particles, whose trajectories must be reconstructed in real time by the ALICE High Level Trigger (HLT).
For this purpose, ALICE HLT TPC track reconstruction and track merging have been adapted for GPUs outperforming the fastest available CPUs by about a factor three. Since the beginning of 2012, the tracker has been running in nonstop operation on 64 nodes of the ALICE HLT providing full real-time track reconstruction.
-The Linpack benchmark employs matrix multiplication (DGEMM) to solve a dense system of linear equations and is the standard tool for ranking compute clusters. Heterogeneous multi-GPU-enabled versions of DGEMM and Linpack have been developed supporting CAL, CUDA, and OpenCL as backend. Employing this implementation, the LOEWE-CSC cluster ranked place 22 in the November 2010 Top500 list of the fastest supercomputers, and the Sanam cluster achieved the second place in the November 2012 Green500 list of the most power efficient supercomputers.
-Failure erasure coding enables failure tolerant data storage and is an absolute necessity for present-day computer infrastructure. Fast encoding implementations are presented, which use exclusively either integer or logical vector instructions. Depending on certain parameters, they always hit different hard limits of the hardware: either the maximum attainable memory bandwidth, or the peak instruction throughput, or the PCI Express bandwidth limit when GPUs or FPGAs are employed.
The examples demonstrate that in most cases GPU implementations can be as efficient as their CPU counterparts with respect to the available peak performance. With respect to costs and power consumption, they are much more efficient. For this purpose, complex tasks must be split in serial as well as parallel parts such that multithreaded pipelines and asynchronous DMA transfers can hide CPU bound tasks ensuring continuous GPU kernel execution. Few cases are identified where this is not possible due to PCI Express limitations or not reasonable because practical GPU languages are missing.