Interdisciplinary Cluster Workshop on GPUs

Europe/Berlin
TUM/IGSSE

TUM/IGSSE

Andreas Müller (Excellence Cluster Universe), Elisa Resconi (TUM)
Description
The Interdisciplinary Cluster Workshop on GPUs is a workshop dedicated in particular to foster new projects and collaborations in this field.

The goal of this technical workshop is to  
-bring together communities from particle physics, astrophysics and computer science
-exchange experiences
-summarize the status on existing libraries and tools
-summarize the status of sampling and minimizer tools with parallel options  

Local scientists present their work in talks on these two days. Selected external scientists join this workshop. Invited external speakers:

-Alfio Lazzaro (CERN/Cray), expert on likelihood on GPUs and on parallel coding
-Matthew Shepherd (Indiana University), author of AmpTool
-David Boersma (Uppsala University), Member of the IceCube Collaboration
-David Rohr (FIAS), GPU hardware
-Claudio Gheller (ETHZ), GPU visualization 

No registration. No fee.  

Scientific Organising Committee:
Frederik Beaujean (LMU, C2PAP)
Nicolay Hammer (LRZ)
Anupam Karmakar (LRZ)
Andreas Müller (TUM), Co-Chair
Stefan Recksiegel (TUM)
Elisa Resconi (TUM), Chair

Schedule: Please see the timetable with the current status.

This event is organised and funded by the Excellence Cluster Universe.
Abstract Booklet
Schedule
    • Overview on GPU hardware and CUDA
    • 1
      Scripting CUDA
      Scripting languages like python are very popular tools for rapid prototyping of applications because of the instant gratification of interpreted languages. In this talk scripting facilities are presented for the use of Cuda-compatible GPGPUs. Examples are given for MATLAB, python and the statistical programming language R. It is also shown how R can be used as a glue language to build larger applications from simple Cuda Kernels with the PGI OpenACC Fortran Compiler.
      Speaker: Ferdinand Jamitzky (LRZ)
      Slides
    • 2
      Intel Xeon Phi – Intel’s way to go many-core
      Programming accelerators like GPGPUs using languages like OpenCL or CUDA is quite cumbersome and error-prone. Trying to overcome these difficulties, Intel developed their own Many Integrated Core (MIC) architecture. Offering 512-bit wide SIMD vector-registers and a large number of threads the new Intel Xeon Phi coprocessor is designed to retain the programmability and flexibility of the standard x86 architecture on the many-core level. We discuss the basics of the hardware architecture, the main programming models and the new Intel Xeon Phi based cluster “SuperMIC“ at LRZ.
      Speaker: Volker Weinberg (LRZ)
    • 3
      Discussion
    • 12:15
      Lunch break
    • 4
      AmpTools: A GPU-accelerated toolkit for amplitude analysis
      Amplitude analysis, commonly used in hadron spectroscopy to study resonance properties, necessarily involves and unbinned maximum likelihood fit. With modern experiments it is not unusual to attempt to fit of order a million events to a model with tens of free parameters. The repeated likelihood evaluation for each event that is necessary at each fit iteration can be dramatically accelerated by performing the calculations in parallel. We will discuss the general computational problem of amplitude analysis and performance attained by parallelization with both a single GPU and a cluster of GPUs.
      Speaker: Matthew Shepherd (U Indiana)
      Slides
    • Maximum likelihood methods on GPUs
    • 5
      Log-likelihood based event reconstruction in IceCube
      Overview of the currently used CPU-based algorithms to determine the type and parameters of each recorded event in IceCube. Many of these algorithms are implemented in a framework called "Gulliver". Whether or not this framework is suitable for GPU solutions probably depends on which strategy for parallellization is chosen.
      Speaker: David Boersma (U Uppsala)
      Slides
    • 16:00
      Coffee break
    • 6
      Parallelization of maximum likelihood fits on CPU and GPU: Algorithms and Technologies
      Data analyses based on maximum likelihood fits are commonly used for fitting statistical models to data samples. Large data samples and complex likelihood functions models can be very time-consuming tasks. Therefore, it becomes particularly important to speed-up the evaluation of the likelihood functions. In this presentation Alfio will present an algorithm which benefits from data vectorization and parallelization on CPU and GPU. Thereafter he will discuss the implementation technologies for porting the application on both devices.
      Speaker: Alfio Lazzaro (CERN)
      Slides
    • 7
      GPU-accelerated Partial-Wave Analysis in Hadron Spectroscopy
      The study of the excitation spectra of baryons and mesons is an active field of hadron physics, where one wants to learn more about how hadrons are composed of quarks and gluons and how exactly confinement works. In typical experiments many different of these excited hadrons are produced at the same time. They appear as short-lived resonances that decay into final-state particles measured by the detector. Partial-wave analysis methods are used in order to disentangle the observed mixture of the contributing resonances, by decomposing the measured kinematic distributions of the decay particles into partial waves with well-defined spin and parity quantum numbers. This spin-parity decomposition is an expansion into a well-defined set of functions, not unlike a Fourier analysis. The coefficients of the expansion describe the strengths and phases of each partial wave and are determined by a multi-dimensional likelihood fit to the measured kinematic distributions which typically contains about 100 free parameters. Due to the high dimensionality of the problem (up to 11 dimensions), the large number of fit parameters (order of 200), and the large amount of data (up to 10^8 events) the computation of the likelihood function is expensive. Furthermore, in order to assess systematic errors, fits have to be repeated with a variety of models and on a number of data sets. Therefore speed is an issue. GPGPU computing can help to widen this bottleneck and also enables new analysis schemes that would be prohibitively expensive on CPU-based systems.
      Speaker: Boris Grube (TUM)
      Slides
    • 8
      Discussion
    • 18:15
      Dinner: Bavarian Brotzeit
  • Tuesday 15 April
    • 9
      Fast vectorized and GPU-accelerated Algorithms for multiple applications
      With processor clock speeds having stagnated, parallel computing architectures like GPUs have achieved a breakthrough in recent years. Modern clusters are often designed heterogeneously and offer different processors for different applications. In order not to waste the available compute power, highly efficient programs are mandatory. The talk is about the development of fast algorithms and their implementations on modern CPUs and GPUs, about the maximum achievable efficiency with respect to peak performance and to power consumption respectively, and about feasibility and limits of programs for CPUs, GPUs, and heterogeneous systems. Three totally different applications from distinct fields are presented. -The ALICE experiment at the LHC studies heavy-ion collisions at high rates of several hundred Hz, while every collision produces thousands of particles, whose trajectories must be reconstructed in real time by the ALICE High Level Trigger (HLT). For this purpose, ALICE HLT TPC track reconstruction and track merging have been adapted for GPUs outperforming the fastest available CPUs by about a factor three. Since the beginning of 2012, the tracker has been running in nonstop operation on 64 nodes of the ALICE HLT providing full real-time track reconstruction. -The Linpack benchmark employs matrix multiplication (DGEMM) to solve a dense system of linear equations and is the standard tool for ranking compute clusters. Heterogeneous multi-GPU-enabled versions of DGEMM and Linpack have been developed supporting CAL, CUDA, and OpenCL as backend. Employing this implementation, the LOEWE-CSC cluster ranked place 22 in the November 2010 Top500 list of the fastest supercomputers, and the Sanam cluster achieved the second place in the November 2012 Green500 list of the most power efficient supercomputers. -Failure erasure coding enables failure tolerant data storage and is an absolute necessity for present-day computer infrastructure. Fast encoding implementations are presented, which use exclusively either integer or logical vector instructions. Depending on certain parameters, they always hit different hard limits of the hardware: either the maximum attainable memory bandwidth, or the peak instruction throughput, or the PCI Express bandwidth limit when GPUs or FPGAs are employed. The examples demonstrate that in most cases GPU implementations can be as efficient as their CPU counterparts with respect to the available peak performance. With respect to costs and power consumption, they are much more efficient. For this purpose, complex tasks must be split in serial as well as parallel parts such that multithreaded pipelines and asynchronous DMA transfers can hide CPU bound tasks ensuring continuous GPU kernel execution. Few cases are identified where this is not possible due to PCI Express limitations or not reasonable because practical GPU languages are missing.
      Speaker: David Rohr (FIAS)
      Slides
    • Selected topics on GPUs
    • 10
      GPU Optimization of Pseudo Random Number Generators for Random Ordinary Differential Equations
      Solving differential equations with stochastic terms involves a massive use of pseudo random numbers. The inherent potential for vectorization of such an application is used to its full extent on GPU accelerator hardware. A representative set of pseudo random number generators for uniformly and normally distributed pseudo random numbers has been implemented, optimized, and benchmarked. The resulting optimized variants outperform standard library implementations on GPUs.
      Speaker: Christoph Riesinger (TUM)
      Slides
    • Minimizers and sampling methods with/without GPUs
    • 11
      Discussion
    • 10:45
      Coffee break
    • 12
      A versatile tomographic forward and back projection approach on Multi-GPUs
      Iterative tomographic reconstruction gets more and more into the focus of interest for x-ray computed tomography as parallel high-performance computing finds its way into compact and affordable computing systems in form of GPU devices. However, when it comes to the point of high-resolution x-ray computed tomography, e.g. measured at synchrotron facilities, the limited memory and bandwidth of such devices are soon stretched to their limits. Especially keeping the core part of tomographic reconstruction, the projectors, both versatile and fast for large datasets is challenging. Therefore, we demonstrate a multi-GPU accelerated forward and back projector based on projection matrices and taking advantage of two concepts to distribute large datasets into smaller units. A novel ultrafast precalculation kernel prevents unnecessary data transfers for cone-beam geometries. Contributors: Tobias Lasser, Peter B. Noël and Franz Pfeiffer
      Speaker: Andreas Fehringer (TUM)
      Slides
    • Image processing
    • 12:15
      Lunch break
    • 13
      High Performance visualization of astrophysical data
      The talk will focus on exploitation of emerging GPU-enabled, hybrid HPC architectures for scientific data visualization. We employ Splotch - a rendering algorithm that allows production of high quality imagery and supports very large-scale datasets. Splotch can exploit heterogeneous supercomputing architectures by means of an effective combination of MPI, for distributed computing, OpenMP, on shared memory systems, and CUDA, to use GPUs. We specifically focus on the CUDA implementation of Splotch referring to the underlying performance model for data transfers, computations and memory access and we presents the results obtained in a number of tests and benchmarks which measure the performance and the scalability of our algorithm.
      Speaker: Claudio Gheller (ETHZ)
      Slides
    • 14
      Current application of CUDA/OpenCL in IceCube simulation and analysis
      GPUs are already used extensively (and successfully) in IceCube, e.g. the simulation of light propagation in ice and for analyzing features in a skymap.
      Speaker: David Boersma (U Uppsala)
    • 15
      GPU experiences at the RZG
      The actual supercomputer of the Max-Planck Society located at the Garching Computing Center has a significant part of nodes accelerated with NVidia Kepler GPUs. This infrastructure at the RZG will be presented. To use this computing power, two codes from astrophysics and plasma physics have been investigated for a porting on this new architecture. In the presentation the experiences and results from this effort will be shown.
      Speaker: Tilman Dannert (RZG)
      Slides
    • Local activities and extreme applications
    • 16
      Training of the Neural Network z-Vertex Trigger for the Belle II Experiment
      Separating signal and background at the first level, the trigger is a crucial component of a particle physics experiment. For the Belle II experiment, which is currently under construction at the SuperKEKB B-Factory in Tsukuba (Japan), especially the z-coordinate of the vertex position is an ideal criterion to reject charged background tracks. Classical triggering methods lack in accuracy for this parameter, but simulation studies with multi layer perceptrons have demonstrated a significant improvement that can be achieved using machine learning. When parallel hardware (e.g. FPGA) is used, the inherent parallel structure of neural networks is well suited for the tight timing constraints of the trigger system. Excessive parallelization is also possible in various stages of the training procedure. Different parallelization approaches will be presented including pattern parallel training based on OpenMP and node parallel execution currently planned for the hardware realization.
      Speaker: Neuhaus Sara (MPP)
      Slides