Agenda
The workshop takes place on Sunday, Nov 11,
SC18 Feedback form https://submissions.supercomputing.org/eval.html
9:00-9:05 | Opening Remarks – Sunita Chandrasekaran & Sandra Wienke |
9:05-9:35 | Keynote: Jack Wells (Oak Ridge National Laboratory, USA) – Experiences in Using Directive-based Programming for Accelerated Computing Architectures |
Session 1: Porting Scientific Applications using Directives (Chair: Sandra Wienke, RWTH Aachen University, Germany) | |
9:35-10:00 | Wenlu Zhang (Chinese Academy of Sciences, Beijing) – Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code Using Directives |
10:00-10:30 | Coffee break |
10:30-10:50 | Ada Sedova (Oak Ridge National Laboratory, USA) – Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation |
Session 2: Using OpenMP (Chair: Jeff Larkin, NVIDIA) | |
10:50-11:15 | Artem Chikin – OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries |
11:15-11:35 | Rahulkumar Gayatri (Lawrence Berkeley National Laboratory, USA) – A Case Study for Performance Portability using OpenMP 4.5 |
Session 3: Using OpenACC (Chair: Randy Allen, Mentor Graphics) (Best Paper Award will be announced) | |
11:35-12:00 | Aniket Shivam (University of California, Irvine, USA) – OpenACC Routine Directive Propagation using Interprocedural Analysis |
12:00-12:25 | Anmol Paudel (Marquette University, USA) – OpenACC Based GPU Parallelization of Plane Sweep Algorithm for Geometric Intersection |
12:25-12:30 | Closing Remarks |
KEYNOTE: Experiences in Using Directive-based Programming for Accelerated Computing Architectures
Dr. Jack Wells, the Director of Science for the Oak Ridge Leadership Computing Facility (OLCF) at the Oak Ridge National Laboratory, Tennessee, USA will be giving the Keynote at the workshop.
Bio
Dr. Jack Wells is the Director of Science for the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science national user facility, and the Titan supercomputer, located at Oak Ridge National Laboratory (ORNL). Wells is responsible for the scientific outcomes of the OLCF’s user programs.
Wells has previously led both ORNL’s Computational Materials Sciences group in the Computer Science and Mathematics Division and the Nanomaterials Theory Institute in the Center for Nanophase Materials Sciences. Prior to joining ORNL as a Wigner Fellow in 1997, Wells was a postdoctoral fellow within the Institute for Theoretical Atomic and Molecular Physics at the Harvard-Smithsonian Center for Astrophysics.
Wells has a Ph.D. in physics from Vanderbilt University, and has authored or co-authored over 80 scientific papers and edited 1 book, spanning nanoscience, materials science and engineering, nuclear and atomic physics computational science, applied mathematics, and text-based data analytics.
Abstract
Accelerated computing architectures have grown in their application within scientific computing since their introduction approximately ten-years ago. From the earliest days, there has been a focus on the programmability of these systems. A variety of desired outcomes have driven the development of directive-based programming approaches for accelerated computing, including improvements in developer productivity and application portability, APIs that are non-proprietary or vendor non specific, and that support incremental acceleration of application codes. The first specification was OpenACC 1.0 introduced in November 2011. With major enhancements, OpenACC has evolved to version 2.5, and is providing constructive inputs to the OpenMP specification. In this talk, we discuss how the use of compiler directives have evolved over time and their implementation status on Titan and Summit. The talk will also discuss which applications on Titan are using directives and how their usage has been changing over time. To end, we will discuss the challenges that need to be solved and how new emerging frameworks are changing the way applications are using directives (e.g. as backends for Kokkos, etc) for C++.
Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code Using Directives
Wenlu Zhang (Institute of Physics, Chinese Academy of Sciences) is presenting this paper. He is a professor of physics and the director of Fusion Plasma Physic at Institute of Physics, Chinese Academy of Sciences, Beijing, China. I devoted to theoretical and simulation researches on turbulence and transport in fusion plasmas, the crucial next step in the quest for clean and abundant fusion energy. He is also the core developer of GTC, which is a well benchmarked massively parallel particle-in-cell code for integrated simulations of confinement properties of burning plasmas. He also leads the multi-institutional collaborative optimization work for both the Tianhe 1, Tianhe 2 and upcoming Tianhe 3 supercomputers of China, and the Summit supercomputer of US.
Abstract
The latest production version of the fusion particle simulation code, Gyrokinetic Toroidal Code (GTC), has been ported to and optimized for the next generation exascale GPU supercomputing platform. Heterogeneous programming using directives has been utilized to fuse and thus balance the continuously implemented physical capabilities and rapidly evolving software/hardware systems. The original code has been refactored to a set of unified functions/calls to enable the acceleration for all the species of particles. Binning and GPU texture caching technique have also been used to boost the performance of the particle push and shift operations. In order to identify the hotspots, the GPU version of the GTC code was the first benchmarked on up to 8000 nodes of the Titan supercomputer, which shows about 2–3 times overall speedup comparing NVidia M2050 GPUs to Intel Xeon X5670 CPUs. This Phase I optimization was followed by further optimizations in Phase II, where single-node tests show an overall speedup of about 34 times on SummitDev and 7.9 times on Titan. The real physics tests on Summit machine showed impressive scaling properties that reaches roughly 50% efficiency on 928 nodes of Summit. The GPU+CPU speed up from purely CPU is over 20 times, leading to an unparalleled speed.
Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation
Ada Sedova is presenting this paper by Ada Sedova, Andreas F. Tillack and Arnold Tharrington (all from Oak Ridge National Laboratory, USA).
Ada Sedova is a CSEEN Postdoctoral Research Associate in the Scientific Computing Group at the National Center for Computational Sciences (NCCS), Oak Ridge National Laboratory. She is working on high performance scientific computing programs in computational biophysics, with a focus on software portability into the exoscale. In particular, she is studying best practices for creating portable libraries for bottleneck calculations in molecular dynamics simulation programs, including an exploration of the effects of choices in programming language, creation of standard high-level interfaces, and algorithm design on code portability. Ada has a background in biophysical chemistry and biomolecular spectroscopy, as well as mathematics, and is also currently working on ab initio molecular dynamics simulations and experimental vibrational neutron spectroscopy of biomolecules, in addition to the concurrent testing and development of methods to simulate experimental spectra using both classical molecular dynamics and computational quantum chemistry methods.
Abstract
Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance within an acceptable range for these types of programs in the HPC setting. We find that for easily parallelizable kernels, performance on the GPU remains within this range. On the CPU, OpenACC-parallelized pairwise distance kernels would not meet the performance standards required, when using AMD Opteron “Interlagos” processors, but with IBM Power 9 processors, performance remains within an acceptable range for small batch sizes. These kernels provide a test for achieving performance portability with compiler directives for problems with memory-intensive components as are often found in scientific applications.
OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries
Artem Chikin is presenting this paper by Artem Chikin, Tyler Gobran and Jose N. Amaral (all from the University of Alberta, Canada)
Artem Chikin is a MSc student working with Prof. Nelson J. Amaral at the University of Alberta Systems Group. His work focuses on compiler-driven performance of heterogeneous computing systems.
Abstract
This paper presents three ideas that focus on improving the execution of high-level parallel code in GPUs. The first addresses programs that include multiple parallel blocks within a single region of GPU code. A proposed compiler transformation can split such regions into multiple, leading to the launching of multiple kernels, one for each parallel region. Advantages include the opportunity to tailor grid geometry of each kernel to the parallel region that it executes and the elimination of the overheads imposed by a code-generation scheme meant to handle multiple nested parallel regions. Second, is a code transformation that sets up a pipeline of kernel execution and asynchronous data transfer. This transformation enables the overlap of communication and computation. Intricate technical details that are required for this transformation are described. The third idea is that the selection of a grid geometry for the execution of a parallel region must balance the GPU occupancy with the potential saturation of the memory throughput in the GPU. Adding this additional parameter to the geometry selection heuristic can often yield better performance at lower occupancy levels.
A Case Study for Performance Portability using OpenMP 4.5
Rahulkumar Gayatri is a presenting this paper by Rahulkumar Gayatri, Charlene Yang, Thorsten Kurth and Jack Deslippe (all from Lawrence Berkeley National Laboratory, USA).
Rahulkumar is a PostDoc at NERSC, LBL. He works in the Application Performance Group and his work involves porting applications onto multi-core architectures using widely used programming frameworks such as OpenMP, OpenACC, Kokkos and assess their abilities in creating a “performance portable” implementation. Prior to this he graduated from Barcelona SuperComputing Center in March 2015 where he worked in the OMPSs programming models group.
Abstract
In recent years, the HPC landscape has shifted away from traditional CPU based systems to energy efficient architectures, relying on many-core CPUs or accelerators to achieve high performance. The goal of performance portability is to enable developers to rapidly produce applications which run efficiently on a variety of these architectures and require little to no architecture specific code adoptions. Directive based programming models (OpenMP and OpenACC) are attractive in this regard as they do not require a major code restructuring and they support incremental portability.
OpenACC Routine Directive Propagation using Interprocedural Analysis
Aniket Shivam is presenting this paper by Aniket Shivam (University of California, Irvine, USA) and Michael Wolfe (NVIDIA, USA).
Aniket Shivam is a PhD candidate in the Donald Bren School of Information and Computer Sciences at University of California, Irvine (UCI).
His research interests involve Compilers and High-Performance Computing.
Abstract
Accelerator programming today requires the programmer to specify what data to place in device memory, and what code to run on the accelerator device. When programming with OpenACC, directives and clauses are used to tell the compiler what data to copy to and from the device, and what code to compile for and run on the device. In particular, the programmer inserts directives around code regions, typically loops, to identify compute constructs to be compiled for and run on the device. If the compute construct calls a procedure, that procedure also needs to marked for device compilation, as does any routine called in that procedure, and so on transitively. In addition, the marking needs to include the kind of parallelism that is exploited within the procedure, or within routines called by the procedure. When using separate compilation, the marking where the procedure is defined must be replicated in any file where it is called. This causes much frustration when firstporting existing programs to GPU programming using OpenACC.
This paper presents an approach to partially automate this process. The approach relies on interprocedural analysis (IPA) to analyze OpenACC regions and procedure definitions, and to propagate the necessary information forward and backward across procedure calls spanning all the linked files, generating the required accelerator code through recompilation at link time. This approach can also perform correctness checks to prevent compilation or runtime errors. This method is implemented in the PGI OpenACC compiler.
OpenACC Based GPU Parallelization of Plane Sweep Algorithm for Geometric Intersection
Anmol Paudel is presenting this paper by Anmol Paudel and Satish Puri (both from Marquette University, USA).
Anmol is currently pursing his graduate studies in the field of Computational Sciences. His research interests are mainly in the domain of parallel computing and high performance computing and his work as a RA in the Parallel Computing Lab in Marquette University is also geared towards the same. He devotes most of his time in speeding up algorithms and computational methods in scientific computing and data science in a scalable fashion. Besides work, he likes to hangout with his friends, travel and explore new cultures and cuisines.
Abstract
Line segment intersection is one of the elementary operations in computational geometry. Complex problems in Geographic Information Systems (GIS) like finding map overlays or spatial joins using polygonal data require solving segment intersections. Plane sweep paradigm is used for finding geometric intersection in an efficient manner. However, it is difficult to parallelize due to its in-order processing of spatial events. We present a new fine-grained parallel algorithm for geometric intersection and its CPU and GPU implementation using OpenMP and OpenACC. To the best of our knowledge, this is the first work demonstrating an effective parallelization of plane sweep on GPUs.