The workshop takes place on Monday, Nov 18, 2019 from 9:00 a.m.- 5:30 p.m. in room 702. Details about all talks and the speakers can be found by clicking on a title in the agenda or below the agenda.

09:00-09:10Opening Remarks – Sandra Wienke & Sridutt Bhalachandra
09:10-10:00Keynote: Nicholas James Wright (Lawrence Berkeley National Laboratory, USA) – Perlmutter – A 2020 Pre-Exascale GPU-accelerated System for NERSC: Architecture and Application Performance Optimization
10:00-10:30WACCPD Morning Break
Topic: Porting Scientific Applications to Heterogeneous Architectures Using Directives
10:30-11:00Takuma Yamaguchi (University of Tokyo, Japan) – GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC
11:00-11:30Noriyuki Kushida (Comprehensive Nuclear-Test-Ban Treaty Organization) – Acceleration in Acoustic Wave Propagation Modelling using OpenACC/OpenMP and its hybrid for the Global Monitoring System
11:30-12:00Zhengji Zhao (NERSC, Lawrence Berkeley National Laboratory, USA) – Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC
12:00-12:30Fazlay Rabbi (Michigan State University, USA) – Evaluation of Directive-based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
12:30-14:00WACCPD Lunch Break
14:00-14:30Invited Talk: Robert Henschel (Indiana University, USA) – The SPEC ACCEL Benchmark – Results and Lessons Learned
Topic: Directive-Based Programming for Math Libraries
14:30-15:00JaeHyuk Kwack (Argonne National Laboratory, USA) – Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries
15:00-15:30WACCPD Afternoon Break
Topic: Performance Portability for Heterogeneous Architectures
15:30-16:00Yuuichi Asahi (National Institute for Quantum and Radiological Science and Technology, Japan) – Performance portable implementation of a kinetic plasma simulation mini-app
16:00-16:30Damodar Sahasrabudhe (University of Utah, USA) – A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures
16:30-16:35WACCPD Best Paper Award
16:35-17:25Panel: Fernanda Foertter (NVIDIA, USA) – Convergence, Divergence, or New Approaches? – The Future of Software-Based Abstractions for Heterogeneous Supercomputing
Panelists: Jeff R. Hammond, Jack Deslippe, Christian Robert Trott, Michael Wolfe, Johannes Doerfert
17:25-17:30WACCPD Closing Remarks

KEYNOTE: Perlmutter- A 2020 Pre-Exascale GPU-accelerated System for NERSC: Architecture and Application Performance Optimization

Dr. Nicholas (Nick) James Wright, the advanced technologies group lead and the NERSC chief architect, will be giving the Keynote at the workshop.


Nicholas J. Wright

Nicholas J. Wright is the Perlmutter chief architect and the advanced technologies group lead at the National Energy Research Scientific Computing (NERSC) center. He led the effort to optimize the architecture of the Perlmutter machine, the first NERSC platform designed to meet the needs of both large scale simulation and data analysis from experimental facilities. Nicholas has a Ph.D. from the University of Durham in computational chemistry and has been with NERSC since 2009.


In 2020 NERSC will take delivery of its next-generation supercomputer, Perlmutter. In this talk we will describe the architecture of the machine and how it was optimized to meet the performance and usability goals of NERSC’s more than 7000 users. We will discuss the current usage of different programming models at NERSC and our plans for supporting them on Perlmutter, and on future machines.

Invited Talk: The SPEC ACCEL Benchmark – Results and Lessons Learned

Robert Henschel

Robert Henschel is Director of Research Software and Solutions at Indiana University. He is responsible for providing advanced scientific applications to researchers at Indiana University and national partners as well as providing support for computational research to the IU school of medicine. Henschel serves as the chair of the Standard Performance Evaluation Corporation (SPEC) High-Performance Group and in this role leads the development of production quality benchmarks for HPC systems. He also serves as the treasurer of the OpenACC organization. Henschel has a deep background in High-Performance Computing and his research interests focus on performance analysis of parallel applications.


The High-Performance Group (HPG) of the Standard Performance Evaluation Corporation (SPEC) is a forum for discussing and developing benchmark methodologies for High-Performance Computing (HPC) systems. The group released the SPEC ACCEL benchmark in 2014, containing OpenCL and OpenACC components. In 2017, an OpenMP 4.5 target offload component was added by porting the OpenACC applications to OpenMP 4.5. This talk will introduce the benchmark, show results and talk about the lessons learned from developing and maintaining this directive based benchmark. In addition, current challenges of creating a follow on suite are discussed.

GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC

Takuma Yamaguchi

Takuma Yamaguchi is a Ph.D. student in the Department of Civil Engineering at the University of Tokyo and he has B.E. and M.E., from the University of Tokyo. His research is high-performance computing targeting at earthquake simulation. More specifically, his work performs an implicit low-order finite element solver enhanced by GPUs.


Accelerating applications with portability and maintainability is one of the big challenges in science and engineering. Previously, we have developed a fast implicit low-order three-dimensional finite element solver, which has a complicated algorithm including artificial intelligence and transprecision computing. In addition, all possible tunings for the target architecture were implemented; accordingly, the solver has inferior portability and maintainability. In this paper, we apply OpenACC to the solver. The directive-based implementation of OpenACC enables GPU computation to be introduced with a smaller developmental cost even for complex codes. In performance measurements on AI Bridging Cloud Infrastructure (ABCI), we evaluated that a reasonable speedup was attained on GPUs, given that the elapsed time of the entire solver was reduced to 1/14 of that on CPUs based on the original CPU implementation. Our proposed template to use transprecision computing with our custom FP21 data type is available to the public; therefore, it can provide a successful example for other scientific computing applications.

Acceleration in Acoustic Wave Propagation Modeling using OpenACC/OpenMP and its hybrid for the Global Monitoring System

Noriyuki Kushida

Noriyuki Kushida is currently working for the Comprehensive Nuclear-Test-Ban Treaty Organization as a software engineer. He has been engaged in research and development of large scale computer simulation methods as well as linear equation solver algorithms. One of his current interest is introducing HPC/Supercomputing technologies to the disarmament fields to contribute to world peace by exploiting his background. In the research point of view, global acoustic modelings interest him a lot.


CTBTO is operating and maintaining the international monitoring system of Seismic, Infrasound, Hydroacoustic and Airborne radionuclide to detect a nuclear explosion over the globe. The monitoring network of CTBTO, especially with regard to infrasound and hydroacoustic, is quite unique because the network covers over the globe, and the data is opened to scientific use. CTBTO has been developing and improving the methodologies to analyze observed signals intensively. In this context, hydroacoustic modeling software, especially which that solves the partial differential equation directly, is of interest. As seen in the analysis of the Argentinian submarine accident, the horizontal reflection can play an important role in identifying the location of an underwater event, and as such, accurate modeling software may help analysts find relevant waves efficiently. Thus, CTBTO has been testing a parabolic equation based model (3D-SSFPE) and building a finite difference time domain (FDTD) model. At the same time, using such accurate models require larger computer resources than simplified methods such as ray-tracing. Thus we accelerated them using OpenMP and OpenACC, or the hybrid of those. As a result, in the best case scenarios, (1) 3D-SSFPE was accelerated by approximately 19 times to the original Octave code, employing the GPU-enabled Octfile technology, and (2) FDTD was accelerated by approximately 160 times to the original Fortran code using the OpenMP/OpenACC hybrid technology, on our DGX—Station with V100 GPUs.

Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC

Zhengji Zhao

Zhengji Zhao is an HPC consultant at the National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory. She specializes in supporting materials science and chemistry applications and users at NERSC. She was part of the NERSC7 (Edison, a Cray XC30) procurement, co-leading its implementation team. Additionally, she worked on developing or extending the capability of workloads analysis tools, such as the system performance monitoring with the NERSC SSP benchmarks, the library tracking (ALTD), and the application usage analysis automation. She is also a member of the NERSC application readiness team, helping users port their applications to new platforms. Most recently she has worked on bringing the checkpoint/restart capability to the NERSC workloads, and has also worked (co-PI) on the Berkeley Lab Directed Research and Development project that is designed to demonstrate performance potential of purpose-built architectures as potential future for HPC applications in absence of Moore’s Law. She has (co)athored more than 30 publications, including the work of developing the reduced density matrix (RDM) method for electronic structure calculations, a highly accurate alternative to wavefunction-based computational chemistry methods, and the award winning development work of the linear scaling 3D fragment (LS3DF) method for large-scale electronic structure calculations (best poster in SC07, and a Gordon Bell award in SC08). She served in the organizing committee for several HPC conference series, such as CUG, SC, IXPUG, etc. She received her Ph.D. in computational physics, and an M.S. in computer science from New York University.


Using GPUs to accelerate the performance of HPC applications has recently gained great momentum. Energy Exascale Earth System Model (E3SM) is a state-of-the-science earth system model development and simulation project and has gained national recognition. It has a large code base with over a million lines of code. How to make effective use of GPUs remains a challenge. In this paper, we use the modal aerosol module (MAM) of E3SM as a driving example to investigate how to effectively offload computational tasks to GPUs, using the OpenACC directives. In particular, we are interested in the performance advantage of using GPUs and understanding the limiting factors from both the application characteristics and the GPU or OpenACC sides.

Evaluation of Directive-based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Fazlay Rabbi

Fazlay Rabbi is a PhD student in Computer Science Department at Michigan State University. He is working under Dr. Hasan Metin Aktulga. His research interests are in the area of parallel algorithms, high performance computing and data- intensive computing. Especially, he is interested in expressing large sparse matrix computations as directed acyclic data- flow graph (DAG) to accelerate those computations on modern deep memory architectures by minimizing data movement between memory layers and overlapping computations with data movement. As a Summer Intern at Lawrence Berkeley National Laboratory (LBNL) in summer 2019, he studied the performance of OpenMP-4.0+ features designed to offload compute kernels to accelerators. He received his M.S. degree in Electrical Engineering from Michigan State University in 2016. He obtained his B.S. degree in Computer Science and Engineering from Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh in 2011.


Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8x – 4.3x speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9x and 48.2x speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively.

Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

JaeHyuk Kwack

JaeHyuk Kwack works at performance engineering group at Argonne Leadership Computing Facility. He received his B.S. and M.S. in engineering from Seoul National University, South Korea, and a Ph.D. and post-doctoral training in computational mechanics for computational fluid dynamics (CFD) and fluid solid interaction (FSI) from University of Illinois at Urbana-Champaign, USA. Before joining Argonne, he had worked for Blue Waters supercomputing project at National Center for Supercomputing Applications. At Argonne since 2018, he has been working on OpenMP offloading model, performance tools and math libraries for the coming US DOE exa-scale system, Aurora at Argonne in 2021


The US Department of Energy (DOE) started operating two GPU-based pre-exascale supercomputers in 2018 and plans to deploy another pre-exascale in 2020, and three exascale supercomputers in 2021/2022. All of the systems are GPU- enabled systems, and they plan to provide optimized vendor-promoted programming models for their GPUs such as CUDA, HIP and SYCL. However, due to their limited functional portability, it is challenging for HPC application developers to maintain their applications in an efficient and effective way with good productivity across all US DOE pre- exascale/exascale systems. Directive-based programming models for accelerators can be one of the solutions for HPC applications on the DOE supercomputers. In this study, we employ OpenMP and OpenACC offloading models to port and re-implement the RI-MP2 Fortran kernel of the GAMESS application on a pre-exascale GPU system, Summit. We compare and evaluate the performance of the re-structured offloading kernels with the original OpenMP threading kernel. We also evaluate the performance of multiple math libraries on the Nvidia V100 GPU in the RI-MP2 kernel. Using the optimized directive-based offloading implementations, the RI-MP2 kernel on a single V100 GPU becomes more than 7 times faster than on dual-socket Power9 processors, which is near the theoretical speed-up based on peak performance ratios. MPI + directive-based offloading implementations of the RI-MP2 kernel perform more than 40 times faster than a MPI + OpenMP threading implementation on the same number of Summit nodes. This study demonstrates how directive- based offloading implementations can perform near what we expect based on machine peak ratios.

Performance Portable Implementation of a Kinetic Plasma Simulation Mini-app

Yuuichi Asahi

Yuuichi Asahi is a post-doctoral research at National institute for quantum and radiological science and technology, Japan. He focused on the nonlinear plasma turbulence by means of gyrokinetic simulations. His current interest includes accelerating simulation codes for more complicated physical simulations.


Performance portability is considered to be an inevitable requirement in the exascale era. We explore a performance portable approach for fusion plasma turbulence simulation code employing kinetic model, namely GYSELA code. For this purpose, we extract the key features of GYSELA such as high dimensionality and Semi-Lagrangian scheme, and encapsulate them into a mini-application which solves the similar but simplified Vlasov-Poisson system. We implement the mini-app with a mixed OpenACC/OpenMP and Kokkos implementation, where we suppress unnecessary duplications of code lines. For a reference case with the problem size of 128 to the 4, the Skylake (Kokkos), Nvidia Tesla P100 (OpenACC), and P100 (Kokkos) versions achieve an acceleration of 1.45, 12.95, and 17.83, respectively, with respect to the baseline OpenMP version on Intel Skylake. In addition to the performance portability, we discuss the code readability and productivity of each implementation. Based on our experience, Kokkos can offer a readable and productive code at the cost of initial porting efforts, which would be enormous for a large scale simulation code like GYSELA.

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

Damodar Sahasrabudhe

Damodar Sahasrabudhe is a student pursuing a doctoral degree in computer science at the University of Utah. His research interests include parallel computing, GPGPU programming, portability, among others.


As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a GPU back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (SIMD) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (KNL), and also facilitates Single Instruction Multiple Threads (SIMT) based execution on GPUs. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new SIMD primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (LVL). The SIMD primitive provides portability across CPUs and GPUs without any performance degradation being observed experimentally.


Theme by HermesThemes

Copyright © 2019 WACCPD 2019. All Rights Reserved