The workshop takes place on Monday, Nov 18, 2019 from 9:00 a.m.- 5:30 p.m. in room 702. Details about all talks and the speakers can be found by clicking on a title in the agenda or below the agenda.

SC 19 Feedback Form:

Morning Session 1 (Session Chair: Sandra Wienke, RWTH Aachen University, Germany)
09:00-09:10Opening Remarks – Sandra Wienke & Sridutt Bhalachandra
09:10-10:00Keynote: Nicholas James Wright (Lawrence Berkeley National Laboratory, USA) – Perlmutter – A 2020 Pre-Exascale GPU-accelerated System for NERSC: Architecture and Application Performance Optimization
10:00-10:30WACCPD Morning Break
Morning Session 2: Porting Scientific Applications to Heterogeneous Architectures Using Directives (Session Chair: Christian Iwainsky, Technische Universität Darmstadt, Germany)
10:30-11:00Takuma Yamaguchi (University of Tokyo, Japan) – GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC
11:00-11:30Noriyuki Kushida (Comprehensive Nuclear-Test-Ban Treaty Organization) – Acceleration in Acoustic Wave Propagation Modelling using OpenACC/OpenMP and its hybrid for the Global Monitoring System
11:30-12:00Zhengji Zhao (NERSC, Lawrence Berkeley National Laboratory, USA) – Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC
12:00-12:30Fazlay Rabbi (Michigan State University, USA) – Evaluation of Directive-based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
12:30-14:00WACCPD Lunch Break
14:00-14:30Invited Talk: Robert Henschel (Indiana University, USA) – The SPEC ACCEL Benchmark – Results and Lessons Learned
Afternoon Session 1: Directive-Based Programming for Math Libraries (Session Chair: Sridutt Bhalachandra, Lawrence Berkeley National Laboratory, USA)
14:30-15:00JaeHyuk Kwack (Argonne National Laboratory, USA) – Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries
15:00-15:30WACCPD Afternoon Break
Afternoon Session 2 (partly): Performance Portability for Heterogeneous Architectures (Session Chair: Julia Levites, Nvidia, USA)
15:30-16:00Yuuichi Asahi (National Institute for Quantum and Radiological Science and Technology, Japan) – Performance portable implementation of a kinetic plasma simulation mini-app
16:00-16:30Damodar Sahasrabudhe (University of Utah, USA) – A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures
16:30-16:35WACCPD Best Paper Award
16:35-17:25Panel:  –  Fernanda Foertter (NVIDIA, USA) – Convergence, Divergence, or New Approaches? – The Future of Software-Based Abstractions for Heterogeneous Supercomputing
Moderator: Fernanda Foertter
Panelists: Jack Deslippe, Johannes Doerfert, Jeff Hammond, Christian Trott, Michael Wolfe.
*Please find statements of the panelist with respect to the panel’s topic below.
17:25-17:30WACCPD Closing Remarks

KEYNOTE: Perlmutter- A 2020 Pre-Exascale GPU-accelerated System for NERSC: Architecture and Application Performance Optimization

Dr. Nicholas (Nick) James Wright, the advanced technologies group lead and the NERSC chief architect, will be giving the Keynote at the workshop.


Nicholas J. Wright

The keynote will be presented by Nicholas J. Wright, the Perlmutter chief architect and the advanced technologies group lead at the National Energy Research Scientific Computing (NERSC) center. He led the effort to optimize the architecture of the Perlmutter machine, the first NERSC platform designed to meet the needs of both large scale simulation and data analysis from experimental facilities. Nicholas has a Ph.D. from the University of Durham in computational chemistry and has been with NERSC since 2009.


In 2020 NERSC will take delivery of its next-generation supercomputer, Perlmutter. In this talk we will describe the architecture of the machine and how it was optimized to meet the performance and usability goals of NERSC’s more than 7000 users. We will discuss the current usage of different programming models at NERSC and our plans for supporting them on Perlmutter, and on future machines.

Invited Talk: The SPEC ACCEL Benchmark – Results and Lessons Learned

Robert Henschel

The invited talk will be presented by Robert Henschel, director of Research Software and Solutions at Indiana University. He is responsible for providing advanced scientific applications to researchers at Indiana University and national partners as well as providing support for computational research to the IU school of medicine. Henschel serves as the chair of the Standard Performance Evaluation Corporation (SPEC) High-Performance Group and in this role leads the development of production quality benchmarks for HPC systems. He also serves as the treasurer of the OpenACC organization. Henschel has a deep background in High-Performance Computing and his research interests focus on performance analysis of parallel applications.


The High-Performance Group (HPG) of the Standard Performance Evaluation Corporation (SPEC) is a forum for discussing and developing benchmark methodologies for High-Performance Computing (HPC) systems. The group released the SPEC ACCEL benchmark in 2014, containing OpenCL and OpenACC components. In 2017, an OpenMP 4.5 target offload component was added by porting the OpenACC applications to OpenMP 4.5. This talk will introduce the benchmark, show results and talk about the lessons learned from developing and maintaining this directive based benchmark. In addition, current challenges of creating a follow on suite are discussed.

Panel: Convergence, Divergence, or New Approaches? – The Future of Software-Based Abstractions for Heterogeneous Supercomputing

Moderator: Fernanda Foertter (NVIDIA, USA)
Panelists: Jack Deslippe, Johannes Doerfert, Jeff Hammond, Christian Trott, Michael Wolfe

With an ongoing shift towards accelerator-based architectures given the promise of improved performance per watt, the complexity of parallelizing and tuning applications is also increasing. Today’s high-performance computing (HPC) developers face a serious dilemma to decide on processor architecture and parallel programming paradigm. Software-based abstractions for accelerators give hope to lift this burden but also leave the developers spoilt for choice – from open standards over Domain-specific languages (DSLs) to proprietary approaches, and from open-source to closed-source solutions. The uncertainty surrounding the usability, portability, and interoperability of these abstractions in the future makes it unsettling for the developers. It is but imperative that we try and resolve these shortcomings for their greater adoption. This panel hopes to work towards this resolution by bringing together standards committee, supporters, and users of these abstractions, e.g., from OpenACC, OpenMP, Kokkos or SYCL. The panelists will share their insights on the future of these abstractions – Will they converge, diverge, or will there be new approaches that would be needed? What makes a good accelerator programming model? Is there a measure for this “goodness”? The audience is also encouraged to challenge the panelists with their questions or share their insights.

Statement of Panelists

Jack Deslippe

There is no doubt a battle brewing for mindshare in heterogeneous programming languages, libraries, and frameworks with attacks flying left and right. “Directive X is Dead”, “Directive Y is bloated”, “Framework Z needs a standard’s body”, “Language Q is proprietary” etc. The problem with this battle is that it misses the important point: none of these approaches is going to let you write performant code (let alone performance portable code) if you don’t understand what your code is doing – where your critical regions are and how and why you rely on data locality. I posit that the challenge in productively coding for a heterogeneous future is for developers and communities to determine where and why to create abstractions rather than the choice of how to implement the abstraction. At NERSC, we’ve been contemplating messaging for our 7000 users around preparing for the upcoming GPU powered Perlmutter system. It is very easy for users to get bogged down in the weeds when trying to prepare their apps for a new system. There are multiple novel hardware features to target and multiple directives, libraries, and frameworks to go after things with. Rather than walking away from a training session ready to throw directives into their code willy-nilly, we’d like users to take a data and knowledge-driven approach to transforming their application. In our case studies, with the exception of poor compiler auto-vectorization, the choice of implementation was never quite as important as understanding the potential parallelism and data locality of a code. Promoting directives to lower the barrier to entry for heterogeneous systems is a noble cause, but we need to make performance analysis and modeling more accessible and actionable as well.

Johannes Doerfert

Software solutions are always evolving so it is fairly likely this will be true for software-based abstractions for heterogeneous supercomputing as well. More specifically, there are different use cases for which, at least in the short and middle term, the landscape will evolve different. Close to peak performance will for the foreseeable future be reserved to specialized languages that have a strong manufacturer backing. Portable abstractions make up the second use case with more contenders including OpenMP, OpenACC (to some degree), SYCL, HIP, Chapel, etc. How this space evolves is hard to predict but to me there is a surplus of alternatives. While that might improve diversity and innovation, it will fragment research, development, and the user base, ultimately causing a worse experience for most. I hope that this space will converge on one or two abstractions soon but that lessons learned from the others are not lost, e.g., if OpenACC functionality would be eventually subsumed by OpenMP. Finally, there are high-level solutions which provide abstractions and patterns that can be lowered portably with good performance. Here we will continue to see different alternatives which balance the amount and complexity of the provided abstractions differently. To measure and compare these abstractions we have to augment classical performance tests with usability information eventually. While some users are driven purely by performance, others need to prioritize maintainability and portability of some kind. Generally, I do not think there is a correct abstraction (level). I think system complexity is steadily increasing, e.g., through heterogeneity, which is good for performance and power reasons but it will make the struggle for users even harder, allowing only the most user friendly solutions to survive. How these should look like is one of the questions we (as all of the people working on such solutions) try to answer.

Jeff Hammond

Developers should use what makes them productive and supports the platforms they use. For Fortran applications, OpenMP supports CPUs and multiple accelerator types with a significant amount of code reuse; for example, the NWChem GPU code I’m working with right now is a direct descendent of code written for Knights Landing, Knights Corner and Blue Gene/P. For modern C++ applications, Khronos SYCL and Kokkos look promising. In general, I’m opposed to programming model wars – these often result from non-technical forces or the failure of application programmers to design good abstractions. For my research in the Parallel Research Kernels, I’ve evaluated essentially all the heterogeneous programming models used in HPC and what differentiates the good ones from the bad are (1) availability of multiple compilers, especially support within LLVM and (2) a robust library ecosystem, including high performance implementations of things like linear algebra, which are essential in a wide range of HPC applications. There will always been new approaches but convergence to a small number of industry standard approaches is inevitable. System operators and application programmers will not support a different programming environment for every platforms. We are already seeing this in the DOE exascale program, but it isn’t a new idea. Opposition to vertical integration and proprietary system software was one of the forces that brought about the death of mainframes and the transition of the HPC community to Linux clusters based on de facto standard architectures like x86.

Christian Trott

Software projects targeting the diversifying HPC architecture landscape fall today into three categories. For small enough projects or codes which only have few performance critical sections simply writing a new version for every architecture can be very attractive. It provides the most flexibility, and possibly the highest performance. For Fortran and C based applications OpenMP pragmas work pretty well already. For C++, pragmas are a poor fit, and software based abstractions provide arguably the best compromise of programmability, performance and future proofing. As the lead of the Kokkos C++ Performance Portability Programming Model, I am responsible to deliver one such abstraction to our users. Over the last few years it has found rapid adoption across a number of DOE laboratories, Universities and other institutions which target HPC platforms. With the expansion of the Kokkos team to include developers at 5 of the leading DOE labs, we are also well positioned to support users at many of the leading supercomputing platforms. The question is: will we see many more such abstraction pop up? I believe that the space of successful models will stay limited – as in other areas there is strength in numbers. The more people use a model, the easier it is for new developers to adopt it. There will be more examples available, more people to ask questions, and more funding to make the library robust. In the long term (2030) though, I believe much of what Kokkos does today will be part of the C++ standard. My team is hard at work helping to shape that future in a way that it works for the HPC community. With proposals making its way into C++23 which cover heterogeneous execution, data abstractions suitable for HPC, and BLAS in C++, we are well on track to achieve this goal.

Michael Wolfe

Today’s HPC accelerators are mostly GPUs. Other accelerators (DSPs, FPGAs, neural network chips) will have a hard time competing with GPUs in performance, cost, programmability, and generality. GPUs differ from multicore CPUs in the amount and structure of parallelism required to achieve high performance. Our problem is providing a software programming abstraction that captures and encourages such parallelism for applications that can take advantage of it. When a new problem arises, a number of solutions are designed and implemented, often with similarities. For an important problem, particularly when multiple platforms exhibit similar problems, we push to standardize. Early message-passing clusters and supercomputers used a variety of message-passing libraries (NX, Express, PVM, others). All were eclipsed by the universal adoption of MPI. Standardization improves portability, but limits innovation. For GPUs and similar accelerators, we are in an innovation stage of software abstraction development. Good ideas migrate between different abstractions. Some will be abandoned, others will thrive. We are unlikely to converge to single MPI-like solution to this problem. Instead, there will be several abstractions in use, even within a single large application. Interoperability between components written with different abstractions, or different levels of abstraction is important. Different users or different parts of the application have different requirements, such as access to low-level features for some programmers or components, and a higher degree of performance portability with few or no source code changes for others. Our relevant experience is developing OpenMP, OpenACC, and CUDA Fortran implementations, and supporting hundreds of production applications on today’s largest supercomputers. Our goal is to automate optimization and mapping of standard C++ and Fortran parallel constructs for CPUs and GPUs, augment with de facto standard directives where needed, and work to maximize interoperability with target-specific explicit models.

GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC

Takuma Yamaguchi

Takuma Yamaguchi is presenting this paper from Kohei Fujita (University of Tokyo), Tsuyoshi Ichimura ( University of Tokyo), Akira Naruse (NVIDIA), Maddegedara Lalith (University of Tokyo), and Muneo Hori (Japan Agency for Marine-Earth Science and Technology).

Takuma Yamaguchi is a Ph.D. student in the Department of Civil Engineering at the University of Tokyo and he has B.E. and M.E., from the University of Tokyo. His research is high-performance computing targeting at earthquake simulation. More specifically, his work performs an implicit low-order finite element solver enhanced by GPUs.


Accelerating applications with portability and maintainability is one of the big challenges in science and engineering. Previously, we have developed a fast implicit low-order three-dimensional finite element solver, which has a complicated algorithm including artificial intelligence and transprecision computing. In addition, all possible tunings for the target architecture were implemented; accordingly, the solver has inferior portability and maintainability. In this paper, we apply OpenACC to the solver. The directive-based implementation of OpenACC enables GPU computation to be introduced with a smaller developmental cost even for complex codes. In performance measurements on AI Bridging Cloud Infrastructure (ABCI), we evaluated that a reasonable speedup was attained on GPUs, given that the elapsed time of the entire solver was reduced to 1/14 of that on CPUs based on the original CPU implementation. Our proposed template to use transprecision computing with our custom FP21 data type is available to the public; therefore, it can provide a successful example for other scientific computing applications.

Acceleration in Acoustic Wave Propagation Modeling using OpenACC/OpenMP and its hybrid for the Global Monitoring System

Noriyuki Kushida

Noriyuki Kushida is presenting this paper from Ying-Tsong Lin (Woods Hole Oceanographic Institution), Peter Nielsen (The Comprehensive Nuclear-Test-Ban Treaty Organization), and Ronan Le Bras (The Comprehensive Nuclear-Test-Ban Treaty Organization).

Noriyuki Kushida is currently working for the Comprehensive Nuclear-Test-Ban Treaty Organization as a software engineer. He has been engaged in research and development of large scale computer simulation methods as well as linear equation solver algorithms. One of his current interest is introducing HPC/Supercomputing technologies to the disarmament fields to contribute to world peace by exploiting his background. In the research point of view, global acoustic modelings interest him a lot.


CTBTO is operating and maintaining the international monitoring system of Seismic, Infrasound, Hydroacoustic and Airborne radionuclide to detect a nuclear explosion over the globe. The monitoring network of CTBTO, especially with regard to infrasound and hydroacoustic, is quite unique because the network covers over the globe, and the data is opened to scientific use. CTBTO has been developing and improving the methodologies to analyze observed signals intensively. In this context, hydroacoustic modeling software, especially which that solves the partial differential equation directly, is of interest. As seen in the analysis of the Argentinian submarine accident, the horizontal reflection can play an important role in identifying the location of an underwater event, and as such, accurate modeling software may help analysts find relevant waves efficiently. Thus, CTBTO has been testing a parabolic equation based model (3D-SSFPE) and building a finite difference time domain (FDTD) model. At the same time, using such accurate models require larger computer resources than simplified methods such as ray-tracing. Thus we accelerated them using OpenMP and OpenACC, or the hybrid of those. As a result, in the best case scenarios, (1) 3D-SSFPE was accelerated by approximately 19 times to the original Octave code, employing the GPU-enabled Octfile technology, and (2) FDTD was accelerated by approximately 160 times to the original Fortran code using the OpenMP/OpenACC hybrid technology, on our DGX—Station with V100 GPUs.

Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC

Zhengji Zhao

Zhengji Zhao is presenting this paper from Hongzhang Shan (Lawrence Berkeley National Laboratory), and Marcus Wagner (Cray Inc.).

Zhengji Zhao is an HPC consultant at the National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory. She specializes in supporting materials science and chemistry applications and users at NERSC. She was part of the NERSC7 (Edison, a Cray XC30) procurement, co-leading its implementation team. Additionally, she worked on developing or extending the capability of workloads analysis tools, such as the system performance monitoring with the NERSC SSP benchmarks, the library tracking (ALTD), and the application usage analysis automation. She is also a member of the NERSC application readiness team, helping users port their applications to new platforms. Most recently she has worked on bringing the checkpoint/restart capability to the NERSC workloads, and has also worked (co-PI) on the Berkeley Lab Directed Research and Development project that is designed to demonstrate performance potential of purpose-built architectures as potential future for HPC applications in absence of Moore’s Law. She has (co)athored more than 30 publications, including the work of developing the reduced density matrix (RDM) method for electronic structure calculations, a highly accurate alternative to wavefunction-based computational chemistry methods, and the award winning development work of the linear scaling 3D fragment (LS3DF) method for large-scale electronic structure calculations (best poster in SC07, and a Gordon Bell award in SC08). She served in the organizing committee for several HPC conference series, such as CUG, SC, IXPUG, etc. She received her Ph.D. in computational physics, and an M.S. in computer science from New York University.


Using GPUs to accelerate the performance of HPC applications has recently gained great momentum. Energy Exascale Earth System Model (E3SM) is a state-of-the-science earth system model development and simulation project and has gained national recognition. It has a large code base with over a million lines of code. How to make effective use of GPUs remains a challenge. In this paper, we use the modal aerosol module (MAM) of E3SM as a driving example to investigate how to effectively offload computational tasks to GPUs, using the OpenACC directives. In particular, we are interested in the performance advantage of using GPUs and understanding the limiting factors from both the application characteristics and the GPU or OpenACC sides.

Evaluation of Directive-based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Fazlay Rabbi

Fazlay Rabbi is presenting this paper from Christopher Steven Daley (Lawrence Berkeley National Laboratory), Hasan Metin Aktulga (Michigan State University), and Nicholas James Wright (Lawrence Berkeley National Laboratory).

Fazlay Rabbi is a PhD student in Computer Science Department at Michigan State University. He is working under Dr. Hasan Metin Aktulga. His research interests are in the area of parallel algorithms, high performance computing and data- intensive computing. Especially, he is interested in expressing large sparse matrix computations as directed acyclic data- flow graph (DAG) to accelerate those computations on modern deep memory architectures by minimizing data movement between memory layers and overlapping computations with data movement. As a Summer Intern at Lawrence Berkeley National Laboratory (LBNL) in summer 2019, he studied the performance of OpenMP-4.0+ features designed to offload compute kernels to accelerators. He received his M.S. degree in Electrical Engineering from Michigan State University in 2016. He obtained his B.S. degree in Computer Science and Engineering from Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh in 2011.


Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8x – 4.3x speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9x and 48.2x speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively.

Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

JaeHyuk Kwack

JaeHyuk Kwack is presenting this paper from Colleen Bertoni (Argonne National Laboratory), Buu Pham (Iowa State University), and Jeff Larkin (NVIDIA).

JaeHyuk Kwack works at performance engineering group at Argonne Leadership Computing Facility. He received his B.S. and M.S. in engineering from Seoul National University, South Korea, and a Ph.D. and post-doctoral training in computational mechanics for computational fluid dynamics (CFD) and fluid solid interaction (FSI) from University of Illinois at Urbana-Champaign, USA. Before joining Argonne, he had worked for Blue Waters supercomputing project at National Center for Supercomputing Applications. At Argonne since 2018, he has been working on OpenMP offloading model, performance tools and math libraries for the coming US DOE exa-scale system, Aurora at Argonne in 2021


The US Department of Energy (DOE) started operating two GPU-based pre-exascale supercomputers in 2018 and plans to deploy another pre-exascale in 2020, and three exascale supercomputers in 2021/2022. All of the systems are GPU- enabled systems, and they plan to provide optimized vendor-promoted programming models for their GPUs such as CUDA, HIP and SYCL. However, due to their limited functional portability, it is challenging for HPC application developers to maintain their applications in an efficient and effective way with good productivity across all US DOE pre- exascale/exascale systems. Directive-based programming models for accelerators can be one of the solutions for HPC applications on the DOE supercomputers. In this study, we employ OpenMP and OpenACC offloading models to port and re-implement the RI-MP2 Fortran kernel of the GAMESS application on a pre-exascale GPU system, Summit. We compare and evaluate the performance of the re-structured offloading kernels with the original OpenMP threading kernel. We also evaluate the performance of multiple math libraries on the Nvidia V100 GPU in the RI-MP2 kernel. Using the optimized directive-based offloading implementations, the RI-MP2 kernel on a single V100 GPU becomes more than 7 times faster than on dual-socket Power9 processors, which is near the theoretical speed-up based on peak performance ratios. MPI + directive-based offloading implementations of the RI-MP2 kernel perform more than 40 times faster than a MPI + OpenMP threading implementation on the same number of Summit nodes. This study demonstrates how directive- based offloading implementations can perform near what we expect based on machine peak ratios.

Performance Portable Implementation of a Kinetic Plasma Simulation Mini-app

Yuuichi Asahi

Yuuichi Asahi is presenting this paper from Guillaume Latu (CEA, IFRM), Virginie GRANDGIRARD (CEA, IFRM), and Julien Bigot (Maison de la Simulation, CEA, CNRS).

Yuuichi Asahi is a post-doctoral research at National institute for quantum and radiological science and technology, Japan. He focused on the nonlinear plasma turbulence by means of gyrokinetic simulations. His current interest includes accelerating simulation codes for more complicated physical simulations.


Performance portability is considered to be an inevitable requirement in the exascale era. We explore a performance portable approach for fusion plasma turbulence simulation code employing kinetic model, namely GYSELA code. For this purpose, we extract the key features of GYSELA such as high dimensionality and Semi-Lagrangian scheme, and encapsulate them into a mini-application which solves the similar but simplified Vlasov-Poisson system. We implement the mini-app with a mixed OpenACC/OpenMP and Kokkos implementation, where we suppress unnecessary duplications of code lines. For a reference case with the problem size of 128 to the 4, the Skylake (Kokkos), Nvidia Tesla P100 (OpenACC), and P100 (Kokkos) versions achieve an acceleration of 1.45, 12.95, and 17.83, respectively, with respect to the baseline OpenMP version on Intel Skylake. In addition to the performance portability, we discuss the code readability and productivity of each implementation. Based on our experience, Kokkos can offer a readable and productive code at the cost of initial porting efforts, which would be enormous for a large scale simulation code like GYSELA.

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

Damodar Sahasrabudhe

Damodar Sahasrabudhe is presenting this paper from Eric T. Phipps (Sandia National Laboratories), Sivasankaran Rajamanickam ( Sandia National Laboratories ), and Martin Berzins (University of Utah, Scientific Computing and Imaging Institute).

Damodar Sahasrabudhe is a student pursuing a doctoral degree in computer science at the University of Utah. His research interests include parallel computing, GPGPU programming, portability, among others.


As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a GPU back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (SIMD) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (KNL), and also facilitates Single Instruction Multiple Threads (SIMT) based execution on GPUs. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new SIMD primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (LVL). The SIMD primitive provides portability across CPUs and GPUs without any performance degradation being observed experimentally.

Panelist Bio

Jack Deslippe

Jack Deslippe is the application performance group lead at NERSC. Jack and his group are partnering with DOE application teams to evaluate and improve the performance of applications on HPC systems at NERSC as well exploring and influencing performance portability strategies. Jack is is one of the lead developers of the BerkeleyGW package for computing the excited state properties of materials and a PI in C2SEPEM – a BES materials software center. He received a Ph.D. from UC Berkeley in physics in 2011, with research centered on computational materials physics and nano-science, including the development and scaling of electronic-structure codes. Jack has been at NERSC since 2011, acting as a consultant for materials science applications, the MyNERSC architect and was a principle investigator in SCIDAC projects, collaborations with light-sources and currently leads the NERSC Exascale Science Applications Program (NESAP).

Johannes Doerfert

Johannes Doerfert is a researcher in the Argonne Leadership Computing Facility at Argonne National Laboratory (ANL). As an active member of the LLVM and OpenMP community, Johannes is working towards *portable* high-performance computing from high-level language design to the low-level details such as device code generation and runtimes. His designs for language and (LLVM) compiler enhancements enable optimization of parallel programs written, for example, in C/C++ and Fortran. This is a major part of several ongoing efforts to make compiler software ready for exascale computing. His projects range from parallelism specific transformations in LLVM to the integration of abstractions that allow existing “parallelism unaware” transformations to optimize parallel programs. Early results show that this effort can completely recover performance lost due to the lack of optimizations as a result of parallelism abstractions.

Jeff Hammond

Jeff Hammond is a Senior System Architect at Intel. He works on a wide range of high-performance computing projects, including the development of exascale supercomputers and software that makes parallel programming less painful. Prior to joining Intel, he worked at Argonne Leadership Computing Facility as a computational scientist. He received his PhD in chemistry from the University of Chicago, where he was Department of Energy Computational Science Graduate Fellow. For more information, please see

Christian Trott

Christian Trott is a high performance computing expert with extensive experience designing and implementing software for modern HPC systems. He is a principal member of staff at Sandia National Laboratories, where he leads the Kokkos core team developing the performance portability programming model for C++ and heads Sandia’s delegation to the ISO C++ standards committee. He also serves as adviser to numerous application teams, helping them redesign their codes using Kokkos and achieve performance portability for the next generation of supercomputers. Christian is a regular contributor to numerous scientific software projects including LAMMPS and Trilinos. He earned a doctorate from the University of Technology Ilmenau in theoretical physics with a focus on computational material research.

Michael Wolfe

Michael Wolfe has worked on languages and compilers for parallel computing since graduate school at the University of Illinois in the 1970s. Along the way, he co-founded Kuck and Associates, Inc. (since acquired by Intel), tried his hand in academia at the Oregon Graduate Institute (since merged with the Oregon Health and Sciences University), and worked on High Performance Fortran at PGI (since acquired by STMicroelectronics, and more recently by NVIDIA). He now spends most of his time as the technical lead on a team that develops and improves the PGI compilers for highly parallel computing, and in particular for NVIDIA GPU accelerators.


Theme by HermesThemes

Copyright © 2019 WACCPD 2019. All Rights Reserved