PROGRAM

The workshop takes place on Sunday, Nov 14, 2021.

When: 9:00 a.m.- 5:30 p.m US Central Time.

SC21 Workshop Evaluation Form – https://submissions.supercomputing.org/eval.html

	Session 1
09:00- 09:10	Opening Remarks
09:10-10:00	Invited Talk I: New Frontiers for Directives – Barbara Chapman
10:00-10:30	Coffee Break
	Topic: Directive case studies
10:30-11:00	GPU porting of scalable implicit solver with Green’s function-based neural networks by OpenACC – Kohei Fujita, Yuma Kikuchi, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara, and Naonori Ueda
11:00-11:30	Challenges Porting a C++ Template-Metaprogramming Abstraction Layer to Directive-based Offloading – Jeffrey Kelling, Sergei Bastrakov, Alexander Debus, Thomas Kluge, Matt Leinhauser, Richard Pausch, Klaus Steiniger, Jan Stephan, Rene Widera, Jeff Young, Michael Bussmann, Sunita Chandrasekaran, and Guido Juckeland
11:30-12:00	Accelerating quantum many-body configuration interaction with directives – Brandon Cook, Patrick J. Fasano, Pieter Maris, Chao Yang, and Dossay Oryspayev
12:00-12:30	GPU offloading of a large-scale gyrokinetic particle-in-cell Fortran code: From OpenACC to OpenMP – Qiheng Cai, Junyi Cheng, Yang Chen, Marcus Wagner, Christopher Daley, Dossay Oryspayev, Stefan Tirkas, Sophie Redd, and Scott Parker
12:30-14:00	Lunch Break
14:00-14:30	Invited Talk II: Introducing SPEChpc 2021 – Mathew Colgrove and Sunita Chandrasekaran
	Topic: Directive extensions
14:30-15:00	Extending OpenMP for Machine Learning-Driven Adaptation – Chunhua Liao, Anjia Wang, Giorgis Georgakoudis, Bronis R. de Supinski, Yonghong Yan, David Beckingsale, and Todd Gamblin
15:00-15:30	Coffee Break
	Topic: Directive alternatives
15:30-16:00	Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability – Vincent R. Pascuzzi and Mehdi Goli
16:00-16:30	Can Fortran’s ‘do concurrent’ Replace Directives for Accelerated Computing? – Miko M. Stulajter, Ronald M. Caplan, and Jon A. Linker
16:30-16:40	Best Paper Award
16:40-17:25	Panel: Publicly-available Directive test suites for heterogeneous architectures Moderator: Christopher Daley Panelists: Swaroop Pophale, Michael Kruse, Brandon Cook, Rahulkumar Gayatri, Mathew Colgrove
17:25-17:30	Closing Remarks

Invited Talk I: New Frontiers for Directives [slides]

Barbara Chapman has been a Professor of Computer Science for over 20 years, performing research on parallel programming interfaces and their implementation. She recently joined Hewlett Packard Enterprise (HPE), where she is defining future directions for the HPE Cray Programming Environment. Dr. Chapman remains affiliated with the Department of Computer Science and the Institute for Advanced Computational Science at Stony Brook University, where her team is engaged in efforts to develop community standards for parallel programming, including OpenMP, OpenACC and OpenSHMEM.

Abstract

As our computing platforms grow in size and complexity, the importance of directives is also increasing. New users, new systems and new applications raise the demand for approaches to parallel programming that offer the balance of performance, productivity and portability that they represent. We discuss the status of the most popular directives for parallel programming in HPC, recent advances in features and implementation, and consider some potential new uses.

GPU porting of scalable implicit solver with Green’s function-based neural networks by OpenACC [slides]

Yuma Kikuchi is a master course student in the Department of Civil Engineering at the University of Tokyo. His research interest is an enhancement of earthquake simulation using GPU computing and data-driven learning.

Kohei Fujita is an associate professor at the Earthquake Research Institute at the University of Tokyo. He received his Dr. Eng. from the Department of Civil Engineering, the University of Tokyo in 2014. His research interest is the development of high-performance computing methods for earthquake engineering problems.

Abstract

With the development of diverse computer architectures and diverse HPC applications, it is desirable to make performance portable applications that run on multiple architectures with relatively low development cost. Directive based programming models such as OpenACC have been developed for such purpose, and have been used successfully to port many equation-based HPC applications. As an example of porting of a class of HPC applications comprising both data-analytics methods and equation-based methods, we port an implicit solver with a neural network (NN)-type preconditioner for solving large-scale partial differential equation (PDE)-based problems. The scalable preconditioner is based on the Green’s functions reflecting properties of the target PDE, which improves the accuracy and efficiency of using NNs for solving PDE-based problems. By kernel algorithm design suitable for the computer architecture and use of OpenACC, we enabled high performance on recent GPUs with relatively low development cost. Here, 64.4% of FP64 peak was obtained on ABCI’s compute nodes equipped with NVIDIA A100 GPUs, leading to 2.54-fold speedup from a highly-tuned GPU implementation of a widely used PDE solver algorithm and 38.9-fold speedup from OpenMP-based CPU implementation running on the same system. Furthermore, 83.4% weak scalability was obtained from 8 to 256 A100 GPUs on the ABCI system, enabling solving large scale problems of up to 25.7 billion degrees-of-freedom with high performance.

Challenges Porting a C++ Template-Metaprogramming Abstraction Layer to Directive-based Offloading [slides]

Jeffrey Kelling obtained his Ph.D. in physics on massively parallel lattice Monte-Carlo simulations on GPUs. He is a scientist in the computational science group at the Helmholtz-Zentrum Dresden – Rossendorf, concerned with high performance computing and deep learning applications in science.

Abstract

HPC systems employ a growing variety of compute accelerators with different architectures and from different vendors. Large scientific applications which are required to run efficiently across these systems but need to retain a single code-base in order to not stifle development. Directive-based offloading programming models set out to provide the required portability, but, to existing codes, they themselves represent yet another API to port to. Here, we present our approach to porting the GPU-accelerated particle-in-cell code PIConGPU to OpenACC and OpenMP target by adding two new backends to its existing C++-template metaprogramming-based offloading abstraction layer alpaka and avoiding other modifications to the application code. We introduce our approach in the face of conflicts between requirements and available features in the standards as well as practical hurdles posed by immature compiler support.

Accelerating quantum many-body configuration interaction with directives [slides]

Brandon leads the simulations area of NERSC’s application readiness program (NESAP) and works on understanding and analyzing performance on a system and application level, developing future benchmark suites, analyzing future architectures, developing tools to help NERSC users/staff be more productive, engaging users through consulting, acting as NERSC liaison for several NESAP teams, and exploring future programming models.

Brandon received his Ph.D. in physics from Vanderbilt University in 2012, where he studied ab initio methods for quantum transport in nanomaterials. Before joining NERSC he was a postdoc at Oak Ridge National Laboratory where he developed and applied electronic structure methods to problems in material science.

Abstract

Many-Fermion Dynamics-nuclear, or MFDn, is a configuration interaction (CI) code for nuclear structure calculations. It is a platform-independent Fortran~90 code using a hybrid MPI+X programming model. For CPU platforms the application has a robust and optimized OpenMP implementation for shared memory parallelism. As part of the NESAP application readiness program for NERSC’s latest Perlmutter system, MFDn has been updated to take advantage of accelerators. The current mainline GPU port is based on OpenACC. In this work we describe some of the key challenges of creating an efficient GPU implementation. We compare the support of OpenMP and OpenACC on AMD and NVIDIA GPUs.

GPU offloading of a large-scale gyrokinetic particle-in-cell Fortran code: From OpenACC to OpenMP [slides]

Qiheng Cai is currently working as postdoc at University of Colorado Boulder after graduating from Pennsylvania State University. His research focuses on computational plasma physics. He has research experience investigating lightning initiation and Inertial Electrostatic Confinement(IEC). He is also interested in Computer Science, where he is investigating using OpenMP GPU offloading to optimize the GEM code, a gyrokinetic PIC code for tokamak plasma simulations.

Abstract

GPU offloading of a large-scale gyrokinetic particle-in-cell Fortran code is converted from using OpenACC to using OpenMP. Particle pushing and deposition are completely offloaded to GPU. Performance is compared between CPU and GPU, and between OpenACC and OpenMP. Good weak scaling (increasing particle number with fixed grid number) is obtained. Issues encountered when porting OpenMP GPU offloading are discussed.

Invited Talk II: Introducing SPEChpc 2021 [slides]

Mathew Colgrove is an NVIDIA Dev Tech working with the NVHPC compiler team. Mat’s primary focus is on training, customer support and programming advice on using OpenACC and OpenMP. Mat is also NVIDIA’s representative on SPEC’s CPU and HPG benchmarking committees. As well as serving on SPEC’s Board of Directors, Mat holds several officer positions including Release Manager for SPEC HPG and SPEC’s Vice-President of Operations.

Sunita Chandrasekaran is an Associate Professor with the Department of Computer and Information Sciences at the University of Delaware, USA. She is also a computational scientist with Brookhaven National Laboratory. She received her Ph.D. in 2012 on Tools and Algorithms for High-Level Algorithm Mapping to FPGAs from the School of Computer Science and Engineering, Nanyang Technological University, Singapore. Her research spans High Performance Computing, exascale computing, parallel programming, benchmarking and data science. Applications of interest include scientific domains such as plasma physics, biophysics, solar physics and bioinformatics. She is a recipient of the 2016 IEEE-CS TCHPC Award for Excellence for Early Career Researchers in High Performance Computing. She has been involved with SC, ISC, IPDPS, IEEE Cluster, CCGrid, WACCPD, AsHES and P3MA in different capacities.

Abstract

The increased complexity poses in modern HPC systems built with innovating system architectures challenges performance portability and performance evaluation of scientific applications. The Standard Performance Evaluation Corporation (SPEC) has a long history of producing industry-standard benchmarks for modern computer systems. SPEC’s newly released SPEChpc 2021 benchmark suites, developed by the High Performance Group, are a bold attempt to provide a fair and objective benchmarking tool designed for state-of-the-art HPC systems. With the support of multiple host and accelerator programming models, the suites are portable across both homogeneous and heterogeneous architectures. Different workloads are developed to fit system sizes ranging from a few compute nodes to a few hundred compute nodes. In this talk, we take a first glance at these benchmark suites and evaluate their portability and basic performance characteristics on various popular and emerging HPC architectures, including x86 CPU, NVIDIA GPU, and AMD GPU. We share a first-hand experience of executing the SPEChpc 2021 suites at scale on production HPC systems, discusses real-world use cases, and serves as an initial guideline for using the benchmark suites.

Extending OpenMP for Machine Learning-Driven Adaptation [slides]

Dr. Chunhua “Leo” Liao is a senior computer scientist in the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory. His research focus has been on software techniques to improve the performance and correctness of parallel programs. His research interests encompass parallel languages, especially OpenMP, optimizing compilers, runtime systems, and programming tools. He is the lead author of the ROSE compiler’s AST translation API, OpenMP implementations targeting CPUs and GPUs, and a range of source-to-source tools.

Abstract

OpenMP 5.0 introduced themetadirectivedirective to sup-port compile-time selection from a set of directive variants based onOpenMP context. OpenMP 5.1 extended context information to include user-defined conditions that enable user-guided runtime adaptation. How-ever, defining conditions that capture the complex interactions between applications and hardware platforms to select an optimized variant is challenging for programmers. This paper explores a novel approach to automate runtime adaptation through machine learning. In particular,we design a new omp declare adaptation directive and its associated clauses to describe semantics for model-driven adaptation and also develop a prototype source-to-source transformation tool for evaluating our runtime adaptation approach. Leveraging an existing runtime library for tuning, we design a small set of API functions to support our source-to-source compiler transformations. Our evaluation, using the Smith-Waterman algorithm as a use-case, demonstrates that the proposed adap-tive OpenMP extension automatically chooses the code variants that deliver the best performance in heterogeneous platforms that consists ofCPU and GPU processing capabilities. Using decision tree models for tuning has an accuracy of up to 93.1% in selecting the optimal variant,with negligible runtime overhead.

Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability [slides]

Dr. Pascuzzi works in the Computation and Data-Driven Discovery (C3D) and High Performance Computing (HPC) groups at BNL’s Computational Science Initiative. He holds Ph.D. and M.Sc. degrees in Physics from the University of Toronto and B.Sc. in Physics and Computer Science from Brock University. Prior to joining BNL in September 2021, he was a Postdoctoral Research Scholar at Lawrence Berkeley National Laboratory, focused on HPC applications for high-energy physics and quantum information science.

Abstract

High-performance computing (HPC) is a major driver accelerating scientific research and discovery, from quantum simulations to medical therapeutics. The growing number of new HPC systems coming online are being furnished with various hardware components, engineered by competing industry entities, each having their own architectures and platforms to be supported. While the increasing availability of these resources is in many cases pivotal to successful science, even the largest collaborations lack the computational expertise required for maximal exploitation of current hardware capabilities. The need to maintain multiple platform-specific codebases further complicates matters, potentially adding a constraint on the number of machines that can be utilized. Fortunately, numerous programming models are under development that aim to facilitate software solutions for heterogeneous computing. In particular is SYCL, an open standard, C++-based single-source programming paradigm. Among SYCL’s features is interoperability, a mechanism through which applications and third-party libraries coordinate sharing data and execute collaboratively. In this paper, we leverage the SYCL programming model to demonstrate cross-platform performance portability across heterogeneous resources. We detail our NVIDIA and AMD random number generator extensions to the oneMKL open-source interfaces library. Performance portability is measured relative to platform-specific baseline applications executed on four major hardware platforms using two different compilers supporting SYCL. The utility of our extensions are exemplified in a real-world setting via a high-energy physics simulation application. We show the performance of implementations that capitalize on SYCL interoperability are at par with native implementations, attesting to the cross-platform performance portability of a SYCL-based approach to scientific codes.

Can Fortran’s ‘do concurrent’ Replace Directives for Accelerated Computing? [slides]

Miko is currently pursuing his Ph.D. in Computational Science in the joint doctoral program at San Diego State University and the University of California, Irvine. He has an M.S.E. in Materials Science and Engineering from the University of Pennsylvania and a B.S. in Mathematics & Computer Science and Physics from DePaul University. His current research interests are in the areas of high-performance computing and machine learning. He is currently working as a Research Assistant at Predictive Science Inc. with his research focusing on high-performance computing and algorithm development.

Abstract

Recently, there has been growing interest in using standard language constructs (e.g. C++’s Parallel Algorithms and Fortran’s `do concurrent`) for accelerated computing as an alternative to directive-based APIs (e.g. OpenMP and OpenACC). These constructs have the potential to be more portable, and some compilers already (or have plans to) support such standards. Here, we look at the current capabilities, portability, and performance of replacing directives with Fortran’s `do concurrent` using a mini-app that currently implements OpenACC for GPU-acceleration and OpenMP for multi-core CPU parallelism. We replace as many directives as possible with `do concurrent`, testing various configurations and compiler options within three major compilers: GNU’s gfortran, NVIDIA’s nvfortran, and Intel’s ifort. We find that with the right compiler versions and flags, many directives can be replaced without loss of performance or portability, and, in the case of nvfortran, they can all be replaced. We discuss limitations that may apply to more complicated codes and future language additions that may mitigate them. Singularity containers are publicly provided to allow the results to be reproduced.

Panel: Publicly-available Directive test suites for heterogeneous architectures [slides]

Moderator: Christopher Daley
Panelists: Swaroop Pophale, Michael Kruse, Brandon Cook, Rahulkumar Gayatri, Mathew Colgrove

Bio

Christopher Daley is a HPC Performance Engineer working in the Advanced Technology Group (ATG) at NERSC. He is the NERSC technical representative for the NERSC/NVIDIA OpenMP target offload Non-Recurring Engineering (NRE) contract. He is involved in understanding OpenMP target offload application requirements at NERSC and has contributed to the OpenMP target offload implementations in HPGMG, SU3, Batoid, XGC, BerkeleyGW, and GEM applications. His research interests include performance analysis of HPC applications and how portable programming methods, such as OpenMP, can be used to achieve high performance on current and next generation supercomputers. He has been with NERSC since 2013. Previously he was a scientific programmer at the Flash Center for Computational Science at the University of Chicago.

Dr. Swaroop Pophale graduated from the University of Houston with a PhD in High Performance Programming in 2014. Her key focus areas there were PGAS programming models, benchmark development, and compiler research. After working in the industry for a year she joined ORNL as a post-doctoral researcher in 2015 and was appointed as staff Computer Scientist in 2017. At ORNL, her research focuses on shared and distributed memory programming models like OpenMP and OpenSHMEM. She is member of the SOLLVE ECP effort and is one of the primary contributors for the OpenMP 4.5 validation and verification suite developed for ECP to test device offloading with OpenMP. Dr. Pophale continues to contribute to the OpenSHMEM library research, benchmarks and specification development.

Michael Kruse graduated with a Master’s degree in computer science from the University of Paderborn (Germany) and a PhD from the University Paris-Sud 11. After a Post-Doc at the École Normale Supérieur, he currently works at Argonne National Laboratory. His topics of research are optimizing compilers for high performance computing, such as Lattice QCD, especially loop optimizations. He contributes to the LLVM compiler infrastructure including Polly, LLVM’s polyhedral optimizer. Currently, he works on a compiler intermediate representation for loop optimizations, standardizing loop transformations in OpenMP and implementing them in Clang and Flang.

Brandon Cook leads the simulations area of NERSC’s application readiness program (NESAP) and works on understanding and analyzing performance on a system, workflow and application level, developing future benchmark suites, analyzing future architectures, developing tools to help NERSC users/staff be more productive, acting as NERSC liaison for several NESAP teams, and exploring future programming models. Brandon received his Ph.D. in physics from Vanderbilt University in 2012, where he studied ab initio methods for quantum transport in nanomaterials. Before joining NERSC he was a postdoc at Oak Ridge National Laboratory where he developed and applied electronic structure methods to problems in material science.

Rahulkumar Gayatri is an Application Performance Specialist working in the Application Performance Group (APG) at NERSC. His current projects include the the EXAALT ECP project and he is also involved with the Kokkos development team. In his time as a NERSC staff member, Rahul optimized the SNAP potential in LAMMPS MD package for next generation architectures. He is currently working on the OpenMPTarget backend of the Kokkos programming model.

placeholder

Mathew Colgrove is an NVIDIA Dev Tech working with the NVHPC compiler team. Mat’s primary focus is on training, customer support and programming advice on using OpenACC and OpenMP. Mat is also NVIDIA’s representative on SPEC’s CPU and HPG benchmarking committees. As well as serving on SPEC’s Board of Directors, Mat holds several officer positions including Release Manager for SPEC HPG and SPEC’s Vice-President of Operations.