The workshop takes place on Friday November 13, 2020
When: 10AM – 2PM US Eastern Time
SC20 Workshop Evaluation Form – https://submissions.supercomputing.org/eval.html
KEYNOTE: Achieving Performance Portability for Extreme Heterogeneity
Mary Hall, the director of the School of Computing at University of Utah, will be giving the Keynote at the workshop. [slides]
Mary Hall is the Director of the School of Computing at University of Utah. Her research focus brings together compiler optimizations and performance tuning targeting current and future high-performance architectures on real-world applications. Professor Hall is an IEEE Fellow, an ACM Distinguished Scientist and a member of the Computing Research Association Board of Directors. She actively participates in mentoring and outreach programs to encourage the participation of groups underrepresented in computer science.
Current and future high-performance architectures employ hardware specialization to attain orders of performance and energy efficiency, leading to next-generation systems that will incorporate different underlying processing technology. A programmer’s challenge is to map software to these diverse architectures. In this talk, we discuss two research directions towards developing the software stack for extreme heterogeneity. First, we consider how to use autotuning of pragma-based mapping guidance to replace descriptive pragmas with more prescriptive ones aimed at portability across CPUs and accelerators. Second, we discuss the role of domain-specific compiler technology in mapping high-level concepts of parallelism and data movement to concrete architecture-specific implementations.
Invited Talk: Enabling Portable Directive-Based Programming at Exascale
The invited talk will be presented by Nicholas Malaya, AMD’s technical lead for the Frontier and El Capitan Centers of Excellence. [slides]
Nicholas Malaya is a computational scientist at AMD Research, and is AMD’s technical lead for the Frontier and El Capitan Centers of Excellence (COEs). These COEs are focused on close collaborations between AMD, DOE, and HPE to ensure application readiness, so that key workloads can run on the computers from Day-1 of machine deployment. Nick’s research interests include Exascale Computing, CFD, Bayesian Inference, and Machine Learning.
Due to large concurrency and heterogeneous architectures, application performance at Exascale will be a considerable challenge. This challenge is compounded by the requirement for performant code portability across multiple computer architectures. This talk will begin with an introduction to some of the anticipated challenges of concurrency and performance on future exascale machines. We next address AMD’s planned technologies for directive-based accelerator programming and demonstrate capabilities available today. The talk ends with a discussion of recommendations and identified opportunities for the directive community to ensure high performance using directive-based codes at unprecedented scale.
ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems
Vinh Quang Dang from (Sandia National Laboratories) is presenting this paper. Other authors include Joseph D. Kotulski (Sandia National Laboratories) and Sivasankaran Rajamanickam (Sandia National Laboratories). [slides] [paper]
Vinh Q. Dang received the B.Eng. degree from the Posts and Telecommunications Institute of Technology, Vietnam, the M.S. degree from University of Technology, Vietnam, and the Ph.D. degree from the Catholic University of America, USA, all in Electrical Engineering, in 2003, 2006, and 2015, respectively.
Prior to 2010, he was a lecturer in the School of Electrical Engineering, International University, Vietnam. From 2010 to 2015, he was a Research Assistant in the Electromagnetic Wave Propagation and Remote Sensing Laboratory at the Catholic University of America, USA. He was a Research Associate in the Center for Automata Processing at the University of Virginia, USA from 2015 to 2018. He joined Sandia National Laboratories, Albuquerque, NM, USA as a Postdoctoral Appointee in 2018, and became a Senior Member of Technical Staff in 2019.
His research interests include high performance computing, computational electromagnetics, automata processing, data mining, compressive sensing, radar imaging, and medical image processing.
Solving dense systems of linear equations is essential in applications encountered in physics, mathematics, and engineering. This paper describes our current efforts toward the development of the ADELUS package for current and next generation distributed, accelerator-based, high-performance computing platforms. The package solves dense linear systems using partial pivoting LU factorization on distributed-memory systems with CPUs/GPUs. The matrix is block-mapped onto distributed memory on CPUs/GPUs and is solved as if it was torus-wrapped for an optimal balance of computation and communication. A permutation operation is performed to restore the results so the torus-wrap distribution is transparent to the user. This package targets performance portability by leveraging the abstractions provided in the Kokkos and Kokkos Kernels libraries. Comparison of the performance gains versus the state-of-the-art SLATE and DPLASMA GESV functionalities on the Summit supercomputer are provided. Preliminary performance results from large-scale electromagnetic simulations using ADELUS are also presented. The solver achieves 7.7 Petaflops on 7600 GPUs of the Sierra supercomputer translating to 16.9% efficiency.
GPU acceleration of the FINE/FR CFD solver in a heterogeneous environment with OpenACC directives
Xiaomeng Zhai (Numeca-USA) is presenting this paper. Other authors include David Gutzwiller (Numeca-USA), Kunal Puri (Numeca-International), and Charles Hirsch (Numeca-International). [slides] [paper]
Dr. Xiaomeng ‘Shine’ Zhai is a HPC development engineer at NUMECA USA. Shine joined NUMECA USA in 2019 after completion of his PhD in Aerospace Engineering at the Georgia Institute of Technology in 2018. His PhD was focused on large-scale direct numerical simulations of isotropic turbulence and magnetohydrodynamic turbulence. Through his PhD training, Shine has gained extensive experience in developing highly scalable programs that take advantages of leadership computing resources at NCSA and TACC, using up to 262144 cores.
OpenACC has been highly successful in adapting legacy CPU-only applications for modern heterogeneous computing environments equipped with GPUs, as demonstrated by many projects as well as our previous experience. In this work, OpenACC is leveraged to transform another Computational Fluid Dynamics (CFD) high order solver FINE/FR to be GPU-eligible. On the Summit supercomputer, impressive GPU speedup ranging from 6X to 80X has been achieved using up to 12,288 GPUs. Techniques critical to achieving good speedup include aggressive reduction of data transfers between CPUs and GPUs, and optimizations targeted at improving exposed parallelism to GPUs. We have demonstrated that OpenACC offers an efficient, portable and easily-maintainable approach to achieve fast turnaround time for high-fidelity industrial simulations.
Performance and Portability of a Linear Solver Across Emerging Architectures
Aaron Walden (NASA Langley Research Center) is presenting this paper. Other authors include Mohammad Zubair (Old Dominion University) and Eric J. Nielsen (NASA Langley Research Center). [slides] [paper]
Aaron Walden is a Computer Scientist with the Computational AeroSciences Branch at NASA Langley Research Center. He specializes in optimizing the performance of scientific simulation software for HPC systems.
A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel Xeon and Xeon Phi, Marvell ThunderX2, NEC SX-AuroraTM TSUBASA Vector Engine, and NVIDIA and AMD GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA, and Intel OneAPI/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.
Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs using the Roofline Methodology
Neil Mehta (Lawrence Berkeley National Laboratory) is presenting this paper. Other authors include Rahulkumar Gayatri (Lawrence Berkeley National Laboratory), Yasaman Ghadar (Argonne National Laboratory), Christopher Knight (Argonne National Laboratory), and Jack Deslippe (Lawrence Berkeley National Laboratory). [slides] [paper]
Neil Mehta is a NESAP postdoctoral researcher working with the Exaalt MD ECP team. He graduated with a PhD in Aerospace engineering in Dec 2019 from University of Illinois Urbana Champaign. His research work is primarily in the area of HPC targetted towards particle based methods, such as molecular dynamics and kinetic Monte Carlo.
In this paper, we show that OpenMP 4.5 based implementation of TestSNAP, a proxy-app for the Spectral Neighbor Analysis Potential (SNAP) in LAMMPS, can be ported across the NVIDIA, Intel, and AMD GPUs. Roofline analysis is employed to assess the performance of TestSNAP on each of the architectures. The main contributions of this paper are two-fold: 1) Provide OpenMP as a viable option for application portability across multiple GPU architectures, and 2) provide a methodology based on the roofline analysis to determine the performance portability of OpenMP implementations on the target architectures. The GPUs used for this work are Intel Gen9, AMD Radeon Instinct MI60, and NVIDIA Volta V100.
Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs
Joshua Davis (University of Delaware) is presenting this paper .Other authors include Christopher Daley (Lawrence Berkeley National Laboratory), Swaroop Pophale (Oak Ridge National Laboratory), Thomas Huber (University of Delaware), Sunita Chandrasekaran (University of Delaware), and Nicholas J. Wright (Lawrence Berkeley National Laboratory). [slides] [paper]
Joshua Hoke Davis is an undergraduate at the University of Delaware majoring in Computer Science and Philosophy. Winner of the 2nd Place award in the SC18 ACM Student Research Competition, Josh has been an active SC participant ever since, serving as a Student Volunteer at SC19. Josh plans to continue his HPC career in a PhD program starting Fall 2021. His research interests include directive-based programming models, heterogeneous architectures and GPUs, deep learning at scale, and formal verification in HPC. Bringing along his training in philosophy, he hopes to be an advocate for ethical and equitable science throughout his career.
Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today’s systems to tomorrow’s. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges on emerging systems. This work focuses on applying and demonstrating OpenMP offloading directives on five proxy applications. We observe that the performance varies widely from one compiler to the other; a crucial aspect of our work is reporting best practices to application developers who use OpenMP offloading compilers. While some issues can be worked around by the developer, there are other issues that must be reported to the compiler vendors. By restructuring OpenMP offloading directives, we gain an 18x speedup for the su3 proxy application on NERSC’s Cori system when using the Clang compiler, and a 15.7x speedup by switching max reductions to add reductions in the laplace mini-app when using the Cray-llvm compiler on Cori.