Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC -Architecture and Application Performance Optimization

Nicholas J. Wright Perlmutter Chief Architect

Sixth Workshop on Accelerator Programming Using Directives

**18 November 2019** 

### NERSC is the mission High Performance Computing facility for the DOE SC









Simulations at scale



Data analysis support for DOE's experimental and observational facilities Photo Credit: CAMERA



## NERSC has a dual mission to advance science and the state-of-the-art in supercomputing

- We collaborate with computer companies years before a system's delivery to deploy advanced systems with new capabilities at large scale
- We provide a highly customized software and programming environment for science applications
- We are tightly coupled with the workflows of DOE's experimental and observational facilities – ingesting tens of terabytes of data each day
- Our staff provide advanced application and system performance expertise to users







### NERSC's Users Demonstrate Groundbreaking Science Capability





Large Scale Particle in Cell Plasma Simulations

Stellar Merger Simulations with Task Based Programming



Largest Ever Quantum Circuit Simulation



Largest Ever Defect Calculation from Many Body Perturbation Theory > 10PF





Deep Learning at 15PF (SP) for Climate and HEP



Celeste: 1<sup>st</sup> Julia app to achieve 1 PF



Galactos: Solved 3-pt correlation analysis for Cosmology @9.8PF



# NERSC also supports a large number of users and projects from DOE SC's experimental and observational facilities



Palomar Transient Factory Supernova



Planck Satellite Alice Cosmic Microwave Large Background Radiation



Alice Atlas Large Hadron Collider Large Hadron Collider



Star Particle Physics



LΖ



DESI



Dayabay Neutrinos



ALS Light Source



LCLS Light Source



Joint Genome Institute Bioinformatics



Cryo-EM

NCEM



LSST-DESC



### **NERSC Systems Roadmap**





### Perlmutter is a Pre-Exascale System

|                     | Pre-Exasca                | Exas                           | cale Systems            |        |                  |
|---------------------|---------------------------|--------------------------------|-------------------------|--------|------------------|
| 2013                | 2016                      | 2018                           | 2020                    |        | 2021-2023        |
| Mira                | Theta                     |                                |                         | A21 AU | 2021             |
| Argonne<br>IBM BG/Q | Argonne<br>Intel/Cray KNL |                                | Perlmutter              | Argo   |                  |
|                     | Intel/Clay Kite           | Summit                         |                         | Intel/ | Gray             |
| Titan               | CORI                      | ORNL<br>IBM/NVIDIA<br>P9/Volta | LBNL<br>Cray/NVIDIA/AMD |        | RØNTIER          |
|                     | LBNL                      |                                |                         |        | ORNL<br>Cray/AMD |
| Sequoia<br>LLNL     | Cray/Intel Xeon/KNL       | Sierra                         |                         | IL/SNL | ELCOPITAN        |
| IBM BG/Q            | Cray/Intel Xeon/KNL       | IBM/NVIDIA<br>P9/Volta         |                         | TBD    | Cray/?           |

II G



### **Perlmutter: A System Optimized for Science**



- GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities
- Cray "Slingshot" High-performance, scalable, low-latency Ethernetcompatible network
- Single-tier All-Flash Lustre based HPC file system, >6x Cori's bandwidth
- Dedicated login and high memory nodes to support complex workflows
- Delivery in early FY21







## AMD CPU nodes

>=Rome

specs

AMD "Milan" CPU

- ~64 cores
- "ZEN 3" cores 7nm+
- AVX2 SIMD (256 bit)

8 channels DDR memory

- >= 256 GiB total per node
- 1 Slingshot connection
  - 1x25 GB/s

~ 1x Cori











4x NVIDIA "Volta-next" GPU

Volta

specs

- > 7 TF
- > 32 GiB, HBM-2
- NVLINK
- 1x AMD CPU
- **4** Slingshot connections
- 4x25 GB/s
- GPU direct, Unified Virtual Memory (UVM)
- 2-3x Cori







## **Slingshot Network**



### High Performance scalable interconnect

- Low latency, high-bandwidth, MPI performance enhancements
- 3 hops between any pair of nodes
- Sophisticated congestion control and adaptive routing to minimize tail latency
- Ethernet compatible

Office of Science

- Blurs the line between the inside and the outside of the machine
- Allow for seamless external communication
- Direct interface to storage





## Perlmutter has a All-Flash Filesystem

• <u>Fast</u> across many dimensions

- 4 TB/s sustained bandwidth
- 7,000,000 IOPS
- 3,200,000 file creates/sec
- <u>Usable</u> for NERSC users
  - 30 PB usable capacity
  - Familiar Lustre interfaces
  - New data movement capabilities
- **Optimized** for NERSC data workloads
  - NEW small-file I/O improvements
  - NEW features for high IOPS, non
    - sequential I/O



### **Perlmutter: A System Optimized for Science**



- GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities
- Cray "Slingshot" High-performance, scalable, low-latency Ethernetcompatible network
- How do we optimize the size of each partition?
- Dedicated login and high memory nodes to support complex workflows







## NERSC System Utilization (Aug'17 - Jul'18)



- 3 codes > 25% of the workload
- 10 codes > 50% of the workload
  - 35 codes > 75% of the workload
- Over 600 codes comprise the remaining 25% of the workload.

## GPU Readiness Among NERSC Codes (Aug'17 - Jul'18)



| <b>GPU Status &amp; Description</b> | Fraction |
|-------------------------------------|----------|
| Enabled:                            |          |
| Most features are ported and        | 37%      |
| performant                          |          |
| Kernels:                            |          |
| Ports of some kernels have been     | 10%      |
| documented.                         |          |
| Proxy:                              |          |
| Kernels in related codes have       | 20%      |
| been ported                         |          |
| Unlikely:                           |          |
| A GPU port would require major      | 13%      |
| effort.                             |          |
| Unknown:                            |          |
| GPU readiness cannot be             | 20%      |
| assessed at this time.              |          |

**NUCLOU** 

A number of applications in NERSC workload are GPU enabled already.

### How many GPU nodes to buy - Benchmark Suite Construction & Scalable System Improvement

| Select codes to represent the anticipated workload                                                        |                                                                               | Application             | Description                                                  |
|-----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------|--------------------------------------------------------------|
| •                                                                                                         | <ul> <li>Include key applications from the current workload.</li> </ul>       |                         | Materials code using DFT                                     |
| <ul> <li>Add apps that are expected to be contribute significantly to the<br/>future workload.</li> </ul> |                                                                               | MILC                    | QCD code using staggered quarks                              |
| Scalable System Improvement                                                                               |                                                                               | StarLord                | Compressible radiation hydrodynamics                         |
| Measures aggregate performance of HPC machine                                                             |                                                                               | DeepCAM                 | Weather/Community<br>Atmospheric Model 5                     |
| • How many more copies of the benchmark can be run relative to                                            |                                                                               | GTC                     | Fusion PIC code                                              |
| <ul> <li>Performance relative to reference machine</li> </ul>                                             |                                                                               | "CPU Only"<br>(3 Total) | Representative of applications that cannot be ported to GPUs |
|                                                                                                           |                                                                               |                         |                                                              |
| $SI = \langle$                                                                                            | #Nodes × Jobsize × Perf_per_node                                              |                         |                                                              |
| <i>''</i> —                                                                                               | #Nodes <sub>Ref</sub> × Jobsize <sub>Ref</sub> × Perf_per_node <sub>Ref</sub> |                         |                                                              |



B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, "A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,

### Hetero system design & price sensitivity: Budget for GPUs increases as GPU price drops





B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, "A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking Stars: Optimal system configuration. and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,

# Application readiness efforts justify larger GPU partitions.

Explore an isocost design space

- Assume 8:1 GPU/CPU node cost ratio.
- Vary the budget allocated to GPUs
- Examine GPU / CPU performance gains such as those obtained by software optimization & tuning. 5 of 8 codes have 10x, 20x, or 30x speedup.

| GPU / CPU<br>perf. per node | SSI increase<br>vs. CPU-Only<br>(@ budget %) |                                                       |
|-----------------------------|----------------------------------------------|-------------------------------------------------------|
| 10x                         | None                                         | No justification for GPUs                             |
| 20x                         | 1.15x @ 45%                                  | Compare to 1.23x<br>for 10x at 4:1 GPU/CPU cost ratio |
| 30x                         | 1.40x @ 60%                                  | Compare to 3x<br>from NESAP for KNL                   |





B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, "A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,

Circles: 50% CPU nodes + 50% GPU nodes <sub>ing</sub> Stars: Optimal system configuration How to Enable NERSC's diverse community of 7,000 users, 750 projects, and 700 codes to run on advanced architectures like Perlmutter and beyond?

- NERSC Exascale Science Application Program (NESAP)
- Engage ~25 Applications
- up to 17 postdoctoral fellows
- Deep partnerships with every SC Office area
- Leverage vendor expertise and community hack-a-thons
- Knowledge transfer through documentation and training for all users
- Optimize codes with improvements relevant to multiple architectures







### **GPU Transition Path for Apps**

### NESAP for Perlmutter will extend activities from NESAP

- 1. Identifying and exploiting on-node parallelism
- 2. Understanding and improving data-locality within the memory hierarchy

### What's New for NERSC Users?

- 1. Heterogeneous compute elements
- 2. Identification and exploitation of even more parallelism
- 3. Emphasis on performance-portable programming approach:
  - Continuity from Cori through future NERSC systems and other DOE platforms

**NESAP For Cori Speedups** 







## OpenMP is the most popular non-MPI parallel programming technique





## OpenMP meets the needs of the NERSC workload



- Supports C, C++ and Fortran
  - The NERSC workload consists of ~700 applications with a relatively equal mix of C, C++ and Fortran
- Provides portability to different architectures at other DOE labs
- Works well with MPI: hybrid MPI+OpenMP approach successfully used in many NERSC apps
- Recent release of OpenMP 5.0 specification the third version providing features for accelerators
  - Many refinements over this five year period





## **NRE** partnership with PGI/NVIDIA





Home » News & Media » News » NERSC, NVIDIA to Partner on Compiler Development for Perimutter System

#### NEWS & MEDIA

News CS In the News InTheLoop

### NERSC, NVIDIA to Partner on **Compiler Development for** Perlmutter System COMPILERS



#### MARCH 21, 2019

The National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory (Berkeley Lab) has signed a contract with NVIDIA to enhance GPU compiler capabilities for Berkeley Lab's nextgeneration Perlmutter supercomputer.

In October 2018, the U.S. Department of Energy (DOE) announced that NERSC had signed a contract with Cray for a pre-exascale supercomputer named "Perlmutter," in honor of Berkeley Lab's Nobel Prize-winning astrophysicist Saul Perimutter. The Cray Shasta machine, slated to be delivered in 2020, will be a heterogeneous system





## **OpenMP NRE – Status & Future Plans**



### **Contract items completed**

- Agreed on the subset of OpenMP target offload features to be included in the PGI compiler
- Created an OpenMP test suite containing micro-benchmarks, mini-apps, and the ECP SOLLVE V&V suite to evaluate correctness and performance
- Selected 5 NESAP application teams to partner with NVIDIA/PGI to add OpenMP target offload directives to the applications

### **Next contract items**

- Evaluate the Alpha compiler on Cori-GPU
- Evaluate upcoming compiler releases: Apr 2020 and Oct 2020
  - More NESAP and NERSC users will get access with each compiler release









## **Engaging around Performance Portability**





NERSC is working with PGI/NVIDIA to enable OpenMP GPU acceleration

NERSC Hosted Past C++ Summit and ISO C++ meeting on HPC.

# **OpenACC**

**Directives for Accelerators** 

### NERSC is a Member



NERSC is leading development of performanceportability.org

Doug Doerfler Lead Performance Portability Workshop at SC18. and 2019 DOE COE Perf. Port. Meeting 26

## NERSC-9 will be named after Saul Perlmutter

- Winner of 2011 Nobel Prize in Physics for discovery of the accelerating expansion of the universe.
- Supernova Cosmology Project, lead by Perlmutter, was a pioneer in using NERSC supercomputers combine large scale simulations with experimental data analysis
- Login "saul.nersc.gov"





### **Perlmutter: A System Optimized for Science**

- Cray Shasta System providing 3-4x capability of Cori system
- First NERSC system designed to meet needs of both large scale simulation and data analysis from experimental facilities
  - Includes both NVIDIA GPU-accelerated and AMD CPU-only nodes
  - Cray Slingshot high-performance network will support Terabit rate connections to system
  - Optimized data software stack enabling analytics and ML at scale
  - All-Flash filesystem for I/O acceleration
- Robust readiness program for simulation, data and learning applications and complex workflows
- Delivery in early FY 2021





### **NERSC Systems Roadmap**









### Will GPUs work for everybody?

- Will 100% of the NERSC workload be able to utilize GPUs by 2024?
  - Yes, they just need to modify their code
  - No, their algorithm needs changing
  - No, their physics is fundamentally not amenable to data parallelism
  - No, they just don't have time or need too







## Next-Next generation Process Nodes have been announced NERSC





### **Specialization: End Game for Moore's Law**





NVIDIA builds deep learning appliance with V100 Tesla's



RISC-V is an open hardware platform

Office of Science



Intel buys deep learning startup, Nervana



## FPGAs offer configurable specialization



Google designs its own Tensor Processing Unit (TPL





## Potential 2024 Node





- Vendors converging to a mixture of energy-efficient Thin Cores/Accelerators and Fat Cores
- Potentially with DRAM/NVRAM
- (Hopefully) leads to less focus on data motion and more on identifying parallelism



J. A. Ang et al., "Abstract Machine Models and Proxy Architectures for Exascale Computing," 2014 Hardware-Software Co-Design for High Performance Computing, New Orleans, LA, 2014, pp. 25-32. doi: 10.1109/Co-HPC.2014.4 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7017960&isnumber=7017953



### Potential NERSC-10 system #1





Storage





### Potential NERSC-10 system #2





### Exploring Workflow Accelerators for SC Applications with NERSC-9 and Slingshot network

- What accelerators map to existing SC workloads? And what SC challenges could be solved with emerging accelerators?
- Key areas of investigation
  - Identify common algorithms, kernels, motifs that run well on emerging accelerators.
  - Determine feasibility of configurable processing technologies, e.g. FPGAs?
  - Analyze changing workload requirements, e.g. ML.

### **Neural Network Processors**



### Emerging Technologies



### Quantum Computing



Next Generation GPUs



### Programmable Arrays



Spectrum of approaches to optimizing an application for accelerators - today



Harder

# Add directives to identify parallelism

Refactor data structures for each accelerator

### **Control data motion**

### **Change algorithm**





Easier

Office of Science Worse case is that a different algorithm is the optimal one for different architectures !

Spectrum of approaches to optimizing an application for accelerators



# Add directives to identify parallelism

Refactor data structures for each accelerator





Office of Science Worse case is that a different algorithm is the optimal one for different architectures !



Spectrum of approaches to optimizing an application for accelerators



Harder

# Add directives to identify parallelism

### **Refactor data structures** *once for each accelerator*







Office of Science Worse case is that a different algorithm is the optimal one for different architectures !







- Hardware trends should reduce some of the burden on programmers today
- Software developments that separate or abstract away the details of the hardware should similarly help
  - Programmer (or library expert) specializes the code for the hardware, e.g. Kokkos, Raja, OpenMP-5 declare variant and metadirective directives
  - Programmer specifies that parallel transformations are safe and allows compiler to specialize for the hardware, e.g. OpenMP-5 loop directive, Fortran do concurrent
- Unrealistic to expect performance and portability while hardware has not converged







- Will there be a Workshop on Accelerator Programming Using Directives (WACCPD) in 2024?
  - **1.** Yes
  - 2. No
- Will we still need directives in 5 yrs? 10 yrs ?
  - **1.** Yes
  - 2. No







- Will there be a Workshop on Accelerator Programming Using Directives (WACCPD) in 2024?
  - **1.** Yes

**2.** No – I hope the conversation will have moved on by then!

- Will we need directives in 5 yrs? 10 yrs?
  - **1.** Yes For some other reason....
  - 2. No







## View from AMD - can we exploit this to benefit NERSC users ?

Science



## OPTIMIZING SYSTEM PERFORMANCE WITH HETEROGENEOUS COMPUTING



