

# **Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices**

Jonas Hahnfeld<sup>1</sup>, <u>Christian Terboven</u><sup>1</sup>, James Price<sup>2</sup>, Hans Joachim Pflug<sup>1</sup>, Matthias S. Müller<sup>1</sup> 1: RWTH Aachen University, Germany, email: <u>{hahnfeld,terboven,pflug,mueller}@itc.rwth-aachen.de</u> 2: University of Bristol, UK, email: <u>j.price@bristol.ac.uk</u>

WACCP 2017: Fourth Workshop on Accelerator Programming Using Directives Nov. 13th, 2017



#### Asynchronous Offloading: are high-level models inferior to low-level APIs?

- CUDA, OpenCL: APIs •
  - low-level
  - full control
- OpenMP, OpenACC: based on pragmas
  - ease of use
  - some abstractions

### Agenda of this talk

- 1. Asynchronous Offloading Capabilities of Accelerator Models
- Kernel used for evaluation: Conjugate Gradient Method 2.
- 3. Findings on NVIDIA GPU
- Findings on Intel Xeon Phi Coprocessor 4.
- 5. Summary

2





### **Asynchronous Offloading Capabilities of Accelerator Models**



|                                 | CUDA                                                                            | OpenCL                                                                           | OpenACC                                                                                                  | OpenMP                                                                                                     |
|---------------------------------|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| Asynchronicity                  | <i>streams</i> : actions in<br>different streams<br>can execute<br>concurrently | command queues:<br>operations in<br>different queues can<br>execute concurrently | Clause: async with<br>option argument to<br>select a queue;<br>synchronization via<br>acc wait construct | Clause: nowait<br>because the<br>target construct is<br>a task;<br>synchronization via<br>taskwait constr. |
| Unstructured data movement      | Yes: implicitly                                                                 | Yes: implicitly                                                                  | acc enter/exit<br>data <b>construct</b>                                                                  | target<br>enter/exit data<br>construct                                                                     |
| Asynchronous<br>memory transfer | <b>API:</b><br>cudaMemcpyAsync                                                  | Argument:<br>blocking_write                                                      | Clause: async                                                                                            | Clause: nowait                                                                                             |
| Page-locked<br>memory           | <b>API:</b><br>cudaMallocHost                                                   | No, but shared virtual memory                                                    | -                                                                                                        | -                                                                                                          |

4





#### **Performance Projection without Overlapping**

• Total runtime consists of computation and communication:

$$t_{exec} = t_{comp} + t_{comm}$$

• Communication time: Prediction for data d with bandwidth B:

$$t_{comm} = \frac{d}{B} + t_{overhead}$$

- t<sub>overhead</sub> accounts for preparational tasks
  - $-\,$  may be significant for the overall communication time  $t_{\text{comm}}$
  - may depend on data volume d

Hahnfeld, Cramer, Klemm, Terboven, Müller: A Pattern for Overlapping Communication and Computation with OpenMP Target Directives. IWOMP 2017.



#### **Offloading to multiple devices (2/2)**

### **Performance Projection with Pipelining Pattern**

 Optimal runtime when overlapping computation and communication:

6

$$t_{pipelined} = max(t_{comp}, t_{comm})$$

time to soultion

• Maximum optimization over runtime without overlapping:

$$o_{max} = rac{min(t_{comp}, t_{comm})}{t_{exec}}$$

- Approximates a performance increase of o<sub>max</sub> = 0.5
  - if communication and computation time are perfectly balanced

Hahnfeld, Cramer, Klemm, Terboven, Müller: A Pattern for Overlapping Communication and Computation with OpenMP Target Directives. IWOMP 2017.

Performance



(3)

#### **Pipelining Concept for Overlapping Communication**

#### Implementation with OpenMP 4.5

- Using standalone directives from OpenMP 4.5
  - omp target enter data
  - omp target exit data
- Aynchronous tasks (nowait) and specify dependencies
- Black lines: data dependency
- Red and blue lines: mutual exclusion
  - of enter and compute tasks
  - avoid oversubscription







### Kernel used for evaluation: Conjugate Gradient Method



#### **Evaluation kernel: CG Method (1/3)**

#### **Conjugate Gradients Method on Multiple Devices**

- Iterative solver for linear equation systems: •
  - -A \* x = k
  - widely used for PDE-based problems
- Sparse matrix with regular sparsity pattern: Serena from SuiteSparse Matrix Collection ۲
  - Matrx is spd
  - about 1.4 mio. rows and columns
  - about 780 MB memory consumption in CRS format on host and Xeon Phi
  - about 6.14 GB memory consumption in ELLPACK-R format on GPU
- Division of matrix and each vector into partitions
- Matrix vector multiplication requires data exchange ۲
  - Start computation with local data
  - Apply pipelining concept for data transfer





10

**Concept for executing kernels on multiple devices** 







11

#### Dependencies for overlapping the communication with two devices





### **Findings on NVIDIA GPU**



#### **Technical Specification**

#### NEC GPU server system

- 2 NVIDIA Tesla P100, each:
  - 5.3 TFLOP/s dp performance
  - 549 GB/s Triad bandwidth to HBM2 measured with BabelStream
  - 13.2 GB/s achievable transfer rate to host via PCIe
- NVLink between the GPUs

13

- 37 GB/s achievable transfer rate between GPUs
- 2 Intel Westmere-EP 12-core processors at 2.2 GHz
  - 120 GB/s Triad bandwidth to DDR4 measured with Stream

| 1 logramming model | DOITWATE STACK                                                                                                    |
|--------------------|-------------------------------------------------------------------------------------------------------------------|
| CUDA               | GCC 4.8.5 + CUDA 8.0.44                                                                                           |
| OpenCL             | $\begin{array}{l} \text{GCC } 4.8.5 + \text{CUDA } 8.0.44 \\ \text{GCC } 4.8.5 + \text{CUDA } 8.0.44 \end{array}$ |
| pocl               | LLVM $4.0.1 + \text{development version of } pocl$                                                                |
| OpenACC            | PGI Accelerator Compiler 17.4                                                                                     |

#### Programming model Software stack





#### **Basis for performance model**

DMA: Direct Memory Access

14

- Only possible with memory that is "page-locked" or "pinned"
  - Necessity for device data transfers
  - By default CUDA (et al.) do a transparent copy
  - In addition, special allocation methods are available





#### **Basis for evaluation**

- Initial evaluation on the host and on a single NVIDIA Tesla P100
  - Host: Intel C++ 17.0.4 compiler w/ OpenMP

|         | matvec $(GFLOP/s)$ | Vector dot products | Iterations | s Total Runtime |
|---------|--------------------|---------------------|------------|-----------------|
| host    | $7.11s\ (17.91)$   | $0.24\mathrm{s}$    | 985        | 9.17s           |
| CUDA    | 1.89s(67.44)       | 0.20s               | 987        | 5.15s           |
| OpenCL  | 2.23s~(57.23)      | 0.31s               | 986        | 5.72s           |
| pocl    | 2.27s (56.24)      | 0.32s               | 986        | 5.78s           |
| OpenACC | 2.23s (57.22)      | $0.29\mathrm{s}$    | 989        | 5.69s           |

- Offloading to the GPU pays off (timings include data transfer)
- Number of iterations varies because of reduction operating in vector dot product
- CUDA is fastest
  - Compiler generates better code with added pragma unroll 1



#### **Options to implement data exchange between devices**

- Via host: utilization of PCIe in two successive transfers
  - Plus a temporary buffer on the host
- Between devices: direct communication between devices
  - cudaMemcpyAsync with kind cudaMemcpyDeviceToDevice
  - Will use PCIe by default
  - Runtime may employ double buffering, or other optimizations
- Peer to peer: utilization of the NVLink
  - cudaDeviceEnablePeerAccess on both devices for unidirectional access



#### **CUDA Results (matvec only)**



#### matvec

- Performance model prediction: up to 46.91 % optimization
  - Requires two CUDA streams per device: computation & communication
- Remember (3): optimization effect depends on max of t<sub>comp</sub> and t<sub>comm</sub>
  - Utilizing NVLink reduces the communication time
- Smaller improvement for whole CG as partitioning takes extra time: 5.15s to 4.26s in best case



#### **Quality of the implementation**

- NVIDIA's own OpenCL ...
  - ... does not support page-locked memory
  - ... has a performance bug in the device to device copy
- Therefore we switched to pocl (OpenCL 2.0 implementation)
  - ... and made some improvements which will become available with next release



#### **OpenCL Results (matvec only)**

19



matvec

- Performance model prediction: up to 44.34 % optimization
  - Requires two OpenCL command queues per device: computation & communication





#### **OpenACC Results (matvec only) (1/2)**



- Performance model prediction: up to 44.05 % optimization
- OpenACC currrently does not allow transfer between two devices without involving the host
- Data transfer time increases
  - PGI's implementation cannot transfer matrix from pageable memory asynchronously
  - Issues with the runtime prevented from using threads to parallelize data transfer





Not successful considering the total runtime





### **Findings on the Intel Xeon Phi Coprocessor**



#### **Technical Specification**

- Bull server system
  - 2 Intel Xeon Phi 5110P coprocessors, each:
    - approx. 1 TFLOP/s dp performance
    - 117 GB/s Triad bandwidth to HBM measured with Stream
    - 6.5 GB/s achievable transfer rate to host via PCIe (gen2)
  - 2 Intel SandyBridge-EP 8-core processors at 2.0 GHz
    - 65 GB/s Triad bandwidth to DDR4 measured with Stream

- Software
  - Intel 17.0.2 compilers
    - 17.0.4 contains a performance bug
  - Intel MPSS 3.8
  - Intel OpenCL SDK 14.2



24

#### Not successful considering the total runtime



- Very high overhead for launching the kernels
- Intel 16.0.x compilers show better performance ...
- ... but do not provide asynchronous offloading



## Summary



#### **Evaluation of asynchronous offloading**

- Asynchronous offloading to multiple devices can deliver the expected performance improvements
- CUDA is perfectly up to the task
- OpenCL 2.0 provides all the necessary ingredients
- OpenACC can be successful
  - Currently, device to device support is missing
  - Issues with the quality of the implementation
- Intel Xeon Phi
  - Implementation issues lead to bad results for OpenMP
  - GCC 7, IBM xlc and LLVM/Clang compilers will soon fully support OpenMP on GPUs
- Programming is challenging: asynchronous offload pattern may ease implementation work
- Code is available at: https://rwth-aachen.sciebo.de/index.php/s/EdjjkEdCIHLizyE



## Vielen Dank für Ihre Aufmerksamkeit

Thank you for your attention

