Acceleration of Element-by-Element Kernel in Unstructured Implicit Low-order Finite-element Earthquake Simulation using OpenACC on Pascal GPUs
Takuma Yamaguchi and Kohei Fujita are presenting this paper from Kohei Fujita, Takuma Yamaguchi, Tsuyoshi Ichimura (University of Tokyo, Japan), Muneo Hori (University of Tokyo and RIKEN, Japan) and Lalith Maddegedara (University of Tokyo, Japan).
Takuma Yamaguchi is a Master’s student in the Department of Civil Engineering at the University of Tokyo and he has a B.E., from the University of Tokyo. His research is high-performance computing targeting at earthquake simulation. More specifically, his work performs fast crustal deformation computation for multiple computation enhanced by GPUs.
Kohei Fujita is a postdoctoral researcher at Advanced Institute for Computational Science, RIKEN. He received his Dr. Eng. from the Department of Civil Engineering, University of Tokyo in 2014. His research interest is development of high-performance computing methods for earthquake engineering problems. He is a coauthor of SC14 and SC15 Gordon Bell Prize Finalist Papers on large-scale implicit unstructured finite-element earthquake simulations.
The element-by-element computation used in matrix-vector multiplications is the key kernel for attaining high-performance in unstructured implicit low-order finite-element earthquake simulations. We accelerate this CPU-based element-by-element kernel by developing suitable algorithms for GPUs and porting to a GPU-CPU heterogeneous compute environment by OpenACC. Other parts of the earthquake simulation code are ported by directly inserting OpenACC directives into the CPU code. This porting approach enables high performance with relatively low development costs. When comparing eight K computer nodes and eight NVIDIA Pascal P100 GPUs, we achieve 23.1 times speedup for the element-by-element kernel, which leads to 16.7 times speedup for the 3 x 3 block Jacobi preconditioned conjugate gradient finite-element solver. We show the effectiveness of the proposed method through many-case crust-deformation simulations on a GPU cluster.
The authors were awarded with an NVIDIA Pascal GPU.
Acceleration of the FINE/Turbo CFD solver in a heterogeneous environment with OpenACC directives
Authors: David Gutzwiller (NUMECA USA, San Francisco, California), Ravi Srinivasan (Seattle Technology Center, Bellevue, Washington), Alain Demeulenaere NUMECA USA, San Francisco, California
Adapting legacy applications for use in a modern heterogeneous environment is a serious challenge for an industrial software vendor (ISV). The adaptation of the NUMECA FINE/Turbo computational fluid dynamics (CFD) solver for accelerated CPU/GPU execution is presented. An incremental instrumentation with OpenACC directives has been used to obtain a global solver acceleration greater than 2X on the OLCF Titan supercomputer. The implementation principals and procedures presented in this paper constitute one successful path towards obtaining meaningful heterogeneous performance with a legacy application. The presented approach minimizes risk and developer-hour cost, making it particularly attractive for ISVs.
The authors were awarded with a Quadro M6000, 12GB.
Achieving portability and performance through OpenACC
Authors: J. A. Herdman, W. P. Gaudin, O. Perks (High Performance Computing, AWE plc, Aldermaston, UK), D. A. Beckingsale, A. C. Mallinson, S. A. Jarvis (University of Warwick, UK)
OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications.
OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the “parallel” and “kernels” constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.
The authors were awarded with a QUADRO P5000.