Program

Agenda

The workshop takes place on Monday, Nov 14, 2016 from 2:00-5:40 p.m. in room 251-C. Details about all talks and the speakers can be found by clicking on a title in the agenda or below the agenda.

Dear attendees: Please take a moment to provide feedback about your experience of the WACCPD workshop!

14:00 – 14:30	Keynote: Matt Norman (ORNL) – The Broader Picture of Using Accelerator Directives in Your Code
14:30 – 14:45	Ian Bertolacci (University of Arizona) – Identifying and Scheduling Loop Chains Using Directives
14:45 – 15:00	John Pennycook (Intel Corporation, USA) – A Modern Memory Management System for OpenMP
15:00 – 15:30	Coffee break
15:30 – 15:45	Nobuhiro Miki (Osaka University, Japan) – An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking
15:45 – 16:00	Robert Searles, Stephen Herbein (University of Delaware) – A Portable, High-Level Graph Analytics Framework Targeting Distributed, Heterogeneous Systems
16:00 – 16:25	Ali Shafiee (University of Utah) – OpenACC cache Directive: Opportunities and Optimizations
16:25 – 16:50	Akihiro Hayashi (Rice University) – Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator Model on a POWER8+GPU Platform
16:50 – 17:15	M. Graham Lopez (ORNL) – Towards Achieving Performance Portability using Directives for Accelerators
17:15 – 17:40	Best Paper: Takuma Yamaguchi (University of Tokyo, Japan), Kohei Fujita (RIKEN, Japan) – Acceleration of Element-by-Element Kernel in Unstructured Implicit Low-order Finite-element Earthquake Simulation using OpenACC on Pascal GPUs

Keynote: The Broader Picture of Using Accelerator Directives in Your Code

Matt Norman

The Keynote of the workshop will be given by Matt Norman, Computational Climate Scientist at the Oak Ridge National Laboratory, USA.

Matt Norman is a computational climate scientist at Oak Ridge National Laboratory’s Center for Computational Science. Norman acts as a liaison to enable more productive computational science outcomes for teams awarded under the INCITE program to run on Oak Ridge Leadership Computing Facility’s (OLCF’s) Titan supercomputer. Norman is leading the GPU refactoring effort for DOE’s Accelerated Model for Climate and Energy (ACME) under funding from ACME and the OLCF Center for Accelerated Application Readiness (CAAR). Norman also develops novel spatial and temporal PDE integration algorithms better suited for emerging supercomputers on which all levels of data movement increasingly come at a premium.

Norman obtained B.S. degrees in Meteorology and Computer Science at North Carolina State University (NCSU), and also obtained his Ph.D. in Atmospheric Sciences at NCSU under the DOE Computational Science Graduate Fellowship (CSGF).

Abstract

Often, when talking about transitioning a codebase to use accelerators, the word “port” is used, and this can lead to false ideas about what the process really looks like on the ground level. In the real world, codes typically have to be heavily refactored in order to use accelerators effectively, and the process is rarely straightforward. Typically, it is an evolving give and take between the developers, the standards committees, and the compiler developers implementing those standards. Every code has a different look and feel, using different language features in a different manner, and this often means there are improvements to be made in directive standards and implementation for each code. The more vocal and interactive code developers are, the more effectively the whole community moves forward. Also, there is always a balance to be kept between performance and maintainable software development practices, including ideas of performance portability. Covering these topics, the goal of this talk is to help developers gain a broader understanding of how to achieve an efficient and maintainable accelerated code using directives.

Identifying and Scheduling Loop Chains Using Directives

Ian Bertolacci

Ian Bertolacci from the University of Arizona is presenting this paper from Ian Bertolacci, Michelle Strout (University of Arizona), Stephen Guzik (Colorado State University), Jordan Riley (Colorado State University) and Catherine Olschanowsky (Boise State University).

Ian J. Bertolacci is a first year Computer Science Ph.D. student at the University of Arizona. This May, he graduated from Colorado State University with bachelor degrees in Computer Science, Psychology, and Applied Computing Technology. His goal is to give all scientific applications access to modern and future supercomputing power through the use of compiler and programming language technologies.

Abstract

Exposing opportunities for parallelization while explicitly managing data locality is the primary challenge to porting and optimizing existing computational science simulation codes to improve performance and accuracy. OpenMP provides many mechanisms for expressing parallelism, but it primarily remains the programmer’s responsibility to group computations to improve data locality. The loopchain abstraction, where data access patterns are included with the specification of parallel loops, provides compilers with sufficient information to automate the parallelism versus data locality tradeoff. In this paper, we present a loop chain pragma and an extension to the omp for to enable the specification of loop chains and high-level specifications of schedules on loop chains. We show example usage of the extensions, describe their implementation, and show preliminary performance results for some simple examples.

A Modern Memory Management System for OpenMP

John Pennycook

John Pennycook is presenting this paper from Jason Sewall, John Pennycook, Alejandro Duran, Xinmin Tian and Ravi Narayanaswamy (all from the Intel Corporation, USA).

John is an HPC Application Engineer at Intel, focused on enabling developers to fully utilize the parallelism available in the current generation of Intel Xeon Phi processors.

Abstract

Modern computers with multi- / many-core processors and accelerators feature a sophisticated and deep memory hierarchy, potentially including main memory, high-bandwidth memory, texture memory and scratchpad memory. The performance characteristics of these memories are varied, and studied have demonstrated the importance of using them effectively.

In this paper, we propose an extension of the OpenMP API to address the needs of programmers to efficiently optimize their applications to use new memory technologies in a platform-agnostic and portable fashion. Our proposal separately exposes the characteristics of memory resources (such as kind) and the characteristics of allocations (such as alignment), and is fully compatible with existing OpenMP constructs.

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

Nobuhiro Miki

Nobuhiro Miki is presenting this paper from Nobuhiro Miki, Fumihiko Ino and Kenichi Hagihara (all from Osaka University, Japan).

Nobuhiro Miki received the B.E. degree in information and computer sciences from Osaka University, Osaka, Japan, in 2015. He is currently working toward the M.E. degree at Osaka University. His research
interests include high performance computing systems and software.

Abstract

In this paper, aiming at realizing directive-based temporal blocking for out-of-core stencil computation, we present an extension of OpenACC directives and a source-to-source translator capable of accelerating out-of-core stencil computation on a graphics processing
unit (GPU). Out-of-core stencil computation here deals with large data that cannot be entirely stored in GPU memory. Given an OpenACC-like code, the proposed translator generates an OpenACC code such that it decomposes large data into smaller chunks, which are then processed in a pipelined manner to hide the data transfer overhead needed for exchanging chunks between the GPU memory and CPU memory. Furthermore, the generated code is optimized with a temporal blocking technique to minimize the amount of CPU-GPU data transfer. In experiments, we apply the proposed translator to three stencil
computation codes. The out-of-core performance on a Tesla K40 GPU reaches 73.4 GFLOPS, which is only 13% lower than the in-core performance. Therefore, we think that our directive-based approach is useful for facilitating out-of-core stencil computation on a GPU.

A Portable, High-Level Graph Analytics Framework Targeting Distributed, Heterogeneous Systems

Robert Searles

Robert Searles and Stephen Herbein are presenting this paper from Robert Searles, Stephen Herbein and Sunita Chandrasekaran (all from the University of Delaware).

Robert Searles is a PhD student at the University of Delaware working under Dr. John Cavazos. His research interests include program acceleration and optimization using GPUs and other parallel architectures, machine learning, and graph analysis. His prior research involved auto-tuning high level languages targeted at GPU codes, optimizing GPU performance using high-level languages, and he has used machine learning techniques in conjunction with code running on parallel architectures to characterize binary applications based on their code structure. As an intern at AMD Research, Robert implemented a template library providing a high level programming abstraction for an emerging processing-in-memory architecture. Robert earned his B.S. and M.S. in Computer and Information Sciences at the University of Delaware.

Stephen Herbein

Stephen Herbein is a PhD student of Dr. Michela Taufer at the University of Delaware. He received his Bachelors and Masters degree in Computer Science at the University of Delaware. His current research focuses on next-generation batch scheduling of HPC clusters, including IO-aware and hierarchical scheduling. His other research interest include auto-tuning IO and big data analytics.

Abstract

As the HPC and Big Data communities continue to converge, heterogeneous and distributed systems are becoming commonplace. In order to take advantage of the immense computing power of these systems, distributing data efficiently and leveraging specialized hardware (e.g. accelerators) is critical. MapReduce is a popular paradigm that provides automatic data distribution to the programmer. CUDA and OpenCL are some of the most popular frameworks for leveraging accelerators (specifically GPUs) on heterogeneous systems.

In this paper, we develop a portable, high-level framework using a popular MapReduce framework, Apache Spark, in conjunction with CUDA and OpenCL in order to simultaneously take advantage of automatic data distribution and specialized hardware present on each node of our HPC systems. Using our framework, we accelerated two real-world, compute and data intensive graph analytics applications: a function call graph similarity application, and a triangle enumeration subroutine. We demonstrate linear scalability on the call graph similarity application, as well as an exploration of the triangle
enumeration parameter space. We show that our method yields a portable solution that can be used to leverage almost any legacy, current, or next-generation HPC or cloud-based system.

OpenACC cache Directive: Opportunities and Optimizations

Ali Shafiee is presenting this paper on behalf of Ahmad Lashgar and Amirali Baniasadi (both from the University of Victoria, Canada).

Ali Shafiee is a fifth year PhD student at CS department of University of Utah.
His research is in the field of computer architecture. He has published papers on
machine learning accelerator, memory architecture, hardware security, and network on-chip.
Before joining University of Utah, he has done BS and MS in computer engineering at Sharif University of Technology.

Abstract

OpenACC’s programming model presents a simple interface to programmers, offering a trade-off between performance and development effort. OpenACC relies on compiler technologies to generate efficient code and optimize for performance. Among the difficult to implement directives, is the cache directive. The cache directive allows the programmer to utilize accelerator’s hardware- or software-managed caches by passing hints to the compiler. In this paper, we investigate the implementation aspect of cache directive under NVIDIA-like GPUs and propose optimizations for the CUDA backend. We use CUDA’s shared memory as the software-managed cache space. We first show a straightforward implementation can be very inefficient, and downgrades performance indeed. We investigate the differences between this implementation and hand-written CUDA alternatives to find the essential optimizations to be carrier by the compiler. Our detailed study results in the following necessary optimizations: i) improving occupancy by sharing the cache among several parallel threads and ii) optimizing cache fetch and write routines via parallelization and minimizing control flow. We present compiler passes to apply these optimizations. Investigating three test cases, we show that the best cache directive implementation can perform very close to hand-written CUDA equivalent and improve performance up to 2.18X (compared to the baseline OpenACC.)

Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator Model on a POWER8+GPU Platform

Akihiro Hayashi

Akihiro Hayashi is presenting this paper from Akihiro Hayashi, Jun Shirako (Rice University), Ettore Tiotto, Robert Ho (both IBM Canada) and Vivek Sarkar (Rice University).

Dr. Hayashi is a research scientist at Rice university. His research interests include automatic parallelization, programming languages, and compiler optimizations for parallel computer systems.

Abstract

While GPUs are increasingly popular for high-performance computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard GPU programming models such as CUDA and OpenCL: programmers are required to orchestrate low-level operations in order to exploit the full capability of GPUs. In terms of software productivity and portability, a more attractive approach would be to facilitate GPU programming by providing high-level abstractions for expressing parallel algorithms.

OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP’s high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures.

However, such high-level parallel programming strategies generally impose additional program optimizations on compilers, which could result in lower performance than fully hand-tuned code with low-level programming models.To study potential performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA Tesla GPU platform and 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.

Towards Achieving Performance Portability using Directives for Accelerators

Graham Lopez

Graham Lopez is presenting this paper from M. Graham Lopez, Veronica Vergara Larrea, Wayne Joubert, Oscar Hernandez, Azzam Haidar, Stanimire Tomov and Jack Dongarra.

Graham Lopez is a researcher in the Computer Science and Mathematics Division at Oak Ridge National Laboratory where he works on programming environments preparation with the application readiness teams for the DOE CORAL and Exascale Computing projects. Graham has published research in the areas of computational materials science, application acceleration and benchmarking on heterogeneous systems, low-level communication APIs, and programming models. He earned his M.S. in Computer science and Ph.D. in Physics from Wake forest University. Prior to joining ORNL, he was a research scientist at Georgia Institute of Technology where he worked on application and numerical algorithm optimizations for accelerators.

Abstract

In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/gpu. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86\_64 with attached NVIDIA GPUs, self hosted Intel Xeon Phi KNL, as well as an X86\_64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.

Best Paper: Acceleration of Element-by-Element Kernel in Unstructured Implicit Low-order Finite-element Earthquake Simulation using OpenACC on Pascal GPUs

Takuma Yamaguchi

Takuma Yamaguchi and Kohei Fujita are presenting this paper from Kohei Fujita, Takuma Yamaguchi, Tsuyoshi Ichimura (University of Tokyo, Japan), Muneo Hori (University of Tokyo and RIKEN, Japan) and Lalith Maddegedara (University of Tokyo, Japan).

Takuma Yamaguchi is a Master’s student in the Department of Civil Engineering at the University of Tokyo and he has a B.E., from the University of Tokyo. His research is high-performance computing targeting at earthquake simulation. More specifically, his work performs fast crustal deformation computation for multiple computation enhanced by GPUs.

Kohei Fujita is a postdoctoral researcher at Advanced Institute for Computational Science, RIKEN. He received his Dr. Eng. from the

Kohei Fujita

Department of Civil Engineering, University of Tokyo in 2014. His research interest is development of high-performance computing methods for earthquake engineering problems. He is a coauthor of SC14 and SC15 Gordon Bell Prize Finalist Papers on large-scale implicit unstructured finite-element earthquake simulations.

Abstract

The element-by-element computation used in matrix-vector multiplications is the key kernel for attaining high-performance in unstructured implicit low-order finite-element earthquake simulations. We accelerate this CPU-based element-by-element kernel by developing suitable algorithms for GPUs and porting to a GPU-CPU heterogeneous compute environment by OpenACC. Other parts of the earthquake simulation code are ported by directly inserting OpenACC directives into the CPU code. This porting approach enables high performance with relatively low development costs. When comparing eight K computer nodes and eight NVIDIA Pascal P100 GPUs, we achieve 23.1 times speedup for the element-by-element kernel, which leads to 16.7 times speedup for the 3 x 3 block Jacobi preconditioned conjugate gradient finite-element solver. We show the effectiveness of the proposed method through many-case crust-deformation simulations on a GPU cluster.