

## Challenges Porting a C++ Template-Metaprogramming Abstraction Layer to Directive-based Offloading Porting PIConGPU to OpenMP target and OpenACC

Jeffrey Kelling<sup>1</sup>, Sergei Bastrakov<sup>2</sup>, Alexander Debus<sup>2</sup>, Thomas Klug<sup>2</sup>, Matt Leinhauser<sup>3,4</sup>, Richard Pausch<sup>2</sup>, Klaus Steiniger<sup>2</sup>, Jan Stephan<sup>4</sup>, René Widera<sup>2</sup>, Jeff Young<sup>5</sup>, Michael Bussmann<sup>4</sup>, Sunita Chandrasekaran<sup>3</sup>, Guido Juckeland<sup>1</sup>

<sup>1</sup>Department of Information Services and Computing, Helmholtz-Zentrum Dresden-Rossendorf (HZDR) mailto:j.kelling@hzdr.de, https://www.hzdr.de

> <sup>2</sup>Insitute of Radiation Physics, Helmholtz-Zentrum Dresden-Rossendorf (HZDR) <sup>3</sup>Deptartment of CIS, University of Delaware <sup>4</sup>Center for Advance Systems Understanding (CASUS) <sup>5</sup>Georgia Tech, School of Computer Science

> > October 15, 2021



github.com/ComputationalRadiationPhysics/picongpu

picongpu.readthedocs.io

- Open source, fully relativistic, 3D3V, manycore, performance portable PIC code with a single code base for relativistic plasma physics
- Implements various numerical schemes, e.g.:
  - > Villasenor-Buneman, Esirkepov and ZigZag current deposition
  - > NGP (0th) to P4S (4th) macro particle shape orders
  - > Boris and Vay particle pusher
  - > Yee, Lehe and AO-FDTD field solver
- Available self-consistent additions to the PIC cycle, e.g.:
  - > QED synchrotron radiation and Bremsstrahlung (photon emission)
  - > Thomas-Fermi collisional ionization
  - > ADK and BSI field ionization
  - > In-situ calculation of coherent and incoherent far field radiation
  - > Classical radiation reaction
- Tools and diagnostics, e.g.:
  - > Extensible selection of plugins for online analysis of particle and field data
  - > Scalable I/O for restarts and full output in openPMD using parallel HDF5 and ADIOS2



















# Weak Scaling FOM Case on Summit

PICon ( is a Frontier CAAR code.



- Runtime:  $\sim 10 \min \sim 0.6 \, \mathrm{s}$  per iteration
- FOM Science case
- Scaling:
  - **27**  $\rightarrow$  4600 nodes
  - $162 \rightarrow 27\,600 \text{ GPUs}$
  - 96–98 % GPU utilization



concer

## **PIConGPU Full Software Stack**



Huebl, Axel, et al. (2018) Zero Overhead Modern C++ for Mapping to Any Programming Model. Software Stack updated by René Widera (2020)



1 Abstraction Layers and Accelerated Computing in C++

2 OpenACC and OpenMP target

3 Porting Alpaka

4 Issues and Results

5 Outlook



5/24 Porting a C++ Abstraction Layer to Directive-based Offloading • 2021-10-15

•

# **Offloading Models**

- Vendor Specific, low-level: CUDA, HIP, ...
- Open, low-level: OpenCL, SYCL, ...
- Open, directive-based: OpenMP target, OpenACC



- RAJA
- Kokkos
- al/saka



- RAJA
- Kokkos
- al/saka

Why?



- RAJA
- Kokkos
- ∎ al<mark>/</mark>}aka

### Why?

- Dilemma of choice:
  - Which API to use?
  - Which will be supported throughout the lifetime of the code?
- A future hardware architechture may come with a new programming model...



- RAJA
- Kokkos
- ∎ al<mark>r</mark>}aka

### Why?

- Dilemma of choice:
  - Which API to use?
  - Which will be supported throughout the lifetime of the code?
- A future hardware architechture may come with a new programming model...
- $\Rightarrow\,$  Important to keep large applications independent of offloading API
- Dependence on abstraction layer less problematic: comparatively lightweight, can be maintained by primary application's team



# **OpenMP** target and **OpenACC**

#### **OpenMP** target

- Extension of OpenMP for accelerator offloading
- Added in verison 4.0
- Aims to provide fine-grained control
- Explicit parallelism
- #pragma omp target



# **OpenMP** target and **OpenACC**

#### **OpenMP** target

- Extension of OpenMP for accelerator offloading
- Added in verison 4.0
- Aims to provide fine-grained control
- Explicit parallelism
- #pragma omp target

#### OpenACC

- Newly developed parallel model specifically for accelerators
- Aims to be descriptive rather than prescriptive
- Intentionally only pure data parallelism on device
- #pragma acc parallel



### **Accelerator Execution Hierarchy**

| CUDA   | Alpaka  | OpenMP 5.0 | OpenACC 3.0 | execution   |
|--------|---------|------------|-------------|-------------|
| grid   | grid    | (target)   | (parallel)  | task        |
| block  | block   | team       | gang        | undefined   |
| thread | thread  | thread     | worker      | lock-step   |
|        | element | simd       | (vector)    | vector/seq. |



### **Accelerator Execution Hierarchy**

| CUDA   | Alpaka  | OpenMP 5.0 | OpenACC 3.0 | execution   |
|--------|---------|------------|-------------|-------------|
| grid   | grid    | (target)   | (parallel)  | task        |
| block  | block   | team       | gang        | undefined   |
| thread | thread  | thread     | worker      | undefined   |
|        | element | simd       | (vector)    | vector/seq. |





- Header-only C++14 abstraction library for accelerator development
- Accelerator type passed to device kernels as backend handle
- 1 template<typename TAcc>
- void kernel(const TAcc& acc, ...);
- $\Rightarrow\,$  no conditional compilation required for backend selection



- Header-only C++14 abstraction library for accelerator development
- Accelerator type passed to device kernels as backend handle
- 1 template<typename TAcc>
- void kernel(const TAcc& acc, ...);
- $\Rightarrow\,$  no conditional compilation required for backend selection
- API and feature set modelled after CUDA

host devices, queues, events, memory management, ... device atomics, block-shared memory, block-sync, ...

lib math random

lib math, random, ...

 supported backends include: sequential, OpenMP, TBB, CUDA, HIP, ...



۹ 🗆

# alsaka: Queues and Tasks

Compute and memory task objects are placed in queues executing in order



# alsaka: Queues and Tasks

Compute and memory task objects are placed in queues executing in order

```
template<class Functor, class... Args>
 1
    struct TaskKernel
 3
      TaskKernel(
 4
        WorkDiv workDiv, // grid size
 5
        Functor functor, // user functor
 6
        Args ... args ); // user arguments
 7
 8
      void operator() (const DevType& dev) const;
9
10
     private:
11
      WorkDiv m workDiv;
12
      Functor m_functor;
13
      tuple< decay t<Args...> > m args;
14
15 };
 11/24 Porting a C++ Abstraction Laver to Directive-based Offloading • 2021-10-15
```

• 🗆

# alsaka: Kernel

#### **OpenMP** target

```
1 // TaskKernel Omp5::operator() (...) {
2 // copy members to local scope, e.q.:
3 auto args = m_args;
4 omp_set_num_threads( workdiv.threads );
5 #pragma omp target
   ł
6
7
   # praqma omp teams distribute
     for ( int b = 0; b < workDiv.blocks; ++b )</pre>
8
     Ł
9
       // OpenMP backend handle:
10
       AccOmp5 ctx (workdiv, b):
11
12 #
       praqma omp parallel
13
       ſ
14
15
         apply([&ctx](auto ...args){
16
             functor ( ctx, args... );
17
           }, margs);
18
19 } } }
```

• 🗆

# alsaka: Kernel

#### **OpenMP** target

```
1 // TaskKernel Omp5::operator() (...) {
2 // copy members to local scope, e.q.:
   auto args = m_args;
3
   omp_set_num_threads( workdiv.threads );
4
   #pragma omp target
5
   ł
6
7
   # praqma omp teams distribute
     for ( int b = 0; b < workDiv.blocks; ++b )</pre>
8
9
       // OpenMP backend handle:
10
       AccOmp5 ctx (workdiv, b):
11
12 #
       praama omp parallel
13
       ſ
14
15
         apply([&ctx](auto ...args){
16
             functor ( ctx, args... );
17
           }, margs);
18
  19
```

#### OpenACC

```
1 // TaskKernel Oacc::operator() (...) {
2 // copy members to local scope, e.q.:
   auto args = m_args;
3
4
   #praqma acc parallel
5
6
7
   # pragma acc loop gang
     for ( int b = 0; b < workDiv.blocks; ++b )</pre>
8
9
        // OpenACC block context:
10
        CtxBlockOacc ctxBlock (workdiv, b);
11
   #
        pragma acc loop worker
12
        for ( int t = 0; t < workdiv.threads; ++t )</pre>
13
14
          AccOacc ctx ( ctxBlock, t );
15
          apply([&ctx](auto ...args){
16
              functor ( ctx, args... );
17
            }, margs);
18
   } } }
19
```

DRESDEN

concep



# alsaka: Kernel Environment

#### OpenMP target

- Block-thread index
- 1 template<>

}

}:

```
2 class GetThreadIdx< AccOmp5 > {
3 size_t get ( const AccOmp5& ) {
```

```
return omp_get_thread_num();
```

#### OpenACC

```
1 template<>
2 class GetThreadIdx< AccOacc > {
3 size_t get ( const AccOacc& ctx ) {
4 return ctx.m_threadIdx;
5 }
6 };
```

4

5

6

# alsaka: Kernel Environment

#### OpenMP target

Block-thread index

```
1 template<>
2 class GetThreadIdx< AccOmp5 > {
3 size_t get ( const AccOmp5& ) {
4 return omp_get_thread_num();
5 }
6 };

    Block-level barrier
1 template<>
2 class SyncBlockThreads< AccOmp5 > {
3 yoid sync ( const AccOmp5& ) {
```

```
4 # pragma omp barrier
```

5 } };

### OpenACC

```
template<>
1
  class GetThreadIdx< AccOacc > {
2
    size_t get ( const AccOacc& ctx ) {
3
      return ctx.m_threadIdx;
4
   }
5
6
  }:
  template<>
  class SyncBlockThreads< AccOacc > {
2
    void svnc ( const AccOacc& acc ) {
3
  // atomics and spin waits
5 } };
```

• 🗆

# alsaka: Kernel Environment

#### OpenMP target

Block-thread index

```
template<>
  class GetThreadIdx< AccOmp5 > {
     size_t get ( const AccOmp5& ) {
3
       return omp_get_thread_num();
    3
5
  }:
6
   Block-level barrier
  template<>
  class SyncBlockThreads< AccOmp5 > {
    void svnc ( const AccOmp5& ) {
3
   # pragma omp barrier
  } };
\mathbf{5}
```

#### OpenACC

```
template<>
1
  class GetThreadIdx< AccOacc > {
    size_t get ( const AccOacc& ctx ) {
3
      return ctx.m_threadIdx;
4
    }
5
  }:
6
  template<>
  class SyncBlockThreads< AccOacc > {
2
    void svnc ( const AccOacc& acc ) {
3
    // atomics and spin waits
  } };
5
```

Block-shared memory

**block** context contains small-object allocator (~ 30 kB buffer, configurable) <sup>13/24</sup> Porting a C++ Abstraction Layer to Directive-based Offloading • 2021-10-15



• 🗆

# alsaka: Memory

- Device (and host) memory are managed via RAII buffer API
- Explicit operations of buffer creation and copy
- $\blacksquare$  No linking between host and device memory  $\Rightarrow$  no use for data directives

| Alpaka           | CUDA       | OpenMP 5.0                 | OpenACC 3.0                                                         |
|------------------|------------|----------------------------|---------------------------------------------------------------------|
| alpaka::allocBuf | cudaMalloc | omp_target_alloc           | acc_malloc                                                          |
| alpaka::memcpy   | cudaMemcpy | omp_target_memcpy          | acc_memcpy_to_device<br>acc_memcpy_from_device<br>acc_memcpy_device |
| ~Buf             | cudaFree   | <pre>omp_target_free</pre> | acc_free                                                            |



# PICon GED : Globals and Constants

Alpaka does not provide and abstraction for global variables.

- PIConGPU uses one global variable, requiring directives in the code:
- uint64\_t nextId;
- 2 #pragma acc declare device\_resident(nextId)
- 3 #pragma omp declare target(nextId)



# PICon GED : Globals and Constants

Alpaka does not provide and abstraction for global variables.

- PIConGPU uses one global variable, requiring directives in the code:
- uint64\_t nextId;
- 2 #pragma acc declare device\_resident(nextId)
- 3 #pragma omp declare target(nextId)
- PIConGPU's simulation definition is fixed at compile time using constexpr.
  - If a constant needs an address at run-time it must be explicitly mapped to the device
  - e.g. for runtime-indexed array, object of which a non-static member function is called in device code
  - 1 constexpr uint64\_t constant[] = { 1, 2 }
  - 2 #pragma acc declare copyin(constant)
  - 3 #pragma omp declare target(constant)



• types containing static constexpr data members were not mappable (OpenMP target (< 5.0))

 $\blacksquare$  probably result of a ban on static with no regard to const in C++



- types containing static constexpr data members were not mappable (OpenMP target (< 5.0) )
  - $\blacksquare$  probably result of a ban on static with no regard to const in C++
- mapping of constexpr variables with static lifetime (compile-time constants) not implicit (OpenMP target / OpenACC)
  - $\blacksquare$  compiler knows which constants are used and there is no abiguity about sequence of copy  $\Rightarrow$  should be implicit

- types containing static constexpr data members were not mappable (OpenMP target (< 5.0) )
  - $\blacksquare$  probably result of a ban on static with no regard to const in C++
- mapping of constexpr variables with static lifetime (compile-time constants) not implicit (OpenMP target / OpenACC)
  - $\blacksquare$  compiler knows which constants are used and there is no abiguity about sequence of copy  $\Rightarrow$  should be implicit

#### missing gang-level barrier

(OpenACC)

- a barrier would not agree with pure data-parallel philosophy
- $\blacksquare$  no explicit control over number of workers  $\Rightarrow$  makeshift barrier can dead-lock or not work



- types containing static constexpr data members were not mappable (OpenMP target (< 5.0) )
  - $\blacksquare$  probably result of a ban on static with no regard to const in C++
- mapping of constexpr variables with static lifetime (compile-time constants) not implicit (OpenMP target / OpenACC)
  - $\blacksquare$  compiler knows which constants are used and there is no abiguity about sequence of copy  $\Rightarrow$  should be implicit

#### missing gang-level barrier

(OpenACC)

- a barrier would not agree with pure data-parallel philosophy
- $\blacksquare$  no explicit control over number of workers  $\Rightarrow$  makeshift barrier can dead-lock or not work
- std::tuple implementations are not required to be trivially copyable if all component types are

(C++)

 $\Rightarrow$  no std::tuple is formally mappable



## **Tested Compilers**

|                             | OpenMP  target |     | OpenACC |     |       |
|-----------------------------|----------------|-----|---------|-----|-------|
| target:                     | ×86            | hsa | nvptx   | ×86 | nvptx |
| $GCC \geq 9$                |                |     |         |     |       |
| $Clang \geq 10$             |                |     |         |     |       |
| $AOMP\approx 0.7$           |                |     |         |     |       |
| ROC Clang = $4.3.0$         |                |     |         |     |       |
| $IBM\ XL = 16.1.1\text{-}5$ |                |     |         |     |       |
| $NVHPC \geq 19.3$           |                |     |         |     |       |

- All listed compilers showed major roadblocks in initial tests.
- Followed only updates of two compilers with fastest development speed:
  - Clang (git main) for OpenMP target
  - NVHPC (releases) for OpenACC

17/24 Porting a C++ Abstraction Layer to Directive-based Offloading • 2021-10-15

- OpenMP 5.0 / OpenACC 3.0 not fully supported anywhere. E.g:
  - GCC types with static constexpr not mappable (very strict interpretation of OpenMP 4.5)  $\Rightarrow$  porting PIConGPU impossible



- OpenMP 5.0 / OpenACC 3.0 not fully supported anywhere. E.g:
  - GCC types with static constexpr not mappable (very strict interpretation of OpenMP 4.5)  $\Rightarrow$  porting PIConGPU impossible
- Internal Compiler Errors (ICE) happen when directives meet C++



• OpenMP 5.0 / OpenACC 3.0 not fully supported anywhere. E.g:

GCC types with static constexpr not mappable (very strict interpretation of OpenMP 4.5)  $\Rightarrow$  porting PIConGPU impossible

- Internal Compiler Errors (ICE) happen when directives meet C++
- Invalid use or not-implemented features can trigger ICE instead of compiler error



• OpenMP 5.0 / OpenACC 3.0 not fully supported anywhere. E.g:

GCC types with static constexpr not mappable (very strict interpretation of OpenMP 4.5)  $\Rightarrow$  porting PIConGPU impossible

- Internal Compiler Errors (ICE) happen when directives meet C++
- Invalid use or not-implemented features can trigger ICE instead of compiler error
- Runtime errors, like incorrect data sharing, atomics not doing what they should



# **Compiler Issues II**

- Focussed main development and testing on Alpaka's test suite and examples, rather than PIConGPU
- $\Leftarrow$  Smaller applications with limited scope may not get stuck on the same bugs



# **Compiler Issues II**

- Focussed main development and testing on Alpaka's test suite and examples, rather than PIConGPU
- Smaller applications with limited scope may not get stuck on the same bugs
  - When code compiles but does not work due to compiler bug correctness of code must be shown hard when no second compiler compiles the code

# **Compiler Issues II**

- Focussed main development and testing on Alpaka's test suite and examples, rather than PIConGPU
- Smaller applications with limited scope may not get stuck on the same bugs
  - When code compiles but does not work due to compiler bug correctness of code must be shown hard when no second compiler compiles the code
- ⇒ Sometimes needed compiler developers to run our complete code through their compiler to debug issues without small reproducer

## Results: alsaka VectorAdd

```
auto bufHostA(alpaka::allocBuf<uint32_t, Idx>(devHost, extent)); //... bufHostB(...), bufHostC(...);
   // init bufHost* ...
2
   auto bufAccA(alpaka::allocBuf<uint32 t, Idx>(devAcc, extent)); //... bufAccB(...), bufAccC(...);
3
   alpaka::memcpy(queue, bufAccA, bufHostA, extent); // ...
\mathbf{5}
   auto const taskKernel = alpaka::createTaskKernel<Acc>(workDiv,
6
      [] (const auto& acc, const uint32_t* A, const uint32_t* B, uint32_t* C, size_t N) {
7
8
          for(TIdx i(threadFirstElemIdx); i < threadLastElemIdxClipped; ++i)</pre>
9
              C[i] = A[i] + B[i];
10
     }, alpaka::getPtrNative(bufAccA), alpaka::getPtrNative(bufAccB), alpaka::getPtrNative(bufAccC), N);
11
12
   alpaka::enqueue(queue, taskKernel);
13
   alpaka::memcpv(queue, bufHostC, bufAccC, extent);
14
   alpaka::wait(queue); // check result against host computation
15
```

• 🗆

## Results: alsaka VectorAdd

```
auto bufHostA(alpaka::allocBuf<uint32_t, Idx>(devHost, extent)); //... bufHostB(...), bufHostC(...);
   // init bufHost* ...
2
   auto bufAccA(alpaka::allocBuf<uint32 t, Idx>(devAcc, extent)); //... bufAccB(...), bufAccC(...);
з
   alpaka::memcpy(queue, bufAccA, bufHostA, extent); // ...
\mathbf{5}
   auto const taskKernel = alpaka::createTaskKernel<Acc>(workDiv,
6
      [] (const auto& acc, const uint32_t* A, const uint32_t* B, uint32_t* C, size_t N) {
7
8
          for(TIdx i(threadFirstElemIdx); i < threadLastElemIdxClipped; ++i)</pre>
9
              C[i] = A[i] + B[i];
10
     }, alpaka::getPtrNative(bufAccA), alpaka::getPtrNative(bufAccB), alpaka::getPtrNative(bufAccC), N);
11
12
   alpaka::enqueue(queue, taskKernel);
13
   alpaka::memcpv(queue, bufHostC, bufAccC, extent);
14
   alpaka::wait(queue); // check result against host computation
15
```

|                    |                            | Clang Main   |              | ROC Clang    | NVHPC 21.7   |              |  |
|--------------------|----------------------------|--------------|--------------|--------------|--------------|--------------|--|
|                    |                            | ×86          | hsa          | hsa          | ×86          | nvptx        |  |
|                    | compile                    | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ |  |
| + Abstraction Laye | run<br>r to Directive-base | d Offloading | memory error | $\checkmark$ | $\checkmark$ | $\checkmark$ |  |

20/24 Porting a C++

DRESDEN

# Results: alsaka Test Suite

- Suite of tests also used in alpaka's CI
- Battery of test cases for each aspect of a backend: kernels, memory, atomics, ...
- Using Catch2  $\Rightarrow$  more TMP, harder for compilers to succeed.

# Results: alsaka Test Suite

- Suite of tests also used in alpaka's CI
- Battery of test cases for each aspect of a backend: kernels, memory, atomics, ...
- Using Catch2  $\Rightarrow$  more TMP, harder for compilers to succeed.

|         | Clang Main   |              | ROC Clang          | NVHPC 21.7   |                | GCC 11 |
|---------|--------------|--------------|--------------------|--------------|----------------|--------|
|         | ×86          | hsa          | hsa                | ×86          | nvptx          | ×86    |
| compile | $\checkmark$ | most         | slow, linker hangs | $\checkmark$ | $\checkmark^1$ | most   |
| run     | $\checkmark$ | memory error |                    | $\checkmark$ | $\checkmark$   | ×      |



21/24 Porting a C++ Abstraction Layer to Directive-based Offloading • 2021-10-15

🗇 <sup>1</sup>only local installation, nvlink error : Duplicate weak parameter bank for ... when using NVIDIA docker image in CI

# Results: PICon CD

|         | Clang Main   | NVHPC 21.7   |              |
|---------|--------------|--------------|--------------|
|         | ×86          | ×86          | nvptx        |
| compile | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| run     | $\checkmark$ | $\checkmark$ | ×            |





- $\blacksquare$  OpenMP target and OpenACC compiler ecosystems still rather unstable when it comes to C++
- OpenACC is too strict about data parallelism to port existing codes which do not adhere to this pattern





- $\blacksquare$  OpenMP target and OpenACC compiler ecosystems still rather unstable when it comes to C++
- OpenACC is too strict about data parallelism to port existing codes which do not adhere to this pattern
- Our OpenMP target and OpenACC backends are, to our knowledge, complete, though we cannot actually test and debug them completely
- Will follow and try to push future compiler development



#### **Acknowledgments**

- Mathew Colgrove (NVIDIA) and NVHPC for helping to debug compiler and code issues
- Ron Liberman (AMD) and SPEC High Performance group for advice and testing PIConGPU

# Thank You.

## **OpenMP** target and **OpenACC**: **Directives**

|           | OpenMP target        | OpenACC         |
|-----------|----------------------|-----------------|
| execution | omp target           | acc parallel    |
|           | omp teams distribute | acc loop gang   |
|           | omp parallel for     | acc loop worker |



## **OpenMP** target and **OpenACC**: **Directives**

|           | OpenMP target          | OpenACC         |  |  |
|-----------|------------------------|-----------------|--|--|
| execution | omp target             | acc parallel    |  |  |
|           | omp teams distribute   | acc loop gang   |  |  |
|           | omp parallel for       | acc loop worker |  |  |
| memory    | omp target data map () | acc data copy   |  |  |
|           | omp declare target ()  | acc declare ()  |  |  |



## **OpenMP** target and **OpenACC**: **Directives**

|              | OpenMP target          | OpenACC         |
|--------------|------------------------|-----------------|
| execution    | omp target             | acc parallel    |
|              | omp teams distribute   | acc loop gang   |
|              | omp parallel for       | acc loop worker |
| memory       | omp target data map () | acc data copy   |
|              | omp declare target ()  | acc declare ()  |
| atomics      | omp atomic             | acc atomic      |
| lock         | omp critical           | —               |
| sync threads | omp barrier            | —               |



# Results: alsaka HelloWorld



# Results: alsaka HelloWorld

```
1 alpaka::exec<Acc>( queue, workDiv,
2 [] (const auto& acc) {
3      const auto gidx = alpaka::getIdx<alpaka::Grid, alpaka::Threads>(acc);
4      const auto gext = alpaka::getWorkDiv<alpaka::Grid, alpaka::Threads>(acc);
5      
6      const auto lgidx = alpaka::mapIdx<1u>( gidx, gext );
7      
8      printf("[z:%u, y:%u, x:%u][linear:%u] Hello World\n", gidx[0u], gidx[1u], gidx[2u], lgidx[0u] );
9      });
10      alpaka::wait(queue);
```

|         | Clang Main   |          | ROC Clang    | NVHPC 21.7   |              |
|---------|--------------|----------|--------------|--------------|--------------|
|         | ×86          | hsa      | hsa          | ×86          | nvptx        |
| compile | $\checkmark$ | no c-lib | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| run     | $\checkmark$ |          | $\checkmark$ | $\checkmark$ | $\checkmark$ |



• 🗆