Summary of GPU First - Execution of Legacy CPU Codes on GPUs

Summary GPU First - Execution of Legacy CPU Codes on GPUs arxiv.org

10,592 words - PDF document - View PDF document

One Line

The "GPU First" compilation scheme enables running CPU codes on GPUs without changing source code, facilitating acceleration identification and quick code modification testing.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

GPU First - Execution of Legacy CPU Codes on GPUs

Source: arxiv.org - PDF - 10,592 words - view

Introduction

• The “GPU First” compilation scheme allows for the execution of legacy CPU codes on GPUs without modifying the source code.

• Transparent porting of legacy CPU codes to GPUs is possible and GPU performance exploration is feasible for non-experts.

• The methodology focuses on the execution of legacy CPU codes on GPUs without modifying the source code.

Simplified Acceleration Identification

• The scheme simplifies the identification of code regions suitable for acceleration.

• Rapid testing of code modifications on actual GPUs is enabled.

• Automatic generation of RPC calls facilitates code execution path.

Replacing Variadic Library Calls

• The text discusses the replacement of variadic library calls with RPC calls on the device.

• RPC calls are invoked by a wrapper on the host and unpack the arguments passed from the device.

• Device code is divided into call sites.

Handling Different Types of Arguments

• Different types of arguments are handled when offloading computations to the GPU.

• The first type is a value that can be directly passed to the GPU.

• The second type requires additional steps for handling.

Multiple Teams and Custom Allocators

• The use of multiple teams in parallel kernels is mentioned.

• Configurable custom allocators are needed due to variations in GPU heap allocation support among vendors.

• Tracking memory allocations is necessary.

Technical Challenges

• The GPU First methodology faces technical challenges.

• Limitations exist in moving more than one level of memory when accessing objects through indirection.

• Annotated library headers could help overcome challenges.

Performance Benchmarks

• Benchmarks conducted using CUDA 11.8.0 and -O3 optimization flag.

• AMGmk and page-rank benchmarks show good performance for relax kernel measurements and propagation step measurements, respectively.

• SPEC OMP benchmarks showcase different results.

Related Research Papers and Benchmarks

• References to research papers and benchmarks related to legacy CPU codes execution on GPUs.

• Co-designing OpenMP GPU runtime and optimizations for near-zero overhead execution are discussed.

• Efficient GPU-centric communication on NVIDIA GPU clusters is explored.

Conclusion

• The “GPU First” compilation scheme simplifies the identification of code regions suitable for acceleration.

• Rapid testing of code modifications on actual GPUs is facilitated.

• GPUs perform well for certain benchmarks' relax kernel measurements and propagation step measurements.

Key Takeaways

• The “GPU First” compilation scheme allows for the execution of legacy CPU codes on GPUs without modifying the source code.

• Transparent porting of legacy CPU codes to GPUs is possible and GPU performance exploration is feasible for non-experts.

• The methodology focuses on the execution of legacy CPU codes on GPUs without modifying the source code.

• Benchmarks show that GPUs perform well for relax kernel measurements and propagation step measurements in certain benchmarks.

Key Points

The "GPU First" compilation scheme allows for the execution of legacy CPU codes on GPUs without modifying the source code.
The scheme simplifies the identification of code regions suitable for acceleration and enables rapid testing of code modifications on actual GPUs.
Transparent porting of legacy CPU codes to GPUs is possible and GPU performance exploration is feasible for non-experts.
The methodology described in the document focuses on the execution of legacy CPU codes on GPUs without modifying the source code.
The document discusses the replacement of variadic library calls with RPC calls on the device for offloading computations to the GPU.
The GPU First methodology faces technical challenges, such as limitations in memory access and the need for annotated library headers.
Benchmarks show that GPUs perform well for relax kernel measurements and propagation step measurements in certain benchmarks.
The "GPU First" compilation scheme simplifies the identification of code regions that can benefit from acceleration and facilitates rapid testing of code modifications.

Summaries

26 word summary

The "GPU First" compilation scheme allows for executing legacy CPU codes on GPUs without modifying source code, simplifying acceleration identification and enabling rapid code modification testing.

46 word summary

The paper introduces a compilation scheme called "GPU First" that allows for the execution of legacy CPU codes on GPUs without modifying the source code. The scheme simplifies the task of identifying code regions suitable for acceleration and enables rapid testing of code modifications on actual

529 word summary

Transparent porting of legacy CPU codes to GPUs is possible and GPU performance exploration is feasible for non-experts. By using the GPU First methodology, parallel loops can achieve performance similar to manually offloaded kernels, with up to a 14.36x

The methodology described in this document focuses on the execution of legacy CPU codes on GPUs. The authors propose a compilation and execution path that allows the user application to be compiled for the GPU without modifying the source code. They also introduce automatic generation of RPC calls

The text discusses the replacement of the variadic library call to fscanf with an RPC call on the device. The RPC call is invoked by a wrapper on the host and unpacks the arguments passed from the device. The device code is divided into call site

This text excerpt discusses the execution of legacy CPU codes on GPUs. It describes how different types of arguments are handled when offloading computations to the GPU. The first type of argument is a value that can be directly passed to the GPU. The second type

The text excerpt discusses the execution of legacy CPU codes on GPUs. It mentions the use of multiple teams in parallel kernels and the need for configurable custom allocators. The reasons for this include variations in GPU heap allocation support among vendors and the need to track

The GPU First methodology has potential but faces technical challenges. One challenge is the inability to move more than one level of memory when accessing objects through indirection, which can result in accessing device memory instead of host memory. Annotated library headers could help overcome

This summary provides a concise version of the text excerpt, highlighting key points and preserving important details.

The experiments were conducted using CUDA 11.8.0 and benchmarks were compiled with the -O3 optimization flag. The prototype version used in the

The excerpt discusses the performance of various benchmarks when executed on GPUs using the GPU First scheme. The AMGmk and page-rank benchmarks show that GPUs perform well for relax kernel measurements and propagation step measurements, respectively. However, the SPEC OMP benchmarks

The "GPU First" compilation scheme allows for the automatic compilation of legacy CPU applications directly for GPUs without the need for modification to the application source. This approach simplifies the identification of code regions that can benefit from acceleration and facilitates rapid testing of code modifications

This excerpt contains a list of references to various research papers and benchmarks related to the execution of legacy CPU codes on GPUs. Some of the key points include the co-designing of an OpenMP GPU runtime and optimizations for near-zero overhead execution, the efficient

This excerpt includes a list of references to various research papers and conference proceedings related to GPU computing, OpenMP, performance optimization, and parallel programming. The papers cover topics such as executing legacy CPU codes on GPUs, GPU-centric communication on NVIDIA GPU clusters,

Raw indexed text (67,647 chars / 10,592 words / 1,298 lines)

GPU First — Execution of Legacy CPU Codes on GPUs

Shilei Tian

Tom Scogland

[email protected]

Stony Brook University

Stony Brook, NY, USA [email protected]

Lawrence Livermore National Laboratory

Livermore, CA, USA

Barbara Chapman Johannes Doerfert

[email protected]

Stony Brook University

Stony Brook, NY, USA

[email protected]

Lawrence Livermore National Laboratory

Livermore, CA, USA

ABSTRACT KEYWORDS

Utilizing GPUs is critical for high performance on heterogeneous

systems. However, leveraging the full potential of GPUs for ac-

celerating legacy CPU applications can be a challenging task for

developers. The porting process requires identifying code regions

amenable to acceleration, managing distinct memories, synchro-

nizing host and device execution, and handling library functions

that may not be directly executable on the device. This complexity

makes it challenging for non-experts to leverage GPUs effectively,

or even to start offloading parts of a large legacy application.

In this paper, we propose a novel compilation scheme called

“GPU First” that automatically compiles legacy CPU applications di-

rectly for GPUs without any modification of the application source.

Library calls inside the application are either resolved through our

partial libc GPU implementation or via automatically generated

remote procedure calls to the host. Our approach simplifies the task

of identifying code regions amenable to acceleration and enables

rapid testing of code modifications on actual GPU hardware in

order to guide porting efforts.

Our evaluation on two HPC proxy applications with OpenMP

CPU and GPU parallelism, four micro benchmarks with originally

GPU only parallelism, as well as three benchmarks from the SPEC

OMP 2012 suite featuring hand-optimized OpenMP CPU paral-

lelism showcases the simplicity of porting host applications to the

GPU. For existing parallel loops, we often match the performance

of corresponding manually offloaded kernels, with up to 14.36×

speedup on the GPU, validating that our GPU First methodology

can effectively guide porting efforts of large legacy applications. GPU, compiler, OpenMP, code transformation

CCS CONCEPTS

• Software and its engineering → Compilers; Parallel program-

ming languages.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specific permission and/or a

fee. Request permissions from [email protected].

Conference’17, July 2017, Washington, DC, USA

ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00

https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Reference Format:

Shilei Tian, Tom Scogland, Barbara Chapman, and Johannes Doerfert. 2023.

GPU First — Execution of Legacy CPU Codes on GPUs. In Proceedings

of ACM Conference (Conference’17). ACM, New York, NY, USA, 12 pages.

https://doi.org/10.1145/nnnnnnn.nnnnnnn

INTRODUCTION

In today’s era of high-performance computing, GPUs have emerged

as the most popular solution for accelerating compute-intensive

workloads due to their massive parallelism and high memory band-

width. However, harnessing the full potential of GPUs can be chal-

lenging, especially for legacy CPU applications that were not de-

signed with GPU acceleration in mind. Porting such applications to

the GPU can be time-consuming, error-prone, and usually requires

significant development effort. One needs to identify code regions

amenable to acceleration, manage distinct memories, synchronize

host and device execution, and handle library functions that are not

executable on the device. These tasks increase the complexity of

porting and may require significant re-architecting efforts, making

it difficult for non-experts to leverage GPUs for performance gains

or even to initiate offloading any part of a large legacy application.

To overcome these challenges, we propose a novel compilation

scheme we call “GPU First” which puts a legacy application on

the GPU before any manual porting effort has been started. Our

approach, sketched in Figure 1, leverages the portability of LLVM’s

OpenMP offloading to directly compile and run a host application

for a GPU, without any source modification. Instead, the application

is compiled for the GPU architecture and started using a provided

GPU loader. Library calls inside the application are either resolved

through our partial libc GPU implementation or via automatically

generated remote procedure calls (RPCs) to the host. By adopting

the GPU First approach, users can seamlessly test and profile their

application directly on GPU hardware. While the performance of

full applications running on (current) GPUs is generally not better

than CPU execution, it allows developers to easily identify how well

existing parallel regions map to the GPU. Furthermore, it enables

rapid testing of code modifications, e.g., data layout transformations

or iteration order modifications, on real GPU hardware. We believe

this significant simplification will facilitate the adoption of GPU

acceleration in various domains and help developers harness the

full potential of modern hardware.Conference’17, July 2017, Washington, DC, USA

S. Tian, T. Scogland, et al.

To evaluate the effectiveness of our approach, we performed

experiments on two HPC proxy applications with OpenMP CPU

and GPU parallelism, four micro benchmarks with originally GPU

only parallelism, as well as three benchmarks from the SPEC OMP

2012 suite featuring hand-optimized OpenMP CPU parallelism. Our

results demonstrate that transparent porting is possible and explo-

ration of GPU performance is feasible for non-experts. For existing

parallel loops, we can closely match the performance of correspond-

ing manually offloaded kernels, achieving up to 14.36× speedup

on the GPU compared to the CPU implementation of the HPC

proxy application. This validates our assumptions that the GPU

First methodology can effectively guide porting efforts, e.g., by iden-

tifying parallel regions that require reorganization to achieve good

scaling behavior on the GPU, and by allowing fast comparison of

different algorithmic and implementation choices.

The main contributions of this paper are summarized here while

limitations are explained in detail in Section 4.

(1) A novel compilation scheme that allows to target GPUs for a

large set of legacy CPU applications by automatically enabling

host-only library calls via a generated RPC interface that trans-

late arguments and mitigate underlying memory for use in

dedicated memory environments.

(2) A parallelism expansion scheme that allows to map OpenMP

parallel directives (and parallel loops) from a single thread

block (aka. work group), which is the natural OpenMP offload

mapping, to the entire GPU for realistic performance studies.

(3) A GPU-optimized partial libc implementation that allows fast

execution of runtime calls that do not require operating system

support, including a GPU-optimized allocator.

(4) An evaluation of the GPU First approach on SPEC OMP 2012

benchmarks as well as HPC proxy applications. The former

shows the applicability of our scheme to large codes while

the latter highlights how parallel loops of the CPU version

perform very similar to the manually offloaded GPU kernels.

Thus, GPU First is well suited to guide porting efforts for legacy

applications that already use OpenMP parallelism on the CPU.

The rest of the paper is organized as follows. In Section 2, we pro-

vide background information on OpenMP target offloading and the

RPC mechanism. In Section 3, we describe the design and concep-

tual implementation of our GPU First method. Section 4 discusses

the limitations of our approach and potential directions for future

work. In Section 5, we present our evaluation results to demonstrate

the effectiveness of our approach to guide porting efforts for legacy

applications. We will talk about related works in Section 6 before

we conclude with Section 7 and insights on the potential impact of

our work on the field of parallel computing.

BACKGROUND

With the introduction of the target construct in OpenMP 4.0, it

became possible to execute a code region on a target device like a

GPU [3] or FPGA [15]. In this section, we will provide an overview

of the LLVM/OpenMP execution model and explain the compilation

and execution path of the GPU First methodology. Additionally, we

will introduce the basics of host remote procedure calls (RPCs).

2.1

OpenMP Execution Model

An OpenMP program begins with a single initial thread executing

sequentially. When any thread encounters a parallel construct, it

creates a new team with zero or more additional threads, and each

of these threads executes the associated code.

The execution of a target region is similar to the program start.

A single initial thread executes the device code sequentially. The

parallel construct works again similar, tough, due to the synchro-

nization requirements of the OpenMP standard, compilers will only

use threads of the same thread block / workgroup for the new team.

loader

app exec.

system libraries

RPC server

main wrapper user wrapper

partial GPU libc

Figure 1: Bird’s-eye view of the GPU First methodology. All

components on a grid background are provided or generated

by the approach. The loader is the entry point for the op-

erating system and responsible to setup the environment

on the device. The application executable (top right) is pro-

duced from the unmodified legacy source code but runs on

the GPU. A partial libc GPU implementation provides rela-

tively fast device side runtime calls while other library calls

are translated into remote procedure calls (RPCs). Our RPC

scheme will also orchestrate memory movement for argu-

ments and underlying objects, forward the calls to existing

system libraries, and return the result to the application

thread waiting on the GPU.

ç {

legacy CPU

app. source Clang with custom

link-time-optimizations

exec.

partial libc

offload lib.

compile time / runtime

extended LLVM parts

GPU

RPC thread

Figure 2: Overview of the compilation and execution path of

the direct GPU compilation framework introduced by Tian

et al. [26]. In our work the source of the application still

remains unchanged. The source wrapper and user wrapper

files are taken from the original direct GPU compilation

paper. The libc implementation has been extended for this

work, and the compiler is now augmented to automatically

generate RPC calls and expand source parallelism to the

entire GPU device. The figure was adapted from Figure 1 in

Tian et al. [27]. The highlighted parts are our contribution.GPU First — Execution of Legacy CPU Codes on GPUs

To utilize the entire device, OpenMP introduced the teams construct

which starts a league of teams each with an independent initial

thread. To distribute work across the league and a team one can

use the distribute and for constructs, or, alternatively, the global

coordinates for manual work-sharing.

2.2

Compilation and Execution Path

Our methodology is based on the compilation and execution path

proposed by Tian et al. [26] as part of their direct GPU compilation

framework. Similar to their approach, our methodology compiles

the user application for the GPU without modifying the application

source code. However, we augment the compiler in two distinct

ways to achieve portability and performance. First, we automat-

ically generate RPC calls and, second, we expand eligible source

parallelism to the entire GPU. In addition, we enhance their wrapper

scripts and partial libc implementation as discussed in Section 3.4.

Figure 2, adapted from Tian et al. [26], provides an overview of

our augmented compilation and execution path. The main and user

wrapper fulfill the same functionality as described by Tian et al. [26],

namely to compile the application code for the GPU architecture,

load the environment, e.g, command line options, onto the device

and finally transfer control to the user provided main function on

the GPU. The partial libc implementation is embedded into the

application during compile time. The LLVM/OpenMP offloading

runtime, orchestrates the offloading and provides necessary func-

tionality to the RPC host server when it communicates with the

GPU threads via “shared”, in our case, managed, memory.

2.3

Host Remote Procedural Call

One of the key challenges in executing a CPU program on a GPU

are external functions that are only defined in system or third-party

libraries. We employ remote procedural call (RPC) to utilize the

existing host functions during the device execution. In principle,

this allows GPU code to call any function on the host almost as if

they were local. A crucial requirement is the coordination between

the host and GPU, that can be achieved through a synchronous,

stateless client-server protocol [16, 23, 26]. In this protocol, the

GPU (client) sends requests to the host (server) and waits for the

host to acknowledge the completion of the requested function. Data

transfer, e.g., for the arguments and return value, are handled either

explicitly [26] or implicitly [16].

DESIGN AND IMPLEMENTATION

At the core of the GPU First methodology is a novel compilation

technique that automatically generates a GPU executable from

a CPU application with (almost) arbitrary library calls, e.g., to a

system or third-party library. The most complex component of

this scheme is the automatic generation of RPCs to transparently

execute library functions on the host since they are not available on

the GPU. During compile time, external function calls are replaced

by RPC code to perform the call remotely. When this happens, the

call arguments are automatically transferred to the host and, in

case of pointers, the underlying memory is usually migrated as well.

While not infallible, our proof-of-concept scheme automatically

handles the most common cases. Limitations are discussed in more

detail in Section 4. By minimizing, often even eliminating, the

Conference’17, July 2017, Washington, DC, USA

manual work required to run applications on the GPU, our approach

aims to facilitate the adoption of GPU acceleration and help non-

experts leverage the potential of modern hardware.

3.1

GPU Code Generation and Loading

In contrast to classical offloading we directly target the GPU with

the entire application code. For this task we utilize an extended

version of the direct GPU compilation framework by Tian et al.

[26]. In a nutshell, the application code is transparently enclosed

in a #pragma omp begin/end declare target device_type(nohost)

scope to utilize existing OpenMP offloading support in LLVM/

Clang to generate GPU code. In the final executable, the application

main function is invoked via the OpenMP offload mechanism after

command line arguments have been mapped to the device and the

host RPC server has been set up. For more information on the code

generation and loading we refer to the original paper [26].

3.2

Generating Remote Procedural Calls

To identify and efficiently replace library calls in the application by

RPCs we provide a dedicated link time optimization (LTO) pass. The

benefit over per translation unit reasoning is the complete world

view which includes all the functions defined by the application

as well as the call sites and contexts in which library functions, or

simply non-defined functions, are called. It is important to note that

compared to existing LLVM-IR passes our RPC generation emits

host code while the device code is analyzed and transformed.

In the following we will walk through the compile time genera-

tion of RPC calls using Figure 3 as a guide. The top part, Figure 3a,

shows a manufactured example of user code that exhibits most com-

plexities and optimization opportunities we support. The variadic

library call to fscanf is replaced with an RPC call on the device, and

a wrapper that is invoked by the RPC server thread on the host. The

latter, shown in Figure 3b, unpacks the arguments passed from the

device and performs the original call on the host. Arguments are

stored in an opaque fashion inside the RPCInfo object that is used

for communication and (bit-)casted to their respective type. For

variadic callees, the host wrapper function name uses the variadic

argument types to provide different entry points for variadic call

sites that do not agree on the number of arguments or their type.

Said differently, for variadic function calls we effectively gener-

ate a non-variadic landing-pad on the host for each combination

of call site argument types we encounter. For non-variadic func-

tion calls there is a unique combination of argument types paired

corresponding to a single host function.

The device code that replaces the original library call, shown in

Figure 3c, is divided in call site specific code and call site indepen-

dent code. The latter is the __fscanf_ip_fp_ip function that issues

the RPC call and waits for the result. Information about arguments

is provided in the RPCArgInfo object. The callee is identified through

a compile time generated enum value representing the function

issuing the RPC. The call site specific code records information

about the arguments to orchestrate data transfer of memory, as it

is potentially required for the library call. Dedicated call site in-

formation allows for more efficient code if call sites disagree what

(dynamic) argument value is passed to a library function. In general,

there are three different kinds of arguments. The simplest are valueConference’17, July 2017, Washington, DC, USA

S. Tian, T. Scogland, et al.

struct S { int a, b; float f; };

void use(struct S* s, int r, int i) { ... }

void example(struct S s, int *p) {

int i;

int r = fscanf(fd, "%f %i %i", &s.f, S.a ? &i : &S.b, p);

use(&s, r, i);

}

(a) Variadic function call that will read and write memory passed in

via pointers. The underlying memory is located on the stack ( i and

s ), inside of a larger object ( s.b , s.f inside s ), or part of a statically

unknown object (via p ).

int __fscanf_ip_fp_ip(RPCInfo &RI) {

auto fd = (FILE*)RI.getArg(0);

auto fm = (const char*)RI.getArg(1);

auto f2 = (float*)RI.getArg(2);

auto i3 = (int*)RI.getArg(3);

auto i4 = (int*)RI.getArg(4);

return fscanf(fd, fm, f2, i3, i4);

}

(b) Host code generated during compilation. The the variadic func-

tion call is embedded in a non-variadic function named based on the

call argument types.

int __fscanf_ip_fp_ip(RPCArgInfo &RAI) {

RPCInfo RI;

RI.setCallee(/* compile time */ enum(__fscanf_ip_fp_ip));

RI.setArgInfo(&RAI);

return RI.issueBlockingCall();

}

void example(struct S s, int *p) {

int i;

RPCArgInfo CallSiteRAI(/* num args */ 5);

// Opaque value, treated as byte sequence:

CallSiteRAI.addValArg(fd);

// Statically identified objects:

CallSiteRAI.addRefArg("%f %i %i", read,

sizeof("%f %i %i"), /* offset */ 0);

CallSiteRAI.addRefArg(&s.f, readwrite,

sizeof(s), offsetof(struct S, f));

if ((s.a ? &i : &s.b) == &i)

CallSiteRAI.addRefArg(&i, write, sizeof(i), 0);

else /* ((s.a ? &i : &s.b) == &s.b) */

CallSiteRAI.addRefArg(&s.b, readwrite,

sizeof(s), offsetof(struct S, b));

// Statically unknown object requires dynamic lookup:

int p_offset, p_size;

if (_FindObj(p, &p_offset, &p_size, PotentialObjs))

CallSiteRAI.addRefArg(p, readwrite, p_size, p_offset);

else

CallSiteRAI.addValArg(p);

int r = __fscanf_ip_fp_ip(CallSiteRAI);

use(&s, r, i);

}

ple shown in Figure 3a, where the argument information at the call

site is both statically and dynamically embedded into the RPCArgInfo

object. This information, along with the call site independent infor-

mation, guides the RPC system to determine which function needs

to be called on the host, what memory needs to be copied, and how

arguments should be translated.

Figure 3: Example illustrating the host and device code gen-

erated during compile time to substitute a variadic library

function call site in the user code with RPC and memory

management code.

arguments, which include integers, floating point values, as well as

pointers to opaque types. The first call argument in our example ( fd )

is of the latter type; namely a FILE * . Since we do not know what a

FILE object looks like, we assume pointers to them are only acces-

sible on the host and the associated memory is never moved to the

device. Said differently, we assume the pointer is pointing to host

memory already and consequently does not need translation for

the RPC. Thus, the value of fd will be exactly the same on the host

as it is on the device. The second kind of arguments are pointers to

statically identified objects by applying inter-procedural analysis

built on top of the LLVM’s Attributor framework. Those objects can

(conceptually) reside in stack, global, or constant memory. The next

three call arguments in our example ( "%f %i %i" , &s.f , (s.a ? &i :

&s.b) ) fall into this category. The format string is a known compile

time constant but we still need to make it available on the host such

that fscanf can read it. To this end, we register the pointer together

with the size of the underlying object, here sizeof("%f %i %i") ,

and the offset of the pointer into the object, here 0 . Since the object

is constant, we mark it as read only which ensure the memory is

only copied to the host and not back. After the memory has been

transferred, we will use a pointer with the same offset into the host

version of the object for the host call site. The second argument

is pointing to a statically identified object is &s.f . The handling is

similar as before but the offset into the object is not trivial this time.

Further, the memory might be read and written by fscanf which

means we need to copy the object to the host and back. For the

next call site argument we can not statically determine a unique

underlying object but we can enumerate all possibilities statically

and they are all statically identified objects. Since offset and size

are different we need to generate code that identifies the object

at runtime based on the pointer value, as shown in lines 35-39 in

Figure 3c. Note that &i is marked write only since the underlying

object is known to have an unspecified value on the device at the

time of the call. The last category of arguments are pointers for

which we can not statically enumerate all potential underlying ob-

jects. This might be because the pointer is a result of a malloc -like

call, which can be executed multiple times in a loop resulting in

different objects, a stack allocation inside a potentially recursive

functions, which also allows for multiple runtime instantiations, or

a pointer with an unknown origin. In all cases we will attempt to

identify the underlying object at runtime in order to determine the

size and offset of the pointer. To accomplish this, we rely on our

allocator, which we use to implement malloc -like calls, to maintain

a record of allocated objects, as discussed in more detail as part of

Section 3.4. In case we are unable to determine the object, we will

treat the pointer as a value assuming that the it is not accessed or

already points to host memory.

3.3

Multi-Team Execution and Kernel Split

The original direct GPU offloading work was unable to predict per-

formance of offloaded versions accurately due because the offloaded

host version was run with a single team (or thread block) [26].

Workload distribution in a single team, either explicitly coded with

omp_get_thread_num and omp_get_num_threads or automatically ap-

plied via omp for , will only utilize threads of the team, which is

insufficient for realistic scaling studies on a GPU. However, theGPU First — Execution of Legacy CPU Codes on GPUs

workload of many parallel regions can be executed by multiple

teams without violating the program semantics. To address this, we

implemented a compiler transformation that can identify and con-

vert amendable parallel regions into kernels executed by multiple

teams (or thread blocks). For each such parallel region, the sched-

ule strategy of each automatic work-sharing construct ( omp for ) is

changed to distribute the work across threads in all teams. This is

similar as rewriting an omp for into an omp distribute (parallel)

for . For manual worksharing we replace the query calls for thread

id and number of threads with versions that take the threads of all

blocks into account. In both cases the rewrite ensures the workload

is distributed among all the teams and all the threads in each team.

Similarly, omp barrier constructs need to be replaced by a version

that synchronizes across all teams (or thread blocks). While that is

not allowed by the OpenMP standard, modern GPUs provide means

to achieve this in practice, e.g., via global atomic counters.

For the sequential part of the original application we still utilize a

single team as there is no need for more threads and any additional

team would require special handling to guard against side effects

caused by its initial thread. Whenever the initial thread encounters

a parallel region that has been converted to a multi-team kernel.

We will issue an RPC call to launch the kernel from the host with

the same arguments the parallel region would have been given. The

basic idea is to replace an omp parallel with a omp teams , followed

by an omp parallel , and appropriate changes to the parallel region,

especially the worksharing parts and synchronization.

Figure 4 illustrates the concepts of multi-team execution and

kernel splitting. On the left-hand side, a single team executes the

main kernel, consisting of one main thread and four worker threads.

Initially, only the main thread runs while the worker threads are

waiting in a state machine, as described in [10]. When a parallel

region is encountered, the worker threads start executing, and the

main thread waits for the region to complete. Once the parallel

region is finished, the main thread resumes the serial part of the

program, while the worker threads wait for their next task. On the

right-hand side, multi-team execution is used. The main kernel has

only one team with one thread. When a parallel region is encoun-

tered, a host RPC call is issued with all the necessary information to

launch a parallel kernel with multiple teams. The main thread waits

for the host RPC to finish before proceeding. In this example, four

teams are launched, each with four threads. These teams are not

taken as separate entities; instead, they are bulked together as one

large team, ensuring that all the threads have continuous thread

IDs, rather than starting from 0 in each team. Once the parallel

kernel is completed, the host RPC call is finished, and the main

thread in the main kernel proceeds with the execution of the serial

part of the program.

3.4

Heap Allocators and Allocation Tracking

We have extended the partial GPU libc implementation by Tian

et al. [26] to provide more functions that can run natively on GPUs

without the need for RPCs. The extensions were guided by bench-

marks and include functions such as strtod , rand , and realloc .

Additionally, we provide our own implementations of malloc that

the user can choose from via the compile time flag:

-fopenmp-target-allocator={generic,balanced[N,M]}

Conference’17, July 2017, Washington, DC, USA

m 0 1 2 3

0 1 2 3

4 5 6 7

○

fork

○

join

host

○

8 9 10 11 12 13 14 15

main kernel

parallel kernel

Figure 4: Illustration of multi-team execution and kernel

splitting. The left side shows single team execution with one

main thread and four worker threads. The right side shows

multi-team execution with one main thread in the main

kernel and four teams of four threads each in the parallel

kernel. A host RPC call ( ○

1 and ○)

3 is used to launch the

parallel kernel ( ○)

2 with multiple teams.

There are two main reasons for configurable custom allocators.

Firstly, the support for heap allocation on GPUs varies among

vendors, and any one implementation can not handle all situations

optimally. To improve the end-to-end execution time the user needs

to choose an allocator implementation based on their particular

use case. As an example, one of our benchmarks has massively

parallel heap allocations and deallocations at the beginning, and

respectively end, of a parallel region. Serialization can cause sig-

nificant delays. Other benchmarks require large amounts of heap

memory but only use the initial thread to allocate it. This means

the heap space can not be divided into exclusive parts (per thread

and/or team). Secondly, we need to track allocated memory regions

to support the runtime lookup for underlying objects in case they

can not be determined at compile time, as mentioned in Section 3.2.

We provide two types of allocators: a single-thread generic allo-

cator and a “balanced” allocator. The single-thread generic allocator

tracks all allocations in two linked lists: an allocation list and a free

list. Each thread can use the entire heap space if necessary, but

access to the lists has to be mutually exclusive, which can become

a performance bottleneck for applications that allocate heap mem-

ory concurrently. The balanced allocator is designed to mitigate

this limitation. It divides the heap space into 𝑁 × 𝑀 chunks, with

each thread calculating its chunk based on its thread and team id

module 𝑁 and 𝑀, respectively. We use a lock per chunk to ensure

consistency, but different chunks are independent. Since the heap

memory each thread can allocate is relatively small, it is possible

to run out of memory in only a specific chunk while others are

mostly empty. As it is common to allocate large heap areas in the

serial execution part of a program, the first chunk of the 𝑁 is larger

than the rest (with a configurable ratio). Since the initial thread is

always the first thread of a warp / wavefront, it has consequently

more heap available when it executes the sequential program parts.

The balanced allocator differs from the generic allocator in that

it embeds indices/pointers into the allocation metadata rather than

using explicit linked lists. Figure 6 illustrates this concept for a sin-

gle chunk. The top row displays three allocations that are currently

in use, indicated by the grid pattern over the user data. When theConference’17, July 2017, Washington, DC, USA

S. Tian, T. Scogland, et al.

4.1

bottom

top

bottom

previous heap

top

free heap space, then next heap

Figure 5: Visualization of one chunk in the balanced allocator.

Top: Encoding of three user allocated entries all currently in

use, as indicated by the missing grid pattern on the user data.

Middle: Situation after the second entry has been deallocated.

The encoding is not changed to speed up deallocation. Alloca-

tion remains fast as long as sufficient heap space is available

to add an entry on top. Bottom: Situation after the former top

entry has been deallocated. As long as the top entry is unused

the space is reclaimed, which makes the scheme especially

suitable for balanced allocations and deallocations.

middle entry is deallocated, it is marked as such but the now-free

memory is kept in place and not referenced elsewhere. To reuse pre-

viously deallocated memory regions that have not been reclaimed,

we need to traverse the list until a suitable entry is found, which can

be costly in practice. Thus, we avoid reusing allocations until we

exhaust the heap space in this chunk. We reclaim the top allocation

by moving the watermark pointer to the end of the previous entry

whenever the top allocation is no longer in use. The newly formed

top may also be reclaimed if it was previously deallocated. The

third row of the figure illustrates the situation after the top entry

was deallocated and the two top entries have been reclaimed.

While this scheme is not a replacement for a more generic al-

locator, it is well-suited for applications with balanced allocations

and deallocations in terms of their lifetime since we can reclaim

memory with minimal overhead during the allocation and dealloca-

tion calls. In the worst case, interleaved allocations with different

lifetimes cause holes and costly linear traversals as soon as the heap

space runs out. However, if we do not run out of heap space, there is

likely only minor performance degradation due to fragmentation.

LIMITATIONS AND FUTURE WORKS

GPU First is a proof-of-concept implementation that showcases how

advancements in GPUs, coupled with modern compiler technology,

can simplify the approach to GPU programming significantly. The

conventional restrictions, such as the absence of recursion, lack of

atomic accesses, and so on, that have led to the current design of

offload languages should be reconsidered, and alternative methods,

such as GPU First, should be explored. Despite its potential, there

are various technical challenges that GPU First needs to overcome,

which we will discuss briefly.

Multiple Levels of Indirection

In Section 3.2 we described that we move underlying objects to

and from the host when a pointer to them is used in an RPC call.

This allows the library function to access the object, e.g., to write

the result from a file I/O operation into that memory. However, we

do not yet try to move more than one level of memory which pre-

vents the host function from accessing objects through indirection.

As an example, a host function might be passed a int** and we

would migrate the object the outer pointer points to automatically.

Thus, when the initial pointer (after translation to the host value)

is dereferences, the migrated memory is accessed. If, however, the

resulting int* is accessed by the host, the value will likely point to

device memory. While it is possible to move and update pointers

for multiple levels, the precision and efficiency of the approach will

depend on the availability of domain knowledge about accesses.

Annotated library headers, as generated by the “HTO” [17], would

likely make this feasible in practice. A system with unified shared

would not encounter this problem at all.

4.2

Reverse Offloading of Code

In our prototype we only execute host landing-pad functions gener-

ated during compilation. However, we plan to extend this capability

in the future to allow for the execution of other host code, such

as when a function pointer is passed to the host via an RPC or

when an object method is invoked as part of an RPC. The first step

would be to generate potentially host executed code for the host

as well, which could be as simple as generating all code for the

host and the GPU. In the second step we need to translate func-

tion pointers from the device to the host value when objects are

moved from the device to the host, or alternatively when a fault

is caused while trying to execute code through a device function

pointer. If objects or function pointers are created on the host as

part of an RPC the revere procedure is required. This shortcoming

is less severe in legacy C code but shows when C++ objects are

created on the device and used on the host as their virtual table

contains both, an additional level of indirection and pointers to

device-only code. That said, C++ objects that are only used on the

device are already supported, including virtual function calls and

other inheritance-related features.

While the above limits the applicability of the GPU First method-

ology, there are related opportunities to improve performance in

the presence of code regions that should be executed on the host.

So far, only single library calls are issued on the host, however,

entire code regions in the original application could benefit from

execution on the host. The applicability limitations discussed so far

notwithstanding, we could outline the region and treat the code as if

it was originally in an external library. As such, our RPC generation

would take care of the call and (single-level) memory movement.

4.3

Multi-Team Execution with Communication

Another limitation of our work is that we only rewrite certain parts

of the code to support multi-team execution in our prototype. For

example, we change the work-sharing schedule and make sure the

user observed thread Ids are continues across the threads in theGPU First — Execution of Legacy CPU Codes on GPUs

Conference’17, July 2017, Washington, DC, USA

different teams (ref. Figure 4). While we do not yet rewrite inter-

thread communication, such as reduction clauses, most common

cases could be handled through additional engineering effort.

Allocator Performance

We believe an application should pick an allocator based on their

specific needs. Some of the evaluated SPEC OMP benchmarks con-

currently allocate many memory regions at the beginning of a

parallel, and deallocate them at the end of the parallel region again.

To best support this scheme we introduced the balanced allocator

described in Section 3.4. In Figure 6 shows how it performs on a syn-

thetic benchmark in which all threads in all teams allocate memory

at the beginning of the kernel, use it briefly, and then deallocate it

again. This design is an exaggeration of the SPEC OMP benchmarks

to stress test allocators. On our test platform the domain specific

balanced allocator is between 3.3× (1 thread, 1 team) and 30× (32

threads, 256 teams) faster than the default NVIDIA provided malloc .

5.2

RPC Performance

To measure the overhead of an RPC call, we conducted a profiling

experiment where we called the fprintf function 1000 times with

the following arguments: fprintf(stderr, "fread reads: %s.\n",

buffer) . In this example, buffer points to a 128 byte array that has

to be copied back and forth as the read/write behavior of fprintf

arguments is unknown (without inspecting the format string).

The average device time spend per RPC was 975 microseconds.

The distribution of this time spend is visualized in Figure 7. From

left to right, the following stages are traversed on the device (top

part): 1) 0.1% of the overall time is spent on initializing the RPC

argument information ( RPCArgInfo in Figure 3c). 2) 9.1% of the

time is on identifying the underlying objects of the three pointer

arguments, which includes copying the format string and buffer to

an RPC buffer where the host can access them (managed memory,

as described in Section 2). 3) The device thread spends 89% of the

time waiting for the host to act on the request and acknowledge

that it has been performed. 4) 1.8% of the time is spent copying the

data from the RPC buffer back to the buffer . On the host, the time

is spent like this, again from left to right: 1) 2% of the overall time is

spent on copying the RPC information ( RPCInfo in Figure 3b) to the

host. 2) 3.5% of the time is spent invoking the host wrapper, which

in turn calls the actual fprintf function and sets up the return

10 −3

10 −4

10 −5

128

number of threads per team

256

Figure 6: Comparison of the performance between the

NVIDIA-provided malloc and our domain-specific balanced

allocator with 32 thread slots and 16 team slots (refer to Sec-

tion 3.4). The benchmark is an exaggeration of the allocation

scheme in some SPEC OMP benchmarks, where memory is

allocated and deallocated at the beginning and end of a par-

allel region with all threads.

EVALUATION

To evaluate the performance of our approach, we used a system

comprising an NVIDIA A100 Tensor Core GPU (40GB) with AMD

EPYC 7532 processors (32 cores and hyper-threading disabled) and

256 GB DDR4 RAM. We performed all experiments using CUDA

11.8.0 and compiled all benchmarks using the -O3 optimization flag.

Our prototype version is based on LLVM trunk ( ¥ 50f1476a ).

5.1

10 −2

Single-Threaded RPC Handling

Our prototype features single-threaded RPC handling which can, in

parallel regions, significantly impact performance. However, since

multi-threaded RPC schemes can be implemented, this is not a

conceptual limitation and will also not influence our benchmarks

that do not issue RPC calls from inside parallel regions.

1 Team 32 Teams 64 Teams 128 Teams 256 Teams

4.4

Balanced Allocator (32 thread slots, 16 team slots)

NVIDIA malloc

Figure 7: Visualization of the time (avg. total 975 microsec-

onds) spend in different staging while resolving a fprintf RPC.

value. 3) 5.4 of the time is spent on copying RPCInfo object back

to the device and notifying completion. This notification is done

by setting an integer value to 0 that is in managed memory and is

also accessible to the device. 4) The remaining 89.1% of the time is

the gap between when the host notifies completion and when the

device receives the notification. This gap occurs because threads in

a GPU kernel that is already running are not guaranteed to see the

updates to memory done by the CPU or other devices in a specific

order and within a specific time interval [6, 21].

5.3

Parallel Region Modeling

In the following we compare the performance obtained via GPU

First compilation of CPU code against manual offload versions of

various parallel regions to determine if the proposed methodology

is suitable to guide porting efforts.

5.3.1 XSBench and RSBench. The two OpenMC proxy applications

XSBench [29] (v20) and RSBench [28] (v13) are implemented in dif-

ferent parallel programming models, including OpenMP threading

for the CPU, and OpenMP offload for the GPU. In the former, two al-

ternative methods are available to perform the cross-section lookup

as part of the neutron transport simulation: event-based lookupConference’17, July 2017, Washington, DC, USA

S. Tian, T. Scogland, et al.

small

(a) Performance of the compute kernel of XSBench relative to

the CPU version.

4.52 4.61

6.09

large

9.07

8.98

4.63

GPU First (history, small)

GPU First (history, large)

10.64

9.76

large

11.5 11.43

11.77

14.36

small

GPU First (event, small)

GPU First (event, large)

manual offload (event, small)

manual offload (event, large)

(b) Performance of the compute kernel of RSBench relative to

the CPU version.

Figure 8: Performance of different GPU versions of the OpenMC proxy applications XSBench and RSBench compared to their

respective CPU counterpart.

and history-based lookup. In the offloading version history-based

mode was not implemented but we can test it out with the GPU First

methodology using the CPU implementation. The results for both

benchmarks and two different input sizes are shown in Figure 8a

and Figure 8b, respectively. For the small input size, history mode

is actually outperforming the event mode on the GPU. However,

with the large input size event mode has caught up (RSBench), or

even surpassed (XSBench), history mode. These results validate the

choice of event-based mode for the offloading implementation. The

second insight from the evaluation can be derived by comparing the

event-based results obtained via the manually offloaded version and

the GPU First version. For the small input the GPU First versions are

likely to benefit from cache re-use as the data initialization is also

performed on the GPU. However, with the large input the two re-

sults are a close match. Thus, performance predictions obtained via

GPU First and the original CPU-only version would have provided

accurate guidance for a potential manual port to the GPU.

5.3.2 HeCBench: Interleaved. The HeCBench [12] “interleaved” mi-

cro benchmark originated from Cook [4] and shows how different

memory access patterns behave on the CPU and GPU. We timed the

parallel region with interleaved memory accesses (array-of-struct

inputs) as well as the one with non-interleaved accesses (struct-

of-array inputs) on the CPU and GPU. The results, expressed as

speedups and slowdowns of the GPU version, are shown in Fig-

ure 9a. While the GPU First version shows the same tendency as

the manually offloading version, we needed to explicitly match the

number of teams to perfectly match the result with our automati-

cally offloaded parallel regions.

5.3.3 HeCBench: Hypterm. The HeCBench “hypterm” micro bench-

mark is a complex stencil operation that originated from the Ex-

pCNS Compressible Navier-Stokes mini-application [24] and was

extracted by Rawat et al. [22]. The GPU version in in HeCBench

contains three kernels which we transformed into three parallel

regions for the CPU. The results of the GPU version and the GPU

First version relative to the CPU are shown in Figure 9b. While the

original GPU version is slightly slower, the overall performance

behavior matches the GPU First prediction.

5.3.4 HeCBench: AMGmk, Page-Rank, Figure 9c shows the re-

sults obtained for the AMGmk, page-rank benchmarks. The first

measures only the relax kernel of the original AMGmk proxy ap-

plication [13]. The second is an implementation of the page-rank

algorithm for graphs in which the propagation step is measured.

5.3.5 SPEC OMP: 358.botsalgn and 359.botsspar. These are two

task-based benchmarks [9] from the SPEC OMP 2012 suite [18].

The former performs sequence alignment while the latter is a sparse

LU decomposition. They are parallelized with different OpenMP

tasking strategies. Figures 10a and 10b shows the performance

of GPU First relative to the CPU. Since LLVM/OpenMP does not

support tasking on GPUs, tasks are executed immediately by the

encountering thread. This limitation severely affects the GPU per-

formance of these benchmarks.

In the case of 358.botsalgn, sequences are distributed across

multiple threads through an outer parallel region. Each thread

spawns several tasks to that perform the alignment. Since the num-

ber of sequences is smaller than the number of CPU cores, threads

not involved in the work sharing can execute the spawned tasks

concurrently. However, on the GPU only a small number of threads

(equal to the number of sequences) are executing concurrently.

Similarly, in 359.botsspar , one thread creates tasks while the

other threads in the parallel region execute them. This pattern of

execution is equivalent to serial execution in our approach. To en-

able parallelism for this benchmark, we rewrote the task regions

by removing the task construct and adding a parallel for con-

struct on the outer parallel region. The results shown in Figure 10b

represent the threaded parallelism version of the benchmark. The

observed slowdown can be attributed to the lack of sufficient se-

quences to fully exploit the massive parallelism that GPUs offer,

similar to the issue observed in benchmark 358.botsalgn. Nev-

ertheless, our GPU First scheme allows application developers to

explore different parallelism on GPUs without much burden.GPU First — Execution of Legacy CPU Codes on GPUs

7.67

AMGmk

6.41

Page Rank

6.6

3.9

3.1

2.7

AMGmk and page-rank micro benchmark (from HeCBench) when

executed on the GPU instead of the CPU.

Figure 9: Comparison of micro benchmarks performance

results for a parallel region compiled with GPU First to the

GPU, and the manually offloaded counterpart, relative to

the corresponding CPU parallel region. The matching teams

column for GPU First uses the same number of teams as the

manually offloaded version. The legend at the top of the

figure is shared among all plots.

(b) 359.botsspar uses one thread creates tasks while the other threads

in the parallel region execute them. The x axis is the size of matrix

and submatrix used in the benchmark.

8 10 12 14 16 18 20 22 24 26 28 30

(b) Relative performance of the three parallel regions (PR1, PR2, PR3)

in the hypterm micro benchmark (from HeCBench) when executed

on the GPU instead of the CPU.

PR3

(a) 358.botsalgn. The x axis is the number of input sequences. This

benchmark distributes sequences across multiple threads through

an outer parallel region, where each thread spawns several OpenMP

tasks to execute the pair alignment algorithm.

the

CPU

PR2

PR1

(a) Relative performance of the two parallel regions in the inter-

leaved benchmark (from HeCBench) when executed on the GPU

instead of the CPU. The first (non-interleaved) uses a struct-of-array

layout while the second (interleaved) uses an array-of-struct layout.

−49.41

−50

−24.75

−24.45

2,3

non-interleaved

(AoS layout)

100

interleaved

(SoA layout)

35.67

parallel region

150

52.08

52.12

end-2-end

GPU First

(matching teams)

GPU First

(1024 teams)

manual

offload

Conference’17, July 2017, Washington, DC, USA

load is manually distributed among multiple threads and

threads communicate with each other using a producer-

consumer model via shared variables followed by barriers.

Figure 10: Relative performance results for the end-to-end

execution and timed parallel regions in the three SPEC OMP

2012 benchmarks when executed on the GPU instead of the

CPU. The legend at the top is shared among all plots.Conference’17, July 2017, Washington, DC, USA

It is important to note that the lack of tasking support is not a

limitation of our proposed scheme, but rather a limitation of the

current LLVM/OpenMP implementation for GPUs. If tasking is

properly supported on the GPU, and there are a sufficient number

of sequences, the massive parallelism of a GPU has the potential to

make up for the performance difference between a CPU and a GPU

thread. While this means advancements in GPU tasking support

could in the future improve performance of these codes on the GPU,

the current results clearly indicate that a GPU port would require a

different parallelization strategy.

5.3.6 SPEC OMP: 372.smithwa. 372.smithwa implements the Smith-

Waterman algorithm for sequence alignment and is characterized

by a large number of nested loops with indirect memory accesses.

In this benchmark, the workload is first distributed to multiple

threads, which maps well to the GPU. However, the threads com-

municate with each other using a producer-consumer model via

shared variables followed by barriers. This form of communication

is conceptually inefficient on GPUs. Figure 10c shows the perfor-

mance of the GPU First approach relative to the CPU. As the input

size is increased the relative performance is at first stable, indicating

good scalability on the GPU. However, when the sequence length

hits 26 we can observe exponentially growing slowdown compared

to the CPU execution. Consequently, this benchmark is another

example of an algorithm that is inefficient on the GPU, requiring a

rewrite as part of the porting effort. It is worth to note that without

the balanced allocator the performance is dominated by the mas-

sively parallel allocations and deallocations at the beginning and

end of the parallel region, respectively.

6 RELATED WORKS

6.1 GPU Execution of CPU Programs

Several prior works have explored the execution of host programs

on GPUs, including Silberstein et al. [23], who proposed direct

access to the host’s file system from GPU code and implemented an

RPC protocol to manage data transfers between the CPU and GPU.

Damschen et al. [5] investigated transparent acceleration of binary

applications using heterogeneous computing resources, without the

need for manual porting or developer-provided hints. Meanwhile,

Matsumura et al. [14] proposed an automated stencil framework

that can automatically transform and optimize stencil patterns in

a given C source code, and generate corresponding CUDA code.

These works mainly focused on identifying and/or generating parts

of the host program to run on GPUs.

Mikushin et al. [16] introduced a parallelization framework that

detects parallelism and generates target code for both X86 CPUs

and NVIDIA GPUs. To support functions that can not be natively

executed on GPUs, they replaced function calls in LLVM with an

interface that ultimately results in the host executing the requested

function using a foreign function interface (FFI). However, our

approach differs in two ways. First, instead of relying on FFI, our

compiler transformation generates the host wrapper, which restores

the call site on the host. Second, in their framework, GPU addresses

are used directly on the host, which leads to segmentation faults

when the host tries to access an address. A signal handler for seg-

mentation faults maps the GPU memory pages into CPU tables and

S. Tian, T. Scogland, et al.

copies input data. However, this memory management subsystem

does not work if the memory buffer is on the stack, such as when

a local variable is used in host RPCs. Jablin et al. [11] proposed a

system for managing and optimizing CPU-GPU communication

that is fully automatic. Their system includes a run-time library and

a set of compiler transformations that work together to manage and

optimize communication between the CPU and GPU. Unlike other

approaches, this system does not rely on strength of static compile-

time analyses or on programmer-supplied annotations. Our pointer

argument analysis shares a similar design to their work.

Tian et al. [26] were the first to attempt to run the entire host pro-

gram on a GPU. They proposed using OpenMP target offloading to

leverage the portability of compiling and running host applications

on a GPU. However, their approach requires application developers

to provide the wrapper function on both the host and device side, ei-

ther manually or through scripts. They also do not support variadic

functions, and their paper shows severe performance regression

due to single-team execution limitations.

6.2

OpenMP Target Offloading

In recent years, researchers have explored compiler and runtime op-

timization for OpenMP since OpenMP 4.0 introduced target offload-

ing. Bertolli et al. presented two works [2, 3] that enabled OpenMP

offloading to GPUs in LLVM. Flang, the PGI Fortran front-end, also

supports OpenMP offloading via the LLVM OpenMP runtime [19].

Antão et al. [1] introduced front-end-based optimizations for Nvidia

GPUs that can reduce register usage and avoid idle threads. Doerfert

et al. [7] presented the TRegion interface, which supports more ker-

nels to execute in SPMD mode. Tian et al. [25] introduced runtime

support for concurrent execution of OpenMP target tasks. Yviquel

et al. [30] presented a framework for using the OpenMP program-

ming model in distributed memory environments. It provides a way

to program clusters of shared memory machines with a hybrid ap-

proach that combines OpenMP directives and MPI communication.

Huber et al. [10] presented OpenMP-aware program analyses and

optimizations that allow efficient execution of CPU-centric paral-

lelism on GPUs. Ozen and Wolfe [20] introduced a fully descriptive

model and demonstrate its benefits with an implementation of the

loop directive on NVIDIA GPUs. Doerfert et al. [8] presented a co-

design methodology for optimizing applications using an OpenMP

GPU runtime with near-zero overhead, on top of which our device

side host RPC support and partial libc implementation were built.

CONCLUSION

In this paper, we introduced a novel compilation scheme called

“GPU First” that enables automatic compilation of legacy CPU ap-

plications directly for GPUs without requiring any modification to

the application source. Our approach simplifies the task of identify-

ing code regions that can benefit from acceleration and facilitates

rapid testing of code modifications on real GPU hardware, thereby

making GPUs easily accessible to non-experts. We evaluated our

approach on two proxy applications, four micro benchmarks, and

three SPEC OMP 2012 codes with CPU parallelism to demonstrate

the simplicity of porting host applications to the GPU. Our ap-

proach closely matched the performance of corresponding man-

ually offloaded kernels, with up to 14.36× speedup on the GPU.GPU First — Execution of Legacy CPU Codes on GPUs

Our evaluation further validates that the GPU First methodology

can effectively guide porting efforts and identify parallel regions

that require reorganization to achieve good scaling behavior on the

GPU. Overall, our proposed approach offers a simple and efficient

solution for porting legacy CPU applications to GPUs, enabling

non-experts to access GPUs, and facilitating faster development

and porting times which will lead to more efficient use of resources.

ACKNOWLEDGEMENT

This research was supported by the Exascale Computing Project (17-

SC-20-SC), a collaborative effort of two U.S. Department of Energy

organizations (Office of Science and the National Nuclear Security

Administration) responsible for the planning and preparation of a

capable exascale ecosystem, including software, applications, hard-

ware, advanced system engineering, and early testbed platforms, in

support of the nation’s exascale computing imperative. The views

and opinions of the authors do not necessarily reflect those of the

U.S. government or Lawrence Livermore National Security, LLC

neither of whom nor any of their employees make any endorse-

ments, express or implied warranties or representations or assume

any legal liability or responsibility for the accuracy, completeness,

or usefulness of the information contained herein. This work was

in parts prepared by Lawrence Livermore National Laboratory un-

der Contract DE-AC52-07NA27344 (LLNL-CONF-827970). We also

gratefully acknowledge the computing resources provided and op-

erated by the Joint Laboratory for System Evaluation at Argonne

National Laboratory.

REFERENCES

[1] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea,

Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray

Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O’Brien.

2016. Offloading Support for OpenMP in Clang and LLVM. In Workshop on the

LLVM Compiler Infrastructure in HPC (LLVM-HPC@SC), November 14, 2016. IEEE

Computer Society, Salt Lake City, UT, USA, 1–11. https://doi.org/10.1109/LLVM-

HPC.2016.006

[2] Carlo Bertolli, Samuel Antão, Gheorghe-Teodor Bercea, Arpith C. Jacob, Alexan-

dre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David

Appelhans, and Kevin O’Brien. 2015. Integrating GPU support for OpenMP of-

floading directives into Clang. In Workshop on the LLVM Compiler Infrastructure

in HPC (LLVM-HPC@SC), November 15, 2015. ACM, Austin, Texas, USA, 5:1–5:11.

https://doi.org/10.1145/2833157.2833161

[3] Carlo Bertolli, Samuel Antão, Alexandre E. Eichenberger, Kevin O’Brien, Zehra

Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU

threads for OpenMP 4.0 in LLVM. In Workshop on the LLVM Compiler Infrastruc-

ture in HPC (LLVM-HPC@SC), November 17, 2014. IEEE Computer Society, New

Orleans, LA, USA, 12–21. https://doi.org/10.1109/LLVM-HPC.2014.10

[4] Shane Cook. 2012. CUDA Programming: A Developer’s Guide to Parallel Computing

with GPUs. Newnes, San Francisco, CA, USA.

[5] Marvin Damschen, Heinrich Riebler, Gavin Vaz, and Christian Plessl. 2015. Trans-

parent offloading of computational hotspots from binary code to Xeon Phi. In De-

sign, Automation & Test in Europe Conference & Exhibition (DATE) March 9-13, 2015.

ACM, Grenoble, France, 1078–1083. http://dl.acm.org/citation.cfm?id=2757063

[6] Feras Daoud, Amir Wated, and Mark Silberstein. 2016. GPURDMA: GPU-Side

Library for High Performance Networking from GPU Kernels. In International

Workshop on Runtime and Operating Systems for Supercomputers (ROSS), June 1,

2016. ACM, Kyoto, Japan, 6:1–6:8. https://doi.org/10.1145/2931088.2931091

[7] Johannes Doerfert, Jose Manuel Monsalve Diaz, and Hal Finkel. 2019. The

TRegion Interface and Compiler Optimizations for OpenMP Target Regions. In

International Workshop on OpenMP (IWOMP), September 11-13, 2019, Vol. 11718.

Springer, Auckland, New Zealand, 153–167. https://doi.org/10.1007/978-3-030-

28596-8_11

[8] Johannes Doerfert, Atmn Patel, Joseph Huber, Shilei Tian, Jose Manuel Monsalve

Diaz, Barbara M. Chapman, and Giorgis Georgakoudis. 2022. Co-Designing an

OpenMP GPU Runtime and Optimizations for Near-Zero Overhead Execution.

In International Parallel and Distributed Processing Symposium (IPDPS), May 30 -

Conference’17, July 2017, Washington, DC, USA

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

June 3, 2022. IEEE, Lyon, France, 504–514. https://doi.org/10.1109/IPDPS53621.

2022.00055

Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard

Ayguadé. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting

the Exploitation of Task Parallelism in OpenMP. In International Conference on

Parallel Processing (ICPP), 22-25 September 2009. IEEE Computer Society, Vienna,

Austria, 124–131. https://doi.org/10.1109/ICPP.2009.64

Joseph Huber, Melanie Cornelius, Giorgis Georgakoudis, Shilei Tian, Jose

Manuel Monsalve Diaz, Kuter Dinel, Barbara M. Chapman, and Johannes Doer-

fert. 2022. Efficient Execution of OpenMP on GPUs. In International Symposium

on Code Generation and Optimization (CGO), April 2-6, 2022. IEEE, Seoul, Republic

of Korea, 41–52. https://doi.org/10.1109/CGO53902.2022.9741290

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R.

Beard, and David I. August. 2011. Automatic CPU-GPU communication manage-

ment and optimization. In ACM SIGPLAN Conference on Programming Language

Design and Implementation (PLDI), June 4-8, 2011. ACM, San Jose, CA, USA,

142–151. https://doi.org/10.1145/1993498.1993516

Zheming Jin. 2023. HeCBench. https://github.com/zjin-lcf/HeCBench

Lawrence Livermore National Laboratory. 2023. CORAL Benchmark Codes.

https://asc.llnl.gov/CORAL-benchmarks/

Kazuaki Matsumura, Hamid Reza Zohouri, Mohamed Wahib, Toshio Endo, and

Satoshi Matsuoka. 2020. AN5D: Automated Stencil Framework for High-Degree

Temporal Blocking on GPUs. In International Symposium on Code Generation

and Optimization (CGO), February, 2020. ACM, San Diego, CA, USA, 199–211.

https://doi.org/10.1145/3368826.3377904

Florian Mayer, Marius Knaust, and Michael Philippsen. 2019. OpenMP on FPGAs

- A Survey. In International Workshop on OpenMP (IWOMP), September 11-13, 2019,

Vol. 11718. Springer, Auckland, New Zealand, 94–108. https://doi.org/10.1007/978-

3-030-28596-8_7

Dmitry Mikushin, Nikolay Likhogrud, Eddy Z. Zhang, and Christopher Bergstrom.

2014. KernelGen - The Design and Implementation of a Next Generation

Compiler Platform for Accelerating Numerical Models on GPUs. In Interna-

tional Parallel & Distributed Processing Symposium Workshops (IPDPSW), May

19-23, 2014. IEEE Computer Society, Phoenix, AZ, USA, 1011–1020. https:

//doi.org/10.1109/IPDPSW.2014.115

William S. Moses and Johannes Doerfert. 2020. “Header Time Optimization”:

Cross-Translation Unit Optimization via Annotated Headers. In LLVM Per-

formance Workshop at CGO, Feb 23, 2020. San Diego, CA, USA, 1–1. https:

//c.wsmoses.com/posters/HTO_Poster.pdf

Matthias S. Müller, John Baron, William C. Brantley, Huiyu Feng, Daniel Hacken-

berg, Robert Henschel, Gabriele Jost, Daniel Molka, Chris Parrott, Joe Robichaux,

Pavel Shelepugin, G. Matthijs van Waveren, Brian Whitney, and Kalyan Kumaran.

2012. SPEC OMP2012 - An Application Benchmark Suite for Parallel Systems

Using OpenMP. In International Workshop on OpenMP (IWOMP), June 11-13, 2012,

Vol. 7312. Springer, Rome, Italy, 223–236. https://doi.org/10.1007/978-3-642-

30961-8_17

Güray Özen, Simone Atzeni, Michael Wolfe, Annemarie Southwell, and Gary

Klimowicz. 2018. OpenMP GPU Offload in Flang and LLVM. In Workshop on the

LLVM Compiler Infrastructure in HPC (LLVM-HPC@SC), November 13, 2018. IEEE,

Dallas, TX, USA, 1–9. https://doi.org/10.1109/LLVM-HPC.2018.8639434

Guray Ozen and Michael Wolfe. 2022. Performant Portable OpenMP. In ACM

SIGPLAN International Conference on Compiler Construction (CC), April 2 - 3, 2022.

ACM, Seoul, South Korea, 156–168. https://doi.org/10.1145/3497776.3517780

Sreeram Potluri, Anshuman Goswami, Davide Rossetti, Chris J. Newburn, Man-

junath Gorentla Venkata, and Neena Imam. 2017. GPU-Centric Communi-

cation on NVIDIA GPU Clusters with InfiniBand: A Case Study with Open-

SHMEM. In International Conference on High Performance Computing (HiPC),

December 18-21, 2017. IEEE Computer Society, Jaipur, India, 253–262. https:

//doi.org/10.1109/HiPC.2017.00037

Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël

Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations

for stencils on GPUs. In ACM SIGPLAN Symposium on Principles and Practice

of Parallel Programming (PPoPP), February 24-28, 2018. ACM, Vienna, Austria,

168–182. https://doi.org/10.1145/3178487.3178500

Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs:

Integrating A File System with GPUs. In Architectural Support for Programming

Languages and Operating Systems (ASPLOS), March 16-20, 2013. ACM, Houston,

TX, USA, 485–498. https://doi.org/10.1145/2451116.2451169

ExaCT team. 2013. ExaCT: Center for exascale simulation of combustion in

turbulence. http://exactcodesign.org

Shilei Tian, Johannes Doerfert, and Barbara M. Chapman. 2020. Concurrent

Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads. In

Languages and Compilers for Parallel Computing (LCPC), October 14-16, 2020,

Vol. 13149. Springer, Stony Brook, NY, USA, 41–56. https://doi.org/10.1007/978-

3-030-95953-1_4

Shilei Tian, Joseph Huber, Konstantinos Parasyris, Barbara M. Chapman, and

Johannes Doerfert. 2022. Direct GPU Compilation and Execution for Host Applica-

tions with OpenMP Parallelism. In Workshop on the LLVM Compiler InfrastructureConference’17, July 2017, Washington, DC, USA

in HPC (LLVM-HPC@SC), November 13-18, 2022. IEEE, Dallas, TX, USA, 43–51.

https://doi.org/10.1109/LLVM-HPC56686.2022.00010

[27] Shilei Tian, Joseph Huber, John R. Tramm, Barbara M. Chapman, and Johannes

Doerfert. 2022. Just-in-Time Compilation and Link-Time Optimization for

OpenMP Target Offloading. In International Workshop on OpenMP (IWOMP),

September 27-30, 2022, Vol. 13527. Springer, Chattanooga, TN, USA, 145–158.

https://doi.org/10.1007/978-3-031-15922-0_10

[28] John R. Tramm, Andrew R. Siegel, Benoit Forget, and Colin Josey. 2014. Per-

formance Analysis of a Reduced Data Movement Algorithm for Neutron Cross

Section Data in Monte Carlo Simulations. In International Conference on Exascale

Applications and Software (EASC), April 2-3, 2014, Vol. 8759. Springer, Stockholm,

Sweden, 39–56. https://doi.org/10.1007/978-3-319-15976-8_3

S. Tian, T. Scogland, et al.

[29] John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014.

XSBench - The Development and Verification of a Performance Abstraction

for Monte Carlo Reactor Analysis. In International Conference on Physics of

Reactors (PHYSOR), September 28 - October 3, 2014. JAEA, Kyoto, Japan, 1–12.

http://dx.doi.org/10.11484/jaea-conf-2014-003

[30] Hervé Yviquel, Marcio Pereira, Emilio Francesquini, Guilherme Valarini, Gustavo

Leite, Pedro Henrique Di Francia Rosso, Rodrigo Ceccato, Carla Cusihualpa,

Vitoria Dias, Sandro Rigo, Alan Souza, and Guido Araujo. 2022. The OpenMP

Cluster Programming Model. In Workshop of the International Conference on

Parallel Processing (ICPP), 29 August 2022 - 1 September 2022. ACM, Bordeaux,

France, 17:1–17:11. https://doi.org/10.1145/3547276.3548444