[PDF] DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime

Abstract

GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or Java. Moreover, using GPUs to their full capabilities requires expert knowledge of asynchronous programming. In this work, we present a novel GPU run time scheduler for multi-task GPU computations that transparently provides asynchronous execution, space-sharing, and transfer-computation overlap without requiring in advance any information about the program dependency structure. We leverage the GrCUDA polyglot API to integrate our scheduler with multiple high-level languages and provide a platform for fast prototyping and easy GPU acceleration. We validate our work on 6 benchmarks created to evaluate task-parallelism and show an average of 44% speedup against synchronous execution, with no execution time slowdown compared to hand-optimized host code written using the C++ CUDA Graphs API.

Full PDF

DDAG-based Scheduling with Resource Sharing forMulti-task Applications in a Polyglot GPU Runtime

Alberto Parravicini

Politecnico di Milano

Milan, [email protected]

Arnaud Delamare

Oracle Labs

Z¨urich, [email protected]

Marco Arnaboldi

Oracle Labs

Z¨urich, [email protected]

Marco D. Santambrogio

Politecnico di Milano

Milan, [email protected]

Abstract —GPUs are readily available in cloud computing andpersonal devices, but their use for data processing accelerationhas been slowed down by their limited integration with commonprogramming languages such as Python or Java. Moreover, usingGPUs to their full capabilities requires expert knowledge ofasynchronous programming. In this work, we present a novelGPU run time scheduler for multi-task GPU computations thattransparently provides asynchronous execution, space-sharing,and transfer-computation overlap without requiring in advanceany information about the program dependency structure. Weleverage the GrCUDA polyglot API to integrate our schedulerwith multiple high-level languages and provide a platform for fastprototyping and easy GPU acceleration. We validate our workon 6 benchmarks created to evaluate task-parallelism and showan average of 44% speedup against synchronous execution, withno execution time slowdown compared to hand-optimized hostcode written using the C++ CUDA Graphs API.

Index Terms —GPU, Scheduling, Software Runtime, HardwareAcceleration

I. I

NTRODUCTION

Graphics Processing Units (GPUs) are often heralded asthe optimal solution to achieve extremely high throughputsin domains such as Deep Learning, ﬁnancial simulations,and graph analytics. Even if GPUs are made available bymost cloud providers and are commonly found in personaldevices, using GPUs for data processing acceleration has beenhampered by their limited integration with high-level program-ming languages such as Python or Java. Fully exploiting thehardware resources of GPUs requires a deep understanding oftheir architecture and creates a steep learning curve that takesa long time for programmers to overcome. As a consequence,the adoption of GPUs is often limited to speciﬁc domainsfor which libraries or Domain-Speciﬁc Languages (DSLs) thatabstract and mask the GPU computation are available.On the other hand, APIs offering complete control over theGPU still require efforts to unleash the full hardware potential,e.g. to overlap and synchronize multiple computations, or tooverlap computation with data transfer from and to the GPU.In this work, we present a novel GPU runtime schedulerthat transparently provides all these optimizations withoutrequiring in advance any information about the structure of thecomputation. We leverage the Graal polyglot Virtual Machine(VM) [1]–[3], and the GrCUDA environment [4], to runGPU kernels from languages such as Python, Java, and Rubyand obtain full control over the GPU runtime to optimize

VEC B&S Images ML HITS DL0.5x1.0x1.5x2.0x2.5x3.0x Sp ee dup GTX1660 SuperTesla P100

Fig. 1: Achievable speedup in C++ CUDA with hand-tunedGPU data transfer and execution overlap. Fine-tuned space-sharing and execution-transfer overlaps can accelerate GPUcomputations by more than 50%.scheduling, data transfer, and execution. We target the CUDAplatform, but similar considerations would work for OpenCL[5], given the availability of a managed execution environment.

A. Motivation

Modern GPUs allow multiple computations to run asyn-chronously and concurrently to leverage task-level paral-lelism: experienced programmers accelerate their programs byprefetching and overlapping data transfers with computationson different data or even take advantage of hardware space-sharing by overlapping multiple independent computations.Figure 1 shows how much speedup can be extracted indifferent CUDA benchmarks (presented in section V-B) byhand-crafting these optimizations. Computations with oppor-tunities for task-level parallelism are common: even the simplemachine learning pipeline in Figure 2 has two independentbranches whose results are combined at the end. Instead ofexecuting computations sequentially, a skilled programmerschedules independent tasks on separate execution streams.However, achieving full utilization of the GPU often requiresextensive debugging and proﬁling, even by experienced users[6]; performance may also depend on the data available at runtime, and it cannot be perfectly ﬁne-tuned by the programmers.Our work aims to provide a low-proﬁle runtime that canautomatically leverage untapped GPU resources in multi-taskcomputations to provide speedups identical to what a skilledprogrammer can achieve by hand, lowering the barrier ofaccess to GPUs with no performance compromises. a r X i v : . [ c s . D C ] J a n . Contributions In this work, we present a novel low-proﬁle run timescheduler for multi-task and asynchronous GPU computations.Our scheduler automatically infers data dependencies be-tween GPU kernels, models them using a Directed AcyclicGraph (DAG), and enable asynchronous CPU and GPU execu-tion without users having to deﬁne any synchronization eventor dependencies manually. More importantly, dependenciesand scheduling are computed entirely at run time, withoutdeﬁning the computation structure in advance, and withoutconstraints on the host language control ﬂow.This work is implemented as an extension of GrCUDA,a polyglot CUDA API based on GraalVM [1]. GrCUDA isimplemented as a Trufﬂe DSL [2] and provides access to GPUacceleration to languages supported by GraalVM, such as Java,Scala, JavaScript, R, and Python. Our scheduler is available inall these languages; moreover, any new feature or optimizationto our scheduler will be available without language-speciﬁcmodiﬁcations. Our scheduler enables GrCUDA to become avalid solution for general-purpose GPU acceleration, focusingon fast prototyping and integration with high-level languagesthat do not currently have a strong GPU support.We evaluate our scheduler on 6 benchmarks from differentdomains that exhibit opportunities for task-level parallelismand show an average of 44% speedup against the serialGrCUDA scheduler and no signiﬁcant slowdown againsthand-optimized scheduling written using the C++ CUDAGraphs API; ﬁnally, we analyze hardware utilization to un-derstand how well each benchmark can exploit data transfer-computation overlap and untapped GPU resources. The sourcecode for our scheduler and benchmarks is openly available .In summary, we make the following contributions: • A run time scheduler based on GrCUDA that automat-ically infers dependencies between GPU computationsand dynamically schedules them to maximize transfer-computation overlap and space-sharing (section IV). • A suite of 6 benchmarks to evaluate GPU asynchronouscomputations, task-parallelism, and space-sharing hard-ware utilization (sections V-B and V-F). • An evaluation of how our scheduler provides an averageof 44% speedup against the serial GrCUDA scheduler andno slowdown against hand-optimized scheduling based onthe CUDA Graphs API (sections V-C and V-D).II. R

ELATED W ORK

Expressing a multi-task computation as a DAG that canbe used to estimate an effective scheduling and to optimizeexecution is a concept that has already been explored witha great degree of success: well-known work that coversGPU computations includes Nvidia’s CUDA Graphs [7] andGoogle’s Tensorﬂow [8]; academic research has also showninterest in domains such as distributed and heterogeneouscomputing [9]–[11], and presented valuable theoretical re-sults [12], [13]. CUDA Graphs are a programming model github.com/AlbertoParravicini/grcuda Computationstructure ParallelschedulingSerial scheduling FC: feature_countNB: naive_bayesNO: normalizeRI: ridge_regressionEN: ensembleStream 1Stream 2Default Stream

Underlined argumentsare read-only

X FC(X,Y)NB(Y,R1) NO(Y,Z)RI(Z,R2)EN(R1,R2,R)RY YR1 ZR2 X FC(X,Y)NB(Y,R1) NO(Y,Z)RI(Z,R2)EN(R1,R2,R)RY YR1 ZR2NO(Y,Z)RI(Z,R2)EN(R1,R2,R)RX NB(Y,R1)FC(X,Y)

Fig. 2: The two branches in this computation are independent,and can be scheduled and executed in parallel. Edges arelabeled with the argument that cause a data dependency.recently released by Nvidia used to deﬁne a DAG of inter-dependent computations and execute them asynchronously.Computations between dependencies must be speciﬁed man-ually using CUDA events, or with a fairly complex customAPI [7]. CUDA Graphs also present initialization overheadsdue to graph creation [14]. Instead, TensorFlow allows usersto express a DAG of computations through a DSL embeddedin languages such as Python. TensorFlow is mostly intendedfor Deep Learning (DL); expressing custom kernels for otherdomains, while supported, is not straightforward and requiressigniﬁcant manual integration effort. We do not deem a directcomparison with our work to be meaningful, as TensorFlowpresents design choices speciﬁcally targeted towards DL.Indeed, the most common approach to DAG-based schedul-ing, as seen in CUDA Graphs and TensorFlow, is to specifythe program ﬂow in advance: this choice simpliﬁes the com-putation of an optimal scheduling and amortizes overheads incase of repeated computations, and is certainly suitable forspeciﬁc domains such as DL. On the other hand, the approachpresented in this work is to capture GPU computations througha low-proﬁle runtime, without the need to specify dependen-cies manually. As such, we do not impose any limitation overthe host program control-ﬂow (e.g. conditional statements,function calls, recursion, library calls) so that the program ﬂowcan change over time. In the simplest scenario, a programmermight use multiple kernel implementations optimized for dif-ferent input sizes or use different pre-processing proceduresbased on the input data language; selecting the appropriatekernel is done simply through conditional statements in thehost language (e.g. a switch-case in Python), without requiringcustom APIs or deﬁning multiple DAGs in advance.Computing dependencies at runtime on heterogeneous ar-chitectures has been explored by the XKaapi runtime [15].Contrary to our approach, XKaapi uses a work-stealing strat-gy that computes dependencies every time an idle threadlooks for a task to execute. It handles GPU memory as a queueof blocks instead of leveraging the ﬁne-grained ﬂexibility ofUniﬁed Memory (UM), and its complex API is limited to C++.Many techniques to virtualize GPUs usage have emergedrecently, although space-sharing is still considered an openchallenge [16]. Ravi et al. [17] consolidate kernels from dif-ferent VMs to increase utilization. TornadoVM [18], which isalso compatible with GraalVM, translates annotated Java codefor heterogeneous hardware; it proﬁles the code to understandthe most suitable backend, and infers automatically data-dependencies and data-transfer through a task graph obtainedfrom the computations speciﬁed in advance by the user [19].Today, most programming languages have ways to achieveGPU acceleration, including Python (PyCUDA, PyOpenCL[20]), Java (Jcuda ) and JavaScript (GPU.js). However, none ofthese libraries can automatically handle asynchronous compu-tations and space-sharing. Moreover, each library has differentAPIs, supported features, and update cycles, greatly limit-ing code portability and interoperability; instead, the uniﬁedGrCUDA runtime and API immediately provides each newfeature (such as our scheduler) to all languages supported byGraalVM without any change in the host code.Recent research on GPU space-sharing includes the workof Wen et al. [21], who showed how concurrent executionof some GPU kernels without data dependencies deliversup to 1.5x speedup; Qiao et al. [22] uses concurrent kernelexecution to achieve 2.5x speedup over serial execution onmultiresolution image ﬁlters. Baymax [23] focuses on multi-application space sharing. DCUDA [24] provides schedulingfor different applications on multiple GPUs, and shows howCUDA UM causes an average slowdown below 1%. We focuson single-application space-sharing, i.e. applications composedof many kernels that can run in parallel, but considerationson hardware utilization are valid in both single and multi-application space-sharing. As GrCUDA seamlessly integrateswith CUDA, our work directly beneﬁts from improvements onthe CUDA API and drivers. For example, we could leveragealternative UM implementations optimized to overlap datatransfer with computations, such as HUM [25].III. B ACKGROUND

GPUs allow the execution of multiple kernels at the sametime. If enough resources are available - e.g. free Stream Mul-tiprocessor (SM), the main GPU computation units - the GPUwill perform space-sharing and run kernels in parallel. ModernGPUs often have enough resources to perform space-sharingwithout signiﬁcantly degrading the performance of individualkernels [22]: space-sharing can improve the occupancy of SMsand exploit under-utilized hardware resources. Kernels run inparallel and asynchronously thanks to CUDA streams: kernelsare executed in issue-order on a stream, but different streamsproceed independently. Space-sharing is key for better GPUresource usage, but requires even greater care to ensure that K1(X,Y)

Dependency set {X,Y} K2(X,Z){X,Z} K1(X,Y){Y} K2(X,Z){Z} K3(X,W){X,W}K1(X,Y) K2(X,Z)K3(X,W){X,W}{X,Y} {X,Z}

A. B.C.

Underlined argumentsare read-onlyStream 1 Stream 2

11 11 11 12

Fig. 3: Dependency computations with read-only arguments.Updates to the DAG and to dependency sets are highlighted.programs run correctly. Synchronizing a single stream throughthe cudaStreamSynchronize function blocks the hostexecution and is generally acceptable only if the host requiresthe output of a GPU kernel. A more ﬂexible approach is theuse of CUDA events , which allow streams to synchronizewith each other without blocking the host execution. UsingCUDA events to efﬁciently synchronize multiple complexstreams by hand can be cumbersome: instead, our solu-tion masks CUDA events and greatly simpliﬁes the optimalscheduling of asynchronous computations (section IV).In CUDA, computations are divided in blocks , each com-posed of an equal number of threads (from 32 to 1024).Bigger blocks imply fewer blocks running concurrently, butmore possibilities of sharing data through fast block-wideshared-memory. Users can choose the number of threads perblock; depending on the kernel implementation, users canalso choose the number of blocks. While a small numberof blocks increases space-sharing, achieving a performanceimprovement depends on the characteristics of the kernels.These considerations also apply to OpenCL: streams are called command queues , threads are work items , and the event modelis similar to CUDA. Streams can also improve performance byoverlapping asynchronous data transfer (from CPU to GPU orvice-versa) and computations: in Figure 2 the transfer of array r1 (required by kernel NB ) can be overlapped with executionof kernel NO . This approach is fruitful with repeated executionof simple kernels on different data batches, as the data transfertakes a signiﬁcant amount of the total execution time. Whenusing UM on GPUs that offer page migration, it is beneﬁcial toprefetch data instead of relying on migrations caused by pagefaults: our scheduler can prefetch data automatically, reducingthe burden of GPU optimization.We leverage GraalVM, a Java VM able to run and combinelanguages that compile to Java bytecode (e.g. Scala) andcustom implementations of other languages such as JavaScript,R, and Python [1]. This interoperability is possible thanks tothe Trufﬂe Abstract Syntax Tree interpreter [2], which guaran-tees high-performance through partial evaluation of repeatedportions of code. Our work is an extension of GrCUDA,a CUDA language binding implemented as a Trufﬂe DSL.GrCUDA can be seen as a polyglot API, as it provides GPUacceleration to all languages supported by GraalVM. docs.nvidia.com/cuda/cuda-runtime-api/group CUDART EVENT.html V. S

CHEDULER D ESIGN M ETHODOLOGY

This section details the design and implementation of ourscheduler. We provide a deﬁnition of our computation DAG(section IV-A). Then, we deﬁne our scheduler’s architectureand its integration with the CUDA runtime (sections IV-Band IV-C). Finally, we show how to leverage the languagedesign of GrCUDA to provide asynchronous GPU computa-tion without any detriment to accessibility (section IV-D).

A. Computation DAG and Dependency Sets

The cornerstone of our scheduler is a Computation DAGthat represents relationships between computations that involvethe GPU. Vertices of the DAG are computational elements :GPU kernels, memory accesses by the CPU host programto GrCUDA UM-backed arrays, and pre-registered or user-deﬁned library functions such as RAPIDS . Using GrCUDAand GraalVM, each element can be encapsulated through anobject to keep track of its state. In the case of kernels, theobject tracks its conﬁguration (e.g. the number of blocks), itsinput arguments, and if the computation is active.Edges of the DAG are data dependencies between com-putational elements. Dependencies are inferred automaticallyinstead of being manually speciﬁed by the user throughhandles or other APIs. Inferring data dependencies is pos-sible as GrCUDA uses a managed execution environmentthat allows object encapsulation of inputs, removing the riskof pointer aliasing typical of native languages (e.g. havingmultiple pointers referring to the same memory area). Thescheduler employs data dependencies modeled through theDAG to associate computational elements to CUDA streamsand introduces synchronization events if required. To computedependencies, we associate with each computational elementa dependency set . This set initially contains all argumentsof the computational element. An argument in the set isremoved when a subsequent computation uses and modiﬁesthe same argument, deﬁning a data dependency on it; once aset is empty, the corresponding computational element can nolonger introduce dependencies. Read-only kernel arguments(speciﬁed as in section IV-D) can be treated with special rulesto avoid adding unnecessary dependencies to the DAG. Ifpossible, they will be ignored in the dependency computations:for example, if two kernels use the same read-only inputarray, they will be executed concurrently on different streams.Figure 3 shows a kernel that modiﬁes an argument and isfollowed by another kernel that uses the same argument asread-only A . If a third kernel with the same input is added, itwill depend on the second kernel if it modiﬁes the argument(a write-after-read anti-dependency) B , and it will dependon the ﬁrst kernel if it uses the argument as read-only C ;it will not, however, depend on both kernels. In case C , theread-only argument adds a new dependency through X , but the dependency set of the parent kernel K1 is not updated: if anew kernel requires X as read-only argument, it will depend developer.nvidia.com/rapids K1 = build_kernel(K1_CODE, "square", "ptr, sint32")K2 = build_kernel(K2_CODE, "sum", "const ptr, const ptr, ptr, sint32")X = polyglot.eval("grcuda", "float[{}]".format(N))Y = polyglot.eval("grcuda", "float[{}]".format(N))Z = polyglot.eval("grcuda", "float[1]")[init arrays...]K1(NUM_BLOCKS, NUM_THREADS)(X, N)K1(NUM_BLOCKS, NUM_THREADS)(Y, N)K2(NUM_BLOCKS, NUM_THREADS)(X, Y, Z, N)res = Z[0]DeclarekernelsDeclarearraysA.B.C.D. scalar value passed by copy, ignored for dependencies XX K1(X,N) {X}

Underlined argumentsare read-only

Stream 1 Stream 2 YY K1(Y,N) {Y}

K2(X,Y,Z,N)ZX K1(X,N) {X}

Y K1(Y,N) {Y}

A. B. C.

K2(X,Y,Z,N)X K1(X,N) { }

Y K1(Y,N) { }

Z[0] D. {X,Y,Z} { } finished computation dependency set synchronous CPUcomputation1 1 12 2 Fig. 4: Example of scheduling for the

VEC benchmark. Weshow both the GrCUDA code and the resulting DAG.on K1 , otherwise it will depend on both K2 and K3 , and all dependency sets will be updated.The DAG is built at run time, not at compile-time or eagerly.Users do not have to worry about their host program’s controlﬂow, as we dynamically add and schedule new computationsas users provide them. Our scheduler is unaware of the full logical DAG structure of a given program, although we usuallyshow complete DAGs for clarity (e.g. Figures 2 and 6).Instead, the scheduler updates the current graph frontier, i.e.the computations that are currently active. This choice is keyto enable the dynamic creation of the DAG and does notintroduce limitations on the optimizations that our schedulercould perform. We track each kernel’s historical performanceand scheduling to allow the creation of heuristics that guidefuture scheduling of the same kernel.In GrCUDA, arrays are backed by UM, which simpliﬁesdata movement and accesses from the CPU without largeperformance penalties [24]. As the CPU can schedule accessesto these arrays at any point (even while the GPU is running),we model these accesses as computational elements . If theaccess introduces a data dependency on a GPU computation,the scheduler ensures that the CPU waits for that computationto end. To keep overheads as low as possible, array accessesthat do not introduce data dependencies with respect to GPUkernels are executed immediately, without modeling them asDAG elements: this is the case of consecutive accesses oraccesses performed while no GPU computation is active. Pre-registered libraries can also take advantage of our scheduler ifthey expose the choice of execution stream in their API. If not,they are scheduled synchronously to guarantee correctness.Figure 4 shows the GrCUDA code (with Python as host) ofthe

VEC benchmark (section V-B). For each kernel invocationin the host, the scheduler adds a computational element to theDAG, updates the dependency sets of active computations,and provides a CUDA stream for execution. Executing K2 (Figure 4, C ) requires a CUDA event to ensure that K1 isompleted. Accessing Z on the CPU ensures that all compu-tations are completed: they will no longer contribute to newdependencies. The host code does not need to care about thescheduling, and it can be written as if it were run sequentially,with no explicit mention of synchronization points or streams. B. System Architecture

Figure 5 shows the main components of our scheduler andtheir integration with the existing GrCUDA architecture. The

GPU execution context tracks declarations and invocations ofGPU computational elements 1 . When a new computationis created or called, it notiﬁes the execution context 2 sothat it updates the DAG with data dependencies of the newcomputation 3 . The GPU execution context uses the DAG tounderstand if the new computation can start immediately orif it must wait for other computations to ﬁnish. Computationsare overlapped using different CUDA streams, assigned bythe stream manager based on dependencies and free resources(section IV-C) 4 . The stream manager executioncontext dependency set is empty) and empty streams.

C. Scheduling Policies and Stream Management

A scheduler is serial if computations are executed one afterthe other in the order deﬁned by the user, and their executionor data transfer do not overlap. A parallel scheduler allowsoverlaps to happen, and data dependencies determine theorder in which computations are executed. In a synchronous scheduler the CPU host program waits for GPU computationsto ﬁnish, while an asynchronous scheduler allows the CPUto perform other computations while the GPU is active. Theoriginal GrCUDA scheduler is serial and synchronous , whileour scheduler is parallel and asynchronous .CUDA streams are key to enable parallel and asynchronouscomputation. In our scheduler, the allocation and manage-ment of streams are performed transparently by a streammanager. The stream manager also tracks what computationsare currently active in each stream and handles events usedfor synchronization. Users can specify different policies tocreate new streams and to associate them with computa-tions. Existing streams are managed in FIFO order, and newstreams are created only if no currently empty stream isavailable to schedule a given computation. If a computation

GrCUDARuntime

GraalVM

NVRTCCompiler

InteroperabilityLibrary

CUDARuntime/Driver APICUDARuntime ManagerComputationalElement

K5(NUM_BLOCKS, NUM_THREADS)(X,Z)

Host Language (Python, Java, JS, R, ...) CUDA Stream Manager

S1:S2:

GrCUDA Execution Context

K3(X,Y)

ComputationDAG

K1(X) { }

K2(Y) { }{Y}

K4(Z) { }

K5(X,Z) {X,Z}

DAGStream managerRuntime managerAccess to ... ... ... (0,0) (N,0) (0,M) (N,M) ...

X:Y:Z:

GrCUDA Managed MemoryCUDA Host Memory (unmanaged)GrCUDA Object MappingCUDA Unified MemoryGPU Device Memory

1. Convert invocation toComputational Element2. Registerto Context 3. Update DAG4. Set ExecutionStream5. Access CUDAStream API6. Schedule forexecutionInterface toCUDA APIInterface to CUDA Compiler

GPU Access

Fig. 5: Our scheduler integrates with the existing GrCUDAarchitecture. New components are highlighted in red.has multiple children (i.e. computations that depend on it),the ﬁrst child is scheduled on the parent’s stream to minimizesynchronization events, while following children are scheduledon other streams to guarantee concurrency. Simpler policies(e.g. schedule all children on a single stream) further reducethe scheduling costs; that said, our experiments always usethe more general policies and show negligible schedulingoverheads (section V-D). The stream manager is architecture-aware: on GPU architectures older than Pascal, the CPUcannot access UM if a kernel is active in the GPU. The streammanager restricts each array’s visibility to the stream where itis used until the CPU needs to access the array. The CPU ismade temporarily unaware of the existence of arrays beingused by the GPU, and can access currently unused arrayseven if the GPU is active. While this optimization is notrequired on architectures since Pascal (thanks to its page faultmechanism), our scheduler can automatically prefetch data tooptimize transfers.

D. Language Design and Integration

Our scheduler leverages the GrCUDA language’s existingfeatures and does not introduce any user-facing modiﬁcationto the language. Kernel signatures are speciﬁed using NativeInterface Deﬁnition Language (NIDL) or Trufﬂe Native Func-tion Interface (NFI), simple typing systems that support basicdata types and pointers. Optional argument annotations such as input , output or const are used by the scheduler to opti-mize computations that contain read-only arguments ( input or const ). For arguments without annotations, the schedulertreats them as modiﬁable by the kernel; not specifying argu-ments as read-only does not affect correctness, but might limitthe scheduler from performing further optimizations. X SUM(X-Y)

Vector Squares

Y Y Z X BSY X BSY X BSY [...] Black & Scholes [...]Stream 1Stream 2Stream 10Stream 3Stream 4CONCATRPOOLCONV CONVCONV CONVPOOLX YDOT

Deep Learning

Other 7 independent kernels1 21 1 2 10

ARGMAXR

ML Ensemble

MMULADDV LSEEXPXNORM MMULMAXSOFTMAX

HITS

DIVSPMVSUMA , H H , A [REPEAT...] Image Processing

X BLURSOBELMIN MAXEXTENDCOMBINECOMBINEY SHARPUNSHARP

11 222 221 4 33

Fig. 6: Computation structure of each benchmark, expressed as a DAG of GPU kernel computations (denoted as circles). Colorsdenote CUDA streams, and kernels whose incoming arrows have different colors will require a synchronization event.V. E

XPERIMENTAL E VALUATION

In this section, we evaluate the performance of our GrCUDAGPU scheduler against multiple benchmarks that exhibit task-level parallelism and opportunities to leverage space-sharingand transfer-computation overlap to achieve lower executiontime than a serial scheduler (section V-B). First, we compareagainst the serial GrCUDA scheduler (section V-C) and showhow our scheduler can exploit untapped GPU resources todeliver better performance than a na¨ıve serial scheduler.Then, we compare the performance of the GrCUDA againstthe C++ CUDA API, measure how the GrCUDA schedulingis identical to the best hand-tuned scheduling possible, andhow GrCUDA does not add any signiﬁcant slowdown in thebenchmark execution times (section V-D).Finally, we investigate, for each benchmark, the natureof the achieved speedup. First, we measure the amount ofresource contention introduced by hardware space-sharing(section V-E); second, we measure what overlaps are present(transfer-computation or space-sharing), and we analyze howwell each benchmark is using hardware resources such asdevice memory and L2 cache to understand which workloadsare more suitable for asynchronous execution (section V-F).

A. Evaluation Setup

Tests are performed on 3 different Nvidia GPUs withdifferent architectures: a Tesla P100 (Pascal, 12 GB of devicememory), a GTX 1660 Super (Turing, 6 GB), and a GTX 960(Maxwell, 2 GB). GPUs are connected to their respective hostmachines through PCI Express (PCIe) 3.0. Testing consumer-grade GPUs such as the GTX 1660 Super and the GTX 960shows how our scheduler does not demand high-end data-center hardware to provide beneﬁts, and it is useful for quickprototyping on commodity GPUs and to accelerate desktopapplications. Benchmarks are executed 30 times on randomdata. We select the input sizes in each benchmark to usebetween 10% and 90% of the available memory on each GPU,up to the largest size that ﬁts in device memory. Execution timeis the total amount of time spent by GPU execution, from theﬁrst kernel scheduling until the end of execution. Parameters(e.g. the number of blocks) are optimized for best performance TABLE I: Amount of device memory for different input sizesin each benchmark. GPUs are tested with different input sizesup to the largest size that ﬁts in GPU memory.

Memory footprint (GB)

Benchmark name GTX 960 GTX 1660 Super Tesla P100Vector Squares (VEC)

Black & Scholes (B&S)

Images (IMG)

ML Ensemble (ML)

HITS

Deep Learning (DL)

GPU device memory in serial execution to provide a worst-case comparison. Thex-axes in Figures 7 to 9 report the benchmark scale, a valueproportional to the benchmark’s memory footprint (e.g. thenumber of pixels in each input image).

B. Benchmark Suite

We tested our scheduler on 6 benchmarks and a total of33 different kernels representing common GPU workloads(image processing, machine learning, etc.) and containing op-portunities for task-level parallelism through space-sharing andcomputation-transfer overlap. To the best of our knowledge, noexisting GPU benchmark suite has the goal of evaluating intra-application task-level parallelism, as most benchmark suites(e.g. Rodinia [26]) focus on single kernels and sequentialexecution. Still, we take or derive the CUDA kernels in ourbenchmarks from open-source implementations.We scale the input size linearly to visualize more clearlyif any hardware bottleneck impacts performance as input sizeexceeds a threshold. For instance, we change the number ofrows for the matrix multiplications in the ML benchmark, butkeep ﬁxed the number of features and output classes.Figure 6 presents each benchmark’s task dependency struc-ture and highlights the optimal stream assignment for eachkernel. Table I summarizes each benchmark. The chosen inputsizes guarantee that the memory footprint covers both smallnd large computations compared to the total memory of eachGPU. For each benchmark, we present a brief description. • Vector Squares (VEC) : a simple benchmark that mea-sures a basic case of task-level parallelism and computesthe sum of differences of 2 squared vectors. Each iterationhas new input data, simulating a streaming computationthat requires transfer from CPU to GPU. Inspired by [27]. • Black & Scholes (B&S) : Black & Scholes equation forEuropean call options, for 10 underlying stocks, and 10vectors of prices. Adapted from [28] to simulate a com-putationally intensive streaming benchmark with double-precision arithmetic and many independent kernels thatcan be overlapped with no dependencies. • Image Processing (IMG) : an image processing pipelinethat combines a sharpened picture with copies blurredat low and medium frequencies [29], to sharpen theedges, soften everything else, and enhance the subject.The benchmark has complex dependencies on 4 streams. • Machine Learning Ensemble (ML) : an ML pipeline thatcombines Categorical Na¨ıve Bayes and Ridge Regressionclassiﬁers by applying softmax normalization and aver-aging scores. The input matrix has 200 features. Thisbenchmark contains branch imbalance (the Na¨ıve Bayesclassiﬁers takes longer) and read-only arguments. • HITS : it computes the HITS algorithm on a graph[30] using repeated Sparse matrix-vector multiplication(SpMV) on a matrix and its transpose, and is imple-mented with LightSpMV [31]. It contains complex cross-synchronizations and multiple iterations. • Deep Learning (DL) : a convolutional neural network thatprojects 2 input images to low dimensional embeddingsand combines the embeddings using a dense layer. Simi-lar neural networks can be used, for example, to classifyif 2 images contain the same subject.

C. Performance against Serial GrCUDA Scheduling

We compare the performance of our parallel scheduleragainst the GrCUDA serial scheduler (Figure 7). Automaticdata prefetching is enabled on the Tesla P100 and the GTX1660 Super, while on the GTX 960, data is necessarilytransferred ahead of the computation as it does not have a pagefault mechanism. We always deliver better performance overthe serial scheduler, with a geomean speedup of 44% on the 3GPUs. The GTX 960 is 25% faster, while the P100 performsthe best, with a geomean speedup of 61%. More hardwareresources, together with automatic prefetching, results in betterparallelization, and we show how our approach works out-of-the-box on data-center GPUs. While still faster than the serialbaseline, disabling automatic prefetching is not recommended:concurrent kernel execution turns the page fault controllerinto the main bottleneck, limiting the beneﬁts of overlappingdata transfer with computation. When using serial scheduling ,GrCUDA does not compute dependencies, making overheadseven smaller. We average results for block sizes (the numberof threads in each 1D block) from 32 to 1024. In benchmarkswith 2D and 3D blocks (e.g. IMG and DL), we keep 2D blocks

256 25632 2563225632 25632 2563225632 2563232

Vector Squares

Median baseline exec. time (ms):960: 19 78 1171660: 33 153 221P100: 39 132 202 817 1177 2·10

32 256 B&S

Median baseline exec. time (ms):67 265 39667 264 36841 138 218 954 1261 16·10 Images

Median baseline exec. time (ms):22 70 1578 31 65 3155 16 43 178 4442·10

32 25632

32 25632 25632 ML Ensemble

Median baseline exec. time (ms):960: 682 3156 47771660: 162 650 975P100: 170 794 1207 5006 10922 4·10

32 102432

HITS

Median baseline exec. time (ms):173 442 885121 313 63191 228 475 1416 3360 3·10 DL Median baseline exec. time (ms):56 134 25521 73 153 52635 86 154 533 1023

Benchmark scale, input number of elements (x-axis not to scale)

GTX960 GTX1660 Super P100

Fig. 7: Parallel scheduler speedup over serial schedugler. Ourparallel scheduler provides a geomean speedup of 44% overthe original GrCUDA serial scheduler. We highlight blocksizes giving the best/worst speedup (when signiﬁcant).with size 8x8 and 3D blocks with size 4x4x4, as bigger blocksresulted in longer execution times in every case.Speedups are mostly independent of the input data size, aswe sweep through inputs with size from less than 10% toalmost 100% of the available GPU memory. Even if kernelsﬁll the GPU resources, it is still possible to achieve a speedupby overlapping data transfer with other kernels’ execution.DAG scheduling appears to be more robust to differentkernel conﬁgurations: in many cases (such as VEC and HITS),using block_size=32 results in higher speedup, but similarexecution time as with larger block size. With serial schedul-ing, small blocks result in under-utilization of GPU resourcessuch as shared memory, while DAG scheduling provides betterutilization by having multiple kernels run in parallel. Thanksto DAG scheduling, programmers have to spend less timeproﬁling their code to ﬁnd the optimal kernel conﬁguration.

D. Performance against CUDA Graphs

To understand how our GrCUDA scheduler performs againstexisting solutions, we re-implemented our benchmarks usingthe C++ CUDA Graphs API. The kernel code and the setup(e.g. input and block sizes) are the same as GrCUDA, butthe host code is written using the C++ CUDA Graphs API.We test CUDA Graphs using stream-capture to wrap hand-optimized multi-stream scheduling synchronized with CUDAevents and using the Graph API to specify dependenciesbetween computations manually. These CUDA Graphs arebuilt only once per execution, and overheads are completelyamortized over many iterations. Finally, we provide a hand-optimized implementation purely based on CUDA events tohave full control over data movement and simulate CUDAGraph’s performance if it supported data prefetching. ·10 G T X VEC

Median GrCUDA exec. time (ms):960: 18 70 104 2·10 B&S

Median GrCUDA exec. time (ms):56 216 324 16·10 Images

Median GrCUDA exec. time (ms):13 46 992·10 Median GrCUDA exec. time (ms):960: 650 2986 4501 4·10 HITS

Median GrCUDA exec. time (ms):160 407 820 3·10 DL Median GrCUDA exec. time (ms):41 99 1902·10 G T X Sup e r VEC

Median GrCUDA exec. time (ms):1660: 14 61 90 2·10 B&S

Median GrCUDA exec. time (ms):39 126 225 16·10 Images

Median GrCUDA exec. time (ms):6 24 54 2442·10 Median GrCUDA exec. time (ms):1660: 142 570 856 4·10 HITS

Median GrCUDA exec. time (ms):85 227 459 3·10 DL Median GrCUDA exec. time (ms):15 58 134 4792·10 T e s l a P VEC

Median GrCUDA exec. time (ms):P100: 15 58 87 361 506 2·10 B&S

Median GrCUDA exec. time (ms):17 57 85 345 482 16·10 Images

Median GrCUDA exec. time (ms):3 12 28 143 3422·10 Median GrCUDA exec. time (ms):P100: 159 639 1186 4246 14058 4·10 HITS

Median GrCUDA exec. time (ms):54 144 298 911 2140 3·10 DL Median GrCUDA exec. time (ms):25 64 132 478 895

Benchmark scale, input number of elements (x-axis not to scale)

CUDA baseline type (speedup of our GrCUDA scheduler vs. the CUDA baselines)vs. CUDA Graphs+manual dep. vs. CUDA Graphs+events vs. Hand-tuned CUDA events

Fig. 8: Speedup of our GrCUDA scheduling against hand-optimized CUDA Graphs (higher is better). We are alwaysfaster, and the GrCUDA runtime overheads are negligible forcomputations lasting more than a few milliseconds.Our GrCUDA scheduler, in addition to being fully auto-mated, is never signiﬁcantly slower than any of the CUDAGraphs baselines and is often faster (Figure 8). The largeperformance gaps compared to CUDA Graphs seen on theGTX 1660 Super and the P100 are mostly explained byour automatic prefetching, which the CUDA Graphs APIseems unable to perform. Even when enabling prefetchingin the CUDA baseline, our parallel scheduler achieves equalperformance to the hand-optimized baseline. Execution timespeedups are hardly affected by input size, with minor dif-ferences only in computations lasting a few milliseconds andcontaining many dependencies (such as IMG).

E. Impact of Space-Sharing Resource Contention

By looking at dependencies between kernels and measur-ing their execution time with serial scheduling so that eachkernel has full access to the GPU resources, we estimatethe resource contention on the GPU hardware and the PCIebandwidth introduced by space-sharing. Figure 9 shows howfar each benchmark is from its theoretical contention-free peakperformance. Results are mostly consistent between GPUs, P a r a ll e l S c h e du l e r Vector Squares

Contention-free exec. time (ms):960: 17 69 1031660: 7 33 51P100: 8 30 46 191 268 2·10 B&S

Contention-free exec. time (ms):18 73 1105 18 311 6 9 37 53 16·10 Images

Contention-free exec. time (ms):10 32 744 19 37 1503 8 21 83 1792·10 Contention-free exec. time (ms):960: 480 3234 35771660: 122 538 805P100: 181 719 1073 11807 18403 4·10 HITS

Contention-free exec. time (ms):99 249 50247 122 24928 77 158 482 1130 3·10 DL Contention-free exec. time (ms):31 75 1449 33 73 26215 37 73 258 479

Benchmark scale, input number of elements (x-axis not to scale)

GTX960 GTX1660 Super P100

Fig. 9: Slowdown compared to execution without hardwareresource contention. Space-sharing introduces a performanceloss of around 30-40% due to hardware space-sharing.with a relative execution time that is often around 70%of the contention-free performance bound; while resourcecontention is present, it is small enough to make space-sharingworthwhile. Unsurprisingly, B&S, which is composed of 10independent computations, achieves around 15-20% of itscontention-free peak performance due to limitations on PCIebandwidth and double-precision arithmetic units availability.

F. Analysis of Hardware Utilization

It is fascinating to understand, for each benchmark, thenature of the speedup achieved through resource sharing, i.e.,if the speedup is caused by overlapping computation with datatransfer or if the speedup is explained by higher utilizationof GPU resources such as device memory bandwidth throughspace-sharing. First, we measure how much overlap is presentin each benchmark. We measure 4 different types of overlap: • CT , computation against transfer: percentage of GPUkernel computation that overlaps with any data transfer • TC , transfer against computation: percentage of datatransfer that overlaps with any kernel computation(s) • CC , percentage of GPU computation overlapped withother GPU computation • TOT , any type of overlap: here we consider any typeof overlap between data-transfer and/or computations.If a computation/data-transfer overlaps more than onecomputation/data-transfer, the overlap is counted onlyonce (we consider the union of the overlap intervals)Figure 10 shows how we compute these overlaps startingfrom the execution timeline. Although the TOT overlap canbe a good proxy of the achieved speedup, it is sometimesinﬂated by high CC overlap, as overlapping computations doesnot always translate to faster execution. In VEC, the speedupcomes only from transfer and computation overlap, while theoverlap of kernels that leave a large amount of shared memoryunused if executed serially explains the speedup in IMG. [sec]Stream 1Stream 2

HTOD HTOD DTOHNORM MMUL ADDV SOFTMAX ARGMAXMMUL MAX LSE EXP SOFTMAX0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.750.00

Device-to-hosttransferHost-to-device transferCT Overlap TC Overlap CC Overlap

Fig. 10: Example of a possible execution timeline for the MLbenchmark. We highlight different types of overlap betweentransfer and computation, as deﬁned in section V-F.

CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT

GTX 960

Vector Squares

Speedup: 1.17x

B&S

Speedup: 1.33x

Images

Speedup: 1.55x

ML Ensemble

Speedup: 1.22x

HITS

Speedup: 1.13x DL Speedup: 1.34x

CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT

GTX 1660 Super

Speedup: 2.68x Speedup: 1.83x Speedup: 1.34x Speedup: 1.28x Speedup: 1.38x Speedup: 1.19x

CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT CT TC CCTOT

Tesla P100

Speedup: 2.55x Speedup: 2.79x Speedup: 1.49x Speedup: 1.39x Speedup: 1.33x Speedup: 1.17x

Type of overlap

CT, computation w.r.t transferTC, transfer w.r.t computation CC, computation w.r.t computationTOT, any type of overlap

Fig. 11: Amount of transfer and computation overlap for eachbenchmark, for serial and parallel scheduling. We report beloweach plot the speedup obtained in the benchmark.Computation time and transfer time do not increase with thesame proportionality factor as the input size; small input datado not use the PCIe bandwidth fully, and a 10x increase in datasize might translate into a transfer time increase below 10x, upto the available bandwidth. On faster GPUs the computationtime is lower, while the transfer time is roughly identical toless powerful hardware (assuming the same transfer interface):more computation is overlapped to data transfer, leading tobetter speedups. For example, in the B&S benchmark, the CToverlap increases on faster GPUs, and so does the speedup.We then analyze how our parallel scheduler affectshardware-level metrics such as device memory throughput, L2cache throughput, Instructions per cycle (IPC), and GFLOPS.We use nvprof and ncu to measure the number of bytesread/written by each kernel from/to device memory and L2cache, and the total number of instructions executed. As col-lecting these metrics introduces high execution time overheadand prevents the execution of concurrent kernels, we combinethe execution timeline without metric collection with hardwaremetrics collected in separate runs (variance in these metricsbetween different runs is insigniﬁcant). Metrics are collectedonly on the GTX 1660 Super as we did not have root-level https://docs.nvidia.com/cuda/proﬁler-users-guide/index.html, https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html VEC B&S IMG ML HITS DL0 GB/s20 GB/s40 GB/s60 GB/s80 GB/s100 GB/s120 GB/s

Device memory throughput

Serial throughput (GB/s):104 7 39 27 89 14

VEC B&S IMG ML HITS DL0 GB/s30 GB/s60 GB/s90 GB/s120 GB/s150 GB/s

L2 cache throughput

Serial throughput (GB/s):102 6 56 79 69 23

VEC B&S IMG ML HITS DL0.000.200.400.600.801.00

IPC

Serial IPC:0.11 0.07 0.37 0.04 0.30 0.38

VEC B&S IMG ML HITS DL020406080100120

GFLOPS32/64

GFLOPS32/64:18 78 69 19 22 32

Serial Scheduler Parallel SchedulerSerial Scheduler Parallel SchedulerSerial Scheduler Parallel SchedulerSerial Scheduler Parallel Scheduler

Fig. 12: Hardware metrics for each benchmark and executionpolicy, as measured on the GTX 1660 Super. All benchmarksin which different kernels overlap their execution show anincrease in hardware utilization.access to the Tesla P100. The amount of bytes read/writtenand the total number of instructions executed by each kernel mostly depends on the kernel itself and is not signiﬁcantlyimpacted by space-sharing; as such, this evaluation is usefulto estimate the global GPU behavior when space-sharing isperformed. GFLOPS is estimated from the total number ofﬂoating-point operations (single and double precision).Figure 12 shows how in kernels with computation overlap(e.g. ML and IMG), the increase in memory throughput issigniﬁcant and in-line with the total speedup observed forthese benchmarks. VEC does not have any increase in memorythroughput, as its speedup comes exclusively from transferoverlap. Benchmarks that operate on dense matrices makeheavier use of L2 cache, whose throughput increases withparallel scheduling. The low IPC in ML is caused by a slowkernel that operates on tall matrices and does not use theGPU parallelism to its full extent: running multiple kernels inparallel hides its latency and provides the speedup in Figure 7.From these analyses, we understand what limits perfor-mance in each benchmark. For example, B&S performs com-plex mathematical operations on independent values, with avery high GFLOPS count (Figure 12) and almost no cacheutilization. On the GTX 1660, B&S has high TC and lowCT (Figure 11); the computation lasts longer than the datatransfer, and part of the computation is not overlapped; onthe other hand, the Tesla P100, which has 20x higher double-precision performance than the 1660, completely masks thecomputation with transfer (high CT), and indeed we observea better speedup. Improving performance even further requireslowering the transfer time, for example, through PCIe 4.0.I. C

ONCLUSION AND FUTURE WORK

We presented a novel scheduler for GPU computations thatcan automatically infer data dependencies to build a computa-tion DAG at run time. The scheduler allows computations toexecute in parallel through GPU space-sharing and overlapsdata-transfer and execution whenever possible.We validate our scheduler on 6 benchmarks and a total of 33GPU kernels. Our scheduler provides a geomean speedup of44% (up to 270%) over serial synchronous scheduling, and isalways faster. It automatically achieves the same scheduling ashand-optimized CUDA Graphs code, without any slowdown.Our scheduler seamlessly integrates with the GrCUDAenvironment, a polyglot CUDA API based on GraalVM thatprovides easy access to GPU acceleration to languages such asJava, Python, JavaScript, and R. Users can leverage our workwithout knowledge of the underlying scheduler, and withoutchanging their code. It is not required to specify in advance thecode structure or control ﬂow, as our scheduler dynamicallyexecutes computations as provided by the host program. Ourwork simpliﬁes GPU code prototyping and acceleration bygiving easy access to untapped GPU resources at no cost, whilemasking the complexity of asynchronous GPU computations.As future work, we plan to extend our technique to multipleGPUs: the problem is signiﬁcantly harder, as it requires tocompute data location and migration costs at run time toidentify the optimal scheduling. We will also leverage run timeinformation for additional optimizations, such as estimatingthe ideal block size based on data size and previous executions.A

CKNOWLEDGMENTS

We thank Oracle Labs for its support and contributions tothis work. The authors from Politecnico di Milano are fundedin part by a research grant from Oracle. We also thank ReneMueller and Lukas Stadler, the original authors of GrCUDA,for their valuable feedback and opinions. Oracle and Java areregistered trademarks of Oracle and/or its afﬁliates.R

EFERENCES[1] T. W¨urthinger, C. Wimmer, A. W¨oß, L. Stadler, G. Duboscq, C. Humer,G. Richards, D. Simon, and M. Wolczko, “One vm to rule them all,” in

Proceedings of the 2013 ACM international symposium on New ideas,new paradigms, and reﬂections on programming & software , 2013.[2] C. Wimmer and T. W¨urthinger, “Trufﬂe: a self-optimizing runtimesystem,” in

Proceedings of the 3rd annual conference on Systems,programming, and applications: software for humanity . ACM, 2012.[3] G. Duboscq, L. Stadler, T. W¨urthinger, D. Simon, C. Wimmer, andH. M¨ossenb¨ock, “Graal ir: An extensible declarative intermediate repre-sentation,” in

Proceedings of the Asia-Paciﬁc Programming Languagesand Compilers Workshop , 2013.[4] R. Mueller and L. Stadler, “Grcuda,” github.com/NVIDIA/grcuda.[5] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programmingstandard for heterogeneous computing systems,”

Computing in science& engineering , vol. 12, no. 3, p. 66, 2010.[6] J. Luitjens, “Cuda streams: Best practices and common pitfalls,” in

GPUTechonology Conference , 2015.[7] “Cuda graphs,” https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDART GRAPH.html, retrieved on 2020-10-12.[8] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al. , “Tensorﬂow: A system for large-scale machine learning,” in { USENIX } Symposium on OperatingSystems Design and Implementation ( { OSDI } , 2016, pp. 265–283. [9] L. Marchal, H. Nagy, B. Simon, and F. Vivien, “Parallel scheduling ofdags under memory constraints,” in . IEEE, 2018.[10] Y. Xu, L. Liu, and Z. Ding, “Dag-aware joint task scheduling and cachemanagement in spark clusters,” in . IEEE, 2020, pp. 378–387.[11] M. Y. ¨Ozkaya, A. Benoit, B. Uc¸ar, J. Herrmann, and ¨U. V. C¸ ataly¨urek,“A scalable clustering-based task scheduler for homogeneous proces-sors using dag partitioning,” in . IEEE, 2019, pp. 155–165.[12] R. Mayer, C. Mayer, and L. Laich, “The tensorﬂow partitioning andscheduling problem: it’s the critical path!” in Proceedings of the 1stWorkshop on Distributed Infrastructures for Deep Learning , 2017.[13] A. Marchetti-Spaccamela, N. Megow, J. Schl¨oter, M. Skutella, andL. Stougie, “On the complexity of conditional dag scheduling in multi-processor systems,” in . IEEE, 2020, pp. 1061–1070.[14] A. Gray, “Getting started with cuda graphs,” developer.nvidia.com/blog/cuda-graphs, 2019-09-05, retrieved on 2020-10-12.[15] T. Gautier, J. V. Lima, N. Maillard, and B. Rafﬁn, “Xkaapi: A runtimesystem for data-ﬂow task programming on heterogeneous architectures,”in . IEEE, 2013, pp. 1299–1308.[16] C.-H. Hong, I. Spence, and D. S. Nikolopoulos, “Gpu virtualization andscheduling methods: A comprehensive survey,”

ACM Computing Surveys(CSUR) , vol. 50, no. 3, pp. 1–37, 2017.[17] V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar, “Supporting gpusharing in cloud environments with a transparent runtime consolidationframework,” in

Proceedings of the 20th international symposium onHigh performance distributed computing , 2011, pp. 217–228.[18] J. Fumero, M. Papadimitriou, F. S. Zakkak, M. Xekalaki, J. Clarkson,and C. Kotselidis, “Dynamic application reconﬁguration on heteroge-neous hardware,” in

Proceedings of the 15th ACM SIGPLAN/SIGOPSInternational Conference on Virtual Execution Environments , 2019.[19] J. Clarkson, J. Fumero, M. Papadimitriou, F. S. Zakkak, M. Xekalaki,C. Kotselidis, and M. Luj´an, “Exploiting high-performance heteroge-neous hardware for java programs using graal,” in

Proceedings of the15th International Conference on Managed Languages & Runtimes ,2018, pp. 1–13.[20] A. Kl¨ockner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih,“PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation,”

Parallel Computing , vol. 38, 2012.[21] Y. Wen and M. F. O’Boyle, “Merge or separate? multi-job schedulingfor opencl kernels on cpu/gpu platforms,” in

Proceedings of the generalpurpose GPUs , 2017, pp. 22–31.[22] B. Qiao, O. Reiche, J. Teich, and F. Hannig, “Unveiling kernel con-currency in multiresolution ﬁlters on gpus with an image processingdsl,” in

Proceedings of the 13th Annual Workshop on General PurposeProcessing using Graphics Processing Unit , 2020, pp. 11–20.[23] Q. Chen, H. Yang, J. Mars, and L. Tang, “Baymax: Qos awareness andincreased utilization for non-preemptive accelerators in warehouse scalecomputers,”

ACM SIGPLAN Notices , vol. 51, no. 4, pp. 681–696, 2016.[24] F. Guo, Y. Li, J. C. Lui, and Y. Xu, “Dcuda: Dynamic gpu schedulingwith live migration support,” in

Proceedings of the ACM Symposium onCloud Computing , 2019, pp. 114–125.[25] J. Jung, D. Park, Y. Do, J. Park, and J. Lee, “Overlapping host-to-devicecopy and computation using hidden uniﬁed memory,” in

Proceedingsof the 25th ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming , 2020, pp. 321–335.[26] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, andK. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”in . Ieee, 2009, pp. 44–54.[27] J. Luitjens, “Faster parallel reductions on kepler,” developer.nvidia.com/blog/faster-parallel-reductions-kepler, 2014-02-14.[28] “Black & scholes option pricing,” docs.nvidia.com/cuda/cuda-samples/index.html

Journal of the ACM (JACM) , vol. 46, no. 5, pp. 604–632, 1999.[31] Y. Liu and B. Schmidt, “Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus,” in2015 IEEE 26th Inter-national Conference on Application-speciﬁc Systems, Architectures andProcessors (ASAP)