[PDF] GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python

Abstract

As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.2x, 11.7x, and 17.4x in Charm++, AMPI, and Charm4py, respectively. We also observe increases in bandwidth of up to 9.6x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.

Full PDF

GGPU-aware Communication with UCX in ParallelProgramming Models: Charm++, MPI, and Python

Jaemin Choi ∗ , Zane Fink ∗ , Sam White ∗ , Nitin Bhat † , David F. Richards ‡ , Laxmikant V. Kale ∗†∗ Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA † Charmworks, Inc., Urbana, Illinois, USA ‡ Center for Applied Scientiﬁc Computing, Lawrence Livermore National Laboratory, Livermore, California, USAEmail: {jchoi157,zanef2,white67,kale}@illinois.edu, [email protected], [email protected]

Abstract —As an increasing number of leadership-class systemsembrace GPU accelerators in the race towards exascale, efﬁcientcommunication of GPU data is becoming one of the most criticalcomponents of high-performance computing. For developers ofparallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDAcan be a daunting task as it requires considerable effort withlittle guarantee of performance. In this work, we demonstrate thecapability of the Uniﬁed Communication X (UCX) framework tocompose a GPU-aware communication layer that serves multipleparallel programming models developed out of the Charm++ecosystem, including MPI and Python: Charm++, AdaptiveMPI (AMPI), and Charm4py. We demonstrate the performanceimpact of our designs with microbenchmarks adapted from theOSU benchmark suite, obtaining improvements in latency of upto 10.2x, 11.7x, and 17.4x in Charm++, AMPI, and Charm4py,respectively. We also observe increases in bandwidth of up to 9.6xin Charm++, 10x in AMPI, and 10.5x in Charm4py. We showthe potential impact of our designs on real-world applicationsby evaluating weak and strong scaling performance of a proxyapplication that performs the Jacobi iterative method, improvingthe communication performance by up to 12.4x in Charm++,12.8x in AMPI, and 19.7x in Charm4py.

Index Terms —GPU communication, UCX, Charm++, AMPI,CUDA-aware MPI, Python, Charm4py

I. I

NTRODUCTION

The parallel processing power of GPUs have become centralto the performance of today’s High Performance Computing(HPC) systems, with seven of the top ten supercomputers inthe world equipped with GPUs [1]. With multiple GPUs percompute node becoming the norm in modern HPC platforms,applications often store the bulk of their data on GPU memory,rendering data transfers between GPUs a critical componentof performance.While GPU vendors provide APIs for programming appli-cations to execute on their hardware and transfer data betweenhost and device memory, such as CUDA on NVIDIA GPUs,their limited functionality makes it challenging to implementa general communication backend for parallel programmingmodels on distributed-memory machines. Inter-process com-munication, when implemented using the CUDA IPC feature,requires rigorous optimizations such as handle caches andpre-allocated device buffers as discussed in [2]. Direct inter-node transfers of GPU data cannot be implemented solelywith CUDA, requiring other hardware and software supportas described in [3]. Adding support for GPUs from a different vendor is another matter, where the development and optimiza-tion efforts would have to be repeated to achieve competitiveperformance.There has been a number of software frameworks aimedat providing a uniﬁed communication layer abstracting thevarious types of networking hardware, such as GASNet [4],libfabric [5], and UCX [6]. While they have been successfullyadopted in many parallel programming models including MPIand PGAS to support communication involving host memory,UCX is arguably the ﬁrst framework to support production-grade, high-performance inter-GPU communication on varioustypes of modern GPUs and interconnects. In this work, weutilize the capability of UCX to perform direct GPU-GPUtransfers to support GPU-aware communication in multipleparallel programming models from the Charm++ ecosystemincluding MPI and Python: Charm++, AMPI, and Charm4py.We extend the UCX machine layer in the Charm++ runtimesystem to enable the transfer of GPU buffers and expose thisfunctionality to the parallel programming models, with model-speciﬁc implementations to support their user applications.Our tests on a leadership-class system show that this approachsubstantially improves the performance of GPU communica-tion for all models.The major contributions of this work are the following: • We present designs and implementation details to enableGPU-aware communication using UCX as a common ab-straction layer in multiple parallel programming models:Charm++, AMPI, and Charm4py. • We discuss design considerations to support message-driven execution and task-based runtime systems byperforming a metadata exchange between communicationendpoints. • We demonstrate the performance impact of our mech-anisms using a set of microbenchmarks and a proxyapplication representative of a scientiﬁc workload.II. B

ACKGROUND

A. GPU-aware Communication

GPU-aware communication has developed out of the need torectify productivity and performance issues with data transfersinvolving GPU buffers. Without GPU-awareness, additionalcode is required to explicitly move data between GPU memory a r X i v : . [ c s . D C ] F e b nd host memory, which can substantially increase latency andreduce bandwidth.The GPUDirect [7] family of technologies have beenleading the effort to resolve such performance issues onNVIDIA GPUs. Version 1.0 allows Network Interface Con-trollers (NICs) to have shared access to pinned system memorywith the GPU and avoid unnecessary memory copies, andversion 2.0 (GPUDirect P2P) enables direct memory accessand data transfers between GPU devices on the same PCIebus. GPUDirect RDMA [8] utilizes Remote Direct MemoryAccess (RDMA) technology to allow the NIC to directlyaccess memory on the GPU. Based on GPUDirect RDMA, theGDRCopy library [9] provides an efﬁcient low-latency trans-port for small messages. The Inter-Process Communication(IPC) feature introduced in CUDA 4.1 enables direct transfersbetween GPU data mapped to different processes, enhancinginter-process communication performance [2].MPI is one of the ﬁrst parallel programming models andcommunication standards to adopt these technologies andsupport GPUs in the form of CUDA-aware MPI, which isavailable in most MPI implementations. Other parallel pro-gramming models have either built direct GPU-GPU commu-nication mechanisms natively using GPUDirect and CUDAIPC, or made use of a GPU-aware communication frameworksuch as UCX. B. UCX

Uniﬁed Communication X (UCX) [6] is an open-source,high-performance communication framework that providesabstractions over various networking hardware and drivers, in-cluding TCP, OpenFabrics Alliance (OFA) verbs, Intel Omni-Path, and Cray uGNI. It is currently being developed at a fastpace with contributions from multiple hardware vendors aswell as the open-source community.With support for tag-matched send/receive, stream-orientedsend/receive, Remote Memory Access (RMA), and remoteatomic operations, UCX provides a high-level API for parallelprogramming models to implement a performance-portablecommunication layer. Projects using UCX include Dask,OpenMPI, MPICH, and Charm++. GPU-aware communica-tion is also supported on NVIDIA and AMD GPUs through itstagged and stream APIs. When provided with GPU memory,these APIs utilize the respective CUDA or ROCm libraries toperform efﬁcient GPU-GPU transfers.

C. Charm++

Charm++ [10] is a parallel programming system basedon the C++ language, developed around the concept of mi-gratable objects. A Charm++ program is decomposed intoobjects called chares that execute in parallel on the ProcessingElements (PEs, typically CPU cores), which are scheduledby the runtime system. This object-centric approach enablesoverdecomposition, where the problem domain can be de-composed into any number of chares determined by theprogrammer, separate from the number of available PEs ina job. This empowers the runtime system to control the mapping and scheduling of chares onto PEs, allowing itto achieve computation-communication overlap and performdynamic load balancing.The execution of a Charm++ program is driven by messagesthat are exchanged between chare objects executing in parallel.Each message encapsulates information about the work thatshould be invoked ( entry method in Charm++) and data forthe receiver chare. Arriving messages are stored in a messagequeue associated with each PE, which are picked up by thescheduler. This message-driven execution and the associatedcommunication is asynchronous by default, as a message sendcall does not block the sender chare and the correspondingmessage is asynchronously stored and eventually picked upby the scheduler for the receiver chare. Communication op-erations initiated by chare objects pass through various layersin the Charm++ runtime system until they eventually reachthe machine layer . Charm++ supports various low-level trans-ports with different machine layer implementations, includingTCP/IP, Mellanox Inﬁniband, Cray uGNI, IBM PAMI, andUCX.This work enables GPU-aware communication in Charm++by extending the UCX machine layer, seeking to improve thecommunication performance of GPU-accelerated applicationsdeveloped with the various parallel programming models builton top of the Charm++ runtime system.

D. Adaptive MPI

Adaptive MPI (AMPI) [11] is an MPI library implementa-tion developed on top of the Charm++ runtime system. AMPIvirtualizes the concept of an MPI rank: whereas a traditionalMPI library equates ranks with operating system processes,AMPI supports execution with multiple ranks per process.This empowers AMPI to co-schedule ranks that are locatedon the same PE based on the delivery of messages. Users cantune the number of ranks they run with based on performance.AMPI ranks are also migratable at runtime for the purposesof dynamic load balancing or checkpoint/restart-based faulttolerance.Communication in AMPI is handled through Charm++ andits optimized networking layers. Each AMPI rank is associatedwith a chare array element. AMPI optimizes communicationbased on locality of the recipient rank as well as the size anddatatype of the message buffer. Small buffers are packed insidea regular Charm++ message in an eager fashion, and the ZeroCopy API [12] is used to implement a rendezvous protocolfor larger buffers. The underlying runtime optimizes messagetransmission based on locality over user-space shared memory,Cross Memory Attach (CMA) for within-node, or RDMAacross nodes. This work extends such optimizations to thecontext of multi-GPU nodes connected by a high performancenetwork programmable with the UCX API.

E. Charm4Py

Charm4Py [13] is a parallel programming model and frame-work based on the Python language, developed on top of nterconnect

Machine LayerConverse

Scheduler

Message Queue ...

Charm++ Core Charm++, Adaptive MPI, Charm4py

PE 0

Machine LayerConverse

Scheduler

Message Queue ...

Charm++ Core

PE N-1

Machine LayerConverse

Scheduler

Message Queue ...

Charm++ Core

PE 1 C h a r m ++ R un t i m e S y s t e m ... Fig. 1. Software stack of the Charm++ family of parallel programmingmodels. the Charm++ runtime system. It seeks to provide an easily-accessible parallel programming environment with improvedprogrammer productivity through Python, while maintaininghigh scalability and performance of the adaptive C++-basedruntime. Being based on Python, Charm4py can also readilytake advantage of many widely-used software libraries suchas NumPy, SciPy, and pandas.Chare objects in Charm4py communicate with each otherby asynchronously invoking entry methods as in Charm++.The parameters are serialized and packed into a message thatis handled by the underlying Charm++ runtime system. Thisallows our extension of the UCX machine layer to supportGPU-aware communication to also beneﬁt Charm4py.Aside from the Charm++-like communication through entrymethod invocations, Charm4py also provides a functionality toestablish streamed connections between chares, called chan-nels [14]. Channels provide explicit send/receive semantics toexchange messages, but retains asynchrony by suspending thecaller object until the respective communication is complete.We extend the channels feature to support GPU-aware com-munication, which we discuss in detail in Section III-D.III. D

ESIGN AND I MPLEMENTATION

To accelerate communication of GPU-resident data, weutilize the capability of UCX to directly send and receiveGPU data through its tagged APIs. UCX is supported as amachine layer in Charm++, positioned at the lowest levelof the software stack directly interfacing the interconnect, asillustrated in Figure 1. As AMPI and Charm4py are builton top of the Charm++ runtime system, all host-side com-munication travels through the Charm++ core and Converselayers where layer-speciﬁc headers are added or extracted,with actual communication primitives executed by the machinelayer.The main idea of enabling GPU-aware communication inthe Charm++ family of parallel programming models is toretain this route to send metadata and host-side data, but supplyGPU data directly to the UCX machine layer. The metadatais necessary due to the message-driven execution model inCharm++, as shown in Figure 2. The sender object providesthe data it wants to send to the entry method invocation, but the // Sender object ’s methodvoid Sender :: foo () {// Send some data to the receiver object ,// actuating the ’bar ’ entry methodreceiverObject .bar( my_val1 , my_val2 );}// Receiver object ’s entry method ,// executed once the sender ’s message arrives// and is picked up by the schedulervoid Receiver :: bar(int val1 , double val2 ) {// val1 and val2 are available...} .Fig. 2. Message-driven execution in Charm++.

MSG_BITS (4)

PE_BITS (default: 32)

CNT_BITS (default: 28)

Tag (64 bits)

Fig. 3. Tag generation for GPU communication in UCX machine layer. receiver does not post an explicit receive function. Instead, thesender’s message arrives in the message queue of the PE thatcurrently owns the receiver object. When the message is pickedup by the scheduler, the receiver object and target entry methodare resolved using the metadata contained in the message. Anyhost-resident data destined for the receiving chare is unpackedfrom the message and delivered to the receiver’s entry method.With our GPU-aware communication scheme, the senderobject’s GPU buffers are not included as part of the message.Only metadata containing information about the GPU datatransfer initiated by the sender and sender’s data on hostmemory are contained in the message. Source GPU buffersare directly provided to the UCX machine layer to be sent,and a receive for the incoming GPU data is posted once thehost-side message arrives on the receiver. While the UCXmachine layer provides the fundamental capability to transferbuffers directly between GPUs, additional implementationsare necessary to support the different parallel programmingmodels as described in the following subsections.

A. UCX Machine Layer

Originally contributed by Mellanox, the UCX machinelayer in Charm++ is designed to handle low-level commu-nication using the UCP tagged API, providing a portableimplementation over all the networking hardware supportedby UCX. To support GPU-aware communication, we extendthe UCX machine layer to provide an interface for sendingand receiving GPU data with the UCP tagged API. As thispath of communication is separate from host-based messaging,a tag generation scheme speciﬁc to GPU-GPU transfers isnecessary. As illustrated in Figure 3, the ﬁrst four bits(

MSG_BITS ) of the 64-bit tag are used to set the type of themessage, where the new

UCX_MSG_TAG_DEVICE type is added The user does not provide any explicit tag to the Charm++ runtime system,except for reference numbers which become part of the message itself. / Charm ++ Interface (CI) file// Exposes chare objects and entry methodschare MyChare {entry MyChare ();entry void recv ( nocopydevice char data [ size ],size_t size );}; . // C++ source file// (1) Sender charevoid MyChare :: send () {peer . recv ( CkDeviceBuffer ( send_gpu_data ), size );}// (2) Receiver ’s post entry methodvoid MyChare :: recv ( char *& data , size_t & size ) {// Set the destination GPU buffer// Receive size is optionaldata = recv_gpu_data ;}// (3) Receiver ’s regular entry methodvoid MyChare :: recv ( char * data , size_t size ) {// Receive complete , GPU data is available...} .Fig. 4. GPU-aware communication interface in Charm++. // Converse layer metadatastruct CmiDeviceBuffer {const void * ptr; // Source GPU buffer addresssize_t size ;uint64_t tag; // Set in the UCX machine layer...};// Charm ++ core layer metadatastruct CkDeviceBuffer : CmiDeviceBuffer {CkCallback cb; // Support Charm ++ callbacks...}; .Fig. 5. Metadata object used for GPU communication in Charm++. to differentiate GPU communication. The remainder of the tagis split into the source PE index ( PE_BITS , 32 by default) andthe value obtained from a counter maintained by the source PE(

CNT_BITS , 28 by default). This can be changed by the userto allocate more bits to one side or the other to accommodatedifferent scaling conﬁgurations.The core functionalities of GPU communication in the UCXmachine layer are exposed as the following functions: void LrtsSendDevice (int dest_pe , const void *& ptr ,size_t size , uint64_t & tag);void LrtsRecvDevice ( DeviceRdmaOp * op , DeviceRecvTypetype ); . LrtsSendDevice provides the functionality to send GPUdata, using the information provided by the calling layerincluding the destination PE, address of the source GPU buffer,size of the data, and a reference to the 64-bit tag to be set.The tag is generated within this function by incrementing thetag counter of the source PE, and included as metadata by thecaller to be sent separately along with any host-side data. Oncethe destination UCP endpoint is determined, the source GPU ptr size tag cb

CkDeviceBuffer

UserCharm++ CoreConverseUCX Machine

Layer ptr size tag cb

CmiSendDevice

LrtsSendDevice

Generate and store tagSend GPU data Network

12 3 Pack with host-side data and send Fig. 6. Sender-side logic of GPU-aware communication in Charm++. buffer is sent with ucp_tag_send_nb using the generated tag.A receive for incoming GPU data is posted in

LrtsRecvDevice , which is executed once metadata arriveson the destination PE. The

DeviceRdmaOp struct passed bythe calling layer contains metadata necessary to post thereceive with ucp_tag_recv_nb , such as the address of thedestination GPU buffer, size of the data, and the tag set by thesender.

DeviceRecvType denotes which parallel programmingmodel has posted the receive, so that the appropriate receivehandler can be invoked once GPU data has been received. Thefollowing sections describe in detail how the UCX machinelayer is used in common by the different parallel programmingmodels to perform GPU-aware communication.

B. Charm++

Communication in Charm++ occurs between chare objectsthat may be executing on different PEs. It should be noted thatmultiple parameters can be passed to a single entry methodinvocation, as in Figure 2. We provide an additional attributein the Charm++ Interface (CI) ﬁle, nocopydevice , to annotateparameters on GPU memory. Figure 4 illustrates this extensionas well as the usage of a

CkDeviceBuffer object, which wrapsthe address of a source GPU buffer and is used by the runtimesystem to store metadata regarding the GPU-GPU transfer. Thestructure of

CkDeviceBuffer is presented in Figure 5.

1) Send:

An entry method invocation such as peer.recv() in Figure 4 executes a generated code block that prepares amessage, which contains data on host memory to be sent to thereceiver object. We modify the code generation to send GPUbuffers in tandem, using the

CkDeviceBuffer objects (one perbuffer) provided by the user. These objects hold informationnecessary for the UCX machine layer to send the GPU bufferswith

LrtsSendDevice . The tags set by the machine layerare stored in the

CkDeviceBuffer objects, which are packedwith host-side data as well as other metadata needed by theConverse and Charm++ core layers. This packed message issent separately, also using the UCX machine layer. Figure 6illustrates this process.

2) Receive:

To receive the incoming GPU data directlyinto the user’s destination buffers and avoid extra copies, weprovide a mechanism for the user to specify the addresses tr size MPI tag

UserAMPIConverseUCX Machine

Layer ptr size tag cbCmiSendDeviceLrtsSendDevice

Generate and store tag

Send GPU data

Network

Pack with additional metadata and send through Charm++ runtime system

CkDeviceBuffer Created to notify the sender rank of transfer completion Fig. 7. Sender-side logic of GPU-aware communication in AMPI. of the destination GPU buffers by extending the Zero CopyAPI [12] in Charm++. The user can provide this informationto the runtime system in the post entry method of the receiverobject, which is executed by the runtime system before theactual target entry method, i.e. regular entry method . As can beseen in Figure 4, the post entry method has a similar functionsignature as the regular entry method, with parameters passedas references so that they can be set by the user.When the message containing host-side data and meta-data (including

CkDeviceBuffer objects) arrives, the postentry method of the receiver chare is ﬁrst executed. Usinginformation about destination GPU buffers provided by theuser in the post entry method and source GPU buffers inthe

CkDeviceBuffer objects, the receiver instructs the UCXmachine layer to post receives for the incoming GPU data with

LrtsRecvDevice . Once all the GPU buffers have arrived, theregular entry method is invoked, completing the communica-tion.

C. Adaptive MPI

Each AMPI rank is implemented as a chare object on topof the Charm++ runtime system, to enable virtualization andadaptive runtime features such as load balancing. Commu-nication between AMPI ranks occurs through an exchangeof AMPI messages between the respective chare objects. AnAMPI message adds AMPI-speciﬁc data such as the MPIcommunicator and user-provided tag to a Charm++ message,and we modify how it is created to support GPU-awarecommunication with the

CkDeviceBuffer metadata object.This change is transparent to the user, and GPU bufferscan be directly provided to AMPI communication primitivessuch as

MPI_Send and

MPI_Recv like any CUDA-aware MPIimplementation.

1) Send:

The user application can send GPU data by in-voking a MPI send call with parameters including the addressof the source buffer, number of elements and their datatype,destination rank, tag, and MPI communicator. The chare objectthat manages the destination rank is ﬁrst determined, and thesource buffer’s address is checked to see if it is located onGPU memory. A software cache containing addresses knownto be on the GPU is maintained on each PE to optimize this if not gpu_direct : .Fig. 8. Channel-based communication in Charm4py. CUDA functions areincluded in the Charm++ library as C++ functions and exposed throughCharm4py’s Cython layer. process. Figure 7 illustrates the mechanism that is adoptedwhen the source buffer is found to be on the GPU, where a

CkDeviceBuffer object is ﬁrst created in the AMPI runtimeto store the information provided by the user. A Charm++callback object is also created and stored as metadata, whichis used by AMPI to notify the sender rank when the com-munication is complete. The source GPU buffer is sent inan identical manner as Charm++ through the UCX machinelayer with

LrtsSendDevice . The tag that is needed by thereceiver rank to post a receive for the incoming GPU data isalso generated and stored inside the

CkDeviceBuffer object.Note that this tag is separate from the MPI tag provided bythe user, which is used in matching the host-side send andreceive.

2) Receive:

Because there are explicit receive calls in theMPI model in contrast to Charm++, there are two possiblescenarios regarding the host-side message that contains meta-data: the message arrives before the receive is posted, andvice versa. If the message arrives ﬁrst, it is stored in anunexpected message queue, which is searched for a matchwhen the receive is posted later. If the receive is posted ﬁrst,it is stored in a request queue to be matched when the messagearrives. The receive for the incoming GPU data is posted afterthis match of the host-side message, with

LrtsRecvDevice inthe UCX machine layer. Another Charm++ callback is createdfor the purpose of notifying the destination rank, which isinvoked by the machine layer when the GPU data arrives.

D. Charm4py

GPU-aware communication in Charm4py is built aroundthe channel API, which provides functionality for the user toprovide the address of the destination GPU buffer. While theAPI itself is in Python, its core functionalities are implementedwith Cython [15] and the underlying Charm++ runtime systemcomprised of C++. Cython generates C extension modules tosupport C constructs and types to be used with Python forinteroperability and performance, and is used extensively in theCharm4py runtime. The Cython layer is also used to interface serCharm4py

Runtime (Python)

Cython

Charm++ Runtime

System buffer size

Channels API

C types support, interface with

C++ functions in Charm++ RTS

Create

CkDeviceBuffer

CmiSendDeviceLrtsSendDevice

Network

Send GPU data Pack with host-side data and send

Fig. 9. Sender-side logic of GPU-aware communication in Charm4py. with the Charm++ runtime, which performs the bulk of thework for GPU-aware communication with the UCX machinelayer. Note that the Python interface for UCX, UCX-Py [16],is not used in this work as Charm4py can directly utilize theUCX functionalities in C/C++ through the Charm++ runtimesystem.Figure 8 compares our GPU-aware communication supportagainst the host-staging mechanism in a ping-pong exchangeof GPU data. The two chares involved establish a channel,where either data on the host is sent to the peer (host-staging)or GPU data is directly provided (GPU-aware). The host-staging version needs to explicitly move data between hostand device memory using the CUDA API, adding complexityto the programmer and degrading performance. Note thatthe channel send and receive calls are asynchronous; thecoroutine posting the receive is suspended until the messagearrives. Such asynchronous communication is implementedwith futures [17], a key component of Charm4py.

1) Send:

As described in Figure 8, addresses of the sourceand destination GPU buffers can be directly provided toCharm4py’s channel API. The address and size of the bufferare propagated to the Charm++ runtime system through theCython layer, which are used to construct the

CkDeviceBuffer metadata object. The steps after this point are similar toCharm++ and AMPI, where the metadata is used by the UCXmachine layer to send the source GPU buffer, and the metadataitself is packed together with the host-side data and Charm4py-speciﬁc information to be sent separately to the receiver object.This process is illustrated in Figure 9.

2) Receive:

When the host-side message containing meta-data about the GPU-GPU transfer arrives, it is used to postthe receives for the incoming GPU data in the UCX ma-chine layer. A Charm++ callback is created and tied to the

LrtsRecvDevice function, so that it can be invoked whenthe GPU-GPU transfer is complete. This callback invocationfulﬁlls the future that has suspended the channel receive call,allowing the user application (coroutine) to continue.IV. P

ERFORMANCE E VALUATION

In this section, we describe the hardware platform and soft-ware conﬁgurations, as well as the set of micro-benchmarks and proxy application used to evaluate the performance of ourGPU-aware communication designs.

A. Experimental Setup

The Summit supercomputer at Oak Ridge National Lab-oratory is used to evaluate the performance of GPU-awarecommunication mechanisms implemented in Charm++, AMPIand Charm4py. The experiments are scaled up to 256 nodesof Summit, where each IBM AC922 node contains two IBMPower9 CPUs and six NVIDIA Tesla V100 GPUs. EachCPU is connected to three GPUs, which are interconnectedvia NVLink with a theoretical peak bandwidth of 50 GB/s.For a GPU to communicate with another GPU connectedto the other CPU, data needs to travel through the X-Busthat connects the CPUs with a bandwidth of 64 GB/s. Thenetwork interconnect is based on the Enhanced Data Rate(EDR) Mellanox Inﬁniband, providing up to 12.5 GB/s ofbandwidth.Charm++, AMPI and Charm4py are conﬁgured to use thenon-SMP build to assign one CPU core as the PE per process,and one process mapped to each GPU device. On a singlenode of Summit, for example, six PEs (and processes) executein parallel using all six available GPUs. To focus on theimpact of GPU-awareness on communication performance byseparating times spent on computation and communication,the problem domain is decomposed into as many chare ob-jects as the number of PEs/GPUs (no overdecomposition inCharm++/Charm4py, no virtualization in AMPI).For reference, the performance of OpenMPI is providedalong with the AMPI results, which also maps one processto each GPU. Since both AMPI and OpenMPI utilize UCX totransfer GPU data, this comparison isolates the performancedifferential incurred by the layers above UCX. Note that AMPIdelivers messages through the Charm++ runtime system toenable its adaptive runtime features, in contrast to OpenMPIwhich can directly utilize UCX for communication.

B. Micro-benchmarks

To evaluate the performance of point-to-point commu-nication primitives involving GPU memory, we adapt thewidely used OSU micro-benchmark suite [18] to Charm++and Charm4py. We also add an option to use the host-staging mechanism, which stages the GPU buffer on hostmemory before performing communication, to measure theperformance impact of our implementations to enable GPU-aware communication. This option is added to the original MPIversions of the benchmarks as well for AMPI and OpenMPI.Performance results are presented with both axes in log-scale,comparing the GPU-aware version of the benchmark (sufﬁxedwith D) against the host-staging version (sufﬁxed with H).

1) Latency:

The OSU latency benchmark repeats ping-pongiterations for different message sizes, where the sender sendsa message to the receiver and waits for a reply. Once themessage arrives, the receiver sends a message with the samesize back to the sender, completing the iteration. GPU-awarecommunication allows the message buffers to be supplied K K K K K K K K K K M M M O n e - w a y l a t e n c y ( u s ) Message size (bytes)

Charm++-H Charm++-D (a) Charm++ K K K K K K K K K K M M M O n e - w a y l a t e n c y ( u s ) Message size (bytes)

AMPI-H AMPI-D OpenMPI-H OpenMPI-D (b) AMPI and OpenMPI K K K K K K K K K K M M M O n e - w a y l a t e n c y ( u s ) Message size (bytes)

Charm4py-H Charm4py-D (c) Charm4pyFig. 10. Comparison of intra-node latency between host-staging and direct GPU-GPU mechanisms. K K K K K K K K K K M M M O n e - w a y l a t e n c y ( u s ) Message size (bytes)

Charm++-H Charm++-D (a) Charm++ K K K K K K K K K K M M M O n e - w a y l a t e n c y ( u s ) Message size (bytes)

AMPI-H AMPI-D OpenMPI-H OpenMPI-D (b) AMPI and OpenMPI K K K K K K K K K K M M M O n e - w a y l a t e n c y ( u s ) Message size (bytes)

Charm4py-H Charm4py-D (c) Charm4pyFig. 11. Comparison of inter-node latency between host-staging and direct GPU-GPU mechanisms. directly to the communication primitives, whereas the host-staging version requires additional cudaMemcpy operations tomove data between host and device memory.Figures 10 and 11 illustrate the improvements in intra-node and inter-node latency with GPU-awareness in Charm++,AMPI and Charm4py. The range of performance improve-ments in the latency benchmark is summarized in Table I, withthe achieved speedups with small messages using the eagerprotocol denoted in a separate row. The observed improvementin latency increases with message size with large messages inall three programming models, as the host-staging mechanismsuffers signiﬁcant slowdowns caused by host memory copiesin the Charm++ runtime system.Although the performance of AMPI improves substantiallywith GPU-aware communication, it does not quite match thelatency of CUDA-aware OpenMPI. To further investigate thisissue, we isolate the time taken in UCX by taking advantage ofthe modular property of the UCX machine layer. We can easilydisable the

CmiSend/RecvDevice calls in the Converse layerand invoke receive handlers directly, allowing us to determinethe time taken outside of UCX. This turns out to be about 8 µs ,which tells us that the GPU-GPU transfer itself with UCX hasa latency of less than 2 µs , similar to OpenMPI. Thus mostof the overhead is AMPI-speciﬁc, which has multiple factors:message packing and unpacking, additional host-side messagethat contains metadata, Charm++ callback invocations, andthe fact that the receiver rank cannot post a receive until themetadata message is received. There are also a couple of heapmemory allocations that are used to keep metadata for theUCX machine layer. We plan to further analyze and optimizethe code to get AMPI’s performance as close to OpenMPI as possible.It should be noted that the detection of the GDRCopy libraryby UCX is essential in order to achieve low latencies withsmall messages, which is not included in the default librarysearch path on Summit. With the rendezvous protocol, UCXswitches to the CUDA IPC transport for intra-node transfers,and to the pipelined host-staging mechanism that stages GPUdata on host memory in chunks for inter-node communication.

2) Bandwidth:

In the OSU bandwidth benchmark, thesender performs a number of back-to-back non-blocking sendsdesignated by the window size for each message size, thenwaits for a reply from the receiver. The receiver performsthe reverse, posting multiple non-blocking receives followedby a send. The increase in bandwidth achieved by our GPU-aware communication mechanisms is illustrated in Figures 12and 13, with the range of improvement detailed in Table I.Charm++ and AMPI achieve close to the maximum attainablebandwidth (50 GB/s for intra-node, 12.5 GB/s for inter-node),with Charm++ demonstrating up to 44.7 GB/s and 10 GB/s,and AMPI up to 45.4 GB/s and 10 GB/s for intra-node andinter-node, respectively. It is worth noting that the host-stagingversion of AMPI (AMPI-H) suffers a degradation in bandwidthat 128 KB due to a sudden increase in latency, an issuethat is being investigated. Charm4py’s bandwidth only reaches35.5 GB/s for intra-node and 6.0 GB/s for inter-node in thegiven range of message sizes, but we observe that it keepsincreasing as messages become larger than 4 MB.

C. Proxy Application: Jacobi3D

To assess the impact of GPU-aware communication on ap-plication performance, we implement a proxy application, Ja-cobi3D, on all three parallel programming models: Charm++, K K K K K K K K K K M M M B a n d w i d t h ( M B / s ) Message size (bytes)

Charm++-H Charm++-D (a) Charm++ K K K K K K K K K K M M M B a n d w i d t h ( M B / s ) Message size (bytes)

AMPI-H AMPI-D OpenMPI-H OpenMPI-D (b) AMPI and OpenMPI K K K K K K K K K K M M M B a n d w i d t h ( M B / s ) Message size (bytes)

Charm4py-H Charm4py-D (c) Charm4pyFig. 12. Comparison of intra-node bandwidth between host-staging and direct GPU-GPU mechanisms. K K K K K K K K K K M M M B a n d w i d t h ( M B / s ) Message size (bytes)

Charm++-H Charm++-D (a) Charm++ K K K K K K K K K K M M M B a n d w i d t h ( M B / s ) Message size (bytes)

AMPI-H AMPI-D OpenMPI-H OpenMPI-D (b) AMPI and OpenMPI K K K K K K K K K K M M M B a n d w i d t h ( M B / s ) Message size (bytes)

Charm4py-H Charm4py-D (c) Charm4pyFig. 13. Comparison of inter-node bandwidth between host-staging and direct GPU-GPU mechanisms.TABLE II

MPROVEMENT IN LATENCY AND BANDWIDTH WITH

GPU-

AWARE COMMUNICATION . Improvement Type Intra-node Inter-nodeCharm++ AMPI Charm4py Charm++ AMPI Charm4pyLatency

Range 2.1x – 10.2x 1.9x – 11.7x 1.8x – 17.4x 1.2x – 4.1x 1.8x – 3.5x 1.5x – 3.4xEager 4.4x 3.6x 1.9x 4.1x 3.4x 1.8x

Bandwidth

Range 1.4x – 9.6x 1.3x – 10.0x 1.3x – 10.5x 1.2x – 2.7x 1.3x – 2.6x 1.0x – 1.5x

AMPI, and Charm4py. Jacobi3D performs the Jacobi iterativemethod in a three-dimensional space, using CUDA kernels toperform stencil computations on the GPU. The problem do-main is decomposed into equal-size cuboid blocks, minimizingsurface area. Each block exchanges its halo data on the GPUwith up to six neighbors, which are either provided directly tothe communication primitives (GPU-aware) or staged throughhost memory. Note that Jacobi3D is conﬁgured to run for a setnumber of iterations without convergence checks, to evaluatethe performance of point-to-point communication.We evaluate both weak and strong scaling performance ofJacobi3D using up to 256 nodes (1,536 GPUs) of Summit,comparing the overall time and communication time periteration of the host-staging and GPU-aware communicationmechanisms. Jacobi3D is weak scaled with a base domainsize of , double values and each dimension doubled inx, y, z order. Strong scaling experiments executed on eight to256 nodes maintain the domain size of , doubles.

1) Charm++:

Figure 14 shows the weak and strong scalingperformance of the Charm++ version of Jacobi3D. With weakscaling, the GPU-aware version (Charm++-D) demonstratesa speedup between 1.1x and 12.4x in communication per- formance, with the largest speedup obtained on a singlenode. This is an expected result as the improvement inlatency and bandwidth are more signiﬁcant for intra-nodecommunication. The improved communication performancetranslates into reductions in overall iteration time, between5% and 37%. The relative speedup obtained with GPU-awarecommunication decreases as the number of nodes increases,as slower inter-node communication starts to dominate intra-node communication. With strong scaling, the improvementin communication performance ranges between 12% and 82%and overall iteration time between 9% and 27%, with thelargest speedup obtained on a single node.

2) AMPI:

Figure 15 illustrates the weak and strong scal-ing performance of the AMPI version of Jacobi3D, withthe performance of OpenMPI provided as reference. Withweak scaling, GPU-awareness improves the communicationperformance by factors between 1.3x and 12.8x, accelerat-ing the overall performance up to 41%. The GPU-awarecommunication performance in AMPI is similar to that ofOpenMPI up to 16 nodes, but starts to fall behind at largerscales. We suspect that this is due to the additional metadataexchange performed in AMPI whose performance impact A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm++-H Charm++-D (a) Weak scaling, overall time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm++-H Charm++-D (b) Weak scaling, comm. time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm++-H Charm++-D (c) Strong scaling, overall time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm++-H Charm++-D (d) Strong scaling, comm. timeFig. 14. Comparison of Charm++ Jacobi3D performance between host-staging and direct GPU-GPU mechanisms. A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesAMPI-HAMPI-D OpenMPI-HOpenMPI-D (a) Weak scaling, overall time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesAMPI-HAMPI-D OpenMPI-HOpenMPI-D (b) Weak scaling, comm. time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesAMPI-HAMPI-D OpenMPI-HOpenMPI-D (c) Strong scaling, overall time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesAMPI-HAMPI-D OpenMPI-HOpenMPI-D (d) Strong scaling, comm. timeFig. 15. Comparison of AMPI Jacobi3D performance between host-staging and direct GPU-GPU mechanisms. A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm4py-H Charm4py-D (a) Weak scaling, overall time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm4py-H Charm4py-D (b) Weak scaling, comm. time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm4py-H Charm4py-D (c) Strong scaling, overall time A v e r a g e t i m e p e r i t e r a t i o n ( m s ) NodesCharm4py-H Charm4py-D (d) Strong scaling, comm. timeFig. 16. Comparison of Charm4py Jacobi3D performance between host-staging and direct GPU-GPU mechanisms. becomes more pronounced at large node counts, but plan tolook into this issue in more detail. With strong scaling, AMPIachieves a speedup between 1.9x and 2.6x in communicationperformance and an improvement in overall iteration timebetween 27% and 74%.

3) Charm4py:

The weak and strong scaling performanceof Charm4py are depicted in Figure 16. As the support forGPU-aware communication in Charm4py signiﬁcantly im-proves performance especially for large messages as seen inFigures 10c and 11c, communication performance is improvedby factors between between 1.9x and 19.7x with weak scaling.Because communication performance has a greater impacton the overall performance in Charm4py compared to otherparallel programming models, we observe speedups in overallexecution time between 1.9x and 7.3x. With strong scaling, theimprovement in communication performance ranges between1.4x and 3.0x, resulting in speedups between 1.5x and 2.7x inthe overall iteration times.V. R

ELATED W ORK

There have been many publications on supporting GPU-aware communication in the context of parallel programmingmodels. Works from the MVAPICH group [2], [3], [19]utilize CUDA and GPUDirect technologies to optimize inter-GPU communication in MPI. Hanford et al. [20] highlightsshortcomings of current GPU communication benchmarks and shares experiences with tuning different MPI implementa-tions. Khorassani et al. [21] evaluates the performance ofvarious MPI implementations on OpenPOWER systems. Chenet al. [22] proposes compiler extensions to support GPUcommunication in the UPC programming model. This workis distinguished from other related studies in demonstratingdesigns for GPU-aware communication and their performancein multiple parallel programming models built on a commonabstraction layer based on UCX.VI. C

ONCLUSION

In this work, we have discussed the importance of GPU-aware communication in today’s GPU-accelerated supercom-puters, and the associated technologies that are involvedin supporting direct GPU-GPU transfers for several parallelprogramming models: Charm++, AMPI, and Charm4py. Weleverage the capability of the UCX framework to seamlesslysupport inter-GPU communication through a set of high-performance APIs, implementing an extension to the UCXmachine layer in the Charm++ runtime system to provide aperformance-portable communication layer for the Charm++family of parallel programming models. With designs toutilize the UCX machine layer for GPU-aware communicationwhile retaining the semantics of message-driven execution, wedemonstrate substantial improvements in performance usinglatency and bandwidth benchmarks adapted from the OSUenchmark suite, as well as a proxy application representinga widely used stencil algorithm.With GPU-aware communication support in place for theCharm++ ecosystem, we plan to incorporate computation-communication overlap with overdecomposition [23] to mini-mize communication overheads on modern GPU systems. Wealso plan on supporting collective communication of GPUdata, using this work as the basis to translate collectivecommunication primitives to point-to-point calls.While UCX proves to be an effective framework for uni-versally accelerating GPU communication, there is still roomfor performance improvement as indicated by the differentialbetween AMPI and OpenMPI. One of the potential areas ofimprovement is GPU support in the active messages API ofUCX, which could better ﬁt the message-driven executionmodel of Charm++. Another is supporting user-provided tagsin the Charm++ runtime system, which would eliminate theneed to delay the posting of the receive for GPU data untilthe arrival of the metadata message.A

CKNOWLEDGMENT

We thank the UCX developer team, including AkshayVenkatesh, Devendar Bureddy, and Yossi Itigin for their as-sistance with technical issues on the Summit supercomputer.This work was performed under the auspices of the U.S. De-partment of Energy (DOE) by Lawrence Livermore NationalLaboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-819099).This research was supported by the Exascale ComputingProject (17-SC- 20-SC), a collaborative effort of the U.S.DOE Ofﬁce of Science and the National Nuclear SecurityAdministration.This research used resources of the Oak Ridge LeadershipComputing Facility at the Oak Ridge National Laboratory,which is supported by the Ofﬁce of Science of the U.S. DOEunder Contract No. DE-AC05-00OR22725.R ,2012, pp. 1848–1857.[3] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K.Panda, “Efﬁcient inter-node mpi communication using gpudirect rdmafor inﬁniband clusters with nvidia gpus,” in , 2013, pp. 80–89.[4] D. Bonachea and P. H. Hargrove, “Gasnet-ex: A high-performance,portable communication library for exascale,” in

Languages and Com-pilers for Parallel Computing , M. Hall and H. Sundar, Eds. Cham:Springer International Publishing, 2019, pp. 138–158.[5] P. Grun, S. Hefty, S. Sur, D. Goodell, R. D. Russell, H. Pritchard, andJ. M. Squyres, “A brief introduction to the openfabrics interfaces - a newnetwork api for maximizing high performance application efﬁciency,” in ,2015, pp. 34–39. [6] P. Shamis, M. G. Venkata, M. G. Lopez, M. B. Baker, O. Hernandez,Y. Itigin, M. Dubman, G. Shainer, R. L. Graham, L. Liss, Y. Shahar,S. Potluri, D. Rossetti, D. Becker, D. Poole, C. Lamb, S. Kumar,C. Stunkel, G. Bosilca, and A. Bouteiller, “Ucx: An open sourceframework for hpc network apis and beyond,” in , 2015, pp. 40–43.[7] G. Shainer, A. Ayoub, P. Lui, T. Liu, M. Kagan, C. R. Trott, G. Scantlen,and P. S. Crozier, “The development of mellanox/nvidia gpudirect overinﬁniband–a new model for gpu to gpu communications,”

Comput.Sci. , vol. 26, no. 3–4, p. 267–273, Jun. 2011. [Online]. Available:https://doi.org/10.1007/s00450-011-0157-1[8] (2021) Gpudirect rdma :: Cuda toolkit documentation. [Online].Available: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html[9] R. Shi, S. Potluri, K. Hamidouche, J. Perkins, M. Li, D. Rossetti, andD. K. D. K. Panda, “Designing efﬁcient small message transfer mecha-nism for inter-node mpi communication on inﬁniband gpu clusters,” in , 2014, pp. 1–10.[10] B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida,X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, andL. Kale, “Parallel programming with migratable objects: Charm++in practice,” in

Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis ,ser. SC ’14. IEEE Press, 2014, p. 647–658. [Online]. Available:https://doi.org/10.1109/SC.2014.58[11] C. Huang, O. Lawlor, and L. V. Kalé, “Adaptive mpi,” in

Languagesand Compilers for Parallel Computing , L. Rauchwerger, Ed. Berlin,Heidelberg: Springer Berlin Heidelberg, 2004, pp. 306–322.[12] (2021) Charm++ zero copy messaging api. [Online]. Avail-able: https://charm.readthedocs.io/en/v6.10.2/charm++/manual.html , 2018, pp. 423–433.[14] (2021) Charm4py channels api. [Online]. Available: https://charm4py.readthedocs.io/en/latest/introduction.html

Computing in ScienceEngineering , vol. 13, no. 2, pp. 31–39, 2011.[16] (2021) Ucx-py. [Online]. Available: https://github.com/rapidsai/ucx-py[17] (2021) Charm4py futures api. [Online]. Available: https://charm4py.readthedocs.io/en/latest/introduction.html

Recent Advances in the Message Passing Interface ,J. L. Träff, S. Benkner, and J. J. Dongarra, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2012, pp. 110–120.[19] H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur, and D. K. Panda,“Mvapich2-gpu: Optimized gpu to gpu communication for inﬁnibandclusters,”

Comput. Sci. , vol. 26, no. 3–4, p. 257–266, Jun. 2011.[Online]. Available: https://doi.org/10.1007/s00450-011-0171-3[20] N. Hanford, R. Pankajakshan, E. A. León, and I. Karlin, “Challengesof gpu-aware communication in mpi,” in , 2020, pp. 1–10.[21] K. S. Khorassani, C.-H. Chu, H. Subramoni, and D. K. Panda,“Performance evaluation of mpi libraries on gpu-enabled openpowerarchitectures: Early experiences,” in

High Performance Computing ,M. Weiland, G. Juckeland, S. Alam, and H. Jagode, Eds. Cham:Springer International Publishing, 2019, pp. 361–378.[22] L. Chen, L. Liu, S. Tang, L. Huang, Z. Jing, S. Xu, D. Zhang, andB. Shou, “Uniﬁed parallel c for gpu clusters: Language extensionsand compiler implementation,” in

Proceedings of the 23rd InternationalConference on Languages and Compilers for Parallel Computing , ser.LCPC’10. Berlin, Heidelberg: Springer-Verlag, 2010, p. 151–165.[23] J. Choi, D. F. Richards, and L. V. Kale, “Achieving computation-communication overlap with overdecomposition on gpu systems,” in2020 IEEE/ACM 5th International Workshop on Extreme Scale Pro-gramming Models and Middleware (ESPM2)