[PDF] A Newcomer In The PGAS World -- UPC++ vs UPC: A Comparative Study

Abstract

A newcomer in the Partitioned Global Address Space (PGAS) 'world' has arrived in its version 1.0: Unified Parallel C++ (UPC++). UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, futures and promises. UPC++ API for moving non-contiguous data and handling memories with different optimal access methods resemble those used in modern C++. In this study we provide two kernels implemented in UPC++: a sparse-matrix vector multiplication (SpMV) as part of a Partial-Differential Equation solver, and an implementation of the Heat Equation on a 2D-domain. Code listings of these two kernels are available in the article in order to show the differences in programming style between UPC and UPC++. We provide a performance comparison between UPC and UPC++ using single-node, multi-node hardware and many-core hardware (Intel Xeon Phi Knight's Landing).

Full PDF

AA Newcomer In The PGAS World - UPC++ vs UPC: AComparative Study

J´er´emie Lagravi`ere Johannes Langguth Martina Prugger Phuong H. Ha Xing Cai , Simula Research Laboratory, P.O. Box 134, NO-1325 Lysaker, Norway Lopez Laboratory, Vanderbilt University , Nashville, Tennessee, USA. The Arctic University of Norway,NO-9037 Tromsø, Norway University of Oslo, NO-0316 Oslo, Norway

Correspondence should be addressed to J´er´emie Lagravi`ere; [email protected]

Abstract

A newcomer in the Partitioned Global Address Space (PGAS) ’world’ has arrived in its version1.0: Uniﬁed Parallel C++ (UPC++). UPC++ targets distributed data structures where communicationis irregular or ﬁne-grained. The key abstractions are global pointers, asynchronous programming viaRPC, futures and promises. UPC++ API for moving non-contiguous data and handling memories withdifferent optimal access methods resemble those used in modern C++. In this study we provide twokernels implemented in UPC++: a sparse-matrix vector multiplication (SpMV) as part of a Partial-Differential Equation solver, and an implementation of the Heat Equation on a 2D-domain. Codelistings of these two kernels are available in the article in order to show the differences in programmingstyle between UPC and UPC++. We provide a performance comparison between UPC and UPC++using single-node, multi-node hardware and many-core hardware (Intel Xeon Phi Knight’s Landing).

Keywords : PGAS; APGAS;UPC++; UPC programming language; Fine-grained irregular communi-cation; Sparse matrix-vector multiplication; Performance optimization

In distributed memory parallel systems, MPI has been the de-facto standard for a long time. It reliablyprovides high communication performance, but programming MPI applications is very complex, and thiscomplexity of parallel programming is a key challenge that the HPC research and industry communitiesface. One of the most prominent approaches to alleviate this problem is the use of Partitioned GlobalAddress Space (PGAS) systems. For more than 20 years [6, 18], the dominating PGAS implementationshave been UPC, Coarray Fortran, and SHMEM. Other implementations have also been proposed such asChapel [11], Titanium, and recently UPC++ [24], whose version 1.0 was released in September 2019.The PGAS programming model constitutes an alternative [16] to using MPI or MPI with OpenMP. In thisstudy we focus on UPC++. 1 a r X i v : . [ c s . D C ] F e b e have multiple goals for this study: we want to extend our previous studies [13, 14] and focus onwhat could be seen as the future of PGAS by studying the new Version 1.0 of UPC++ PGAS language[24]), released in September 2019. To this end, we compare the performance and programability of UPCand UPC++.PGAS [2, 8, 10, 19] is a programming model that aims to achieve both good programmer productivityand high computational performance. These goals are usually conﬂicting in the context of developing par-allel code for scientiﬁc computations. The main aspect of PGAS is a global address space that is sharedamong concurrent processes, running on different nodes of a supercomputer that jointly execute a par-allel program. Data exchange between the processes is typically performed transparently by a low-levelnetwork layer such as GASNet [5], without explicit involvement from the programmer, thus providinggood productivity. The shared global address space is logically partitioned such that each partition hasafﬁnity to a designated owner process. This awareness of data locality is essential for achieving goodperformance of parallel programs written in the PGAS model, because the globally shared address spacemay encompass many physically distributed memory sub-systems.As mentioned before, UPC and UPC++ are both implementations of the PGAS programming model.UPC stands for Uniﬁed Parallel C and UPC++ stands for

Uniﬁed Parallel C++ . In both cases data is either shared or private , where shared data is accessible (read or write) by all threads and private data is onlyaccessible by its owner thread. UPC is based on the idea of ”communication simpliﬁcation” in the sensethat there is no way for the programmer to distinguish intra-node memory operations (between threadsrunning on the same hardware node) from their inter-node counterparts. This brings a very efﬁcient andsimple way of creating programs either in UPC at the very possible expense of performance. By default,UPC++ follows the same pattern, but with the recent addition of teams of threads, it offers a native wayto distinguish between intra- and inter-node communication.However, the main difference between UPC and UPC++, on a conceptual level, is that UPC++ im-plements a sub-category of PGAS, which is called APGAS: ”Asynchronous Partitioned Global AddressSpace”. The APGAS model extends the PGAS with ideas from task-based asynchronous execution model,which describes the semantics of an application as a hierarchy of tasks that are dynamically created andscheduled during the execution time [17]. Another language that implements the APGAS paradigm isX10 [26]. Furthermore, for software maintenance reasons, UPC++ is distributed as a library rather than alanguage, although this has little effect on the actual PGAS programming.To study UPC++ we have developed two applications based on well-known kernels: the heat equationin 2D and a sparse-matrix vector multiplication using the ELLPACK format. It is designed to simulatediffusion processes over unstructured 3D mesh representing the human cardiac ventricle. In this paperwe compare UPC++ to its ancestor UPC on both multi-node and many-core architectures using multi-ple criteria such as: programmability, performance predictability, scalability, performance (GFLOPS orexecution time). After this comparison we discuss the obtained results.Previous studies [3, 27], made in part by the UPC++ design team, have included comparative aspectsabout UPC++. Our study focuses on comparing UPC++ to UPC by using the SpMV kernel which, due toits irregularity and communication requirements, poses a challenging benchmark for distributed memorysystems. 2 Implementations

In this study we have focused our efforts on the implementations of two kernels in UPC and UPC++. Theﬁrst kernel is rather simple and is the implementation of the Heat Equation in two dimensions domain,presented in detail in Section 2.1.1. The second kernel is the implementation of a Sparse matrix-vectormultiplication, solving a 3D diffusion equation that is posed on an irregular domain modeling the leftcardiac ventricle of a healthy male human. This 3D diffusion solver was designed as an integral part of acardiac electrophysiology simulator and is presented in detail in Section 2.1.2. This section aims also atshowing the difference between the programming ’style’ of UPC and UPC++.

The heat equation is one of the most fundamental kernels in scientiﬁc computing. Due to its simplicity,it is useful as a benchmark to assess the performance of a parallel system. For this experiment, we solvethe 2D heat diffusion equation: ∂φ∂ t = ∂ φ∂ x + ∂ φ∂ y on a uniform mesh. We employ the ﬁnite difference discretization given by: u ( x , y ) = u − h h We use an existing UPC code that was tested in previous work [13]. The UPC++ implementation isderived from this UPC implementation. Both the UPC and UPC++ codes implement a 2D heat equationsolver using halo-exchange with a single layer of ghost cells to communicate data between threads. Weuse X and Y to denote the number of cells in each dimension of the domain. Each processing elementsreceives a rectangular sub-domain of approximately equal size. In order to ensure the communication ofdata to adjacent subdomain(s) we use the well-known halo exchange technique [9]. In both our implemen-tations of the 2D Heat Equation the domains receive their initial values before the computation starts. Theheat diffusion is then computed by performing N time steps. For each time step, each processing-elementperforms two tasks: computing the diffusion by updating each cell in its subdomain, and communicatingthe newly computed results on the domain boundary to adjacent subdomains. All time steps perform thesame amount of computation, but due to cache effects and the unpredictability of network communicationtheir execution time can vary. For benchmarking purposes it is important to obtain the average time pertime step. Typically, running for N = A general matrix-vector multiplication is compactly and mathematically deﬁned by the following formula: y = Mx . In this study, without loss of generality, we assume that the matrix M is square, having n rows and n columns. Both the input vector and the result vector, respectively denoted by x and y , are of length n . The code was kindly provided by Dr. Rolf Rabenseifner at HLRS, in connection with a course on PGAS programming [20] i of the result vector y we apply the following general formula (usingzero-based indices): y ( i ) = ∑ ≤ j < n M ( i , j ) x ( j ) . (1) M is called a sparse matrix, if most of the M ( i , j ) values are zero. In this case the above formulabecomes unnecessarily expensive from a computational point of view. A more economic formula forcomputing y ( i ) in a sparse matrix-vector multiplication (SpMV) is thus: y ( i ) = ∑ M ( i , j ) (cid:54) = M ( i , j ) x ( j ) , (2)This formula takes into account only the nonzero values of matrix M on each row. Because only thenonzero values are used, it is then memory-wise unnecessarily expensive to store all the n values of asparse matrix. As a result various compact storage formats for sparse matrices have been adopted, suchas: • the coordinate format (COO);• the compressed sparse row format (CSR);• the compressed sparse column format (CSC);• and the EllPack format [12].For sparse matrices that have a homogeneous number of nonzero values per row, it is usual to use theEllPack storage format. The EllPack storage format usually uses two 2D tables. These two 2D tables areof the same size: having n rows and the number of columns equaling the maximum number of non-zerosper row. The ﬁrst 2D table contains all the nonzero values of the sparse matrix. Whereas the second2D table contains the corresponding column indices of the non-zeros. Explicitly stored zeros are usedfor padding when the rows that have fewer than the maximum number of non-zeros. Additionally, if weassume that all the values on the main diagonal of a sparse matrix M are nonzero, which is applicable tomost scientiﬁc applications, it is beneﬁcial to split M so that: M = D + A (3)In this formula, D is the main diagonal of M , while A contains the off-diagonal part of M . Then, wecan use a modiﬁed EllPack format where the main diagonal D is stored as a 1D array of length n . Thereis no need to store the column indices of these nonzero diagonal values, because their column indicesequal the row indices by deﬁnition. We suppose that r nz now denotes the maximum number of non-zerooff-diagonal values per row. Thus, for storing the values in the off-diagonal part A , it is usual to use two1D arrays both of length n · r nz (instead of two n × r nz

2D tables). Therefore, one 1D array contains theoff-diagonal values consecutively row by row, whereas the other 1D array contains the correspondinginteger column indices [13].The goal of this chapter is to present a detailed view of the our implementations using UPC++. Inaddition, for comparison purposes, we also show the corresponding UPC code implementing the samekernel as in UPC++. We want to emphasize the differences in the UPC and UPC++ programming modelsand their consequences in terms of code. 4 .2 Deﬁnitions

In our UPC++ implementations of SpMV and the Heat Equation we use repeatedly global pointers andcommunication function such as rget or rput . In this section we present deﬁnitions for these terms andfunctionalities. A UPC++ program can allocate global memory in shared segments, which are accessible by all processes.A global pointer points at storage within the global memory, and is declared as follows: upcxx::global ptr gptr = upcxx::new (upcxx::rank me());

The call to upcxx::new allocates a new integer in the calling process’s shared segment, andreturns a global pointer ( upcxx::global ptr ) to the allocated memory. Each process has its own privatepointer (gptr) to an integer in its local shared segment. By contrast, a conventional C++ dynamic allocation(int *mine = new int) will be in private local memory. Note that we use the integer type in this paragraphas an example, but any type T can be allocated using the upcxx::new () function call [23].In our implementation of both SpMV and the Heat Equation we use shared arrays and global pointers.The declaration and allocation of shared arrays is a multiple step process, as described in Listing 6 (page9) with a buffer called share receive buffer . Both UPC++ SpMV codes, i.e. SpMV using block-wise data transfer and SpMV using message condens-ing and consolidation use the UPC++ function upcxx::rput . This function is used to send data fromone thread to another asynchronously by default. In our implementation of the UPC++ Heat Equation2D we use the UPC++ function upcxx::rget . Both functions are part of the one-sided communicationmodel that is implemented in UPC++. These operations initiate transfer of the value object to (put) orfrom (get) the remote process; no coordination is needed with the remote process since it is a one-sidedcommunication. Like many asynchronous communication operations, rget and rput default to returning afuture object [25] that becomes ready when the transfer is complete [23].

In our study we have implemented multiple versions of the Sparse Matrix-Vector Multiplication (SpMV)both in UPC and UPC++. The difference between these versions lies essentially in the communicationmethodology. Thus, in this section we describe 4 implementations of SpMV using a modiﬁed ELLPackformat, as described in Section 2.1.2.These 4 versions are:• UPC SpMV using Block-wise data transfer: Section 2.3.2, Listing 3, based on our previous study[13];• UPC++ SpMV using Block-wise data transfer: Section 2.3.2, Listing 4;• UPC SpMV using message condensing and consolidation: Section 2.3.3, Listing 5, based on ourprevious study [13];• UPC++ SpMV using message condensing and consolidation: Section2.3.3, Listings 7, 6.5ote that in UPC upcxx::rank me corresponds to

MYTHREAD , meaning ’id number of calling thread’and upcxx::rank n corresponds to THREADS, meaning ’total number of threads used for the currentrun’.For both UPC and UPC++ the code can be summarized as follows: ﬁrst, we read and distribute thedata in memory, then we prepare the data (particularly preparing data exchange buffers), and after that theactual SpMV computation occurs.

In both UPC and UPC++ codes presented in the following sections, we have used as much as possible astrategy that we call explicit thread privatization . The goal of this technique is to ensure that each threadaccesses (read and write) its own local data. Doing so means that no implicit communication is triggeredand no-illegal memory access is performed, thus accessing the data thanks to explicit thread privatization delivers the best possible performance. To achieve such a goal, we use the well-known technique ofcasting pointers-to-shared to pointers-to-local . To ensure that we create pointers that point to data withafﬁnity to the calling thread ( MYTHREAD in UPC, upcxx:rank me in UPC++), we use the

BLOCKSIZE topoint to the local data in the computation loop. This technique is illustrated in the Listing 1 for UPC andin Listing 2 for UPC++. for ( int mb =0; mb < mythread_nblks ; mb ++) { int offset = (mb*THREADS+MYTHREAD)* BLOCKSIZE ; /* casting shared pointers to local pointers */ double *loc_y = ( double *) (y+offset); //... } Listing 1: Explicit thread privatization in UPC (Complete code in Listing 3) for (mb =0; mb < mythread_nblks ; mb ++) {offset = (mb*upcxx :: rank_n ()S+upcxx :: rank_me ())* BLOCKSIZE ; /* casting shared pointers to local pointers */ double *loc_y = ( double *) (y[upcxx :: rank_me ()]+ offset); //... } Listing 2: Explicit thread privatization in UPC++

In this section, we present a summarized view of UPC and UPC++ SpMV implementing block-wise datatransfer, as seen in Listings 3 and 4. The UPC SpMV using block-wise data transfer is the same as theone we used in our previous study [13]. The UPC++ SpMV using block-wise data transfer is derivedfrom this aforementioned UPC version, in the sense that it implements the same logic of computationand communication. To the exception that in UPC we used upc memget , which gets data from anotherthread’s shared memory , for communication and for UPC++ we used upcxx::rput , which sends data toanother thread’s shared memory.In these versions of UPC and UPC++ SpMV, the communication of data is done block-wise. Thismeans that for UPC if

Thread A requires one element X to Thread B , a block of data of size

BLOCKSIZE containing element X will be sent from Thread B to Thread A . In other words, each needed block istransported in its entirety , independent of the actual number of values needed in that block. For UPC++the operation is initiated from the sending thread as we have used a one-sided put function whereas inUPC we have used a one-sided get function. 6 /Reading the data set and data distribution over THREADS then prepare the data for computation double * mythread_x_copy = ( double *) malloc(n* sizeof ( double )); /* Prep-work: check for each block of x whether it has values needed by MYTHREAD; make a private booleanarray ’block_is_needed’ of length nblks */// ..../* Transport the needed blocks of x into mythread_x_copy */ for ( int b=0; b futureComms = upcxx :: make_future (); /* SpMV: each thread only goes through its designated blocks */ int mythread_nblks = nblks/upcxx :: rank_n ()+( upcxx :: rank_me () <(nblks%upcxx :: rank_n ())?1:0); for ( int mb =0; mb < mythread_nblks ; mb ++) { int offset = (mb*upcxx :: rank_n ()+upcxx :: rank_me ())* BLOCKSIZE ; /* casting shared pointers to local pointers */ double *loc_y = ( double *) (y[upcxx :: rank_me ()]+ offset); double *loc_D = ( double *) (D[upcxx :: rank_me ()]+ offset); double *loc_A = ( double *) (A[upcxx :: rank_me ()]+ offset*r_nz); int * loc_J = ( int *)I[upcxx :: rank_me ()]. local ()+offset*r_nz; /* computation per block */ for ( int k=0; k.wait , which implement one-sided, asynchronous, bulk communi-cation. And in UPC we have used, one-sided, bulk communication using upc memget . By default, allcommunications in UPC++ are asynchronous, which, in terms of code corresponds to use of at least oneinstruction for sending or receiving data and one instruction to wait for the completion of the commu-nication. In our UPC++ implementation, we have chosen to distinguish the sending of data performedby upcxx::rput and the waiting for completion performed by futureComms.wait . This means thatmultiple communications are launched in a non-blocking manner without having to wait for one-anotherin order to start transmitting data. This is done at the expense and with a measured risk of bandwidth(between CPU and RAM) and network (The test system in our experiments uses Inﬁniband , as describedin section 3).

In this versions of UPC and UPC++ SpMV, we implement a different communication technique. Contraryto the block-wise data transfer presented in section 2.3.2, where each communication request is of a fullblock size (

BLOCKSIZE ), here, we employ a communication strategy where each thread communicates thestrict amount of required data to other threads. In other words, the length of a message from thread A tothread B equals the number of unique values in the k blocks owned by A that are needed by B , times thenumber of bytes required to represent each value (8 byte double in our experiments). All the between-thread messages are thus condensed and consolidated.Listing 6 (page 9) and Listing 7 (page 10) present respectively the preparation step and the computa-tion step of the UPC++ version of SpMV using message condensing and consolidation. Listing 5 (page 8)presents the preparation and computation steps of the UPC version of SpMV using message condensingand consolidation.This communication patterns and the use of the UPC++ function upcxx::rput leads to the necessityof a Shared Receive Buffer (SRB). This SRB will be accessed by each thread that needs to communicatedata, and then the receiving thread will get the data from this buffer after the all threads are done com-municating. In both UPC and UPC++ implementations we have the necessity for the SRB. In UPC++the implementation of the SRB is different and slightly more complicated than in UPC. SRBs in UPC++require the use of a two-dimensional vector and the corresponding declarations and function calls to pop-ulate it and broadcast the information across all threads ( upcxx::broadcast ). /* Allocation of the five shared arrays x, y, D, A, J */ / .../* Allocation of an additional private x array per thread */ double * mythread_x_copy = ( double *) malloc(n* sizeof ( double )); /* Preparation step: create and fill the thread-private arrays of int *mythread_num_send_values, int *mythread_num_recv_values, int **mythread_send_value_list, int **mythread_recv_value_list, double **mythread_send_buffers. Also, shared_recv_buffers is prepared. */// ..../* Communication procedure starts */ int T,k,mb ,offset; double * local_x_ptr = ( double *)(x+MYTHREAD* BLOCKSIZE ); for (T=0; T0) /* pack outgoing messages */ for (k=0; k< mythread_num_send_values [T]; k++)mythread_send_buffers [T][k] = local_x_ptr [ mythread_send_value_list [T][k]]; for (T=0; T0) /* send out messages */ upc_memput ( shared_recv_buffers [T*THREADS+ MYTHREAD], mythread_send_buffers [T],mythread_num_send_values [T]* sizeof ( double ));upc_barrier ; int mythread_nblks = nblks/THREADS +( MYTHREAD <( nblks%THREADS)?1:0); for (mb =0; mb < mythread_nblks ; mb ++) { /* copy own x-blocks */ offset = (mb*THREADS+MYTHREAD)* BLOCKSIZE ;memcpy (& mythread_x_copy [offset], ( double *)(x+offset), min(BLOCKSIZE ,n-offset)* sizeof ( double ));} for (T=0; T0) { /* unpack incoming messages */ double * local_buffer_ptr = ( double *) shared_recv_buffers [MYTHREAD*THREADS+T]; for (k=0; k< mythread_num_recv_values [T]; k++)mythread_x_copy [ mythread_recv_value_list [T][k]] = local_buffer_ptr [k];} /* Communication procedure ends *//* SpMV: each thread only goes through its designated blocks */ for (mb =0; mb < mythread_nblks ; mb ++) {offset = (mb*THREADS+MYTHREAD)* BLOCKSIZE ; /* casting shared pointers to local pointers */ double *loc_y = ( double *) (y+offset); double *loc_D = ( double *) (D+offset); double *loc_A = ( double *) (A+offset*r_nz); int *loc_J = ( int *) (J+offset*r_nz); /* computation per block */ for (k=0; k > >shared_receive_buffer (upcxx :: rank_n ()); td :: vector > temp(upcxx :: rank_n ()); for ( int i = 0; i < upcxx :: rank_n (); i++)temp[i] = upcxx :: new_array < double >( localReceiveCount [i]);shared_receive_buffer [upcxx :: rank_me ()] = temp; for ( int i = 0; i futureComms = upcxx :: make_future (); Listing 6: An improved UPC++ implementation of SpMV by message condensing and consolidation:Preparation step //computation starts/* SpMV: each thread only goes through its designated blocks */ for (mb =0; mb < mythread_nblks ; mb ++) {offset = (mb*upcxx :: rank_n ()S+upcxx :: rank_me ())* BLOCKSIZE ; /* casting shared pointers to local pointers */ double *loc_y = ( double *) (y[upcxx :: rank_me ()]+ offset); double *loc_D = ( double *) (D[upcxx :: rank_me ()]+ offset); double *loc_A = ( double *) (A[upcxx :: rank_me ()]+ offset*r_nz); int * loc_J = ( int *)I[upcxx :: rank_me ()]. local ()+offset*r_nz; /* computation per block */ for (k=0; k0) /* pack outgoing messages */ for (k=0; k< mythread_num_send_values [T]; k++)mythreadsend_buffers [T][k] = local_x_ptr [ mythreadsend_value_list [T][k]]; for (T=0; T0) /* send out messages */ futureComms .when_all(futureComms , upcxx :: rput( mythread_send_buffers [T], shared_recv_buffers [T][ upcxx ::rank_me ()], mythread_num_send_values [T]));} //waiting for all communication to be completed futureComms .wait ();upc_barrier ; //Receiving data, unpacking it into mythread_x_copy for (T=0; T0) { /* unpack incoming messages */ double * local_buffer_ptr = ( double *) shared_recv_buffers [upcxx :: rank_me ()*upcxx :: rank_n ()S+T]; for (k=0; k< mythread_num_recv_values [T]; k++)mythread_x_copy [ mythread_recv_value_list [T][k]] = local_buffer_ptr [T][k];} /* Communication procedure ends */ Listing 7: An improved UPC++ implementation of SpMV by message condensing and consolidation:Computation step

The UPC code used in this section is identical to the one used in our previous study [13]. The UPCcode was kindly provided by Dr. Rolf Rabenseifner at HLRS, in connection with a short course on PGAS10rogramming [20]. The UPC++ code for Heat Equation 2D is inspired from the UPC code in that it usesidentical policies for data distribution, computation and data communication through halo data exchange.Thus, in this section we propose the following implementations of a solver for the Heat Equation on a2-dimensional domain:• UPC implementation is presented in the following listings: – ”Scratch” arrays for data exchange in Listing 8 (page 11) – Halo data exchange in Listing 10 (page 12)• UPC++ implementation is presented in the following listings: – ”Scratch” arrays for data exchange in Listing 9 (page 11) – Halo data exchange in Listing 11 (page 12)In our implementations of UPC and UPC++ Heat Equation, the global 2D solution domain is rect-angular, so the UPC and UPC++ threads are arranged as a 2D processing grid, with mprocs rows and nprocs columns. (Note upcxx::rank n equals mprocs*nprocs .) Each thread is thus identiﬁed by an in-dex pair (iproc,kproc) , where iproc = upcxx::rank me / nprocs and kproc = upcxx::rank me% nprocs . The global 2D domain, of dimension M × N , is evenly divided among the threads. Each threadis responsible for a 2D sub-domain of dimension m × n , which includes a surrounding halo layer neededfor communication with the neighboring threads. Reminder: In UPC++ upcxx::rank me corresponds to MYTHREAD in UPC and, in UPC++, upcxx::rank n corresponds to

THREADS , in UPC.In both UPC and UPC++ implementations we use a halo data exchange, which is needed betweenthe neighboring threads. The halo data exchange performs communication, where each thread calls upc memget for UPC or upcxx::rget for UPC++, on each of the four sides of its subdomain (if aneighboring thread exists). In the vertical direction, the values to be transferred from the upper and lowerneighbors already lie contiguously in the memory of the owner threads. There is thus no need to explicitlypack the messages. In the horizontal direction, however, message packing is needed before upc memget can be invoked towards the left and right neighbors.Listing 8 presents additional data structure which is needed with packing and unpacking the horizontalmessages. Listing 9, presents the same idea implemented in UPC++. This short example of code alsoshows the additional instructions required by the use of UPC++: the declaration of shared arrays and theirallocation in memory requires more code than in UPC. shared [] double * shared xphivec_coord1first [THREADS ];shared [] double * shared xphivec_coord1last [THREADS ]; double * halovec_coord1first , * halovec_coord1last ;xphivec_coord1first [MYTHREAD] =(shared [] double *) upc_alloc ((m -2)* sizeof ( double ));xphivec_coord1last [MYTHREAD] =(shared [] double *) upc_alloc ((m -2)* sizeof ( double ));halovec_coord1first = ( double *) malloc ((m -2)* sizeof ( double ));halovec_coord1last = ( double *) malloc ((m -2)* sizeof ( double )); Listing 8: Scratch arrays for UPC halo exchange of non-contiguous data [13] std :: vector > xphivec_coord1first ;std :: vector > xphivec_coord1last ;xphivec_coord1first .resize(upcxx :: rank_n ());xphivec_coord1last .resize(upcxx :: rank_n ());xphivec_coord1first [upcxx :: rank_me ()] = upcxx :: new_array < double >(m -2); phivec_coord1last [upcxx :: rank_me ()] = upcxx :: new_array < double >(m -2); for ( int i = 0; i < upcxx :: rank_n (); i++) {xphivec_coord1first [i] = upcxx :: broadcast ( xphivec_coord1first [i], i).wait ();xphivec_coord1last [i] = upcxx :: broadcast ( xphivec_coord1last [i], i).wait ();}phivec_coord1first = ( double *) xphivec_coord1first [upcxx :: rank_me ()]. local ();phivec_coord1last = ( double *) xphivec_coord1last [upcxx :: rank_me ()]. local (); double * halovec_coord1first = ( double *) calloc ((m -2) , sizeof ( double )); double * halovec_coord1last = ( double *) calloc ((m -2) , sizeof ( double )); Listing 9: Scratch arrays for UPC++ halo exchange of non-contiguous dataThe communication implementation through halo exchange is different in UPC and UPC++. List-ing 10 presents the communication performed in UPC using upc memget , and performing a packingof the horizontal data beforehand. Listing 11 presents the implementation of halo data exchange using upcxx::rget and performing a packing of the horizontal data beforehand.In the UPC++ version it is important to notice that we store futures upcxx::future in a std::vector called getRequests , this vector is then used in the function waitForGetRequestsToEnd . This functioncalls the UPC++ wait function on each upcxx::future contained in the vector getRequests . This isdifferent technique to wait for all communication to complete compared to the one used in Listing 7 (page10), where we used the combination of upcxx::when all and upcxx::wait .The communication technique used in UPC and UPC++ relies on the same halo data exchange model,however, the amount of code needed in UPC++ is greater than the one needed in UPC. As explained earlierin this paper, this difference in the amount of code is directly related to the way shared arrays are declaredin UPC and UPC++ and the fact that communication in UPC++ is done in two steps: upcxx::rget andthe wait statement whereas in UPC a single function call to upc memget is needed. idx(i,k) ((i)*n+(k)) rank(ip ,kp) ((ip)*nprocs +(kp)) void halo_exchange_intrinsic () { double * phi = ( double *) xphi[MYTHREAD ]; int i,k; /* packing messages for the horizontal direction */ if (kproc >0) { double * phivec_coord1first = ( double *) xphivec_coord1first [MYTHREAD ]; for (i=0; i0) {upc_memget ( halovec_coord1first , xphivec_coord1last [rank(iproc ,kproc -1)], (m -2)* sizeof ( double )); for (i=1; i0)upc_memget (& phi[idx (0 ,1)], &( xphi[rank(iproc -1, kproc)][ idx(m-2 ,1)]), (n -2)* sizeof ( double )); if (iproc > getRequests (4) oid halo_exchange_intrinsic () { int i,k; /* packing messages for the horizontal direction */ if (kproc >0) for (i=0;i0) {getRequests [0] = upcxx :: rget(xphi[getRank(iproc -1, kproc)]+ idx(m-2 ,1), &phi[idx (0 ,1)], n -2); } if (iproc 0) {getRequests [2] = upcxx :: rget( xphivec_coord1last [getRank(iproc ,kproc -1)], halovec_coord1first , m -2); for (i=1;i0) { getRequests [0]. wait (); } if (iproc 0) { getRequests [2]. wait (); } if (kproc

Our performance measurements focus on many-core and multi-node hardware architectures.For the many-core experiments we run our UPC and UPC++ codes on a machine equipped with oneIntel Xeon Phi 7250 (Knights Landing, or KNL) processor, which is equipped with 16GB of high speedMCDRAM. Due to the hardware structure of the KNL processor we can select the way MCDRAM isseen (addressed and accessed) by the operating system and the programs. For our experiments we chosethe ﬂat mode . When the Knights Landing processor is booted in ﬂat mode , the entirety of the MCDRAMis used as addressable memory. MCDRAM as addressable memory shares the physical address spacewith DDR4, and is also cached by the L2 cache. With respect to Non Uniform Memory Access (NUMA)architecture, the MCDRAM portion of the addressable memory is exposed as a separate NUMA nodewithout cores, with another NUMA node containing the DDR4 memory [7]. We use numactl to explicitlyplace data in the MCDRAM NUMA node. KNL processors can alternatively be booted in cache mode ,which uses the MCDRAM as cache which is thus transparent for the operating system, and in hybridmode which used 8GB as cache and 8GB as addressable memory. In previous work we found that forproblems which ﬁt entirely within the 16GB of MCDRAM, which is the case for our test instances, theperformance of cache mode is only marginally lower than that of ﬂat mode [15]. On this machine we usedIntel Compiler version 17.0.0 to compile UPC and UPC++ compilers and runtimes environments, as wellas our programs.The Abel computer cluster [1] was used to run all the UPC and UPC++ codes on multi-node andmeasure their time usage. Each compute node on Abel is equipped with two Intel Xeon E5-2670 2.6 GHz8-core CPUs and 64 GB of RAM. The interconnect between the nodes is FDR InﬁniBand (56 Gbits/s).13n our previous study [13], we measured the memory bandwidth per node on Abel and obtained 75GB/s; we also measured the inter-node communication bandwidth and obtained about 6 GB/s. On thissupercomputer we have used nodes providing access to 16 physical cores, and for all our runs we alwayshave used the maximum amount of physical cores per node. In the following, when speaking of cores wemean physical core . We have not used any kind of purely logical cores offered by Intel’s HyperThreadingtechnology.The Berkeley UPC [4] version 2.24.2 was used for compiling and running all our UPC implementa-tions of SpMV and Heat Equation. The compilation procedure involved ﬁrst a behind-the-scene transla-tion from UPC to C done remotely at Berkeley via HTTP, with the translated C code being then compiledlocally on Abel using Intel’s icc compiler version 15.0.1. The compilation options are -O3 -std=gnu99 .The Berkeley UPC++ [22] version 1.0 (2019.9.0) was used for compiling and running all our UPC++implementation of SpMV and Heat Equation. UPC++ relies on a local compiler and MPI installation(i.e. located on the supercomputer). MPI is used to spawn UPC++ process on each node. GNU G++version 7.2.0 was used to compile C++ code, and OpenMPI version 3.1.2 was used for process spawning.Both UPC and UPC++ use GASNet(-EX) as an under-layer to ensure communication between threads,processes or cores and nodes. This also means, that UPC and UPC++ are APIs for GASNet. GAS-Net is a language-independent networking middleware layer that provides network-independent, high-performance communication primitives including Remote Memory Access (RMA) and Active Messages(AM). It has been used to implement parallel programming models and libraries such as UPC, UPC++,Co-Array Fortran, Legion, Chapel, and many others. The interface is primarily intended as a compila-tion target and for use by runtime library writers (as opposed to end users), and the primary goals arehigh performance, interface portability, and expressiveness. GASNet stands for ”Global-Address SpaceNetworking” [5].In terms of technology, having GASNet as an under-layer means that UPC and UPC++ have to staysynchronized with the GASNet project in order to beneﬁt from the latest version of it. It also means forthe user that a GASNet version change requires recompiling of both the UPC or UPC++ compiler andruntime environment, as well as all the programs. The same process is required when new features areimplemented in UPC or UPC++: a full recompiling of the whole tool-chain and all the programs.It is important to specify that UPC++ relies on a newer version of GASNet called GASNet-EX. Also,newest versions of UPC, not used in this paper, rely on GASNet-EX. In [5], the authors give a detailedpresentation of GASNet-EX as well as a short deﬁnition of GASNet-EX’s ”philosophy”: ”GASNet-EXis the next generation of the GASNet-1 communication system, continuing our commitment to provideportable, high-performance, production-quality, open-source software. The GASNet-EX upgrade is beingdone over the next several years as part of the U.S. Department of Energy’s Exascale Computing Program(ECP). The GASNet interfaces are being redesigned to accommodate the emerging needs of exascalesupercomputing, providing communication services to a variety of programming models on current andfuture HPC architectures. This work builds on ﬁfteen years of lessons learned with GASNet-1, and isinformed and motivated by the evolving needs of distributed runtime systems.” A set of improvementsbetween GASNet and GASNet-EX are also presented in [5]:• Retains GASNet-1’s wide portability (laptops to supercomputers)• Provides backwards compatibility for the dozens of GASNet-1 clients,• including multiple UPC and CAF/Fortran08 compilers• Focus remains on one-sided RMA and Active Messages14 Reduces CPU and memory overheads• Improves many-core and multi-threading support• “Immediate mode” injection to avoid stalls due to back-pressure• Explicit handling of local-completion (source buffer lifetime)• New AM interfaces, e.g. to reduce buffer copies between layers• Vector-Index-Strided for non-contiguous point-to-point RMA• Remote Atomics, implemented with NIC ofﬂoad where available• Subset teams and non-blocking collectives

In this chapter, we focus on presenting the results obtained with our implementations of UPC++ SpMV(see Section 2.3, page 5) and UPC++ Heat Equation 2D (see Section 2.4, page 10). First we will showthat it is possible to get reproducible performance with UPC++. We then discuss the importance of the

BLOCKSIZE on the data distribution and the obtained performance. Then, we will focus on comparingperformance obtained in UPC++ with that of UPC. The UPC results come from our previous study [13].In this section we use named implementations of UPC++, the correspondence with the implementa-tions presented in section 2 is as follows:• ”UPC++SpMV Version 1” corresponds to ”UPC++ SpMV with Block-Wise Data Transfer betweenThreads” presented in Section 2.3.2 (see page 6)• ”UPC++SpMV Version 2” corresponds to ”UPC++ SpMV with Message condensing and consoli-dation” presented in Section 2.3.3 (see page 8)• ”UPC Heat Equation” corresponds to the implementation of UPC the Heat Equation in 2D presentedin Section 2.4 (see page 10)• ”UPC++ Heat Equation” corresponds to the implementation of the UPC++ Heat Equation in 2Dpresented in Section 2.4 (see page 10)

We focus on the reproducibility of the obtained performance with UPC++ since it is a crucial feature ofa new language such as UPC++, both for verifying its usefulness in high performance computing and forensuring the validity of our results. In addition, being able to expect repeatable performance from UPC++is a prerequisite for designing a performance model capable of predicting obtainable performance.In Figure 1, we show results measured on the Abel supercomputer using UPC++ SpMV version 2running on two different counts of threads: 64 (4 nodes) and 128 threads (8 nodes), processing an instancerepresenting a healthy human heart composed of 6.8 million tetrahedrons. For each thread count we haveused different values of

BLOCKSIZE in order to investigate two aspects: ﬁrst whether the

BLOCKSIZE has astrong impact on performance, and second that when running the same UPC++ SpMV program multiple15igure 1: Performance stability and repeatability and importance of the

BLOCKSIZE on the obtained per-formance. Using human heart 3D representation, using 6.8 millions tetrahedrons ( n = Number of threads - BLOCKSIZE we obtain stable and repeatableperformance.Figure 1 shows both that we obtain performance stability and that the

BLOCKSIZE has a strong im-pact on performance. This impact in performance is related to the fact that the chosen

BLOCKSIZE inour implementations affects directly the data distribution. Consequently, for this particular data set (6.8millions tetrahedrons), the observed performance varies signiﬁcantly depending on the chosen value of

BLOCKSIZE : for 128 threads (8 nodes) performance go from slightly more than 50 GFLOPS down toslightly less than 10 GFLOPS when using either 65536 or 524288 as

BLOCKSIZE . For 64 threads, goingfrom a block size of 1024 to 8192 helps to improve the performance, however, when the block size isbigger than 8192 we observe a slight drop in the obtained performance.16igure 2: Multi-node performance for SpMV using Abel Supercomputer (abel.uio.no). Using 1 to 64nodes, 16 to 1024 threads. Performance are expressed in GFLOPS (higher is better). Using human heart3D representation, using 6.8 millions tetrahedrons ( n = Concerning multi-node results for UPC SpMV and UPC++ SpMV: Table I and Figure 2 present theobtained performance for UPC and UPC++ SpMV versions 1 and 2 using again the Abel supercomputer.Additionally, we present the obtained speedups for UPC and UPC++ SpMV versions 1 and 2 in Table II.We obtain speed-ups strictly superior to 1 for all versions, and all amounts of threads (except for 1024threads (64 nodes) for UPC SpMV version 1 and 2 and UPC++ version 1). Also, the obtained speed-upon the Abel supercomputer for all versions of UPC and UPC++ SpMV is always strictly inferior to 2.00.Based on these results we can say that we achieve strong scaling in UPC SpMV and UPC++ SpMVversion 1 and 2 using the Abel supercomputer for the multi-node experiments (except for UPC version1 and 2 and UPC++ version 1 for 1024 threads (64 nodes)). The obtained speedups can be called strongscaling in the sense that for the same total amount of data (ﬁxed problem size) we get higher performancewhen using more processing elements from 16 to 512 cores (1 to 32 nodes), when using 1024 threads onlyUPC++ version 2 continues to scale up: 1.83 speed-up (see Table II, page 18) between UPC++ version2 using 1024 threads (64 nodes) compared to performance obtained with UPC++ version 2 using 512threads (32 nodes).In Figure 2, it is clear that message condensing and consolidation (which corresponds to UPC SpMVversion 2 and UPC++ SpMV version 2) delivers far better performance than block wise data transfer(which corresponds to UPC version 1 and UPC++ version 1).This is to be expected as the versions 2 of UPC and UPC++ SpMV use a far more sophisticatedcommunication pattern, leading to messages sent between threads/cores/nodes of a size that correspondsexactly to the amount of data we need to transfer. Whereas in the block-wise data transfer versions (UPC17 hreads (node) UPC SpMVversion 1 UPC SpMVversion 2 UPC++ SpMVversion 1 UPC++SpMVversion 216 (1)

32 (2)

64 (4)

128 (8)

256 (16)

512 (32) n = Threads (node) UPC SpMVversion 1 UPC SpMVversion 2 UPC++ SpMVversion 1 UPC++SpMVversion 216 (1) - - - -

32 / 16

64 / 32

128 / 64

256 / 128

512 / 256 n = BLOCKSIZE even if only 1 element(C-type ”double”, 8 bytes on 64 bits architecture) needs to be sent. Thus, communication containing onlyone element that is needed will be of size 1 × ( bytes ) in UPC and UPC++ SpMV versions 2, and of size1 × BLOCKSIZE × ( bytes ) in UPC and UPC++ SpMV versions 1. As the communication pattern is theonly main difference in UPC and UPC++ versions 1 and 2, it is possible to say that the communicationpattern is the main source of difference in performance. In addition, we have shown in our previous studythat this performance can be predicted using our performance model [13], and we show in this study thaton multi-node the heaviest factor inﬂuencing performance is the inter-node communication through theInﬁniband network.Table III (Page 19) and Figure 3 show the obtained performance for UPC SpMV and UPC++ SpMVversions 1 and 2 on the Intel Xeon Phi Knight’s Landing (as presented in Section 3, page 13). Globally,UPC and UPC++ SpMV in versions 2 perform better than UPC and UPC++ SpMV versions 1. As ex-plained for the multi-node results, the main difference between these two versions is the communicationpattern. In this single-node context running on a many-core architecture, the difference in the commu-nication pattern has a huge inﬂuence on performance. As shown earlier in this section and in section18igure 3: Single-node performance for SpMV using Intel Xeon Phi Knight’s Landing (described in section3, see page 13). Using human heart 3D representation, using 6.8 millions tetrahedrons ( n = HyperThreading ,i.e. each of the 68 physical cores appears to the operating system as four logical cores. Since the logicalcores do not add computational resources, we cannot expect linear scaling from using them, as observedin [21]. This explains why when using 136 cores we only get a slight increase in performance. Whenusing 272 cores we observe a performance drop for all versions of UPC and UPC++ SpMV.Table IV and Figure 4, present the obtained results for our implementations of UPC and UPC++ Heat

Threads UPC SpMVversion 1 UPC SpMVversion 2 UPC++ SpMVversion 1 UPC++ SpMVversion 216 n = hreads (node) UPC Heat Equation UPC++ Heat Equation16 (1)

32 (2)

64 (4)

128 (8)

256 (16)

512 (32) , 1000 iterations. Performance obtained on Abel super-computer (described insection 3 see page 13), computing on 16 to 1024 threads (1 to 64 nodes). For graphical representation seeFigure 4.Figure 4: Multi-node performance for Heat Equation 2D, domain size 20000x20000, 1000 iterations usingAbel Supercomputer (abel.uio.no). Using 1 to 64 nodes, 16 to 1024 threads. Performance are expressed inseconds (execution time) (lower is better). Table IV contains numerical results represented in this graph.Equation using a 20000x20000 domain running on Abel supercomputer using 16 to 1024 cores (1 to 64nodes). Both the UPC and UPC++ Heat Equation versions adopt a similar way of communicating dataand a similar computation kernel, as presented in Section 2.4, page 2.4. With that in mind we observethat UPC++ performs as well as UPC. In this case, the data per processor is ﬁxed independently of theamount of cores that is used, thus, as for our implementations of SpMV, we can observe that we achieveweak scaling in both UPC and UPC++, at least from 1 to 32 nodes.The obtained performance is extremely close between the UPC and UPC++ implementations of theHeat Equation, from 1 to 32 nodes. However when using 64 nodes (1024 cores) UPC++ has a strongadvantage over UPC. We observed this behavior of UPC having trouble to perform with more than 512threads before [13, 14]. 20

Discussion

In Section 2 (page 3), we have presented our UPC and UPC++ implementations of SpMV and the HeatEquation in 2D. It is important to note the signiﬁcant differences in both programming model and conse-quences in terms of code between UPC and UPC++. For instance, the way the shared arrays are accessedin UPC is done relatively transparently in the sense that a shared array of size N will be accessible withindices ranging from to N-1 . However, in UPC++ the same array will be accessible in a 2-dimensionalway, the ﬁrst dimension explicitly points at the thread ID, the second dimension corresponds to a positionin the array for the portion allocated to the chosen thread.In other words, UPC++ does not try to maintain the same ease of programming at the expense ofperformance as UPC. As shown in previous work [13], accessing shared arrays in UPC can be extremelycostly as it generates hidden communication: hidden in two senses, ﬁrst the programmer does not explic-itly implement the communication and the developer has no way to know whether this communicationwill be between cores, sockets or nodes. In other words, in UPC communication can be implicit and yieldextremely bad performance whereas in UPC++ all communication is done explicitly. However, UPC++performs also communication without having the developer know whether the communication is betweencores, sockets or nodes (unless teams are used). This simply means that the UPC and/or UPC++ program-mer has to pay attention to both data distribution and the use of communication whether it is implicit orexplicit.This leads to an increase in the effort in programming that UPC++ developers have to deploy inorder to get satisfying performance on single-node, multi-node or many-core hardware architectures withUPC++ it is no longer possible to claim that this implementation of the PGAS programming model meansa better ease of programming compared to the de-facto standard MPI+OpenMP. However, by offeringa lot of features related to one-sided asynchronous communication, the use of C++ and

Promises and

Futures in order to enforce the use of asynchronous programming, UPC++ represents an alternative toMPI+OpeMP. Moreover, UPC++ offers these functionalities and yields performance as we explain belowand as we have shown in this paper.Our goal was also to verify and show whether UPC++ can run and yield good performance on multi-node and many-core architectures and whether this performance is comparable to previously obtainedperformance in UPC. Thus, in Section 4 (page 15), we have presented the obtained performance for ourimplementations of UPC and UPC++ SpMV on both multi-node and many-core hardware architectureand we have presented the results for our implementations of UPC and UPC++ Heat Equation on multi-node. Globally UPC++ perform as well as UPC for both considered kernels. This is encouraging as abasis for future work using more advanced UPC++’s features such as asynchronous remote procedurecalls and implementation of communication overlapping with computation.21

Conclusion

In this study, we have investigated and provided insights into the UPC++ programming model and itsperformance. We have measured the computational performance of two kernels (SpMV and Heat Equa-tion 2D) both on a multi-node supercomputer and a many-core processor. In addition, we have comparedUPC++ to UPC by running identical kernels. We have shown that both on multi-node and many-core ar-chitecture UPC++ can compete with UPC and yields better performance on the largest amounts of coresand nodes that we have used (1024 cores, 64 nodes).We have provided a detailed view of the difference of programming styles between UPC and UPC++and the consequences in terms of code. UPC and UPC++, despite their similar names, are two extremelydifferent implementations of the PGAS programming model, and in that sense, switching from UPC toUPC++ should not be considered lightly.For future studies, there are many tracks to investigate, such as using more UPC++ features such asasynchronous remote procedure calls, or using UPC++ in conjunction with CUDA, and using UPC++to implement different kind of kernels as we have focused our studies mainly on memory-bound andcommunication-bound problems. 22 eferences [1] The Abel computer cluster. , 2018.[2] G. Almasi. PGAS (partitioned global address space) languages. In D. Padua, editor,

Encyclopediaof Parallel Computing , pages 1539–1545. Springer, 2011.[3] J. Bachan, S. B. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, P. H. Hargrove, andH. Ahmed. UPC++”: A High-Performance Communication Framework for Asynchronous Compu-tation. In , pages963–973. IEEE, 2019.[4] Berkeley UPC – Uniﬁed Parallel C. upc.lbl.gov , 2018.[5] D. Bonachea and P. H. Hargrove. GASNet-EX: A High-Performance, Portable Communication Li-brary for Exascale. In

Proceedings of Languages and Compilers for Parallel Computing (LCPC’18) ,volume 11882 of

Lecture Notes in Computer Science . Springer International Publishing, October2018.[6] W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction toUPC and language speciﬁcation. Technical report, Technical Report CCS-TR-99-157, IDA Centerfor Computing Sciences, 1999.[7] Colfax Research. MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors:Developer’s Guide. https://colfaxresearch.com/knl-mcdram/ , 2017. [Online; accessed 20-November-2019].[8] D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, andK. Yelick. Parallel programming in Split-C. In

Supercomputing ’93. Proceedings , pages 262–273,Nov 1993.[9] A. Davies and J. Mushtaq. The domain decomposition boundary element method on a network oftransputers.

WIT Transactions on Modelling and Simulation , 15, 1970.[10] M. de Wael, S. Marr, B. de Fraine, T. van Cutsem, and W. de Meuter. Partitioned global addressspace languages.

ACM Computing Surveys , 47(4), 2015.[11] S. J. Deitz, B. L. Chamberlain, and M. B. Hribar. Chapel: Cascade High-Productivity Language AnOverview of the Chapel Parallel Programming Model.

Cray User Group , 2006.[12] R. Grimes, D. Kincaid, and D. Young. ITPACK 2.0 User’s Guide. Technical Report CNA-150,Center for Numerical Analysis, University of Texas, 1979.[13] J. Lagravi`ere, J. Langguth, M. Prugger, L. Einkemmer, P. H. Ha, and X. Cai. Performance Opti-mization and Modeling of Fine-Grained Irregular Communication in UPC.

Scientiﬁc Programming ,Article ID 6825728, 2019.[14] J. Lagraviere, J. Langguth, M. Sourouri, P. H. Ha, and X. Cai. On the performance and energy efﬁ-ciency of the pgas programming model on multicore architectures. In , pages 800–807. IEEE, 2016.2315] J. Langguth, C. Jarvis, and X. Cai. Porting tissue-scale cardiac simulations to the knights landingplatform. In

International Conference on High Performance Computing , pages 376–388. Springer,2017.[16] D. A. Mall´on, G. L. Taboada, C. Teijeiro, J. Touri˜no, B. B. Fraguela, A. G ´omez, R. Doallo, andJ. C. Mouri˜no. Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures. InM. Ropo, J. Westerholm, and J. Dongarra, editors,

Recent Advances in Parallel Virtual Machine andMessage Passing Interface , pages 174–184, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.[17] S. M. Martin.

MATE, a Uniﬁed Model for Communication-Tolerant Scientiﬁc Applications . PhDthesis, UC San Diego, 2018.[18] R. W. Numrich and J. Reid. Co-Array Fortran for parallel programming. In

ACM Sigplan FortranForum , volume 17, pages 1–31. ACM, 1998.[19] PGAS - Partitioned Global Address Space. , 2016.[20] R. Rabenseifner. Introduction to Uniﬁed Parallel C (UPC) and Co-array Fortran (CAF), 2015. Shortcourse at HLRS, University of Stuttgart, April 23-April 24.[21] R. A. Tau Leng, J. Hsieh, V. Mashayekhi, and R. Rooholamini. An empirical study of hyper-threading in high performance computing clusters.

Linux HPC Revolution , 45, 2002.[22] UPC++. UPC++ Ofﬁcial Website. bitbucket.org/berkeleylab/upcxx/ . Online; accessed 28-October-2019.[23] UPC++. UPC++ Programmer’s Guide, Revision 2019.9.0. https://bitbucket.org/berkeleylab/upcxx/downloads/upcxx-guide-2019.9.0.pdf . Online; accessed 28-October-2019.[24] UPC++. UPC++ Version Changelog. bitbucket.org/berkeleylab/upcxx/wiki/ChangeLog .Online; accessed 28-October-2019.[25] Wikipedia. Futures and promises — Wikipedia, The Free Encyclopedia. en.wikipedia.org/wiki/Futures_and_promises , 2019. [Online; accessed 28-October-2019].[26] X10. Performance and Productivity at Scale. x10-lang.org/ , 2019. [Online; accessed 11-December-2019].[27] Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick. UPC++: A PGAS Extension for C++.In , pages 1105–1114,May 2014.

Conﬂicts of interest : The authors declare that there is no conﬂict of interest regarding the publicationof this paper.