[PDF] MGPU-TSM: A Multi-GPU System with Truly Shared Memory

Abstract

The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPU count because of the overhead of data movement across multiple GPUs. Moreover, a lack of hardware support for coherency exacerbates the problem because a programmer must either replicate the data across GPUs or fetch the remote data using high-overhead off-chip links. To address these problems, we propose a multi-GPU system with truly shared memory (MGPU-TSM), where the main memory is physically shared across all the GPUs. We eliminate remote accesses and avoid data replication using an MGPU-TSM system, which simplifies the memory hierarchy. Our preliminary analysis shows that MGPU-TSM with 4 GPUs performs, on average, 3.9x? better than the current best performing multi-GPU configuration for standard application benchmarks.

Full PDF

11 MGPU-TSM: A Multi-GPU System withTruly Shared Memory

Saiful A. Mojumder , Yifan Sun , Leila Delshadtehrani , Yenai Ma , Trinayan Baruah , Jos´e L. Abell´an , John Kim , David Kaeli , Ajay Joshi ECE Department, Boston University; ECE Department, Northeastern University; CS Department, UCAM; School of EE, KAIST; { msam, delshad, yenai, joshi } @bu.edu, { yifansun, tbaruah, kaeli } @ece.neu.edu,[email protected], [email protected] Abstract —The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a singleGPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPUcount because of the overhead of data movement across multiple GPUs. Moreover, a lack of hardware support for coherencyexacerbates the problem because a programmer must either replicate the data across GPUs or fetch the remote data usinghigh-overhead off-chip links. To address these problems, we propose a multi-GPU system with truly shared memory (MGPU-TSM),where the main memory is physically shared across all the GPUs. We eliminate remote accesses and avoid data replication using anMGPU-TSM system, which simpliﬁes the memory hierarchy. Our preliminary analysis shows that MGPU-TSM with 4 GPUs performs,on average, 3.9 × better than the current best performing multi-GPU conﬁguration for standard application benchmarks. Index Terms —Multi-GPU, Shared Memory, RDMA (cid:70)

NTRODUCTION

Graphics processing units (GPUs) have become the systemof choice for accelerating a variety of workloads includingdeep learning, graph applications, data mining, and big dataprocessing. The size of these applications is growing contin-uously, and these applications are exhausting the computeand memory resources in single-GPU systems. Hence, thecommunity is actively migrating towards using multi-GPU(MGPU) systems to accelerate the above-mentioned work-loads. To enable inter-GPU communication, GPU vendorshave proposed a number of mechanisms (see Table 1).However, achieving near-ideal speedup (w.r.t. a single GPU)when using multiple GPUs is challenging because of theinefﬁciencies in MGPU system design and the associatedprogramming model.

Inefﬁciency 1:

In the existing discrete MGPU systems,each GPU has its own local main memory (MM) as shownin Figure 1(left). Each GPU in the MGPU system can accessthe other GPUs’ MM through low-bandwidth high-latencylinks. These off-chip links have 5 × to 10 × lower bandwidth(BW) (for transferring data between GPUs, and betweenCPU and GPU) than the BW for accessing local MM ofa GPU. Thus, accessing a remote GPU’s MM increasesthe application execution time. Moreover, we observe non-uniform memory access (NUMA) effects when accessinga remote memory resulting in under-utilization of GPUcompute resources, and therefore sub-optimal performance. Inefﬁciency 2:

Today’s MGPU programming model re-quires a programmer to manually maintain coherency byreplicating data and/or accessing non-cached data froma remote memory using the expensive off-chip links. Asa result, there is additional trafﬁc traversing through theoff-chip links. In addition, the existing weak data-race-free (DRF) consistency model for GPUs requires additional ef-forts from the programmer to avoid data races by providingexplicit barriers.As a result of these inefﬁciencies, we cannot leverage thefull potential of MGPU systems. We provide more detailsabout these two inefﬁciencies with experimental evaluationin Section 2. Researchers have proposed various solutions toaddress the aforementioned inefﬁciencies in the MGPU sys-tems. In particular, the solution with identical objectives toours was by Arunkumar et al. [1], who proposed a package-level integration of multi-chip-module GPU (MCM-GPU)(see Figure 1(left)), where each GPU module has its own lo-cal DRAM. Here local accesses have low-latency, but remoteaccesses have very high latency. In parallel, other hardwareand software optimizations such as L1.5$ [1], CARVE [16],and HMG [11] have been proposed to address the twoinefﬁciencies mentioned earlier.To simplify programming, reduce the data transfer la-tency and increase the memory utilization efﬁciency, wepropose a multi-GPU system with truly shared memory(MGPU-TSM). Unlike the MCM-GPU (see Figure 1), anMGPU-TSM system allows all GPUs to directly access theentire physical main memory of the system, thus eliminatingnon-uniform memory access (NUMA) effects observed intraditional MGPU systems. In addition, an MGPU-TSMdoes not require L1.5$ to reduce remote access overhead.Moreover, MGPU-TSM paves the way to accommodate low-overhead coherence protocol as well as a simpler consis-tency model for MGPU systems. In this work, we comparethe performance of an MGPU-TSM design with state-of-the-art RDMA– and uniﬁed memory (UM)–based MGPUdesigns using MGPUSim [13] to demonstrate the beneﬁts ofMGPU-TSM systems. a r X i v : . [ c s . A R ] A ug TABLE 1: Comparison of different communication mechanisms available in existing MGPUs vs. the communication schemein MGPU-TSM. We compare the programmability and memory usage of each mechanism w.r.t. P2P Memcpy. Latency andBW is compared w.r.t. local MM access latency and BW. ‘ (cid:55) ’, ‘ (cid:51) ’, and ‘ (cid:51)(cid:51) ’ indicate ‘no’, ‘fair’ , and ‘good’, respectively .

Method Deﬁnition MM AccessLatency MM AccessBandwidth DataDuplication ImprovesProgrammability Improves GPUMem. UsageP2P Memcpy Data copy from one GPU MM to another GPU MM High Low Yes – –P2P Direct Data is accessed directly from the remote GPU memoryand cached in the requesting GPU’s L1$ High Low Partial (cid:51) (cid:51)

Zerocopy Data is directly accessed from CPU memory by all GPUswithout copying the data into GPU memory or GPU cache Extremelyhigh Low No (cid:51)(cid:51) (cid:55)

Uniﬁed Memory Data is either transferred or accessed directly from the currentowner based on how the runtime decides to serve a page fault ExtremelyHigh Low No (cid:51)(cid:51) (cid:51)

MGPU-TSM All CPUs and GPUs can access the physically shared mainmemory seamlessly using a low latency network Low High No (cid:51)(cid:51) (cid:51)(cid:51)

CUs + L1$L2$NS D R A M GPU 0 CUs + L1$L2$ GPU 1 D R A M CUs + L1$L2$ D R A M GPU 0 CUs + L1$L2$GPU 1 D R A M CUs + L1$L2$ D R A M GPU 0 CUs + L1$L2$ GPU 1 D R A M CUs + L1$L2$ D R A M GPU 2 CUs + L1$L2$ GPU 3 D R A M Switch

CUs + L1$L1.5$ L XBar D R A M CUs + L1$L1.5$ L XBar D R A M CUs + L1$L1.5$ L XBar D R A M CUs + L1$L1.5$ L XBar D R A M R e m o t e D R A M a cc e ss L o c a l D R A M a cc e ss GPMGPM GPMGPM

Fig. 1:

MCM-GPU system (left) and Proposed MGPU-TSMsystem (right).

HALLENGES IN E XISTING

MGPU S

YSTEMS

In this section, using the data access latency metric, wepresent the motivation for providing shared main mem-ory in an MGPU system. Here, we run the commonly-used matrix multiplication kernel

SGEMM , from NVIDIA’scuBLAS library [9], on an MGPU system with V100 GPUs(compute capability of 7.0). We use two GPUs connectedthrough NVLink 2.0 (50 GB/s bidirectional bandwidth). Theconclusions of our analysis should be broadly applicable tosystems with more than 2 GPUs that use GPU-GPU RDMA.The computations in the

SGEMM kernel consist of threematrices A, B, and C. In our experiment, we distribute thematrices in the memory of two GPUs (GPU0 and GPU1) andexamine the performance degradation caused by differentdegrees of remote access (using P2P direct access as anexample) when the

SGEMM is executed on GPU0. We use the aL-bR format to represent a % local access and b % remoteaccess for GPU0, where a and b are integers. We evaluatethe following four matrix distributions across memory:1) Matrices A, B and C are in GPU0’s memory. This leadsto 100% local access for GPU0 ( ).2) Matrices A and B are in GPU0’s memory, and C is inGPU1’s memory ( ).3) Matrix A is in GPU0’s memory, and matrices B and Care in GPU1’s memory ( ).4) Matrices A, B and C are in GPU1’s memory. This leadsto 100% remote access for GPU0 ( .Figure 2 shows the runtime for the SGEMM kernel execu-tion with different matrix sizes for the above four matrixdistributions. For smaller matrix sizes, accessing remotememory is very expensive because of the ﬁxed remoteaccess overhead. The runtime of

SGEMM for the distribution for a 4k ×

4k matrix is 27 × longer than that ofthe distribution. On the other hand, the runtime of

4k X 4k 8k X 8k 16k X 16k 32k X 32k

Matrix Size T i m e ( m s ) Fig. 2:

Runtime of

SGEMM kernel from cuBLAS library for differ-ent matrix sizes. Each bar corresponds to a different distributionof local and remote memory accesses.

SGEMM for the distribution for the 32k ×

32k matrixis 12.2 × longer than that of the distribution. Here,the ﬁxed remote access overhead gets amortized. From theseexperiments, we can see the signiﬁcant impact of remoteaccesses on performance, and in turn, argue that to improvethe performance of applications, we need to avoid remoteaccesses as much as possible. Data sharing across multiple GPUs during kernel execu-tion leads to programming challenges, as the programmermust choose between programmability and performance.In this section, we examine the DNN training process onMGPU systems, when leveraging different data-parallelismschemes. We highlight how different mechanisms trade-off programmability for performance. The three stages ofDNN training include forward propagation (FP), backwardpropagation (BP), and weight update (WU). During the FPand BP stages, different GPUs calculate their local stochasticgradient descents (SGDs) that are later used to update thevalues of weights used for the next iteration. In Algorithms 1, 2 and 3, we consider three differentways a programmer can perform the WU stage. We willassume a 2-GPU MGPU system here. Algorithm 1 showsthat when using memcpy, the programmer must maintaincoherence explicitly by periodically copying data to GPU1’smemory. Thus, there is an additional copy of data i.e. SGD( gGPU1 ) in GPU0’s memory, leading to additional mem-ory usage. Nonetheless, this mechanism can be efﬁcientin terms of kernel runtime because P2P memcpy can runasynchronously. Algorithm 2 shows how P2P direct access

1. More details about the training stages can be found in [6]. with RDMA can eliminate the data copy step, but at theexpense of accessing data using off-chip links. Still, the pro-grammer must transfer the data from the CPU to the GPUs.Algorithm 3 illustrates that a shared main memory couldease programmability and eliminate explicit GPU-to-GPUor CPU-to-GPU data transfers. Note that UM and Zerocopysolutions use Algorithm 3. UM, as proposed by NVIDIA,eases programming with a software abstraction, but suffersfrom performance degradation due to inefﬁcient page-faultsupport and expensive remote accesses [2]. A Zerocopysolution does not use GPU memory at all. The GPUs access pinned

CPU memory using the off-chip (PCIe) links [8]. Weargue that we need a solution which would not trade-offprogrammability to gain performance. A programmer canuse Algorithm 3 on our envisioned MGPU-TSM and enjoyboth ease of programmability and high performance.

Algorithm 1:

Using Memcpy ∗ Initialization: weights in CPU ;Copy weights from CPU to GPU0 → wGPU0 ;Copy weights from CPU to GPU1 → wGPU1 ;FP+BP on GPU0 using wGPU0 → gGPU0 ;FP+BP on GPU1 using wGPU1 → gGPU1 ;Copy gGPU1 from GPU1 to GPU0 → gGPU0Copy ;WU on GPU0 using (gGPU0,gGPU0Copy) → wGPU0 ;Copy wGPU0 from GPU0 to GPU1 → wGPU1 ; Algorithm 2:

Using P2P direct access ∗ Initialization: weights in CPU ;Copy weights from CPU to GPU0 → wGPU ;FP+BP on GPU0 using wGPU → gGPU0 ;FP+BP on GPU1 using wGPU → gGPU1 ;WU on GPU0 using (gGPU0,gGPU1) → wGPU ; Algorithm 3:

Using shared main memory ∗ Initialization: weights in CPU ;FP+BP on GPU0 using weights → g0 ;FP+BP on GPU1 using weights → g1 ;WU on GPU0 using (g0, g1) → weights ; ∗ In the pseudocode, the right arrows point to destinationvariables of an operation.

YSTEM

To explain and evaluate our envisioned MGPU-TSM archi-tecture, we consider an MGPU-TSM system consisting of4 GPUs, 1 CPU and 4 HBM stacks that provide a total of32GB MM (we are using a 32GB capacity as an exampleto explain the MGPU-TSM architecture – our MGPU-TSMsystem works with larger memory). The speciﬁcations ofthe GPU, CPU, and HBM stacks are provided in Table 2.

Figure 1(right) shows the logical view of our proposedMGPU-TSM system. We leverage the current common de-sign for compute units (CUs), where each CU has a dedi-cated write-through L1$. All the L1$s are connected to theL2$s using a crossbar network. For our proposed MGPU-TSM system, we make changes to the memory hierarchy,starting from L2$ down to the MM.GPUs typically have distributed L2$ banks, where eachL2$ bank serves one memory controller (MC). In our envi-sioned MGPU-TSM system, we have 8 L2$ banks per GPU TABLE 2: Speciﬁcation of MGPU-TSM components.

Component Name Tech. Node Area Power (nm) (mm ) (W)GPU RX 5700 7 151 180CPU Ryzen 9 3950X 7 144 ∗ ∗∗ Determined using technology scaling rules. and 4 HBM stacks that provide a total of 32 GB of MM. Thus,for each GPU, an L2 MC controls 4GB of memory. Each ofthe 8GB DRAMs is further distributed into 16 banks, whereeach bank has a 512MB capacity.Each L2 bank, as well as each DRAM bank, is connectedto a centralized switch through a dedicated 32GB/s bidi-rectional link. Thus, each GPU has a total of 256GB/s ofbidirectional BW between the L2$ and MM. With 4 GPUs,the total BW is 1TB/s. This also implies that each memoryaccess requires a two-hop communication, from L2$ tothe Switch, then from the Switch to MM, and vice versa.Recently, NVIDIA introduced NVSwitch [5], providing 18ports and 928GB/s of bidirectional BW, supporting RDMAconnectivity across multiple GPUs. Hence, our assumed32-port switch with 1TB/s aggregate bidirectional BW isrealistic.The key advantage of our TSM lies in physically-uniﬁedMM, providing uniform memory access (UMA) across thesystem. This physically-uniﬁed design completely removesthe need for remote accesses. In addition, having a central-ized location for data access by multiple GPUs providesthe opportunity to coalesce data accesses at the MM leveland makes it easier to provide support for coherency giventhe lower overhead in communication. Moreover, havingmore memory banks helps improve the throughput by anefﬁcient allocation of data, i.e., allocating consecutive pagesto neighboring DRAM banks in a round-robin manner.

In this section, we discuss the potential performance beneﬁtsof an MGPU-TSM system over the existing MGPU systemconﬁgurations, i.e., MGPU systems that use RDMA P2Pdirect access (referred to as

RDMA ), and the MGPU systemthat uses uniﬁed memory (referred to as UM ). Table 3 showsthe conﬁguration for each GPU in our evaluation, wherewe allocate memory by interleaving the pages across all thememory modules in the MGPU system. For a fair compari-son, we use the same GPU speciﬁcations i.e. CU count, L1$and L2$ sizes and number of total DRAM banks (16 foreach GPU) for RDMA , UM and TSM conﬁgurations. We use apage size of 4KB. For the

RDMA conﬁguration, we use PCIe4.0 links to provide 32GB/s bidirectional BW for remoteaccesses. UM provides a uniﬁed view of the total memory tothe programmer by virtually combining the CPU and GPUmemories. UM uses a ﬁrst touch policy for page placement.To evaluate our design we use the MGPUSim simulator [13],TABLE 3: GPU Architecture. Component Conﬁguration Count Component Conﬁguration Countper GPU per GPU

CU 1.0 GHz 32 L1 Vector $ 16KB 4-way 32L1 Scalar $ 16KB 4-way 8 L1I$ 32KB 4-way 8L2$ 256KB 16-way 8 DRAM 512MB HBM 16L1 TLB 1 set, 32-way 48 L2 TLB 32 sets, 16-way 1

RDMAUMTSM a e s a t a x b f s b i c g b s c o n v f i r f w s mm m p p r r e l u G . M e a n Benchmarks S p ee d - U p Fig. 3:

Speedup of proposed

TSM , and UM w.r.t. RDMA . which is designed speciﬁcally to support MGPU simulation.We use 12 standard benchmarks from the Hetero-Mark [14],PolyBench [10], SHOC [3], and DNNMark [4] benchmarksuites for our preliminary evaluation.Figure 3 shows a comparison of TSM , RDMA and UM . TSM is, on average, 3.9 × and 8.2 × faster than RDMA and UM ,respectively. TSM is faster than using

RDMA because

RDMA requires data copy operations between the CPU and GPUs.During kernel execution, all GPUs are required to use

RDMA to access data residing on the other GPUs’ memories. UM suffers from an expensive page fault service mechanism andpage migration through the off-chip links. YSTEM D ESIGN C HALLENGES

Our preliminary comparison of

TSM with

RDMA and UM shows that TSM is quite promising, but it also comes withseveral challenges. Here we discuss these challenges andour future research direction to address those challenges.

In the MGPU-TSM system, different CUs within and acrossGPUs can access the same memory location. Hence, weneed a low-overhead scalable cache coherency and mem-ory consistency model to maintain accuracy such as HAL-CONE [7]. Traditional snooping-based or directory-basedcoherency protocols, such as MESI and MOESI, can leadto large inter-GPU and intra-GPU communication laten-cies [12]. Timestamp-based coherence [15], which allowsauto-invalidation of cache blocks and reduces the trafﬁcoverhead, can be suitable for an MGPU-TSM system. Awide range of consistency models, including sequentialconsistency, weak consistency, and release consistency, havebeen proposed for single-GPU systems. We need to designconsistency models for an MGPU-TSM system consisting ofthousands of threads.

The L2-to-MM network plays a critical role in the overallperformance of an MGPU-TSM system. In our examplesystem, we used direct links between L2 to the Switch andbetween the Switch to MM. As we scale the number ofGPUs, the radix of the Switch grows proportionally. A high-radix switch leads to lower performance, and at the sametime, the resulting area and power become problematic. Inour future work, we will explore different high-BW low-latency networks that scale well with GPU count.

CPUs are typically latency-sensitive, while GPUs are BW-sensitive. Since the MGPU-TSM system provides the samephysical memory to both CPUs and GPUs, it is imperativeto design a network protocol that allows low-latency dataaccess to the CPU and high-BW data access to the GPUs.

To design a scaled-up MGPU-TSM system, we envisionusing 2.5D integration technology with multiple interposers.Each interposer will have multiple GPU chiplets, a CPUchiplet, and multiple HBM stacks. For intra-interposer com-munication, we can use electrical links, while for long-distance inter-interposer communication, we can use pho-tonic links. To design such a multi-interposer system, weneed to develop a cross-layer design automation techniquethat jointly optimizes the system architecture, circuit design,and physical design.

ONCLUSION

In this work, we showed that the performance of MGPU sys-tems is limited due to expensive remote data access throughoff-chip links. At the same time, programming MGPU sys-tems is difﬁcult due to a lack of hardware support for co-herency. To address these issues, we propose an MGPU-TSMarchitecture that eliminates remote data access, improvesmemory utilization, and reduces programmer burden. Wealso highlight the major challenges we need to overcome tomake MGPU-TSM viable. R EFERENCES [1] A. Arunkumar et al. Mcm-gpu: Multi-chip-module gpus forcontinued performance scalability.

Proc. ISCA , pp. 320–332, 2017.[2] T. Baruah et al. Grifﬁn: Hardware-software support for efﬁcientpage migration in multi-gpu systems.

Proc. HPCA , pp. 596–609,2020.[3] A. Danalis et al. The scalable heterogeneous computing (shoc)benchmark suite.

Proc. GPGPU-3 , pp. 63–74, 2010.[4] S. Dong and D. Kaeli. Dnnmark: A deep neural network bench-mark suite for gpus.

Proc. GPGPUs , pp. 63–72, 2017.[5] A. Ishii et al. Nvswitch and dgx-2 nvlink-switching chip and scale-up compute server.

Proc. HCS , pp. 1–30, 2018.[6] S . A. Mojumder et al. Proﬁling dnn workloads on a volta-baseddgx-1 system.

Proc. IISWC , pp. 122–133, 2018.[7] S . A. Mojumder et al. Halcone: A hardware-level timestamp-based cache coherence scheme for multi-gpu systems. arXivpreprint arXiv:2007.04292 , 2020.[8] D. Negrut et al. Uniﬁed memory in cuda 6.0. a brief overview ofrelated data access and transfer issues.

SBEL, Madison, WI, USA,Tech. Rep. TR-2014-09 , 2014.[9] C. Nvidia. Cublas library.

NVIDIA Corporation, Santa Clara,California , 15(27):31, 2008.[10] L . N. Pouchet. Polybench: The polyhedral benchmark suite. , 2012.[11] X. Ren et al. Hmg: Extending cache coherence protocols acrossmodern hierarchical multi-gpu systems.

Proc. HPCA , pp. 585–595,2020.[12] I. Singh et al. Cache coherence for gpu architectures.

HPCA , pp.578–590. IEEE, 2013.[13] Y. Sun et al. Mgpusim: Enabling multi-gpu performance modelingand optimization.

Proc. ISCA , pp. 197–209, 2019.[14] Y. Sun et al. Hetero-mark, a benchmark suite for cpu-gpu collabo-rative computing.

Proc. IISWC , pp. 1–10, 2016.[15] A. Tabbakh et al. G-tsc: Timestamp based coherence for gpus.

Proc.HPCA , pp. 403–415, 2018.[16] V. Young et al. Combining hw/sw mechanisms to improve numaperformance of multi-gpu systems.