OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems
OOO-VR: NUMA Friendly Object-Oriented VR RenderingFramework For Future NUMA-Based Multi-GPU Systems
Chenhao Xie [email protected] Northwest National Lab (PNNL) andUniversity of Houston
Xin Fu [email protected] Lab, ECE Department,University of Houston
Mingsong Chen [email protected] of Computer Science and Software Engineering,East China Normal University
Shuaiwen Leon Song
[email protected] Northwest National Lab (PNNL) andThe University of Sydney
ABSTRACT
With the strong computation capability, NUMA-based multi-GPUsystem is a promising candidate to provide sustainable and scalableperformance for Virtual Reality (VR) applications and deliver theexcellent user experience. However, the entire multi-GPU systemis viewed as a single GPU under the single programming modelwhich greatly ignores the data locality among VR rendering tasksduring the workload distribution, leading to tremendous remotememory accesses among GPU models (GPMs). The limited inter-GPM link bandwidth (e.g., 64GB/s for NVlink) becomes the majorobstacle when executing VR applications in the multi-GPU system.By conducting comprehensive characterizations on different kindsof parallel rendering frameworks, we observe that distributingthe rendering object along with its required data per GPM canreduce the inter-GPM memory accesses. However, this object-levelrendering still faces two major challenges in NUMA-based multi-GPU system: (1) the large data locality between the left and rightviews of the same object and the data sharing among differentobjects and (2) the unbalanced workloads induced by the software-level distribution and composition mechanisms.To tackle these challenges, we propose object-oriented VR ren-dering framework (OO-VR) that conducts the software and hard-ware co-optimization to provide a NUMA friendly solution for VRmulti-view rendering in NUMA-based multi-GPU systems. We firstpropose an object-oriented VR programming model to exploit thedata sharing between two views of the same object and groupobjects into batches based on their texture sharing levels. Then,we design an object aware runtime batch distribution engine anddistributed hardware composition unit to achieve the balancedworkloads among GPMs and further improve the performance ofVR rendering. Finally, evaluations on our VR featured simulatorshow that OO-VR provides 1.58x overall performance improvement
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
ISCA ’19, June 22–26, 2019, Phoenix, AZ, USA © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6669-4/19/06...$15.00https://doi.org/10.1145/3307650.3322247 and 76% inter-GPM memory traffic reduction over the state-of-the-art multi-GPU systems. In addition, OO-VR provides NUMAfriendly performance scalability for the future larger multi-GPUscenarios with ever increasing asymmetric bandwidth betweenlocal and remote memory.
ACM Reference Format:
Chenhao Xie, Xin Fu, Mingsong Chen, and Shuaiwen Leon Song. 2019. OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework For FutureNUMA-Based Multi-GPU Systems . In
The 46th Annual International Sympo-sium on Computer Architecture (ISCA ’19), June 22–26, 2019, Phoenix, AZ, USA.
ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3307650.3322247
With the vast improvements in graphics technology, Virtual Real-ity (VR) is becoming a potential popular product for major high-tech companies such as Facebook [30], Google [16] and NVIDIA[29]. Different from normal PC or mobile graphics applications, VRpromises a fully immersive experience to users by directly display-ing images in front of users’ eyes. Due to the dramatic experiencerevolution VR brings to users, the global VR market is expectedto grow exponentially and generate $30 billion annual revenue by2022 [24, 34].Despite the growing market penetration, achieving true immer-sion for VR applications still faces severe performance challenges[20]. First, the display image must have a high pixel density as wellas a broad field of views which requires a high display resolution.Meanwhile, the high-resolution image must be delivered at an ex-tremely short latency so that users can preserve the continuousillusion of reality. However, the state-of-the-art graphics hardware– the Graphics Processing Units (GPUs) in particular – cannot meetthese strict performance requirements [20]. Historically, GPUs gainperformance improvements through integrating more transistorsand scaling up the chip size, but these optimizations on single-GPUsystem can barely satisfy VR users due to the limited performanceboost [33]. Multi-GPU system with much stronger computation ca-pability is a promising candidate to provide sustainable and scalableperformance for VR applications [21, 33].In recent years, the major GPU verdors combine multiple smallGPU models (e.g., GPMs) to build a future multi-GPU system undera single programming model to provide scalable computing re-sources. They employ high speed inter-GPU links such as NVLINK[28] and AMD Crossfire[3] to achieve fast data transmit among a r X i v : . [ c s . D C ] J a n henhao Xie, Xin Fu, Mingsong Chen, and Shuaiwen Leon Song GPMs. The memory system and address mapping in this multi-GPUsystem are designed as a Non-Uniform Memory Access (NUMA)architecture to achieve 4x storage capacity over single-GPU sys-tem. The NUMA-based multi-GPU system employs shared memoryspace to avoid data duplication and synchronization overheadsacross the distributed memories [5]. In this study, we target the fu-ture multi-GPU system because it serves the VR applications moreenergy-efficiently than distributed multi-GPU system that employsseparated memory space, and is becoming a good candidate for fu-ture mobile VR applications. Since the entire system is viewed as asingle GPU under the single programming model, the VR renderingworkloads are sequentially launched and distributed to differentGPMs without specific scheduling. Applying this naive single pro-gramming model greatly hurts the data locality among renderingworkloads and incurs huge inter-GPM memory accesses, whichsignificant constrain the performance of multi-GPU system for VRapplications due to the bandwidth asymmetry between the localDRAM and the inter-GPM links. There have been many studies[5, 21, 25, 43] to improve the performance of NUMA-based multi-GPU system by minimizing the remote accesses. However, thesesolutions are still based on single programming model withoutconsidering the special data redundancy in VR rendering, hence,they cannot efficiently solve the performance bottleneck for VRapplications.Aiming to reduce the inter-GPM memory accesses, a straightfor-ward method is employing parallel rendering frameworks [7, 13,14, 19] to split the rendering tasks into multiple parallel sub-tasksunder specific software policy before assigning to the multi-GPUsystem. Since these frameworks are originally designed for dis-tributed multi-GPU system, a knowledge gap still exists on howto leverage parallel rendering programming model to efficientlyexecute VR applications in NUMA-based multi-GPU system. Tobridge this gap, we first investigate three different parallel renderingframeworks (i.e. frame-level, tile-level and object-level). By con-ducting comprehensive experiments on our VR featured simulator,we find that the object-level rendering framework that distributesthe rendering object along with its required data per GPM can con-vert some remote accesses to local memory accesses. However, thisobject-level rendering still faces two major challenges in NUMA-based multi-GPU system: (1) a large number of inter-GPM memoryaccesses because it fails to capture the data locality between leftand right view of the same object as well as the data sharing amongdifferent objects; (2) the serious workload unbalance among GPMsdue to the inefficient software-level distribution and compositionmechanisms.To overcome these challenges, we propose object-oriented VRrendering framework (OO-VR) that reduces the inter-GPM memorytraffic by exploiting the data locality among objects. Our OOVRframework conducts the software and hardware co-optimizationsto provide a NUMA friendly solution for VR multi-view rendering.First, we propose an object-oriented VR programming model thatprovides a simple software interface for VR applications to exploitthe data sharing between the left and right views of the same ob-ject. The proposed programming model also automatically groupsobjects into batches based on their data sharing levels. Then, tocombat the limitation of software-level solutions on workload dis-tribution and composition, we design a object aware runtime batch
Left viewRight view
Figure 1: Rendering the VR world into left and right views.(Frame is captured from The LAB[31]) distribution engine in hardware level to balance the rendering work-loads among GPMs. We predict the execution time for each batchso that we can pre-allocate the required data of each batch to thelocal memory to hide long data copy latency. We further design thedistributed composition unit in hardware level to fully utilize therendering output units across all GPMs for best pixel throughput.To summarize, the paper makes following contributions: • We investigate the performance of future NUMA-based multi-GPU systems for VR applications, and find that the inter-GPM memory accesses are the major performance bottle-neck. • We conduct comprehensive characterizations on major paral-lel rendering frameworks, and observe that the data localityamong rendering objects can help to significantly reduce theinter-GPM memory accesses but the state-of-the-art frame-works and multi-GPU systems fail to capture this interestingfeature. • We propose a software and hardware co-designed O bject- O riented VR (OO-VR) rendering framework that leveragesthe data locality feature to convert the remote inter-GPMmemory accesses to local memory accesses. • We further build a VR featured simulator to evaluate our pro-posed design by rendering VR enabled real-world games withdifferent resolutions. The results show that OO-VR achieves1.58x performance improvement and 76% inter-GPM mem-ory traffic reduction over the state-of-the-art multi-GPUsystem. With its nature of NUMA friendly, OO-VR exhibitsstrong performance scalability and potentially benefits thefuture larger multi-GPU scenarios with ever increasing asym-metric bandwidth between local and remote memory.
In contrast to other traditional graphics applications, the state-of-the-art VR applications employ
Head-Mounted-Display (HMD), orVR helmet, to directly present visuals to users’ eyes. To display 3Dobjects in VR, a pair of frames (i.e., stereoscopic frames ) are gener-ated for both left and right eyes by projecting the scene onto two
O-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems (3) Rasterization (1) Geometry Process (2) Multi-Projection (4) Fragment Process
Input AssemblerVertex ShaderTessellation ShaderGeometry ShaderSimultaneousMulti-ProjectionRasterizationAttribute SetupPixel ShaderColor Output (b) Multi-View Rendering Pipeline PME
Raster Engine
SMP L C ac h e M e m o r y C on t r o ll e r s O ff- c h i p M e m o r y (c) GPU Architecture That Supports Simultaneous Multi-Projection (a) Graphical Illustration XBAR
ROP … GigaThread Engine
PME
SMP
PME
SMP … SM TXU
Inst. $ … …
TX/L1 $ SM TXUInst. $ … …
TX/L1 $ SM TXUInst. $ … …
TX/L1 $ … ROP ROP Figure 2: Multi-view VR rendering process with simultaneous multi-projection (SMP) enabled: (a) the output of each renderingstep; (b) The overview of the 4-step rendering process; and (c) GPU architecture for SMP-enabled VR rendering. Modern GPUsemploy unified shader model that processes all shaders in programmable Streaming Multiprocessors (SMs). They also featurea new fixed function unit inside each Polymorph Engine (PME) to support SMP.
2D plate images. This process is referred as stereo rendering incomputer graphics. Figure 1 shows an example of such VR projec-tion. The green and yellow boxes represent the rendering processfor left and right views, respectively, creating two display imagesfor the HMD. Stereo rendering requires two concurrent renderingprocess for the two eyes’ views, resulting in doubled amount ofworkload for the VR pipeline. Due to the observation that someobjects in the scene (e.g. the robot in Figure 1) are shared by twoeyes, mainstream graphics engines such as NVIDIA and UNITYemploy simultaneous multi-projection (SMP) to generate the leftand right frames simultaneously through single rendering process[8, 9, 27, 35]. This significantly reduces workload redundancy andachieves substantial performance gain.Based on the conventional three-step rendering process (i.e., Ge-ometry Process, Rasterization and Fragment Process) defined bymodern graphics application programming interface (API) [1, 2],VR rendering inserts multi-projection process after the geometryprocess and prior to the Rasterization, shown in Figure 2(a). Thus,when SMP is enabled, VR rendering process is composed of foursteps, detailed in Figure 2(b). Basically, VR rendering begins fromreading the application-issued vertex from GPU memory. Duringthe geometry process 1 , the vertex shader calculates the 3D coordi-nates of the vertex and assembles them into primitives (i.e. trianglesin Figure 2(a)-(1)). After that, the generated triangles pass throughthe geometry-related shaders which perform clipping, face cullingand tessellation to generate extra triangles and remove non-visibletriangles. Then, the SMP step 2 is responsible for generating mul-tiple projections of a single geometry stream. In other words, GPUexecutes geometry process only once but produces two positionsfor each triangle (Figure 2(a)-(2)). These triangles are then streamed into the rasterization stage 3 to generate fragments (Figure 2(a)-(3)), each of which is equivalent to a pixel in a 2D image. Finally,the fragment process 4 generates pixels by calculating the corre-sponding fragment attributes to determine their colors and texture(Figure 2(a)-(4)). The output pixels will be written into the framebuffer in GPU memory for displaying.
Traditionally, GPUs are designed as the special-purpose graph-ics processors for performing modern rendering tasks. Figure 2(c)shows a SMP supported GPU architecture which models the re-cent NVIDIA Pascal GPUs[27]. It consists of several programmablestreaming multiprocessors (SMs) 1 , some fixed function units suchas the GigaThread Engine 2 , Raster Engine 3 , Polymorph En-gine (PME) 4 , and Render Output Units (ROPs) 5 . Each SM 1 iscomposed of a unified texture/L1 cache (TX/L1 $), several textureunits (TXU) and hundreds of shader cores that execute a varietyof graphics shaders (e.g., the functions in both geometry and frag-ment process). The GigaThread Engine 2 distributes the renderingworkloads among PMEs if there are adequate computing resources.The raster engine 3 is a hardware accelerator for rasterization pro-cess. Each PME 3 conducts input assembler, vertex fetching, andattribute setup. To support multi-view rendering, NVIDIA Pascalarchitecture integrates an SMP engine into each PME. The SMP en-gine is capable of processing geometry for two different viewportswhich are the projection centers for the left and right views. In otherwords, it duplicates the geometry process from left to right viewsthrough changing the projection centers instead of executing thegeometry process twice. Finally, the Render Output Units (ROPs)5 perform anti-aliasing, pixel compression and color output. As henhao Xie, Xin Fu, Mingsong Chen, and Shuaiwen Leon Song
Table 1: Differences Between PC Gaming and VR
Gaming PC Stereo VR
Display 2D LCD panel Stereo HMDField of View (FoV) 24-30" diagonal 120 ◦ horizontally135 ◦ verticallyNumber of Pixel 2-4 Mpixels 58.32x2 MpixelsFrame latency 16-33 ms 5-10 ms D R A M GPM 0 SMs + L1$
XBAR L D R A M GPM 3 SMs + L1$XBAR L D R A M GPM 1SMs + L1$XBAR L D R A M GPM 2SMs + L1$XBAR L LT LB RT RB Figure 3: The Overview of the multi-GPU architecture. Dis-tributed rendering tasks for the same object causes signifi-cant remote memory access and data duplication.
Figure 2(c) illustrates, all the SMs and ROPs share a L2 cache andread/write data to the off-chip memory through the memory con-trollers, each of which is paired to one memory channel. Prior torendering, GPU memory contents such as framebuffer and texturedata are pre-allocated in GPU’s off-chip memory. Programmers canmanually manage the memory allocation using the graphics APIssuch as OpenGL and Direct3D[1, 2].Although recent generations of GPUs have shown capability todeliver good gaming experiences and also gradually evolved to sup-port SMP, it is still difficult for them to satisfy the extremely highdemands on rendering throughput from immersive VR applications.The human vision system has both wide field of view (FoV) andincredibly high resolution when perceiving the surrounding world;the requirement for enabling an immersive VR experience is muchmore stringent than that for PC gaming. Table 1 lists the majordifferences between PC gaming and stereo VR [20]. As it demon-strates, stereo VR requires GPU to deliver 116 (58.32 ×
2) Mpixelswithin 5 ms. Missing the rendering deadline will cause frame dropwhich significantly damages VR quality. Although the VR vendorstoday employ frame re-projection technologies such as Asynchro-nous Time Warp (ATW)[15, 36] to artificially fill in dropped frames,they cannot fundamentally solve the problem of rendering dead-line missing due to little consideration on users’ perception andinteraction. Thus, improving the overall rendering efficiency is stillthe highest design priority for modern VR-oriented GPUs [6].
In recent years, major GPU vendors such as NVIDIA have proposedto integrate multiple easy-to-manufacture GPU chips at packagelevel (i.e., multi-chip design)[5] or at system level[25, 43] using high bandwidth interconnection technologies such as Grand-ReferenceSignaling (GRS)[32] or NVLinks[28], in order to address future chipdensity saturation issue. Figure 3 shows the overview of the multi-GPU architecture which consists of four GPU models (i.e., GPMs).In terms of compute capability, each GPM is configured to resem-ble the latest NVIDIA GPU architecture (e.g., Figure 2(c)). Insideeach GPM, SMs are connected to the GPM local memory hierarchyincluding a local memory-side L2 cache and off-chip DRAM, via anXBAR. In the overall multi-chip design (MCM-GPU), XBARs areinterconnected through high speed links such as NVLinks to sup-port the communication among different GPMs. This multi-GPUsystem generally acts as a large single GPU; its memory system andaddress mapping are designed as a Non-Uniform Memory Access(NUMA) architecture. This design also reduces the programmingcomplexity (e.g., unified programming model similar to CUDA) forGPU developers.
Future Multi-GPU System Bottleneck for VR Workloads.
As previous works [5, 25] have indicated, data movement amongGPMs will become the major obstacle for the continued perfor-mance scaling in these future NUMA-based multi-GPU systems.This situation is further exacerbated when executing VR applica-tions caused by the large data sharing among GPMs. Due to thenature of view redundancy in VR applications, the left and rightviews may include the same object (e.g., the rabbit in Figure 3)which require the same texture data. However, to effectively utilizethe computing resources from all the GPMs in such multi-GPMplatforms, the rendering tasks for left and right views will be dis-tributed to different groups or islands of GPMs in a more balancedfashion; each view will then be further broken into smaller piecesand distributed to the individual GPMs of that group. This naivestrategy could greatly hurt data locality in the SMP model. Forexample, if the basic texture data used to describe the rabbit inFigure 3 is stored in the local memory of GPM_0, other GPMs needto issue remote memory accesses to acquire this data. Due to theasymmetrical bandwidth between the local DRAM (e.g., 1TB/s) andinter-GPM NVLink (e.g., 64GB/s), the remote memory access willlikely become one of the major performance bottlenecks in suchmulti-GPU system design. More sophisticated parallel renderingframeworks such as OpenGL Multipipe SDK [7], Chromium [19]and Equalizer[13, 14], are designed for distributed environmentwhere they separate memory space and the memory data needto be duplicated in each memory which greatly limits the storagecapacity on our NUMA-based multi-GPU systems. Thus, employ-ing them on our architecture requires further investigation andcharacterization. We will show this study in Section 4.Figure 4 presents the performance of a 4-GPM multi-GPU sys-tem as the bandwidth of inter-GPM links is decreased from 1TB/sto 32GB/s (refer to Section 3 for experimental methodology). Wecan observe that the rendering performance is significantly lim-ited by the bandwidth. On average, applying 128GB/s, 64GB/s and32GB/s inter-GPM bandwidth results in 22%, 42% and 65% perfor-mance degradation compared to the baseline 1TB/s bandwidth,respectively. Although improving the inter-GPM bandwidth is astraightforward method to tackle the problem, it has proven diffi-cult to achieve due to additional silicon cost and power overhead[5].
O-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems P e r f o r m a n c e S l o w d o w n Figure 4: Normalized performance sensitivity to inter-GPMlink bandwidth for a 4-GPM Multi-GPU system.Table 2: BASELINE CONFIGURATION
GPU frequency 1GHzNumber of GPMs 4Number of SMs 32, 8 per GPMsSM configuration 64 shader core per SM128KB Unified L1 cache4 texture unitTexture filtering configuration 16x Anisotropic filteringRaster Engine 16x16 tiled rasterizationNumber of ROP 32, 8 per GPMsL2 cache 4MB in total, 16-waysInter-GPU interconnect 64GB/s NVLinkuni-directionalLocal DRAM bandwidth 1TB/s
This motivates us to provide software-hardware co-design strate-gies to enable "true" immersive VR experience for future users viasignificantly reducing the inter-GPM traffic and alleviating theperformance bottleneck of executing VR workloads on future multi-GPU platforms. We believe this is the first attempt to co-design atsystem architecture level for eventually realizing future planet-scale
VR.
We investigate the performance impact of multi-GPU system for vir-tual reality by extending ATTILA-sim [10], a cycle-level rasterization-based GPU simulator which covers a wide spectrum of graphicsfeatures on modern GPUs. The model of ATTILA-sim is designedupon boxes (a module of functional pipeline) and signals (simulat-ing the interconnect of different components). Because the currentATTILA-sim models an AMD TeraScale2 architecture [17], it isdifficult to configure it using the same amount of SMs as NVIDIAPascal-like architectures [27]. To fairly evaluate the design impact,we accordingly scale down other parameters such as the numberof ROPs and L2 cache. Similar strategies have been used to studymodern graphics architectures in previous works[40–42]. The GPMmemory system consists of two level cache hierarchy and a localDRAM. The L1 cache is private to each SM while the L2 cache areshared by all the SMs. Table 2 shows the simulation parametersapplied in our baseline multi-GPU system.In order to support multi-view VR rendering, we implementthe SMP engine in ATTILA-sim based on the state-of-the-art SMPtechnology [8, 9] which re-projects the triangles in left view to right
SMP+W-W 0
Original Rendering Result Left View Right View +3/2W-3/2W 0 0
Figure 5: The original rendering frame (left) and results withSMP enabled (right).Table 3: BENCHMARKS
Abbr. Names Library Resolution
DM3 Doom 3 OpenGL[2] 1600x1200 1911280x1024640x480HL2 Half-Life 2 DirectX[1] 1600x1200 3281280x1024640x480NFS Need For DirectX 1280x1024 1267SpeedUT3 Unreal DirectX 1280x1024 876Tournament 3WE Wolfenstein DirectX 640x480 1697 using updated viewport. Figure 5 shows the rendering example ofHalf-Life 2 after enabling SMP in ATTILA-sim. Our SMP enginefirst gathers the X coordinate of the display frame which is from -Wto +W, where W is a coordinate offset parameter. Then, it duplicateseach triangle generated from the geometry process. After that, theSMP engine shifts the viewport of the rendering object by half ofW, left or right depending on the eye. The SMP engine can alsore-project the triangle based on user-defined viewports for left andright views. Finally, we modify the triangle clipping to prevent thespill over into the opposite eye. We validated the implementation ofthe SMP engine in ATTILA-sim by comparing the triangle number,fragment number and performance improvement with that fromexecuting VR benchmarks on the state-of-the-art GPUs (e.g. Sponzaand San Mangle in NVIDIA VRWork [29]) on NVIDIA GTX 1080 Ti).Specifically, we observe that the added SMP rendering on ATTILA-sim can provide a 27% speed up over the sequential rendering ontwo views.We also model the inter-GPU interconnect as high bandwidthpoint-to-point NVLinks with 64GB/s bandwidth (one direction).We assume each GPM has 6 ports and each pair of ports is usedto connect two GPMs, indicating that the intercommunication be-tween two GPMs will not be interfered by other GPMs. Based onthe sensitivity study shown in Figure 4, we configure the inter-GPMlink bandwidth as 64GB/s bandwidth. Following the common GPUdesign, each ROP in our simulation outputs 4 pixels per cycle tothe framebuffer. To further alleviate the remote memory accesslatency on the NUMA-based baseline architecture, we employ thestate-of-the-art First-Tough (FT) page placement policy and remotecache scheme [5] to create a fair baseline evaluation environment.Table 3 lists the set of graphics benchmarks employed to evaluateour design. This set includes five well-known 3D games, coveringdifferent rendering libraries and 3D engines. We also list the originalrendering resolution and the number of draw commands for thesebenchmarks. Two benchmarks (
Doom3 and Half-Life 2 ) from the ta-ble are rendered with a range of resolutions (1600 × × henhao Xie, Xin Fu, Mingsong Chen, and Shuaiwen Leon Song GPM0 GPM1 GPM2
GPM3 (a) AFR
T0 T1 T2 T3 T0 T1 T2 T3 + GPM0 GPM1 GPM2
GPM3 (b) Tile-level SFR (V) + GPM0
GPM1 GPM2
GPM3
T0 T1 T2 T3
T0T1
T2T3 (c) Tile-level SFR (H)
Object 4
Object 0 … Object 5Object 1… Object 6
Object 2 … Object 7Object 3… O0 O1 O2O3 O4 + GPM0 GPM1 GPM2
GPM3 X (d) Object-level SFR Figure 6: Three types of parallel rendering schemes for parallel VR applied on future NUMA-based Multi-GPU systems. D M - D M - D M - H L - H L - H L - N F S U T W E A v g . O v e r a ll P e r f o r m a n c e D M - D M - D M - H L - H L - H L - N F S U T W E A v g . S i n g l e F r a m e L a t e n c y Figure 7: Normalized performance improvement (left) andsingle frame latency (right) across different benchmarks. × Aiming to reduce the NUMA-induced bottlenecks, a straightfor-ward method is to employ parallel rendering schemes in VR appli-cations to distribute a domain of graphics workloads on a targetedcomputing resource. While some parallel rendering frameworkssuch as Equalizer [13, 14] and OpenGL Multiple SDK [7] have beenused in many cluster-based PC games, the NUMA-based multi-GPUsystems face some different challenges when performing parallelrendering. In this section, we perform a detailed analysis using threestate-of-the-art parallel rendering schemes (including frame-level,tile-level and object-level parallel rendering) for VR applicationrunning on such future NUMA-based multi-GPU architectures, tofurther understand the design challenges.
Alternate Frame Rendering (AFR), also known as frame-level par-allel rendering, executes one rendering process on each GPU ina multi-GPU environment. As Figure 6a demonstrates, AFR dis-tributes a sequence of rendering frames along with the requireddata across different GPMs. AFR is often considered to be a betterfit for distributed memory environment since the separate memoryspaces make the concurrent rendering of different frames easier to implement[14]. To separate our NUMA memory system intounique memory spaces, we leverage the software-level segmentedmemory allocation to reserve distributed memory segments foreach frame. We also employ a simple task scheduler to map therendering workloads of a frame to a specific GPM. The benefit ofthis design is to eliminate the inter-GPM commutation.Figure 7 shows the performance improvement and single framelatency affected by AFR scheme. The results are normalized tothe baseline NUMA-based multi-GPU setup (with 64GB/s NVLink)where the entire system is viewed as a single GPU under the pro-gramming model and rendering workloads are directly launchedto this system without specific parallel rendering scheduling. Onaverage, AFR improves the performance (i.e., overall frame rate)by 1.67X comparing to the baseline setup. AFR not only eliminatesthe performance degradation of low bandwidth inter-GPU links,but also increases the rendering throughput by leveraging the SMPfeature of the GPM. However, Figure 7(right) also suggests thatAFR increases the single frame latency by 59% as a frame is pro-cessed by only one GPU. This increased single-frame latency maycause significant motion anomalies, including judder, lagging andsickness in VR system [42, 44] because it highly impacts whetherthe corresponding display on VR head-gear device can be in syncwith the actual motion occurrence. Additionally, we observe thatAFR near-linearly increases the memory bandwidth and capac-ity requirement according to the pre-allocate memory space foreach frame. This decreases the maximum system memory capacitywhich directly limits the rendering resolution, texture details andperceived quality for different VR applications.
In contrast to AFR, split frame rendering (SFR) tends to reducethe single-frame latency by splitting a single frame into smallerrendering workloads and each GPM only responses to one group ofworkloads. Figure 6b shows a tile-level SFR which splits the render-ing frame into several pixel tiles in the screen space and distributesthese sets of pixels across different GPMs. This basic method iswidely used in cluster-based PC gaming because it requires verylow software effort[37]. To employ tile-level SFR, we simply lever-age the sort-first algorithm to define the tile-window size beforethe rendering tasks is processed in GPMs. Although this design caneffectively reduce single-frame latency, its vertical pixel stripping[37] does distribute left and right views into different GPMs, ig-noring the redundancy of the two views. Thus, to enable the SMP
O-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems N o r m a li z e d P e r f o r m a n c e Tile-Level (V) Tile-Level (H) Object-Level
Figure 8: Normalized performance after enabling SFR acrossdifferent benchmarks. N o r m a li z e d I n t e r - G P M T r a ff i c Tile-Level (V) Tile-Level (H) Object-Level
Figure 9: Total inter-GPM memory traffic across differentbenchmarks. under this tile-level SFR, an alternative is to employ a horizontalculling method, shown in Figure 6c. It groups the left and rightviews as a large pixel tile so that the rendering workloads in theleft view can by re-projected into the right via the SMP engine toreduce redundancy geometry processing and improve data sharing.
Distributing objects among processing units represents a specifictype of split rendering frame (SFR). Figure 6d shows an exampleof object-level SFR which is often referred as sort-last rendering[13]. In contrast to the traditional vertical and horizontal tile-levelSFR, the distribution under object-level SFR begins after the GPUstarts the rendering process. During object-level SFR, a root nodeis selected (e.g., GPM0 in this example) to distribute the render-ing objects to other working units (e.g., GPM1, GPM2 and GPM3).Once a worker completes the assembled object, the output colorin its local DRAM is sent to the root node to composite the finalframe. In this study, we first profile the entire rendering processto get the total number of rendering objects, and then issue themto different GPMs in a round-robin fashion. Note that only oneobject is executed in each GPM at a time for better data locality.Although this object distribution can also occur during renderingprocess (e.g., between rasterization and fragment processing [21]),it typical requires to insert additional inter-GPM synchronizationwhich may cause increasing inter-GPM traffic and performancedegradation. Thus, we only distribute the objects at the beginningof the rendering pipeline for our experiments. B e s t - t o - W o r s t R a t i o Figure 10: The best-to-worst performance ratio amongGPMs in object-level SFR across different workloads.
Figure 8 and 9 illustrate the performance (i.e., the overall framerate) impact and inter-GPM memory traffic for different SFR sce-narios. The results are normalized to the baseline setup. We havethe following observations:(i) The tile-level SFR schemes only slightly improve the renderingperformance over the baseline case, e.g., on average 28% and 3%for Tile-level (V) and Tile-level (H), respectively. This is becausealthough processing a small set of pixels via tile-level SFR canimprove the data locality within one GPU, the tile-level SFR schemesincrease the inter-GPM memory traffic by an average of 50% forthe vertical culling (V) and 44% for horizontal culling (H) due tothe object overlapping across the tiles. While the horizontal culling(H) fails to capture the data sharing for large objects (e.g., thebridge on the right side of Figure 6c), vertical culling (V) ignores theredundancy between the left and right view. Since when applyingSMP-based VR rendering the GPMs do not render the left and rightviews simultaneously, the large texture data have to be movedfrequently across the GPMs.(ii) The object-level SFR outperforms tile-level SFR schemes andachieves an average of 60%, 32% and 57% performance improvementover the baseline, tile-level (V) and tile-level (H), respectively. Thespeedups are mainly from the inter-GPMs traffic reduction, indi-cated by Figure 9. By placing the required data in the local DRAMfor the rendered objects, Object-level SFR reduces approximately40% of inter-GPMs traffic compared to the baseline. However, thestate-of-the-art object-level SFR can not fully address the NUMA-induced performance bottlenecks for VR execution, because it stillexecutes the objects from the left and right views separately. Inother word, it ignores the multi-view redundancy in VR applicationswhich limits its rendering efficiency.(iii) Additionally, we also observe that the object-level SFR ischallenged by low load balance and high composition overhead. Fig-ure 10 shows the ratio between the best and the worst performanceamong different GPMs under the Round-Robin object schedulingpolicy. Since each object has a variety of graphical properties (e.g.,the total amount of triangles, the level of details, the viewport win-dow size, etc), the processing time is typically different for eachobject. If one GPM is assigned more complex objects than the oth-ers, it will take more time to complete the rendering tasks. Sincethe overall performance of Multi-GPU system is determined bythe worst-case GPM processing, low load balance will significantlydegrade the overall performance. Meanwhile, the high compositionoverhead (i.e., assembling all the rendering outputs from different henhao Xie, Xin Fu, Mingsong Chen, and Shuaiwen Leon Song
GPMs into a frame) also contributes to the unbalanced executiontime. As we mentioned previously, only the root node is employedto distribute and composite rendering tasks in the current object-level SFR. In this case, extra workloads will be issued to the rootnode while the ROP units of the other GPMs can not be fully utilizedduring this color output stage, causing bad composition scalabil-ity [7]. Therefore, we aim to propose software-hardware supportto efficiently handle these challenges facing the state-of-the-artobject-level SFR in a NUMA-based multi-GPU environment.
In order to address the performance issues of the object-level SFRapplied on future NUMA-based multi-GPU systems, we proposethe object-oriented VR rendering framework (OO-VR). The basicdesign diagram is shown in Figure 11. It consists of several novelcomponents. First, we propose an object-oriented VR programmingmodel at the software layer to support multi-view rendering forthe object-level SFR. It also provides an interface to connect theVR applications to the underlying multi-GPU platform. Second,we propose an object-aware runtime distribution engine at thehardware layer to balance the rendering workloads across GPMs. InOO-VR, this distribution engine predicts the rendering time for eachobject before it is distributed. It replaces the master-slave structureamong GPMs so that the actual distribution is only determined bythe rendering process. Finally, we design a distributed hardwarecomposition unit to utilize all the ROPs of the GPMs to assembleand update the final frame output from the framebuffer. Due tothe NUMA feature, the framebuffer is distributed across all theGPMs instead of only one DRAM partition, so that it can provide4x output bandwidth of the baseline scenario. We detail each ofthese components as follows.
The Object-Oriented VR Programming Model extends the conven-tional object-level SFR as we introduced in Section 4 and uses asimilar software structure as today’s Equalizer[13, 14] and OpenGLMultipipe SDK (MPK)[7]. Figure 12 uses an simplified workingflow diagram to explain our programming model. In this study, wepropose two major components that drive our OO-VR program-ming model:
Object-Oriented Application (OO_Application) to drivethe VR multi-view rendering for each object, and
Object-OrientedMiddleware (OO_Middleware) to reorder objects and regroup theones that share similar features as a large batch which acts as thesmallest scheduling units on the NUMA-based multi-GPU system.
The OO_Application provides a software interface (dark bluebox) for developers to merge the left and right views of same objectas a single rendering tasks. The OO_Application is designed byextending the conventional object-level SFR. For each object, wereplace the original viewport which is set during the renderinginitialization with two new parameters – viewportL and viewportR ,each of which points to one view of the object. In order to en-able rendering multi-views at the same time, we apply the built-inopenGL extension GL _ OV R _ multiview Object-Oriented Programming Model
Runtime Batch Distribution Engine
Rendering Time PredictorObject-Oriented VR ApplicationGPM0 GPM1 GPM2
GPM3
Distributed Hardware CompositionROP0 ROP1 ROP2
ROP3FrameBuffer
Object-Oriented MiddlewareSoftware Layer:Hardware Layer:
Figure 11: Our proposed object-oriented VR renderingframework (OO-VR). automatically renders the left and right views to its own position-ing using the same texture data. We also design an auto-modelto extend the conventional object-level SFR to enable multi-viewrendering through generating two fixed viewports for each objectvia shifting the original viewport along the X coordinate. In thiscase, only one rendering process needs to be setup for each object.In constrast to the single-path stereo rendering enabled in modernVR SDKs [13, 29], our OO_Application does not decompose the leftand right views during the rendering initialization so that it stillfollows the execution model of the object-level SFR.
OO_Middleware is the software-hardware bridge to connectthe OO_Application and multi-GPU system. It is automaticallyexecuted during the application initialization stage to issue a groupof the objects to the rendering queue of the multi-GPU system.In the conventional object-level SFR, the objects are scheduled ina master-slave fashion following the programmer-defined order.However, different objects that may share some common texturedata are not rendered on the same GPM. As Figure 12 illustrates,both "pillar1" and "pillar2" share the common "stone" texture. If theyare rendered on different GPMs, the "stone" texture may need to bereallocated, increasing remote GPM access. In OO-VR, we leverageOO_Middleware to group objects based on their texture sharinglevel (TSL) to exploit the data locality across different objects.To implement this, OO_Middeware first picks an object from thehead of the queue as the root. It then finds the next independentobject of the root and computes the TSL between the two usingEquation (1).
T SL = T (cid:213) t ( P r ( t ) · P n ( t )) / T (cid:213) t P r ( t ) (1) Where t is the shared texture data between the two objects, P r ( t ) and P n ( t ) represent the percentages of t among all the requiredtextures for the root and the target object. O-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems
OO_ApplicationInitialize OO_MiddlewareBegin Frame
Object0 “pillar1”ViewportL*ViewportRTexture[stone] … Object1 “flag”ViewportL*ViewportRTexture[cloth] … ObjectN “pillar2”ViewportL*ViewportRTexture[stone]…
End FrameExit GPU Setup
Batch1
Object0“pillar1”ObjectN“pillar2”…
Find Next
Full?
Or End?
Group StartEnd DrawAll? …… Multi-GPUsConfig.
Object
Distribution EngineMulti-View Rendering
End?
Frame Ready
NoYesYes YesNo No
Next Batch
Figure 12: Simplified working flow of an Object-Oriented ap-plication, middleware and Multi-GPU rendering pipeline.
TSL represents how many texture data will be shared if we groupthe target object with the root. If TSL is greater than 0.5, we groupthem together as a batch and this batch then becomes the newroot which consists all textures from the previous iterator and thetarget object. After this grouping, the OO_Middleware removes thetarget object from the queue and continues to search for the nextobject until the total number of triangles within the batch is higherthan 4096, or all the objects in the queue have been selected. Thetriangle number limitation is used to prevent load imbalance froman inflated batch.After this step, this batch is marked as ready and issued to a GPMin the system for rendering. Finally, the OO_Middleware repeatsthis grouping process for all the objects in the frame until there is noobject in the queue. Note that for the objects that have dependencyon any of the objects in a batch, we directly merge them to thebatch and increase the triangle limitation so that they can followthe programmer-defined rendering order.
After receiving the batches from the OO_Middleware, the Multi-GPU system needs to distribute them across different GPMs formulti-view rendering. For workload balancing, we propose anobject-aware runtime distribution engine at the hardware layerinstead of using the software-based distribution method based onmaster-slave execution used in the conventional object-level SFR.Comparing to the software-level solution which needs to split thetasks before assigning to the multi-GPU system, the hardware en-gine provides efficient runtime workload distribution by collectingrendering information. Figure 13 illustrates the architecture designof the proposed distribution engine. The new hardware architec-ture is implemented as a micro-controller for the multi-GPU systemwhich responses for predicting the rendering time for each batch,allocating an object to the earliest available GPM, and pre-allocatingmemory data using the Pre-allocation Units (PAs) in each GPM.
Runtime Batch Distribution Engine
GPM0 GPM1 GPM2 GPM3
Rendering Time
Predictor
X Object-Oriented Programming Model Runtime Information
Batch Queue
Counters Object Properties
PA Unit PA Unit PA Unit PA Unit
Figure 13: The architecture design of the object-aware run-time distribution engine.
Recall the discussion in Section 3 for our evaluation baseline,we employ the first-touch memory mapping policy (FT) [5] toallocate the required data in the local DRAM. Although FT can helpreduce inter-GPM memory access, it can also cause performancedegradation if the required data is not ready during the renderingprocess. As a result, we consider to pre-allocate data before objectsare being distributed across GPMs. In this case, OO-VR needs to beaware of the runtime information of each GPM to determine whichGPM is likely to become idle the earliest.In order to obtain this information, we need to predict approxi-mately how long the current batch will be completed. Equation (2)shows a basic way to estimate the rendering time of the currenttask X , introduced by [39]: t ( X ) = RT ( д x , c x , HW , ST ) (2)Where д x , c x is the geometry and texture property of the object X , HW is the hardware information, and ST is the current renderingstep (i.e., geometry process, multi-view projection, rasterazition orfragment process) of the object X .While a complex equation can increase the estimation accuracy,it also requires more comprehensive data and increases hardwaredesign complexity and computing overhead. Because the objectiveof our prediction model is to identify the earliest available GPMinstead of accurately predicting each batch’s rendering time, wepropose a simple linear memorization-based model to estimate therendering time as Equation (3): t ( X ) = c · trianдle x = c · tv x + c · pixel x (3)Where trianдle x , tv x and pixel x represent the triangle counts,the number of transformed vertexes and the number of renderedpixels of the current batch, respectively. c , c and c represent thetriangle, vertex and pixel rate of the GPM.After building this estimation model, we split the predictionprocess into two phases: total rendering time estimation and elapsedrendering time prediction. We setup two counters to record thetotal rendering time and the elapsed rendering time for each GPM.First, we leverage trianдle x (which can be directly acquired from henhao Xie, Xin Fu, Mingsong Chen, and Shuaiwen Leon Song GPM0
DHCFB
GPM1
DHC FB
GPM3
DHCFB
GPM2
DHC FB
FB0 FB1FB2FB3
Figure 14: The distributed FrameBuffer for hardware com-position. the OO_Application) to predict the total rendering time. Duringrendering, the distribution engine tracks tv x and pixel x fromGPMs to calculate the elapsed rendering time. If the tv x or pixel x increases by 1, the elapsed counter increases by c or c , respectively.At the end, by comparing the distance between the two countersfrom each GPM, we can predict which GPM will become availablefirst.At the beginning of the rendering, the distribution engine usesthe first 8 batches to initialize c , c and c . The first 8 batches will bedistributed across GPMs under the Round-Robin object schedulingpolicy and baseline FT memory mapping scheme is also applied toallocate the rendering data. After GPMs complete this round of 8batches, the total rendering time will be sent back to the distributionengine to calculate c , c and c . Then, starting from the 9th batch,the rendering time predictor is enabled to find the earliest availableGPM. After that, the PA Unit pre-allocates the required data tothe selected GPMs, and the rendering time predictor updates thepredicted total rendering time by increasing the triangle counts.Note that we limit the maximum size of the batch queue to 4 objectsto reduce the memory space requirement. Multiple batches couldbe distributed onto one GPM at the same time. In this case, a PAUnit sequentially fetches the data based on the order of the batchID.We further observe that even though distribution engine caneffectively balance the rendering tasks, it is possible that somelarge objects may still become the performance bottleneck if allthe other batches have been completed. To fully utilize the com-puting and memory resources of these idle GPMs, we employ asimple fine-grained task mapping mechanism to fairly distributethe rest of the processing units (e.g. triangles in geometry processand fragments in fragment process) to idle GPMs based on theirIDs. Meanwhile, the PA units duplicate the required data to thecorresponding unused DRAM to eliminate inter-GPMs access forthese left-over fine-grained tasks. In the conventional object-level SFR, the entire FrameBuffer (FB)is mapped in the DRAM of the master node, and all the renderingoutputs will then be transmitted to the master node for the finalcomposition. Although the color outputs can be executed asyn-chronously with the shader process, a small amount of ROPs in a single GPM limits the pixel rate which impacts the overall ren-dering performance. Since the NUMA-based multi-GPU systemcan be considered as a large single GPU, we consider to distributethe composition tasks across all the GPMs which is currently notsupported due to the lack of relevant communication mechanismand hardware.For example, shown in Figure 14, we first split the entire FB into 4partitions using the screen-space coordinate of the final frame. Herewe employ the same memory mapping policy as the vertical Tile-level SFR (V). Based on this, we propose the distributed hardwarecomposition unit (DHC) to determine which part of FrameBuffer isused to store what color outputs of the final frame. This design isbased on the observation that the color outputs of the final framesonly incur a small number of memory access compared to the mainrendering phase so that the small amount of remote communicationfor this phase will not become a performance bottleneck for NUMA-based multi-GPU systems. This is also why vertical culling shownin Figure 14 can perform well as the last stage of VR rendering (i.e.,after the object-aware runtime distribution for the main renderingphase) since the inter-GPM bandwidth can be effectively utilizedby the distributed hardware composition.
The major hardware components added into the existing multi-GPU system is the object-aware runtime distribution engine, whichconsists of a rendering time predictor, GPM counters and a batchqueue. For the baseline Multi-GPU architecture that we modeledfor this work (Table 2), we allocate 64 bits for each counter and16 bits for each batch ID to store the predicted rendering time.Additionally, to predict the total and elapsed rendering time, twelve32-bits registers are used to track the triangle counts, the numberof transformed vertexes and the number of the rendered pixels forthe current batches. In total, we only require 960 bits for storageand several small logic units. We use McPAT[22] to evaluate thearea and energy overhead of the added storage and logic units forthe distribution engine. The area overhead is 0.59 mm under 24nmtechnology which is 0.18% to modern GPUs (e.g., GTX1080). Thepower overhead is 0.3W which is 0.16% of TDP to GTX1080. We model the object-oriented VR rendering framework (OO-VR) byextending AITTILA-sim [10]. To get the object graphical properties(e.g., viewports, number of triangles and texture data), we profile therendering-traces from our real-game benchmarks as shown in Table3. Then in ATTILA-sim, we implement the OOVR programmingmodel in its GPUDriver, and the object distribution engine in itscommand processor, and the distributed hardware compositionduring the color writing procedure. To evaluate the effectivenessof our proposed OO-VR design, we compare it with several designscenarios: (i)
Baseline - the baseline multi-GPU system with singleprogramming model (Section 2); (ii) - the baseline systemwith 1 TB/s inter-GPU link bandwidth; (iii)
Object-level - the Object-level SFR which distributes objects among GPMs (Section 4); (iv)
Frame-level - the AFR which renders entire frame within each GPM;and (v)
OO_APP - the proposed object-oriented programming model(Section 3). We provide results and detailed analysis of our proposed
O-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems N o r m a li z e d S p ee dup Object-Level Frame-Level 1TB/s-BW OO_APP OOVR
Figure 15: Normalized speedup of overall VR rendering forsingle frame under different design scenarios. design on performance, inter-GPU memory traffic, sensitivity studyfor inter-GPM link bandwidth and the performance scalability overthe number of GPMs.
Fig.15 shows the performance results with respect to single framelatency under the five design scenarios. We gather the entire ren-dering cycles from the beginning to the end for each frame andnormalized the performance speedup to baseline case. We showthe performance speedup for single frame because it is critical toavoid motion sickness for VR. From the figure, we have severalobservations.First, without hardware modifications, the OO_APP improvesthe performance about 99%, 39% an 28% on average comparing tothe Baseline, Object-level SFR and 1TB/s-BW, respectively. It com-bines the two views of the same object and enable the multi-viewrendering to share the texture data. In addition, by grouping ob-jects into large batches, it further increases the data locality withinone GPM to reduce the inter-GPM memory traffic. However, itstill suffers serious workload unbalance. For instance, object-levelSFR slightly outperforms OO_APP when executing DM3-1280 andDM3-1600. This is because some batches within these two bench-marks require much longer rendering time than other batches, thesoftware scheduling policy alone in OO_APP can not balance theexecution time across GPMs without runtime information. Second,we observe that on average, OO-VR outperforms Baseline, Object-level SFR and OO_APP by 1.58x, 99% and 59%, respectively. Withthe software and hardware co-design, OO-VR distributes batchesbased on the predicted rendering time and provides better work-load balance than OO_APP. It also increases the pixel rate by fullyutilizing the ROPs of all GPMs.We also observe that OO-VR could achieve similar performanceas Frame-level parallelism which is considered to provide idealperformance on overall rendering cycles for all frames (as shownin Fig.7(left)). However, in terms of the single frame latency, Frame-level parallelism suffers 40% slowdown while OO-VR could signifi-cantly improve the performance.
Reducing inter-GPM memory traffic is another important criteriato justify the effectiveness of OO-VR. Fig.16 shows the impact of OO-VR on inter-GPM memory traffic. Both Baseline and 1TB/s-BW have the same inter-GPM memory traffic, and Frame-Level isprocessing each frame in one GPM and has near-zero inter-GPMtraffic. Moreover, the memory traffic reduction is mainly cause byour software-level design, the inter-GPM traffic is the same underthe impact of OO_APP and OO-VR. Therefore, Fig.16 only showsthe results for Baseline, Object-Level and OO-VR, and we mainly in-vestigate these three techniques in the following subsections. Fromthe figure, we observe OO-VR can save 76% and 36% inter-GPMmemory accesses comparing to the Baseline and Object-level SFR,respectively. This is because OO-VR allocates the required renderingdata to the local DRAM of GPMs. The majority inter-GPM memoryaccesses are contributed by the distributed hardware composition,command transmit and Z-test during fragment process. We ob-serve that the delay caused by these inter-GPM memory accessescan be fully hidden by executing thousands of threads simultane-ously in numerous shader cores. In addition, the data transfer viathe inter-GPM links also leads to higher power dissipation (e.g.10pj/bit for board or 250pj/bit for nodes based on different inte-gration technologies[5]). By reducing inter-GPM memory traffic,OO-VR also achieves significant energy and cost saving.
Inter-GPU link bandwidth is one of the most important factors inmulti-GPU systems. Previous works [5, 25] have shown that increas-ing the bandwidth of inter-processor link is difficult and requireshigh fabric cost. To understand how inter-GPM link bandwidthimpacts the design choice, we examine the performance gain ofOO-VR under a variety of choices on link bandwidth. Fig.17 showsthe speedup under different link bandwidth when applying Base-line, Object-level SFR and our proposed OO-VR. In this figure, wenormalize the performance to the Baseline with 64GB/s inter-GPMlink. We observe that the inter-GPU link bandwidth highly affectsthe Baseline and Object-level SFR design scenarios. This is becausethese two designs cannot capture the data locality within the GPMto minimize the inter-GPU memory accesses during rendering. Thelarge amount of shared data across GPMs significantly stalls therendering performance. In the contrast, OO-VR fairly distributesthe rendering workloads into different GPMs and convert numer-ous remote data to local data. By doing this, it fully utilizes thehigh-speed local memory bandwidth and is insensitive to the band-width of inter-GPM link even the inter-GPM memory accesses arenot entirely eliminated. As the local memory bandwidth scales infuture GPU design (e.g. High-Bandwidth Memory (HBM)[11]), theperformance of the future multi-GPU scenario is more likely to beconstrained by inter-GPU memory. In this case, we consider theOO-VR can potentially benefit the future multi-GPU scenario byreducing inter-GPM memory traffic.
Fig.18 shows the average speedup of the Baseline, Object-levelSFR and OO-VR as the number of GPMs increases. The results arenormalized to single-GPU system. As the figure shows, the Baselineand Object-level SFR suffer limited performance scalability due tothe NUMA bottleneck. With 8 GPMs, the Baseline and Object-levelSFR only improve the overall performance by 2.08x and 3.47x on henhao Xie, Xin Fu, Mingsong Chen, and Shuaiwen Leon Song N o r m a li z e d I n t e r - G P M T r a ff i c Baseline Object-Level OOVR
Figure 16: Normalized inter-GPM memorytraffic under different design scenarios. S p ee dup Inter-GPU BandwidthBaseline Object-level OOVR
Figure 17: Normalized speedup ofthe proposed OO-VR, object-levelSFR and baseline under differentinter-GPM link bandwidth. S p ee dup Number of GPUsBaseline Object-level OOVR
Figure 18: Normalized speedup ofthe proposed OO-VR, object-levelSFR and baseline over single GPUunder different number of GPMs. average over the single GPU processing. On the other hand, theOO-VR provides scalable performance improvement by distributingindependent rendering tasks to each GPM. Hence, with 4 and 8GPMs, it achieves 3.64x and 6.27x speedup over the single GPUprocessing, respectively.
Architecture Approach For NUMA Based Multi-GPU System.
There have been many works [5, 21, 25] improving the performancefor NUMA based multi-GPU system. Some of them[5, 25, 43] in-troduce architectural optimizations to reduce the inter-GPM mem-ory traffic for GPGPU application while Kim et al.[21] redistributeprimitives to each GPM to improve the scalability on performance.However, none of them discusses the data sharing feature of VRapplication. Our approach exploits the data locality during VRrendering to reduce the inter-GPM memory traffic and achievesscalable performance for multi-GPU system.
Parallel Rendering.
Currently, PC clusters are broadly usedto render high-interactive graphics applications. To drive the clus-ter, software-level parallel rendering frameworks such as OpenGLMultipipe SDK [7], Chromium [19], Equalizer[13, 14] have beendeveloped. They provides the application programming interface(API) to develop parallel graphics applications for a wide rand ofplatforms. Their works tend to split the rendering tasks during ap-plication development under different configurations. In our work,we propose a software and hardware co-designed object-orientedVR rendering framework for the parallel rendering in NUMA basedmulti-GPU system.
Performance Improvement For Rendering.
In order to bal-ance the rendering workloads among multiple GPUs, some studies[12, 18, 26] propose a software-level solution that employs CPU topredict the execution time before rendering to adaptively determinethe workload size. However, such software-level method requires along time to acquire the hardware runtime information from GPUswhich causes performance overhead. Our hardware-level schedulercould quickly collect the runtime information and conduct the real-time object distribution to balance the workloads. There are alsosome works [23, 38] designing NUMA aware algorithms for fastimage composition. Instead of implementing a composition kernelin software, we resort to a hardware-level solution that leverageshardware components to distribute the composition tasks across the multi-GPU system to enhance the pixel throughput. Meanwhile,many architecture approaches [4, 40, 41] have been proposed toreduce the memory traffic during rendering. Our work focuses onmulti-view VR rendering in multi-GPU system which is orthogonalto these architecture technologies.
In modern NUMA-based multi-GPU system, the low bandwidthof inter-GPM links significantly limits the performance due to theintensive remote data accesses during the multi-view VR rendering.In this paper, we propose object-oriented VR rendering framework(OO-VR) that converts the remote inter-GPM memory accesses tolocal memory accesses by exploiting the data locality among objects.First, we characterize the impact of several parallel rendering frame-works on performance and memory traffic in the NUMA-basedmulti-GPU systems. We observe the high data sharing among somerendering objects but the state-of-the-art rendering framework andmulti-GPU system cannot capture this interesting feature to re-duce the inter-GPM traffic. Then, we propose an object-oriented VRprogramming model to combine the two views of the same objectand group objects into large batches based on their texture sharinglevels. Finally, we design a object aware runtime batch distributionengine and distributed hardware composition unit to achieve thebalanced workloads among GPUs and further improve the perfor-mance of VR rendering. We evaluate the proposed design using VRfeatured simulator. The results show that OO-VR improves the over-all performance by 1.58x on average and saves inter-GPM memorytraffic by 76% over the baseline case. In addition, our sensitivitystudy proves that OO-VR can potentially benefit the future largermulti-GPU scenario with ever increasing asymmetric bandwidthbetween local and remote memory.
ACKNOWLEDGMENT
This research is supported by U.S. DOE Office of Science, Officeof Advanced Scientific Computing Research, under the CENATEproject (award No. 466150), The Pacific Northwest National Lab-oratory is operated by Battelle for the U.S. Department of Energyunder contract DEAC05-76RL01830. This research is partially sup-ported by National Science Foundation grants CCF-1619243, CCF-1537085(CAREER), CCF-1537062.
O-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems
REFERENCES
Proceeding of the 41st Annual International Symposium onComputer Architecuture (ISCA ’14) . IEEE Press, Piscataway, NJ, USA, 529–540.http://dl.acm.org/citation.cfm?id=2665671.2665748[5] Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi,Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU:Multi-chip-module GPUs for continued performance scalability.
ISCA 2017
45, 2(2017), 320–332.[6] Dean Beeler and Anuj Gosalia. 2016. Asynchronous Time Warp On Oculus Rift.https://developer.oculus.com/blog/asynchronous-timewarp-on-oculus-rift/[7] Praveen Bhaniramka, P. C. D. Robert, and S. Eilemann. 2005. OpenGL multipipeSDK: a toolkit for scalable parallel rendering. In
VIS 05. IEEE Visualization, 2005.
Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposiumon . IEEE, 1165–1168.[9] François De Sorbier, Vincent Nozick, and Hideo Saito. 2010. GPU-Based multi-view rendering. In
Computer Games, Multimedia and Allied Technology . 7–13.[10] V. M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and Espasa E. 2006. ATTILA: acycle-level execution-driven simulator for modern GPU architectures. In
ISPASS
Eurographics Symposium on Parallel Graphics andVisualization . The Eurographics Association.[13] Stefan Eilemann, Maxim Makhinya, and Renato Pajarola. 2009. Equalizer: Ascalable parallel rendering framework.
IEEE transactions on visualization andcomputer graphics
15, 3 (2009), 436–452.[14] Stefan Eilemann, David Steiner, and Renato Pajarola. 2018. Equalizer 2.0-Convergence of a Parallel Rendering Framework. arXiv preprint arXiv:1802.08022 (2018).[15] Daniel Evangelakos and Michael Mara. 2016. Extended TimeWarp LatencyCompensation for Virtual Reality. In
Proceedings of the 20th ACM SIGGRAPHSymposium on Interactive 3D Graphics and Games (I3D ’16) . ACM, New York, NY,USA, 193–194. https://doi.org/10.1145/2856400.2876015[16] Google. 2018. Google VR. https://vr.google.com/[17] Mike Houston. 2008. Anatomy of AMD TeraScale Graphics Engine. In
SIGGRAPH .[18] Chang Hui, Lei Xiaoyong, and Dai Shuling. 2009. A dynamic load balancingalgorithm for sort-first rendering clusters. In
Computer Science and InformationTechnology, 2009. ICCSIT 2009. 2nd IEEE International Conference on . IEEE, 515–519.[19] Greg Humphreys, Mike Houston, Ren Ng, Randall Frank, Sean Ahern, Peter DKirchner, and James T Klosowski. 2002. Chromium: a stream-processing frame-work for interactive rendering on clusters.
ACM transactions on graphics (TOG)
21, 3 (2002), 693–702.[20] David Kanter. 2015. Graphics processing requirements for enabling immersiveVR. In
AMD White Paper .[21] Youngsok Kim, Jae-Eon Jo, Hanhwi Jang, Minsoo Rhu, Hanjun Kim, and JangwooKim. 2017. GPUpd: A Fast and Scalable multi-GPU Architecture Using Cooper-ative Projection and Distribution. In
Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO-50 ’17) . ACM, New York,NY, USA, 574–586. https://doi.org/10.1145/3123939.3123968[22] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen,and Norman P. Jouppi. 2009. McPAT: An Integrated Power, Area, and TimingModeling Framework for Multicore and Manycore Architectures. In
Proceedingsof the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO) . New York, NY, USA.[23] M. Makhinya, S. Eilemann, and R. Pajarola. 2010. Fast Compositing for Cluster-parallel Rendering. In
Proceedings of the 10th Eurographics Conference on ParallelGraphics and Visualization (EG PGV’10)
Proceedings of the 50th Annual IEEE/ACM International Symposiumon Microarchitecture (MICRO)
Virtual Reality Conference, 2006
Proceedings of the 22nd ACM Conference on Virtual RealitySoftware and Technology . ACM, 37–46.[37] Alex Vlachos. 2016. Advanced VR rendering performance. In
Game DeveloperConference .[38] Pan Wang, Zhiquan Cheng, Ralph Martin, Huahai Liu, Xun Cai, and Sikun Li.2013. NUMA-aware image compositing on multi-GPU platform.
The VisualComputer
29, 6-8 (2013), 639–649.[39] Michael Wimmer and Peter Wonka. 2003. Rendering Time Estimation for Real-time Rendering. In
Proceedings of the 14th Eurographics Workshop on Rendering(EGRW ’03) . Eurographics Association, Aire-la-Ville, Switzerland, Switzerland,118–129. http://dl.acm.org/citation.cfm?id=882404.882422[40] Chenhao Xie, Xin Fu, and Shuaiwen Song. 2018. Perception-Oriented 3D Ren-dering Approximation for Modern Graphics Processors. In
IEEE InternationalSymposium on High Performance Computer Architecture, HPCA 2018, Vienna,Austria, February 24-28, 2018 . 362–374.[41] Chenhao Xie, Shuaiwen Leon Song, Jing Wang, Weigong Zhang, and Xin Fu.2017. Processing-in-Memory Enabled Graphics Processors for 3D Rendering. In . https://doi.org/10.1109/HPCA.2017.37[42] Chenhao Xie, Xingyao Zhang, Ang Li, Xin Fu, and Shuaiwen Leon Song. 2019.PIM-VR: Erasing Motion Anomalies In Highly-Interactive Virtual Reality WorldWith Customized Memory Cube. In .[43] Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans,and Oreste Villa. 2018. Combining HW/SW Mechanisms to Improve NUMAPerformance of Multi-GPU Systems. In .[44] D. Zhang and Y. Luo. 2012. Single-trial ERPs elicited by visual stimuli at twocontrast levels: Analysis of ongoing EEG and latency/amplitude jitters. In2012IEEE Symposium on Robotics and Applications (ISRA)