[PDF] Q-VR: System-Level Design for Future Mobile Collaborative Virtual Reality

Abstract

High Quality Mobile Virtual Reality (VR) is what the incoming graphics technology era demands: users around the world, regardless of their hardware and network conditions, can all enjoy the immersive virtual experience. However, the state-of-the-art software-based mobile VR designs cannot fully satisfy the realtime performance requirements due to the highly interactive nature of user's actions and complex environmental constraints during VR execution. Inspired by the unique human visual system effects and the strong correlation between VR motion features and realtime hardware-level information, we propose Q-VR, a novel dynamic collaborative rendering solution via software-hardware co-design for enabling future low-latency high-quality mobile VR. At software-level, Q-VR provides flexible high-level tuning interface to reduce network latency while maintaining user perception. At hardware-level, Q-VR accommodates a wide spectrum of hardware and network conditions across users by effectively leveraging the computing capability of the increasingly powerful VR hardware. Extensive evaluation on real-world games demonstrates that Q-VR can achieve an average end-to-end performance speedup of 3.4x (up to 6.7x) over the traditional local rendering design in commercial VR devices, and a 4.1x frame rate improvement over the state-of-the-art static collaborative rendering.

Full PDF

QQ-VR: System-Level Design for Future Mobile CollaborativeVirtual Reality

Chenhao Xie

Pacific Northwest NationalLaboratoryUSA

Xie Li

University of SydneyAustralia

Yang Hu

University of Texas at DallasUSA

Huwan Peng

University of WashingtonUSA

Michael Taylor

University of WashingtonUSA

Shuaiwen Leon Song

University of SydneyAustralia

ABSTRACT

High Quality Mobile Virtual Reality (VR) is what the incominggraphics technology era demands: users around the world, regard-less of their hardware and network conditions, can all enjoy the im-mersive virtual experience. However, the state-of-the-art software-based mobile VR designs cannot fully satisfy the realtime perfor-mance requirements due to the highly interactive nature of user’s ac-tions and complex environmental constraints during VR execution.Inspired by the unique human visual system effects and the strongcorrelation between VR motion features and realtime hardware-level information, we propose

Q-VR , a novel dynamic collaborativerendering solution via software-hardware co-design for enablingfuture low-latency high-quality mobile VR. At software-level, Q-VRprovides flexible high-level tuning interface to reduce network la-tency while maintaining user perception. At hardware-level, Q-VRaccommodates a wide spectrum of hardware and network condi-tions across users by effectively leveraging the computing capabilityof the increasingly powerful VR hardware. Extensive evaluationon real-world games demonstrates that Q-VR can achieve an av-erage end-to-end performance speedup of (up to ) overthe traditional local rendering design in commercial VR devices,and a frame rate improvement over the state-of-the-art staticcollaborative rendering.

CCS CONCEPTS • Computing methodologies → Virtual reality ; Sequential de-cision making ; • Computer systems organization → Client-server architectures ; System on a chip . KEYWORDS

Virtual Reality, Mobile System, System-on-Chip, Realtime Learning,Planet-Scale System Design

ACM Reference Format:

Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, and ShuaiwenLeon Song. 2021. Q-VR: System-Level Design for Future Mobile Collabora-tive Virtual Reality . In

Proceedings of the 26th ACM International Conference

ACM acknowledges that this contribution was authored or co-authored by an employee,contractor, or affiliate of the United States government. As such, the United Statesgovernment retains a nonexclusive, royalty-free right to publish or reproduce thisarticle, or to allow others to do so, for government purposes only.

ASPLOS ’21, April 19ś23, 2021, Virtual, USA © on Architectural Support for Programming Languages and Operating Systems(ASPLOS ’21), April 19ś23, 2021, Virtual, USA. ACM, New York, NY, USA,13 pages. https://doi.org/10.1145/3445814.3446715

Since the release of the movie

Ready Player One , consumers havebeen longing for a commercial product that one day can levitatethem to a fantasy alternate dimension: a truly immersive experi-ence without mobility restriction and periodical motion anomalies.In other words, users require exceptional visual quality from an untethered mobile-rendered head-mounted displays (HMDs) thatis equivalent to what high-end tethered VR systems (e.g., OculusRift [44] and HTC Vive [22]) provide. Although the current mobilehardware’s processing capability has been significantly improved[3, 48], they still cannot fully process heavy VR workloads underthe stringent runtime latency constraints. With the developmentof high performance server technology, server-based realtime ren-dering of Computer Graphics (CG) has been introduced by severalmajor cloud vendors such as Nvidia GeForce Now [41] and GoogleCloud for Game[17]. However, under the current network condi-tions, remote servers alone cannot provide realtime low-latencyhigh-quality VR rendering due to the dominating communicationlatency. Thus, neither local-only rendering nor remote-only ren-dering can satisfy the latency requirements for high-quality mobileVR: there is a clear mismatch between hardware’s raw computingpower and desired rendering complexity .To address the latency and bandwidth challenges of today’s dom-inant mobile rendering models, it seems reasonable to utilize mobileVR hardware’s computing power to handle part of the renderingworkload near the display HMD to trade off for reduced networkcommunication, while letting the remote system handle the restof the workload. But how to design such VR systems to reachthe latency and perception objectives is still an open problem. Re-cent studies [7, 31, 35ś37] proposed a static collaborative softwareframework that renders the foreground interactive objects locallywhile offloading the background environment to the remote server,based on the observation that interactive objects are often morelightweight than the background environment. However, after athorough qualitative investigating into the current mobile VR’sarchitecture-level rendering pipeline and a quantitative latencybottleneck analysis, we observe that this naive rendering schemefaces several challenges.

SPLOS ’21, April 19–23, 2021, Virtual, USA Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, and Shuaiwen Leon Song

Collaborative Foveated Rendering L o c a l F o v e a l R e nd e r i n g R e m o t e P e r i ph e r y R e nd e r i n g Multi-pipe Graphics M i dd l e l a y e r L M i dd l e l a y e r R O u t e r l a y e r L O u t e r l a y e r R Parallel Streaming Lightweight Interaction-aware Workload ControllerUnified

Composition&ATW

Unified Composition&ATW

GPU CPU D R A M VideoStreamGraphics MemoryFrame BufferVR Sensor Network Model Video DecoderDisplay Buffer

23 3

Hardware Layer Software Layer

Figure 1: Processing diagram of our software-hardware co-designed Q-VR.

First, interactive objects have to be narrowly defined by pro-grammers on each hardware platform to satisfy the "worst case"scenario during VR application development which significantlylimits the design possibilities for high-quality interactive VR en-vironments and burdens programmers to accommodate all the re-altime constraints during development. It is labor intensive andimpractical. Second, it cannot fundamentally reduce the commu-nication latency because the remote rendering workload remainsunreduced. Third, it loses the flexibility to dynamically maintain thebalance of local-remote rendering latency under realtime uncertain-ties: unpredictable user inputs (e.g., interaction, movements, etc.)and environment (e.g., hardware and network) changes. Finally, itsuffers from high composition overhead by requiring more com-plex collision detection and embedding methods [7, 31], directlycontributing to resource contention on mobile GPU(s).In this paper, we propose a novel software-hardware co-designsolution, named

Q-VR , for enabling low-latency high-quality collab-orative mobile VR rendering by effectively leveraging the process-ing capability of both local and remote rendering hardware. Fig.1illustrates the processing diagram of our Q-VR. At the software-layer, we propose a vision-perception inspired collaborative ren-dering design 1 for Q-VR to provide flexible tuning interface andprogramming model for enabling network latency reduction whilemaintaining user perception. The basic idea is that different acuitylevel requirements of human visual system naturally generate anew workload partitioning mechanism for collaborative VR ren-dering (Section 3). We leverage and extend this łfoveation effect"[20, 51, 58ś60] in Q-VR’s software design to transform this complexglobal collaborative rendering problem into a workable framework.At the hardware-level, we design two novel architecture compo-nents,

Lightweight Interaction-Aware Workload Controller (LIWC) 2and

Unified Composition and ATW (UCA) 3 , to seamlessly interfacewith Q-VR’s software-layer for achieving two general objectives: (1)quickly reaching the local-remote latency balance for each framefor the optimal rendering efficiency; and (2) forming a low-latencycollaborative rendering pipeline for reducing realtime resourcecontention and improving architecture-level parallelism. Thesehardware designs are based on two key insights: there is a strong

Rendering EngineRendering CommentsVR-Runtime SDKPlugin Sensors Eye Tracking ClientMotion InformationFrame Reprojectionand Time Wrap Game Content EngineGraphics Card DriverUser

Data Motion

Display

Update Frames

VR Sensors VR Sensors VR SensorsVR Graphics VR Graphics VR Graphics Display Refresh Display Refresh Display Refresh

End-to-end Latency

𝐹𝑃𝑆 = ൘ 𝑚𝑎𝑥 𝑇 𝑠 , 𝑇 𝑔 , 𝑇 𝐷 Figure 2: An example of a modern VR graphics pipeline. correlation among motion, scene complexity and hardware-levelintermediate data (Section 4.1); and there is an algorithmic-levelsimilarity between VR composition and reprojection (Section 4.2).To summarize, this paper makes the following contributions: • We design the first software-hardware co-designed collabo-rative rendering architecture to tackle the mismatch betweenVR hardware processing capability and desired renderingcomplexity from a cross-layer systematic perspective; • We identify the fundamental limitations of the state-off-the-art collaborative rendering design and quantify the majorbottleneck factors via detailed workload characterizationand VR execution pipeline analysis; • By leveraging the foveation features of human visual sys-tem, we explore the software-level flexibility to reduce thenetwork limitation via a fine-grained dynamic tuning spacefor workload control while maintaining user perception; • Based on our key observations on VR motion correlationsand execution similarity, we design two novel hardware com-ponents to support software-layer interfacing and deeperpipeline-level optimizations; • Extensive evaluation on real-world games demonstrates thatQ-VR design can achieve an average end-to-end speedup of (up to ) over the traditional local rendering designin today’s commercial VR devices, and a frame rateimprovement over the state-of-the-art static collaborativerendering solution.

Different from the traditional graphics applications, modern VRsystems retrieve the real-time user information to present a pair ofrealities scenes in front of users’ eyes. Fig. 2 shows an example of atypical modern VR graphics pipeline. The VR system first gathersthe head-/eye-tracking data at the beginning of a frame throughplugin motion and eye sensors which are typically executed on theirown frequencies [13, 20, 53]. Then, it relies on the VR runtime toprocess user inputs and eye-tracking information, and the renderingengine to generate the pair of frames for both eyes. Before the pairof rendered frames displayed onto the

Head Mounted Display (or

HMD ), a VR system processes asynchronously time wrap (ATW)to reproject the 2D image plane based on lens distortion [5, 57].To create a perception that users are physically present in a non-physical world (i.e., the concept of immersion [14, 27, 42, 57]), the -VR: System-Level Design for Future Mobile Collaborative Virtual Reality ASPLOS ’21, April 19–23, 2021, Virtual, USA F P SS y s t e m L a t e n c y ( m s ) Tracking Rendering ATW Display FPS (a) Local-only rendering F P SS y s t e m L a t e n c y ( m s ) Tracking Sending Rendering Transmit

ATW Display FPS (b) remote-only rendering

Figure 3: System latency and FPS when running high-endVR applications on two current mobile VR system designs. rendering task becomes very heavy: generating a pair of high-quality images along with sound and other stimuli catering anengrossing total environment.Meanwhile, because the human vision system is very latencysensitive for close views, any noticeable performance degradationin VR real-time can cause motion anomalies such as judder, sicknessand disorientation [7, 31]. To achieve robust real-time user expe-rience, commercial VR applications are required to meet severalperformance requirements, e.g., the end-to-end latency (i.e.,

Motion-to-Photon latency or

MTP ) <

25 ms and frame rate >

90 Hz [27](about 11ms) as Fig.2 demonstrates. In order to deliver high imagequality simultaneously with low system latency, high quality VRapplications are typically designed on a tethered setup (e.g., HTC-Vive Pro [22] and Oculus Rift [44]). The tethered setup connectsthe VR HMD with a high-end rendering engine (e.g., standaloneGPU or GPUs) to provide desired rendering performance. However,bounded by the connection cable between VR HMD and render,tethered VR systems significantly limit users’ mobility which isone of the core requirements of immersive user experience. Withthe advancement of mobile device design and System-on-Chip(SoC) performance, we have observed a trend of design focus shiftfrom low-mobility rendering to a mobile-centric local renderingdesign, e.g., Google Daydream[18], Oculus Quest [44], Gear VR[50]. However, these rendering schemes cannot effectively supportlow-latency high-quality VR rendering tasks due to the wimpy mo-bile hardware’s raw processing power compared to their tetheredcounterparts. As a result, the state-of-the-art mobile VR designsare limited to delivering VR videos instead of enabling real-timeinteractive VR graphics [32, 36].

With the development of wireless technology, the concept of cloud-based real-time rendering of Computer Graphics (CG) is beingintroduced by major cloud service vendors [2, 17, 41]. It opens upopportunities to stream VR games or other virtual contents fromcloud servers to enable possible high-quality VR scene renderingon high-performance computing clusters [64]. There are two mainrendering schemes proposed to support next-generation mobile VRrendering: (I) Remote Rendering.

A straightforward approach to over-come the performance limitation of mobile systems is to offload thecompute-intensive rendering tasks to a powerful server or remotehigh-end GPUs by leveraging the cloud-based real-time render-ing technologies. However, under the current network condition, the naive cloud VR design via streaming is infeasible to providereal-time high quality VR rendering due to the requirements ofhigh resolution and low end-to-end latency. Previous work [13]suggests to leverage compression techniques to reduce the trans-mit latency. However, even with the highly effective compressionstrategies with parallel decoding, such approach cannot meet theperformance requirements of high-quality VR applications[31].Fig.3 shows the breakdown of the end-to-end latency (i.e., fromtracking to display) for executing several high-quality VR applica-tions under two commercial mobile VR designs: local-only rendering and remote-only rendering . The detailed experimental setup is dis-cussed in Sec-2.3. The blue lines represent the frame rate (FPS)achieved on the VR HMD while the red dash lines illustrate mo-bile VR system latency restriction (i.e., the commercial standardof 25ms). The figure shows that the raw processing power of theintegrated GPU is the key bottleneck for local-only rendering, whilethe transmission latency in remote-only rendering contributes toapproximately 63% of the overall system latency.Although the VR vendors today employ frame re-projection tech-nologies such as Asynchronous TimeWarp (ATW) [5] to artificiallyfill in dropped frames, they cannot fundamentally reduce the MTPlatency and increase the FPS due to little consideration of realtimeinputs such as users’ movements and interaction. Thus, improvingthe overall system performance is still one of the highest designpriorities for future mobile VR systems. (II) State-of-the-Art: Static Collaborative VR Rendering.

Recent works [7, 31, 35, 37] have proposed a collaborative render-ing scheme which applies mobile VR hardware’s computing powerto handle a portion of the time-critical rendering workload nearthe HMD display while letting the remote system handle the rest.Specifically, the fundamental principle of this collaborative schemeis based on the observation that the pre-defined interactive objectsare often more lightweight than the background environment, sug-gesting to render the foreground interactive objects locally whileoffloading the background environment to the remote server. To fur-ther hide the network latency and improve bandwidth utilization,they also enable pre-rendering and prefetching for the backgroundenvironment. However, this general scheme ignores several keyfactors, including (1) different mobile VR hardware’s realtime pro-cessing capability, (2) ever-changing rendering workload due torealtime user inputs, (3) different network conditions available tousers. These factors result in significant performance, programmbil-ity and portability challenges for low-latency high-quality mobileVR design. We will discuss this in details next.

Rendering Execution Pipeline Analysis.

We first qualitativelyanalyze the general collaborative rendering and its limitations fromthe perspective of execution pipeline. Fig.4(top) describes a generalcollaborative rendering execution pipeline based on today’s mobileVR design prototypes[7, 31, 35, 37]. A collaborative VR renderingworkload can be interpreted as several functional kernels launchedon to multiple accelerators [24] (with the same color in Fig. 4), eachof which is responsible for executing a set of tasks. Specifically, forevery frame,

CPU utilizes VR input signal to process the VR applica-tion logic ( CL ). After that, it setups the local rendering tasks and SPLOS ’21, April 19–23, 2021, Virtual, USA Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, and Shuaiwen Leon Song

Network VDLR VD LSLS

Frame NFrame N+1

Frame N+2

CL LRNetwork VD ATWNetwork VD ATWC C

C ATW

RR LRRR CL CLLS LS LS LRVD

Frame NFrame N+1Frame N+2

LIWC RRLS UCALIWC Network VDRR

UCA

LIWC LR NetworkRR

UCA

VR signal VR signal VR signal VR signalVR signal VR signal

VR signal

VR signalVR signal

Static

Q-VR LR BA AC

RR Network

CPU GPU Network Video Decode Remote GPU LIWC UCA

3B C C

VR signal LR Figure 4: Execution pipeline of static collaborative rendering and our proposed Q-VR. Q-VR’s software and hardware opti-mizations are reflected on the pipeline. Rendering tasks are conceptually mapped to different hardware components, amongwhich LIWC and UCA are newly designed in this work. Intra-frame tasks may be overlapped in realtime (e.g., RR, network andVD) due to multi-accelerator parallelism. CL: software control logic; LS: local setup; LR: local rendering; C: composition; RR:remote rendering; VD: video decoding; LIWC: lightweight interaction-aware workload controller; UCA: unified compositionand ATW.Table 1: Performance of Static Collaborative VR rendering Across Different High-Quality VR Applications (90Hz)

Apps Resolution 𝑓 Range Avg. 𝑇 𝑙𝑜𝑐𝑎𝑙 Min. 𝑇 𝑙𝑜𝑐𝑎𝑙 Max. 𝑇 𝑙𝑜𝑐𝑎𝑙 Back Size 𝑇 𝑟𝑒𝑚𝑜𝑡𝑒 Foveated3D[20] 1920x2160 231K 9 Chess 16% - 52% 43 ms 18 ms 75ms 646KB 38msViking[56] 1920x2160 2.8M 1 Carriage 10% - 13% 13ms 12ms 16ms 530KB 31msNature[55] 1920x2160 1.4M 1 Tree 10% - 24% 16ms 12ms 26ms 482KB 28msSponze[42] 1920x2160 282K Lion Shield 0.1% - 20% 5.8ms 0.5 ms 12 ms 537KB 31msSan Miguel[42] 1920x2160 4.2M 4 Chairs, 1 Table 6% - 15% 11 ms 5.4 ms 14 ms 572KB 33ms issues remote frame fetching to the network ( LS ). Then the framegeneration is split in to two parallel processes: the mobile GPUprocesses the interactive objects via local rendering ( LR ), while thenetwork model offloads the background rendering to the remoteserver ( RR ). Then, the remote server returns the rendered back-ground as encoded streaming network packets to be later decodedby the video processing unit and stored in the framebuffer ( VD ).When both the interactive objects and background image are ready,GPU composites them based on the depth information to generatethe final frame ( C ). Since this output frame is still in 2D, GPU willfurther map it into 3D coordinates via ATW (lens distortion andreprojection) and deliver it to the HMD.To achieve the highest rendering performance, both software-level parallelism (between different kernels) and hardware-levelparallelism (between different hardware resources) need to be wellcoordinated. We identify two general insights for forming a low-latency collaborative rendering pipeline. (1) Within each frame, thelocal and remote rendering need to reach a balance point to achievethe highest resource utilization. The significant slowdown fromeither component will result in unsatisfactory execution and caus-ing motion anomalies and low frame rate. For example, Fig4- 2 iscaused by misestimating hardware’s realtime processing capabilityand the changing workload during the execution. (2) Across frames,eliminating realtime GPU resource contention from different es-sential tasks can significantly improve framerate. As illustrated by Fig.4- 3 , several essential tasks including local rendering, com-position and ATW all compete for GPU resource. Any elongatedoccupation of GPU cores by composition and ATW can interrupt thenormal local rendering process and cause bursts of frame rate drops.This phenomenon has been observed by previous studies [5, 32, 65]. Challenges Facing Static Collaborated Rendering.

Now weinvestigate the design efficiency of the current static collaborativerendering. To provide quantitative analysis, we build our physicalexperimental platform for this evaluation. We execute several Win-dows OS based open source high-quality VR apps on a Gen 9 Intelmobile processor which is equipped with an Intel Core i7 CPU anda mobile GPU. We also calibrate the rendering performance of thislocal rendering platform against an Apple A10 Fusion SoC equippedby iPhone X[3] through executing a range of mobile VR apps. Forremote rendering, a high-performance gaming system equippedwith an NVIDIA Pascal GPU is used as the rendering engine. Ad-ditionally, Netcat [16] is applied for network communication andlossless H.264 protocol is leveraged for video compression.Table 1 lists the tested high-quality VR applications and theirperformance characterization. This application is original designedfor tethered VR devices and present photorealistic VR scenes. Foreach application, we first identify the draw batch comments forevery object and then extract the foreground dynamic objects forlocal rendering and background for remote rendering as previousworks[7, 31, 35, 37] suggest. The workload partition parameter, -VR: System-Level Design for Future Mobile Collaborative Virtual Reality ASPLOS ’21, April 19–23, 2021, Virtual, USA (a) 12 ms (b) 15 ms (c) 26 ms

Figure 5: The realtime user inputs (e.g., interaction) directlyaffects latency to vary even within the same scene. Thecloser to the tree in Nature[55], the more details need to berendered. 𝑓 , represents the percentage of the normalized latency to renderthe interactive objects to the entire frame rendering time. We alsocollect the latencies for the local rendering ( 𝑇 𝑙𝑜𝑐𝑎𝑙 ), remote framefetching ( 𝑇 𝑟𝑒𝑚𝑜𝑡𝑒 ) which should smaller than 11 ms to satisfy 90HzFPS. Since the remote rendering, network transmission and videocodex can be streamed in parallel [31, 34], we only count the highestlatency portion from the remote side which is the network trans-mission in our case. From Table 1, we have identified two majorchallenges for static collaboration: Challenge I: Design Inflexibility and Poor Programmabil-ity.

The state-of-the-art design is a łone-fit-for-all" solution: itassumes the processing of the pre-defined interactive objects willalways meet VR’s realtime latency requirements. However, the VRscene complexity and animation of interactive objects are oftenrandom and determined by users’ actions at realtime which maycause significant workload change from frame to frame. Fig.5 andTable 1 demonstrate that the rendering latency for a single interac-tive object (the tree in the Nature app) can change from 12ms to26ms (i.e., 10% - 24% rendering workloads) depending on how usersinteract with the tree, and the maximum 𝑇 𝑙𝑜𝑐𝑎𝑙 of all benchmarksexceed the fps requirement ( 𝑚𝑠 or 90HZ). As a result, in this staticcollaborative design, programmers are burdened to accommodateall the realtime constraints and reduce the interactive concepts intheir developing to avoid VR latency issues, which is extremelydifficult, labor intensive and impractical. Additionally, this designloses the flexibility to control runtime kernel execution (e.g., inFig.4- 2 , transmission latency is long) to help local and remoterendering reach a balance point for optimal rendering and resourceutilization. Challenge II: Costly Remote Data Transmission.

Table 1also shows that the static design incurs high network latency (about30ms in WIFI) to download the compressed background image,which significantly increases the end-to-end latency (demonstratedin Fig.4- 1 ). Under this design, not only the rendered frames, butalso the depth maps of the VR scenes need to be sent back for com-position [7, 31, 35, 37]. Although the static collaborated renderingenables caching and prefetching techniques [7, 31] attempting tohide the network latency under some circumstances, they encounterlarge storage overhead. Meanwhile, to prefetch the background intime, mobile VR systems need to predict random users’ motioninputs more than 30 ms ahead (about 3 frames) which may sig-nificantly reduce the prediction accuracy. Furthermore, failing to predict users’ behaviors will trigger even higher end-to-end VRlatency, resulting in motion sickness from the position mismatchbetween the interactive objects and background. [29, 46].To tackle these challenges above, we propose a novel software-hardware co-design solution for low-latency high-quality collabo-rative VR rendering, named

Q-VR . Its general pipeline is shown inFig.4 (bottom). Based on the insights from this subsection, Q-VR hasthe following high-level designing objectives: (a) reducing 𝑇 𝑟𝑒𝑚𝑜𝑡𝑒 to weaken the impact of remote rendering and network latency;(b) dynamically balancing local and remote rendering based onrealtime constraints (e.g., hardware, network and user inputs) foroptimal resource utilization and rendering efficiency; and (c) elimi-nating realtime hardware contention on the execution pipeline toimprove FPS. We breakdown Q-VR’s design into a new softwareframework (Sec.3) and novel hardware supports (Sec.4). In this section, we propose a vision-perception inspired softwarelayer design for our Q-VR to provide a flexible interface for enabling 𝑇 𝑟𝑒𝑚𝑜𝑡𝑒 reduction while maintaining user perception. It also pro-vides high-level support for the fine-grained dynamic renderingtuning capability enabled by our hardware design optimizations(Sec.4) which effectively accommodates rendering workload varia-tion across frames and help reach local-remote latency balancing.Instead of predefining the workload partition during VR appli-cation development, we extend the concept of foveated rendering [20, 28, 43, 60] to redesign the rendering workload for mobile VRsystems. Previous research has documented how human visual acu-ity falls off from the centre (called fovea ) to the periphery [49, 52].Although human eyes can see a broad field of view (135 degreesvertically and 160 degrees horizontally), only requires fine details. For the periphery areas, the acuity re-quirement falls off rapidly as eccentricity increases. Based on thisfeature, foveated rendering can reduce rendering workload viagreatly reducing the image resolution in the periphery areas andis able to maintain user perception as long as foveated constraintsare satisfied between layers [9, 20, 38, 47, 51]. Basic idea.

The basic idea is that the varying spatial resolutionrequirements in the human visual system (e.g., fovea versus periph-eral vision) naturally generate an efficient workload partitioning.We can leverage this to significantly reduce the transmitted datasize on the network through adapting lower resolutions of videostreaming for periphery area on the remote server, but also effec-tively render the most critical visual perception area locally withthe highest resolution without any approximation.

Traditional foveated rendering decomposes the frame into three lay-ers: (1) the foveal layer (has a radius of 𝑒 ) in the eye tracking centerwhich is the most critical perception area with the highest resolu-tion; (2) the middle layer (has a radius of 𝑒 ) which employs gradientresolution to smooth the acuity falling; and (3) the outer layer whichrenders the periphery area with low resolution for speedup. Manypast user perception surveys [1, 20, 38] have demonstrated that SPLOS ’21, April 19–23, 2021, Virtual, USA Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, and Shuaiwen Leon Song N o r m a li z e d F r a m e S i z e l o c a l R e nd e r i n g L a t e n c y ( m s ) Fovea Area Eccentricity e (degree)400 objects 4k triangles/object 800 objects 4k triangles/object400 objects 8K triangles/object Relative Frame Size e = 10, e = 50e = 20, e = 35e = 30, e = 30 Figure 6: Average foveal layer rendering latency under theincreasing eccentricity when running Foveated3D on IntelGen9 mobile processor. When the eccentricity is ≤

15 de-grees, all types of scene complexities can be handled withinthe target latency requirements ( ≤ foveated rendering determines the resolutions following a well se-lected MAR ( minimum angle of resolution ) model to achieve the same perceptive visual quality with non-foveated rendering.To estimate the local SoC’s computing capability, we evaluate therendering latency (end-to-end) according to foveal layer radius byexecuting Foveated3D app on a state-of-the-art Intel Gen 9 mobileprocessor and remote server collaboration setup (Sec.2.3). Here wereorganize the three layers into two: the local fovea rendering forthe centre ( 𝑒 ) and the remote periphery rendering for middle andouter layers ( ∗ 𝑒 ). We also adapt the second eccentricity ( ∗ 𝑒 ) andcalculate the *Periphery Quality via Eq.(1) to further reduce thecommunication overhead. ∗ 𝑒 = 𝑚𝑖𝑛 ( 𝑃 𝑀𝑖𝑑𝑑𝑙𝑒 + 𝑃 𝑂𝑢𝑡𝑒𝑟 )∗ 𝑠 𝑖 = 𝜔 𝑖 𝜔 ∗ = 𝑚 ∗ 𝑒 𝑖 + 𝜔 𝜔 ∗ , 𝑖 ∈ { , } (1)where we directly employ the vision parameters (e.g., MAR slope 𝑚 , fovea MAR 𝜔 ) from the previous user studies [1, 20, 38] tomaintain user perception within the foveated constraints.Fig.6 demonstrates that the local rendering performance highlydepends on the size of the foveal layer. We observe that if the eccen-tricity is ≤

15 degrees, all types of scene complexities in Foveated3Dcan be handled within the target latency requirements ( ≤ 𝑚𝑠 ).This suggests that modern VR mobile SoCs are capable of dynam-ically rendering a range of workloads (or fovea sizes) with finedetails and high resolution beyond the traditionally defined 5 de-grees central fovea, determined by realtime constraints such asscene complexity, hardware capability, etc. This finding providesa flexible tuning knob for enabling dynamic workload control forQ-VR and helps further deprioritizing network latency and remoterendering.Finally, we conduct an image quality survey following the evalu-ation principles from [20, 29] to evaluate the impact of our eccen-tricity selection method. Specifically, we take a user survey to 50candidates to estimate the image quality effects after adapting ouradaptive foveated rendering scheme. First, We apply different VR Eye TrackerPartition EngineFovea Graphics

Fovea(X, Y)Eccentricities (e ) “Fovea” Channel VRS Graphics

Fovea(X, Y)Eccentricities (e , *e )*Periphery Quality “Periphery ”Channels M O

Foveated CompositionLocal GPURemote GPUs

Parallel Rendering “Display” Channel

Input “fovea”Input “mid”

Input “out” F o v e a M i dd l e O u t e r ConfigSetup Setup node {pipe { window { name “Fovea" viewport[Fovea(X,Y), e ] channel{ name “fovea" } }}}node {pipe {window { name “Periphery" viewport1[ Fovea(X,Y),*e ] channel1{ name “mid" } viewport2[( ] channel2{ name “out" } }}} component {channel { name “Display“ inputframe “fovea” inputframe “mid” inputframe “out” outputframe “framebuffer” }}} Composition &ATW + Figure 7: An example of software-level setup and configu-ration in our vision-perception inspired Q-VR, its program-ming model, and how it interfaces with hardware. steam of images under a specific display resolution (e.g., 1920x2160)with different fovea areas (i.e., referring to the eccentricity from 40degrees to 5 degrees) and their corresponding periphery resolutions.We then let the candidates focus on the center of the images andswitch images based on the degrading central foveal eccentricity.Each image will be displayed for 5s. We then ask them if they expe-rience any image quality difference and let them score each imageduring the survey. Similar to what is reflected in Fig.6 from differ-ent snapshots of the chessboard, participants observe no visibleimage quality difference between different eccentricity selectionswhen the target MAR is satisfied which helps Q-VR maintain userperception.

We then introduce the software-layer support for enabling thisfovea-ted-oriented collaborative rendering for future mobile VR.Different from the original foveated rendering focusing on imageresolution approximation with pre-calculated eccentricities andresolutions, the design goals of the new software framework is toenable a dynamic partition by leveraging the key observation thatthe central fovea size depends on real-time hardware renderingcapability. To achieve this, we created a new distributed renderingprogramming model supported by lower-level graphics libraries.Fig.7 shows the overall software-layer design of our proposedQ-VR to support collaborative foveated rendering. First, we splitthe VR graphics into a local client version (the yellow boxes) and aremote server version (the green boxes) to process different visuallayers in parallel. Instead of directly collecting the foveated ren-dering parameters such as the central fovea coordinate foveat(X,Y) -VR: System-Level Design for Future Mobile Collaborative Virtual Reality ASPLOS ’21, April 19–23, 2021, Virtual, USA and the partition eccentricity ( 𝑒 , 𝑒 ) from the eye tracker, we adda software tuning-knob for fine-grained fovea control and soft-ware interfaces to the graphics to acquire these parameters fromour hardware partition engine, which is integrated into the work-load controller described in Section 4.1. For the client version, wegather the 𝑓 𝑜𝑣𝑒𝑎𝑡 ( 𝑋, 𝑌 ) and 𝑒 to setup the rendering viewportsvia VR SDK and the local rendering process remains as normal VRrendering for the two eyes in high resolution. For the server ver-sion, we extend the state-of-the-art parallel VR rendering pipeline[6, 11, 12, 43, 64] to setup multiple rendering channels for middleand outer layers with calculated eccentricity ( 𝑒 , ∗ 𝑒 ).Since Q-VR requires no additional composition on the remoteserver (supported by our UCA design in Sec.4.2), we use separatedframebuffers to store the rendering results from the periphery lay-ers. Each framebuffer has an adjustable size according to its corre-sponding layer’s resolution or periphery quality. By doing this, theserver only needs to send the lower quality middle and outer layers(under the fovated visual constraints though) back to the local clientinstead of the entire framebuffer with full resolution to reduce thetransmitted data size. Using separated framebuffers, we apply par-allel streaming to transmit data for middle and outer layers for eacheye (Fig.7) and overlap the rendering and data transmit to furtherreduce the transmit latency. Finally, we performing the foveatedcomposition to simply overlap the three layers’ inputs. To erase theartificial effects generated by the resolution gradient between lay-ers, the composition also processes multi-sample anti-alias (MSAA)on the edge [20] of layers. We discuss our novel hardware supportsnext. By proposing the software-layer design, we enable the possibilityof realtime tuning rendering workload via adaptive foveal sizing.However, the actual eccentricity selection for each frame requireshigh fidelity and ideally should has minimized latency, which onlysoftware-based control mechanism cannot provide. As shown inFig.4, to dynamically predict the proper fovea size, software controllogic (CL) has to wait until the previous rendering completes whichmay delay more than one frame, e.g., Frame N+2’s prediction isbased on Frame N’s rendering output. This not only causes lowprediction accuracy but also may significantly extend the overallexecution pipeline. This motivates us to explore hardware-levelopportunities for deeper pipeline-level optimizations.

A straightforward method to dynamically select the best eccen-tricity would be statically and exhaustively profiling various pa-rameters (e.g., hardware and network conditions, fps and MTP,user actions, etc) for each frame’s eccentricity set ( 𝑒 , 𝑒 ) in a largesampling space, and build a model to predict 𝑒 for each frame. Inreality, however, correctly predicting such mapping is very difficultbecause there is a large number of samples required even for asingle scene [7] and is not portable to the other VR applications.Recent approaches [40, 61] have used deep learning models to train Figure 8: The head motions and fovea tracking can help de-termine the scene complexity change trend across frames.

Data Size

User Input Monitor

Latency

Lightweight Interaction-aware Workload Controller > L a t e n c y P r e d i c t i o n Runtime UpdaterUpdate the Latency ParameterMotion

Codec

Motion to EccentricityMapping Table EccentricityMovement Bits Eye Bits ∆e Gradient Offset

Figure 9: Architecture diagram of our proposed LIWC. certain dynamic relationships but they are too power hungry to beintegrated in mobile VR. Thus, we propose a lightweight design thatcan largely describe scene complexity change and help dynamicallybuild a strong mapping between environment conditions and 𝑒 . Key Design Insights.

To build such mapping, we leverage twokey observations . (i) The scene complexity change for the localfoveated rendering across continuous frames is highly related touser’s head and eye motions. Fig.8 shows an example: the centerfocus moves relatively with user’s head and eyes to the left andright which changes the rendering workload in the fovea area (thepurple box) accordingly. This indicates that it is possible to use thisbuilt-up interaction experience to correlate change trend for scenecomplexity with fovea area movements. (ii) The local renderinglatency is sensitive to the scene complexity and realtime hardwareprocessing capability (e.g., can be estimated by triangles ) whilethe remote latency is dominated by the resolution and networkbandwidth. To respond to the environmental changes as soon aspossible with minimal latency impact on the overall executionpipeline, we can predict the local and remote latency by directlyleveraging the intermediate hardware information . Architecture design.

Based on these two insights, we proposea lightweight Interaction-Aware Workload Controller (named

LIWC )shown in Fig.9, to determine the best balanced eccentricity whichis indexed by user’s inputs and runtime latency. It includes fourmajor components: (1) an SRAM to store the motion-to-eccentricitymapping table which records the latency gradient offset for all pairsof motion information and eccentricity; (2) a latency predictor topredict the current latency for the local and remote rendering; (3) amotion codec to translate the motion information into table entryaddresses; and (4) a runtime updater to update the mapping tableand latency prediction parameters. Triangles are the basic intermediate units in computer graphics pipeline for creatingmore complex models and objects.

SPLOS ’21, April 19–23, 2021, Virtual, USA Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, and Shuaiwen Leon Song

As a single accelerator separated from CPU and GPU, LIWC canbypass the CPU to directly monitor the number of triangles dur-ing the rendering setup process for assessing the local renderinglatency, and to monitor the network’s ACK packets for assessingthe remote latencies. Leveraging these two hardware-level inter-mediate data, the local and remote latencies are estimated based ona lightweight performance model, described as Eq.(2). As Fig.4- Billustrates, LIWC design avoids the overheads that the softwareapproaches introduce, e.g., waiting for the rendering to complete,in-out memory activities, and kernel issuing. 𝑇 𝑙𝑜𝑐𝑎𝑙 = 𝑇𝑟𝑖𝑎𝑛𝑔𝑙𝑒𝑠 ∗ % 𝑓 𝑜𝑣𝑒𝑎𝑃 ( 𝐺𝑃𝑈 𝑚 ) , 𝑇 𝑟𝑒𝑚𝑜𝑡𝑒 = 𝐷𝑎𝑡𝑎𝑆𝑖𝑧𝑒 ( 𝑀 + 𝑂 ) 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 (2)

To gather user’s inputs, LIWC indexes the motion informationwith the changes of user motion between two frames (i.e., 6 bits fordegrees of freedom changes on HMD and 4 bits for the fovea centermovement) through motion codec. This is to strictly control theparameter space size for both motion and eccentricity coordinates,since the motion information and the eccentricity mapping havean infinite parameter space when the problem scales up. Similarly,LIWC also indexes the eccentricity with a set of integer delta tags(-5 to 5 degrees) for each motion entry.During the eccentricity selection, LIWC looks up the table en-try with the closest latency gradient offset from the motion toeccentricity mapping table based on the motion index and the esti-mated latency difference between the local and remote rendering.After taking the selected delta eccentricity, the runtime updatermonitors the realtime measured latency and the change of FPS toonline update the latency gradient offset with a reward function( 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 = ( − 𝛼 ) ∗ 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 ′ + 𝛼 ∗ Δ 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 , where 𝛼 representsthe reward parameter and 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 ′ represents the original latencygradient). It also updates the network throughput and GPU per-formance for further latency prediction. The table and parametersupdate phase will be executed in parallel with composition anddisplay for minimizing the overall rendering latency. As discussed Fig.4- 3 in Sec.2.3, resource contention between ren-dering and composition/ATW on realtime GPU resources acrossframes result in delaying critical rendering process and cause signif-icant FPS drops. One challenge is how to conduct parallel renderingof the complex scenes on Q-VR with efficient composition and ATWexecution to form a low-latency collaborative rendering pipeline.

Key Design insight: Algorithmic-level Similarity.

Fig.10-(top) shows the traditional sequential execution of compositionand ATW. To smooth the resolution gaps between layers, the orig-inal foveated rendering performs anti-aliasing by combining thepixel colors from the rendered frames of the two layers. It calculatesthe average pixel color using Eq.(3)-(left) and then ATW fetchesthe composited frame from the framebuffer in GPU memory as atexture. After this, the frame is mapped into a sphere plane basedon HMD lens distortion (2D to 3D) and the coordinate reprojectionmap (update to the latest motion position). During ATW, the planeframe is cut into small tiles (32x32) for SIMD execution and then fedinto a specialized texture filter for bilinear filtering (Eq.3-(right)).From the two equations in Eq.(3), we observe a key design insight that if ATW is first processed for multiple vision layers then fed

Anti-

Aliasing Lens Distortion Coordinate Mapping Bilinear

Filtering

Baseline Fixed Software Execution Order: S a m p li n g HMDPipeline-Reorder Execution in Unified Composition and ATW:

Composition

ATW

Lens Distortion Coordinate Mapping Anti-

Aliasing

Bilinear Filtering HMDPrevious Frames Frame Drop S a m p li n g Sampling

Bound TilesNon-overlapping Tiles

Figure 10: Comparison between baseline sequential execu-tion and Unified Composition and ATW (UCA). into composition (i.e., reordering in Eq.4-right), these two filteringphases can be combined to a unified filtering process which onlysamples the input once. In computer graphics, the unified filteringprocess can be operated as trilinear filtering.The advantages of using a unified process include: (1) it by-passes CPU and avoids the software overhead between kernels; (2)it breaks the fixed software execution sequence so that the ATWcan start processing the non-overlapping tiles (e.g., tiles requireno composition) earlier; and (3) it can be executed in parallel withGPU for better parallelism.

Composition: 𝑋 = 𝑀 𝑀 Õ 𝑖 𝑆 𝑖 , ATW: 𝑌 = 𝑁 𝑁 Õ 𝑖 𝑤 𝑖 ∗ 𝑋 𝑖 (3) 𝑌 = 𝑁 𝑁 Õ 𝑖 𝑤 𝑖 ∗ ( Õ 𝑗 𝑆 𝑖𝑗 ) = 𝑀𝑁 𝑀 Õ 𝑗 𝑁 Õ 𝑖 𝑤 𝑖 ∗ 𝑆 𝑖𝑗 (4) Architecture design.

Due to the algorithm-level similarity be-tween ATW and composition, we propose to use a single UnifiedComposition and ATW Process (UCA) to replace the two indepen-dent computation paths by combining ATW with the unique foveacomposition, and asynchronously executing them across frametiles prior to the rendering completion (Fig.4- C ). Fig.10-(bottom)shows the execution pipeline of the proposed architecture. Unlikethe original VR pipeline which separates the frame compositionand re-projection, the new unified kernel reorders the filtering stage (i.e., first processing ATW for multiple vision layers then fed intocomposition) and combines them into a Trilinear filter with thesame inputs of original foveated composition. The UCA can alsoleverage the previous frame layers to artificially reconstruct theupdated frame with a new position as what the original ATW out-puts. This helps fill in dropped frames to avoid coordination errorsbetween two layers.At hardware-level, we implement the UCA as a separate hard-ware unit on SoC to eliminate possible large and burst latencyscenarios caused by GPU resource contention. We reused some ofthe logic units from the state-of-the-art ATW design[5, 32, 65] forlens distortion translation, coordination mapping and filtering. TheUCA Unit mainly consists of two microarchitecture components:4 MULs for lens distortion and 8 SIMD4 FPUs for coordinationmapping and filtering. Fig.11 shows the architecture diagram of the -VR: System-Level Design for Future Mobile Collaborative Virtual Reality ASPLOS ’21, April 19–23, 2021, Virtual, USA D R A M Video Stream Frame Buffer

Unified Composition & ATW

Lens Distortion Translate

Coordinate

MappingBound?Yes Texture UnitBilinear

Filtering

Bilinear Sample

Trilinear Filtering

Q-VR Sample

Sensor No Figure 11: Architecture diagram of UCA.Table 2: BASELINE CONFIGURATION

Mobile VR SystemGPU frequency 500 MHzNumber of Unified Shaders 8Shader Configuration 8 SIMD4-scale ALUs16KB Unified L1 cache1 texture unitTexture Filtering Configuration 4x Anisotropic FilteringRaster Engine 16x16 tiled rasterizationL2 Cache 256 KB in total, 8-waysDRAM Bandwidth 16 bytes/cycle8 channelUnified Composition and ATW UnitFrequency 500 MHzCount 2Remote GPUGPU Configuration Multi GPU system as [64]Network Throughput (Download Speed)Wi-Fi 200 Mbps4G LTE 100 MbpsEarly 5G 500 Mbps proposed UCA. By monitoring the video stream and the framebuffersignals, UCA can detect if the data is ready in the DRAM. When thedata is ready, UCA acquires the motion information from the HMDsensors and processes lens distortion and coordinates mapping asthose in the normal ATW procedure. Then, it checks if the blockbelongs to the border of the two layers. For the border tiles, UCAprocesses an single trilinear filtering as eq.4 and sends the resultsback to the framebuffer. For the non-overlapping tiles, UCA directlyprocesses them via bilinear filtering to generate the final pixel color.

We use McPat to evaluate the area and power overhead of ourproposed architecture. For LIWC, the SRAM table dominates itsarea and power cost. Due to our cost-effective design, its memorydepth can be as small as = . We use a 16 bit half-precisionfloating-point number to represent the latency gradient offset, andthe size of the table is estimated as approximately 64KB which has0.66 𝑚𝑚 area overhead and maximum 25 mW power overheadunder the default 500Mhz core frequency and 45nm technology Table 3: BENCHMARKS

Names Library Resolution

Doom3-H OpenGL[45] 1920x2160 382Doom3-L OpenGL 1280x1600 382HL2-H DirectX[39] 1920x2160 656HL2-L DirectX 1280x1600 656GRID DirectX 1920x2160 3680UT3 DirectX 1920x2160 1752Wolf DirectX 1920x2160 3394 by McPat[33]. For UCA, we reference previous works[32, 65] tomap the logic units to hardware architecture. The McPAT resultsshow that a single UCA occupies an area of 1.6 𝑚𝑚 and consumes94mW runtime power at 500 MHz. For the latency overhead, sincewe formulate our eccentricity selection into a lightweight tablelookup, the computation in the latency prediction and parameterupdating are quite simple. We estimate the latency per frame canbe as low as nanoseconds level. Thus, LIWC’s latency overheadcan be completely hidden. Additionally, we implement UCA as atexture mapping unit on a cycle-level mobile GPU simulator. Underthe default configuration (Sec.5), the latency to process one 32x32pixels block in UCA can be as low as 532 cycles. With 2 UCAsoperating at 500 Mhz, we are able to achieve sufficient performancefor realtime VR. Evaluation Environment.

To model the proposed Q-VR softwarelayer and hardware design, we use similar validation methods fromthe previous work[62ś64] on a modified ATTILA-sim[4], a cycle-accurate rasterization-based GPU rendering simulator which coversa wide spectrum of modern graphics features. Specifically, for therendering pipeline, we implement simultaneous multi-projectionengine in ATTILA-sim to support two-eyes VR rendering and recon-figure it by referencing the ARM Mali-G76 [10], a state-of-the-arthigh-end mobile GPU. Following the design details from Section3, we separately implement the client and server version of ourQ-VR framework in ATTILA-sim by modifying the GPUDriver andthe command processor. The added architecture blocks, includingLIWC and UCA, are implemented as a rendering task dispatcherand a post-processor, respectively. They are also integrated intothe rendering pipeline in ATTILA-sim. We also investigated otherdetailed hardware latencies (e.g., eye-tracking, screen display, etc)and integrate them into our model for an enhanced end-to-endsimulation. For instance, since the eye-tracking latency is not inthe critical path of the graphics pipeline (Section 7), we count 2msto transmit the sensored data to the rendering engine and 5 ms todisplay the frame on HMD[13, 20] in the end-to-end latency.For evaluating the network transmission latency, we leverageffmpeg [15] to compress the output frames from the remote serverand then use them to estimate network latency based on differentdownloading speeds. The network latency is calculated by dividingthe network bandwidth with the compressed frame size. Further-more, we insert white noises into our network channel with 20dBSNR (Signal-to-Noise Ratio) to better reflect reality. We validateour model against netcat [40] which is widely used in linux back-ends to build communication channels and found that our network

SPLOS ’21, April 19–23, 2021, Virtual, USA Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, and Shuaiwen Leon Song Doom3-H Doom3-L HL2-H HL2-L GRID UT3 Wolf Avg. N o r m a li z e d F P S N o r m a li z e d P e r f o r m a n c e Static FFR DFR Q-VR SW-FPS Q-VR-FPS

Figure 12: The normalized performance improvement fromdifferent designs under the default hardware and network.The results are normalized to the traditional local renderingdesign appeared in today’s mobile VR devices. model is able to reflect the real communication channels to a greatextent. For the remote server side, we implement a future chipletbased multi-GPU design that can scale up to 8 MCM GPUs (similarto that in [64]) to enable high performance parallel rendering. Table2 illustrates the simulation configuration and network throughputused in our evaluation. We choose 500 MHz and Wi-Fi as our defaultGPU core frequency and network condition, respectively.

Benchmarks:

Table 3 lists a set of gaming benchmarks em-ployed to evaluate Q-VR. This set includes We employ five well-known 3D games from ATTILA-sim, which are well compatiblewith our simulator, to evaluate Q-VR. The benchmarks set coversdifferent rendering libraries and 3D gaming engines [39, 45]. Al-though the graphics API traces can be directly used for evaluation,we do adjust the entire frame resolution per eye to match the settingin our VR HMD. To better understand the effectiveness of Q-VR,two benchmarks (Doom3 and Half-Life 2) are rendered with bothlow and high resolutions ( × , × ); while for theothers (UT3, GRID and Wolf), × is adopted as the baselineresolution. We first estimate the performance improvement of Q-VR by com-paring it with several design choices under the default hardwareand network condition: (i)

Baseline ś traditional local rendering incommercial VR device. (ii)

Static ś static collaborative VR render-ing which leverages mobile GPU to render the interactive objectfrom frame 𝑛 and prefetching the background of frame 𝑛 + fromremote GPUs. We identify the interactive object in ATTILA-sim bycomparing the depths of all rendering batches and find the closetone to viewports; (iii) Fixed Foveated Rendering (FFR) ś collabora-tive foveated rendering with static eccentricity based on the classicMAR model (i.e., 𝑒 = , traditional fovea size), discussed in Section3 ; (iv) Dynamic Foveated Rendering (DFR) ś collaborative foveatedrendering with only LIWC enabled; and (v)

Q-VR ś our proposedcollaborative VR rendering.

End-to-End System Latency.

Fig.12 shows the normalizedspeed-ups of different design choices over the Baseline case, the tra-ditional local rendering in commercial VR device. we calculate theaverage end-to-end system latency from each design and normalizethem to the pure local rendering case. From the figure, we makethe following observations. First, naively partitioning the fovea andperiphery area in

FFR design is able to achieve approximately 1.75x R e s o l u t i o n R e du c t i o n N o r m a li z e d T r a n s m i t D a t a S i z e Static FFR Q-VR Resolution Reduction

Figure 13: The normalized transmitted data size and reso-lution reduction from different designs under the defaulthardware and network. The results are normalized to theremote rendering design in commercial cloud servers. and 52% performance improvement on average and up to 5.6x and1.4x over the baseline and static, respectively. This is because evenunder the fixed fovea area, Q-VR design software framework isable to reduce a certain amount of data transmitted back from theserver via resolution approximation. However, the speedup by FFRcan be limited by the network latency. We observe that for mostof the benchmarks, network latency is much higher than the localrendering latency under FFR design. In other words, the latencybalance is not reached. Third, by leveraging our LIWC design, DFRis able to reach a more balanced state: it achieves an average of 1.1xspeedup over FFR. Finally, by leveraging UCA to further extend theaccelerator-level parallelism over FFR, Q-VR outperforms othersand achieves an average of 3.4x speedup (up to 6.7x) over Baseline.

Frame Rate.

We also compare the frame rate (FPS) achieved be-tween a pure software design and our proposed software-hardwareco-design. We build the pure software implementation of Q-VRby selecting eccentricity based on previous local and remote ren-dering latency instead of using the intermediate hardware data(e.g.,

𝐹𝑃𝑆 = 𝑚𝑖𝑛 ( / 𝑇 𝐺𝑃𝑈 , / 𝑇 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 ) . The result demonstrates that Q-VR outperforms the static collaboration design and software im-plementation by 4.1 × and 2.8 × , respectively. First, Q-VR achievesbetter latency balancing than the pure software design by leverag-ing the intermediate hardware data to fast and accurately predictthe best eccentricity. Additionally, by detaching the ATW and com-position processes from GPU core execution, Q-VR can increaseGPU utilization for rendering and better exploit multi-acceleratorlevel parallelism. Network Transmission.

Fig.13 shows the normalized trans-mitted data size and resolution reduction from different designsunder the default hardware and network condition. The results arenormalized to the remote rendering in commercial cloud server.From the figure, we observe that the static approach does not re-duce the actual transmitted data size. Alternatively, it prefetches thebackgrounds to hide the network latency. Compared to the staticcollaborative rendering, Q-VR achieves an average transmitted datareduction of 85% by runtime adopting optimal foveal sizes andreducing the periphery area resolutions. Regarding the overall res-olution reduction, Q-VR achieves an average of 41% reduction overthe original frame. We want to emphasize that the transmitted data -VR: System-Level Design for Future Mobile Collaborative Virtual Reality ASPLOS ’21, April 19–23, 2021, Virtual, USA T h e L a t e n c y R a t i o Frame ID

Doom3-H HL2-H GRID UT3 Wolf (a) Latency Ratio

300 0 50 100 150 200 250 300 F r a m e R a t e Frame ID

Doom3-H HL2-H GRID UT3 Wolf

Target FPS (b) FPS

Figure 14: The Latency Ratios and FPS across 300 Frames.Table 4: Best Eccentricity Under Different Configurations

Freq. Net. Benchmarks

D3H D3L H2H H2L GD NFS WF500MHz Wi-Fi 46.4 85.3 27.4 33.2 9.9 27.2 15.34G LTE 74.5 90 42.2 44.3 22.1 39.1 25.7Early 5G 22.4 45.2 11.3 14.3 5 10.9 8.6400MHz Wi-Fi 34.5 77.3 23.1 26.1 7.8 22.5 13.24G LTE 64.3 90 34.5 39.2 15.5 32.4 18.5Early 5G 15.3 30.2 7.8 11.5 5 7.4 6.1300MHz Wi-Fi 27.5 65.4 16.4 24.5 6.5 14.3 11.34G LTE 43.2 90 30.2 35.1 12.4 27.2 16.4Early 5G 13.1 27.1 6.9 8.3 5 6.1 5 size reduction does not only originate from resolution reduction;it also comes from correctly adjusting the central fovea workloadon the local hardware based on different realtime constraints. Forexample, Q-VR reduces 96% transmitted data size for Doom3-L with7% resolution reduction. Since Doom3-L is the lightest workload inour experiments, most of the rendering work is executed locally.

To evaluate if our Q-VR can help the rendering pipeline quicklyreach the balanced local-remote latency state under different userinputs and environment constraints, we calculate the latency ratio( 𝑇 𝑟𝑒𝑚𝑜𝑡𝑒 / 𝑇 𝑙𝑜𝑐𝑎𝑙 ) for each frame during a game execution, as shownin Fig.14-(a). We initiate Q-VR with 𝑒 = under the default hard-ware and network condition. From the figure, we observe that thelatency ratios are quite high during the first several frames. This isbecause relatively small eccentricity makes the local hardware torender quite fast while the network latency becomes the primarybottleneck which causes local-remote latency imbalance. The fig-ure also demonstrates that Q-VR can gradually locate the balancedeccentricity to reach the best rendering efficiency after a very shortperiod of time. Finally, Fig.14-(b) proves it is able to maintain veryhigh FPS for all benchmarks which satisfy the high-quality VRrequirement (>90Hz). Eccentricity Selection Under Different Configurations.

Table4 shows the average eccentricity (i.e., 𝑒 radius value) selected byQ-VR across different applications and hardware/network condi-tions. We started recording the eccentricity for each frame afterQ-VR reaches a steady state and then calculate their average. Notethat scene complexity can dynamically change from frame to frame. WIFI 4G/LTE Early 5G WIFI 4G/LTE Early 5G WIFI 4G/LTE Early 5G500 400 300 N o r m a li z e d S y s t e m E n e r gy Doom3-H Doom3-L HL2-H HL2-L GRID UT3 Wolf

MHz

Figure 15: The normalized energy efficiency of Q-VR underdifferent hardware and network conditions.

From the table, we observe that under different configurations theaverage eccentricity can be quite different. For example, under thedefault GPU frequency and Wi-Fi, Doom3-L has a much bigger 𝑒 than GRID. This is because GRID has more complex scenes thanDoom3-L and requires longer rendering time. Thus Q-VR keeps theeccentricity small and giving more workload to remote GPUs tobalance local-remote latency for the best rendering performance.Similar situation occurs when increasing network throughput or re-ducing GPU frequency. Note that the parameters marked underlineindicate that these combinations will not reach the desired FPS.The table also indicates that Q-VR can accommodate a range ofhardware, network and scene complexity conditions. System Energy Sensitivity.

As the predominant energy con-sumer on mobile systems, Fig.15 shows the normalized GPU energyefficiency of Q-VR under different hardware and network condi-tions. We estimate network module power by referring to the pre-vious works [23, 25]. We also count the energy consumption ofUCA and LIWC into the total energy consumption of Q-VR. Thenwe estimate the energy efficiency of Q-VR by normalizing its en-ergy consumption to traditional local rendering in commercial VRdevice. The figure shows that Q-VR achieves an average of 73%energy reduction over the purely local rendering even though thecollaborative rendering incurs network overhead. This is becausethe local mobile hardware in Q-VR only processes the most criticalfovea area with high resolution instead of the entire frame. Wealso observe that in general increasing the network throughputcan improve the energy efficiency of Q-VR. This is because Q-VR isable to achieve better performance under high network bandwidthwhile the power consumption of network model is typically lesscritical than that of the local GPU. Additionally, reducing GPUfrequency will not always increase the energy benefit due to largerGPUs dynamic energy consumption.

Eye-Tracking Performance and Accuracy ś

In our work, weestimate the performance of eye-tracking system based on thepublicly available stats from the state-of-the-art eye-trackers, whichare implemented in HTC VIVE Pro Eye[53] and tobii Pio Neo 2Eye[54]. The latest eye-tracker system is able to reach the refreshrate of 120 Hz and high accuracy of under 1 degree for detection. Aswe mentioned in Fig-2 in Sec-2.1, like motion sensors, state-of-the-art eye-trackers are operated in parallel with the graphics pipelineon their own frequencies [13, 20, 53]. Thus, the actual eye trackinglatency is not in the critical path of the graphics pipeline as ourfocus. In the work. we count a sensor-data transmission latency (i.e.,

SPLOS ’21, April 19–23, 2021, Virtual, USA Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, and Shuaiwen Leon Song around 2 ms [13, 20]) in the end-to-end latency discussion. Due tothe proprietary nature of the eye-tracking chip design, it is hard toestimate its standalone energy. Since it has been widely integratedin the current mobile VR SoCs such as Snapdragon[53], we believeits energy consumption is acceptable for modern VR applications.

Design Choice of LIWC ś

Since the dynamic fovea selectingis on the critical path of Q-VR pipeline, it requires low latency aswell as online learning capability to quickly identify the balancedpoint for different realtime constraints. To make this design choice,we have investigated several research-based and commercial DNNaccelerators [8, 19, 40, 61]. We found that some of them [40, 61]are too power hungry for mobile VR systems while the others, e.g.,Google coral edge TPU [19] and Eyeriss[8], cannot provide therequired performance. For example, Google coral edge TPUs need10-20ms to process a DNN inference and the training process hasto rely on high-end GPUs. To this end, we propose a lightweightrealtime Q-Learning based approach, LIWC, which maps the userinputs to scene complexity using online updated lookup table. Tomatch the design goals, we drastically simplify the fine-grainedtuning space of the original Q-learning by indexing the motioninformation and eccentricity as limited delta tags to greatly reduceits latency, power and design complexity.

Studies On Foveation Effects.

Since the foveation effect of hu-man visual system can provide significant workload reduction with-out affecting user experience, it has been studied in various as-pects such as foveated compression[58, 59] and foveated rendering[1, 20, 38, 43, 51, 60]. Several works[1, 20, 38] also conduct usersurvey on the user perception for foveated rendering. Our workfollows their suggestion to constrain resolution manipulation forthe periphery layers to guarantee user perception in Q-VR design.Recent work also employs the foveation effect to reconstruct thelow-resolution image for VR/AR display using neural networks[24,28]. Comparing with them, our work cooperates with mobile hard-ware and network resource to improve the performance of themobile VR system. By exploring the accelerator-level parallel, theexpanded high-quality fovea area is rendered fast and promptlywhile the resolution of periphery area is reduced to save the overalldata transmit.

Collaborative Computing.

There have been several works[7, 21, 26, 30, 31, 64] that improve the system performance byallocating part of the workload to multiple accelerators. In thecomputing graphics domain, works [7, 31, 36] have either enabledcaching mechanism to store all the pre-rendered scenes [7] or em-ployed static collaborative techniques between dynamic objectsand background [31]. We provide a lengthy discussion about theirissues concerning complex modern VR applications in Section 2.1.In contrast, Q-VR provides desirable Quality of Experience (QoE)for a wide range of VR applications, hardware, and network con-ditions by effectively leveraging the computing capability of theincreasingly powerful hardware of both mobile systems and cloudservers. In the general-purpose application domain, Neurosurgeon[26] profiles the computing latency and data size for DNN layersand uses the information to identify the best static partition point. Gables[21] refines the roofline model to estimate the collaborativecomputing performance among multi-accelerator on Mobile SoC.

Looking into the future, the state-of-the-art mobile VR render-ing strategies become increasingly difficult to satisfy the realtimeconstraints for processing high-quality VR applications. In thiswork, we provide a novel software-hardware co-design solution,named

Q-VR , to enable future low-latency high-quality mobile VRsystems. Specifically, the software-level design of Q-VR leverageshuman visual effects to translate a difficult global collaborativerendering problem into a workable scope while the hardware de-sign enables a low-cost local-remote latency balancing mechanismand deeper pipeline optimizations. Evaluation results show thatour Q-VR achieves an average end-to-end performance speedupof 2.2X (up to 3.1X) and a frame rate improvement over thestate-of-the-art static collaborative VR designs.

ACKNOWLEDGMENT

This is research is partially supported by University of Sydney fac-ulty startup funding, Australia Research Council (ARC) DiscoveryProject DP210101984 and Facebook Faculty Award. This researchis partially supported by U.S. DOE Office of Science, Office of Ad-vanced Scientific Computing Research, under the CENATE project(award No. 66150), the Pacific Northwest National Laboratory is op-erated by Battelle for the U.S. Department of Energy under contractDEAC05-76RL01830.

REFERENCES [1] Rachel Albert, Anjul Patney, David Luebke, and Joohwan Kim. 2017. Latencyrequirements for foveated rendering in virtual reality.

ACM Transactions onApplied Perception (TAP) , 231ś241.[5] Dean Beeler and Anuj Gosalia. 2016. Asynchronous Time Warp On Oculus Rift.https://developer.oculus.com/blog/asynchronous-timewarp-on-oculus-rift/.[6] Praveen Bhaniramka, Philippe CD Robert, and Stefan Eilemann. 2005. OpenGLMultipipe SDK: A toolkit for scalable parallel rendering. In

VIS 05. IEEE Visual-ization, 2005.

IEEE, 119ś126.[7] Kevin Boos, David Chu, and Eduardo Cuervo. 2016. FlashBack: Immersive VirtualReality on Mobile Devices via Rendering Memoization. In

Proceedings of the 14thAnnual International Conference on Mobile Systems, Applications, and Services (Singapore, Singapore) (MobiSys ’16) . ACM, New York, NY, USA, 291ś304. https://doi.org/10.1145/2906388.2906418[8] Y. Chen, J. Emer, and V. Sze. 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In . IEEE ComputerSociety, Los Alamitos, CA, USA, 367ś379. https://doi.org/10.1109/ISCA.2016.40[9] Eduardo Cuervo and David Chu. 2016. Poster: mobile virtual reality forhead-mounted displays with interactive streaming video and likelihood-basedfoveation. In

Proceedings of the 14th Annual International Conference on MobileSystems, Applications, and Services Companion . ACM, 130ś130.[10] ARM Developer. 2017. Mali-G76 High Performance GPU. https://developer.arm.com/ip-products/graphics-and-multimedia/mali-gpus/mali-g76-gpu.[11] Stefan Eilemann, Maxim Makhinya, and Renato Pajarola. 2009. Equalizer: Ascalable parallel rendering framework.

IEEE transactions on visualization andcomputer graphics

15, 3, 436ś452.[12] S. Eilemann, D. Steiner, and R. Pajarola. 2020. Equalizer 2śConvergence of aParallel Rendering Framework.

IEEE Transactions on Visualization and ComputerGraphics

26, 2, 1292ś1307.[13] M. S. Elbamby, C. Perfecto, M. Bennis, and K. Doppler. 2018. Toward Low-Latencyand Ultra-Reliable Virtual Reality.

IEEE Network

32, 2, 78ś84. -VR: System-Level Design for Future Mobile Collaborative Virtual Reality ASPLOS ’21, April 19–23, 2021, Virtual, USA [14] Daniel Evangelakos and Michael Mara. 2016. Extended TimeWarp LatencyCompensation for Virtual Reality. In

Proceedings of the 20th ACM SIGGRAPHSymposium on Interactive 3D Graphics and Games (Redmond, Washington) (I3D’16)

ACM Trans. Graph.

31, 6, Article 164, 10 pages. https://doi.org/10.1145/2366145.2366183[21] Mark Hill and Vijay Janapa Reddi. 2019. Gables: A roofline model for mobile SoCs.In

Proceedings of the 10th international conferenceon Mobile systems, applications, and services . 225ś238.[24] Haomiao Jiang, Rohit Rao Padebettu, Kazuki Sakamoto, and Behnam Bastani.2019. Architecture of Integrated Machine Learning in Low Latency MobileVR Graphics Pipeline. In

SIGGRAPH Asia 2019 Technical Briefs (Brisbane, QLD,Australia) (SA ’19) . Association for Computing Machinery, New York, NY, USA,41ś44. https://doi.org/10.1145/3355088.3365154[25] Tianxing Jin, Songtao He, and Yunxin Liu. 2015. Towards accurate gpu powermodeling for smartphones. In

Proceedings of the 2nd Workshop on Mobile Gaming .7ś11.[26] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, JasonMars, and Lingjia Tang. 2017. Neurosurgeon: Collaborative intelligence betweenthe cloud and mobile edge. In

ACM SIGARCH Computer Architecture News , Vol. 45.ACM, 615ś629.[27] David Kanter. 2015. Graphics processing requirements for enabling immersiveVR. In

AMD White Paper .[28] Anton S Kaplanyan, Anton Sochenov, Thomas Leimkühler, Mikhail Okunev, ToddGoodall, and Gizem Rufo. 2019. DeepFovea: neural reconstruction for foveatedrendering and video compression using learned statistics of natural videos.

ACMTransactions on Graphics (TOG)

38, 6, 212.[29] Juno Kim, Matthew Moroz, Benjamin Arcioni, and Stephen Palmisano. 2018.Effects of Head-Display Lag on Presence in the Oculus Rift. In

Proceedings of the24th ACM Symposium on Virtual Reality Software and Technology (Tokyo, Japan) (VRST ’18) . Association for Computing Machinery, New York, NY, USA, Article83, 2 pages. https://doi.org/10.1145/3281505.3281607[30] Youngsok Kim, Jae-Eon Jo, Hanhwi Jang, Minsoo Rhu, Hanjun Kim, and JangwooKim. 2017. GPUpd: a fast and scalable multi-GPU architecture using cooper-ative projection and distribution. In

Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture . 574ś586.[31] Zeqi Lai, Y. Charlie Hu, Yong Cui, Linhui Sun, and Ningwei Dai. 2017. Furion:Engineering High-Quality Immersive Virtual Reality on Today’s Mobile Devices.In

Proceedings of the 23rd Annual International Conference on Mobile Computingand Networking (Snowbird, Utah, USA) (MobiCom ’17) . ACM, New York, NY, USA,409ś421. https://doi.org/10.1145/3117811.3117815[32] Yue Leng, Chi-Chun Chen, Qiuyue Sun, Jian Huang, and Yuhao Zhu. 2019. Energy-efficient video processing for virtual reality. In

Proceedings of the 46th InternationalSymposium on Computer Architecture . 91ś103.[33] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, andNorman P Jouppi. 2009. McPAT: an integrated power, area, and timing modelingframework for multicore and manycore architectures. In

Proceedings of the 42ndAnnual IEEE/ACM International Symposium on Microarchitecture . ACM, 469ś480.[34] Luyang Liu, Ruiguang Zhong, Wuyang Zhang, Yunxin Liu, Jiansong Zhang, Lin-tao Zhang, and Marco Gruteser. 2018. Cutting the Cord: Designing a High-QualityUntethered VR System with Low Latency Remote Rendering. In

Proceedings ofthe 16th Annual International Conference on Mobile Systems, Applications, andServices (Munich, Germany) (MobiSys ’18) . Association for Computing Machinery,New York, NY, USA, 68ś80. https://doi.org/10.1145/3210240.3210313[35] Xing Liu, Christina Vlachou, Feng Qian, Chendong Wang, and Kyu-Han Kim.2020. Firefly: Untethered Multi-user VR for Commodity Mobile Devices. In

Proceedings of the Workshop on Virtual Reality and AugmentedReality Network (Los Angeles, CA, USA) (VR/AR Network ’17) . ACM, New York,NY, USA, 30ś35. https://doi.org/10.1145/3097895.3097901[37] Jiayi Meng, Sibendu Paul, and Y. Charlie Hu. 2020. Coterie: Exploiting FrameSimilarity to Enable High-Quality Multiplayer VR on Commodity Mobile De-vices. In

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzer-land) (ASPLOS ’20) . Association for Computing Machinery, New York, NY, USA,923ś937. https://doi.org/10.1145/3373376.3378516[38] Xiaoxu Meng, Ruofei Du, Matthias Zwicker, and Amitabh Varshney. 2018. Kernelfoveated rendering.

Proceedings of the ACM on Computer Graphics and InteractiveTechniques

1, 1, 5.[39] Microsoft. 2017. Direct3D. https://msdn.microsoft.com/en-us/library/windows/desktop/bb219837(v=vs.85).aspx.[40] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atariwith Deep Reinforcement Learning.

CoRR

Displays .https://doi.org/10.1016/j.displa.2016.11.001[47] Anjul Patney, Marco Salvi, Joohwan Kim, Anton Kaplanyan, Chris Wyman, NirBenty, David Luebke, and Aaron Lefohn. 2016. Towards foveated rendering forgaze-tracked virtual reality.

ACM Transactions on Graphics (TOG)

AnnualReview of Vision Science

Com-puter Graphics Forum , Vol. 35. Wiley Online Library, 129ś139.[52] Hans Strasburger, Ingo Rentschler, and Martin Jüttner. 2011. Peripheral visionand pattern recognition: A review.

Journal of vision

11, 5, 13ś13.[53] tobii. 2018. HTC VIVE Pro Eye. https://vr.tobii.com/sdk/products/htc-vive-pro-eye/.[54] tobii. 2018. tobii Pico Neo 2. https://vr.tobii.com/sdk/develop/unity/getting-started/pico-neo-2-eye/.[55] Unity. 2018. Nature. https://assetstore.unity.com/publishers/13640.[56] Unity. 2018. Viking Village. https://assetstore.unity.com/packages/essentials/tutorial-projects/viking-village-29140.[57] JMP Van Waveren. 2016. The asynchronous time warp for virtual reality onconsumer hardware. In

Proceedings of the 22nd ACM Conference on Virtual RealitySoftware and Technology . ACM, 37ś46.[58] Zhou Wang, Alan C Bovik, and Ligang Lu. 2001. Wavelet-based foveated imagequality measurement for region of interest image coding. In

Proceedings 2001International Conference on Image Processing (Cat. No. 01CH37205) , Vol. 2. IEEE,89ś92.[59] Zhou Wang, Alan Conrad Bovik, Ligang Lu, and Jack L Kouloheris. 2001. Foveatedwavelet image quality index. In

Applications of Digital Image Processing XXIV ,Vol. 4472. International Society for Optics and Photonics, 42ś52.[60] Martin Weier, Thorsten Roth, Ernst Kruijff, André Hinkenjann, Arsène Pérard-Gayot, Philipp Slusallek, and Yongmin Li. 2016. Foveated real-time ray tracingfor head-mounted displays. In

Computer Graphics Forum , Vol. 35. Wiley OnlineLibrary, 289ś298.[61] Lei Xiao, Anton Kaplanyan, Alexander Fix, Matt Chapman, and Douglas Lanman.2018. DeepFocus: Learned Image Synthesis for Computational Display. In

ACMSIGGRAPH 2018 Talks (Vancouver, British Columbia, Canada) (SIGGRAPH ’18) .ACM, New York, NY, USA, Article 4, 2 pages. https://doi.org/10.1145/3214745.3214769[62] Chenhao Xie, Xin Fu, and Shuaiwen Song. 2018. Perception-Oriented 3D Ren-dering Approximation for Modern Graphics Processors. In . 362ś374.https://doi.org/10.1109/HPCA.2018.00039[63] Chenhao Xie, Shuaiwen Leon Song, Jing Wang, Weigong Zhang, and Xin Fu.2017. Processing-in-Memory Enabled Graphics Processors for 3D Rendering. In . 637ś648. https://doi.org/10.1109/HPCA.2017.37[64] Chenhao Xie, Fu Xin, Mingsong Chen, and Shuaiwen Leon Song. 2019. OO-VR:NUMA Friendly Object-Oriented VR Rendering Framework for Future NUMA-based multi-GPU Systems. In

Proceedings of the 46th International Symposium onComputer Architecture (Phoenix, Arizona) (ISCA ’19) . ACM, New York, NY, USA,53ś65. https://doi.org/10.1145/3307650.3322247[65] Chenhao Xie, Xingyao Zhang, Ang Li, Xin Fu, and Shuaiwen Leon Song. 2019.PIM-VR: Erasing Motion Anomalies In Highly-Interactive Virtual Reality WorldWith Customized Memory Cube. In2019 IEEE International Symposium on HighPerformance Computer Architecture (HPCA)