[PDF] A Survey on 360-Degree Video: Coding, Quality of Experience and Streaming

Abstract

The commercialization of Virtual Reality (VR) headsets has made immersive and 360-degree video streaming the subject of intense interest in the industry and research communities. While the basic principles of video streaming are the same, immersive video presents a set of specific challenges that need to be addressed. In this survey, we present the latest developments in the relevant literature on four of the most important ones: (i) omnidirectional video coding and compression, (ii) subjective and objective Quality of Experience (QoE) and the factors that can affect it, (iii) saliency measurement and Field of View (FoV) prediction, and (iv) the adaptive streaming of immersive 360-degree videos. The final objective of the survey is to provide an overview of the research on all the elements of an immersive video streaming system, giving the reader an understanding of their interplay and performance.

Full PDF

AA Survey on 360 ◦ Video: Coding, Quality of Experienceand Streaming

Federico Chiariotti ∗ Aalborg UniversityFredrik Bajers Vej 7C, 9220 Aalborg, Denmark

Abstract

The commercialization of Virtual Reality (VR) headsets has made immersiveand 360 ◦ video streaming the subject of intense interest in the industry andresearch communities. While the basic principles of video streaming are thesame, immersive video presents a set of speciﬁc challenges that need to beaddressed. In this survey, we present the latest developments in the relevantliterature on four of the most important ones: (i) omnidirectional video codingand compression, (ii) subjective and objective Quality of Experience (QoE)and the factors that can aﬀect it, (iii) saliency measurement and Field of View(FoV) prediction, and (iv) the adaptive streaming of immersive 360 ◦ videos.The ﬁnal objective of the survey is to provide an overview of the research onall the elements of an immersive video streaming system, giving the reader anunderstanding of their interplay and performance. Keywords:

Video streaming, Virtual Reality, Quality of Experience

1. Introduction

Over the past few years, the commercialization of Virtual Reality (VR) head-sets and cheaper systems using smartphones as viewports [1] have fueled a strongresearch interest in 360 ◦ immersive videos, and the technology is currently un- ∗ Corresponding author

Email address: [email protected] (Federico Chiariotti ∗ ) Preprint submitted to Elsevier Computer Communications February 17, 2021 a r X i v : . [ c s . MM ] F e b ergoing standardization [2]. Commercial Head-Mounted Displays (HMDs) arecurrently being sold by multiple companies, and the artistic potential of the newmedium is being explored for both gaming and movies.This kind of technology has the potential to make video a more intenseexperience, with a stronger emotional impact [3], thanks to the wider Field ofView (FoV) and the direct user control of viewing direction. 360 ◦ videos alsohave a huge potential for storytelling, as multiple story lines can be developedin parallel [4]. Immersive video also has the potential to enhance empathyand participation in news stories [5], although evidence regarding its use showsmixed results [6]. Psychological factors such as perception of embodiment [7]also aﬀect immersiveness [8], particularly when an avatar is animated in the VRsimulation [9].Immersive video streaming presents some unique challenges [10], especiallyfor live streaming: since the full omnidirectional view is wider than traditionalvideo, it requires far more bandwidth to be streamed with a comparable quality.In order to reduce the throughput of 360 ◦ streams [11], tile-based solutionshave become a standard: the sphere is divided in several tiles, according to apre-deﬁned projection scheme, and each tile can be downloaded as a separateobject. In this way, clients can concentrate most of their resources on the tilesthat are in the user’s FoV, i.e., the ones that will actually be displayed with thehighest probability, resulting in the same Quality of Experience (QoE) even iftiles outside the viewport have a very low resolution or are not downloaded atall. Naturally, this kind of solution requires an accurate prediction of where theuser’s gaze will fall, which is in itself a complex research topic. The design ofthe tiling scheme is also a signiﬁcant factor in both the compression eﬃciencyof the video coding scheme and the ﬁnal QoE of the user.Additionally, the geometric distortion [12] generated by the projection ofspherical omnidirectional video onto a ﬂat surface reduces both the accuracy oftraditional QoE metrics and the eﬃciency of 2D video codecs. Since traditionalQoE metrics are designed for planar images and videos, their direct use doesnot correctly represent the human perception of the video and is only loosely2orrelated with actual QoE. The design of projective corrections for legacy met-rics and 360-speciﬁc ones is an active area of research. Cybersickness [13] isanother major problem for immersive video streaming, requiring both a moreprecise metric to evaluate quality variations and better streaming techniques toreduce stalling.The distortion issue also aﬀects automatic saliency estimation, which canhelp predict the FoV, and even feature extraction and Convolutional Neural Net-works (CNNs) [14] are aﬀected by it, requiring ad hoc corrective methods [15].This survey aims at providing readers with a broad overview of the state ofthe art in all the major research directions on omnidirectional video. We give afull perspective on the building blocks of an omnidirectional streaming system: • In Sec. 2, we examine coding methods, discussing diﬀerent standards andprojections and how they can introduce diﬀerent kinds of distortion andenable more eﬃcient compression; • In Sec. 3, we describe subjective and objective metrics to evaluate theQoE of omnidirectional videos, and why it is a complex challenge; • In Sec. 4, we examine the question of saliency and FoV prediction. Wereview empirical approaches based on user behavior, analytical ones basedon image features, and joint ones that consider both past viewport direc-tions and the current image; • In Sec. 5, we present the state of the art on omnidirectional video stream-ing techniques, focusing on tiling-based approaches. We also review somerecent network-level innovations to provide support to omnidirectionalstreaming. • In Sec. 6, we present a summarized version of the lessons learned on eachtopic and conclude the paper with a discussion of the open research chal-lenges in the ﬁeld.Each section of the paper includes a discussion of the key challenges and openproblems in the ﬁeld. 3 number of recent surveys, whose contribution is summarized in Table 1,have examined the state of the art on diﬀerent topics in the ﬁeld. One work [16]focuses on projection, explaining several state of the art methods in detail andevaluating them on a public dataset with known quality metrics. The authorsexplore viewport-adaptive coding as a possible solution to the demanding band-width requirements of omnidirectional video, and brieﬂy mention the possiblesources of coding distortion, which are examined in detail in [17]: this workconsiders the steps of the encoding chain, examining how each one introducesdiﬀerent kinds of local and global image distortions. A more recent work [18]takes a broader view, examining the existing QoE evaluation metrics, along withviewer attention models for eye and head movements, while the networking as-pects of streaming, from resource allocation to caching, are reviewed in [19].Finally, a survey focusing on system design and implementation [20] examinessome of the existing systems, protocol and standards for acquisition, compres-sion, transmission, and display of omnidirectional videos.These recent works only present a limited review of FoV-adaptive stream-ing, while our Sec. 5 has a more extensive review of the existing literature.Furthermore, while all of these works concern themselves with QoE, this workis the ﬁrst to provide an analysis of the existing comparisons between objectivemetrics, resulting in insights for further research and implementation. Finally,these recent surveys only present a limited review of FoV-adaptive streaming,and none of them has a complete perspective that uniﬁes evaluation and stream-ing: since the eﬃciency of adaptation techniques strongly depends on both theencoding techniques and the FoV, presenting them in a uniﬁed manner is im-portant to get a full picture of the design requirements. The discussion of theﬁeld developed in this survey has a uniﬁed perspective, linking the later sectionsto the earlier ones and proposing some ideas for a holistic development of 360 ◦ streaming systems. 4 able 1: Summary of the existing surveys on omnidirectional video Survey Year TopicRecent advances in omnidirectional video coding for virtual reality:Projection and evaluation [16] 2018 ProjectionVisual Distortions in 360-degree Videos [17] 2019 Visual distortionState-of-the-Art in 360 ◦ Video/Image Processing: Perception, As-sessment and Compression [18] 2020 Saliency, QoENetwork Support for AR/VR and Immersive Video Application: ASurvey [19] 2018 System implementationA Survey on 360 ◦ Video Streaming: Acquisition, Transmission, andDisplay [20] 2019 Protocol design

2. Coding, compression and distortion

The eﬃcient encoding of omnidirectional video has all the well-known issuesof 2D video encoding, with an additional degree of complexity: since ﬁlters andcoding tools are often based on 2D images, the spherical content needs to beprojected to a ﬂat surface to be processed and encoded. In this section, wediscuss the diﬀerent factors that should be considered when encoding omnidi-rectional video, presenting the main projection schemes and coding solutions,both in the spatial and temporal domains.

The geometric distortion issue in 360 ◦ video is the same that cartographershave faced for thousands of years when drawing maps of the Earth [21]: project-ing a sphere onto a planar surface inevitably leads to some form of distortion.However, projection is not the only source of distortion, as the omnidirectionalvideo processing pipeline can cause it at every step [17]. The ﬁrst one is the ac-quisition of the image: omnidirectional images and videos are usually stitchedfrom multiple cameras [22], and this can introduce several kinds of issues atthe edges. These can range from missing information and misalignment of theedges to diﬀerences in the exposure and “ghosting”, and are often particularlystrong at the poles, which most camera systems cannot capture and are oftenreconstructed in post-processing. Video can also have temporal discontinuities,such as objects appearing and disappearing or warping as objects move close tothe stitching areas [23]. In order to avoid smoothness issues and increase the5oding eﬃciency, appropriate motion models that explicitly use rotation needto be used [24].After the omnidirectional image has been acquired, it needs to be convertedto a planar representation for encoding and storage. It can then be divided intotiles to allow tile-based streaming, which we will discuss in detail in Sec. 5. Thewarping patterns generated by the combination of the map projection and tileedges will then interact. Consequently, the form and severity of the geometricdistortion eﬀects depend strongly on the projection and tiling scheme, which iscrucial for eﬃcient compression of omnidirectional video.The Equirectangular Projection (ERP) [25] is the oldest, simplest, and mostcommon projection for omnidirectional video: it is similar to the plate carr´ee geographic projection, as it divides the sphere of view in a number of rectangleswith the same solid angle. Distortion at the poles makes projection wasteful, asit encodes the poles with more pixels than the equator: as viewers usually focustheir attention close to the equator, the poles are often outside the FoV.The dyadic projection [26] tries to solve the pole oversampling issue by re-ducing the sampling for vertical angles above π from the equator, while thebarrel projection [27] encodes the top and bottom quarters of the ERP as cir-cles, reducing the number of pixels used for the two caps. The polar squareprojection [28, 29] is another adaptation that works like the barrel projection,but maps the poles to two squares. There are other techniques to compen-sate for the pole oversampling issue: the equal-area cylindrical projection [30]reduces the height of the tiles with the latitude, while the latitude adaptive ap-proach [31] adapts the number of tiles to the latitude. The result is also known asRhombic Mapping (RBM) [32], since the tiles are arranged in a rhombic shape,which can then be rearranged onto a rectangle. The octagonal projection [33]does the same with a rough latitude quantization, resulting in its namesakeshape. Nested Polygonal Chain Mapping (NPCM) is another downsamplingtechnique [34], which starts from the ERP output and linearly approximatesthe optimal sampling density.The Cubic Mapping Projection (CMP) is the other projection to be widely6 able 2: Summary of state of the art projections Projection Geometry Main advantages and issuesEquirectangular [25] Each rectangle has the same solid an-gle Oversampling at the polesDyadic [26] Equirectangular with reduced polarsampling Distortion at the polesBarrel [27] The sphere is mapped to a cylinder Distortion at the edgesPolar square [28] Barrel-like, mapping the poles tosquares Distortion at the polesEqual-area cylindri-cal [30] Equirectangular with latitude-dependent tile height Reduced polar oversamplingLatitude adaptive [31] Equirectangular with latitude-dependent number of tiles Reduced polar oversamplingRhombic mapping [32] Similar to latitude adaptive, arrangingtiles in a rhombus Efficient retilingNested polygonalchain [34] Downsampling from equirectangular Reduced polar oversamplingCubic mapping [35] Projection from sphere to cube Higher efficiency, lower polar dis-tortion, edge distortionEquiangular cubic map-ping [39] Equiangular mapping on cube faces Reduced face edge distortionOther solids [41, 42, 43] Projection on solids with more faces Lower projection distortion,higher edge distortionVariable tile shape [44] Tiles can be adapted to the content Low distortion, complex encodingand decodingRotated sphere [45] Baseball-like unfolding Increased coding efficiency, lowedge distortionClusTile [46] Viewer behavior-based adaptive sam-pling Low distortion, complex encodingand decoding adopted. It constructs a cube around the sphere [35], then projects rays outwardfrom the center. Each ray intersects with a single point on the surfaces of bothsolids, resulting in the projection mapping. The CMP [36] is more eﬃcientthan the ERP in terms of compression [37], and is currently used by Facebookfor omnidirectional videos [38]. A comparison between the ERP and CMPprojections is shown in Fig. 1. It is easy to see that distortion at the poles is farlower, while objects close to the edges and corners of a face are more distorted.This should be intuitive, as the cube mapping approximates a sphere better closeto the center of each face: this eﬀect can be mitigated by applying equiangularmapping to the cube faces [39], or in general by adjusting the sampling toprivilege the center of each face [40].Solids with a larger number of faces, such as octahedrons [41], rhombic do-decahedrons [42], or icosahedrons [43], can reduce the eﬀect of edges by havinga lower stretch and area distortion, like the Sinusoidal Projection (SP) [47],7hich is an equal area projection. However, there is a trade-oﬀ when choos-ing the number of faces: polyhedrons with more faces have a lower projectiondistortion, but a higher number of discontinuous boundaries. An example ofoctahedral projection is shown in Fig. 2. Other less regular projection shapesare also possible, with tiles of variable size and shape [44]. The Rotated SphereProjection (RSP) [45] unfolds the sphere under two rotation angles and stitchesthem like a baseball; this can be obtained from the ERP, and it can increasecoding eﬃciency.Finally, a more advanced approach to projection integrates content andviewer behavior in the design [48]: areas that have salient content and are oftenwatched will be sampled at a higher rate. ClusTile [46] is another projectionthat uses past viewer behavior, designing a set of tiles that minimizes bandwidthrequirements for past views. A framework evaluating the projections presentedabove was described in [14], and some results comparing the basic projections’compression eﬃciency and distortion with H.264 and H.265 codecs are presentedin [49], ﬁnding that the equal-area cylindrical projection outperforms both theERP and CMP. The main projection methods we presented in this section aresummarized in Table 2.Oﬀset projection is a concept meant to save bandwidth and exploit theavailable knowledge of the user’s viewing direction: oﬀset projections use morepixels to encode regions close to the predicted gaze direction, while regions atwide angles from it have a higher compression. The Truncated Square Pyra-mid (TSP) [50] projection constructs a truncated pyramid around the sphere,with the bottom facing the same way as the viewer. The projection is thenconstructed like the CMP. The construction of the solid is shown in Fig. 3, inwhich two truncated pyramids with diﬀerent settings are shown: the one on theright has a smaller upper base, giving more relative importance and more pixelsto the region facing the viewport directly. When the pyramid’s upper base isvery small, regions at wide angles from the user’s expected gaze are encoded byvery few pixels [51], with extreme compression gains.The Oﬀset Cubic Projection (OCP) [52] adopts another way to perform oﬀ-8 igure 1: Equirectangular and cubemap projection comparison. The ﬁgure was adapted fromthe Facebook video engineering blog: https://engineering.fb.com/video-engineering/under-the-hood-building-360-video/

Figure 2: Equirectangular and octahedral projection of the same scene. Image credits: OmarShehata, https://omarshehata.me/

Figure 3: Truncated square pyramid projection with diﬀerent settings.

There are a number of competing video encoding standards being devel-oped [54]: the most popular are High Eﬃciency Video Coding (HEVC) [55], orH.265, and AOMedia Video 1 (AV1) [56], but the older Advanced Video Coding(AVC) [57], or H.264, is still widely used. Additionally, Versatile Video Coding(VVC) [58], the future H.266, promises to add new capabilities to the existingstandards. The 2D encoding techniques in the standards are highly optimizedand close to ubiquitous, and most omnidirectional streaming systems reuse the2D coding pipelines [59]. However, all the distortion issues discussed in Sec. 2.1do not just impact the QoE of the projected and encoded video, but also thecoding eﬃciency. Furthermore, the resampling and interpolation steps of theencoding pipeline often cause aliasing and blurring, and if these steps are notmanaged carefully [60] they can also introduce visible seams and combine withthe projection scheme to create distortion. While older works can get goodresults using custom techniques on the spherical image, often without projec-10ion [61], most of the recent literature follows the standard approach, with allits advantages and pitfalls. The decision on the representations that need to beencoded and stored [62] in a streaming system can aﬀect the requirements onbandwidth support, server storage space and distortion.Naturally, coding eﬃciency depends on the projection used, and it is pos-sible to optimize coding for a certain projection, reducing its downsides andincreasing compression performance. Since ERP oversamples the polar regions,it is possible to use smoothing [63] or reduce the accuracy of motion vectorsand the coding block resolution [64] as a function of the latitude, increasingthe coding eﬃciency with minimal QoE impacts. Another way to compensatefor this distortion is to adaptively set the Quantization Parameters (QPs), us-ing the Weighted to Spherically Uniform PSNR (WS-PSNR) weights: regionsthat are less important in the metric will be encoded with a rougher compres-sion [12]. The same optimization can be performed for other metrics, such asSphere-based PSNR (S-PSNR) [65]. A more advanced way to set the QPs is tocombine the geometric information with the saliency [66], privileging the salientareas which will be watched more often.The ERP latitude-adaptive quantization technique is adopted in [67], com-bined with some steps to terminate the coding unit partition early in these areas,speeding up the encoding process. Early coding unit termination can also beperformed in a content-dependent way [68], computing the local texture com-plexity. Another optimization for ERP concerns the edges of the image: sincethe left and right edges are actually continuous, the coding unit parameters needto be set to avoid visible seams [69]. In [70], the region-adaptive quantizationscheme is combined with an adaptive mechanism that reduces the frame rate toincrease picture quality if the motion in the content is not too fast. An alterna-tive strategy is rotation: since regions close to the equator have less distortion,interesting regions of the image with high motion and ﬁne-grained textures canbe rotated to the equator, while the less interesting regions are rotated to thepoles and have more distortion [71]. This approach is extended in [72], using aCNN to predict the orientation that maximizes the achievable compression over11he Group of Picture (GoP), as both content and motion vector discontinuitiescan aﬀect the compressibility.Filters are another important concern in omnidirectional coding, as theireﬀectiveness relies on proper adaptation to the projection. In [73], the SampleAdaptive Oﬀset (SAO) ﬁlter that can improve coding quality for sharp edgesis adapted to the ERP, reducing the coding complexity by up to 80% with noQoE impacts. A correction to the standard HEVC deblocking ﬁlter can reducethe CMP edge distortion [74] by aligning the face edges with the ﬁlter edges,ﬁltering only the left and top borders to maintain rotational symmetry, andusing the correct pixels in the 3D representation for the ﬁlter decision-making.A similar approach is used in [39], limiting the coding unit splits at the faceedges and adapting the HEVC ﬁlter to the equiangular CMP by enforcing theface boundaries and using the correct pixels. The authors also adapt a CNNdenoising ﬁlter to the projection. Coding unit depths can also be adapted tothe content and CMP geometry, reducing coding time signiﬁcantly [75].In projections with more irregular face shapes, the inactive samples that areused to pad the 2D projected frame to a rectangular shape can be ignored in therate-distortion optimization, resulting in further compression beneﬁts [76]. Afull coding system using a sampling-adjusted CMP is presented in [77], includingpadding and other techniques to limit face boundary discontinuities such aspacking, i.e., reshuﬄing of the cube faces in the representation so that contiguousobjects in the 3D sphere are close in the projected image.

The temporal element is critical when encoding omnidirectional video: sincethe content is dynamic and encoded in GoPs, considering the motion in sub-sequent frames signiﬁcantly increases the compression eﬃciency. The ﬁrst ex-ample is downsampling: performing the operation on each frame statically doesnot achieve the same compression eﬃciency as considering the quality of thedependent B and P frames [78] when downsampling the independent I framesthey are tied to. It is also possible to reduce the number of independent frames12y adopting the Shared Coded Picture (SCP) technique, which introduces P-coded pictures that are the same across all representations. This enables longerGoPs, increasing the eﬃciency of the code, but also the encoding and decodingcomplexity [79].Motion estimation is inextricably tied into saliency, which we will discussin Sec. 4: the content that is most important to viewers, and on which theirgaze usually ﬁxates, is often also the fastest-moving one. This has importantconsequences for streaming systems which use prediction of the future FoV tooptimize the bandwidth utilization, as these systems require accurate predic-tions and eﬃcient coding. As the use of oﬀset projection, temporal coding,and FoV-oriented predictive streaming all aim at improving compression whilemaintaining an accurate representation of moving content, the interplay of thesesubsystems must be considered when designing a streaming system.The eﬀects of projection also complicate motion modeling in omnidirectionalvideo: since projection is a non-linear transformation, a simple translationalmotion of all the projected pixels in a local block (like in the HEVC standard)will not be able to capture the actual motion of the content. This distortioncan become catastrophic if the motion crosses face boundaries, causing texturediscontinuities that seriously impair QoE.A possible solution is to reproject the motion vector: if the motion on thesphere is translational (i.e., the movement is on the surface of the sphere),the motion vector on the projected video is converted to the spherical motionvector, which is then interpolated [80]. In this way, the coding eﬃciency andthe QoE increase; the same can be done for purely rotational motion. Thistechnique was proposed for the CMP [35, 39] and ERP [24], integrating it withstandard HEVC motion modeling schemes. In [81], a general model is testedfor ERP, CMP and octahedron projections. The spherical coordinate transformcan be used to further improve performance and extend the possible motionsto the whole 3D space [82], working in spherical coordinates and using relativedepth to convert between ERP and the 3D space. It is also possible to assigndiﬀerent motion vectors to pixels in the same block, correcting the motion vector13istortion [83]. A less eﬃcient but less computationally demanding way tocorrect motion vectors in ERP is to exploit the WS-PSNR [84] weight map tocalculate a scaling factor for the motion vectors [85].Another technique deals with distortion due to motion compensation failuresat face boundaries extends a face by linearly projecting the pixels in the otherfaces [86] to preserve texture continuity [87]. This operation can be performedmore eﬃciently using polytope geometry [88]. Another work [81] considers theangle of the block in the sphere in the ERP projection when computing thepadding.Deep learning is a new alternative to traditional motion estimation: in [89],CNNs are used to reconstruct future cubemap frames, combining the encodedP or B frame with the last received I frame. This scheme can improve PeakSignal to Noise Ratio (PSNR) without increasing the required bandwidth.

3. Quality of Experience in Immersive Videos

QoE is the ultimate measure of performance for both standard and panoramicvideo streaming. However, its subjective nature makes ﬁnding a general metricto measure it extremely diﬃcult [90]. Although most of the research on stan-dard video is still applicable, 360 ◦ video presents some unique challenges [91]:an important factor in the perceived quality of panoramic video is the geometricdistortion given by the projection of the spherical image on a planar display [92],which is more pronounced with wide FoVs. It is possible to assess these distor-tions objectively [93], but not their impact on QoE. For a more comprehensivesurvey on the possible sources of distortion in 360 ◦ videos, we refer the readerto [17]. Another important factor in the quality of omnidirectional video is themosaic technique, which can generate distortion in dynamic scenes [94].In this section, we consider subjective and objective methods to measureomnidirectional video QoE, and present the wide body of literature on the eval-uation of these metrics. We conclude the section with a discussion of dynamiceﬀects on omnidirectional video QoE. 14 .1. Measuring QoE: subjective methods QoE is a complex concept, as it involves the human interaction with the con-tent, and its automatic assessment is a challenging problem [95]. Since a directmeasure of QoE requires human subjects, the assessments need to be performedin controlled and replicable conditions. The standard methodologies for con-ducting these assessments are speciﬁed by the International TelecommunicationUnion (ITU) in [96], and distinguish between Absolute Category Rating (ACR)and Degradation Category Rating (DCR) scoring. The standard methodologieswere developed for 2D video, and they often have to be adapted for omnidirec-tional video: in [97], an example of a new ACR methodology for omnidirectionalvideo without requiring users to take oﬀ their HMD is presented. The standardtesting conditions speciﬁed by the Joint Video Exploration Team (JVET) [98]are also often used, although slightly diﬀerent from the ITU recommendations.The golden standard for ACR quality assessment is Mean Opinion Score(MOS): the content is shown in controlled conditions to a large number of humansubjects, who then rate it on a scale from 1 to 5. When evaluating compressionschemes, Diﬀerential Mean Opinion Score (DMOS) is often used as a DCRmetric, evaluating the diﬀerence between the quality of the compressed contentand the original’s: this is a fundamental step of the evaluation of new codingschemes, for both standard and omnidirectional content [99]. Omnidirectionalvideo content is even more challenging, as static image quality is not the onlycomponent that inﬂuences QoE, and even subjective studies need to considerFoV changes and how the diﬀerent encoding of foreground and backgroundaﬀects the experience [100]. A testing methodology that considers the dynamicaspect of QoE, accounting for delays between user motion and the high-qualityrendering of the video in the new direction, is presented in [101].Double Stimulus Impairment Scale (DSIS) is another way to measure qualityimpairment of compressed sequences speciﬁed in [96]: instead of rating the con-tent QoE on an absolute scale, and possibly comparing it with the unimpairedversion’s score, this assessment method asks users to rate the degradation di-rectly, after being shown the original and impaired sequence one after the other.15 able 3: Available subjective QoE assessment datasets

Reference Type Subjects Videos or images Total sequences[109] Video 221 60 600[110] Video 88 6 48[97] Video 30 6 60[111] Video 30 13 364[112] Video 30 10 60[113] Static images 20 16 320[114] Video 21 5 75[100] Video 12 3 24[115] Video 27 2 10[116] Video 13 10 150[117] Video 23 16 384[118] Video 340 30 1608[119] Stereoscopic video 30 13 364

However, this method may cause cybersickness more often [102] when used foromnidirectional video. A more complete comparison between various assessmentmethods is presented in [103].Immersiveness is another factor that needs to be considered in omnidirec-tional video QoE assessment, as the quality of the video can signiﬁcantly im-prove the sense of presence in a VR environment. In order to do so, more factorsthan just picture quality need to be considered , as audio quality and spatialfeatures can have a strong impact on sense of presence, as well as the propri-oceptive matching between the user’s movements and the video displayed onthe HMD [104]. Multi-sensory environments [105] that include haptic feedbackor even smells present yet more challenges: n [106], immersiveness is evaluatedwhen an external sensory stimulus is combined to the omnidirectional video,ﬁnding that this kind of addition can improve immersiveness and enrich userexperience.Finally, an interesting development that straddles the line between subjectiveand objective metrics is the creation of metrics based on objective physiologicaldata from the user collected by smart watches and other simple sensors [107].In [108], the authors develop a QoE metric based on the combined electroen-cephalographic, electrocardiographic and electromyographic signals, achievinghigh correlation with MOS.Several QoE studies have published their datasets, providing a common base16or future research on QoE assessment. The largest dataset is the one presentedin [109], with 221 total subjects watching 60 video sequences, following themethodology described in [110], which also presents a public dataset with atotal of 88 subjects watching 48 video sequences extracted from 6 videos. Thedataset presented in [97] contains data from 30 users watching 60 sequences, andit was obtained using diﬀerent methodologies, so it can be used to compare them.In [111], 13 videos are processed into 364 sequences, watched by 30 subjects.In [112], 10 omnidirectional videos of 10 seconds each are evaluated by 30 non-expert subjects. The dataset in [120] uses static images, having 20 subjectsevaluate 528 compressed versions of 16 base images, as does the one in [113],with 320 compressed versions of 16 images watched by 20 subjects. The authorsof [114] also released their dataset, with 21 participants watching 75 impairedvideo sequences with diﬀerent resolution and compression levels. There are othersmall-scale datasets associated to other measurement studies [100, 115], whiletwo more large dataset, with 13 subject watching 150 videos and 23 subjectswatching 384, were presented in [116] and [117], respectively. To the best of ourknowledge, the largest available dataset was presented in [118], and is divided in5 scenarios with an approximately uniform division of samples. Finally, there isa large-scale dataset for stereoscopic omnidirectional video, which was presentedin [119]. The datasets above are summarized in Table 3.

The easiest method to objectively measure the QoE of an omnidirectionalimage is to directly use a classic 2D metric such as PSNR, Structural SimilarityIndex (SSIM) [121], Multiscale SSIM (MS-SSIM) [122], Visual Information Fi-delity in Pixel Domain (VIFP) [123], or Feature Similarity Index (FSIM) [124].However, these metrics do not take the geometric distortion caused by the pro-jection of the spherical image into account; indeed, most objective QoE metricsfor omnidirectional images and videos are adaptations of these metrics, withsome corrections for the geometrical distortion resulting from the projection ofspherical images on a plane. 17-PSNR [98] is an adaptation of PSNR that takes a number of uniformlydistributed sampling points on a spherical surface, then reprojects them onthe reference and distorted omnidirectional images and computes PSNR. Pointsthat are between sampling positions in the 2D plane are mapped to the near-est neighbor. WS-PSNR [125] takes the opposite approach, computing PSNRon each pixel of the projected image, then weighting the results proportionallyto the area occupied by the pixel on the sphere. PSNR for Craster ParabolicProjection (CPP-PSNR) [126] is a projection-independent adaptation of PSNR;it applies a Craster parabolic projection that preserves areas in the sphericaldomain, then calculates PSNR on the resulting image. By virtue of being inde-pendent of the projection used in the image, it allows the comparison of diﬀerentprojection methods. Finally, Spherical SSIM (S-SSIM) [127] and Weighted toSpherically Uniform SSIM (WS-SSIM) [128] are adaptations of SSIM to thespherical domain: the structural similarity is adjusted to compensate the geo-metrical distortion using a weighting function similar to the one used by WS-PSNR. In [114], the sphere is divided into patches using a Voronoi diagram, andthe 2D algorithms are applied on the patches, reducing the distortion.The content itself can be the basis of the weighting system, as in [99]: Con-tent Preference PSNR (CP-PSNR) and Content Preference SSIM (CP-SSIM)are adaptations of the two metrics that take the viewport direction and con-tent saliency into account, using a predictive model to gauge future viewingdirection. However, saliency and eye movement models are not always perfect,and using the center of the viewport as a proxy for gaze direction is still veryimprecise [129].More complex metrics take into account several factors, often combiningthe objective metrics mentioned above: in [130], a non-linear Perceptual VideoQuality (PVQ) model is derived, starting from SSIM and other metrics andmatching them to a predicted MOS. The same operation is performed by theNormalized Quality versus Quality factor (NQQ) model in [131], which com-putes QoE as a non-linear function of a combination of coding parameters suchas spatial resolution and quantization factor, whose parameters are derived from18he spatial activity in the image and the low-order moments of the luminancedistribution.Learning tools can also be used to estimate these models: in [132], BackPropagation (BP) is applied on inputs on multiple scales, considering singlepixels, regional superpixels, salient objects, and the complete projection, re-sulting in the Quality Assessment in VR systems (QAVR) metric. GenerativeAdversarial Networks (GANs) are another learning tool that can be used totrain neural networks to estimate QoE, and the Deep VR Image Quality As-sessment (DeepVR-IQA) [133] metric is based on them. GANs involve twoneural networks in opposition to each other: as one network is trained to es-timate the QoE, the other’s objective is to generate examples that trick theother into estimating an incorrect quality. This improves training convergenceand can increase overall correlation with subjective test scores. The metric in[109] includes head and eye movement data in the learning process, concatenat-ing patch-level CNNs with a fully connected network to obtain the QoE score.CNNs can also be used to determine 3D omnidirectional video quality [113],with additional preprocessing. The Viewport-based CNN (V-CNN) model com-bines viewport prediction with a CNN [134]: the QoE for diﬀerent viewportsis computed by the CNN, while another spherical CNN predicts possible futureviewports’ viewing probability and determine the weights of their contributionto the expected QoE. Table 4 presents a summary of the main full-reference QoEmetrics presented in this section, along with the references of the comparisonstudies they appear in.No reference metrics can measure QoE in diﬀerent context, in which nouncompressed image is available. Metrics such as the Natural Image QualityEvaluator (NIQE) [140], based on natural image statistics, and the Six-StepBlind Metric (SISBLIM) [141], which is the combination of six diﬀerent distor-tion measurements, have good performance on 2D images and videos, but theonly study to check their eﬀectiveness for immersive video [111] has found thattheir performance is signiﬁcantly aﬀected by the geometric distortion, makingthem only weakly correlated with subjectively perceived quality. The Multi19 able 4: Summary of the main presented objective QoE metrics

Metric Description Comparison studiesPSNR Pixel-level Mean Square Error (MSE) over thewhole image (2D) [84, 98, 126, 135, 133, 99, 136, 132][137, 124, 131, 111, 127, 114, 112, 120, 138]SSIM [121] Structural similarity on a small scale (2D) [130, 99, 133, 124, 131, 114, 111][132, 127, 112, 120, 138]MS-SSIM [122] Structural similarity on multiple scales (2D) [137, 133, 124, 131, 111, 114, 120, 138]VIFP [123] Shannon model measuring shared information(2D) [137, 133, 131, 120]FSIM [124] Feature-based model [124, 120]S-PSNR [98] PSNR on sampling points from a sphere,remapped on the 2D projection [98, 99, 139, 135, 133, 132][114, 136, 137, 111, 127, 112, 109, 138]WS-PSNR [84] PSNR weighted proportionally to pixel area onthe sphere [84, 99, 139, 114, 135, 133][136, 137, 131, 111, 127, 112, 109, 138]CPP-PSNR [126] Compares quality across projection methodswith equal area projection [126, 139, 99, 114, 135][133, 136, 137, 111, 127, 112, 109, 138]S-SSIM [127] SSIM with corrections for projective distortionin the spherical domain [127]WS-SSIM [128] SSIM weighted proportionally to pixel area onthe sphere [128]Voronoi [114] SSIM and PSNR on Voronoi patches [114]CP-PSNR [99] Saliency- and viewport-weighted PSNR [99]CP-SSIM [99] Saliency- and viewport-weighted SSIM [99]PVQ [130] Non-linear function of SSIM [130]NQQ [131] Non-linear function of the coding parameters [131]QAVR [132] Learning-based model based on features at mul-tiple scales [132]DeepVR-IQA [133] Adversarial generative model to learn QoE [133]Model in [109] Learning-based metric with head and eye move-ment input [109]V-CNN [134] CNN on viewports weighted by viewing proba-bility [134]

Channel 360 ◦ Image Quality Assessment (MC360IQA) metric [142] is a no ref-erence metric using a multi-channel CNN on the six faces of a cube, trained onthe dataset in [111]: the metric outperforms even 2D full reference metrics onthe dataset.

The conditions for testing QoE metrics in immersive video are speciﬁed bythe JVET in [98]; a wider discussion on the framework [139] also provides somereference experiments, with objective and subjective quality metrics; it also in-troduces the evil viewport problem. Evil viewports correspond to FoVs in which20 able 5: Performance of the main presented objective QoE metrics. The table should be readhorizontally: the metric in each row is compared to one for each column. Metrics whose rowshave more green cells are more closely correlated with subjective MOS

PSNR SSIM MS-SSIM VIFP WS-PSNR S-PSNR CPP-PSNRPSNR Worse Worse Worse Worse Worse WorseSSIM Better Similar Worse Better Better SlightlybetterMS-SSIM Better Similar Worse Better Better SlightlybetterVIFP Better Better Better Better Better BetterWS-PSNR Better Worse Worse Worse Slightlyworse SlightlyworseS-PSNR Better Worse Worse Worse Slightlybetter SlightlyworseCPP-PSNR Better Slightlyworse Slightlyworse Worse Slightlybetter Slightlybetter the discontinuous edge caused by the stitching of images from diﬀerent camerasis clearly visible; it is important to consider evil viewports as a separate case,as QoE metrics that take the whole sphere into account might underestimatetheir impact on QoE because of the relatively small area of the stitching edge.Furthermore, another study [143] argues that short videos should not be usedfor QoE evaluation in VR, as users’ attention takes longer to focus in this kindof environment. A detailed evaluation of the JVET database, with subjectiveexperiments, is presented in [112].In recent years, several studies have compared objective quality metrics tomeasure their correlation with actual subjective QoE: due to the strong de-pendence of the correlation between objective metrics and MOS on the actualcontent of the images, tests performed on diﬀerent datasets often have con-tradictory results, and the wide variation across videos of the same datasetconﬁrms that the eﬀect is fundamental and not due to experimental design.The subjective experiments in [136], for example, show no advantages of the360-speciﬁc PSNR-based metrics over the baseline 2D metric; however, thiscontradicts the results in [135, 112], which both ﬁnd that CPP-PSNR has bet-ter performance than the other metrics, and S-PSNR and WS-PSNR also out-perform standard PSNR. All of the works above [135, 136] conﬁrm that MOSdecreases sharply if the resolution is lower than 1920p; since only part of the21ideo is inside the viewport at any time, even 1080p video has a low perceivedresolution. All later studies conﬁrm that standard PSNR is worse than anyother quality metric, but they often include other metrics, such as SSIM [121]and VIFP [123]. In [137, 120, 131], VIFP signiﬁcantly outperforms SSIM, MS-SSIM and WS-PSNR, which achieve a similar performance, while PSNR doeseven worse. Similar results are reported in [127], which includes S-SSIM but notVIFP or MS-SSIM; the 360-speciﬁc SSIM variant outperforms both its 2D an-cestor and the PSNR-based metrics. The most complete study, which includesseveral less common 2D QoE metrics and SSIM ﬂavors, ﬁnds that SSIM outper-forms both MS-SSIM and the various PSNR-based metrics. The results of thevarious experimental studies are summarized in Table 5, which compares all thealgorithms that are present in at least two of the works presented in this section.The table should be read horizontally: in each row, the corresponding metricis compared to the others (one in each column), and a qualitative summary ofthe comparison is given by the cell color. The row corresponding to VIFP, forexample, is completely green, showing that it does better than any other metricin the studies in which it is examined, while PSNR’s row is entirely red. Aninteresting case is presented by the comparison between SSIM and MS-SSIM,whose relative performance is similar, but with a very high variance: MS-SSIMperforms better on some datasets [137], but worse in others [111], and neitheris clearly better in others [131]. Another work [138] compares the basic metrics’performance and complexity, and ﬁnds that most are well-correlated to MOSin the studied scenarios. The experiments by the authors show that the morecomplex methods in [130, 131, 109, 132, 133] have a higher performance thantraditional metrics, but they have not been corroborated by independent studiesyet.The results of the analyses and comparisons are summarized in Table 5, witha color-coding scheme to give the reader a ﬁrst-glance impression of the metrics.22 .4. Dynamic factors in video QoE

The dynamic nature of video is also a major factor in QoE that should betaken into account: as in 2D video streaming, stalling events [144] can signif-icantly aﬀect both the perceived quality of 360 ◦ videos [145] and the sense ofpresence of the experience [115]. Since omnidirectional video is more bandwidth-intensive than standard video of the same quality, and buﬀering is limited bythe accuracy of FoV prediction, as we will discuss in detail in Sec. 5, avoidingrebuﬀering events is likely to be a major issue in bitrate adaptation algorithmdesign.Quality ﬂuctuations also have an impact on QoE, and omnidirectional videocan have two sources of picture quality variation: as in all adaptive video stream-ing systems, the bitrate adaptation algorithm can change the quality to adaptto the connection, either decreasing it if the available bandwidth does not sup-port the current quality level or increasing it if there is unused capacity. Thesecond cause of quality ﬂuctuations is speciﬁc to omnidirectional video: as wewill discuss in Sec. 5, streaming systems transmit regions outside the predictedviewport at a lower quality to save bandwidth, which causes sharp decreases inQoE when the user turns and the lower-quality content is displayed.The impact of quality variations due to FoV changes in adaptive systemsis modeled in [146], using quotients between exponential functions of the qual-ity variation rate to approximate the subjective quality when ﬂuctuations arepresent. This model is extended in [118], which considers a more completemodel for several diﬀerent possible scenarios and tests it on a large-scale sub-jective evaluation dataset. Naturally, a more precise model of the trajectory ofthe user’s gaze could improve the accuracy of these QoE models, tying qualityevaluation, encoding, and FoV tracking inextricably.Another study [147] investigates the impact of head turn movements on sub-jective QoE, ﬁnding that these movements can have a strong impact on perceivedquality. However, the eﬀect of user movements on the QoE of omnidirectionalvideo is still largely unexplored, and should be investigated further. Anotherinteresting issue, which is explored in [148], is the impact of audio degradation23n omnidirectional video QoE: the authors use a neural network to combine theeﬀects of video and audio impairment, training it on a subjective assessmentdataset.Immersive videos with fast camera motions are also subject to cybersick-ness [149], which is caused by a mismatch between perceived motion and visualinput. Cybersickness symptoms often include oculomotor disturbances, nausea,and disorientation, and they are strongly dependent on the content [13]: immer-sive scenarios with strong pitch motion such as rollercoaster rides or parachutedives can induce far stronger symptoms than more horizontal scenes. The tech-nical challenges of designing immersive systems are explored in more detailin [150, 105].Gaming is another important application of VR, and the deﬁnition of QoEcan be slightly diﬀerent in this context, as both enjoyment and performance needto be taken into account. Immersive gaming is aﬀected both by the quality ofthe video and by other factors such as the control scheme [151], which shouldinclude the headset movement input: measurement studies have been performedin diﬀerent contexts, such as driving simulators [152], ﬁrst-person shooters [153],sport simulators [154], or even training simulators [155].

4. Saliency and FoV tracking

Saliency is the quality that makes part of an image or video stand out andcapture viewers’ attention [156]. In this section, we discuss how to evaluatesaliency in omnidirectional videos, then apply the concepts to FoV tracking,which represents not just the importance of parts of images but the trajectorythat users’ gazes have over the whole duration of the video.While saliency estimation and FoV tracking are not, in and of themselves,optimizations that improve the QoE of 360 ◦ video streaming, they are closelyintertwined with all the other components that we discuss in this survey. Themost eﬀective projection methods take user behavior into account [48], as pri-oritizing the content that is watched most often will usually lead to a higher24ompression eﬃciency. The same reasoning applies to QoE estimation: whilewe can look at the quality of a 360 ◦ frame from all possible angles, the actualexperience of users will always entail a single trajectory throughout the video,as their eyes can only look in one direction at a time. Naturally, diﬀerent usersmight follow diﬀerent paths during the videos, looking at diﬀerent points atdiﬀerent times, and even the same user might focus on diﬀerent content whenrewatching an omnidirectional video, but this makes extensive studies of saliencyall the more important.Finally, FoV tracking is a key component of streaming systems, as we willdiscuss in detail in Sec. 5: since QoE only depends on the parts of the videothat the user is currently watching, buﬀer-aided streaming systems can improvetheir eﬃciency by predicting which direction the user will look and prefetchingthe correct parts of the video, or adjusting the projection to improve quality inthat direction. A precise, long-term FoV tracking can then enable the streamingclient to make more foresighted choices, While there is a wide body of literature on 2D saliency evaluation [157],omnidirectional video saliency is still a recent ﬁeld. The Boolean Map Saliency(BMS) and Graph-Based Visual Saliency (GBVS) 2D saliency metrics wereadapted to omnidirectional images and videos in [158], applying them directly onthe omnidirectional images by using the ERP and automatically compensatingfor the distortion in the CIELAB color space [159]. Another attempt to adaptsaliency metrics to panoramic video was made in [160], using similar tools tocompensate for the equirectangular distortion. A later work [161] considersmultiple projections, taking into account the bias towards looking at the centerof the panorama [162], i.e., keeping close to the equator of the video sphere [163],and combining it with 2D metrics. Other saliency metrics, taking center bias andmulti-object confusion into account, are proposed in [164] and [165]; the latteralso includes a movement tracking framework. A metric considering a linearcombination of low-level features and high-level ones such as faces and people25as proposed in [166], obtaining good results for images containing humans. Itis also possible to apply 2D techniques such as weakly supervised CNNs directlyby using the appropriate projection and adjustments [167], or by using CNNsto correct the distortion, combining the output of the traditional saliency mapof each path with its spherical coordinates [168]. Spherical CNNs can also beused directly [169].In [170], a superpixel decomposition is applied to the image, which is thenconverted to the CIELAB color space; the diﬀerence in contrast and color isthen used to train an unsupervised learner to determine saliency, according tothe boundary connectivity measure [171]. A similar approach is taken in [172],in which the authors derive sparse color features and apply a model of humanperception, biased towards the equator, to derive saliency. It is also possibleto combine 2D saliency maps on diﬀerent projections with spherical domainoptimization to generate a hybrid metric [173], or to include illumination nor-malization [174] to compensate for lighting variations in the omnidirectional im-ages. GANs [175] are another supervised learning tool that can be used to infersaliency; unsupervised learning from bottom-up features has also been appliedsuccessfully [176]. An experimental comparison of several standard and omni-directional state-of-the-art saliency detection techniques is presented in [177].Scanpaths [178] are a natural extension of the saliency metric, adding thetime dimension to the static map; image metrics can often be straightfor-wardly extended to the video domain, both for standard and omnidirectionalvideo [179]. Scanpaths can also act as predictors of future gaze directions whenused as the training model for learning agents such as deep networks [180] orGANs [181]. However, scanpath models often have the same issues as staticsaliency models: since saliency is extremely content-dependent, diﬀerent mod-els can have higher performance on diﬀerent datasets. For this reason, standardevaluation datasets and metrics have been proposed [182, 183]. In [184], anapproximate saliency metric is derived by clustering multiple users’ head move-ments, but the training is video-speciﬁc and does not generalize on other content.A more general model based on user movement statistics is derived in [185] by26 able 6: Summary of the main presented saliency FoV prediction methods

Reference Type Basic principle[181] Content- and popularity-based GAN[184] Popularity-based Clustering[187] History-based Dead reckoning[188] History-based Polynomial regression[189, 190] History-based Kalman filtering[191, 192] History- and popularity-based Gaussian filtering[193] History- and popularity-based Clustering[194] History- and popularity-based CNN[195] History- and popularity-based Recurrent Neural Network (RNN)[195, 196,197, 198] Content-, history- and popularity-based Long Short-Term Memory (LSTM)[199] Content-, history- and popularity-based Convolutional LSTM[200] Content- and history-based Attention-based encoder-decoder network combining Fused Saliency Maps (FSMs) [186] with head movement data andapplying an equator bias.In general, saliency evaluation is more related to coding and compressionthan to streaming, as streaming systems have the beneﬁt of knowing the currenttrajectory of the user, which can lead to more eﬀective FoV tracking toolsdiscussed below. On the other hand, the compression and coding phase mustbe performed once, so saliency and most frequent scanpath estimation are theonly available tools to use content information during it. As with other ﬁelds,the development of machine learning tools to combine content features and userexperience is one of the major research challenges: the ﬁeld is rapidly developing,and a one-step network that can automatically learn to extract saliency andencode the video at the same time is just behind the corner.A task related to saliency and scanpath estimation is automatic navigation,i.e., moving through a panoramic video to catch the most important parts of theaction. A simple optimization is performed in [201], while another work [202]proposes a combination of object recognition and reinforcement learning, im-plementing the policy gradient technique to track interesting objects in sportsvideos. A similar approach can be applied to explore a space by rewarding anagent when it examines unexplored portions of its environment [203].27 .2. Field of View prediction

As discussed in Sec. 3, the viewport direction is a fundamental factor inassessing the QoE of immersive video, and needs to be considered proactivelyboth in the coding phase and when performing adaptive streaming. In particu-lar, the diﬃculty of predicting future viewport orientation leads to diminishingreturns on capacity, limiting the amount of prefetching [204] and exposing usersto the risk of annoying stalling events [115].The prediction of gaze direction has been studied since the ’90s by usingsimple analytical tools, and it parallels the work on motion prediction: the ﬁrststudies used dead reckoning [187] and polynomial regression [188], and severalstreaming systems that exploit FoV prediction still apply simple linear regres-sion on historic data [205]. However, the models are often too simplistic, notcapturing viewer behavior complexity: an early frequency-domain analysis [206]highlights the diﬃculty of predicting long-term trends using these strategies.Kalman ﬁltering approaches use similar underlying models, but they can dealwith imprecise measurements of the orientation [189, 190].Recently, more complex statistical tools such as Gaussian ﬁltering [191, 192]and clustering [193] have been used with good results, modeling viewer gazedirection as a random variable whose distribution is determined by their ownhistory as well as past users’ behavior. Another study on the correlation in thebehavior of users [207] concentrates on the caching implications of predictingFoV.Recently, deep learning has also been applied to the problem, as FoV pre-diction is a classical regression problem: both CNNs [194], and RNNs [195] hadgood performance on standard datasets [208]. Three other works [196, 197, 198]introduce LSTMs, including content-related metrics such as saliency maps andscanpaths along with the motion information. In [209], ladder convolution isused before the LSTM to extract contextual information from the encoded im-age and correct for the projection. Naturally, a richer state with more infor-mation from diﬀerent sources can improve the quality of the prediction, whichis further enhanced in [200] by the use of an encoder-decoder network with an28ttention mechanism that can have high tracking accuracy over multiple sec-onds. However, these methods have not been tested on large datasets yet, andtheir signiﬁcant computational complexity poses a challenge in real-time mo-bile applications. The search for an eﬃcient FoV tracking algorithm that canallow Dynamic Adaptive Streaming over HTTP (DASH) clients to achieve sim-ilar levels of buﬀer ﬁlling to traditional planar video is still open, and as theseworks are all from the past 3 years, the state of the ﬁeld is rapidly changing andimproving.Prediction on even longer timescales is possible by leveraging the watchinghistory of other users and identifying similarities [199], maintaining a viewporthit rate over 75% even at a distance of 10 seconds. For additional accuracy,users can be clustered by similarity [210, 211], identifying common patternswithin clusters more eﬀectively. This approach can also be combined with deepreinforcement learning [212] to reduce training costs. It is also possible to usecombine saliency metrics and head movement with more precise gaze tracking,obtaining a higher precision in the prediction [213]. FoV prediction can also betested on public datasets, often used by existing saliency estimation [177] andprediction methods [212]; the latter provides a dataset with the head movementsof 58 users across 76 video sequences. The datasets used for QoE measurementoften include both the ratings and head movements of the viewers, so they canalso be used for this purpose. A dataset with the head movements of 59 userswatching 7 YouTube immersive videos was presented in [214], while anotherdataset with partly overlapping videos and 50 diﬀerent subjects was presentedin [215]. Another dataset includes the head trajectories of 48 users watching 18videos [216], and yet another [217] contains the FoV trajectories and saliencymaps of 48 users on 24 videos. The dataset presented in [218] includes bothhead movements and the results of a cybersickness questionnaire for 20 subjectswatching 48 video sequences. The same kinds of data are available in [219], with60 subjects watching 28 videos, and in [220], with 20 subjects watching 5 videoscreated and edited by professional ﬁlmmakers. Another dataset [221] provideseye tracking data, which is more precise than head movements, for 98 static29 able 7: Available FoV tracking datasets

Reference Type Subjects Videos[212] Head movements 58 76[214] Head movements 59 7[215] Head movements 50 10[216] Head movements 48 18[217] Head movements (with saliency maps) 48 24[218] Head movements (with cybersickness questionnaire) 20 48[219] Head movements (with cybersickness questionnaire) 60 28[220] Head movements (with cybersickness questionnaire) 20 5[221] Eye movements (static images) 63 98[222] Eye movements (desktop platform) 50 12 images, observed by 63 subjects for 25 seconds each. Viewer gaze direction isusually analyzed on VR headsets, but there is a public dataset [222] of immer-sive video FoVs on a desktop platform. The datasets on FoV prediction andtracking are summarized in Table 7, while the main methods of FoV predictionwe presented in this section are summarized in Table 6.

5. Streaming

Serving omnidirectional video content over the Internet is a complex prob-lem of its own: a naive approach sending the whole sphere at the highest qualitywill be extremely ineﬃcient, and an intelligent way to adapt to network con-ditions and user behavior needs to be devised. In this section, we discuss thestandardization work on omnidirectional video streaming and the solutions tooptimize bitrate adaptation by considering spatiotemporal elements such as FoVprediction. Finally, we present some of the work on network support of omnidi-rectional video in the context of VR, which is one of the key applications thatwill be enabled by 5G networks.

Today, the DASH streaming standard is almost universally used for 2D videostreaming over the Internet: it divides videos into short segments, which areencoded independently and at several diﬀerent qualities by the server. Thestreaming client can then choose the quality level for each segment, depend-ing on the bitrate its connection can support, by requesting the appropriate30TTP resource. The low computational load on the server and transparency tomiddleboxes make DASH highly compatible with the existing Internet infras-tructure, and the possibility of implementing diﬀerent adaptation algorithmsmakes it versatile to diﬀerent network conditions. In the early 2010s, the stan-dard was extended to enable the transmission of omnidirectional, zoomable and3D content: the Spatial Representation Description (SRD) extension [223] spec-iﬁes spatial information on each segment, allowing servers to present spatiallydiverse content. The standard only speciﬁes the spatiotemporal coordinates ofeach segment, and the choice of which ones to download and show to the useris still client-side, in accordance with the client-based DASH paradigm.The Omnidirectional Media Format (OMAF) standard [224] is another spec-iﬁcation that can extend DASH or other streaming systems by specifying thespatial nature of video segments. Furthermore, OMAF also speciﬁes some re-quirements for players, taking another step towards a complete standard spec-iﬁcation for omnidirectional streaming. In fact, OMAF-based players have al-ready been implemented and demonstrated [225]. The standard speciﬁes aviewport-independent video proﬁle using the HEVC coding standard, as wellas two viewport-dependent proﬁles using HEVC or the older AVC, supportingthe ERP and CMP projections and tile-based streaming. OMAF further de-ﬁnes a viewport-dependent projection approach, in which the client chooses theprojection with the highest quality for its current viewport, as well as three dif-ferent tile-based streaming approaches: in the simplest one, the viewport regionis downloaded at a high quality, along with an additional low-quality version ofthe whole sphere. The other two allow a freer choice by the client, which candownload a set of tiles with either mixed encoding quality or mixed resolutions,privileging the viewport area in both cases.A DASH SRD or OMAF compliant server can allow clients to stream omni-directional video, presenting either segments with diﬀerent viewport-dependentprojections or separate tiles for the client to choose. The client can downloadthe appropriate projected content, potentially discarding or downloading low-quality versions of tiles with a low viewing probability and saving bandwidth.31t is also possible to exploit the features of HEVC to enable fast FoV switch-ing or to give users the option to zoom into certain areas of the sphere [226],as high-quality chunks can be requested at any moment if the user movestheir head [227], seamlessly integrating the functions with minimal server-sidechanges. The techniques for streaming content at the highest possible qualityexploiting viewport information are described in detail in the following.

Omnidirectional streaming has all the complexity of traditional streaming,with buﬀer concerns and dynamic quality considerations, but it has an addi-tional degree of freedom: since the viewer only sees the portion of the sphere intheir FoV, quality is strongly dependent on the direction of their gaze [228]. theparts of the sphere inside the FoV are visualized by the user, and their attentionfocuses on a narrower foveal cone [229]. Naturally, adaptive streaming systemstry to exploit this by maximizing the quality of the predicted FoV at the expenseof unwatched regions, which do not contribute to the QoE. This approach is notwithout pitfalls: standard DASH buﬀered streaming often prefetches segmentsseveral seconds in advance, with no performance loss, but prefetching an un-watched region at a high quality does not lead to any QoE improvement [230],so the advantages of prefetching in adaptive 360 ◦ video are closely tied with thequality of the viewport prediction [204]. The paradigm can also deal reactivelyto dynamic viewpoint changes [231].Transmission factors can signiﬁcantly aﬀect the quality of the image [115]:viewport-agnostic streaming, which transmits the whole omnidirectional videowith the same quality, does not introduce additional distortion, but it is ex-tremely bandwidth-ineﬃcient. There are two viewport-dependent approachesto adapting omnidirectional streaming systems to the FoV. The ﬁrst, and mostcommon, approach is tile-based streaming, which divides the omnidirectionalvideo into independent rectangular tiles [232]. In this case, the bitrate adapta-tion becomes multi-dimensional [233]: each tile can be streamed independentlyat a diﬀerent quality level, and the client reconstructs the whole sequence. It32s also possible to exploit the HTTP/2 weight parameter to control the tileinterleaving and prioritization [234]. The main downsides of tiling-based ap-proaches are the frequent spatial quality ﬂuctuations [235] and artifacts close totile borders.The second approach is viewport-dependent projection, which uses oﬀsetprojection [236] or diﬀerentiated QP assignment [237] to improve the quality ofthe FoV [238]. This approach avoids obvious seams between tiles at diﬀerentqualities. However, it can have temporal quality ﬂuctuations as the projectionchanges when the user moves their head, and it is rarely used in the literaturebecause of the server-side memory requirements of storing several diﬀerent pro-jections with diﬀerent encoding parameters. A third, even less common, solutionin wireless channels is to transmit the video directly, using analog modulationafter applying the Discrete Cosine Transform (DCT) [239]. This leads to a moregraceful quality decrease than the sharp fall caused by digital transmission, butis not without its disadvantages, as the transmitter and receiver hardware needto be designed ad hoc .In the following, we concentrate on tile-based streaming methods, as theyare by far the most common, although they involve a higher computationalcosts due to the necessity of stitching [183]. While the simplicity in the designof tile-based systems is attractive, we remark that they might not be optimalin terms of encoding eﬃciency, and a more holistic solution that takes bothencoding eﬃciency and streaming factors into account might provide an evenbetter solution in the future. As we discussed in the previous sections, thedesign of projection and encoding methods is inextricably linked to the expectedscanpath of the user’s gaze, while the streaming adaptation strategy stronglyrelies on FoV prediction. As some users might behave in an atypical mannerand follow uncommon scanpaths, the encoding system and streaming systemsneed to guarantee a minimum QoE in all cases, while optimizing the QoE foras many users as possible. These conﬂicting objectives present an interestingtrade-oﬀ, which is mostly unexplored in the current literature and would beextremely interesting to investigate. 33n accurate prediction of the FoV can improve the eﬃciency of omnidirec-tional streaming signiﬁcantly: since the only area that the viewer sees is theone in the viewport, other parts of the video sphere can be streamed with amuch higher compression, or even discarded, without aﬀecting the QoE. Severalauthors have proposed streaming algorithms exploiting this prediction, oftenusing it in one of two ways: • The viewport-based approach maximizes the quality of the predicted FoV,or a slightly wider region to account for inaccuracies in the prediction, andstreaming the rest of the sphere at the lowest quality. • The probabilistic approach weights the tiles by their viewing probability,then optimizing the expected quality. • The reinforcement learning approach implicitly optimizes the expectedlong-term QoE by applying its namesake learning paradigm.Naturally, the capacity of the connection is the constraint that limits the QoE,and various capacity prediction methods can be employed. Since there is nocorrelation between the capacity of the channel and the viewport orientation,the two predictions can be performed separately with diﬀerent methods, and theuse that the streaming adaptation algorithm makes of the results is usually notconstrained by the prediction method. An interesting way to improve the pre-diction and the streaming quality is to devise the content in a way that implicitlyor explicitly leads users to direct their attention in certain directions [240].The viewport-based approach is simpler, as it does not require solving acomplex optimization problem: there are only two regions, the one aroundthe viewport and the rest of the sphere, and the second one is usually eithernot streamed at all or streamed at the maximum possible compression [241].Naturally, the approach is optimal if the predictor is perfect. In [205], botha linear regression and a neural network-based prediction are tested with asimple algorithm that transmits a circular portion of the omnidirectional video,comprised of the circle inscribing the predicted viewport with an additional34afety margin. The authors assume that an eﬃcient projection method is usedand that capacity is constant. It is also possible to adapt the safety margin tothe estimated prediction error variance [242], increasing the area in case of quickhead movements or highly unreliable predictions. Naturally, linear regression isnot the only possible model: a second-degree model with constant accelerationis proposed in [243], and Support Vector Regression (SVR) with eye trackingdata is used in [244]. The latter distinguishes a small attention area of about10 ◦ close to the gaze direction, while the rest of the FoV is a larger sub-attentionarea. The two areas have diﬀerent weights in the optimization, and a third area(non-attention) completes the sphere with the unwatched portions. This kindof three-tier optimization is a ﬁrst step towards the probabilistic approach.It is also possible to mix a popularity-based approach with linear regression:the scheme presented in [245] uses the two at the same time, weighting theregression outputs by the popularity and fetching the predicted viewport tiles,with some margin for errors, at the highest quality supported by the connection.A more reﬁned server-side approach is adopted in [246], which uses a neuralnetwork to estimate the future viewport of multiple users. The algorithm thensends the data for the predicted viewport to each user at the highest possiblequality, while sending the invisible parts of the sphere at the lowest one tosave bandwidth. Another work [194] takes the same approach, replacing thefully connected neural network with a CNN. Object tracking is another kind ofinformation that can be used for the prediction: this semantic information [247]is often correlated to users’ viewing patterns, as their gaze follows one of theobject across the panoramic video.The probabilistic streaming approach weights the quality of each tile bytheir viewing probability and optimize expected quality assuming constant ca-pacity. This scheme has been combined with linear and ridge regression forthe equirectangular [248], triangular [249], and truncated pyramid [250] tilingschemes. In all three cases, the capacity of the connection is assumed to beconstant. In [251], the linear regression is combined with a buﬀer-based stream-ing approach to maintain playback smoothness, adapting the estimate of the35otal bitrate to control the buﬀer level. Bas-360 [252] is another scheme whichcombines spatial adaptation with a temporal factor, optimizing a sequence ofmultiple future frames together and using stream prioritization and termina-tion to correct bandwidth and FoV prediction errors. A similar method [253]considers both temporal and spatial quality smoothness in the optimization,considering a sequence of future segments. The Optimal Probabilistic Viewport(OPV) scheme [254] tackles prediction error from a diﬀerent angle, correctingits decisions by streaming higher-quality tiles for already buﬀered segments ifnecessary. This allows the client to keep a long buﬀer and avoid stalling withouthaving to lower quality.As for the viewport-based approach, popularity can be considered to per-form the prediction: a proposed scheme [255] tries to maximize the overallexpected QoE, considering only the popularity of each tile, corrected for theequirectangular tiling (if the viewport is closer to the poles, more tiles will bepart of the FoV). The algorithm considers the rate-distortion curve for eachtile, weighted by its corrected navigation probability. In this case, capacity isassumed to be constant. This approach can also exploit the popularity of tilesand linear regression jointly: in [256], a transition threshold between the twomethods is set, and the popularity-based model is used if the measured capacityof the connection is insuﬃcient to support the other one. The concept behindthis scheme is that regression incurs a higher risk of rebuﬀering events in low-bandwidth scenarios, and switching to a more conservative scheme is desirablein this context. Another work [257] mixing the two prediction methods usesa linear combination of the two outputs, considering the trade-oﬀ between theﬂexibility of the adaptation and the coding eﬃciency, which decreases as thenumber of tiles grows. A k-Nearest Neighbors (k-NN) was exploited in [258]to make use of previous users’ data by ﬁnding similar scanpaths and assigningfuture FoVs from those users a larger probability.A more sophisticated approach, presented in [196], combines saliency andmotion information with the FoV scanpath using an LSTM. The predictedviewing probability for each equirectangular tile can then be used in the usual36 able 8: Summary of the main presented FoV prediction-based streaming schemes Ref. Projection Optimization Prediction method[205] Ideal Circular region around the viewport Linear regression and neural networks[242] ERP Adaptable region around the viewport Linear regression[243] CMP Highest quality for predicted viewport Second-degree regression[244] ERP Attention-based weights SVR with eye tracking[245] ERP Highest quality for predicted viewport Popularity-weighted linear regression[246] ERP Highest quality for predicted viewport Neural network with motion history[194] ERP Highest quality for predicted viewport CNN with motion history[247] Direct Highest quality for predicted viewport Semantic object tracking[248] ERP Expected quality Linear regression[249] Triangular Expected quality Linear and ridge regression[250] TSP Expected quality Linear regression[251] ERP Expected quality with buffer control Linear regression[252] ERP Expected quality over multiple futuresteps Unspecified[253] ERP Expected quality over multiple futuresteps Unspecified[254] ERP Expected quality, past action fixes Unspecified[255] ERP Expected quality Popularity-based model[256] ERP Expected quality Popularity/linear regression switching[257] ERP Expected quality Popularity/linear regression linearcombination[258] SP Expected quality k-NN with other users’ patterns[196] ERP Expected quality LSTM with saliency, motion, and FoVinfo[259] ERP Expected quality 3D-CNN with saliency, motion, andFoV info[260] ERP Minimum visible quality, stallingavoidance Unspecified[261] Unspecified Reinforcement learning Unspecified[262] ERP Reinforcement learning Neural network from [263][264] ERP Reinforcement learning LSTM[265] ERP Reinforcement learning LSTM[266] ERP Reinforcement learning Implicit in the solution[267, 268] Adap. ERP Expected quality Known FoV[263] Adap. ERP Expected quality Popularity-based model[269] Adap. Expected quality Popularity-based model probability-weighted quality optimization. The same technique was comparedto a 3D-CNN approach in [259]: both prediction methods had extremely goodperformance, but the latter had a slight advantage.A complete streaming algorithm, which considers stalling and a more so-phisticated capacity prediction method based on the harmonic mean of pastsamples, is presented in [260]. The authors derive an eﬃcient heuristic that canmaintain a high quality even when the FoV is uncertain, optimizing the qualityof the worst tile in the viewport to guarantee a minimum QoE while limiting37talling. However, they do not present a speciﬁc FoV prediction method, butanalyze performance as a function of the prediction error.The third way to achieve the same objective without explicitly optimizingthe expected QoE is to use Deep Reinforcement Learning (DRL): the sequentialapproach reduces the multi-dimensional tile quality decision to a sequence ofdecisions for each single tile [261]. Another DRL solution [262] models theproblem as a Markov Decision Problem (MDP), optimizing a complex functionconsidering the FoV picture quality, quality variations, and stalling events. Thework assumes that FoV prediction is performed by a neural network, as in [246],and includes the prediction in the model state, along with the capacity and buﬀerhistory. Plato [264] is another system that assumes an external prediction asinput to a DRL system, in this case performed by an LSTM. A similar solutionwas presented in [265], modeling buﬀer overﬂows explicitly. Another work usingDRL [266] performs the FoV prediction implicitly, using an LSTM to keep trackof the historical trends in capacity and viewport orientation.It is also possible to adaptively change the projection: in [267, 268], the com-pression or size of the tiles of an ERP can be changed according to the user’sexpected behavior and the expected quality resulting from each scheme. Whilethe authors assume that the future FoV is known in advance, which is obvi-ously unrealistic, this kind of scheme adds a degree of freedom to the streamingoptimization. It is also possible to use the adaptive projection with popularity-based prediction, as in [263]. In [269], the popularity-based prediction is usedto derive an adaptive projection with an irregular shape. The trade-oﬀ betweenchanging the compression of the tiles at the same resolution and lowering theresolution to increase the bandwidth eﬃciency has also been explored [100], andthe results show that the viewport-based approach has a higher QoE with thesame compression.Techniques based on packet-level coding or Scalable Video Coding (SVC) [270,271] are also possible: a scheme that protects immersive video data with fountaincodes, increasing the redundancy for areas in the FoV while leaving unwatchedareas of the sphere unprotected, has been proposed in [272]. In a multipath38ireless scenario in which multiple links with fast-varying capacity are avail-able, it is possible to use a wireless path to transmit the video’s base layer andanother to to transmit enhancement layers, improving the quality of live VRstreaming while maintaining full reliability [273].

The DASH paradigm is entirely end-to-end, and does not require any net-work support. However, several studies have explored the possibility of imple-menting explicit network support for video streaming: the network can eitherexplicitly communicate with the client and help it make decisions, or provisionresources and indirectly improve the situation perceived by the client, whichwill then improve the video quality autonomously. Since immersive streamingrequires more resources from the network, implicit or explicit support is evenmore helpful in this scenario.The most basic form of network support for immersive video is at the designlevel: the lower layer protocols and their interplay can negatively aﬀect the360 ◦ stream, and design adjustments based on an analysis of these eﬀects cansigniﬁcantly improve performance. Such a study was performed for the LTEnetwork [274], ﬁnding several simple solutions that can be implemented withoutchanging the network architecture. The standardization of the 5G requirementsand solutions for immersive and VR video streaming are ongoing [275].Caching is another form of basic network support that can be implementedsimply, and is often already in place thanks to Content Delivery Networks(CDNs). Explicitly considering the nature of immersive video can signiﬁcantlyenhance the eﬃciency of edge caching strategies [276, 277]: by caching the mostcommon ﬁelds of view closest to the network edge [278], it is possible to in-crease the cache hit rate and, consequently, the average QoE. Caching can becombined with edge computing strategies to improve the QoE of AugmentedReality (AR) [279], rendering the virtual content in the user’s FoV withoutthe latency that cloud processing entails. It is also possible to extend thesetechniques, along with a measure of user popularity at any given moment, to39ptimize multicast immersive streaming in mobile networks [280].More explicit approaches aim at resource allocation when multiple Radio Ac-cess Technologies (RATs) are available [281], exploiting FoV prediction to pairusers with access points and eﬀectively use wireless resources. The same opti-mization can be performed for multiple users on the same network, maximizingthe overall QoE by cooperatively downloading diﬀerent SVC layers [282]. FoVprediction can also be used in multicast scenarios, clustering users with similarpoints of view and exploiting mmWave multicast [283] to serve them together.With the gradual adoption of 5G technology, it is also possible to combine cel-lular resource scheduling optimization with encoding tile rate selection [284] toprovide low delay upload of VR content.Live streaming of AR and VR content is another issue, which is complicatedby the limited delay tolerance: experimental studies [285, 286] show that anydelay over 10 ms can be perceived by users as annoying, although higher latenciescan be tolerated [287]. The issue becomes even more complex when viewport-adaptive schemes are taken into account, as the adaptation scheme needs toreact fast enough to changes in the FoV to avoid quality drops [237]. Futurenetworks need to be able to guarantee reliable end-to-end communication belowthis latency, requiring innovation both from the physical [288] to the transportlayer [289] to enable these applications.However, network support is not limited to communication: in the case ofrendered VR, the network can also help with computation tasks. Most VRplatforms are tethered, using a desktop computer to render the environment inreal-time: current smartphones do not have the computing and battery powerto provide a high-quality VR experience without oﬄoading some of the compu-tational load [290]. Several works have tried to mitigate the latency problemscaused by the remote rendering, either by reducing the throughput using com-pression [291] or by using servers close to the network edge [292]. The Furionplatform [293] tries to solve this issue by using FoV prediction techniques toprefetch rendered background content from a remote server, rendering only theforeground objects locally. The use of Mobile Edge Computing (MEC) to pro-40ide rendering support to multiple VR users at the same time has also beeninvestigated [294]. The several components of latency in a VR application wereanalyzed in [295]: the trade-oﬀ between network and computation delay, ascloud servers are more powerful but farther away, is a critical design choice forfuture systems.

6. Conclusions and open challenges

Omnidirectional video has gained signiﬁcant traction, both in the researchcommunity and in the industry, and the ﬁrst commercial HMDs are now severalyears old. This kind of video presents challenges that call for a redesign of thewhole video coding, streaming and evaluation pipeline, taking into account twocritical aspects speciﬁc to 360 ◦ video: geometric distortion due to the mappingof a spherical surface to 2D planes, and the fact that viewers only experience alimited FoV.In this survey, we analyzed all aspects of omnidirectional video coding andstreaming. First, we reviewed projection methods and the geometric distortionthat they can cause, with a description of their eﬀects on video encoders andtheir compression eﬃciency. The choice of a projection scheme is often a trade-oﬀ between diﬀerent types of distortion: while approaches based on solids witha larger number of faces approximate the spherical nature of the image better,they also increase the number of edge distortion, and thus the possibility ofvisible errors at the seams. The same is true for oﬀset projection: dedicatingmore pixels to the most probable view increases the average QoE, but highlyreduces it if the user turns around unexpectedly. The subsequent encodingparameters also have eﬀects on the image quality, and they should be optimizedjointly with the projection settings.The projection and encoding of omnidirectional videos is a critical procedure,as it determines the rate-distortion eﬃciency of the video streaming system. Theresearch on the subject has evolved far from the ﬁrst simple examples usingsimple projection schemes and the 2D encoding pipeline, but some fundamental41rade-oﬀs limit possible performance. In particular, the choice of projection af-fects the rest of the encoding pipeline signiﬁcantly, and ad hoc region-adaptivequantization schemes need to be devised. Motion models and inter-frame com-pression also need to be carefully tuned, as no projection can avoid geometricdistortion and discontinuities caused by objects crossing face boundaries at thesame time.We then focused on QoE in omnidirectional video: as several subjectivestudies prove, 2D quality metrics are inaccurate in this scenario, and moreintelligent ones that take geometric distortion and viewer attention are needed.The dynamic factor also plays a role, as quality variations between segmentsand tiles can aﬀect QoE in unpredictable ways. In general, measuring QoE inomnidirectional video is a complex problem, and will probably require the use ofcontent-aware learning tools. We then discussed automatic saliency estimationand FoV prediction techniques, which have a critical role in QoE estimation andvideo streaming: being able to predict the FoV, both for the average user and forthe current viewing session, can help compress video better by allocating morepixels to regions with more important content and which are viewed more often,but also increase the eﬃciency of tile-dependent streaming and the accuracy ofQoE metrics.The strong dependence between video content and the eﬀectiveness of diﬀer-ent metrics, along with the lack of a single large-scale database of experimentalresults to use, can result in contradictory evidence, and multiple studies oftenhave diﬀerent outcomes. However, there are a few guidelines for future research:the inadequacy of 2D metrics such as PSNR in the omnidirectional video domainis evident from most studies, even when corrected and weighted to account forthe diﬀerent geometry. VIFP seems to be a promising base to develop betteromnidirectional QoE metrics, but the hot topic in the ﬁeld is machine learn-ing: a few learning-based metrics have already been proposed, but they havenot been tested on a wider scale or released publicly. Whether the signiﬁcantperformance improvements that machine learning achieved in other applicationscan be replicated in QoE measurement of omnidirectional video is arguably the42iggest open question. Another important, and often overlooked, factor is thedynamic nature of video, which can be crucial in omnidirectional video due tothe cybersickness issue: the study of dynamic metrics for omnidirectional videotaking stalling events and quality ﬂuctuations due to the adaptive streamingand the user’s head movements into account is still limited to a few works.Streaming itself is another active research topic: we considered the threemost common approaches to tile-based streaming as well as a brief overview ofviewport-dependent streaming. In particular, schemes that weigh the tiles bytheir viewing probability and importance in the projected FoV and maximize theoverall expected QoE, often including dynamic factors such as stalling and qual-ity variations in the optimization, obtain the best performance. However, betterFoV prediction is not the only way to improve streaming systems: additionaloptions such as adaptive tiling schemes and SVC are also being investigated,as they can increase bandwidth eﬃciency and robustness in mobile streamingscenarios. Reinforcement learning-based schemes have recently been under theirspotlight, as they can seamlessly integrate data from diﬀerent sources in theirprediction and optimize even complex QoE functions in diﬃcult scenarios withlittle design eﬀort. Learning-based solutions provide higher accuracy and al-low prediction for up to 10 seconds, a critical requirement to avoid stalling inbuﬀer-based streaming systems.Finally, network-level optimization to support omnidirectional streamingand VR is another subject that is beginning to attract interest: the promisesof 5G with regard to resource allocation and optimization, higher capacity, andedge and fog computing provide new interesting scenarios to simplify streamingsystems and enable VR over simple devices with limited battery and computingpower.Streaming techniques, along with all other aspects of omnidirectional videocoding and evaluation, are rapidly converging towards machine learning as ageneral solution: the complexity of omnidirectional videos requires a level ofcontext-awareness that is too complex for traditional analytical techniques. Fur-thermore, the trend in the ﬁeld is towards joint optimization, not considering43ach step of the process separately but optimizing them all at once, from projec-tion and coding to streaming and quality evaluation. The ﬁrst fully integratedmodels, incorporating historical data from other users, spatial and temporalfeatures of the content, and past history for the speciﬁc user, are beginning toappear in the literature, although larger datasets with a varied population ofviewers for proper evaluation are not available yet. Gaze tracking, which is moreprecise than head orientation tracking, is another possibility that is still largelyunexplored due to the cost and complexity of the required experimental setup.However, the research related to several of the topics presented in this surveyis still ongoing, and, given the fast update rate of communication technologiesand the rapid growth of deep learning, we can expect the interest in the topicnot to fade. In particular, VR is central to the 5G paradigm, and innovationsin each of the subjects we considered is needed to meet the high expectations. ReferencesReferences [1] A. Amin, D. Gromala, X. Tong, C. Shaw, Immersion in cardboard VRcompared to a traditional head-mounted display, in: International Con-ference on Virtual, Augmented and Mixed Reality, Springer, 2016, pp.269–276.[2] R. Skupin, Y. Sanchez, Y.-K. Wang, M. M. Hannuksela, J. Boyce,M. Wien, Standardization status of 360 degree video coding and deliv-ery, in: International Conference on Visual Communications and ImageProcessing (VCIP), IEEE, 2017, pp. 1–4.[3] V. T. Visch, E. S. Tan, D. Molenaar, The emotional and cognitive eﬀectof immersion in ﬁlm viewing, Cognition and Emotion 24 (8) (2010) 1439–1445.[4] L. Lescop, Narrative grammar in 360 ◦ , in: International Symposium on44ixed and Augmented Reality (ISMAR-Adjunct), IEEE, 2017, pp. 254–257.[5] N. De la Pe˜na, P. Weil, J. Llobera, E. Giannopoulos, A. Pom´es, B. Span-lang, D. Friedman, M. V. Sanchez-Vives, M. Slater, Immersive journalism:immersive virtual reality for the ﬁrst-person experience of news, Presence:Teleoperators and Virtual Environments 19 (4) (2010) 291–301.[6] G. Wang, W. Gu, A. Suh, The eﬀects of 360-degree VR videos on audienceengagement: Evidence from the New York Times, in: International Con-ference on HCI in Business, Government, and Organizations, Springer,2018, pp. 217–235.[7] U. Schultze, Embodiment and presence in virtual worlds: a review, Jour-nal of Information Technology 25 (4) (2010) 434–449.[8] A. Steed, S. Friston, M. M. Lopez, J. Drummond, Y. Pan, D. Swapp,An ‘in the wild’ experiment on presence and embodiment using consumerVirtual Reality equipment, IEEE Transactions on Visualization and Com-puter Graphics 22 (4) (2016) 1406–1414.[9] Q. Lin, J. J. Rieser, B. Bodenheimer, Stepping oﬀ a ledge in an HMD-based immersive virtual environment, in: Symposium on Applied Percep-tion, ACM, 2013, pp. 107–110.[10] M. Zink, R. Sitaraman, K. Nahrstedt, Scalable 360 ◦ video stream delivery:Challenges, solutions, and opportunities, Proceedings of the IEEE 107 (4)(2019) 639–650.[11] S. Afzal, J. Chen, K. Ramakrishnan, Characterization of 360-degreevideos, in: Workshop on Virtual Reality and Augmented Reality Net-work, ACM, 2017, pp. 1–6.[12] Y. Li, J. Xu, Z. Chen, Spherical domain rate-distortion optimization for360-degree video coding, in: International Conference on Multimedia andExpo (ICME), IEEE, 2017, pp. 709–714.4513] H. G. Kim, H. Lim, S. Lee, Y. M. Ro, VRSA Net: VR sickness assessmentconsidering exceptional motion for 360 ◦ VR video, IEEE Transactions onImage Processing 28 (4) (2019) 1646–1660.[14] M. Yu, H. Lakshman, B. Girod, A framework to evaluate omnidirectionalvideo coding schemes, in: International Symposium on Mixed and Aug-mented Reality, IEEE, 2015, pp. 31–36.[15] Y.-C. Su, K. Grauman, Learning spherical convolution for fast featuresfrom 360 imagery, in: Advances in Neural Information Processing Sys-tems, 2017, pp. 529–539.[16] Z. Chen, Y. Li, Y. Zhang, Recent advances in omnidirectional video codingfor virtual reality: Projection and evaluation, Signal Processing 146 (2018)66–78.[17] R. Azevedo, N. Birkbeck, F. Simone, I. Janatra, B. Adsumilli, P. Frossard,Visual distortions in 360-degree videos, IEEE Transactions on Circuits andSystems for Video Technology 30 (8) (2020) 2524–2537.[18] M. Xu, C. Li, S. Zhang, P. Le Callet, State-of-the-art in 360 video/imageprocessing: Perception, assessment and compression, IEEE Journal ofSelected Topics in Signal Processing 14 (1) (2020) 5–26.[19] D. He, C. Westphal, J. Garcia-Luna-Aceves, Network support for AR/VRand immersive video application: A survey., in: 14th International Con-ference on Signal Processing and Multimedia Applications (SIGMAP),ICETE, 2018, pp. 525–535.[20] C.-L. Fan, W.-C. Lo, Y.-T. Pai, C.-H. Hsu, A survey on 360 ◦ video stream-ing: Acquisition, transmission, and display, ACM Computing Surveys(CSUR) 52 (4) (2019) 71.[21] J. P. Snyder, Flattening the Earth: two thousand years of map projections,University of Chicago Press, 1997.4622] R. Szeliski, et al., Image alignment and stitching: A tutorial, Foundationsand Trends in Computer Graphics and Vision 2 (1) (2007) 1–104.[23] W. Jiang, J. Gu, Video stitching with spatial-temporal content-preservingwarping, in: Conference on Computer Vision and Pattern Recognition(CVPR) Workshops, IEEE, 2015, pp. 42–48.[24] B. Vishwanath, T. Nanjundaswamy, K. Rose, Rotational motion model fortemporal prediction in 360 video coding, in: 19th International Workshopon Multimedia Signal Processing (MMSP), IEEE, 2017, pp. 1–6.[25] D. Salomon, Transformations and projections in computer graphics,Springer Science & Business Media, 2007.[26] H. Benko, A. D. Wilson, F. Zannier, Dyadic projected spatial augmentedreality, in: 27th Annual Symposium on User Interface Software and Tech-nology, ACM, 2014, pp. 645–655.[27] R. G. Youvalari, A. Aminlou, M. M. Hannuksela, M. Gabbouj, Eﬃcientcoding of 360-degree pseudo-cylindrical panoramic video for virtual real-ity applications, in: 2016 IEEE International Symposium on Multimedia(ISM), IEEE, 2016, pp. 525–528.[28] Y. Wang, R. Wang, Z. Wang, K. Fan, Y. Deng, S. Syu, M.-J. J. Shenzhen,Polar square projection for panoramic video, in: International Conferenceon Visual Communications and Image Processing (VCIP), IEEE, 2017,pp. 1–4.[29] A. Jallouli, F. Kammoun, N. Masmoudi, Equatorial part segmentationmodel for 360-deg video projection, Journal of Electronic Imaging 28 (1)(2019) 013019.[30] A. Safari, A. Ardalan, New cylindrical equal area and conformal mapprojections of the reference ellipsoid for local applications, Survey Review39 (304) (2007) 132–144. 4731] S.-H. Lee, S.-T. Kim, E. Yip, B.-D. Choi, J. Song, S.-J. Ko, Omnidi-rectional video coding using latitude adaptive down-sampling and pixelrearrangement, Electronics Letters 53 (10) (2017) 655–657.[32] C. Wu, H. Zhao, X. Shang, Rhombic mapping scheme for panoramic videoencoding, in: International Forum on Digital TV and Wireless MultimediaCommunications, Springer, 2017, pp. 443–453.[33] W. Chengjia, Z. Haiwu, S. Xiwu, Octagonal mapping scheme forpanoramic video encoding, IEEE Transactions on Circuits and Systemsfor Video Technology 28 (9) (2018) 2402–2406.[34] K. Kammachi-Sreedhar, M. M. Hannuksela, Nested polygonal chain map-ping of omnidirectional video, in: International Conference on Image Pro-cessing (ICIP), IEEE, 2017, pp. 2169–2173.[35] L. Li, Z. Li, M. Budagavi, H. Li, Projection based advanced motion modelfor cubic mapping for 360-degree video, in: International Conference onImage Processing (ICIP), IEEE, 2017, pp. 1427–1431.[36] D. G´omez, J. A. N´u˜nez, I. Fraile, M. Montagud, S. Fern´andez, TiCMP: Alightweight and eﬃcient tiled cubemap projection strategy for immersivevideos in web-based players, in: 28th Workshop on Network and OperatingSystems Support for Digital Audio and Video (NOSSDAV), ACM, 2018,pp. 1–6.[37] E. Alshina, J. Boyce, A. Abbas, Y. Ye, AHG8: a study on compressioneﬃciency of cube projection, Tech. Rep. D0022, JVET (Oct. 2017).[38] C. Zhou, Z. Li, Y. Liu, A measurement study of oculus 360 degree videostreaming, in: 8th Conference on Multimedia Systems (MmSys), ACM,2017, pp. 27–37.[39] J.-L. Lin, Y.-H. Lee, C.-H. Shih, S.-Y. Lin, H.-C. Lin, S.-K. Chang,P. Wang, L. Liu, C.-C. Ju, Eﬃcient projection and coding tools for 360 ◦ ◦ video coding technol-ogy in responses to the joint call for proposals on video compression withcapability beyond HEVC, IEEE Transactions on Circuits and Systems forVideo Technology 30 (5) (2020) 1226–1240.[59] A. Zare, A. Aminlou, M. M. Hannuksela, M. Gabbouj, HEVC-complianttile-based streaming of panoramic video for virtual reality applications, in:24th International Conference on Multimedia, ACM, 2016, pp. 601–605.[60] L. Bagnato, P. Frossard, P. Vandergheynst, Plenoptic spherical sampling,in: 19th International Conference on Image Processing (ICIP), IEEE,2012, pp. 357–360.[61] I. Tosic, P. Frossard, Low bit-rate compression of omnidirectional images,in: Picture Coding Symposium, IEEE, 2009, pp. 1–4.[62] C. Ozcinar, A. De Abreu, S. Knorr, A. Smolic, Estimation of optimalencoding ladders for tiled 360 ◦ VR video in adaptive streaming systems,in: International Symposium on Multimedia (ISM), IEEE, 2017, pp. 45–52.[63] M. Budagavi, J. Furton, G. Jin, A. Saxena, J. Wilkinson, A. Dickerson,360 degrees video coding using region adaptive smoothing, in: Interna-tional Conference on Image Processing (ICIP), IEEE, 2015, pp. 750–754.[64] B. Ray, J. Jung, M.-C. Larabi, A low-complexity video encoder forequirectangular projected 360 video content, in: International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp.1723–1727. 5165] Y. Liu, L. Yang, M. Xu, Z. Wang, Rate control schemes for panoramicvideo coding, Journal of Visual Communication and Image Representation53 (2018) 76–85.[66] G. Luz, J. Ascenso, C. Brites, F. Pereira, Saliency-driven omnidirectionalimaging adaptive coding: Modeling and assessment, in: 19th InternationalWorkshop on Multimedia Signal Processing (MMSP), IEEE, 2017, pp. 1–6.[67] M. Zhang, J. Zhang, Z. Liu, C. An, An eﬃcient coding algorithm for 360-degree video based on improved adaptive QP compensation and early CUpartition termination, Multimedia Tools and Applications 78 (1) (2019)1081–1101.[68] M. Zhang, X. Dong, Z. Liu, F. Mao, W. Yue, Fast intra algorithm basedon texture characteristics for 360 videos, EURASIP Journal on Image andVideo Processing 2019 (1) (2019) 53.[69] N. Li, S. Wan, F. Yang, Reference samples padding for intra-frame codingof omnidirectional video, in: Asia-Paciﬁc Signal and Information Process-ing Association Annual Summit and Conference (APSIPA ASC), IEEE,2018, pp. 1987–1990.[70] M. Tang, Y. Zhang, J. Wen, S. Yang, Optimized video coding for omni-directional videos, in: International Conference on Multimedia and Expo(ICME), IEEE, 2017, pp. 799–804.[71] J. Boyce, Q. Xu, Spherical rotation orientation indication for hevc andjem coding of 360 degree video, in: Applications of Digital Image Pro-cessing, Vol. 10396, International Society for Optics and Photonics, 2017,p. 103960I.[72] Y.-C. Su, K. Grauman, Learning compressible 360 ◦ video isomers, in:Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,2018, pp. 7824–7833. 5273] Y. Zhou, Z. Chen, S. Liu, Fast sample adaptive oﬀset algorithm for 360-degree video coding, Signal Processing: Image Communication 80 (2020)115634.[74] J. Sauer, M. Wien, J. Schneider, M. Bl¨aser, Geometry-corrected deblock-ing ﬁlter for 360 video coding using cube representation, in: Picture Cod-ing Symposium (PCS), IEEE, 2018, pp. 66–70.[75] X. Guan, C. Xu, M. Zhang, Z. Liu, W. Yue, F. Mao, A fast intra modeselection algorithm based on CU size for virtual reality 360 ◦ video, Inter-national Journal of Pattern Recognition and Artiﬁcial Intelligence (2019)2055001.[76] C. Herglotz, M. Jamali, S. Coulombe, C. Vazquez, A. Vakili, Eﬃcientcoding of 360 ◦ videos exploiting inactive regions in projection formats, in:International Conference on Image Processing (ICIP), IEEE, 2019, pp.1104–1108.[77] P. Hanhart, X. Xiu, Y. He, Y. Ye, 360 ◦ video coding based on projectionformat adaptation and spherical neighboring relationship, IEEE Journalon Emerging and Selected Topics in Circuits and Systems 9 (1) (2018)71–83.[78] R. G. Youvalari, A. Aminlou, M. M. Hannuksela, Analysis of regionaldown-sampling methods for coding of omnidirectional video, in: PictureCoding Symposium (PCS), IEEE, 2016, pp. 1–5.[79] R. G. Youvalari, A. Zare, A. Aminlou, M. M. Hannuksela, M. Gabbouj,Shared Coded Picture technique for tile-based viewport-adaptive stream-ing of omnidirectional video, IEEE Transactions on Circuits and Systemsfor Video Technology 29 (10) (2018) 3106–3120.[80] F. De Simone, P. Frossard, N. Birkbeck, B. Adsumilli, Deformable block-based motion estimation in omnidirectional image sequences, in: 19th53nternational Workshop on Multimedia Signal Processing (MMSP), IEEE,2017, pp. 1–6.[81] L. Li, Z. Li, X. Ma, H. Yang, H. Li, Advanced spherical motion model andlocal padding for 360 ◦ video compression, IEEE Transactions on ImageProcessing 28 (5) (2018) 2342–2356.[82] Y. Wang, D. Liu, S. Ma, F. Wu, W. Gao, Spherical coordinates transform-based motion model for panoramic video coding, IEEE Journal on Emerg-ing and Selected Topics in Circuits and Systems 9 (1) (2019) 98–109.[83] J. Zheng, Y. Shen, Y. Zhang, G. Ni, Adaptive selection of motion modelsfor panoramic video coding, in: International Conference on Multimediaand Expo, IEEE, 2007, pp. 1319–1322.[84] Y. Sun, A. Lu, L. Yu, AHG8: WS-PSNR for 360 video objective qualityevaluation, Tech. Rep. D0040, JVET (Oct. 2016).[85] R. G. Youvalari, A. Aminlou, Geometry-based motion vector scaling foromnidirectional video coding, in: International Symposium on Multimedia(ISM), IEEE, 2018, pp. 127–130.[86] Y. He, Y. Ye, P. Hanhart, et al., Geometry padding for 360 video coding,Tech. Rep. D0075, JVET (Oct. 2016).[87] X. Ma, H. Yang, Z. Zhao, L. Li, H. Li, Coprojection-plane based motioncompensated prediction for cubic format VR content, Tech. Rep. D0061,JVET (Oct. 2016).[88] J. Sauer, J. Schneider, M. Wien, Improved motion compensation for 360 ◦ video projected to polytopes, in: International Conference on Multimediaand Expo (ICME), IEEE, 2017, pp. 61–66.[89] Y. Li, L. Yu, C. Lin, Y. Zhao, M. Gabbouj, Convolutional neural networkbased inter-frame enhancement for 360-degree video streaming, in: PaciﬁcRim Conference on Multimedia, Springer, 2018, pp. 57–66.5490] L. Skorin-Kapov, M. Varela, T. Hoßfeld, K.-T. Chen, A survey of emerg-ing concepts and challenges for qoe management of multimedia services,ACM Transactions on Multimedia Computing, Communications, and Ap-plications (TOMM) 14 (2s) (2018) 29.[91] A.-F. Perrin, C. Bist, R. Cozot, T. Ebrahimi, Measuring quality of omni-directional high dynamic range content, in: Applications of Digital ImageProcessing, Vol. 10396, International Society for Optics and Photonics,2017.[92] F. Jabar, J. Ascenso, M. P. Queluz, Perceptual analysis of perspectiveprojection for viewport rendering in 360 ◦ images, in: International Sym-posium on Multimedia (ISM), IEEE, 2017, pp. 53–60.[93] F. Jabar, M. P. Queluz, J. Ascenso, Objective assessment of line distor-tions in viewport rendering of 360 º images, in: International Conferenceon Artiﬁcial Intelligence and Virtual Reality (AIVR), IEEE, 2018, pp.68–75.[94] E. D. Luis E. Gurrieri, Acquisition of omnidirectional stereoscopic imagesand videos of dynamic scenes: a review, Journal of Electronic Imaging22 (3) (2013) 1–22.[95] Z. Akhtar, K. Siddique, A. Rattani, S. L. Lutﬁ, T. H. Falk, Why is mul-timedia Quality of Experience assessment a challenging problem?, IEEEAccess 7 (2019) 117897–117915.[96] I.-T. S. G. 12”, Subjective video quality assessment methods for multime-dia applications, Tech. Rep. P.910, ITU (Sep. 1999).[97] A. Singla, S. Fremerey, W. Robitza, P. Lebreton, A. Raake, Comparison ofsubjective quality evaluation for HEVC encoded omnidirectional videos atdiﬀerent bit-rates for UHD and FHD resolution, in: Thematic Workshopsof the International Multimedia Conference, ACM, 2017, pp. 511–519.5598] E. Alshina, J. Boyce, A. Abbas, Y. Ye, JVET common test conditionsand evaluation procedures for 360 degree video, Tech. Rep. G1030, JVET(Jul. 2017).[99] M. Xu, C. Li, Z. Chen, Z. Wang, Z. Guan, Assessing visual quality ofomnidirectional videos, IEEE Transactions on Circuits and Systems forVideo Technology 29 (12) (2018) 3516–3530.[100] I. D. Curcio, H. Toukomaa, D. Naik, Bandwidth reduction of omnidirec-tional viewport-dependent video streaming via subjective quality assess-ment, in: 2nd International Workshop on Multimedia Alternate Realities,ACM, 2017, pp. 9–14.[101] A. Singla, S. G¨oring, A. Raake, B. Meixner, R. Koenen, T. Buchholz,Subjective quality evaluation of tile-based streaming for omnidirectionalvideos, in: 10th Multimedia Systems Conference (MMSys), ACM, 2019,pp. 232–242.[102] A. Singla, W. Robitza, A. Raake, Comparison of subjective quality eval-uation methods for omnidirectional videos with dsis and modiﬁed acr,Electronic Imaging 2018 (14) (2018) 1–6.[103] A. Singla, W. Robitza, A. Raake, Comparison of subjective quality testmethods for omnidirectional video quality evaluation, in: 21st Interna-tional Workshop on Multimedia Signal Processing (MMSP), IEEE, 2019,pp. 1–6.[104] W. Zou, F. Yang, W. Zhang, Y. Li, H. Yu, A framework for assessingspatial presence of omnidirectional video on virtual reality device, IEEEAccess 6 (2018) 44676–44684.[105] V. Wanick, G. Xavier, E. Ekmekcioglu, Virtual transcendence experiences:Exploring technical and design challenges in multi-sensory environments,in: 10th International Workshop on Immersive Mixed and Virtual Envi-ronment Systems, ACM, 2018, pp. 7–12.56106] ´A. L. Guedes, G. d. A. Roberto, P. Frossard, S. Colcher, S. D. J. Barbosa,Subjective evaluation of 360-degree sensory experiences, in: 21st Interna-tional Workshop on Multimedia Signal Processing (MMSP), IEEE, 2019,pp. 1–6.[107] D. Egan, S. Brennan, J. Barrett, Y. Qiao, C. Timmerer, N. Murray, Anevaluation of heart rate and electrodermal activity as an objective QoEevaluation method for immersive virtual reality environments, in: 8thInternational Conference on Quality of Multimedia Experience (QoMEX),IEEE, 2016, pp. 1–6.[108] P. Arnau-Gonz´alez, T. Althobaiti, S. Katsigiannis, N. Ramzan, Perceptualvideo quality evaluation by means of physiological signals, in: 9th Interna-tional Conference on Quality of Multimedia Experience (QoMEX), IEEE,2017, pp. 1–6.[109] C. Li, M. Xu, X. Du, Z. Wang, Bridge the gap between VQA and hu-man behavior on omnidirectional video: A large-scale dataset and a deeplearning model, in: 26th International Conference on Multimedia, ACM,2018, pp. 932–940.[110] M. Xu, C. Li, Y. Liu, X. Deng, J. Lu, A subjective visual quality as-sessment method of panoramic videos, in: International Conference onMultimedia and Expo (ICME), IEEE, 2017, pp. 517–522.[111] W. Sun, K. Gu, S. Ma, W. Zhu, N. Liu, G. Zhai, A large-scale compressed360-degree spherical image database: From subjective quality evaluationto objective model comparison, in: 20th International Workshop on Mul-timedia Signal Processing (MMSP), IEEE, 2018, pp. 1–6.[112] Y. Zhang, Y. Wang, F. Liu, Z. Liu, Y. Li, D. Yang, Z. Chen, Subjec-tive panoramic video quality assessment database for coding applications,IEEE Transactions on Broadcasting 64 (2) (2018) 461–473.57113] J. Yang, T. Liu, B. Jiang, H. Song, W. Lu, 3d panoramic virtual realityvideo quality assessment based on 3d convolutional neural networks, IEEEAccess 6 (2018) 38669–38682.[114] S. Croci, C. Ozcinar, E. Zerman, J. Cabrera, A. Smolic, Voronoi-basedobjective quality metrics for omnidirectional video, in: 2019 Eleventh In-ternational Conference on Quality of Multimedia Experience (QoMEX),IEEE, 2019, pp. 1–6.[115] R. Schatz, A. Sackl, C. Timmerer, B. Gardlo, Towards subjective qualityof experience assessment for omnidirectional video streaming, in: 9th In-ternational Conference on Quality of Multimedia Experience (QoMEX),IEEE, 2017, pp. 1–6.[116] H. Duan, G. Zhai, X. Yang, D. Li, W. Zhu, IVQAD 2017: An immer-sive video quality assessment database, in: International Conference onSystems, Signals and Image Processing (IWSSIP), IEEE, 2017, pp. 1–5.[117] B. Zhang, J. Zhao, S. Yang, Y. Zhang, J. Wang, Z. Fei, Subjective andobjective quality assessment of panoramic videos in virtual reality envi-ronments, in: International Conference on Multimedia & Expo Workshops(ICMEW), IEEE, 2017, pp. 163–168.[118] S. Xie, Y. Xu, Q. Shen, Z. Ma, W. Zhang, Modeling the perceptual qualityof viewport adaptive omnidirectional video streaming, IEEE Transactionson Circuits and Systems for Video Technology 30 (9) (2020) 3029–3042.[119] J. Yang, Y. Zhu, C. Ma, W. Lu, Q. Meng, Stereoscopic video qualityassessment based on 3D convolutional neural networks, Neurocomputing309 (2018) 83–93.[120] H. Duan, G. Zhai, X. Min, Y. Zhu, Y. Fang, X. Yang, Perceptual qualityassessment of omnidirectional images, in: International Symposium onCircuits and Systems (ISCAS), Vol. 1, IEEE, 2018, pp. 1–5.58121] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality as-sessment: from error visibility to structural similarity, IEEE Transactionson Image Processing 13 (4) (2004) 600–612.[122] Z. Wang, E. P. Simoncelli, A. C. Bovik, Multiscale structural similarityfor image quality assessment, in: 37th Asilomar Conference on Signals,Systems & Computers, Vol. 2, IEEE, 2003, pp. 1398–1402.[123] H. R. Sheikh, A. C. Bovik, Image information and visual quality, IEEETransactions on Image Processing 15 (2) (2006) 430–444.[124] L. Zhang, L. Zhang, X. Mou, D. Zhang, FSIM: A feature similarity indexfor image quality assessment, IEEE Transactions on Image Processing20 (8) (2011) 2378–2386.[125] Y. Sun, A. Lu, L. Yu, Weighted-to-spherically-uniform quality evaluationfor omnidirectional video, IEEE Signal Processing Letters 24 (9) (2017)1408–1412.[126] V. Zakharchenko, E. Alshina, A. Singh, A. Dsouza, AHG8: Suggestedtesting procedure for 360-degree video, Tech. Rep. D0027, JVET (Oct.2016).[127] S. Chen, Y. Zhang, Y. Li, Z. Chen, Z. Wang, Spherical structural sim-ilarity index for objective omnidirectional video quality assessment, in:International Conference on Multimedia and Expo (ICME), IEEE, 2018,pp. 1–6.[128] Y. Zhou, M. Yu, H. Ma, H. Shao, G. Jiang, Weighted-to-spherically-uniform ssim objective quality evaluation for panoramic video, in: 14thInternational Conference on Signal Processing (ICSP), IEEE, 2018, pp.54–57.[129] Y. Rai, P. Le Callet, P. Guillotel, Which saliency weighting for omnidirectional image quality assessment?, in: 9th International Conferenceon Quality of Multimedia Experience (QoMEX), IEEE, 2017, pp. 1–6.59130] W. Zou, F. Yang, S. Wan, Perceptual video quality metric for compressionartefacts: from two-dimensional to omnidirectional, IET Image Processing12 (3) (2017) 374–381.[131] M. Huang, Q. Shen, Z. Ma, A. C. Bovik, P. Gupta, R. Zhou, X. Cao,Modeling the perceptual quality of immersive images rendered on headmounted displays: Resolution and compression, IEEE Transactions onImage Processing 27 (12) (2018) 6039–6050.[132] S. Yang, J. Zhao, T. Jiang, J. W. T. Rahim, B. Zhang, Z. Xu, Z. Fei, Anobjective assessment method based on multi-level factors for panoramicvideos, in: International Conference on Visual Communications and ImageProcessing (VCIP), IEEE, 2017, pp. 1–4.[133] H. G. Kim, H.-t. Lim, Y. M. Ro, Deep Virtual Reality image quality as-sessment with human perception guider for omnidirectional image, IEEETransactions on Circuits and Systems for Video Technology 30 (4) (2020)917–928.[134] C. Li, M. Xu, L. Jiang, S. Zhang, X. Tao, Viewport proposal cnn for360deg video quality assessment, in: Conference on Computer Vision andPattern Recognition (CVPR), IEEE, 2019, pp. 10177–10186.[135] H. T. Tran, N. P. Ngoc, C. M. Bui, M. H. Pham, T. C. Thang, An eval-uation of quality metrics for 360 videos, in: 9th International Conferenceon Ubiquitous and Future Networks (ICUFN), IEEE, 2017, pp. 7–11.[136] H. T. Tran, N. P. Ngoc, C. T. Pham, Y. J. Jung, T. C. Thang, A subjectivestudy on QoE of 360 video for VR communication, in: 19th InternationalWorkshop on Multimedia Signal Processing (MMSP), IEEE, 2017, pp.1–6.[137] E. Upenik, M. Rerabek, T. Ebrahimi, On the performance of objectivemetrics for omnidirectional visual content, in: 9th International Confer-60nce on Quality of Multimedia Experience (QoMEX), IEEE, 2017, pp.1–6.[138] H. T. Tran, C. T. Pham, N. P. Ngoc, A. T. Pham, T. C. Thang, A studyon quality metrics for 360 video communications, IEICE Transactions onInformation and Systems 101 (1) (2018) 28–36.[139] P. Hanhart, Y. He, Y. Ye, J. Boyce, Z. Deng, L. Xu, 360-degree videoquality evaluation, in: Picture Coding Symposium (PCS), IEEE, 2018,pp. 328–332.[140] A. Mittal, R. Soundararajan, A. C. Bovik, Making a “completely blind”image quality analyzer, IEEE Signal Processing Letters 20 (3) (2012) 209–212.[141] K. Gu, G. Zhai, X. Yang, W. Zhang, Hybrid no-reference quality metric forsingly and multiply distorted images, IEEE Transactions on Broadcasting60 (3) (2014) 555–567.[142] W. Sun, W. Luo, X. Min, G. Zhai, X. Yang, K. Gu, S. Ma, MC360IQA:The multi-channel CNN for blind 360-degree image quality assessment, in:International Symposium on Circuits and Systems (ISCAS), IEEE, 2019,pp. 1–5.[143] H. Huang, J. Chen, H. Xue, Y. Huang, T. Zhao, Time-variant visualattention in 360-degree video playback, in: International Symposium onHaptic, Audio and Visual Environments and Games (HAVE), IEEE, 2018,pp. 1–5.[144] V. Kelkkanen, M. Fiedler, Coeﬃcient of throughput variation as indi-cation of playback freezes in streamed omnidirectional videos, in: 28thInternational Telecommunication Networks and Applications Conference(ITNAC), IEEE, 2018, pp. 1–6.[145] P. A. Kara, W. Robitza, M. G. Martini, C. T. Hewage, F. M. Felisberti,Getting used to or growing annoyed: How perception thresholds and ac-61eptance of frame freezing vary over time in 3d video streaming, in: Inter-national Conference on Multimedia & Expo Workshops (ICMEW), IEEE,2016, pp. 1–6.[146] Y.-F. Ou, Y. Xue, Y. Wang, Q-STAR: A perceptual video quality modelconsidering impact of spatial, temporal, and amplitude resolutions, IEEETransactions on Image Processing 23 (6) (2014) 2473–2486.[147] R. Schatz, A. Zabrovskiy, C. Timmerer, Tile-based streaming of 8K om-nidirectional video: Subjective and objective QoE evaluation, in: 11th In-ternational Conference on Quality of Multimedia Experience (QoMEX),IEEE, 2019, pp. 1–6.[148] B. Zhang, Z. Yan, J. Wang, Y. Luo, S. Yang, Z. Fei, An audio-visual qual-ity assessment methodology in Virtual Reality environment, in: Interna-tional Conference on Multimedia & Expo Workshops (ICMEW), IEEE,2018, pp. 1–6.[149] S. Davis, K. Nesbitt, E. Nalivaiko, A systematic review of cybersickness,in: Conference on Interactive Entertainment, ACM, 2014, pp. 8:1–8:9.[150] X. Liu, Q. Xiao, V. Gopalakrishnan, B. Han, F. Qian, M. Varvello, 360innovations for panoramic video streaming, in: 16th Workshop on HotTopics in Networks, ACM, 2017, pp. 50–56.[151] E. Martel, K. Muldner, Controlling VR games: control schemes and theplayer experience, Entertainment Computing 21 (2017) 19–31.[152] I. Hupont, J. Gracia, L. Sanagust´ın, M. A. Gracia, How do new visual im-mersive systems inﬂuence gaming QoE? a use case of serious gaming withOculus Rift, in: 7th International Workshop on Quality of MultimediaExperience (QoMEX), IEEE, 2015, pp. 1–6.[153] J.-L. Lugrin, M. Cavazza, F. Charles, M. Le Renard, J. Freeman,J. Lessiter, Immersive FPS games: User experience and performance, in:62nternational Workshop on Immersive Media Experiences, ACM, 2013,pp. 7–12.[154] R. Wood, F. Loizides, T. Hartley, A. Worrallo, Investigating control of Vir-tual Reality snowboarding simulator using a Wii FiT board, in: Human-Computer Interaction (INTERACT), Springer, 2017, pp. 455–458.[155] K. Yue, D. Wang, X. Yang, H. Hu, Y. Liu, X. Zhu, Evaluation of theuser experience of “astronaut training device”: an immersive, VR-based,motion-training system, in: Optical Measurement Technology and Instru-mentation, Vol. 10155, Society of Photo-Optical Instrumentation Engi-neers, 2016.[156] G. Underwood, T. Foulsham, Visual saliency and semantic incongruencyinﬂuence eye movements when inspecting pictures, The Quarterly Journalof Experimental Psychology 59 (11) (2006) 1931–1949.[157] A. Borji, Saliency prediction in the deep learning era: An empirical inves-tigation, CoRR [Online]. ArXiV Prepr. abs/1810.03716 (2018).[158] P. Lebreton, A. Raake, GBVS360, BMS360, ProSal: Extending existingsaliency prediction models from 2D to omnidirectional images, Signal Pro-cessing: Image Communication 69 (2018) 69–78.[159] C. Connolly, T. Fleiss, A study of eﬃciency and accuracy in the transfor-mation from RGB to CIELAB color space, IEEE Transactions on ImageProcessing 6 (7) (1997) 1046–1048.[160] M. Startsev, M. Dorr, 360-aware saliency estimation with conventionalimage saliency predictors, Signal Processing: Image Communication 69(2018) 43–52.[161] V. Sitzmann, A. Serrano, A. Pavel, M. Agrawala, D. Gutierrez, B. Ma-sia, G. Wetzstein, Saliency in VR: How do people explore virtual envi-ronments?, IEEE Transactions on Visualization and Computer Graphics24 (4) (2018) 1633–1642. 63162] T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict wherehumans look, in: 12th International Conference on Computer Vision,IEEE, 2009, pp. 2106–2113.[163] T. Suzuki, T. Yamanaka, Saliency map estimation for omni-directionalimage considering prior distributions, in: International Conference on Sys-tems, Man, and Cybernetics (SMC), IEEE, 2018, pp. 2079–2084.[164] Y. Ding, Y. Liu, J. Liu, K. Liu, L. Wang, Z. Xu, Panoramic image saliencydetection by fusing visual frequency feature and viewing behavior pattern,in: Paciﬁc Rim Conference on Multimedia, Springer, 2018, pp. 418–429.[165] A. Nguyen, Z. Yan, K. Nahrstedt, Your attention is unique: Detecting360-degree video saliency in head-mounted display for head movementprediction, in: Conference on Multimedia, ACM, 2018, pp. 1190–1198.[166] F. Battisti, S. Baldoni, M. Brizzi, M. Carli, A feature-based approach forsaliency estimation of omni-directional images, Signal Processing: ImageCommunication 69 (2018) 53–59.[167] H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, M. Sun,Cube padding for weakly-supervised saliency prediction in 360 videos, in:Conference on Computer Vision and Pattern Recognition (CVPR), 2018,pp. 1420–1429.[168] R. Monroy, S. Lutz, T. Chalasani, A. Smolic, SalNet360: Saliency mapsfor omni-directional images with CNN, Signal Processing: Image Commu-nication 69 (2018) 26–34.[169] Z. Zhang, Y. Xu, J. Yu, S. Gao, Saliency detection in 360 videos, in:Proceedings of the European Conference on Computer Vision (ECCV),Computer Vision Foundation, 2018, pp. 488–503.[170] Y. Fang, X. Zhang, N. Imamoglu, A novel superpixel-based saliency de-tection model for 360-degree images, Signal Processing: Image Communi-cation 69 (2018) 1–7. 64171] Y. Yan, J. Ren, G. Sun, H. Zhao, J. Han, X. Li, S. Marshall, J. Zhan,Unsupervised image saliency detection with Gestalt-laws guided optimiza-tion and visual attention based reﬁnement, Pattern Recognition 79 (2018)65–78.[172] J. Ling, K. Zhang, Y. Zhang, D. Yang, Z. Chen, A saliency predictionmodel on 360 degree images using color dictionary based sparse represen-tation, Signal Processing: Image Communication 69 (2018) 60–68.[173] B. Dedhia, J.-C. Chiang, Y.-F. Char, Saliency prediction for omnidirec-tional images considering optimization on sphere domain, in: InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), IEEE,2019, pp. 2142–2146.[174] S. Biswas, S. A. Fezza, M.-C. Larabi, Towards light-compensated saliencyprediction for omnidirectional images, in: 7th International Conference onImage Processing Theory, Tools and Applications (IPTA), IEEE, 2017, pp.1–6.[175] F. Chao, L. Zhang, W. Hamidouche, O. Deforges, SalGAN360: Visualsaliency prediction on 360 degree images with Generative AdversarialNetworks, in: International Conference on Multimedia Expo Workshops(ICMEW), 2018, pp. 1–4.[176] C. Xia, F. Qi, G. Shi, Bottom–up visual saliency estimation with deepautoencoder-based sparse reconstruction, IEEE Transactions on NeuralNetworks and Learning Systems 27 (6) (2016) 1227–1240.[177] C. Ozcinar, A. Smolic, Visual attention in omnidirectional video for vir-tual reality applications, in: 10th International Conference on Quality ofMultimedia Experience (QoMEX), IEEE, 2018, pp. 1–6.[178] M. Cerf, J. Harel, W. Einhaeuser, C. Koch, Predicting human gaze usinglow-level saliency combined with face detection, in: Advances in Neural65nformation Processing Systems, Curran Associates, Inc., 2007, pp. 241–248.[179] P. Lebreton, S. Fremerey, A. Raake, V-BMS360: A video extention to theBMS360 image saliency model, in: International Conference on Multime-dia & Expo Workshops (ICMEW), IEEE, 2018, pp. 1–4.[180] M. Assens, X. Giro-i Nieto, K. McGuinness, N. E. O’Connor, Scanpathand saliency prediction on 360 degree images, Signal Processing: ImageCommunication 69 (2018) 8–14.[181] M. Assens, X. Giro-i Nieto, K. McGuinness, N. E. O’Connor, PathGAN:Visual scanpath prediction with Generative Adversarial Networks, in:Computer Vision – ECCV 2018 Workshops, Springer International Pub-lishing, 2019, pp. 406–422.[182] J. Guti´errez, E. J. David, A. Coutrot, M. P. Da Silva, P. Le Callet, In-troducing UN Salient360! benchmark: A platform for evaluating visualattention models for 360 ◦ contents, in: 10th International Conference onQuality of Multimedia Experience (QoMEX), IEEE, 2018, pp. 1–3.[183] J. Guti´errez, E. David, Y. Rai, P. Le Callet, Toolbox and dataset for thedevelopment of saliency and scanpath models for omnidirectional/360 ° still images, Signal Processing: Image Communication 69 (2018) 35–42.[184] L. Xie, X. Zhang, Z. Guo, CLS: A cross-user learning based system forimproving QoE in 360-degree video adaptive streaming, in: Conferenceon Multimedia, ACM, 2018, pp. 564–572.[185] A. De Abreu, C. Ozcinar, A. Smolic, Look around you: Saliency mapsfor omnidirectional images in VR applications, in: 9th International Con-ference on Quality of Multimedia Experience (QoMEX), IEEE, 2017, pp.1–6. 66186] K.-Y. Chang, T.-L. Liu, H.-T. Chen, S.-H. Lai, Fusing generic objectnessand visual saliency for salient object detection, in: International Confer-ence on Computer Vision, IEEE, 2011, pp. 914–921.[187] P. Ramanathan, M. Kalman, B. Girod, Rate-distortion optimized interac-tive light ﬁeld streaming, IEEE Transactions on Multimedia 9 (4) (2007)813–825.[188] S. K. Singhal, D. R. Cheriton, Exploiting position history for eﬃcientremote rendering in networked Virtual Reality, Presence: Teleoperators& Virtual Environments 4 (2) (1995) 169–193.[189] A. Kiruluta, M. Eizenman, S. Pasupathy, Predictive head movementtracking using a Kalman ﬁlter, IEEE Transactions on Systems, Man, andCybernetics, Part B (Cybernetics) 27 (2) (1997) 326–331.[190] T. Aykut, C. Zou, J. Xu, D. Van Opdenbosch, E. Steinbach, A delay com-pensation approach for pan-tilt-unit-based stereoscopic 360 degree telep-resence systems using head motion prediction, in: International Confer-ence on Robotics and Automation (ICRA), IEEE, 2018, pp. 1–9.[191] I. Bogdanova, A. Bur, H. H¨ugli, P.-A. Farine, Dynamic visual attentionon the sphere, Computer Vision and Image Understanding 114 (1) (2010)100–110.[192] X. Feng, V. Swaminathan, S. Wei, Viewport prediction for live 360-degreemobile video streaming using user-content hybrid motion tracking, Pro-ceedings of the ACM on Interactive, Mobile, Wearable and UbiquitousTechnologies 3 (2) (2019) 43.[193] S. Petrangeli, G. Simon, V. Swaminathan, Trajectory-based viewport pre-diction for 360-degree Virtual Reality videos, in: International Conferenceon Artiﬁcial Intelligence and Virtual Reality (AIVR), IEEE, 2018, pp.157–160. 67194] J. Zou, C. Li, C. Liu, Q. Yang, H. Xiong, E. Steinbach, Probabilistic tilevisibility-based server-side rate adaptation for adaptive 360-degree videostreaming, IEEE Journal of Selected Topics in Signal Processing 14 (2020)161–176.[195] C.-L. Fan, S.-C. Yen, C.-Y. Huang, C.-H. Hsu, Optimizing ﬁxation pre-diction using recurrent neural networks for 360 ◦ video streaming in head-mounted virtual reality, IEEE Transactions on Multimedia 22 (3) (2020)744–759.[196] C.-L. Fan, J. Lee, W.-C. Lo, C.-Y. Huang, K.-T. Chen, C.-H. Hsu, Fixa-tion prediction for 360 ◦ video streaming in head-mounted Virtual Reality,in: 27th Workshop on Network and Operating Systems Support for DigitalAudio and Video (NOSSDAV), ACM, 2017, pp. 67–72.[197] Y. Li, Y. Xu, S. Xie, L. Ma, J. Sun, Two-layer FoV prediction modelfor viewport dependent streaming of 360-degree videos, in: InternationalConference on Communicatins and Networking in China, Springer, 2018,pp. 501–509.[198] Y. Xu, Y. Dong, J. Wu, Z. Sun, Z. Shi, J. Yu, S. Gao, Gaze prediction indynamic 360 ◦ immersive videos, in: Conference on Computer Vision andPattern Recognition (CVPR), IEEE, 2018, pp. 5333–5342.[199] C. Li, W. Zhang, Y. Liu, Y. Wang, Very long term ﬁeld of view predictionfor 360-degree video streaming, in: Conference on Multimedia InformationProcessing and Retrieval (MIPR), IEEE, 2019, pp. 297–302.[200] J. Yu, Y. Liu, Field-of-view prediction in 360-degree videos with attention-based neural encoder-decoder networks, in: 11th Workshop on ImmersiveMixed and Virtual Environment Systems, ACM, 2019, pp. 37–42.[201] T. Maugey, O. Le Meur, Z. Liu, Saliency-based navigation in omnidi-rectional image, in: 19th International Workshop on Multimedia SignalProcessing (MMSP), IEEE, 2017, pp. 1–6.68202] H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, M. Sun, Deep360 pilot: Learning a deep agent for piloting through 360 sports videos,in: Conference on Computer Vision and Pattern Recognition (CVPR),IEEE, 2017, pp. 1396–1405.[203] D. Jayaraman, K. Grauman, Learning to look around: intelligently explor-ing unseen environments for unknown tasks, in: Conference on ComputerVision and Pattern Recognition (CVPR), 2018, pp. 1238–1247.[204] M. Almquist, V. Almquist, V. Krishnamoorthi, N. Carlsson, D. Eager,The prefetch aggressiveness tradeoﬀ in 360 ° video streaming, in: 9th Con-ference on Multimedia Systems (MmSys), ACM, 2018, pp. 258–269.[205] Y. Bao, H. Wu, T. Zhang, A. A. Ramli, X. Liu, Shooting a moving target:Motion-prediction-based transmission for 360-degree videos, in: Interna-tional Conference on Big Data (Big Data), IEEE, 2016, pp. 1161–1170.[206] R. Azuma, G. Bishop, A frequency-domain analysis of head-motion predic-tion, in: Conference of the Special Interest Group on Computer GRAPH-ics and Interactive Techniques (SIGGRAPH), Vol. 95, ACM, 1995, pp.401–408.[207] N. Carlsson, D. Eager, Had you looked where I’m looking: Cross-usersimilarities in viewing behavior for 360 ◦ video and caching implications,CoRR [Online]. ArXiV Prepr. abs/1906.09779 (2019).[208] E. Upenik, T. Ebrahimi, A simple method to obtain visual attention datain head mounted Virtual Reality, in: International Conference on Multi-media & Expo Workshops (ICMEW), IEEE, 2017, pp. 73–78.[209] P. Zhao, Y. Zhang, K. Bian, H. Tuo, L. Song, LadderNet: Knowledgetransfer based viewpoint prediction in 360 ◦ video, in: International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), IEEE,2019, pp. 1657–1661. 69210] Y. Ban, L. Xie, Z. Xu, X. Zhang, Z. Guo, Y. Wang, Cub360: Exploitingcross-users behaviors for viewport prediction in 360 video adaptive stream-ing, in: International Conference on Multimedia and Expo (ICME), IEEE,2018, pp. 1–6.[211] S. Rossi, F. De Simone, P. Frossard, L. Toni, Spherical clustering ofusers navigating 360 ◦ content, in: International Conference on Acoustics,Speech and Signal Processing (ICASSP), IEEE, 2019.[212] M. Xu, Y. Song, J. Wang, M. Qiao, L. Huo, Z. Wang, Predicting headmovement in panoramic video: A deep reinforcement learning approach,IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (11)(2019) 2693–2708.[213] Y. Zhu, G. Zhai, X. Min, The prediction of head and eye movement for360 degree images, Signal Processing: Image Communication 69 (2018)15–25.[214] X. Corbillon, F. De Simone, G. Simon, 360-degree video head movementdataset, in: 8th Conference on Multimedia Systems (MmSys), ACM, 2017,pp. 199–204.[215] W.-C. Lo, C.-L. Fan, J. Lee, C.-Y. Huang, K.-T. Chen, C.-H. Hsu, 360video viewing dataset in head-mounted virtual reality, in: 8th Conferenceon Multimedia Systems (MmSys), ACM, 2017, pp. 211–216.[216] C. Wu, Z. Tan, Z. Wang, S. Yang, A dataset for exploring user behaviors inVR spherical video streaming, in: 8th Conference on Multimedia Systems(MmSys), ACM, 2017, pp. 193–198.[217] A. Nguyen, Z. Yan, A saliency dataset for 360-degree videos, in: 10thConference on Multimedia Systems (MmSys), ACM, 2019, pp. 279–284.[218] S. Fremerey, A. Singla, K. Meseberg, A. Raake, AVtrack360: an opendataset and software recording people’s head rotations watching 360 ◦ ◦ videos,in: 10th Multimedia Systems Conference (MMSys), ACM, 2019, pp. 273–278.[220] S. Knorr, C. Ozcinar, C. O. Fearghail, A. Smolic, Director’s cut: acombined dataset for visual attention analysis in cinematic VR content,in: 15th SIGGRAPH European Conference on Visual Media Production,ACM, 2018, p. 3.[221] Y. Rai, J. Guti´errez, P. Le Callet, A dataset of head and eye movements for360 degree images, in: 8th Conference on Multimedia Systems (MmSys),ACM, 2017, pp. 205–210.[222] F. Duanmu, Y. Mao, S. Liu, S. Srinivasan, Y. Wang, A subjective studyof viewer navigation behaviors when watching 360-degree videos on com-puters, in: International Conference on Multimedia and Expo (ICME),IEEE, 2018, pp. 1–6.[223] O. A. Niamut, E. Thomas, L. D’Acunto, C. Concolato, F. Denoual, S. Y.Lim, MPEG DASH SRD: spatial relationship description, in: 7th Inter-national Conference on Multimedia Systems (MMSys), ACM, 2016, pp.1–8.[224] M. M. Hannuksela, Y.-K. Wang, A. Hourunranta, An overview of theOMAF standard for 360 video, in: Data Compression Conference (DCC),IEEE, 2019, pp. 418–427.[225] R. Skupin, Y. Sanchez, D. Podborski, C. Hellge, T. Schierl, Viewport-dependent 360 degree video streaming based on the emerging Omnidirec-tional Media Format (OMAF) standard, in: International Conference onImage Processing (ICIP), IEEE, 2017, pp. 4592–4592.71226] L. D’Acunto, J. Van den Berg, E. Thomas, O. Niamut, Using MPEGDASH SRD for zoomable and navigable video, in: 7th International Con-ference on Multimedia Systems (MMSys), ACM, 2016, pp. 1–4.[227] J. Song, F. Yang, W. Zhang, W. Zou, Y. Fan, P. Di, A fast FoV-switchingDASH system based on tiling mechanism for practical omnidirectionalvideo services, IEEE Transactions on Multimedia 22 (20) (2020) 2366–2381.[228] D. V. Nguyen, H. T. Tran, T. C. Thang, Impact of delays on 360-degreevideo communications, in: TRON Symposium (TRONSHOW), IEEE,2017, pp. 1–6.[229] P. Lungaro, R. Sj¨oberg, A. J. F. Valero, A. Mittal, K. Tollmar, Gaze-awarestreaming solutions for the next generation of mobile vr experiences, IEEETransactions on Visualization and Computer Graphics 24 (4) (2018) 1535–1544.[230] D. He, C. Westphal, J. Garcia-Luna-Aceves, Joint rate and FoV adapta-tion in immersive video streaming, in: Workshop on Virtual Reality andAugmented Reality Network (VR/AR Network), ACM, 2018, pp. 27–32.[231] X. Corbillon, F. De Simone, G. Simon, P. Frossard, Dynamic adaptivestreaming for multi-viewpoint omnidirectional videos, in: 9th Conferenceon Multimedia Systems (MmSys), ACM, 2018, pp. 237–249.[232] M. Hosseini, V. Swaminathan, Adaptive 360 VR video streaming: Divideand conquer, in: International Symposium on Multimedia (ISM), IEEE,2016, pp. 107–110.[233] S. Petrangeli, V. Swaminathan, M. Hosseini, F. De Turck, An HTTP/2-based adaptive streaming framework for 360 ◦ virtual reality videos, in:25th International Conference on Multimedia, ACM, 2017, pp. 306–314.72234] M. B. Yahia, Y. Le Louedec, G. Simon, L. Nuaymi, HTTP/2-basedstreaming solutions for tiled omnidirectional videos, in: InternationalSymposium on Multimedia (ISM), IEEE, 2018, pp. 89–96.[235] C. Concolato, J. Le Feuvre, F. Denoual, F. Maz´e, E. Nassor, N. Oue-draogo, J. Taquet, Adaptive streaming of HEVC tiled videos using MPEG-DASH, IEEE transactions on Circuits and Systems for Video Technology28 (8) (2017) 1981–1992.[236] K. K. Sreedhar, A. Aminlou, M. M. Hannuksela, M. Gabbouj, Viewport-adaptive encoding and streaming of 360-degree video for virtual realityapplications, in: International Symposium on Multimedia (ISM), IEEE,2016, pp. 583–586.[237] Y. S. de la Fuente, G. S. Bhullar, R. Skupin, C. Hellge, T. Schierl, De-lay impact on MPEG OMAF’s tile-based viewport-dependent 360 videostreaming, IEEE Journal on Emerging and Selected Topics in Circuits andSystems 9 (1) (2019) 18–28.[238] X. Corbillon, A. Devlic, G. Simon, J. Chakareski, Optimal set of 360-degree videos for viewport-adaptive streaming, in: 25th International Con-ference on Multimedia, ACM, 2017, pp. 943–951.[239] T. Fuiihashi, M. Kobavashi, K. Endo, S. Saruwatari, S. Kobayashi,T. Watanabe, Graceful quality improvement in wireless 360-degree videodelivery, in: 2018 IEEE Global Communications Conference (GLOBE-COM), IEEE, 2018, pp. 1–7.[240] L. Sassatelli, M. Winckler, T. Fisichella, R. Aparicio, A.-M. Pinna-D´ery,A new adaptation lever in 360 ◦ video streaming, in: 29th Workshop onNetwork and Operating Systems Support for Digital Audio and Video(NOSSDAV), ACM, 2019, pp. 37–42.[241] J. He, M. A. Qureshi, L. Qiu, J. Li, F. Li, L. Han, Rubiks: Practical 360-degree streaming for smartphones, in: 16th Annual International Con-73erence on Mobile Systems, Applications, and Services, ACM, 2018, pp.482–494.[242] D. V. Nguyen, H. T. Tran, A. T. Pham, T. C. Thang, A new adaptationapproach for viewport-adaptive 360-degree video streaming, in: Interna-tional Symposium on Multimedia (ISM), IEEE, 2017, pp. 38–44.[243] T. C. Nguyen, J.-H. Yun, Predictive tile selection for 360-degree VR videostreaming in bandwidth-limited networks, IEEE Communications Letters22 (9) (2018) 1858–1861.[244] S. Yang, Y. He, X. Zheng, FoVR: Attention-based VR streaming throughbandwidth-limited wireless networks, in: 16th Annual International Con-ference on Sensing, Communication, and Networking (SECON), IEEE,2019, pp. 1–9.[245] F. Qian, L. Ji, B. Han, V. Gopalakrishnan, Optimizing 360 video deliveryover cellular networks, in: 5th Workshop on All Things Cellular: Opera-tions, Applications and Challenges, ACM, 2016, pp. 1–6.[246] Y. Bao, T. Zhang, A. Pande, H. Wu, X. Liu, Motion-prediction-basedmulticast for 360-degree video transmissions, in: 14th Annual Interna-tional Conference on Sensing, Communication, and Networking (SECON),IEEE, 2017, pp. 1–9.[247] Y. Leng, C.-C. Chen, Q. Sun, J. Huang, Y. Zhu, Semantic-aware VirtualReality video streaming, in: th Asia-Paciﬁc Workshop on Systems, ACM,2018, p. 21.[248] D. V. Nguyen, H. T. Tran, A. T. Pham, T. C. Thang, An optimal tile-based approach for viewport-adaptive 360-degree video streaming, IEEEJournal on Emerging and Selected Topics in Circuits and Systems 9 (1)(2019) 29–42.[249] F. Qian, B. Han, Q. Xiao, V. Gopalakrishnan, Flare: Practical viewport-adaptive 360-degree video streaming for mobile devices, in: 24th Annual74nternational Conference on Mobile Computing and Networking (Mobi-Com), ACM, 2018, pp. 99–114.[250] Z. Xu, X. Zhang, K. Zhang, Z. Guo, Probabilistic viewport adaptivestreaming for 360-degree videos, in: International Symposium on Circuitsand Systems (ISCAS), IEEE, 2018, pp. 1–5.[251] L. Xie, Z. Xu, Y. Ban, X. Zhang, Z. Guo, 360probdash: Improving QoE of360 video streaming using tile-based HTTP adaptive streaming, in: 25thInternational Conference on Multimedia, ACM, 2017, pp. 315–323.[252] M. Xiao, C. Zhou, V. Swaminathan, Y. Liu, S. Chen, Bas-360: Exploringspatial and temporal adaptability in 360-degree videos over HTTP/2, in:Conference on Computer Communications (INFOCOM), IEEE, 2018, pp.953–961.[253] Y. Ban, L. Xie, Z. Xu, X. Zhang, Z. Guo, Y. Hu, An optimal spatial-temporal smoothness approach for tile-based 360-degree video streaming,in: International Conference on Visual Communications and Image Pro-cessing (VCIP), IEEE, 2017, pp. 1–4.[254] W. Lin, X. Zhang, Z. Guo, W. Hu, OPV: Bias correction based optimalprobabilistic viewport-adaptive streaming for 360-degree video, in: Inter-national Conference on Multimedia & Expo Workshops (ICMEW), IEEE,2019, pp. 384–389.[255] J. Chakareski, R. Aksu, X. Corbillon, G. Simon, V. Swaminathan,Viewport-driven rate-distortion optimized 360 ◦ video streaming, in: In-ternational Conference on Communications (ICC), IEEE, 2018, pp. 1–7.[256] C. Koch, A.-T. Rak, M. Zink, R. Steinmetz, A. Rizk, Transitions of view-port quality adaptation mechanisms in 360 degree video streaming, in:29th Workshop on Network and Operating Systems Support for DigitalAudio and Video (NOSSDAV), ACM, 2019, pp. 14–19.75257] S. Rossi, L. Toni, Navigation-aware adaptive streaming strategies for om-nidirectional video, in: 19th International Workshop on Multimedia SignalProcessing (MMSP), IEEE, 2017, pp. 1–6.[258] Z. Xu, Y. Ban, K. Zhang, L. Xie, X. Zhang, Z. Guo, S. Meng, Y. Wang,Tile-based QoE-driven HTTP/2 streaming system for 360 video, in: Inter-national Conference on Multimedia & Expo Workshops (ICMEW), IEEE,2018, pp. 1–4.[259] S. Park, A. Bhattacharya, Z. Yang, M. Dasari, S. R. Das, D. Samaras,Advancing user Quality of Experience in 360-degree video streaming, in:IFIP Networking Conference, IEEE, 2019, pp. 1–9.[260] A. Ghosh, V. Aggarwal, F. Qian, A robust algorithm for tile-based 360-degree video streaming with uncertain FoV estimation, CoRR [Online].ArXiV Prepr. abs/1812.00816 (2018).[261] J. Fu, X. Chen, Z. Zhang, S. Wu, Z. Chen, 360SRL: A sequential rein-forcement learning approach for ABR tile-based 360 video streaming, in:International Conference on Multimedia and Expo (ICME), IEEE, 2019,pp. 290–295.[262] N. Kan, J. Zou, K. Tang, C. Li, N. Liu, H. Xiong, Deep reinforcementlearning-based rate adaptation for adaptive 360-degree video streaming,in: International Conference on Acoustics, Speech and Signal Processing(ICASSP), IEEE, 2019, pp. 4030–4034.[263] C. Ozcinar, J. Cabrera, A. Smolic, Visual attention-aware omnidirectionalvideo streaming using optimal tiles for virtual reality, IEEE Journal onEmerging and Selected Topics in Circuits and Systems 9 (1) (2019) 217–230.[264] X. Jiang, Y.-H. Chiang, Y. Zhao, Y. Ji, Plato: Learning-based adaptivestreaming of 360-degree videos, in: 43rd Conference on Local ComputerNetworks (LCN), IEEE, 2018, pp. 393–400.76265] G. Xiao, X. Chen, M. Wu, Z. Zhou, Deep reinforcement learning-drivenintelligent panoramic video bitrate adaptation, in: Turing CelebrationConference-China, ACM, 2019, p. 41.[266] Y. Zhang, P. Zhao, K. Bian, Y. Liu, L. Song, X. Li, DRL360: 360-degreevideo streaming with Deep Reinforcement Learning, in: Conference onComputer Communications (INFOCOM), IEEE, 2019, pp. 1252–1260.[267] M. Xiao, C. Zhou, Y. Liu, S. Chen, OpTile: Toward optimal tiling in 360-degree video streaming, in: 25th International Conference on Multimedia,ACM, 2017, pp. 708–716.[268] D. V. Nguyen, H. T. Tran, T. C. Thang, A client-based adaptation frame-work for 360-degree video streaming, Journal of Visual Communicationand Image Representation 59 (2019) 231–243.[269] C. Dunn, B. Knott, Resolution-deﬁned projections for virtual reality videocompression, in: IEEE Virtual Reality Conference (VR), IEEE, 2017, pp.337–338.[270] L. Sun, F. Duanmu, Y. Liu, Y. Wang, Y. Ye, H. Shi, D. Dai, A two-tier system for on-demand streaming of 360 degree video over dynamicnetworks, Vol. 9, IEEE, 2019, pp. 43–57.[271] A. T. Nasrabadi, A. Mahzari, J. D. Beshay, R. Prakash, Adaptive 360-degree video streaming using scalable video coding, in: 25th InternationalConference on Multimedia, ACM, 2017, pp. 1689–1697.[272] Y. Lv, D. Li, Y. Wang, Y. Liu, Unequal error protection for 360 VR videobased on expanding window fountain codes, in: International Conferenceon Network Infrastructure and Digital Content (IC-NIDC), IEEE, 2018,pp. 295–299.[273] L. Sun, F. Duanmu, Y. Liu, Y. Wang, Y. Ye, H. Shi, D. Dai, Multi-pathmulti-tier 360-degree video streaming in 5G networks, in: 9th Conferenceon Multimedia Systems (MmSys), ACM, 2018, pp. 162–173.77274] Z. Tan, Y. Li, Q. Li, Z. Zhang, Z. Li, S. Lu, Supporting mobile VR in LTEnetworks: How close are we?, Proceedings of the ACM on Measurementand Analysis of Computing Systems 2 (1) (2018) 8.[275] F. Gabin, G. Teniou, N. Leung, I. Varga, 5G multimedia standardization,Journal of ICT Standardization 6 (1) (2018) 117–136.[276] A. Mahzari, A. Taghavi Nasrabadi, A. Samiei, R. Prakash, FoV-awareedge caching for adaptive 360 ◦ video streaming, in: Conference on Multi-media, ACM, 2018, pp. 173–181.[277] P. Maniotis, E. Bourtsoulatze, N. Thomos, Tile-based joint caching anddelivery of 360 ◦ videos in heterogeneous networks, IEEE Transactions onMultimedia 22 (9) (2020) 2382–2395.[278] K. Liu, Y. Liu, J. Liu, A. Argyriou, Y. Ding, Joint EPC and RAN cachingof tiled VR videos for mobile networks, in: International Conference onMultimedia Modeling, Springer, 2019, pp. 92–105.[279] J. Chakareski, VR/AR immersive communication: Caching, edge com-puting, and transmission trade-oﬀs, in: Workshop on Virtual Reality andAugmented Reality Network (VR/AR Network), ACM, 2017, pp. 36–41.[280] H. Ahmadi, O. Eltobgy, M. Hefeeda, Adaptive multicast streaming ofvirtual reality content to mobile users, in: Thematic Workshops of ACMMultimedia, ACM, 2017, pp. 170–178.[281] W. Huang, L. Ding, G. Zhai, X. Min, J.-N. Hwang, Y. Xu, W. Zhang,Utility-oriented resource allocation for 360-degree video transmission overheterogeneous networks, Digital Signal Processing 84 (2019) 1–14.[282] X. Zhang, X. Hu, L. Zhong, S. Shirmohammadi, L. Zhang, Cooperativetile-based 360-degree panoramic streaming in heterogeneous networks us-ing Scalable Video Coding, IEEE Transactions on Circuits and Systemsfor Video Technology 30 (1) (2020) 217–231.78283] C. Perfecto, M. S. Elbamby, J. Del Ser, M. Bennis, Taming the latency inmulti-user VR 360 ◦ : A QoE-aware deep learning-aided multicast frame-work, CoRR [Online]. ArXiV Prepr. abs/1811.07388 (2018).[284] J. Yang, J. Luo, F. Lin, J. Wang, Content-sensing based resource allo-cation for delay-sensitive VR video uploading in 5G H-CRAN, Sensors19 (3) (2019) 697.[285] A. Grzelka, A. Dziembowski, D. Mieloch, O. Stankiewicz, J. Stankowski,M. Doma´nski, Impact of video streaming delay on user experience withhead-mounted displays, in: Picture Coding Symposium (PCS), IEEE,2019, pp. 1–5.[286] K. Mania, B. D. Adelstein, S. R. Ellis, M. I. Hill, Perceptual sensitivityto head tracking latency in virtual environments with varying degrees ofscene complexity, in: 1st Symposium on Applied Perception in Graphicsand Visualization, ACM, 2004, pp. 39–47.[287] R. Albert, A. Patney, D. Luebke, J. Kim, Latency requirements forfoveated rendering in virtual reality, ACM Transactions on Applied Per-ception (TAP) 14 (4) (2017) 1–13.[288] M. S. Elbamby, C. Perfecto, M. Bennis, K. Doppler, Toward low-latencyand ultra-reliable virtual reality, IEEE Network 32 (2) (2018) 78–84.[289] F. Chiariotti, S. Kucera, A. Zanella, H. Claussen, Analysis and design ofa latency control protocol for multi-path data delivery with pre-deﬁnedQoS guarantees, IEEE/ACM Transactions on Networking 27 (3) (2019)1165–1178.[290] W.-C. Lo, C.-Y. Huang, C.-H. Hsu, Edge-assisted rendering of 360 videosstreamed to head-mounted virtual reality, in: International Symposiumon Multimedia (ISM), IEEE, 2018, pp. 44–51.[291] L. Liu, R. Zhong, W. Zhang, Y. Liu, J. Zhang, L. Zhang, M. Gruteser,Cutting the cord: Designing a high-quality untethered VR system with79ow latency remote rendering, in: 16th Annual International Conferenceon Mobile Systems, Applications, and Services, ACM, 2018, pp. 68–80.[292] S. Shi, V. Gupta, M. Hwang, R. Jana, Mobile VR on edge cloud: alatency-driven design, in: 10th Conference on Multimedia Systems (Mm-Sys), ACM, 2019, pp. 222–231.[293] Z. Lai, Y. C. Hu, Y. Cui, L. Sun, N. Dai, H.-S. Lee, Furion: Engineeringhigh-quality immersive virtual reality on today’s mobile devices, IEEETransactions on Mobile Computing 19 (7) (2020) 1586–1602.[294] Y. Li, W. Gao, MUVR: Supporting multi-user mobile virtual reality withresource constrained edge cloud, in: Symposium on Edge Computing(SEC), IEEE/ACM, 2018, pp. 1–16.[295] S. Mangiante, G. Klas, A. Navon, Z. GuanHua, J. Ran, M. D. Silva, VR ison the edge: How to deliver 360 videos in mobile networks, in: Workshopon Virtual Reality and Augmented Reality Network, ACM, 2017, pp. 30–35. GlossaryACR

Absolute Category Rating. AR Augmented Reality.

AV1

AOMedia Video 1.

AVC

Advanced Video Coding.

BMS

Boolean Map Saliency. BP Back Propagation.

CDN

Content Delivery Network. 80 MP Cubic Mapping Projection.

CNN

Convolutional Neural Network.

CP-PSNR

Content Preference PSNR.

CP-SSIM

Content Preference SSIM.

CPP-PSNR

PSNR for Craster Parabolic Projection.

DASH

Dynamic Adaptive Streaming over HTTP.

DCR

Degradation Category Rating.

DCT

Discrete Cosine Transform.

DeepVR-IQA

Deep VR Image Quality Assessment.

DMOS

Diﬀerential Mean Opinion Score.

DRL

Deep Reinforcement Learning.

DSIS

Double Stimulus Impairment Scale.

ERP

Equirectangular Projection.

FoV

Field of View.

FSIM

Feature Similarity Index.

FSM

Fused Saliency Map.

GAN

Generative Adversarial Network.

GBVS

Graph-Based Visual Saliency.

GoP

Group of Picture.

HEVC

High Eﬃciency Video Coding.

HMD

Head-Mounted Display. 81 TU International Telecommunication Union.

JVET

Joint Video Exploration Team. k-NN k-Nearest Neighbors.

LSTM

Long Short-Term Memory.

MC360IQA

Multi Channel 360 ◦ Image Quality Assessment.

MDP

Markov Decision Problem.

MEC

Mobile Edge Computing.

MOS

Mean Opinion Score.

MS-SSIM

Multiscale SSIM.

MSE

Mean Square Error.

NIQE

Natural Image Quality Evaluator.

NPCM

Nested Polygonal Chain Mapping.

NQQ

Normalized Quality versus Quality factor.

OCP

Oﬀset Cubic Projection.

OMAF

Omnidirectional Media Format.

OPV

Optimal Probabilistic Viewport.

PSNR

Peak Signal to Noise Ratio.

PVQ

Perceptual Video Quality.

QAVR

Quality Assessment in VR systems.

QoE

Quality of Experience. 82 P Quantization Parameter.

RAT

Radio Access Technology.

RBM

Rhombic Mapping.

RNN

Recurrent Neural Network.

RSP

Rotated Sphere Projection.

S-PSNR

Sphere-based PSNR.

S-SSIM

Spherical SSIM.

SAO

Sample Adaptive Oﬀset.

SCP

Shared Coded Picture.

SISBLIM

Six-Step Blind Metric. SP Sinusoidal Projection.

SRD

Spatial Representation Description.

SSIM

Structural Similarity Index.

SVC

Scalable Video Coding.

SVR

Support Vector Regression.

TSP

Truncated Square Pyramid.

V-CNN

Viewport-based CNN.

VIFP

Visual Information Fidelity in Pixel Domain. VR Virtual Reality.

VVC

Versatile Video Coding.

WS-PSNR

Weighted to Spherically Uniform PSNR.