[PDF] A Taxonomy and Dataset for 360° Videos

Abstract

In this paper, we propose a taxonomy for 360° videos that categorizes videos based on moving objects and camera motion. We gathered and produced 28 videos based on the taxonomy, and recorded viewport traces from 60 participants watching the videos. In addition to the viewport traces, we provide the viewers' feedback on their experience watching the videos, and we also analyze viewport patterns on each category.

Full PDF

AA Taxonomy and Dataset for 360 ◦ Videos

Afshin Taghavi Nasrabadi, Aliehsan Samiei,Anahita Mahzari, Ryan P. McMahan, RaviPrakash

The University of Texas at Dallas, U.S.A. { afshin,aliehsan.samiei,anahita.mahzari,rymcmaha,ravip } @utdallas.edu Myl`ene C.Q. Farias, Marcelo M. Carvalho

Department of Electrical EngineeringUniversity of Brasilia (UnB) { mylene,mmcarvalho } @ene.unb.br ABSTRACT

In this paper, we propose a taxonomy for 360 ◦ videos that catego-rizes videos based on moving objects and camera motion. We gath-ered and produced 28 videos based on the taxonomy, and recordedviewport traces from 60 participants watching the videos. In addi-tion to the viewport traces, we provide the viewers’ feedback ontheir experience watching the videos, and we also analyze viewportpatterns on each category. KEYWORDS ◦ video, Dataset, Viewport, Virtual Reality Omnidirectional or 360 ◦ video is one of the many Virtual Reality(VR) technologies with a growing popularity. 360 ◦ video appli-cations range from entertainment to education. These videos areusually watched through Head Mounted Displays (HMD) that en-able viewers to explore a scene and look in any direction from aspecific point in the scene. However, this new medium poses newchallenges for content producers and service providers. For exam-ple, 360 ◦ videos should have a high spatial resolution (4K or above)to provide an acceptable level of Quality of Experience (QoE) forviewers. Therefore, processing and streaming this type of contentis very demanding.Several solutions have been proposed to stream and render 360 ◦ videos based on real-time users’ viewport [3, 12]. They take ad-vantage of the fact that users, at any point in time, view a limitedportion of the video. To provide a high quality video inside a users’viewport, these methods need to know the users’ viewport before-hand. This is typically done using viewport prediction methods.Since the accuracy and duration of existing viewport predictionmethods are limited, viewport cannot be accurately predicted fortime intervals longer than one second [13]. This limits the useful-ness of viewport prediction under fluctuating network conditionsas the video client has to buffer a long duration of the video to copewith network variations. Any mismatch between the predicted andactual user viewport can be detrimental to QoE. Another interesting Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

MMSys ’19, Amherst, MA, USA © 2019 ACM. 978-1-4503-6297-9/19/06...$15.00DOI: 10.1145/3304109.3325812 challenge is storytelling, which, so far, has not been well-definedfor this type of media. Unlike traditional videos, viewers are notlimited to watch only a specific part of the scene determined byproducers. Also, the effect of camera motion and scene-cuts arevery different, which requires that producers know how to guideuser attention. Currently, there are ongoing efforts to study theeffect of different editing techniques on the QoE of 360 ◦ videos [16].Solving the above-mentioned challenges requires study and anal-ysis of user behaviors while watching 360 ◦ videos. Publicly avail-able viewport datasets facilitate these studies for several reasons.First, study of viewport traces enables content producers to under-stand which aspects of a 360 ◦ video are important for users andhow their attention can be guided. Second, datasets can help todevelop and test viewport prediction and saliency detection meth-ods. Moreover, these traces can be used by other researchers forthe purpose of running experiments related to 360 ◦ video stream-ing, as well as salience and visual attention modeling. Since users’viewport patterns are highly influenced by the video content, it isimportant to have various types of 360 ◦ videos in a dataset. More-over, a taxonomy of videos could be developed and videos couldbe appropriately classified on the basis of a set of attributes. Ifthere is a high correlation between viewing patterns for videos inthe same category, and significant differences between videos indifferent categories, then taxonomy information could be leveragedfor viewport prediction. Today, there are several published datasetsfor users’ viewport traces [5][8][2][11][17]. However, they do notprovide a comprehensive taxonomy of videos.We propose a taxonomy for 360 ◦ videos that classifies videosbased on camera motion and number of the moving targets in avideo scene. We gathered and produced 28 scene-cut free videosbased on the proposed taxonomy. We designed a subjective ex-periment in which 60 viewers watched a subset of the videos. Inaddition to providing the viewport traces for each viewing session,our dataset includes the viewers’ feedback about their experienceafter watching each video. Viewers described what they focused onand rated their perception of presence level of discomfort. Theseresponses can be very helpful to study viewport traces. We alsopresent preliminary analysis of user data that could be helpful indesigning viewport prediction algorithms. Several datasets provide head movement traces of users watching360 ◦ videos. Corbillon et al. [2] gathered a dataset of viewporttraces for five videos with 59 participants. Lo et al. provided adataset captured with 50 subjects watching 10 videos [11]. Saliency https://github.com/acmmmsys/2019-360dataset a r X i v : . [ c s . MM ] M a y nd motion map of videos are also available in the dataset. The au-thors designed a viewport prediction method based on their datasetin [6]. Wu et al. [17] provided a dataset with 18 videos watchedby 48 participants. They classified the videos based on their genre,such as sports, documentary, etc. This classification is very generaland does not characterize the intrinsic properties of a scene. Theirtest procedure included two types of experiments. In the first exper-iment, subjects just watched the videos. In the second experiment,after each video, subjects were asked specific questions about thevideo content. This type of experiment forces viewers to pay atten-tion to the content of videos. As a consequence, viewport samplesare more scattered in the first experiment than in the second one.Duanmu et al. [5] provided a dataset of viewport traces in whichvideos were watched on a computer monitor. The authors comparedthe similarities and differences of viewing patterns between HMDand computer-based viewing sessions.Fremerey et al. [8] provided a dataset of head movements from48 subjects watching twenty different videos. Participants alsofilled a Simulator Sickness (SS) Questionnaire after each set of fivesequences. Although the overall discomfort was not very high,female participants experienced a higher discomfort level, whichincreased over the course of different videos. David et al. pro-vided a dataset of head and eye movements [4] for nineteen videoswatched by 57 subjects. The dataset includes head+eye and head-only saliency maps and scan-paths. Interestingly, their results showthat there are some differences between head-only and head+eyesaliency maps, which are not highly correlated. According to theauthors, this is caused by a loss of information in head-only maps.Generally, users’ behavior depends on the content of the video.However, users’ viewport is biased towards the center of the video.According to Fremerey et al. [8], most of the time users watch theareas closer to the center of the videos, with 90% of the time within ± ◦ deviation from equator, and 50% of the time within the sameinterval from the center in horizontal direction. But, in horizontaldirection, the viewports are more distributed, and nearly 50% ofusers watched 330 ◦ of horizontal view in all video sequences.In [1], thirty 360 ◦ videos were shown to 32 participants. Videoswere classified into 5 categories based on motion, but no distinctionwas made between the motion of object(s) in the scene versuscamera motion. Their analysis shows that viewport patterns changefor different categories. Users viewport distribution along yaw axistend to be more uniformly distributed for videos without movingobjects. In another study [14], a dataset is created based on 6 videoswith duration of 10 seconds, which is much shorter than previousstudies. 17 participants have watched the videos, and each videowas randomly repeated. They found that most of the fixation pointsare around moving objects. Moreover, videos with high motioncomplexity have fewer fixation points. Xu et al. [18] have minedtheir own dataset for the purpose of viewport prediction. 58 subjectswatched 76 panoramic videos. Analysis of their dataset shows thatthere is center bias, and there is similarity in the magnitude anddirection of viewport changes when they are co-located. Our goal is to design a taxonomy of 360 ◦ videos that puts videoswith similar user viewing patterns in the same category. Users’ head movement can be triggered by different features of the content.Several studies on viewport dynamics suggest that user attention isguided by moving targets in the scene [14]. Therefore, existence ofmoving objects plays an important role in a taxonomy. Additionally,we believe camera motion can affect viewer attention. In regularvideos, camera motion dictates what users see in a scene. Althoughin 360 ◦ videos users are free to look in any direction, camera motioncan alter user behavior and transform the motion of moving objects.So, we also would like to study the effect of camera motion on userviewport.Therefore, we propose a two dimensional taxonomy for 360 ◦ videos based on the type of camera motion and number of movingobjects. We classify videos into five different categories based oncamera motion: 1- Fixed, 2- Horizontal, 3- Vertical, 4- Rotational,5- Mixture of previous camera motions. Regarding moving objectsin a scene, we study the effect of the number of moving targets ina scene on viewport changes. So we have three categories: 1- Nomoving object, 2- Single moving object, 3- Multiple moving objects.By comparing videos from these categories, we can study the effectof moving objects on viewport pattern. This taxonomy results ina total of fifteen categories shown in Table 1 with each categorycorresponding to a < camera movement, number of moving targets > combination. We considered two videos for each category in order to examinecategorical similarities in viewport patterns. Each video has a dura-tion of one minute with no scene cuts, so there is no discontinuityduring each video. In Table 1, each video has a numerical ID from 1to 30, referred to as videoID . Each cell contains two video IDs fromthe same category. For each video, resolution and frame rate arespecified. Most of the videos were chosen from YouTube, but, wealso produced several videos for categories such as rotational cam-era movements (videos 4,19,20,22,23,24). For the YouTube videos,links to the videos and the used start time, as many videos arelonger than one minute, are inside brackets To study users’ behavior under the proposed taxonomy, we de-signed a subjective experiment where participants watch a set of360 ◦ videos, using a HMD, and answer a set of questions after each Videos 10,17,27,28 were rotated 265, 180,63,81 degrees to right, respectively, to re-orient during playback. able 1: Taxonomy and video links

Number of Moving TargetsNone Single Multiple C a m e r a M o t i o n Fixed

1) 3840x1920 25fps [ESRz3-btZIA (0:40)]2) 3840x2160 29fps [30cSb3wTc7U (0:00)] 3) 3840x2048 29fps [ULixPLH-WA4 (0:07)]4) 3840x1920 29fps 5) 3840x2048 30fps [7IWp875pCxQ (0:18)]6) 3840x2048 29fps [ze w7Lh97Co (0:05)]

Horizontal

7) 3840x2160 25fps [9XR2CZi3V5k (0:01)]8) 3840x2048 29fps [6TlW1ClEBLY (0:45)] 9) 2560x1440 29fps [tVsw0DvAWdE (0:15)]10) 3840x1920 29fps [cNlQrTkXkOQ (0:15)] 11) 3840x2160 24fps [jMyDqZe0z7M (0:00)]12) 3840x2160 30fps [2Lq86MKesG4 (0:12)]

Vertical

13) 3840x1920 29fps [DgxmQvWEGBU (0:04)]14) 3840x1920 29fps [elhdcvKhgbA (0:14)] 15) ———————–16) ———————– 17) 3840x2048 30fps [jau-Ric7kls (1:11)]18) 3840x2160 29fps [905 oiaJN 0 (0:15)]

Rotational

19) 3840x1920 29fps20) 3840x1920 29fps 21) 3840x2160 29fps [ZRFIdczxxkY (0:04)]22) 3840x1920 29fps 23) 3840x1920 29fps24) 3840x1920 29fps

Mixed

25) 3840x2048 25fps [HiRS 6BCyG8 (0:26)]26) 3840x2160 29fps [L tqK4eqelA (5:30)] 27) 3840x1920 30fps [AX4hWfyHr5g (0:00)]28) 3840x1920 29fps [VGY4ksezNkY (2:11)] 29) 3840x2160 25fps [p9h3ZqJa1iA (0:00)]30) 3840x1906 29fps [H6SsB3JYqQg (1:00)]

Figure 1: Spatial and Temporal complexity of dataset videos video. The experiment has three main parts: 1)

Training , whereparticipants answer an entry questionnaire and watch an introduc-tory video ; 2) Main session , where participants watch one videofrom each of the taxonomy categories and answer a questionnaireafter each video; 3)

Exit survey , where participants answer a finalquestionnaire about their overall experience. Our experiment hasbeen approved by the University Institutional Review Board.The entry questionnaire is a background survey, which asks theparticipant’s gender, age, and level of experience with using VRtechnology, including how many times the person has watched360 ◦ videos. Then, the subject watches an introductory video tobecome familiar with the experiment and adjust the HMD. Duringthe experiment, participants sit on a swivel chair and are free tolook in any direction. We used Oculus Go HMD for this study,which is an all-in-one mobile, cable-free HMD. So, participants canrotate their head without cable interference. Since the videos donot have audio, users wear a headphone that plays white noise toeliminate auditory distractions.In the main session, each participant watches the videos in ashuffled order of categories to compensate for temporal effects. Theshuffled list has one (of the two) video in each category. When theplayback is finished for each video, a gray screen is shown. Then,the subject takes off the HMD and asked the following questions:Q1) Please describe what you saw while watching the video.Q2) What did you focus on while watching the video? https://youtu.be/mlOiXMvMaZo starting at 0:30 Then, participants rate their presence level. We used the fol-lowing four questions from the self-location portion of the SpatialPresence Experience Scale [9]:Q3) I felt like I was actually there in the environment of thevideo.Q4) It seemed as though I actually took part in the action ofthe video.Q5) It was as though my true location had shifted into theenvironment of the video.Q6) I felt as though I was physically present in the environmentof the video.Each question can be answered on a 5-point scale, from 1 ( do notagree at all ) to 5 ( fully agree ).At the end of this questionnaire, we examine users’ discomfortusing a discomfort score [7]. The participant chooses their dis-comfort score in the range from 0 to 10, where 10 is the highestdiscomfort level and 0 is the lowest. The experiment is terminatedif a subject chooses the maximum score at any moment. At the endof the experiment, an exit survey is conducted that asks for partici-pants to choose their three favorite videos. All questionnaires areanswered on a desktop computer.

We developed a video player in Unity using Pixvana SPINPlay SDK that plays back videos and records users’ head orientation samplesat the rate of HMD’s screen refresh rate which is 60Hz. The playeron the HMD is connected to a server on a PC. The server controlsvideo playback on the HMD and collects recorded traces from HMD.Before presenting the format of the recorded viewport dataset,we explain the coordinate system and the video projection formatthat were used. All videos are in equirectangular projection. Fig-ure 2 depicts how an equirectangular video is shown to a viewer,along with the coordinate system. During playback, the video ismapped to a sphere. The coordinate system is defined such that the Z axis always points out to the center of equirectangular video. Ifwe assume that a viewer looks in the direction of Z , then the Y axispoints up and the X axis points right. In the beginning of videoplayback, the sphere is rotated along the Y axis to bring the centerof the video to the front of the viewer. Note that the image on thesurface of the sphere is mirrored, because the image is viewed fromthe inside of the sphere. https://pixvana.com/spin-sdk/ igure 2: Left: Equirectangular frame. Right: The framemapped to sphere and the coordinate system. For each frame rendered on the HMD screen, we record the times-tamp of the sample and the head rotation quaternion relative to the Z axis. We use a quaternion coordinate system because it is able torepresent a rotation with a higher accuracy, if compared to Euler an-gles. We get rotation samples from UnityEngine.XR.InputTracking class in Unity.

GetLocalRotation(XRNode.CenterEye) functionprovides the quaternion rotation of VR HMD. We record the sam-ples and playback timestamp (with millisecond accuracy) inside

Update() function loop that runs per rendered frame. In additionto the quaternion rotation, we also include the Cartesian form of thevector that points out to the center of user’s viewport. Therefore,the format of the recorded samples is the following: { timestamp , Q x , Q y , Q z , Q w , V x , V y , V z } where { Q x , Q y , Q z , Q w } denotes the components of the quaternionof viewport rotation and { V x , V y , V z } specifies the vector from thesphere center to viewport center. We store all samples in a CSV file. Our dataset has six main data folders:

Traces , Questionnaires , View-portHeatmap , SampleVideos , Scripts , and

Histograms . The

Traces folder contains the viewport traces of all participants. Each partici-pant is assigned a 6-character number: the

SubjectID . The

Traces folder contains one sub-folder for each participant, named accord-ing to this

SubjectID . It contains a CSV file, i.e., a trace file foreach video watched by this participant. The CSV filename formatis

SubjectID V ideoID . csv . The Questionnaires folder contains theparticipants’ responses to all questions. The

BackgroundQuestion-naire.csv contains responses to the background survey and the

PerVideoQuestionnaire.csv contains the answers to the question-naire completed after each video. The header of questionnaire filescontains the questions.We created a heatmap of viewport traces for each video. Allheatmaps are included in the

ViewportHeatmap folder. We representeach subject’s viewport center using a 10 ◦ Gaussian kernel andcreated a heatmap for each frame by applying the kernel for allthirty viewers of a video. The

SampleVideos folder contains a subsetof videos in the taxonomy . The video sources are in the Source folder, and viewport-overlaid versions are in the

ViewportOverlays folder. Heatmap of each video was merged with its original video tocreate each viewport-overlaid video . These videos provide useful Only the videos that we have created. for all videos in the taxonomy, the overlaid version can be found in:https://bit.ly/2P1DRR7. visual representations of how the videos were watched by viewers.Figure 4 shows example frames of a viewport-overlaid video. The scripts folder includes MATLAB scripts to generate overlay videosfrom viewport traces and source of videos. The codes for runningclustering on viewport traces are also included in the Scripts folder.Finally, the

Histogram folder contains histograms of yaw and pitchangles for each video.

A total of 60 persons participated in our study, from which 28.3% ofparticipants were female. Each subject watched 14 videos plus theintroductory video, with each video being watched by 30 viewers.Table 2 shows the age distribution of experiment participants, alongwith their experience with VR technologies.

Table 2: Subjects distribution and VR experience

Gender Age Mobile VR Exp. Room Scale VR Exp. 360 Exp.

17 Female43 Male 18-21 : 1922-25 : 1626-29 : 1530-33 : 7 >

40 : 3 Never: 171-5 times: 316-10 times: 411-20 times: 3 >

20 times: 5 Never: 371-5 times: 156-10 times: 5 >

10 times: 3 Never: 211-5 videos: 286-10 videos: 6 >

10 videos: 5

Analysis of the distributions of yaw and pitch angles over thewhole duration of videos shows that pitch angle is biased alongthe equatorial section of the video. However, for videos capturedfrom higher altitudes, the samples are more distributed, e.g., videos17, 18, and 27. For the yaw angle, the distribution depends on thecontent and location of regions of interest. For example, videos23 and 24, which contain rotational camera motion with multiplemoving objects, have an almost uniform angle distribution (see‘histogram’ folder).To analyze users’ viewport pattern, we use the clustering algo-rithm proposed by Rossi et al. [15] which clusters viewports basedon their overlap in spherical domain. We use an angle threshold of π /

5. The clustering algorithm divides a video into 3-second chunks,and viewports which are less than π / a to frame b , which is one secondlater. Notice from this figure that a group of observers seem to befollowing moving targets (in this case, the man in the frames).For the category corresponding to vertical camera movement andmultiple moving targets, viewports are more dispersed comparedto no moving target videos (see green bar in the third group inFigure 3). One possible reason for this behavior is that for most ofthe duration of these videos, the camera is located at high distancefrom the ground level and viewers have a landscape view. The igure 3: Barplot of the average number of clusters per cat-egory, along with the error bars.Figure 4: Two frames from video 3 that show viewers track-ing a moving target (walking man). viewport-overlaid version of these videos show that viewers weremore interested in the landscape view, and did not focus on anyspecific area of the video.Figure 5 shows the clustering results separated per video (andordered by category), and Figure 6 depicts the corresponding num-ber of viewers in the largest cluster for each of these videos. Noticethat in some categorical pairs, such as (9,10), (13,14), and (27,28),the number of clusters for the videos is very different. For example,video 9 shows a view from a racing car that chases another car,while video 10 shows a woman walking in the woods. Both thesevideos were classified as having a horizontal movement and a singlemoving target. But, although for both videos the viewport-overlaidversions show a high concentration of viewports on the movingtarget, the speed of camera for video 9 is much higher and there arefew objects (other than the two cars) in the video. Most likely, forsingle moving target videos, when the camera movement directionis aligned with the moving target, viewers are influenced to look atthe target. The number of clusters is smaller for this type of videos.Videos 13 and 14 have a view from inside a glass elevator, witha vertical camera motion and no moving targets. One of the maindifferences between these two videos is that in video 14 (at time 0:22)the elevator stops and the door opens, attracting a lot of the viewers’attention and, therefore, reducing the number of clusters. Figure 7shows how the number of clusters changes over time for videos 13and 14, where a drop in the number of clusters can be seen in thecurve for video 14 after 22s. Looking more closely, Figure 8 showsthe viewport-overlaid views of the frames at instant 14s and 26s ofvideo 14. Videos 17, 18, and 27 are also shot from a high altitude,and based on viewers feedback and viewport-overlaid videos, alsohad a landscape view that viewers found more interesting.Comparing categories with one moving target, videos with cam-era motion can have fewer clusters if the camera follows the target.For example, in videos 8, 9, and 21, the camera moves according to the moving target, and the number of clusters is less comparedto fixed camera category. In video 21, moving target is always atthe center of video, and this video has more viewers in one clustercompared to similar videos without camera motion, e.g., videos3 and 4. Video 22 has camera rotation but it does not follow themoving target, and viewers were more dispersed compared to video21. However, for mixed camera motion, the pattern is not the same.The video scenery and camera elevation affects the viewers. Figure 5: Barplot of the average number of clusters pervideo, along with the corresponding error bars.Figure 6: Barplot of the average number of viewers per videoin the most populated cluster, along with the error bars.Figure 7: Number of clusters for videos 13 and 14Figure 8: Two viewport-overlaid frames of video 14igure 9: Subjects’ head-movement speed heatmapFigure 10: Boxplot of discomfort score per video

We also observed that viewers have different head movement andangular speed patterns. Some viewers tend to be still and rotate theirhead occasionally compared to others. Figure 9 shows the heatmapof average angular speed of viewers for each video. We measuredthe head movement speed at time intervals of 1 second, which is thegreat-circle distance between two head orientation samples dividedby the total time. In this figure, each row corresponds to a categoryin the taxonomy and each column to a viewer. Viewers are sortedaccording to their average angular speed over all videos. It canbe observed that viewers with a higher average speed watched allvideos with a higher speed, which are represented by bright yellowcolors in the map in Figure 9. On the other hand, barring a fewexceptions, slower viewers had smaller average speeds, which arerepresented by darker blue colors in the map, for all videos. Thissuggests that users could be possibly classified on the basis of theirhead movement speed. So, head movement speed of a specific userin previous viewing sessions could be used to predict his/her futureviewport.

After watching each video, viewers chose their discomfort level.Figure 10 shows the box-plot graph of the users’ responses. Al-though the range of scores is from 0 to 10, for most videos, themedian score value was close to 0. More specifically, videos withfixed camera movements received low discomfort scores, whilevideos with camera motion received slightly higher discomfortscores. Also, for the presence level questions, Q3 to Q6, the averagescore is around 3, and the detailed responses are available in thedataset.

In this paper, we presented a taxonomy and dataset for 360 ◦ videos.We analyzed viewport traces according to the number of viewerclusters in each video. Generally videos with moving targets have fewer clusters, and preliminary investigation of viewport overlayson videos suggest that users tend to look at moving targets. How-ever, there are some exceptions based on camera location and videoscenery. For example, we observed that videos captured from higheraltitudes have more dispersed viewport distribution irrespectiveof the number of moving objects. Some viewers tend to explorethe scene more aggressively while others tend to be more passive,regardless of the nature of the video. In future work, we will studythe effect of moving objects on viewport pattern in more detail. REFERENCES [1] Mathias Almquist, Viktor Almquist, Vengatanathan Krishnamoorthi, NiklasCarlsson, and Derek Eager. 2018. The Prefetch Aggressiveness Tradeof in 360Video Streaming. In

Proceedings of ACM Multimedia Systems Conference. Amster-dam, Netherlands .[2] Xavier Corbillon, Francesca De Simone, and Gwendal Simon. 2017. 360-degreevideo head movement dataset. In

Proceedings of the 8th ACM on MultimediaSystems Conference . ACM, 199–204.[3] Xavier Corbillon, Gwendal Simon, Alisa Devlic, and Jacob Chakareski. 2017.Viewport-adaptive navigable 360-degree video delivery. In

Communications (ICC),2017 IEEE International Conference on . IEEE, 1–7.[4] Erwan J David, Jes´us Guti´errez, Antoine Coutrot, Matthieu Perreira Da Silva, andPatrick Le Callet. 2018. A dataset of head and eye movements for 360 ◦ videos. In Proceedings of the 9th ACM Multimedia Systems Conference . ACM, 432–437.[5] Fanyi Duanmu, Yixiang Mao, Shuai Liu, Sumanth Srinivasan, and Yao Wang. 2018.A Subjective Study of Viewer Navigation Behaviors When Watching 360-DegreeVideos on Computers. In . IEEE, 1–6.[6] Ching-Ling Fan, Jean Lee, Wen-Chih Lo, Chun-Ying Huang, Kuan-Ta Chen, andCheng-Hsin Hsu. 2017. Fixation prediction for 360 video streaming in head-mounted virtual reality. In

Proceedings of the 27th Workshop on Network andOperating Systems Support for Digital Audio and Video . ACM, 67–72.[7] Ajoy S Fernandes and Steven K Feiner. 2016. Combating VR sickness throughsubtle dynamic field-of-view modification. In

3D User Interfaces (3DUI), 2016 IEEESymposium on . IEEE, 201–210.[8] Stephan Fremerey, Ashutosh Singla, Kay Meseberg, and Alexander Raake. 2018.AVtrack360: an open dataset and software recording people’s head rotationswatching 360 videos on an HMD. In

Proceedings of the 9th ACM MultimediaSystems Conference . ACM, 403–408.[9] Tilo Hartmann, Werner Wirth, Holger Schramm, Christoph Klimmt, PeterVorderer, Andr´e Gysbers, Saskia B¨ocking, Niklas Ravaja, Jari Laarni, Timo Saari,and others. 2015. The spatial presence experience scale (SPES).

Journal of MediaPsychology (2015).[10] ITU-T Recommendation P.910. 2008. Subjective video quality assessment meth-ods for multimedia applications. (April 2008).[11] Wen-Chih Lo, Ching-Ling Fan, Jean Lee, Chun-Ying Huang, Kuan-Ta Chen,and Cheng-Hsin Hsu. 2017. 360 video viewing dataset in head-mounted virtualreality. In

Proceedings of the 8th ACM on Multimedia Systems Conference . ACM,211–216.[12] Afshin Taghavi Nasrabadi, Anahita Mahzari, Joseph D Beshay, and Ravi Prakash.2017. Adaptive 360-degree video streaming using scalable video coding. In

Proceedings of the 2017 ACM on Multimedia Conference . ACM, 1689–1697.[13] Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. 2018. Your Attention isUnique: Detecting 360-Degree Video Saliency in Head-Mounted Display forHead Movement Prediction. In . ACM, 1190–1198.[14] Cagri Ozcinar and Aljosa Smolic. 2018. Visual attention in omnidirectional videofor virtual reality applications. In . IEEE, 1–6.[15] Silvia Rossi, Francesca De Simone, Pascal Frossard, and Laura Toni. 2018. Spher-ical clustering of users navigating 360 ◦ content. arXiv preprint arXiv:1811.05185 (2018).[16] Ana Serrano, Vincent Sitzmann, Jaime Ruiz-Borau, Gordon Wetzstein, DiegoGutierrez, and Belen Masia. 2017. Movie editing and cognitive event segmenta-tion in virtual reality video. ACM Transactions on Graphics (TOG)

36, 4 (2017),47.[17] Chenglei Wu, Zhihao Tan, Zhi Wang, and Shiqiang Yang. 2017. A Dataset forExploring User Behaviors in VR Spherical Video Streaming. In

Proceedings of the8th ACM on Multimedia Systems Conference . ACM, 193–198.[18] Mai Xu, Yuhang Song, Jianyi Wang, MingLang Qiao, Liangyu Huo, and ZulinWang. 2018. Predicting head movement in panoramic video: A deep reinforce-ment learning approach.