[PDF] Sequential Reinforced 360-Degree Video Adaptive Streaming with Cross-user Attentive Network

Abstract

In the tile-based 360-degree video streaming, predicting user's future viewpoints and developing adaptive bitrate (ABR) algorithms are essential for optimizing user's quality of experience (QoE). Traditional single-user based viewpoint prediction methods fail to achieve good performance in long-term prediction, and the recently proposed reinforcement learning (RL) based ABR schemes applied in traditional video streaming can not be directly applied in the tile-based 360-degree video streaming due to the exponential action space. Therefore, we propose a sequential reinforced 360-degree video streaming scheme with cross-user attentive network. Firstly, considering different users may have the similar viewing preference on the same video, we propose a cross-user attentive network (CUAN), boosting the performance of long-term viewpoint prediction by selectively utilizing cross-user information. Secondly, we propose a sequential RL-based (360SRL) ABR approach, transforming action space size of each decision step from exponential to linear via introducing a sequential decision structure. We evaluate the proposed CUAN and 360SRL using trace-driven experiments and experimental results demonstrate that CUAN and 360SRL outperform existing viewpoint prediction and ABR approaches with a noticeable margin.

Full PDF

IIEEE TRANSACTIONS ON BROADCASTING SUBMISSION 1

Sequential Reinforced 360-Degree Video AdaptiveStreaming with Cross-user Attentive Network

Jun Fu, Zhibo Chen,

Senior Member, IEEE , Xiaoming Chen, and Weiping Li,

Fellow, IEEE

Abstract —In the tile-based 360-degree video streaming, pre-dicting user’s future viewpoints and developing adaptive bitrate(ABR) algorithms are essential for optimizing user’s qualityof experience (QoE). Traditional single-user based viewpointprediction methods fail to achieve good performance in long-termprediction, and the recently proposed reinforcement learning(RL) based ABR schemes applied in traditional video streamingcan not be directly applied in the tile-based 360-degree videostreaming due to the exponential action space. Therefore, wepropose a sequential reinforced 360-degree video streamingscheme with cross-user attentive network. Firstly, consideringdifferent users may have the similar viewing preference on thesame video, we propose a cross-user attentive network (CUAN),boosting the performance of long-term viewpoint prediction byselectively utilizing cross-user information. Secondly, we proposea sequential RL-based (360SRL) ABR approach, transformingaction space size of each decision step from exponential to linearvia introducing a sequential decision structure. We evaluate theproposed CUAN and 360SRL using trace-driven experimentsand experimental results demonstrate that CUAN and 360SRLoutperform existing viewpoint prediction and ABR approacheswith a noticeable margin.

Index Terms —viewpoint prediction, cross-user, sequential de-cision structure

I. I

NTRODUCTION A FTER years of development, Virtual Reality (VR) hasreached a new level of technological maturity, and hasbeen increasingly penetrating diverse application areas includ-ing entertainment, retail, real-estate, education, healthcare, etc.360-degree video or panoramic video is a key componentof these emerging VR applications. Particularly, 360-degreevideo allows a user to freely navigate through the capturedvideo scene in a panoramic manner by changing his/herdesired viewpoints, offering immersive experience thus signiﬁ-cantly enhances presence to users. However, the huge data sizeof 360-degree video is imposing unprecedented challenges inboth storage and transmission. For instance, a premium quality360-degree video with 120 frames-per-second and 24K reso-lution can easily consume a bandwidth of multiple Gigabits-per-second (Gbps) [1]. Also, for smooth rendering, the 360-degree video has to be streamed consistently and reliably at a

Jun Fu, Zhibo Chen, and Weiping Li are with the CAS Key Lab-oratory of Technology in Geo-spatial Information Processing and Appli-cation System, University of Science and Technology of China, Hefei230027, China (e-mail: [email protected]; [email protected];[email protected]). Xiaoming Chen is a professor from School of ComputerScience and Engineering, Beijing Technology and Business University, China(e-mail:[email protected]). Corresponding Author: Zhibo Chen,Xiaoming Chen. Fig. 1. Framework of tile-based 360-degree video streaming. high rate [1]. Thus, it is critically signiﬁcant and imperative todevelop effective and efﬁcient methods for 360-degree videotransmission. In fact, in 360-degree video, at any given timea user can only watch a portion of the video scene withina ﬁeld of vision (FoV) centered at a certain direction andwith limited horizontal and vertical spans. However, the videoservice providers, including Youtube [2], currently deploy 360-degree video streaming in a straightforward manner that theentire high-quality 360-degree videos are delivered to users.This inevitably leads to a waste of bandwidth.To reduce the transmission bandwidth, many viewpoint-aware panoramic video streaming schemes are proposed basedon dynamic adaptive streaming over HTTP (DASH) [3], suchas asymmetric panoramic streaming [4], [5], multi-tier basedmethods [6], [7], and tile-based panoramic streaming frame-work [8]–[13]. Among them, tile-based panoramic streamingbecomes the prevalent approach to transmit 360-degree videoson the internet due to its higher efﬁciency in storage and trans-mission. In tile-based streaming, as illustrated in Fig. 1, at theserver the 360-degree video is spatially split into several tilesthat are independently encoded at various quality levels, whilethe client conducts viewpoint prediction and rate adaptation toprefetch video segments from the server. However, it is stilltechnically challenging for such a framework to provide userswith high quality of experience (QoE) due to the ﬂuctuatednetwork condition and users’ irregular head movement.Therefore, to deal with the above issues, in this paperwe focus on two related research problems, i.e., intelligentviewpoint prediction and reinforcement learning based rateadaptation.For the ﬁrst research problem, the existing schemes can bemainly categorized into three categories: 1) Single-user based a r X i v : . [ c s . MM ] S e p EEE TRANSACTIONS ON BROADCASTING SUBMISSION 2 algorithms [5], [8], [9], [14], 2) Content-based algorithms[15], [16], and 3) Cross-user based algorithms [10], [12].These schemes, however, suffer from some key limitations:1) Single-user based algorithms exploit linear regression (LR)model to forecast user’s future viewpoints based on user’shistorical viewpoints, which fails to perform well in long-term prediction due to ignoring the nonlinear characteristicsof user’s head movement; 2) Content-based approaches utilizemotion and saliency maps of 360-degree videos to improve theperformance of viewpoint prediction, since user’s viewpointsare largely related to the video content. However, it is time-consuming to extract the motion and saliency maps fromhigh-deﬁnition 360-degree videos. Moreover, the obtained themotion and saliency map are not reliable enough [12]. 3) Thecompetitive cross-user based algorithm, KNN [10], amends thelong-term prediction bias by means of exploiting K nearestviewpoints of other users around the predicted result by theLR model. However, the LR model is easily biased, whichinevitably impairs the performance of KNN.In order to boost the performance of KNN, we aim todesign a better viewpoint prediction scheme. Considering userswith similar viewing pattern may have similar preference onthe future video frames, we propose a cross-user attentivenetwork called CUAN to predict user’s future viewing tra-jectory. Speciﬁcally, we ﬁrstly encode user’s and cross-userhistorical viewpoints into hidden vectors via a shared recurrentneural network (RNN), and then conduct viewpoint predictionbased on the cross-user information extracted by an attentionmechanism.For the second research problem, the existing methodscan be divided into two branches: 1) Combinatorial opti-mization based schemes [8], [9], [12] and 2) Reinforcementlearning (RL) based schemes [17], [18]. However, there arealso several key limitations with these schemes: 1) Com-binatorial optimization based methods heavily rely on theaccurate estimation of bandwidth budget, e.g., overestimatingthe bandwidth budget will endanger the playback or even resultin rebuffering. Moreover, these methods typically do not takethe cascading effect of birtate decision into consideration; 2)Reinforcement learning (RL) based approaches have shownits promising potential in tackling with ﬂuctuated networkcondition and various QoE objectives. However, these schemescan not be directly applied in tile-based 360-degree videostreaming owing to the exponential action space (e.g., given Ntiles and each tile has M bitrate level, the dimension of actionspace is M N ).Therefore, we propose a sequential RL-based method, called360SRL. Speciﬁcally, we introduce a sequential decision struc-ture, i.e., selecting bitrate for each tile in sequence insteadof determining the bitrate of all tiles in one step, whichtransforms the action space size of each decision step fromexponential to linear. Then, 360SRL learns to make ABRdecisions solely through thousands of interactions with the de-ployment environment, instead of relying on pre-programmedmodels or assumptions about the deployment environment. Asa result, 360SRL is able to learn ABR algorithms that adaptto a wide range of environments and QoE objectives.The main contributions of this paper can be summarized as follows: • We propose a cross-user attentive network for viewpointprediction, named CUAN, which boosts the performanceof viewpoint prediction through exploiting cross-userinformation extracted by an attention mechanism. • We propose a sequential RL-based method for rate adap-tation, called 360SRL, which successfully applies RL inthe tile-based 360-degree video streaming via introducinga sequential decision structure. • We integrate CUAN and 360SRL into a 360-degree videoadaptive streaming framework. The experimental resultsshow that our prototype outperforms existing methodswith a noticeable margin.This journal paper is an extension of our published shortconference paper [19], the key differences lie in three aspects.First, this journal version sets up a complete panoramicvideo streaming framework with more theoretical analysis,and solves another challenging problem in panoramic videostreaming, i.e., viewpoint prediction with a novel cross-userattentive network. Second, this journal version introduces asimple but efﬁcient cross-user attention mechanism to mitigateerror in viewpoint prediction, which is validated with detailedexperimental analysis. Third, more thoroughly experimentresults and ablation studies are provided in this paper to verifyeffectiveness of the proposed framework.The remainder of this paper is organized as follows. Section2 discusses the related work concerning tile-based 360-degreevideo streaming. Section 3 details the proposed approachfor viewpoint prediction and rate adaptation in sequence.Then, performance evaluation and comparison are presented inSection 4. Finally, Section 5 concludes the paper and discussessome future research directions.II. R

ELATED W ORK

A. Viewpoint Prediction

For viewpoint prediction, the existing algorithms can becategorized into three classes: single-user based ones, contentbased ones and cross-user based ones. In single-user basedcategory, Qian et al. [14] and Stefano et al. [20] employa Linear Regression (LR) model to forecast user’s futureviewpoints. To boost the performance of the LR model,Lan et al. [9] and Xu et al. [5] put forward a probabilisticmodel estimating the distribution of LR’s prediction errors.However, the long-term prediction bias is still large, sincethe assumption of linear head movement is easily violated.In content-based category, Fan et al. [16] and Xu et al. [15]leverage saliency and motion maps of 360-degree videos andRNN to conduct viewpoint prediction. However, extractingsaliency and motion maps from high-deﬁnition 360-degreevideos is high-computation. Besides, the generated saliencyand motion maps are not reliable enough [12]. In cross-user based category, CLS [12] groups others’ viewpoints intoclusters via a density-based clustering algorithm [21], and thenuses SVM [22] to predict the cluster to which the current userbelongs according to his/her past viewing trajectory. KNN [10]amends the prediction bias of the LR model through utilizing K nearest viewpoints of other users around the predicted EEE TRANSACTIONS ON BROADCASTING SUBMISSION 3 result. However, the long-term prediction bias still exists inKNN, since KNN is based on LR. Thus, in this paper, we aimto develop a better viewpoint prediction algorithm to improvethe performance of KNN.

B. Rate Adaptation

The existing rate adaptation approaches could be dividedinto two branches: combinatorial optimization based methodsand reinforcement learning (RL) based methods. In combi-natorial optimization based category, these schemes typicallyformulate rate adaptation as a variant of multiple-choiceknapsack (MCKP) problem [23], which aim to maximize thedeﬁned QoE objective given a bandwidth budget. The availablebandwidth estimation methods include rate-based ones [24],[25], buffer-based ones [26], [27], and target-buffer basedones [9]. Since the rate adaptation is a NP-hard optimizationproblem, brute-forced search seems infeasible. As a result,360ProDash [9], CLS [12], and Hosseini et al. [8] presenta greedy algorithm, dynamic programming algorithm, anddivide-and-conquer approach to solve the MCKP problem ata relatively low-complexity manner, respectively. However,these schemes fail to achieve optimal performance acrossa wide variety of network conditions since the bandwidthbudget is estimated via ﬁxed rules on simpliﬁed or inaccuratemodels of the deployment environment [17]. Besides, thesemethods do not take the cascading effect of bitrate decisioninto consideration, e.g., CLS [12] aims to maximize the qualityof user’s FoV regardless of the change of bitrate between twoconsecutive FoVs. In RL-based category, Pensieve [17] andD-Dash [18] have shown their promising potential of dealingwith various network condition and QoE metrics in traditionalvideo streaming. However, these methods can not be directlyapplied in tile-based streaming system due to the exponentialaction space. To reduce action space size, the recent method,DRL360 [28], allocates the same bitrate for tiles within users’FoV regardless of their different importance, which inevitablylimits user’s perceived video quality.III. P

ROPOSED M ETHOD

A. Viewpoint Prediction1) Problem Formulation:

We formulate the viewpoint pre-diction in the tile-based 360-degree video streaming as fol-lows: Given historical viewpoints of the p th user L p t = { l p , l p , ..., l pt } where l pt = ( x t , y t ) , x t ∈ [ − , and y t ∈ [ − , correspond to the longitude and latitude ofthe viewpoint on a 3D sphere, and other users’ viewpoints onthe same video L M t + T where M denotes the number of otherusers, then viewpoint prediction aims to forecast the future T viewpoints: L pi where i = t + 1 , ..., t + T . Mathematically, theobjective of viewpoint tracking can be expressed as follows: min F t + T (cid:88) k = t +1 (cid:107) l pk − ˆ l pk (cid:107) , (1)where F denotes the viewpoint prediction model and ˆ l pk represents the predicted result by the F . The existing competitive cross-user based approach, KNN[10], chooses the LR model to model F and reduces the long-term prediction bias via exploiting K nearest viewpoints ofother users around the predicted result. However, the LR modelis easily biased, which inevitably impairs the performance ofKNN. As a result, to boost the performance of KNN, wepropose a cross-user attentive network to model F , whichtakes user’s historical viewing trajectory and cross-user view-points into consideration. The motivation of this design is two-fold. On the one hand, user’s historical viewpoints providekey clues for viewpoint prediction, since they reveal user’sunique viewing pattern in exploring a video scene. On theother hand, considering users with similar viewing pattern mayhave similar preference on the future video frames, we proposean attention mechanism to automatically extract useful cross-user information from viewpoints of other users. Althoughthe attention mechanism has been proven to be effective invarious tasks, such a technique needs to be carefully designedto accommodate speciﬁc problems. In the task of viewpointprediction, it is intuitive to pay more attention to the view-points of other users who have similar preference on the samevideo as the current user when generating representations ofcross-user information. In this paper, the similarity betweenother users and the current user is calculated based on thepast viewing trajectories of other users and the current user,rather than the single-timestamp viewpoints of other users andthe current user [29]. Intuitively, such a method of calculatingsimilarity is more reasonable than the previous method [29],which is veriﬁed by the experimental results.

2) Cross-user Attentive Network:

As shown in Fig. 2, ourproposed method consists of a trajectory encoder module, anattention module, and a viewpoint prediction module. Next,we will detail these modules in sequence.The trajectory encoder module aims to extract temporalfeatures from users’ historical viewpoints. Inspired by thegood performance of Recurrent Neural Network (RNN) [30]for capturing the sequential information [15], we employ theLong Short-Term Memory (LSTM) [31], a variant of RNN, toencode the viewing path of a user. Speciﬁcally, for predictingthe viewpoint at the ( t + 1) th frame, we ﬁrstly feed thehistorical viewpoints of the p th user into the LSTM as follows: f pt +1 = h ( l p , l p , ..., l pt ) , (2)where the function h ( · ) denotes the input-output function ofthe LSTM. Then, we use the same LSTM to encode others’viewing trajectories as follows: f it +1 = h ( l i , l i , ..., l it +1 ) , i ∈ { , .., M } . (3)The attention module is designed to extract informationrelated to the p th user from others’ viewpoints. Firstly, wederive the correlation coefﬁcient of the p th user and others asfollows: s pit +1 = z ( f it +1 , l pt +1 ) , i ∈ { , .., M } ∪ { p } , (4)where s pit +1 represents the similarity of the p th user and the i th user, and the function z is modeled by inner product. It isworth noting that z can be modeled by other ways, such as EEE TRANSACTIONS ON BROADCASTING SUBMISSION 4

Fig. 2. The illustration of the proposed cross-user attentive network, comprised of a trajectory encoder module, an attention module, and a viewpoint predictionmodule. As shown, the trajectory encoder module is made up of two stacked Long Short-Term Memory (LSTM) layers, the attention module adopts theself-attention mechanism, and the viewpoint prediction module consists of one fully-connected (FC) layer and an activation function (Tanh). multiple fully-connected (FC) layers. Then, we normalize thecorrelation coefﬁcient as follows: α pit +1 = e s pit +1 (cid:80) i ∈{ ,...,M }∪{ p } e s pit +1 , (5)Finally, we obtain the fused feature as follows: g pt +1 = (cid:88) i ∈{ ,...,M }∪{ p } α pit +1 · f it +1 . (6)The viewpoint prediction module aims to forecast the futureviewpoints according to the fused feature generated by theattention module. Speciﬁcally, the viewpoint at ( t +1) th framescan be estimated as follows: ˆ l pt +1 = r ( g pt +1 ) , (7)where the function r ( · ) is modeled by one FC layer. It is worthnoting that viewpoints corresponding to future T frames arepredicted in a rolling fashion.

3) Loss function:

The loss function is deﬁned as the sum ofthe all the absolute differences between the predict viewpointsand the ground truth, which can be formulated as follows: L = t + T (cid:88) i = t || ˆ l pi − l pi || . (8)

4) Implement Details:

We implement the proposed CUANon the popular deep-learning framework, PyTorch [32]. Thefunction h ( · ) is made up of two stacked LSTM layers, bothwith 32 neurons. The function r ( · ) consists of one FC layerwith 32 neurons, followed by a Tanh function. The length ofhistorical viewpoints and future viewpoints are set to 1 secondand 5 seconds. During training, 2048 samples are randomlygenerated from the dataset per iteration. All trainable variablesof the proposed CUAN are optimized by Adam [33], where β , β and (cid:15) are set to 0.9, 0.999, and − , respectively. Thelearning rate and the number of training epoch are set to − and 50. B. Rate Adaptation1) Problem Formulation:

In the tile-based 360-degreevideo streaming, a 360-degree video is cut to m T -secondvideo segments, and each video segment is spatially split into N tiles that are independently encoded at M bitrate levels.Hence, there are N × M optional encoded chunks for eachsegment. Rate adaptation aims to ﬁnd the optimal bitrate set X = { x i,j } ∈ Z N × M (where x i,j = 1 means choosing j th bitrate level for i th tile and x i,j = 0 otherwise) for eachsegment to maximize user’s QoE objective. Mathematically,this problem can be formulated as follows: max X m (cid:88) t =1 Q t , (9)where Q t denotes the QoE score of the t th segment.In this paper, referring to existing QoE models [10], [28],[34]–[37], the QoE score of a certain segment is related to thefollowing aspects: • Viewport Quality [Mbps]: As only the video contentwithin user’s FoV is rendered, the video quality merelydepends on the viewport quality, which can be calculatedas follows: Q t = N (cid:88) i =1 M (cid:88) j =1 x i,j · p i · r i,j , (10)where p i denotes the normalized viewing probability ofthe i th tile and r i,j records the bitrate of chunk ( i, j ) . It isworth noting that we suppress the segment index in x i,j , p i , and r i,j for compactness and clarity. • Viewport Temporal Variation [Mbps]: Since drasticchange of rate between two consecutive viewports maycause users dizziness and headache [28], viewport tem-poral variation is also taken into consideration, which canbe measured via: Q t = | Q t − Q t − | . (11) EEE TRANSACTIONS ON BROADCASTING SUBMISSION 5 ✞☛ ✆ (cid:0)✁✂✄☎✝☎✟✠✡ ☞✟✌☎✄✍✟✎✏✟✑✒✍✓ ✔✕✖✂✄☎✝☎✟✠✡ ☞✟✌☎✄✍✟✎✏✟✑✗✍✑✑✍✎ ✔✕✖✘✙✚✛✜✚✛✢ ✙✣ ✢✛✤✛✜ ✞☛✥✦✧☛★✩✪✫ ✬✧☛★✞☛✥✭ ✬✧☛★✞☛✥✮✯✭ ✬✧☛★✞☛✥✮✰✱✥✭✞☛✯ ✞☛✰ ✲☛✳✥✲☛✧☛✥ ✧☛✯✴☛✥ ✧☛✵✴☛✯ ✶ ✶ ✶ ✧☛✰✴☛✰✱✥ ✲☛✳✥✴☛✰✬✧☛★✞☛✥✮✰✭ ✴☛✴☛

Fig. 3. Illustration of Sequential Decision Structure. • Viewport Spatial Variation [Mbps]: We also take theviewport spatial variation into account because ratechanges among tiles within user’s FoV may introduceblocking artifacts. The speciﬁc expression of the viewportspatial variation is written as follows: Q t = 12 N (cid:88) i =1 (cid:88) u ∈ U i p i · p u M (cid:88) j =1 | x i,j · r i,j − x u,j · r u,j | , (12)where U i denotes the index of tiles in the 1-hop neighborof the i th tile [10]. • Rebuffering [s]: Rebuffering typically occurs when thedownloading time of a segment is greater than the bufferoccupancy of the client player, which is the most an-noying to video-streaming users [38]. Thus, rebufferingis also a key aspect of user’s QoE [39], which can bederived by: Q t = max ( (cid:80) Ni =1 (cid:80) Mj =1 x i,j · r i,j · Tξ t − b t − , , (13)where ξ t and b t − denote the network throughput and thebuffer occupancy of the video player.Based on the aforementioned analysis, the QoE objective isdeﬁned as follows: Q t = Q t − η · Q t − η · Q t − η · Q t , (14)where η ∗ are adjustable parameters and different η ∗ meansdifferent user’s preference.

2) Sequential RL-based ABR Scheme:

Similar to ordinaryvideo streaming [17], [18], we assume that the ABR decisionprocess in tile-based 360-degree video streaming can be alsoregarded as a Markov Decision Process (MDP) illustrated bythe “Top MDP” in Fig. 3. The “Top MDP” shows that, ata speciﬁc time, e.g. t , an agent performs a N -dim action a t ∈ Z N (the bitrate set of the t th segment) derived from theobserved state s t to the deployment environment, and thenreceives a feedback signal r t that reﬂects user’s QoE. It canbe observed that the Top MDP has M N possible actions ateach decision step, which hinders the application of the RL-based ABR algorithms [17], [18] due to the exponential actionspace. To resolve this problem, we propose a novel sequentialdecision structure to reduce the action space of one step from M N to M . Speciﬁcally, we stretch out one transition with a N -dim action in the “Top MDP” into N cascading transitionswith a 1-dim action in the “Bottom MDP”, which are tiedtogether with Bellman Equation [40]. As illustrated in the “Bottom MDP”, the agent derives a 1-dim action a it fromthe state s it composed of the original state s t and the set ofprevious selected actions { a t , ..., a i − t } , then gets the feedbacksignal r it recording the value of a it . At last, the agent receivesan reward r t indicating the value of the action a t . It is worthnoting that, according to our experimental results, the decisionorder of tiles in the “Bottom MDP” is set as follows: the agentﬁrstly picks up the lowest bitrate for tiles out of user’s FoV,and then makes bitrate decision for tiles within user’s FoV inthe order of viewing probabilities from high to low. Next, wewill introduce the design of the agent in detail.The state, action and reward representation of our agent areconstructed as follows: • State. The state describes the situation when we prefetch anew segment, typically including the network bandwidth,the buffer size of the player, and the ﬁle size of theupcoming segment. Speciﬁcally, for the i th tile in the t th segment, the state is deﬁned as follows: s it = { (cid:126)τ t,i , (cid:126)P t,i , (cid:126)q t,i , ˆ ξ t , Q t − , p it , b it − } , (15)where (cid:126)τ t,i is a M -dim vector recording the ﬁle size of i th tile at different bitrate levels; (cid:126)P t,i and (cid:126)q t,i are the viewingprobabilities and chosen bitrate of tiles within 1-hopneighbor of the i th tile; ˆ ξ t is network throughput estimatedby the harmonic mean of the experienced throughputfor the past ﬁve segment downloads; Q k − records theviewport quality of the last segment; p it denotes thepredicted viewing probability of the i th tile; b it is theestimated buffer occupancy of the video player calculatedby the following equation: b it = b t − − (cid:80) i − h =1 r h,aht T ˆ ξ t . • Action. The output of our agent is an M -dim vector,where each entry corresponds to the probability of choos-ing a certain bitrate based on the current state s it . Theentry with the highest probability is chosen as a it . • Reward. Reward assesses the value of chosen action beingin a certain state, which guides the agent to learn ABRalgorithm. In this paper, the reward r t in the “Top MDP”is measured by the Eq. 14. However, in the “BottomMDP”, the reward of single transition r it is not availablebecause the original environment does not make changesuntil a decision is made for the last transition. Hence, wemodify the Eq. 14 to estimate r it as follows: r it = Q t,i − η · Q t,i − η · Q t,i − η · Q t,i , (16) Q t,i = p it · r i,a it , (17) Q t,i = p it · | r i,a it − Q t − | , (18) Q t,i = 12 · (cid:88) u ∈ U i δ ut · p it · p ut M (cid:88) j =1 | r i,a it − x u,j · r u,j | , (19) Q t,i = max ( r i,a it ∗ T ˆ ξ t − b it , , (20)where δ ut = 0 means the birate of the u th tile is notdetermined and δ ut = 1 otherwise.

3) Training Methodology:

In this paper, we use the state-of-the-art actor-critic RL technique A3C [41] as the fundamentaltraining algorithm of our agent. In the framework of A3C,

EEE TRANSACTIONS ON BROADCASTING SUBMISSION 6

Algorithm 1

Training Methodology of 360SRL

Input: discounting factor in the “Top MDP”, γ ; discountingfactor in the “Bottom MDP”, γ ; global shared parametervectors, θ , θ v ; global shared counter C ; thread-speciﬁcparameter vectors, θ (cid:48) , θ (cid:48) v ; maximum number of iterations, C max . Output: global shared parameter vectors, θ , θ v . repeat Initialize the thread step counter t ← Reset gradients: dθ ← and dθ v ← Synchronize thread-speciﬁc parameter vectors: dθ ← θ (cid:48) and dθ v ← θ (cid:48) v Reset the deployment environment and get state s t repeat for i ∈ { , ..., N } do Obtain a it according to the policy π ( a t | s it ; θ (cid:48) v ) Calculate the single-step reward r it end for Perform a t ∈ Z N to the deployment environment Receive reward r t and new state s t +1 C ← C + 1 t ← t + 1 until download all segments of a 360-degree video R ← for j ∈ { t − , ..., } do R ← r j + γ R R (cid:48) ← for i ∈ { N, ..., } do R (cid:48) ← r ij + γ R (cid:48) r ← R + R (cid:48) Accumulate gradients w.r.t θ (cid:48) : dθ ← dθ + ∇ θ (cid:48) log π θ (cid:48) ( s ij , a ij )( r − V ( s ij ; θ (cid:48) v )) + β ∇ θ (cid:48) H ( π ( s ij ; θ (cid:48) )) Accumulate gradients w.r.t θ (cid:48) v : dθ v ← dθ v + ∂ ( r − V ( s ij ; θ (cid:48) v )) /∂θ (cid:48) v end for end for Perform asynchronous update of θ using dθ and of θ v using dθ v until C > C max multiple agents are training asynchronously and each agentconsists of a policy network and a value network. The policynetwork aims at adjusting its parameters in the direction ofincreasing the accumulated discounted reward. The gradientof the cumulative discounted reward with respect to the policyparameters, θ , can be written as: ∇ θ E π θ [ ∞ (cid:88) t =0 γ t r t ] = E π θ [ ∇ θ log π θ ( s, a ) A π θ ( s, a )] , (21)where A π θ ( s, a ) is called the advantage function, which in-dicates how much better a speciﬁc action is compared tothe average action taken according to the policy. In practice,the unbiased estimate of A π θ ( s t , a t ) is replaced with theempirically computed advantage A ( s t , a t ) [42]. Therefore, the Fig. 4. The diagram of the proposed 360SRL, comprised of a policy networkand a value network. As shown, the two networks extract features fromthe input state via 1D convolutional neural networks (1D-CNNs) and fully-connected (FC) layers, then concatenate the resulting features via the mergeblock, ﬁnally generate a distribution over available bitrate levels and a scalarthrough multi-layer perceptron (MLP). gradient of policy network can be rewritten as: θ ← θ + α (cid:88) t ∇ θ log π θ ( s t , a t ) A ( s t , a t ) , (22)where α is the learning rate. The advantage A ( s t , a t ) satisﬁesthe following equality: A ( s t , a t ) = Q ( s t , a t ) − V π θ ( s t ) ,where V π θ ( s t ) represents the value of the current state s t and Q ( a t , s t ) is the value of the state-action pair ( s t , a t ) . Tocalculate the advantage A ( s t , a t ) , we use the value networkto learn the value function of state V π θ ( s t ) from empiricallyobserved rewards r t . Following the standard Temporal Differ-ence method [42], we update parameters of the value network, θ v , as follow: θ v ← θ v − α v (cid:80) t ∇ θ v ( r t + γV π θ ( s t +1 ; θ v ) − V π θ ( s t ; θ v )) , (23)where α v is the learning rate of the value network and γ is thediscounted factor. Finally, to ensure that the agent discoversgood policies, we add an entropy regularization term to Eq.22 as follows: θ ← θ + α (cid:80) t ∇ θ log π θ ( s, a ) A π θ ( s, a ) + β ∇ θ H ( π θ ( ·| s t )) , (24)where H ( · ) is the entropy of the policy at each time step and β represents the strength of the entropy regularization term.The detail of training methodology is presented in Algorithm.1.

4) Implement details:

We use PyTorch [32] to implementour agent, too. As demonstrated in Fig. 4, our agent feedsthe chunk size of the i th tile, the viewing probabilities andchosen bitrate of tiles within the 1-hop neighbor of the i th tile into 1D convolutional network (1D-CNN) layers with64 ﬁlters, each of size 3 with stride 1. Results from theselayers are then aggregated with other inputs in a multi-layerperceptron (MLP) comprised of one FC layer with 64 neuronsto apply the softmax function. The value network uses thesame network architecture, but its ﬁnal output is a singleneuron without activation function. During training, we set the EEE TRANSACTIONS ON BROADCASTING SUBMISSION 7 discount factor of “Top MDP” and “Bottom MDP” to 0.99 and1, use 16 parallel asynchronous agents for exploration, and setthe number of iterations to . The parameters of each agentis optimized by Adam [33], where β , β and (cid:15) are set to0.9, 0.999, and − , respectively. The learning rates for thepolicy and value network is set to − and − . Besides,we initialize the discounted factor β as 1, and use a step decayschedule to drop β by 0.1 every steps.IV. E XPERIMENT

A. Viewpoint Prediction1) Setup:

We conduct viewpoint prediction on a widelyused public dataset, i.e., the Shanghai dataset [15]. TheShanghai dataset consists of 208 360-degree videos, wherethe duration of these 360-degree videos varies from 20 to 60seconds and each video has at most 45 viewing trajectories.In this paper, we ﬁrstly normalize the longitude and latitudeto be zero-centered and have range [-1, 1]. Then, we split theShanghai dataset into training and testing sets in the samemanner as [15]. Besides, following the common setting inviewpoint prediction [15], we sample one frame from everyﬁve frames for model training and performance evaluation. Inaddition, we randomly choose 20 viewing paths as cross-userinformation for each video.

2) Evaluation Metrics:

To evaluate the performance ofthe proposed method, we take the following metrics intoconsideration: • Longitude/Latitude Error: Longitude/Latitude Error mea-sures the manhattan distance between the predicted lon-gitude/latitude and its ground truth. Typically, the smallerthe longitude/latitude error, the better the viewpoint pre-diction ability. • Hit rate: In the tile-based 360-degree video streaming, it isnecessary to correctly predict the viewing probabilities oftiles. As a result, Xie et al. [12] proposes Hit rate to mea-sure the prediction precision of viewing probabilities oftiles. Typically, the higher hit rate corresponds to smallerLongitude and Latitude Error. The speciﬁc calculation ofHit rate can be found in [12].

3) Methods for Comparison:

We compare our method withthe following competitive baselines: • Static: This method adopts a naive strategy to forecastuser’ future viewpoints, i.e., assuming user’ viewpointremains still all the time. • LR [9]: This approach ﬁrstly estimates the trend of headmovement from user’s historical viewpoints by LeastSquare Method, and then employs a linear regressionmodel to predict user’ future viewpoints. • KNN [10]: This scheme use the same strategy as LR topredict user’ future viewpoints. To amend the predictionbias of the LR model, this method utilizes ﬁve nearestviewpoints of other users around the predicted result tocalculate the viewing probabilities of tiles. • AME [12]: This approach estimates user’ future view-points via a sequence-to-sequence learning-based model.To reduce the long-term prediction error, this methodproposes an attention mechanism to exploit cross-user

TABLE I

THE AVERAGE RUNTIME OF CUAN AT DIFFERENT LENGTH OF VIEWPOINTPREDICTION .Length of viewpoint prediction (s) Average Runtime (ms)CPU GPU1 8.9 7.73 20.1 19.75 32.9 32.4 information. It is worth noting that AME directly fusesthe predicted viewpoints with others’ viewpoints, and thecontribution of others’ viewpoints merely depends on thesimilarity between user’s and others’ single-timestampviewpoint.

4) Main Results:

Fig. 5 summarizes the performance ofeach scheme in terms of Latitude error, Longitude error, andHit rate in various prediction horizon time. As seen, in terms ofHit rate at the ﬁfth second, CUAN outperforms AME, KNN,LR, and Static by 3%, 2%, 14%, and 17%, and shows itssuperior capability in decreasing longitude error and latitudeerror compared to existing methods. The similar phenomenoncan be also observed at the ﬁrst and third second. As aresult, we can conclude that, CUAN is able to better exploitcross-user information than AME, and boost the performanceof KNN. In addition, we present two predicted 5-secondtrajectories of each approach in Fig. 6. As we can see, CUANadapts better to user’s head movement compared to AME,while LR is easily biased since the assumption of linear headmovement is easily violated.

5) Ablation Study:

Firstly, we vary the hidden size ofLSTMs from 4 to 64. Results from this sweep are presented inFig. 7. As shown, the performance of CUAN is increasinglyimproved with the increase of hidden size, and begins toplateau once the hidden size exceeds 32. Hence, we set thehidden size of LSTMs as 32, and keep the same conﬁgurationin the following experiments. Then, we investigate the impactof cross-user information, self-attention mechanism (LSTM-based or MLP-based), and the approach of fusing users’embeddings (addition or concatenation) on the performanceof viewpoint prediction. As seen in Fig.8, compared to CUANwithout cross-user information, methods equipped with cross-user information show their superior capability in forecastingusers’ future viewpoints (especially in the long-term predic-tion). Meantime, we can observe that MLP-based self-attentionis inferior to LSTM-based CUAN in terms of Latitude er-ror, Longitude error, and Hit rate, which is mainly becauseLSTM is more suitable to capture sequential information thanMLP. Also, we can ﬁnd adding users’ embeddings performsslightly better than concatenating users’ embeddings in thetask of viewpoint prediction. Therefore, we choose to mergeusers’ embeddings through addition instead of concatenationin CUAN. Finally, we evaluate the average runtime of CUANat different length of viewpoint prediction. As shown inTable I, CUAN meets the real-time requirements of practicalapplications.

EEE TRANSACTIONS ON BROADCASTING SUBMISSION 8

Fig. 5. Comparing CUAN with existing viewpoint prediction algorithms in terms of Latitude error, Longitude error, and Hit rate at the ﬁrst, third, and ﬁfthsecond. L a t i t u d e ( ∘ ) CUAN AME LR Static Hi tory Gnd 0 1 2 3 4 5 6Timestamp (s)−150−100−50050100 L o n g i t u d e ( ∘ ) CU∘N ∘ME LR Static Histo y Gnd (a) L a t i t u d e ( ∘ ) CUAN AME LR Static Hi tory Gnd 0 1 2 3 4 5 6Timestamp (s)−150−100−50050100 L o n g i t u d e ( ∘ ) CU∘N ∘ME LR Static Histo y Gnd (b)Fig. 6. The visualized results of 5-second viewpoint prediction. (a) Example 1, (b) Example 2.

B. Rate Adaptation1) Setup:

We evaluate the performance of the proposed360SRL in the following setting: • Video Content: We randomly select 30 and 18 360-degreevideos from the Shanghai dataset [15] as our training andtesting data. To achieve adaptive streaming, we split each360-degree video in the × (rows × columns) tiling pat-tern, and encode each tile with a quantization parameter(QP) of 22 to 42 in steps of 5 via the open-source encoderx264 [43], and then encapsulate the encoded bit-streamsinto chunks recording 1-second video content. • Network Traces: We randomly select 50 traces from theHSDPA dataset [44] as training data, and the remaining28 traces as testing. Noticing that the network capacityis often unaffordable to pick up the minimum bitrate, weenlarge the network capacity by 3 Mbps. Meanwhile, inorder to avoid the situation where the network capacitycan always afford to pick up the maximum bitrate, weset the upper bound of the network capacity to 8 Mbps. • Viewpoint Traces: Since the Shanghai dataset [15] hascollected multiple viewing trajectories for each video, wedirectly reuse those data.

EEE TRANSACTIONS ON BROADCASTING SUBMISSION 9

Fig. 7. Studying the inﬂuence of the number of hidden size in terms of Latitude error, Longitude error, and Hit rate at the ﬁrst, third, and ﬁfth second.Fig. 8. Investigating the impact of cross-user information, self-attention mechanism, and the approach of fusing users’ embeddings in terms of Latitudeerror, Longitude error, and Hit rate at the ﬁrst, third, and ﬁfth second. CUAN w/o attention module denotes viewpoint prediction merely based on user’shistorical information, MLP denotes applying self-attention mechanism based on features extracted by multi-layer perceptron, and Concat means mergingusers’ embeddings via concatenatation instead of addition in attention module.

All experiment is conducted on a NVIDIA TITAN X GPUplatform with an Intel(R) Core(TM) i7-6850K 3.60GHz CPU.

2) Methods for Comparison:

We compare our method withthe following methods which collectively represent the state-of-the-art in rate adaptation: • MONO: This approach regards the panoramic videostreaming as ordinary video streaming, and employs theRL-based ABR algorithm proposed by Mao et al. [17] topick up the bitrate for the upcoming segments. • Buffer-based (BB): This scheme selects the same bitratefor tiles within user’s FoV via the buffer-based algorithmproposed by Huang et al. [45], which uses a reservoirof 1 seconds and a cushion of 5 seconds i.e., it choosesbitrates with the goal of maintaining the buffer occupancyabove 1 seconds, and automatically picks up the highestavailable bitrate if the buffer occupancy is larger than 5seconds. It is worth noting that tiles out of user’s FoVare delivered at the lowest bitrate. • CLS [12]: This method ﬁrstly utilizes the target-bufferbased algorithm predict the bandwidth budget, and thenconducts rate allocation according to the viewing prob-abilities of tiles. It is worth noting that this methodaims to maximize the viewpoint quality regardless of theviewpoint temporal and spatial variation. • DRL360 [28]: This approach employs a RL-based algo- rithm to select the same bitrate for tiles within user’s FoV,and picks up the lowest bitrate for the remaining tiles.To keep fair comparison, all ABR schemes use the proposedCUAN to predict user’s future viewpoints.

3) Main Results:

As illustrated in Fig.9, 360SRL achievesthe state-of-the-art performance under different QoE objec-tives, which validates the advantage of 360SRL. Speciﬁcally,in the setting of η = 1 , η = 1 , η = 4 . , 360SRL outper-forms MONO, CLS, BB, and DRL360 by 15%, 16%, 21%,and 6% in terms of normalized average QoE, respectively.When users prefer to small veiwport temporal variation andviewport spatial variation, 360SRL is also superior to theexisting algorithms. To ﬁgure out the improvement in theaverage QoE obtained by 360SRL, we analyze the perfor-mance of each scheme on the individual terms of the deﬁnitionof QoE score (Eq. 14). As seen in Fig. 10, in the caseof η = 1 , η = 1 , η = 4 . , 360SRL is comparable toCLS in terms of viewpoint quality and rebuffering, whileshows superior capability in suppressing viewpoint temporaland spatial variation. The similar phenomena also occurson the other two QoE objectives. Hence, we can concludethat the advanced capability in reconciling each individualitem of Eq. 14 helps 360SRL beat the existing competitiveABR algorithms. In addition, compared with DRL360, we canobserve that a large portion of the gain obtained by 360SRL EEE TRANSACTIONS ON BROADCASTING SUBMISSION 10 (a) (b) (c)Fig. 9. Comparing 360SRL with existing ABR algorithms on three different QoE objectives: a) η = 1 , η = 1 , η = 4 . , 2) η = 1 , η = 2 , η = 4 . , c) η = 2 , η = 1 , η = 4 . . V i e w p o r t Q u a li t y ( M bp s ) V i e w p o r t T e m p o r a l V a r i a t i o n ( M bp s ) V i e w p o r t Sp a t i a l V a r i a t i o n ( M bp s ) R e b u ff e r i n g ( s e c o n d ) Fig. 10. Comparing 360SRL with existing ABR algorithms in terms of viewport quality, viewport temporal quality variance, viewport spatial quality variance,and rebuffering when the conﬁguration of the QoE objective is η = 1 , η = 1 , η = 4 . . comes from its higher viewport quality. This is mainly because360SRL automatically chooses bitrate for tiles within user’sFoV according to their viewing probabilities, instead of simplyselecting the same bitrate for them.

4) Ablation Study:

Firstly, to investigate the impact ofparametric size of 360SRL on the average QoE, we vary thenumber of ﬁlters in the 1D-CNN and the number of neuronsin the FC from 4 to 128. As shown in Table II, the averageQoE begins to plateau once the number of ﬁlters and neuronsreaches 64. Thus, we set the number of ﬁlters and neuronsas 64 and keep it unchanged in the following experiments. Inaddition, we can notice that the performance of 360SRL is rel-atively insensitive to the parametric size: the average QoE with4 ﬁlters and neurons is within 1.9% of that when the numberof ﬁlters and neurons is 64. Secondly, we study the impact oftraining algorithm of 360SRL on the average QoE. As shownin Table. III, A3C achieves better performance than DQN[19] at three different conﬁgurations of QoE objective. As aresult, in this paper, we switch the implement of 360SRL fromDQN to A3C without introducing extra challenges. Thirdly, weinvestigate the impact of maximum buffer occupancy of thevideo player on the average QoE. As seen in Table. IV, themaximum buffer occupancy of the video player has negligibleimpact on the average QoE. the average QoE with a 2-secondmaximum buffer occupancy is within 0.1% of that when themaximum buffer occupancy is 10 seconds. Fourthly, we studythe effect of decision order on the average QoE. Table. VIsummaries the performance of Z-scan order, random order, theorder of viewing probability from low to high, and the order of viewing probability from high to low. As seen, 360SRLis sensitive to the decision order. This may be because it ishard for 360SRL to react when a single dimension of theinput is changed. Hence, we adopt a greedy strategy, i.e.,the order of viewing probability from high to low, as ourdecision order. Fifthly, we probe the impact of viewpointprediction on the average QoE. As illustrated in Table. V,CUAN achieves a signiﬁcant improvement in average QoE,compared with existing ABR algorithms, which validates theeffectiveness of CUAN. Finally, we also evaluate the averageruntime of 360SRL on CPU and GPU platform. As seen inTable VII, 360SRL achieves the real-time requirements of real-world applications.

TABLE II

THE IMPACT OF NUMBER OF FILTERS AND NEURONS ON THE AVERAGEQOE WHEN THE CONFIGURATION OF THE QOE OBJECTIVE IS η = 1 , η = 1 , η = 4 . .Number of ﬁlters and neurons (each) Average QoE4 12.178 12.2516 12.1632 12.2664 12.40128 12.34 V. C

ONCLUSION

In this paper, we focus on optimizing two indispensablecomponents in tile-based panoramic video adaptive streaming,

EEE TRANSACTIONS ON BROADCASTING SUBMISSION 11

TABLE III

THE IMPACT OF TRAINING ALGORITHM ON THE AVERAGE QOE .QoE objective MethodDQN A3C η = 1 , η = 1 , η = 4 . η = 1 , η = 2 , η = 4 . η = 2 , η = 1 , η = 4 . THE IMPACT OF MAXIMUM BUFFER OCCUPANCY OF THE VIDEO PLAYERON THE AVERAGE QOE WHEN THE CONFIGURATION OF THE QOEOBJECTIVE IS η = 1 , η = 1 , η = 4 . .Maximum Buffer Occupancy (second) Average QoE2 12.3024 12.3956 12.3978 12.39710 12.398 i.e., viewpoint prediction and rate adaptation. For the ﬁrst op-timization problem, we design a cross-user attentive network,named CUAN, where we enhance the performance of view-point prediction through cross-user information extracted byan attention mechanism. For the second optimization problem,we propose a sequential RL-based approach, called 360SRL,which transforms the dimension of action space from expo-nential to linear by introducing a sequential decision structure.Experimental results demonstrate that proposed CUAN and360SRL achieve the state-of-the-art performance.We also acknowledge the limitations of our current approachand would like to point out several important future directionsto make the method more applicable to real world. First,since the current scheme individually optimizes the viewpointprediction and rate adaptation lacking consideration of thecascading inﬂuence of these two modules, a joint optimizationframework could further improve user’s QoE. Second, our TABLE V

THE IMPACT OF VIEWPOINT PREDICTION ON THE AVERAGE QOE WHENTHE CONFIGURATION OF THE QOE OBJECTIVE IS η = 1 , η = 1 , η = 4 . .Method Average QoEStatic 11.30LR 9.62KNN 11.64AME 11.79CUAN 12.40TABLE VI THE IMPACT OF DECISION ORDER ON THE AVERAGE QOE WHEN THECONFIGURATION OF THE QOE OBJECTIVE IS η = 1 , η = 1 , η = 4 . .Order Average QoEZ-scan 11.78Low → High 10.92Random 11.37High → Low 12.40 TABLE VII

THE AVERAGE RUNTIME OF

SRL ON CPU AND GPU PLATFORM .Average Runtime (ms) 360SRLCPU 22GPU 10 proposed rate adaptation scheme adopts a simple QoE metric,whereas the real world QoE metric is much more complicatedthan this. Although some end-to-end learning-based methods[46] have tried to ﬁnd more efﬁcient metrics reﬂecting hu-man visual experience in 2D video streaming, these schemeshave not been investigated in the tile-based 360-degree videostreaming yet. A

CKNOWLEDGMENT

This work was supported in part by NSFC un-der Grant U1908209, 61571413, 61632001 and the Na-tional Key Research and Development Program of China2018AAA0101400. R

Retrieved , vol. 27, p. 2011, 2011.[3] T. Stockhammer, “Dynamic adaptive streaming over http–: standards anddesign principles,” in

Proceedings of the second annual ACM conferenceon Multimedia systems . ACM, 2011, pp. 133–144.[4] K. K. Sreedhar, A. Aminlou, M. M. Hannuksela, and M. Gabbouj,“Viewport-adaptive encoding and streaming of 360-degree video forvirtual reality applications,” in . IEEE, 2016, pp. 583–586.[5] Z. Xu, X. Zhang, K. Zhang, and Z. Guo, “Probabilistic viewportadaptive streaming for 360-degree videos,” in . IEEE, 2018, pp. 1–5.[6] L. Sun, F. Duanmu, Y. Liu, Y. Wang, Y. Ye, H. Shi, and D. Dai, “A two-tier system for on-demand streaming of 360 degree video over dynamicnetworks,”

IEEE Journal on Emerging and Selected Topics in Circuitsand Systems , 2019.[7] L. Sun, F. Duanmu, Y. Liu, Y. Wang, Y. Ye, and H. Shi, “Multi-pathmulti-tier 360-degree video streaming in 5g networks,” in

Proceedingsof the 9th ACM Multimedia Systems Conference . ACM, 2018, pp.162–173.[8] M. Hosseini and V. Swaminathan, “Adaptive 360 vr video streaming:Divide and conquer,” in . IEEE, 2016, pp. 107–110.[9] L. Xie, Z. Xu, Y. Ban, X. Zhang, and Z. Guo, “360probdash: Improvingqoe of 360 video streaming using tile-based http adaptive streaming,” in

Proceedings of the 25th ACM international conference on Multimedia .ACM, 2017, pp. 315–323.[10] Y. Ban, L. Xie, Z. Xu, X. Zhang, Z. Guo, and Y. Wang, “Cub360:Exploiting cross-users behaviors for viewport prediction in 360 videoadaptive streaming,” in . IEEE, 2018, pp. 1–6.[11] J. Le Feuvre and C. Concolato, “Tiled-based adaptive streaming usingmpeg-dash,” in

Proceedings of the 7th International Conference onMultimedia Systems . ACM, 2016, p. 41.[12] L. Xie, X. Zhang, and Z. Guo, “Cls: A cross-user learning based systemfor improving qoe in 360-degree video adaptive streaming,” in . ACM, 2018,pp. 564–572.[13] Y. Ban, L. Xie, Z. Xu, X. Zhang, Z. Guo, and Y. Hu, “An optimalspatial-temporal smoothness approach for tile-based 360-degree videostreaming,” in . IEEE, 2017, pp. 1–4.

EEE TRANSACTIONS ON BROADCASTING SUBMISSION 12 [14] F. Qian, L. Ji, B. Han, and V. Gopalakrishnan, “Optimizing 360 videodelivery over cellular networks,” in

Proceedings of the 5th Workshop onAll Things Cellular: Operations, Applications and Challenges . ACM,2016, pp. 1–6.[15] Y. Xu, Y. Dong, J. Wu, Z. Sun, Z. Shi, J. Yu, and S. Gao, “Gazeprediction in dynamic 360 immersive videos,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2018,pp. 5333–5342.[16] C.-L. Fan, J. Lee, W.-C. Lo, C.-Y. Huang, K.-T. Chen, and C.-H. Hsu,“Fixation prediction for 360 video streaming in head-mounted virtualreality,” in

Proceedings of the 27th Workshop on Network and OperatingSystems Support for Digital Audio and Video . ACM, 2017, pp. 67–72.[17] H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video stream-ing with pensieve,” in

Proceedings of the Conference of the ACM SpecialInterest Group on Data Communication . ACM, 2017, pp. 197–210.[18] M. Gadaleta, F. Chiariotti, M. Rossi, and A. Zanella, “D-dash: A deepq-learning framework for dash video streaming,”

IEEE Transactions onCognitive Communications and Networking , vol. 3, no. 4, pp. 703–718,2017.[19] J. Fu, X. Chen, Z. Zhang, S. Wu, and Z. Chen, “360srl: A sequentialreinforcement learning approach for abr tile-based 360 video stream-ing,” in . IEEE, 2019, pp. 290–295.[20] S. Petrangeli, V. Swaminathan, M. Hosseini, and F. De Turck, “Anhttp/2-based adaptive streaming framework for 360 virtual realityvideos,” in

Proceedings of the 25th ACM international conference onMultimedia . ACM, 2017, pp. 306–314.[21] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al. , “A density-basedalgorithm for discovering clusters in large spatial databases with noise.”in

Kdd , vol. 96, no. 34, 1996, pp. 226–231.[22] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vectormachines,”

ACM transactions on intelligent systems and technology(TIST) , vol. 2, no. 3, pp. 1–27, 2011.[23] P. Sinha and A. A. Zoltners, “The multiple-choice knapsack problem,”

Operations Research , vol. 27, no. 3, pp. 503–515, 1979.[24] J. Jiang, V. Sekar, and H. Zhang, “Improving fairness, efﬁciency, andstability in http-based adaptive video streaming with festive,”

IEEE/ACMTransactions on Networking (ToN) , vol. 22, no. 1, pp. 326–340, 2014.[25] Y. Sun, X. Yin, J. Jiang, V. Sekar, F. Lin, N. Wang, T. Liu, andB. Sinopoli, “Cs2p: Improving video bitrate selection and adaptationwith data-driven throughput prediction,” in

Proceedings of the 2016ACM SIGCOMM Conference . ACM, 2016, pp. 272–285.[26] T.-Y. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson,“A buffer-based approach to rate adaptation: Evidence from a largevideo streaming service,” in

ACM SIGCOMM Computer CommunicationReview , vol. 44, no. 4. ACM, 2014, pp. 187–198.[27] K. Spiteri, R. Urgaonkar, and R. K. Sitaraman, “Bola: Near-optimalbitrate adaptation for online videos,” in

IEEE INFOCOM 2016-The 35thAnnual IEEE International Conference on Computer Communications .IEEE, 2016, pp. 1–9.[28] Y. Zhang, P. Zhao, K. Bian, Y. Liu, L. Song, and X. Li, “Drl360: 360-degree video streaming with deep reinforcement learning,” in

IEEE IN-FOCOM 2019-IEEE Conference on Computer Communications . IEEE,2019, pp. 1252–1260.[29] C. Li, W. Zhang, Y. Liu, and Y. Wang, “Very long term ﬁeldof view prediction for 360-degree video streaming,” arXiv preprintarXiv:1902.01439 , 2019.[30] A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland, “Finite stateautomata and simple recurrent networks,”

Neural computation , vol. 1,no. 3, pp. 372–381, 1989.[31] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,”in

Proceedings of the IEEE-INNS-ENNS International Joint Conferenceon Neural Networks. IJCNN 2000. Neural Computing: New Challengesand Perspectives for the New Millennium , vol. 3. IEEE, 2000, pp.189–194.[32] A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch,” 2017.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[34] L. Qian, Z. Cheng, Z. Fang, L. Ding, F. Yang, and W. Huang, “Aqoe-driven encoder adaptation scheme for multi-user video streaming inwireless networks,”

IEEE Transactions on Broadcasting , vol. 63, no. 1,pp. 20–31, 2016.[35] L. Yu, T. Tillo, and J. Xiao, “Qoe-driven dynamic adaptive videostreaming strategy with future information,”

IEEE Transactions onBroadcasting , vol. 63, no. 3, pp. 523–534, 2017.[36] A. Doumanoglou, D. Grifﬁn, J. Serrano, N. Zioulis, T. K. Phan,D. Jim´enez, D. Zarpalas, F. Alvarez, M. Rio, and P. Daras, “Quality of experience for 3-d immersive media streaming,”

IEEE Transactionson Broadcasting , vol. 64, no. 2, pp. 379–391, 2018.[37] W. J. Tam, F. Speranza, S. Yano, K. Shimono, and H. Ono, “Stereoscopic3d-tv: visual comfort,”

IEEE Transactions on Broadcasting , vol. 57,no. 2, pp. 335–346, 2011.[38] C. W. Chen, P. Chatzimisios, T. Dagiuklas, and L. Atzori,

Multimediaquality of experience (QoE): current status and future requirements .John Wiley & Sons, 2015.[39] T. De Pessemier, K. De Moor, W. Joseph, L. De Marez, and L. Martens,“Quantifying the inﬂuence of rebuffering interruptions on the user’squality of experience during mobile video watching,”

IEEE Transactionson Broadcasting , vol. 59, no. 1, pp. 47–61, 2012.[40] S. Singh, P. Norvig, D. Cohn et al. , “How to make software agentsdo the right thing: An introduction to reinforcement learning,”

AdaptiveSystems Group , 1996.[41] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-forcement learning,” in

International conference on machine learning ,2016, pp. 1928–1937.[42] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introductionmit press,”

Cambridge, MA , 1998.[43] L. Merritt and R. Vanam, “x264: A high performance h. 264/avcencoder,” online] http://neuron2. net/library/avc/overview x264 v8 5.pdf , 2006.[44] H. Riiser, P. Vigmostad, C. Griwodz, and P. Halvorsen, “Commutepath bandwidth traces from 3g networks: analysis and applications,”in

Proceedings of the 4th ACM Multimedia Systems Conference , 2013,pp. 114–118.[45] T.-Y. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson, “Abuffer-based approach to rate adaptation: Evidence from a large videostreaming service,”

ACM SIGCOMM Computer Communication Review ,vol. 44, no. 4, pp. 187–198, 2015.[46] T. Huang, R.-X. Zhang, C. Zhou, and L. Sun, “Qarc: Video quality awarerate control for real-time video streaming based on deep reinforcementlearning,” in

Proceedings of the 26th ACM international conference onMultimedia , 2018, pp. 1208–1216.

Jun Fu received the B.S. degree in Department ofElectrical Engineering from Chongqing Universityin 2017, and is currently studying for Ph.D. degreeat University of Science and Technology of China(USTC), Hefei, China. His research interests includeimmersive media streaming, image and video com-pression, and automatic machine learning.

Zhibo Chen received the B. Sc., and Ph.D. de-gree from Department of Electrical Engineering Ts-inghua University in 1998 and 2003, respectively.He is now a professor in University of Scienceand Technology of China. His research interestsinclude image and video compression, visual qualityof experience assessment, immersive media comput-ing and intelligent media computing. He has morethan 100 publications and more than 50 grantedEU and US patent applications. He is IEEE seniormember, Secretary of IEEE Visual Signal Processingand Communications Committee, and member of IEEE Multimedia Systemand Applications Committee. He was TPC chair of IEEE PCS 2019 andorganization committee member of ICIP 2017 and ICME 2013, served asTPC member in IEEE ISCAS and IEEE VCIP.

EEE TRANSACTIONS ON BROADCASTING SUBMISSION 13

Xiaoming Chen received the B.Sc. degree from theRoyal Melbourne Institute of Technology, Australia,and the Ph.D. degree from The University of Sydney,Australia, in 2004 and 2009, respectively. From 2010to 2014, he was with the National University ofSingapore, CSIRO Australia, and IBM. From 2014to 2019, he had been a Researcher with the Instituteof Advanced Technology, University of Science andTechnology of China. In 2014, he was invited intothe 100-Talent Program by the government of Hefei,Anhui Province, China. He is currently a Professorwith the School of Computer Science and Engineering, Beijing Technologyand Business University, China. Meanwhile, he has been invited as a GuestResearcher with Beijing Research Institute, University of Science and Tech-nology of China since 2019. His research interests include immersive mediacomputing, virtual reality, and related business information systems. His workhas been published in ACM MM, IEEE Virtual Reality, IEEE TIP, IEEE T-CSVT, etc.