[PDF] A Knowledge-Driven Quality-of-Experience Model for Adaptive Streaming Videos

Abstract

The fundamental conflict between the enormous space of adaptive streaming videos and the limited capacity for subjective experiment casts significant challenges to objective Quality-of-Experience (QoE) prediction. Existing objective QoE models exhibit complex functional form, failing to generalize well in diverse streaming environments. In this study, we propose an objective QoE model namely knowledge-driven streaming quality index (KSQI) to integrate prior knowledge on the human visual system and human annotated data in a principled way. By analyzing the subjective characteristics towards streaming videos from a corpus of subjective studies, we show that a family of QoE functions lies in a convex set. Using a variant of projected gradient descent, we optimize the objective QoE model over a database of training videos. The proposed KSQI demonstrates strong generalizability to diverse streaming environments, evident by state-of-the-art performance on four publicly available benchmark datasets.

Full PDF

SSUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 1

A Knowledge-Driven Quality-of-Experience Modelfor Adaptive Streaming Videos

Zhengfang Duanmu,

Student Member, IEEE,

Wentao Liu,

Student Member, IEEE,

Diqi Chen, Zhuoran Li,

Student Member, IEEE,

Zhou Wang,

Fellow, IEEE,

Yizhou Wang,

Member, IEEE, and Wen Gao,

Fellow, IEEE

Abstract —The fundamental conﬂict between the enormousspace of adaptive streaming videos and the limited capacity forsubjective experiment casts signiﬁcant challenges to objectiveQuality-of-Experience (QoE) prediction. Existing objective QoEmodels exhibit complex functional form, failing to generalize wellin diverse streaming environments. In this study, we proposean objective QoE model namely knowledge-driven streamingquality index (KSQI) to integrate prior knowledge on the humanvisual system and human annotated data in a principled way. Byanalyzing the subjective characteristics towards streaming videosfrom a corpus of subjective studies, we show that a family ofQoE functions lies in a convex set. Using a variant of projectedgradient descent, we optimize the objective QoE model over adatabase of training videos. The proposed KSQI demonstratesstrong generalizability to diverse streaming environments, evi-dent by state-of-the-art performance on four publicly availablebenchmark datasets.

Index Terms —Quality-of-Experience assessment, adaptivevideo streaming, quadratic programming.

I. I

NTRODUCTION V IDEO trafﬁc in various content distribution networks isexpected to occupy 71% of all consumed bandwidthby 2021 and exceed 82% by 2022 [1]. The explosion ofdata volume introduced by video streaming will quickly drainavailable network bandwidth in the next decade. Concurrentwith the scarcity of network resources is the steady rise inuser demands on video quality. With the emergence of newtechnologies such as 4K, high dynamic range, wide colorGamut, and high frame rate, viewers’ expectation on videoquality has been higher than ever. The diversity of streamingenvironments and complexity of human Quality-of-Experience(QoE) response have posed signiﬁcant challenges to optimalcontent distribution services.Adaptive bitrate (ABR) algorithms are the primary tools formodern Internet over-the-top (OTT) video streaming services.In dynamic adaptive streaming environment, ABR achievesplayer-driven bitrate adaptation by providing video streams ina variety of bitrate and quality levels, and breaking them intosmall HTTP ﬁle segments. Throughout the streaming process,

Z. Duanmu, W. Liu, Z. Li, and Z. Wang are with the Depart-ment of Electrical and Computer Engineering, University of Waterloo,Waterloo, ON N2L 3G1, Canada (e-mail: { zduanmu, w238liu, z777li,zhou.wang } @uwaterloo.ca).D. Chen is with Institute of Computing Technology, Chinese Academy ofSciences, Beijing, 100190, China (e-mail: [email protected]).Y. Wang, and W. Gao are with the School of Electronic Engineeringand Computer Science, Peking University, Beijing, 100871, China (e-mail: { yizhou.wang, wgao } @pku.edu.cn). the video player at the client adaptively switches among theavailable streams by selecting segments based on playbackrate, buffer condition and instantaneous throughput, primarilyto optimize viewers QoE [2]–[7].With many ABR algorithms at hand, it becomes pivotalto measure their performance so as to guide the networkresource allocation. Therefore, the development of an accu-rate objective QoE model lies in the root of ABR systems.State-of-the-art QoE models employ sophisticated machinelearning techniques such as random forest [8], support vectormachine [9], and neural network [10] to combat the complexhuman visual system (HVS). The success of the approachheavily depends on the quantity and quality of training data,both of which are extremely limited in practice. First, thereis a major conﬂict between the enormous sample space andthe limited capacity for subjective QoE measurements. Forexample, the number of possible adaptation patterns in a d -temporal segment and b -encoding level streaming video is b d , which further expands with respect to the source content,encoder type, and rebuffering patterns. By contrast, collectinggroup-truth data via subjective testing is expensive and time-consuming. The largest publicly available subject-rated QoEdatabase contains only , samples [11], which are deemedto be extremely sparsely distributed in the sample space.Second, the learning-based models assume that the trainingsamples and testing samples come from the same distribution.However, the assumption has never been justiﬁed in theexisting studies and may hardly hold in practice. A motivatingexample is shown in Fig. 1, where the probability density func-tions of video presentation quality measured by a state-of-the-art VQA model VMAF [12], rebuffering duration, and qualityadaptation magnitude in six publicly available streaming QoEdatasets are presented. Clearly, there is signiﬁcant variabilityon the characteristics of streaming videos across differentdatasets, suggesting that an objective QoE model optimizedon a simple dataset such as WaterlooSQoE-I [13] may yieldvery poor predictions on complex datasets as WaterlooSQoE-III [14], WaterlooSQoE-IV [11], and LIVE-NFLX-II [15], andvice versa. The streaming video probability density estimationis further complicated by the concept drift problem [16], wherethe characteristics of streaming video changes over time.For example, the drift in streaming video distribution mayarise from the advancement of video acquisition [17]–[19],compression [20]–[22], transmission [2]–[7], and reproductionsystems [23]–[25], and the steady rise in viewers’ expectationon video quality [26], [27]. Consequently, the construction of a r X i v : . [ c s . MM ] N ov UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 2 (a) Distribution of VMAF (b) Distribution of rebuffering duration (c) Distribution of adaptation magnitudeFigure 1. There exists signiﬁcant variance on the characteristics of streaming videos, evident by the distributions of (a) VMAF, (b) rebuffering duration, and(c) adaptation magnitude in six publicly available datasets. a large-scale and representative training dataset remains anelusive goal.Hindered by the two fundamental conﬂicts, it is highlydifﬁcult to develop objective QoE models that generalize todiverse impairments. To this end, we propose an objectiveQoE model namely knowledge-driven streaming quality index(KSQI) to integrate prior knowledge of the HVS and a limitednumber of training data. From a Bayesian perspective, weshow one possible solution to the QoE assessment problemresides in a deeper understanding of the HVS.Given a collection of subjective QoE studies, how do wemake use of the knowledge in a principled and scalable man-ner? To answer the question, we analyze the HVS propertiesobserved from existing subjective QoE studies, from which wederive a system of linear inequalities. We further show that afamily of objective QoE models lies within a convex set thatresults from the intersection of a hyper-plane and a positivecone in a functional space. This gives us both guidance on theform of our model as well as constraints.Building upon insights of HVS properties, how do wedesign an objective QoE model that can accurately predictsubjective QoE response? We demonstrate that the QoE modelparameter estimation problem can be formulated as a quadraticprogramming problem. Using a variant of projected gradientdescent, we optimize the proposed model over a databaseof training samples with limited adaptation patterns. Theresulting model is computationally efﬁcient, mathematicallywell-behaved and perceptually grounded.We compare KSQI to ten objective QoE models on fourbenchmark datasets covering a broad set of video contents,encoder conﬁgurations, network conditions, ABR algorithms,and viewing devices. KSQI shows very strong generalizabilityin all considered scenarios, signiﬁcantly outperforming allexisting schemes. We show that the proposed model is notonly superior on average, but also in extreme cases via aset of intuitive examples. We have made the implementationof all objective QoE models available at https://github.com/zduanmu/ksqi to facilitate future objective QoE research.In summary, this paper makes the following key contribu-tions: • Mathematical analysis on the space of QoE functions for adaptive streaming videos; • Design of an objective QoE model combining the con-straints from our analysis and human annotated data; • By far the most comprehensive evaluation of objectiveQoE models. II. R

ELATED W ORKS

Over the past decade, QoE for adaptive streaming videoshas been a rapidly evolving research topic and has attractedan increasing amount of attention from both industry andacademia. Existing objective QoE models can be categorizedbased on the observation space and the functional form thatthey rely on to make QoE predictions. The earliest QoEmodels simply correlate the statistics of rebuffering eventsto QoE [28], assuming rebuffering dominates the viewingexperience. Although several efforts have put forth to improvethe prediction accuracy [29], [30], the negligence of picturequality reduces the relevance of the QoE models to real-worldABR scenarios. To overcome the limitations of the rebuffering-centric QoE models, many studies propose to complement re-buffering duration with average bitrate or quantization param-eter (QP) as the input to the quality prediction model. Thesemodels compute the QoE as a weighted average of bitrate/QPand rebuffering duration, where the trade-off parameter iseither determined empirically [31] or by solving a regressionproblem [32]. Motivated by observations that frequent qualityadaptations annoy viewers [27], [33]–[35], a number of modelshas been developed to explicitly quantify the quality adaptationexperience and then linearly combine it with average bitrateand rebuffering duration [4], [5], [34]. Due to the simplicity,these models remain the standard criterion for the assessmentand optimization of ABR systems [4], [5], [7], [36]. Despitethe demonstrated success, the aforementioned QoE modelsassume every single bit contributes equally to the video quality.However, the assumption is fundamentally ﬂawed according tothe rate-distortion theory [37], and may deteriorate in differentcompression, transmission and reproduction systems [22],[24], [38]. Recent efforts suggest to replace bitrate by state-of-the-art video quality assessment (VQA) models [12], [39], [40]as the presentation quality measure [6], [13], [41], achievinghighly competitive performance on existing benchmarks. All

UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 3 these models make a priori assumptions about the form ofthe QoE models. The subjective QoE response with respect torebuffering and quality adaptation, however, can vary signiﬁ-cantly from an exponential or logarithmic function.Supposing subjective QoE response is too complex to modelwith simple parametric functions, recent objective QoE modelsutilize machine learning techniques such as non-linear auto-regressive model [42], neural network [43], support vectormachine [44], and random forest [27], [45] to map streamingvideo features to subjective opinion scores. Although thesemodels can ﬁt arbitrary complex continuous functions [46],they are often susceptible to overﬁtting. The defect is ex-aggerated by the limited capacity for subjective experimentand the concept drift problem as will be demonstrated in thesubsequent sections.In addition to the speciﬁc limitations the two kinds of mod-els may respectively have, they may qualitatively contradict theHVS properties observed from the existing subjective studies.Furthermore, the objective QoE models have not receivedcomprehensive evaluation on subject-rated datasets of diversevideo contents, encoder types, network conditions, ABR al-gorithms, and viewing devices. While many recent worksacknowledge the importance of objective QoE models [3], [4],[7], [36], [47], a careful analysis, modeling, and evaluation ofthe models has yet to be done. We wish to address this void.In doing so, we seek a good compromise between 1) simpleand rigid model building upon intuitive understanding of QoEand 2) complex and indeﬁnite models requiring a signiﬁcantnumber of training samples.III. C

HARACTERIZING THE Q O E F

UNCTIONS

Despite the complexity of HVS, there are three widelyaccepted key inﬂuencing factors in QoE of streaming videos:video presentation quality, rebuffering, and quality adaptation(switching between proﬁles) [4], [5], [14], [15], [48]–[50]. Thesimplest approach to parametrizing the QoE function is toassume the three elements are additive. Formally, the QoEof chunk t is determined by Q t = P t + S t + A t , where P t , S t , and A t denote the presentation quality, therebuffering QoE function, and the adaptation QoE function ofchunk t , respectively. For simplicity, we will drop the subscript t from P t and Q t , denote P t − by lower case p , and P t − P t − by ∆ p in the rest of this section unless otherwise speciﬁed.While the presentation quality P can be obtained fromsubjective tests or a reliable VQA model, the two QoEfunctions S and A are not well studied. Deﬁning the spaceof QoE functions helps us build a model of these functions. Itnot only guides us as to the form such a model should take,but also determines the constraints these functions must satisfy.We begin by summarizing observations from a collection ofexisting subjective QoE studies, and then formulate the domainknowledge to deﬁne the space of these functions. A. Space of Rebuffering Experience Function S First, we assume that the inﬂuence of each rebuffering eventis independent, additive, and only determined by the previous chunk’s presentation quality p and the rebuffering duration τ [13], [50], [51]. This assumption allows us to analyze eachrebuffering event separately, and reduces dimensions of thefunctional space [13]. As such, S can be modeled as a bi-variate function S ( p, τ ) ∈ { S | S : R → R } .Second, various subjective tests [52], [53] have attested thatrebuffering duration is negatively correlated with the overallQoE of streaming videos, while very short rebuffering maynot be perceived and thus has little impact on QoE [54], [55].Formally, (cid:26) S ( p, τ ) ≥ S ( p, τ ) , ∀ p, τ ≤ τ S ( p,

0) = 0 , ∀ p . (1)The third assumption is that the QoE drop tends to be greaterwhen the presentation quality of the previous chunk is higher, i.e. S ( p , τ ) ≥ S ( p , τ ) , ∀ p ≤ p , τ. (2)Such a trend has been observed in recent subjective tests [13],[50], and may be explained by the expectation conﬁrmationtheory [26].The fourth assumption is elicited from the fact that, givena constant presentation quality and a ﬁxed total durationof rebuffering, the overall QoE degrades as the number ofrebuffering occurrences increases [30], [56], [57]. Mathemat-ically, this may be expressed as S ( p, τ ) + S ( p, τ ) ≤ S ( p, τ + τ ) , ∀ p, τ , τ . (3)The ﬁfth remark on S is that, given the same rebufferingduration, videos with higher presentation quality consistentlydeliver higher overall QoE, despite the greater penalty for therebuffering event [51]. This statement can be formulated as S ( p , τ ) + p ≤ S ( p , τ ) + p , ∀ p ≤ p , τ. (4)In summary, we deﬁne the theoretical space of rebufferingQoE functions S as W S := { S : R → R | S ( p,

0) = 0 , S ( p, τ ) ≥ S ( p, τ ) ,S ( p , τ ) ≥ S ( p , τ ) , S ( p, τ ) + S ( p, τ ) ≤ S ( p, τ + τ ) ,S ( p , τ ) + p ≤ S ( p , τ ) + p , ∀ p, τ, τ ≤ τ , p ≤ p } . (5)The equality constraint and inequality constraints in (5) rep-resent a hyperplane and a positive cone, respectively [58].The convexity of hyperplane and cone determines that theirintersection is also convex. B. Space of Adaptation Experience Function A Similar to video rebuffering, we ﬁrst assume that the in-ﬂuence of each quality adaptation event is independent andadditive, and only depends on the instantaneous presentationquality p , and the intensity of the adaptation ∆ p [33]–[35],[56]. Thus the adaptation QoE function A lies in the spacedeﬁned by { A = A ( p, ∆ p ) | A : R → R } .The second assumption is an intuitive one that A must havethe same sign as the quality adaptation ∆ p [34], [35], [56],[59]. This assumption suggests that people always assign apenalty to presentation quality degradation, reward to qualityelevation, and neither penalty nor reward when no quality UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 4 adaptation occurs. Mathematically, the assumption can beexpressed as  A ( p, ∆ p ) < , ∀ p, ∆ p < A ( p, ∆ p ) > , ∀ p, ∆ p > A ( p,

0) = 0 , ∀ p . (6)Further analysis [33]–[35], [56] on the relationship betweenthe QoE adjustment A and the intensity of quality adaptation ∆ p indicates that subjects tend to give greater QoE penalty orreward when quality drops or improves by a greater amount.This ﬁnding, together with the second assumption, promptsour third assumption: A is monotonically increasing withregards to ∆ p . Formally, the two assumptions combined canbe represented by A ( p, ∆ p ) ≤ A ( p, ∆ p ) , ∀ p, ∆ p ≤ ∆ p . (7)Experiments in [35] ﬁnd that quality degradation occurringin the high quality range leads to greater amount of penaltythan that occurring in the low quality range, while qualityelevation in the high quality range results in smaller rewards.Such an observation leads to the fourth assumption that thefunction A is monotonically decreasing along the other axis p , i.e. A ( p , ∆ p ) ≥ A ( p , ∆ p ) , ∀ p ≤ p , ∆ p. (8)Another commonly observed trend in adaptation QoE is thatthe reward for a positive quality change is relatively smallerthan the penalty for a negative one given the same intensityof quality adaptation and the same average presentation qual-ity [33]–[35]. Formally, this can be summarized by A ( p, − ∆ p ) + A ( p − ∆ p, ∆ p ) ≤ , ∀ p, ∆ p. (9)In summary, we deﬁne the space of adaptation experiencefunctions A as W A := { A : R → R | A ( p,

0) = 0 , A ( p, ∆ p ) ≤ A ( p, ∆ p ) ,A ( p, − ∆ p ) + A ( p − ∆ p, ∆ p ) ≤ ,A ( p , ∆ p ) ≥ A ( p , ∆ p ) , ∀ p, ∆ p, p ≤ p , ∆ p ≤ ∆ p } . (10)Analogous to the rebuffering experience function, the adapta-tion experience function A also lies in a convex set.IV. A K NOWLEDGE -D RIVEN Q O E M

ODEL

Given the functional space constrained by the hyperplaneand the positive cone described above, there are inﬁnitenumber of functions lying in the space. A good QoE modelshould be in close agreement with human perception. In thissection, we present the roadmap to design a perceptuallygrounded objective QoE model.

A. Modeling the Presentation Quality

Traditionally, for the sake of operational convenience, bi-trate is often used as the major indicator of video presentationquality [3]–[5], [7], [29], [60]. However, bitrate may heav-ily deviate from perceptual quality. The presentation qualitymodel should provide meaningful and consistent QoE predic-tions across video contents, video resolutions, and viewingconditions/devices. To the best of our knowledge, currentlythe only video QoE models that satisfy such requirements are SSIMplus [40] and VMAF [12]. Both models performconsistently well on various subject-rated video databases [61],[62], making them an appropriate component in KSQI. Inthe rest of the paper, we present our results using VMAFas our presentation quality model as it is open source thatfacilitates reproducible research. Although the presentationquality scores are not available to the adaptive streamingplayer by default, they can either be embedded into themanifest ﬁle that describes the speciﬁcations of the video,or carried in the metadata of the video container. Thanks tothe light overhead, the feature embedding technique has beensuccessfully deployed in practical QoE measurement [13], [63]and ABR optimization systems [6], [38].

B. Modeling the Rebuffering QoE Function S Strictly speaking, W S is a space of continuous functions,but we may approximate it in terms of a vector space bydensely sampling the supporting domain of S . Speciﬁcally, thesupporting domain of S is deﬁned as { ( p, τ ) | p ∈ [0 , P ] , τ ∈ [0 , τ max ] } , where P indicates the best quality, and τ max is themaximum rebuffering duration. By uniformly sampling both p and τ , we approximate the function S with a ﬁnite-size matrix S ∈ R ( N +1) × ( N +1) , where an element s i,j denotes the QoEpenalty when ( p, τ ) = (cid:0) i − N P, j − N τ max (cid:1) . We then vectorize S as s ∈ R ( N +1) for the convenience of further formulation.Finally, we are able to approximate the functional space W S with a vector space W s := { s ∈ R ( N +1) | G s s ≤ h s , B s s = c s } , where G s , h s , B s and c s are constructed so that all the entriesin s should satisfy the constraints in (5).Given a training set of M s video sequences, each of whichhas C s chunks, one or more rebuffering events, no adaptation,and a mean opinion score (MOS) Q m to indicate its overallQoE, we want to obtain a vector s ∗ ∈ W s that best ﬁts thetraining data. Besides, it is beneﬁcial to impose smoothness onthe function S . Practically, many subjective experiments haveempirically shown the smoothness of the QoE functions [13],[42]. Mathematically, smoothness regularization may lead towell-behaved solutions. Thus, the objective function of s canbe deﬁned as L s := (cid:15) F s + λ(cid:15) S s , (11)where λ > is a weighting factor. We adopt the mean squarederror as the ﬁdelity term (cid:15) F s , and the sum of squared second-order differences along i and j axes as the smoothness term (cid:15) S s . Formally, we deﬁne that  (cid:15) F s := M s (cid:80) M s m =1 (cid:104) Q m − C s (cid:80) C s c =1 ( P m c + s i mc ,j mc ) (cid:105) (cid:15) S s := N +1) (cid:80) N +1 i =1 (cid:80) N +1 j =1 (cid:20)(cid:16) ∂ s i,j ∂i (cid:17) + (cid:16) ∂ s i,j ∂j (cid:17) (cid:21) , where P m c and s i mc ,j mc denote the presentation quality andrebuffering QoE penalty at chunk c of video m , respectively. Itis not hard to see that both (cid:15) F s and (cid:15) S s take quadratic forms of s .As a result, we are able to estimate the rebuffering QoE matrix S by solving the following quadratic programming problemminimize s L s = (cid:15) F s + λ(cid:15) S s subject to s ∈ W s . (12) UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 5

Table IC

OMPARISON OF O BJECTIVE Q O E M

ODELS

QoE model Presentation quality Rebuffering SwitchingRegression function Features Regression function Features Regression functionMok2011 [29] — — linear τ —FTW [30] — — exponential τ —Liu2012 [31] linear bitrate linear τ —Xue2014 [32] linear QP logarithmic τ —Yin2015 [4] linear bitrate linear τ linearSpiteri2016 [5] logarithmic bitrate linear τ logarithmicBentaleb2016 [6] linear VQA linear τ —SQI [13] — VQA linear VQA, τ —P.1203 [45] random forest bitrate, resolution random forest τ random forestVideoATLAS [44] SVR VQA SVR τ SVRKSQI — VQA non-parametric VQA, τ non-parametric The convexity of W s and (11) implies that there exists aunique solution for the optimization problem. The problemcan be efﬁciently solved with projected gradient descent-based algorithms such as alternating direction method ofmultipliers [64]. C. Modeling the Adaptation QoE Function A Following the same approach, we work with the discreteversion of of A . The supporting domain of A is { ( p, ∆ p ) | p ∈ [0 , P ] , ∆ p ∈ [ − p, P − p ] } , since the presentation quality couldgo neither below nor over the best quality P . By uniformlysampling both p and ∆ p , we approximate the function A witha ﬁnite-size matrix A ∈ R ( N +1) × ( N +1) , where an entry a i,j denotes the QoE change when ( p, ∆ p ) = (cid:0) i − N P, j − iN P (cid:1) , andthen vectorize A as a ∈ R ( N +1) . Finally, the vector space ofadaptation experience function becomes W a := { a ∈ R ( N +1) | G a a ≤ h a , B a a = c a } , where G a , h a , B a and c a are according to the constraints in(10).Given a training set of M a video sequences, each of whichhas C a chunks, no rebuffering events, and a MOS Q m , weaim to optimize L a := (cid:15) F a + λ(cid:15) S a , where  (cid:15) F a := M a (cid:80) M a m =1 (cid:104) Q m − C a (cid:80) C a c =1 ( P m c + a i mc ,j mc ) (cid:105) (cid:15) S a := N +1) (cid:80) N +1 i =1 (cid:80) N +1 j =1 (cid:20)(cid:16) ∂ a i,j ∂i (cid:17) + (cid:16) ∂ a i,j ∂j (cid:17) (cid:21) . Here, s i mc ,j mc denotes the quality adaptation experience atchunk c of video m . The optimal quality adaptation experiencematrix A can be obtained by solving the following quadraticprogramming problemminimize a L a = (cid:15) F a + λ(cid:15) S a subject to a ∈ W a . (13) D. Overall QoE

In practice, one usually requires a single end-of-processQoE measure. We use the mean value of the predicted QoEover the whole playback duration to evaluate the overall QoE. To reduce the memory usage, the end-of-process QoE can becomputed in a moving average fashion Y t = ( t − Y t − + Q t t , where Y t is the cumulative QoE up to the t -th segment in thestreaming session. V. E XPERIMENTS

In this section, we ﬁrst describe the experimental setupsand evaluation criteria. We then compare KSQI with classicand state-of-the-art objective QoE models. Furthermore, wedevelope an efﬁcient methodology for examining the best-caseperformance of objective QoE models. Finally, we conduct aseries of ablation experiments to identify the contributions ofthe core factors in KSQI.

A. Experimental Setup1) Objective QoE Models:

We evaluate the performanceof objective QoE models for adaptive streaming videos.The competing algorithms are chosen to cover a diver-sity of design philosophies, including classic paramet-ric QoE models: FTW [30], Mok2011 [29], Liu2012 [31],Xue2014 [32], Yin2015 [4], Spiteri2016 [5], Bentaleb2016 [6],and SQI [13], state-of-the-art learning-based QoE models:VideoATLAS [44] and P.1203 [45], and the proposed KSQI.A description of the existing QoE models is shown in Table I.The implementation for VideoATLAS are obtained from theoriginal authors and we implement the other nine QoE models.We have made the implementation of the models publiclyavailable at https://github.com/zduanmu/ksqi. For the purposeof fairness, the parameters of all models are optimized on theWaterlooSQoE-I [13] and the WaterlooSQoE-II [35] datasets,except for P.1203 [45] whose training methodology is notspeciﬁed in the original paper. The WaterlooSQoE-I datasetcontains compressed videos, compressed videos withinitial buffering, and compressed videos with rebuffering.The WaterlooSQoE-II dataset involves video clips withvariations in compression level, spatial resolution, and frame-rate. For the models with hyper-parameters, we randomly splitthe datasets into % training and % validation sets, and thehyper-parameters with the lowest validation loss are chosen.For KSQI, we set the maximum rebuffering duration τ max UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 6

Table IIPLCC

BETWEEN THE OBJECTIVE Q O E MODEL PREDICTION AND

MOS

ON THE BENCHMARK DATASETS .QoE model LIVE-NFLX-I LIVE-NFLX-II WaterlooSQoE-III WaterlooSQoE-IV Average Weighted AverageMok2011 [29] 0.292 0.512 0.173 0.046 0.256 0.166FTW [30] 0.286 0.568 0.323 0.147 0.331 0.263Xue2014 [32] — 0.788 0.387 0.166 0.447 0.328Liu2012 [31] 0.524 0.732 0.609 0.282 0.537 0.438Yin2015 [4] 0.376 0.673 0.722 0.323 0.524 0.466VideoATLAS [44] 0.100 0.644 0.385 0.675 0.451 0.586P.1203 [45] 0.325 0.817 0.769 0.636 0.637 0.679Bentaleb2016 [6] 0.741 0.898 0.625 0.682 0.737 0.713Spiteri2016 [5] 0.612 0.731

KSQI

Table IIISRCC

BETWEEN THE OBJECTIVE Q O E MODEL PREDICTION AND

MOS

ON THE BENCHMARK DATASETS .QoE model LIVE-NFLX-I LIVE-NFLX-II WaterlooSQoE-III WaterlooSQoE-IV Average Weighted AverageMok2011 [29] 0.335 0.516 0.152 0.056 0.265 0.171FTW [30] 0.325 0.549 0.184 0.082 0.285 0.197Xue2014 [32] — 0.778 0.388 0.219 0.462 0.360Liu2012 [31] 0.438 0.732 0.598 0.468 0.559 0.539Yin2015 [4] 0.441 0.686 0.741 0.541 0.602 0.601VideoATLAS [44] 0.076 0.673 0.469 0.670 0.472 0.603Spiteri2016 [5] 0.493 0.711

KSQI

Table IVKRCC

BETWEEN THE OBJECTIVE Q O E MODEL PREDICTION AND

MOS

ON THE BENCHMARK DATASETS .QoE model LIVE-NFLX-I LIVE-NFLX-II WaterlooSQoE-III WaterlooSQoE-IV Average Weighted AverageMok2011 [29] 0.275 0.425 0.112 0.044 0.214 0.137FTW [30] 0.251 0.425 0.135 0.072 0.221 0.156Xue2014 [32] — 0.582 0.262 0.148 0.148 0.253Liu2012 [31] 0.324 0.524 0.434 0.319 0.319 0.378Yin2015 [4] 0.327 0.482 0.543 0.379 0.379 0.427VideoATLAS [44] 0.050 0.491 0.330 0.480 0.338 0.432Spiteri2016 [5] 0.376 0.501 0.597 0.461 0.484 0.490P.1203 [45] 0.300 0.619

KSQI to , because a rebuffering event longer than secondsrarely occurs as shown in Fig. 1. The penalty of a rebufferingevent longer than can be easily obtained by extrapolatingthe matrix S . We set the step size N to , roughly char-acterizing the standard deviation of subjective presentationquality evaluation. The maximum presentation quality value P = 100 is inherited from SSIMplus and VMAF. Althoughwe can learn a initial buffering experience matrix independentfrom S , it introduces unnecessary model complexity. Instead,we discount the impact of initial buffering with and setthe expectation to the initial quality P − to following therecommendation by [13]. We apply OSQP [65] to solve thequadratic programming problem in (12) and (13). The ﬁdelity-smoothness tradeoff parameter λ = 1 is obtained by cross-validation. In the subsequent section, we will also show thatKSQI performs consistently over a broad range of λ and stepsize N .

2) Benchmark Databases:

We compare KSQI with state-of-the-art objective QoE models on four subject-rated adaptive streaming video datasets, including LIVE-NFLX-I [50], LIVE-NFLX-II [15], WaterlooSQoE-III [14], and WaterlooSQoE-IV [11]. The LIVE-NFLX-I dataset consists of streamingvideos derived from source content with handcraftedplayout patterns. The LIVE-NFLX-II dataset consists of streaming videos generated from content-adaptive encodingproﬁles, bitrate adaptation algorithms and network conditions.The WaterlooSQoE-III dataset contains streaming videosof source contents recorded from a set of streamingexperiment. The WaterlooSQoE-IV dataset contains , highly-realistic streaming videos constructed from videocontents, video encoders, real-world network traces, ABR algorithms, and viewing devices. The streaming videosin different datasets are of diverse characteristics since theyare generated from different source videos, encoding proﬁles,adaptive streaming algorithms, and network conditions. We donot evaluate Xue2014 on the LIVE-NFLX-I dataset becausetheir quantization parameters (QP) and encoded representa- UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 7

Figure 2. Pairwise comparison matrix R . Each entry indicates the preferenceof the row model against the column model. R − R T are drawn here forbetter visibility. tions of the streaming videos are not publicly available.

3) Evaluation Criteria:

Three criteria are employed forperformance evaluation by comparing MOS and objective QoEscores according to the recommendation by the video qualityexperts group [66]. We adopt Pearson linear correlation coef-ﬁcient (PLCC) to evaluate the prediction accuracy, Spearmanranking-order correlation coefﬁcient (SRCC) and Kendell rankcorrelation coefﬁcient (KRCC) to assess prediction monotonic-ity. A better objective QoE model should have higher PLCC,SRCC, and KRCC.

B. Performance Comparison

Tables II, III, and IV show the PLCC, SRCC, and KRCCon the benchmark datasets, respectively, where top 2 bestperformers are highlighted with bold face. We have severalobservations. First, the objective QoE models which employadvanced VQA models as the presentation quality measuregenerally performs favorably against the conventional bitrate-based QoE models. In particular, Bentaleb2016 signiﬁcantlyoutperforms Yin2015, where the only difference betweenthem is the presentation quality measure. Second, althoughthe learning-based QoE models perform competitively oncertain test sets, they fail miserably on the other benchmarkdatasets. Speciﬁcally, the performance degradation of P.1203and VideoATLAS from one dataset to another can be as largeas 0.406 and 0.575, suggesting that the learning-based modelsexhibit low generalizability to diverse streaming environments.By contrast, KSQI achieves state-of-the-art performance onall benchmark datasets, thanks to the constraints given bydomain knowledge. Third, the classic QoE models with a ﬁxedparametric form cannot faithfully capture the subjective QoEresponse on streaming videos with complex distortion patterns,evident by the low prediction accuracy on WaterlooSQoE-III. In spite of the authors’ effort in designing functionalforms to conform known HVS properties [13], [30], [32],the QoE functions can vary signiﬁcantly from exponentialand logarithmic functions. On the other hand, KSQI doesnot assume a particular form of QoE functions and instead

Figure 3. Global ranking results of the four QoE models. maximizes the mathematically well-behaveness. In summary,we believe the performance improvement arises because 1)KSQI is equipped with an HVS inspired VQA measure thatgeneralizes well on a variety of video contents, encoders,and viewing devices; 2) the training procedure optimizes thequality prediction accuracy regularized by the prior knowledgeon HVS; and 3) the proposed model does not make inaccuratea priori assumptions on the form of QoE functions.

C. Best-case Validation

Objective QoE model is not only used to evaluate, but alsoto optimize a variety of ABR algorithms and systems. A goodrule of thumb is that an optimized system is only as good asthe optimization criterion used to design it [67]. Conversely,the performance of an objective QoE model can be assessedvia synthesizing optimal streaming videos with respect to anobjective QoE model followed by visual inspection of thegenerated stimulus [68], [69]. Speciﬁcally, given a set ofencoded and segmented videos and a realistic network trace,we can generate an optimal streaming video in terms of eachobjective QoE model. Subjective evaluation of the synthesizedstimuli provides a best-case validation of the underliningobjective QoE models. A good objective QoE model shouldproduce perceptually better streaming videos comparing to theother schemes.In this paper, we select high-quality videos of diversecomplexity to constitute the test sample set. All videos havethe length of seconds. Using the source sequences, eachvideo is encoded with an x264 encoder into representationsin accordance with the Netﬂix’s recommendation [20]. Wesegment the test sequences the encoded videos with GPAC’sMP4Box [70] with a segment length of seconds for thefollowing reasons. First, -second segments are widely usedin the development of ABR algorithms. Second, it allows usto derive test videos in an efﬁcient way such that they covera diverse adaptation patterns in a limited time. networktraces of diverse characteristics are randomly selected from theHSDPA dataset [71]. We compare KSQI with three objectiveQoE models that have guided the development of ABR al-gorithms, including Yin2015, Spiteri2016, and Bentaleb2016. UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 8

Table VS

TATISTICAL SIGNIFICANCE MATRIX BASED ON F- STATISTICS ON THE COMBINATION OF W ATERLOO SQ O E-III, W

ATERLOO SQ O E-IV, LIVE-NFLX-I,

AND

LIVE-NFLX-II

DATASETS . A

SYMBOL “1”

MEANS THAT THE PERFORMANCE OF THE ROW MODEL IS STATISTICALLY BETTER THAN THAT OF THECOLUMN MODEL , A SYMBOL “0”

MEANS THAT THE ROW MODEL IS STATISTICALLY WORSE , AND A SYMBOL “-”

MEANS THAT THE ROW AND COLUMNMODELS ARE STATISTICALLY INDISTINGUISHABLE .FTW Mok2011 Liu2012 Yin2015 VideoATLAS Spiteri2016 P.1203 Bentaleb2016 SQI KSQIFTW - - 0 0 0 0 0 0 0 0Mok2011 - - 0 0 0 0 0 0 0 0Liu2012 1 1 - - 0 0 0 0 0 0Yin2015 1 1 - - 0 0 0 0 0 0VideoATLAS 1 1 1 1 - 0 0 0 0 0Spiteri2016 1 1 1 1 1 - - 0 0 0P.1203 1 1 1 1 1 - - 0 0 0Bentaleb2016 1 1 1 1 1 1 1 - 0 0SQI 1 1 1 1 1 1 1 1 - 0KSQI 1 1 1 1 1 1 1 1 1 -

We present results for the ofﬂine optimal scheme [5], [7],which is computed using dynamic programming with completefuture throughput information. The dynamic programming-based method generates globally optimal streaming videosfor the considered QoE models, completely eliminating theinﬂuence of inaccurate throughput estimation. For each sourcevideo, we randomly select a network trace and optimizethe streaming videos with respect to the four objective QoEmodels. In the end, we obtain a total of streaming videosgenerated from (source videos, network traces) pairs × ABR algorithms. An online demonstration of the experimentis available at [72].We perform a subjective user study that adopts the pairwisecomparison methodology in which a pair of streaming videosgenerated from the same video contents and network tracesare presented to human viewers. The subjective experimentis setup as a normal indoor home settings with an ordinaryillumination level, with no reﬂecting ceiling walls and ﬂoors. Acustomized interface is created to render a pair of × videos side-by-side on a 27 inch 4K monitor. The display iscalibrated in accordance with the recommendations of ITU-R BT. 500 [73]. For each video pair, the subjects are forcedto choose which one has a better perceptual quality. A totalof na¨ıve subjects, including males and females agedbetween and , participate in the subjective experiment.Visual acuity and color vision are conﬁrmed from each subjectbefore the subjective test. A training session is performed,during which, 3 video pairs that are different from the videosin the testing set are presented to the subjects. We used thesame methods to generate the videos used in the trainingand testing sessions. Therefore, subjects knew what distortiontypes would be expected before the test session, and thuslearning effects are kept minimal in the subjective experiment.For each subject, the whole study takes one hour, which isdivided into two sessions with a -minute break in-between.The results of the subjective experiment can be summarizedas a × matrix R , where r i,j represents the probabilityof QoE model i better than QoE model j . Fig. 2 shows theresult matrix R , where the higher value of an entry (warmercolor), the stronger the row model against the column model.It is obvious that KSQI performs favorably to the competingmodels. We further aggregate the pairwise comparison resultsinto a global ranking via the maximum likelihood method for multiple options [51], [74], [75]. Let µ = [ µ , µ , µ , µ ] ∈ R be the global ranking score vector, we maximize the log-likelihood of µ arg max µ (cid:80) i,j r i,j log(Φ( µ i − µ j )) subject to (cid:80) i µ i = 0 , where Φ( · ) is the standard normal cumulative distributionfunction. The constraint (cid:80) i µ i = 0 is introduced to resolvethe translation ambiguity. The optimization problem is convexand enjoys efﬁcient solvers. A larger µ i means the optimalstreaming video in terms of the i -th model is perceptuallybetter than the optimal samples generated by other QoEmodels in general. Fig. 3 shows the experimental results. Itcan be seen that KSQI signiﬁcantly outperforms the standardQoE models. The results have signiﬁcant implications on thedevelopment of ABR algorithms. Speciﬁcally, state-of-the-artABR algorithms have achieved a performance plateau levelsand signiﬁcant improvement has become difﬁcult to attain.However, the enormous difference in perceptual relevancebetween the bitrate-based QoE model and KSQI suggeststhat further improvement is attainable simply by adoptingperceptually motivated optimization criterion. D. Statistical Signiﬁcance Test

To ascertain that the improvement of the proposed model isstatistically signiﬁcant, we carry out a statistical signiﬁcanceanalysis by following the approach introduced in [76]. First,we linearly scale MOSs in each dataset to the same perceptualscale [0, 100]. Second, a nonlinear regression function isapplied to map the objective quality scores to predict thesubjective scores independently on the four testing datasets.The prediction residuals of each QoE models from all datasetsare aggregated into a vector. We observe that the predictionresiduals all have zero-mean, and thus the model with lowervariance is generally considered better than the one with highervariance. We conduct a hypothesis testing using F-statistics.Since the number of samples exceeds 50, the Gaussian as-sumption of the residuals approximately hold based on thecentral limit theorem [58]. The test statistic is the ratio ofvariances. The null hypothesis is that the prediction residualsfrom one quality model come from the same distribution

UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 9

Table VIPLCC

BETWEEN THE VARIANTS OF

KSQI

PREDICTION AND

MOS

ON THE BENCHMARK DATASETS .QoE model LIVE-NFLX-I LIVE-NFLX-II WaterlooSQoE-III WaterlooSQoE-IV Average Weighted AverageKSQI with bitrate 0.622 0.722 0.670 0.618 0.658 0.647KSQI with log bitrate 0.686 0.715 0.787

Table VIIPLCC

BETWEEN THE VARIANTS OF

KSQI

PREDICTION AND

MOS

ON THE BENCHMARK DATASETS .Constraint (1)(2)(3)(4)(6) (1)(2)(3)(4)(6)(7) (1)(2)(3)(4)(6)(7)(8) (1) 0.744 0.902 0.788 0.718 0.788 0.766(2) 0.743 0.906 0.758 0.717 0.781 0.760(3) 0.743 0.895 0.798 0.713 0.788 0.764(4) 0.753 0.902 0.787 0.717 0.790 0.766(6) 0.745 0.884 0.770 0.691 0.773 0.744(7) 0.745 0.884 0.770 0.692 0.773 0.744(8) 0.745 0.884 0.770 0.691 0.773 0.744(9) 0.746 0.884 0.770 0.692 0.773 0.744KSQI

Figure 4. Performance of KSQI with different number of bins. and are statistically indistinguishable (with 95% conﬁdence)from the residuals from another model. After comparing everypossible pairs of objective models, the results are summarizedin Table V, where a symbol ‘1’ means the row model performssigniﬁcantly better than the column model, a symbol ‘0’ meansthe opposite, and a symbol ‘-’ indicates that the row andcolumn models are statistically indistinguishable. It can beobserved that the proposed model is statistically better than allother methods on the combination of all existing benchmarkdatasets.

E. Ablation Experiment

We conduct a series of ablation experiments to single outthe core contributors of KSQI. We ﬁrst take bitrate [4], [31],

Figure 5. Performance of KSQI with different λ . logarithmic bitrate [5], and QP [32] as the video presentationquality measure as opposed to VMAF and then train the QoEmodel with the proposed optimization framework. In order tomap the range of video presentation quality measure into thesame perceptual scale [0, 100], we apply a linear transformto the alternative measures before the training stage. FromTable VI, we observe that KSQI achieves the best performancewith the adoption of state-of-the-art video quality measuresuch as VMAF.Next, we analyze the impact of the knowledge-imposedconstraints on the quality prediction performance. We startfrom a baseline model by solving the problem in (12) and (13)with no constraints and gradually increase the number of con-straints. We then investigate the validity of each observation by UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 10 imposing only one constraint in a variant model. The resultsare listed in Table VII, from which the key observations areas follows. First, the performance of KSQI generally improveswith respect to the number of imposed constraints, advocatingthe effectiveness of prior knowledge in regularizing the ob-jective QoE functions. Second, while some of the constraintsdo not improve the performance of KSQI by themselves,the joint model achieves state-of-the-art performance. Thissuggests that the constraints may be complement to each other.Third, the constraint (3) has drastically different impacts onthe LIVE-NFLX-II dataset and the WaterlooSQoE-III dataset,suggesting that the validity of the constraint may be inﬂuencedby other factors. A careful investigation may further improvethe performance of the proposed QoE model.

F. Impact of Step Sizes

In previous experiments, we set the bin sizes of videopresentation quality and rebuffering duration to and ,respectively. To investigate the impact of step sizes, we trainseveral variants of KSQI, where the number of bins rangesfrom to . We show the experimental results in Fig. 4.Theoretically speaking, the performance of KSQI should in-crease monotonically with respect to the precision of featurerepresentations. However, the observation does not echo ourexpectation, which may be a consequence of insufﬁcienttraining data and intrinsic noise in the subjective opinionscores. Nevertheless, KSQI is generally very robust to a broadrange of bin sizes. G. Impact of λ The parameter λ in KSQI determines the tradeoff betweenﬁdelity and smoothness of the QoE functions. Although theoptimal parameter is obtained from cross-validation in previ-ous experiments, we also perform an experiment to investigatethe impact of λ . Speciﬁcally, we train several versions ofKSQI, where λ ranges from . to , . The resultsare shown in Fig. 5, from which we can observe that theperformance of KSQI is generally insensitive to λ .VI. C ONCLUSIONS

We propose a novel objective QoE model for adaptivestreaming videos, namely KSQI, by regularizing a non-parametric model with known HVS properties. KSQI outper-forms the existing objective QoE models by a sizable marginover a wide range of video contents, encoding conﬁgurations,network conditions, and viewing devices, which we believearises from a perceptually motivated video quality representa-tion, a knowledge constrained optimization framework, and anon-parametric model of QoE functions.The proposed model may be improved in many ways. First,KSQI is readily extendable when new knowledge of HVSproperties is acquired. With proper modiﬁcations of the non-parametric functions, we may incorporate more features suchas motion strength [41] into the QoE model. Second, theremay be better ways to combine the video presentation quality,rebuffering experience, and quality adaptation experience. For example, we can jointly model all inﬂuencing factors byescalating the dimensionality of the non-parametric model.Third, how to integrate the QoE model into the adaptive bitrateselection algorithm for optimal playback control is anotherchallenging problem that is worth further investigations.R

IEEE/ACM Transactions on Networking , vol. 22, no. 1, pp. 326–340,Feb. 2014.[3] T. Y. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson, “Abuffer-based approach to rate adaptation: Evidence from a large videostreaming service,”

ACM SIGCOMM Computer Communication Review ,vol. 44, no. 4, pp. 187–198, Feb. 2015.[4] X. Yin, A. Jindal, V. Sekar, and B. Sinopoli, “A control-theoreticapproach for dynamic adaptive video streaming over HTTP,”

ACMSIGCOMM Computer Communication Review , vol. 45, no. 4, pp. 325–338, Apr. 2015.[5] K. Spiteri, R. Urgaonkar, and R. K. Sitaraman, “BOLA: Near-optimalbitrate adaptation for online videos,” in

Proc. IEEE Int. Conf. ComputerCommunications . San Francisco, CA, USA: IEEE, Apr. 2016, pp. 1–9.[6] A. Bentaleb, A. C. Begen, and R. Zimmermann, “SDNDASH: ImprovingQoE of HTTP adaptive streaming using software deﬁned networking,”in

Proc. ACM Int. Conf. Multimedia . Amsterdam, The Netherlands:ACM, Oct. 2016, pp. 1296–1305.[7] H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video stream-ing with Pensieve,” in

Proc. ACM SIGCOMM . Los Angeles, CA, USA:ACM, Aug. 2017, pp. 197–210.[8] L. Breiman, “Random forests,”

Machine learning , vol. 45, no. 1, pp.5–32, Oct. 2001.[9] C. Burges, “A tutorial on support vector machines for pattern recogni-tion,”

Data mining and knowledge discovery , vol. 2, no. 2, pp. 121–167,Jun. 1998.[10] M. T. Hagan, H. B. Demuth, M. H. Beale, and R. De Jes´us,

Neuralnetwork design . PWS Pub. Boston, 1996, vol. 20.[11] Z. Duanmu, D. Chen, Z. Li, W. Liu, Z. Wang, Y. Wang, and W. Gao.(2019) Waterloo streaming Quality-of-Experince database IV. [Online].Available: http://ece.uwaterloo.ca/ ∼ zduanmu/waterloosqoe4.[12] Z. Li, A. Aaron, L. Katsavounidis, A. Moorthy, andM. Manohara. (2016) Toward a practical perceptual videoquality metric. [Online]. Available: http://techblog.netﬂix.com/2016/06/toward-practical-perceptual-video.html.[13] Z. Duanmu, K. Zeng, K. Ma, A. Rehman, and Z. Wang, “A Quality-of-Experience index for streaming video,” IEEE Journal of Selected Topicsin Signal Processing , vol. 11, no. 1, pp. 154–166, Sep. 2017.[14] Z. Duanmu, A. Rehman, and Z. Wang, “A Quality-of-Experiencedatabase for adaptive video streaming,”

IEEE Trans. Broadcasting ,vol. 64, no. 2, pp. 474–487, Jun. 2018.[15] C. G. Bampis, Z. Li, I. Katsavounidis, T. Y. Huang, C. Ekanadham,and A. C. Bovik, “Towards perceptually optimized end-to-end adaptivevideo streaming,”

ArXiv preprint arXiv:1808.03898 , Aug. 2018.[16] J. Gama, I. ˇZliobait˙e, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “Asurvey on concept drift adaptation,”

ACM computing surveys , vol. 46,no. 4, pp. 44.1–44.37, Apr. 2014.[17] S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High dynamicrange video,”

ACM Trans. of Graphics , vol. 22, no. 3, pp. 319–325, Jul.2003.[18] R. Ng, M. Levoy, M. Br´edif, G. Duval, M. Horowitz, and P. Hanrahan,“Light ﬁeld photography with a hand-held plenoptic camera,” ComputerScience Technical Report, pp. 1–11, Apr. 2005.[19] I. Ishii, T. Tatebe, Q. Gu, Y. Moriue, T. Takaki, and K. Tajima, “2000 fpsreal-time vision system with high-frame-rate video recording,” in

Proc.IEEE Int. Conf. Robotics and Automation . Anchorage, AK, USA: IEEE,May 2010, pp. 1536–1541.[20] Netﬂix Inc. (2015) Per-title encode optimization. [Online]. Available:http://techblog.netﬂix.com/2015/12/per-title-encode-optimization.html[21] L. Toni, R. Aparicio-Pardo, K. Pires, W. Simon, A. Blanc, andP. Frossard, “Optimal selection of adaptive streaming representations,”

ACM Trans. Multimedia Computing, Communications, and Applications ,vol. 11, no. 2s, pp. 1–43, Feb. 2015.

UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 11 [22] J. De Cock, Z. Li, M. Manohara, and A. Aaron, “Complexity-basedconsistent-quality encoding in the cloud,” in

Proc. IEEE Int. Conf. ImageProc.

Phoenix, AZ, USA: IEEE, Sep. 2016, pp. 1484–1488.[23] R. M. Nasiri, J. Wang, A. Rehman, S. Wang, and Z. Wang, “Perceptualquality assessment of high frame rate video,” in mmsp . Xiamen, China:IEEE, Oct. 2015, pp. 1–6.[24] Z. Wang and A. Rehman, “Begin with the end in mind: A uniﬁed end-to-end quality-of-experience monitoring, optimization and managementframework,” in

SMPTE Annual Technical Conference and Exhibition .Hollywood, CA, USA: SMPTE, Oct. 2017, pp. 1–11.[25] Z. Li, Z. Duanmu, W. Liu, and Z. Wang, “AVC, HEVC, VP9, AVS2, orAV1? - A comparative study of state-of-the-art video encoders on 4Kvideos,” in

Proc. Int. Conf. Image Analysis and Recognition . Waterloo,ON, Canada: AIMI, To Appear.[26] R. L. Oliver, “A cognitive model of the antecedents and consequencesof satisfaction decisions,”

Journal of Marketing Research , vol. 17, no. 4,pp. 460–469, Nov. 1980.[27] Z. Duanmu, K. Ma, and Z. Wang, “Quality-of-Experience for adap-tive streaming videos: An expectation conﬁrmation theory motivatedapproach,”

IEEE Trans. Image Processing , vol. 27, no. 12, pp. 6135–6146, Dec. 2018.[28] K. Watanabe, J. Okamoto, and T. Kurita, “Objective video qualityassessment method for evaluating effects of freeze distortion in arbitraryvideo scenes,” in

Image Quality and System Performance IV , vol.64940P. SPIE, Jan. 2007, pp. 1–8.[29] R. K. Mok, X. Luo, E. W. Chan, and R. K. Chang, “QDASH: A QoE-aware DASH system,” in

Proc. ACM Conf. Multimedia Systems . ChapelHill, NC, USA: ACM, Feb. 2012, pp. 11–22.[30] T. Hoßfeld, R. Schatz, E. Biersack, and L. Plissonneau, “Internet videodelivery in YouTube: From trafﬁc measurements to Quality of Experi-ence,” in

Data Trafﬁc Monitoring and Analysis . Berlin, Heidelberg:Springer, Jan. 2013, pp. 264–301.[31] X. Liu, F. Dobrian, H. Milner, J. Jiang, V. Sekar, I. Stoica, and H. Zhang,“A case for a coordinated internet video control plane,”

ACM SIGCOMMComputer Communication Review , vol. 42, no. 4, pp. 359–370, Sep.2012.[32] J. Xue, D. Zhang, H. Yu, and C. W. Chen, “Assessing Quality ofExperience for adaptive HTTP video streaming,” in

Proc. IEEE Int.Conf. Multimedia and Expo Workshop . Chengdu, China: IEEE, Jul.2014, pp. 1–6.[33] P. Ni, R. Eg, A. Eichhorn, C. Griwodz, and P. Halvorsen, “Flicker effectsin adaptive video streaming to handheld devices,” in

Proc. ACM Int.Conf. Multimedia . Scottsdale, AZ, USA: ACM, Nov. 2011, pp. 463–472.[34] A. Rehman and Z. Wang, “Perceptual experience of time-varying videoquality,” in

Proc. IEEE Int. Conf. Quality of Multimedia Experience .Klagenfurt am Worthersee: IEEE, Jul. 2013, pp. 218–223.[35] Z. Duanmu, K. Ma, and Z. Wang, “Quality-of-Experience of adaptivevideo streaming: Exploring the space of adaptations,” in

Proc. ACM Int.Conf. Multimedia . Mountain View, CA, USA: ACM, Oct. 2017, pp.1752–1760.[36] Z. Akhtar, Y. S. Nam, R. Govindan, S. Rao, J. Chen, E. Katz-Bassett,B. Ribeiro, J. Zhan, and H. Zhang, “Oboe: Auto-tuning video abralgorithms to network conditions,” in

Proc. ACM SIGCOMM . Budapest,Hungary: ACM, Aug. 2018, pp. 44–58.[37] T. Berger,

Rate Distortion Theory: A Mathematical Basis for DataCompression , ser. Prentice-Hall electrical engineering series. UpperSaddle River, NJ, USA: Prentice-Hall, 1971.[38] Z. Wang, K. Zeng, and A. Rehman, “Method and system for smartadaptive video streaming driven by perceptual Quality-of-Experienceestimations,” Aug. 2016, US Patent WO/2016/123721.[39] M. Pinson and S. Wolf, “A new standardized method for objectivelymeasuring video quality,”

IEEE Trans. Broadcasting , vol. 50, no. 3, pp.312–322, Sept. 2004.[40] A. Rehman, K. Zeng, and Z. Wang, “Display device-adapted videoQuality-of-Experience assessment,” in

Proc. SPIE . San Francisco, CA,USA: SPIE, Feb. 2015, pp. 939 406.1–939 406.11.[41] Y. Liu, S. Dey, F. Ulupinar, M. Luby, and Y. Mao, “Deriving andvalidating user experience model for DASH video streaming,”

IEEETrans. Broadcasting , vol. 61, no. 4, pp. 651–665, Dec. 2015.[42] C. G. Bampis, Z. Li, and A. C. Bovik, “Continuous prediction ofstreaming video QoE using dynamic networks,”

IEEE Signal ProcessingLetters , vol. 24, no. 7, pp. 1083–1087, Jul. 2017.[43] K. Singh, Y. Hadjadj-Aoul, and G. Rubino, “Quality of Experienceestimation for adaptive HTTP/TCP video streaming using H.264/AVC,”in

CCNC-IEEE Consumer Communications & Networking Conference .Las Vegas, NV, USA: IEEE, Jan. 2012, pp. 1–6. [44] C. G. Bampis and A. C. Bovik, “Learning to predict streamingvideo QoE: Distortions, rebuffering and memory,”

ArXiv preprintarXiv:1703.00633

Neural Networks , vol. 2, no. 5,pp. 359–366, Jan. 1989.[47] T. Huang, R. Zhang, C. Zhou, and L. Sun, “QARC: Video quality awarerate control for real-time video streaming based on deep reinforcementlearning,” in

Proc. ACM Int. Conf. Multimedia . Seoul, Republic ofKorea: ACM, Oct. 2018, pp. 1208–1216.[48] M. Seufert, S. Egger, M. Slanina, T. Zinner, T. Hoßfeld, and P. Tran-Gia,“A survey on Quality of Experience of HTTP adaptive streaming,”

IEEECommunications Surveys & Tutorials , vol. 17, no. 1, pp. 469–492, Sep.2014.[49] M. N. Garcia, F. De Simone, S. Tavakoli, N. Staelens, S. Egger,K. Brunnstr¨om, and A. Raake, “Quality of Experience and HTTPadaptive streaming: A review of subjective studies,” in

Proc. IEEE Int.Conf. Quality of Multimedia Experience . Singapore, Singapore: IEEE,Sep. 2014, pp. 141–146.[50] C. G. Bampis, Z. Li, A. K. Moorthy, I. Katsavounidis, A. Aaron, andA. C. Bovik, “Study of temporal effects on subjective video Quality ofExperience,”

IEEE Trans. Image Processing , vol. 26, no. 11, pp. 5217–5231, Nov. 2017.[51] K. Ma, Z. Duanmu, Z. Wang, Q. Wu, W. Liu, H. Yong, H. Li,and L. Zhang, “Group maximum differentiation competition: Modelcomparison with few samples,”

IEEE Trans. Pattern Analysis andMachine Intelligence , pp. DOI: 10.1109/TPAMI.2018.2 889 948, IEEEXplore early access, 2019.[52] T. Hoßfeld, M. Seufert, M. Hirth, T. Zinner, P. Tran-Gia, and R. Schatz,“Quantiﬁcation of YouTube QoE via crowdsourcing,” in

Proc. IEEEInt. Sym. Multimedia . Dana Point, CA, USA: IEEE, Dec. 2011, pp.494–499.[53] F. Dobrian, V. Sekar, A. Awan, I. Stoica, D. Joseph, A. Ganjam,J. Zhan, and H. Zhang, “Understanding the impact of video quality onuser engagement,”

ACM SIGCOMM Computer Communication Review ,vol. 41, no. 4, pp. 362–373, Aug. 2011.[54] Y. Qi and M. Dai, “The effect of frame freezing and frame skipping onvideo quality,” in

Proc. IEEE Int. Conf. Intelligent Information Hidingand Multimedia Signal Processing . Pasadena, CA, USA: IEEE, Dec.2006, pp. 423–426.[55] N. Staelens, S. Moens, W. V. den Broeck, I. Marien, B. Vermeulen,P. Lambert, R. V. de Walle, and P. Demeester, “Assessing Qualityof Experience of IPTV and video on demand services in real-lifeenvironments,”

IEEE Trans. Broadcasting , vol. 56, no. 4, pp. 458–466,Dec. 2010.[56] A. K. Moorthy, L. K. Choi, A. C. Bovik, and G. De Veciana, “Videoquality assessment on mobile devices: Subjective, behavioral and ob-jective studies,”

IEEE Journal of Selected Topics in Signal Processing ,vol. 6, no. 6, pp. 652–671, Oct. 2012.[57] R. Pastrana-Vidal, J. C. Gicquel, C. Colomes, and H. Cheriﬁ, “Sporadicframe dropping impact on quality perception,” in

Human Vision andElectronic Imaging IX . San Jose, CA, USA: SPIE, Jun. 2004, pp.182–194.[58] C. Bishop,

Pattern Recognition and Machine Learning . Berlin,Heidelberg: Springer-Verlag, 2006.[59] M. Graﬂ and C. Timmerer, “Representation switch smoothing foradaptive HTTP streaming,” in

Proc. IEEE Int. Workshop PerceptualQualtiy of Systems . Vienna, Austria: ISCA/DEGA, Sep. 2013, pp.178–183.[60] Z. Li, X. Zhu, J. Gahm, R. Pan, H. Hu, A. C. Begen, and D. Oran,“Probe and adapt: Rate adaptation for HTTP video streaming at scale,”

IEEE Journal on Selected Areas in Communications , vol. 32, no. 4, pp.719–733, Apr. 2014.[61] W. Liu, Z. Duanmu, and Z. Wang, “End-to-end blind quality assessmentof compressed videos using deep neural networks,” in

Proc. ACM Int.Conf. Multimedia . Seoul, Republic of Korea: ACM, Oct. 2018, pp.546–554.[62] C. G. Bampis, A. C. Bovik, and Z. Li, “A simple prediction fusionimproves data-driven full-reference video quality assessment models,”in

Picture Coding Symposium . San Francisco, CA, USA: IEEE, Jun.2018, pp. 298–302.[63] Z. Wang, Z. Duanmu, A. Rehman, and K. Zeng, “Method and system for

UBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 12 automatic user Quality-of-Experience measurement of streaming video,”Aug. 2017, US Patent WO/2017/152274.[64] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”

Foundations and Trends R (cid:13) in Machine learning , vol. 3,no. 1, pp. 1–122, Jul. 2011.[65] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd, “OSQP:An operator splitting solver for quadratic programs,” ArXiv preprintarXiv:1711.08013

IEEE Signal Processing Magazine ,vol. 26, no. 1, pp. 98–117, Jan. 2009.[68] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image qualityassessment: From error visibility to structural similarity,”

IEEE Trans.Image Processing , vol. 13, no. 4, pp. 600–612, Apr. 2004.[69] Z. Wang and E. P. Simoncelli, “Maximum differentiation (MAD)competition: A methodology for comparing computational models ofperceptual quantities,”

Journal of Vision , vol. 8, no. 12, pp. 8–8, 2008.[70] J. Le Feuvre, C. Concolato, and J. Moissinac, “GPAC: Open source mul-timedia framework,” in

Proc. ACM Int. Conf. Multimedia . Augsburg,Bavaria, Germany: ACM, Sep. 2007, pp. 1009–1012.[71] H. Riiser, P. Vigmostad, C. Griwodz, and P. Halvorsen, “Commute pathbandwidth traces from 3G networks: Analysis and applications,” in

Proc.ACM Conf. Multimedia Systems . Oslo, Norway: ACM, Feb. 2013, pp.114–118.[72] Z. Duanmu, W. Liu, D. Chen, Z. Li, Z. Wang, Y. Wang, andW. Gao. (2019) Pairwise comparison of objective QoE modelsvia analysis-by-synthesis. [Online]. Available: http://ivc.uwaterloo.ca/research/KSQI/demo/.[73] ITU-R BT.500-12, “Recommendation: Methodology for the subjectiveassessment of the quality of television pictures,” Nov. 1993.[74] K. Tsukida and M. R. Gupta, “How to analyze paired comparison data,”University of Washington, Tech. Rep. UWEETR-2011-0004, May 2011.[75] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola,B. Vozel, K. Chehdi, M. Carli, and F. Battisti, “Image database TID2013:Peculiarities, results and perspectives,”

Signal Processing: Image Com-munication , vol. 30, pp. 57–77, Jan. 2015.[76] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation ofrecent full reference image quality assessment algorithms,”