[PDF] A New Abstraction for Internet QoE Optimization

Abstract

A perennial quest in networking research is how to achieve higher quality of experience (QoE) for users without incurring more resources. This work revisits an important yet often overlooked piece of the puzzle: what should the QoE abstraction be? A QoE abstraction is a representation of application quality that describes how decisions affect QoE. The conventional wisdom has relied on developing hand-crafted quality metrics (e.g., video rebuffering events, web page loading time) that are specialized to each application, content, and setting. We argue that in many cases, it maybe fundamentally hard to capture a user's perception of quality using a list of handcrafted metrics, and that expanding the metric list may lead to unnecessary complexity in the QoE model without a commensurate gain. Instead, we advocate for a new approach based on a new QoE abstraction called visual rendering. Rather than a list of metrics, we model the process of quality perception as a user watching a continuous "video" (visual rendering) of all the pixels on their screen. The key advantage of visual rendering is that it captures the full experience of a user with the same abstraction for all applications. This new abstraction opens new opportunities (e.g., the possibility of end-to-end deep learning models that infer QoE directly from a visual rendering) but it also gives rise to new research challenges (e.g., how to emulate the effect on visual rendering of an application decision). This paper makes the case for visual rendering as a unifying abstraction for Internet QoE and outlines a new research agenda to unleash its opportunities.

Full PDF

AA New Abstraction for Internet QoE Optimization

Junchen Jiang Siddhartha SenUniversity of Chicago Microsoft Research

ABSTRACT

A perennial quest in networking research is how to achievehigher quality of experience (QoE) for users without incur-ring more resources. This work revisits an important yet of-ten overlooked piece of the puzzle: what should the

QoEabstraction be? A QoE abstraction is a representation of ap-plication quality that describes how decisions affect QoE.The conventional wisdom has relied on developing hand-crafted quality metrics ( e.g., video rebuffering events, webpage loading time) that are specialized to each application,content, and setting. We argue that in many cases, it maybefundamentally hard to capture a user’s perception of qual-ity using a list of handcrafted metrics, and that expandingthe metric list may lead to unnecessary complexity in theQoE model without a commensurate gain. Instead, we ad-vocate for a new approach based on a new QoE abstractioncalled visual rendering . Rather than a list of metrics, wemodel the process of quality perception as a user watchinga continuous “video” (visual rendering) of all the pixels ontheir screen. The key advantage of visual rendering is that itcaptures the full experience of a user with the same abstrac-tion for all applications. This new abstraction opens newopportunities ( e.g., the possibility of end-to-end deep learn-ing models that infer QoE directly from a visual rendering)but it also gives rise to new research challenges ( e.g., howto emulate the effect on visual rendering of an applicationdecision). This paper makes the case for visual rendering asa unifying abstraction for Internet QoE and outlines a newresearch agenda to unleash its opportunities.

1. Introduction

An inﬂection point in Internet trafﬁc is afoot, driven by aconﬂuence of trends in Internet applications (web services,video streaming, etc.): more devices with larger screens,more high-ﬁdelity content, more interactive applications, andmore impatient users [2, 10]. These trends are playing outagainst a backdrop of plateauing improvement in video andweb service quality despite considerable academic and in-dustrial research effort. The consequence is far reaching:application quality continues to fall short of user expecta-tions, and application demands increasingly overwhelm theInternet’s capacity— e.g., content providers are forced to re-duce streaming video quality to cope with more users stayingat home [12]. These unprecedented challenges call for a newapproach to achieving higher quality of experience (QoE) forusers given limited network resources.The trade-off between QoE and resources has been widely

Quality of Experience (QoE)

Quality metrics (video rebuffering, bitrate, page-load time, time-to-first-byte, …)

Application protocols Quality of Experience (QoE)

Visual rendering (everything a user sees on screen)

Application protocols (a) Conventional approach centered around quality metrics (b) New abstraction of visual rendering for Internet QoE

Predictive QoEmodelAdaptive logic to optimize QoE

Figure 1:

Traditional QoE optimization is based on qualitymetrics, which fail to capture the full experience of a user.We propose a new abstraction called visual rendering, whichcaptures user experience as a stream of pixels on the screen. studied in the networking and multimedia communities. Akey concept underpinning most QoE optimization work is a

QoE model that infers the quality perceived by a user wheninteracting with an application, such as watching a stream-ing video ( e.g., [16, 22]) or loading a web page ( e.g., [21, 26,15, 17]). QoE models are integral to protocols/control algo-rithms that adapt their decisions to maximize QoE given lim-ited, dynamic availability of network resources (bandwidth,latency, etc.). An accurate QoE model enables these proto-cols to balance between conﬂicting metrics ( e.g., when doesa user prefer higher resolution over less rebuffering?) andminimum resources to achieve high QoE ( e.g., for a givenweb page, how fast is fast enough for users?).The conventional approach to QoE modeling is based on human engineered features , including quality metrics suchas video rebuffering events, page load time, etc., and otherfeatures such as screen size, genre of the video/website, etc..Over the decades, a large body of research has expandedthe set of features and quality metrics, but recently there hasbeen a dramatically acceleration with not only more featuresbut more ﬁner-grained features handcrafted for each appli-cation, each content type, and even each individual pieceof content ( i.e., a speciﬁc video or web page). Indeed, themarch of QoE feature engineering is in full swing, fueled bynew applications ( e.g., interactive live video) on new devices( e.g., panoramic headsets) with new application behaviors( e.g., new online advertising methods). For example, recentwork predicts video QoE using 275 features [42] and com-plex machine-learning models [20]. But the improvement inperformance is not commensurate with this complexity.The root of this complexity apocalypse, we argue, is thatquality metrics and features act as an “information bottle- a r X i v : . [ c s . N I] A ug eck” that reduces QoE perception to a list of values, whenin fact the way users perceive QoE is more complex thanwhat a (potentially long) list of values can capture. For in-stance, a user watching a video rarely perceives its qualityby consciously counting how long each stall lasts; instead,the video stalls (and other quality incidents) inﬂuence theuser’s full viewing experience, including how much a stalldisturbs their engagement with the video content ( e.g., if itoccurs during an important like a goal in a live sports game).By reducing QoE to a handful of metrics, it would be difﬁ-cult to characterize the impact of such quality incidents onthe user’s full experience. In other words, the problem withtoday’s QoE models is not that they need more features orcomplexity; it is that the feature-based abstraction of qualityis a mismatch for capturing user quality perception. In this paper, we propose a new abstraction for QoE mod-eling called visual rendering —a video stream that records allof the pixels displayed on the screen over time as seen by theuser, including the activities of all visible windows/frames.For example, it may capture streaming video being playedback by a viewer, or a sequence of web objects being ren-dered by a browser. Figure 1 contrasts the traditional QoEabstraction based on quality metrics with a visual render-ing. A visual rendering is fundamentally distinct from thestatic content of an application ( e.g., the raw video or webpage content): it captures the rendering of the content on thescreen after compression, reordering, and any other effects ofthe application and network protocols have taken effect. Theabstraction of visual rendering enjoys two unique advantagesover the traditional feature-based abstraction: • Visual rendering captures the full visual experience of auser, which encompasses the information captured by ex-isting quality metrics, future ones we may discover, andothers way may never discover. • Visual rendering applies to all Internet applications, be-cause users experience these applications by viewing pix-els on a screen. Thus it is a unifying abstraction that couldpotentially lead to uniﬁed QoE models across applications.Now, it may seem counter-intuitive that we address thehigh complexity of QoE modeling by using a seemingly morecomplex abstraction. However, recent trends give us reasonsto be optimistic. Computer vision has been revolutionizedby the transition from traditional feature-based models to farmore accurate and general deep learning models, and the keyenabling idea is to learn useful representations directly fromraw images, rather than handcrafted features. Inspired bythis success, we believe a similar transformative approachcan be applied to QoE modeling, especially since QoE per-ception and computer vision share the visual perception pro-cess. Although computer vision techniques and deep learn-ing have been used for QoE optimization, they have beenused within the framework of feature-based QoE modeling, e.g., modeling the relationship between quality metrics andQoE ( e.g., [46, 47]) or deriving quality metrics from thestatic content ( e.g., [32]). We believe the time has come for a redeﬁnition of the QoE abstraction, driven by both appli-cation “pulls” ( e.g., user experience as the key driver) andtechnology “pushes” ( e.g., advances in computer vision).

2. Why QoE Modeling Matters

We believe that accurate QoE modeling is the key to achiev-ing higher QoE in the face of limited network resources.

Today’s Internet users have much higher expectations forapplication quality than a few years ago. As more appli-cations move to mobile interfaces, users are becoming in-creasingly impatient and sensitive to sub-second increasesin page load time [5]. The surge of live videos ( e.g., [9])has shifted people’s perception of Internet videos from on-demand streaming to real-time interaction with a massive,live audience. This growing demand for low delays is beingmet with a craving for ultra high-quality content. With main-stream content providers and websites offering more videosin 4K or higher resolutions [7, 4], Internet video viewers to-day demand much higher quality than ever before.At the same, network resources are growing not as fast andnot as evenly. The disparity of broadband network accessat home is a widespread phenomenon, even in the US [3].The gap between limited network resources and the questfor higher application quality underscore the need for newtechniques that achieve better QoE-resource tradeoffs : eitherachieving higher QoE with the same resources or reducingresource demands without hurting QoE.

Applications use a wide range of control algorithms to op-timize the QoE of Internet video ( e.g., [44, 14, 45, 33]) andweb services ( e.g., [19, 30, 41]) under dynamic availabilityof network resources; accurate QoE modeling is key to thesuccess of most of these techniques. At a high level, a QoE-optimizing control algorithm can be framed as choosing theoptimal control action a ∗ ( e.g., selecting the video bitrate,prioritizing web objects in a page, etc.) from an action spacethat maximizes the expected QoE: a ∗ = argmax a ∈ A ( ˆ Q ( a, ˆ r )) ,where ˆ r estimates the available network resources (band-width, latency, etc.) and ˆ Q ( a, ˆ r ) is the estimated QoE whentaking action a under ˆ r . Although this equation formulatesa single-step optimization, control algorithms typically opti-mize a longer-term QoE objective, which has important im-plications for QoE modeling, but the idea is the same.A considerable amount of research has focused on makingaccurate predictions of ˆ r to improve QoE under a given ˆ Q .We argue that accurate modeling of QoE ( ˆ Q ) is at least asimportant as accurate predictions of network resources. Inparticular, the QoE model fundamentally limits the scope forimprovement of all control algorithms for three reasons: • Balancing conﬂicting objectives:

Applications are oftenfaced with conﬂicting quality objectives. For instance,video QoE can be improved by increasing average bitrate,avoiding bitrate switches and rebuffering (stalls), and re-ducing start-up delay, but maximizing bitrate and mini- …… Video streaming session (normal mode, full screen mode) Web session(news site)Web session(search engine) … Visual rendering (everything a user sees on screen)

Quality Function(DL algorithm) QoE(Output)

Figure 2:

An illustration of visual rendering and QoE mod-eling based on visual renderings. mizing join time are often in conﬂict, especially for shortvideos, and minimizing bitrate switches often conﬂicts withmaximizing bitrate and reducing rebuffering [16], espe-cially when bandwidth varies a lot. In such settings, anaccurate QoE model is crucial for adaptive bitrate (ABR)algorithms to strike a good balance among the objectives. • Identifying which actions matter:

Not all quality improve-ments lead to higher QoE, e.g., because users have limitedcognitive capacity to perceive the change. For instance,when the page load time is below or above certain thresh-olds, it may be too fast or too slow for users to experiencea quality difference [49]. Knowing exactly when qualityimprovements have a diminishing impact on QoE is criti-cal to achieving high QoE with minimum resources. • Limiting action granularity:

Finally, ﬁner-grained QoEmodels allow ﬁner-grained adaptation actions. For instance,it has been shown that the same video bitrate leads to dif-ferent user-perceived quality depending on the video con-tent [34, 11, 28]. If a QoE model is agnostic to the percep-tual quality of each bitrate on each video chunk, then theABR algorithm will not be able to raise/lower the bitrateof the chunks where it matters more/less, thus missing op-portunities to improve QoE or save bandwidth.

3. The Case for Visual Rendering

We introduce visual rendering as a new abstraction formodeling QoE, outline its beneﬁts and potential over currentapproaches, and discuss similar notions in prior work.

Although a plethora of techniques exist for modeling theQoE of Internet applications, they all share the same high-level feature-based approach. They ﬁrst extract handcraftedfeatures or quality metrics , and then build quality functions to model the relationship between these features and userQoE. Each quality metric is crafted to capture some aspectof application quality that might affect user QoE.Quality metrics are widely used in industry as key per-formance indicators for optimizing QoE, because they showstronger correlations with user ratings and engagement thantraditional packet/ﬂow-level performance metrics. For in-stance, video streaming QoE is modeled using metrics of vi-sual quality of the rendered frames ( e.g.,

SSIM [8]), qualitystability ( e.g., number of bitrate switches), and smoothness

Generalizability E x p r e ss i v e n e ss Visual rendering

Complex quality metrics Simple quality metrics

Figure 3:

A qualitative comparison of traditional qualitymetrics and our proposed abstraction of visual rendering. ( e.g., rebuffering events [22]). The quality functions rangefrom linear combinations of these metrics [45] to deep learn-ing models [20]. Similarly, web QoE is modeled by variantsof page load time ( e.g., time-to-ﬁrst-byte) to capture the im-pact of object loading progress on user QoE [17, 18, 21]. We explore a new approach: instead of modeling QoE as afunction of handcrafted features, we instead model it directlyfrom the pixels a user sees on the screen over time, includingthe activities of all visible windows. In a video streamingapplication this could be the frame-by-frame playback of astreamed video; in a web service this could be the visual se-quence of web objects loading in a web browser. We call thisa visual rendering . Figure 2 illustrates an example of a visualrendering from a web browser, in which it ﬁrst loads a webpage (search engine), then streams a video, and then loadsanother web page (news). Here, we assume the browser cov-ers the full screen, but in general a visual rendering may in-clude multiple windows.

What is captured (and not captured)?

Intuitively, a vi-sual rendering represents all visual input to the visual per-ception process . It thus captures both the spatial experience(objects appearing on the screen at the same time) as well asthe temporal experience (dynamic loading of objects or play-back of a video). That said, visual rendering does not includeany non-visual factors, such as audio information (which israrely integrated by QoE models, except those used in theacoustics literature), or contextual information like the user’sdevice, browser settings, etc.. However, as we later discuss,contextual information inﬂuences the visual rendering seenby a user and hence must be accounted for in our modeling.

The key beneﬁt of modeling QoE based on visual render-ing is that they are potentially more expressive and more gen-eralizable than feature-based QoE models (see Figure 3).

Expressiveness

Feature-based QoE models work well when the features (qual-ity metrics) capture user QoE, but QoE usually varies sub-stantially even when we limit the values of key quality met-rics to a small range. To show this, we create multiple videosand ask Amazon MTurkers to rate their QoE on a scale of1-5. All videos show the same content at the same bitrateand include the same half-second rebuffering stall, and are ime

Scoring (b) Avg QoE rating = time

Idle time (a) Avg QoE rating =

Scoring Idle time

Figure 4:

An MTurk study showing that videos with the samequality metrics can result in signiﬁcantly different user QoEdepending on when a stall (indicated by the arrow) occurs.

40 50 60Quality (VMAF)40506070 Q o E R a t i n g (a)

40 50 60 70 80Quality (VMAF)405060708090 Q o E R a t i n g (b)

60 70 80 90Quality (VMAF)30405060708090 Q o E R a t i n g (c) Figure 5:

An MTurk study showing that visual quality met-rics like VMAF cannot fully explain user QoE, as the highvariance above shows (the lines show the mean and the beltsshow the stddev). Each mean and stddev is based on at leastten QoE ratings across the same set of 30 raters. rated by the same

MTurkers; the only difference is whenthe stall occurs. Most QoE models would predict the sameQoE for all videos, but we observe systematic differences inthe mean rating of each video, as shown in Figure 4. Thisis because QoE ratings drop sharply when the stall occursat a critical moment in the video. A similar effect occurseven for quality metrics that are content aware. For exam-ple, VMAF [11] is a visual quality metric that gives lowerQoE estimates if a bitrate drop occurs when frame pixels aremore “complex”. We run a large-scale analysis on a publicvideo QoE dataset [24] and ask MTurkers to rate videos withsimilar (to within 5%) rebuffering time, number of bitrateswitches, and VMAF scores. Figure 5 shows the means andvariances of QoE ratings (on a scale of 0-100) against VMAFscore. We see that the variances are consistently more signif-icant than the differences in mean QoE rating due to higherVMAF scores, which means that VMAF cannot adequatelypredict user QoE. Although the variances might be explainedby ﬁner-grained quality metrics, ﬁnding all such metrics isinfeasible, and prediction errors like this are common in evenstate-of-the-art video/web QoE models.

Why visual renderings might be expressive:

In contrast,visual renderings by deﬁnition preserve all visual informa-tion that affects a user’s experience, in both video and web-based applications. This not only includes all the informationneeded to derive existing quality metrics ( e.g., rebufferingtime and page loading delay), but also preserves other infor-mation that might affect QoE, including factors that we haveyet to discover. These factors include how content and appli-cation quality affect a user’s perceived QoE. For instance, theQoE variance in Figure 4 and 5 may be caused by how visualcontent affects the relationship between application qualityand QoE, which is captured by a visual rendering. Visual renderings also include information that can poten-tiall help model how new adaptation actions affect QoE. Forexample, consider a new video adaptation action where theplayer (slightly) slows down the playback of a video whilereplenishing the buffer, in order to avoid more abrupt stalls.Traditional feature-based QoE models cannot capture this ef-fect because no feature or quality metric is designed for thisaction. In contrast, visual renderings naturally include alltemporal information, including the slowdown of the video.

Generalizability

Traditional feature-based QoE models trade complexity forhigher accuracy. A case in point are the quality metrics usedfor rebuffering: initially, its impact on video QoE was mea-sured using the rebuffering “ratio” (the fraction of time spentin stalls during a video session) [22], but more complex met-rics emerged over time, capturing factors such as the rela-tionships between rebuffering stalls ( e.g., length distributionand memory effect [25, 23]) and the differences in its effecton the QoE of live videos versus on-demand videos [16]. AsQoE models become more complex and ﬁne-grained, how-ever, they also become harder to generalize. For instance,traditional QoE models might be able to explain the QoEvariance in Figure 4 if they are customized to each video; butsuch per-video QoE models do not generalize and are pro-hibitively costly to create (§4.2). Similarly, web QoE mod-els that are specialized to a web page can predict QoE muchmore accurately than a one-size-ﬁts-all QoE model [21], butcreating such per-page QoE models also faces a scalabilityproblem, especially since content is continuously changing.

Why visual renderings might be generalizable:

It is difﬁ-cult to say upfront, but we have reasons to believe that a QoEmodel based on visual rendering will generalize. There isa striking analogy between Internet QoE research and com-puter vision research in the pre-deep-learning era. Back then,each computer vision task ( e.g., object detection, gesture de-tection, segmentation) had a separate literature that devel-oped handcrafted features customized for the task. The suc-cess of deep learning in computer vision is not only that itprovides more accurate models, but also that it provides ageneralizable approach. Deep learning models take the rawpixels as input and are trained “end-to-end” with minimaldomain-expert intervention. Moreover, deep learning mod-els for different computer vision tasks often share the sameconvolutional layers ( e.g.,

ResNet) as common feature ex-tractors, which are more expressive than the best handcraftedfeatures. We speculate that building end-to-end deep learn-ing models directly from visual renderings might lead to amore generalizable approach to QoE modeling than relyingon handcrafted features/models, as Figure 6 illustrates.

Several concepts from prior work are closely related to vi-sual renderings, but do not take them to their logical extreme.

Gaze tracking/prediction:

WebGaze [30] shares with usthe insight that user gaze varies with the dynamic web load- eature Extractor Quality Function

Quality Metrics (Features)Input QoE(Output) … Quality Function(DL algorithm)

Visual Rendering

Input QoE(Output) … (a) Traditional feature-based QoE model (Each row may an application, a website, or a video genre) (b) New visual-rendering-based QoE model Figure 6:

Visual renderings may offer a more generalizableapproach to QoE modeling, by providing a unifying input todeep learning models that directly predict user QoE. ing process, implying that web QoE is inﬂuenced by the gazetrajectory in addition to traditional page load time metrics. Inparticular, WebGaze tracks gaze while a web loading processis replayed, which is similar to a visual rendering. Somefollow-up work automatically derives user gaze from webcontent ( e.g., [31, 38]), and similar techniques are also usedto track user saliency in panoramic videos ( e.g., [35, 50]).However, these efforts use gaze or saliency as another fea-ture in traditional QoE models ( e.g., to reweight web objectsor video pixels/chunks).

Eliciting QoE feedback:

EYEORG [37] uses recorded videosto elicit user ratings (QoE) for video streaming and web ser-vices. They do this because users may have different networkconnectivities, so rather than letting them stream the videosor load the web pages, they show users pre-recorded videosof a video session or web page loading process. Though theidea of showing recorded videos resembles the concept ofvisual renderings, EYEORG and others [43] still model QoEas a function of pre-determined quality metrics.

4. Architecting for Visual Rendering

So far we have seen that the abstraction of visual render-ings could lead to more expressive and generalizable QoEmodels. In this section, we discuss the technical challengesto realizing this ideal. At this stage of our research, we donot yet know if the advantages of visual renderings will out-weigh the challenges. We recognize that our vision for re-architecting QoE frameworks is broader than what we canaccomplish alone. By outlining a speciﬁc research agenda,we hope to spark discussions and efforts from the network-ing, multimedia and computer vision communities.

Optimization architecture:

Figure 7 depicts a logical viewof a QoE optimization framework based on visual rendering.It applies two components to each adaptation action: • A visual renderer (4.1) ﬁrst infers the visual rendering ofthe action, and • A visual rendering-based

QoE model (4.2) then predictsthe QoE of a given visual rendering.Finally, we pick the action that achieves the best QoE. Thevisual rendering of an action may also include the recent vi-

Action

Visual Renderings (estimated) QoE

Action … Action

Quality Function(e.g., DL model)Visual Renderer

Decision

Figure 7:

A framework for QoE optimization based on visualrenderings and deep learning models. sual renderings up to this point, since QoE is often dependenton the content and the user’s QoE in recent history. This canbe done implicitly by the emulator or with help of the client-side browser.

Inferring a visual rendering from an action in real-time isa formidable task, because the visual rendering may dependon the speciﬁc content being shown as well as the context ofthe user’s video or web session. In an ideal world, we wouldfreeze time, take the action in a parallel world for the sameuser, capture the visual rendering it results in, and feed thatto our QoE model. Since this is not possible, we must ﬁndan alternative approach.

Leveraging existing testing infrastructure:

Web contentproviders rely on extensive testing infrastructure to evaluatetheir application protocols and control algorithms, includingautomated unit tests, A/B testing frameworks, human testers,and others. Some of these testing environments emulate theexperience of streaming a video or loading a web page, pro-viding an ideal opportunity to capture a visual rendering.However, even if we are able to tap into this infrastructureto enumerate all possible visual renderings that result fromthe adaptation actions of an application, we still face two se-rious challenges to making this viable: • Diverse clients:

The visual rendering experienced by auser is inﬂuenced by several contextual factors such as theuser’s device, available bandwidth, browser settings, etc.. • Real-time decisions:

There is very little time between whena user request arrives and when an adaptation action mustbe taken to deliver content to the user.These challenges imply that a visual rendering must becontextualized to the user in real-time. Since creating a vi-sual rendering from scratch is not feasible in real-time, andsince ofﬂine-enumerated visual renderings (such as the onesmentioned above) are not contextualized to the user, we pro-pose a compromise: parameterized visual renderings . Thatis, we enumerate parameterized visual renderings ofﬂine thatcan take contextual factors as input online and quickly spe-cialize the visual rendering to those factors. Although thisis still a difﬁcult task, consider the following examples. Ifwe record a visual rendering assuming a particular networkbandwidth, we can emulate other network bandwidths bysimply speeding up/slowing down the visual rendering. Sim-ilarly, if we record timings in the visual rendering of whendistinct web objects are loaded, we might be able to speedup/slow down speciﬁc object loading events, or even rear-ange the load order (with additional video editing effort).

Designing a visual rendering-based QoE function

We have two intuitive reasons to posit that a general visualrendering-based QoE model is plausible. First, from a cogni-tive perspective, the perception of streaming video and webbrowsing involve the same psychophysical process. Second,visual renderings enable us to harness the power of deep-learning-based computer vision, which also models humanperception. We elaborate on both aspects below.

Drawing ideas from cognitive visual perception:

Visualperception is a primary focus of cognitive research. It aimsto reveal the general psychophysical process behind all vi-sual perception activities, which include web browsing andwatching videos. There are two key concepts: expectation ,which describes how prior experience affects the perceptionof visual stimuli, and attention , which inﬂuences the neu-ronal representation of current visual stimuli [27].There is a striking parallel between these two concepts andhow application quality affects QoE in networking research.For instance, a video rebuffering event (stall) is a violation ofthe expectation since the user expects the video to continueplaying. Similarly, fast loading of a web page means higherQoE, because it meets the expectation of a user when a linkis clicked. A user’s expectation of application quality is alsoshaped by the quality of recent web/video sessions [29, 23],which has been studied under the framework of cognitivebiases. Similarly, models of human visual attention are in-creasingly used in 360° videos [36, 35] and recently in weboptimization [31, 38, 13]. In short, we posit that using theconcepts of expectation and attention, high QoE can be in-terpreted as having less violation of expectation within theregion of attention.

Drawing ideas from computer vision:

While the visualperception literature provides a useful framework for under-standing QoE, we still need to automatically infer attentionand expectation. This is where computer vision might pro-vide useful building blocks. In the interest of space, we onlyhighlight the three most relevant topics. (1)

Visual attention(saliency) detection [39, 40] uses convolutional models toreason about the spatial structures that inﬂuence the distribu-tion of human visual attention. (2)

Video summarization (and highlight detection ) [48] uses recurrent models to learn thetemporal patterns in a video and when users will pay moreattention to high-level incidents. (3)

Video prediction pre-dicts future video frames based on the previous ones, whichhelps to model user expectation of the content.

Open questions:

Despite the apparent congruity betweenQoE and computer vision, their mismatch is also evident. • What should the QoE model look like?

We can use ma-ture techniques such as the attention mechanism to modelattention and recurrent models to learn temporal patternsin a visual rendering, but combining them is challenging.One idea is to merge them similar to how computer vision models and natural language models are combined to per-form high-level tasks such as visual question-answering.We also speculate that the QoE model of one applicationcould be ﬁne-tuned to serve other applications by transferlearning, via requires less training data. • Visual renderings are not “natural” videos:

Computer vi-sion works well with natural images/videos that do nothave artiﬁcial glitches ( e.g., video rebuffering or bitrateswitches) that inﬂuence QoE. For instance, quality inci-dents such as a video stall or bitrate switch can affect userattention/expectation (as observed in [30]) but they are rarelymodeled in computer vision.

Creating new QoE datasets for training the model

Existing datasets are inadequate:

Training a QoE modelrequires an annotated visual rendering dataset that coversmany combinations of content and quality incidents. Unfor-tunately, existing QoE datasets have limited variability of thevideo/web content. For instance, popular video QoE datasetsinclude only a handful of videos (20 or less [23, 6, 37]), inpart because QoE data collection can be frustratingly slowand expensive—to test one video content, researchers needto recruit tens of participants and let each of them watch thesame video rendered with different quality incidents.

The wisdom of crowd:

A potential solution is to leveragecommercial crowdsourcing platforms such as Amazon Me-chanical Turk [1]. For its short response times, auto scaling,and reasonable pricing, crowdsourcing is a promising alter-native to lab studies for QoE annotation [49, 43]. That said,existing use of crowdsourcing platforms only models spe-ciﬁc relationships between quality metrics/features and QoE.

Open questions:

There are two key questions: • How to create a visual rendering-based QoE dataset?

Oneidea is to draw from popular content ( e.g., the Alexa top-1000 web sites), but popularity does not necessarily meanadequate diversity. Alternatively, one can sample acrossmany content genres similar to how ImageNet compilesimages of different objects from each class. • Other sources of data?

We recognize that a scaled-downversion of the envisioned dataset can be built by a contentprovider ( e.g.,

Netﬂix or Google). A content provider canpassively monitor visual renderings seen by its users andlabel each visual rendering with the user engagement (howlong a user watches a video or stays on the web site) as theQoE. This process can easily generate a large amount ofannotated data, but the content could be biased.We do not claim the ideas outlined here are the only (oroptimal) way of building the envisioned QoE model. Instead,we hope they inspire more ideas and research.

5. A New Frontier for ML in Networking

Machine learning is increasingly used in networking, butso far it has largely been a “solver” of complex control prob-lems such as scheduling, bitrate adaptation, and resource se-lection. The abstraction of visual rendering creates a newrontier for harnessing the power of deep learning, whichrevolutionized computer vision and may similarly transformuser-facing applications and Internet QoE. We believe theconﬂuence of trends—user QoE as the key driver and recentadvances in computer vision—make now the right time toexplore this frontier.

6. References [1] Amazon Mechanical Turk. .[2] Cisco Annual Internet Report (2018â ˘A¸S2023) White Paper. .[3] Digital gap between rural and nonrural America persists. .[4] Fast Growth in 4K Televisions and UHD Content Requires PremiumContent Protection. .[5] Find out how you stack up to new industry benchmarks for mobilepage speed. .[6] LIVE Netﬂix Video Quality of Experience Database. http://live.ece.utexas.edu/research/LIVE_NFLXStudy/nflx_index.html .[7] Streaming toward television’s future: A detailed look at 4K video andhow Akamai is making it a reality. .[8] The SSIM Index for Image Quality Assessment. .[9] Twitch, Facebook, YouTube and the future of Interactive Video. https://whatsnewinpublishing.com/twitch-facebook-youtube-and-the-future-of-interactive-video/ .[10] Video Quality of Experience: Requirements and Considerations forMeaningful Insight. .[11] VMAF: The Journey Continues. https://netflixtechblog.com/vmaf-the-journey-continues-44b51ee9ed12 .[12] YouTube joins Netﬂix in reducing video quality in Europe. .[13] PERCIVAL: Making in-browser perceptual ad blocking practicalwith deep learning. In . USENIX Association, 2020.[14] Z. Akhtar, Y. S. Nam, R. Govindan, S. Rao, J. Chen, E. Katz-Bassett,B. Ribeiro, J. Zhan, and H. Zhang. Oboe: auto-tuning video abralgorithms to network conditions. In

Proceedings of the 2018Conference of the ACM Special Interest Group on DataCommunication , pages 44–58, 2018.[15] A. Balachandran, V. Aggarwal, E. Halepovic, J. Pang, S. Seshan,S. Venkataraman, and H. Yan. Modeling web quality-of-experienceon cellular networks. In

Proceedings of the 20th annual internationalconference on Mobile computing and networking , pages 213–224,2014.[16] A. Balachandran, V. Sekar, A. Akella, S. Seshan, I. Stoica, andH. Zhang. Developing a predictive model of quality of experience forinternet video. In

ACM SIGCOMM Computer CommunicationReview , volume 43, pages 339–350. ACM, 2013. [17] E. Bocchi, L. De Cicco, and D. Rossi. Measuring the quality ofexperience of web users.

ACM SIGCOMM Computer CommunicationReview , 46(4):8–13, 2016.[18] J. Brutlag, Z. Abrams, and P. Meenan. Above the fold time:Measuring web page performance visually. In

Velocity: WebPerformance and Operations Conference , 2011.[19] M. Butkiewicz, D. Wang, Z. Wu, H. V. Madhyastha, and V. Sekar.Klotski: Reprioritizing web content to improve user experience onmobile devices. In

NSDI , volume 1, pages 2–3, 2015.[20] C. Cárdenas-Angelat, J. B. Polglase, C. J. Vaca-Rubio, and M. C.Aguayo-Torres. Application of deep learning techniques to video qoeprediction in smartphones. In , pages 252–256. IEEE,2019.[21] D. N. da Hora, A. S. Asrese, V. Christophides, R. Teixeira, andD. Rossi. Narrowing the gap between qos metrics and web qoe usingabove-the-fold metrics. In

International Conference on Passive andActive Network Measurement , pages 31–43. Springer, 2018.[22] F. Dobrian, V. Sekar, A. Awan, I. Stoica, D. Joseph, A. Ganjam,J. Zhan, and H. Zhang. Understanding the impact of video quality onuser engagement. In

ACM SIGCOMM Computer CommunicationReview , volume 41, pages 362–373, 2011.[23] Z. Duanmu, K. Ma, and Z. Wang. Quality-of-experience for adaptivestreaming videos: An expectation conﬁrmation theory motivatedapproach.

IEEE Transactions on Image Processing ,27(12):6135–6146, 2018.[24] Z. Duanmu, A. Rehman, and Z. Wang. A quality-of-experiencedatabase for adaptive video streaming.

IEEE Transactions onBroadcasting , 64(2):474–487, June 2018.[25] N. Eswara, S. Ashique, A. Panchbhai, S. Chakraborty, H. P.Sethuram, K. Kuchi, A. Kumar, and S. S. Channappayya. Streamingvideo qoe modeling and prediction: A long short-term memoryapproach.

IEEE Transactions on Circuits and Systems for VideoTechnology , 30(3):661–673, 2019.[26] Q. Gao, P. Dey, and P. Ahammad. Perceived performance of top retailwebpages in the wild: Insights from large-scale crowdsourcing ofabove-the-fold qoe. In

Proceedings of the Workshop on QoE-basedAnalysis and Management of Data Communication Networks , pages13–18, 2017.[27] N. Gordon, N. Tsuchiya, R. Koenig-Robert, and J. Hohwy.Expectation and attention increase the integration of top-down andbottom-up signals in perception through different pathways.

PLoSbiology , 17(4):e3000233, 2019.[28] Y. Guan, C. Zheng, X. Zhang, Z. Guo, and J. Jiang. Pano: Optimizing360 video streaming with a better understanding of qualityperception. In

Proceedings of the ACM Special Interest Group onData Communication , pages 394–407. 2019.[29] T. Hoßfeld, S. Biedermann, R. Schatz, A. Platzer, S. Egger, andM. Fiedler. The memory effect and its implications on web qoemodeling. In ,pages 103–110. IEEE, 2011.[30] C. Kelton, J. Ryoo, A. Balasubramanian, and S. R. Das. Improvinguser perceived page load times using gaze. In { USENIX } Symposium on Networked Systems Design and Implementation( { NSDI } , pages 545–559, 2017.[31] C. Kelton, Z. Wei, S. Ahn, A. Balasubramanian, S. R. Das,D. Samaras, and G. Zelinsky. Reading detection in real-time. In Proceedings of the 11th ACM Symposium on Eye Tracking Research& Applications , pages 1–5, 2019.[32] W. Liu, Z. Duanmu, and Z. Wang. End-to-end blind qualityassessment of compressed videos using deep neural networks. In

ACM Multimedia , pages 546–554, 2018.[33] H. Mao, R. Netravali, and M. Alizadeh. Neural adaptive videostreaming with pensieve. In

Proceedings of the Conference of theACM Special Interest Group on Data Communication , pages197–210, 2017.[34] V. Nathan, V. Sivaraman, R. Addanki, M. Khani, P. Goyal, andM. Alizadeh. End-to-end transport for video qoe fairness. In

Proceedings of the ACM Special Interest Group on DataCommunication , pages 408–423. 2019.[35] A. Nguyen, Z. Yan, and K. Nahrstedt. Your attention is unique:Detecting 360-degree video saliency in head-mounted display foread movement prediction. In

Proceedings of the 26th ACMinternational conference on Multimedia , pages 1190–1198, 2018.[36] C. Ozcinar, J. Cabrera, and A. Smolic. Visual attention-awareomnidirectional video streaming using optimal tiles for virtual reality.

IEEE Journal on Emerging and Selected Topics in Circuits andSystems , 9(1):217–230, 2019.[37] M. Varvello, J. Blackburn, D. Naylor, and K. Papagiannaki. Eyeorg:A platform for crowdsourcing web quality of experiencemeasurements. In

Proceedings of the 12th International onConference on emerging Networking EXperiments and Technologies ,pages 399–412, 2016.[38] S. Vidyapu, V. S. Vedula, and S. Bhattacharya. Quantitative visualattention prediction on webpage images using multiclass svm. In

Proceedings of the 11th ACM Symposium on Eye Tracking Research& Applications , pages 1–9, 2019.[39] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang. Salient objectdetection in the deep learning era: An in-depth survey. arXiv preprintarXiv:1904.09146 , 2019.[40] W. Wang and J. Shen. Deep visual attention prediction.

IEEETransactions on Image Processing , 27(5):2368–2378, 2017.[41] X. S. Wang, A. Krishnamurthy, and D. Wetherall. Speeding up webpage loads with shandian. In { USENIX } Symposium onNetworked Systems Design and Implementation ( { NSDI } , pages109–122, 2016.[42] S. Wassermann, N. Wehner, and P. Casas. Machine learning modelsfor youtube qoe and user engagement prediction in smartphones. ACM SIGMETRICS Performance Evaluation Review , 46(3):155–158,2019. [43] C.-C. Wu, K.-T. Chen, Y.-C. Chang, and C.-L. Lei. Crowdsourcingmultimedia qoe evaluation: A trusted framework.

IEEE transactionson multimedia , 15(5):1121–1137, 2013.[44] F. Y. Yan, H. Ayers, C. Zhu, S. Fouladi, J. Hong, K. Zhang, P. Levis,and K. Winstein. Learning in situ: a randomized experiment in videostreaming. In { USENIX } Symposium on Networked SystemsDesign and Implementation ( { NSDI } , pages 495–511, 2020.[45] X. Yin, A. Jindal, V. Sekar, and B. Sinopoli. A control-theoreticapproach for dynamic adaptive video streaming over http. In Proceedings of the 2015 ACM Conference on Special Interest Groupon Data Communication , pages 325–338, 2015.[46] T. Yue, H. Wang, S. Cheng, and J. Shao. Deep learning based qoeevaluation for internet video.

Neurocomputing , 2019.[47] H. Zhang, H. Hu, G. Gao, Y. Wen, and K. Guan. Deepqoe: A uniﬁedframework for learning to predict video qoe. In , pages1–6. IEEE, 2018.[48] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Videosummarization with long short-term memory. In

Europeanconference on computer vision , pages 766–782. Springer, 2016.[49] X. Zhang, S. Sen, D. Kurniawan, H. Gunawi, and J. Jiang. E2e:embracing user heterogeneity to improve quality of experience on theweb. In

Proceedings of the ACM Special Interest Group on DataCommunication , pages 289–302. 2019.[50] Z. Zhang, Y. Xu, J. Yu, and S. Gao. Saliency detection in 360 videos.In