[PDF] Self-play Reinforcement Learning for Video Transmission

Abstract

Video transmission services adopt adaptive algorithms to ensure users' demands. Existing techniques are often optimized and evaluated by a function that linearly combines several weighted metrics. Nevertheless, we observe that the given function fails to describe the requirement accurately. Thus, such proposed methods might eventually violate the original needs. To eliminate this concern, we propose \emph{Zwei}, a self-play reinforcement learning algorithm for video transmission tasks. Zwei aims to update the policy by straightforwardly utilizing the actual requirement. Technically, Zwei samples a number of trajectories from the same starting point and instantly estimates the win rate w.r.t the competition outcome. Here the competition result represents which trajectory is closer to the assigned requirement. Subsequently, Zwei optimizes the strategy by maximizing the win rate. To build Zwei, we develop simulation environments, design adequate neural network models, and invent training methods for dealing with different requirements on various video transmission scenarios. Trace-driven analysis over two representative tasks demonstrates that Zwei optimizes itself according to the assigned requirement faithfully, outperforming the state-of-the-art methods under all considered scenarios.

Full PDF

SSelf-play Reinforcement Learning for Video Transmission

Tianchi Huang , Rui-Xiao Zhang , Lifeng Sun , , ∗ Dept. of CS & Tech., BNRist, Key Laboratory of Pervasive Computing, Tsinghua University.

ABSTRACT

Video transmission services adopt adaptive algorithms to ensureusers’ demands. Existing techniques are often optimized and evalu-ated by a function that linearly combines several weighted metrics.Nevertheless, we observe that the given function fails to describethe requirement accurately. Thus, such proposed methods mighteventually violate the original needs. To eliminate this concern, wepropose

Zwei , a self-play reinforcement learning algorithm for videotransmission tasks. Zwei aims to update the policy by straightfor-wardly utilizing the actual requirement. Technically, Zwei samplesa number of trajectories from the same starting point, and instantlyestimates the win rate w.r.t the competition outcome. Here thecompetition result represents which trajectory is closer to the as-signed requirement. Subsequently, Zwei optimizes the strategy bymaximizing the win rate. To build Zwei, we develop simulationenvironments, design adequate neural network models, and inventtraining methods for dealing with different requirements on vari-ous video transmission scenarios. Trace-driven analysis over tworepresentative tasks demonstrates that Zwei optimizes itself ac-cording to the assigned requirement faithfully, outperforming thestate-of-the-art methods under all considered scenarios.

CCS CONCEPTS • Information systems → Multimedia streaming ; KEYWORDS

Video transmission, Self-play reinforcement learning.

ACM Reference Format:

Tianchi Huang, Rui-Xiao Zhang, Lifeng Sun. 2020. Self-play ReinforcementLearning for Video Transmission. In

ACM, New York, NY, USA, Article 4, 10 pages.https://doi.org/10.1145/3386290.3396930

Thanks to the dynamic growth of video encoding technologies andbasic Internet services [6], currently we are living with the greathelp of video transmission services. In particular, the videos arerequired to transmit with fulfilling users’ requirements, where therequirement is often known as quality of experience (QoE) or qual-ity of service (QoS). Unfortunately, as much as the fundamental * Corresponding Author. {htc19@mails, sunlf@}tsinghua.edu.cn .Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

NOSSDAV’20, June 10–11, 2020, Istanbul, Turkey © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7945-8/20/06...$15.00https://doi.org/10.1145/3386290.3396930 issue has already been published about two decades [15], currentapproaches, either heuristics or learning-based methods, fall shortof achieving this goal. On the one hand, heuristics often use ex-isting domain knowledge [15] as the working principle. However,such approaches sometimes require careful tuning and will backfireunder the circumstance that violated with presumptions, finallyresulting in the failure of achieving acceptable performance underall considered scenarios. On the other hand, learning-based meth-ods [14, 22] leverage deep reinforcement learning (DRL) to train aneural network (NN) by interacting with the environments with-out any presumptions, aiming to obtain higher reward. In recentwork, the reward function is often defined as a linear-based equa-tion with the combination of manipulation variables. While in thisstudy, we empirically discover that i) an inaccurate reward functionmay mislead the learning-based algorithm to generalize bad strate-gies, since ii) the actual requirement can hardly be presented bythe linear-based reward function with fixed weights. Moreover, iii)considering the diversity of real-world network environments, it’sdifficult to present an accurate reward function that can perfectlyfit any network conditions (§2.2). As a result, despite its abilities togain higher numerical reward score, learning-based schemes maygeneralize a strategy that hardly meet the actual requirements.Taking a look from another perspective, we observe that theaforementioned problem can be naturally described as a determinis-tic goal or requirement [9]. For instance, in the most cases, the goalof the adaptive bitrate (ABR) streaming algorithm is to achieve lowerrebuffering time first, and next, reaching higher bitrate [10, 13]. In-spired by this opportunity, we envision a self-play reinforcementlearning-based framework, known as Zwei, which can be viewedas a solution for tackling the video transmission dilemma (§3). Thekey idea of Zwei is to sample trajectories repeatedly by itself, anddistinguish the sequence that is closer to the assigned demand,so as to learn the strategy for satisfying the demand iteratively.Specifically, we apply Monte-Carlo (MC) search to estimate movingdecisions from the starting point. In the MC search process, severaltrajectories are sampled from the starting state according to thecurrent policy. Then the expected long-term win rate is estimatedby averaging the competition results from each trajectory pairs,where the result represents which one is closer to the actual demandbetween the two trajectories. Having estimated the win rate, Zweiadopts the proximate policy optimization (PPO) [18] to optimizethe NN via increasing the probabilities of the winning sample andreducing the possibilities of the failure sample.In the rest of the paper (§4), we attempt to evaluate the potentialof Zwei using trace-driven analyses of various representative videotransmission scenarios. To achieve this, we build several faithfulvideo transmission simulators which can accurately replicate theenvironment via real-world network dataset. Specifically, we vali-date Zwei on two different tasks (§2.1), including client-to-server,and server-to-client service. Note that each of them has individ-ual requirements and challenges. As expected, evaluation results a r X i v : . [ c s . MM ] M a y OSSDAV’20, June 10–11, 2020, Istanbul, Turkey Huang et al. demonstrate the superiority of Zwei against existing state-of-the-artapproaches on all tasks. In detail, we show that (i)

Zwei outperformsexisting ABR algorithms on both HD videos and 4K videos, with theimprovements on Elo score [7] of 32.24% - 36.38%. (ii)

In the crowd-sourced live streaming (CLS) scheduling task, Zwei reduces theoverall costs on 22% and decreases over 6.5% on the overall stallingratio in comparison of state-of-the-art learning-based schedulingmethod LTS [22].

Contributions:

This paper makes four key contributions. • We point out the shortcoming of learning-based schemes in videotransmission tasks and present the idea that update networkswithout reward engineering (§2.2). • Zwei is a novel framework that aims to employ the self-playreinforcement learning method to make the idea practical (§3). • We implement Zwei into two representative video transmissionscenarios, i.e. rate adaptation and crowd-sourced scheduling.Results indicates that Zwei outperforms existing schemes on allconsidered scenarios (§4).

In this work, the video transmission service mainly consists of:

Client-to-Server Service.

In this scenario, users often adopts avideo player to watch the video on demand. First, video contentsare pre-encoded and pre-chunked as several bitrate ladders on theserver. Then the video player, placed on the client side, dynamicallypicks the proper bitrate for the next chunk to varying networkconditions. Specifically, the bitrate decisions should achieve highbitrate and low rebuffering on the entire session [5, 10]. We call it adaptive bitrate streaming (ABR).

Server-to-Client Service.

Considering that if we were the con-tent provider and currently we had multiple content delivery net-works (CDNs) with different costs and performance, how to sched-ule the users’ requests to the proper CDN, aiming to provide livestreaming services withing less stalling ratio and lower cost? Incommon, we call that crowd-sourced live streaming(CLS) [22].Details of each video transmission scenario are demonstrated in§4. We can see that the video transmission task is often required toobtain better performance under various mutual metrics.

Motivation.

Recent video transmission algorithms mainly consistof two types, i.e., heuristics and learning-based scheme. Heuris-tic methods utilize an existing fixed model or domain knowledgeto construct the algorithm, while they inevitably fail to achieveoptimal performance in all considered network conditions due tothe inefficient parameter tuning (§5,[14]). Learning-based schemestrain a NN model towards the higher reward score from scratch,where the reward function is often defined as a linear combinationof several weighted metrics [14, 22]. Nevertheless, considering thecharacteristics of the video transmission tasks above, we argue thatthe policy generated by the linear-based reward fails to always per-form on the right track . We, therefore, set up two experiments inthe ABR scenario (4.2) to prove this conjecture.

Observation 1.

The best learning-based ABR algorithm Pensieve [14]is not always the best scheme on every network traces.

Figure 1: We show average bitrate and rebuffering time foreach ABR method. ABRs are performed over the HSDPA net-work traces.Figure 2: The comparison of linear-based optimal andrequirement-based optimal strategy. Results are evaluatedon fixed (0.5mbps) and the real-world network trace [17]. Wecan see the difference between the actual requirement andthe optimal trajectory generated by the linear-based reward.

Recently, given a deterministic QoE metric with linearly combin-ing of several underlying metrics, several attempts have been madeto propose the ABR algorithm via model-based [21] or model-freereinforcement learning (RL) method [14]. However, such methodheavily relies on the accuracy of the given QoE metric. Especially, how to set a proper QoE parameter for each network condition isindeed a critical challenges for training a good ABR algorithm. Inorder to verify whether QoE parameters have influenced the per-formance of ABR algorithms, we set up an experiment to reportaverage bitrate and rebuffering time of several state-of-the-art ABRbaselines (§4.2.3) and Zwei. Despite the outstanding performancesthat Pensieve achieves, the best learning-based ABR algorithm doesnot always stand for the best scheme. On the contrary, Zwei alwaysoutperforms existing approaches.

Observation 2.

Recent linear-based weighted reward function canhardly map the actual requirement for all the network traces.

To better understand the effectiveness of weighted-sum-basedreward functions, we compare the optimal strategy of linear-basedwith requirement-based under two representative network traces,in which linear-based optimal is the policy which obtains maximumreward, and the requirement-based optimal stands for the closeststrategy in terms of the given requirement. Unsurprisingly, fromFigure 2 we observe that linear-based optimal policy performsdifferently compared with the requirement-based optimal strategy.The reason is that the given weighted parameters are not allowedto be adjusted dynamically according to the current network status.Generally, we believe that the policy learned by reward engineeringmight fall into the unexpected conditions.

Summary.

In general, we observe that no matter how precisely andcarefully the parameter of the linear-based reward function tunes,such tuned functions can hardly meet the requirement of any net-work conditions. E.g., the parameter of stable and unstable networkconditions are not the same. To that end, traditional learning-basedscheme, which often optimizes the NN via the assigned functions, elf-play Reinforcement Learning for Video Transmission NOSSDAV’20, June 10–11, 2020, Istanbul, Turkey

Sn0an0

Zwei

Sn1an1

Zwei

Snt … ant Zwei

Terminal

Rollout trajectory Tn …T0 …T1 …Tn-2 …Tn-1 … Monte Carlo Search

Win+1 Loss-1 Win+1…

Battle Competition … < > Win Rate r Start … Figure 3: Zwei System Overview. The framework is mainlycomposed of four phases: Monte Carlo sampling, battle com-petition, win rate estimation, and policy optimization. will eventually fail to provide a reliable performance on any net-work traces. We therefore, attempt to learn the strategies from theoriginal demands.

In this section, we briefly introduce the details of Zwei, includingits key principle, training algorithms and implementation details.

As mentioned before, we attempt to generalize the strategy based onactual requirements instead of linear-based reward functions. Con-sidering the basic requirement seldom directly provides gradientsfor optimization, we, therefore, employ the self-play method [19]that enables the NN to explore better policies and suitable rewardsvia self-learning. Zwei treats the learning task as a competition be-tween distinct trajectories sampled by itself, where the competitionoutcome is determined by a set of rules, symbolizing which oneis closer to the given requirement. Subsequently, we are able toupdate the NN towards achieving a better outcome.

Figure 3 presents the main phases in our framework. The pipelinecan be summarized as follows:

Phase1: Monte Carlo Sampling.

First, we adopt MC samplingmethod to sample N different trajectories T n = { s n , a n , s n , a n ,. . . , a nt } , n ∈ N w.r.t the given policy π ( s ) under the same envi-ronments ( a t ∼ π ( s t ) ) and start point (the gray point in Figure 3).Next, we record and analysis the underlying metric for the entiresession. Finally, we store all the sample T n into D . Note that we canselect Monte Carlo Tree Search (MCTS) [19], which is widely usedin advanced research, to implement the process. Phase2: Battle Competition.

To better estimate how the cur-rent policy performs, Zwei requires a module to label all trajectoriesfrom D : given two different trajectories, T i and T j which are all col-lected from the same environment settings ( T i , T j ∈ D )), we attemptto identify which trajectory is positive for NN updating, and whichtrajectory is generated by the worse policy. Thus, we implement arule called Rule which can determine the better trajectory between the given two candidates, in which better means which trajectoryis closer to the requirement. At the end of the session, the terminalposition s t is scored w.r.t the rules of the requirement for computingthe game outcome o : − + o ji = Rule ( T i , T j ) . (1) s . t . o ji = {− , , } , T i , T j ∈ D , T i (cid:44) T j . (2) Phase3: Win Rate Estimation.

Next, having computed thecompetition outcome o i for any two trajectories, we then attempt toestimate the average win rate r i for each trajectory T i in D , i.e., r i = E [ Rule ( T i , )] = lim N →∞ N (cid:205) Nu o ui . Notice that the accuracy of thewin rate estimation heavily depends on the number N of trajectories.Since it’s impractical to sample infinite number of samples in thereal world, we further list the performance comparison of differentsample numbers in §4.2. Phase4: Policy Optimization.

In this part, given a batch ofcollected samples and their win rate, our key idea is to updatethe policy via elevating the probabilities of the winning samplefrom the collected trajectories and diminishing the possibilities ofthe failure sample from the worse trajectories. In other words, theimproved policy π at state s t is required to pick the action a t whichproduced the best estimated win rate r t . We employ Proxy PolicyOptimization (PPO) [18], a state-of-the-art actor-critic method, as tooptimize the NN’s policy. PPO uses clip method to restrict the stepsize of the policy iteration and update the NN by minimizing thefollowing clipped surrogate objective . We list the Zwei’s loss function L Zwei ( θ ) in Eq. (6). The function consists of a policy loss and aoutcome loss. The policy are computed as Eq. 3, where p t ( θ ) denotethe probability ratio between the policy π θ and the old policy π θ old ,i.e., π θ ( a t | s t ) π θ old ( a t | s t ) ; ϵ is the hyper-parameters which controls the cliprange. We set ϵ = . A t is the advantage function (Eq. 4), V θ p ( s ) is provided by anothervalue NN. Meanwhile, the value loss aims to minimize the meansquare error between r t and V θ p ( s ) (Eq. 5). Moreover, we also addentropy H ( s t ; θ ) to encourage exploration. For more details, pleaserefer to our repository [8]. L Policy = ˆ E t (cid:104) min (cid:16) p t ( θ ) ˆ A t , clip (cid:0) p t ( θ ) , − ϵ , + ϵ (cid:1) ˆ A t (cid:17)(cid:105) . (3)ˆ A t = r t − V θ p ( s t ) (4) L Value =

12 ˆ E t (cid:104) V θ p ( s t ) − r t (cid:105) . (5) ∇ L Zwei = ∇ θ L Policy + ∇ θ p L Value + ∇ θ βH ( s t ; θ ) . (6) In this section, we thoroughly evaluate the performance of Zweiover two representative scenarios.

We use TensorFlow [3] to construct Zwei. Zwei’s policy networktakes a n-dims vector with softmax active function as the output,and the outcome network outputs a value with tanh function scaledin (− , ) . Considering the characteristics on video transmissiontasks, we construct different NN architectures for each task, but OSSDAV’20, June 10–11, 2020, Istanbul, Turkey Huang et al. use the same set of hyper-parameters for training the NN: samplenumber N =

16, learning rate α = − . Notice that we discuss theNN architecture and the effect of different sample number in §4.2. The ABR video streaming architectureconsists of a video player client with a constrained buffer length andan HTTP-Server or Content Delivery Network (CDN). The videoplayer client decodes and renders video frames from the playbackbuffer. Once the streaming service starts, the client fetches thevideo chunk from the HTTP Server or CDN orderly by an ABRalgorithm. The algorithm determines next chunks’ video quality.After finished to play the video, several metrics, such as total bitrate,total re-buffering time and total bitrate change will be summarizedas a QoE metric to evaluate the performance.

We now explain the details of the Zwei’sneural network (NN), including its inputs, outputs, network archi-tecture, and implementation.

Inputs.

For each chunk k =

8, Zwei takes the state inputs S k = { X k , τ k , N k , b k , c k , l k } as the input, in which X k represents thethroughput measured for past t times, τ k reflects the vector ofdownload time for past t chunks, N k is the next video sizes foreach bitrate chunk, b k means current buffer occupancy, l k is thenormalized value of the last bitrate selected, c k is the scalar whichmeans the video chunk remaining. Outputs.

We attempt to use discrete action space to describethe output. Note that the output is an n-dim vector indicating theprobability of the bitrate being selected under the current ABRstate S k . In this work, we set n =

6, which is widely used in ABRpapers [9, 14].

NN Representation.

Zwei’s NN representation is quite simple:it leverages three fully-connected layers, which is sized 128, 64,and 64 respectively, for describing the feature extraction layer. Theoutput of the NN’s policy network is a vector, which repre-sents the probabilities for each bitrate selected. The NN’s outcomenetwork outputs a value scaled in (-1, 1).

Requirements for ABR tasks.

Algorithm 1 is an example ofRule

Rule which used in the ABR scenario, where ϵ c is a smallnumber that can add noise for improving Zwei’s robustness. In thiswork, we set ϵ c as 10. Note that, such settings are trivial and youcan just set the value as a small value (e.g., 0.1) instead. Algorithm 1

Rule for the ABR task.

Require:

Trajectory T u , T v ; Compute average bitrate r u , r v , average rebuffering b u , b v from thegiven trajectories T u , T v . ▷ Requirements: i) low rebuffering time ii)high bitrate. Initialize Return s = {− , − } ; if | b u − b v | < ϵ c or | r u − r v | < ϵ c then Randomly set s or s as 1. ▷ Add noise to improve the robustness. else if | b u − b v | < ϵ c then s i ← , i = arдmax i ∈{ u , v } r ; else s i ← , i = arдmin i ∈{ u , v } b ; end if return s ; QoE Representation.

Recall that the goal of ABR algorithmis to select bitrates for the next chunk k with high bitrate r k andless rebuffering time t k [13], thus the optimization reward q k canbe normally written as q k = r k − αt k , here α is the coefficientthat adjust the importance of the underlying metrics. Note thatprior research often add additional smoothness metric | r k − r k − | to control the bitrate change of the entire session, while in practicethe following metric is neglectable for the ABR algorithm [10, 13].Hence, in this work we also set the smoothness to zero for betterunderstanding the fundamental performance of Zwei. Solving Zweiwith smoothness metric will be our future work. We employ the standard ABR’s em-ulation environment [14] to evaluate Zwei. We adopt various net-work bandwidth databases, including HSDPA [17], FCC [16]. Train-ing process lasts approximate 45000 steps, within 10 hours toobtain a reliable result. In this experiment, we setup two ABRscenarios, i.e., HD and 4K video scenario. In the HD video sce-nario, we adopt

EnvivioDash3 , a video that commonly used inrecent work [14], to validate Zwei, where the video chunks areencoded as { . , . , . , . , . , . } mbps. In the 4K-video sce-nario, we use the popular open-source movie Big Buck Bunny [2],which is now even more a world standard for video standards.Specifically, we pick 6 bitrates from Dash.js standard [1], i.e., { . , . , . , . , . , . } mbps. ABR Baselines.

In this paper, we select several representationalABR algorithms from various type of fundamental principles:(1)

Rate-based [11]: the basic baseline for ABR problems. It lever-ages harmonic mean of past five throughput measured as futurebandwidth.(2)

BOLA [20]: the most popular buffer-based ABR scheme in prac-tice. BOLA turns the ABR problem into a utility maximizationproblem and solve it by using the Lyapunov function.(3)

RobustMPC [21]: a state-of-the-art heuristic method whichmaximizes the objectives by jointly considered the buffer occu-pancy and throughput predictions. We implement

RobustMPC by ourselves.(4)

Pensieve [14]: the state-of-the-art learning-based ABR schemewhich utilizes DRL to select bitrate for next video chunks.(5)

Tiyuntsong [9]: the first study of multi-objective optimizationABR approach. Tiyuntsong uses actor-critic method to updatethe NN via the competition with two agents under the samenetwork condition.

Evaluation Metrics

The Elo rating [7] is a traditional methodfor calculating the relative performance of players in zero-sumgames. It’s suitable to compare different schemes via win rate in-formation only. Thus, we also use Elo score to compare differentABR schemes, similar to that of AlphaGo [19] and Tiyuntsong [9].

We study the performance of Zwei over theHD video dataset and HSDPA network traces. Followed by recentwork [21], we set α = 4.3 as the basic QoE-HD metric. Zwei vs. Existing schemes.

In this experiment, we compareZwei with existing ABR schemes over the HSDPA dataset. Resultsare computed as the Elo-score and reported in Figure 4(a). Specifi-cally, we first select several previously proposed approaches andvalidate the performance respectively under the same network and elf-play Reinforcement Learning for Video Transmission NOSSDAV’20, June 10–11, 2020, Istanbul, Turkey (a) Zwei vs. ABR schemes (b) Zwei vs. Tiyuntsong (c) ABR Scheme Details (d) CDF of ABR Algorithms

Figure 4: This group of figures show the comparison results of Zwei and other ABR approaches. Results are evaluated in thetypical ABR system with HD Videos (video resolution=1920 × (a) Zwei vs. ABR schemes (b) Zwei with Different Samples (c) ABR Algorithm Details (d) CDF of ABR Algorithms Figure 5: This group of figures show the comparison results of Zwei and other ABR approaches. Results are evaluated in thetypical ABR system with 4K Videos (video resolution=3840 × video environment. Next, we use Rule to estimate their winningrate. Finally, we compute the Elo rating for these approaches.Through the experiments we can see that Zwei outperformsrecent ABR approaches. In particular, Zwei improves Elo-score on36.38% compared with state-of-the-art learning-based method Pen-sieve, and increases 31.11% in terms of state-of-the-art heuristicmethod RobustMPC. Same conclusion are demonstrated as the CDFof QoE in Figure 4(d). Besides, we also illustrate the comparisonresults of different methods on average bitrate and average rebuffer-ing in Figure 4(c). As shown, Zwei can not only achieve highestbitrate but also obtain lowest rebuffering under all network traces.We compare Zwei with previously proposed self-learning methodTiyuntsong on the same experimental setting. Results are shownas the Elo-curve in Figure 4(b). As expected, Zwei with any NNarchitectures outperforms Tiyuntsong on average Elo-score of 35%.The reason is that Tiyuntsong treat the learning process as a battlewith two separated NNs, as each agent have to use the sample onlycollected by the agent itself. Thus, it lacks sample efficiency andsometimes struggles into the sub-optimal solution.

Zwei with Different NN Architectures.

This experiment con-siders Zwei with several NN architectures, where Zwei-1D is thestandard ABR NN architecture [14], Zwei-2D uses stacked Conv-2D layers to extract features, and Zwei takes three fully-connectedlayers sized { , , } . Results are shown in Figure 4(b). Un-surprisingly, when Zwei trains with some complicated NN archi-tecture (i.e., Zwei-1D and Zwei-2D), it performs worse than thefully connected NN scheme. This makes sense since ABR is a light-weighted task which can be solved in a practical and uncomplicatedmanner instead of a NN incorporating some deep yet wide layers. we evaluate Zwei with 4K videos on HDFS net-work traces, and the video are encoded with { . , . , . , . , . , . } mbps. In particular, we have also retrained Pensieve with the BigBuck Bunny and QoE-4K. QoE metrics for other approaches.

Due to the difference be-tween the maximum bitrate of 4K video (i.e., 12mbps) and HDvideo (i.e., 4.3mbps), we refer

QoE K as the QoE function for otherABRs (expect Zwei). Specifically, we set the penalty of rebufferingparameter α to 20 for better avoiding the occurrence of rebuffering. Zwei vs. Recent ABR Schemes.

Figure 5(a) plots the learningcurves in terms of the comparison of Zwei and other ABR meth-ods. We observe that Zwei has already achieved state-of-the-artperformance in almost 200 epochs, and finally, the performanceimprovement of Zwei can achieve 32.24% against the second best al-gorithm Pensieve and 58.58%-120.76% against others. What’s more,as depicted in Figure 5(c), no matter average bitrate or total re-buffering time, Zwei always stands for the best scheme among allcandidates. In particular, Zwei increases 10.46% on average bitrateswith reducing 56.52% on total rebuffering time. Meanwhile, we alsoreport the comparison of QoE performance of the proposed schemein Figure 5(d). As expected, Zwei outperforms existing ABR algo-rithms without reward engineering, with the increasing on averageQoE of 11.59%.

Zwei with Different Samples N . Besides, we also investigatethe Elo-rating of Zwei with different sample N , where we set N = { , , , } . Figure 5(b) illustrates the comparison of eachmethod, we observe that Zwei can achieve sample efficiency withthe increasing of the sample number N . The ClS system is composed of a sourceserver and several CDNs. Upon received viewers’ requests, the CLSplatform will first aggregate all stream data to the source server, andthen deliver the video stream to viewers through CDN providersaccording to a certain scheduling strategy.

Our experiments are conducted on thereal-world CLS dataset provided by the authors [22], spanning 1week (6 days for training and 1 day for test). At each time, we

OSSDAV’20, June 10–11, 2020, Istanbul, Turkey Huang et al. (a) Zwei vs. Existing Methods (b) Zwei Training Curve

Figure 6: Results of Zwei under the CLS environment. select 3 candidates from total 4 different CDN providers, and wefit a separate simulator for them. Followed by previous work [22],we use a piecewise linear model to characterize this relationshipbetween workload and block ratio. Note that the following featuresare extracted by the real CDN dataset. What’s more, we set theCDN pricing model w.r.t various CDN providers in industry, suchas Amazon E2 and Tencent CDN. See [8] for details.

Algorithm 2

Rule for the CLS task.

Require:

Trajectory T u , T v ; Compute average stalling ratio stall u , stall v , accumulative cost c u , c v from the given trajectories T u , T v . ▷ Requirements: i) lowstalling ratio ii) low costs. Initialize Return s = {− , − } ; if | stall u − stall v | < ϵ or | c u − c v | < ϵ c then Randomly set s or s as 1. ▷ Add noise to improve the robustness. else if | stall u − stall v | < ϵ c then s i ← , i = arдmin i ∈{ u , v } c ; else s i ← , i = arдmin i ∈{ u , v } stall ; end if return s ; Baselines.

Like other scenarios, we also compare Zwei with thefollowing state-of-art scheduling baselines:(1)

Weighted round robin (WRR) [4].

The requests will be redi-rected to different CDN providers w.r.t a constant ratio. Weadopt the algorithm with the best parameters.(2) E2 [12]. Exploitation and Exploration (E2) algorithm utilizesharmonic mean for estimating CDN providers’ performance,and select with the highest upper confidence bound of reward.We use E2 algorithm provided by the authors [12].(3)

LTS [22].

State-of-the-art CLS algorithm which uses deep re-inforcement learning to train the NN towards lower stallingratio. However, it ignores the trade-off between the cost andthe performance. We evaluate LTS by Zhang et. al. [22].

NN Representation.

We implement Zwei in CLS as suggestedby recent work [22]. More precisely speaking, for each CDN provider i , Zwei passes past 20 normalized workload and stalling ratio intoConv-2D layers with filter=64, size=4, and stride=1. Then severaloutput layers are merged into a hidden layer that uses 64 units toapply the softmax activation function. We model the action space asa heuristic way: each CDN provider has 3 choices, i.e., incrementallyincreases its configuration ratio by 1%, 5%, and 10%. Requirements for CLS tasks.

Algorithm 2 describes the

Rule of CLS. Note that the ϵ c =

10 is also a small number that can addnoise to improve Zwei’s robustness.

Zwei vs. State-of-the-art Scheduling Methods

We start tostudy how well that Zwei achieves under the CLS scenario. Asshown in Figure 6(a), we find that Zwei stands for the best schemeamong the candidates. Specifically, Zwei reduces the overall costson 22% compared with state-of-the-art learning-based method LTS,and decreases over 6.5% in terms of the overall stalling ratio. Thereason is that LTS takes the weight-sumed combination function asthe reward, while the function can hardly give a clearer guidancefor the optimized algorithm. Moreover, comparing the performanceof Zwei with the optimal strategy generated by the reward function,we observe that both optimal policy and Zwei are in the Paretofront. Zwei consumes less pricing costs than the optimal policysince the requirement is to minimize the cost first . Details.

Besides, we also present the training process in Fig-ure 6(b). As shown, Zwei converges in less than 600 epochs, whichneeds about 3 hours. It’s worth noting that Zwei also experiencedtwo stages on the CLS task. The first stage ranges from 0 to 100epochs, and we can see the goal of Zwei is to minimize the costwithout considering the number of stalling ratio. The rest of theprocess we find that Zwei attempts to reduce the number of stallingratio. Meanwhile, the cost curve converges to a steady state.

Heuristic Methods.

Heuristic-based ABR methods often adoptsthroughput prediction (E.g., FESTIVE [11]) or buffer occupancycontrol (E.g., BOLA [20]) to handle the task. However, such ap-proaches suffer from either inaccurate bandwidth estimation orlong-term bandwidth fluctuation problems. Then, MPC [21] picksnext chunks’ bitrate by jointly considering throughput and bufferoccupancy. Nevertheless, MPC is sensitive to its parameters sincethe it relies on well-understanding different network conditions.Through preliminary measurements, it is widely accepted thatthe strategies are largely statically configured [4]. In recent years,dynamic scheduling across different CSPs has received more at-tention. Jiang et. al. [12] uses E2 method to replace traditionalmodel-based methods.

Learning-based Schemes.

Recent years, several learning-basedattempts have been made to tackle video transmission problem. Forexample, Mao et al. [14] develop an ABR algorithm that uses DRLto select next chunks’ bitrate. Tiyuntsong optimizes itself towards arule or a specific reward via the competition with two agents underthe same network condition [9]. LTS [22] is a DRL-based schedulingapproach which outperforms previously proposed CLS approaches.However, such methods fail to achieve actual requirements sincethey are optimized via a linear-based reward function.

We propose Zwei, which utilize self-play RL to enhance itself basedon the actual requirement, where the requirement is always hardto be described as a linear-based manner. We show that Zwei out-performs recent work with the improvements of more than 22% ontwo representative video transmission tasks. (32.24% on Elo scoreover the ABR scenario, 22% on stalling ratio over the LTS).

Acknowledgement.

We thank the

MMSys and

NOSSDAV reviewerfor the valuable feedback. This work was supported by NSFC underGrant 61936011, 61521002, Beijing Key Lab of Networked Multime-dia, and National Key R&D Program of China (No. 2018YFB1003703). elf-play Reinforcement Learning for Video Transmission NOSSDAV’20, June 10–11, 2020, Istanbul, Turkey

REFERENCES

OSDI , Vol. 16.[4] Vijay Kumar Adhikari, Yang Guo, Fang Hao, Matteo Varvello, Volker Hilt, MoritzSteiner, and Zhi-Li Zhang. 2012. Unreeling netflix: Understanding and improvingmulti-cdn movie delivery. In

INFOCOM’12 .[5] Abdelhak Bentaleb, Bayan Taani, Ali C Begen, Christian Timmerer, and RogerZimmermann. 2018. A Survey on Bitrate Adaptation Schemes for StreamingMedia over HTTP.

IEEE Communications Surveys & Tutorials

International Conference on Computers and Games .Springer, 113–124.[8] Tianchi Huang. 2020. Zwei. https://github.com/thu-media/Zwei. (2020).[9] Tianchi Huang, Xin Yao, Chenglei Wu, Rui-Xiao Zhang, and Lifeng Sun. 2018.Tiyuntsong: A Self-Play Reinforcement Learning Approach for ABR VideoStreaming. arXiv preprint arXiv:1811.06166 (2018).[10] Te-Yuan Huang, Chaitanya Ekanadham, Andrew J. Berglund, and Zhi Li. 2019.Hindsight: Evaluate Video Bitrate Adaptation at Scale. In

MMSys’19 (MMSys ’19) .ACM, New York, NY, USA, 86–97. https://doi.org/10.1145/3304109.3306219[11] Junchen Jiang, Vyas Sekar, and Hui Zhang. 2014. Improving fairness, efficiency,and stability in http-based adaptive video streaming with festive.

TON

22, 1(2014).[12] Junchen Jiang, Shijie Sun, Vyas Sekar, and Hui Zhang. 2017. Pytheas: EnablingData-Driven Quality of Experience Optimization Using Group-Based Exploration-Exploitation.. In

NSDI , Vol. 1.[13] Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell, Yuan-dong Tian, Mohammad Alizadeh, and Eytan Bakshy. 2019. Real-world VideoAdaptation with Reinforcement Learning. (2019).[14] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural adaptivevideo streaming with pensieve. In

Proceedings of the Conference of the ACM SpecialInterest Group on Data Communication . 197–210.[15] Jeonghoon Mo, Richard J La, Venkat Anantharam, and Jean Walrand. 1999. Anal-ysis and comparison of TCP Reno and Vegas. In

INFOCOM’99

Proceedings of the 4th ACM Multimedia Systems Conference .[18] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).[19] David Silver, Aja Huang, Chris J Maddison, et al. 2016. Mastering the game ofGo with deep neural networks and tree search. nature

INFOCOM’16 .[21] Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. 2015. A control-theoretic approach for dynamic adaptive video streaming over HTTP. In

SIG-COMM’15 .[22] Rui-Xiao Zhang, Tianchi Huang, Ming Ma, Haitian Pang, Xin Yao, ChengleiWu, and Lifeng Sun. 2019. Enhancing the crowdsourced live streaming: a deepreinforcement learning approach. In

NOSSDAV’19 . A APPENDIX

This supplementary material details the principle of Zwei. Due tothe length of the supplemental material, we list a content to facilitatethe selection of interested parts for review. Although these contentshave

NOT appeared in the main text, we believe that they will helpthe reviewer get a thorough understanding of Zwei.

A.1 Additional Experiments for ABR Tasks

We list the experimental details under validation set from Table 1to Table 3, where the ABR algorithm includes Zwei, BOLA, Ro-bustMPC, Pensieve and Rate-based. The validation set containsseveral different network scenarios. E.g., bus-1 represents networktrace No.1 which collected from the bus.

OSSDAV’20, June 10–11, 2020, Istanbul, Turkey Huang et al.

Table 1: Details of ABR Tasks

Trace Name BOLA Rate-based Pensieve RobustMPC Zwei- Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf.bus 1 2659.6 - 2188.3 - 2480.9 - 2839.4 5.1 bus 10 1467.0 - 1389.4 4.6 1413.8 - 1427.7 0.1 bus 11 1327.7 - 1195.7 - 1264.9 - 1336.2 0.6 bus 12 769.1 4.9 721.3 - 771.3 5.2 769.1 4.0 bus 13 852.1 9.7 bus 15 3919.1 - 3497.9 - 3598.9 - bus 18 1623.4 - 1559.6 0.4 1596.8 - 1588.3 1.0 bus 19 1472.3 - 1212.8 - 1461.7 - 1440.4 - bus 2 1820.2 - 1656.4 - 1737.2 - 1861.7 - bus 20 1726.6 - 1517.0 1.6 1703.2 - 1752.1 - bus 21 1669.1 - 1497.9 - 1526.6 - 1759.6 2.1 bus 22 bus 3 1331.9 - 1258.5 - 1283.0 - 1348.9 0.9 bus 4 2121.3 - 1710.6 - 2103.2 - 2160.6 1.2 bus 5 950.0 1.9 863.8 14.0 922.3 - bus 7 car 1 1440.4 0.8 1413.8 10.1 1467.0 12.7 1447.9 4.0 car 10 1300.0 - 1127.7 - 1218.1 - 1338.3 2.6 car 11 920.2 1.7 673.4 1.0 925.5 - 889.4 - car 12 1291.5 18.2 car 3 1351.1 1.8 1293.6 - 1334.0 2.0 1309.6 0.8 car 4 1433.0 1.6 1327.7 3.4 1314.9 - 1423.4 1.7 car 5 927.7 2.1 836.2 0.1 878.7 - 895.7 - car 6 1295.7 - 1133.0 - 1263.8 - 1286.2 - car 7 710.6 1.3 568.1 11.9 666.0 - 677.7 0.9 car 8 1688.3 - 1448.9 - 1676.6 - 1704.3 - car 9 1485.1 - 1366.0 - 1423.4 - 1459.6 - ferry 1 1454.3 0.3 1423.4 - 1441.5 - 1455.3 - ferry 10 1119.1 0.2 1021.3 0.2 1104.3 0.2 1104.3 0.2 ferry 11 1038.3 1.2 847.9 - 978.7 - 1037.2 0.1 ferry 12 1525.5 - 1324.5 6.4 1466.0 - 1555.3 - ferry 13 1421.3 0.5 1192.6 12.6 ferry 15 1551.1 - 1385.1 - 1535.1 - 1486.2 1.1 ferry 16 2684.0 - 2406.4 - 2480.9 - 2734.0 - ferry 17 859.6 0.1 768.1 9.4 818.1 - 826.6 - ferry 18 724.5 2.8 587.2 4.9 666.0 1.1 667.0 1.0 ferry 19 839.4 4.8 654.3 - elf-play Reinforcement Learning for Video Transmission NOSSDAV’20, June 10–11, 2020, Istanbul, Turkey

Table 2: Details of ABR Tasks

Trace Name BOLA Rate-based Pensieve RobustMPC ZweiTrace Name Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf.ferry 20 686.2 6.7 635.1 - 670.2 5.8 672.3 5.0 ferry 3 1211.7 - 1093.6 - 1170.2 - 1231.9 0.7 ferry 4 562.8 4.2 472.3 1.9 543.6 - 558.5 4.7 ferry 5 1011.7 1.7 901.1 12.0 940.4 - 1018.1 2.2 ferry 6 2014.9 - 1598.9 - 2039.4 - 2093.6 - ferry 7 1367.0 0.5 1231.9 1.4 1309.6 - 1346.8 0.6 ferry 8 1078.7 3.8 1010.6 - 1076.6 1.7 1078.7 2.5 ferry 9 964.9 3.8 787.2 5.1 868.1 3.7 metro 10 896.8 - 625.5 3.7 837.2 - 895.7 - metro 2 1364.9 - 1285.1 2.8 1316.0 - 1381.9 2.5 metro 3 934.0 - 481.9 - 884.0 - 913.8 - metro 4 1086.2 1.0 884.0 1.2 968.1 0.6 1030.9 - metro 5 1178.7 1.0 1041.5 3.8 1142.6 - metro 7 945.7 - 769.1 - 948.9 0.2 961.7 0.1 metro 8 740.4 2.5 443.6 6.4 738.3 - 724.5 - metro 9 788.3 - 702.1 4.4 735.1 - 788.3 1.5 train 1 1028.7 4.2 863.8 2.3 996.8 - 1008.5 - train 10 1838.3 - 1627.7 - 1709.6 - 1894.7 - train 11 1256.4 0.4 1163.8 0.2 1171.3 - 1257.4 - train 12 1304.3 0.8 911.7 1.3 1114.9 - 1212.8 - train 13 1560.6 - 1236.2 - 1334.0 - 1564.9 - train 14 1330.9 - 1206.4 - 1259.6 - 1320.2 - train 15 1900.0 - 1423.4 - 1650.0 - train 17 1620.2 - 1394.7 4.1 1556.4 - 1637.2 - train 18 1978.7 - 1725.5 - 1803.2 - 2012.8 - train 19 1423.4 - 1230.9 2.5 1302.1 - 1452.1 - train 2 635.1 3.5 481.9 11.1 train 21 1598.9 - 1421.3 - 1596.8 - 1626.6 - train 3 648.9 1.6 529.8 8.6 619.1 - 596.8 - train 4 1084.0 3.3 868.1 - 1024.5 - 1059.6 0.3 train 5 1241.5 1.5 946.8 0.4 1155.3 - 1151.1 - train 6 969.1 0.9 721.3 - 875.5 - 936.2 0.9 train 7 825.5 8.3 750.0 - train 9 855.3 - 596.8 - 776.6 - 852.1 - tram 1 635.1 - 472.3 1.6 625.5 - 635.1 - tram 10 869.1 2.5 673.4 - 850.0 - tram 12 807.4 - 548.9 - 795.7 - 816.0 - tram 13 788.3 - 596.8 0.9 739.4 - 787.2 - tram 14 839.4 2.2 644.7 1.1 808.5 - 797.9 - tram 15 453.2 6.9 357.4 5.3 414.9 1.8 424.5 - tram 16

OSSDAV’20, June 10–11, 2020, Istanbul, Turkey Huang et al.

Table 3: Details of ABR Tasks

Trace Name BOLA Rate-based Pensieve RobustMPC ZweiTrace Name Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf. Avg. Bit. Rebuf.tram 19 967.0 0.1 788.3 - 907.4 - 973.4 - tram 2 577.7 3.3 434.0 - 548.9 - 548.9 - tram 20 930.9 - 730.9 - 885.1 - 928.7 - tram 21 692.6 0.5 558.5 - 677.7 - 696.8 - tram 22 510.6 0.3 376.6 - 491.5 - 501.1 - tram 23 778.7 - 529.8 - 738.3 - 778.7 - tram 24 788.3 - 616.0 3.4 739.4 - 769.1 0.3 tram 25 907.4 - 711.7 - 813.8 - 888.3 - tram 26 568.1 0.8 300.0 - 539.4 - tram 31 769.1 - 644.7 - 753.2 - 769.1 - tram 32 tram 35 548.9 5.0 405.3 1.6 501.1 - 486.2 - tram 36 850.0 0.3 539.4 - 819.1 - 816.0 - tram 37 984.0 - 807.4 - 894.7 - 1003.2 - tram 38 644.7 9.6 481.9 4.5 tram 4 462.8 5.2 300.0 - 414.9 - 414.9 - tram 40 797.9 0.6 520.2 0.6 770.2 0.6 772.3 0.8 tram 41 769.1 2.8 663.8 - 734.0 - 750.0 0.5 tram 42 1069.1 9.0 692.6 - 961.7 - 1057.4 - tram 43 1079.8 - 960.6 0.2 1048.9 - 1084.0 - tram 44 869.1 - 759.6 - 801.1 - 874.5 0.7 tram 45 648.9 1.1 472.3 7.0 tram 47 788.3 0.9 539.4 - 705.3 - 764.9 - tram 48 616.0 0.3 395.7 - 558.5 - 605.3 0.2 tram 49 963.8 1.1 778.7 - 963.8 - 985.1 2.2 tram 5 654.3 5.2 443.6 2.6 587.2 - 616.0 - tram 50 863.8 6.8 797.9 - 851.1 - 845.7 - tram 51 658.5 5.5 587.2 - tram 53 657.4 17.3 443.6 3.8 434.0 0.2 428.7 - tram 54 1123.4 6.8 862.8 6.6 1059.6 - tram 56 971.3 - 797.9 - 944.7 - 957.4 - tram 6 740.4 - 587.2 - 683.0 - 744.7 - tram 7 759.6 - 596.8 - 720.2 - 744.7 - tram 8 501.1 3.5 414.9 - 472.3 - 462.8 0.2 tram 9 606.4 0.2 328.7 - 568.1 - 572.3 -tram 9 606.4 0.2 328.7 - 568.1 - 572.3 -