Efficient Bitrate Ladder Construction for Content-Optimized Adaptive Video Streaming
>> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < Efficient Bitrate Ladder Construction for Content-OptimizedAdaptive Video Streaming
Angeliki V. Katsenou ∗ , Joel Sole † and David R. Bull ∗ One of the challenges faced by many video providers is the heterogeneity of network specifications, user requirements, andcontent compression performance. The universal solution of a fixed bitrate ladder is inadequate in ensuring a high quality ofuser experience without re-buffering or introducing annoying compression artifacts. However, a content-tailored solution, based onextensively encoding across all resolutions and over a wide quality range is highly expensive in terms of computational, financial,and energy costs. Inspired by this, we propose an approach that exploits machine learning to predict a content-optimized bitrateladder. The method extracts spatio-temporal features from the uncompressed content, trains machine-learning models to predictthe Pareto front parameters and, based on that, builds the ladder within a defined bitrate range. The method has the benefit ofsignificantly reducing the number of encodes required per sequence. The presented results, based on 100 HEVC-encoded sequences,demonstrate a reduction in the number of encodes required when compared to an exhaustive search and an interpolation-basedmethod, by 89.06% and 61.46%, respectively, at the cost of an average Bjøntegaard Delta Rate difference of 1.78% compared tothe exhaustive approach. Finally, a hybrid method is introduced that selects either the proposed or the interpolation-based methoddepending on the sequence features. This results in an overall 83.83% reduction of required encodings at the cost of an averageBjøntegaard Delta Rate difference of 1.26%.
Index Terms —Bitrate Ladder, Adaptive Video Streaming, Rate-Quality Curves, Video Compression, HEVC.
I. I
NTRODUCTION I N recent reports on internet traffic volumes [1], the shareoccupied by video data is predicted to reach 80% by2023 with anticipation of further rises subsequently. Due tothe recent pandemic and the associated major shift towardsremote working (work from home schemes, online education,etc) [2], this figure is now expected to be reached evensooner. Although mobile users are exchanging more and moreof the content they generate, the major share of the videonetworking index relates to on-demand video services, such asthose provided by Netflix, Hulu, Amazon Prime, and others.All video service providers invest a significant amount ofresource into optimizing video compression parameters priorto transmission. This enables them to increase user satisfaction- meeting varying end-user constraints while maintaining thehighest possible level of delivered video quality.The display quality of the delivered content may varyfrom device to device, and may be affected by a variety offactors such as location, terminal equipment type and availablebandwidth. For example, a given mobile phone is likely toreceive a different encoded version of the same source videoon a 5G network than it would on a 4G network. Theseencodes may vary in terms of both compression ratio andspatial resolution. It also means that an end-user’s device mightreceive content compressed at lower resolutions that is thenupscaled to a device’s native resolution prior to display.Many of the video service providers adopt HTTP AdaptiveStreaming (HAS) through Dynamic Adaptive Streaming overHTTP (DASH), which is the standard solution introduced by
Submitted on December 15, 2020. The work presented was supported bythe Leverhulme Early Career Fellowship (ECF-2017-413) and by Netflix Inc.A. Katsenou and D. R. Bull, are with University of Bristol, Bristol, BS15DD, UK. (e-mail: {angeliki.katsenou, dave.bull}@bristol.ac.uk)J. Sole is with Netflix, Inc, Los Gatos, California, USA (e-mail:jsole@netflix.com).
MPEG [3]. In DASH, videos are encoded with a differentset of parameters (resolution, quantization parameter, etc.) toallow for the adaptation of the delivered video content tothe heterogeneous network conditions (available bandwidth,device characteristics, etc) so as to ensure high quality of ex-perience. The traditional approach has been to construct (at theserver side) a bitrate ladder, which constitutes a set of bitrate-resolution pairs. This type of bitrate ladder is often referredto as “one-size-fits-all”. In early examples, two bitrate pointswere used for 1080p: 4300 kbps and 5800 kbps regardlessof title [4]. In later advances, differentiation was introducedbased on the genre of the content, e.g. [5]. For example,higher bitrates were used for sports content with rapid motionand fast scene changes. These solutions, however, ignored thedependency of the video compression performance on specificcontent characteristics, resulting in noticeable blocking andother visual artifacts in some cases and thus, in a degradedviewing experience.Recently, content-customised solutions have been developedand adopted by industry, such as those used by Netflix [4], [6],[7]. The key task here is to invest in pre-processing whereeach video title is split into shorter clips or chunks, usuallyassociated with shots. Each short video chunk is encoded usingoptimized parameters, i.e. resolution, quantization level, intra-period, etc, with the aim of building the Pareto Front (PF)across all Rate-Quality (RQ) curves. After that, a set of targetbitrates is used to find the best encoded bitstreams. Giventhe extensive parameter space (compression levels, spatial andtemporal resolution, codec type etc.) and taking into accountthe fact that this process must be repeated for each videochunk, the amount of computation needed is massive. As aconsequence, the industry heavily relies on cloud computingservices for pre-processing, and this naturally comes with ahigh cost in financial, time and environmental terms.Considering the above, in this paper we propose a content- a r X i v : . [ ee ss . I V ] F e b REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < gnostic method of estimating a close-to-optimal bitrate ladderfor adaptive video streaming. The proposed method is basedon extracting low-level content features from the uncom-pressed videos at their native spatial resolution and on trainingmachine-learning models to predict the PF parameters of therate-distortion curves across different resolutions. Based onthe estimated PF parameters, a set of equations that modelthe quantization parameters to the bitrate is defined. Withthis set of equations and taking into account the availablebitrate range, a suitable bitrate ladder is constructed per videosequence. A clear benefit of the proposed method compared toprevious approaches is that it significantly reduces the amountof computation required. Furthermore, since the bitrate ladderis constructed using the RQ PF, the number of steps on theladder is not fixed and might be reduced for certain videoscompared to other bitrate ladder solutions. This further reducesthe storage requirements for the resulting encodes. A. Contribution
In our previous work [8], we proposed a content-gnosticmethod that predicts the cross-over points of RQ curves acrossthree spatial resolutions (2160p, 1080p, and 720p) for eachvideo chunk based on Peak Signal-to-Noise Ratio (PSNR).Our method had the unique characteristic of predicting theQuantization Parameters (QPs) associated with the curve inter-section point without requiring any encodings. The encodingswere performed only to estimate the bitrates corresponding tothe switching of resolutions.In this paper, we build upon this previous work and proposean extension to it. The new framework extends beyond the pre-diction of the RQ intersection points across spatial resolutionsby introducing the following: • A new content-driven process to predict the bitrate ladderis introduced that takes into account both bitrate andquality constraints. This method models the relationshipbetween rate and quantization parameters across resolu-tions and utilises this for the estimation of the bitrateladder. • Further to the feature-based method, a hybrid methodol-ogy is proposed. This method selects either the contentdriven method or an interpolation-based method for aninput sequence as a solution that can balance the accuracyof prediction to the relative complexity trade-off. • The test case presented is based on an extended set ofresolutions from 2160p down to 540p, i.e. 3840 × × × × B. Paper Organization
The rest of this paper has the following structure. Firstly,Section II outlines state-of-the-art research and industrial tech-nologies. In Section III, the proposed framework is introduced.The dataset employed together with its low-level features aredescribed in Section IV. In the same section, the modules thatcontribute to the construction of the bitrate ladder are detailed.Next, in Section V, the extracted spatio-temporal features and the machine learning techniques used for the prediction ofthe bitrate ladder are reported. Moreover, this section presentsresults on the predicted ladder and discusses the methods’relative complexity. Finally, conclusions and future work aresummarised in Section VI.II. R
ELATED W ORK
As explained in the Introduction, the traditional approachemployed builds a fixed “bitrate ladder” that yields recom-mended spatial resolution for the available bitrate. The bitrateladder is either content-agnostic, with fixed bitrate rangesallocated per resolution, e.g. [9], or employs a limited numberof bitrate ladders based on genre, e.g. [5].More advanced methods that move beyond the fixed “bi-trate ladder” approach have been proposed recently by bothacademic and industry stakeholders [6], [7], [10]–[20]. Allof these solutions rely on encoding statistics collected eitherby performing a massive number of encodings or by usinga selective number of encodes to predict the bitrate laddersusing pre-trained machine learning models.Most of these recently introduced methods were based ona per-title optimization framework. Netflix [4], [6], [10] havereported that they obtain the RQ curves per title at differentresolutions and at different bitrates by running several trial en-codings at different quantization levels. Then, this informationis used to construct the Pareto-optimal front (often referred toin the adaptive streaming literature as a convex hull) of the RQcurves using both scaled PSNR [6] and scaled VMAF in [10]and hence to obtain the optimal parameters for a definedbitrate range. An alternative approach, presented in [11],uses measurements on the actual usage of millions of videoclips to create probability distributions of available bandwidthand viewport sizes. These probability distributions feed anoptimization process that ensures video quality preservationwhile reducing the required bitrate compared with existingtechniques. Other per-title-encoding approaches have beendeveloped by Bitmovin [14], MUX [15], CAMBRIA [16] andothers [17]. The Bitmovin [14] and CAMBRIA [16] solutionsinclude computation of the encoding complexity. Accordingto the former [14], a complexity analysis is performed oneach incoming video, and a variety of measurements areprocessed by a machine-trained model to adjust the encodingprofile to match the content. The CAMBRIA solution esti-mates the encoding complexity by running a fast constant rateencoding [16]. MUX [15] introduced a deep-learning basedapproach that takes, as input, the vectorized video frames andpredicts the bitrate ladder. The aforementioned approaches arecompared in [17]. A further approach that uses trial encodesto collect coding statistics at low resolutions and utilizes themwithin a probabilistic framework to speed up the encodingdecisions at higher resolutions is presented in [19].Recent work has presented a per scene optimization methodwhich aims to either maximize the quality or minimize thebitrate of each encoded representation in video on demandHAS scenarios [20]. The method relies on building a quantizedconvex hull by encoding the sequences across a set of spatialresolutions. Furthermore, Netflix [7] has updated its dynamic
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < Content Features Extraction Machine Learning-based Regression
Testing Videos @ Native Spatial Resolution Spatio-temporal Features of Testing Videos
Video Codec
Rate @ Cross-Over Points
Reference Pareto Front & Cross-Over QPs
Training Videos
Downscale Resolution
Downscaled Videos Reference Cross-over QPsTraining VideosSpatio-temporal Features of Training Videos
Training ProcessTesting Process
Upscale Resolution to Native
Decoded Training Videos Upscaled Training Videos
Quality Metrics Computation
Upscaled Training VideosDecoded Testing Videos Upscaled Testing VideosQuality Values for Training VideosQuality @ Cross-over PointsTesting Videos Predicted Cross-over QPs
Compute the Predicted Bitrate Ladder Compute the Reference Bitrate Ladder
Seq Rate QP Res.
A 150kbps 41 540p300kbps 35 720p600kbps 31 720p1.2Mbps 29 1080p2.4Mbps 24 1080p….. …. ….…… …. ….
Fig. 1: Diagrammatic overview of the proposed framework. Blue blocks indicate off-the-shelf technologies/tools, yellow the methodologies employed in ourprevious work [8], and green blocks the new processes introduced in this paper. optimization method by building a bitrate ladder after com-plexity analysis and by further tuning of pre-defined encodingparameters. Another interesting approach that takes into ac-count both quality constraints and bitrate network statisticswas proposed by Brightcove [18], [21]. The quality metricused in this case was the Structural Similarity Index Measure(SSIM) and bitrate constraints were based on probabilisticmodels. Another recent approach, the iSize solution [22], usespre-encodes within a deep learning framework to decide on theoptimal set of encoding parameters and resolution at a blocklevel.While all of the above solutions are significant and havecontributed in the enhancement of video services, it is notpossible to make direct detailed comparisons. Firstly, they areproprietary, meaning that they are designed to satisfy differingbusiness requirements and that their full details may not bepublic [23]. Moreover, some of the methods are using differentmetrics in their bitrate adaptation process that are not alwaysshared. One area of improvement for these methods is to re-duce their complexity, as most of the aforementioned solutionsrely on large numbers of encodes (in many cases massive).Hence the computational, energy and financial costs are high,since cloud encoding services are usually employed [24].In the above context, our aim here is to build a methodologythat can predict a close to optimal content-gnostic bitrateladder at a reduced computational cost compared to traditionalmethods. The proposed methodology is outlined in the follow-ing section.III. O
UTLINE OF THE P ROPOSED F RAMEWORK
The proposed framework, as shown in Fig. 1, is structuredin two processes that use common functionalities; the training and testing processes. Different line patterns are used to denotethe information flow for these. During the training process,we first downscale the uncompressed sequences to createdifferent spatial resolution versions. We also extract low-levelcontent features from the native resolution sequences. Then,we encode the native and downscaled sequences for a widerange of QPs. After decoding, we rescale all versions inorder to compute quality metrics at the native resolution andconstruct the reference Pareto Front (PF) of the sequence.Simultaneously, we record the intersection points of the RQcurves across resolutions, that is, the bitrate, quality, QP, andresolution. The QP values at the intersection points betweeneach resolution are henceforth called cross-over QPs. TheQP values represent the independent variable in the encodingprocess (rate and quality are the dependent variables), andassume discrete values. Thus, our first step is to predict theseQP values as a basis for bitrate prediction. The cross-over QPsresolutions of the reference PF represent the ground truth forour predictions.Using the extracted spatio-temporal features, the groundtruth cross-over points and the associated PFs, we train su-pervised machine learning models to perform regression andpredict the cross-over points. Using the reference PF, we canconstruct the reference bitrate ladder, as follows: firstly wereduce the bitrate range within practical limits for streaming;secondly, this trimmed Pareto surface is subsampled across thequality and bitrate dimensions, as detailed in Section IV-E.The testing process is similar, but simpler than the trainingprocess. We first extract the spatio-temporal features that wereselected during the training process from the uncompressedtest sequences at the native resolution. Then, we use the trainedmodels to predict the cross-over QPs. These QPs help to
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < determine the bitrates at which the resolution switches to thenext one on the PF. Thus, we perform a small number ofencodes across resolutions at the cross-over QPs. The nextstep involves defining the parameters of the set of equationsthat relate the quantization parameter to the bitrate. Theseequations will indicate the resolution and QP at the targetbitrate ladder rungs. The final step involves encoding thesequence at the predicted resolution and QPs for each ladderrung so as to derive the predicted content-aware bitrate ladder.While the proposed methodology has been implemented anddemonstrated using the High Efficiency Video Codec (HEVC)codec [25], it is extendable to any video codec once properlytrained. Furthermore, in this paper, we selected PSNR as thebasis for constructing the bitrate ladder. PSNR remains themost commonly used quality metric despite the fact that otherquality metrics have been shown to offer a better correlationwith perceived quality. Nevertheless, the method is adaptableto other quality metrics.IV. C ONSTRUCTING THE R EFERENCE B ITRATE L ADDER
This section focuses on the exploration of the RQ spaceacross resolutions, the definition and modelling of the ref-erence Pareto surface, and the construction of the sequence-specific reference bitrate ladder.
A. Description of the Dataset
For any content-driven video processing framework, it isessential to have a large video dataset that covers a varietyof scenes. Therefore, we employed a dataset of 100 publiclyavailable UHD video sequences from different sources: NetflixChimera [26], Ultra Video Group [27], Harmonic [28], SJTU[29] and AWS Elemental [30]. The same dataset was alsoused as a training dataset in [31]. Example frames fromthe dataset are depicted in Fig. 2. Many of the sequenceshave a native resolution of 4096 × × and a bit depth of10 bits per sample. Finally, the sequences were temporallycropped to 64 frames.We illustrate in Fig. 3 the four basic descriptors of ourdataset: Spatial Information (SI), Temporal Information (TI),average Motion Vectors (MV) magnitude and Colourfulness(CF) [32]. These four distributions highlight the variety ofits video content. SI is an indicator of edge energy; TI is anindicator of temporal variance; MV is another expression ofhow fast the motion in successive frames might be; and CFis an indication of the colour distribution. All features show awide coverage of the spatio-temporal domain with most of thesequences in the range of 150-250 for SI and 10-50 for TI. Thehistograms also reveal some of the outlying video sequences, Two of the sequences were temporally downsampled from 120 to 60 fpsin order to match the majority frame rate of 60 fps. Fig. 2: Sample frames of the considered dataset [31]. such as TunnelFlag (sample frame in row 10 and column 1of Fig. 2) and Wood (row 10 and column 9) that contain verydense edges. Video sequence Jockey (row 3, column 6) alsodiffers from others because of its high motion. (a) SI histogram. (b) TI histogram.(c) MV histogram. (d) CF histogram
Fig. 3: Distributions of SI, TI, MV, and CF descriptors of the considereddataset.
B. The Reference Pareto-optimal Front
Each RQ curve at a resolution S ∈ S can be defined asthe set of vectors; bitrate R = ( R , R , . . . , R | P | ) (cid:62) , videoquality Q = ( Q , Q , . . . , Q | P | ) (cid:62) , and quantization points P = ( P , P , . . . , P | P | ) (cid:62) . As P represents the independentparameter QP that is given as input parameter to the videocodec, R , Q depend on P .Each RQ curve expresses the tradeoff between quality Q and bitrate R at a resolution S over the independent parameter P . The sets of the independent variables S , P form a deci-sion variable space that is mapped to an objective functionspace that contains the resulting R i , Q i points. Our aim isto determine the optimal (cid:104) P ∗ i , S ∗ i (cid:105) that result to the highestquality Q ∗ i at the lowest possible R ∗ i . These tuples of optimalpoints {(cid:104) R ∗ i , Q ∗ i , P ∗ i , S ∗ i (cid:105)} form the PF. Every point on the PFis dominant over every other point in the objective function REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER <
19 20 21 22 23 24log(Bitrate)3436384042444648 PS NR - Y U V ( d B ) (a) RQ points and the feasible objectivespace.
19 20 21 22 23 24log(Bitrate)36384042444648 PS NR - Y U V high1080p ,QP low720p } {QP high2160p , QP low1080p } (b) Intersections of RQ curves and cross-over QPs. Fig. 4: Example of intersecting RQ curves at (2160p, 1080p and 720p),feasible objective space and cross-over points for the sequence Marathon. space. To put it simply, the PF, C , in our work is defined as aset of tuples, i.e. C := {(cid:104) R ∗ i , Q ∗ i , P ∗ i , S ∗ i (cid:105)} |C| i (1)where R ∗ i > R ∗ i − with R i ∈ R + , ∗ and R min < R i < R max , Q ∗ i > Q ∗ i − with Q i ∈ R + , ∗ , P ∗ i ≤ P ∗ i − with P i ∈ N ∗ , S ∗ i ≥ S ∗ i − , and |C| expressing the cardinality and varyingper content. Depending on the shape of the RQ curves acrossresolutions and their intersection points, the PF shape andcardinality may differ per sequence.In Fig. 4 (a), an example of the Rate-PSNR points acrossthree resolutions, i.e. {2160p, 1080p, 720p}, and the respectivefeasible objective space is depicted. Next, in Fig. 4 (b), the PFis illustrated with the grey dashed line.Furthermore, it is important to define the intersection pointsbetween the RQ curves of the same video across resolutions.The intersection points signal the switching of resolutions andare defined by pairs of QPs, called cross-over QPs, that aremathematically represented as follows (cid:16) QP level j S j , QP level j − S j − (cid:17) , with S j (cid:54) = S j − ,level j (cid:54) = level j − , where S j , j ∈ { , , . . . , |S|} are resolutions of the inter-secting curves of the same video sequence and level ∈{ high, low } defines the range of QPs. The total number ofcross-over QPs depends of course on the number of resolutionsand equals to × ( |S| − . This was used in order to makemore distinct the intersection QPs of the same resolution. So, level is an indication of whether the intersection is happeningat high or low range of QP values. The resolution and levelcannot be the same for both QPs in a pair. For example,the pair ( QP high p , QP low p ) indicates that the 2160p curve isintersecting with the 1080p curve at a considerably high valuefor 2160p and at a low for 1080p. Figure 4 (b) illustrates anexample of cross-over QPs and how the notation is used forthe three intersecting curves. C. Constructing the Reference Pareto surface
We first construct the ground truth PF and determine theintersection points of the RQ curves between different spatialresolutions. These intersection points mark the limits of the
16 18 20 22 24 26 28 30log (Bitrate)25303540455055 PS NR - Y U V ( d B ) (a) log (R)-PSNR curves. (b) log (R)-PSNR PFs.Fig. 5: Examples of a subset of the considered dataset RQ curves at fourdifferent spatial resolutions. range for which encoding at the given resolution yields thebest quality. When encoding at a lower resolution, all met-rics are computed on the rescaled version (see Fig. 1): allsequences are first downscaled, then encoded, decoded andfinally upscaled prior to metric computation.We spatially downscaled all sequences in our dataset(see Fig. 1) using a Lanczos-3 filter [33], as imple-mented by FFmpeg [34], at four different resolutions, S = { p, p, p, p } . Then, we encoded all ver-sions of the sequences with the HEVC reference software,HM16.20 [25], using the Random Access profile, a 64-frameintra period, a length of group of pictures equal to 16 frames,and a fixed QP range for all resolutions: P = { , . . . , } .The range of QPs selected was sufficiently wide to ensure thatthe RQ curves across resolutions intersect. As can be seen inFig. 1, after decoding the sequences, we upscale them to thenative resolution using the same filter. All quality metrics arecomputed at the display resolution (2160p), as recommendedalso in [35].In Fig. 5 (a), we illustrate a subset of log (Rate)-PSNRcurves from the considered dataset across four spatial res-olutions for the same range of quantization levels ( log ishenceforth simplified as log ). From these figures, we observethat the wide range of content features is reflected by the highdiversity in the RQ curves. For example, the sequence Tod-dlerFountain (row 5, column 7, Fig. 2) that exhibits dynamictexture (fountains) has a smoother slope, compared to a morestatic sequence such as HoneyBee (row 7, column 8, Fig. 2)that exhibits a steeper slope. Furthermore, there is a shift ofthe RQ curves toward lower quality and bitrate associated withdownscaled spatial resolutions. The lower resolution sequencessaturate at lower quality values. However, sequences at lowerresolutions demonstrate higher quality values at lower bitrates.Also, the intersection points differ significantly according tosequence characteristics (see Section V-B, Fig. 8). In Fig. 5 (b),we illustrate the resulting Pareto surfaces for our dataset acrossthe four spatial resolutions. As expected, the composition ofthese curves varies for the different sequences. The PFs arecomposed by a different number of points across resolutions.This figure emphasizes the requirement for content-awarebitrate ladder construction.After computing quality metrics on all versions of upscaleddecoded sequences, we construct the PF for each sequence.This is referred to as the reference PF and will be considered asthe ground truth. We assume that two curves across resolutionscan only intersect once, and in a monotonically descending REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER <
15 20 25 30 35 40QP high2160p Q P l o w (a) QP low p vs QP high p .
20 25 30 35 40 45QP low540p Q P h i gh720p (b) QP high p vs QP low p . Fig. 6: Examples of scatter plots of two cross-over QP pairs. manner: 2160p with 1080p, 1080p with 720p, etc. Hence, ifthe resulting PF reveals more than one intersection betweenpairs of resolutions, we record the one with the highest bitrate.
D. Cross-over QPs and their relationships
In this paper, we consider four spatial resolutions S = { p, p, p, p } , so we have three in-tersections that correspond to three pairs of cross-overQPs, namely ( QP high p , QP low p ) , ( QP high p , QP low p ) , and ( QP high p , QP low p ) . A typical example of intersecting RQcurves is drawn in Fig. 4 (b). We can see that the RQ curvesacross resolutions reside in very close proximity, appearingto overlap across a wide range of bitrates. Such occurrencesare common and very often the quality values differ onlymarginally across resolutions for certain bitrate ranges.Moreover, many of the cross-over QPs are highly correlatedacross different resolutions. Table I reports the Pearson LinearCorrelation Coefficient (LCC) and Spearman Rank CorrelationCoefficient (SROCC) for the cross-over QPs. Almost all QPsare highly correlated. These observations are useful, indicatingthat previously predicted cross-over QPs could be used asfeatures. The linear relationship between pairs of cross-overQPs can be verified from the example scatter plots in Fig. 6,where two examples of pairs of cross-over QPs are given. Itcan be seen that the cross-over points show a close-to-linearshift across resolutions. An example of this linear relationshipbetween cross-over QPs is given below: (cid:103) QP low p = 1 . QP high p − . , (2) (cid:103) QP high p = 1 . QP low p − . , (3)where the estimated QP values are rounded to the nearestinteger. In this case, LCC is 0.9908 and SROCC 0.9901 forthe ( QP high p , QP low p ) pair and 0.9563 and 0.9160 for theother pair, respectively. E. The Reference Bitrate Ladder
After constructing the Pareto-optimal front, the next stepis to build the bitrate ladder. We define the bitrate ladder asan ordered set R L = { R L, , R L, , . . . , R L, |L| } , where |L| isthe cardinality of R L and R L, < R L, < . . . < R L,N . Thebitrate ladder is fully defined as a set of tuples L that comprisebitrate values R L , the associated set of quality values Q L , aset of QP values P L , and a set of resolutions S L , i.e. L := {(cid:104) R L,i , Q
L,i , P
L,i , S
L,i (cid:105)} |L| i . (4) In order to construct the bitrate ladder, we follow three steps.First, we define the range of bitrates that will be used forstreaming, and trim our PF to lie between the lower R min andupper R max bitrate values. Next, we perform subsampling ofthe trimmed front across bitrate and quality.Constraints across the quality dimension depend on themetric employed. From Fig. 5 (b), we observe that, for somesequences, the PF saturates after reaching a certain bitratevalue. Allocating bits beyond this value would not improvevideo quality. A shorter bitrate ladder, that takes into accountthe saturation for these sequences can therefore be used. Ingeneral, as mentioned in Section I, the length of the ladder willdepend on the video content and its compression performanceacross the different resolutions.Next, we follow common practise by selecting points on thePF such that each ladder point R L, i is approximately twice thebitrate of the previous point, i.e. R L, i (cid:39) R L, i − , (5)where R L, i ∈ ( R min , R max ) and i ∈ N . This expressiontranslated into the log domain can be written as log( R L, i ) (cid:39) R L, i − ) (6)We use approximation in the above equation because, inpractice the curves are not continuous, but instead finite setsof discrete points as a consequence of using integer QP values.We next subsample the PF considering restrictions acrossthe quality dimension. Put formally, we find the rate points onthe ladder R L, i for which: Q L, i ( R L, i ) ≤ Q max , (7) d Q L, i d R L > (cid:15) , (8)where Q max is the maximum value that can be assumed bynormalised metrics and (cid:15) ∈ R , (cid:15) → . As a consequenceof the above constraint, the length of the ladder might vary.The use of different ladder lengths, dependent on compressioncomplexity was also suggested in [18]. The basic steps toconstruct the reference bitrate ladder explained above arebriefly outlined in Algorithm 1.In Fig. 7, the reference bitrate ladders (a) per sequence and(b) on average for the considered dataset, are illustrated. Forthese figures, we considered the {150kbps,25Mbps} bitraterange. As can be seen from both plots, the steps on the ladderare clearly visible and are shifted to a greater or lesser extentaccording to the sequence. It is noticeable in Fig. 7 (a) that thelast step of the bitrate ladder appears to have a smaller numberof points. This indicates the variable length of the ladder as aresult of the PF in the considered bitrate range.Table II reports the statistics on the different combinationsdetected in the considered dataset. As expected, the combina-tions of higher resolutions are dominant. It can be seen, thatonly for 13.69% of the test sequences, the native resolutionis not included in the constructed reference bitrate ladder.The content-gnostic construction of a bitrate ladder offers theadvantage of combining a variable length bitrate ladder andlower than native resolutions where appropriate, as opposed REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < TABLE I: Cross-correlation values (LCC, SROCC) between cross-over QPs.PSNR QP high p QP low p QP high p QP low p QP high p QP low p QP high p - (0.9908, 0.9901) (0.8974, 0.8638) (0.9108, 0.8825) (0.7908, 0.7414) (0.8378, 0.8073) QP low p - - (0.8842, 0.8521) (0.9040, 0.8812) (0.7700, 0.7318) (0.8149, 0.7971) QP high p - - - (0.9895, 0.9825) (0.9123, 0.8865) (0.9140, 0.8690) QP low p - - - - (0.9000, 0.8728) (0.9182, 0.8760) QP high p - - - - - (0.9563, 0.9160) Algorithm 1:
Reference Bitrate Ladder Construction
Input: video sequence, set of resolutions S , set ofquantization points P Output:
Reference bitrate ladder L per video sequence % Step1:
Extract RQ Points for each s ∈ S do Downscale sequence at s using Lanczos-3 filter; for each p ∈ P do Encode sequence at QP = p with RA profile,intraPeriod = 64 , GoPlength = 16 ; Compute Bitrate, R p ; Decode sequence; Upscale decoded sequence to native resolution; Compute quality metrics Q p ; end RQ curve at s, { log( R ) , Q , P , S } end return RQ curves across all resolutions, { log( R ) , Q , P , S } % Step2:
Compute the Reference PF Find the RQ points that compose the PF; Find the intersection points of the RQ curves; return Reference PF: C ref ← {(cid:104) log( R i ) , Q i , P i , S i (cid:105)} |C| i . % Step3:
Compute the Reference Bitrate Ladder Trim the logarithmic bitrate range: log( R ) min ≤ log( R ) L, i ≤ log( R ) max ; Prune the trimmed C across bitrate dimension usingEq.(6), −→ C (cid:48) ; Prune C (cid:48) across the quality dimension according toEqs.(7)-(8) ; return Reference Bitrate Ladder , as in Eq.(4):
L ← {(cid:104) log( R ) L,i , Q
L,i , P
L,i , S
L,i (cid:105)} |L| i .to the traditional method that includes all resolutions and alltarget ladder rungs. This results in reduced encoding cost,while not degrading the end-user experience.V. C ONTENT - DRIVEN P REDICTION OF THE B ITRATE L ADDER
This section outlines the processes linked to the predictionof the PF, including feature extraction, feature selection,prediction of the cross-over points using machine learningmodels, estimation of the PF parameters and the assessment
17 18 19 20 21 22 23 24 25log(Rate)202530354045505560 PS NR ( d B ) (a) Bitrate ladder points.
17 18 19 20 21 22 23 24 25log(Rate)323436384042 PS NR ( d B ) (b) Average Bitrate ladder.Fig. 7: The reference bitrate ladder for the considered dataset.TABLE II: Resolution patterns that compose the bitrate ladders. (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) - 17.32% (cid:88) (cid:88) - - 21.79% (cid:88) - - - 3.07%- (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) - 0.56%- (cid:88) - - 1.4% and evaluation of the results. A description of the proposedmethodology for content-aware ladder prediction is given inAlgorithm 2. Furthermore, this section includes a discussionof the results with respect to the optimality of the predictedbitrate ladders and the computational cost in terms of requiredencodings. A. Spatio-Temporal Features
In this subsection, we discuss the spatio-temporal featureextraction, which is the first step for the prediction of thecontent-gnostic bitrate ladder, as described in Algorithm 2.Observing the RQ curves, their intersection points and theirPFs, it is evident that there are strong dependencies on, andcorrelation with, content characteristics. For example, in thecase of the Marathon sequence, whose PSNR- log (R) curvesare depicted in Fig.4, we observe that the PF comprises many2160p resolution points. This can be attributed to the densityof small moving structures (runners) within the scene. Onthe other hand, for other more static sequences that includeout of focus background (e.g. Barscene), the 1080p curveintersect with the 2160p occurs at a much lower QP value.The challenge is to find suitable spatio-temporal features thatreflect such content characteristics.The literature is rich with various spatio-temporal featuresused to characterise the relationship between video contentand compression performance [36]–[43]. The spatio-temporalfeatures employed in this work have been carefully selectedthrough extensive evaluation of a large variety of features,
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < Algorithm 2:
Prediction of the Bitrate Ladder
Input: test video sequence v at native resolution Output:
Predicted bitrate ladder ˜ L per video sequence % Step1:
Extract Features for each frame i ∈ { , , . . . , N oF rames } do Compute GLCM contrast, homogeneity,correlation, energy, entropy on frame i ; if i > then Compute TC mean, standard deviation,skewness, kurtosis and entropy betweenframes ( i − , i ) ; else if i = 1 then Compute RsMSE; end end Compute the mean GLCM descriptors, the mean andstandard deviation of TC statistics over all frames; % Step2:
Predict Cross-Over QPs for each QP levels do Select a subset of features using RFE; Predict (cid:100) QP levels ; Update the set of features with the (cid:100) QP levels ; end % Step3:
Estimate the QP- log (R) Eq. Parameters for each v do for each s ∈ S do if s (cid:54) = |S| then if s > then Downscale sequence v d at resolution s using Lanczos-3 filter; end Encode sequence v d at (cid:100) QP highs ; else Encode sequence v at (cid:100) QP lows ; end Compute Bitrate, R p ; Decode sequence v (cid:48) ; if s > then Upscale sequence v (cid:48) at native resolutionusing Lanczos-3 filter; end Compute quality metrics Q p between ( v (cid:48) , v ) ; end Estimate Eq.(9) parameters for video v (cid:48) for all s ; end % Step4:
Compute the Bitrate Ladder Average the R p points of cross-over QPs to define theresolution switching bitrate. Repeat Lines 20-21 from Algorithm 1 for each v (cid:48) ; % Step5:
Validate Monotonicity and Concavity Order non-monotonic points and remove concavepoints. return Predicted Bitrate Ladder: (cid:98)
L ← {(cid:104) log( R ) L,i , (cid:98) Q L,i , (cid:98) P L,i , (cid:98) S L,i (cid:105)} |L| i .
20 25 30 35 40 45QP high2160p m ean G L C M en t r (a) QP p vs meanGLCM ent .
20 25 30 35 40 45QP high2160p -2-1012 m ean T C sk w (b) QP p vs meanTC skw .Fig. 8: Example of content dependency of the cross-over QPs. modifying some so that they better represent the basic charac-teristics of video texture that relate to encoding difficulty, i.e.spatial diversity, coarseness and motion, as shown in [44].We adopt those features that have been successfully usedin our previous compression-related research [31], [44]–[46].Particularly, for the representation of spatial information, andspecifically to express the variability of intensity contrastbetween neighboring pixels, we employ the Gray Level Co-occurrence Matrix (GLCM) [37] and extracted its basic de-scriptors (contrast; correlation; homogeneity; energy; entropy)along with its mean and standard deviation across frames,as described in [44]. Another low-cost feature adopted isthe Mean Squared Error of the spatial Rescaling (RsMSE)of the first frame, similarly to a feature suggested in [31].This feature captures the distortions that result from spatialsub/upsampling. Furthermore, in order to combine both spatialand temporal characteristics, we employ Temporal Coherence(TC) [45] with its interframe statistics: mean; standard devi-ation; skewness; kurtosis; and entropy, as well as the meanand standard deviation across all frames. Table III reportsthe full set of features and their statistics, biasing those withthe lowest computational complexity and those that wereselected via feature selection methods as described in Sec-tion V. Besides those referred to above, we have tested otherfeatures, including the normalised Laplacian pyramid [47],the normalised cross-correlation across successive frames [45],[48], the average frame difference, the optical flow [49], andmore.In Fig. 8, we illustrate examples of the ground truth cross-over QP high p against the temporal mean GLCM entropy,meanGLCM ent , and the mean of the temporal coherenceskewness, meanTC skw , extracted from the video sequences attheir native resolution. The higher the value of meanGLCM ent ,the higher the spatial variability of the sequence. This is, inmost cases, related to a high cross-over value for the QP low p ,which means that the PF comprises more points from the2160p resolution. In the case of meanTC skw , high values(positive skewness) indicate a temporally coherent sequencewhere switching to a lower resolution is likely to happen at alower QP value.It is also worth highlighting that, in the last row of Ta-ble III, the predicted cross-over QPs are listed as features.As explained in Section IV-D, there is a strong relationshipbetween cross-over QPs. There inclusion has resulted in higherprediction accuracy. More details on this follow in the nextsubsection. REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < TABLE III: List of features and their notations.
Feature Notations
Grey-Level Co-occurrence Matrix(GLCM) [37] F1.meanGLCM con , F6.stdGLCM con ,F2.meanGLCM cor , F7.stdGLCM cor ,F3.meanGLCM hom , F8.stdGLCM hom ,F4.meanGLCM enr , F9.stdGLCM enr ,F5.meanGLCM ent , F10.stdGLCM ent
Temporal Coherence(TC) [45] F11.meanTC mean , F16.stdTC mean ,F12.meanTC std , F17.stdTC std ,F13.meanTC skw , F18.stdTC skw ,F14.meanTC kur , F19.stdTC kur ,F15.meanTC entr , F20.stdTC entr
MSE from Rescalingusing Lanczos filter(RsMSE) [31] F21.RsMSE (from 2160p to1080p), F22.RsMSE (from 2160pto 720p), F23.RsMSE (from 2160pto 720p)Predicted cross-overQPs F24. ˆ QP low p , F25. ˆ QP low p ,F26. ˆ QP high p , F27. ˆ QP low ,F28. ˆ QP high p B. Predicting the Cross-Over QP Values
The latest version of HEVC reference software, HM16.20,was employed in this study. The Random Access profile wasused according to the Common Testing Conditions [25], [50],namely a 64-frame Intra Period and a Group of Pictures(GoP) length of 16 frames. After encoding, decoding, andupscaling the spatial resolution to 2160p, we computed thequality metrics and bitrate at a GoP level. Computing the RQcurves at a GoP level enabled a larger coverage of the RQspace.Prior to prediction of each cross-over QP value, we appliedfeature selection, using recursive feature elimination [51], onthe set of spatio-temporal features. We followed a sequentialprediction of the cross-over QPs starting from the highestresolution down to the lowest. Despite the fact that for the QP high K prediction we only relied on spatio-temporal featuresextracted from the uncompressed 2160p videos, for the rest ofthe cross-over QPs, we made use of their identified relationsas explained earlier. We trained and tested several machine-learning regression methods, including Support Vector Ma-chines with different kernels and Random Forests. We alsoevaluated deep-learning based regression with dense sequentiallayers (rectified unit activation and Adam optimiser). However,the Gaussian Processes (GP) classifier performed best for thiswork, as also shown in [8]. To avoid overfitting, we deployeda ten-fold random cross-validation process.The results in Table IV report the outcome of the ten-foldcross-validation with the accuracy of prediction metrics aver-aged over the ten folds. The table also lists the selected featuresand accuracy of prediction for each predicted cross-over QP.Regarding the selected features subsets, we observe that thereare similarities for all predicted cross-over QPs. Additionally,the previously predicted QPs were also selected as features,as explained earlier. The two tables report high values of R ,around 0.9. Moreover, the cross-correlation metrics LCC andSROCC between the predicted and the ground truth QPs are also high . Also, the Mean Absolute Error (MAE) and theRoot Mean Squared Error (RMSE) are considerably low andcomparable for all predicted cross-over QPs. It is important topoint out that the effectiveness of the method cannot be fullyassessed by these results; the predicted cross-over QPs will beutilised to estimate the resolution switching bitrates and definemodels to estimate the bitrate ladder. Hence, the comparisonof the predicted bitrate ladder to the reference will provide thefull assessment of this framework. C. Modelling QP-log(Rate) to Estimate the Bitrate Ladder
In the construction of the bitrate ladder after predictingthe cross-over points, we need to know which resolution topick for each rung of the ladder and which QP correspondsto the respective bitrate. In order to predict the QP thatcorresponds to each bitrate ladder rung, we explored theQP- log (R) relation. An example of QP- log (R) points acrossthree resolutions is illustrated in Fig. 9(a). As indicated, weconfirmed that there is a strong linear correlation with anaverage LCC equal to: -0.9891 for 2160p, -0.9931 for 1080p,-0.9955 for 720p, and -0.9952 for 540p. Thus, by defining aset of log (R)-QP linear equations (one per resolution), we canestimate the (cid:98) P L . Put formally: (cid:103) QP s = α s log( R ) + β s , (9)where α s , β s ∈ R with s ∈ S . So, each (cid:98) P L,i at R L,i can be estimated by this equation. We explored whether the α s , β s parameters for the set of resolutions are correlatedand noticed that there is a strong correlation between the α values, particularly in the lower resolutions, between 720p and540p with LCC equal to 0.9890 and SROCC equal to 0.9858.This can be observed in Fig. 9(b)-(d). Moreover, by fitting afirst order polynomial, we noticed that α p ≈ α p . Thismeans that only one set of ( QP, log( R )) values is requiredto determine the β p model parameter for 540p resolution.The same cannot be applied on the higher resolutions, as thedeviation of the estimated α value is significant.In the considered example, where |S| = 4 , we need toperform two encodes in one of the four resolutions and oneencode for the remainder in order to fully define the RQrelations across resolutions. Naturally, it is more efficient toperform the two encodes at lower resolutions. D. Compared Methods
Ideally, we would validate our proposed method againstthose state-of-the-art technologies described in Section II.However, as those are proprietary, with no publicly availableimplementations, we instead have benchmarked using thefollowing methods. • Reference Ladder (RL) : This exhaustive search approachwas used to construct our reference Pareto surface, as de-scribed in Section IV-C and Algorithm 1. To summarise,we encoded each sequence at different resolutions for a We have to note that the predicted values were rounded to the nearestinteger and clipped to the range of QP values, before computing the correlationmetrics.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < TABLE IV: Selected features & validation metrics of predicted cross-over QPs for PSNR- log (R) curves.
QP Selected Features LCC SROCC R MAE RMSE (cid:100) QP high F2, F4, F5, F11, F12, F14 .9350 .9164 .91 1.41 1.96 (cid:100) QP low F2, F4, F5, F11, F12, F14, F24 .9442 .9296 .90 1.35 1.96 (cid:100) QP high F2, F4, F5, F11, F12, F14, F25 .9536 .9076 .91 .95 1.36 (cid:100) QP low F2, F4, F5, F11, F12, F14, F21, F25, F26 .9531 .8751 .92 .76 1.15 (cid:100) QP high F2, F4, F5, F12, F13, F14 .9316 .8334 .88 .97 1.44 (cid:100) QP low F2, F4, F5, F11-F15 .9210 .8535 .89 1.17 1.59 Q P (a) QP vs log (R) across resolutionsfor ToddlerFontain. -7 -6 -5 -4 -3 -2 -7-6-5-4-3-2 (b) α p vs α p (LCC:0.8748, SROCC: 0.8560). -7 -6 -5 -4 -3 -7-6-5-4-3 (c) α p vs α p (LCC: 0.9420,SROCC: 0.9412). -7 -6 -5 -4 -3 -7-6-5-4-3 (d) α p vs α p , (LCC: 0.9890,SROCC: 0.9858).Fig. 9: Exploring the QP- log (R) model parameters across resolutions. wide range of QP values, computed the cross-over QP’s,constructed the PF, and the bitrate ladders. This methodcreates the optimal bitrate ladders and requires the highestnumber of encodings. • Interpolation-based Ladder (IL) : This method is basedon encoding using only a subset of QP values perresolution. Specifically, after encoding using a subsetof QPs per resolution, we then use a piece-wise cubicHermite interpolation [52] to find the RQ coordinatesfor the interim QP values. Based on these encodingsand and the interpolated RQs, the PF is extracted as inthe RL method explained above. This method producesa suboptimal solution, whose accuracy depends on thenumber of encodes performed per resolution. The addedbenefit of this method is that it significantly reduces thenumber of encodings required compared to the RL. • Feature-based Predicted Ladder (FL) : This is the pro-posed method described earlier in Algorithm 2, wherespatio-temporal features are extracted first to predict theRQ cross-over points that are on the PF. Then encodingsat the cross-over QPs are used to define the bitrates,where resolution switches, and to estimate the parametersof Eq. (9). After the estimation of the parameters, theequations are utilised along with the switching bitrates to estimate the QP values and the resolution for the bitrateladder rungs. • Hybrid Ladder (HL) : This method combines the best per-forming method per content, either FL or IL. A methodselection module is introduced after the spatio-temporalfeature extraction. Using the extracted spatio-temporalfeatures, a classifier selects for each input sequence whichmethod, IL or FL, is expected to more accurately estimatethe bitrate ladder.
E. Bitrate Ladder Prediction Results
As described in Algorithm 2, after predicting the cross-overQPs, we perform encodings at the defined cross-over pointsin order to estimate the two parameters of Eq. (9) at eachresolution. After defining the parameters, the bitrate laddersfor the considered fitted models are constructed followingthe approach described in Section IV-E. In order to assessthe predicted bitrate ladder against the two benchmarks, wecomputed the BD metrics [53], BD-Rate and BD-PSNR persequence. For the comparison of the methods, we report themean values and the mean absolute deviation of both metricsin Table V. As an additional measure of optimality, this Tablealso reports the average percentage of the predicted RQ pointsthat belong to the PF (PF-hits). We selected the mean absolutedeviation (mad) instead of standard deviation because, aseasily observed in the histograms of Fig. 11, the distributionsare not normal. The distributions are skewed due to the factthat the BD metrics against the RL bitrate ladders that areconstructed from points on the PF. The juxtaposed methodsare potentially composed by a mixture of points that eitherbelong to the PF or to a suboptimal set of points. Moreover,in Fig. 13, we provide examples of predicted ladders using allmethods for different sequences.
1) IL Method
We first investigated the accuracy of the IL method bycomputing the BD metrics for varying QP samples |P sub | perresolution, namely from 4 to 8. Figure 10 plots the mean BD-Rate P and BD-PSNR with their mean absolute deviation forthe different |P sub | encodes per resolution. As can be seenfrom this figure, as the number of encodes increases the meanBD-Rate drops resulting to a very good approximation of theRL, as also verified by the results reported in Table V. Thevariations in the results are mainly attributed to the sensitivityof the interpolation method and the different number of steps. REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < sub |0.60.811.21.41.61.8 B DR a t e ( % ) mad BDRatemean BDRate (a) BD-Rate. sub |-0.01-0.00500.0050.010.015 B D - PS NR ( d B ) mean BDPSNRmad BDPSNR (b) BD-PSNR.Fig. 10: Mean BDRate and confidence intervals of the IL over the RL againstthe number of encodes required. The BD statistics start converging for |P sub | ≥ with the bestresults achieved for |P sub | = 7 . From this point onward, wewill be using this setting to compare with the other methods.For this case, as reported in Table V and shown in Fig. 11,the mean BD-Rate is 0.80% with a mean absolute deviation of1.71%, while the BD-PSNR is -0.004dB with a mean absolutedeviation of 0.001dB. Besides this, the IL method results inladders composed with 87.5% RQ points from the PF. This isa strong indication of the optimality of the predicted ladder.The histograms of BD-Rate and BD-PSNR for 7 QP sam-ples per resolution are plotted in Fig. 11(a)-(b). As can beseen, the distributions are tightly clustered around the meanvalues, while it can also be observed that, for a small numberof sequences, the |BD-Rate| >
2) FL Method
For the FL method, we first performed a number of initialencodes required to determine the parameters of Eqs.(9) thatwill lead to the QP that corresponds to the rung bitrate.The predicted cross-over QPs, { (cid:100) QP high , (cid:100) QP low , (cid:100) QP high , (cid:100) QP low , (cid:100) QP high , (cid:100) QP low } , are used for the initial encodes.With the RQ points at (cid:100) QP low , (cid:100) QP high , the α p , β p are defined. The RQ points resulting from the (cid:100) QP low , (cid:100) QP high encodes are utilised to determine the α p , β p parametersand the β p parameter, as well. Additionally to the above sixinitial encodes, one more encode for the 2160p resolution isused. The QP value selection for the extra encode is decidedbased on (cid:100) QP high value: if it is towards the lower end ahigher QP is selected, and vice versa. The additional encodein 2160p helps to improve the predictions towards the higherbitrates because, as shown in Fig. 9, the α p , α p valuesdeviate. Also, after the bitrate ladder construction with the FLmethod, we performed a monotonicity and concavity check,as indicated in Step 5 of Algorithm 2. According to this, ifnon-monotonic points are detected, we sort them. Also, if abitrate ladder point results in a concave RQ curve, we removethis point, as this most likely is a suboptimal point.Inspecting the histograms of FL BD-Rate and BD-PSNRin Fig. 11 (c)-(d) we observe that, compared to IL, thesedistributions have heavier tails, which is verified by the highermean absolute deviation values. Table V reports the mean andmean absolute deviation of the BD metrics of the proposedFL method. The mean BD-Rate is 0.98% higher compared toIL, while the mean absolute deviation is increased by 0.5%.Despite these increased figures, the PF-hits percentage remainshigh, over 80%. (a) BD-Rate for IL. (b) BD-PSNR for IL.(c) BD-Rate for FL. (d) BD-PSNR for FL.(e) BD-Rate for HL. (f) BD-PSNR for HL.Fig. 11: BD metric histograms for the compared methods in Table V.TABLE V: BD metrics for the predicted ladders with the proposed methodsand percentage of points on the PF.Method [mean,mad] BD-Rate [mean,mad] BD-PSNR PF-hitsIL vs RL 0.80%, 1.71% -0.004dB, 0.001dB 87.50%FL vs RL 1.78%, 2.27% -0.04dB, 0.05dB 80.48%HL vs RL 1.26%, 1.91% -0.02dB, 0.04dB 83.86%
3) HL Method
We investigated the effectiveness of a hybrid approach thatcombines the IL and FL methods. In many cases, IL and FLconstruct almost identical ladders leading to a very similar BD-Rate when compared to the RL. Therefore, the rule that we ap-ply is that the IL method should be chosen only for those cases,where it will improve the FL BD-Rate at least by a threshold T . We explored the impact of this threshold by selecting aset of values, T ∈ { , . , . , . , . , . , . } and by estimating the best BDRate that could be achievedif the selection of the method was performed with a 100%accuracy. In Fig. 12, we illustrate the effect of this threshold onthe mean BD-Rate and the maximum expected average numberof encodes per sequence. As clearly illustrated, the increase ofthe T value results in an increase of the mean BD-Rate whiledecreasing the average required number of encodings. Fromthe BD-Rate to the average number of encodes tradeoff, weselected T = 0 . as it appears as an optimal threshold.After defining the threshold, we proceeded to the methodselection step. To this end, before predicting the cross-overQPs, we implemented a binary classifier. If the IL method isselected for a sequence, then we proceed as described above.If the FL method is selected, we proceed with the cross-overQPs prediction to apply the FL method. The classifier wasbuilt using Ensemble Trees utilizing the set of spatio-temporal REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER <
16 18 20 22 24 26 28 30Max Avg No of Encodes per Sequence0.40.50.60.70.80.911.11.2 m ean B DR a t e ( % ) T v a l ue Fig. 12: Effect of the HL threshold for the expected accuracy of predictionsand the number of encodes. features F1-F20 and a ten-fold cross-validation was performedto avoid overfitting. The resulting accuracy of classificationwas 68%. As a result, for 70% of the sequences the FL methodwas selected, while for 30% of the sequences the IL methodwas predicted to be more accurate.The resulting BD statistics from this hybrid method, afterthe method selection process, are illustrated in Fig. 11(e)-(f). Although the classification accuracy was not very high,it is evident that these results show significant improvementscompared to the FL results with statistics closer to those fromIL in (a) and (b). The mean BD-Rate value is .52% lower thanthat of FL and 0.46% higher of IL, while the mean absolutedeviation is almost equal to of IL. What is further significantis that the PF-hits value for HL slightly drops (only 3.64%),indicating the optimality of the predicted ladders.
F. Relative Complexity
The proposed method FL and its hybrid combination withIL, HL, both offer a close-to-optimal prediction of the bitrateladder, while achieving a significant reduction in the numberof encodings required to build the ladder for any new videosequence. In Table VI, the numbers of encodings required forthe compared methods are reported. As explained earlier, theexhaustive-search method produces the optimal PF but thiscomes at the cost of |S| × |P| encodings needed to derive it.In the considered test case, the number of encodings neededis 132.For the IL method, based on the results of Fig. 10, sevenencodes per resolution were selected. This means that a totalnumber of |S| × encodes is required per sequencein order to find the rate points where resolution switches andbuild the PF. In this case, an additional number of encodes E r eq , with E req ∈ { , , . . . , |L|} , is required to build thebitrate ladder. This varies for each sequence as it dependsfrom the length of its ladder. The average recorded numberof encodes for the considered dataset was 35.21. This methodbrings a significant reduction of 71.60% compared to the RLmethod in the presented test case.The FL method requires initially only × |S| − encodes to define the rate points, where resolution switches,and to compute the parameters of Eq. (9) for each sequence.Then, similarly to IL, E req number of encodes are requiredto hit the target bitrates at each ladder rung. In the presentedtest case, 13.57 encodings were required on average for FL.Clearly, FL outperforms both RL and IL approaches in terms of number of encodes, requiring from 89.06% fewer encodingscompared to the RL method and about 61.46% less encodescompared to IL.The hybrid method HL offers an important improvement interms of required number of encodings per sequence whileproducing a very close to optimal ladder. This depends ofcourse on the number of sequences and how often IL orFL is invoked. For the presented test case, where for 70%of the sequences the FL method was selected for the ladderconstruction, an average of 20.05 encodes was performedresulting to a 83.83% reduction compared to RL and to a43.06% reduction compared to IL. Compared to pure FL, HLperformed on average 6.48 more encodes.Although FL achieves an important reduction in the numberof encodings required, it also introduces an overhead associ-ated with the computation of the extracted features and withthe cross-over QP prediction. The cost of the the constructionof the bitrate ladder is thus almost identical for IL and FLmethods. The ratio of the average feature extraction timefor a sequence at 2160p resolution to the average 2160pencoding time for a sequence at one QP is 0.18 . The cross-over QP prediction time is negligible compared to encodingtime. Considering this, FL’s complexity is still significantlylower from that of the IL method. G. Discussion
From the results discussed above, the FL and HL methodsoffer a significant reduction in the required number of encodescompared to RL or IL for only a small BDRate cost. FL andHL solutions are close to optimal with over 80% of the bitrateladder points produced belonging to the PF. Generally, IL, FL,and HL offer very similar solutions and produce bitrate ladderswith a high percentage of points on the PF. Suboptimal pointsof lower or higher target bitrate, could potentially be improvedwith an additional round of encodes at an incremental QP toresult in an attempt to hit the bitrate closer to the target.Regarding the distributions of the BD statistics, we observedtwo kinds of outliers. On one hand, negative BD-Rate valuesare observed in all three methods against the expectation ofonly positive BD-Rate values. The negative BD-Rates areattributed to the fact that in many cases the bitrate laddermight be composed of points on the PF that are shiftedtowards higher bitrates and PSNR. Thus, in curved PSNR- log (R) ladders those segments create a raised PF. An exampleof this is provide in Fig. 13 (e). On the other hand, we noticedthat for all tested methods, IL, FL, and HL, there are outliersof |BD-Rate| >
5% which generally is considered an importantdeviation from the reference. Examples of those are given inFig. 13 (f)-(h). These statistics are either caused by suboptimalpoints or by not matching all the ladder rungs. For examplein Fig. 13 (g), most of the points are not the PF for allmethods (PF-hits: 2/7 for IL and 3/7 for FL). This specificsequence produces RQs with an unusual short range of PSNR,39-41.5dB within the wide range of [500kbps,25Mbps]. Thismeans that although the BD-Rate might be considerable the The feature extraction is implemented in Matlab, while for the encodingsthe HM reference software was used.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < TABLE VI: Comparison of the number of encodings required per method for each sequence.
Method ( |S| = 4 ) Average RL |S| × |P| ×
31 124IL |S| × |P sub | + E req , with E req ∈ { , , . . . , |L|} |L| ) 35.21FL (2 × |S| −
1) + E req |L| ) 13.57HL |S| × |P sub | (cid:107) (2 × |S| −
1) + E req (28||7)+(from 0 up to |L| ) 20.05 impact on the resulting quality is insignificant. Other outliers,such as Fig. 13 (h), that are observed in the FL and HL methodare attributed to incorrect predictions of the cross-over QPsthat subsequently lead to an incorrect bitrate for the resolutionswitching. Thus, the predicted ladder is suboptimal.The other examples of RQs in Fig. 13 (a)-(d), indicate casesof successful prediction of both IL and FL methods with a highPF-hits percentage (over 75%). Also, they indicate cases forHL, when the correctly better method was selected throughthe classification step or cases where it failed. In most casesthough the BD-Rate difference is small.
16 18 20 22 24 26log2(Rate)343638404244 PS NR ( d B ) RLIL: -0.25636%FL: -0.25636%HL: -0.25636% (a) Air-acrobatics.
18 19 20 21 22 23 24 25log2(Rate)2830323436384042 PS NR ( d B ) RLIL: -0.61134%FL: -0.54404%HL: -0.61134% (b) BoxingPractice.
18 19 20 21 22 23 24log2(Rate)2830323436384042 PS NR ( d B ) RLIL: -0.52714%FL: 0.026409%HL: 0.026409% (c) Coastguard.
17 18 19 20 21 22 23 24log2(Rate)32343638404244 PS NR ( d B ) RLIL: -0.4845%FL: 0.58864%HL: -0.4845% (d) Crosswalk.
16 18 20 22 24 26log2(Rate)28303234363840 PS NR ( d B ) RLIL: -0.098218%FL: 1.6087%HL: 1.6087% (e) Treeshade.
16 18 20 22 24 26log2(Rate)34353637383940 PS NR ( d B ) RLIL: 3.7272%FL: 5.4534%HL: 5.4534% (f) Raptors.
16 18 20 22 24 26log2(Rate)38.53939.54040.54141.542 PS NR ( d B ) RLIL: 8.2619%FL: 5.7979%HL: 5.7979% (g) Skateboarding-scene8.
16 18 20 22 24log2(Rate)38394041424344 PS NR ( d B ) RLIL: 0.34706%FL: 10.2106%HL: 10.2106% (h) WindAndNature-scene2.Fig. 13: Examples of predicted ladders for different sequences with the BD-Rate reported per tested method.
VI. C
ONCLUSION AND F UTURE W ORK
In this paper we have proposed a reduced complexity,content-customised, solution that can predict the bitrate ladderfor adaptive streaming, based on spatio-temporal features ex-tracted from uncompressed video at its native resolution. Ourmethod predicts the intersection points of the RQ curves acrossspatial resolutions with a small number of video encodingsand then parameterises a set of equations that predict thePF RQ points at the target bitrate ladder rungs. This enablesconstruction of the bitrate ladder via constrained sampling ofthe quality and bitrate set of values. The proposed methodwas compared against two benchmarks, an exhaustive-searchmethod (which produces the most accurate PF) and a moreconventional interpolation-based method. When compared tothe exhaustive search, the results show a mean BD-Rate lossof only 1.78% and a mean BD-PSNR of 0.04 dB, but with areduction on average of 89.06% in the number of encodingsneeded. Although the BD statistics of the interpolation basedmethod are better than those of the feature-based method, thelatter provides a significant reduction of encodes, 61.46% onaverage, to build the bitrate ladder. A hybrid method as acombination of the feature-based and the interpolation-basedis examined resulting to a 1.26% mean BD-Rate for only anadditional 32.32% number of encodings on average. Both FLand HL result in bitrate ladders that are composed on averageover 80% by Pareto optimal points. Adopting FL or HL couldresult in significant savings in processing time and energyconsumption.Future work will focus on testing the effectiveness ofthe method across codecs and by employing different qual-ity metrics. Firstly, as explained, the proposed method wasdeveloped for and tested on an HEVC codec. However, ifthe regression models were trained with data derived froma different codec, then we expect the performance gains tobe comparable. Additional gains may however be possible byexploiting the correlation between content features and rate-distortion characteristics across different codecs. This couldlead to even higher efficiency in estimating the bitrate ladderand in the overall content delivery. Furthermore, as videoproviders are using other quality metrics, e.g. VMAF or SSIM,to construct the bitrate ladders, we will test the proposedmethod on an extended set of quality metrics.A
CKNOWLEDGEMENTS
The authors would like to thank Dr Mariana Afonso andKyle Swanson from Netflix Video Coding Group for all theinsightful discussions that helped improving this work.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < R . capacitymedia . com/articles/3825382/outages-traffic-peaks-and-video-quality-the-internet-during-lockdown.[3] I. Sodagar, “The mpeg-dash standard for multimedia streaming over theinternet,” IEEE MultiMedia , vol. 18, no. 4, pp. 62–67, 2011.[4] A. Aaron, Z. Li, M. Manohara, J. De Cock, and D. Ronca,“Per-Title Encode Optimization,” https://netflixtechblog . com/per-title-encode-optimization-7e99442b62a2.[5] S. Lederer, C. Müller, and C. Timmerer, “Dynamic Adaptive Streamingover HTTP Dataset,” in Proceedings of the 3rd Multimedia SystemsConference , 2012, ACM MMSys, pp. 89–94.[6] J. De Cock, Z. Li, M. Manohara, and A. Aaron, “Complexity-based consistent-quality encoding in the cloud,” in
IEEE InternationalConference on Image Processing (ICIP) , Sept 2016, pp. 1484–1488.[7] M. Afonso, A. Moorthy, L. Guo, L. Zhu, and A. Aaron, “Improv-ing our video encodes for legacy devices,” https://netflixtechblog . com/improving-our-video-encodes-for-legacy-devices-2b6b56eec5c9.[8] A. V. Katsenou, J. Sole, and D. R. Bull, “Content-gnostic BitrateLadder Prediction for Adaptive Video Streaming,” in Picture CodingSymposium , November 2019.[9] Apple, “HLS Authoring Specification for Apple Devices,”https://developer . apple . com/documentation/http_live_streaming/hls_authoring_specification_for_apple_devices.[10] I. Katsavounidis, “Dynamic optimizer - a perceptual video encodingoptimization framework,” https://medium . com/netflix-techblog/dynamic-optimizer-a-perceptual-video-encoding-optimization-framework-e19f1e3a277f.[11] C. Chen, Y. Lin, S. Benting, and A. Kokaram, “Optimized Transcodingfor Large Scale Adaptive Streaming Using Playback Statistics,” in , Oct 2018,pp. 3269–3273.[12] A. Zabrovskiy, C. Feldmann, and C. Timmerer, “A practical evaluationof video codecs for large-scale http adaptive streaming services,” in , Oct 2018,pp. 998–1002.[13] L. Toni, R. Aparicio-Pardo, K. Pires, G. Simon, A. Blanc, andP. Frossard, “Optimal Selection of Adaptive Streaming Representations,” ACM Trans. Multimedia Comput. Commun. Appl. , vol. 11, no. 2s, pp.43:1–43:26, Feb. 2015.[14] Bitmovin, “White Paper: Per Title Encoding,” https://bitmovin . com/whitepapers/Bitmovin-Per-Title . pdf, 2018.[15] MUX, “Instant Per-Title Encoding,” https://mux . . capellasystems . net/capella_wp/wp-content/uploads/2018/01/CambriaFTC_SABL . pdf.[17] J.Ozer, “Per-Title Encoding Comparison: Crunch Video Optimiza-tion Technology compared to: Brightcove CAE, Capped CRF, CapellaSystems SABL, JWPlayer,” https://streaminglearningcenter . com/wp-content/uploads/2018/07/Report_final . pdf.[18] Y. A. Reznik, K. O. Lillevold, A. Jagannath, J. Greer, and J. Corley,“Optimal design of encoding profiles for abr streaming,” in Proceedingsof the 23rd Packet Video Workshop (ACM PV ’18) , 2018, p. 43–47.[19] K. Goswami, B. Hariharan, P. Ramachandran, A. Giladi, D. Grois,K. Sampath, A. Matheswaran, A. K. Mishra, and K. Pikus, “AdaptiveMulti-Resolution Encoding for ABR Streaming,” in , 2018, pp. 1008–1012.[20] V. P. K. Malladi, C. Timmerer, and H. Hellwagner, “MIPSO: Multi-Period Per-Scene Optimization For HTTP Adaptive Streaming,” in , 2020,pp. 1–6.[21] Y. A. Reznik, X. Li, K. O. Lillevold, A. Jagannath, and J. Greer, “Opti-mal multi-codec adaptive bitrate streaming,” in , 2019, pp. 348–353.[22] E. Bourtsoulatze, A. Chadha, I. Fadeev, V. Giotsas, and Y. Andreopou-los, “Deep video precoding,”
IEEE Transactions on Circuits and Systemsfor Video Technology , 2019.[23] A. Bentaleb, B. Taani, A. C. Begen, C. Timmerer, and R. Zimmermann,“A Survey on Bitrate Adaptation Schemes for Streaming Media OverHTTP,”
IEEE Communications Surveys Tutorials , vol. 21, no. 1, pp.562–585, 2019. [24] J. Ozer, “A Cloud Encoding Pricing Comparison,” http://docs . hybrik . com/repo/cloud_pricing_comparison . pdf.[25] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview ofthe High Efficiency Video Coding (HEVC) Standard,” IEEE Trans. onCircuits and Systems for Video Technology . cdvl . org/documents/NETFLIX_Chimera_4096x2160_Download_Instructions . pdf, 2015.[27] T. U. Ultra Video Group, ,” http://ultravideo . cs . tut . . harmonicinc . com/4k-demo-footage-download/, [Online; accessed 2017-05-01].[29] L. Song, X. Tang, W. Zhang, X. Yang, and P. Xia, “The sjtu 4kvideo sequence dataset,” in Fifth International Workshop on Qualityof Multimedia Experience (QoMEX) . youtube . com/playlist?list=PLwIpNYl7S0G_C5I76Tf46n6ImKssMn2kT/, [Online; accessed2018-07-02].[31] M. Afonso, F. Zhang, and D. R. Bull, “Spatial resolution adaptationframework for video compression,” in SPIE Optical Engineering +Applications , 2018, vol. Proceedings Volume 10752, Applications ofDigital Image Processing XLI.[32] S. Winkler, “Analysis of public image and video databases for qualityassessment,”
IEEE Journal of Selected Topics in Signal Processing , vol.6, no. 6, pp. 616–625, 2012.[33] C. E. Duchon, “Lanczos filtering in one and two dimensions,”
Journalof Applied Meteorology . ffmpeg . org.[35] J. Sole, L. Guo, A. Norkin, M. Afonso, K. Swanson, andA. Aaron, “Performance comparison of video coding standards:an adaptive streaming perspective,” https://medium . com/netflix-techblog/performance-comparison-of-video-coding-standards-an-adaptive-streaming-perspective-d45d0183ca95.[36] P. Salembier and T. Sikora, Introduction to MPEG-7: MultimediaContent Description Interface , John Wiley and Sons, Inc., New York,NY, USA, 2002.[37] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features forimage classification,”
IEEE Trans. on Systems, Man, and Cybernetics ,vol. SMC-3, no. 6, pp. 610–621, Nov 1973.[38] R. M. Haralick, “Statistical and structural approaches to texture,”
Proceedings of the IEEE , vol. 67, no. 5, pp. 786–804, May 1979.[39] J. Zujovic, T. N. Pappas, and D. L. Neuhoff, “Structural TextureSimilarity Metrics for Image Analysis and Retrieval,”
IEEE Trans. onImage Processing , vol. 22, no. 7, pp. 2545–2558, July 2013.[40] M. M. Subedar and L. J. Karam, “A no reference texture granularityindex and application to visual media compression,” in , Sept 2015, pp. 760–764.[41] M. Bosch, F. Zhu, and E. J. Delp, “Segmentation-Based Video Com-pression Using Texture and Motion Models,”
IEEE Journal of SelectedTopics in Signal Processing , vol. 5, no. 7, pp. 1366–1377, Nov 2011.[42] F. Zhang and D. R. Bull, “A Parametric Framework for VideoCompression Using Region-Based Texture Models,”
IEEE Journal ofSelected Topics in Signal Processing , vol. 5, no. 7, Nov 2011.[43] C. H. Peh and L. F. Cheong, “Synergizing spatial and temporal texture,”
IEEE Trans. on Image Processing , vol. 11, no. 10, pp. 1179–1191, Oct2002.[44] A. V. Katsenou, T. Ntasios, M. Afonso, D. Agrafiotis, and D. R. Bull,“Understanding Video Texture - a Basis for Video Compression,” in
IEEE 19th International Workshop on Multimedia Signal Processing(MMSP) , Oct 2017.[45] A. Katsenou, M. Afonso, D. Agrafiotis, and D. R. Bull, “PredictingVideo Rate-Distortion Curves using Textural Features,” in
PictureCoding Symposium (PCS) , Dec 2016.[46] A. V. Katsenou, D. Ma, and D. R. Bull, “Perceptually-Aligned FrameRate Selection Using Spatio-Temporal Features,” in
Picture CodingSymposium , June 2018, pp. 288–292.[47] V. Laparra, J. Ballé, A. Berardino, and E. Simoncelli, “Perceptual imagequality assessment using a normalized laplacian pyramid,” in
ElectronicImaging 2016 . SPIE, 2016, pp. 1–6.[48] J. P. Lewis, “Fast template matching,” in
Vision interface , 1995, vol. 95,pp. 15–19.[49] G. Farnebäck, “Two-frame motion estimation based on polynomialexpansion,” in
Scandinavian Conference on Image analysis . Springer,2003, pp. 363–370.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER < [50] K. Sharman and K. Sühring, “Common Test Conditions for HM videocoding experiments,” Tech. Rep., Document JCTVC-AC1100 of JCT-VC, Oct. 2017.[51] M. Kuhn and K. Johnson, Applied Predictive Modeling, First Edition ,Springer, New York, USA, 2013.[52] F. N. Fritsch and R. E. Carlson, “Monotone Piecewise Cubic Interpola-tion,”