Rate distortion optimization over large scale video corpus with machine learning
RRATE DISTORTION OPTIMIZATION OVER LARGE SCALE VIDEO CORPUS WITHMACHINE LEARNING
Sam John, Akshay Gadde and Balu Adsumilli
Google, Mountain View, CA { samjohn, agadde, badsumilli } @google.com ABSTRACT
We present an efficient codec-agnostic method for bitrate al-location over a large scale video corpus with the goal of min-imizing the average bitrate subject to constraints on averageand minimum quality. Our method clusters the videos in thecorpus such that videos within one cluster have similar rate-distortion (R-D) characteristics. We train a support vectormachine classifier to predict the R-D cluster of a video us-ing simple video complexity features that are computationallyeasy to obtain. The model allows us to classify a large sampleof the corpus in order to estimate the distribution of the num-ber of videos in each of the clusters. We use this distributionto find the optimal encoder operating point for each R-D clus-ter. Experiments with AV1 encoder show that our method canachieve the same average quality over the corpus with less average bitrate.
Index Terms — rate distortion optimization, clustering,machine learning, adaptive streaming, YouTube
1. INTRODUCTION
For massive video streaming platforms such as YouTube, itis desirable to deliver the best video quality with minimumbitrates since it directly contributes to the streaming cost forthese platforms as well to the data cost and quality of expe-rience of the users. These platforms have various types ofvideo content ranging from song lyrics and simple anima-tions to sports and gaming [1]. For a given video encoder,we can compute the rate-distortion (R-D) curve of a video byencoding the video at different bitrates and plotting the dis-tortion achieved for each bitrate. The R-D characteristics foreach content type is significantly different than others. Evenwithin one video, R-D characteristics can vary significantlyover time. A video can be divided into chunks and a sepa-rate R-D curve can be obtained for each chunk. Given theR-D curves for each chunk in a video corpus, we can definea problem of finding the optimal bitrate for encoding eachchunk such that an aggregate measure of distortion is mini-mized subject to a constraint on average bitrate (or averagebitrate is minimized subject to a constraint on quality). Solv-ing such an optimization problem directly at a large scale suchas YouTube’s, which contains millions of hours of videos [2], is infeasible since it would involve encoding each chunk atmultiple bitrates to get the R-D curve and then solving a non-linear optimization problem in millions of variable. More-over, it is necessary to encode each video chunk into mul-tiple representations at different bitrates and resolutions foradaptive streaming over networks of varying bandwidth [3].This introduces additional complexity to the problem sincewe need to find the optimal bitrates for all representations.The problem of R-D optimal bitrate allocation over cod-ing units given their R-D characteristics has been studied be-fore [4]. The authors show that the optimal bitrate allocationis such that the marginal gain in quality achieved by spendingone extra unit of bitrate is equal for all coding units. However,they do not consider bitrate allocation over different repre-sentations of a coding unit. Toni et al. [5] consider the prob-lem of selecting optimal representation of a video for adaptivestreaming taking into account network dynamics. The prob-lem of finding optimal bitrates for different representationsof a video chunk subject to constraint on average deliveredquality considering the distribution of user bandwidths andviewports is studied by Chen et al. [6]. All of these methodsassume that the R-D curves of all coding video chunks in thecorpus are known and that the number of encoding bitrates tooptimize is small. These assumptions are not feasible for R-Doptimization over a large scale video corpus.We propose a new efficient codec-agnostic method for al-locating bitrates for all video chunks in a large scale corpuswith the goal of minimizing the average bitrate while main-taining aggregate quality. Our method does not require en-coding of all chunks in the corpus at multiple bitrates to gettheir R-D curves. This reduces the computational complexitysignificantly, especially at scale. Instead we use simple videocomplexity features obtained from encoder pass-log to predictthe R-D curve of a chunk using machine learning. We clusterthe video chunks in the corpus into multiple categories basedon their approximate R-D curves. We demonstrate that thenumber of clusters required to model the variation in the R-D characteristics across the corpus is much less than the sizeof the corpus. Because all video chunks in a category havesimilar R-D characteristics, we can optimize for bitrates overcategories instead of individual chunks. This requires solv-ing a relatively small non-linear optimization problem with a r X i v : . [ c s . MM ] A ug eature extractiontraining video chunks R-D curve computation R-D clusteringML cluster predictiontest video chunk R-D optimization over clustersOptimal rate for test chunk Fig. 1 . Outline of the proposed algorithm. White boxes showthe input and output data. Solid arrows denote the trainingflow and dotted arrows indicate how models are used.dimensionality equal to the number of clusters. The proposedrate allocation method is outlined in Figure 1.The rest of the paper is organized as follows. In Section 2,we describe the proposed R-D curve prediction method basedon R-D curve clustering and video classification using videocomplexity features. Section 3 explains our R-D optimizationformulation to compute the optimal bitrates for all video cat-egories. Experimental results in Section 4 show the efficacyour proposed approach. We conclude the paper in Section 5.
2. R-D CURVE MODELING
Our method attempts to categorize the video chunks in thecorpus such that all video chunks in one category have sim-ilar R-D curves. We do this using a two step approach (seeFigure 1). In the first step, we cluster the R-D curves of train-ing video chunks that are randomly sampled from the cor-pus. In the second step, we build a classification model topredict which R-D cluster a chunk belongs to using simplevideo complexity features obtained by fast one-pass analysisof the chunk with an encoder (i.e., encoder pass-log). R-Doptimization is done over these clusters using their centroidR-D curves.
Let n be the number of training video chunks selected fromthe corpus. We obtain s points on the R-D curve of eachtraining video chunk by encoding the chunk at fixed encoderoperating points, [ q , . . . , q s ] . Each operating point q j corre-sponds to a quantization parameter (QP) or a constant ratefactor (CRF) used by the encoder. Encoding a chunk i atoperating point q j results in a representation with bitrate r j and distortion d j for a chunk. The R-D curve samples forchunk i can then be represented by a vector x i ∈ R s of rate-distortion values, [ r , . . . , r s , d , . . . , d s ] . We normalize eachcomponent of x as x norm j = ( x j − m j ) /σ j , where m j and σ j are sample mean and sample standard deviation of x j respec-tively. We cluster the vectors x norm , . . . , x norm n into k ( (cid:28) n )clusters C = { C , . . . , C k } using k -means. L distance be-tween normalized R-D points is used to define the cost func-tion for clustering. It is reasonable to normalize and computethe distances for each component of x across different videochunks since each component corresponds to a fixed encoder n Number of clusters M e a n r e l a t i v e e rr o r Fig. 2 . Plot of mean relative error between the training R-D points and corresponding cluster centroids vs. number ofclusters for different number of training R-D points.operating point. We use the centroid µ l of cluster C l to getcurves ρ l ( q ) and δ l ( q ) for mapping an operating point q to bi-trate and distortion values respectively. These centroid curvesare expected to be a good approximation for bitrate r i ( q ) anddistortion d i ( q ) for any video chunk i ∈ C l .The number of clusters, k , needed to capture all the vari-ation in the R-D characteristics in the corpus is determinedempirically. Figure 2 shows the mean relative error betweenthe R-D points for a corpus sample of n chunks and their cor-responding cluster centroids (i.e., (cid:80) i (cid:107) x norm i − µ l (cid:107) (cid:80) i (cid:107) x norm i (cid:107) ) as functionof k for different values on n . The data points are obtained byencoding video chunks at different CRF values with AV1 en-coder [7]. It shows that approximating the corpus R-D pointsby the corresponding cluster centroids results in a small rel-ative error that reduces slowly as the number of clusters ex-ceeds . Moreover, the number of clusters needed to achievethis small error does not increase with the sample size n . Fig-ure 3 show the plots of bitrate vs. CRF, distortion vs. CRFand distortion vs. bitrate for different cluster centroids for k = 10 . These plots show that the marginal reduction indistortion achieved by allocating higher bitrate varies signifi-cantly across clusters. Therefore, the optimal operating pointfor achieving the best rate-distortion tradeoff will be differentfor each cluster. It is not feasible to sample the R-D curve for every videochunk in the corpus in order to determine its R-D cluster C l since it involves encoding each chunk multiple times at dif-ferent operating points. In order to circumvent this problem,we train a support vector machine (SVM) classifier to predictthe R-D cluster of a chunk using video complexity featuresthat are computationally much cheaper to obtain.Specifically, we use of the passlog features given bythe AV1 encoder. AV1 encoder generates this passlog by do-ing an analysis pass over a video without fully encoding it.The features include statistics related to prediction modes (in-ter or intra), prediction errors, reference frames used for inter bitrate (kbps) p s n r ( db ) crf b i t r a t e ( k bp s ) crf p s n r ( db ) Fig. 3 . Plots of PSNR vs. bitrate, bitrate vs. CRF and PSNR vs. CRF for different cluster centroids -6 -4 -2 0 2 4 6 8 10 pca1 -8-6-4-202468 p c a Fig. 4 . Projection of the training features using first two prin-cipal components. The color denotes the R-D class of a point.R-D classes exhibit some clustering in this space.prediction and motion vectors for each frame in the video [8].These features are a good indicator of the spatial and tem-poral complexity of a video. Therefore, they are useful forpredicting the R-D characteristics of the video (see Figure 4).In order to train the classifier, the ground truth class labelsfor video chunks in the training set are obtained by clusteringtheir R-D curves. Therefore, centroid R-D curve of the pre-dicted cluster of a video is expected to be good approximationfor the R-D curve of the video.The idea of clustering the R-D curves and building amodel for R-D cluster prediction was proposed independentlyby Ling et al. [9]. However, our method and the method in [9]have some key differences. Firstly, the method in [9] uses theBjontegaard Delta (BD) rate between the PSNR vs. bitratecurves as the distance metric in clustering. The problem withusing this metric is that two R-D curves with substantiallydifferent slopes may have a very small BD rate distance.However, taking these differences in slopes into account iscritical for efficient rate utilization. Ling et al. also use dif-ferent features for predicting the R-D curve cluster. We findthat using the encoder pass log features is computationallyefficient and allows accurate R-D cluster prediction.
3. R-D OPTIMIZATION OVER VIDEO CORPUS
Our goal is to find the optimal encoder operating points forall R-D clusters so that the average corpus bitrate is mini-mized while the average and worst-case distortions remain below certain thresholds. In order to compute these averages,we need to estimate the distribution of the number of videochunks in different clusters. We do this by classifying a largesample of the video corpus into different clusters using theSVM model proposed in Section 2.2. We can use this dis-tribution to compute the average bitrate and distortion for thecorpus for a given set of operating points for the clusters basedon their centroid R-D curves.Let q = [ q , . . . , q k ] be the encoder operating points forclusters C , . . . , C k respectively. Let ρ l ( q l ) and δ l ( q l ) denotethe bitrate and distortion obtained by encoding a video chunkin cluster C l at operating point q l as given by the centroidR-D curves for C l . The optimal value of q is defined as thesolution to the following problem:minimize q k (cid:88) l =1 w l ρ l ( q l ) subject to k (cid:88) l =1 w l δ l ( q l ) ≤ D avg max l ∈{ ,...,k } δ l ( q l ) ≤ D max , (1)where w l is the fraction of video chunks in cluster C l . Notethat (cid:80) l w l = 1 . The solution to the above problem, q (cid:63) , willminimize the total bitrate while maintaining the given con-straints on distortion. Any video chunk in cluster C l is en-coded using the optimal operating point q (cid:63)l for that cluster.
4. EXPERIMENTS
In order to evaluate the performance of the proposed method,we use a set of n = 14000 videos at p resolution randomlysampled from the YouTube corpus. We sample a 5 secondlong chunk from each video. Using short chunks ensures thatthe R-D characteristics do not change significantly within onechunk. We use the AV1 encoder developed by AOM [7] togenerate the R-D points for each chunk. This is done by en-coding each chunk in constant quality (CQ) mode of the AV1encoder at s = 13 CRF values. We cluster the vectors of R-Dpoints into clusters as explained in Section 2.1. luster l % v i d e o s i n c l u s t e r Fig. 5 . Distribution of number of videos in different clusters.We train an SVM classifier for predicting the R-D clusterof a video using the features given by the first pass log of theAV1 encoder [8]. of the data is used for training and theremaining is used for testing. We use the radial basisfunction (RBF) kernel in the SVM classifier. Optimal hyper-parameters of the SVM (namely, the regularization parameterand γ used in RBF kernel [10]) are computed with grid searchusing -fold cross-validation. The classifier gives an accuracyof on the test set. The distribution of number of videosin different clusters (i.e., the weights w l ) is estimated by clas-sifying a large sample of videos using the SVM model. It isshown in Figure 5.We use the weights w l and the R-D points for clustercentroids to solve optimization problem in Eq. (1). In or-der to get a baseline, we use the same CRF value for allR-D clusters. For each baseline CRF q , we compute aver-age, (cid:80) kl =1 w l δ l ( q ) and maximum, max l ∈{ ,...,k } δ l ( q ) , dis-tortions. We then use these values to set the constraints D avg and D max in Eq. (1) and compute optimal CRFs, [ q (cid:63)l ] , for allclusters. The expected baseline and optimal average bitratesare given by (cid:80) kl =1 w l ρ l ( q ) and (cid:80) kl =1 w l ρ l ( q (cid:63)l ) respectively.We repeat this for multiple baseline CRF values to get base-line and optimal rate-distortion sweeps. Figure 6 shows theplots of expected average and maximum distortion vs. ex-pected average bitrate. Based on these plots, for AV1 encoder,using optimal CRF for each cluster is expected to improve theBD rate by compared to using the same CRF for all clus-ters for the same average distortion.We check robustness of the optimization against the inac-curacies in the R-D class prediction model and the approxi-mation errors in centroid R-D curves. This is done by clas-sifying each chunk in the data using the SVM classifier andthen computing the bitrate and distortion at the optimal CRFfor its predicted class using the actual R-D curve of the chunk.Baseline is obtained using the same CRF for all chunks. Plotsof average distortion, n (cid:80) ni =1 d i ( q i ) , and maximum distor-tion, max i ∈{ ,...,n } d i ( q i ) vs. average bitrate, n (cid:80) ni =1 r i ( q i ) ,computed over all chunks using optimal and baseline CRFs isshown in Figure 7. These plots also show that using optimalCRFs improves the BD rate by compared to the baselinefor the same average distortion, thus indicating that the BDrate savings persist even with modelling errors. Figure 7 alsoshows the plots of average and maximum distortions vs. aver-age bitrate obtained by optimizing the bitrates directly for allchunks in the data (blue solid and dotted lines respectively). baselineoptimal100 200 300 400 500 1,000 2,000 3,0004,0005,000 10,000 Bitrate (kbps) PS NR ( db ) Fig. 6 . Solid line: average PSNR vs. average bitrate. Dottedline: minimum PSNR vs. average bitrate. Computed usingbaseline (green) and optimal CRFs (red) with weights w i andcentroid R-D curves. baselineoptimalupper bound100 200 300 400 500 1,000 2,000 3,0004,0005,000 10,000 Bitrate (kbps) PS NR ( db ) Fig. 7 . Solid line: average PSNR vs. average bitrate. Dottedline: minimum PSNR vs. average bitrate. Computed overall chunks using optimal and baseline CRFs based on theirpredicted R-D class and actual R-D curves.This can be considered the best rate allocation for the givenchunks. The plots show that clustering based rate allocationachieves the smallest possible average distortion for given av-erage bitrate. However, the maximum distortion is larger thanthe smallest maximum distortion possible.
5. CONCLUSION
We presented an efficient method for optimal rate allocationover a large scale corpus using machine learning. Our methodclusters the videos in the corpus based on their R-D charac-teristics and finds the optimal encoder operating points for allclusters. We developed a machine learning model to predictthe R-D cluster of a test video using encoder pass log featuresthat are easy to obtain. In the future, we would like to developa model for finding the optimal encoder parameters using fea-tures that are even simpler to compute and reduce the amountof R-D data needed for training. It would be also interestingto integrate the playback statistics in our framework to opti-mize bitrates for multiple formats of the same video used foradaptive streaming. . REFERENCES [1] Y. Wang, S. Inguva, and B. Adsumilli, “YouTube ugcdataset for video compression research,” in
IEEE In-ternational Workshop on Multimedia Signal Processing ,2019.[2] “YouTube for press,” .[3] M. Seufert, S. Egger, M. Slanina, T. Zinner, T. Hoßfeld,and P. Tran-Gia, “A survey on quality of experience ofhttp adaptive streaming,”
IEEE Communications Sur-veys Tutorials , 2015.[4] A. Ortega and K. Ramchandran, “Rate-distortion meth-ods for image and video compression,”
IEEE SignalProcessing Magazine , Nov 1998.[5] L. Toni, R. Aparicio-Pardo, K. Pires, G. Simon,A. Blanc, and P. Frossard, “Optimal selection of adap-tive streaming representations,”
ACM Trans. MultimediaComput. Commun. Appl. , Feb 2015.[6] C. Chen, Y. Lin, S. Benting, and A. Kokaram, “Op-timized transcoding for large scale adaptive streamingusing playback statistics,” in
IEEE International Con-ference on Image Processing , Oct 2018.[7] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu,S. Parker, C. Chen, H. Su, U. Joshi, et al., “An overviewof core coding tools in the AV1 video codec,” in
PictureCoding Symposium . IEEE, 2018.[8] “Alliance for Open Media AV1 reference implementa-tion,” https://aomedia.googlesource.com/aom/+/refs/heads/master/av1/encoder/firstpass.h .[9] S. Ling, Y. Baveye, P. Le Callet, J. Skinner, and I. Kat-savounidis, “Characterization of user generated contentfor perceptually-optimized video compression: Chal-lenges, observations and perspectives,” in
Human Visionand Electronic Imaging , 2020.[10] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for sup-port vector machines,”