Quality-driven Variable Frame-Rate for Green Video Coding in Broadcast Applications
Glenn Herrou, Charles Bonnineau, Wassim Hamidouche, Patrick Dumenil, Jerome Fournier, Luce Morin
IIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 1
Quality-driven Variable Frame-Rate for GreenVideo Coding in Broadcast Applications
Glenn Herrou, Charles Bonnineau, Wassim Hamidouche,
Member, IEEE,
Patrick Dumenil, J´erˆome Fournier and Luce Morin
Abstract —The Digital Video Broadcasting (DVB) has proposedto introduce the Ultra-High Definition services in three phases:UHD-1 phase 1, UHD-1 phase 2 and UHD-2. The UHD-1phase 2 specification includes several new features such as HighDynamic Range (HDR) and High Frame-Rate (HFR). It hasbeen shown in several studies that HFR (+100 fps) enhances theperceptual quality and that this quality enhancement is content-dependent. On the other hand, HFR brings several challengesto the transmission chain including codec complexity increaseand bit-rate overhead, which may delay or even prevent itsdeployment in the broadcast echo-system. In this paper, wepropose a Variable Frame Rate (VFR) solution to determine theminimum (critical) frame-rate that preserves the perceived videoquality of HFR video. The frame-rate determination is modeledas a 3-class classification problem which consists in dynamicallyand locally selecting one frame-rate among three: 30, 60 and 120frames per second. Two random forests classifiers are trainedwith a ground truth carefully built by experts for this purpose.The subjective results conducted on ten HFR video contents,not included in the training set, clearly show the efficiency ofthe proposed solution enabling to locally determine the lowestpossible frame-rate while preserving the quality of the HFRcontent. Moreover, our VFR solution enables significant bit-ratesavings and complexity reductions at both encoder and decodersides.
Index Terms —High Frame-Rate (HFR), variable frame-rate,Ultra-High Definition (UHD), High Efficiency Video Coding(HEVC).
I. I
NTRODUCTION T HE DEPLOYMENT of the latest Ultra-High DefinitionTV (UHDTV) system [1] aims to increase the user’sQuality of Experience (QoE) by introducing to the existingHigh Definition TV (HDTV) system [2] new features such ashigher spatial resolution, High Dynamic Range (HDR), widercolor gamut and High Frame-Rate (HFR) [3], [4]. Technicaldefinition of the UHDTV signal is available in the BT.2020 recommendation of the International TelecommunicationUnion (ITU) [1]. The Ultra-High Definition (UHD)-1 Phase
Manuscript received April 13th, 2020; revised September 6th, 2020 andNovember 17th, 2020; accepted December 9th, 2020. This work has beenfunded by the French government through the ANR Investment referenced10-AIRT-0007.All authors are with the Institute of Research and Technology (IRT)b <> com, 35510 Cesson S´evign´e, France.G. Herrou, C. Bonnineau, W. Hamidouche and L. Morin are also with INSARennes, Institut d’Electronique et des Technologies du Num´eRique (IETR),CNRS - UMR 6164, VAADER team, 20 Avenue des Buttes de Coesmes,35708 Rennes, France, (e-mail: [email protected]).C. Bonnineau is also with the Direction Technique, TDF, 35510 Cesson-´evign´e, FranceP. Dumenil is also with Harmonic Inc., 35510 Cesson-S´evign´e, FranceJ. Fournier is also with Orange Labs, 35510 Cesson-S´evign´e, France videosignal and shown its impact to enhance the viewing experienceby reducing temporal artifacts specifically motion blur andtemporal aliasing [5]. Authors in [9] have conducted subjectiveevaluations and shown that the combination of HFR (100fps) and high resolution (4K) significantly increases the QoEwhen the video is coded at high bit-rate. The subjectiveevaluations conducted in [10] have also demonstrated theimpact of HFR (up to 120 fps) to increase the perceived videoquality. Moreover, this study shows that HFR improves thequality of video content with camera motion, while lowerframe-rates are more convenient for sequences with complexmotion such as dynamic textures. The study conducted in [11]by the British Broadcasting Corporation (BBC) showed thatdown-conversion of HFR video to 50 fps would inevitablyresult in considerable quality degradation. Mackin et al. in [12]have presented a new database containing videos at differentframe-rates from 15 fps to 120 fps. Subjective evaluations per-formed on the video database have demonstrated a relationshipbetween the frame-rate and the perceived video quality andconfirmed that this relationship is content dependent.The main limitation of the HFR in practical transmissionchain is the significant increase in coding and decoding com-plexities. The complexity increase may prevent the deploymentof the HFR since the recent SW/HW codecs do not supportreal-time processing of high resolution video at high frame-rate (+100 fps). This complexity overhead, estimated to 40%of the encoding and decoding times, is required to increasethe frame-rate from 60 to 120 fps. Moreover, the HFR mayincrease both bit-rate and energy consumption compared tolower frame-rates without significant quality improvementsdepending on the video content, as shown in [10], [12].A number of research works have investigated VariableFrame-Rate (VFR) [13]–[16]. These VFR solutions use differ-ent motion-related features with tresholding techniques [13],[16] or Machine Learning (ML) algorithms [14], [15] to selectthe desired frame-rate. The main limitations of these solutionsare either a static sequence-level frame-rate adaptation, whichgreatly reduces the possible coding gains compared to adynamic adaptation, or the target application with frame-rateslower than or , making them unusable for the recentHFR format without major updates and thorough testing. In this paper high frame-rate video refers to video represented by 100frames per second and more (+100 fps). a r X i v : . [ ee ss . SP ] D ec EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 2
In this paper we propose a content-dependent variableframe-rate solution that determines the critical frame-rate ofHFR videos. The critical frame-rate is the lowest possibleframe-rate that does not affect the perceived video qualityof the original HFR video signal. The proposed solution isbased on a machine learning approach that takes as an inputspatial and temporal features of the video to determine as anoutput the critical frame-rate. This can be considered as aclassification problem to derive the critical frame-rate amongthree possible considered frame-rates: 30, 60 and 120 fps.The motivation behind these three frame rates is mainly forcompliance with the frame rates specified by ATSC [17]and DVB [18] for use with various broadcast standards.The subjective results conducted on ten HFR video contentsclearly show the efficiency of the proposed solution enablingto determine the lowest possible frame-rate while preservingthe quality of the HFR content. This VFR solution enablessignificant bit-rate savings and complexity reductions at bothencoder and decoder.The rest of this paper is organized as follows. Section IIgives a short overview of related works on HFR video,including coding, quality evaluation and rendering, followedby the objective and motivation of the paper. The proposedvariable frame-rate solution is investigated in Section IIIas a classification problem with two binary Random Forest(RF) classifiers. The ground truth generation and featuresextraction, used to train the two RF classifiers, are describedin Section IV. Section V gives details on the training of thetwo RF classifiers. The performance of the proposed variableframe-rate solution is assessed in Section VI in terms of per-ceived video quality, compression and complexity efficiencies.Finally, Section VII concludes the paper.II. R
ELATED W ORK
A. High Frame-Rate Video
The UHDTV signal, defined in the ITU-R BT. 2020 rec-ommendation [1], introduces a number of improvements overthe HDTV [2] aiming at providing a better visual experienceto the user. Along with a wider color gamut and an increasedbitdepth, which allow to depict real colors and avoid ringingartifacts respectively, the key features of the UHDTV signalenabling a better depiction of live content are the higher spatialresolution - up to 3840x2160 and 7680x4320 pixels - andincreased frame-rate - up to 120 fps. The different experimentsthat lead to the definition of each characteristic of the UHDTVsignal are summarized in [3], [19].Particularly, high frame-rate video has been an active fieldof research in the last decade, with the goal of avoiding well-known motion-related artifacts, namely flickering, motion blurand stroboscopic effect, which are present in traditional HDTVframe-rates of 60 fps and lower. Flicker is a phenomenon inwhich unwanted visible fluctuations of luminance appear on alarge part of the screen and occurs at low refresh rates on nonhold-type displays (e.g. Cathode Ray Tubes (CRTs)). Severalstudies [4], [20] have shown that flicker can be eliminated, forUHDTV signals, by simply using a frame-rate higher than 80fps. The stroboscopic effect is the result of temporal aliasing, where the frame-rate is insufficient to represent smooth motionof objects in a scene causing them to judder or appear multipletimes. At a given frame-rate, strobing can be reduced bylowering the shutter speed of the camera. However, a lowershutter speed also increases motion blur, which is caused bythe camera integration of an object position over time, whilethe shutter is opened. Thus, strobing artifacts and motion blurcan not be optimized independently except by using a higherframe-rate [21].Based on previous studies by Barten [22] and Daly [23],Laird et al. [8] defined a spatio-velocity Contrast SensitivityFunction (CSF) model of the Human Visual System (HVS)taking into account the effect of eye velocity on sensitivity tomotion. In [7], Noland uses this model along with traditionalsampling theory to demonstrate that the theoretical frame-rate required to eliminate motion-blur without any strobingeffect is 140 fps for untracked motion and as high as 700fps if eye movements are taken into account. Since thistheoretical critical frame-rate is not yet achievable, severalsubjective studies have investigated the frame-rate for whichmotion-related artifacts are acceptable for the HVS. In [24],Selfridge et al. investigate the visibility of motion blur andstrobing artifacts at various shutter angles and motion speedsfor a frame-rate at 100 fps. Their subjective tests showed thateven at such a frame-rate, all motion-blur and strobing artifactscan not be both avoided simultaneously. Kuroki et al. [6]conducted a subjective test with frame-rates ranging from 60to 480 fps, concluding that no further improvements of thevisibility of blurring and strobing artifacts were visible above250 fps. Recently, Mackin et al. [5] have performed subjectivetests on the visibility of motion artifacts for frame-rates up to2000 fps, achieved using a strobe light with controllable flashfrequency. The study concluded that a minimum of 100 fpswas required to reach tolerable motion artifacts.For the purpose of the UHDTV signal definition, severalstudies further investigated the importance of HFR for televi-sion [9], [11], [25]. Emoto et al. [25] showed that increasingthe frame-rate from the traditional 60 fps to high frame-rateof 120 fps provides a significant visual quality improvement.It is also stated that a further increase to 240 fps would alsoimprove the motion portrayal but to a much lesser extent thanthe transition from 60 to 120 fps. Salmon et al. [11] have alsostudied HFR for television, showing that at least 100 fps isrequired for improvements over HDTV, especially for contentwith high motion such as sports. Recently, with one of the first65 inches UHD HFR prototype displays, Hulusic et al. [9]studied the joint and independent contributions of 4K reso-lution and HFR. The subjective tests carried out showed thatthe 2160p100Hz format enables a significant increase in visualquality over other configurations - 1080p50Hz, 1080p100Hzand 2160p50Hz - but also that the improvements are stronglycontent dependent.
B. Compression of HFR content and Variable Frame-Rate
Since the adoption of HFR in the future television standard,through the second phase of the Digital Video Broadcasting(DVB) UHD standard [26], several studies of HFR content
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 3 compression have been carried out. Authors in [10] investi-gated the impact of high frame-rate on video compression,focusing on the perceptual quality of different motion typesand frame-rates at several bit-rates. Using the test sequences ofthe public HFR dataset described in [12] compressed using anHigh Efficiency Video Coding (HEVC) encoder, it is shownthat HFR is beneficial and desirable, especially at high bit-rates and even at the current HDTV broadcast data rates forsequences containing camera and/or simple motion, for whichthe encoder can make use of the increased temporal correlationto predict adjacent frames. Sugito et al. [27] showed that theoverhead, in terms of bit-rate, introduced by the increase from60 to 120 fps, is reasonable, with an optimal bit allocation of - of the total bit-rate for the additional frames needed toachieve HFR capability. However, one of the main limitation ofdoubling the frame-rate is the additional encoding complexity,with a near increase in encoding time.VFR, where the image frequency can be adapted based onthe signal characteristics, is one of the solutions to cope withthe complexity and bit-rate increases. Authors in [13] havestudied the impact of both frame-rate and Quantization Param-eter (QP) on the perceived video quality. They also developedan accurate rate model and quality model based on singlelayer and scalable video bitstreams. These models have beenapplied to frame-rate based adaptive rate control for single-layer and scalable video encodings. However, these modelshave been designed for frame-rate decisions between valuesranging from to , which corresponds to a very lowmotion portrayal quality compared to videos. In [14], aSupport Vector Regression (SVR) is used to predict a satisfieduser ratio - percentage of people who do not see the differencebetween original and lower frame-rates - which is then used todynamically select the appropriate image frequency at a GroupOf Pictures (GOP) or sequence level. The trained SVR usescomplex and computationally demanding features, notably avisual saliency map and a spatial randomness map for eachframe, thus making it unsuitable for real-time dynamic frame-rate selection. In addition, the training set is only composed ofup to 60 fps content, limiting the frame-rate choice to , and . With their design targeting lower than maximum frame-rates, there is no guarantee that the featuresused by the VFR models proposed in [13] and [14] would alsowork on content due to the different motion portrayalobserved in HFR content.More recently, VFR for HFR content has been investigated,aiming at offering a perceptually indistinguishable temporallydownsampled video [15], [16]. Katsenou et al. train BaggedDecision Trees to predict the critical frame-rate at a sequencelevel [15]. The selected feature set is only composed ofan Optical Flow (OF) for the temporal aspect, and GrayLevel Co-occurrence Matrix (GLCM) for the spatial detailscontribution. In addition, the considered dataset consists of 22test sequences with critical frame-rates of 60 or 120 fps. Thus,a good generalization of the VFR decision problem is hard toachieve with such few data to train and validate the model.In [16], a dynamic frame-rate adaptation is proposed based onthe frame-rate dependent metric FRQM [28]. The temporaladaptation is coupled to a spatial resolution adaptation using kernel-based downsampling and a neural network based up-sampling at respectively the pre and post-processing stages.The spatio-temporal adaptation model shows high coding gainsthrough both objective and subjective tests. However, theauthors indicates that the temporal adaptation has only beenused for one sequence in their dataset, which only contained sequences, making the performance evaluation of theVFR part of the solution difficult, especially for HFR content. C. Motion Blur Rendering and Video Frame Interpolation
In a pipeline using a VFR video format to transport thevideo, several processing steps could be added to improve theperceptual quality of the output video. Indeed, on one hand,motion blur can be synthesized when the frame-rate is loweredto render a video close to what would have been captured witha camera at the lower frame-rate and its corresponding shutterspeed. This would reduce the stroboscopic effect due to theframe decimation thus improving the visual quality of the VFRvideo. Motion blur synthesis has been extensively studied in aneffort to render synthetic images as real as possible [29]. Thesetechniques mostly rely on the perfect knowledge of the depthand motion of the scene and are thus not compatible for a livebroadcast use-case. More recently, a motion blur renderingalgorithm using only two consecutive images as inputs toproduce a motion blurred output have been designed in [30].These promising results are balanced by the computationallydemanding algorithm, due to the underlying ConvolutionalNeural Network (CNN) architecture used to synthesize motionblur.On the other hand, since most display devices do not supporta variable frame-rate, the VFR video must be temporallyupsampled to the original higher frame-rate before displayingit. Thus, frame interpolation methods can be used to improvethe temporal upsampling step in order to obtain a visuallybetter displayed video. Video frame interpolation is a well-studied field with several existing approaches to the prob-lem. The classical approach interpolates intermediate framesfrom the optical flow field [31] of the scene. Interpolatedframes, whose quality highly depends on the accuracy ofthe computationally expensive optical flow computation [32],typically suffer from motion boundaries and severe occlusionsthus showing strong artifacts, even with state-of-the-art opticalflow algorithms [33]. More recent promising works rely onneural networks to either predict convolution kernels for eachpixel used to generate the interpolated frames [34] or leverageoptical flow fields with exceptional motion maps [35]. How-ever, these techniques involve a large number of convolutions,sometimes with large kernels (up to 41x41 for each pixel)to cope with large motion, thus making the computationaldemand unsuitable for real-rime use-cases.
D. Objective and Motivation
Most existing algorithms containing variable frame-ratehave been designed purely for rate control in 30 fps videoencoding schemes, with the goal of skipping frames whenthe bit budget constraint can not be met. Such behaviordoes not take properly into account the impact on perceptual
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 4
Tempor alDownsampling
Fr ame Rate Selection Encoder
Tempor alUpsampling
DisplayInput
UHD120 Hz UHDVFR UHDVFR UHD 120 Hz
Decoder
Fig. 1. Block diagram of the complete Variable Frame-Rate (VFR) coding scheme. quality, making these solutions not suitable for HFR, whichhas been integrated to the UHDTV standard to improve motionportrayal.Since the improvements brought by HFR capabilities arehighly content dependent, several recent studies have basedthe frame-rate selection on perceptual factors to lower theframe-rate when there is no impact on visual quality. However,they use computationally expensive features and rely on asmall dataset, not always composed of 120 fps content, totrain and validate the variable frame-rate models. Thus, thesesolutions do not achieve good generalization of the problem.Moreover, they are not suitable for a real-time constraint,which is required for use-cases like live broadcast of, forexample, sport events, that would periodically highly benefitfrom HFR.In this paper, real-time variable frame-rate for HFR sourcecontent is addressed. The proposed system, depicted in Fig. 1,relies on two classifiers to predict the critical frame-rate. Ithas been designed with the following objectives: • Design a dynamic frame-rate selection at a GOP levelwith perceptually invisible frame-rate changes. • Low-complexity feature computation with low impact onthe coding chain processing time. • Training and testing on a well dimensioned dataset con-taining various types of 120 fps content. • Perceptual validation of the obtained objective perfor-mance through subjective evaluation test. • Assess the bit-rate and complexity savings within anHEVC encoding chain.To meet the real-time constraint of live broadcast, state-of-the-art motion blur rendering and frame interpolation have notbeen integrated in the VFR pipeline. Instead, the proposed sys-tem has been designed with simple frame decimation (resp. du-plication) as a temporal downsampling (resp. upsampling) tool.Additionaly, the RF algorithm has been chosen as classificationtechnique thanks to a small benchmark comparing several MLtechniques for VFR classification that showed a better trade-off between prediction accuracy and computational complexityfor the RF algorithm.III. R
ANDOM F OREST C LASSIFIER FOR V ARIABLE F RAME -R ATE
This section briefly presents Random Forests (RFs) as aclassification tool and introduces the method proposed in thiswork to reduce the frame-rate with no visual impact, i.e.the VFR decision problem, as a combination of two binaryclassification problems.
A. Background on Random Forests
Random Forests [36] are a common ML tool used to solveclassification problems. A RF classifier is able to predict thevalue of a target variable, i.e. a class, based on a set of inputvariables, i.e. input features, using the majority vote of anensemble of nearly independent decision trees.A decision tree is constructed by first partitioning the train-ing dataset, i.e. the features and the associated class of eachsample, into two different subsets, called nodes. This process isperformed recursively until either all the node samples belongto a single class or a tree constraint has been reached. Ateach node, each available input feature is evaluated for all itspossible values, in order to achieve the best separation of theclasses in the subsequent child-nodes.In this work, the criterion used to quantify the quality of asplit, given the feature F and its threshold value t , is basedon the Gini impurity measure, a common metric for DecisionTrees [37]. It is computed as follows I G ( D ) = (cid:88) c ∈ C P ( c | D ) (1 − P ( c | D )) , (1)with D the sample set under consideration, C the set ofpossible class labels and P ( c | D ) the conditional probabilityof class c given the sample set D .The best split is then obtained by finding the pair ( F, t ) thatmaximizes the Mean Decrease Impurity (MDI) ∆ I G definedby Equation (2). ∆ I G ( D, F, t ) = I G ( D ) − | D L || D | I G ( D L ) − | D R || D | I G ( D R ) , (2)with D L = { x ∈ D, F ( x ) < t } (resp. D R = { x ∈ D, F ( x ) ≥} ) the subset of sample set D for which each sample x hasa value of feature F ( x ) smaller (resp. larger) than threshold t and | D | the cardinal of a set D .To minimize the correlation between trees of the RF, thebootstrap aggregating, or bagging, technique [38] is used toconstruct the forest. This consists in training each tree T i witha different subset D i of the input data sample set D . Each D i is obtained by a uniform sampling of D with replacement, i.e.replacing discarded samples by duplicates of a selected one.In addition and to further reduce the correlation between trees,only a random subset of the features, here √ n features with n the total number of input features, are evaluated at each nodeto find the best available split. EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 5
Frame DecimationData SampleDecision
Fig. 2. Overall prediction scheme with cascaded binary RF classifiers.
B. VFR Classification Problem
The proposed solution aims at predicting when the frame-rate can be reduced, by discarding frames, without any percep-tual impact on the quality of the original input HFR video. Inan effort to keep the number of possible frame-rates reducedand obtain a regular frame decimation process, two frame-rates,
60 fps and
30 fps , were identified as potential candidatesin addition to the original frame-rate of
120 fps . The VFRdecision problem thus becomes a three-class classification.In this work, a combination of two successive binary RFclassifiers has been chosen to solve the classification problem,as depicted in Fig. 2, instead of directly training a forest withmulti-class outputs. This decision leads to a better overall per-formance by training both classifiers independently on separatedatasets and features. Indeed, in addition to the specializationof each binary classifier, almost all samples of the databasecan be used for training either one or both classifiers whilekeeping balanced training datasets, as described in Section IV,thus increasing the accuracy of the overall model.The first RF classifier, named , is specializedin deciding whether the frame-rate must remain
120 fps ora Frame Decimation (FD) can be applied without impactingthe visual quality. If the
120 fps class is chosen by the firstclassifier, the frame-rate prediction process is stopped and noFD is applied. Otherwise, the prediction process continues byrequesting the second classifier, named , whichaims at selecting the appropriate lower frame-rate if a FD isapplied on the input HFR video.IV. G
ROUND T RUTH G ENERATION
One of the most crucial steps towards training a supervisedRF model is to gather a dataset which will be used as groundtruth, i.e. examples - features representing a sample and thesample class label - used by the model to learn to predict. Itis thus important to have a ground truth that contains a goodrepresentation of all the real cases the model could encounterin order to achieve a good generalization of the problem.This section focuses on detailing the ground truth generationprocess, necessary to obtain the datasets used to train bothRF classifiers. The HFR database is first presented, followedby the detailed methodology for subjectively determining thecritical frame-rate labels. Then, the creation of the dataset isdescribed, with the composition of the balanced training sets
Vote - N1 st Repetition 2 nd Repetition
Ref-N | Test-N time
Fig. 3. Basic Test Cell (BTC) for the SDSCE evaluation method. on one hand and the feature extraction process on the otherhand.
A. HFR Video Database
To the best of our knowledge, there is no publicly avail-able dataset for the VFR classification problem except forthe Bristol Vision Institute High Frame-Rate (BVI-HFR)database [12]. However, this database only contains 22 videos,which is rather small to train a reliable model. In addition, thetemporal downsampling technique used to create the lowerframe-rates in this database is frame averaging, whereas thechosen method in this work is frame decimation, which couldlead to different decisions on the critical frame-rate.Therefore, a new database has been gathered, composed of375 native HFR video clips of 5 to 10 seconds. These clips areall uncompressed and stored in YUV format with 4:2:0 chromasubsampling and 8-bit depth. Their original frame-rate is 120fps and spatial resolution 1920 × <> com, Harmonic and other non-publicly available test sequences. In order to later evaluate thetrained model on unseen data samples, 15 sequences, withheterogeneous spatio-temporal characteristics, have been ex-tracted from the database, leaving 360 video clips to annotatebefore training both RF classifiers. B. Critical Frame-rate Decision Methodology
The ground truth generation process requires each videoof the database to be assigned a critical frame-rate chosenamong the three considered ones in the VFR classificationproblem. A subjective test has thus been carried out with theobjective of finding the lowest frame-rate for which no visualdegradation can be observed compared to the original 120 fpsvideo. Following the ITU-R BT.500-13 recommendation [40],the Simultaneous Double Stimulus for Continuous Evaluation(SDSCE) protocol has been used with a binary scale - eitheridentical or visible difference - and two screens placed side-by-side. Therefore, subjects were asked, for each video ofthe database, if there was a visible difference between thetwo displayed videos, i.e. the known reference and the lowerframe-rate, either 30 or 60 fps, test video.The test was composed of 750 Basic Test Cells (BTCs),randomly divided into 30-minute test sessions. Each BTCis 40-second long and is composed of a 2-second messageannouncing the test video index followed by the side-by-sidedisplay of the reference and test videos with two repetitions.A 2-second break displaying a mid-gray image has beenadded between each repetition. The BTC is concluded by a 4-second message asking the viewer to vote, as shown in Fig. 3.
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 6
TABLE ID
ATABASE CRITICAL FRAME - RATE DISTRIBUTION .Critical frame-rate30 60 120
Due to the large duration of the subjective test, only fiveviewers, experts in video processing, participated in the wholedatabase annotation. The final frame-rate decision is the lowestframe-rate for which the majority of expert viewers did notnotice any visible difference with the reference 120 fps video.The tests were conducted in a controlled laboratory en-vironment, following the ITU-R Rec BT.500-13 [40]. Twoidentical 27-inch screens capable of displaying 120 fps content(Asus RoG Swift PG278Q) were used side-by-side, alignedand placed at a distance from the viewer position of threetimes the screen height. Each participant has been screened toensure (corrected to) normal visual acuity and normal colorvision.
C. Balanced Dataset Composition
The results of the expert subjective test are summarizedin Table I. As can be seen, the sequences are not evenlydistributed over the three possible frame-rates due to a largeproportion of the available content being captured with HFR-capable devices containing high motion content, for whichframe decimation downsampling to 30 fps is critical. Sincethe goal is to allow for a frame-rate adaptation at the lowestpossible level, i.e. a frame-rate decision for every chunk of4 frames to keep a regular frame decimation process, thesequence-level labels obtained via the subjective test havebeen extended to 4-frame chunk-level labels. Thus, for se-quences with significant motion discontinuities, video shotswith uniform motion can be identified and assigned differentlabels in a single sequence. Based on the observations madeby the experts after the subjective test, a total of 429 videoshots with uniform motion have been extracted from the 360native HFR sequences. This refinement allows for a moreaccurate annotation of the database, thus avoiding chunksbeing annotated with inconsistent labels. For instance, a chunknot containing any movement associated to the class.However, such cases could still remain in the ground truth dueto the difficulty to identify the motion discontinuities with aprecision of less than four frames.From this ground truth, two different datasets were createdto train the and
RF classifiers. Thefirst one contains all samples of the class as well asthose from the FD class. The FD class is composed of all samples and a random subset of the samples. Theamount of selected samples has been chosen to producea balanced dataset, i.e. to roughly obtain the same sample sizefor both the and FD classes. The second dataset, usedto train the classifier, is comprised of the samples and a random subset of the samples, whose sizehas also been chosen to produce a balanced dataset. The choice of balancing datasets has been made because the unbalancedclass distribution in the database does not necessarily representthe distribution of media content in a broadcast context, thechosen use-case for the proposed solution, but rather relatesto the current difficulty to find HFR content with low motion.This is due to the fact that most of the currently available HFRcontent has been shot to demonstrate the gain in perceptualquality and motion portrayal brought by the technology. D. Feature Extraction
The goal of a feature set is to gather the different metricsrelevant to the considered classification problem that wouldhelp discriminate the output classes from one another. For theVFR classification problem, a first feature would intuitivelybe the motion information, e.g. the motion vectors betweentwo consecutive frames. Indeed, high movement in a sourceHFR video will likely lead to visible temporal aliasing, i.e.stroboscopic effect, if a lower frame-rate is used after framedecimation. In addition, since motion blur is not added duringthe temporal downsampling process used in this work, low-ering the frame-rate could introduce visible jerkiness in highmotion videos.For the three considered frame-rate classes, flickering, theother well-known motion-related artifact, can appear in highlytextured areas where the local variation in luminance betweentwo consecutively displayed frames would be visible at lowerframe-rates. In an effort to capture this phenomenon in the fea-ture set, the pixel luminance values and directional gradientscan be used.Based on the performed expert viewing sessions, it has beenobserved that small objects with high velocity, which wouldnot necessarily be detected by the motion vectors dependingon the used motion estimation algorithm, could induce visibleartifacts at lower frame-rates. To take this observation intoaccount, a simple metric capable of detecting both globaldisplacements and small moving objects has been designed.This metric is based on the thresholding of the differencebetween two consecutive frames. First, the frame difference D n ( i, j ) , i.e. the difference in pixel value of the luminanceplane at the same location in space ( i, j ) between the n th frame F n and the preceding one F n − , is computed for eachpixel using Equation (3) D n ( i, j ) = | F n ( i, j ) − F n − ( i, j ) | . (3)Then, a thresholding operation is performed on the framedifference image, defined as follows A n,T h ( i, j ) = (cid:26) if D n ( i, j ) ≥ T h if D n ( i, j ) < T h , (4)with A n,T h the resulting thresholding activation map for the n th frame and a threshold T h . Fig. 4 depicts an example withboth the original image and the resulting thresholded framedifference image.The designed feature set is thus based on the followingfeature maps: • NormMV , HorMV , VerMV: maps respectively repre-senting the Motion Vectors (MVs) norm, horizontal co-ordinate and vertical coordinate.
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 7 (a) (b)Fig. 4. Example of thresholded motion difference with (a) the original imageof the
Jokey sequence and (b) the thresholding activation map with threshold
T h = 25 . • ThreshDiffMap: thresholded frame difference map asdefined in Equation (4). • GradMag , GradHor , GradVer: maps respectively rep-resenting the Sobel gradient magnitude, horizontal gradi-ent and vertical gradient. • Luma: pixel luminance map.For each map, several scores have been computed, namelythe mean value, the standard deviation, the maximum valueand the mean of the 10% highest values, to produce a total of32 different features that will serve as an initial feature set forthe training of both the considered RF models.V. R
ANDOM F OREST T RAINING P ROCESS
Once the ground truth is available and the features com-puted, the RF models can be trained to solve the VFRclassification problem. This section focuses first on the perfor-mance evaluation process, necessary to assess and optimize thequality of the model critical frame-rate prediction task. Then,a feature selection process, used to reduce the initial featureset to only the relevant features for each binary classifier, ispresented. Finally, the classification results are presented andanalyzed.
A. Model Evaluation Process
In order to optimize a ML classifier, it is necessary touse a metric capable of evaluating the model classificationperformance to find the best parameters. To do so, severalcommon metrics, namely precision, recall and F1-score [41],can be used. First, the trained model confusion matrix has tobe computed from the true and predicted labels of the datasetsamples. Then, once the different quantities of the confusionmatrix, namely True Positives (TP), False Positives (FP),False Negatives (FN) and True Negatives (TN), are availableeither as number of samples or normalized probabilities, theprecision, recall and F1-score can be computed as follows,with C = { c , c } the set of classes for the binary classifierunder test precision ( C ) = 1 | C | (cid:88) c i ∈ C T P ( c i ) T P ( c i ) + F P ( c i ) , (5) recall ( C ) = 1 | C | (cid:88) c i ∈ C T P ( c i ) T P ( c i ) + F N ( c i ) , (6) F - score ( C ) = 2 | C | (cid:88) c i ∈ C precision ( c i ) recall ( c i ) precision ( c i ) + recall ( c i ) , (7) As for any binary classification problem, the goal is tomaximize the confusion matrix main diagonal values, i.e. thenumber of TP and TN representing the correct predictions.This can be achieved by maximizing precision, recall, or F1-score during the training process, depending on the consideredclassification problem and the criticality of each error type.For the VFR classification problem, the main goal is alsoto minimize the critical errors - predicted frame-rate lowerthan the ground truth - which would potentially induce visibletemporal artifacts thus greatly reducing the output visualquality. To emphasize these critical errors and avoid them inthe final model, another performance evaluation metric M crit has been designed, as a combination of the precision of thelower frame-rate class and the recall of the higher frame-rateclass, using Equation (8) M crit ( C ) = | C | [ precision ( c ) + recall ( c )] , = | C | (cid:104) T P ( c ) T P ( c )+ F P ( c ) + T P ( c ) T P ( c )+ F N ( c ) (cid:105) , = | C | (cid:104) T P ( c ) T P ( c )+ F P ( c ) + T N ( c ) T N ( c )+ F P ( c ) (cid:105) , (8)with C = { c , c } the set of ordered classes - frame-rate of c lower than the frame-rate of c . This metric has been usedtogether with the F1-score to assess the quality of RF modelsfor both the feature selection process and hyper-parametertuning described in the next sections. B. Feature Selection
In an effort to limit the model complexity and improveits performance, a dimensionality reduction algorithm hasbeen used on the proposed initial feature set. Indeed, byonly selecting the relevant features, thus removing featurescarrying useless information for the considered classificationproblem, both the feature computation time and training timeare greatly reduced. Additionally, model over-fitting is alsodecreased when the size of the feature set is reduced due tothe reduction of noise in the input data and the eliminationof highly correlated features, i.e. features that would carry thesame information about the target variable.In this work, a Recursive Feature Elimination (RFE) processhas been used to reduce the dimension of the initial feature set.It consists in recursively evaluating the model performance ona dataset and a feature in which the least important featureis removed after each iteration. The feature importance iscomputed in terms of mean decrease in Gini impurity, i.e.the average capacity of a feature to reduce the Gini impuritycomputed at a given tree node, using Equation (1). When thefeature set size reaches the minimum tested dimension of 2,the feature set leading to the best model performance amongall the tested dimensions is selected as the final feature set.This process has been performed independently for bothproposed RF models with the same initial feature set butleading to a different optimal feature set size for each binaryRF classifier, respectively 26 and 11 features for the classifier and classifier, as depicted in Fig. 5.Fig. 6a shows the list of selected features for the
RF classifier with their corresponding feature importance. Ascan be observed, the most relevant features to discriminate
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 8 M e a n w e i g h t e d s c o re (a) 120fps-FD RF model Number of features M e a n w e i g h t e d s c o re (b) 60fps-30fps RF modelFig. 5. Recursive Feature Elimination with weighted (F1, M crit ) score. samples from both classes are based on the two motionfeatures ThreshDiffMap and
NormMV . This correlates wellwith the observations made by the experts during the groundtruth annotation subjective tests. Indeed, it was pointed outthat above a certain amount of movement, either from amoving camera or an object with high velocity - which canbe captured by both metrics -, a stroboscopic effect as well asjerkiness, due to lack of motion blur, could easily be detectedwith frame-rates lower than 120 fps. Most of the featuresbased on spatial measures are present in the optimized featureset, with a significantly lower importance compared to theaforementioned motion features. This tends to indicate thatflickering becomes an important criterion to keep a high frame-rate when the amount of movement does not induce othermotion artifacts.For the
RF model, the selected features and theirimportance are depicted in Fig. 6b. As for the first RF model,the features based on the motion vectors,
NormMV , have a highcapacity to discriminate samples from both classes. However,spatial features, based on the
Luma and
GradHor feature maps,hold a significantly higher importance compared to the firstmodel, indicating that flickering mostly occurs at 30 fps formost of the videos of the training dataset.
C. Classification Results
With the final feature sets, an optimization has been con-ducted on the maximum tree depth and number of trees hyper-parameters, leading to a final model with 200 trees of depth 7for the classifier and 100 trees of depth 7 for the classifier. An in-depth analysis of the predictioncapability of the final models, both individually and combinedto form the overall VFR prediction scheme, can then beconducted. It is important to note that the models have beentrained using a 10-fold cross-validation so that the consideredperformance is a combination of the results from the validationfold of each iteration. This means that each tested sampleprediction presented in the different confusion matrices hasbeen obtained without using the validation sample for training.Additionally, the training set has not been shuffled, so thatchunks from a same sequence could not be in the trainingand validation folds at the same time, thus avoiding a highlybiased performance evaluation.Figures 7a and 7b shows the resulting confusion matricesof both RF models, individually. For the classifier, error rates of and can be observed for the andFD classes respectively, which represent a good performanceconsidering the VFR classification problem and its imperfectground-truth. Indeed, the frontier between annotating a se-quence with a label and a label can be difficultto maintain consistent during the processing of the 360 videosof the training set. In addition, as detailed in Section IV-C,several sequences with high motion discontinuities have beenseparated into shots with different labels. Since these motionchange frontiers could only be determined subjectively and notat a precise frame level, some dataset samples, i.e, 4-framechunks, located at these frontiers could have been annotatedwith incorrect labels. Therefore, the error rate is likely to beover-estimated, leading to a visible motion artifacts rate lowerthan the observed . For the model, correctprediction rates are respectively and for the and classes. Critical errors, defined as critical frame-rate under-estimation, represent of the class samples.This rate can be problematic since the under-estimation witha frame-rate of 30 fps could lead to severe visible motionartifacts. However, the same aforementioned remark concern-ing imperfect ground truth applies to the training dataset ofthe model. The proportion of frame-rate over-estimation errors, equal to , does not impact the visualquality and is thus not as prejudicial as the critical errors.The overall VFR prediction scheme confusion matrix,obtained by combining the cross-validation validation-foldpredictions of both models, is depicted in Fig. 7c. It isimportant to note that only a subset of randomly chosen class samples has been used to compute the overallprediction scheme confusion matrix so that the three classeshave the same number of samples. The observed performanceis consistent with the individual RF model prediction results,with good probabilities of correct prediction. The only sig-nificant change is the correct prediction rate for the class, which can be explained by the fact that it isthe intermediate class, thus sharing characteristics with theother two classes which makes the discrimination of the classsamples harder to generalize. In addition, the extreme errors,i.e. the critical under-estimation of a sample with a predicted label or the exact inverse, are rarely occurringwith rates of and , respectively. This tends to bolster thehypothesis on the ground truth being imperfect due to possiblyunstable/blurry annotation frontiers between adjacent labels.If this hypothesis is correct, the combined prediction modelshould lead to VFR output video sequences visually identicalto the HFR input. However, the compression and encodingcomplexity gains should be slightly lower than with groundtruth labels. The next section aims at verifying this statement.VI. R ESULTS AND A NALYSIS
Before analyzing the coding performance of the VFR codingscheme, the visual quality of the output VFR video must beevaluated to assess whether the RF model frame-rate decisionsare preserving the perceptual quality compared to the HFRsource video. This section first describes the characteristics ofthe test set sequences and the chosen subjective evaluation
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 9 s t dd e v _ t h r e s h D i ff m e a n _ t h r e s h D i ff t o p _ t h r e s h D i ff s t dd e v _ N o r m M V t o p _ N o r m M V s t dd e v _ H o r M V m a x _ N o r m M V m a x _ V e r M V m e a n _ N o r m M V s t dd e v _ V e r M V s t dd e v _ l u m a m e a n _ H o r M V m e a n _ G r a d V e r s t dd e v _ G r a d H o r t o p _ G r a d V e r t o p _ l u m a s t dd e v _ G r a d M a g m e a n _ l u m a m e a n _ G r a d M a g t o p _ G r a d H o r m e a n _ G r a d H o r m a x _ l u m a t o p _ G r a d M a g s t dd e v _ G r a d V e r m a x _ H o r M V m e a n _ V e r M V Feature M e a n D e c r e a s e i n G i n i I m p u r i t y (a) 120fps-FD RF model s t dd e v _ N o r m M V t o p _ G r a d H o r m e a n _ G r a d H o r m e a n _ N o r m M V m e a n _ t h r e s h D i ff s t dd e v _ H o r M V s t dd e v _ G r a d H o r s t dd e v _ l u m a s t dd e v _ t h r e s h D i ff s t dd e v _ G r a d M a g t o p _ t h r e s h D i ff Feature M e a n D e c r e a s e i n G i n i I m p u r i t y (b) 60fps-30fps RF modelFig. 6. Feature importance measured with Mean Decrease in Gini Impurity. yellow: spatial features, green: motion features. F D Predicted labelFD120 T r u e l a b e l (a) 120fps-FD classifier Predicted label T r u e l a b e l
30 60 (b) 60fps-30fps classifier
30 60 120
Predicted label3060120 T r u e l a b e l (c) Overall schemeFig. 7. Individual classifier and overall scheme confusion matrices for a 10-fold cross-validation training with their respective datasets. methodology. Then, the results of the subjective tests aredetailed and discussed for both uncompressed and compressedVFR videos. Finally, the coding performance of VFR codingscheme is presented in terms of bit-rate savings and complexityreduction. A. Specific Test Datasets and Subjective Tests Motivations
A total of 15 sequences has been selected to validate theperformance of the VFR model. These sequences are unknownto the model, i.e. they have not been used during the cross-validation training of both binary RF classifiers. The inputframe-rate is 120 fps for all test sequences and their durationranges between 9 and 13 seconds. Source content with a3840x2160 original resolution have been downsampled to the1920x1080 resolution with Lanczos3 filers [39] to ensureconsistency during the subjective test.The test set sequences have been selected from varioussources to cover a wide range of spatio-temporal character-istics both in terms of temporal and spatial information (SI
100 150 200 250 300
Spatial Information (SI) T e m p o r a l I n f o r m a t i o n ( T I ) Rowing1 Refuge4PourKatana NYCBikeRefuge1Refuge2 Refuge3 Rowing2Bouncyball FlowersLibraryMartialArts Rugby6Rugby7Complete Test SetSelected Test Set Sequences for Subjective Evaluation
Fig. 8. SI-TI characteristics for test sequences of the three considered sets. and TI), as recommenced in [42] and shown in Fig. 8. Theyalso depict several use-cases, including sporting events andmovie-type clips, in addition to the more common naturalvideo content.In order to generate the final VFR model, both RF classifiershave been retrained with their respective whole datasets aswell as their feature sets and hyper-parameters determinedvia cross-validation. The prediction results for the test set aredepicted in Fig. 9. As can be observed, correct prediction ratesreach , and for the , and classes, respectively, showing the capacity of the model togeneralize the VFR classification problem to unknown data.The slightly better prediction results for the test set comparedto the cross-validation predictions presented in Section V-Cmay be explained by the more accurate ground truth labelsobtained for the test set, thus minimizing the labeling issuepreviously raised. In addition, the low amount of samplesfalsely predicted with a class label should lead to a goodperceptual quality of the VFR output videos, very close to theHFR source content.To verify this statement, a subjective test comparing theuncompressed HFR and VFR videos has been designed. The and versions, obtained by frame decimation, have EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 10
30 60 120
Predicted label3060120 T r u e l a b e l Fig. 9. Cascaded RF model prediction performance on test set. also been introduced in the subjective evaluation to assessthe interest of variable frame-rate compared to systematictemporal downsampling in terms of perceived quality.
B. Subjective Evaluation Methodology
The considered subjective evaluation aims at assessing theeffect of a system, here the VFR coding scheme, on thevisual quality. For this kind of test, the ITU-R BT.500-13 rec-ommendation [40] proposes the Double Stimulus ContinuousQuality Scale (DSCQS) method, which consists in showingthe observer pairs of videos - the un-processed source contentand the same sequence processed with the system under test -and asking the observer to rate the quality of both sequences.The grading scale is a continuous vertical scale divided into5 equal parts corresponding to the common 5-level ITU-Rquality labels:
Excellent , Good , Fair , Poor and
Bad .For each test session, a series of video pairs is presentedto the observer in a random order, to distribute the degreesof quality impairments over the entire session. Each pair ofvideos is internally random, i.e. the observer is not aware ofthe position of the reference un-processed video (A or B),which is presented twice, successively. Fig. 10 depicts thestructure of a BTC presenting a pair of videos to assess. Ascan be observed, each BTC begins with a 2-second messageindicating the id number of the current test point and ends witha message asking to vote. In addition, each display of a 10-second sequence is preceded by a 1-second message indicatingif the following video is A or B on the answer sheet, makingthe total duration of a BTC equal to 50 seconds.A total number of 10 sequences have been selected withinthe test set for the subjective test, as indicated in Fig. 8. Thesequence set has been formed to cover a wide range of spatio-temporal characteristics and content types. For each sequence,4 frame-rate pairs have been evaluated by the observers: , , and . Therefore, a total of 40 BTCs were presented toeach observer, randomly divided into two 20-minute sessionsseparated by a 10-minute break. VFR video sequences havebeen obtained using the predicted frame-rates resulting fromthe proposed VFR model. Fig. 11 depicts the evolution offrame-rate decisions over the duration of each sequence of thesubjective evaluation test set. As can be observed, the predictedframe-rates are highly dependent on the test sequence, asexpected considering the wide range of spatial and temporalinformation characteristics for the selected sequence set. Inaddition, the predicted frame-rates also vary over-time for most time Test Video BTest Video AA B Test
Vote
Test Video BTest Video AA B Fig. 10. Subjective test BTC structure for DSCQS evaluation method. of the test sequences, demonstrating the interest of the 4-framelevel of granularity proposed for frame-rate decisions.The test was conducted in a controlled laboratory environ-ment, with a viewing distance fixed to 3 times the screenheight. A 65-inch LG OLED B6 display with HFR capabilitiesand peak luminance of 340 cd/m has been used for bothsubjective tests. During the whole duration of the tests, allinternal post-processing were disabled to avoid any impact onthe perceived quality. Each test sequence in raw format (YUV4:2:0 and 8-bit precision) has been encoded using the libx265 encoder at 100 Mbps in order to be presented to the TV set viaUSB3 interface. Special care has been taken to ensure that theencoding needed for display did not introduce any ‘coding’artifacts. A total of 19 participants took part in the subjectivetest. They were aged between 20 and 53 with (corrected-to) normal vision acuity and color vision. A post-screeninganalysis of the results has been carried out, according to themethod described in ITU-R Rec. BT.500-13, to detect andreject the outliers before computing the Mean Opinion Score(MOS) values. C. Subjective Visual Quality Results
Fig. 12 shows the results of the subjective test carried outto demonstrate the interest of variable frame-rate and evaluatethe perceived quality of the proposed VFR model output.For each sequence of the test set previously presented, theDifferential Mean Opinion Score (DMOS) values, computedusing Equation (9), of each tested frame-rate are depictedtogether with their associated
Confidence Intervals (CIs).Since none of the participants were flagged as outliers afterthe post-screening analysis, the presented DMOS values havebeen obtained using the results from the 19 participants.
DM OS f ( s ) = 100 − N N (cid:88) n =1 S n, fps ( s ) − S n,f ( s ) , (9)with N the total number of valid participants, N = 19 in thistest, DM OS f ( i ) the DMOS value for sequence s at the testedframe-rate f , f ∈ { f ps, V F R, f ps, f ps } . The pair ( S n, fps ( s ) , S n,f ( s )) represents the scores attributed to se-quence s at respectively the hidden reference frame-rateand tested frame-rate f , i.e. both videos of a given BTC, bythe n th participant, n ∈ { , .., N } .The first statement that can be made by analyzing the resultsof the subjective test is that, as previously stated, the benefitbrought by a frame-rate of 120 images per second compared tolower frame-rates is highly content-dependent. Indeed, for thesequences Rugby7 , library and Rugby6 , there is a significantdifference between the DMOS values associated to the frame-rate and those of the and frame-rates. The
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 11
Time (seconds)
FPS
Refuge1
Time (seconds)
Rowing1
Time (seconds)
Rugby7
Time (seconds) library
Time (seconds)
FPS bouncyball
Time (seconds)
FPS
Refuge4
Time (seconds)
Rowing2
Time (seconds)
Rugby6
Time (seconds) flowers
Time (seconds)
FPS martial_arts
Fig. 11. Frame-rate decisions of the VFR algorithm for test set sequences. R e f u g e R o w i n g1 R u g b y7 li b r a r y b o un c y b a ll R o w i n g2 R u g b y6 f o w er s m a r t i a l _a r t s R e f u g e D M O S Fig. 12. Mean Opinion Score values with 95% confidence intervals for test set sequences and subjectively tested frame-rates. same trend can be observed for the bouncyball , Rowing2 and martials arts sequences. However, for these sequences, theCIs of the and
DMOS are overlapping, thusa significant difference between the perceived quality of thetwo frame-rates cannot be confidently guaranteed for thesesequences. For other sequences, namely
Refuge1 , Refuge4 and flowers , the perceived quality of the and seemequivalent, with similar DMOS and highly overlapping CIs.Finally, the sequence
Rowing1 shows no visual difference evenwith a frame decimation down to .Comparing the perceived qualities of the VFR model out-puts with their source HFR 120 fps counterpart, the DMOSvalues of both configurations appear to be equivalent for everysequence. This trend highlights the interest of variable frame-rate with its capacity to adapt to the quantity of movementpossibly varying over time. For instance, the library sequenceopens on a camera panning with a gradually slowing speedwhich then stops at the middle of the sequence on a stationarytop spinning at high speed. The first part of the video requires120 fps to correctly portray the camera panning, while lowerframe-rates can be used without introducing artifacts as thespeed of the camera gradually drops. For this sequence,participants attributed significantly lower scores to the and frame-rates due to the important motion artifactspresent in the first part of the video at these frame-rates. Onthe contrary, the VFR model correctly lowers the frame-ratewhen the content permits it, resulting in a score identical tothe one attributed to the source HFR video. However, despitehighly correlated DMOS values and overlapping CIs, thereis still a chance that the perceived qualities of the comparedframe-rates are actually different.To confirm these observations and confidently state thatthe VFR model output perceived quality is the same as for the source HFR content, a more rigorous analysis can beperformed using a two-sample unequal variance Student’s t-test with a two-tailed distribution (also called Welsch’s t-test). This test allows to determine if indeed the perceivedqualities given by the MOS values of each pair of tested frame-rates are “significantly” different or not. In this case, the nullhypothesis, H , would be that the tested frame-rate f test hasthe same perceived quality as the considered reference frame-rate f ref . The alternate hypothesis, H a , would be that there isa difference between the perceived qualities of f test and f ref .In order to test the similarity for each possible pair of frame-rates, the possible values for both frame-rates are: f test ∈{ V F R, f ps, f ps } and f ref ∈ { f ps, V F R, f ps } .First, considering the sample populations from the scoresattributed to a sequence s at the two frame-rates f test and f ref being compared, the t-statistic t f test ,f ref ( s ) can be used,expressed as follows t f test ,f ref ( s ) = S f test ( s ) − S f ref ( s ) (cid:115) σ f test ( s ) N f test + σ f ref ( s ) N f ref , (10)with S f i ( s ) , σ f i ( s ) , N f i the sample mean, sample varianceand sample population size for frame-rate f i , i ∈ { test, ref } .In this test, N f test = N f ref = N , the number of observersthat took part in the subjective test.Then, by approximating the t-statistic with a Student’s t-distribution, a value p , which indicates the degree of correla-tion between the means of the two sample populations, can becomputed from the t-statistic. The higher the p-value is, themore significant the similarity between the distributions of thetwo populations is. A p-value lower than . indicates thatthere is statistical significance that the tested frame-rate f test EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 12FPS 120 VFR 60VFR (a) Refuge1 FPS 120 VFR 60VFR (b) Rowing1 FPS 120 VFR 60VFR (c) Rugby7 FPS 120 VFR 60VFR (d) library FPS 120 VFR 60VFR (e) bouncyball FPS 120 VFR 60VFR (f) Refuge4 FPS 120 VFR 60VFR (g) Rowing2 FPS 120 VFR 60VFR (h) Rugby6 FPS 120 VFR 60VFR (i) flowers FPS 120 VFR 60VFR (j) martial artsFig. 13. p-value probabilities resulting from two-sample unequal variance bilateral Student’s t-test on MOS values for each pair of tested frame-rates andeach test set sequence. p ≥ . (green) means there is no significant difference between the MOS value of the row and column frame-rate labels while p < . (red) indicates that the MOS value of the row frame-rate label is significantly lower than the MOS value of the column frame-rate label. has a different perceived quality compared to the consideredreference frame-rate f ref . Indeed, in this case, there is a lowprobability of committing a type-I error, i.e. rejecting the nullhypothesis when it is true, meaning that the null hypothesiscan be confidently rejected. On the contrary, if the p-value isgreater than or equal to . , the null hypothesis cannot besafely rejected and both frame-rates can be considered to havethe same perceived quality.Finally, the p-value does not give information on the prob-ability of committing a type-II error, i.e. a failure to rejectthe null hypothesis when the alternate hypothesis is true,which is thus still a possibility. To ensure a low type-II errorprobability, and thus a statistically powerful test, the power β of the statistical test must be lower than . . The power β hasbeen computed for each possible pair of tested and referenceframe-rates, resulting in an average β value of . , showingthat there is a lower than chance, on average, to commita type-II error. Therefore, the similarity assessment for eachpair of possible frame-rates can be only based on the p-values resulting from the Student’s t-test.Fig. 13 depicts the p-values computed for each sequenceand each possible frame-rate combination. Green-colored cellsshow the frame-rate pairs for which the associated p-value isgreater than . . Since every VFR vs comparison fallswithin this category, it can be confidently concluded that theperceived quality of the VFR model output video is alwaysthe same as the original 120 fps frame-rate. This confirmsthat the under-estimated frame-rate predictions, identified inthe confusion matrix depicted in Fig. 9, do not impact theperceived quality of the VFR videos. This also tends tovalidate the hypothesis made while analyzing training errors,stating that the ground truth is imperfect due to the coarse-grained nature of the ground truth annotations. Indeed, with itsfine-grained decisions, the VFR model is capable of capturingsmaller variations of critical frame-rates, thus resulting in pre-dictions different from the ground truth, which are identifiedas prediction errors. D. Compression Efficiency and Complexity Reduction
In order to evaluate the impact of VFR on coding perfor-mance, both the source HFR videos and VFR model outputshave been encoded using the HEVC reference software en- coder HM16.12 [43]. The encoder was configured to use theHEVC Common Test Conditions (CTC) in Random Access(RA) configuration with a GOP size of 16 pictures and anintra-period of approximately 1 second to match with theconsidered broadcasting use-case. The quantization parameterwas set to QP = { , , , } to cover a wide range ofbit-rates and applications.For the VFR encodings, the HEVC reference softwareencoder has been modified to handle the critical frame-rate de-cision coming from the proposed VFR module for each chunkof 4 input frames. Fig. 14 depicts the GOP structures needed tobe supported in order to encode a VFR video sequence. Thanksto the built-in support of temporal scalability, removing theframes from upper Temporal Layers (TLs) does not break thecoding dependencies. The VFR GOP structure is thus enabledin the reference software by simply skipping frames withinthe core encoding loop depending on the given frame-ratedecision fed to the encoder and the current Picture OrderCount (POC). This results in a VFR encoding with a lowerbit-rate and reduced coding complexity while producing abitstream decodable by the reference software decoder withoutany modification.The bit-rate savings presented in this performance evalu-ation assume that, with the same QP, the perceived qualityof the VFR decoded video is the same as the decodedoriginal video, as was demonstrated for the VFR and uncompressed inputs. Indeed, on one hand, any framedecoded from the VFR bitstream is exactly the same as itscorresponding frame in the decoded video due to theuse of identical GOP structures. On the other hand, the validityof the VFR frame-rate decisions on compressed data has beenverified through an expert subjective test. This test aimed atevaluating the visual quality of the compressed andVFR sequences, independently, for a subset of the VFR testset. The protocol used is the standard Degradation CategoryRating (DCR) method [42] with an 11-grade scale. Thissubjective test resulted in the same MOS values for both the and VFR decoded sequences, at every bit-rate tested,showing that the critical frame-rate decisions obtained withthe VFR model trained on uncompressed data remain validfor compressed content. The subjective test also showed thatthe frame-rate could be further reduced in some cases due to EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 13
BBBB BBB BBBB B
TL
Dec. order
0 5 4 6 3 2 8 7 10 9 11 1
POC
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fig. 14. Example of GOP structures of size 16 for a) source HFR 120 fpscontent and b) VFR with different frame-rates for each 4-frame chunk. the removal of details, by the encoding process, that justifieda higher frame-rate in the original uncompressed video. Thecoding performance of the proposed model could thus beslightly improved by training it on compressed data. However,such an improvement would require the annotation of theentire database at several QPs, which would be a very timeconsuming task for a small coding gain. The encoding resultspresented in this work are thus obtained using the modeltrained on uncompressed videos.Table II summarizes the performance results for the VFRencodings compared to regular
HEVC encodings, interms of both bit-rate savings and encoding complexity reduc-tion for the 15 sequences of the objective evaluation test set.The proportion of frames dropped by the VFR coding schemeare also added for information. Results for both the VFRmodel and ground truth decisions are presented to comparethe performance at two different levels of granularity.With the VFR model decisions, the VFR coding schemeoffers . bit-rate savings on average, ranging from to . for sequences where 120 and 30 frames per second arechosen for the whole sequence, respectively. For sequenceswith mostly chosen or with temporally varying deci-sions, the bit-rate savings are generally around . These bit-rate savings are not equal to the proportion of frames droppedby the VFR model due to the significantly lower amount ofbits used to encode the frames of the upper TLs. Indeed,upper TL frames are coded using higher quantization stepsand greatly benefit from the inter-picture predictions of theRA coding configuration. Thus, the amount of transmittedquantized residuals is lower for these frames, especially ifthe motion is easily predictable and if high spatial detailsare not present in the source content. For the complexityreduction brought by the VFR coding scheme, the resultsare close to the amount of frames dropped, with an averageencoding complexity reduction of , ranging from to . The per-sequence results follows the same trend as forbit-rate savings but with higher gain variations. The differencebetween the complexity reduction and frames dropped resultsmainly comes from the slightly reduced coding complexityof the upper TL frames compared to the kept frames of thelower TLs. Indeed, a higher number of residual coefficient tobinarize and process with the entropy coding engine increasesthe encoding time. For the decoding complexity, the detailedresults are not presented in this paper but the observed gainsare highly similar to those observed for the encoding side,with an average decoding complexity reduction of . for TABLE IIVFR HEVC
ENCODING PERFORMANCE COMPARED TO
FPS
HEVC
ENCODINGS FOR
VFR
PREDICTED LABELS (M ODEL ) AND GROUND TRUTH (G-T)
LABELS ON THE TEST SET . Sequence bit-rate savings Enc. Time FramesReduction DroppedModel G-T Model G-T Model G-TRefuge1 -1.1 % -4.9 % 7.8 % 39 % 10 % 50 %Rowing1 -9.3 % -9.3 % 60 % 60 % 75 % 75 %Rugby7 -0.1 % 0.0 % 0.6 % 0.0 % 0.9 % 0.0 %library -5.0 % -4.8 % 39 % 37 % 42 % 41 %bouncyball -2.5 % 0.0 % 6.7 % 0.0 % 8.7 % 0.0 %Refuge4 -3.2 % -3.5 % 47 % 49 % 52 % 55 %Rowing2 -0.6 % -0.4 % 6.1 % 5.4 % 9.7 % 8.7 %Rugby6 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %flowers -4.1 % -4.1 % 41 % 40 % 50 % 50 %martial arts -4.0 % -0.5 % 23 % 5.9 % 28 % 7.6 %Katana -5.9 % -1.2 % 35 % 11 % 38 % 13 %NYCBike -6.2 % -5.6 % 27 % 25 % 32 % 29 %pour -1.6 % -1.6 % 11 % 11 % 16 % 15 %Refuge2 -15 % -14.6 % 70 % 68 % 74 % 71 %Refuge3 -5.8 % -5.6 % 53 % 52 % 58 % 56 %
Average -4.3 % -3.7 % 28 % 27 % 33 % 32 % the VFR coding scheme.With the ground truth annotated frame-rates, the bit-ratesavings and complexity reduction results are very close tothe performance with the predicted frame-rates. This can beexplained by the high correct prediction rate of the VFR modelon the test set. The results are only significantly differentfor some sequences.
Refuge1 shows lower gains for the VFRmodel output due to the over-estimation of the required frame-rate, i.e. alternation between and prediction whilethe annotated ground truth frame-rate is for the majorpart of the sequence. The opposite situation can be observedfor the sequences bouncyball , martial arts and Katana , wherethe VFR model allows for lower frame-rates more frequentlythan the ground truth, thus resulting in higher gains.VII. C
ONCLUSION
In this paper, a new variable frame-rate coding scheme isproposed for broadcast delivery of HFR (120 fps) contents.The proposed scheme incorporates a machine learning basedVFR model capable of dynamically adapting the frame-rateof the video before encoding and transmitting it to the endreceiver.The VFR model relies on several spatio-temporal featuresextracted from each frame of the input video to predict theoptimal lowest artifact-free frame-rate through two cascadedbinary RF trained classifiers. The considered frame-rate adap-tation is performed dynamically by choosing for each chunkof 4 consecutive input frames its associated critical frame-rate, among the three possible values: , or .The model achieves an average critical frame-rate correctprediction rate of , while keeping the frame-rate under-estimations error rate below . The visual quality of thegenerated VFR videos has been carefully evaluated throughformal subjective tests showing an identical perceived qualitycompared to the source HFR content.From a coding performance perspective, the proposed VFRcoding scheme provides average bit-rate savings of . inaddition to average complexity reductions of and . EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 14 at the encoding and decoding sides, respectively. It should benoted that this work can be applied to other broadcast framerates such us 25 and 50 and 100 fps by adopting the proposedalgorithm.The work proposed in this paper has been shown, atboth the International Broadcasting Convention (IBC) 2019and National Association of Broadcasters (NAB) Show 2019,through a real-time demonstration with both the input legacyHFR and processed VFR videos displayed synchronously ontwo HFR screens to demonstrate the equivalence in perceivedquality. The demonstration includes a real-time software im-plementation of the VFR prediction model - 7.2 ms (138fps) average runtime for feature computation (6.9 ms) andframe-rate prediction (0.3 ms) of HD sequences processed ona common consumer CPU.The proposed solution is a practical candidate to lowerthe requirements for the broadcast delivery of the upcomingHFR services of the DVB UHD second deployment phase.Additionally, thanks to the its hardware-friendly solution (fea-ture computation based on well-known h.264 encoding andexisting RF hardware implementations [44]), the proposedVFR method can be considered by hardware video encodermanufacturers to enhance the quality of experience and reducethe energy fingerprint of their devices.A
CKNOWLEDGMENT
The authors would like to thank Franck Chi, Maxime Peraltaand Cl´ement Brossard who also contributed to this project.R
EFERENCES[1] ITU-R, “Recommendation BT.2020-1: Parameters Values of UHDTVSystems for Production and International Programme Exchange.”[2] ——, “Recommendation BT.709-5: Parameter Values for the HDTVStandards for Production and International Programme Exchange.”[3] M. Nilsson, “Ultra high definition video formats and standardisation,”
BT Media and Broadcast Research Paper , 2015.[4] M. Sugawara and K. Masaoka, “UHDTV Image Format for Better VisualExperience,”
Proceedings of the IEEE , vol. 101, no. 1, pp. 8–17, 2013.[5] A. Mackin, K. C. Noland, and D. R. Bull, “High frame rates and thevisibility of motion artifacts,”
SMPTE Motion Imaging Journal , vol. 126,no. 5, pp. 41–51, 2017.[6] Y. Kuroki, T. Nishi, S. Kobayashi, H. Oyaizu, and S. Yoshimura, “Apsychophysical study of improvements in motion-image quality by usinghigh frame rates,”
Journal of the Society for Information Display , 2007.[7] K. Noland, “The application of sampling theory to television frame raterequirements,”
BBC R&D White Paper , vol. 282, 2014.[8] J. Laird, M. Rosen, J. Pelz, E. Montag, and S. Daly, “Spatio-velocity csfas a function of retinal velocity using unstabilized stimuli,” in
HumanVision and Electronic Imaging XI , vol. 6057. International Society forOptics and Photonics, 2006, p. 605705.[9] V. Hulusic, G. Valenzise, J.-C. Gicquel, J. Fournier, and F. Dufaux,“Quality of experience in uhd-1 phase 2 television: the contribution ofuhd+ hfr technology,” in
Multimedia Signal Processing (MMSP), 2017IEEE 19th International Workshop on . IEEE, 2017, pp. 1–6.[10] A. Mackin, F. Zhang, M. A. Papadopoulos, and D. Bull, “Investigatingthe impact of high frame rates on video compression,” in
ImageProcessing (ICIP), IEEE International Conference on . IEEE, 2017.[11] R. Salmon, T. Borer, M. Pindoria, M. Price, and A. Sheikh, “Higherframe rates for television,”
IBC Conference 2013 , 2013.[12] A. Mackin, F. Zhang, and D. R. Bull, “A study of subjective videoquality at various frame rates,” in
Image Processing (ICIP), 2015 IEEEInternational Conference on . IEEE, 2015, pp. 3407–3411.[13] Z. Ma, M. Xu, Y.-F. Ou, and Y. Wang, “Modeling of rate and perceptualquality of compressed video as functions of frame rate and quantizationstepsize and its applications,”
IEEE Transactions on Circuits and Sys-tems for Video Technology , vol. 22, no. 5, pp. 671–682, 2012. [14] Q. Huang, S. Y. Jeong, S. Yang, D. Zhang, S. Hu, H. Y. Kim, J. S. Choi,and C.-C. J. Kuo, “Perceptual quality driven frame-rate selection (pqd-frs) for high-frame-rate video,”
IEEE Transactions on Broadcasting ,vol. 62, no. 3, pp. 640–653, 2016.[15] A. V. Katsenou, D. Ma, and D. R. Bull, “Perceptually aligned frame rateselection using spatio temporal features,” in
Picture Coding Symposium(PCS), 2018 . IEEE, 2018, pp. 1–5.[16] M. Afonso, F. Zhang, and D. R. Bull, “Video compression based onspatio-temporal resolution adaptation,”
IEEE Transactions on Circuitsand Systems for Video Technology , vol. 29, no. 1, pp. 275–280, 2018.[17]
Advanced Television Systems Committee (ATSC) Standard .[18] “Specification for the use of video and audio coding in broadcast andbroadband applications,”
DVB, ETSI TS 101 154 V2.4.1 , 2000.[19] “The present state of ultra-high definition television,”
ITU-R ReportBT.2246-6 , March 2017.[20] M. Emoto and M. Sugawara, “Critical fusion frequency for bright andwide field-of-view image display,”
Journal of Display Technology , vol. 8,no. 7, pp. 424–429, 2012.[21] R. Salmon, M. Armstrong, and S. Jolly, “Higher frame rates for moreimmersive video and television,”
BBC White Paper WHP , vol. 209, 2011.[22] P. G. Barten,
Contrast sensitivity of the human eye and its effects onimage quality . Spie optical engineering press Bellingham, WA, 1999.[23] S. Daly, “Engineering observations from spatiovelocity and spatiotem-poral visual models,” in
Vision Models and Applications to Image andVideo Processing . Springer, 2001, pp. 179–200.[24] R. Selfridge, K. C. Noland, and M. Hansard, “Visibility of motion blurand strobing artefacts in video at 100 frames per second,” in
EuropeanConference on Visual Media Production (CVMP 2016) . ACM, 2016.[25] M. Emoto, Y. Kusakabe, and M. Sugawara, “High-frame-rate motionpicture quality and its independence of viewing distance,”
Journal ofDisplay Technology , vol. 10, no. 8, pp. 635–641, 2014.[26] EBU, “Ebu policy statement on ultra high definition television,” in
European Broadcasting Union, Grand-Saconnex, Switzerland. [27] Y. Sugito, S. Iwasaki, K. Chida, K. Iguchi, K. Kanda, X. Lei, H. Miyoshi,and K. Kazui, “A study on the required video bit-rate for 8k 120hz hevctemporal scalable coding,” in
Picture Coding Symposium . IEEE, 2018.[28] F. Zhang, A. Mackin, and D. R. Bull, “A frame rate dependent videoquality metric based on temporal wavelet decomposition and spatiotem-poral pooling,” in
Image Processing (ICIP), 2017 IEEE InternationalConference on . IEEE, 2017, pp. 300–304.[29] F. Navarro, F. J. Ser´on, and D. Gutierrez, “Motion blur rendering: Stateof the art,” in
Computer Graphics Forum . Wiley Online Library, 2011.[30] T. Brooks and J. T. Barron, “Learning to synthesize motion blur,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 6840–6848.[31] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski,“A database and evaluation methodology for optical flow,”
InternationalJournal of Computer Vision , vol. 92, no. 1, pp. 1–31, 2011.[32] D. Sun, S. Roth, and M. J. Black, “A quantitative analysis of currentpractices in optical flow estimation and the principles behind them,”
International Journal of Computer Vision , vol. 106, pp. 115–137, 2014.[33] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,“Flownet 2.0: Evolution of optical flow estimation with deep networks,”in
IEEE conference on computer vision and pattern recognition , 2017.[34] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptiveconvolution,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2017, pp. 670–679.[35] M. Park, H. G. Kim, S. Lee, and Y. M. Ro, “Robust video frameinterpolation with exceptional motion map,”
IEEE Transactions onCircuits and Systems for Video Technology , 2020.[36] L. Breiman, “Random forests,”
Machine learning , vol. 45, no. 1, 2001.[37] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification andregression trees,” 1984.[38] L. Breiman, “Bagging predictors,”
Machine learning , vol. 24, 1996.[39] C. E. Duchon, “Lanczos filtering in one and two dimensions,”
Journalof applied meteorology , vol. 18, no. 8, pp. 1016–1022, 1979.[40] ITU-R, “Recommendation BT.500-13:Methodology for the SubjectiveAssessment of the Quality of Television Pictures.”[41] N. Chinchor, “Muc-4 evaluation metrics,” in
Proceedings of the 4thconference on Message understanding . Association for ComputationalLinguistics, 1992, pp. 22–29.[42] “Subjective video quality assessment methods for multimedia applica-tions,”
ITU-T Rec. P.910 , April 2008.[43] “HEVC reference software version 16.12.” [Online]. Available:https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-16.12/
EEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY - ACCEPTED VERSION 15 [44] B. Van Essen, C. Macaraeg, M. Gokhale, and R. Prenger, “Acceleratinga random forest classifier: Multi-core, gp-gpu, or fpga?” in . IEEE, 2012, pp. 232–239.
Glenn Herrou received the Dipl.-Ing. (M.Sc.) de-gree in electrical and computer engineering and PhDin signal processing from the Institut National desSciences Appliqu´ees (INSA) de Rennes, France, in2016 and 2019. From 2016 to 2019, he worked atthe Institute of Research and Technology b <> com,Cesson-S´evign´e, France, on projects focusing onadaptive spatio-temporal resolution for efficientvideo coding. Since 2020, he is a post-doctoralresearcher in the VAADER team of the Institutd’Electronique et des Technologies du Num´eRique(IETR), Rennes, France. His current research interests focus on video codingand applied Machine/Deep Learning. Wassim Hamidouche received Master’s and Ph.D.degrees both in Image Processing from the Uni-versity of Poitiers (France) in 2007 and 2010, re-spectively. From 2011 to 2013, he was a juniorscientist in the video coding team of Canon Re-search Center in Rennes (France). He was a post-doctoral researcher from Apr. 2013 to Aug. 2015with VAADER team of IETR where he workedunder collaborative project on HEVC video stan-dardisation. Since Sept. 2015 he is an AssociateProfessor at INSA Rennes and a member of theVAADER team of IETR Lab. He has joined the Advanced Media Content Labof b¡¿com IRT Research Institute as an academic member in Sept. 2017. Hisresearch interests focus on video coding and security of multimedia contents.He is the author/coauthor of more than one hundred and twenty (+120) papersat top journals and conferences in Image Processing, two MPEG standards,two patents, several MPEG contributions, public datasets and open sourcesoftware projects.