Capturing Video Frame Rate Variations via Entropic Differencing
Pavan C. Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik
CCapturing Video Frame Rate Variations through EntropicDifferencing
Pavan C. Madhusudana
The University of Texas at AustinAustin, TX, [email protected]
Neil Birkbeck
Google Inc.Mountain View, CA, [email protected]
Yilin Wang
Google Inc.Mountain View, CA, [email protected]
Balu Adsumilli
Google Inc.Mountain View, CA, [email protected]
Alan C. Bovik
The University of Texas at AustinAustin, TX, [email protected]
ABSTRACT
High frame rate videos are increasingly getting popular in recentyears majorly driven by strong requirements by the entertainmentand streaming industries to provide high quality of experiences toconsumers. To achieve the best trade-off between the bandwidthrequirements and video quality in terms of frame rate adaptation,it is imperative to understand the effects of frame rate on videoquality. In this direction, we make two contributions: firstly wedesign a High Frame Rate (HFR) video database consisting of 480videos and around 19,000 human quality ratings. We then devise anovel statistical entropic differencing method based on GeneralizedGaussian Distribution model in spatial and temporal band-passdomain, which measures the difference in quality between thereference and distorted videos. The proposed design is highly gen-eralizable and can be employed when the reference and distortedsequences have different frame rates, without any need of tem-poral upsampling. We show through extensive experiments thatour model correlates very well with subjective scores in the HFRdatabase and achieves state of the art performance when comparedwith existing methodologies.
KEYWORDS high frame rate, video quality assessment,full reference, entropy,natural video statistics, generalized Gaussian distribution
As current media technology continues to emphasize ever higherquality regimes and to involve more immersive and engaging expe-riences for consumers, the need to extend current video parameterspaces along spatial and temporal resolutions, screen sizes and dy-namic range has become a topic of extreme importance, especiallyin the media and streaming industry. Existing and emerging stan-dards have increasingly focused on improving spatial resolution(4K/8K) [13], High Dynamic Range (HDR) [16, 21] and multiviewformats [10, 36]. However there has been much less emphasis placedon increasing frame rates, and for a long time the frame rates associ-ated with television, cinema and other video streaming applicationshave changed little - rarely above 60 frames per second (fps).Various factors have limited increased adoptions of High FrameRate (HFR) videos. Switching to HFR requires employing complex capture and display technologies which were not commonly avail-able. However with the development of advanced digital camerassuch as GoPro [3], Sony RX series [4], and widespread availabilityof high performance monitors such as Acer Predator [1], whichare primarily designed for gaming applications, HFR technology iswell poised for general adoption. Another possible reason for thelimited popularity of HFR relates to the limited knowledge aboutthe perceptual benefits of employing HFR, which partly arises dueto insufficient availability of HFR content. Recently, HFR domainhas gathered significant interest among the research communitywith publication of databases such as Waterloo HFR [27], BVI-HFR[20] that exclusively target HFR contents.Before future video pipelines are able to exploit HFR formats,it is imperative to analyze and evaluate the perceptual benefits ofusing HFR videos. A natural question arises as to whether viewinga particular video at a higher frame rate is better than viewing itslower frame rate version. What is the quality gain that is achievableby going from lower to higher frame rate? This work tries to addressthese concerns by analyzing subjective quality as wells as designingan objective video quality index that seeks to accurately quantifythe quality variations that occur due to frame rate variations.Perceptual Video Quality Assessment (VQA) is an integral com-ponent in numerous video applications such as digital cinema,video streaming applications (such as YouTube, Netflix, Hulu etc.)and social media (Facebook, Instagram etc). VQA models can bebroadly classified into three main categories [7]: Full-Reference(FR), Reduced-Reference (RR) and No-Reference (NR) models. FRVQA models require entire pristine undistorted stimuli along withdegraded versions [22, 33, 35, 41, 45, 47, 49], while RR models oper-ate with limited reference information [5, 17, 37, 38, 44]. NR modelsoperate without any knowledge of pristine stimuli [18, 23, 24, 32].This work addresses the problem of quality evaluation when pris-tine and distorted sequences can possibly have different frame rates,thus our primarily focus will be on FR and RR VQA methods.It is a common belief that HFR videos can provide better visualquality, reduction in flicker and motion blur particularly on con-tents involving high motion. However limited progress on HFRVQA models is making it hard to analyze the actual perceptual gainassociated with switching to HFR domain. Although a large numberof FR-VQA models have been proposed in the literature, almost allof them require the reference and distorted to have same framerate since they typically perform pointwise comparisons. Even in a r X i v : . [ c s . MM ] J un avan et al. (a) bobblehead (b) books (c) bouncyball (d) catch-track(e) Flips (f) Hurdles (g) Longjump (h) 3 Runners Figure 1: Sample frames from source sequences. (a) - (d): sequences from BVI-HFR and (e) - (h): sequences from Fox Media
RR models, the reduction in reference information occurs mainlyalong spatial dimensions and not in temporal domain. Althoughwe can apply existing FR/RR methods trivially by upsampling thedistorted sequence or downsampling the reference, (we assumereference frame rate is always at least as high as that of distorted)we show in our work that this can be counterproductive and canlead to highly inaccurate quality predictions. Moreover upsam-pling/downsampling process can introduce undesirable artifactswhich can potentially affect the accuracy of quality estimates.There has been very limited work done on addressing VQA inHFR domain. One of the first models was proposed by Nasiri et al .[28] where they measured the amount of aliasing occurring in thetemporal frequency spectrum and employ that as a measure of qual-ity. In [26] motion smoothness measure is proposed for cross framerate quality evaluation. Zhang et al . [46] propose a wavelet domainbased Frame Rate Quality Metric (FRQM), where the difference be-tween wavelet coefficients of reference and temporally upsampleddistorted sequence is used to predict quality. FRQM has a limitationthat it cannot be employed when both reference and distorted havesame frame rate, thus limiting it’s generalizability. Moreover all theabove methods only account for artifacts arising from frame ratevariations, while other artifacts such as compression etc. are noteffectively addressed.Our main contributions are in the design of VQA subjective andobjective models that can capture distortions arising due to framerate variations, and provide quality predictions that correlate wellwith human perception. Towards this direction our contributionsare two fold. Firstly we construct a HFR database consisting of 480videos and conduct a large scale human study to subjectively eval-uate them by obtaining around 40 human opinion scores for eachvideo. This database has unique characteristics since it containsvideos upto 120fps and also includes the effects of compression.Although there does exist HFR datasets [20, 27], they either do notconsider the impact of compression or only contain videos with ≤
60 fps. Our second contribution is in the design of a statistical VQAmodel, which is primarily motivated from variations observed in the distributions of band-pass coefficients. We propose a novel en-tropic differencing method using Generalized Gaussian Distribution(GGD) model for both spatial and temporal band-pass responses andshow their effectiveness in capturing spatio-temporal artifacts. Ourproposed method is simplistic in nature, has very few hyperparam-eters to tune and does not require any computationally intensivetraining process. We evaluate our model on the database we developand show that the predicted quality estimates outperforms existingmethods when compared with human opinion scores.The rest of the paper is organized as follows. In Section 2 wepresent details about the database construction and subjective study.In Section 3 we provide a detailed description of our proposed VQAmodel. In Section 4 we report and analyze various experimentalresults, and provide some concluding remarks in Section 5.
We create an HFR database comprised of 480 videos obtained from16 diverse contents. All the source sequences are natural scenescaptured at a frame rate of 120 fps and are currently available inthe public domain. Of these 16 contents, 11 were sampled from theBristol Vision Institute High Frame Rate (BVI-HFR) video database[20], all are of 10 seconds in duration and 1920x1080p (HD) YUV4:2:0 8 bit format. The other 5 videos correspond to sports contentand were captured by Fox Media Group at 3840x2106p (UHD) YUV4:2:0 10 bit format and 6-8 seconds in duration. Sample video framespresented in the database is shown in Fig. 1.We created 30 test sequences from each of the source sequencesusing 6 different frame rates: 24, 30, 60, 82, 98 and 120 fps, and5 compression levels for each frame rate. The frame rates werechosen based on the refresh rates supported by the monitor (AcerPredator X27 [1]) employed for conducting the human study. Allthe sequences are compressed using FFmpeg [12] with VP9 [25]compression scheme by varying Constant Rate Factor (CRF) values.The strategy for choosing CRF values was done as follows: 2 valuescorresponded to lossless (CRF=0) and highest possible compression apturing Video Frame Rate Variations through Entropic Differencing
20 40 60 80 100 120
Frame Rate M O S Average MOS vs Frame Rate
Average MOS
Figure 2: Plot of average MOS across frame rates. The shadedregion represents standard deviation ( ± σ ). level in VP9 (CRF=63), 3 CRF values were chosen such that approx-imately same bitrate values were obtained across all frame rates fora given source sequence. Thus, for each source content, there are6(Frame rate) × ×
30 = 480 videos in the database.
We conducted a human study of 85 undergraduate student vol-unteers at The University of Texas at Austin. The study was ofSingle-Stimulus Continuous Quality Evaluation (SSCQE) [14] typewhere the videos were played on Venueplayer [9] application devel-oped by VideoClarity . The subjects were provided with a Palettegear console [2] and they could move the cursor over a continu-ous quality bar press to choose the quality score. The scores wererecorded on a scale of 0 to 39, with 39 corresponding to best qualityand 0 representing videos suffering from severe distortions. Eachsubject rated 240 videos across 2 sessions, with each session consist-ing of 120 videos and lasting approximately 30-40 minutes. A totalof 42 human opinion scores were obtained on every video in thedatabase. A subject rejection procedure detailed in the ITU-R BT500.11 recommendation [14] was used to reject scores from unreli-able subjects. In our study nine subjects were rejected and MOS wascalculated by averaging scores obtained from the remaining sub-jects. In Fig. 2 average MOS scores for every frame rate along withtheir corresponding standard deviation is shown. We observe theeffect of diminishing returns with regard to quality perceived andincreasing frame rate. Although the quality difference is significantwhen 24 and 120 fps videos are compared, this gap is much smallerfor videos beyond 60 fps. We calculate Difference MOS (DMOS) bysubtracting MOS from the corresponding MOS of reference DMOS i = MOS refi − MOS i (1) −100 −50 0 50 100 BandpassCoefficients P r obab ili t y DistributionofBandpassCoefficients
Figure 3: Distribution of band-pass coefficients across differ-ent frame rates
In this section, we introduce a novel FR-VQA method that can beemployed when the reference and the distorted videos can possiblyhave different frame rates. Many existing VQA methods rely on twoideas. One is spatial band-pass filtering such as DCT [32], waveletdecomposition [35, 37, 38] which results in coefficients having aheavy tailed distributions. The second is to apply divisive normal-ization, based on Gaussian Scale Mixture (GSM) [42] models andthe concept of contrast masking. Divisive normalization transformband-pass image coefficients to follow an uncorrelated Gaussiandistribution [23, 31]. The presence of distortions tend to disruptthese statistical regularities, thus quality indices try to measurethe deviation from the GSM model to quantify quality. Althougha large number of VQA models have been proposed in the liter-ature, there has been much less emphasis in designing temporalmodels to capture temporal artifacts. Existing methods employbasic operations such as absolute temporal differences [19] andframe differences [5, 32, 38] as temporal component in their design.Although they seem to perform well in the general case, their ap-plication is restricted to the case where the reference and distortedvideos have same frame rate. This work tries to generalize thesemethods by removing frame rate limitations and accounting forframe rate changes.Consider a bank of K temporal band-pass filters denoted by b i for i ∈ { , . . . K } , the temporal band-pass response for a video V ( x , t ) ( x = ( x , y ) represent spatial co-ordinates and t denotes temporaldimension) is given by B i ( x , t ) = V ( x , t ) ∗ b i ( t ) ∀ i ∈ , . . . K , (2)where B i denotes band-pass response of i th filter. Note that theseare 1D filters applied only along the temporal dimension. Tem-poral differences are a special case, where the band-pass filter isessentially the high pass component of a one level Haar waveletfilter. We empirically observe that the distributions of the coeffi-cients of B i vary as a function of frame rate. This is illustrated in https://videoclarity.com/ avan et al. Fig. 3 where distributions of different frame rates are shown fora 3-level Haar wavelet filter. We observe that as the frame ratesincrease, the distribution becomes more peaky as the correlationbetween the consecutive frames increase with frame rate, makingband-pass responses more sparse. Our work is primarily motivatedfrom this observation and we extract this deviation to assess quality.Although these band-pass coefficients follow heavy tailed distribu-tions, we observed that application of divisive normalization doesnot always make their distribution Gaussian.Although this implies that the coefficients of B i do not necessar-ily follow a GSM model, they can be well modelled as following aGeneralized Gaussian Distribution (GGD). GGD models have beenwidely used to model band-pass coefficients in many previous ap-plications such as image denoising [6], texture retrieval [11], VideoBLIINDS VQA method [32] etc. Our work is based on the intu-ition that entropic differences of GGD provide a simplistic wayto measure the deviations in distribution of band-pass coefficientsoccurring due to artifacts arising from changes in frame rate. Weleverage on this idea to design a statistical model to capture framerate variations. In the next subsection we will discuss the GGDbased model of band-pass coefficients. Let the reference and distorted videos be denoted by R and D re-spectively, with R t , D t representing corresponding frames at time t . Note that R and D can have different frame rates though we re-quire them to have same spatial resolution. Let the response of the i th band pass filter b i , i ∈ { , , . . . K } on reference and distortedvideos be denoted by B Rit and B Dit respectively. Assume that everyframe of B Rit , B Dit follows a GGD model i . e . B Rit ∼ GGD ( µ Rit , α Rit , β Rit ) and B Dit ∼ GGD ( µ Dit , α Dit , β Dit ) where µ is location parameter whichis the mean of the distribution, α is a scale parameter and β isthe shape parameter. Note that these parameters are time varyingdepending on the dynamics of the video under consideration. Sincethe band-pass coefficients have zero-mean, we only consider thetwo parametric GGD model: µ Rit = µ Dit = ∀ i , t . The probabilitydensity expression for a zero mean GGD ( α , β ) is given by: f ( x ; α , β ) = β α Γ ( / β ) exp (cid:16) − (cid:16) | x | α (cid:17) β (cid:17) (3)where Γ ( . ) is the gamma function: Γ ( a ) = ∫ ∞ x a − e − x dx . (4)The shape parameter β controls the shape of the distribution while α affects the variance. Special cases of GGD include the Gaussiandistribution when β = β =
1. Let theband pass coefficients at frame t be partitioned into nonoverlappingblocks of size √ M × √ M , which are indexed by b ∈ { , , . . . B } . Let B Ribt and B Dibt denote vector of band pass coefficients in block b forsubband i and frame t for reference and distorted respectively. Weallow the band-pass coefficients to pass through a Gaussian channelto model perceptual imperfections such as neural noise [35, 38].Let ˜ B Ribt , ˜ B Dibt represent coefficients which undergo channel imper-fections to obtain the observed responses B Ribt , B Dibt respectively.Also let ˜ B Ribt , ˜ B Dibt both be a GGD random variable. This model is expressed as: B Ribt = ˜ B Ribt + W Ribt B Dibt = ˜ B Dibt + W Dibt (5)where ˜ B Ribt is independent of W Ribt , ˜ B Dibt is independent of W Dibt , W Ribt ∼ N( , σ W I M ) and W Dibt ∼ N( , σ W I M ) . It can be inferredfrom equation 5 that B Ribt , B Dibt need not necessarily be GGD, al-though it can be well approximated by a GGD [48] due to theindependence assumption. Similar to [37, 38] we hypothesize thatthe entropy values of ˜ B Ribt , ˜ B Dibt quantify information pertaining toobserved quality variations. The entropy of a GGD random variable X ∼ GGD ( , α , β ) has a closed form expression given by: h ( X ) = β − log (cid:18) β α Γ ( / β ) (cid:19) (6)Entropy computation requires the values of the GGD parameters of˜ B Ribt and ˜ B Dibt . However we have access only to B Ribt and B Dibt . Thusto estimate these parameters we follow the kurtosis matching pro-cedure detailed in [39]. The first step is to estimate variance, whichis a straightforward calculation due to independence assumption σ ( B Ribt ) = σ ( ˜ B Ribt ) + σ W , σ ( ˜ B Ribt ) = σ ( B Ribt ) − σ W σ ( B Dibt ) = σ ( ˜ B Dibt ) + σ W , σ ( ˜ B Dibt ) = σ ( B Dibt ) − σ W (7)The next step is to calculate the kurtosis κ : κ ( ˜ B Ribt ) = κ ( B Ribt ) (cid:32) σ ( B Ribt ) σ ( ˜ B Ribt ) (cid:33) , κ ( ˜ B Dibt ) = κ ( B Dibt ) (cid:32) σ ( B Dibt ) σ ( ˜ B Dibt ) (cid:33) (8)Interested readers can refer to [29, 39] for a detailed derivationof equation 8. Sample variance and kurtosis values of B Ribt , B Dibt are employed in equation 8 to calculate the kurtosis of ˜ B Ribt and˜ B Dibt respectively. Lastly, the GGD parameters and kurtosis havea bijective mapping [39] where the kurtosis of a GGD randomvariable is given by:
Kurtosis ( X ) = Γ ( / β ) Γ ( / β ) Γ ( / β ) (9)A simple grid search can be used to estimate the shape parameter β from the kurtosis value obtained from equation 8. The otherparameter α can be obtained using the relation α = σ (cid:115) Γ ( / β ) Γ ( / β ) (10)Plugging the parameters obtained from equations 9 and 10 in 6, theentropies h ( ˜ B Ribt ) and h ( ˜ B Dibt ) can be computed. In the next sectionwe show how these entropies can be effectively used to assess thequality of videos. We define entropy scaling factors given by: γ Ribt = log ( + σ ( ˜ B Ribt )) , γ Dibt = log ( + σ ( ˜ B Dibt )) (11)These scaling factors are similar to the ones used in [37, 38]. Scalingfactors lend a more local nature to our model and also providenumerical stability in the regions having low variance as entropy apturing Video Frame Rate Variations through Entropic Differencing estimates can be inconsistent in these parts. Entropies are modifiedby premulitplying these scaling factors to obtain ϵ Ribt = γ Ribt h ( ˜ B Ribt ) , ϵ Dibt = γ Dibt h ( ˜ B Dibt ) (12)Although we can simply use the absolute difference between theentropies as a quality measure, there exists a frame rate bias associ-ated with the entropy values where different frame rates have haveentropies at different scales . Typically high frame rate sequencessuch as 120fps have much lower entropy values when compared tolower frame rates such as 24fps, 30fps etc. Thus simple subtractionis a measure of the difference between the frame rates of R and D .Though this is desirable, this can be counterproductive when weare comparing two distorted videos which have the same framerate but different compression levels, as frame rate bias dominatesentropy variation arising due to compression. To remove this bias,we employ an additional video sequence termed Pseudo Reference(PR) signal, which is obtained by temporally downsampling the ref-erence to match the frame rate of distorted. We use frame droppingfor temporal downsampling, although any other downsamplingtechnique can be employed to accomplish the same. In the casewhen distorted sequence has the same frame rate as reference, PRwill be equal to reference. Similar to B Ribt and B Dibt , we calculate ϵ PRibt . We define the Generalized Temporal Index (GTI) as:
GT I it = B B (cid:213) b = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) K + | ϵ Dibt − ϵ PRibt | (cid:17) ϵ Ribt ϵ PRibt − K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (13)The expression in 13 can be interpreted by decomposing into twofactors: absolute difference term and ratio term. Absolute differenceterm removes frame rate bias and captures the quality changes asif the reference was at the same frame rate. The ratio term weightsthese factors depending on the reference and distorted frame rate.Note that in the case of same frame rate for reference and distorted,the ratio term will be 1, making the expression in 13 only dependon absolute difference. K is a predefined constant that is used toavoid GT I becoming zero when D = PR (cid:44) R , which is the casewhen distorted video is a downsampled version of the reference,since it can possibly contain temporal artifacts. Note that GT I = D = PR = R . In our implementation we use K = In the previous section we discussed capturing temporal informa-tion by using temporal band-pass responses. Although
GT I doescapture spatial information, it is primarily influenced by the tempo-ral filtering. To address this concern, we employ spatial band-passfilters applied to every frame with the aim of extracting informationabout spatial artifacts. We use a simple local Mean Subtracted (MS)filtering similar to [5]. Let R MSt = R t − µ Rt and D MSt = D t − µ Dt be the reference and distorted MS coefficients where local mean iscalculated as µ Rt ( i , j ) = G (cid:213) д = − G H (cid:213) h = − H ω д , h R t ( i + д , j + h ) , µ Dt ( i , j ) = G (cid:213) д = − G H (cid:213) h = − H ω д , h D t ( i + д , j + h ) (14) Algorithm 1:
Generalized Spatio-Temporal Index
Input:
Reference Video R , Distorted Video D Output:
GSTI Temporal downsampling to obtain Pseudo Reference (PR) Temporal band-pass filtering with b i , obtain B Ribt , B Dibt , B PRibt Spatial band-pass filtering, obtain R MSbt , D MSbt Calculate ϵ Ribt , ϵ Dibt , ϵ PRibt from equation 12 Calculate θ Rbt , θ Dbt from equation 17 Temporal entropy pooling from equation 19 Reference entropy subsampling from equation 23 Calculate
GT I , GSI from equations 13 and 18 respectively Obtain
GST I from equation 21where ω = ω д , h | д = − G , . . . G , h = − H , . . . H is a 2D circularlysymmetric Gaussian weighting function sampled out to 3 standarddeviations and rescaled to unit volume and R t , D t are frames inpixel domain. In our implementation we use G = H =
7. The MS co-efficients R MSt , D MSt as well follow GGD model. Similar to temporalresponse, we divide each frame into nonoverlapping blocks of size √ M × √ M and index by b ∈ { , , . . . B } . The channel imperfectionscan be similarly modeled as: R MSbt = ˜ R MSbt + Z Rbt D MSbt = ˜ D MSbt + Z Dbt (15)where ˜ R MSbt is independent of Z Rbt , ˜ R MSbt is independent of Z Dbt , Z Rbt ∼ N( , σ Z I M ) and Z Dbt ∼ N( , σ Z I M ) . The entropies h ( ˜ R MSt ) and h ( ˜ D MSt ) can be calculated using the procedure detailed in sub-section 3.1 by replacing temporal band-pass responses with corre-sponding MS coefficients. Similarly we define scaling factors andmodified entropies: η Rbt = log ( + σ ( ˜ R MSbt )) , η Dbt = log ( + σ ( ˜ D MSbt )) (16) θ Rbt = η Rbt h ( ˜ R MSbt ) , θ Dbt = η Dbt h ( ˜ D MSbt ) (17)Since spatial entropies are computed using only the informationfrom a single frame, the values are frame rate agnostic. Thus theredoes not arise any scale variations due to frame rate bias as seen inthe temporal case. The Generalized Spatial Index (GSI) is definedas: GSI t = B B (cid:213) b = | θ Dbt − θ Rbt | (18) Empirically we observe that employing entropy terms obtainedfrom every frame results in noisy quality estimates. The effective-ness of the obtained quality estimates can be greatly enhanced byincorporating a temporal pooling strategy for entropy terms. Weconsider a window of length L and replace entropy terms at frame avan et al. t by averaging a block of consecutive L entropy estimates. ϵ Dibt ← L t + L − (cid:213) t ′ = t ϵ Dibt ′ , ϵ Ribt ← L t + L − (cid:213) t ′ = t ϵ Ribt ′ , ϵ PRibt ← L t + L − (cid:213) t ′ = t ϵ PRibt ′ θ Rbt ← L t + L − (cid:213) t ′ = t θ Rbt ′ , θ Dbt ← L t + L − (cid:213) t ′ = t θ Dbt ′ (19)The pooled entropy terms are then used in equations 13 and 18 toobtain temporal and spatial measure respectively. In our experi-ments we choose L =
5. We discuss the impact of L on performancein section 4.5. GSI and GTI operate individually on data obtained by separate pro-cessing of spatial and temporal frequency responses. Interestingly,while GSI is obtained in a purely spatial manner, GTI has bothspatial and temporal information embedded in it (as entropies areobtained in a spatial blockwise manner). Thus temporal artifactssuch as judder etc. only influence GTI, while spatial artifacts affectboth GTI and GSI. A combined Generalized Spatio-Temporal Index(GSTI) is defined as:
GST I it = GT I it GSI t (20)The quality score obtained from equation 20 provides scores at theframe level. To obtain a video level quality score we average poolframe scores: GST I i = T T (cid:213) t = GST I it (21) For simplicity we implement our method only in luminance domain.We use a 3-level Haar wavelet filter as the temporal band-pass filter b i with i ∈ { , . . . } , (we ignore the low pass response) wherehigher i value denotes larger center frequency. We use waveletpacket filter [8] as it provides linear bandwidth which is beneficialin analyzing the impact of individual frequency bands on perceivedquality. For entropy calculation we choose spatial blocks of size5 × i . e . √ M = σ W = σ Z = . ϵ Ribt , θ Rbt will have a different number of frames whencompared to their counterpart distorted entropy terms ϵ Dibt , θ Dbt .Thus we temporally average reference entropy terms as: k = FPS ref
FPS dist (22)
Table 1: Performance comparison across different FR algo-rithms on the HFR database. In each column first and sec-ond best values are marked boldface and underlined, respec-tively
SROCC ↑ KROCC ↑ PLCC ↑ RMSE ↓ PSNR 0.7062 0.5163 0.6810 9.094SSIM [49] 0.4717 0.3277 0.4717 10.95MS-SSIM [45] 0.5082 0.3553 0.4863 10.85FSIM [47] 0.4556 0.3187 0.4535 11.068ST-RRED [38] 0.5663 0.3893 0.5298 10.532SpEED [5] 0.5003 0.3508 0.4631 11.006FRQM [46] 0.4260 0.2964 0.4453 11.08VMAF [19] 0.7500 0.5564 0.7288 8.503deepVQA [15] 0.3575 0.3463 0.2462 11.650Ours ˜ ϵ Ribt = k k (cid:213) n = ϵ Ribt ′ , ˜ θ Rbt = k k (cid:213) n = θ Rbt ′ where t ′ = ( t − ) k + n (23)The above procedure is equivalent to dividing the entropy termsinto k subsequences along the temporal dimension and averag-ing each subsequence [20]. The entire algorithm is summarized inAlgorithm 1. In this section we first describe the experimental settings, compari-son methods and basic evaluation criteria. Secondly, we evaluate ourproposed method against existing state of the art algorithms. Nextwe perform various ablation studies to analyze the performancevariation.
Compared Methods . Since our proposed framework is an FR/RRmodel, we selected 4 FR-IQA methods: PSNR, SSIM [49], MS-SSIM[45] and FSIM [47] for comparison. Note that these are image in-dices and do not take into account any temporal information. Theseindices are computed on every frame and averaged across all framesto obtain the video score. In addition to the above IQA metrics, wealso include 5 FR-VQA indices: ST-RRED [38], SpEED [5],FRQM[46]VMAF [19] and deepVQA [15]. For deepVQA we use only stage-1 of the pretrained model (trained on LIVE-VQA [34] database)obtained from the code released by the authors. All the above meth-ods assume the reference and corresponding distorted sequencesto have the same frame rate. For cases with differing frame rates,we perform a naive temporal upsampling by frame duplication tomatch the reference frame rate. Although we can downsample thereference as well, we avoid this method since it can potentiallyintroduce artifacts ( e . g . judder) in reference which is not desirable. Evaluation Criteria . Spearman’s rank order correlation coeffi-cient (SROCC), Kendall’s rank order correlation coefficient (KROCC),Pearson’s linear correlation coefficient (PLCC) and root mean squared we use the pretrained VMAF model available at https://github.com/Netflix/vmaf apturing Video Frame Rate Variations through Entropic Differencing Table 2: Performance comparison of various FR methods for individual frame rates in the HFR database. In each column firstand second best values are marked boldface and underlined, respectively
24 fps 30 fps 60 fps 82 fps 98 fps 120 fps OverallSROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ PSNR 0.4226 0.3788 0.4849 0.4316
Temporal Frequency C o rr e l a t i on Correlation vs Temporal Frequency
Figure 4: Variation of SROCC with temporal frequencybands ( b , . . . , b ). X-axis denotes subband index i ∈ { , . . . , } error (RMSE) are the main performance criteria employed to evalu-ate VQA methodologies. Before computing PLCC and RMSE, thepredicted scores are passed through a four-parameter logistic non-linearity as described in [40] Q ( x ) = β + β − β + exp (cid:32) − (cid:16) x − β | β | (cid:17)(cid:33) (24) In this section we analyze the correlation between objective scorespredicted by various FR methods against human judgments ob-tained in the form of DMOS from equation 1. We use uncompressed120 fps videos (CRF=0) as reference sequences. The performanceof various FR methods is shown in Table 1. FR-IQA indices PSNR,SSIM, MS-SSIM and FSIM have poor performance due to absenceof temporal information which carries great significance in HFRdatabase. Our proposed method outperforms all the existing modelsacross every evaluation criteria as illustrated in Table 1.The reported results for our method in Table 1 correspond to thefirst subband ( i . e . b ) of the band-pass filter. In Fig. 4 we plot theperformance variation with the choice of b i , i ∈ { , . . . , } where Table 3: Significance of Spatial and Temporal Measures
SROCC ↑ KROCC ↑ PLCC ↑ RMSE ↓ Spatial Measure 0.7352 0.5356 0.7096 8.75Temporal Measure 0.6460 0.4646 0.6466 9.473Overall higher value of i represents higher center-frequency. The degra-dation in performance for higher frequencies could be explainedin terms of temporal contrast sensitivity function (CSF) [30] of hu-man vision, according to which the sensitivity to the visual signalfollows a band-pass response, resulting in reduced sensitivity tohigher frequencies. In this experiment we subdivide the HFR database into sets whichcontain videos having the same frame rate and individually analyzethe performance on them. The performance comparison is shownin Table 2. To avoid clutter we only include SROCC and PLCCfor evaluation. KROCC and RMSE as well follow similar trends asshown in Table 2. We observe that our proposed method achievedeither first or second best performance across every frame rate. Wealso observed an interesting anomaly where PSNR achieves higherperformance at lower frame rates when compared to other priorperceptual indices. This is counter-intuitive, given that in manyprior works PSNR has been shown to correlate poorly with humanperception [43]. A possible reasoning behind this observation canbe that temporal upsampling by frame duplication artificially boostsPSNR performance. Though the same argument can be made forother perceptual indices like SSIM, MS-SSIM etc., the global natureof PSNR can be a possible factor for this enhancement, differingfrom local neighborhood based perceptual methods such as SSIM,MS-SSIM etc. Note that at 120 fps, the performance of PSNR islower than other perceptual IQA indices since there is no frameduplication involved in this evaluation, as reference and distortedare at the same frame rate.From Table 2 we observe that prediction performance of ourmethod increases gradually with frame rate. This behavior can beexplained in terms of frame rate and compression, and their effect onspatial and temporal measures. Note that the frame rate of referencevideo will always remain same. Fixing the frame rate makes the avan et al.
Table 4: Performance variation with change in TemporalPooling Window length
L SROCC ↑ KROCC ↑ PLCC ↑ RMSE ↓ To study the impact of each conceptual component present in ouralgorithm, we test them in isolation and the correlation values arereported in Table 3. It can be inferred from the table that spatialand temporal measure possibly contribute complementary qualityinformation, as their combination yields a higher performance thantheir respective individual performances.
In section 3.4 we introduced temporal pooling of entropy estimatesby a window of length L to enhance quality prediction. In Table 4we show the variation of performance across different choices ofwindow length L . Note that L = L except for theextreme values. In the construction of the HFR database we had discussed that thecompression levels were chosen such that there were 5 distinctbitrate levels for a given content. We divide the dataset into these5 sets and their individual performance is shown in Table 5 (thevalue of bitrates decrease monotonically from bitrate-1 to bitrate-5).We can infer from the table that the qualities predicted in the highbitrate region are more accurate than lower in the lower bitrateregime. This is due to the fact that when bitrate is high, videosonly differ by frame rate and the effect of compression is less. Thusit’s easier to predict quality since it only depends on the temporalmeasure. However when bitrate is low, both temporal and spatialmeasures come into play and their simple product is not particularlyeffective in capturing this quality variation.
We presented a simple, highly generalizable video quality evalua-tion method that can be employed when reference and distorted
Table 5: Performance variation with Bitrate
SROCC KROCC PLCC RMSEBitrate-1 (Highest) 0.8046 0.5962 0.8467 5.6265Bitrate - 2 0.8270 0.6285 0.8291 5.434Bitrate - 3 0.6590 0.4772 0.7005 6.416Bitrate - 4 0.3415 0.2246 0.3863 8.14Bitrate - 5 (Lowest) 0.3456 0.2289 0.3217 10.3958videos having different frame rates, and gauged its performance onour newly designed HFR database. An important characteristic ofour method is that it captures spatio-temporal artifacts by means ofspatial and temporal measures with no requirement of any tempo-ral upsampling. We performed a holistic evaluation of our methodin terms of correlation with human perception and established thatour method is superior and more robust than existing algorithms.We conducted ablation studies where the significance of spatial andtemporal measures on the overall performance were gauged.Although the proposed method achieved state of the art perfor-mance, the highest correlation we achieved is around 0.8, whichsuggests that there is ample room for further improvement. Forband-pass filtering we employed a simple Haar filter which haspoor frequency response and can potentially limit the performance.As part of future work we wish to explore other band-pass filterswith superior frequency response. Another avenue we wish to ex-plore concerns the possibility of combining quality estimates frommultiple temporal bands in a perceptually weighted manner withweights motivated by the temporal CSF. Also our proposed modelcan be incorporated in a data driven quality model such as VMAF[19] to further enhance the performance.
REFERENCES
IEEE Signal Processing Letters
24, 9 (Sep. 2017), 1333–1337.[6] S Grace Chang, Bin Yu, and Martin Vetterli. 2000. Adaptive wavelet thresholdingfor image denoising and compression.
IEEE Transactions on Image Processing
9, 9(2000), 1532–1546.[7] Shyamprasad Chikkerur, Vijay Sundaram, Martin Reisslein, and Lina J Karam.2011. Objective video quality assessment methods: A classification, review, andperformance comparison.
IEEE Transactions on Broadcasting
57, 2 (2011), 165–182.[8] Ronald R Coifman and M Victor Wickerhauser. 1992. Entropy-based algorithmsfor best basis selection.
IEEE Transactions on Information Theory
38, 2 (1992),713–718.[9] CVVP-3085-4K-8. [n.d.]. Clearview Player. https://videoclarity.com/PDF/ClearView-Player-DataSheet.pdf. [Online; accessed 1-November-2019].[10] Varuna De Silva, Hemantha Kodikara Arachchi, Erhan Ekmekcioglu, and AhmetKondoz. 2013. Toward an impairment metric for stereoscopic video: A full-reference video quality metric to assess compressed stereoscopic video.
IEEETransactions on Image Processing
22, 9 (2013), 3392–3404.[11] Minh N Do and Martin Vetterli. 2002. Wavelet-based texture retrieval usinggeneralized Gaussian density and Kullback-Leibler distance.
IEEE Transactionson Image Processing
11, 2 (2002), 146–158.[12] FFmpeg. [n.d.]. Encoding for streaming sites. https://trac.ffmpeg.org/wiki. [On-line; accessed 1-November-2019].[13] Chang Ge, Ning Wang, Gerry Foster, and Mick Wilson. 2017. Toward QoE-assured4K video-on-demand delivery through mobile edge virtualization with adaptive apturing Video Frame Rate Variations through Entropic Differencing prefetching.
IEEE Transactions on Multimedia
19, 10 (2017), 2222–2237.[14] ITU-R Recommendation BT.500-11. 2000. âĂIJMethodology for the SubjectiveAssessment of the Quality of Television Pictures,âĂİ.
International Telecommuni-cation Union (2000).[15] Woojae Kim, Jongyoo Kim, Sewoong Ahn, Jinwoo Kim, and Sanghoon Lee. 2018.Deep video quality assessor: From spatio-temporal visual sensitivity to a convo-lutional neural aggregation network. In
Proceedings of the European Conferenceon Computer Vision (ECCV) . 219–234.[16] Debarati Kundu, Deepti Ghadiyaram, Alan C Bovik, and Brian L Evans. 2017. No-reference quality assessment of tone-mapped HDR pictures.
IEEE Transactionson Image Processing
26, 6 (2017), 2957–2971.[17] Qiang Li and Zhou Wang. 2009. Reduced-reference image quality assessmentusing divisive normalization-based image representation.
IEEE journal of selectedtopics in signal processing
3, 2 (2009), 202–211.[18] X. Li, Q. Guo, and X. Lu. 2016. Spatiotemporal Statistics for Video QualityAssessment.
IEEE Transactions on Image Processing
25, 7 (July 2016), 3329–3342.[19] Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy, and MeghaManohara. 2016. Toward a practical perceptual video quality metric. "http://techblog.netflix.com/2016/06/toward-practical-perceptual-video.html"[20] Alex Mackin, Fan Zhang, and David R Bull. 2018. A study of high frame ratevideo formats.
IEEE Transactions on Multimedia
21, 6 (2018), 1499–1512.[21] Zicong Mai, Hassan Mansour, Rafal Mantiuk, Panos Nasiopoulos, Rabab Ward,and Wolfgang Heidrich. 2010. Optimizing a tone curve for backward-compatiblehigh dynamic range image and video compression.
IEEE Transactions on ImageProcessing
20, 6 (2010), 1558–1571.[22] K. Manasa and S. S. Channappayya. 2016. An Optical Flow-Based Full ReferenceVideo Quality Assessment Algorithm.
IEEE Transactions on Image Processing
IEEE Transactions on Image Processing
21, 12 (Dec. 2012), 4695–4708.[24] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. 2013. Making a “completelyblind” image quality analyzer.
IEEE Signal Processing Letters
20, 3 (March 2013),209–212.[25] D. Mukherjee, J. Han, J. Bankoski, R. Bultje, A. Grange, J. Koleszar, P. Wilkins,and Y. Xu. 2015. A Technical Overview of VP9âĂŤThe Latest Open-Source VideoCodec.
SMPTE Motion Imaging Journal . IEEE, 1418–1422.[27] Rasoul Mohammadi Nasiri, Jiheng Wang, Abdul Rehman, Shiqi Wang, and ZhouWang. 2015. Perceptual quality assessment of high frame rate video. In . IEEE, 1–6.[28] Rasoul Mohammadi Nasiri and Zhou Wang. 2017. Perceptual aliasing factors andthe impact of frame rate on video quality. In . IEEE, 3475–3479.[29] Xunyu Pan, Xing Zhang, and Siwei Lyu. 2012. Exposing image splicing withinconsistent local noise variances. In . IEEE, 1–10.[30] John G Robson. 1966. Spatial and temporal contrast-sensitivity functions of thevisual system.
Josa
56, 8 (1966), 1141–1142.[31] Daniel L Ruderman. 1994. The statistics of natural images.
Network: computationin neural systems
5, 4 (1994), 517–548.[32] Michele A Saad, Alan C Bovik, and Christophe Charrier. 2014. Blind Predictionof Natural Video Quality.
IEEE Transactions on Image Processing
23, 3 (March2014), 1352–1365.[33] Kalpana Seshadrinathan and Alan C Bovik. 2010. Motion Tuned Spatio-TemporalQuality Assessment of Natural Videos.
IEEE Transactions on Image Processing
IEEE Transactions on Image Processing
19, 6 (June 2010), 1427–1441.[35] Hamid R Sheikh and Alan C Bovik. 2006. Image Information and Visual Quality.
IEEE Trans. Image Process.
15, 2 (Feb. 2006), 430–444.[36] Aljoscha Smolic, Karsten Mueller, Nikolce Stefanoski, Joern Ostermann, AtanasGotchev, Gozde B Akar, Georgios Triantafyllidis, and Alper Koz. 2007. Codingalgorithms for 3DTVâĂŤa survey.
IEEE Transactions on Circuits and Systems forVideo Technology
17, 11 (2007), 1606–1621.[37] Rajiv Soundararajan and Alan C Bovik. 2012. RRED Indices: Reduced ReferenceEntropic Differencing for Image Quality Assessment.
IEEE Transactions on ImageProcessing
21, 2 (Feb. 2012), 517–526.[38] Rajiv Soundararajan and Alan C Bovik. 2013. Video Quality Assessment byReduced Reference Spatio-Temporal Entropic Differencing.
IEEE Transactions onCircuits and Systems for Video Technology
23, 4 (April 2013), 684–694.[39] Hamza Soury and Mohamed-Slim Alouini. 2015. New results on the sum of twogeneralized Gaussian random variables. In . IEEE, 1017–1021.[40] VQEG. 2000. Final report from the video quality experts group on the validationof objective quality metrics for video quality assessment.[41] P. V. Vu, C. T. Vu, and D. M. Chandler. 2011. A spatiotemporal most-apparent-distortion model for video quality assessment. In . 2505–2508.[42] Martin J Wainwright and Eero P Simoncelli. 2000. Scale mixtures of Gaussiansand the statistics of natural images. In
Advances in neural information processingsystems . 855–861.[43] Zhou Wang and Alan C Bovik. 2009. Mean squared error: Love it or leave it? Anew look at signal fidelity measures.
IEEE signal processing magazine
26, 1 (2009),98–117.[44] Zhou Wang and Eero P Simoncelli. 2005. Reduced-reference image qualityassessment using a wavelet-domain natural image statistic model. In
HumanVision and Electronic Imaging X , Vol. 5666. International Society for Optics andPhotonics, 149–159.[45] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. 2003. Multiscale structural sim-ilarity for image quality assessment. In
The Thrity-Seventh Asilomar Conferenceon Signals, Systems Computers, 2003 , Vol. 2. 1398–1402 Vol.2.[46] F. Zhang, A. Mackin, and D. R. Bull. 2017. A frame rate dependent video qualitymetric based on temporal wavelet decomposition and spatiotemporal pooling. In . 300–304.[47] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. 2011. FSIM: A feature sim-ilarity index for image quality assessment.
IEEE Transactions on Image Processing
20, 8 (2011), 2378–2386.[48] Qian Zhao, Hong-wei Li, and Yuan-tong Shen. 2004. On the sum of generalizedGaussian random signals. In
Proceedings 7th International Conference on SignalProcessing, 2004. Proceedings. ICSP’04. 2004. , Vol. 1. IEEE, 50–53.[49] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Imagequality assessment: from error visibility to structural similarity.