[PDF] Subjective and Objective Quality Assessment of High Frame Rate Videos

Abstract

High frame rate (HFR) videos are becoming increasingly common with the tremendous popularity of live, high-action streaming content such as sports. Although HFR contents are generally of very high quality, high bandwidth requirements make them challenging to deliver efficiently, while simultaneously maintaining their quality. To optimize trade-offs between bandwidth requirements and video quality, in terms of frame rate adaptation, it is imperative to understand the intricate relationship between frame rate and perceptual video quality. Towards advancing progression in this direction we designed a new subjective resource, called the LIVE-YouTube-HFR (LIVE-YT-HFR) dataset, which is comprised of 480 videos having 6 different frame rates, obtained from 16 diverse contents. In order to understand the combined effects of compression and frame rate adjustment, we also processed videos at 5 compression levels at each frame rate. To obtain subjective labels on the videos, we conducted a human study yielding 19,000 human quality ratings obtained from a pool of 85 human subjects. We also conducted a holistic evaluation of existing state-of-the-art Full and No-Reference video quality algorithms, and statistically benchmarked their performance on the new database. The LIVE-YT-HFR database has been made available online for public use and evaluation purposes, with hopes that it will help advance research in this exciting video technology direction. It may be obtained at \url{this https URL}

Full PDF

11 Subjective and Objective Quality Assessment ofHigh Frame Rate Videos

Pavan C. Madhusudana, Xiangxu Yu, Neil Birkbeck, Yilin Wang, Balu Adsumilli and Alan C. Bovik

Abstract —High frame rate (HFR) videos are becoming in-creasingly common with the tremendous popularity of live,high-action streaming content such as sports. Although HFRcontents are generally of very high quality, high bandwidthrequirements make them challenging to deliver efﬁciently, whilesimultaneously maintaining their quality. To optimize trade-offsbetween bandwidth requirements and video quality, in terms offrame rate adaptation, it is imperative to understand the intricaterelationship between frame rate and perceptual video quality.Towards advancing progression in this direction we designed anew subjective resource, called the LIVE-YouTube-HFR (LIVE-YT-HFR) dataset, which is comprised of 480 videos having 6different frame rates, obtained from 16 diverse contents. Inorder to understand the combined effects of compression andframe rate adjustment, we also processed videos at 5 compressionlevels at each frame rate. To obtain subjective labels on thevideos, we conducted a human study yielding 19,000 humanquality ratings obtained from a pool of 85 human subjects. Wealso conducted a holistic evaluation of existing state-of-the-artFull and No-Reference video quality algorithms, and statisticallybenchmarked their performance on the new database. The LIVE-YT-HFR database has been made available online for public useand evaluation purposes, with hopes that it will help advanceresearch in this exciting video technology direction. It may beobtained at https://live.ece.utexas.edu/research/LIVE YT HFR/LIVE YT HFR/index.html

Index Terms —high frame rate, objective algorithm evalua-tions, subjective quality, video quality assessment, video qualitydatabase, full reference

I. I

NTRODUCTION R ECENT advancements in hardware technology have re-sulted in a dramatic visual information explosion on theInternet. Visual data such as images and videos constituteas much as 80% of total Internet trafﬁc. Contemporaneously,increasing demands for better consumer viewing experienceshas impelled streaming and social video service providers topursue the delivery of higher quality videos. The require-ments of higher video quality can involve better immersiveexperiences, higher spatial resolutions, larger display sizes,high dynamic ranges (HDR), and high frame rates (HFR).Indeed, the rapid development of streaming video technologyhas made the production and reception of superior qualityvideos affordable to the general public. Popular mobile capturedevices have made the creation of high quality video content

P. C. Madhusudana, X. Yu and A. C. Bovik are with the Departmentof Electrical and Computer Engineering, University of Texas at Austin,Austin, TX, USA (e-mail: [email protected]; [email protected];[email protected]). Neil Birkbeck, Yilin Wang and Balu Adsumilliare with Google Inc. (e-mail: [email protected]; [email protected];[email protected]). quite pervasive. Improved hardware supports the display ofhigher quality videos. Powerful GPUs are now able to dis-play live, real-time 4K, HDR, and HFR videos on consumerdisplays, and virtual reality videos on head-mounted displays.Video service providers like YouTube, Netﬂix and AmazonPrime Video continue to offer videos having higher spatialresolutions and/or increased frame rates.In the past, considerable research effort has been expendedon improving spatial resolution (4K/8K) [1], HDR [2], [3]and multiview formats [4], [5]. However, there has been lessprogress on increasing the frame rates of consumer videos,and the vast majority of streamed or shared videos are stillprovided at 60 frames per second (fps) or less.Various factors have limited the mainstream deployment ofHFR videos. In the past, the necessity of sophisticated captureequipment and expensive display technologies placed HFR outof reach of the general populace. However, because of modernadvanced consumer grade digital cameras such as the GoPro[6] and Sony RX series [7] more casual users can captureHFR videos at a reasonable cost. While the current dearth ofHFR content is a factor hindering the growth of its popularity,this is likely to change, given high interest in live action, high-speed sporting events and outdoor activities. Yet, HFR contentsrequire higher bandwidths, making them more challenging formass distribution by the streaming entertainment industry.As technology evolves, HFR videos are likely to occupya larger proportion of online videos, so it is important tounderstand the perceptual beneﬁts associated with them. It isalso interesting to consider the beneﬁts conveyed to viewers’experiences when shifting from the low to high frame rateregime. While there is a general notion that HFR videos pro-vide better perceptual quality, by reducing temporal artifactssuch as ﬂicker and motion blur, there has been little work doneto validate these notions. Video Quality Assessment (VQA)has mostly addressed developments like HDR and high spatialresolution. One reason for this is the lack of subjective datasetsaddressing HFR videos, especially beyond 60 fps.Recently, there has been a renewed interest in HFR research,along with newer datasets like Waterloo HFR [8] and BVI-HFR [9], which primarily address HFR content quality. Thesedatabases either contain only a few frame rates, and/or donot consider the joint effects of other distortions such ascompression artifacts. To address these limitations and furtheradvance progress on understanding HFR video quality, wehave created a new HFR video resource, which we will referto as the LIVE-YouTube High Frame Rate (LIVE-YT-HFR)Database. An important distinction of the new HFR databaseis the presence of six different frame rates with multiple spatial a r X i v : . [ c s . MM ] J u l resolutions spread across a wide variety of contents. The newHFR database also encompasses a unique combination of com-pression and frame rate variations, evaluated and labeled bya large pool of volunteer subjects. Overall, the database com-prises of 480 videos, making it one of the largest existing HFRvideo quality datasets. We also performed a holistic evaluationand benchmark study of current state-of-the-art VQA models.To help facilitate further development on HFR video qualityresearch, we are making the new LIVE-YT-HFR dataset freelyand publicly available in its entirety at https://live.ece.utexas.edu/research/LIVE YT HFR/LIVE YT HFR/index.html.The rest of the paper is organized as follows: In Section IIwe discuss prior work on the HFR quality problem. SectionIII provides a detailed description of the new database andits construction. Section IV describes the subjective study.Section V compares and contrasts the performance of relevantVQA models on the new database. Finally we provide someconcluding remarks in Section VI.II. P RIOR W ORK

A. Subjective VQA

Research pertaining to video quality has made signiﬁ-cant strides over the last decade. Several widely-used VQAdatabases have been proposed, including LIVE VQA [10],LIVE Mobile [11], CSIQ-VQA [12], CDVL [13] etc. Thesegenerally begin with a set of less than 20 pristine videocontents, on which various distortions are applied, primarilycompression artifacts arising from past and present codecs,on both Standard Deﬁnition (SD) and High Deﬁnition (HD)resolutions. In all these databases, the reference and dis-torted sequences have the same frame rates, and thereforedo not contain artifacts arising from frame rate changes.Moreover, the distortions present in these legacy databaseswere synthetically applied. More recent novel databases haveemerged containing authentic distortions obtained from user-generated-content (UGC) videos. These include LIVE VQC[14], KoNViD-1k [15], and YouTube UGC [16]. Since thevideos in these databases were captured by casual users, thereare no pristine versions of any of the videos, hence they areprimarily suited for blind video quality assessment research.Since only a single version of each content is available, thesedatabases are not suitable for studying the perceptual impactsof frame rate changes.Currently available datasets addressing HFR content arevery limited. One of the ﬁrst HFR databases was proposedby Nasiri et al . [8], containing SD and HD videos withframe rates no greater than 60 fps, and distorted by variouscompression levels. However, this database has not been madepublicly available. Mackin et al . introduced the BVI-HFR [9]database, which contains videos of 4 different frame ratesvarying from 15 fps to 120 fps. The dataset includes 22 120 fpssource sequences, where the lower fps videos were obtained bysubsampling the source videos via frame averaging. Possibleshortcomings of this database are that it only includes framerate artifacts, it does not consider the effects of compression onframe rate, and it uses simple frame averaging to subsamplein time. The latter strategy imposes a strong assumption on the changed videos, creates speciﬁc motion blur artifacts, andmay not match practical systems.

B. Objective VQA

Generally, VQA models are broadly classiﬁed into threecategories: Full-Reference (FR), Reduced-Reference (RR) andNo-Reference (NR). FR VQA models require entire pristineundistorted videos against which degraded versions are per-ceptually compared, while RR models operate with limitedreference information. NR-models predict quality with noreference knowledge.Although FR Image Quality Assessment (IQA) models[17]–[19] can be easily extended to VQA by applying themon a frame-by-frame basis, in combination with a suitabletemporal pooling strategy, their performance is often limited,since temporal information is not effectively used. An earlyVQA model, Video Quality Metric (VQM) [20], employs 3Dspatio-temporal video blocks to compute certain features, andframe differencing to capture temporal variations. A modiﬁedSSIM algorithm [21], and the later MOVIE [22] index bothuse a model of human visual motion processing in extra-cortical area MT to capture motion distortions. The ST-MAD[23] index uses a “most apparent distortion” concept [24]to quantify quality. Natural Scene Statistics (NSS) basedVQA models, such as ST-RRED [25] and SpEED-VQA [26],compute statistical measurements such as spatial and temporalentropic differences in the band-pass domain, to measure qual-ity deviations. Recently, learning-based FR-VQA frameworkshave gained popularity due to their superior performance.The Video Multi-method Fusion (VMAF) algorithm [27] isa highly successful and widely used method, which uses a setof features derived from VIF [28], a frame-difference feature,and a detail feature [29], fusing them using a trained SupportVector Regressor (SVR). Kim et al . [30] proposed a modelcalled DeepVQA, based on a CNN model in combinationwith a convolutional neural aggregation network (CNAN) fortemporal pooling, achieving competitive performance on theLIVE-VQA and CSIQ-VQA datasets.VQA models relevant to HFR quality prediction are uncom-mon. Nasiri et al . [31] proposed an early model that measuresthe degree of aliasing of the temporal frequency spectrum. In[32], motion smoothness is used as a measure of cross-framerate quality assessment. Zhang et al . [33] proposed a waveletdomain Frame Rate Quality Metric (FRQM), whereby absolutedifferences between temporally wavelet ﬁltered sequences areused to quantify quality. Although FRQM achieves competi-tive performance on the BVI-HFR dataset, it cannot be usedwhen both the reference and distorted videos have the sameframe rate, thus limiting its generalizability.The VQA models just discussed only address artifactsarising from frame rate variations, without accounting forthe joint perceptual impacts of compression and frame rate.Recently a model called GSTI [34] was proposed, whereentropic differences between temporally band-pass ﬁlteredresponses were found to achieve better correlations againsthuman judgments of quality, even when tested in the presenceof both compression and frame rate. (a) Runner (b) 3 Runners (c) Flips (d) Hurdles(e) Longjump (f) bobblehead (g) books (h) bouncyball(i) catch-track (j) cyclist (k) hamster (l) lamppost(m) leaves-wall (n) library (o) pour (p) water-splashingFig. 1. Sample frames from source sequences in the LIVE-YT-HFR Database. (a) - (e): Sequences contributed by the Fox Media Group and (f) - (p):sequences from the BVI-HFR dataset.

The absence of reference information makes NR videoquality prediction quite challenging. Most existing models in-volve some kind of learning based procedure to ﬁnd mappingsbetween features (or pixels) and human subjective judgmentsof quality. Good examples are [35]–[38], which use NSS orother quality-aware features on which an SVR or RandomForest learner is trained to predict quality. Recent interest inassessing UGC video quality has resulted in several successfulmethods based on deep learning [39]–[41]. Although UGCvideos contain a wide variety of interesting authentic distor-tions, they are less topical for understanding the effects offrame rate, since usually, only very high quality source videosare subjected to frame-rate reductions (during streaming),hence UGC datasets contain only one version of each videocontent. Nevertheless, frame-rate variations of UGC contentmay become a more important topic in the future, creatinginteresting research possibilities.III. LIVE-Y OU T UBE -HFR D

ATABASE

A detailed description of the new LIVE-YT-HFR database ispresented in this section. Our main objective in creating thisdatabase is to provide a tool for the video quality researchcommunity to have access to when analyzing the impact

TABLE IC

HARACTERIZATION OF S OURCE S EQUENCES (120 H Z )SI CF TIRange 69.65 29.22 31.07Uniformity of Coverage 0.87 0.94 0.8 of frame rates on perceptual video quality. We believe thatstudying the perception of artifacts arising from frame ratevariations will prove to be beneﬁcial when designing futureVQA models. A. Source Sequences

We used 16 uncompressed source videos of natural scenescaptured at a frame rate of 120 fps that are currently availablein the public domain. Of these 16 videos, 11 sequences wereborrowed from the Bristol Vision Institute High Frame Rate(BVI-HFR) video database [42]. These were captured usinga RED Epic-X video camera with a spatial resolution of × (UHD-1) at a frame rate of 120 fps. The publiclyavailable version of the database contains sequences that werespatially downsampled to × (HD) YUV 4:2:0 8 bitformat, of each 10 seconds duration. The remaining 5 videos

20 25 30 35 40 45

Colorfulness (CF) S p a t i a l I n f o r m a t i on ( S I ) SI vs CF (a)

Temporal Information (TI) S p a t i a l I n f o r m a t i on ( S I ) SI vs TI (b)Fig. 2. (a) Spatial Information (SI) versus colorfulness (CF) and (b) TemporalInformation (TI), measured on the source sequences in the LIVE-YT-HFRdatabase respectively. The corresponding convex hulls are indicated by redlines. contain high-motion sports content captured by the Fox MediaGroup in × (UHD-1) YUV 4:2:0 10 bit format,each of 6-8 seconds duration. Sample frames drawn fromthe source sequences, along with their IDs are shown in Fig.1. This database was restricted to contain only progressivelyscanned videos, to avoid separate issues associated with videode-interlacing artifacts. B. Content Description and Coverage

Similar to [9], we computed three low level descriptorson each source sequence: (i) Spatial Information (SI), indi-cating the amount of local spatial variation in each frame,(ii) Temporal Information (TI), which captures change acrossframes, and the (iii) Colorfulness (CF) measure [43]. SI isa Sobel magnitude measure, whereas TI uses the averagesquared luminance difference between successive frames:

T I = (cid:118)(cid:117)(cid:117)(cid:116) N − N − (cid:88) t =1 P (cid:88) i,j ( I ( i, j, t + 1) − I ( i, j, t )) P (1)where I ( i, j, t ) is luminance at co-ordinate i, j in frame t , P isthe total number of pixels in each frame, and N is the numberof frames in the video. Table I shows the range and uniformitycharacteristics of the source sequences, while the raw SI, CFand TI values are plotted in Fig. 2. These plots illustrate adiverse span of scenes and motions among the selected sourcesequences. C. Temporal Downsampling

Simultaneously capturing a same scene across multipleframe rates without downsampling is impractical, as it wouldeither require a specialized camera with concurrent multi-frame rate capture capability, or a careful conﬁguration of amulti-camera system. Thus, lower frame rate versions weregenerated by employing temporal downsampling of originalhigh frame rate (120fps) source videos. In prior studies, twomethods of downsampling have been used: frame droppingand frame averaging [9]. Dropping frames is similar to na-tive capture at a lower frame rate with a reduced shutterangle [44]. However, while frame dropping is simple andcomputationally inexpensive, it can introduce judder/strobingartifacts, especially on videos captured with signiﬁcant cameramotion. Conversely, frame averaging alleviates the problem

Content A ve r a g e b i t r a t e ( k bp s ) Average bitrate variation with content

Fig. 3. Variation of average bit-rate with content in the LIVE-YT-HFRDatabase. of judder/strobing distortions, but can introduce motion blur,resulting in the attenuation of visually important high spatio-temporal frequencies. The degree of high-frequency attenua-tion increases with increasing downsampling factor, makingvideos subsampled to low frame rates, such as 24, 30 fps,strikingly blurred. Of course, motion compensation methodsof frame averaging might be considered, but these can createother kinds of artifacts, and are not commonly used [45].Since the choice of temporal downsampling method inﬂuencesthe perception of video quality, we decided to use the framedropping method, in order to avoid the introduction of motionblur, and to obtain low frame-rate videos closer to nativelycaptured ones. Frame dropping was performed by suitablymodifying the fps ﬁlter available in FFmpeg [46].

D. Test Sequences

We created 30 test sequences from each source sequence,by subsampling them to 6 different frame rates: 24, 30, 60, 82,98 and 120 fps. Each of these were subsequently subjected to5 levels of VP9 compression. These frame rates were chosenbased on the refresh rates supported by the monitor (AcerPredator X27 [47]) that was employed to conduct the humanstudy. All of the sequences were compressed using FFmpegVP9 compression [48] by varying the Constant Rate Factor(CRF), values resulting in bit-rates R i , i ∈ { , . . . , } , where R i < R j , ∀ i < j . The strategy for choosing the 5 compressionlevels for a given source sequence was done as follows: two ofthe levels R , R correspond, respectively, to lossless (CRF=0)and highest (CRF=63) possible compression levels in VP9.The other three CRF levels, yielding rates R , R and R were chosen such that compression resulted in approximatelythe same bit-rates across all frame rates. Thus, for a givensource sequence, bit-rates R , R and R remained constantand were selected to ensure that there was adequate perceptualseparation between them. The CRF values of the remainingvideos derived from the source sequence were determined toapproximately match these bit-rates. Thus, for each sourcecontent, there are 6 (Frame rates) × TABLE IID

ISPLAY PARAMETERS AND VIEWING CONDITIONS OF SUBJECTIVE STUDY

Parameter ValueScreen Resolution 3840 × test sequences. The above procedure was repeated on everysource sequence present in the database. Since bit-rates dependon content, there is signiﬁcant variation of bit-rates acrossthe compressed source sequences. This is illustrated in Fig.3, where average bit-rates are plotted against content indices,and where the initial contents were 4K videos having higherbit-rate values. Given the 16 source videos described in Sec.III-A, we arrived at 16 ×

30 = 480 videos in the database.

E. Signiﬁcance of LIVE-YT-HFR Database

The LIVE-YT-HFR database possesses some important andunique characteristics that distinguishes it from both existingHFR and standard VQA databases. First, it contains sequencescorresponding to six different frame rates, spanning the range24 fps to 120 fps. Prior HFR datasets have either limited thecontent to be less than 60 fps [8], or have contained only a fewframe rates [9]. Standard VQA databases generally restrict allof the reference and distorted videos to the same frame rate.We believe that having a more ﬁne-grained sampling of framerates will make it possible to create better models of the impactof frame rate on perceptual video quality. Second, the databasecontains a mixture of contents at spatial resolutions 1080p and4K. The inclusion of 4K contents increases the relevancy of thedatabase, given strong trends in video streaming towards 4Kstandards. Lastly, the LIVE-YT-HFR Database includes VP9compression artifacts, enabling the study of the joint effects ofcompression and frame rate on video quality. VP9 is a widelyused alternative to MPEG compression, and it is heavily usedby YouTube. The principles that can be learned will likely beapplicable to other codecs as well, such as HEVC and AV-1.Overall, the new database comprises 480 videos, making itone of the largest VQA databases currently available.IV. S

UBJECTIVE Q UALITY A SSESSMENT

A. Subjective Testing Design

We employed a Single-Stimulus Continuous Quality Eval-uation (SSCQE) [49] procedure to obtain subjective quality ratings on the videos in the LIVE-YT-HFR database. By “con-tinuous,” we refer to a continuous quality scale, as opposed tocontinuous quality collection over the duration of each video.The display parameters and viewing conditions employed inthe subjective study are shown in Table II. Since the screenresolution of the display device is 4K, the 1080p sequenceswere spatially upsampled to 4K using Lanczos interpolation,while the 4K videos were shown at their native resolution. Thisis how 1080p videos are commonly reformatted for display inpractice. During the study, the videos were played out usingthe Venueplayer [50] application developed by VideoClarity,which supports high frame rates and does not introduceartifacts that could impact the perception of video quality. Toensure perfect playback, all of the distorted sequences wereprocessed and stored as raw YUV 4:2:0 ﬁles.The LIVE-YT-HFR database was divided into 4 subsetsof 120 videos each, such that every subject viewed only 2of the 4 sets. Thus, each subject rated 240 videos across 2sessions, where 120 videos were viewed in each session. Weprepared playlists for each subject by randomly re-orderingthe 120 videos. Care was taken to ensure that successivevideos were obtained from different source sequences as wellas different frame rates. This was done in order to inhibitany contextual and memory biases that could affect subjectivequality judgments. Distinct playlists were created for everysubject across every session, to avoid any prejudice arisingfrom playing videos in any speciﬁc order. To avoid latencyissues due to slow hard disk access, the entire playlist wasloaded into memory before playback in each session. Themonitor refresh rates were altered to exactly match eachvideo’s frame rate before it was played back.After each video plays, an interactive continuous qualityrating scale was displayed on the screen as shown in Fig. 4.The initial position of the cursor was randomized for everyvideo. The quality bar was labeled with 5 Likert indicators, toassist the subjects in their rating task, ranging from “Bad” to“Excellent.” The subjects could move the cursor using a Palettegear console [51], then press a key on the console to entereach quality score. The subject was provided as much timeas needed to enter each score, but could not modify the scoreonce entered. After the score was received, the next videowas presented. The continuous-scale scores were sampled ona numerical scale of 0 to 39, with 0 corresponding to “Bad”and 39 representing “Excellent.”

B. Subjects and Training

A total of 85 volunteer undergraduate subjects were re-cruited at The University of Texas at Austin. The subject poolconsisted of 14 female and 71 male participants, aged between20 to 30 years. All subjects were screened for normal orcorrected-to-normal color vision, and no subjects were rejectedduring screening. Each subject was individually informed ofthe purpose of the study, and a short training session wasconducted to familiarize them with the rating procedure. Dur-ing the training session, 6 videos were shown approximatelyspanning the overall quality range of test sequences, to givethe subjects an idea of the video quality they could expect

Quality Scores N u m b e r o f V i d e o s ( N o r m a li z e d ) Histogram of quality scores set1set2set3set4

Fig. 5. Histogram of raw scores across all four subsets of the LIVE-YT-HFRDatabase

10 20 30 40 50 60 70

Mean Opinion Scores N u m b e r o f V i d e o s Histogram of Mean Opinion Scores

Fig. 6. Histogram of MOS in 20 equally spaced bins during the actual study. The training videos were not part ofthe database and contained different content, and the scoreson them were not recorded or considered. Training was onlyperformed before the start of each subject’s ﬁrst session. Thesubjects were instructed to provide ratings based on perceivedquality, rather than on any preference for, or interestingness ofcontent. To reduce subjective fatigue, a minimum of 24 hourswas required between successive sessions.No subject required more than 40 minutes to complete anysession. In the end, each video was labelled by a minimumof 42 user ratings. The histograms of raw subjective scoresfor all four subsets of scores are shown in Fig. 5. The verysimilar score distributions over the four subsets indicate thatthey contain very similar quality distributions.

C. Processing of Subjective Scores

Let m ijk denote the score provided by subject i to video j in session k = { , } . Since not all videos in the LIVE-YT-HFR Database were rated by every subject, let δ ( i, j ) be anindicator function such that δ ( i, j ) = (cid:40) if subject i rated video j otherwise . (2)Then, to normalize the scores received across multiple ses-

10 20 30 40 50 60 70

MOS (group1) M O S ( g r oup ) Subject Consistency

Fig. 7. Scatter plot of MOS between two groups of subjects. sions of each subject, we calculate the Z-scores per session[52] as µ ik = 1 N ik N ik (cid:88) j =1 m ijk σ ik = (cid:118)(cid:117)(cid:117)(cid:116) N ik − N ik (cid:88) j =1 ( m ijk − µ ik ) z ijk = m ijk − µ ik σ ik , where N ik is the number of videos seen by subject i in session k . The Z-scores from all sessions were concatenated to formthe matrix { z ij } denoting the Z-score assigned by subject i to the videos indexed by j with j ∈ { , . . . } , where theentries of { z ij } are empty at locations ( i, j ) where δ ( i, j ) = 0 .We elected to not enforce any subject rejection procedure, aswe observed that the inter-subject correlation was very high(inter-subject consistency is discussed in Sec. IV-D). Assuming z ij to have a standard normal distribution, of the Z-scoreswere found to lie in [ − , . A linear rescaling was used tomap scores to the range [0 , as z (cid:48) ij = 100( z ij + 3)6 . (3)Finally the Mean Opinion Score (MOS) of each video wascalculated by averaging the scores received for that video as M OS j = 1 N j N (cid:88) i =1 z (cid:48) ij δ ( i, j ) , (4)where N j = (cid:80) Ni =1 δ ( i, j ) and N = 480 . The MOS were foundto lie in the range [10 . , . , and the mean of standarddeviations of the rescaled Z-scores obtained from all subjectsacross all images was found to be . . The histogram ofMOS is shown in Fig. 6 indicating a relatively broad MOSvariation.We calculated Difference MOS (DMOS) by subtractingthe MOS of each video from the MOS of its correspondingreference as: DM OS j = M OS refj − M OS j . (5)DMOS is particularly useful for FR-VQA problems to reducecontent dependence.

20 30 40 50 60 70 80

Number of Subjects M O S anchor video 1anchor video 2anchor video 3anchor video 4 Fig. 8. MOS of anchor videos plotted against number of subjects along with95% conﬁdence intervals.

20 40 60 80 100 120

Frame Rate A ve r a g e M O S Average MOS vs Frame Rate

20 40 60 80 100 120

Frame Rate A ve r a g e M O S Average MOS vs Frame Rate

No Camera MotionCamera Motion

Fig. 9. (Left) Relationship between average MOS and frame rate, and (Right)The effect of camera motion. Shaded regions represent 95% conﬁdenceintervals.

D. Subject-Consistency Analysis

To ensure that the subjects’ ratings were reliable, weperformed additional analysis to evaluate the inter and intrasubject reliability. a) Inter-Subject Consistency:

To check inter-subject con-sistency we split the scores received for every video into twodisjoint equal groups, and measured the correlation of MOSbetween these two groups. The random splits were performedover 100 trials and the mean Spearman rank order correlationcoefﬁcient (SROCC) between the two groups was found to be . Fig. 7 shows a scatter plot of MOS between the tworandomly divided groups. It may be observed that the majorityof scores are concentrated near a line of unit slope passingthrough the origin, indicating a high consistency between thegroups. b) Intra-Subject Consistency:

Measuring intra-subjectreliability provides information on the level of consistencydemonstrated by individual subjects [53] over the videos ratedby them. We thus measured SROCC between the individualopinion scores and MOS. A median SROCC of wasobtained across all subjects.These additional experiments indicate that we can ascribea high degree of conﬁdence in the veracity of the obtainedopinion scores, as well as the framework used to conduct thesubjective study.

Source Sequence A ve r a g e M O S Average MOS variation with content

Fig. 10. Variation of average MOS with content across frame rates.

Average Bitrate (kbps) A ve r a g e M O S Rate Distortion Curves (1080p)

Average Bitrate (kbps) A ve r a g e M O S Rate Distortion Curves (4K)

Fig. 11. Rate distortion curves for different frame rates with 1080p (left) and4K (right) resolutions.

E. Anchor Videos

In our study, not every video was rated by all of subjects,and each subject viewed only 50% of the entire set of videospresent in the database. Since we subscribed 85 subjects, weobtained roughly 43 ratings per video. In order to analyze theimpact on MOS of having a different subset of subjects vieweach video as opposed to the entire population, we chose asubset of 30 anchor videos which were present in the viewingsets of all subjects. Thus anchor videos received twice asmany ratings as compared to non-anchor videos. To analyzethe inﬂuence of different subject groups contributing MOS,we randomly sampled subsets of scores received for theseanchor videos, and recalculated MOS on the reduced subsets,as shown in Fig. 8. We notice that these computed MOSvalues remained nearly constant across the number of subjects,although the standard deviation tended to be higher when thenumber of subjects fell below 40. The conﬁdence intervalswere calculated based on MOS variation over 25 trials. Fig.8 depicts the results on 4 anchor videos, but very similarobservations were made on the remaining anchor videos. Akey takeaway of this exercise is that MOS was relatively robustagainst the number of subjects.

F. Analysis of Opinion Scoresa)

Impact of frame rate on MOS : In Fig. 9 (left), theaverage MOS over all videos at each frame rate is plotted,along with their corresponding conﬁdence intervals. Clearly,increases in frame rate led to higher perceived quality, but with

TABLE IIIR

ESULTS OF T - TEST BETWEEN VIDEOS AT VARIOUS FRAME RATES . A

VALUE OF ‘1’

INDICATES THAT THE ROW IS STATISTICALLY SUPERIOR ( BETTERVISUAL QUALITY ) THAN THE COLUMN , WHILE A VALUE OF ‘0’

INDICATES THAT THE COLUMN IS STATISTICALLY SUPERIOR THAN THE ROW . A

VALUEOF ‘-’

INDICATES THAT THE ROW AND COLUMN ARE STATISTICALLY SIMILAR . E

ACH SUB - ENTRY IN ROW / COLUMN CORRESPONDS TO CONTENTSARRANGED IN THE SAME ORDER , AS SHOWN IN F IG . 1

24 fps 30 fps 60 fps 82 fps 98 fps 120 fps24 fps ---------------- 0000000-00000000 0000000000000000 0000000000000000 0000000000000000 000000000000000030 fps 1111111-11111111 ---------------- 0000000000000-00 0000000000000000 0000000000000000 000000000000000060 fps 1111111111111111 1111111111111-11 ---------------- ----0-0-10--000- 0000--0-----00-- 0000000-000-00--82 fps 1111111111111111 1111111111111111 ----1-1-01--111- ---------------- -00-----01------ 0000--0-0-------98 fps 1111111111111111 1111111111111111 1111--1-----11-- -11-----10------ ---------------- 0000--0-0----0--120 fps 1111111111111111 1111111111111111 1111111-111-11-- 1111--1-1------- 1111--1-1----1-- ---------------- diminishing returns for videos beyond 60 fps. In Fig. 9 (right)the impact of camera motion on MOS is illustrated. Videoswith signiﬁcant camera motion suffer from judder/strobingartifacts, particularly among lower frame rate versions. Thusvideos with camera motion tended to have lower MOS values,as compared to non-camera motion videos at lower framerates. However this gap narrowed with increases in frame rate,indicating a valuable reduction in judder/strobing distortionsat higher frame rates. b) MOS content dependence : In Fig. 10 the impactof source content on MOS across different frame rates isanalyzed. It may be seen that for certain contents, thereexists a clear demarcation between frame rates, however thisseparation is considerably reduced beyond 60 fps. Note thatvideos at lower frame rates (24 fps, 30 fps) always hadlower MOS values, irrespective of content, indicating theexistence of annoying temporal distortions arising from framerate variations. A salient takeaway from these plots is that thereexists high perceptual disparity in low fps regions, irrespectiveof the content. However, moving towards high fps, there issigniﬁcant reduction in this gap, with the amount of reductiondepending on the content. c) Rate distortion curves : In Fig. 11 rate distortion(RD) curves are plotted for various frame rates of 1080p(left) and 4K (right) videos. The horizontal axis denotes bit-rates averaged across content over 5 compression levels, asdiscussed in Sec. III-D. Note that we ignored the lossless(CRF=0) compression level when plotting Fig. 11, as bit-ratesassociated with those sequences are large, hence includingthem would make it harder to compare lower bit-rate videos.From the plots we may discern that there exists considerableoverlap among the RD curves for frame rates above 60 fps inthe low bit-rate region, while the amount of overlap graduallydecreased as we moved towards the high bit-rate regime.Here as well, lower frame rates (24 fps, 30 fps) led to muchlower MOS values across all bit-rates, reﬂecting the impact oftemporal distortions on video quality.

G. Statistical Signiﬁcance

We analyzed the statistical signiﬁcance of the subjectivescores obtained from the human study, by performing a t-testbetween the Gaussian distributions centered at MOS values(and also employing the standard deviation of MOS) to inferthe signiﬁcance of individual frame rates at the 95% conﬁ-dence level. Since the condition being studied is a functionof content, we performed our experiments separately on each

TABLE IVP

ERFORMANCE COMPARISON OF

FR-VQA

ALGORITHMS ON THE

LIVE-YT-HFR

DATABASE . T HE D ISTORTED VIDEOS WERE TEMPORALLYUPSAMPLED TO MATCH THE REFERENCE FRAME RATE . I

N EACH COLUMNFIRST AND SECOND BEST MODELS ARE BOLDFACED .SROCC ↑ KROCC ↑ PLCC ↑ RMSE ↓ PSNR 0.6950 0.5071 0.6685 9.023SSIM [17] 0.4494 0.3102 0.4526 10.819MS-SSIM [18] 0.4898 0.3407 0.4673 10.726FSIM [19] 0.4469 0.3151 0.4435 10.874ST-RRED [25] 0.5531 0.3800 0.5107 10.431SpEED [26] 0.4861 0.3409 0.4449 10.866FRQM [33] 0.4216 0.2956 0.452 10.804VMAF [27] deepVQA [30] 0.3463 0.2371 0.3329 11.441GSTI [34] content. In Table III a value of ‘1’ signiﬁes that the row-condition was statistically superior (better visual quality) tothe column-condition, while a value of ‘0’ denotes the rowis worse than a column; a value of ‘-’ indicates that row andcolumn conditions were statistically equivalent. For example,in Table III, on all 16 contents the 120 fps videos exhibitedstatistically better visual quality than the 24 fps and 30 fpsvideos.In Table III we assess whether the MOS values were statisti-cally distinguishable across frame-rates via the t-test. From theTable, we may observe that lower frame rates exhibited highdegrees of statistical separability, but this margin of differencereduced towards high frame rates, especially beyond 60 fps.This reinforces our previous ﬁndings regarding the inﬂuenceof frame rate on MOS.V. E

VALUATION OF O BJECTIVE Q UALITY P REDICTORS

As a way of demonstrating the value of the new LIVE-YT-HFR Database, we evaluated a variety of relevant objectiveVQA models on it. We employed four performance crite-ria: Spearman’s rank order correlation coefﬁcient (SROCC),Kendall’s rank order correlation coefﬁcient (KROCC), Pear-son’s linear correlation coefﬁcient (PLCC) and the root meansquared error (RMSE) to the evaluate VQA models. Beforecomputing PLCC and RMSE, the predicted scores were passedthrough a four-parameter logistic non-linearity as described in[54] Q ( x ) = β + β − β (cid:32) − (cid:16) x − β | β | (cid:17)(cid:33) . (6) TABLE VP

ERFORMANCE COMPARISON OF VARIOUS FR METHODS FOR INDIVIDUAL FRAME RATES IN THE

HFR

DATABASE . T HE D ISTORTED VIDEOS WERETEMPORALLY UPSAMPLED TO MATCH THE REFERENCE FRAME RATE . I

N EACH COLUMN FIRST AND SECOND BEST VALUES ARE MARKED BOLDFACE .

24 fps 30 fps 60 fps 82 fps 98 fps 120 fps OverallSROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ PSNR deepVQA [30] 0.1144 0.0495 0.1353 0.1059 0.2527 0.1652 0.1803 0.1515 0.2816 0.2654 0.6865 0.6209 0.3463 0.3329GSTI [34]

TABLE VIR

ESULTS OF F- TEST BETWEEN RESIDUALS OF MODEL PREDICTIONS AND

DMOS

VALUES ACROSS VARIOUS FR METHODS . E

ACH CELL CONTAINS ENTRIES : 6

FRAME RATES - 24, 30, 60, 82, 98, 120

FPS AND ALL VIDEOS , IN THAT ORDER . A

VALUE OF ‘1’

INDICATES THAT THE ROW ISSTATISTICALLY SUPERIOR ( BETTER VISUAL QUALITY ) THAN THE COLUMN , WHILE A VALUE OF ‘0’

INDICATES THAT THE COLUMN IS STATISTICALLYSUPERIOR THAN THE ROW . A

VALUE OF ‘-’

INDICATES STATISTICAL EQUIVALENCE BETWEEN ROW AND COLUMN .PSNR SSIM MS-SSIM FSIM ST-RRED SpEED FRQM VMAF deepVQA GSTIPSNR ------- --111-1 --111-1 ---1111 ---1--1 --111-1 -1111-1 ----00- --111-1 00000-0SSIM --000-0 ------- ------- -----1- -----1- -----1- ------- --000-0 ------- 00000-0MS-SSIM --000-0 ------- ------- -----1- ------- ------- ------- --00000 ------- 00000-0FSIM ---0000 -----0- -----0- ------- ------- ------- ------- --00000 ------- 0000000ST-RRED ---0--0 -----0- ------- ------- ------- ------- ------- ---0000 ------1 0000000SpEED --000-0 -----0- ------- ------- ------- ------- ------- --00000 ------- 0000000FRQM -0000-0 ------- ------- ------- ------- ------- ------- --000-0 ------- 00000-0VMAF ----11- --111-1 --11111 --11111 ---1111 --11111 --111-1 ------- --11111 000---0deepVQA --000-0 ------- ------- ------- ------0 ------- ------- --00000 ------- 0000000GSTI 11111-1 11111-1 11111-1 1111111 1111111 1111111 11111-1 111---1 1111111 -------

Since we obtained MOS values from the human study, ourdatabase can be employed to create and/or test both FR andNR VQA models.

A. FR-VQA Models

To conduct FR model evaluations, we used the DMOS val-ues obtained from equation 5, considering the original lossless120 fps videos as references. We began by testing four 4 FR-IQA methods: PSNR, SSIM [17], MS-SSIM [18] and FSIM[19]. These are image quality models, hence do not take intoaccount any temporal information. These were computed onevery frame, and the frame scores averaged across all framesto obtain the ﬁnal video scores. We also studied ﬁve popularFR-VQA models: ST-RRED [25], SpEED [26], FRQM [33],VMAF [27], and deepVQA [30]. Further, we also include aprototype model we recently devised, called the GeneralizedSpatio-Temporal Index (GSTI) [34], which is designed tocapture artifacts arising from frame rate variations, while alsobeing responsive to other distortions. When evaluating deep-VQA, we only used stage-1 of the pretrained model (trained onthe LIVE-VQA [10] database) obtained from the code releasedby the authors. Among the above VQA models, only FRQMand GSTI allow for frame rate variations, while the rest requirethe reference and corresponding distorted sequences to havethe same frame rate. When there were differing frame rates,we performed naive temporal upsampling by frame duplicationto match the reference frame rate. Although we could havedownsampled the reference, we avoided this method since We used the pretrained VMAF model available at https://github.com/Netﬂix/vmaf it could potentially introduce artifacts (e.g. judder) in thereference which is not desirable. We also do not considerany specialized temporal upsampling technique (e.g. motioncompensated temporal interpolation), as the performance canbe very sensitive to the choice of interpolation method.The performances of the various FR methods is shown inTable IV. In Fig. 12, scatter plots of the objective VQA scoresagainst DMOS are shown for all of the FR-VQA models, alongwith the best ﬁtting logistic function obtained from equation6. GSTI was the best performing VQA model amongst thecompared models across all performance criteria. The poorcorrelation values of the FR-IQA indices PSNR, SSIM, MS-SSIM and FSIM highlight the importance of the efﬁcacy ofcrucial temporal information for VQA in HFR scenarios. Theinferior performance of other existing VQA models is alsoindicative of the fundamental limitations encountered whenreference and distorted sequences have differing frame rates.In order to individually analyze performance against eachframe rate, we subdivided the database into sets of videoshaving the same frame rates. The performance comparisonis shown in Table V. To avoid clutter, we only includedSROCC and PLCC scores in the evaluation. However, KROCCand RMSE were observed to follow the same trends as inTable V. It may be seen that VMAF and GSTI performedwell across all frame rates. We also observed an interestinganomaly, where PSNR achieved higher performance at lowerframe rates when compared to some of the other models.This seemed surprising, given that PSNR has been shown tocorrelate relatively poorly with human judgments of quality[55], even when the distortions are purely spatial in nature.However, in this case it shows higher correlation at lower

20 30 40 50 60

Model Prediction D M O S (a) PSNR Model Prediction D M O S (b) SSIM [17] Model Prediction D M O S (c) MS-SSIM [18] Model Prediction D M O S (d) FSIM [19] Model Prediction D M O S (e) ST-RRED [25] Model Prediction D M O S (f) SpEED [26] Model Prediction D M O S (g) FRQM [33]

20 40 60 80 100

Model Prediction D M O S (h) VMAF [27] Model Prediction D M O S (i) deepVQA [30] Model Prediction D M O S (j) GSTI [34]Fig. 12. Scatter plots of objective VQA scores vs DMOS across all videos in LIVE-YT-HFR database. The broken red line depicts the best ﬁtting logisticfunction. frame-rates without access to temporal distortion. This mayhave occurred since algorithms like SSIM will estimate thespatial aspect of distortion more accurately, while missingtemporal distortion (e.g. judder) entirely. PSNR would alsoshow similar behavior, but its less accurate spatial predictionsmay have aligned better with the overall space-time quality.The FRQM index correlates very poorly when analyzed atﬁxed frame rates. This is because it only captures frame ratevariations, hence is insensitive to other artifacts. Moreover,FRQM can only be calculated between videos having differingframe rates. B. Statistical Evaluation

Next we addressed the question of whether the observeddifferences in performance in Table IV are statistically signif-icant. We employed an F-test on residuals between DMOS andthe objective scores predicted by VQA models after applyinglogistic non-linearity [10]. The main underlying assumption isthat residuals follow a Gaussian distribution with zero mean.An F-test was conducted on the ratios of variances of theresiduals between each pair of objective models. Statisticalequivalence is achieved if the variances of residuals from thetwo objective models are equal at the 95% signiﬁcance level.The results of the statistical signiﬁcance tests are reported inTable VI. We followed similar convention as used in TableIII in determining statistical superiority. Each cell in Table VI

TABLE VIIM

EDIAN VALUES OF

SROCC, KROCC, PLCC

AND

RMSE

WITH N O R EFERENCE

QA A

LGORITHMS FOR

ITERATIONS OF RANDOMLYCHOSEN TRAIN AND TEST SETS ( SUBJECTIVE

MOS

VS PREDICTED

MOS).T

HE VALUES INSIDE THE BRACKETS DENOTE STANDARD DEVIATION . T

OPTWO PERFORMING MODELS ARE HIGHLIGHTED . SROCC ↑ KROCC ↑ PLCC ↑ RMSE ↓ BRISQUE [35] 0.376(0.2) 0.255(0.14) 0.384(0.2) 12.47(4.44)NIQE [56] 0.278(0.18) 0.2(0.12) 0.255(0.2) 12.71(1.33)V-BLIINDS [36]

TLVQM [38] 0.320(0.25) 0.241(0.17) 0.289(0.23) 17.61(6.24)Li et al . [41] consists of 7 entries: 6 frame rates - 24, 30, 60, 82, 98, 120fps and all videos, in that order.To summarize the results in Table VI, the performance ofGSTI was statistically superior to the other FR-VQA modelsacross all frame rates.

C. NR-VQA Models

Since we also obtained MOS values on every video, wewere able to evaluate NR-VQA models on the new database.We compared the performance of several NR-VQA models,including BRISQUE [35], NIQE [56], V-BLIINDS [36] andTLVQM [38] as reported in Table VII. All of these modelsemploy handcrafted features. The former three derive fromNatural Scene Statistics (NSS) models, while the latter usesa combination of low and high complexity features. We also TABLE VIIIP

ERFORMANCE COMPARISON OF VARIOUS NR MODELS FOR INDIVIDUAL FRAME RATES IN THE

HFR

DATABASE . T

HE NUMBERS DENOTE MEDIANVALUES FOR

ITERATIONS OF RANDOMLY CHOSEN TRAIN AND TEST SETS ( SUBJECTIVE

MOS

VS PREDICTED

MOS). T

HE VALUES INSIDE THEBRACKETS DENOTE STANDARD DEVIATION . T

OP TWO PERFORMING MODELS IN EACH COLUMN ARE HIGHLIGHTED .

24 fps 30 fps 60 fps 82 fps 98 fps 120 fps OverallSROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ SROCC ↑ PLCC ↑ BRISQUE [35]

TLVQM [38] 0.26(0.35) 0.28(0.29) 0.26(0.35) 0.22(0.26) 0.26(0.32) 0.23(0.27) 0.31(0.29) 0.25(0.26) 0.28(0.28) 0.22(0.25) 0.27(0.29) 0.21(0.28) 0.32(0.25) 0.29(0.23)Li et al . [41]

BRISQUE NIQE VBLIINDS TLVQM Li etal.0.00.20.40.60.8 S R O CC Fig. 13. Boxplots of SROCC distributions for multiple NR-VQA algorithms included the recently proposed model by [41], which employsa deep CNN along with a Gated Recurrent Unit (GRU) forblind video quality evaluation. To evaluate this model on ourdatabase, we employed a pretrained model (trained on theKonViD-1K [15] database) released by the authors. We reportthe performance of BRISQUE, V-BLIINDS and TLVQMfeatures, when trained on the LIVE-YT-HFR database using aSupport Vector Regressor (SVR) with Radial Basis Function(RBF) kernel. For training purposes we divided the LIVE-YT-HFR database content-wise into two random subsets: 80% fortraining and the remaining 20% for testing - ensuring that thereexisted no overlap between the contents present in the train andtest subsets. For fair analysis, we repeated this random train-test division 500 times, and report the median performancein Table VII. Since BRISQUE is an image quality model, wecalculated features on every frame, and averaged the featuresacross frames to obtain video level features. When computingNIQE, scores were obtained on every frame, then averagedto obtain overall video scores. It may be observed that V-BLIINDS and [41] were the top performing NR methods.There were substantial differences between the correlationsobtained by FR and NR models, indicating the signiﬁcance ofreference information.In Table VIII the performances of the NR models for ﬁxedframe rates is analyzed. It may be observed that [41] achievedthe highest correlation across all frame rates. Interestingly,BRISQUE, although an IQA model, achieved high correlationsfor individual frame rates, but when analyzed collectivelyacross frame rates yielded poor correlation. Since sets of indi-vidual frame rates only differ by the amount of compression,BRISQUE might effectively differentiate them, but its overall efﬁcacy was reduced by its inability to capture frame ratequality variations. In Fig. 13, boxplots depicting the spreads ofSROCC values for each NR algorithm are shown, illustratingthe reduced spread of scores of the method in [41], as alsoreported in Table VIII.VI. D

ISCUSSION AND C ONCLUSION

We constructed a large HFR database comprising of 480videos spanning six different frame rates and ﬁve compressionlevels, obtained from 16 diverse contents involving both HDand UHD spatial resolutions. We used these to conduct ahuman study involving 85 volunteer subjects. The LIVE-YT-HFR Database is unique with respect to the number of framerates, and the joint presence of compression artifacts and framerate variations. We also presented a comprehensive evaluationof existing FR and NR-VQA models and benchmarked theirperformance on the new database.Important and obvious conclusions of our analysis are thatframe rate has considerable inﬂuence on human subjectivejudgments of video quality, and that humans prefer higherframe rates over lower ones. Further, this preference of higherframe rates is not ubiquitous, but depends on the contentbeing viewed. Videos involving signiﬁcant camera motionalmost always received higher quality scores at high framerates, as compared to low frame rates. Moreover, the qualitygain associated with frame rate increases diminishes somewhatabove 60 fps. This might be expected, since videos at lowerframe rates suffer from judder/strobing artifacts, while qualityvariations at higher frame rates, e.g. 98 and 120 fps, are moresubtle, becoming noticeable only when there is high motion.The results of objective VQA model testing were particu-larly encouraging. The majority of the IQA methods faltered,underscoring the importance of capturing temporal informa-tion. The tested FR-VQA models mainly suffered from twoshortcomings: 1) Almost all FR-VQA algorithms require thesame frame rate for reference and distorted videos, thus atemporal upsampling step is needed, which can inﬂuence theoutcome. 2) When analyzed separately on ﬁxed frame rates,model performance varied across frame rates. The tested NR-VQA models also failed to capture temporal artifacts arisingfrom frame rate changes, since the features they use do notexplicitly address these type of distortions.We believe this new HFR database will beneﬁt the researchcommunity towards advancing and understanding the complexrelationships associated with frame rate and perceptual videoquality. We also believe that these relationships are not limitedto HFR content, and much may be learned regarding temporalinformation in generic VQA models. VII. A

CKNOWLEDGMENT

The authors would like to thank all the volunteers who tookpart in the human study.R

EFERENCES[1] C. Ge, N. Wang, G. Foster, and M. Wilson, “Toward QoE-assured4k video-on-demand delivery through mobile edge virtualization withadaptive prefetching,”

IEEE Trans. Multimedia , vol. 19, no. 10, pp.2222–2237, 2017.[2] Z. Mai, H. Mansour, R. Mantiuk, P. Nasiopoulos, R. Ward, and W. Hei-drich, “Optimizing a tone curve for backward-compatible high dynamicrange image and video compression,”

IEEE Trans. Image Process. ,vol. 20, no. 6, pp. 1558–1571, 2010.[3] D. Kundu, D. Ghadiyaram, A. C. Bovik, and B. L. Evans, “No-referencequality assessment of tone-mapped HDR pictures,”

IEEE Trans. ImageProcess. , vol. 26, no. 6, pp. 2957–2971, 2017.[4] A. Smolic, K. Mueller, N. Stefanoski, J. Ostermann, A. Gotchev, G. B.Akar, G. Triantafyllidis, and A. Koz, “Coding algorithms for 3DTVasurvey,”

IEEE Trans. Circuits Syst. Video Technol. , vol. 17, no. 11, pp.1606–1621, 2007.[5] V. De Silva, H. K. Arachchi, E. Ekmekcioglu, and A. Kondoz, “Towardan impairment metric for stereoscopic video: A full-reference videoquality metric to assess compressed stereoscopic video,”

IEEE Trans.Image Process.

IEEE InternationalWorkshop on Multimedia Signal Processing (MMSP) , 2015, pp. 1–6.[9] A. Mackin, F. Zhang, and D. R. Bull, “A study of high frame rate videoformats,”

IEEE Trans. Multimedia , vol. 21, no. 6, pp. 1499–1512, 2018.[10] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack,“Study of subjective and objective quality assessment of video,”

IEEETrans. Image Process. , vol. 19, no. 6, pp. 1427–1441, June 2010.[11] A. K. Moorthy, L. K. Choi, A. C. Bovik, and G. De Veciana, “Videoquality assessment on mobile devices: Subjective, behavioral and ob-jective studies,”

IEEE J. Sel. Topics Signal Process. , vol. 6, no. 6, pp.652–671, 2012.[12] P. V. Vu and D. M. Chandler, “Vis3: an algorithm for video qualityassessment via analysis of spatial and spatiotemporal slices,”

Journal ofElectronic Imaging , vol. 23, no. 1, p. 013016, 2014.[13] M. H. Pinson, “The consumer digital video library [best of the web],”

IEEE Signal Process. Mag. , vol. 30, no. 4, pp. 172–174, 2013.[14] Z. Sinno and A. C. Bovik, “Large-scale study of perceptual videoquality,”

IEEE Trans. Image Process. , vol. 28, no. 2, pp. 612–627, 2018.[15] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men, T. Szir´anyi, S. Li,and D. Saupe, “The Konstanz natural video database (KoNViD-1k),” in . IEEE, 2017, pp. 1–6.[16] Y. Wang, S. Inguva, and B. Adsumilli, “YouTube UGC dataset for videocompression research,” in

Proc. IEEE Int. Workshop Multimedia SignalProcess.

IEEE, 2019, pp. 1–5.[17] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,”

IEEETrans. Image Process. , vol. 13, no. 4, pp. 600–612, April 2004.[18] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment,” in

The Thrity-Seventh AsilomarConference on Signals, Systems Computers, 2003 , vol. 2, Nov 2003, pp.1398–1402 Vol.2.[19] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarityindex for image quality assessment,”

IEEE Trans. Image Process. ,vol. 20, no. 8, pp. 2378–2386, 2011.[20] M. H. Pinson and S. Wolf, “A new standardized method for objectivelymeasuring video quality,”

IEEE Trans. Broadcast. , vol. 50, no. 3, pp.312–322, Sep. 2004.[21] K. Seshadrinathan and A. C. Bovik, “A structural similarity metric forvideo based on motion models,” in

IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2007, pp. I–869.[22] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporalquality assessment of natural videos,”

IEEE Trans. Image Process. ,vol. 19, no. 2, pp. 335–350, Feb 2010. [23] P. V. Vu, C. T. Vu, and D. M. Chandler, “A spatiotemporal most-apparent-distortion model for video quality assessment,” in , Sep. 2011, pp. 2505–2508.[24] E. C. Larson and D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,”

Journalof Electronic Imaging , vol. 19, no. 1, p. 011006, 2010.[25] R. Soundararajan and A. C. Bovik, “Video quality assessment byreduced reference spatio-temporal entropic differencing,”

IEEE Trans.Circuits Syst. Video Technol. , vol. 23, no. 4, pp. 684–694, April 2013.[26] C. G. Bampis, P. Gupta, R. Soundararajan, and A. C. Bovik, “SpEED-QA: Spatial efﬁcient entropic differencing for image and video quality,”

IEEE Signal Process. Lett. , vol. 24, no. 9, pp. 1333–1337, Sep. 2017.[27] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara,“Toward a practical perceptual video quality metric,” http://techblog.netﬂix.com/2016/06/toward-practical-perceptual-video.html.[28] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”

IEEE Trans. Image Process. , vol. 15, no. 2, pp. 430–444, Feb. 2006.[29] S. Li, F. Zhang, L. Ma, and K. N. Ngan, “Image quality assessmentby separately evaluating detail losses and additive impairments,”

IEEETrans. Multimedia , vol. 13, no. 5, pp. 935–949, 2011.[30] W. Kim, J. Kim, S. Ahn, J. Kim, and S. Lee, “Deep video qualityassessor: From spatio-temporal visual sensitivity to a convolutionalneural aggregation network,” in

Proceedings of the European Conferenceon Computer Vision (ECCV) , September 2018, pp. 219–234.[31] R. M. Nasiri and Z. Wang, “Perceptual aliasing factors and the impact offrame rate on video quality,” in

IEEE Intl Conf. Image Process. , 2017,pp. 3475–3479.[32] R. M. Nasiri, Z. Duanmu, and Z. Wang, “Temporal motion smoothnessand the impact of frame rate variation on video quality,” in

IEEE IntlConf. Image Process. , 2018, pp. 1418–1422.[33] F. Zhang, A. Mackin, and D. R. Bull, “A frame rate dependent videoquality metric based on temporal wavelet decomposition and spatiotem-poral pooling,” in , Sep. 2017, pp.300–304.[34] P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli, and A. C.Bovik, “Capturing video frame rate variations through entropic differ-encing,” in arXiv preprint arXiv:2006.11424 , 2020.[35] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image qualityassessment in the spatial domain,”

IEEE Trans. Image Process. , vol. 21,no. 12, pp. 4695–4708, Dec. 2012.[36] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of naturalvideo quality,”

IEEE Trans. Image Process. , vol. 23, no. 3, pp. 1352–1365, March 2014.[37] X. Li, Q. Guo, and X. Lu, “Spatiotemporal statistics for video qualityassessment,”

IEEE Trans. Image Process. , vol. 25, no. 7, pp. 3329–3342,July 2016.[38] J. Korhonen, “Two-level approach for no-reference consumer videoquality assessment,”

IEEE Trans. Image Process. , vol. 28, no. 12, pp.5923–5938, 2019.[39] Y. Zhang, X. Gao, L. He, W. Lu, and R. He, “Blind video qualityassessment with weakly supervised learning and resampling strategy,”

IEEE Trans. Circuits Syst. Video Technol. , vol. 29, no. 8, pp. 2244–2255,Aug 2019.[40] S. Ahn and S. Lee, “Deep blind video quality assessment based ontemporal human perception,” in , Oct 2018, pp. 619–623.[41] D. Li, T. Jiang, and M. Jiang, “Quality assessment of in-the-wildvideos,” in

Proceedings of the 27th ACM International Conference onMultimedia , 2019, pp. 2351–2359.[42] A. Mackin, F. Zhang, and D. R. Bull, “A study of subjective videoquality at various frame rates,” in ,Sep. 2015, pp. 3407–3411.[43] D. Hasler and S. E. Suesstrunk, “Measuring colorfulness in naturalimages,” in

Human vision and electronic imaging VIII , vol. 5007.International Society for Optics and Photonics, 2003, pp. 87–95.[44] A. B. Watson, “High frame rates and human vision: A view through thewindow of visibility,”

SMPTE Motion Imaging Journal , vol. 122, no. 2,pp. 18–32, 2013.[45] B.-D. Choi, J.-W. Han, C.-S. Kim, and S.-J. Ko, “Motion-compensatedframe interpolation using bilateral motion estimation and adaptive over-lapped block motion compensation,”

IEEE Transactions on Circuits andSystems for Video Technology [48] D. Mukherjee, J. Han, J. Bankoski, R. Bultje, A. Grange, J. Koleszar,P. Wilkins, and Y. Xu, “A technical overview of VP9the latest open-source video codec,” SMPTE Motion Imaging Journal , vol. 124, no. 1,pp. 44–54, Jan 2015.[49] ITU-R Recommendation BT.500-11, “Methodology for the subjectiveassessment of the quality of television pictures,,”

International Telecom-munication Union , 2000.[50] CVVP-3085-4K-8, “Clearview Player,” https://videoclarity.com/PDF/ClearView-Player-DataSheet.pdf, [Online; accessed 1-November-2019].[51] “Palette Gear console,” https://monogramcc.com, [Online; accessed 1-November-2019].[52] A. M. van Dijk, J.-B. Martens, and A. B. Watson, “Quality asessmentof coded images using numerical category scaling,” in

Proc. SPIE -Advanced Image and Video Communications and Storage Technologies ,1995.[53] T. Hossfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, andP. Tran-Gia, “Best practices for QoE crowdtesting: QoE assessment withcrowdsourcing,”

IEEE Trans. Multimedia , vol. 16, no. 2, pp. 541–558,2013.[54] VQEG, “Final report from the video quality experts group on thevalidation of objective quality metrics for video quality assessment,”2000.[55] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it?a new look at signal ﬁdelity measures,”

IEEE Signal Process. Mag. ,vol. 26, no. 1, pp. 98–117, 2009.[56] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completelyblind” image quality analyzer,”