A user model for JND-based video quality assessment: theory and applications
Haiqiang Wang, Ioannis Katsavounidis, Xinfeng Zhang, Chao Yang, C.-C. Jay Kuo
AA user model for JND-based video quality assessment: theoryand applications
Haiqiang Wang a , Ioannis Katsavounidis b , Xinfeng Zhang a , Chao Yang a , and C.-C. Jay Kuo aa University of Southern California, Los Angeles, California, USA b Netflix, Los Gatos, California, USA
ABSTRACT
The video quality assessment (VQA) technology has attracted a lot of attention in recent years due to anincreasing demand of video streaming services. Existing VQA methods are designed to predict video quality interms of the mean opinion score (MOS) calibrated by humans in subjective experiments. However, they cannotpredict the satisfied user ratio (SUR) of an aggregated viewer group. Furthermore, they provide little guidanceto video coding parameter selection, e.g. the Quantization Parameter (QP) of a set of consecutive frames, inpractical video streaming services. To overcome these shortcomings, the just-noticeable-difference (JND) basedVQA methodology has been proposed as an alternative. It is observed experimentally that the JND location is anormally distributed random variable. In this work, we explain this distribution by proposing a user model thattakes both subject variabilities and content variabilities into account. This model is built upon user’s capabilityto discern the quality difference between video clips encoded with different QPs. Moreover, it analyzes videocontent characteristics to account for inter-content variability. The proposed user model is validated on the datacollected in the VideoSet. It is demonstrated that the model is flexible to predict SUR distribution of a specificuser group.
Keywords:
Video Quality Assessment, Just Noticeable Difference, Satisfied User Ratio
1. INTRODUCTION
Although being expensive in time and money, the subjective experiment is the ultimate method to quantifythe perceptual quality of compressed video. Obtaining accurate and robust labels based on subjective votingsprovided by human observers is a critical step in Quality of Experience (QoE) evaluation. A typical subjectiveexperiment involves: 1) selecting several representative stimuli, 2) presenting them to a group of subjects and3) assigning quality scores to them by subjects. The collected subjective scores should go through a cleaning andmodeling process before being used to validate the performance of objective video quality assessment metrics.Absolute Category Rating (ACR) is one of the most commonly used subjective test methods. Test videoclips are displayed on a screen for a certain amount of time and observers rate their perceived quality usingan abstract scale, such as “Excellent (5)”, “Good (4)”, “Fair (3)”, “Poor (2)” and “Bad (1)”. There are twoapproaches in aggregating multiple scores on a given clip. They are the mean opinion score (MOS) and thedifference mean opinion score (DMOS). The MOS is computed as the average score from all subjects while theDMOS is calculated from the difference between the raw quality scores of the reference and the test images.Both MOS and DMOS are popular in the quality assessment community. However, they have several limita-tions.
3, 4
The MOS scale is as an interval scale rather than an ordinal scale. It is assumed that there is a linearrelationship between the MOS distance and the cognitive distance. For example, a quality drop from “Excellent”to “Good” is treated the same as that from “Poor” to “Bad”. There is no difference to a metric learning systemas the same ordinal distance is preserved ( i.e. the quality distance is 1 for both cases in the aforementioned5-level scale). However, human viewing experience is quite different when the quality changes at different levels.It is also rare to find a video clip exhibiting poor or bad quality in real-life video applications. As a consequence,the number of useful quality levels drops from five to three. It is too coarse for video quality measurement.
Further author information: (Send correspondence to Haiqiang Wang)Haiqiang Wang: E-mail: [email protected] a r X i v : . [ c s . MM ] J u l he second challenge is that scores from subjects are typically assumed to be independently and identicallydistributed (i.i.d.) random variables. This assumption rarely holds. Given multiple quality votings on the samecontent, individual voting contributes equally in the MOS aggregation method. Subjects may have differentlevels of expertise on perceived video quality. A critical viewer may give low quality ratings on coded clipswhose quality is still good to the majority. The same phenomenon occurs in all presented stimuli. The absolutecategory rating method is confusing to subjects as they have different understanding and interpretation of therating scale.To overcome the limitations of the MOS method, the just-noticeable-difference (JND) based VQA method-ology was proposed in as an alternative. A viewer is asked to compare a pair of coded clips and determinewhether noticeable difference can be observed or not. The pair consists of two stimuli, i.e. a distorted stim-ulus (comparison) and an anchor preserving the targeted quality. A bisection search is adopted to reduce thenumber of pair comparisons. The JND reflects the boundary of perceived quality levels, which is well suitedfor the determination of the optimal image/video quality with minimum bit rates. For example, the first JND,whose anchor is the source clip, is the boundary between “Excellent” and “Good” categories. The boundary issubjectively decided rather than empirically selected by the experiment designer.In MOS or JND-based VQA methods, subjective data are noisy due to the nature of “subjective opinion”. Inthe extreme case, some subjects submit random answers rather than good-faith attempts to label. Even worse,adversary votings may happen due to malice or a systematic misinterpretation of the task. Thus, it is critical tostudy subject capability and reliability to alleviate their effects in the VQA task.In this work, we propose a user model that takes subject bias and inconsistency into account. The perceivedquality of compressed video is characterized by the satisfied user ratio (SUR). The SUR value is a continuousrandom variable depending on subject and content factors. We study the SUR difference as it varies with userprofile as well as content with variable level of difficulty. The proposed model aggregates quality ratings peruser group to address inter-group difference. The proposed user model is validated on the data collected in theVideoSet. It is demonstrated that the model is flexible to predict SUR distribution of a specific user group.The rest of this paper is organized as follows. Related work is reviewed in Sec. 2. The proposed user modelis presented in Sec. 3. Experimental results are shown in Sec. 4. Finally, concluding remarks are given in Sec.5.
2. RELATED WORK
There were several popular datasets available in the video quality assessment community, such as LIVE, VQEG-HD, MCL-V, and NETFLIX-TEST, using the MOS aggregation approach. Recently, efforts have been madeto examine MOS-based subjective test methods. Various methods were proposed from different perspectives toaddress the limitations mentioned in Section 1.A theoretical subject model was proposed to model the three major factors that influence MOS accuracy:subject bias, subject inaccuracy, and stimulus scoring difficulty. It was reported that the distribution of thesethree factors spanned about ±
25% of the rating scale. Especially, the subject error terms explained previouslyobserved inconsistencies both within a single subject’s data and also the lab-to-lab differences. A perceptu-ally weighted rank correlation indicator was proposed, which rewarded the capability of corrected rankinghigh-quality images and suppressed the attention towards insensitive rank mistakes. A generative model wasproposed to jointly recover content and subject factors by solving a maximum likelihood estimation problem.However, these models were proposed for the traditional MOS-based approaches.Recently, there has been a large amount of efforts in JND-based video quality analysis. The human visualsystem (HVS) cannot perceive small pixel variation in coded video until the difference reaches a certain level.However, the difference of selected contents for ranking in traditional MOS-based framework was sufficientlylarge for the majority of subjects. We could conduct fine-grained quality analysis by directly measuring the JNDthreshold of each subject. There were several datasets
8, 15, 16 proposed with the JND methodology. CorrespondingJND prediction methods were proposed in.
17, 18
However, the JND location was analyzed in a data-driven fashion.It was simply modeled by the mean value of multiple JND samples with heuristic subject rejection approach. probability model was proposed to offer new insights to the JND phenomenon. Inspired by, theproposed generative model decomposed JND-based video quality score into subject and content factors. A close-form expression was derived to estimate the JND location by aggregating multiple binary decisions. It was shownthat the JND samples followed Normal distribution which was parameterized by the subject and content factors.These unknown factors were jointly optimized by solving a maximum likelihood estimation (MLE) problem.
3. PROPOSED USER MODEL
In this section, we present the proposed user model based on the JND methodology. Let c denote a referencevideo content, which can be compressed into a set of clips e i , i = 0 , , , · · · ,
51, where i is the quantizationparameter (QP) index used in H.264/AVC. Typically, clip e i has a higher PSNR value than clip e j , if i < j , and e is the losslessly coded copy of c .The JND of coded clips characterizes the distortion visibility threshold with respect to a given anchor, e i . Through the subjective experiment, JND points can be obtained from a sequence of consecutive Notice-able/Unnoticeable difference tests between clips pair ( e i , e j ) where j ∈ { i + 1 , · · · , } . For example, the anchorfor the first JND point is e and it remains the same while searching for the first JND point. A bisection searchis adopted to effectively update e j and reduce the total number of comparisons.Consider a VQA dataset consisting of C contents and S subjects, the JND data matrix is modeled as Y ∈ R C × S . Individual JND location Y c,s for s = 1 , · · · , S and c = 1 , · · · , C , is obtained through six rounds ofcomparison. The following analysis is conducted on the data matrix to recover underlying subject and contentfactors.It was demonstrated in that the perceived video quality depends on several causal factors: 1) the bias ofthe subject bias, 2) the inconsistency of a subject, 3) the average JND location, 4) the difficulty of a content toevaluate. The JND location of content c from subject s can be expressed as Y c,s = y c + N (0 , v c ) + N ( b s , v s ) , (1)where y c and v c are content factors while b s and v s are subject factors. The difficulty of a content is modeledby v c ∈ [0 , ∞ ). A larger v c value means that its masking effects are stronger and the most experienced expertsstill have difficulty in spotting artifacts in compressed clips. The bias of a subject is modeled by parameter b s ∈ ( −∞ , + ∞ ). If b s <
0, the subject is more sensitive to quality degradation in compressed video clips. If b s >
0, the subject is less sensitive to distortions. The sensitivity of an averaged subject has a bias around b s = 0. Moreover, the subject variance, v s , captures the inconsistency of the quality votings from subject s . Aconsistent subject evaluate all sequences attentively. Under the assumption that content and subject are independent factors on perceived video quality, the JNDposition can be expressed by a Gaussian distribution in form of Y c,s ∼ N ( µ Y , σ Y ) , (2)where µ Y = y c + b s and σ Y = v c + v s . The unknown parameters are θ = ( { y c } , { v c } , { b s } , { v s } ) for c = 1 , · · · , C and s = 1 , · · · , S , where {·} denotes the corresponding parameter set. All unknown parameters can be jointlyestimated via the Maximum Likelihood Estimation (MLE) method given the subjective data matrix Y ∈ R C × S .This is a well-formulated parameter inference approach and we refer interested viewers to
12, 19 for more details.Among the four parameters θ = ( { y c } , { v c } , { b s } , { v s } ), we have limited control on content factors, i.e. y c and v c . Content factors should be independent parameters that are input to a quality model. In practice, it isdifficult, sometimes even impossible, to model subject inconsistency ( i.e. , the v s term), as it is viewer’s freedomto decide how much attention to pay to the video content.On the other hand, the subject bias term ( i.e. b s ) is a consistent prior of each subject. It is reasonable tomodel the subject bias and integrate it into a SUR model. We can roughly classify users into three groups basedon the bias estimated from MLE. The user model aims to provide a flexible system to accommodate differentviewer groups:igure 1: Consecutive frames of contents • Viewers who are easy-to-satisfy (ES), corresponding to a larger b s ; • Viewers who have normal sensitivity (NS), corresponding to a neural b s ; • Viewers who are hard-to-satisfy (HS), corresponding to a smaller b s .Furthermore, a viewer is said to be satisfied if one cannot perceive quality difference between the compressedclip and its anchor. The Satisfied User Ratio (SUR) of video clip e i on user group j can be expressed as Z i,j = 1 − | S j | (cid:88) s ∈ S j s ( e i ) , (3)where S j is the j − th group of subjects and | · | denotes the cardinality. s ( e i ) = 1 or 0 if the s − th subject canor cannot see the difference between compressed clip e i and its anchor, respectively. The summation term inthe right-hand-side of Eq. (3) is the empirical cumulative distribution function (CDF) of random variable Y c,s .Then, by substituting Eq. (2) into Eq. (3), we obtain a compact expression for the SUR curve as Z i,j = Q ( e i | µ Y , σ Y ) = Q ( e i | y c + b s , v c + v s ) , for s ∈ S j , (4)where Q ( · ) is the Q-function of the normal distribution. By dividing users into different groups, the modelachieves small intra-group variance and large inter-group variance. We can model JND and SUR more precisely.Alternatively, a universal model could be generalized by replacing S j by the union of all subjects, i.e. S = (cid:83) j S j .
4. EXPERIMENTAL RESULTS
We evaluate the performance of the proposed user model using real JND data from the VideoSet and compareit with the MOS method. The VideoSet contains 220 video contents in four resolutions and three JND pointsper resolution per content. During the subjective test, the dataset was split into 15 subsets and each subset wasevaluated independently by a group of subjects. The group size was around 35. We adopt a subset of the firstJND point on 720p video in our experiment. It contains 15 video contents evaluated by 37 subjects. The cleaned JND scores are shown in Figure 2a and the estimated subject bias and inconsistency are shownin Figure 2c, respectively. Please note that 5 subjects were identified as unreliable subjects and their qualityvotings were removed. These subjects have a larger bias value or inconsistent measures. We refer interestedreaders to for further details.Figure 2b shows the estimated content difficulty. Content Test Subjects (s)011026041056071086101116131146161175189203217 V i d e o I n d e x ( c ) (a) Video Index (c) 10123456 C o n t e n t D i ff i c u l t y ( v c ) MLE (b) −20020 S u b j e c t B i a s ( b s ) MLE Subject Inde (s)010 S u b j e c t I n c o n s i s e n c y ( v s ) MLE (c)
Video Index (c)222426283032343638 R e c o v e r e d J N D L o c a t i o n ( y c ) MLE MOS (d)
Figure 2: Visualization of cleaned JND data and estimated subject and content factors: (a) cleaned JND data,where each pixel represents one JND location and a brighter pixel means the JND happens at a larger QP, (b)estimated content difficulty ( i.e. v c ) using the MLE method, (c) estimated subject bias and inconsistency ( i.e. v s ), and (d) estimated JND locations using the MLE and the MOS methods, respectively. The error bars insubfigures represent 95% confidence interval. We classify viewers into different viewer groups based on the estimated subject bias from cleaned JND data.The distribution of subject bias and inconsistency are given in Figure 3. The left and middle figures are thehistogram of their statistics, respectively. For a large percentage of viewers, their bias and inconsistency are in areasonable range ( i.e. [ − ,
4] for the subject bias and [0 , .
5] for subject inconsistency, respectively). The rightfigure is the scatter plot of these two factors. We do not observe strong correlation between them.In the following, we use video • The subject bias is set to -4, 0, 4 for HS, NS and ES, respectively. • Subject inconsistency is set to 2 for all subjects. • The averaged JND locations are set to 31.7 and 30.39 for clip • The content difficulty levels are set to 3.962 and 1.326 for clip F r e q u e n c y F r e q u e n c y −5 0 5Subject Bias012345 S u b j e c t I n c o n s i s t e n c y Figure 3: Illustration of subject factors. Left: the histogram of the subject bias. Middle: the histogram ofsubject inconsistency. Right: the scatter plot of subject inconsistency versus the subject bias.We have the following two observations.1. SUR difference for normal usersConsider the middle curves of EC and HC contents. Subjects in this group have normal sensitivity and weuse this group to represent the majority. Intuitively, the content diversity is large if we visually examinethose two clips. However, if we target at
SU R = 0 .
75, which is the counterpart of the mean value in theMOS method, the QP location from modeled SUR curve is pretty close. The difference increases whenthe SUR deviates from the
SU R = 0 .
75 location. For contents that have a weak masking effect (shown inblue curve), they are less resistant to compression distortion and SUR drops sharply once artifacts becomenoticeable. In contrast, for contents that have a strong masking effect (shown in red curve), they havebetter discriminatory power on subject capability so that the SUR curve drops slowly. Given the sameextra bitrate quota, we could expect a higher SUR gain from EC than HC. It takes much more effort tosatisfy critical users when the content has a strong masking effect. We conclude that it is essential to studycontent difficulty and subject capability to better model perceived quality of compressed video.2. SUR difference for different user groupsThe SUR difference is considerably large among different user groups on the same content. We observe agap between the three curves for both contents. The SUR curve of normal users is shifted by the subjectbias b s in Eq. (4). Although the neutral user group covers the majority of users, we believe that a qualitymodel would better characterize QoE by taking the user capability into consideration.The above observations can be easily explained using the proposed user model. It shows the value and powerof our study.
5. CONCLUSION AND FUTURE WORK
A flexible user model was proposed in this work by considering the subject and content factors in the JNDframework. The QoE of a group of users was characterized by the Satisfied User Ratio (SUR) while the JNDlocation of content c from subject s was modeled as a random variable parameterized by subject and contentfactors. The model parameters can be estimated by the MLE method using a set of JND-based subjective testdata. As an application of the proposed user model, we studied SUR curves that are influenced by different userprofiles and contents of different difficult levels. It was shown that the subject capability significantly affects theSUR curves, especially at the middle range of the quality curve.Apparently, the proposed user model provides valuable insights on the quality assessment problem. We wouldlike to explore these insights for better SUR prediction for new contents in the future. QP S UR EC,HSEC,NSEC,ESHC,HSHC,NSHC,ES
Figure 4: Illustration of the proposed user model. The blue and red curves demonstrate the SUR of EC and HCcontents, respectively. For each content, the three curves show the SUR difference between different user groups.
REFERENCES [1] ITU-R BT. 500, “Methodology for the subjective assessment of the quality of television pictures,” (2003).[2] ITU-T P.910, “Subjective video quality assessment methods for multimedia applications,” (1999).[3] Chen, K.-T., Wu, C.-C., Chang, Y.-C., and Lei, C.-L., “A crowdsourceable QoE evaluation framework formultimedia content,” in [
Proceedings of the 17th ACM international conference on Multimedia ], 491–500,ACM (2009).[4] Ye, P. and Doermann, D., “Active sampling for subjective image quality assessment,” in [
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition ], 4249–4256 (2014).[5] Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L., “Whose vote should count more:Optimal integration of labels from labelers of unknown expertise,” in [
Advances in neural informationprocessing systems ], 2035–2043 (2009).[6] Li, Q., Li, Y., Gao, J., Su, L., Zhao, B., Demirbas, M., Fan, W., and Han, J., “A confidence-aware approachfor truth discovery on long-tail data,”
Proceedings of the VLDB Endowment (4), 425–436 (2014).[7] Lin, J. Y., Jin, L., Hu, S., Katsavounidis, I., Li, Z., Aaron, A., and Kuo, C.-C. J., “Experimental design andanalysis of JND test on coded image/video,” in [ SPIE Optical Engineering+ Applications ], 95990Z–95990Z,International Society for Optics and Photonics (2015).[8] Wang, H., Katsavounidis, I., Zhou, J., Park, J., Lei, S., Zhou, X., Pun, M.-O., Jin, X., Wang, R., Wang,X., et al., “Videoset: A large-scale compressed video quality dataset based on JND measurement,”
Journalof Visual Communication and Image Representation , 292–302 (2017).[9] Seshadrinathan, K., Soundararajan, R., Bovik, A. C., and Cormack, L. K., “Study of subjective and objec-tive quality assessment of video,” IEEE Transactions on Image Processing (6), 1427–1441 (2010).[10] Group, V. Q. E. et al., “Report on the validation of video quality models for high definitionvideo content,” (2010).[11] Lin, J. Y., Song, R., Wu, C.-H., Liu, T., Wang, H., and Kuo, C.-C. J., “MCL-V: A streaming video qualityassessment database,” Journal of Visual Communication and Image Representation , 1 – 9 (2015).[12] Li, Z. and Bampis, C. G., “Recover subjective quality scores from noisy measurements,” in [ Data Compres-sion Conference (DCC), 2017 ], 52–61, IEEE (2017).13] Janowski, L. and Pinson, M., “The accuracy of subjects in a quality experiment: A theoretical subjectmodel,”
IEEE Transactions on Multimedia (12), 2210–2224 (2015).[14] Wu, Q., Li, H., Meng, F., and Ngan, K. N., “A perceptually weighted rank correlation indicator for objectiveimage quality assessment,” IEEE Transactions on Image Processing , 2499–2513 (May 2018).[15] Jin, L., Lin, J. Y., Hu, S., Wang, H., Wang, P., Katasvounidis, I., Aaron, A., and Kuo, C.-C. J., “Statisticalstudy on perceived JPEG image quality via MCL-JCI dataset construction and analysis,” in [ IS&T/SPIEElectronic Imaging ], International Society for Optics and Photonics (2016).[16] Wang, H., Gan, W., Hu, S., Lin, J. Y., Jin, L., Song, L., Wang, P., Katsavounidis, I., Aaron, A., andKuo, C.-C. J., “MCL-JCV: A JND-based H.264/AVC video quality assessment dataset,” in [ ], 1509–1513 (Sept 2016).[17] Huang, Q., Wang, H., Lim, S. C., Kim, H. Y., Jeong, S. Y., and Kuo, C.-C. J., “Measure and predictionof hevc perceptually lossy/lossless boundary QP values,” in [
Data Compression Conference (DCC), 2017 ],42–51, IEEE (2017).[18] Wang, H., Katsavounidis, I., Huang, Q., Zhou, X., and Kuo, C.-C. J., “Prediction of satisfied user ratio forcompressed video,” arXiv preprint arXiv:1710.11090 (2017).[19] Wang, H., Zhang, X., Yang, C., and Kuo, C.-C. J., “A JND-based video quality assessment model and itsapplication,” arXiv preprint arXiv:1807.00920arXiv preprint arXiv:1807.00920