A JND-based Video Quality Assessment Model and Its Application
11 A JND-based Video Quality Assessment Modeland Its Application
Haiqiang Wang, Xinfeng Zhang, Chao Yang, and C.-C. Jay Kuo,
Fellow, IEEE
Abstract
Based on the Just-Noticeable-Difference (JND) criterion, a subjective video quality assessment (VQA) dataset,called the VideoSet, was constructed recently. In this work, we propose a JND-based VQA model using a probabilisticframework to analyze and clean collected subjective test data. While most traditional VQA models focus on contentvariability, our proposed VQA model takes both subject and content variabilities into account. The model parametersused to describe subject and content variabilities are jointly optimized by solving a maximum likelihood estimation(MLE) problem. As an application, the new subjective VQA model is used to filter out unreliable video quality scorescollected in the VideoSet. Experiments are conducted to demonstrate the effectiveness of the proposed model.
Index Terms
Video Quality Assessment, Subjective Viewing Model, Just Noticeable Difference.
I. I
NTRODUCTION
Subjective quality evaluation is the ultimate means to measure quality of experience (QoE) of users. Formalmethods and guidelines for subjective quality assessments are specified in various ITU recommendations, such asITU-T P.910 [1], ITU-R BT.500 [2], etc. Several datasets on video quality assessment were proposed, such as theLIVE dataset [3], the Netflix Public dataset [4], the VQEG HD3 dataset [5] and the VideoSet [6]. Furthermore,efforts have been made in developing objective quality metrics such as VQM-VFD [7], MOVIE [8] and VMAF [4].Machine learning-based video quality assessment (VQA) systems rely heavily on the quality of collected sub-jective scores. A typical pipeline consists of three main steps. First, a group of human viewers are recruited tograde video quality based on individual perception. Second, noisy subjective data should be cleaned and combinedto provide an estimate of the actual video quality. Third, a machine learning model was trained and tested onthe calibrated datasets, and the performance was reported in terms of evaluation criteria. They are called the datacollection, cleaning and analysis steps, respectively. In this work, we propose a novel method for the data cleaningstep, which is essential for a variety of video contents viewed by different individuals. This is a challenging problemdue to the following variabilities.
Haiqiang Wang, Xinfeng Zhang, Chao Yang and C.-C. Jay Kuo are with the Ming-Hsieh Department of Electrical Engineering, Signal andImage Processing Institute, University of Southern California, Los Angeles, CA 90089 USA e-mail: { haiqianw, xinfengz, yangchao } @usc.edu,[email protected]. a r X i v : . [ c s . MM ] J u l • Inter-subject variability. Subjects may have a different vision capability. • Intra-subject variability. The same subject may have different scores against the same content in multiplerounds. • Content variability. Video contents have varying characteristics.When each content is evaluated several times by different subjects, a straightforward approach is to use themost common label as the true label [9]. This problem was examined more carefully in [10], which verified thedistribution assumptions required for parametric testing. Furthermore, it discussed practical considerations and maderecommendations in the testing procedure. Based on the Just-Noticeable-Difference (JND) criterion, a VQA dataset,called the VideoSet [6], was constructed recently. Being motivated by [11], we develop a probabilistic VQA modelto quantify the influence of subject and content factors on JND-based VQA scores. Furthermore, we show that ourmodel is more robust than the MOS-based model in [11] for noisy measurements cleaning.The rest of this work is organized as follows. Related previous work is reviewed in Sec. II. The proposed JND-based VQA model is introduced in Sec. III. The parameters inference problem is examined in Sec. IV. Experimentalresults in data cleaning are shown in Sec. V. Concluding remarks are given in Sec. VI.II. R
EVIEW OF R ELATED W ORK
The impacts of subject and content variabilities on video quality scores are often analyzed separately. A z -scoreconsistency test was used as a preprocessing step to identify unreliable subjects in the VideoSet. Another methodwas proposed in [12], which built a probabilistic model for the quality evaluation process and then estimatedmodel parameters with a standard inference approach. A subject model was proposed in [13] to study the influenceof subjects on test scores. An additive model was adopted and model parameters were estimated using real dataobtained by repetitive experiments on the same content. More recently, a generative model was proposed in [11] thattreated content and subject factors jointly by solving a maximum likelihood estimation (MLE) problem. Their modelwas developed targeting on the traditional mean-opinion-scores (MOS) data acquisition process with continuousdegradation category rating.The JND-based VOA methodology provides a new framework for fine-grained video quality scores acquisition.Several JND-based VQA datasets were constructed [6], [14], [15], and JND location prediction methods wereexamined in [16], [17]. Being inspired by [11], we develop a JND-based VQA model that considers subject andcontent variabilities jointly in this work. Then, we will show that this new method provides a powerful data cleaningtool for JND-based VQA datasets.III. D ERIVATION OF
JND-
BASED
VQA M
ODEL
Consider a VQA dataset containing C video contents, where each source video clip is denoted by c , c = 1 , · · · , C .Each source clip is encoded into a set of coded clips d i , i = 0 , , , · · · , , where i is the quantization parameter(QP) index used in the H.264/AVC standard. By design, clip d i has a higher PSNR value than clip d j , if i < j , and d is the losslessly coded copy of c . The JND of this set of coded clips characterizes the distortion visibility thresholdwith respect to a given anchor, d i . Through subjective experiments, JND points can be obtained from a sequenceof consecutive noticeable/unnoticeable difference tests between clips pair ( d i , d j ) , where j ∈ { i + 1 , · · · , } . A. Binary Decisions in Subjective JND Tests
The anchor, d i , is fixed while searching for the JND location. With a binary search procedure to update d j , ittakes at most L = 6 rounds to find the JND location. Here, we use l , l = 1 , · · · , L , to indicate the round numberand s , s = 1 , · · · , S , to indicate the subject index, respectively. The test result obtained from subject s at round l on content c is a binary decision - noticeable or unnoticeable differences. This is denoted by random variable X c,s,l ∈ { , } . If the decision is unnoticeable difference, we set X c,s,l = 1 . Otherwise, X c,s,l = 0 . The probabilityof X c,s,l can be written as P r ( X c,s,l = 1) = p c,s,l and P r ( X c,s,l = 0) = 1 − p c,s,l , (1)where random variable p c,s,l ∈ [0 , is used to model the probability of making the “unnoticeable difference”decision at a given comparison.We say that a decision was made confidently if all subjects made the same decision, no matter it was “noticeabledifference” or “unnoticeable difference”. One the other hand, a decision was made least confidently if two decisionshad equal sample size. In light of these observations, p c,s,l should be closer to zero for smaller l since the qualitydifference between two clips is more obvious in earlier test rounds. It is close to 0.5 for larger l as the coded clipapproaches the final JND location. B. JND Localization by Integrating Multiple Binary Decisions
During the subjective test, a JND sample was obtained through multiple binary decisions. Let X c,s = [ X c,s, , · · · , X c,s,L ] denote a sequence of decisions made by subject s on content c . Random variable X c,s,l is assumed to be inde-pendently identically distributed (i.i.d) in subject index s . Furthermore, X c,s,l is independent of content index c since the binary search approaches the ultimate JND location at the same rate regardless of the content. The searchinterval at round l , denoted by ∆ QP l , can be expressed as ∆ QP l = ∆ QP ( 12 ) l , (2)where ∆ QP = 51 is the initial JND interval for the first JND. The anchor location is QP , i.e. the reference andthe JND is searched between [ QP , QP ] . We skip comparison between clips pair ( QP , QP ) since it is a trivialone.By definition, the JND location is the coded clip that is the transition point from the unnoticeable difference tothe noticeable difference against the anchor. It is located at the last round after a sequence of “noticeable difference”decisions. Thus, the JND location on content c for subject s can be obtained by integrating searching intervals basedon decision sequences X c,s as Y c,s = L (cid:88) l =1 X c,s,l ∆ QP l , (3)since we need to add ∆ QP l to the offset (or the left ending) point of the current searching interval if X c,s,l = 1 and keep the same offset if X c,s,l = 0 . The distance between the left end point of the search interval and the JND location is no larger than one QP when the search procedure converges. Then, the JND location could be expressedas a function of the confidence of subject s on content c : Y c,s = ∆ QP L (cid:88) l =1 p c,s,l ( 12 ) l . (4) C. Decomposing JND into Content and Subject Factors
The JND locations depend on several causal factors: 1) the bias of the subject, 2) the consistency of a subject,3) the averaged JND location, 4) the difficulty of a content to evaluate. To provide a closed-form expression of theJND location, we adopt the following probabilistic model for the confidence random variable: p c,s,l = µ l + α(cid:15) c + βδ s , (5)where µ l = (1 + e − γl ) is the averaged confidence, (cid:15) c ∼ N ( µ c , σ c ) and δ s ∼ N ( µ s , σ s ) are two Gaussian randomvariables to capture content and subject factors, respectively. α and β are weights to control the effects of mentionedfactors. We set γ = 0 . , α = 1 and β = 1 empirically.By plugging Eq. (5) into Eq. (4), we can express the JND location as Y c,s = y c + N (0 , v c ) + N ( b s , v s ) (6)where y c = ∆ QP (cid:80) Ll =1 ( ) l ( µ l + µ c ) and v c = κ σ c are content factors. b s = κµ s and v s = κ σ s are subjectfactors. κ = ∆ QP (cid:80) Ll =1 ( ) l ≈ is a constant.The JND-based VQA model in Eq. (6) decomposes the JND location into the content term and the subjectterm. The difficulty of a content is modeled by v c ∈ [0 , ∞ ) . A larger v c value means that its masking effect isstronger and the most experienced experts still have difficulty in spotting artifacts in compressed clips. The bias of asubject is modeled by parameter b s ∈ ( −∞ , + ∞ ) . If b s < , the subject is more sensitive to quality degradation incompressed video clips. If b s > , the subject is less sensitive to distortions. The sensitivity of an averaged subjecthas a bias around b s = 0 . Moreover, the subject variance, v s , captures the consistency of subject s . A consistentsubject evaluates all sequences deliberately.IV. P ARAMETER E STIMATION
The JND-based VQA model in Eq. (6) has a set of parameters to determine; namely, θ = ( { y c } , { v c } , { b s } , { v s } ) with c = 1 , · · · , C and s = 1 , · · · , S . Under the assumption that contents and subjects are independent factors onperceived video quality, the JND location can be expressed by the Gaussian distribution in form of Y c,s ∼ N ( µ c,s , σ c,s ) , (7)where µ c,s = y c + b s and σ c,s = v c + v s . The task is to estimate unknown parameters jointly given observationson a set of contents from a group of subjects. A standard inference method to recover the true MOS score wasstudied in [11]. Here, we extend the procedure to estimate the parameters in the JND-based VQA model.
011 026 041 056 071086 101 116 131 146161 175 189 203 217
Fig. 1: Representative frames from 15 source contents.Let L ( θ ) = log p ( { y c,s }| θ ) be the log-likelihood function. One can show that the optimal estimator of θ is givenby ˆ θ = arg max θ L ( θ ) . By omitting constant terms, we can express the log-likelihood function as L ( θ ) = log p ( { y c,s | θ )= log (cid:89) c,s p ( y c,s | y c , b s , v c , v s )= (cid:88) c,s log p ( y c,s | y c , b s , v c , v s ) ≡ − log( v c + v s ) − ( y c,s − y c − b s ) v c + v s . (8)The first and second order derivatives of L ( θ ) with respect to each parameter can be derived. They are used toupdate the parameters at each iteration according to the Newton-Raphson rule, i.e. θ ← θ − ∂L/∂θ∂ L/∂θ .V. E XPERIMENTAL R ESULTS
In this section, we evaluate the performance of the proposed model using real JND data from the VideoSet andcompare it with another commonly used method. For reproducibility, the source code of the proposed model isavailable at: https://github.com/JohnhqWang/sureal.
A. Experiment Settings
The VideoSet contains 220 video contents in 4 resolutions and three JND points per resolution per content.During the subjective test, the dataset was split into 15 subsets and each subset was evaluated independently by agroup of subjects. We adopt a subset of the first JND points on 720p video in the experiment. The subset contains15 video contents evaluated by 37 subjects. One representative frame from each of the 15 video clips is shown inFig. 1. The measured raw JND scores are shown in Fig. 2(a).Standard procedures have been provided by the ITU for subject screening and data modeling. For example,a subject rejection method was proposed in the ITU-R BT.500 Recommendation [2]. The differential MOS was Test Subjects (s)011026041056071086101116131146161175189203217 V i d e o I n d e x ( c ) (a)
011 026 041 056 071 086 101 116 131 146 161 175 189 203 217
Video Index (c)−10123456 C o n t e n t D i ff i c u l t ( v c ) Proposed (c)
011 026 041 056 071 086 101 116 131 146 161 175 189 203 217
Video Index (c)222426283032343638 R e c o v e r e d J N D L o c a t i o n ( y c ) Proposed MOS (e) −20020 S u b j e c t B i a s ( b s ) Proposed Subject Inde (s)010 S u b j e c t I n c o n s i s e n c y ( v s ) Proposed (b)
011 026 041 056 071 086 101 116 131 146 161 175 189 203 217
Video Index (c)−10123456 C o n t e n t D i ff i c u l t ( v c ) Proposed (d)
011 026 041 056 071 086 101 116 131 146 161 175 189 203 217
Video Index (c)222426283032343638 R e c o v e r e d J N D L o c a t i o n ( y c ) Proposed MOS (f)
Fig. 2: Experimental results: (a) raw JND data, where each pixel represents one JND location and a brighter pixelmeans the JND occurs at a larger QP; (b) estimated subject bias and inconsistency on raw JND data; (c) and(d) estimated content difficulty based on raw and cleaned JND data, respectively, using the proposed VQA+MLEmethod; (e) and (f) estimated JND locations based on raw and cleaned JND data, respectively, using both theproposed VQA+MLE method and the MOS method. Error bars in all subfigures represent the confidenceinterval.defined in the ITU-T P.910 Recommendation [1] to alleviate the influence of subject and content factors. However,these procedures do not directly apply to the collected JND VQA data due to a different methodology. TraditionalVQA subjective tests evaluate video quality by a score while the JND-based VQA subjective tests target at thedistortion visibility threshold.Here, we compare the proposed VQA model whose parameters are estimated by the MLE method with thoseestimated by the standard MOS approach [11] in two different settings. First, we compare them based on raw JNDdata without any cleaning process. Second, we clean unreliable data using the proposed VQA model and comparethese two models with the cleaned JND data.
B. Experiments on Raw JND Data
The first experiment was conducted on the raw JND data without outlier removal. Some subjects completed thesubjective test hastily without sufficient attention. By jointly estimating the content and subject factors, a goodVQA data model can identify such outlying quality ratings from unreliable subjects. The estimated subject bias andinconsistency are shown in Fig. 2(b). The proposed JND-based VQA model indicates that the bias of subjects that subjects
C. Experiments on Cleaned JND Data
Here, we remove the outlying JND samples detected by the proposed model. They are from subjects with alarger bias value or inconsistent measures. We show the estimated content difficulty in Fig. 2(d) and compare theestimated JND locations of the proposed method and the MOS method in Fig. 2(f) on the cleaned dataset. We seethat the proposed VQA+MLE method can estimate the relative content difficulty accurately. We also notice that theestimation changed a lot for some contents. The reason is that considerable portion ( /
37 = 13 . ) of the subjectswere removed. The bias and inconsistency of the removed scores have great influence on the conclusion of thesecontents.By comparing Figs. 2(e) and 2(f), we observe that outlying samples changed the distribution of recoveredJND locations in both methods. First, the confidence intervals of the MOS method decrease a lot. It reflectsthe vulnerability of the MOS method due to noisy samples. In contrast, the proposed VQA+MLE method is morerobust. Second, the recovered JND location increases by 0.5 to 1 QP in both methods after removing noisy samples.It demonstrates the importance of building a good VQA model and using it to filter out noisy samples.VI. C ONCLUSION AND FUTURE WORK
A JND-based VQA model was proposed to analyze measured JND-based VQA data. The model considered subjectand content variabilities, and determined its parameters by solving an MLE problem iteratively. The technique canbe used to remove biased and inconsistent samples and estimate the content difficulty and JND locations. It wasshown by experimental results that the proposed methodology is more robust to noisy subjects than the traditionalMOS method.The MLE optimization problem may have multiple local maxima and the iterative optimization procedure maynot converge to the global maximum. We would like to investigate this problem deeper in the future. Also, we maylook for other parameter estimation methods that are more efficient and robust. R EFERENCES[1] ITU-T P.910, “Subjective video quality assessment methods for multimedia applications,” 1999.[2] ITU-R BT. 500, “Methodology for the subjective assessment of the quality of television pictures,” 2003.[3] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,”
IEEE transactions on image processing , vol. 19, no. 6, pp. 1427–1441, 2010.[4] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara, “Toward a practical perceptual video quality metric,”
The Netflix TechBlog , vol. 6, 2016.[5] V. Q. E. Group et al. , “Report on the validation of video quality models for high definition video content,” , 2010.[6] H. Wang, I. Katsavounidis, J. Zhou, J. Park, S. Lei, X. Zhou, M.-O. Pun, X. Jin, R. Wang, X. Wang et al. , “Videoset: A large-scalecompressed video quality dataset based on jnd measurement,”
Journal of Visual Communication and Image Representation , vol. 46, pp.292–302, 2017.[7] S. Wolf and M. Pinson, “Video quality model for variable frame delay (vqm-vfd),”
National Telecommunications and InformationAdministration NTIA Technical Memorandum TM-11-482 , 2011.[8] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,”
IEEE transactions on imageprocessing , vol. 19, no. 2, pp. 335–350, 2010.[9] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo, “Whose vote should count more: Optimal integration of labels fromlabelers of unknown expertise,” in
Advances in neural information processing systems , 2009, pp. 2035–2043.[10] M. Narwaria, L. Krasula, and P. Le Callet, “Data analysis in multimedia quality assessment: Revisiting the statistical tests,”
IEEETransactions on Multimedia , 2018.[11] Z. Li and C. G. Bampis, “Recover subjective quality scores from noisy measurements,” in
Data Compression Conference (DCC), 2017 .IEEE, 2017, pp. 52–61.[12] Q. Liu, J. Peng, and A. T. Ihler, “Variational inference for crowdsourcing,” in
Advances in neural information processing systems , 2012,pp. 692–700.[13] L. Janowski and M. Pinson, “The accuracy of subjects in a quality experiment: A theoretical subject model,”
IEEE Transactions onMultimedia , vol. 17, no. 12, pp. 2210–2224, 2015.[14] L. Jin, J. Y. Lin, S. Hu, H. Wang, P. Wang, I. Katasvounidis, A. Aaron, and C.-C. J. Kuo, “Statistical study on perceived JPEG imagequality via MCL-JCI dataset construction and analysis,” in
IS&T/SPIE Electronic Imaging . International Society for Optics and Photonics,2016.[15] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “MCL-JCV: A JND-based H.264/AVC video quality assessment dataset,” in , Sept 2016, pp.1509–1513.[16] Q. Huang, H. Wang, S. C. Lim, H. Y. Kim, S. Y. Jeong, and C.-C. J. Kuo, “Measure and prediction of hevc perceptually lossy/losslessboundary qp values,” in
Data Compression Conference (DCC), 2017 . IEEE, 2017, pp. 42–51.[17] H. Wang, I. Katsavounidis, Q. Huang, X. Zhou, and C.-C. J. Kuo, “Prediction of satisfied user ratio for compressed video,” arXiv preprintarXiv:1710.11090arXiv preprintarXiv:1710.11090