[PDF] A Simple Model for Subject Behavior in Subjective Experiments

Abstract

In a subjective experiment to evaluate the perceptual audiovisual quality of multimedia and television services, raw opinion scores collected from test subjects are often noisy and unreliable. To produce the final mean opinion scores (MOS), recommendations such as ITU-R BT.500, ITU-T P.910 and ITU-T P.913 standardize post-test screening procedures to clean up the raw opinion scores, using techniques such as subject outlier rejection and bias removal. In this paper, we analyze the prior standardized techniques to demonstrate their weaknesses. As an alternative, we propose a simple model to account for two of the most dominant behaviors of subject inaccuracy: bias and inconsistency. We further show that this model can also effectively deal with inattentive subjects that give random scores. We propose to use maximum likelihood estimation to jointly solve the model parameters, and present two numeric solvers: the first based on the Newton-Raphson method, and the second based on an alternating projection (AP). We show that the AP solver generalizes the ITU-T P.913 post-test screening procedure by weighing a subject's contribution to the true quality score by her consistency (thus, the quality scores estimated can be interpreted as bias-subtracted consistency-weighted MOS). We compare the proposed methods with the standardized techniques using real datasets and synthetic simulations, and demonstrate that the proposed methods are the most valuable when the test conditions are challenging (for example, crowdsourcing and cross-lab studies), offering advantages such as better model-data fit, tighter confidence intervals, better robustness against subject outliers, the absence of hard coded parameters and thresholds, and auxiliary information on test subjects. The code for this work is open-sourced at this https URL.

Full PDF

AA Simple Model for Subject Behavior in Subjective Experiments

Zhi Li (Netﬂix), Christos G. Bampis (Netﬂix), Lucjan Janowski (AGH, Poland) and Ioannis Katsavounidis (Facebook)

Abstract

In a subjective experiment to evaluate the perceptual audio-visual quality of multimedia and television services, raw opinionscores offered by subjects are often noisy and unreliable. Recom-mendations such as ITU-R BT.500, ITU-T P.910 and ITU-T P.913standardize post-processing procedures to clean up the raw opin-ion scores, using techniques such as subject outlier rejection andbias removal. In this paper, we analyze the prior standardizedtechniques to demonstrate their weaknesses. As an alternative,we propose a simple model to account for two of the most domi-nant behaviors of subject inaccuracy: bias (aka systematic error)and inconsistency (aka random error). We further show that thismodel can also effectively deal with inattentive subjects that giverandom scores. We propose to use maximum likelihood estimation(MLE) to jointly estimate the model parameters, and present twonumeric solvers: the ﬁrst based on the Newton-Raphson method,and the second based on alternating projection. We show that thesecond solver can be considered as a generalization of the subjectbias removal procedure in ITU-T P.913. We compare the proposedmethods with the standardized techniques using real datasets andsynthetic simulations, and demonstrate that the proposed methodshave advantages in better model-data ﬁt, tighter conﬁdence inter-vals, better robustness against subject outliers, shorter runtime,the absence of hard coded parameters and thresholds, and auxil-iary information on test subjects. The source code for this work isopen-sourced at https://github.com/Netﬂix/sureal.

Introduction

Subjective experiment methodologies to evaluate the per-ceptual audiovisual quality of multimedia and television serviceshave been well studied. Recommendations such as ITU-R BT.500[1], ITU-T P.910 [2] and ITU-T P.913 [3] standardize the pro-cedures to conduct subjective experiments and post-process rawopinion scores to yield the ﬁnal mean opinion scores (MOS). Toaccount for the inherently noisy and often unreliable nature of testsubjects, the recommendations have included corrective mecha-nisms such as subject rejection (BT.500), subject bias removal(P.913), and criteria for establishing the conﬁdence intervals ofthe MOS (BT.500, P.910 and P.913). The standardized proceduresare not without their own limitations. For example, in BT.500, if asubject is deemed an outlier, all opinion scores for that subject arediscarded, which could be excessive. The BT.500 procedure alsoincorporates a number of hard coded parameters and thresholds,which may not be suitable for all conditions and subjective tests.As an alternative, we propose a simple model to account fortwo of the most dominant behaviors of test subject inaccuracy: bias and inconsistency . In addition, this model can effectivelydeal with inattentive subject outliers that give random scores.Compared to BT.500-style subject rejection, the proposed modelcan be thought as performing “soft” subject rejection as it explic-itly models subject outliers as having large inconsistencies, and thus is able to diminish the effect from the estimated true qualityscores. To solve for the model parameters, we propose to jointlyoptimize its likelihood function, also known as maximum likeli-hood estimation (MLE) [4]. We present two numeric solvers: 1)a Newton-Raphson (NR) solver [5], and 2) an Alternating Pro-jection (AP) solver. We further show that the AP solver can beconsidered as a weighted and iterative generalization of the sub-ject bias removal procedure in P.913. The AP solver also has theadvantage of having no hard coded parameters and thresholds .One of the challenges is to fairly compare the proposedmethods to its alternatives. We evaluate the proposed simplemodel and its numerical solvers separately. To evaluate themodel’s ﬁt to real datasets, we use Bayesian Information Crite-rion (BIC) [6], where the winner can be characterized as having agood ﬁt to data while maintaining a small number of parameters .We also compare the conﬁdence intervals of the estimated qual-ity scores, where a tighter conﬁdence interval implies a higherconﬁdence in the estimation. To evaluate the model’s robustnessagainst subject outliers, we perform a simulation study on howthe true quality score’s root mean squared error (RMSE) changescompared to the clean case as the number of outliers increases.Lastly, to validate that the numerical solvers are indeed accurate,we use synthetic data to compare the recovered parameters againstthe ground truth.The rest of the paper is organized as follows. Section PriorArt and Standards discusses prior art and standards. The proposedmodel is presented in Section Proposed Model, followed by thetwo numerical solvers in Section Proposed Solvers, and the cal-culation of conﬁdence intervals in Section Conﬁdence Interval.Section Experimental Results presents the experimental results.The source code of this work is open-sourced on Github [7].

Prior Art and Standards

Raw opinion scores collected from subjective experimentsare known to be inﬂuenced by the inherently noisy and unreli-able nature of human test subjects [8]. To compensate for theinﬂuence of individuals, a common practice is to average the rawopinion scores from multiple subjects, yielding a MOS per stim-ulus. Standardized recommendations incorporate more advancedcorrective mechanisms to further compensate for test subjects’ in-ﬂuence, and criteria for establishing the conﬁdence intervals ofMOS. • ITU-R BT.500 Recommendation [1] deﬁnes methodolo-gies including double-stimulus impairment scale (DSIS)and double-stimulus continuous quality scale (DSCQS),and a corresponding procedure for subject rejection (ITU-R BT.500-14 Section A1-2.3.1) prior to the calculation ofMOS. Video by video, the procedure counts the number ofinstances when a subject’s opinion score deviates by a fewsigmas (i.e. standard deviation), and rejects the subject if a r X i v : . [ c s . MM ] A p r he occurrences are more than a fraction. All scores corre-sponding to the rejected subjects are discarded, which couldbe considered excessive. On the other hand, our experimentshows that, in the presence of many outlier subjects, the pro-cedure is only able to identify a portion of them. Anotherdrawback of this approach is that it incorporates a numberof hard coded parameters and thresholds to determine theoutliers, which may not be suitable for all conditions. It alsoestablishes the corresponding way of calculating the conﬁ-dence interval (ITU-R BT.500-14 Section A1-2.2.1). • ITU-T P.910 Recommendation [2] deﬁnes methodologiesincluding absolute category rating (ACR), degradation cat-egory rating DCR (equivalent to DSIS), absolute categoryrating with hidden reference (ACR-HR) and the correspond-ing differential MOS (DMOS) calculation, and recommendsusing the BT.500 subject rejection and conﬁdence intervalcalculation procedure in conjunction. • ITU-T P.913 Recommendation [3] deﬁnes a procedure to re-move subject bias (ITU-T P.913 Section 12.4) before carry-ing out other steps. It ﬁrst ﬁnds the mean score per stimulus,and subtract it from the raw opinion scores to get the residualscores. Then it averages the residue scores on a per-subjectbasis to yield an estimate of each subject’s bias. The subjectbias is then removed from the raw opinion scores. For P.913to possess resistance to subject outliers, it needs to be com-bined with BT.500-style subject rejection. Yet, by doing so,it poses similar weaknesses as BT.500.For completeness, below we give mathematical descriptions ofthe subject rejection method standardized in ITU-R BT.500-14and the subject bias removal method in ITU-T P.913. Let u i jr be the opinion score voted by subject i on stimulus j in repeti-tion r . Note that, in BT.500-14, the notation j is used to indi-cate test condition and k is used to indicate sequence/image; inthis paper, the test condition and sequence/image are combinedand collectively represented by the stimulus notation j . Let µ jr denote the mean value over scores for stimulus j and for repeti-tion r , i.e. µ jr = ( ∑ i ) − ∑ i u i jr . Similarly, m n , jr denotes the n -th order central moment over scores for stimulus j and repeti-tion r , i.e. m n , jr = ( ∑ i ) − ∑ i ( u i jr − µ jr ) n . Lastly, σ jr denotesthe sample standard deviation for stimulus j and repetition r , i.e. σ jr = (cid:113) (( ∑ i ) − ) − ∑ i ( u i jr − µ jr ) . In the previous, the term ∑ i jr . This number of ob-servers could be the same, I , or different per stimulus, if a sub-jective experiment has been designed in such a way. The subjectrejection procedure in ITU-R BT.500-14 Section A1-2.3 can besummarized in Algorithm 1.ITU-T P.913 does not consider repetitions, so the notation u i j denotes the opinion score voted by subject i on stimulus j . Thesubject bias removal procedure in ITU-T P.913 Section 12.4 canbe summarized in Algorithm 2. Proposed Model

We propose a simple yet effective model to account for twoof the most dominant effects of test subject inaccuracy: subjectbias and subject inconsistency. We further show that this modelcan effectively deal with inattentive subjects that give randomscores, without invoking explicit subject rejection. The proposed

Algorithm 1

ITU-R BT.500 Subject Rejection [1] • Input: u i jr for i = ,..., I , j = ,..., J and r = ,..., R . • Initialize p ( i ) ← q ( i ) ← i = ,..., I . • For j = ,..., J , r = ,..., R : – Let

Kurtosis jr = m , jr ( m , jr ) . – If 2 ≤ Kurtosis jr ≤

4, then ε jr =

2; otherwise ε jr = √ – For i = ,..., I : ∗ If u i jr ≥ µ jr + ε jr σ jr , then p ( i ) ← p ( i ) + . ∗ If u i jr ≤ µ jr − ε jr σ jr , then q ( i ) ← q ( i ) + • Initialize

Set re j = /0. • For i = ,..., I : – If p ( i )+ q ( i ) ∑ jr ≥ .

05 and (cid:12)(cid:12)(cid:12) p ( i ) − q ( i ) p ( i )+ q ( i ) (cid:12)(cid:12)(cid:12) < .

3, then

Set re j ← Set re j ∪ { i } . • Output:

Set re j . Algorithm 2

ITU-T P.913 Subject Bias Removal [3] • Input: – u i j for subject i = ,..., I , stimulus j = ,..., J . • For j = ,..., J : – Estimate MOS of stimulus as

MOS j = ( ∑ i ) − ∑ i u i j . • For i = ,..., I : – Estimate subject bias as

BIAS i = ( ∑ j ) − ∑ j ( u i j − MOS j ) . • Calculate the subject bias-removed opinion scores r i j = u i j − BIAS i , i = ,..., I , j = ,..., J . • Use r i j instead of u i j as the opinion scores to carry out theremaining steps.model is a simpliﬁed version of [9] without considering the am-biguity of video content. Compared to the previously proposedmodel, the solutions to the simpliﬁed model are more efﬁcientand stable.We assume that each opinion score u i jr can be representedby a random variable as follows: U i jr = ψ j + ∆ i + υ i X , (1)where ψ j is the true quality of stimulus j , ∆ i represents the biasof subject i , the non-negative term υ i represents the inconsistencyof subject i , and X ∼ N ( , ) are i.i.d. Gaussian random variables.The index r represents repetitions.It is important to point out that a subject with erroneous be-haviors can be modeled by a large inconsistency value υ i . The er-roneous behaviors that can be modeled include but are not limitedto: subject giving random scores, subject being absent-mindedfor a portion of a session, or software issue that randomly shufﬂesa subject’s scores among multiple stimuli. By successfully esti-mating υ i and accounting its effect to calculating the true qualityscore, we can compensate for subject outliers without invokingBT.500-style subject rejection.Given a collection of opinion scores { u i jr } from a subjec-tive experiment, the task is to solve for the free parameters θ = { ψ j } , { ∆ i } , { υ i } ) , such that the model ﬁts the observed scoresthe best. This can be formulated as a maximum likelihood esti-mation (MLE) problem. Let the log-likelihood function be L ( θ ) = log P ( { u i jr }|{ ψ j } , { ∆ i } , { υ i } ) , i.e. a monotonic measure of the probability of observing thegiven raw scores, for a set of these parameters. We can solvethe model by ﬁnding ˆ θ that maximizes L ( θ ) , or ˆ θ = argmax L ( θ ) .This problem can be numerically solved by the proposed Newton-Raphson method or the Alternating Projection method, to be dis-cussed in Section Proposed Solvers.It is important to notice that the recoverability of { ψ j } and { ∆ i } in (1) is up to a constant shift. Formally, assume ˆ θ =( { ˆ ψ j } , { ˆ ∆ i } , { ˆ υ i } ) is a solution that maximizes L ( θ ) , one caneasily show that ( { ˆ ψ j + C } , { ˆ ∆ i − C } , { ˆ υ i } ) where C ∈ R , is an-other solution that achieves the same maximum likelihood value L ( ˆ θ ) . This implies that the optimal solution is not unique. Inpractice, we can enforce a unique solution, by adding a constraintthat forces the mean subject bias to be zero, or ∑ i ∆ i = . This intuitively makes sense, since bias is relative - saying ev-eryone is positively biased is equivalent to saying that no one ispositively biased. It is also equivalent to assuming that the sampleof observers that offer opinion scores in a subjective experimentare truly random and do not consist of only “expert” viewers or“lazy” viewers that tend to offer lower or higher opinion scores,as a whole. There is always the possibility, once a subjective testestablishes that the population from where subjects were recruitedhave such a collective bias, to change the condition and thus prop-erly estimate what the true “typical” observer, drawn from a morerepresentative pool that would vote.Lastly, one should keep in mind that it is always possible touse more complicated models than (1) to capture other effects ina subjective experiment. For example, [9] considers content am-biguity, and [10, 11] considers per-stimulus ambiguity. There arealso environment-related factors that could induce biases. Addi-tionally, the votes are inﬂuenced by the voting scales chosen, forexample, continuous vs. discrete [12]. Our hope is that the pro-posed model strikes a good balance between the model complex-ity and explanatory power. In Section Model-Data Fit, we showthat the proposed model yields better model-data ﬁt than BT.500and P.913 used today.

Proposed Solvers

Let us start by simplifying the form of the log-likelihoodfunction L ( θ ) . We can write: L ( θ ) = log P ( { u i jr }|{ ψ j } , { ∆ i } , { υ i } )= log ∏ i jr P ( u i jr | ψ j , ∆ i , υ i ) (2) = ∑ i jr log P ( u i jr | ψ j , ∆ i , υ i )= ∑ i jr log f ( u i jr | ψ j + ∆ i , υ i ) ∼ = ∑ i jr − log υ i − ( u i jr − ψ j − ∆ i ) υ i (3) Algorithm 3

Proposed Newton-Raphson (NR) solver • Input: – u i jr for subject i = ,..., I , stimulus j = ,..., J andrepetition r = ,..., R . – Refresh rate α . – Stop threshold ψ thr . • Initialize { ∆ i } ← { } , { ψ j } ← { MOS j } , { υ i } ← { RSD i } . • Loop: – { ψ prevj } ← { ψ j } . – ∆ i ← ( − α ) · ∆ i + α · ∆ newi where ∆ newi = ∆ i − ∂ L ( θ ) / ∂ ∆ i ∂ L ( θ ) / ∂ ∆ i for i = ,..., I . – υ i ← ( − α ) · υ i + α · υ newi where υ newi = υ i − ∂ L ( θ ) / ∂υ i ∂ L ( θ ) / ∂υ i for i = ,..., I . – ψ j ← ( − α ) · ψ j + α · ψ newj where ψ newj = ψ j − ∂ L ( θ ) / ∂ψ j ∂ L ( θ ) / ∂ψ j for j = ,..., J . – If (cid:16) ∑ Jj = ( ψ j − ψ prevj ) (cid:17) < ψ thr , break. • Output: { ψ j } , { ∆ i } , { υ i } .where (2) uses the independence assumption on opinion scores, f ( x | µ , υ ) is the Gaussian density function with mean µ and stan-dard deviation υ , and ∼ = denotes equal with omission of constantterms.Note that not every subject needs to vote on each stimulusin every repetition. Our proposed solvers can effectively dealwith subjective tests with incomplete data where some observa-tions u i jr are missing. Denote by (cid:63) the missing observations in anexperiment. All summations in this paper are ignoring the miss-ing observations (cid:63) , that is, ∑ i jr is equivalent to ∑ i jr : u ijr (cid:54) = (cid:63) , and soon. Newton-Raphson (NR) Solver

With (3), the ﬁrst- and second-order partial derivatives of L ( θ ) can be derived (see the ﬁrst Appendix). We can apply theNewton-Raphson rule [5] a new ← a − ∂ L / ∂ a ∂ L / ∂ a to update each pa-rameter a in iterations. We further use a refresh rate parameter α to control the speed of innovation and to avoid overshooting.Note that other update rules can be applied, but using the Newton-Raphson rule yields nice interpretability.Also note that the NR solver ﬁnds a local optimal solu-tion when the problem is non-convex. It is important to ini-tialize the parameters properly. We choose zeros as the initialvalues for { ∆ i } , the mean score MOS j = ( ∑ ir ) − ∑ ir u i jr for { ψ j } , and the residue standard deviation RSD i = σ i ( { ε i jr } ) for { υ i } , where ε i jr = u i jr − MOS j is the “residue”, σ i ( { ε i jr } ) = (cid:113) ( ∑ jr ) − ∑ jr ( ε i jr − ε i ) − ε i , and ε i = ( ∑ jr ) − ∑ jr ε i jr . TheNR solver is summarized in Algorithm 3. A good choice of re-fresh rate and stop threshold are α = . ψ thr = e − , re-spectively, but varying these parameters would not signiﬁcantlychange the result.The “new” parameters can be simpliﬁed to the followingorm: ψ newj = ∑ ir υ − i ( u i jr − ∆ i ) ∑ ir υ − i , (4) ∆ newi = ∑ jr ( u i jr − ψ j ) ∑ jr , (5) υ newi = υ i ∑ jr υ i − ( u i jr − ψ j − ∆ i ) ∑ jr υ i − ( u i jr − ψ j − ∆ i ) . Note that there are strong intuitions behind the expressions forthe newly estimated true quality ψ newj and subject bias ∆ newi . Ineach iteration, ψ newj is re-estimated, as the weighted mean of theopinion scores u i jr with the currently estimated subject bias ∆ i removed. Each opinion score is weighted by the “subject con-sistency” υ − i , i.e., the higher the inconsistency for subject i , theless reliable the opinion score, hence less the weight. For the sub-ject bias ∆ newi , it is simply the average shift between subject i ’sopinion scores and the true values. Alternating Projection (AP) Solver

This solver is called “alternating projection” because in aloop, we alternate between projecting (or averaging) the opin-ion scores along the subject dimension and the stimulus di-mension. To start, we initialize { ψ j } to { MOS j } , where MOS j = ( ∑ ir ) − ∑ ir u i jr , same as the NR solver. The subjectbias { ∆ i } is initialized differently to { BIAS i } , where BIAS i =( ∑ jr ) − ∑ jr ( u i jr − MOS j ) is the average shift between subject i ’s opinion scores and the true values. Note that the calculation of { MOS j } and { BIAS i } matches precisely to the ones in Algorithm2 (ITU-T P.913). Within the loop, ﬁrst, the “residue” ε i jr is up-dated, followed by the calculation of the subject inconsistency υ i as the residue’s standard deviation per subject, or σ i ( { ε i jr } ) = (cid:113) ( ∑ jr ) − ∑ jr ( ε i jr − ε i ) − ε i , where ε i = ( ∑ jr ) − ∑ jr ε i jr .Then, the true quality { ψ j } and the subject bias { ∆ i } are re-estimated, by averaging the opinion scores along either the subjectdimension i or the stimulus dimension j . The projection formulaprecisely matches equations (4) and (5) of the Newton-Raphsonmethod. The AP solver is summarized in Algorithm 4. A goodchoice of the stop threshold is ψ thr = e − .In sum, the AP solver can be considered as a generalizationof P.913 Section 12.4 in the following sense: ﬁrst, the AP solveris iterative until convergence whereas P.913 only goes throughthe initialization steps; second, in the AP solver the re-estimationof quality score ψ j is weighted by the subject consistency υ − i whereas in P.913, the re-estimation is unweighted. Please notethat weighting multiple random variables by the inverse of theirvariance is the minimum error parameter estimation, as can betrivially proven through Lagrange multipliers. Conﬁdence Interval

The estimate of each model parameter { ψ j } , { ∆ i } , { υ i } isassociated with a conﬁdence interval. Using the Cramer-Raobound [13], the asymptotic 95% conﬁdence intervals for the meanterm ψ j and ∆ i have the form CI ( a ) = a ± . (cid:16) − ∂ L ( a ) ∂ a (cid:17) − ,where their second-order derivatives ∂ L ( a ) ∂ a can be found in theﬁrst Appendix. The 95% conﬁdence interval for the standard de-viation term υ i has the form (cid:16)(cid:113) k χ k ( . ) υ , (cid:113) k χ k ( . ) υ (cid:17) , where Algorithm 4

Proposed Alternating Projection (AP) solver • Input: – u i jr for subject i = ,..., I , stimulus j = ,..., J andrepetition r = ,..., R . – Stop threshold ψ thr . • Initialize { ψ j } ← { MOS j } , { ∆ i } ← { BIAS i } . • Loop: – { ψ prevj } ← { ψ j } . – ε i jr = u i jr − ψ j − ∆ i for i = ,..., I , j = ,..., J and r = ,..., R . – υ i ← σ i ( { ε i jr } ) for i = ,..., I . – ψ j ← ∑ ir υ − i ( u ijr − ∆ i ) ∑ ir υ − i for j = ,..., J . – ∆ i ← ∑ jr ( u ijr − ψ j ) ∑ jr , for i = ,..., I . – If (cid:16) ∑ Jj = ( ψ j − ψ prevj ) (cid:17) < ψ thr , break. • Output: { ψ j } , { ∆ i } , { υ i } . χ k ( a ) is the percent point function (the inverse of CDF) of a chi-square distribution with k degrees of freedom. After simpliﬁca-tion, the conﬁdence intervals for ψ j , ∆ i and υ i are: CI ( ψ j ) = ψ j ± .

96 1 (cid:113) ∑ ir υ − i , (6) CI ( ∆ i ) = ∆ i ± . υ i (cid:112) ∑ jr , CI ( υ i ) = (cid:18) (cid:114) k i χ ki ( . ) υ i , (cid:114) k i χ ki ( . ) υ i (cid:19) , where k i = ∑ jr i hasviewed.There is one fact worth mentioning. Recall that ∑ ir is equiva-lent to ∑ ir : u ijr (cid:54) = (cid:63) , where (cid:63) represents missing observation. If thereis no missing observation, that is, the subjective test has com-plete data, then the lengths of the conﬁdence intervals for ψ j , j = ,..., J are the same, equal to . √ ∑ ir υ − i (since it is indepen-dent of the subscript j ). This is very different from the conﬁdenceintervals estimated from a conventional approach (for example,plain MOS, or BT.500), where each stimulus has a different con-ﬁdence interval length (see the second Appendix for a MLE inter-pretation of the plain MOS). This phenomenon can be explainedby the fact that all the true quality parameters ψ j , j = ,..., J areestimated jointly, yielding identical certainty for all the estimatedparameters. Experimental Results

We compare the proposed method (the proposed model andits two numerical solvers) with the prior art BT.500 and P.913 rec-ommendations. For P.913, after subject bias removal, we assumethat a BT.500-style subject rejection is carried out, before calcu-lating the MOS and the corresponding conﬁdence intervals. Weﬁrst illustrate the proposed model by giving visual examples ontwo datasets: VQEG HD3 dataset [14] (which is the compression-only subset of the larger HDTV Ph1 Exp3 dataset) and the NFLX

10 20 30 40 50 60 70Video Stimuli ( j )05101520 T e s t S u b j e c t s ( i ) Raw Opinion Scores ( u ij ) (a) VQEG HD3 dataset j )0510152025 T e s t S u b j e c t s ( i ) Raw Opinion Scores ( u ij ) (b) NFLX Public dataset Figure 1.

Raw opinion scores from (a) the VQEG HD3 dataset and (b) theNFLX Public dataset. Each pixel represents a raw opinion score. The darkerthe color, the lower the score. The impaired videos are arranged by contents,and within each content, from low quality to high quality (with the referencevideo always appears last). For the NFLX Public dataset, the last four rowscorrespond to corrupted subjective data.

Public dataset [15]. We then validate the model-data ﬁt usingthe Bayesian Information Criterion (BIC) on 22 datasets, includ-ing 20 datasets as part of a different larger experiments: VQEGHDTV Phase I [14]; ITS4S [16]; AGH/NTIA [10, 17]; MM2 [18];ITU-T Supp23 Exp1 [19]; and ITS4S2 [20]. We also evaluate theconﬁdence intervals on the estimated quality scores on these 22datasets. Next, we demonstrate that the proposed model is muchmore effective in dealing with outlier subjects. We then use syn-thetic data to validate the accuracy of the numerical solvers andthe conﬁdence interval calculation. Lastly, we compare the run-time of the various schemes.

Visual Examples

First, we demonstrate the proposed method on the VQEGHD3 and the NFLX Public datasets. Refer to Figure 1 for a visu-alization of the raw opinion scores. The 44th video of the VQEGHD3 dataset has a quality issue that all its scores are low. TheNFLX Public dataset includes four subjects whose raw scoreswere shufﬂed due to a software issue during data collection.Figure 2 shows the recovered quality scores of the four meth-ods compared. The quality scores recovered by the two proposedmethods are numerically different from the ones from BT.500and P.913, suggesting that the recovery is non-trivial. The av-erage conﬁdence intervals by the proposed methods are generallytighter, compared to the ones from BT.500 and P.913, suggestingthat the estimation has higher conﬁdence. The NBIC scores, tobe discussed in detail in Section Model-Data Fit, represent howwell the model ﬁts the data. It can be observed that the proposedmodel ﬁts the data better than BT.500 and P.913.Figure 3 shows the recovered subject bias and subject incon- sistency by the methods compared. On the VQEG HD3 dataset, itcan be seen that the 20th subject has the most positive bias, whichis evidenced by the whitish horizontal strip visible in Figure 3(a). On the NFLX Public dataset, the last four subjects, whoseraw scores are scrambled, have very high subject inconsistencyvalues. Correspondingly, their estimated biases have very looseconﬁdence intervals. This illustrates that the proposed model iseffective in modeling outlier subjects. On the contrary, among thefour outlier subjects, both BT.500 and P.910 fail to reject the 28thsubject.The subject bias and inconsistency revealed through the re-covery process could be valuable information for subject screen-ing. Unlike BT.500, which makes a binary decision on if a subjectis accepted/rejected, the proposed approach characterizes a sub-ject’s inaccuracy in two dimensions, along with their conﬁdenceintervals, allowing further interpretation and study. How to usethe bias and inconsistency information to better screen subjectsremains our future work.

Model-Data Fit

Bayesian Information Criterion [6] is a criterion for model-data ﬁt. When ﬁtting models, it is possible to increase the likeli-hood by adding parameters, but doing so may result in overﬁtting.BIC attempts to balance between the degree of freedom (charac-terized by the number of free parameters) and the goodness of ﬁt(characterized by the log-likelihood function). Formally, BIC isdeﬁned as

BIC = log ( n ) | θ | − L ( θ ) , where n = |{ u i jr }| is the to-tal number of observations (i.e. the number of opinion scores), | θ | is the number of model parameters, and L ( θ ) is the log-likelihoodfunction. One can interpret that the lower the free parameter num-bers | θ | , and the higher the log-likelihood L ( θ ) , the lower theBIC, and hence the better ﬁt. In this work, we adopt the notion ofa “normalized BIC”, deﬁned as the BIC divided by the number ofobservations, or: NBIC = log ( n ) | θ | − L ( θ ) n , as the model ﬁt criterion, for easier comparison across datasets.Table NBIC shows the NBIC reported on the comparedmethods on the 22 public datasets. The MOS method is the plainMOS without subject rejection or subject bias removal. | θ | forMOS and BT.500 is 2 J , where J is the number of stimuli (re-fer to the second Appendix for a MLE interpretation of the plainMOS). For P.913, | θ | is equal to 2 J + I , where I is the number ofsubjects (due to the subject bias term). For the calculation of thelog-likelihood function, notice that if subject rejection is applied,only the opinion scores after rejection are taken into account. Theresult in Table NBIC shows that the proposed two solvers yieldbetter model-data ﬁt than the plain MOS, BT.500 and P.913 ap-proaches. Conﬁdence Interval of Quality Scores

Table CI shows the average length of the conﬁdence intervalson the compared methods on the 22 public datasets. The smallerthe number, the tighter the conﬁdence interval, thus more con-ﬁdent the estimation is. For MOS, BT.500 and P.913, the conﬁ-dence intervals are calculated based on (8). For BT.500 and P.913,only the opinions scores after rejection are taken into account.For the proposed methods, the conﬁdence intervals are calculated

10 20 30 40 50 60 70Video Stimuli ( j )12345 Recovered Quality Score ( j ) BT.500 (SR_MOS) [NBIC 2.74] [avg CI 0.60]P.913 (BR_SR_MOS) [NBIC 2.40] [avg CI 0.49] Proposed (AP) [NBIC 2.30] [avg CI 0.46] (a) VQEG HD3 dataset j )12345 Recovered Quality Score ( j ) BT.500 (SR_MOS) [NBIC 2.57] [avg CI 0.54]P.913 (BR_SR_MOS) [NBIC 2.55] [avg CI 0.50] Proposed (AP) [NBIC 2.52] [avg CI 0.44] (b) NFLX Public dataset

Figure 2.

Recovered quality score ψ j and its conﬁdence interval for the four methods compared, on (a) the VQEG HD3 dataset and (b) the NFLX Publicdataset. The proposed NR method is not shown in the plots since it virtually produces identical results as the proposed AP method. For each method compared,the NBIC score (see Section Model-Data Fit) and the average length of the conﬁdence interval are reported. (SR: subject rejection; BR: bias removal; avg CI:average conﬁdence interval; NBIC: Bayesian Information Criterion; NR: Newton-Raphson; AP: Alternating Projection.) based on (6). It can be observed that the proposed two methodsyield tighter conﬁdence intervals compared to the other methods.For some databases BT.500 generates wider conﬁdence intervalthan the plain MOS. This phenomena can be explained by thefact that subject rejection decreases the number of samples, eventhough the variance may also be decreased. Overall, the obtainedconﬁdence interval can be either narrower or wider. Robustness against Outlier Subjects

We demonstrate that the proposed method is much more ef-fective in dealing with (corrupted) outlier subjects compared toother methods. We use the following methodology in our report-ing of results. For each method compared, we have a benchmarkresult, which is the recovered quality scores obtained using that method - for fairness - on an unaltered full dataset (note that forthe NFLX Public dataset, unlike the one used in Figure 1, 2 and3, we start with the dataset where the corruption on the last foursubjects has been corrected). We then consider that a number ofthe subjects are “corrupted” and simulate it by randomly shufﬂingeach corrupted subject’s votes among the video stimuli. We thenrun each method compared on the partially corrupted datasets.The quality scores recovered are normalized by subtracting themean and dividing by the standard deviation of the scores of theunaltered dataset. The normalized scores are compared againstthe benchmark, and a root-mean-squared-error (RMSE) value isreported.Figure 4 reports the results on the two datasets, comparingthe proposed method with plain MOS, BT.500, P.913 and the pro- posed AP solver, as the number of corrupted subjects increases.It can be observed that in the presence of subject corruption, Theproposed method achieves a substantial gain over the other meth-ods. The reason is that the proposed model was able to capturethe variance of subjects explicitly and is able to compensate forit. On the other hand, the other methods are only able to identifypart of the corrupted subjects. Meanwhile, traditional subject re-jection employs a set of hard coded parameters to determine out-liers, which may not be suited for all conditions. By contrast, theproposed model naturally integrates the various subjective effectstogether and is solved efﬁciently by the MLE formulation.Figure 5 reports the results as we increase the probabilityof corruption from 0 to 1 while ﬁxing the number of corruptedsubjects to 10. It can be seen that as the corruption probability in-creases, the RMSE increases linearly/near-linearly for other meth-ods, while the RMSE increases much slower for the proposedmethod, and it saturates at a constant value without increasingfurther. A simpliﬁed explanation is that, since only a subset of asubject’s scores is unreliable, discarding all of the subject’s scoresis a waste of valuable subjective data, while the proposed methodcan effectively avoid that.

Validation of Solvers and Conﬁdence Interval Cal-culation

Next, we demonstrate that the NR and AP solvers can ac-curately recover the parameters of the proposed model. This isshown using synthetic data, where the ground truth of the modelparameters are known. In this section, we considered only the .50.00.51.0

Subject Bias ( i ) P.913 (BR_SR_MOS)Proposed (AP) [avg CI 0.28] Subject Inconsistency ( i ) Proposed (AP) [avg CI 0.20] (a) VQEG HD3 dataset

Subject Bias ( i ) P.913 (BR_SR_MOS)Proposed (AP) [avg CI 0.33] Subject Inconsistency ( i ) Proposed (AP) [avg CI 0.24] (b) NFLX Public dataset

Figure 3.

Recovered subject bias ∆ i and subject inconsistency υ i for each subject i , for the methods compared, on (a) the VQEG HD3 dataset and (b) theNFLX Public dataset. The proposed NR method is not shown in the plots since it virtually produces identical results as the proposed AP method. For eachmethod compared, the average length of the conﬁdence interval is reported. (SR: subject rejection; BR: bias removal; avg CI: average conﬁdence interval; NR:Newton-Raphson; AP: Alternating Projection.) Table NBIC: Normalized Bayesian Information Criterion (NBIC)reported on the compared methods on public datasets. TheNR and AP methods produce identical results. (MOS: plainmean opinion score; NR: Newton-Raphson; AP: AlternatingProjection.)

Dataset MOS BT.500 P.913 NR/APVQEG HD3 2.75 2.74 2.39

NFLX Public 2.97 2.57 2.55

HDTV Ph1 Exp1 2.45 2.46 2.38

HDTV Ph1 Exp2 2.72 2.72 2.52

HDTV Ph1 Exp3 2.72 2.71 2.37

HDTV Ph1 Exp4 2.96 2.96 2.51

HDTV Ph1 Exp5 2.77 2.77 2.47

HDTV Ph1 Exp6 2.51 2.49 2.32

ITU-T Supp23 Exp1 2.91 2.91 2.35

MM2 1 2.80 2.78 2.83

MM2 2 3.89 3.89 3.52

MM2 3 2.48 2.47 2.45

MM2 4 2.74 2.73 2.62

MM2 5 2.90 2.82 2.67

MM2 6 2.81 2.74 2.74

MM2 7 2.73 2.72 2.76

MM2 8 3.00 2.92 2.88

MM2 9 3.27 3.21 2.95

MM2 10 3.04 3.05 2.98 its4s2 3.63 3.63 2.96 its4s AGH 3.15 3.05 2.77 its4s NTIA 2.94 2.91 2.53

Table CI: Average length of conﬁdence intervals of the esti-mated quality scores reported on the compared methods onpublic datasets. The NR and AP methods produce identical re-sults. (MOS: plain mean opinion score; NR: Newton-Raphson;AP: Alternating Projection.)

Dataset MOS BT.500 P.913 NR/APVQEG HD3 0.59 0.60 0.49

NFLX Public 0.62 0.54 0.5

HDTV Ph1 Exp1 0.50 0.61 0.48

HDTV Ph1 Exp2 0.57 0.57 0.53

HDTV Ph1 Exp3 0.56 0.59 0.52

HDTV Ph1 Exp4 0.63 0.63 0.52

HDTV Ph1 Exp5 0.57 0.57 0.53

HDTV Ph1 Exp6 0.50 0.51 0.48

ITU-T Supp23 Exp1 0.61 0.61 0.56

MM2 1 0.59 0.60 0.57

MM2 2 1.21 1.21 1.12

MM2 3 0.47 0.48 0.45

MM2 4 0.58 0.59 0.54

MM2 5 0.63 0.65 0.58

MM2 6 0.62 0.70 0.59

MM2 7 0.60 0.61 0.57

MM2 8 0.76 0.76 0.71

MM2 9 0.84 0.85 0.74

MM2 10 0.77 0.83 0.73 its4s2 0.82 0.82 0.66 its4s AGH 0.68 0.68 0.61 its4s NTIA 0.57 0.58 0.54 .0 2.5 5.0 7.5 10.0 12.5 15.0 17.5No. Corrupted Subjects (Behavior: SHUFFLE)0.00.20.40.60.8 R M S E o f N o r m a li z e d Q u a li t y S c o r e ( j ) MOSBT.500 (SR_MOS) P.913 (BR_SR_MOS)Proposed (AP) (a) VQEG HD3 dataset R M S E o f N o r m a li z e d Q u a li t y S c o r e ( j ) MOSBT.500 (SR_MOS) P.913 (BR_SR_MOS)Proposed (AP) (b) NFLX Public dataset

Figure 4.

RMSE of the (normalized) recovered quality score ψ j as a func-tion of the number of corrupted subjects, of the proposed method (AP) versusthe other methods, of (a) the VQEG HD3 dataset and (b) the NFLX Publicdataset. The subject corruption is simulated, in the way that the scores cor-responding to a subject are scrambled. The recovered quality score is nor-malized by subtracting the mean and dividing by the standard deviation ofthe scores of the unaltered dataset. (MOS: plain mean opinion score; SR:Subject Rejection; BR: Bias Removal; AP: Alternating Projection.) NFLX Public dataset for simulations. The random samples aregenerated using the following methodology. For each proposedsolver, we take the NFLX Public dataset and run the solver to esti-mate the parameters. The parameters estimated from a real datasetallow us to run simulations with practical settings. We then treatthe estimated parameters as the “synthetic” parameters, run simu-lations to generate synthetic samples according to the model (1).Subsequently, we run the solver again on the synthetic data toyield the “recovered” parameters.Figure 6 shows the scatter plots of the synthetic vs. recoveredparameters, for the true quality ψ j , subjective bias ∆ i and subjectinconsistency υ i terms. It can be observed that the solvers recoverthe parameters reasonably well. We have to keep in mind that thesynthetic data, differently from usual subjective scores of categoryrating, are continuous. For discrete data, some speciﬁc problemswould inﬂuence the obtained results as described in [12]. Sincethose problems are not the main topic of this paper we do not gointo more details and leave it as a future topic of research.Figure 6(a) also shows the recovery result of the BT.500 andP.913. It is noticeable that the recovered subject biases by theAP method and the P.913 subject bias removal are very similar.This should not be surprising, considering that the AP methodcan be treated as a weighted and iterative generalization of theP.913 method.Also plotted in Figure 6(b) are the conﬁdence intervals ofthe recovered parameters. The reported “CI%” is the percentageof occurrences where the synthetic ground truth falls within theconﬁdence interval. By deﬁnition, we expect the CI% to be 95%on average. To verify this, we run the same simulation on the22 public datasets. For each dataset, the simulation is run 100 R M S E o f N o r m a li z e d Q u a li t y S c o r e ( j ) MOSBT.500 (SR_MOS) P.913 (BR_SR_MOS)Proposed (AP) (a) VQEG HD3 dataset R M S E o f N o r m a li z e d Q u a li t y S c o r e ( j ) MOSBT.500 (SR_MOS) P.913 (BR_SR_MOS)Proposed (AP) (b) NFLX Public dataset

Figure 5.

RMSE of the (normalized) recovered quality score ψ j as a func-tion of the probability of corruption (ﬁxing the number of corrupted subjects to10), of the proposed method (AP) versus the other methods, of (a) the VQEGHD3 dataset and (b) the NFLX Public dataset. The subject corruption is sim-ulated, in the way that the scores corresponding to a subject are scrambled.The recovered quality score is normalized by subtracting the mean and divid-ing by the standard deviation of the scores of the unaltered dataset. (MOS:plain mean opinion score; SR: Subject Rejection; BR: Bias Removal; AP:Alternating Projection.) times with different seeds. The result is shown in Table CI%. Wecompare the proposed NR and AP methods with the plain MOS.It can be seen that all methods yield CI% to be very close to 95%,but slightly below. The explanation is that both have assumed thatthe underlying distribution is Gaussian, but with both the meanand standard deviation unknown, one should use a Student’s t -distribution instead. If the t -distribution is used, the coefﬁcientcan no longer be a ﬁxed value 1.96 but is a function of the numberof subjects and repetitions.For the NR and AP methods, there are occasional caseswhere the CI% is signiﬁcantly lower, for example, ψ j and υ i forMM2 2 dataset. This is the case where the stimuli and/or sub-ject dimensions are small, yielding non-Gaussian behavior (recallthat the conﬁdence interval calculated is asymptotic). In practice,we can introduce a correction term to compensate for the non-Gaussianity. Runtime and Iterations

Lastly, we evaluate the runtime of the proposed NR and APmethods compared to the others. The experiment was performedon a MacBook Pro (15-inch, 2018) with 2.9 GHz Intel Core i9with 32 GB 2400 MHz DDR4 memory, macOS version 10.14.6.The schemes compared are implemented in Python, and are open-source on Github [7]. The results of 100 simulations runs (basedon the similar methodology as in the previous sections) of eachmethods are reported in Table Runtime. The results reveal the or-der of magnitude of the algorithms compared. The plain MOS istypically the fastest, while the BT.500 and P.913 are two magni-tude slower. The NR and AP algorithms are three and one magni- .0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0Synthetic12345 R e c o v e r e d Quality Score ( j ) BT.500 (SR_MOS) (RMSE 0.1186)P.913 (BR_SR_MOS) (RMSE 0.1280)Proposed (AP) (RMSE 0.1025) 0.4 0.2 0.0 0.2 0.4 0.6 0.8Synthetic0.40.20.00.20.40.60.8 R e c o v e r e d Subject Bias ( i ) P.913 (BR_SR_MOS) (RMSE 0.0483)Proposed (AP) (RMSE 0.0483) 0.5 0.6 0.7 0.8 0.9Synthetic0.40.50.60.70.80.9 R e c o v e r e d Subject Inconsistency ( i ) Proposed (AP) (RMSE 0.0479) (a) Comapring BT.500, P.913 and AP R e c o v e r e d Quality Score ( j ) Proposed (NR) (RMSE 0.1027, CI% 94.9)Proposed (AP) (RMSE 0.1025, CI% 94.9) 0.4 0.2 0.0 0.2 0.4 0.6 0.8Synthetic0.60.40.20.00.20.40.60.81.0 R e c o v e r e d Subject Bias ( i ) Proposed (NR) (RMSE 0.0496, CI% 100.0)Proposed (AP) (RMSE 0.0483, CI% 100.0) 0.5 0.6 0.7 0.8Synthetic0.30.40.50.60.70.80.91.0 R e c o v e r e d Subject Inconsistency ( i ) Proposed (NR) (RMSE 0.0456, CI% 96.2)Proposed (AP) (RMSE 0.0479, CI% 92.3) (b) Conﬁdence Intervals of NR and AP

Figure 6.

Validation of the proposed NR and AP solvers using synthetic data. The random samples are generated using the following methodology. For eachproposed solver, take the NFLX Public dataset, run the solver to estimate the parameters. Treat the estimated parameters and the “synthetic” parameters, runsimulations to generate synthetic samples according to the model (1). Run the solver again on the synthetic data to yield the “recovered” parameters. The x-axisshows the synthetic parameters and the y-axis shows the recovered parameters. (a) Comparing the proposed AP with BT.500 and P.913, (b) Proposed NR andAP with conﬁdence intervals. (NR: Newton-Raphson; AP: Alternating Projection.) able CI%: Average conﬁdence interval coverage (CI%) reported on public datasets. For each proposed solver and each dataset,run the solver to estimate the parameters. Treat the estimated parameters and the “synthetic” parameters, run simulations togenerate synthetic samples according to the model (1) (except for MOS, whose samples are generated according to (7)). Run thesolver again on the synthetic data to yield the “recovered” parameters and their conﬁdence intervals. The reported “CI%” is thepercentage of occurrences when the synthetic ground truth falls within the conﬁdence interval. For each dataset, the simulationis run 100 times with different seeds. Note that for both MOS and the proposed NR and AP methods, the CI% is slightly below95%, due to the underlying Gaussian assumption used instead of the legitimate Student’s t -distrubtion. (MOS: plain mean opinionscore; NR: Newton-Raphson; AP: Alternating Projection.) Dataset MOS NR AP ψ j ψ j ∆ i υ i ψ j ∆ i υ i VQEG HD3 93.3 93.6 93.9 93.0 93.2 94.4 91.9NFLX Public 94.2 93.7 94.5 93.1 93.5 94.1 92.3HDTV Ph1 Exp1 93.9 94.1 93.9 93.1 93.8 94.2 91.3HDTV Ph1 Exp2 93.8 94.0 94.5 92.5 93.8 94.0 91.2HDTV Ph1 Exp3 93.9 93.9 94.4 92.5 93.7 94.1 90.6HDTV Ph1 Exp4 93.8 94.0 94.3 91.9 93.8 94.1 90.9HDTV Ph1 Exp5 93.8 94.1 94.2 92.2 93.9 94.2 90.9HDTV Ph1 Exp6 93.8 94.0 94.4 92.6 93.9 94.0 91.0ITU-T Supp23 Exp1 93.8 94.0 94.4 91.2 93.8 94.9 90.0MM2 1 93.5 92.8 95.4 92.6 92.5 94.0 91.6MM2 2 92.1 81.5 92.9 80.0 68.1 92.1 75.4MM2 3 94.4 93.6 95.1 93.4 93.4 94.2 92.0MM2 4 93.2 93.6 95.6 93.0 93.2 95.1 92.0MM2 5 93.2 93.2 95.7 92.7 91.8 95.3 91.4MM2 6 93.6 93.3 95.2 92.8 93.0 94.1 91.4MM2 7 93.6 93.3 95.2 92.8 92.9 94.2 91.9MM2 8 93.0 92.4 95.4 88.8 92.2 94.5 87.0MM2 9 93.2 93.3 94.8 89.1 92.8 94.2 88.1MM2 10 93.2 93.1 95.7 89.7 92.8 94.5 87.9its4s2 93.1 94.1 94.6 60.6 94.1 94.2 59.2its4s AGH 93.6 94.0 94.4 90.4 94.0 94.4 89.7its4s NTIA 93.9 94.4 94.7 86.1 94.3 95.1 85.6 ude slower, respectively. Notably, the AP runs faster than BT.500and P.913, and is about 50x faster than the NR. The AP also re-quires about half the number of iterations to reach convergencethan the NR.

Conclusions

In the paper, we proposed a simple model to account fortwo of the most dominant effects of test subject inaccuracy: sub-ject bias and subject inconsistency. We further proposed to solvethe model parameters through maximum likelihood estimationand presented two numerical solvers. We compared the pro-posed methodology with the standardized recommendations in-cluding ITU-R BT.500 and ITU-T P.913, and showed that theproposed methods have advantages in: 1) better model-data ﬁt, 2)tighter conﬁdence intervals, 3) better robustness against subjectoutliers, 4) shorter runtime, 5) absence of hard coded parametersand thresholds, and 6) auxiliary information on test subjects. Webelieve the proposed methodology is generally suitable for sub-jective evaluation of perceptual audiovisual quality in multimediaand television services, and we propose to update the correspond-ing recommendations with the methods presented.

References [1] ITU-R BT.500-14 (10/2019): Methodologies for the Subjective As-sessment of the Quality of Television Images. .[2] ITU-T P.910 (04/08): Subjective Video Quality Assessment Meth-ods for Multimedia Applications. .[3] ITU-T P.913 (03/2016): Methods for the Subjective Assessmentof Video Quality, Audio Quality and Audiovisual Quality of Inter-net Video and Distribution Quality Television in Any Environment. .[4] Maximum Likelihood Estimation (Wikipedia). https://en.wikipedia.org/wiki/Maximum_likelihood_estimation .[Online; accessed 30-March-2020].[5] Newton’s Method in Optimization (Wikipedia). https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization . [Online; accessed 30-March-2020].[6] Bayesian Information Criterion (Wikipedia). https://en.wikipedia.org/wiki/Bayesian_information_criterion .[Online; accessed 30-March-2020].[7] SUREAL - Subjective Recovery Analysis. https://github.com/Netflix/sureal . [Online; accessed 30-March-2020].[8] Tobias Hoßfeld, Raimund Schatz, and Sebastian Egger. SOS: TheMOS is not enough! , pages 131–136, 2011.[9] Zhi Li and Christos G. Bampis. Recover Subjective QualityScores from Noisy Measurements. https://arxiv.org/abs/1611.01715 . in Proceedings of the Data Compression Conference,2017.[10] L. Janowski and M. Pinson. Subject bias: Introducing a theoreti-cal user model. In , pages 251–256, Sep. 2014.[11] L. Janowski and M. Pinson. The accuracy of subjects in a qual-ity experiment: A theoretical subject model.

IEEE Transactions onMultimedia , 17(12):2210–2224, Dec 2015.[12] Lucjan Janowski, Bogdan ´Cmiel, Krzysztof Rusek, Jakub Nawała,and Zhi Li. Generalized score distribution. https://arxiv.org/ abs/1909.04369 . in arXiv:1909.04369 [stat.ME], 2019.[13] T.M. Cover and J.A. Thomas.

Elements of Information Theory . AWiley-Interscience publication. Wiley, 2006.[14] Margaret Pinson, Filippo Speranza, M Barkowski, V Baroncini,R Bitto, S Borer, Y Dhondt, R Green, L Janowski, T Kawano, et al.Report on the validation of video quality models for high deﬁnitionvideo content.

Video Quality Experts Group , 2010.[15] Netﬂix Public Dataset. https://github.com/Netflix/vmaf/blob/master/resource/doc/datasets.md . [Online; accessed30-March-2020].[16] Margaret H. Pinson. Its4s: A video quality dataset with four-secondunrepeated scenes. Technical Report NTIA Technical Memo TM-18-532, NTIA/ITS, Feb. 2018.[17] Margaret H. Pinson and L. Janowski. Agh/ntia: A video quality sub-jective test with repeated sequences. Technical Report NTIA Techni-cal Memo TM-14-505, NTIA/ITS, June 2014.[18] M. H. Pinson, L. Janowski, R. Pepion, Q. Huynh-Thu, C. Schmid-mer, P. Corriveau, A. Younkin, P. Le Callet, M. Barkowsky, andW. Ingram. The inﬂuence of subjects and environment on audiovi-sual subjective tests: An international study.

IEEE Journal of SelectedTopics in Signal Processing , 6(6):640–651, Oct 2012.[19] ITU-T P-Series. ITU-T coded-speech database. Technical ReportSeries P: Telephone Transmission Quality, Telephone Installations,Local Line Networks, ITU-T, Feb. 1998.[20] Margaret H. Pinson. ITS4S2: An image quality dataset with un-repeated images from consumer cameras,. Technical Report NTIATechnical Memo TM-19-537, NTIA/ITS, Apr. 2019.

Appendix: First- and Second-Order PartialDerivatives of L ( θ ) We can derive the ﬁrst-order and second-order partial derivatives of L ( θ ) with respect to ψ j , ∆ i and υ i as: ∂ L ( θ ) ∂ψ j = ∑ ir u ijr − ψ j − ∆ i υ i ∂ L ( θ ) ∂ ∆ i = ∑ jr u ijr − ψ j − ∆ i υ i ∂ L ( θ ) ∂υ i = ∑ jr − υ i + ( u ijr − ψ j − ∆ i ) υ i ∂ L ( θ ) ∂ψ j = − ∑ ir υ i ∂ L ( θ ) ∂ ∆ i = − υ i ∑ jr ∂ L ( θ ) ∂υ i = ∑ jr υ i − ( u ijr − ψ j − ∆ i ) υ i Appendix: An MLE Interpretation of the PlainMOS

The plain MOS and its conﬁdence interval can be interpreted usingthe notion of maximum likelihood estimation. Consider the model: U ijr = ψ j + υ j X , (7)where U ijr is the opinion score, ψ j is the true quality of stimulus j and υ j is the “ambiguity” of j . X ∼ N ( , ) is i.i.d. Gaussian. Note that thisis different from the proposed model (1) where υ i is associated with thesubjects, not the stimuli. We can deﬁne the log-likelihood function forthis model as L ( θ ) = log P ( { u ijr }| ( { ψ j } , { υ j } )) , and solve for { ψ j } and able Runtime: Average runtime in seconds and number of iterations (for NR and AP) reported on public datasets. For eachproposed solver and each dataset, run the solver to estimate the parameters. Treat the estimated parameters and the “synthetic”parameters, run simulations to generate synthetic samples according to the model (1) (except for MOS, whose samples are gener-ated according to (7)). Run the solver again on the synthetic data. For each dataset, the simulation is run 100 times with differentseeds, and the mean is reported. For NR and AP, also reported are the number of iterations. (MOS: plain mean opinion score; NR:Newton-Raphson; AP: Alternating Projection.) Dataset Mean Runtime (seconds) No. IterationsMOS BT.500 P.913 NR AP NR APVQEG HD3 5.2e-4 1.5e-2 1.5e-2 2.1e-1 4.3e-3 26.2 12.1NFLX Public 5.7e-4 1.8e-2 1.9e-2 2.8e-1 4.5e-3 34.5 11.8HDTV Ph1 Exp1 7.7e-4 3.3e-2 3.4e-2 2.0e-1 4.6e-3 23.4 10.3HDTV Ph1 Exp2 7.8e-4 3.3e-2 3.4e-2 2.8e-1 4.9e-3 33.2 11.3HDTV Ph1 Exp3 7.8e-4 3.3e-2 3.4e-2 2.5e-1 4.7e-3 29.4 10.7HDTV Ph1 Exp4 7.6e-4 3.3e-2 3.4e-2 3.3e-1 5.0e-3 38.3 11.5HDTV Ph1 Exp5 7.8e-4 3.3e-2 3.4e-2 2.7e-1 4.7e-3 31.3 10.8HDTV Ph1 Exp6 7.6e-4 3.3e-2 3.4e-2 2.2e-1 4.6e-3 25.8 10.7ITU-T Supp23 Exp1 8.1e-4 3.5e-2 3.5e-2 3.4e-1 5.0e-3 36.0 11.6MM2 1 4.9e-4 1.3e-2 1.3e-2 2.1e-1 4.3e-3 27.4 12.4MM2 2 4.0e-4 1.0e-2 1.1e-2 5.8e-1 1.4e-2 78.0 54.9MM2 3 5.3e-4 1.3e-2 1.4e-2 1.8e-1 4.2e-3 23.3 11.6MM2 4 5.0e-4 1.3e-2 1.4e-2 2.6e-1 4.6e-3 33.4 13.8MM2 5 5.0e-4 1.3e-2 1.4e-2 2.9e-1 6.0e-3 37.3 19.3MM2 6 4.8e-4 1.2e-2 1.3e-2 2.2e-1 4.3e-3 28.8 13.1MM2 7 4.8e-4 1.2e-2 1.3e-2 2.0e-1 4.2e-3 25.6 12.3MM2 8 4.3e-4 1.1e-2 1.1e-2 2.7e-1 5.5e-3 35.3 18.7MM2 9 4.3e-4 1.1e-2 1.2e-2 2.8e-1 5.1e-3 36.5 16.8MM2 10 4.3e-4 1.1e-2 1.2e-2 2.3e-1 4.8e-3 29.8 15.4its4s2 3.3e-3 2.5e-1 2.5e-1 1.1e+0 1.3e-2 49.8 13.3its4s AGH 8.7e-4 4.1e-2 4.2e-2 3.5e-1 5.3e-3 39.4 11.6its4s NTIA 2.6e-3 1.6e-1 1.6e-1 6.4e-1 1.1e-2 46.2 11.3 υ j } that maximize the log-likelihood function, as follows: ψ j = ∑ ir u ijr ∑ ir , υ j = (cid:115) ∑ ir ( u ijr − ψ j ) ∑ ir . The second-order partial derivative w.r.t. to ψ j is ∂ L ( θ ) ∂ψ j = − υ j ∑ ir

1. The95% conﬁdence interval of ψ j is then: CI ( ψ j ) = ψ j ± . υ j √ ∑ ir . (8)One minor difference between (8) and the 95% conﬁdence interval for-mula in BT.500-14 Section A1-2.2.1 is that the former uses differentialdegrees of freedom 0 and the latter uses 1 for the sample standard devi-ation calculation. In fact, neither one is fully precise. In the most pre-cise way to calculate the conﬁdence interval, one should use a Student’s tt