[PDF] Describing Subjective Experiment Consistency by p -Value P-P Plot

Abstract

There are phenomena that cannot be measured without subjective testing. However, subjective testing is a complex issue with many influencing factors. These interplay to yield either precise or incorrect results. Researchers require a tool to classify results of subjective experiment as either consistent or inconsistent. This is necessary in order to decide whether to treat the gathered scores as quality ground truth data. Knowing if subjective scores can be trusted is key to drawing valid conclusions and building functional tools based on those scores (e.g., algorithms assessing the perceived quality of multimedia materials). We provide a tool to classify subjective experiment (and all its results) as either consistent or inconsistent. Additionally, the tool identifies stimuli having irregular score distribution. The approach is based on treating subjective scores as a random variable coming from the discrete Generalized Score Distribution (GSD). The GSD, in combination with a bootstrapped G-test of goodness-of-fit, allows to construct p -value P-P plot that visualizes experiment's consistency. The tool safeguards researchers from using inconsistent subjective data. In this way, it makes sure that conclusions they draw and tools they build are more precise and trustworthy. The proposed approach works in line with expectations drawn solely on experiment design descriptions of 21 real-life multimedia quality subjective experiments.

Full PDF

DDescribing Subjective Experiment Consistencyby p -Value P–P Plot Jakub Nawała ∗ , Lucjan Janowski ∗ , Bogdan ´Cmiel † and Krzysztof Rusek ∗∗ AGH University of Science and TechnologyDepartment of TelecommunicationsKraków, [email protected] † AGH University of Science and TechnologyDepartment of Mathematical Analysis,Computational Mathematics and Probability MethodsKraków, Poland

Abstract —There are phenomena that cannot be measuredwithout subjective testing. However, subjective testing is a com-plex issue with many inﬂuencing factors. These interplay toyield either precise or incorrect results. Researchers requirea tool to classify results of subjective experiment as eitherconsistent or inconsistent. This is necessary in order to decidewhether to treat the gathered scores as quality ground truthdata. Knowing if subjective scores can be trusted is key todrawing valid conclusions and building functional tools basedon those scores (e.g., algorithms assessing the perceived qualityof multimedia materials). We provide a tool to classify subjectiveexperiment (and all its results) as either consistent or inconsistent.Additionally, the tool identiﬁes stimuli having irregular scoredistribution. The approach is based on treating subjective scoresas a random variable coming from the discrete GeneralizedScore Distribution (GSD). The GSD, in combination with abootstrapped G-test of goodness-of-ﬁt, allows to construct p -value P–P plot that visualizes experiment’s consistency. The toolsafeguards researchers from using inconsistent subjective data. Inthis way, it makes sure that conclusions they draw and tools theybuild are more precise and trustworthy. The proposed approachworks in line with expectations drawn solely on experimentdesign descriptions of 21 real-life multimedia quality subjectiveexperiments. I. I

NTRODUCTION

Any system built with an end-user in mind should provideexcellent experience for that user. In telecommunications wesay that the system should provide high Quality of Experience(QoE). Therefore, numerous works focus on QoE optimizationin various contexts. Many target quality improvement bybetter network resources utilization [2, 33]. Some go beyondthat and use the QoE to optimize parameters like powerconsumption of mobile device displays [45] or to improveuser-perceived quality when watching 360 ◦ videos. All thisresearch is based on the concept of measuring the QoE throughsubjective experiments (i.e., by asking selected end-users abouttheir perception of quality of a service, system or a singlemultimedia material of interest). The key takeaway here isthat subjective experiments are necessary to assess the QoE.Subjective experiments are not precise measuring systems.People are able to differentiate a limited number of intensitiesof a given stimulus [25]. Additionally, they are not perfectly consistent with their actual perception when formulating theirjudgement. We thus classify collecting subjective scores as anoisy process. There are recommended ways to analyse subjec-tive scores and correct for typical errors. These include subjectbias removal or discarding study participants with opinion notcorrelated with the general opinion of others (cf. Rec. ITU-T P.913). However, there is no tool for classifying the wholesubjective experiment as either consistent or inconsistent. Sucha tool could protect researchers and practitioners from usingerroneous data and reaching ill-founded conclusions. Sincediscarding all the results at once is expensive (because ittranslates to re-organizing the whole experiment again), itwould be also helpful to have a tool that can point to individualproblematic stimuli (e.g., images or videos). Hopefully, sucha tool could point to stimuli for which study participantsdo not agree (or express any other unwanted behaviour). Byinvestigating these stimuli in greater detail and, potentially,discarding them, the general data consistency would improve(not necessitating the whole experiment to be discarded).The main contribution of this paper is a tool classifyingthe whole subjective experiment as either consistent or incon-sistent. The tool also points to speciﬁc stimuli with irregular(also referred to as atypical) score distribution. The overallconsistency of an experiment is visualised using p -value P–Pplot (cf. Fig. 1).There are four goals we address with this paper. First, weintroduce the new data analysis method. Second, we presentwhich score distributions the method classiﬁes as typical (andthus consistent). Third, we show how the method operateson real-life subjective data. At last, we hope to convince thecommunity that p -value P–P plots are a useful data analysistool.To validate the proposed method we use results from 21real-life subjective experiments. These test the quality of thetotal of 4,360 stimuli and contain 98,282 individual scores.Among the stimuli are: videos (with and without audio),images, and audio samples. The analysed experiments coverdifferent experiment designs and have a varying number ofscores per stimulus (between 9 and 33) with most of them a r X i v : . [ c s . MM ] S e p .000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 theoretical uniform cdf e c d f o f p - v a l u e s (a) theoretical uniform cdf e c d f o f p - v a l u e s (b)Fig. 1. Our main contribution is a tool allowing to discern between consistent (a) and inconsistent (b) subjective experiment. having the typical number of 24 scores per stimulus.Having validated our method on the real-life data sets webelieve it can be used to assess consistency of many differentsubjective experiments. Furthermore, method’s ability to pointto speciﬁc problematic stimuli eases data analysis and has apotential to provide new per-stimulus insights (e.g., indicatethat a given image is especially difﬁcult to score).Next section highlights related work. Section III describesthe theoretical background underlying p -value P–P plot and theproposed method. In Section IV p -value P–P plot is explained.In Section V the real-life data sets are described and analysedusing our method. At last, Section VI concludes the paper.II. R ELATED W ORK

Subjective scores analysis is a broad topic. It is consideredby numerous publications. This includes ITU standards and,among these, the ITU-T Rec. P.1401. Although focused onobjective quality algorithms, it gives important guidelinesregarding good practices related with subjective data analysisin general. In the similar vein, the work of Brunnström andBarkowsky [3] looks into appropriate sample sizes for sub-jective experiments. In particular, they devise a methodologyto select the number of participants necessary to measure acertain MOS score difference between stimuli. An importantobservation here is that both mentioned publications rely onMOS score analysis. They also implicitly assume correctness(also referred to as consistency) of data stemming fromsubjective experiments. Our work complements the toolkit in-troduced by these works allowing to check whether subjectiveexperiment consistency is a valid assumption. Additionally,our method gives per-stimulus information (which is in con-trast to MOS-only approach).Search for analysis methods going beyond the MOS ispresent in few existing works. One of these is [10], whereauthors explore several measures related to user behavior andservice acceptance. Both [16] and [38] go beyond the MOSby targeting score distributions. They propose to model theprobability of each single score, rather than focusing on justthe mean. Finally, works like [6] operate on higher levels ofabstraction and propose a mapping between QoE and Quality of Service (QoS). With our work we also go beyond the MOS,but do not take into account the QoE–QoS mapping.Not many publications analyse subjective scores by focusingon the answering process. An important exception here is[13]. There, the relation between the MOS and the standarddeviation of opinion scores is studied. Authors of [13] intro-duce the HSE α parameter. They argue that it can be usedto both comprehensively summarize subjective experiment re-sults and check their consistency. Another interesting approachto subjective scores consistency is shown in [11]. There,various conﬁdence intervals for MOS values are analysed. Ourapproach is to analyse consistency on a per-stimulus basis andis thus distinct both from [13] and [11].Our method is based on the concept of subject model. Theidea was ﬁrst presented in [18] and later in [22]. The subjectmodel further extends the toolkit for subjective data analysis[9, 20]. It also helps make existing analyses more precise[21, 8]. The initial shape of the subject model is modiﬁedin [39, 40] and [42]. The derived models helped discovernew phenomena (e.g., observing that content has a signiﬁcantinﬂuence on both the standard deviation and mean of opinionscores [39]). The analysis presented in our paper is based ona yet different subject model that was introduced in [19].We do not compare our method to similar methods presentin other ﬁelds that also deal with quality assessment. Thisis because they require different set of data than what isusually gathered in QoE subjective experiments. One exampleis a tool used by the food industry called “Panel Check”[35, 41]. Its main drawback is the assumption of no tiedanswers. When using the 5-point scale (typical for QoEexperiments) this assumption is difﬁcult to satisfy. Anotherexample is the signal detection theory (SDT), extensively usedin psychology [24, 7]. Again, it requires measurements that areoften not available in QoE experiments. Speciﬁcally, both therecognition score and quality score are needed. In the typicalQoE experiment only the latter is provided. In the original paper the parameter is labelled as SOS α . Since SOSis traditionally used to denote the standard deviation of opinion scores, notwanting to introduce ambiguous name we refer to the parameter using ﬁrstletters of surnames of its authors. –P plots we use are a member of the family of plotsreferred to as probability plots. The second member of thisfamily are more popular Q–Q plots. Both types of plots wereintroduced by Wilk and Gnanadesikan in 1968 [44]. Theirgeneral purpose is to compare two sets of data. Commonly,this is used to juxtapose a set of observations with a theoreticalmodel. The probability plots are used throughout variousdisciplines, including astrophysics [23] and landscape andurban planning [4].III. S UBJECTIVE S CORE AS A R ANDOM V ARIABLE

Scores (also referred to as answers) collected for a singlestimulus in a subjective experiment can be modelled by aprobability distribution. Expressing an answer as a randomvariable has numerous advantages. One is the ability to for-mally describe typical and atypical distributions. In this sectionwe start from a discussion about possible score distributionsand then describe a discrete distribution which is used to modelthese.

A. Typical or Atypical?

People do not usually use numbers to describe servicequality. Instead, they voice their judgement as single words(e.g., “ok”) or more complex verbal explanations. Never-theless, their opinion can typically be mapped to selectedcategories. Therefore, the most popular scale used in subjectiveexperiments is a categorical ﬁve-point scale (1—“Bad,” 2—“Poor,” 3—“Fair,” 4—“Good” and 5—“Excellent”). We alsouse this scale, but note that the analysis can be extended toother scales as well (see [19] for the explanation).Even in an ideal world a subject (i.e., study participant)using a discrete scale naturally generates scores with somedegree of randomness. For example, if stimulus quality isnot satisfactory enough to obtain rating “Good,” but also notsufﬁciently bad to obtain rating “Fair,” an ideal subject wouldnot use the same answer each time the same stimulus isshown. In other words, they would alternate between ratings“Good” and “Fair.” This phenomenon alone makes scoredistribution analysis both important and challenging. It alsojustiﬁes why it would be valuable to be able to discernbetween typical and atypical score distributions. Importantly,by typical we mean score distributions that we would expectto observe in a consistent subjective experiment. On the otherhand, by atypical we mean distributions appearance of whichwould be justiﬁable only by referring to some external factorsinﬂuencing the scoring process—e.g., a bias [46].We now analyse typical and atypical score distributions. Todo so we divide the two classes (i.e., typical and atypical)into more granular classes. We name each class and shortlydescribe it. Additionally, next to the name of each class weshow an exemplary sample of scores corresponding to thisclass. This sample is presented as [ k , k , k , k , k ] , where k j denotes the number of answers of category j . To simplifythe discussion we assume we are always considering a single stimulus having the true quality ψ = 2 . . Additionally, weassume this stimulus is assigned 24 scores. In other words, (cid:80) j =1 k j = 24 . At last, it is worth pointing out that the MeanOpinion Score (MOS) for the exemplary sample should beclose to 2.5 (i.e., (cid:80) j =1 j · k j /n ≈ . ).These are score distributions that we treat as typical: • Perfect [0 , , , , —represents the most stable an-swers we can image. All answers are as close to ψ aspossible. In a typical subjective experiment subjects donot agree perfectly with each other so this is not the mostcommon class. • Common [2 , , , , —shows answers spread around ψ . This class is common in real-life subjective experi-ments. • Strongly spread [5 , , , , —it is still typical but startsto contain a surprisingly large spread of scores. Theborder line between common and strongly spread dependson the experiment, stimulus difﬁculty, subject pool, andpossibly other factors.To some extent it is more interesting to consider scoredistributions that are atypical. Here is a list of these that wetreat as such: • Random answer(s) [1 , , , , —represents single an-swers appearing away from the majority of other answers. • Bimodal [9 , , , , —represents a mixture of very dif-ferent opinions. Lack of answers “Poor” in the sampleshows that we have non-uniform groups of subjects.Potentially, some of them are experts and other are naïveobservers. In such a case we should not analyze thescores as though they were coming from a single groupof observers. For example, stating that this stimulus hasthe true quality of 2.5 would be incorrect. This is becausefor one group the quality is “Bad” and for the other it isa little bit better than “Fair.” • Sudden cut-off [3 , , , , —does not represent an ob-vious error. We could imagine that this exemplary sampleis related with a stimulus of the quality slightly belowrating “Fair.” Nevertheless, we should still observe more“Poor” scores and less “Fair” ones. The lack of “Good”answers also seems unusual, especially since there are somany “Fair” ratings. Importantly, all of this can be dueto chance since 24 scores may not be enough to accu-rately represent the true underlying distribution. Anotherexplanation is presence of some speciﬁc disturbance inthe voting process. • Hate or love [11 , , , , —we doubt that such distribu-tion can be observed in subjective experiments. • Wrong MOS —any sample with the mean value very farfrom 2.5. Signiﬁcantly, this problem can be detected onlyif we know the true quality. By true quality we understand a non-directly-observable true quality ofa stimulus. We use ψ as deﬁned by SAM (Statistical Analysis Methods) ofVQEG (Video Quality Expert Group) [15]. Note that this time the sample mean is not 2.5. Since ψ is the true andhidden parameter, not always the observed sample has the mean of 2.5. . Subject Model In order to distinguish between typical and atypical scoredistribution we need a subject model accepting typical sam-ples and excluding the atypical ones. It should have as fewparameters as possible. The smallest number of parameters istwo. One is the true quality ψ , while the other describes theanswers spread θ . We consider a model where an answer fora stimulus is a random variable U drawn from a distribution: U ∼ F ( ψ, θ ) , (1)where F () is a cumulative distribution function.We could use two types of models: (i) continuous (proposedin [12], [18] and [22]) and discrete (proposed in [19]). Sincethe discrete model better ﬁts data from multiple subjectiveexperiments [19] we use it as the basis of the presentedmethod. The continuous model could be used as well.We refer to the discrete model as discrete distribution .We use this term (instead of discrete model ) purposely. Weare looking for a discrete distribution which is a function oftwo parameters. A distribution satisfying our constraint is theGeneralized Score Distribution (GSD) described in [19]. Sincethe complete description of the distribution is lengthy we referthe reader to [19]. Here, we only describe GSD properties thatare relevant for this work.For a distribution with limited support the variance is limitedas well. Moreover, if a distribution is discrete the lower andthe upper values of the variance are limited and depend onthe mean value. For a ﬁve-point scale and the mean ( ψ ) equalto 1.5, the smallest possible variance is obtained for 50%answers 1 and 50% answers 2. In contrast, if only scores 1and 5 are used we obtain the maximum variance. The obtainedvariance limitations (for ψ = 1 . ) are: V min (1 .

5) = 0 . and V max (1 .

5) = 1 . . For a different ψ , for example ψ = 3 ,the minimal and maximal values are: V min (3) = 0 and V max (3) = 4 , respectively. Connection between the varianceand the mean makes the analysis difﬁcult since the variancecannot be analysed without knowledge of the mean. One wayto deal with this problem is to use the HSE α parameter [13].GSD distribution parameters are the true quality ψ (deﬁningthe stimulus quality) and ρ (describing answer spread withanswers closer to ψ if ρ is closer to 1). ρ is limited to the (0 , interval, regardless of the ψ value. This addresses the previ-ously mentioned problem of the variance-mean dependency.Tab. I presents the GSD distribution for various ρ values.All the typical score distribution classes described in Sec-tion III-A (i.e., perfect , common , and strongly spread ) comefrom the GSD distribution. On the other hand, the atypicalclasses (i.e., random answer(s) , bimodal , and sudden cut-off )are not part of GSD. The only atypical class that comes fromthe GSD distribution is hate or love . This class is part ofGSD since it is a generalization of the strongly spread class.Therefore, we have to manually set a boundary score spreadthat would still correspond to the strongly spread class and Probably “scoring model” would be a better name. Sill, since it has alreadybeen called subject model in the literature [18] we stick to this convention. TABLE IP

RESENTATION ON H OW THE ρ P ARAMETER I NFLUENCES THE S CORE D ISTRIBUTION M ODELLED W ITH

GSD W

HEN ψ = 2 . . ρ Score 1 2 3 4 5 not to hate or love . We leave for future research the taskof quantifying this boundary condition. Signiﬁcantly, our dataanalysis have shown that hate or love samples are rare.Typical score distribution classes are from the GSD distribu-tion and atypical are not. We can thus claim that GSD reﬂectsintuitions of the subjective testing community developed overthe years through practical encounters with subjective data.As such, it shows which stimuli have score distributions thatwould be counter-intuitive to practitioners and which followtheir intuitions and experiences. Naturally, GSD is a modelthat tries to simplify the complex nature of reality. This meansit should not be used as the ultimate measure of subjectivedata consistency. Instead, it should be juxtaposed with otherconsistency measures.To check whether a sample of scores follows the GSDdistribution we have to ﬁrst estimate the GSD parameters forthe sample and then validate whether the sample comes fromGSD with these estimated parameters. This is presented in nextsection. The section also shows how to extend this reasoningto multiple samples.IV. p -V ALUE

P–P P

LOT

The main contribution of this paper is a new, for the QoEcommunity, tool assessing subjective experiment consistencythat also detects which stimuli should be analyzed in greaterdetail. The algorithm, philosophy behind it and practical issuesrelated with its usage are described in this section.We start from the overview of the philosophy behind theproposed methodology:1) A consistent experiment contains stimuli with typicalscore distributions.2) Typical distributions are described by the GSD distribu-tion.3) For each stimulus we can estimate the probability ofwhether its score distribution comes from the GSD.4) If for many stimuli this probability is low, we shouldanalyze the data in greater detail.5) p -Value P–P plot reveals when the detailed analysisis needed and when the experiment can be treated asconsistent.In order to check if any assumed distribution ﬁts speciﬁcdata we have to perform a two-step procedure. In the ﬁrst step,ubjective data GSD ( ψ, ρ ) ExpectedfrequenciesObservedfrequencies G-test ofgoodness-of-ﬁt p -Value Fig. 2. Algorithm to test the goodness-of-ﬁt. The pipeline starts at the blue“Subjective data“ block. The output is a p -value (red box), which we use toverify the null hypothesis that a sample comes from the GSD distribution. distribution parameters are estimated for an observed sample.In our case, we treat scores assigned to a single stimulus asa single sample. The second step is to use a goodness-of-ﬁt (GoF) test to see how well the selected distribution (withthe estimated parameter values) describes the sample. TheGoF test returns a p -value, which states how likely it is toobserve the sample, assuming it comes from the considereddistribution. Fig. 2 visualizes the procedure.The GoF test we use is a standard likelihood ratio approachcalled G-test (cf. Sec. 14.3.4 of [1]). Since our sample sizesare usually small we do not use the asymptotic distributionfor calculating the p -value (as is the case for the popular χ test of GoF—cf. Ch. 11 of [27]). Instead, we estimate the p -value using the bootstrapped version of G-test. The approachis described in our GitHub repository [26]. Broader theoreticalconsiderations are in [5].The analysis shown in Fig. 2 generates as many p -values asthere are stimuli in the experiment. With multiple p -values theanalysis is more involved than usually [3]. Since we are notinterested in detecting which stimulus is not from the GSDdistribution, but rather in concluding about the experimentconsistency as a whole, the p -value P–P plot is our mainanalytical tool [37]. p -Value P–P plot (see Fig. 1 or Fig. 3) presents a one di-mensional sample of observed quantities (in our case, a vectorof p -values) in relation to a theoretical, expected probabilitydistribution (in our case, the uniform distribution spanning therange from 0 to 1). The plot is based on empirical cumulativedistribution function (ECDF) of the observed sample ( y axis)and CDF of the expected probability distribution ( x axis).Happily, since p -values are in the range from 0 to 1 and ourexpected uniform distribution is deﬁned over the same range,values on the x axis not only correspond to the expected CDFbut also to the observed p -values. To grasp why p -value P–P plot is the better choice here we suggest toread [36]. For a Simple Hypothesis, such as the one stating that thesample is from the standard Gaussian distribution, the expectedshape of the p -value P–P plot is a straight line ( x = y ). This isbecause we expect p -values to follow the uniform distribution[34]. Still, deviating from the x = y line does not necessarilyprove that the obtained p -value distribution is odd. Slight devi-ations are tolerable due to the random nature of the underlyingprocess. Importantly, the assumption about the x = y lineis not valid if we consider a Composite Hypothesis. Onesuch example is a hypothesis stating that a sample is fromthe GSD distribution with unknown parameters. In this case,the expected shape of the p -value P–P plot is best describedby a set of points falling below the x = y line (see Fig. 1a).Only if the points signiﬁcantly exceed the line can we rejectour hypothesis.To quantify what it means to signiﬁcantly exceed the x = y line we use the following procedure. For each considered α (i.e., a speciﬁc value on the x axis of p -value P–P plot) wechange each p -value to 0 or 1. We do it by checking whether itis larger (0) or smaller (1) than α . Now, the whole experimentis changed to a vector of zeros and ones with its length equalto the number of stimuli. The question under investigationis: “is the number of stimuli with small p -value signiﬁcant?”(Note that “small” is deﬁned by α .) This can be understoodas hypothesis testing verifying the null hypothesis stating thatthe probability of ﬁnding a stimulus with p -value below α issmaller than α . This is a classical problem with the solutiongiven by: ˆ α > α + z − β (cid:114) α (1 − α ) n , (2)where ˆ α is the observed proportion of p -values smaller than α (i.e., the y axis value on p -value P–P plot), α is the conjecturedproportion of such stimuli, z γ is the quantile of order γ for thestandard normal distribution and β is the signiﬁcance level ofthis test. (We use 5% and hence z − β = 1 . .) Importantly,if Eq. (2) is satisﬁed we reject the null hypothesis and saythat around ˆ α − α stimuli truly have their p -value below α . Substituting the LHS of Eq. (2) by f ( α ) and equatingboth sides we get a formula for drawing a line spanning thewhole range of α values (from 0 to 1). This is the blackline visible in Fig. 1 and Fig. 3. When points on p -valueP–P plot signiﬁcantly exceed this threshold line we say thatan experiment is inconsistent. In other words, we reject ourcomposite hypothesis stating that score distributions of stimuliin an experiment come from the GSD distribution.As stated in the previous paragraph Eq. (2) is used to drawthe theoretical threshold line in p -value P–P plot. Now, if datapoints are crossing this line (i.e., falling above it) or are closeto it we should analyse all stimuli with p -values smaller than α of the crossing point. Note that α of the crossing point isfound by looking at the x axis value on the P–P plot for whichthe data crosses the theoretical threshold. Analysing the stimuliwith p -values smaller than the crossing point it is important to Any hypothesis that speciﬁes population distribution completely. Any hypothesis that does not specify population distribution completely. emember that it is natural to observe a fraction of stimuli with p -value below α as high as α . In other words, even if we drawthe scores from the GSD itself, still the fraction of stimuli with p -value below α could be close to α . In any case we are notallowed to simply remove all stimuli with p -value smaller than α . Instead, we should analyse their score distribution one-by-one and look for speciﬁc problems. Section V-B demonstratessuch an analysis.Describing the above algorithm from the practical perspec-tive let us consider generating 160 samples, each time drawing24 values from the GSD distribution. This way we simulatea subjective experiment with 160 stimuli, each assigned 24scores. Since scores come from the GSD distribution weknow that each stimulus has a typical score distribution. Wenow apply the procedure from Fig. 2 to get p -value for eachstimulus. This generates a range of p -values, some of whichare small (e.g., smaller than 0.05). These small p -values appeareven though we know the input data is generated from typicalscore distributions only. Hence, the decision of whether toclassify some real experiment, as a whole, as inconsistentcannot be based on observing few small p -values. Also, justa single stimulus with low p -value cannot be labelled asodd [34]. Instead, the proportion of p -values (relative to thetotal number of stimuli) smaller than certain α should beanalysed—cf. Eq. (2).Our simulation study proved that if the data comes fromthe GSD distribution points on p -value P–P plot fall belowor spontaneously coincide with the threshold line. Since thisholds for the region of p -values that is critical for drawingconclusions (i.e., for α < . ), we can follow the standardrecommended analysis from [37]. a) GSD Parameters Estimation: A practical issue relatedwith p -value P–P plot is GSD parameters estimation. GSD hasa complicated form of its likelihood function. Thus, it is notpossible to provide analytical solution for parameters estima-tion. To simplify the estimation we pre-calculate probabilitiesof each score for a grid of ψ and ρ values. Speciﬁcally, weuse 399 values of ψ , spanning the interval [1 . , . , withthe step of 0.01; and 400 values of ρ , spanning the interval [0 . , , with the step of 0.0025. The pre-calculated gridaccelerates the estimation process and protects us from anypotential difﬁculties that arise when using dynamic parameteroptimization. Using simulation studies we have validated thatthe proposed estimation method works as expected. In otherwords, we are able to recover correct GSD parameters fromsynthetic data that is itself generated from the GSD.V. S UBJECTIVE D ATA A NALYSIS

This section puts our idea into practice. We ﬁrst describe sixreal-life subjective studies we use. The descriptions highlightreasonable expectations about studies consistency. We thenverify these expectations by applying our method.

A. Data Sets

To check the practical distribution of subjective scores weuse data from 21 subjective experiments. The data comes from six studies representing various stimulus types: (i) VQEGHDTV Phase I [28] (six experiments; video-only), (ii) ITS4S[29] (two experiments; video-only), (iii) AGH/NTIA [17, 31](one experiment; video-only), (iv) MM2 [32] (ten experiments;audiovisual), (v) ITS4S2 [30] (one experiment; image), and(vi) ITU-T Supp. 23 [14] (one experiment; speech). This givesa total of 98,282 subjective scores.All the experiments use the 5-level Absolute Category Rat-ing (ACR) experiment design. Importantly, all the data we useis publicly accessible. Please refer to the references providedfor each study or go to our repository [26] to download asingle CSV ﬁle with all the results combined (and put into thetidy data format [43]). a) VQEG HDTV Phase I:

All the six experiments fromthis study took place in a controlled environment. Test par-ticipants were screened for normal vision acuity and normalcolour vision. Only scores of those testers who correlated wellwith the average opinion were kept. Impairments includedcompression and lossy transmission. All these suggest that itis fair to expect consistent results. b) ITS4S:

Utilizing the unrepeated scene design thisstudy has a potential to strongly emphasize personal preferencedifferences between participants. With no strict vision acuityand colour blindness testing the study has a higher chanceto include unreliable subjects. Furthermore, it uses contentthat can trigger strong emotional response. Topping this withthe scores originating from a mixture of experts and naïvesubjects, we expect this data to be less consistent than thatof HDTV Phase I. Still, the study setting was not completelyuncontrolled as participants were seated in a lab-like environ-ment. They were briefed and trained as well. This suggestsbetter scores consistency than that of crowdsourced studies.Due to experiment design choices (other than those describedabove), we expect the ﬁrst experiment to be more consistentthan the second one. For one reason, the second experimentallowed participants to take the test simultaneously in groupsof ten. c) AGH/NTIA:

Non-standard procedures in this test are:lack of screening for normal vision acuity and colour vision;distance to the screen was not strictly controlled; two testersreceived intentionally erroneous instructions and one testerwas a video quality expert. However, the study generallycomplied with Rec. ITU-T P.910, recruited testers through atemporary job recruitment agency and investigated compres-sion as the only distortion. We expect this study to be lessconsistent than HDTV Phase I, but more consistent than othermore loose experiment designs. d) MM2:

Five out of ten experiments took place in alaboratory-like environment. The other half took place in aless controlled setting. All the experiments used the sameset of audio-visual sequences. Signiﬁcantly, the presence ofaudio might have increased inter-tester difference of opinion.The study investigated compression as the only distortionsource. Furthermore, all participants went through training and For a detailed speciﬁcation please refer to Table IV in [32].ABLE IIT HE F OUR L OWEST p -V ALUES OF THE T EST V ALIDATING THE N ULL H YPOTHESIS T HAT AN E XPERIMENT C ONTAINS M OSTLY T YPICAL S CORE D ISTRIBUTIONS . Experiment p -Value ITS4S2 0.00263ITS4S—2 nd experiment 0.02320ITS4S—1 st experiment 0.02476MM2—IRCCyN (lab env.) 0.02634 brieﬁng. In general, we expect MM2’s laboratory experimentsto be more stable than those with the loose setting. e) ITS4S2: Although compliant with Rec. ITU-T P.913this study has the greatest potential for inconsistent results.It investigates non-standard distortions related to consumer-grade cameras. This makes the study novel but also means thatdifferent (than those recommended) best practices may apply.At last, it includes content that may trigger strong emotionalresponse. f) ITU-T Supp. 23:

This study investigates the subjectiveperformance of a speech codec. Since it follows a strict andwell-controlled experiment design we expect its scores to bevery consistent. Importantly, we only utilize data from theexperiment conducted by Nortel (although the study containsthree experiments in total).

B. Case Study on Real Data

We start from classifying each of the 21 experiments aseither consistent or inconsistent. For this purpose we use thehypothesis testing approach described in Section IV. Selectingthe α = 0 . we identify four experiments as inconsistent.Table II lists those, starting from the one with the smallest p -value. Recalling study descriptions from Section V-A onlythe inconsistency of the one laboratory experiment from MM2is a surprise. It would be logical to expect inconsistency in theexperiments with loose settings rather than in the one done ina lab. One potential explanation of this result is that the MM2study is generally consistent. However, when repeating thesame experiment ten times, the chances of randomly observ-ing at least one experiment being inconsistent is signiﬁcant.Furthermore, we use the MM2 data as is. This means we donot perform any post-experimental screening of subjects beforewe do our analyses. The experiment we label as inconsistentincludes data from three subjects poorly correlated with thegeneral opinion (with correlations of 0.49, 0.59 and 0.64).In fact, this experiment includes two of the least correlatedsubjects among all of the ten MM2 experiments (compareFig. 2 in [32]). We know from experience that p -value P–P plotis sensitive to poorly correlated subjects. Thus, we hypothesizethat the three outlying subjects are the main reason our methodmarks the one MM2 experiment as inconsistent. Please note that this p -value describes the hypothesis testing outcome andnot the goodness-of-ﬁt test for a particular stimulus. Though the above analysis is based on hypothesis testingalone, we recommend to take a look at p -value P–P plot ofeach experiment. One argument for using the P–P plot is thatit applies the same reasoning as above, but for a range of α values. In fact, by looking at the p -value P–P plot of theAGH/NTIA experiment (see Fig. 3a) we see that it consistentlyfails the hypothesis testing for α values below 0.08. This pointsto at least partial inconsistency of the experiment, which is inline with the expectations given in Section V-A. theoretical uniform cdf e c d f o f p - v a l u e s (a) AGH/NTIA theoretical uniform cdf e c d f o f p - v a l u e s (b) VQEG HDTV Phase I theoretical uniform cdf e c d f o f p - v a l u e s (c) ITS4S2 theoretical uniform cdf e c d f o f p - v a l u e s (d) ITS4SFig. 3. p -Value P–P plots for (a) the one (and only) experiment from theAGH/NTIA study, (b) the ﬁrst experiment from the VQEG HDTV Phase Istudy, (c) the one (and only) experiment from the ITS4S2 study and (d) thesecond experiment from the ITS4S study. We decide not to show p -value P–P plot for each experi-ment. However, we do provide a complete, detailed analysisof three selected experiments. We use data from: (i) theﬁrst experiment from the VQEG HDTV Phase I study (laterreferred to as HDTV1), (ii) one (and only) experiment fromthe ITS4S2 study (later referred to as ITS4S2) and (iii) thesecond experiment from the ITS4S study (later referred to asITS4S). Figures 3b, 3c and 3d present p -value P–P plots foreach experiment, respectively.By looking at the ﬁgures we can say that the experimentsare sorted in accordance with their consistency. With allpoints in Fig. 3b falling bellow the black line, the HDTV1experiment is certainly consistent. This means that we donot have to investigate it further. In Fig. 3c we see pointsoscillating around the black line. This suggests that the ITS4S2experiment is neither totally consistent nor inconsistent. Inorder to decide about its consistency we have to take a closelook at stimuli with low p -values. At last, from Fig. 3d we canquickly conclude that the ITS4S experiment is inconsistent.This is because all the points fall on or above the black line.Furthermore, the points falling above the line deviate from itsigniﬁcantly. Here also we have to take a close look at stimuliwith low p -values. However, this time we expect to ﬁnd moresuch stimuli (relative to the total number of stimuli in thisexperiment). ABLE IIIS

CORE C OUNT FOR F IVE S TIMULI W ITH THE L OWEST p -V ALUE F ROMTHE

ITS4S E

XPERIMENT . ID S

TANDS FOR I DENTIFIER . ID Score Count p -Value1 2 3 4 5 a 0 0 13 5 6 0.0014b 2 0 0 9 13 0.0021c 0 0 14 6 4 0.0067d 1 13 4 6 0 0.0076e 0 9 3 8 3 0.0113TABLE IVS CORE C OUNT FOR F IVE S TIMULI W ITH THE L OWEST p -V ALUE F ROMTHE

ITS4S2 E

XPERIMENT . ID S

TANDS FOR I DENTIFIER . ID Score Count p -Value1 2 3 4 5 f 0 11 3 0 2 0.0002g 2 0 3 11 0 0.0004h 1 2 1 12 0 0.0008i 4 6 0 6 0 0.0012j 1 0 0 9 6 0.0014 Proceeding to the detailed analysis of low p -value stimuliwe suggest to consider all stimuli with p -value lower than thatof the right-most point exceeding the theoretical threshold.Please note that we only take into account the p -value rangefrom 0 to 0.2 (cf. Section IV for the explanation). This meansthat even if there are points exceeding the black line withcorresponding p -value above 0.2, we still take 0.2 as the upperlimit deﬁning which stimuli should be analysed in detail. Byapplying this rule we observe that we need to analyse allstimuli with p -value below 0.2 for both ITS4S and ITS4S2.This corresponds to 328 stimuli from ITS4S2 and 54 stimulifrom ITS4S. Although the number for ITS4S2 is greater, whenanalysing the ﬁgures in relative terms (i.e., comparing themto the total number of stimuli in each experiment), we noticethat there are actually more potentially problematic stimuliin ITS4S (25.47%) rather than in ITS4S2 (22.95%). To keepthis case study concise we only take a look at ﬁve stimuliwith lowest p -values from both experiments. Tables III and IVpresent score counts for these stimuli for ITS4S and ITS4S2,respectively.Studying Table III we can classify score distribution ofproblematic stimuli into one of three classes (representing asubset of the classes described in Section III-A): (i) bimodal ,(ii) random answer(s) and (iii) sudden cut-off . Stimuli a, dand e fall into the bimodal class. Their score count suggeststhat there are two modes of the score distribution. Onepotential explanation for this is that there are two groups ofparticipants, both expressing a signiﬁcant bias. If individualbiases are similar within the group and signiﬁcantly disjointbetween the groups, we observe the score distribution similarto the one of stimuli a, d, and e. It is also possible that the bimodal distribution does not come solely from characteristicsof the participants as a whole. It may be that these stimulipresent content that highlights individual preferences of theparticipants. In other words, participants who disagree whenscoring these particular stimuli agree when scoring otherstimuli. It would be audacious to suggest these stimuli shouldbe discarded. Instead, we advise to treat this case as a valu-able insight about study participants and stimulus itself. Ourframework classiﬁes these stimuli as a source of inconsistency,because they do not ﬁt the general assumptions about the scoredistribution. For example, it is difﬁcult to believe that the MOSprovides meaningful information for such cases. The meanfalls between the two modes and expresses the general opinionof neither of the two hypothetical participant groups.Stimulus b can be assigned to the random answer(s) class.Here, it seems fairly safe to claim that the two 1s are a resultof some error. Discarding these two answers we are still atrisk of removing a genuine opinion. However, we claim thatmany practitioners would agree to remove these.Stimulus c ﬁts into the sudden cut-off class. We observemany 3s, but no 2s and no 1s. Though this is not outrightwrong, it is hard to believe that there would be no oneassigning this stimulus the score of 2. This type of stimulus isdifﬁcult to handle. Even if we discard selected scores thanksto, for example, removing poorly correlated study participants,the sudden cut-off in the middle of the scale is likely toremain. We advise to take a close look at stimulus’ content.One hypothesis is that this stimulus is difﬁcult to score andpeople choose the middle of the scale as the safest optionconveying the “I do not know” message. If true, this suggeststhat debrieﬁng study participants may provide crucial insightsabout how to analyse such stimuli. Another option (althoughusually not feasible) is to try to gather more scores on thestimulus. It is worth remembering that 24 observations maybe too few to rightfully represent the shape of the underlyingscore distribution. Finally, we point out that stimulus a alsoseems to be of the sudden cut-off class.Applying a similar analysis to the data in Table IV we notethat: stimuli h and i belong to the bimodal class, and stimulif, g and j to the random answer(s) class. However, uponcloser inspection more peculiarities become visible. Pleasenote that all stimuli represent data following the description ofthe sudden cut-off class. For example, even if we discard twooutlying 1s from the scores of stimulus g, the sudden cut-offat score 4 remains.A careful reader will notice that the HDTV1 experimentcontains stimuli with p -value below the 0.2 threshold (althoughwe stated the experiment is consistent). Those stimuli couldbe analysed but we are advising not to do so. Statisticallyspeaking it is not unusual to observe few low p -values.According to our framework the low p -values are rare enoughto classify the experiment as consistent.This case study shows that our framework detects stimuliwith atypical score distribution. Additional information (apartfrom the scores) and a closer inspection may still be necessaryto decide what to do with the identiﬁed stimuli. Still, theethod provides a handy tool reliably detecting potentialdefects present in the data.VI. C ONCLUSIONS

We introduce a new tool classifying results of a subjectiveexperiment as either consistent or inconsistent. The tool alsohighlights stimuli with irregular score distribution. We showthat the method works by using data from 21 subjectiveexperiments. Apart from the theoretical description we providea software implementation. To download it please go to: https://github.com/Qub3k/subjective-exp-consistency-check. There,we share the data set and Python scripts used for the dataanalysis and a cookbook-style tutorial on how to apply ourmethod on arbitrary subjective data. The data set and scriptsare also provided in the auxiliary material. The procedure behind our method can be summarized intwo steps. First, generate a p -value P–P plot for a subjectiveexperiment and check whether data points fall above the linedeﬁning the theoretical threshold. Second, if they do—theexperiment is potentially inconsistent and score distributionsof low p -value stimuli should be analysed; if they do not—theexperiment is consistent.Though practical our method has its limitations. It cannotbe directly used to show which study participants are potentialoutliers. Likewise, it does not show which observed inconsis-tencies are obtained by chance and which are the result ofsome true underlying phenomenon (e.g., a bias inﬂuencing thescoring process). Finally, if GSD does not cover all real scoredistributions (i.e., the ones being the result of a valid subjectiveevaluation) we will observe false negatives (negative meaningclassifying an experiment as inconsistent). For the explanationwhy this last risk seems to be small we refer the reader to [19].We do not directly compare out method with other screen-ing techniques since none of them targets consistency of awhole subjective experiment. They rather focus on discardingindividual study participants or quantifying mean-variancerelationship. (Signiﬁcantly, the latter is also addressed whenusing GSD to model score distribution.)There are two topics that we would like to explore in ourfurther work. First, we would like to test if a different subjectmodel could be used instead of the GSD model. In particular,we would like to test the quantized normal model [18]. Second,we would like to run thorough simulation studies. This wouldhelp explore limitations of our method and show how typicalproblems with subjective data inﬂuence p -value P–P plots.We hope our work will serve as a useful tool for the researchcommunity and invite everyone to test it on their subjectivedata. A CKNOWLEDGMENT

The authors would like to thank Netﬂix, Inc. for sponsor-ing this research and especially Zhi Li for his support andchallenging questions. This work was supported by the Polish This is not valid for the arXiv version of the paper. To download theauxiliary material please take a look at the ACM Digital Library entry forthis paper.

Ministry of Science and Higher Education with the subventionfunds of the Faculty of Computer Science, Electronics andTelecommunications of AGH University and by the PL-GridInfrastructure. R

EFERENCES [1] A. Agresti. 2002.

Categorical Data Analysis (2th ed.).Wiley.[2] Divyashri Bhat, Rajvardhan Deshmukh, and MichaelZink. 2018. Improving QoE of ABR Streaming Ses-sions through QUIC Retransmissions. In

Proceedings ofthe 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM ’18) . Association forComputing Machinery, New York, NY, USA, 1616–1624.https://doi.org/10.1145/3240508.3240664[3] Kjell Brunnström and Marcus Barkowsky. 2018. Sta-tistical quality of experience analysis: on planning thesample size and statistical signiﬁcance testing.

Journalof Electronic Imaging

27, 5 (2018), 1–11. https://doi.org/10.1117/1.JEI.27.5.053013[4] Andrea De Montis and Simone Caschili. 2012. Nuraghesand landscape planning: Coupling viewshed with com-plex network analysis.

Landscape and Urban Plan-ning

An Introduction tothe Bootstrap . Chapman and Hall.[6] Markus Fiedler, Tobias Hossfeld, and Phuoc Tran-Gia.2010. A generic quantitative relationship between qualityof experience and quality of service.

IEEE Network

Psycholog-ical Review

124 (Jan 2017), 91–114. https://doi.org/10.1037/rev0000045[8] Pedro Garcia Freitas, Alexandre Fieno Silva, Judith A.Redi, and Mylène C. Q. Farias. 2018. Performance anal-ysis of a video quality ruler methodology for subjectivequality assessment.

Journal of Electronic Imaging

27, 5(2018), 1–10. https://doi.org/10.1117/1.JEI.27.5.053020[9] Tobias Hoßfeld, Poul E. Heegaard, Lea Skorin-Kapov,and Martin Varela. 2017. No silver bullet: QoE metrics,QoE fairness, and user diversity in the context of QoEmanagement. In . 1–6. https://doi.org/10.1109/QoMEX.2017.7965671[10] Tobias Hoßfeld, Poul E. Heegaard, Martin Varela,and Sebastian Möller. 2016. Formal Deﬁnition ofQoE Metrics. (2016), 1–23. https://doi.org/10.1007/s41233-016-0002-1 arXiv:1607.00321[11] Tobias Hossfeld, Poul E. Heegaard, Martin Varela, andLea Skorin-Kapov. 2018. Conﬁdence Interval Estimatorsfor MOS Values. (2018). arXiv:1806.01126 http://arxiv.org/abs/1806.0112612] Tobias Hoßfeld, Poul E. Heegaard, Martin Varela, LeaSkorin-Kapov, and Markus Fiedler. 2020. From QoSDistributions to QoE Distributions: a System’s Perspec-tive. In . 1–7. arXiv:2003.12742http://arxiv.org/abs/2003.12742[13] Tobias Hoßfeld, Raimund Schatz, and Sebastian Egger.2011. SOS: The MOS is not enough! (2011), 131–136. https://doi.org/10.1109/QoMEX.2011.6065690[14] ITU-T Study Group 12. 1998. ITU-T Coded-SpeechDatabase. http://handle.itu.int/11.1002/1000/4415[15] Lucjan Janowski, Jakub Nawała, Werner Robitza, Zhi Li,Lukáš Krasula, and Krzysztof Rusek. 2019. Notation forSubject Answer Analysis. arXiv:1903.05940 [cs.MM][16] Lucjan Janowski and Zdzislaw Papir. 2009. Modelingsubjective tests of quality of experience with a gener-alized linear model. (2009),35–40. https://doi.org/10.1109/QOMEX.2009.5246979[17] Lucjan Janowski and Margaret Pinson. 2014. Subjectbias: Introducing a theoretical user model. In . 251–256. https://doi.org/10.1109/QoMEX.2014.6982327[18] Lucjan Janowski and Margaret Pinson. 2015. The Accu-racy of Subjects in a Quality Experiment: A TheoreticalSubject Model.

IEEE Transactions on Multimedia

17, 12(Dec 2015), 2210–2224. https://doi.org/10.1109/TMM.2015.2484963[19] Lucjan Janowski, Bogdan ´Cmiel, Krzysztof Rusek, JakubNawała, and Zhi Li. 2019. Generalized Score Distribu-tion. arXiv:1909.04369 [stat.ME][20] A. Kumcu, K. Bombeke, L. Platiša, L. Jovanov, J. VanLooy, and W. Philips. 2017. Performance of Four Sub-jective Video Quality Assessment Protocols and Impactof Different Rating Preprocessing and Analysis Methods.

IEEE Journal of Selected Topics in Signal Processing

11, 1 (Feb 2017), 48–63. https://doi.org/10.1109/JSTSP.2016.2638681[21] Jing Li and Patrick Le Callet. 2018. Improving thediscriminability of standard subjective quality assessmentmethods: a case study. In .1–3. https://doi.org/10.1109/QoMEX.2018.8463400[22] Zhi Li and Christos G. Bampis. 2017. Recover Sub-jective Quality Scores from Noisy Measurements.

DataCompression Conference Proceedings

Part F127767(2017), 52–61. https://doi.org/10.1109/DCC.2017.26arXiv:1611.01715[23] R. E. Lupu, K. S. Scott, J. E. Aguirre, I. Aretxaga,R. Auld, E. Barton, A. Beelen, F. Bertoldi, J. J. Bock,D. Bonﬁeld, C. M. Bradford, S. Buttiglione, A. Cava, D. L. Clements, J. Cooke, A. Cooray, H. Dannerbauer,A. Dariush, G. De Zotti, L. Dunne, S. Dye, S. Eales,D. Frayer, J. Fritz, J. Glenn, D. H. Hughes, E. Ibar,R. J. Ivison, M. J. Jarvis, J. Kamenetzky, S. Kim, G.Lagache, L. Leeuw, S. Maddox, P. R. Maloney, H.Matsuhara, E. J. Murphy, B. J. Naylor, M. Negrello, H.Nguyen, A. Omont, E. Pascale, M. Pohlen, E. Rigby,G. Rodighiero, S. Serjeant, D. Smith, P. Temi, M.Thompson, I. Valtchanov, A. Verma, J. D. Vieira, andJ. Zmuidzinas. 2012. Measurements of CO redshiftswith Z-spec for lensed submillimeter galaxies discoveredin the H-atlas survey.

Astrophysical Journal

Consciousness andCognition

21 (2012), 422–430.[25] George A. Miller. 1956. The Magical Number Seven,Plus or Minus Two: Some Limits on Our Capacityfor Processing Information.

The Psychological Review subjective-exp-consistency-check GitHub Repository . https://github.com/Qub3k/subjective-exp-consistency-check[27] Tormod Næs, Per B. Brockhoff, and Oliver Tomic. 2010.

Statistics for Sensory and Consumer Science . John Wiley& Sons, Ltd.[28] Margaret Pinson, Filippo Speranza, Akira Takahashi,Christian Schmidmer, Chulhee Lee, Jun Okamoto, KjellBrunnström, Lucjan Janowski, Marcus Barkowsky, Nico-las Staelens, Quan Huynh-Thu, Rima Green, RolandBitto, Ron Renaud, Silvio Borer, Taichi Kawano, VittorioBaroncini, and Yves Dhondt. 2010.

Report on theValidation of Video Quality Models for High DeﬁnitionVideo Content (HDTV Phase I)

ITS4S: A Video QualityDataset with Four-Second Unrepeated Scenes . Techni-cal Report NTIA Technical Memorandum 18-532. U.S.Department of Commerce, National Telecommunicationsand Information Administration, Institute for Telecom-munication Sciences.[30] Margaret H. Pinson. 2019.

ITS4S2: An Image Qual-ity Dataset With Unrepeated Images From ConsumerCameras . Technical Report NTIA Technical Memoran-dum 19-537. U.S. Department of Commerce, NationalTelecommunications and Information Administration, In-stitute for Telecommunication Sciences.[31] Margaret H. Pinson and Lucjan Janowski. 2014.

AGH/NTIA: A Video Quality Subjective Test with Re-peated Sequences . Technical Report NTIA TechnicalMemorandum 14-505. U.S. Department of Commerce,National Telecommunications and Information Adminis-ration, Institute for Telecommunication Sciences.[32] Margaret H. Pinson, Lucjan Janowski, Romuald Pe-pion, Quan Huynh-Thu, Christian Schmidmer, PhillipCorriveau, Audrey Younkin, Patrick Le Callet, MarcusBarkowsky, and William Ingram. 2012. The Inﬂuenceof Subjects and Environment on Audiovisual SubjectiveTests: An International Study.

IEEE Journal of SelectedTopics in Signal Processing

6, 6 (Oct 2012), 640–651.https://doi.org/10.1109/JSTSP.2012.2215306[33] Yanyuan Qin, Shuai Hao, Krishna R. Pattipati, FengQian, Subhabrata Sen, Bing Wang, and Chaoqun Yue.2019. Quality-Aware Strategies for Optimizing ABRVideo Streaming QoE and Reducing Data Usage. In

Proceedings of the 10th ACM Multimedia Systems Con-ference (Amherst, Massachusetts) (MMSys ’19) . Associ-ation for Computing Machinery, New York, NY, USA,189–200. https://doi.org/10.1145/3304109.3306231[34] David Robinson. 2014. How to interpret a p-valuehistogram. http://varianceexplained.org/statistics/interpreting-pvalue-histogram/[35] Rosaria Romano, Per Bruun Brockhoff, Margrethe Her-sleth, Oliver Tomic, and Tormod Næs. 2008. Correctingfor different use of the scale and the need for furtheranalysis of individual differences in sensory analysis.

Food Quality and Preference

19, 2 (2008), 197–209. 8thSensometrics Meeting.[36] Jonathan Rosenblatt. 2013. A Practitioner’s Guide toMultiple Testing Error Rates. arXiv:1304.4920 [stat.ME][37] T. Schweder and E. Spjøtvoll. 1982. Plots of P-Valuesto Evaluate Many Tests Simultaneously.

Biometrika . 1–6.https://doi.org/10.1109/QoMEX.2019.8743296[39] K. D. Singh, Y. Hadjadj-Aoul, and G. Rubino. 2012. Quality of experience estimation for adaptive HTTP/TCPvideo streaming using H.264/AVC. In . 127–131.[40] S. Tasaka. 2017. Bayesian Hierarchical Regression Mod-els for QoE Estimation and Prediction in AudiovisualCommunications.

IEEE Transactions on Multimedia

LWT - Food Science andTechnology

40, 2 (2007), 262–269.[42] Haiqiang Wang, Ioannis Katsavounidis, Xinfeng Zhang,Chao Yang, and C.-C. Jay Kuo. 2018. A user modelfor JND-based video quality assessment: theory andapplications. In

Applications of Digital Image ProcessingXLI , Andrew G. Tescher (Ed.), Vol. 10752. InternationalSociety for Optics and Photonics, SPIE, 219–226. https://doi.org/10.1117/12.2320813[43] Hadley Wickham. 2014. Tidy Data.

Journal of StatisticalSoftware, Articles

59, 10 (2014), 1–23. https://doi.org/10.18637/jss.v059.i10[44] M. B. Wilk and R Gnanadesikan. 1968. ProbabilityPlotting Methods for the Analysis of Data.

Biometrika

Proceedings of the 23rd ACM International Conferenceon Multimedia (Brisbane, Australia) (MM ’15) . Associ-ation for Computing Machinery, New York, NY, USA,431–440. https://doi.org/10.1145/2733373.2806269[46] Sławomir Zieli´nski, Francis Rumsey, and Søren Bech.2008. On some biases encountered in modern listeningtests.