Ambiguity of Objective Image Quality Metrics: A New Methodology for Performance Evaluation
Manri Cheon, Toinon Vigier, Lukáš Krasula, Junghyuk Lee, Patrick Le Callet, Jong-Seok Lee
AAmbiguity of Objective Image Quality Metrics: A NewMethodology for Performance Evaluation
Manri Cheon a , Toinon Vigier b , Lukáš Krasula b , Junghyuk Lee a , Patrick Le Callet b ,Jong-Seok Lee a, ∗ a School of Integrated Technology, Yonsei University, 21983 Incheon, Korea b LS2N UMR CNRS 6004, Université de Nantes, 44306 Nantes, France
Abstract
Objective image quality metrics try to estimate the perceptual quality of the given im-age by considering the characteristics of the human visual system. However, it is pos-sible that the metrics produce different quality scores even for two images that areperceptually indistinguishable by human viewers, which have not been considered inthe existing studies related to objective quality assessment. In this paper, we addressthe issue of ambiguity of objective image quality assessment. We propose an approachto obtain an ambiguity interval of an objective metric, within which the quality scoredifference is not perceptually significant. In particular, we use the visual differencepredictor, which can consider viewing conditions that are important for visual qual-ity perception. In order to demonstrate the usefulness of the proposed approach, weconduct experiments with 33 state-of-the-art image quality metrics in the viewpoint oftheir accuracy and ambiguity for three image quality databases. The results show thatthe ambiguity intervals can be applied as an additional figure of merit when conven-tional performance measurement does not determine superiority between the metrics.The effect of the viewing distance on the ambiguity interval is also shown.
Keywords:
Quality of experience, objective quality assessment, ambiguity interval,viewing distance (cid:63)
A preliminary study [1] was presented at the International Conference on Quality of Multimedia Expe-rience, Lisbon, Portugal, 2016. ∗ Corresponding author
Email address: [email protected] (Jong-Seok Lee)
Preprint submitted to Elsevier January 20, 2021 a r X i v : . [ c s . MM ] J a n . Introduction Multimedia systems operating in resource-constrained environments usually striveto achieve two conflicting objectives: achieving efficiency and providing high qualitycontent. For instance, compression, e.g., JPEG [2] and JPEG2000 [3] for images andH.264/AVC [4] and HEVC [5] for videos, is a representative way to deal with thisissue; it can reduce the amount of data to enhance storage and transmission efficiencyat the cost of degradation of perceptual quality. Quality degradation introduced throughenhanced efficiency tends to lower the quality of experience (QoE) of the consumers.Therefore, it is important to carefully consider the trade-off relationship between thetwo objectives in designing the target multimedia systems and services.The first step toward this goal is to accurately measure the perceptual quality of thecontent as perceived by human viewers, which can be performed via subjective qualityassessment or objective quality assessment [6, 7, 8]. The former is the most accurateway of assessing the QoE, where human subjects are asked to rate the given contentin terms of perceptual quality. However, it is time-consuming and expensive, and can-not be used in real time applications for controlling or optimizing the quality of thedelivered content. Thus, objective quality assessment performed by objective metricsis widely used to replace subjective quality assessment, which tries to automaticallypredict perceived quality. A number of objective quality metrics have been developedand used for various applications including compression, transmission, enhancement,etc. [9].It has been considered that the primary goal of an objective metric is to predict sub-jective quality scores, usually denoted as mean opinion scores (MOS), as accuratelyas possible. The ITU-T P.1401 standard [10] specifies recommended procedures toevaluate the accuracy of an objective quality metric. For instance, the Pearson’s lin-ear correlation coefficient (PLCC) and Spearman’s rank ordered correlation coefficient(SROCC) are computed to evaluate linearity and monotonicity of metrics with respectto subjective data, respectively. In addition, the prediction error and consistency arealso measured using the root-mean-square error (RMSE) and outlier ratio (OR), re-spectively. Additional statistical measures of performance have also been proposed in211].In this paper, however, we argue that the accuracy is not the only perspective inwhich objective quality metrics should be judged, and propose that considering an ad-ditional figure of merit provides much more informative insight into the performanceand behavior of the metrics, which is their ambiguity or, conversely, reliability . Ingeneral, the output of an objective metric for a given visual stimulus is expressed as asingle value on a continuous scale. This means that when the predicted quality scoresfor a pair of stimuli by a metric are obtained, the quality superiority between the stimuliis always formed, no matter how small the difference is. However, a nonzero qualityscore difference between two similar stimuli may cause misleading conclusions whenthe quality difference is not perceivable by human viewers. In fact, the visual sensi-tivity of humans is limited in the sense that a small amount of pixel value differenceis sometimes visually indistinguishable depending on several factors such as overallluminance and neighboring pixel values [12].Figure 1 shows example images demonstrating the existence of ambiguity of objec-tive metrics [13]. For two reference images ( parrots and house ) from the LIVE ImageQuality Assessment Database [6], JPEG2000 compression is applied to corrupt themwith different bitrates. When Figures 1(a) and 1(b) are visually compared, their qualitydifference can be easily perceived. We conducted a subjective quality assessment ex-periment using the paired comparison scheme [14, 15], where most of the hired subjects(14 out of 15) chose Figure 1(b) as the one having better quality. An objective metric,peak signal-to-noise ratio (PSNR), also rates Figure 1(b) as having better quality (witha difference of 2.49 dB), which is consistent with the quality superiority perceived byhumans. On the other hand, the difference between Figures 1(c) and 1(d) is hardlynoticeable; nearly a half of the subjects (6 out of 15) chose Figure 1(c). However, thequality measured by PSNR still determines that Figure 1(d) is better, showing a dif-ference of 2.54 dB, which is even larger than the difference between Figures 1(a) and1(b). Such inconsistent results between subjective and objective quality measurementsare undesirable for quality-optimized multimedia systems. For instance, a system rely-ing on PSNR may try to deliver Figure 1(d) instead of Figure 1(c) to improve QoE atthe cost of an increased bitrate (20 to 35 kbytes), which is actually not so worthy for3 a) 30.46 dB (b) 32.95 dB(c) 30.39 dB (d) 32.93 dB
Figure 1: Example images from the LIVE Image Quality Assessment Database [6], demonstrating theambiguity of objective quality metrics (in this case, PSNR). users. An additional observation in this example is the content-dependence of the am-biguity of objective metrics. In other words, the perceptual insignificance of the PSNRdifference is observed only for house .Even the state-of-the-art objective quality metrics showing good performance onpredicting perceived quality (e.g., [16, 17, 18]) have the issue of indistinguishable qual-ity ranges because all the existing metrics produce single numerical values representingthe perceptual quality of given stimuli, which is highly related to the reliability of themetrics. In this paper, therefore, we address the issue of ambiguity of objective qualityassessment and propose an approach to measure the ambiguity as an interval definingthe indistinguishable quality score range, which can be applied to any quality metricsto supplement their usefulness in a new direction. Furthermore, we present use caseswhere the proposed approach can be useful, i.e., one for performance comparison ofquality metrics’ and the other for analysis of metrics performance in terms of reliabilitywith respect to the viewing distance. 4he main contributions of this paper can be summarized as follows:1.
We propose an approach to measure the ambiguity of objective quality met-rics.
The ambiguity is expressed as an interval on the scale of a metric’s score, called ambiguity interval , within which the quality difference is perceptually indistin-guishable. In obtaining ambiguity intervals, we incorporate the viewing con-ditions, in particular, viewing distance, because it is one of the most importantfactors that significantly influence the visual sensitivity of human viewers. Ourapproach employs the visual difference predictor (VDP) [19], which automati-cally estimates a threshold for perceptually indistinguishable pixel value differ-ence at each pixel location. Using VDP also eliminates the necessity to conductsubjective experiments to obtain the ambiguity intervals, which maximizes theapplicability of the proposed approach.2.
We provide a practical use case, i.e., objective metric benchmarking, todemonstrate the effectiveness of the proposed approach.
We use the ambiguity characteristics of metrics for performance comparison ofmetrics in addition to the accuracy measure. It is shown that the ambiguity canplay an important role to determine the superiority among the metrics. In theresearch community of multimedia quality assessment, systematic evaluationof objective metrics has been considered important to analyze their advantagesand disadvantages [6, 7, 8, 20, 21, 22, 23]. The Video Quality Experts Group(VQEG), an international forum for perceptual quality assessment towards stan-dardization, also puts a significant amount of efforts for this. Thus, this usecase proposes a novel framework for benchmarking of objective quality metrics,which enables performance analysis of the metrics in multidimensional perspec-tives.3.
As another practical use case, we evaluate state-of-the-art metrics in termsof viewing distance.
We show that the behavior of a metric depending on the viewing distance alsoprovides valuable information in analyzing the metric’s performance. Such in-formation can be exploited as a part of benchmarking of objective metrics. In5ddition, it can be used to identify proper viewing conditions where the metricsare reliable.The rest of this paper is organized as follows. The following section presents theproposed approach in detail. Section 3 describes the experimental setup. The twouse cases, where the ambiguity intervals are exploited, are given in Sections 4 and 5,respectively. Finally, conclusions are given in Section 6.
2. Proposed Method
As mentioned in the introduction, the goal of the proposed approach is to obtainan interval for a given objective quality metric, so that a score difference within theinterval at that particular quality level is considered as being perceptually insignificant.The core idea to obtain such an interval is to change the amount of distortion (e.g.,noise, compression artifacts, etc.) in an image and check using a perceptual model ifthe change of the distortion would be detected by human observers.Algorithm 1 summarizes the procedure of the proposed approach to obtain the am-biguity interval (i.e., the upper and lower bounds of the interval) over the whole qualityrange for a source image and a type of distortion. Figure 2 illustrates the process toobtain the ambiguity interval for a particular quality level corresponding to a degradedimage, which corresponds to lines 6 to 19 in Algorithm 1.First, a quality degradation for the distortion type is applied to the source image( I ) with various amounts of distortion and the objective quality levels of the resultingimages are measured. Then, we determine perceptual distinguishability between twoimages having different amounts of artifacts. For a given image I i containing a certaintype of artifacts, we obtain the level of ambiguity at the corresponding objective qual-ity score ( Q i ) as an interval around the score. We assess perceivable difference of thegiven image ( I i ) compared to an image from the same source image but with differentamounts of artifacts ( I j ). We gradually increase (or decrease) the amounts of artifactsin I j , until the images that are perceptually distinguishable from the given image arefound. Among the images that are perceptually indistinguishable from I i , the one with6 lgorithm 1 Computing the ambiguity interval
Input:
Source image I having M pixels Output:
Upper bound width U ∈ R N and lower bound width L ∈ R N of the ambiguityinterval for i ← , N do (cid:46) N : number of considered quality levels I i ← degrade_image ( I , i ) (cid:46) Apply quality degradation (compression, blurring,etc.) to I ( I i is more degraded than I i − ) Q i ← measure_quality ( I i ) (cid:46) Measure the objective quality (assume that ahigher Q i indicates higher quality) end for for i ← , N do for j ← i + , N do PMap ← vdp( I i , I j ) (cid:46) Obtain the perceivableness map if count(PMap > 0.5) / M > k then L i ← Q i − Q j − (cid:46) Obtain the width of the lower bound break end if end for for j ← i − , do PMap ← vdp( I i , I j ) (cid:46) Obtain the perceivableness map if count(PMap > 0.5) / M > k then U i ← Q j − − Q i (cid:46) Obtain the width of the upper bound break end if end for end for return U and L igure 2: Procedure to obtain an ambiguity interval based on a perceivableness map, which judges whetherthe two images are perceptually distinguishable or not. Note that the white pixels of the perceivableness mapmean distinguishable pixels determined by VDP. the highest (or lowest) quality level is identified, and the difference between the corre-sponding quality score and the quality score of I i is recorded as the width of the upper(or lower) bound of the interval, U i (or L i ).A visual just-noticeable difference (JND) model is used to determine whether twoimages having different amounts of distortion are perceptually distinguishable. TheJND model compares the two images and produces a map having the same size to thatof the input images, called perceivableness map. Each pixel of the map representsthe probability that the pixel value difference of the two images at the correspondinglocation is perceptually distinguishable. A probability of 0.5 (i.e., random chance)is considered as the threshold of distinguishability. Therefore, if at most a certainproportion (denoted as k ) of the pixels of the perceivableness map have values above0.5, the two images are considered to be perceptually indistinguishable.The JND model considered in this study is VDP, originally proposed by Daly [19].It enables to specify the viewing conditions including the type, resolution, and param-eters of the display, together with the viewing distance [24]. In particular, we use the8atest version, known as HDR-VDP 2.2 [16] . The model quantifies the visible dif-ference between two input images under specific viewing conditions. The images arefirstly passed through a model of the optical retinal pathway, including a simulationof intra-ocular light scatter, photoreceptor spectral sensitivity, luminance masking, andachromatic response. Further on, they are compared on multiple scales considering themodel of neural noise, neural contrast sensitivity, and contrast masking. Note that whenproducing a perceivableness map, VDP takes into account the contextual informationfor each pixel (i.e., its relationship with neighboring pixels).Figure 3 shows examples of the ambiguity intervals, which are obtained for thevisual information fidelity (VIF) metric [17]. To determine the intervals, we gener-ate N =100 images having different amounts of distortion (spanning the whole qualityrange) for each distortion type and each reference image in the LIVE Image QualityAssessment Database [6], and apply Algorithm 1 to them. In the figure, a higher scoremeans a higher quality level, i.e., less artifacts. Three types of dependency of the in-terval are observed. First, the width of the interval is not necessarily uniform overthe quality range. In Figure 3(c), for instance, the width of the interval is large forthe intermediate quality range and small for low quality (near zero). This implies thatthe perceptual scale of the metric is not perfectly linear. Second, the interval widthis dependent on the content, which is in line with the observation made from Figure1. This is related to the fact that the visibility of quality degradation is dependent onthe image content due to perceptual mechanisms such as frequency-dependent contrastsensitivity, spatial masking, etc. Third, the type of distortion also influences the in-terval because the detectability of quality difference depends on the type of artifacts.Detailed analysis is given in Section 4. In summary, the interval is dependent on thevisual components included in the image, which are affected differently by the qualitylevel, the distortion type, and the content itself.9 VIF V I F VIF V I F (a) JPEG (b) JPEG2K VIF V I F VIF V I F (c) GB (d) WN Figure 3: Examples of obtained ambiguity intervals of VIF for the LIVE database. The upper andlower bounds for two different reference images are expressed in different colors. (a) JPEG (b) JPEG2000(JPEG2K) (c) Gaussian blur (GB) (d) white Gaussian noise (WN) PSNR PS NR ADM AD M (a) PSNR (b) ADM Figure 4: Examples of obtained ambiguity intervals for GB of the LIVE database. Different colors meandifferent reference images. (a) PSNR (b) ADM
The ambiguity intervals of an objective metric can be used to measure the perfor-mance of the metric in terms of quality resolution. Figure 4 shows examples for twodifferent metrics, i.e., PSNR and additive impairment and detail loss measure (ADM)[25], which have different output ranges and ambiguity interval widths. Overall, forinstance, the intervals of ADM are larger than those of PSNR; the intervals of PSNRare relatively small for the low quality range and get larger as the quality increases,whereas the intervals of ADM are more uniform over the whole range. To enable easycomparison between the intervals of different metrics, we compute measures that sum-marize the ambiguity intervals of a metric. As the first step, the ambiguity intervalsof a metric are normalized with the obtained output range of the metric, since differ-ent metrics may have different ranges and units. Note that in our preliminary work[1], nonlinear regression using subjective rating data was employed for normalization,which limits the applicability of the method only to the cases where subjective data areavailable. In addition, only the quality levels associated with subjective ratings wereused, which permitted ambiguity evaluation only at a coarse level. An implementation is publicly available at http://hdrvdp.sourceforge.net/wiki/ able 1: Characteristics of the three databases used for the experiments. A viewing distance is expressed asa multiple of the height of the display. Database ×
512 JPEG, JPEG2K sRGB, CRT 2H DMOS512 ×
512 GB, WN 21-inch, 1024 × ×
512 JPEG, JPEG2K sRGB, LCD 4H DMOS512 ×
512 GB, WN 23-inch, 1920 × ×
800 JPEG, JPEG2K sRGB, LCD 1.5H MOSGB, PN 24-inch, 1920 × We propose to compute three statistics of the ambiguity intervals, namely, the mean,maximum, and standard deviation of the widths of the ambiguity intervals over thewhole quality range in order to measure the performance of a metric in multiple aspectsof ambiguity. They are measures of the sensitivity of a metric in an average sense, thecoarsest quality resolution, and the uniformity of the quality resolution, respectively.The smaller each of these measures is, the better the performance of the metric is.
3. Experimental setup
We conduct experiments in order to demonstrate applications where the proposedapproach can be exploited effectively, which are shown in the following sections. Thissection explains the employed databases and the objective metrics considered in theexperiments.
We employ three databases that are popularly used in the research of perceptualquality assessment, i.e., the LIVE Image Quality Assessment Database (LIVE) [6],which is one of the most popular databases for benchmarking objective metrics, theViewing Distance-changed Image Database (VDID) [26], which is the first image qual-ity assessment database specifically established for varying viewing distances, and theColourlab Image Database: Image Quality (CIDIQ) [27], which also contains subjec-tive data for multiple viewing distances. The databases were produced based on dif-ferent experimental setups such as reference images, distortion types, screens, viewingdistances, etc. We select them to ensure reproducibility of distortion types and avail-ability of information regarding viewing environments, e.g., information of the screen12nd viewing distance. Table 1 summarizes the characteristics of the databases. Fourcommon distortion types are selected, i.e., JPEG compression, JPEG2000 (JPEG2K)compression, Gaussian blur (GB), and white Gaussian noise (WN). For the CIDIQdatabase, Poisson noise (PN) is considered instead of WN. JPEG and JPEG2K are wellknown compression schemes for images, and GB and WN (or PN) are distortions thatcan easily occur in pre- or post-processing of images. VDID and CIDIQ have subjec-tive results from two different viewing distances.
We consider 33 state-of-the-art objective quality metrics (28 full-reference (FR)metrics, one reduced-reference (RR) metric, and four no-reference (NR) metrics) forbenchmarking. The tested FR metrics are PSNR, structural similarity index (SSIM)[18], multi-scale structural similarity (MS-SSIM) [28], visual signal-to-noise ratio (VSNR)[29], VIF [17], universal image quality index (UQI) [30], information fidelity criterion(IFC) [31], noise quality measure (NQM) [32], weighted signal to noise ratio (WSNR)[32], modified versions of PSNR (PSNR-HVS [33], PSNR-HVS-M [34], PSNR-HMA,PSNR-HA, PSNR-HMA-C, and PSNR-HA-C [35]), optimal scale selection (OSS)-PSNR and OSS-SSIM [26], information content weighted SSIM (IW-SSIM) [36], fea-ture similarity index (FSIM) and chrominance extension of FSIM (FSIM-C) [37], gra-dient magnitude similarity deviation (GMSD) [38], most apparent distortion (MAD)[39], ADM [25], analysis of distortion distribution-based (ADD)-SSIM [40], ADD-gradient similarity index (ADD-GSIM) [40], a visual saliency-induced index (VSI)[41], image quality assessment based on gradient similarity (GSM) [42], and percep-tual similarity (PSIM) [43]. The RR metric is reduced reference entropic differenc-ing index (RRED) [44], and the NR metrics are spatial-spectral entropy-based quality(SSEQ) [45], oriented gradients image quality assessment (OG-IQA) [46], blind imageintegrity notator using DCT statistics (BLIINDS2) [47], and accelerated screen imagequality evaluator (ASIQE) [48]. 13 . Use case 1 : Benchmarking of objective metrics
Objective quality metrics that can automatically predict perceived quality of visualcontent are a key component of quality-optimized multimedia systems. For instance,a method enhancing a given degraded image requires an objective metric as a crite-rion with respect to which the image is enhanced. Therefore, it is critical to identifya quality metric that mimics the human visual system as closely as possible, so thatthe results of optimization based on the metric are also optimal for human viewers. Inthis context, benchmarking studies of objective quality metrics have been conductedextensively in literature, e.g., [8, 22, 49, 50]. In these studies, as mentioned in theintroduction, the prediction accuracy of existing metrics is considered as the most im-portant performance index, which is typically measured in terms of PLCC, SROCC,OR, and RMSE. However, different metrics have different levels of ambiguity, whichcan be captured by the proposed approach. The use case presented in this sectiondemonstrates how such information can be effectively used in the benchmarking.In this use case, we use the LIVE database. The accuracy performance of the33 state-of-the-art objective metrics is measured by PLCC between the ground truthsubjective quality scores and the predicted quality scores . In particular, PLCC is com-puted after nonlinear regression using the monotonic logistic function: Q (cid:48) = β + β − β + e − (cid:16) Q − β β (cid:17) (1)to fit the objective scores outputted by a metric to the subjective quality scores, asdescribed in the recommendation [51]. Here, Q and Q (cid:48) denote the objective scoresbefore and after regression, respectively. The initial values of the parameters ( β to β ) are set as suggested in [51]. In addition, the statistical tests are also conducted[10], i.e., Z-tests are performed using the Fisher z-transformation for PLCC. The am-biguity performance of the metrics is evaluated based on the proposed approach. Themean, maximum, and standard deviation of the widths of the ambiguity intervals are Other measures such as SROCC, OR, and RMSE can be also used, but we use only PLCC for concise-ness of presentation. a) JPEG (b) JPEG2K(c) GB (d) WN Figure 5: Performance of the objective metrics in terms of Pearson’s linear correlation coefficient (PLCC)scores (blue) and mean of ambiguity intervals (green) for the LIVE database. (a) JPEG, (b) JPEG2K, (c) GB,and (d) WN. The metrics are listed in a descending order of the PLCC scores. The statistically equivalentmetrics with the best metric for PLCC are marked in a gray box. obtained. In addition, non-parametric Wilcoxon-Mann-Whitney tests are conducted tostatistically compare the ambiguity intervals of different metrics.Figure 5 summarizes the PLCC values and the mean ambiguity intervals of the 33metrics. The results for the four distortion types are shown separately, and the metricsare listed in a descending order of the PLCC values. In the figure, the metrics showingstatistically equivalent performance with the best metric in terms of PLCC are markedin the gray box. We can observe that the superiority of a metric over the others in termsof accuracy may not coincide with its superiority in terms of ambiguity, and vice versa.For instance, in Figure 5(b), the best metric in terms of accuracy is FSIM-C, but GSM,15hich is statistically significantly inferior to FSIM-C, is the best in terms of ambiguity.Many metrics predict perceived image quality with high accuracy. For instance,the best metric in terms of PLCC for JPEG in Figure 5(a), i.e., FSIM-C, which showsPLCC of about 0.95, is not statistically different with PSNR-HA, which ranks 24th.Thus, it would be difficult to distinguish the superiority between these metrics. At thispoint, we can apply the results of the ambiguity analysis. Among the top 24 metrics,ADD-GSIM has the smallest mean width of the ambiguity intervals, which is revealedto be significantly smaller than the second smallest one (VSI) by the statistical test ( p < . igure 6: Performance of the top-performing objective metrics showing statistically equivalent PLCC valuesfor all data of the LIVE database. PLCC scores and the mean, maximum, and standard deviation values ofambiguity intervals are shown. in a wide range of quality. As another example, ADD-SSIM, ADD-GSIM, FSIM, andFSIM-C show statistically equivalent performance in terms of the mean ambiguity in-tervals, showing mean ambiguity intervals of only about 2.0-2.5% of the whole qualityrange. However, the maximum and standard deviation of the ambiguity intervals ofADD-SSIM are larger than those of the other three metrics, and thus it may be lesspreferable. Therefore, considering all the ambiguity measures, ADD-GSIM, FSIM,and FSIM-C can be regarded as the best metrics.
5. Use case 2 : Viewing distance vs. ambiguity
The viewing distance is one of the most important factors that influence visualquality perception of human viewers. As the distance from a viewer to an image getslarge, less and less details in the image are distinguished, changes or artifacts in the im-age become less noticeable, and the viewer’s quality perception becomes less reliable.The proposed approach incorporates this tendency by employing the VDP method thatconsiders the viewing environment including the viewing distance.However, due to the difference in underlying mechanisms of different objective17
H 3H 4H 5H 6H 7H
Viewing Distance M ea n MS-SSIMVSNR
Figure 7: Examples of ambiguity intervals of two objective metrics, MS-SSIM (blue) and VSNR (orange),with respect to the viewing distance (in multiples of the display height) for GB of the VDID database. Thesuperiority of a metric against the other varies depending on the viewing distance. metrics, they may show different ambiguity patterns with respect to the viewing dis-tance. For instance, the superiority of the metrics in terms of the ambiguity intervalmay change depending on the viewing distance. Figure 7 shows the mean ambiguityintervals of two metrics, MS-SSIM and VSNR, for GB of the VDID database withrespect to the viewing distance. When the viewing distance is 4 or 5 times the dis-play height (i.e., 4H or 5H), MS-SSIM shows slightly smaller ambiguity intervals thanVSNR, whereas VSNR shows smaller intervals than MS-SSIM for the other viewingdistances. Thus, the viewing distance should be considered carefully when the ambi-guity of a metric is evaluated. In general, it is preferable for a metric not only to havehigh accuracy and low ambiguity for a particular viewing distance, but also to showconsistent performance over various viewing distances in terms of both accuracy andambiguity.In this section, we demonstrate that the ambiguity behavior of metrics with respectto the viewing distance can be used to compare the reliability performance of the met-rics, which can be seen as an extension of the benchmarking in the previous section,and to identify proper viewing distances for which a metric can be used reliably. TheVDID and CIDIQ databases are used. 18erformance of the metrics for two viewing distances in terms of PLCC and meanof the ambiguity intervals is shown in Figure 8. Most of the metrics show statisticallyequivalent PLCC scores for the two viewing distances; only one and nine metrics showsignificantly different accuracy scores for VDID and CIDIQ, respectively (which aremarked with asterisks in Figure 8). However, for VDID, all metrics except for OSS-SSIM (marked with a square in Figure 8(a)) show significantly different ambiguityinterval widths for the two viewing distances. Furthermore, OSS-SSIM shows high ac-curacy, i.e., it is included in the group of top-performing metrics (showing statisticallyequivalent PLCC scores with the best one for the short distance), and shows the small-est mean ambiguity intervals for both viewing distances (which are statistically equiva-lent). Thus, we can choose OSS-SSIM as the best metric considering both the accuracyand the ambiguity for different viewing distances. OSS-SSIM explicitly considers theeffect of the viewing distance, which seems to be the reason for the consistency of itsambiguity performance. In the case of CIDIQ, all metrics have significantly differentresults of ambiguity intervals. MAD and IW-SSIM are two top-performing metrics interms of accuracy for the short distance. However, these metrics have relatively lowerperformance in terms of ambiguity (i.e., larger mean interval widths) than the follow-ing ones (in the ranking of accuracy), e.g., OSS-SSIM and ADD-GSIM. If we accepta slight loss in terms of accuracy, it would be a better choice to select ADD-GSIM orOSS-SSIM as the best metric with consideration of both the accuracy and ambiguityfor the two viewing distances.Next, we analyze patterns of the ambiguity intervals over various viewing distances.As an example, Figure 9 shows the mean widths of the ambiguity intervals of ADD-GSIM for each of the four distortion types of VDID. As aforementioned, as the viewingdistance increases, the ability of human viewers to distinguish the details in imagesdecreases. The ambiguity intervals obtained by our approach also tend to increase withthe increasing viewing distance. A gradual increase of the ambiguity intervals due toincrease of the viewing distance is acceptable, but a sudden increase of the slope wouldnot be desirable. For instance, in Figure 9(c), the slope for GB increases suddenly after5H, thus, care must be taken when the metric is used for viewing distances larger than5H. 19 a) VDID(b) CIDIQ
Figure 8: Performance of the metrics for two viewing distances in terms of PLCC scores and mean ofambiguity intervals for the (a) VDID and (b) CIDIQ databases. The metrics are listed in a descendingorder of the PLCC for the short viewing distances (4H for VDID and 1.5H for CIDIQ). The metrics havingstatistically different accuracy between the two viewing distances are marked with asterisks. The statisticallyequivalent metrics with the best metric in terms of PLCC for the short viewing distance are marked with agray box. The metric having statistically equivalent ambiguity interval widths for the two viewing distancesis marked with a square. H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (a) JPEG (b) JPEG2K
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (c) GB (d) WN Figure 9: Mean widths of the ambiguity intervals of ADD-GSIM for the VDID database (a) JPEG, (b)JPEG2K, (c) GB, and (d) WN. The distortion type influences the slopes of the curves.
6. Conclusion
In this paper, we have proposed a new way to measure performance of objectiveimage quality metrics in the viewpoint of the quality resolution. The procedure to ob-22
H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (a1) OSS-SSIM (a2) GSM (a3) WSNR (a4) PSNR
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (b1) VSNR (b2) FSIM (b3) PSNR-HA (b4) VIF
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (c1) MS-SSIM (c2) FSIM (c3) BLIINDS2 (c4) ADM
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (d1) OSS-SSIM (d2) RRED (d3) NQM (d4) IFC Figure 10: Mean widths of the ambiguity intervals of the objective metrics for the VDID database. Foreach distortion type, metrics in the first, second, third, and last quarters in the ascending order of the meanambiguity interval width for 4H are shown from left to right. (a1)-(a4) JPEG, (b1)-(b4) JPEG2K, (c1)-(c4)GB, and (d1)-(d4) WN H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (a1) ADD-GSIM (a2) RRED (a3) WSNR (a4) OSS-PSNR
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (b1) VSNR (b2) GMSD (b3) UQI (b4) VIF
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (c1) VSNR (c2) GSM (c3) GMSD (c4) ADM
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n
2H 3H 4H 5H 6H 7H
Viewing Distance M ea n (d1) BLIINDS2 (d2) PSNR-HMA (d3) PSNR (d4) OG-IQA Figure 11: Mean widths of the ambiguity intervals of the objective metrics for the CIDIQ database. Foreach distortion type, metrics in the first, second, third, and last quarters in the ascending order of the meanambiguity interval width for 1.5H are shown from left to right. (a1)-(a4) JPEG, (b1)-(b4) JPEG2K, (c1)-(c4)GB, and (d1)-(d4) PN
Acknowledgment
This research was supported by the Ministry of Science and ICT (MSIT), Korea,under the “ICT Consilience Creative Program” (IITP-2018-2017-0-01015) supervisedby the Institute for Information & communications Technology Promotion (IITP) andalso by the IITP grant funded by the Korea government (MSIT) (R7124-16-0004, De-velopment of Intelligent Interaction Technology Based on Context Awareness and Hu-man Intention Understanding).