[PDF] A Hitchhiker's Guide to Structural Similarity

Abstract

The Structural Similarity (SSIM) Index is a very widely used image/video quality model that continues to play an important role in the perceptual evaluation of compression algorithms, encoding recipes and numerous other image/video processing algorithms. Several public implementations of the SSIM and Multiscale-SSIM (MS-SSIM) algorithms have been developed, which differ in efficiency and performance. This "bendable ruler" makes the process of quality assessment of encoding algorithms unreliable. To address this situation, we studied and compared the functions and performances of popular and widely used implementations of SSIM, and we also considered a variety of design choices. Based on our studies and experiments, we have arrived at a collection of recommendations on how to use SSIM most effectively, including ways to reduce its computational burden.

Full PDF

DDate of latest submission January 30, 2021

Digital Object Identiﬁer

A Hitchhiker’s Guide to StructuralSimilarity

ABHINAU K. VENKATARAMANAN , CHENGYANG WU , ALAN C. BOVIK (FELLOW, IEEE),IOANNIS KATSAVOUNIDIS , AND ZAFAR SHAHID Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78705 USA Facebook, Menlo Park, CA 94025 USA

Corresponding author: Abhinau K. Venkataramanan (e-mail: [email protected]).This research was supported by funding from Facebook Video Infrastructure

ABSTRACT

The Structural Similarity (SSIM) Index is a very widely used image/video quality model thatcontinues to play an important role in the perceptual evaluation of compression algorithms, encoding recipesand numerous other image/video processing algorithms. Several public implementations of the SSIM andMultiscale-SSIM (MS-SSIM) algorithms have been developed, which differ in efﬁciency and performance.This “bendable ruler" makes the process of quality assessment of encoding algorithms unreliable. Toaddress this situation, we studied and compared the functions and performances of popular and widelyused implementations of SSIM, and we also considered a variety of design choices. Based on our studiesand experiments, we have arrived at a collection of recommendations on how to use SSIM most effectively,including ways to reduce its computational burden.

INDEX TERMS

Image/Video Quality Assessment, Structural Similarity Index, Pareto Optimality, ColorSSIM, Spatio-Temporal Aggregation, Enhanced SSIM

I. INTRODUCTION

With the explosion of social media platforms and onlinestreaming services, video has become the most widely con-sumed form of content on the internet, accounting for 60%of global internet trafﬁc in 2019 [1]. Social media platformshave also led to an explosion in the amount of image databeing shared and stored online. Handling such large volumesof image and video data is inconceivable without the use ofcompression algorithms such as JPEG [2] [3], AVIF (AV1Intra) [4] [5], HEIF (HEVC Intra) [6], H.264 [7] [8], HEVC[9], EVC [10], VP9 [11], AV1 [12], SVT-AV1 [13], and theupcoming VVC and AV2 standards.The goal of these algorithms is to perform lossy compres-sion of images and videos to signiﬁcantly reduce ﬁle sizesand bandwidth consumption, while incurring little or accept-able reduction of visual quality. In addition to compression-distorted streaming videos, a large fraction of the images andvideos that are shared on social media are User GeneratedContent (UGC) [14] [15], i.e., not professionally created.As a result, even without any additional processing, theseimages and videos can have impaired quality because theywere captured by uncertain hands. In all these circumstances,it is imperative to have available automatic perceptual quality models and algorithms which can accurately, reliably, andconsistently predict the subjective quality of images/videosover this wide range of applications.One way that perceptual quality models can provide sig-niﬁcant gains in compression is by conducting perceptualRate-Distortion Optimization RDO [16], where quantizationparameters, encoding “recipes," and mode decisions are eval-uated by balancing the resulting bitrates against the percep-tual quality of the decoded videos. Typically, a set of viableencodes is arrived at by constructing a perceptually-guided,Pareto-optimal bitrate ladder.To understand Pareto-optimality, consider two encodes ofa video E = ( R , D ) and E = ( R , D ) , where R and D denote the bitrate and the (perceptual) distortion associatedwith each encode. If R ≤ R and D ≤ D , we saythat E “dominates" E , since better performance (lowerdistortion) is obtained at a lower cost (bitrate). So, we canprune any set of encodes S = { E i } , by removing all thosedominated by any other encode. The pruned set, say S (cid:48) ,has the property that for any two encodes E and E suchthat R < R , D > D . That is, we obtain a set ofencodes such that an increase in bitrate corresponds to adecrease in distortion. Such a set is said to be Pareto-optimal. VOLUME X, 2021 a r X i v : . [ ee ss . I V ] J a n enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity In general, we can deﬁne Pareto-optimality for any settingwhere a cost is incurred (bitrate, running time, etc.) to achievebetter performance (accuracy, distortion, etc.)The ﬁrst distortion/quality metric used to measure im-age/video quality was the humble Peak Signal to NoiseRatio (PSNR), or log-reciprocal Mean Squared Error (MSE)between a reference video and possibly distorted versions ofit (e.g., by compression). However, while the PSNR metric issimple and easy to calculate, it does not correlate very wellwith subjective opinion scores of picture or video quality[17]. This is because PSNR is a pixel-wise ﬁdelity metricwhich does not account for spatial or temporal perceptualprocesses.An important breakthrough on the perceptual quality prob-lem emerged in the form of the Universal Quality Index(UQI) [18], the ﬁrst form of the Structural Similarity Index(SSIM). Given a pair of images (reference and distorted),UQI creates a local quality map, by measuring luminance,contrast and structural similarity over local neighborhoods,then pooling (averaging) values of this spatial quality mapyielding a single quality prediction (per picture or videoframe). SSIM was later reﬁned to better account for theinterplay between adaptive gain control of the visual signal(the basis for masking effects) and saturation at low signallevels [19].The SSIM concept reached a higher performance in theform of Multi-Scale SSIM (MS-SSIM) [20], which ap-plies SSIM at ﬁve spatial resolutions obtained by succes-sive dyadic sampling. Contrast and structure similarities arecomputed at each scale, while luminance similarity is onlycalculated at the coarsest scale. These scores are then com-bined using exponential weighting. SSIM and MS-SSIM arewidely used by the streaming and social media industries toperceptually control the encodes of many billions of pictureand video contents annually.While SSIM and MS-SSIM are most commonly deployedon a frame-by-frame basis, temporal extensions of SSIMhave also been developed. In [21], the authors computeframe-wise quality scores, weighted by the amount of motionin each frame. In [22], explicit motion ﬁelds are used tocompute SSIM along motion trajectories, an idea that waselaborated on in the successful MOVIE index [23].Following the success of SSIM, a great variety of pictureand video quality models have been proposed. Among these,the most successful have relied on perceptually relevant“natural scene statistics" (NSS) models, which accurately andreliably describe bandpass and nonlinearly normalized visualsignals [24] [25]. Distortions predictably alter these statistics,making it possible to create highly competitive picture andvideo quality predictors like the Full-Reference (FR) VisualInformation Fidelity (VIF) index [26], the Spatio-TemporalReduced Reference Entropic Differences (ST-RRED) model[27], the efﬁcient SpEED-QA [28], and the Video Multi-method Assessment Fusion (VMAF) [29] model, which usesa simple learning model to fuse quality features derived fromNSS models to obtain high performance and widespread industry adoption. Despite these advances, SSIM remains themost widely used perceptual quality algorithm because of itshigh performance, natural deﬁnition, and compute simplicity.Moreover, the success of SSIM can also be explained byNSS, at least in part [30].In many situations, reference information is not availableas a “gold standard" against which the quality of a test pictureor video can be evaluated. No-Reference (NR) quality mod-els have been developed that can accurately predict pictureor video quality without a reference, by measuring NSS de-viations. Notable NR quality models include BLIINDS [31],DIIVINE [32], BRISQUE [33], and NIQE [34]. The lattertwo, which have attained signiﬁcant industry penetration, aresimilar to SSIM since they are deﬁned by simple bandpassoperations over multiple scales, normalized by local spatialenergy. For encoding applications where the source video tobe encoded is already impaired by some distortion(s), e.g.UGC, as is often found on sites like YouTube, Facebook, andInstagram, SSIM and NIQE can be combined via a 2-stepassessment process to produce signiﬁcantly improved encodequality predictions [35] [36].Evaluating picture and video encodes at scale remainsthe most high-volume application of quality assessment, andSSIM continues to play a dominant role in this space. Nev-ertheless, many widely used versions of SSIM exist havingdifferent characteristics. Understanding and unifying thesevarious implementations would be greatly useful to industry.Moreover, there remain questions regarding the use of SSIMacross different display sizes, devices and viewing distances,as well as how to handle color, and how to combine (pool)SSIM scores. Our objective here is to attempt to answer thesequestions, at least in part.

II. BACKGROUND

The basic SSIM index is a FR picture quality model deﬁnedbetween two luminance images of size M × N , I ( i, j ) and I ( i, j ) as a multiplicative combination of three terms- luminance similarity l ( i, j ) , contrast similarity c ( i, j ) andstructure similarity s ( i, j ) . Color may be considered, but wewill do so later.These three terms are deﬁned in terms of the local means µ ( i, j ) , µ ( i, j ) , standard deviations σ ( i, j ) , σ ( i, j ) , andcorrelations σ ( i, j ) of luminance, as follows. Let W ij denote a windowed region of size k × k spanning the indices { i, . . . , i + k − } × { j, . . . , j + k − } and let w ( m, n ) denote weights assigned to each index ( m, n ) of this window.In practice, these weighting functions sum to unity, and havea ﬁnite-extent Gaussian or rectangular shape.The local statistics are then calculated on (and between)the two images as µ ( i, j ) = (cid:88) m,n ∈ W i j w ( m, n ) I ( m, n ) , (1) µ ( i, j ) = (cid:88) m,n ∈ W i j w ( m, n ) I ( m, n ) , (2) VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity σ ( i, j ) = (cid:88) m,n ∈ W i j w ( m, n ) I ( m, n ) − µ ( i, j ) , (3) σ ( i, j ) = (cid:88) m,n ∈ W i j w ( m, n ) I ( m, n ) − µ ( i, j ) , (4) σ ( i, j ) = (cid:88) m,n ∈ W i j w ( m, n ) I ( m, n ) I ( m, n ) − µ ( i, j ) µ ( i, j ) . (5)Using these local statistics, the luminance, contrast andstructural similarity terms are respectively deﬁned as l ( i, j ) = 2 µ ( i, j ) µ ( i, j ) + C µ ( i, j ) + µ ( i, j ) + C , (6) c ( i, j ) = 2 σ ( i, j ) σ ( i, j ) + C σ ( i, j ) + σ ( i, j ) + C , (7) s ( i, j ) = σ ( i, j ) + C σ ( i, j ) σ ( i, j ) + C , (8)where C , C and C are saturation constants that contributeto numerical stability. Local quality scores are then deﬁnedas Q ( i, j ) = l ( i, j ) · c ( i, j ) · s ( i, j ) . (9)Adopting the common choice of C = C / , the contrastand structure terms combine: cs ( i, j ) = c ( i, j ) s ( i, j ) = 2 σ ( i, j ) + C σ ( i, j ) + σ ( i, j ) + C . (10)In this way, a SSIM quality map Q ( i, j ) is deﬁned, whichcan be used to visually localize distortions. Since a singlepicture quality score is usually desired, the average value ofthe quality map can be reported as the Mean SSIM (MSSIM)score between the two images: SSIM ( I , I ) = 1 M N M (cid:88) i =1 N (cid:88) j =1 Q ( i, j ) . (11)SSIM obeys the following desirable properties:1) Symmetry: SSIM ( I , I ) = SSIM ( I , I )

2) Boundedness: | SSIM ( I , I ) | ≤

3) Unique Maximum:

SSIM ( I , I ) = 1 ⇐⇒ I = I An important property of SSIM is that it accounts forthe perceptual phenomenon of Weber’s Law, whereby a JustNoticeable Difference (JND) is proportional to the localneighborhood property Q . This is the basis for perceptualmasking of distortions, whereby the visibility of a distortion ∆ Q is mediated by the relative perturbation ∆ Q/Q .To illustrate the connection between SSIM and Webermasking, consider an error ∆ µ of local luminance µ in atest image: µ = µ + ∆ µ = µ (1 + Λ) , (12) Very rarely, distortion can cause a negative correlation between referenceand test image patches. where

Λ = ∆ µ/µ is the relative change in luminance. Then,the luminance similarity term (6) becomes (dropping spatialindices) l = 2 µ µ + C µ + µ + C (13) = 2 µ (1 + Λ) + C µ (1 + (1 + Λ) ) + C (14) = 2(1 + Λ) + C /µ + C /µ . (15)Since it is usually true that C (cid:28) µ , the luminance term l is approximately only a function of the relative luminancechange, reﬂecting luminance masking.Similarly, a locally perturbed contrast σ in a test imagemay be expressed as σ = (1 + Σ) σ , where Σ = ∆ σ/σ isthe relative change in contrast from distortion. Similar to theabove, we can express the contrast term (7) as c = 2(1 + Σ) + C /σ + C /σ . (16)Since generally C (cid:28) σ , the contrast term c is approxi-mately a function of the relative, rather than absolute, changein contrast, thereby accounting the perceptual contrast mask-ing.Given an 8-bit luminance image, assume the nominaldynamic range [0 , L ] , where L = 255 . Most commonly, thesaturation constants are chosen relative to the dynamic rangeas C = ( K L ) and C = ( K L ) , where K and K aresmall constants.SSIM is quite ﬂexible and allows room for design choices.The recommended implementation of SSIM in [19] is • If min( M, N ) > , resize images such that min( M, N ) = 256 . • Use a Gaussian weighting window in (1) - (5) having k = 11 and σ = 1 . . • Choose regularization constants K = 0 . , K =0 . .One of our goals here is to compare and test different, com-monly used implementations of SSIM and MS-SSIM, whichmake different design choices. We conduct performance eval-uations on existing, well-regarded image and video qualitydatabases. We study the effects of several design choicesand make recommendations on best practices when utilizingSSIM. III. DATABASES

One of our main goals is to help “standardize" the way SSIMis deﬁned and used. Since many versions of SSIM existand implementing SSIM involves design choices, reliableand accurate subjective test beds that capture the breadthof theoretical and practical distortions are the indispensabletools for our analysis. Among these, we selected two picturequality databases and two video quality databases that arewidely used.

VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity

A. LIVE IMAGE QUALITY ASSESSMENT DATABASE

The LIVE IQA database [37] [38] contains 29 referencepictures, each subjected to the following ﬁve distortions (eachat four levels of severity). • JPEG compression • JPEG2000 compression • Gaussian blur • White noise • Bit errors in JPEG2000 bit streamsLIVE IQA contains 982 pictures with nearly 30,000 corre-sponding Difference Mean Opinion (DMOS) human subjectscores.

B. TAMPERE IMAGE DATABASE 2013

The Tampere Image Database 2013 (TID2013) [39] contains3000 distorted pictures subjected to 24 impairments at 5distortion levels synthetically applied to 25 pristine images.The 24 distortions are • Additive Gaussian noise • Additive noise in color components is more intensivethan additive noise in the luminance component • Spatially correlated noise • Masked noise • High frequency noise • Impulse noise • Quantization noise • Gaussian blur • Image denoising • JPEG compression • JPEG2000 compression • JPEG transmission errors • JPEG2000 transmission errors • Non eccentricity pattern noise • Local block-wise distortions of different intensity • Mean shift (intensity shift) • Contrast change • Change of color saturation • Multiplicative Gaussian noise • Comfort noise • Lossy compression of noisy images • Image color quantization with dither • Chromatic aberrations • Sparse sampling and reconstructionThe 3000 pictures in TID 2013 are accompanied by humansubjective quality scores on them in the form of over 500,000MOS.

C. LIVE VIDEO QUALITY ASSESSMENT DATABASE

The LIVE VQA database [40] contains 10 reference videos,each subjected to the following four distortions (each appliedat four levels of severity). • MPEG-2 compression • H.264 compression • Error-prone IP networks • Error-prone wireless networks A total of 150 distorted videos are obtained, on which 4350subjective DMOS were obtained.

D. NETFLIX PUBLIC DATABASE

The Netﬂix Public Database, obtained from the VMAF [29]Github repository, contains 9 reference videos, each distortedby spatial scaling and compression, yielding 70 distortedvideos. We selected this database because of its high rele-vance and commonly observed distortions characteristic ofSSIM streaming video deployments at the largest scales.

IV. VERSIONS OF SSIM

Next, we take a deep dive into publicly available and com-monly used implementations of SSIM and MS-SSIM. Wecompare various aspects of their performance, explain thedifferences between them, and provide recommendations forbest practices when using SSIM. This is especially importantbecause, as we will see, subtle differences in design choicescan lead to signiﬁcant changes in performance and efﬁciency.

A. PUBLIC SSIM AND MS-SSIM IMPLEMENTATIONS

We considered the following ten SSIM implementationswhen carrying out our experiments:1) FFMPEG [41]2) LIBVMAF [29]3) VideoClarity ClearView Player (ClearView)4) HDRTools [42]5) Daala [43]6) Scikit-Image in Python (Rectangular) [44]7) Scikit-Image in Python (Gaussian) [44]8) Scikit-Video in Python (Rectangular) [45]9) Scikit-Video in Python (Gaussian) [45]10) Tensorﬂow in Python [46]11) MATLAB12) MATLAB (Fast)“Rectangular" refers to using a constant weight windowfunction to calculate local statistics, while “Gaussian" refersto using a Gaussian-shaped window function to calculatelocal statistics, as in [19]. Only the Python and MATLABimplementations allow the user to set parameters such as theSSIM window size. Hence, we tested the other implementa-tions using the default parameters. “Fast” in item 12 refers toan accelerated implementation of SSIM in MATLAB.In addition, the following eight MS-SSIM implementa-tions were tested:1) LIBVMAF2) ClearView3) HDRTools4) Daala5) Daala (Fast)6) Scikit-Video in Python (Sum)7) Scikit-Video in Python (Product)8) Tensorﬂow in Python“Sum" and “Product" in 6 and 7 refer to different waysof aggregating SSIM scores across scales. “Product" refers VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 1: Salient features of SSIM implementations

Implementation Window Function Scaling Border Handling CommentsFFMPEG 8x8 rectangular window, with astride of 4. None. Only valid convolutionoutputs are used. None.LIBVMAF 11x11 Gaussian window Downsampled by a factor of max(1 , [min( W, H ) / . Only valid convolutionoutputs are used. Only luminance channel isprocessed.ClearView Gaussian window Unknown. Zero padding is usedto extend borders. Images of arbitrary resolutionsare not supported. A conﬁg-urable threshold is used to ﬁl-ter frame scores.HDRTools Strided Rectangular window (8x8with stride 1, by default). None. Only valid convolutionoutputs are used. None.Daala Gaussian window truncated to avalue of 0.5, i.e., window size k =2 σ (cid:16) − (cid:16) σ (cid:113) π (cid:17)(cid:17) / + 1 . Using a Gaussian windowhaving standard deviation σ =1 . h/ . Only valid convolutionoutputs are used. None.Scikit-Image Rectangular or Gaussian window(11x11, by default). None. Reﬂection padding isused to extend borders. None.Scikit-Video Any window (user-input). Downsampled by a factor of max(1 , [min( W, H ) / . Zero padding is usedto extend borders. Irrespective of window size, 5border pixels are removed.Tensorﬂow Gaussian window. None. Only valid convolutionoutputs are used. None.FastQA Strided rectangular window(11x11 with stride 1, by default). None. Only valid convolutionoutputs are used. A custom implementation wecreated that uses integral im-ages. Publicly available here.MATLAB Gaussian window. Size is inferredfrom standard deviation σ as ×(cid:100) σ (cid:101) + 1 . None. Only valid convolutionoutputs are used. None.MATLAB (Fast) Rectangular window. None. Only valid convolutionoutputs are used. Integer arithmetic is used toaccelerate computations. to the method proposed in [20], where MS-SSIM is com-puted as an exponentially-weighted product of SSIM scoresfrom each scale. In “Sum", the MS-SSIM score is instead aweighted average of SSIM scores across scales. “Fast” refersto an accelerated implementation of MS-SSIM in Daala. B. SALIENT FEATURES OF SSIM AND MS-SSIMIMPLEMENTATIONS

The salient features of various SSIM implementations havebeen listed in Table 1, and the salient features of variousMS-SSIM implementations have been listed below. To avoidrepetition, we only discuss the aspects in which each MS-SSIM implementation deviates from the corresponding SSIMimplementation.1) LIBVMAFa) Dyadic down-sampling is performed using a 9/7biorthogonal wavelet ﬁlter.2) Daalaa) Uses σ = 1 . , leading to a Gaussian window ofsize 11.b) Dyadic down-sampling is performed by 2x2 av-erage pooling.3) Daala (Fast)a) Multiscale processing is performed at 4 levels.The ﬁrst four exponents used in the standard MS-SSIM formulation are renormalized to sum to 1.b) An integer approximation to Gaussian is used.c) Dyadic down-sampling is performed by 2x2 av-erage pooling.d) When image dimensions were not multiples of16, we found that this implementation suffered from memory leaks, which led to a considerabledecrease in accuracy. A simple ﬁx restored per-formance to expected levels.4) Scikit-Videoa) Dyadic down-sampling is performed by low passﬁltering using a 2x2 average ﬁlter, followed bydown-sampling.b) Allows aggregating across scales by summationinstead of product. For summation, the exponents β i are normalized to sum to 1, leading to a convexcombination.c) At the coarsest scale, the algorithm uses α M = β M = γ M = 1 .d) When image dimensions were large, we foundthat this implementation suffered from incompat-ible memory allocation, leading to crashes at run-time. We ﬁxed this error, making the implemen-tation usable.5) Tensorﬂowa) Dyadic down-sampling is carried out by averagepooling 2x2 neighborhoods. C. OFF-THE-SHELF PERFORMANCE USING DEFAULTPARAMETERS

In this section, we evaluate the off-the-shelf performance ofthe implementations discussed above. We ﬁrst normalizedthe subjective scores of pictures/videos each database to therange [0 , by scaling and shifting. In all of the experimentsin this section, unless mentioned otherwise, we computedSSIM scores only on the luminance channel. VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 2: Off-the-shelf performance of SSIM implementations (a) Performance of SSIM implementations

Implementation LIVE IQA TID 2013 LIVE VQAPCC SROCC 1-RMSE PCC SROCC 1-RMSE PCC SROCC 1-RMSEFFMPEG

Daala SSIM 0.940 0.929 0.907 0.701 0.657 0.873 0.621 0.618 0.829ClearView 0.791 0.789 0.883 0.718 0.683 0.876 0.455 0.376 0.805HDRTools 0.845 0.831 0.898 0.667 0.605 0.868 0.471 0.452 0.807Scikit-Image (Rect) 0.942 0.930 0.908 0.692 0.639 0.872 0.665 0.668 0.837Scikit-Image (Gauss) 0.939 0.925 0.906 0.677 0.628 0.869 0.563 0.549 0.819Scikit-Video (Rect) 0.944

Scikit-Video (Gauss) 0.945

FastQA (b) Performance of MS-SSIM Implementations

Implementation LIVE IQA TID 2013 LIVE VQAPCC SROCC 1-RMSE PCC SROCC 1-RMSE PCC SROCC 1-RMSELIBVMAF 0.943 0.946 0.909

Scikit-Video (Sum) 0.943

FastQA

MATLAB 0.902 0.905 0.918 0.829 0.782 0.901 0.737 0.729 0.852Tensorﬂow

It is well known that the relationship between SSIM (orany other quality metric) and subjective scores is non-linear.To account for this, we ﬁt the ﬁve-parameter logistic (5PL)function [47] shown in (17) from SSIM values to subjectivescores: Q ( x ) = β (cid:18) −

11 + exp( β ( x − β )) (cid:19) + β x + β , (17)where x is the SSIM score, β i are parameters of the logisticfunction, and Q ( x ) is the predicted subjective quality.After linearizing the SSIM values in this manner, we reportthe Pearson Correlation Coefﬁcient (PCC) which is a mea-sure of the linear correlation between the predicted and truequality, the Spearman Rank Order Correlation Coefﬁcient(SROCC) which is a measure of the rank correlation (mono-tonicity), and the Root Mean Square Error (RMSE) which isa measure of the error in predicting subjective quality.Table 2 shows the results of the experiments, and the bestthree results in each column have been boldfaced. AmongSSIM implementations, LIBVMAF and the Scikit-Video im-plementations generally outperformed all other algorithms.We attribute this superior performance to the use of scaling,which we will expound in Section VI.Among the MS-SSIM implementations, there was no con-sistent "winners." Python implementations like Scikit-Videoand the FastQA implementation were often among the top-three implementations. Tensorﬂow’s MS-SSIM implementa-tion also performs well, lending strong empirical support to the use of MS-SSIM as a training objective for deep networksimplemented in Tensorﬂow.Since compression forms an important class of distortionsencountered in practice, we also report the off-the-shelfperformance of SSIM and MS-SSIM implementations oncompression distorted data in Table 3. When restricting thecomparisons to compression, the LIBVMAF, Scikit-Imageand FastQA implementations still generally outperform otherSSIM implementations, while HDRTools and ClearViewgenerally outperform other MS-SSIM implementations. D. PERFORMANCE-EFFICIENCY TRADEOFFS

In addition to performance (i.e., correlation with subjectivescores), it is important to consider the compute efﬁciency ofthese implementations. Algorithms that employ sophisticatedtechniques for down-sampling, calculation of local statisticsand multi-scale processing may provide improvements inperformance, but often incur the cost of additional computa-tional complexity. When deployed at scale, these additionalcosts can be signiﬁcant.To evaluate the compared algorithms in the contextof this performance-efﬁciency tradeoff, we plotted theSROCC achieved by each algorithm against their exe-cution time. Since some methods leverage multithread-ing/multiprocessing, we report the user time instead of thewall time of the processes.As with any run-time experiments, we expected to ob-serve slight variations in execution times between runs due VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 3: Off-the-shelf performance of SSIM and MS-SSIM implementations on compression (a) Performance of SSIM Implementations

Implementation LIVE IQA (Comp) TID 2013 (Comp) LIVE VQA (Comp) Netﬂix PublicPCC SROCC 1-RMSE PCC SROCC 1-RMSE PCC SROCC 1-RMSE PCC SROCC 1-RMSEFFMPEG 0.970 0.967 0.930 0.938 0.926 0.918 0.652 0.652 0.845 0.696 0.657 0.797LIBVMAF 0.941 0.954 0.902

Daala SSIM

ClearView 0.856 0.851 0.901 0.863 0.854 0.881 0.532 0.352 0.827 0.590 0.538 0.772HDRTools 0.912 0.901 0.922 0.895 0.903 0.895 0.579 0.532 0.833 0.589 0.558 0.772Scikit-Image (Rect)

MATLAB (Fast) 0.930 0.917 (b) Performance of MS-SSIM Implementations

Implementation LIVE IQA (Comp) TID 2013 (Comp) LIVE VQA (Comp) Netﬂix PublicPCC SROCC 1-RMSE PCC SROCC 1-RMSE PCC SROCC 1-RMSE PCC SROCC 1-RMSELIBVMAF 0.931 0.956 0.895 0.967 0.943 0.940 0.693 0.694 0.852 0.754 0.739 0.814Daala 0.936

ClearView 0.945 0.927

HDRTools

MATLAB 0.942 0.925 to varying system conditions. To account for this, we ranevery SSIM implementation on each database ﬁve timesand recorded the total execution time of each run. We thenreported the median run-time over the ﬁve runs.We omitted Tensorﬂow implementations from these ex-periments because they run prohibitively slowly on CPUs,and we cannot compare their GPU run-times while all otherimplementations are run on the CPU. we also omit ClearViewimplementations because we had to run them on customhardware. The results on each database are shown in Fig. 1,where the Pareto-optimal implementations have been circled.From these plots, it may be observed that among the imple-mentations that we tested, Daala’s Fast MS-SSIM and Scikit-Video’s SSIM (using Rectangular windows) implementationare Pareto-optimal most often, followed the FastQA MS-SSIM implementation. In addition, among the SSIM im-plementations, Daala, LIBVMAF and the FastQA imple-mentations were often Pareto-optimal across databases. Notethat while the concept of Pareto-optimality is often used inthe context of “optimizing" an encode in a rate-distortionsense by varying a parameter, no parameters were optimizedduring our experiments. In our setting, an implementation isconsidered to be Pareto-optimal among the set of consideredimplementations if there is no implementation that bothachieves a higher SROCC and runs in lesser time.The nominal computational complexity of SSIM is O ( M N k ) . We propose a method to improve the efﬁ-ciency of SSIM if the weighting function is rectangular, i.e., w ( i, j ) = 1 /k , by using integral images, also known assummed-area tables [48] [49].This can be done by forming ﬁve integral images asfollows: I (1)1 ( i, j ) =  (cid:80) m ≤ i (cid:80) n ≤ j I ( m, n ) i, j > otherwise , (18) I (1)2 ( i, j ) =  (cid:80) m ≤ i (cid:80) n ≤ j I ( m, n ) i, j > otherwise , (19) I (2)1 ( i, j ) =  (cid:80) m ≤ i (cid:80) n ≤ j I ( m, n ) i, j > otherwise , (20) I (2)2 ( i, j ) =  (cid:80) m ≤ i (cid:80) n ≤ j I ( m, n ) i, j > otherwise , (21) I ( i, j ) =  (cid:80) m ≤ i (cid:80) n ≤ j I ( m, n ) I ( m, n ) i, j > otherwise , (22)Given the integral image I (1)1 , calculate the sum in any k × k window W ij in constant time via S (1)1 ( i, j ) = I (1)1 ( i + k − , j + k −

1) + I (1)1 ( i − , j − − I (1)1 ( i + k − , j − − I (1)1 ( i − , j + k − . (23) VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity (a) LIVE IQA Database (b) TID 2013 Database(c) LIVE VQA Database (d) Netﬂix Public Database

FIGURE 1: Correlation vs execution timeThis operation would require O ( k ) time without the use ofan integral image. Similarly, calculate local sums using theother integral images, and denote them S (1)2 ( i, j ) , S (2)1 ( i, j ) , S (2)2 ( i, j ) , and S ( i, j ) , respectively. Then calculate thenecessary local statistics as µ ( i, j ) = S (1)1 ( i, j ) /k , (24) µ ( i, j ) = S (1)2 ( i, j ) /k , (25) σ ( i, j ) = S (2)1 ( i, j ) /k − µ ( i, j ) , (26) σ ( i, j ) = S (2)2 ( i, j ) /k − µ ( i, j ) , (27) σ ( i, j ) = S ( i, j ) /k − µ ( i, j ) µ ( i, j ) . (28)In this new way of computing rapid SSIM index, whichis applicable when using a rectangular SSIM window, thecompute complexity of SSIM is reduced to O ( M N ) . V. SCALED SSIM

Arguably the most widespread use of SSIM (and other pic-ture/video quality models) is in evaluating the quality of com-pression encodes. On streaming and social media platforms,pictures and videos are commonly encoded at lower resolu-tions for transmission. This is done either because the source has low-complexity content and can be down-sampled withrelatively little additional loss (or if the available bandwidthrequires it) or to decrease the decoding load at the user’s end.Perceptual distortion models have become common toolsfor determining the quality of encodes for Rate-DistortionOptimization (RDO) [16]. Advances in video hardware haveenabled the accelerated encoding and decoding of videos,making the distortion estimation step the bottleneck whenoptimizing encoding “recipes." From Section IV-D, we knowthat the computational complexity of SSIM in terms ofimage dimensions is O ( M N ) . Including a scale factor α bywhich we resize the image, the computational complexity is O ( α M N ) . Due to this quadratic growth, the computationalload of distortion estimation is an increasingly relevant issuegiven the prevalence of high-resolution videos.Therefore, it is of great interest to be able to accuratelypredict the quality of high-resolution videos that are distortedin two steps - scaling followed by compression. For example,consider High Deﬁnition (HD) videos that are ﬁrst resized toa lower resolution, which we call the compression resolution,then encoded and decoded using, for example, H.264 at thiscompression resolution. The videos are then up-sampled tothe original resolution before they are rendered for display. VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity

We will refer to this higher resolution as the renderingresolution.We aim to reduce the computational burden ofperceptually-driven RDO by circumventing the computationof SSIM at the rendering resolution, i.e. between the HDsource and rendered videos. We propose a suite of algo-rithms, called Scaled SSIM in [50], which predict SSIM byonly using SSIM values computed at the lower compressionresolution during runtime. The video compression pipelinein which we solve the Scaled SSIM problem is illustrated inFig. 2.We achieve this using two classes of models that efﬁcientlypredict Scaled SSIM, which we refer to as • Histogram Matching • Feature-based modelsAll of the proposed models operate on a per-frame basis.We ﬁrst trained and tested the performance of these modelson an in-house video corpus of 60 pristine videos. We com-pressed these videos at 6 compression scales - 144p, 240p,360p, 480p, 540p, 720p using FFMPEG’s H.264 (libx264)encoder at 11 choices of the Quantization Parameter (QP) - , , , . . . , . In this manner, we obtained a total of 3960videos having almost 1.75M frames.On this corpus, we can evaluate the accuracy of predictingSSIM scores, i.e., the correlation between predicted andtrue SSIM, which was computed using the “ssim" ﬁlter inFFMPEG. However, the end goal is to predict subjectivescores, which are not available for this corpus. So, we insteadevaluated the performance of our models against subjectivescores on the Netﬂix Public Database. A. HISTOGRAM MATCHING

We observe a non-linear relationship between SSIM valuesacross encoding resolutions. Because framewise SSIM scoresare calculated by averaging the local quality map obtainedfrom SSIM, we can estimate SSIM scores by matchingthe histograms of these quality maps. However, to matchhistograms, we require the true histogram at the renderingresolution, which is what we wish to avoid estimating.So, we instead calculate the “true" quality map just onceevery k frames, and assume that the shape of the true his-FIGURE 2: Video compression pipeline FIGURE 3: Histogram Matching Solutiontogram does not change signiﬁcantly over a short period of k − frames. This allows us to reuse this “reference map"for the next k − frames as a heuristic model against whichwe match the shapes of the next k − histograms at thecompression scale. This histogram matching algorithm isillustrated in Fig. 3.Let α ∈ (0 , be the factor by which we down-sampledthe source video. Then, the ratio of required computationusing our proposed approach to SSIM computation directlyat the rendered scale is approximately (cid:18) − k (cid:19) α (1 + β + γ ) + 1 k (1 + β ) . (29)The factors β and γ account for computing and matchingthe histograms respectively, which are both O ( M N ) opera-tions. This ratio is a decreasing function of k , and approaches α (1 + β + γ ) as k → ∞ .By comparison, if the rendered SSIM map were not sam-pled, the ratio would be approximately α . In practice, wehave observed that the time taken to compute and matchhistograms is comparable to the time taken to compute theSSIM map at the compression scale. So, the computationalburden of the matching step is small, albeit not negligible.This reduction in computational complexity as k increasesis accompanied by a reduction in performance (accurateprediction of true SSIM), as shown in Fig. 4. We chose k = 5 in all the experiments unless otherwise mentioned.One drawback of this method is that it requires “guidinginformation" in the form of periodically updated referencequality maps. However, this issue is not a factor in the secondclass of models. B. FEATURE-BASED MODELS

As we observed earlier, the net quality degradation occursin two steps - scaling and compression. So, we calculate thecontribution of each operation and use these as features tocalculate the net distortion.Let X be a source video and S α ( X ) denote the videoobtained by scaling X by a factor α . Then, the result of VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity

FIGURE 4: Correlation vs Sampling interval for HistogramMatchingup-sampling the down-sampled video back to the originalresolution may be denoted by S α ( S α ( X )) . The SSIM valuebetween X and S α ( S α ( X )) is a measure of the loss inquality from down-sampling the video. Since this SSIM isindependent of the choice of codec and compression param-eters, this can be pre-computed.The second source of quality degradation is compression.Let C ( X ; q ) be the decoded video obtained by encoding thesource video X using a Quantization Parameter (QP) q . Then,the SSIM value between S α ( X ) and C ( S α ( X ); q ) measuresthe loss of quality resulting from compression of the video.We use these two SSIM scores as features to predict thetrue SSIM and refer to these models as Two-Feature Models.In addition, we can also use the scaling factor α and the QP q as features. We call such models four-feature models.In both cases, we train three regressors to predict theSSIM value at the rendering scale on each frame. The threeregressors considered are • Linear Support Vector Regressor (Linear SVR) • Gaussian Radial Basis Function SVR (Gaussian SVR) • Fully Connected Neural Network (NN)The Neural Network is a small fully connected network hav-ing a single hidden layer with twice the number of neuronsas input features. We compared these models to a simplelearning-free model, which is used as a baseline. The outputof the baseline model is the simple product of the two SSIMfeatures. This is similar to the 2stepQA picture quality modelproposed in [36] [35], for two-stage distorted pictures. Wecall this the Product model.

C. RESULTS

The correlations between predicted SSIM and true SSIMachieved by the various models on our in-house corpus isshown in Table 4, where “2" and “4" denote the numberof features input to each learning-based model. Among the TABLE 4: Correlation with True SSIM on corpus test data

Model PCC SROCCNN 2 0.9461 0.9834

NN 4 0.9845 0.9869

Linear SVR 2 0.9529 0.9759Linear SVR 4 0.9215 0.9201Gaussian SVR 2 0.8571 0.9591Gaussian SVR 4 0.9598 0.9628Product (Baseline) 0.9662 0.9829

Histogram Matching 0.9933 0.9956 feature-based models, four-feature NN performed best. Thisis to be expected, given the great learning capacity of NNs.It is interesting to note, however, that the learning-freeproduct model yielded comparable or better performanceat a negligible computational cost. Finally, the HistogramMatching model provided near-perfect predictions, outper-forming all other models. The cost of this performanceis the additional periodic calculation of reference qualitymaps/histograms.Because our corpus contains videos generated at variouscompression scales and QPs, we were able to evaluate thesensitivity of our best models’ performance to these choicesof encoding parameters. We illustrate this in Fig. 5.We observe that histogram matching performed consis-tently well across all encoding parameters with only a slightdecrease in parameters at high-quality regions, i.e., highcompression scale and low QP. We attribute this to the factthat most quality values at low QPs are very close to 1. Asa result, the histogram of quality scores at the compressionscale is concentrated close to 1, making histogram matchingdifﬁcult.On the other hand, the feature-based models performedpoorly in low-quality regions, i.e., low compression scale andhigh QP. However, videos are seldom compressed at such lowqualities, so this does not affect performance in most practicaluse cases.Table 5 compares models based on the correlation theyachieved against subjective opinion scores on the NetﬂixPublic Database. Because our goal is to predict SSIM ef-ﬁciently, we hold the performance of “true" SSIM as thegold standard against which we evaluated the performanceof the Scaled SSIM models. Because videos in this databasewere generated by setting bitrates instead of QPs, we onlytested our two-feature and histogram matching models. It isTABLE 5: Correlation with DMOS on Netﬂix PublicDatabase

Model PCC SROCCTrue SSIM 0.6962 0.6567

NN 2 0.6759 0.6425

Linear SVR 2 0.6746 0.6196Gaussian SVR 2 0.6756 0.6373Product (Baseline) 0.6715 0.6215

Histogram Matching 0.6848 0.6616 VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity

FIGURE 5: Variation in performance with choice of Encoding Scale and QPimportant to note that these models were not retrained on theNetﬂix database.From the table, it may be observed that SSIM estimated byHistogram Matching matches the performance of true SSIM.We also observe that the feature-based models approach trueSSIM’s performance, with the Product Model offering aneffective low complexity alternative to the learning-basedmodels.

VI. DEVICE DEPENDENCE

The Quality of Experience of an individual user varies toa great extent, depending on not only the visual quality ofthe content, but also various other factors. These include, butmay not be limited to [51]:1. Context of media contents.2. Users’ viewing preferences.3. Condition of terminal and application used for display-ing contents.4. Network bandwidth and latency.5. The environment where users view content. (Back-ground lighting conditions, viewing distances, audio devices,etc.)From the perspective of compression and quality assess-ment algorithms, the factors to be considered are networkﬂuctuations and terminal screens. The rest of this sectionrestricts the scope to only the size of screens and pixelresolutions, excluding the inﬂuence of other variables.

A. IMPACT OF SCREEN SIZE

The issue of screen size on viewing quality originally arosein the context of television [52]. Large wall-sized, theatre(s)and IMAX displays provide better experiences, with subjectsreporting increased feelings of ’reality,’ i.e. the illusion thatviewers are present at the scene. Studies have also shown apossible relation between screen size and viewer’s intensityof response to contents.

B. DISPLAYED CONTENT AND PERCEIVEDPROJECTION

The authors of [51] compared the experiences of users ofdevices of different screen sizes, both on web browsing andvideo viewing. Mean Opinion Scores were found to varysigniﬁcantly, with an approximate difference of ∆ = 1

M OS between high-end devices and low-end devices of that time.It was also discovered that viewing of videos on tablets(iPad, etc.) beneﬁted more from displaying contents of higherresolutions (more signiﬁcant impact on MOS) as comparedto mobile phones. User experiences are a combined functionof screen size, content resolution, and viewing distance.This is quantiﬁed by the contrast sensitivity (CSF) of theHVS [53] which is broadly band pass so that increases inresolutions beyond a certain limit are not perceivable, hencepose little impact on user experience. The pass band of thehuman spatial CSF peaks between 1-10 cycles/degree (cpd)(depending on illumination and temporal factors), fallingoff rapidly beyond. Naturally, it is desirable that pictureand video quality assessment algorithms be able to adaptto screen size against assumed viewing distance, i.e., bycharacterizing the projection of the screen on the retina [54].Given a screen height H and a viewing distance D , full-screen viewing angle is α = 2 arctan (cid:18) H D (cid:19) (30)then, if the viewing screen contains L pixels on each (verticalor horizontal) line, the spatial frequency of the pixel spacingis deﬁned as f max = L α (31)in cpd. C. TRANSFORMING ACROSS VARIOUS SCALES

While primitive speciﬁcations of viewing distances and pixelresolutions have been provided by the ITU [55], existing sub-jective picture and video quality databases are mainly deﬁnedby their content and their testing environments. Likewise,

VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity nearly all quality assessment models that operate over scalesuse the down-sampling transform Z α = max (cid:18) , (cid:24) H I (cid:25)(cid:19) (32)which creates discontinuities as the height is varied (e.g., H = 510 and H = 520 ) and does not account for viewingdistance D. However, the Self-Adaptive Scale Transform(SAST) [56] seeks to remediate these weaknesses. SASTuses both the viewing distance and the user’s angle of view(including the distance) Z s = (cid:114) H I · W I H · I (33) = (cid:115)

14 tan (cid:0) θ H (cid:1) tan (cid:0) θ W (cid:1) · H I D · W I D (34)where D denotes the viewing distance, H I and W I arethe height and width of the display, and the correspondingprojection on the retina is H by W . Commonly assumed,horizontal and vertical viewing angles are θ H = 40 ° and θ W = 50 °.Because of the band pass nature of the visual system, apicture or video may be preprocessed using Adaptive High-frequency Clipping (AHC) [57]. In this method, instead oflosing high-frequency information during scaling, high fre-quencies are selectively assigned smaller weights in a waveletdomain.A dedicated database called VDID2014 [58] was createdto record human visual responses under varying viewingdistances and pixel resolutions. The authors derived an op-timal scale selection (OSS) model based on the SAST andAHC model. Instead of directly comparing reference anddistorted contents, all frames of the picture or video areﬁrst preprocessed according to an assumed or derived view-ing distance and pixel resolution, before applying qualityassessment algorithms such as SSIM. The OSS model sig-niﬁcantly boosted performance of both legacy and modernIQA/VQA models while not signiﬁcantly increasing compu-tational complexity. D. TRADEOFFS

While modern viewers maintain high expectations of visualquality, regardless of the device and picture or video ap-plications being used, efforts to achieve consistent qualityhave been inconsistent across screens of various sizes andresolutions. A study of H.264 streams without packet loss[59] demonstrated that the required bit rate grows linearlywith the horizontal screen size, while the expected level ofsubjective quality was kept ﬁxed.However, the required bit rates grew much faster withincreased expectations of perceived quality, with saturationof MOS at high bit rates, and little quality improvement withincreased bit rate. Notwithstanding future improvements inpicture/video representation and compression technologies,the results of [59] supply approximate upper bounds of MOSagainst content resolution and bit rate.

E. MOBILE DEVICES

Mobile devices have advanced rapidly in recent years, fea-turing larger screens and higher resolutions. Most legacydatabases that focused on explaining the impact of screensizes and pixel resolutions were constructed using PersonalDigital Assistants (PDA) or older cell phones with displayssmaller than 4.5 inches. Popular resolutions of the time ofthese studies were 320p or 480p, which rarely appear oncontemporary mobile devices. In an effort to investigate thesame issue on screens larger than 5.7 inches and resolutionsof 1440p (2K) or 2160p (4K), a recent subjective study[60] focused on more recent mobile devices having largerresolutions and screen sizes. The contents viewed by thesubjects were re-scaled to 4K.The results from a one-way analysis of variance (ANOVA)suggested no signiﬁcant relevance between screen size andperceived quality on screens ranging from 4 inches to 5.7inches. However, a considerable MOS improvement of 0.15was achieved by 1080p content over 720p content, but nofurther improvement was observed by increasing the contentresolutions to 1440p, suggesting a saturation of perceivedquality with resolution. Of course, MOS tends to remainconstant across content resolutions higher than that of thedisplay, suggesting that service providers restrict spatial res-olutions whenever device speciﬁcations are available.

VII. EFFECT OF WINDOW FUNCTIONS ON SSIM

At the heart of SSIM lies the computation of the localstatistics - means, variances, and covariances - comprisingthe luminance, contrast and structure terms. As describedin Section II, the computational complexity of SSIM is O ( M N k ) . A. EFFECT OF WINDOW SIZE

While computational complexity increases quadratically withwindow size, using larger windows does not guarantee betterperformance. Indeed, since picture and video frames are non-stationary, computing local statistics is highly advantageousfor local quality prediction. While small values of k can leadto noisier estimates due to lack of samples, choosing largevalues of k risks losing the relevance of local luminance,contrast, structure, and distortion.As mentioned earlier, the two most common choices ofSSIM windows are Gaussian-shaped and rectangular-shaped.Traditionally, the use of rectangular windows is not recom-mended in image processing, due to frequency side-lobes andresulting “noise-leakage." Because the frequency responseof a rectangular window is a sinc function, undesired high-frequencies can leak into the output. To mitigate this ef-fect, Gaussian ﬁltering is usually preferred, especially fordenoising applications or if noise may be present. For thesereasons, Gaussian-shaped windows were recommended bythe authors of SSIM when calculating local statistics.To investigate the effect of the choice of window andwindow size, we considered a set of scaling constant values σ = 1 . , . , . . . . of the Gaussian window, while truncat- VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity (a) Variation of SSIM performance with window size(b) Variation of MS-SSIM performance with window size

FIGURE 6: Effect of size and choice of window function on performance (a) Variation of SSIM performance with window size (b) Variation of MS-SSIM performance with window size

FIGURE 7: Effect of size and choice of window function on performance on compression dataing them at a width of about σ and forcing the width to beodd. Since only Python implementations allowed setting σ ,we restricted our experiments to these implementations. Allthese experiments were conducted using a stride of 1.Given a Gaussian window of standard deviation σ , onecan construct analogous rectangular windows in three ways -having the same physical size (i.e., width and height), havingthe same variance (considering the rectangular window asa sampled uniform distribution), or having the same (3dB)bandwidth. To specify a rectangular window of size K + 1 ,we only need to specify K . Equating the variance of theGaussian to that of a uniform distribution yields K = (cid:6) σ √ (cid:7) , where (cid:100)·(cid:101) denotes the ceiling operation. Similarly,equating the 3dB bandwidths of the two ﬁlters requires K = (cid:100) . σ (cid:101) .The variation of SSIM performance against windowchoice is shown in Fig. 6. For simplicity, we only show theperformance of rectangular windows having the same physi- cal size. We observed similar results for rectangular windowshaving the same variance and 3dB bandwidth. Surprisingly,the ﬁgure indicates that rectangular windows are an objec-tively better design choice than equivalent Gaussian win-dows! In particular, rectangular windows outperform Gaus-sian windows for smaller window sizes, achieving slightlyhigher peak performance. Our experiments suggest that usingwindows of linear dimension in the range 15 to 20 offers agood tradeoff between performance and computation on bothpicture databases. Using rectangular windows also offers thepossibility of a signiﬁcant computational advantage, becausethey can be implemented efﬁciently using integral images, asdiscussed in Section II.We report the variation of performance against windowsize on compression distorted data in Fig. 7. From these plots,it may be seen that when tested on compression distortions,SROCC is maximized for smaller window sizes, in the range7 to 15. In both ﬁgures, it may be seen that the Scikit-Video VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity implementations peak at smaller window sizes compared toother implementations. This is explained by the fact that theScikit-Video implementations downsample images, whichincreases the “effective" size of a window function withrespect to the original image.

B. EFFECT OF STRIDE

It is also possible to compute SSIM in a subsampled way,by including a stride s , where s is the distance betweenadjacent windows on which SSIM is computed. Then, thecomputational complexity of SSIM is O ( M N k /s ) .We tested the effect of stride on performance using ourFastQA python implementation, because none of the existingimplementations allow varying the stride. In Fig. 8, we reportthe variation of SROCC with stride, where lines labelled“(Comp)" denote the performance on compression distorteddata. (a) LIVE IQA Database (b) TID 2013 Database(c) LIVE VQA Database (d) Netﬂix Public Database FIGURE 8: Variation of performance with strideFrom the ﬁgure, it may be seen that the SROCC is largelyunaffected by stride for s ≤ . This means that by choosinga stride of s = 5 , we can obtain a signiﬁcant improvementin efﬁciency (25x speedup), with little change in predictionperformance. VIII. MAPPING SCORES TO SUBJECTIVE QUALITY

Because of the different parameter conﬁguration and approx-imations they use, the many available implementations ofSSIM tend to disagree with each other, producing (usuallyslightly) different scores on the same contents. Of course,any inconsistencies between deployed SSIM models is un-desirable, since otherwise in a given application (such ascontrolling an encoder), changing the SSIM implementationmay lead to unpredictable results.One way to address this issue is by applying a pre-determined function to map the obtained SSIM results tosubjective scores. Among a collection of both nonlinear FIGURE 9: Example of ﬁtting 5PL curve to a scatter plot ofMOS vs. MS-SSIMand piecewise linear mappings, the 5PL function in (17) isparticularly useful.

A. IMPROVEMENT DUE TO MAPPING

Fig. 9 shows a typical example of ﬁtting raw results (scatterplot of MOS vs. MS-SSIM) to a 5PL function. Both theMOS and objective scores cover fairly wide ranges while themapping function lies approximately in the middle these.It can be easily observed that utilizing the ﬁtted curveyields a considerable improvement in PCC and RMSE, whilethe SROCC remains the same due to the monotonic natureof the function. Table 6 shows the improvement obtainedin PCC and RMSE after utilizing the ﬁtted curve. For mostof the evaluated models there is a considerable performanceenhancement.

B. GENERALIZING TO OTHER DATABASES

While better performance against subjective scores is ob-tained after mapping the raw data using logistic functions,this only works on databases where subjective scores areavailable to help optimize the model parameters. In real life,however, social media and streaming service providers lacksubjective opinions of their shared of streamed content. Thismeans that it is uncertain whether a set of ﬁtted parameterswill apply well to unknown data. In order to study thegeneralizability provided by logistic mapping, we optimizedthe function parameters on each individual database, mappedthe raw data in the other three databases using the obtainedmodel, as a way of assessing performance on unknowncontent. If the correlation metrics were to remain similar, itwould demonstrate that the logistic function can be used toprovided steady performance on unseen data.The results of the cross-database generalization experi-ments are shown in Table 7. The result in each cell of thetable is the RMSE obtained by training a 5PL function ona "source" database (SD), then testing it on each "target"database (TD). From these experiments, we observed that VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 6: Improvement in performance due to linearization (a) LIVE IQA Database

Metric PCC RMSERaw Fitted Raw FittedClearView 0.5928 0.7929 0.4265 0.1161HDRTools 0.7002 0.8445 0.3110 0.1021MATLAB 0.7311 0.8610 0.3047 0.0969ClearView (MS) 0.7677 0.8766 0.3720 0.0917HDRTools (MS) 0.6673 0.9124 0.4576 0.0780 (b) TID 2013 Database

Metric PCC RMSERaw Fitted Raw FittedClearView 0.6612 0.7175 0.3492 0.1239HDRTools 0.6205 0.6653 0.2704 0.1327MATLAB 0.652 0.686 0.263 0.129ClearView (MS) 0.7308 0.7508 0.3008 0.1174HDRTools (MS) 0.7870 0.8384 0.3738 0.0969 (c) LIVE VQA Database

Metric PCC RMSERaw Fitted Raw FittedClearView 0.4277 0.4625 0.4385 0.1938HDRTools 0.4469 0.4789 0.3780 0.1919MATLAB 0.4767 0.5595 0.3798 0.1812ClearView (MS) 0.5491 0.6569 0.4210 0.1648HDRTools (MS) 0.6701 0.7394 0.4571 0.1472 (d) Netﬂix Public Database

Metric PCC RMSERaw Fitted Raw FittedClearView 0.5856 0.5896 0.4567 0.2283HDRTools 0.5800 0.5800 0.3820 0.2302MATLAB 0.6150 0.6150 0.3758 0.2228ClearView (MS) 0.7862 0.8119 0.4359 0.1650HDRTools (MS) 0.7171 0.7552 0.4632 0.1852 when using MATLAB SSIM, the 5PL function generalizedwell between the TID 2013 and LIVE IQA Databases.However, we observed poor generalization when ﬁtting5PL on the LIVE VQA database, then testing on the LIVEIQA database. While it may be too much to expect strongperformance on video distortions after training on picturesand picture distortions, the lesson learned is still that a useror service provider either select the most relevant database totrain on, or conduct a user study directed to their use case, onwhich a SSIM mapping may be optimized.

IX. COLOR SSIM

Of course, the vast majority of shared and streamed picturesand videos are in color. Hence it is naturally of interest to un-derstand whether SSIM can be optimized to also account forcolor distortions. however, most available SSIM implemen-tations only operate on the luminance channel. Distortionsof the color components may certainly exert considerableinﬂuence on subjective quality. The most common approachto incorporating color information into SSIM is to calculateit on each color channel, whether RGB, YUV, or other colorspace, then combine the channel results.More sophisticated approaches have been taken to incor-porate color channel information into quality models. Forexample, CMSSIM [61] utilized CIELAB color space [62]distances to better distinguish color distortions and noises.This approach evolved, based on a later subjective study, intoCSSIM [63], which generalizes the calculations of SSIM.Another approach, called SHSIM [64] deﬁnes hue simi-larity (HSIM) much like structural similarity, then combinesTABLE 7: Generalizability of 5PL SSIM mappings acrossdatabases

SD TD Netﬂix Public LIVE VQA TID 2013 LIVE IQANetﬂix Public 0.228 0.197 0.307 0.718LIVE VQA 0.232 0.194 0.243 0.464TID 2013 0.250 0.209 0.124 0.147LIVE IQA 0.245 0.208 0.152 0.116 uses SSIM scores together with the HSIM scores. The com-bination of two was found to better predict subjective qualitythan when only using luminance or color.The method called Quaternion SSIM (QSSIM) [65] com-bines multi-valued RGB (or any other tristimulus) colorspace vectors from picture or video pixels into a quaternionrepresentation, providing a formal way to assess luminanceand chrominance signals and their degradations together.Although different in their formulations, these algorithmsexpress individual frames in a tristimulus color space,whether RGB, YUV, or CIELAB depending on the applica-tion. In the following, we will deﬁne and assess each of theseapproaches.

A. QUATERNION SSIM

The quaternion SSIM algorithm uses quaternions [65] to rep-resent color pixels with a vector of complex-like components(quaternions are often described as extensions of complex orphasor representations): q ( m, n ) = r ( m, n ) · i + g ( m, n ) · j + b ( m, n ) · k. (35)The quaternion picture or video frame can then be decom-posed into constituent "DC" and "AC" components via dc (cid:44) µ q = 1 M N M (cid:88) m =1 N (cid:88) n =1 q ( m, n ) , (36)and ac q (cid:44) q ( m, n ) − µ q . (37)The quaternion contrast is then deﬁned as σ q = (cid:118)(cid:117)(cid:117)(cid:116) M − N − M (cid:88) m =1 N (cid:88) n =1 (cid:107) ac q (cid:107) , (38)which, when computed on both reference and test signals, isused to form a correlation factor σ q ref,dis = 1( M − N − M (cid:88) m =1 N (cid:88) n =1 ac q ref · ac q dis , (39) VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity yielding a quaternion formulation similar to the legacygrayscale SSIM:

QSSIM = (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) µ q ref · µ q dis µ q ref + µ q dis (cid:19) (cid:18) σ q ref,dis µ q ref + µ q dis (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) . (40) B. CMSSIM

The CMSSIM algorithm ﬁrst transforms the input picture orvideo signal into the CIE XYZ tristimulus color space. TheseXYZ pixels are then transformed into luminance, red-green,and blue-yellow planes [63] as  Q Q Q  =  .

279 0 . − . − .

449 0 . − . . − .

59 0 .   XYZ  (41)The resulting chromatic channels are then smoothed usingGaussian kernels, then transformed back into XYZ tristimu-lus color space via  XYZ  =  . − . − . . . . . . .   Q Q Q  (42)and then into the CIELAB L*, a*, and b*.The dissimilarities between the reference and test chro-matic components is then found as ∆ E ( x, y ) = (cid:16) ( L ∗ ( x, y ) − L ∗ ( x, y )) +( a ∗ ( x, y ) − a ∗ ( x, y )) +( b ∗ ( x, y ) − b ∗ ( x, y )) (cid:17) (43)which are the used to weight the values of the ﬁnal SSIMmap: CSSIM = l ( x, y ) · c ( x, y ) · s ( x, y ) · (cid:18) − ∆ E ( x, y )45 (cid:19) . (44) C. HSSIM

The HSSIM index is calculated by ﬁrst transforming picturesor frames into an HSV color space. The color quality isthen predicted using a weighted average of SSIM and huesimilarity:

HSSIM ( x, y ) = SSIM ( x, y ) + 0 . H ( x, y )1 . , (45)where H ( x, y ) is of the same form as SSIM but operates onhue channel instead of grayscale.Next we discuss straight-forward channel-wise SSIM asapplied in YUV space. This model is also tested on the fourdatabases. D. CHANNEL-WISE SSIM

While image sensors normally capture RGB data in accor-dance with photopic (daylight) retinal sampling, most of thestructural information is present in the luminance signal. This implies the existence of a lower information (reduced band-width) color representation. In fact, both the retinal represen-tation and modern opponent (chromatic) color spaces exploitthis property of visual signals. Modern social media andstreaming platforms ordinarily process RGB into a chromaticspace such as YUV or YCrCb prior to compression. Like-wise, IQA/VQA may be deﬁned on color frames representedby luminance and chrominance.Since chromatic representations reduce the correlationbetween color planes, the chromatic components with re-duced entropies may be down-sampled prior to compression.YCbCr values can be obtained directly from RGB via a lineartransformation, typically Y = 0 . × R + 0 . × G + 0 . × B (46) Cb = 0 . × ( B − Y ) (47) Cr = 0 . × ( R − Y ) (48)assuming ITU-R BT.709 conversion. However the YCbCrcomponents are deﬁned, a chromatic SSIM model may bedeﬁned based on a weighted average of the objective qualitiesof the individual YCbCr channels F ( ref, dis ) = ( f ( Y ref , Y dis ) + αf ( Cb ref , Cb dis )+ βf ( Cr ref , Cr dis )) / (1 + α + β ) (49)where f ( · , · ) denotes similarity between reference and dis-torted frames.In our case, the baseline SSIM is used as the base QAmeasure, i.e. f ( · , · ) = SSIM ( ref , dis ) . This method is usedin popular image and video processing tools like FFMPEGand Daala, where a Color SSIM is calculated as SSIM = 0 . · SSIM Y + 0 . · SSIM Cb + 0 . · SSIM Cr (50)Instead of testing all possible combinations of the hyperpa-rameters during our experiments, we ﬁxed α = β . We found α = β = − . to yield optimal results in our experiments,with the obtained performances of this optimized ColorSSIM (CSSIM) on the four databases included in Table 8. E. RESULTS

We compared the performances of the above four ColorSSIM models on the same databases, with the results tabu-lated in Table 8. As may be observed, using color informationcan noticeably boost SSIM’s quality prediction power, withQSSIM RGB and YUV yielding the largest gains.

X. SPATIO-TEMPORAL AGGREGATION OF QUALITYSCORES

In its native form, SSIM is deﬁned on a pair of image regionsand returns a local quality score. When applied to a pairof images, a quality map is obtained of (approximately)the same size as the image. The most common method ofaggregating these local quality values is to calculate MeanSSIM (MSSIM) to obtain a SSIM score on the entire image.This method of aggregating quality scores is also usually VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 8: Comparison of color SSIM models (a) LIVE IQA Database

Method PCC SRCC RMSEBaseline SSIM 0.8594 0.8449 0.0975Quaternion (RGB) 0.8845 0.8748 0.0889Quaternion (YUV) 0.8766 0.8657 0.0917Quaternion (LAB) 0.5983 0.5946 0.1527CMSSIM 0.7873 0.7806 0.1175HSSIM 0.7873 0.7809 0.1175FFMPEG 0.8650 0.8507 0.0956Daala (SSIM) 0.7547 0.7163 0.1250Daala (MSSSIM) 0.8193 0.8113 0.1093Daala (FASTSSIM) 0.7489 0.7500 0.1263

CSSIM 0.8810 0.8790 0.0902 (b) TID 2013 Database

Method PCC SRCC RMSEBaseline SSIM 0.6902 0.6337 0.1287Quaternion (RGB) 0.7456 0.7155 0.1185

Quaternion (YUV) 0.7735 0.7564 0.1127

Quaternion (LAB) 0.3537 0.3015 0.1663CMSSIM 0.6216 0.6124 0.1393HSSIM 0.6217 0.6125 0.1393FFMPEG 0.7168 0.6781 0.1240Daala (SSIM) 0.4691 0.4497 0.1570Daala (MSSSIM) 0.7414 0.7472 0.1193Daala (FASTSSIM) 0.6348 0.6037 0.1374CSSIM 0.6961 0.6411 0.1277 (c) LIVE VQA Database

Method PCC SRCC RMSEBaseline SSIM 0.5429 0.5089 0.1836

Quaternion (RGB) 0.6709 0.6622 0.1621

Quaternion (YUV) 0.6139 0.5985 0.1726Quaternion (LAB) 0.2914 0.2651 0.2091CMSSIM 0.4233 0.3826 0.1980HSSIM 0.4194 0.3842 0.1984FFMPEG 0.4651 0.4415 0.1935Daala (SSIM) 0.4702 0.4443 0.1929Daala (MS-SSIM) 0.6488 0.6351 0.1663Daala (FastSSIM) 0.5526 0.5208 0.1822CSSIM 0.6249 0.5742 0.1707 (d) Netﬂix Public Database

Method PCC SRCC RMSEBaseline SSIM 0.6335 0.5904 0.2187Quaternion (RGB) 0.7621 0.7557 0.1830

Quaternion (YUV) 0.7816 0.7763 0.1763

Quaternion (LAB) 0.4508 0.3690 0.2523CMSSIM 0.5417 0.4379 0.2375HSSIM 0.5501 0.4444 0.2360FFMPEG 0.6695 0.6267 0.2099Daala (SSIM) 0.6766 0.6475 0.2081Daala (MS-SSIM) 0.7585 0.7398 0.1842Daala (FastSSIM) 0.7695 0.7385 0.1805CSSIM 0.6643 0.6056 0.2112 applied when applying SSIM to videos. Frame-wise MSSIMscores are calculated between pairs of corresponding frames,and the average value of MSSIM (over time) is reported asthe single SSIM score of the entire video.In the context of HTTP streaming, the authors of [66]evaluated various ways of temporal pooling SSIM scores,and found that over longer durations, the simple temporalmean performed about as well as other more sophisticatedpooling strategies. Here, we summarize and expand thiswork by simultaneously testing various spatial and temporalaggregation methods on the two video databases.We begin by discussing various spatial and temporal pool-ing strategies that can be used to pool SSIM. Some of thesemethods require the tuning of hyperparameters. To optimizethese hyperparameters, we use the baseline sample meanas the other pooling method, by comparing the SROCCachieved by each choice of hyperparameters. That is, whenoptimizing a spatial pooling method, we use temporal meanpooling, and vice versa.As we will discuss below, many methods have been pro-posed which leverage either side information, such as visualattention maps, or computationally intensive procedures likeoptical ﬂow estimation. While these methods offer principledways to improve the SSIM model, we omit them from ourcomparisons because they have a high cost, which is oftenunsuitable for practical deployments of SSIM at large scales.In subsequent sections, all of the SSIM quality maps weregenerated using the Scikit-Image implementation of SSIM,with rectangular windows and the default parameters. Whilethe exact results of the experiments may vary slightly withthe choice of “base implementation," we expect these trends (a) LIVE VQA Database (b) Netﬂix Public Database

FIGURE 10: Windowed-Moment-Pooling SROCC vsWindow Sizeto hold across implementations.

A. MOMENT-BASED POOLING

A straightforward extension of the averaging operation usedin SSIM is to replace it by one of the other two Pythagoreanmeans - the Geometric Mean (GM) and the Harmonic Mean(HM). Since local SSIM scores can be negative, we canonly use GM and HM pooling on the structural dissimilarity(DSSIM) scores, i.e. − SSIM . However, this means thatif the SSIM at any location is close to 1, the pooled score isdramatically decreased. So, we do not recommend using GMor HM for spatial pooling. However, framewise SSIM scoresare nearly always positive, so we investigate the use of thePythagorean means for temporal pooling. We also considerthe sample median, since it is a robust measure of centraltendency (MCT), unlike the mean.Another method of pooling quality scores can be foundin the MOVIE index [23]. It was found that the coefﬁcient

VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 9: Performance of Windowed-Moment-Pooling

Database Method PCC SROCC RMSELIVE VQA Baseline SSIM 0.6645 0.6664 0.1633Windowed-AM 0.6778 0.6776 0.1607

Windowed-GM 0.6788 0.6782 0.1605Windowed-HM 0.6788

Windowed-CoV 0.4151 0.6044 0.1988Netﬂix Public Baseline SSIM 0.7034 0.6804 0.2009Windowed-AM 0.7222 0.6838 0.1955

Windowed-GM 0.7233 0.6834 0.1951Windowed-HM of variation (CoV) of quality values correlated well withsubjective scores. Let x = ( i, j ) denote spatial indices. Givena quality map Q ( x , t ) having mean value µ Q ( t ) and standarddeviation σ Q ( t ) , the CoV-pooled score is deﬁned as S CoV ( t ) = σ Q ( t ) /µ Q ( t ) . (51)The same method can be used to pool frame-wise qualityscores. One can also adapt the CoV method of temporalpooling using windowing, where the CoV is computed overshort temporal windows before being averaged. That is, givena sequence of “local" temporal CoV values ρ Q ( t ) , deﬁne thewindowed-CoV-pooled scores S W − CoV = 1 T (cid:88) t ρ Q ( t ) . (52)In the same vein, windowed versions of the threePythagorean means can be used for temporal pooling. Usingframewise SSIM scores obtained using the Scikit-Image im-plementation with a rectangular window of size 11, we testedthe performance of the three windowed means (W-AM, W-GM, W-HM) and windowed-CoV (W-CoV) pooling. We an-alyzed the variation of performance of the windowed meansagainst window size, for window sizes w = 1 , , . . . . Theresults of these experiments are shown in Fig. 10. We omittedthe W-CoV method from this plot because it gave signiﬁ-cantly inferior performance, as shown in Table 9, which liststhe best performance of each windowed method.These plots reveal similar trends. There is an initial de-crease in performance with increased window size, but forlarge enough windows, there is improvement in performanceover the baseline. While windowed-CoV performed verypoorly, the difference between the three Pythagorean meansis small, with windowed-GM being a good choice. However,to observe a reliable improvement in performance, a largewindow size is needed, of k ≈ on the LIVE VQA databaseand k ≈ on the Netﬂix Public database. However, suchlarge values of k could lead to signiﬁcant delays in real-timeapplications, which may not be a reasonable cost consideringthe small increase in performance. B. FIVE-NUMBER SUMMARY POOLING

The ﬁve-number summary (FNS) [67] method was proposedas a better way of summarizing the histogram of a spatialquality map, as compared to the simple mean. Given a spatial quality map Q ( x , t ) at time t , let Q min ( t ) denotethe minimum value, Q ( t ) denote the 25th percentile, (lowerquartile), Q med ( t ) denote the median value, Q ( t ) denotethe 75th percentile (upper quartile) and Q max ( t ) denote themaximum value. The ﬁve number summary is then deﬁnedas S FNS ( t ) = Q min ( t ) + Q ( t ) + Q med ( t ) + Q ( t ) + Q max ( t )5 (53)Of course, FNS may likewise be applied to the framewisequality scores as a way of temporal pooling. C. MEAN-DEVIATION POOLING

In [68], the authors proposed a SSIM-like quality index,which is then pooled using a “mean-deviation" operation.The deviation is deﬁned as the power o of the Minkowskidistance of order p between the quality values and its mean.More concretely, given a spatial quality map Q ( x , t ) at time t having mean value µ Q ( t ) , the pooled mean-deviation qualityscore is given by S ( p,o ) MD ( t ) = (cid:32) M N (cid:88) x ( Q ( x , t ) − µ Q ( t )) p (cid:33) /p  o . (54)In our experiments, the most common optimal choice was p = 2 , corresponding to the standard deviation. When apply-ing MD pooling to temporal scores, the ﬁnal exponent o doesnot affect the SROCC, since exponentiation is a monotonicfunction. So, while we select p using the SROCC as discussedabove, we choose o for temporal pooling by comparing PCCvalues. D. LUMINANCE-WEIGHTED POOLING

In [21], the authors proposed a method of spatial poolingwhich assigns weights to regions of an image based onthe local luminance (brightness), which we call Luminance-Weighted (LW) pooling. These weights are used to accountfor the fact that the HVS is less sensitive to distortions in darkregions. Following our earlier convention, the local mean µ ( x ) is a measure of the local luminance in the referenceimage. Given a lower limit a s and an interval length b s , theweighting function is deﬁned as w LW ( x ) =  , µ ( x ) < a s ( µ ( x ) − a s ) /b s , a s ≤ µ ( x ) < a s + b s , µ ( x ) ≥ a s + b s (55)Then, the spatially-weighted SSIM score is given by S LW ( t ) = 1 M N (cid:88) x w LW ( x ) Q ( x , t ) (56)We tested the performance of LW-pooling on all fourdatabases for values of a s = 0 , , . . . , and b s =0 , , . . . . Note that choosing a s = b s = 0 correspondsto the standard baseline SSIM. The experimental variation of VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity (a) LIVE IQA Database (b) TID2013 Database(c) LIVE VQA Database (d) Netﬂix Public Database

FIGURE 11: LW-Pooling SROCC vs parameters a s and b s performance (SROCC) against choices of a s and b s is shownin Fig. 11.On the LIVE IQA and Netﬂix Public databases, we ob-served that the best performance was achieved by the baseline a s = b s = 0 . On the other two databases, the improvement inperformance was insigniﬁcant, with an elevation of SROCCof less than 0.002. So, we do not recommend using LW-pooling. E. DISTORTION-WEIGHTED POOLING

Distortion-Weighted (DW) pooling is a method that assignsdifferent weights to low and high-quality regions. We con-sider the common method of distortion weighting, where theweight assigned to a quality score is proportional to a powerof the quality score. Concretely, given an exponent p , thespatial DW-pooled score of a spatial quality map Q ( x , t ) astime t is given by S ( p ) DW ( t ) = (cid:80) x (1 − Q ( x , t )) p Q ( x , t ) (cid:80) x (1 − Q ( x , t )) p . (57)Likewise, DW-pooling may be applied to the times se-ries of framewise quality scores to perform temporal DW-pooling. We tested DW-pooling using values of p =1 / , / , . . . on all four databases, for both spatial andtemporal pooling. While DW-pooling can lead to a consider-able increase in performance, we also found that the optimal value of p varied signiﬁcantly between databases. So, in theabsence of a dataset that the user can use to select p reliably,we do not recommend using DW-pooling off the shelf. If auser does have a relevant dataset, then DW-pooling may beproﬁtably applied. We refer the reader to Table 13 for detailedresults. F. MINKOWSKI POOLING

The Minkowski Pooling (Mink) method is a generalization ofthe arithmetic mean, which provides another way to provideadditional weight to low quality scores. Because local qualityscores can be negative, we pool the DSSIM scores. Given anexponent p , deﬁne the spatial Minkowski-pooled score as S p Mink ( t ) = 1 M N (cid:88) x (1 − Q ( x , t )) p . (58)Once again, we tested values of p = 1 / , / , . . . , . Notethat we omitted p = 1 since it is identical to the baselinemean pooling. Spatial Minkowski pooling provided improve-ment in performance on the video databases, with p = 4 being a good choice of p across databases. However, as withDW-pooling, the optimal choice of p for temporal poolingvaried signiﬁcantly between databases and any improvementin performance was modest. So, we do not recommend usingtemporal Minkowski pooling, unless a speciﬁc application-relevant dataset is available. G. PERCENTILE POOLING

More sophisticated techniques have been proposed to spa-tially pool of SSIM scores. In [69], the authors proposepooling SSIM scores by visual importance. The visual impor-tance of distortions was measured in two ways: visual atten-tion maps using the Gaze-Attentive Fixation Finding Engine(GAFFE) [70], and percentile pooling (PP) of quality scores.Because this guide is tailored towards practical applicationof SSIM, we omit the additional computation of runningGAFFE and focus only on PP.The spatial PP method is speciﬁed by two parameters: p s ,the percentile of lower values to be modiﬁed, and r s , thefactor by which the lower p s percentile is weighted. The ideaof PP is to heavily weight the worst quality regions, whichare likely to heavily bias the perception of quality. Deﬁne thelowest p s percentile of values of the quality map Q ( x, t ) by perc ( Q, p s ) . The quality values are then re-weighted as ˜ Q ( r s ,p s ) ( x , t ) = (cid:40) Q ( x , t ) /r s , Q ( x , t ) ∈ perc ( Q, p s ) Q ( x , t ) , otherwise . (59)The PP quality score is then deﬁned as the average of there-weighted quality values: S ( r s ,p s ) PP ( t ) = 1 M N (cid:88) x ˜ Q ( r s ,p s ) ( x , t ) . (60)Larger values of p s penalize more low-quality values,which may dilute the distortion severity, while larger valuesof r s weight the low quality regions more heavily. While VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity (a) LIVE IQA Database (b) TID2013 Database(c) LIVE VQA Database (d) Netﬂix Public Database

FIGURE 12: Spatial PP SROCC vs parameters p t and r t (a) LIVE VQA Database (b) Netﬂix Public Database FIGURE 13: Temporal PP SROCC vs parameters p t and r t the authors of [69] were circumspect regarding the value ofpercentile pooling, they recommended choosing p s = 6 and r s = 4000 . However, on all four databases, we found thatall choices of parameters p s and r s performed worse than thebaseline. This behavior is illustrated in Fig. 12 for values of p s in the range [0 , (%) and r s in the range [1 , . Whilethe discrepancy between our results and those of [69] canbe attributed to variations in the implementations, we foundthat the performance of percentile pooling was inferior acrossall the tested databases. So, we do not recommend spatialpercentile pooling.We also studied temporal PP of aggregated framewisequality scores by penalizing low-quality frames. Once again,to ﬁnd the optimal choice of the temporal PP parameters p t and r t , we computed framewise SSIM scores, one which wetested values of p t in the range [0 , (%) and r t in the range [1 , . Fig. 13a and 13b plot the variation of performance(SROCC) with choices of p t and r t on the LIVE VQA andNetﬂix Public databases, respectively. The performances ofthe optimal PP algorithm and baseline SSIM are compared inTable 10.From Table 10, it may be observed that temporal PP gave TABLE 10: Performance of Temporal PP SSIM Database Method PCC SROCC RMSELIVE VQA Baseline SSIM 0.5999 0.5971 0.1749

Temporal PP 0.6163 0.5986 0.1721

Netﬂix Public Baseline SSIM 0.6815 0.6574 0.2068

Temporal PP 0.6868 0.6601 0.2054

TABLE 11: Performance of 3D SSIM/MS-SSIM

Database Method PCC SROCC RMSELIVE VQA Framewise SSIM 0.6650 0.6677 0.1632

SSIM 3D 0.7300 0.7285 0.1494

Framewise MS-SSIM 0.7631 0.7551 0.1412

MS-SSIM 3D 0.7779 0.7681 0.1374

Netﬂix Public Framewise SSIM 0.7022 0.6784 0.2012

SSIM 3D 0.7086 0.6948 0.1994

Framewise MS-SSIM 0.7454 0.7408 0.1884

MS-SSIM 3D 0.7512 0.7408 0.1865 only a minor improvement in performance over baselineSSIM. From the accompanying plots, while there were gen-eral trends in performance with variations of each param-eter, the prediction performance of the pooled models wassensitive to small perturbations of the parameters. Coupledwith the fact that the observed increases in performancewere small, it is difﬁcult to reliably identify good choicesof the parameters p t and r t . So, we do not recommend usingpercentile pooling for temporal aggregation either. H. SPATIO-TEMPORAL SSIM

Efforts have also been made to create spatio-temporal ver-sions of SSIM. In [71], the authors proposed a 3D spatio-temporal SSIM, and its motion-tuned extension. To avoidadditional computation related to motion estimation, we con-sider only the SSIM-3D model and replace the motion-tunedweighting function by a rectangular window. Thus, SSIM-3D was deﬁned as identical to SSIM, other than that localstatistics - mean, standard deviation and correlation - werecomputed on 3D spatio-temporal neighborhoods instead of2D spatial neighborhoods. Similarly, deﬁne MS-SSIM-3Das 3D SSIM computed over multiple spatial scales. Becausewe choose rectangular windows, we used integral images toefﬁciently compute the local statistics.We tested these 3-D variants of SSIM and MS-SSIM onthe LIVE VQA and Netﬂix Public video databases. We used (a) LIVE VQA Database (b) Netﬂix Public Database

FIGURE 14: Variation of SSIM and MS-SSIM 3Dperformance with Temporal Window Size K t VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity rectangular ﬁlters of size × × K t , and investigatedthe variation of the algorithm’s performance with K t . Theperformances of the baseline frame-wise SSIM/MS-SSIM( K t = 1 ) models and the best SSIM/MS-SSIM 3D are listedin Table 11. The variation in performance of SSIM-3D andMS-SSIM-3D with K t is plotted in Fig. 14.From these ﬁgures, performance increases by both SSIM-3D and MS-SSIM-3D relative to the 2D frame-based ver-sions may be observed on the LIVE VQA database, withthe improvement being much more pronounced in the caseof single scale SSIM. When tested on the Netﬂix-Publicdatabase, the improvement was much lower. Again, theimprovement was lower for MS-SSIM-3D than SSIM-3D.From the plots, choosing K t from the interval [3 , offerssolid improvement in performance. Another advantage of thisapproach is that using small rectangular temporal windows,performance increases can be obtained without any increasein computational complexity, by maintaining rolling sums ofthe last K t frames. This can be achieved as below, usinga buffer of K t frames, leading to an O ( M N K t ) memorycomplexity.As in equations (24) - (28), we can calculate the lo-cal statistics from the analogously deﬁned sums over3D neighborhoods S (1)1 ( i, j, k ) , S (2)1 ( i, j, k ) , S (1)2 ( i, j, k ) , S (2)2 ( i, j, k ) , and S ( i, j, k ) . As an illustrative example,consider S (1)1 ( i, j, k ) = i (cid:88) m = i − l +1 j (cid:88) n = j − l +1 (cid:32) k (cid:88) o = k − K t +1 I ( m, n, o ) (cid:33) . (61)Deﬁning the temporal sum T (1)1 ( i, j, k ) = (cid:32) k (cid:88) o = k − K t +1 I ( i, j, o ) (cid:33) , (62)we can rewrite (61) as S (1)1 ( i, j, k ) = i (cid:88) m = i − l +1 j (cid:88) n = j − l +1 T (1)1 ( m, n, k ) . (63)Knowing T (1)1 ( m, n, k ) , this sum can be computed ef-ﬁciently using integral images, using the equations (18) -(23). The temporal sum T (1)1 ( m, n, k ) itself can be updatedefﬁciently with each new frame, by observing that T (1)1 ( i, j, k ) = T (1)1 ( i, j, k − − I ( i, j, k − K t )+ I ( i, j, k ) . (64)In the same manner, we can also compute S (2)1 ( i, j, k ) , S (1)2 ( i, j, k ) , S (2)2 ( i, j, k ) , and S ( i, j, k ) efﬁciently. Com-bining these two methods, we can compute SSIM-3D in O ( M N ) time at each frame, irrespective of the temporal sizeof the window K t .Motion vectors were also used to incorporate temporalinformation in [72], which proposed a Motion CompensatedSSIM (MC-SSIM) algorithm. Motion vectors were used toﬁnd matching blocks in the reference and test video se-quences at each temporal index. The SSIM scores between these matched blocks were then used to calculate a temporalquality score. The clear drawback of this method is thecomputation of motion vectors, which are expensive and maynot be readily available.This issue was addressed in [73], which proposed a spatio-temporal SSIM which did not use motion information. In-stead, the authors visualize the video as a 3D volume in x, y, and t , where the frames lie in the x − y planes. The x − t and y − t planes contain both spatial and temporal information,and can also be compared using SSIM. The spatio-temporalSSIM (ST-SSIM) model is then deﬁned as the average of thethree SSIM values. Note that neighborhoods in the x − y , x − t and y − t directions are special cases of 3-D neighborhoodsused in SSIM-3D (obtained by setting the size along onedimension to 1). So, while we discuss ST-SSIM for complete-ness, we did not include it in our experiments. I. FINAL RESULTS

Here, we provide comprehensive results of our experimentswith the various spatial and temporal pooling algorithmsdescribed above. In all cases, we refer to each pooling methodby the abbreviations listed above. To include informationabout the choice of optimal hyperparameters, we added su-perscripts to the abbreviated algorithm names. So, we denoteMean-Deviation Pooling by MD ( p,o ) , Distortion-WeightedPooling by DW ( p ) , Minkowski Pooling by Mink ( p ) andthe Windowed AM, GM and HM algorithms by W-AM ( k ) ,W-GM ( k ) and W-HM ( k ) respectively, where k denotes thewindow size. Once again, we linearized pooled SSIM scoresby ﬁtting them with the 5PL function in (17), and reportperformance in terms of the PCC, SROCC and RMSE values.The performances of the various spatial pooling methodson the two image databases is tabulated in Table 12. On thevideo databases, we tested all pairs of spatial and temporalpooling methods, and these results are given in Table 13.In these tables, the columns represent the choice of spatialpooling (SP) method, while the rows represent the choice oftemporal pooling (TP) methods.In Table 12, the best performing spatial pooling methodis boldfaced. We found that CoV pooling performed best onthe challenging TID 2013 database, while the baseline MeanSSIM method performed best on the LIVE IQA database,with CoV pooling a close second.In Table 13, we boldfaced the ﬁve best results in eachsub-table. It may be observed that MD Pooling and CoVpooling performed best among the spatial pooling methods,while using large windowed means performed well amongthe temporal pooling methods, signiﬁcantly outperformingthe spatio-temporal SSIM-3D and MS-SSIM-3D algorithms.However, as discussed earlier, using windowed means re-quires large windows ( k ≈ , ) while providing onlya minor performance improvement over the baseline. Inaddition, because CoV pooling performed consistently wellacross all databases and does not have any hyperparametersto tune, we recommend using spatial CoV pooling on pictureor video frame quality maps, and the standard arithmetic VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 12: Comparing the Performance of Spatial Pooling Methods on Image Databases (a) LIVE IQA Database

Method PCC SROCC RMSEAM 0.944 0.934 0.091CoV 0.940 0.931 0.093MD (2 , (1 / Mink ( ) (b) TID 2013 Database Method PCC SROCC RMSEAM 0.711 0.659 0.125CoV 0.741 0.718 0.119MD (2 , (1 / Mink ( ) TABLE 13: Comparing the Performance of Pairs of Spatial and Temporal Pooling Methods on Video Databases (a) LIVE VQA Database - PCC

TP SP AM CoV MD (2 , FNS DW (1)

Mink (4)

AM 0.664 0.766 0.708 0.580 0.770 0.781GM 0.665 0.738 0.650 0.522 0.777 0.757HM 0.661 0.709 0.435 0.518 0.781 0.736CoV 0.521 0.221 0.153 0.341 0.578 0.161MD (2 , W-AM (80)

W-GM (81) (81) (2) (2) (b) Netﬂix Public Database - PCC

TP SP AM CoV MD (4 , FNS DW (8)

Mink (8)

AM 0.703 0.802 0.887 0.677 (2 , (50) (55) (78) (1 / (1 / (c) LIVE VQA Database - SROCC TP SP AM CoV MD (2 , FNS DW (1)

Mink (4)

AM 0.667 0.762 (2 , (80) (81) (81) (2) (2) (d) Netﬂix Public Database - SROCC TP SP AM CoV MD (4 , FNS DW (8)

Mink (8)

AM 0.680 0.768 0.871 0.633 (2 , (50) (55) (78) (8) (8) (e) LIVE VQA Database - RMSE TP SP AM CoV MD (2 , FNS DW (1)

Mink (4)

AM 0.163 0.140 0.154 0.178 0.140 0.137GM 0.163 0.148 0.166 0.186 0.138 0.143HM 0.164 0.154 0.197 0.187 0.137 0.148CoV 0.187 0.213 0.216 0.205 0.178 0.216MD (2 , W-AM (80)

W-GM (81) (81) (2) (2) (f) Netﬂix Public Database - RMSEs

TP SP AM CoV MD (4 , FNS DW (8)

Mink (8)

AM 0.201 0.169 0.130 0.208 (2 , (50) (55) (78) (1 / (1 / mean pooling of temporal frame SSIM scores. This qualityaggregation method is identical to the one used in MOVIE.As in previous sections, we repeated the experiments oncompression distorted data, and reported the results in Ta-bles 14 and 15. Even when we restricted the distortion types,we did not obtain concordant values of hyperparametersacross the video databases. Based on our recommendations, we propose a variantof SSIM, called “Enhanced SSIM", which we are makingpublicly available to the community as a command line tool.The speciﬁcations of Enhanced SSIM are described below.1) Operates only on the luminance channel.2) Uses rectangular windows, to calculate local statistics,with a default size of 11x11. These rectangular win- VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity

TABLE 14: Comparing the Performance of Spatial Pooling Methods on Compressed Images (a) LIVE IQA (Comp) Database

Method PCC SROCC RMSEAM 0.9701 0.9683 0.0699CoV 0.8391 0.9675 0.1564MD (2 , DW ( / ) Mink (2) (b) TID2013 (Comp) Database

Method PCC SROCC RMSEAM 0.9438 0.9283 0.0777CoV 0.9342 0.9505 0.0839MD (2 , (1 / Mink ( ) TABLE 15: Comparing the SROCC Achieved by Pooling Methods on Compressed LIVE VQA Videos

TP SP AM CoV MD (2 , FNS DW (1)

Mink (4)

AM 0.692 0.704 (2 , (99) (89) (80) (2) (1 / dows are implemented using integral images.3) Local quality scores are computed with a stride of 5.4) The input image is down-sampled by a factor inferredfrom values of D/H , using a default ratio of 3.0,corresponding to a typical

D/H ratio for TV viewing.5) Coefﬁcient of Variation pooling is used to spatiallyaggregate the local quality scores.We compare the performance of our implementation withLIBVMAF in Table 16, since LIBVMAF was the best per-forming implementation in Section IV. This table highlightsthe computational and performance beneﬁts of using our rec-ommendations. Once again, “(Comp)" refers to experimentsconducted on compression data from each database.

XI. CONCLUSION

In this guide, we detailed the results of a series of experimentswe conducted towards determining optimal design choiceswhen deploying SSIM. We ﬁrst evaluated the off-the-shelfperformance and efﬁciency of several public implementa-tions of SSIM and MS-SSIM, using which we identiﬁed aset of Pareto-optimal implementations. Using these results,we also described a method to improve the computationalefﬁciency of SSIM using integral images. We then reviewed

Database LIBVMAF EnhancedLIVE IQA

LIVE VQA 0.6954

Netﬂix Public 0.7652

LIVE IQA (Comp) 0.9543

TID 2013 (Comp) 0.9467

LIVE VQA (Comp) 0.6839

TABLE 16: Performance of Enhanced SSIM a method, called Scaled SSIM, to improve the efﬁciencyof computing SSIM across resolutions, when conductingRDO in encoding pipelines. Following this, we reviewed thedependence of SSIM performance on the viewing device,where we discussed improvements to SSIM which accountfor viewing distance and screen size. We then investigatedthe dependence of SSIM on the choice of window, where weconducted extensive experiments to identify good choices forthe size and type of window function (rectangular windows ofsize 15-20), thereby validating some observations we madewhen testing public SSIM implementations.Due to the non-linear nature of SSIM, it is crucial to de-velop a good mapping function from SSIM scores to subjec-tive scores. We tested a popular choice of such a mapping, theﬁve parameter logistic function, and demonstrated its gener-alizability. Further, while the baseline SSIM model is deﬁnedon two luminance images, most practical applications involvemedia having color. To account for this, we reviewed severalColor SSIM models and compared their performance, ﬁnd-ing that Quaternion SSIM was a consistently good choice.Finally, we performed a comprehensive evaluation of spatialand temporal aggregation methods used to deploy SSIMon videos. Based on these results, we recommended usingspatial CoV pooling and temporal arithmetic mean poolingof framewise SSIM scores.In all, we have conducted a comprehensive study of manydesign choices involved when implementing SSIM, andmade recommendations on the best practices. In addition, wehave incorporated these recommendations into a variant ofSSIM which we call Enhanced SSIM, for which we providean openware command line tool for use by video qualityengineers in academia and industry here.

VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity

REFERENCES

CCITT Recommendation T , vol. 81, p. 6, 1991.[3] W. B. Pennebaker and J. L. Mitchell,

JPEG: Still Image Data CompressionStandard . Springer Science & Business Media, 1992.[4] Netﬂix. AVIF for next-generation image cod-ing. [Online]. Available: https://netﬂixtechblog.com/avif-for-next-generation-image-coding-b1d75675fe4[5] N. Barman and M. G. Martini, “An evaluation of the next-generationimage coding standard AVIF,” in , 2020, pp. 1–4.[6] J. Lainema, M. M. Hannuksela, V. K. M. Vadakital, and E. B. Aksu, “Hevcstill image coding and high efﬁciency image ﬁle format,” in , 2016, pp. 71–75.[7] T. Wiegand, “Draft ITU-T recommendation and ﬁnal draft internationalstandard of joint video speciﬁcation (ITU-T Rec. H. 264| ISO/IEC 14496-10 AVC),”

JVT-G050 , 2003.[8] I. E. Richardson,

The H. 264 Advanced Video Compression Standard .John Wiley & Sons, 2011.[9] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand, “Overview of the highefﬁciency video coding (hevc) standard,”

IEEE Transactions on Circuitsand Systems for Video Technology , vol. 22, no. 12, pp. 1649–1668, 2012.[10] K. Choi, J. Chen, D. Rusanovskyy, K. Choi, and E. S. Jang, “An overviewof the mpeg-5 essential video coding standard [standards in a nutshell],”

IEEE Signal Processing Magazine , vol. 37, no. 3, pp. 160–167, 2020.[11] D. Mukherjee, J. Bankoski, A. Grange, J. Han, J. Koleszar, P. Wilkins,Y. Xu, and R. Bultje, “The latest open-source video codec VP9 - Anoverview and preliminary results,” in

Picture Coding Symposium (PCS) ,2013, pp. 390–393.[12] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker,C. Chen, H. Su, U. Joshi, C. Chiang, Y. Wang, P. Wilkins, J. Bankoski,L. Trudeau, N. Egge, J. Valin, T. Davies, S. Midtskogen, A. Norkin, andP. de Rivaz, “An overview of core coding tools in the AV1 Video Codec,”in

Picture Coding Symposium (PCS) , 2018, pp. 41–45.[13] F. Kossentini, H. Guermazi, N. Mahdi, C. Nouira, A. Naghdinezhad,H. Tmar, O. Khlif, P. Worth, and F. B. Amara, “The SVT-AV1 encoder:overview, features and speed-quality tradeoffs,” in

Applications of DigitalImage Processing XLIII , A. G. Tescher and T. Ebrahimi, Eds., vol. 11510,International Society for Optics and Photonics. SPIE, 2020, pp. 469 –490. [Online]. Available: https://doi.org/10.1117/12.2569270[14] D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced studyof subjective and objective picture quality,”

IEEE Transactions on ImageProcessing , vol. 25, no. 1, pp. 372–387, 2016.[15] Z. Tu, Y. Wang, N. Birkbeck, B. Adsumilli, and A. C. Bovik, “UGC-VQA:Benchmarking blind video quality assessment for user generated content,”

ArXiv , vol. abs/2005.14354, 2020.[16] I. Katsavounidis, “Dynamic optimizer - a perceptual video encodingoptimization framework,”

The Netﬂix Tech Blog , 2018.[17] Z. Wang and A. C. Bovik, “Mean squared error: Love it or leave it? anew look at signal ﬁdelity measures,”

IEEE Signal Processing Magazine ,vol. 26, no. 1, pp. 98–117, 2009.[18] Zhou Wang and A. C. Bovik, “A universal image quality index,”

IEEESignal Processing Letters , vol. 9, no. 3, pp. 81–84, 2002.[19] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,”

IEEETransactions on Image Processing , vol. 13, no. 4, pp. 600–612, 2004.[20] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural simi-larity for image quality assessment,” in

Asilomar Conference on Signals,Systems Computers , 2003, pp. 1398–1402 Vol.2.[21] Z. Wang, L. Lu, and A. C. Bovik, “Video quality assessmentbased on structural distortion measurement,”

Signal Processing: ImageCommunication , vol. 1, 2007,pp. I–869–I–872.[23] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal qualityassessment of natural videos,”

IEEE Transactions on Image Processing ,vol. 19, no. 2, pp. 335–350, 2010. [24] M. J. Wainwright and E. P. Simoncelli, “Scale mixtures of gaussians andthe statistics of natural images,” in

Proceedings of the 12th InternationalConference on Neural Information Processing Systems , ser. NIPS’99.Cambridge, MA, USA: MIT Press, 1999, p. 855–861.[25] Z. Wang and A. C. Bovik, “Reduced- and no-reference image qualityassessment,”

IEEE Signal Processing Magazine , vol. 28, no. 6, pp. 29–40,2011.[26] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”

IEEE Transactions on Image Processing , vol. 15, no. 2, pp. 430–444, 2006.[27] R. Soundararajan and A. C. Bovik, “Video quality assessment by reducedreference spatio-temporal entropic differencing,”

IEEE Transactions onCircuits and Systems for Video Technology , vol. 23, no. 4, pp. 684–694,2013.[28] C. G. Bampis, P. Gupta, R. Soundararajan, and A. C. Bovik, “SpEED-QA:Spatial efﬁcient entropic differencing for image and video quality,”

IEEESignal Processing Letters , vol. 24, no. 9, pp. 1333–1337, 2017.[29] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara, “Towarda practical perceptual video quality metric,”

The Netﬂix Tech Blog , vol. 6,p. 2, 2016.[30] K. Seshadrinathan and A. C. Bovik, “Unifying analysis of full referenceimage quality assessment,” in , 2008, pp. 1200–1203.[31] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assess-ment: A natural scene statistics approach in the DCT domain,”

IEEETransactions on Image Processing , vol. 21, no. 8, pp. 3339–3352, 2012.[32] A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: Fromnatural scene statistics to perceptual quality,”

IEEE Transactions on ImageProcessing , vol. 20, no. 12, pp. 3350–3364, 2011.[33] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image qualityassessment in the spatial domain,”

IEEE Transactions on Image Process-ing , vol. 21, no. 12, pp. 4695–4708, 2012.[34] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completelyblind” image quality analyzer,”

IEEE Signal Processing Letters , vol. 20,no. 3, pp. 209–212, 2013.[35] A. Bovik, “Assessing quality of images or videos using a two-stage qualityassessment,” Jan. 7 2020, US Patent 10,529,066.[36] X. Yu, C. G. Bampis, P. Gupta, and A. C. Bovik, “Predicting the qualityof images compressed after distortion in two steps,”

IEEE Transactions onImage Processing , vol. 28, no. 12, pp. 5757–5770, 2019.[37] H. R. Sheikh, “Image and video quality assessment research at live,” http://live. ece. utexas. edu/research/quality , 2003.[38] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation ofrecent full reference image quality assessment algorithms,”

IEEE Trans-actions on Image Processing , vol. 15, no. 11, pp. 3440–3451, 2006.[39] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian,J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J.Kuo], “Image database TID2013: Peculiarities, results and perspectives,”

Signal Processing: Image Communication

IEEETransactions on Image Processing , vol. 19, no. 6, pp. 1427–1441, 2010.[41] Ffmpeg. [Online]. Available: https://ffmpeg.org/[42] A. Tourapis and D. Singer, “HDRTools: A software package forvideo processing and analysis,” in lSO/IEC JTC1/SC29IWG11MPEG2014/m35156 , 2014.[43] T. J. Daede, N. E. Egge, J. Valin, G. Martres, and T. B. Terriberry,“Daala: A perceptually-driven next generation video codec,”

CoRR ,vol. abs/1603.03129, 2016. [Online]. Available: http://arxiv.org/abs/1603.03129[44] S. Van der Walt, J. L. Schönberger, J. Nunez-Iglesias, F. Boulogne, J. D.Warner, N. Yager, E. Gouillart, and T. Yu, “Scikit-image: image processingin Python,”

PeerJ VOLUME X, 2021 enkataramanan et al. : A Hitchhiker’s Guide to Structural Simlarity [47] P. G. Gottschalk and J. R. Dunn, “The ﬁve-parameter logistic: a char-acterization and comparison with the four-parameter logistic,”

AnalyticalBiochemistry , vol. 343, no. 1, pp. 54–65, 2005.[48] F. C. Crow, “Summed-area tables for texture mapping,”

SIGGRAPHComput. Graph. , vol. 18, no. 3, p. 207–212, Jan. 1984. [Online].Available: https://doi.org/10.1145/964965.808600[49] P. Viola and M. J. Jones, “Robust real-time face detection,”

InternationalJournal of Computer Vision , vol. 57, no. 2, pp. 137–154, 2004.[50] A. K. Venkataramanan, C. Wu, and A. C. Bovik, “Optimizing video qualityestimation across resolutions,” in , 2020, pp. 1–5.[51] R. Schatz and S. Egger, “On the impact of terminal performance andscreen size on qoe,” in

Proc. ETSI Workshop Sel. Items Telecommun. Qual.Matters , 2012, pp. 1–26.[52] M. E. Grabe, M. Lombard, R. D. Reich, C. C. Bracken, and T. B. Ditton,“The role of screen size in viewer experiences of media content,”

VisualCommunication Quarterly , vol. 6, no. 2, pp. 4–9, 1999.[53] F. W. Campbell and J. G. Robson, “Application of fourier analysis to thevisibility of gratings,”

The Journal of physiology , vol. 197, no. 3, p. 551,1968.[54] S. Winkler, “Issues in vision modeling for perceptual video quality assess-ment,”

Signal Processing , vol. 78, no. 2, pp. 231–252, 1999.[55] I. BT, “Recommendation ITU-R BT.500-14 (10/2019): Methodologiesfor the subjective assessment of the quality of television images,”

ITU,Geneva , 2020.[56] K. Gu, G. Zhai, X. Yang, and W. Zhang, “Self-adaptive scale transformfor iqa metric,” in . IEEE, 2013, pp. 2365–2368.[57] K. Gu, G. Zhai, M. Liu, Q. Xu, X. Yang, J. Zhou, and W. Zhang, “Adaptivehigh-frequency clipping for improved image quality assessment,” in

VisualCommunications and Image Processing (VCIP) , 2013, pp. 1–5.[58] K. Gu, M. Liu, G. Zhai, X. Yang, and W. Zhang, “Quality assessmentconsidering viewing distance and image resolution,”

IEEE Transactionson Broadcasting , vol. 61, no. 3, pp. 520–531, 2015.[59] G. Cermak, M. Pinson, and S. Wolf, “The relationship among video qual-ity, screen resolution, and bit rate,”

IEEE Transactions on Broadcasting ,vol. 57, no. 2, pp. 258–262, 2011.[60] W. Zou, J. Song, and F. Yang, “Perceived image quality on mobile phoneswith different screen resolution,”

Mobile Information Systems , vol. 2016,2016.[61] M. Hassan and C. Bhagvati, “Structural similarity measure for colorimages,”

International Journal of Computer Applications , vol. 43, no. 14,pp. 7–12, 2012.[62] C. I. De L’Eclairage, “Recommendations on uniform color spaces, color-difference equations, psychometric color terms,”

Paris: CIE , 1978.[63] M. A. Hassan and M. S. Bashraheel, “Color-based structural similarityimage quality assessment,” in . IEEE, 2017, pp. 691–696.[64] Y. Shi, Y. Ding, R. Zhang, and J. Li, “Structure and hue similarity for colorimage quality assessment,” in . IEEE, 2009, pp. 329–333.[65] A. Kolaman and O. Yadid-Pecht, “Quaternion structural similarity: a newquality index for color images,”

IEEE Transactions on Image Processing ,vol. 21, no. 4, pp. 1526–1536, 2011.[66] M. Seufert, M. Slanina, S. Egger, and M. Kottkamp, ““to pool or notto pool”: A comparison of temporal pooling methods for http adaptivevideo streaming,” in , 2013, pp. 52–57.[67] C. G. Zewdie, M. Pedersen, and Z. Wang, “A new pooling strategy forimage quality metrics: Five number summary,” in , 2014, pp. 1–6.[68] H. Ziaei Nafchi, A. Shahkolaei, R. Hedjam, and M. Cheriet, “Mean devi-ation similarity index: Efﬁcient and reliable full-reference image qualityevaluator,”

IEEE Access , vol. 4, pp. 5579–5590, 2016.[69] A. K. Moorthy and A. C. Bovik, “Visual importance pooling for imagequality assessment,”

IEEE Journal of Selected Topics in Signal Processing ,vol. 3, no. 2, pp. 193–201, 2009.[70] U. Rajashekar, I. van der Linde, A. C. Bovik, and L. K. Cormack, “Gaffe:A gaze-attentive ﬁxation ﬁnding engine,”

IEEE Transactions on ImageProcessing , vol. 17, no. 4, pp. 564–573, 2008.[71] A. K. Moorthy and A. C. Bovik, “Efﬁcient motion weighted spatio-temporal video SSIM index,” in

Human Vision and Electronic ImagingXV , B. E. Rogowitz and T. N. Pappas, Eds., vol. 7527, International Society for Optics and Photonics. SPIE, 2010, pp. 440 – 448. [Online].Available: https://doi.org/10.1117/12.844198[72] A. K. Moorthy and A. C. Bovik, “Efﬁcient video quality assessment alongtemporal trajectories,”

IEEE Transactions on Circuits and Systems forVideo Technology , vol. 20, no. 11, pp. 1653–1658, 2010.[73] Y. Wang, T. Jiang, S. Ma, and W. Gao, “Spatio-temporal ssim index forvideo quality assessment,” in , 2012, pp. 1–6.

ABHINAU K. VENKATARAMANAN receivedhis B.Tech. degree in Electrical Engineering fromthe Indian Institute of Technology, Hyderabad,India, in 2019. He is currently pursuing his M.S.and Ph.D. degrees in Electrical and ComputerEngineering at the University of Texas at Austin,TX, USA.During the summer of 2018, he was a sum-mer research intern at the Robotics Institute,at Carnegie Mellon University, as an SN BoseScholar, where he worked on biologically-inspired reinforcement learning.He is currently a Graduate Research Assistant at the Laboratory for Imageand Video Engineering at the University of Texas at Austin. His research in-terests include image and video quality assessment, perceptual optimization,deep learning, and reinforcement learning.Mr. Venkataramanan was a recipient of the SN Bose Scholarship (Indo-US Science and Technology Foundation) and the KVPY Fellowship (De-partment of Science and Technology, Government of India). He was alsoawarded the Institute Silver Medal by the Indian Institute of Technology,Hyderabad for securing the ﬁrst rank in his department during his under-graduate studies.

CHENGYANG WU received his B. Eng. Degreein Electrical Engineering from Shanghai Jiao TongUniversity, in 2019. He is currently pursuing hisPh.D. degree in Electrical and Computer Engi-neering at the University of Texas at Austin, TX,USA.He is currently a Graduate Research Assistant atthe Laboratory for Image and Video Engineeringat the University of Texas at Austin. His researchinterests include image and video quality assess-ment, no-reference methods, machine learning, and computer vision.

VOLUME X, 2021 et al. : A Hitchhiker’s Guide to Structural Simlarity

ALAN C. BOVIK (F ’95) is the Cockrell FamilyRegents Endowed Chair Professor at The Univer-sity of Texas at Austin. His research interests in-clude image processing, digital photography, digi-tal television, digital streaming video, and visualperception. For his work in these areas he hasbeen the recipient of the 2019 Progress Medalfrom The Royal Photographic Society, the 2019IEEE Fourier Award, the 2017 Edwin H. LandMedal from The Optical Society, a 2015 Prime-time Emmy Award for Outstanding Achievement in Engineering Devel-opment from the Television Academy, and the Norbert Wiener SocietyAward and the Karl Friedrich Gauss Education Award from the IEEESignal Processing Society. He has also received about 10 ‘best journalpaper’ awards, including the 2016 IEEE Signal Processing Society SustainedImpact Award. His books include The Essential Guides to Image andVideo Processing. He co-founded and was longest-serving Editor-in-Chiefof the IEEE Transactions on Image Processing, and also created/Chaired theIEEE International Conference on Image Processing which was ﬁrst held inAustin, Texas, 1994.

IOANNIS KATSAVOUNIDIS received hisB.S./M.S. degree from the Aristotle University ofThessaloniki, Greece in 1991 and a Ph.D. fromthe University of Southern California in 1998,all in Electrical Engineering. He was an engineerfor Caltech’s Physics department, working on theMACRO high-energy astrophysics experiment inItaly from 1996-2000. He was a Director of Soft-ware at InterVideo in Fremont, CA, working onadvanced video codecs, developing technologiesaround error resilience, and optimizing video encoding and decoding, from2000-2006. In 2007, he cofounded Cidana, a mobile multimedia softwarecompany, and served as its CTO in Shanghai, China. Between 2008 and2015, he was an associate professor at the Electrical and Computer Engineer-ing department at the University of Thesally in Greece, where he taught anddid research in signal, image, and video processing, as well as informationtheory. From 2015-2018, he was a Sr. Research Scientist at Netﬂix, partof the Encoding Technologies team, where he worked on video qualitymetrics and optimization problems, contributing to the development andpopularization of VMAF and inventing the Dynamic Optimizer perceptualquality optimization framework. Since 2018, he is a Research Scientistat Facebook’s Video Infrastructure team, working on large scale videoquality and video encoding optimization problems that involve HW andSW components. He is actively involved with the Aliance for Open Media(AOM) and the next generation royalty-free codec development and theVideo Quality Experts Group (VQEG) in multiple projects around videoquality. Dr. Katsavounidis has over 100 publications in multiple journalsand conferences and over 40 patents.

ZAFAR SHAHID received his B.S. degree fromthe University of Engineering & Technology La-hore, Pakistan in 2001, M.S. from INSA (InstitutNational des Sciences Appliquées) de Lyon Francein 2007 and a Ph.D. from the University of Mont-pellier France in 2010. His PhD was funded byCNRS (Centre National de la Recherche Scien-tiﬁque) France. From 2001-2006, he was a videosoftware engineer at Streaming Networks, work-ing on Real-time multimedia products. He wasinvolved in video codec optimization for Philips VLIW Processor TriMedia, and developed OEM/ ODM Products for this processor. During 2011, heworked in IRCCyN Labs, Nantes France, on QOE for scalable video codecsand designed drift-free bit stream watermarking of H.264/AVC. During2011-2015, he worked as a Media Encoding Architect for multiple start-ups designing both IPTV and 50mSec end-to-end latency video pipelinesfor cloud gaming and tele-health. His video pipeline was used to streamSuperBowl 2014 to more than 500K viewers simultaneously on Desktop,iOS and Android. During 2015-2016, he worked on cloud-products for tone-mapping and metadata processing of HDR content. After this, he workedin GameStream team of Nvidia during 2016-2019. During this period, heworked on streaming of 4K video games from Cloud and from GPU enabledGaming Desktop at home to Tegra powered Shield devices and Win/Macclients. Since 2019, he is working at Facebook, working on video codecoptimization and quality problems at scale, involving both software & ASIC.Dr. Shahid has more than 40 publications and 3 patents.26