[PDF] User-generated Video Quality Assessment: A Subjective and Objective Study

Abstract

Recently, we have observed an exponential increase of user-generated content (UGC) videos. The distinguished characteristic of UGC videos originates from the video production and delivery chain, as they are usually acquired and processed by non-professional users before uploading to the hosting platforms for sharing. As such, these videos usually undergo multiple distortion stages that may affect visual quality before ultimately being viewed. Inspired by the increasing consensus that the optimization of the video coding and processing shall be fully driven by the perceptual quality, in this paper, we propose to study the quality of the UGC videos from both objective and subjective perspectives. We first construct a UGC video quality assessment (VQA) database, aiming to provide useful guidance for the UGC video coding and processing in the hosting platform. The database contains source UGC videos uploaded to the platform and their transcoded versions that are ultimately enjoyed by end-users, along with their subjective scores. Furthermore, we develop an objective quality assessment algorithm that automatically evaluates the quality of the transcoded videos based on the corrupted reference, which is in accordance with the application scenarios of UGC video sharing in the hosting platforms. The information from the corrupted reference is well leveraged and the quality is predicted based on the inferred quality maps with deep neural networks (DNN). Experimental results show that the proposed method yields superior performance. Both subjective and objective evaluations of the UGC videos also shed lights on the design of perceptual UGC video coding.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

User-generated Video Quality Assessment: ASubjective and Objective Study

Yang Li, Shengbin Meng, Xinfeng Zhang,

Member, IEEE,

Shiqi Wang,

Member, IEEE,

Yue Wang,Siwei Ma,

Member, IEEE,

Abstract —Recently, we have observed an exponential increaseof user-generated content (UGC) videos. The distinguished char-acteristic of UGC videos originates from the video production anddelivery chain, as they are usually acquired and processed by non-professional users before uploading to the hosting platforms forsharing. As such, these videos usually undergo multiple distortionstages that may affect visual quality before ultimately beingviewed. Inspired by the increasing consensus that the optimiza-tion of the video coding and processing shall be fully driven bythe perceptual quality, in this paper, we propose to study thequality of the UGC videos from both objective and subjectiveperspectives. We ﬁrst construct a UGC video quality assessment(VQA) database, aiming to provide useful guidance for theUGC video coding and processing in the hosting platform. Thedatabase contains source UGC videos uploaded to the platformand their transcoded versions that are ultimately enjoyed by end-users, along with their subjective scores. Furthermore, we developan objective quality assessment algorithm that automaticallyevaluates the quality of the transcoded videos based on thecorrupted reference, which is in accordance with the applicationscenarios of UGC video sharing in the hosting platforms. Theinformation from the corrupted reference is well leveraged andthe quality is predicted based on the inferred quality maps withdeep neural networks (DNN). Experimental results show that theproposed method yields superior performance. Both subjectiveand objective evaluations of the UGC videos also shed lights onthe design of perceptual UGC video coding.

Index Terms —User-generated content, video quality assess-ment, deep neural network.

I. I

NTRODUCTION V IDEO content was historically created by professionalcontent producers. Recently, with the development ofmultimedia and network technologies, as well as the advancesof acquisition devices, there has been an explosion of user-generated content (UGC) videos and related sharing services.Enormous videos generated without professional routines andpractices are uploaded to sharing platforms such as Facebook,YouTube and TikTok. Comparing to professionally-generatedcontent (PGC) videos, the low barriers in video production andsharing make the UGC content extremely diverse. In particular,the lack of proper shooting skills and professional videocapture equipment make the perceptual quality of UGC videos

Yang Li and Siwei Ma are with the Institute of Digital Media, PekingUniversity, Haidian District, Beijing 100871, China.Xinfeng Zhang is with the School of Computer Science and Technology,University of Chinese Academy of Sciences, Beijing 101408, China.Shiqi Wang is with Department of Computer Science, City University ofHong Kong, Kwoloon, HK, China.Shengbin Meng and Yue Wang are with media foundation team, BytedanceInc. even worse. Besides, special effects are sometimes incorpo-rated to enhance the user experience, thereby increasing thedifﬁculty of quality assessment and compression. Exponentialincrease in the demand for high-quality videos poses greatchallenges in practice. As such, effective UGC video qualityassessment (VQA) algorithms become critical to guide theoptimization of the hosting platform, in an effort to delivervideos with better visual quality under limited bandwidth.In the traditional full-reference (FR) quality assessment,pristine sources are available for reference, such that thequality of the distorted video can be predicted by signal orfeature level comparisons. However, straightforwardly apply-ing this strategy to UGC videos is problematic, as the sourcevideos in the hosting platform have already been corrupted dueto acquisition and compression distortions introduced beforeuploading to the hosting platform. As such, the traditional FRalgorithms may be mislead by the distorted reference and failto predict the quality of the ultimately viewed UGC videos.One extreme example is that an excessively high bit rate isapplied to transcode a video with extremely poor quality. Inthis scenario, the objective FR quality is not consistent with thesubjective quality due to the high similarity with the corruptedreference. However, relying on no-reference (NR) algorithmsonly may omit the useful reference information, and may notbe able to ensure the accurate prediction with high robustnesson such diverse content.In this paper, we ﬁrst creat a database with subjectiveratings for UGC videos, revealing the complex nature of theUGC quality assessment problem. Furthermore, we present acorrupted-reference framework which delivers accurate pre-dictions of the perceptual quality for the UGC videos. Theproposed algorithm measures perceptual quality by combiningthe local distortions of the source and transcoded videosrelying on the prediction of the quality maps. In particular,the quality maps are predicted in a data-driven manner, andfused through a learned network such that the overall qualityscore is estimated by gradually pooled features. The three maincontributions of this work are as follows.1) We construct a dedicated exploration database for UGCvideos, including the source and transcoded videos in thehosting platforms, as well as the objective and subjectivescores. We further demonstrate that innovative qualityassessment approaches should be developed based oncareful investigations.2) We propose a novel corrupted-reference VQA methodfor UGC videos based on deep neural networks (DNN).In contrast with traditional FR quality models, the in- a r X i v : . [ c s . MM ] M a y OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 trinsic quality of the corrupted-reference is incorporatedto accurately infer the quality.3) We show that the performance of the proposed frame-work outperforms the state-of-the-art methods in theapplication domain of UGC video processing. Whilethe ﬁeld of UGC video coding and processing is stillquickly evolving, we also envision the future perceptualUGC video compression scheme based on the proposedquality measure.II. R

ELATED WORKS

A. Objective VQA Measures1) Full-Reference VQA:

FR VQA algorithms deliver robustand accurate predictions based on the fully accessible refer-ence information. In [1], hysteresis effect in the subjectivetestings is observed, and a hysteresis based temporal poolingstrategy is applied to extend image quality assessment (IQA)metrics such as PSNR and SSIM [2] to VQA, which hasbeen proved to be better than average pooling. In [3] and [4],video quality measures have been designed based on structuralfeatures. Lu et al. [5] described the degradation of videoquality via spatiotemporal 3D gradient differencing. Moreover,a VQA algorithm based on statistical characteristics of opticalﬂows was proposed in [6]. In [7], a spatio-spectrally localizedmultiscale framework for evaluating dynamic video ﬁdelity bymotion quality along computed motion trajectories was pre-sented. In [8], ViS3 estimates quality via separate predictionsof perceived degradation originated from spatial distortion aswell as joint spatial and temporal distortion. Machine learninghas also played a critical role in the development of modernVQA models. In [9], several perceptual-relevant features andmethods have been combined by random forest regressionalgorithm to boost the performance. In [10], video multi-method assessment fusion (VMAF) produces remarkably im-proved quality prediction performance by mapping multiplefeatures to human-quality opinions using support vector re-gressor (SVR). Motivated by the great success of convolutionalneural network (CNN) on numerous visual analysis tasks, aDNN based approach has been developed by joint learningof local quality and local weights in [11], and a pairwise-learning framework was proposed in [12] to train a perceptualimage-error metric. Kim et al. [13] quantiﬁed the spatio-temporal visual perception via a CNN and a convolutionalneural aggregation network, and Zhang et al. [14] proposed aFR VQA metric by integrating transfer learning with a CNN.

2) No-Reference VQA:

NR VQA is a more natural andpreferable way to assess the perceived video quality as the ref-erence videos are unavailable in many practical video applica-tions. Many methods focus on estimating the perceived qualityof videos with speciﬁc distortions, such as compression distor-tion [15], transmission error [16] and scaling artifacts [17]. Fordistortion-unaware NR VQA methods, natural scene statistics(NSS) or natural video statistics (NVS) models are usuallyused as they are sensitive to diverse distortions. Saad et al. [18]proposed a NR VQA algorithm, known as VBLIINDS, whichcontains a NSS model and a motion model that quantiﬁesmotion coherency. Mittal et al. [19] proposed a VQA model termed as the video intrinsic integrity and distortion evaluationoracle (VIIDEO), which quantiﬁes disturbances introduceddue to distortions according to the NVS model. In [20], thevideo content is disassembled into the predicted part andthe uncertain part, such that their quality degradations areseparately evaluated by NVS model to yield the overall quality.Li et al. [21] proposed a NR-VQA metric based on NVS inthe 3D discrete cosine transform (3D-DCT) domain. Recently,CNN based NR-VQA methods have also been developed. Li et al. [22] proposed a shearlet- and CNN-based NR VQA(SACONVA), where spatiotemporal features extracted by 3Dshearlet transform are fed to a CNN to predict a perceptualquality score. Liu et al. [23] exploited the 3D-CNN modelfor codec classiﬁcation and quality assessment of compressedvideos. In [24], a NR VQA framework based on weaklysupervised learning with a CNN and a resampling strategywas presented. Li et al. [25] proposed a NR-framework forin-the-wild videos by incorporating content dependency andtemporal-memory effects. Moreover, generative networks havealso been used to predict the quality map or source given thedistorted image to help the blind IQA task [26], [27].

B. VQA Databases

There are several publicly available video databases forVQA. LIVE [28] collects 10 uncompressed high-qualityvideos as reference videos, and correspondingly 150 distortedvideos were created using four different distortion types andstrengths. LIVE Mobile [29] consists of 200 distorted videoscreated from 10 RAW HD reference videos, and dynamicallyvarying distortions are also considered. In MCL-JCV [30],a compressed VQA database was created based on the justnoticeable difference (JND) model. CVD2014 [31] containsa total of 234 videos that are recorded using 78 differentcameras, along with open-ended quality descriptions such assharpness, graininess and color balance provided by the ob-servers. LIVE-Qualcomm [32] consists of 208 videos capturedusing 8 different mobile devices which model six commonin-capture distortion categories. KoNViD-1k [33] is a subjec-tively VQA database containing 1,200 public-domain videosthat are fairly sampled from a large public video database,YFCC100M. LIVE-VQC [34] contains 585 videos capturedusing 101 different devices with a wide range of distortionlevels.Apparently, databases with high quality source videos suchas LIVE and LIVE Mobile may not align with the UGCapplication scenarios, where databases with acquisition dis-tortion such as CVD2014, LIVE-Qualcomm and KoNViD-1k are more realistic. In [35], LIVE Wild Compressed Pic-ture Quality Database has been constructed, where imageswith acquisition distortions are further compressed. However,database dedicated to UGC videos by considering the UGCvideo compression still remains absent and there is a strongdesire for an adequate database sufﬁcing to simulate the UGCproduction chain from acquisition to processing on the hostingplatform.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

III. UGC V

IDEO D ATABASE

A. Video Collection

To cover typical content and characteristics representingUGC videos, 400 videos are randomly selected from thevideos uploaded to TikTok [36] that meet the followingcriteria: • With a resolution of 1280 ×

720 (height × width); • Belonging to the category of selﬁe, indoor, outdoor orscreen content. • Last longer than 10 seconds; • Played at 30 frames per second (FPS);Since 720p is one of the most widely adopted UGC videoformats, we ensure that all selected videos share this reso-lution. Most videos can be classiﬁed into one of the selﬁe,indoor, outdoor and screen content videos. In particular, mostareas of selﬁe videos are occupied by human face, and screencontent videos are mainly game screen recording. Moreover,indoor videos are life scenes shot in close-up, and outdoorvideos are outdoor scenery acquired with a distant view. Afew videos with special content have been ﬁltered out. Sincewe will crop all these videos to 10 seconds, videos shorterthan 10 seconds are not considered here.

B. Video Sampling

Subsequently, we sample the videos selected in the previousstep according to the statistical characteristics of videos toobtain the ﬁnal source videos. Speciﬁcally, three attributesincluding spatial perceptual information (SI), temporal per-ceptual information (TI) and blur index have been employed.Among these indicators, SI and TI are highly correlated withthe levels of distortion when the video is lossy transmitted, assuggested in [37]. Since UGC videos uploaded by users areusually accompanied by varying degrees of blurry artifactswhich signiﬁcantly affect the perceptual quality, the blurmetric is also included.

SI:

SI quantiﬁes spatial complexity and variety of a video,and it is deﬁned as the maximum standard deviation over allSobel-ﬁltered frames, SI = max time { std space [ Sobel ( F n )]} (1)where F n represents frame n , Sobel (·) is Sobel ﬁlter and std space represents the standard deviation over space.

TI:

TI quantities the temporal changes of a video, and itis given by the maximum standard deviation of the framedifference derived from adjacent frames. As such, it can beformulated as follows,

T I = max time { std space [ M n ( i , j )]} (2) M n ( i , j ) = F n ( i , j ) − F n − ( i , j ) (3)where F n ( i , j ) is pixel value at ( i , j ) of frame n . Blur:

The cumulative probability of blur detection (CPBD)indicator [38] is adopted here to evaluate the levels of blur.The average CPBD value of the sampled frames is used toindicate the blurriness of the video. (a)(b)Fig. 1. Distribution of SI and TI indices for UGC videos. (a) 400 videosbefore sampling; (b) 50 videos after sampling.

Before sampling from these videos, we crop these videosto 10 seconds and remove the audio parts. To enable the char-acteristics of sampled videos uniformly distributed in termsof these features, we adopt the sampling strategy introducedin [39]. In particular, the original videos are characterized witha set S , S = (cid:8) q i | q i ∈ R M , q i ∼ D MS (cid:9) Ki = (4)where M and K represent the number of features and videos,respectively (here M = , K = ). The main objective is toselect a subset of N videos, s = (cid:8) ˆ q i | ˆ q i ∈ S , ˆ q i ∼ D Ms (cid:9) Ni = (5)with the uniform distribution D ∈ R H × M (each of its columns D ∗ j denoting the probability mass function across the j th dimension which is quantized into H bins). As such, weintroduce a set of M binary matrices B = { B m } Mm = , in which b mij denotes whether or not the j th item of S belongs to i th interval of the target PMF for the dimension m , and binaryvector x ∈ Z K , where x i is decision variable determining OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 (a) (b) (c) (d) (e) (f) (g) (h)Fig. 2. Examples of videos in the database. (a)-(b): selﬁe videos; (c)-(d): indoor videos; (e)-(f): outdoor videos; (g)-(h): screen content videos. whether i th item of S belongs to subset s . As such, thisproblem can be formulated as follows, min x M (cid:213) m = (cid:107) B m x − N D ∗ m (cid:107) s . t . (cid:107) x (cid:107) = N . (6)By ﬁnding the best solution with the optimization objective, asubset that is closest to the uniform distribution on all featurescan be sampled from the original database. Finally, 12, 13, 13,12 videos were chosen from selﬁe, indoor, outdoor and screencontent videos, respectively, using this sampling strategy with H = . Fig. 1 shows the plots of SI against TI for 400videos before sampling and 50 videos after sampling, whichis apparent that sampled videos span a wide range of SI-TIspaces. Moreover, the snapshots of some example sampledvideos from each content category are shown in Fig. 2. C. Video Transcoding

Considering that our primary goal of building this databaseis to simulate the UGC production chain from acquisition toprocessing on the hosting platform, based on which qualityassessment algorithm can be developed in an effort to furtherimprove the transcoding performance, we further transcodethese sampled source videos using different codecs and com-pression levels. More speciﬁcally, H.264/AVC [40] encoderx264 [41] and HEVC [42] encoder x265 [43] have been usedto simulate the transcoding process in the hosting platforms,and ﬁve common used quantization parameters (QPs) 22, 27,32, 37, 42 are used to control the quality level of transcodedvideos for each codec. As such, each source video can betranscoded to 10 corresponding transcode versions. Finally,there are 550 videos in our built UGC-VIDEO databaseincluding source videos.

D. Subjective Testing and Analyses

After collecting the videos, subjective testing is further con-ducted to obtain the subjective scores using absolute categoryrating with hidden reference (ACR-HR) [44] method, in whichthe videos are played one by one and the subjects are askedto provide a opinion score according the ﬁve-grade ratingscales. The full database is divided into three sessions, eachcontaining 16 or 17 source videos along with their respectivetranscoded versions. Hence, each session lasts about half an

TABLE IP

ERFORMANCE COMPARISONS OF QUALITY ASSESSMENT ALGORITHMS INTERMS OF

SROCC.Methods Selﬁe Indoor Ourdoor Screen Full databaseBRISQUE 0.436 0.327 0.580 0.346 0.354NIQE 0.511 0.480 0.453 0.128 0.314VIIDEO 0.113 0.348 0.218 0.026 0.085BLIINDS 0.382 0.386 0.051 0.462 0.175PSNR 0.715 0.700 0.664 0.489 0.612VIF 0.837 0.803 0.807 0.629 0.736SSIM 0.842 0.798 0.857 0.464 0.714MS-SSIM 0.821 0.783 0.842 0.507 0.722SpEED-QA 0.839 0.747 0.838 0.746 0.786ViS3 0.762 0.706 0.823 0.699 0.746VMAF 0.823 0.821 0.856 0.825 0.814TABLE IIP

ERFORMANCE COMPARISONS OF QUALITY ASSESSMENT ALGORITHMS INTERMS OF

PLCC.Methods Selﬁe Indoor Ourdoor Screen Full databaseBRISQUE 0.416 0.346 0.611 0.328 0.315NIQE 0.509 0.511 0.520 0.056 0.176VIIDEO 0.251 0.178 0.326 0.032 0.157BLIINDS 0.415 0.421 0.001 0.464 0.216PSNR 0.717 0.733 0.639 0.452 0.579VIF 0.862 0.820 0.850 0.633 0.626SSIM 0.866 0.847 0.857 0.590 0.769MS-SSIM 0.845 0.841 0.865 0.626 0.773SpEED-QA 0.748 0.671 0.730 0.724 0.673ViS3 0.787 0.744 0.872 0.754 0.783VMAF 0.884 0.886 0.907 0.830 0.863 hour to minimize viewer fatigue. In particular, at the beginningof each session, “dummy presentations” with various levelsof perceptual quality have been introduced to stabilize theopinion of subjects and the opinion data of these presentationsare not taken into account in the ﬁnal result of the exper-iment. The videos are displayed at their original resolutionwithout scaling, and the subjects are required to click thecorresponding button within a few seconds to choose from“Excellent”,“Good”, “Fair”, “Poor” and “Bad”, correspondingto 5 ∼ OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 (a) MOS: 4.43 (b) MOS: 4.11 (c) MOS: 3.21 (d) MOS: 2.96Fig. 3. Performance of FR metrics on the proposed database. (a) High quality reference video; (b) HEVC transcoded video of (a), PSNR: 41.64dB, SSIM:0.996; (c) Low quality reference video; (d) HEVC transcoded video of (c), PSNR: 41.73dB, SSIM: 0.977.Fig. 4. Framework of the proposed objective quality assessment method. The quality maps from the source videos as well as the comparisons between sourceand transcoded videos are fused with a pooling network to obtain the ﬁnal quality. determine if the scores for each test presentation are normallydistributed. Score range of each video is then computed as 2or √ standard deviations from the mean scores according towhether the scores are normally distributed. For each subject i , we count the number of scores above and blow this range,denoted as P i and Q i . As such, the subject i will be rejectedwhen, P i + Q i JK > . and | P i − Q i P i + Q i | < . (7)where J is the number of versions for each source and K denotes the number of source videos. Based on the ouranalysis, no subject has been rejected at this stage. E. Performance of Existing Models

We evaluate the performance of several objective qual-ity assessment algorithms on the established database usingSpearman’s rank ordered correlation coefﬁcient (SROCC) andPearson’s linear correlation coefﬁcient (PLCC). In particular,the larger the values of SROCC and PLCC, the better theperformance. Besides, before computing PLCC, the predicted scores are passed through a logistic non-linearity regressionas suggested in [46]: f ( x ) = β ( − + e β ( x − β ) ) + β x + β . (8)IQA algorithms are extended to VQA methods by averagingframe-level quality scores. The tested quality measures includePSNR, SSIM [2], MS-SSIM [47], VIF [48], SpEED-QA [49],ViS3 [8], VMAF [10], VBLIINDS [18] and VIIDEO [19].Table I and Table II tabulate the SROCC and PLCC betweenthe algorithm scores and MOS for each content category,as well as across the full database. It is disappointing toﬁnd that the existing algorithms may not be able to provideadequate predictions on the UGC videos. However, theseresults still provide some useful insights that could beneﬁtthe design of the UGC VQA models. First, as illustrated inFig. 3, the quality based on comparisons against the referencecould be problematic due to the corrupted reference. Thissuggests the importance of including the intrinsic quality ofthe reference. Second, most of the tested algorithms performthe worst on screen content videos, and this may be attributedto the particularity of this type of videos compared to natural OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 videos. These observations motivate a speciﬁcally designedVQA model that equips the intrinsic quality of the corruptedreference as well as the data-driven model for learning thestatistics of the video content.IV. O

BJECTIVE Q UALITY A SSESSMENT

A. Framework

As illustrated in Fig. 4, the proposed UGC video qualityassessment framework leverages the intrinsic quality of thesource videos as well as the comparisons between the sourceand transcoded videos. Instead of straightforwardly obtaininga quality score of the source videos using NR algorithms, wepropose to learn and fuse the intermediate quality maps, whichmeaningfully indicate the spatially variant quality of differentregions. The inferred quality maps are fed to a poolingnetwork, such that the local distortions are aggregated in adata-driven manner for ﬁnal quality prediction. As such, theproposed quality evaluation framework consists of three mainmodules, including a generative network G φ that generatesthe quality maps of the source videos, an evaluator E ω whichproduces the relative quality maps between the source andtranscoded videos, and a regression network f θ that fuses thequality maps to obtain the ﬁnal quality score. These modulesare parameterized by φ , ω and θ , respectively.Given a source video V s and its transcoded version V t , weﬁrst predict the quality maps M s of V s using G φ : M is = G φ ( I is ) (9)where I is is the i -th frame of V s and M is represents itscorresponding quality map. Meanwhile the relative perceptualdegradation between V s and V t can also be measured, M it = E ω ( I is , I it ) (10)where I it and M it are the i -th frame of V t and its correspondingquality map, respectively. Finally, a quality pooling network f θ concats the quality maps, which deliver the intrinsic qualityof source videos as well as the relative quality between sourceand transcoded videos: ˆ S = f θ ( G φ ( V s ) , E ω ( V s , V t )) (11)where ˆ S is the predicted score of transcoded video. B. Quality Maps Generation Based on V s and V t In the hosting platform, V s is further transcoded into V t ,such that the difference lying between them originates fromcompression artifacts. As such, given V s and V t , to evaluatethe relative distortion between them, we leverage existingquality metrics including SSIM [2], MDSI [50] and VIF [48],which well reﬂect the local distortion from I is to I it fromthe perspectives of structure, gradient and visual information,respectively. Regarding SSIM, only luminance component isconsidered and the derived single channel luminance similaritymap of each frame pair is used as SSIM quality map. Withrespect to MDSI, the combination of gradient similarity mapand chromaticity similarity map is used as MDSI map. SinceVIF is a multi-scale method, only VIF map derived from the frames of the original size is adopted. These quality maps areshown in Fig. 5, which imply that the adopted quality mapswell predict the visual quality. The values in the quality mapsare normalized to the range of [ , ] to facilitate subsequenttraining in DNN. C. Quality Maps Generation from Source Video V s Given the source video V s , we aim at blindly estimatingthe quality map of each frame since the pristine reference isnot available. We adopt the deep neural networks ensuringthe robust and accurate quality map prediction. In particular,ResNet [51] is employed with the consideration that residualconnections make the training of identical function easier,which gradually facilitate the adding of distortions from lowlevel to high level. The detailed architecture of the generativenetwork is shown in Fig. 6. More speciﬁcally, quality maps ofthe input frame are predicted after 10 identical residual blocks,each of which contains two 3 × × V s , multipledistortion stages are applied on these pristine images. Gaus-sian blur or Gaussian noise of different levels is injected tothe pristine image, and subsequently these distorted imagesare compressed with certain compression levels by JPEG orJPEG2000 compression. The images after compression areused as training inputs and their quality maps are used asgroundtruth labels. As described in Section IV-B, differentquality maps derived from existing FR methods can be adoptedas training labels. Different quality maps predicted by thegenerative network and their corresponding ground-truth labelsare shown in Fig. 7. The generative networks trained onWaterloo database are then applied on our database to generatequality maps of V s , as shown in Fig. 8.The loss function for the generative network consists of astructural loss characterized by SSIM and pixel-wise loss, asintroduced in [54], which is given by, L G ( P k , P k ) = α · L SSI M ( P k , P k ) + ( − α ) · L L ( P k , P k ) (12)where P k is the ground-truth quality map patch, P k is thecorresponding generated patch, and α here is an empiricallyset weighting factor. The structural loss based on SSIM isformulated as, L SSI M ( P k , P k ) = − SSI M ( P k , P k ) . (13) D. Quality Map Pooling

After the generation of quality maps from the source videoand transcoded video, a pooling network is trained to fuse

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 (a)(b) (c) (d) (e) (f) (g) (h) (i)(j) (k) (l) (m) (n) (o) (p) (q)Fig. 5. Quality maps generated based on V s and V c . (a) One frame from V s ; (b)-(e) HEVC transcoded versions from V s with QP 27, 32, 37 and 42; (f)-(i)corresponding SSIM maps; (j)-(m) corresponding MSDI maps; (n)-(q) corresponding VIF maps. C o n v ( , , , ) R e L U R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k R e s i d u a l B l o c k C o n v ( , , , ) B a t c h N o r m C o n v ( , , , ) C o n v ( , , , ) C o n v ( , , , ) R e L U B a t c h N o r m B a t c h N o r m Residual Block

Fig. 6. The architecture of the generative network that produces the quality maps from V s . Blue box: a convolutional layer Conv ( d , f , s , p ) with d ﬁltersof size f × f , a stride of s and a padding of p ; yellow box: ReLU layer; gray box: batch normalization layer. these quality maps and generate a ﬁnal quality score. Ingeneral, convolutional networks have been widely used toprogressively reduce the resolution of feature maps, while suchloss of spatial acuity may limit the performance. In our frame-work, a dilated residual network (DRN) [55] is employed, inwhich dilated convolutions are used to increase the resolutionof output feature maps without reducing the receptive ﬁeldof individual neurons. As shown in Fig. 9, each set of theinput maps ﬂows through independent convolutional layers,and feature maps are concatenated after the ﬁrst convolutionallayer. The four dilated residual structures with 3 × L REG = (cid:13)(cid:13) f θ ( G φ ( I is ) , E ω ( I is , I it )) − S (cid:13)(cid:13) (14)where S denote human evaluation for V t , and the video scoreis set as training label for each quality map pairs. Scores of allsampled frames are pooled at the sequence level by averagepooling. In this step, 30 frames are uniformly sampled fromeach video to train the model. V. E XPERIMENTAL R ESULTS

A. Experimental Settings1) Database:

Due to the lack of databases that align withthe UGC application scenario, in particular from acquisitionto processing on the hosting platform, the newly introducedUGC-Video database in Section III is used in evaluating ourproposed method.

2) Compared methods:

Both FR and NR quality assess-ment algorithms are applicable for quality assessment ofUGC videos. In particular, in FR methods, source videoswith various quality levels are used as reference. The NRmethods can be directly applied on the transcoded videosfor evaluating the quality of compressed videos. EffectiveFR and NR methods with high generalization capaiblity areused for comparison, including PSNR, SSIM, MS-SSIM, VIF,NIQE, BRISQUE, VBLIINDS, VIIDEO, VMAF. In addition,a 2stepQA [35] method combining FR and NR models arealso considered which serves as a ﬂexible framework basedon different combinations of FR and NR methods.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 (a) (b) (c) (d) (e) (f) (g)(h) (i) (j) (k) (l) (m) (n)Fig. 7. Illustration of the predicted quality maps and the corresponding ground-truth maps. (a)(h) distorted images with multiple distortions; (b)(i) ground-truthSSIM maps; (c)(j) predicted SSIM maps; (d)(k) ground-truth MDSI maps; (e)(l) predicted MDSI maps; (f)(m) ground-truth VIF maps; (g)(n) predicted VIFmaps. (a) (b) (c) (d) (e) (f) (g) (h)Fig. 8. Illustration of the predicted quality maps in our database. (a)(e) frames of V s ; (b)(f) predicted SSIM maps of V s ; (c)(g) predicted MDSI maps of V s ; (d)(h) predicted VIF maps of V s . C o n v ( n , , s , , ) C o n v ( n , , , , d ) Dilated Residual Block(n, s, d) C o n v ( , , , , ) C o n v ( , , , , ) C o n v ( , , , , ) D il a t e d R e s i d u a l B l o c k ( , , ) C o n v ( , , , , ) D il a t e d R e s i d u a l B l o c k ( , , ) C o n v ( , , , , ) C o n v ( , , , , ) A v e r a g e P oo li n g F C D il a t e d R e s i d u a l B l o c k ( , , ) D il a t e d R e s i d u a l B l o c k ( , , ) C o n v ( n , , , , ) C o n v ( n , , , , d ) Fig. 9. Detailed architecture of the pooling network. Blue box: a convolutional layer

Conv ( n , f , s , p , d ) with n ﬁlters of size f × f , a stride of s , a paddingof p and a dilation of d ; red box: average pooling layer; yellow box: full connection layer. It is worth mentioning that there are batch normalization andReLU layers after each convolutional layer, which are omitted here for simpliﬁcation. B. Training Details

The training process consists of two steps: (1) traininggenerative network on the modiﬁed Waterloo ExplorationDatabase; (2) training the pooling network on UGC-Videos.In the original Waterloo Exploration Database [53], 94880distorted images are created from 4744 pristine natural imagesby introducing four types of distortion (blur, noise, JPEGand JPEG2K), each with ﬁve levels. To enable the generationnetwork to capture mixture distortions similar to that in thesource UGC videos, we develop a new way to generate thedistorted images. More speciﬁcally, noise or blur distortionsof random level are ﬁrst induced to these pristine images, andsubsequently compression distortion is also injected by JPEGor JPEG2000 with random compression level. As such, 4744distorted images with multiple distortions are created. VIF quality maps are calculated according the distorted images andthe corresponding pristine images. Both distorted images andtheir quality maps are cropped into 64 ×

64 non-overlappingpatches. Generative network is trained based on the inputs(patches from distorted image) and labels (correspondingpatches from VIF quality map) using Adam optimizer [56]at the initial learning rate of − for 100 epochs.Subsequently, the pooling network is trained using the pre-trained generative network model and the score is regressedusing quality maps. Once each quality map is derived fromthe previous generative network, we freeze the weights ofgenerative networks, and train the pooling network using MSEloss and Adam optimizer with the learning rate of − . Bythe dilated residual blocks and average pooling, quality mappairs with the fully connected layers yield the ﬁnal score. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE IIIM

EAN AND STANDARD DEVIATION OF PERFORMANCE VALUES OFVARIOUS FR AND NR METHODS IN RUNS ON

UGC-V

IDEO DATABASE , I . E ., MEAN ( ± STD )Method SROCC PLCC RMSEPSNR 0.647 ( ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± EAN AND STANDARD DEVIATION OF PERFORMANCE VALUES OF STEP QA MODEL USING DIFFERENT COMBINATIONS OF FR AND NR METHODS IN RUNS ON

UGC-V

IDEO DATABASE , I . E ., MEAN ( ± STD ) Method SROCC PLCC RMSEPSNR+NIQE 0.687 ( ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± C. Performance Comparisons

To ensure fair comparisons with existing conventional andlearning-based methods, the full database is randomly dividedinto non-overlapping 60 % training set, 20 % validation setand 20 % test set, according to the content of source videos.Conventional quality measures which are not learning-based,i.e., PSNR, SSIM, VIF, NIQE, BRISQUE, VIIDEO and VBLI-INDS are directly evaluated on the 20 % testing data afterthe parameters in Eqn (8) are optimized with the trainingand validation data. For the 2stepQA method, the trainingand validation sets are merged together to train the relevantparameters. For our method, the models with the highestSROCC value on the validation set during the training arechosen for testing. This procedure has been repeated for 20times and all above methods are tested on the same 20 % test set. In particular, the mean and standard deviation ofperformance values are reported.Table III shows the performance of conventional methods,and it is apparent that FR algorithms perform better thanNR algorithms, and the VMAF performs best by combiningdifferent metrics. Moreover, the 2stepQA performances withdifferent combinations of FR and NR models are shown inTable IV, we can see that the performances of reference algo-rithms have been improved in most cases. However, due to thesimplicity of the 2stepQA model and the lack of efﬁcient NR TABLE VM

EAN AND STANDARD DEVIATION OF PERFORMANCE VALUES OF OURPROPOSED MODEL USING DIFFERENT COMBINATIONS OF QUALITY MAPS , I . E ., QUALITY MAPS FOR TRANSCODED VIDEO + QUALITY MAPS FORSOURCE VIDEO , IN RUNS ON

UGC-V

IDEO DATABASE . M

EAN ANDSTANDARD DEVIATION ( STD ) OF PERFORMANCE VALUES IN RUNS AREREPORTED , I . E ., MEAN ( ± STD )Method SROCC PLCC RMSESSIM map 0.812 ( ± ± ± ± ± ± ± ± ± S R O CC Methods

Full reference only 2stepQA Ours

Fig. 10. SROCC performance of the compared algorithms over 20 trials onthe UGC-Video database. models with high generalization capability, 2stepQA methodmay degrade the performance of FR algorithms, such as VIFand VMAF.The performances of the proposed framework are shown inTable V, where predicted VIF quality maps by the generativenetwork are obtained for the source video and different qualitymaps calculated using existing FR methods are used as thequality maps of transcoded videos. We can observe that ourmethod is superior to the FR algorithms with more perfor-mance improvement compared with the 2stepQA method. Itis worth mentioning that VMAF quality map represents theconcatenation of multiple types of quality maps, due to thefact that VMAF is a combination of multiple indicators. Morespeciﬁcally, VIF quality map and motion map are contained inthe VMAF quality map, where the motion map is luminancecomponent difference calculated along the consecutive frames.To demonstrate the effectiveness of our framework, the av-erage SROCC performance of FR methods, directly combiningFR and NR scores (2stepQA) and our methods are comparedin Fig. 10. For SSIM method, both 2stepQA and our methodgreatly improve the accuracy of prediction, where 2stepQAincreases SROCC from 0.729 to 0.804 by introducing theBRISQUE score of source video and our method increases theSROCC to 0.812 by combining the quality maps of sourcevideo and SSIM quality map of transcoded video. For VIFmethod, 2stepQA fails to improve the performance where ourmethod brings great performance improvement. As can be seenfrom Fig. 10, our method signiﬁcantly improves performanceof existing reference models, and exhibit higher and morereliable correlations with subjective quality compared than the

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE VIP

ERFORMANCE OF THE ABLATION STUDY . M

EAN AND STANDARDDEVIATION ( STD ) OF PERFORMANCE VALUES IN RUNS . S

ETTING QUALITY MAPS OF SOURCE VIDEO ARE REMOVED , S

ETTING QUALITYMAPS OF TRANSCODED VIDEOS ARE REMOVED .Method SROCC PLCC RMSEFull version 0.853 ( ± ± ± ± ± ± ± ± ± SROCC PLCC0.60.650.70.750.80.850.90.951 Proposed Absence of quality map for source videoAbsence of quality map for transcoded video

Fig. 11. Box plot in the ablation studies. The marks × in the middle representsthe average. The bottom, middle and top bounds of the box represent the 25%,50% and 75% percentage points, respectively. direct combination of the FR and NR algorithms. D. Ablation Studies

To further provide evidence regarding the effectiveness ofthe proposed framework, we have conducted ablation studiesby removing the quality maps of source and transcoded videos.

1) Absence of quality map for source video:

We ﬁrst showthe performance variation due to the removal of the qualitymaps of source videos. In particular, these quality maps arereplaced by source video frames, such that frames of sourcevideos and quality maps of transcoded videos are fed to thepooling network.

2) Absence of quality map for transcoded video:

Theperformance variations due to the removal of the quality mapsof transcoded videos are further studied. In this manner, thequality maps of source videos and frames of transcoded videoare fed to the pooling network.We compare the full version of our proposed method (red)with source video quality map removed conﬁguration (green)and transcoded video quality map removed conﬁguration(blue), as shown in Table VI and Fig. 11. The removal of thesource video quality maps or transcoded video quality mapscauses signiﬁcant performance drop, further verifying theeffectiveness of quality maps of source videos or transcodedvideos. VI. C

ONCLUSIONS

In this paper, we have systematically studied the videoquality of UGC content. To facilitate the development ofVQA for UGC videos, we have constructed a new subjectivequality database. This database contains diverse UGC videosources along with their transcoded versions under differentcompression standards and levels. The subjective ratings ofthese videos are also provided as the ground truth. Based onthe interesting observations from the developed database, we propose a new objective video quality model with the designphilosophy that the quality prediction does not only rely onthe divergence of source video and transcoded video, but alsothe intrinsic quality of the source videos. The experimentalresults show that our method outperforms the state-of-the-artquality assessment methods. The proposed VQA method isalso envisioned to be further adopted to regularize the qualityof the output UGC videos, in an effort to provide a newparadigm of quality driven UGC video coding.R

EFERENCES[1] K. Seshadrinathan and A. C. Bovik, “Temporal hysteresis model of timevarying subjective video quality,” in . IEEE, 2011, pp.1153–1156.[2] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al. , “Imagequality assessment: from error visibility to structural similarity,”

IEEEtransactions on image processing , vol. 13, no. 4, pp. 600–612, 2004.[3] Z. Wang, L. Lu, and A. C. Bovik, “Video quality assessment based onstructural distortion measurement,”

Signal processing: Image communi-cation , vol. 19, no. 2, pp. 121–132, 2004.[4] Y. Wang, T. Jiang, S. Ma, and W. Gao, “Novel spatio-temporal structuralinformation based video quality metric,”

IEEE transactions on circuitsand systems for video technology , vol. 22, no. 7, pp. 989–998, 2012.[5] W. Lu, R. He, J. Yang, C. Jia, and X. Gao, “A spatiotemporal modelof video quality assessment via 3d gradient differencing,”

InformationSciences , vol. 478, pp. 141–151, 2019.[6] K. Manasa and S. S. Channappayya, “An optical ﬂow-based full refer-ence video quality assessment algorithm,”

IEEE Transactions on ImageProcessing , vol. 25, no. 6, pp. 2480–2492, 2016.[7] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporalquality assessment of natural videos,”

IEEE transactions on imageprocessing , vol. 19, no. 2, pp. 335–350, 2009.[8] P. V. Vu and D. M. Chandler, “ViS3: an algorithm for video qualityassessment via analysis of spatial and spatiotemporal slices,”

Journal ofElectronic Imaging , vol. 23, no. 1, p. 013016, 2014.[9] P. G. Freitas, W. Y. Akamine, and M. C. Farias, “Using multiple spatio-temporal features to estimate video quality,”

Signal Processing: ImageCommunication , vol. 64, pp. 1–10, 2018.[10] J. Y. Lin, T.-J. Liu, E. C.-H. Wu, and C.-C. J. Kuo, “A fusion-based video quality assessment (fvqa) index,” in

Signal and InformationProcessing Association Annual Summit and Conference (APSIPA), 2014Asia-Paciﬁc . IEEE, 2014, pp. 1–5.[11] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek,“Deep neural networks for no-reference and full-reference image qualityassessment,”

IEEE Transactions on Image Processing , vol. 27, no. 1, pp.206–219, 2017.[12] E. Prashnani, H. Cai, Y. Mostoﬁ, and P. Sen, “Pieapp: Perceptual image-error assessment through pairwise preference,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2018,pp. 1808–1817.[13] W. Kim, J. Kim, S. Ahn, J. Kim, and S. Lee, “Deep video qualityassessor: From spatio-temporal visual sensitivity to a convolutionalneural aggregation network,” in

Proceedings of the European Conferenceon Computer Vision (ECCV) , 2018, pp. 219–234.[14] Y. Zhang, X. Gao, L. He, W. Lu, and R. He, “Objective video qualityassessment combining transfer learning with CNN,”

IEEE transactionson neural networks and learning systems , 2019.[15] K. Zhu, C. Li, V. Asari, and D. Saupe, “No-reference video qualityassessment based on artifact measurement and statistical analysis,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 25,no. 4, pp. 533–546, 2014.[16] F. Zhang, W. Lin, Z. Chen, and K. N. Ngan, “Additive log-logistic modelfor networked video quality assessment,”

IEEE Transactions on ImageProcessing , vol. 22, no. 4, pp. 1536–1547, 2012.[17] D. Ghadiyaram, C. Chen, S. Inguva, and A. Kokaram, “A no-referencevideo quality predictor for compression and scaling artifacts,” in . IEEE,2017, pp. 3445–3449.[18] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of naturalvideo quality,”

IEEE Transactions on Image Processing , vol. 23, no. 3,pp. 1352–1365, 2014.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 [19] A. Mittal, M. A. Saad, and A. C. Bovik, “A completely blind videointegrity oracle,”

IEEE Transactions on Image Processing , vol. 25, no. 1,pp. 289–300, 2015.[20] Y. Zhu, Y. Wang, and Y. Shuai, “Blind video quality assessmentbased on spatio-temporal internal generative mechanism,” in . IEEE, 2017, pp.305–309.[21] X. Li, Q. Guo, and X. Lu, “Spatiotemporal statistics for video qualityassessment,”

IEEE Transactions on Image Processing , vol. 25, no. 7,pp. 3329–3342, 2016.[22] Y. Li, L.-M. Po, C.-H. Cheung, X. Xu, L. Feng, F. Yuan, and K.-W. Cheung, “No-reference video quality assessment with 3d shearlettransform and convolutional neural networks,”

IEEE Transactions onCircuits and Systems for Video Technology , vol. 26, no. 6, pp. 1044–1057, 2015.[23] W. Liu, Z. Duanmu, and Z. Wang, “End-to-end blind quality assessmentof compressed videos using deep neural networks.” in

ACM Multimedia ,2018, pp. 546–554.[24] Y. Zhang, X. Gao, L. He, W. Lu, and R. He, “Blind video qualityassessment with weakly supervised learning and resampling strategy,”

IEEE Transactions on Circuits and Systems for Video Technology , 2018.[25] D. Li, T. Jiang, and M. Jiang, “Quality assessment of in-the-wildvideos,” in

Proceedings of the 27th ACM International Conference onMultimedia . ACM, 2019, pp. 2351–2359.[26] H. Ren, D. Chen, and Y. Wang, “RAN4IQA: restorative adversarialnets for no-reference image quality assessment,” in

Thirty-Second AAAIConference on Artiﬁcial Intelligence , 2018.[27] D. Pan, P. Shi, M. Hou, Z. Ying, S. Fu, and Y. Zhang, “Blind predictingsimilar quality map for image quality assessment,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2018,pp. 6373–6382.[28] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack,“Study of subjective and objective quality assessment of video,”

IEEEtransactions on Image Processing , vol. 19, no. 6, pp. 1427–1441, 2010.[29] A. K. Moorthy, L. K. Choi, A. C. Bovik, and G. De Veciana, “Videoquality assessment on mobile devices: Subjective, behavioral and ob-jective studies,”

IEEE Journal of Selected Topics in Signal Processing ,vol. 6, no. 6, pp. 652–671, 2012.[30] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang,I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “MCL-JCV: a JND-based H.264/AVC video quality assessment dataset,” in . IEEE, 2016,pp. 1509–1513.[31] M. Nuutinen, T. Virtanen, M. Vaahteranoksa, T. Vuori, P. Oittinen,and J. Häkkinen, “CVD2014â ˘AˇTA database for evaluating no-referencevideo quality assessment algorithms,”

IEEE Transactions on ImageProcessing , vol. 25, no. 7, pp. 3073–3086, 2016.[32] D. Ghadiyaram, J. Pan, A. C. Bovik, A. K. Moorthy, P. Panda, andK.-C. Yang, “In-capture mobile video distortions: A study of subjectivebehavior and objective algorithms,”

IEEE Transactions on Circuits andSystems for Video Technology , vol. 28, no. 9, pp. 2061–2077, 2017.[33] V. Hosu, F. Hahn, M. Jenadeleh, H. Lin, H. Men, T. Szirányi, S. Li,and D. Saupe, “The Konstanz natural video database (KoNViD-1k),” in . IEEE, 2017, pp. 1–6.[34] Z. Sinno and A. C. Bovik, “Large-scale study of perceptual videoquality,”

IEEE Transactions on Image Processing , vol. 28, no. 2, pp.612–627, 2018.[35] X. Yu, C. G. Bampis, P. Gupta, and A. C. Bovik, “Predicting the qualityof images compressed after distortion in two steps,”

IEEE Transactionson Image Processing

International Telecommunication Union, Geneva ,vol. 2, 2008.[38] N. D. Narvekar and L. J. Karam, “A no-reference image blur metricbased on the cumulative probability of blur detection (CPBD),”

IEEETransactions on Image Processing , vol. 20, no. 9, pp. 2678–2683, 2011.[39] V. Vonikakis, R. Subramanian, and S. Winkler, “Shaping datasets: Op-timal data selection for speciﬁc target distributions across dimensions,”in .IEEE, 2016, pp. 3753–3757.[40] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview ofthe H. 264/AVC video coding standard,”

IEEE Transactions on circuitsand systems for video technology , vol. 13, no. 7, pp. 560–576, 2003. [41] L. Merritt and R. Vanam, “x264: A high performance H. 264/AVCencoder,” online] http://neuron2. net/library/avc/overview_x264_v8_5.pdf , 2006.[42] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe high efﬁciency video coding (HEVC) standard,”

IEEE Transactionson circuits and systems for video technology

International telecommunicationunion , 1999.[45] B. Series, “Methodology for the subjective assessment of the quality oftelevision pictures,”

Recommendation ITU-R BT , pp. 500–13, 2012.[46] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluationof recent full reference image quality assessment algorithms,”

IEEETransactions on image processing , vol. 15, no. 11, pp. 3440–3451, 2006.[47] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment,” in

The Thrity-Seventh AsilomarConference on Signals, Systems & Computers, 2003 , vol. 2. Ieee, 2003,pp. 1398–1402.[48] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,”

IEEE Transactions on image processing , vol. 15, no. 2, pp. 430–444,2006.[49] C. G. Bampis, P. Gupta, R. Soundararajan, and A. C. Bovik, “Speed-qa: Spatial efﬁcient entropic differencing for image and video quality,”

IEEE signal processing letters , vol. 24, no. 9, pp. 1333–1337, 2017.[50] H. Z. Nafchi, A. Shahkolaei, R. Hedjam, and M. Cheriet, “Meandeviation similarity index: Efﬁcient and reliable full-reference imagequality evaluator,”

IEEE Access , vol. 4, pp. 5579–5590, 2016.[51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[53] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang,“Waterloo exploration database: New challenges for image qualityassessment models,”

IEEE Transactions on Image Processing , vol. 26,no. 2, pp. 1004–1016, 2016.[54] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for imagerestoration with neural networks,”

IEEE Transactions on ComputationalImaging , vol. 3, no. 1, pp. 47–57, 2016.[55] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 472–480.[56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980