Deep Local and Global Spatiotemporal Feature Aggregation for Blind Video Quality Assessment
DDeep Local and Global Spatiotemporal FeatureAggregation for Blind Video Quality Assessment
Wei Zhou,
Student Member, IEEE , and Zhibo Chen,
Senior Member, IEEE
Abstract —In recent years, deep learning has achieved promis-ing success for multimedia quality assessment, especially forimage quality assessment (IQA). However, since there exist morecomplex temporal characteristics in videos, very little workhas been done on video quality assessment (VQA) by exploit-ing powerful deep convolutional neural networks (DCNNs). Inthis paper, we propose an efficient VQA method named DeepSpatioTemporal video Quality assessor (DeepSTQ) to predictthe perceptual quality of various distorted videos in a no-reference manner. In the proposed DeepSTQ, we first extractlocal and global spatiotemporal features by pre-trained deeplearning models without fine-tuning or training from scratch.The composited features consider distorted video frames as wellas frame difference maps from both global and local views. Then,the feature aggregation is conducted by the regression model topredict the perceptual video quality. Finally, experimental resultsdemonstrate that our proposed DeepSTQ outperforms state-of-the-art quality assessment algorithms.
Index Terms —Blind video quality assessment, deep convo-lutional neural network, global and local feature extraction,spatiotemporal aggregation
I. I
NTRODUCTION
With the rapid growth of visual multimedia applications,evaluating the perceptual quality of multimedia data has at-tracted increasing attention in both academia and industry [1].Compared with image quality assessment (IQA), how to assessvideo quality is more challenging due to the additional tem-poral dimension. Moreover, the viewed videos often consistof different visually annoying distortions that are introducedduring the processing chain of digital videos including capture,compression, transmission, reconstruction, etc. Therefore, theconstruction of accurate video quality assessment (VQA)methods is significant for optimizing existing video services.In general, the most reliable VQA method is to designsubjective tests [2]. During the subjective tests, human subjectsare asked to watch videos and then provide the quality ratingsfor these videos. However, the subjective quality assessmentis usually labor-intensive and time-consuming, which is notapplicable in practical application scenarios. Thus, it is desir-able to develop efficient objective VQA algorithms to predictthe perceptual quality of videos automatically.Depending on the availability of originally non-distortedvideos, objective VQA methods can be generally classi-
W. Zhou and Z. Chen are with the CAS Key Laboratory of Technol-ogy in Geo-Spatial Information Processing and Application System, Uni-versity of Science and Technology of China, Hefei 230027, China, ([email protected]; [email protected]).This work was supported in part by NSFC under Grant U1908209,61632001 and the National Key Research and Development Program of China2018AAA0101400. (a) (b)
Fig. 1. Examples of distorted video frame and frame difference map from theCSIQ video quality database [3]. (a) Current video frame, (b) Frame differencemap between the current frame and the previous frame. fied into three categories, namely full-reference (FR) VQA,reduced-reference (RR) VQA, and no-reference/blind (NR)VQA models. The FR VQA methods always need the fullinformation of pristine reference videos to perform qualityassessment. Trational FR IQA metrics, such as peak signal-to-noise ratio (PSNR), structural similarity (SSIM) index [4],multiscale SSIM (MS-SSIM) [5], feature similarity (FSIM)index [6], are designed for assessing the perceptual qualityof images rather than more complex videos. In the literature,several FR VQA algorithms have been proposed, which in-clude the motion-based video integrity evaluation (MOVIE)index [7], the spatiotemporal most-apparent-distortion (ST-MAD) model [8], the algorithm for video quality assessmentvia analysis of spatial and spatiotemporal slices (ViS3) [3],the just noticeable difference-based video quality (JVQ) index[9], etc. The RR VQA methods require only part of origi-nal content. Typical RR VQA approaches include the videoquality model (VQM) [10], the spatiotemporal RR entropicdifferences (STRRED) algorithm [11], and so on. Contrary toFR and RR VQA models, the NR VQA methods assess theperceptual video quality without any information of originalvideos. Consequently, the NR VQA task is more attractivesince the pristine reference content is not always accessible inpractical applications.Recently, several studies have been carried out on NR VQAmethods. In [12], the codebook representation for no-referenceimage assessment (CORNIA) [13] is directly extended to NRvideo quality evaluation, where the V-CORNIA is proposed byframe-level unsupervised feature learning and hysteresis tem-poral pooling. Moreover, a spatiotemporal quality assessmentmodel of natural video scenes in the discrete cosine trans-form (DCT) domain i.e. video blind image integrity notatorusing DCT statistics (V-BLIINDS) [14] is presented, whichis derived from the image-based index called BLIINDS [15].In [16], the video intrinsic integrity and distortion evaluation a r X i v : . [ ee ss . I V ] S e p istorted Video Frames Frame Difference Maps
ResNet-50
Local ViewGlobal View
ResNet-50 Pool5Pool5Feature Aggregation
SVR
Quality Score
Fig. 2. The framework of our proposed DeepSTQ method. oracle (VIIDEO) is proposed by employing a variety of space-time statistical regularities and probing into intrinsic propertiesof space-time band pass video correlations. Additionally, thedeep blind video quality assessment (DeepBVQA) method[17] is proposed based on spatial features extracted from pre-trained deep learning models and hand-crafted temporal fea-tures. Nevertheless, none of the above-mentioned algorithmshave utilized both the local and global spatiotemporal featuresby pre-trained deep learning models.Additionally, deep convolutional neural networks (DCNNs)have shown enormous advances for many image processingand computer vision tasks. Since existing off-the-shelf DCNNsare trained on large-scale image databases with diverse imagecontent such as ImageNet [18], they could have the remarkableability to extract discriminative image feature representationfor quality assessment. Different from other quality assessmentalgorithms by fine-tuning or training models from scratch [19],exploiting the generic image feature representation extractedfrom pre-trained deep learning models is simple and efficient.Therefore, in this paper, we propose a blind perceptual videoquality evaluation method based on local and global spatiotem-poral features from distorted video frames and frame differ-ence maps, which are extracted from off-the-shelf DCNNs.Fig. 1 shows the examples of RGB distorted video frameand frame difference map from the CSIQ video qualitydatabase [3]. Here, the frame difference map means thedifference between current video frame and previous videoframe. It should be noted that we add 128 to each pixel valuein the frame difference map for better visualization. Besides,we can see that the distorted video frame represents spatialtexture characteristic, while the frame difference map revealsmotion information in a sense.We describe the details of our proposed method in SectionII. Section III presents the experimental results. We concludethe paper in Section IV.II. P
ROPOSED M ETHOD
In this section, we provide a detailed description of ourproposed no-reference perceptual video quality evaluationmethod named Deep SpatioTemporal video Quality assessor (DeepSTQ). The framework of the proposed DeepSTQ methodis shown in Fig. 2. First, we generate distorted video framesand frame difference maps from both local and global views.Second, we utilize pre-trained deep learning models to extractmulti-view spatiotemporal features. Finally, by aggregating theextracted features, we are able to regress them onto perceptualquality scores.
A. Local and Global Spatiotemporal Representation
Considering that a distorted video is composed of manydistorted video frames, we first generate different video framesto represent the spatial texture information of the entire video.Then, the frame difference map reflecting motion informationis computed based on the gray-scale distorted maps as follows: F i +1 = | D i +1 − D i | , (1)where D i +1 and D i denote the current frame and previousframe.Since the global and local views of an image are bothimportant for quality assessment, we thus take the whole imageas global view and the image patch as local view. Specifically,as shown in Fig. 3, we give an example of global and localviews for distorted video frame and frame difference map. Theglobal view reflects the entire information of the image, whilethe local view reveals the local description of distortions. B. Deep Feature Extraction and Aggregation
In our designed model, we employ the powerful residualnetwork, i.e. ResNet-50 [20] for deep feature extraction. Thepool5 layer of this network is taken as the feature represen-tation, which has 2048 dimensions. Moreover, since the inputsize of ResNet-50 is 224 × a) (b) Fig. 3. Global and local views of distorted video frame and frame difference map. (a) Current video frame and the corresponding video patch, (b) Framedifference map between current frame and previous frame as well as the corresponding frame difference patch. part of each specific distorted video. Finally, the well-knownregression model, i.e. support vector regression (SVR) isapplied to the aggregated feature and predict the perceptualquality score. III. E
XPERIMENTS
The experiments are conducted on the CSIQ video qualitydatabase [3] which consists of 12 pristine reference videosand 216 distorted videos. The distorted videos cover 6 distor-tion types including H.264 compression, HEVC compression,motion JPEG compression, wavelet-based compression usingthe snow codec, additive white noise, and H.264 videos withpacket loss rate subjected to simulate wireless transmissionloss. Each video in the database is in the YUV420 format withthe resolution of 832 ×
480 and the duration of 10 seconds.The video frame rate ranges from 24 fps to 60 fps. Subjectivequality score is provided for each video as the difference meanopinion score (DMOS).For the performance evaluation of VQA algorithms, weadopt Spearman rank-order correlation coefficient (SROCC)and Pearson linear correlation coefficient (PLCC). TheSROCC aims to measure the prediction monotonicity, whilethe PLCC is to measure the prediction accuracy. Here, thehigher SROCC and PLCC indicate better performance in termsof correlation with human subjective opinions. In principle, itshould be noted that before calculating the PLCC values ofobjective VQA algorithms, a nonlinear logistic fitting functionis applied to map the predicted quality scores to the samescales of subjective quality scores. Here, we utilize a five-parameter logistic function to the predicted quality scores fora better fit to the subjective ratings as follows: Q ( x ) = β − β e x − β β + β , (2)where β to β are four free parameters to be determinedduring the curve fitting process. Moreover, x denotes theraw objective score and Q ( x ) is the mapped score after thenonlinear fitting process.We compare the performance of our proposed DeepSTQmethod with state-of-the-art VQA algorithms on the CSIQvideo quality database [3]. The 80% of the database arerandomly chosen for training and the remaining 20% are usedfor testing where no overlap exists between the training andtest sets. We repeat 1000 iterations of cross correlation, and TABLE ISROCC
AND
PLCC
PERFORMANCE COMPARISON ON THE
CSIQ
VIDEOQUALITY DATABASE [3].Types Methods SROCC PLCCFR PSNR 0.5461 0.5339SSIM 0.6946 0.7093MS-SSIM 0.7530 0.6665FSIM 0.7392 0.7514MOVIE 0.8060 0.7880ST-MAD 0.7355 0.7237ViS3 0.8410 0.8300JVQ 0.6840 0.7005RR VQM 0.7890 0.7690STRRED 0.8129 0.7894NR V-CORNIA 0.8216 0.8315V-BLIINDS 0.8069 0.8228VIIDEO 0.6498 0.6704DeepBVQA 0.8472 0.8532
Proposed DeepSTQ 0.8533 0.8578
20 30 40 50 60 70 80
Training percentage (%) M ed i an v a l ue SROCCPLCC
Fig. 4. Median performance in regard to training percentage over 1000iterations on the CSIQ video quality database [3]. then give the median SROCC and PLCC as the final results. Asshown in Table 1, the proposed method is compared with 14state-of-the-art quality assessment metrics, including PSNR,SSIM [4], MS-SSIM [5], FSIM [6], MOVIE [7], ST-MAD[8], ViS3 [3], JVQ [9], VQM [10], STRRED [11], V-CORNIA[12], V-BLIINDS [14], VIIDEO [16], and DeepBVQA [17].The DeepBVQA is a deep learning based approach to predictthe quality of distorted videos without reference. It shouldbe noted that PSNR, SSIM, MS-SSIM and FSIM are FRIQA metrics where we compute for each video frames andthen average to derive the final performance. From Table 1,we can see that our proposed DeepSTQ outperforms state-
ABLE IIA
BLATION STUDY OF THE PROPOSED METHOD ON THE
CSIQ
VIDEOQUALITY DATABASE [3].Methods SROCC PLCCDistorted video frames 0.7915 0.8031Frame difference maps 0.8113 0.8177Distorted video frames + Frame difference maps 0.8224 0.8307Distorted video patches 0.8232 0.8298Frame difference patches 0.8175 0.8263Distorted video patches + Frame difference patches 0.8503 0.8551
Proposed DeepSTQ 0.8533 0.8578 of-the-art algorithms, which demonstrates the effectivenessof the proposed method. Fig. 4 shows SROCC and PLCCperformance variation with respect to training percentage onthe CSIQ video quality database [3]. We find that a largenumber of training data bring about performance increase.Furthermore, in order to investigate the specific contributionof each technical part for our proposed DeepSTQ, we presentthe ablation study of the proposed method on the CSIQ videoquality database [3], as shown in Table 2. As for each method,we extract features from the same pool5 layer of the pre-trained ResNet-50 model, and then carry out perceptual qualityregression. First, we observe that the extracted features offrame difference maps perform better than that of distortedvideo frames. One possible explanation may be that motioninformation has more influence than spatial texture on per-ceptual video quality. Besides, using the combined featuresfrom distorted video frames and frame difference maps out-performs using either the two types of features alone, whichverifies the importance of composited spatiotemporal features.Second, contrary to the previous observation, the performanceof extracted features from distorted video patches is superiorto that of frame difference patches. This is because some ofthe cropped patches for frame difference maps lost abundanttexture details, which are visualized as a whole gray image.Likewise, using the combined features from distorted videopatches and frame difference patches also shows advantagethan using either the two types of features alone. Finally,the result on local view metrics is better than that on globalview metrics with the same setting. This demonstrates thatthe local discriminative features seem to be more importantfor quality assessment. The proposed DeepSTQ, which usescomposited spatiotemporal features both from global and localviews, achieves higher SROCC and PLCC performance values.IV. C
ONCLUSION
In this paper, we present an efficient blind video qualityevaluation method based on deep learning models. Specif-ically, we first exploit pre-trained off-the-shelf DCNNs togenerate the discriminative features of distorted video framesas well as frame difference maps from both local and globalviews. The proposed DeepSTQ considers the spatiotemporalcharacteristics of various distorted videos. We then use theregression model to aggregate features and assess perceptualvideo quality. Experimental results demonstrate that our pro-posed DeepSTQ achieves superior performance compared to other state-of-the-art VQA algorithms. In the future, we planto study 3D video quality evaluation based on deep learning.R
EFERENCES[1] Zhibo Chen, Wei Zhou, and Weiping Li, “Blind stereoscopic videoquality assessment: From depth perception to overall experience,”
IEEETransactions on Image Processing , vol. 27, no. 2, pp. 721–734, 2018.[2] Wei Zhou, Ning Liao, Zhibo Chen, and Weiping Li, “3D-HEVCvisual quality assessment: Database and bitstream model,” in
Quality ofMultimedia Experience (QoMEX), 2016 Eighth International Conferenceon . IEEE, 2016, pp. 1–6.[3] Phong V Vu and Damon M Chandler, “ViS3: an algorithm for videoquality assessment via analysis of spatial and spatiotemporal slices,”
Journal of Electronic Imaging , vol. 23, no. 1, pp. 013016, 2014.[4] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli,“Image quality assessment: from error visibility to structural similarity,”
IEEE Transactions on image processing , vol. 13, no. 4, pp. 600–612,2004.[5] Zhou Wang, Eero P Simoncelli, and Alan C Bovik, “Multiscalestructural similarity for image quality assessment,” in
The Thrity-SeventhAsilomar Conference on Signals, Systems & Computers, 2003 . IEEE,2003, vol. 2, pp. 1398–1402.[6] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang, “FSIM: A fea-ture similarity index for image quality assessment,”
IEEE Transactionson Image Processing , vol. 20, no. 8, pp. 2378–2386, 2011.[7] Kalpana Seshadrinathan and Alan Conrad Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,”
IEEE Transactions onimage processing , vol. 19, no. 2, pp. 335–350, 2010.[8] Phong V Vu, Cuong T Vu, and Damon M Chandler, “A spatiotemporalmost-apparent-distortion model for video quality assessment,” in
ImageProcessing (ICIP), 2011 18th IEEE International Conference on . IEEE,2011, pp. 2505–2508.[9] Woei-Tan Loh and David Boon Liang Bong, “A just noticeabledifference-based video quality assessment method with low computa-tional complexity,”
Sensing and Imaging , vol. 19, no. 1, pp. 33, 2018.[10] Margaret H Pinson and Stephen Wolf, “A new standardized method forobjectively measuring video quality,”
IEEE Transactions on broadcast-ing , vol. 50, no. 3, pp. 312–322, 2004.[11] Rajiv Soundararajan and Alan C Bovik, “Video quality assessmentby reduced reference spatio-temporal entropic differencing,”
IEEETransactions on Circuits and Systems for Video Technology , vol. 23,no. 4, pp. 684–694, 2013.[12] Jingtao Xu, Peng Ye, Yong Liu, and David Doermann, “No-referencevideo quality assessment via feature learning,” in
Image Processing(ICIP), 2014 IEEE International Conference on . IEEE, 2014, pp. 491–495.[13] Peng Ye, Jayant Kumar, Le Kang, and David Doermann, “Unsupervisedfeature learning framework for no-reference image quality assessment,”in .IEEE, 2012, pp. 1098–1105.[14] Michele A Saad, Alan C Bovik, and Christophe Charrier, “Blindprediction of natural video quality,”
IEEE Transactions on ImageProcessing , vol. 23, no. 3, pp. 1352–1365, 2014.[15] Michele A Saad, Alan C Bovik, and Christophe Charrier, “A DCTstatistics-based blind image quality index,”
IEEE Signal ProcessingLetters , vol. 17, no. 6, pp. 583–586, 2010.[16] Anish Mittal, Michele A Saad, and Alan C Bovik, “A completely blindvideo integrity oracle,”
IEEE Transactions on Image Processing , vol.25, no. 1, pp. 289–300, 2016.[17] Sewoong Ahn and Sanghoon Lee, “Deep blind video quality assessmentbased on temporal human perception,” in . IEEE, 2018, pp. 619–623.[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei,“Imagenet: A large-scale hierarchical image database,” in
ComputerVision and Pattern Recognition, 2009. CVPR 2009. IEEE Conferenceon . Ieee, 2009, pp. 248–255.[19] Wei Zhou, Zhibo Chen, and Weiping Li, “Stereoscopic video qualityprediction based on end-to-end dual stream deep neural networks,” in
Pacific Rim Conference on Multimedia . Springer, 2018, pp. 482–492.[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deepresidual learning for image recognition,” in