Towards Deep Learning Methods for Quality Assessment of Computer-Generated Imagery
Markus Utke, Saman Zadtootaghaj, Steven Schmidt, Sebastian Möller
aa r X i v : . [ c s . MM ] M a y Towards Deep Learning Methods for QualityAssessment of Computer-Generated Imagery
Markus Utke ∗ , Saman Zadtootaghaj ∗ , Steven Schmidt ∗ , Sebastian M¨oller ∗†∗ Quality and Usability Lab, Technische Universitt Berlin, Germany, [email protected],[email protected], [email protected] † DFKI Projektb¨uro Berlin, Germany, [email protected]
Abstract — Video gaming streaming services are growingrapidly due to new services such as passive video streaming,e.g. Twitch.tv, and cloud gaming, e.g. Nvidia Geforce Now. Incontrast to traditional video content, gaming content has specialcharacteristics such as extremely high motion for some games,special motion patterns, synthetic content and repetitive content,which makes the state-of-the-art video and image quality metricsperform weaker for this special computer generated content.In this paper, we outline our plan to build a deep learning-based quality metric for video gaming quality assessment. Inaddition, we present initial results by training the network basedon VMAF values as a ground truth to give some insights on howto build a metric in future. The paper describes the methodthat is used to choose an appropriate Convolutional NeuralNetwork architecture. Furthermore, we estimate the size of therequired subjective quality dataset which achieves a sufficientlyhigh performance. The results show that by taking around 5kimages for training of the last six modules of Xception, we canobtain a relatively high performance metric to assess the qualityof distorted video games.
I. I
NTRODUCTION
The gaming industry is one of the largest digital markets fordecades which is rapidly growing with the emerging onlineservices such as gaming video streaming, online gaming andcloud gaming (CG). While the game industry is growing,more complex games in terms of processing power are gettingdeveloped which requires players to update their end devicesevery few years in order to play high end games. One solutionfor this is to move the heavy processes such as rendering to thecloud and cut the need for high end hardware devices for cus-tomers. Cloud gaming was proposed to offer more flexibilityto users in order to allow them to play any games anywhereand on any type of devices. Apart from processing power,cloud gaming benefits users by the platform independencyand for game developers offers security to their products andpromises a new market to increase their revenue. Besides cloudgaming, passive video streaming of gameplays become popularwith hundreds of millions of viewers in a year. Twitch.tv andYouTube Gaming are the two most popular services for passivevideo gaming streaming.Quality assessment is a necessary process of any serviceprovider to ensure the satisfaction of customers. While subjec-tive tests are the basis of any quality assessment of multimediaservices, service providers are seeking objective methods forpredicting the quality, as subjective experiments are expensive and time-consuming. Depending on the amount of access tothe reference signal, signal-based video quality models canbe divided into three classes, no-reference (NR), reduced-reference (RR), and full-reference (FR) metrics. For QoEassessment of cloud gaming and passive video streamingservices such as Twitch.tv, NR metrics are of interest forservice providers as the reference signal is not available or itcomes with a high cost of recording and syncing the referencesignal with the distorted signal (e.g. cloud gaming).In this paper, we aim at designing a Convolutional NeuralNetwork (CNN) based NR video quality metric that canpredict the quality of video games with high accuracy. Themain idea of our work is to train a CNN on a huge number offrames which are annotated based on a full-reference qualitymetric, VMAF [1], and then retrain a few last layers of thepre-trained CNN based on a smaller subjective image dataset.In this paper, we try to answer the following research questionsbefore training such a huge network in the future to build thefinal metric: • Are machine learning based quality assessment methodssuitable for computer generated imagery? • How does a pre-trained deep CNN has to be retrained togain a decent result? • Which pre-trained CNN architecture performs the bestamong state of the art models for CGI quality assessment? • How much data, in terms of number of frames, is roughlyrequired for transfer learning of CNNs?In order to answer these research questions without train-ing a whole network and conducting subjective experimentsblindly, pre-trained CNN architectures are taken into consid-eration and VMAF was chosen as a ground truth to get someinsight on the selection of an architecture, the rough numberof required frames for transfer learning and the expectedperformance. II. R
ELATED W ORK
Within the last decades, we have been witness to a hugenumber of research works with respect to objective image andvideo quality assessment. In this section, due to limited space,we give a short overview of deep learning based quality modelsas well as metrics that are developed specifically for computergenerated content.The performance of state of the art video and image qualitymetrics on gaming videos were investigated in [2] whichhows a high correlation of VMAF with Mean Opinion Scores(MOS) while most of the NR metrics perform quite poorly.With respect to gaming content, to the best knowledge of theauthors, only two NR metrics are developed. Zadtootaghajet al. proposed a NR machine learning-based video qualitymetric for gaming content, named NR-GVQM, that is trainedbased on low level image features with the aim at predictingVMAF without having access to a reference video [3]. AnotherNR pixel-based video quality metric for gaming QoE wasproposed by Goering et al [4], called Nofu. Nofu is also amachine learning metric that extracts low level features ofwhich some are hand-crafted by the authors and some aretaken from the state of the art. Nofu as a NR metric has aslightly higher performance compared to VMAF as FR metricon the GamingVideoSET [5].Most of deep neural networks (DNN) based models areproposed for image quality assessment for two reasons. First,the video quality datasets are relatively small in terms ofnumber of annotated data compared to image quality datasetsdue to expensive quality assessment of videos compared toimage content. In addition, most of the state of the art worksused transfer learning methods which uses pre-trained DNNand retrains a few last layers of it. Transfer learning is moresuitable for image application as more pre-trained models areavailable. Bosse et al. [6] presented a neural network-basedapproach to build FR and NR image quality metrics which areinspired by VGG16 by increasing the depth of the CNN withten convolutional layers. Goering et al. [ref] proposed a hybridNR image quality metric which is developed after training andtesting a few pre-trained CNNs and extending them by addingsignal-based features from state of the art models.III. D
ATASETS
With the aim at training and testing the model, Gam-ingVideoSET was used [5]. GamingVideoSET consists of 24source video sequences from 12 games (two sequences pergame), which are encoded using H.264/MPEG-AVC underdifferent bitrate-resolution pairs. 18 video sequences from 10games were selected and used in the training process of themodel, while six source video sequences from four games wereselected for the validation set. With the aim to investigatethe suitability of ML methods for gaming content becauseof similarity between video sequences from the same game,two video sequences in the validation set were chosen fromthe same games that are included in the training set (butdifferent recorded sequences), and four other video sequenceswere selected from games that are not in the training set. Intotal, we selected among
279 900 f rames in the training setand
71 100 f rames in the validation set as explained later.IV. E
XPERIMENTS AND R ESULTS
To test if deep CNN are applicable on our type of data weuse transfer learning , denoting the process of retraining orfine tuning a pre-trained neural network to make it performa task that is different from the one it was originally trainedto do. Using Keras [7], a high level neural network library, different model architectures are available along with theirpre-trained weights on the ImageNet [8] database. We chosethree different architectures and compared their performanceon our dataset in table I: DenseNet121 [9], ResNet50 [10], andXception [11]. For each of these architectures the fully con-nected network at the end was removed. Instead we used onedense layer consisting of only one output neuron with linearactivation. The output of the network is directly compared tothe actual VMAF value. To calculate the quality at the videolevel, the frame level values are averaged.Because of the large size of the images in the dataset wecannot efficiently train the network on the images directly.Hence, we crop random patches of size × from theframes we want to train on. It has to be noted that this sizeis the standard input size of Xception. This is done in parallelto the training, such that in each epoch a new random patchof each image is chosen.We used the Xception architecture for the following inves-tigations since it worked best for our setting (see table I).However, it should be said that for an extensive evaluationmultiple parameter settings should have been compared. Architecture Level R RMSE PCC SRCCDenseNet121 frame .
87 7 .
43 0 .
934 0 . video .
97 3 .
38 0 .
985 0 . ResNet50 frame .
87 6 .
98 0 .
938 0 . video .
97 3 .
11 0 .
987 0 . Xception frame .
87 6 .
85 0 .
939 0 . video .
98 2 .
75 0 .
990 0 . TABLE I: Results for different architecturesWhen using pre-trained networks it is often sufficient totrain only the last layers instead of retraining the whole net-work. To investigate this we tried training different amounts ofthe network on our dataset and compared the performance. TheXception architecture consists of 14 modules, each containingtwo to three convolutional layers. Figure 1 shows a comparisonof the results when training only the last, last four and last sixmodules. We observe that the results get better when moremodules are used in the training process. However, this mightincrease the risk of overfitting as the dataset is not largeenough. If the performance can get improved even more bytraining more than six modules is yet to be tested which maycome with higher cost of computation.From the 12 different games in the dataset two are used onlyin the validation set and two are present in both validation andtraining set. Surprisingly comparing the results for these twogroups showed no difference.In the future we want to train our model on subjectiveratings instead of VMAF values. Therefore, we tried to finda small subset of our data, that still performs well. Since thetraining dataset consists of over
25 000 f rames and consecutiveframes can be very similar, only every n -th frame from everyvideo is used for training. A very high number for n waschosen ( n = 403 ) and then lowered step by step to find thepoint, where the performance of the model stops improving. MSE: 3.84PCC: 0.980 (a) One module
RMSE: 3 .10PCC: 0.986 (b) Four modules
RMSE: 2 .75PCC: 0.990 (c) Six modules
Fig. 1: Different number of trained modules of the Xception network (RMSE and PCC on video level) n Total numberof frames RMSE SRCCframe video frame video
403 933 7 .
50 4 .
40 0 .
939 0 . .
12 4 .
07 0 .
938 0 . .
18 3 .
72 0 .
940 0 . .
78 3 .
11 0 .
942 0 . .
09 3 .
19 0 .
937 0 . TABLE II: RMSE and SRCC for different choices of n Table II shows the RMSE and SRCC for different choices of n . To provide good results and minimize the number of framesneeded, n = 53 seems to be a good choice.We also investigated if only cropping the center of imageswould lead us to better performance compared to randomizedcropping during the training of the CNN (as users tend tolook more at the center of images). Our results showed thatit cannot improve the network and taking random patches fortraining would gain much higher performance.V. D ISCUSSION AND C ONCLUSION
This paper is presented to demonstrate the effectivenessof the usage of CNNs for quality assessment of multimediaservices. While such CNN-based quality metrics come withhigh computation cost which might not be suitable speciallyfor real-time services such as cloud gaming, the high perfor-mance of these methods motivate us to consider them as afuture of QoE assessment for some special use cases such asquality assessment of uploaded video content for transcodingpurposes, e.g. for Twitch.tv. It has to be noted that this paperonly presents the performance of a few examples of pre-trainedCNNs that aim to predict a FR metric, VMAF. However theauthors expect to reach similar performance when real MOSvalues are used as a ground truth. The initial results show thatwith a medium sized image quality dataset, between 1k and 5kannotated images, the quality can be predicted with relativelyhigh performance. It was observed that with 5k annotatedframes, retraining six modules of Xception might avoid overfitting while also getting high performance. Our plan is to reduce this number significantly by mixing the training ofCNNs with VMAF and MOS values, which requires a smallerdataset of annotated frames by MOS values. In addition, we tryto use sophisticated techniques to pool the frame-level qualityvalues to obtain video quality such as using Long Short-TermMemory (LSTM) architecture.Moreover, we observed that using videos from the gamesthat are used in the training and validation set did not leadus to significantly higher performance. However, we needto investigate this more as we only tested it with a smalldataset. Another observation was that making the networkdeeper might help in order to get better performance as itwas also observed in [6]. However, that depends on the sizeof the training dataset as well as on its diversity of content.Otherwise, training a deep CNN could lead to overfitting.R
EFERENCES[1] Netflix, “VMAF - Video Multi-Method Assessment Fusion.”https://github.com/Netflix/vmaf. [Online: Accessed 2-Oct-2018].[2] N. Barman, S. Schmidt, S. Zadtootaghaj, M. G. Martini, and S. M¨oller,“An evaluation of video quality assessment metrics for passive gamingvideo streaming,” in
Proceedings of the 23rd Packet Video Workshop ,pp. 7–12, ACM, 2018.[3] S. Zadtootaghaj, N. Barman, S. Schmidt, M. G. Martini, and S. M¨oller,“NR-GVQM: A No Reference Gaming Video Quality Metric,” in , pp. 131–134,IEEE, 2018.[4] “nofu- A Lightweight No-Reference Pixel Based Video Quality Modelfor Gaming QoE,” in
Accepted at Eleventh International Workshop onQuality of Multimedia Experience (QoMEX) , 2019.[5] N. Barman, S. Zadtootaghaj, S. Schmidt, M. G. Martini, and S. M¨oller,“GamingVideoSET: a dataset for gaming video streaming applications,”in , pp. 1–6, IEEE, 2018.[6] S. Bosse, D. Maniry, K.-R. M¨uller, T. Wiegand, and W. Samek,“Deep neural networks for no-reference and full-reference image qualityassessment,”
IEEE Transactions on Image Processing , vol. 27, no. 1,pp. 206–219, 2018.[7] F. Chollet et al. , “Keras.” https://keras.io, 2015.[8] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. , “Imagenet largescale visual recognition challenge,”
International journal of computervision , vol. 115, no. 3, pp. 211–252, 2015.9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
Proceedings of the IEEE confer-ence on computer vision and pattern recognition , pp. 4700–4708, 2017.[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 770–778, 2016.[11] F. Chollet, “Xception: Deep learning with depthwise separable convolu-tions,” in