[PDF] Advances In Video Compression System Using Deep Neural Network: A Review And Case Studies

Abstract

Significant advances in video compression system have been made in the past several decades to satisfy the nearly exponential growth of Internet-scale video traffic. From the application perspective, we have identified three major functional blocks including pre-processing, coding, and post-processing, that have been continuously investigated to maximize the end-user quality of experience (QoE) under a limited bit rate budget. Recently, artificial intelligence (AI) powered techniques have shown great potential to further increase the efficiency of the aforementioned functional blocks, both individually and jointly. In this article, we review extensively recent technical advances in video compression system, with an emphasis on deep neural network (DNN)-based approaches; and then present three comprehensive case studies. On pre-processing, we show a switchable texture-based video coding example that leverages DNN-based scene understanding to extract semantic areas for the improvement of subsequent video coder. On coding, we present an end-to-end neural video coding framework that takes advantage of the stacked DNNs to efficiently and compactly code input raw videos via fully data-driven learning. On post-processing, we demonstrate two neural adaptive filters to respectively facilitate the in-loop and post filtering for the enhancement of compressed frames. Finally, a companion website hosting the contents developed in this work can be accessed publicly at this https URL

Full PDF

11 Advances In Video Compression System UsingDeep Neural Network: A Review And Case Studies

Dandan Ding (cid:63) , Member, IEEE,

Zhan Ma (cid:63) , Senior Member, IEEE,

Di Chen,

Member, IEEE,

Qingshuang Chen,

Member, IEEE,

Zoe Liu, and Fengqing Zhu,

Senior Member, IEEE

Abstract —Signiﬁcant advances in video compression systemhave been made in the past several decades to satisfy thenearly exponential growth of Internet-scale video trafﬁc. Fromthe application perspective, we have identiﬁed three majorfunctional blocks including pre-processing, coding, and post-processing, that have been continuously investigated to maximizethe end-user quality of experience (QoE) under a limited bit ratebudget. Recently, artiﬁcial intelligence (AI) powered techniqueshave shown great potential to further increase the efﬁciencyof the aforementioned functional blocks, both individually andjointly. In this article, we review extensively recent technicaladvances in video compression system, with an emphasis ondeep neural network (DNN)-based approaches; and then presentthree comprehensive case studies. On pre-processing, we showa switchable texture-based video coding example that leveragesDNN-based scene understanding to extract semantic areas for theimprovement of subsequent video coder. On coding, we present anend-to-end neural video coding framework that takes advantageof the stacked DNNs to efﬁciently and compactly code inputraw videos via fully data-driven learning. On post-processing,we demonstrate two neural adaptive ﬁlters to respectively fa-cilitate the in-loop and post ﬁltering for the enhancement ofcompressed frames. Finally, a companion website hosting thecontents developed in this work can be accessed publicly athttps://purdueviper.github.io/dnn-coding/.

Index Terms —Deep Neural Networks, Texture Analysis, NeuralVideo Coding, Adaptive Filters

I. I

NTRODUCTION

In recent years, Internet trafﬁc has been dominated by awide range of applications involving video, including videoon demand (VOD), live streaming, ultra-low latency real-time communications, etc. . With ever increasing demandsin resolution ( e.g. , 4K, 8K, gigapixel [1], high speed [2]),and ﬁdelity, ( e.g. , high dynamic range [3], and higher bitprecision or bit depth [4]), more efﬁcient video compres-sion is imperative for content transmission and storage, bywhich networked video services can be successfully deployed.Fundamentally, video compression systems devise appropriatealgorithms to minimize the end-to-end reconstruction distor-tion (or maximize the quality of experience (QoE)), under agiven bit rate budget. This is a classical rate-distortion (R-D) optimization problem. In the past, the majority of effort

D. Ding is with the School of Information Science and Engineering,Hangzhou Normal University, Hangzhou, Zhejiang, China.Z. Ma is with the School of Electronic Science and Engineering, NanjingUniversity, Nanjing, Jiangsu, China.D. Chen, Q. Chen, and F. Zhu are with the School of Electrical andComputer Engineering, Purdue University, West Lafayette, Indiana, USA.Z. Liu is with Visionular Inc, 280 2nd St., Los Altos, CA, USA. (cid:63)

These authors contributed equally. had been focused on the development and standardization ofvideo coding tools for optimized R-D performance, such asthe intra/inter prediction, transform, entropy coding, etc. , re-sulting in a number of popular standards and recommendationspeciﬁcations ( e.g. , ISO/IEC MPEG series [5]–[11], ITU-TH.26x series [9]–[13], AVS series [14]–[16], as well as theAV1 [17], [18] from the Alliance of Open Media (AOM) [19]).All these standards have been widely deployed in the marketand enabled advanced and high-performing services to bothenterprises and consumers. They have been adopted to coverall major video scenarios from VOD, to live streaming, toultra-low latency interactive real-time communications, usedfor applications such as telemedicine, distance learning, videoconferencing, broadcasting, e-commerce, online gaming, shortvideo platforms, etc . Meanwhile, the system R-D efﬁciencycan also be improved from pre-processing and post-processing,individually and jointly, for content adaptive encoding (CAE).Notable examples include saliency detection for subsequentregion-wise quantization control, and adaptive ﬁlters to alle-viate compression distortions [20]–[22].In this article, we therefore consider pre-processing , coding ,and post-processing as three basic functional blocks of anend-to-end video compression system, and optimize themto provide compact and high-quality representation of inputoriginal video. • The “coding” block is the core unit that converts rawpixels or pixel blocks into binary bits presentation. Overthe past decades, the “coding” R-D efﬁciency has beengradually improved by introducing more advanced toolsto better exploit spatial, temporal, and statistical redun-dancy [23]. Nevertheless, this process inevitably incurscompression artifacts, such as blockiness and ringing, dueto the R-D trade-off, especially at low bit rates. • The “post-processing” block is introduced to alleviatevisually perceptible impairments produced as byproductsof coding. Post-processing mostly relies on the desig-nated adaptive ﬁlters to enhance the reconstructed videoquality or QoE. Such “post-processing” ﬁlters can alsobe embedded into the “coding” loop to jointly improvereconstruction quality and R-D efﬁciency, e.g. , in-loopdeblocking [24] and sample adaptive offset (SAO) [25]; • The “pre-processing” block exploits the discriminativecontent preference of the human visual system (HVS),caused by the non-linear response and frequency se-lectivity ( e.g. , masking) of visual neurons in the visualpathway. Pre-processing can extract content semantics a r X i v : . [ ee ss . I V ] J a n ( e.g. , saliency, object instance) to improve the psychovi-sual performance of the “coding” block, for example, byallocating unequal qualities (UEQ) across different areasaccording to pre-processed cues [26]. Building upon the advancements in deep neural networks(DNN), numerous recently-created video processing algo-rithms have been greatly improved to achieve superior per-formance, mostly leveraging the powerful nonlinear represen-tation capacity of DNNs. At the same time, we have alsowitnessed an explosive growth in the invention of DNN-basedtechniques for video compression from both academic researchand industrial practices. For example, DNN-based ﬁlteringin post-processing was extensively studied when developingthe VVC standard under the joint task force of ISO/IEC andITU-T experts over the past three years. More recently, thestandard committee issued a Call-for-Evidence (CfE) [27],[28] to encourage the exploration of deep learning-based videocoding solutions beyond VVC.In this article, we discuss recent advances in pre-processing , coding , and post-processing , with particular emphasis on theuse of DNN-based approaches for efﬁcient video compression.We aim to provide a comprehensive overview to bring readersup to date on recent advances in this emerging ﬁeld. Wealso suggest promising directions for further exploration. Assummarized in Fig. 1, we ﬁrst dive into video pre-processing,emphasizing the analysis and application of content semantics, e.g. , saliency, object, texture characteristics, etc. , to videoencoding. We then discuss recently-developed DNN-basedvideo coding techniques for both modularized coding tooldevelopment and end-to-end fully learned framework explo-ration. Finally, we provide an overview of the adaptive ﬁltersthat can be either embedded in codec loop, or placed as apost enhancement to improve ﬁnal reconstruction. We alsopresent three case studies, including 1) switchable texture-based video coding in pre-processing; 2) end-to-end neuralvideo coding ; and 3) efﬁcient neural ﬁltering , to provideexamples the potential of DNNs to improve both subjectiveand objective efﬁciency over traditional video compressionmethodologies.The remainder of the article is organized as follows: FromSection II to IV, we extensively review the advances in respec-tive pre-processing, coding, and post-processing. Traditionalmethodologies are ﬁrst brieﬂy summarized, and then DNN-based approaches are discussed in detail. As in the casestudies, we propose three neural approaches in Section V, VI,and VII, respectively. Regarding pre-processing, we develop aCNN based texture analysis/synthesis scheme for AV1 codec.For video compression, an end-to-end neural coding frame-work is developed. In our discussion of post-processing,wepresent different neural methods for in-loop and post ﬁlteringthat can enhance the quality of reconstructed frames. Sec-tion VIII summarizes this work and discusses open challengesand future research directions. For your convenience, Table I Although adaptive ﬁlters can also be used in pre-processing for pre-ﬁltering, e.g. , denoising, motion deblurring, contrast enhancement, edgedetection, etc. , our primary focus in this work will be on semantic contentunderstanding for subsequent intelligent “coding”.

TABLE I: Abbreviations and Annotations

Abbreviation DescriptionAE AutoEncoderCNN Convolutional Neural NetworkCONV ConvolutionConvLSTM Convolutional LSTMDNN Deep Neural NetworkFCN Fully-Connected NetworkGAN Generative Adversarial NetworkLSTM Long Short-Term MemoryRNN Recurrent Neural NetworkVAE Variational AutoEncoderBD-PSNR Bjøntegaard Delta PSNRBD-Rate Bjøntegaard Delta RateGOP Group of PicturesMS-SSIM Multiscale SSIMMSE Mean Squared ErrorPSNR Peak Signal-to-Noise RatioQP Quantizatin ParameterQoE Quality of ExperienceSSIM Structural Similarity IndexUEQ UnEqual QualityVMAF Video Multi-Method Assessment FusionAV1 AOMedia Video 1AVS Audio Video StandardH.264/AVC H.264/Advanced Video CodingH.265/HEVC H.265/High-Efﬁciency Video CodingVVC Versatile Video CodingAOM Alliance of Open MediaMPEG Moving Picture Experts Group provides an overview of abbreviations and acronyms that arefrequently used throughout this paper.II. O

VERVIEW OF

DNN-

BASED V IDEO P RE - PROCESSING

Pre-processing techniques are generally applied prior to thevideo coding block, with the objective of guiding the videoencoder to remove psychovisual redundancy and to maintainor improve visual quality, while simultaneously lowering bitrate consumption. One category of pre-processing techniquesis the execution of pre-ﬁltering operations. Recently, a numberof deep learning-based pre-ﬁltering approaches have beenadopted for targeted coding optimization. These include de-noising [29], [30], motion deblurring [31], [32], contrastenhancement [33], edge detection [34], [35], etc . Anotherimportant topic area is closely related to the analysis of videocontent semantics, e.g. , object instance, saliency attention,texture distribution, etc. , and its application to intelligent videocoding. For the sake of simplicity, we refer to this group oftechniques as “pre-processing” for the remainder of this paper.In our discussion below, we also limit our focus to saliency-based and analysis/synthesis-based approaches.

A. Saliency-Based Video Pre-processing1) Saliency Prediction: Saliency is the quality of beingparticularly noticeable or important. Thus, the salient area refers to region of an image that predominantly attracts theattention of subjects. This concept corresponds closely tothe highly discriminative and selective behaviour displayedin visual neuronal processing [36], [37]. Content featureextraction, activation, suppression and aggregation also occurin the visual pathway [38]. input x x corr r corr CNN model

Input Original Videos OutputReconstructed Videos

Case Study I

Switchable Texture-based Video Coding

Pre-processing Coding

Case Study II

End-to-End Neural Video Coding

Case Study III

Neural Adaptive Filtering (In-loop & Post)

Post-processing T r an s m i ss i on Semantic Understanding Feature Representation Quality Enhancement

Fig. 1:

Topic Outline.

This article reviews DNN-based techniques used in pre-processing, coding, and post-processing of apractical video compression system. The “pre-processing” module leverages content semantics ( e.g. , texture) to guide videocoding, followed by the “coding” step to represent the video content using more compact spatio-temporal features. Finally,quality enhancement is applied in “post-processing” to improve reconstruction quality by alleviating processing artifacts.Companion case studies are respectively offered to showcase the potential of DNN algorithms in video compression.Earlier attempts to predict saliency typically utilized hand-crafted image features, such as color, intensity, and orientationcontrast [39]; motion contrast [40]; camera motion [41], etc. ,to predict saliency.Later on, DNN-based semantic-level features were exten-sively investigated for both image content [42]–[48] andvideo sequences [49]–[55]. Among these features, imagesaliency prediction only exploits spatial information, whilevideo saliency prediction often relies on spatial and temporalattributes jointly. One typical example of video saliency isa moving object that incurs spatio-temporal dynamics overtime, and is therefore more likely to attract users’ attention.For example, Bazzani et al. [49] modeled the spatial relationsin videos using 3D convolutional features and the temporalconsistency with a convolutional long short-term memory(LSTM) network. Bak et al. [50] applied a two-stream net-work that exploited different fusion mechanisms to effectivelyintegrate spatial and temporal information. Sun et al. [51]proposed a step-gained FCN to combine the time-domainmemory information and space-domain motion components.Jiang et al. [52] developed an object-to-motion CNN that wasapplied together with a LSTM network. All of these effortsto efﬁciently predict video saliency leveraged spatio-temporalattributes. More details regarding the spatio-temporal saliencymodels for video content can be found in [56].

2) Salient Object:

One special example of image saliencyinvolved the object instance in a visual scene, speciﬁcally, themoving object in videos. A simple yet effective solution tothe problem of predicting image saliency in this case involvedsegmenting foreground objects and background components.The segmentation of foreground objects and backgroundcomponents has mainly relied on foreground extraction orbackground subtraction. For example, motion information hasfrequently been used to mask out foreground objects [57]–[61].Recently, both CNN and foreground attentive neural net-work (FANN) models have been developed to perform fore-ground segmentation [62], [63]. In addition to conventionalGaussian mixture model-based background subtraction, recentexplorations have also shown that CNN models could beeffectively used for the same purpose [64], [65]. To address these separated foreground objects and background attributes,Zhang et al. [66] introduced a new background mode to morecompactly represent background information with better R-D efﬁciency. To the best of our knowledge, such foregroundobject/background segmentation has been mostly applied invideo surveillance applications, where the visual scene lendsitself to easier separation.

3) Video Compression with UEQ Scales:

Recalling thatsaliency or object refers to more visually attentive areas. Itis straightforward to apply UEQ setting in a video encoder,where light compression is used to encode the saliency area,while heavy compression is used elsewhere. Use of this tech-nique often results in a lower level of total bit rate consumptionwithout compromising QoE.For example, Hadi et al. [67] extended the well-known Itti-Koch-Niebur (IKN) model to estimate saliency in the DCTdomain, also considering camera motion. In addition, saliency-driven distortion was also introduced to accurately capture thesalient characteristics, in order to improve R-D optimizationin H.265/HEVC. Li et al. [68] suggested using graph-basedvisual saliency to adapt the quantizations in H.265/HEVC,to reduce total bits consumption. Similarly, Ku et al. [69]applied saliency-weighted Coding Tree Unit (CTU)-level bitallocation, where the CTU-aligned saliency weights weredetermined via low-level feature fusion.The aforementioned methodologies rely on traditional hand-crafted saliency prediction algorithms. As DNN-based saliencyalgorithms have demonstrated superior performance, we cansafely assume that their application to video coding will leadto better compression efﬁciency. For example, Zhu et al. [70]adopted a spatio-temporal saliency model to accurately controlthe QP in an encoder whose spatial saliency was generatedusing a 10-layer CNN, and whose temporal saliency was cal-culated assuming the 2D motion model (resulting in an averageof 0.24 BD-PSNR gains over H.265/HEVC reference model(version HM16.8)). Performance improvement due to ﬁne-grained quantization adaptation was reported using an open-source x264 encoder [71]. This was accomplished by jointlyexamining the input video frame and associated saliencymaps. These saliency maps were generated by utilizing three

CNN models suggested in [52], [56], [72]. Up to 25% bitrate reduction was reported when distortion was measuredusing the edge-weighted SSIM (EW-SSIM). Similarly, Sun etal. [73] implemented a saliency-driven CTU-level adaptive bitrate control, where the static saliency map of each frame wasextracted using a DNN model and dynamic saliency regionwhen it was tracked using a moving object segmentationalgorithm. Experiment results revealed that the PSNR ofsalient regions was improved by 1.85 dB on average.Though saliency-based pre-processing is mainly driven bypsychovisual studies, it heavily relies on saliency detection toperform UEQ-based adaptive quantization with a lower rate ofbit consumption but visually identical reconstruction. On theother hand, visual selectivity behaviour is closely associatedwith video content distribution ( e.g. , frequency response),leading to perceptually unequal preference. Thus, it is highlyexpected that such content semantics-induced discriminativefeatures can be utilized to improve the system efﬁciency whenintegrated into the video encoder. To this end, we will discussthe analysis/synthesis-based approach for pre-processing in thenext section.

B. Analysis/Synthesis Based Pre-processing

Since most videos are consumed by human vision, subjec-tive perception of HVS is the best way to evaluate quality.However, it is quite difﬁcult to devise a profoundly accuratemathematical HVS model in actual video encoder for rateand perceptual quality optimization, due to the complicatedand unclear information processing that occurs in the humanvisual pathway. Instead, many pioneering psychovisual studieshave suggested that neuronal response to compound stimuliis highly nonlinear [74]–[81] within the receptive ﬁeld. Thisleads to well-known visual behaviors, such as frequency se-lectivity, masking, etc. , where such stimuli are closely relatedto the content texture characteristics. Intuitively, video scenescan be broken down into areas that are either “perceptuallysigniﬁcant” ( e.g. , measured in an MSE sense) or “perceptuallyinsigniﬁcant”. For “perceptually insigniﬁcant” regions, userswill not perceive compression or processing impairments with-out a side-by-side comparison with the original sample. Thisis because the HVS gains semantic understanding by viewingcontent as a whole, instead of interpreting texture details pixel-by-pixel [82]. This notable effect of the HVS is also referredto as “masking,” where visually insigniﬁcant information, e.g. ,perceptually insigniﬁcant pixels, will be noticeably suppressed.In practice, we can ﬁrst analyze the texture characteristicsof original video content in the pre-processing step, e.g. , Texture Analyzer in Fig. 2, in order to sort textures by theirsigniﬁcance. Subsequently, we can use any standard compliantvideo encoder to encode the perceptually signiﬁcant areas asthe main bitstream payload, and apply a statistical model torepresent the perceptually insigniﬁcant textures with modelparameters encapsulated as side information. Finally, we canuse decoded areas and parsed textures to jointly synthesizethe reconstructed sequences in

Texture Synthesizer . This typeof texture modeling makes good use of statistical and psy-chovisual representation jointly, generally requiring fewer bits,

Original VideoOriginal Video Texture Encoder TextureAnalyzer Encoder Analyzer Side InformationSide InformationReconstructed TextureReconstructed Video TextureSynthesizer Decoder Video Synthesizer Decoder Side InformationSide Information C hanne l C hanne l C hanne l C hanne l Fig. 2:

Texture Coding System.

A general framework ofanalysis/synthesis based video coding.despite yielding visually identical sensation, compared to thetraditional hybrid “prediction+residual” method . Therefore,texture analysis and synthesis play a vital role for subsequentvideo coding. We will discuss related techniques below.

1) Texture Analysis:

Early developments in texture analysisand representation can be categorized into ﬁlter-based or statistical modeling-based approaches. Gabor ﬁlter is onetypical example of a ﬁlter-based approach, by which theinput image is convoluted with nonlinear activation for thederivation of corresponding texture representation [84], [85].At the same time, in order to identify static and dynamictextures for video content, Thakur et al. [86] utilized the 2Ddual tree complex wavelet transform and steerable pyramidtransform [87], respectively. To accurately capture the tem-poral variations in video, Bansal et al. [88] again suggestedthe use of optic ﬂow for dynamic texture indication andlater synthesis, where optical ﬂow could be generated usingtemporal ﬁltering. Leveraging statistical models such as theMarkovian random ﬁeld (MRF) [89], [90] is an alternativeway to analyze and represent texture. For efﬁcient texturedescription, statistical modeling such as this was then ex-tended using handcrafted local features, e.g. , the scale invariantfeature transform (SIFT) [91], speeded up robust features(SURF) [92], and local binary patterns (LBP) [93]Recently, stacked DNNs have demonstrated their superiorefﬁciency in many computer vision tasks, This efﬁciency ismainly due to the powerful capacity of DNN features to beused for video content representation. The most straightfor-ward scheme directly extracted features from the FC6 or FC7layer of AlexNet [94] for texture representation. Furthermore,Cimpoi et al. [95] demonstrated that Fisher vectorized [96]CNN features was a decent texture descriptor candidate.

2) Texture Synthesis:

Texture synthesis reverse-engineersthe analysis in pre-processing to restore pixels accordingly. Itgenerally includes both non-parametric and parametric meth-ods. For non-parametric synthesis, texture patches are usuallyresampled from reference images [97]–[99]. In contrast, theparametric method utilized statistical models to reconstructthe texture regions by jointly optimizing observation outcomesfrom the model and model itself [87], [100], [101].DNN-based solutions exhibit great potential for texture syn-thesis applications. One notable example demonstrating this A comprehensive survey of texture analysis/synthesis based video codingtechnologies can be found in [83]. potential used a pre-trained image classiﬁcation-based CNNmodel to generate texture patches [102]. Li et al. [103], thendemonstrated that a Markovian GAN-based texture synthesiscould offer remarkable quality improvement.To brieﬂy summarize, earlier “texture analysis/synthesis”approaches often relied on handcrafted models, as well ascorresponding parameters. While they have shown good per-formance to some extent for a set of test videos, it is usuallyvery difﬁcult to generalize them to large-scale video datasetswithout ﬁne-tuning parameters further. On the other hand,related neuroscience studies propose a broader deﬁnition oftexture which is more closely related to perceptual sensa-tion, although existing mathematical or data-driven texturerepresentations attempt to fully fulﬁll such perceptual motives.Furthermore, recent DNN-based schemes present a promisingperspective. However, the complexity of these schemes has notyet been appropriately exploited. So, in Section V, we willreveal a CNN-based pixel-level texture analysis approach tosegment perceptually insigniﬁcant texture areas in a frame forcompression and later synthesis. To model the textures bothspatially and temporally, we introduce a new coding modecalled the “switchable texture mode” that is determined atgroup of pictures (GoP) level according to the bit rate saving.III. O

VERVIEW OF

DNN-

BASED V IDEO C ODING

A number of investigations have shown that DNNs can beused for efﬁcient image/video coding [104]–[107]. This topichas attracted extensive attention in recent years, demonstratingits potential to enhance the conventional system with better R-D performance.There are three major directions currently under inves-tigation. One is resolution resampling-based video coding,by which the input videos are ﬁrst down-sampled prior tobeing encoded, and the reconstructed videos are up-sampledor super-resolved to the same resolution as the input [108]–[111]. This category generally develops up-scaling or super-resolution algorithms on top of standard video codecs. Thesecond direction under investigation is modularized neuralvideo coding (MOD-NVC), which has attempted to improveindividual coding tools in traditional hybrid coding frameworkusing learning-based solutions. The third direction is end-to-end neural video coding (E2E-NVC), which fully leveragesthe stacked neural networks to compactly represent input im-age/video in an end-to-end learning manner. In the followingsections, we will primarily review the latter two cases, sincethe ﬁrst one has been extensively discussed in many otherstudies [112].

A. Modularized Neural Video Coding (MOD-NVC)

The MOD-NVC has inherited the traditional hybrid codingframework within which handcrafted tools are reﬁned orreplaced using learned solutions. The general assumption isthat existing rule-based coding tools can be further improvedvia a data-driven approach that leverages powerful DNNs tolearn robust and efﬁcient mapping functions for more compactcontent representation. Two great articles have comprehen-sively reviewed relevant studies in this direction [106], [107]. We brieﬂy introduce key techniques in intra/inter prediction,quantization, and entropy coding. Though in-loop ﬁltering isanother important piece in the “coding” block, due to itssimilarities with post ﬁltering, we have chosen to review itin quality enhancement-aimed “post-processing” for the sakeof creating a more cohesive presentation.

1) Intra Prediction:

Video frame content presents highlycorrelated distribution across neighboring samples spatially.Thus, block redundancy can be effectively exploited usingcausal neighbors. In the meantime, due to the presence of localstructural dynamics, block pixels can be better representedfrom a variety of angular directed prediction.In conventional standards, such as the H.264/AVC,H.265/HEVC, or even emerging VVC, speciﬁc predictionrules are carefully designated to use weighted neighbors forrespective angular directions. From the H.264/AVC to recentVVC, intra coding efﬁciency has been gradually improvedby allowing more ﬁne-grained angular directions and ﬂexibleblock size/partitions. In practice, an optimal coding mode isoften determined by R-D optimization.One would intuitively expect that coding performance canbe further improved if better predictions can be produced.Therefore, there have been a number of attempts to lever-age the powerful capacity of stacked DNNs for better in-tra predictor generation, including the CNN-based predictorreﬁnement suggested in [113] to reduce prediction residual,additional learned mode trained using FCN models reportedin [114], [115], using RNNs in [116], using CNNs in [108],or even using GANs in [117], etc.

These approaches haveactively utilized the neighbor pixels or blocks, and/or othercontext information ( e.g. , mode) if applicable, in order toaccurately represent the local structures for better prediction.Many of these approaches have reported more than 3% BD-Rate gains against the popular H.265/HEVC reference model.These examples demonstrate the efﬁciency of DNNs in intraprediction.

2) Inter Prediction:

In addition to the spatial intra pre-diction, temporal correlations have also been exploited via inter prediction, by which previously reconstructed framesare utilized to generate inter predictor for compensation usingdisplaced motion vectors.Temporal prediction can be enhanced using references withhigher ﬁdelity, and more ﬁne-grained motion compensation.For example, fractional-pel interpolation is usually deployed toimprove prediction accuracy [118]. On the other hand, motioncompensation with ﬂexible block partitions is another majorcontributor to inter coding efﬁciency.Similarly, earlier attempts have been made to utilize DNNssolutions for better inter coding. For instance, CNN-basedinterpolations were studied in [119]–[121] to improve the half-pel samples. Besides, an additional virtual reference couldbe generated using CNN models for improved R-D decisionin [122]. Xia et al. [123] further extended this approachusing multiscale CNNs to create an additional reference closerto the current frame by which accurate pixel-wise motionrepresentation could be used. Furthermore, conventional ref-erences could also be enhanced using DNNs to reﬁne thecompensation [124].

3) Quantization and Entropy Coding:

Quantization andentropy coding are used to remove statistical redundancy.Scalar quantization is typically implemented in video encodersto remove insensitive high-frequency components, withoutlosing the perceptual quality, while saving the bit rate. Re-cently, a three-layer DNN was developed to predict the localvisibility threshold C T for each CTU, by which more accuratequantization could be achieved via the connection between C T and actual quantization stepsize. This development ledto noticeable R-D improvement, e.g. , upto 11% as reportedin [125].Context-adaptive binary arithmetic coding (CABAC) andits variants are techniques that are widely adopted to encodebinarized symbols. The efﬁciency of CABAC is heavily relianton the accuracy of probability estimation in different contexts.Since the H.264/AVC, handcrafted probability transfer func-tions (developed through exhaustive simulations, and typicallyimplemented using look-up tables) were utilized. In [115]and [126], the authors demonstrated that a combined FCN andCNN model could be used to predict intra mode probability forbetter entropy coding. Another example of a combined FCNand CNN model was presented in [127] to accurately encodetransform indexes via stacked CNNs. And likewise, in [128],intra DC coefﬁcient probability could be also estimated usingDNNs for better performance.All of these explorations have reported positive R-D gainswhen incorporating DNNs in traditional hybrid coding frame-works. A companion H.265/HEVC-based software model isalso offered by Liu et al. [106], to advance the potential forsociety to further pursue this line of exploration. However,integrating DNN-based tools could exponentially increase boththe computational and space complexity. Therefore, creatingharmony between learning-based and conventional rule-basedtools under the same framework requires further investigation.It is also worth noting that an alternative approach is cur-rently being explored in parallel. In this approach, researcherssuggest using an end-to-end neural video coding (E2E-NVC)framework to drive the raw video content representation vialayered feature extraction, activation, suppression, and aggre-gation, mostly in a supervised learning fashion, instead ofreﬁning individual coding tools. B. End-to-End Neural Video Coding (E2E-NVC)

Representing raw video pixels as compactly as possible bymassively exploiting its spatio-temporal and statistical correla-tions is the fundamental problem of lossy video coding. Overdecades, traditional hybrid coding frameworks have utilizedpixel-domain intra/inter prediction, transform, entropy coding, etc. , to fulﬁll this purpose. Each coding tool is extensivelyexamined under a speciﬁc codec structure to carefully justifythe trade-off between R-D efﬁciency and complexity. Thisprocess led to the creation of well-known international orindustry standards, such as the H.264/AVC, H.265/HEVC,AV1, etc.

On the other hand, DNNs have demonstrated a powerfulcapacity for video spatio-temporal feature representation forvision tasks, such as object segmentation, tracking, etc.

This naturally raises the question of whether it is possible to encodethose spatio-temporal features in a compact format for efﬁcientlossy compression.Recently, we have witnessed the growth of video codingtechnologies that rely completely on end-to-end supervisedlearning. Most learned schemes still closely follow the conven-tional intra/inter frame deﬁnition by which different algorithmsare investigated to efﬁciently represent the intra spatial tex-tures, inter motion, and the inter residuals (if applicable) [104],[129]–[131]. Raw video frames are fed into stacked DNNs toextract, activate, and aggregate appropriate compact features(at the bottleneck layer) for quantization and entropy coding.Similarly, R-D optimization is also facilitated to balance therate and distortion trade-off. In the following paragraphs, wewill brieﬂy review the aforementioned key components.

1) Nonlinear Transform and Quantization:

The autoen-coder or variational autoencoder (VAE) architectures are typ-ically used to transform the intra texture or inter residual intocompressible features.For example, Toderic et al. [132] ﬁrst applied fully-connected recurrent autoencoders for variable-rate thumbnailimage compression. Their work was then improved in [133],[134] with the support of full-resolution image, unequal bitallocation, etc.

Variable bit rate is intrinsically enabled by theserecurrent structures. The recurrent autoencoders, however, suf-fer from higher computational complexity at higher bit rates,because more recurrent processing is desired. Alternatively, convolutional autoencoders have been extensively studied inpast years, where different bit rates are adapted by setting avariety of λ s to optimize the R-D trade-off. Note that differentnetwork models may be required for individual bit rates,making hardware implementation challenging, ( e.g. , modelswitch from one bit rate to another). Recently, conditionalconvolution [135] and scaling factor [136] were proposed toenable variable-rate compression using a single or very limitednetwork model without noticeable coding efﬁciency loss,which makes the convolutional autoencoders more attractivefor practical applications.To generate a more compact feature representation, Balle et al. [105] suggested replacing the traditional nonlinear ac-tivation, e.g. , ReLU, using generalized divisive normalization(GDN) that is theoretically proven to be more consistent withhuman visual perception. A subsequent study [137] revealedthat GDN outperformed other nonlinear rectiﬁers, such asReLU, leakyReLU, and tanh, in compression tasks. Severalfollow-up studies [138], [139] directly applied GDN in theirnetworks for compression exploration.Quantization is a non-differentiable operation, basicallyconverting arbitrary elements into symbols with a limitedalphabet for efﬁcient entropy coding in compression. Quanti-zation must be derivable in the end-to-end learning frameworkfor back propagation. A number of methods, such as addinguniform noise [105], stochastic rounding [132] and soft-to-hard vector quantization [140], were developed to approximatea continuous distribution for differentiation.

2) Motion Representation:

Chen et al. [104] developed theDeepCoder where a simple convolutional autoencoder was ap-plied for both intra and residual coding at ﬁxed 32 ×

32 blocks, and block-based motion estimation in traditional video codingwas re-used for temporal compensation. Lu et al. [141] intro-duced the optical ﬂow for motion representation in their DVCwork, which, together with the intra coding in [142], demon-strated similar performance compared with the H.265/HEVC.However, coding efﬁciency suffered from a sharp loss at lowbit rates. Liu et al. [143] extended their non-local attentionoptimized image compression (NLAIC) for intra and residualencoding, and applied second-order ﬂow-to-ﬂow prediction formore compact motion representation, showing consistent rate-distortion gains across different contents and bit rates.Motion can also be implicitly inferred via temporal interpo-lation. For example, Wu et al. [144] applied RNN-based frameinterpolation. Together with the residual compensation, RNN-based frame interpolation offered comparable performance tothe H.264/AVC. Djelouah et al. [145] furthered interpolation-based video coding by utilizing advanced optical ﬂow estima-tion and feature domain residual coding. However, temporalinterpolation usually led to an inevitable structural codingdelay.Another interesting exploration made by Ripple et al. in [130] was to jointly encode motion ﬂow and residual usingcompound features, where a recurrent state was embedded toaggregate multi-frame information for efﬁcient ﬂow generationand residual coding.

3) R-D Optimization: Li et al. [146] utilized a separatethree-layer CNN to generate an importance map for spatial-complexity-based adaptive bit allocation, leading to noticeablesubjective quality improvement. Mentzer et al. [140] furtherutilized the masked bottleneck layer to unequally weight fea-tures at different spatial locations. Such importance map em-bedding is a straightforward approach to end-to-end training.Importance derivation was later improved with the non-localattention [147] mechanism to efﬁciently and implicitly captureboth global and local signiﬁcance for better compressionperformance [136].Probabilistic models play a vital role in data compression.Assuming the Gaussian distribution for feature elements, Balle et al. [142] utilized hyper priors to estimate the parametersof Gaussian scale model (GSM) for latent features. LaterHu et al. [148] used hierarchical hyper priors (coarse-to-ﬁne)to improve the entropy models in multiscale representations.Minnen et al. [149] improved the context modeling using jointautoregressive spatial neighbors and hyper priors based on theGaussian mixture model (GMM). Autoregressive spatial priorswere commonly fused by PixelCNNs or PixelRNNs [150].Reed et al. [151] further introduced multiscale PixelCNNs,yielding competitive density estimation and great boost inspeed ( e.g. , from O ( N ) to O (log N ) ). Prior aggregationwas later extended from 2D architectures to 3D PixelC-NNs [140]. Channel-wise weights sharing-based 3D imple-mentations could greatly reduce network parameters withoutperformance loss. A parallel 3D PixelCNNs for practicaldecoding is presented in Chen et al. [136]. Previous methodsaccumulated all the priors to estimate the probability based ona single GMM assumption for each element. Recent studieshave shown that weighted GMMs can further improve codingefﬁciency in [152], [153]. Pixel-error, such as MSE, was one of the most popularloss functions used. Concurrently, SSIM (or MS-SSIM) wasalso adopted because of its greater consistency with visualperception. Simulations revealed that SSIM-based loss canimprove reconstruction quality, especially at low bit rates.Towards the perceptual-optimized encoding, perceptual lossesthat were measured by adversarial loss [154]–[156] and VGGloss [157] were embedded in learning to produce visuallyappealing results.Though E2E-NVC is still in its infancy, its fast growing R-D efﬁciency holds a great deal of promise. This is especiallytrue, given that we can expect neural processors to be deployedmassively in the near future [158].IV. O VERVIEW OF

DNN-

BASED P OST - PROCESSING

Compression artifacts are inevitably present in both tra-ditional hybrid coding frameworks and learned compressionapproaches, e.g. , blockiness, ringing, cartoonishness, etc. ,severely impairing visual sensation and QoE. Thus, qualityenhancement ﬁlters are often applied as a post-ﬁltering step orin-loop module to alleviate compression distortions. Towardsthis goal, adaptive ﬁlters are usually developed to minimizethe error between original and distorted samples.

A. In-loop Filtering

Existing video standards are mainly utilizing the in-loopﬁlters to improve the subjective quality of reconstruction,and also to offer better R-D efﬁciency due to enhancedreferences. Examples include deblocking [24], sample adaptiveoffset (SAO) [25], constrained directional enhancement ﬁlter(CDEF) [159], loop-restoration (LR) [160], adaptive loop ﬁlter(ALF) [161], etc.

Recently, numerous CNN models have been developedfor in-loop ﬁltering via a data-driven approach to learn themapping functions. It is worth pointing out that predictionrelationships must be carefully examined when designingin-loop ﬁlters, due to the frame referencing structure andpotential error propagation. Both intra and inter predictions areutilized in popular video encoders, where an intra-coded frameonly exploits the spatial redundancy within current frame,while an inter-coded frame jointly explores the spatio-temporalcorrelations across frames over time.Earlier explorations of this subject have mainly focused ondesigning DNN-based ﬁlters for intra-coded frames, particu-larly by trading network depth and parameters for better cod-ing efﬁciency. For example, IFCNN [162], and VRCNN [163]are shallow networks with ≈ e.g. , 5.7% BD-Rate gain reported in [164]by using the model with 3,340,000 parameters, and 8.50%BD-Rate saving obtained in [167] by using the model with2,298,160 parameters. The more parameters a model has, themore complex it is. Unfortunately, greater complexity limitsthe network’s potential for practical application. Such intra-frame-based in-loop ﬁlters treat decoded frames equally, with-out the consideration of in-loop inter-prediction dependency. Nevertheless, aforementioned networks can be used in post-ﬁltering out of the coding loop.It is necessary to include temporal prediction dependencywhile designing the in-loop CNN-based ﬁlters for inter-framecoding. Some studies leveraged prior knowledge from theencoding process to assist the CNN training and inference.For example, Jia et al. [168] incorporated the co-located blockinformation for in-loop ﬁltering. Meng et al. [169] utilized thecoding unit partition for further performance improvement.Li et al. [170] input both the reconstructed frame and thedifference between the reconstructed and predicted pixels toimprove the coding efﬁciency. Applying prior knowledge inlearning may improve the coding performance, but it furthercomplicates the CNN model by involving additional informa-tion in the networks. On the other hand, the contribution ofthis prior knowledge is quite limited because such additionalpriors are already implicitly embedded in the reconstructedframe.If a CNN-based in-loop ﬁltering is applied to frame I , theimpact will be gradually propagated to frame I that has frame I as the reference. Subsequently, I is the reference of I , andso on so forth . If frame I is ﬁltered again by the same CNNmodel, an over-ﬁltering problem will be triggered, resultingin severely degraded performance, as analyzed in [171]. Toovercome this challenging problem, a CNN model calledSimNet was built to carry the relationship between the recon-structed frame and its original frame in [172] to adaptively skipﬁltering operations in inter coding. SimNet reported 7.27% and5.57% BD-Rate savings for intra- and inter- coding of AV1,respectively. A similar skipping strategy was suggested byChen et al. [173] to enable a wide activation residual network,yielding . and 9.64% BD-Rate savings for respectiveintra- and inter- coding on AV1 platform.Alternative solutions resort to the more expensive R-Doptimization to avoid the over-ﬁltering problem. For example,Yin et al. [174] developed three sets of CNN ﬁlters for lumaand chroma components, where the R-D optimal CNN modelis used and signaled in bitstream. Similar ideas are developedin [175], [176] as well, in which multiple CNN models aretrained and the R-D optimal model is selected for inference.It is impractical to use deeper and denser CNN modelsin applications. It is also very expensive to conduct R-Doptimization to choose the optimal one from a set of pre-trained models. Note that a limited number of pre-trainedmodels are theoretically insufﬁcient to be generalized forlarge-scale video samples. To this end, in Section VII-A, weintroduce a guided-CNN scheme which adapts shallow CNNmodels according to the characteristics of input video content. B. Post Filtering

Post ﬁltering is generally applied to the compressed framesat the decoder side to further enhance the video quality forbetter QoE.Previous in-loop ﬁlters designated for intra-coded framescan be re-used for single-frame post-ﬁltering [163], [177]– Even though more advanced inter referencing strategies can be devised,inter propagation-based behavior remains the same. [185]. Appropriate re-training may be applied in order to bettercapture the data characteristics. However, single-frame post-ﬁltering may introduce quality ﬂuctuation across frames. Thismay be due to the limited capacity of CNN models to dealwith a great amount of video contents. Thus, multi-frame postﬁltering can be devised to massively exploit the correlationacross successive temporal frames. By doing so, it not onlygreatly improves the single-frame solution, but also offersbetter temporal quality over time.Typically, a two-step strategy is applied for multi-frame postﬁltering. First, neighboring frames are aligned to the currentframe via (pixel-level) motion estimation and compensation(MEMC). Then, all aligned frames are fed into networksfor high-quality reconstruction. Thus, the accuracy of MEMCgreatly affects reconstruction performance. In applications,learned optical ﬂow, such as FlowNet [186], FlowNet2 [187],PWC-Net [188], and TOFlow [189], are widely used.Some exploration has already been made in this arena: Bao et al. [190] and Wang et al. [191] implemented a general videoquality enhancement framework for denoising, deblocking,and super-resolution, where Bao et al. [190] employed theFlowNet and Wang et al. [191] used pyramid, cascading, anddeformable convolutions to respectively align frames tempo-rally. Meanwhile, Yang et al. [192] proposed a multi-framequality enhancement framework called MFQE-1.0, in which aspatial transformer motion compensation (STMC) network isused for alignment, and a deep quality enhancement network(QE-net) is employed to improve reconstruction quality. Then,Guan et al. [193] upgraded MFQE-1.0 to MFQE-2.0 byreplacing QE-net using a dense CNN model, leading to betterperformance and less complexity. Later on, Tong et al. [194]suggested using FlowNet2 in MFQE-1.0 for temporal framealignment (instead of default STMC), yielding 0.23 dB PSNRgain over the original MFQE-1.0. Similarly, FlowNet2 is alsoused in [195] for improved efﬁciency.All of these studies suggested the importance of temporalalignment in post ﬁltering. Thus, in the subsequent casestudy (see Section VII-B), we ﬁrst examine the efﬁciencyof alignment, and then further discuss the contributions fromrespective intra-coded and inter-coded frames for the qualityenhancement of ﬁnal reconstruction. This will help audiencesgain a deeper understanding of similar post ﬁltering tech-niques. V. C

ASE S TUDY FOR P RE - PROCESSING :S WITCHABLE T EXTURE - BASED V IDEO C ODING

This section presents a switchable texture-based video pre-processing that leverages DNN-based semantic understandingfor subsequent coding improvement. In short, we exploitDNNs to accurately segment “perceptually InSIGnifcant” (pIn-SIG) texture areas to produce a corresponding pInSIG mask.In many instances, this mask drives the encoder to performseparately for pInSIG textures that are typically inferred with-out additional residuals, and “perceptually SIGniﬁcant” (pSIG)areas elsewhere using traditional hybrid coding method. Thisapproach is implemented on top of the AV1 codec [196]–[198]by enabling the GoP-level switchable mechanism, This yields

Resnet-50 Pool UpsampleResnet-50 Pool UpsampleDilated CONVDilated CNN CONVInput Image CNNInput Image

PSP ModulePSP Module

CONVCONV Scene Scene SegmentationSegmentation

PSP ModulePSP Module

Fig. 3:

Texture Analyzer.

Proposed semantic segmentationnetwork using PSPNet [200] and ResNet-50 [201].noticeable bit rate savings for both standard test sequencesand additional challenging sequences from YouTube UGCdataset [199], under similar perceptual quality. The methodwe propose is a pioneering work that integrates learning-basedtexture analysis and reconstruction approaches with modernvideo codec to enhance video compression performance.

A. Texture Analysis

Our previous attempt [202] yielded encouraging bit ratesavings without decreasing visual quality. This was accom-plished by perceptually differentiating pInSIG textures andother areas to be encoded in a hybrid coding framework.However, the corresponding texture masks were derived usingtraditional methods, at the coding block level. On the otherhand, building upon advancements created by DNNs and large-scale labeled datasets ( e.g. , ImageNet [203], COCO [204], andADE20K [205]), learning-based semantic scene segmentationalgorithms [200], [205], [206] have been tremendously im-proved to generate accurate pixel-level texture masks.In this work, we ﬁrst rely on the powerful ResNet50 [201]with dilated convolutions [207], [208] to extract feature mapsthat effectively embed the content semantics. We then in-troduce the pyramid pooling module from PSPNet [200] toproduce a pixel-level semantic segmentation map shown inFig. 3. Our implementation starts with a pre-trained PSPNetmodel generated using the MIT SceneParse150 [209] as ascene parsing benchmark. We then retrained the model ona subset of a densely annotated dataset ADE20K [205]. Inthe end, the model offers a pixel segmentation accuracy of80.23%.It is worthwhile to note that such pixel-level segmentationmay result in the creation of a number of semantic classes.Nevertheless, this study suggests grouping similar textureclasses commonly found in nature scenes together into fourmajor categories, e.g. , “earth and grass”, “water, sea andriver”, “mountain and hill”, and “tree”. Each texture categorywould have an individual segmentation mask to guide thecompression performed by the succeeding video encoder.

B. Switchable Texture-Based Video Coding

Texture masks are generally used to identify texture blocks,and to perform the encoding of texture blocks and non-textureblocks separately, as illustrated in Fig. 4a. In this case study, the AV1 reference software platform is selected to exemplifythe efﬁciency of our proposal.

Texture Blocks.

Texture and non-texture blocks are identi-ﬁed by overlaying the segmentation mask from the textureanalyzer on its corresponding frame. These frame-alignedtexture masks produce pixel-level accuracy, which is capableof supporting arbitrary texture shapes. However, in order tosupport the block processing commonly adopted by videoencoders, we propose reﬁning original pixel-level masks totheir block-based representations. The minimum size of atexture block is 16 ×

16. In order to avoid boundary artifactsand maintain temporal consistency, we implemented a con-servative two-step strategy to determine the texture block.First, the block itself must be fully contained in the textureregion marked using the pixel-level mask. Then, its warpedrepresentation to temporal references ( e.g. , the preceding andsucceeding frames in the encoding order) have to be insidethe masked texture area of corresponding reference frames aswell. Finally, these texture blocks are encoded using the texturemode , and non-texture blocks are encoded as usual using thehybrid coding structure.

Texture Mode.

A texture mode coded block is inferredby its temporal reference using the global motion parame-ters without incurring any motion compensation residuals. Incontrast, non-texture blocks are compressed using a hybrid“prediction+residual” scheme. For each current frame and anyone of its reference frames, AV1 syntax speciﬁes only one setof global motion parameters at the frame header. Therefore,to comply with the AV1 syntax, our implementation onlyconsiders one texture class for each frame. This guaranteesthe general compatibility of our solution to existing AV1decoders. We further modiﬁed the AV1 global motion tool toestimate the motion parameters based on the texture regionsof the current frame and its reference frame. We used thesame feature extraction and model ﬁtting approach as in theglobal motion coding tool in order to provide a more accuratemotion model for the texture regions. This was done to preventvisual artifacts on the block edges between the texture andnon-texture blocks in the reconstructed video. Although wehave demonstrated our algorithms using the AV1 standard,we expect that the same methodology can be applied toother standards. For instance, when using the H.265/HEVCstandard, we can leverage the SKIP mode syntax to signal thetexture mode instead of utilizing the global motion parameters.Previous discussions have suggested that the texture mode isenabled along with inter prediction. Our extensive studies havealso demonstrated that it is better to activate the texture modein frames where bi-directional predictions are allowed ( e.g. , B-frames), for the optimal trade-off between bit rate saving andperceived quality. As will be shown in following performancecomparisons, we use a 8-frame GoP (or Golden-Frame (GF)group deﬁned in AV1) to exemplify the texture modes inevery other frame, by which the compound prediction from bi-directional references can be facilitated for prediction warping.Such bi-directional prediction could also alleviate possibletemporal quality ﬂickering.

Switchable Optimization.

In our previous work [210], thetexture mode was enabled for every B frame, demonstrating Get scene change information from first pass encodingGet scene change information from first pass encodingLoad pixel-based texture mask of Load pixel-based texture mask of

Frame

Calculate motion parameter

Frame Level

Calculate motion parameter

Level

Is it a texture block?Is it a texture block?

Block

Choose texture mode

Block Level

Choose texture mode

Level

Get scene change information from first pass encodingGet scene change information from first pass encodingbased texture mask of the chosen texture classbased texture mask of the chosen texture classCalculate motion parameterCalculate motion parameter NNIs it a texture block?Is it a texture block?YYChoose texture mode RD optimizationChoose texture mode RD optimizationEncode blockEncode block (a)

Texture region Texture region percentageEncode with percentageEncode with texture mode Bit ratetexture mode enabled Bit rateenabledFirst GoPFirst GoP Encode with Bit rateEncode with texture mode Bit ratetexture mode disableddisabled Enable texture Enable texture mode for the GoPmode for the GoPin the scenein the scene Texture NTexture region Nregion > 10%> 10%YYHas bit NHas bit rate Nrate savingsavingY Disable texture YEnable texture Disable texture mode for the GoPEnable texture GoP mode for the GoPin the sceneGoP in the scene (b)

Fig. 4:

Texture mode and switchable control scheme. (a)Texture mode encoder implementation. (b) Switchable texturemode decision.signiﬁcant bit rate reduction at the same level of perceptualsensation in most standard test videos, in comparison to theAV1 anchor. However, some videos did cause the modelto perform more poorly. One reason for this effect is thathigher QP settings typically incur more all-zero residualblocks. Alternatively, texture mode is also content-dependent:a relatively small number of texture blocks may be presentfor some videos. Both scenarios limit the bit rate savings,and an overhead of extra bits is mandatory for global motionsignaling, if texture mode is enabled.To address these problems, we introduce a switchablescheme to determine whether texture mode could be poten-tially enabled for a GoP or a GF group. The criteria forswitching are based on the texture region percentage thatis calculated as the average ratio of texture blocks in B-frames, and on the potential bit rate savings with or withouttexture mode. Figure 4b illustrates the switchable texture modedecision. Currently, we use bit rate saving as a criterion forswitch decisions when the texture mode is enabled. Thisassumes perceptual sensation will remain nearly the same,since these texture blocks are perceptually insigniﬁcant.

C. Experimental Results

We selected sequences with texture regions from standardtest sequences and the more challenging YouTube UGC data set [199]. YouTube UGC dataset is a sample selected fromthousands of User Generated Content (UGC) videos uploadedto YouTube. The names of the UGC videos follow the formatof Category Resolution UniqueID. We calculate the bit ratesavings at different QP values for 150 frames of the test se-quences. In our experiments, we used the following parametersfor the AV1 codec as the baseline: 8-frame GoP or GF groupusing random access conﬁguration; 30 FPS; constant qualityrate control policy; multi-layer coding structure for all GFgroups; maximum intra frame interval at 150. We evaluatethe performance of our proposed method in terms of bit ratesavings and perceived quality.

1) Coding Performance:

To evaluate the performance of theproposed switchable texture mode method, bit rate savings atfour quantization levels (QP = 16, 24, 32, 40) are calculatedfor each test sequence in comparison to the AV1 baseline.

Texture Analysis.

We compare two DNN-based textureanalysis methods [210], [212] with a handcrafted feature-basedapproach [211] for selected standard test sequences. Resultsare shown in Table II. A positive bit rate saving (%) indicatesa reduction compared with the AV1 baseline. Compared to thefeature based approach, DNN-based methods show improvedperformance in terms of bit rate saving. The feature basedapproach relies on color and edge information to generatethe texture mask and is less accurate and consistent bothspatially and temporally. Therefore, the number of blocks thatare reconstructed using texture mode is usually much smallerthan that of DNN-based methods. Note that the parametersused in feature based approach require manually tuning foreach video to optimize the texture analysis output. The pixel-level segmentation [210] shows further advantages comparedwith block-level method [212], since the CNN model does notrequire block size to be ﬁxed.

Switchable Scheme.

We also compare the proposedmethod, a.k.a., tex-switch , with our previous work in [210],a.k.a., tex-allgf , which enables texture mode for all frames ina GF group. All three methods use the same encoder settingfor fair comparison. Bit rate saving results for various videosat different resolutions against the AV1 baseline are shown inTable III. A positive bit rate saving (%) indicates a reductioncompared with the AV1 baseline.In general, compared to the AV1 baseline, the codingperformance of tex-allgf shows signiﬁcant bit rate savingsat lower QPs. However, as QP increases, the savings arediminished. In some cases, tex-allgf exhibits poorer codingperformance than the AV1 baseline at a high QP ( e.g. , negativenumbers at QP 40). At a high QP, most blocks have zeroresidual due to heavy quantization, leading to very limitedmargins for bit rate savings using texture mode. In addition,few extra bits are required for the signalling of global motionof texture mode coded blocks. The bit savings gained throughresidual skipping in texture mode still cannot compensate forthe bits used as overhead for the side information.Furthermore, the proposed tex-switch method retains thegreatest bit rate savings offered by tex-allgf , and resolves the https://media.withyoutube.com/ AV1 codec change-Id: Ibed6015aa7cce12fcc6f314ffde76624df4ad2a1 TABLE II: Bit rate saving (%) comparison between handcraft feature (FM) [211], block-level DNN (BM) [212] and pixel-levelDNN (PM) [210] texture analysis against the AV1 baseline for selected standard test sequences using tex-allfg method.

Video Sequence QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%)FM BM PM FM BM PM FM BM PM FM BM PMCoastguard − .

17 7 .

80 9 . − .

36 6 .

99 8 . − .

43 4 .

70 5 . − .

62 1 .

90 2 . Flower .

42 10 .

55 13 .

00 5 .

42 8 .

66 10 .

78 2 .

51 5 .

96 4 .

95 0 .

19 3 .

38 1 . Waterfall .

65 4 .

63 13 .

11 1 .

58 3 .

96 7 . − . − .

33 1 . − . − . − . Netﬂix aerial .

15 8 .

59 9 . − .

26 2 .

15 5 . − . − .

68 1 . − . − . − . Intotree .

88 5 .

32 9 .

71 0 .

15 4 .

32 9 . − .

14 1 .

99 8 . − . − .

83 4 . TABLE III: Bit rate saving (%) comparison for tex-allgf and tex-switch methods against the AV1 baseline.

Resolution Video Sequence QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%) tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch

CIF Bridgeclose .

78 15 .

78 10 .

87 10 .

87 4 .

21 4 .

21 2 .

77 2 . Bridgefar .

68 10 .

68 8 .

56 8 .

56 6 .

34 6 .

01 6 . Coastguard .

14 9 .

14 8 .

01 8 .

01 5 .

72 5 .

72 2 .

13 2 . Flower .

00 13 .

00 10 .

78 10 .

78 4 .

95 4 .

95 1 .

20 1 . Waterfall .

11 13 .

11 7 .

21 7 .

21 1 .

30 1 . − .

48 0 . ×

270 Netﬂix ariel .

15 9 .

15 5 .

59 5 .

59 1 .

05 1 . − .

01 0 . .

77 10 .

77 9 .

27 9 .

27 5 .

23 5 .

23 1 .

54 1 . NewsClip 360P-22ce .

37 17 .

37 15 .

79 15 .

79 16 .

37 16 .

37 17 .

98 17 . TelevisionClip 360P-3b9a .

45 1 .

45 0 .

48 0 . − .

09 0 . − .

26 0 . TelevisionClip 360P-74dd .

66 1 .

17 1 .

17 0 .

36 0 . − .

37 0 . .

81 3 .

81 2 .

57 2 .

57 0 .

93 0 .

06 0 . HowTo 480P-4c99 .

36 2 .

36 1 .

67 1 .

67 0 .

37 0 . − .

16 0 . MusicVideo 480P-1eee .

31 3 .

29 3 .

29 2 .

53 2 . − . − . NewsClip 480P-15fa .

31 6 .

05 5 .

79 0 .

53 0 . − .

79 0 . NewsClip 480P-7a0d .

54 11 .

54 10 .

03 10 .

03 1 .

53 1 .

53 0 .

08 0 . TelevisionClip 480P-19d3 .

13 3 .

13 2 .

86 2 .

86 1 .

66 1 .

66 0 .

58 0 . .

72 12 .

72 11 .

84 11 .

84 9 .

31 9 .

31 6 .

35 6 . MusicVideo 720P-3698 .

76 1 .

07 1 .

07 0 .

30 0 . − .

17 0 . MusicVideo 720P-4ad2 .

93 6 .

93 3 .

81 3 .

81 1 .

87 1 .

87 0 .

60 0 . .

31 7 .

31 6 .

07 6 .

07 3 .

21 3 .

21 0 .

72 0 . MusicVideo 1080P-55af .

88 3 .

88 1 .

78 1 .

78 0 .

31 0 . − . − . intotree .

71 9 .

42 9 .

42 8 .

46 8 .

46 4 .

92 4 . Average .

96 7 .

96 6 .

28 6 .

27 3 .

38 3 .

40 1 .

45 2 . loss at higher QP settings. As shown in Table III, negativenumbers are mostly removed (highlighted in green) by theintroduction of a GoP-level switchable texture mode. In somecases where tex-switch has zero bit rate savings compared tothe AV1 baseline, the texture mode is completely disabledfor all the GF groups, whereas tex-allgf has loss. In a fewcases, however, tex-switch has less bit rate saving than tex-allgf (highlighted in red). This is because the bit rate savingperformance of the ﬁrst GF group in the scene fails toaccurately represent the whole scene in some of the UGCsequences with short scene cuts. A possible solution is toidentify additional GF groups that show potential bit ratesavings and enable texture mode for these GF groups.

2) Subjective Evaluation:

Although signiﬁcant bit rate sav-ings have been achieved compared to the AV1 baseline, itis acknowledged that identical QP values do not necessarilyimply the same video quality. We have performed a subjectivevisual quality study with 20 participants. Reconstructed videosproduced by the proposed method ( tex-switch ) and the baselineAV1 codec at QP = 16, 24, 32 and 40 are arranged randomlyand assessed by the participants using a double stimuluscontinuous quality scale (DSCQS) method [213]. Subjectshave been asked to choose among three options: the ﬁrst videohas better visual quality, the second video has better visualquality, or there is no difference between two versions.The result of this study is summarized in Figure 5. The Fig. 5:

Subjective evaluation of visual preference.

Resultsshow average subjective preference (%) for QP = 16, 24, 32,40 compared between AV1 baseline and proposed switchabletexture mode.“Same Quality” indicates the percentage of participants thatcannot tell the difference between the reconstructed videosby the AV1 baseline codec and the proposed method tex-switch (69.03% on average). The term “tex-switch” indicatesthe percentage of participants that prefer the reconstructionsby the proposed method tex-switch (14.32% on average); andthe “AV1” indicates the percentage of participants who think the visual quality of the reconstructed videos using the AV1baseline is better (16.65% on average).We observe that the results are sequence dependent and thatspatial and temporal artifacts can appear in the reconstructedvideo. The main artifacts come from the inaccurate pixel-basedtexture mask. For example, in some frames of Television-Clip 360P-74dd sequence, the texture masks include parts ofthe moving objects in the foreground, which are reconstructedusing texture mode. Since the motion of the moving objectsis different from the motion of the texture area, there arenoticeable artifacts around those parts of the frame. To furtherimprove the accuracy of region analysis using DNN-based pre-processing, we plan to incorporate an in-loop perceptual visualquality metric for optimization during the texture analysis andreconstruction.

D. Discussion And Future Direction

We proposed a DNN based texture analysis/synthesis codingtool for AV1 codec. Experimental results show that our pro-posed method can achieve noticeable bit rate reduction withsatisfying visual quality for both standard test sets and usergenerated content, which is veriﬁed by a subjective study. Weenvision that video coding driven by semantic understandingwill continue to improve in terms of both quality and bit rate,especially by leveraging advances of deep learning methods.However, there remain several open challenges that requirefurther investigation.Accuracy of region analysis is one of the major challengesfor integrating semantic understanding into video coding.However, recent advances in scene understanding have signif-icantly improved the performance of region analysis. Visualartifacts are still noticeable when a non-texture region isincorrectly included in the texture mask, particularly if theanalysis/synthesis coding system is open loop. One potentialsolution is to incorporate some perceptual visual quality mea-sures in-loop during the texture region reconstruction.Video segmentation benchmark datasets are important fordeveloping machine learning methods for video based seman-tic understanding. Existing segmentation datasets are eitherbased on images with texture [214], or contain general videoobjects only [215], [216], or focus on visual quality but lacksegmentation ground truth.VI. C

ASE S TUDY FOR C ODING :E ND - TO -E ND N EURAL V IDEO C ODING (E2E-NVC)This section presents a framework for end-to-end neuralvideo coding. We include a discussion of its key components,as well as its overall efﬁciency. Our proposed method isextended from our pioneering work in [104] but with signiﬁ-cant performance improvements by allowing fully end-to-endlearning-based spatio-temporal feature representation. Moredetails can be found in [131], [136], [217].

A. Framework

As with all modern video encoders, the proposed E2E-NVCcompresses the ﬁrst frame in each group of pictures as an intra-frame using a VAE based compression engine (neuro-Intra).

Intra Encoder

Motion Encoder Motion Decoder Intra Decoder

Residual Encoder

Residual DecoderReference Frame Buffer-Multi-scale Motion Compensation Network +Inter Coding Compressed Binary FeaturesIntra Coding neuro-Intra neuro-Res neuro-Motion

MS-MCN (a) C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) R e s i du a l B l o c k ( x ) C o n v x / ↓ C o n v x / ↓ N o n l o c a l A tt e n t i o n C o n v x / ↓ C o n v x / ↓ N o n l o c a l A tt e n t i o n R e s i du a l B l o c k ( x ) C o n v x / ↑ N o n l o c a l A tt e n t i o n C o n v x / ↓ C o n v x / ↑ R e s i du a l B l o c k ( x ) C o n v x / ↑ N o n l o c a l A tt e n t i o n R e s i du a l B l o c k ( x ) C o n v x / ↑ R e s i du a l B l o c k ( x ) C o n v x / ↑ Q AE AD QAE AD P A C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) R e s i du a l B l o c k ( x ) C o n v x / ↑ InputOutput (b)

Fig. 6:

End-to-End Neural Video Coding (E2E-NVC).

This E2E-NVC in (a) consists of modularized intra and intercoding, where inter coding utilizes respective motion andresidual coding. Each component is well exploited usinga stacked CNNs-based VAE for efﬁcient representations ofintra pixels, displaced inter residuals, and inter motions. Allmodularized components are inter-connected and optimized inan end-to-end manner. (b) General VAE model applies stackedconvolutions ( e.g. , 5 ×

5) with main encoder-decoder ( E m , D m ) and hyper encoder-decoder pairs ( E h , D h ), where mainencoder E m includes four major convolutional layers ( e.g. ,convolutional downsampling and three residual blocks ( × D h mirrorsthe steps in hyper encoder E h for hyper prior informationgeneration. Prior aggregation (PA) engine collects the informa-tion from hyper prior, autoregressive spatial neighbors, as wellas temporal correspondences (if applicable) for main decoder D m to reconstruct input scene. Non-local attention is adoptedto simulate the saliency masking at bottlenecks, and rectiﬁedlinear unit (ReLU) is implicitly embedded with convolutionsfor enabling the nonlinearity. “Q” is for quantization, AE andAD for respective arithmetic encoding and decoding. 2 ↓ and2 ↑ are downsampling and upsampling at a factor of 2 for bothhorizontal and vertical dimensions.It codes the remaining frames in each group using motioncompensated prediction. As shown in Fig. 6a, the proposedE2E-NVC uses the VAE compressor (neuro-Motion) to gen-erate the multiscale motion ﬁeld between the current frame andthe reference frame. Then, a multiscale motion compensationnetwork (MS-MCN) takes multiscale compressed ﬂows, warpsthe multiscale features of the reference frame, and combinesthese warped features to generate the predicted frame. The prediction residual is then coded using another VAE-basedcompressor (neuro-Res).A low-delay E2E-NVC based video encoder is speciﬁcallyillustrated in this work. Given a group of pictures (GOP) X = { X , X , ..., X t } , we ﬁrst encode X using the neuro-Intramodule and have its reconstructed frame ˆ X . The followingframe X is encoded predictively, using neuro-Motion, MS-MCN, and neuro-Res together, as shown in Fig. 6a. Note thatMS-MCN takes the multiscale optical ﬂows (cid:110) (cid:126)f d , (cid:126)f d , ..., (cid:126)f sd (cid:111) derived by the pyramid decoder in neuro-Motion, and thenuses them to generate the predicted frame ˆ X p by multiscalemotion compensation. Displaced inter-residual r = X − ˆ X p is then compressed in neuro-Res, yielding the reconstruction ˆ r . The ﬁnal reconstruction ˆ X is given by ˆ X = ˆ X p +ˆ r . Allof the remaining P-frames in the group of pictures are thenencoded using the same procedure.Fig. 6b illustrates the general architecture of the VAE model.The VAE model includes a main encoder-decoder pair that isused for latent feature analysis and synthesis, as well as ahyper encoder-decoder for hyper prior generation. The mainencoder E m uses four stacked CNN layers. Each convolutionallayer employs stride convolutions to achieve downsampling(at a factor of 2 in this example) and cascaded convolutionsfor efﬁcient feature extraction (here, we use three ResNet-based residual blocks [201]) . We use two-layer hyper encoder E h to further generate the subsequent hyper priors as sideinformation, which is used in the entropy coding of the latentfeatures.We apply stacked convolutional layers with a limited (3 × e.g. , via unequalfeature quantization) [140], [218]. This allows resources to beassigned such that salient areas are more accurately recon-structed, while resources are conserved in the reconstructionof less-salient areas. To more accurately discern salient fromnon-salient areas, we adopt the non-local attention module(NLAM) at the bottleneck layers of both the main encoderand hyper encoder, prior to quantization, in order to includeboth global and local information.To enable more accurate conditional probability densitymodeling for entropy coding of the latent features, we in-troduce the Prior Aggregation (PA) engine which fuses the We choose to apply cascaded ResNets for stacked CNNs because they arehighly efﬁcient and reliable. Other efﬁcient CNN architectures could also beapplied. bit per pixel (bpp) P S N R ( d B ) NLAICneuro-IntraMinnen (2018)BPG (4:4:4)JPEG2000

Fig. 7:

Efﬁciency of neuro-Intra.

PSNR vs. rate perfor-mance of neuro-Intra in comparison to NLAIC [136], Minnen(2018) [149], BPG (4:4:4) and JPEG2000. Note that the curvesfor neuro-Intra and NLAIC overlap.inputs from the hyper priors, spatial neighbors, and temporalcontext (if applicable) . Information theory suggests that moreaccurate context modeling requires fewer resources ( e.g. , bits)to represent information [219]. For the sake of simplicity, weassume the latent features ( e.g. , motion, image pixel, residual)are following the Gaussian distribution as in [148], [149]. Weuse the PA engine to derive the mean and standard deviationof the distribution for each feature. B. Neural Intra Coding

Our neuro-Intra is a simpliﬁed version of the Non-LocalAttention optimized Image Compression (NLAIC) that wasoriginally proposed in [136].One major difference between the NLAIC and the VAEmodel using autoregressive spatial context in [149] is theintroduction of the NLAM inspired by [220]. In addition,we have applied 3D 5 × × to extract spatialpriors, which are fused with hyper priors in PA for entropycontext modeling ( e.g. , the bottom part of Fig. 9). Here, wehave assumed the single Gaussian distribution for the contextmodeling of entropy coding. Note that temporal priors are notused for intra-pixel and inter-residual in this paper by onlyutilizing the spatial priors.The original NLAIC applies multiple NLAMs in both mainand hyper coders, leading to excessive memory consumptionat a large spatial scale. In E2E-NVC, NLAMs are only used atthe bottleneck layers for both main and hyper encoder-decoderpairs, allowing bits to be allocated adaptively. Intra and residual coding only use joint spatial and hyper priors withouttemporal inference. This 5 × × Rate EstimationPyramidal Flow Decoder Q Predicted FrameWarpingFeature Fusion Quantized FeaturesRate Distortion Optimization Reconstruction Error

Multiscale CompressedFlows (MCF)

Multiscale Motion Compensation Network

PyramidalFeatures Aggregation neuro-Motion

Fig. 8:

Multiscale Motion Estimation and Compensation.

One-stage neuro-Motion with MS-MCN uses a pyramidal ﬂowdecoder to synthesize the multiscale compressed optical ﬂows (MCFs) that are used in a multiscale motion compensationnetwork for generating predicted frames.

Spatial-temporal Prior Aggregation

Temporal Prior

Updating

Autoregressive Priors

Quantized Motion Features Updated Priors

ConvLSTM

Hyper DecoderHyper Priors 3D Masked ConvTemporal Priors c on c a t e Obtained ProbabilityNext Time Step

Fig. 9:

Context-Adaptive Modeling Using Joint Spatio-temporal and Hyper Priors.

All priors are fused in PA toprovide estimates of the probability distribution parameters.To overcome the non-differentiability of the quantizationoperation, quantization is usually simulated by adding uniformnoise in [142]. However, such noise augmentation is notexactly consistent with the rounding in inference, which canyield performance loss (as reported by [135]). Thus, we applyuniversal quantization (UQ) [135] in neuro-Intra. UQ is usedfor neuro-Motion and neuro-Res as well. When applied tothe common Kodak dataset, neuro-Intra performed as well asNLAIC [136], and outperformed Minnen (2018) [149], BPG(4:4:4) and JPEG2000, as shown in Fig. 7.

C. Neural Motion Coding and Compensation

Inter-frame coding plays a vital role in video coding. Thekey is how to efﬁciently represent motion in a compact formatfor compensation. In contrast to the pixel-domain block-basedmotion estimation and compensation in conventional videocoding, we rely on optical ﬂow to accurately capture thetemporal information for motion compensation .To improve inter-frame prediction, we extend our earlierwork [131] to multiscale motion generation and compensa-tion. This multiscale motion processing directly transforms two concatenated frames (where one frame is the referencefrom the past, and one is the current frame) into quantizedtemporal features that represent the inter-frame motion. Thesequantized features are decoded into compressed optical ﬂowin an unsupervised way for frame compensation via warping.This one-stage scheme does not require any pre-trained ﬂownetwork such as FlowNet2 or PWC-net to generate the opticalﬂow explicitly. It allows us to quantize the motion featuresrather than the optical ﬂows, and to train the motion featureencoder and decoder together with explicit consideration ofquantization and rate constraint.The neuro-Motion module is modiﬁed for multiscale motiongeneration, where the main encoder is used for feature fusion.We replace the main decoder with a pyramidal ﬂow decoder ,which generates the multiscale compressed optical ﬂows(MCFs). MCFs will be processed together with the referenceframe, using a multiscale motion compensation network (MS-MCN) to obtain the predicted frame efﬁciently, as shown inFig. 8. Please refer to [217] for more details.Encoding motion compactly is another important factor foroverall performance improvement. We suggest the joint spatio-temporal and hyper prior-based context-adaptive model shownin Fig. 9 for efﬁciently inferring current quantized features.This is implemented in the PA engine of Fig. 6b.The joint spatio-temporal and hyper prior-based context-adaptive model mainly consists of a spatio-temporal-hyperaggregation module (STHAM) and a temporal updating mod-ule (TUM), shown in Fig. 9. At timestamp t , STHAM isintroduced to accumulate all the accessible priors and estimatethe mean and standard deviation of Gaussian Mixture Model(GMM) jointly using: ( µ F , σ F ) = F ( F , ..., F i − , ˆ z t , h t − ) , (1)Spatial priors are autoregressively derived using masked5 × × × × F i , i = 0 , , , ... are elements of quantized latentfeatures ( e.g. , motion ﬂow), h t − is aggregated temporal priors from motion ﬂows preceding the current frame. The neuro-Motion module exploits temporal redundancy to further pre-diction efﬁciency, leveraging the correlation between second-order moments of inter motion. A probabilistic model of eachelement to be encoded is derived with the estimated µ F and σ F by: p F | ( F ,..., F i − , ˆ z t , h t − ) ( F i | F , ..., F i − , ˆ z t , h t − )= (cid:89) i ( N ( µ F , σ F ) ∗ U ( − ,

12 ))( F i ) . (2)Note that TUM is applied to embedded current quantizedfeatures F t recurrently using a standard ConvLSTM [221]: ( h t , c t ) = ConvLSTM( F t , h t − , c t − ) , (3)where h t are updated temporal priors for the next frame, c t is a memory state to control information ﬂow across multipletime instances ( e.g. , frames). Other recurrent units can also beused to capture temporal correlations as in (3).It is worth noting that leveraging second-order informationfor the representation of compact motion is also widelyexplored in traditional video coding approaches. For example,motion vector predictions from spatial and temporal co-locatedneighbors are standardized in H.265/HEVC, by which onlymotion vector differences (after prediction) are encoded. D. Neural Residual Coding

Inter-frame residual coding is another signiﬁcant modulecontributing to the overall efﬁciency of the system. It is usedto compress the temporal prediction error pixels. It affectsthe efﬁciency of next frame prediction, since errors usuallypropagate temporally.Here we use the VAE architecture in Fig. 6b to encode theresidual r t . The rate-constrained loss function is used: L = λ · D ( X t , ( X pt + ˆ r t )) + R, (4)where D is the (cid:96) loss between a residual compensatedframe X pt + ˆ r t and X t . neuro-Res will be ﬁrst pretrainedusing the frames predicted by the pretrained neuro-Motion andMS-MCN, and a loss function in (4) where the rate R onlyaccounts for the bits for residual. Then we reﬁne neuro-Resjointly with neuro-Motion and MS-MCN, using a loss where R incorporates the bits for both motion and residual with twoframes. E. Experimental Comparison

We applied the same low-delay coding setting as DVCin [129] for our method and traditional H.264/AVC, andH.265/HEVC for comparison. We encoded 100 frames andused GOP of 10 on H.265/HEVC test sequences, and 600frames with GOP of 12 on the UVG dataset. For H.265/HEVC,we applied the fast mode of the x265 — a popular open-source H.265/HEVC encoder implementation; while the fastmode of the x264 is used as the representative of theH.264/AVC encoder. http://x265.org/ %SS 36 15 G % 19&/XB&935++ (a) %SS 0 6 66 , 0 19&/XB&935++ (b) Fig. 10:

BD-Rate Illustration Using PSNR & MS-SSIM. (a) NVC offers averaged 35.34% gain against the anchorH.264/AVC when distortion is measured using PSNR. (b)NVC shows over 50% gains against anchor H.264/AVC whenusing MS-SSIM evaluation. MS-SSIM is usually studied as aperceptual quality metric in image compression, especially ata low bit rate.We show the leading compression efﬁciency in Fig. 10using respective PSNR and MS-SSIM measures, acrossH.265/HEVC and UVG test sequences. In Table IV, by settingthe same anchor using H.264/AVC, our NVC presents 35%BD-Rate gains, while H.265/HEVC and DVC offer 30% and22% gains, respectively. If the distortion is measured bythe MS-SSIM, our gains in efﬁciency are even larger. Thisdemonstrates that NVC can achieve a 50% improvement inefﬁciency, while both H.265/HEVC and DVC achieve onlyaround 25%.Our NVC rivals the recent DVC Pro [222], an upgradeof the earlier DVC [141], e.g. , 35.54% and 50.83% BD-Rate reduction measured by PSNR and MS-SSIM distortionrespectively for NVC, while 34.57% and 45.88% marked forDVC Pro. DVC [141] has mainly achieved a higher level ofcoding efﬁciency than H.265/HEVC at high bit rates. However, TABLE IV: BD-Rate Gains of NVC, H.265/HEVC and DVC against the H.264/AVC.

Sequences H.265/HEVC DVC NVC

PSNR MS-SSIM PSNR MS-SSIM PSNR MS-SSIMBDBR BD-(D) BDBR BD-(D) BDBR BD-(D) BDBR BD-(D) BDBR BD-(D) BDBR BD-(D)ClassB -32.03% 0.78 -27.67% 0.0046 -27.92% 0.72 -22.56% 0.0049 -45.66% 1.21 -54.90% 0.0114

ClassC -20.88% 0.91 -19.57% 0.0054 -3.53% 0.13 -24.89% 0.0081 -17.82%

ClassD -12.39% 0.57 -9.68% 0.0023 -6.20% 0.26 -22.44% 0.0067 -15.53% 0.70 -43.64% 0.0123

ClassE -36.45% 0.99 -30.82% 0.0018 -35.94% 1.17 -29.08% 0.0027 -49.81% 1.70 -58.63% 0.0048

UVG -48.53% 1.00 -37.5% 0.0056 -37.74% 1.00 -16.46% 0.0032 -48.91% 1.24 -53.87% 0.0100

Average -30.05% 0.85 -25.04% 0.0039 -22.26% 0.65 -23.08% 0.0051 -35.54% 1.11 -50.83% 0.0103

NVC (BPP: 0.1274/PSNR: 28.07dB) H.265/HEVC (BPP: 0.1347/PSNR: 27.61dB) H.264/AVC (BPP: 0.1353/PSNR: 26.57dB)NVC (BPP: 0.0634/PSNR: 34.63dB) H.265/HEVC (BPP: 0.0627/PSNR: 33.88dB) H.264/AVC (BPP: 0.0687/PSNR: 32.57dB)

NVC (BPP: 0.0364/PSNR: 36.82dB) H.265/HEVC (BPP: 0.0368/PSNR: 36.24dB) H.264/AVC (BPP: 0.0395/PSNR: 35.15dB)

Fig. 11:

Visual Comparison.

Reconstructed frames of NVC, H.265/HEVC and H.264/AVC. We avoid blocky artifacts, visiblenoise, etc. , and provide better quality at lower bit rate.a sharp decline in the performance of DVC is revealed atlow bit rates ( e.g. , performing worse than H.264/AVC at somerates). We have also observed that DVC’s performance variesfor different test sequences. DVC Pro upgrades DVC withbetter intra/residual coding using [149] and λ ﬁne-tuning,showing state-of-the-art performance [222]. Visual Comparison

We provide a visual quality comparisonbetween NVC, H.264/AVC, and H.265/HEVC as shown inFig. 11. Generally, NVC yields reconstructions that are muchhigher in quality than those of its competitors, even with alower bit rate cost. For the sample clip “RaceHorse”, whichincludes non-translational motion and a complex background, NVC uses 7 percent fewer bits despite an improvement in qual-ity greater than 1.5 dB PSNR, compared with H.264/AVC. Forother cases, our method also shows robust improvement. Tradi-tional codec usually suffers from blocky artifacts and motion-induced noise close to the edges of objects. In H.264/AVC,you clearly can observe block partition boundaries with severepixel discontinuity. Our results provide higher-quality recon-struction and avoid noise and artifacts.

F. Discussion And Future Direction

We developed an end-to-end deep neural video codingframework that can learn compact spatio-temporal represen- tation of raw video input. Our extensive simulations yieldedvery encouraging results, demonstrating that our proposedmethod can offer consistent and stable gains over existingmethods ( e.g. , traditional H.265/HEVC, recent learning-basedapproaches [129], etc. ,) across a variety bit rates and a widerange of content.The H.264/AVC, H.264/HEVC, AVS, AV1, and eventhe VVC, are masterpieces of hybrid prediction/transformframework-based video coding. Rate-distortion optimization,rate control, etc. , can certainly be incorporated to improvelearning-based solutions. For example, reference frame selec-tion is an important means by which we can embed and aggre-gate the most appropriate information for reducing temporalerror and improving overall inter-coding efﬁciency. Makingdeep learning-based video coding practically applicable isanother direction worthy of deeper investigation.VII. C ASE S TUDIES FOR P OST - PROCESSING :E FFICIENT N EURAL F ILTERING

In this case study, both in-loop and post ﬁltering are demon-strated using stacked DNN-based neural ﬁlters for qualityenhancement of reconstructed frames. We speciﬁcally designa single-frame guided CNN which adapts pre-trained CNNmodels to different video contents for in-loop ﬁltering, and amulti-frame CNN leveraging spatio-temporal information forpost ﬁltering. Both reveal noticeable performance gains. Inpractice, neural ﬁlters can be devised, i.e. , in-loop or post,according to the application requirements.

A. In-loop Filtering via Guided CNN

As reviewed in Section IV, most existing works design aCNN model to directly map a degraded input frame to itsrestored version ( e.g. , ground truth label), as illustrated inFig. 12a. To ensure that the model is generalizable to othercontexts, CNN models are often designed to use deeper layers,denser connections, wider receptive ﬁelds, etc. , with hundredsof millions of parameters. As a consequence, such generalizedmodels are poorly suited to most practical applications. Toaddress this problem, we propose that content adaptive weightsbe used to guide a shallow CNN model (as shown in Fig. 12b)instead.The principle underlying this approach is sparse signaldecomposition: We expect that the CNN model can representany input as a weighted combination of channel-wise features.Note that weighting coefﬁcients are dependent on input sig-nals, making this model generalizable to a variety of contentcharacteristics.

Method.

Let x be a degraded block with N pixels in acolumn-wise vector format. The corresponding source block of x is s , which has a processing error d = s − x . We wish to have r corr from x so that the ﬁnal reconstruction x corr = x + r corr is closer to s .Let the CNN output layer have M channels, i.e. , r , r , · · · , r M − . Then, the r corr is assumed as a linear combi-nation of these channel-wise feature vectors, r corr = a r + a r + · · · + a M − r M − , (5) input x r r M-1 … a a M-1 x corr r corr CNN modelskip connectioninput x x corr r corr CNN modelskip connection (a)(b)

High-quality HF Low-quality LF High-quality HF FlowNetMotion compensationflow field Low-quality LF Compensated HF (b) Pre-processing (c) Multi-frame processingWARN model Enhanced LF WARN modelEnhanced HF and HF (a) Single-frame processing Compensated HF Plain ResBlk Proposed ResBlk

Raw videos

Video decoder

Reconstructed videos

Proposed case study I:

CNN-based texture analysis

Pre-processing Video encoder

Proposed case study II:

End-to-end neural video codingEfficient neural in-loop filtering

Proposed case study III:

CNN-based multi-frame enhancement

Post-processing T r an s m i ss i on Fig. 12:

CNN-based Restoration. (a) Conventional modelstructure. (b) Guided CNN model with adaptive weights.where a , a , · · · , a M − are the weighting parameters that areexplicitly signaled in the compressed bitstream.Our objective is to minimize the distance between therestored block x corr and its corresponding source s , i.e. , | x corr − s | = | r corr − d | . Given the channel-wise outputfeatures r , r , · · · , r M − , for a degraded input x , theweighting parameters a , a , · · · , a M − can then be estimatedby least-square optimization as [ a , a , · · · , a M − ] T = ( R T R ) − R T d , (6)where R = [ r , r , . . . , r M − ] is the matrix at a size of N × M comprised of stacked output features in column-wise order.The reconstruction error is given by e = | r corr − d | = | d | − d T R ( R T R ) − R T d . (7) Loss Function.

Assuming that one training batch is com-prised of T patch pairs: { s i , x i } , i = 0 , , , · · · , T − , theoverall reconstruction error over the training set is E = (cid:88) i {| d i | − d i T R i ( R i T R i ) − R i T d i } , (8)where d i = s i − x i is the error for the i th patch. R i =[ r i, , r i, , · · · , r i, M − ] is the corresponding channel-wise fea-tures in matrix form, with r i,j being the j th channel whentraining sample x i is passed through the CNN model. Giventhat | d i | is independent of the network model, the lossfunction can be simpliﬁed as L = (cid:88) i {− d i T R i ( R i T R i ) − R i T d i } . (9) Experimental Studies.

A shallow baseline CNN model(asdescribed in Table V) is used to demonstrate the efﬁciencyof the guided CNN model. This model is comprised of sevenlayers in total and has a ﬁxed kernel size of 3 ×

3. At thebottleneck layer, the channel number of the output feature mapis M . After extensive simulations, M = 2 was selected. Intotal, our model only requires 3,744 parameters, far fewer thanthe number required by existing methods.In training, 1000 pictures of DIV2K [223] were used. Allframes were compressed using the AV1 encoder with in-loopﬁlters CDEF [159] and LR [160] turned off to generate TABLE V: Layered structure and parameter settings of base-line CNN model.

Layer Kernel size Input channels Output channels Parameters1 × × × × × × × Total parameters corresponding quantization-induced degraded reconstructions.We divided the 64 QPs into six ranges and trained one modelfor each QP range. The six ranges include QP values 7 to16, 17 to 26, 27 to 36, 47 to 56, and 57 to 63. Compressedframes falling into the same QP range were used to trainthe corresponding CNN model. Frames were segmented into64 ×

64 patches. Each batch contained 1,000 patches. Weadopted the Adaptive moment estimation (Adam) algorithm,with the initial learning rate set at 1e-4. The learning rate ishalved every 20 epochs.We used the Tensorﬂow platform, which runs on NVIDIAGeForce GTX 1080Ti GPU, to evaluate coding efﬁciencyacross four QPs, e.g. , {

32, 43, 53, and 63 } . Our test setincluded 24 video sequences with resolutions ranging from2560 × × N was set to 64, 128, 256, and thewhole frame, respectively. We found that N = 256 yieldsthe best performance. For each block, the linear combinationparameters a i ( i = 0 , were derived accordingly. To strikean appropriate balance between bit consumption and modelefﬁciency, our experiments suggest that the dynamic range of a i is within 15.We compared the respective BD-Rate reductions of ourguided CNN model and a baseline CNN model against theAV1 baseline encoder. All ﬁlters were enabled for the AV1anchor. For a description of the baseline CNN model, seeTable V. Our guided CNN model is the baseline model plusthe adaptive weights.Both baseline and guided CNN models were applied on topof the AV1 encoder with only the deblocking ﬁlter enabled,and other ﬁlters (including CDEF and LR) turned off. The ﬁnd-ings reported in Table VI demonstrate that either baseline orguided CNN models can be used to replace additional adaptivein-loop ﬁlters, while improving R-D efﬁciency. Furthermore,regardless of block size and frame types, our guided modelalways outperformed the baseline CNN. This is mainly dueto the adaptive weights used to better characterize content dy-namics. Similar lightweight CNN structures can be upgradedusing deep models [163], [164], [167] for potentially greaterBD-Rate savings. B. Multi-frame Post Filtering

This section demonstrates how multi-frame video enhance-ment (MVE) scheme-based post ﬁltering can be used to min-imize compression artifacts. We implemented our proposed

Low-quality LF High-quality HF Motion compensationflow Low-quality LF (b) (c) WARN model (a)

Compensated HF Plain ResBlk Proposed ResBlk

Compensated HF Low-quality LF Compensated MF LF ResBlk ResBlk8 ResBlksPOC: 0 1 2 3 4 5 6 7 8GOP (GOP size = 16)HF HF HFLF LF LF LF LF LF 9 10 11 12 13 14 15 16HF HFLF LF LF LF LF LF × , WARN modelEnhanced HF Fig. 13:

WARN.

This wide activation residual network is usedto fuse/enhance input frame for improved quality. In MVEcase, it takes three inputs to enhance the LFs; and in SVEcase, it inputs a single frame and outputs its enhanced version.This WARN generally follows the residual network structurewith residual link and ResBlk embedded. Note that ResBlkis extended to support wide activation from its plain versionprior to ReLU activation.approach on AV1 reconstructed frames and achieved signiﬁ-cant coding improvement. Similar observations are expectedwith different anchors, such as the H.265/HEVC.

Method.

Single-frame video enhancement (SVE) refers tothe sole application of the fusion network without leveragingtemporal frame correlations. As discussed in Section IV, thereare a great number of network models that can be used to doSVE. In most cases, the efﬁciency and complexity are at oddswith one another: In other words, efﬁciency and complexitycome at the cost of deeper networks and higher numbers ofparameters. Recently, Yu et al. [224] discovered that modelswith more feature channels before activation could providesigniﬁcantly better performance with the same parametersand computational budgets. We designed a wide activationresidual network (WARN) by combining wide activation witha powerful deep residual network (ResNet) [225], shownin Fig. 13. This WARN illustrates the three inputs for anenhanced output in the MVE framework. In contrast, SVEnormally inputs a single frame, and outputs a correspondingenhanced representation.This MVE closely follows the two-step strategy reviewedin Section IV. It uses FlowNet2 [187] to perform pixel-level motion estimation/compensation-based temporal framealignment. Next, a WARN-based fusion network is used forﬁnal enhancement. We allow the two High-quality Frames(HF) immediately preceding and succeeding a low-qualityframe (LF) to enhance the Low-quality Frame (LF) in between.Bi-directional warping is performed for each LF to producecompensated HFs in Fig. 14.

Experimental Studies.

We evaluate both SVE and MVEagainst the AV1 baseline. A total of 118 video sequenceswere selected to train network models. More speciﬁcally, theﬁrst 200 frames of each sequence were encoded with AV1encoder to generate the reconstructed frames. The QPs are {

32, 43, 53, 63 } , yielding 23,600 reconstructed frames in total.After frame alignment, we selected one training set containingcompensated HF , compensated HF , and to-be-enhanced LF TABLE VI: BD-Rate savings of baseline and guided CNN models against the AV1.

Resolution Sequence All Intra Random AccessBaseline Guided CNN Baseline Guided CNNCNN N=64 N=128 N=256 Frame CNN N=64 N=128 N=256 Frame × PeopleOnStreet − . − . − . − . − . − . − . − . − . − . Trafﬁc − . − . − . − . − . − .

26% +1 . − . − . − . × BasketballDrive − .

45% +2 . − . − . − . − .

02% +8 .

04% +0 .

87% +0 . − . BQTerrace − . − . − . − . − . − .

33% +0 . − . − . − . Cactus − . − . − . − . − . − .

21% +1 . − . − . − . Kimono − .

23% +3 . − . − . − . − .

07% +6 .

07% +0 . − . − . ParkScene − .

21% +0 . − . − . − . − .

07% +1 . − . − . − . blue-sky − . − . − . − . − .

56% +0 .

00% +3 . − . − . − . crowd run − . − . − . − . − . − . − . − . − . − . × BasketballDrill − . − . − . − . − . − . − . − . − . − . BQMall − . − . − . − . − . − .

15% +0 . − . − . − . PartyScene − . − . − . − . − . − . − . − . − . − . RaceHorsesC − . − . − . − . − . − . − . − . − . − . × BasketballPass − . − . − . − . − . − .

20% +0 . − . − . − . BlowingBubbles − . − . − . − . − . − . − . − . − . − . BQSquare − . − . − . − . − . − . − . − . − . − . RaceHorses − . − . − . − . − . − . − . − . − . − . × Johnny − . − . − . − . − . − .

31% +8 . − . − . − . FourPeople − . − . − . − . − . − .

29% +17 .

99% +1 . − . − . KristenAndSara − . − . − . − . − . − .

42% +15 .

95% +0 . − . − . × Harbour − . − . − . − . − . − . − . − . − . − . Ice − . − . − . − . − . − . − . − . − . − . Silent − . − . − . − . − . − .

21% +1 . − . − . − . Students − . − . − . − . − . − .

52% +1 . − . − . − . Average − . − . − . − . − . − .

26% +2 . − . − . − . input x r r M-1 … a a M-1 x corr r corr CNN modelskip connectioninput x x corr skip connection (a)(b)

High-quality HF Low-quality LF High-quality HF FlowNetMotion compensation flow

Low-quality LF Compensated HF (b) (c) WARN model Enhanced LF WARN modelEnhanced HF (a) Compensated HF Plain ResBlk Proposed ResBlk

Compensated HF Low-quality LF LF ResBlk ResBlk8 ResBlks × , WARN modelEnhanced HF Fig. 14:

Enhancement Framework. (a) Single-input WARN-based SVE to enhance the HF. (b)+(c) Two-step MVE usingFlowNet2 for temporal alignment, and three-input WARN- based fusion to use preceding and succeeding HFs for LFenhancement.from every 8 frames, which yielded a total of 2900 trainingsets. These sets were used to train the WARN model as thefusion network. Notice that we trained the WARN modelsfor SVE and MVE individually. The GoP size was 16 witha hierarchical prediction structure. The LFs and HFs wereidentiﬁed using their QPs, i.e. , HFs with lower QP than thebase QP were decoded, such as frames 0, 4, 8, 12, and 16 inFig. 15.Algorithms were implemented using the Tensorﬂow plat-form, NVIDIA GeForce GTX 1080Ti GPU. In training, frameswere segmented into 64 ×

64 patches, with 64 patches includedin each batch. We adopted the Adam optimizer with theinitial learning rate set at 1e-4. The learning rate can be thenadjusted using the step strategy with γ = 0 . . An additional 18sequences were also employed for testing. These were mostlyused to evaluate video quality. The ﬁrst 50 frames of eachtest sequence were compressed. Then the reconstructed frameswere enhanced using the proposed SVE and MVE methods. Plain ResBlk Proposed ResBlk

Compensated HF Low-quality LF Compensated MF LF ResBlk ResBlk8 ResBlksPOC: 0 1 2 3 4 5 6 7 8GOP (GOP size = 16)HF HF HFLF LF LF LF LF LF 9 10 11 12 13 14 15 16HF HFLF LF LF LF LF LFOriginal frame AV1 reconstructed frame SVE enhancement MVE emhancement × , Fig. 15:

The hierarchical coding structure in the AV1encoder . The LFs are enhanced using HFs following theprediction structure via MVE scheme, and HFs are restoredusing SVE method.We applied the proposed method on AV1 reconstructedframes. The results are presented in Table VII. Due to thehierarchical coding structure in inter prediction, the LFs inFig. 15 were enhanced using the neighboring HFs via MVEframework. The HFs themselves are enhanced using the SVEmethod. TABLE VII: BD-rate improvement of proposed SVE andMVE scheme against the AV1.

Class Sequence All Intra Random AccessSVE MVE SVE MVEA PeopleOnStreet − . − . − . − . Trafﬁc − . − . − . − . B BasketballDrive − . − . − . − . BQTerrace − . − . − . − . Cactus − . − . − . − . Kimono − . − . − . − . ParkScene − . − . − . − . C BasketballDrill − . − . − . − . BQMall − . − . − . − . PartyScene − . − . − . − . RaceHorsesC − . − . − . − . D BasketballPass − . − . − . − . BlowingBubbles − . − . − . − . BQSquare − . − . − . − . RaceHorses − . − . − . − . E FourPeople − . − . − . − . Johnny − . − . − . − . KristenAndSara − . − . − . − . Average − . − . − . − . The overall BD-Rate savings of the SVE and MVE methodsare tabulated in Table VII, against the AV1. SVE achieves anaveraged reduction of 8.2% and 5.0% BD-rate for all intraand random access scenarios, respectively. On the other hand,our MVE obtains 20.1% and 7.5% BD-rate savings on aver-age, further demonstrating the effectiveness of our proposedscheme. When random access techniques are used, the HFsselected are generally distant from a target LF, which reducesthe beneﬁts provided from inter HFs. On the other hand,intra coding techniques uniformly demonstrate greater BD-ratesavings, because the neighboring frames nearest to target LFscan be used. This contributes signiﬁcantly to enhancement.Besides the objective measures, sample snapshots of recon-structed frames are illustrated in Fig. 16, clearly demonstratingthat blocky and ringing artifacts from the AV1 baseline areattenuated after applying either SVE or MVE based ﬁltering.Notably, MVE creates more visually appealing images thanSVE.

C. Discussion And Future Direction

In this section, we proposed DNN-based approaches forvideo quality enhancement. For in-loop ﬁltering, we developeda guided CNN framework to adapt pre-trained CNN models tovarious video contents. Under this framework, the guided CNNlearns to project an input signal onto a subspace of dimension M . The weighting parameters for a linear combination of thesechannels are explicitly signaled in the encoded bitstream to ob-tain the ﬁnal restoration. For post ﬁltering, we devised a spatio-temporal multi-frame architecture to alleviate the compressionartifacts. A two-step scheme is adopted in which optical ﬂowis ﬁrst obtained for accurate motion estimation/compensation,and then a wide activation residual network called WARNis designed for information fusion and quality enhancement.Our proposed enhancement approaches can be implementedon different CNN architectures. SVE enhancement MVE emhancement Original frame AV1 reconstructed frame SVE enhancement MVE emhancement

Fig. 16:

Qualitative Visualization

Zoomed-in snapshots ofreconstructed frames for the AV1 baseline, SVE and MVEﬁltered restoration, as well as the ground truth label.The quality of enhanced frames plays a signiﬁcant rolefor overall coding performance, since they serve as referenceframes for the motion estimation of subsequent frames. Ourfuture work will investigate the joint effect of in-loop ﬁlteringand motion estimation on reference frames to exploit theinherent correlations of these coding tools, which could furtherimprove coding performance.VIII. D

ISCUSSION AND C ONCLUSION

As an old Chinese saying goes, “A journey of a thousandmiles begins with a single step.” This is particularly true in therealm of technological advancement. Both the ﬁelds of videocompression and machine learning have been established formany decades, but until recently, they evolved separately inboth academic explorations and industrial practice.Lately, however, we have begun to witness the interdis-ciplinary advancements yielded by the proactive applicationof deep learning technologies [226] into video compressionsystems. Beneﬁts of these advances include remarkable im-provements in performance in many technical aspects. Toshowcase the remarkable products of this disciplinary cross-pollination, we have identiﬁed three major functional blocksin a practical video system, e.g. , pre-processing, coding, post-processing. We then reviewed related studies and publicationsto help the audience familiarize themselves with these topics.Finally, we presented three case studies to highlight the state-of-the-art efﬁciency resulting from the application of DNNs tovideo compression systems, which demonstrates this avenueof exploration’s great potential to bring about a new generationof video techniques, standards, and products.Though this article presents separate DNN-based case stud-ies for pre-processing, coding, and post-processing, we believethat a fully end-to-end DNN model could potentially offera greater improvement in performance, while enabling morefunctionalities. For example, Xia et al. [227] applied deepobject segmentation in pre-processing, and used it to guideneural video coding, demonstrating noticeable visual improve-ments at very low bit rates. Meanwhile, Lee et al. [228] andothers observed similar effects, when a neural adaptive ﬁlter was successfully used to further enhance neural compressedimages.Nevertheless, a number of open problems requiring substan-tial further study have been discovered. These include: • Model Generalization: It is vital for DNN models to begeneralizable to a wide variety of video content, differentartifacts, etc.

Currently, most DNN-based video compres-sion techniques utilize supervised learning, which oftendemands a signiﬁcant amount of labelled image/videodata for the full spectrum coverage of aforementioned ap-plication scenarios. Continuously developing a large-scaledataset, such as the ImageNet presents one possiblesolution to this problem. An alternative approach may usemore advanced techniques to alleviate uncertainty relatedto a limited training sample for model generalization.These techniques include (but are not limited to) few-shot learning [229] and self-supervised learning [226]. • Complexity: Existing DNN-based methods are mainlycriticized for their unbearable complexity in both com-putational and spatial dimensions. Compared to conven-tional video codec, which requires tens of Kilobyteson-chip memory, most DNN algorithms require severalMegabytes or even Gigabytes of memory space. On theother hand, although inference may be very fast, trainingcould take hours, days or even weeks for convergedand reliable models [141]. All of these issues presentserious barriers to the market adoption of DNN-basedtools, particularly on energy-efﬁcient mobile platforms.One promising solution is to design specialized hard-ware for the acceleration of DNN algorithms [158].Currently, neural processing units (NPU) have attractedsigniﬁcant attention, and have been gradually deployedin heterogeneous platforms ( e.g. , Qualcomm AI Enginein the Snapdragon chip series, Neural Processor in Applesilicons, etc. ) This paints a promising picture of a futurein which DNN algorithms can be deployed on NPU-equipped devices at a massive scale. • QoE Metric: Video quality matters. A video QoE metricthat is better correlated with the human visual systemis highly desirable, not only for quality evaluation, butalso for loss control in DNN-based video compression.There has been notable development in both subjec-tive and objective video quality assessments, yieldingseveral well-known metrics, such as SSIM [230], just-noticeable-distortion (JND) [231], and VMAF [232],some of which are actively adopted for the evaluation ofvideo algorithms, application products, etc.

On the otherhand, existing DNN-based video coding approaches canadaptively optimize the efﬁciency of a pre-deﬁned lossfunction, such as MSE, SSIM, adversarial loss [157],VGG feature based semantic loss, etc.

However, noneof these loss functions has shown clear advantages. Auniﬁed, differentiable, and HVS-driven metric is of greatimportance for the capacity of DNN-based video codingtechniques to offer perceptually better QoE. The exponential growth of Internet trafﬁc, a majority ofwhich involves videos and images, has been the drivingforce for the development of video compression systems.The availability of a vast amount of images through theInternet, meanwhile, has been critical for the renaissance ofthe ﬁeld of machine learning. In this work, we show thatrecent progress in deep learning can, in return, improve videocompression. These mutual positive feedbacks suggest thatsigniﬁcant progress could be achieved in both ﬁelds when theyare investigated together. Therefore, the approaches presentedin this work could be the stepping stones for improving thecompression efﬁciency in Internet-scale video applications.From a different perspective, most compressed videos willbe ultimately consumed by human beings or interpreted bymachines, for subsequent task decisions. This is a typicalcomputer vision (CV) problem, i.e. , content understandingand decisions for consumption or task-oriented application( e.g. , detection, classiﬁcation, etc .) Existing approaches haveperformed these tasks by ﬁrst decoding the video, and thenexamining the tasks via learned or rule-based methods basedon decoded pixels. Such separate processing, e.g. , video de-coding followed by CV tasks, is relied upon mainly becausetraditional pixel-prediction based differential video compres-sion methods break the spatio-temporal features that couldbe potentially helpful for vision tasks. In contrast, recentDNN-based video compression algorithms rely on the featureextraction, activation, suppression, and aggregation for morecompact representation. For these reasons, it is expected thatthe CV tasks can be fulﬁlled in the compressive domainwithout bit decoding and pixel reconstruction. Our earlierattempts have shown very encouraging gain in the accuracyof classiﬁcation and retrieval in compressive formats, withoutresorting to the traditional feature-based approaches usingdecoded pixels, which we report in [233], [234]. Usingpowerful DNNs to unify video compression and computervision techniques is an exciting new ﬁeld. It is also worthnoting that the ISO/IEC MPEG is now actively working ona new project called “Video Coding for Machine” (VCM) ,with emphasis on exploring video compression solutions forboth human perception and machine intelligence.R EFERENCES[1] D. J. Brady, M. E. Gehm, R. A. Stack, D. L. Marks, D. S. Kittle, D. R.Golish, E. Vera, and S. D. Feller, “Multiscale gigapixel photography,”

Nature , vol. 486, no. 7403, pp. 386–389, 2012.[2] M. Cheng, Z. Ma, S. Asif, Y. Xu, H. Liu, W. Bao, and J. Sun, “A dualcamera system for high spatiotemporal resolution video acquisition,”

IEEE Trans. Pattern Analysis and Machine Intelligence , no. 01, pp.1–1, 2020.[3] F. Dufaux, P. Le Callet, R. Mantiuk, and M. Mrak,

High dynamic rangevideo: from acquisition, to display and applications . Academic Press,2016.[4] M. Winken, D. Marpe, H. Schwarz, and T. Wiegand, “Bit-depthscalable video coding,” in , vol. 1. IEEE, 2007, pp. I–5.[5] P. Tudor, “Mpeg-2 video compression,”

Electronics & communicationengineering journal , vol. 7, no. 6, pp. 257–264, 1995.[6] B. G. Haskell, A. Puri, and A. N. Netravali,

Digital video: anintroduction to MPEG-2 . Springer Science & Business Media, 1996. https://mpeg.chiariglione.org/standards/exploration/video-coding-machines [7] T. Sikora, “The mpeg-4 video standard veriﬁcation model,” IEEETransactions on circuits and systems for video technology , vol. 7, no. 1,pp. 19–31, 1997.[8] W. Li, “Overview of ﬁne granularity scalability in mpeg-4 video stan-dard,”

IEEE Transactions on circuits and systems for video technology ,vol. 11, no. 3, pp. 301–317, 2001.[9] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the H.264/AVC video coding standard,”

IEEE Transactions oncircuits and systems for video technology , vol. 13, no. 7, pp. 560–576,2003.[10] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe high efﬁciency video coding (HEVC) standard,”

IEEE Transactionson circuits and systems for video technology , vol. 22, no. 12, pp. 1649–1668, 2012.[11] V. Sze, M. Budagavi, and G. J. Sullivan, “High efﬁciency video coding(hevc),” in

Integrated circuit and systems, algorithms and architectures .Springer, 2014, vol. 39, pp. 49–90.[12] G. J. Sullivan, P. N. Topiwala, and A. Luthra, “The h. 264/avc advancedvideo coding standard: Overview and introduction to the ﬁdelity rangeextensions,” in

Applications of Digital Image Processing XXVII , vol.5558. International Society for Optics and Photonics, 2004, pp. 454–474.[13] A. Vetro, T. Wiegand, and G. J. Sullivan, “Overview of the stereo andmultiview video coding extensions of the h. 264/mpeg-4 avc standard,”

Proceedings of the IEEE , vol. 99, no. 4, pp. 626–642, 2011.[14] L. Yu, S. Chen, and J. Wang, “Overview of AVS-video codingstandards,”

Signal processing: Image communication , vol. 24, no. 4,pp. 247–262, 2009.[15] S. Ma, S. Wang, and W. Gao, “Overview of ieee 1857 video codingstandard,” in . IEEE, 2013, pp. 1500–1504.[16] J. Zhang, C. Jia, M. Lei, S. Wang, S. Ma, and W. Gao, “Recentdevelopment of avs video coding standard: AVS3,” in . IEEE, 2019, pp. 1–5.[17] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker,C. Chen, H. Su, U. Joshi et al. , “An overview of core coding tools inthe AV1 video codec,” in . IEEE,2018, pp. 41–45.[18] J. Han, B. Li, D. Mukherjee, C.-H. Chiang, C. Chen, H. Su, S. Parker,U. Joshi, Y. Chen, Y. Wang et al. , “A technical overview of av1,” arXivpreprint arXiv:2008.06091

Applications of Digital Image ProcessingXLII , A. G. Tescher and T. Ebrahimi, Eds., vol. 11137, September2019, p. 26.[22] Y.-C. Lin, H. Denman, and A. Kokaram, “Multipass encoding forreducing pulsing artifacts in cloud based video transcoding,” in , September 2015,pp. 907–911.[23] G. J. Sullivan and T. Wiegand, “Video compression-from concepts tothe H.264/AVC standard,”

Proceedings of the IEEE , vol. 93, no. 1, pp.18–31, 2005.[24] A. Norkin, G. Bjontegaard, A. Fuldseth, M. Narroschke, M. Ikeda,K. Andersson, M. Zhou, and G. Van der Auwera, “HEVC deblockingﬁlter,”

IEEE Transactions on Circuits and Systems for Video Technol-ogy , vol. 22, no. 12, pp. 1746–1754, 2012.[25] C.-M. Fu, E. Alshina, A. Alshin, Y.-W. Huang, C.-Y. Chen, C.-Y. Tsai,C.-W. Hsu, S.-M. Lei, J.-H. Park, and W.-J. Han, “Sample adaptiveoffset in the HEVC standard,”

IEEE Transactions on Circuits andSystems for Video technology , vol. 22, no. 12, pp. 1755–1764, 2012.[26] R. Gupta, M. T. Khanna, and S. Chaudhury, “Visual saliency guidedvideo compression algorithm,”

Signal Processing: Image Communica-tion , vol. 28, no. 9, pp. 1006–1022, 2013.[27] S. Liu, X. Li, W. Wang, E. Alshina, K. Kawamura, K. Unno, Y. Kidani,P. Wu, A. Segall, M. Wien et al. , “Ahg on neural network based codingtools,”

Joint Video Expert Team , no. JVET-S0267/M54764, June 2020.[28] S. Liu, E. Alshina, J. Pfaff, M. Wien, P. Wu, and Y. Ye, “Report ofahg11 meeting on neural network-based video coding,”

Joint VideoExpert Team , no. JVET-T0042/M54848, July 2020.[29] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond agaussian denoiser: Residual learning of deep cnn for image denoising,”

IEEE Transactions on Image Processing , vol. 26, no. 7, pp. 3142–3155,2017.[30] C. Tian, Y. Xu, L. Fei, and K. Yan, “Deep learning for image denoising:a survey,” in

International Conference on Genetic and EvolutionaryComputing . Springer, 2018, pp. 563–572.[31] A. Chakrabarti, “A neural approach to blind motion deblurring,” in

European conference on computer vision . Springer, 2016, pp. 221–235.[32] J. Koh, J. Lee, and S. Yoon, “Single-image deblurring with neuralnetworks: A comparative survey,”

Computer Vision and Image Under-standing , p. 103134, 2020.[33] Y. Zhu, X. Fu, and A. Liu, “Learning dual transformation networks forimage contrast enhancement,”

IEEE Signal Processing Letters , 2020.[34] W. Guan, T. Wang, J. Qi, L. Zhang, and H. Lu, “Edge-aware con-volution neural network based salient object detection,”

IEEE SignalProcessing Letters , vol. 26, no. 1, pp. 114–118, 2018.[35] L. Xu, J. Ren, Q. Yan, R. Liao, and J. Jia, “Deep edge-aware ﬁlters,” in

International Conference on Machine Learning , 2015, pp. 1669–1678.[36] L. Zhaoping, “A new framework for understanding vision from the per-spective of the primary visual cortex,”

Current opinion in neurobiology ,vol. 58, pp. 1–10, 2019.[37] X. Chen, M. Zirnsak, G. M. Vega, E. Govil, S. G. Lomber, andT. Moore, “The contribution of parietal cortex to visual salience,” bioRxiv , 2019, doi: http://doi.org/10.1101/619643.[38] O. Schwartz and E. Simoncelli, “Natural signal statistics and sensorygain control.”

Nature neuroscience , vol. 4, no. 8, p. 819, 2001.[39] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,”

IEEE Transactions on patternanalysis and machine intelligence , vol. 20, no. 11, pp. 1254–1259,1998.[40] L. Itti, “Automatic foveation for video compression using a neuro-biological model of visual attention,”

IEEE transactions on imageprocessing , vol. 13, no. 10, pp. 1304–1318, 2004.[41] T. V. Nguyen, M. Xu, G. Gao, M. Kankanhalli, Q. Tian, and S. Yan,“Static saliency vs. dynamic saliency: a comparative study,” in

Proceed-ings of the 21st ACM international conference on Multimedia , 2013,pp. 987–996.[42] E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchicalfeatures for saliency prediction in natural images,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2014, pp. 2798–2805.[43] N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu, “Predicting eye ﬁxationsusing convolutional neural networks,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2015, pp.362–370.[44] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 478–487.[45] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network forsalient object detection,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2016, pp. 678–686.[46] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr,“Deeply supervised salient object detection with short connections,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 3203–3212.[47] B. Yan, H. Wang, X. Wang, and Y. Zhang, “An accurate saliencyprediction method based on generative adversarial networks,” in . IEEE, 2017,pp. 2339–2343.[48] Y. Xu, S. Gao, J. Wu, N. Li, and J. Yu, “Personalized saliency andits prediction,”

IEEE transactions on pattern analysis and machineintelligence , vol. 41, no. 12, pp. 2975–2989, 2018.[49] L. Bazzani, H. Larochelle, and L. Torresani, “Recurrent mixturedensity network for spatiotemporal visual attention,” arXiv preprintarXiv:1603.08199 , 2016.[50] C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal saliencynetworks for dynamic saliency prediction,”

IEEE Transactions onMultimedia , vol. 20, no. 7, pp. 1688–1698, 2017.[51] M. Sun, Z. Zhou, Q. Hu, Z. Wang, and J. Jiang, “SG-FCN: A motionand memory-based deep learning model for video saliency detection,”

IEEE transactions on cybernetics , vol. 49, no. 8, pp. 2900–2911, 2018.[52] L. Jiang, M. Xu, T. Liu, M. Qiao, and Z. Wang, “Deepvs: A deeplearning based video saliency prediction approach,” in

Proceedings ofthe European Conference on Computer Vision , 2018, pp. 602–617.[53] Z. Wang, J. Ren, D. Zhang, M. Sun, and J. Jiang, “A deep-learningbased feature hybrid framework for spatiotemporal saliency detectioninside videos,”

Neurocomputing , vol. 287, pp. 68–83, 2018. [54] R. Cong, J. Lei, H. Fu, F. Porikli, Q. Huang, and C. Hou, “Videosaliency detection via sparsity-based reconstruction and propagation,” IEEE Transactions on Image Processing , vol. 28, no. 10, pp. 4819–4831, 2019.[55] K. Min and J. J. Corso, “Tased-net: Temporally-aggregating spatialencoder-decoder network for video saliency detection,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.2394–2403.[56] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji,“Revisiting video saliency prediction in the deep learning era,”

IEEEtransactions on Pattern Analysis and Machine Intelligence , 2019.[57] A. Vetro, T. Haga, K. Sumi, and H. Sun, “Object-based coding for long-term archive of surveillance video,” in

Proceedings of InternationalConference on Multimedia and Expo , vol. 2, July 2003, p. 417,Baltimore, MD.[58] T. Nishi and H. Fujiyoshi, “Object-based video coding using pixelstate analysis,” in

Proceedings of the 17th International Conference onPattern Recognition , vol. 3, August 2004, pp. 306–309, Cambridge,UK.[59] L. Zhu and Q. Zhang, “Motion-based foreground extraction in com-pressed video,” in , vol. 2. IEEE, 2010, pp. 711–714.[60] Z. Zhang, T. Jing, J. Han, Y. Xu, and X. Li, “Flow-process foregroundregion of interest detection method for video codecs,”

IEEE Access ,vol. 5, pp. 16 263–16 276, 2017.[61] Y. Guo, Z. Xuan, and L. Song, “Foreground target extraction methodbased on neighbourhood pixel intensity correction,”

Australian Journalof Mechanical Engineering , pp. 1–10, 2019.[62] A. Shahbaz, V.-T. Hoang, and K.-H. Jo, “Convolutional neural networkbased foreground segmentation for video surveillance systems,” in

IECON 2019-45th Annual Conference of the IEEE Industrial Elec-tronics Society , vol. 1. IEEE, 2019, pp. 86–89.[63] S. Zhou, J. Wang, D. Meng, Y. Liang, Y. Gong, and N. Zheng,“Discriminative feature learning with foreground attention for personre-identiﬁcation,”

IEEE Transactions on Image Processing , vol. 28,no. 9, pp. 4671–4684, 2019.[64] M. Babaee, D. T. Dinh, and G. Rigoll, “A deep convolutional neuralnetwork for background subtraction,” arXiv preprint arXiv:1702.01731 ,2017.[65] X. Liang, S. Liao, X. Wang, W. Liu, Y. Chen, and S. Z. Li, “Deep back-ground subtraction with guided learning,” in . IEEE, 2018, pp. 1–6.[66] S. Zhang, K. Wei, H. Jia, X. Xie, and W. Gao, “An efﬁcient foreground-based surveillance video coding scheme in low bit-rate compression,”in . IEEE, 2012,pp. 1–6.[67] H. Hadizadeh and I. V. Baji´c, “Saliency-aware video compression,”

IEEE Transactions on Image Processing , vol. 23, no. 1, pp. 19–33,2013.[68] Y. Li, W. Liao, J. Huang, D. He, and Z. Chen, “Saliency based percep-tual HEVC,” in . IEEE, 2014, pp. 1–5.[69] C. Ku, G. Xiang, F. Qi, W. Yan, Y. Li, and X. Xie, “Bit allocation basedon visual saliency in HEVC,” in . IEEE, 2019, pp. 1–4.[70] S. Zhu and Z. Xu, “Spatiotemporal visual saliency guided perceptualhigh efﬁciency video coding with neural network,”

Neurocomputing ,vol. 275, pp. 511–522, 2018.[71] V. Lyudvichenko, M. Erofeev, A. Ploshkin, and D. Vatolin, “Improvingvideo compression with deep visual-attention models,” in

Proceedingsof the 2019 International Conference on Intelligent Medicine andImage Processing , 2019, pp. 88–94.[72] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting humaneye ﬁxations via an lstm-based saliency attentive model,”

IEEE Trans-actions on Image Processing , vol. 27, no. 10, pp. 5142–5154, 2018.[73] X. Sun, X. Yang, S. Wang, and M. Liu, “Content-aware rate controlscheme for HEVC based on static and dynamic saliency detection,”

Neurocomputing , 2020.[74] M. Carandini, J. B. Demb, V. Mante, D. J. Tolhurst, Y. Dan, B. A.Olshausen, J. L. Gallant, and N. C. Rust, “Do we know what the earlyvisual system does?”

Journal of Neuroscience , vol. 25, no. 46, pp.10 577–10 597, 2005.[75] J. Kremkow, J. Jin, S. J. Komban, Y. Wang, R. Lashgari, X. Li,M. Jansen, Q. Zaidi, and J.-M. Alonso, “Neuronal nonlinearity explainsgreater visual spatial resolution for darks than lights,”

Proceedings of the National Academy of Sciences , vol. 111, no. 8, pp. 3170–3175,2014.[76] J. Ukita, T. Yoshida, and K. Ohki, “Characterisation of nonlinearreceptive ﬁelds of visual neurons by convolutional neural network,”

Scientiﬁc reports , vol. 9, no. 1, pp. 1–17, 2019.[77] P. Neri, “Nonlinear characterization of a simple process in humanvision,”

Journal of Vision , vol. 9, no. 12, pp. 1–1, 2009.[78] D. J. Heeger, “Normalization of cell responses in cat striate cortex,”

Visual Neuroscience , vol. 9, no. 2, pp. 181–197, 1992.[79] N. J. Priebe and D. Ferster, “Mechanisms of neuronal computation inmammalian visual cortex,”

Neuron , vol. 75, no. 2, pp. 194–208, 2012.[80] M. Carandini and D. J. Heeger, “Normalization as a canonical neuralcomputation,”

Nature Reviews Neuroscience , vol. 13, no. 1, p. 51, 2012.[81] M. H. Turner and F. Rieke, “Synaptic rectiﬁcation controls nonlinearspatial integration of natural visual inputs,”

Neuron , vol. 90, no. 6, pp.1257–1271, 2016.[82] D. Doshkov and P. Ndjiki-Nya, “Chapter 6 - how to use texture analysisand synthesis methods for video compression,” in

Academic PressLibrary in signal Processing , ser. Academic Press Library in SignalProcessing, S. Theodoridis and R. Chellappa, Eds. Oxford, UK:Elsevier, 2014, vol. 5, pp. 197–225.[83] P. Ndjiki-Nya, D. Doshkov, H. Kaprykowsky, F. Zhang, D. Bull, andT. Wiegand, “Perception-oriented video coding based on image analysisand completion: A review,”

Signal Processing: Image Communication ,vol. 27, no. 6, pp. 579–594, 2012.[84] A. K. Jain and F. Farrokhnia, “Unsupervised texture segmentationusing gabor ﬁlters,” in

Proceedings of the International Conferenceon Systems, Man, and Cybernetics Conference proceedings . IEEE,1990, pp. 14–19, Los Angeles, CA.[85] A. C. Bovik, M. Clark, and W. S. Geisler, “Multichannel textureanalysis using localized spatial ﬁlters,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 12, no. 1, pp. 55–73, 1990.[86] U. S. Thakur and O. Chubach, “Texture analysis and synthesis usingsteerable pyramid decomposition for video coding,” in

Proceedings ofInternational Conference on Systems, Signals and Image Processing ,September 2015, pp. 204–207, London, UK.[87] J. Portilla and E. P. Simoncelli, “A parametric texture model based onjoint statistics of complex wavelet coefﬁcients,”

International Journalof Computer Vision , vol. 40, no. 1, pp. 49–70, 2000.[88] S. Bansal, S. Chaudhury, and B. Lall, “Dynamic texture synthesisfor video compression,” in

Proceeding of National Conference onCommunications , Feb 2013, pp. 1–5, New Delhi, India.[89] G. R. Cross and A. K. Jain, “Markov random ﬁeld texture models,”

IEEE Transactions on Pattern Analysis and Machine Intelligence , no. 1,pp. 25–39, 1983.[90] R. Chellappa and S. Chatterjee, “Classiﬁcation of textures usingGaussian Markov random ﬁelds,”

IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 33, no. 4, pp. 959–963, 1985.[91] D. G. Lowe, “Distinctive image features from scale-invariant key-points,”

International Journal of Computer Vision , vol. 60, no. 2, pp.91–110, 2004.[92] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robustfeatures,” in

Proceedings of the European Conference on ComputerVision . Springer, 2006, pp. 404–417, Graz, Austria.[93] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classiﬁcation with local binary patterns,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 24, no. 7, pp. 971–987, 2002.[94] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁcationwith deep convolutional neural networks,”

Advances in Neural Infor-mation Processing Systems , pp. 1097–1105, 2012.[95] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep ﬁlter banks for texturerecognition and segmentation,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2015, pp. 3828–3836,Boston, Massachusetts.[96] F. Perronnin, J. S´anchez, and T. Mensink, “Improving the ﬁsher kernelfor large-scale image classiﬁcation,” in

Proceedings of the EuropeanConference on Computer Vision . Springer, 2010, pp. 143–156, Crete,Greece.[97] A. A. Efros and T. K. Leung, “Texture synthesis by non-parametricsampling,” in

Proceedings of the Seventh IEEE International Confer-ence on Computer Vision , vol. 2. IEEE, 1999, pp. 1033–1038, Kerkyra,Greece.[98] L.-Y. Wei and M. Levoy, “Fast texture synthesis using tree-structuredvector quantization,” in

Proceedings of the 27th Annual Conference onComputer Graphics and Interactive Techniques , 2000, pp. 479–488,New Orleans, LA. [99] M. Ashikhmin, “Synthesizing natural textures,” in Proceedings of theSymposium on Interactive 3D Graphics , 2001, pp. 217–226, New York,NY.[100] H. Derin and H. Elliott, “Modeling and segmentation of noisy andtextured images using Gibbs random ﬁelds,”

IEEE Transactions onPattern Analysis and Machine Intelligence , no. 1, pp. 39–55, 1987.[101] D. J. Heeger and J. R. Bergen, “Pyramid-based texture analy-sis/synthesis,” in

Proceedings of the 22nd annual Conference onComputer Graphics and Interactive Techniques , 1995, pp. 229–238,Los Angeles, CA.[102] L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convo-lutional neural networks,”

Advances in Neural Information ProcessingSystems , pp. 262–270, 2015.[103] C. Li and M. Wand, “Precomputed real-time texture synthesis withmarkovian generative adversarial networks,” in

Proceedings of theEuropean Conference on Computer Vision . Springer, 2016, pp. 702–716, Amsterdam, The Netherlands.[104] T. Chen, H. Liu, Q. Shen, T. Yue, X. Cao, and Z. Ma, “Deepcoder: Adeep neural network based video compression,” in

Proceedings of theIEEE Visual Communications and Image Processing . IEEE, 2017, pp.1–4, St. Petersburg, FL.[105] J. Ball´e, V. Laparra, and E. P. Simoncelli, “End-to-end optimized imagecompression,” arXiv preprint arXiv:1611.01704 , 2016.[106] D. Liu, Y. Li, J. Lin, H. Li, and F. Wu, “Deep learning-based videocoding: A review and a case study,”

ACM Computing Surveys (CSUR) ,vol. 53, no. 1, pp. 1–35, 2020.[107] S. Ma, X. Zhang, C. Jia, Z. Zhao, S. Wang, and S. Wanga, “Image andvideo compression with neural networks: A review,”

IEEE Transactionson Circuits and Systems for Video Technology , 2019.[108] Y. Li, D. Liu, H. Li, L. Li, F. Wu, H. Zhang, and H. Yang, “Convolu-tional neural network-based block up-sampling for intra frame coding,”

IEEE Transactions on Circuits and Systems for Video Technology ,vol. 28, no. 9, pp. 2316–2330, 2018.[109] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An end-to-endcompression framework based on convolutional neural networks,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 28,no. 10, pp. 3007–3018, 2017.[110] M. Afonso, F. Zhang, and D. R. Bull, “Video compression based onspatio-temporal resolution adaptation,”

IEEE Transactions on Circuitsand Systems for Video Technology , vol. 29, no. 1, pp. 275–280, 2018.[111] J. Lin, D. Liu, H. Yang, H. Li, and F. Wu, “Convolutional neuralnetwork-based block up-sampling for HEVC,”

IEEE Transactions onCircuits and Systems for Video Technology , vol. 29, no. 12, pp. 3701–3715, 2018.[112] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and Q. Liao,“Deep learning for single image super-resolution: A brief review,”

IEEETransactions on Multimedia , vol. 21, no. 12, pp. 3106–3121, 2019.[113] W. Cui, T. Zhang, S. Zhang, F. Jiang, W. Zuo, and D. Zhao, “Con-volutional neural networks based intra prediction for HEVC,” arXivpreprint arXiv:1808.05734 , 2018.[114] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-based intra prediction for image coding,”

IEEE Transactions on ImageProcessing , vol. 27, no. 7, pp. 3236–3247, 2018.[115] J. Pfaff, P. Helle, D. Maniry, S. Kaltenstadler, W. Samek, H. Schwarz,D. Marpe, and T. Wiegand, “Neural network based intra predictionfor video coding,”

Applications of Digital Image Processing XLI , vol.10752, p. 1075213, 2018.[116] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrentneural network for intra prediction,”

IEEE Transactions on Multimedia ,vol. 21, no. 12, pp. 3024–3037, 2019.[117] Z. Jin, P. An, and L. Shen, “Video intra prediction using convolutionalencoder decoder network,”

Neurocomputing , vol. 394, pp. 168–177,2020.[118] B. Girod, “Motion-compensating prediction with fractional-pel accu-racy,”

IEEE Transactions on Communications , vol. 41, no. 4, pp. 604–612, 1993.[119] N. Yan, D. Liu, H. Li, and F. Wu, “A convolutional neural networkapproach for half-pel interpolation in video coding,” in

Proceedings ofthe IEEE International Symposium on Circuits and Systems . IEEE,2017, pp. 1–4, Baltimore, MD.[120] H. Zhang, L. Song, Z. Luo, and X. Yang, “Learning a convolutionalneural network for fractional interpolation in HEVC inter coding,” in

Proceedings of the IEEE Conference on Visual Communications andImage Processing . IEEE, 2017, pp. 1–4, St. Petersburg, FL.[121] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu, “One-for-all: Groupedvariation network-based fractional interpolation in video coding,”

IEEETransactions on Image Processing , vol. 28, no. 5, pp. 2140–2151, 2018. [122] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhancedmotion-compensated video coding with deep virtual reference framegeneration,”

IEEE Transactions on Image Processing , vol. 28, no. 10,pp. 4832–4844, 2019.[123] S. Xia, W. Yang, Y. Hu, and J. Liu, “Deep inter prediction via pixel-wise motion oriented reference generation,” in

Proceedings of the IEEEInternational Conference on Image Processing . IEEE, 2019, pp. 1710–1774, Taipei, Taiwan.[124] S. Huo, D. Liu, F. Wu, and H. Li, “Convolutional neural network-basedmotion compensation reﬁnement for video coding,” in

Proceedings ofthe IEEE International Symposium on Circuits and Systems . IEEE,2018, pp. 1–4, Florence, Italy.[125] M. M. Alam, T. D. Nguyen, M. T. Hagan, and D. M. Chandler, “Aperceptual quantization strategy for HEVC based on a convolutionalneural network trained on natural images,”

Applications of DigitalImage Processing XXXVIII , vol. 9599, p. 959918, 2015.[126] R. Song, D. Liu, H. Li, and F. Wu, “Neural network-based arithmeticcoding of intra prediction modes in HEVC,” in

Proceedings of theIEEE Visual Communications and Image Processing . IEEE, 2017,pp. 1–4, St. Petersburg, FL.[127] S. Puri, S. Lasserre, and P. Le Callet, “CNN-based transform indexprediction in multiple transforms framework to assist entropy coding,”in

Proceedings of the European Signal Processing Conference . IEEE,2017, pp. 798–802, Kos island, Greece.[128] C. Ma, D. Liu, X. Peng, and F. Wu, “Convolutional neural network-based arithmetic coding of dc coefﬁcients for HEVC intra coding,” in

Proceedings of the IEEE International Conference on Image Process-ing . IEEE, 2018, pp. 1772–1776, Athens, Greece.[129] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “Dvc: Anend-to-end deep video compression framework,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2019,pp. 11 006–11 015, Long Beach, CA.[130] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, andL. Bourdev, “Learned video compression,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 3454–3463,South Korea.[131] H. Liu, H. Shen, L. Huang, M. Lu, T. Chen, and Z. Ma, “Learnedvideo compression via joint spatial-temporal correlation exploration,”in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,vol. 34, no. 07, 2020, pp. 11 580–11 587.[132] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Min-nen, S. Baluja, M. Covell, and R. Sukthankar, “Variable rate imagecompression with recurrent neural networks,” in

Proceedings of theInternational Conference on Learning Representations , 2016.[133] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen,J. Shor, and M. Covell, “Full resolution image compression withrecurrent neural networks,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp. 5306–5314,Honolulu, Hawaii.[134] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy imagecompression with priming and spatially adaptive bit rates for recurrentnetworks,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 4385–4393.[135] Y. Choi, M. El-Khamy, and J. Lee, “Variable rate deep image com-pression with a conditional autoencoder,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 3146–3154.[136] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “Neural imagecompression via non-local attention optimization and improved contextmodeling,” arXiv preprint arXiv:1910.06244 , 2019.[137] J. Ball´e, “Efﬁcient nonlinear transforms for lossy image compression,” arXiv preprint arXiv:1802.00847 , 2018.[138] J. Lee, S. Cho, and S.-K. Beack, “Context-adaptive entropymodel for end-to-end optimized image compression,” arXiv preprintarXiv:1809.10452 , 2018.[139] J. Klopp, Y.-C. F. Wang, S.-Y. Chien, and L.-G. Chen, “Learninga code-space predictor by exploiting intra-image-dependencies.” in

BMVC , 2018, p. 124.[140] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool,“Conditional probability models for deep image compression,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) ,vol. 1, no. 2, 2018, p. 3.[141] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: Anend-to-end deep video compression framework,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2019,pp. 11 006–11 015. [142] J. Ball´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Vari-ational image compression with a scale hyperprior,” arXiv preprintarXiv:1802.01436 , 2018.[143] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, “Non-local attention optimized deep image compression,” arXiv preprintarXiv:1904.09757 , 2019.[144] C.-Y. Wu, N. Singhal, and P. Kr¨ahenb¨uhl, “Video compression throughimage interpolation,” in Proceedings of the European Conference onComputer Vision , 2018, pp. 416–431.[145] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers, “Neuralinter-frame compression for video coding,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 6421–6429.[146] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutionalnetworks for content-weighted image compression,” arXiv preprintarXiv:1703.10553 , 2017.[147] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neuralnetworks,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2018, pp. 7794–7803.[148] Y. Hu, W. Yang, and J. Liu, “Coarse-to-ﬁne hyper-prior modeling forlearned image compression.” in

Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , 2020, pp. 11 013–11 020.[149] D. Minnen, J. Ball´e, and G. D. Toderici, “Joint autoregressive andhierarchical priors for learned image compression,” in

Advances inNeural Information Processing Systems , 2018, pp. 10 794–10 803.[150] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrentneural networks,” arXiv preprint arXiv:1601.06759 , 2016.[151] S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo,Z. Wang, Y. Chen, D. Belov, and N. De Freitas, “Parallel multiscaleautoregressive density estimation,” in

Proceedings of the 34th Inter-national Conference on Machine Learning . JMLR. org, 2017, pp.2912–2921.[152] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image com-pression with discretized gaussian mixture likelihoods and attentionmodules,” in

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 2020, pp. 7939–7948.[153] J. Lee, S. Cho, and M. Kim, “An end-to-end joint learning scheme ofimage compression and quality enhancement with improved entropyminimization,” arXiv , pp. arXiv–1912, 2019.[154] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” arXiv preprint arXiv:1705.05823 , 2017.[155] C. Huang, H. Liu, T. Chen, S. Pu, Q. Shen, and Z. Ma, “Extremeimage coding via multiscale autoencoders with generative adversarialoptimization,” in

Proceedings of IEEE Visual Communications andImage Processing , 2019.[156] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V.Gool, “Generative adversarial networks for extreme learned imagecompression,” in

Proceedings of the IEEE International Conferenceon Computer Vision , 2019, pp. 221–231.[157] H. Liu, T. Chen, Q. Shen, T. Yue, and Z. Ma, “Deep image compressionvia end-to-end learning,” in

Proceedings of the IEEE InternationalConference on Computer Vision Workshops , 2018.[158] J. L. Hennessy and D. A. Patterson, “A new golden age for computerarchitecture,”

Communications of the ACM , vol. 62, no. 2, pp. 48–60,2019.[159] S. Midtskogen and J.-M. Valin, “The AV1 constrained directionalenhancement ﬁlter (CDEF),” in . IEEE, 2018, pp. 1193–1197.[160] D. Mukherjee, S. Li, Y. Chen, A. Anis, S. Parker, and J. Bankoski, “Aswitchable loop-restoration with side-information framework for theemerging AV1 video codec,” in . IEEE, 2017, pp. 265–269.[161] C.-Y. Tsai, C.-Y. Chen, T. Yamakage, I. S. Chong, Y.-W. Huang, C.-M.Fu, T. Itoh, T. Watanabe, T. Chujoh, M. Karczewicz et al. , “Adaptiveloop ﬁltering for video coding,”

IEEE Journal of Selected Topics inSignal Processing , vol. 7, no. 6, pp. 934–945, 2013.[162] W.-S. Park and M. Kim, “Cnn-based in-loop ﬁltering for codingefﬁciency improvement,” in . IEEE, 2016, pp.1–5.[163] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approachfor post-processing in HEVC intra coding,” in

International Conferenceon Multimedia Modeling . Springer, 2017, pp. 28–39.[164] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residualhighway convolutional neural networks for in-loop ﬁltering in HEVC,”

IEEE Transactions on image processing , vol. 27, no. 8, pp. 3827–3841,2018. [165] X. Xu, J. Qian, L. Yu, H. Wang, X. Zeng, Z. Li, and N. Wang, “Denseinception attention neural network for in-loop ﬁlter,” in . IEEE, 2019, pp. 1–5.[166] K. Lin, C. Jia, Z. Zhao, L. Wang, S. Wang, S. Ma, and W. Gao,“Residual in residual based convolutional neural network in-loop ﬁlterfor avs3,” in . IEEE, 2019, pp. 1–5.[167] J. Kang, S. Kim, and K. M. Lee, “Multi-modal/multi-scale convolu-tional neural network based in-loop ﬁlter design for next generationvideo codec,” in . IEEE, 2017, pp. 26–30.[168] C. Jia, S. Wang, X. Zhang, S. Wang, and S. Ma, “Spatial-temporalresidue network based in-loop ﬁlter for video coding,” in . IEEE, 2017, pp. 1–4.[169] X. Meng, C. Chen, S. Zhu, and B. Zeng, “A new HEVC in-loop ﬁlterbased on multi-channel long-short-term dependency residual networks,”in . IEEE, 2018, pp. 187–196.[170] D. Li and L. Yu, “An in-loop ﬁlter based on low-complexity cnn usingresiduals in intra video coding,” in . IEEE, 2019, pp. 1–5.[171] D. Ding, L. Kong, G. Chen, Z. Liu, and Y. Fang, “A switchabledeep learning approach for in-loop ﬁltering in video coding,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 30,no. 7, pp. 1871–1887, 2020.[172] D. Ding, G. Chen, D. Mukherjee, U. Joshi, and Y. Chen, “A CNN-based in-loop ﬁltering approach for AV1 video codec,” in

Proceedingsof the Picture Coding Symposium . IEEE, 2019, pp. 1–5, Ningbo,China.[173] G. Chen, D. Ding, D. Mukherjee, U. Joshi, and Y. Chen, “AV1 in-loopﬁltering using a wide-activation structured residual network,” in

Pro-ceedings of the IEEE International Conference on Image Processing .IEEE, 2019, pp. 1725–1729, Taipei, Taiwan.[174] H. Yin, R. Yang, X. Fang, and S. Ma, “Ce13-1.2: adaptive convolutionalneural network loop ﬁlter,”

JVET-N0480 , 2019.[175] T. Li, M. Xu, C. Zhu, R. Yang, Z. Wang, and Z. Guan, “A deep learningapproach for multi-frame in-loop ﬁlter of HEVC,”

IEEE Transactionson Image Processing , vol. 28, no. 11, pp. 5663–5678, 2019.[176] C. Jia, S. Wang, X. Zhang, S. Wang, J. Liu, S. Pu, and S. Ma,“Content-aware convolutional neural network for in-loop ﬁltering inhigh efﬁciency video coding,”

IEEE Transactions on Image Processing ,vol. 28, no. 7, pp. 3343–3356, 2019.[177] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifactsreduction by a deep convolutional network,” in

Proceedings of theIEEE International Conference on Computer Vision , 2015, pp. 576–584, Santiago, Chile.[178] L. Cavigelli, P. Hager, and L. Benini, “CAS-CNN: A deep convo-lutional neural network for image compression artifact suppression,”in

Proceedings of the IEEE International Joint Conference on NeuralNetworks . IEEE, 2017, pp. 752–759, Anchorage, Alaska.[179] J. Guo and H. Chao, “Building dual-domain representations for com-pression artifacts reduction,” in

Proceedings of the European Confer-ence on Computer Vision . Springer, 2016, pp. 628–644, Amsterdam,The Netherlands.[180] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, “Deepgenerative adversarial compression artifact removal,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2017, pp.4826–4835, Venice, Italy.[181] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-cnn for image restoration,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops , 2018, pp. 773–782, Salt Lake City, UT.[182] Y. Zhang, L. Sun, C. Yan, X. Ji, and Q. Dai, “Adaptive residualnetworks for high-quality image restoration,”

IEEE Transactions onImage Processing , vol. 27, no. 7, pp. 3150–3163, 2018.[183] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based methodof improving coding efﬁciency from the decoder-end for HEVC,” in

Proceedings of the Data Compression Conference . IEEE, 2017, pp.410–419, Snowbird, Utah.[184] R. Yang, M. Xu, and Z. Wang, “Decoder-side HEVC quality en-hancement with scalable convolutional neural network,” in . IEEE, 2017, pp.817–822.[185] X. He, Q. Hu, X. Zhang, C. Zhang, W. Lin, and X. Han, “EnhancingHEVC compressed videos with a partition-masked convolutional neuralnetwork,” in . IEEE, 2018, pp. 216–220.[186] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical ﬂow with convolutional networks,” in Proceedings of the IEEEinternational conference on computer vision , 2015, pp. 2758–2766.[187] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,“Flownet 2.0: Evolution of optical ﬂow estimation with deep networks,”in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 2462–2470.[188] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for opticalﬂow using pyramid, warping, and cost volume,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2018,pp. 8934–8943.[189] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video en-hancement with task-oriented ﬂow,”

International Journal of ComputerVision , vol. 127, no. 8, pp. 1106–1125, 2019.[190] W. Bao, W.-S. Lai, X. Zhang, Z. Gao, and M.-H. Yang, “MEMC-Net:Motion estimation and motion compensation driven neural network forvideo interpolation and enhancement,”

IEEE transactions on patternanalysis and machine intelligence , 2019.[191] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, “EDVR:Video restoration with enhanced deformable convolutional networks,”in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , 2019, pp. 0–0.[192] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-frame quality enhancementfor compressed video,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 6664–6673.[193] Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu, and Z. Wang, “MFQE2.0: A new approach for multi-frame quality enhancement on com-pressed video,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , 2019.[194] J. Tong, X. Wu, D. Ding, Z. Zhu, and Z. Liu, “Learning-basedmulti-frame video quality enhancement,” in

Proceedings of the IEEEInternational Conference on Image Processing . IEEE, 2019, pp. 929–933, Taipei, Taiwan.[195] M. Lu, M. Cheng, Y. Xu, S. Pu, Q. Shen, and Z. Ma, “Learnedquality enhancement via multi-frame priors for HEVC compliant low-delay applications,” in . IEEE, 2019, pp. 934–938.[196] U. Joshi, D. Mukherjee, J. Han, Y. Chen, S. Parker, H. Su, A. Chiang,Y. Xu, Z. Liu, Y. Wang et al. , “Novel inter and intra prediction toolsunder consideration for the emerging AV1 video codec,” in

Applicationsof Digital Image Processing XL , vol. 10396. International Society forOptics and Photonics, 2017, p. 103960F.[197] Z. Liu, D. Mukherjee, W.-T. Lin, P. Wilkins, J. Han, and Y. Xu,“Adaptive multi-reference prediction using a symmetric framework,”

Electronic Imaging , vol. 2017, no. 2, pp. 65–72, 2017.[198] Y. Chen, D. Murherjee, J. Han, A. Grange, Y. Xu, Z. Liu, S. Parker,C. Chen, H. Su, U. Joshi et al. , “An overview of core coding tools inthe AV1 video codec,” in . IEEE,2018, pp. 41–45.[199] Y. Wang, S. Inguva, and B. Adsumilli, “YouTube UGC dataset for videocompression research,”

IEEE International Workshop on MultimediaSignal Processing , September 2019, Kuala Lumpur, Malaysia.[200] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsingnetwork,” in

Proceedings of the IEEE conference on Computer Visionand Pattern Recognition , July 2017, pp. 2881–2890, Honolulu, HI.[201] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[202] M. Bosch, F. Zhu, and E. J. Delp, “Segmentation-Based Video Com-pression Using Texture and Motion Models,”

IEEE Journal of SelectedTopics in Signal Processing , vol. 5, no. 7, pp. 1366–1377, November2011.[203] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “ImageNet largescale visual recognition challenge,”

International Journal of ComputerVision , vol. 115, no. 3, pp. 211–252, 2015.[204] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objectsin context,” in

Proceedings of the IEEE European Conference onComputer Vision , September 2014, pp. 740–755, Z¨urich, Switzerland.[205] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, andA. Torralba, “Semantic understanding of scenes through the ADE20Kdataset,”

International Journal of Computer Vision , vol. 127, no. 3, pp.302–321, 2019.[206] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , June 2015, pp. 3431–3440,Boston, MA. [207] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Semantic image segmentation with deep convolutional nets and fullyconnected crfs,” arXiv preprint arXiv:1412.7062 , 2014.[208] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” arXiv preprint arXiv:1511.07122 , 2015.[209] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,“Scene parsing through ade20k dataset,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2017, pp. 633–641.[210] D. Chen, Q. Chen, and F. Zhu, “Pixel-level texture segmentation basedAV1 video compression,” in . IEEE, 2019, pp. 1622–1626.[211] M. Bosch, F. Zhu, and E. J. Delp, “Spatial texture models for videocompression,”

Proceedings of IEEE International Conference on ImageProcessing , vol. 1, pp. 93–96, September 2007, San Antonio, TX.[212] C. Fu, D. Chen, E. Delp, Z. Liu, and F. Zhu, “Texture segmentationbased video compression using convolutional neural networks,”

Elec-tronic Imaging , vol. 2018, no. 2, pp. 155–1, 2018.[213] I.-R. R. BT.500-14, “Methodologies for the subjective assessment ofthe quality of television images,” Geneva, Tech. Rep., 2019.[214] M. Haindl and S. Mikes, “Texture segmentation benchmark,” in

Pro-ceedings of the 19th International Conference on Pattern Recognition .IEEE, 2008, pp. 1–4.[215] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang,“YouTube-VOS: A large-scale video object segmentation benchmark,” arXiv preprint , p. arXiv:1809.03327, 2018.[216] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, andA. Sorkine-Hornung, “A benchmark dataset and evaluation method-ology for video object segmentation,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2016, pp.724–732.[217] H. Liu, M. Lu, Z. Ma, F. Wang, Z. Xie, X. Cao, and Y. Wang, “Neuralvideo coding using multiscale motion compensation and spatiotemporalcontext model,” accepted by IEEE Trans. Circuits and Systems forVideo Technology , Oct. 2020.[218] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutionalnetworks for content-weighted image compression,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2018, pp. 3214–3223.[219] T. M. Cover and J. A. Thomas,

Elements of information theory . JohnWiley & Sons, 2012.[220] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual non-local atten-tion networks for image restoration,” arXiv preprint arXiv:1903.10082 ,2019.[221] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c.Woo, “Convolutional lstm network: A machine learning approach forprecipitation nowcasting,”

Advances in neural information processingsystems , vol. 28, pp. 802–810, 2015.[222] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-to-end learning framework for video compression,”

IEEE Transactionson Pattern Analysis and Machine Intelligence , 2020.[223] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang,“Ntire 2017 challenge on single image super-resolution: Methods andresults,” in

Proceedings of the IEEE conference on Computer Visionand Pattern Recognition Workshops , 2017, pp. 114–125.[224] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, and T. Huang,“Wide activation for efﬁcient and accurate image super-resolution,” arXiv preprint arXiv:1808.08718 , 2018.[225] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deepresidual networks,” in

European Conference on Computer Vision .Springer, 2016, pp. 630–645.[226] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, pp. 436–444, 2015.[227] Q. Xia, H. Liu, and Z. Ma, “Object-based image coding: A learning-driven revisit,” in . IEEE, 2020, pp. 1–6.[228] J. Lee, S. Cho, and M. Kim, “A hybrid architecture of jointly learningimage compression and quality enhancement with improved entropyminimization,” arXiv preprint arXiv:1912.12817 , 2019.[229] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a fewexamples: A survey on few-shot learning,”

ACM Computing Surveys(CSUR) , vol. 53, no. 3, pp. 1–34, 2020.[230] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment,” in

Proceedings of The Thrity-Seventh Asilomar Conference on Signals, Systems Computers , vol. 2,Nov 2003, pp. 1398–1402, paciﬁc Grove, CA. [231] D. Yuan, T. Zhao, Y. Xu, H. Xue, and L. Lin, “Visual jnd: a perceptualmeasurement in video coding,” IEEE Access , vol. 7, pp. 29 014–29 022,2019.[232] Netﬂix, Inc., “VMAF: Perceptual video quality assessment based onmulti-method fusion,” https://github.com/Netﬂix/vmaf, 2017.[233] Q. Shen, J. Cai, L. Liu, H. Liu, T. Chen, L. Ye, and Z. Ma,“Codedvision: Towards joint image understanding and compressionvia end-to-end learning,” in

Paciﬁc Rim Conference on Multimedia .Springer, 2018, pp. 3–14.[234] L. Liu, H. Liu, T. Chen, Q. Shen, and Z. Ma, “Codedretrieval: Jointimage compression and retrieval with neural networks,” in2019 IEEEVisual Communications and Image Processing (VCIP)