Advances In Video Compression System Using Deep Neural Network: A Review And Case Studies
Dandan Ding, Zhan Ma, Di Chen, Qingshuang Chen, Zoe Liu, Fengqing Zhu
11 Advances In Video Compression System UsingDeep Neural Network: A Review And Case Studies
Dandan Ding (cid:63) , Member, IEEE,
Zhan Ma (cid:63) , Senior Member, IEEE,
Di Chen,
Member, IEEE,
Qingshuang Chen,
Member, IEEE,
Zoe Liu, and Fengqing Zhu,
Senior Member, IEEE
Abstract —Significant advances in video compression systemhave been made in the past several decades to satisfy thenearly exponential growth of Internet-scale video traffic. Fromthe application perspective, we have identified three majorfunctional blocks including pre-processing, coding, and post-processing, that have been continuously investigated to maximizethe end-user quality of experience (QoE) under a limited bit ratebudget. Recently, artificial intelligence (AI) powered techniqueshave shown great potential to further increase the efficiencyof the aforementioned functional blocks, both individually andjointly. In this article, we review extensively recent technicaladvances in video compression system, with an emphasis ondeep neural network (DNN)-based approaches; and then presentthree comprehensive case studies. On pre-processing, we showa switchable texture-based video coding example that leveragesDNN-based scene understanding to extract semantic areas for theimprovement of subsequent video coder. On coding, we present anend-to-end neural video coding framework that takes advantageof the stacked DNNs to efficiently and compactly code inputraw videos via fully data-driven learning. On post-processing,we demonstrate two neural adaptive filters to respectively fa-cilitate the in-loop and post filtering for the enhancement ofcompressed frames. Finally, a companion website hosting thecontents developed in this work can be accessed publicly athttps://purdueviper.github.io/dnn-coding/.
Index Terms —Deep Neural Networks, Texture Analysis, NeuralVideo Coding, Adaptive Filters
I. I
NTRODUCTION
In recent years, Internet traffic has been dominated by awide range of applications involving video, including videoon demand (VOD), live streaming, ultra-low latency real-time communications, etc. . With ever increasing demandsin resolution ( e.g. , 4K, 8K, gigapixel [1], high speed [2]),and fidelity, ( e.g. , high dynamic range [3], and higher bitprecision or bit depth [4]), more efficient video compres-sion is imperative for content transmission and storage, bywhich networked video services can be successfully deployed.Fundamentally, video compression systems devise appropriatealgorithms to minimize the end-to-end reconstruction distor-tion (or maximize the quality of experience (QoE)), under agiven bit rate budget. This is a classical rate-distortion (R-D) optimization problem. In the past, the majority of effort
D. Ding is with the School of Information Science and Engineering,Hangzhou Normal University, Hangzhou, Zhejiang, China.Z. Ma is with the School of Electronic Science and Engineering, NanjingUniversity, Nanjing, Jiangsu, China.D. Chen, Q. Chen, and F. Zhu are with the School of Electrical andComputer Engineering, Purdue University, West Lafayette, Indiana, USA.Z. Liu is with Visionular Inc, 280 2nd St., Los Altos, CA, USA. (cid:63)
These authors contributed equally. had been focused on the development and standardization ofvideo coding tools for optimized R-D performance, such asthe intra/inter prediction, transform, entropy coding, etc. , re-sulting in a number of popular standards and recommendationspecifications ( e.g. , ISO/IEC MPEG series [5]–[11], ITU-TH.26x series [9]–[13], AVS series [14]–[16], as well as theAV1 [17], [18] from the Alliance of Open Media (AOM) [19]).All these standards have been widely deployed in the marketand enabled advanced and high-performing services to bothenterprises and consumers. They have been adopted to coverall major video scenarios from VOD, to live streaming, toultra-low latency interactive real-time communications, usedfor applications such as telemedicine, distance learning, videoconferencing, broadcasting, e-commerce, online gaming, shortvideo platforms, etc . Meanwhile, the system R-D efficiencycan also be improved from pre-processing and post-processing,individually and jointly, for content adaptive encoding (CAE).Notable examples include saliency detection for subsequentregion-wise quantization control, and adaptive filters to alle-viate compression distortions [20]–[22].In this article, we therefore consider pre-processing , coding ,and post-processing as three basic functional blocks of anend-to-end video compression system, and optimize themto provide compact and high-quality representation of inputoriginal video. • The “coding” block is the core unit that converts rawpixels or pixel blocks into binary bits presentation. Overthe past decades, the “coding” R-D efficiency has beengradually improved by introducing more advanced toolsto better exploit spatial, temporal, and statistical redun-dancy [23]. Nevertheless, this process inevitably incurscompression artifacts, such as blockiness and ringing, dueto the R-D trade-off, especially at low bit rates. • The “post-processing” block is introduced to alleviatevisually perceptible impairments produced as byproductsof coding. Post-processing mostly relies on the desig-nated adaptive filters to enhance the reconstructed videoquality or QoE. Such “post-processing” filters can alsobe embedded into the “coding” loop to jointly improvereconstruction quality and R-D efficiency, e.g. , in-loopdeblocking [24] and sample adaptive offset (SAO) [25]; • The “pre-processing” block exploits the discriminativecontent preference of the human visual system (HVS),caused by the non-linear response and frequency se-lectivity ( e.g. , masking) of visual neurons in the visualpathway. Pre-processing can extract content semantics a r X i v : . [ ee ss . I V ] J a n ( e.g. , saliency, object instance) to improve the psychovi-sual performance of the “coding” block, for example, byallocating unequal qualities (UEQ) across different areasaccording to pre-processed cues [26]. Building upon the advancements in deep neural networks(DNN), numerous recently-created video processing algo-rithms have been greatly improved to achieve superior per-formance, mostly leveraging the powerful nonlinear represen-tation capacity of DNNs. At the same time, we have alsowitnessed an explosive growth in the invention of DNN-basedtechniques for video compression from both academic researchand industrial practices. For example, DNN-based filteringin post-processing was extensively studied when developingthe VVC standard under the joint task force of ISO/IEC andITU-T experts over the past three years. More recently, thestandard committee issued a Call-for-Evidence (CfE) [27],[28] to encourage the exploration of deep learning-based videocoding solutions beyond VVC.In this article, we discuss recent advances in pre-processing , coding , and post-processing , with particular emphasis on theuse of DNN-based approaches for efficient video compression.We aim to provide a comprehensive overview to bring readersup to date on recent advances in this emerging field. Wealso suggest promising directions for further exploration. Assummarized in Fig. 1, we first dive into video pre-processing,emphasizing the analysis and application of content semantics, e.g. , saliency, object, texture characteristics, etc. , to videoencoding. We then discuss recently-developed DNN-basedvideo coding techniques for both modularized coding tooldevelopment and end-to-end fully learned framework explo-ration. Finally, we provide an overview of the adaptive filtersthat can be either embedded in codec loop, or placed as apost enhancement to improve final reconstruction. We alsopresent three case studies, including 1) switchable texture-based video coding in pre-processing; 2) end-to-end neuralvideo coding ; and 3) efficient neural filtering , to provideexamples the potential of DNNs to improve both subjectiveand objective efficiency over traditional video compressionmethodologies.The remainder of the article is organized as follows: FromSection II to IV, we extensively review the advances in respec-tive pre-processing, coding, and post-processing. Traditionalmethodologies are first briefly summarized, and then DNN-based approaches are discussed in detail. As in the casestudies, we propose three neural approaches in Section V, VI,and VII, respectively. Regarding pre-processing, we develop aCNN based texture analysis/synthesis scheme for AV1 codec.For video compression, an end-to-end neural coding frame-work is developed. In our discussion of post-processing,wepresent different neural methods for in-loop and post filteringthat can enhance the quality of reconstructed frames. Sec-tion VIII summarizes this work and discusses open challengesand future research directions. For your convenience, Table I Although adaptive filters can also be used in pre-processing for pre-filtering, e.g. , denoising, motion deblurring, contrast enhancement, edgedetection, etc. , our primary focus in this work will be on semantic contentunderstanding for subsequent intelligent “coding”.
TABLE I: Abbreviations and Annotations
Abbreviation DescriptionAE AutoEncoderCNN Convolutional Neural NetworkCONV ConvolutionConvLSTM Convolutional LSTMDNN Deep Neural NetworkFCN Fully-Connected NetworkGAN Generative Adversarial NetworkLSTM Long Short-Term MemoryRNN Recurrent Neural NetworkVAE Variational AutoEncoderBD-PSNR Bjøntegaard Delta PSNRBD-Rate Bjøntegaard Delta RateGOP Group of PicturesMS-SSIM Multiscale SSIMMSE Mean Squared ErrorPSNR Peak Signal-to-Noise RatioQP Quantizatin ParameterQoE Quality of ExperienceSSIM Structural Similarity IndexUEQ UnEqual QualityVMAF Video Multi-Method Assessment FusionAV1 AOMedia Video 1AVS Audio Video StandardH.264/AVC H.264/Advanced Video CodingH.265/HEVC H.265/High-Efficiency Video CodingVVC Versatile Video CodingAOM Alliance of Open MediaMPEG Moving Picture Experts Group provides an overview of abbreviations and acronyms that arefrequently used throughout this paper.II. O
VERVIEW OF
DNN-
BASED V IDEO P RE - PROCESSING
Pre-processing techniques are generally applied prior to thevideo coding block, with the objective of guiding the videoencoder to remove psychovisual redundancy and to maintainor improve visual quality, while simultaneously lowering bitrate consumption. One category of pre-processing techniquesis the execution of pre-filtering operations. Recently, a numberof deep learning-based pre-filtering approaches have beenadopted for targeted coding optimization. These include de-noising [29], [30], motion deblurring [31], [32], contrastenhancement [33], edge detection [34], [35], etc . Anotherimportant topic area is closely related to the analysis of videocontent semantics, e.g. , object instance, saliency attention,texture distribution, etc. , and its application to intelligent videocoding. For the sake of simplicity, we refer to this group oftechniques as “pre-processing” for the remainder of this paper.In our discussion below, we also limit our focus to saliency-based and analysis/synthesis-based approaches.
A. Saliency-Based Video Pre-processing1) Saliency Prediction: Saliency is the quality of beingparticularly noticeable or important. Thus, the salient area refers to region of an image that predominantly attracts theattention of subjects. This concept corresponds closely tothe highly discriminative and selective behaviour displayedin visual neuronal processing [36], [37]. Content featureextraction, activation, suppression and aggregation also occurin the visual pathway [38]. input x x corr r corr CNN model
Input Original Videos OutputReconstructed Videos
Case Study I
Switchable Texture-based Video Coding
Pre-processing Coding
Case Study II
End-to-End Neural Video Coding
Case Study III
Neural Adaptive Filtering (In-loop & Post)
Post-processing T r an s m i ss i on Semantic Understanding Feature Representation Quality Enhancement
Fig. 1:
Topic Outline.
This article reviews DNN-based techniques used in pre-processing, coding, and post-processing of apractical video compression system. The “pre-processing” module leverages content semantics ( e.g. , texture) to guide videocoding, followed by the “coding” step to represent the video content using more compact spatio-temporal features. Finally,quality enhancement is applied in “post-processing” to improve reconstruction quality by alleviating processing artifacts.Companion case studies are respectively offered to showcase the potential of DNN algorithms in video compression.Earlier attempts to predict saliency typically utilized hand-crafted image features, such as color, intensity, and orientationcontrast [39]; motion contrast [40]; camera motion [41], etc. ,to predict saliency.Later on, DNN-based semantic-level features were exten-sively investigated for both image content [42]–[48] andvideo sequences [49]–[55]. Among these features, imagesaliency prediction only exploits spatial information, whilevideo saliency prediction often relies on spatial and temporalattributes jointly. One typical example of video saliency isa moving object that incurs spatio-temporal dynamics overtime, and is therefore more likely to attract users’ attention.For example, Bazzani et al. [49] modeled the spatial relationsin videos using 3D convolutional features and the temporalconsistency with a convolutional long short-term memory(LSTM) network. Bak et al. [50] applied a two-stream net-work that exploited different fusion mechanisms to effectivelyintegrate spatial and temporal information. Sun et al. [51]proposed a step-gained FCN to combine the time-domainmemory information and space-domain motion components.Jiang et al. [52] developed an object-to-motion CNN that wasapplied together with a LSTM network. All of these effortsto efficiently predict video saliency leveraged spatio-temporalattributes. More details regarding the spatio-temporal saliencymodels for video content can be found in [56].
2) Salient Object:
One special example of image saliencyinvolved the object instance in a visual scene, specifically, themoving object in videos. A simple yet effective solution tothe problem of predicting image saliency in this case involvedsegmenting foreground objects and background components.The segmentation of foreground objects and backgroundcomponents has mainly relied on foreground extraction orbackground subtraction. For example, motion information hasfrequently been used to mask out foreground objects [57]–[61].Recently, both CNN and foreground attentive neural net-work (FANN) models have been developed to perform fore-ground segmentation [62], [63]. In addition to conventionalGaussian mixture model-based background subtraction, recentexplorations have also shown that CNN models could beeffectively used for the same purpose [64], [65]. To address these separated foreground objects and background attributes,Zhang et al. [66] introduced a new background mode to morecompactly represent background information with better R-D efficiency. To the best of our knowledge, such foregroundobject/background segmentation has been mostly applied invideo surveillance applications, where the visual scene lendsitself to easier separation.
3) Video Compression with UEQ Scales:
Recalling thatsaliency or object refers to more visually attentive areas. Itis straightforward to apply UEQ setting in a video encoder,where light compression is used to encode the saliency area,while heavy compression is used elsewhere. Use of this tech-nique often results in a lower level of total bit rate consumptionwithout compromising QoE.For example, Hadi et al. [67] extended the well-known Itti-Koch-Niebur (IKN) model to estimate saliency in the DCTdomain, also considering camera motion. In addition, saliency-driven distortion was also introduced to accurately capture thesalient characteristics, in order to improve R-D optimizationin H.265/HEVC. Li et al. [68] suggested using graph-basedvisual saliency to adapt the quantizations in H.265/HEVC,to reduce total bits consumption. Similarly, Ku et al. [69]applied saliency-weighted Coding Tree Unit (CTU)-level bitallocation, where the CTU-aligned saliency weights weredetermined via low-level feature fusion.The aforementioned methodologies rely on traditional hand-crafted saliency prediction algorithms. As DNN-based saliencyalgorithms have demonstrated superior performance, we cansafely assume that their application to video coding will leadto better compression efficiency. For example, Zhu et al. [70]adopted a spatio-temporal saliency model to accurately controlthe QP in an encoder whose spatial saliency was generatedusing a 10-layer CNN, and whose temporal saliency was cal-culated assuming the 2D motion model (resulting in an averageof 0.24 BD-PSNR gains over H.265/HEVC reference model(version HM16.8)). Performance improvement due to fine-grained quantization adaptation was reported using an open-source x264 encoder [71]. This was accomplished by jointlyexamining the input video frame and associated saliencymaps. These saliency maps were generated by utilizing three
CNN models suggested in [52], [56], [72]. Up to 25% bitrate reduction was reported when distortion was measuredusing the edge-weighted SSIM (EW-SSIM). Similarly, Sun etal. [73] implemented a saliency-driven CTU-level adaptive bitrate control, where the static saliency map of each frame wasextracted using a DNN model and dynamic saliency regionwhen it was tracked using a moving object segmentationalgorithm. Experiment results revealed that the PSNR ofsalient regions was improved by 1.85 dB on average.Though saliency-based pre-processing is mainly driven bypsychovisual studies, it heavily relies on saliency detection toperform UEQ-based adaptive quantization with a lower rate ofbit consumption but visually identical reconstruction. On theother hand, visual selectivity behaviour is closely associatedwith video content distribution ( e.g. , frequency response),leading to perceptually unequal preference. Thus, it is highlyexpected that such content semantics-induced discriminativefeatures can be utilized to improve the system efficiency whenintegrated into the video encoder. To this end, we will discussthe analysis/synthesis-based approach for pre-processing in thenext section.
B. Analysis/Synthesis Based Pre-processing
Since most videos are consumed by human vision, subjec-tive perception of HVS is the best way to evaluate quality.However, it is quite difficult to devise a profoundly accuratemathematical HVS model in actual video encoder for rateand perceptual quality optimization, due to the complicatedand unclear information processing that occurs in the humanvisual pathway. Instead, many pioneering psychovisual studieshave suggested that neuronal response to compound stimuliis highly nonlinear [74]–[81] within the receptive field. Thisleads to well-known visual behaviors, such as frequency se-lectivity, masking, etc. , where such stimuli are closely relatedto the content texture characteristics. Intuitively, video scenescan be broken down into areas that are either “perceptuallysignificant” ( e.g. , measured in an MSE sense) or “perceptuallyinsignificant”. For “perceptually insignificant” regions, userswill not perceive compression or processing impairments with-out a side-by-side comparison with the original sample. Thisis because the HVS gains semantic understanding by viewingcontent as a whole, instead of interpreting texture details pixel-by-pixel [82]. This notable effect of the HVS is also referredto as “masking,” where visually insignificant information, e.g. ,perceptually insignificant pixels, will be noticeably suppressed.In practice, we can first analyze the texture characteristicsof original video content in the pre-processing step, e.g. , Texture Analyzer in Fig. 2, in order to sort textures by theirsignificance. Subsequently, we can use any standard compliantvideo encoder to encode the perceptually significant areas asthe main bitstream payload, and apply a statistical model torepresent the perceptually insignificant textures with modelparameters encapsulated as side information. Finally, we canuse decoded areas and parsed textures to jointly synthesizethe reconstructed sequences in
Texture Synthesizer . This typeof texture modeling makes good use of statistical and psy-chovisual representation jointly, generally requiring fewer bits,
Original VideoOriginal Video Texture Encoder TextureAnalyzer Encoder Analyzer Side InformationSide InformationReconstructed TextureReconstructed Video TextureSynthesizer Decoder Video Synthesizer Decoder Side InformationSide Information C hanne l C hanne l C hanne l C hanne l Fig. 2:
Texture Coding System.
A general framework ofanalysis/synthesis based video coding.despite yielding visually identical sensation, compared to thetraditional hybrid “prediction+residual” method . Therefore,texture analysis and synthesis play a vital role for subsequentvideo coding. We will discuss related techniques below.
1) Texture Analysis:
Early developments in texture analysisand representation can be categorized into filter-based or statistical modeling-based approaches. Gabor filter is onetypical example of a filter-based approach, by which theinput image is convoluted with nonlinear activation for thederivation of corresponding texture representation [84], [85].At the same time, in order to identify static and dynamictextures for video content, Thakur et al. [86] utilized the 2Ddual tree complex wavelet transform and steerable pyramidtransform [87], respectively. To accurately capture the tem-poral variations in video, Bansal et al. [88] again suggestedthe use of optic flow for dynamic texture indication andlater synthesis, where optical flow could be generated usingtemporal filtering. Leveraging statistical models such as theMarkovian random field (MRF) [89], [90] is an alternativeway to analyze and represent texture. For efficient texturedescription, statistical modeling such as this was then ex-tended using handcrafted local features, e.g. , the scale invariantfeature transform (SIFT) [91], speeded up robust features(SURF) [92], and local binary patterns (LBP) [93]Recently, stacked DNNs have demonstrated their superiorefficiency in many computer vision tasks, This efficiency ismainly due to the powerful capacity of DNN features to beused for video content representation. The most straightfor-ward scheme directly extracted features from the FC6 or FC7layer of AlexNet [94] for texture representation. Furthermore,Cimpoi et al. [95] demonstrated that Fisher vectorized [96]CNN features was a decent texture descriptor candidate.
2) Texture Synthesis:
Texture synthesis reverse-engineersthe analysis in pre-processing to restore pixels accordingly. Itgenerally includes both non-parametric and parametric meth-ods. For non-parametric synthesis, texture patches are usuallyresampled from reference images [97]–[99]. In contrast, theparametric method utilized statistical models to reconstructthe texture regions by jointly optimizing observation outcomesfrom the model and model itself [87], [100], [101].DNN-based solutions exhibit great potential for texture syn-thesis applications. One notable example demonstrating this A comprehensive survey of texture analysis/synthesis based video codingtechnologies can be found in [83]. potential used a pre-trained image classification-based CNNmodel to generate texture patches [102]. Li et al. [103], thendemonstrated that a Markovian GAN-based texture synthesiscould offer remarkable quality improvement.To briefly summarize, earlier “texture analysis/synthesis”approaches often relied on handcrafted models, as well ascorresponding parameters. While they have shown good per-formance to some extent for a set of test videos, it is usuallyvery difficult to generalize them to large-scale video datasetswithout fine-tuning parameters further. On the other hand,related neuroscience studies propose a broader definition oftexture which is more closely related to perceptual sensa-tion, although existing mathematical or data-driven texturerepresentations attempt to fully fulfill such perceptual motives.Furthermore, recent DNN-based schemes present a promisingperspective. However, the complexity of these schemes has notyet been appropriately exploited. So, in Section V, we willreveal a CNN-based pixel-level texture analysis approach tosegment perceptually insignificant texture areas in a frame forcompression and later synthesis. To model the textures bothspatially and temporally, we introduce a new coding modecalled the “switchable texture mode” that is determined atgroup of pictures (GoP) level according to the bit rate saving.III. O
VERVIEW OF
DNN-
BASED V IDEO C ODING
A number of investigations have shown that DNNs can beused for efficient image/video coding [104]–[107]. This topichas attracted extensive attention in recent years, demonstratingits potential to enhance the conventional system with better R-D performance.There are three major directions currently under inves-tigation. One is resolution resampling-based video coding,by which the input videos are first down-sampled prior tobeing encoded, and the reconstructed videos are up-sampledor super-resolved to the same resolution as the input [108]–[111]. This category generally develops up-scaling or super-resolution algorithms on top of standard video codecs. Thesecond direction under investigation is modularized neuralvideo coding (MOD-NVC), which has attempted to improveindividual coding tools in traditional hybrid coding frameworkusing learning-based solutions. The third direction is end-to-end neural video coding (E2E-NVC), which fully leveragesthe stacked neural networks to compactly represent input im-age/video in an end-to-end learning manner. In the followingsections, we will primarily review the latter two cases, sincethe first one has been extensively discussed in many otherstudies [112].
A. Modularized Neural Video Coding (MOD-NVC)
The MOD-NVC has inherited the traditional hybrid codingframework within which handcrafted tools are refined orreplaced using learned solutions. The general assumption isthat existing rule-based coding tools can be further improvedvia a data-driven approach that leverages powerful DNNs tolearn robust and efficient mapping functions for more compactcontent representation. Two great articles have comprehen-sively reviewed relevant studies in this direction [106], [107]. We briefly introduce key techniques in intra/inter prediction,quantization, and entropy coding. Though in-loop filtering isanother important piece in the “coding” block, due to itssimilarities with post filtering, we have chosen to review itin quality enhancement-aimed “post-processing” for the sakeof creating a more cohesive presentation.
1) Intra Prediction:
Video frame content presents highlycorrelated distribution across neighboring samples spatially.Thus, block redundancy can be effectively exploited usingcausal neighbors. In the meantime, due to the presence of localstructural dynamics, block pixels can be better representedfrom a variety of angular directed prediction.In conventional standards, such as the H.264/AVC,H.265/HEVC, or even emerging VVC, specific predictionrules are carefully designated to use weighted neighbors forrespective angular directions. From the H.264/AVC to recentVVC, intra coding efficiency has been gradually improvedby allowing more fine-grained angular directions and flexibleblock size/partitions. In practice, an optimal coding mode isoften determined by R-D optimization.One would intuitively expect that coding performance canbe further improved if better predictions can be produced.Therefore, there have been a number of attempts to lever-age the powerful capacity of stacked DNNs for better in-tra predictor generation, including the CNN-based predictorrefinement suggested in [113] to reduce prediction residual,additional learned mode trained using FCN models reportedin [114], [115], using RNNs in [116], using CNNs in [108],or even using GANs in [117], etc.
These approaches haveactively utilized the neighbor pixels or blocks, and/or othercontext information ( e.g. , mode) if applicable, in order toaccurately represent the local structures for better prediction.Many of these approaches have reported more than 3% BD-Rate gains against the popular H.265/HEVC reference model.These examples demonstrate the efficiency of DNNs in intraprediction.
2) Inter Prediction:
In addition to the spatial intra pre-diction, temporal correlations have also been exploited via inter prediction, by which previously reconstructed framesare utilized to generate inter predictor for compensation usingdisplaced motion vectors.Temporal prediction can be enhanced using references withhigher fidelity, and more fine-grained motion compensation.For example, fractional-pel interpolation is usually deployed toimprove prediction accuracy [118]. On the other hand, motioncompensation with flexible block partitions is another majorcontributor to inter coding efficiency.Similarly, earlier attempts have been made to utilize DNNssolutions for better inter coding. For instance, CNN-basedinterpolations were studied in [119]–[121] to improve the half-pel samples. Besides, an additional virtual reference couldbe generated using CNN models for improved R-D decisionin [122]. Xia et al. [123] further extended this approachusing multiscale CNNs to create an additional reference closerto the current frame by which accurate pixel-wise motionrepresentation could be used. Furthermore, conventional ref-erences could also be enhanced using DNNs to refine thecompensation [124].
3) Quantization and Entropy Coding:
Quantization andentropy coding are used to remove statistical redundancy.Scalar quantization is typically implemented in video encodersto remove insensitive high-frequency components, withoutlosing the perceptual quality, while saving the bit rate. Re-cently, a three-layer DNN was developed to predict the localvisibility threshold C T for each CTU, by which more accuratequantization could be achieved via the connection between C T and actual quantization stepsize. This development ledto noticeable R-D improvement, e.g. , upto 11% as reportedin [125].Context-adaptive binary arithmetic coding (CABAC) andits variants are techniques that are widely adopted to encodebinarized symbols. The efficiency of CABAC is heavily relianton the accuracy of probability estimation in different contexts.Since the H.264/AVC, handcrafted probability transfer func-tions (developed through exhaustive simulations, and typicallyimplemented using look-up tables) were utilized. In [115]and [126], the authors demonstrated that a combined FCN andCNN model could be used to predict intra mode probability forbetter entropy coding. Another example of a combined FCNand CNN model was presented in [127] to accurately encodetransform indexes via stacked CNNs. And likewise, in [128],intra DC coefficient probability could be also estimated usingDNNs for better performance.All of these explorations have reported positive R-D gainswhen incorporating DNNs in traditional hybrid coding frame-works. A companion H.265/HEVC-based software model isalso offered by Liu et al. [106], to advance the potential forsociety to further pursue this line of exploration. However,integrating DNN-based tools could exponentially increase boththe computational and space complexity. Therefore, creatingharmony between learning-based and conventional rule-basedtools under the same framework requires further investigation.It is also worth noting that an alternative approach is cur-rently being explored in parallel. In this approach, researcherssuggest using an end-to-end neural video coding (E2E-NVC)framework to drive the raw video content representation vialayered feature extraction, activation, suppression, and aggre-gation, mostly in a supervised learning fashion, instead ofrefining individual coding tools. B. End-to-End Neural Video Coding (E2E-NVC)
Representing raw video pixels as compactly as possible bymassively exploiting its spatio-temporal and statistical correla-tions is the fundamental problem of lossy video coding. Overdecades, traditional hybrid coding frameworks have utilizedpixel-domain intra/inter prediction, transform, entropy coding, etc. , to fulfill this purpose. Each coding tool is extensivelyexamined under a specific codec structure to carefully justifythe trade-off between R-D efficiency and complexity. Thisprocess led to the creation of well-known international orindustry standards, such as the H.264/AVC, H.265/HEVC,AV1, etc.
On the other hand, DNNs have demonstrated a powerfulcapacity for video spatio-temporal feature representation forvision tasks, such as object segmentation, tracking, etc.
This naturally raises the question of whether it is possible to encodethose spatio-temporal features in a compact format for efficientlossy compression.Recently, we have witnessed the growth of video codingtechnologies that rely completely on end-to-end supervisedlearning. Most learned schemes still closely follow the conven-tional intra/inter frame definition by which different algorithmsare investigated to efficiently represent the intra spatial tex-tures, inter motion, and the inter residuals (if applicable) [104],[129]–[131]. Raw video frames are fed into stacked DNNs toextract, activate, and aggregate appropriate compact features(at the bottleneck layer) for quantization and entropy coding.Similarly, R-D optimization is also facilitated to balance therate and distortion trade-off. In the following paragraphs, wewill briefly review the aforementioned key components.
1) Nonlinear Transform and Quantization:
The autoen-coder or variational autoencoder (VAE) architectures are typ-ically used to transform the intra texture or inter residual intocompressible features.For example, Toderic et al. [132] first applied fully-connected recurrent autoencoders for variable-rate thumbnailimage compression. Their work was then improved in [133],[134] with the support of full-resolution image, unequal bitallocation, etc.
Variable bit rate is intrinsically enabled by theserecurrent structures. The recurrent autoencoders, however, suf-fer from higher computational complexity at higher bit rates,because more recurrent processing is desired. Alternatively, convolutional autoencoders have been extensively studied inpast years, where different bit rates are adapted by setting avariety of λ s to optimize the R-D trade-off. Note that differentnetwork models may be required for individual bit rates,making hardware implementation challenging, ( e.g. , modelswitch from one bit rate to another). Recently, conditionalconvolution [135] and scaling factor [136] were proposed toenable variable-rate compression using a single or very limitednetwork model without noticeable coding efficiency loss,which makes the convolutional autoencoders more attractivefor practical applications.To generate a more compact feature representation, Balle et al. [105] suggested replacing the traditional nonlinear ac-tivation, e.g. , ReLU, using generalized divisive normalization(GDN) that is theoretically proven to be more consistent withhuman visual perception. A subsequent study [137] revealedthat GDN outperformed other nonlinear rectifiers, such asReLU, leakyReLU, and tanh, in compression tasks. Severalfollow-up studies [138], [139] directly applied GDN in theirnetworks for compression exploration.Quantization is a non-differentiable operation, basicallyconverting arbitrary elements into symbols with a limitedalphabet for efficient entropy coding in compression. Quanti-zation must be derivable in the end-to-end learning frameworkfor back propagation. A number of methods, such as addinguniform noise [105], stochastic rounding [132] and soft-to-hard vector quantization [140], were developed to approximatea continuous distribution for differentiation.
2) Motion Representation:
Chen et al. [104] developed theDeepCoder where a simple convolutional autoencoder was ap-plied for both intra and residual coding at fixed 32 ×
32 blocks, and block-based motion estimation in traditional video codingwas re-used for temporal compensation. Lu et al. [141] intro-duced the optical flow for motion representation in their DVCwork, which, together with the intra coding in [142], demon-strated similar performance compared with the H.265/HEVC.However, coding efficiency suffered from a sharp loss at lowbit rates. Liu et al. [143] extended their non-local attentionoptimized image compression (NLAIC) for intra and residualencoding, and applied second-order flow-to-flow prediction formore compact motion representation, showing consistent rate-distortion gains across different contents and bit rates.Motion can also be implicitly inferred via temporal interpo-lation. For example, Wu et al. [144] applied RNN-based frameinterpolation. Together with the residual compensation, RNN-based frame interpolation offered comparable performance tothe H.264/AVC. Djelouah et al. [145] furthered interpolation-based video coding by utilizing advanced optical flow estima-tion and feature domain residual coding. However, temporalinterpolation usually led to an inevitable structural codingdelay.Another interesting exploration made by Ripple et al. in [130] was to jointly encode motion flow and residual usingcompound features, where a recurrent state was embedded toaggregate multi-frame information for efficient flow generationand residual coding.
3) R-D Optimization: Li et al. [146] utilized a separatethree-layer CNN to generate an importance map for spatial-complexity-based adaptive bit allocation, leading to noticeablesubjective quality improvement. Mentzer et al. [140] furtherutilized the masked bottleneck layer to unequally weight fea-tures at different spatial locations. Such importance map em-bedding is a straightforward approach to end-to-end training.Importance derivation was later improved with the non-localattention [147] mechanism to efficiently and implicitly captureboth global and local significance for better compressionperformance [136].Probabilistic models play a vital role in data compression.Assuming the Gaussian distribution for feature elements, Balle et al. [142] utilized hyper priors to estimate the parametersof Gaussian scale model (GSM) for latent features. LaterHu et al. [148] used hierarchical hyper priors (coarse-to-fine)to improve the entropy models in multiscale representations.Minnen et al. [149] improved the context modeling using jointautoregressive spatial neighbors and hyper priors based on theGaussian mixture model (GMM). Autoregressive spatial priorswere commonly fused by PixelCNNs or PixelRNNs [150].Reed et al. [151] further introduced multiscale PixelCNNs,yielding competitive density estimation and great boost inspeed ( e.g. , from O ( N ) to O (log N ) ). Prior aggregationwas later extended from 2D architectures to 3D PixelC-NNs [140]. Channel-wise weights sharing-based 3D imple-mentations could greatly reduce network parameters withoutperformance loss. A parallel 3D PixelCNNs for practicaldecoding is presented in Chen et al. [136]. Previous methodsaccumulated all the priors to estimate the probability based ona single GMM assumption for each element. Recent studieshave shown that weighted GMMs can further improve codingefficiency in [152], [153]. Pixel-error, such as MSE, was one of the most popularloss functions used. Concurrently, SSIM (or MS-SSIM) wasalso adopted because of its greater consistency with visualperception. Simulations revealed that SSIM-based loss canimprove reconstruction quality, especially at low bit rates.Towards the perceptual-optimized encoding, perceptual lossesthat were measured by adversarial loss [154]–[156] and VGGloss [157] were embedded in learning to produce visuallyappealing results.Though E2E-NVC is still in its infancy, its fast growing R-D efficiency holds a great deal of promise. This is especiallytrue, given that we can expect neural processors to be deployedmassively in the near future [158].IV. O VERVIEW OF
DNN-
BASED P OST - PROCESSING
Compression artifacts are inevitably present in both tra-ditional hybrid coding frameworks and learned compressionapproaches, e.g. , blockiness, ringing, cartoonishness, etc. ,severely impairing visual sensation and QoE. Thus, qualityenhancement filters are often applied as a post-filtering step orin-loop module to alleviate compression distortions. Towardsthis goal, adaptive filters are usually developed to minimizethe error between original and distorted samples.
A. In-loop Filtering
Existing video standards are mainly utilizing the in-loopfilters to improve the subjective quality of reconstruction,and also to offer better R-D efficiency due to enhancedreferences. Examples include deblocking [24], sample adaptiveoffset (SAO) [25], constrained directional enhancement filter(CDEF) [159], loop-restoration (LR) [160], adaptive loop filter(ALF) [161], etc.
Recently, numerous CNN models have been developedfor in-loop filtering via a data-driven approach to learn themapping functions. It is worth pointing out that predictionrelationships must be carefully examined when designingin-loop filters, due to the frame referencing structure andpotential error propagation. Both intra and inter predictions areutilized in popular video encoders, where an intra-coded frameonly exploits the spatial redundancy within current frame,while an inter-coded frame jointly explores the spatio-temporalcorrelations across frames over time.Earlier explorations of this subject have mainly focused ondesigning DNN-based filters for intra-coded frames, particu-larly by trading network depth and parameters for better cod-ing efficiency. For example, IFCNN [162], and VRCNN [163]are shallow networks with ≈ e.g. , 5.7% BD-Rate gain reported in [164]by using the model with 3,340,000 parameters, and 8.50%BD-Rate saving obtained in [167] by using the model with2,298,160 parameters. The more parameters a model has, themore complex it is. Unfortunately, greater complexity limitsthe network’s potential for practical application. Such intra-frame-based in-loop filters treat decoded frames equally, with-out the consideration of in-loop inter-prediction dependency. Nevertheless, aforementioned networks can be used in post-filtering out of the coding loop.It is necessary to include temporal prediction dependencywhile designing the in-loop CNN-based filters for inter-framecoding. Some studies leveraged prior knowledge from theencoding process to assist the CNN training and inference.For example, Jia et al. [168] incorporated the co-located blockinformation for in-loop filtering. Meng et al. [169] utilized thecoding unit partition for further performance improvement.Li et al. [170] input both the reconstructed frame and thedifference between the reconstructed and predicted pixels toimprove the coding efficiency. Applying prior knowledge inlearning may improve the coding performance, but it furthercomplicates the CNN model by involving additional informa-tion in the networks. On the other hand, the contribution ofthis prior knowledge is quite limited because such additionalpriors are already implicitly embedded in the reconstructedframe.If a CNN-based in-loop filtering is applied to frame I , theimpact will be gradually propagated to frame I that has frame I as the reference. Subsequently, I is the reference of I , andso on so forth . If frame I is filtered again by the same CNNmodel, an over-filtering problem will be triggered, resultingin severely degraded performance, as analyzed in [171]. Toovercome this challenging problem, a CNN model calledSimNet was built to carry the relationship between the recon-structed frame and its original frame in [172] to adaptively skipfiltering operations in inter coding. SimNet reported 7.27% and5.57% BD-Rate savings for intra- and inter- coding of AV1,respectively. A similar skipping strategy was suggested byChen et al. [173] to enable a wide activation residual network,yielding . and 9.64% BD-Rate savings for respectiveintra- and inter- coding on AV1 platform.Alternative solutions resort to the more expensive R-Doptimization to avoid the over-filtering problem. For example,Yin et al. [174] developed three sets of CNN filters for lumaand chroma components, where the R-D optimal CNN modelis used and signaled in bitstream. Similar ideas are developedin [175], [176] as well, in which multiple CNN models aretrained and the R-D optimal model is selected for inference.It is impractical to use deeper and denser CNN modelsin applications. It is also very expensive to conduct R-Doptimization to choose the optimal one from a set of pre-trained models. Note that a limited number of pre-trainedmodels are theoretically insufficient to be generalized forlarge-scale video samples. To this end, in Section VII-A, weintroduce a guided-CNN scheme which adapts shallow CNNmodels according to the characteristics of input video content. B. Post Filtering
Post filtering is generally applied to the compressed framesat the decoder side to further enhance the video quality forbetter QoE.Previous in-loop filters designated for intra-coded framescan be re-used for single-frame post-filtering [163], [177]– Even though more advanced inter referencing strategies can be devised,inter propagation-based behavior remains the same. [185]. Appropriate re-training may be applied in order to bettercapture the data characteristics. However, single-frame post-filtering may introduce quality fluctuation across frames. Thismay be due to the limited capacity of CNN models to dealwith a great amount of video contents. Thus, multi-frame postfiltering can be devised to massively exploit the correlationacross successive temporal frames. By doing so, it not onlygreatly improves the single-frame solution, but also offersbetter temporal quality over time.Typically, a two-step strategy is applied for multi-frame postfiltering. First, neighboring frames are aligned to the currentframe via (pixel-level) motion estimation and compensation(MEMC). Then, all aligned frames are fed into networksfor high-quality reconstruction. Thus, the accuracy of MEMCgreatly affects reconstruction performance. In applications,learned optical flow, such as FlowNet [186], FlowNet2 [187],PWC-Net [188], and TOFlow [189], are widely used.Some exploration has already been made in this arena: Bao et al. [190] and Wang et al. [191] implemented a general videoquality enhancement framework for denoising, deblocking,and super-resolution, where Bao et al. [190] employed theFlowNet and Wang et al. [191] used pyramid, cascading, anddeformable convolutions to respectively align frames tempo-rally. Meanwhile, Yang et al. [192] proposed a multi-framequality enhancement framework called MFQE-1.0, in which aspatial transformer motion compensation (STMC) network isused for alignment, and a deep quality enhancement network(QE-net) is employed to improve reconstruction quality. Then,Guan et al. [193] upgraded MFQE-1.0 to MFQE-2.0 byreplacing QE-net using a dense CNN model, leading to betterperformance and less complexity. Later on, Tong et al. [194]suggested using FlowNet2 in MFQE-1.0 for temporal framealignment (instead of default STMC), yielding 0.23 dB PSNRgain over the original MFQE-1.0. Similarly, FlowNet2 is alsoused in [195] for improved efficiency.All of these studies suggested the importance of temporalalignment in post filtering. Thus, in the subsequent casestudy (see Section VII-B), we first examine the efficiencyof alignment, and then further discuss the contributions fromrespective intra-coded and inter-coded frames for the qualityenhancement of final reconstruction. This will help audiencesgain a deeper understanding of similar post filtering tech-niques. V. C
ASE S TUDY FOR P RE - PROCESSING :S WITCHABLE T EXTURE - BASED V IDEO C ODING
This section presents a switchable texture-based video pre-processing that leverages DNN-based semantic understandingfor subsequent coding improvement. In short, we exploitDNNs to accurately segment “perceptually InSIGnifcant” (pIn-SIG) texture areas to produce a corresponding pInSIG mask.In many instances, this mask drives the encoder to performseparately for pInSIG textures that are typically inferred with-out additional residuals, and “perceptually SIGnificant” (pSIG)areas elsewhere using traditional hybrid coding method. Thisapproach is implemented on top of the AV1 codec [196]–[198]by enabling the GoP-level switchable mechanism, This yields
Resnet-50 Pool UpsampleResnet-50 Pool UpsampleDilated CONVDilated CNN CONVInput Image CNNInput Image
PSP ModulePSP Module
CONVCONV Scene Scene SegmentationSegmentation
PSP ModulePSP Module
Fig. 3:
Texture Analyzer.
Proposed semantic segmentationnetwork using PSPNet [200] and ResNet-50 [201].noticeable bit rate savings for both standard test sequencesand additional challenging sequences from YouTube UGCdataset [199], under similar perceptual quality. The methodwe propose is a pioneering work that integrates learning-basedtexture analysis and reconstruction approaches with modernvideo codec to enhance video compression performance.
A. Texture Analysis
Our previous attempt [202] yielded encouraging bit ratesavings without decreasing visual quality. This was accom-plished by perceptually differentiating pInSIG textures andother areas to be encoded in a hybrid coding framework.However, the corresponding texture masks were derived usingtraditional methods, at the coding block level. On the otherhand, building upon advancements created by DNNs and large-scale labeled datasets ( e.g. , ImageNet [203], COCO [204], andADE20K [205]), learning-based semantic scene segmentationalgorithms [200], [205], [206] have been tremendously im-proved to generate accurate pixel-level texture masks.In this work, we first rely on the powerful ResNet50 [201]with dilated convolutions [207], [208] to extract feature mapsthat effectively embed the content semantics. We then in-troduce the pyramid pooling module from PSPNet [200] toproduce a pixel-level semantic segmentation map shown inFig. 3. Our implementation starts with a pre-trained PSPNetmodel generated using the MIT SceneParse150 [209] as ascene parsing benchmark. We then retrained the model ona subset of a densely annotated dataset ADE20K [205]. Inthe end, the model offers a pixel segmentation accuracy of80.23%.It is worthwhile to note that such pixel-level segmentationmay result in the creation of a number of semantic classes.Nevertheless, this study suggests grouping similar textureclasses commonly found in nature scenes together into fourmajor categories, e.g. , “earth and grass”, “water, sea andriver”, “mountain and hill”, and “tree”. Each texture categorywould have an individual segmentation mask to guide thecompression performed by the succeeding video encoder.
B. Switchable Texture-Based Video Coding
Texture masks are generally used to identify texture blocks,and to perform the encoding of texture blocks and non-textureblocks separately, as illustrated in Fig. 4a. In this case study, the AV1 reference software platform is selected to exemplifythe efficiency of our proposal.
Texture Blocks.
Texture and non-texture blocks are identi-fied by overlaying the segmentation mask from the textureanalyzer on its corresponding frame. These frame-alignedtexture masks produce pixel-level accuracy, which is capableof supporting arbitrary texture shapes. However, in order tosupport the block processing commonly adopted by videoencoders, we propose refining original pixel-level masks totheir block-based representations. The minimum size of atexture block is 16 ×
16. In order to avoid boundary artifactsand maintain temporal consistency, we implemented a con-servative two-step strategy to determine the texture block.First, the block itself must be fully contained in the textureregion marked using the pixel-level mask. Then, its warpedrepresentation to temporal references ( e.g. , the preceding andsucceeding frames in the encoding order) have to be insidethe masked texture area of corresponding reference frames aswell. Finally, these texture blocks are encoded using the texturemode , and non-texture blocks are encoded as usual using thehybrid coding structure.
Texture Mode.
A texture mode coded block is inferredby its temporal reference using the global motion parame-ters without incurring any motion compensation residuals. Incontrast, non-texture blocks are compressed using a hybrid“prediction+residual” scheme. For each current frame and anyone of its reference frames, AV1 syntax specifies only one setof global motion parameters at the frame header. Therefore,to comply with the AV1 syntax, our implementation onlyconsiders one texture class for each frame. This guaranteesthe general compatibility of our solution to existing AV1decoders. We further modified the AV1 global motion tool toestimate the motion parameters based on the texture regionsof the current frame and its reference frame. We used thesame feature extraction and model fitting approach as in theglobal motion coding tool in order to provide a more accuratemotion model for the texture regions. This was done to preventvisual artifacts on the block edges between the texture andnon-texture blocks in the reconstructed video. Although wehave demonstrated our algorithms using the AV1 standard,we expect that the same methodology can be applied toother standards. For instance, when using the H.265/HEVCstandard, we can leverage the SKIP mode syntax to signal thetexture mode instead of utilizing the global motion parameters.Previous discussions have suggested that the texture mode isenabled along with inter prediction. Our extensive studies havealso demonstrated that it is better to activate the texture modein frames where bi-directional predictions are allowed ( e.g. , B-frames), for the optimal trade-off between bit rate saving andperceived quality. As will be shown in following performancecomparisons, we use a 8-frame GoP (or Golden-Frame (GF)group defined in AV1) to exemplify the texture modes inevery other frame, by which the compound prediction from bi-directional references can be facilitated for prediction warping.Such bi-directional prediction could also alleviate possibletemporal quality flickering.
Switchable Optimization.
In our previous work [210], thetexture mode was enabled for every B frame, demonstrating Get scene change information from first pass encodingGet scene change information from first pass encodingLoad pixel-based texture mask of Load pixel-based texture mask of
Frame
Calculate motion parameter
Frame Level
Calculate motion parameter
Level
Is it a texture block?Is it a texture block?
Block
Choose texture mode
Block Level
Choose texture mode
Level
Get scene change information from first pass encodingGet scene change information from first pass encodingbased texture mask of the chosen texture classbased texture mask of the chosen texture classCalculate motion parameterCalculate motion parameter NNIs it a texture block?Is it a texture block?YYChoose texture mode RD optimizationChoose texture mode RD optimizationEncode blockEncode block (a)
Texture region Texture region percentageEncode with percentageEncode with texture mode Bit ratetexture mode enabled Bit rateenabledFirst GoPFirst GoP Encode with Bit rateEncode with texture mode Bit ratetexture mode disableddisabled Enable texture Enable texture mode for the GoPmode for the GoPin the scenein the scene Texture NTexture region Nregion > 10%> 10%YYHas bit NHas bit rate Nrate savingsavingY Disable texture YEnable texture Disable texture mode for the GoPEnable texture GoP mode for the GoPin the sceneGoP in the scene (b)
Fig. 4:
Texture mode and switchable control scheme. (a)Texture mode encoder implementation. (b) Switchable texturemode decision.significant bit rate reduction at the same level of perceptualsensation in most standard test videos, in comparison to theAV1 anchor. However, some videos did cause the modelto perform more poorly. One reason for this effect is thathigher QP settings typically incur more all-zero residualblocks. Alternatively, texture mode is also content-dependent:a relatively small number of texture blocks may be presentfor some videos. Both scenarios limit the bit rate savings,and an overhead of extra bits is mandatory for global motionsignaling, if texture mode is enabled.To address these problems, we introduce a switchablescheme to determine whether texture mode could be poten-tially enabled for a GoP or a GF group. The criteria forswitching are based on the texture region percentage thatis calculated as the average ratio of texture blocks in B-frames, and on the potential bit rate savings with or withouttexture mode. Figure 4b illustrates the switchable texture modedecision. Currently, we use bit rate saving as a criterion forswitch decisions when the texture mode is enabled. Thisassumes perceptual sensation will remain nearly the same,since these texture blocks are perceptually insignificant.
C. Experimental Results
We selected sequences with texture regions from standardtest sequences and the more challenging YouTube UGC data set [199]. YouTube UGC dataset is a sample selected fromthousands of User Generated Content (UGC) videos uploadedto YouTube. The names of the UGC videos follow the formatof Category Resolution UniqueID. We calculate the bit ratesavings at different QP values for 150 frames of the test se-quences. In our experiments, we used the following parametersfor the AV1 codec as the baseline: 8-frame GoP or GF groupusing random access configuration; 30 FPS; constant qualityrate control policy; multi-layer coding structure for all GFgroups; maximum intra frame interval at 150. We evaluatethe performance of our proposed method in terms of bit ratesavings and perceived quality.
1) Coding Performance:
To evaluate the performance of theproposed switchable texture mode method, bit rate savings atfour quantization levels (QP = 16, 24, 32, 40) are calculatedfor each test sequence in comparison to the AV1 baseline.
Texture Analysis.
We compare two DNN-based textureanalysis methods [210], [212] with a handcrafted feature-basedapproach [211] for selected standard test sequences. Resultsare shown in Table II. A positive bit rate saving (%) indicatesa reduction compared with the AV1 baseline. Compared to thefeature based approach, DNN-based methods show improvedperformance in terms of bit rate saving. The feature basedapproach relies on color and edge information to generatethe texture mask and is less accurate and consistent bothspatially and temporally. Therefore, the number of blocks thatare reconstructed using texture mode is usually much smallerthan that of DNN-based methods. Note that the parametersused in feature based approach require manually tuning foreach video to optimize the texture analysis output. The pixel-level segmentation [210] shows further advantages comparedwith block-level method [212], since the CNN model does notrequire block size to be fixed.
Switchable Scheme.
We also compare the proposedmethod, a.k.a., tex-switch , with our previous work in [210],a.k.a., tex-allgf , which enables texture mode for all frames ina GF group. All three methods use the same encoder settingfor fair comparison. Bit rate saving results for various videosat different resolutions against the AV1 baseline are shown inTable III. A positive bit rate saving (%) indicates a reductioncompared with the AV1 baseline.In general, compared to the AV1 baseline, the codingperformance of tex-allgf shows significant bit rate savingsat lower QPs. However, as QP increases, the savings arediminished. In some cases, tex-allgf exhibits poorer codingperformance than the AV1 baseline at a high QP ( e.g. , negativenumbers at QP 40). At a high QP, most blocks have zeroresidual due to heavy quantization, leading to very limitedmargins for bit rate savings using texture mode. In addition,few extra bits are required for the signalling of global motionof texture mode coded blocks. The bit savings gained throughresidual skipping in texture mode still cannot compensate forthe bits used as overhead for the side information.Furthermore, the proposed tex-switch method retains thegreatest bit rate savings offered by tex-allgf , and resolves the https://media.withyoutube.com/ AV1 codec change-Id: Ibed6015aa7cce12fcc6f314ffde76624df4ad2a1 TABLE II: Bit rate saving (%) comparison between handcraft feature (FM) [211], block-level DNN (BM) [212] and pixel-levelDNN (PM) [210] texture analysis against the AV1 baseline for selected standard test sequences using tex-allfg method.
Video Sequence QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%)FM BM PM FM BM PM FM BM PM FM BM PMCoastguard − .
17 7 .
80 9 . − .
36 6 .
99 8 . − .
43 4 .
70 5 . − .
62 1 .
90 2 . Flower .
42 10 .
55 13 .
00 5 .
42 8 .
66 10 .
78 2 .
51 5 .
96 4 .
95 0 .
19 3 .
38 1 . Waterfall .
65 4 .
63 13 .
11 1 .
58 3 .
96 7 . − . − .
33 1 . − . − . − . Netflix aerial .
15 8 .
59 9 . − .
26 2 .
15 5 . − . − .
68 1 . − . − . − . Intotree .
88 5 .
32 9 .
71 0 .
15 4 .
32 9 . − .
14 1 .
99 8 . − . − .
83 4 . TABLE III: Bit rate saving (%) comparison for tex-allgf and tex-switch methods against the AV1 baseline.
Resolution Video Sequence QP=16 (%) QP=24 (%) QP=32 (%) QP=40 (%) tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch tex-allgf tex-switch
CIF Bridgeclose .
78 15 .
78 10 .
87 10 .
87 4 .
21 4 .
21 2 .
77 2 . Bridgefar .
68 10 .
68 8 .
56 8 .
56 6 .
34 6 .
34 6 .
01 6 . Coastguard .
14 9 .
14 8 .
01 8 .
01 5 .
72 5 .
72 2 .
13 2 . Flower .
00 13 .
00 10 .
78 10 .
78 4 .
95 4 .
95 1 .
20 1 . Waterfall .
11 13 .
11 7 .
21 7 .
21 1 .
30 1 . − .
48 0 . ×
270 Netflix ariel .
15 9 .
15 5 .
59 5 .
59 1 .
05 1 . − .
01 0 . .
77 10 .
77 9 .
27 9 .
27 5 .
23 5 .
23 1 .
54 1 . NewsClip 360P-22ce .
37 17 .
37 15 .
79 15 .
79 16 .
37 16 .
37 17 .
98 17 . TelevisionClip 360P-3b9a .
45 1 .
45 0 .
48 0 . − .
09 0 . − .
26 0 . TelevisionClip 360P-74dd .
66 1 .
66 1 .
17 1 .
17 0 .
36 0 . − .
37 0 . .
81 3 .
81 2 .
57 2 .
57 0 .
93 0 .
93 0 .
06 0 . HowTo 480P-4c99 .
36 2 .
36 1 .
67 1 .
67 0 .
37 0 . − .
16 0 . MusicVideo 480P-1eee .
31 3 .
31 3 .
29 3 .
29 2 .
53 2 . − . − . NewsClip 480P-15fa .
31 6 .
31 6 .
05 5 .
79 0 .
53 0 . − .
79 0 . NewsClip 480P-7a0d .
54 11 .
54 10 .
03 10 .
03 1 .
53 1 .
53 0 .
08 0 . TelevisionClip 480P-19d3 .
13 3 .
13 2 .
86 2 .
86 1 .
66 1 .
66 0 .
58 0 . .
72 12 .
72 11 .
84 11 .
84 9 .
31 9 .
31 6 .
35 6 . MusicVideo 720P-3698 .
76 1 .
76 1 .
07 1 .
07 0 .
30 0 . − .
17 0 . MusicVideo 720P-4ad2 .
93 6 .
93 3 .
81 3 .
81 1 .
87 1 .
87 0 .
60 0 . .
31 7 .
31 6 .
07 6 .
07 3 .
21 3 .
21 0 .
72 0 . MusicVideo 1080P-55af .
88 3 .
88 1 .
78 1 .
78 0 .
31 0 . − . − . intotree .
71 9 .
71 9 .
42 9 .
42 8 .
46 8 .
46 4 .
92 4 . Average .
96 7 .
96 6 .
28 6 .
27 3 .
38 3 .
40 1 .
45 2 . loss at higher QP settings. As shown in Table III, negativenumbers are mostly removed (highlighted in green) by theintroduction of a GoP-level switchable texture mode. In somecases where tex-switch has zero bit rate savings compared tothe AV1 baseline, the texture mode is completely disabledfor all the GF groups, whereas tex-allgf has loss. In a fewcases, however, tex-switch has less bit rate saving than tex-allgf (highlighted in red). This is because the bit rate savingperformance of the first GF group in the scene fails toaccurately represent the whole scene in some of the UGCsequences with short scene cuts. A possible solution is toidentify additional GF groups that show potential bit ratesavings and enable texture mode for these GF groups.
2) Subjective Evaluation:
Although significant bit rate sav-ings have been achieved compared to the AV1 baseline, itis acknowledged that identical QP values do not necessarilyimply the same video quality. We have performed a subjectivevisual quality study with 20 participants. Reconstructed videosproduced by the proposed method ( tex-switch ) and the baselineAV1 codec at QP = 16, 24, 32 and 40 are arranged randomlyand assessed by the participants using a double stimuluscontinuous quality scale (DSCQS) method [213]. Subjectshave been asked to choose among three options: the first videohas better visual quality, the second video has better visualquality, or there is no difference between two versions.The result of this study is summarized in Figure 5. The Fig. 5:
Subjective evaluation of visual preference.
Resultsshow average subjective preference (%) for QP = 16, 24, 32,40 compared between AV1 baseline and proposed switchabletexture mode.“Same Quality” indicates the percentage of participants thatcannot tell the difference between the reconstructed videosby the AV1 baseline codec and the proposed method tex-switch (69.03% on average). The term “tex-switch” indicatesthe percentage of participants that prefer the reconstructionsby the proposed method tex-switch (14.32% on average); andthe “AV1” indicates the percentage of participants who think the visual quality of the reconstructed videos using the AV1baseline is better (16.65% on average).We observe that the results are sequence dependent and thatspatial and temporal artifacts can appear in the reconstructedvideo. The main artifacts come from the inaccurate pixel-basedtexture mask. For example, in some frames of Television-Clip 360P-74dd sequence, the texture masks include parts ofthe moving objects in the foreground, which are reconstructedusing texture mode. Since the motion of the moving objectsis different from the motion of the texture area, there arenoticeable artifacts around those parts of the frame. To furtherimprove the accuracy of region analysis using DNN-based pre-processing, we plan to incorporate an in-loop perceptual visualquality metric for optimization during the texture analysis andreconstruction.
D. Discussion And Future Direction
We proposed a DNN based texture analysis/synthesis codingtool for AV1 codec. Experimental results show that our pro-posed method can achieve noticeable bit rate reduction withsatisfying visual quality for both standard test sets and usergenerated content, which is verified by a subjective study. Weenvision that video coding driven by semantic understandingwill continue to improve in terms of both quality and bit rate,especially by leveraging advances of deep learning methods.However, there remain several open challenges that requirefurther investigation.Accuracy of region analysis is one of the major challengesfor integrating semantic understanding into video coding.However, recent advances in scene understanding have signif-icantly improved the performance of region analysis. Visualartifacts are still noticeable when a non-texture region isincorrectly included in the texture mask, particularly if theanalysis/synthesis coding system is open loop. One potentialsolution is to incorporate some perceptual visual quality mea-sures in-loop during the texture region reconstruction.Video segmentation benchmark datasets are important fordeveloping machine learning methods for video based seman-tic understanding. Existing segmentation datasets are eitherbased on images with texture [214], or contain general videoobjects only [215], [216], or focus on visual quality but lacksegmentation ground truth.VI. C
ASE S TUDY FOR C ODING :E ND - TO -E ND N EURAL V IDEO C ODING (E2E-NVC)This section presents a framework for end-to-end neuralvideo coding. We include a discussion of its key components,as well as its overall efficiency. Our proposed method isextended from our pioneering work in [104] but with signifi-cant performance improvements by allowing fully end-to-endlearning-based spatio-temporal feature representation. Moredetails can be found in [131], [136], [217].
A. Framework
As with all modern video encoders, the proposed E2E-NVCcompresses the first frame in each group of pictures as an intra-frame using a VAE based compression engine (neuro-Intra).
Intra Encoder
Motion Encoder Motion Decoder Intra Decoder
Residual Encoder
Residual DecoderReference Frame Buffer-Multi-scale Motion Compensation Network +Inter Coding Compressed Binary FeaturesIntra Coding neuro-Intra neuro-Res neuro-Motion
MS-MCN (a) C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) R e s i du a l B l o c k ( x ) C o n v x / ↓ C o n v x / ↓ N o n l o c a l A tt e n t i o n C o n v x / ↓ C o n v x / ↓ N o n l o c a l A tt e n t i o n R e s i du a l B l o c k ( x ) C o n v x / ↑ N o n l o c a l A tt e n t i o n C o n v x / ↓ C o n v x / ↑ R e s i du a l B l o c k ( x ) C o n v x / ↑ N o n l o c a l A tt e n t i o n R e s i du a l B l o c k ( x ) C o n v x / ↑ R e s i du a l B l o c k ( x ) C o n v x / ↑ Q AE AD QAE AD P A C o n v x / ↓ C o n v x / ↓ R e s i du a l B l o c k ( x ) R e s i du a l B l o c k ( x ) C o n v x / ↑ InputOutput (b)
Fig. 6:
End-to-End Neural Video Coding (E2E-NVC).
This E2E-NVC in (a) consists of modularized intra and intercoding, where inter coding utilizes respective motion andresidual coding. Each component is well exploited usinga stacked CNNs-based VAE for efficient representations ofintra pixels, displaced inter residuals, and inter motions. Allmodularized components are inter-connected and optimized inan end-to-end manner. (b) General VAE model applies stackedconvolutions ( e.g. , 5 ×
5) with main encoder-decoder ( E m , D m ) and hyper encoder-decoder pairs ( E h , D h ), where mainencoder E m includes four major convolutional layers ( e.g. ,convolutional downsampling and three residual blocks ( × D h mirrorsthe steps in hyper encoder E h for hyper prior informationgeneration. Prior aggregation (PA) engine collects the informa-tion from hyper prior, autoregressive spatial neighbors, as wellas temporal correspondences (if applicable) for main decoder D m to reconstruct input scene. Non-local attention is adoptedto simulate the saliency masking at bottlenecks, and rectifiedlinear unit (ReLU) is implicitly embedded with convolutionsfor enabling the nonlinearity. “Q” is for quantization, AE andAD for respective arithmetic encoding and decoding. 2 ↓ and2 ↑ are downsampling and upsampling at a factor of 2 for bothhorizontal and vertical dimensions.It codes the remaining frames in each group using motioncompensated prediction. As shown in Fig. 6a, the proposedE2E-NVC uses the VAE compressor (neuro-Motion) to gen-erate the multiscale motion field between the current frame andthe reference frame. Then, a multiscale motion compensationnetwork (MS-MCN) takes multiscale compressed flows, warpsthe multiscale features of the reference frame, and combinesthese warped features to generate the predicted frame. The prediction residual is then coded using another VAE-basedcompressor (neuro-Res).A low-delay E2E-NVC based video encoder is specificallyillustrated in this work. Given a group of pictures (GOP) X = { X , X , ..., X t } , we first encode X using the neuro-Intramodule and have its reconstructed frame ˆ X . The followingframe X is encoded predictively, using neuro-Motion, MS-MCN, and neuro-Res together, as shown in Fig. 6a. Note thatMS-MCN takes the multiscale optical flows (cid:110) (cid:126)f d , (cid:126)f d , ..., (cid:126)f sd (cid:111) derived by the pyramid decoder in neuro-Motion, and thenuses them to generate the predicted frame ˆ X p by multiscalemotion compensation. Displaced inter-residual r = X − ˆ X p is then compressed in neuro-Res, yielding the reconstruction ˆ r . The final reconstruction ˆ X is given by ˆ X = ˆ X p +ˆ r . Allof the remaining P-frames in the group of pictures are thenencoded using the same procedure.Fig. 6b illustrates the general architecture of the VAE model.The VAE model includes a main encoder-decoder pair that isused for latent feature analysis and synthesis, as well as ahyper encoder-decoder for hyper prior generation. The mainencoder E m uses four stacked CNN layers. Each convolutionallayer employs stride convolutions to achieve downsampling(at a factor of 2 in this example) and cascaded convolutionsfor efficient feature extraction (here, we use three ResNet-based residual blocks [201]) . We use two-layer hyper encoder E h to further generate the subsequent hyper priors as sideinformation, which is used in the entropy coding of the latentfeatures.We apply stacked convolutional layers with a limited (3 × e.g. , via unequalfeature quantization) [140], [218]. This allows resources to beassigned such that salient areas are more accurately recon-structed, while resources are conserved in the reconstructionof less-salient areas. To more accurately discern salient fromnon-salient areas, we adopt the non-local attention module(NLAM) at the bottleneck layers of both the main encoderand hyper encoder, prior to quantization, in order to includeboth global and local information.To enable more accurate conditional probability densitymodeling for entropy coding of the latent features, we in-troduce the Prior Aggregation (PA) engine which fuses the We choose to apply cascaded ResNets for stacked CNNs because they arehighly efficient and reliable. Other efficient CNN architectures could also beapplied. bit per pixel (bpp) P S N R ( d B ) NLAICneuro-IntraMinnen (2018)BPG (4:4:4)JPEG2000
Fig. 7:
Efficiency of neuro-Intra.
PSNR vs. rate perfor-mance of neuro-Intra in comparison to NLAIC [136], Minnen(2018) [149], BPG (4:4:4) and JPEG2000. Note that the curvesfor neuro-Intra and NLAIC overlap.inputs from the hyper priors, spatial neighbors, and temporalcontext (if applicable) . Information theory suggests that moreaccurate context modeling requires fewer resources ( e.g. , bits)to represent information [219]. For the sake of simplicity, weassume the latent features ( e.g. , motion, image pixel, residual)are following the Gaussian distribution as in [148], [149]. Weuse the PA engine to derive the mean and standard deviationof the distribution for each feature. B. Neural Intra Coding
Our neuro-Intra is a simplified version of the Non-LocalAttention optimized Image Compression (NLAIC) that wasoriginally proposed in [136].One major difference between the NLAIC and the VAEmodel using autoregressive spatial context in [149] is theintroduction of the NLAM inspired by [220]. In addition,we have applied 3D 5 × × to extract spatialpriors, which are fused with hyper priors in PA for entropycontext modeling ( e.g. , the bottom part of Fig. 9). Here, wehave assumed the single Gaussian distribution for the contextmodeling of entropy coding. Note that temporal priors are notused for intra-pixel and inter-residual in this paper by onlyutilizing the spatial priors.The original NLAIC applies multiple NLAMs in both mainand hyper coders, leading to excessive memory consumptionat a large spatial scale. In E2E-NVC, NLAMs are only used atthe bottleneck layers for both main and hyper encoder-decoderpairs, allowing bits to be allocated adaptively. Intra and residual coding only use joint spatial and hyper priors withouttemporal inference. This 5 × × Rate EstimationPyramidal Flow Decoder Q Predicted FrameWarpingFeature Fusion Quantized FeaturesRate Distortion Optimization Reconstruction Error
Multiscale CompressedFlows (MCF)
Multiscale Motion Compensation Network
PyramidalFeatures Aggregation neuro-Motion
Fig. 8:
Multiscale Motion Estimation and Compensation.
One-stage neuro-Motion with MS-MCN uses a pyramidal flowdecoder to synthesize the multiscale compressed optical flows (MCFs) that are used in a multiscale motion compensationnetwork for generating predicted frames.
Spatial-temporal Prior Aggregation
Temporal Prior
Updating
Autoregressive Priors
Quantized Motion Features Updated Priors
ConvLSTM
Hyper DecoderHyper Priors 3D Masked ConvTemporal Priors c on c a t e Obtained ProbabilityNext Time Step
Fig. 9:
Context-Adaptive Modeling Using Joint Spatio-temporal and Hyper Priors.
All priors are fused in PA toprovide estimates of the probability distribution parameters.To overcome the non-differentiability of the quantizationoperation, quantization is usually simulated by adding uniformnoise in [142]. However, such noise augmentation is notexactly consistent with the rounding in inference, which canyield performance loss (as reported by [135]). Thus, we applyuniversal quantization (UQ) [135] in neuro-Intra. UQ is usedfor neuro-Motion and neuro-Res as well. When applied tothe common Kodak dataset, neuro-Intra performed as well asNLAIC [136], and outperformed Minnen (2018) [149], BPG(4:4:4) and JPEG2000, as shown in Fig. 7.
C. Neural Motion Coding and Compensation
Inter-frame coding plays a vital role in video coding. Thekey is how to efficiently represent motion in a compact formatfor compensation. In contrast to the pixel-domain block-basedmotion estimation and compensation in conventional videocoding, we rely on optical flow to accurately capture thetemporal information for motion compensation .To improve inter-frame prediction, we extend our earlierwork [131] to multiscale motion generation and compensa-tion. This multiscale motion processing directly transforms two concatenated frames (where one frame is the referencefrom the past, and one is the current frame) into quantizedtemporal features that represent the inter-frame motion. Thesequantized features are decoded into compressed optical flowin an unsupervised way for frame compensation via warping.This one-stage scheme does not require any pre-trained flownetwork such as FlowNet2 or PWC-net to generate the opticalflow explicitly. It allows us to quantize the motion featuresrather than the optical flows, and to train the motion featureencoder and decoder together with explicit consideration ofquantization and rate constraint.The neuro-Motion module is modified for multiscale motiongeneration, where the main encoder is used for feature fusion.We replace the main decoder with a pyramidal flow decoder ,which generates the multiscale compressed optical flows(MCFs). MCFs will be processed together with the referenceframe, using a multiscale motion compensation network (MS-MCN) to obtain the predicted frame efficiently, as shown inFig. 8. Please refer to [217] for more details.Encoding motion compactly is another important factor foroverall performance improvement. We suggest the joint spatio-temporal and hyper prior-based context-adaptive model shownin Fig. 9 for efficiently inferring current quantized features.This is implemented in the PA engine of Fig. 6b.The joint spatio-temporal and hyper prior-based context-adaptive model mainly consists of a spatio-temporal-hyperaggregation module (STHAM) and a temporal updating mod-ule (TUM), shown in Fig. 9. At timestamp t , STHAM isintroduced to accumulate all the accessible priors and estimatethe mean and standard deviation of Gaussian Mixture Model(GMM) jointly using: ( µ F , σ F ) = F ( F , ..., F i − , ˆ z t , h t − ) , (1)Spatial priors are autoregressively derived using masked5 × × × × F i , i = 0 , , , ... are elements of quantized latentfeatures ( e.g. , motion flow), h t − is aggregated temporal priors from motion flows preceding the current frame. The neuro-Motion module exploits temporal redundancy to further pre-diction efficiency, leveraging the correlation between second-order moments of inter motion. A probabilistic model of eachelement to be encoded is derived with the estimated µ F and σ F by: p F | ( F ,..., F i − , ˆ z t , h t − ) ( F i | F , ..., F i − , ˆ z t , h t − )= (cid:89) i ( N ( µ F , σ F ) ∗ U ( − ,
12 ))( F i ) . (2)Note that TUM is applied to embedded current quantizedfeatures F t recurrently using a standard ConvLSTM [221]: ( h t , c t ) = ConvLSTM( F t , h t − , c t − ) , (3)where h t are updated temporal priors for the next frame, c t is a memory state to control information flow across multipletime instances ( e.g. , frames). Other recurrent units can also beused to capture temporal correlations as in (3).It is worth noting that leveraging second-order informationfor the representation of compact motion is also widelyexplored in traditional video coding approaches. For example,motion vector predictions from spatial and temporal co-locatedneighbors are standardized in H.265/HEVC, by which onlymotion vector differences (after prediction) are encoded. D. Neural Residual Coding
Inter-frame residual coding is another significant modulecontributing to the overall efficiency of the system. It is usedto compress the temporal prediction error pixels. It affectsthe efficiency of next frame prediction, since errors usuallypropagate temporally.Here we use the VAE architecture in Fig. 6b to encode theresidual r t . The rate-constrained loss function is used: L = λ · D ( X t , ( X pt + ˆ r t )) + R, (4)where D is the (cid:96) loss between a residual compensatedframe X pt + ˆ r t and X t . neuro-Res will be first pretrainedusing the frames predicted by the pretrained neuro-Motion andMS-MCN, and a loss function in (4) where the rate R onlyaccounts for the bits for residual. Then we refine neuro-Resjointly with neuro-Motion and MS-MCN, using a loss where R incorporates the bits for both motion and residual with twoframes. E. Experimental Comparison
We applied the same low-delay coding setting as DVCin [129] for our method and traditional H.264/AVC, andH.265/HEVC for comparison. We encoded 100 frames andused GOP of 10 on H.265/HEVC test sequences, and 600frames with GOP of 12 on the UVG dataset. For H.265/HEVC,we applied the fast mode of the x265 — a popular open-source H.265/HEVC encoder implementation; while the fastmode of the x264 is used as the representative of theH.264/AVC encoder. http://x265.org/ % S S 3 6 1 5 G % 1 9 &