Dynamically Expanded CNN Array for Video Coding
DDynamically Expanded CNN Array for Video Coding
Everett Fall Kai-Wei Chang Liang-Gee Chen Abstract
Video coding is a critical step in all popular meth-ods of streaming video. Marked progress has beenmade in video quality, compression, and compu-tational efficiency. Recently, there has been aninterest in finding ways to apply techniques formthe fast-progressing field of Machine Learning tofurther improve video coding.We present a method that uses convolutional neu-ral networks to help refine the output of variousstandard coding methods. The novelty of ourapproach is to train multiple different sets of net-work parameters, with each set corresponding toa specific, short segment of video. The array ofnetwork parameter sets expands dynamically tomatch a video of any length. We show that ourmethod can improve the quality and compressionefficiency of standard video codecs.
1. Introduction
In recent years there have been many advances in videocoding standards, and there are several prominent videocodecs used commercially such as H.264/AVC (Schwarzet al., 2007), H.265/HECV (Sullivan et al., 2012) and VP9.However, most of these traditional coding algorithms areblock-based and therefore suffer from block artifacts. Theyare also largely hand-designed, making joint optimizationdifficult.In Machine Learning (ML) the past decade has yielded vastimprovements and altogether new techniques for processingimages and video such as convolutional neural networks(CNN). CNNs offer an intuitive theoretical approach of en-coding or embedding information as vectors, that can beinterpreted as high-level features. Despite the success ofCNNs in many ML tasks, they have seen very limited suc-cess as a direct replacement for traditional commercial videocodecs. Partially because training ML models is computa-tionally expensive and partially because existing methodssuch as .H265 or VP9 are already highly optimized, makingthe improving on benchmarks a challenging barrier to entry.In this work we take a different approach. Instead of directlyreplacing traditional codecs by using CNNs to learn an end- to-end model, we apply ML as a post processing step torefine the output. The novelty of our approach is to train adynamically expanding array of many small CNNs , allowingeach network to specialize in refining a relatively shortsegment of video. This refiner module switches betweendifferent networks as the video is decoded.We evaluate our method on several commonly used com-mercial codecs and show that the refiner network makesa substantial improvement to quality with only a minimalincrease in code size.
2. Related Work
Several deep learning-integrated video compression meth-ods have been proposed to improve traditional coding. Oneapproach is replacing and enhancing different modules in tra-ditional coding, especially the state-of-the-art HEVC codec.For example, improving motion compensation and inter-prediction (Huo et al., 2018; Yan et al., 2019; Zhao et al.,2018), intra-prediction (Song et al., 2017; Cui et al., 2018;J. Pfaff, 2018), and replacing in-loop filter (Park & Kim,2016; Kang et al., 2017; Zhang et al., 2018; Jia et al., 2019).Another approach is to apply ML as a post processing stepto improve video quality. For example, (Li et al., 2017) pro-posed a dynamic metadata post-precessing scheme basedon a CNN, (Yang et al., 2017) and (Wang et al., 2017) pro-posed Decoder-side Scalable Convolutional Neural Network(DS-CNN) and Deep CNN-based Auto Decoder (DACD)respectively for video qulity and efficiency enhancement.Instead of using ML to form a hybrid video coding frame-work, some works propose an end-to-end framework forvideo compression, the performance of which can be on parwith the commercial codecs. (Wu et al., 2018) developed anend-to-end deep video codec relying on repeated interpolateimages in a hierarchical manner. Inheriting conventionalvideo coding structure, (Lu et al., 2018) employed multipleneural networks to constitute different modules, which canbe jointly optimized through a single loss function. (Rip-pel et al., 2018) proposed a learned end-to-end model forlow-latency mode with spatial rate control. a r X i v : . [ ee ss . I V ] M a y one
3. Dynamically Expanded CNN Array
We denote a video as a sequence of frames: X = x , x , ... where each x t ∈ R W × H × is an image with width W ,height H with 3 color channels for each pixel. An videocoding scheme provides two algorithms: one to transformthe video into code, E : X (cid:55)→ R N , and one that converts anencoded video back into a sequence of images, D : R N (cid:55)→X . Let ˆ X ∈ X denote a sequence of frames generated byencoding and then decoding a video: ˆ X = D ( E ( X )) . Thegoal of video coding is to minimize size of an encoded videoand to design E and D to be as computationally efficient aspossible. In the case of lossy compression coding, there isa pixel-wise error, known as the residual , associated witheach decoded frame ∆ t = ˆ x t − x t , which is also desirableto minimize.Our method is designed to complement an existing codingscheme, reducing the error in the decoded video by addinga refining function x (cid:48) t = R (ˆ x t ) : R W × H × (cid:55)→ R W × H × which yields X (cid:48) = x (cid:48) , x (cid:48) , ... when applied to each frame of X as shown in Fig. 1. The existing coding scheme could bea standard commercial codec or a custom end-to-end CNN.Let s i,j = x i , x i +1 , ..., x j denote a contiguous segment of X . The novelty of our method is to use a CNN to implement R with parameters θ ( t ) , which vary with time. Specifically,we partition X in to many small segments of duration ρ , s ,ρ , s ρ +1 ,ρ × , ... , and for each segment a corresponding setof network parameters is learned for R which we denote as θ i,j . Intuitively, this can be thought of as an array of CNNsthat expands dynamically as needed to refine a video of anylength.Learning θ i,j is accomplished through standard stochas-tic gradient decent. A training example consists of ( x =ˆ x t , y = x t ) where x is sampled from X in the range i ≤ t ≤ j . Intuitively the network learns to predict x t from ˆ x t . Alternatively, the refine function can learn to predictthe residual (denoted ∆ (cid:48) t = R ∆ (ˆ x t ) ) in which case refininginvolves removing the predicted residual from the signal. Inthis case the training example consists of ( x = ˆ x t , y = ∆ t ) and the output is obtained by x (cid:48) t = ˆ x t − R ∆ (ˆ x t ) .Another commonly desired characteristic of an encodingscheme is for the code to have a localized temporal corre-spondence to the video. This allows a small segment, knownas a random access segment , of the video to be decodedwithout requiring the entire code which is useful for datatransmission applications such as video streaming. Sinceeach segment s i,j is relatively short, our method can alsosupport random access by transmitting θ i,j in advance ofthe corresponding random access segment of the code. Figure 1.
The video X is encoded and decoded by some codingscheme to produce ˆ X . The correct parameters for the refinernetwork are selected according to current frame being process.The refiner produces X (cid:48) from ˆ X .
4. Experimental Evaluation
In this section we present the results of initial experimentsas a proof-of-concept for our method. We implemented ourproposed method and applied it to the benchmark dataset”Big Buck Bunny”. We used the .H264 codec with a highCRF value (low quality and high compression ratio) to cre-ate the input to the refiner network. The refiner contains 4convolutional layers with 5x5 filters followed by 3 convo-lutional layers with 3x3 filters and applies to segments ofsize ρ = 50 frames. The network is given approximately500 training steps which corresponds to 10 epochs of the(very small) dataset for each 50 frame segment. The beforeand after result of a typical frame is shown in Fig. 2. Thequality of the refined image was MS-SSIM: 0.9802 PSNR:36.96 (original .H264 CRF-36 was MS-SSIM: 0.9589 andPSNR: 33.82). Figure 2.
Left: The refiner input (output of .H264 codec using CRF36). Middle: Output of the refiner. Right: Ground truth.
5. Conclusion
In this work we introduced a novel method for video cod-ing which uses an array CNNs to refine each frame. Wedescribe the algorithms used to segment the video and traina CNN for each segment and automatically switch betweennetworks during the decoding process. We implemented our one proposed algorithm and conducted several experiments toevaluate it’s performance with standard benchmark codingschemes. Our method was able to provide substantial im-provement to the quality and reduce the compressed videosize.
References
Cui, W., Zhang, T., Zhang, S., Jiang, F., Zuo, W., and Zhao,D. Convolutional neural networks based intra predictionfor HEVC.
CoRR , abs/1808.05734, 2018.Huo, S., Liu, D., Wu, F., and Li, H. Convolutional neuralnetwork-based motion compensation refinement for videocoding. In
IEEE International Symposium on Circuitsand Systems, ISCAS 2018, 27-30 May 2018, Florence,Italy , pp. 1–4, 2018. doi: 10.1109/ISCAS.2018.8351609.J. Pfaff, P. Helle, D. M. S. K. W. S. H. S. D. M. T. W. Neuralnetwork based intra prediction for video coding, 2018.Jia, C., Wang, S., Zhang, X., Wang, S., Liu, J., Pu, S., andMa, S. Content-aware convolutional neural network forin-loop filtering in high efficiency video coding.
IEEETransactions on Image Processing , pp. 1–1, 2019. ISSN1057-7149. doi: 10.1109/TIP.2019.2896489.Kang, J., Kim, S., and Lee, K. M. Multi-modal/multi-scaleconvolutional neural network based in-loop filter designfor next generation video codec. In , pp. 26–30, 2017.doi: 10.1109/ICIP.2017.8296236.Li, C., Song, L., Xie, R., and Zhang, W. CNN basedpost-processing to improve HEVC. In , pp. 4577–4580,2017. doi: 10.1109/ICIP.2017.8297149.Lu, G., Ouyang, W., Xu, D., Zhang, X., Cai, C., and Gao, Z.DVC: an end-to-end deep video compression framework.
CoRR , abs/1812.00101, 2018.Park, W. and Kim, M. Cnn-based in-loop filtering for codingefficiency improvement. In
IEEE 12th Image, Video, andMultidimensional Signal Processing Workshop, IVMSP2016, Bordeaux, France, July 11-12, 2016 , pp. 1–5, 2016.doi: 10.1109/IVMSPW.2016.7528223.Rippel, O., Nair, S., Lew, C., Branson, S., Anderson, A. G.,and Bourdev, L. D. Learned video compression.
CoRR ,abs/1811.06981, 2018.Schwarz, H., Marpe, D., and Wiegand, T. Overview ofthe scalable video coding extension of the H.264/AVCstandard.
IEEE Trans. Circuits Syst. Video Techn. , 17(9):1103–1120, 2007. doi: 10.1109/TCSVT.2007.905532. Song, R., Liu, D., Li, H., and Wu, F. Neural network-basedarithmetic coding of intra prediction modes in HEVC.In , pp. 1–4, 2017. doi: 10.1109/VCIP.2017.8305104.Sullivan, G. J., Ohm, J., Han, W., and Wiegand, T. Overviewof the high efficiency video coding (HEVC) standard.
IEEE Trans. Circuits Syst. Video Techn. , 22(12):1649–1668, 2012. doi: 10.1109/TCSVT.2012.2221191.Wang, T., Chen, M., and Chao, H. A novel deep learning-based method of improving coding efficiency from thedecoder-end for HEVC. In ,pp. 410–419, 2017. doi: 10.1109/DCC.2017.42.Wu, C., Singhal, N., and Kr¨ahenb¨uhl, P. Video compressionthrough image interpolation. In
Computer Vision - ECCV2018 - 15th European Conference, Munich, Germany,September 8-14, 2018, Proceedings, Part VIII , pp. 425–440, 2018. doi: 10.1007/978-3-030-01237-3 \ IEEE Trans. Circuits Syst. Video Techn. , 29(3):840–853, 2019. doi: 10.1109/TCSVT.2018.2816932.Yang, R., Xu, M., and Wang, Z. Decoder-side HEVC qualityenhancement with scalable convolutional neural network.In , pp. 817–822, 2017. doi: 10.1109/ICME.2017.8019299.Zhang, Y., Shen, T., Ji, X., Zhang, Y., Xiong, R., and Dai, Q.Residual highway convolutional neural networks for in-loop filtering in HEVC.
IEEE Trans. Image Processing ,27(8):3827–3841, 2018. doi: 10.1109/TIP.2018.2815841.Zhao, L., Wang, S., Zhang, X., Wang, S., Ma, S., and Gao,W. Enhanced ctu-level inter prediction with deep framerate up-conversion for high efficiency video coding. In2018 IEEE International Conference on Image Process-ing, ICIP 2018, Athens, Greece, October 7-10, 2018