[PDF] Dynamically Expanded CNN Array for Video Coding

Abstract

Video coding is a critical step in all popular methods of streaming video. Marked progress has been made in video quality, compression, and computational efficiency. Recently, there has been an interest in finding ways to apply techniques form the fast-progressing field of machine learning to further improve video coding. We present a method that uses convolutional neural networks to help refine the output of various standard coding methods. The novelty of our approach is to train multiple different sets of network parameters, with each set corresponding to a specific, short segment of video. The array of network parameter sets expands dynamically to match a video of any length. We show that our method can improve the quality and compression efficiency of standard video codecs.

Full PDF

DDynamically Expanded CNN Array for Video Coding

Everett Fall Kai-Wei Chang Liang-Gee Chen Abstract

Video coding is a critical step in all popular meth-ods of streaming video. Marked progress has beenmade in video quality, compression, and compu-tational efﬁciency. Recently, there has been aninterest in ﬁnding ways to apply techniques formthe fast-progressing ﬁeld of Machine Learning tofurther improve video coding.We present a method that uses convolutional neu-ral networks to help reﬁne the output of variousstandard coding methods. The novelty of ourapproach is to train multiple different sets of net-work parameters, with each set corresponding toa speciﬁc, short segment of video. The array ofnetwork parameter sets expands dynamically tomatch a video of any length. We show that ourmethod can improve the quality and compressionefﬁciency of standard video codecs.

1. Introduction

In recent years there have been many advances in videocoding standards, and there are several prominent videocodecs used commercially such as H.264/AVC (Schwarzet al., 2007), H.265/HECV (Sullivan et al., 2012) and VP9.However, most of these traditional coding algorithms areblock-based and therefore suffer from block artifacts. Theyare also largely hand-designed, making joint optimizationdifﬁcult.In Machine Learning (ML) the past decade has yielded vastimprovements and altogether new techniques for processingimages and video such as convolutional neural networks(CNN). CNNs offer an intuitive theoretical approach of en-coding or embedding information as vectors, that can beinterpreted as high-level features. Despite the success ofCNNs in many ML tasks, they have seen very limited suc-cess as a direct replacement for traditional commercial videocodecs. Partially because training ML models is computa-tionally expensive and partially because existing methodssuch as .H265 or VP9 are already highly optimized, makingthe improving on benchmarks a challenging barrier to entry.In this work we take a different approach. Instead of directlyreplacing traditional codecs by using CNNs to learn an end- to-end model, we apply ML as a post processing step toreﬁne the output. The novelty of our approach is to train adynamically expanding array of many small CNNs , allowingeach network to specialize in reﬁning a relatively shortsegment of video. This reﬁner module switches betweendifferent networks as the video is decoded.We evaluate our method on several commonly used com-mercial codecs and show that the reﬁner network makesa substantial improvement to quality with only a minimalincrease in code size.

2. Related Work

Several deep learning-integrated video compression meth-ods have been proposed to improve traditional coding. Oneapproach is replacing and enhancing different modules in tra-ditional coding, especially the state-of-the-art HEVC codec.For example, improving motion compensation and inter-prediction (Huo et al., 2018; Yan et al., 2019; Zhao et al.,2018), intra-prediction (Song et al., 2017; Cui et al., 2018;J. Pfaff, 2018), and replacing in-loop ﬁlter (Park & Kim,2016; Kang et al., 2017; Zhang et al., 2018; Jia et al., 2019).Another approach is to apply ML as a post processing stepto improve video quality. For example, (Li et al., 2017) pro-posed a dynamic metadata post-precessing scheme basedon a CNN, (Yang et al., 2017) and (Wang et al., 2017) pro-posed Decoder-side Scalable Convolutional Neural Network(DS-CNN) and Deep CNN-based Auto Decoder (DACD)respectively for video qulity and efﬁciency enhancement.Instead of using ML to form a hybrid video coding frame-work, some works propose an end-to-end framework forvideo compression, the performance of which can be on parwith the commercial codecs. (Wu et al., 2018) developed anend-to-end deep video codec relying on repeated interpolateimages in a hierarchical manner. Inheriting conventionalvideo coding structure, (Lu et al., 2018) employed multipleneural networks to constitute different modules, which canbe jointly optimized through a single loss function. (Rip-pel et al., 2018) proposed a learned end-to-end model forlow-latency mode with spatial rate control. a r X i v : . [ ee ss . I V ] M a y one

3. Dynamically Expanded CNN Array

We denote a video as a sequence of frames: X = x , x , ... where each x t ∈ R W × H × is an image with width W ,height H with 3 color channels for each pixel. An videocoding scheme provides two algorithms: one to transformthe video into code, E : X (cid:55)→ R N , and one that converts anencoded video back into a sequence of images, D : R N (cid:55)→X . Let ˆ X ∈ X denote a sequence of frames generated byencoding and then decoding a video: ˆ X = D ( E ( X )) . Thegoal of video coding is to minimize size of an encoded videoand to design E and D to be as computationally efﬁcient aspossible. In the case of lossy compression coding, there isa pixel-wise error, known as the residual , associated witheach decoded frame ∆ t = ˆ x t − x t , which is also desirableto minimize.Our method is designed to complement an existing codingscheme, reducing the error in the decoded video by addinga reﬁning function x (cid:48) t = R (ˆ x t ) : R W × H × (cid:55)→ R W × H × which yields X (cid:48) = x (cid:48) , x (cid:48) , ... when applied to each frame of X as shown in Fig. 1. The existing coding scheme could bea standard commercial codec or a custom end-to-end CNN.Let s i,j = x i , x i +1 , ..., x j denote a contiguous segment of X . The novelty of our method is to use a CNN to implement R with parameters θ ( t ) , which vary with time. Speciﬁcally,we partition X in to many small segments of duration ρ , s ,ρ , s ρ +1 ,ρ × , ... , and for each segment a corresponding setof network parameters is learned for R which we denote as θ i,j . Intuitively, this can be thought of as an array of CNNsthat expands dynamically as needed to reﬁne a video of anylength.Learning θ i,j is accomplished through standard stochas-tic gradient decent. A training example consists of ( x =ˆ x t , y = x t ) where x is sampled from X in the range i ≤ t ≤ j . Intuitively the network learns to predict x t from ˆ x t . Alternatively, the reﬁne function can learn to predictthe residual (denoted ∆ (cid:48) t = R ∆ (ˆ x t ) ) in which case reﬁninginvolves removing the predicted residual from the signal. Inthis case the training example consists of ( x = ˆ x t , y = ∆ t ) and the output is obtained by x (cid:48) t = ˆ x t − R ∆ (ˆ x t ) .Another commonly desired characteristic of an encodingscheme is for the code to have a localized temporal corre-spondence to the video. This allows a small segment, knownas a random access segment , of the video to be decodedwithout requiring the entire code which is useful for datatransmission applications such as video streaming. Sinceeach segment s i,j is relatively short, our method can alsosupport random access by transmitting θ i,j in advance ofthe corresponding random access segment of the code. Figure 1.

The video X is encoded and decoded by some codingscheme to produce ˆ X . The correct parameters for the reﬁnernetwork are selected according to current frame being process.The reﬁner produces X (cid:48) from ˆ X .

4. Experimental Evaluation

In this section we present the results of initial experimentsas a proof-of-concept for our method. We implemented ourproposed method and applied it to the benchmark dataset”Big Buck Bunny”. We used the .H264 codec with a highCRF value (low quality and high compression ratio) to cre-ate the input to the reﬁner network. The reﬁner contains 4convolutional layers with 5x5 ﬁlters followed by 3 convo-lutional layers with 3x3 ﬁlters and applies to segments ofsize ρ = 50 frames. The network is given approximately500 training steps which corresponds to 10 epochs of the(very small) dataset for each 50 frame segment. The beforeand after result of a typical frame is shown in Fig. 2. Thequality of the reﬁned image was MS-SSIM: 0.9802 PSNR:36.96 (original .H264 CRF-36 was MS-SSIM: 0.9589 andPSNR: 33.82). Figure 2.

Left: The reﬁner input (output of .H264 codec using CRF36). Middle: Output of the reﬁner. Right: Ground truth.

5. Conclusion

In this work we introduced a novel method for video cod-ing which uses an array CNNs to reﬁne each frame. Wedescribe the algorithms used to segment the video and traina CNN for each segment and automatically switch betweennetworks during the decoding process. We implemented our one proposed algorithm and conducted several experiments toevaluate it’s performance with standard benchmark codingschemes. Our method was able to provide substantial im-provement to the quality and reduce the compressed videosize.

References

Cui, W., Zhang, T., Zhang, S., Jiang, F., Zuo, W., and Zhao,D. Convolutional neural networks based intra predictionfor HEVC.

CoRR , abs/1808.05734, 2018.Huo, S., Liu, D., Wu, F., and Li, H. Convolutional neuralnetwork-based motion compensation reﬁnement for videocoding. In

IEEE International Symposium on Circuitsand Systems, ISCAS 2018, 27-30 May 2018, Florence,Italy , pp. 1–4, 2018. doi: 10.1109/ISCAS.2018.8351609.J. Pfaff, P. Helle, D. M. S. K. W. S. H. S. D. M. T. W. Neuralnetwork based intra prediction for video coding, 2018.Jia, C., Wang, S., Zhang, X., Wang, S., Liu, J., Pu, S., andMa, S. Content-aware convolutional neural network forin-loop ﬁltering in high efﬁciency video coding.

IEEETransactions on Image Processing , pp. 1–1, 2019. ISSN1057-7149. doi: 10.1109/TIP.2019.2896489.Kang, J., Kim, S., and Lee, K. M. Multi-modal/multi-scaleconvolutional neural network based in-loop ﬁlter designfor next generation video codec. In , pp. 26–30, 2017.doi: 10.1109/ICIP.2017.8296236.Li, C., Song, L., Xie, R., and Zhang, W. CNN basedpost-processing to improve HEVC. In , pp. 4577–4580,2017. doi: 10.1109/ICIP.2017.8297149.Lu, G., Ouyang, W., Xu, D., Zhang, X., Cai, C., and Gao, Z.DVC: an end-to-end deep video compression framework.

CoRR , abs/1812.00101, 2018.Park, W. and Kim, M. Cnn-based in-loop ﬁltering for codingefﬁciency improvement. In

IEEE 12th Image, Video, andMultidimensional Signal Processing Workshop, IVMSP2016, Bordeaux, France, July 11-12, 2016 , pp. 1–5, 2016.doi: 10.1109/IVMSPW.2016.7528223.Rippel, O., Nair, S., Lew, C., Branson, S., Anderson, A. G.,and Bourdev, L. D. Learned video compression.

CoRR ,abs/1811.06981, 2018.Schwarz, H., Marpe, D., and Wiegand, T. Overview ofthe scalable video coding extension of the H.264/AVCstandard.

IEEE Trans. Circuits Syst. Video Techn. , 17(9):1103–1120, 2007. doi: 10.1109/TCSVT.2007.905532. Song, R., Liu, D., Li, H., and Wu, F. Neural network-basedarithmetic coding of intra prediction modes in HEVC.In , pp. 1–4, 2017. doi: 10.1109/VCIP.2017.8305104.Sullivan, G. J., Ohm, J., Han, W., and Wiegand, T. Overviewof the high efﬁciency video coding (HEVC) standard.

IEEE Trans. Circuits Syst. Video Techn. , 22(12):1649–1668, 2012. doi: 10.1109/TCSVT.2012.2221191.Wang, T., Chen, M., and Chao, H. A novel deep learning-based method of improving coding efﬁciency from thedecoder-end for HEVC. In ,pp. 410–419, 2017. doi: 10.1109/DCC.2017.42.Wu, C., Singhal, N., and Kr¨ahenb¨uhl, P. Video compressionthrough image interpolation. In

Computer Vision - ECCV2018 - 15th European Conference, Munich, Germany,September 8-14, 2018, Proceedings, Part VIII , pp. 425–440, 2018. doi: 10.1007/978-3-030-01237-3 \ IEEE Trans. Circuits Syst. Video Techn. , 29(3):840–853, 2019. doi: 10.1109/TCSVT.2018.2816932.Yang, R., Xu, M., and Wang, Z. Decoder-side HEVC qualityenhancement with scalable convolutional neural network.In , pp. 817–822, 2017. doi: 10.1109/ICME.2017.8019299.Zhang, Y., Shen, T., Ji, X., Zhang, Y., Xiong, R., and Dai, Q.Residual highway convolutional neural networks for in-loop ﬁltering in HEVC.

IEEE Trans. Image Processing ,27(8):3827–3841, 2018. doi: 10.1109/TIP.2018.2815841.Zhao, L., Wang, S., Zhang, X., Wang, S., Ma, S., and Gao,W. Enhanced ctu-level inter prediction with deep framerate up-conversion for high efﬁciency video coding. In2018 IEEE International Conference on Image Process-ing, ICIP 2018, Athens, Greece, October 7-10, 2018