[PDF] Editorial: Introduction to the Issue on Deep Learning for Image/Video Restoration and Compression

Abstract

Recent works have shown that learned models can achieve significant performance gains, especially in terms of perceptual quality measures, over traditional methods. Hence, the state of the art in image restoration and compression is getting redefined. This special issue covers the state of the art in learned image/video restoration and compression to promote further progress in innovative architectures and training methods for effective and efficient networks for image/video restoration and compression.

Full PDF

aa r X i v : . [ ee ss . I V ] F e b IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 15, NO. 2, FEBRUARY 2021 1

Introduction to the Issue on Deep Learning forImage/Video Restoration and Compression

I. I

NTRODUCTION T HE huge success of deep-learning–based approaches incomputer vision has inspired research in learned solutionsto classic image/video processing problems, such as denoising,deblurring, dehazing, deraining, super-resolution (SR), andcompression. Hence, learning-based methods have emergedas a promising nonlinear signal-processing framework forimage/video restoration and compression.Recent works have shown that learned models can achievesigniﬁcant performance gains, especially in terms of perceptualquality measures, over traditional methods. Hence, the state ofthe art in image restoration and compression is getting rede-ﬁned. This special issue covers the state of the art in learnedimage/video restoration and compression to promote furtherprogress in innovative architectures and training methods foreffective and efﬁcient networks for image/video restorationand compression.In the following, we provide a short overview of the stateof the art in learned image and video processing Section II.Section III introduces the articles in this issue. Finally, weprovide the outlook for future directions in Section IV.II. O

VERVIEW OF THE S TATE - OF - THE -A RT A. Image/Video Restoration and Super-resolution

Many researchers reported results that exceed the state ofthe art in image/video restoration and SR by a wide marginvia supervised learning using pairs of ground-truth (GT) im-ages/video and degraded or low-resolution (LR) images/videogenerated by known degradation models, such as bicubicdownsampling. However, there is need for further researchand room for improvement in at least three key areas: gen-eralization of these results to real-world problems, efﬁciencyof the solutions, and perceptual optimization of the results.Most existing image restoration/SR methods assume a pre-deﬁned degradation process from a GT image/video to adegraded/LR one, which can hardly hold true in real imagingwith complex degradation types. To ﬁll this gap, growingattention has been paid in recent years to approaches forunknown degradations, namely real-world SR or blind SR.We can roughly divide these methods into four groups:The ﬁrst group of methods utilize an external dataset to learna SR model well adapted to a large set of downsamplingkernels, such as IKC [1], SRMD [2], or USRNet [3]. Anothergroup of methods leverage the internal statistics within asingle image derived from the degradation model, thus re-quiring no external dataset for training, like ZSSR [4] andDGDMLSR [5]. The third group resorts to implicit modeling,which deﬁnes the degradation process implicitly through adata distribution [6], [7], [8]. Particularly, these methods utilize data-distribution learning with Generative AdversarialNetworks (GANs) [9] to grasp the implicit degradation modelpossessed within dataset, like WESPE [6], FSSR [10] andCinCGAN [11]. The last group directly builds real imagedatasets with input-output pairs for speciﬁc applications, suchas DPED [12], RealSR [13], Zurich RAW-to-RGB [14] andDRealSR [15]. These new datasets make it possible to takeadvantage of existing supervised-learning methods in real-world applications.For real-world applications besides dealing with data cap-tured in uncontrolled or challenging conditions, the restora-tion/SR solutions need to be run-time, memory and energyefﬁcient and to run on constrained hardware [16]. In a pi-oneering work, Ronneberger et al. [17] introduced U-Net, awidely adopted efﬁcient neural design for image to imagemapping. Since then tremendous progress has been achieved inNeural Architecture Search (NAS) [18]. However, while veryefﬁcient architectures have been optimized for tasks such asimage classiﬁcation (MobileNetV3 [19]), solutions are soughtfor image restoration/SR tasks as shown in the AIM 2020challenge on efﬁcient SR [20].Another active area of research is perceptual image restora-tion and SR. Variations of the GAN architecture have beenproposed for various low-level–vision tasks to obtain percep-tually better results with more texture details. In a pioneeringwork, Ledig et al. [21] have proposed SRGAN model thatcould generate photo-realistic images in SR tasks. Ignatov etal. [12], [6] proposed to use perceptual losses and GANs tolearn from paired or unpaired images to enhance the imagesfrom a smartphone camera to a DSLR target camera quality.In [22], the authors made the observation that there is a trade-off between ﬁdelity (measured by full-reference metrics) andperception (measured by no-reference metrics). In the PIRM2018-SR Challenge [23], ESRGAN [24] achieved state-of-the-art performance by improving the network architecture forthe generator and loss functions. Beneﬁting from a learnableranker, RankSRGAN [25] can optimize the generative net-work in the direction of any image quality assessment (IQA)metrics and achieves state-of-the-art performance. Althoughremarkable progress has been made, Gu et al. [26] reveal thatexisting IQA method cannot objectively evaluate perceptualSR methods. In the newly-proposed IQA dataset, there is stilla large gap between IQA methods and human labels.

B. Image/Video Compression

Much of the early work in applying learned models tocompression focused on image compression, starting withapproaches that solely learned non-linear transformations ofimage inputs without learning corresponding probability mod-els [27]. Subsequent, more effective approaches jointly learned

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 15, NO. 2, FEBRUARY 2021 2 models of non-linear auto-encoders with models of the latent-variable probability distributions [28], [29]. The model cou-pling was done by minimizing the Lagrangian formulation ofthe rate-distortion loss, using the learned probability modelfor the rate estimation and the decoded reconstruction from thescalar-quantized latent variables for the distortion. Adding sideinformation (“hyper priors”) to allow the probability modelsthemselves to adapt locally resulted in a learned models thatexceeded the performance of traditional image encoders (e.g.,BPG) [30]. Extensions to the adaptive probability modelinginclude additional layers of side information about the prob-ability models for the hyper-priors themselves, as well asautoregressive context models [31].Much of the previous work in learned video compressionaddressed the problem of replacing parts of standard compres-sion systems (e.g. HEVC) with learned components [32], [33],[34], [35], [36], [37], [38]. End-to-end optimized fully learnedmodels have also shown promise. Some learned models basedon uni-directional motion-compensation (low-latency) [39],[40] have outperformed H.264 in PSNR and HEVC in MS-SSIM. The best performance to date in fully learned low-latency video compression uses a learned scale-space motion-compensation model [41]. Recently, end-to-end optimizedlearned models based on bi-directional motion compensationhave also shown competitive performance [42], [43].

C. Point Cloud Denoising and Compression

VERVIEW OF THE A RTICLES

This special issue consists of 20 papers on recent advancesin deep learning for image/video restoration and compression:13 papers on image/video restoration and super-resolution,5 papers on image/video compression, and 2 papers on pointcloud processing. We provide a short introduction to thesepapers in the following.

A. Image/Video Restoration and Super-resolution

The paper ”Degradation aware approach to image restora-tion using knowledge distillation” is the ﬁrst journal paperon application of knowledge distillation on image restoration.The authors present a new approach to handle image-speciﬁcand spatially-varying degradations that occur in practice, suchas rain-streaks, haze, raindrops, and motion blur. They de-compose the restoration task into two stages of degradationlocalization and degraded region-guided restoration, unlikeexisting methods that directly learn a mapping between thedegraded and clean images.In “Color image restoration exploiting inter-channel correla-tion with a 3-stage CNN” Cui et al. propose a 3-stage CNN forcolor image restoration tasks. In this framework, ﬁrst the greencomponent is reconstructed, followed by the red and bluechannels with parallel networks, then all the intermediate re-constructions are concatenated to generate the ﬁnal result. Thismethod is successful in three typical color image restorationtasks: color-image demosaicking, color compression artifactreduction, and real-world color image denoising.Another image restoration work “A deep primal-dual prox-imal network for image restoration” borrows idea from im-age classiﬁcation tasks and proposes a primal-dual proximalnetwork. Speciﬁcally, it reformulates a speciﬁc instance ofthe primal-dual hybrid gradient (PDHG) algorithm as a deepnetwork with ﬁxed layers. Each layer corresponds to oneiteration of the primal-dual algorithm. Two learning strategies– Full learning and Partial learning – are also proposed forbetter optimization. The proposed DeepPDNet shows excellentperformance on the several benchmark datasets for imagerestoration and super resolution.The paper “Semi-supervised landmark-guided restorationof atmospheric turbulent images” considers restoration ofatmospheric turbulent (AT) images. As there is no pairedtraining dataset for AT images, especially with faces, thiswork proposes a semi-supervised method for jointly extractingfacial landmarks and restoring degraded images. The proposedapproach learns to generate AT images by combining thecontent from a clean image and turbulence information fromAT images in an unpaired manner. It adopts heatmaps from thelandmark localization network as an additional prior. Experi-ments demonstrate the effectiveness of the proposed networkon both AT image restoration and landmark localization.In the rain-removal task, Kui et al. (“Multi-level memorycompensation network for rain removal via divide-and-conquerstrategy”) leverage the divide and-conquer strategy by decom-posing the learning task into several subproblems accordingto levels of texture richness. It produces a high-quality rain-free image by subtracting the predicted rain information from

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 15, NO. 2, FEBRUARY 2021 3 multiple subnetworks. Each subnetwork processes a speciﬁcsub-sampled image, sampled from the original rainy onesvia the Gaussian kernel. Experiments show that the proposedMLMCN outperforms existing deraining methods on severalbenchmark datasets, and the high-level object detection task.Also, Yasarla et al. (“Exploring Overcomplete Represen-tations for Single Image Deraining using CNNs”) proposes aderaining solution called Over-and-Under Complete DerainingNetwork (OUCD). OUCD consists of two branches: one em-ploying an overcomplete convolutional network architecturefor learning local structures by restraining the receptive ﬁeldof ﬁlters and another one employing U-net targeting globalstructures. The solution signiﬁcantly improves over state-of-the-art on synthetic and real benchmarks.Ning et al. (“Accurate and Lightweight Image Super-Resolution with Model-Guided Deep Unfolding Network”)propose an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN). MoG-DUNunfolds the iterative process of model-based SISR into amulti-stage concatenation of building blocks with three in-terconnected trainable modules (denoising, nonlocal-AR, andreconstruction). Experiments show improvements over existingmodel-based methods.In “Multi-scale image super-resolution via a single extend-able deep network” Zhang et al. propose a solution (MSWSR)addressing efﬁciency and arbitrary upscaling factors. MSWSRimplements multi-scale SR simultaneously by learning multi-level wavelet coefﬁcients of the target image. Structurally,MSWSR is composed of one CNN part for low frequenciesand one extendable RNN part for high frequencies and multi-scale SR. A side window kernel is proposed for efﬁciency.In “WDN: A Wide and Deep Network to Divide-and-Conquer Image Super-resolution”, Singh and Mittal proposeto divide the SISR problem into multiple sub-problems andthen solve/conquer them within a neural network design.Their introduced network architecture is much wider and isdeeper than existing networks and employs a new technique tocalibrate the intensities of feature map pixels. The advantagesare demonstrated through extensive experiments.In “Multi-Grid Back-Projection Networks” NavarreteMichelini et al. demonstrate the power of the Multi–GridBack–Projection (MGBP) network architecture on image andvideo super-resolution tasks with ﬁdelity and/or perceptualquality targets. MGBP combines a novel cross-scale residualblock inspired by the iterative back–projection (IBP) algorithmand a multi-grid recursion strategy inspired by multi–gridPDE solvers to scale computational complexity efﬁciently withincreasing output resolutions.In “LSTM-DNN Based Autoencoder Network for NonlinearHyperspectral Image Unmixing”, Zhao et al. address theproblem of blind hyperspectral unmixing by proposing a non-symmetric autoencoder network to fully exploit the spectraland spatial correlation information. LSTM captures spectralcorrelation information, a spatial regularization improves thespatial continuity of results, while an attention mechanismfurther enhance the unmixing performance. The effectivenessof the proposed method is validated on synthetic and real data.“Uncertainty-Aware Semantic Guidance and Evaluation for Image Inpainting” address the problem of ﬁlling in missingirregularly shaped areas of an image, a problem that arisesin practice when trying to recover an image that has anoverlay (e.g., super-imposed text) or a foreground object thatis being synthetically removed or when trying to create adifferent viewpoint of a scene (in newly dis-occluded areas).The approach that is taken is to iteratively evaluate inpaintedvisual contents as well as a structural segmentation mask. Theapproach surpasses other state-of-the-art approaches, in termsof clear boundaries and photo-realistic textures.The paper “Deep energy: Task driven training of deep neuralnetworks” offers an unsupervised training approach using task-speciﬁc energy functions, where the proposed solution is betterthan the one obtained by a direct minimization of the energyfunction due to added regularization property of deep neuralnetworks.

B. Image/Video Compression

Breakthroughs in modeling latent-variable probability dis-tributions jointly with parameterized non-linear transforma-tions [16, 17] were what was needed to allow learning-based approaches to image compression to quickly surpass theperformance achieved by more traditional models. “Nonlineartransform coding” is the ﬁrst journal paper with comprehen-sive coverage of latent-variable RD-curve optimization fornonlinear-transform coding.The next two papers focus on different approaches to intra-frame block compression. “Intra-frame coding using a con-ditional autoencoder” introduces an auto-encoder approach tomode-selection for predicting intra-frame image/video blocks.The learned latent-space variable is itself the prediction-function index (replacing the mode-index used in classic intra-frame block coding) and the context pixels condition both partsof the auto-encoder architecture. Cross-channel prediction isprovided between the luma and chroma encodings, to avoidthe need to separately send the latent-variables for the chromachannels. The results improve the Bjøntegaard delta rate (BD-rate) for both luma and chroma channels compared to previousstate-of-the-art. The second of these papers, “Attention-basedneural networks for chroma intra prediction in video coding”,also looks at intra-frame chroma prediction but does so witha very different approach. In this paper, a purely convolu-tional network is used with an attention layer [46] to cross-index between the (known) chroma boundary pixels and the(previously decoded) luma pixels within the block. Since thenetwork is purely convolutional, it is able to handle all blocksizes (4x4, 8x8, and 16x16), reducing both the space requiredfor the models and the average computation used across thevideo duration.The next paper, “MFRNet: A new CNN architecture forpost-processing and in-loop ﬁltering,” also looks at leverag-ing convolutional neural-networks within the framework ofa traditional video compressor, but this time for in-loop andpost-processing ﬁltering. The paper introduces a new neural-network architecture that allows reuse of early-layer represen-tations throughout the remaining layers of the network. Theresults show signiﬁcant PSNR gains for both in-loop and post-processing.

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 15, NO. 2, FEBRUARY 2021 4

Finally, “Learning for video compression with recurrentauto-encoder and recurrent probability model” presents a fullylearning-based approach to video compression that outper-forms the default-speed setting for x265, using recurrentprobability models for the latent variables of the recurrentauto-encoder network that is used to encode the motion-compensated video frames.

C. Point Cloud Denoising and Compression

Finally, there are two papers on deep learning for pointcloud processing, one on denoising and one on compression.The paper entitled “Learning robust graph-convolutionalrepresentations for point cloud denoising” proposes a deeplearning method that can simultaneously denoise a pointcloud and remove outliers in a single model. The core ofthe proposed method is a graph-convolutional neural networkable to efﬁciently deal with the irregular domain and thepermutation invariance problem typical of point clouds.The paper entitled “Adaptive deep learning-based pointcloud geometry coding” is the ﬁrst journal paper on pointcloud compression. It presents a novel deep learning-basedsolution for point cloud geometry coding that divides thepoint cloud into 3D blocks and selects the most suitableavailable deep learning coding model to code each block, thusmaximizing the compression performance.IV. O

UTLOOK

There are many compelling future research challenges stillremain to be addressed. These include: i) learned modelscontain millions of parameters, which makes real-time infer-ence on common devices a challenge, ii) it is difﬁcult tointerpret learned models or to provide performance boundson results, iii) it is important to provide perceptual loss func-tions, for training, that accurately reﬂect human preferences,iv) the performance of learned models trained on syntheticallygenerated data drops sharply on real-world images/video,where the quantity and quality of training data is limited, andv) exploiting temporal correlations for efﬁcient and effectivevideo restoration and compression is challenging.We hope that this special issue broadly summarizes thecurrent state of the art in learned methods for image/videorestoration and compression, and inspires researchers to workon numerous future directions calling for deeper investigation.R

EFERENCES[1] J. Gu, H. Lu, W. Zuo, and C. Dong. Blind super-resolution with iterativekernel correction, in CVPR, 2019[2] K. Zhang, W. Zuo, and Lei Zhang. Learning a single convolutional super-resolution network for multiple degradations. In CVPR, 2018.[3] K. Zhang, L. Van Gool, R. Timofte, Deep Unfolding Network for ImageSuper-Resolution, in Proc. CVPR, pp. 3217-3226, 2020.[4] A. Shocher, N. Cohen, and M. Irani. “Zero-shot” super-resolution usingdeep internal learning, in CVPR, 2018[5] Xi Cheng, Z. Fu, and J. Yang. Zero-shot image super-resolution withdepth guided internal degradation learning. In ECCV, 2020.[6] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, L. Van Gool, WE-SPE: Weakly Supervised Photo Enhancer for Digital Cameras, in Proc.CVPRW, pp. 691-700, 2018.[7] A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, L. Van Gool, Night-to-day image translation for retrieval-based localization, 2019 Int. Conf.on Robotics and Automation (ICRA), pp. 5958-5964, 2019. [8] A. Lugmayr, M. Danelljan, R. Timofte, Unsupervised learning for real-world super-resolution, in Proc. ICCVW, pp. 3408-3416, 2019.[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, David Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, inNeurIPS, 2014.[10] M. Fritsche, S. Gu, and R. Timofte. Frequency separation for real-worldsuper-resolution. In ICCV Workshop, 2019.[11] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin. Unsuper-vised image super resolution using cycle-in-cycle generative adversarialnetworks. in IEEE/CVF Conf. on CVPR Workshop, 2018.[12] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, L. Van Gool, DSLR-Quality Photos on Mobile Devices With Deep Convolutional Networks,in Proc. ICCV, pp. 3277-3285, 2017.[13] J. Cai, H. Zeng, H. Yong, Z. Cao, L. Zhang, Toward real-worldsingle image super-resolution: A new benchmark and a new model, inIEEE/CVF Conf. on CVPR, 2019.[14] A. Ignatov, L. Van Gool, R. Timofte, Replacing Mobile Camera ISPWith a Single Deep Learning Model, in Proc. CVPRW, 2020.[15] P. Wei, Z. Xie, H. Lu, Z. Zhan, Q. Ye, W. Zuo, L. Lin, ComponentDivide-and-Conquer for Real-World Image Super-Resolution, in in Euro-pean Conf. Comp. Vision (ECCV), 2020.[16] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, L.Van Gool, AI benchmark: Running deep neural networks on Androidsmartphones, in Proc. ECCV Workshops, 2018.[17] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networksfor biomedical image segmentation, International Conference on Medicalimage computing and computer-assisted intervention, pp. 234-241, 2015.[18] T. Elsken, J.H. Metzen, F. Hutter, Neural architecture search: A survey,J. Mach. Learn. Res., 2019.[19] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W.Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, H. Adam, Searching forMobileNetV3, in Proc. ICCV, pp. 1314-1324, 2019.[20] K. Zhang, M. Danelljan, Y. Li, R. Timofte, et al., AIM 2020 Challengeon Efﬁcient Super-Resolution: Methods and Results, in Proc. ECCVWorkshops, 2020.[21] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realisticsingle image super-resolution using a generative adversarial network. inIEEE/CVF Conf. on CVPR, 2017.[22] Y. Blau and T. Michaeli, The perception-distortion tradeoff, inIEEE/CVF Conf. on CVPR, 2018.[23] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor. ThePIRM challenge on perceptual image super-resolution. in in EuropeanConf. Comp. Vision (ECCV), 2018[24] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C.Loy. Esrgan: Enhanced super-resolution generative adversarial networks.In ECCV Workshops, 2018.[25] W. Zhang, Y. Liu, C. Dong, and Y. Qiao. Ranksrgan: Generativeadversarial networks with ranker for image super-resolution. in Int. Conf.Comp. Vision (ICCV), 2019.[26] J. Gu, H. Cai, H. Chen, X. Ye, J. Ren, and C. Dong, PIPAL: aLarge-Scale Image Quality Assessment Dataset for Perceptual imageRestoration, in European Conf. Comp. Vision (ECCV), 2020.[27] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor,and M. Covell. Variable rate image compression with recurrent neuralnetworks, in Proc. Int. Conf. Learning Representations, 2016.[28] J. Ball´e, V. Laparra, and E. P. Simoncelli, End-to-end optimized imagecompression, in Proc. Int. Conf. on Learning Representations, 2017.[29] L. Theis, W. Shi, A. Cunningham, and F. Husz´ar. Lossy image com-pression with compressive autoencoders. in Proc. Int. Conf. on LearningRepresentations, 2017.[30] J. Ball´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston. Variationalimage compression with a scale hyperprior, in ICLR, 2018.[31] D. Minnen, J. Ball´e, and G. D. Toderici. Joint autoregressive andhierarchical priors for learned image compression, in NeurIPS, 2018[32] X. Zhao, J. Chen, A. Said, V. Seregin, H. E. Egilmez, and M. Kar-czewicz, NSST: Non-separable secondary transforms for next generationvideo coding. in IEEE Picture Coding Symposium (PCS), 2016.[33] Y. Dai, D. Liu, and F. Wu. A convolutional neural network approachfor post-processing in HEVC intra coding. in Int. Conf. on MultimediaModeling. Springer, 2017, pp. 28–39.[34] Y. Li, L. Li, Z. Li, J. Yang, N. Xu, D. Liu, and H. Li. A hybridneural network for chroma intra prediction. in IEEE Int. Conf. on ImageProcessing (ICIP). 2018, pp. 1797–1801.[35] D. Wang, S. Xia, W. Yang, Y. Hu, and J. Liu. Partition tree guidedprogressive rethinking network for in-loop ﬁltering of HEVC. in IEEEInt. Conf. on Image Processing (ICIP), 2019, pp. 2671–2675.

EEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 15, NO. 2, FEBRUARY 2021 5 [36] P. Helle, J. Pfaff, M. Schafer, R. Rischke, H. Schwarz, D. Marpe, and T.Wiegand. Intra picture prediction for video coding with neural networks.in IEEE Data Compression Conference (DCC), 2019.[37] M. Gorriz, S. Blasi, A. F. Smeaton, N. E. O’Connor, and M. Mrak,Chroma intra prediction with attention-based CNN architectures, in Proc.IEEE ICIP, 2020.[38] L. Murn, S. Blasi, A. F. Smeaton, N. E. O’Connor, and M. Mrak, Inter-preting CNN for low complexity learned sub-pixel motion compensationin video coding. arXiv preprint arXiv:2006.06392, 2020.[39] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, DVC: An end-to-end deep video compression framework. in Proc. of the IEEE Conf.on Computer Vision and Pattern Recognition, pp. 11006—11015, 2019.[40] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. Bourdev.Learned video compression. in Proc. ICCV, 2019.[41] E. Agustsson, D. Minnen, N. Johnston, J. Ball´e, S. J. Hwang, and G.Toderici. Scale-space ﬂow for end-to-end optimized video compression,in Proc. CVPR, 2020.[42] R. Yang, F. Mentzer , L. Van Gool, and R. Timofte, Learning forvideo compression with hierarchical quality and recurrent enhancement,in IEEE/CVF Conf. on CVPR, 2020.[43] M. A. Yılmaz and A. M. Tekalp, “End-to-end rate-distortion optimiza-tion for bi-directional learned video compression,” IEEE Int. Conf. onImage Processing, Abu Dhabi, UAE, Nov. 2020.[44] X.-F. Han, J. S. Jin, M.-J. Wang, W. Jiang, L. Gao, and L. Xiao, Areview of algorithms for ﬁltering the 3D point cloud, Signal Processing:Image Communication, vol. 57, 2017, pp. 103-112,[45] D. Graziosi, O. Nakagami, S. Kuma, A. Zaghetto, T. Suzuki, andA. Tabatabai, An overview of ongoing point cloud compression standard-ization activities: Video-based (V-PCC) and geometry-based (G-PCC).APSIPA Trans. on Signal and Information Processing, 9, E13, 2020.doi:10.1017/ATSIP.2020.12[46] A. Vaswani, N. Shazeer, N Parmar, J Uszkoreit, L Jones, A. N.Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”arXiv:1706.03762, 2017.[36] P. Helle, J. Pfaff, M. Schafer, R. Rischke, H. Schwarz, D. Marpe, and T.Wiegand. Intra picture prediction for video coding with neural networks.in IEEE Data Compression Conference (DCC), 2019.[37] M. Gorriz, S. Blasi, A. F. Smeaton, N. E. O’Connor, and M. Mrak,Chroma intra prediction with attention-based CNN architectures, in Proc.IEEE ICIP, 2020.[38] L. Murn, S. Blasi, A. F. Smeaton, N. E. O’Connor, and M. Mrak, Inter-preting CNN for low complexity learned sub-pixel motion compensationin video coding. arXiv preprint arXiv:2006.06392, 2020.[39] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, DVC: An end-to-end deep video compression framework. in Proc. of the IEEE Conf.on Computer Vision and Pattern Recognition, pp. 11006—11015, 2019.[40] O. Rippel, S. Nair, C. Lew, S. Branson, A. G. Anderson, and L. Bourdev.Learned video compression. in Proc. ICCV, 2019.[41] E. Agustsson, D. Minnen, N. Johnston, J. Ball´e, S. J. Hwang, and G.Toderici. Scale-space ﬂow for end-to-end optimized video compression,in Proc. CVPR, 2020.[42] R. Yang, F. Mentzer , L. Van Gool, and R. Timofte, Learning forvideo compression with hierarchical quality and recurrent enhancement,in IEEE/CVF Conf. on CVPR, 2020.[43] M. A. Yılmaz and A. M. Tekalp, “End-to-end rate-distortion optimiza-tion for bi-directional learned video compression,” IEEE Int. Conf. onImage Processing, Abu Dhabi, UAE, Nov. 2020.[44] X.-F. Han, J. S. Jin, M.-J. Wang, W. Jiang, L. Gao, and L. Xiao, Areview of algorithms for ﬁltering the 3D point cloud, Signal Processing:Image Communication, vol. 57, 2017, pp. 103-112,[45] D. Graziosi, O. Nakagami, S. Kuma, A. Zaghetto, T. Suzuki, andA. Tabatabai, An overview of ongoing point cloud compression standard-ization activities: Video-based (V-PCC) and geometry-based (G-PCC).APSIPA Trans. on Signal and Information Processing, 9, E13, 2020.doi:10.1017/ATSIP.2020.12[46] A. Vaswani, N. Shazeer, N Parmar, J Uszkoreit, L Jones, A. N.Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”arXiv:1706.03762, 2017.