[PDF] Light Field Compression by Residual CNN Assisted JPEG

Abstract

Light field (LF) imaging has gained significant attention due to its recent success in 3-dimensional (3D) displaying and rendering as well as augmented and virtual reality usage. Nonetheless, because of the two extra dimensions, LFs are much larger than conventional images. We develop a JPEG-assisted learning-based technique to reconstruct an LF from a JPEG bitstream with a bit per pixel ratio of 0.0047 on average. For compression, we keep the LF's center view and use JPEG compression with 50% quality. Our reconstruction pipeline consists of a small JPEG enhancement network (JPEG-Hance), a depth estimation network (Depth-Net), followed by view synthesizing by warping the enhanced center view. Our pipeline is significantly faster than using video compression on pseudo-sequences extracted from an LF, both in compression and decompression, while maintaining effective performance. We show that with a 1% compression time cost and 18x speedup for decompression, our methods reconstructed LFs have better structural similarity index metric (SSIM) and comparable peak signal-to-noise ratio (PSNR) compared to the state-of-the-art video compression techniques used to compress LFs.

Full PDF

LL IGHT F IELD C OMPRESSION BY R ESIDUAL

CNN A

SSISTED

JPEG

A P

REPRINT

Eisa Hedayati

Michigan Technological University [email protected]

Timothy C. Havens

Michigan Technological University [email protected]

Jeremy P. Bos

Michigan Technological University [email protected]

October 2, 2020 A BSTRACT

Light ﬁeld (LF) imaging has gained signiﬁcant attention due to its recent success in 3-dimensional(3D) displaying and rendering as well as augmented and virtual reality usage. Nonetheless, because ofthe two extra dimensions, LFs are much larger than conventional images. We develop a JPEG-assistedlearning-based technique to reconstruct an LF from a JPEG bitstream with a bit per pixel ratio of0.0047 on average. For compression, we keep the LF’s center view and use JPEG compression with50% quality. Our reconstruction pipeline consists of a small JPEG enhancement network (JPEG-Hance), a depth estimation network (Depth-Net), followed by view synthesizing by warping theenhanced center view. Our pipeline is signiﬁcantly faster than using video compression on pseudo-sequences extracted from an LF, both in compression and decompression, while maintaining effectiveperformance. We show that with a 1% compression time cost and 18x speedup for decompression,our methods reconstructed LFs have better structural similarity index metric (SSIM) and comparablepeak signal-to-noise ratio (PSNR) compared to the state-of-the-art video compression techniquesused to compress LFs.

Light ﬁelds (LF) have two extra dimensions, as compared to conventional images, which represents angular informationof the scene. Hence, LFs contain a large volume of data that makes storing and portability time consuming and costly.Also, decompressing LF video with a high angular resolution at acceptable frames per second (fps) for streaming ischallenging. We aim to address these problems by predicting the entire LF from its JPEG compressed center view.Direct application of the standard image compression techniques, such as JPEG, PNG, etc let@tokeneonedot, on the LFdo not take advantage of the existing redundancy between LF views. A better success has been achieved from the use ofvideo compression techniques. For using video compression methods on LFs, a sequence of images has been build fromLF views which is called pseudo-sequence[1]. A combination of machine learning (ML) methods, capable of predictingLF views, and video compression techniques, has been explored in [2]. In this work, we present a combination of JPEGcompression with ML view predictions. LF synthesis techniques have shown the possibility of estimating the entire LFfrom a single or a set of sparse views. Here, we show that there is enough information in the JPEG compressed centerview—as well as a group of sub-aperture images (SAI)—to predict the entire LF with sufﬁcient quality. We test thesuccess of our method by comparing against state-of-the-art methods in LF compression that use the existing HEVCcompression.Our method is faster in compression and decompression by 100x and 10x, respectively, compared to the direct use ofHEVC. This speed up means a set of 30 LFs with a spatial resolution of (375 , and angular resolution of (7 , can be decompressed on a typical gaming GPU in less than 0.02 second, while HEVC-based methods it takes morethan 0.39 second. We can see that a little increase in spatial or angular dimensions will kick HEVC-based methodsout of 1 second easily. Thus, streaming with not be possible. While speeding up the process, we have maintained and,in most cases, improved the quality of reconstruction at the same bit per pixel (bpp) ratio. We have used the mean ofpeak signal-to-noise (MPSNR) ratio over all of the views and mean structural similarity index metric (MSSIM) to a r X i v : . [ ee ss . I V ] S e p PREPRINT - O

CTOBER

2, 2020compare the reconstructed LFs of our model with those that use HEVC. We show that while the MPSNR of our methodis comparable to the direct employment of HEVC, our model can achieve higher MSSIM. This results in fewer artifactsand better quality in the extracted synthetic aperture depth of ﬁeld (DoF) images.The contributions of this paper are as follows: We achieved compression speed up of more than 100x, and decompressionspeed up of more than 10x compared to use of HEVC on pseudo-sequence of LF views. At average bpp of . theLF’s DoF reconstructed with our method improved the SSIM by . on average compared to the direct use of HEVC.Finally, we introduce a small, fast and efﬁcient convolutional neural network (CNN) for enhancing JPEG images. Thisnetwork also boosts the SSIM of the ﬁnal decompressed LF. Linear view synthesis by Levin and Durand [3] and depth of ﬁeld extension and super-resolution by Bishop and Favaro[4] are among the earliest works on LF view synthesis and reconstruction. Flynn et al let@tokeneonedot[5] proposed adeep learning method to predict novel views from a sequence of images with wide baselines. LF view synthesis becamemore popular after Kalantari et al let@tokeneonedot[6] showed in their work that an LF can be synthesized from its cornerSAIs with high quality. Building on the work of Kalantari et al let@tokeneonedot, Yeung et al let@tokeneonedotuseddifferent sets of views to reconstruct dense LFs [7]. Srinivasan et al let@tokeneonedotdemonstrate the possibilityof estimating the entire LF from its center view by extrapolating using machine learning methods [8]. Choi etal let@tokeneonedotextended the extrapolation to an LF taken with arrays of cameras [9]. LF fusion [10] and Depth-guided techniques [11] have been popular in reconstructing an LF from a single or a sparse set of SAIs. Hu etal let@tokeneonedot[12] aimed for a faster LF reconstruction method by using hierarchical features fusion.The backbone of nearly every view synthesis method enumerated here is the depth-map estimation. the current work iscategorized as single view LF reconstruction. Our method is unique because we use a lossy compressed JPEG viewfrom which to estimate the entire LF. We use residual learning methods to guess the possible artifacts from the JPEGcompressed version of the center view to assist the main network for accurately estimating the depth map.

Lossless and lossy compression methods have been investigated extensively in the literature. For the lossless model,Perra [13] proposed an adaptive block differential prediction method and Helin et al let@tokeneonedot[14] described asparse modeling with a predictive coding for SAIs of the LF.The lossy models can be classiﬁed in to sub-categories of: standardized image/video compression techniques andmachine learning assisted compression techniques.

Standardized image and video compression techniques (especially HEVC) have been directly used to address theproblem of the bulkiness of LF, see e.g., [1, 15, 16]. Other methods, such as homography-based low-rank models [17]and Fourier Disparity Layers [18], have been used to reduce the angular dimension of the LF. In another work, the LFdepth was segmented into 4D spatial-angular blocks, which was used for prediction, followed by encoding the residueusing JPEG-2000 [19].

Followed by the breakthrough in synthesizing LF views from its four corners using CNN learning techniques introducedby [6], another work introduced a compression technique by using the same method and compressing the four cornerviews by HEVC [20]. In another work, the authors proposed to keep half of the views and encode them by HEVCand synthesize the other half by a CNN [2]. A CNN based epipolar plane image super-resolution algorithm wasused in cooperation with HEVC to compress LF as well [21]. Wang et al let@tokeneonedotproposed a new LF videocompression technique by deploying view synthesis methods from multiple inputs while encoding the input views by aproposed region-of-interest scheme [22].To the best of our knowledge, because the view extrapolation is ill posed, LF reconstruction from a lossy compressedsingle input (speciﬁcally, JPEG) has not been explored before our work.2

PREPRINT - O

CTOBER

2, 2020

For several decades, different researchers addressed the JPEG compression artifact reduction generally in three maingroups: prior knowledge-based, ﬁlter-based, and learning-based approaches. Here though, we are interested inthe learning-based approaches. The basic intention of the learning-based methods is to ﬁnd a non-linear mappingbetween the JPEG compressed image—compressed at different compression ratios—to the ground truth uncompressedimage. To the best of our knowledge, the ﬁrst deep learning model to address this problem was created by Dong etal let@tokeneonedot[23], where they showed the possibility of enhancing the reconstructed JPEG image by a relativelyshallow CNN. Since then, multiple researchers have gradually improved the performance of learning-based methodsby introducing new networks such as: dual-domain representations [24], deep dual-domain based fast restoration[25], encoder-decoder networks with symmetric skip connection [26], CAS-CNN [27], one-to-one networks [28],DMCNN [29], and dual-stream multi-path recursive Residual Network [30]. While deeper networks and state of theart architectures have improved the task of JPEG artifact reduction, we are not focused solely on this task here. Theultimate goal of our JPEG-Hance network is to improve the estimated depth-map from the JPEG compressed centerimage of an LF. JPEG artifact reduction is the natural ﬁrst step for extracting better depth-maps.

Here we describe our compression and decompression pipeline. The compression pipeline is simply extraction ofthe center view of the LF, compression by JPEG at 50% quality, followed by discarding of all other views. Thedecompression pipeline has the following steps:1. JPEG decompression of the center view c J

2. Enhancing c J by JPEG-Hance to c E , c E = J ( c J ) . (1)3. Estimating depth map d ( x, u ) of every view u from c E , d = D ( c E ) . (2)4. Reconstructing LF by L ( x , u ) u → u = L ( x + ( u − u ) d ( x, u ) , u ) , (3)where L is the approximated LF and u is the middle view index. Variables x and u are spatial and angularindices. The main goal of our JPEG-Hance network is to assist the Depth-Net in providing better depth map estimation. Indoing so, it is certainly beneﬁcial to improve the overall quality of the JPEG decompressed image by reducing theerror between uncompressed ground truth images and the lossy compressed ones. However, the goal of our networkis not general JPEG artifact reduction; instead, JPEG-Hance should learn to enhance the parts of the image whichhave the most effect on improving depth information extraction. To achieve this task, JPEG-Hance also needs toﬁnd correspondence information from the extracted depth maps. Therefore, it is trained in two phases: ﬁrst, it learnsto enhance any typical JPEG decompressed image, then again as part of the whole depth estimation pipeline. Thearchitecture of JPEG-Hance is shown in Fig. 1. Encouraged by ResNet50’s bottelneck building blocks structure, wehave designed our JPEG-Hance residual blocks. We have added a batch normalization (BN) layer after each convolutionfollowed by an exponential linear unit (ELU). The

ELU followed by a last layer tanh seems to be the most promisingactivation pair of functions when dealing with regression of image data scaled to the interval [ − , .JPEG-Hance is pre-trained by minimizing the mean squared error of each pixel value in the RGB channels. Then it willbe added to the training pipeline for full reconstruction of LFs. Multiple images provide geometry information which can be used for LF reconstruction. A single image does notprovide such information. Therefore, such information should be extracted by other methods. Machine learningtechniques, particularly CNNs showed a promising potential for estimating geometry from a single image [8, 6, 7].Thus, for the problem of depth estimation from our enhanced center image, we use a residual CNN.3

PREPRINT - O

CTOBER

2, 2020Figure 1: JPEG-Hance detailed structureFigure 2: Depth-Net detailed structure4

PREPRINT - O

CTOBER

2, 2020Figure 3: Depth-Net residual blocksOur Depth-Net is responsible for estimating corresponding depth map (disparity map) for all 49 views from the middleJPEG compressed view. The structure of Depth-Net is depicted in Fig. 2. The Depth net has three variants of residualblocks. The ﬁrst variant is a down-sampler which uses a 2D convolution with strides of (2 × , halving the spatialdimension of the input image. This block is used just before the ﬁrst Depth Residual Block; each time the feature sizeis increased. The second type of block, the Depth Residual block, is the main residual block. This block is used themost and extracts most of the features. The structure of the Depth Residual block mimics the bottleneck structure ofResNet50 with added instance normalization after each of the ﬁrst two convolution layers. Last, the Upsampler block isconstructed to have a 2D deconvolution (transposed convolution) layer and two 2D convolution layers with kernel sizeof (3 × . The deconvolution layer’s stride is set to (2 × . These blocks are shown in Fig. 3.Because we are training the Depth-Net on the actual LF data and not the ground truth depth maps, our loss functionshave to be designed to train the network in an unsupervised manner. Thus, we deﬁne the Depth-Net pre-training lossfunction to be a weighted sum of four sub-functions: i) photometric loss L p , ii) defocus loss L r , iii) depth-consistencyloss L c , and iv) DoF loss L d .. L depth = α L p + α L r + α L c + α L d , (4)where α, α , α and α are chosen to be , , . , and in our conducted experiment which are empirically foundto work well overall.The image quality comparison sub-function ψ [11] is constructed by combining mean absolute difference of pixels andimage structural dissimilarity (DSSIM) that is derived from the structural similarity index metric (SSIM) [31]: ψ ( I , I ) = β − SSIM ( I , I )2 + (1 − β ) (cid:107) I − I (cid:107) , (5)where I , I are two images that are being compared and β is a number in range (0 , , which we empirically foundthat . yields better training results compared to other values. Using the sub-function ψ , photometric loss is deﬁnedto be [11] L p = (cid:88) u (cid:2) ψ ( L ( x, u ) u → u , L ( x, u ))+ ψ ( L ( x, u ) u → u , L ( x, u )) (cid:3) . (6)Because we are training an unsupervised Depth-Net, the more prior knowledge we can give the network, the better willbe the training quality. Zhou et al let@tokeneonedot[11] introduced defocus cue loss to be L d = ψ (cid:0) L ( x, u ) , N (cid:88) u L ( x, u ) u → u (cid:1) . (7)Also depth consistency (left-right or forward-backward) has been shown in the literature [32, 33, 34]to be a promisingregularizer for LF view synthesis purpose, where d u → u ( x ) = d u (cid:0) x, ( u − u ) d ( x, u ) (cid:1) , (8a) L c = (cid:88) u || d u ( x ) − d u → u ( x ) || . (8b)5 PREPRINT - O

CTOBER

2, 2020Finally we have included depth of ﬁeld (DoF) loss to further assist the network in learning depth information:

DoF = 1 u (cid:88) u L ( x, u ) , (9a) L dof = ψ (cid:0) DoF, DoF u → u (cid:1) . (9b) In this section, we describe our method’s implementation details. Then, we use public data sets [8, 6] to evaluate ourmethod and investigate the impact of different parts of our network on the performance of our model.

We have conducted our experiments over the two public data sets:

Flowers [8] and

30 Scenes [6]. Both of these datasets are captured by a Lytro Illum camera. The angular resolution of these data sets is × views and the spatialresolution is variable between × and × pixels.The LF from these data sets are cropped to the size of × × × to have a consistent size and vignetting. Our pipeline has been trained in multiple steps. The input pipeline was one of the main variables during the trainingphase.We have implemented our model with Tensorﬂow 2.2 in python 3.7 on a workstation with an Intel Xeon W-2223 3.60GHz, 64GB DDR4 memory, and NVIDIA Quadro RTX 5000.

Our JPEG-Hance has been pre-trained on the

30 Scenes training data set, which contains 100 scenes. The center viewsof these 100 scenes are extracted and used for training. In the training phase, the spatial dimensions of the JPEG-Hanceare set to × . First a training pool of images with dimensions of × is created by cropping the centerviews of the 100 scenes at 8 pixels steps. Therefore, the training pool had 150,000 different crops which we foundsufﬁcient for the JPEG-Hance network to be trained without over-ﬁtting or under-ﬁtting. The learning rate was set to0.0004.The JPEG-Hance has a relatively small network and only has 202,435 trainable parameters. The pre-training phasetakes about 90 minutes to converge. Our Depth-Net also has a pre-training step. The Depth-Net has been pre-trained on the

Flowers data set, which has 3,343ﬂowers. During the pre-training phase, the JPEG-Hance is used for enhancing center images while only Depth-Netparameters are trained. The input pipeline of the ﬂowers contains random croppings to and data augmentation with50% selection rate for the original data, 15% chance for random contrast change between [0 . , . , 15% chance thatthe brightness is changed randomly up to . × original brightness, and the remaining 20% where the hue was randomlychanged by up to . × . During the pre-training phase, 16 random crops were extracted for each epoch, and the networkwas trained for 10 epochs. The learning rate was 0.0004.The Depth-Net is the main network responsible for extracting the depth map, thus, it has more trainable parameters,about 38.2 million. The pre-training phase takes about 7 hours to converge. After pre-training the two networks, we now train the entire pipeline, by adding the 100 scenes to the

Flowers dataset pool and using the same input pipeline as the one used for Depth-Net. The entire pipeline has been trained for 45epochs with gradually decreasing learning rate from 0.0001 to 0.000001 for the last 5 epochs.The last ﬁne-tuning step includes training the pipeline on the data sets with input spatial dimensions of × .Here the augmentation selection is 25% original, 25% random contrast, 25% random brightness, and 25% random hue.Because of the structure of the Depth-Net network, the input images have to be zero-padded and the resulting LFs6 PREPRINT - O

CTOBER

2, 2020Figure 4: The top ﬁgure is showing the MPSNR for each reconstructed LF by our method and HEVC. The bottomﬁgure is the MSSIM calculated for each reconstructed LF.should be cropped to the correct size. The input dimension of the Depth-Net is × . The ﬁne tuning phase takes40 epochs for the network to converge with gradually decaying learning rate from 0.00005 to 0.000001. The ﬁne-tuningphase took around 20 hours to converge, while all other pre-training phases took less than 10 hours cumulatively. We have compared our compression-decompression results with a pseudo sequence method using the HEVC videocompression Codec. We have chosen a raster sequence over spiral because raster had slightly better performance. The30 scenes data set has been used for comparing our method with HEVC. We use MSSIM and MPSNR metrics as wellas SSIM and PSNR of the extracted DoF from LFs to compare the results, where

M SSIM = 1 M (cid:88) u SSIM ( LF, ˜ LF ) , (10) M P SN R = 1 M (cid:88) u P SN R ( LF, ˜ LF ) . (11)To have a fair comparison, we have tuned the QP factor of HEVC for each LF to reach approximately similar bppbetween HEVC compressed LF and our method’s compressed representative. The average bpp for both methods on the30 scenes data is . . We can see in Fig. 4 that the LFs reconstructed by our method have very similar MPSNR andMSSIM to those decompressed by HEVC. By carefully looking at Fig. 4 it will be evident that, while our method isslightly inferior in MPSNR metrics, it outperforms HEVC in MSSIM. Furthermore, in Fig. 5 we can see the calculatedPSNR and SSIM metrics for extracted DoFs from the reconstructed LFs. Here, our method meaningfully outperformsHEVC in SSIM metrics while further reducing the gap in the PSNR. It is worth noting that for quality assessment, theSSIM metric is more reliable than PSNR.As shown in Table 1, the compression time of our proposed method is more than 100 times faster than that of HEVC,on the same computational hardware. This is because of our more efﬁcient compression pipeline, which is just a JPEGalgorithm on a fraction of the LF ( / in our case with 49 views). The HEVC algorithm processes all of the views.7 PREPRINT - O

CTOBER

2, 2020Figure 5: The DoF images extracted from ground truth LFs and the reconstructed LFs are compared using SSIM andPSNR metrics. The top plot is representing PSNR and the bottom one is showing SSIM for each LF in the 30 scenes.Method CUDA Comp time (s)HEVC No 43.53JPEG-Hance + Depth-Net No

Table 1: The time takes to compress all 30 LFs in the 30 scenes data set using our method and the HEVC. We can seean speed up of more than 102 times.For decompression, our method is 18 times faster than HEVC; see Table 2. To give a fair comparison, we used theNVIDIA optimized HEVC codec using the GPU’s video decoder. So on the same hardware, this will be the fastestimplementation of HEVC.Overall, these results indicate that our model is suitable for compressing light ﬁeld videos with high angular resolution.This is because it can be decompressed in near real time inside the GPU without the barrier of transferring high volumesof data from host to GPU. The bandwidth used from host to GPU is equal to the size of only the center view of the LF.

We have designed a machine-learning assisted LF compression technique. It contains two sequential CNNs (JPEG-Hance and Depth-Net). We showed that there is enough information in a highly compressed LF center view to estimatea depth-map for LF and use it to reconstruct the whole LF. Also, compression and decompression are faster with ourmethod. We have used the public

Flowers and

30 Scenes data sets to conduct our experiment and also to evaluate ourmodel. We have achieved more than 100 times speedup during compression and about 18 times faster reconstructioncompared to using HEVC on LF pseudo sequences. Comparing to HEVC, the reconstructed LFs using our methodMethod CUDA Rec time (s)HEVC yes 0.399JPEG-Hance + Depth-Net yes

Table 2: The reconstruction time on all LFs from the 30 scenes data set by our method and HEVC. We can see that ourmethod is 18 times faster in decompressing. 8

PREPRINT - O

CTOBER

2, 2020have superior MSSIMs, and they have comparable MPSNRs. For future work, we will try to add some other views toimprove the quality of reconstruction. The added views possibly will be more compressed than the center view. We willalso explore options to enhance the MPSNR. We are looking forward to deploying our method on an actual LF video toexplore the achieved compression ratio and streaming capabilities.

References [1] D. Liu, L. Wang, L. Li, Zhiwei Xiong, Feng Wu, and Wenjun Zeng. Pseudo-sequence-based light ﬁeld image compression. In , pages 1–4, 2016.[2] Z. Zhao, S. Wang, C. Jia, X. Zhang, S. Ma, and J. Yang. Light ﬁeld image compression based on deep learning. In , pages 1–6, 2018.[3] A. Levin and F. Durand. Linear view synthesis using a dimensionality gap light ﬁeld prior. In , pages 1831–1838, 2010.[4] T. E. Bishop and P. Favaro. The light ﬁeld camera: Extended depth of ﬁeld, aliasing, and superresolution.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 34(5):972–986, 2012.[5] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views from the world’simagery. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016.[6] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-based view synthesis for light ﬁeld cameras.

ACM Trans. Graph. , 35(6), November 2016.[7] Henry Wing Fung Yeung, Junhui Hou, Jie Chen, Yuk Ying Chung, and Xiaoming Chen. Fast light ﬁeld reconstruction withdeep coarse-to-ﬁne modeling of spatial-angular clues. In

The European Conference on Computer Vision (ECCV) , September2018.[8] Pratul P. Srinivasan, Tongzhou Wang, Ashwin Sreelal, Ravi Ramamoorthi, and Ren Ng. Learning to synthesize a 4d rgbd lightﬁeld from a single image. In

The IEEE International Conference on Computer Vision (ICCV) , Oct 2017.[9] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H. Kim, and Jan Kautz. Extreme view synthesis. In

The IEEEInternational Conference on Computer Vision (ICCV) , October 2019.[10] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, andAbhishek Kar. Local light ﬁeld fusion: Practical view synthesis with prescriptive sampling guidelines.

ACM Trans. Graph. ,38(4), July 2019.[11] Wenhui Zhou, Gaomin Liu, Jiangwei Shi, Hua Zhang, and Guojun Dai. Depth-guided view synthesis for light ﬁeld reconstruc-tion from a single image.

Image and Vision Computing , 95:103874, 2020.[12] Zexi Hu, Yuk Ying Chung, Wanli Ouyang, Xiaoming Chen, and Zhibo Chen. Light ﬁeld reconstruction using hierarchicalfeatures fusion.

Expert Systems with Applications , 151:113394, 2020.[13] Cristian Perra. Lossless plenoptic image compression using adaptive block differential prediction. In , pages 1231–1234. IEEE, 2015.[14] Petri Helin, Pekka Astola, Bhaskar Rao, and Ioan Tabus. Sparse modelling and predictive coding of subaperture images forlossless plenoptic image compression. In , pages 1–4. IEEE, 2016.[15] C. Conti, P. Nunes, and L. D. Soares. Hevc-based light ﬁeld image coding with bi-predicted self-similarity compensation. In , pages 1–4, 2016.[16] Y. Li, R. Olsson, and M. Sjöström. Compression of unfocused plenoptic images using a displacement intra prediction. In , pages 1–4, 2016.[17] M. Le Pendu X. Jiang and C. Guillemot. Light ﬁeld compression using depth image based view synthesis. In , pages 19–24, 2016.[18] E. Dib, M. L. Pendu, and C. Guillemot. Light ﬁeld compression using fourier disparity layers. In , pages 3751–3755, 2019.[19] I. Tabus, P. Helin, and P. Astola. Lossy compression of lenslet images from plenoptic cameras combining sparse predictivecoding and jpeg 2000. In , pages 4567–4571, 2017.[20] X. Jiang, M. Le Pendu, R. A. Farrugia, and C. Guillemot. Light ﬁeld compression with homography-based low-rankapproximation.

IEEE Journal of Selected Topics in Signal Processing , 11(7):1132–1145, 2017.[21] J. Zhao, P. An, X. Huang, L. Shan, and R. Ma. Light ﬁeld image sparse coding via cnn-based epi super-resolution. In , pages 1–4, 2018.[22] B. Wang, Q. Peng, E. Wang, K. Han, and W. Xiang. Region-of-interest compression and view synthesis for light ﬁeld videostreaming.

IEEE Access , 7:41183–41192, 2019. PREPRINT - O

CTOBER

2, 2020 [23] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolutionalnetwork. In

The IEEE International Conference on Computer Vision (ICCV) , December 2015.[24] Jun Guo and Hongyang Chao. Building dual-domain representations for compression artifacts reduction. In Bastian Leibe,Jiri Matas, Nicu Sebe, and Max Welling, editors,

Computer Vision – ECCV 2016 , pages 628–644, Cham, 2016. SpringerInternational Publishing.[25] Zhangyang Wang, Ding Liu, Shiyu Chang, Qing Ling, Yingzhen Yang, and Thomas S. Huang. D3: Deep dual-domain basedfast restoration of jpeg-compressed images. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,June 2016.[26] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutional encoder-decoder networkswith symmetric skip connections. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,

Advances inNeural Information Processing Systems 29 , pages 2802–2810. Curran Associates, Inc., 2016.[27] L. Cavigelli, P. Hager, and L. Benini. Cas-cnn: A deep convolutional neural network for image compression artifact suppression.In , pages 752–759, 2017.[28] Baochang Zhang, Jiaxin Gu, Chen Chen, Jungong Han, Xiangbo Su, Xianbin Cao, and Jianzhuang Liu. One-two-one networksfor compression artifacts reduction in remote sensing.

ISPRS Journal of Photogrammetry and Remote Sensing , 145:184 – 196,2018. Deep Learning RS Data.[29] X. Zhang, W. Yang, Y. Hu, and J. Liu. Dmcnn: Dual-domain multi-scale convolutional neural network for compression artifactsremoval. In , pages 390–394, 2018.[30] Z. Jin, M. Z. Iqbal, W. Zou, X. Li, and E. Steinbach. Dual-stream multi-path recursive residual network for jpeg imagecompression artifacts reduction.

IEEE Transactions on Circuits and Systems for Video Technology , pages 1–1, 2020.[31] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structuralsimilarity.

IEEE Transactions on Image Processing , 13(4):600–612, 2004.[32] Clement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-rightconsistency. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017.[33] Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. Occlusion aware unsupervised learning of opticalﬂow. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018.[34] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical ﬂow and camera pose. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018., June 2018.