[PDF] Deep Coarse-to-fine Dense Light Field Reconstruction with Flexible Sampling and Geometry-aware Fusion

Abstract

A densely-sampled light field (LF) is highly desirable in various applications, such as 3-D reconstruction, post-capture refocusing and virtual reality. However, it is costly to acquire such data. Although many computational methods have been proposed to reconstruct a densely-sampled LF from a sparsely-sampled one, they still suffer from either low reconstruction quality, low computational efficiency, or the restriction on the regularity of the sampling pattern. To this end, we propose a novel learning-based method, which accepts sparsely-sampled LFs with irregular structures, and produces densely-sampled LFs with arbitrary angular resolution accurately and efficiently. We also propose a simple yet effective method for optimizing the sampling pattern. Our proposed method, an end-to-end trainable network, reconstructs a densely-sampled LF in a coarse-to-fine manner. Specifically, the coarse sub-aperture image (SAI) synthesis module first explores the scene geometry from an unstructured sparsely-sampled LF and leverages it to independently synthesize novel SAIs, in which a confidence-based blending strategy is proposed to fuse the information from different input SAIs, giving an intermediate densely-sampled LF. Then, the efficient LF refinement module learns the angular relationship within the intermediate result to recover the LF parallax structure. Comprehensive experimental evaluations demonstrate the superiority of our method on both real-world and synthetic LF images when compared with state-of-the-art methods. In addition, we illustrate the benefits and advantages of the proposed approach when applied in various LF-based applications, including image-based rendering and depth estimation enhancement.

Full PDF

11 Deep Coarse-to-ﬁne Dense Light FieldReconstruction with Flexible Sampling andGeometry-aware Fusion

Jing Jin, Junhui Hou,

Senior Member, IEEE,

Jie Chen,

Member, IEEE,

Huanqiang Zeng,

SeniorMember, IEEE,

Sam Kwong,

Fellow, IEEE,

Jingyi Yu

Abstract —A densely-sampled light ﬁeld (LF) is highly desirable in various applications, such as 3-D reconstruction, post-capturerefocusing and virtual reality. However, it is costly to acquire such data. Although many computational methods have been proposed toreconstruct a densely-sampled LF from a sparsely-sampled one, they still suffer from either low reconstruction quality, lowcomputational efﬁciency, or the restriction on the regularity of the sampling pattern. To this end, we propose a novel learning-basedmethod, which accepts sparsely-sampled LFs with irregular structures, and produces densely-sampled LFs with arbitrary angularresolution accurately and efﬁciently. We also propose a simple yet effective method for optimizing the sampling pattern. Our proposedmethod, an end-to-end trainable network, reconstructs a densely-sampled LF in a coarse-to-ﬁne manner. Speciﬁcally, the coarsesub-aperture image (SAI) synthesis module ﬁrst explores the scene geometry from an unstructured sparsely-sampled LF andleverages it to independently synthesize novel SAIs, in which a conﬁdence-based blending strategy is proposed to fuse the informationfrom different input SAIs, giving an intermediate densely-sampled LF. Then, the efﬁcient LF reﬁnement module learns the angularrelationship within the intermediate result to recover the LF parallax structure. Comprehensive experimental evaluations demonstratethe superiority of our method on both real-world and synthetic LF images when compared with state-of-the-art methods. In addition, weillustrate the beneﬁts and advantages of the proposed approach when applied in various LF-based applications, including image-basedrendering and depth estimation enhancement.

Index Terms —Light ﬁeld, deep learning, depth estimation, super resolution, compression, image-based rendering. (cid:70)

NTRODUCTION T HE light ﬁeld (LF) is a high-dimensional function de-scribing light rays through every point traveling inevery direction in the free space [1], [2]. This function isinitially introduced for LF rendering, which is an attractivemethod for generating novel views from a given set ofpre-acquired views. Instead of the traditional image-basedrendering (IBR) methods, LF rendering treats the capturedimages as samples of the LF function, and the novel viewscan be generated by re-sampling a slice from the func-tion in real-time, during which no geometry informationis required. To avoid ghosting effects, the LF is requiredto be densely sampled [3]. Densely-sampled LFs includingsufﬁcient information will also facilitate a wide range ofapplications, such as accurate depth inference [4], [5], 3-D scene reconstruction [6] and post-capture refocusing [7].In addition, with the rapid development of virtual real-ity technology, a densely-sampled LF becomes vital as it • J. Jin, J. Hou, and S. Kwong are with the Department of Computer Science,City University of Hong Kong, Hong Kong.E-mail: [email protected]; { jh.hou,cssamk } @cityu.edu.hk • J. Chen is with the School of Electrical and Electronics Engineering,Nanyang Technological University, Singapore 639798.E-mail: [email protected] • H. Zeng is with the School of Information Science and Engineering,Huaqiao University, Xiamen 361021, China.E-mail:[email protected] • J. Yu is with the School of Information Science and Technology, Shang-haiTech University, Shanghai, China, and the Department of Computerand Information Sciences, University of Delaware, Newark, DE.E-mail: [email protected] provides smooth angular parallax shift as well as naturalfocus details, which are important for a satisfying immersiveviewing experience [8], [9], [10].The densely-sampled LF is highly desirable but raisesgreat challenges for the acquisition. For example, LF im-ages with high angular resolution can be captured using acamera array [11] for simultaneous sampling from differentviewpoints or computer-controlled gantry [12] for time-sequential sampling at different positions. However, the for-mer is expensive and bulky, and the latter is limited to staticscenes. The commercialization of hand-held LF camerassuch as Lytro [13] and Raytrix [14] makes it convenient toacquire LF images. These cameras are cheaper and portableby encoding 4-D LF data into a single 2-D sensor. However,due to limited sensor resolution, a trade-off between spatialand angular resolution exists.Instead of relying on the development of hardware,many computational methods have been proposed for re-constructing a densely-sampled LF from a sparse one, whichcan be realized with low cost commercial devices. Previousworks [15], [16], [17], [18], [19], [20] either estimate disparitymaps as auxiliary information, or use speciﬁc priors such assparsity in transformation domain for dense reconstruction.With recent development of deep learning solutions forvisual modeling, some learning-based methods [21], [22],[23], [24] have been proposed. However, most of the existingmethods require the input sub-aperture images (SAIs) tobe sampled with a speciﬁc or regular pattern, which raisesdifﬁculties for practical acquisition. Moreover, since the a r X i v : . [ ee ss . I V ] M a r scene geometry is inexplicitly and insufﬁciently modeled inthese methods, the aliasing problem becomes serious in thereconstructed images when the input LF is extremely under-sampled, i.e. the samples have large disparities.As a preliminary work [25], we proposed a learning-based model for densely-sampled LF reconstruction. Thereconstruction of all novel SAIs are performed in one for-ward pass during which the intrinsic LF structural infor-mation among them is fully explored. See more detailsin Section 2.2. Although this method can produce impres-sive and state-of-the-art results on extensive real-world im-ages captured by the Lytro Illum camera, the performancedegradation caused by sparse sampling and the problemof non-ﬂexibility still exits. In this paper, built upon [25],we provide a few distinguishable improvements, enablingﬂexible and accurate reconstruction of a densely-sampled LFfrom sparse sampling. We inherit the coarse-to-ﬁne frame-work in [25]. That is, the proposed model consists of twomodules, namely the coarse SAI synthesis and the efﬁcientLF reﬁnement. Speciﬁcally, the coarse SAI synthesis mod-ule independently synthesizes novel SAIs using geometry-based warping, where we take the sampling with largedisparities and arbitrary patterns into consideration. We alsopropose a novel conﬁdence-based strategy for handling theoccluded regions when blending the warped images fromdifferent viewpoints. We further reﬁne the coarse results byexploiting all the intermediate SAIs with efﬁcient pseudo4-D ﬁlters. Such a reﬁnement module is capable of improv-ing the reconstruction quality by utilizing the intrinsic LFparallax structure.In summary, the main contributions of this paper are asfollows: • we propose an end-to-end learning-based methodfor the reconstruction of densely-sampled LFs fromsparsely-sampled LFs. Our method maintains highreconstruction quality when the sampling disparityincreases, and improves the generality by enablingﬂexible input positions as well as ﬂexible outputangular resolution. We also propose effective strate-gies for handling occlusions and preserving the LFparallax structure; • we investigate how the sampling pattern affects thereconstruction quality, and propose a simple yet ef-fective method for optimizing the sampling pattern; • we design various and extensive experiments to eval-uate and analyze our method as well as those undercomparison comprehensively; and • we demonstrate and discuss the beneﬁts of the pro-posed approach to LF-based downstream applica-tions.The rest of this paper is organized as follows. Sec. 2comprehensively reviews existing methods for view synthe-sis and densely-sampled LF reconstruction. Sec. 3 presentsthe proposed approach and investigates the optimizationfor sampling patterns. In Sec. 4, extensive experiments arecarried out to evaluate the performance of the proposedapproach. The beneﬁts of the proposed approach to practicalLF-based applications are validated and discussed in Sec. 5.Finally, Sec. 6 concludes this paper. ELATED W ORK

View synthesis, taking one or more views as inputs to rendernovel views, is a long-standing problem in the ﬁeld ofcomputer graphics and computer vision. Most algorithmsleverage the scene geometry information for view synthesis,that is, to extract/learn the global/local geometry from theinput viewpoints and use the resulting geometry informa-tion to warp the input views, followed by blending for novelview rendering [26], [27]. However, the forward warpingoperation typically leads to a hole-ﬁlling problem in occlu-sion areas. Flynn et al. [28] proposed to project input viewsto a set of depth planes and learn the weights to averagethe color of each plane. This method needs to learn speciﬁcgeometry for different target viewpoints. To overcome thisshortage, some methods based on 3-D scene representationwere proposed. Penner et al. [29] presented a soft 3-Drepresentation by preserving depth uncertainty. Tulsiani etal. [30] modeled the 3-D structure of the scene by learningto predict a layer-based representation, which representsmultiple ordered depths per pixel along with color values.Zhou et al. [31] proposed to use multi-plane images whereeach plane encodes color and transparency maps. Throughthese methods, novel views at varying positions can berendered by simply forward projecting their correspondingrepresentations. Besides, many methods aim at reconstruct-ing 3-D scenes and synthesizing novel views from a singleimage (e.g., [32], [33], [34], [35]). However, these methodsare still limited over simple and non-photorealistic syntheticobjects.

LF rendering needs densely-sampled LFs as inputs. In whatfollows, we only focus on the methods that reconstruct adensely-sampled LF from a sparsely-sampled one. Availablesolutions can be roughly classiﬁed to two categorizes: non-learning based methods and learning based methods.

Non-learning based methods.

Many traditional solu-tions that are originally adopted for natural image process-ing, such as Gaussian model and sparse representation, havebeen explored for LF processing tasks. Among them, Mitra et al. [16] modeled the LF patches using a Gaussian mixturemodel to address many LF processing tasks. Although itcan achieve promising results to a certain extent, it is notrobust against noise. Shi et al. [18] explored sparsity in thecontinuous Fourier domain to reconstruct densely-sampledLFs from a small set of samples. Vagharshakyan et al. [20] proposed an approach using the sparse representationof epipolar-plane images (EPIs) in the shearlet transformdomain. These methods require the sparsely-sampled LFto be sampled in a regular grid. Moreover, some methodsexplore the compressive LF photography. Marwah et al. [17]proposed a compressive LF camera architecture which al-lows LF reconstruction based on overcomplete dictionaries.To reduce the computational cost for dictionary learning,Kamal et al. [36] exploited a joint tensor low-rank and sparseprior for compressive reconstruction. These methods werespeciﬁcally designed for coded LF acquisition.Many works on LF reconstruction leverage explicitdepth information for LF reconstruction. Zhang et al. [19] d d T ... ... Sparsely-sampled LF PSV ， p ， p K ...... C

4T 200 T 64 32161+K ... ... S · ... Disparity Estimation W + Novel SAI

Share weights

Coarse SAI Synthesis Network · Spatial conv Spatial conv Spa-to-ang reshape Ang-to-spa reshape

Angular conv

Pseudo 4-D Convolution

Spatial conv

Intermediatedensely-sampled LF

Residual Reconstruction

Densely-sampled LF

Efficient LF Refinement Network cat Softmaxwarping summul

K32 64 324K32 64 324K32 64 324K32 64 324

64 64 64 64 64 64

Residual + sum64 Fig. 1: The ﬂowchart of the proposed method for reconstructing a densely-sampled LF with M × N SAIs from a sparsely-and arbitrarily-sampled LF with K SAIs. Our proposed model consists of two phases, i.e., the coarse SAI synthesis and theefﬁcient LF reﬁnement.proposed a depth-assisted phase-based synthesis strategyfor a micro-baseline stereo pair. Patch-based synthesis meth-ods were presented by Zhang et al. [37], in which thecenter SAI is decomposed into different depth layers and LFediting is performed on all layers. However, this method haslimited performance for view synthesis, especially for com-plex scenes. Some works were developed based on the ideaof warping given SAIs to novel SAIs guided by an estimateddisparity map. Wanner and Goldluecke [4] formulated theSAI synthesis problem as an energy minimization problemwith a total variation prior, where the disparity map is ob-tained through global optimization with a structure tensorcomputed on the 2-D EPI slices. This approach considersdisparity estimation as a separate step from view synthesis,which makes the reconstruction quality heavily depend onthe accuracy of the estimated disparity maps. Althoughsubsequent research [5], [15], [38] has shown signiﬁcantlybetter disparity estimations, ghosting and tearing effects arestill presented.

Learning-based methods.

With the great success of deepconvolutional neural networks in the ﬁeld of image process-ing [39], [40], [41], [42], many learning-based methods havebeen proposed for densely-sampled LF reconstruction. Yoon et al. [21] jointly super-resolved the LF image in both spatialand angular domain using a network that closely resemblesthe model proposed in [43]. Their approach is limited toscale 2 angular super-resolution and cannot ﬂexibly adaptto sparsely-sampled LF inputs. Following the idea of singleimage super-resolution, Wu et al. [23], [44] proposed an LFreconstruction method which focuses on recovering the highfrequency details of the bicubic up-sampled EPIs. In thesemethods, a blur-deblur scheme was proposed to address theinformation asymmetry problem caused by sparse angularsampling. Based on the observation that an EPI shows clearstructure when sheared with the disparity value, Wu et al. [24] proposed to fuse a set of sheared EPIs for LF reconstruc-tion. Wang et al. [45] also proposed a method based on EPIs,which applies 3-D convolutional layers to recover the de-tails on horizontal and vertical EPIs sequentially. However,since each EPI is a 2-D slice of the 4-D LF, the accessiblespatial and angular information of these EPI-based models is severely restricted. Moreover, for these models, novelSAIs must be synthesized horizontally or vertically in 2-D angular domain, resulting in accumulated errors. Yeung et al. [25] proposed an end-to-end network for densely-sampled LF reconstruction. By exploring the relationshipsbetween SAIs with pseudo 4-D ﬁlters, this method achievesstate-of-the-art performance over a large number of real-world scenes captured by the Lytro camera.In addition, depth information is also utilized in somelearning-based methods for LF reconstruction. Srinivasan etal. [46] proposed to synthesize a 4-D LF image from a 2-DRGB image based on estimated 4-D ray depth. However, thismethod requires a large training dataset and only works onsimple scenes since the information contained in single 2-Dimages is extremely limited. Kalantari et al. [22] proposedto synthesize novel SAIs with two sequential networks thatperform depth estimation and color prediction successively.Although this method achieves good performance on LFimages captured by the Lytro camera, the depth estimationand color prediction module are implemented in a straight-forward manner, which leaves room for improvement. HE P ROPOSED A PPROACH

A 4-D LF can be represented with the two-plane parameter-ization structure, which uniquely describes the propagationdirection of a light ray via two points from two parallelplanes, i.e., the angular plane ( u, v ) and the spatial plane ( x, y ) . Let I ∈ R W × H × M × N denote a densely-sampled LFcontaining M × N SAIs of spatial dimension W × H , whichare sampled on the angular plane with a regular 2-D grid ofsize M × N . Let U be the set of 2-D angular coordinates of theSAIs in I , i.e. U = { u | u = ( u, v ) , ≤ u ≤ M, ≤ v ≤ N } .The SAI at u is denoted as I u ∈ R W × H . Let I s de-note a sparsely-sampled LF with K SAIs, P be the setof the 2-D angular coordinates of the SAIs in I s , i.e., P = { p k | p k = ( u, v ) , ≤ k ≤ K } , and I p k be an SAI in I s located at p k . Moreover, the SAIs of a sparsely-sampledLF are assumed to be arbitrarily sampled from a certaindensely-sampled LF, i.e., P ⊂ U and K (cid:28) M N . Theunsampled SAIs, which belong to I but do not appear in I s are denoted by I s = { I q l | q l ∈ Q = U \P , ≤ l ≤ M N − K } with the operator \ returing the difference between two sets.Our goal is to learn (cid:98) I s as close to I s as possiblebased on I s such that a densely-sampled LF denoted by (cid:98) I ∈ R W × H × M × N can be reconstructed, together with I s .This problem can be implicitly formulated as: (cid:98) I = I s (cid:91) (cid:98) I s = f ( I s , P , Q ) , (1)where f denotes the mapping function to be learnt, and (cid:83) is the operator to combine two sets. SAIs in I are correlated to each other, which reveals theLF parallax structure. Speciﬁcally, under the Lambertian as-sumption and in the absence of occlusions, the relationshipbetween SAIs of I can be expressed as I u ( x ) = I u +∆ u ( x + d ∆ u ) , (2)where x = ( x, y ) is the spatial coordinates, and d is thedisparity at the pixel I u ( x ) . Being aware of this uniquecharacteristic as well as the great success of deep learning,we propose a learning-based approach to explore the LFparallax structure for densely-sampled LF reconstruction,i.e., constructing a deep network to learn f , as shown inFig. 1. Our approach consists of two modules, namely thecoarse SAI synthesis network f c ( · ) and the LF reﬁnementnetwork f r ( · ) , which predicts (cid:98) I in a coarse-to-ﬁne manner.To be speciﬁc, by explicitly learning the scene geometryfrom input SAIs, the coarse SAI synthesis network individ-ually generates novel SAIs, giving an intermediate densely-sampled LF denoted as (cid:101) I : (cid:101) I = I s (cid:91) (cid:101) I s = f c ( I s , P , Q ) . (3)The independent synthesis of the novel SAIs greatly savescomputational time and memory usage during testing stage.Then, the efﬁcient reﬁnement network learns residuals for (cid:101) I by exploring the complementary information between theSAIs to recover the LF parallax structure, leading to the ﬁnaloutput: (cid:98) I = (cid:101) I + f r (cid:16)(cid:101) I (cid:17) . (4)By characterizing the sparsely- and densely-sampledLFs, our approach improves the ﬂexibility and accuracy ofthe reconstruction of a densely-sampled LF. Speciﬁcally, ourapproach has the following characteristics: • it overcomes the aliasing problem caused by sparsesampling, making it possible for sparsely-sampledLFs with different angular sampling rate as inputs; • it enables SAIs with arbitrary angular samplingpatterns to be used as inputs, which brings moreﬂexibility for the densely-sampled LF reconstruc-tion. Moreover, we further investigated to optimizethe sampling patterns for improving reconstructionquality; • beyond the early mentioned goal, our method canproduce densely-sampled LFs with user-deﬁned an-gular resolution, making it more ﬂexible for densely-sampled LF reconstruction in various scenes; and • it is able to accurately recover the valuable LF par-allax structure, which is crucial for various applica-tions based on a densely-sampled LF.In the following, the details of the proposed approach arepresented step-by-step. This module aims at independently synthesizing intermedi-ate novel SAIs denoted by (cid:101) I s = (cid:110) (cid:101) I q l (cid:111) , which is formulatedas (cid:101) I q l = f c ( I s , P , q l ) . (5)To handle the inputs with large disparities, we utilize thegeometry information explicitly for novel SAI synthesis.That is, we learn the disparity map at q l from I s andsynthesize the target SAI via backward warping. To dealwith the challenge posed by the irregular sampling patterns,we construct the disparity estimation network by learningcorrespondence from the plane-sweep volumes (PSVs) [47].We also propose a new strategy for blending the warpedimages, which is able to alleviate the artifacts around oc-clusion boundaries caused by warping. To this end, thismodule consists of three steps: PSV construction, disparityestimation, warping and blending. PSV construction . A naive way of disparity estimationis via directly extracting features from I s using sequentialconvolutional layers. However, for randomly-sampled SAIinputs, i.e. the angular position set P always varies, it is dif-ﬁcult to properly provide the network with indicators w.r.tthe sampling and target positions, making the predictionunreliable (see results in Fig. 9). Instead, we use PSVs fordisparity estimation. A PSV with respect to a target position q l is constructed by backward warping, i.e., reprojecting I s = { I p k } with respect to a set of disparity planes { d } ,resulting in a set of warped images V = (cid:8) V kd (cid:9) : V kd ( x ) = I p k ( x + d ( q l − p k )) . (6)In this way, the arbitrary sampling positions of input SAIsas well as the target position for synthesis are encoded intothe PSVs during its construction.The disparity inference from a PSV is based on prin-ciples of photo-consistency. However, in occlusion areasor non-Lambertian surfaces, the relationships between thematching patches of different SAIs are complicated. Wepropose to feed the whole PSV into the disparity estima-tion network, which is different from the way adopted in[22], where simple hand-craft features such as mean andstandard deviation of the PSV across disparity planes areused. With the convolutional network’s powerful abilityin learning the representation, we are able to accuratelyestimate the disparity maps at challenging regions with therich information provided by the PSVs. Disparity estimation . The disparity estimation networkis designed to predict a disparity map D q l at the targetposition q l based on V . The network consists of a costcalculator to learn the matching cost for each disparityplane, and an estimator to predict the disparity value.For cost calculator, several convolutional layers are ap-plied to per disparity plane using shared weights. For a typical disparity plane d ∗ , features measuring the similar-ity and diversity between images warped from differentinput SAIs are extracted from (cid:8) V kd ∗ (cid:9) . We use kernel size × to obtain a relatively large receptive ﬁeld and setthe number of channels in the ﬁnal layer as in the costcalculator. For the disparity estimator, all features from eachdisparity plane are concatenated together. Then sequentialconvolutional layers are used to predict the disparity value.Instead of selecting the disparity value with a minimum costfrom the predeﬁned disparity set, we let the network learnthe disparity value, so that the number of the predeﬁneddisparity plane, as well as the width of the network (i.e., thechannel number), can be reduced. The number of channelsin the hidden layers of the estimator is set to at the frontlayer, and then gradually decreased from to , , and to output a disparity map D q l ﬁnally. Warping and blending . The novel SAI at the targetposition q l can be synthesized by warping the input SAIsin I s using the predicted disparity map D q l . Speciﬁcally,the resulting image I q l ← p k by warping I p k to the targetposition q l can be expressed as I q l ← p k ( x ) = I p k ( x + ( q l − p k ) · D q l ( x )) . (7)Since the input SAIs contain valuable information of thescene from different viewpoints, they will contribute to thetarget SAI in different areas. The warped images inevitablyshow artifacts around occlusion boundaries, and locationsof the artifacts vary among different source SAIs. Directcombination of the images warped from different view-points by simple average or convolutional layers adoptedin [22] will produce blurry effects caused by the (cid:96) /(cid:96) loss[48], especially when the input SAIs have large disparities.Therefore, we propose a blending strategy to fuse the im-ages warped from different input SAIs to generate the novelSAI by using adaptive dense conﬁdence maps. Speciﬁcally,the conﬁdence maps are learned to indicate the pixel-wiseaccuracy of the images warped from different input SAIs.Then it is expected that the more accurate regions can beselected to form the synthesized SAIs. This strategy properlyhandles the occlusion problem after warping and preservesclear textures in the synthesized novel SAI (see details in4.3).The K conﬁdence maps corresponding to the K inputSAIs, along with the disparity maps, are predicted by theﬁnal layer of the disparity estimation network. It is feasiblebecause the network has learnt the relationships betweenthe input SAIs and implicitly modeled their relationships tothe target SAI. Then the blending can be formulated as: (cid:101) I q l = K (cid:88) k =1 C k (cid:12) I q l ← p k , (8)where C k is the conﬁdence map for k -th input SAI, and (cid:12) is the element-wise multiplication operator. In the coarse SAI synthesis phase, novel SAIs are indepen-dently synthesized, and the LF parallax structure amongthem are not well taken into account, resulting in possiblephotometric inconsistencies between SAIs in the intermedi-ate LF image (cid:98) I . Therefore, an efﬁcient reﬁnement network is designed to further exploit the structure of (cid:101) I , whichis expected to recover the photo-consistency and furtherimprove the reconstruction quality of the densely-sampledLF. Since the goal is to correct possible ﬂaws inconsistentacross SAIs while preserve high-frequency textures, residuallearning is used in this module. In summary, we ﬁrst exploitthe LF parallax structure from (cid:98) I and then reconstruct resid-ual maps for it, as formulated in Eq. ( ) . The LF parallax structure . To exploit the LF parallaxstructure within (cid:101) I , 4-D convolution is a straightforwardchoice. However, the computational cost required by 4-Dconvolution is very high. Instead, pseudo ﬁlters or separableﬁlters, which reduce model complexity by approximating ahigh dimensional ﬁlter with a combination of ﬁlters of lowerdimensions, have been applied to solve different computervision problems, such as image structure extraction [49], 3-Drendering [50] and video frame interpolation [51]. This hasbeen recently adopted in [52] for LF material classiﬁcationand [53] for LF spatial super-resolution, which veriﬁes thatpseudo 4-D ﬁlters can achieve comparable performance to4-D ﬁlters.Therefore, we adopt the pseudo 4-D ﬁlter which approx-imates a single 4-D ﬁltering step with two 2-D ﬁlters. Specif-ically, the intermediate feature maps are reshaped betweenthe stack of spatial feature maps F spa ∈ R W × H × f c × MN andthe stack of angular ones F ang ∈ R M × N × f c × W H so that theconvolution is performed alternatively on the spatial andangular domains. Such a design reduces the computationrequired by a 4-D convolution signiﬁcantly, while it is stillcapable of extracting information from both spatial andangular information from the LF image effectively.

Residual reconstruction . After exploring the relation-ship among angular dimension, the residual maps are re-constructed separately for each SAI in the intermediate LFimage. Several layers of 2-D spatial convolution are appliedto learn a residual map from the extracted spatial-angulardeep features for each SAI. Here each SAI is processed in-dependently for two reasons. First, we believe the previousspatial-angular convolutions are capable of exploiting theLF parallax structure. Second and more importantly, in thisway, we can build a fully-convolutional network on bothspatial and angular dimension, such that ﬂexible outputangular resolution is achieved. Finally, the reconstructedresidual map is added to the previously synthesized inter-mediate LF image as the ﬁnal reconstructed LF (cid:98) I . All modules in our approach are differentiable, leadingto an end-to-end trainable network. The loss function fortraining the network consists of three parts. The ﬁrst partprovides supervision for the intermediate LF by calculatingthe absolute error between the intermediate LF images andground-truth ones, i.e., (cid:96) s = (cid:107)I − (cid:101) I(cid:107) . (9)To promote smoothness of the predicted ray disparity, wepenalize the L norm of the second-order gradients [54], (a) → × (b) → × (c) → × Fig. 2: Illustration of the relationship between the minimum distance of the sampling patterns and the reconstruction qualitytested on the

HCI dataset. The blue dots denote the patterns generated randomly. The green dots and their annotationscorrespond to the patterns in Fig. 3. The results of the optimized patterns by our method are highlighted as red stars. (a) (b) (c) (d) (e) (f)(g) (h) (i) (j) (k) (l)(m) (n) (o) (p) (q) (r)

Fig. 3: Illustration of different sampling patterns. From topto bottom are sampling patterns with 4, 3 and 2 input SAIs,respectively. (f), (l) and (r) depict the optimized samplingpatterns by our algorithm for the tasks → × , → × and → × , respectively.denoted as (cid:96) smooth : (cid:96) smooth = MN − K (cid:88) l =1 (cid:107)∇ xx D q l (cid:107) + (cid:107)∇ xy D q l (cid:107) + (cid:107)∇ yx D q l (cid:107) + (cid:107)∇ yy D q l (cid:107) , (10)where ∇ xx , ∇ xy , ∇ yx and ∇ yy are the second-order gra-dients for the spatial domain of the disparity map D q l .Finally, the output reconstructed LF image is optimized byminimizing the absolute error as: (cid:96) r = (cid:107)I − (cid:98) I(cid:107) . (11)Thus, our ﬁnal objective is written as (cid:96) = λ (cid:96) s + λ (cid:96) smooth + λ (cid:96) r , (12)where λ , λ and λ are the weighting for the reconstructionaccuracy and the disparity smoothness, which are empiri-cally set to , . and , respectively. Optimizing the sampling pattern for densely-sampled LFreconstruction is a valuable topic, which could further ex-ploit the full potential of the reconstruction algorithm, andimprove the reconstruction quality using as few hardwareresources as possible. Additionally, optimizing the samplingpattern may be beneﬁcial to its application in LF compres-sion (see more details in Sec. 6). In this section, we ﬁrstinvestigate how the sampling pattern affects the reconstruc-tion qualitatively and experimentally, then we propose asimple yet effective method for optimizing the samplingpattern tailored to our reconstruction model.Intuitively, the reconstruction quality is inﬂuenced byhow thoroughly the scene has been recorded by thesparsely-sampled input. Since most foreground objects canbe completely captured from different viewpoints, the oc-cluded regions are the critical challenge. There are severalfactors that affect the amount of information that couldbe captured with LF over the occluded areas. One of thefactors is the overall distance between the novel SAIs andthe sampled SAIs. That is, SAIs nearby can provide morereferences for novel SAI reconstruction compared to thosefar away. Additionally, sampling patterns with SAIs dis-tributed at more diverse locations along the horizontal andvertical directions are better than their counterparts withless variation, as the former sees more occluded regions.Finally, this issue should be related to the scene content.Factors such as the geometry complexity between objectscan play an important role.We experimentally investigated the effect of the sam-pling pattern on reconstruction quality. First, we deﬁne ametric, namely minimum distance, which is the average ofthe angular Euclidean distances of all novel SAIs to theirnearest input SAI in the 2-D sampling grid. We then con-ducted the following experiments, in which we randomlyselected some sampling patterns for → × , → × and → × dense reconstruction, respectively, then ﬁttedthe relationships between their minimum distance againsttheir reconstruction quality with a second degree polyno-mial. Fig. 2 illustrates the results, where we can see thatwith the increase of the minimum distance of the samplingpattern, the corresponding reconstruction quality decreases HerbsStillLifeRockBikes

Ground truth Vagharshakyan et al. [20] Wu et al. [23] Wu et al. [24] Wang et al. [45] Kalantari et al. [22] Yeung et al. [25] Ours (ﬁxed)

Fig. 4: Visual comparisons of different methods on the synthesized center SAI for the task × → × (ﬁxed models).Selected regions have been zoomed in for better comparison. It is recommended to view this ﬁgure by zooming in.in general. Moreover, the corresponding sampling patternsof the green dots are illustrated in Fig. 3. It can be seen thatpatterns with smaller variation along horizontal or verticaldirections always stay below the ﬁtted curve (e.g., withclose values of the minimum distance, the sampling pattern b ) performs better than c ) , and similar scenarios canbe found between l ) and i ) , and q ) and n ) ), which indicates that the divergence is indeed a factor inﬂuencingthe reconstruction quality.Based on the above observations, we propose a simpleyet effective strategy for optimizing the sampling pattern, TABLE 1: Comparison of attributes for densely-sampled LF reconstruction algorithms, where ﬂexible input means whetherthe method is feasible for an arbitrary sampling pattern, and ﬂexible output means whether the method can producedensely-sampled LFs with ﬂexible angular resolution.

Algorithms learning-based geometry-based ﬂexible input ﬂexible outputVagharshakyan et al. [20] - - - (cid:88) Wu et al. [23] (cid:88) - - (cid:88) Wu et al. [24] (cid:88) (cid:88) - (cid:88) Wang et al. [45] (cid:88) - - -Kalantari et al. [22] (cid:88) (cid:88) (cid:88) (cid:88)

Yeung et al. [25] (cid:88) - - -Ours (cid:88) (cid:88) (cid:88) (cid:88)

TABLE 2: Quantitative comparisons (PSNR/SSIM) of the proposed approach with the state-of-the-art ones under task × → × . The input sparsely-sampled LFs are sampled at the four corners during both training and test. Test set Disparity Vagharshakyan et al. Wu et al. Wu et al. Wang et al.

Kalantari et al.

Yeung et al.

Ours (ﬁxed)[20] [23] [24] [45] [22] [25]

HCI [-24, 24] 26.98/0.734 26.64/0.744 31.84/0.898 29.61/0.819 32.85/0.909 32.30/0.900 / HCI old [-18, 18] 32.47/0.853 31.43/0.850 37.61/0.942 35.73/0.898 38.58/0.944 39.69/0.941 / [-6, 6] 34.17/0.907 33.66/0.918 39.17/0.975 38.22/0.970 41.40/0.982 / Occlusions [-6, 6] 32.64/0.923 32.72/0.924 34.41/0.955 35.42/0.962 37.25/0.972 / Reﬂective [-6, 6] 35.34/0.935 34.76/0.930 36.38/0.944 35.96/0.942 38.09/0.953 38.33/0.960 / which is formulated as: arg min P , O MN − K (cid:88) l =1 K (cid:88) k =1 o l,k (cid:107) q l − p k (cid:107) ,s.t. q l ∈ Q , p k ∈ P , ∀ l, k, o l,k ∈ [0 , , ∀ l, K (cid:88) k =1 o l,k = 1 , (13)where o l,k is the ( l, k ) -th entry of the indicator matrix O ∈ R ( MN − K ) × K , which indicates whether the k -th sam-pled SAI is the nearest one in all samples to the l -th novelSAI. We ﬁrst ﬁnd a solution of the optimization problemin Eq. ( ) using the deterministic annealing based method[55], [56]. As the solution varies with initialization, we selectthe one producing the minimum objective value after re-peating the algorithm with random initialization for 5 times.In addition, as the resulting optimized positions may not belocated on the grid, we consider the divergence along bothhorizontal and vertical directions to round the solutions. Inthis way, we can obtain the optimized sampling patternsas depicted in Fig. 3 ( f ) , 3 ( l ) and 3 ( r ) . As demonstrated inFig. 2, the corresponding quantitative reconstruction qualityunder the sampling patterns by our algorithm achieves thehighest when compared with others, which indicates theeffectiveness of our sampling pattern selection algorithm.Furthermore, we experimentally veriﬁed the effectiveness ofthe proposed strategy for optimizing the sampling patternon LFs with different scene content, see section 4.3. XPERIMENTAL R ESULTS

Both synthetic LF images from the 4-D LF benchmarks[57] [58] and real-world LF images captured with a Lytro Illum camera provided by Standford Lytro LF Archive [59]and Kalantari et al. [22] were employed to train and test.Speciﬁcally, 20 synthetic images and 100 real-world imageswere used for training, while 9 synthetic data, including 4LF images from the

HCI [57] dataset and 5 LF images fromthe

HCI old [58] dataset, and 3 datasets with 70 real-worldLF images captured with a Lytro Illum camera were used fortest, namely [22],

Occlusions [59] and

Reﬂective [59].These datasets cover several important factors in evaluatingthe methods for LF reconstruction. Speciﬁcally, the syn-thetic datasets contain high-resolution textures to measurethe ability of maintaining high-frequency details. The real-world datasets can evaluate the performance of differentmethods under natural illumination and practical cameradistortion. Moreover, the

HCI dataset contains LF imageswith large disparities, which emphasizes the robustness onmore sparse sampling. The

Occlusions and

Reﬂective datasetsfocus on challenging scenes in which the assumption ofphoto-consistency is not guaranteed.During training, patches of spatial size × wererandomly cropped, and the batch size was set to 1 dueto the limitation of the computational memory. Moreover,we adopted ADAM [60] optimizer with β = 0 . and β = 0 . . The learning rate was initialized as e − and reduced by a half when the loss stops decreasing. Thespatial resolution of the model output was kept unchangedat × with padding of zeros. We implemented the modelwith PyTorch. The code will be publicly available. Besides our preliminary work Yeung et al. [25], we alsocompared with 5 state-of-the-art learning-based methodsthat were speciﬁcally designed for densely-sampled LF re-construction, i.e., Vagharshakyan et al. [20], Wu et al. [23],

TABLE 3: Quantitative comparisons of the proposed approach with Kalantari et al. [22] on the reconstruction with arbitrarysampling patterns under task → × . Sampling patterns (a), (c) and (f) (depicted in Fig. 3) are used for comparison. a ) → × c ) → × f ) → × Test set Kalantari et al. [22] Ours Kalantari et al. [22] Ours Kalantari et al. [22] Ours

HCI

HCI old

Occlusions

Reﬂective

TABLE 4: Quantitative comparisons of the proposed approach with Kalantari et al. [22] on the reconstruction with arbitrarysampling patterns under task → × . Sampling patterns (g), (j) and (l) (depicted in Fig. 3) are used for comparison. g ) → × j ) → × l ) → × Test set Kalantari et al. [22] Ours Kalantari et al. [22] Ours Kalantari et al. [22] Ours

HCI

HCI old

Occlusions

Reﬂective

TABLE 5: Quantitative comparisons of the proposed approach with Kalantari et al. [22] on the reconstruction with arbitrarysampling patterns under task → × . Sampling patterns (m), (p) and (r) (depicted in Fig. 3) are used for comparison. m ) → × p ) → × r ) → × Test set Kalantari et al. [22] Ours Kalantari et al. [22] Ours Kalantari et al. [22] Ours

HCI

HCI old

Occlusions

Reﬂective Wu et al. [24], Wang et al. [45], and Kalantari et al. [22] .Table 1 lists the feature comparisons of these algorithms interms of whether they are learning-based, geometry-based,whether they are ﬂexible with arbitrary input patterns, andwhether they can produce the reconstruction with ﬂexibleangular resolution. We conducted various experiments forcomparisons, listed as follows: • as 5 out of 6 methods under comparison, i.e. Vaghar-shakyan et al. [20], Wu et al. [23], Wu et al. [24], Wang et al. [45], and Yeung et al. [25], are unable to handlethe input with ﬂexible and irregular sampling pat-terns, we ﬁrst designed the experiment × → × ,in which the same and ﬁxed sampling pattern wasused during both training and testing, such that allcompared methods can be evaluated. We name our

1. Note that the methods with training code released, i.e., Wu etal. [24], Wang et al. [45], Kalantari et al. [22], and Yeung et al. [25]were retrained with the same training data for fair comparisons. Theretrained models achieve comparable performance to those providedby the authors. For method without training code released, i.e. Wu etal. [23], we used the trained model provided by the authors. method

Ours (ﬁxed) under such a training setting. Seesubsection 1); • as both Ours and Kalantari et al. [22] can accept ﬂexi-ble and irregular sampling patterns, we designed theexperiments → × , → × and → × , inwhich sparsely-sampled LFs each containing K SAIswith arbitrary positions and structures were fed intothe network during training, and some of patterns il-lustrated in Fig. 3 were used during testing. Here weconsidered three cases, i.e., K = 2 , , , respectively.See subsection 2); and • we compared the ability of different methods on pre-serving the LF parallax structure both quantitativelyand qualitatively. We also evaluate the running timefor different methods. See subsection 3). Comparison on the reconstruction with ﬁxed inputsampling patterns.

This comparison was performed over the task × → × , which attempts to reconstruct a densely-sampled LFwith × SAI from a sparsely-sampled LF with × SAIsdistributed regularly. Here the SAIs of a sparsely-sampled Bicycles DishesIMG-1528-eslf IMG-1555-eslf

Ground truth Kalantari et al. [22] Ours Ground truth Kalantari et al. [22] Ours

Fig. 5: Visual comparisons of different methods on the synthesized center SAI for the task a ) → × (ﬂexible models).Selected regions have been zoomed in for better comparison. It is recommended to view this ﬁgure by zooming in.LF are located at the four corners of the densely-sampled LFto be reconstructed, as shown in Fig. 3a. We used the averagevalue of PSNR and SSIM over all synthetic novel SAIs toquantitatively measure the quality of reconstructed densely-sampled LFs, and the corresponding results are listed inTable 2. It can be observed that: • the performance of all methods decreases when thedisparity between input SAIs increases; • EPI-based methods, including Vagharshakyan et al. [20], Wu et al. [23], Wu et al. [24], and Wang et al. [45], are inferior to others. The possible reason is thatonly 2 rows or columns of pixels are available duringthe reconstruction of each EPI, making it difﬁcultto recover the intermediate linear structures withoutmodeling the 2-D spatial structure, especially whenthe scenes are complicated. Among them, Wu et al. [24] performs relatively better, as depth informationis utilized as guidance; • Kalantari et al. [22] achieves good results on real-world datasets, which indicates the effectiveness ofgeometry-based warping. However, it fails on the

HCI dataset with larger disparities. The reason isthat Kalantari et al. [22] uses hand-crafted featuresto estimate the disparity and simple convolutional layers to combine the warped images, which makesit difﬁcult to build long distance connection betweenSAIs with large disparities; • Yeung et al. [25] achieves the best results on the real-world datasets, indicating that the pseudo 4-D ﬁlterseffectively explore the spatial and angular relation-ships between input SAIs. However, this methodalso does not work well on the

HCI dataset, becauseit entirely relies on deep regression for novel viewsynthesis, which indicates the importance of explicitgeometric modeling for the reconstruction based onsparse sampling; and • our approach achieves the highest PSNR/SSIM forthe HCI and

HCI old datasets, and comparable per-formance with Yeung et al. [25] at , Occlusions and

Reﬂective datasets, showing the advantages ofthe proposed framework.We also visually compared the reconstruction results ofdifferent algorithms, as shown in Fig. 4. It can be observedthat Wu et al. [23], Wu et al. [24] and Wang et al. [45] failto recover delicate structures, such as the leaves and thetextures on the wall, while Kalantari et al. [22] and Yeung et al. [25] struggle with large disparities. In contrary, ourapproach produces accurate estimations, which are closer to Fig. 6: Quantitative comparisons of the LF parallax structureby comparing the parallax content PR curves for differentmethods.

HCI 1 HCI 2 HCI 3 HCI 4 Occlusions 1 Occlusions 2Occlusions 3 Occlusions 4 30scenes 1 30scenes 2 30scenes 3 30scenes 4(1) (2) (3) (4) (6)(5) (l)

Fig. 7: The images and sampling patterns used to investigatethe effectiveness of the optimized sampling patterns on LFswith different scene content. 12 different scenes are manu-ally selected. The optimized sampling pattern (l) obtainedby our method is compared with 6 neighboring patterns (1)-(6).the ground-truth ones. Comparison on the reconstruction with ﬂexibleinput sampling patterns.

We performed comparisons over random input positionswith Kalantari et al. [22] and our approach. During training,the input SAIs were selected at random positions, and theinput patterns illustrated in Fig. 3 were used for testing.We report the quantitative results of task → × , → × and → × in Table 3, 4 and 5, respectively. It canobserved that our method improves the PSNR by around 4dB on synthetic datasets and around 0.4-1 dB on real-worlddatasets.To visually compare the outputs from Kalantari et al. [22] with our method, we calculated the error maps of thereconstructed center SAI under task a ) → × in Fig.5. The results further demonstrate the advantages of ourproposed approach. As shown in the results of syntheticdata in Fig. 5 (see the ﬁrst row), basic textures are severelyblurred or distorted in the reconstructed SAI of Kalantari et al. [22] when the inputs have large disparities, whileour method can reconstruct most of the high-frequencydetails. For real-world LF reconstruction in Fig. 5 (see thesecond row), Kalantari et al. [22] produces artifacts nearthe boundaries of the foreground objects, while ﬁne edgesand small objects are well preserved in the results by ourmethod. Comparison of the LF parallax structure.

The most valuable information of LF images is the LF Fig. 8: Illustration of the effectiveness of the optimizedsampling patterns on LFs with different scene content. Theselected LF scenes and sampling patterns are illustrated inFig. 7. The red pentagrams mark the highest PSNR achievedwith the optimized sampling pattern by our method, andred dots mark the highest PSNR achieved with other pat-terns.parallax structure, which implicitly represents the scenegeometry. We compared the LF parallax structure of thedensely-sampled LFs reconstructed from different algo-rithms. In Figs. 4 and 5, the EPIs of the reconstructed LFimages are compared. It can be seen that the the EPIs of oursenhanced methods preserve clearer linear structures and arecloser to the ground truth.We also quantitatively evaluated the LF parallax struc-ture by using the LF parallax edge precision-recall (PR)curves [61]. Fig. 6 shows the comparisons on PR curves ofthe densely-sampled LF reconstructed from different algo-rithms with ﬁxed and ﬂexible sampling. It can be observedthat the PR curves of our method are closer to the top rightcorner than others, indicating that our method preserves theLF parallax structure best.Moreover, as SSIM is a well-known metric to measurethe structural similarity between images, the SSIM valuesof EPIs were computed as a metric for evaluating thepreservation of the LF parallax structure as done in [62].The results over different datasets are listed in Table 6. It canbe seen that the EPIs reconstructed by our method achievethe highest SSIM value, validating the better LF parallaxstructure preserved by our method. Such a conclusion isconsistent with the one drawn from the experiment of thePR curve comparison. Comparison of the running time.

We compared the running time (in seconds) of differentmethods for reconstructing a densely-sampled LF, and Table7 lists the results. All methods were tested on a desktop withIntel CPU i7-8700 @ 3.70GHz, 32 GB RAM and NVIDIAGeForce RTX 2080 Ti. From Table 7, it can be observed thatour approach, taking about only 0.8 seconds to generate anovel SAI, is much faster than other methods except Wang et al. [45] and Yeung et al. [25]. Although Wang et al. [45] andYeung et al. [25] are the faster ones, our approach is superiorin terms of reconstruction quality and angular ﬂexibility.

In this section, we experimentally validated the effectivenessof our view sampling optimization strategy on LFs withdifferent scene content, as well as the effectiveness of three TABLE 6: Comparisons of the SSIM of EPIs reconstructed by different methods.

Test set Disparity Vagharshakyan et al. Wu et al. Wu et al. Wang et al.

Kalantari et al.

Yeung et al.

Ours (ﬁxed)[20] [23] [24] [45] [22] [25]

HCI [-24, 24] 0.747 0.736 0.896 0.775 0.905 0.899 [-6, 6] 0.904 0.907 0.969 0.923 0.977 0.981

TABLE 7: Comparisons of the running time (in seconds) of different methods for reconstructing a densely-sampled LF.

Algorithms Vagharshakyan et al. Wu et al. Wu et al. Wang et al.

Kalantari et al.

Yeung et al.

Ours (ﬁxed)[20] [23] [24] [45] [22] [25]

HCI × → × Simple convolution Kalantari et al. [22] OursSAI Simple convolution Kalantari et al. [22] (inter) OursSAI Kalantari et al. [22] (a) HCI(b) 30scenes

Fig. 9: Visual comparisons of the intermediate by-product disparity maps estimated by directly applying convolutionallayers to the input SAIs, Kalantari et al. [22] and our network. Kalantari et al. [22] (inter) denotes the modiﬁed network ofKalantari et al. [22] with an intermediate supervision for the warped images using ground-truth targets.

Input views Ground truthDisparity map Warped images Conﬁdence maps Blended images

Fig. 10: Demonstration of the effectiveness of our blending strategy. The estimated disparity map, the zoom-in of the imageswarped from the input SAIs, the learned conﬁdence maps and the blended images are presented.components of our network, including the disparity esti-mation module, the blending strategy and the reﬁnement module. The effectiveness of the optimization strategy for TABLE 8: Effectiveness veriﬁcation of the reﬁnement mod-ule in our approach. We compare the reconstruction qualityof the LF images generated by our method without thereﬁnement module and the LF images by our method withall modules under tasks → × and → × over HCI and . Test set without reﬁnement with reﬁnement without reﬁnement with reﬁnement a ) → × f ) → × HCI g ) → × l ) → × HCI the sampling pattern on different scene content.

As there is no metric to quantify the scene content com-plexity, we manually select images covering different scenesand captured with different camera settings. As shown inFig. 7, the selected images vary in geometry complexity (e.g. and ), object category (e.g.

Occlusions1 and

Occlusions 3 ), camera parameters (e.g.

Occlusions 3 and

Occlusions 4 ), and data acquisition method (e.g.

HCI and ), etc. 6 sampling patterns neighboring to theoptimized one by our strategy were used for comparisons,as illustrated in the bottom row of Fig. 7. The PSNR ofthe reconstructed LFs from inputs with different samplingpatterns on LFs with different scene content is plotted inFig. 8. It can be seen that although the PSNR values ofreconstructed LFs present different trends when the sam-pling patterns change, the highest PSNR values have beenachieved with the same sampling pattern by our method formost cases (9 out of 12). In addition, our selected samplingpattern can achieve a comparable PSNR value to the highestone even when it is not optimal. Although the selected im-ages cannot cover all scenarios, our experiment shows thatthe proposed optimization strategy is generally applicablein most of the cases we have experimented with. The effectiveness of the disparity estimation mod-ule.

In our approach, the disparity maps are estimated byconstructing PSVs, which are fed into the subsequent net-work. Alternative ways include applying convolutional lay-ers to the input SAIs straightly, or abstracting hand-craftfeatures from PSVs as the input of a network [22]. To vali-date the advantages of our disparity estimation module, wevisually compared the by-product disparity maps estimatedby these three manners. Note that by training the network ofKalantari et al. [22] using codes provided by the authors, theestimated disparity maps for the

HCI dataset are nearly allzeros. We believe the reason is that the only objective of thenetwork is to optimize the ﬁnal reconstruction by applying aloss function to the last reﬁnement module. For LF datasetswith large disparities, such a loss function can not efﬁ-ciently back-propagate to the disparity values via warpingoperators. Therefore, we modiﬁed their source code, andre-trained their network by adding an intermediate super-vision for the warped images using ground-truth targets,denoted as Kalantari et al. [22] (inter). Then the estimateddisparity maps become reasonable. In addition, the averagePSNR value of its ﬁnal reconstructions on

HCI dataset is TABLE 9: Effectiveness veriﬁcation of the conﬁdence-basedfusion compared with blending using convolutional layersin [22]. The PSNR/SSIM values are provided for compar-isons. a ) and f ) are two sampling patterns for the task → × depicted in Fig. 3 Test set

Ours cnn blend Ours conf blend Ours cnn blend Ours conf blend a ) → × f ) → × / / Occlusions / / Reﬂective / / also improved by around 0.3 db. As shown in Fig. 9, itcan be observed that our method produces disparity mapswith much fewer error in both background and occlusionboundaries. The effectiveness of the blending strategy.

The blending strategy in our approach is designed toaddress the occlusion issues during the fusion of the imageswarped from different input SAIs. To validate the effective-ness of the proposed blending strategy, the intermediateresults before and after blending are visualized in Fig. 10. Itcan be observed that the errors around occlusion boundariesin the intermediate images warped from different sourceSAIs are closely related to the location of the source SAIs,and appear in different positions. The learned conﬁdencemaps are able to indicate these error areas in each warpedimage, and provide guidance for the fusion of the warpedimages. Blending over the guidance of conﬁdence mapshelps to remove these errors, while the correct regions ofeach warped image are preserved.Moreover, to demonstrate the advantage of the proposedblending strategy, we quantitatively compared the blendedresults by our method and the method used in [22]. First, weremoved the reﬁnement module from our model, such thatthe remaining view synthesis network consists of disparityestimation, warping and the conﬁdence-based blending. Wedenote this model as

Ours conf blend . Then, we replacedthe conﬁdence-based blending with the blending strategyused in [22], i.e., using convolutional layers to directlycombine the warped image. This new model, denoted as

Ours cnn blend , was trained using the same datasets asours. In this way, the only difference between these twomodels are the blending mechanisms. We compared theirreconstruction quality, and the results are listed in Table 9,where it can be seen that

Ours conf blend achieves higherPSNR/SSIM values than

Ours cnn blend , validating the ad-vantage of our conﬁdence-based blend strategy. The effectiveness of the reﬁnement module.

To demonstrate the effectiveness of the reﬁnement mod-ule, we quantitatively compared the quality of the LFimages generated by our method without the reﬁnementmodule and the LF images by our method with all modules,and Table 8 lists the results. It can be seen that the reﬁnementprovides around 1 dB PSNR improvement, which indi-cates that the reﬁnement module can efﬁciently exploit thecomplementary information between the synthesized SAIsand improves the intermediate LF images. Moreover, Fig. 6shows the comparisons on the parallax content PR curves,which demonstrate that the reﬁnement helps recover the LFparallax structure in the reconstructed densely-sampled LFs. Input positionsOutput positionsInput positionsOutput positions Wu et al. [23] Wu et al. [24] Kalantari et al. [22] Ours (ﬁxed)

Fig. 11: Visual comparisons on LF reconstruction with ﬂexible output angular resolution. We present the results of × reconstruction from 4 corner SAIs of a × sampling grid (top), and the results of × reconstruction from 4 corner SAIsof a × sampling grid (bottom). The center SAI of the LF images reconstructed from different algorithms are presented.Horizontal and vertical EPIs corresponding to the colored lines are shown below the center SAI, and regions with obviousartifacts or blurring are highlighted with yellow boxes. It is recommended to view this ﬁgure by zooming in. Center View Ground truth Sparse LF Wu et al. [23] Wu et al. [24] Wang et al. [45] Kalantari et al. [22] Yeung et al. [25] Ours (ﬁxed)

Fig. 12: Visual comparisons of the depth estimation results (as the depth is inverse proportional to the disparity, we donot make a distinction between them). The center SAIs of the LF images, the disparity maps estimated from the ground-truth densely-sampled LFs, the sparsely-sampled LFs, the reconstructed densely-sampled LFs by different algorithms arepresented from left to right. It is recommended to view this ﬁgure by zooming in.

PPLICATIONS

In this section, we will discuss two applications, which willbeneﬁt from our accurate, ﬂexible and efﬁcient method forthe reconstruction of densely-sampled LFs.

IBR aims at generating novel views from a set of capturedimages. Comprehensive review on IBR can be found in [63].Among IBR techniques, LF rendering is attractive as novelviews can be generated by straightforward interpolationwithout the need of any geometric information such thatreal-time rendering can be achieved. To produce novel TABLE 10: Quantitative comparisons (100 × MSE) of the depth estimated from the ground-truth densely-sampled LFs, thesparsely-sampled LFs, the reconstructed densely-sampled LFs by different algorithms.

LF image Ground-truth LF Sparse LF Wu et al. Wu et al. Wang et al.

Kalantari et al.

Yeung et al.

Ours (ﬁxed)[23] [24] [45] [22] [25]

Buddha

Buddha2

StillLife

Papillon

Monasroom views without ghosting artifacts, LF rendering requiresthe LF to be densely sampled, with disparities betweenneighboring views to be less than 1 pixel [3]. Therefore,for a sparsely-sampled LF that does not meet the samplingrequirement, our method can reconstruct a densely-sampledLF with desired angular resolution to enable subsequentLF rendering. More generally, as our method is capable ofgenerating novel views at arbitrary viewpoints from a set ofsparsely-sampled SAIs, it can realize IBR directly.To validate the effectiveness of our approach on theIBR application, we performed comparisons of dense recon-struction under different sampling baselines for differentoutput angular resolution. Speciﬁcally, we compared theperformance of different algorithms when reconstructing × densely-sampled LFs from × corner SAIs sampledat a × grid, and reconstructing × densely-sampledLFs from × corner SAIs sampled at a × grid on HCI dataset. As the ground-truth images are unavailable,we visually compared the center SAIs of the reconstructedLF images. Moreover, to compare the ability of preservingthe LF parallax structure, horizontal and vertical EPIs arepresented. Fig. 11 shows the results, and it can be observedthat our method can produce novel SAIs with sharp texturesand construct EPIs with clear linear structures, even whenthe input sampling baselines are relatively large.

The value of an LF image lies in the implicitly encodedscene geometry information. By ﬁnding correspondencesin different SAIs, depth maps can be estimated from theLF images. A densely-sampled LF leads to more accurateand more robust depth inference, as matching points canbe detected more easily and occlusion problems can bealleviated by multiple viewpoints. Therefore, the proposedmethod can be used to enhance LF depth estimation.Here, we present the depth maps estimated fromsparsely-sampled × LF images as well as those estimatedfrom densely-sampled × LF images reconstructed bydifferent algorithms. The state-of-the-art depth estimationalgorithm [5] was applied, and Fig. 12 shows the results. Itcan be observed that the reconstructed densely-sampled LFsenable better estimations than sparsely-sampled LF ones,and the depth maps from our method are more accuratethan those from others, especially in the regions includingﬁne details and occluded boundaries. Additionally, the highaccuracy of estimated depth maps further validates the advantage of our method on preserving the LF parallaxstructure.Moreover, as shown in Table 10, we provide quantitativecomparisons of the depth maps estimated from differentreconstructions. The mean squares error (MSE) between theestimated depth map and its ground-truth was used to mea-sure the accuracy. It can be seen that

Ours (ﬁxed) producesthe lowest MSE values over all ﬁve scenes. Especially, theMSE values of

Ours (ﬁxed) are even lower than those ofthe depth maps estimated from the ground-truth densely-sampled LFs on 4 out of 5 scenes. The reason is twofold:no method can guarantee perfectly accurate estimations,and the adopted depth estimation method [5] adapts to thereconstructed LFs by our method better; and much noise ispresent in the raw LF images [57], while the noise might besuppressed by our reconstruction algorithm to some extent.

ONCLUSION AND F UTURE W ORK

We have presented a novel learning-based algorithm forthe reconstruction of densely-sampled LFs from sparsely-sampled ones. Owing to the deep, effective and comprehen-sive modeling of the unique LF parallax structure, includingthe geometry-based SAI synthesis based on position-awarePSVs, the adaptive blending strategy and the efﬁcient LFreﬁnement network, our method breaks the obstacle in anarbitrary sampling pattern and sparse sampling, not onlyachieving over 4 dB improvement on synthetic data and1 dB improvement on real-world data, but also preservingthe valuable LF parallax structure better, compared withstate-of-the-art methods. Besides, we proposed a simple yeteffective algorithm to optimize the sparse sampling patternfor better reconstruction quality. Last but not least, thepotential of our method on improving subsequent LF-basedapplications has been validated and discussed.During the sampling pattern optimization, we have builta scene content-independent strategy, which only considersthe overall distance between the novel views and the sam-pled ones and the distribution divergence of the sampling.In fact, the optimal sampling pattern should vary with thescene content, such as the geometry complexity and textualinformation. In our future work, we plan to predict the scenecontent-dependent optimized sampling pattern via a CNNtrained with the ground-truth optimal sampling patternsthat can be obtained via an exhaustive search.Another interesting line of future work is exploring thepotential of the proposed framework on LF data compres-sion. The huge data size of LF images poses great challenges to both data storage and transmission. In [64], an LF imageis partitioned into key SAIs and non-key SAIs, and non-key SAIs are compensated by the reconstruction from keySAIs. Only the key SAIs and residual of non-key SAIs areencoded. Our framework adapting to ﬂexible inputs can benaturally utilized to optimize the combination of key SAIsso that the reconstruction quality of non-key SAIs can beimproved using the same number of key SAIs, and likewisethe compression performance. Moreover, our experimentalresults have demonstrated that using the optimized sam-pling patterns, the number of key SAIs can be reducedwithout penalizing the reconstruction performance, whichmeans the encoding bits of key SAIs can be saved. In thefuture, we will comprehensively study how the samplingpattern and the number of input views affect the compres-sion performance, and experimentally verify the applicationof the proposed framework on LF compression. A CKNOWLEDGEMENT

We thank the authors of [45] for sharing their source codes. R EFERENCES7