[PDF] Deep Selective Combinatorial Embedding and Consistency Regularization for Light Field Super-resolution

Abstract

Light field (LF) images acquired by hand-held devices usually suffer from low spatial resolution as the limited detector resolution has to be shared with the angular dimension. LF spatial super-resolution (SR) thus becomes an indispensable part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structure of LF images make the problem more challenging than traditional single-image SR. The performance of existing methods is still limited as they fail to thoroughly explore the coherence among LF sub-aperture images (SAIs) and are insufficient in accurately preserving the scene's parallax structure. To tackle this challenge, we propose a novel learning-based LF spatial SR framework. Specifically, each SAI of an LF image is first coarsely and individually super-resolved by exploring the complementary information among SAIs with selective combinatorial geometry embedding. To achieve efficient and effective selection of the complementary information, we propose two novel sub-modules conducted hierarchically: the patch selector provides an option of retrieving similar image patches based on offline disparity estimation to handle large-disparity correlations; and the SAI selector adaptively and flexibly selects the most informative SAIs to improve the embedding efficiency. To preserve the parallax structure among the reconstructed SAIs, we subsequently append a consistency regularization network trained over a structure-aware loss function to refine the parallax relationships over the coarse estimation. In addition, we extend the proposed method to irregular LF data. To the best of our knowledge, this is the first learning-based SR method for irregular LF data. Experimental results over both synthetic and real-world LF datasets demonstrate the significant advantage of our approach over state-of-the-art methods.

Full PDF

MMANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 1

Deep Selective Combinatorial Embedding andConsistency Regularization for Light FieldSuper-resolution

Jing Jin,

Student Member, IEEE,

Junhui Hou,

Senior Member, IEEE,

Zhiyu Zhu,

Student Member, IEEE,

Jie Chen,

Member, IEEE, and Sam Kwong,

Fellow, IEEE

Abstract —Light ﬁeld (LF) images acquired by hand-helddevices usually suffer from low spatial resolution as the limiteddetector resolution has to be shared with the angular dimension.LF spatial super-resolution (SR) thus becomes an indispens-able part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structureof LF images make the problem more challenging than tradi-tional single-image SR. The performance of existing methodsis still limited as they fail to thoroughly explore the coherenceamong LF sub-aperture images (SAIs) and are insufﬁcient inaccurately preserving the scene’s parallax structure. To tacklethis challenge, we propose a novel learning-based LF spatial SRframework. Speciﬁcally, each SAI of an LF image is ﬁrst coarselyand individually super-resolved by exploring the complementaryinformation among SAIs with selective combinatorial geometryembedding. To achieve efﬁcient and effective selection of thecomplementary information, we propose two novel sub-modulesconducted hierarchically: the patch selector provides an optionof retrieving similar image patches based on ofﬂine disparityestimation to handle large-disparity correlations; and the SAIselector adaptively and ﬂexibly selects the most informative SAIsto improve the embedding efﬁciency. To preserve the parallaxstructure among the reconstructed SAIs, we subsequently appenda consistency regularization network trained over a structure-aware loss function to reﬁne the parallax relationships over thecoarse estimation. In addition, we extend the proposed method toirregular LF data. To the best of our knowledge, this is the ﬁrstlearning-based SR method for irregular LF data. Experimentalresults over both synthetic and real-world LF datasets demon-strate the signiﬁcant advantage of our approach over state-of-the-art methods, i.e., our method not only improves the averagePSNR/SSIM but also preserves more accurate parallax details.

Index Terms —Light ﬁeld, deep learning, super resolution,depth.

I. I

NTRODUCTION D light ﬁeld (LF) images differ from conventional 2D im-ages as they record not only intensities but also directionsof light rays. The rich information enables a wide range ofapplications, such as 3D reconstruction [1]–[4], refocusing [5],and virtual reality [6], [7]. LF images can be convenientlycaptured with commercial micro-lens based cameras [8], [9]

This work was supported in part by the Hong Kong Research Grants Coun-cil under grants 9048123 (CityU 21211518) and 9042820 (CityU 11219019),and in part by the Basic Research General Program of Shenzhen Municipalityunder grants JCYJ20190808183003968. (

Corresponding author: J. Hou. )J. Jin, J. Hou, Z. Zhu, and S. Kwong are with the Department of ComputerScience, City University of Hong Kong, Hong Kong. (E-mail: { jingjin25-c,zhiyuzhu2-c } @my.cityu.edu.hk; { jh.hou, cssamk } @cityu.edu.hk)J. Chen is with the Department of Computer Science, Hong Kong BaptistUniversity, Hong Kong. (E-mail: [email protected]) which parameterize 4D LFs with two planes. However, dueto the limited sensor resolution, recorded LF images alwayssuffer from low spatial resolution. Therefore, LF spatial super-resolution (SR) is highly necessary for subsequent applications[10].Some traditional methods for LF spatial SR have been pro-posed [11]–[13]. Due to the high dimensionality of LF data,the reconstruction quality of these methods is quite limited.Recently, some learning-based methods [14]–[16] have beenproposed to address the problem of 4D LF spatial SR via data-driven training. Although these methods have improved bothperformance and efﬁciency, there are two problems unsolvedyet. That is, the complementary information within all sub-aperture images (SAIs) is not well utilized, and the structuralconsistency of the reconstruction is not well preserved (seemore analyses in Sec. III).In this paper, we propose a learning-based method forLF spatial SR, focusing on addressing the two problems ofcomplete complementary information fusion and LF parallaxstructure preservation. As shown in Fig. 1, our approachconsists of two modules, i.e., a coarse SR module via se-lective combinatorial embedding and a reﬁnement module viastructural consistency regularization. Speciﬁcally, the coarseSR module separately super-resolves individual SAIs by learn-ing combinatorial correlations and fusing the complementaryinformation of different SAIs in a selective fashion, giving acoarse super-resolved LF image. To select the complementaryinformation both effectively and efﬁciently, we propose a plug-and-play patch selector, which is capable of locating similarpatches for large-disparity LFs based on the ofﬂine disparityprediction, and an SAI selector, which can adaptively andﬂexibly select auxiliary SAIs. The reﬁnement module exploitsthe spatial-angular geometry coherence among the coarseresult, and enforces the structural consistency in the high-resolution space. More precisely, we adopt alternate spatial-angular convolutional layers with dense connections to extractdeep features from the 4-D LF data efﬁciently and effectively,in combination with a structure-aware loss based on theEPI gradient to constraint the ﬁnal reconstruction. Extensiveexperimental results on both real-world and synthetic datasetsdemonstrate the signiﬁcant advantage of our method. Thatis, as shown in Fig. 2, our method produces much higherPSNR/SSIM at a moderate speed, compared with state-of-the-art methods.A preliminary version of this work has been published a r X i v : . [ ee ss . I V ] S e p ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 2

Fig. 1: The ﬂowchart of the proposed approach and illustration of the detailed architectures of the coarse SR and reﬁnementmodules. The coarse SR module takes advantage of the complementary information of selective SAIs of an LF image bylearning their combinatorial correlations with the target SAI. At the same time, the unique details of each individual SAI arealso well retained. We propose a patch selector to handle the large-disparity LFs. The reﬁnement module recovers the SAIconsistency among the resulting coarse LF image by exploring the spatial-angular relationships and a structure-aware loss.in CVPR 2020 [17]. In this paper, we further improve theeffectiveness and efﬁciency of the preliminary model, and theadditional technical contributions are listed as follows: • the model of [17] utilizes all SAIs to super-resolve eachindividual SAI of an LF image. Although such a manneris able to exploit the complementary information amongSAIs maximally, huge computational and memory costsare required as the angular resolution of the LF imageincreases. To this end, we propose a novel SAI selector,which is capable of adaptively selecting a certain numberof informative SAIs, such that the reconstruction qualityand computational and memory costs could be balancedin a ﬂexible fashion. Based on the SAI selector, we alsoinvestigate the relationship between the computationalcost and the reconstruction quality to offer guidance inpractice; • to cover the corresponding areas between different SAIs,the model of [17] relies on increasing the receptive ﬁeldof the network, which requires a deeper model with moreparameters to ensure the performance when the disparityof the LF image increases. To address this drawback, wepropose a novel plug-and-play patch selector based on theofﬂine disparity prediction to align different SAIs at thepatch level, so that the model can easily handle LFs withlarge disparities while avoiding deepening the networkarchitecture; • we modify the residual blocks to reduce the parameternumber of the coarse SR module without compromisingreconstruction performance, and introduce dense connec-tions to improve the reﬁnement module; • and we extend the proposed SR method to irregular LFdata and verify its effectiveness. See Sec. V-D. To the bestof our knowledge, this is the ﬁrst learning-based methodfor irregular LF SR.The rest of this paper is organized as follows. Sec. IIbrieﬂy introduces the two-plane representation of 4D LFs andcomprehensively reviews existing methods for LF spatial SR. Execution time (sec) PS N R ( d B ) PCA-RR SSIM:0.944 GB SSIM: 0.952 EDSR SSIM:0.968 ResLF SSIM:0.973 LF-SAS SSIM:0.977 LF-ATO SSIM:0.979

Ours SSIM:0.980

Fig. 2: Comparisons of the running time (in second) andreconstruction quality (PSNR/SSIM) of different methods.Here, LF images with spatial resolution of × andangular resolution of × were super-resolved with a scalefactor of 2. Deep learning based methods, i.e., EDSR [18],ResLF [16], LFSAS [19], LF-ATO [17], and Ours, wereimplemented with GPU. The PSNR/SSIM value refers to theaverage over 108 LF images in Stanford Lytro Archive dataset[20].Sec. III presents the motivation of our method, followed bythe proposed framework in Sec. IV. In Sec. V, extensiveexperiments and comparisons are carried out to evaluate theperformance of the proposed model, as well as comprehensiveablation studies towards individual modules. Finally, Sec. VIconcludes this paper. ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 3

II. R

ELATED W ORK

A. Two-plane Representation of 4D LFs

The 4D LF is commonly represented using two-plane pa-rameterization. Each light ray is determined by its intersectionswith two parallel planes, i.e., a spatial plane ( x, y ) and aangular plane ( u, v ) . Let L ( x , u ) denote a 4D LF image,where x = ( x, y ) and u = ( u, v ) . An SAI, denoted as L u ∗ = L ( x , u ∗ ) , is a 2D slice of the LF image at a ﬁxedangular position u ∗ . The SAIs with different angular positionscapture the 3D scene from slightly different viewpoints.Under the assumption of Lambertian, projections of thesame scene point will have the same intensity at differentSAIs. This geometry relation leads to a particular LF parallaxstructure , which can be formulated as: L u ( x ) = L u (cid:48) ( x + d ( u (cid:48) − u )) , (1)where d is the disparity of the point L ( x , u ) . The moststraightforward representation of the LF parallax structure isepipolar-plane images (EPIs). Speciﬁcally, each EPI is the 2Dslice of the 4D LF at one ﬁxed spatial and angular position,and consists of straight lines with different slops correspondingto scene points at different depth. B. Single Image SR

Single Image SR (SISR) is a classical problem in the ﬁeldof image processing. To solve this ill-posed inverse problem,traditional methods explore various image priors based onintuitive image understanding [21] or natural image statistic[22]. Inspired by the great success of deep convolutionalneural networks (CNNs) on image classiﬁcation [23], Dong et al. [24] pioneered deep learning based methods for SISR,and interpreted the CNN as the counterpart of sparse coding.Afterwards, various CNNs with deeper architectures [18],[25]–[27] were proposed, and achieve signiﬁcantly better per-formance than traditional methods. More recently, non-localattention is used to explore the long-distance spatial contextualinformation [28], [29]. We refer the reader to [30], [31] forthe comprehensive review on SISR.

C. LF Spatial SR

As multiple SAIs are available in LF images, the corre-lations between them can be used to directly constrain theinverse problem, and the complementary information betweenthem can greatly improve the performance of SR. Existingmethods for LF spatial SR can be classed into two categories:optimization-based and learning-based methods.Traditional LF spatial SR methods physically model therelations between SAIs based on estimated disparities, andthen formulate SR as an optimization problem. Bishop andFavaro [32] ﬁrst estimated the disparity from the LF image,and then used it to build an image formation model, which wasemployed to formulate a variantional Bayesian framework forSR. Wanner and Goldluecke [11], [33] applied structure tensoron EPIs to estimate disparity maps, which were employed ina variational framework for spatial and angular SR. Mitra andVeeraraghavan [13] proposed a common framework for LF 𝑢𝑣 𝑥𝑦 … .. .. ... 𝑣 𝑥 𝑢 𝑦 SAI

EPI 𝑥 𝑢 𝑦 𝑣 Fig. 3: Illustration of the two-plane representation of 4D LFs.processing, which models the LF patches using a Gaussianmixture model conditioned on their disparity values. To avoidthe requirement of precise disparity estimation, Rossi andFrossard [12] proposed to regularize the problem using agraph-based prior, which explicitly enforces the LF geometricstructure.Learning-based methods exploit the cross-SAI redundanciesand utilize the complementary information between SAIsto learn the mapping from low-resolution to high-resolutionSAIs. Farrugia [34] constructed a dictionary of examples by3D patch-volumes extracted from pairs of low-resolution andhigh-resolution LFs. Then a linear mapping function is learnedusing Multivariate Ridge Regression between the subspaceof these patch-volumes, which is directly applied to super-resolve the low-resolution LF images. Recent success of CNNsin single image SR [24], [26], [31] inspired many learning-based methods for LF spatial SR. Yoon et al. [14], [35]ﬁrst proposed to use CNNs to process LF data. They useda network with similar architecture of that in [24] to improvethe spatial resolution of neighboring SAIs, which were used tointerpolate novel SAIs for angular SR next. Wang et al. [15]used a bidirectional recurrent CNN to sequentially modelcorrelations between horizontally or vertically adjacent SAIs.The predictions of horizontal and vertical sub-networks arecombined using the stacked generalization technique. Zhang et al. [16] proposed a residual network to super-resolve theSAI of LF images. Similar to [36], SAIs along four directionsare ﬁrst stacked and fed into different branches to extract sub-pixel correlations. Then the residual information from differ-ent branches is integrated for ﬁnal reconstruction. However,the performance of side SAIs will be signiﬁcantly degradedcompared with the central SAI as only few SAIs can beutilized, which will result in undesired inconsistency in thereconstructed LF images. Additionally, this method requiresvarious models suitable for SAIs at different angular positions,e.g., 6 models for a × LF image, which makes the practicalstorage and application harder. Yeung et al. [19] used thealternate spatial-angular convolution to super-resolve all SAIsof the LF at a single forward pass.III. M

OTIVATION

Given a low-resolution LF image, denoted as L lr ∈ R H × W × M × N , LF spatial SR aims at reconstructing a super-resolved LF image, close to the ground-truth high-resolution ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 4 (a) LFCNN (b) LFNet (e) Ours (d) LF-SAS (c) ResLF ......

Fig. 4: Illustration of different network architectures for thefusion of the complementary information among SAIs. (a)LFCNN [14], (b) LFNet [15], (c) ResLF [16], (d) LFSAS [19],and (e) our proposed SR method via selective combinatorialembedding. Colored boxes represent images or feature mapsof different SAIs. Among them, red-framed boxes are SAIs tobe super-resolved, and blue boxes are SAIs whose informationis utilized.LF image L hr ∈ R αH × αW × M × N , where H × W is thespatial resolution, M × N is the angular resolution, and α is the upsampling factor. We believe the following two issuesare paramount for high-quality LF spatial SR: (1) thoroughexploration of the complementary information among SAIs;and (2) strict regularization of the LF structural parallax. Inwhat follows, we will discuss more about these issues, whichwill shed light on the proposed method. A. Complementary Information among Small-disparity SAIs

An LF image contains multiple observations of the samescene from slightly varying angles. Due to occlusion, non-Lambertian reﬂections, and other factors, the visual informa-tion is asymmetric among these observations. In other words,the information absent in one SAI may be captured by anotherone, hence all SAIs are potentially helpful for high-quality SR.Traditional optimization-based methods [11]–[13], [33] typ-ically model the relationships among SAIs using explicitdisparity maps. Inaccurate disparity estimation in occludedor non-Lambertian regions will induce artifacts and the cor-rection of such artifacts is beyond the capabilities of theseoptimization-based models. Instead, recent learning-basedmethods, such as LFCNN [35], LFNet [15], ResLF [16], andLF-SAS [19], explore the complementary information amongSAIs through data-driven training. Although these methodsimprove both the reconstruction quality and computationalefﬁciency, the complementary information among SAIs has notbeen fully exploited due to the limitation of their SAI fusionmechanisms. Fig. 4 shows the architectures of different SAIfusion approaches. LFCNN only uses neighbouring SAIs in apair or square, while LFNet only takes SAIs in a horizontal andvertical 3D LF. ResLF considers 4D structures by constructingdirectional stacks, which leaves SAIs not located at the ”star”shape un-utilized.An intuitive way to fully take advantage of the cross-SAIinformation is by stacking the images or features of all SAIs, feeding them into a deep network, and predicting the high-frequency details for all SAIs simultaneously. However, thismethod will compromise unique details that only belong toindividual SAIs since it is the average error over all SAIswhich is optimized during network training. Recent advancedlearning-based method LF-SAS [19] adopts this manner,which uses 4-D or 2-D spatial-angular separable convolutionallayers to explore the cross-SAI information, and achieves state-of-the-art performance. However, as the angular convolution isapplied on a regular grid, less information is provided for thecorner SAIs, leading to the corner performance degradation.To this end, we propose a novel fusion strategy for LFSR, which super-resolves each individual SAI by combiningthe information from combinatorial geometry embedding withdifferent SAIs. Moreover, to reduce possible redundancy ofthe auxiliary information, and save the time and memory cost,we further propose an adaptive SAI selector, which ﬂexiblyselects the auxiliary SAIs based on their spatial correlations.

B. Complementary Information among Large-disparity SAIs

For LF images with large disparities, corresponding pixelsin different SAIs have a relatively long distance in the spatialcoordinate. A shallow network has a limited capacity tocapture such complementary information. Although deepeningthe model may address this issue, it also leads to moreparameters. Moreover, it poses difﬁculties to apply a uniﬁedmodel to LF image datasets with different disparity ranges.Existing learning-based methods for LF spatial SR neglectthis problem, resulting in limited performance on LF imageswith large disparities. In video SR [37], a similar issue isaddressed by motion compensation based on estimated ﬂowmaps. However, the pixel-level correction inevitably introducesartifacts.In this paper, we propose a patch selector to cope with largedisparities without changing the network architecture. Thepatch selector can be added in training and testing according tothe requirements, i.e., we can train the network with the patchselector, and only apply it on LFs with large disparities duringtesting. We explicitly take advantage of the depth informationof the LF image to locate corresponding patches from differentSAIs, and directly extract complementary information betweenthem for SR reconstruction.

C. LF Parallax Structure

As the most important property of an LF image, the parallaxstructure should be well preserved after SR. Generally, existingmethods promote the ﬁdelity of such a structure by enforcingcorresponding pixels to share similar intensity values. Specif-ically, traditional methods employ particular regularization inthe optimization formulation, such as the low-rank [38] andgraph-based [12] regularizer. Farrugia and Guillemot [39] ﬁrstused optical ﬂow to align all SAIs and then super-resolve themsimultaneously via an efﬁcient CNN. However, the disparitybetween SAIs need to be recovered by warping and inpaintingafterwards, which will cause inevitable high-frequency loss.For most learning-based methods [15], [16], the cross-SAIcorrelations are only exploited in the low-resolution space,

ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 5 while the consistency in the high-resolution space is not wellmodeled. See the quantitative veriﬁcation in Sec. V-B.We address the challenge of LF parallax structure preser-vation with a subsequent reﬁnement module on the coarsehigh-resolution results. Speciﬁcally, an additional network isapplied to explore the spatial-angular geometry coherence inthe high-resolution space, which models the parallax structureimplicitly. Moreover, we use a structure-aware loss functiondeﬁned on EPIs, which enforces not only SAI consistency butalso models inconsistency on non-Lambertian regions.IV. P

ROPOSED M ETHOD

As illustrated in Fig. 1, our framework consists of a coarseSR module, which super-resolves each SAI of an LF imageindividually by fusing the selective combinatorial embedding,and a reﬁnement module, which enforces the LF parallaxstructure of the reconstructed LF image via structural con-sistency regularization. In what follows, we will detail eachmodule.

A. Coarse SR via Selective Combinatorial Embedding

Let L lr u r denote a typical SAI to be super-resolved, and { L lr u a } the set of k ( k ≤ M N ) auxiliary SAIs for providingcomplementary information, i.e., |{ L lr u a }| = k . In the coarseSR module, we ﬁrst determine { L lr u a } adaptively via an SAIselector, and then leverage the complementary informationof these SAIs to assist the SR of L lr u r . In addition, we alsopropose a patch selector, which is optionally applied on LFswith large disparities.

1) Adaptive SAI selector:

In this module, we use a networkto distinguish a certain number of SAIs from all SAIs { L lr u } of the LF, which make most contributions to the SR of L lr u r ,and select them as the auxiliary SAIs { L lr u a } . The selection islearned based on the relations of image content between L lr u r and L lr u .As shown in Fig. 1, L lr u r and each SAI in { L lr u } are con-catenated in a pairwise manner, and a convolutional networkfollowed by an adaptive average pooling layer is applied toproduce a scalar score for each pair: s u = f s ( L lr u r , L lr u ) , (2)where s u is the scalar score representing the degree of thecorrelation between L lr u r and L lr u . A higher score indicates thatthe corresponding SAI is expected to make more contributionsto the SR of L lr u r . Then, SAIs in { L lr u } that produces top- k scores are retained as { L lr u a } : { u a } = arg top k u { s u } , (3)where { u a } are the angular positions of the selected { L lr u a } . Tomake the top- k function differentiable for back-propagation,the scores are multiplied with the features in the following SRsub-module. The loss will penalize the scores of selected SAIsin each iteration to dynamically adjust their ranking. Moreover,to enable a ﬂexible number of selected SAIs, i.e., the value of k , a max-pooling layer is applied to the features before all-SAIfusion. Some examples of the selected results are presented inFig. 9.

2) Combinatorial geometry embedding:

This module isfocused on extracting the complementary information from { L lr u a } to assist the SR of L lr u r . As shown in Fig. 1, thereare four sub-phases involved, i.e., per-SAI feature extraction,combinatorial correlation learning, all-SAI fusion and upsam-pling. Per-SAI feature extraction . We ﬁrst extract deep features,denoted as F u , from L lr u r and { L lr u a } separately, i.e., F u r = f ( L lr u r ) ,F u a = f ( L lr u a ) . (4)Inspired by the excellent performance of residual blocks [25],[40], which learn residual mappings by incoporating the self-indentity, we use them for deep feature extraction. Note thatdifferent from the ’conv-relu-conv’ structure used in [17], ourresidual blocks consist of only one layer of convolution andReLU, and use the pre-activation proposed in [41]. This mod-iﬁcation greatly reduces the parameter number while keepingthe performance on par with the residual block used in [17].The feature extraction process f ( · ) contains one convolutionallayer, and n residual blocks. The parameters of f ( · ) areshared across all SAIs. Combinatorial correlation learning . The geometric corre-lations between L lr u r and L lr u a vary with their angular positions u r and u a . To enable our model to be compatible for allSAIs with different u r in the LF, we use the network f ( · ) tolearn the correlations between the features of a pair of SAIs { F u , F u } , where the angular positions u and u can bearbitrarily selected. Based on the correlations between F u and F u , f ( · ) is designed to extract information from F u and embed it into the features of F u . Here, u is set to be theangular position of the target SAI, and u can be the positionof any auxiliary SAI. Thus the output can be written as: F , u a u r = f ( F u r , F u a ) , (5)where F , u a u r is the features of the L lr u r incorporated with theinformation of L lr u a .The network f ( · ) consists of a concatenation operator tocombine the features F u r and F u a as inputs, and a convo-lutional layer followed by n residual blocks. f ( · ) ’s abilityof handling arbitrary pair of SAIs is naturally learned byaccepting the target SAI and all auxiliary SAIs in each trainingiteration. All-SAI fusion . The output of f ( · ) , i.e., { F , u ai u r | i =1 , , · · · , k } , is a stack of features with embedded geometryinformation from { L lr u a } . These features have been trained toalign to L lr u r , hence they can be fused directly. To enable aﬂexible number of auxiliary SAIs, we apply a max-poolinglayer along the SAI dimension of the feature stacks, whichproduces a smaller set of SAI-level features without loss ofinformation, i.e., { F (cid:48) , u aj u r | j = 1 , , · · · , p } , where p ≤ k .After that, the fusion process can be formulated as: F u r = f ( F (cid:48) , u a u r , · · · , F (cid:48) , u ap u r ) . (6)To fuse these features, we ﬁrst combine them channel-wise,i.e., combine the feature maps at the same channel acrossall SAIs. Then, all channel maps are used to extract deeper ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 6 features. The network f ( · ) consists of one convolutional layer, n residual blocks for channel-wise SAI fusion and n residualblocks for channel fusion. Upsampling . We use a similar architecture with residuallearning in SISR [25]. To reduce the memory consumption andcomputational complexity, all feature learning and fusion areconducted in low-resolution space. The fused features are up-sampled using the efﬁcient sub-pixel convolutional layer [42],and a residual map is then reconstructed by a subsequentconvolutional layer f ( · ) . The ﬁnal reconstruction is producedby adding the residual map with the upsampled image: L sr u r = f ( U ( F u r )) + U ( L lr u r ) , (7)where U ( · ) is the sub-pixel convolutional layer and U ( · ) isthe bicubic interpolation process. Loss function . The coarse SR module super-resolves L lr u r individually, and the output (cid:98) L sr u r is trained to approach theground truth high-resolution image L hr u r . We use the (cid:96) errorbetween them to deﬁne the loss function: (cid:96) v = || (cid:98) L sr u r − L hr u r || . (8)

3) Disparity-based patch selector:

To address the problemof performance limitation on LFs with large disparities, wepropose a disparity-based patch selector. This module alignspatches in different SAIs before the SAI selector by takingadvantage of the disparity map of the central SAI of aninput LF via an ofﬂine disparity prediction method [43]. Wespeciﬁcally design the training and testing strategies to enablethe proposed patch selector to be a plug and play module,which can be optionally applied on LFs with large disparitieswithout changing the network architecture.Suppose the disparity map of the central SAI is denoted as D u c . To reduce the inﬂuence of disparity estimation errors andnon-Lambertian areas, we utilize the disparity map to locatethe correspondences at the patch level, instead of the pixellevel. During training, we ﬁrst randomly crop a patch centeredat x u c in the central SAI, and then calculate the patch-leveldisparity d p by averaging over the corresponding patch of thedisparity map, i.e., d p = 1 | P x u c | (cid:88) x ∈ P xu c D u c ( x ) , (9)where P x u c is the patch centered at x u c , and | P x u c | is thenumber of pixels in P x u c . Based on d p , patches at the otherSAIs of the input LF are cropped to produce candidates forthe auxiliary patches, i.e., the central point of the patch at SAI u is computed as: x u = x u c + d p ( u c − u ) . (10)As an example, Fig. 5 visualizes the effect of the patchselector, where it can be seen that before applying the patchselector, there is an obvious translation between patches ofdifferent SAIs caused by a relatively large disparity, while afterapplying the patch selector, the content in different patchesare well aligned. In this way, the patches fed to the networkare almost aligned so that the network can easily learn theircorrelations and extract the complementary information. (a) before applying the patch selector(b) after applying the patch selector Fig. 5: Visualization of the effect of the proposed disparity-based patch selector. Framed regions are cropped as patchesfor training or testing.During testing, each SAI of the LF takes turns to be thetarget SAI, and thus, the disparity maps at all SAI are requiredto reconstruct the LF image. Considering that most disparityestimation algorithms for LFs only produce the central dis-parity map, and the disparity map at different SAI also keepthe geometry relation described in Eq. 1, we generate thedisparity map at other SAIs by forward warping the central-SAI disparity map. After that, patches at each target SAI canbe super-resolved similar to the training process.

B. Reﬁnement via Structural Consistency Regularization

We apply structural consistency regularization on the coarseresults by the coarse SR module. This reﬁnement moduleemploys the efﬁcient alternate spatial-angular convolution toimplicitly model cross-SAI correlations among the coarse LFimages. In addition, a structure-aware loss function deﬁned onEPIs is used to enforce the structural consistency of the ﬁnalreconstruction.

1) Efﬁcient alternate spatial-angular convolution:

To reg-ularize the LF parallax structure, an intuitive method is usingthe 4D or 3D convolution. However, 4D or 3D CNNs willresult in signiﬁcant increase of the parameter number andcomputational complexity. To improve the efﬁciency, but stillexplore the spatial-angular correlations, we adopt the alternatespatial-angular convolution [19], [44], [45], which handles thespatial and angular dimensions in an alternating manner withthe 2D convolution.In our regularization network, we use n layers of alternatespatial-angular convolutions. Speciﬁcally, for the coarse results (cid:98) L sr ∈ R αH × αW × M × N , we ﬁrst extract features from eachSAI separately and construct a stack of spatial features, i.e., F s ∈ R αH × αW × c × MN , where c is the number of featuremaps. Then we apply 2D spatial convolutions on F s . Theoutput features are reshaped to the stacks of angular patches,i.e., F a ∈ R M × N × c × α HW , and then angular convolutionsare applied. Afterwards, the features are reshaped for spa-tial convolutions, and the previous ’Spatial Conv-Reshape-Angular Conv-Reshape’ process repeats n times. Moreover,to enhance the information ﬂow in the network, we add the ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 7

TABLE I: The details of the datasets used for evaluation.

Real-world SyntheticDataset Stanford Kalantari Stanford Synthetic HCI oldLytro [20] Lytro [47] Gantry [48] [49], [50] [51] dense connection [46] between each spatial-angular convolu-tion stage.

C. Structure-aware Loss Function

The objective function is deﬁned as the (cid:96) error betweenthe predicted LF image and the ground truth: (cid:96) r = || (cid:98) L rf − L hr || , (11)where (cid:98) L rf is the ﬁnal reconstruction by the reﬁnement mod-ule.A high-quality LF reconstruction shall have strictly linearpatterns on the EPIs. Therefore, to further enhance the parallaxconsistency, we add additional constraints on the output EPIs.Speciﬁcally, we incorporate the EPI gradient loss, whichcomputes the (cid:96) distance between the gradient of EPIs of ourﬁnal output and the ground-truth LF, for the training of thereﬁnement module. The gradients are computed along bothspatial and angular dimensions on both horizontal and verticalEPIs: (cid:96) e = (cid:107)∇ x (cid:98) E y,v − ∇ x E y,v (cid:107) + (cid:107)∇ u (cid:98) E y,v − ∇ u E y,v (cid:107) + (cid:107)∇ y (cid:98) E x,u − ∇ y E x,u (cid:107) + (cid:107)∇ v (cid:98) E x,u − ∇ v E x,u (cid:107) , (12)where (cid:98) E y,v and (cid:98) E x,u denote EPIs of the reconstructed LFimages, and E y,v and E x,u denote EPIs of the ground-truthLF images. D. Implementation and Training Details1) Training strategy:

To make the coarse SR module com-patible for all different angular positions, we ﬁrst trained itindependently from the reﬁnement module. During training,a training sample of an LF image was fed into the network,while an SAI at random angular position was selected as thetarget SAI. To enable the ﬂexibility of the SAI selector, thenumber of auxiliary SAIs k was randomly set in each iteration.After the coarse SR network training was completed, we ﬁxedits parameters and used them to generate the coarse inputs forthe training of the subsequent reﬁnement module.

2) Parameter setting:

In our network, each convolutionallayer has 64 ﬁlters with kernel size × , and zero-paddingwas applied to keep the spatial resolution unchanged. In thecoarse SR module, we set n = 5 , n = 5 , n = 3 and n = 3 for the number of residual blocks, and p = 9 for thefeature number in the all-SAI fusion. For reﬁnement, we used n = 10 layers of the spatial-angular convolutions.During training, we used LF images with angular resolutionof × , and randomly cropped LF patches with spatial size × . The batch size was set to 1. Adam optimizer [52]with β = 0 . and β = 0 . was used. The learning rate was initially set to e − and decreased by a factor of 0.5 every250 epochs. V. E XPERIMENTAL R ESULTS

A. Datasets

Both synthetic LF datasets, i.e., HCI [49], [51] and Inria[50], and real-world LF datasets, i.e., Stanford LF Archives[20], [48] and Kalantari Lytro [47], were used for trainingand testing. Speciﬁcally, 180 LF images, including 160 real-world images and 20 synthetic images, were used for training,and 198 LF images containing 155 real-world scenes and43 synthetic scenes were used for evaluation. As shwon inTable I, different datasets exhibit different properties. To bespeciﬁc, Stanford Lytro [20] and Kalantari Lytro [47] containrich outdoor-scene images captured with a Lytro Illum camera,which have the spatial resolution of × and a smalldisparity range of [ − , . Stanford Gantry [48] contains in-door LF images captured with a camera array, resulting in thehigh spatial resolution and large disparity range. These real-world images evaluate the performance of different methodson natural illumination and practical camera distortion. TheLF images in Synthetic [49], [50] were generated with theopen-source software Blender, and have the spatial resolutionof × and a relatively large disparity range of [ − , .HCI old [51] were also generated with Blender, but with noisesand a moderate disparity range. Compared with the real-worldLF images, these synthetic images have sharper textures andmore high-frequency details.Same as previous works [15], [16], [18], we used the bicubicdown-sampling method to generate low-resolution LF images.Only Y channel was used for training and testing, while Cband Cr channels were upsampled using bicubic interpolationwhen generating visual results. B. Comparison with State-of-the-Art Methods

In addition to our preliminary work LF-ATO [17], we alsocompared with 5 state-of-the-art LF SR methods, including1 optimization-based method, i.e., GB [12], 4 learning-basedmethods, i.e., PCA-RR [34], LFNet [15], LF-SAS [19], andResLF [16], and 1 SISR method, i.e., EDSR [18]. The resultsof bicubic interpolation were also provided as a baseline.

1) Quantitative comparisons of reconstruction quality:

PSNR and SSIM were used as the quantitative indicators forcomparisons, and the average PSNR/SSIM over each testingdataset is listed in Table II, where it can be observed thatLF-SAS, LF-ATO and Ours generally achieve state-of-the-artresults. In comparison with LF-SAS, our method improvesthe PSNR of the × reconstructed LFs by around 0.7dB insmall-disparity datasets, and more than 1dB in large-disparitydatasets beneﬁting from the patch selector. While in × reconstruction, LF-SAS achieves comparable results with Oursas the progressive architecture used in LF-SAS helps to utilizethe complementary information between × super-resolvedSAIs. Compared with LF-ATO, our method improves thePSNR value via a deeper network with dense connections inthe reﬁnement module. ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 8

TABLE II: Quantitative comparisons (PSNR/SSIM) of different methods on × and × LF spatial SR. The best results arehighlighted in bold. PSNR/SSIM refers to the average value of all the scenes of a dataset.

Dataset Scale Bicubic PCA-RR [34] LFNet [15] GB [12] EDSR [18] ResLF [16] LF-SAS [19] LF-ATO [17] OursStanford Lytro [20] 2 35.59/0.940 36.02/0.944 36.79/0.953 36.46/0.952 39.39/0.968 40.44/0.973 41.60/0.977 41.96/0.979 / Kalantari Lytro [47] 2 37.51/0.960 38.29/0.964 38.80/0.969 39.33/0.976 41.55/0.980 42.95/0.984 43.75/0.986 44.02/ / Stanford Gantry [48] 2 39.21/0.977 34.25/0.928 34.61/0.933 35.83/0.47 43.65/0.988 42.45/0.985 43.36/0.987 44.47/ / Synthetic [49], [50] 2 33.14/0.911 32.21/0.885 33.89/0.919 35.73/0.946 37.44/0.946 37.43/0.952 38.79/0.960 39.44/0.963 / HCI old [51] 2 34.48/0.915 35.05/0.923 35.34/0.928 36.65/0.951 37.74/0.946 38.72/0.957 40.09/0.966 40.54/ / Stanford Lytro [20] 4 30.13/0.813 30.60/0.828 30.60/0.829 30.29/0.848 32.57/0.872 33.11/0.884 34.58/0.907 34.46/0.907 / Kalantari Lytro [47] 4 31.63/0.864 32.57/0.882 32.14/0.879 31.86/0.892 34.59/0.916 35.55/0.930 37.03/ /0.946Stanford Gantry [48] 4 33.19/0.920 31.27/0.855 31.32/0.860 30.92/0.873 36.40/0.954 35.54/0.945 36.25/0.947 37.06/0.960 / Synthetic [49], [50] 4 28.49/0.792 28.76/0.791 28.95/0.806 29.11/0.832 31.63/0.861 31.60/0.869 32.87/0.889 32.68/0.887 / HCI old [51] 4 30.04/0.791 30.65/0.815 30.52/0.807 29.91/0.812 32.49/0.845 32.71/0.860 33.86/0.884 33.81/0.884 / Min: 37 .

41, Max: 37 . .

44, Std: 0 . EDSR

Min: 36 .

85, Max: 37 . .

28, Std: 0 . ResLF

Min: 38 .

26, Max: 39 . .

80, Std: 0 . LF-SAS

Min: 39 .

92, Max: 40 . .

14, Std: 0 . Ours

Fig. 6: Comparison of the PSNR of the individual SAI. Wevisualized the average PSNR value over 43 LFs in Synthetic[49], [50] with color.We also compared the PSNR of individual SAIs of differentmethods. As shown in Fig. 6, it can be observed that thecorner SAIs of LF images reconstructed by ResLF and LF-SAS always have signiﬁcantly lower PSNR values than thoseof SAIs closer to the center. The performance degradation atcorner SAIs in ResLF and LF-SAS is caused by fewer neigh-boring SAIs which provide less complementary information.In contrast, this issue is greatly alleviated in our method whichis credited to the better way of utilizing the information ofauxiliary SAIs. The small variance of the PSNR values of SAIsreconstructed by the SISR method EDSR is resulted from itslimited ability, i.e., each SAI is independently super-resolvedwithout utilizing the information of other SAIs.

2) Qualitative comparisons of reconstruction quality:

Wealso provided visual comparisons of different methods, asshown in Fig. 7 for × SR and Fig. 8 for × SR, where itcan be observed that blurring effects occur in texture regionsof the results of EDSR, ResLF and LF-SAS to some extent,such as the lines in the map, the digits on the clock and thecharacters on the cards. In contrast, our method can produceSR results with sharper textures, which are closer to the groundtruth ones, demonstrating its advantage.

3) Comparisons of the LF parallax structure:

The LFparallax structure in the most valuable information of LFdata. As discussed in Sec. III, the straight lines in EPIsprovide direct representation for the LF parallax structure.To qualitatively compare the ability of different methods onpreserving the LF parallax structure, the EPIs constructedfrom the reconstructions of different methods are depictedin Figs. 7 and 8, where it can be seen that the EPIs fromour methods show clearer and more consistent straight lines, TABLE III: Comparisons of the PSNR/SSIM values of EPIsreconstructed by different methods.

Dataset scale EDSR ResLF LF-SAS OursKalantari Lytro 2 42.78/0.975 42.76/0.977 44.01/0983 / Stanford Gantry 2 45.08/0.988 35.95/0.980 43.98/0.987 / Synthetic 2 38.53/0.945 38.09/0.950 39.53/0.960 / Kalantari Lytro 4 36.15/0.916 36.47/0.922 / Stanford Gantry 4 38.56/0.956 32.92/0.945 37.78/0.953 / Synthetic 4 33.15/0.873 32.62/0.874 34.02/0.896 / which are closer to ground truth ones, compared with thosefrom other methods. The quantitative comparison of the EPIsreconstructed by different methods in terms of PSNR andSSIM is listed in Table IV, which also shows the advantageof our method in terms of the preservation of structuralconsistency.Moreover, depth estimation algorithms for LF images arealways built based on the relations described in Eq. 1, andthus, it is expected that the reconstructed LF images withbetter LF parallax structures will lead to a depth map withhigher accuracy. Based on this fact, the quality of the LFparallax structures of the super-resolved LF images can beindirectly evaluated via depth estimation. In Table III, wequantitatively compared the accuracy of the depth estimatedfrom the super-resolved LF images by different SR methods,as well as the ground-truth high-resolution LF images. Theaccuracy is measured by Mean Square Error (MSE) and BadPixel Ratio (BPR), which is the percentage of pixels with anerror large than a typical threshold, between the estimateddepth maps and ground truth ones. The popular light ﬁelddepth estimation method in [43] was used. From Table III, itcan be observed that our method produces the lowest MSE andBPR values compared with other LF SR methods. Note thatthe MSE and BPR values of the depth estimated from ground-truth LF images are even higher than those estimated fromsuper-resolved LF images on the HCI old dataset. The reasoncould be that the raw LF images in

HCI old are noisy [51],while the noises might be suppressed by the reconstructionmethods.

ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 9

Ground Truth PCA-RR EDSR ResLF LF-SAS Ours

Fig. 7: Visual comparisons of different methods on × reconstruction. The predicted central SAIs, the zoom-in of the framedpatches, the EPIs at the colored lines, and the zoom-in of the EPI framed patches in EPI are provided. Zoom in the ﬁgure forbetter viewing.TABLE IV: Quantitative comparisons (100 × MSE/BPR withthreshold 0.07) of the depth estimated from the ground-truth high-resolution LFs and the × super-resolved LFs bydifferent algorithms. The best and second best results arehighlighted in red and blue, respectively. Dataset GT PCA-RR EDSR ResLF LF-SAS OursSynthetic [50] 6.46/35.5 9.78/51.57 7.42/39.83 7.35/40.41 7.56/39.14 7.18/37.12HCI old [51] 0.84/11.72 0.87/14.44 0.76/13.27 0.69/9.46 0.68/9.64 0.67/9.40

C. Ablation Study1) Effectiveness of the SAI selector:

To investigate theeffectiveness of the proposed SAI selector, we visualized theangular positions of the auxiliary SAIs selected by our networkin Fig. 9, where it can be seen that the SAI selector producescontent-adaptive selections. Moreover, the selected SAIs areconcentrated at the neighbors of the target SAI. Such resultsare consistent with our intuition, i.e., neighboring SAIs are more similar to the target SAI, and thus, can contribute moreinformation for SR. The use of neighboring SAIs as auxiliaryinformation is also consistent with previous works [16], [35],while our strategy is more ﬂexible and adaptive, and avoidsthe performance degradation at corner SAIs.Then, we investigated how the proposed SAI selector affectsthe reconstruction quality. Fig. 10 shows the PSNR of thesuper-resolved LF with different numbers of selected SAIs,where the results of a random SAI selection strategy areprovided for comparisons, which selects a certain numberof auxiliary SAIs randomly during both training and testing.From Fig. 10, it can be observed that the more auxiliary SAIsare utilized, the higher PSNR is achieved, which is reasonableas more auxiliary SAIs can provide more complementaryinformation to enhance the reconstruction. Moreover, the pro-posed SAI selector produces better results compared with therandom SAI selector, which demonstrates the effectiveness ofthe proposed SAI selection strategy.

ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 10

Ground Truth PCA-RR EDSR ResLF LF-SAS Ours

Fig. 8: Visual comparisons of different methods on × reconstruction. The predicted central SAIs, the zoom-in of the framedpatches, the EPIs at the colored lines, and the zoom-in of the EPI framed patches in EPI are provided. Zoom in the ﬁgure forbetter viewing.We also investigated the impact of the number of selectedSAIs on the computational complexity of the reconstructionmeasured in running time and FLOPs. The results are listedin Fig. 11, where it can be observed that both of themincrease linearly with the number of selected SAIs increasing,which indicates that the proposed SAI selection module couldeffectively reduce the computational cost by utilizing lessauxiliary SAIs for SR. In combination with the results in Fig.10, we can conclude that the proposed SAI selector optimizesthe balance between reconstruction quality and computationalefﬁciency.

2) Effectiveness of the disparity-based patch selector:

Asthe patch selector is a plug-and-play operator during trainingand testing, as listed in Table V we investigated its effective- ness under three settings over two datasets, i.e., the real-worlddataset Kalantari Lytro [47] with small disparities, and thesynthetic dataset [49], [50] with relatively large disparities.From Table V, it can be observed that the use of the patch se-lector has little inﬂuence on LFs with small disparities, whichis as expected because the network already has a sufﬁcientreceptive ﬁeld to cover complementary areas in different SAIs.In contrast, when testing on LFs with large disparities, usingthe patch selector produces signiﬁcant improvements, whichvalidates the effectiveness of the proposed patch selector. Notethat training the network with or without the patch selector hassimilar results, so that we can universally train one model,but apply different testing strategy for LFs with differentdisparities.

ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 11 k = 10 k = 10 k = 18 k = 18 (a) Cars k = 10 k = 10 k = 18 k = 18 (b) Flower1

Fig. 9: Visualization of the selected auxiliary SAIs for twodifferent LF images. The yellow and blue grids indicate theangular positions of target SAIs and selected auxiliary SAIs,respectively. k is the number of auxiliary SAIs. (a) Inria Synthetic (b) kalantari Lytro Fig. 10: Comparison of the reconstruction quality (PSNR) oftwo SAI selection strategies on two datasets.

Select learn de-notes the proposed learnable SAI selector, and

Select random denotes the random SAI selector.TABLE V: Comparison of the reconstruction quality of × SR of the proposed method with (w/) and without (w/o) thepatch selector. w/o and w/ means that the model is trained ortested without or with the patch selector, respectively. Kalantari Lytro [47] Synthetic [49], [50]train w/o test w/o 43.77/0.986 39.30/0.963train w/ test w/o 43.76/0.986 39.32/0.961test w/ 43.76/0.986 /

3) Effectiveness of the structural consistency regulariza-tion:

We compared the reconstruction quality of the coarse(before regularization) and ﬁnal results (after regularization),and the corresponding results are listed in Table VI, whereit can be observed that the reﬁnement module improves thePSNR values of the reconstructed results on both × and × SR, demonstrating the effectiveness of the proposed reﬁnementmodule. (a) Run time (b) FLOPs

Fig. 11: Illustration of the computational complexity withdifferent numbers of selected auxiliary SAIs. The running timewas computed over a × LF with × SR, and the FLOPs wasobtained by computing one forward pass with a × × × LF as input.TABLE VI: Comparisons of the reconstruction quality beforeand after the regularization on × and × LF SR are listed.

Dataset Scale w/o regularization w/ regularizationStanford Lytro [20] 2 41.65/0.978 / Kalantari Lytro [47] 2 43.76/0.986 / Synthetic [49], [50] 2 39.84/0.966 / Stanford Lytro [20] 4 34.49/0.905 / Kalantari Lytro [47] 4 36.88/0.945 / Synthetic [49], [50] 4 /0.896 / D. Extended Application on Irregular LF SR

Currently, all the existing LF spatial SR methods weredesigned for regular LF images by modeling the relationsbetween SAIs sampled on the angular plane with a regulargrid. However, in order to densely sample the LF, samplingan irregular LF image from unstructured viewpoints is morepractically meaningful [53], [54]. Different from previousmethods, our proposed coarse SR module super-resolves SAIsvia combinatorial geometry embedding, and thus, can naturallyhandle the reconstruction of irregular LF images.To validate the ability of our proposed coarse SR module onthe reconstruction of irregular LF images, we compared it withEDSR [18], which super-resolves each SAI independently,and LF-SAS-1D, which was developed by replacing the 2-Dkernels of the angular convolution of LF-SAS [19] with 1-D kernel. We simulated the irregular LF data by stacking theSAIs selected from the regular LF data. In our experiments, werandomly generated 3 irregular patterns with 36 SAIs selectedfrom × LF images, and compared different methods by theaverage PSNR/SSIM over these 3 patterns for robust evalua-tion. The results are listed in Table VII, where it can be ob-served that on the irregular LFs with small disparity, althoughLF-SAS-1D improves the reconstruction quality comparedwith EDSR by taking advantage of the complementary infor-mation from different SAIs, the performance is signiﬁcantlydegraded compared with the results in Table II. While onthe LFs with large disparity, LF-SAS-1D produces the resultssimilar with those of EDSR, indicating that the complementaryinformation is not effectively utilized. We deduce that the

ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 12

TABLE VII: Quantitative comparisons (PSNR/SSIM) of dif-ferent methods on × reconstruction of irregular LF images.Ours-coarse denotes our proposed coarse SR module. Dataset EDSR [18] LF-SAS-1D [19] Ours-coarseKalantari Lytro [47] 41.55/0.980 42.64/0.983 / Synthetic [49], [50] 37.44/0.946 37.47/0.948 / structured spatial-angular convolution is no longer suitable toexplore the relations between unstructured SAIs. In contrast,the reconstruction quality of our proposed coarse SR model ispreserved, leading to the highest PSNR/SSIM values, whichdemonstrates its ability on the reconstruction of irregular LFimages. VI. C ONCLUSION

In this paper, we have presented a learning-based method forLF spatial SR. We focused on addressing two crucial problems,which we believe are paramount for high-quality LF spatialSR, i.e., how to fully take advantage of the complementaryinformation among SAIs, and how to preserve the LF parallaxstructure in the reconstruction. We modeled them with twomodules, i.e., coarse SR via selective combinatorial embed-ding and reﬁnement via structural consistency regularization.Owing to the selective combinatorial embedding, our networkimproves the SR performance on LFs with large disparities,and is capable of handling irregular LF data. Experimental re-sults demonstrate that our method efﬁciently generates super-resolved LF images with higher PSNR/SSIM and better LFstructure, compared with the state-of-the-art methods.R

EFERENCES[1] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and M. H. Gross,“Scene reconstruction from high spatio-angular resolution light ﬁelds.”

ACM Transactions on Graphics , vol. 32, no. 4, pp. 73–1, 2013.[2] H. Zhu, Q. Wang, and J. Yu, “Occlusion-model guided antiocclusiondepth estimation in light ﬁeld,”

IEEE Journal of Selected Topics in SignalProcessing , vol. 11, no. 7, pp. 965–978, 2017.[3] L. Si and Q. Wang, “Dense depth-map estimation and geometry infer-ence from light ﬁelds via global optimization,” in

Asian Conference onComputer Vision (ACCV) , 2016, pp. 83–98.[4] H. Zhu, Q. Zhang, and Q. Wang, “4d light ﬁeld superpixel andsegmentation,” in

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017, pp. 6384–6392.[5] J. Fiss, B. Curless, and R. Szeliski, “Refocusing plenoptic imagesusing depth-adaptive splatting,” in

IEEE International Conference onComputational Photography (ICCP) , 2014, pp. 1–9.[6] F.-C. Huang, K. Chen, and G. Wetzstein, “The light ﬁeld stereoscope:immersive computer graphics via factored near-eye light ﬁeld displayswith focus cues,”

ACM Transactions on Graphics , vol. 34, no. 4, p. 60,2015.[7] J. Yu, “A light-ﬁeld journey to virtual reality,”

IEEE MultiMedia

IEEE Journal ofSelected Topics in Signal Processing , vol. 11, no. 7, pp. 926–954, 2017.[11] S. Wanner and B. Goldluecke, “Variational light ﬁeld analysis fordisparity estimation and super-resolution,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 36, no. 3, pp. 606–619, 2014.[12] M. Rossi and P. Frossard, “Geometry-consistent light ﬁeld super-resolution via graph-based regularization,”

IEEE Transactions on ImageProcessing , vol. 27, no. 9, pp. 4207–4218, 2018. [13] K. Mitra and A. Veeraraghavan, “Light ﬁeld denoising, light ﬁeldsuperresolution and stereo camera based refocussing using a gmm lightﬁeld patch prior,” in

IEEE Conference on Computer Vision and PatternRecognition Workshops (CVPRW) , 2012, pp. 22–28.[14] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. S. Kweon, “Light-ﬁeld image super-resolution using convolutional neural network,”

IEEESignal Processing Letters , vol. 24, no. 6, pp. 848–852, 2017.[15] Y. Wang, F. Liu, K. Zhang, G. Hou, Z. Sun, and T. Tan, “Lfnet: Anovel bidirectional recurrent convolutional neural network for light-ﬁeld image super-resolution,”

IEEE Transactions on Image Processing ,vol. 27, no. 9, pp. 4274–4286, 2018.[16] S. Zhang, Y. Lin, and H. Sheng, “Residual networks for light ﬁeld imagesuper-resolution,” in

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2019, pp. 11 046–11 055.[17] J. Jin, J. Hou, J. Chen, and S. Kwong, “Light ﬁeld spatial super-resolution via deep combinatorial geometry embedding and structuralconsistency regularization,” in

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2020, pp. 2260–2269.[18] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep resid-ual networks for single image super-resolution,” in

IEEE Conference onComputer Vision and Pattern Recognition Workshops (CVPRW) , 2017,pp. 136–144.[19] H. W. F. Yeung, J. Hou, X. Chen, J. Chen, Z. Chen, and Y. Y. Chung,“Light ﬁeld spatial super-resolution using deep efﬁcient spatial-angularseparable convolution,”

IEEE Transactions on Image Processing , vol. 28,no. 5, pp. 2319–2330, 2018.[20] A. S. Raj, M. Lowney, R. Shah, and G. Wetzstein, “Stanford lytro lightﬁeld archive,” http://lightﬁelds.stanford.edu/LF2016.html, [Online].[21] J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradientproﬁle prior,” in

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2008, pp. 1–8.[22] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution viasparse representation,”

IEEE Transactions on Image Processing , vol. 19,no. 11, pp. 2861–2873, 2010.[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems (NeurIPS) , 2012, pp. 1097–1105.[24] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution usingdeep convolutional networks,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 38, no. 2, pp. 295–307, 2016.[25] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolutionusing very deep convolutional networks,” in

IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2016, pp. 1646–1654.[26] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacianpyramid networks for fast and accurate super-resolution,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2017,pp. 624–632.[27] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual densenetwork for image super-resolution,” in

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2018, pp. 2472–2481.[28] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-orderattention network for single image super-resolution,” in

IEEE conferenceon computer vision and pattern recognition (CVPR) , 2019, pp. 11 065–11 074.[29] Y. Mei, Y. Fan, Y. Zhou, L. Huang, T. S. Huang, and H. Shi, “Imagesuper-resolution with cross-scale non-local attention and exhaustive self-exemplars mining,” in

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2020, pp. 5690–5699.[30] J. Tian and K.-K. Ma, “A survey on super-resolution imaging,”

Signal,Image and Video Processing , vol. 5, no. 3, pp. 329–342, 2011.[31] Z. Wang, J. Chen, and S. C. Hoi, “Deep learning for image super-resolution: A survey,” arXiv preprint arXiv:1902.06068 , 2019.[32] T. E. Bishop and P. Favaro, “The light ﬁeld camera: Extended depthof ﬁeld, aliasing, and superresolution,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 34, no. 5, pp. 972–986, 2012.[33] S. Wanner and B. Goldluecke, “Spatial and angular variational super-resolution of 4d light ﬁelds,” in

European Conference on ComputerVision (ECCV) , 2012, pp. 608–621.[34] R. A. Farrugia, C. Galea, and C. Guillemot, “Super resolution of lightﬁeld images using linear subspace projection of patch-volumes,”

IEEEJournal of Selected Topics in Signal Processing , vol. 11, no. 7, pp.1058–1071, 2017.[35] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learninga deep convolutional network for light-ﬁeld image super-resolution,”in

IEEE International Conference on Computer Vision Workshops (IC-CVW) , 2015, pp. 24–32.

ANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 13 [36] C. Shin, H.-G. Jeon, Y. Yoon, I. So Kweon, and S. Joo Kim, “Epinet:A fully-convolutional neural network using epipolar geometry for depthfrom light ﬁeld images,” in

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018, pp. 4748–4757.[37] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, andW. Shi, “Real-time video super-resolution with spatio-temporal networksand motion compensation,” in

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2017, pp. 4778–4787.[38] S. Heber and T. Pock, “Shape from light ﬁeld meets robust pca,” in

European Conference on Computer Vision (ECCV) , 2014, pp. 751–767.[39] R. Farrugia and C. Guillemot, “Light ﬁeld super-resolution using a low-rank prior and deep convolutional neural networks,”

IEEE Transactionson Pattern Analysis and Machine Intelligence , 2019.[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

IEEE conference on computer vision and patternrecognition (CVPR) , 2016, pp. 770–778.[41] ——, “Identity mappings in deep residual networks,” in

EuropeanConference on Computer Vision (ECCV) , 2016, pp. 630–645.[42] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop,D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efﬁcient sub-pixel convolutional neural network,” in

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2016, pp. 1874–1883.[43] T.-C. Wang, A. A. Efros, and R. Ramamoorthi, “Occlusion-aware depthestimation using light-ﬁeld cameras,” in

IEEE International Conferenceon Computer Vision (ICCV) , 2015, pp. 3487–3495.[44] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptiveseparable convolution,” in

IEEE International Conference on ComputerVision (ICCV) , 2017, pp. 261–270.[45] W. F. H. Yeung, J. Hou, J. Chen, Y. Y. Chung, and X. Chen, “Fast lightﬁeld reconstruction with deep coarse-to-ﬁne modeling of spatial-angular clues,” in

European Conference on Computer Vision (ECCV) , 2018, pp.137–152.[46] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2017, pp. 4700–4708.[47] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-basedview synthesis for light ﬁeld cameras,”

ACM Transactions on Graphics ,vol. 35, no. 6, pp. 193:1–193:10, 2016.[48] V. Vaish and A. Adams, “The (new) stanford light ﬁeld archive,” http://lightﬁeld.stanford.edu/lfs.html, [Online].[49] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “Adataset and evaluation methodology for depth estimation on 4d lightﬁelds,” in

Asian Conference on Computer Vision (ACCV) , 2016, pp.19–34.[50] J. Shi, X. Jiang, and C. Guillemot, “A framework for learning depthfrom a ﬂexible subset of dense and sparse light ﬁeld views,”

IEEETransactions on Image Processing , pp. 1–15, 2019.[51] S. Wanner, S. Meister, and B. Goldluecke, “Datasets and benchmarks fordensely sampled 4d light ﬁelds,” in

Vision, Modelling and Visualization(VMV) , vol. 13, 2013, pp. 225–226.[52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[53] K. Y¨ucer, A. Sorkine-Hornung, O. Wang, and O. Sorkine-Hornung,“Efﬁcient 3d object segmentation from densely sampled light ﬁeldswith applications to 3d reconstruction,”

ACM Transactions on Graphics ,vol. 35, no. 3, pp. 1–15, 2016.[54] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ra-mamoorthi, R. Ng, and A. Kar, “Local light ﬁeld fusion: Practical viewsynthesis with prescriptive sampling guidelines,”