[PDF] Detail-preserving and Content-aware Variational Multi-view Stereo Reconstruction

Abstract

Accurate recovery of 3D geometrical surfaces from calibrated 2D multi-view images is a fundamental yet active research area in computer vision. Despite the steady progress in multi-view stereo reconstruction, most existing methods are still limited in recovering fine-scale details and sharp features while suppressing noises, and may fail in reconstructing regions with few textures. To address these limitations, this paper presents a Detail-preserving and Content-aware Variational (DCV) multi-view stereo method, which reconstructs the 3D surface by alternating between reprojection error minimization and mesh denoising. In reprojection error minimization, we propose a novel inter-image similarity measure, which is effective to preserve fine-scale details of the reconstructed surface and builds a connection between guided image filtering and image registration. In mesh denoising, we propose a content-aware ℓ p -minimization algorithm by adaptively estimating the p value and regularization parameters based on the current input. It is much more promising in suppressing noise while preserving sharp features than conventional isotropic mesh smoothing. Experimental results on benchmark datasets demonstrate that our DCV method is capable of recovering more surface details, and obtains cleaner and more accurate reconstructions than state-of-the-art methods. In particular, our method achieves the best results among all published methods on the Middlebury dino ring and dino sparse ring datasets in terms of both completeness and accuracy.

Full PDF

11 Detail-preserving and Content-aware VariationalMulti-view Stereo Reconstruction

Zhaoxin Li, Kuanquan Wang, Wangmeng Zuo, Deyu Meng and Lei Zhang

Abstract —Accurate recovery of 3D geometrical surfaces fromcalibrated 2D multi-view images is a fundamental yet activeresearch area in computer vision. Despite the steady progress inmulti-view stereo reconstruction, most existing methods are stilllimited in recovering ﬁne-scale details and sharp features whilesuppressing noises, and may fail in reconstructing regions withfew textures. To address these limitations, this paper presents a D etail-preserving and C ontent-aware V ariational (DCV) multi-view stereo method, which reconstructs the 3D surface byalternating between reprojection error minimization and meshdenoising. In reprojection error minimization, we propose a novelinter-image similarity measure, which is e ﬀ ective to preserveﬁne-scale details of the reconstructed surface and builds aconnection between guided image ﬁltering and image registration.In mesh denoising, we propose a content-aware (cid:96) p -minimizationalgorithm by adaptively estimating the p value and regulariza-tion parameters based on the current input. It is much morepromising in suppressing noise while preserving sharp featuresthan conventional isotropic mesh smoothing. Experimental resultson benchmark datasets demonstrate that our DCV method iscapable of recovering more surface details, and obtains cleanerand more accurate reconstructions than state-of-the-art methods.In particular, our method achieves the best results among allpublished methods on the Middlebury dino ring and dino sparsering datasets in terms of both completeness and accuracy. Index Terms —Multi-view stereo, reprojection error, feature-preserving, (cid:96) p minimization, mesh denoising. I. INTRODUCTION M ULTI-VIEW stereo (MVS), which aims at inferring ascene’s 3D geometric surface from a set of calibrated2D images captured in di ﬀ erent views, is a fundamental prob-lem in computer vision. Due to its capability of high-qualityreconstruction for both indoor and outdoor scenes, MVS hasbeen widely used in science and engineering [10]–[12]. Drivenby the MVS benchmark datasets in [1] and [2], variousMVS algorithms have been proposed to gradually improvethe accuracy and completeness of MVS reconstruction [4],[31], [43], [62], and MVS remains an active research areathat attracts considerable attentions [3]–[5].The performance of existing MVS methods is limited dueto factors such as violation of the Lambertian reﬂectancemodel, inaccurate camera calibration, lack of textures on the Z. Li is with School of Computer Science and Technology, Harbin Instituteof Technology, Harbin, China, and Department of Computing, The Hong KongPolytechnic University, Kowloon, Hong Kong.K. Wang and W. Zuo is with School of Computer Science and Technology,Harbin Institute of Technology, Harbin, China.D. Meng is with Institute for Information and System Sciences, Xi’anJiaotong University, Xi’an, China.L. Zhang is with Department of Computing, The Hong Kong PolytechnicUniversity, Kowloon, Hong Kong. object, and false matches. Therefore, noises are inevitablefor the reconstructed 3D surface, resulting in degraded accu-racy and visually unpleasant artifacts. A number of methods,e.g., weighted minimal surface models [13], [14], have beenproposed to suppress noises. However, this line of methodsusually impose isotropic smoothness prior on 3D models, andtend to over-smooth ﬁne-scale details and sharp features.To overcome these limitations, various methods have beendeveloped to suppress noise while preserving sharp features.Based on the 3D model representation, these methods canbe grouped into three categories, i.e., point cloud-based,volumetric-based, and mesh-based. For point cloud-basedmethods, smooth prior is introduced in [58] to improve theaccuracy of local matches on each stereo pairs. In [43], [59],accurate point clouds on high-textured regions are generatedby deploying reliable features, and then propagated to theneighbouring regions. Besides, several heuristic strategies [60],[61] are suggested to evaluate the reliability of each point andremove noises based on local geometry orientation, photomet-ric and visibility. For these methods [43], [58]–[61], meshingpoint clouds are usually required to generate the ﬁnal 3Dsurface, which may lead to over-smoothing in thinly protrudingstructures. Besides, noises and missing data of point cloudscould be propagated to the meshing step, resulting in artifactsin the ﬁnal reconstruction.In volumetric-based methods, a photoﬂux term of pho-toconsistency [25] is introduced to provide data-driven bal-looning force toward maximal photo-consistent surface. Suchan energy term is helpful in segmenting thin structures, butfail to recover the structures on concave regions. Kolev etal. [4] added a stereo regional term to enforce the backgroundconstraint based on a set of depth-maps. The regional termcan be updated along with iterations to infer the occludedregions, making the method work well in recovering both theprotrusion structure and concave regions. Kostrikov et al. [24]further improved the method in [4] by proposing a robustcamera selection algorithm for labelling voxels as interior orexterior. However, high memory requirements of volumetric-based methods hamper their applications in large-volume andhigh-quality MVS reconstruction.For mesh-based MVS, a number of variational methods[5], [7], [19], [29]–[32] have been proposed to improve thereconstruction quality. They can also be employed as a re-ﬁnement step of other methods for high-quality reconstruction[31]–[33], [43]. However, most existing mesh-based methodsadopt isotropic mesh smoothing, where the photoconsistencyis computed by the zero-mean normalized cross-correlation(ZNCC). This makes them often fail in recovering the ﬁne a r X i v : . [ c s . C V ] M a y details and sharp features of object surface.In addition to the above methods, other cues, e.g., silhou-ettes [19], [21]–[23] and surface orientation [26], can alsobe incorporated to help 3D reconstruction. The silhouettesof an object can be fused to ensure that the reconstructedsurface preserves the protrusions and indentions. They canbe adopted in either volumetric-based [21]–[23] or mesh-based methods [9], [19], [20]. However, the incorporationof silhouettes cannot guarantee the preservation of ﬁne-scalesurface details and sharp features that are not on the contourgenerators of surface. The surface orientation of the observedshape can be employed to design an anisotropic weight surface[26]. However, the computation of surface orientation needsaccurate second-order surface derivative, and the constantalbedo assumption may not hold [27], [28], making surfaceorientation only be applicable to some restricted scenarios.Compared with point cloud-based and volumetric-basedmethods, mesh-based methods are more feasible for recon-structing high-resolution surface with low memory require-ments, but the accuracy of the mesh-based methods is gen-erally limited, partially due to the use of isotropic meshsmoothing and ZNCC-based inter-image similarity measure.To address these issues, in this paper we propose a novel inter-image similarity measure and a content aware mesh denoisingalgorithm, resulting in a detail-preserving and content-awarevariational (DCV) method for MVS. As shown in Fig. 1, thecontribution of this work is two-fold: • An inter-image similarity measure is proposed to pre-serve ﬁne-scale details of the reconstructed surface. Theproposed similarity measure also builds a connectionbetween guided image ﬁltering [34] and image registra-tion, making our measure have promising edge-preservingperformance. • A content-aware (cid:96) p -minimization algorithm is proposedfor mesh denoising. By adaptively estimating a suitable p value and regularization parameters, our algorithm worksvery well in mesh smoothing while preserving sharpfeatures.Extensive experimental results on benchmark datasets validatethe superiority of our DCV method in accurate 3D reconstruc-tion. Moreover, DCV achieves the best results among all pub-lished methods on the Middlebury dino ring and dino sparsering datasets in terms of both completeness and accuracy.The paper is organized as follows. Section II introducesthe related work. Section III brieﬂy introduces the concept ofreprojection error and its minimization. Section IV presentsthe pipeline of our method and its two major components, i.e.,detail-preserving similarity measure and content-aware meshdenoising, respectively. Section V presents the experimentalresults. Finally, the paper is concluded in VI.II. R elated W ork This section gives a brief survey on mesh-based MVSmethods according to the two key elements which decide thequality of MVS reconstruction: data ﬁdelity and regularization.The mesh-based MVS methods can provide high-resolutionreconstruction with low memory requirements, and they are convenient to accelerate by using graphic hardware. Due tothese advantages, mesh-based representation has been widelyadopted in state-of-the-art MVS methods for 3D reconstructionof indoor [5], [19] and outdoor scenarios [31], [32], andsurface reﬁnement [31], [32], [43].

Data ﬁdelity.

Data ﬁdelity is used to measure photometricconsistency between images. In some early works [7], [8],[19], data ﬁdelity is measured by comparing projections ofsurface points (or a planar patch tangent to surface point) withthe corresponding neighbouring images, i.e., photoconsistency.The total consistency of the mesh surface is a summation ofphotoconsistency over all the mesh vertices. The main draw-back of this measure is the projective distortion occurred inthe high curvature regions of objects. Another line of methodscompare the image pixels with rendered surface textures [9],[28] by implicitly assuming controlled lighting environments.The reprojection error minimization framework, which is alsoknown as reprojection error functional [5], [20], [35], [36],attempts to solve the MVS problem by comparing the observedand predicted values of pixels generated from the reconstructedsurface. The total consistency of the mesh surface is measuredon image space instead of 3D surface space to alleviateprojective distortion. It can also be considered as an imageregistration problem [37], i.e., registering the input images andtheir predicted images.A similarity measure is needed to measure the reprojec-tion error. In previous works, di ﬀ erentiable and isotropicsimilarity measures have been widely used, such as Zero-mean Normalized Cross Correlation (ZNCC) [29], [31], [32],[37] and Sum of Square Di ﬀ erence (SSD) [30], [35], [36].However, these isotropic similarity measures tend to ﬂattenor smoothen the sharp features of surface, and are limited inrecovering ﬁne-scale details. Edge-aware anisotropic methodscan be used to replace the isotropic ones. Actually, anisotropicmethods have been independently proposed in binocular stereovision [15]–[18]. Among them, guided image ﬁltering hasbeen used [16], [17] due to its e ﬀ ectiveness and e ﬃ ciency.However, the guided image ﬁlter in stereo vision is employedto ﬁlter discrete disparity space images (DSI), and cannot bedirectly adopted in the reprojection error minimization frame-work, where a variational measure is necessary. The proposedmethod ﬁlls the gap between guided image ﬁltering-basedanisotropic measure and variational-based image registration,and it is e ﬀ ective in reconstructing ﬁne-scale details. Surface regularization:

Mesh regularization methods areintroduced to improve the smoothness while preserving detailsof 3D surface, which can be divided into two categories,i.e., surface smoothing and denoising. For surface smoothing,several mesh smoothing operators, e.g., discrete Laplace-Beltrami operator [38], have been adopted as band-pass ﬁltersin MVS. Other smoothing methods, such as mean curvaturemotion [35], [36] and gradient ﬂow [6], have been studiedand applied to mesh-based MVS [5], [30]. To improve thecomputational e ﬃ ciency, di ﬀ erent approximations, such asLaplacian approximation [7], [20], [29], umbrella operator,[19] and paraboloid approximation [28], have been proposed.Higher order derivatives, e.g., combination of the ﬁrst- andsecond-order Laplace [43], thin-plate energy [31], [32], have Fig. 1: Overview of the proposed DCV method. been suggested to handle artiﬁcial shrinkage of small compo-nents and to penalize strong bending. However, higher orderderivatives are not well deﬁned at regions with sharp featuresand their computation is sensitive to noise.Mesh denoising aims to remove the noises or spuriousdetails while preserving sharp edge and corner features, whichcan be further classiﬁed into three sub-categories. The ﬁrst oneis based on bilateral ﬁltering on vertexes [39]; the second onecombines normal ﬁltering and vertex position updating [40],[41]; and the third one is based on optimizing an (cid:96) norm basednon-convex energy function [42]. In this work, we proposea novel mesh-denoising method by considering the gradientdistribution of surface meshes, which is formulated as an (cid:96) p -minimization problem in the maximum a posteriori (MAP)framework. Moreover, we adaptively select the p value andregularization parameters, making our method content-awareto preserve sharp features.III. P rerequisites : R eprojection E rror and I ts M inimization Let S ⊂ R denote a reconstructed surface of object, B ⊂ R stand for its background, and I i : Ω i ⊂ R → R d denote the observed (input) image in camera i ( d = d = S i is the visible part of surface forcamera i . We deﬁne S i , j as the shared visible surface ofcamera i and camera j . Let π i : R → Ω i be the perspectiveprojection which projects 3D point x to 2D pixel p . Let ˆ I i , S , B be the predicted image of I i via surface and background.ˆ I i , S is the predicted image for object part and ˆ I i , B is thepredicted image for background part. With the desired 3Dreconstruction of object S and background B , it is natural toassume that the image predicted by 3D object and backgroundmodels should be similar to the observed image. Therefore,the minimization framework of reprojection errors adopts thefollowing functional [5], [30], [35], [36]: E im ( S ) = (cid:88) i (cid:90) Ω i g ( I i ( p ) , ˆ I i , S , B ( p )) d p = (cid:88) i (cid:104) (cid:90) π i ◦ S i g iF ( I i ( p ) , ˆ I i , S ( p )) d p + (cid:90) Ω i − π i ◦ S i g iB ( I i ( p ) , ˆ I i , B ( p )) d p (cid:105) , (1)where π i ◦ S i denotes the projection of surface S i onto I i ,the reprojection error g ( I , J )( p ) denotes the similarity measure (a) (b) Fig. 2: Illustration of visibility of the surface. (a) Visible parts ofsurface for each camera: S i for camera i (center C i ) and S j for camera j (center C j ). S i , j is the shared visible part for cameras i and j . (b)The interior points are these surface points which are visible fromboth stereo pair and not in the contour generators of the surface.The horizons are the points in the contour generators for a speciﬁccamera. The terminators are occluded by horizons and are behind thehorizons along with the camera ray tracing. between images I and J at pixel p . The reprojection error g iF ( I i ( p ) , ˆ I i , S ( p )) measures the similarity between image I i and its predicted image via surface of the object, and thereprojection error g iB ( I i ( p ) , ˆ I i , B ( p )) measures similarity betweenimage I i and predicted background image.The predicted image can be generated via rendering surfaceand background. In particular, ˆ I i , S is deﬁned based on stereopairs and is usually not a single image. One of the predictedimages of I i can be computed by ﬁrst projecting its neigh-bouring image I j onto the reconstructed surface S and thenprojecting to image space of camera i , which actually deﬁnesa predicted image ˆ I i , j , S . The valid deﬁnition domain for ˆ I i , j , S is the projection of shared visible surface S i , j , i.e., π i ◦ S i , j .By counting all the neighbouring images of I i , g iF is deﬁnedas: g iF ( p ) = (cid:88) j m ( I i , ˆ I i , j , S )( p ) = (cid:88) j m i , j ( p ) , (2)where m is a similarity measure of two pixels in a smallsquared window centered on p . The deﬁnition domain of pre-dicted image ˆ I i , B of I i via background is deﬁned by Ω i − π i ◦ S i ,i.e., the supplementary set of π i ◦ S i . To simplify the com-putation, we assume that the background is uniformly black,and this can be implemented by segmenting silhouettes fromobserved images. Based on the fact that d u = − x · n ( x ) / x z d s ,with simple algebra, Eq. (1) can be rewritten as an integralover the surface by counting only the visible points (see [36] for details): E im ( S ) = (cid:88) i (cid:90) S − x · n ( x ) x z (cid:104) g iF ( x , n ( x )) − g iB ( x , n ( x )) (cid:105) Λ i , S ( x ) d s , (3)where n ( x ) is the outward normal of surface S on point x , and Λ i , S : R → [0 ,

1] is the visibility function which equals to 1if x is visible from camera i and 0 otherwise.The functional of reprojection errors in Eq. (3) can bereformulated on mesh-based discrete representation. Let’sparametrize surface S to a triangle mesh M with a set ofindexes V = { v , v , . . . v n } and a set of triangular faces F = { f , f , . . . , f n } , f i ∈ V×V×V . The geometric embeddingof a triangle mesh into R is speciﬁed by associating eachvertex to a 3D position. Let x i denote the position of a vertex v i . Over each triangular face, points are parametrized usingbarycentric coordinates x ( u ) : u = ( u , v ) ∈ T = { ( u , v ) | u ∈ [ , ] , v ∈ [ , − u ] } . The energy functional on the trianglemesh is formulated as follows: E im ( M ) = (cid:88) i (cid:88) k A k (cid:90) T G ( x ) N k Λ i , S ( x ) d u , (4)where G ( x ) = − ( x / x z ) · (cid:104) g iF ( x , n ( x )) − g iB ( x , n ( x )) (cid:105) , N k and A k are the normal and area of the triangle f k , respectively, and theterm d u = A k ds corresponds to the unit surface area elementin the triangle mesh.The energy functional in Eq. (4) can be optimized by usinggradient decent over all the vertices of the mesh. According to[5], [30], [35], [36], the evolution equation for gradient decentﬂow is:  x k (0) = x d x k / dt = − (1 / A k ) (cid:104) M int k + M horiz k (cid:105) , (5)where M int k is deﬁned as: V k (cid:88) k A k N k (cid:90) T (cid:53) G ( x )(1 − u − v ) dudv , (6)and M horiz k is deﬁned as: (cid:88) horizon edges H k , j (cid:90) u ∈ [0 ,

12 [ G ( T ( y )) − G ( y )] y ∧ H k , j | y | [ y ] z (1 − u ) du , (7)where V k is the velocity vector, H k , j is the vector such that (cid:104) x k , x k + H k , j (cid:105) is the edge of the triangular face f j generatingthe horizon, y is deﬁned as y = x k + uH k , j , and T ( x ) is theterminator of x . The deﬁnitions of Horizon and Terminatorare illustrated in Fig. 2(b). M int k is the gradient for the vertex(interior point) that does not change its visibility state, and M horiz k is the gradient for the vertex that exihibits strongchanges in visibility during the evolution.The term M horiz k is used to conﬁne the horizons of thesurface in di ﬀ erent cameras. Although its inﬂuence will beconsiderably decreased by the introduction of surface regular-ization, this term is very useful for persevering thin protrudingstructures on the border between object and background. Thisnaturally corresponds to a silhouette constriant [19]–[23]. Theform of g iB decides the consistency of reconstructed modelwith silhouettes, and SSD can be used to measure this error.The term M int k is crucial to the reconstruction quality. To evolve the current surface S , we should estimate the derivativeof g iF ( x ). As shown in [37], the gradient of the similarity mea-sure m i , j with respect to an inﬁnitesimal vector displacement δ S of 3D surface point x can be computed using the chainrule:lim (cid:15) → ∂ m i , j ( S + (cid:15)δ S ) ∂(cid:15) = lim (cid:15) → (cid:90) π i ◦ S i , j ∂ m ( I i , ˆ I i , j , S )ˆ I i , j , S ( p j ) × ∂ ˆ I i , j , S ∂ p j × ∂ p j ∂ x × ∂π − i , S ∂(cid:15) d p i , (8)and we have (cid:53) g iF ( S )( x ) = − (cid:88) j , i (cid:44) j η π i ◦ S i , j ∂ m ( I i , ˆ I i , j , S )ˆ I i , j , S ( p j ) × ∂ ˆ I i , j , S ∂ p j × ∂ p j ∂ x × d i z i n , (9)where p i and p j are the pixel positions in images I i andˆ I i , j , S , respectively, π − i , S : R → Ω i is the inverse projectionwhich projects pixels from camera i onto the surface, and d i is the vector joining the center of camera i and x , η is theKronecker symbol which cancels the gradient computation inthe region outside the shared visible surface of both cameras.When the surface moves, the predicted image tends to bechanged. Hence, the variation of reprojection errors involvesthe derivative of the similarity measure with respect to itssecond argument ˆ I i , j , S , i.e., ∂ m i , j , as shown in the ﬁrstderivative term of the right part of Eq. (9). Therefore, thevariation of predicted images will a ﬀ ect much the 3D shapeof surface. IV. P roposed M ethod Following the mesh-based MVS framework, our DCVmodel consists of two terms, i.e., data ﬁdelity E im and surfaceregularization E reg . The energy functional of our model canbe formulated as: E ( S ) = E im ( S ) + λ E reg ( S ) (10)where S denotes the reconstructed surface of the object,and λ is the trade-o ﬀ parameter. Note that E im usually isdi ﬀ erentiable while E reg is non-smooth. The model can besolved by extending the proximal gradient algorithm [63],which iteratively performs the following two steps. Step 1. Gradient Descent.

Given the current estimate S k ,the gradient descent algorithm is adopted to minimize the dataﬁdelity term E im : S k + . = S k + . − η∂ E im ( S ) /∂ S , (11)where η is the stepsize. Step 2. Surface Denoising.

Given S k + . , the reconstructedsurface S is further reﬁned by solving the following meshdenoising problem: S k + = arg min S (cid:107) S − S k + . (cid:107) + λη E reg ( S ) . (12)Given the nonsmooth convex function E reg and the smoothconvex function E im with Lipschitz constant L , when thestepsize η ≤ / L and the surface denoising problem hasthe global solution, the algorithm can converge to the globaloptimum [63]. For our case, even E reg is nonconvex, ouralgorithm empirically converges to a satisfactory solution. Inthis work, we propose a detail-preserving similarity measure for Step 1 and propose a content-aware mesh denoisingalgorithm for

Step 2 , which will be described in detail inthe following two sub-sections, respectively.

A. Detail-preserving Inter-image Similarity Measure

The similarity measure m ( I i , ˆ I i , j , S ) is critical for the min-imization of reprojection error between I i and its predictedimage ˆ I i , j , S in π i ◦ S i , j . In the variational framework, it is de-sirable that the similarity measure m ( I i , ˆ I i , j , S ) is di ﬀ erentiable.Among the existing similarity measures [45]–[47], zero-meannormalized cross correlation (ZNCC) is the most commonlyused one due to the following advantages: (1) it is robust tointer-image a ﬃ ne illumination variation; and (2) its derivativecan be e ﬃ ciently computed. However, the isotropic propertyof ZNCC treats all pixels equally and prefers to ﬂatten thedetails of surface. In this section, we ﬁrst review the derivativeof ZNCC-based similarity measure and then propose a detail-preserving similarity measure based on the principle of guidedimage ﬁltering.

1) Derivative of ZNCC-based Similarity Measure:

TheZNCC measure is deﬁned as follows: m ( I , I )( p ) = v , ( p ) / (cid:112) v (( p )) v ( p ) , (13)where v , v and v , are given by v i ( p ) = G σ (cid:63) I i ( p ) /ω ( p ) − µ i ( p ) + (cid:15), (14) v , ( p ) = G σ (cid:63) I I ( p ) /ω (( p )) − µ µ ( p ) , (15) µ i ( p ) = G σ (cid:63) I i ( p ) /ω ( p ) . (16)where G σ is a Gaussian kernel with standard deviation σ , ω is a normalization coe ﬃ cient accounting for the shape ofsupport domain: ω = (cid:82) π i ◦ S i , j G σ ( p − q ) dq , and the small positiveconstant (cid:15) is introduced to prevent the denominator from beingzero. The derivative of m ( I , I ) with respect to any entry of I at pixel position p has the following form [37]: ∂ m ( p ) = α I ( p ) + β I ( p ) + γ ( p ) , (17) α ( p ) = G σ (cid:63) − ω √ v v ( p ) , (18) β ( p ) = G σ (cid:63) g ω v ( p ) , (19) γ ( p ) = G σ (cid:63) ( µω √ v v − µ g ω v )( p ) . (20)Note that the variation at p also tends to a ﬀ ect the similaritymeasure of its neighboring position. Actually, if we restrictZNCC in a local square window of size w , the variation ofpixel p will a ﬀ ect its entire neighbouring pixels in the regionof size 2 w × w .

2) Detail-preserving Similarity Measure Based on GuidedImage Filtering:

Let I be the ﬁltering input, and I be theguidance image. The principle of guided image ﬁltering is toassume a local linear transformation between ﬁltering output Q and a guidance image I for any pixel p belonging to alocal window with size of w k ( k is the center of window): Q ( p ) = a ( p ) I ( p ) + b ( p ) . (21) By minimizing the di ﬀ erence between Q and I , we can obtainparameters a ( p ) and b ( p ): a ( p ) = ( v / v )( p ) , (22) b ( p ) = ( µ − a µ )( p ) = ( µ − v µ / v )( p ) . (23)Note that the tolerance (cid:15) in Eq. (14) can also be included topenalize large a ( p ) in (22) and (23). The role of (cid:15) in the guidedﬁlter is similar to the range variance σ r in the bilateral ﬁlter,which determines the edge patch that should be preserved.Finally, the ﬁltering output Q has the following form: Q ( p ) = G σ (cid:63) a ω ( p ) I ( p ) + G σ (cid:63) b ω ( p ) . (24)We can then have an interesting connection between thederivatives of reprojection errors and guided image ﬁltering.Based on (13), (18)-(20), Eq. (17) can be reformulated as: ∂ m ( p ) = ( G σ (cid:63) − ω √ v v ( p )) I ( p ) + ( G σ (cid:63) v , ω v √ v v ( p )) I ( p ) + G σ (cid:63) ( µ ω √ v v − µ v , ω v √ v v )( p ) . (25)Based on (22)-(23), Eq. (24) can be rewritten as Q ( p ) = ( G σ (cid:63) v ω v ( p )) I ( p ) + G σ (cid:63) ( v µ − v µ ) ω v ( p ) . (26)Let v ( p ) = v ( p ), Eq. (25) becomes: ∂ m ( p ) = ( G σ (cid:63) − ω v ( p )) I ( p ) + ( G σ (cid:63) v , ω v ( p )) I ( p ) + G σ (cid:63) ( µ ω v − µ v , ω v )( p ) . (27)We can then have: ∂ m ( p ) + ( G σ (cid:63) ω v ( p )) I ( p ) = ( G σ (cid:63) v , ω v ( p )) I ( p ) + G σ (cid:63) ( v µ − v µ ) ω v ( p ) . (28)Suppose that v ( p ) varies more slowly than G σ and I ( p ) inspatial domain, and ( G σ (cid:63) ω ( p )) ≈

1. Then, Eq. (28) can beapproximately rewritten as: v ∂ m ( p ) + I ( p ) ≈ ( G σ (cid:63) v , ω v ( p )) I ( p ) + G σ (cid:63) ( v µ − v µ ) ω v ( p ) . (29)Note that the right sides of Eq. (26) and Eq. (28) are thesame, and the minimization of reprojection error is actuallythe maximization of similarity measure. Therefore, we have I ( p ) + v ∂ m ( p ) ≈ Q ( p ) , (30)and guided image ﬁltering can be approximately interpretedas one step of variational image registration of I ( p ) and I ( p )with constraint v ( p ) = v ( p ) and stepsize v .Motivated by the connection between guided image ﬁlteringand image registration, and to enhance the edge preservationof ZNCC-based similarity measure, we modify the derivative ∂ m in Eq. (17) by adding a term to enforce the constraint (a) (b) (c)(d) (e) (f) Fig. 3: Illustration of the proposed similarity measure. (a) One sampleimage from the temple sparse ring dataset. (b) The reconstructionresult obtained by isotropic ZNCC similarity measure. (c) Theresult obtained by the proposed detail-preserving similarity measure.Comparing (c) with (b), one can see that the proposed method canbetter recovers the ﬁne-scale details. (d)-(f) show another examplefrom the Buddha dataset. v ( p ) = v ( p ): ∂ ˜ m ( p ) = α I ( p ) + β I ( p ) + γ ( p ) + κ G σ (cid:63) ( v − v ) v ( p ) , (31)where κ is a tradeo ﬀ parameter to adjust the inﬂuence of thevariance constraint. In practice, we initialize κ with a smallvalue in the beginning of surface evolution, and graduallyincrease it until convergence. By Eq. (31), the predicted imageˆ I i , j , S is implicitly set as the guidance image. As shown in Fig.3, the proposed similarity measure can recover the ﬁne-scaledetails and largely extend the edge preservation capability ofthe original isotropic measure. B. Content-aware Mesh Denoising via L p -norm Minimization Let matrix X = ( x , i ) ni = ∈ R × n store the positions of the n vertices of the reconstructed noisy surface mesh S , and matrix X = ( x i ) ni = ∈ R × n store the positions of the n vertices of thenoisy-free surface mesh S k + . . The mesh denoising problemin Eq. (12) can be reformulated as:min X (cid:107) X − X (cid:107) + λ R ( X ) . (32)In this section, we ﬁrst present the proposed content-awaremodel based on the MAP framework, and then propose analternating minimization algorithm for content-aware meshdenoising.

1) MAP-based Mesh Denoising with Hyper-LaplacianPrior:

Denote by q ( X ) the prior on the sharpness of noise-freemesh, and by q ( X | X ) the likelihood of noisy mesh. The MAPframework estimates X by maximizing a posterior probability q ( X | X ) ∝ q ( X | X ) q ( X ). By assuming that the noise is additivewhite Gaussian noise with standard deviation σ , the likelihoodof noisy mesh can be modeled as: q ( X | X , σ ) = (cid:89) i √ πσ exp  − ( x i − x , i ) i σ  . (33) Fig. 5: Shape parameters of di ﬀ erent 3D models. From left to rightand from top to down, Armadillo ( p = . θ = . p = . θ = . p = . θ = . p = . θ = . p = . θ = . p = . θ = . p = . θ = . p = . θ = . p = . θ = . p = . θ = . For surface mesh, the edge-based discrete Laplacian oper-ator D ∈ R m × n proposed in He et al. [42] can be adopted forcomputing surface gradients (where m is the number of edgesin the mesh). In image restoration, it has been empiricallyveriﬁed that the natural image gradients generally follow aheavy-tailed distribution and can be well described by hyper-Laplacian [48]. Therefore, we suggest using hyper-Laplacianto model surface gradients: q ( X | θ, p ) = (cid:89) i p θ p Γ ( p ) exp ( − θ | ( DX ) i | pp ) , (34)where Γ is the Gamma function, and p and θ are the shape pa-rameters. p ∈ [0 ,

1] determines the peakiness and θ determinesthe width of a hyper-Laplacian distribution.One concern is that whether surface gradients of real3D models follow the hyper-Laplacian distribution. Fig. 4shows the empirical distributions and the corresponding hyper-Laplacian ﬁts of the surface gradients of three real models.One can see that hyper-Laplacian ﬁts the empirical distributionvery well, which validates that the empirical distribution canbe well modeled by hyper-Laplacian.It should be noted that, for di ﬀ erent 3D models the shapeparameters p and θ will vary. To illustrate this, we computethe shape parameters on more than twenty public models,including Armadillo, Bunny in the Stanford repository [55]and models in the AIM@SHAPE repository [56]. We providethe shape parameters of ten models in Fig. 5. One can see thatthe p values vary from 0 . .

75 and the θ values vary from5 .

506 to 483 .

9. Therefore, instead of ﬁxing shape parameters, p and θ values should be adaptively estimated for di ﬀ erent3D models. We propose the following content-aware meshdenoising model that jointly estimates the mesh X , noise level σ , and and the shape parameters θ and p from the observation X : ( ¯ X , ¯ σ , ¯ θ, ¯ p ) = arg min X ,σ ,θ, p (cid:110) − log( q ( X , σ , θ, p | X )) (cid:111) = arg min X ,σ ,θ, p (cid:40) n log(2 πσ ) + (cid:107) X − X (cid:107) σ + θ | ( DX ) i | pp + m (cid:32) log( Γ ( 1 p )) − log( p − p log( θ (cid:33) (cid:41) . (35)

2) Alternating Minimization:

We propose an alternatingminimization algorithm to minimize Eq. (35) for the jointestimation of noisy-free (denoised) mesh X , noise standarddeviation σ , and hyper-Laplacian parameters θ and p by (a) (b) (c) (d) (e) (f) Fig. 4: Surface gradient distributions of three real 3D models: (a) the Red circular box model, (b) the Hand Olivier model, and (c) the Gargomodel. (d)-(f) are their empirical distributions of sharpness (red) and the corresponding ﬁtted hyper-Laplacian proﬁles (blue). iteratively solving the following two subproblems.(1) Given X , the optimization problem w.r.t. σ , θ , and p canbe reformulated as¯ σ = arg min σ (cid:40) n σ ) + (cid:107) X − X (cid:107) σ (cid:41) , (36)¯ θ, ¯ p = arg min θ, p (cid:40) θ | ( DX ) i | pp + m (cid:32) log( Γ ( 1 p )) − log( p − p log( θ (cid:33) (cid:41) . (37)The σ -subproblem has the closed-form solution σ = (cid:107) X − X (cid:107) / n . The problem in Eq. (37) can be solved by: ( a ) ﬁndingthe optimal θ for given p , which results in a closed-form solu-tion θ = m / p | ( DX ) i | pp , and ( b ) using a simple 1D exhaustivesearching strategy to obtain the estimation of p for given θ .(2) Given σ , θ and p , we deﬁne λ = θσ /

2, and then wehave ¯ X = arg min X (cid:107) X − X (cid:107) + λ (cid:107) DX (cid:107) pp (38)where λ is the regularziation parameter. Using the variablesplitting approach, Eq. (38) can be reformulated as:¯ X = arg min X (cid:107) X − X (cid:107) + λ (cid:107) ψ (cid:107) pp + β | DX − ψ | , (39)which can also be optimized with an alternating optimizationmethod. Fix ψ , X can be optimized by solving the followingquadratic problem:¯ X = arg min X (cid:107) X − X (cid:107) + β | DX − ψ | . (40)Fix X , ψ can be optimized by solving the following subprob-lem: ¯ ψ = arg min ψ λ (cid:107) ψ (cid:107) pp + β | DX − ψ | . (41)The above subproblem can be e ﬃ ciently solved by usingthe generalized shrinkage / thresholding (GST) method [49].Solution to each ( ψ ) i can be written as:( ψ ) i = T GS Tp (( DX ) i ; λ β ) , (42)where T GS Tp is the generalized shrinkage / thresholding operator[49]. When the penalty factor β → ∞ , the solution to Eq.(39) converges to that of Eq. (38). In practice, we adopt thecontinuation technique by initializing β with a small value andgradually increasing it until convergence. V. E xperiments We implemented the proposed DCV method using C ++ with OpenGL, CGAL and TAUCS library. The predictedimages are estimated using projective texture mapping, andOpenGL is adopted to generate the horizon and terminator oftriangular surface. CGAL is used to manipulate the triangularmesh and TAUCS is used to manipulate the sparse matrix.We quantitatively and qualitatively evaluate the performanceof DCV on multiple datasets, including the Middlebury bench-mark and several public datasets with indoor and outdoorscenes. These public datasets have camera calibration pa-rameters available. We also provide three real datasets, i.e., Buddha , Totoro and bell , taken from mobile phone or digitalcamera. For these three datasets, cameras are calibrated usingthe Bundler software [52]. In all experiments, the window sizefor calculating similarity measure is set as 7 × A. Initialization and Implementation Details

In our experiments, we consider two initialization methods:visual hull and PMVS + PSR (Possion Surface Reconstruction).The visual hull is the intersection of the visual cones associatedwith all image silhouettes, and can provide a good initializationfor most indoor scenes where the interested object is easyto be segmented from background. The PMVS is an opensource software designed by Furukawa and Ponce [43]. A setof dense patches are generated from PMVS with its defaultparameters and then a triangular surface mesh is estimated byusing PSR [51] with octree depth ﬁxed to 8. PMVS + PSR ismainly used to initialize the scene where the background isnot easy to be segmented from foreground, including someoutdoor and indoor scenes. For the temple dataset, becausesome small protruding structures tend to be over-smoothedwhen large concave region is recovered in the back of temple,we also use the PMVS + PSR to generate an initial mesh. Thestatistics of all the datasets used in our experiments are listedin Table I, including the number of images, image resolution,initial points and running times (CPU i7, 2.4Ghz).Two issues, non-convexity and topology adaptivity, areconsidered in our implementation. Since the L p -sparsity (0 ≤ p ≤

1) is used, the objective functional of DCV is non-convex, making the algorithm sensitive to local minimum. Toalleviate this, we adopt a multi-resolution scheme. We ﬁrstminimize the energy on low resolution mesh and downsampleimages accordingly, and then optimize it on high-resolutionmesh and full-size images. The Gaussian pyramid is used todownsample the image. The Qslim algorithm [53] is used tosimplify the mesh, and √ subdivide the mesh to a higher resolution. Another issue is thetopology adaptivity of mesh-based methods, which needs aninitial surface of the model with an approximately consistenttopology. We use two initialization methods, visual hull andPMVS + PSR, in the experiments. Other initialization methodscan also be deployed, e.g., those methods based on features,fusion of depth maps, and volumetric optimization.

TABLE I: Datasets used in our experiments

Dataset Number ofimages Resolution Initialization Time(min)dino sparse 16 640 ×

480 visual hull 90dino ring 48 640 ×

480 visual hull 150temple sparse 16 640 ×

480 pmvs + psr 105temple ring 47 640 ×

480 pmvs + psr 170Beethoven 33 1024 ×

768 visual hull 180bird 21 1024 ×

768 visual hull 160fountain-P11 11 3072 × + psr 210Herzjesu-P8 8 3072 × + psr 150Totoro 8 1504 × + psr 45Buddha 5 2400 × + psr 30bell 3 1504 × + psr 20statuegirl 50 2592 × + psr 750 B. Middlebury Datasets

We ﬁrst evaluate the e ﬀ ectiveness of DCV on the Mid-dlebury benchmark [1] by using two performance indicators:accuracy and completeness. The accuracy is measured bythe distance d such that the distance between 90% of thereconstructed surface and the ground truth surface is less than d . Completeness is measured by the percentage f such thatthe distance between percentage f of the ground truth surfaceand the reconstructed surface is less than 1 .

25 mm.We test DCV on the dino ring (48-views) , dino sparse ring(16-views) , temple ring (47-views) and temple sparse ring (16-views) , respectively. The smaller the number of views is, themore di ﬃ cult and challenging the reconstruction will be. Theaccuracy and completeness of DCV on these four datasets areshown in Fig. 6. These results are also publicly available on theMiddlebury evaluation page [50] and can be compared withstate-of-the-arts. It is worthy to mention that (at the time thatthis paper is submitted) our DCV method achieves the bestresult on the dino ring and dino sparse ring in terms of bothmodel completeness and accuracy. Though our results on the temple ring and temple sparse ring datasets are not top ranked,it can be easily seen that the visual quality of our results isbetter than most of the other top ranked methods. In Fig. 7,we compare the reconstruction results by DCV and several topranked methods on the temple ring datasets. The reconstructionresults of dino sparse ring and temple sparse ring by DCV areshown in Fig. 8, including the comparison with their coarseinitializations, which indicates that the proposed method is notsensitive to initialization.The quantitative comparison results between DCV andseveral state-of-the-art methods [4], [25], [26], [29], [29], [35],[36], [43] are listed in Table II. Since some reconstructionresults were not reported by authors, we labeled them as ’ − ’.We also conduct a visual comparison with four representativemethods [26], [29], [36], [43] in Fig. 9 on the dino sparse ring dataset. The methods proposed in [29], [36] use the isotropicsimilarity measure for reprojection error minimization and use Fig. 6: Evaluation results of DCV on the Middlebury Benchmark. isotopic mesh smoothing. The method proposed in [26] use ananisotropic weighted minimal surface functional. The method[43] is a combination of patch-based method and isotropicsurface reﬁnement.

C. Results on the Other Datasets

We further apply DCV to several other public datasets: the

Beethoven dataset [25], the bird dataset [26], the fountain-P11 dataset [2], and the statuegirl dataset [64]. We also conductexperiments on two real datasets collected by us:

Buddha and bell . Since groundtruths of these datasets are not available ,qualitative evaluation of the reconstruction results is adoptedin the experiment.The Beethoven dataset and the bird dataset contain thirtythree 1024 ×

768 images and twenty one 1024 ×

768 images,respectively. They were captured by a set of synchronizedcameras. The

Beethoven dataset presents textureless / smoothsurface while the bird dataset presents highly textured surface.The reconstruction results of Beethoven by several state-of-the-art methods [29], [31], [36] and the proposed DCV are shownin Fig. 10. Thanks to the content-aware L p -minimization baseddenoising algorithm, DCV is able to e ﬀ ectively suppress noiseand outliers while keeping the sharp features of surface. Theresults on the bird dataset by these methods are shown inFig. 11. If the isotropic similarity measure is used, smallprotrusions such as the wings, details of feathers, claws andhead of bird are considered as high-frequency noise andare thus over-smoothed. With the proposed detail-preservingsimilarity measure, the ﬁne-scale details are well preserved.The Buddha dataset consists of ﬁve 2400 × bell dataset are shown in Fig. 12. The bell dataset consists of only three 1504 × Although there was an evaluation system for quantitatively evaluating the fountain-P11 dataset, the evaluation service is currently unavailable

TABLE II: Quantitative comparison between DCV and several state-of-the-art methods on the Middlebury data sets in terms ofaccuracy / completeness Method dino sparse dino ring temple sparse temple ringDCV / / / / − / − / Kostrikov [4] 0.37mm / / / / / / / / / / / / / / / / / / / / / − / − Furukawa3 [43] 0.37mm / / / / Fig. 7: Comparison of reconstruction results of DCV on the Middlebury temple ring dataset. The names of the comparison methods followthe entries in the Middlebury evaluation website. (a) Vu [31], Acc. 0.45, Comp. 99.8%. (b) Campbell [62], Acc. 0.48, Comp. 99.4%. (c)Furukawa3 [43], Acc. 0.47, Comp. 99.6%. (d) Hernandez [19], Acc. 0.52, Comp. 99.5%. (e) th proposed DCV method, Acc. 0.73, Comp.98.2%. (f) groundtruth. (a) (b) (c) (d) (e)(f) (g) (h) (i) (j)

Fig. 8: Reconstruction results of DCV on dino sparse ring (ﬁrst row) and temple sparse ring (second row). (a) Some samples of dino sparsering ; (b) and (d) are the visual hulls; (c) and (e) are the reconstruction results of DCV. (f) Some samples of temple sparse ring ; (g) and (i)are the reconstruction results of PMVS + PSR, which are adopted as the initial surfaces to DCV; (h) and (j) are the reconstruction results ofDCV.Fig. 9: Comparison of reconstruction results on the dino sparse dataset. From left to right: results by [36], [26], [29], [43], the proposedDCV, and groundtruths, respectively. It is obvious that DCV has better performance in preserving the details and sharp features while ﬁlteringthe noises. Fig. 10: Reconstruction results by several state-of-the-art methods and the proposed DCV on the

Beethoven datasets. From left column toright column: input images, results by [29], [36], [31], and DCV in two views, respectively. in a museum, and thus only a partial surface of bell isreconstructed. The less number of observed images makesthe regularization scheme more important. One of observedimages is shown in Fig. 12(a). The initial surface of the bellis estimated by PMVS + PSR. The point clouds generated byPMVS and watertight surface generated by PSR are shown inFig. 12(f) and Fig. 12(b), respectively. Fig. 12(c), Fig. 12(d)and Fig. 12(e) show the reconstruction results of ”isotropicsimilarity + isotropic regularization”, ”detail-preserving simi-larity + isotropic regularization”, and DCV, respectively. Thechoosed isotropic regularization combines the ﬁrst order andsecond order Laplacian [43]. Fig. 12(h) and Fig. 12(j) showthe close-up images of the corresponding results in Fig. 12(b)and Fig. 12(e). It is clear that DCV presents the best results forboth ﬁne details and surface smoothness among all competingmethods.The fountain-P11 is an outdoor dataset including eleven3072 × fountain-P11 dataset areshown in Fig. 13. The input images, initial surface generatedby PMVS + PSR and reconstruction results by DCV are shownin Fig. 13(a). The comparison results of DCV with theisotropic method are shown in Fig. 13(b). It can be easilyseen that DCV performs better than the isotropic method onpreserving the ﬁne-scale details and sharp features.The statuegirl is an outdoor dataset including ﬁfty 2592 × + PSR, and the reconstruction results by DCV andcommercial 3D reconstruction software Smart3Dcapture (freeedition) [64] are shown in the ﬁrst row, respectively. Theclose-ups images in di ﬀ erent surface regions are shown inthe second and the third rows. The images have been down-sampled by half before performing the reconstructions. For theSmart3Dcapture software, a complete and robust reconstruc-tion pipeline has been integrated, including camera calibration,dense reconstruction and visualization. We have used thesoftware’s ultra high precision option to recover more details.For DCV, bundler is used for calibration and PMVS + PSR isused for initialization. The comparison results show that DCVcan generally obtain similar results to Smart3Dcapture, and insome part (e.g., toes) it can recover more ﬁne-scale details. (a)(b)

Fig. 13: (a) Results of DCV on the fountain-P11 dataset. From left toright: several input images, initial surface, results obtained by DCV.(b) The left image shows the result obtained by the method based onisotropic similarity measure and surface regularization, and the rightimage shows the results obtained by DCV. Obviously, DCV performsbetter in preserving the small-scale details and sharp features.

D. Evaluation on Content-aware Mesh Denoising

Our DCV method consists of two components, i.e., detail-preserving similarity measure and content-aware L p meshdenoising. The e ﬀ ectiveness of the former component has beenvalidated in Fig. 3 by comparing with isotropic ZNCC mea-sure. To evaluate the e ﬀ ectiveness of the latter component, weimplement four variants of DCV by substituting the content-aware L p mesh denoising with four competing denoisingmethods, including the isotropic mesh smoothing method (i.e.,the one based on the combination of ﬁrst order and secondorder Laplacian [43]) and the anisotropic mesh denoisingmethods (i.e., two-step normal ﬁltering [40], bilateral normalﬁltering [41], and L mesh denoising [42]). Two datasets, i.e., Herzjesu-P8 and

Totoro , are used for evaluating the meshdenosing methods.The

Herzjesu-P8 dataset contains eight 3072 × Totoro dataset contains eight Fig. 11: Reconstruction results by state-of-the-art methods and the proposed DCV on the bird dataset. From left column to right column:input images, results by [29], [36], [31] and DCV in two views, respectively. (a) (b) (c) (d) (e)(f) (g) (h) (i) (j)

Fig. 12: Reconstruction results on the bell dataset. (a) One of the input images. (b) The initial reconstruction using PMVS + PSR. Thepoint clouds generated by PMVS are shown in (f). (c) Reconstruction result by isotropic similarity measure + isotropic smoothing. (d)Reconstruction result by detail-preserving similarity measure + isotropic smoothing. (e) Reconstruction result by DCV. (g)-(j) are the close-up images corresponding to the red rectangle regions in (b)-(j), respectively. One can see that DCV preserves well the ﬁne details and smoothsurface. × Herzjesu-P8 dataset, the anisotropic methodshave better performance than isotropic methods, and ourcontent-aware L p denoising method achieves similar results toHe et al.’s L denoising method but is visually more pleasant.As shown in Fig. 16, on the Totoro dataset, our content-aware L p denoising method obtains much better results than theother methods. Unlike the competing denoising methods, theproposed L p denoising method is content-aware and is ableto reconstruct the object with ﬂat regions, sharp edges, andﬁne-scale details. VI. C onclusion In this paper, we proposed a detail-preserving and content-aware variational (DCV) method for multi-view stereo (MVS)reconstruction. First, by connecting guided image ﬁltering withimage registration, a novel similarity measure was proposedto preserve the ﬁne-scale details in reconstruction. Second, bythe hyper-Laplacian modelling of surface gradients, a content-aware mesh denoising method based on L p minimization waspresented to suppress the noises and outliers while preserving sharp features. Compared with state-of-the-art MVS methods,the proposed DCV method is capable of reconstructing asmooth and clean surface with ﬁnely preserved details andsharp features. The running time of our single-thread CPUimplementation of DCV method on the datasets used in thepaper is from twenty minutes to several hours. In the future,GPU-based parallel implementation on the main parts ofgradient computation will be adopted to improve the speedof DCV by using the Nvidia CUDA framework.A cknowledgement The authors would like to thank Dr. Daniel Scharstein forevaluating our results on the Middlebury datasets, and Dr.Hoang Hiep Vu for providing the statuegirl dataset.R eferences [1] S. Seitz, B. Curless, J. Diebel, D. Scharstein and R. Szeliski, “A com-parison and evaluation of multi-view stereo reconstruction algorithms,”in

Proc. CVPR , pp. 519-526, 2006.[2] C. Strecha, W. von Hansen, L. Van Gool, P. Fua and U. Thoennessen,“On benchmarking camera calibration and multiview stereo for highresolution imagery,” in

Proc. CVPR , pp. 1-8, 2008. Fig. 14: Results on the statuegirl dataset. First row, from left to right: one of input images, initial surface by PMVS + PSR, results by DCVand results by Smart3dCapture Free Edition with ultra high precision setting. Second and third row, from left to right: the close-ups ofreconstruction results in the ﬁrst row. (a) (b) (c)(d) (e) (f)

Fig. 15: Results by using di ﬀ erent mesh denoising methods on the Herzjesu-P8 dataset. (a) Input images. (b) Results by the combinationof ﬁrst order and second order Laplacian [29], [43]. (c) Results by Sun et al.’s method [40]. (d) Results by Zhang et al.’s bilateral normalﬁltering [41]. (e) Results by He et al.’s L denoising [42]. (f) Results by our L p denoising method. All the models are ﬂat-shaded to showfaceting e ﬀ ect. [3] P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer and M.Pollefeys, “Live Metric 3D Reconstruction on Mobile Phones,” in Proc.ICCV , pp. 1-8, 2013.[4] I. Kostrikov, E. Horbert and B. Leibe, “Probabilistic Labeling Cost forHigh-Accuracy Multi-View Reconstruction,” in

Proc. CVPR , 2014.[5] A. Delaunoy and M. Pollefeys, “Photometric Bundle Adjustment forDense Multi-View 3D Modeling,” in

Proc. CVPR , 2014.[6] M. Meyer, M. Desbrun, P. Schrder and A. H. Barr, “Discrete di ﬀ erential-geometry operators for triangulated 2-manifolds,” Visualization andMathematics III,Part I , pp 35-57, 2003[7] Y. Duan, L. Yang, H. Qin and D. Samaras, “Shape Reconstruction from3D and 2D Data Using PDE-Based Deformable Surfaces,” in

Proc.ECCV , pp 238-251, 2004[8] G. Vogiatzis, P. Torr, S.M. Seitz and R. Cipolla, “Reconstructing reliefsurfaces,” in

Proc. BMVC , 2004 [9] J. Isidoro and S. Sclaro ﬀ , “Stochastic Reﬁnement of the Visual Hull toSatisfy Photometric and Silhouette Consistency Constraints,” in Proc.ICCV , pp. 1335-1342, 2003[10] D. Bradley, T. Popa, A. She ﬀ er, W. Heidrich and T. Boubekeur,“Markerless garment capture,” ACM Trans. Graphics , vol. 27, no. 3,pp. 538-551, 2008.[11] P. Yan, S.M. Khan and M. Shah, “3d model based object class detectionin an arbitrary view,” in

Proc. ICCV , pp. 1-6, 2006.[12] A. Kushal and J. Ponce, “Modeling 3d objects from stereo views andrecognizing them in photographs,” in

Proc. ECCV , pp. 563-574, 2006.[13] O. Faugeras and R. Keriven, “Variational principles, surface evolution,pdes, level set methods, and the stereo problem,”

IEEE Trans. ImageProcess. , vol. 7, no. 3, pp. 336-344, 1998.[14] B. Goldlcke, I. Ihrke, C. Linz and M. Magnor, “Weighted minimalhypersurface reconstruction,”

IEEE Trans. Pattern Anal. Mach. Intell. , (a) (b) (c)(d) (e) (f) Fig. 16: Results by using di ﬀ erent mesh denoising methods on the Totoro dataset. (a) Input images. (b) Results by the combination of ﬁrstorder and second order Laplacian [29], [43]. (c) Results by Sun et al.’s method [40]. (d) Results by Zhang et al.’s bilateral normal ﬁltering[41]. (e) Results by He et al.’s L denoising [42]. (f) Results by our content-aware L p denoising method. All the models are ﬂat-shaded toshow faceting e ﬀ ect. vol. 29, no. 7, pp. 1194-1208, 2007.[15] K. Yoon and I. Kweon, “Adaptive support-weight approach for corre-spondence search,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 28, no.4, pp. 650-656, 2006.[16] C. Rhemann, A. Hosni, M. Bleyer, C. Rother and M. Gelautz, “Fastcost-volume ﬁltering for visual correspondence and beyond.,” in

Proc.CVPR , pp. 1-8, 2011.[17] Z. Ma, K. He, Y. Wei, J. Sun and E. Wu, ‘Constant time weightedmedian ﬁltering for stereo matching and beyond, ” in

Proc. ICCV , pp.1-8, 2013.[18] Q. Yang, “A non-local cost aggregation method for stereo matching,” in

Proc. CVPR , pp. 1402-1409, 2012.[19] C. Hernandez and F. Schmitt, “Silhouette and stereo fusion for 3d objectmodeling,”

Comput. Vis. Image Und. , vol. 96, no. 3, pp. 367-392, 2004.[20] Z. Li, K. Wang, W. Jia, H. C. Chen, W. Zuo, D. Meng and M. Sun,“Multiview stereo and silhouette fusion via minimizing generalizedreprojection error,”

Image Vision Comput. , vol. 33, no. 1, pp. 1-14, 2015.[21] P. Song, X. Wu and M. Wang, “Volumetric stereo and silhouette fusionfor image-based modeling,”

Visual Comput. , vol. 26, no. 12, pp. 1435-1450, 2010.[22] S. Sinha, P. Mordohai and M. Pollefeys, “Multiview stereo via graphcuts on the dual of an adaptive tetrahedral mesh,” in

Proc. ICCV , pp.1-8, 2007.[23] D. Cremers and K. Kolev, “Multiview stereo and silhouette consistencyvia convex functionals over convex domains,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 33, no. 6, pp. 1161-1174, 2011.[24] Y. Boykov and V. Lempitsky, “From Photohulls to Photoﬂux Optimiza-tion,” in

Proc. BMVC , vol. 3, pp. 1149-1158, 2006.[25] K. Kolev, M. Klodt, T. Brox and D. Cremers, “Continuous globaloptimization in multiview 3D reconstruction,”

Int. J. Comput. Vis. , pp.80-96 , 2009.[26] K. Kolev, T. Pock and D. Cremers, “Anisotropic minimal surfaces inte-grating photoconsistency and normal information for multiview stereo,”in

Proc. ECCV , pp. 538-551, 2010.[27] C. Wu, B. Wilburn, Y. Matsushita and C. Theobalt, “High-quality shapefrom multi-view stereo and shading under general illumination,” in

Proc.CVPR , pp. 969-976, 2011.[28] N. Birkbeck, D. Cobzas, P. Sturm and M. Jagersand, “Variational shapeand reﬂectance estimation under changing light and viewpoints,” in

Proc.ECCV , pp. 536-549, 2006.[29] A. Zaharescu, E. Boyer and R. Horaud, “Transformesh: A topology-adaptive mesh-based approach to surface evolution,” in

Proc. ACCV ,166-175, 2007.[30] A. Delaunoy, E. Prados, P. Gargallo, P. J. Philippe and P. Sturm,“Minimizing the multi-view stereo reprojection error for triangularsurface meshes,” in

Proc. BMVC , pp. 1-10, 2008.[31] H. Vu, R. Keriven, P. Labatut and J. P. Pons, “Towards high-resolutionlarge-scale multi-view stereo,” in

Proc. CVPR , pp. 1-10, 2009.[32] F. Lafarge, R. Keriven,M. Bredif and H. Vu, “Hybrid multi-viewreconstruction by jump-di ﬀ usion,” in Proc. CVPR , pp. 350-357, 2010. [33] F. Lafarge, R. Keriven and M. Brdif, “Insertion of 3D-Primitives inMesh-Based Representations: Towards Compact Models Preserving theDetails,”

IEEE Trans. Image Process. , vol. 19, no. 7, pp. 1683-1694,2010.[34] K. He, J. Sun and X. Tang, “Guided image ﬁltering,” in

Proc. ECCV ,pp. 1-14, 2010.[35] P. Gargallo, E. Prados and P. Sturm, “Minimizing the reprojection errorin surface reconstruction from images,” in

Proc. ICCV , pp. 1-8, 2007.[36] A. Delaunoy and E. Prados, “Gradient ﬂows for optimizing triangularmesh-based surfaces: Applications to 3D reconstruction problems deal-ing with visibility,”

Int. J. Comput. Vis. , vol. 95, no. 2, pp. 100-123,2011.[37] J. Pons, R. Keriven and O. Faugeras, “Multi-view stereo reconstructionand scene ﬂow estimation with a global image-based matching score,”

Int. J. Comput. Vis. , vol. 72, no. 2, pp. 179-193, 2007.[38] D. A. Field, “Laplacian smoothing and delaunay triangulations,”

Com-mun. Numer. Meth. En. , vol. 4, no. 6, pp. 709-712, 1988.[39] S. Fleishman, I. Drori and D. Cohen-Or, “Bilateral mesh denoising,”

ACM Trans. Graphics , vol. 22, no. 3, pp. 950–953, 2003.[40] X. Sun, P.L. Rosin, R.R. Martin and F.C. Langbein, “Fast and e ﬀ ectivefeature-preserving mesh denoising,” IEEE Trans. Vis. Comput. Gr. , vol.13, no. 5, pp. 925-938, 2007.[41] Y. Zheng, H. Fu, O. K. Au and C. Tai, “Bilateral Normal Filtering forMesh Denoising,”

IEEE Trans. Vis. Comput. Gr. , vol. 17, no. 10, pp.1521-1530, 2013.[42] L. He and S. Schaefer, “Mesh denoising via l0 minimization,”

ACMTrans. Graphics , vol. 32, no. 4, pp. 1-8, 2013.[43] Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiviewstereopsis,”

IEEE Trans. Pattern Anal. Mach. Intell. , pp. 1362-1376,2008.[44] I. Eckstein, J.P. Pons, Y. Tong, C. J. Kuo and M. Desbrun, “Generalizedsurface ﬂows for mesh processing,” in

Proc. SGP , pp. 1-8, 2007.[45] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of densetwo-frame stereo correspondence algorithms,”

Int. J. Comput. Vis. , vol.47, no. 1, pp. 7-42, 2002.[46] S. Wei and S. Lai, “Fast Template Matching Based on Normalized CrossCorrelation With Adaptive Multilevel Winner Update,”

IEEE Trans.Image Process. , vol. 17, no. 11, pp. 2227-2235, 2008.[47] M. Drulea and S. Nedevschi., “Motion Estimation Using the CorrelationTransform,”

IEEE Trans. Image Process. , vol. 22, no. 8, pp. 3260-3270,2013.[48] D. Krishnan and R. Fergus, “Fast image deconvolution using hyper-laplacian priors,” in

Proc. NIPS , pp. 1-9, 2009.[49] W. Zuo, D. Meng, L. Zhang, X. Feng and D. Zhang, “A generalizediterated shrinkage algorithm for non-convex sparse coding,”in

Proc.ICCV , pp. 1-8, 2013.[50] http: // vision.middlebury.edu / mview / eval / .[51] M. Kazhdan,M. Bolitho and H. Hoppe, “Poisson Surface Reconstruc-tion,” in Proc. SGP , pp. 61-70, 2006. [52] N. Snavely, S. M. Seitz and R. Szeliski, “Photo Tourism: Exploringimage collections in 3D,” ACM Trans. Graphics , pp. 835-846, 2006.[53] M. Garland and P. S. Heckbert, “Surface simpliﬁcation using quadricerror metrics,” in

Proc. ACM SIGGRAPH , pp. 209-216, 1997.[54] L. Kobbelt, “ √ Proc. SIGGRAPH , pp. 103-112, 2000.[55] http: // graphics.stanford.edu / data / / [56] http: // / ontologies / shapes / [57] D. Bradley, T. Boubekeur and W. Heidrich, “Accurate multi-viewreconstruction using robust binocular stereo and surface meshing,” in Proc. CVPR , 2008.[58] Y. Liu, X. Cao, Q. Dai and W. Xu, “Continuous depth estimation formulti-view stereo,” in

Proc. CVPR , 2009.[59] E. Tola, C. Strecha and P. Fua “E ﬃ cient large-scale multi-view stereofor ultra high-resolution image sets,” Mach. Vision Appl. , vol. 23, no. 5,pp. 903-920, 2012.[60] B., Christian, M. Finckh and H. P. Lensch “Scale robust multi viewstereo,” in

Proc. ECCV , 2012.[61] S. Shen “Accurate Multiple View 3D Reconstruction Using Patch-BasedStereo for Large-Scale Scenes,”

IEEE Trans. Image Process. , vol. 22,no. 5, pp. 1901-1914, 2013.[62] N. Campbell, G. Vogiatzis, C. Hernandez and R. Cipolla, “Usingmultiple hypotheses to improve depth-maps for multi-view stereo,” in

Proc. ECCV , 2008.[63] A. Beck, and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM J. Imaging Sci., vol. 2,no. 1, pp. 183-202, 2009.[64] https: // community.acute3d.comcommunity.acute3d.com