[PDF] Single View Depth Estimation from Examples

Abstract

We describe a non-parametric, "example-based" method for estimating the depth of an object, viewed in a single photo. Our method consults a database of example 3D geometries, searching for those which look similar to the object in the photo. The known depths of the selected database objects act as shape priors which constrain the process of estimating the object's depth. We show how this process can be performed by optimizing a well defined target likelihood function, via a hard-EM procedure. We address the problem of representing the (possibly infinite) variability of viewing conditions with a finite (and often very small) example set, by proposing an on-the-fly example update scheme. We further demonstrate the importance of non-stationarity in avoiding misleading examples when estimating structured shapes. We evaluate our method and present both qualitative as well as quantitative results for challenging object classes. Finally, we show how this same technique may be readily applied to a number of related problems. These include the novel task of estimating the occluded depth of an object's backside and the task of tailoring custom fitting image-maps for input depths.

Full PDF

11 Single View Depth Estimation from Examples

Tal Hassner, Ronen Basri,

Abstract —We describe a non-parametric, “example-based” method for estimating the depth of an object, viewed in a single photo.Our method consults a database of example 3D geometries, searching for those which look similar to the object in the photo. Theknown depths of the selected database objects act as shape priors which constrain the process of estimating the object’s depth. Weshow how this process can be performed by optimizing a well deﬁned target likelihood function, via a hard-EM procedure. We addressthe problem of representing the (possibly inﬁnite) variability of viewing conditions with a ﬁnite (and often very small) example set, byproposing an on-the-ﬂy example update scheme. We further demonstrate the importance of non-stationarity in avoiding misleadingexamples when estimating structured shapes. We evaluate our method and present both qualitative as well as quantitative results forchallenging object classes. Finally, we show how this same technique may be readily applied to a number of related problems. Theseinclude the novel task of estimating the occluded depth of an object’s backside and the task of tailoring custom ﬁtting image-maps forinput depths. (cid:70)

NTRODUCTION T HE human visual system is remarkably adapt atestimating the shapes of objects from just a singleview, despite this being an ill posed problem; manydifferent shapes can appear the same in an image andany one of them is as plausible as the next. To overcomethis difﬁculty, existing computational methods routinelymake a-priori assumptions on the lighting properties, theobject’s surface properties, the structure of the scene, andmore.Here, we make the following alternative assumption:We assume that the object viewed is roughly similar inshape (but by no means identical) to the shapes of aset of related objects. The obvious example here is ofcourse faces. As we will later show, examples of typicalface shapes can be used to estimate even highly unusualfaces. We claim that the same is true for other objectclasses and indeed demonstrate results for images ofchallenging objects, including hands, full body ﬁgures(non-rigid objects), and ﬁsh (highly textured objects).Speciﬁcally, we assume that we have at our disposala database of relevant example 3D geometries. We caneasily obtain the appearances of these objects, viewedunder any desired viewing condition, by using stan-dard rendering techniques. To estimate the shape of anovel object from a single image we search through theappearances of these objects, possibly rendering newappearances as we go, looking for ones appearing similarto the input image. Once found, the known depths ofthese selected objects serve as priors for the object’sshape. We perform this task at the patch level, thusobtaining depth estimates very different from those in • • R. Basri is with The Dept. of Computer Science and Applied Math.,Weizmann Institute, Israel. the database. This process is performed via a Hard-EMprocedure, optimizing a well deﬁned target likelihoodfunction representing the likelihood of the estimateddepth given the input image and the set of examples.This approach to depth estimation has a number ofadvantages: (1) Our method is non-parametric, and assuch, requires no a-priori model selection or design.Consequently, (2) it is versatile . As we will show, thesame method is used to estimate the shapes of ob-jects belonging to very different object classes and evento solve additional related tasks. Finally, (3) a data-driven approach requires making no assumptions on theproperties of the object in the image nor the viewingconditions. Our chief requirement is the existence of asuitable set of 3D examples. We believe this to be areasonable requirement given the growing availabilityof such databases.Obviously, in taking an example-based approach todepth estimation, we have no guarantee that the exampledata sets we use contain objects sufﬁciently similar to theone in the input image. We therefore follow the exampleof methods such as [1], [2], [3] in seeking to produce plausible depth estimates and not necessarily the truedepths. Here, however, the concept of a plausible depthis formally deﬁned by our target function. Moreover, wepresent quantitative results suggesting that our methodis indeed capable of producing accurate estimates evenfor challenging objects, given an adequate example set.To summarize, this report reviews the following top-ics. • Example-based approach to depth estimation.

Wedescribe an approach to single-view depth estima-tion and present both qualitative and quantitativeresults on a number of challenging object classes.We have tested out method on large sets of objects,including real and synthetic images of objects witharbitrary texture, pose, and genus, viewed underunconstrained viewing conditions. • On-the-ﬂy example update scheme.

We augment a r X i v : . [ c s . C V ] A p r existing example-based methods by arguing thatexamples need not be selected a-priori . To handle thepossible inﬁnite viewing conditions and postures ofthe objects being reconstructed we produce bettersuited examples while removing less adequate ones on-the-ﬂy , as part of the reconstruction process. • Non-stationarity for structured shape recon-struction.

We emphasize the importance of non-stationarity in avoiding depth ambiguities and mak-ing better example selections. • Additional applications.

We show how the samemethod used for depth estimation may also be usedfor the additional tasks of estimating the depths ofthe occluded backside of objects viewed in an imageas well as estimating the colors of objects from theirshape.The rest of this report is organized as follows. In the nextsection we review related work. Our depth estimationframework is described in Sec. 3. Our example updatescheme is presented in Sec. 4 followed by a discussion onnon-stationarity in Sec. 5. We propose additional appli-cations, based on our method in Sec. 6. Implementationand results are presented in Sec. 7. Finally, we concludein Sec. 8.

ELATED WORK

Depth estimation.

There is an immense volume of liter-ature on the problem of estimating the shapes of objectsor scenes from a single image. Indeed, this problemis considered to be one of the classical challenges ofComputer Vision. Methods for single image reconstruc-tion very often rely on different cues such as shading,silhouette shapes, texture, and vanishing points (e.g., [1],[4], [5], [6], [7], [8]). These methods restrict the allowablereconstructions by placing constraints on the proper-ties of reconstructed objects (e.g., reﬂectance properties,viewing conditions, and symmetry).There has been a growing interest in producing depthestimates for large scale outdoor scenes from single im-ages. One approach [2], [9] reconstructs outdoor scenesassuming they can be labeled as “ground,” “sky,” and“vertical” billboards. Other approaches include the Dio-rama construction method of [1] and the Make3D systemof [10]. Although both visually pleasing and quantita-tively accurate estimates have been demonstrated, it isunclear how to extend these methods to classes otherthan outdoor scenes.Recently, there is a growing number of methodsexplicitly using examples to guide the reconstructionprocess. One notable approach makes the assumptionthat all 3D objects in the class being modeled lie in alinear space spanned using a few basis objects (e.g., [11],[12], [13], [14]). This approach is applicable to faces, butit is less clear how to extend it to more variable classesbecause it requires dense correspondences betweensurface points across examples. Another approach [15] uses a single example to produce accurate, shape-from-shading estimates of face shapes. This approach too istailored for the particular problem of estimating faceshapes. By contrast, our chief assumption is that theobject viewed in the query image has similar lookingcounterparts in our example set, and so can be appliedto produce depth estimates for a range of differentobject classes.

Synthesis “by-example”.

A fully data-driven methodwas ﬁrst proposed in [16], inspired by methods forconstructing 3D models by combining parts [17]. Itoperates by assuming a collection of example, referenceimages, along with known 3D shapes (depths). For agiven query image, it seeks to match the appearanceof the query to the appearance of these references, andproduces a depth estimate by combining the knownreference depth values, associated with the matchingappearances. This report elaborates on the originalmethod described in [16] and provides additionalinformation and results compared to that paper.

Shape decoration.

Sec. 6 demonstrate how our frame-work can be applied to solve additional problems be-yond shape reconstruction. In particular, we demon-strate the use of our method for automatically colorizingdepth-maps, as a quick means for decorating 3D shapes(Sec 6.2). Existing automatic methods for decorating 3Dmodels have mostly focused on covering the surface of3D models with texture examples (e.g., [18], [19], [20]).We note in particular a work concurrent to our own [21],which uses an optimization procedure similar to the oneused here. Their goal, however, is to cover 3D surfaceswith 2D texture examples. Finally, recent methods haveattempted to allow the modeler semi-automatic meansof producing non-textural image-maps (e.g., [22]). Thesemethods rely on the user forming explicit correspon-dences between parts of the 3D surface, and differenttexture examples, which are then merged together toproduce the complete image-map for an input 3D model.Our work, on the other hand, is fully automatic.Finally, there have been a number of publications pre-senting methods for 3D model (e.g., triangulated mesh)correspondences and cross parameterization (e.g., [23],[24]). These methods establish correspondences acrosstwo or more 3D surfaces. Once these correspondencesare computed, surface properties such as texture, can betransferred from one corresponding 3D object to another,thus providing a novel model with a custom image-map.These methods, however, often require a human modelerto input a seed set of correspondences across the models,or else assume the models are similar in general form.In our colorization method, no prior correspondences arerequired, the process is fully automatic, and the modelsneed only be locally similar.

Query image

Depth estimate (i)(ii) (iii)

Mappings database S (a) (b) (c) Fig. 1.

Visualization of our process. (a) The input image. (b) Step (i) ﬁnds for every query patch a similar patch inthe database. Each patch provides depth estimates for the pixels it covers. Thus, overlapping patches provide severaldepth estimates for each pixel. These values are combined at each pixel to produce a new estimate for that pixel’sdepth in Step (ii). This process is then repeated until convergence (Step (iii)) by returning to Step (i), now searchingfor patches matching in both intensity and depth, using the current depth estimate for the comparison. (c) Our ﬁnaldepth estimate.

STIMATING DEPTH FROM EXAMPLES

Given a query image I of some object of a certain class,our goal is to estimate a depth map D for the object.To this end we use examples of feasible mappings fromintensities (appearances) to depths for the class. Thesemappings are given in a database S = { M i } ni =1 = { ( I i , D i ) } ni =1 , where I i and D i respectively are the imageand the depth-map of an object from the class. Theseimage-depth pairs are produced by applying standardrendering techniques to a set of example, textured, 3Dgeometries. For simplicity we assume ﬁrst that all theimage-depth pairs in the database were produced byrendering the geometries from a single viewing direc-tion, shared also with the input image. Later in Sec. 4 werelax this assumption by demonstrating how an estimateof the camera pose may be recovered along with thedepth.Our goal is to produce a depth D such that every k × k patch of mappings in M = ( I, D ) will have a similarcounterpart in S (i.e., will be feasible ). Speciﬁcally, weseek a depth D satisfying the following two criteria: i1) For every k × k patch of mappings in M , there isa similar patch in S , and2) if two patches overlap a pixel p , then the twodatabase patches selected as their matches mustagree of the depth at p .We next describe how we produce depth estimates sat-isfying these criteria. Given an input image I , we produce a depth estimate D meeting the two criteria mentioned above, by buildingon the following simple, two-step procedure (see alsoFig. 1): (i) At every location p in I we consider a k × k window around p and seek a matching window in thedatabase with a similar intensity pattern in the leastsquares sense (Fig. 1.( i )). (ii) Finding such a window, weextract its corresponding k × k depths. We do this for allpixels in I , matching overlapping intensity patterns andobtaining k depth estimates for every pixel coordinate.The depth value at every p is then determined bytaking an Gaussian weighted mean of these k estimates(Fig. 1.( ii )). Here, the Gaussian weights weigh in favorof estimates from patches centered closer to p .Of course, there is nothing to guarantee that the depthestimate obtained by executing these two steps just oncewill meet our criteria. In order to produce a suitableestimate, we therefore take the current depth to be aninitial guess which we then reﬁne iteratively. We repeatthe following process until convergence (see also Fig. 2):At every step we seek for every patch in M , a databasepatch similar in both intensity as well as depth, using D from the previous iteration for the comparison. Thus,unlike the initial step, we now look for similar mappings .Having found new matches, we compute a new depthestimate for each pixel as before, by taking the Gaussianweighted mean of its k estimates. In Section 3.2 weprove that this two-step procedure is a hard-EM opti-mization of a well deﬁned target function. As such, it isguaranteed to converge to a local optimum of the targetfunction.Fig. 2 summarizes this process. The function getSimilarP atches searches S for patches of mappingswhich match those of M , in the least squares sense.The set of all such matching patches is denoted V . Thefunction updateDepths then updates the depth estimate D at every pixel p by taking the weighted mean overall depth values for p in V . D = estimateDepth( I , S ) M = ( I, ?) repeat until no change in M (i) V = getSimilarPatches( M , S )(ii) D = updateDepths( M , V ) M = ( I, D ) Fig. 2.

Summary of the basic steps of our algorithm.

We now analyze our iterative process and show that itis in fact a hard-EM optimization [25] of the followingtarget function (which in turn, satisﬁes our criteria ofSec. 3). Denote by W p a k × k window from the query M centered at p , containing both intensity values and(unknown) depth values, and denote by V a similarwindow in some M i ∈ S . Our target function can nowbe deﬁned as P laus ( D | I, S ) = (cid:89) p ∈ I max V ∈ S Sim ( W p , V ) , (1)with the similarity measure Sim ( W p , V ) being: Sim ( W p , V ) = exp (cid:18) −

12 ( W p − V ) T Σ − ( W p − V ) (cid:19) , (2)where Σ is a constant diagonal matrix, its componentsrepresenting the individual variances of the intensityand depth components of patches in the class. Theseare provided by the user as weights (see also Sec. 7.1).To make this norm robust to illumination changes wenormalize the intensities in each window to have zeromean and unit variance, similarly to the normalizationoften applied to patches in detection and recognitionmethods (e.g. [26]). Fig. 3.

Graphical model representation.

Please see textfor more details.

In Fig. 3 we represent the intensities of the query im-age I as observables and the matching database patches V and the sought depth values D as hidden variables.The joint probability of the observed and hidden vari-ables can be formulated through the edge potentials by f ( I, V ; D ) = (cid:89) p ∈ I (cid:89) q ∈ W p φ I ( V p ( q ) , I ( q )) · φ D ( V p ( q ) , D ( q )) where V p is the database patch matched with an imagepatch W p centered at p by the global assignment V . Tak-ing φ I and φ D to be Gaussians with different covariances over the appearance and depth respectively, implies f ( I, V ; D ) = (cid:89) p ∈ I Sim ( W p , V p ) . Where

Sim is deﬁned in (2). Integrating over all possibleassignments of V we obtain the likelihood function L = f ( I ; D ) = (cid:88) V f ( I, V ; D ) = (cid:88) V (cid:89) p ∈ I Sim ( W p , V p ) . We approximate the sum with a maximum operator.Note that this is common practice for EM algorithms,often referred to as hard-EM (e.g., [25]). Since similaritiescan be computed independently, we can interchange theproduct and maximum operators, obtaining the follow-ing maximum likelihood: max L ≈ (cid:89) p ∈ I max V ∈ S Sim ( W p , V ) = P laus ( D | I, S ) , which is our cost function (1).The function estimateDepth (Fig. 2) maximizes thismeasure by implementing a hard-EM optimization. Thefunction getSimilarP atches performs a hard E-step byselecting the set of assignments V t +1 for time t +1 whichmaximizes the posterior: f ( V t +1 | I ; D t ) ∝ (cid:89) p ∈ I Sim ( W p , V p ) . Here, D t is the depth estimate at time t . Due to the in-dependence of patch similarities, this can be maximizedby ﬁnding for each patch in M the most similar patchin the database, in the least squares sense.The function updateDepths approximates the M-stepby ﬁnding the most likely depth assignment at eachpixel: D t +1 ( p ) = arg max D ( p ) ( − (cid:88) q ∈ W p ( D ( p ) − depth ( V t +1 q ( p )) )) . This is maximized by taking the mean depth valueover all k estimates depth ( V t +1 q ( p )) , for all neighboringpixels q .We note that optimization with Hard-EM, well knownto converge to a local optimum of the target func-tion [25]. INDING THE RIGHT EXAMPLES

By-example, patch based approaches have become quitepopular and are successfully employed for solving prob-lems ranging from texture synthesis to recognition. Theunderlying assumption behind these methods is thatclass variability can be captured by a ﬁnite, preferablysmall, set of examples. Many applications can typicallyguarantee these conditions (notably texture synthesis).However, when the examples include non-rigid objects,objects varying in texture, or when viewing conditionsare allowed to change, it becomes increasingly harderto apply these methods: Adding more examples to al-low more variability (e.g., rotations of the input image in [27]), implies larger storage requirements, longer run-ning times, and higher risk of false matches.Our goal here is to handle objects viewed from any di-rection, non-rigid objects (e.g. hands), and objects whichvary in texture (e.g. ﬁsh). Ideally, we would like touse few examples whose shape (depth) is similar tothat of the object in the input image, viewed undersimilar conditions. This, however, implies a chicken-and-egg problem: Depth estimation requires choosing similarexample objects, but knowing which objects are similarﬁrst requires a depth estimate.Our optimization scheme provides a convenientmeans of solving this problem. Instead of committingbeforehand to a ﬁxed set of examples we update theset of examples, on-the-ﬂy , alongside the optimizationprocess. We start with an initial seed database of ex-amples. In subsequent iterations of our optimization wedrop the least used examples M i from our database,replacing them with ones deemed better suited for thedepth estimation process. These are produced by on-the-ﬂy rendering of more suitable 3D models, with viewingconditions closer to the one used in the query. In ourexperiments, we applied this idea to search for moresimilar example objects and better viewing angles. Webelieve that other parameters such as lighting conditionscan also be similarly resolved. We next describe thedetails of our implementation. Fig. 4 demonstrates a depth estimation result producedby using example images generated from a single in-correct viewing angle (Fig. 4.a) and four ﬁxed, widelyspaced viewing angles (Fig. 4.b). Both results are inade-quate.It stands to reason that mappings from viewing anglescloser to the true one, will contribute more patches tothe process than those further away. We thus adoptthe following scheme. We start with a small numberof pre-selected views, sparsely covering parts of theviewing sphere (the gray cameras in Fig. 4.c). The seeddatabase S is produced by taking the mappings M i ofour objects, rendered from these views, and is used toobtain an initial depth estimate. In subsequent iterations,we re-estimate our views by taking the mean of thecurrently used angles, linearly weighted by the relativenumber of patches selected from each angle. We thendrop from S mappings originating from the least usedangle and replace them with ones from the new view. Ifthe new view is sufﬁciently close to one of the remainingangles (e.g., its distance to an existing view falls belowa predeﬁned threshold), we instead increase the numberof objects to maintain the size of S . Fig. 4.c presents aresult obtained with our angle update scheme.Although methods exist which accurately estimate theviewing angle [28], [29], we preferred embedding thisestimation in our optimization. To understand why, con-sider non-rigid classes such as the human body where Input image(a) (b) (c) Fig. 4.

Depth estimate with unknown viewingangle.

A woman’s face viewed from camera angle ( α, β ) = (0 ◦ , − ◦ ) . (a) Database mappings S ren-dered with the camera at angle (0 ◦ , ◦ ) . (b) Databasegenerated with cameras positioned in angles ( − ◦ , ◦ ) , (20 ◦ , ◦ ) , ( − ◦ , − ◦ ) , and (20 , − , without updatingthe database viewing position. (c) Estimating depth whileupdating the database camera view on-the-ﬂy. Startingfrom the angles in (b), now updating angles until conver-gence to ( − ◦ , − ◦ ) . posture cannot be captured with only a few parameters.Our approach uses information from several viewingangles simultaneously, without pre-committing to anysingle view. Although we have collected at least 40 objects in eachdatabase, we can use no more than 12 objects at a timein the optimization, as it becomes increasingly difﬁcultto handle larger sets. We select the ones used in practiceas follows. Starting from a set of arbitrarily selectedseed objects, at every update step we drop those leasedreferenced. We then scan the remainder of our objectsfor those who’s depth, D i , best matches the currentdepth estimate D (i.e., ( D − D i ) is smallest, D and D i center aligned) adding them to the database instead ofthose dropped. In practice, a fourth of our objects werereplaced after the ﬁrst iteration of our process. RESERVING GLOBAL STRUCTURE

The scheme described in Sec. 3.1, makes an implicitstationarity assumption [30]: Put simply, the probabilityfor the depth at any pixel, given those of its neighbors, isthe same throughout the output image. This is generallyuntrue for structured objects, where depth often dependson position. For example, the probability of a pixel’s (a) (b) (c)

Fig. 5.

Preserving relative position. (a) Input image. (b)Depth estimate without position preservation constraintsand (c) with them. depth being “tip-of-the-nose high” is different at differ-ent locations of a face. To overcome this problem, wesuggest enforcing non -stationarity by adding additionalconstraints to the patch matching process. Speciﬁcally,we encourage selection of patches from similar semanticparts by favoring patches which match not only inintensities and depth, but also in position relative tothe centroid of the input depth-map. This is achievedby adding relative position values to each patch ofmappings in both the database and the query image.Let p = ( x, y ) be the coordinates of a pixel in I (thecenter of the patch W p ) and let ( x c , y c ) be the coordinatesof the center of mass of the area occupied by nonbackground depths in the current depth estimate D . Weadd the values ( δx, δy ) = ( x − x c , y − y c ) , to each suchpatch W p and similar values to all database patches (i.e.,by using the center of each depth image D i for ( x c , y c ) ).These values now force the matching process to ﬁndpatches similar in both mapping and global position.Fig. 5 demonstrates a reconstruction result with andwithout these constraints.If the query object is segmented from the background,an initial estimate for the query’s centroid can be ob-tained from the foreground pixels. Alternatively, thisconstraint can be applied only after an initial depthestimate has been computed (i.e., Sec. 3). DDITIONAL APPLICATIONS

One of the appealing aspects of this example-basedapproach is that other problems besides depth estimationmay be similarly solved with little or no change tothe method we described. Speciﬁcally, we have thusfar considered example mappings from appearances todepths. This has allowed us to estimate depths for inputappearances. We next show how by deﬁning alternativemappings we obtain solutions to additional problemswithin the very same framework.

What can be said about the shape of a surface which doesnot appear in the image? Methods for depth estimationhave predominantly focused on estimating the shapes of surfaces visible in images. Here we suggest that aninput image may contain sufﬁcient cues which, coupledwith additional examples, may allow us to guess theshapes of surfaces even when they are occluded in theimage. Speciﬁcally, we next demonstrate how the shapeof the occluded backside of an object may be estimatedby using the same process described thus far, from asingle image of the object’s front.We consider a database of mappings containing notappearances and their corresponding depths, but ratherdepths and a corresponding second depth layers (or ingeneral, multiple depth layers). This additional depthlayer is taken here to be the depth at the back of theobjects viewed. Once again, we generate this databaseby applying standard rendering techniques to our ex-ample 3D geometries. We thus obtain a database S (cid:48) = { M (cid:48) i } ni =1 = { ( D i , D (cid:48) i ) } ni =1 , where where D (cid:48) i is the seconddepth layer.Having recovered the visible depth of an object (itsdepth map, D ), we deﬁne the mapping from visible tooccluded depth as M (cid:48) ( p ) = ( D ( p ) , D (cid:48) ( p )) , where D (cid:48) isits second depth layer. Synthesizing D (cid:48) can now proceedsimilarly to the synthesis of the visible depth layers. Wenote that this second depth layer may indeed have littlein common with the true depth at the object’s back. It is,however, reminiscent of the image hole-ﬁlling problemin attempting to produce a plausible estimate of thisinformation, where none other exists. An additional problem may be solved by reversingour original mappings. Here we propose an applicationsimilar in spirit to the problem faced by a sculptor whenapplying paint to enhance the appearance of statues.Given an input depth-map, our goal is to fabricate atailor made image-map for the depth. The motivation fordoing this comes from the graphics community, whereconsiderable effort is put into developing automatic 3Dcolorization techniques. Here we achieve this goal bysimply switching the roles of the intensities and depthsin the example mappings: We now use a database S = { M (cid:48) i } ni =1 = { ( D i , I i ) } ni =1 . Given an input depth map D ,our goal is now to produce an image map I such that M (cid:48) = ( D, I ) consists of feasible mappings from shape tointensities.We have found that for this particular application, on-the-ﬂy database update is unnecessary, as our input is a3D shape, allowing us to easily select similar shapes fromthe database before commencing with the optimization.We thus choose for synthesis a small number (often assmall as one or two) of models who’s depths best matchthe input depth in the least squares sense. These are usedthroughout the synthesis process. We note that whenonly one database object is used, our method effectivelymorphs its image-map to ﬁt the 3D features of the novelinput depth (See Sec. 7.2). Fig. 6.

Depth estimates at multiple resolutions.

Fromleft to right, input image, ﬁve intermediate depth-mapestimates from different resolutions, and a zoomed in viewof our output reconstruction.Fig. 7.

Database mappings used as examples.

In thetop row, two appearance-depth images, out of the 67 inthe Fish database. Bottom row, two of 50 pairs from ourHuman-posture database.

MPLEMENTATION AND RESULTS

For the purpose of depth reconstruction, the mappingat each pixel in M = ( I, D ) , and similarly every M i =( I i , D i ) , encodes both appearance and depth (See exam-ples in Fig. 7). In practice, the appearance componentof each pixel is its intensity and high frequency values,as encoded in the Gaussian and Laplacian pyramidsof I [31]. We have found direct synthesis of depths toresult in low frequency noise (e.g., “lumpy” surfaces).We thus estimate a Laplacian pyramid of the depth,producing the ﬁnal depth by collapsing the depth high-frequencies estimates from all scales. In this fashion,low frequency depths are synthesized in the coursescale of the pyramid and only sharpened at ﬁner scales(See example in Fig. 6). For depth colorization we usedmappings from depths and depth high frequencies to Y,Cb, Cr components, also computed at different scales ofthe Gaussian and Laplacian pyramids.Different patch components, including relative posi-tions, contribute different amounts of information indifferent classes, as reﬂected by their different variance(i.e., Σ in the deﬁnition of Sim , Eq. 2). For example, facesare highly structured, thus, position plays an important role in their reconstruction. On the other hand, due tothe variability of human postures, relative position isless reliable for that class. We therefore amplify differentcomponents of each patch, W p , of mappings for differ-ent classes, by weighting them differently. Section 7.2present weights computed automatically for differentobject classes and quantitative results obtained withthese weights.Finally, we note that, in principle, database objectsmay come in any coordinate system, and in particulartheir depth values can be shifted (i.e., z (cid:48) = z + z ).This may pose a problem when combining depths fromdifferent objects to form a single estimate. A possiblesolution would be to synthesize surface normals insteadof depths (as in, e.g., [32]). Doing so, however, raisesthe problem of dealing with depth discontinuities. Herewe chose instead to produce our examples in a commonframe of reference by setting z = 0 at the centroid ofthe 3D object and performing the reconstruction in thiscommon frame of reference. In our reconstruction and colorization experiments, weused the following data sets. 52 Hand and 45 Human-posture objects, produced by exporting built-in modelsfrom the Poser software, 77 busts from the USF headdatabase [33], and a ﬁsh database [34] containing 41models. In addition, for the colorization experimentswe used a database of ﬁve human ﬁgures courtesy ofCyberware [35]. Our objects are stored as textured 3D tri-angulated meshes. We can thus render them to produceexample mappings using any standard rendering engine.Example mappings from the ﬁsh and human posturedata-sets are displayed in Fig. 7. We used 3D Studio Maxfor rendering the images and depth-maps. We preferredpre-rendering the images and depth maps instead of ren-dering different objects from different angles on-the-ﬂy.Thus, we trade rendering times with disk access timesand large storage. Note that this is an implementationdecision; at any one time we load only a small numberof images to memory. The angle update step (Sec. 4)therefore selects the existing pre-rendered angle closestto the mean angle.

Depth Reconstruction.

Our results include depth esti-mates from single images for structured objects suchas faces (Fig. 4, 5, 10) as well as highly non-rigidobjects such as hands (Fig. 1 and 11) and full bodies(Fig. 6 and 8) in various postures. These results in-clude in particular objects with higher than zero genus(e.g., Fig. 8) and objects with depth discontinuities suchas the ﬁngers of the hand in Fig. 1 and the ﬁn ofthe left ﬁsh in Fig. 9. Additionally, we show that ourmethod can produce estimates even when the objects inthe image are highly textured as in the ﬁsh examplesin Fig. 9. Similarly to shape-from-shading methods, weassume here that the query objects were pre-segmentedfrom their background and then aligned with a single

Fig. 8.

Full body depth results.

Left to right: Input image, the output depth without and with texture, input imageof a man, output depth, textured view of the output, output estimate of the depth at the back . Man results shownzoomed-in.Fig. 9.

Two ﬁsh depth results.

Left to right: Input image (removed from the example database); estimated depth anda textured view of the output; Input image; estimated depth and a textured view of the output.Fig. 10.

Two face depth results.

Left to right: Input image,four most referenced database images in the last iteration,our output depth without and with texture, input image,four most referenced database images in the last iteration,our output depth without and with texture. preselected database image to solve for scale and image-plane rotation differences (see, e.g., [15]).It is interesting to compare our method with a methodtailored to reconstructing face depths [15] (see Fig. 12).For a non-standard face (a cyclops) our patch basedmethod appropriately produces a shape estimate withonly one eye socket, using examples of typical, binocularfaces. Although [15] produce a ﬁner detailed estimate,their strong global face shape prior results in an estimateerroneously containing three separate eyes.Occluded back estimation results (Sec. 6.1) are pre-sented in Fig. 8 and 11. Two non-structured objects withnon-trivial backs were selected for these tests. (a) (b) (c)

Fig. 11.

Hand depth result. (a) Input image. (b) Ouroutput. (c) Output estimate for the back of the hand.

In general, the quality of our depth estimate dependson the database used, the input image, and how the twomatch. Fig. 13 presents some failed results. In Fig. 13(a)our method’s lack of a global prior resulted in the middleﬁnger which both points forward and downwards. InFig. 13(b) the subject was waring a dark shirt and brighttrousers, very different from the uniformly colored ob-jects in the database.

TABLE 1

Depth estimation database parameters. m - Number ofmappings (objects) M i used for synthesis. Weights forintensities, depth, and relative position components.Patch sizes were × pixels in all tests. DB Name m Weights

Human-posture 4 0.2140, 0.1116, 0.0092Hands 5 0.1294, 0.1120, 0.0103Fish 5 0.1361, 0.1063, 0.0116

Quantitative depth estimation results.

To evaluate theperformance of our algorithm, we ran leave-one-out,depth estimation tests on the human-posture, hands, and (a) (b) (c)

Fig. 12.

Cyclops depth result. (a) Input image of aCyclopean face. (b-c)

Top row : Our depth estimate ren-dered in 3D and a textured view;

Bottom row : Depthestimate produced using the method of [15] renderedin 3D and a textured view. Both methods use exampleshapes of binocular faces . Although [15] produce moredetailed estimates, their strong prior on a face shaperesults in a face with three separate eyes. (a) (b)

Fig. 13.

Failures. (a) Hand reconstructions are particu-larly challenging, as they are largely untextured, and canvary greatly in posture. (b) The uniform black shirt differedgreatly from the ones worn by our database objects (seeFig. 7). No reliable matches could thus be found, resultingin a lumpy surface. Resulting surface presented from azoomed-in view. ﬁsh data sets. We used ﬁve randomly selected objectsfrom each object class as training. Their images andground truth depths were used to automatically searchfor optimal weights for the three components of themappings (appearance, depth, and relative position).We used a direct simplex search method to search forthese three parameters separately for each class mini-mizing the error between our depth estimate and theknown ground truth. The parameters thus obtained arepresented in Table 1. The search for parameters wasperformed only once for each class, and the parametersobtained were applied to all input images. We did notscreen our results for any failures and all depth estimates

TABLE 2

Depth estimation quantitative results.

Mean and STDresults of L1 distances between estimated depths andground truth.

DB Name Baseline Make3D [10] Our method

Posture .040 ± .01 .248 ± .09 .023 ± .00Hands .039 ± .02 .228 ± .05 .026 ± .01Fish .044 ± .02 .277 ± .12 .036 ± .02 were included when computing the global result.We next estimated the depths of the objects whichwere not included in the training. The quality of ourestimates was compared against a naive selection of thedepths belonging to the database objects most similarin appearance to the test images. We have included theaccuracy obtained by applying the method of [10], usingtheir own code . We note, however, that this methodwas developed and optimized for outdoor scenes, and soit is not surprising that it should under-perform whenapplied to images of objects. Table 2 summarizes ourresults, presenting a comparison of the mean and STDL1 distances between the ground truth depths and theestimated depths produced by our method, Make3D [10]and the naive selection as baseline.Fig. 14, 15, and 16 present depth estimates obtainedby own own method in these batch tests. The widestandard deviation of both the baseline and our method,as reported in Table 2, suggests that these three datasets do not fully capture the range of shapes andappearances of objects in the classes; some objects donot have sufﬁciently similar counterparts in the databaseand consequently their estimates (obtained with ourmethod as well as the baseline) were poor. This is notsurprising considering the nature of the objects includedin these sets (i.e., non-rigid and textured objects). Pairedt-tests comparing our algorithm to the baseline methodshow the improved performance of our method to besigniﬁcant for all three data sets, with p < for thehuman-postures, p < for the hands and p < forthe ﬁsh data sets. Automatic Colorization.

Some colorization results arepresented in Fig. 17–20. Note in particular how pairsof database image-maps seamlessly mesh to producethe output image-map in 20, where the database objectsare presented alongside the output result. Some failedresults are reported in Fig. 21. We believe that thefailed faces were due to the automatic example selectiondisregarding the different colors of the selected databaseexamples. In the case of the ﬁsh, the failures were dueto the anomalous shapes of the input depths.To obtain quantitative results for our colorizationscheme we again ran leave-one-out tests on the faceand ﬁsh data sets. Here, since the quality of ourcolorization results is subjective, we poled 10 subjects,

1. Make3D code available from http://make3d.cs.cornell.edu Input Ground DB Our Input Ground DB Ourimage truth examples result image truth examples result

Fig. 14.

Hand depth estimates.

Four out of the 52 hand-object depth estimates computed using automaticallyobtained weights (see Table 1 for parameter values). In both columns from left to right: input image, its ground truth,four of the ﬁve automatically selected database examples used for the reconstructions, and our output estimate.

Input Ground DB Our Input Ground DB Ourimage truth examples result image truth examples result

Fig. 15.

Human-posture depth estimates.

Four out of the 45 human-posture depth estimates computed usingautomatically obtained weights (see Table 1 for parameter values). In both columns from left to right: input image,its ground truth, the four automatically selected database examples used for the reconstructions, and our outputestimate. asking “how many image-maps are faulty or otherwiseappear inferior to those in the database”. Out of the57 ﬁsh results on average 28 % were found to be faulty.Similarly, 28 % of our face results were found faulty outof the 76 faces in the database. The parameters used inthese tests are reported in Table 3. Run-time.

Our running time was approximately . min-utes for a × pixel image using 12 example imagesat any one time, on a Pentium 4, 2.8GHz computer with2GB of RAM. We used three pyramid levels, each scaledto half the size of the previous level. Patch sizes, unlessotherwise noted, were taken to be × at the coarsestscale, × at the second level, and × for the ﬁnestscale. ONCLUSIONS AND FUTURE WORK

Clearly, having prior knowledge about the shapes ofobjects in the world is beneﬁcial for determining theshapes of novel objects. This idea is particularly useful

TABLE 3

Colorization database parameters. m - Number ofmappings (objects) M i used for synthesis. k - Patchwidth and height, from ﬁne to coarse scale of threepyramid levels. Weights for depth, depthhigh-frequencies, Y, Cb, Cr, and relative positioncomponents. Note that relative position is ampliﬁed forthe structured face and humans data-sets. Also, as oureyes are sensitive to intensities, we amplify Y as well. DB Name m k Weights

Humans 1 7, 9, 9 0.08, 0.06, 8, 1.1, 1.1, 10Busts 2 7, 11, 9 0.08, 0.06, 8, 1.1, 1.1, 10Fish 2 7, 11, 9 0.08, 0.06, 8, 1.1, 1.1, 0.1 when only a single image of the world is available.Motivated by this basic understanding we formulate analgorithm which produces depth estimates from singleimages, given examples of typical related objects. Theultimate goal of our algorithm is to produce a depth Input Ground DB Ourimage truth examples result

Fig. 16.

Fish depth estimates.

Four out of the 41 ﬁsh depth estimates computed using automatically obtained weights(see Table 1 for parameter values). From left to right: input image, its ground truth, four of the ﬁve automatically selecteddatabase examples used for the reconstructions, and our output estimate. estimate which is consistent with both the appearance ofthe input image and the known depths in our exampleset. We show how this goal may be formally stated andachieved by way of a strong optimization technique.At the heart of our method is the realization that theproblem of depth estimation may be stated as usingknown mappings from appearances to depths to pro-duce a new, plausible mapping given a novel appear-ance (image). This observation is coupled with the ideaof storing 3D geometries explicitly and using them torender example appearance-depth, mappings on-the-ﬂy.We can thus produce example mappings capturing anessentially inﬁnite range of viewing conditions withoutlimiting the example set a-priori.As a consequence, we obtain an algorithm which isversatile in the objects and viewing conditions it canbe applied to. Moreover, we show that the algorithmis versatile in the problems it may be used to solve:The general formulation of our mappings allows us toestimate additional properties of the objects in the scene,in particular, the shape of the occluded back of the object.

Future work.

It seems natural to explore how additionalinformation may be estimated using the same frame-work. For example, can foreground-background seg-mentation be estimated alongside the depth estimation?There are additional directions which we feel requirefurther study. Chieﬂy, we would like to explore how the method may be improved, both in accuracy and speed.Here we would like to capitalize on recent advances inimage representation and matching, mainly in dense andinvariant image representations such as [36], [37].We believe it would also be interesting to explore howexplicit 3D representations may be further exploited. Inparticular, can an accurate camera viewing position (orillumination, or posture etc.) be estimated by a simi-lar means of producing novel examples on-the-ﬂy andcomparing them to the input image? Also, can we learnmore about the world occluded from view, by using bothknown and example information? R EFERENCES [1] J. Assa, L. Wolf, Diorama construction from a single image, in:Eurographics, 2007, pp. 599–608.[2] D. Hoiem, A. Efros, M. Hebert, Automatic photo pop-up, ACMTrans. Graph. 24 (3) (2005) 577–584.[3] L. Zhang, G. Dugas-Phocion, J.-S. Samson, S. M. Seitz, Single viewmodeling of free-form scenes, in: Proc. IEEE Conf. Comput. VisionPattern Recognition, Vol. 1, 2001, pp. 990–997.[4] R. Cipolla, G. Fletcher, P. Giblin, Surface geometry from cusps ofapparent contours, in: Proc. IEEE Int. Conf. Comput. Vision, 1995,pp. 858–863.[5] A. Criminisi, I. Reid, A. Zisserman, Single view metrology, Int. J.Comput. Vision 40 (2) (2000) 123–148.[6] E. Delage, H. Lee, A. Ng, Automatic single-image 3D recon-structions of indoor manhattan world scenes, in: Proc. of theInternational Symposium of Robotics Research (ISRR), 2005, pp.305–321.[7] B. Horn, Obtaining Shape from Shading Information, The Psy-chology of Computer Vision, McGraw-Hill, 1975. (a) (b) (c) (d) (e) Fig. 17.

Fish image-maps. (a) Input depth-map. (b) Automatically selected database objects (image-maps displayed).(c) Output image marked with the areas taken from each database image. (d) Input depth rendered with, from top tobottom, result image and database image-maps. Note the mismatching features when using the database images. (e)Textured 3D view of our output.Fig. 18.

Fish image-maps.

Top row, input depth-maps; bottom row, our output image-maps.Fig. 19.

Human image-maps.

Three human ﬁgure results. Using a single database object, our method effectivelymorphs the database image, automatically ﬁtting it to the input depth’s 3D features. For each result, displayed fromleft to right, are the input depth, depth textured with automatically selected database image-map (in red, depth areasnot covered by the database map,) and our result.Fig. 20.

Bust image-maps.

Three bust results. For each result, displayed from left to right, are the input depth, ourresult and the two database objects automatically selected to produce it.Fig. 21.

Failures.

Bust failures caused by differently colored database image-maps. Fish failures are due to anomalousinput depths. [8] A. Witkin, Recovering surface shape and orientation from texture,AI 17 (1–3) (1981) 17–45. [9] D. Hoiem, A. Efros, M. Hebert, Geometric context from a singleimage, in: Proc. IEEE Int. Conf. Comput. Vision, IEEE Computer3