Semi-Supervised Adversarial Monocular Depth Estimation
Rongrong Ji, Ke Li, Yan Wang, Xiaoshuai Sun, Feng Guo, Xiaowei Guo, Yongjian Wu, Feiyue Huang, Jiebo Luo
IIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
Semi-Supervised Adversarial MonocularDepth Estimation
Rongrong Ji,
Senior Member , IEEE , Ke Li ∗ , Yan Wang, Member , IEEE , Xiaoshuai Sun,
Member , IEEE ,Feng Guo,
Member , IEEE , Xiaowei Guo, Yongjian Wu, Feiyue Huang, and Jiebo Luo,
Fellow , IEEE
Abstract —In this paper, we address the problem of monocular depth estimation when only a limited number of training image-depthpairs are available. To achieve a high regression accuracy, the state-of-the-art estimation methods rely on CNNs trained with a largenumber of image-depth pairs, which are prohibitively costly or even infeasible to acquire. Aiming to break the curse of such expensivedata collections, we propose a semi-supervised adversarial learning framework that only utilizes a small number of image-depth pairs inconjunction with a large number of easily-available monocular images to achieve high performance. In particular, we use one generatorto regress the depth and two discriminators to evaluate the predicted depth , i.e. , one inspects the image-depth pair while the otherinspects the depth channel alone. These two discriminators provide their feedbacks to the generator as the loss to generate morerealistic and accurate depth predictions. Experiments show that the proposed approach can (1) improve most state-of-the-art models onthe NYUD v2 dataset by effectively leveraging additional unlabeled data sources; (2) reach state-of-the-art accuracy when the trainingset is small, e.g. , on the Make3D dataset; (3) adapt well to an unseen new dataset (Make3D in our case) after training on an annotateddataset (KITTI in our case).
Index Terms —Monocular Depth Estimation, Generative Adversarial Learning, Semi-supervise Learning (cid:70)
NTRODUCTION E S timating scene depth is the foundation of variouscomputer vision tasks. Coming with the recent successin deep learning techniques [1], [2], [3], vision-based depthestimation is a more flexible and affordable solution fordepth acquisition comparing with using active sensors likeLIDAR. Among the vision-based approaches, monoculardepth estimation has its unique advantages due to itsextremely low requirement on the sensor, and thus hasreceived extensive focus from both academy and industry.However, the practical performance of monocular depthestimation retains unsatisfied in real-world applications. Toachieve reasonable accuracy, early works relied on usingvisual cues such as shading [4] and texture [5], or usingadditional information to the image content such as cam-era motion [6] and multi-view stereo [7]. However, suchdependency on extra sensor information greatly weakensthe unique advantages of monocular depth estimation. Sub-sequently, learning based methods [1], [2], [8] have beenintroduced and have attracted increasing research attention.Among the earliest works, Saxena et al. in [8], [9] adoptedsuperpixels together with Markov Random Fields (MRFs)and Conditional Random Fields (CRFs) to infer depth. Non-parametric approaches, such as [10], aim to reconstruct thedepth of a given image by warping the most similar imagesin a dataset and then transferring the corresponding depthmap. Notably, deep learning schemes have been introduced • Rongrong Ji, Feng Guo and Xiaoshuai Sun are with Fujian Key Laboratoryof Sensing and Computing for Smart City, School of Information Scienceand Engineering, Xiamen University, 361005, China. Yan Wang is withMicrosoft. Jiebo Luo is with Department of Computer Science, Universityof Rochester. Ke Li, Xiaowei Guo, Yongjian Wu and Feiyue Huang arewith Tencent Youtu Lab.E-mail: [email protected] received on Oct 11, 2018, and revised on June 27, 2019. in depth estimation by Eigen et al. [1], which have subse-quently dominated the learning-based estimation methods.In their settings, Convolutional Neural Networks (CNNs)are trained with meticulously designed loss functions toregress the depth value. However, such methods typicallyneed a large (or at least sufficient) amount of image-depthpairs to effectively train the deep models. For instance, [1],[11] and [3] take more than 50K ground truth depth tosufficiently train their deep models. Such a requirement isindeed problematic, i.e. , the ground truth depth maps aretypically expensive, and sometimes are even infeasible toacquire, due to the intrinsic limitations in depth acquisitionapproaches. For instance, most depth sensors are generallyexpensive with relatively low resolutions comparing to ex-isting image sensors, and their depth sensing ranges aretypically fixed and hard to adapt to different scenarios. Morerecently, video sequences and stereo image pairs have beenintroduced as an alternative approach to generate depthpairs [12], [13]. However, these works instead introducea new requirement to align videos or image pairs, whichare therefore not suitable when only irrelevant images areavailable.In this work, we propose a semi-supervised depth es-timation framework to relax the need for large-scale su-pervised information, which instead only requires a smallamount of image-depth pairs accompanied with a largeamount of cheaply-available monocular images in training.Differing from the supervised training schemes in mostexisting models [1], [2], [12], such a semi-supervised settingimposes extra unsupervised cues. In particular, to train thedepth regressor, a generative adversarial learning paradigmis introduced, which contains a generator for depth estima-tion and two discriminators to evaluate the depth estimationquality and its consistency with the corresponding RGB a r X i v : . [ c s . C V ] A ug EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2 image, respectively. The generator can be of any cutting-edge image-to-depth estimation models, e.g. , [2], [3], [14].One of the discriminators, called pair discriminator , aims todistinguish images and their predicted depth maps from thereal image-depth pairs. The other discriminator, called depthdiscriminator , aims to evaluate the quality of depth map,which tells whether the predicted depth is drawn from thesame distribution of the real depth. During training, unan-notated images are fed to the generator and then the net-work loss is computed according to the feedback from thesetwo discriminators. The generator, the pair discriminatorand the depth discriminator together simulate a Bayesianframework: The depth discriminator accounts for the priori p ( d ) , and the pair discriminator accounts for the joint distri-bution p ( d, I ) . The likelihood has p ( I | d ) = p ( d, I ) /p ( d ) . Andthe final posterior has p ( d | I ) ∝ p ( I | d ) p ( d ) .The contributions of our work are as follows: • We propose a semi-supervised framework to releasemodern models’ reliance on image-depth pairs byleveraging unlabeled RGB images in the depth es-timation task. • We implement the semi-supervised framework by anadversarial learning paradigm, in which a generatornetwork estimates the depth while two discriminatornetworks inspect the estimated depth-image pair anddepth, respectively. • The framework generalizes well to different networkarchitecture. For example, the generator can be anycutting-edge depth estimator. Specifically, networksdescribed in [2], [3], [14] all receive a performancegain in our experiments. The semi-supervised frame-work can also be used in dataset adaptation.Thorough experimental validations are given in threefolds: • We demonstrated that the proposed framework isable to benefit most cutting-edge models [2], [3],[14] when limited training data (image-depth pairs)are available. On the NYUD dataset, the proposedframework is able to decrease the evaluation errorsof our baseline model [3] by , and w.r.t .the Rel, RMSE and log error in a practical settingthat the labeled training is limited. We achieve thisby leveraging more unlabeled images drawnfrom the same distribution and alternatively trainthe generator with the supervised information (fromL1 discrepancy) and unsupervised information (fromdiscriminators genuineness feedback). • We demonstrated that the proposed framework isable to reach state-of-the-art accuracy when the train-ing set is small. On the Make3D dataset, our methodreveals its potential on the small dataset and reach astate-of-the- art evaluation error of Rel . , RMSE . and log . . • We demonstrated that the proposed frameworkadapts well to unseen new datasets after training onan annotated one. When directly testing the modeltrained on KITTI on the Make3D dataset, our semi-supervised method overall performs the best withother supervised models. (cid:5)(cid:19)(cid:8)(cid:14)(cid:12)(cid:1) (cid:19)(cid:10)(cid:17)(cid:9)(cid:13)(cid:18)(cid:17) (cid:11)(cid:12)(cid:22)(cid:25)(cid:15)(cid:6)(cid:23)(cid:12)(cid:11)(cid:16)(cid:10)(cid:25)(cid:12)(cid:11)(cid:1)(cid:11)(cid:12)(cid:22)(cid:25)(cid:15) (cid:3)(cid:8)(cid:12)(cid:8)(cid:15)(cid:5)(cid:17)(cid:13)(cid:15) (cid:4)(cid:5)(cid:10)(cid:15)(cid:2)(cid:10)(cid:16)(cid:6)(cid:15)(cid:10)(cid:11)(cid:10)(cid:12)(cid:5)(cid:17)(cid:13)(cid:15) (cid:2)(cid:8)(cid:14)(cid:17)(cid:9)(cid:2)(cid:10)(cid:16)(cid:6)(cid:15)(cid:10)(cid:11)(cid:10)(cid:12)(cid:5)(cid:17)(cid:13)(cid:15) (cid:7)(cid:12)(cid:8)(cid:18)(cid:1)(cid:11)(cid:12)(cid:22)(cid:25)(cid:15) (cid:5)(cid:19)(cid:8)(cid:14)(cid:12)(cid:1) (cid:19)(cid:10)(cid:17)(cid:9) (cid:11)(cid:12)(cid:22)(cid:25)(cid:15)(cid:7)(cid:12)(cid:8)(cid:18)(cid:3)(cid:4)(cid:8)(cid:17)(cid:12)(cid:1) !" (cid:7)(cid:12)(cid:8)(cid:18)(cid:3)(cid:4)(cid:8)(cid:17)(cid:12)(cid:1) ( (cid:3)(cid:8)(cid:12)(cid:8)(cid:15)(cid:5)(cid:17)(cid:13)(cid:15) (cid:25)(cid:23)(cid:16)(cid:12)(cid:24)(cid:1)(cid:25)(cid:21)(cid:1)(cid:22)(cid:23)(cid:12)(cid:11)(cid:16)(cid:10)(cid:25)(cid:1)(cid:8)(cid:1)(cid:11)(cid:12)(cid:22)(cid:25)(cid:15)(cid:1)(cid:19)(cid:8)(cid:22)(cid:1)(cid:25)(cid:15)(cid:8)(cid:25)(cid:1)(cid:13)(cid:21)(cid:21)(cid:18)(cid:24)(cid:9)(cid:21)(cid:25)(cid:15)(cid:1)(cid:22)(cid:8)(cid:16)(cid:23)(cid:1)(cid:11)(cid:16)(cid:24)(cid:10)(cid:23)(cid:16)(cid:19)(cid:16)(cid:20)(cid:8)(cid:25)(cid:21)(cid:23)(cid:1)(cid:8)(cid:20)(cid:11)(cid:1)(cid:11)(cid:12)(cid:22)(cid:25)(cid:15)(cid:1)(cid:11)(cid:16)(cid:24)(cid:10)(cid:23)(cid:16)(cid:19)(cid:16)(cid:20)(cid:8)(cid:25)(cid:21)(cid:23)(cid:2) (cid:4)(cid:5)(cid:10)(cid:15)(cid:1)(cid:7)(cid:10)(cid:16)(cid:6)(cid:15)(cid:10)(cid:11)(cid:10)(cid:12)(cid:5)(cid:17)(cid:13)(cid:15)(cid:1) (cid:25)(cid:23)(cid:16)(cid:12)(cid:24)(cid:1)(cid:25)(cid:21)(cid:1)(cid:16)(cid:11)(cid:12)(cid:20)(cid:25)(cid:16)(cid:13)(cid:26)(cid:1)(cid:25)(cid:15)(cid:12)(cid:1)(cid:13)(cid:8)(cid:17)(cid:12)(cid:1)(cid:22)(cid:8)(cid:16)(cid:23)(cid:1)(cid:13)(cid:23)(cid:21)(cid:19)(cid:1)(cid:25)(cid:15)(cid:12)(cid:1)(cid:23)(cid:12)(cid:8)(cid:18)(cid:1)(cid:22)(cid:8)(cid:16)(cid:23)(cid:2) (cid:2)(cid:8)(cid:14)(cid:17)(cid:9)(cid:1)(cid:7)(cid:10)(cid:16)(cid:6)(cid:15)(cid:10)(cid:11)(cid:10)(cid:12)(cid:5)(cid:17)(cid:13)(cid:15)(cid:1) (cid:25)(cid:23)(cid:16)(cid:12)(cid:24)(cid:1)(cid:25)(cid:21)(cid:1)(cid:16)(cid:11)(cid:12)(cid:20)(cid:25)(cid:16)(cid:13)(cid:26)(cid:1)(cid:25)(cid:15)(cid:12)(cid:1)(cid:13)(cid:8)(cid:17)(cid:12)(cid:1)(cid:11)(cid:12)(cid:22)(cid:25)(cid:15)(cid:1)(cid:19)(cid:8)(cid:22)(cid:1)(cid:13)(cid:23)(cid:21)(cid:19)(cid:1)(cid:25)(cid:15)(cid:12)(cid:1)(cid:23)(cid:12)(cid:8)(cid:18)(cid:1)(cid:11)(cid:12)(cid:22)(cid:25)(cid:15)(cid:1)(cid:19)(cid:8)(cid:22)(cid:2) Fig. 1. Our semi-supervised adversarial framework. We try to leveragea vast amount of unannotated images together with a small number ofimage-depth pairs to train a depth estimation network. The generatorreceives gradients from two discriminator networks to update its param-eters. Unlike traditional losses such as L , L and Huber norm, the lossfrom two discriminators’ feedback can preserve better details with lessrequirement on the amount of ground truth. The rest of our paper are organized as follows. We intro-duce and discuss the related work in Sec. 2. The proposedapproach is then presented in Sec. 3. We give details of theexperimental settings and discussions in Sec. 4. Finally, weconclude the paper in Sec. 5.
ELATED W ORK
Early works have revealed that 3D shape can be inferredfrom monocular camera images by shading [4], texture [5],and motion [6], etc . For instance, the methods in [4], [5],[15] inferred 3D object shapes from shading or texture withthe assumption that color and texture distribute uniformly.The methods in [16], [17], [18] studied the reconstruction ofknown object classes. Levin et al. [19] used a modified cam-era aperture to predict depth by defocusing. Structure fromMotion (SfM) [6] and visual Simultaneous Localization AndMapping (vSLAM) [20] are in another direction, which havebeen popular algorithms in reconstructing 3D scene frommultiple images. SfM and vSLAM leverage camera motionto estimate camera poses first, and then infer depth bytriangulating consecutive keyframe pairs. The reconstructedpoints are accurate with respect to their relative positions,but only cover a very small portion of the entire scene,which leads to a sparse depth map that is hard to use invarious applications.
Among the earliest works in learning-based monoculardepth estimation, Saxena et al. [8] predicted depth from localfeatures, and then incorporated global context to refine thelocal prediction via an MRF. This work was later extended in[9], which reconstructs scene structure using the predicteddepth map. Karsch et al. [10] used non-parametric feature
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 matching to retrieve the nearest neighbors of the givenimages within an RGB-D dataset. Then the correspond-ing depth maps of the retrieved images are warped andmerged together to obtain the final visually pleasing depthestimation. Ladicky et al. [21] integrated depth estimationwith semantic segmentation, and trained a classifier to solvethese two problems. The work in [1] is among the pioneersto utilize CNNs to regress depth maps from a single view.In their work, a global coarse-scale network is adopted tocapture the overall scale, and then a local fine-scale networkis adopted to refine the local details of the depth map. Theauthors further trained their network on multiple tasks todemonstrate its generalization ability [22]. By virtue of apretrained ResNet50 and an efficient up-sampling scheme,Laina et al. [2] constructed a fully convolutional residualnetwork, which decreases the evaluation errors by a largemargin. Hu et al. later extended [2] in the feature extractionstage with squeeze-and-excitation blocks [23]. In line withthe success of CNNs, Liu et al. [24] introduced a structuralloss in the CNNs training, and Wang et al. [25] furtherrefined the estimated depth using a Hierarchical CRF. Morerecently, Xu et al. [26] proposed to integrate side outputsof the model by CRFs. Zhang et al. [27] trained the FCRN[2] in a supervised manner under an adversarial learningframework. It is worth to note that, most monocular depthestimators either rely on large amounts of ground truthdepth data or predict disparity as an intermediary step.To this end, Atapour-Abarghouei et al. [28] trained a depthestimation model using pixel-perfect synthetic data, whichovercomes the domain bias to resolve the above issues to acertain extent.Since a high-quality depth map is not always easy to col-lect, unsupervised methods have also been introduced. Garg et al. [29] exploited the consistency between two registeredviews. They first warped the right image using the predictedinverse depth map, and then trained the estimation networkto minimize the reconstruction error between the left imageand the warped right image. Kumar et al. [30] followed thework in [29] by using a discriminator network in an ad-versarial framework to distinguish the reconstructed imageand the real image. Xie et al. [31] designed a similar pipelineand combined disparity maps of multiple levels to predictthe right view, in which the objective function is the L loss between the output right view and the ground truthright view. Godard et al. [13] proposed a novel objectivefunction that considers appearance matching loss, disparitysmoothness loss and left-right disparity consistency loss,which achieves promising results in depth estimation. Re-quiring only a monocular video as input, Ranftl et al. [32]reconstructed complex dynamic scene depth from temporalsequences by motion models. Zhou et al. [12] presented asupervised scheme for depth estimation using unlabeledvideo clips with view synthesis, which is implemented by aDepth CNN and a Pose CNN.Besides the supervised and unsupervised solutions,semi-supervised depth estimation is also studied very re-cently. Kuzniestsov et al. [33] used a sparse ground truthdepth map together with a registered stereo image pair totrain a CNN, which has reached the state-of-the-art perfor-mance on the KITTI dataset [34]. Note that our approachhas a very different setting about the training data from [33]. (cid:2)(cid:8)(cid:5)(cid:9)(cid:6)(cid:7)(cid:10) (cid:1)(cid:7)(cid:5)(cid:9)(cid:6)(cid:7)(cid:10)(cid:3)(cid:7)(cid:8)(cid:7)(cid:10)(cid:4)(cid:11)(cid:9)(cid:10) Fig. 2. A basic encoder-decoder generator architecture. Encoder ex-tracts features while reducing the spatial size, which is usually done bybasic convolutions or convBlocks [23], [36]. Decoder gradually upsam-ples the extracted features to a size similar with the input image, whichis usually implemented by deconvolution or naive interpolation.
During training, we do not require all the image to have aregistered depth (only a limit amount of image-depth pairsare needed), and we also need less ground truth depth mapsto get comparable results, as quantitatively shown in Sec.4.5. HE P ROPOSED A PPROACH
The proposed generative adversarial learning frameworkconsists of one generator network and two discriminatornetworks. The generator network accepts an unannotatedimage as its input and predicts a corresponding depth map.The predicted depth is then fed to: (1) a
Pair Discriminator (PD) network together with the RGB channels as its in-put; (2) a
Depth Discriminator (DD) network together witha real depth map sampled from the labeled data, whichgives feedback to the generator about whether the predicteddepth comes from the same distribution of the real depth.It is worth noting that, Nguyen et al. [35] have used asimilar architecture called D2GAN, but their generator isnot conditioned and the two discriminators are designedto inspect the predicted data twice to avoid model collapse.D2GAN is unable to focus on the predicted depth map whileguaranteeing its consistency with the corresponding RGBimage, thus are not suitable for depth estimation task.
Similar to semantic segmentation [37], edge detection [38],and other image translation tasks [39], depth estimation istypically conducted by using an encoder-decoder network,as illustrated in Fig. 2. However, instead of classifyingeach pixel to a discrete label, the depth estimation networkconducts a regression task that maps the pixel intensitiesto continuous depth values. To accomplish such a denseprediction, most existing works [1], [2], [22], [26] firstly usea CNN to progressively extract feature maps from low levelto high level. After that, a decoder network is plugged into regress the depth while restore the spatial resolution byupsampling the compact feature maps. To this end, Eigen et al. [1] proposed an architecture that contains two phases– a coarse phase that produces low-resolution predictionsusing both convolution and fully-connected layers, and afine phase that refines the first phase’s output with moreconvolution layers. Notably, the use of the fully-connectedlayer in the first phase enables the coarse network to have afull receptive field, which is essential to capture the globalscale of the scene. However, such a design is memory
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
FakeReal
Fig. 3. The architecture of the pair discriminator (PatchGAN [39]), whichconsists of five layers of convolution with increasing sizes of receptivefields in each layer. The input image and the predicted depth map in thelast layer are split to patches and each patch is inspected as real or fakeby the discriminator. intensive given the large amount of parameters. Laina etal. [2] took advantage of the ResNet architecture [36] anddesigned a fast up-convolution module, which is similar toa ResNet block, but has a reverse data flow. Hu et al. [3]later inherited the decoding scheme, who adopts Squeeze-and-Excitation blocks [23] and intermediate feature fusionwith an edge-aware loss function to train the model.In this work, the generator design can be of any cutting-edge depth estimation models (or more broadly image-to-image translation models). As quantitatively demonstratedlater in our experiments, we have found that most of themodern architectures can be easily plugged into our semi-supervised framework as the generator to get a performancegain. However, unlike the previous works [1], [2], [22] thatpredict a smaller depth map of the corresponding RGBimage, we predict the depth map with the same size asthe input image. And instead of reconstructing the originalspatial resolution using naive methods such as bilinearinterpolation, we want the network to learn the upsamplingscheme by itself. Throughout the paper, we use the networkproposed in [3] as our generator backbone, and add anadditional deconvolution layer at the end to upsample tothe input resolution. Furthermore, we also tested the mod-els described in [2], [14], the corresponding comparison isshown later in Tab. 4.
The two discriminator networks measure how far the dis-tribution of the predicted depth map (and the image) isfrom the real one(s). Naive losses such as L norms penalizethe generator by per-pixel discrepancy, which tends to leadoverall blurry depth map either theoretically or empirically. TABLE 1The receptive field size of neurons in each feature layer. A larger sizetakes more contextual information into account.
Layer 1 2 3 4 5Receptive field size × ×
10 22 ×
22 46 ×
46 70 × The discriminator network’s feedback (sometimes referredas GAN loss [39]) as a contrast, distinguish such artifactsand penalize the generator to produce a depth map thataccords with the natural depth value distribution, thusencourages more high frequencies (crispness/edge/detail)in the predicted depth map. The GAN loss performs similareffects to the higher-order potential of CRFs [40], since itcan access the joint configuration of many depth patches. Ithelps to guide the generator to produce sharp depth mapsand consider neighbor relations better than the pairwisepotential.For the pair discriminator, we use the PatchGAN [39] topredict the real / fake labels at the scale of patches. As shownin Fig. 3, the pair discriminator classifies each overlapped × patch in the input image-depth pairs to be real or fake , and computes a loss D by averaging all responses.Although the patch size N can be of any size between and min( h input , w input ) , Isola et al. [39] have verified that asize around is best to encourage the overall sharpnesswhile preventing artifacts at patch borders. Isola et al. [39]also showed that smaller or bigger patches do not improveperformance on the task of semantic segmentation. For ourimplementation, the patch sizes, a.k.a. the receptive fieldsizes of each layer’s neuron, are illustrated in Tab. 1. Thenetwork has five convolution layers in total. The first fourlayers are followed by a batch normalization layer and aleaky ReLU layer to improve the stability.For the depth discriminator, we use a network similarto the pair discriminator, with the difference that it onlyreceives the predicted depth map and the real depth mapas its inputs. Other details of the model architecture are thesame as the pair discriminator. Assuming we have n + m images ( n (cid:28) m ) and n depthmaps of the first n images, the training data we used can beexpressed as: I = { I , I , ..., I n , I n +1 , ..., I n + m } , Y = { Y , Y , ..., Y n } , where I i has a shape of h × w × representing an RGB image,and Y i has a shape of h × w representing a depth map. Ourgoal is to use all the n + m images to learn a mapping f fromthe image domain I to the depth domain Y , by minimizinga loss function. Different traditional GANs’ loss, our lossfunction is tailored to the problem in two ways.First the function is defined solely on the input imageand output depth, without the noise component z , which isdifferent from the vanilla GAN [41] and Conditional GANs[42]. The vanilla GAN tries to map a random noise vector z to a desired image y , which can be represented by z G (cid:55)→ y .Conditional GANs [42] condition the mapping by addingimage i to the generator, which can be viewed as { i , z } G (cid:55)→ y . EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
However, as observed in [43] and [39], the input noise vector z tends to be ignored by the generator and does not affectthe output result y much. Therefore, we discard the noisevector z and only use image i as the input to the generator,which is denoted by i G (cid:55)→ y in depth estimation task( i represents the image and y represents the predicted depthmap).The second difference comes from the novel structure ofthe proposed GAN, having three components for the Gener-ator, the Pairwise Discriminator, and the Depth Discrimina-tor respectively. The loss function for the Pair Discriminatoris designed as: L P D = − E i,y ∼ p data ( i,y ) (cid:104) log P D ( i, y ) (cid:105) − E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) i (cid:48) , G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) . (1)Here P D ( x, y ) represents the probability that ( x, y ) is a realpair from the training set. i is sampled from { I , I , ..., I n } and i (cid:48) is sampled from { I n +1 , ..., I n + m } . The depth map isstacked with the corresponding image to allow the discrim-inator to penalize the mismatch between the joint distribu-tion of ( i, d ) and ( i (cid:48) , G ( i (cid:48) )) .In order to predict high-quality depth maps, the gen-erator should fool the pair discriminator by making P D ( i (cid:48) , G ( i (cid:48) )) as close to as possible. The correspondingobjective for G is then: L P DG = E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) i (cid:48) , G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) . (2)Note that maximizing − E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) i (cid:48) , G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) means to maximizing the probability that D makes a mis-take, which is equalized to minimizing E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) i (cid:48) , G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) .The depth discriminator (DD), however, only looks atthe predicted map to infer whether the generated map isdrawn from the same distribution with a limited amount ofground truth depth maps. Its objective function is thereforewritten as: L DD = − E y ∼ p data ( y ) (cid:104) log DD ( y ) (cid:105) − E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − DD (cid:0) G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) . (3)Correspondingly, the generator computes the loss by usingthe following equation: L DDG = E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − DD (cid:0) G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) . (4)Note that the depth discriminator does not care if the gen-erated depth is correspondent to the image. It only aims toensure that the predicted map looks “natural”, i.e. the mapis not distinguishable from the sample drawn from p data ( d ) .Finally, we combine the two discriminator networks withthe generator network. More formally, PD, DD and G play a three-player minimax optimization game as below: min G max P D,DD V ( G, P D,DD ) = E i,y ∼ p data ( i,y ) (cid:104) log P D ( i, y ) (cid:105) + E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) i (cid:48) , G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) + E y ∼ p data ( y ) (cid:104) log P D ( d ) (cid:105) + E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) . (5)The pair discriminator and depth discriminator is optimizedusing Eqs. (1) and (3) respectively. And the generator nowconsiders feedbacks from these two discriminators: L G = λ E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) i (cid:48) , G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) +(1 − λ ) E i (cid:48) ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − DD (cid:0) G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) , (6)where < λ ≤ is a hyperparameter to control the trade-offbetween the pair discriminator and the depth discriminator.In the experiment, we have found that λ = 0 . gives a rea-sonable result both qualitatively and quantitatively, whichmeans the feedback from the pair discriminator shouldbe given more attention. Following the work in [39], wealternatively optimize PD, DD and G one step at a time. We embed the semi-supervised training phase into the su-pervised training phase. During training, supervised train-ing and semi-supervised training switches between itera-tions. In the supervised iteration, we make full use of theground truth depth map by computing both regression loss L and GAN loss as follows: L G = λ E i ∼ p data ( i ) (cid:104) log (cid:16) − P D (cid:0) i, G ( i ) (cid:1)(cid:17)(cid:105) +(1 − λ ) E i ∼ p data ( i ) (cid:104) log (cid:16) − DD (cid:0) G ( i ) (cid:1)(cid:17)(cid:105) + β E i,y ∼ p data ( i,y ) (cid:104) (cid:107) y − G ( i ) (cid:107) (cid:105) , (7)where parameters λ and β control the weights between theregression loss and the GAN loss. In our experiment, wehave found that a larger β at the initial training gives betterresults. The parameters of pair discriminator and depthdiscriminator are updated by computing the loss as below: L P D = − E i,y ∼ p data ( i,y ) (cid:104) log P D ( i, y ) (cid:105) − E i ∼ p data ( i (cid:48) ) (cid:104) log (cid:16) − P D (cid:0) i, G ( i ) (cid:1)(cid:17)(cid:105) . (8) L DD = − E i,y ∼ p data ( i,y ) (cid:104) log P D ( i, y ) (cid:105) − E i ∼ p data ( i ) (cid:104) log (cid:16) − P D (cid:0) i (cid:48) , G ( i (cid:48) ) (cid:1)(cid:17)(cid:105) . (9)In the semi-supervised phase, we optimize the generatorand discriminator networks by the loss functions defined inSec. 3.2. Note that the key difference between the supervisedtraining phase and the semi-supervised training phase isthat all the losses are computed by labeled image-depthpairs without using the unlabeled images in the supervisedtraining phase. We have found that it can stabilize thetraining process to get the best performance. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
Fig. 4. Qualitative results on NYU Depth test set. All losses are applied to the same model architecture with the same learning strategy using 500image-depth pairs. Our semi-supervised models are able to produce finer depth maps than the compared approaches.
XPERIMENTS
We conduct our experiments in three aspects: • In Sec. 4.3, we first validate that the proposed GANloss is superior in the depth regression problem overother losses using a limited amount of ground truthdata. Then we show that most of the cutting-edgemodels can get a performance gain under our semi-supervised settings. Last but not least, we conductextensive experiments to explore when the semi-supervised framework can potentially improve theperformance. • In Sec. 4.4, we show that our GAN loss canreach state-of-the-art performance on the Maked3Ddataset, which verifies the point that our method hasan advantage when the data amount is practicallysmall. • In Sec. 4.5, we further testify the proposed methodfor the task of the domain adaptation, which hasshown that the proposed semi-supervised schemeperforms better than the traditional supervisedschemes.We implement our model using PyTorch on NVIDIAGeForce GTX TITAN X GPU with 12GB GPU memory.For the pair discriminator and the depth discriminator,we initialize parameters of all layers with values drawnfrom a normal distribution ( µ = 0 , σ = 0 . ). As forthe generator, we warm it up by training it with a simple L loss for initialization. The overall training pipeline isintroduced in Alg. 1. And we have quantitatively observedthat the pretrained initialization performs better than thenormal initialization. We stop training when the loss of thegenerator has converged. In alignment with [1], we evaluate ourmodel with the following error metrics:
ARD = | T | (cid:80) y ∈ T | y − y ∗ | /y ∗ , RM SE = (cid:113) | T | (cid:80) y ∈ T ( y i − y ∗ i ) , RM SE ( log ) = (cid:113) | T | (cid:80) y ∈ T (log y i − log y ∗ i ) , log = | T | (cid:80) y ∈ T | log y i − log y ∗ i | , δ = % of y i s.t. max (cid:16) y i y ∗ i , y ∗ i y i (cid:17) < thr .In the equations above, ARD and RMSE are abbrevia-tions for Absolute Relative Difference (rel) and Root MeanSquared Error, respectively. y i is the predicted pixel depthvalue and y ∗ is ground truth depth value. T represents thepixel number in the test set. We use δ , δ , and δ to denotethe situations where δ < . , δ < . , and δ < . .Note that for rel, RMSE, RMSE ( log ) and log metrics,lower is better. And, higher performance can be expectedby setting a higher δ i value. As mentioned in the previous sections, our framework canbenefit most of the cutting-edge models. However, modelarchitecture itself sometimes greatly influences the estima-tion accuracy. To show that our performance gain is generalenough to benefit a wide range of backbone networks,we use off-the-shelf models [2], [3], [14] as our baselines,which meet our assumptions for the generator network asdescribed in Sec. 3.1.1.
NYU Depth v2 dataset [44] contains RGB images and thecorresponding depth maps of various indoor scenes, whichare captured by Microsoft Kinect v1. The full training setcontains scenes, which are officially split to scenesfor training and scenes for testing. Together with the fullset, a sufficiently labeled subset is also offered. This subsethas training images and testing images. Followingthe previous works in [1], [2], we test on the imagesin the following experiments. The raw training depth mapsgiven by the dataset are in a range of (0 m, m ) and arestored as uint8 PNG images. Before feeding the data to the EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
TABLE 2Evaluation of different loss functions on NYU Depth test set. All the models are trained using the same generator architecture. Our GAN loss givesthe best quantitative results when the number of training data is limited. loss type log δ δ δ L supervised 500 0.201 0.750 0.083 0.680 0.917 0.978 L supervised 500 0.195 0.706 0.080 0.695 0.924 0.981Huber supervised 500 0.204 0.698 0.080 0.696 0.920 0.977Scale-invariant [1] supervised 500 0.192 generator network, we first downsample both the imageand the depth map to a size of × (half of the fullresolution) in order to accelerate the training process, each ofwhich is then center-cropped to a size of × to excludethe blank frame border. The RGB images are normalizedusing the mean and standard deviation values computedon ImageNet [45]. Finally, we train our model for epochswith a batch size of . Parameters of the generator and thediscriminator are all optimized by Adam [46], with an initiallearning rate of . that is multiplied by 0.1 every iterations. The coefficients β and β used for running theaverages of gradient and its square are set to . and . ,respectively. Evaluating the loss functions.
We first evaluate whetherthe proposed GAN loss can improve the overall perfor-mance. We train our baseline generator with L , L , Huber,Sca-Inv [1], berHu [2] and edge-aware [3] losses in a su-pervised manner, and with the loss proposed in Sec. 3.2in a semi-supervised manner. To study the effect of thediscriminator structure, we experiment three types of thediscriminator - pair discriminator only, depth discriminatoronly and a combination of the both. We use labeledimages to evaluate the predicted maps with measurementsdescribed in Sec. 4.1. To make sure the losses are fairlytreated during training, we plot the convergence curve inFig. 5 and only stop training when a loss convergenceis observed. Note that the loss of berHu is scaled by 0.1for better visualization. Quantitative results are reported inTab. 2. Among the nine losses, our semi-supervised lossoutperforms all the other losses that are popularly used inthe depth estimation task and is better than using any ofthe discriminators alone. We also randomly visualize someresults to get a qualitative sense of the predicted depth map.The visualizations are shown in Fig. 4. It can be seen that L , L and Huber losses are good at capturing the overall meandepth scale, but are prone to output blurry depth maps.Scale-invariant error, instead, measures the scene point re-lationships that are irrespective to the absolute global scale,which can better preserve the relative depth. Besides, berHu[2] shows a good quantitative result considering that thedepth distribution in NYU Depth dataset is heavily-tailed[47]. Our semi-supervised framework produces the bestresult, which preserves more sharpness at object borders,and are more natural than other losses. We also compareour method with [27], in which a GAN variant loss wasproposed to train a depth estimator. Results are given in Tab.3. We can see that though our semi-supervised framework TABLE 3Comparisons with the GAN loss variant. Our method, though notdesigned for depth estimation when training data is sufficient (12K inthis case), still performs better w.r.t most of the evaluation metrics. algorithm log δ Zhang et al. [27] 12K 0.128 0.551 H S R F K P H D Q O R V V / / + X E H U 6 F D , Q Y > @ E H U + X > @ ( G J H $ Z D U H > @ Fig. 5. The convergence curves of different loss functions during modeltraining. After about 15 epochs, all losses tend to stop decreasing andwe stop training when convergence is observed. is designed for situations when training data is limited, wecould still outperform [27] when training data is sufficient(12K in this case). More qualitative results are shown inFig. 12 (a). We can see that the predicted maps give betterresults for the near objects, but sometimes tend to confusethe middle distance and far distance. This is actually alsothe case for human beings. For near objects, we can evenestimate the distance by a precision of centimeter. But forfar objects, it’s very hard to get meter-level precision.
Improvements over other methods.
Besides our baselinemodel of [3], we also tried two different models [2], [14]under our semi-supervised learning framework and com-pare the results with the ones that are trained by usingthe regression loss L . Comparison results are reported inTab. 4. All three model architectures receive performancegains at different levels of data numbers. Further, to suggesta data range about when the proposed semi-supervisedframework can bring performance gain, we train our base-line model with the loss that is also proposed in [3] in asupervised manner, and compare the results with ours in EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
TABLE 4Quantitative comparisons of models trained by supervised learning andour semi-supervised framework. generator log Unet [14] 100 0.374 1.641 0.209Unet [14] semi 100
Unet [14] 500 0.372 1.179 0.141Unet [14] semi 500
Unet [14] 1000 0.365 1.180 0.137Unet [14] semi 1000
FCRN [2] 100 0.286 0.964 0.114FCRN [2] semi 100
FCRN [2] 500 0.272 0.906 0.111FCRN [2] semi 500
FCRN [2] 1000 0.220 0.760 0.086FCRN [2] semi 1000
Hu et. al. [3] 100 0.322 1.387 0.151Hu et. al. [3] semi 100
Hu et. al. [3] 500 0.197 0.837 0.084Hu et. al. [3] semi 500
Hu et. al. [3] 1000 0.168 0.751 0.071Hu et. al. [3] semi 1000
TABLE 5Error analysis against different semantic areas. area rel rms log δ δ δ floor structure 0.192 0.831 0.081 0.690 0.919 0.981props 0.187 0.694 0.078 0.711 0.931 0.984furniture 0.176 0.612 0.074 0.728 0.938 0.987missing 0.186 0.721 0.078 0.708 0.930 0.983overall 0.183 0.704 0.077 0.713 0.931 0.984 Fig. 9. It can be seen that when the training data is practicallysmall (for example, around pairs), our methods canvastly boost the prediction accuracy. The margin graduallydecreases when the training data becomes sufficient. Whenthere are more than . K training image-depth pairs, fully-supervised approaches are more suitable for the problem.
Error analysis.
We further analyze the prediction errorswith respect to different semantic regions. Besides imagesand depth maps, the NYU Depth v2 dataset also providessemantic labels for each image. Silberman et al. [44] defineda general 4-class labels ( floor , structure , props and furniture )plus one missing area out of the original 894 class labels. Wefollow this class combination strategy and compute the errorin these five areas. Results are reported in Tab. 5. We can seethat our model performs best in the floor area and performsworst in the structure area (ignoring the label missing area).It may be due to the fact that floors are smoother with lessdepth hop, while the depth of structure is more protean andoften shows in far distance. Ablation study towards the additional RGB imagenumber.
To further evaluate the effect of the additionalRGB number on our semi-supervised NYU Depth modelperformance, we experiment , k , k , k and k RGBimages respectively together with image-depth pairs inour semi-supervised setting. Results are given in Fig. 6 ,from which we can see that given k training image-depthpairs, adding more additional RGB images (within a rangeof [1 / ∗ k, ∗ k ] ) is beneficial to the model performance. Theperformance gain gradually saturates after 10 times moreRGB images are used. Testing on images in the wild.
We randomly pick some U J E U H O 2 X U V V H P L U J E U P V 2 X U V V H P L U J E U P V O R J 2 X U V V H P L U J E G H O W D 2 X U V V H P L G H O W D 2 X U V V H P L G H O W D 2 X U V V H P L G H O W D Fig. 6. Performance curves with respect to different numbers of ad-ditional RGB images used on NYU Depth dataset. In all experiment,500 image-depth pairs are used. Our semi-supervised framework caneffectively boost the performance by leveraging extra unlabeled RGBimages.Fig. 7. Some qualitative results tested on wild images using the modeltrained on NYUDv2 with 1k ground truth image pairs.
Internet images covering both indoor and outdoor scenes,and then test our model on them to see whether the modelgeneralizes well to unseen scenes drawn from an unfamiliardata distribution. Results are shown in Fig. 7. We comparethe predicted depth with our visual estimation and foundthat the result is decent. For the indoor scenes, our modeloften predicts depth with a high accuracy. For the outdoorscenes, our model predicts relative depth decently but failsto capture the absolute distance scale. It can be explained bydomain shift, since the absolute distance of outdoor scenes issignificantly different from the NYU Depth indoor dataset.In most cases, the model predicts reliable and convincingresults, which is useful for various applications, such ascleaning robots, UAVs, and computational photography.
Make3D [9] is an outdoor dataset captured by an RGBcamera and a laser scanner. It contains aligned image
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
TABLE 6Comparison on the Make3D dataset. Our methods generalize well tooutdoor scenes and can reach state-of-the-art performance. Pixels withdistance larger than m are masked out. algorithm type rel rms log Make3D [9] supervised 0.370 - 0.187Liu et al. [48] supervised 0.379 - 0.148DepthTransfer [49] supervised 0.361 15.10 0.148Liu et al. [24] supervised 0.338 12.60 0.134Li et al. [50] supervised 0.279 10.27 0.102Liu et al. [51] supervised 0.287 14.09 0.122Roy et al. [47] supervised 0.260 12.40 0.119MS-CRF et al. [26] supervised 0.198 8.56 -Atapour-A et al. [28] unsupervised 0.423 9.002 -Laina et al. [2] supervised 0.201 7.038 0.079DORN(VGG) [11] supervised 0.238 10.01 0.087DORN(ResNet) [11] supervised 0.162 7.32 0.067Hu et al. [3] supervised 0.179 6.613 0.070Ours-GAN supervised 0.158 6.139 0.067Ours-GAN semi-supervised and range data, which are officially divided into trainingpairs and testing pairs. The RGB images are storedwith a , × , resolution, while their depth mapsare stored with a × resolution, which is an order ofmagnitude smaller in the spatial size. Thus, we first resizeall images and the corresponding depth maps to a uniformsize of × using bilinear interpolation. Besides, dueto the inaccurate long-range distance and glasses in dataacquisition, we mask out all invalid depth pixels and pixelswith distance >
70 meters following [2]. We first train ourmodel with all training image-depth pairs in a supervisedmanner as described in Sec. 3.3 to get a fair comparison withother supervised methods. Then, we take another 425 RGBimages from a similar dataset, namely the dataset-2 of theMake3D dataset, to further evaluate the effectiveness of theproposed semi-supervised learning.We compare our methods with [2], [10], [24], [33] andreport the results in Tab. 6. Our models push the evaluationerrors down to a new level and the semi-supervised schemecan reach the state-of-the-art performance on all metrics.We can thus conclude that the semi-supervised learningeffectively utilizes the extra RGB images and can generalizewell to outdoor scenes. Sample qualitative results are shownin Fig. 8 and more qualitative results are given in Fig. 12 (b).It can be observed that our model overcomes the distractionfrom the shadow area and can capture the underlyingstructure behind the scene appearance image well.
In this section, we first evaluate our model in a larger dataregime, using 22K image-depth pairs from the KITTI dataset[34] together with 20K additional RGB images from theCityscape dataset [52] in our case. Evaluation results arereported in Tab. 7. Then, we experiment in a setting wheredomain shift (KITTI → Make3D in our case) exists. That is,we train on the KITTI image-depth pairs while test on theMake3D dataset.The KITTI [34] dataset is captured by two high-resolution color (and grayscale) cameras and a Velodynelaser scanner on a driving car. The raw dataset contains scenes categorized as City , Residential , Road or Campus .Following [1], we use scenes of the dataset for training (cid:4)(cid:2)(cid:1)(cid:2)(cid:5)(cid:3)(cid:8)(cid:6)(cid:7) Fig. 8. Qualitative results on Make3D dataset. The rows (from up tobottom) are RGB images, ground truth depth maps, and results by oursupervised model, respectively. Pixels with distance larger than m aremasked out. TABLE 7Comparison on the KITTI dataset. The values of the baselines aretaken from the original papers. algorithm training data rel rms rms log Eigen et al. [1] KITTI 0.190 7.156 0.270Zhou et al. [12] KITTI+Cityscape 0.198 6.565 0.275Kuznietsov et al. [33] KITTI 0.108 et al. [30] KITTI 0.219 6.340 0.273Atapour-A et al. [28] KITTI+Cityscape 0.110 4.726 0.194Ours-GAN KITTI 0.107 4.405 0.181Ours-GAN KITTI+Cityscape (20k RGBs) and leave the rest scenes for testing, which results in Kimage-depth pairs in total. The raw dataset stores depth bysaving 3D points that are sampled by a rotating LIDARscanner. The ground truth depth maps are generated byreprojecting these points to the left RGB images using thegiven intrinsics and extrinsics. All image data and Velodynedepth measures have a spatial resolution of , × .Besides the training image-depth pairs drawn from theKITTI dataset, we also leverage more RGB images from theCityscape dataset [52]. The Cityscape dataset offers densepixel annotations of urban street scenes that are similarto the KITTI dataset. We draw 20k RGB images from thedataset and combine them with the 22k KITTI image-depthpairs to train our model in the semi-supervised manner.A validation set of 160 samples drawn from the KITTIdataset is used to tune hyper-parameters. We test our modelon the commonly-used Eigen split [1], which contains images selected from scenes. During testing, we maskout the pixels with distance ≤ or ≥ meters from theground truth depth map, and do not consider them in thesubsequent error computation. We report the quantitativecomparisons with the state-of-the-art methods in Tab. 7. Byleveraging extra RGB images through the semi-supervisedlearning, our model can further decrease the evaluationerror down to a new level. We also give some qualitativeresults in Fig. 10. It is shown that predicted maps aresometimes even more natural than ground truth depthmaps, due to the fact that ground truth depths given byKITTI are sparse and need to be densified by in-painting for EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10 J W S D L U V U H O + X H W D O > @ 2 X U V V H P L J W S D L U V U P V + X H W D O > @ 2 X U V V H P L J W S D L U V U P V O R J + X H W D O > @ 2 X U V V H P L J W S D L U V G H O W D + X H W D O > @ G H O W D + X H W D O > @ G H O W D + X H W D O > @ G H O W D 2 X U V V H P L G H O W D 2 X U V V H P L G H O W D 2 X U V V H P L G H O W D Fig. 9. Performance curves with respect to different numbers of training image-depth pairs. We can see that our semi-supervised framework canboost the performance when the number of labeled training samples is less than 1K.Fig. 10. Qualitative results on KITTI dataset. The columns(from left toright) are RGB images, ground truth depth maps, and results by oursemi-supervised model, respectively. Pixels with distance larger than m are masked out. TABLE 8Adaptation results from the KITTI to the Make3D. Unlike othersupervised methods, our model is capable to leverage the images(without their corresponding depth maps) in Make3D and adapts bestto it. algorithm training data rel rms log Laina et al. [2] KITTI 0.587 17.957
DORN [11] KITTI 0.589 10.701 0.239Ours KITTI 0.555 12.153 0.387Ours KITTI+Make3D (400 RGBs) visualization, while our model directly predicts dense depthvalues.We also evaluate our model in the viewpoint of do-main adaptation. Domain adaptation is necessary when thetraining data distribution differs from the distribution ofthe testing data, which may cause significant performancedrop during the algorithm deployment, as is shown in Tab.8. The problem can be partially solved by our proposedsemi-supervised learning in which RGB images drawn fromunseen scenes are leveraged. We train on the image-depthpairs from KITTI along with images from Make3D, and thentest the learned model on Make3D. To make the predictedmaps of Make3D images comparable with those of KITTIimages, we randomly crop a region on KITTI images intraining, whose spatial size is aligned with the image sizeof Make3D. As shown in Tab. 8, our model outperforms themodels that are trained on KITTI in a supervised mannerand directly tested on Make3D.
We further take a deeper look into the cases when our modelis less effective:
Fig. 11. Failures in glass and mirror area (from left to right: RGB image,ground truth depth and predicted depth).
Abundant Training GTs are available.
As shown in Fig.9, when the model is fed with sufficient data, adding moreunlabeled image shows no help to the prediction accuracy.We experiment with three models in Tab. 4 and find that oursemi-supervised framework helps to improve the predictionaccuracy in the case where data number is still the first driv-ing force for model training. After a saturation point of thetraining data, giving more unlabeled images does not helpthe model to learn better. We argue that, this is expected for asemi-supervised learning approach, which has assumptionsthat the annotations are not sufficient. When the trainingdata is abundant, such assumptions are violated, and thuswe cannot expect the semi-supervised approach can furtheroutperform fully-supervised approaches.
Predicting depth in translucency or high reflectiveregions.
As illustrated in Fig. 11, our model fails to predictthe correct depth value in glass and mirror regions, due tothe flaw of the training data. IR based depth acquisitionequipment, such as Kinect, may collect misleading depthvalues in the unreflective region such as glass, and regionwith specular reflection such as mirror. With these trainingdata flaws, the model can neither handle the depth predic-tion task in such regions.
ONCLUSION
In this work, we propose an adversarial framework thatproduces state-of-the-art depth estimation results by usingonly a limited amount of training samples. We achieve thisby passing the feedbacks of two different discriminators(one inspects image-depth pairs and the other inspectsdepth channel only) to the generator as a unified loss,which makes effective use of unlabeled RGB images. Theexperimental results on NYU Depth, Make3D and KITTI
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11 (a) More results on NYU Depth v2 (b) More results on Make3D
Fig. 12. More qualitative results on NYUD dataset and Make3D dataset. The columns from left to right are RGB images, ground truth depth mapand predicted depth by our model. Depth values are normalized for visualization. datasets demonstrate that our method with the proposedGAN loss can achieve better quantitative results and betterqualitative results. Moreover, our method generalizes wellto both indoor and outdoor scenes. Our model also has lessdemand for labeled data and has the simplicity of adaptingto various depth estimation models as the generator, whichis therefore more flexible and robust in real-world applica-tions. A CKNOWLEDGMENTS
This work is supported by the National Key R&D Program(No.2017YFC0113000, and No.2016YFB1001503), Nature Sci-ence Foundation of China (No.U1705262, No.61772443, andNo.61572410). R EFERENCES [1] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction froma single image using a multi-scale deep network,” in
Advances inneural information processing systems , 2014, pp. 2366–2374. [2] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab,“Deeper depth prediction with fully convolutional residual net-works,” in
3D Vision (3DV), 2016 Fourth International Conference on .IEEE, 2016, pp. 239–248.[3] J. Hu, M. Ozay, Y. Zhang, and T. Okatani, “Revisiting single imagedepth estimation: toward higher resolution maps with accurateobject boundaries,” in . IEEE, 2019, pp. 1043–1051.[4] E. Prados and O. Faugeras, “Shape from shading,” in
Handbook ofmathematical models in computer vision . Springer, 2006, pp. 375–388.[5] J. Aloimonos, “Shape from texture,”
Biological cybernetics , vol. 58,no. 5, pp. 345–360, 1988.[6] R. Hartley and A. Zisserman,
Multiple view geometry in computervision . Cambridge university press, 2003.[7] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski,“A comparison and evaluation of multi-view stereo reconstructionalgorithms,” in
Computer vision and pattern recognition, 2006 IEEEComputer Society Conference on , vol. 1. IEEE, 2006, pp. 519–528.[8] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from singlemonocular images,” in
Advances in neural information processingsystems , 2006, pp. 1161–1168.[9] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scenestructure from a single still image,”
IEEE transactions on patternanalysis and machine intelligence , vol. 31, no. 5, pp. 824–840, 2009.[10] K. Karsch, C. Liu, and S. B. Kang, “Depth extraction from videousing non-parametric sampling,” in
European Conference on Com-puter Vision . Springer, 2012, pp. 775–788.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12 [11] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deepordinal regression network for monocular depth estimation,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 2002–2011.[12] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervisedlearning of depth and ego-motion from video,” in
CVPR , vol. 2,no. 6, 2017, p. 7.[13] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervisedmonocular depth estimation with left-right consistency,” in
CVPR ,vol. 2, no. 6, 2017, p. 7.[14] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in
InternationalConference on Medical image computing and computer-assisted inter-vention . Springer, 2015, pp. 234–241.[15] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah, “Shape-from-shading: a survey,”
IEEE transactions on pattern analysis and machineintelligence , vol. 21, no. 8, pp. 690–706, 1999.[16] T. Nagai, T. Naruse, M. Ikehara, and A. Kurematsu, “Hmm-basedsurface reconstruction from single images,” in
Image Processing.2002. Proceedings. 2002 International Conference on , vol. 2. IEEE,2002, pp. II–II.[17] T. Hassner and R. Basri, “Example based 3d reconstruction fromsingle 2d images,” in
Computer Vision and Pattern RecognitionWorkshop, 2006. CVPRW’06. Conference on . IEEE, 2006, pp. 15–15.[18] F. Han and S.-C. Zhu, “Bayesian reconstruction of 3d shapes andscenes from a single image,” in
Higher-Level Knowledge in 3D Mod-eling and Motion Analysis, 2003. HLK 2003. First IEEE InternationalWorkshop on . IEEE, 2003, pp. 12–20.[19] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image anddepth from a conventional camera with a coded aperture,”
ACMtransactions on graphics (TOG) , vol. 26, no. 3, p. 70, 2007.[20] G. Klein and D. Murray, “Parallel tracking and mapping for smallar workspaces,” in
Mixed and Augmented Reality, 2007. ISMAR2007. 6th IEEE and ACM International Symposium on . IEEE, 2007,pp. 225–234.[21] L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspec-tive,” in
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2014, pp. 89–96.[22] D. Eigen and R. Fergus, “Predicting depth, surface normals andsemantic labels with a common multi-scale convolutional architec-ture,” in
Proceedings of the IEEE International Conference on ComputerVision , 2015, pp. 2650–2658.[23] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 7132–7141.[24] M. Liu, M. Salzmann, and X. He, “Discrete-continuous depthestimation from a single image,” in
Computer Vision and PatternRecognition (CVPR), 2014 IEEE Conference on . IEEE, 2014, pp. 716–723.[25] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille,“Towards unified depth and semantic prediction from a singleimage,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2015, pp. 2800–2809.[26] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scalecontinuous crfs as sequential deep networks for monocular depthestimation,” in
Proceedings of CVPR , 2017.[27] S. Zhang, N. Li, C. Qiu, Z. Yu, H. Zheng, and B. Zheng, “Depthmap prediction from a single image with generative adversarialnets,”
Multimedia Tools and Applications , pp. 1–18, 2018.[28] A. Atapour-Abarghouei and T. P. Breckon, “Real-time monoculardepth estimation using synthetic data with domain adaptationvia image style transfer,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 2800–2810.[29] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn forsingle view depth estimation: Geometry to the rescue,” in
EuropeanConference on Computer Vision . Springer, 2016, pp. 740–756.[30] A. CS Kumar, S. M. Bhandarkar, and M. Prasad, “Monocular depthprediction using generative adversarial networks,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , 2018, pp. 300–308.[31] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,”in
European Conference on Computer Vision . Springer, 2016, pp. 842–857.[32] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun, “Dense monoculardepth estimation in complex dynamic scenes,” in
Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp.4058–4066.[33] Y. Kuznietsov, J. St ¨uckler, and B. Leibe, “Semi-supervised deeplearning for monocular depth map prediction,” in
Proc. of the IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.6647–6655.[34] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meetsrobotics: The kitti dataset,”
International Journal of Robotics Research(IJRR) , 2013.[35] T. Nguyen, T. Le, H. Vu, and D. Phung, “Dual discriminatorgenerative adversarial nets,” in
Advances in Neural InformationProcessing Systems , 2017, pp. 2667–2677.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
Proceedings of the IEEE conference on computervision and pattern recognition , 2016, pp. 770–778.[37] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-works for semantic segmentation,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2015, pp. 3431–3440.[38] S. Xie and Z. Tu, “Holistically-nested edge detection,” in
Proceed-ings of the IEEE international conference on computer vision , 2015, pp.1395–1403.[39] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” arXiv preprint ,2017.[40] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Seman-tic segmentation using adversarial networks,” arXiv preprintarXiv:1611.08408 , 2016.[41] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarialnets,” in
Advances in neural information processing systems , 2014, pp.2672–2680.[42] M. Mirza and S. Osindero, “Conditional generative adversarialnets,” arXiv preprint arXiv:1411.1784 , 2014.[43] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scalevideo prediction beyond mean square error,” arXiv preprintarXiv:1511.05440 , 2015.[44] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg-mentation and support inference from rgbd images,” in
EuropeanConference on Computer Vision . Springer, 2012, pp. 746–760.[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al. , “Imagenetlarge scale visual recognition challenge,”
International journal ofcomputer vision , vol. 115, no. 3, pp. 211–252, 2015.[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[47] A. Roy and S. Todorovic, “Monocular depth estimation usingneural regression forest,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2016, pp. 5506–5514.[48] B. Liu, S. Gould, and D. Koller, “Single image depth estimationfrom predicted semantic labels,” in . IEEE, 2010,pp. 1253–1260.[49] K. Karsch, C. Liu, and S. B. Kang, “Depth transfer: Depth extrac-tion from video using non-parametric sampling,”
IEEE transactionson pattern analysis and machine intelligence , vol. 36, no. 11, pp. 2144–2158, 2014.[50] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depthand surface normal estimation from monocular images usingregression on deep features and hierarchical crfs,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2015, pp. 1119–1127.[51] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from singlemonocular images using deep convolutional neural fields,”
IEEEtransactions on pattern analysis and machine intelligence , vol. 38,no. 10, pp. 2024–2039, 2016.[52] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in