Casting Geometric Constraints in Semantic Segmentation as Semi-Supervised Learning
CCasting Geometric Constraints in Semantic Segmentationas Semi-Supervised Learning
Sinisa Stekovic Friedrich Fraundorfer Vincent Lepetit , Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria Universit´e Paris-Est, ´Ecole des Ponts ParisTech, Paris, France { sinisa.stekovic, fraundorfer, lepetit } @icg.tugraz.at Abstract
We propose a simple yet effective method to learn tosegment new indoor scenes from video frames: State-of-the-art methods trained on one dataset, even as large asthe SUNRGB-D dataset, can perform poorly when appliedto images that are not part of the dataset, because of thedataset bias, a common phenomenon in computer vision.To make semantic segmentation more useful in practice,one can exploit geometric constraints. Our main contri-bution is to show that these constraints can be cast con-veniently as semi-supervised terms, which enforce the factthat the same class should be predicted for the projectionsof the same 3D location in different images. This is in-teresting as we can exploit general existing techniques de-veloped for semi-supervised learning to efficiently incorpo-rate the constraints. We show that this approach can ef-ficiently and accurately learn to segment target sequencesof ScanNet and our own target sequences using only anno-tations from SUNRGB-D, and geometric relations betweenthe video frames of target sequences.
1. Introduction
Semantic segmentation of images provides high-levelunderstanding of a scene, useful for many applications suchas robotics and augmented reality. Recent approaches canperform very well [23, 15, 45, 5].In practice, however, it is difficult to generalize from ex-isting datasets to new scenes. In other words, it is a chal-lenging task to obtain good segmentation of images thatdo not belong to the training datasets. To demonstratethis, we trained a state-of-the-art segmentation methodDeepLabV3+ [5] on the SUNRGB-D dataset [37], which ismade of more than training images of indoor scenes.Fig. 1a shows the segmentation we obtain when we attemptto segment a new image, which does not belong to thedataset. The performance is clearly poor, showing that the (a) (b)
Figure 1: (a) Even the state-of-the-art method DeepLabV3+trained with training data from SUNRGB-D makes manymistakes when segmenting an image outside the SUNRGB-D dataset. (b) After exploiting geometric constraints on anunlabeled sequence of the new scene, our semi-supervisedS4-Net approach predicts much better segmentations.SUNRGB-D dataset was not sufficient to generalize to thisimage, despite the size of the training dataset.To make semantic segmentation more practical and tobreak this dataset bias, one can exploit geometric con-straints [25, 31, 27, 14], in addition to readily availabletraining data such as the SUNRGB-D dataset. We intro-duce an efficient formalization of this approach, which re-lies on the observation that geometric constraints can be in-troduced as standard terms from the semi-supervised learn-ing literature. This results in an elegant, simple, and pow-erful method that can learn to segment new environmentsfrom video frames, which makes it very useful for applica-tions such as robotics and augmented reality.More exactly, we adapt a general technique for semi-supervised learning that consists of adding constraints onpairs of unlabeled training samples that are close to eachother in the feature space, to enforce the fact that such twosamples should belong to the same category [22, 39, 2].This is very close to what we want to do when enforcinggeometric constraints for semantic segmentation: Pairs ofunlabeled pixels that correspond to the same physical 3Dpoint would be labeled with the same category. In practice, a r X i v : . [ c s . C V ] J a n o obtain the geometric information needed to enforce theconstraints, we can rely on the measurements from depthsensors or train a network to predict depth maps as well, us-ing recent techniques for monocular image reconstruction.In contrast to previous methods exploiting geometricconstraints for semantic segmentation, our method intro-duces several novelties. Comparing to [25], our ap-proach applies geometric constraints to completely unla-beled scenes. Furthermore, when compared to [31, 27, 14],which use simple label fusion to segment given target se-quence, our approach can generalize from a representationof one target sequence from the target scene to segmentingunseen images of the target scene. We demonstrate this as-pect further in the evaluation section.In short, our contribution is to show that semi-supervisedlearning is a simple yet principled and powerful way to ex-ploit geometric constraints in learning semantic segmenta-tion. We demonstrate this by learning to annotate sequencesof the ScanNet [7] dataset using only annotations from theSUNRGB-D dataset. We also demonstrate effectiveness ofthe proposed method through the semantic labeling of ourown newly generated sequence unrelated to SUNRGB-Dand ScanNet.In the rest of the paper, we discuss related work, describeour approach, and present its evaluation with quantitativeand qualitative experiments together with an ablation study.
2. Related Work
In this section, we discuss related work on the aspects ofsemantic segmentation, domain adaptation, general semi-supervised learning, and also recent methods for learningdepth prediction from single images, as they also exploitgeometric constraints similar to our approach. Finally, wediscuss similarities and differences with other works thatalso combine segmentation and geometry.
The introduction of deep learning made a large impacton performance of semantic segmentation. Fully Convo-lutional Networks (FCNs) [23] allow segmentation predic-tion for input of arbitrary size. In this setting, standardimage classification task networks [36, 16] can be usedby transforming fully-connected layers into convolutionalones. FCNs use deconvolutional layers that learn the in-terpolation for upsampling process. Other works includ-ing SegNet [3] and U-Net [33] rely on similar architectures.Such works have been applied to a variety of segmentationaltasks [33, 1, 29].Recent methods address the problem of utilizing globalcontext information for semantic segmentation. PSP-Net [45] proposes to capture global context informationthrough a pyramid pooling module that combines features under four different pyramid scales. DeepLabV3+ [5] usesatrous convolutions to control response of feature maps andapplies atrous spatial pyramid pooling for segmenting ob-jects at multiple scales. In our experiments, we apply ourapproach to both DeepLabV3+ and PSPNet to demonstrateit generalizes to different network architectures. In princi-ple, any other architecture could be used instead.
Availability of ground truth labels is often the main lim-itation of supervised methods in pratice. In contrast, semi-supervised learning is a general approach aiming at exploit-ing both labeled and unlabeled or weakly labeled trainingdata. Some approaches rely on adversarial networks to mea-sure the quality of unlabeled data [8, 10, 21, 17]. More inline with our work are the popular consistency-based mod-els [22, 39, 2]. These methods enforce the model outputto be consistent under small input perturbations. As ex-plained in [2], consistency-based models can be viewed asa student-teacher model: To measure consistency of model f , or the student, its predictions are compared to predictionsof a teacher model g , a different trained model, while at thesame time applying small input perturbations. Π − model [22] is a recent method using a consistency-based model where the student is its own teacher, i.e. f = g .It relies on a cross-entropy loss term applied to labeled dataonly and an additional term that penalizes differences inpredictions for small perturbations of input data. Our semi-supervised approach is closely related to the Π − model butrelies on geometric consistency instead of enforcing consis-tent predictions for different input perturbations.As pixel-level annotations, required for semantic seg-mentation tasks, are typically very time consuming toobtain, weakly-supervised methods become very interest-ing options for further increasing the amount of trainingsamples. One way of obtaining more training data isthrough image-level annotations or bounding boxes. [30]demonstrates that a network trained with large number ofsuch weakly-supervised samples in combination with smallamount of samples with pixel-level annotations achievescomparable results to a fully supervised approach. Givenimage-level annotations rather than pixel-level annotations,[41] generates dense object localization maps which arethen utilized in a weakly- or semi-supervised framework tolearn semantic segmentation. Our geometric constraints canbe seen as a form of weak supervision but instead of weaklabels our approach relies only fon weak constraints enforc-ing consistent annotations for 3D points of the scene. Domain adaptation has been studied for the field of se-mantic segmentation. One can argue that overcoming theataset bias is closely related to the field of domain adap-tation. In the context of semantic segmentation, such ap-proaches usually leverage the possibility of using inexaus-tive synthetic datasets for improving performance on realdata [24, 28, 38, 4]. However, as further explained in [38],due to large domain gap between real and synthetic images,such domain adaptation methods easily overfit to syntheticdata and can fail to generalize to real images.Very recently, in terms of domain adaptation approachesthat rely on real data only, Kalluri et al . [20] proposed a uni-fied segmentation model for different target domains thatminimizes supervised loss for labeled data of the target do-mains and exploits visual similarity between unlabeled dataof the domains. Results indicate increase in performancefor all of the target domains. However, such approach stillrequires labeled images for all of the target domains. Here,we focus on adapting the source domain to a related targetdomain for which no labeling is available.
Geometry in semantic segmentation has already beenconsidered for purpose of semantic mapping. [25] trainsa CNN by propagating manual hand-labeled segmentationsof frames to new frames by warping. In contrast, we donot need any manual annotations for the target sequencesof the scene. SemanticFusion [27] uses a pre-trainedCNN together with ElasticFusion SLAM method [42], andmerges multiple segmentation predictions from differentviewpoints. [31, 14] rely on using a pre-trained CNN to-gether with 3D reconstruction methods and improve accu-racy over initial segmentations. However, these approachesare applied to a CNN with fixed parameters and rely on ge-ometric constraints during inference time. In contrast, ourmethod uses geometric constraints to improve single-viewsegmentation predictions for the target scene and afterwardsrequires only color information for segmenting unseen im-ages of the scene.
Because of view warping, our approach is also related torecent work on unsupervised single-view depth estimation.Both Zhou et al . [46] and Godard et al . [12] proposed anunsupervised approach for learning depth estimation fromvideo data. This is done by learning to predict a depth mapso that a view can be warped into another one. This researchdirection became quickly popular, and has been extendedsince by many authors [44, 26, 40, 13].Our work is related to these methods as it also introducesconstraints between multiple views, by using warping. Wedemonstrate that this type of constraints can be utilized forthe task of semantic segmentation. (a)(b)
Figure 2: Method overview. (a) Our S4-Net approach com-bines supervised data from SUNRGB-D and an image se-quence from a target scene without any annotations. By ex-ploiting geometric constraints of the target image sequence,we obtain a network with high performance for the targetscene, and labels for the target sequence. (b) After beingtrained by S4-Net, the segmentation network can be appliedto unseen images of the target scene with much better per-formance.
3. Approach Overview
For the rest of the paper, we refer to our Semi-Supervisedmethod for Semantic Segmentation as S4-Net. We assumethat we are given a dataset of color images and their seg-mentations: S = { e i = ( I i , A i ) } i , where I i is a color image, and A i is the correspond-ing ground truth segmentation. In practice we use theSUNRGB-D dataset. Based on these annotations, we wouldlike to train a segmentation model f () for a new scene givena sequence of registered frames, for which no labels areknown a priori : U = { e j = ( I j , D j , T j ) } j , where I j is a color image, D j is the corresponding depthmap, and T j the corresponding camera pose. As a directresult of S4-Net, we obtain automatic annotations for se-quence U . Additionally, the output of S4-Net is a trainednetwork f () . At test time, network f () trained with S4-Netcan then be used to predict correct segmentation for newimages of the scene. We present the method overview inFig. 2. .1. Semi-Supervised Learning and GeometricConsistency We optimize the parameters Θ of f () by minimizing thesemi-supervised loss term: L = L S + λL G , (1)where L S is a supervised loss term and L G is a term thatexploits geometric constraints. In practice, we set the dis-count factor λ for all experiments to the same value. L S is astandard term for supervised learning of semantic segmen-tation: L S = (cid:88) e ∈S l WCE ( f ( I ( e ); Θ) , A ( e )) , (2)where l WCE is the weighted cross-entropy of segmentationprediction f ( I ; Θ) relative to manual annotation A . Theclass weights are calculated using median frequency bal-ancing [11] to prevent overfitting to most common classes. L G exploits geometric constraints to enforce consistencybetween predictions for images taken from different view-points: L G = (cid:88) e ∈U l CE ( f ( I ( e ); Θ) , Merge e (cid:48) ∈N ( e ) ( Warp e (cid:48) → e ( f ( I ( e (cid:48) ); Θ (cid:48) ))) , (3)where N ( e ) is a subset of U containing samples with aviewpoint that overlaps with the view point of e . Warp e (cid:48) → e ( S ) function warps segmentation S from frame e (cid:48) to frame e .We give more details on this warp operation in Section 3.2.Merge e (cid:48) ∈N ( e ) function merges given neighbouring views by firstsumming the pixelwise probabilities and then performing argmax operation to obtain the final pixelwise labels.We consider prediction f ( I ( e (cid:48) ); Θ (cid:48) ) as a teacher predic-tion and, similarly to the Π − model [22], it is treated as aconstant when calculating the update of the network param-eters. Parameters Θ (cid:48) are updated every iterations toequal parameters Θ . We found that this step helps to fur-ther stabilize the learning process. l CE is the standard cross-entropy loss function that compares the predicted segmen-tations. We found empirically that using weighted cross-entropy tends to converge to solutions with incorrect seg-mentations. We base our warping function Warp e (cid:48) → e on the inverse warp-ing method used in [46]. For a 2D location p in homoge-neous coordinates of a target sample e , we find the corre-sponding location p (cid:48) of the source sample e (cid:48) using: p (cid:48) = KT e → e (cid:48) dK − p , (4)where K is the intrinsic matrix of the camera, T e → e (cid:48) is therelative transformation matrix between the target and the source samples, and d is the predicted depth value at lo-cation p . Since p (cid:48) value lies in general between differentimage locations, we use the differentiable bilinear interpo-lation from [18] to compute the final projected value fromthe 4 neighbouring pixels. This transformation is applied tothe segmentation probabilities predicted by the network.In practice, not every pixel in the target sample has acorrespondent pixel in the source sample. This can happenas depth information is not necessarily available for everypixel when using depth cameras, and since some pixels inthe target sample may not be visible in the source sample,because they are occluded or, simply, because they are notin the field of view of the source sample. If the differencebetween the depths is larger than a threshold value, thismeans that the pixel is occluded and does not correspondto the same physical 3D point. We simply ignore the pix-els without correspondents in the loss function. Addition-ally, we ignore the pixels that are located near the edges ofthe predicted segments: Segmentation predictions in theseregions tend to be less reliable and, for such regions, in-significant errors in one view can easily induce significanterrors in other views because of the different perspectives,as shown in Fig. 3 For sequences captured with an RGB camera, depth datais not available, and we rely on predicted depths to enforcegeometric constraints. If the supervised dataset S also in-cludes ground truth depths, we can introduce additional lossterms to learn depth estimation: L D = L DS + λ D L DG , (5)where L DS is a supervised depth loss term and L DG is asemi-supervised term that exploits geometric constraints inthe depth domain. λ D is a weighting factor. L DS is theabsolute difference loss term: L DS = (cid:88) e ∈S | f d ( I ( e ); Θ d ) − ˆ D ( e ) | , (6)where f d ( I ; Θ d ) is the depth prediction for network pa-rameters Θ d and ˆ D is the ground truth depth map. Term L DG corrects noisy depth predictions for the target scene U through geometric constraints only: L DG = (cid:88) e ∈U (cid:88) e (cid:48) ∈N ( e ) L INT (( I ( e ) , Warp e (cid:48) → e ( I ( e (cid:48) ))) , (7)where L INT is a loss term comparing pixelwise intensitiestogether with the structure similarity loss term from recentliterature on monocular depth prediction [46, 12, 44, 26,40, 13]. We apply this term only to the target image pix-els where the predicted segmentations are consistent witheach other and further away from segmentation borders: a) (b) (c) (d)
Figure 3: Process of warping source segmentation prediction in (a) to the corresponding target view. Encircled region in(b) demonstrates that warping in boundary regions of segmentation predictions can induce errors in the target view. In thiscase, the warped segmentation prediction falsely assigns wall labels to the edges of the bed region. Hence, we introduce asegmentation boundary mask to resolve this issue in (c). As direct result, S4-Net is able to recover quality segmentations inthe affected boundary regions in (d).We found that such mask helps regularize depth predictionsfor occluded regions of the image. We explain this furtherin supplementary material. For enforcing geometric con-straints on semantic segmentation with term L G , we con-sider the depth prediction as a constant when calculatingthe update of the network parameters. We use DeepLabV3+ [5] or PSPNet [45] as network f () in our experiments. In both cases, as the base network, weuse ResNet-50 [16] pre-trained on the ImageNet classifica-tion task [9]. However, S4-Net is not restricted to a specifictype of architecture and could be applied to other architec-tures as well. When predicting depth maps, the encoderis shared between the depth network and the segmentationnetwork. The depth decoder has the same architecture asthe segmentation decoder, but they do not share any param-eters. We show further details on network initialization andtraining procedure in supplementary material.
4. Evaluation
We evaluate S4-Net on the task of learning to predict se-mantic segmentation from color images for a target scene.The task for the network is to learn segmentations for a tar-get scene without any knowledge about the ground truth la-bels for the scene. Hence, S4-Net requires an uncorrelatedannotated dataset to obtain prior knowledge about the seg-mentation task that it needs to perform. Additionally, inorder to learn accurate segmentations for the target scene,it utilizes frame sequences of that scene. By exploiting ge-ometric constraints between the frames, S4-Net learns topredict segmentations across the target scene.
Datasets.
In all of our evaluations, we use theSUNRGB-D dataset [37], consisting of annotatedRGB-D images for training, as the supervised dataset S andperform mirroring on these samples to augment the super- vised data. The SUNRGB-D dataset is a collection made ofan original dataset and additional datasets previously pub-lished [35, 19, 43]. The images are manually segmentedinto object categories that are typical for an indoor sce-nario. The full list of object categories is given in sup-plementary materials. First, we evaluate S4-Net on scenesfrom the ScanNet dataset [7]. Second, we show that S4-Netis general by applying it to our own data. Finally, we showthat enforcing geometric constraints through depth predic-tions can be used to learn quality segmentation predictionsfor the target scene. As previously discussed, we use SUNRGB-D for the su-pervised training data only. Therefore, for our first exper-imental setup, we evaluate S4-Net on scenes from theScanNet dataset [7] to demonstrate the generalization as-pect of S4-Net, as it is the scenario that motivates our work.These scenes represent different indoor scenarios, includ-ing apartment, hotel room, public bathroom, large kitchenspace, lounge area, and study room. Even though the RGB-D sequences in the ScanNet dataset are annotated and canbe mapped to our desired segmentation task, we utilizethese annotations only to validate our results. Data Split.
Intentionally, we choose scenes from theScanNet dataset which were scanned twice during the cre-ation of the dataset. The first scan of each scene is utilizedduring training while the second scan is used for validationpurposes only. We refer to these independently recordedscans as “Scan 1” and “Scan 2”. During training, we usethe registered RGB-D sequence from “Scan 1” when apply-ing geometric constraints. For evaluation, we additionallyvalidate performance on “Scan 2” for which the camera fol-lows a different pathway.In this experiment, we show that our network, trainedwith geometric constraints from a target scene of ScanNetand supervised data from SUNRGB-D, notably improves
Scan 1” (ScanNet) “Scan 2” (ScanNet) (Unlabeled images during training) (Excluded from training) pix acc mean acc mIOU fwIOU pix acc mean acc mIOU fwIOU
DeepLabV3+ network architecture
Supervised baseline .
765 0 .
634 0 .
533 0 .
692 0 .
772 0 .
651 0 .
544 0 . S4-Net
S4-Net with depth dred. .
794 0 .
679 0 .
581 0 .
724 0 .
797 0 .
684 0 .
586 0 . PSPNet network architecture
Supervised baseline .
727 0 .
597 0 .
486 0 .
644 0 .
737 0 .
61 0 .
499 0 . S4-Net
Table 1: Quantitative evaluation on the target scenes from ScanNet. We include results averaged over the target scenesused during experiments. The results for “Scan 1” show significant improvements for the images where we applied S4-Net.Furthermore, the segmentation accuracy for “Scan 2” indicates that our trained network brings similar improvements overthe supervised baseline for the images that were not utilized by S4-Net during training. Our experiments also show thatS4-Net can be applied to different segmentation networks with a significant gain in accuracy in comparison to the supervisedbaselines and the results with depth prediction show a comparable increase in performance over supervised baseline. S up . b a s e li n e S4 - N e t M a n . a nno t . (a) (b) (c) (d) Figure 4: Qualitative results on unseen images from “Scan 2” of the target scenes from ScanNet, DeepLabV3+ networkarchitecture. As S4-Net does not rely on manual annotations for the target scenes, it predicts segmentations that are sometimesmore accurate than manual annotations. More specifically, it correctly segments the otherwise unlabelled table and boxregions in (a) and (c), and in regions with wrong manual annotations it correctly predicts paper segments in (b) and (c).performance over our supervised baseline for the targetscene. This is true for all of our experimental scenes inScanNet in regions where our supervised baseline alreadyprovides a certain level of generalization for some of theviewpoints. For “Scan 1” we measure a significant increase in performance. To further demonstrate different use casesof S4-Net, we show that the network fine-tuned for “Scan1” predicts high-grade segmentations also for “Scan 2” ofthe scene. The two scans are recorded independently whichresults in different camera paths for the recordings. As therere no direct neighboring frames between the two indepen-dently recorded scans, the results on “Scan 2” demonstratethe ability of the S4-Net trained network to generalize toindependently recorded scans of the same scene.In our quantitative evaluations in Table 1, we presentresults averaged over all of our experimental scenes. Weobserve that the S4-Net approach clearly overcomes thedataset bias correlated with the supervised approach as itdemonstrates superior performance on the target scene incomparison to its supervised baseline. Not only does theperformance increase for “Scan 1” but we also observe thatthe increase in performance for the images of “Scan 2” isas significant. Furthermore, by observing performance fordifferent network architectures, we show that S4-Net can beapplied to arbitrary segmentation network architectures.We further demonstrate the benefits of our approach inFig. 4 where we show some qualitative results. We observethat, in areas where the supervised approach predicts verynoisy predictions, our approach predicts consistent segmen-tations. This is the indicator that confident segmentationpredictions are propagated to less confident viewpoints, andnot the other way around.Our experiments on ScanNet demonstrate that S4-Net isuseful for different practical applications. First, the evalua-tions on “Scan 1” show that the approach is applicable forthe use case of automatically labelling indoor scenes. Thesecond application is that, once the network has convergedfor the target sequence, we can reliably segment new imagesof the scene without the need for the depth data.
So far we have demonstrated that S4-Net works wellfor the chosen scenes from the ScanNet dataset. To showthat S4-Net generalizes well, we also evaluate it on ourown data. For this purpose, we captured and registered aliving room area using an Intel R (cid:13) RealSense TM D400 seriesdepth camera and registered the scene using an implemen-tation of a scene reconstruction pipeline [6] from Open3Dlibrary [32]. We refer to this scene as the “Room” dataset. Data Split . In line with the ScanNet experiments,we scanned the “Room” scene twice. “Scan 1” containsroughly training images. For evaluation purposes, wethen sampled images from “Scan 1” that capture dif-ferent viewpoints of the scene, and we manually annotatedthem using the LabelMe annotation tool [34]. We annotated additional images from “Scan 2” that was recorded inde-pendently of “Scan 1” to further demonstrate the aspect ofgeneralization across the scene.Table 2 gives the results of our quantitative evaluation.We again observe a significant increase in performanceover the supervised baseline approach for images of “Scan1” and “Scan 2”. Our qualitative evaluations in Fig. 5 https://software.intel.com/en-us/realsense/d400 S up . b a s e li n e S4 - N e t M a n . a nno t . (a) (b) (c) Figure 5: Qualitative evaluation on unseen images from“Scan 2” of the “Room” scene for the DeepLabV3+ net-work architecture. The supervised baseline already predictshigh quality segmentations across the scene. However, thesupervised baseline still predicts noisy or incorrect segmen-tations for some view points, especially due to partial vis-ibility of objects, for example the sofa in (b) and the tablein (c). S4-Net demonstrates notable improvements in theseregions.show many overall improvements. Even though our super-vised baseline might pedict quality segmentations for spe-cific viewpoints, for other viewpoints it fails completely asthese data samples are not presented well throughout theSUNRGB-D dataset. In contrast, the S4-Net approach pre-serves quality segmentations in such regions. This furtherproves that the usage of geometric constraints is, indeed, avery powerful method for transfering knowledge from thesupervised baseline to a new scene.
Furthermore, we evaluate the aspect of using depth pre-dictions for enforcing geometric constraints. As SUNRGB-D also contains depth ground truth data, it provides su-pervision for both the depth network and the segmenta-tion network in this scenario. When enforcing geometricconstraints for the target scenes, warping between differentview points is performed by using depth predictions insteadof the ground truth depth images. For this experiment, wefound empirically that setting λ D to . achieved satisfyingquality for segmentation and depth predictions. In case ofPSPNet, due to low accuracy of initial depth predictions, weexcluded this part in our evaluations.Our quantitative evaluations in Table 1 demonstrate com-parable performance to S4-Net with depth ground truth forthe ScanNet scenes. Similarly, in Table 2 we observe that Scan 1” (“Room”) “Scan 2” (“Room”) (Unlabeled images during training) (Excluded from training) pix acc mean acc mIOU fwIOU pix acc mean acc mIOU fwIOU
DeepLabV3+ network architecture
Supervised baseline .
89 0 .
757 0 .
699 0 .
827 0 .
817 0 .
726 0 .
66 0 . S4-Net
S4-Net with depth pred. .
922 0 .
801 0 .
757 0 .
868 0 .
911 0 .
755 0 .
712 0 . PSPNet network architecture
Supervised baseline .
862 0 .
673 0 .
597 0 .
788 0 .
728 0 .
629 0 .
497 0 . S4-Net .
888 0 .
723 0 .
645 0 .
817 0 .
847 0 .
708 0 .
604 0 . Table 2: Quantitative evaluation on the “Room” scene. Similarly to our experiments on ScanNet, S4-Net demonstratessignificant performance increase for both the images from “Scan 1” and “Scan 2” of the scene. We observe improvementsfor different network architectures and also when using S4-Net with depth prediction network.S4-Net with depth predictions shows significant improve-ments for the “Room” scene in comparison to the super-vised baseline. In our qualitative results in Fig. 6 wevisualize the results of S4-Net with depth predictions forthe ScanNet scenes. Even though one would expect thatnoisy depth predictions considerably decrease the qualityof geometric constraints, S4-Net still demonstrates qualityimprovements in this scenario that is comparable to our re-sults when using ground truth depth for enforcing geomet-ric constraints. Even though initial depth predictions forthe supervised baseline are noisy, S4-Net also learns bet-ter depth predictions for the target scene. Hence, geometricconstraints on semantic segmentation improve during train-ing enabling convergence for S4-Net. For unseen “Scan 2”sequences from ScanNet, the Root Mean Square (RMS) er-ror drops from . to . on average after applying S4-Net.For “Scan 2” images from “Room” scene, the average RMSerror drops from . to . . We show further quantitativeand qualitative results on depth predictions in supplemen-tary material.
5. Conclusion
We showed that semi-supervised learning is a good the-oretical framework for enforcing geometric constraints andfor adapting semantic segmentation to new scenes. Wealso investigated a potential problem which could appearwith such semi-supervised constraints on non-annotated se-quences. It would be possible that the learning may assignlabels which are consistent among views, but wrong. Ourexperiments have shown that this is only very rarely thecase. Instead, the semi-supervised contraints yield signif-icant improvements, without the need for additional man-ual labels. This is possible because the network can learnto propagate labels from locations where it is confident tomore difficult locations.In summary, our S4-Net approach yields quality labelsacross given target sequences which makes it very interest- S up . b a s e li n e S4 - N e t M a n . a nno t . Figure 6: Qualitative results of S4-Net with depth predic-tions on unseen images from “Scan 2” of the target scenesfrom ScanNet. Similarly to our previous observations, S4-Net predicts quality segmentations in many regions whichare noisy, wrongly labeled or unlabeled in the manual anno-tations.ing for the task of sequence labelling. The segmentationnetwork trained with S4-Net also generalizes nicely to un-seen images of the target scene. This makes our approachuseful for applications relying on semantic segmentation,for example in robotics and augmented reality. Finally, wehave shown that the attractive idea of enforcing geometricconstraints by means of depth predictions produces satisfy-ing segmentations and achieves accuracy that is comparableto the accuracy when using ground truth depth information.
Acknowledgment
This work was supported by the Christian Doppler Lab-oratory for Semantic 3D Computer Vision, funded in partby Qualcomm Inc. eferences [1] A. Armagan, M. Hirzer, P. M. Roth, and V. Lepetit. Learningto Align Semantic Segmentation and 2.5D Maps for Geolo-calization. In
CVPR , 2017. 2[2] B. Athiwaratkun, M. Finzi, P. Izmailov, and A. G. Wilson.There Are Many Consistent Explanations of Unlabeled Data:Why You Should Average.
ICLR , 2019. 1, 2[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: ADeep Convolutional Encoder-Decoder Architecture for Im-age Segmentation.
TPAMI , 2017. 2[4] W.-L. Chang, H.-P. Wang, W.-H. Peng, and W.-C. Chiu. Allabout Structure: Adapting Structural Information across Do-mains for Boosting Semantic Segmentation. In
CVPR , 2019.3[5] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-Decoder with Atrous Separable Convolution for Se-mantic Image Segmentation. In
ECCV , 2018. 1, 2, 5[6] S. Choi, Q.-Y. Zhou, and V. Koltun. Robust Reconstructionof Indoor Scenes. In
CVPR , 2015. 7[7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,and M. Niessner. ScanNet: Richly-Annotated 3D Recon-structions of Indoor Scenes. In
CVPR , 2017. 2, 5[8] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. Salakhutdi-nov. Good Semi-supervised Learning That Requires a BadGAN. In
NIPS , 2017. 2[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. In
CVPR , 2009. 5[10] C. N. dos Santos, K. Wadhawan, and B. Zhou. Learning LossFunctions for Semi-Supervised Learning via DiscriminativeAdversarial Networks. arXiv preprint arXiv:1707.02198 ,2017. 2[11] D. Eigen and R. Fergus. Predicting Depth, Surface Normalsand Semantic Labels With a Common Multi-Scale Convolu-tional Architecture. In
ICCV , 2015. 4[12] C. Godard, O. Mac Aodha, and G. J. Brostow. UnsupervisedMonocular Depth Estimation with Left-Right Consistency.In
CVPR , 2017. 3, 4[13] C. Godard, O. Mac Aodha, and G. J. Brostow. DiggingInto Self-Supervised Monocular Depth Estimation. arXivpreprint arXiv:1806.01260 , 2018. 3, 4[14] M. Grinvald, F. Furrer, T. Novkovic, J. J. Chung, C. Cadena,R. Siegwart, and J. Nieto. Volumetric Instance-Aware Se-mantic Mapping and 3D Object Discovery. arXiv preprintarXiv:1903.00268 , 2019. 1, 2, 3[15] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick. MaskR-CNN. In
ICCV , 2017. 1[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learningfor Image Recognition. In
CVPR , 2016. 2, 5[17] W.-C. Hung, Y.-H. Tsai, Y.-T. Liou, Y.-Y. Lin, and M.-H.Yang. Adversarial Learning for Semi-Supervised SemanticSegmentation. In
BMVC , 2018. 2[18] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial Transformer Networks. In
NIPS , 2015. 4 [19] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz,K. Saenko, and T. Darrell. A Category-Level 3D ObjectDataset: Putting the Kinect to Work. In
Consumer DepthCameras for Computer Vision . 2013. 5[20] T. Kalluri, G. Varma, M. Chandraker, and C. Jawahar. Uni-versal semi-supervised semantic segmentation.
ICCV , 2019.3[21] A. Kumar, P. Sattigeri, and P. T. Fletcher. Improved Semi-Supervised Learning with GANs using Manifold Invari-ances. In
NIPS , 2017. 2[22] S. Laine and T. Aila. Temporal Ensembling for Semi-Supervised Learning. In
ICLR , 2017. 1, 2, 4[23] J. Long, E. Shelhamer, and T. Darrell. Fully ConvolutionalNetworks for Semantic Segmentation. In
CVPR , 2015. 1, 2[24] Y. Luo, P. Liu, T. Guan, J. Yu, and Y. Yang. Significance-aware Information Bottleneck for Domain Adaptive Seman-tic Segmentation. arXiv preprint arXiv:1904.00876 , 2019.3[25] L. Ma, J. St¨uckler, C. Kerl, and D. Cremers. Multi-ViewDeep Learning for Consistent Semantic Mapping with RGB-D Cameras. In
IROS , 2017. 1, 2, 3[26] R. Mahjourian, M. Wicke, and A. Angelova. UnsupervisedLearning of Depth and Ego-Motion from Monocular VideoUsing 3D Geometric Constraints. In
CVPR , 2018. 3, 4[27] J. McCormac, A. Handa, A. Davison, and S. Leutenegger.SemanticFusion: Dense 3D Semantic Mapping with Convo-lutional Neural Networks. In
ICRA , 2017. 1, 2, 3[28] U. Michieli, M. Biasetton, G. Agresti, and P. Zanuttigh.Adversarial Learning and Self-Teaching Techniques for Do-main Adaptation in Semantic Segmentation. arXiv preprintarXiv:1909.00781 , 2019. 3[29] A. Milioto, P. Lottes, and C. Stachniss. Real-Time Seman-tic Segmentation of Crop and Weed for Precision Agricul-ture Robots Leveraging Background Knowledge in CNNs.In
ICRA , 2018. 2[30] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille.Weakly-and semi-supervised learning of a deep convolu-tional network for semantic image segmentation. In
ICCV ,2015. 2[31] Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung.Real-time Progressive 3D Semantic Segmentation for IndoorScenes. In
WACV , 2019. 1, 2, 3[32] Qian-Yi Zhou and Jaesik Park and Vladlen Koltun.Open3D: A Modern Library for 3D Data Processing. arXiv:1801.09847 , 2018. 7[33] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convo-lutional Networks for Biomedical Image Segmentation. In
MICCAI , 2015. 2[34] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Free-man. LabelMe: A Database and Web-Based Tool for ImageAnnotation.
IJCV , 2008. 7[35] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. IndoorSegmentation and Support Inference from RGBD Images. In
ECCV , 2012. 5[36] K. Simonyan and A. Zisserman. Very Deep ConvolutionalNetworks for Large-Scale Image Recognition.
ICLR , 2015.237] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D: ARGB-D Scene Understanding Benchmark Suite. In
CVPR ,2015. 1, 5[38] R. Sun, X. Zhu, C. Wu, C. Huang, J. Shi, and L. Ma. Not AllAreas Are Equal: Transfer Learning for Semantic Segmen-tation via Hierarchical Region Selection. In
CVPR , 2019.3[39] A. Tarvainen and H. Valpola. Mean Teachers are BetterRole Models: Weight-Averaged Consistency targets improveSemi-Supervised Deep Learning Results. In
NIPS , 2017. 1,2[40] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learn-ing Depth from Monocular Videos using Direct Methods. In
CVPR , 2018. 3, 4[41] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang. Re-visiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In
CVPR , 2018.2[42] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison,and S. Leutenegger. ElasticFusion: Real-Time Dense SLAMand Light Source Estimation.
IJRR , 2016. 3[43] J. Xiao, A. Owens, and A. Torralba. SUN3D: A Database ofBig Spaces Reconstructed Using SfM and Object Labels. In
ICCV , 2013. 5[44] Z. Yin and J. Shi. GeoNet: Unsupervised Learning of DenseDepth, Optical Flow and Camera Pose. In
CVPR , 2018. 3, 4[45] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. In
CVPR , 2017. 1, 2, 5[46] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-pervised Learning of Depth and Ego-Motion from Video. In
CVPR , 2017. 3, 4 asting Geometric Constraints in Semantic Segmentationas Semi-Supervised Learning - Supplementary Material
Sinisa Stekovic Friedrich Fraundorfer Vincent Lepetit , Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria Universit´e Paris-Est, ´Ecole des Ponts ParisTech, Paris, France { sinisa.stekovic, fraundorfer, lepetit } @icg.tugraz.at
1. Temporal Constraints for Semantic Segmen-tation
As shown in Figure 1, even though comparing predic-tions of noisy representations of input images has beenshown as powerful temporal constraints in semi-supervisedlearning for image classification problems [9, 11, 1], suchconstraints are dangerous for semantic segmentation as theyinduce significant prediction error on pixelwise level. Incomparison, our geometric constraints show notably re-duced prediction noise which significantly improves proba-bility of convergence for S4-Net.
2. Network Initialization And Training
Initialization.
To initialize the network f ( . ; Θ) , we firsttrain it only based on the L S loss term. This avoids theproblem of converging to a bad local minimum introducedby the term L G . As it is the case with other consistency-based models, minimizing L G may fall in a solution wherea single class is predicted for all the image locations. Eventhough tuning hyper-parameter λ more carefully might re-solve this problem, we noticed that using this pre-trainingstep makes the convergence to a correct model much easier.When predicting depth maps, the encoder is shared be-tween the depth network and the segmentation network.The depth decoder has the same architecture as the segmen-tation decoder, but they do not share any parameters. Toinitialize the networks f ( . ; Θ) and f d ( . ; Θ d ) , we train themusing only the supervised loss term L S + L DS . The full lossterm L + L D is utilized afterwards. Training Details.
At every iteration, we sample a batchof examples from each of the involved datasets. The inputimages are resized to × . For better convergence, wepre-train the network using only the L S term on SUNRGB-D training set until convergence on the SUNRGB-D testset. We use the Adam optimizer [8] with an initial learn-ing rate of − and momentum of . for this step. Inour experiments, we refer to this network as the supervised (a) (b)(c) (d)(e) Figure 1: Semi-supervised terms for semantic segmenta-tion. (a) Simply comparing predictions of noisy represen-tations of input training images does not exploit the ge-ometric constraints of the problem. (b) Merging predic-tions of noise-added inputs induces further errors in pixel-wise segmentation and does not perform well as a temporalconstraint. (c) and (d) Merging information from differentviews is more useful as a temporal constraint and enablesconvergence for S4-Net as shown in (e).baseline. When learning to segment a target scene with oursemi-supervised approach, we fine-tune the network withigure 2: Colormap used for visualizing segmentations.background is not one of the categories but it is used tovisualize regions of images without manual annotations.initial learning rate of − until no further improvementsin performance are notable for the target scene. We foundthat setting parameter λ to . balances the loss terms dur-ing this stage of training. For the geometric consistencyloss term L G , for a given sample e , we randomly sample neighbouring viewpoints with a minimum overlapping re-gion of relative to the viewpoint of e . Through this,we achieve a trade-off between stability of the training pro-cess and the computational costs. In the case of PSPNet, allloss terms are applied to the auxiliary prediction branch aswell. According to [14], this step improves gradient back-propagation during training. Furthermore, when applyingS4-Net with PSPNet, due to very high computational costs,we set the batch size for the dataset U to .
3. Evaluation Metrics
For quantitative evaluation, we report common evalua-tion metrics. For a given image, we compare segmentationprediction to the corresponding ground truth annotation us-ing following metrics: • Pixel accuracy: pix acc = P i n ii t i ; • Mean accuracy: mean acc = n c P i n ii t i ; • Mean IOU: mIOU = n c P i n ii t i + P j n ji − n ii ; • Frequency weighted IOU: f wIOU = P i t i P i t i n ii t i + P j n ji − n ii , where n ij is the number of pixels of class i predicted asclass j , n c the number of classes, and t i the number of pix-els that belong to class i . Results are then averaged accrossthe given dataset to measure the performance.
4. Failure Cases
As demonstrated in Figure 3, we observe that S4-Netfails when the estimated camera poses are inaccurate forthe registered scene. Hence, the performance of S4-Net istightly related to the performance of underlying scene reg-istration pipeline.
Supervised
S4-Net
Man. annot.
Figure 3: Failure case of S4-Net due to noisy geometricconstraints caused by high level of reflectivity in scene0011.Even the manual annotations are inaccurate which indicateshigh error for camera poses across the scene. (a) (b) (c)
Figure 4: Even though comparisons with ground truth anno-tations might indicate wrong segmentation predictions, oursegmentation predictions are often visually very appealing.In (a), S4-Net wrongly predicts window that is actually oc-cluded by a transparent curtain. In (b), both the supervisedbaseline and S4-Net confuse the night stand with a table.In (c), S4-Net predicts a sofa where the manual annotationsuggests a chair and the actual definition of the category forthis object would be a ’single seater sofa’. As this categoryis not one of the given categories for the task, we find bothof these alternatives equally correct.Furthermore, the categories colormap in Figure 2 clearlyreveals many object categories that might be interchange-able with each other. We show in Figure 4 that S4-Netsometimes predicts these ’alternative’ categories instead ofthe ones proposed in manual annotations.
5. Quantitative Results for Individual Scenesfrom ScanNet
Table 1 presents details on all of the ScanNet scans thatwere used for the experiments. Table 2 presents quantitativeexperiments on the individual scenes from ScanNet. Weobserve that S4-Net consistently outperforms its supervisedbaseline for all of the scenes. a) (b) (c) (d)
Figure 5: When warping the source view in (a) to the target view (d), we notice that occlusions can lead to association ofincorrect correspondences between two views as encircled in (b). When learning to predict depth with L INT term, this resultsin errors caused by intensity differences of incorrect correspondences. We resolve this by masking out correspondences withinconsistent segmentation predictions as shown in (c).
ScanNet “Scan 1” “Scan 2”Scene scene0006 scene0009
980 920 scene0011 scene0022 scene0030
Table 1: Number of samples for the utilized scans fromScanNet.
6. Segmentation Mask for Learning Depth Pre-diction
We observe that the L INT loss term is very sensitive toocclusions in the scene. Even though existing works utiliz-ing this term for monocular depth estimation achieve goodresults [15, 5, 13, 10, 12, 6], the baseline between the twocamera views is very small. This reduces the negative influ-ence of occlusions between the views and, hence, the net-work still learns quality depth estimations. As the baselinebetween the two views increases, occlusions in the scenemight have more negative influence on the geometric con-straints as we show in Figure 5b. Hence, in order to dealwith occlusions in the image, we apply the L INT term onlyto the target image pixels where predicted segmentationsare consistent with each other and further away from seg-mentation borders. In Figure 5c we show that such masksuccessfully masks out the affected regions of the warpedimage.
7. Evaluation of Depth Predictions
In Table 3 and Figure 6 we demonstrate that fine-tuningS4-Net with depth predictions for the target scene also im-proves depth estimations across the scene. Although ap-plying a smoothness loss term would further improve ourdepth estimations [4, 15, 5, 13, 10, 12, 6], the quality of thelearned predictions is good enough for enforcing geometricconstraints on semantic segmentation.
Scan 1” (ScanNet) “Scan 2” (ScanNet) pix acc mean acc mIOU fwIOU pix acc mean acc mIOU fwIOU
DeepLabV3+ network architectureSupervised baseline scene0000 (Apartment) .
67 0 .
505 0 .
398 0 .
591 0 .
709 0 .
536 0 .
435 0 . scene0006 (Hotel room) .
751 0 .
629 0 .
54 0 .
683 0 .
723 0 .
617 0 .
511 0 . scene0009 (Bathroom) .
896 0 .
811 0 .
73 0 .
855 0 .
908 0 .
851 0 .
75 0 . scene0011 (Kitchen) .
675 0 .
532 0 .
407 0 .
575 0 .
716 0 .
555 0 .
442 0 . scene0022 (Lounge) .
872 0 .
751 0 .
657 0 .
81 0 .
829 0 .
697 0 .
585 0 . scene0030 (Study room) .
726 0 .
574 0 .
467 0 .
638 0 .
744 0 .
647 0 .
539 0 . Average .
765 0 .
634 0 .
533 0 .
692 0 .
772 0 .
651 0 .
544 0 . S4-Net scene0000 (Apartment) .
726 0 .
565 0 .
465 0 .
642 0 .
752 0 .
579 0 .
486 0 . scene0006 (Hotel room) .
816 0 .
725 0 .
647 0 .
767 0 .
782 0 .
679 0 .
599 0 . scene0009 (Bathroom) .
907 0 .
88 0 .
785 0 .
873 0 .
927 0 .
909 0 .
813 0 . scene0011 (Kitchen) .
701 0 .
553 0 .
431 0 .
595 0 .
741 0 .
582 0 .
47 0 . scene0022 (Lounge) .
896 0 .
805 0 .
708 0 .
833 0 .
831 0 .
724 0 .
608 0 . scene0030 (Study room) .
77 0 .
596 0 .
505 0 .
689 0 .
787 0 .
674 0 .
58 0 . Average
S4-Net with Depth Predictions scene0000 (Apartment) .
713 0 .
538 0 .
438 0 .
63 0 .
746 0 .
563 0 .
471 0 . scene0006 (Hotel room) .
803 0 .
715 0 .
636 0 .
754 0 .
779 0 .
674 0 .
595 0 . scene0009 (Bathroom) .
932 0 .
89 0 .
804 0 .
902 0 .
956 0 .
924 0 .
848 0 . scene0011 (Kitchen) .
659 0 .
545 0 .
414 0 .
542 0 .
699 0 .
564 0 .
443 0 . scene0022 (Lounge) . .
791 0 .
701 0 .
842 0 .
825 0 .
704 0 .
59 0 . scene0030 (Study room) .
759 0 .
593 0 .
493 0 .
674 0 .
775 0 .
674 0 .
569 0 . Average .
794 0 .
679 0 .
581 0 .
724 0 .
797 0 .
684 0 .
586 0 . PSPNet network architectureSupervised baseline scene0000 (Apartment) .
618 0 .
453 0 .
346 0 .
534 0 .
664 0 .
486 0 .
38 0 . scene0006 (Hotel room) .
683 0 .
586 0 .
472 0 . .
67 0 .
58 0 .
458 0 . scene0009 (Bathroom) .
888 0 .
81 0 .
705 0 .
832 0 .
904 0 .
814 0 .
725 0 . scene0011 (Kitchen) .
635 0 .
488 0 .
362 0 .
531 0 .
66 0 .
498 0 .
386 0 . scene0022 (Lounge) .
849 0 .
724 0 .
618 0 .
777 0 .
821 0 .
685 0 .
569 0 . scene0030 (Study room) .
686 0 .
522 0 .
414 0 .
591 0 .
703 0 .
599 0 .
478 0 . Average .
727 0 .
597 0 .
486 0 .
644 0 .
737 0 .
61 0 .
499 0 . S4-Net scene0000 (Apartment) .
716 0 .
523 0 .
42 0 .
63 0 .
731 0 .
541 0 .
445 0 . scene0006 (Hotel room) .
76 0 .
679 0 .
565 0 .
686 0 .
722 0 .
629 0 .
511 0 . scene0009 (Bathroom) .
91 0 .
872 0 .
748 0 .
863 0 .
934 0 .
879 0 .
792 0 . scene0011 (Kitchen) .
68 0 .
523 0 . .
567 0 .
722 0 .
552 0 .
44 0 . scene0022 (Lounge) .
897 0 .
767 0 .
668 0 .
832 0 .
817 0 .
694 0 .
569 0 . scene0030 (Study room) .
724 0 .
526 0 .
433 0 .
627 0 .
752 0 .
617 0 .
519 0 . Average
Table 2: Quantitative evaluation on the target scenes from ScanNet. The table shows results calculated per individual sceneand average performance over all of the target scenes. The numbers for individual scenes indicate that S4-Net consistentlyoutperforms the supervised baseline for all of the scenes.
Scan 2” (ScanNet)
Lower is better Higher is better
Abs Rel Sq Rel RMSE RMSE log δ < . δ < . δ < . DeepLabV3+ network architectureSupervised baseline scene0000 (Apartment) .
296 0 .
192 0 .
524 0 .
271 0 .
449 0 .
918 0 . scene0006 (Hotel room) .
358 0 .
28 0 .
609 0 .
321 0 .
374 0 .
857 0 . scene0009 (Bathroom) .
329 0 .
204 0 .
532 0 .
293 0 .
257 0 .
958 0 . scene0011 (Kitchen) .
319 0 .
322 0 .
711 0 .
303 0 .
444 0 . . scene0022 (Lounge) .
324 0 .
37 0 .
723 0 .
297 0 .
441 0 .
932 0 . scene0030 (Study Room) .
298 0 .
187 0 .
574 0 .
274 0 .
357 0 .
963 0 . Average .
321 0 .
259 0 .
612 0 .
293 0 .
387 0 .
921 0 . S4-Net with Depth Predictions scene0000 (Apartment) .
115 0 .
048 0 .
269 0 .
131 0 .
885 0 .
988 0 . scene0006 (Hotel room) .
122 0 .
09 0 .
452 0 .
174 0 .
886 0 .
958 0 . scene0009 (Bathroom) .
121 0 .
053 0 .
271 0 .
133 0 .
923 0 .
993 0 . scene0011 (Kitchen) .
168 0 .
159 0 .
553 0 .
219 0 .
804 0 .
938 0 . scene0022 (Lounge) .
179 0 .
199 0 .
535 0 .
195 0 .
823 0 .
97 0 . scene0030 (Study Room) .
134 0 .
057 0 .
341 0 .
151 0 .
897 0 .
99 0 . Average .
14 0 .
101 0 .
404 0 .
167 0 .
87 0 .
973 0 . “Scan 2” (“Room”) Abs Rel Sq Rel RMSE RMSE log δ < . δ < . δ < . DeepLabV3+ network architecture
Supervised baseline .
419 0 .
263 0 .
584 0 .
358 0 .
153 0 .
839 0 . S4-Net with Depth Predictions .
337 0 .
183 0 .
495 0 .
298 0 .
307 0 .
925 0 . Table 3: Quantitative evaluation for depth predictions on the target scenes. Results indicate improvements in the depth domainfor S4-Net with depth predictions. Hence, we obtain quality geometric constraints for learning semantic segmentation. npu ti m a g e S up e r v i s e d S4 - N e t G r ound t r u t h Figure 6: Qualitative evaluation of our depth predictions on unseen images of the target scenes from ScanNet for theDeepLabV3+ network architecture. After applying S4-Net with depth predictions to the target scenes, we observe improve-ments also in the depth domain. eferences [1] B. Athiwaratkun, M. Finzi, P. Izmailov, and A. G. Wilson.There Are Many Consistent Explanations of Unlabeled Data:Why You Should Average.
ICLR , 2019. 1[2] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-Decoder with Atrous Separable Convolution for Se-mantic Image Segmentation. In
ECCV , 2018.[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. In
CVPR , 2009.[4] R. Garg, V. K. BG, G. Carneiro, and I. Reid. UnsupervisedCNN for Single View Depth Estimation: Geometry to theRescue. In
ECCV , 2016. 3[5] C. Godard, O. Mac Aodha, and G. J. Brostow. UnsupervisedMonocular Depth Estimation with Left-Right Consistency.In
CVPR , 2017. 3[6] C. Godard, O. Mac Aodha, and G. J. Brostow. DiggingInto Self-Supervised Monocular Depth Estimation. arXivpreprint arXiv:1806.01260 , 2018. 3[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learningfor Image Recognition. In
CVPR , 2016.[8] D. P. Kingma and J. Ba. Adam: A Method for StochasticOptimization. In
ICLR , 2015. 1[9] S. Laine and T. Aila. Temporal Ensembling for Semi-Supervised Learning. In
ICLR , 2017. 1[10] R. Mahjourian, M. Wicke, and A. Angelova. UnsupervisedLearning of Depth and Ego-Motion from Monocular VideoUsing 3D Geometric Constraints. In
CVPR , 2018. 3[11] A. Tarvainen and H. Valpola. Mean Teachers are BetterRole Models: Weight-Averaged Consistency targets improveSemi-Supervised Deep Learning Results. In
NIPS , 2017. 1[12] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learn-ing Depth from Monocular Videos using Direct Methods. In
CVPR , 2018. 3[13] Z. Yin and J. Shi. GeoNet: Unsupervised Learning of DenseDepth, Optical Flow and Camera Pose. In
CVPR , 2018. 3[14] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. In
CVPR , 2017. 2[15] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-pervised Learning of Depth and Ego-Motion from Video. In