Self-Supervised Attention Learning for Depth and Ego-motion Estimation
SSelf-Supervised Attention Learning for Depth and Ego-motionEstimation
Assem Sadek and Boris ChidlovskiiNaver Labs Europe, chemin Maupertuis 6, Meylan-38240, France [email protected]
Abstract — We address the problem of depth and ego-motionestimation from image sequences. Recent advances in thedomain propose to train a deep learning model for bothtasks using image reconstruction in a self-supervised manner.We revise the assumptions and the limitations of the currentapproaches and propose two improvements to boost the per-formance of the depth and ego-motion estimation. We firstuse Lie group properties to enforce the geometric consistencybetween images in the sequence and their reconstructions. Wethen propose a mechanism to pay an attention to image regionswhere the image reconstruction get corrupted. We show howto integrate the attention mechanism in the form of attentiongates in the pipeline and use attention coefficients as a mask.We evaluate the new architecture on the KITTI datasets andcompare it to the previous techniques. We show that ourapproach improves the state-of-the-art results for ego-motionestimation and achieve comparable results for depth estimation.
I. INTRODUCTIONThe tasks of depth estimation and ego-motion are long-standing problems in computer vision; their successful so-lution is crucial for a wide variety of applications, such asautonomous driving, robot navigation and visual localization,Augmented/Virtual Reality applications, etc.In the last years, deep learning networks [6], [8], [17],[30] achieved results comparable with traditional geometricmethods for depth estimation. They show competitive resultsin complex and ambiguous 3D areas, with CNNs servingas deep regressors and coupled with classical components,to get the best from the geometric and learning paradigms.For the ego-motion estimation, several works [16], [30] haveachieved a level of performance comparable to the traditionaltechniques based on SLAM algorithm [12], [19], [20]. Earlymethods for depth and ego-motion (DEM) are based onsupervised learning; they require large annotated datasets andcalibrated setups. Trained and tested on publicly availablebenchmark datasets these techniques show a limited capacityto generalize beyond the data they are trained on.Moreover, data annotation is often slow and costly. Theannotations also suffer from the structural artifacts, particu-larly in presence of reflective, transparent, dark surfaces ornon-reflective sensors which output infinity values. All thesechallenges strongly motivated the shift to the unsupervisedlearning of depth and ego-motion, in particular from monoc-ular (single-camera) videos.To enable the DEM estimation without annotations, themajor idea is to process both tasks jointly [30]. In the self-supervised setting, an assumption is made about spatialconsistency and temporal coherence between consecutiveframes in a sequence. The only external data needed is thecamera intrinsics. Recent progresses in the domain [18], [28],[26], [3] allows to use monocular unlabeled videos to provideself-supervision signals to a learning component. The 3Dgeometry estimation includes per-pixel depth estimation froma single image and 6DoF relative pose estimation betweenneighbour images.The self-supervised learning greatly boosted DEM estima-tion performance. There however remains a gap with respectto the supervised methods. The underlying assumption of thestatic world is often violated in real scenes and the geometricimage reconstruction gets corrupted by unidentified movingobjects, occlusions, reflection effects, etc.Multiple improvements have been recently proposed toaddress these issues [1], [3], [18], [26], [27]. They are oftenbased on adding more components to the architecture suchas flow nets [27], semantic segmentation [18], adversarialnetworks [1] and multiple masks [26]. These approaches leadhowever to an important growth of model parameters, mak-ing the architecture and training procedure more complex.In this paper we propose an alternative and effective solu-tion to the problem based on the attention mechanism [15].Initially proposed for natural language processing tasks [4],the attention and its variants have been successfully extendedto computer vision tasks, including image classification [25],semantic segmentation [13], [21], image captioning [29]and depth estimation [28]. Inspired by these successes, wepropose to include the attention mechanism in the self-supervised learning for DEM estimation. We show thatso called attention gates can be integrated in the baselinearchitecture and trained from scratch to automatically learnto focus on corrupted regions without additional supervision.The attention gates do not require a large number of modelparameters and introduce a minimal computational overhead.In return, the proposed mechanism improves model sensitiv-ity and accuracy for dense depth and ego-motion estimation.The attention gates are integrated into the depth estimationnetwork. Consequently, the depth network can predict boththe depth estimation and attention coefficients which arethen used to weigh the difference between the true andreconstructed pixels when minimizing the objective function.We evaluate the proposed architecture on the KITTIdatasets and compare it to the state of the art techniques. We a r X i v : . [ c s . C V ] A p r how that our approach improves the state-of-the-art resultsfor ego-motion estimation and achieve comparable results fordepth estimation. II. RELATED WORKEigen at al. [7] were first to directly regress a CNN overpixel values and to use multi-scale features for monoculardepth estimation. They used the global (coarse-scale) andlocal (fine-scale) networks to accomplish the tasks of globaldepth prediction and local refinements.Garg et al. [8] proposed to use a calibrated stereo camerapair setup where the depth is produced as an intermediateoutput and the supervision comes as a reconstruction of oneimage from another in a stereo pair. Images on the stereo righave a fixed and known transformation, and the depth canbe learned from this functional relationship.An important step forward was developed in [10] wherethe depth estimation problem was reformulated in a newway. Godard et al. employ binocular stereo pairs of a viewin training but, during inference time, one view is onlyused to estimate the depth. By exploiting epipolar geometryconstraints, they generate disparity images by training theirnetwork with an image reconstruction loss. The model doesnot require any labelled depth data and learns to predictpixel-level correspondences between pairs of rectified stereoimages.Mahjourian et al. [18] made another step by using cameraego-motion and 3D geometric constraints. Zhou et al. [30]proposed a novel approach for unsupervised learning ofdepth and ego-motion from monocular video only. An addi-tional module to learn the motion of objects was introducedin [24]; however, their architecture recommends optionalsupervision by ground-truth depth or optical flow to improveperformance.The static world assumption doe not hold in real scenes,because of unidentified moving objects, occlusions, photo-effects, etc. that violate the underlying assumption andcorrupt the geometric image reconstruction. The recent worksaddress these limitations and propose a number of im-provements, varying from new objective functions, additionalmodules and pixel masking to new learning schemes.Almalioglu at al. [1] proposed a framework that predictspose camera motion and monocular depth map of the sceneusing deep convolutional Generative Adversarial Networks(GANs). An additional adversarial module helps learn moreaccurate models and make reconstructed images indistin-guishable from the real images.Wang et al. [26] coped with errors in realistic scenes dueto reflective surfaces and occlusions. They combined thegeometric and photometric losses by introducing the match-ing loss constrained by epipolar geometry, and designedmultiple masks to solve image pixel mismatch caused bythe movement of the camera.Another solution was proposed in UnDEMoN architec-ture [2]. Authors changed the objective function and tried tominimize spatial and temporal reconstruction losses simul- taneously. These losses are defined using bi-linear samplingkernel and penalized using the Charbonnier penalty function.Most recently, Bian et al.[3] analyzed violations of theunderlying static assumption in geometric image reconstruc-tion and came to conclusion that, due to lack of properconstraints, networks output scale-inconsistent results overdifferent samples. To remedy the problem, authors proposeda geometry consistency loss for scale-consistent predictionsand an mask for handling moving objects and occlusions.Since our approach do not leverage additional modulesnor the multi-task learning, our attention-based frameworkis much simpler and more efficient.III. B ASELINE ARCHITECTURE AND EXTENSIONS
Similarly to the recent methods [18], [26], [30], ourbaseline architecture includes the depth estimation and poseestimation modules. The depth module is an encoder-decodernetwork (DispNet); it takes a target image and outputsdisparity values ˆ D t ( p ) for every pixel p in the image. Thepose module (PoseNet) inputs the target image I t , twoneighbour (source) images I s , s ∈ { t − , t + 1 } and outputstransformation matrices ˆ T t → s , representing the six degreesof freedom (6DoF) relative pose between the images.
1) Image reconstruction:
The self-supervised learningproceeds by image reconstruction using inverse warping technique [10]. Being differentiable, it allows to back-propagate the gradients to the networks during training. Ittries to reconstruct the target image I t by sampling pixelsfrom the source images I s based on the estimated disparitymap ˆ D t and the relative pose transformation matrices ˆ T t → s , s ∈ { t − , t + 1 } . The sampling is done by projectingthe homogeneous coordinates of target pixel p t onto thesource view p s . Given the camera intrinsics K , the estimateddepth of the pixel ˆ D t ( p ) and transformation matrix ˆ T t → s , theprojection is done by p s ∼ K ˆ T t → s ˆ D t ( p t ) K − p t . (1)For non-discrete values p s , the differentiable bi-linear sam-pling interpolation [14] is used to find the intensity value atthat position. The mean intensity value in the reconstructedimage ˆ I s is interpolated using 4 pixel neighbours of p s ( t op- r ight, t op- l eft, b ottom- r ight, b ottom- l eft), as follows ˆ I s ( p t ) = I s ( p s ) = (cid:88) i ∈{ t,b } ,j ∈{ l,r } w ij I s ( p ijs ) , (2)where ˆ I s ( p t ) is the intensity value of p t in the reconstructedimage ˆ I s . The weight w ij is linearly proportional to thespatial proximity between p s and the neighbour p ijs ; the fourweights w ij sum to 1.
2) Photometric Loss:
Under the static world assumption,many existing methods apply the photometric loss [1], [23]defined as L loss objective function: L p = (cid:80) s (cid:80) p | I t ( p ) − ˆ I s ( p ) | .Any violation of the static world assumption in the realscenes affects drastically the reconstruction. To overcomethis limitation, one solution is to use the SSID loss [18], ig. 1. Attention-based DEM architecture. [26], [23]. A more advanced solution [30] is to introducean explainability mask to indicate the importance of apixel in the warped images. If the pixel contributes to acorrupted synthesis the explainability value of the pixel willbe negligible.The explainability values are produced by a dedicatedmodule (ExpNet) in [30]; it shares the encoder with thePoseNet and branches off in the decoding part. All threemodules, for depth, pose and explainability are trained simul-taneously.The ExpNet decoder generates a per pixel mask ˆ E k ( p ) . Similar to PoseNet, the explainability map ˆ E k isgenerated for both source images. Per-pixel explainabilityvalues are embedded in the photometric loss: L p = 1 | V | (cid:88) p ˆ E k ( p ) | I t ( p ) − ˆ I s ( p ) | . (3)where | V | is the number of pixels in the image. To avoid atrivial solution in (3) with ˆ E k ( p ) equals to zero, a constraintis added on the values of ˆ E k ( p ) . This constraint is imple-mented as a regularization loss L reg ( ˆ E k ) , defined as a crossentropy loss between the mask value and a constant 1.
3) Depth smoothness:
We follow [23] in including asmoothness term to resolve the gradient-locality issue andremove discontinuity of the learned depth in low-textureregions. We use the edge-aware depth smoothness loss whichuses image gradient to weigh the depth gradient: L smo = (cid:88) p |∇ D ( p ) | T · e −|∇ I ( p ) | , (4)where p is the pixel on the depth map D and image I , ∇ denotes the 2D differential operator, and | · | is the element-wise absolute value. We apply the smoothness loss on threeadditional intermediate layers from DispNet and ExpNet. A. Backward-Forward Consistency
The baseline architecture presented in the previous sectionintegrates all components that proved their efficiency in thestate of the art methods. Now we propose the first extensionof our architecture and consider reinforcing the geometricconsistency by using the Lie group property [5].
Fig. 2. Relative transformation between two different views for the samecamera.
Indeed, the set of 3D space transformations T form a Liegroup SE (3) ; it is represented by linear transformations onhomogeneous vectors, T = [ R , t ] ∈ SE (3) , with the rotationcomponent R ∈ SO (3) , and translation component t ∈ R .For every transformation T ∈ SE (3) , there is an inversetransformation T − ∈ SE (3) , such that T T − = I (seeFigure 2).The PoseNet estimates relative pose transformations froma given target to the source frames. Therefore, for everypair of neighbour frames ( t − , t ) , we obtain the forwardtransformation ˆ T t − → t as well as the backward one ˆ T t → t − .In a general case, for every pair of transformations ˆ T t → s and ˆ T s → t , we impose an additional forward-backward ge-ometric constraint; it requires the product of forward andbackward transformations to be as close as possible to theidentity matrix I x ∈ SE (3) . The corresponding loss isdefined over for all pairs of relative pose transformations: L bf = (cid:88) s (cid:88) t | ˆ T s → t ˆ T t → s − I x | . (5)
1) Total Loss:
The total training loss is given by L total = L p + λ smo L smo + λ reg L reg + λ bf L bf , (6)here λ smo , λ reg , λ bf are hyper-parameters. In the experi-ments, λ smo = 0 . , λ reg = 0 . and λ bf = 0 . , as showingthe most stable results. B. Self-attention gates
Our second extension of the baseline architecture ad-dresses the attention mechanism and lets the network knowwhere to look as it is performing the task of DEM estimation.Unlike integrating attention in Conditional RandomFields [28], our proposal is inspired by the recent works insemantic segmentation [15], in particular in medical imag-ing [21], [22]. We treat attention in depth estimation similarlyto semantic segmentation. If we consider that each instance(group of pixels) belong to a certain semantic label (e.g.pedestrian), then the same group of pixels will have closeand discontinuous depth values. We therefore pay attentionto any violation of this principle as a potential source ofcorrupted image reconstruction.We propose to integrate the attention mechanism in thedepth module (DispNet). As shown in Figure 3, the encoderdoes not change, while the decoder layers are interleavedwith the attention gates (AGs). The integration is done asfollows.Let x l = { x li } ni =1 be the activation map of a chosen layer l ∈ { , . . . , L } , where each x li represents the pixel-wisefeature vector of length F l (i.e. the number of channels).For each x li , AG computes coefficients α l = { α li } ni =1 , α li ∈ [0 , , in order to identify corrupted image regionsand prune feature responses to preserve only the activationsrelevant to the accurate depth estimation. The output of AGis x l = { α li x li } ni =1 where each feature vector is scaled bythe corresponding attention coefficient.The attention coefficients α li are computed as follows. InDispNet, the features on the coarse level identify locationof the target objects and model their relationship at globalscale. Let g ∈ R F ge be such a global feature vector providinginformation to AGs to disambiguate task-irrelevant featurecontent in x l . The idea is to consider each x l and g jointlyto attend the features at each scale l that are most relevantto the objective being minimised.The gating vector contains contextual information to prunelower-level feature responses as suggested in AGs for imageclassification [25]. And we prefer additive attention to themultiplicative one, as it has experimentally shown to achievea higher accuracy [22]: q latt,i = ψ T (cid:0) σ ( W Tx x li + W Tg g + b xg ) (cid:1) + b ψ α l = σ ( q latt ( x l , g i ; Θ att ) ) . (7)where σ ( x ) is an element-wise nonlinear function, in par-ticular we use = σ ( x ) = max( x, , and σ ( x ) is asigmoid activation function. Each AG is characterised by aset of parameters Θ att containing the linear transformations W x ∈ R F l × F int , W g ∈ R F g × F int , W x ∈ R F l × F int , andbias terms b ψ ∈ R , b xg ∈ R F int . AG parameters can betrained with the standard back-propagation updates togetherwith other DispNet parameters. Fig. 3. DispNet with the integrated attention gates (in orange). Input imageis progressively filtered and downsampled by factor of 2 at each scale l inthe encoding part of the network, H i = H / i − . Attention gates filter thefeatures propagated through the skip connections. Feature selectivity in AGsis achieved by use of contextual information (gating) extracted in coarserscales. With attention gates integrated in DispNet, we modify thephotometric loss in (6) accordingly, with attention coefficient α for pixel p used instead of explainability value E ( p ) .Figure 4 in Section IV visualizes the attention coefficientsfor three example images. It shows that the system paysless attention to moving objects, as well as to 2D edgesand boundaries of regions with discontinuous depth valuessensitive to the erroneous depth estimation.IV. EVALUATION RESULTSIn this section, we present the evaluation results of depthans ego-motion estimation, analyze them. We support ouranalysis with some visualizations, such as attention co-efficients for masking pixels with likely corrupted imagereconstruction. A. Depth estimation
We evaluated the depth estimation on publicly availableKITTI Raw dataset. It contains 42,382 rectified stereo pairsfrom 61 scenes. Image size is 1242 × log 10 error(RMSE log), and the accuracy with threshold t where t ∈ [1 . , . , . ] (see [7] for more detail).We test our architecture presented in Section III, in thebaseline configuration, extended with Backward-Forwardloss, attention gates and both. Table I reports our depthevaluation results and compares them to the state of artmethods.s the table shows that our methods show state compa-rable, without using additional the attention comparing tothe baseline and it outperforms the supervised and mostunsupervised techniques and show performance comparableto the most recent methods [1], [3] which extend the baselinemodules with additional modules and components. a) Attention coefficients: Figure 4 visualizes the effectof attention coefficients as masks for down-weighting imageregions that get corrupted in the image reconstruction. Itactually visualizes the inverse attention, where white colorrefers to the low attention coefficient α i , thus having a lowerweight; the black color refer to high values.We can see in the figure, that low attention coefficientspoint to pixels that have a high probability to be corrupted.First of all, it concerns image regions corresponding tothe moving objects. In addition, region with discontinuousdepth values are considered as corrupted as well. Thisoften includes region boundaries. Thin objects like streetlight and sign poles are also down-weighted because ofhigh probability of depth discontinuity and corrupted imagereconstruction.These results support the hypothesis for using attentioncoefficients as a mask. It represents an alternative to theexplainability module in [30]. The region of interest aremore likely the rigid object that the network will have moreconfidence to estimate their depth, it will be almost equalall over the object, like the segmentation problem. Also, therigid objects are the most appropriate to estimate the changein position between frames, that is why it shut down thecoefficient for the moving objects.Unlike the explainability, the attention gates do not requirean additional module, they are integrated in the existingmodule. B. Pose estimation
We use the publicly available KITTI visual odometrydataset. The official split contains 11 driving sequenceswith ground truth poses obtained by GPS readings. We usesequence 09 and 10 to evaluate our approaches to align withprevious SLAM-based works.We follow the previous works in using the absolute tra-jectory error (ATE) as evaluation metric. It measures the dif-ference between points of the ground truth and the predictedtrajectory. Using timestamps to associate the ground truthposes with the corresponding predicted poses, we computethe difference between each pair of poses, and output themean and standard deviation.
1) Evaluation results:
Table II reports the evaluationresults of the Backward-Forward and attention modules,separately and jointly, and compare them to the previousworks. Both extensions improve the baseline, the attentionmodule performs well. When coupled with the BF, theattention boosts the performance of PoseNet training to havea consistent ego-motion estimation and outperforms all thestate of the art methods [30], [18], [1], [23], which use ad-ditional models or pixel masks, thus increasing considerablythe model size. Figure 5 visualizes the pose estimation for the test se-quence 09. For each degree of freedom, the translation( x, y, z ) and rotation (roll, pitch, yaw), it compares theestimation by different methods to the ground truth.The figure unveils another important issue. It demonstratesthe oscillation of roll and yaw values and their discontinuitywhen put in [ − π, π ] interval while orientation changes arecontinuous in the real world. A recent analysis of 3Dand n -dimensional rotations [31] shows that discontinuousorientation representations makes them difficult for neuralnetworks to learn. Therefore, it might be a subject of adeeper analysis and a need of replacing quaternions withan alternative, continuous representation. C. Model size and the training time
Our architecture for DEM evaluation requires less param-eters in the model and shows a faster training time than thestate of the art methods. Indeed, adding the BF loss has anegligible impact on the training time with respect to thebaseline. Adding attention gates increases the model size by5-10% and training time by 10-25%. For the comparison,the training time of additional semantic segmentation [18]or GAN module [1] doubles the model size and requires4-10 more time to train the models.V. CONCLUSIONSWe have presented two extensions of the baseline architec-ture for the depth and pose estimation tasks, all aimed to im-prove the performance in the self-supervised setting. Addingbackward-forward consistency loss to the training processallowed to boost the performance. Our method follows oneof the current trend forcing the learned models to respect thegeometric principles but adding penalties for any consistencyviolation. This idea opens a possibility to explore and imposemore geometric constraints on the learned models, this mightfurther improve the accuracy.We shown the effectiveness of attention gates integrated inthe depth module of DEM estimation. It demonstrates thatthe attention principle can be expanded to the navigationtasks. The attention gates help identify corrupted imageregions where the static world assumption is violated. Inaddition, the attention model can be explored as maskingcoefficients in multiple different ways, it represents a strongalternative to the explainability network in the baselinearchitecture. R
EFERENCES[1] Yasin Almalioglu, Muhamad Risqi U. Saputra, Pedro P. B. de Gusmao,Andrew Markham, and Niki Trigoni. GANVO: Unsupervised DeepMonocular Visual Odometry and Depth Estimation with GenerativeAdversarial Networks. In
Proceedings of ICRA , 2019.[2] V. Madhu Babu, Kaushik Das, Anima Majumder, and Swagat Kumar.Undemon: Unsupervised deep network for depth and ego-motionestimation. In
IEEE/RSJ International Conference on IntelligentRobots and Systems, IROS , pages 1082–1088, 2018.[3] Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, ChunhuaShen, Ming-Ming Cheng, and Ian D. Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In
Proc. NIPS , pages 35–45, 2019.ig. 4. Original images from KITTI dataset (top) and the corresponding attention coefficient maps (bottom).Method Supervision Error metric Accuracy metricAbs Rel Sq Rel RMSE RMSE log δ < . δ < . δ < . Eigen et al. [7] Coarse Depth 0.214 1.605 6.563 0.292 0.673 0.884 0.957Eigen et al. [7] Fine Depth 0.203 1.548 6.307 0.282 0.702 0.890 0.958Liu et al. [17] Pose 0.202 1.614 6.523 0.275 0.678 0.895 0.965Godard et al. [10] No
Shen et al. [23] No 0.156 1.309 5.73 0.236 0.797 0.929 0.969Bian et al. [3] No
Ours (BF) No 0.213 1.849 6.781 0.288 0.679 0.887 0.957Ours (Attention) No 0.171 1.281 5.981 0.252 0.755 0.921 0.968Ours (BF+Attention) No 0.162
TABLE IS
INGLE - VIEW DEPTH RESULTS ON THE
KITTI
DATASET [9]
USING THE SPLIT OF E IGEN ET AL . [7] . B
EST AND RUNNER - UP RESULTS ARE SHOWNIN BOLD AND ITALIC , RESPECTIVELY .Method Seq. Seq. ORB-SLAM (full) . ± .
008 0 . ± . ORB-SLAM (short) . ± .
141 0 . ± . Zhou et al. [30] (baseline) . ± .
009 0 . ± . Mahjourian et al. [18] . ± .
010 0 . ± . Almalioglu et al. [1] . ± .
005 0 . ± . Shen et al. [23] . ± . . ± . Ours (BF) . ± . . ± . Ours (Attention) . ± . . ± . Ours (BF+Attention) ± ± TABLE IIA
BSOLUTE T RAJECTORY E RROR (ATE)
ON THE
KITTI
ODOMETRYSPLIT AVERAGED OVER ALL FRAME SNIPPETS ( LOWER IS BETTER ).[4] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and VarunMithal. An attentive survey of attention models.
CoRR ,abs/1904.02874, 2019.[5] Jos´e Luis Blanco Claraco. A tutorial on se(3) transformation param-eterizations and on-manifold optimization. Technical report, 2019.[6] D. Eigen and R. Fergus. Predicting Depth, Surface Normals and Se-mantic Labels with a Common Multi-scale Convolutional Architecture.In
Proc. IEEE International Conference on Computer Vision (ICCV) ,2015.[7] David Eigen, Christian Puhrsch, and Rob Fergus. Depth Map Predic-tion from a Single Image using a Multi-Scale Deep Network. In
Proc.NIPS , pages 2366–2374, 2014.[8] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Un-supervised CNN for Single View Depth Estimation: Geometry to theRescue. 3 2016.[9] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready forAutonomous Driving? The KITTI Vision Benchmark Suite. Technicalreport.[10] Cl´ement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsu- pervised Monocular Depth Estimation with Left-Right Consistency. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition ,pages 6602–6611, 2017.[11] Clment Godard, Oisin Mac, Aodha Gabriel, and J Brostow. Unsu-pervised Monocular Depth Estimation with Left-Right Consistency.Technical report.[12] Cristopher G´omez, Matas Mattamala, Tim Resink, and Javier Ruiz-delSolar. Visual SLAM-based Localization and Navigation for ServiceRobots: The Pepper Case. 11 2018.[13] Qin Huang, Chunyang Xia, Chi-Hao Wu, Siyang Li, Ye Wang, YuhangSong, and C.-C. Jay Kuo. Semantic segmentation with reverseattention. In
Proc. BMVC , 2017.[14] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and KorayKavukcuoglu. Spatial Transformer Networks. 6 2015.[15] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip H. S.Torr. Learn to pay attention. In
Proc. International Conference onLearning Representations, ICLR , 2018.[16] Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: AConvolutional Network for Real-Time 6-DOF Camera Relocalization.5 2015.[17] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. LearningDepth from Single Monocular Images Using Deep ConvolutionalNeural Fields.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 2016.[18] Reza Mahjourian, Martin Wicke, and Anelia Angelova. UnsupervisedLearning of Depth and Ego-Motion from Monocular Video Using 3dGeometric Constraints. In
Proc. IEEE Conference on Computer Visionand Pattern Recognition , pages 5667–5675, 2018.[19] Raul Mur-Artal, J. M.M. Montiel, and Juan D. Tardos. ORB-SLAM: AVersatile and Accurate Monocular SLAM System.
IEEE Transactionson Robotics , 31(5):1147–1163, 10 2015.[20] Raul Mur-Artal and Juan D. Tardos. ORB-SLAM2: An Open-SourceSLAM System for Monocular, Stereo, and RGB-D Cameras.
IEEETransactions on Robotics , 33(5):1255–1262, 10 2017.[21] Ozan Oktay, Jo Schlemper, Lo¨ıc Le Folgoc, Matthew C. H. Lee,Mattias P. Heinrich, Kazunari Misawa, Kensaku Mori, Steven G.McDonagh, Nils Y. Hammerla, Bernhard Kainz, Ben Glocker, andaniel Rueckert. Attention u-net: Learning where to look for thepancreas. arXiv:1804.03999 , 2018.[22] Jo Schlemper, Ozan Oktay, Liang Chen, Jacqueline Matthew, CarolineKnight, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention-Gated Networks for Improving Ultrasound Scan Plane Detection. arXiv:1804.05338 , 2018.[23] Tianwei Shen, Zixin Luo, Lei Zhou, Hanyu Deng, Runze Zhang, TianFang, and Long Quan. Beyond photometric loss for self-supervisedego-motion estimation. In
Proc. International Conference on Roboticsand Automation, ICRA , pages 6359–6365, 2019.[24] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid,Rahul Sukthankar, and Katerina Fragkiadaki. SfM-Net: Learning ofStructure and Motion from Video. arXiv:1704.07804 , 2017.[25] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li,Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residualattention network for image classification. In
Proc. IEEE Conferenceon Computer Vision and Pattern Recognition, CVPR , pages 6450–6458, 2017.[26] G. Wang, H. Wang, Y. Liu, and W. Chen. Unsupervised learningof monocular depth and ego-motion using multiple masks. In
Proc.International Conference on Robotics and Automation (ICRA) , pages4724–4730, May 2019.[27] Yang Wang, Zhenheng Yang, Peng Wang, Yi Yang, Chenxu Luo, andWei Xu. Joint unsupervised learning of optical flow and depth bywatching stereo videos. arXiv: 1810.03654 , 2018.[28] Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and ElisaRicci. Structured Attention Guided Convolutional Neural Fields forMonocular Depth Estimation. arXiv:1803.11029 , 2018.[29] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C.Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Ben-gio. Show, attend and tell: Neural image caption generation with visualattention. In
Proc. International Conference on Machine Learning,ICML , pages 2048–2057, 2015.[30] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe.Unsupervised Learning of Depth and Ego-Motion from Video. In
Proc.IEEE International Conference on Computer Vision , pages 6612–6619,2017.[31] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. Onthe continuity of rotation representations in neural networks. In
Proc.IEEE Conference on Computer Vision and Pattern Recognition, CVPR ,pages 5745–5753, 2019. a)b)
Fig. 5. Sequence 09: Comparing the pose estimation to the ground truth,for each degree of freedom, including a) the translation ( x, y, zx, y, z