[PDF] FlowNorm: A Learning-based Method for Increasing Convergence Range of Direct Alignment

Abstract

Many approaches have been proposed to estimate camera poses by directly minimizing photometric error. However, due to the non-convex property of direct alignment, proper initialization is still required for these methods. Many robust norms (e.g. Huber norm) have been proposed to deal with the outlier terms caused by incorrect initializations. These robust norms are solely defined on the magnitude of each error term. In this paper, we propose a novel robust norm, named FlowNorm, that exploits the information from both the local error term and the global image registration information. While the local information is defined on patch alignments, the global information is estimated using a learning-based network. Using both the local and global information, we achieve an unprecedented convergence range in which images can be aligned given large view angle changes or small overlaps. We further demonstrate the usability of the proposed robust norm by integrating it into the direct methods DSO and BA-Net, and generate more robust and accurate results in real-time.

Full PDF

FFlowNorm: A Learning-based Method for Increasing ConvergenceRange of Direct Alignment

Ke Wang, Kaixuan Wang, and Shaojie Shen

Abstract — Many approaches have been proposed to esti-mate camera poses by directly minimizing photometric error.However, due to the non-convex property of direct alignment,proper initialization is still required for these methods. Manyrobust norms (e.g. Huber norm) have been proposed to dealwith the outlier terms caused by incorrect initializations. Theserobust norms are solely deﬁned on the magnitude of eacherror term. In this paper, we propose a novel robust norm,named FlowNorm, that exploits the information from both thelocal error term and the global image registration information.While the local information is deﬁned on patch alignments, theglobal information is estimated using a learning-based network.Using both the local and global information, we achieve anunprecedented convergence range in which images can bealigned given large view angle changes or small overlaps. Wefurther demonstrate the usability of the proposed robust normby integrating it into the direct methods DSO and BA-Net, andgenerate more robust and accurate results in real-time.

I. I

NTRODUCTION

Direct methods are widely used to solve visual odometryand monocular stereo problems [1]–[4]. By directly minimiz-ing the photometric error between pixels in the source frameand the target frame, camera poses and scene geometry canbe estimated in the joint optimization process. Comparedwith indirect methods [5]–[8], which solve the problem byminimizing the reprojection error between matched sparsefeatures, direct methods avoid the pre-processed featurematching step and can utilize more pixels in the image. How-ever, intensity-based optimization is prone to local minimadue to the non-convex property of complex images.Recent years, many approaches have been proposed toexpand the convergence range of direct methods. SVO [9]combines matched feature points with photometric optimiza-tion. However, although matched features can provide poseinitialization for further optimization, they rely on texturesof the environment and are prone to outliers. With the helpof learning-based methods, many researchers have proposednetworks [10], [11] to generate smooth feature maps fordirect optimization. Compared with the image intensity do-main, optimization on feature maps shows advantages inconvergence ranges. For example, BA-Net [10] can estimatecamera poses given images with small overlaps. LS-Net [12]uses an end-to-end trained network as a solver for two-framemonocular stereo problems. Learning-based methods achievesuperior performance on evaluation datasets, such as RGB-D datasets or the KITTI dataset, but have not been widelyused on robotic platforms. The reason may be the limited

The authers are with Department of Electronic and ComputerEngineering , Hong Kong University of Science and Technology, [email protected] 𝑥 𝑥 𝑓(𝑥)𝑔(𝑥) (a) (b) 𝑦 𝑦 Fig. 1. A simple example to show the different contributions of points inaligning two functions: argmin t (cid:80) x (cid:107) g ( x + t ) − f ( x ) (cid:107) . Point x with ( g ( x + t ) − f ( x )) g (cid:48) ( x + t ) < contributes to the optimization of t and is markedin green, while x with ( g ( x + t ) − f ( x )) g (cid:48) ( x + t ) > counteracts theoptimization and is marked in red. As shown in (a), with good initialization,most of the points are positive to the optimization. However, with worseinitialization, negative points make the optimization fall into local minima. computation resources of general robotic platforms and thediversity of robotic application scenes.One of the contributions of the paper is a study of thedirect optimization process followed by the design of a robustnorm for the optimization. Due to the problem of non-convexity, during the photometric minimization (or featureconsistency minimization in learning-based methods), notall pixels contribute to the convergence. The difference inpixels depends on both the local texture and the global poseinitialization which establishes pixels correspondences. Weillustrate the convergence problem in Fig. 1. As shown, forgood initializations, most of the correspondences contributeto the ﬁnal estimation. However, given a bad initialization,most of the correspondences will suppress the convergence.Poor correspondences make the optimization fall into localminima. Based on this observation, we propose the ﬂownorm that combines a low-accuracy optical ﬂow predictionnetwork to distinguish which correspondences will suppressthe convergence of direct alignment.In summary, the contributions of our paper are the follow-ing: • We propose a new norm to expand the convergencerange of the traditional nonlinear solver for the directalignment problem. • To the best of our knowledge, the proposed method a r X i v : . [ c s . R O ] O c t s the ﬁrst that can distinguish which correspondencewill suppress solver convergence in the direct alignmentsituation. • We build FlowNorm versions of DSO and BA-Net, andthe FlowNorm DSO retains the real-time property.To demonstrate the effectiveness of our method, we evaluateit on the SceneNN dataset [13], TUM-MonoVO dataset [14]and ICL-NUIM dataset [15], showing that FlowNorm ver-sions consistently outperform the original versions.II. R

ELATED W ORK

Semi-dense visual odometry [4] is a pioneering work thattracks a monocular camera in real-time using direct align-ment algorithm. SVO [9] uses matched features to calculatean initialization pose for joint optimization. Following theidea of the direct method, Engel proposed LSD-SLAM [3],which solves the camera pose using keyframes with depthvalues. DSO [16] is the baseline direct alignment work,which jointly optimizes all model parameters, includinggeometry represented as inverse depth and camera motion.DSO further integrates a full photometric calibration, ac-counting for exposure time, lens vignetting, and non-linearresponse functions. Although all these methods feature real-time efﬁciency and high accuracy, they rely on incrementallytracking the camera poses to ensure large overlaps and properinitialization.To increase the convergence range of the direct methods,many learning-based methods have been proposed to replacethe intensity map with feature maps. BA-Net [10] formulatesBundle Adjustment (BA) as a differentiable layer and utilizesa standard encoder-decoder network to generate the featuremap and depth map. Camera poses and depth maps areoptimized by minimizing the feature consistency betweenprojected pixels. Beneﬁting from the generated feature maps,BA-Net expands the convergence range of direct alignment.GN-Net [11] uses a novel Gauss-Newton loss for trainingdeep feature maps. The direct alignment in GN-Net, basedon minimizing the feature metric error, achieves robust per-formance under dynamic lighting or weather changes. Thesetwo approaches nicely combine traditional direct alignmentand deep learning techniques.LS-Net [12] uses an end-to-end trained network to replacethe traditional nonlinear solver. Given a photometric errormap and a Jacobian matrix, LS-Net estimates the updateddepth map and camera motion. Although it achieves impres-sive results on datasets, the generalization ability of LS-Nethas not been demonstrated.In this paper, we propose a different solution that im-proves the robustness of direct optimization. The core of thecontribution is a robust norm that distinguishes error termsusing both local and global information. Different from mostof the learning-based methods that use a heavy network togenerate high-dimensional feature maps, we utilize a light-weight network to improve both the robustness and accuracyof the state-of-the-art methods, with an overhead of only ms . III. D IRECT A LIGNMENT R EVISITED

Before introducing our enhanced direct alignment algo-rithm, we revisit the classic direct alignment to give a betterunderstanding of where difﬁculties lie, and why our methodis desirable. We only introduce the most relevant content,and refer the readers to [16] for a more comprehensiveintroduction.Given a target/source image pair I s and I t , the directalignment problem is formulated as estimating the relativetransformation T between the image pair, and d i ∈ D = { d i | i = 1 · · · N } , which are the depths of the pixels p si ∈ P = { p si | i = 1 · · · N } at the image I s . Let X = { T , D } and we can estimate X by minimizing the norm of thephotometric error ˆ X = arg min X N (cid:88) i =1 | e i ( X ) | , (1)where |·| donates the L1 norm or Huber norm of a vector, N is the number of selected pixels, and the photometric error e i ( X ) = I t ( p (cid:48) ti ) − I s ( p si ) (2)measures the intensity difference between the ith pixel p si at I s and its corresponding pixel p (cid:48) ti at I t . p (cid:48) ti is computedby the projection function p (cid:48) ti = π ( p si , T , d i ) = s KT d i K − p si , (3)which projects 2D point p si from I s to I t , where d i is thedepth value of p si at I s , K and s are the camera’s intrinsicmatrix and a scale factor respectively.The general strategy to minimize Eq. (1) is the Gaussian-Newton (GN) or Levenberg-Marquardt (LM) algorithms[17]. The GN and LM methods are both iterative methods.At the jth iteration, the GN algorithm solves for an optimalupdate ∆ X j = − ( J jT J j ) − J jT E j . (4)Here E j = [ e ( X j ) , e ( X j ) , · · · , e N ( X j )] , where X j isthe initial parameters at the jth iteration. Let δ denotesa small se (3) pertubation around X j , J j is the Jacobianmatrix of E j with respect to δ . Let p (cid:48) ti represent theprojection position of p si at I t based on the parameters X j .The ith row of J j is J j ( i ) = (cid:34) ∂e i ( X j ) ∂I t ( p (cid:48) ti ) ∂I t ( p (cid:48) ti ) ∂ p (cid:48) ti ∂ p (cid:48) ti ∂ δ (cid:35) , (5)where ∂e i ( X j ) ∂I t ( p (cid:48) ti ) and ∂ p (cid:48) ti ∂ δ are smooth compared with theincrement ∆ X j . In contrast, ∂I t ( p (cid:48) ti ) ∂ p (cid:48) ti is much less smooth.As found in DSO, ∂I t ( p (cid:48) ti ) ∂ p (cid:48) ti is only valid in a 1-2 pixel radius.Hence the effective optimization requires that all parametersinvolved in computing p (cid:48) ti should be initialized sufﬁcientlyaccurately to be off by no more than 1-2 pixels. However,giving accurate initialization is difﬁculty when there is a largeview change between I s and I t . ig. 2. Illustration of θ and θ , which are used in the deﬁnition of theﬂow norm. x is the derivative of the residual with respect to p (cid:48) ti , whichtotally depends on the local information. Conversely, θ relies on globalinformation p (cid:48) ti , p oti and σ . IV. A

PPROACH

To deal with the local minima problem of direct alignment,we design a ﬂow norm to guide the non-linear solver tojump out from the local minimum. Assume we have a coarseoptical ﬂow map between the image pair. The key idea ofthe FlowNorm is to balance the local information (residualdecreasing direction) and global optical ﬂow information.Because the optical ﬂow is coarse and unreliable, we justdown-weight those correspondences whose residual decreas-ing directions are highly inconsistent with the correspondingﬂow positions.

A. Flow Norm

Following the deﬁnition in Sect. III, F denotes the com-puted coarse ﬂow map between I s and I t , P ot represents theﬂow positions computed by P ot = P s + F , and P (cid:48) t is theprojection position of P s at I t based on the current relativepose T . p si , p oti and p (cid:48) ti denote the ith item of P s , P ot and P (cid:48) t respectively. The ﬂow norm of residual e i is deﬁned as L ( p si , p oti , p (cid:48) ti , e i ) =  e i , (cid:12)(cid:12)(cid:12) p oti − p (cid:48) ti (cid:12)(cid:12)(cid:12) ≤ σe i , cos θ ≤ cos θ ( cos θ +1cos θ +1 ) e i , cos θ < cos θ, (6)where σ is the variance of the computed ﬂow (the methodfor computing σ is described in Sect. IV-B) and |·| denotesthe L2 norm of the vector. v = p (cid:48) ti − p oti is the direction fromprojection position p (cid:48) ti to the ﬂow position p oti , x = ∂e i ∂ p (cid:48) ti represents the derivative direction at p (cid:48) ti , θ denotes the anglebetween v and x , and cos( θ ) can be computed by cos( θ ) = v T x | v | | x | . (7)As shown in Fig. 2, when the projection position p (cid:48) ti liesoutside the circle with p oti as its center and σ as its radius, θ represents the angle between the tangent line l and thedirection v . Thus cos( θ ) = √ v T v − σ | v | . (8)In summary, we tend to activate these correspondenceswhen their projection positions are close to the ﬂow positionsor local gradient agrees with global information. Fig. 3. Illustration of the change of the ﬂow factor s with the increasingof θ . With the proposed ﬂow norm, the cost function of directalignment is formulated as ˆ X = arg min X N (cid:88) i L ( p si , p oti , p (cid:48) ti , e i ) . (9)The new optimal update step of the GN method for the jth iteration is ∆ (cid:102) X j = − ( J Tj SJ j ) − J Tj SE j , (10)where S is a diagonal matrix, and the ith row and ith columnentry is the ﬂow norm factor of the ith residual, which issummarized as s i =  , (cid:12)(cid:12)(cid:12) p oti − p (cid:48) ti (cid:12)(cid:12)(cid:12) ≤ σ , cos θ ≤ cos θ ( cos θ +1cos θ +1 ) , cos θ < cos θ. (11)Fig. 3 illustrates the change of the ﬂow factor s with theincreasing of θ . The θ of these four lines is ◦ , ◦ , ◦ and ◦ respectively. From Eq. (8), for the same p (cid:48) ti and p oti ,a bigger θ corresponds to a bigger ﬂow uncertainty σ . For abig ﬂow uncertainty, the ﬂow norm will take more account oflocal information and assign larger weights to it. In case ofthe overshoot of the convergence process and the optimizedresults from being biased by the noise ﬂow, we only involvethe ﬂow norm in a tracker when it runs at coarse levels ofimage pyramid. For example, the image pyramid of DSO hasfour levels, and we only involve our ﬂow norm in the toptwo levels.Although the form of the ﬂow norm is similar to that ofthe Huber norm, their cores are very different. The Hubernorm utilizes the local information of correspondences andthe ﬂow norm depends on the coarse ﬂow. In fact, they arecomplementary to each other. B. Shrunken PWC-Net

To obtain the optical ﬂow map, we employ a shrunkenPWC-Net to predict the optical ﬂow between two images.Approaches that learn to predict optical ﬂow from an image ig. 4. Overview of the FlowNorm DSO. The predicted ﬂow map is used to suppress those correspondences whose local gradients are highly inconsistentwith predicted ﬂow in the top two levels of the image pyramid. pair have been studied in previous works [18]–[22]. However,due to the high computation cost, these previous nets cannotbe migrated directly to our work. To efﬁciently obtain theoptical ﬂow, we choose the baseline network PWC-Net [21]as a reference network, and then shrink its convolutionallayers and reduce its input image size. The shrinking processis a tradeoff between prediction accuracy and computingefﬁciency. As our method works on the coarse levels of theimage pyramid, it can robustly utilize the inaccurate opticalﬂow.Firstly, we change the input size of this network from [3 × × to [1 × × . RGB images should betransformed into grey images before feeding them into theshrunken network. Secondly, we remove one coding blockand two pooling operations from the encoder, and the outputsize of the last encoding layer is [15 × . Finally, we removeone decoding block and reduce the correlation radius from4 to 3, as the correlation operation of decoding block iscomputationally expensive. The size of the predicted ﬂow is [112 × . Our encoding and decoding blocks are identicalto the encoding and decoding blocks of PWC-Net. Theshrunken network architecture is shown in the supplementaryvideo.Let Θ be the set of all the learnable parameters in ourshrunken network. W l Θ and W lGT denote the predicted ﬂowﬁeld and the corresponding ground truth of the lth pyramidlevel respectively. We use the same multiscale training lossproposed in FlowNet [19]: L (Θ) = L (cid:88) l = l α l (cid:88) x (cid:12)(cid:12) W l Θ ( x ) − W lGT ( x ) (cid:12)(cid:12) + γ | Θ | , (12)where the second term regularizes the parameters of themodel in case of over ﬁtting, the α l and γ are the balanceweights for different pyramid levels.The variance σ of the predicted ﬂow is computed byaveraging the squared L error of the prediction results inthe testing dataset. C. Overview of the FlowNorm DSO

To demonstrate the effectiveness and efﬁciency of ourmethod, we take our ﬂow norm as a plug-in componentfor the baseline methods, DSO and BA-Net. The FlowNorm DSO is still a real-time system, which can be directlycompared with DSO on any dataset.As shown in Fig. 4, the plug-in is composed of three parts:the latest image is ﬁrstly fed into the encoder network toconstruct multi-scale feature maps. Then, the decoder willget the concatenation of the latest feature maps and thekeyframe’s feature maps and output a predicted ﬂow map.Finally, the predicted ﬂow map will be involved in the BA ofthe DSO tracker in the top two levels of the image pyramid.Although the pipeline needs to encode two images, all framesare only required to be encoded once by buffering the featuremaps of the active keyframes.

D. Comparison with FlowInit DSO

To completly prove the effectiveness and efﬁciency of theﬂow norm, we also construct a competitive strategy. Given p si ∈ P s = { p si | i = 1 · · · N } , d i ∈ D = { d i | i = 1 · · · N } at I s and the predicted positions p oti ∈ P ot = { p oti | i =1 · · · N } at I t , we compute the initial transform T byminimizing the geometric error T = arg min T N (cid:88) i | π ( p i , T , d i ) − p oti | , (13)where π ( · ) is the projection function, and it is deﬁned inSect. III. Then, we take T as an initialization for thetracker of the DSO. We call the DSO with the initializationcomputed from the predicted ﬂow map as FlowInit DSO.We ﬁnd that the performance of this initialization strategy ison par with the FlowNorm DSO strategy for those well pre-dicted ﬂow positions. However, for very poor ﬂow prediction,tracking with the initialization strategy is highly unstable.Because we shrink the size of the prediction network, ourpredicted ﬂow usually has an overall offset. The overall offsetcauses that the initialization from the geometric BA alsocontains the offset. More comparison details are shown inthe next Section. V. E XPERIMENTS

To verify the effectiveness of our method, we buildFlowNorm versions of DSO and BA-Net. We evaluate oursystem on a Linux system with an Intel Core i7-7700 CPUof 3.50GHz and an Nvidia Titan Xp GPU. . Training

We train the shrunken PWC-Net on the SceneNN [13]dataset, which consists of 94 Kinect-captured RGB-D imagesequences with ground truth poses. We select 44 / 25 imagesequences from the SceneNN dataset and take them astraining/testing sets respectively. Then, we sample pairs fromthe training and testing sets and generate the ground truthoptical ﬂow by projecting one pixel from one image toanother image. During the projection process, we removethe occlusion area by verifying whether the depth of onepixel is consistent with the depth of its projection position.Our shrunken PWC-Net is trained with ADAM [23] withthe initial learning rate 0.0001. The weights in the trainingloss deﬁned in Eq. (12) are set to be α . , α . , α . , and α . . The trade-off weight γ is set tobe 0.0004. Although our network lacks some of the layers ofPWC-Net, we still load the parameters of PWC-Net into thecorresponding layers of the shrunken version as the initialparameters. The total training process takes one day on acomputer with one Titan XP. B. FlowNorm in DSO

We compare the FlowNorm DSO with the original DSOon two monocular datasets: TUM-MonoVO dataset [14] andICL-NUIM dataset [15]. The TUM-MonoVO dataset pro-vides 50 photometrically calibrated sequences, comprisingdifferent indoors and outdoors environments. The ICL-NUIMdataset contains 8 ray-traced sequences from two indoor en-vironments. Since the TUM-MonoVO dataset only providesloop-closure ground-truth, we evaluate all sequences usingthe alignment error, which is deﬁned in the TUM-MonoVOdataset.

Fig. 5. The accumulated number of runs whose alignment errors are smallerthan e align ; larger is better. Testing dataset contains all downsampledsequences from TUM-MonoVo and ICL-NUIM datasets. To increase the difﬁculty of evaluation, we add a newevaluation metric . We downsample the image sequenceswith a skip of 1,2,3,4...13 frames. We track 2 times forevery sequence. Which means the total number of runs is1508. Apart from the alignment errors, we also measuretwo numbers: the maximum skip number without a losingtracking and the maximum skip number that can be tracked with an acceptable accuracy for every sequence. We takethe alignment error of sequences without downsampling asa reference for whether tracking has acceptable accuracyor not and label it as error . If the alignment error ofone downsampling sequence is smaller than three times itscorresponding error , we mark the tracking result of the runas an acceptable tracking accuracy. (a) The maximum skip number with an acceptable trackingaccuracy(b) The maximum skip number without a losing trackingFig. 6. Comparison of the convergence ability in all 50 sequences of theTUM-MonoVO dataset. Fig. 5 illustrates the statistical performance of FlowNormDSO and DSO. The accuracy of FlowNorm DSO is betterthan that of the original DSO. The performance of theoriginal DSO is on par with its FlowNorm version whenthe downsampling rate is low. However, FlowNorm DSOpresents more robust performance with the increase of thedownsampling rate. Fig. 6 shows the maximum skip numberwith acceptable tracking accuracy and the maximum skipnumber without losing tracking for all sequences in the TUMMono dataset. FlowNorm DSO (blue) has consistently betterperformance than the original version. Note that we onlydownsample sequences with 1 to 13 steps, the maximumstep with 13 means we do not ﬁnd losing tracking or thetracking results of all runs are acceptable in this sequence.Fig. 7 shows the tracking trajectories of FlowNorm DSO ig. 7. An example of a losing tracking in the ﬁrst sequence of the TumMonoVo dataset with a skip of 10 frames. The red and green trajectories arecomputed from DSO and FlowNorm DSO respectively. (green) and the original DSO (red) on the ﬁrst sequence ofthe TUM Mono dataset with a downsampling rate of 10.DSO loses tracking in the black box area as the camera haslarge view change there.

C. FlowNorm in BA-Net

As BA-Net does not public codes, we construct a motiontracking version of it. The constructed BA-Net is trained onour training/validating dataset. Similar to FlowNorm DSO,we also use the predicted ﬂow guide for the convergence ofBA-Net. We use the remaining part of the SceneNN datasetto build a challenging image pair dataset. Then, we generateinitial poses by adding rotation noise and translation noiseto the ground truth poses of these image pairs. The ﬁnalnumber of image pairs is 60126. We compare how manypairs are successfully aligned by BA-Net and its FlowNormversion. The results are 48261 and 37218 for FlowNormversion and the original version respectively, which provesour method can further expand the convergence range of thetracker based on minimizing the feature metric residual.

D. FlowNorm Vs FlowInit

To prove the effectiveness of the ﬂow norm, we constructa competitive strategy, which is described in Sect. IV-D. Wecompute an initial pose from the predicted ﬂow directly. Weﬁnd that if we just take the computed initial pose as theinitialization of the DSO tracker, the tracker becomes veryunstable (usually loses tracking when it gets an inaccurateoptical ﬂow). We consider the reason for such losing trackingis that the indirect method seriously depends on the correctmatching. However, the depths and correspondences used tocompute the initial pose both contain a lot of noise. Next,we insert the computed initial pose to the queue of tryingposes in the DSO tracker, and the queue in DSO is usedto prevent loss of tracking. We compare the FlowInit DSOand FlowNorm DSO, and the results are shown in Table I.In the table, “accept. acc.” and “w/o losing.” mean themaximum skip number with an acceptable tracking accuracyand the maximum skip number without a losing trackingrespectively. “Ave. align err.” denotes the average of the ﬁrst ﬁve alignment errors (downsampling rate from 1 to 5). Dueto space limitation, we just show the comparison results forthe ﬁrst three sequences of the TUM-MonoVO dataset.

TABLE IF

LOW N ORM V S F LOW I NIT

Conﬁg Seq. Ave. align err. accept. acc. w/o losing.DSO 01 0.5760 9 11FlowInit 01 0.5380 9 13FlowNorm 01 E. Runtime analysis

In the implementation, we implement the trained model inDSO by PyTorch-C++ , and we create a new thread for theﬂow prediction. The forward of the network is in the GPUand the other models of DSO are implemented in the CPU,which means the prediction of the ﬂow does not have aneffect on other models in DSO. In our computer, the forwardprocess of the shrunken network takes 14 ms per frame.VI. C ONCLUSION AND FUTURE WORK

In this paper, we have presented a ﬂow norm to enhancethe convergence range of direct alignment by utilizing acoarse ﬂow map to constrain those correspondences thatare highly inconsistent with the ﬂow map. We employeda shrunken PWC-Net to generate the coarse ﬂow map andbuilt variants of DSO and BA-Net to prove the effectivenessof the ﬂow norm. Meanwhile, we also compared the ﬂownorm with a competitive strategy that gets the initial posefrom the predicted ﬂow directly. Our experiments proved theeffectiveness and efﬁciency of the ﬂow norm. In future work,we plan to investigate new network architectures to increasethe accuracy of the prediction network and explore moreformation of the ﬂow norm. https://pytorch.org/cppdocs/ EFERENCES[1] Hatem Alismail, Brett Browning, and Simon Lucey. Photometricbundle adjustment for vision-based SLAM. In

Asian Conference onComputer Vision , pages 324–341. Springer, 2016.[2] Amal Delaunoy and Marc Pollefeys. Photometric bundle adjustmentfor dense multi-view 3D modeling. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) ,pages 1486–1493, 2014.[3] Jakob Engel, Thomas Schps, and Daniel Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In

European Conference on ComputerVision(ECCV) , pages 834–849. Springer, 2014.[4] Jakob Engel, Jurgen Sturm, and Daniel Cremers. Semi-dense visualodometry for a monocular camera. In

Proceedings of the IEEEInternational Conference on Computer Vision(ICCV) , pages 1449–1456, 2013.[5] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, andRichard Szeliski. Building Rome in a Day. In , pages 72–79.IEEE, 2009.[6] David Nistr. An efﬁcient solution to the ﬁve-point relative pose prob-lem.

IEEE Transactions on Pattern Analysis and Machine Intelligence ,26(6):756, 2004.[7] Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition(CVPR) , pages 4104–4113, 2016.[8] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M. Seitz.Multicore bundle adjustment. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition(CVPR) , pages 3057–3064. IEEE, 2011.[9] Christian Forster, Matia Pizzoli, and Davide Scaramuzza. SVO: Fastsemi-direct monocular visual odometry. In , pages 15–22. IEEE,2014.[10] Chengzhou Tang and Ping Tan. BA-Net: Dense bundle adjustmentnetwork.

In International Conference on Learning Representa-tions(ICLR) , 2019.[11] Lukas von Stumberg, Patrick Wenzel, Qadeer Khan, and Daniel Cre-mers. GN-Net: The gauss-newton loss for multi-weather relocalization.2019.[12] Ronald Clark, Michael Bloesch, Jan Czarnowski, Stefan Leutenegger,and Andrew J. Davison. Learning to solve nonlinear least squaresfor monocular stereo. In

Proceedings of the European Conference onComputer Vision (ECCV) , pages 284–299, 2018.[13] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-KhoiTran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshesdataset with annotations. In , pages 92–101. IEEE, 2016.[14] Jakob Engel, Vladyslav Usenko, and Daniel Cremers. A photomet-rically calibrated benchmark for monocular visual odometry. arXivpreprint arXiv:1607.02555 , 2016.[15] Ankur Handa, Thomas Whelan, John McDonald, and Andrew J.Davison. A benchmark for RGB-D visual odometry, 3D reconstructionand SLAM. In , pages 1524–1531. IEEE, 2014.[16] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparseodometry.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 40(3):611–625, 2017.[17] Jorge Nocedal and Stephen Wright.

Numerical Optimization . SpringerScience & Business Media, 2006.[18] Jia Xu, Ren Ranftl, and Vladlen Koltun. Accurate optical ﬂow viadirect cost volume processing. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1289–1297, 2017.[19] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, CanerHazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers,and Thomas Brox. Flownet: Learning optical ﬂow with convolutionalnetworks. In

Proceedings of the IEEE International Conference onComputer Vision , pages 2758–2766, 2015.[20] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, AlexeyDosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical ﬂowestimation with deep networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2462–2470, 2017.[21] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net:CNNs for optical ﬂow using pyramid, warping, and cost volume. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 8934–8943, 2018. [22] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and CordeliaSchmid. Epicﬂow: Edge-preserving interpolation of correspondencesfor optical ﬂow. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 1164–1172, 2015.[23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980