FlowNorm: A Learning-based Method for Increasing Convergence Range of Direct Alignment
FFlowNorm: A Learning-based Method for Increasing ConvergenceRange of Direct Alignment
Ke Wang, Kaixuan Wang, and Shaojie Shen
Abstract — Many approaches have been proposed to esti-mate camera poses by directly minimizing photometric error.However, due to the non-convex property of direct alignment,proper initialization is still required for these methods. Manyrobust norms (e.g. Huber norm) have been proposed to dealwith the outlier terms caused by incorrect initializations. Theserobust norms are solely defined on the magnitude of eacherror term. In this paper, we propose a novel robust norm,named FlowNorm, that exploits the information from both thelocal error term and the global image registration information.While the local information is defined on patch alignments, theglobal information is estimated using a learning-based network.Using both the local and global information, we achieve anunprecedented convergence range in which images can bealigned given large view angle changes or small overlaps. Wefurther demonstrate the usability of the proposed robust normby integrating it into the direct methods DSO and BA-Net, andgenerate more robust and accurate results in real-time.
I. I
NTRODUCTION
Direct methods are widely used to solve visual odometryand monocular stereo problems [1]–[4]. By directly minimiz-ing the photometric error between pixels in the source frameand the target frame, camera poses and scene geometry canbe estimated in the joint optimization process. Comparedwith indirect methods [5]–[8], which solve the problem byminimizing the reprojection error between matched sparsefeatures, direct methods avoid the pre-processed featurematching step and can utilize more pixels in the image. How-ever, intensity-based optimization is prone to local minimadue to the non-convex property of complex images.Recent years, many approaches have been proposed toexpand the convergence range of direct methods. SVO [9]combines matched feature points with photometric optimiza-tion. However, although matched features can provide poseinitialization for further optimization, they rely on texturesof the environment and are prone to outliers. With the helpof learning-based methods, many researchers have proposednetworks [10], [11] to generate smooth feature maps fordirect optimization. Compared with the image intensity do-main, optimization on feature maps shows advantages inconvergence ranges. For example, BA-Net [10] can estimatecamera poses given images with small overlaps. LS-Net [12]uses an end-to-end trained network as a solver for two-framemonocular stereo problems. Learning-based methods achievesuperior performance on evaluation datasets, such as RGB-D datasets or the KITTI dataset, but have not been widelyused on robotic platforms. The reason may be the limited
The authers are with Department of Electronic and ComputerEngineering , Hong Kong University of Science and Technology, [email protected] 𝑥 𝑥 𝑓(𝑥)𝑔(𝑥) (a) (b) 𝑦 𝑦 Fig. 1. A simple example to show the different contributions of points inaligning two functions: argmin t (cid:80) x (cid:107) g ( x + t ) − f ( x ) (cid:107) . Point x with ( g ( x + t ) − f ( x )) g (cid:48) ( x + t ) < contributes to the optimization of t and is markedin green, while x with ( g ( x + t ) − f ( x )) g (cid:48) ( x + t ) > counteracts theoptimization and is marked in red. As shown in (a), with good initialization,most of the points are positive to the optimization. However, with worseinitialization, negative points make the optimization fall into local minima. computation resources of general robotic platforms and thediversity of robotic application scenes.One of the contributions of the paper is a study of thedirect optimization process followed by the design of a robustnorm for the optimization. Due to the problem of non-convexity, during the photometric minimization (or featureconsistency minimization in learning-based methods), notall pixels contribute to the convergence. The difference inpixels depends on both the local texture and the global poseinitialization which establishes pixels correspondences. Weillustrate the convergence problem in Fig. 1. As shown, forgood initializations, most of the correspondences contributeto the final estimation. However, given a bad initialization,most of the correspondences will suppress the convergence.Poor correspondences make the optimization fall into localminima. Based on this observation, we propose the flownorm that combines a low-accuracy optical flow predictionnetwork to distinguish which correspondences will suppressthe convergence of direct alignment.In summary, the contributions of our paper are the follow-ing: • We propose a new norm to expand the convergencerange of the traditional nonlinear solver for the directalignment problem. • To the best of our knowledge, the proposed method a r X i v : . [ c s . R O ] O c t s the first that can distinguish which correspondencewill suppress solver convergence in the direct alignmentsituation. • We build FlowNorm versions of DSO and BA-Net, andthe FlowNorm DSO retains the real-time property.To demonstrate the effectiveness of our method, we evaluateit on the SceneNN dataset [13], TUM-MonoVO dataset [14]and ICL-NUIM dataset [15], showing that FlowNorm ver-sions consistently outperform the original versions.II. R
ELATED W ORK
Semi-dense visual odometry [4] is a pioneering work thattracks a monocular camera in real-time using direct align-ment algorithm. SVO [9] uses matched features to calculatean initialization pose for joint optimization. Following theidea of the direct method, Engel proposed LSD-SLAM [3],which solves the camera pose using keyframes with depthvalues. DSO [16] is the baseline direct alignment work,which jointly optimizes all model parameters, includinggeometry represented as inverse depth and camera motion.DSO further integrates a full photometric calibration, ac-counting for exposure time, lens vignetting, and non-linearresponse functions. Although all these methods feature real-time efficiency and high accuracy, they rely on incrementallytracking the camera poses to ensure large overlaps and properinitialization.To increase the convergence range of the direct methods,many learning-based methods have been proposed to replacethe intensity map with feature maps. BA-Net [10] formulatesBundle Adjustment (BA) as a differentiable layer and utilizesa standard encoder-decoder network to generate the featuremap and depth map. Camera poses and depth maps areoptimized by minimizing the feature consistency betweenprojected pixels. Benefiting from the generated feature maps,BA-Net expands the convergence range of direct alignment.GN-Net [11] uses a novel Gauss-Newton loss for trainingdeep feature maps. The direct alignment in GN-Net, basedon minimizing the feature metric error, achieves robust per-formance under dynamic lighting or weather changes. Thesetwo approaches nicely combine traditional direct alignmentand deep learning techniques.LS-Net [12] uses an end-to-end trained network to replacethe traditional nonlinear solver. Given a photometric errormap and a Jacobian matrix, LS-Net estimates the updateddepth map and camera motion. Although it achieves impres-sive results on datasets, the generalization ability of LS-Nethas not been demonstrated.In this paper, we propose a different solution that im-proves the robustness of direct optimization. The core of thecontribution is a robust norm that distinguishes error termsusing both local and global information. Different from mostof the learning-based methods that use a heavy network togenerate high-dimensional feature maps, we utilize a light-weight network to improve both the robustness and accuracyof the state-of-the-art methods, with an overhead of only ms . III. D IRECT A LIGNMENT R EVISITED
Before introducing our enhanced direct alignment algo-rithm, we revisit the classic direct alignment to give a betterunderstanding of where difficulties lie, and why our methodis desirable. We only introduce the most relevant content,and refer the readers to [16] for a more comprehensiveintroduction.Given a target/source image pair I s and I t , the directalignment problem is formulated as estimating the relativetransformation T between the image pair, and d i ∈ D = { d i | i = 1 · · · N } , which are the depths of the pixels p si ∈ P = { p si | i = 1 · · · N } at the image I s . Let X = { T , D } and we can estimate X by minimizing the norm of thephotometric error ˆ X = arg min X N (cid:88) i =1 | e i ( X ) | , (1)where |·| donates the L1 norm or Huber norm of a vector, N is the number of selected pixels, and the photometric error e i ( X ) = I t ( p (cid:48) ti ) − I s ( p si ) (2)measures the intensity difference between the ith pixel p si at I s and its corresponding pixel p (cid:48) ti at I t . p (cid:48) ti is computedby the projection function p (cid:48) ti = π ( p si , T , d i ) = s KT d i K − p si , (3)which projects 2D point p si from I s to I t , where d i is thedepth value of p si at I s , K and s are the camera’s intrinsicmatrix and a scale factor respectively.The general strategy to minimize Eq. (1) is the Gaussian-Newton (GN) or Levenberg-Marquardt (LM) algorithms[17]. The GN and LM methods are both iterative methods.At the jth iteration, the GN algorithm solves for an optimalupdate ∆ X j = − ( J jT J j ) − J jT E j . (4)Here E j = [ e ( X j ) , e ( X j ) , · · · , e N ( X j )] , where X j isthe initial parameters at the jth iteration. Let δ denotesa small se (3) pertubation around X j , J j is the Jacobianmatrix of E j with respect to δ . Let p (cid:48) ti represent theprojection position of p si at I t based on the parameters X j .The ith row of J j is J j ( i ) = (cid:34) ∂e i ( X j ) ∂I t ( p (cid:48) ti ) ∂I t ( p (cid:48) ti ) ∂ p (cid:48) ti ∂ p (cid:48) ti ∂ δ (cid:35) , (5)where ∂e i ( X j ) ∂I t ( p (cid:48) ti ) and ∂ p (cid:48) ti ∂ δ are smooth compared with theincrement ∆ X j . In contrast, ∂I t ( p (cid:48) ti ) ∂ p (cid:48) ti is much less smooth.As found in DSO, ∂I t ( p (cid:48) ti ) ∂ p (cid:48) ti is only valid in a 1-2 pixel radius.Hence the effective optimization requires that all parametersinvolved in computing p (cid:48) ti should be initialized sufficientlyaccurately to be off by no more than 1-2 pixels. However,giving accurate initialization is difficulty when there is a largeview change between I s and I t . ig. 2. Illustration of θ and θ , which are used in the definition of theflow norm. x is the derivative of the residual with respect to p (cid:48) ti , whichtotally depends on the local information. Conversely, θ relies on globalinformation p (cid:48) ti , p oti and σ . IV. A
PPROACH
To deal with the local minima problem of direct alignment,we design a flow norm to guide the non-linear solver tojump out from the local minimum. Assume we have a coarseoptical flow map between the image pair. The key idea ofthe FlowNorm is to balance the local information (residualdecreasing direction) and global optical flow information.Because the optical flow is coarse and unreliable, we justdown-weight those correspondences whose residual decreas-ing directions are highly inconsistent with the correspondingflow positions.
A. Flow Norm
Following the definition in Sect. III, F denotes the com-puted coarse flow map between I s and I t , P ot represents theflow positions computed by P ot = P s + F , and P (cid:48) t is theprojection position of P s at I t based on the current relativepose T . p si , p oti and p (cid:48) ti denote the ith item of P s , P ot and P (cid:48) t respectively. The flow norm of residual e i is defined as L ( p si , p oti , p (cid:48) ti , e i ) = e i , (cid:12)(cid:12)(cid:12) p oti − p (cid:48) ti (cid:12)(cid:12)(cid:12) ≤ σe i , cos θ ≤ cos θ ( cos θ +1cos θ +1 ) e i , cos θ < cos θ, (6)where σ is the variance of the computed flow (the methodfor computing σ is described in Sect. IV-B) and |·| denotesthe L2 norm of the vector. v = p (cid:48) ti − p oti is the direction fromprojection position p (cid:48) ti to the flow position p oti , x = ∂e i ∂ p (cid:48) ti represents the derivative direction at p (cid:48) ti , θ denotes the anglebetween v and x , and cos( θ ) can be computed by cos( θ ) = v T x | v | | x | . (7)As shown in Fig. 2, when the projection position p (cid:48) ti liesoutside the circle with p oti as its center and σ as its radius, θ represents the angle between the tangent line l and thedirection v . Thus cos( θ ) = √ v T v − σ | v | . (8)In summary, we tend to activate these correspondenceswhen their projection positions are close to the flow positionsor local gradient agrees with global information. Fig. 3. Illustration of the change of the flow factor s with the increasingof θ . With the proposed flow norm, the cost function of directalignment is formulated as ˆ X = arg min X N (cid:88) i L ( p si , p oti , p (cid:48) ti , e i ) . (9)The new optimal update step of the GN method for the jth iteration is ∆ (cid:102) X j = − ( J Tj SJ j ) − J Tj SE j , (10)where S is a diagonal matrix, and the ith row and ith columnentry is the flow norm factor of the ith residual, which issummarized as s i = , (cid:12)(cid:12)(cid:12) p oti − p (cid:48) ti (cid:12)(cid:12)(cid:12) ≤ σ , cos θ ≤ cos θ ( cos θ +1cos θ +1 ) , cos θ < cos θ. (11)Fig. 3 illustrates the change of the flow factor s with theincreasing of θ . The θ of these four lines is ◦ , ◦ , ◦ and ◦ respectively. From Eq. (8), for the same p (cid:48) ti and p oti ,a bigger θ corresponds to a bigger flow uncertainty σ . For abig flow uncertainty, the flow norm will take more account oflocal information and assign larger weights to it. In case ofthe overshoot of the convergence process and the optimizedresults from being biased by the noise flow, we only involvethe flow norm in a tracker when it runs at coarse levels ofimage pyramid. For example, the image pyramid of DSO hasfour levels, and we only involve our flow norm in the toptwo levels.Although the form of the flow norm is similar to that ofthe Huber norm, their cores are very different. The Hubernorm utilizes the local information of correspondences andthe flow norm depends on the coarse flow. In fact, they arecomplementary to each other. B. Shrunken PWC-Net
To obtain the optical flow map, we employ a shrunkenPWC-Net to predict the optical flow between two images.Approaches that learn to predict optical flow from an image ig. 4. Overview of the FlowNorm DSO. The predicted flow map is used to suppress those correspondences whose local gradients are highly inconsistentwith predicted flow in the top two levels of the image pyramid. pair have been studied in previous works [18]–[22]. However,due to the high computation cost, these previous nets cannotbe migrated directly to our work. To efficiently obtain theoptical flow, we choose the baseline network PWC-Net [21]as a reference network, and then shrink its convolutionallayers and reduce its input image size. The shrinking processis a tradeoff between prediction accuracy and computingefficiency. As our method works on the coarse levels of theimage pyramid, it can robustly utilize the inaccurate opticalflow.Firstly, we change the input size of this network from [3 × × to [1 × × . RGB images should betransformed into grey images before feeding them into theshrunken network. Secondly, we remove one coding blockand two pooling operations from the encoder, and the outputsize of the last encoding layer is [15 × . Finally, we removeone decoding block and reduce the correlation radius from4 to 3, as the correlation operation of decoding block iscomputationally expensive. The size of the predicted flow is [112 × . Our encoding and decoding blocks are identicalto the encoding and decoding blocks of PWC-Net. Theshrunken network architecture is shown in the supplementaryvideo.Let Θ be the set of all the learnable parameters in ourshrunken network. W l Θ and W lGT denote the predicted flowfield and the corresponding ground truth of the lth pyramidlevel respectively. We use the same multiscale training lossproposed in FlowNet [19]: L (Θ) = L (cid:88) l = l α l (cid:88) x (cid:12)(cid:12) W l Θ ( x ) − W lGT ( x ) (cid:12)(cid:12) + γ | Θ | , (12)where the second term regularizes the parameters of themodel in case of over fitting, the α l and γ are the balanceweights for different pyramid levels.The variance σ of the predicted flow is computed byaveraging the squared L error of the prediction results inthe testing dataset. C. Overview of the FlowNorm DSO
To demonstrate the effectiveness and efficiency of ourmethod, we take our flow norm as a plug-in componentfor the baseline methods, DSO and BA-Net. The FlowNorm DSO is still a real-time system, which can be directlycompared with DSO on any dataset.As shown in Fig. 4, the plug-in is composed of three parts:the latest image is firstly fed into the encoder network toconstruct multi-scale feature maps. Then, the decoder willget the concatenation of the latest feature maps and thekeyframe’s feature maps and output a predicted flow map.Finally, the predicted flow map will be involved in the BA ofthe DSO tracker in the top two levels of the image pyramid.Although the pipeline needs to encode two images, all framesare only required to be encoded once by buffering the featuremaps of the active keyframes.
D. Comparison with FlowInit DSO
To completly prove the effectiveness and efficiency of theflow norm, we also construct a competitive strategy. Given p si ∈ P s = { p si | i = 1 · · · N } , d i ∈ D = { d i | i = 1 · · · N } at I s and the predicted positions p oti ∈ P ot = { p oti | i =1 · · · N } at I t , we compute the initial transform T byminimizing the geometric error T = arg min T N (cid:88) i | π ( p i , T , d i ) − p oti | , (13)where π ( · ) is the projection function, and it is defined inSect. III. Then, we take T as an initialization for thetracker of the DSO. We call the DSO with the initializationcomputed from the predicted flow map as FlowInit DSO.We find that the performance of this initialization strategy ison par with the FlowNorm DSO strategy for those well pre-dicted flow positions. However, for very poor flow prediction,tracking with the initialization strategy is highly unstable.Because we shrink the size of the prediction network, ourpredicted flow usually has an overall offset. The overall offsetcauses that the initialization from the geometric BA alsocontains the offset. More comparison details are shown inthe next Section. V. E XPERIMENTS
To verify the effectiveness of our method, we buildFlowNorm versions of DSO and BA-Net. We evaluate oursystem on a Linux system with an Intel Core i7-7700 CPUof 3.50GHz and an Nvidia Titan Xp GPU. . Training
We train the shrunken PWC-Net on the SceneNN [13]dataset, which consists of 94 Kinect-captured RGB-D imagesequences with ground truth poses. We select 44 / 25 imagesequences from the SceneNN dataset and take them astraining/testing sets respectively. Then, we sample pairs fromthe training and testing sets and generate the ground truthoptical flow by projecting one pixel from one image toanother image. During the projection process, we removethe occlusion area by verifying whether the depth of onepixel is consistent with the depth of its projection position.Our shrunken PWC-Net is trained with ADAM [23] withthe initial learning rate 0.0001. The weights in the trainingloss defined in Eq. (12) are set to be α . , α . , α . , and α . . The trade-off weight γ is set tobe 0.0004. Although our network lacks some of the layers ofPWC-Net, we still load the parameters of PWC-Net into thecorresponding layers of the shrunken version as the initialparameters. The total training process takes one day on acomputer with one Titan XP. B. FlowNorm in DSO
We compare the FlowNorm DSO with the original DSOon two monocular datasets: TUM-MonoVO dataset [14] andICL-NUIM dataset [15]. The TUM-MonoVO dataset pro-vides 50 photometrically calibrated sequences, comprisingdifferent indoors and outdoors environments. The ICL-NUIMdataset contains 8 ray-traced sequences from two indoor en-vironments. Since the TUM-MonoVO dataset only providesloop-closure ground-truth, we evaluate all sequences usingthe alignment error, which is defined in the TUM-MonoVOdataset.
Fig. 5. The accumulated number of runs whose alignment errors are smallerthan e align ; larger is better. Testing dataset contains all downsampledsequences from TUM-MonoVo and ICL-NUIM datasets. To increase the difficulty of evaluation, we add a newevaluation metric . We downsample the image sequenceswith a skip of 1,2,3,4...13 frames. We track 2 times forevery sequence. Which means the total number of runs is1508. Apart from the alignment errors, we also measuretwo numbers: the maximum skip number without a losingtracking and the maximum skip number that can be tracked with an acceptable accuracy for every sequence. We takethe alignment error of sequences without downsampling asa reference for whether tracking has acceptable accuracyor not and label it as error . If the alignment error ofone downsampling sequence is smaller than three times itscorresponding error , we mark the tracking result of the runas an acceptable tracking accuracy. (a) The maximum skip number with an acceptable trackingaccuracy(b) The maximum skip number without a losing trackingFig. 6. Comparison of the convergence ability in all 50 sequences of theTUM-MonoVO dataset. Fig. 5 illustrates the statistical performance of FlowNormDSO and DSO. The accuracy of FlowNorm DSO is betterthan that of the original DSO. The performance of theoriginal DSO is on par with its FlowNorm version whenthe downsampling rate is low. However, FlowNorm DSOpresents more robust performance with the increase of thedownsampling rate. Fig. 6 shows the maximum skip numberwith acceptable tracking accuracy and the maximum skipnumber without losing tracking for all sequences in the TUMMono dataset. FlowNorm DSO (blue) has consistently betterperformance than the original version. Note that we onlydownsample sequences with 1 to 13 steps, the maximumstep with 13 means we do not find losing tracking or thetracking results of all runs are acceptable in this sequence.Fig. 7 shows the tracking trajectories of FlowNorm DSO ig. 7. An example of a losing tracking in the first sequence of the TumMonoVo dataset with a skip of 10 frames. The red and green trajectories arecomputed from DSO and FlowNorm DSO respectively. (green) and the original DSO (red) on the first sequence ofthe TUM Mono dataset with a downsampling rate of 10.DSO loses tracking in the black box area as the camera haslarge view change there.
C. FlowNorm in BA-Net
As BA-Net does not public codes, we construct a motiontracking version of it. The constructed BA-Net is trained onour training/validating dataset. Similar to FlowNorm DSO,we also use the predicted flow guide for the convergence ofBA-Net. We use the remaining part of the SceneNN datasetto build a challenging image pair dataset. Then, we generateinitial poses by adding rotation noise and translation noiseto the ground truth poses of these image pairs. The finalnumber of image pairs is 60126. We compare how manypairs are successfully aligned by BA-Net and its FlowNormversion. The results are 48261 and 37218 for FlowNormversion and the original version respectively, which provesour method can further expand the convergence range of thetracker based on minimizing the feature metric residual.
D. FlowNorm Vs FlowInit
To prove the effectiveness of the flow norm, we constructa competitive strategy, which is described in Sect. IV-D. Wecompute an initial pose from the predicted flow directly. Wefind that if we just take the computed initial pose as theinitialization of the DSO tracker, the tracker becomes veryunstable (usually loses tracking when it gets an inaccurateoptical flow). We consider the reason for such losing trackingis that the indirect method seriously depends on the correctmatching. However, the depths and correspondences used tocompute the initial pose both contain a lot of noise. Next,we insert the computed initial pose to the queue of tryingposes in the DSO tracker, and the queue in DSO is usedto prevent loss of tracking. We compare the FlowInit DSOand FlowNorm DSO, and the results are shown in Table I.In the table, “accept. acc.” and “w/o losing.” mean themaximum skip number with an acceptable tracking accuracyand the maximum skip number without a losing trackingrespectively. “Ave. align err.” denotes the average of the first five alignment errors (downsampling rate from 1 to 5). Dueto space limitation, we just show the comparison results forthe first three sequences of the TUM-MonoVO dataset.
TABLE IF
LOW N ORM V S F LOW I NIT
Config Seq. Ave. align err. accept. acc. w/o losing.DSO 01 0.5760 9 11FlowInit 01 0.5380 9 13FlowNorm 01 E. Runtime analysis
In the implementation, we implement the trained model inDSO by PyTorch-C++ , and we create a new thread for theflow prediction. The forward of the network is in the GPUand the other models of DSO are implemented in the CPU,which means the prediction of the flow does not have aneffect on other models in DSO. In our computer, the forwardprocess of the shrunken network takes 14 ms per frame.VI. C ONCLUSION AND FUTURE WORK
In this paper, we have presented a flow norm to enhancethe convergence range of direct alignment by utilizing acoarse flow map to constrain those correspondences thatare highly inconsistent with the flow map. We employeda shrunken PWC-Net to generate the coarse flow map andbuilt variants of DSO and BA-Net to prove the effectivenessof the flow norm. Meanwhile, we also compared the flownorm with a competitive strategy that gets the initial posefrom the predicted flow directly. Our experiments proved theeffectiveness and efficiency of the flow norm. In future work,we plan to investigate new network architectures to increasethe accuracy of the prediction network and explore moreformation of the flow norm. https://pytorch.org/cppdocs/ EFERENCES[1] Hatem Alismail, Brett Browning, and Simon Lucey. Photometricbundle adjustment for vision-based SLAM. In
Asian Conference onComputer Vision , pages 324–341. Springer, 2016.[2] Amal Delaunoy and Marc Pollefeys. Photometric bundle adjustmentfor dense multi-view 3D modeling. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR) ,pages 1486–1493, 2014.[3] Jakob Engel, Thomas Schps, and Daniel Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In
European Conference on ComputerVision(ECCV) , pages 834–849. Springer, 2014.[4] Jakob Engel, Jurgen Sturm, and Daniel Cremers. Semi-dense visualodometry for a monocular camera. In
Proceedings of the IEEEInternational Conference on Computer Vision(ICCV) , pages 1449–1456, 2013.[5] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, andRichard Szeliski. Building Rome in a Day. In , pages 72–79.IEEE, 2009.[6] David Nistr. An efficient solution to the five-point relative pose prob-lem.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,26(6):756, 2004.[7] Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition(CVPR) , pages 4104–4113, 2016.[8] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M. Seitz.Multicore bundle adjustment. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition(CVPR) , pages 3057–3064. IEEE, 2011.[9] Christian Forster, Matia Pizzoli, and Davide Scaramuzza. SVO: Fastsemi-direct monocular visual odometry. In , pages 15–22. IEEE,2014.[10] Chengzhou Tang and Ping Tan. BA-Net: Dense bundle adjustmentnetwork.
In International Conference on Learning Representa-tions(ICLR) , 2019.[11] Lukas von Stumberg, Patrick Wenzel, Qadeer Khan, and Daniel Cre-mers. GN-Net: The gauss-newton loss for multi-weather relocalization.2019.[12] Ronald Clark, Michael Bloesch, Jan Czarnowski, Stefan Leutenegger,and Andrew J. Davison. Learning to solve nonlinear least squaresfor monocular stereo. In
Proceedings of the European Conference onComputer Vision (ECCV) , pages 284–299, 2018.[13] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-KhoiTran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshesdataset with annotations. In , pages 92–101. IEEE, 2016.[14] Jakob Engel, Vladyslav Usenko, and Daniel Cremers. A photomet-rically calibrated benchmark for monocular visual odometry. arXivpreprint arXiv:1607.02555 , 2016.[15] Ankur Handa, Thomas Whelan, John McDonald, and Andrew J.Davison. A benchmark for RGB-D visual odometry, 3D reconstructionand SLAM. In , pages 1524–1531. IEEE, 2014.[16] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparseodometry.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 40(3):611–625, 2017.[17] Jorge Nocedal and Stephen Wright.
Numerical Optimization . SpringerScience & Business Media, 2006.[18] Jia Xu, Ren Ranftl, and Vladlen Koltun. Accurate optical flow viadirect cost volume processing. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1289–1297, 2017.[19] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, CanerHazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers,and Thomas Brox. Flownet: Learning optical flow with convolutionalnetworks. In
Proceedings of the IEEE International Conference onComputer Vision , pages 2758–2766, 2015.[20] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, AlexeyDosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flowestimation with deep networks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2462–2470, 2017.[21] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net:CNNs for optical flow using pyramid, warping, and cost volume. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 8934–8943, 2018. [22] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and CordeliaSchmid. Epicflow: Edge-preserving interpolation of correspondencesfor optical flow. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 1164–1172, 2015.[23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980