[PDF] On-line non-overlapping camera calibration net

Abstract

We propose an easy-to-use non-overlapping camera calibration method. First, successive images are fed to a PoseNet-based network to obtain ego-motion of cameras between frames. Next, the pose between cameras are estimated. Instead of using a batch method, we propose an on-line method of the inter-camera pose estimation. Furthermore, we implement the entire procedure on a computation graph. Experiments with simulations and the KITTI dataset show the proposed method to be effective in simulation.

Full PDF

OOn-line non-overlapping camera calibration net

Zhao Fangda, Toru Tamaki, Takio Kurita, Bisser Raytchev, Kazufumi Kaneda

Hiroshima UniversityHiroshima 739-8527 Japan

Abstract

We propose an easy-to-use non-overlapping camera calibration method. First,successive images are fed to a PoseNet-based network to obtain ego-motion ofcameras between frames. Next, the pose between cameras are estimated. Insteadof using a batch method, we propose an on-line method of the inter-camera poseestimation. Furthermore, we implement the entire procedure on a computationgraph. Experiments with simulations and the KITTI dataset show the proposedmethod to be effective in simulation.

Camera calibration is one of fundamental task in computer vision and has been studied over decades,while it is still a popular topic. Recently, calibration methods have been proposed for camerasthat do not share the ﬁeld of views (FOVs), which are called non-overlapping calibration. In thiscase, standard stereo camera calibration methods do not work. For this task, many methods havebeen proposed such as SLAM-based [3, 5], mirror-based [1, 14, 18, 19], tracking-based [15, 2],trajectory-based [16, 6], and AR marker-based [21, 22].However, these approaches are off-line or computationally demanding. The use of mirrors [1, 14,18, 19] and AR makers [21, 22] enforces the calibration to be done before moving the calibrationrig or driving cars (in case of car-mounted cameras). Trajectory-based methods [16, 6] takes cameratrajectory data as a batch to process after shooting videos with the cameras. SLAM-based methods[3, 5] may work on-line, however a computation resources is usually required.In this paper, we propose an on-line calibration method for non-overlapping cameras. Assuming theinter-camera pose being ﬁxed, the proposed method continuously takes video frames as input andcomputes the inter-camera pose at each frame. This makes the calibration easy-to-use in practicalsituations. Our contributions are: • First, we develop an on-line method for estimating inter-camera pose by extending anexisting batch trajectory-based method. • We combine a PoseNet-based ego-motion network with our on-line pose update scheme. • Finally, we implement the entire procedure on a computation graph, that is, the networkconsists of the ego-motion part and inter-camera pose update part.

Here we brieﬂy review a trajectory-based method [7] on which our proposed method is based. Withoutloss of generality, we assume that there are two cameras ﬁxed on the same rig (hence do not move),one is called master and the other is slave, hereafter we call them camera 0 and 1, respectively. Fixing

Preprint. Work in progress. a r X i v : . [ c s . C V ] F e b he coordinate system of camera 0, the problem is to estimate the coordinate system of camera 1,represented by rotation matrix ∆ R and translation ∆ T . In addition, the scale difference ∆ λ betweentwo cameras is also estimated; that is, two camera coordinates are related by a similar transformation.To this end, two sequences of camera poses are obtained by moving the camera rig with the twocameras, and estimating camera motions with respect to their initial positions. More speciﬁcally, R it and T it denote a pose of camera i at time t , and R = I and T = 0 , and R = ∆ R and T = ∆ T .Furthere we denote q it as the quaternion of the rotation R it , and A it as the matrix composed of thequaternions at initial frame t = 0 and the current frame t deﬁned by A it =  w t − w it − x t + x it − y t + y it − z t + z it x t − x it w t − w it − z t − z it y t + y it y t − y it z t + z it w t − w it − x t − x it z t − z it − y t − y it x t + x it w t − w it  for q it = ( w it , x it , y it , z it ) .Given camera poses { R it , T it } ( t = 1 , , . . . , N ) over N frames, rotation between cameras ∆ q (quaternion of ∆ R ) is obtained by solving min ∆ q (cid:88) t (cid:107) A t ∆ q (cid:107) . (1)Once rotation ∆ R is recovered, then translation ∆ T and scale difference ∆ λ is obtained by solving min ∆ x (cid:88) t (cid:107) B t ∆ x − T t (cid:107) , ∆ x = (cid:18) ∆ T ∆ λ (cid:19) (2)where B it = ( I − R t , ∆ RT it ) . The trajectory-based method [7] needs camera poses at each time step. Here we use a PoseNet-basedego-motion estimation method. PoseNet [13] is a network to estimate 6D camera pose by matchingthe current video frame with learned 3D scenes. In this sense, this is a camera re-localization becauseit can not deal with new scenes that are not seen during training the network. It has been extended toan ego-motion estimation network [23] which takes successive video frames to compute the depth ofthe scene as well as the camera motion between the frames (i.e., ego-motion).A straightforward way might be the use of the ego-motion results to perform the trajectory-basedmethod. However, this naive idea has the following drawbacks. First, this is a batch way because aspeciﬁed number of (i.e., 200) frames are necessary to compute the inter-camera pose. This needs towait for hundreds of frames to obtain a result, and not good when we want to adjust, or temporallyattach, the cameras on the rig. Second, this batch way is not useful when we want to compute theentire procedure on a GPU. Instead, our method is on-line, good for the adjustable or temporalcameras, and updates parameters on the same computation graph with the PoseNet-based ego-motionestimation.

Batch problems (1) and (2) can solved with SVD and least-square solver, respectively. Here wepropose to incrementally update the inter-pose at each time.

To solve problem (1), we propose to use incremental SVD [4, 10, 20]. First, SVD of matrix A isgiven as U SV T . Then, new matrix B is added by stacking them vertically as ( A T B T ) T . IncrementalSVD computes SVD of ( A T B T ) T by using U SV T = A and B as follows.1. Perform QL factorization QL = ( I − V V T ) B T

2. Perform SVD ˜ U , ˜ S, ˜ V T = (cid:18) S BV L (cid:19) amera 0Camera 1 Pose netPose net IncrementalSVDRLS frame t 𝑡 − 1 frame t 𝑅 (cid:3047)(cid:2868) 𝑇 (cid:3047)(cid:2868) 𝑅 (cid:3047)(cid:2869) 𝑇 (cid:3047)(cid:2869) 𝑡 − 1 ∆𝑅∆𝑇∆𝜆𝑆, 𝑉𝑥, 𝑐 Figure 1: Overview of the proposed method3. SVD of ( A T B T ) T is given as (cid:18) U I (cid:19) ˜ S (cid:0) [ V Q ] ˜ V (cid:1) In our problem, A is square and therefore V is orthogonal, which makes the update rules muchsimpler.Furthermore, we can rewrite problem (1) as ( A T , A T , . . . , A TN ) T ∆ q = 0 , and the solution is theright singular vector v corresponding the least singular value. Therefore, we can discard the leftsingular vectors U and keep the right singular vectors V only.Algorithm 1 shows our method for computing the right singular vectors of matrix ( A T , A T , . . . , A TN ) T recursively. To solve problem (2), we use recursive least squares (RLS) [12, 11, 17]. Particularly, given { ( B t , T t ) } Nt =1 , we derive the following exponentiated block RSL with forgetting factor λ as follows; Γ t = ( I + λ − B t T C t − B t ) − (3) G t = λ − C t − B t T Γ t (4) ∆ x t = ∆ x t − + G t ( T t − B t ∆ x t − ) (5) C t = λ − C t − + G t Γ − t G Tt , (6)with the initialization of ∆ x to be a zero vector, and C to be an identity matrix. The proposed method shown in Algorithm 2 combines the incremental SVD and RLS. It takes videoframes of both cameras and estimate the inter-camera pose at each time.Both the incremental SVD and RLS are recursive but not iterative, which means that solutions areexact. Our method is however not exactly the same with the batch method [7] because B it in RLSincludes a temporal solution of rotation ∆ R .As we use a PoseNet-based ego-motion estimation net, it is reasonable to implement the on-lineestimation on the same computation graph on a GPU. This reduces the memory transfer betweenGPUs and CPUs. The overview of the proposed method is shown in Figure 1. Note that in our currentprototype implementation two parts are separated due to some problems.3 lgorithm 1: Incremental SVD algorithm.

Input: A , A , . . . , A N Output:

Solution quaternion q U SV T = A ; // initial SVD for t = 2 to N do ˜ U ˜ S ˜ V T = (cid:18) SA t V (cid:19) ; // SVD S = ˜ S ; // update S V = V ˜ V ; // update V return v as q ; // vector of least singular value Algorithm 2:

On-line algorithm. Arrows indicate some trivial transformations. Init:

S, V, C, x, λ

Input: frames I t , I t − , I t , I t − Output: ∆ R, ∆ T, ∆ λ // Ego-motion R t , T t = PoseNet( I t , I t − ) ; // first camera 0 R t , T t = PoseNet( I t , I t − ) ; // second camera 1// Rotation with incremental SVD A ← R t , R t ˜ U ˜ S ˜ V T = (cid:18) SAV (cid:19) ; // SVD S = ˜ S ; V = V ˜ V ; // update S and V ∆ R ← V ; // Rotation between cameras// translation; RLS with forgetting factor B ← R t , ∆ R, T t ; b = T t Γ = ( I + λ − B T CB ) − ; G = λ − CB T Γ x = x + G ( b − Bx ) ; // update x C = λ − C + G Γ − G T ∆ T, ∆ λ ← x ; // Translation and scale return ∆ R, ∆ T, ∆ λ First, we evaluated the proposed method with synthetic trajectories. Two camera poses were randomlygenerated (but with a unit baseline length due to the scale ambiguity) and ﬁxed on a rig, then atrajectory of the rig over 128 frames was generated by using a linear dynamic system. We addedGaussian noise with zero mean and different stds (noise levels) to rotation and translation of eachframe independently. Figure 2 shows relative errors of estimated inter-camera rotation and translation.We see that errors degrease as iteration increases, and ﬁnal errors (at the last iteration) becomeslarger when the noise level is large. In rotation estimation, errors in ﬁrst several frames are large butdecreases rapidly. However this effect remains in translation estimation even after a large number ofiterations. Therefore we postponed and started translation estimation at 60 iterations. This effectivelyreduces the error in translation, and results are acceptable in practical situations; translation errorsabout 0.25 with respect to the baseline length of 1.

Next, we evaluated the proposed method with real video frames from KITTI dataset [9, 8]. This isnot a non-overlapping calibration problem, however the ground truth pose is available and we use itfor evaluating the accuracy of our method. Figure 3 shows the errors in rotation and translation over4

50 100 150 200 250iterations0255075100125150175 e rr o r ( d e g r ee ) e rr o r ( d e g r ee ) Figure 2: Error of angles (in deg) in estimating (top) rotation axes and (bottom) translation vectorsover iterations. Different curves show results with different amount of noise added to poses.iterations. The error increases as iterations increase, which demonstrates that the proposed methoddoesn’t work well for real video frames of this setting. This is caused by the ego-motion method [23]producing pose estimations of different scales in each several frames. We are currently working ontackling this problem.

We have proposed a method for non-overlapping camera calibration. This is on-line, and use acomputation graph to implement the estimation of rotation with incremental SVD, and translationwith RLS. Results with camera poses generated synthetically and obtained from real video frameswith PoseNet demonstrate that our method works well in simulation. Our future work includesexperiments and evaluations on real videos taken by non-overlapping cameras.

Acknowledgements

This work was supported in part by JSPS KAKENHI grant number JP16H06540.

References [1] Amit Agrawal. Extrinsic Camera Calibration without a Direct View Using Spherical Mirror. In , pages 2368–2375. IEEE, dec 2013.[2] Nadeem Anjum, Murtaza Taj, and Andrea Cavallaro. Relative Position Estimation of Non-Overlapping Cameras. , pages II–281–II–284, 2007.5

20 40 60 80 100iterations020406080100120140160180 e rr o r ( d e g r ee ) e rr o r ( d e g r ee ) Figure 3: Error of angles (in deg) in estimating (top) rotation axes and (bottom) translation vectorsover iterations.[3] Esra Ataer-Cansizoglu, Yuichi Taguchi, Srikumar Ramalingam, and Yohei Miki. Calibrationof Non-overlapping Cameras Using an External SLAM System. In , volume 1, pages 509–516. IEEE, dec 2014.[4] James R. Bunch and Christopher P. Nielsen. Updating the singular value decomposition.

Numerische Mathematik , 31(2):111–129, jun 1978.[5] Gerardo Carrera, Adrien Angeli, and Andrew J Davison. SLAM-based automatic extrinsiccalibration of a multi-camera rig. In , pages 2652–2659. IEEE, may 2011.[6] Sandro Esquivel, Felix Woelk, and Reinhard Koch. Calibration of a multi-camera rig fromnon-overlapping views. In

Joint Pattern Recognition Symposium , pages 82–91. Springer, 2007.[7] Sandro Esquivel, Felix Woelk, and Reinhard Koch. Calibration of a multi-camera rig fromnon-overlapping views. In

Proceedings of the 29th DAGM Conference on Pattern Recognition ,pages 82–91, Berlin, Heidelberg, 2007. Springer-Verlag.[8] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: Thekitti dataset.

International Journal of Robotics Research (IJRR) , 2013.[9] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving?the kitti vision benchmark suite. In

Conference on Computer Vision and Pattern Recognition(CVPR) , 2012.[10] Ming Gu and Stanley C. Eisenstat. A stable and fast algorithm for updating the singular valuedecomposition, 1993.[11] Monson H. Hayes.

Statistical Digital Signal Processing and Modeling . John Wiley & Sons,Inc., New York, NY, USA, 1st edition, 1996.612] Simon Haykin.

Adaptive Filter Theory (3rd Ed.) . Prentice-Hall, Inc., Upper Saddle River, NJ,USA, 1996.[13] Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: A Convolutional Network forReal-Time 6-DOF Camera Relocalization. In , pages 2938–2946. IEEE, dec 2015.[14] Ram Krishan Kumar, Adrian Ilie, Jan-Michael Frahm, and Marc Pollefeys. Simple calibrationof non-overlapping cameras with a mirror. In , pages 1–7. IEEE, jun 2008.[15] Bernhard Lamprecht, Stefan Rass, Simone Fuchs, and Kyandoghere Kyamakya. Extrinsiccamera calibration for an on-board two-camera system without overlapping ﬁeld of view.

IEEEConference on Intelligent Transportation Systems, Proceedings, ITSC , pages 265–270, 2007.[16] Pierre Lébraly, Eric Royer, Omar Ait-Aider, and Michel Dhome. Calibration of non-overlappingcameras - application to vision-based robotics. In

Proceedings of the British Machine VisionConference , pages 10.1–10.12. BMVA Press, 2010.[17] Ali H. Sayed.

Fundamentals of adaptive ﬁltering . Wiley-IEEE Press, 2003.[18] Kosuke Takahashi, Shohei Nobuhara, and Takashi Matsuyama. A new mirror-based extrinsiccamera calibration using an orthogonality constraint.

Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition , pages 1051–1058, 2012.[19] Kosuke Takahashi, Shohei Nobuhara, and Takashi Matsuyama. Mirror-based camera poseestimation using an orthogonality constraint.

IPSJ Transactions on Computer Vision andApplications , 8:11–19, 2016.[20] Hongyuan Zha and Horst D. Simon. On Updating Problems in Latent Semantic Indexing.

SIAMJournal on Scientiﬁc Computing , 21(2):782–791, jan 1999.[21] Fangda Zhao, Toru Tamaki, Takio Kurita, Bisser Raytchev, and Kazufumi Kaneda. Markerbased simple non-overlapping camera calibration. , pages 1180–1184, sep 2016.[22] Fangda Zhao, Toru Tamaki, Takio Kurita, Bisser Raytchev, and Kazufumi Kaneda. Marker-based non-overlapping camera calibration methods with additional support camera views.

Imageand Vision Computing , 70:46–54, feb 2018.[23] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised Learning ofDepth and Ego-Motion from Video.2017 IEEE Conference on Computer Vision and PatternRecognition (CVPR)