[PDF] Stereo 3D Object Trajectory Reconstruction

Abstract

We present a method to reconstruct the three-dimensional trajectory of a moving instance of a known object category using stereo video data. We track the two-dimensional shape of objects on pixel level exploiting instance-aware semantic segmentation techniques and optical flow cues. We apply Structure from Motion (SfM) techniques to object and background images to determine for each frame initial camera poses relative to object instances and background structures. We refine the initial SfM results by integrating stereo camera constraints exploiting factor graphs. We compute the object trajectory by combining object and background camera pose information. In contrast to stereo matching methods, our approach leverages temporal adjacent views for object point triangulation. As opposed to monocular trajectory reconstruction approaches, our method shows no degenerated cases. We evaluate our approach using publicly available video data of vehicles in urban scenes.

Full PDF

SStereo 3D Object Trajectory Reconstruction

Sebastian Bullinger, Christoph Bodensteiner, Michael ArensFraunhofer IOSBEttlingen, Germany .@iosb.fraunhofer.de

Rainer StiefelhagenKarlsruhe Institute of TechnologyKarslruhe, Germany [email protected]

Abstract

We present a method to reconstruct the three-dimensional trajectory of a moving instance of a knownobject category using stereo video data. We track thetwo-dimensional shape of objects on pixel level exploitinginstance-aware semantic segmentation techniques and op-tical ﬂow cues. We apply Structure from Motion (SfM) tech-niques to object and background images to determine foreach frame initial camera poses relative to object instancesand background structures. We reﬁne the initial SfM resultsby integrating stereo camera constraints exploiting factorgraphs. We compute the object trajectory by combining ob-ject and background camera pose information. In contrastto stereo matching methods, our approach leverages tem-poral adjacent views for object point triangulation. As op-posed to monocular trajectory reconstruction approaches,our method shows no degenerated cases. We evaluate ourapproach using publicly available video data of vehicles inurban scenes.

1. Introduction

The reconstruction of three-dimensional object mo-tion trajectories is important for autonomous systems andsurveillance applications. There are different platforms likedrones or wearable systems where one wants to achievethis task with a minimal number of devices in order to re-duce weight or lower production costs. We propose an ap-proach to reconstruct three-dimensional object motion tra-jectories using two cameras as sensors. These results areessential for applications like environment perception andgeo-registration of three-dimensional object trajectories.3D stereo measurement precision deteriorates quickly withcamera distance [20] due to limited camera baselines. Wetackle this problem by combining temporal adjacent viewsusing Structure from Motion techniques. Even small objectrotations may result in big camera baseline differences. In many scenes objects cover only a minority of pixels.This increases the difﬁculty of reconstructing object mo-tion trajectories using image data. In such cases currentstate-of-the-art Structure from Motion (SfM) approaches[18, 22] treat moving object observations most likely as out-liers and reconstruct background structures instead. Previ-ous works, e.g. [15, 16], detect moving objects by applyingmotion segmentation or keypoint tracking. Recent progressin instance-aware semantic segmentation [7, 11] and opti-cal ﬂow [13, 12] techniques allow for object tracking onpixel level [2] and handle stationary objects naturally. Weextend the approach in [2] to track objects on pixel levelin stereo video data. Stereo object tracking allows us touse [22] and [18] for object and background reconstruction.We reﬁne the reconstruction results by incorporating stereoconstraints using GTSAM [6]. GTSAM provides function-ality to model reconstruction problems with factor graphs.The incorporation of stereo constraints removes the scaleambiguity between object and background reconstructionand allows us to compute consistent object motion trajecto-ries.

Semantic segmentation or scene parsing is the task ofproviding semantic information at pixel-level. Shelhamer etal. [23] applied Fully Convolutional Networks for seman-tic segmentation, which are trained end-to-end. Recently,[7, 11] proposed instance-aware semantic segmentation ap-proaches. The ﬁeld of Structure from Motion (SfM) can bedivided into iterative and global approaches. Iterative or se-quential SfM methods [18, 22, 25] are more likely to ﬁndreasonable solutions than global SfM approaches [18, 25].However, the latter are less prone to drift. GTSAM [6] al-lows to model and to optimize SfM problems using factorgraphs, but does not provide functionality to perform dataassociation and initialization. [3] analyze the importance ofinitialization techniques for Simultaneous Localization andMapping (SLAM) using GTSAM. We perform data associ-ation and initialization using state-of-the-art SfM libraries[18, 22]. Previous works [4, 24] exploit speciﬁc camera1 a r X i v : . [ c s . C V ] A ug erObject SemanticSegmentation andObject TrackingSfM + StereoReﬁnement SfM + StereoReﬁnementInput Frames(Left + Right) ObjectSegmentations(Left + Right) BackgroundSegmentations(Left + Right)ObjectSfM Result BackgroundSfM ResultTrajectoryComputation

Figure 1: Overview of the Trajectory ReconstructionPipeline. Boxes with corners denote computation resultsand boxes with rounded corners denote computation steps.poses to reconstruct object trajectories in monocular videodata. These approaches are speciﬁcally deﬁned for drivingscenarios. [8] reconstruct vehicle shapes and trajectories instereo video data using off-the-shelf ego-motion and stereoreconstruction algorithms. [19] combine object proposals,stereo, visual odometry and scene ﬂow to compute three-dimensional vehicle tracks in trafﬁc scenes. The object tra-jectory reconstructions in [8] and [19] are limited by thestereo camera baseline.

The core contributions of this work are as follows. (1)We present a new framework to reconstruct the three-dimensional trajectory of moving instances of known ob-ject categories in stereo video data leveraging state-of-the-art semantic segmentation and structure from motion ap-proaches. (2) We propose a novel approach to track the two-dimensional shape of objects on pixel level in stereo videodata. (3) We present a novel method to compute object mo-tion trajectories consistent to image observations and back-ground structures using state-of-the-art SfM techniques fordata initialization and factor graphs for reﬁnement exploit-ing stereo constraints. (4) Opposed to stereo matching meth-ods, our approach leverages views from different time stepsfor object point triangulation. (5) We demonstrate the use-fulness of our method by showing qualitative results of re-constructed object motion trajectories.

2. Object Motion Trajectory Reconstruction

The pipeline of our approach is shown in Fig. 1. Theinput is an ordered stereo image sequence. We track two-dimensional object shapes on pixel level across video se- quences exploiting instance-aware semantic segmentation[17] to identify object shapes and optical ﬂow [13] to as-sociate extracted object shapes in corresponding stereo im-ages and subsequent frames. Without loss of generality, wedescribe motion trajectory reconstructions of single objects.We apply SfM [18, 22] to object and background images asshown in Fig. 1. Object images denote pictures containingonly color information of single object instances. Similarly,background images show only environment structures. Wecombine information of object and background SfM recon-structions to determine consistent object motion trajecto-ries. We use GTSAM [6] to reﬁne object and backgroundreconstructions and resolve the scale ambiguity using stereocamera baseline constraints.The point triangulation of stereo matching or stereo corre-spondence [21] methods are limited by the baseline of cor-responding the stereo camera [20]. In contrast, SfM allowsto triangulate 3D points by exploiting information of sub-sequent frames. Since already small object rotations mayresult in big camera baseline changes, our method is notnecessarily limited by the stereo camera baseline. In con-trast to stereo correspondence methods, the proposed ap-proach builds object models reﬂecting the information ofeach frame. To build an object model with stereo matchingtechniques requires additional steps to fuse the 3D points ofsubsequent frames. The presented method does not requirea calibration of the stereo camera.

The proposed Stereo Multiple Object Tracking (MOT)approach extends the monocular tracking algorithm pre-sented in [2] and is depicted in Fig. 2. [2] allows to trackthe two-dimensional shape of objects of known categoriesacross video sequences on pixel level. We use opticalﬂow matches to associate instance-aware semantic segmen-tations between subsequent frames to maintain the trackerstate. In contrast to motion model based tracking methods,this approach allows to naturally associate objects betweenleft and right images of stereo cameras. [2] uses the Kuhn-Munkres algorithm [14] to solve the assignment problem,i.e. to determine object associations of objects between im-age pairs. The assignment problem consists of ﬁnding amaximum weight matching in a weighted two-dimensional(or bipartite) graph. This problem translates in the stereoMOT case to a four-dimensional matching problem, be-cause the object instances in the left image I i,l and the rightimage I i,r at time i as well as the object instances in theleft image I i +1 ,l and the right image I i +1 ,r at time i + 1 must be associated. Let OF i,lr and OF i,ln denote the op-tical ﬂow between image I i,l and I i,r as well as I i,l and I i +1 ,l . We do not solve the associations of I i,l , I i +1 ,l , I i,r and I i +1 ,r simultaneously, since (a) the four-dimensionalmatching problem is NP-complete and (b) the simultaneous i,l I i +1 ,l I i +2 ,l I i,r I i +1 ,r I i +2 ,r OF i,ln OF i +1 ,ln OF i,lr OF i +1 ,lr OF i +2 ,lr D i,l D i +1 ,l D i +2 ,l D i,r D i +1 ,r P i,ln P i +1 ,ln P i,lr P i +1 ,lr P i +2 ,lr T i,l T i +1 ,l T i +2 ,l T i,r T i +1 ,r t i t i +1 t i +2 Figure 2: Stereo Object Tracking Scheme. The variables have the following meaning. I : image, OF : optical ﬂow, D :detection, P : Prediction, T : Tracker State, i : image index, l : left, r : right, ln : left-next, lr : left-right. Arrows show therelation of computation steps. A computation step depends on the results connected with incoming arrows. The optical ﬂowcolor coding used is deﬁned in [1]. The ﬁgure is best viewed in color.determination of two subsequent stereo image pairs requiresthe computation of three optical ﬂow ﬁelds in addition to OF i,lr and OF i,ln . Instead, we track object instances inthe left images I i,l and I i +1 ,l using the object afﬁnity ma-trix presented [2] as input for the Kuhn-Munkres algorithm.Concretely, the afﬁnity matrix is deﬁned according to equa-tion (1) A t =  O , · · · O ,v · · · O ,n v ... . . . ... . . . ... O u, · · · O u,v · · · O u,n v ... . . . ... . . . ... O n u , · · · O n u ,v · · · O n u ,n v  . (1) Here, O u,v denotes the overlap of prediction u in P i,ln anddetection v in D i +1 ,l (see Fig. 2). Let n u denote the num-ber of predictions in P i,ln and n v denote the number of de-tections in D i +1 ,l . This afﬁnity measure reﬂects localityand visual similarity. The tracker state T i +1 ,l contains onlytracks of object instances in images corresponding to theleft camera. We use the optical ﬂow between left and rightimages OF i +1 ,lr to associate the tracker state of left images T i +1 ,l with objects visible in the corresponding right image.The association between predictions P i +1 ,lr and detections D i +1 ,r in the right images are also solved using the afﬁnitymatrix of [2] as input for [14]. In this case O u,v denotes theoverlap of prediction u in P i +1 ,lr and detection v in D i +1 ,r . n u denotes the number of predictions in P i +1 ,lr and n v de-otes the number of detections in D i +1 ,r . We follow the pipeline outlined in Fig. 1 and apply SfMsimultaneously to object and background images. We de-note corresponding reconstruction results with sf m ( o ) and sf m ( b ) . Each object image has a corresponding back-ground image, i.e. the background image extracted fromthe same input frame. We consider only object-background-image-pairs, which are part of sf m ( o ) and sf m ( b ) . Re-constructed cameras without corresponding object or back-ground camera are removed from the reconstruction.Let o ( o ) j denote the 3D points contained in sf m ( o ) . The su-perscript o in o ( o ) j describes the corresponding coordinateframe. The variable j denotes the index of the points inthe object point cloud. We combine information of object-background-image-pairs to deﬁne object motion trajecto-ries parameterized by a single parameter. The object re-construction sf m ( o ) contains object point positions o ( o ) j as well as corresponding camera centers c ( o ) i and rotations R ( o ) i . We convert the object points o ( o ) j deﬁned in thecoordinate frame system (CFS) of the object reconstruc-tion to points in the camera CFS o ( i ) j of camera i using o ( i ) j = R ( o ) i · ( o ( o ) j − c ( o ) i ) . We use the camera center c ( b ) i and the corresponding rotation R ( b ) i contained in the back-ground reconstruction sf m ( b ) to transform object points incamera coordinates to the background CFS using equation(2). o ( b ) j,i = c ( b ) i + R ( b ) i T · o ( i ) j (2)The naive combination of object and background recon-struction results in inconsistent object motion trajectoriesdue to the scale ambiguity of SfM [10]. We adjust thescale between object and background reconstruction usingthe baseline of the stereo cameras in object and backgroundreconstructions as reference. Reconstructions of dynamicobjects using state-of-the-art SfM tools contain occasion-ally badly registered cameras and incorrectly triangulatedobject points (see Fig. 3). Reasons for these are small ob-ject sizes, changing illumination and reﬂecting surfaces. In-correctly estimated camera baselines hamper the correct es-timation of the scale ratio between object and backgroundreconstruction.We leverage factor graphs [6] to model stereo camera con-straints and to reﬁne the previously computed SfM recon-structions. For each triangulated point we search for cor-responding stereo feature observations, i.e. pairs of featureobservations which appear in the left and the right image ofthe same time step. Stereo image rectiﬁcation preprocessingallows us to assume that the stereo feature observation po-sitions should show (almost) the same y coordinate. Since Figure 3: Comparison of initial SfM object reconstructions(left column) and corresponding reﬁnements using stereoconstraints (right column). The cameras are shown in red.The blue and green circle emphasizes incorrectly registeredcameras and triangulated points, respectively.the feature observations are computed for each image inde-pendently we use only stereo feature observations with an y pixel difference smaller than three pixels. We average the y coordinate to deﬁne the ﬁnal stereo constraint. The re-sulting reconstructions show consistent camera stereo base-lines. Note that GTSAM [6] does not provide functionalityto perform data association and initialization. Fig. 3 showsa comparison of initial and reﬁned reconstructions.We can recover the full object motion trajectory computingequation (2) for each object-background-image-pair. Weuse o ( b ) j,i of all cameras and object points as object motiontrajectory representation.

3. Experiments and Evaluation

Due to the lack of suitable benchmark datasets, we showqualitative results using publicly available video data [5, 9].For object tracking we evaluated [7, 17, 11] for instance-aware semantic segmentation and [12, 13] for optical ﬂowcomputations. We observed that [11] and [12] achieved thebest segmentation and optical ﬂow results. [12] computesmore stable optical ﬂow vectors for moving objects than[13]. We considered the following SfM pipelines for objectand background reconstructions: Colmap [22], OpenMVG[18], Theia [25] and VisualSfM [26]. Our object trajectoryreconstruction pipeline uses Colmap for object and Open-MVG for background reconstructions. Colmap and Open- a) Left Input Frame.(b) Left Object Segmentation.(c) Left Background Segmentation.(d) Object Reconstruction.(e) Background Reconstruction.(f) Trajectory Reconstruction (Top View).(g) Trajectory Reconstruction (Side View).

Figure 4: Vehicle trajectory reconstruction using three sequences (stuttgart01-stuttgart03) contained in the Cityscape dataset[5] and one sequence (2011 09 26 drive 0013) of the KITTI dataset [9]. Object segmentations and object reconstructionsare exemplarily shown for one of the vehicles visible in the scene. The reconstructed cameras are shown in red. The vehicletrajectories are colored in green and blue. The ﬁgure is best viewed in color.VG created the most reliable object and background re-constructions in our experiments.

4. Conclusions

This paper presents a pipeline to reconstruct the three-dimensional trajectory of moving objects using stereo videodata. We presented a novel approach to track objects onpixel level across stereo video sequences. This allows us toapply state-of-the-art SfM techniques simultaneously to dif-ferent objects. We demonstrate how to resolve the scale am-biguity of object and background sfm reconstructions lever-aging stereo constraints. In contrast to previously publishedstereo 3D object trajectory reconstruction approaches, ourmethod leverages temporal adjacent frames for object andbackground reconstruction. Thus, the presented method isnot limited by the stereo camera baseline. Due to the lack ofstereo 3D object motion trajectory benchmark datasets withsuitable ground truth data, we showed qualitative results onthe Cityscape and the KITTI dataset. In future work wewill analyze robustness and limitations of the presented ap-proach w.r.t decreasing object sizes.

References [1] S. Baker, S. Roth, D. Scharstein, M. J. Black, J. P. Lewis,and R. Szeliski. A database and evaluation methodology foroptical ﬂow. In , pages 1–8, 2007.[2] S. Bullinger, C. Bodensteiner, and M. Arens. Instance ﬂowbased online multiple object tracking. In

IEEE InternationalConference on Image Processing (ICIP) , 2017.[3] L. Carlone, R. Tron, K. Daniilidis, and F. Dellaert. Initializa-tion techniques for 3d SLAM: A survey on rotation estima-tion and its use in pose graph optimization. In

IEEE Interna-tional Conference on Robotics and Automation, ICRA 2015,Seattle, WA, USA, 26-30 May, 2015 , pages 4597–4604, 2015.[4] F. Chhaya, N. D. Reddy, S. Upadhyay, V. Chari, M. Z. Zia,and K. M. Krishna. Monocular reconstruction of vehicles:Combining SLAM with shape priors. In , 2016.[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding.In

Proc. of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2016.[6] F. Daellert. Factor graphs and gtsam: A hands-on introduc-tion. Technical report, GT-RIM-CP&R-2012-002, 2012.[7] J. Dai, K. He, and J. Sun. Instance-aware semantic segmen-tation via multi-task network cascades. In

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2016.[8] F. Engelmann, J. St¨uckler, and B. Leibe. SAMP: shape andmotion priors for 4d vehicle reconstruction. In

IEEE Win-ter Conference on Applications of Computer Vision, WACV ,2017. [9] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meetsrobotics: The kitti dataset.

Int. J. Rob. Res. , 32(11), 2013.[10] R. I. Hartley and A. Zisserman.

Multiple View Geometryin Computer Vision . Cambridge University Press, ISBN:0521540518, second edition, 2004.[11] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn.In

The IEEE International Conference on Computer Vision(ICCV) , Oct 2017.[12] Y. Hu, R. Song, and Y. Li. Efﬁcient coarse-to-ﬁne patch-match for large displacement optical ﬂow. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2016.[13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, andT. Brox. Flownet 2.0: Evolution of optical ﬂow estimationwith deep networks. In

IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR) , 2017.[14] H. W. Kuhn. The hungarian method for the assignment prob-lem.

Naval Research Logistics Quarterly , 2:83–97, 1955.[15] A. Kundu, K. M. Krishna, and C. V. Jawahar. Realtime multi-body visual slam with a smoothly moving monocular cam-era. In

ICCV , 2011.[16] K. Lebeda, S. Hadﬁeld, and R. Bowden. 2D or not 2D:Bridging the gap between tracking and structure from mo-tion. In

Proceedings, Asian Conference on Computer Vision(ACCV) , 2014.[17] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutionalinstance-aware semantic segmentation. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,July 2017.[18] P. Moulon, P. Monasse, R. Marlet, and Others. Openmvg. anopen multiple view geometry library., 2013.[19] A. Oˇsep, W. Mehner, M. Mathias, and B. Leibe. Combinedimage- and world-space tracking in trafﬁc scenes. In

ICRA ,2017.[20] P. Pinggera, D. Pfeiffer, U. Franke, and R. Mester. Knowyour limits: Accuracy of long range stereoscopic object mea-surements in practice. In D. Fleet, T. Pajdla, B. Schiele, andT. Tuytelaars, editors,

Computer Vision – ECCV 2014 , 2014.[21] D. Scharstein and R. Szeliski. A taxonomy and evaluation ofdense two-frame stereo correspondence algorithms.

Interna-tional Journal of Computer Vision , 2002.[22] J. L. Sch¨onberger and J.-M. Frahm. Structure-from-motionrevisited. In

IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , 2016.[23] E. Shelhamer, J. Long, and T. Darrell. Fully convolutionalnetworks for semantic segmentation.

IEEE Trans. PatternAnal. Mach. Intell. , 39(4):640–651, 2017.[24] S. Song, M. Chandraker, and C. C. Guest. High accuracymonocular SFM and scale correction for autonomous driv-ing.

IEEE Trans. Pattern Anal. Mach. Intell. , 38(4), 2016.[25] C. Sweeney.