Future Localization from an Egocentric Depth Image
FFuture Localization from an Egocentric Depth Image
Hyu Soo Park Yedong Niu Jianbo ShiUniversity of Pennsylvania { hypar,yedniu,jshi } @seas.upenn.edu Abstract
This paper presents a method for future localization: topredict a set of plausible trajectories of ego-motion given adepth image. We predict paths avoiding obstacles, betweenobjects, even paths turning around a corner into space be-hind objects. As a byproduct of the predicted trajectoriesof ego-motion, we discover in the image the empty spaceoccluded by foreground objects. We use no image basedfeatures such as semantic labeling/segmentation or objectdetection/recognition for this algorithm. Inspired by prox-emics, we represent the space around a person using anEgoSpace map, akin to an illustrated tourist map, that mea-sures a likelihood of occlusion at the egocentric coordinatesystem. A future trajectory of ego-motion is modeled by alinear combination of compact trajectory bases allowingus to constrain the predicted trajectory. We learn the re-lationship between the EgoSpace map and trajectory fromthe EgoMotion dataset providing in-situ measurements ofthe future trajectory. A cost function that takes into accountpartial occlusion due to foreground objects is minimized topredict a trajectory. This cost function generates a trajec-tory that passes through the occluded space, which allowsus to discover the empty space behind the foreground ob-jects. We quantitatively evaluate our method to show pre-dictive validity and apply to various real world scenes in-cluding walking, shopping, and social interactions.
1. Introduction
Consider a dynamic scene such as Figure 1 where you, asthe camera wearer, plan to pass through the corridor in theshopping mall while others walk in different directions. Youneed to plan your trajectory to avoid collisions with othersand objects such as walls and fence. Looking ahead, youwould plan a trajectory that enters into the shop by turn-ing left at the corner although such space cannot be seendirectly from your perspective.The fundamental problem we are interested in is futurelocalization : where am I supposed to be after 5, 10, and 15seconds? This challenging task requires understanding ofthe scene in terms of a long term temporal human behaviors
Future localization Occluded space discovery
Figure 1. Where am I supposed to be after 5, 10, and 15 seconds?We present a method to predict a set of plausible trajectories givena first person depth image. As a byproduct of the predicted trajec-tories, the occluded space by foreground objects such as the spaceinside of the shop or behind the ladies are discovered. with respect to the spatial scene layout, with missing datadue to occlusions.We study the future localization problem using a firstperson depth (stereo) camera. We present a method to pre-dict a set of plausible trajectories of ego-motion given adepth image captured from a egocentric view. As a byprod-uct of predicted trajectories, the occluded space behindforeground objects is discovered. Our method purely re-lies on the depth measurements, i.e., no image based fea-tures such as semantic labeling/segmentation or object de-tection/recognition are required.Inspired by proxemics [10], we represent the spacearound a camera wearer using an EgoSpace map which re-assembles an illustrated tourist map: an overhead map withobjects seen from first person video projected onto it.A predictive future localization model, using theEgoSpace map, is learned from in-situ first person stereovideos from various life logging activities such as com-mutes, shopping, and social interactions. By leverag-ing structure from motion, camera trajectories are recon-structed. These camera trajectories are associated with itsdepth image at each time instant, i.e., given the depth im-age, a future camera trajectory is precisely measured whilethe depth image is obtained by the stereo cameras as shownin Figure 2(a).In a training phase, we discriminatively learn the rela-tionship between the EgoSpace map and future camera tra- Any depth sensor such as Kinect and Creative Senz3D are complimen-tary to our depth measurement. a r X i v : . [ c s . C V ] S e p ectory. We model a trajectory of ego-motion using a linearcombination of compact trajectory bases. By the nature ofthe alignment between ego-motion and gaze direction, thetrajectory is highly structured. We empirically show that4 ∼ % accuracy). This compact representation allows us toefficiently find a set of trajectories that are compatible withthe associated depth image using EgoSpace map matching.This provides an initialization of the predicted trajectories.However, not all these ‘re-imagined’ trajectories avoid ob-jects in the current first person view. We refine it by mini-mizing a cost function that takes into account compatibilitybetween the obstacles in EgoSpace map and trajectory. Thiscost function explicitly models partial occlusion of a tra-jectory which allows us to discover the space behind fore-ground objects. Why EgoSpace map?
Two cues are strongly related to pre-dict a trajectory of ego-motion, e.g., where is he or she go-ing? (1) ego-cue: a vanishing point is often aligned withgaze direction; and 2D visual layout of the obstacles in thefirst person view implicitly encodes the semantics of thescene. (2) exo-cue: objects in a 3D scene such as road,buildings, and tables constrain the space where the wearercan navigate. Such cues can be explicitly extracted by anego-depth image where the gaze direction of the wearer canbe calibrated with respect to a ground plane (exocentric co-ordinate) while the depth provides obstacles with respect tothe wearer (egocentric coordinate). Our EgoSpace map rep-resentation exploits these two cues where we measure depthfrom an egocentric view, and create an illustrated touristmap representation capturing both 2D visual arrangementof the obstacles (in first person view) and their 3D layout(in overhead view). This representation allows us to analyzeand understand different scene types and gaze directions inthe same coordinate system.
Contributions
To our best knowledge, this is the first pa-per that predicts ego-motion from a depth image withoutsemantic scene labels or object detection via in-situ firstperson measurements. Core technical contributions of ourpaper are (a) a predictive model that describes a spatial dis-tribution of objects with respect to an egocentric view, al-lowing us to register different scenes in a unified coordinatesystem; (b) a compact subspace representation of the pre-dicted trajectories enabling a search for trajectory parame-ters feasible without explicit modeling of dynamics of hu-man behaviors; (c) occluded space discovery through tra-jectory prediction; and (d) the EgoMotion dataset with adepth and its long term camera trajectory, which includesdiverse daily activities across camera wearers. We evaluateour algorithm to predict ego-motion in real world scenes.
2. Related Work
Our framework lies an intersection between behaviorprediction and egocentric vision.
Predicting where-to-go is a long standing task in behav-ioral science. This task requires to understand the interac-tions of agents with objects in a scene that afford a space tomove. There is a large body of literature on human behav-iors prediction algorithms. Pentland and Lin [28] modeledhuman behaviors using a hidden Markov dynamic modelto recognize driving patterns. Such Markovian model isan attractive choice to encode human behaviors becauseit reflects the way humans make a decision [17, 19, 38].These models, especially partially observable Markov de-cision process (POMDP), have influenced motion planningin robotics [15, 29, 31].In computer vision, Ali and Shah [3] developed a flowfield model that predicts spatial crowd behaviors for track-ing extremely cluttered crowd scenes. Inspired by the socialforce model [11], Mehran et al. [22] predicted pedestrianbehaviors in a crowd scene to detect abnormal behaviors,and Pellegrini et al. [27] used a modified model to trackmultiple agents. Ryoo [32] presented a bag-of-word ap-proach to recognize social activities at the early stage ofvideos. Vu et al. [36] predicted plausible activities from astatic scene by associating the scene statistics and labeledactions. In terms of the trajectory prediction task, our workis closely related with three path planning frameworks byGong et al. [9], Kitani et al. [13], and Alahi et al. [2]. Gonget al. presented a method to generate multiple plausibletrajectories of each agent in the scene constructed by ho-motopy classes, which allows them to produce a long termtrajectory for visual tracking in crowd scenes. Kitani etal. leveraged inverse optimal control theory to learn humanpreference with respect to the scene semantic labels, whichenables them to predict the paths an agent follows. Alahiet al. introduced a geometric feature, social affinity modelthat captures a spatial relationship of neighboring agents topredict destinations of a crowd.Unlike previous methods that use semantic la-bels/segmentation or object detection/tracking whichare often noisy in real world scenes, our measurementsare a single depth image that can be reliably obtainedby stereo cameras or depth sensors. Estimating optimalparameters for Markovian models is often intractable. Incontrast, our trajectory representation in a egocentric viewcan be encoded using compact trajectory bases, thus itmakes learning tractable because of the reduced number ofparameters.
A first person camera is an ideal camera placement toobserve human activities because it reflects the attention of2 a) Ego-stereo cameras n c Y u θ r g ( , ) r φ θ = n u T XZ v (b) Geometry Locations of EgoSpace map (c) Depth image
Angle, θ R a d i u s , r (d) EgoSpace map Figure 2. (a) We use ego-stereo cameras to capture our dataset where the depth image can be computed. Any depth sensor such as Kinect iscomplementary to our stereo setup. (b) Inspired by proxemics, we represent the space around a person using an EgoSpace map computedfrom (c) the depth image. (d) The EgoSpace map, φ ( r, θ ) , captures a likelihood of occlusion. the camera wearer. This characteristics provides a powerfulcue to understand human behaviors [5, 7, 12, 30, 33].Kitani et al. [12] used scene statistics produced by cam-era ego-motion to recognize sport activities from a firse per-son camera. Traditional vision frameworks such as objectdetection, recognition, and segmentation frameworks aresuccessfully integrated in first person data: Pirsiavash andRamanan [30] recognized daily activities using deformablepart models, Lee et al. [18] found important persons and ob-jects, Fathi et al. [5] discovered objects, and Li et al. [20,21]segmented pixels corresponding to hands. In a social set-ting, Fathi et al. [6] presented a method to recognize socialinteractions by detecting gaze directions of people and Parket al. [25] introduced an algorithm to reconstruct joint atten-tion in 3D by leveraging 3D reconstruction of camera ego-motion. This reconstruction allows prediction of joint atten-tion possible by learning the spatial relationship between asocial formation and joint attention [26].Such characteristics of first person cameras were usedto generate interesting applications in vision, graphics, androbotics. Lee et al. [18] summarized a life logging video,Xiong et al. [37] detected iconic images using a web imageprior. Arev et al. [4] used 3D joint attention to edit socialvideo footages and Kopf et al. [14] used 3D camera mo-tion to generate a hyperlapse first person video. In robotics,Ryoo et al. [34] predicted human activities for human-robotinteractions.Unlike most previous methods, our task primarily fo-cuses on predicting future behaviors by leveraging in-situ measurements from 3D reconstruction of camera ego-motion. This also allows us to tackle a more challengingproblem—to discover an empty space that is not observablebecause of visual occlusion.
3. Representation
Inspired by proxemics [10], we present a characteriza-tion of space with respect to the egocentric coordinate sys-tem, called EgoSpace map.
EgoSpace Map is a representation for space experiencedfrom first-person view but visualized in an overhead bird-eye map, akin to an illustrated tourist map.It has three key ingredients. First, we define an ego-centric coordinate system centered at the feet location, theprojection of the center of eyes onto the ground plane asshown in Figure 2(b). The normal direction, n , of theground plane is aligned with the Y -axis, and the height ofthe eye location is h , i.e., c = (cid:2) h (cid:3) T where c is the3D location of the center of eyes. The gaze direction de-fines tangential directions of the ground plane: the Z -axisis aligned with the projection of the gaze direction, v , i.e., v = (cid:2) v y v z (cid:3) T .Second, the EgoSpace encodes depth cue from a first per-son view onto an overhead view on the ground plane. Us-ing a log-polar φ ( r, θ ) parametrization of the X-Z (ground)plane, we define EgoSpace Map as a function φ : R × S → R , measuring likelihood of occlusion introduced by fore-ground objects from the gaze direction. One can think ofthe eye gaze is a light source shining on foreground objectscasting shadows onto the ground plane. On the shadow im-age we record the object height which is proportional to theocclusion likelihood.Formally, φ ( r, θ ) measures the height of the point, u ,from the ground plane that intersects the ray, g , from thecenter of eyes, c , to ( r, θ ) with an occluding object, O , i.e., φ ( r, θ ) = u T n , (1)where u = min λ ∈L λ g + c such that L = { λ | λ g + c ⊂∪ Ii =1 O i , λ > } . {O i } Oi =1 is a set of objects in the scene.We discretize the polar coordinate system by uniformsampling in angle between π/ and π/ and uniform sam-pling in the inverse of radius which results in uniform sam-pling in the egocentric view as shown in Figure 2(c). Notethat the locations to measure the EgoSpace map are al-most radially uniform from the first person view point. Fig-ure 2(d) shows the EgoSpace map for Figure 2(c).3or future localization , ground plane provides a freespace for us to move into. On the EgoSpace map, φ ( r, θ ) =0 if from the first person view the point ( r, θ ) lies on theground plane. More interestingly, the space behind an ob-ject also indicates potential places to navigate. Since theEgoSpace map is represented in the ground plane, not infirst person view, the space behind the object are marked asoccluded area (the right few columns of the map).Third, the area outside of a first person view depth imageboundary is set to φ max = 2 m. On the EgoSpace map,shape of the mask is uniquely defined by the gaze direction(roll and pitch angles of the head direction). For example,Figure 2(c) shows a case where the wearer is looking aheadalmost parallel to the ground, the ground area close to thewearer ( r < . m) was not visible e.g., φ ( r < . , θ ) ismarked as φ = φ max . If the wearer is looking down, themasked area on EgoSpace would be for large values of r .The EgoSpace representation supports learning future lo-calization from first person videos by combining cues from3D scene geometry and gaze direction. Its benefits include:1) the gaze direction normalized coordinate system pro-vides a common 3D reference frame to learn; 2) overheadview representation removes the variations in first person3D experience due to the head’s pitch angle, 3) the log-polarencoding and sampling gives more importance to nearbyspace, and 4) the depth masking encodes implicitly both rolland pitch angle of head, making it more situation aware. Let X = (cid:2) x z · · · x F z F (cid:3) T ∈ R F be a 2Dtrajectory on the ground plane of the egocentric coordinatesystem, where F is the number of future frames to predictand x i and z i are two coordinates at the i th time instanceas shown in Figure 2(b). In practice, this trajectory can beobtained by projecting 3D camera poses between the f + 1 and f + F time instances at the f th time instant onto theground plane. This allows us to represent all trajectories inthe same egocentric coordinate system, which are normal-ized by gaze direction because the Z axis is aligned with thegaze direction.The gaze direction normalized trajectory is highly com-pressible. Most trajectory of ego-motion can be encodedusing a linear combination of trajectory bases learned usingPrincipal Coordinate Analysis (PCA) from the EgoMotiondataset described in Section 5: X = B β + X , (2)where X is a mean trajectory and B ∈ R F × K is a col-lection of trajectory bases, i.e., each column of B is a tra-jectory basis where K is the number of basis. In practice, K is selected as 4 ∼ accuracy as shown in Figure 3(a) andFigure 3(b). β ∈ R K is the trajectory coefficient, which isthe low dimensional parametrization of X . In Figure 3(b), we compare reconstruction error produced by PCA basesand DCT generic bases [1]. −30 −20 −10 0 10 20 300 X (m) Z ( m ) (a) Ego-motion trajectories Number of bases R ec on s t r u c ti on E rr o r ( m ) PCA with testing on trainingDCT with testing on trainingPCA with testing on testingDCT with testing on testing (b) Reconstruction error
Figure 3. (a) We register all trajectories in an ego-centric coordi-nate system, which results in highly redundant trajectories that canbe represented by a linear combination of (b) compact trajectorybases.
4. Prediction
A trajectory of ego-motion is associated with anEgoSpace map, i.e., given a depth image, we know how weexplored the space in the training data (Section 5). By lever-aging a computational representation of egocentric spaceand trajectory described in Section 3, in this section, wepresent a method to predict a set of plausible trajectoriesgiven an EgoSpace map and to discover the occluded spaceusing the predicted trajectories.
Estimating X that conforms to a depth image is to find apath that stays in the ground plane minimizing the followingcost function along the trajectory: minimize β (cid:80) Fi (cid:101) φ ( B i β ) , (3)where (cid:101) φ : R → R is the Cartesian coordinate representa-tion of the EgoSpace map, φ and B i ∈ R × K is a matrixcomposed of the (2( i − th and i th rows of B . There-fore, B i β is the point ( x i , z i ) at the i th time instant.Equation (3) finds a trajectory that stays on the groundgiven a depth image. This approach has been used inrobotics communities for various path planning tasks. How-ever, this does not take into account the trajectory that is par-tially occluded by objects because the occluded part of thetrajectory always produces higher cost. Instead, we intro-duce a novel cost function that minimizes a trajectory costdifference between the given depth image and the retrieveddepth image from the database: minimize β (cid:80) Fi max (cid:16) , (cid:101) φ ( B i β ) − (cid:101) φ D ( B i β D ) (cid:17) , (4)where (cid:101) φ D and β D are the EgoSpace map and trajectory pa-rameter retrieved from the training dataset. This minimiza-tion finds a partially occluded trajectory as long as there4xists a trajectory in the database that has similar occlusioncost.There exist infinite number of trajectories that are com-patible with a given EgoSpace map. More importantly, thecost function in Equation (4) is nonlinear where an initial-ization of the solution is critical.We initialize β using a trajectory retrieved from the train-ing data by EgoSpace map matching. The dataset is di-vided into 3 gaze directions (3 pitch angles) to reduce thefalse matches dominated by the area beyond the depth im-age. Given an EgoSpace map, k-nearest neighbors (KNN)are found using K-d tree [23]. Other search or planningmethods such as structured SVM [35] and Rapidly Explor-ing Random Tree (RRT) [16] can be complimentary to theKNN search. The predicted trajectories of ego-motion allow us to dis-cover the hidden space occluded by foreground objects be-cause the trajectories can be still predicted in the hiddenspace. We build a likelihood map of the occluded space asfollows: ψ ( x ) = (cid:80) Jj =1 (cid:80) Fi =1 exp (cid:0) −(cid:107) x − B i β j (cid:107) / σ (cid:1) (cid:101) φ (cid:0) B i β j (cid:1)(cid:80) Jj =1 (cid:80) Fi =1 exp (cid:0) −(cid:107) x − B i β j (cid:107) / σ (cid:1) , (5) where ψ ( x ) is the likelihood of the occluded space that atrajectory can pass through at the evaluating point x ∈ R in the ground. β j is the j th predicted trajectories, J is thenumber of predicted trajectories, and σ is the bandwidth forthe Guassian kernel. Equation (5) takes into account thelikelihood of the predicted trajectories weighted by the like-lihood of the occlusion. ψ ( x ) is high when many trajecto-ries are predicted at x while (cid:101) φ ( x ) is high.
5. EgoMotion Dataset
We present a new dataset, EgoMotion dataset, capturedby first person stereo cameras. This dataset includes variousindoor and outdoor scenes such as Park, Malls, and Campuswith various activities such as walking, shopping, and socialinteractions.
A stereo pair of GoPro Hero 3 (Black Edition) cam-eras with 100mm baseline are used to capture EgoMotiondataset as shown in Figure 2(a). All videos are recorded at1280 ×
960 with 100fps. The stereo cameras are calibratedprior to the data collection and synchronized manually witha synchronization token at the beginning of each sequence.
Depth Computation
We compute disparity between thestereo pair after stereo rectification. A cost space of stereomatching is generated for each scan line and match eachpixel by exploiting dynamic programming in a coarse-to-fine manner.
3D Reconstruction of Ego-motion
We reconstruct a cam-era trajectory using a standard structure from motionpipeline with a few modifications to handle a large num-ber of images . We partition the dataset such that eachdataset includes less than 500 images with sufficient overlapwith neighbor image sets (100 image overlap). We recon-struct each dataset independently and merge them by mini-mizing cross reprojection error between two dataset, i.e., apoint in one dataset is reprojected to a camera in the otherdataset. Then, we project the reconstructed camera trajec-tory onto the ground plane estimated by fitting a plane usingRANSAC [8]. Scenes
We collect both indoor and outdoor data, which con-sists of 21 scenes with 55,933 frames of 7.7 hours long intotal, including walking on campus, in parks and downtownstreets, shopping in the mall, cafe and grocery, as well astaking public transportation. The data consists of variousactivities (walking, talking, and shopping), scenes (campus,park, malls, and downtown streets), cities, and time. Wealso collect repeated daily routines multiple times at a cam-pus. The dataset is summarized in Table 1.
We define the EgoSpace map with respect to a gaze di-rection, which allows us to canonicalize all trajectories inone coordinate system and further to represent it with com-pact bases. This stems from a primary conjecture: a gazedirection is aligned with ego-motion. In this section, we em-pirically prove the conjecture from our EgoMotion dataset. −2 −1 Direction of destination (radian) G aze p it c h a ng l e (r a d i a n ) (a) Attention −1 −0.5 0 0.5 1 Direction of destination N u m b e r o f i n s t a n ce s Looking upLooking straightLooking down (b) Yaw distribution
Figure 4. From our dataset, we empirically prove that the gazedirection is highly correlated with the direction of destination, i.e.,we look where we go.
We compute the pitch angle of a gaze direction by cali-brating the relationship between the first person camera andgaze direction [25]. The pitch angle is cos − v z by defini-tion in Section 3.1 and the position after 10 seconds is usedto measure the direction of destination. Figure 4(a) showsa distribution of the direction of destination with respect tothe gaze pitch angle, which indicates that the gaze directionis aligned with the pitch axis. Figure 4(b) shows a yaw dis-tribution of the direction of destination given pitch angle (a A 30 minute walking sequence at a 30 fps reconstruction rate producesHD 108,000 images. mageDisparityScene IKEA Costco Mall Park School1/2 Downtown1/2 Grocery1/2/3 Bus1/2Frames 966 577 2683 3088 3754/3736 2856/3405 2858/2892/2834 2292/1850Duration 08:03 04:49 22:22 25:44 31:17/31:08 23:48/28:23 23:49/24:06/23:37 19:06/15:25ImageDisparityScene Campus1 Campus2 Campus3 Campus4 Campus5 Campus6 Campus7 Campus8Frames 2607 1884 1975 2359 3337 4034 2568 3378Duration 21:44 15:42 16:28 19:40 27:49 33:37 21:24 28:09 Table 1. EgoMotion dataset horizontal cross section of Figure 4(a)). This also indicatesthat gaze direction is highly correlated with the direction ofdestination.
6. Result
We apply our method to predict ego-motion and hiddenspace in real world scenes by leveraging the EgoMotiondataset. We divide all scenes into two categories: indoorand outdoor scenes as ego-motion has different characteris-tics, e.g., speed and scene layout. Note that for all evalua-tions, we predict a scene that is not included in training data,i.e., training and testing scenes are completely separated.
We quantitatively evaluate our trajectory prediction bycomparing with ground truth trajectories achieved by 3Dreconstruction of the first person camera. Our evaluationaddresses the future localization problem.Multiple trajectories are often equally plausible, e.g., Y-junction, while one ground truth trajectory is available perimage. This results in a large prediction error. To ad-dress this multiple path configuration, we measure predic-tive precision—how often one of our predicted trajecto-ries aligns with the ground truth trajectory, i.e., prec. = (cid:80) Ni =1 D i /N , where N is the number of testing images. D i = 1 if min k max t (cid:107) (cid:98) X t − X kt (cid:107) < (cid:15) , and D i = 0 other-wise where X kt is the location at the t th time instant of the k th predicted trajectory and (cid:98) X is the ground truth trajectory.We set (cid:15) = 1 . m . Note that unlike previous approachesmeasured a spatial distance between trajectories [13] , ourevaluation measures a spatiotemporal distance between tra-jectories because the time scale also needs to be considered.Four baseline methods are used to compare our ap-proach: one method solely based on gaze direction, twomethods with a subsampled depth image at the same resolu-tion of our EgoSpace map, and one method with EgoSpacemap but without trajectory refinement by Equation (4). (1)Going straight: we generate a trajectory aligned with thegaze direction to test gaze bias; (2) Pure 2D: we retrieve aset of trajectories using KNN solely based on a subsampleddepth image; (3) 2D+ground plane: we retrieve trajectories A dynamic time warping was used to handle a time scale. These baseline algorithms are designed by ours because no previousalgorithm exists to predict the trajectories of ego-motion using the subsampled depth image but transform the coor-dinates of the trajectories such that they lie on the groundplane of the test image. This coordinate transform takesinto account the 3D camera direction with respect to theground plane of the test image; (4) EgoSpace w/o trajectoryoptimization: the trajectories are retrieved by the EgoSpacemap but no adaptation to the test image by Equation (4). Infact, this method provides an initialization of our predictedtrajectories.Figure 5 shows evaluations on indoor and outdoor depthimages. We retrieve k neighbors from dataset and measureprecision. Our method outperforms the baseline algorithmswith large margin. These experiments indicate that theEgoSpace representation has strong predictive power com-paring to the camera pose oriented feature produced by thesubsampled depth image. Also the scene adaptation by thetrajectory optimization allows us to produce more accurateprediction (see the performance gap from the initialization).As noted in Section 5.2, a gaze direction is a good predictorbut it is not strong enough to predict a long term behavior.Note that the precision at early k may be significantly im-proved by using N-best algorithms [24] based on homotopyclass [9] because KNN retrieves many redundant trajecto-ries. In Table 2, we measure the average precision acrossall scenes in Section 5. Indoor 0 ∼ ∼
10 secs 10 ∼
15 secsk=100 k=60 k=30 k=100 k=60 k=30 k=100 k=60 k=30Going straight 0.571 0.221 0.124Pure 2D 0.643 0.507 0.308 0.524 0.379 0.217 0.346 0.229 0.1232D+Ground plane 0.710 0.556 0.367 0.561 0.413 0.267 0.384 0.261 0.162EgoSpace w/o opt. 0.690 0.534 0.341 0.570 0.265 0.255 0.401 0.265 0.156EgoSpace w/ opt.
Outdoor 0 ∼ ∼
10 secs 10 ∼
15 secsk=100 k=60 k=30 k=100 k=60 k=30 k=100 k=60 k=30Going straight 0.443 0.259 0.103Pure 2D 0.535 0.506 0.303 0.417 0.391 0.218 0.267 0.255 0.1422D+Ground plane 0.554 0.554 0.350 0.425 0.407 0.244 0.293 0.261 0.135EgoSpace w/o opt. 0.567 0.527 0.329 0.432 0.399 0.233 0.289 0.250 0.141EgoSpace w/ opt.
Table 2. Average precision (k is the number of neighbors)
Occluded Space Discovery
We quantitatively evaluate ouroccluded space discovery by measuring detection rate,
D/N where D is the number of true positive detection and N the total number of detection produced by the space dis-covery. We threshold the likelihood of the occluded space, ψ , from Equation (5) and manually evaluate whether thedetection is correct. Note that no ground truth label is avail-able unless the camera wearer already had passed throughthe space. The detection rate in Table 3 indicates that ourmethod predicts the outdoor scenes better than the indoor6
20 40 60 80 10000.20.40.60.81
Number of predicted trajectories, k P r ec i s i on Going straightPure 2D2D+ground planeEgoSpace w/o opt.EgoSpace w/ opt.
Ground truth ego-motion Input: depth image Precision at 0~5 secs
Number of predicted trajectories, k P r ec i s i on Going straightPure 2D2D+ground planeEgoSpace w/o opt.EgoSpace w/ opt.0 20 40 60 80 10000.20.40.60.81
Number of predicted trajectories, k P r ec i s i on Going straightPure 2D2D+ground planeEgoSpace w/o opt.EgoSpace w/ opt.
Precision at 5~10 secs
Number of predicted trajectories, k P r ec i s i on Going straightPure 2D2D+ground planeEgoSpace w/o opt.EgoSpace w/ opt.0 20 40 60 80 10000.20.40.60.81
Number of predicted trajectories, k P r ec i s i on Going straightPure 2D2D+ground planeEgoSpace w/o opt.EgoSpace w/ opt.
Precision at 10~15 secs
Number of predicted trajectories, k P r ec i s i on Going straightPure 2D2D+ground planeEgoSpace w/o opt.EgoSpace w/ opt.
Figure 5. We compare our method with four baseline representations: (1) Going straight; (2) Pure 2D: no EgoSpace map without adaptationof the ground plane by the test scene; (3) 2D + ground plane: no EgoSpace map with adaptation of the ground plane by the test scene; (4)EgoSpace without trajectory optimization. Our method outperforms other representations. scenes. This is because the indoor scenes such as Groceryand IKEA, the camera wearer had a number of close inter-actions with objects such as shelves or products where theview of the scenes are substantially limited.
Indoor Mall I Grocery IKEADetection rate 0.5882 0.2371 0.3937Outdoor Park Bus stop WalkDetection rate 0.6234 0.6593 0.6338
Table 3. Detection rate
We apply our method on real world examples to predict aset of plausible trajectories of ego-motion and the occludedspace by foreground objects. Our training dataset is com-pletely separated from testing data, e.g., Grocery scene wastrained to predict IKEA scene. Given a depth image, we es-timate the ground plane by a RANSAC based plane fittingwith gravity and height prior. This ground plane is used todefine the EgoSpace map with respect to the camera direc-tion .Figure 1 and Figure 7 illustrate our results from theEgoMotion dataset. In a testing phase, only a depth im-age was used while 3D reconstruction of camera poseswere used in the training phase. In Figure 7, we show(1) image and ground truth ego motion; (2) input depthimage; (3) EgoSpace map overlaid with the predicted tra-jectories (gray) and ground truth trajectory (red); (4) re-projection of the trajectories; (5) reprojection of occludedspace computed by the EgoSpace map (inset image). Forall scenes, our method predicts the plausible trajectories thatpass through unexplored space. Obstacle Avoidance
Our cost function in Equation (4) min-imizes cost difference between trajectories from trainingdata and testing data. This precludes a trajectory passingthrough an object unless the retrieved trajectory was par-tially occluded. EgoSpace map captures the obstacle avoid-ance as shown in Campus and Grocery. The yaw angle of the gaze direction is assumed to be aligned with thecamera direction.
Multiple Plausible Trajectories
Our prediction produces anumber of plausible trajectories that conform to the testingscene. Trifurcated trajectories in Campus; bifurcated trajec-tories in Bus stop; and multiple directions of trajectories inMall I.
Occluded Space Discovery
The space occluded by fore-ground objects is discovered by the predicted trajectories.The space inside of the shop and behind the person in Fig-ure 1; the space occluded by the left fence and persons inCampus; the space behind the cars and the parking vendingmachine in Bus stop; the space behind the persons and treesin Park; the space inside the shop and around the left corner;the space behind the column; and the space occluded by thefence.
7. Discussion
In this paper, we present a method to predict ego-motionand occluded space by foreground objects from a first per-son depth image. EgoSpace map that encodes a likelihoodof occlusion is used to represent a scene around a camerawearer. We associate a trajectory with the EgoSpace mapin the training phase to predict a set of plausible trajec-tories given a test depth image. The trajectories that areparametrized by a linear combination of compact trajectorybases are refined to conform with the test depth image. Theoccluded space is detected by measuring how often the pre-dicted trajectories invade the occluded space.
Figure 6. Our method fails due to mis-estimation ground plane,different scene distributions, and failure of depth estimation.
Limitation
Our framework needs three ingredients: similarscene training data, ground estimation, and depth computa-tion. These failure cases are illustrated in Figure 6.7 ngle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Angle, θ R a d i u s , r Ground truth ego-motion Input: depth image EgoSpace map Output: predicted ego-motion + discovered hidden space M a II ll I K E A M a ll I P a r k G r o ce r y s t o r e B u s s t op C a m pu s Figure 7. Given a depth image (the second column), we predict a set of plausible trajectories of ego-motion (the forth column) and discoverthe occluded space (the fifth column) using the EgoSpace map (the third column: predicted trajectories (gray) and ground truth trajectory(red)). The first column shows an image with ground truth trajectory of ego-motion measured by 3D reconstruction of a first person camera(time is color-coded). For more scene description, see Section 6.2. eferences [1] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Trajectoryspace: A dual representation for nonrigid structure from mo-tion. PAMI , 2011.[2] A. Alahi, V. Ramanathan, and L. Fei-Fei. Socially-awarelarge-scale crowd forecasting. In
CVPR , 2014.[3] S. Ali and M. Shah. Floor fields for tracking in high densitycrowd scenes. In
ECCV , 2008.[4] I. Arev, H. S. Park, Y. Sheikh, J. K. Hodgins, and A. Shamir.Automatic editing of footage from multiple social cameras.
SIGGRAPH , 2014.[5] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding ego-centric activities. In
ICCV , 2011.[6] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interaction:A first-person perspective. In
CVPR , 2012.[7] A. Fathi and J. M. Rehg. Modeling actions through statechanges. In
CVPR , 2013.[8] M. A. Fischler and R. C. Bolles. Modeling and prediction ofhuman behavior.
Communications of the ACM , 1981.[9] H. Gong, J. Sim, M. Likhachev, and J. Shi. Multi-hypothesismotion planning for visual object tracking. In
ICCV , 2011.[10] E. T. Hall. A system for the notation of proxemic behaviour.
American Anthropologist , 1963.[11] D. Helbing and P. Moln´ar. Social force model for pedestriandynamics.
Physics Review , 1995.[12] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto. Fast un-supervised ego-action learning for first-person sports videos.In
CVPR , 2011.[13] K. M. Kitani, B. Ziebart, J. D. Bagnell, and M. Hebert. Ac-tivity forecasting. In
ECCV , 2012.[14] J. Kopf, M. Cohen, and R. Szeliski. First person hyperlapsevideos.
SIGGRAPH , 2014.[15] H. Kurniawati, Y. Du, D. Hsu, and W. S. Lee. Motion plan-ning under uncertainty for robotic tasks with long time hori-zons. In
Robotics Research , 2009.[16] S. M. LaValle. Rapidly-exploring random trees: A new toolfor path planning.
Technical Report 98-11, Computer Sci-ence Department, Iowa State University , 1999.[17] R. Lee, D. H. Wolpert, S. Backhaus, R. Bent, J. Bono, andB. Tracey. Modeling humans as reinforcement learners: Howto predict human behavior in multi-stage games. In
NIPS ,2011.[18] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering importantpeople and objects for egocentric video summarization. In
CVPR , 2012.[19] S. Levine and V. Koltun. Continuous inverse optimal controlwith locally optimal examples. In
ICML , 2012.[20] C. Li and K. M. Kitani. Pixel-level hand detection for ego-centric videos. In
CVPR , 2013.[21] Y. Li, A. Fathi, and J. M. Rehg. Learning to predict gaze inegocentric video. In
ICCV , 2013.[22] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd be-havior detection using social force model. In
CVPR , 2009.[23] M. Muja and D. G. Lowe. Scalable nearest neighbor algo-rithms for high dimensional data.
PAMI , 2014.[24] D. Park and D. Ramanan. N-best maximal decoders for partmodels. In
ICCV , 2011. [25] H. S. Park, E. Jain, and Y. Shiekh. 3D social saliency fromhead-mounted cameras. In
NIPS , 2012.[26] H. S. Park and J. Shi. Social saliency prediction. In
CVPR ,2015.[27] S. Pellegrini, A. Ess, K. Schindler, and L. van Gool. Youllnever walk alone: Modeling social behavior for multi-targettracking. In
ICCV , 2009.[28] A. Pentland and A. Lin. Modeling and prediction of humanbehavior.
Neural Computation , 1995.[29] J. Pineau and G. J. Gordon. Pomdp planning for robust robotcontrol. In
Robotics Research , 2007.[30] H. Pirsiavash and D. Ramanan. Recognizing activities ofdaily living in first-person camera views. In
CVPR , 2012.[31] S. Ragi and E. K. P. Chong. Uav path planning in a dynamicenvironment via partially observable markov decision pro-cess. In
IEEE Transactions on Aerospace and ElectronicsSystems , 2013.[32] M. S. Ryoo. Human activity prediction: Early recognition ofongoing activities from streaming videos. In
ICCV , 2011.[33] M. S. Ryoo and L. Matthies. First-person activity recogni-tion: What are they doing to me. In
CVPR , 2013.[34] M. S. Ryoo, B. Rothrock, and L. Matthies. Pooled motionfeatures for first-person videos. In
CVPR , 2015.[35] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.Large margin methods for structured and interdependent out-put variables.
JMLR , 2005.[36] T. Vu, C. Olsson, I. Laptev, A. Oliva, and J. Sivic. Predictingactions from static scenes. In
ECCV , 2014.[37] B. Xiong and K. Grauman. Detecting snap points in egocen-tric video with a web photo prior. In
ECCV , 2014.[38] B. Ziebart, A. Maas, J. Bagnell, and A. Dey. Maximum en-tropy inverse reinforcement learning. In
AAAI , 2008., 2008.