[PDF] Structure from Recurrent Motion: From Rigidity to Recurrency

Abstract

This paper proposes a new method for Non-Rigid Structure-from-Motion (NRSfM) from a long monocular video sequence observing a non-rigid object performing recurrent and possibly repetitive dynamic action. Departing from the traditional idea of using linear low-order or lowrank shape model for the task of NRSfM, our method exploits the property of shape recurrency (i.e., many deforming shapes tend to repeat themselves in time). We show that recurrency is in fact a generalized rigidity. Based on this, we reduce NRSfM problems to rigid ones provided that certain recurrency condition is satisfied. Given such a reduction, standard rigid-SfM techniques are directly applicable (without any change) to the reconstruction of non-rigid dynamic shapes. To implement this idea as a practical approach, this paper develops efficient algorithms for automatic recurrency detection, as well as camera view clustering via a rigidity-check. Experiments on both simulated sequences and real data demonstrate the effectiveness of the method. Since this paper offers a novel perspective on rethinking structure-from-motion, we hope it will inspire other new problems in the field.

Full PDF

SStructure from Recurrent Motion: From Rigidity to Recurrency

Xiu Li , Hongdong Li , Hanbyul Joo Yebin Liu Yaser Sheikh Tsinghua University Carnegie Mellon University Australian National University Abstract

This paper proposes a new method for Non-RigidStructure-from-Motion (NRSfM) from a long monocularvideo sequence observing a non-rigid object performing re-current and possibly repetitive dynamic action. Departingfrom the traditional idea of using linear low-order or low-rank shape model for the task of NRSfM, our method ex-ploits the property of shape recurrency (i.e., many deform-ing shapes tend to repeat themselves in time). We show thatrecurrency is in fact a generalized rigidity . Based on this,we reduce NRSfM problems to rigid ones provided that cer-tain recurrency condition is satisﬁed. Given such a reduc-tion, standard rigid-SfM techniques are directly applicable(without any change) to the reconstruction of non-rigid dy-namic shapes. To implement this idea as a practical ap-proach, this paper develops efﬁcient algorithms for auto-matic recurrency detection, as well as camera view clus-tering via a rigidity-check. Experiments on both simulatedsequences and real data demonstrate the effectiveness of themethod. Since this paper offers a novel perspective on re-thinking structure-from-motion, we hope it will inspire othernew problems in the ﬁeld.

1. Introduction

Structure-from-Motion (SfM) has been a success story incomputer vision. Given multiple images of a rigidly mov-ing object, one is able to recover the 3D shape of the ob-ject as well as camera locations by using geometrical multi-view constraints. Recent research focus in SfM has been ex-tended to the reconstruction of non-rigid dynamic objects orscenes from multiple images, leading to “Non-Rigid Struc-ture from Motion” (or NRSfM in short).Despite remarkable progresses made in NRSfM, existingmethods suffer from serious limitations. Notably, they of-ten assume simple linear models, either over the non-rigidshape [7] or over motion trajectories [1], or both [9]. Theselinear models, while they are useful for characterizing cer-tain classes of deforming objects (e.g, face, human pose, orclothing), are unable to capture a variety of dynamic objects in rapid deformation, which are however common in reality.This paper presents a new method for non-rigid struc-ture from motion. Contrary to the traditional wisdom forNRSfM, we do not make a linear model assumption. In-stead, we describe how to exploit shape recurrency for thetask of non-rigid reconstruction. Speciﬁcally, we observethat in our physical world many deforming objects (andtheir shapes) tend to repeat themselves from time to time,or even only occasionally. In the context of SfM if a shapereoccurs in the video we say it is recurrent .This observation of recurrency enables us to use the ex-isting knowledge of multi-view geometry to reconstruct ashape. Given a video sequence, if one is able to recognizea shape that was seen before, then these two instances ofimages can be used as a virtual stereo pair of the same rigid object in a space. Therefore one can simply apply standardrigid SfM techniques to reconstruct a non-rigid object, with-out developing new methods. For instance, the techniquesused in rigid SfM [12] such as the use of the fundamen-tal matrix, computing camera poses by the Perspective-n-Point (PnP) algorithm, triangulating 3D points, bundle ad-justment, and rigid factorization can be used without modi-ﬁcations. We conducted experiments on both synthetic andreal data, showing the efﬁcacy of our method.

2. The Key Insight

Rigidity is a fundamental property that underpins almostall work in rigid Structure-from-Motion (SfM). We say anobject is rigid if its shape remains constant over time. Forthis reason, multiple images of the same object, taken fromdifferent viewpoints, can be viewed as redundant observa-tions of the same target, making the task of rigid SfM math-ematically well-posed and solvable. In contrast, the shapeof a non-rigid object changes over time, violating the rigid-ity assumption and rendering NRSfM ill-posed.In this paper, we show that shape recurrency is in facta generalized rigidity . At ﬁrst glance, ﬁnding rigid-pairscan be thought as a restrictive condition; however, satis-fying this condition is far easier. Recurrent motions areubiquitous in our surroundings, including human’s walking,animal’s running, leaves’ waving, clock’s pendulum sway-1 a r X i v : . [ c s . C V ] A p r igure 1. This composite slow-motion picture of ‘ﬁgure-skating’ clearly illustrates the basic idea of our non-rigid SfM method. Despite theskater’s body poses kept changing dynamically over time, there were moments when she struck (nearly) identical posture, e.g., as indicatedby the two red arrows and two blue arrows. Using a pair of such recurrent observations- albeit distant in time, one can reconstruct the 3Dpose (shape) of the skater at that time instants, by using only standard rigid-SfM techniques. ing, car wheels’ rotating, and so on. Many human motionssuch as martial arts, dance, and sport games contain vari-ous repetitive motions and patterns. Even dramatic or non-periodic motions can be included, as long as a visual obser-vation is long enough, making it highly probable to revisita previously-seen scene again. If we are given multiple se-quences for the similar human motions, although they arenot exactly the same scenes, it can increase the chances ofﬁnding recurrent motions.This is the key insight of this paper. To further illustratethis idea, consider the example of ﬁgure skating in Fig. 1,showing a composite (strobe-type) photograph made by fus-ing multiple frames of slow-motion photos, which vividlycaptures the dynamic performance of the skater. Examinecarefully each of the individual postures of the skater at dif-ferent time steps; it is not difﬁcult for one to recognize sev-eral (nearly) repeated poses.To apply our idea of reconstructing non-rigid shapesfrom recurrence, we propose a novel method to formulateNRSfM problem as a graph-clustering problem, which canbe solved by a Normalized-Cut [35] framework. In particu-lar, we build a method to compute the probability represent-ing the rigidity of a shape from images at two different timeinstances. The ﬁnal recurrence relations are globally solvedconsidering all connections in the constructed graph.

3. Problem Formulation and Main Algorithm

Consider a non-rigid dynamic 3D object observed by amoving pinhole camera, capturing N images at time stepsof t = 1 , , .., N. Our task is then to recover all the N temporal shapes of the object, S (1) , S (2) , .., S ( N ) . To beprecise, the shape of the object at time t , S ( t ) , is deﬁnedby a set of M feature points (landmarks) on the object: S ( t ) = [ X t , X t , .., X tM ] , where X ti denotes the homo-geneous coordinates of the i -th feature point of the object at time t . Clearly the S ( t ) is a × M matrix.Given a pinhole camera with a projection matrix P , a sin-gle 3D point X is projected on the image at position x by ahomogeneous equation x (cid:39) P X . For the shape of a tempo-rally deforming object at time t , we have x ( t ) (cid:39) P ( t ) S ( t ) , where x ( t ) denotes the image measurement of the shape S ( t ) at time t , and P ( t ) deﬁnes the camera matrix of the t -th frame.By collecting all N frames of observations of the non-rigid object at time t = 1 , .., N , we obtain the basic equa-tion system for N -view M -point NRSfM problem:  x (1) x (2) ... x ( N )  (cid:39)  P (1) P (2) . . . P ( N )  ·  S (1) S (2) ... S ( N )  . (1) Deﬁnition 3.1 (Rigidity) . Given two 3D shapes S and S (cid:48) with correspondences in a space, we can say that they forma rigid pair if they are related by a rigid transformation T . Note that a rigid transformation can be compactly repre-sented by a × matrix T , hence we have: S (cid:48) = T S, ∃ T ∈ SE (3) . We use S ≈ S (cid:48) to denote that S and S (cid:48) form a rigid pair . Example 3.1 (Rigid Object) . The shape of a rigid objectremains constant all the time: S ( t ) ≈ S ( t (cid:48) ) , ∀ t (cid:54) = t (cid:48) . Example 3.2 (Periodic Deformation) . A non-rigid objectundergoing periodic deformation with period p will returnto its previous shape after a multiplicity of periods, leadingto S ( t ) ≈ S ( t + kp ) , ∀ k ∈ N . Example 3.3 (Recurrent Object) . A shape at time t re-occurs after some δ -time lapse: S ( t ) ≈ S ( t + δ ) . .1. Rigidity Check via Epipolar Geometry If two 3D shapes (represented by point clouds) and theirexact correspondences are given, checking whether they arerigidly related or not is a trivial task. However, this is notpossible in the case of NRSfM where the shapes are notknown a priori . All we have are two corresponding imagesof the shapes, and the rigidity-test has to be conducted basedon the input images only.In this paper, we use epipolar-test for this purpose. It isbased on the well-known result of epipolar geometry: if two3D shapes differ by only rigid Euclidean transformations,then, their two images must satisfy the epipolar relation-ship . Put it mathematically, we have S ≈ S (cid:48) ⇒ x (cid:48)(cid:62) i F x i =0 , ∀ i, where F is the fundamental matrix between the twoimages for S and S (cid:48) , respectively. Note that the RHS equa-tion must be veriﬁed over all pairs of correspondences of ( x i , x (cid:48) i ) , ∀ i .Also note that satisfying epipolar relationship is only a necessary condition for two shapes S and S (cid:48) to be rigid.This is because the epipolar relationship is invariant to any × projective transformation in 3-space. As a result, itis a weaker condition than the rigidity test, suggesting thateven if two images pass the epipolar-test they still possiblybe non-rigidly related. Fortunately, in practice, this is nota serious issue, because the odds that a generic dynamicobject (with more than 5 landmark points) changes its shapeprecisely following a 15-DoF 3D projectivity is negligible.In other words, there is virtually no risk of mistaking.The above idea of epipolar-test looks very simple. Assuch, one might be tempted to rush to implementing thefollowing simple and straightforward algorithm:1. Estimate a fundamental matrix from the correspon-dences using the linear 8-point algorithm;2. Compute the mean residual error computed by aver-aging all the point-to-epipolar-line distances evaluatedon key points in the image;3. If this mean residual error is less than a pre-deﬁnedtolerance, return ‘rigid’, else return ‘non-rigid’. Unfortunately , despite the simplicity of the above algo-rithm, it is however not useful in practice, because of thefollowing two reasons. (1) Ill-posed estimation: It is wellknown that linear methods for epipolar geometry estimationare very sensitive to outliers; a single outlier may destroythe fundamental matrix estimation. However, in our con-text, the situation is much worse (than merely having a fewoutliers). This is because, whenever the two feature pointsets are in fact not rigidly related, forcing them to ﬁt to asingle fundamental matrix by using any linear algorithm canonly yield a meaningless estimation, subsequently leadingto a meaningless residual errors and unreliable decision. In short, ﬁtting all feature points to a single epipolar geometryis ill-posed. Instead, in order to do a proper rigidity-test onemust consider the underlying 3D rigid-reconstructability ofall these image points. (2) Degenerate cases: Even if twosets of points are indeed connected by a valid and meaning-ful fundamental matrix, there is no guarantee that a valid3D reconstruction can be computed from the epipolar ge-ometry. For example, when the camera is doing a pure rota-tion, there will not be enough disparity (parallax) in the cor-respondences to allow for a proper reconstruction–becausethe two cameras have only one center of projection- depthcan not be observed. In such cases, the two sets of imagescan be mapped to each other by a planar homography, andthe fundamental matrix estimations are non-unique.

Our solution:

We propose a new algorithm for rigidity-test, named “Modiﬁed Epipolar Test”, which resolves bothof the above issues. First, it uses (minimum) sub-set sam-pling mechanism to ensure that the estimated two-viewepipolar geometries (e.g., fundamental matrices) are mean-ingful. Second, it adopts model-selection to exclude de-generate cases associate with planar homography. DetailedEpipolar-Test algorithm will be presented in Section-4.

Given the above rigidity-test is in place, we are nowready to present the main algorithm of the paper, namely

Structure-from-Recurrent-Motion (SfRM) . Algorithm 1:

A high-level sketch of our Structure-From-Recurrent-Motion algorithm

Input: N perspective views of a non-rigid shape S ( t ) , t = 1 , ..N. Choose K , i.e., the desirednumber of clusters. Output:

The reconstructed 3D shapes of S ( t ) , ∀ t ∈ { , .., N } up to non-rigidtransformations. for ( i = 1 , · · · , N, j = 1 , · · · , N ) do Call

Algorithm 2 (i.e., modiﬁed-epipolar-test) toget A matrix whose ( i, j ) -th entry A ( i, j ) givesthe probability that the two images i, j are rigidlyrelated. end [Clustering] Form a view-graph G ( V, E, A ) connecting all N views, and the A matrix is used asthe afﬁnity matrix. Run a suitable graph clusteringalgorithm to cluster the N views into K clusters. [Reconstruction] Apply any rigid SfM-reconstructionmethod to each of the K clusters.Note that the core steps of the algorithm are A -matrixcomputation and graph clustering. It should be also notedhat our algorithm only makes use of rigid SfM routines toachieve non-rigid shape reconstruction.

4. Modiﬁed Epipolar Test

In this section, we describe our modiﬁed epipolar-testalgorithm. The output of this algorithm is the probabilitymeasuring whether these two images can be the projectionsof a same rigid shape. As discussed in the previous section,we implement this by checking whether or not these twosets of correspondences are related by a certain fundamen-tal matrix , and at the same time not related by any planarhomography . The latter condition (i.e., excluding homog-raphy) is to ensure 3D reconstruction is possible. Our algo-rithm is inspired by an early work of McReynolds and Lowefor the same task of rigidity-checking [27], however oursis much simpler—without involving complicated param-eter tuning and non-linear reﬁnement. Rigidity-checkingwas also applied for solving multi-view geometry problemswithout via camera motion [23].We will proceed by presenting our algorithm descriptionﬁrst, followed by necessary explanations and comments.

Algorithm 2:

Modiﬁed Epipolar Test algorithm

Input:

Two input images, with M featurecorrespondences { ( x i , x (cid:48) i ) | i = 1 ..M } Output:

The probability P that the two images arerigidly related.1. ( Initialization ): Set parameters σ F , σ H , τ F , τ H .2. ( Estimate fundamental matrices ): Sample all possi-ble 8-point subsets from the M points; Totally thereare (cid:0) M (cid:1) such subsets. Store them in a list, and indexits entries by k . for k = 1 , · · · , (cid:0) M (cid:1) do • Pick the k -th 8-point subset, estimate fund-matrix F k with the linear 8-point algorithm. • Given F k , compute the geometric (point to epipolar-line) distances for all the M points by F k , i.e., d F ( x (cid:48) i , F k x i ) . • Convert the distances to probability measures by ap-plying Gaussian kernel. Compute the product of allprobability measures by: P F ( k ) = (cid:89) i =1 ..M exp (cid:18) − d F ( x (cid:48) i , F k x i ) σ F (cid:19) , (2) end Find the minimum of all the (cid:0) M (cid:1) probabilities: i.e., P F = min k ∈ ( M ) P F ( k ) . (3) 3. ( Estimate homography )Run a similar procedure as above, for homography es-timation, via sampling all 4-point subsets l ∈ (cid:0) M (cid:1) .The overall homography probability can be computedby: P H = min l ∈ ( M ) (cid:89) i =1 ..M exp (cid:18) − d H ( x (cid:48) i ,H l x i ) σ H (cid:19) . (4)4. ( Compute overall probability ) By now we have both P F , and P H . Compare them with their respective tol-erances δ F , and δ H . if ( P F ≥ τ F ) and ( P H < τ H ), then Set P = P F (1 − P H ) , return P . else Set P = 0 , return P . end In Step 3 of the algorithm 2, we sample subsets of thedata points, each consists of 8 points, the minimally re-quired number to linearly ﬁt a fundamental matrix. Thisway we avoid forcing too many points to ﬁt a single epipolargeometry. If the cameras are calibrated, one could also sam-ple 5 points and use the non-linear 5-point essential-matrixalgorithm for better sampling efﬁciency (e.g. [24, 25]).Once a fundamental matrix F k is estimated from an 8-tuple, we evaluate the probability of how likely every otherfeature points (not in the 8-tuple) satisﬁes this fundamentalmatrix. Assuming this probability is independent for eachpoint, the product of Eq.2 gives the total probability P k ofhow well this F k explains all the M points. Exhaustingall (cid:0) M (cid:1) subsets, we pick the least one (in Eq.3) as a (i.e.,conservative) estimate of the rigidity score. In Step 4, werepeat a similar sampling and ﬁtting procedure for homog-raphy estimation. The idea is to perform model-selection [36] to ﬁlter out degenerate cases. Finally in Step 5, wereport the overall probability (of rigidity-check for the twoimages) as the product of P F and (1 − P H ) when P F issufﬁciently high (i.e., ≥ τ F ) and P H is sufﬁciently low(i.e., < τ H ); otherwise report ‘0’. In summary, our algo-rithm provides a way to estimate the rigidity-score deﬁnedby the worst-case goodness-of-ﬁtting achieved for all ten-tative fundamental matrices for each 8-tuple, while at thesame time our algorithm favors the case hardly explainedby a homography. Our Algorithm 2 can be computationally expensive dueto its exhaustive subset enumeration step (Step 3). For ex-ample, when M = 100 , (cid:0) M (cid:1) gives a large number of 186billions. igure 2. Examples of A matrices: (from left to right), periodic,recurrent, and rigid scenarios. Below we will show that one can almost safely replacethe enumeration step with a randomized sampling processwith much fewer samples, yet at little loss of accuracy.Speciﬁcally, we only need to replace the ﬁrst line (of “For k ∈ [1 , (cid:0) M (cid:1) ] ...”) in Step-3 with “Randomly sample minimal8-tuples for k ≤ K times..”.Suppose there are about e proportion of valid subsets(i.e., e is the inlier ratio). By ‘valid’ we mean this 8-tuplegives a rise to a good epipolar geometry which explains alldata points well enough. Then the odds (i.e., probability) ofpicking a valid 8-tuple by only sampling once is e , and theodds of getting an outlier is − e . If one samples K times,then the total odds of getting all K outliers is (1 − e ) K .Finally, the odds of getting at least one valid estimation is p = 1 − (1 − e ) K . This predicted odds can be very highin practice, suggesting that even a small number of randomsamples sufﬁces. Note that this proof is akin to the prob-ability calculation used in RANSAC, one can refer to [37]for details.

5. View Clustering and Block Reconstruction

For a given video sequence containing N views, we con-struct a complete view-graph G ( V, E, A ) of N nodes whereeach corresponds to one view. E denotes the set of edges,and A the afﬁnity matrix in which A ( i, j ) measures the sim-ilarity between node- i and node- j .After Step 3 of Algorithm 1, we have obtained an N × N matrix A . We will use this A as the afﬁnity matrix of ourview-graph.Fig. 2 shows visualizations of example A s that charac-terizes different types of dynamic movements of objectsin videos, showing periodic motion, recurrent motion, andrigid motion respectively. Fig. 2, bright colors in the matrixindicate at which views a particular shape re-occurred. Given a view-graph G ( V, E, A ) with the rigidity matrix A as its afﬁnity matrix, and choose a suitable number K asthe intended number of clusters, we suggest to use spectralclustering technique to perform K-way camera view cluster-ing. If two views are clustered to the same group, it meansthe two views are related by a rigid transformation. Figure 3. For periodic motion, with K = 40 (i.e., one period).From left to right, original A matrix, rearranged A matrix afterclustering and clustering membership.Figure 4. For general recurrent motion, with K = 25 . From left toright, original A matrix, rearranged A matrix after clustering andclustering membership. Speciﬁcally, we use Shi-Malik’s Normalized-Cut [35]for its simplicity. The algorithm goes as follows: First, com-pute a diagonal matrix whose diagonal entries are D ( i, i ) = (cid:80) j A ( i, j ) . Then, form a Symmetric normalized Laplacianby L = D − / AD − / . Next, take the least log K eigen-vectors corresponding to the second smallest and highereigen-values of L and run K-means algorithm on them toachieve K − way clustering. Some examples are given be-low, in Fig. 3 and 4. After the spectral clustering, the A matrix is rearrangedto a block-diagonal structure. Each block represents a clus-ter of views which are rigidly connected, up to an accuracyabout the diameter of the cluster. Therefore, they can beconsidered as multiple rigid projections of the same shape.Hence any standard rigid-SfM technique can be used to re-cover the 3D shape. In our experiments speciﬁcally, weuse incremental bundle adjustment which adds new framesgradually to a local triangulation thread. As each rigid shape cluster is reconstructed indepen-dently, all recovered shapes are up to an ambiguous scale.To achieve globally consistent reconstruction results, wealign the shape scale by normalizing distance between twoselected landmarks (e.g., by normalizing the maximumlimb-length for human body). . Results

The input of our method is multi-frame feature corre-spondences, as in other NRSfM methods (e.g., [11, 1]).Finding correspondences is a difﬁcult task by itself, espe-cially for non-rigid deforming objects where self-occlusionsmay happen frequently. In our experiments, for the syn-thetic data, we assume that the correspondences are pro-vided. For the real data, we use the publicly available Open-Pose [4] library to detect human poses, faces, and hands insequences.

This ﬁrst experiment aims to validate that our Algorithm1 and 2 work for a real sequence with periodic movements— which is a special (and simpler) case of recurrent motion.We use a sequence capturing a walking person at a con-stant speed, where a moving camera is observing this per-son from different viewpoints, resulting in a nearly periodicsequence. We apply the OpenPose [4] library to detect 14landmark points on the person over all 700 frames. Exam-ple frames are shown in Fig. 5. For the entire sequence, therigidity (i.e., afﬁnity) matrix computed by our Algorithm 2is shown on the left of Fig. 6. This ﬁgure shows that thereexists a strong periodicity, shown as bright bands along themain diagonals. Moreover the period can be readily readout as p =40 frames, although our algorithm does not makeuse of this result. Instead, frames with repetitive shapesare automatically grouped together via view-graph cluster-ing. The middle and the right ﬁgure of Fig. 6 show there-arranged afﬁnity matrix after spectral clustering and theﬁnal clustering membership result respectively, where theevident ‘blocky’ structure clearly reveals the grouping. Wethen perform a rigid-SfM for all views within each block.Fig. 7 shows example pose reconstruction results; note theposes are in 3D.So far, our algorithm has only focused on recoveringthe non-rigid shape itself, ignoring its absolute pose in theworld coordinate frame. In practice this can be easily ﬁxed,assuming that the ego-motion of each camera view can berecovered by, for example, standard rigid-SfM/SLAM tech-niques against a stationary background. We conduct this ex-periment by ﬁrst tracking background points, then estimat-ing absolute camera poses relative to the background, fol-lowed by Procrustes alignment between the absolute cameraposes and each reconstructed human poses. The ﬁnal re-construction result, with both background point clouds andhuman poses and trajectories, is shown in Fig. 8. This experiment aims to demonstrate the performanceof our method on a general ( non-periodic ) video sequencewhich is likely to contain recurrent movements.

Figure 5. A (nearly) periodic walk sequence.Figure 6. Afﬁnity matrices before, and after spectral clustering( i.e . N-Cut). The ‘blocky’ structure becomes evident after N-cut.Right: the ﬁnal view-clustering result.Figure 7. 3D reconstruction results of the walking sequence.Figure 8. Consistent 3D reconstruction of both dynamic fore-ground object (and temporal trajectories) and a static backgroundscene.

We choose a solo dancing sequence captured by theCMU Panoptic Studio [17]. This dataset contains videosfrom camera arrays. In order to increase the probability ofsuccessfully reconstruction, we do not directly use one spe-ciﬁc camera, but, instead, extract a time-consecutive videoby randomly “hopping” between different cameras in thedataset, to simulate a video as if captured by a “monocularcamera randomly roaming in space”.This dancing sequence is challenging as the motion of igure 9. The computed original afﬁnity matrix, and the block-wise pattern after cluttering on the CMU dancing sequence. Thereis no obvious cyclic pattern in the original afﬁnity matrix. Afterclustering, more clear recurrence patterns are revealed.Figure 10. 3D reconstruction results on the dance sequence. the dancer is fast and the dance itself is complicated creatingmany unnatural body movements. The computed afﬁnitymatrix is shown in Fig. 9, showing that there is no obviousstructure. However, after applying our graph-clustering, wecan see a clear block-wise pattern (albeit noisy), suggestingthat the video indeed contains many recurrent (repetitive)body poses. Example reconstruction results (along with thediscovered recurrent frames) are shown in Fig. 10.

To quantitatively measure the performance of ourmethod, we use the Blender to generate synthetic deforma-tions with recurrence. We use the ﬂying cloth dataset [38]and fold the sequence by several times to mimic recurrency.Camera views are randomly generated. Fig. 11 shows somesample frames of the data.In this sequence, all ground-truth (object shape, cameraposes) are given. Noises of different levels are added to im-age planes. Our method successfully detects recurrency andreconstructs the shape as shown in second row of Fig. 11.The reconstruction quality is measured by shape errorsafter alignment, as well as the portion of successfully re-constructed frames. We evaluate on two criteria at differentnoise levels. Results are given in Fig. 13.We compare our method with other the state-of-the-arttemplate-free NRSfM methods[1, 7]. The result is shown inFig. 12. In terms of overall reconstruction accuracy their ‘

Figure 11. Cloth waving in the wind and our SfRM reconstruc-tions.

Reconstruction Error F r e q e n cy SfRM(ours): =0.0849DLH: =0.1742DCT: =0.2379

Figure 12. Histograms of reprojection errors by different methods.Here we compare our method with shape-basis based method[7]and trajectory-basis based method[1].

Pixel uncertainty R e c o n s t r u c t i o n R a t i o R e c o n s t r u c t i o n E rr o r Reconstruction RatioReconstruction Error

Figure 13. SfRM performance at different noise levels. Whennoise increases, the reconstruction error increases whereas the suc-cess ratio falls. This result shows our method handles increasingamount of noises gracefully. performances are comparable, while ours is superior forframes exhibiting strong recurrency.

Fig. 14 gives the timing results of our SfRM system(excluding rigid reconstruction), showing a clear linear re-lationship w.r.t. the number of feature points, as well asw.r.t. the number of random samples (in algorithm-2), butis quadratically related to the number of image frames. Inour experiments we chose K –the number of clusters– em-pirically. For future work we would like to investigate howto automatically determine K .We also test our method on face and hand data cap-tured by the Panoptic Studio. Sample qualitative results areshown in Fig. 15.

10 15 20

Number of E-sampling S e c o n d s

15 20 25 30 35 40

Number of Points S e c o n d s

100 200 300 400 500 600

Number of Frames S e c o n d s Figure 14. Timing (in seconds) as a function of the number ofrandom samples, the number of points, and the number of frames.Figure 15. Example images of 3D reconstruction of face and handdata.

7. Related work

The idea of our SfRM method is rather different fromconventional NRSfM approaches. For space reason we willnot review the NRSfM literature here but refer interestedreaders to recent publications on this topic and referencestherein [7, 29, 15, 19, 8, 18]. Below, we focus on previouswork with similar ideas.A cornerstone of our method is the mechanism to detectshape recurrence in a video. Similar ideas were proposedfor periodic dynamic motion analysis [2, 32, 34, 33]. Ourwork is speciﬁcally inspired by [2, 14]. However, there aremajor differences. First, their methods assume strictly pe-riodical motions, and need to estimate the period automati-cally [6] or manually [2]. This way, their methods can onlyhandle limited periodical motions such as well-controlledwalking and running. In contrast, our method extends tomore general cases of recurrent motions, which includeboth a-periodic , and re-occurring cases. Moreover, theirmethods assume a camera to be static, and under the the pe-riodical assumption, the target is not allowed to turn aroundand has to move (walking or running) on a straight line, cap-turing only partial surfaces [2] or trajectories [32]. Compa-rably, our method allows free-form target movements andcamera motions. Finally, our method is fully automatic,while their methods rely on signiﬁcant level of manual in-teractions.Our method can be applied for 3D human pose recov-ery, therefore it is related to many work in this domain, [13, 31, 30, 28, 20, 21]. In particular, our method is re-lated to the research directions which try to lift 3D posefrom 2D images, e.g. [3, 5, 26]. Earlier work in this direc-tion either requires the integration of knowledge of the bonelength of the target [22], or human pose and shape space pri-ors [3]. Although in experiments we use 3D human poses,mainly as exemplar recurrent movements, our method doesnot take advantage of any category-speciﬁc priors. Rather,we treat poses as general point clouds in 3D. It can be ap-plied to other objects beyond human body. Another cate-gory of work on human pose capture relies on the existenceof large-scale pose database for retrieving the most similarpose based on a 3D-2D pose similarity metric [10, 5, 16].Their performance is heavily depend on the size and qualityof the database of speciﬁc type of targets, while ours worksin general scenarios. A recent deep learning approach byMartinez et al. [26] shows that a well-designed network fordirectly regressing 3D keypoint positions from 2D joint de-tection showed good performance. However, they rely onlarge amount of training data of speciﬁc class, while oursworks without training.

8. Conclusion

We have presented a new method for solving Non-rigidStructure-from-Motion (NRSfM) for a long video sequencecontaining recurrent motion. It directly extends the con-cept of rigidity to recurrency as well as periodicity. Withthis new method at hand, one is able to directly use tradi-tional rigid SfM techniques for non-rigid problems. Keycontributions of this work include a randomized algorithmfor robust two-view rigidity check, and a view-graph clus-tering mechanism which automatically discovers recurrentshape enabling the subsequent rigid reconstructions. Finitebut adequate experiments have demonstrated the usefulnessof the proposed method. The method is practically rele-vant, thanks to the ubiquity of recurrent motions in reality.One may criticize our method will not work if a shape wasonly seen for one times. We admit this is a fair criticism,but we argue that if that happened it would be of little realpractical value to reconstruct any shape with such a ﬂeet-ing nature. Our proposed view-graph and shape-clusteringalgorithms are examples of unsupervised machine-learningtechniques. In this regard, we hope this paper may offerinsights that bridge SfM research with learning methods.

Acknowledgement.

We would like to thank reviewers and ACs fortheir valuable comments. This work was completed when XL was a vis-iting PhD student to CMU under the CSC Scholarship (201706210160).HL is grateful for YS for his very generous hosting. HL’s work isfunded in part by Australia ARC Centre of Excellence for Robotic Vision(CE140100016). YL’s research is funded by the National Key Founda-tion for Exploring Scientiﬁc Instrument (2013YQ140517) and NSFC grant(No. 61522111). eferences [1] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Nonrigid struc-ture from motion in trajectory space. In

NIPS , 2009.[2] S. Belongie and J. Wills. Structure from periodic motion.In

Spatial Coherence for Visual Motion Analysis . Springer,2006.[3] F. Bogo, A. Kanazawa, C. Lassner, P. V. Gehler, J. Romero,and M. J. Black. Keep it SMPL: automatic estimation of 3dhuman pose and shape from a single image. In

ECCV , 2016.[4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds. In

CVPR ,2017.[5] C. Chen and D. Ramanan. 3d human pose estimation = 2dpose estimation + matching. In

CVPR , 2017.[6] R. Cutler and L. S. Davis. Robust real-time periodic motiondetection, analysis, and applications.

TPAMI , 2000.[7] Y. Dai, H. Li, and M. He. A simple prior-free method fornon-rigid structure-from-motion factorization.

IJCV , 2014.[8] M. Gallardo, T. Collins, A. Bartoli, and F. Mathias. Densenon-rigid structure-from-motion and shading with unknownalbedos. In

CVPR , 2017.[9] P. F. Gotardo and A. M. Martinez. Non-rigid structure frommotion with complementary rank-3 spaces. In

CVPR , 2011.[10] A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham. 3dpose from motion for cross-view action recognition via non-linear circulant temporal encoding. In

CVPR , 2014.[11] R. Hartley and R. Vidal. Perspective nonrigid shape and mo-tion recovery. In

ECCV , 2008.[12] R. Hartley and A. Zisserman.

Multiple view geometry incomputer vision . Cambridge university press, 2003.[13] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Hu-man3.6m: Large scale datasets and predictive methods for3d human sensing in natural environments.

TPAMI , 2014.[14] P. P. Ivan Laptev, Serge Belongie and J. Wills. Periodic mo-tion detection and segmentation via approximate sequencealignment. In

ICCV , 2013.[15] P. Ji, H. Li, Y. Dai, and I. Reid. Maximizing rigidity revis-ited: A convex programming approach for generic 3d shapereconstruction from multiple perspective views. In

ICCV ,2017.[16] H. Jiang. 3d human pose reconstruction using millions ofexemplars. In

ICPR , 2010.[17] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews,T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic studio:A massively multiview system for social motion capture. In

ICCV , 2015.[18] S. Kumar, A. Cherian, Y. Dai, and H. Li. Scalable dense non-rigid structure-from-motion: A grassmannian perspective. In

CVPR , 2018.[19] S. Kumar, Y. Dai, and H. Li. Multi-body non-rigid structure-from-motion. In , 2016.[20] W. Kusakunniran, Q. Wu, H. Li, and J. Zhang. Automaticgait recognition using weighted binary pattern on video. In

IEEE AVSS , 2009.[21] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, and L. Wang.Recognizing gaits across views through correlated motionco-clustering.

TIP , 2014. [22] H. Lee and Z. Chen. Determination of 3d human body pos-tures from a single view.

Computer Vision, Graphics, andImage Processing , 1985.[23] H. Li. Multi-view structure computation without explicitlyestimating motion. In

CVPR , 2010.[24] H. Li and R. Hartley. Five-point motion estimation madeeasy. In

ICPR , 2006.[25] H. li and R. Hartley. A simple solution to the six-point two-view focal-length problem. In

ECCV , 2006.[26] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A sim-ple yet effective baseline for 3d human pose estimation. In

ICCV , 2017.[27] D. P. McReynolds and D. G. Lowe. Rigidity checking of 3dpoint correspondences under perspective projection.

TPAMI ,1996.[28] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin,M. Shaﬁei, H. Seidel, W. Xu, D. Casas, and C. Theobalt.Vnect: real-time 3d human pose estimation with a singleRGB camera.

TOG , 2017.[29] H. S. Park, T. Shiratori, I. A. Matthews, and Y. Sheikh. 3dreconstruction of a moving point from a series of 2d projec-tions. In

ECCV , 2010.[30] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis.Coarse-to-ﬁne volumetric prediction for single-image 3d hu-man pose. In

CVPR , 2017.[31] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstructing3d human pose from 2d image landmarks. In

ECCV , 2012.[32] E. Ribnick and N. Papanikolopoulos. 3d reconstruction ofperiodic motion from a single view.

IJCV , 2010.[33] S. M. Seitz and C. R. Dyer.

Afﬁne invariant detection of pe-riodic motion . University of Wisconsin-Madison, ComputerSciences Department, 1994.[34] S. M. Seitz and C. R. Dyer. Detecting irregularities in cyclicmotion. In

Motion of Non-Rigid and Articulated Objects,IEEE Workshop on , 1994.[35] J. Shi and J. Malik. Normalized cuts and image segmenta-tion.

TPAMI , 2000.[36] P. H. Torr. An assessment of information criteria for motionmodel selection. In

CVPR , 1997.[37] P. H. Torr and D. W. Murray. The development and compari-son of robust methods for estimating the fundamental matrix.

IJCV , 1997.[38] R. White, K. Crane, and D. A. Forsyth. Capturing and ani-mating occluded cloth.