Self-Similarity Based Time Warping
SSelf-Similarity Based Time Warping
Christopher J. TralieDuke University Department of Mathematics [email protected]
Abstract
In this work, we explore the problem of aligning twotime-ordered point clouds which are spatially transformedand re-parameterized versions of each other. This has adiverse array of applications such as cross modal timeseries synchronization ( e.g . MOCAP to video) and align-ment of discretized curves in images. Most other worksthat address this problem attempt to jointly uncover a spa-tial alignment and correspondences between the two pointclouds, or to derive local invariants to spatial transfor-mations such as curvature before computing correspon-dences. By contrast, we sidestep spatial alignment com-pletely by using self-similarity matrices (SSMs) as a proxyto the time-ordered point clouds, since self-similarity ma-trices are blind to isometries and respect global geome-try. Our algorithm, dubbed “Isometry Blind Dynamic TimeWarping” (IBDTW), is simple and general, and we showthat its associated dissimilarity measure lower bounds theL1 Gromov-Hausdorff distance between the two point setswhen restricted to warping paths. We also present a local,partial alignment extension of IBDTW based on the SmithWaterman algorithm. This eliminates the need for tediousmanual cropping of time series, which is ordinarily neces-sary for global alignment algorithms to function properly.
1. Introduction / Background
In this work, we address the problem of synchronizingsampled curves, which we refer to as “time-ordered pointclouds” (TOPCs). The problem of synchronizing TOPCswhich trace similar trajectories but which may be parame-terized differently is usually approached with the DynamicTime Warping (DTW) algorithm [31, 32]. Since sequentialdata can often be translated into a sequence of vectors insome feature space, this algorithm has found widespreaduse in applications such as spoken word synchronization[31, 32], gesture recognition [37], touch screen authenti-cation [12], video contour shape sequence alignment [26],and general time series alignment [5], to name a few of Figure 1. A concept figure for our technique of aligning time-ordered point clouds which are rotated/translated/flipped and re-parameterized versions of each other. Rows of self-similaritymatrices (SSMs) of points which are in correspondence are re-parameterized versions of each other, which reduces the globalalignment problem to a series of 1D time warping problems. Thisobservation forms the basis of our algorithm, which returns “warp-ing path,” drawn in cyan in the lower left plot of this figure, thatinforms how to synchronize the point clouds. the thousands of works that use it. The problem becomessubstantially more difficult, however, when the point cloudsundergo spatial transformations or dimensionality shifts inaddition to re-parameterizations, which is more common across modalities. For instance, one may want to syn-chronize a motion capture sequence expressed with quater-nions of joints with a video of a similar motion in somefeature space (see Section 5.4). There is no apparent cor-respondence between these spaces a priori. This problemeven arises within modalities, such as aligning gesturesfrom different people who reside in different spatial loca-tions. Thus, when synchronizing sampled curves, it is im-portant to address not only re-parameterizations, but alsospatial transformations such as maps between spaces or ro-tations/translations/flips within the same space.In our work, we avoid explicitly solving for spatial mapsby using self-similarity matrices (SSMs). Figure 1 shows1 a r X i v : . [ c s . C V ] N ov sketch of the technique. Even if the curves have beenrotated/translated/flipped and re-parameterized, rows of theSSMs which are in correspondence are re-parameterizedversions of each other. Our technique is simple both con-ceptually and in implementation, it is fully unsupervised,and it is parameter free. There are also theoretical connec-tions between our algorithm and metric geometry, as shownin Section 2.2. We also show an extension of our basic tech-nique to partially align time series across modalities (Sec-tion 3), and this is the first known solution to that problem.Finally, with proper normalization (Section 4), these tech-niques can address cross-modal alignment. We show favor-able results on a number of benchmark datasets (Section 5). The main data structure we rely on in this work is the self-similarity matrix . Given a space curve parameterizedby the unit interval γ : [0 , → ( M , d ) , a Self-SimilarityImage (SSI) is a function D : [0 , × [0 , → R so that D γ ( i, j ) = d ( γ ( i ) , γ ( j )) (1)The discretized version of an SSI corresponding to asampled version of a curve is a self-similarity matrix (SSM) .SSIs and SSMs are naturally blind to isometries of the un-derlying space curve and time-ordered point cloud, respec-tively; these structures remain the same if the curve/pointcloud is rotated/translated/flipped.Time-ordered SSMs have been applied to the problem ofhuman activity recognition in video [20], periodicity andsymmetry detection in video motion [11], musical audionote boundary detection [14], music structure understand-ing and segmentation [4, 23, 35, 28], cover song identifi-cation [39], and dynamical systems [29], to name a fewareas. In this work, we study more general properties oftime-ordered SSMs that make them useful for alignment. The warping path is the basic primitive object we seekto synchronize two time-ordered point clouds. It is a dis-crete version of an orientation preserving homeomorphismof the unit interval used to re-parameterize curves. In lay-man’s terms, it provides a way to step forward along bothpoint clouds jointly in a continuous way without back-tracking, so that they are optimally aligned over all steps.More precisely, given two sets X and Y , a correspondence C between the two sets is such that C ⊂ X × Y and ∀ x ∈ X ∃ y ∈ y s.t. ( x, y ) ∈ C and ∀ y ∈ Y ∃ x ∈ X s.t. ( x, y ) ∈ C . In other words, a correspondence is a match-ing between two sets X and Y so that each element in X ismatched to at least one element in Y , and each element of Y is matched to at least one element in X . Let X and Y be two sets whose elements are adorned with a time order: X = { x , x , ..., x M } and Y = { y , y , ..., y N } . A warp-ing path , between X and Y is a correspondence W whichcan be put into the sequence W = ( c , c , ..., c K ) satisfyingthe following properties • Monotonicity : If ( x i , y j ) ∈ W , then ( x k , y l ) / ∈ W for k < i , l > j • Boundary Conditions : ( x , y ) , ( x M , y N ) ∈ W• Continuity : c i − c i − ∈ { (0 , , (1 , , (1 , } Now suppose there are two time-ordered point clouds X and Y which both live in the same metric space ( M , d ) .The Dynamic Time Warping (DTW) Dissimilarity [31, 32] between X and Y isDTW ( X, Y ) = min W∈ Ω (cid:88) ( i,j ) ∈W d ( x i , y j ) where Ω is the set of all valid warping paths between X and Y . DTW satisfies the following subsequence relation DT W ij = d ( x i , y j ) + min DT W i − ,j − DT W i − ,j DT W i,j − (2)where DT W ij is the DTW dissimilarity between { x , x , ..., x i } and { y , y , ..., y j } . This makes it possi-ble to solve DTW with a dynamic programming algorithmwhich takes O ( M N ) time. This algorithm computes thecost of the shortest path from the upper left to the lowerright of the “cross-similarity matrix” (CSM), or the M × N matrix holding all distances d ij between X i and Y j . A (notnecessarily unique) shortest path realizing this distance is awarping path which can be used to align the two time series. In addition to synchronizing curves in the same ambi-ent space which are approximately re-parameterizations ofeach other, there has also been some recent work on themore difficult problem of matching curves which live in dif-ferent ambient spaces or which live in the same space butwhich may differ by a spatial transformation in addition tore-parameterization. One objective for spatial alignment isthe optimal rigid transformation taking one set of points toanother. More precisely, given two Euclidean point clouds
X, Y ∈ R d , each with N points which are assumed to be incorrespondence, the Procrustes distance [46, 22] is d P ( X, Y ) = min R x ,R y ,t x ,t y N (cid:88) i =1 || R x ( x i − t x ) − R y ( y i − t y ) || (3) Note that DTW is not a metric, as it fails to satisfy the triangle in-equality. For an example, see [30] section 4.1 ne issue with Procrustes is that not only do X and Y have to have the same number of points, but the correspon-dences must be known a priori. Often in practice, neither ofthese assumptions are true. To deal with this, one can usethe “Iterative Closest Points” (ICP) algorithm [6, 8], whichswitches back and forth between finding correspondenceswith nearest neighbors and solving the Procrustes problem.The authors of [49] use a modified version of ICP, replacingthe nearest neighbor correspondence step with DTW. Thisensures that the time order will be respected, which is notguaranteed with nearest neighbors only .There are also techniques which use canonical correla-tion analysis (CCA) instead of Procrustes analysis. Giventwo point clouds with N points represented by matrices X ∈ R d × N and Y ∈ R d × N , assumed to be in corre-spondence, CCA is defined as d CCA = min V x ∈ R dx × b ,V y || V Tx X − V Ty Y || F (4)for some chosen constant b ≤ min( d , d ) , s.t. V Tx XX T V x = V Ty Y Y T V y = I b . This is better suitedto cross modal applications where scaling is involved.Like Procrustes, this assumes that the correspondences areknown a priori. To find the correspondences, the authorsin [53] take the same iterative approach as that authors in[49] did with ICP, but they alternate back and forth betweenDTW and CCA instead of DTW and Procrustes. An up-dated version of this algorithm known as “generalized timewarping” (GTW) [51, 52] was developed which aligns mul-tiple sequences using a single optimization objective wherethe spatial alignment and time warping are coupled. Fi-nally, a recent work in [40, 41] takes a similar approach,but it replaces CCA by learning features in the projectionstage with a deep neural network. Like all supervised learn-ing approaches, however, this method requires training datawith known correspondences. Furthermore, all of the tech-niques we have mentioned so far require a good initial guessto converge to a globally optimal solution.As an alternative to solving for a spatial alignment ex-plicitly, many works perform time warping on a surro-gate function which is invariant to isometries of the in-put. A popular choice is to numerically estimate curvature[24, 33, 15]. Some works use the triangle area betweentriples of points as an invariant [1], and some use the turningangle of the curve [9], which is related to curvature. Thesetechniques can suffer from numerical difficulties when esti-mating the invariants. Also, most of the invariants are local,so small differences can cause the curves “drift” over time( e.g . a U is similar to a 6 with local curvature [33]), thoughusing integrated curvature [10] can ameliorate this. This analogous to the difference between the Fr´echet Distance [2] andthe Hausdorff Distance between curves.
Figure 2. Self-similarity images of different parameterizations ofa Figure 8. Rectangles in one image map to rectangles in the otherimage, and lines in one image map to lines in the other image.
Beyond spatial alignment and invariants, the authorsof [47] address more general case of cross-domain objectmatching (CDOM) with general correspondences and ad-dress warping paths as a special case. However, their prob-lem reduces to the quadratic assignment problem, which isNP-hard, and their iterative approximation requires a goodinitial guess. The authors of [42] address a special casewhere curves form closed loops, using cohomology to findmaps from point clouds to the circle, where they are syn-chronized, but this only works for periodic time series. Theauthors of [16] jointly align curves on manifolds, which iseffective but requires learning the manifolds. Perhaps themost similar to our approach is the action recognition workof [21], from which we drew much inspiration, which ap-plies DTW to small patches of SSMs to align time warpedactions from different camera views. However, they onlyuse elements near the diagonal of SSMs, and their schemedoes not extend across modalities.
2. Isometry Blind Dynamic Time Warping
Most of the approaches we reviewed to align time serieswhich have undergone linear transformations try to explic-itly factor out those transformations before doing an align-ment, but this is not necessary if we build our algorithmon top a self-similarity matrix between two point clouds,which is already blind to isometries. To set the stage forour algorithms, we first study the maps that are induced be-tween self-similarity images by re-parameterization func-tions, which will help in the algorithm design. For exam-ple, take the figure 8 curve, γ ( t ) = (cos(2 πt ) , sin (4 πt )) .The bottom left of Figure 2 shows the SSM of a linearlyparameterized sampled version of this curve, while the bot-tom right of Figure 2 shows the SSM corresponding to a re-parameterized sampled version. Maps between the domainsof the SSMs shown are always rectangles, and they are in-dependent of underlying curve being parameterized (theyonly depend on the relationship between two parameteriza-tions). To see this, start with a space curve γ : [0 , → R d and its resulting self-similarity image D γ . Given a home-omorphism h : [0 , → [0 , , which yields a space curve γ h : [0 , → R d and a corresponding self-similarity image D γ h , there is an induced homeomorphism, h × h from thequare to itself between the two domains of D γ and D γ h ;that is, D γ h = D γ ( h ( s ) , h ( t )) . If we fix a correspondence s ⇐⇒ u = h ( s ) , then this shows that row h ( s ) of D γ isa 1D re-parameterization of row s of D γ h , making rigorousthe observation in Figure 1. Note that for a discrete versionof these maps between time-ordered point clouds, one canreplace the homeomorphism h with a warping path W , andthe relationships are otherwise the same. We now have the prerequisites necessary to define ourmain algorithm. The idea is quite simple. Based on our ob-servations and Figure 1 and Figure 2, if we know that point i in a time-ordered point cloud (TOPC) X is in correspon-dence with a point j in TOPC Y, then we should match the i th row of X’s SSM to the j th row of Y’s SSM under the L distance, enforcing the constraint that ( i, j ) ∈ W . How-ever, since it is unknown a priori which rows should be incorrespondence, we try every row i of SSM A against ev-ery row j in SSM B, and we create a cross-similarity timewarping matrix (CSWM) C so that C ij contains L DTWbetween row i of SSMA and row j of SSMB, constrainedto warping paths which include ( i, j ) . To enforce that ( i, j ) be in the optimal warping path, we exploit the boundarycondition property of DTW by running the original DTWalgorithm twice: once between SSM Ai i , SSM Bj j andonce between SSM Ai i : M and SSM Bj j : N , summing thecosts. After doing this ∀ i, j , apply the ordinary DTW algo-rithm to C . Algorithm 1 summarizes this process. Note thata serial implementation of this algorithm takes O ( M N ) time, since a 1D DTW is computed for every row pair. Tomitigate this, we implement a linear systolic array[50] ver-sion of DTW in CUDA. With unlimited parallel processors,this reduces computation to O ( M + N ) . In practice, wewitness a 30x speedup between point clouds with hundredsof samples.Figure 3 shows an example of this algorithm ontwo rotated/translated/re-parameterized time-ordered pointclouds in R (point clouds 1 and 2). As the colors show,IBDTW puts the points into correspondence correctly evenwithout first spatially aligning them. We also show align-ment to a third time-ordered point cloud, which is met-rically distorted in addition to being rotated/translated/re-parameterized. The returned warping degrades gracefully.We will explore this more rigorously in Section 5.1. IBDTW can be put into the
Gromov-Hausdorff Distance framework, which describes how to “embed” one metricspace into another. More formally, given two discrete met-ric spaces ( X, d X ) and ( Y, d Y ) , and a correspondence C between X and Y , the p -stress is defined as Figure 3. An example of IBDTW between 3 different samplings ofa pinched ellipse. The optimal warping path found by Algorithm 1is drawn in cyan on top of the CSWM in each case. Based on this,points which are in correspondence are drawn with the same colorin the lower left figure. Though time-ordered point cloud 2 hasmore points towards the beginning and fewer points towards theend than time-ordered point cloud 1, correct regions are put intocorrespondence with each other. Furthermore, in addition to beingparameterized this way, the time-ordered point could 3 is also dis-torted geometrically, but the correspondences are still reasonable.
Algorithm 1
Isometry Blind Dynamic Time Warping procedure IBDTW( X , Y , d X , d Y ) (cid:46) TOPCs X and Y with M and N points, metrics d X and d Y , respectively (cid:46) Initialize cross-similarity warp matrix (CSWM) C ← ∞ ∞ . . . ∞∞ . . . ... ... ... . . . ... ∞ . . . (cid:124) (cid:123)(cid:122) (cid:125) N M for i = 1 : M do (cid:46)i th row of d X A ← [ d X ( x i , x ) , d X ( x i , x ) , . . . , d X ( x i , x M )] for j = 1 : N do (cid:46)j th row of d Y B ← [ d Y ( y j , y ) , d Y ( y j , y ) , . . . , d Y ( y j , y N )] C ij ← ConstrainedDTW(A, B, L , i , j ) end for end for (cid:46) Use the CSWM C in ordinary DTW D ← DTW ( X, Y, C ) return ( D, C ) (cid:46) Return the cost and the CSWM end procedure S p ( X, Y, C ) = (cid:88) ( x,y ) , ( x (cid:48) ,y (cid:48) ) ∈C ( d X ( x, x (cid:48) ) − d Y ( y, y (cid:48) )) p /p (5)ntuitively, the p -stress measures how much one has tostretch one metric space when moving it to another. TheGromov-Hausdorff Distance d GH uses S ∞ specifically: d GH ( X, Y ) = 12 inf C∈ Π S ∞ ( X, Y, C ) (6)where Π is the set of all correspondences between X and Y .In other words, the Gromov-Hausdorff Distance measuresthe smallest possible distortion between a pair of pointsover all possible embeddings of one metric space into an-other. Unfortunately, the Gromov-Hausdorff Distance isNP-complete, but we can still connect Algorithm 1 to theGromov-Hausdorff Distance via the following lemma: Lemma 1.
The cost returned by Algorithm 1 lower bounds S ( X, Y, W ) , or the 1-stress restricted to warping paths . Proof: Note that the optimal IBDTW warping path W ∗ has the following cost c ( W ∗ ) c ( W ∗ ) = (cid:88) ( x i ,y j ) , ( x (cid:48) ,y (cid:48) ) ∈W ∗ | d X ( x i , x (cid:48) ) − d Y ( y j , y (cid:48) ) | (7)which can be rewritten as c ( W ∗ ) = 12 (cid:88) ( x i ,y j ) ∈W ∗ (cid:88) ( x (cid:48) ,y (cid:48) ) ∈W ∗ | d X ( x i , x (cid:48) ) , − d Y ( y j , y (cid:48) ) | (8)since if ( x (cid:48) , y (cid:48) ) = ( x i , y j ) then the cost is zero, all otherterms counted twice. Now fix an x i and y j . Then the sum ofthe terms of the form | d X ( x i , x (cid:48) ) − d Y ( y j , y (cid:48) ) | is simply the L warping distance between 1D time series which are the i th row of d X , d X [ i, :] and the j th row of d Y , d Y [ j, :] underthe warping W ∗ . Also, the DTW Distance between d X [ i, :] and d Y [ j, :] is at most the L warping distance under W ∗ ,and is potentially lower since we are computing them greed-ily only between x i and y j , ignoring all other constraints.Hence, the sum of the terms | d X ( x i , x (cid:48) ) − d Y ( y j , y (cid:48) ) | islower bounded by Line 11 in Algorithm 1. (cid:4) For a more direct analogy with DTW, Algorithm 1 wasdesigned to lower bound the 1-stress restricted to warpingpaths. We note that a similar technique could be used tolower bound the Gromov-Hausdorff Distance restricted towarping paths by replacing constrained DTW in the innerloop in Line 11 with a constrained version of the discreteFr´echet Distance [13] to find the maximum distortion in-duced by putting two points in correspondence. In thiswork, however, we stick to the 1-stress, since it gives a moreinformative overall picture of the full metric space.
3. Isometry Blind Partial Time Warping
One of the drawbacks of IBDTW is that it requires aglobal alignment. However, if the sequences only partially
Figure 4. Partial alignment with IBPTW on jigsaw puzzle pieces.The middle row shows the optimal partial alignment. The bottomrow shows a locally optimally partial alignment with a lower score.Please refer to color version of this figure for full detail. overlap, forcing a global alignment leads to poor result, un-less manual cropping is done to ensure that sequences startand end at the same place [53]. To automate cropping, wedesign an isometry blind time warping algorithm that cando partial alignment. This algorithm is like IBDTW, exceptDTW is replaced with the “Smith Waterman” algorithm[36, 45], which seeks the best contiguous subsequences ineach time series which match each other . Unlike dynamictime warping, Smith Waterman seeks to maximize an align-ment score , and the alignment does not have to start on thefirst element of each sequence or end on the last elementon each sequence. To solve this, the exact same dynamicprogramming algorithm is used, except there is one extra“restart” condition if a local alignment has become suffi-ciently poor. The recurrence is SW ij = max SW i − ,j − + m ( x i , y j ) SW i − ,j + gSW i,j − + g (9)where m ( x i , y j ) is a matching score between points x i and y j , which is positive for a match and negative for amismatch, and g is a gap penalty.Like DTW, we may modify Smith Waterman to re-turn the best subsequence constrained to match the i th point in the first sequence to the j th point in thesecond sequence by runing Smith Waterman between { X , X , ..., X i } and Y , Y , ..., Y j , and then again betweenthe reversed sequences { X M , X M − , X M − , ..., X i } and { Y N , Y N − , Y N − , ..., Y } . Then, the Isometry Blind Par-tial Time Warping (IBPTW) algorithm is exactly like Algo-rithm 1, except we replace Line 11 with constrained Smith This algorithm was originally developed for gene sequence alignment,but it has been adapted to multimedia problems such as music alignment[34] and video copyright infringement detection [7]. igure 5. An example of matching the SSM of an oscillating linesegment captured with Lagrangian coordinates to an SSM of theoscillating line segment captured with Eulerian coordinates, andvice versa. The right column shows an example of a row from eachof the matrices in different cases. The stipple line pattern showsthe original row, the line segment shows the corresponding rowfrom the SSM with the target distribution, and the solid row showsthe remapped version of the original row. In this case, it is easier toremap the Lagrangian coordinates to Eulerian coordinates, thoughboth remappings are closer to the target than the original.
Waterman, and we replace Line 15 with unconstrainedSmith Waterman. We refer to the matrix C (Line 4) as the“partial cross-similarity warp matrix (PCSWM).” In prac-tice, we define m ( a, b ) = exp( −| d ( a, b ) | /σ ) − . forLine 11. If d ( a, b ) is the L1 distance between elementsof two histogram normalized SSM rows, then it ranges be-tween and . Thus, there is a positive matching scoreof up to 0.4 for the most similar SSM values and a neg-ative matching score of -0.6 between the most dissimilarvalues. We also choose a gap penalty of -0.4 to promote di-agonal matches. Otherwise, the warping path maximizingthe alignment score contains longer horizontal and verticallines, leading to undesirable pauses of one time series withrespect to the other. For the outer loop (Line 15), we use m ( a i , b j ) = ( S ij − md ( S ))max( | S − md ( S )) | where md is the median operation. This will give a highscore up to ≤ to row pairs which have a high subsequencescore in common, and a low score ≥ − to rows which donot have a good subsequence. Figure 4 shows an example ofthis algorithm on two jigsaw puzzle pieces which should fittogether with m and m defined as above, σ = 0 . , and g , g = − . . The longest subsequence is indeed alongthe cutouts where they match together. It is also possible tobacktrace from anywhere in the PCSWM to find other sub-sequences which match in common, so Figure 4 also showsan example of suboptimal but good local alignment.
4. Cross-Modal Histogram Normalization
The schemes we have presented work well for pointclouds sampled from isometric curves, but the isometry assumption does not usually hold in cross-modal applica-tions. Not only can the scales be drastically different be-tween modalities, but it is unlikely that a uniform re-scalingwill fix the problem. For instance, consider a 1D oscillat-ing bar of length B oscillating sinusoidally over the interval [0 , A ] with a period of T . Its center position is measuredas c t = ( A/
2) + ( A/
2) cos(2 πt/T ) . This type of measure-ment, which follows the object in question, is in Lagrangiancoordinates . By contrast, suppose we take a 1D video of thebar with A pixels, where each pixel in each frame measuresoccupancy by the bar at that frame in the video. Then pixel i in video X t is parameterized by time as X t [ i ] = (cid:26) | c t + B/ − i | < B/ otherwise (cid:27) (10)These pixel by pixel measurements at fixed positions arereferred to as Eulerian coordinates . Let the LagrangianSSM D be the 1D metric between two different centers, D [ s, t ] = | c s − c t | , and let the Eulerian SSM D be theEuclidean metric D [ s, t ] = || X s − X t || between eachframe of the video. Although they are measuring the sameprocess, the SSMs have a locally different character. Rowsof D are perfect sinusoids, while rows of D are more likesquare waves, since there are sharp transitions from fore-ground to background in Eulerian coordinates. Figure 5summarizes all of this visually.To address this kind of local rescaling between an SSM D and an SSM D , we first divide each SSM by its respec-tive max, and we quantize each to L levels evenly spacedin [0 , . We then apply a monotonic, one-to-one map f to each pixel in D so that the CDF of D approximatelymatches the CDF of D (see, e.g ., [17] ch. 3.3). Note thatthis process can be done from D to D or from D to D ,as shown in Figure 5. Since this process is not necessarilysymmetric, we perform both sets of normalizations, and wechoose the one which yields a better alignment score.
5. Experimental Results
In this section, we will quantitatively compare the IB-DTW algorithm for global alignment with several othertechniques in the literature, including ordinary dynamictime warping (DTW), derivative dynamic time warp-ing (DDTW) [24] (a curvature-based version), canoni-cal time warping (CTW)[53], Generalized Time Warping(GTW)[51, 52], and Iterative Motion Warping (IMW)[19](a simpler version of CTW which is restricted to the samespace). We use code from [53] and [51, 52] to compute allof these alignments . We use the default parameters pro-vided in this code, and, as in [53] and [51, 52], we usethe results of DTW to initialize CTW and GTW. In all ofour experiments, we show results both from IBDTW and igure 6. Comparisons of alignment error distributionsfor different techniques on synthetic rotated/translated/re-paramterized/distorted 2D/3D curves drawn from the classesshown on the left. Log plot shown for contrast since IBDTWperforms so well relative to other methods. Before Alignment After AlignmentPCSWM Two Traversals
Figure 7. Two closed loops in the shape of a fork which have beenrotated/translated/re-parameterized, in addition to starting at dif-ferent points. The left plot shows the forks before alignment. Thecenter plot shows the PCSWM resulting from aligning both pointclouds each repeating themselves twice, as shown by the colorsalong the rows (left fork) and columns (right fork), which cor-respond to the colors in the left plot. The optimal partial warpingpath truncated to the first repetition of the left fork is superimposedin cyan. The forks in the right plot are put into correct correspon-dence by this truncated partial warping path.
IBDTW after SSM rank normalization, which we refer toas “IBDTWN.” When a ground truth warping path exists,we report the alignment error as in [52] and [40]. Given awarping path W = ( x , x , ..., x M ) and a ground truth path W GT = ( y , y , ..., y N ) , the alignment error is M + N M (cid:88) i =1 N min j =1 || x i − y j || + N (cid:88) j =1 M min i =1 || x i − y j || (11)which is (roughly) the average number of samples bywhich W is shifted from W GT at any point in time. We first perform an experiment aligning a seriesof rotated/translated/flipped and re-parameterized sam-pled curves. As in [52], we re-parameterize thecurves with random convex combinations of polynomial,logarithmic, exponential, and hyperbolic tangent func-tions. To distort the curves, we move random con-trol points in random directions after spatial transfor-mation and re-parameterization. Let X be the TOPC Figure 8. Distribution of IBPTW alignment errors for circularlyshifted/warped/distorted MPEG-7 loops. before transformation/re-paramterization/distortion and let Y be the curve afterwards. The average ratio of d GH ( X, Y ) / diam ( X ) , where diam is the “diameter” of X (the maximum inter point distance), is 0.18. Figure 6 showsthe results. IBDTW performs the best, while doing the nor-malization for IBDTWN only degrades the results slightly.We also showcase our IBPTW algorithm by aligning 2Dloops, or curves γ : [0 , → R so that γ (0) = γ (1) ,which is useful in recognizing boundaries of foreground ob-jects in video. Note that a geometrically equivalent loopcan be parameterized starting at a different point in the in-terior of the first loop: γ (cid:48) ( t ) = γ ( t − τ ( mod . Also,it is possible that this loop is parameterized differently: γ (cid:48) h ( t ) = γ (cid:48) ( h ( t )) for some orientation preserving homeo-morphism h : [0 , → [0 , , and γ (cid:48) h ( t ) may also be trans-formed spatially. To demonstrate how our algorithm is ableto align such curves, we use examples from the MPEG-7dataset of 2D contours [25]. Given a point cloud A anda point cloud B , we partially align the concatenated pointclouds AA and BB . If A starts T samples later than B ,then there will be a partial warping path which starts at sam-ple in AA and sample T in BB . Figure 7 shows an ex-ample where two forks are successfully put into correspon-dence this way. Figure 8 shows a histogram of alignment er-rors for distorted/circularly shifted/re-parameterized curvesover 7 classes from the MPEG-7 dataset, using σ = 0 . and m , m = − . , and with a mean d GH / diam = 0 . .IBPTW returns excellent alignments for most loops, thoughthere are a cluster of outliers that occur due to (near) sym-metries of some loops. For our first cross modal experiment, we align 4 videosof people walking from the Weizmann dataset [18] croppedto 4 walking cycles each. As in [53], we use different fea-tures between each pair of videos we align. On one video,we use the binary mask of the foreground object of the per-son walking, where every pixel is a dimension, and everyframe is a point in the TOPC. On the second video, weuse the Euclidean distance transform (EDT) [27]. To as-sess performance rigorously, we create a more controlledexperiment where the second video is the same as the firstvideo after a time warp and applying EDT, so that we haveaccess to the ground truth. Figure 9 shows the results. CTW igure 9. Comparison of IBDTW with different time warping tech-niques on walking videos from the Weizmann action dataset[18](left) and on 2D/3D facial expression videos from the BU4Ddataset [48] (right).Figure 10. Example aligning MOCAP data expressed in the prod-uct space of quaternions with videos in raw pixel space. performs slightly better than IBDTWN, but not appreciably.Furthermore, IBTW without normalization has the worstaverage alignment error, demonstrating how normalizationis needed in cross-modal applications.
To show a more difficult cross-modal application, wealso synchronize expressions drawn from the BU4D facedataset [48], which includes RGB videos of people makingdifferent facial expressions and a 3D triangle mesh corre-sponding to each frame. For our RGB features, we simplytake each channel and each pixel to be a dimension, so avideo with W × H pixels lives in R W H . For the 3D meshfeatures, we use shape histograms [3] after mean centeringand RMS normalizing each mesh. Our shape histogramshave 20 radial shells, each with 66 sectors per shell withcenters equally distributed across the sphere. We performan experiment where we take the 2D features from a faceand the 3D features of the same face which has been warpedin time. We perform 10 such warpings and alignments for9 faces from 6 different expression types. Figure 9 showsthe aggregated results. The performance is strikingly worsethan the Weizmann dataset, though the normalized IBDTWhas about half of the alignment error of other techniques.
One strength of our algorithms is that they run withoutmodification for features in arbitrary metric spaces. For ex-ample, Figure 10 shows IBDTW between motion capturedata expressed in a product space of quaternions of N joints Figure 11. Example aligning “Smooth Criminal” by Michael Jack-son to a faster tempo cover by Alien Ant Farm. The optimal warp-ing path is superimposed in magenta over the PCSWM. MJ’s ver-sion has an intro which is not present in the AAF version, andwhich is properly skipped by the warping path. with video data expressed as raw pixels. The product spaceof quaternions we choose is (cid:80) Ni =1 cos − ( | q i · p i | ) betweentwo sets of quaternions ( q , q , ..., q N ) and ( p , p , ..., p N ) .For our second example, Figure 11 shows IBPTW used toalign two “cover songs” (two versions of the same song withdifferent instruments / tempos) after fusing note-based andtimbral features using similarity network fusion [43, 44] tolearn an improved metric for self-similarity (see [38] formore info on this process). Please also refer to supplemen-tary material for video and audio for these examples.
6. Conclusions
In this work, we have shown it is possible to synchronizetime-ordered point clouds that are spatially transformedwithout explicitly uncovering the spatial transformation. Aswe have shown by our experiments, our algorithms performexcellently when aligning nearly isometric sampled curves,and, with proper normalization, we are competitive withstate of the art unsupervised techniques for cross-modal ap-plications. Furthermore, IBDTW requires no parameters,making it approachable “out of the box,” while, in our expe-rience, CTW and GTW require many parameters that makeor break performance. Finally, we have opened the door forstraightforward non-Euclidean time warping, and we hopeto see more such applications ( e.g . time series of graphs).
Acknowledgments
Christopher Tralie was partially supported by an NSFGraduate Fellowship NSF under grant DGF-1106401 andan NSF big data grant DKA-1447491. Paul Bendich isthanked for the suggestion to compare the technique to theGromov-Hausdorff Distance, and for helpful comments onan initial manuscript.
References [1] N. Alajlan, I. El Rube, M. S. Kamel, and G. Freeman.Shape retrieval using triangle-area representation and dy-namic space warping.
Pattern Recognition , 40(7):1911–1920, 2007. 32] H. Alt and M. Godau. Computing the fr´echet distance be-tween two polygonal curves.
International Journal of Com-putational Geometry & Applications , 5(01n02):75–91, 1995.3[3] M. Ankerst, G. Kastenm¨uller, H.-P. Kriegel, and T. Seidl.3d shape histograms for similarity search and classificationin spatial databases. In
International Symposium on SpatialDatabases , pages 207–226. Springer, 1999. 8[4] J. P. Bello. Grouping recorded music by structural similarity.
Int. Conf. Music Inf. Retrieval (ISMIR-09) , 2009. 2[5] D. J. Berndt and J. Clifford. Using dynamic time warping tofind patterns in time series. In
KDD workshop , volume 10,pages 359–370. Seattle, WA, 1994. 1[6] P. J. Besl and N. D. McKay. Method for registration of 3-dshapes. In
Robotics-DL tentative , pages 586–606. Interna-tional Society for Optics and Photonics, 1992. 3[7] A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Thevideo genome. arXiv preprint arXiv:1003.5320 , 2010. 5[8] Y. Chen and G. Medioni. Object modelling by registra-tion of multiple range images.
Image and vision computing ,10(3):145–155, 1992. 3[9] S. D. Cohen and L. J. Guibas. Partial matching of planarpolylines under similarity transformations. In
SODA , pages777–786, 1997. 3[10] M. Cui, J. Femiani, J. Hu, P. Wonka, and A. Razdan. Curvematching for open 2d curves.
Pattern Recognition Letters ,30(1):1–10, 2009. 3[11] R. Cutler and L. S. Davis. Robust real-time periodic motiondetection, analysis, and applications.
IEEE Transactions onPattern Analysis and Machine Intelligence , 22(8):781–796,2000. 2[12] A. De Luca, A. Hang, F. Brudy, C. Lindner, and H. Huss-mann. Touch me once and i know it’s you!: implicit au-thentication based on touch screen patterns. In
Proceedingsof the SIGCHI Conference on Human Factors in ComputingSystems , pages 987–996. ACM, 2012. 1[13] T. Eiter and H. Mannila. Computing discrete fr´echet dis-tance. Technical report, Citeseer, 1994. 5[14] J. Foote. Automatic audio segmentation using a measure ofaudio novelty. In
Multimedia and Expo, 2000. ICME 2000.2000 IEEE International Conference on , volume 1, pages452–455. IEEE, 2000. 2[15] M. Frenkel and R. Basri. Curve matching using the fastmarching method. In
EMMCVPR , pages 35–51. Springer,2003. 3[16] D. Gong and G. Medioni. Dynamic manifold warpingfor view invariant action recognition. In
Computer Vision(ICCV), 2011 IEEE International Conference on , pages 571–578. IEEE, 2011. 3[17] R. Gonzalez and R. Woods.
Digital Image Process-ing . Addison-Wesley world student series. Addison-Wesley,1992. 6[18] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri.Actions as space-time shapes.
Transactions on Pattern Anal-ysis and Machine Intelligence , 29(12):2247–2253, Decem-ber 2007. 7, 8 [19] E. Hsu, K. Pulli, and J. Popovi´c. Style translation for hu-man motion. In
ACM Transactions on Graphics (TOG) , vol-ume 24, pages 1082–1089. ACM, 2005. 6[20] I. N. Junejo, E. Dexter, I. Laptev, and P. P´erez. Cross-viewaction recognition from temporal self-similarities. In
Pro-ceedings of the 10th European Conference on Computer Vi-sion: Part II , pages 293–306. Springer-Verlag, 2008. 2[21] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez.View-independent action recognition from temporal self-similarities.
IEEE transactions on pattern analysis and ma-chine intelligence , 33(1):172–185, 2011. 3[22] W. Kabsch. A solution for the best rotation to relate twosets of vectors.
Acta Crystallographica Section A: CrystalPhysics, Diffraction, Theoretical and General Crystallogra-phy , 32(5):922–923, 1976. 2[23] F. Kaiser and T. Sikora. Music structure discovery in popularmusic using non-negative matrix factorization. In
ISMIR ,pages 429–434, 2010. 2[24] E. J. Keogh and M. J. Pazzani. Derivative dynamic timewarping. In
Proceedings of the 2001 SIAM InternationalConference on Data Mining , pages 1–11. SIAM, 2001. 3,6[25] L. Latecki. Shape data for the mpeg-7 core experimentce-shape-1. , 2002. 7[26] P. Maurel and G. Sapiro. Dynamic shapes average. 2003. 1[27] C. R. Maurer, R. Qi, and V. Raghavan. A linear time algo-rithm for computing exact euclidean distance transforms ofbinary images in arbitrary dimensions.
IEEE Transactions onPattern Analysis and Machine Intelligence , 25(2):265–270,2003. 7[28] B. McFee and D. P. Ellis. Analyzing song structure withspectral clustering. In , 2014. 2[29] G. McGuire, N. B. Azar, and M. Shelhamer. Recurrence ma-trices and the preservation of dynamical properties.
PhysicsLetters A , 237(1-2):43–47, 1997. 2[30] M. M¨uller.
Information retrieval for music and motion , vol-ume 2. Springer, 2007. 2[31] H. Sakoe and S. Chiba. A similarity evaluation of speechpatterns by dynamic programming. In
Nat. Meeting of Insti-tute of Electronic Communications Engineers of Japan , page136, 1970. 1, 2[32] H. Sakoe and S. Chiba. Dynamic programming algorithmoptimization for spoken word recognition.
IEEE transac-tions on acoustics, speech, and signal processing , 26(1):43–49, 1978. 1, 2[33] T. B. Sebastian, P. N. Klein, and B. B. Kimia. On aligningcurves.
IEEE transactions on pattern analysis and machineintelligence , 25(1):116–125, 2003. 3[34] J. Serra, E. G´omez, P. Herrera, and X. Serra. Chroma binarysimilarity and local alignment applied to cover song iden-tification.
Audio, Speech, and Language Processing, IEEETransactions on , 16(6):1138–1151, 2008. 5[35] J. Serra, M. M¨uller, P. Grosche, and J. L. Arcos. Unsuper-vised detection of music boundaries by time series structurefeatures. In
Twenty-Sixth AAAI Conference on Artificial In-telligence , 2012. 236] T. F. Smith and M. S. Waterman. Identification of com-mon molecular subsequences.
Journal of molecular biology ,147(1):195–197, 1981. 5[37] G. A. ten Holt, M. J. Reinders, and E. Hendriks. Multi-dimensional dynamic time warping for gesture recognition.In
Thirteenth annual conference of the Advanced School forComputing and Imaging , volume 300, 2007. 1[38] C. J. Tralie. Early mfcc and hpcp fusion for robust coversong identification. In , 2017. 8[39] C. J. Tralie and P. Bendich. Cover song identification withtimbral shape. In , 2015. 2[40] G. Trigeorgis, M. Nicolaou, S. Zafeiriou, and B. Schuller.Deep canonical time warping for simultaneous alignmentand representation learning of sequences.
IEEE Transactionson Pattern Analysis and Machine Intelligence , 2017. 3, 7[41] G. Trigeorgis, M. A. Nicolaou, S. Zafeiriou, and B. W.Schuller. Deep canonical time warping. In
Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pages 5110–5118, 2016. 3[42] M. Vejdemo-Johansson, F. T. Pokorny, P. Skraba, andD. Kragic. Cohomological learning of periodic motion.
Ap-plicable Algebra in Engineering, Communication and Com-puting , 26(1-2):5–26, 2015. 3[43] B. Wang, J. Jiang, W. Wang, Z.-H. Zhou, and Z. Tu. Un-supervised metric fusion by cross diffusion. In
ComputerVision and Pattern Recognition (CVPR), 2012 IEEE Confer-ence on , pages 2997–3004. IEEE, 2012. 8[44] B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu,M. Brudno, B. Haibe-Kains, and A. Goldenberg. Similar-ity network fusion for aggregating data types on a genomicscale.
Nature methods , 11(3):333–337, 2014. 8[45] M. S. Waterman.
Introduction to computational biology:maps, sequences and genomes . CRC Press, 1995. 5[46] G. Whaba. A least squares estimate of spacecraft attitude.
SIAM Review , 7(3):409, 1965. 2[47] M. Yamada, L. Sigal, M. Raptis, M. Toyoda, Y. Chang, andM. Sugiyama. Cross-domain matching with squared-lossmutual information.
IEEE transactions on pattern analysisand machine intelligence , 37(9):1764–1776, 2015. 3[48] L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. A high-resolution 3d dynamic facial expression database. In
Auto-matic Face & Gesture Recognition, 2008. FG’08. 8th IEEEInternational Conference on , pages 1–6. IEEE, 2008. 8[49] R. Ying, J. Pan, K. Fox, and P. K. Agarwal. A simpleefficient approximation algorithm for dynamic time warp-ing. In
Proceedings of the 24th ACM SIGSPATIAL Interna-tional Conference on Advances in Geographic InformationSystems , GIS ’16, pages 21:1–21:10, New York, NY, USA,2016. ACM. 3[50] C. W. Yu, K. Kwong, K.-H. Lee, and P. H. W. Leong. Asmith-waterman systolic cell. In
New Algorithms, Architec-tures and Applications for Reconfigurable Computing , pages291–300. Springer, 2005. 4[51] F. Zhou and F. De la Torre. Generalized time warping formulti-modal alignment of human motion. In
Computer Vi- sion and Pattern Recognition (CVPR), 2012 IEEE Confer-ence on , pages 1282–1289. IEEE, 2012. 3, 6[52] F. Zhou and F. De la Torre. Generalized canonical time warp-ing.
IEEE transactions on pattern analysis and machine in-telligence , 38(2):279–294, 2016. 3, 6, 7[53] F. Zhou and F. Torre. Canonical time warping for alignmentof human behavior. In