[PDF] Time Coupled Diffusion Maps

Abstract

We consider a collection of n points in R d measured at m times, which are encoded in an n×d×m data tensor. Our objective is to define a single embedding of the n points into Euclidean space which summarizes the geometry as described by the data tensor. In the case of a fixed data set, diffusion maps (and related graph Laplacian methods) define such an embedding via the eigenfunctions of a diffusion operator constructed on the data. Given a sequence of m measurements of n points, we construct a corresponding sequence of diffusion operators and study their product. Via this product, we introduce the notion of time coupled diffusion distance and time coupled diffusion maps which have natural geometric and probabilistic interpretations. To frame our method in the context of manifold learning, we model evolving data as samples from an underlying manifold with a time dependent metric, and we describe a connection of our method to the heat equation over a manifold with time dependent metric.

Full PDF

TTIME COUPLED DIFFUSION MAPS

NICHOLAS F. MARSHALL AND MATTHEW J. HIRN

Abstract.

We consider a collection of n points in R d measured at m times, which are encodedin an n × d × m data tensor. Our objective is to deﬁne a single embedding of the n points intoEuclidean space which summarizes the geometry as described by the data tensor. In the case ofa ﬁxed data set, diﬀusion maps and related graph Laplacian methods deﬁne such an embeddingvia the eigenfunctions of a diﬀusion operator constructed on the data. Given a sequence of m measurements of n points, we introduce the notion of time coupled diﬀusion maps whichhave natural geometric and probabilistic interpretations. To frame our method in the contextof manifold learning, we model evolving data as samples from an underlying manifold with atime-dependent metric, and we describe a connection of our method to the heat equation onsuch a manifold. Introduction

In many machine learning and signal processing tasks, the observable data is high dimensional,but it lies on a low-dimensional intrinsic manifold. In recent years, several manifold learningmethods have emerged which attempt to recover the intrinsic manifold underlying datasets. Inparticular, graph Laplacian methods have become popular due to their practicality and theoreticalguarantees [1, 2, 3, 4, 5, 6, 7, 8].Current graph Laplacian methods implicitly assume a static intrinsic manifold, or equivalently,that the dynamics underlying the data generation process are stationary. For many applications,this stationary assumption is justiﬁed, as datasets often consist of a single snapshot of a system, orare recorded over small time windows. However, in the case where data is accumulated over longerperiods of time, accounting for changing dynamics may be advantageous. Furthermore, if a systemis particularly noisy, combining a large number of snapshots over time may help recover structurehidden in noise. These observations raise the following question: how can graph Laplacian methodsbe extended to account for changing dynamics while maintaining theoretical guarantees?In this paper, we propose modeling data with changing dynamics by assuming there existsan underlying intrinsic manifold with a time-dependent metric. We will describe the proposedmethod using the diﬀusion maps framework: a popular graph Laplacian framework which isrobust to non-uniform sampling [4]. We remark that diﬀusion maps are highly related to othermanifold learning methods such as Laplacian eigenmaps and spectral clustering. In fact, if datais uniformly sampled from the underlying manifold, diﬀusion maps [4] is essentially eigenvalueweighted Laplacian eigenmaps [9].Although we assume points on the intrinsic manifold are ﬁxed, their geometry, i.e, dependencestructure, is allowed change. We can conceptualize samples from a manifold with a time-dependentmetric by considering a corresponding point cloud smoothly moving through R d produced by iso-metrically embedding the manifold over time. The evolution of the metric dictates the movement ofpoints, and vice versa. In practice, datasets conforming to this model are commonly encountered,e.g., an RGB video feed consists of a collection of n pixels which move through R .In general, we consider data consisting of a collection of n points in R d measured at m times en-coded in an n × d × m data tensor X . The tensor X can be expressed as a sequence ( X , . . . , X m ) of n × d matrices whose entries correspond across the sequence. Given such as sequence ( X , . . . , X m ), Key words and phrases.

Manifold learning; Dimensionality reduction; Diﬀusion distance; Heat equation; Time-dependent metric. a r X i v : . [ m a t h . C A ] N ov TIME COUPLED DIFFUSION MAPS the time coupled diﬀusion map framework introduced in this paper is based on the product oper-ator:(1) P ( m ) = P m P m − · · · P P , where each P i is a diﬀusion operator constructed from X i . We will show that this discretediﬀusion process, which is formally deﬁned in the following section, approximates a continuousdiﬀusion process on an assumed underlying manifold with a time-dependent metric. Additionally,we introduce the notion of time coupled diﬀusion maps, named thus because the time evolutionof the data has been coupled to the time evolution of a diﬀusion process.1.1. Related works.

In the diﬀusion geometry literature, several techniques have been developed,which also utilize multiple diﬀusion kernels for a variety of objectives including: iteratively reﬁningthe representation of data, facilitating comparison, and combing multiple measurements of a ﬁxedsystem.An early example of a multiple kernel method is the denosing algorithm of Szlam, Maggioni,and Coifman [10], which iteratively smooths an image via an anisotropic diﬀusion process. Thatis, the algorithm switches between constructing a diﬀusion kernel on a given data set (in this casean image), and applying the constructed kernel to the data: X i → P i , X i +1 = P i X i where the arrow denotes that P i is constructed based on X i . More recently, in [11] Welp, Wolf,Hirn, and Krishnaswamy introduce an iterative diﬀusion based construction, which acts to coursegrain data. From a theoretical perspective, both of these methods can be considered in the contextof the time coupled diﬀusion framework introduced in this paper.In [12] Wang, Jiang, Wang, Zhou, and Tu introduce the notion of Cross Diﬀusion as a metricfusion algorithm with applications to image processing. They demonstrate how multiple metricson a given data set can be combined by considering the iterative cross diﬀusion P ( t +1)1 = P P ( t )2 P T , and P ( t +1)2 = P P ( t )1 P T , where P (0)1 = P and P (0)2 = P are constructed from two diﬀerent metrics on the given data. Ageneralized method for m metrics is also described.In [13] Coifman and Hirn present a method of extending diﬀusion maps to allow comparisonsacross multiple measurements of a system, even when such measurements are of diﬀerent modali-ties.In [14, 15] Lederman and Talmon introduce the idea of Alternating Diﬀusion: a method ofcombining measurements from multiple sensors to exact the common source of variability (i.e.,the common manifold), while ﬁltering out sensor speciﬁc eﬀects. The method is based on theproduct operators P P , and P P , where P and P are constructed from two diﬀerent views of the data. In [16], Lindenbaum,Yeredor, Salhov, and Averbuch follow a similar approach, but concatenate a collection of alternat-ing products in block matrix deﬁning Multi-View Diﬀusion Maps. Recently, several applicationsand extensions of Alternating Diﬀusion have been developed. In [15] Lederman, Talmon, Wu, Lo,and Coifman demonstrate an application of Alternating Diﬀusion to sleep stage assessment. In[17] Talmon and Wu describe a general notion of nonlinear manifold ﬁltering, which extracts acommon manifold from multiple sensors.Our work extends the current diﬀusion maps literature by considering evolving dynamics ratherthan enhancing the analysis of a ﬁxed system. We consider an n × d × m data tensor describinga system of n points in R d over m times, and seek to construct a manifold model and diﬀusiongeometry framework for this setting. A similar framework is considered by Banisch and Koltai in[18]. However, rather than study the product operator (1), they study the sum of the operators P i and prove a relation with the dynamic Laplacian. IME COUPLED DIFFUSION MAPS 3

Organization.

The remainder of the paper is organized as follows. In Section 2 we de-scribe the construction of time coupled diﬀusion maps. In Section 3 we establish a connectionbetween the product operator P ( t ) and the heat kernel on the assumed underlying manifold witha time-dependent metric. In Section 4 we present numerical results on synthetic data. Section5 investigates the continuous analog of the operator P ( t ) and the corresponding time coupleddiﬀusion distance. Concluding remarks are given in Section 6.2. Time coupled diffusion maps

In this section, we introduce the notion of time coupled diﬀusion maps.2.1.

Notation.

Let M be a compact smooth manifold with a smooth 1-parameter family ofRiemannian metrics g ( τ ), τ ∈ [0 , T ], T < ∞ . We refer to the parameter τ as time throughout. Foreach time τ , let ι τ : M (cid:44) → R d , M τ = ι τ ( M ), denote an isometric embedding of ( M , g ( τ )) into d -dimensional Euclidean space, where d is ﬁxed for all τ ∈ [0 , T ]. Let X = { x j } nj =1 ⊆ M denote aﬁnite collection of n points sampled from M and let ( τ , τ , . . . , τ m ) be a uniform partition of [0 , T ]with τ = 0 and τ m = T . We assume that our data consists of measurements of the n points X attimes τ , . . . , τ m such that we have one measurement set for each of the time intervals ( τ i − , τ i ] for i = 1 , . . . , m . More precisely, our data will consist of a sequence of m sets ( X , . . . , X m ), where X i = ι τ i ( X ) = { x ( i ) j } nj =1 ; see Figure 1 for an illustration. R d τ , τ , τ , . . . , τ m (cid:0) M , g ( τ ) (cid:1) · · · X X X X m  Figure 1.

Illustration of the data modelSuch data can be represented as an n × d × m tensor corresponding to n points in R d measuredat m times. Suppose each X i is distributed over M τ i according to a density q τ i : M τ i → R .We assume that at time τ , the n points X = { x (1) j } nj =1 are sampled independently from M τ according to the density q τ . Since no re-sampling occurs, any changes in the densities q τ i , for i = 1 , . . . , m , result from deformations of q τ induced by changes in the Riemannian volume of M , which itself is induced from changes in the Riemannian metric g ( τ ) over time. In particular, x ( i ) j = ι τ i ◦ ι − τ ( x (1) j ) and q τ i ( x ) = q τ ( ι τ ◦ ι − τ i ( x )) | det( D ( ι τ ◦ ι − τ i ))( x ) | . Remark.

Two remarks are in order. (1)

Even if the initial sampling density is uniform, the changing geometry may alter the densityover time. Hence, the density invariant kernel construction described in [2, 4] plays anessential role in our construction. We assume the minimum sampling density over time isbounded below and the associated error term enters our error analysis as a constant. Werefer the reader to a recent paper by Berry and Harlim [19] for a detailed error analysis ofvariable density kernel constructions. (2)

Since we have assumed the initial samples X are i.i.d., then for ﬁxed i , each set of samples X i are also i.i.d since X i is the continuous, hence measurable, function ι τ i ◦ ι − τ of thei.i.d. variables X . Kernel construction.

Given a sequence of data sets ( X , . . . , X m ) as described above, weproceed as follows. For each data set X i , we construct a diﬀusion operator P ε,i following thediﬀusion maps framework [4]. For completeness, we include the details of the construction of P ε,i in the following. For each time index i , we deﬁne a Gaussian kernel K ε,i on X using themeasurements X i ,(2) K ε,i ( x j , x k ) = exp (cid:32) − (cid:107) x ( i ) j − x ( i ) k (cid:107) ε (cid:33) , x j , x k ∈ X and x ( i ) j , x ( i ) k ∈ X i , TIME COUPLED DIFFUSION MAPS where (cid:107) · (cid:107) denotes the Euclidean norm in the ambient space R d . By summing over the secondvariable of the kernel K ε,i we approximate the density q τ i by(3) q ε,i ( x j ) = n (cid:88) k =1 K ε,i ( x j , x k ) , x j ∈ X. Using q ε,i we normalize the kernel K ε,i as follows(4) (cid:101) K ε,i ( x j , x k ) = K ε,i ( x j , x k ) q ε,i ( x j ) q ε,i ( x k ) , x j , x k ∈ X. Then the corresponding diﬀusion operator P ε,i : R n → R n is given by(5) ( P ε,i f )( x j ) = n (cid:88) k =1 (cid:101) K ε,i ( x j , x k ) (cid:80) nl =1 (cid:101) K ε,i ( x j , x l ) f ( x k ) . Each P ε,i is a matrix which is a diﬀusion operator when applied to column vectors on the left,and a Markov operator when applied row vectors on the right. For 1 ≤ t ≤ m , deﬁne:(6) P ( t ) ε = P ε,t P ε,t − · · · P ε, . Like its constituent components, the matrix P ( t ) ε is a diﬀusion operator when acting on columnvectors from the left and a Markov operator when acting on row vectors from the right. As aMarkov operator, P ( t ) ε acts backwards in time, in the sense that P ( t ) ε acts by ﬁrst applying P ε,t ,second P ε,t − and so on. On the other hand, as a diﬀusion operator P ( t ) ε acts forward in time,ﬁrst applying P ε, , second applying P ε, and so on. We have chosen the ordering convention inequation (6) to favor the geometric interpretation.2.3. Time coupled diﬀusion distance.

Let δ j denote a Dirac distribution centered at x j , i.e., δ j ( x j ) = 1 and δ j = 0 elsewhere. We compare the points x j and x k by comparing the posteriordistributions of δ Tj and δ Tk under the Markov operator P ( t ) ε . More speciﬁcally, following [4] wedeﬁne a diﬀusion based distance as the L distance between these posterior distributions weightedby the reciprocal of the stationary distribution of the Markov chain. That is, we deﬁne the distance D ( t ) ε by(7) D ( t ) ε ( x j , x k ) = (cid:107) δ Tj P ( t ) ε − δ Tk P ( t ) ε (cid:107) L (1 / π ( t ) ) , where π ( t ) is the stationary distribution of P ( t ) ε , i.e., π T ( t ) P ( t ) ε = π T ( t ) , and (cid:107) · (cid:107) L (1 / π ( t ) ) is theweighted L norm:(8) (cid:107) f (cid:107) L (1 / π ( t ) ) := (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 f ( x j ) π ( t ) ( x j ) . We refer to D ( t ) ε as the time coupled diﬀusion distance , since it extends the diﬀusion distance in [4]by coupling the evolution time of the data ( X , . . . , X t ) with that of the diﬀusion process governedby ( P ε, , . . . , P ε,t ). Our next objective is to construct a time coupled diﬀusion map , that is, anembedding of X into Euclidean space which preserves the time coupled diﬀusion distance. Fornotational brevity, we suppress the dependence on ε in the following.2.4. Time coupled diﬀusion map.

Recall that the diﬀusion map for a static manifold ( M , g ) isdeﬁned in terms of the eigenvectors and eigenvalues of the transition matrix P , which is constructedin the same manner as (5). The inhomogeneous transition operator P ( t ) is also row stochastic,but unlike P , does not necessarily have a complete basis of real eigenvectors. Instead, we deﬁnethe operator A ( t ) by(9) A ( t ) = Π / t ) P ( t ) Π − / t ) , where Π ( t ) denotes the matrix with the stationary distribution π ( t ) of P ( t ) along the diagonal andzeros elsewhere. First, observe that π / t ) is both a right and left eigenvector of A ( t ) with eigenvalue IME COUPLED DIFFUSION MAPS 5

1. In fact, A ( t ) has operator norm one and naturally arises when constructing a diﬀusion frameworkstarting with a Markov chain, see A. Next, we compute the singular value decomposition (SVD)of A ( t ) :(10) A ( t ) = U ( t ) Σ ( t ) V T ( t ) , where U ( t ) is an orthogonal matrix of left singular vectors, Σ ( t ) is a diagonal matrix of corre-sponding singular values, and V ( t ) is an orthogonal matrix of right singular vectors. Deﬁne(11) Ψ ( t ) := Π − / t ) U ( t ) Σ ( t ) . Lemma 2.1.

The embedding (12) x j (cid:55)→ δ Tj Ψ ( t ) of the data X into Euclidean space preserves the time coupled diﬀusion distance. That is to say, D ( t ) ( x j , x k ) = (cid:107) δ Tj Ψ ( t ) − δ Tk Ψ ( t ) (cid:107) L . We refer to the embedding x j (cid:55)→ δ Tj Ψ ( t ) as the time coupled diﬀusion map.Proof. By the deﬁnition of time coupled diﬀusion distance (7), and by deﬁnition of the weighted L norm (8)(13) D ( t ) ( x j , x k ) = (cid:107) δ Tj P ( t ) − δ Tk P ( t ) (cid:107) L (1 / π ( t ) ) = (cid:107) δ Tj P ( t ) Π − / − δ Tk P ( t ) Π − / (cid:107) L . Multiplying equation (9) by Π − / t ) yields Π − / t ) A ( t ) = P ( t ) Π − / t ) , and substituting this expression into (13) gives D ( t ) ( x j , x k ) = (cid:107) δ Tj Π − / t ) A ( t ) − δ Tk Π − / t ) A ( t ) (cid:107) L . Expanding A ( t ) in its singular value decomposition (10) yields, D ( t ) ( x j , x k ) = (cid:107) δ Tj Π − / t ) U ( t ) Σ ( t ) V T ( t ) − δ Tk Π − / t ) U ( t ) Σ ( t ) V T ( t ) (cid:107) L . Now, since the transformation V ( t ) is orthogonal, D ( t ) ( x j , x k ) = (cid:107) δ Tj Π − / t ) U ( t ) Σ ( t ) − δ Tk Π − / t ) U ( t ) Σ ( t ) (cid:107) L , and substituting Ψ ( t ) = Π − / t ) U ( t ) Σ ( t ) into this equation yields the result. (cid:3) Remark.

Several remarks regarding time coupled diﬀusion maps are in order. (1) If t = 1 , the time coupled diﬀusion map (12) is equivalent to the deﬁnition of the standarddensity invariant diﬀusion map in [4] , see the calculations in A, which are motivated bysimilar calculations in [2] . (2) The ﬁrst coordinate of the embedding (12) is always constant because π / t ) is the top leftsingular vector of A ( t ) , and therefore, this coordinate of the embedding can be discarded. (3) In order to produce an embedding of X into R l for some l > , we can compute a rank l + 1 singular value decomposition in equation (10) and map x j (cid:55)→ δ Tj Ψ ( t ) l , where Ψ ( t ) l denotes the matrix consisting of columns through l + 1 of the matrix Ψ ( t ) .Since the singular values of A ( t ) are of the form σ > σ ≥ · · · ≥ σ n − , the truncatedembedding will preserve the diﬀusion distance up to an error on the order of σ l +1 . As inother diﬀusion based methods, in the case where data lies on a low dimensional manifold,we expect the ﬁrst few coordinates of the embedding to provide a meaningful summary ofthe data. TIME COUPLED DIFFUSION MAPS

Comparison to standard diﬀusion maps.

It is instructive at this point to make a com-parison with the original diﬀusion maps of Coifman and Lafon [4]. In that setting, one is givensamples X of an isometric embedding of a manifold M with a static metric g . The density invari-ant version of diﬀusion maps uses the same construction as outlined in equations (2), (3), (4), and(5), which yields a Markov matrix P , which is not indexed by a time i since the metric is static.Running this homogeneous Markov chain forward t steps is equivalent to composing P with itself t times, i.e., P t . Since P is, by construction, similar to a symmetric matrix, it can be decomposedinto a diagonal matrix of real eigenvalues Λ and a basis of right eigenvectors Φ . Since the lefteigenvectors of P are orthogonal in L (1 / π ), the diﬀusion map(14) x j (cid:55)→ δ Tj ΦΛ t , preserves the diﬀusion distance D t ( x j , x k ) = (cid:107) δ Tj P t − δ Tk P t (cid:107) L (1 / π ) , where π is the stationarydistribution of P . A main result of [4] is that as n → ∞ and ε →

0, we have P t/εε → e t ∆ , where e t ∆ is the Neumann heat kernel on M . Since the eigenfunctions and eigenvalues of e t ∆ give acomplete geometric description of the manifold M [20], the diﬀusion maps embedding (14) learnsthe geometry of the manifold from the samples X .In the time-dependent case, we consider a sequence of data ( X , . . . , X t ) and construct a corre-sponding family of operators ( P , . . . , P t ). Rather than raising a matrix to a power, we composethe family of operators, obtaining P ( t ) deﬁned in equation (6). After deﬁning a distance D ( t ) based on P ( t ) we seek a distance preserving embedding. However, since the product of symmetricmatrices is not in general symmetric, there is no reason we should suspect P ( t ) to have a basis ofreal eigenvalues and eigenvectors. Therefore, we construct a diﬀusion maps framework based onthe Markov operator using the singular value decomposition of the operator A ( t ) as described inequations (9), (10), (11), and (12) . We expect the operator P ( t ) encodes some average sense ofaﬃnity across the data, and by extension that the time coupled diﬀusion map is similarly mean-ingful. In the following, we make the connection of P ( t ) to the data precise by showing that P ( t ) approximates the heat kernel on the underlying manifold ( M , g ( · )), which will be deﬁned in Sec-tion 3. Thus, we conjecture that the time coupled diﬀusion map aggregates important geometricalinformation of the manifold ( M , g ( · )) over the time interval [0 , T ]. Numerical results in Section4 lend credence to this conjecture. There is, however, no theoretical result that directly linksgeometrical information of ( M , g ( · )) over arbitrarily long time scales with its heat kernel. Theclosest results are contained in [21, 22], in which the heat kernel of ( M , g ( · )) is used to embed themanifold into a single Hilbert space, so that one can observe the ﬂow of M over time. This result,however, only holds for short time scales.Before describing the connection of P ( t ) to heat ﬂow, we make a brief computational note. Inpractice, it is common to construct each diﬀusion matrix P i deﬁned in equation (5) as a sparsematrix by, for example, truncating values which fall below a certain threshold. However, if theproduct P ( t ) = P t P t − · · · P P is computed explicitly, there is no reason to expect sparsity will be maintained. Moreover, sinceeach P i is a diﬀusion operator on an assumed underlying k -dimensional manifold M , if each P i initially has l nonzero entires in each row, the number of nonzero entires in each row of the product P ( t ) could be on the order of ( l · t ) k . Therefore, rather than explicitly constructing P ( t ) , we cantreat P ( t ) as an operator that can be applied to vectors in order n · l · t operations. Using thisapproach, the singular value decomposition in equation (10) can be computed by iterative methodssuch as Lancroz iteration or subspace iteration [23]. When n > n · l · t this approach may providesigniﬁcant computational savings, and reduce the memory requirements of the computation.3. The heat kernel for ( M , g ( · ))We begin by stating the deﬁnition and properties of the heat kernel for a manifold with time-dependent metric. Let g ( τ ) , τ ∈ [0 , T ], be a smooth family of metrics on a manifold M . The heat IME COUPLED DIFFUSION MAPS 7 equation for such a manifold is:(15) ∂u∂t = ∆ g ( t ) u, where u : M × [0 , T ] → R . We say that Z : ( M × [0 , T ]) × ( M × [0 , T ]) → R , is the heat kernel for ∂∂t − ∆ g ( t ) if the following three conditions hold:( H ) Z ( x, τ ; y, σ ) is C in the spacial variables x, y and C in the temporal variables τ, σ ,( H ) (cid:0) ∂∂t − ∆ g ( t ) (cid:1) Z ( · , · ; y, σ ) = 0, and( H ) lim τ (cid:38) σ Z ( · , τ ; y, σ ) = δ y .Several researchers have studied the heat equation (15) and corresponding heat kernel for a man-ifold ( M , g ( · )), in large part due to its relationship with the Ricci ﬂow [24, 25], but also fordata analysis [21]. The following properties of the heat kernel have been established (see [24]).First, the heat kernel exists and is the unique positive function satisfying ( H ), ( H ), and ( H ).Additionally, the solution to the initial value problem,(16) (cid:40) ∂u/∂t = ∆ g ( t ) u,u ( x,

0) = f ( x ) , has the integral representation [24, Corollary 2.2]: u ( x, τ ) = (cid:90) M Z ( x, τ ; y, f ( y ) dV ( y, , where V ( τ ) is the Riemannian volume of M at time τ . The heat kernel Z ( x, τ ; y, σ ) also possessesproperties analogous to those of the standard heat kernel, such as the semigroup property,(17) Z ( x, τ ; y, σ ) = (cid:90) M Z ( x, τ ; ξ, ν ) Z ( ξ, ν ; y, σ ) dV ( ξ, ν ) , ∀ x, y ∈ M , ν ∈ ( σ, τ ) . Furthermore, its “rows” sum to one, precisely stated as(18) (cid:90) M Z ( x, τ ; y, σ ) dV ( y, σ ) = 1 , ∀ x ∈ M , σ < τ ∈ [0 , T ] . In the case of a static metric, the associated integral transform of the Neumann heat kernel isknown to have the asymptotic approximation [4]:(19) e t ∆ = lim ε → ( I + ε ∆) t/ε = (1 + ε ∆) t/ε + O ( ε ) . For t ≤ T , we deﬁne the associated integral transform of Z as the operator T ( t ) Z : L ( M ) → L ( M ),(20) T ( t ) Z f ( x ) = (cid:90) M Z ( x, t ; y, f ( y ) dV ( y, . We want to express T ( t ) Z in an analogous form to (19); we would expect it to resemble e (cid:82) t ∆ g ( τ ) dτ .However, as the family ∆ g ( τ ) is not in general commutative with respect to composition, it isnecessary to consider the ordered exponential of ∆ g ( τ ) over [0 , t ], which is the equivalent of theexponential for non-communicative operators, and which is denoted T e (cid:82) t ∆ g ( τ ) dτ . The orderedexponential can be expressed in terms of the power series: T e (cid:82) t ∆ g ( τ ) dτ = I + (cid:90) t ∆ g ( τ ) dτ + (cid:90) t ∆ g ( τ ) (cid:90) τ ∆ g ( τ ) dτ dτ + · · · = I + ∞ (cid:88) i =1 (cid:90) t ∆ g ( τ ) (cid:90) τ ∆ g ( τ ) (cid:90) τ · · · (cid:90) τ i − ∆ g ( τ i − ) (cid:90) τ i − ∆ g ( τ i ) dτ i dτ i − · · · dτ dτ , where I denotes the identity in the appropriate space. The operator T can be thought of asenforcing time order in the products for the standard exponential expansion in a sense which isequivalent to the power series deﬁnition, cf. [26]. Considering T e (cid:82) t ∆ g ( τ ) dτ as a formal power TIME COUPLED DIFFUSION MAPS series, it is easy to see that the heat equation is satisﬁed. Let { φ l } l ≥ be the eigenfunctions of∆ g (0) , ordered in terms of increasing eigenvalue, and deﬁne E K := Span { φ l : 0 ≤ l ≤ K } . Then∆ g (0) is bounded on E K , and furthermore as g ( τ ) is a smooth family of metrics for τ ∈ [0 , T ], wecan conclude ∆ g ( τ ) is uniformly bounded on E K for all τ ∈ [0 , T ] (see B). Hence on E K the powerseries for T e (cid:82) t ∆ g ( τ ) dτ converges and by uniqueness we conclude T e (cid:82) t ∆ g ( τ ) dτ is the heat kernel on E K . Henceforth we will use(21) T ( t ) Z = T e (cid:82) t ∆ g ( τ ) dτ as the notation for the Neumann heat kernel. We remark that E K could also be deﬁned indepen-dent of the diﬀerential structure of the manifold as K -band-limited functions on M , cf. [27]. Weprefer the former deﬁnition as it provides consistency with the deﬁnitions in [4].Recall that n is the number of spatial samples of the manifold M , and that m is the numberof temporal measurements on the time interval [0 , T ]. We assume that the time interval [0 , T ] isdivided into m intervals [ τ i − , τ i ) each of length ε where τ = 0 and τ m = T . For simplicity, weassume that our m measurements are taken at ( τ , . . . , τ m ). Our main result is that in the limitof large data, both spatially and temporally, the transition operator P ( (cid:100) t/ε (cid:101) ) ε converges to the heatkernel: P ( (cid:100) t/ε (cid:101) ) ε → T e (cid:82) t ∆ g ( τ ) dτ as n → ∞ and ε → . More precisely:

Theorem 3.1.

Suppose the isometric embedding M τ ⊂ R d of a time-dependent manifold ( M , g ( τ )) is measured at a common set X = { x j } nj =1 ⊂ M of n points at ε spaced units of time over a timeinterval [0 , T ] , so that, in particular, we have time samples ( τ i ) mi =1 ⊂ [0 , T ] with τ i = i · ε and m = T /ε .Then, for any suﬃciently smooth function f : M → R and t ≤ T , the heat kernel T e (cid:82) t ∆ g ( τ ) dτ can be approximated by the operator P ( (cid:100) t/ε (cid:101) ) ε : (22) P ( (cid:100) t/ε (cid:101) ) ε f ( x j ) = T e (cid:82) t ∆ g ( τ ) dτ f ( x j ) + O (cid:18) n / ε d/ / , ε (cid:19) , x j ∈ X. We prove Theorem 3.1 in the following. Recall that E K is deﬁned to be the span for the ﬁrst K + 1 eigenfunctions of the Laplace-Beltrami operator ∆ g (0) . Lemma 3.2. On E K , the heat kernel T e (cid:82) σ + εσ ∆ g ( τ ) dτ admits the following asymptotic expansion: (23) T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + ε · ∆ g ( σ ) + O ( ε ) . Proof.

By deﬁnition T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + (cid:90) σ + εσ ∆ g ( τ ) dτ + (cid:90) σ + εσ ∆ g ( τ ) (cid:90) τσ ∆ g ( τ ) dτ dτ + (cid:90) σ + εσ ∆ g ( τ ) (cid:90) τσ ∆ g ( τ ) (cid:90) τ σ ∆ g ( τ ) dτ dτ dτ + · · · The tail of this series starting with the third term is O ( ε ). For the second term, we can separatethe contribution of ∆ g ( σ ) from the integral yielding: T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + ε · ∆ g ( σ ) + (cid:90) σ + εσ (∆ g ( τ ) − ∆ g ( σ ) ) dτ + O ( ε ) . Now we can bound the integral by using the smoothness of ∆ g ( τ ) , which results from the smooth-ness of g ( τ ). By the smoothness of ∆ g ( τ ) with respect to time and the fact that it is uniformlybounded on E K for all τ ∈ [0 , T ], (cid:13)(cid:13) ∆ g ( τ ) − ∆ g ( τ (cid:48) ) (cid:13)(cid:13) ≤ C | τ − τ (cid:48) | , C ∈ R + . Therefore, (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) σ + εσ (∆ g ( τ ) − ∆ g ( σ ) ) dτ (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε · sup τ ∈ [ σ,σ + ε ] (cid:13)(cid:13) ∆ g ( τ ) − ∆ g ( σ ) (cid:13)(cid:13) ≤ Cε , and hence we have: T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + ε · ∆ g ( σ ) + O ( ε ) , IME COUPLED DIFFUSION MAPS 9 as desired. (cid:3)

By using the short time expansion from (23), we can develop a long time asymptotic expansionof the heat kernel. In the following, we write ∆ i = ∆ g ( τ i ) to simplify notation. Lemma 3.3.

Let ( τ , τ , . . . , τ m ) be a uniform partition of [0 , T ] with step size ε > , so that τ i = ε · i and τ m = m · ε = T . For < t ≤ T , deﬁne (cid:96) := min i { i = 1 , . . . , m | τ i ≥ t } . Then on E K the heat kernel T e (cid:82) t ∆ g ( τ ) dτ has the expansion (24) T e (cid:82) t ∆ g ( τ ) dτ = ( I + ε ∆ (cid:96) ) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O ( ε ) . Proof.

First suppose t = τ (cid:96) . Using the semigroup property (17) of the heat kernel, we have:(25) T e (cid:82) t ∆ g ( τ ) dτ = T e (cid:82) τ(cid:96)τ(cid:96) − ∆ g ( τ ) dτ · T e (cid:82) τ(cid:96) − τ(cid:96) − ∆ g ( τ ) dτ · · · T e (cid:82) τ τ ∆ g ( τ ) dτ . Combining the expansion (25) and Lemma 3.2, yields the following: T e (cid:82) t ∆ g ( τ ) dτ = (cid:0) I + ε ∆ (cid:96) + O ( ε ) (cid:1) (cid:0) I + ε ∆ (cid:96) − + O ( ε ) (cid:1) · · · (cid:0) I − ε ∆ + O ( ε ) (cid:1) , = ( I + ε ∆ (cid:96) ) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + Ξ . We now analyze the error term Ξ. First note that for small enough ε we have: (cid:107) I + ε ∆ i (cid:107) ≤ , for all i = 1 , . . . , m. Furthermore, since [0 , T ] is ﬁxed, the number of temporal measurements m is m = T /ε = O (1 /ε ).Putting together these two facts, one obtains:Ξ = (cid:96) (cid:88) k =1 (cid:18) (cid:96)k (cid:19) O ( ε k ) ≤ (cid:96) (cid:88) k =1 (cid:96) k k ! O ( ε k ) = (cid:96) (cid:88) k =1 k ! O ( ε k ) = O ( ε ) , which proves (24) for t = τ (cid:96) .Now let τ (cid:96) > t . The previous argument shows: T e (cid:82) τ(cid:96) ∆ g ( τ ) dτ = (cid:0) I + ε ∆ g ( τ (cid:96) ) (cid:1) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O ( ε ) , T e (cid:82) t ∆ g ( τ ) dτ = (cid:0) I + ε ∆ g ( t ) (cid:1) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O ( ε ) . Therefore the diﬀerence between the two operators is: (cid:107)T e (cid:82) τ(cid:96) ∆ g ( τ ) dτ − T e (cid:82) t ∆ g ( τ ) dτ (cid:107) ≤ ε (cid:107) ∆ g ( τ (cid:96) ) − ∆ g ( t ) (cid:107) + O ( ε ) ≤ Cε + O ( ε ) = O ( ε ) , and so (24) holds for any 0 < t ≤ T . (cid:3) Proof of Theorem 3.1.

Recall that each of our diﬀusion operators P ε,i for a ﬁxed i is deﬁned thesame as in the standard diﬀusion maps framework. Hence, by adapting the convergence resultfrom Singer [28], we have that on E K : P ε,i = I + ε ∆ i + O (cid:18) n / ε d/ − / , ε (cid:19) . Let (cid:96) = (cid:100) t/ε (cid:101) and recall our (cid:96) -step transition operator P ( (cid:96) ) ε is deﬁned as P ( (cid:96) ) ε := P ε,(cid:96) P ε,(cid:96) − · · · P ε, . The proof of Lemma 3.3 demonstrates that the error of the product increases by a factor of ε ,hence: P ( (cid:96) ) ε = ( I + ε ∆ (cid:96) ) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O (cid:18) n / ε d/ / , ε (cid:19) . Therefore using Lemma 3.3 we conclude that P ( (cid:100) t/ε (cid:101) ) ε = T e (cid:82) t ∆ g ( τ ) dτ + O (cid:18) n / ε d/ / , ε (cid:19) , on the set E K , where K is ﬁxed but arbitrary. The result follows by taking the union and closure (cid:83) K> E K , as in [4]. (cid:3) Remark.

The connection to heat diﬀusion can be also considered in terms of inﬁnitesimal gen-erators. For the static case, the product P t/(cid:15)(cid:15) approaches a continuous-time Markov operator P t ,which can be represented in terms of a generator G as P t = e tG where G = ∆ . In the above, wehave shown that the (cid:100) t/ε (cid:101) -step transition operator for our inhomogeneous Markov chain P ( (cid:100) t/ε (cid:101) ) (cid:15) approaches the continuous process T e (cid:82) t ∆ g ( τ ) dτ , whose time-dependent inﬁnitesimal generator attime t is ∆ g ( t ) . Numerical Example

Suppose that D consists of n = 10 ,

000 points sampled uniformly at random from the unit discin R . Deﬁne the maps h, v : R → R by h ( x, y ) = (cid:0) x, y · (1 − cos πx ) (cid:1) , and v ( x, y ) = (cid:0) x · (1 − cos πy ) , y (cid:1) . We refer the images h ( D ) and v ( D ) as horizontal and vertical barbells, respectively. In thefollowing, we deﬁne a deformation of the n = 10 ,

000 points from a horizontal to a vertical barbellover m = 9 times. We deﬁne X = h ( D ), X = D , and X = v ( D ). The intermediate sets X , X , X and X , X , X are deﬁned by linear interpolation. Speciﬁcally, if X i = (cid:110) x ( i ) j (cid:111) nj =1 , then x ( i ) j = (cid:16) x (5) j − x (1) j (cid:17) · i −

14 + x (1) j and x ( i ) j = (cid:16) x (9) j − x (5) j (cid:17) · i −

54 + x (5) j , for i = 2 , , i = 6 , ,

8, respectively. In the ﬁrst row of Figure 2 the data X i is plotted for i = 1 , , X , . . . , X ) can be stored as an10 , × × ,

000 points in R measured over the 9 times.Moreover, by ﬁxing the ﬁrst coordinate of this tensor, we have a 2 × R over the 9 times. Note how, as discussed in Section 2.1,even though points were sampled uniformly at time i = 5, the deformation of the data manifoldcreates a nonuniform distribution of points at the other times. In the following, we compare threemethods of embedding the data.First, as a baseline, we compute the diﬀusion map for each ﬁxed data set X , X , and X andplot the ﬁrst three embedding coordinates φ , φ , φ in the second row of Figure 2. Second, wedeﬁne a concatenated data diﬀusion map based on the concatenation of the data up to time i .That is, at time i we consider the data X ( i ) := (cid:110)(cid:16) x (1) j , . . . , x ( i ) j (cid:17)(cid:111) nj =1 ⊆ R i , and construct a diﬀusion map via equations (2), (3), (4), and (5), where the kernel K ε,i ( x j , x k ) inequation (2) is replaced by the Gaussian kernel K concat ε,i ( x j , x k ) deﬁned by K concat ε,i ( x j , x k ) := exp  − (cid:13)(cid:13)(cid:13)(cid:16) x (1) j , . . . , x ( i ) j (cid:17) − (cid:16) x (1) k , . . . , x ( i ) k (cid:17)(cid:13)(cid:13)(cid:13) (cid:96) ( R i ) ε  . The ﬁrst three coordinates ζ , ζ , ζ of the concatenated data diﬀusion map are plotted in thethird row of Figure 2. Finally, for { X i } i =1 , { X i } i =1 , and { X i } i =1 we computed the time coupleddiﬀusion map and plot the ﬁrst three coordinates ψ , ψ , ψ in the fourth row of Figure 2. Wewould like to draw the reader’s attention to the diﬀerence between the concatenated data diﬀusionmap and the time coupled diﬀusion map at time i = 9. To facilitate visualization, we have plottedthe ﬁrst two coordinates of these embeddings in Figure 3.Observe that each point x j in the data (cid:110) x ( i ) j (cid:111) nj =1 , i = 1 , . . . , i = 1, andthe side (up or down) that the point resides in the vertical barbell at time i = 9. Each of the four IME COUPLED DIFFUSION MAPS 11 i = 1 i = 5 i = 9 D a t a D i ﬀ u s i o n m a p C o n c a t e n a t e dd a t a D M T i m ec o up l e d D M Figure 2.

In the ﬁrst row, the data sets X i are plotted for times i = 1 , ,

9; in the second row,the ﬁrst three coordinates of the diﬀusion map of the ﬁxed sets X , X , and X are plotted; inthe third row, the ﬁrst three coordinates for the concatenated data diﬀusion map (as describedin Section 4) for the sequences of data sets { X i } i =1 , { X i } i =1 , and { X i } i =1 are plotted; ﬁnally, inthe fourth row, the time coupled diﬀusion map for the sequences of data sets { X i } i =1 , { X i } i =1 ,and { X i } i =1 are plotted. A consistent color map is used across all plots.lines in the time coupled diﬀusion map embedding correspond to one of these four classes. Onthe other hand, we interpret the concatenated data diﬀusion map by observing that the kernel K concat ε,i ( x j , x k ) can also be written K concat ε,i ( x j , x k ) = exp  − (cid:80) il =1 (cid:107) x ( l ) j − x ( l ) k (cid:107) (cid:96) ( R ) ε  = i (cid:89) l =1 exp  − (cid:107) x ( l ) j − x ( l ) k (cid:107) (cid:96) ( R ) ε  . Thus, concatenating the available data is equivalent to either averaging (up to a constant) thesquare distances for the available times, or multiplying the Gaussian kernels pointwise for the

Figure 3.

For time i = 9 the ﬁrst two coordinates of the concatenated data DM (left) and timecoupled DM (right) are plotted.available times. The resulting concatenated data diﬀusion map encodes the fact that on averagethe points are spread out over the disc; the only indication of the existence of the barbells attimes i = 1 and i = 9 are the corners in the embedding. In contrast, the time coupled diﬀusionmap embedding clearly captures the two barbell events: the four lines in this embedding representthe four classes in the data induced by the barbell events at times i = 1 and i = 9. Moreover,this numerical example demonstrates that the time coupled diﬀusion map embedding aggregatesinformation over time. Indeed, the two barbell events are temporally disjoint and thus no singletemporal slice of the data is suﬃcient to recover the four classes in the data. This temporaldisjointness is reﬂected in the standard diﬀusion map embeddings in row two of Figure 2. Insummary, the numerical results provide an example where averaging square distances over time, orequivalently multiplying Gaussian kernels pointwise over time, is insuﬃcient to capture geometricevents in the evolution of the data; instead, by viewing the data as an evolving manifold andapproximating heat ﬂow on the manifold via the product of diﬀusion operators, we are able todeﬁne an embedding which provides a concise description of the structures which occur in theevolution of the data. 5. Continuous time coupled diffusion map

Theorem 3.1 approximates the operator T ( t ) Z : L ( M ) → L ( M ), deﬁned in (20), with thediscrete inhomogeneous Markov chain P ( (cid:100) t/ε (cid:101) ) ε . The inhomogeneous chain P ( (cid:100) t/ε (cid:101) ) ε is the founda-tion of the time coupled diﬀusion distance, and after the normalization in (9) the singular valuedecomposition can be used to construct the corresponding time coupled diﬀusion map (12). Inthis section we further investigate the connection between the heat kernel Z and the time coupleddiﬀusion map through the continuous operator T ( t ) Z .Recall the diﬀusion operator T ( t ) Z f ( x ) = (cid:82) M Z ( x, t ; y, f ( y ) dV ( y,

0) introduced in Section 3. If f is an initial distribution over M , the operator T ( t ) Z diﬀuses f forward in time over the changingmanifold geometry, up to time τ = t . The adjoint of this operator is (cid:16) T ( t ) Z (cid:17) ∗ : L ( M ) → L ( M ), (cid:16) T ( t ) Z (cid:17) ∗ f ( x ) = (cid:90) M Z ( y, t ; x, f ( y ) dV ( y, t ) . It follows from (18) that the operator (cid:16) T ( t ) Z (cid:17) ∗ maps a probability distribution over ( M , g ( t ))backwards in time to a probability distribution over ( M , g (0)). Indeed, if f is a probabilitydensity function, then (cid:16) T ( t ) Z (cid:17) ∗ f is its posterior probability distribution as in Section 2. In thecontinuous setting, the analysis is performed over the manifold ( M , g ( · )), and all integration isperformed with respect to the Riemannian volume on the manifold similar to the approach in[22, 21, 20]. In the continuous case, the stationary distribution is constant and the normalizationin (9) is therefore unnecessary. Moreover, the operator T ( t ) Z is bi-stochastic in the sense that the IME COUPLED DIFFUSION MAPS 13 constant function is an eigenfunction of both T ( t ) Z and (cid:16) T ( t ) Z (cid:17) ∗ . Indeed, in Lemma 3.2 we showedthat T ( t ) Z can be written as the limit of a product – with an increasing number of terms – of self-adjoint operators, which each have the constant eigenfunction. Therefore, the continuous analogof the discrete diﬀusion distance (7) is the diﬀusion distance D ( t ) ( x, y ) = (cid:90) M ( Z ( x, t ; w, − Z ( y, t ; w, dV ( w, , or equivalently, D ( t ) ( x, y ) = (cid:107) (cid:16) T ( t ) Z (cid:17) ∗ δ x − (cid:16) T ( t ) Z (cid:17) ∗ δ y (cid:107) L ( M ) , where δ x is the Dirac distribution centered at x ∈ M (compare to (7)). Since the “row” sums of Z ( x, t, y,

0) are 1 (see (18)), and the manifold M is compact: (cid:90) M (cid:90) M Z ( x, t ; y, dV ( y, dV ( x, t ) < ∞ Therefore, the operator T ( t ) Z f ( x ) = (cid:82) M Z ( x, t ; y, f ( y ) dV ( y,

0) is compact, see for example [29].Similarly, the operator (cid:16) T ( t ) Z (cid:17) ∗ is compact, and thus T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ is compact, since it is thecomposition of two compact operators. Moreover, since T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ is also self-adjoint, by thespectral theorem T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ has an eigendecomposition, and similarly, (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z has an eigen-decomposition. Therefore, analogous to the discrete decomposition (10), T ( t ) Z has a singular valuedecomposition. T ( t ) Z f ( x ) = (cid:88) k ≥ σ ( t ) k (cid:104) f, ϕ ( t ) k (cid:105) ψ ( t ) k ( x ) , where ϕ ( t ) k ∈ L ( M ) is a right singular function of T ( t ) Z (an eigenfunction of (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z ), ψ ( t ) k ∈ L ( M ) is a left singular function of T ( t ) Z (an eigenfunction of T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ ), and { σ ( t ) k } k ≥ are thecorresponding singular values (the square root of the eigenvalues of (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z , or equivalentlythe square root of the eigenvalues of T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ since these operators share the same spectrum).However, recall that in the discrete case, we must take into account the stationary distribution ofthe Markov chain using the normalization in (9), since the discrete operator P ( t ) ε is not in generalbi-stochastic. In the continuous setting left singular functions are used to deﬁne the time coupleddiﬀusion map (12), written here as:Ψ ( t ) ( x ) = (cid:16) σ ( t ) k ψ ( t ) k ( x ) (cid:17) k ≥ . where the index k starts at 1 because the 0th left singular function is constant.We remark that the operator T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ can be written as an integral operator(26) T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ f ( x ) = (cid:90) M K bZ ( x, y ; t ) f ( y ) dV ( y, t ) , where K bZ ( x, y ; t ) = (cid:90) M Z ( x, t ; w, Z ( y, t ; w, dV ( w, . We refer to K bZ ( x, y ; t ) as the backwards kernel since it is integrated against functions on ( M, g ( t )),i.e., integrated against functions with respect to the Riemannian volume at the end of the timeinterval [0 , t ]. Similarly, the operator (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z has integral representation (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z f ( x ) = (cid:90) M K fZ ( x, y ; t ) f ( y ) dV ( y, , where K fZ ( x, y ; t ) = (cid:90) M Z ( w, t ; x, Z ( w, t ; y, dV ( w, t ) . We refer to K fZ as the forward kernel since it is integrated against functions on ( M , g (0)). The ker-nels K bZ and K fZ can be interpreted as analogous to the matrices A ( t ) ( A ( t ) ) (cid:62) and ( A ( t ) ) (cid:62) A ( t ) fromthe discrete case, whose eigenfunctions are the left and right singular vectors of A ( t ) , respectively.Since the time coupled diﬀusion map is based on the SVD of T ( t ) Z diﬀusing ϕ ( t ) k forward in timeresults in σ ( t ) k ψ ( t ) k , and conversely the backward propagation of ψ ( t ) k leads to σ ( t ) k ϕ ( t ) k ; speciﬁcally T ( t ) Z ϕ ( t ) k = σ ( t ) k ψ ( t ) k and (cid:16) T ( t ) Z (cid:17) ∗ ψ ( t ) k = σ ( t ) k ϕ ( t ) k . If we deﬁne, (cid:101) Φ ( t ) ( x ) = (cid:16) ϕ ( t ) k ( x ) (cid:17) k ≥ and set T ( t ) Z (cid:101) Φ ( t ) ( x ) = ( T ( t ) Z ϕ ( t ) k ( x )) k ≥ , then this observation leads to: D ( t ) ( x, y ) = (cid:107) (cid:16) T ( t ) Z (cid:17) ∗ δ x − (cid:16) T ( t ) Z (cid:17) ∗ δ y (cid:107) L ( M ) = (cid:107) Ψ ( t ) ( x ) − Ψ ( t ) ( y ) (cid:107) (cid:96) = (cid:107) T ( t ) Z (cid:101) Φ ( t ) ( x ) − T ( t ) Z (cid:101) Φ ( t ) ( y ) (cid:107) (cid:96) . We see that the diﬀerence in posterior distributions of δ x and δ y is equivalent to the diﬀerencebetween the diﬀused distributions of (cid:101) Φ ( t ) ( x ) and (cid:101) Φ ( t ) ( y ); these equivalent distances are, in turn,equal to the Euclidean distance between points under the time coupled diﬀusion map Ψ ( t ) .6. Conclusion

We have introduced the notion of time coupled diﬀusion maps as a method of summarizingevolving data via an embedding into Euclidean space. In particular, we have introduced a methodof modeling evolving data as samples from a manifold with a time-dependent metric. We show thatthe constructed time inhomogeneous Markov chain approximates heat diﬀusion over a manifold( M , g ( · )) with a smoothly varying family of metrics g ( τ ). In the context of manifold learning,these operators and resulting embeddings are related to the heat kernel of ∂ t u = ∆ g ( t ) u , whichprovides geometric and probabilistic interpretations. Numerical experiments indicate that the mapencodes aggregate geometrical information over arbitrarily long time scales, and thus summarizesthe geometry of a sequence of datasets in a useful way.These results open new directions related to diﬀusion based manifold learning, and in particularthe use of inhomogeneous Markov chains and asymmetric diﬀusion semi-groups to understanddata geometry. Early results on single cell data [11] show the usefulness of this type of diﬀusionprocess in biology, but further numerical and theoretical study is needed. Understanding preciselythe nature of the geometrical information encoded by the time coupled diﬀusion map (extending[21, 22]), would give theoretical insight and could lead to further developments. More immediately,the results contained here provide a theoretical foundation for understanding dynamic data thatcan be modeled as a time varying manifold.7. Acknowledgements

N.M. was a participant in the 2013 Research Experience for Undergraduates (REU) at CornellUniversity under the supervision of M.H. During the REU program both were supported bythe National Science Foundation grant number NSF-1156350. This paper is the result of workstarted during the REU. M.H. was supported by the European Research Council (ERC) grantInvariantClass 320959 while writing the ﬁrst version of the paper. He is currently supported bythe Alfred P. Sloan Fellowship (grant number FG-2016-6607), the DARPA Young Faculty Award(grant number D16AP00117), and the NSF (grant number 1620216).Both authors would like to thank Ronald Coifman for numerous insightful conversations, andthe reviewers for their comments which improved the manuscript.

IME COUPLED DIFFUSION MAPS 15

References [1] M. Belkin, Problems of learning on manifolds, Ph.D. thesis, University of Chicago (2003).[2] S. Lafon, Diﬀusion maps and geometric harmonics, Ph.D. thesis, Yale University (2004).[3] M. Belkin, P. Niyogi, Towards a theoretical foundation for Laplacian-based manifold methods, Journal ofComputer and System Sciences 74 (8) (2008) 1289–1308.[4] R. R. Coifman, S. Lafon, Diﬀusion maps, Applied and Computational Harmonic Analysis 21 (1) (2006) 5 – 30.[5] A. Singer, From graph to manifold Laplacian: The convergence rate, Applied and Computational HarmonicAnalysis 21 (1) (2006) 128–134.[6] A. Singer, H. Wu, Vector diﬀusion maps and the connection laplacian, Communications on Pure and AppliedMathematics 65 (8) (2012) 1067–1144.[7] G. Wolf, A. Averbuch, Linear-projection diﬀusion on smooth euclidean submanifolds, Applied and Computa-tional Harmonic Analysis 34 (1) (2013) 1–14.[8] T. Berry, T. Sauer, Local kernels and the geometric structure of data, Applied and Computational HarmonicAnalysis (2015) In press.[9] R. Talmon, I. Cohen, S. Gannot, R. Coifman, Diﬀusion maps for signal processing: A deeper look at manifold-learning techniques based on kernels and graphs, Signal Processing Magazine, IEEE 30 (4) (2013) 75–86.doi:10.1109/MSP.2013.2250353.[10] A. D. Szlam, M. Maggioni, R. R. Coifman, Regularization on graphs with function-adapted diﬀusion processes,Journal of Machine Learning Research 9 (2008) 1711–1739.URL http://jmlr.org/papers/volume9/szlam08a/szlam08a.pdf [11] T. Welp, G. Wolf, M. Hirn, S. Krishnaswamy, A diﬀusion-based condensation process for multiscale analysisof single cell data, in: ICML Workshop Computational Biology, New York, NY, 2016, pp. 1–5, 5 pages.[12] B. Wang, J. Jiang, W. Wang, Z.-H. Zhou, Z. Tu, Unsupervised metric fusion by cross diﬀusion, in:Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 2997–3004.doi:10.1109/CVPR.2012.6248029.[13] R. R. Coifman, M. J. Hirn, Diﬀusion maps for changing data, Applied and Computational Harmonic Analysis36 (1) (2014) 79 – 107. doi:http://dx.doi.org/10.1016/j.acha.2013.03.001.URL [14] R. R. Lederman, R. Talmon, Common manifold learning using alternating-diﬀusion, Tech. rep., Yale (2014).[15] R. R. Lederman, R. Talmon, H. t. Wu, Y. L. Lo, R. R. Coifman, Alternating diﬀusion for common manifoldlearning with application to sleep stage assessment, in: 2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2015, pp. 5758–5762. doi:10.1109/ICASSP.2015.7179075.[16] O. Lindenbaum, A. Yeredor, M. Salhov, A. Averbuch, Multiview diﬀusion maps, arXiv:1508.05550 (2015).[17] R. Talmon, H.-t. Wu, Latent common manifold learning with alternating diﬀusion: analysis and applications,ArXiv e-printsarXiv:1602.00078.[18] R. Banisch, P. Koltai, Understanding the geometry of transport: Diﬀusion maps for lagrangian trajectory dataunravel coherent sets, Chaos 27 (2017) 035804, arXiv:1603.04709.[19] T. Berry, J. Harlim, Variable bandwidth diﬀusion kernels, Applied and Computational Harmonic Analysis1 (0) (2015) –. doi:http://dx.doi.org/10.1016/j.acha.2015.01.001.URL [20] P. B´erard, G. Besson, S. Gallot, Embedding Riemannian manifolds by their heat kernel, Geometric and Func-tional Analysis 4 (4) (1994) 373–398.[21] H. Abdallah, Processus de diﬀusion sur un ﬂot de vari´et´es Riemanniennes, Ph.D. thesis, L’Universite deGrenoble (2010).[22] H. Abdallah, Embedding Riemannian manifolds via their eigenfunctions and their heat kernel, Bulletin of theKorean Mathematical Society 49 (5) (2012) 939–947.[23] H. Rutishauser, Computational aspects of f. l. bauer’s simultaneous iteration method, Numer. Math. 13 (1)(1969) 4–13. doi:10.1007/BF02165269.URL http://dx.doi.org/10.1007/BF02165269 [24] C. M. Guenther, The fundamental solution on manifolds with time-dependent metrics, The Journal of Geo-metric Analysis 12 (3) (2002) 425–436. doi:10.1007/BF02922048.[25] B. Chow, S.-C. Chu, D. Glickenstein, C. Guenther, J. Isenber, T. Ivey, D. Knopf, P. Lu, F. Luo, L. Ni, TheRicci Flow: Techniques and Applicatoins Part III: Geometric-Analytic Aspects, Vol. 163, AMS, 2008.[26] P.-L. Giscard, K. Lui, S. J. Thwaite, D. Jaksch, An exact formulation of the time-ordered exponential usingpath-sums, ArXiv e-printsarXiv:1410.6637.[27] S. Mousazadeh, I. Cohen, Out-of-sample extension of band-limited functions on homogeneous manifolds usingdiﬀusion maps, Signal Processing 108 (0) (2015) 521 – 529. doi:http://dx.doi.org/10.1016/j.sigpro.2014.10.024.URL [28] A. Singer, From graph to manifold laplacian: The convergence rate, Applied and Computa-tional Harmonic Analysis 21 (1) (2006) 128 – 134, special Issue: Diﬀusion Maps and Wavelets.doi:http://dx.doi.org/10.1016/j.acha.2006.03.004.URL [29] M. Pedersen, Functional analysis in applied mathematics and engineering, Chapman & Hall : CRC Press,2000.

Appendix A. Diffusion geometry from Markov kernels

We show that the operator A ( t ) , deﬁned in Section 2, line (9), has operator norm one and that π / t ) , the square root of its stationary distribution, is an eigenvector with eigenvalue one. We doso by following [2] and developing a diﬀusion geometry framework starting from a Markov kernel.Suppose that ( X , dx ) is a measure space equipped with a Markov kernel p ( x, y ). Moreover, weassume the kernel p ( x, y ) nonnegative and has the conservation property (cid:90) X p ( x, y ) dy = 1 for all x ∈ X . The kernel p ( x, y ) induces a Markov operator P ∗ from L ( X ) to itself deﬁned by( P ∗ f )( x ) = (cid:90) X p ( y, x ) f ( y ) dy for all x ∈ X . The notation P ∗ has been used to bring attention to the fact that P ∗ is the adjoint of the operator P deﬁned as: ( P f )( x ) = (cid:90) X p ( x, y ) f ( y ) dy. We assume that P ∗ has a unique strictly positive stationary distribution v ( x ), that is to say,( P ∗ v )( x ) = (cid:90) X p ( y, x ) v ( y ) dy = v ( x ) for all x ∈ X . For example, if X is a ﬁnite set with the counting measure (as in Section 2), it would suﬃce toassume that the Markov chain is irreducible and positive recurrent. By the function v ( x ), wedenote the positive pointwise square root of v ( x ). We deﬁne an operator A : L ( X ) → L ( X ) by(27) ( Af )( x ) = (cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy = (cid:90) X a ( x, y ) d ( y ) dy for all x ∈ X . Here the kernel a ( x, y ) = v ( x ) p ( x, y ) /v ( y ). We refer to A has an averaging or diﬀusion operatorand show it has several nice properties. Lemma A.1.

The operator norm (cid:107) A (cid:107) = 1 , and the norm is achieved by the function v ( x ) . Remark.

Taking A = A ( t ) establishes the claim from the beginning of the appendix, which wasoriginally stated after line (9) in Section 2.Proof. First we evaluate ( Av )( x ) to check the second part of the assertion and establish a lowerbound on the operator norm:( Av )( x ) = (cid:90) X v ( x ) p ( x, y ) v ( y ) v ( y ) dy = v ( x ) (cid:90) X p ( x, y ) dy = v ( x ) , using the Markov property of the kernel. To establish that 1 is an upper bound on the operatornorm we show that (cid:104) Af, Af (cid:105) ≤ (cid:107) f (cid:107) . First, observe that by Cauchy-Schwartz, (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X v ( x ) p ( x, y ) f ( y ) v ( y ) dy (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:18)(cid:90) X v ( x ) p ( x, y ) dy (cid:19) / (cid:18)(cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy (cid:19) / , = v ( x ) · (cid:18)(cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy (cid:19) / . (28)Now starting with (cid:104) Af, Af (cid:105) we multiply and divide by v ( x ) yielding: (cid:104) Af, Af (cid:105) = (cid:90) X v ( x ) (cid:90) X v ( x ) p ( x, y ) f ( y ) v ( y ) dy (cid:90) X v ( x ) p ( x, z ) f ( z ) v ( z ) dz dx. By applying the inequality (28) we see that, (cid:104)

Af, Af (cid:105) ≤ (cid:90) X (cid:18)(cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy (cid:19) / (cid:18)(cid:90) X v ( x ) p ( x, z ) v ( z ) f ( z ) dy (cid:19) / dx. IME COUPLED DIFFUSION MAPS 17

Applying Cauchy-Schwartz again, (cid:104)

Af, Af (cid:105) ≤ (cid:18)(cid:90) X (cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy dx (cid:19) / (cid:18)(cid:90) X (cid:90) X v ( x ) p ( x, z ) v ( z ) f ( z ) dz dx (cid:19) / . Finally, using the fact that v ( x ) is the stationary distribution shows that: (cid:104) Af, Af (cid:105) ≤ (cid:18)(cid:90) X f ( y ) dy (cid:19) / (cid:18)(cid:90) X f ( z ) dz (cid:19) / = (cid:107) f (cid:107) , as was to be shown. (cid:3) Appendix B. Uniform boundedness of ∆ g ( τ ) on E K We prove that the family of operators { ∆ g ( τ ) } ≤ τ ≤ T is uniformly bounded on E K ⊂ L ( M ),where E K = Span { φ l : 0 ≤ l ≤ K } , and φ l is the l th eigenfunction of ∆ g (0) with eigenvalue λ l , ordered so that 0 = λ < λ ≤ · · · ≤ λ K .To simplify notation, set ∆ τ = ∆ g ( τ ) : E K → L ( M ), and consider the function α ( τ ) = (cid:107) ∆ τ (cid:107) .If α ( τ ) is a continuous function in τ , then α ( τ ) is uniformly bounded on [0 , T ] since α (0) = λ K < ∞ and [0 , T ] is compact. It thus remains to show that α ( τ ) is a continuous function.Let ∆ ∗ τ : L ( M ) → E K be the adjoint of ∆ τ . It suﬃces to show that β ( τ ) = (cid:107) ∆ ∗ τ ∆ τ (cid:107) is acontinous function in τ , since (cid:107) ∆ ∗ τ ∆ τ (cid:107) = (cid:107) ∆ τ (cid:107) . We have ∆ ∗ τ ∆ τ : E K → E K , and so the operator∆ ∗ τ ∆ τ can be represented by the ( K + 1) × ( K + 1) matrix M τ , deﬁned through:∆ ∗ τ ∆ τ φ j = K +1 (cid:88) i =0 ( M τ ) ij φ i . Recalling that the metric tensor g ( τ ) varies smoothly in τ , it follows that the entries of M τ arecontinous in τ since in local coordinates:∆ τ f = 1 (cid:112) | g ( τ ) | ∂ i (cid:16)(cid:112) | g ( τ ) | g ( τ ) ij ∂ j f (cid:17) , where | g ( τ ) | is the determinant of g ( τ ) in the local chart, g ( τ ) ij are the entries of the inverse ofthe metric tensor, and the Einstein summation convention is used. But then the eigenvalues of M τ vary continuously in τ , and in particular the operator norm of M τ is a continuous function of τ . Department of Mathematics, Yale University, New Haven, CT 06511, USA

E-mail address : [email protected] Department of Computational Mathematics, Science & Engineering and Department of Mathematics,Michigan State University, East Lansing, MI 48824, USA

E-mail address ::