Time Coupled Diffusion Maps
TTIME COUPLED DIFFUSION MAPS
NICHOLAS F. MARSHALL AND MATTHEW J. HIRN
Abstract.
We consider a collection of n points in R d measured at m times, which are encodedin an n × d × m data tensor. Our objective is to define a single embedding of the n points intoEuclidean space which summarizes the geometry as described by the data tensor. In the case ofa fixed data set, diffusion maps and related graph Laplacian methods define such an embeddingvia the eigenfunctions of a diffusion operator constructed on the data. Given a sequence of m measurements of n points, we introduce the notion of time coupled diffusion maps whichhave natural geometric and probabilistic interpretations. To frame our method in the contextof manifold learning, we model evolving data as samples from an underlying manifold with atime-dependent metric, and we describe a connection of our method to the heat equation onsuch a manifold. Introduction
In many machine learning and signal processing tasks, the observable data is high dimensional,but it lies on a low-dimensional intrinsic manifold. In recent years, several manifold learningmethods have emerged which attempt to recover the intrinsic manifold underlying datasets. Inparticular, graph Laplacian methods have become popular due to their practicality and theoreticalguarantees [1, 2, 3, 4, 5, 6, 7, 8].Current graph Laplacian methods implicitly assume a static intrinsic manifold, or equivalently,that the dynamics underlying the data generation process are stationary. For many applications,this stationary assumption is justified, as datasets often consist of a single snapshot of a system, orare recorded over small time windows. However, in the case where data is accumulated over longerperiods of time, accounting for changing dynamics may be advantageous. Furthermore, if a systemis particularly noisy, combining a large number of snapshots over time may help recover structurehidden in noise. These observations raise the following question: how can graph Laplacian methodsbe extended to account for changing dynamics while maintaining theoretical guarantees?In this paper, we propose modeling data with changing dynamics by assuming there existsan underlying intrinsic manifold with a time-dependent metric. We will describe the proposedmethod using the diffusion maps framework: a popular graph Laplacian framework which isrobust to non-uniform sampling [4]. We remark that diffusion maps are highly related to othermanifold learning methods such as Laplacian eigenmaps and spectral clustering. In fact, if datais uniformly sampled from the underlying manifold, diffusion maps [4] is essentially eigenvalueweighted Laplacian eigenmaps [9].Although we assume points on the intrinsic manifold are fixed, their geometry, i.e, dependencestructure, is allowed change. We can conceptualize samples from a manifold with a time-dependentmetric by considering a corresponding point cloud smoothly moving through R d produced by iso-metrically embedding the manifold over time. The evolution of the metric dictates the movement ofpoints, and vice versa. In practice, datasets conforming to this model are commonly encountered,e.g., an RGB video feed consists of a collection of n pixels which move through R .In general, we consider data consisting of a collection of n points in R d measured at m times en-coded in an n × d × m data tensor X . The tensor X can be expressed as a sequence ( X , . . . , X m ) of n × d matrices whose entries correspond across the sequence. Given such as sequence ( X , . . . , X m ), Key words and phrases.
Manifold learning; Dimensionality reduction; Diffusion distance; Heat equation; Time-dependent metric. a r X i v : . [ m a t h . C A ] N ov TIME COUPLED DIFFUSION MAPS the time coupled diffusion map framework introduced in this paper is based on the product oper-ator:(1) P ( m ) = P m P m − · · · P P , where each P i is a diffusion operator constructed from X i . We will show that this discretediffusion process, which is formally defined in the following section, approximates a continuousdiffusion process on an assumed underlying manifold with a time-dependent metric. Additionally,we introduce the notion of time coupled diffusion maps, named thus because the time evolutionof the data has been coupled to the time evolution of a diffusion process.1.1. Related works.
In the diffusion geometry literature, several techniques have been developed,which also utilize multiple diffusion kernels for a variety of objectives including: iteratively refiningthe representation of data, facilitating comparison, and combing multiple measurements of a fixedsystem.An early example of a multiple kernel method is the denosing algorithm of Szlam, Maggioni,and Coifman [10], which iteratively smooths an image via an anisotropic diffusion process. Thatis, the algorithm switches between constructing a diffusion kernel on a given data set (in this casean image), and applying the constructed kernel to the data: X i → P i , X i +1 = P i X i where the arrow denotes that P i is constructed based on X i . More recently, in [11] Welp, Wolf,Hirn, and Krishnaswamy introduce an iterative diffusion based construction, which acts to coursegrain data. From a theoretical perspective, both of these methods can be considered in the contextof the time coupled diffusion framework introduced in this paper.In [12] Wang, Jiang, Wang, Zhou, and Tu introduce the notion of Cross Diffusion as a metricfusion algorithm with applications to image processing. They demonstrate how multiple metricson a given data set can be combined by considering the iterative cross diffusion P ( t +1)1 = P P ( t )2 P T , and P ( t +1)2 = P P ( t )1 P T , where P (0)1 = P and P (0)2 = P are constructed from two different metrics on the given data. Ageneralized method for m metrics is also described.In [13] Coifman and Hirn present a method of extending diffusion maps to allow comparisonsacross multiple measurements of a system, even when such measurements are of different modali-ties.In [14, 15] Lederman and Talmon introduce the idea of Alternating Diffusion: a method ofcombining measurements from multiple sensors to exact the common source of variability (i.e.,the common manifold), while filtering out sensor specific effects. The method is based on theproduct operators P P , and P P , where P and P are constructed from two different views of the data. In [16], Lindenbaum,Yeredor, Salhov, and Averbuch follow a similar approach, but concatenate a collection of alternat-ing products in block matrix defining Multi-View Diffusion Maps. Recently, several applicationsand extensions of Alternating Diffusion have been developed. In [15] Lederman, Talmon, Wu, Lo,and Coifman demonstrate an application of Alternating Diffusion to sleep stage assessment. In[17] Talmon and Wu describe a general notion of nonlinear manifold filtering, which extracts acommon manifold from multiple sensors.Our work extends the current diffusion maps literature by considering evolving dynamics ratherthan enhancing the analysis of a fixed system. We consider an n × d × m data tensor describinga system of n points in R d over m times, and seek to construct a manifold model and diffusiongeometry framework for this setting. A similar framework is considered by Banisch and Koltai in[18]. However, rather than study the product operator (1), they study the sum of the operators P i and prove a relation with the dynamic Laplacian. IME COUPLED DIFFUSION MAPS 3
Organization.
The remainder of the paper is organized as follows. In Section 2 we de-scribe the construction of time coupled diffusion maps. In Section 3 we establish a connectionbetween the product operator P ( t ) and the heat kernel on the assumed underlying manifold witha time-dependent metric. In Section 4 we present numerical results on synthetic data. Section5 investigates the continuous analog of the operator P ( t ) and the corresponding time coupleddiffusion distance. Concluding remarks are given in Section 6.2. Time coupled diffusion maps
In this section, we introduce the notion of time coupled diffusion maps.2.1.
Notation.
Let M be a compact smooth manifold with a smooth 1-parameter family ofRiemannian metrics g ( τ ), τ ∈ [0 , T ], T < ∞ . We refer to the parameter τ as time throughout. Foreach time τ , let ι τ : M (cid:44) → R d , M τ = ι τ ( M ), denote an isometric embedding of ( M , g ( τ )) into d -dimensional Euclidean space, where d is fixed for all τ ∈ [0 , T ]. Let X = { x j } nj =1 ⊆ M denote afinite collection of n points sampled from M and let ( τ , τ , . . . , τ m ) be a uniform partition of [0 , T ]with τ = 0 and τ m = T . We assume that our data consists of measurements of the n points X attimes τ , . . . , τ m such that we have one measurement set for each of the time intervals ( τ i − , τ i ] for i = 1 , . . . , m . More precisely, our data will consist of a sequence of m sets ( X , . . . , X m ), where X i = ι τ i ( X ) = { x ( i ) j } nj =1 ; see Figure 1 for an illustration. R d τ , τ , τ , . . . , τ m (cid:0) M , g ( τ ) (cid:1) · · · X X X X m Figure 1.
Illustration of the data modelSuch data can be represented as an n × d × m tensor corresponding to n points in R d measuredat m times. Suppose each X i is distributed over M τ i according to a density q τ i : M τ i → R .We assume that at time τ , the n points X = { x (1) j } nj =1 are sampled independently from M τ according to the density q τ . Since no re-sampling occurs, any changes in the densities q τ i , for i = 1 , . . . , m , result from deformations of q τ induced by changes in the Riemannian volume of M , which itself is induced from changes in the Riemannian metric g ( τ ) over time. In particular, x ( i ) j = ι τ i ◦ ι − τ ( x (1) j ) and q τ i ( x ) = q τ ( ι τ ◦ ι − τ i ( x )) | det( D ( ι τ ◦ ι − τ i ))( x ) | . Remark.
Two remarks are in order. (1)
Even if the initial sampling density is uniform, the changing geometry may alter the densityover time. Hence, the density invariant kernel construction described in [2, 4] plays anessential role in our construction. We assume the minimum sampling density over time isbounded below and the associated error term enters our error analysis as a constant. Werefer the reader to a recent paper by Berry and Harlim [19] for a detailed error analysis ofvariable density kernel constructions. (2)
Since we have assumed the initial samples X are i.i.d., then for fixed i , each set of samples X i are also i.i.d since X i is the continuous, hence measurable, function ι τ i ◦ ι − τ of thei.i.d. variables X . Kernel construction.
Given a sequence of data sets ( X , . . . , X m ) as described above, weproceed as follows. For each data set X i , we construct a diffusion operator P ε,i following thediffusion maps framework [4]. For completeness, we include the details of the construction of P ε,i in the following. For each time index i , we define a Gaussian kernel K ε,i on X using themeasurements X i ,(2) K ε,i ( x j , x k ) = exp (cid:32) − (cid:107) x ( i ) j − x ( i ) k (cid:107) ε (cid:33) , x j , x k ∈ X and x ( i ) j , x ( i ) k ∈ X i , TIME COUPLED DIFFUSION MAPS where (cid:107) · (cid:107) denotes the Euclidean norm in the ambient space R d . By summing over the secondvariable of the kernel K ε,i we approximate the density q τ i by(3) q ε,i ( x j ) = n (cid:88) k =1 K ε,i ( x j , x k ) , x j ∈ X. Using q ε,i we normalize the kernel K ε,i as follows(4) (cid:101) K ε,i ( x j , x k ) = K ε,i ( x j , x k ) q ε,i ( x j ) q ε,i ( x k ) , x j , x k ∈ X. Then the corresponding diffusion operator P ε,i : R n → R n is given by(5) ( P ε,i f )( x j ) = n (cid:88) k =1 (cid:101) K ε,i ( x j , x k ) (cid:80) nl =1 (cid:101) K ε,i ( x j , x l ) f ( x k ) . Each P ε,i is a matrix which is a diffusion operator when applied to column vectors on the left,and a Markov operator when applied row vectors on the right. For 1 ≤ t ≤ m , define:(6) P ( t ) ε = P ε,t P ε,t − · · · P ε, . Like its constituent components, the matrix P ( t ) ε is a diffusion operator when acting on columnvectors from the left and a Markov operator when acting on row vectors from the right. As aMarkov operator, P ( t ) ε acts backwards in time, in the sense that P ( t ) ε acts by first applying P ε,t ,second P ε,t − and so on. On the other hand, as a diffusion operator P ( t ) ε acts forward in time,first applying P ε, , second applying P ε, and so on. We have chosen the ordering convention inequation (6) to favor the geometric interpretation.2.3. Time coupled diffusion distance.
Let δ j denote a Dirac distribution centered at x j , i.e., δ j ( x j ) = 1 and δ j = 0 elsewhere. We compare the points x j and x k by comparing the posteriordistributions of δ Tj and δ Tk under the Markov operator P ( t ) ε . More specifically, following [4] wedefine a diffusion based distance as the L distance between these posterior distributions weightedby the reciprocal of the stationary distribution of the Markov chain. That is, we define the distance D ( t ) ε by(7) D ( t ) ε ( x j , x k ) = (cid:107) δ Tj P ( t ) ε − δ Tk P ( t ) ε (cid:107) L (1 / π ( t ) ) , where π ( t ) is the stationary distribution of P ( t ) ε , i.e., π T ( t ) P ( t ) ε = π T ( t ) , and (cid:107) · (cid:107) L (1 / π ( t ) ) is theweighted L norm:(8) (cid:107) f (cid:107) L (1 / π ( t ) ) := (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 f ( x j ) π ( t ) ( x j ) . We refer to D ( t ) ε as the time coupled diffusion distance , since it extends the diffusion distance in [4]by coupling the evolution time of the data ( X , . . . , X t ) with that of the diffusion process governedby ( P ε, , . . . , P ε,t ). Our next objective is to construct a time coupled diffusion map , that is, anembedding of X into Euclidean space which preserves the time coupled diffusion distance. Fornotational brevity, we suppress the dependence on ε in the following.2.4. Time coupled diffusion map.
Recall that the diffusion map for a static manifold ( M , g ) isdefined in terms of the eigenvectors and eigenvalues of the transition matrix P , which is constructedin the same manner as (5). The inhomogeneous transition operator P ( t ) is also row stochastic,but unlike P , does not necessarily have a complete basis of real eigenvectors. Instead, we definethe operator A ( t ) by(9) A ( t ) = Π / t ) P ( t ) Π − / t ) , where Π ( t ) denotes the matrix with the stationary distribution π ( t ) of P ( t ) along the diagonal andzeros elsewhere. First, observe that π / t ) is both a right and left eigenvector of A ( t ) with eigenvalue IME COUPLED DIFFUSION MAPS 5
1. In fact, A ( t ) has operator norm one and naturally arises when constructing a diffusion frameworkstarting with a Markov chain, see A. Next, we compute the singular value decomposition (SVD)of A ( t ) :(10) A ( t ) = U ( t ) Σ ( t ) V T ( t ) , where U ( t ) is an orthogonal matrix of left singular vectors, Σ ( t ) is a diagonal matrix of corre-sponding singular values, and V ( t ) is an orthogonal matrix of right singular vectors. Define(11) Ψ ( t ) := Π − / t ) U ( t ) Σ ( t ) . Lemma 2.1.
The embedding (12) x j (cid:55)→ δ Tj Ψ ( t ) of the data X into Euclidean space preserves the time coupled diffusion distance. That is to say, D ( t ) ( x j , x k ) = (cid:107) δ Tj Ψ ( t ) − δ Tk Ψ ( t ) (cid:107) L . We refer to the embedding x j (cid:55)→ δ Tj Ψ ( t ) as the time coupled diffusion map.Proof. By the definition of time coupled diffusion distance (7), and by definition of the weighted L norm (8)(13) D ( t ) ( x j , x k ) = (cid:107) δ Tj P ( t ) − δ Tk P ( t ) (cid:107) L (1 / π ( t ) ) = (cid:107) δ Tj P ( t ) Π − / − δ Tk P ( t ) Π − / (cid:107) L . Multiplying equation (9) by Π − / t ) yields Π − / t ) A ( t ) = P ( t ) Π − / t ) , and substituting this expression into (13) gives D ( t ) ( x j , x k ) = (cid:107) δ Tj Π − / t ) A ( t ) − δ Tk Π − / t ) A ( t ) (cid:107) L . Expanding A ( t ) in its singular value decomposition (10) yields, D ( t ) ( x j , x k ) = (cid:107) δ Tj Π − / t ) U ( t ) Σ ( t ) V T ( t ) − δ Tk Π − / t ) U ( t ) Σ ( t ) V T ( t ) (cid:107) L . Now, since the transformation V ( t ) is orthogonal, D ( t ) ( x j , x k ) = (cid:107) δ Tj Π − / t ) U ( t ) Σ ( t ) − δ Tk Π − / t ) U ( t ) Σ ( t ) (cid:107) L , and substituting Ψ ( t ) = Π − / t ) U ( t ) Σ ( t ) into this equation yields the result. (cid:3) Remark.
Several remarks regarding time coupled diffusion maps are in order. (1) If t = 1 , the time coupled diffusion map (12) is equivalent to the definition of the standarddensity invariant diffusion map in [4] , see the calculations in A, which are motivated bysimilar calculations in [2] . (2) The first coordinate of the embedding (12) is always constant because π / t ) is the top leftsingular vector of A ( t ) , and therefore, this coordinate of the embedding can be discarded. (3) In order to produce an embedding of X into R l for some l > , we can compute a rank l + 1 singular value decomposition in equation (10) and map x j (cid:55)→ δ Tj Ψ ( t ) l , where Ψ ( t ) l denotes the matrix consisting of columns through l + 1 of the matrix Ψ ( t ) .Since the singular values of A ( t ) are of the form σ > σ ≥ · · · ≥ σ n − , the truncatedembedding will preserve the diffusion distance up to an error on the order of σ l +1 . As inother diffusion based methods, in the case where data lies on a low dimensional manifold,we expect the first few coordinates of the embedding to provide a meaningful summary ofthe data. TIME COUPLED DIFFUSION MAPS
Comparison to standard diffusion maps.
It is instructive at this point to make a com-parison with the original diffusion maps of Coifman and Lafon [4]. In that setting, one is givensamples X of an isometric embedding of a manifold M with a static metric g . The density invari-ant version of diffusion maps uses the same construction as outlined in equations (2), (3), (4), and(5), which yields a Markov matrix P , which is not indexed by a time i since the metric is static.Running this homogeneous Markov chain forward t steps is equivalent to composing P with itself t times, i.e., P t . Since P is, by construction, similar to a symmetric matrix, it can be decomposedinto a diagonal matrix of real eigenvalues Λ and a basis of right eigenvectors Φ . Since the lefteigenvectors of P are orthogonal in L (1 / π ), the diffusion map(14) x j (cid:55)→ δ Tj ΦΛ t , preserves the diffusion distance D t ( x j , x k ) = (cid:107) δ Tj P t − δ Tk P t (cid:107) L (1 / π ) , where π is the stationarydistribution of P . A main result of [4] is that as n → ∞ and ε →
0, we have P t/εε → e t ∆ , where e t ∆ is the Neumann heat kernel on M . Since the eigenfunctions and eigenvalues of e t ∆ give acomplete geometric description of the manifold M [20], the diffusion maps embedding (14) learnsthe geometry of the manifold from the samples X .In the time-dependent case, we consider a sequence of data ( X , . . . , X t ) and construct a corre-sponding family of operators ( P , . . . , P t ). Rather than raising a matrix to a power, we composethe family of operators, obtaining P ( t ) defined in equation (6). After defining a distance D ( t ) based on P ( t ) we seek a distance preserving embedding. However, since the product of symmetricmatrices is not in general symmetric, there is no reason we should suspect P ( t ) to have a basis ofreal eigenvalues and eigenvectors. Therefore, we construct a diffusion maps framework based onthe Markov operator using the singular value decomposition of the operator A ( t ) as described inequations (9), (10), (11), and (12) . We expect the operator P ( t ) encodes some average sense ofaffinity across the data, and by extension that the time coupled diffusion map is similarly mean-ingful. In the following, we make the connection of P ( t ) to the data precise by showing that P ( t ) approximates the heat kernel on the underlying manifold ( M , g ( · )), which will be defined in Sec-tion 3. Thus, we conjecture that the time coupled diffusion map aggregates important geometricalinformation of the manifold ( M , g ( · )) over the time interval [0 , T ]. Numerical results in Section4 lend credence to this conjecture. There is, however, no theoretical result that directly linksgeometrical information of ( M , g ( · )) over arbitrarily long time scales with its heat kernel. Theclosest results are contained in [21, 22], in which the heat kernel of ( M , g ( · )) is used to embed themanifold into a single Hilbert space, so that one can observe the flow of M over time. This result,however, only holds for short time scales.Before describing the connection of P ( t ) to heat flow, we make a brief computational note. Inpractice, it is common to construct each diffusion matrix P i defined in equation (5) as a sparsematrix by, for example, truncating values which fall below a certain threshold. However, if theproduct P ( t ) = P t P t − · · · P P is computed explicitly, there is no reason to expect sparsity will be maintained. Moreover, sinceeach P i is a diffusion operator on an assumed underlying k -dimensional manifold M , if each P i initially has l nonzero entires in each row, the number of nonzero entires in each row of the product P ( t ) could be on the order of ( l · t ) k . Therefore, rather than explicitly constructing P ( t ) , we cantreat P ( t ) as an operator that can be applied to vectors in order n · l · t operations. Using thisapproach, the singular value decomposition in equation (10) can be computed by iterative methodssuch as Lancroz iteration or subspace iteration [23]. When n > n · l · t this approach may providesignificant computational savings, and reduce the memory requirements of the computation.3. The heat kernel for ( M , g ( · ))We begin by stating the definition and properties of the heat kernel for a manifold with time-dependent metric. Let g ( τ ) , τ ∈ [0 , T ], be a smooth family of metrics on a manifold M . The heat IME COUPLED DIFFUSION MAPS 7 equation for such a manifold is:(15) ∂u∂t = ∆ g ( t ) u, where u : M × [0 , T ] → R . We say that Z : ( M × [0 , T ]) × ( M × [0 , T ]) → R , is the heat kernel for ∂∂t − ∆ g ( t ) if the following three conditions hold:( H ) Z ( x, τ ; y, σ ) is C in the spacial variables x, y and C in the temporal variables τ, σ ,( H ) (cid:0) ∂∂t − ∆ g ( t ) (cid:1) Z ( · , · ; y, σ ) = 0, and( H ) lim τ (cid:38) σ Z ( · , τ ; y, σ ) = δ y .Several researchers have studied the heat equation (15) and corresponding heat kernel for a man-ifold ( M , g ( · )), in large part due to its relationship with the Ricci flow [24, 25], but also fordata analysis [21]. The following properties of the heat kernel have been established (see [24]).First, the heat kernel exists and is the unique positive function satisfying ( H ), ( H ), and ( H ).Additionally, the solution to the initial value problem,(16) (cid:40) ∂u/∂t = ∆ g ( t ) u,u ( x,
0) = f ( x ) , has the integral representation [24, Corollary 2.2]: u ( x, τ ) = (cid:90) M Z ( x, τ ; y, f ( y ) dV ( y, , where V ( τ ) is the Riemannian volume of M at time τ . The heat kernel Z ( x, τ ; y, σ ) also possessesproperties analogous to those of the standard heat kernel, such as the semigroup property,(17) Z ( x, τ ; y, σ ) = (cid:90) M Z ( x, τ ; ξ, ν ) Z ( ξ, ν ; y, σ ) dV ( ξ, ν ) , ∀ x, y ∈ M , ν ∈ ( σ, τ ) . Furthermore, its “rows” sum to one, precisely stated as(18) (cid:90) M Z ( x, τ ; y, σ ) dV ( y, σ ) = 1 , ∀ x ∈ M , σ < τ ∈ [0 , T ] . In the case of a static metric, the associated integral transform of the Neumann heat kernel isknown to have the asymptotic approximation [4]:(19) e t ∆ = lim ε → ( I + ε ∆) t/ε = (1 + ε ∆) t/ε + O ( ε ) . For t ≤ T , we define the associated integral transform of Z as the operator T ( t ) Z : L ( M ) → L ( M ),(20) T ( t ) Z f ( x ) = (cid:90) M Z ( x, t ; y, f ( y ) dV ( y, . We want to express T ( t ) Z in an analogous form to (19); we would expect it to resemble e (cid:82) t ∆ g ( τ ) dτ .However, as the family ∆ g ( τ ) is not in general commutative with respect to composition, it isnecessary to consider the ordered exponential of ∆ g ( τ ) over [0 , t ], which is the equivalent of theexponential for non-communicative operators, and which is denoted T e (cid:82) t ∆ g ( τ ) dτ . The orderedexponential can be expressed in terms of the power series: T e (cid:82) t ∆ g ( τ ) dτ = I + (cid:90) t ∆ g ( τ ) dτ + (cid:90) t ∆ g ( τ ) (cid:90) τ ∆ g ( τ ) dτ dτ + · · · = I + ∞ (cid:88) i =1 (cid:90) t ∆ g ( τ ) (cid:90) τ ∆ g ( τ ) (cid:90) τ · · · (cid:90) τ i − ∆ g ( τ i − ) (cid:90) τ i − ∆ g ( τ i ) dτ i dτ i − · · · dτ dτ , where I denotes the identity in the appropriate space. The operator T can be thought of asenforcing time order in the products for the standard exponential expansion in a sense which isequivalent to the power series definition, cf. [26]. Considering T e (cid:82) t ∆ g ( τ ) dτ as a formal power TIME COUPLED DIFFUSION MAPS series, it is easy to see that the heat equation is satisfied. Let { φ l } l ≥ be the eigenfunctions of∆ g (0) , ordered in terms of increasing eigenvalue, and define E K := Span { φ l : 0 ≤ l ≤ K } . Then∆ g (0) is bounded on E K , and furthermore as g ( τ ) is a smooth family of metrics for τ ∈ [0 , T ], wecan conclude ∆ g ( τ ) is uniformly bounded on E K for all τ ∈ [0 , T ] (see B). Hence on E K the powerseries for T e (cid:82) t ∆ g ( τ ) dτ converges and by uniqueness we conclude T e (cid:82) t ∆ g ( τ ) dτ is the heat kernel on E K . Henceforth we will use(21) T ( t ) Z = T e (cid:82) t ∆ g ( τ ) dτ as the notation for the Neumann heat kernel. We remark that E K could also be defined indepen-dent of the differential structure of the manifold as K -band-limited functions on M , cf. [27]. Weprefer the former definition as it provides consistency with the definitions in [4].Recall that n is the number of spatial samples of the manifold M , and that m is the numberof temporal measurements on the time interval [0 , T ]. We assume that the time interval [0 , T ] isdivided into m intervals [ τ i − , τ i ) each of length ε where τ = 0 and τ m = T . For simplicity, weassume that our m measurements are taken at ( τ , . . . , τ m ). Our main result is that in the limitof large data, both spatially and temporally, the transition operator P ( (cid:100) t/ε (cid:101) ) ε converges to the heatkernel: P ( (cid:100) t/ε (cid:101) ) ε → T e (cid:82) t ∆ g ( τ ) dτ as n → ∞ and ε → . More precisely:
Theorem 3.1.
Suppose the isometric embedding M τ ⊂ R d of a time-dependent manifold ( M , g ( τ )) is measured at a common set X = { x j } nj =1 ⊂ M of n points at ε spaced units of time over a timeinterval [0 , T ] , so that, in particular, we have time samples ( τ i ) mi =1 ⊂ [0 , T ] with τ i = i · ε and m = T /ε .Then, for any sufficiently smooth function f : M → R and t ≤ T , the heat kernel T e (cid:82) t ∆ g ( τ ) dτ can be approximated by the operator P ( (cid:100) t/ε (cid:101) ) ε : (22) P ( (cid:100) t/ε (cid:101) ) ε f ( x j ) = T e (cid:82) t ∆ g ( τ ) dτ f ( x j ) + O (cid:18) n / ε d/ / , ε (cid:19) , x j ∈ X. We prove Theorem 3.1 in the following. Recall that E K is defined to be the span for the first K + 1 eigenfunctions of the Laplace-Beltrami operator ∆ g (0) . Lemma 3.2. On E K , the heat kernel T e (cid:82) σ + εσ ∆ g ( τ ) dτ admits the following asymptotic expansion: (23) T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + ε · ∆ g ( σ ) + O ( ε ) . Proof.
By definition T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + (cid:90) σ + εσ ∆ g ( τ ) dτ + (cid:90) σ + εσ ∆ g ( τ ) (cid:90) τσ ∆ g ( τ ) dτ dτ + (cid:90) σ + εσ ∆ g ( τ ) (cid:90) τσ ∆ g ( τ ) (cid:90) τ σ ∆ g ( τ ) dτ dτ dτ + · · · The tail of this series starting with the third term is O ( ε ). For the second term, we can separatethe contribution of ∆ g ( σ ) from the integral yielding: T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + ε · ∆ g ( σ ) + (cid:90) σ + εσ (∆ g ( τ ) − ∆ g ( σ ) ) dτ + O ( ε ) . Now we can bound the integral by using the smoothness of ∆ g ( τ ) , which results from the smooth-ness of g ( τ ). By the smoothness of ∆ g ( τ ) with respect to time and the fact that it is uniformlybounded on E K for all τ ∈ [0 , T ], (cid:13)(cid:13) ∆ g ( τ ) − ∆ g ( τ (cid:48) ) (cid:13)(cid:13) ≤ C | τ − τ (cid:48) | , C ∈ R + . Therefore, (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) σ + εσ (∆ g ( τ ) − ∆ g ( σ ) ) dτ (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε · sup τ ∈ [ σ,σ + ε ] (cid:13)(cid:13) ∆ g ( τ ) − ∆ g ( σ ) (cid:13)(cid:13) ≤ Cε , and hence we have: T e (cid:82) σ + εσ ∆ g ( τ ) dτ = I + ε · ∆ g ( σ ) + O ( ε ) , IME COUPLED DIFFUSION MAPS 9 as desired. (cid:3)
By using the short time expansion from (23), we can develop a long time asymptotic expansionof the heat kernel. In the following, we write ∆ i = ∆ g ( τ i ) to simplify notation. Lemma 3.3.
Let ( τ , τ , . . . , τ m ) be a uniform partition of [0 , T ] with step size ε > , so that τ i = ε · i and τ m = m · ε = T . For < t ≤ T , define (cid:96) := min i { i = 1 , . . . , m | τ i ≥ t } . Then on E K the heat kernel T e (cid:82) t ∆ g ( τ ) dτ has the expansion (24) T e (cid:82) t ∆ g ( τ ) dτ = ( I + ε ∆ (cid:96) ) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O ( ε ) . Proof.
First suppose t = τ (cid:96) . Using the semigroup property (17) of the heat kernel, we have:(25) T e (cid:82) t ∆ g ( τ ) dτ = T e (cid:82) τ(cid:96)τ(cid:96) − ∆ g ( τ ) dτ · T e (cid:82) τ(cid:96) − τ(cid:96) − ∆ g ( τ ) dτ · · · T e (cid:82) τ τ ∆ g ( τ ) dτ . Combining the expansion (25) and Lemma 3.2, yields the following: T e (cid:82) t ∆ g ( τ ) dτ = (cid:0) I + ε ∆ (cid:96) + O ( ε ) (cid:1) (cid:0) I + ε ∆ (cid:96) − + O ( ε ) (cid:1) · · · (cid:0) I − ε ∆ + O ( ε ) (cid:1) , = ( I + ε ∆ (cid:96) ) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + Ξ . We now analyze the error term Ξ. First note that for small enough ε we have: (cid:107) I + ε ∆ i (cid:107) ≤ , for all i = 1 , . . . , m. Furthermore, since [0 , T ] is fixed, the number of temporal measurements m is m = T /ε = O (1 /ε ).Putting together these two facts, one obtains:Ξ = (cid:96) (cid:88) k =1 (cid:18) (cid:96)k (cid:19) O ( ε k ) ≤ (cid:96) (cid:88) k =1 (cid:96) k k ! O ( ε k ) = (cid:96) (cid:88) k =1 k ! O ( ε k ) = O ( ε ) , which proves (24) for t = τ (cid:96) .Now let τ (cid:96) > t . The previous argument shows: T e (cid:82) τ(cid:96) ∆ g ( τ ) dτ = (cid:0) I + ε ∆ g ( τ (cid:96) ) (cid:1) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O ( ε ) , T e (cid:82) t ∆ g ( τ ) dτ = (cid:0) I + ε ∆ g ( t ) (cid:1) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O ( ε ) . Therefore the difference between the two operators is: (cid:107)T e (cid:82) τ(cid:96) ∆ g ( τ ) dτ − T e (cid:82) t ∆ g ( τ ) dτ (cid:107) ≤ ε (cid:107) ∆ g ( τ (cid:96) ) − ∆ g ( t ) (cid:107) + O ( ε ) ≤ Cε + O ( ε ) = O ( ε ) , and so (24) holds for any 0 < t ≤ T . (cid:3) Proof of Theorem 3.1.
Recall that each of our diffusion operators P ε,i for a fixed i is defined thesame as in the standard diffusion maps framework. Hence, by adapting the convergence resultfrom Singer [28], we have that on E K : P ε,i = I + ε ∆ i + O (cid:18) n / ε d/ − / , ε (cid:19) . Let (cid:96) = (cid:100) t/ε (cid:101) and recall our (cid:96) -step transition operator P ( (cid:96) ) ε is defined as P ( (cid:96) ) ε := P ε,(cid:96) P ε,(cid:96) − · · · P ε, . The proof of Lemma 3.3 demonstrates that the error of the product increases by a factor of ε ,hence: P ( (cid:96) ) ε = ( I + ε ∆ (cid:96) ) ( I + ε ∆ (cid:96) − ) · · · ( I + ε ∆ ) + O (cid:18) n / ε d/ / , ε (cid:19) . Therefore using Lemma 3.3 we conclude that P ( (cid:100) t/ε (cid:101) ) ε = T e (cid:82) t ∆ g ( τ ) dτ + O (cid:18) n / ε d/ / , ε (cid:19) , on the set E K , where K is fixed but arbitrary. The result follows by taking the union and closure (cid:83) K> E K , as in [4]. (cid:3) Remark.
The connection to heat diffusion can be also considered in terms of infinitesimal gen-erators. For the static case, the product P t/(cid:15)(cid:15) approaches a continuous-time Markov operator P t ,which can be represented in terms of a generator G as P t = e tG where G = ∆ . In the above, wehave shown that the (cid:100) t/ε (cid:101) -step transition operator for our inhomogeneous Markov chain P ( (cid:100) t/ε (cid:101) ) (cid:15) approaches the continuous process T e (cid:82) t ∆ g ( τ ) dτ , whose time-dependent infinitesimal generator attime t is ∆ g ( t ) . Numerical Example
Suppose that D consists of n = 10 ,
000 points sampled uniformly at random from the unit discin R . Define the maps h, v : R → R by h ( x, y ) = (cid:0) x, y · (1 − cos πx ) (cid:1) , and v ( x, y ) = (cid:0) x · (1 − cos πy ) , y (cid:1) . We refer the images h ( D ) and v ( D ) as horizontal and vertical barbells, respectively. In thefollowing, we define a deformation of the n = 10 ,
000 points from a horizontal to a vertical barbellover m = 9 times. We define X = h ( D ), X = D , and X = v ( D ). The intermediate sets X , X , X and X , X , X are defined by linear interpolation. Specifically, if X i = (cid:110) x ( i ) j (cid:111) nj =1 , then x ( i ) j = (cid:16) x (5) j − x (1) j (cid:17) · i −
14 + x (1) j and x ( i ) j = (cid:16) x (9) j − x (5) j (cid:17) · i −
54 + x (5) j , for i = 2 , , i = 6 , ,
8, respectively. In the first row of Figure 2 the data X i is plotted for i = 1 , , X , . . . , X ) can be stored as an10 , × × ,
000 points in R measured over the 9 times.Moreover, by fixing the first coordinate of this tensor, we have a 2 × R over the 9 times. Note how, as discussed in Section 2.1,even though points were sampled uniformly at time i = 5, the deformation of the data manifoldcreates a nonuniform distribution of points at the other times. In the following, we compare threemethods of embedding the data.First, as a baseline, we compute the diffusion map for each fixed data set X , X , and X andplot the first three embedding coordinates φ , φ , φ in the second row of Figure 2. Second, wedefine a concatenated data diffusion map based on the concatenation of the data up to time i .That is, at time i we consider the data X ( i ) := (cid:110)(cid:16) x (1) j , . . . , x ( i ) j (cid:17)(cid:111) nj =1 ⊆ R i , and construct a diffusion map via equations (2), (3), (4), and (5), where the kernel K ε,i ( x j , x k ) inequation (2) is replaced by the Gaussian kernel K concat ε,i ( x j , x k ) defined by K concat ε,i ( x j , x k ) := exp − (cid:13)(cid:13)(cid:13)(cid:16) x (1) j , . . . , x ( i ) j (cid:17) − (cid:16) x (1) k , . . . , x ( i ) k (cid:17)(cid:13)(cid:13)(cid:13) (cid:96) ( R i ) ε . The first three coordinates ζ , ζ , ζ of the concatenated data diffusion map are plotted in thethird row of Figure 2. Finally, for { X i } i =1 , { X i } i =1 , and { X i } i =1 we computed the time coupleddiffusion map and plot the first three coordinates ψ , ψ , ψ in the fourth row of Figure 2. Wewould like to draw the reader’s attention to the difference between the concatenated data diffusionmap and the time coupled diffusion map at time i = 9. To facilitate visualization, we have plottedthe first two coordinates of these embeddings in Figure 3.Observe that each point x j in the data (cid:110) x ( i ) j (cid:111) nj =1 , i = 1 , . . . , i = 1, andthe side (up or down) that the point resides in the vertical barbell at time i = 9. Each of the four IME COUPLED DIFFUSION MAPS 11 i = 1 i = 5 i = 9 D a t a D i ff u s i o n m a p C o n c a t e n a t e dd a t a D M T i m ec o up l e d D M Figure 2.
In the first row, the data sets X i are plotted for times i = 1 , ,
9; in the second row,the first three coordinates of the diffusion map of the fixed sets X , X , and X are plotted; inthe third row, the first three coordinates for the concatenated data diffusion map (as describedin Section 4) for the sequences of data sets { X i } i =1 , { X i } i =1 , and { X i } i =1 are plotted; finally, inthe fourth row, the time coupled diffusion map for the sequences of data sets { X i } i =1 , { X i } i =1 ,and { X i } i =1 are plotted. A consistent color map is used across all plots.lines in the time coupled diffusion map embedding correspond to one of these four classes. Onthe other hand, we interpret the concatenated data diffusion map by observing that the kernel K concat ε,i ( x j , x k ) can also be written K concat ε,i ( x j , x k ) = exp − (cid:80) il =1 (cid:107) x ( l ) j − x ( l ) k (cid:107) (cid:96) ( R ) ε = i (cid:89) l =1 exp − (cid:107) x ( l ) j − x ( l ) k (cid:107) (cid:96) ( R ) ε . Thus, concatenating the available data is equivalent to either averaging (up to a constant) thesquare distances for the available times, or multiplying the Gaussian kernels pointwise for the
Figure 3.
For time i = 9 the first two coordinates of the concatenated data DM (left) and timecoupled DM (right) are plotted.available times. The resulting concatenated data diffusion map encodes the fact that on averagethe points are spread out over the disc; the only indication of the existence of the barbells attimes i = 1 and i = 9 are the corners in the embedding. In contrast, the time coupled diffusionmap embedding clearly captures the two barbell events: the four lines in this embedding representthe four classes in the data induced by the barbell events at times i = 1 and i = 9. Moreover,this numerical example demonstrates that the time coupled diffusion map embedding aggregatesinformation over time. Indeed, the two barbell events are temporally disjoint and thus no singletemporal slice of the data is sufficient to recover the four classes in the data. This temporaldisjointness is reflected in the standard diffusion map embeddings in row two of Figure 2. Insummary, the numerical results provide an example where averaging square distances over time, orequivalently multiplying Gaussian kernels pointwise over time, is insufficient to capture geometricevents in the evolution of the data; instead, by viewing the data as an evolving manifold andapproximating heat flow on the manifold via the product of diffusion operators, we are able todefine an embedding which provides a concise description of the structures which occur in theevolution of the data. 5. Continuous time coupled diffusion map
Theorem 3.1 approximates the operator T ( t ) Z : L ( M ) → L ( M ), defined in (20), with thediscrete inhomogeneous Markov chain P ( (cid:100) t/ε (cid:101) ) ε . The inhomogeneous chain P ( (cid:100) t/ε (cid:101) ) ε is the founda-tion of the time coupled diffusion distance, and after the normalization in (9) the singular valuedecomposition can be used to construct the corresponding time coupled diffusion map (12). Inthis section we further investigate the connection between the heat kernel Z and the time coupleddiffusion map through the continuous operator T ( t ) Z .Recall the diffusion operator T ( t ) Z f ( x ) = (cid:82) M Z ( x, t ; y, f ( y ) dV ( y,
0) introduced in Section 3. If f is an initial distribution over M , the operator T ( t ) Z diffuses f forward in time over the changingmanifold geometry, up to time τ = t . The adjoint of this operator is (cid:16) T ( t ) Z (cid:17) ∗ : L ( M ) → L ( M ), (cid:16) T ( t ) Z (cid:17) ∗ f ( x ) = (cid:90) M Z ( y, t ; x, f ( y ) dV ( y, t ) . It follows from (18) that the operator (cid:16) T ( t ) Z (cid:17) ∗ maps a probability distribution over ( M , g ( t ))backwards in time to a probability distribution over ( M , g (0)). Indeed, if f is a probabilitydensity function, then (cid:16) T ( t ) Z (cid:17) ∗ f is its posterior probability distribution as in Section 2. In thecontinuous setting, the analysis is performed over the manifold ( M , g ( · )), and all integration isperformed with respect to the Riemannian volume on the manifold similar to the approach in[22, 21, 20]. In the continuous case, the stationary distribution is constant and the normalizationin (9) is therefore unnecessary. Moreover, the operator T ( t ) Z is bi-stochastic in the sense that the IME COUPLED DIFFUSION MAPS 13 constant function is an eigenfunction of both T ( t ) Z and (cid:16) T ( t ) Z (cid:17) ∗ . Indeed, in Lemma 3.2 we showedthat T ( t ) Z can be written as the limit of a product – with an increasing number of terms – of self-adjoint operators, which each have the constant eigenfunction. Therefore, the continuous analogof the discrete diffusion distance (7) is the diffusion distance D ( t ) ( x, y ) = (cid:90) M ( Z ( x, t ; w, − Z ( y, t ; w, dV ( w, , or equivalently, D ( t ) ( x, y ) = (cid:107) (cid:16) T ( t ) Z (cid:17) ∗ δ x − (cid:16) T ( t ) Z (cid:17) ∗ δ y (cid:107) L ( M ) , where δ x is the Dirac distribution centered at x ∈ M (compare to (7)). Since the “row” sums of Z ( x, t, y,
0) are 1 (see (18)), and the manifold M is compact: (cid:90) M (cid:90) M Z ( x, t ; y, dV ( y, dV ( x, t ) < ∞ Therefore, the operator T ( t ) Z f ( x ) = (cid:82) M Z ( x, t ; y, f ( y ) dV ( y,
0) is compact, see for example [29].Similarly, the operator (cid:16) T ( t ) Z (cid:17) ∗ is compact, and thus T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ is compact, since it is thecomposition of two compact operators. Moreover, since T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ is also self-adjoint, by thespectral theorem T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ has an eigendecomposition, and similarly, (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z has an eigen-decomposition. Therefore, analogous to the discrete decomposition (10), T ( t ) Z has a singular valuedecomposition. T ( t ) Z f ( x ) = (cid:88) k ≥ σ ( t ) k (cid:104) f, ϕ ( t ) k (cid:105) ψ ( t ) k ( x ) , where ϕ ( t ) k ∈ L ( M ) is a right singular function of T ( t ) Z (an eigenfunction of (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z ), ψ ( t ) k ∈ L ( M ) is a left singular function of T ( t ) Z (an eigenfunction of T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ ), and { σ ( t ) k } k ≥ are thecorresponding singular values (the square root of the eigenvalues of (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z , or equivalentlythe square root of the eigenvalues of T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ since these operators share the same spectrum).However, recall that in the discrete case, we must take into account the stationary distribution ofthe Markov chain using the normalization in (9), since the discrete operator P ( t ) ε is not in generalbi-stochastic. In the continuous setting left singular functions are used to define the time coupleddiffusion map (12), written here as:Ψ ( t ) ( x ) = (cid:16) σ ( t ) k ψ ( t ) k ( x ) (cid:17) k ≥ . where the index k starts at 1 because the 0th left singular function is constant.We remark that the operator T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ can be written as an integral operator(26) T ( t ) Z (cid:16) T ( t ) Z (cid:17) ∗ f ( x ) = (cid:90) M K bZ ( x, y ; t ) f ( y ) dV ( y, t ) , where K bZ ( x, y ; t ) = (cid:90) M Z ( x, t ; w, Z ( y, t ; w, dV ( w, . We refer to K bZ ( x, y ; t ) as the backwards kernel since it is integrated against functions on ( M, g ( t )),i.e., integrated against functions with respect to the Riemannian volume at the end of the timeinterval [0 , t ]. Similarly, the operator (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z has integral representation (cid:16) T ( t ) Z (cid:17) ∗ T ( t ) Z f ( x ) = (cid:90) M K fZ ( x, y ; t ) f ( y ) dV ( y, , where K fZ ( x, y ; t ) = (cid:90) M Z ( w, t ; x, Z ( w, t ; y, dV ( w, t ) . We refer to K fZ as the forward kernel since it is integrated against functions on ( M , g (0)). The ker-nels K bZ and K fZ can be interpreted as analogous to the matrices A ( t ) ( A ( t ) ) (cid:62) and ( A ( t ) ) (cid:62) A ( t ) fromthe discrete case, whose eigenfunctions are the left and right singular vectors of A ( t ) , respectively.Since the time coupled diffusion map is based on the SVD of T ( t ) Z diffusing ϕ ( t ) k forward in timeresults in σ ( t ) k ψ ( t ) k , and conversely the backward propagation of ψ ( t ) k leads to σ ( t ) k ϕ ( t ) k ; specifically T ( t ) Z ϕ ( t ) k = σ ( t ) k ψ ( t ) k and (cid:16) T ( t ) Z (cid:17) ∗ ψ ( t ) k = σ ( t ) k ϕ ( t ) k . If we define, (cid:101) Φ ( t ) ( x ) = (cid:16) ϕ ( t ) k ( x ) (cid:17) k ≥ and set T ( t ) Z (cid:101) Φ ( t ) ( x ) = ( T ( t ) Z ϕ ( t ) k ( x )) k ≥ , then this observation leads to: D ( t ) ( x, y ) = (cid:107) (cid:16) T ( t ) Z (cid:17) ∗ δ x − (cid:16) T ( t ) Z (cid:17) ∗ δ y (cid:107) L ( M ) = (cid:107) Ψ ( t ) ( x ) − Ψ ( t ) ( y ) (cid:107) (cid:96) = (cid:107) T ( t ) Z (cid:101) Φ ( t ) ( x ) − T ( t ) Z (cid:101) Φ ( t ) ( y ) (cid:107) (cid:96) . We see that the difference in posterior distributions of δ x and δ y is equivalent to the differencebetween the diffused distributions of (cid:101) Φ ( t ) ( x ) and (cid:101) Φ ( t ) ( y ); these equivalent distances are, in turn,equal to the Euclidean distance between points under the time coupled diffusion map Ψ ( t ) .6. Conclusion
We have introduced the notion of time coupled diffusion maps as a method of summarizingevolving data via an embedding into Euclidean space. In particular, we have introduced a methodof modeling evolving data as samples from a manifold with a time-dependent metric. We show thatthe constructed time inhomogeneous Markov chain approximates heat diffusion over a manifold( M , g ( · )) with a smoothly varying family of metrics g ( τ ). In the context of manifold learning,these operators and resulting embeddings are related to the heat kernel of ∂ t u = ∆ g ( t ) u , whichprovides geometric and probabilistic interpretations. Numerical experiments indicate that the mapencodes aggregate geometrical information over arbitrarily long time scales, and thus summarizesthe geometry of a sequence of datasets in a useful way.These results open new directions related to diffusion based manifold learning, and in particularthe use of inhomogeneous Markov chains and asymmetric diffusion semi-groups to understanddata geometry. Early results on single cell data [11] show the usefulness of this type of diffusionprocess in biology, but further numerical and theoretical study is needed. Understanding preciselythe nature of the geometrical information encoded by the time coupled diffusion map (extending[21, 22]), would give theoretical insight and could lead to further developments. More immediately,the results contained here provide a theoretical foundation for understanding dynamic data thatcan be modeled as a time varying manifold.7. Acknowledgements
N.M. was a participant in the 2013 Research Experience for Undergraduates (REU) at CornellUniversity under the supervision of M.H. During the REU program both were supported bythe National Science Foundation grant number NSF-1156350. This paper is the result of workstarted during the REU. M.H. was supported by the European Research Council (ERC) grantInvariantClass 320959 while writing the first version of the paper. He is currently supported bythe Alfred P. Sloan Fellowship (grant number FG-2016-6607), the DARPA Young Faculty Award(grant number D16AP00117), and the NSF (grant number 1620216).Both authors would like to thank Ronald Coifman for numerous insightful conversations, andthe reviewers for their comments which improved the manuscript.
IME COUPLED DIFFUSION MAPS 15
References [1] M. Belkin, Problems of learning on manifolds, Ph.D. thesis, University of Chicago (2003).[2] S. Lafon, Diffusion maps and geometric harmonics, Ph.D. thesis, Yale University (2004).[3] M. Belkin, P. Niyogi, Towards a theoretical foundation for Laplacian-based manifold methods, Journal ofComputer and System Sciences 74 (8) (2008) 1289–1308.[4] R. R. Coifman, S. Lafon, Diffusion maps, Applied and Computational Harmonic Analysis 21 (1) (2006) 5 – 30.[5] A. Singer, From graph to manifold Laplacian: The convergence rate, Applied and Computational HarmonicAnalysis 21 (1) (2006) 128–134.[6] A. Singer, H. Wu, Vector diffusion maps and the connection laplacian, Communications on Pure and AppliedMathematics 65 (8) (2012) 1067–1144.[7] G. Wolf, A. Averbuch, Linear-projection diffusion on smooth euclidean submanifolds, Applied and Computa-tional Harmonic Analysis 34 (1) (2013) 1–14.[8] T. Berry, T. Sauer, Local kernels and the geometric structure of data, Applied and Computational HarmonicAnalysis (2015) In press.[9] R. Talmon, I. Cohen, S. Gannot, R. Coifman, Diffusion maps for signal processing: A deeper look at manifold-learning techniques based on kernels and graphs, Signal Processing Magazine, IEEE 30 (4) (2013) 75–86.doi:10.1109/MSP.2013.2250353.[10] A. D. Szlam, M. Maggioni, R. R. Coifman, Regularization on graphs with function-adapted diffusion processes,Journal of Machine Learning Research 9 (2008) 1711–1739.URL http://jmlr.org/papers/volume9/szlam08a/szlam08a.pdf [11] T. Welp, G. Wolf, M. Hirn, S. Krishnaswamy, A diffusion-based condensation process for multiscale analysisof single cell data, in: ICML Workshop Computational Biology, New York, NY, 2016, pp. 1–5, 5 pages.[12] B. Wang, J. Jiang, W. Wang, Z.-H. Zhou, Z. Tu, Unsupervised metric fusion by cross diffusion, in:Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 2997–3004.doi:10.1109/CVPR.2012.6248029.[13] R. R. Coifman, M. J. Hirn, Diffusion maps for changing data, Applied and Computational Harmonic Analysis36 (1) (2014) 79 – 107. doi:http://dx.doi.org/10.1016/j.acha.2013.03.001.URL [14] R. R. Lederman, R. Talmon, Common manifold learning using alternating-diffusion, Tech. rep., Yale (2014).[15] R. R. Lederman, R. Talmon, H. t. Wu, Y. L. Lo, R. R. Coifman, Alternating diffusion for common manifoldlearning with application to sleep stage assessment, in: 2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2015, pp. 5758–5762. doi:10.1109/ICASSP.2015.7179075.[16] O. Lindenbaum, A. Yeredor, M. Salhov, A. Averbuch, Multiview diffusion maps, arXiv:1508.05550 (2015).[17] R. Talmon, H.-t. Wu, Latent common manifold learning with alternating diffusion: analysis and applications,ArXiv e-printsarXiv:1602.00078.[18] R. Banisch, P. Koltai, Understanding the geometry of transport: Diffusion maps for lagrangian trajectory dataunravel coherent sets, Chaos 27 (2017) 035804, arXiv:1603.04709.[19] T. Berry, J. Harlim, Variable bandwidth diffusion kernels, Applied and Computational Harmonic Analysis1 (0) (2015) –. doi:http://dx.doi.org/10.1016/j.acha.2015.01.001.URL [20] P. B´erard, G. Besson, S. Gallot, Embedding Riemannian manifolds by their heat kernel, Geometric and Func-tional Analysis 4 (4) (1994) 373–398.[21] H. Abdallah, Processus de diffusion sur un flot de vari´et´es Riemanniennes, Ph.D. thesis, L’Universite deGrenoble (2010).[22] H. Abdallah, Embedding Riemannian manifolds via their eigenfunctions and their heat kernel, Bulletin of theKorean Mathematical Society 49 (5) (2012) 939–947.[23] H. Rutishauser, Computational aspects of f. l. bauer’s simultaneous iteration method, Numer. Math. 13 (1)(1969) 4–13. doi:10.1007/BF02165269.URL http://dx.doi.org/10.1007/BF02165269 [24] C. M. Guenther, The fundamental solution on manifolds with time-dependent metrics, The Journal of Geo-metric Analysis 12 (3) (2002) 425–436. doi:10.1007/BF02922048.[25] B. Chow, S.-C. Chu, D. Glickenstein, C. Guenther, J. Isenber, T. Ivey, D. Knopf, P. Lu, F. Luo, L. Ni, TheRicci Flow: Techniques and Applicatoins Part III: Geometric-Analytic Aspects, Vol. 163, AMS, 2008.[26] P.-L. Giscard, K. Lui, S. J. Thwaite, D. Jaksch, An exact formulation of the time-ordered exponential usingpath-sums, ArXiv e-printsarXiv:1410.6637.[27] S. Mousazadeh, I. Cohen, Out-of-sample extension of band-limited functions on homogeneous manifolds usingdiffusion maps, Signal Processing 108 (0) (2015) 521 – 529. doi:http://dx.doi.org/10.1016/j.sigpro.2014.10.024.URL [28] A. Singer, From graph to manifold laplacian: The convergence rate, Applied and Computa-tional Harmonic Analysis 21 (1) (2006) 128 – 134, special Issue: Diffusion Maps and Wavelets.doi:http://dx.doi.org/10.1016/j.acha.2006.03.004.URL [29] M. Pedersen, Functional analysis in applied mathematics and engineering, Chapman & Hall : CRC Press,2000.
Appendix A. Diffusion geometry from Markov kernels
We show that the operator A ( t ) , defined in Section 2, line (9), has operator norm one and that π / t ) , the square root of its stationary distribution, is an eigenvector with eigenvalue one. We doso by following [2] and developing a diffusion geometry framework starting from a Markov kernel.Suppose that ( X , dx ) is a measure space equipped with a Markov kernel p ( x, y ). Moreover, weassume the kernel p ( x, y ) nonnegative and has the conservation property (cid:90) X p ( x, y ) dy = 1 for all x ∈ X . The kernel p ( x, y ) induces a Markov operator P ∗ from L ( X ) to itself defined by( P ∗ f )( x ) = (cid:90) X p ( y, x ) f ( y ) dy for all x ∈ X . The notation P ∗ has been used to bring attention to the fact that P ∗ is the adjoint of the operator P defined as: ( P f )( x ) = (cid:90) X p ( x, y ) f ( y ) dy. We assume that P ∗ has a unique strictly positive stationary distribution v ( x ), that is to say,( P ∗ v )( x ) = (cid:90) X p ( y, x ) v ( y ) dy = v ( x ) for all x ∈ X . For example, if X is a finite set with the counting measure (as in Section 2), it would suffice toassume that the Markov chain is irreducible and positive recurrent. By the function v ( x ), wedenote the positive pointwise square root of v ( x ). We define an operator A : L ( X ) → L ( X ) by(27) ( Af )( x ) = (cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy = (cid:90) X a ( x, y ) d ( y ) dy for all x ∈ X . Here the kernel a ( x, y ) = v ( x ) p ( x, y ) /v ( y ). We refer to A has an averaging or diffusion operatorand show it has several nice properties. Lemma A.1.
The operator norm (cid:107) A (cid:107) = 1 , and the norm is achieved by the function v ( x ) . Remark.
Taking A = A ( t ) establishes the claim from the beginning of the appendix, which wasoriginally stated after line (9) in Section 2.Proof. First we evaluate ( Av )( x ) to check the second part of the assertion and establish a lowerbound on the operator norm:( Av )( x ) = (cid:90) X v ( x ) p ( x, y ) v ( y ) v ( y ) dy = v ( x ) (cid:90) X p ( x, y ) dy = v ( x ) , using the Markov property of the kernel. To establish that 1 is an upper bound on the operatornorm we show that (cid:104) Af, Af (cid:105) ≤ (cid:107) f (cid:107) . First, observe that by Cauchy-Schwartz, (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) X v ( x ) p ( x, y ) f ( y ) v ( y ) dy (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:18)(cid:90) X v ( x ) p ( x, y ) dy (cid:19) / (cid:18)(cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy (cid:19) / , = v ( x ) · (cid:18)(cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy (cid:19) / . (28)Now starting with (cid:104) Af, Af (cid:105) we multiply and divide by v ( x ) yielding: (cid:104) Af, Af (cid:105) = (cid:90) X v ( x ) (cid:90) X v ( x ) p ( x, y ) f ( y ) v ( y ) dy (cid:90) X v ( x ) p ( x, z ) f ( z ) v ( z ) dz dx. By applying the inequality (28) we see that, (cid:104)
Af, Af (cid:105) ≤ (cid:90) X (cid:18)(cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy (cid:19) / (cid:18)(cid:90) X v ( x ) p ( x, z ) v ( z ) f ( z ) dy (cid:19) / dx. IME COUPLED DIFFUSION MAPS 17
Applying Cauchy-Schwartz again, (cid:104)
Af, Af (cid:105) ≤ (cid:18)(cid:90) X (cid:90) X v ( x ) p ( x, y ) v ( y ) f ( y ) dy dx (cid:19) / (cid:18)(cid:90) X (cid:90) X v ( x ) p ( x, z ) v ( z ) f ( z ) dz dx (cid:19) / . Finally, using the fact that v ( x ) is the stationary distribution shows that: (cid:104) Af, Af (cid:105) ≤ (cid:18)(cid:90) X f ( y ) dy (cid:19) / (cid:18)(cid:90) X f ( z ) dz (cid:19) / = (cid:107) f (cid:107) , as was to be shown. (cid:3) Appendix B. Uniform boundedness of ∆ g ( τ ) on E K We prove that the family of operators { ∆ g ( τ ) } ≤ τ ≤ T is uniformly bounded on E K ⊂ L ( M ),where E K = Span { φ l : 0 ≤ l ≤ K } , and φ l is the l th eigenfunction of ∆ g (0) with eigenvalue λ l , ordered so that 0 = λ < λ ≤ · · · ≤ λ K .To simplify notation, set ∆ τ = ∆ g ( τ ) : E K → L ( M ), and consider the function α ( τ ) = (cid:107) ∆ τ (cid:107) .If α ( τ ) is a continuous function in τ , then α ( τ ) is uniformly bounded on [0 , T ] since α (0) = λ K < ∞ and [0 , T ] is compact. It thus remains to show that α ( τ ) is a continuous function.Let ∆ ∗ τ : L ( M ) → E K be the adjoint of ∆ τ . It suffices to show that β ( τ ) = (cid:107) ∆ ∗ τ ∆ τ (cid:107) is acontinous function in τ , since (cid:107) ∆ ∗ τ ∆ τ (cid:107) = (cid:107) ∆ τ (cid:107) . We have ∆ ∗ τ ∆ τ : E K → E K , and so the operator∆ ∗ τ ∆ τ can be represented by the ( K + 1) × ( K + 1) matrix M τ , defined through:∆ ∗ τ ∆ τ φ j = K +1 (cid:88) i =0 ( M τ ) ij φ i . Recalling that the metric tensor g ( τ ) varies smoothly in τ , it follows that the entries of M τ arecontinous in τ since in local coordinates:∆ τ f = 1 (cid:112) | g ( τ ) | ∂ i (cid:16)(cid:112) | g ( τ ) | g ( τ ) ij ∂ j f (cid:17) , where | g ( τ ) | is the determinant of g ( τ ) in the local chart, g ( τ ) ij are the entries of the inverse ofthe metric tensor, and the Einstein summation convention is used. But then the eigenvalues of M τ vary continuously in τ , and in particular the operator norm of M τ is a continuous function of τ . Department of Mathematics, Yale University, New Haven, CT 06511, USA
E-mail address : [email protected] Department of Computational Mathematics, Science & Engineering and Department of Mathematics,Michigan State University, East Lansing, MI 48824, USA
E-mail address ::