The Shape of Data and Probability Measures
TThe Shape of Data and Probability Measures
Diego H. D´ıaz Mart´ınez a , Facundo M´emoli b , Washington Mio a a Department of Mathematics, Florida State University, Tallahassee, FL 32306-4510 USA b Department of Mathematics, Ohio State University, Columbus, OH 43210-1174 USA
Abstract
We introduce the notion of multiscale covariance tensor fields (CTF) asso-ciated with Euclidean random variables as a gateway to the shape of theirdistributions. Multiscale CTFs quantify variation of the data about everypoint in the data landscape at all spatial scales, unlike the usual covari-ance tensor that only quantifies global variation about the mean. Empiricalforms of localized covariance previously have been used in data analysis andvisualization, for example, in local principal component analysis, but we de-velop a framework for the systematic treatment of theoretical questions andmathematical analysis of computational models. We prove strong stabilitytheorems with respect to the Wasserstein distance between probability mea-sures, obtain consistency results for estimators, as well as bounds on therate of convergence of empirical CTFs. These results show that CTFs arerobust to sampling, noise and outliers. We provide numerous illustrations ofhow CTFs let us extract shape from data and also apply CTFs to manifoldclustering, the problem of categorizing data points according to their noisymembership in a collection of possibly intersecting smooth submanifolds ofEuclidean space. We prove that the proposed manifold clustering method isstable and carry out several experiments to illustrate the method.
Keywords: shape of data, multiscale data analysis, covariance fields,Fr´echet functions, manifold clustering
1. Introduction
Probing, analyzing and visualizing the shape of complex data are chal-lenges that are magnified by the intricate dependence of their structuralproperties, as basic as dimensionality, on location and scale (cf. [1]). As such,resolving and integrating the geometry and topology of data across scales are1 a r X i v : . [ s t a t . M L ] F e b roblems of foremost importance. In this paper, we develop the notion ofmultiscale covariance tensor fields (CTF) associated with Euclidean randomvariables and show that many properties of the shape of their distributionsbecome accessible through CTFs, which provide stable representations thatcan be estimated reliably from data.For a random vector y ∈ R d , scale dependence is controlled by a kernelfunction K ( x, y, σ ) (cid:62)
0, where x, y ∈ R d and σ > x , at scale σ >
0, the kernel masksthe distribution by attributing weight K ( x, y, σ ) to data located at y , cre-ating a windowing effect. More simply put, K ( x, y, σ ) quantifies how wellan observer at x sees data at y at scale σ . Covariation of the weighted datais measured relative to every point x ∈ R d , not just about the mean as iscommon practice, thus giving rise to a multiscale covariance field. Specialcases of these covariance fields were introduced in [2], targeting applicationsto such problems as detection of local scales and feature rich points in shapes.Here we present a more systematic treatment that includes a broader formu-lation of multiscale CTFs, stability theorems that ensure that properties ofprobability measures derived from multiscale CTFs are robust, as well asconsistency results and convergence rates for empirical CTFs. We prove sta-bility of CTFs with respect to the Wasserstein distance between probabilitymeasures, a metric that is finding uses in an ever expanding landscape ofproblems and whose origins are in optimal transport theory [3, 4]. SinceWasserstein distance metrizes weak convergence of probability measures, weobtain a strong stability result that ensures that if two probability distribu-tions are similar in a weak sense, then their multiscale CTFs are uniformlyclose over the entire domain. Convergence rates are derived from the sta-bility theorems and results by Fournier and Guillin [5] and Garc´ıa-Trillosand Slepˇcev [6] on convergence of empirical measures. The standard covari-ance tensor of a random vector y ∈ R d quantifies covariation of y about themean, but may be extended to a full covariance field by considering covaria-tion about arbitrary points. Nonetheless, this field provides no informationabout the organization of the data other than that already contained in thecovariance about the mean. Thus, a localized formulation is essential forgaining additional insight into the shape of data.The trace of a multiscale CTF is a scalar field that gives a multiscaleanalogue of the classical Fr´echet function V ( x ) = E [ (cid:107) y − x (cid:107) ] of a randomvariable y with finite second moment. The Fr´echet function provides a moregeometric interpretation of the mean as the unique minimizer of V ; that is,2he point µ ∈ R d with respect to which the spread of y is minimal. Similarly,the local extrema and other properties of the multiscale Fr´echet functionprovide a wealth of information about the distribution of y . In fact, we showthat the distribution of any random vector may be fully recovered from themultiscale Fr´echet function associated with the Gaussian kernel.Several variants of empirical localized or weighted covariance previouslyhave been used in data analysis, but we develop a framework for the for-mulation and systematic treatment of such problems. Allard et al. havedeveloped a computational model termed geometric multi-resolution analy-sis for multiscale data analysis based on covariance localized to hierarchiesof dyadic cubes [7]. In computer graphics, local principal component anal-ysis (PCA) is commonly used in the estimation of normals to surfaces frompoint-cloud data [8] for surface reconstruction; see also [9] and referencestherein. In computer vision, tensor voting by Medioni et al. [10] has beenapplied to multiple image analysis and processing tasks. Brox et al. haveused empirical covariance weighted by the isotropic Gaussian kernel in non-parametric density estimation targeting applications in motion tracking [11].In the literature dealing with clustering, especially clustering of multiple pos-sibly intersecting manifolds, local PCA ideas have been used in the works ofKushnir et al. [12], Goldberg et al. [13], Gong et al. [14], Wang et al. [15],and in a series of papers by Arias-Castro and collaborators [16, 17, 18].(a) (b) (c) Figure 1: Examples of data clustered along intersecting manifolds.
The special case of affine linear subspaces, known as subspace clustering,has been addressed in the machine learning and computer vision literature bymany authors using a variety of techniques (cf. [19, 20, 21, 22, 23, 24, 25, 26]).More general manifold clustering has been considered in [27, 28, 29, 12, 30,13, 14, 15, 18, 31]. In our approach, we exploit the fact that localized covari-ance tensors encode rich information about the tangential structure of thesubmanifolds that underlie the data. Combined with information about the3relative) positions of the data points, they yield an effective data representa-tion for manifold clustering. Although several different clustering techniquescould be applied to the “tensorized” data, we use the single linkage hierarchi-cal method because it produces provenly stable dendrograms. In conjunctionwith the stability and consistency results for covariance fields, this ensuresthat the manifold clustering method is stable at all steps. Dendrogram sta-bility is analyzed in the framework of [32].
Contributions and Organization of the paper.
The paper includes several il-lustrations and applications of CTFs to data analysis. For example, to il-lustrate how geometric information can be extracted from CTFs, we showthat the curvature of plane curves and the principal curvatures of surfacesin R can be calculated from the spectrum of multiscale CTFs. Thus, multi-scale covariance tensors give a way of extending these infinitesimal measuresof geometric complexity to all scales and general probability distributions,not just those supported on smooth submanifolds. We also apply multiscaleCTFs to manifold clustering, the problem of clustering Euclidean data thatare organized along a finite union of possibly intersecting smooth submani-folds. Fig. 1 shows three such examples.The main goals of the paper are: (i) to establish the foundations for analy-sis, visualization and management of data with methods based on multiscalecovariance tensor fields, and (ii) to describe applications that characterizethe usefulness of CTFs in data analysis. In Section 2, we formulate the no-tion of multiscale CTFs for a broad class of kernels and give examples thatillustrate how CTFs reveal the geometry of data. In Section 3, we show thatthe curvature of a plane curve and the principal curvatures of a surface in R can be recovered from small-scale covariance. Section 4 is devoted to themain theoretical developments. We prove stability and consistency theoremsfor multiscale covariance tensor fields under mild regularity assumptions onthe kernel, and also analyze rates of convergence that are important for ap-plications in data analysis. Since some discontinuous kernels are of practicalinterest, we also investigate convergence results for such kernels, including apointwise central limit theorem. Multiscale Fr´echet functions are discussedin Section 5 and manifold clustering in Section 6. We close with a summaryand some discussion. 4 . Covariance Tensor Fields To define covariance tensor fields, we introduce some notation. Elementsof the tensor product R d ⊗ R d may be identified with bilinear forms B : R d × R d → R through the Euclidean inner product. More precisely, a pure tensor x ⊗ y corresponds to the bilinear form x ⊗ y ( u, v ) = (cid:104) x, u (cid:105) · (cid:104) y, v (cid:105) , (1) ∀ u, v ∈ R d , where (cid:104) , (cid:105) denotes Euclidean inner product. Bilinear forms as-sociated with more general elements of R d ⊗ R d can be described by linearextension. In Euclidean coordinates, we abuse notation and also write thecoordinate vectors of x, y ∈ R d as x and y . With this convention, letting A be the d × d matrix A = xy T , we have x ⊗ y ( u, v ) = (cid:104) u, Av (cid:105) , (2)where the superscript T denotes transposition. In this manner, using Eu-clidean coordinates, an element of R d ⊗ R d also can be identified with a d × d matrix by linear extension of the correspondence x ⊗ y ↔ A . Through theseidentifications, we refer to an element Σ ∈ R d ⊗ R d interchangeably as atensor, a bilinear form or a matrix. We equip R d ⊗ R d with the inner productdefined on pure tensors by (cid:104) x ⊗ y , x ⊗ y (cid:105) = (cid:104) x , x (cid:105) (cid:104) y , y (cid:105) (3)and extended linearly to R d ⊗ R d . Thus, the corresponding norm satisfies (cid:107) x ⊗ y (cid:107) = (cid:107) x (cid:107)(cid:107) y (cid:107) , (4)for any x, y ∈ R d . In matrix representation, this is the Frobenius norm.Throughout the paper, we view R d as a measurable space equipped withthe Borel σ -algebra for the Euclidean metric. Let y be an R d -valued randomvariable distributed according to the probability measure α . Suppose that y has expected value E [ y ] = µ ∈ R d and finite second moment. As a motivationfor the definition of multiscale CTFs, recall that the covariance tensor of y is defined asΣ α ( µ ) = E [( y − µ ) ⊗ ( y − µ )] = (cid:90) R d ( y − µ ) ⊗ ( y − µ ) α ( dy ) ∈ R d ⊗ R d . (5)5n matrix notation, Σ α ( µ ) = (cid:90) R d ( y − µ )( y − µ ) T α ( dy ) . (6)The bilinear form associated with Σ α ( µ ) clearly is symmetric and positivesemi-definite.Covariation of y may be measured with respect to any x ∈ R d , not just µ .Thus, Σ α ( µ ) may be extended to a global covariance tensor field Σ α : R d → R d ⊗ R d given by Σ α ( x ) = (cid:90) R d ( y − x ) ⊗ ( y − x ) α ( dy ) . (7)Note, however, thatΣ α ( x ) = Σ α ( µ ) + ( µ − x ) ⊗ ( µ − x ) , (8)for any x ∈ R d . Thus, for x (cid:54) = µ , Σ α ( x ) does not reveal any informationabout the distribution of y other than that already contained in Σ α ( µ ). Incontrast, as we shall see below, multiscale analogues are rich in informationabout the shape of α . We adopt the notation ν d for the volume of the unit ball in R d and ω d − for the “surface area” of the unit sphere S d − ⊂ R d , d ≥
1. Recall that ω d − = 2 π d/ / Γ( d/ · ) is the Gamma function, and ω d − = d ν d .We make the convention that ν = 1.Let y be an R d -valued random variable with distribution α and let K bea multiscale kernel; that is, a measurable function K : R d × R d × (0 , ∞ ) → R such that K ( x, y, σ ) (cid:62)
0, for any x, y ∈ R d and σ > Definition 1.
The multiscale covariance tensor field (CTF) of y associatedwith the kernel K is the one-parameter family of tensor fields, indexed by σ ∈ (0 , ∞ ), given byΣ α ( x, σ ) := (cid:90) R d ( y − x ) ⊗ ( y − x ) K ( x, y, σ ) α ( dy ) , (9)provided that the integral converges for each x ∈ R d and σ > emark 1. Note that Σ α depends only on the probability measure α , noton y . For this reason, we refer to Σ α interchangeably as the multiscale CTFof the random variable y or the probability measure α .Σ α ( x, σ ) measures the covariation of y about x with probability massat y weighted by K ( x, y, σ ). It is simple to verify that the bilinear formΣ α ( x, σ ) is symmetric and positive semi-definite. Note that if K is boundedfor each σ >
0, that is, ∃ M σ > K ( x, y, σ ) ≤ M σ , ∀ x, y ∈ R d ,then Σ α ( x, σ ) is well defined for any random variable y with finite secondmoment. In particular if K ≡
1, Σ α ( x, σ ) = Σ α ( x ), ∀ x ∈ R d . However, asour primary goal is to study the organization of data and random variablesat scales ranging from local to global, we consider kernels in R d that satisfyadditional decay conditions as they produce a windowing effect. The kernelsare constructed as follows. Definition 2.
Let d be a positive integer and f : [0 , ∞ ) → R a bounded andmeasurable function satisfying:(a) f ( r ) (cid:62) ∀ r ∈ [0 , ∞ );(b) M d = (cid:82) ∞ r d − f ( r ) dr < ∞ ;(c) There is C > rf ( r ) ≤ C , ∀ r ∈ [0 , ∞ ).The multiscale kernel K : R d × R d × (0 , ∞ ) → R associated with f is definedas K ( x, y, σ ) := 1 C d ( σ ) f (cid:18) (cid:107) y − x (cid:107) σ (cid:19) , (10)where C d ( σ ) = σ d M d ω d − .Condition (b) in the definition implies that the normalizing constant C d ( σ ) is well defined. The normalization is adopted so that (cid:82) K ( x, y, σ ) dy =1, ∀ x ∈ R d and ∀ σ >
0. Condition (c) guarantees that the integral in (9) isconvergent for any probability measure α . Henceforth, for convenience, weassume that sup f = 1. This is not restrictive since scaling f does not changethe kernel K because of the normalization.Whereas we investigate properties of multiscale CTFs in a more generalsetting, our examples and experiments focus on two special kernels:7i) The isotropic Gaussian kernel G ( x, y, σ ) = 1(2 πσ ) d/ exp (cid:18) − (cid:107) y − x (cid:107) σ (cid:19) , (11)which is associated with the function f ( x ) = e − x/ ;(ii) The truncation kernel T ( x, y, σ ) = 1 σ d ν d χ (cid:18) (cid:107) y − x (cid:107) σ (cid:19) (12)associated with the characteristic function χ : [0 , ∞ ) → R of the unitinterval [0 , x ,the kernel T attributes a uniform weight to mass at points within theclosed ball of radius σ centered at x and weight zero to mass elsewhere. Remark 2.
The kernel K defined in (10) is homogeneous and isotropic; thatis, for any isometry ϕ : R d → R d , K ( ϕ ( x ) , ϕ ( y ) , σ ) = K ( x, y, σ ), ∀ x, y ∈ R d and σ >
0. Moreover, if we write ϕ ( x ) = U x + b , with U ∈ O ( d ) and b ∈ R d ,then U Σ α ( x, σ ) U T = Σ ϕ ∗ ( α ) ( ϕ ( x ) , σ ) , (13)for any ( x, σ ) ∈ R d × (0 , ∞ ). Here O ( d ) is the group of d × d orthogonalmatrices and ϕ ∗ ( α ) is the pushforward of α under ϕ . Remark 3.
Multiscale covariance tensor fields can be defined for any positiveBorel measure α that satisfies (cid:90) R d (cid:107) z (cid:107) f ( (cid:107) z (cid:107) ) α ( dz ) < ∞ , (14)not just for probability measures. In particular, if f has compact support,covariance fields are defined for any locally finite Borel measure α ; that is,measures for which every point p ∈ R d has an open neighborhood U p suchthat α ( U p ) < ∞ .We conclude this section with examples that support our contention thatmultiscale covariance tensor fields are rich in information about the shape ofdata. 8 xample . This example shows that the spectrum of multiscale covariancetensors allow us to estimate the dimensionality of data in a scale dependentmanner. We consider the data points y , . . . , y n in R , shown in Figure2, and calculate Σ α n centered at one of the data points for the Gaussiankernel at scales σ = 0 . σ = 2. Here α n denotes the empirical measure n − (cid:80) ni =1 δ y i . The covariance tensors are depicted as ellipses whose principalaxes are in the direction of the eigenvectors of the covariance matrix andprincipal radii are proportional to √ λ and √ λ , where 0 ≤ λ ≤ λ are theeigenvalues of the covariance. At scale σ = 0 . λ /λ = 0 . σ = 2, the ratio of the eigenvalues is 0 . σ = 0 . σ = 2 Figure 2: Estimating data dimensionality at different scales through multiscale covariance.
Example R d ) . Let v , . . . , v r ∈ R d , 1 ≤ r ≤ d ,be orthonormal vectors and consider the subspace H = < v , . . . , v r > thatthey span. Let α denote the singular measure supported on H induced bythe volume form on H . The measure α clearly is locally finite. We calculatemultiscale covariance fields at points x ∈ H to show that H may be recoveredfrom Σ α ( x, σ ). By Remark 2, we may assume that x = 0. A calculation showsthat for the Gaussian kernel,Σ α (0 , σ ) = r ( √ π ) d − r σ d − r − r (cid:88) i =1 v i ⊗ v i . (15)For the truncation kernel,Σ α (0 , σ ) = λ r r (cid:88) i =1 v i ⊗ v i , (16)where λ r = 1 σ d − r − ν r − ν d (cid:90) π/ − π/ sin θ cos r θ dθ . (17)9or r = 1, this expression simplifies to λ = 2 / (3 σ d − ν d ). Thus, for bothkernels, the orthogonal complement of H is the null space of Σ α (0 , σ ) and H is the eigenspace associated with the positive eigenvalue λ r . Example n segments) . Consider the wedge (one-point union) W of n segments L , . . . , L n in R d attached at the origin, as depicted in Fig. 3.Each segment L i is determined by its length (cid:96) i > v i . We assume that v i (cid:54) = v j , for any 1 ≤ i < j ≤ n . Let α be the singularmeasure on R d that is supported on W and agrees with the measure inducedby arc length on each segment L i . We consider the multiscale covariancefield of α associated with the truncation kernel. For x ∈ L i , x (cid:54) = 0, as inthe case r = 1 in Example 2, we have that Σ α ( x, σ ) = (2 / σ d − ν d ) v i ⊗ v i atsmall enough scales. Thus, Σ α ( x, σ ) has rank one. However, at the origin,Σ α (0 , σ ) = 13 σ d ν d n (cid:88) i =1 (min( σ, (cid:96) i )) v i ⊗ v i . (18)for any σ >
0. Thus, for σ ≤ min { (cid:96) i , ≤ i ≤ n } ,Σ α (0 , σ ) = 13 σ d − ν d n (cid:88) i =1 v i ⊗ v i . (19) Figure 3: Covariance at the one-point union of line segments.
3. Geometry of Curves and Surfaces
In this section, we show how multiscale CTFs associated with the trun-cation kernel extract precise local geometric information from plane curvesand surfaces in R . 10 .1. Plane CurvesExample . We begin with the special case of a circle. Let C R ⊂ R be thecircle of radius R centered at the origin in R and α the singular measuresupported on C R induced by arc length. For any x ∈ R , we denote r = (cid:107) x (cid:107) .If x is such that | r − R | > σ then Σ α ( x, σ ) = 0 . Assume that x ∈ R and0 < σ < R are such that r ∈ [ R − σ, R + σ ]. In this case, in the coordinatesystem given by the directions n = x/ (cid:107) x (cid:107) and t = n ⊥ , a calculation showsthat Σ α ( x, σ ) is diagonal with entries λ n ( x, σ ) = 1 πσ (cid:2) Rφ (cid:0) R + 2 r (cid:1) + R ( R cos φ − r ) sin φ (cid:3) λ t ( x, σ ) = R πσ ( φ − sin φ cos φ ) , (20)where φ = arccos (cid:16) R + r − σ rR (cid:17) . Thus, the normal and tangential vectors, n and t , are eigenvectors with eigenvalues λ n and λ t , respectively. Fig. 4 showsthe eigenvalues as functions of r , 0 . ≤ r ≤ .
1, for σ = 0 . Figure 4: Tangential (blue) and normal (red) eigenvalues as a function of r , 0 . ≤ r ≤ . σ = 0 .
1, of the multiscale CTF associated with the truncation kernel for the singularmeasure induced by arc length, supported on the unit circle in R . Now we consider a general smooth curve C ⊂ R , that is, a 1-dimensional,smooth, properly embedded submanifold of R . Let α be the singular mea-sure on R supported on C and induced by arc length. This measure islocally finite because the embedding is proper. We calculate the small-scalecovariance at points on C for the truncation kernel and show that the curva-ture can be recovered from the eigenvalues of Σ α . Let x ∈ C be fixed. Thearc-length parametrization of C near x may be written as X ( s ) = s − κ s O ( s ) and Y ( s ) = κs κ s s O ( s ) , (21)11here X ( s ) and Y ( s ) are coordinates along the tangent and normal to C at x , respectively [33]. Here, the curvature κ and its derivative κ s are evaluatedat x . A calculation yields: Proposition 1.
Let σ > be small. If C is a smooth plane curve and x ∈ C ,then in the coordinates specified above we have Σ α ( x, σ ) = (cid:18) σ π − κ σ π + O ( σ ) κ s σ π + O ( σ ) κ s σ π + O ( σ ) κ σ π + O ( σ ) (cid:19) . (22)Proposition 1 implies that, for σ > α are λ = 2 σ π − κ σ π + O ( σ ) and λ = κ σ π + O ( σ ) , (23)so that tr Σ α ( x, σ ) = 2 σ π + κ σ π + O ( σ ) . (24)Thus, the curvature at x ∈ C may be recovered, up to a sign, as κ = ± lim σ → √ πσ / (cid:18) tr Σ α ( x, σ ) − σ π (cid:19) / . (25) R Example . Let S R be the sphere of radius R centered at the origin in R .For x ∈ R , we let r = (cid:107) x (cid:107) . If x is such that | r − R | > σ , then Σ α ( x, σ ) = 0 . Assume that x (cid:54) = (0 , ,
0) and σ > r ∈ [ R − σ, R + σ ]. Inthe coordinate system given by the vector n = x/ (cid:107) x (cid:107) , and any orthonormalbasis { t , t } of the orthogonal complement n ⊥ , a direct calculation showsthat Σ α ( x, σ ) is a 3 × λ t ( x, σ ) = λ t ( x, σ ) = R σ sin (cid:18) φ (cid:19) (cos φ + 2) λ n ( x, σ ) = R σ (1 − cos φ ) (cid:0) R + R cos φ ( R cos φ + R − r ) (cid:1) + R σ (1 − cos φ ) (cid:0) − Rr + 3 r (cid:1) (26)where φ = arccos (cid:16) R + r − σ Rr (cid:17) . In particular, this means that λ n is the ein-genvalue corresponding to the eigenvector n along the normal direction to12 igure 5: Tangential (blue) and normal eigenvalues (red) of the multiscale CTF associatedwith the truncation kernel as a function of r , 0 . ≤ r ≤ .
1, at σ = 0 .
1, for the singularmeasure induced by surface area, supported on the unit sphere in R . the sphere at x/ (cid:107) x (cid:107) , and { t , t } span the eigenspace along the tangent di-rections with eigenvalue λ t = λ t . Fig. 5 shows a plot of the eigenvalues asa function of r , 0 . ≤ r ≤ ,
1, for σ = 0 . S ⊂ R . Let α be thesingular measure on R supported on S and induced by the area measure on S . We calculate the small-scale covariance at points on S for the truncationkernel and show that the principal curvatures may indeed be recovered fromthe spectrum of Σ. Given a non-umbilic point p ∈ S , one can choose aCartesian coordinate system centered at p so that the x -axis is along thedirection of maximal curvature at p , the y -axis is along the direction ofminimal curvature at p , and the z -axis is along the normal to S at p . Proposition 2.
Let σ > be small, p ∈ S be non-umbilic, and α be thesurface area measure on S . In the coordinate system described above, thecovariance tensor for the truncation kernel is given by Σ α ( p, σ ) = A t O ( σ ) O ( σ ) O ( σ ) A t O ( σ ) O ( σ ) O ( σ ) A n , (27) where t = 3 σ
16 + 1256 ( − κ − κ κ + κ ) σ + O ( σ ) ,A t = 3 σ
16 + 1256 ( κ − κ κ − κ ) σ + O ( σ ) ,A n = 3 κ + 2 κ κ + 3 κ σ + O ( σ ) , (28) and κ > κ are the principal curvatures of S at p . It follows from this result that, for σ > α ( p, σ ) = 316 σ + 164 ( κ − κ ) σ + O ( σ ) anddet Σ α ( p, σ ) = (cid:0) κ + 2 κ κ + 3 κ (cid:1) π σ + O ( σ ) . (29)As a consequence, κ and κ can be recovered from the spectrum of Σ α ( p, σ )as a function of σ . Indeed, from the small scale asymptotics of the traceand determinant of Σ α ( p, σ ), we can extract the values of ( κ − κ ) and3 κ + 2 κ κ + 3 κ from which we can determine the values of κ and κ . Proof of Proposition 2.
Using cylindrical coordinates in the chosen referencesystem, we can parametrize the patch S ∩ B ( p, σ ) as ( ρ cos φ, ρ sin φ, z ( ρ, φ )),for φ ∈ [0 , π ], ρ ∈ [0 , ρ σ ( φ )], where ρ σ ( φ ) = σ − (cid:0) κ (cos φ ) + κ (sin φ ) (cid:1) σ + O ( σ ), and z ( ρ, φ ) = ρ (cid:0) κ (cos φ ) + κ (sin φ ) (cid:1) + O ( σ ) . The area elementon the patch is given by dA = (cid:18) ρ + ρ κ (cos φ ) + κ (sin φ ) ) + O ( ρ ) (cid:19) dρ dφ. (30)Now we have all the ingredients needed to compute Σ α ( p, σ ). For example,to calculate the (1 , (cid:82)(cid:82) S ∩ B (0 ,σ ) x dA as (cid:90) π (cid:90) ρ σ ( φ )0 (cid:20) ρ cos φ + ρ cos φ (cid:0) κ sin φ + κ cos φ (cid:1) + O ( ρ ) (cid:21) dρ dφ , (31)which after a simple but tedious calculation yields the desired result. Thecomputation of other entries of the matrix follows similar steps.14 . Stability and Consistency For each p ∈ [1 , ∞ ), let P p ( R d ) denote the collection of all Borel proba-bility measures α on R d whose p th moment M p ( α ) = (cid:82) (cid:107) z (cid:107) p α ( dz ) is finite.We adopt the notation m p ( α ) = M /pp ( α ). For p = ∞ , we let P ∞ ( R d ) be thecollection of all Borel probability measures on R d with bounded support and m ∞ ( α ) = sup {(cid:107) z (cid:107) , z ∈ supp [ α ] } . By Jensen’s inequality, if 1 ≤ q ≤ p ≤ ∞ ,then P p ( R d ) ⊂ P q ( R d ) and m q ( α ) ≤ m p ( α ), for any α ∈ P p ( R d ). Definition 3.
For p ∈ [1 , ∞ ] and λ >
0, we define P λp ( R d ) ⊂ P p ( R d ) as thesubset of all α ∈ P p ( R d ) such that α ( A ) ≤ λ L ( A ), for all measurable sets A ,where L stands for Lebesgue measure. Example . If α ∈ P p ( R d ) is absolutely continuous with respect to theLebesgue measure with density function f ∈ L ∞ ( R d ) satisfying (cid:107) f (cid:107) ∞ ≤ λ ,then α ∈ P λp ( R d ).Let us recall the definition of the p -Wasserstein distance W p ( α, β ) between α, β ∈ P p ( R d ). Let Γ( α, β ) be the collection of all couplings of α and β ; thatis, probability measures µ on R d × R d such that ( π ) ∗ µ = α and ( π ) ∗ µ = β ,where π , π : R d × R d → R d denote projections onto the first and secondcomponents, respectively. Definition 4.
For p ∈ [1 , ∞ ), the p -Wasserstein distance between α, β ∈P p ( R d ) is given by W p ( α, β ) := inf µ ∈ Γ( α,β ) (cid:18)(cid:90) (cid:90) (cid:107) z − z (cid:107) p µ ( dz × dz ) (cid:19) /p , and the ∞ -Wasserstein distance between α, β ∈ P ∞ ( R d ) by W ∞ ( α, β ) := inf µ ∈ Γ( α,β ) sup {(cid:107) z − z (cid:107) , ( z , z ) ∈ supp [ µ ] } . Remark 4. (i) For any α, β ∈ P p ( R d ), p ∈ [1 , ∞ ], there exists a coupling that realizesthe infimum in the definition of W p ( α, β ) (cf. [34]).(ii) It is a standard result that, for each p ∈ [1 , ∞ ), W p defines a metricon P p ( R d ) that is compatible with weak convergence of probabilitymeasures [3].(iii) If ϕ : R d → R d is an isometry, then W p ( α, β ) = W p ( ϕ ∗ ( α ) , ϕ ∗ ( β )), forany α, β ∈ P p ( R d ). 15 .1. Smooth Kernels Theorem 1 (Stability for Smooth Kernels) . Let f : [0 , ∞ ) → R be as inDefinition 2 with multiscale kernel K . Suppose that f is differentiable andthere exists a constant A > such that r / | f (cid:48) ( r ) | ≤ A , ∀ r ≥ . Then,there is a constant A f > , that depends only on f , such that sup x ∈ R d (cid:107) Σ α ( x, σ ) − Σ β ( x, σ ) (cid:107) ≤ σA f C d ( σ ) W ( α, β ) , for any α, β ∈ P ( R d ) and any σ > . Here (cid:107) · (cid:107) is the norm associated withthe inner product defined in (3) . Theorem 1 shows that multiscale covariance fields yield a robust repre-sentation of probability measures that make their geometric properties morereadily accessible, as illustrated in our examples. In Section 5, we show thatnot only is Σ α ( · , σ ) stable, but all the information contained in the probabil-ity measure α is fully absorbed into the multiscale CTF associated with theGaussian kernel. In fact, α may be recovered from the multiscale scalar fieldgiven by V α ( x, σ ) = tr Σ α ( x, σ ), x ∈ R d and σ > K σ : R d → R by K σ ( z ) = K ( z, , σ ) = 1 C d ( σ ) f (cid:18) (cid:107) z (cid:107) σ (cid:19) (32)and Q σ : R d → R d ⊗ R d by Q σ ( z ) = ( z ⊗ z ) K σ ( z ) . (33) Lemma 1.
Let f be as in Definition 2 and suppose that f is differentiableand there is a constant A > such that r / | f (cid:48) ( r ) | ≤ A , ∀ r ≥ . Then,there is a constant A f > , that depends only on f , such that (cid:107) Q σ ( z ) − Q σ ( z ) (cid:107) ≤ A f σC d ( σ ) (cid:107) z − z (cid:107) , for any z , z ∈ R d and all σ > .Proof. Let z ( t ) = tz + (1 − t ) z , 0 ≤ t ≤
1. Then, Q σ ( z ) − Q σ ( z ) = (cid:90) ddt Q σ ( z ( t )) dt . (34)16ince ddt Q σ ( z ) = ( z − z ) ⊗ zC d ( σ ) f (cid:18) (cid:107) z (cid:107) σ (cid:19) + z ⊗ ( z − z ) C d ( σ ) f (cid:18) (cid:107) z (cid:107) σ (cid:19) + ( z ⊗ z ) 2 σ C d ( σ ) f (cid:48) (cid:18) (cid:107) z (cid:107) σ (cid:19) ( z · ( z − z )) , (35)it follows that (cid:13)(cid:13)(cid:13)(cid:13) ddt Q σ ( z ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) z (cid:107) C d ( σ ) f (cid:18) (cid:107) z (cid:107) σ (cid:19) (cid:107) z − z (cid:107) + 2 (cid:107) z (cid:107) σ C d ( σ ) (cid:12)(cid:12)(cid:12)(cid:12) f (cid:48) (cid:18) (cid:107) z (cid:107) σ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) z − z (cid:107) = 2 σC d ( σ ) (cid:107) z (cid:107) σ f (cid:18) (cid:107) z (cid:107) σ (cid:19) (cid:107) z − z (cid:107) + 2 σC d ( σ ) (cid:107) z (cid:107) σ (cid:12)(cid:12)(cid:12)(cid:12) f (cid:48) (cid:18) (cid:107) z (cid:107) σ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) z − z (cid:107) . (36)Since f is smooth, condition (c) of Definition 2 ensures that there is a con-stant A > √ rf ( r ) ≤ A , ∀ r ≥
0. Moreover, by hypothesis r / | f (cid:48) ( r ) | ≤ A . Thus, (36) implies that (cid:13)(cid:13)(cid:13)(cid:13) ddt Q σ ( z ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ σA f C d ( σ ) (cid:107) z − z (cid:107) , (37)where A f = 2( A + A ). The lemma follows from (34) and (37). Proof of Theorem 1.
Without loss of generality, we may assume that x = 0.We express the covariance fields asΣ α ( x, σ ) = (cid:90) Q σ ( z ) α ( dz ) and Σ β ( x, σ ) = (cid:90) Q σ ( z ) β ( dz ) . (38)Given η > W ( α, β ) < η , let µ ∈ Γ( α, β ) be a coupling such that (cid:90) (cid:90) (cid:107) z − z (cid:107) µ ( dz × dz ) < η . (39)We may writeΣ α ( x, σ ) = (cid:90) Q σ ( z ) µ ( dz × dz ) and Σ β ( x, σ ) = (cid:90) Q σ ( z ) µ ( dz × dz ) . (40)17herefore, (cid:107) Σ α ( x, σ ) − Σ β ( x, σ ) (cid:107) ≤ (cid:90) (cid:90) (cid:107) Q σ ( z ) − Q σ ( z ) (cid:107) µ ( dz × dz ) . (41)Lemma 1 and (41) imply that (cid:107) Σ α ( x, σ ) − Σ β ( x, σ ) (cid:107) ≤ σA f C d ( σ ) (cid:90) (cid:90) (cid:107) z − z (cid:107) µ ( dz × dz ) ≤ σA f C d ( σ ) η . (42)Since (42) holds for any η > W ( α, β ), we can conclude that (cid:107) Σ α ( x, σ ) − Σ β ( x, σ ) (cid:107) ≤ σA f C d ( σ ) W ( α, β ) , (43)as claimed.In what follows, given random vectors y i ∈ R d , i ∈ N , we let α n = (cid:80) ni =1 δ y i /n . Corollary 1 (Consistency for Smooth Kernels) . Let K be a multiscale kernelas in Theorem 1 and σ > . If α ∈ P ( R d ) and y i ∈ R d , i ∈ N , are i.i.d.random variables with distribution α , then sup x ∈ R d (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) n ↑∞ −−→ almost surely.Proof. Theorem 1 implies thatsup x ∈ R d (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) ≤ σA f C d ( σ ) W ( α n , α ) . (44)The conclusion follows from the fact that W metrizes weak convergenceof probability measures in P ( R d ) and Varadarajan’s Theorem [35] aboutconvergence of empirical measures on Polish spaces that ensures that α n converges weakly to α almost surely. 18orollary 1 guarantees the asymptotic consistency of empirical CTFs.However, in applications, it is important to have estimates of the rate ofconvergence, which we derive from the stability theorem and a result ofFournier and Guillin [5, Theorem 1]. Theorem 2 (Fournier and Guillin, [5]) . Let α ∈ P s ( R d ) , where s > . If y , . . . , y n are i.i.d. random variables with distribution α and p ∈ [1 , s ) , thenthere exists a constant b > that depends only on p, s and d such that E [ W p ( α, α n )] ≤ b m ps ( α ) · n − s − ps + n − if p > d/ and s (cid:54) = 2 p ; n − s − ps + n − log(1 + n ) if p = d/ and s (cid:54) = 2 p ; n − s − ps + n − pd if p ∈ [1 , d/ and s (cid:54) = d/ ( d − p ) ,for any n ≥ . Corollary 2.
Let f be as in Theorem 1 and σ > . Suppose that α ∈ P ( R d ) and y i , i ∈ N , are i.i.d. random variables with distribution α . Then, there isa constant b > , that depends only on d , such that E (cid:20) sup x ∈ R d (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) (cid:21) ≤ σA f bC d ( σ ) m ( α ) · n − + n − if d = 1 ; n − + n − log(1 + n ) if d = 2 ; n − + n − d if d ≥ .Proof. Since P ( R d ) ⊆ P ( R d ), α is also an element of P ( R d ). The conclu-sion follows by invoking Theorem 1 and the result of Fournier and Guillinwith p = 1 and s = 3. The constant b depends only on d because we arefixing p and s .Theorem 1 ensures that multiscale covariance fields are stable. However,the results do not apply to some discontinuous kernels of practical interest.Nonetheless, we prove a stability theorem for the truncation kernel, as wellas pointwise convergence results for more general kernels.19 .2. The Truncation Kernel We begin our discussion of covariance fields associated with the truncationkernel with a stability theorem with respect to the ∞ -Wasserstein metric. Inpreparation for the proof of the theorem, we introduce some notation. For0 ≤ a < b , let R d ( a, b ) = { y ∈ R d : a < (cid:107) y (cid:107) ≤ b } . (45)and s d ( a, b ) = (cid:90) R d ( a,b ) (cid:107) y (cid:107) dy (46)be its radial moment of inertia. We will use the fact that for any B ≥ b , theinequality s d ( a, b ) ≤ ( b − a ) ω d − d + 2 B d +2 B − a (47)holds. Indeed, s d ( a, b ) = ω d − d + 2 (cid:0) b d +2 − a d +2 (cid:1) = ω d − d + 2 a d +2 (cid:0) ( b/a ) d +2 − (cid:1) = ω d − d + 2 a d +2 ( b − a ) a (cid:32) (cid:18) ba (cid:19) + · · · + (cid:18) ba (cid:19) d +1 (cid:33) ≤ b − aB − a ω d − d + 2 a d +2 B − aa (cid:32) (cid:18) Ba (cid:19) + · · · + (cid:18) Ba (cid:19) d +1 (cid:33) ≤ ( b − a ) ω d − ( d + 2) B d +2 − a d +2 B − a ≤ ( b − a ) ω d − d + 2 B d +2 B − a . (48)Let P ∞ ( R d ) and P λ ∞ ( R d ) be as in Definition 3. Theorem 3 (Stability for the Truncation Kernel) . Let σ > and λ > .Suppose Ω ⊂ R d is a compact set and let c > satisfy diam(Ω) ≤ c . There isa constant A = A ( σ, d, c ) > such that if α ∈ P λ ∞ ( R d ) and β ∈ P ∞ ( R d ) havetheir supports contained in Ω , then the multiscale covariance tensor fields of α and β associated with the truncation kernel T satisfy sup x ∈ R d (cid:107) Σ α ( x, σ ) − Σ β ( x, σ ) (cid:107) ≤ λAW ∞ ( α, β ) . Proof.
Without loss of generality, we may assume that x ∈ R d is the origin.We abbreviate η = W ∞ ( α, β ) and let µ ∈ Γ( β, α ) be a coupling that realizes η . Clearly, η ≤ diam(Ω) ≤ c . 20sing the notation introduced in (45), let A σ,η = R d ( σ, σ + η ). We writeΣ α ( x, σ ) − Σ β ( x, σ ) = Σ α ( x, σ ) − ( σ + η ) d σ d Σ α ( x, σ + η ) (cid:124) (cid:123)(cid:122) (cid:125) T + ( σ + η ) d σ d Σ α ( x, σ + η ) − Σ β ( x, σ ) (cid:124) (cid:123)(cid:122) (cid:125) T . (49)Using the fact that α ∈ P λ ∞ ( R d ), we bound the norm of T as follows: (cid:107) T (cid:107) = 1 σ d ν d (cid:13)(cid:13)(cid:13) (cid:90) A σ,η y ⊗ y dα ( y ) (cid:13)(cid:13)(cid:13) ≤ σ d ν d (cid:90) A σ,η (cid:107) y (cid:107) dα ( y ) ≤ λσ d ν d (cid:90) A σ,η (cid:107) y (cid:107) dy = λσ d ν d s d ( σ, σ + η ) . (50)Using (47) with a = σ , b = σ + η and B = σ + c , we have that s d ( σ, σ + η ) ≤ η ω d − d + 2 ( σ + c ) d +2 c = η ν d dd + 2 ( σ + c ) d +2 c . (51)Thus, (cid:107) T (cid:107) ≤ λdd + 2 ( σ + c ) d +2 cσ d W ∞ ( α, β ) . (52)Now we examine T . Let I : R d → R the characteristic function of the closedball of radius 1 centered at the origin. Using the coupling µ , we write T = 1 σ d ν d (cid:90) (cid:90) R d × R d (cid:20) ( y ⊗ y ) I (cid:18) y σ + η (cid:19) − ( y ⊗ y ) I (cid:16) y σ (cid:17)(cid:21) µ ( dy × dy ) . (53)Note that the integrand in (53) vanishes on ( R d \ B σ + η ) × ( R d \ B σ ) andthe integral also vanishes over (cid:0) R d \ B σ + η (cid:1) × B σ because this subdomain isdisjoint from supp [ µ ]. Combining these remarks with y ⊗ y − y ⊗ y = ( y − y ) ⊗ y + y ⊗ ( y − y ) , (54)21e may rewrite (53) as T = 1 σ d ν d (cid:90) (cid:90) B σ + η × B σ [( y ⊗ y ) − ( y ⊗ y )] µ ( dy × dy )+ 1 σ d ν d (cid:90) (cid:90) B σ + η × ( R d \ B σ ) y ⊗ y µ ( dy × dy )= 1 σ d ν d (cid:90) (cid:90) B σ + η × B σ ( y − y ) ⊗ y µ ( dy × dy )+ 1 σ d ν d (cid:90) (cid:90) B σ + η × B σ y ⊗ ( y − y ) µ ( dy × dy )+ 1 σ d ν d (cid:90) (cid:90) B σ + η × A σ, η y ⊗ y µ ( dy × dy ) . (55)For the last equality, we used again the fact that( y , y ) ∈ supp [ µ ] = ⇒ (cid:107) y − y (cid:107) ≤ η . (56)From (55) and (56), using the facts that (cid:107) y (cid:107) ≤ σ + η for y ∈ B σ + η and (cid:107) y (cid:107) ≤ σ for y ∈ B σ , we may conclude that (cid:107) T (cid:107) ≤ σ + ησ d ν d (cid:90) (cid:90) B σ + η × B σ (cid:107) y − y (cid:107) µ ( dy × dy )+ σσ d ν d (cid:90) (cid:90) B σ + η × B σ (cid:107) y − y (cid:107) µ ( dy × dy )+ 1 σ d ν d (cid:90) (cid:90) B σ + η × A σ, η (cid:107) y (cid:107) µ ( dy × dy ) ≤ σ + cσ d ν d η (cid:90) B σ + η α ( dy ) + σσ d ν d η (cid:90) B σ + η α ( dy )+ 1 σ d ν d (cid:90) A ( σ − η )+ , η (cid:107) y (cid:107) α ( dy ) , (57)where ( σ − η ) + = max { σ − η, } . Since α ∈ P λ ∞ ( R d ) and η ≤ c , (cid:90) B σ + η α ( dy ) ≤ λ (cid:90) B σ + η dy = λ ( σ + η ) d ν d ≤ λ ( σ + c ) d ν d (58)22nd (cid:90) A σ − η, η (cid:107) y (cid:107) α ( dy ) ≤ λ (cid:90) A ( σ − η )+ , η (cid:107) y (cid:107) dy = λs d (( σ − η ) + , σ + η ) ≤ λη ω d − d + 2 ( σ + c ) d +2 η + c ≤ λη dd + 2 ν d ( σ + c ) d +2 c , (59)where we used (47) with a = ( σ − η ) + , b = σ + η and B = σ + c . Combining(57), (58) and (59), we obtain (cid:107) T (cid:107) ≤ σ + cσ d λ ( σ + c ) d η + 2 λσ d dd + 2 ( σ + c ) d +2 c η (60)From (49), (52) and (60), it follows that (cid:107) Σ α ( x, σ ) − Σ β ( x, σ ) (cid:107) ≤ λ dd + 2 ( σ + c ) d +2 cσ d W ∞ ( α, β )+ λ (2 σ + c )( σ + c ) d σ d W ∞ ( α, β )+ λ d ( d + 2) ( σ + c ) d +2 cσ d W ∞ ( α, β )= λA ( σ, d, c ) W ∞ ( α, β ) . (61)Since x ∈ R d is arbitrary, the claim follows.We now derive a consistency result and estimates for the rate of con-vergence of empirical approximations to multiscale covariance fields. Thefollowing result is a W ∞ -counterpart to the theorem by Fournier and Guillinstated above. Theorem 4 (Garc´ıa-Trillos and Slepˇcev [6]) . Let Ω ⊂ R d be a boundedconnected open subset with Lipschitz boundary. Let α be a probability measureon Ω with density f α : Ω → (0 , ∞ ) such that there exists λ ≥ with λ − ≤ f α ( x ) ≤ λ , for all x ∈ Ω , and let y , . . . , y n be i.i.d. random variables withdistribution α . Then, there exist constants c , C , C > , depending only on Ω and λ , such that for all n ∈ N and p > , P (cid:0) W ∞ ( α, α n ) ≤ ( C + C √ p ) r d ( n ) (cid:1) ≥ − c n − p , here r ( n ) = ln( n ) / n / and r d ( n ) = ln( n ) /d n /d , for d ≥ . Corollary 3 (Consistency for the Truncation Kernel) . Let α be a probabiltymeasure on R d with density f α and let Ω α be the interior of the support of α . Assume that Ω α is bounded and connected with Lipschitz boundary ∂ Ω α .Furthermore, assume that there exists λ ≥ such that λ − ≤ f α ( z ) ≤ λ , forall z ∈ Ω α . If y i , i ∈ N , are i.i.d. random variables with distribution α , then,for any p > , there are constants C = C (Ω α , λ, p ) > and c = c (Ω α , λ ) > such that P (cid:18) sup x ∈ R d (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) ≤ C r d ( n ) (cid:19) ≥ − c n − p . Here, α n = (cid:80) ni =1 δ y i /n .Proof. We use Theorem 4 and write C (cid:48) = C + √ p C . Theorem 3 impliesthat there is a constant C (cid:48)(cid:48) = C (cid:48)(cid:48) (Ω α , λ ) > x ∈ R d (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) ≤ C (cid:48)(cid:48) W ∞ ( α, α n ) . (62)Thus, P (cid:16) sup x ∈ R d (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) ≤ C (cid:48) C (cid:48)(cid:48) r d ( n ) (cid:17) ≥≥ P (cid:0) W ∞ ( α, α n ) ≤ C (cid:48) r d ( n ) (cid:1) ≥ − c n − p . (63)The claim follows by setting C = C (cid:48) C (cid:48)(cid:48) . Corollary 4.
Let σ > and p > . Under the assumptions of Corollary 3,for the truncation kernel, there exist N = N ( σ, Ω α , λ ) ∈ N and a constant A = A ( σ, Ω α , λ ) > such that E (cid:20) sup x ∈ R d (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) (cid:21) ≤ Ar d ( n ) , for all n ≥ N. Proof.
We apply to W ∞ ( α, α n ) the identity E ( Z ) = (cid:82) ∞ P ( Z > t ) dt that isvalid for any non-negative random variable Z with finite first moment. Since W ∞ ( α, α n ) ≤ D = diam(Ω α ), we get E [ W ∞ ( α, α n )] = (cid:90) D P ( W ∞ ( α, α n ) > t ) dt . (64)24heorem 4 implies that P ( W ∞ ( α, α n ) > ( C + √ p C ) r d ( n )) ≤ n − p . (65)Let t = min { D, ( C + √ p C ) r d ( n ) } . From (64) and (65), E [ W ∞ ( α, α n )] = (cid:90) t P ( W ∞ ( α, α n ) > t ) dt + (cid:90) Dt P ( W ∞ ( α, α n ) > t ) dt ≤ t + D P ( W ∞ ( α, α n ) > ( C + √ p C ) r d ( n )) ≤ ( C + √ p C ) r d ( n ) + Dn − p . (66)Fixing p , say p = 2, for n sufficiently large, the dominant term on this lastexpression is the one involving r d ( n ). Thus, the claim follows from (66) andTheorem 3 applied to α and β = α n . Remark 5.
We carry out an experiment to test the convergence rates ob-tained in Corollary 4. We consider the probability measure α supportedon the unit circle S ⊂ R induced by the normalized arc length element(2 π ) − ds . In this case, for the truncation kernel, Σ α was calculated explicitlyin Example 4. We consider sets of i.i.d. samples of size n , 10 ≤ n ≤ .For each n , thirty sets of samples are taken. For each such set, we computeΣ α n and estimate the “error” as max (cid:107) Σ α n ( x, σ ) − Σ α ( x, σ ) (cid:107) , for σ = 0 . ×
24 grid on the square[ − . , . × [ − . , . ε n be the average error over all thirty sets ofsamples. Figure 6 shows a plot (in blue) of ε n in log-log scale. To compare ε n with the predicted rates, we use a least-squares fit, in log-log scale, ofthe form ε = Cr ( n ) = C ln( n ) / n / , also shown in Figure 6 (in red). The dis-crepancy between the predicted and observed rates suggests that Corollary4 might not be optimal. A curve of the form ε = Cn − / , shown in green,produces a tighter fit to the data, suggesting that the optimal bound mightbe O ( n − / ). We conclude the discussion of convergence of empirical CTFs with apointwise central limit theorem (CLT) that holds for kernels in the full gen-erality of Definition 2. One may think of it as a CLT for each entry of thematrix Σ α ( x, σ ). If e , . . . , e d is an orthonormal basis of R d , the ( i, j )-entry ofthe covariance matrix in this coordinate system is given by Σ α ( x, σ )( e i , e j ),25 igure 6: Log-log plots of experimental error rates (in blue) for empirical covariance fields,rates predicted by Corollary 4 (in red), and a least-squares fit of order n − / (in green). the bilinear form Σ α ( x, σ ) evaluated at ( e i , e j ). In matrix notation, this isthe same as (cid:104) e i , Σ α ( x, σ ) e j (cid:105) . More generally, for fixed u, v, x ∈ R d and σ > α ( x, σ )( u, v ) = (cid:90) ( y − x ) ⊗ ( y − x )( u, v ) K ( x, y, σ ) α ( dy ) . (67)Consider the random variable z uv ( y ) = ( y − x ) ⊗ ( y − x ) ( u, v ) K ( x, y, σ ) , (68)where y has distribution α . Clearly, E [ z uv ] = Σ α ( x, σ )( u, v ) . (69) Theorem 5 (Central Limit) . If f is as in Definition 2, then z uv has finitevariance σ uv . Moreover, if z i , i ∈ N , are i.i.d. random variables with thesame distribution as z uv , then √ n (cid:32) n n (cid:88) i =1 z i − Σ α ( x, σ )( u, v ) (cid:33) d −→ N (0 , σ uv ) , as n → ∞ , where convergence is in distribution and N (0 , σ uv ) is normallydistributed with mean zero and variance σ uv . roof. We show that z uv has finite second moment. From (68) and (4), (cid:90) z uv α ( dy ) (cid:54) (cid:107) u (cid:107)(cid:107) v (cid:107) (cid:90) (cid:107) y − x (cid:107) K ( x, y, σ ) α ( dy ) ≤ σ (cid:107) u (cid:107)(cid:107) v (cid:107) C d ( σ ) (cid:90) (cid:107) y − x (cid:107) σ f (cid:18) (cid:107) y − x (cid:107) σ (cid:19) α ( dy ) ≤ σ (cid:107) u (cid:107)(cid:107) v (cid:107) C d ( σ ) C . (70)The last inequality follows from condition (c) in Definition 2 that ensuresthat r f ( r ) < C , for any r >
0. The theorem now follows from a directapplication of the classical CLT.
Remark 6.
Note that (70) implies that if (cid:107) u (cid:107) = (cid:107) v (cid:107) = 1, then σ uv = (cid:90) z uv α ( dy ) − ( E [ z uv ]) ≤ C σ C d ( σ ) , (71)giving a uniform bound on the variance of z uv over x ∈ R d and u, v ∈ S d − .
5. Multiscale Fr´echet Functions
The mean of a random vector y ∈ R d is a simple and yet oftentimes infor-mative, “one-element” summary of the distribution of y . If y has finite secondmoment and is distributed according to the probability measure α , then themean may be characterized more geometrically as the unique minimizer ofthe Fr´echet function F α ( x ) = E (cid:2) (cid:107) y − x (cid:107) (cid:3) = (cid:90) (cid:107) y − x (cid:107) α ( dy ) , (72)which measures the spread of y about x ∈ R d . The mean, however, is not aseffective for complex distributions of practical interest such as multimodaldistributions or those supported in nonlinear subspaces. In this section,we introduce a multiscale analogue of the Fr´echet function that is rich ininformation about the shape of the distribution of y . At each fixed scale,the local minima of the function may be viewed as localized analogues of themean, as illustrated in examples below. However, instead of just focusing onthe local extrema, we take the view that it is more informative to investigatethe behavior of the full multiscale Fr´echet function, as this lets us uncovermore information about the distribution of y .27 efinition 5. Let f : [0 , ∞ ) → R be as in Definition 2 with associated kernel K : R d × R d × (0 , ∞ ) → R . The multiscale Fr´echet function V α : R d × (0 , ∞ ) → R is defined as V α ( x, σ ) := (cid:90) (cid:107) y − x (cid:107) K ( x, y, σ ) α ( dy ) . Proposition 3.
For each σ > , the multiscale Fr´echet function satisfies V α ( x, σ ) = tr Σ α ( x, σ ) . Proof.
Let { e , . . . , e d } ⊂ R d be an orthonormal basis. Then, (cid:107) y − x (cid:107) = d (cid:88) i =1 (cid:104) y − x, e i (cid:105) = d (cid:88) i =1 ( y − x ) ⊗ ( y − x )( e i , e i ) . (73)Hence, V α ( x, σ ) = d (cid:88) i =1 (cid:90) ( y − x ) ⊗ ( y − x )( e i , e i ) K ( x, y, σ ) α ( dy )= d (cid:88) i =1 (cid:18)(cid:90) ( y − x ) ⊗ ( y − x ) K ( x, y, σ ) α ( dy ) (cid:19) ( e i , e i )= d (cid:88) i =1 Σ α ( x, σ )( e i , e i ) = tr Σ α ( x, σ ) , (74)as claimed. Corollary 5 (Stability) . Let f : [0 , ∞ ) → R be as in Definition 2 with mul-tiscale kernel K . Suppose that f is differentiable and there exists a constant A > such that r / | f (cid:48) ( r ) | ≤ A , ∀ r ≥ . Then, there is a constant A f > ,that depends only on f , such that sup x ∈ R d | V α ( x, σ ) − V β ( x, σ ) | ≤ σdA f C d ( σ ) W ( α, β ) , , for any α, β ∈ P ( R d ) and any σ > .Proof. The result follows from Proposition 3, Theorem 1 and the fact thatfor any d × d matrix X , | tr X | ≤ d (cid:107) X (cid:107) , where (cid:107) X (cid:107) is the Frobenius normof X . 28imilarly, Corollary 1 and Proposition 3 yield the following consistencyresult for multiscale Fr´echet functions. Corollary 6 (Consistency) . Suppose that α ∈ P ( R d ) . Let y i ∈ R d , i ∈ N ,be i.i.d. random variables with distribution α and K a multiscale kernel asin Theorem 1. Then, for each fixed σ > , sup x ∈ R d | V α ( x, σ ) − V α n ( x, σ ) | n ↑∞ −−→ almost surely. The following result about convergence of multiscale Fr´echet functionsare immediate consequences of Corollary 2 and Corollary 5.
Corollary 7.
Let f be as in Theorem 1 and σ > . Suppose that α ∈ P ( R d ) and y i , i ∈ N , are i.i.d. random variables with distribution α . Then, there isa constant β > , that depends only on d , such that E (cid:20) sup x ∈ R d | V α ( x, σ ) − V α n ( x, σ ) | (cid:21) ≤ σdA f βC d ( σ ) m ( α ) · n − + n − if d = 1 ; n − + n − log(1 + n ) if d = 2 ; n − + n − d if d ≥ . Remark 7.
Analogous stability and consistency results for the truncationkernel follow from Theorem 3, Corollary 3 and Corollary 4.For more general kernels, the following pointwise central limit theoremholds. For fixed x ∈ R d and σ >
0, let t ( y ) = (cid:107) y − x (cid:107) K ( x, y, σ ) , (75)whose expected value is E [ t ] = V α ( x, σ ). As in Theorem 5, the variance of t is finite and denoted σ t . Theorem 6 (Central Limit) . Let f be as in Definition 2. If t i ∈ R , i ∈ N ,are i.i.d. random variables with the same distribution as t , then √ n (cid:32) n n (cid:88) i =1 t i − V α ( x, σ ) (cid:33) d −→ N (0 , σ t ) , as n → ∞ , where convergence is in distribution and N (0 , σ V ) is normallydistributed with mean zero and variance σ t . α may be fully recovered fromits multiscale Fr´echet function associated with the Gaussian kernel, as thefollowing result shows. Proposition 4.
Let σ > be fixed. Any probability measure α is completelydetermined by the Fr´echet function V α ( · , σ ) associated with the Gaussian ker-nel at scale σ .Proof. Let h σ : R d → R be given by h σ ( x ) = (cid:107) x (cid:107) (2 πσ ) d/ exp (cid:18) − (cid:107) x (cid:107) σ (cid:19) . (76)Then, for the Gaussian kernel, we may express the multiscale Fr´echet functionas the convolution V α ( x, σ ) = ( h σ ∗ α )( x ). Under Fourier transform, for eachfixed σ >
0, we obtain (cid:98) V α ( ξ, σ ) = (cid:98) h σ ( ξ ) φ α ( − πξ ) , (77)where φ α is the characteristic function of α defined as φ α ( ξ ) = (cid:82) e i (cid:104) x,ξ (cid:105) α ( dx ).Therefore, φ α ( ξ ) = (cid:98) V α ( − ξ/ π, σ ) / (cid:98) h σ ( − ξ/ π ) (78)provided that (cid:98) h σ ( − ξ/ π ) (cid:54) = 0. A calculation shows that (cid:98) h σ ( − ξ/ π ) = σ (cid:18) d − σ (cid:107) ξ (cid:107) π (cid:19) exp (cid:18) − σ (cid:107) ξ (cid:107) π (cid:19) , (79)which only vanishes at points ξ on the sphere of radius ρ σ = √ πd/σ aboutthe origin. Thus, (78) implies that we can recover φ α ( ξ ) from (cid:98) V α ( · , σ ), if (cid:107) ξ (cid:107) (cid:54) = √ πd/σ . By continuity, we can recover φ α ( ξ ), for any ξ . The claimnow follows from the fact that the characteristic function φ α determines α [35].The following examples illustrate how information about the shape ofdata can be extracted from multiscale Fr´echet functions. Example . We consider n = 400 data points distributed into two clustersof 200 points, each sampled from a Gaussian of variance 0 .
36 centered atdifferent points. The data points are plotted in blue in Fig. 7(a), which30lso shows the empirical Fr´echet function V n at scale σ = 3. The localminima of V n captures what is perceived as the “centers” of the two clustersat that scale. However, more information about the data distribution canbe uncovered from V n . For example, the local minima may be viewed asattractors of the (negative) gradient field −∇ V n , indicated by the arrows inthe figure. The stable manifold of each attractor, which comprises pointsthat move toward the attractors under the associated flow may be viewedas clusters inferred from the data at that scale. These clusters are delimitedby the repellers of the system, which correspond to the local maxima of V n .Fig. 7(b) shows how V n varies across scales, highlighting the bifurcation ofthe attractors (in red) and repellers (in green) as σ changes. In data analysis,(a) (b) Figure 7: (a) Fr´echet function for data on the line (highlighted in blue) computed withthe Gaussian kernel at scale σ = 3; (b) Fr´echet function across scales. such bifurcation diagrams may find several applications. For example, if thedata represent the distribution of some phenotypic trait for two species thathave evolved from a single group, the multiscale Fr´echet function and theassociated bifurcation diagram let us create an evolutionary model for thetrait from the observed data. Example . Here we consider the dataset in R shown in panel (a) of Fig. 8.Panels (b)–(h) show the Fr´echet function for the Gaussian kernel calculatedat increasing scales. The gradient field −∇ V n at scale σ = 2 .
25 is depictedin panel (a) of Fig. 9 along with the two attractors p and p , and theirstable manifolds that were estimated numerically. The stable manifoldsmay be viewed as estimations at scale σ = 2 .
25 of clusters of the underlyingprobability measure α from which the data was sampled. Panel (b) showsthe gradient field and the covariance tensors at the attractors depicted as31ata σ =1.0 σ = 1 . σ = 2 . σ = 3 . σ = 4 . σ = 5 . σ = 6 . σ = 7 . σ = 9 . Figure 8: Heat maps of the multiscale Fr´echet function for 2D data at increasing scalescomputed with the Gaussian kernel. (a) (b)
Figure 9: (a) 2D data, attractors and their stable manifolds at a fixed scale ( σ = 2 . ellipses with principal radii proportional to the square root of the eigenvaluesof the covariance matrix. This may be viewed as a localized analogue ofprincipal component analysis (PCA) that is able to uncover geometry thatis not detectable with standard PCA. Analysis of the spectra of Σ n ( p i , σ ), i = 1 ,
2, suggests that the data is organized around two one-dimensionalclusters, whereas standard PCA is not sensitive to the local dimensionalitybecause of the orientation of the clusters.32hese examples are intended as proof-of-concept illustrations. Topolog-ical and other methods will be explored in forthcoming work for extractionof structural information from V n .
6. Hierarchical Manifold Clustering
Clustering is a central theme in pattern analysis with a rich history; cf.[36]. One of the most studied forms of the problem is that of partitioning adataset into various subsets if there is some form of spatial separation of thedata into subgroups. Motivated by problems in such areas as computer visionand video analysis, cf. [37], there has been a growing interest in clusteringdata that are organized as a finite union of possibly intersecting subspacesthat have some special geometric structure [38, 16, 18, 26]. As illustrated inFig. 1, the data may consist of noisy samples from an arrangement of (affine)linear subspaces of a Euclidean space such as a collection of lines in a plane,or an arrangement of lines and planes in R . More generally, the clusters maycomprise a finite collection of possibly non-linear, smooth submanifolds of aEuclidean space that intersect transversely. Here we propose an approachto manifold clustering based on CTFs. The basic idea is to use covariancefields to incorporate directional information at each data point. Formally,this is achieved via a section of the tensor bundle R d × ( R d ⊗ R d ) over R d , asfollows. Given a probability measure α and a multiscale kernel, let Σ α ( x, σ )be the associated CTF. For each σ >
0, consider the section ι α ; σ : R d → R d × ( R d ⊗ R d ) given by x (cid:55)→ ( x, Σ α ( x, σ )). On the total space of the tensorbundle, define the metric (cid:107) ( x, Σ) − ( x (cid:48) , Σ (cid:48) ) (cid:107) γ = (cid:0) (cid:107) Σ − Σ (cid:48) (cid:107) + γ (cid:107) x − x (cid:48) (cid:107) (cid:1) / , (80)where x, x (cid:48) ∈ R d , Σ , Σ (cid:48) ∈ R d ⊗ R d , and γ ≥ (cid:107) · (cid:107) onlydefines a pseudo-metric since (cid:107) · (cid:107) disregards “horizontal” distances.For any subset X ⊆ R d , we denote by X α ; γ,σ the metric space ( X, d α ; γ,σ ),where d α ; γ,σ ( x, x (cid:48) ) = (cid:107) ι α ; σ ( x ) − ι α ; σ ( x (cid:48) ) (cid:107) γ . (81)For a dataset A = { a , . . . , a n } ⊂ R d , the proposed clustering method isbased on the single-linkage method [39] applied to the finite metric space A α n ; γ,σ associated with the empirical measure α n = n − (cid:80) ni =1 δ a i . Equiva-lently, clustering is based on the n × n affinity matrix D whose ( i, j )-entry33s d ij = d α n ; γ,σ ( a i , a j ) . (82)Recall that single linkage on a finite metric space A = ( A, d A ) starts from n clusters, each a singleton { a i } , 1 (cid:54) i (cid:54) n , sequentially merging the closestclusters until all data points coalesce into a single cluster. Closeness of twoclusters, say A , A ⊂ A , is measured by the inter-cluster distance d sl ( A , A ) = min a ∈ A ,a (cid:48) ∈ A d A ( a, a (cid:48) ) , (83)We choose single linkage because it yields stable dendrograms, as expoundedbelow, under assumptions on the probability measure from which the datais sampled that are not very restrictive. Combined with our stability andconsistency results for covariance fields, this guarantees that the manifoldclustering method is stable at all stages. We denote a metric space by X = ( X, d X ). An ultrametric space is a pair( X, u X ), where u X : X × X → R + is a metric on X that satisfies the strongtriangle inequality u X ( x, x (cid:48) ) ≤ max { u X ( x, x (cid:48)(cid:48) ) , u X ( x (cid:48)(cid:48) , x (cid:48) ) } , (84)for all x, x (cid:48) , x (cid:48)(cid:48) ∈ X . Any such function u X is called an ultrametric on X .As proved in [32], dendrograms over a finite set X are in structure-preserving, bijective correspondence with ultrametrics on X . In this formula-tion, a hierarchical clustering method can be regarded as a map H : M → U from finite metric spaces into finite ultrametric spaces. Henceforward, H willdenote the map given by single linkage hierarchical clustering. It is known[32] that if X = ( X, d X ) ∈ M , then H ( X ) = ( X, u X ) is given by u X ( x, x (cid:48) ) = min x = x ,...,x r = x (cid:48) max i d X ( x i , x i +1 ) . (85)The minimum above is taken over all finite ordered sequences x , x , . . . , x r of points in X such that x = x and x r = x (cid:48) . If x, x (cid:48) ∈ X , then u X ( x, x (cid:48) )may be interpreted as the dendrogram level at which the clusters containing x and x (cid:48) first merge. This is known as the cophenetic distance between x and x (cid:48) . 34he main goal of this section is to formulate and prove stability of themap M (cid:51) X (cid:55)→ H ( X ) ∈ U . The question of stability of single linkageclustering can be approached using ideas related to the Gromov-Hausdorffdistance [40], as follows. A correspondence R between two sets X and Y isa subset of X × Y such that π ( R ) = X and π ( R ) = Y , where π and π denote projections onto the first and second factors. Given X and Y , wedenote by R ( X, Y ) the set of all correspondences between X and Y . Definition 6.
Let X and Y be compact metric spaces.(i) The distortion of a correspondence R between X and Y is defined bydis ( R ; X , Y ) := max ( x,y ) , ( x (cid:48) ,y (cid:48) ) ∈ R | d X ( x, x ) − d Y ( y, y (cid:48) ) | . (ii) The Gromov-Hausdorff distance between X and Y is given by d GH ( X , Y ) := 12 inf R dis ( R ; X , Y ) , where the infimum is taken over all correspondences between X and Y .The following stability result is a generalization of [32, Proposition 26]. Proposition 5.
For any X , Y ∈ M and any correspondence R ∈ R ( X, Y ) , dis( R ; H ( X ) , H ( Y )) ≤ dis( R ; X , Y ) . As a consequence, d GH ( H ( X ) , H ( Y )) ≤ d GH ( X , Y ) . Remark 8.
The claim of the proposition may be written, equivalently, asfollows. If u X and u Y denote the ultrametrics produced by single linkagehierarchical clustering on X and Y , then | u X ( x, x (cid:48) ) − u Y ( y, y (cid:48) ) | ≤ max ( x,y ) , ( x (cid:48) ,y (cid:48) ) ∈ R | d X ( x, x (cid:48) ) − d Y ( y, y (cid:48) ) | , (86)for any correspondence R between X and Y and all ( x, y ) , ( x (cid:48) , y (cid:48) ) ∈ R . Proof of Proposition 5.
We prove (86). Given a correspondence R ∈ R ( X, Y )and ( x, y ) , ( x (cid:48) , y (cid:48) ) ∈ R , let x = x , x , . . . , x n = x (cid:48) in X be such thatmax i d X ( x i , x i +1 ) = u X ( x, x (cid:48) ). Let y = y , y n = y (cid:48) and choose y , . . . , y n − ∈ such that ( x i , y i ) ∈ R for all i = 1 , . . . , n −
1. This is possible since anycorrespondence R satisfies π ( R ) = X . Notice that u Y ( y, y (cid:48) ) ≤ max i d Y ( y i , y i +1 ) ≤ max i ( d X ( x i , x i +1 ) + | d X ( x i , x i +1 ) − d Y ( y i , y i +1 ) | ) ≤ max i d X ( x i , x i +1 ) + max ( x,y ) , ( x (cid:48) ,y (cid:48) ) ∈ R | d X ( x, x (cid:48) ) − d Y ( y, y (cid:48) ) | = u X ( x, x (cid:48) ) + max ( x,y ) , ( x (cid:48) ,y (cid:48) ) ∈ R | d X ( x, x (cid:48) ) − d Y ( y, y (cid:48) ) | . (87)The claim follows since (87) also holds if we reverse the roles of X and Y . Lemma 2.
Let α, β ∈ P ∞ ( R d ) and σ > . If a kernel satisfies the conditionsof Lemma 1, then sup ( a,b ) ∈ R µ (cid:107) Σ α ( a, σ ) − Σ β ( b, σ ) (cid:107) ≤ A f σC d ( σ ) sup ( a,b ) ∈ R µ (cid:107) a − b (cid:107) , for any coupling µ ∈ Γ( α, β ) , where R µ := supp [ µ ] and A f > is as inLemma 1.Proof. Set ζ = sup ( a,b ) ∈ supp [ µ ] (cid:107) a − b (cid:107) . Let µ ∈ Γ( α, β ) and ( y, y (cid:48) ) , ( a, b ) ∈ supp [ µ ]. In the notation of Lemma 1, setting z = y − a and z = y (cid:48) − b , wehave (cid:107) Q σ ( y − a ) − Q σ ( y (cid:48) − b ) (cid:107) ≤ A f σC d ( σ ) ( (cid:107) y − y (cid:48) (cid:107) + (cid:107) a − b (cid:107) ) ≤ A f σC d ( σ ) ζ , (88)where in the last inequality we used (cid:107) y − y (cid:48) (cid:107) ≤ ζ and (cid:107) a − b (cid:107) ≤ ζ . Since (cid:107) Σ α ( a, σ ) − Σ β ( b, σ ) (cid:107) ≤ (cid:90) (cid:90) (cid:107) Q σ ( y − a ) − Q σ ( y (cid:48) − b ) (cid:107) µ ( dy × dy (cid:48) ) , (89)the lemma follows. Lemma 3 (Lemma 2.2 of [41]) . Let α, β ∈ P ∞ ( R d ) . Then, for any coupling µ ∈ Γ( α, β ) , R µ = supp [ µ ] gives a correspondence between A = supp [ α ] and B = supp [ β ] . Theorem 7.
Let α, β ∈ P ∞ ( R d ) , A = supp [ α ] , B = supp [ β ] , σ > and γ ≥ . Then, for any kernel satisfying the conditions of Lemma 1, d GH (( A, d α ; σ,γ ) , ( B, d β ; σ,γ )) ≤ (cid:18) A f σC d ( σ ) + γ (cid:19) W ∞ ( α, β ) , with A f > as in Lemma 1. roof. Let µ ∈ Γ( α, β ) be a coupling that realizes W ∞ ( α, β ). By Lemma 3, R µ = supp [ µ ] is a correspondence between A and B . Thus, d GH (cid:0) A α ; σ,γ , B β ; σ,γ (cid:1) ≤
12 dis( R µ ; A, B )= 12 sup ( a,b ) , ( a (cid:48) ,b (cid:48) ) ∈ R µ | d α ; σ,γ ( a, a (cid:48) ) − d β ; σ,γ ( b, b (cid:48) ) |≤
12 sup ( a,b ) , ( a (cid:48) ,b (cid:48) ) ∈ R µ (cid:16) (cid:107) Σ α ( a, σ ) − Σ β ( b, σ ) (cid:107) ++ (cid:13)(cid:13) Σ α ( a (cid:48) , σ ) − Σ β ( b (cid:48) , σ ) (cid:13)(cid:13) + γ (cid:107) a − b (cid:107) + γ (cid:107) a (cid:48) − b (cid:48) (cid:107) (cid:17) ≤ sup ( a,b ) ∈ R µ (cid:107) Σ α ( a, σ ) − Σ β ( b, σ ) (cid:107) + γ sup ( a,b ) ∈ R µ (cid:107) a − b (cid:107)≤ (cid:18) A f σC d ( σ ) + γ (cid:19) sup ( a,b ) ∈ R µ (cid:107) a − b (cid:107) , (90)where the last step follows from Lemma 2. The conclusion follows since W ∞ ( α, β ) = sup ( a,b ) ∈ R µ (cid:107) a − b (cid:107) . Combining Proposition 5 and Theorem 7, we obtain:
Corollary 8 (Stability of Hierarchical Manifold Clustering) . Let α, β ∈P ∞ ( R d ) be probability measures with finite support, A = supp [ α ] , B = supp [ β ] , σ > and γ ≥ . Then, d GH ( H ( A α ; γ,σ ) , H ( B α ; γ,σ )) ≤ (cid:18) A f σC d ( σ ) + γ (cid:19) W ∞ ( α, β ) . A question that our paper leaves open is whether, under a reasonablegenerative model for the sampling from a collection of intersecting mani-folds one may be able to prove that the empirical dendrogram converges inprobability to a dendrogram that represents the spatial organization of theunderlying manifolds. In the simpler context of clustering n i.i.d. samples { x , . . . , x n } from a compact metric space ( X, d X ) endowed with a Borelprobability measure α X , it has been established in [32] that single linkage hi-erarchical clustering converges to a dendrogram whose hierarchical structuredepends on the support of α X in a precise way.37n the case of flat clustering, i.e. when the goal is to obtain a single par-tition of the dataset, consistency results for some multi-manifold clusteringmethods based on local PCA are given in [16, 17, 18]. Let X = { x , . . . , x n } be a dataset in R d . For γ, σ >
0, we apply thesingle linkage method to the metric space X α n ; γ,σ = ( X, d α n ; γ,σ ), where α n isthe empirical measure associated with X and d α n ; γ,σ is the distance definedin (81). The ultrametric associated with H ( X α n ; γ,σ ) is abbreviated u α n ; γ,σ .In this setting, analyzing informative dendrogram cutoff levels often isan important task, which can be approached in different ways, depending onthe nature of the problem. For example, a cutoff level h may be based on apre-assigned number of clusters, be learned from training data, or be moreexploratory. We give examples that illustrate all three viewpoints. Example . In this experiment we consider the unlabeledpoint cloud in Fig. 1(a) that represents an arrangement of two parallel planesand two lines that intersect the planes transversely. Each plane contains 225points on a uniform grid and each line contains 30 equally spaced points.Cutting the dendrogram at four clusters, our method finds the four affinelinear subspaces accurately with the Gaussian kernel at σ = 0 .
6. In thiscase, it is important to choose γ (cid:54) = 0 since the spatial component of (82) isneeded to discriminate the parallel planes. In Fig. 1(a), the points are coloredaccording to cluster membership. In this case, the covariance tensors at datapoints on the planes that are away ’from the cluster intersections have twodominating eigenvalues, whereas for points on the lines they have only onesuch eigenvalue. Thus, an analysis of the spectrum of the covariance tensorsat data points let us infer the dimension of each cluster. Example
10 (A Line Arrangement) . In this example, the point-cloud datarepresents three intersecting lines in R , as shown in Fig. 10(a). Each linesegment is sampled at 200 equally spaced points. Since the slopes of thelines are different, we expect the covariance matrices to be able to cluster thepoints without the aid of additional spatial information. Thus, we set γ = 0in (82) and σ = 0 .
4. The number of clusters was set to six to test the abilityof the algorithm to detect not only the lines, but also the three intersections.Fig. 10(b) shows the single-linkage dendrogram, highlighting each of the sixclusters. The data points are colored according to cluster membership. Asexpected, well delineated clusters are detected away from the intersection38a) (b) (c) (d)
Figure 10: (a) an arrangement of three lines and (b) clustering dendrogram; (c) noisy lineswith outliers and (d) clustering dendrogram. points because the covariance matrices are highly anisotropic with principalaxes that align well with the corresponding line segments. Although thecovariance matrices are not as anisotropic near the intersection points, thereare enough differences in their behavior near the three intersection loci forthe algorithm to be able to place them into different clusters.The next two examples are of a more exploratory nature in that dendro-gram cutoff was chosen through experimentation with the data.
Example
11 (Noisy Lines with Outliers) . This is a noisy version of Example10, as shown in Fig. 10(c). As before, each line is represented by 200 points,but we have added Gaussian noise of width 0 .
015 to the data, as well as 180outliers sampled from the uniform distribution on a rectangle containing thelines. Because of the nature of the data, the number of clusters was set to m = 80 so that the three main clusters did not get merged because of theoutliers. The figure also shows a line fitted to each of the three largest clustersusing principal component analysis. The method was able to sharply recoverthe three lines, even in the presence of noise and outliers. The majority ofthe 80 clusters are singletons of outliers and these are colored black in thefigure. We remark that the choice of σ = 0 .
51 is crucial when dealing withdata contaminated by noise. In this case, it was also important to set γ (cid:54) = 0to better cope with noise. Example
12 (Floor cracks) . We apply the clustering method to segmentationof two images of concrete floor cracks. Panel (a) of Fig. 11 shows the orig-inal images, whereas panel (b) shows binary images obtained from an edgedetection algorithm. We cluster the foreground pixels of the binary images.39s in Example 11, it is important to allow a fairly large number of clustersso that the clusters that detect the main cracks do not get merged becauseof the noisy pixels. Panels (c) and (d) show the outputs (not to scale) of theclustering algorithm.(a) (b) (c) (d)
Figure 11: (a) original and (b) processed images of floor cracks; (c) and (d) show clusteringbased on CTFs.
To further test the ability of the method to cluster intersecting manifolds,we experimented with synthetic data comprising multiple arrangements suchas the intersecting lines in Fig. 1(b).
Example . We consider three syntehtic datasets of point clouds represent-ing random arrangements of: (i) three line segments in R ; (ii) four curves in R that are either line segments or arcs of parabolas; and (iii) three patchesof planes in R . Each of these datasets contains a total of 250 point clouds,50 used for training the algorithm and 200 test samples. The points in eachpoint cloud are labeled to allow quantification of the accuracy of the outputof the algorithm. Fig. 12 shows a few samples from each of these datasets.Parameter values that optimize classification performance are learned fromthe training samples. Note, however, that even though the number k ofclusters is known, specifying a height h that yields precisely k clusters mayyield undesirable results. For example, important clusters representing dif-ferent components of an arrangement of manifolds may get merged due tothe presence of outliers or the behavior near the intersections. Thus, it isoften preferable to choose a lower cutoff level before this phenomenon occursat the expense of getting a larger number (cid:96) of clusters. In such situations,we select the largest k clusters and assign each point in the remaining ( (cid:96) − k )clusters to the closest of the top k clusters. Experiments indicate that a good40 igure 12: Random arrangements of line segments (row 1), segments of lines and parabolas(row 2), and plane patches (row 3). baseline for the cutoff level h is the mean cophenetic distance, which for apoint cloud { x , . . . , x n } ⊂ R d is given by h = 2 n ( n − (cid:88) i The synthetic data used in the manifold clustering experiments is avail-able at https://bitbucket.org/diegodiaz-math/ctf-files/ . Acknowledgements This research was supported in part by NSF grants IIS-1422400, DMS-1418007 and DBI-1262351, and by the Erwin Schr¨odinger Institute in Vienna.We thank Dejan Slepˇcev for a discussion about the results of [6]. References [1] A. Little, Y.-M. Jung, M. Maggioni, Estimation of intrinsic dimensional-ity of samples from noisy low-dimensional manifolds in high dimensionswith multiscale SVD, in: IEEE/SP 15th Workshop on Statistical SignalProcessing, 2009, pp. 85–88.[2] D. D´ıaz Mart´ınez, F. M´emoli, W. Mio, Multiscale covariance fields, localscales, and shape transforms, in: F. Nielsen, F. Barbaresco (Eds.), Geo-metric Science of Information, Vol. 8085 of Lecture Notes in ComputerScience, Springer, 2013, pp. 794–801.[3] C. Villani, Topics in Optimal Transportation, Vol. 58 of Graduate Stud-ies in Mathematics, American Mathematical Society, Providence, RI,2003.[4] C. Villani, Optimal Transport, Old and New, Springer, 2009.[5] N. Fournier, A. Guillin, On the rate of convergence in Wasserstein dis-tance of the empirical measure, Probability Theory and Related Fields.436] N. Garc´ıa-Trillos, D. Slepˇcev, On the rate of convergence of empiricalmeasures in ∞ -transportation distance, Canadian Journal of Mathemat-ics, doi:10.4153/CJM-2014-044-6 .[7] W. Allard, G. Chen, M. Maggioni, Multi-scale geometric methods fordata sets II: Geometric multi-resolution analysis, Appl Comput Har-monic Anal 32 (2012) 435–462.[8] J. Berkmann, T. Caelli, Computation of surface geometry and segmenta-tion using covariance techniques, IEEE Trans Pattern Anal Mach Intell16 (1994) 1114–1116.[9] R. Rusu, Semantic 3D object maps for everyday manipulation in hu-man living environments, Ph.D. thesis, Technische Universit¨at M¨unchen(2009).[10] G. Medioni, M.-S. Lee, C.-H. Tang, A Computational Framework forSegmentation and Grouping, Elsevier Science, 2000.[11] T. Brox, B. Rosenhahn, D. Cremers, H.-P. Seidel, Nonparametric den-sity estimation with adaptive, anisotropic kernels for human motiontracking, in: Human Motion – Understanding, Modeling, Capture andAnimation, Vol. 4814 of Lecture Notes in Computer Science, Springer,2007, pp. 152–165.[12] D. Kushnir, M. Galun, A. Brandt, Fast multiscale clustering and mani-fold identification, Pattern Recognition 39 (10) (2006) 1876–1891.[13] A. B. Goldberg, X. Zhu, A. Singh, Z. Xu, R. D. Nowak, Multi-manifoldsemi-supervised learning., in: AISTATS, 2009, pp. 169–176.[14] D. Gong, X. Zhao, G. Medioni, Robust multiple manifolds structurelearning, arXiv preprint arXiv:1206.4624.[15] Y. Wang, Y. Jiang, Y. Wu, Z.-H. Zhou, Spectral clustering on multiplemanifolds, IEEE Transactions on Neural Networks 22 (7) (2011) 1149–1161.[16] E. Arias-Castro, Clustering based on pairwise distances when the data isof mixed dimensions, IEEE Transactions on Information Theory 57 (3)(2011) 1692–1706. 4417] E. Arias-Castro, G. Chen, G. Lerman, et al., Spectral clustering basedon local linear approximations, Electronic Journal of Statistics 5 (2011)1537–1587.[18] E. Arias-Castro, G. Lerman, T. Zhang, Spectral clustering based onlocal pca, arXiv preprint arXiv:1301.2007.[19] R. Vidal, Y. Ma, S. Sastry, Generalized principal component analysis(GPCA), IEEE Trans Pattern Anal Mach Intell 27 (12) (2005) 1945–1959.[20] V. Govind, A tensor decomposition for geometric grouping and segmen-tation, in: IEEE Conference on Computer Vision and Pattern Recogni-tion, Vol. 1, 2005, pp. 1150–1157.[21] G. Chen, G. Lerman, Spectral curvature clustering (SCC), Int J ComputVis 81 (3) (2009) 317–330.[22] G. Lerman, T. Zhang, Robust recovery of multiple subspaces by geo-metric (cid:96) pp