Heuristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis
F. Patricia Medina, Linda Ness, Melanie Weber, Karamatou Yacoubou Djima
HHeuristic Framework for Multi-Scale Testing ofthe Multi-Manifold Hypothesis
F. Patricia Medina, Linda Ness and Melanie Weber, Karamatou Yacoubou Djima
Abstract
When analyzing empirical data, we often find that global linear modelsoverestimate the number of parameters required. In such cases, we may ask whetherthe data lies on or near a manifold or a set of manifolds (a so-called multi-manifold)of lower dimension than the ambient space. This question can be phrased as a(multi-) manifold hypothesis. The identification of such intrinsic multiscale featuresis a cornerstone of data analysis and representation, and has given rise to a largebody of work on manifold learning. In this work, we review key results on multi-scale data analysis and intrinsic dimension followed by the introduction of a heuris-tic, multiscale framework for testing the multi-manifold hypothesis. Our methodimplements a hypothesis test on a set of spline-interpolated manifolds constructedfrom variance-based intrinsic dimensions. The workflow is suitable for empiricaldata analysis as we demonstrate on two use cases.
In many empirical data sets, the dimension of the ambient space exceeds the numberof parameters required to parametrize local models. Geometrically, this is evidentin data sets sampled from a manifold of lower dimension than the ambient space.The simplest hypothesis for explaining this observation is that the number of local
F. Patricia MedinaWorcester Polytechnic Institute; e-mail: [email protected]
Linda NessRutgers University; e-mail: [email protected]
Melanie WeberPrinceton University; e-mail: [email protected]
Karamatou Yacoubou DjimaAmherst College; e-mail: [email protected] a r X i v : . [ s t a t . M L ] J u l F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima parameters required to model the data is constant. We can formalize this by askingwhether the data lies on or near a d -dimensional manifold or whether the data wassampled from a distribution supported on a manifold. This manifold hypothesis iscentral to the field of manifold learning. In the present article, we outline a heuristicframework for a hypothesis test suitable for computation and empirical data anal-ysis. We consider sets of manifolds (multi-manifolds) instead of single manifolds,since empirical data is more likely to lie near a multi-manifold than on a single man-ifold (see, e.g., [1] ). For this, consider the following motivating question: Given adata set in R n , is it on or near a multi-manifold? Note, that the manifolds do notneed to be linear; they may have different intrinsic dimensions and they may inter-sect.
Proposition (Multi-manifold Hypothesis Test)Given a data set X = { x i } i ∈ I in R D and a multi-manifold V , is the expected distanceof the points in X to V more than one would expect? If so, reject V as being amulti-manifold that fits X .This hypothesis is closely related to the identification of intrinsic dimensions. Alarge body of work has been devoted to the study and computation of intrinsic di-mension. If the data set can be partitioned into subsets, each of which has a singleintrinsic dimension, hypothesis testing methods might be applied to the correspond-ing subsets separately. In the present paper, we propose a heuristic framework for testing a multi-manifoldhypothesis on real-world data sets. Our method partitions a data set into subsetsbased on intrinsic dimension and constructs a multi-manifold whose dimensionalcomponents fit the partitions. Finally, we compute test statistics to evaluate thegoodness-of-fit of a candidate multi-manifold with the data set. To our knowledge,this is the first implementable heuristic for multi-manifold hypothesis testing.For efficiently computing intrinsic dimensions, we introduce a multi-scale variance-based notion of intrinsic dimension, denoted as d VLID . We demonstrate our methodon two low-dimensional densely sampled data sets with visible geometry: One dataset is a sample from a sphere-line configuration (see Fig. 2), the other a subset ofa 3-dimensional image of the Golden Gate Bridge recorded with LiDAR technol-ogy. The computational experiments demonstrate that multi-scale techniques can beused to overcome the issue of linear models overestimating the dimension of theunderlying data. The decomposition of the data set into subsets with a single local We define d VLID to be a pointwise statistic that depends on a set of local neighborhoods at eachpoint. The intrinsic dimension d is computed for sets of data points in each local neighborhood.Then d VLID is the minimum of these intrinsic dimensions. Hence points sampled from a localmanifold of dimension d have d VLID equal to d . A more formal definition is in Section 2.4.euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 3 intrinsic dimension promises to improve the understanding of the data and providefeatures that could be used as preprocessed input to further analysis and machinelearning tools.Our method provides a practical heuristic for testing a manifold hypothesis as itis central to manifold learning. The introduced framework is general and can be im-plemented using a variety of computational tools in different parts of the workflow.Two fundamental types of statistical reasoning, hypothesis testing, and variance-based analysis, are used in combination with multiscale representation methods. We start with an extensive review of (multi-scale) techniques for dimensionalityanalysis and manifold learning (section 2). In section 3 we propose a heuristic, mul-tiscale framework for testing a multi-manifold hypothesis. Section 4 describes ourimplementation of the framework, including a variance-based notion of intrinsic di-mension that we developed as part of the workflow. We demonstrate our method on(i) a simple sphere-line configuration and (ii) imaging data obtained with LiDARtechnology. The paper concludes with a list of open questions and directions thatwe suggest for future work.
In this section, we review related work on manifold learning and geometric dataanalysis that underlie or motivate the ideas outlined in the paper.
A real world data set X is typically a set of m vectors x i , with D components. Hencethe data set X is a subset of m points in a D -dimensional Euclidean space, denoted X ⊂ R D . A central question in manifold learning is: Is X on or near a manifold ofdimension d < D ? If so, i.e., if the manifold hypothesis is true, then it is reasonableto expect that X has another representation as a subset of a space of dimension d < D , where d may be much smaller than D , denoted d ≪ D . Furthermore, becausethe data points are on a manifold which may be curved in its embedding space, themost natural or informative dimension reduction may be non-linear.Many results in manifold learning are focused on dimension reduction mappings f : X → R d , defined by non-linear functions of the D -dimensional coordinates ofpoints in X . Conveniently, the mappings can be defined for general data sets X;they do not require the manifold hypothesis to be validated first. Some of the first F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima papers that discussed this and presented examples and non-linear methods for di-mension reduction are [19, 23, 43, 44, 50]. These methods could be used to infernon-linear parameters, e.g., the pose variables and azimuth lighting angle sufficientto parametrize a set of images, which would have been invisible to traditional di-mension reduction techniques such as Principal Component Analysis (PCA) andMultidimensional Scaling (MDS) [50].Laplacian Eigenmaps (LE) is a well-known example of non-linear dimensionreduction mapping defined by the first few eigenfunctions of the normalized Lapla-cian matrix associated to a given data set. LE is used in numerous applications andis very popular in spectral clustering [40]. In [5, 7], Belkin and Niyogi justified thethe LE-algorithm by proving that, when a sufficiently large data set X is uniformlysampled from a low dimensional manifold M of R D , the first few eigenvectors of thenormalized Laplacian matrix M are discrete approximations of the eigenfunctionsof the Laplace-Beltrami operator on the manifold. Recall that the normalized Lapla-cian matrix M = D − L where L is the similarity kernel matrix whose entries L i , j aredefined by L i , j = exp (cid:18) − || x i − x j || ε (cid:19) (1)and D is the diagonal normalization matrix with entries D i , i = ∑ j L i , j .Subsequent research has focused on non-linear dimension reductions mappingsthat approximately preserve distances. Using a symmetric matrix adjoint to the nor-malized Laplacian, Nadler, Lafon, Coifman and Kevredkides in [38] defined a non-linear dimension reduction mapping known as Diffusion Maps which approximatelypreserves diffusion distances. The normalized Laplacian and its symmetric adjointare stochastic matrices and hence define random walks; the diffusion distance attime t between two points x i and x j is the probability that the random walks startingat x i and x j will reach the same point at time t . This distance is a more accurate androbust model for the distance traveled by moving to nearby points, i.e., the distanceby moving along the manifold best-fitting the data points. Diffusion maps have beenapplied to many types of data set, for example, in characterizing the properties ofmolecular dynamics [42] [52] [17] [55]. In further related developments, Rohrdanz,Zheng, Maggioni and Clementi [54] used locally scaled diffusion maps to moreaccurately determine reaction coordinates. Joncas, Meila and McQueen [29] devel-oped methods for defining and computing non-linear dimension reduction mappingsthat approximately preserve the original metric induced by the ambient Euclideanmetric. Approximate preservation of this metric would enable preservation of shapeproperties involving curvature. McQueen, Meila, VanderPlas and Zhang have devel-oped and documented Megaman, a scalable publicly available software package formanifold learning from data [37]. Our intrinsic dimension algorithms demonstrateautomated methods for decomposing data sets into subsets each of which lie on ornear a not necessarily linear submanifold of a single dimension. euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 5 The manifold hypothesis is central to the area of manifold learning. Recent work byFefferman, Mitter, and Narayanan [24] formulate and prove a manifold hypothesistest, thereby providing a theoretical framework for testing whether a given data setlies on or near a manifold. Narayanan and Mitter obtained bounds on the samplecomplexity of Empirical Risk Minimization for a class of manifolds with bounds ondimension, volume, and curvature [39].When data is sampled from a single manifold of dimension d in R D with a re-stricted noise model, Chen, Little and Maggioni [15] have introduced GeometricMulti-Resolution Analysis (GMRA). Using a notion of geometric wavelets theyshow that one can construct a linear multi-manifold that gives a good local ap-proximation to this manifold on certain scales. The local linear multi-manifold canbe obtained by projecting onto the local linear subspace determined by the intrinsicdimension. GMRA exploits a dyadic tree to decompose the manifold and sampleddata into pieces at each scale. The current implementation of our method also usesa dyadic tree and computes local linear approximations to the data.Lerman and collaborators noted that empirical data is more likely to fit a set ofmanifolds rather than a single manifold, hence motivating the notion of multi-manifoldsthat we adopt here. We review recent work on multi-manifolds that motivated ourapproach: Arias-Castro, Chen and Lerman [1] point out that when a data set lieson or near multiple manifolds in Euclidean space, the “foremost” problem is clus-tering the data into subsets associated with different manifolds. They propose aHigher Order Spectral Clustering algorithm (HOSC) that applies spectral clusteringto a pairwise affinity function. The algorithm provably outperforms other clusteringmethods (e.g., Ng, Jordan and Weiss [40]) in its accuracy on small scales and underlow sampling rates. It utilizes the notion of tubular neighborhoods around manifoldsand leverages the definition of correlation intrinsic dimension [26, 34] to determinethe radii of these neighborhoods. The approach assumes that the data lies almostcompletely in these neighborhoods with the exception of a set of outliers whichsatisfy particular sampling assumptions. While we adopt some of these ideas, ourheuristic approach does not make this assumption, nor does it assume a particularsample distribution. Additional context on multi-manifolds can be found in [53] andthe references therein. A challenging problem is to determine if a set of data is a subset of “nice” manifolds,i.e., is piece-wise smooth. One way to make this precise is the notion of rectifiability:
Definition 1 (Rectifiability).
A subset X ⊂ R D with Hausdorff dimension d ∈ Z is said to be rectifiable if it is contained in the union of a countable family of d -dimensional Lipschitz graphs with the exception of a set of Hausdorff measure zero. F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima
A stronger quantitative condition implying rectifiability, the BPLG property, wasestablished by David and Semmes [22]. Prior to this, Jones [30] proved a necessaryand sufficient condition for a subset of the plane to be contained in a plane curve(i.e., in the image of the unit interval under a Lipschitz mapping). He defined β -numbers for each scale and location which measure the deviation of a set from thebest fitting line. He proved that the length of the curve is bounded in terms of thesum of the β -numbers. Recently, Azzam and Schul [2] proved a variant of Jonestheorem for a more general case, providing bounds on the Hausdorff measure forinteger-dimensional subsets of Euclidean spaces using a generalization of Jones’ β -numbers.In this paper, we did not attempt to determine if there are conditions on sub-sets of Euclidean spaces with a specified variance-based intrinsic dimension whichwould guarantee quantitative rectifiability. Jones’ multi-scale techniques and statis-tics associated with each location and scale inspired our multi-scale definition ofvariance-based dimension for each locality. The approach in this paper enlarges theclass of Multi-scale SVD (MSVD) unsupervised learning techniques (sometimesreferred to as MLPCA) used previously to automatically generate features for su-pervised machine learning [3, 4, 8]. We now review the notion of stratified spaces which is used synonymously for multi-manifolds:
Definition 2 (Stratified Space).
A stratified space is a topological space that can bedecomposed into manifolds.While the two notions are closely related, the emphasis of stratified spaces is topo-logical. Bendich, Gasparovic, Tralie and Harer [9] used stratified spaces to developa heuristic approach for partitioning the space. Their approach is both similar andcomplementary to the partitioning approach used in our methodology. It exploitsprevious ideas in [3, 4, 8] on multi-scale data analysis. One similarity is the use ofa tree-based approach that decomposes data sets using tree structures. While theyconstruct the tree-based decomposition using the CoverTree algorithm [10] withgap-based local intrinsic dimensions, we compute a fixed dyadic tree structure usingvariance-based intrinsic dimensions. A second similarity arises in the constructionof multi-manifolds: While we focus on fitting piece-wise linear manifolds to thedata on which to compute the test statistics, they summarize the decomposition intoa graph structure that captures the local topology. The results of both approaches areto some extend complementary: Our fixed dyadic tree structure gives coarse-grainedinformation on the topology of the multi-manifold. Their approach provides morerefined information by exploiting persistent homology statistics to refine the stop-ping condition and to coalesce some of the sets in the original decomposition. euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 7
The problem of estimating the intrinsic dimension (ID) of a data set is a recur-ring topic in the analysis of large data sets that require efficient representation, i.e.,representation that simplifies visualization, decreases storage needs, improves com-putational complexity, etc. An essential step in this problem is to uncover the true orintrinsic dimensionality of the data. Indeed, although the data may be embedded in R D , its intrinsic dimension, or as Fukunaga defines it [25], the minimum number d such that the data set lies entirely within an d -dimensional subspace of R D , is oftenmuch smaller than D . From this point of view, intrinsic dimensionality estimationcan be put under the general umbrella of dimension reduction.The intrinsic dimension of a data set can be estimated globally and locally. Globalestimation methods assume that there is only one dimension for the entire data set.By contrast, local estimation methods assume that the dimension differs from oneregion of the data set to another and, therefore, the dimension is computed for eachdata point in relation to its neighbors.In our work, we focus on local estimation of intrinsic dimensionality; however, itis important to note that several local techniques are obtained by adapting a globaltechnique to small regions or points in a large data set. We will often use intrinsicdimension of a point to refer to the local intrinsic dimension of the data set centeredat the said point. This abuse of language is common in dimensionality estimation;points are not regarded as zero-dimensional objects but rather as carrying the dimen-sionality of a region large enough to accurately capture the surrounding manifold butsmall enough to preserve a notion of locality.In the following, we review a few important estimation techniques: Projection-based Methods.
The goal of projection-based methods is to find thebest subspace R d on which to project a data set embedded in R D . The criteria forbest subspace is often encoded by an error or cost function that one seeks to mini-mize. For example, PCA, a very popular linear projection technique, minimizes thereconstruction error between a data matrix and its reconstruction, which is the pro-jection onto basis vectors that represent the directions of greatest variance of thedata. The PCA algorithm for estimating intrinsic dimension is as follows:1. Compute the eigenvalues λ . . . , λ D of the D × D data covariance matrix andorder them from highest to lowest.2. Compute the (percent) cumulative sum of the first k eigenvalues 100 (cid:18) k ∑ i = λ i (cid:19) / (cid:18) D ∑ i = λ i (cid:19) . These cumulative sums are fractions of the total variance explained by the cor-responding eigenvalues.3. Define the intrinsic dimension d as the number of non-null eigenvalues whosecumulative sum is larger than a prescribed threshold value, e.g., 95%. F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima
Even though PCA remains a go-to technique in dimensionality reduction, it has sev-eral known issues such as its lack of robustness to noise or its overestimation of theintrinsic dimension in global settings for certain data sets, in particular, those thatare non-linear. For instance, PCA characterizes the d -dimensional sphere as being d + Multiscale Methods.
Another method based on singular value decomposition isthe Multiscale Singular Value Decomposition (MSVD) method of Anna Little andMauro Maggioni. MSVD is a multiscale approach to determining intrinsic dimen-sion, but it can also be classified as a projection method. In particular, the maindifference between this method and the local PCA of Fukunaga and Olsen is thatthe local PCA algorithm computes the intrinsic dimension using a fixed scale deter-mined interactively, while MSVD estimates the intrinsic dimension by studying thegrowth the Squared Singular Values (SSV’s) in function of changes in scale [15,35].MSVD is based on the observation that for small scales r , SSV’s representing thetangential space, i.e., the intrinsic dimension have a linear relationship with r , whileSSV’s representing the curvature space have a quadratic relationship with r . Forlarge scales, SSV’s representing the tangential space have a quadratic linear rela-tionship with r , while SSV’s representing the curvature space have a quartic rela-tionship with r .In absence of noise, the MSVD algorithm can be summarized as follows: given adata set X = { x , . . . x N } ⊆ R D and a range of scales or radii r , . . . , r p ,(i) construct a ball B r j ( x i ) of radius r j centered at x i , i = , . . . , N , j = , . . . , p .(ii) compute the SSV λ k ( x i , r j ) , k = , . . . , D for each ball B r j ( x i ) .(iii) for each point x j , use a least-square regression of λ k as a function of r todiscriminate the curvature tangential from the tangential ones.(iv) the intrinsic dimension d is defined as the number of tangential SSV’s.In presence of noise, an extra step is added to eliminate certain values of r wherethe noise creates variability in SSV’s that can not be attributed to dimensionality.In [15], the authors implemented the MSVD algorithm on both artificial manifoldsand real world data sets and obtained excellent results. Fractal-Based Methods.
These techniques estimate the intrinsic dimension basedon the box-counting dimension, which is itself a simplified version of the Hausdorff euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 9 dimension. Consider the data set X ⊆ R D and let v ( r ) be the minimal number ofboxes of size r needed to cover X . The box counting dimension d of X is defined as d : = lim r → ln ( v ( r )) ln ( / r ) . (2)The box-counting dimension estimation is computationally prohibitive, thereforemany methods such as the correlation dimension attempt to give a computationallyfeasible approximation. This dimension estimate is based on the correlation integral: C ( r ) : = lim N → ∞ N ( N − ) N ∑ i = N ∑ j = i + {k x j − x i k≤ r } , (3)where x , . . . , x N are N i.i.d. samples which lie on X . Given C ( r ) , the correspondingcorrelation dimension is given by d C ≈ lim r → ln ( C ( r )) ln ( r ) . (4)The GP algorithm, named after its creators, Grassbered and Procaccia, estimates d by finding the slope of the linear part of the plot of ln ( C ( r )) versus ln ( r ) . This de-creases the sensitivity of the algorithm to the choice of r . However, their methodis still computationally expensive as one needs N > d C / data points to obtain anaccurate estimate of the intrinsic dimensionality. In 2002, Camastra and Vinciarelliproposed a fractal adaptation of the GP method that could be use for smaller datasets X [13]. The algorithm starts by generating data sets Y i , i = , . . . m of the samesize as X for which the intrinsic dimensionality d i is known. Using the GP method,they compute the correlation dimension d ( i ) C of each data set and create a referencecurve which is the best fitting curve to the data set { ( d i , d ( i ) C ) : i = , . . . m } . Then,they determine the correlation dimension d C for X and using the reference curve,find the corresponding intrinsic dimension d . This heuristic method is based on theassumption that the reference curve depends on N but is not affected by the type ofdata set Y i used in its construction.Several other fractal-based methods were also developed to improve GP. Themethod of surrogate data consists in computing the correlation dimension for a (sur-rogate) data set with size larger than X but with the same statistical properties (mean,variance and Fourier Spectrum), in the spirit of the bootstrap method [51]. Takens’smethod improves the expected error in the GP algorithm and is based on Fisher’smethod of Maximum Likelihood [49].Other estimators of intrinsic dimension based on the correlation integral arebased on applying the maximum likelihood estimation (MLE) principle to the dis-tances between data points. In their 2005 paper, Levina and Bickel assume that theobservations within a specified radius of x are sampled from Poisson process andestimate the intrinsic dimension of the Poisson process approximation via some statistical measures [34]. Other MLE based-methods include an extension of Lev-ina’s and Bickel’s work in [28], where the authors model the data set as a processof translated Poisson mixtures with regularizing restrictions in the presence of noise. Nearest Neighbor-based Methods.
Suppose we are given data points X = { x , . . . , x N } ⊂ R D drawn according to an unknown density p ( x ) . Assume that this subset X is ofintrinsic dimension d . Let V d be the volume of the unit sphere in R D and denote by R k ( x ) the distance a point x and its k th nearest neighbor. The intrinsic density of X can be approximated by the formula [34]: kN ≈ p ( x ) V d R k ( x ) d . Based on this formula, Pettis and al. [41] show that, with some additional assump-tions, the intrinsic dimensionality d and k are related by1 d ln k ≈ ln (cid:0) E [ R k ] (cid:1) + C , (5)where C is a constant and R k is evaluated using R k = N N ∑ i = R k ( x i ) . Then, one uses linear regression to plot ln k versus ln (cid:0) E [ R k ] (cid:1) and the d is estimatedas the reciprocal of the slope of this line.Another method based on nearest-neighbors is the Geodesic Minimal spanningtree (GMST), which estimates the intrinsic dimensions by 1) finding the geodesicdistances between all points in a data sets, 2) constructing a similarity matrix basedon these distances and 3) computing a minimal spanning subgraph from whichthe intrinsic dimension is estimated [21]. A major drawback of Nearest-Neighbors-based approaches is their large negative bias due to under sampling. Improvementswere obtained by giving more weights to interior points and forcing constant dimen-sion in small neighborhood for local estimation [14, 20]. Analytic Methods based on metric spaces.
Nearest neighbor search is a fundamen-tal area in which it is also essential to estimate the intrinsic dimension. In [10], theauthors mention two quantities that can be used as proxies for intrinsic dimension.The first was developed by Karger and Ruhl for classes of metrics with a growthbound [31]. Let B r ( x ) represent the ball of radius r > x . For a data set X , Karger and Ruhl define the expansion constant c of as the smallest value c ≥ x ∈ X : | B r ( x ) | ≤ c | B r / ( x ) | . From this, assuming that X is sampled uniformly on some surface of dimension d which would imply c ∼ d , they define the expansion dimension d KR by euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 11 d KR = ln c . However, in practice, this formula often overestimates the intrinsic dimension. Forexample, KR-dimension may grow arbitrarily large if one adds just a single pointto a data set embedded in a Euclidean space. Another intrinsic dimension estimatecomes from Krauthgamer and Lee [32] and is based on the doubling dimension ordoubling constant, i.e., the minimum value c such that every ball of a given radiusin a set X can be covered by c balls of half the radius. Given c , the dimension d KL isdefined as before as d KL = ln c . The dimension estimate d KL is more robust to changes in data sets than d KR , how-ever there are few convergence results for the algorithm [10]. When representing adimensional clustering hierarchy (as used in the cover tree algorithm [10]), c canbe used to bound the number of children in the next tree level (upper bounded by c ). Its value is computed by considering balls B and B of radius r and 2 r aroundeach data point and counting the number of data points in each ball. Then c is thesmallest value, such that | B | ≤ c | B | . Such a tree structure allows for performinga fast nearest neighbor search, O ( c log ( | X | ) , after one-time construction cost of O ( c | X | log ( | X | )) and storage O ( | X | ) [10]. Interestingly, the doubling dimension al-lows for a rigorous estimation of these complexity results; an approach that couldbe extended to the methods described below.There are several other ideas for estimating intrinsic dimension including Mul-tidimensional Scaling Methods (MDS), Topology representing network (TRN),Bayesian estimators and many more. A lengthier account of those presented herecan be found in [35]. Camastra’s survey of data dimensionality estimation gives avery good description and classification of different estimators [12]. A thorough sur-vey of nonlinear dimensionality reduction techniques can be found in [33].The present paper defines a variance-based notion of intrinsic dimension d VLID ,similar to the MSVD method in its multi-scale approach. Moreover, it is similar toPCA in that it exploits the principal values accounting for a prescribed proportionof the total variance (see Sec. 4.1).
We now present a computational methodology for testing the multi-manifold hypothesis(Prop. 1). Our approach is based on a training-testing routine that constructs candi-date manifolds based on one part of the data (training set) and evaluates the hy-pothesis through a testing procedure on the remaining data points (testing set). Theworkflow consists of three major steps, (i) the sampling of training and testing sets,(ii) the construction of candidate manifolds and (iii) goodness of fit statistics forevaluation.
For the first step, we either separate the data points into two groups (training/testing) or sub-sample two sets of data points if the given data set is very large.The sampling should preserve the intrinsic geometry of the original data set, sincewe want to test if we can construct a candidate manifold that represents the wholedata set reasonably well. To construct candidate manifolds we draw on the extensiveliterature on manifold learning and dimensionality analysis as detailed below.A key step in the methodology is the evaluation of the candidate manifolds thatrepresent the actual hypothesis test. For this, we want to estimate an approximatesquare distance, that is, compute shortest distances from each sample point to thecandidate manifold. Formally, we evaluate the empirical loss against the loss func-tion L ( V , P ) = Z d ( x , V ) dP ( x ) (1)where P is the probability distribution from which the data set is sampled. By ana-lyzing the distribution of their deviation, i.e. P " sup k | | X | ∑ x i ∈ X d ( x i , V ) − L ( V , P ) | < ε > − δ . (2)Here, δ is the significance level (e.g., the commonly used δ = .
05) and k a resolu-tion parameter in the construction of the candidate manifold V . However, since wecannot directly compute the loss function L , the test statistic (eq. 2) is not suitablefor computational purposes. Instead, we use the following heuristics:sup k | X | ∑ x i ∈ X d ( x i , V ) < ˆ δ , (3)where k is again a resolution parameter and ˆ δ : = ˆ δ ( | X | , k ) the square-distancethreshold for which we are willing to accept the candidate manifold. The thresh-old depends on both the sample size | X | and the resolution parameter k .These ideas are implemented by the following workflow, shown schematically inFig. 1:Step 1 Preprocessing.
We assume the data is pre-processed to lie in R D . Local intrinsic dimensions arecomputed for each point as part of the pre-processing. With this, the data can bepartitioned into sets of different intrinsic dimensions and steps 2, 3 and 4 can beapplied to each partition separately.Step 2 Hierarchical Multi-scale Partitioning
We construct a hierarchy of partitions of the data using dyadic trees. The hierar-chical partitioning provides a multiscale view of the data where the scale indexis the resolution parameter. A stopping condition determines the leaf sets of thehierarchical partition. In our implementation, the stopping condition ensures thatthe local intrinsic dimension is smaller than or equal to the pointwise intrin-sic dimension. Algorithmic tools for this construction include CoverTree [10] euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 13
Fig. 1:
Workflow for heuristic multi-manifold hypothesis test. We partition the data set using in-trinsic dimensions directly computed from the data. Based on this, we construct multi-manifoldsconsisting of piece-wise linear manifolds that fit the data. The set of candidate multi-manifolds isthen used to conduct a hypothesis test on the goodness-of-fit with the sample data. (which gives tree-like ε -nets with dyadically decreasing ε ) or dyadic partitions,see, e.g., [16].Step 3 Manifold construction.
We perform a spline interpolation on the leaf sets of the partition-tree that givespiece-wise linear candidate manifolds consistent with the computed intrinsic di-mensions. Coordinates associated with these piece-wise linear manifold can beused to construct non-linear splines to achieve a better goodness-of-fit.Step 4
Test statistics.
We compute approximate square distances (Eq. 3) for the candidate multi-manifold. The total square distance is used as decision parameter for the hy-pothesis test.
We implemented the methodology by defining algorithms for three functions: • Pre-processing.
A local intrinsic dimension d for each point of a data set X ⊂ R D . • Multi-Manifold construction.
A dyadic linear multi-manifold V ( X ) approxi-mating a data set X . • Test statistics.
A test statistic S which takes as input a set X of data points anda dyadic linear multi-manifold V and outputs the the expected value of the sumof the squared distances of X and V , S ( X , V ) = E [ SQD ( S , M )] . (1)The workflow then consists of the following steps:1. Preprocessing.
Subdivide the sample points X into a training and a testing set( X train and X test ). Compute local intrinsic dimensions for each data point in X train .Stratify X train into strata S k using the local intrinsic dimensions.2. Multi-Manifold construction.
For each strata S k , construct a dyadic linearmulti-manifold V ( S k ) that approximates the strata.3. Test statistics.
For each strata S k , construct a probability distribution by apply-ing the test statistics to the testing points X test of the data set and the dyadiclinear multi-manifold V ( S k ) that approximates the complementary training set X train .For higher accuracy, test statistics are averaged over multiple runs.This implementation allows for testing the goodness-of-fit of a candidate multi-manifold V . We sample a subset S from the candidate multi-manifold V and com-pute intrinsic dimensions for each point in S . Based on these intrinsic dimensions,we stratify S into strata S k . Then, we construct a dyadic linear multi-manifold V ( S k ) for each strata. For each value k of the intrinsic dimension, the expected value E [ SQD ( S k , V ( S k ))] of the sum of squared distances is computed and compared withthe empirical distribution. If SQD ( S k , V ( S k )) lies outside of the specified confidenceinterval, the hypothesis is rejected. For greater accuracy, the hypothesis test can berepeated multiple times. It there is no strata in the data set of the same intrinsicdimension as a strata S k , the hypothesis is rejected for that strata of the candidatemulti-manifold. Parameters: neighborhood definition.
It is clear that the method by which wedefine neighborhoods of points are essential for local estimation, both in terms ofcomplexity and global estimation issues: While considering a small neighborhoodcan create computational errors and non-representative values, looking at a largeneighborhood might cause global estimation issues. Here, we consider two typesof neighborhood constructions, (i) neighborhoods consisting of balls centered at adesign point and (ii) neighborhoods of the nearest neighbors of a design point. Thesize of the neighborhoods is chosen experimentally, we do not yet have a principledway to determine them.
In the current implementation, we used a variance-based local intrinsic dimension d VLID . We define d VLID in terms of a variance-based intrinsic dimension d
VID ,which takes as input a finite data set X ⊂ R D , a variance-based threshold t ∈ [ , ] and a cutoff parameter c . If there are too few points in N , i.e. | N | ≤ c , then d is euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 15 undefined. Otherwise, its output is the smallest integer i such that the sum of thefirst i squared singular values of the centered data set N − E ( N ) accounts for at least t proportion of the total variance. In this case, d VLID is the PCA-based intrinsic di-mension defined in Section 2.5. Recall that the total variance of a centered matrix isthe sum of the squares of its singular values.d
VLID ( N ) = min ≤ i ≤ n s . t . ( i ∑ σ j ≥ t · n ∑ σ j ) (2)The variance-based intrinsic dimension depends on the parameters t and c , and alist L that determines a set of neighborhoods N i of (design) points in X . For example, L could be a list of radii r i for neighborhoods B ( p , r i ) of radius r i centered at adesign point p . For the nearest-neighbor based construction, L could be a list ofneighborhoods KNN ( p , k ) consisting of the k -nearest neighbors of design points p .The value of the variance-based local intrinsic dimension function at a point p isthen defined as the minimum over the neighborhoods N i of the variance-based localintrinsic dimension d loc ( N i ) whose cardinality exceeds the cutoff c :d ( p , N i ) = min ≤ i ≤ n | N i | > c { d VLID ( N i ) } . (3)The novelty of d VLID is its multiscale exploitation of projection-based intrinsic di-mension, combined with a notion of cutoff.
An alternate method for computing intrinsic dimension is based on the GMSTmethod applied locally. Suppose that we have a data set X = { x , x , . . . , x N } , wherethe samples points are drawn from a bounded density supported on a compact, d -dimensional Riemannian sub-manifold. Assume that this condition holds locally forsome n larger than a certain value n ∗ . Our local GMST algorithm uses the followingsteps:1. Consider a point x i and construct a neighborhood N n , i of x i using either a ballcentered at x i containing, say, n -samples points or the n -nearest neighbors of x i , n > n ∗ .2. For each x i and the constructed neighborhood N n , i above, find the k -NearestNeighbors of each point x i in N n , i , where k < n . These form the sub-neighborhood N n , k , i .3. Compute the total edge length of the kNN graph for each N n , i : L γ , k ( N n , i ) : = n ∑ i = ∑ x j ∈ N n , k , i | x j − x i | γ , where the parameter γ determines locality. An equivalent formula if balls areused.4. Using the fact that with probability 1 [21], L γ , k ( N n , i ) = a n di , n − γ di , n + ε n , (4)where ε − n gets small as n grows and a is some positive constant, the intrinsicdimension d i , n at each x i is found by applying non-linear least-squares.Compute the intrinsic dimension d i , n for multiple neighborhoods N n , i about x i . Thefinal intrinsic dimension at x i is found by averaging over the number of neighbor-hoods. Given a data set X ⊂ R D , we recursively construct a sequence of linear multi-manifolds approximating the data set by recursively constructing a tree of dyadiccubes, such that the cubes at each level of the tree are disjoint, their union con-tains X and approximating X ∩ C by the best fitting linear space L C of dimension d v ( X ∩ C ) containing E ( X ∩ C ) . Here E ( X ∩ C ) is the average of all of the pointsin X ∩ C and d v ( X ∩ C ) is the variance-based dimension of X ∩ C . This linear spacecan be computed using Singular Value Decomposition. Dyadic cubes are translatesof cubes consisting of points whose i th coordinates lie in a dyadic interval [ , − k i ] .For the root of the tree, choose a cube which contains X . To obtain the other cubes,choose an order of the coordinate and sequentially divide the cube in half alonga specific coordinate axis. Recursively cycle through the sequence of coordinates.This results in a binary tree, making the computation easier, although a tree can alsobe constructed by halving all of the sides of the parent cube (not just one side).The depth of the tree varies with the stopping condition used in the algorithm.Different stopping conditions for the recursive algorithm determine different dyadiclinear multi-manifolds V ( X ) . In our implementation we constructed the dyadic lin-ear multi-manifolds for subsets X ( i ) ⊂ X that consists of all points with local in-trinsic dimension i . In this case, we could use the the stopping condition that thevariance-based intrinsic dimension of the leaves is smaller than i . The sets L C ∩ C ,corresponding to the leaf cubes form the candidate multi-manifold V ( X ( i )) . Wedefined V ( X ) as the union of the dyadic linear multi-manifolds V ( X ( i )) for eachintrinsic dimension. euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 17 In the current implementation we exploit the variance-based definition of the intrin-sic dimension (d v ). We observe that for any data set X the squared distances to itsbest linear space L of dimension d v is bounded above by the sum of singular values: SQD ( X , L ) < ( − t ) ∑ i > d v σ i . (5)We used this observation to define SQD ( S , V ) : We define SQD for each multi-manifold component, i.e. for each linear space L C in the dyadic linear multi-manifold. SQD ( L C ) = ( − t ) ∑ i > d loc ( S ∩ A ) σ i . (6)In this equation, σ i is the i th singular value of the centered data set, S ∩ L C − E ( S ∩ L C ) . Then the sum of squared distances function from a data set S to a multi-manifold V consisting of components L C is defined in the current implementationby summing up the sum of squared distances functions for each of the components: SQD ( S , V ) = ∑ ( L C ) ∼ V SQD ( L C ) . (7) We demonstrate the methodology for two low-dimensional use cases. For bothcases, we compute intrinsic dimension, construct a candidate multi-manifold andcompute test statistics. The first data set consists of a simple simple sphere-line ob-ject with components of different intrinsic dimensions: A one-dimensional line, atwo-dimensional surface and three dimensional intersection points (see Figure 2).
Fig. 2
The second data set consists of three-dimensional coordinates for a LiDAR imageof the Golden Gate Bridge (see Figure 8). Intuitively, the bridge cables appear to be
The data set consists of a sample from a Sphere-Line configuration (see Fig. 2).For ease of computation, only points on the sphere and the line segments exter-nal to the sphere were sampled. We first computed the intrinsic dimension of thesample points. As shown in Fig. 2, we would expect to find points of intrinsic di-mension 1 (line) and 2 (sphere surface) and two points of intrinsic dimension 3(intersection points). The sphere is curved, so samples from its surface will not bewell-approximated by a linear multi-manifold. We sampled randomly using polarcoordinates on the sphere in order to preserve the intrinsic geometry to the best pos-sible extend. The sample X consisted of 2708 points; 2513 from the sphere and 193from the line external to the sphere. The sampled sphere is of radius and centeredat the origin; the line sample was randomly selected from the intervals [-1,-1/2] and[1/2,1] on the x axis. Because polar coordinates were used, the sampling from thesphere was not uniform with respect to the surface area measure. First, the intrinsicdimension of the points in X was computed using the variance-based intrinsic di-mension algorithm discussed in Section 4.1 with neighborhoods of radii 2 to 0 . . | X ( ) | = | X ( ) | = | X ( ) | =
37. The sample points are shown inFig. 3, color coded by intrinsic dimension values.For each of the intrinsic-dimension based strata X ( i ) , a dyadic linear multi-manifold V ( X ( i )) is computed approximating the strata. A summary of the properties of eachof the multi-manifolds is shown in Table 1.The multi-manifold for the points with intrinsic dimension one ( V ( X ( )) ) hasonly one linear component, which agrees with the fact that all of these points are onthe x-axis. In this sample, the multi-manifold for the intrinsic dimension two points V ( X ( ) has is two linear components, one corresponding to the cube consisting ofpoints x ≤ x >
0. The somewhat surpris-ing fact is that the local intrinsic dimension of the sphere samples in each of thesehalves of the unit cube is one. In this example the parameter is t = .
95. The multi-manifold V ( X ( )) for the 37 points of intrinsic dimension three also had only onecomponent. This is explained by Figure 3 which shows that most of the points ofintrinsic dimension three are on the x-axis, a 1-dimensional linear manifold, nearthe points of intersection with the sphere. To summarize the goodness-of-fit of thelinear multi-manifolds, we compute the expected value of the squared distances ofthe data points in the cube to the best fitting linear affine space of the local intrinsicdimension d .The last step of the methodology is the computation of a probability distribu-tion H ( i ) for each value of the intrinsic dimension i . For the sphere-line example,this was done by randomly choosing 20 test subsets for each i, which determined euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 19 Fig. 3:
Intrinsic Dimensions of the Sphere-Line Sample.dim total pts V points components E ( SQD ) Table 1:
Summary of the Multi-Manifolds V ( X ( i )) for the Sphere-Line Example. The intrinsicdimension is shown in the first column, the number of points in each strata in the second. Thethird column shows the total number of points; this is the sum of the points that lie in the dyadiccubes associated with each component of the multi-linear manifold. In this case, the total numberof points equals the number of supported points for each strata since the sampling was fairly dense.However, that will not be true in general since in the top-down recursive algorithm no component ofthe multi-manifold will be constructed if there are less than Klog ( K ) points. Here K was specifiedto be three, since that was the maximum intrinsic dimension. The fourth column lists the numberof components of V ( X ( i )) .
20 training subsets X train ( i ) = X ( i ) − X test ( i ) . The dyadic linear multi-manifold V ( X ( i )) was computed for each training set and the expected value E i ( SQD ) ofthe sum of the squares of the distances of the test subset X test ( i ) to V ( X ( i )) wascomputed (cube by cube). The hypothesis testing probability distribution H ( i ) isthe distribution of the statistics E i ( SQD ) . The expected value, standard deviation,and z-score cutoff for a confidence interval were computed for a 95% confidenceinterval. The computed values for these statistics are shown in Tab. 2. The informa- tion in Tab. 2 is sufficient to make a hypothesis testing decision for each intrinsicdimension. d E ( E ( SQD )) support train count test count runs SD ( E ( SQD )) z cutoff1 0 1 1814 893 20 0 0.31102 0.0541 1 1814 893 20 0.0007 1.51903 0.0011 1 1814 893 20 0.0004 1.7146 Table 2:
Test Statistics for the Sphere-Line Sample. The third column shows the expected valueof the number of points in the dyadic cubes supporting the multi-manifold. The fourth and fifthcolumn show the expected values of the sizes of the training and testing sets.
We also implemented a simple version of local GMST. We use the first algorithm(using Laurens van der Maaten’s implementation) and compute the intrinsic dimen-sion of each point in the data sets using neighborhood N n , i of size n in the range 200to 400, with increments of 25. We only performed the experiment on the sphere-linedata set as the computation time grows prohibitively large while the results obtaineddo not match all our predictions, see Figure (14). Our results show that the points onthe lines are dimension 1 and the points around the intersection of the line and thesphere have dimension around 3. However, this is the case for several points on theother parts of the balls as well. The results are thus poorer than those obtained withthe Variance Based estimator. This is because for the GMST, the size n of N n , i foreach x i has to be large enough for the guarantee 4 to hold. It is clear that the methodused to construct the neighborhood of a point is essential for local estimation, bothfrom the point of view of complexity but also, because picking a small neighbor-hood can create computational errors and non-representative values while for largeneighborhood, the method will carry global estimation issues. At this stage, we donot have a principled way to find these sizes.In future work, we could vary we also hope to vary the k and n parameters forthe GMST, but also γ . We also hope to implement the MSVD algorithm [15, 35]. To provide the context for our LiDAR data use case, we summarize LiDAR technol-ogy, example applications, the LiDAR data collection process and the measurementstaken in the process. We then describe the specific use case in this context.LiDAR stands for light detection and ranging and it is an optical remote sensingtechnique that uses laser light to densely sample the surface of the earth, produc-ing highly accurate x , y and z measurements. The resulting mass point cloud datasets can be managed, visualized, analyzed and shared using ArcGIS. The collectionvehicle of LiDAR data might be an aircraft, helicopter, vehicle and tripod. euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 21 LiDAR is an active optical sensor that transmits laser beams towards a targetwhile moving through specific survey routes. The reflection of the laser from thetarget is detected and analyzed by receivers in the LiDAR sensor. These receiversrecord the precise time from when the laser pulse leaving the system to when itreturns to calculate the range distance between the sensor and the target, combinedwith the positional information GPS (Global Positioning System), and INS (inertialnavigation system). These distance measurements are transformed to measurementsof actual three-dimensional points of the reflective target in object space.LiDAR can be applied, for instance, to update digital elevation models, glacialmonitoring, detecting faults and measuring uplift detecting, forest inventory, detectshoreline and beach volume changes, landslide risk analysis, habitat mapping andurban development [36]. 3D LiDAR point clouds have many applications in theGeosciences. A very important application is the classification of the 3D cloud intoelementary classes. For example, it can be used to differentiate between vegetation,man-made structures and water. Alternatively, only two classes such as ground andnon-ground could be used. Another useful classification is based on the heterogene-ity of surfaces. For instance, we might be interested classifying the point cloud ofreservoir into classes such as gravel, sand and rock. The design of algorithms forclassification of this data using a multi-scale intrinsic dimensionality approach is ofgreat interest to different scientific communities [11] [3].The LiDAR data considered here was converted to 3D coordinates, using the freeQGIS software. It contains approximately 87,000 points, a scatter plot is shown inFigure 8. In terms of dimensionality, the catenary cables at the top of the bridgeshould have intrinsic dimension one and the bridge surface intrinsic dimension two.We will test this intuition using the multi-manifold testing framework.The point data is post-processed after the LiDAR data collection survey intohighly accurate geo-referenced x , y , z coordinates by analyzing the laser time range,laser scan angle, GPS position, and INS information. We have followed very closelythe exposition in [46] and [48]. LiDAR point attributes
The following attributes along with the position ( x , y , z ) are maintained for each recorded laser pulse. We have included a description of eachattribute and complemented the intensity attribute description with the expositionin [46] . • Intensity. Captured by the LiDAR sensors is the intensity of each return. Theintensity value is a measure of the return signal strength. It measures the peakamplitude of return pulses as they are reflected back from the target to the de-tector of the LiDAR system. • Return number. An emitted laser pulse can have up to five returns dependingon the features it is reflected from and the capabilities of the laser scanner used The description of each of the attributes below are literally taken from website http://desktop.arcgis.com/en/arcmap/ (in ’Fundamentals about LiDAR under ’ManageData’)2 F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima to collect the data. The first return will be flagged as return number one, thesecond as return number two, and so on. • Number of returns. The number of returns is the total number of returns for agiven pulse. Laser pulses emitted from a LiDAR system reflect from objectsboth on and above the ground surface: vegetation, buildings, bridges, and so on.One emitted laser pulse can return to the LiDAR sensor as one or many returns.Any emitted laser pulse that encounters multiple reflection surfaces as it travelstoward the ground is split into as many returns as there are reflective surfaces. • Point classification. Every LiDAR point that is post-processed can have a clas-sification that defines the type of object that has reflected the laser pulse. Li-DAR points can be classified into a number of categories including bare earthor ground, top of canopy, and water. The different classes are defined usingnumeric integer codes in the LAS files. • Edge of flight line. The points will be symbolized based on a value of 0 or 1.Points flagged at the edge of the flight line will be given a value of 1, and allother points will be given a value of 0. • RGB. LiDAR data can be attributed with RGB (red, green, and blue) bands.This attribution often comes from imagery collected at the same time as theLiDAR survey. • GPS time. The GPS time stamp at which the laser point was emitted from theaircraft. The time is in GPS seconds of the week. • Scan angle. The scan angle is a value in degrees between -90 and +90. At 0degrees, the laser pulse is directly below the aircraft at nadir. At -90 degrees,the laser pulse is to the left side of the aircraft, while at +90, the laser pulse is tothe right side of the aircraft in the direction of flight. Most LiDAR systems arecurrently less than ±
30 degrees. • Scan direction. The scan direction is the direction the laser scanning mirror wastraveling at the time of the output laser pulse. A value of 1 is a positive scandirection, and a value of 0 is a negative scan direction. A positive value indicatesthe scanner is moving from the left side to the right side of the in-track flightdirection, and a negative value is the opposite.Points clouds are are a very dense collection of points over an area. A laser pulse canbe returned many times to the airborne sensor. See figure 4 for graphic explanationof this process with a tree.In the case of a simple laser profiler that has been mounted on an airborne plat-form, the laser points vertically toward the ground to allow a rapid series of mea-surements of the distances to the ground from the successive positions of the movingplatform. The measurements of the vertical distances from the platform to a seriesof adjacent points along the ground track are made possible through the forwardmotion of the airborne or space-borne platform. If the positions and altitudes of theplatform at these successive positions in the air or in space are known or can be de-termined (e.g., using a GPS/IMU system), then the corresponding ranges measuredat these points will allow their ground elevation values to be determined. Conse-quently, these allow the terrain profile along the flight line to be constructed (seefigures 5 and 6). euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 23
Fig. 4:
A pulse can be reflected off a tree’s trunk, branches and foliage as well as reflected off theground. The image is recreated from a figure in [48], pp. 7
Fig. 5:
Profile being measured along a line on the terrain from an airborne or space-borne platformusing a laser altimeter. The image reproduced from [46], Chapter 1, pp. 74 F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima
Fig. 6:
The profile belonging to a series of terrain profiles is measured in the cross track directionof an airborne platform. The image was recreated from figure 1.5 (b), pp. 8 in [46].
Fig. 7:
3D point cloud LiDAR visualization of the Golden Gate Bridge, San Francisco, CA. Theimage was produced by Jason Stoker (USGS) using LP360 by Qcoherent [47].euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 25
For our use case, we use LiDAR data from the Golden Gate Bridge, San Fran-cisco, CA. We extracted the original data (more than eight million points) from the
USGS EarthExplorer (https://earthexplorer.usgs.gov/) and sampled using the soft-ware QGIS. Fig. 7 illustrates a visualization of the 3D point cloud data of the com-plete bridge. We just worked with one part of the bridge and the surrounding ground,vegetation and water (see Fig. 8). We did not work with all the above mentioned at-tributes, but extracted only spacial coordinates x , y , z for our study. Fig. 8:
Scatter plot of the Golden Gate Bridge section of the data. We extracted ∼ ,
000 pointsfrom the original LAS file for the analysis.
The data was pre-processed by computing the variance-based local intrinsic dimen-sion using balls of dyadic scales 4 through 7. Specifically, we used neighborhoodsof radii diam · − scale for scale = ...
7, where the diameter was the maximum of thecoordinate diameters. Figure 9 shows that the catenary cables indeed have intrinsicdimension one, the surface of the bridge has intrinsic dimension two, and the inter-section of the main catenary cables with the bridge columns have dimension three.This confirms our intuition on the intrinsic dimension.
Fig. 9:
LiDAR data set of the Golden Gate Bridge.
Two additional views were computed and visualized using variance-based lo-cal intrinsic dimension based color-coding for further understanding the data set.In Fig. 10 the data set was colored by a lexicographic ordering of the intrinsic di-mension, minus the ordering of the radii. In Fig. 11 the data was colored by theexpected value of the total variance over the radii at which the intrinsic dimensionwas observed. Figures 10 and 11 show more subtle distinctions than are revealedby the intrinsic dimension statistics, but these were not used in the remainder of theanalysis. This experiment demonstrates that meaningful geometric structures can beinferred from analyzing intrinsic dimensions in densely sampled low-dimensionaldata.We also computed the intrinsic dimension using the Variance-Based estimatorfor the same data sets, but this time, we formed the neighborhoods using the k near-est neighbors of a given point x . Our results are practically identical to the onesobtained when the neighborhoods are formed using balls of radius r centered at x .The main advantage of k nearest neighbors is that we are guaranteed that the neigh-borhoods considered for the intrinsic dimension estimation are not empty. For thesphere-line example in Figure (12), the entire sample size consisted in 2708 pointsand we computed the intrinsic dimension for neighborhood size in the range 50 to700, with increments of 25. For the LiDAR data in Figure ( ?? ), we used a range ofneighborhood sizes of 50 to 800, with increments of 50. The same conclusion as for euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 27 Fig. 10:
LiDAR data set of the Golden Gate Bridge the sphere-line holds, i.e., for most points, the intrinsic dimension obtained for eachpoint are identical as those obtained using balls.
For each of the three intrinsic-dimension based strata D ( i ) of the LiDAR data, adyadic linear multi-manifold V ( D ( i )) was computed approximating the strata. Asummary of the properties of each of the multi-manifolds is shown in Table 3. Thereis one row for each intrinsic dimension. dim totalpts MMpoints components EVsqdist1 1891 1885 20 0.00032 84698 84698 66 0.00023 1185 1185 1 0.0079 Table 3:
Summary of the Multi-Manifolds V ( D ( i )) for the LiDAR data.8 F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima Fig. 11:
LiDAR data set of the Golden Gate Bridge -0.50.50 0 1 points color-coded by idim
Fig. 12:
Variance-Based estimator with k-nearest neighbors for sphere-lineeuristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 29 -504.1868050 4.1866 5.459100150 points color-coded by idim
250 4.1862 5.4584.186 5.4575
Fig. 13:
Variance-Based estimator with k-nearest neighbors for LiDAR data -0.50.5 10 0.5
Intrinsic Dimension uisng local GMST
Fig. 14:
LOCAL GMST k-nearest neighbors for sphere-line0 F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima
Finally, we computed of a probability distribution H ( i ) for each intrinsic dimensionvalue. As for the Sphere-Line example, this was done by randomly sampling testingand training subsets. The results are shown in Table 4. d E ( E ( SQD )) support train count test count runs SD ( E ( SQD )) z cutoff1 2.9926 1 58834 28940 20 0.17154 1.44012 3.9316 1 58834 28940 20 0.21638 1.72473 0 1 58834 28940 20 0 1.5343 Table 4:
Test Statistics for the LiDAR data.
This article presents a summary of conceptual ideas and preliminary results from aworkshop collaboration. In line with the exploratory style of the article, we outlinea number of further research questions and possible future directions:1. For what data sets and applications is multi-manifold hypothesis testing usefulin practice? The examples in this paper are limited to densely sampled low-dimensional data sets. – How does the method perform on higher dimensionaldata sets and on sparse data sets (e.g. Word2Vec)?2. Could intrinsic dimension statistics be used to find change points or changeboundaries (commonly used in statistics)? Can the dyadic linear multi-manifoldstructure be useful for the formulation of high-dimensional trends for multi-dimensional time series and high-dimensional change boundary detection?3. Can a dyadic linear multi-manifold structure be exploited to construct a non-linear multi-manifold which models the data more accurately, has known smooth-ness properties, and has as few components as possible?4. What are the most practical and effective methods for improving the scalabilityof the intrinsic dimension computation? What additional state-of-the art algo-rithms can be exploited to realize computationally efficient hypothesis testingfor multi-manifolds?5. How robust is the presented approach? The investigations could include robust-ness to changes in the tree structure, the neighborhood choices and changes inthe intrinsic dimension algorithm itself.6. Are there additional or alternative test statistics which could be efficiently com-puted to compare samples of candidate multi-manifolds and the constructedtraining manifolds, for example test statistics that compare structural proper-ties?7. Could computational topology be used to estimate the optimal number of man-ifold components and the minimal number of patches? euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 31
8. Are there conditions on a data set as a subset of a tree-structured space whichwill guarantee that the total variance for node subsets associated with a level inthe tree is monotonically decreasing as the distance of the level from the rootincreases?9. How could the theory of quantitative rectifiability be exploited or enhanced toprovide theoretical guarantees for multi-manifold hypothesis testing?
In this paper we present conceptual ideas and preliminary results for the devel-opment of a heuristic framework for multi-scale testing of the multi-manifoldhypothesis:Given a data set X = { x i } i ∈ I in R n and a multi-manifold V , is the square distanceof the points in X to the multi-manifold V more than one would expect? If so, wereject V as being a multi-manifold that fits X . We describe an implementation of thisheuristic framework and demonstrate it on two low-dimensional, densely sampleddata sets with intuitive geometry. The experiments demonstrate that the computedlow-dimensional multi-manifold is consist with the intuitive geometry.Our approach exploits fundamental methods of statistical reasoning, hypothesistesting and simple variance-based analysis, as well as multi-scale representationmethods. We apply summary statistics to data computed at multiple scales usingresults from geometric representation theory. The specific distribution is computedempirically from the data.We expect that many other algorithms can be exploited in alternative realizationsof the framework. Further directions that could build on our approach are outlinedat the end of the paper. To ensure the reproducibility of our results, the prototypeimplementation will be made publicly available on GitHub. Acknowledgements
This research started at the Women in Data Science and Mathematics ResearchCollaboration Workshop (WiSDM), July 17-21, 2017, at the Institute for Computational andExperimental Research in Mathematics (ICERM). The workshop was partially supported by grantnumber NSF-HRD 1500481-AWM ADVANCE and co-sponsored by Brown’s Data Science Initia-tive.Additional support for some participant travel was be provided by DIMACS in associationwith its Special Focus on Information Sharing and Dynamic Data Analysis. LN worked on thisproject during a visit to DIMACS, partially supported by the National Science Foundation un-der grant number CCF-1445755 and by DARPA SocialSim-W911NF-17-C-0098. FPM receivedpartial travel funding from Worcester Polytechnic Institute, Mathematical Science Department.We thank Brie Finegold and Katherine M. Kinnaird for their participation in the workshop andin early stage experiments. In addition, we thank Anna Little for helpful discussions on intrinsicdimensions and Jason Stoker for sharing material on LiDAR data.2 F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima
Code Availability
An implementation of the workflow in M
ATLAB is available on GitHub: https://github.com/MelWe/mm-hypothesis . References
1. E. A
RIAS -C ASTRO , G. C
HEN , AND
G. L
ERMAN , Spectral clustering based on local linearapproximations, Electronic Journal of Statistics, 5 (2011), pp. 1537–1587.2. J. A
ZZAM AND
R. S
CHUL , An analyst’s traveling salesman theorem for sets of dimensionlarger than one, tech report, (2017). https://arxiv.org/abs/1609.02892 .3. D. B
ASSU , R. I
ZMAILOV , A. M C I NTOSH , L. N
ESS , AND
D. S
HALLCROSS , Centralizedmulti-scale singular vector decomposition for feature construction in lidar image classificationproblems, in IEEE Applied Imagery and Pattern Recognition Workshop (AIPR), IEEE, 2012.4. D. B
ASSU , R. I
ZMAILOV , A. M C I NTOSH , L. N
ESS , AND
D. S
HALLCROSS , Applicationof multi-scale singular vector decomposition to vessel classification in overhead satelliteimagery, in Proceedings Volume 9631: Seventh Annual International Conference on DigitalImage Processing ICDIP 2015, C. Falco and X. Jiang, eds., 2015.5. M. B
ELKIN AND
P. N
IYOGI , Laplacian eigenmaps and spectral techniques for embeddingand clustering, in NIPS Vol.14, 2002.6. M. B
ELKIN AND
P. N
IYOGI , Laplacian Eigenmaps for dimensionality reduction and datarepresentation, Neural Computation, 15 (2002), pp. 1373–1396.7. M. B
ELKIN AND
P. N
IYOGI , Laplacian eigenmaps for dimensionality reduction and datarepresentation, Neural Computation, 15 (2003), pp. 1373 – 1396.8. P. B
ENDICH , E. G
ASPAROVIC , J. H
ARER , R. I
ZMAILOV , AND
L. N
ESS , Multi-scale localshape analysis and feature selection in machine learning applications, in Multi-scale localshape analysis and feature selection in machine learning applications, IEEE, 2014. http://arxiv.org/pdf/1410.3169.pdf .9. P. B
ENDICH , E. G
ASPAROVIC , C. T
RALIE , AND
J. H
ARER , Scaffoldings and spines:Organizing high-dimensional data using cover trees, local principal component analysis,and persistent homology, technical report, (2016). https://arxiv.org/pdf/1602.06245.pdf .10. A. B
EYGELZIMER , S. K
AKADE , AND
J. L
ANGFORD , Cover trees for nearest neighbor, inProceedings of the 23rd International Conference on Machine Learning, ICML ’06, New York,NY, USA, 2006, ACM, pp. 97–104.11. N. B
RODU AND
D. L
AGUE , 3d terrestrial lidar data classification of complex natural scenesusing a multi-scale dimensionality criterion: Applications in geomorphology, ISPRS Journalof Photogrammetry and Remote Sensing, 68 (2012), pp. 121 – 134.12. F. C
AMASTRA , Data dimensionality estimation methods: A survey, Pattern Recognition, 36(2003), pp. 2945–2954.13. F. C
AMASTRA AND
A. V
INCIARELLI , Estimating the intrinsic dimension of data with afractal-based method., IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2002), pp. 1404 – 1407.14. K. C
ARTER AND
A. H
ERO , Variance reduction with neighborhood smoothing for localintrinsic dimension estimation, Acoustics, Speech and Signal Processing, 2008. ICASSP 2008.IEEE International Conference on, (2008).15. G. C
HEN , A. L
ITTLE , AND
M. M
AGGIONI , Multi-resolution geometric analysis for datain high dimensions, in Excursions in Harmonic Analysis: The February Fourier Talks at theNorbert Wiener Center, Springer, 01 2013.16. G. C
HEN , A. L
ITTLE , AND
M. M
AGGIONI , Multi-resolution geometric analysis for data inhigh dimensions, in Excursions in Harmonic Analysis, Springer, 2013, pp. 259–285.euristic Framework for Multi-Scale Testing of the Multi-Manifold Hypothesis 3317. J. C
HODERA , W. S
WOPE , J. P
ITERA , AND
K. D
ILL , Long-time protein folding dynamicsfrom short-time molecular dynamics simulations, Multiscale Modeling and Simulation, 5(2006), pp. 1214 – 1226.18. R. C
OIFMAN AND
S. L
AFON , Diffusion maps, Applied and Computational Harmonic Anal-ysis, 21 (2006), pp. 5–30.19. R. C
OIFMAN , S. L
AFON , M. M
AGGIONE , B. N
ADLER , F. W
ARNER , AND
S. W. Z
UCKER ,Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusionmaps, Proc. Natl. Acad. Sci. USA, 102 (2005), pp. 7426–31.20. J. C
OSTA , A. G
IROTRA , AND
A. H
ERO , Estimating local intrinsic dimension with k-nearestneighbor graphs, Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on, (2005).21. H. A. C
OSTA
J.A., Geodesic entropic graphs for dimension and entropy estimation inmanifold learning, IEEE Transactions on Signal Processing, 52 (2004), pp. 2210–2211.22. G. D
AVID AND
S. S
EMMES , Quantitative rectifiability and lipschitz mappings, Trans. Amer.Math. Soc., 2 (1993), p. 855?889. http://dx.doi.org/10.2307/2154247 .23. D. D
ONOHO AND
C. G
RIMES , Hessian eigenmaps: locally linear embedding techniques forhigh-dimensional data, Proc. Natl. Acad. Sciences USA, 100 (2003), pp. 5591–5596.24. C. F
EFFERMAN , S. M
ITTER , AND
H. N
ARAYANAN , Testing the manifold hypothesis, J.Amer. Math. Soc., 29 (2016), pp. 983–1049.25. K. F
UKUNAGA , Intrinsic dimensionality extraction, in Classification Pattern Recognition andReduction of Dimensionality, vol. 2 of Handbook of Statistics, Elsevier, 1982, pp. 347–360.26. P. G
RASSBERGER AND
I. P
ROCACCIA , Measuring the strangeness of strange attractors, Phys-ica D, 9 (1983), p. 189?208.27. J. H AM , D. L EE , S. M IKA , AND
B. S CH ¨ OLKOPF , A kernel view of the dimensionalityreduction of manifolds, in Proceedings of the Twenty-first International Conference on Ma-chine Learning, ICML ’04, New York, NY, USA, 2004, ACM, pp. 47–55.28. G. H
ARO , G. R
ANDALL , AND
G. S
APIRO , Translated poisson mixture model for stratificationlearning, International Journal of Computer Vision, 80 (2008), pp. 358–374.29. D. J
ONCAS , M. M
EILA , AND
J. M C Q UEEN , Improved graph laplacian via geometricself-consistency, in NIPS, 2017.30. P. W. J
ONES , Rectifiable sets and the traveling salesman problem, Invent. Math., 102 (1990),pp. 1–15.31. D. R. K
ARGER AND
M. R
UHL , Finding nearest neighbors in growth-restricted metrics, inProceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC’02, New York, NY, USA, 2002, ACM, pp. 741–750.32. R. K
RAUTHGAMER AND
J. R. L EE , Navigating nets: Simple algorithms for proximity search,in Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms,SODA ’04, Philadelphia, PA, USA, 2004, Society for Industrial and Applied Mathematics,pp. 798–807.33. J. L EE AND
M. V
ERLEYSEN , Nonlinear dimensionality reduction, Springer Publishing Com-pany, Incorporated, 1st ed., 2007.34. E. L
EVINA AND
P. B
ICKEL , Maximum likelihood estimation of intrinsic dimension, in Ad-vances in Neural Information Processing Systems (NIPS), vol. 17, MIT Press, 2005, pp. 777–784.35. A. L
ITTLE , Estimating the Intrinsic Dimension of High-Dimensional Data Sets: A Multiscale,Geometric Approach, Duke University, 05 2011.36. P. M. M
ATHER , Computer Processing of Remotely-Sensed Images: An Introduction, JohnWiley & C Q UEEN , M. M
EILA , J. V
ANDER P LAS , AND
Z. Z
HANG , Megaman: Scalable manifoldlearning in python, Journal of Machine Learning Research, 17 (2016), pp. 1–5.38. B. N
ADLER , S. L
AFON , R. C
OIFMAN , AND
I. K
EVREKIDIS , Diffusion maps, spectralclustering and eigenfunctions of fokker-planck operators, Appl. Comput. Harmon. Anal., 21(2006), pp. 113 – 127.39. H. N
ARAYANAN AND
S. M
ITTER , Sample complexity of testing the manifold hypothesis,in Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams,J. Shawe-Taylor, R. S. Zemel, and A. Culotta, eds., Curran Associates, Inc., 2010, pp. 1786–1794.4 F. P. Medina, L. Ness and M. Weber, K. Yacoubou Djima40. A. N G , M. J ORDAN , AND
Y. W
EISS , On spectral clustering: Analysis and an algorithm, inAdvances in Neural Information Processing Systems (NIPS), vol. 14, 2002, pp. 849–856.41. K. P
ETTIS , T. B
AILEY , J. T.,
AND
D. R., An intrinsic dimensionality estimator fromnear-neighbor information, IEE Trans., (1979), pp. 25–37.42. C. R., I. K
EVREKIDIS , S. L
AFON , M. M
AGGIONI , AND
B. N
ADLER , Diffusion maps,reduction coordinates, and low dimensional representation of stochastic systems, MultiscaleModeling and Simulation, 7 (2008), pp. 842–864.43. S. T. R
OWEIS AND
L. K. S
AUL , Nonlinear dimensionality reduction by locally linearembedding, Science, 290 (2000), pp. 2323–2326.44. , Think globally, fit locally: unsupervised learning of low dimensional manifolds, Jour-nal of Machine Learning Research, 4 (2003), pp. 119–155.45. B. S CH ¨ OLKOPF , A. S
MOLA , J. A
LEXANDER , AND
K. M ¨
ULLER ,Kernel principal component analysis, Advances in kernel methods: support vector learning,(1999), pp. 327–352.46. J. S
HAN AND
C. K. T
OTH , Topographic Laser Ranging and Scanning: Principles andProcessing, CRC Press, 1 ed., 11 2008.47. J. S
TOKER
UMERLING
AKENS , On the numerical determination of the dimension of an attractor, Springer BerlinHeidelberg, Berlin, Heidelberg, 1985, pp. 99–106.50. J. B. T
ENENBAUM , V. DE S ILVA , AND
J. C. L
ANGFORD , A global geometric framework fornonlinear dimensionality reduction, Science, (2000), pp. 2319–2323.51. J. T
HEILER , S. E
UBANK , A. L
ONGTIN , B. G
ALDRIKIAN , AND
J. D. F
ARMER , Testing fornonlinearity in time series: the method of surrogate data, Physica D: Nonlinear Phenomena,58 (1992), pp. 77 – 94.52. J. W
ANG AND
A. L. F
ERGUSON , Nonlinear reconstruction of single-molecule free-energysurfaces from univariate time series, Physical Review E, 93 (2016).53. X. W
ANG , K. S
LAVAKIS , AND
G. L
ERMAN , Riemannian multi-manifold modeling, techni-cal report, (2014). arXiv:1410.0095 and
Link to supplementary webpage with code.54. W. Z
HENG , M. R
OHRDANZ , M. M
AGGIONI , AND
C. C
LEMENTI , Determination of reactioncoordinates via locally scaled diffusion map, The Journal of Chemical Physics, 134 (2011).55. W. Z
JENG , M. R
OHRDANZ , AND
C. C