Optimal Recovery of Mahalanobis Distance in High Dimension
OOPTIMAL RECOVERY OF MAHALANOBIS DISTANCE INHIGH DIMENSION
MATAN GAVISH, RONEN TALMON, PEI-CHUN SU, AND HAU-TIENG WU
Abstract.
In this paper, we study the problem of Mahalanobis distance es-timation from high-dimensional noisy data. By relying on recent transforma-tive results in covariance matrix estimation, we demonstrate the sensitivity ofMD to measurement noise, determining the exact asymptotic signal-to-noiseratio at which MD fails, and quantifying its performance otherwise. In addi-tion, for an appropriate loss function, we propose an asymptotically optimalshrinker, which is shown to be beneficial over the classical implementation ofthe MD, both analytically and in simulations. INTRODUCTION
High-dimensional datasets encountered in modern science often exhibit nonlinearlower-dimensional structures. Efforts to uncover such low-dimensional structures re-volve around the discovery of meaningful measures of pairwise discrepancy betweendata points [22, 16, 3, 4]. In the so-called metric design problem, the data ana-lyst aims to find a useful metric representing the relationship between data pointsembedded in high-dimensional space. In this paper, we study the Mahalanobis Dis-tance ( MD ) – a popular, and arguably the first, method for metric design [12, 14].MD was originally proposed in 1936 with the classical low-dimensional setting inmind, namely, for the case where the ambient dimension p is much larger than thedataset size n .Interestingly, due to its useful statistical and invariance properties, MD becamethe basis of several geometric data analysis [28, 23, 26] and manifold learning [17, 20]techniques aimed specifically at the high-dimensional regime p (cid:16) n . It was recentlyshown that MD is also implicitly used in the seminal Locally Linear Embeddingalgorithm [16], when the barycenter step is properly expressed [25, (2.17)].As the number of dimensions in typical data analysis applications continues togrow, it becomes increasingly crucial to understand the behavior of MD, as wellas other metric design algorithms, in the high-dimensional regime p (cid:16) n . At afirst glance, it might seem that this regime poses little more than a computationalinconvenience for metric design using MD. Indeed, it is easy to show that in theabsence of measurement noise, MD cares little about the increase in the ambientdimension p .This paper calls attention to the following key observation. In the high-dimensionalregime n (cid:16) p , in the presence of ambient measurement noise , a new phenomenonemerges, which introduces nontrivial effects on the performance of MD. Dependingon the noise level, in the high-dimensional regime, MD may be adversarially affectedor even fail completely. Clearly, the assumption of measurement noise cannot berealistically excluded, and yet, to the best of our knowledge, this phenomenon has a r X i v : . [ m a t h . S T ] A p r MATAN GAVISH, RONEN TALMON, PEI-CHUN SU, AND HAU-TIENG WU not been previously fully studied. A first step in this direction was taken in [5],with the calculation of the distribution of MD under specific assumptions.Let us describe this key phenomenon informally at first. The computation ofMD involves estimation of the inverse covariance matrix, or precision matrix, cor-responding to the data at hand. Classically, the estimation relies on the samplecovariance, which is inverted using the Moore-Penrose pseudo-inverse. It is well-known that, in the regime n (cid:16) p , the sample covariance matrix is a poor estimator ofthe underlying covariance matrix. Indeed, advances in random matrix theory fromthe last decade imply that the eigenvalues and eigenvectors of the sample covariancematrix are inconsistent, namely, do not converge to the corresponding eigenvaluesand eigenvectors of the underlying covariance matrix [15, 7]. Importantly, suchinconsistencies in small eigenvalues, which lead to inaccurate covariance matrixestimation, become immense when applying the Moore-Penrose pseudo-inverse.In this paper, we study this problem and propose a remedy. By relying onformal existing results in covariance matrix estimation, we measure the sensitivityof MD to measurement noise. Under the assumption that the data lie on a low-dimensional linear subspace or manifold in the ambient space R p and that themeasurement noise is white Gaussian, we are able to determine the exact asymptoticsignal-to-noise ratio at which MD fails, and quantify its performance otherwise. Inaddition, it has been known since the 1970’s [19] that by shrinking the samplecovariance eigenvalues one can significantly mitigate the noise effects and improvethe covariance estimation in high-dimensions. We formulate the classical MD as aparticular choice of shrinkage estimator for the eigenvalues of the sample covariance.Building on recent results in high-dimensional covariance estimation, including thegeneral theory in [6] and a special case with application in random tomography [18,Section 4.4], we find an asymptotically optimal shrinker , which is beneficial over theclassical implementation of the MD, whenever MD is computed from noisy high-dimensional data. We show that under a suitable choice of a loss function for theestimation of MD, our shrinker is the unique asymptotically optimal shrinker; theimprovement in asymptotic loss it offers over the classical MD is calculated exactly.While the present paper focuses on MD, we posit that the same phenomenonholds much more broadly and in fact affects several widely-used manifold learningand metric learning algorithms. In this regard, the present paper seeks to highlightthe fact that manifold learning and metric learning algorithms will not performas predicted by the noiseless theory in high dimensions, and may fail completelybeyond a certain noise level. 2. PROBLEM SETUP
Signal Model.
Consider a point cloud in R p supported on a d -dimensionallinear subspace, where d ≤ p . For example, we may assume that we are samplinglocally from a d -dimensional embedded submanifold of R p , in which case the lin-ear subspace is a specific tangent space to the manifold. Assuming the standardGaussian model, we may assume that the point cloud is sampled independently andidentically (i.i.d.) from X ∼ N ( µ, Σ X ) ∈ R p , where Σ X is the population covariance matrix and its rank is equal to d . TheMD between an arbitrary point z ∈ R p and the underlying signal distribution X is PTIMAL RECOVERY OF MAHALANOBIS DISTANCE IN HIGH DIMENSION 3 defined by(1) d † X ( z , X ) = ( z − µ ) (cid:62) Σ † X ( z − µ ) , where † denotes the Moore-Penrose pseudo-inverse. Note that since Σ X is semi-positive definite, by the Cholesky decomposition, we have Σ † X = W W (cid:62) , where W ∈ R p × d . Hence d † X ( z , X ) = ( z − µ ) (cid:62) ( W W (cid:62) )( z − µ ) = (cid:107) W (cid:62) ( z − µ ) (cid:107) R p , which indicates that geometrically, MD evaluates the relationship between z and X by a proper linear transform. A primary merit of MD for manifold learningstems from the invariance of rotation and rescaling. To see this property, considera random variable (cid:101) X = cAX , where c ∈ R and A ∈ O ( p ), where O ( p ) denotesthe group of p -by- p orthogonal matrices. We know that its population covariancematrix is(2) Σ (cid:101) X = c A Σ X A (cid:62) and its population mean is also rotated and rescaled to ˜ µ = cAµ . If ˜z = cA z , thenthe MD between ˜z and (cid:101) X is d † (cid:102) X ( ˜z , (cid:101) X ) = ( cA ( z − µ )) (cid:62) Σ † Z ( cA ( z − µ ))= ( z − µ ) (cid:62) Σ † X ( z − µ ) = d † X ( z , X ) , (3)demonstrating the invariance.Suppose that the samples of X , which we refer to as the signal , are not directlyobservable. Instead, the observed data consist of samples from the random variable Y = X + σξ , where ξ ∼ N (0 , I p ) is Gaussian measurement noise, which we assume for simplicityto be white, independent of X , and 0 ≤ σ < ∞ . Assume that y , . . . , y n iid ∼ Y isa sample of n data points. Since Σ X is unknown, the quantity d Σ † X ( z , X ), or Σ † X in (1), must be estimated from the (noisy) data. For simplicity of exposition, weassume that µ and σ are known; these assumptions can easily be removed.2.2. A Class of Estimators and a Loss Function.
For any p -by- p matrix M n estimated from y , . . . , y n , consider the estimator for MD(4) d M n ( z , X ) = ( z − µ ) (cid:62) M n ( z − µ ) . In order to quantitatively measure the performance of any MD estimator d M n ( z , X ),it is useful to introduce a loss function. For any estimator of the form (4), theabsolute value of the estimation error with respect to the true value (1) is (cid:12)(cid:12)(cid:12) d X ( z , X ) − d M n ( z , X )) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ( z − µ ) (cid:62) [Σ † X − M n ]( z − µ ) (cid:12)(cid:12)(cid:12) . MATAN GAVISH, RONEN TALMON, PEI-CHUN SU, AND HAU-TIENG WU
As the test vector z is arbitrary, it is natural to consider the worst case, and definethe loss of M n at the (unknown) underlying low-dimensional covariance Σ X by L n ( M n | Σ X ) := sup (cid:107) y i (cid:107) R p =1 (cid:12)(cid:12)(cid:12) ( z − µ ) (cid:62) [Σ † X − M n ]( z − µ ) (cid:12)(cid:12)(cid:12) = (cid:107) Σ † X − M n (cid:107) op , (5)where (cid:107)·(cid:107) op is the matrix operator norm. To keep the notation light, the dependenceof L n on µ and σ as well as the dependence of M n on the sample y , . . . , y n areimplicit.2.3. Shrinkage Estimators.
Consider matrices of the form M ηn := η ( S n ), where η : [0 , ∞ ) → [0 , ∞ ) and S n is the sample covariance. We call M ηn the shrinkageestimator of Σ † X with η . A typical example is the classical MD estimator , which isa shrinkage estimator with η = η classical σ , where(6) η classical σ ( α ) = (cid:26) / ( α − σ ) α > σ α ≤ σ . From [12], in the traditional setup when the dimension p is fixed and n → ∞ , theclassical MD estimator obtains zero loss asymptotically. Theorem 1.
Let p be fixed independently of n . Then lim n →∞ L n ( η classical σ ( S n ) | Σ X ) = 0 . Proof.
Since it is well known that ( S n − σ I p ) → Σ X as n → ∞ , substituting M n with η classical σ ( S n ) in (5) and taking limit with n → ∞ complete the proof. (cid:3) When p grows with n , such that p = p n → ∞ with p n /n → β >
0, the situationis quite different. It is known that, in this situation, the sample covariance matrixis an inconsistent estimate of the population covariance matrix [8], and Theorem 1might not hold; that is, the classical MD might not be optimal. The followingquestions naturally arise when β > L n ( η |· )?(2) How does the optimal loss compare with the loss L n ( η classical σ |· )?In the sequel, we attempt to answer these questions.3. OPTIMAL SHRINKAGE FOR MAHALANOBIS DISTANCE
To formally capture the low-dimensional structure assumption, consider the fol-lowing model, based on Johnstone’s Spike Covariance Model [7]. Without loss ofgenerality, we set the noise level σ = 1 and will discuss the general case subse-quently. Assumption 1 (Asymptotic( β )) . The number of variables p = p n grows with thenumber of observations n , such that p/n → β as n → ∞ , for 0 < β ≤ Assumption 2 (Spiked model) . Suppose Σ X = Σ Y − σ I p with the eigendecom-postion:(7) Σ X = U (cid:20) Σ d
00 0 p − d (cid:21) U (cid:62) ∈ R p × p , where d ≥
0, Σ d = diag( (cid:96) , · · · (cid:96) d ) is a d × d matrix whose diagonal consists of d spikes (cid:96) > · · · > (cid:96) d >
0, which are fixed and independent of p and n , and the PTIMAL RECOVERY OF MAHALANOBIS DISTANCE IN HIGH DIMENSION 5 off-diagonal elements are set to zero. For completeness, denote (cid:96) d +1 = . . . = (cid:96) p = 0.Note that we assume that all spikes are simple. When d = 0, it is the null case.Denote the eigendecompostion of S n as(8) S n = V n diag( λ ,n , . . . , λ p,n ) V (cid:62) n ∈ R p × p , where λ ,n ≥ . . . λ p,n ≥ V n ∈ O ( p ) is thematrix, whose columns are the empirical eigenvectors v i,n ∈ R p , i = 1 , . . . , p . UnderAssumption 1 and Assumption 2, results collected from [13, 1, 2, 15, 6] imply threeimportant facts about the sample covariance matrix S n .(1) Eigenvalue spread.
Suppose Assumption 1 holds and consider the null casewhere Σ d = 0. As n → ∞ , the spread of the empirical eigenvalues λ i,n converges to a continuous distribution called the “Marcenko-Pastur” law[13],(9) (cid:112) ( λ + − x )( x − λ − )2 πβx [ λ − ,λ + ] dx , where λ + = (1 + √ β ) and λ − = (1 − √ β ) are the limiting bulk edges .(2) Top eigenvalue bias.
Suppose Assumption 1 and Assumption 2 hold. For1 ≤ i ≤ d , the empirical eigenvalues λ i,n a.s. −−→ λ ( (cid:96) i ) =: λ i as n → ∞ , where(10) λ ( α ) = (cid:40) α + β + βα α > (cid:96) + (1 + √ β ) ≤ α ≤ (cid:96) + is defined on α ∈ [0 , ∞ ) and (cid:96) + := √ β . For d + 1 ≤ i ≤ p , since (cid:96) i = 0 theempirical eigenvalues λ i,n follow the Marcenko-Pastur law (9).(3) Top eigenvector inconsistency.
Suppose Assumption 1 and Assumption 2hold. Let c i,n and s i,n be the cosine and sine value of the angel betweenthe i -th population eigenvector and the i -th empirical eigenvector afterproperly flipping the sign of each empirical eigenvector. Note that thereexists a sequence of R n ∈ O ( p ) so that R n V n converges almost surely (a.s.)to V ∈ O ( p ). In the following we assume that the empirical eigenvectorshave been properly rotated. It is known that when n → ∞ , c i,n a.s. −−→ c ( (cid:96) i )and s i,n a.s. −−→ s ( (cid:96) i ), where(11) c ( α ) = (cid:40)(cid:113) α − βα + βα α > (cid:96) + ≤ α ≤ (cid:96) + , and(12) s ( α ) = (cid:112) − c ( α )are defined on α ∈ [0 , ∞ ).The above three properties imply that the classical MD may not be the bestestimator. Inspired by [6], we may “correct” the biased eigenvalues to improve theestimation. Denote the asymptotic loss function by(13) L ∞ ( η | (cid:96) , . . . , (cid:96) d ) := lim n →∞ L n ( M ηn | Σ † X ) , assuming the limit exists. MATAN GAVISH, RONEN TALMON, PEI-CHUN SU, AND HAU-TIENG WU
To find a shrinkage estimator η that minimizes L ∞ ( η | (cid:96) , . . . , (cid:96) d ), it is natural toconstruct the estimator by recovering the spikes (cid:96) i using the biased eigenvalues λ i .From the inversion of (10), recalling that (cid:96) + = √ β , we can define(14) (cid:96) ( α ) := α + 1 − β + (cid:112) ( α + 1 − β ) − α − α > λ + , and consider the shrinkage function(15) η inv ( α ) = (cid:40) /(cid:96) ( α ) α > λ + . However, since true (population) eigenvectors from U and empirical eigenvectorsfrom V n are not collinear [6], it is reasonable to expect the existence of an optimalshrinkage function η ∗ satisfying L ∞ ( η ∗ | (cid:96) , . . . , (cid:96) d ) ≤ L ∞ ( η inv | (cid:96) , . . . , (cid:96) d )for any spikes (cid:96) , . . . , (cid:96) d . Below we show that η inv is in fact the optimal shrinkage.3.1. Derivation of the Optimal Shrinker when σ = 1 .Definition 1. A function η : [0 , ∞ ) → [0 , ∞ ) is called a shrinker if it is continuouswhen λ > λ + , and η ( λ ) = 0 when 0 ≤ λ ≤ λ + .Note that this shrinker is a bulk shrinker considered in [6, Definition 3]. Basedon the assumption of a shrinker η , the associated shrinkage estimator convergesalmost surely, that is(16) M ηn a.s. −−→ M η := V · diag( η ( λ ) , . . . , η ( λ p )) · V (cid:62) . Thus, the sequence of loss functions also almost surely converges as(17) L n ( M ηn | Σ † X ) = (cid:107) Σ † X − M ηn (cid:107) op a.s. −−→ (cid:107) Σ † X − M η (cid:107) op . As a result, the limit in (13) exists when η is a shrinker, and we have the followingtheorem which in turn establishes the optimal shrinker. Theorem 2 (Characterization of the asymptotic loss) . Suppose σ = 1 . Considerthe spike covariance model satisfying Assumption 1 and Assumption 2 and a shrink-age function η : [0 , ∞ ) → [0 , ∞ ) . We have a.s. (18) L ∞ ( η | (cid:96) , . . . , (cid:96) d ) = max i =1 ,...,p { ∆( (cid:96) i , η ( λ i )) } , where ∆ : [0 , ∞ ) × [0 , ∞ ) → [0 , ∞ ) is given by (19) ∆( α, ζ ) = u + ( α, ζ ) α > (cid:96) + and ζ ≤ α − u − ( α, ζ ) α > (cid:96) + and ζ > α /α < α ≤ (cid:96) + α = 0 , where (20) u + ( α, ζ ) = 12 (cid:32) α − ζ + (cid:114)(cid:16) α − ζ (cid:17) + 4 ζs ( α ) α (cid:33) , (21) u − ( α, ζ ) = 12 (cid:32) α − ζ − (cid:114)(cid:16) α − ζ (cid:17) + 4 ζs ( α ) α (cid:33) . PTIMAL RECOVERY OF MAHALANOBIS DISTANCE IN HIGH DIMENSION 7
Figure 1.
The asymptotic loss for several β = p/n values as afunction of the spike strength in a single spike model, i.e., (cid:96) ∈ [0 . , (cid:96) = (cid:96) = ... = (cid:96) d = 0. The noise level is fixed and setto σ = 1. Proof.
Based on the property of “simultaneous block-diagonalization” for Σ † X and M ηn in [6, Section 2], the properties of “orthogonal invariance” and “max-decomposability”for the operator norm in [6, Section 3], and the convergence of c i,n and s i,n in (11)and (17), we have L n ( M ηn | Σ † X ) = max i (cid:107) A i − B i,n (cid:107) op , where A i = (cid:20) /(cid:96) i
00 0 (cid:21) when (cid:96) i (cid:54) = 0 and A i = 0 × otherwise, and B i,n = η ( λ i,n ) (cid:20) c i,n c i,n s i,n c i,n s i,n s i,n (cid:21) . When n → ∞ , the loss converges a.s. to max i (cid:107) A i − B i (cid:107) op , where B i = η ( λ i ) (cid:20) c ( (cid:96) i ) c ( (cid:96) i ) s ( (cid:96) i ) c ( (cid:96) i ) s ( (cid:96) i ) s ( (cid:96) i ) (cid:21) . Now we evaluate (cid:107) A i − B i (cid:107) op for different (cid:96) i .When (cid:96) i > (cid:96) + , denote the eigenvalues of A i − B i as u + ( (cid:96) i , η ( λ i )) and u − ( (cid:96) i , η ( λ i )).If η ( λ i ) > /(cid:96) i we have 0 ≤ u + ( (cid:96) i , η ( λ i )) ≤ − u − ( (cid:96) i , η ( λ i )), and hence (cid:107) A i − B i (cid:107) op = − u − ( (cid:96) i , η ( λ i )); otherwise, we have u + ( (cid:96) i , η ( λ i )) ≥ − u − ( (cid:96) i , η ( λ i )) ≥
0, and hence (cid:107) A i − B i (cid:107) op = u + ( (cid:96) i , η ( λ i )). For 0 < (cid:96) i ≤ (cid:96) + , since c ( (cid:96) i ) = 0, we have B i = (cid:20) η ( λ i ) (cid:21) , which equals 0 × since η ( λ i ) = 0 by the definition of shrinkage function. Thus, (cid:107) A i − B i (cid:107) op = 1 /(cid:96) i . Finally, for (cid:96) i = 0, A i is a 2 × (cid:107) A i − B i (cid:107) op = η ( λ i ) = 0. This concludes the proof. (cid:3) Figure 1 illustrates the obtained asymptotic loss for several β = p/n values as afunction of the spike strength in a single spike model. It is clear that for each β ,there is a transition at (cid:96) + = √ β . MATAN GAVISH, RONEN TALMON, PEI-CHUN SU, AND HAU-TIENG WU
We note the following interesting phenomenon stemming from Theorem 2. Ifthere exists a nontrivial spike (cid:96) i > (cid:96) i ≤ (cid:96) + , then L ∞ ( η | (cid:96) , . . . , (cid:96) d ) is dominated by 1 /(cid:96) i . Consequently, it implies that in this large p large n regime, we cannot “rescue” this spike, and the associated signal is lost inthe noise.An immediate consequence of Theorem 2 is that η inv is an optimal shrinker. Corollary 1.
Suppose σ = 1 and Assumption 1 and Assumption 2 hold. Definethe asymptotically optimal shrinkage function as (22) η ∗ := arg min η { L ∞ ( η | (cid:96) , . . . , (cid:96) d ) } , where argmin is evaluated on the set of all possible shrinkage functions. Then, η ∗ is unique and equals η inv given in (15). Moreover, its associated loss is (23) max i { ∆( (cid:96) ( λ i )) } , where (24) ∆( α ) = s ( α ) /α = √ βα / (cid:113) αβ + α α > (cid:96) + /α < α ≤ (cid:96) + α = 0 . Note that this result coincides with the findings reported in [6]. Precisely, it isshown in [6, (1.12)] that for the operator norm, (cid:96) ( α ) (14) is the optimal shrinkagefor the covariance matrix and precision matrix. In this corollary, we show thatfor the Mahalanobis distance, which is related to the precision matrix, the optimalestimator is also achieved by the optimal shrinkage, taking (cid:96) ( α ) into account. Proof.
Based on Theorem 2, the optimal shrinker η ∗ leads to min η max i =1 ,...,d { ∆( (cid:96) i , η ( λ i )) } .By the max-min inequality, we have(25) max i =1 ,...,d min η { ∆( (cid:96) i , η ( λ i )) } ≤ min η max i =1 ,...,d { ∆( (cid:96) i , η ( λ i )) } . Note that for any given shrinker η , we have (cid:96) j that maximizes max i =1 ,...,d { ∆( (cid:96) i , η ( λ i )) } .By the same argument in [6], if we could solve arg min η ≥ { ∆( α, η ( λ ( α ))) } for any α > η ( λ ( α )) by η . For α > (cid:96) + and η > α , we have ∆( α, η ) = − u − ( α, η ). By a direct calculation,we get ∂ η ∆( α, η ) = 1 + − α − η ) + s ( α ) α (cid:113) ( α − η ) + 4 ηs ( α ) α > . For α > (cid:96) + and 0 ≤ η ≤ α , we have ∆( α, η ) = u + ( α, η ), and similarly by takingthe derivative of (20) we have ∂ η ∆( α, η ) = − − α − η ) + s ( α ) α (cid:113) ( α − η ) + 4 ηs ( α ) α ≥ . As a result, the partial derivative of the loss function is decreasing when 0 ≤ η ≤ /α and increasing when η > /α with a discontinuity at η = 1 /α while the loss function PTIMAL RECOVERY OF MAHALANOBIS DISTANCE IN HIGH DIMENSION 9
Figure 2.
The obtained optimal shrinker with the classicalshrinker overlay, for β = p/n = 1 and σ = 1. To enhance thevisualization, the y-axis of the figure is truncated at 1 . η = 1 /α . Thesefacts imply that η ∗ ( λ i ) = 1 /(cid:96) ( λ i ) when λ i > (cid:96) + . Furthermore, by substituting η with η ∗ in (20) or (21), we get ∆( α ) = s ( α ) /α . By definition, η ∗ = 0 when0 ≤ α ≤ (cid:96) + . Thus, for 0 < α ≤ (cid:96) + , ∆( α ) = 1 /α , and for (cid:96) = 0, ∆( (cid:96) ) = 0. Finally,it is clear that η ∗ is continuous when α > λ + , and η ( α ) = 0 when 0 ≤ α ≤ λ + . Wethus conclude that η ∗ is the optimal shrinker. (cid:3) Figure 2 illustrates the obtained optimal shrinker with the classical shrinkeroverlay, for β = p/n = 1 and σ = 1. Clearly, compared with the classical shrinker,the obtained optimal shrinker truncates the eigenvalues more aggressively.3.2. Derivation of the Optimal Shrinker when σ (cid:54) = 1 . To handle the generalcase when σ >
0, we first rescale the data and model by setting (cid:96) (cid:48) i := (cid:96) i /σ and λ (cid:48) i,n := λ i,n /σ , and consider the following shrinker defined on [0 , ∞ ):(26) η σ ( α ) := η ( α/σ ) σ . Note that since η plays the role of estimating the precision matrix, we normalize itback by dividing η ( α/σ ) by σ . The shrinkage estimator for Σ † X becomes M η σ n := η σ ( S n ), and the general optimal shrinker becomes(27) ˜ η ∗ ( α ) = (cid:40) σ (cid:96) ( α/σ ) α > σ (cid:96) + ≤ α ≤ σ (cid:96) + and the associated loss is max i (cid:40) ∆( (cid:96) ( λ i σ )) σ (cid:41) . SIMULATION STUDY
To numerically compare the optimal shrinker and the classical shrinker, we set β = 0 . , . , . . . , n = 300 so that p = βn . Forsimplicity, we set (cid:96) i = i for i = 1 , . . . , d . We consider d = { , } . Suppose x i , i =1 , . . . , n are sampled i.i.d. from the random vector (cid:80) d(cid:96) =1 ζ (cid:96) e (cid:96) , where e (cid:96) ∈ R p is the n = l og ( l o ss ) = 0.2 = 0.4 = 0.6 = 0.8 = 1 n = l og ( l o ss ) = 0.2 = 0.4 = 0.6 = 0.8 = 1 Figure 3. (top) The performance of the optimal shrinker and theclassical shrinker when d = 1. (bottom) The performance of theoptimal shrinker and the classical shrinker when d = 4. The blackcurve represents the median of the difference between the loss of theclassical shrinker and that of the theoretical optimal loss presentedin log scale. The blue curve represents the median of the differencebetween the loss of the optimal shrinker and the theoretical optimalloss presented in log scale. The error bars depict the interquartilerange of each shrinker (in log scale). The vertical blue line is σ = 1 / (cid:112) (cid:96) + , which indicates the tolerable noise level for the given β and signal strength.unit vector with (cid:96) -th entry 1, ζ (cid:96) ∼ N (0 ,
1) for (cid:96) = 1 , . . . , d , and ζ (cid:96) is independent of ζ k when (cid:96) (cid:54) = k . The noisy data is simulated by y i = A x i + σ ξ , where ξ ∼ N (0 , I p )is the noise matrix, ξ is independent of ζ (cid:96) and A ∈ O ( p ) is randomly sampled from O ( p ). In the simulation, we take σ = 0 . , . , . . . , .
8. For each σ , we repeat theexperiment 200 times and report the mean and variance of the loss L n .Figure 3 shows the log loss of the optimal and classical shrinkers when d = 1 and d = 4. We observe that the loss using the classical shrinker is significantly larger.This stems from the fact that in the large p and large n regime, there are eigen-values greater than σ that are not associated with the signal. When applying theclassical shrinker (6) (the Moore-Penrose pseudo-inverse), these irrelevant eigenval-ues contribute significantly, leading to high loss. Conversely, the optimal shrinkeris much more ‘selective’ (as illustrated in Figure 2), associating larger eigenvalueswith the noise, thereby increasing the robustness of the estimator.Our main motivation for considering MD in the high-dimensional regime p (cid:16) n comes from manifold learning. In manifold learning, point clouds with possible non-linear structures are modeled by manifolds; that is, data points in R p are assumedto lie on or close to a d -dimensional smooth manifold M ⊂ R p . This mathemat-ical definition can be understood intuitively – the set of data points in a smallregion can be well approximated by a d -dimensional affine subspace of R p . Thedimension of the manifold d is usually fixed when we sample more and more data,representing intrinsic properties of the data, whereas p may vary, depending, for PTIMAL RECOVERY OF MAHALANOBIS DISTANCE IN HIGH DIMENSION 11 example, on the specific observation modality or the advance of the sampling tech-nology. Traditionally, the goal is to learn the structure of the manifold from thedata points, and in turn, use the learned structure to embed the high-dimensionaldata in low-dimension, facilitating a compact representation of the ‘essence’ of thedata. In a recent line of work [17, 20], a variant of MD was proposed and usedto reveal hidden intrinsic manifolds underlying high-dimensional, possibly multi-modal, observations. The main purpose of MD in this hidden manifold setup ishandling possible deformations caused by the observation/sampling process, whichis equivalent to estimating the precision matrix locally on the manifold. Studyingthis case and examining the benefits stemming from the incorporation the proposedoptimal shrinker extends the scope of this paper and is left for future work.Here we only test the performance of the proposed optimal shrinker in a sim-plified setup, where the high-dimensional data lie on a lower-dimensional manifold.Consider the model Y = X + σξ , where X is sampled from a curvy manifold with onechart M embedded in R p that can be parametrized by [ s, t, s ) +5( t ) , , . . . , ∈ R p , ξ ∼ N (0 , I p ) ∈ R p , and σ is the noise level. The ambient dimension is fixed at p = 100. For each β > n = p/β samples are taken with a uniform sampling from s, t ∈ [ − , Error ( M n , y ) := (cid:12)(cid:12)(cid:12) d M n ( y , X ) − d Σ † X ( y , X ) (cid:12)(cid:12)(cid:12) d Σ † X ( y , X ) , where y ∈ R p is an arbitrary point on the manifold. We examine two cases, where y = y = [0 , . . . , ∈ M and y = y = [2 , , , , . . . , ∈ M . Each case wasrepeated for 500 times, with the mean and standard deviation reported. In Table1, we compare the performance of the optimal shrinker estimator M n = ˜ η ∗ ( S n )with the performance of the classical estimator M n = η classical σ ( S n ). We observethat the optimal shrinker outperforms the classical estimator in this well controlledmanifold setup. 5. CONCLUSIONS
We proposed a new estimator for MD based on precision matrix shrinkage. For anappropriate loss function, we show that the proposed estimator is asymptoticallyoptimal and outperforms the classical implementation of MD using the Moore-Penrose pseudo-inverse of the sample covariance. Importantly, the proposed es-timator is particularly beneficial when the data is noisy and in high-dimension,a case in which the classical MD estimator might completely fail. Consequently,we believe that the new estimator may be useful in modern data analysis applica-tions, involving for example, local principal component analysis, metric design, andmanifold learning.In this work, we focused on the case in which the intrinsic dimensionality ofthe data (the rank of the covariance matrix) d is unknown, and therefore, it wasnot explicitly used in the estimation. Yet, in many scenarios, this dimension isknown. In this case, it could be beneficial to consider a direct truncation and useonly the top d eigen-pairs for the estimation of the precision matrix. Note that inthe particular manifold setup, this case is essentially different from the rank-awareshrinker discussed in [6]; in the manifold setup, the rank of the covariance matrixassociated with points residing inside a small neighborhood of any point on themanifold could be much larger than d . The benefit from this truncation approach Table 1.
The normalized loss of the optimal shrinker estimator M n = ˜ η ∗ ( S n ) and the classical estimator M n = η classical σ ( S n ) in themanifold setup. The mean and the standard deviation over 500realizations are reported. Error ( M n , y ) Error ( M n , y ) η classical σ ( S n ) ˜ η ∗ ( S n ) η classical σ ( S n ) ˜ η ∗ ( S n ) β = 0 . σ = 1 18.78 ± ± ± ± σ = 1 . ± ± ± ± σ = 2 33.54 ± ± ± ± β = 0 . σ = 1 26.86 ± ± ± ± σ = 1 . ± ± ± ± σ = 2 58.59 ± ± ± ± β = 1 σ = 1 34.70 ± ± ± ± σ = 1 . ± ± ± ± σ = 2 69.65 ± ± ± ± Acknowledgements
MG has been supported by H-CSRC Security Research Center and Israeli ScienceFoundation grant no. 1523/16. MG and RT were supported by a grant from theTel-Aviv University ICRC Research Center.
References [1] J. Baik, G. Ben Arous, and S. Peche. Phase transition of the largest eigenvalue for non-nullcomplex sample covariance matrices.
Ann. Probab. , 33(5):1643–1697, 2005.[2] J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spikedpopulation models.
Journal of Multivariate Analysis , 97(6):1382 – 1408, 2006.[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-sentation.
Neural computation , 15(6):1373–1396, 2003.[4] R. R. Coifman and S. Lafon. Diffusion maps.
Applied and computational harmonic analysis ,21(1):5–30, 2006.[5] D. Dai and T. Holgersson. High-dimensional CLTs for individual mahalanobis distances. InM¨ujgan Tez and Dietrich von Rosen, editors,
Trends and Perspectives in Linear StatisticalInference , pages 57–68, 2018.
PTIMAL RECOVERY OF MAHALANOBIS DISTANCE IN HIGH DIMENSION 13 [6] D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal Shrinkage of Eigenvalues in theSpiked Covariance Model.
The Annals of Statistics , 46(4):1742–1778, November 2018.[7] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components anal-ysis.
Annals of Statistics , 29(2):295–327, 2001.[8] I. M Johnstone. High Dimensional Statistical Inference and Random Matrices. In
Proc. Int.Congr. Math. , pages 307–333, 2006.[9] S. Kritchman and B. Nadler. Non-parametric detection of the number of signals: Hypothesistesting and random matrix theory.
Trans. Sig. Proc. , 57(10):3930–3941, October 2009.[10] E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. In
Advancesin neural information processing systems , pages 777–784, 2005.[11] G.-R. Liu, Y.-L. Lo, Y.-C. Sheu, and H.-T. Wu. Diffuse to fuse eeg spectra–intrinsic geometryof sleep dynamics for classification. arXiv preprint arXiv:1803.01710 , 2018.[12] P. C. Mahalanobis. On the generalized distance in statistics.
Proceedings of the NationalInstitute of Sciences (Calcutta) , 2:49–55, 1936.[13] V A Marcenko and L A Pastur. Distribution of eigenvalues for some sets of random matrices.
Mathematics of the USSR-Sbornik , 1(4):457, 1967.[14] G. J. McLachlan. Mahalanobis distance.
Resonance , 4(6):20–26, 1999.[15] D. Paul. Asymptotics of Sample Eigenstructure for a Large Dimensional Spiked CovarianceModel.
Statistica Sinica , 17:1617–1642, 2007.[16] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.
Science , 290(5500):2323–2326, 2000.[17] A. Singer and R. R. Coifman. Non-linear independent component analysis with diffusionmaps.
Applied and Computational Harmonic Analysis , 25(2):226–239, 2008.[18] A. Singer and H.-T. Wu. Two-dimensional tomography from noisy projections taken at un-known random directions.
SIAM J. Imaging Sciences , 6(1):136–175, 2012.[19] C. M. Stein. Lectures on the theory of estimation of many parameters.
Journal of SovietMathematics , 74(5), 1986.[20] R. Talmon and R. R. Coifman. Empirical intrinsic geometry for nonlinear modeling and timeseries filtering.
Proceedings of the National Academy of Sciences , 110(31):12535–12540, 2013.[21] R. Talmon, S. Mallat, H. Zaveri, and R. R. Coifman. Manifold learning for latent variableinference in dynamical systems.
IEEE Transactions on Signal Processing , 63(15):3843–3856,2015.[22] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlineardimensionality reduction. science , 290(5500):2319–2323, 2000.[23] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighborclassification.
J. Mach. Learn. Res. , 10:207–244, 2009.[24] H.-T. Wu, R. Talmon, and Y.-L. Lo. Assess sleep stage by modern signal processing tech-niques.
IEEE Trans. Biomed. Engineering , 62(4):1159–1168, 2015.[25] H.-T. Wu and N Wu. Think globally, fit locally under the Manifold Setup: AsymptoticAnalysis of Locally Linear Embedding.
Annuls of Statistics , In press, 2018.[26] S. Xiang, F. Nie, and C. Zhang. Learning a Mahalanobis distance metric for data clusteringand classification.
Pattern Recognition , 41:3600–3612, 2008.[27] O. Yair, R. Talmon, R. R. Coifman, and I. G. Kevrekidis. Reconstruction of normal forms bylearning informed observation geometries from data.
Proceedings of the National Academy ofSciences , 114(38):E7865–E7874, 2017.[28] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey.
Michigan StateUniversiy , 2(2):4, 2006.
The School of Computer Science and Engineering, Hebrew University of Jerusalem
E-mail address : [email protected] The Department of Electrical Engineering, Technion – Israel Institute of Technol-ogy
E-mail address : [email protected] The Department of Mathematics, Duke University
E-mail address : [email protected] The Department of Mathematics and Department of Statistical Science, Duke Uni-versity
E-mail address ::