Matrix Means and a Novel High-Dimensional Shrinkage Phenomenon
MMatrix Means and a Novel High-Dimensional ShrinkagePhenomenon
Asad Lodhia, Keith Levin and Elizaveta Levina
Department of Statistics, University of Michigan
October 17, 2019
Abstract
Many statistical settings call for estimating a population parameter, most typically thepopulation mean, from a sample of matrices. The most natural estimate of the population meanis the arithmetic mean, but there are many other matrix means that may behave differently,especially in high dimensions. Here we consider the matrix harmonic mean as an alternative tothe arithmetic matrix mean. We show that in certain high-dimensional regimes, the harmonicmean yields an improvement over the arithmetic mean in estimation error as measured bythe operator norm. Counter-intuitively, studying the asymptotic behavior of these two matrixmeans in a spiked covariance estimation problem, we find that this improvement in operatornorm error does not imply better recovery of the leading eigenvector. We also show that a Rao-Blackwellized version of the harmonic mean is equivalent to a linear shrinkage estimator that hasbeen studied previously in the high-dimensional covariance estimation literature. Simulationscomplement the theoretical results, illustrating the conditions under which the harmonic matrixmean yields an empirically better estimate.
Matrix estimation problems arise in statistics in a number of areas, most prominently in covarianceestimation, but also in network analysis, low-rank recovery and time series analysis. Typically, thefocus is on estimating a matrix based on a single noisy realization of that matrix. For example, theproblem of covariance estimation [24] focuses on estimating a population covariance matrix basedon a sample of vectors, which are usually combined to form a sample covariance matrix or anotherestimate of the population covariance. In network analysis and matrix completion problems, thegoal is typically to estimate the expectation of a matrix-valued random variable based on a singleobservation under suitable structural assumptions (see, e.g., [18] and citations therein). A relatedsetting that has received less attention is when we observe a sample of matrices and aim to estimatean underlying population mean or other parameter. This arises frequently in neuroimaging dataanalysis, where each matrix represents connectivity within a particular subject’s brain and the goalis to estimate a population brain connectivity pattern [10, 47].The most direct approach to estimating the underlying population mean from a sample ofmatrices is to take the arithmetic (sample) mean, perhaps with some regularization to ensurethe stability of the resulting estimate. The arithmetic matrix mean is the mean with respect toEuclidean geometry on the space of matrices, which is often not the most suitable geometry for amatrix model. A simple example can be found in our recent work [35], where we showed that in1 a r X i v : . [ m a t h . S T ] O c t he problem of estimating a low-rank expectation of a collection of independent random matriceswith different variances, a weighted average improves upon the na¨ıve arithmetic matrix mean.Somewhat surprisingly in light of the rich geometry of matrices, fairly little attention has been paidin the literature to the use of other matrix geometries and their associated means. An exceptionis work by Schwartzman [45], who argued for using the intrinsic geometry of the positive definitecone [7] in the problem of covariance estimation, since the covariance matrix has to be positive(semi-)definite, and showed that a mean with respect to this different matrix geometry can, undercertain models, yield an appreciably better estimate. Recent work has considered Fr´echet means inthe context of multiple-network analysis [37]. Continuing in this vein, the goal of the current workis to better understand the sampling distribution of matrix means other than the arithmetic meanunder different matrix geometries.The subject of computing means with respect to different geometries has been studied at somelength in the signal processing and computer vision communities, mostly in the context of theGrassmann and Stiefel manifolds [1, 22, 38]. See [46] for a good discussion of how taking intrin-sic geometry into account leads to estimators other than the arithmetic mean. Recent work hasconsidered similar geometric concerns in the context of network data [30, 37]. Kolaczyk and coau-thors [30] considered the problem of averaging multiple networks on the same number of vertices,developed a novel geometry for this setting and derived a Fr´echet mean for that geometry. Recentwork by Lunag´omez, Olhede and Wolfe [37] considered a similar network averaging problem, andpresented a framework for both specifying distributions of networks and deriving correspondingsample means.Unfortunately, most of these matrix means are not amenable to the currently available toolsfrom random matrix theory. Instead, in this paper we consider the behavior of the harmonic meanof a collection of random matrices, as an example of a matrix mean that arises from a markedlydifferent notion of matrix geometry compared to the matrix arithmetic mean. The harmonic meanturns out to be well-suited to analysis using techniques in random matrix theory, and it is our hopethat results established here may be extended to other related matrix means in the future. Buildingon random matrix results by the first author [36], we show how the harmonic matrix mean can, incertain regimes, yield a better estimate of the population mean matrix in spectral norm comparedto the arithmetic mean, but not necessarily in recovering the top eigenvector in a spiked covariancemodel, making an important distinction between two measures of successful performance whichare often assumed to behave similarly. We characterize the settings in which the harmonic matrixmean improves upon the more na¨ıve arithmetic matrix mean as well as the settings in which it doesnot, and show the implications of these results for covariance estimation.In the rest of this section, we briefly review the literature on covariance matrix estimation, sincewe focus on estimating the mean of a collection of Wishart matrices, which can be thought of assample covariance matrices. There is an extensive literature on estimating a population covariancematrix on p variables from n observations. The sample covariance matrix is the maximum likelihoodestimator for Gaussian data, and when the dimension p is fixed, there are classical results that fullydescribe its behavior [40, 2]. When p is allowed to grow with n , the sample covariance matrix is notwell-behaved, and in particular becomes singular as soon as p ≥ n . There has been extensive workon understanding this phenomenon in random matrix theory, starting from the pioneering work of[39] and many more recent results, especially focusing on estimating the spectrum [50, 23]. Muchwork in random matrix theory has focused on the spiked model, in which the population covarianceis the sum of the identity matrix and a low-rank signal matrix [28, 6, 12, 25].2he recent literature on covariance estimation in high dimensions (see [24] for a review) fo-cused on addressing the shortcomings of the sample covariance, mainly by applying regularization.James-Stein type shrinkage was considered in early Bayesian approaches [20] and in the Ledoit-Wolfestimator [33], which shrinks the sample covariance matrix towards the identity matrix using coeffi-cients optimized with respect to a Frobenius norm loss. Subsequent papers presented variants withdifferent estimates of the optimal coefficients and different choices of the shrinkage target matrix fornormal data [19, 26], as well as for the more general case of finite fourth moments [48]. Follow-upwork presented a nonlinear regularization scheme [34], motivated by the fact that linear shrinkageof the sample covariance fails to account for the nonlinear dispersion of the sample eigenvalues.An alternative approach to regularization of the sample covariance matrix is to impose structuralconstraints. This class of methods includes banding or tapering of the covariance matrix, suitablewhen the variables have a natural ordering [49, 8, 44], and thresholding when the variables are notordered [9, 44, 11]. Minimax rates for many structured covariance estimation problems are nowknown [14, 16, 15]; see [13] for a good overview of these results.The remainder of this paper is laid out as follows. In Section 2 we establish notation andintroduce the random matrix models under study in the present paper. In Section 3, we establishthe asymptotic behavior of the harmonic matrix mean under these models. In Section 4 we computea Rao-Blackwellized version of the harmonic mean of two random covariance matrices. In Section 5,we analyze a spiked covariance model and compute the limit of the inner product of the topeigenvector of the harmonic mean with the top eigenvector of the population covariance. Finally,Section 6 briefly presents numerical simulations highlighting the settings in which the harmonicmatrix mean succeeds and fails in covariance estimation. Section 7 concludes with discussion. We begin by establishing notation. We denote the identity matrix by I , with its dimension clearfrom context. For a p × p matrix M , (cid:107) M (cid:107) denotes its operator norm and (cid:107) M (cid:107) F denotes itsFrobenius norm. For a set A , let A ( x ) = 1 if x ∈ A and A ( x ) = 0 otherwise. We denote by S p ( R )and S p ( C ) the spaces of p × p symmetric and Hermitian positive definite matrices, respectively. Fora p × p symmetric or Hermitian matrix M , the eigenvalues of M are denoted λ ( M ) ≥ λ ( M ) ≥· · · ≥ λ p ( M ) and their corresponding eigenvectors are denoted v ( M ) , . . . , v p ( M ). We use (cid:22) forthe positive semidefinite ordering, so that M (cid:22) M if and only if M − M is positive semidefinite.Suppose that we wish to estimate the population mean Σ of a collection of N independentidentically distributed self-adjoint positive definite p -by- p random matrices. The most commonlyused model for positive (semi)definite random matrices is the Wishart distribution, which arises incovariance estimation and is well-studied in the random matrix theory literature. Definition 1 (Wishart Random Matrix: Real Observation Model) . Let X be a random p × n matrix with columns drawn i.i.d. from a centered normal with covariance Σ ∈ R p × p . Then W = XX ∗ n is a real-valued random Wishart matrix with parameters Σ and n .Some of our results are more straightforwardly stated and proved for the complex-valued versionof the Wishart distribution, which we define here for the special case of identity covariance.3 efinition 2 (Wishart Random Matrix: Complex Observation Model) . Let X be a random p × n matrix with i.i.d. complex standard Gaussian random entries, i.e., entries of the form Z + √− Z √ , where Z and Z are independent standard real Gaussian random variables. Then W = XX ∗ /n isa random matrix following the complex Wishart distribution with parameters I and n .Let { X i } Ni =1 be a sequence of independent identically distributed p × n matrices with columnsdrawn i.i.d. from a centered normal with covariance Σ. Then for each i = 1 , , . . . , N , W i := X i X ∗ i n (1)is the sample covariance matrix, which follows the real-valued Wishart distribution with parametersΣ and n . Most previous approaches to covariance estimation and other related matrix estimationproblems have focused on bounding the error between an estimator and Σ. This error is mostcommonly measured in Frobenius norm or operator norm, the latter of which is more relevant insome applications since, by the Davis-Kahan theorem [21], small operator norm error implies thatone can recover the leading eigenvectors of Σ. This is of particular interest in covariance estimationwhen the task at hand is principal component analysis (see Section 5), but is also relevant in otherproblems when Σ is low-rank. For examle, in the case of network analysis [35], the eigenvectors ofΣ encode community structure.Even in the modestly high-dimensional regime of p/n → γ ∈ (0 , I , the spectral measure of each W i satisfies the Marˇcenko-Pastur law withparameter γ in the large- n limit. In fact, we have the stronger result (see Proposition 1 below) that (cid:107) W i − I (cid:107) → γ + 2 √ γ a . s . A straightforward estimator of Σ in this setting is the arithmetic mean of the N matrices, A := (cid:80) Ni =1 W i N , which can be equivalently expressed as A = (cid:2) X , · · · , X N (cid:3)(cid:2) X , · · · , X N (cid:3) ∗ N n .
The arithmetic mean is a sample covariance based on a total of T = nN observations in this case,since we center by the known rather than estimated observation mean, and every covariance matrixis based on the same number of observations. Note that in the present work we assume that theobserved data are mean-0, and thus there is no need to center the observations about a samplemean. This assumption comes with minimal loss of generality, since the centered sample covariancematrix of a collection of normal random variables is Wishart distributed with parameter n − n , which has no effect on the asymptotic analyses below. Remark 1.
In practical applications, there are situations where pooling observations is not ap-propriate, and the arithmetic mean may be ill-suited to estimating Σ as a result. For example,4n resting state fMRI data, pooling observations from different subjects at a given brain region isinfeasible, as the response signals at a particular brain location are not time-aligned across subjects.Nonetheless, combining sample covariance or correlation matrices across subjects via some otherprocedure may still be valid for estimating the population covariance or correlation matrix.Keeping the number of observed matrices N fixed and letting p and n grow, for the case Σ = I ,defining Γ = lim p/T = γ/N , (see Proposition 1) (cid:107) A − I (cid:107) → Γ + 2 √ Γ a.s.Throughout this paper, T will denote the total number of observations of points in p -dimensionalspace. The regime of interest is that in which p/T → Γ, and we will consider T = nN where N isa fixed number of matrices, and n will be tending to infinity with p . It will be convenient to define γ = lim p/n , which has the relationship Γ = γ/N .In past work [36], the first author analyzed the behavior of the harmonic mean of a collectionof independent Wishart random matrices in the regime p/n → γ ∈ (0 , H := N (cid:18) N (cid:88) i =1 W − i (cid:19) − , (2)provided that n > p (so that the W i are invertible almost surely). While the harmonic mean isalso inconsistent as an estimator for Σ, we will show that its operator-norm bias is, under certainconditions, is smaller than the arithmetic mean. When the W i are drawn from the same underlying population, the harmonic mean can be a betterestimate of the population mean Σ in operator norm than the arithmetic mean [36]. This improve-ment is best understood as a data splitting result, in which we partition a sample of p -dimensionalobservations, and compute the harmonic mean of the covariance estimates computed from eachpart. This is certainly counter-intuitive, but we remind the reader that our intuitions are oftenwrong in the high-dimensional regime.Let D be a set of T points in R p , Let P be a partition of D into N ≥ D i , P := { D i } Ni =1 such that D = N (cid:91) i =1 D i . Define the Wishart random matrix associated with each D i as W ( D i ) := 1 | D i | (cid:88) x ∈ D i xx ∗ , and the arithmetic and harmonic means associated with P as, respectively, A ( P ) := 1 N (cid:88) D i ∈P W ( D i ) and H ( P ) := N (cid:16) (cid:88) D i ∈P W ( D i ) − (cid:17) − , provided that W ( D i ) is invertible for all i . 5f the sets making up the partition P are all of the same size, then A ( P ) is in fact simplythe sample covariance of the vectors in D and does not depend on P . The convergence of thespectrum of A ( P ) is classical [39]. The statement given here can be found in the modern reference[3, Theorems 3.6–7 and Theorems 5.9–10]. Proposition 1 (Marˇcenko-Pastur law) . Suppose D is a collection of T i.i.d. p -dimensional complexGaussians with zero mean and covariance I , with p and T tending to infinity in such a way that p/T → Γ ∈ (0 , / , and let P be a deterministic sequence of partitions of D such that | D i | areequal for all D i ∈ P . The spectral measure of A ( P ) converges weakly almost surely to the measurewith density π Γ x (cid:112) ( S + − x )( x − S − ) [ S − ,S + ] ( x ) , where S ± = (cid:0) ± √ Γ (cid:1) . (3) Further, we have the convergence (cid:107) A ( P ) − I (cid:107) → Γ + 2 √ Γ a . s . The following result describes the limiting behaviour of H ( P ) under similar conditions. Proposition 2.
Let D be a collection of T i.i.d. p -dimensional complex Gaussians with zero meanand covariance I , with p and T tending to infinity in such a way that (cid:12)(cid:12)(cid:12) pT − Γ (cid:12)(cid:12)(cid:12) ≤ Kp where K > is a constant and Γ ∈ (0 , / . Let N ≥ be fixed with T divisible by N , and let P be a deterministic sequence of partitions of D of size N ≤ (cid:98) Γ − (cid:99) such that | D i | are equal for all D i ∈ P . The spectral measure of H ( P ) converges weakly almost surely to the measure with density π Γ x (cid:112) ( E + − x )( x − E − ) [ E − ,E + ] ( x ) , where E ± := 1 − ( N − ± √ Γ (cid:112) − ( N − . (4) Further, we have the convergence (cid:107) H ( P ) − I (cid:107) → ( N − √ Γ (cid:112) − ( N − . s . The above result is a restatement of the results of [36, Theorem 2.1], since if the D , D , . . . , D N are disjoint and equal sized, then W ( D i ) are a collection of N growing Wishart matrices of thesame dimension p and shared parameter n , with p/n → γ = N Γ. The case where | D i | is permittedto vary in i , while still feasibly handled by the tools of this paper, is more complicated. Thelimiting spectrum of the harmonic mean in this setting depends on the roots of a high-degreepolynomial, with the result that comparison of the harmonic and arithmetic mean requires a moresubtle analysis. In contrast, when the cells of the partition are of equal size, the limiting spectralmeasure of the harmonic mean is characterized by the roots of a quadratic and thus admits anexplicit solution. For the sake of concreteness and simplicity, we will assume that P is a partitionwith cells of equal size for the remainder of the paper.With the interpretation of H ( P ) as a mean formed by splitting D into equal parts, we have thefollowing Theorem. 6 heorem 1. Under the assumptions of Proposition 2, the operator norm (cid:107) H ( P ) − I (cid:107) is minimizedfor a partition P of size N = 2 . Further, for such a partition, lim p,T →∞ (cid:107) H ( P ) − I (cid:107) = 2 √ Γ √ − Γ < lim p,T →∞ (cid:107) A ( P ) − I (cid:107) = Γ + 2 √ Γ . Proof.
The function f ( x ) = ( x − √ Γ + 2 √ Γ (cid:112) − ( x − , x < f (cid:48) ( x ) = √ Γ − Γ √ Γ (cid:112) x − x > − . For Γ ∈ (0 , / x = 2 so that the minimizer of f ( x ) on the interval[2 , ∞ ) is 2.The above result suggests that given a collection D of T ≥ p observations, it is better asymptot-ically (as measured in operator norm error) to estimate the covariance by splitting D into two equalparts D and D and computing the harmonic mean of W ( D ) and W ( D ) than it is to directlycompute the sample covariance matrix of D . We note that for N = 2, E + = 1 + 2 √ Γ √ − Γ < S + ,so that E + is closer to 1. This in turn implies that, in the case where the true covariance matrix isthe identity, the harmonic mean is shrunk toward the true population covariance when comparedwith the arithmetic mean. As we will see below in Section 4, Rao-Blackwellizing the harmonicmean yields yet another shrinkage estimator, one that has appeared elsewhere in the literature.There are two main drawbacks to using the complex Gaussian model from Definition 2. Thefirst is the requirement that the random vectors be complex-valued Gaussians, which is not thecase for most datasets. The second is the requirement that the true covariance be equal to theidentity. The need for complex Gaussian matrices arises in the proof of Theorem 2.1 in [36], whichrelies on a result in free probability [17], and the remainder of the argument extends immediately toreal-valued data with sub-Gaussian entries [36, Remark 3]. This is not unusual: many applicationsof random matrix theory to statistics have first been established for complex random matrices, fortechnical reasons. For example, the celebrated Ben Arous, Baik and Pech´e transition for spikedcovariance models was first established for complex Gaussian random vectors [4] and the Tracy-Widom Law for the largest eigenvalue of a general sample covariance was first established forcomplex Gaussians [29]. We believe these results can be made to apply to real random variables,and empirical simulations with multivariate real-valued Gaussians support this claim.The requirement that the vectors have identity covariance is partially addressed by [36, Corollary2.1.1], which we restate here. Proposition 3.
Under the same assumptions as Proposition 2, let N = 2 and suppose Σ is apositive definite matrix such that lim sup p,T →∞ (cid:107) Σ (cid:107)(cid:107) Σ − (cid:107)(cid:107) H ( P ) − I (cid:107)(cid:107) A ( P ) − I (cid:107) < a.s. hen lim sup p,T →∞ (cid:107)√ Σ H ( P ) √ Σ − Σ (cid:107)(cid:107)√ Σ A ( P ) √ Σ − Σ (cid:107) < a.s. Since multiplying H ( P ) by √ Σ on both sides gives a complex Wishart model with populationcovariance Σ (see Remark 3 below), the bound in Proposition 3 holds so long as the conditionnumber of Σ lies in a certain range. See [36, Remark 2] for further discussion.The harmonic mean has an interesting additional property with respect to the Frobenius norm.The arithmetic matrix mean is usually motivated as minimizing the squared Frobenius norm error.A similar objective motivates many existing shrinkage estimators for the covariance matrix [33, 26].Under the setting considered above, the harmonic matrix mean, despite not being optimized forthis loss, matches the Frobenius norm error of the arithmetic mean asymptotically.
Lemma 1.
Under the conditions of Proposition 2, when N = 2 we have lim p,T →∞ p (cid:107) H ( P ) − I (cid:107) F = lim p,T →∞ p (cid:107) A ( P ) − I (cid:107) F = Γ a.s.Proof. Since H ( P ) − I is symmetric, by the almost sure weak convergence of H ( P ), it suffices toshow lim p,T →∞ p Tr (cid:2) ( H ( P ) − I ) (cid:3) → (cid:90) E + E − ( x − (cid:112) ( E + − x )( x − E − )2 π Γ x d x, where E ± are defined in Equation (4), and compare withlim p,T →∞ p Tr (cid:2) ( A ( P ) − I ) (cid:3) → (cid:90) S + S − ( x − (cid:112) ( S + − x )( x − S − )2 π Γ x d x, where S ± are defined in Equation (3). Note that E + − E − E + + E − = 2 √ Γ √ − Γ , E + + E − S + − S − S + + S − = 2 √ Γ1 + Γ , S + + S − . Using Lemma 3 in the Appendix, (cid:90) E + E − ( x − x + 1) (cid:112) ( E + − x )( x − E − )2 π Γ x d x = 1 − Γ − − Γ) + 1 = Γ , while (cid:90) S + S − ( x − x + 1) (cid:112) ( S + − x )( x − S − )2 π Γ x d x = 1 + Γ − . Rao-Blackwell Improvement of the Harmonic Mean
The results in Section 3 are somewhat unexpected, and raise the question of whether other matrixmeans have similar properties. Analyzing other means such as the geometric or the more compli-cated Frech´et means [7, 45] under the high-dimensional regime poses a significant challenge sincethese operations currently fall outside the scope of known free probability theory. Random ma-trix techniques do, however, allow us to extend our analysis of the harmonic mean by computingits expectation conditioned on the arithmetic mean. By the Rao-Blackwell theorem, using thisconditional expectation as an estimator yields an expected spectral norm error no worse than theunconditioned harmonic mean.In this section we restrict ourselves to the model of Definition 1, so as to ensure the availabilityof explicit integrals for our quantities of interest. We expect the same results can be established forthe complex model in Definition 2, but doing so would require reworking the results of [31] (restatedin the Appendix for ease of reference) for the complex Wishart Ensemble, which is outside the scopeof the present article. Further, we restrict our attention to T = 2 n ≥ p to facilitate comparisonwith the N = 2 case studied in the previous section. To this end, let P = D ∪ D where D and D are disjoint, and W := W ( D ) = 1 | D | (cid:88) x ∈ D xx ∗ = 1 n (cid:88) x ∈ D xx ∗ ,W := W ( D ) = 1 | D | (cid:88) x ∈ D xx ∗ = 1 n (cid:88) x ∈ D xx ∗ . The matrices W and W have densities [2, Theorem 7.2.2] f W i ( w i ) = C n,p det( w i ) ( n − p − exp (cid:18) − n − w i (cid:19) , where C n,p = n pn np det(Σ) n Γ p (cid:0) n (cid:1) , Γ p ( x ) = π p ( p − p (cid:89) i =1 Γ (cid:18) x − i − (cid:19) . These densities are supported on the space S p ( R ) of all p × p symmetric positive definite realmatrices. As before, define A = W + W , (5) H = 2 (cid:0) W − + W − (cid:1) − (6)and note that A = 12 n (cid:88) x ∈ D xx ∗ is a Wishart random matrix with parameter Σ and 2 n , and thus f A ( a ) := C n,p det( a ) (2 n − p − exp( − n Tr Σ − a ) , (7)where a takes values in all of S p ( R ). 9ecall that the matrix A is a sufficient statistic for the covariance matrix Σ, and note that theloss function (cid:96) ( M, Σ) := (cid:107) M − Σ (cid:107) , is convex in the variable M . By the Rao-Blackwell Theorem [43, 5a.2 (ii)], we have E (cid:8) (cid:96) (cid:0) E [ H | A ] , Σ (cid:1)(cid:3) ≤ E (cid:8) (cid:96) (cid:0) H , Σ (cid:1)(cid:9) , which is to say that as an estimator, H is outperformed by the conditional expectation E [ H | A ],which we now compute.Observe that the harmonic mean satisfies [7, Section 4.1] H = 2 W − W A − W , H = 2 W − W A − W . Averaging these two equations gives H = 2 A − W A − W − W A − W , and taking the conditional expectation yields E [ H | A ] = 2 A − E [ W A − W | A ] − E [ W A − W | A ] . To compute the matrix-valued integrals E [ W A − W | A ] and E [ W A − W | A ] , we proceed by directly computing the conditional density of W i given A .We begin with the joint density of W and W : f W ( w ) f W ( w ) = C n,p det( w ) ( n − p − det( w ) ( n − p − exp (cid:20) − n (cid:8) Σ − ( w + w ) (cid:9)(cid:21) . We will use this formula to obtain an expression for the joint density of W and A . For a symmetricmatrix M with entries m i,j , let d m i,j denote Lebesgue measure over that entry and define(d M ) := (cid:94) ≤ i ≤ j ≤ p d m i,j , that is (d M ) is the volume form of the matrix M . The “shear” transformation( w , w ) (cid:55)→ ( w , a ) , a := w + w , maps the domain S p ( R ) × S p ( R ) to the region { M ∈ S p ( R ) : 0 (cid:22) M (cid:22) a } × S p ( R ) , where we remind the reader that (cid:22) denotes the positive semidefinite ordering. The Jacobian ofthis mapping is (d w ) ∧ (d w ) = 2 p ( p +1)2 (d w ) ∧ (d a ) , f W , A ( w , a ) = C n,p p ( p +1)2 det( w ) ( n − p − det(2 a − w ) ( n − p − exp[ − n Tr Σ − a ] . To obtain the conditional distribution, we divide by (7), yielding f W | A ( w | a ) = C n,p C n,p det( w ) n − p − det(2 a − w ) n − p − det(2 a ) n − p +12 , (8)where w is supported on the region D ( a ) := { m ∈ S p ( R ) : 0 (cid:22) m (cid:22) a } . Evaluating this density at a = A , for w ∈ D ( A ), f W | A ( w | A ) := C n,p C n,p det( w ) n − p − det(2 A − w ) n − p − det(2 A ) n − p − gives a multivariate Beta distribution B ( p ; n, n ; 2 A ) (see Definition 4 in the Appendix). With thisnotation, we have E [ W A − W | A ] = (cid:90) D ( A ) w A − w f W | A ( w | A )(d w )the integration over w can be done using Theorem 5 in the Appendix, with n = n , n = n , andsetting ∆ = 2 A yields the following Lemma. Lemma 2.
For any F that is a function of A taking value in the space of p × p matrices, E [ W F W | A ] = { n (2 n + 1) − } A F A + n { ( A F A ) (cid:62) + Tr( A F ) A } (2 n − n + 1) . Setting F = A − yields E [ W A − W | A ] = 2 n ( n + 1) − pn (2 n − n + 1) A . The same calculation can be carried out for E [ W A − W | A ] to give E [ H | A ] = 2 A − (cid:26) n ( n + 1) − pn (2 n − n + 1) (cid:27) A = n (2 n − p )(2 n − n + 1) A , which is simply a rescaling of A by a deterministic constant. We summarize this result as a theorem. Theorem 2.
Let T = 2 n and D be as in Definition 1. If P is a partition of size with | D | = | D | = n , then E [ H ( P ) | A ( P )] = n (2 n − p )(2 n − n + 1) A ( P ) . p/T = p/ (2 n ) → Γ ∈ (0 , / − Γ) Z, where Z is a random variable distributed according to the limiting spectral distribution of A .The above calculations can be further extended by making a few adjustments to the matrices W , W . A number of matrix estimators take the form˜ A := c ( A + d ˆΛ) , where c, d are positive scalars and ˆΛ is a positive semidefinite matrix, all depending only on A .Estimators of this form have been extensively studied in the covariance estimation literature [33,26, 32]. One could take the extra step of applying the same regularization procedure to the matrices W and W before computing their (Rao-Blackwellized) harmonic mean. Suppose we replace W and W with ˜ W := c ( W + d ˆΛ) and ˜ W := c ( W + d ˆΛ) , respectively. Letting ˜ H be the harmonic mean of ˜ W and ˜ W , we can compute a Rao-Blackwellimprovement of ˜ H in much the same way that we did for H above. Indeed, we still have˜ H = 2 ˜ A − ˜ W ˜ A − ˜ W − ˜ W ˜ A − ˜ W . We can compute the conditional expectation with respect to A as follows E [ ˜ H | A ] = 2 ˜ A − c (cid:0) E [ W ˜ A − W | A ] + E [ W ˜ A − W | A ] (cid:1) − c d (cid:0) A ˜ A − ˆΛ + ˆΛ ˜ A − A (cid:1) − ( cd ) ˆΛ ˜ A − ˆΛ . (9)Using Lemma 2, we have c (cid:0) E [ W ˜ A − W | A ] + E [ W ˜ A − W | A ] (cid:1) = c (cid:20) { n ( n + 1) − } A ˜ A − A + n Tr( A ˜ A − ) A (2 n − n + 1) (cid:21) . (10)Combining Equations (9) and (10), we have E [ ˜ H | A ] = 2 ˜ A − c (cid:20) { n ( n + 1) − } A ˜ A − A + n Tr( A ˜ A − ) A (2 n − n + 1) (cid:21) − c d (cid:0) A ˜ A − ˆΛ + ˆΛ ˜ A − A (cid:1) − ( cd ) ˆΛ ˜ A − ˆΛ , which we can write solely in terms of ˜ A and ˆΛ by substituting c A with ˜ A − cd ˆΛ, obtaining E [ ˜ H | A ] = 2 ˜ A − [2 n ( n + 1) − A − cd ˆΛ + ( cd ˆΛ) ˜ A − ( cd ˆΛ)](2 n − n + 1) − [ np − ncd Tr( ˆΛ ˜ A − )]( ˜ A − cd ˆΛ)(2 n − n + 1) − cd ˆΛ + ( cd ˆΛ) ˜ A − ( cd ˆΛ) . We summarize the above results in the following Theorem.12 heorem 3.
The Rao-Blackwell improvement of ˜ H , the harmonic mean of the matrices c ( W + d ˆΛ) and c ( W + d ˆΛ) , where c and d are positive constants that only depend on A and ˆΛ is a positivedefinite matrix that only depends on A given by E [ ˜ H | A ] = n (cid:20) n − p + cd Tr( ˆΛ ˜ A − )(2 n − n + 1) (cid:21) ˜ A + (cid:20) − n + 1(2 n − n + 1) (cid:21) ( cd ˆΛ) ˜ A − ( cd ˆΛ)+ (cid:20) n + np − − ncd Tr( ˆΛ ˜ A − )(2 n − n + 1) (cid:21) cd ˆΛ . Remark 2.
For linear shrinkage estimators of the form˜ A = (1 − λ ) A + λ I , as in [33], setting ˆΛ = I , c = (1 − λ ), and cd = λ in our formula gives E [ ˜ H | A ] = n (cid:20) n − p + λ Tr( ˜ A − ) } (2 n − n + 1) (cid:21) ˜ A + (cid:20) − n + 1(2 n − n + 1) (cid:21) λ ˜ A − + (cid:20) n + np − − nλ Tr( ˜ A − )(2 n − n + 1) (cid:21) λ I . (11)The results outlined above are unexpected, and somewhat odd. The implication of Theorem 2 isthat the Rao-Blackwellization of H ( P ) is a deterministic constant multiple of A ( P ). This suggeststhat the expense of computing H ( P ) is not warranted, since while H ( P ) may improve upon A ( P )as an estimator, a scalar multiple of A ( P ) improves still further upon H ( P ). Figure 4 in Section 6explores this point empirically in the finite-sample regime. On the other hand, the form of theRao-Blackwellized estimator in Equation (11), obtained from Theorem 3, bears noting. In contrastto the Rao-Blackwellized version of H considered in Theorem 2, this estimator involves a linearcombination of ˜ A , ˜ A − and I . The form of this estimator is fundamentally different from theclass of linear shrinkage estimators considered elsewhere in the literature [33, 26, 48], and warrantsfurther study. A major motivation for working with the operator norm is to obtain guarantees on convergence ofeigenvectors, which are often the main object of interest in covariance estimation, as in when thecovariance is used for principal component analysis. This is done via the Davis-Kahan theorem,which bounds the distance between the leading eigenvectors v ( ˆΣ) and v (Σ) in terms of (cid:107) ˆΣ − Σ (cid:107) .For example, it can be shown [51, Corollary 1] that (cid:107) v ( ˆΣ) − v (Σ) (cid:107) ≤ (cid:107) ˆΣ − Σ (cid:107) λ (Σ) − λ (Σ) if (cid:104) v ( ˆΣ) , v (Σ) (cid:105) > H ( P )carries information about the leading eigenvector of the population covariance matrix Σ in theregime p/T → Γ ∈ (0 , / A ( P ). We return tothe complex case once more due to our reliance on Proposition 2, but as alluded to previously, weexpect the resulting asymptotic formula to hold for the real case without any adjustment.13 efinition 3. Let D ⊂ C p be a set of T i.i.d. p -dimensional centered multivariate complex Gaus-sians with population covariance matrix Σ = I + θvv ∗ , (13)where θ > v ∈ C p has norm one. As in Proposition 2 assume that (cid:12)(cid:12)(cid:12) pT − Γ (cid:12)(cid:12)(cid:12) ≤ Kp , where K > p or T and Γ ∈ (0 , / Remark 3.
Let D be a collection of multivariate complex Gaussians with zero mean and covari-ance I . If we define D = {√ Σ x : x ∈ D } =: √ Σ D , where Σ is given in Definition 3, then D has the same distribution as the model in Equation 13.Moreover, by this same transformation, we may take a partition P of D and generate a partition P of D by replacing each D i in P by D i := √ Σ D i . With this definition we have the equality H ( P ) = √ Σ H ( P ) √ Σ and A ( P ) = √ Σ A ( P ) √ Σ . The Theorem below follows from well-established results in the literature and can be generalizedwithout any change to higher rank perturbations of the identity. We focus on the simple case ofone spike in order to get more explicit insight into the performance of H ( P ). Theorem 4.
Let D be a spiked model as in Definition 3, and suppose P is a partition satisfyingthe conditions of Proposition 2, with N = 2 . Then we have the almost sure convergence λ (cid:8) H ( P ) (cid:9) → (cid:40) Γ θ + (1 − Γ) θ if θ > (cid:113) Γ1 − Γ √ Γ √ − Γ otherwise , and (cid:12)(cid:12)(cid:12)(cid:10) v (cid:8) H ( P ) (cid:9) , v (cid:11)(cid:12)(cid:12)(cid:12) → (cid:40) θ +1 θ θ (1 − Γ) − Γ θ (1 − Γ)+ θ +Γ if θ > (cid:113) Γ1 − Γ , otherwise.Proof. To prove this result we will use the general framework [5], which considers multiplicativespikes of the form ˜ M := √ I + θvv ∗ M √ I + θvv ∗ . Here θ and v are as in Definition 3 and M is a Hermitian matrix whose eigenvalue distributionconverges weakly to a spectral measure ν almost surely, and ν is supported on the interval [ a, b ].Assume further that the convergence of the largest (smallest) eigenvalue of M is the right (left)edge of the support of ν and that the distribution of M is invariant under unitary conjugation;recall this implies the matrix of eigenvectors of M is Haar distributed on the unitary group. Definefor z ∈ C \ [ a, b ] m ν ( z ) := (cid:90) R ν (d x ) z − x ,t ν ( z ) := (cid:90) R xν (d x ) z − x = − zm ν ( z ) , t − ν ( z ) be the functional inverse of t ν . By [5, Theorem 2.7] we have the almost sureconvergence λ ( ˜ M ) → (cid:40) t − ν (cid:16) θ (cid:17) θ > t ν ( b + ) ,b otherwise . Here t ν ( b + ) is the limit as z → b of t ν ( z ), well defined when ν has a density with square root decaynear b [5, Proposition 2.10]. Furthermore, by [5, Remark 2.11] we have the almost sure convergence |(cid:104) v ( ˜ M ) , v (cid:105)| → (cid:40) − θ +1 θ ρt (cid:48) ν ( ρ ) if θ > /t ν ( b ),0 otherwise , where ρ := t − ν (cid:18) θ (cid:19) . Applying these results using Remark 3 and taking M = H ( P ) where P is the partitionof a data set D with population covariance I , we see that M satisfies the required convergenceproperties by Proposition 2 and is unitarily invariant. Letting ν equal to the limiting spectralmeasure of H ( P ), and noting for N = 2 E ± = 1 ± √ Γ √ − Γ , the proof now proceeds by calculation. From the results of [36, Equation (18)], m ν ( z ) satisfies thefixed point equation Γ zm ν ( z ) + (1 − − z ) m ν ( z ) + 1 = 0 . Inserting the definition of t ν ( z ) and simplifying yieldsΓ t ν ( z ) + (1 − z ) t ν ( z ) + 1 − Γ = 0 . Taking the limit as z goes to E + and utilizing the square root decay of ν at E + yields (cid:32) t ν ( E + ) − (cid:114) − ΓΓ (cid:33) = 0 = ⇒ t ν ( E + ) = (cid:114) − ΓΓ . Hence, a phase transition in the largest eigenvalue of H ( P ) occurs for θ > (cid:112) Γ / (1 − Γ). We cansolve for the inverse of t ν ( z ) by substituting z = t − ν ( w ) into the polynomial fixed point equationfor t ν ( z ), t − ν ( w ) = 1 + Γ w + 1 − Γ w . Assuming θ > (cid:112) Γ / (1 − Γ) and inserting w = 1 /θ gives the location of the spiked eigenvalue of H ( P ), ρ = t − ν (cid:18) θ (cid:19) = 1 + Γ θ + (1 − Γ) θ. Differentiating the fixed point equation of t ν ( z ) gives the following fixed point equation for t (cid:48) ν ( z )(2Γ t ν ( z ) + 1 − z ) t (cid:48) ν ( z ) − t ν ( z ) = 0 . ρ yields t (cid:48) ν ( ρ ) = 12Γ + θ (1 − ρ ) = 1Γ − θ (1 − Γ) , and hence − θ + 1 θ ρt (cid:48) ν ( ρ ) = θ + 1 θ θ (1 − Γ) − Γ1 + Γ θ + (1 − Γ) θ , which concludes the proof. Remark 4.
The same analysis has been performed on A ( P ) [4, 42, 27, 41]. Under this setting, wehave the almost sure convergence [5, Section 3.2] λ (cid:8) A ( P ) (cid:9) → (cid:40) ( θ + 1) (cid:0) Γ θ (cid:1) if θ > √ Γ, (cid:0) √ Γ (cid:1) otherwise , and (cid:12)(cid:12)(cid:12)(cid:10) v (cid:8) A ( P ) (cid:9) , v (cid:11)(cid:12)(cid:12)(cid:12) → − Γ θ Γ θ if θ > √ Γ,0 otherwise.When 0 < Γ < , we have (cid:112) Γ / (1 − Γ) > √ Γ, and it is possible to choose θ such that √ Γ < θ ≤ (cid:114) Γ1 − Γ . (14)Comparing the phase transition for the harmonic mean given in Theorem 4 and that for thearithmetic mean given in Remark 4, we see that when θ satisfies Equation (14), Theorem 4 andRemark 4 imply the almost sure convergence (cid:12)(cid:12)(cid:12)(cid:10) v (cid:8) H ( P ) (cid:9) , v (cid:11)(cid:12)(cid:12)(cid:12) → , (cid:12)(cid:12)(cid:12)(cid:10) v (cid:8) A ( P ) (cid:9) , v (cid:11)(cid:12)(cid:12)(cid:12) → − Γ θ Γ θ . This means that for low signal strength θ , v (cid:8) H ( P ) (cid:9) fails to have any relationship with v .On the other hand, when θ > (cid:112) Γ / (1 − Γ) the leading eigenvectors of both H ( P ) and A ( P )have some relationship with v in the limit as p/T → Γ. We observe thatlim p,T →∞ (cid:18)(cid:12)(cid:12)(cid:12)(cid:10) v (cid:8) A ( P ) (cid:9) , v (cid:11)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:10) v (cid:8) H ( P ) (cid:9) , v (cid:11)(cid:12)(cid:12)(cid:12) (cid:19) = 1 − Γ θ Γ θ − θ + 1 θ θ (1 − Γ) − Γ θ (1 − Γ) + θ + Γ = Γ (1 + θ ) (cid:0) Γ θ (cid:1) θ (cid:8) θ (1 − Γ) + θ + Γ (cid:9) > , so that the leading eigenvector of A ( P ) functions as a better estimator, asymptotically, for allpossible choices of θ . 16ompare this result with the bound predicted by solely analyzing the upper bounds obtainedfrom the Davis-Kahan Theorem. That is, taking λ (Σ) − λ (Σ) = θ in Equation (12), we have (cid:13)(cid:13)(cid:13) v (cid:8) H ( P ) (cid:9) − v (cid:13)(cid:13)(cid:13) ≤ (cid:107) H ( P ) − Σ (cid:107) θ when (cid:10) v (cid:8) H ( P ) (cid:9) , v (cid:11) > , and (cid:13)(cid:13)(cid:13) v (cid:8) A ( P ) (cid:9) − v (cid:13)(cid:13)(cid:13) ≤ (cid:107) A ( P ) − Σ (cid:107) θ when (cid:10) v (cid:8) A ( P ) (cid:9) , v (cid:11) > . Since the condition number of Σ is 1 + θ , by Proposition 3 we havelim sup p,T →∞ (cid:107) H ( P ) − Σ (cid:107)(cid:107) A ( P ) − Σ (cid:107) < θ < √ Γ2 √ − Γ − . In order for |(cid:104) v { A ( P ) } , v (cid:105)| (cid:54)→
0, we would need θ > √ Γ. Notice that as Γ → / Γ → √ Γ2 √ − Γ − √ − > lim Γ → √ Γ = 1 √ . It follows that there exist values of θ and Γ for which the Davis-Kahan theorem predicts that theharmonic mean yields better eigenvector recovery than the arithmetic mean, even though v { H ( P ) } has no asymptotic relationship to v while v { A ( P ) } does. Our theoretical results presented above are asymptotic, so we complement them here with a briefinvestigation of the empirical finite-sample performance of the harmonic and arithmetic matrixmeans and related estimators.We begin by comparing the operator norm error of the arithmetic and harmonic matrix means inrecovering the true covariance matrix. We generate a random covariance matrix Σ =
U DU T ∈ R p × p by choosing U ∈ R p × p according to Haar measure on the orthogonal matrices and populating theentries of the diagonal matrix D with i.i.d. draws from the uniform distribution on the interval [1 , b ].The parameter b ≥ b = 1 corresponds toΣ = I ). We then draw T = 4 p independent mean 0 normal random vectors with covariance matrixΣ. Splitting these T random vectors into two equal size samples of size n = T /
2, we computethe harmonic mean of the two resulting sample covariances. We compare the performance of thisestimator to the sample covariance matrix of the full sample (i.e., the arithmetic mean of the twosplits). We also include, for the sake of comparison, the shrinkage estimator proposed by Fisherand Sun [26] specifically for normal data in the high-dimensional setting, which should improve onthe sample covariance matrix. Similar to the scheme originally proposed by Ledoit and Wolf [33],this estimator is a convex combination of the sample covariance matrix and a target matrix, whichwe take here to be the identity I ∈ R p × p .Figures 1 and 2 display, for various choices of the dimension p and the condition number b ,the operator norm relative error in recovering Σ for these three estimators. Over a wide rangeof condition numbers, the harmonic mean yields a better estimate of the population covariance Σthan does the arithmetic mean, but does not manage to match the performance of the Fisher-Sunregularized estimator. Unsurprisingly, when the regularization target matrix is close to the truth,17 l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l l p=50 p=200 p=500 p=2000 Condition number (log scale) R e l a t i v e E rr o r method lll AMAMregHM
Figure 1:
Operator norm relative error (cid:107) ˆΣ − Σ (cid:107) / (cid:107) Σ (cid:107) as a function of the condition number parameter b fordifferent choices of the data dimension p . The plot compares the arithmetic matrix mean (red), the Fisher-Sun estimator (green) and the harmonic matrix mean (blue). Each point is the mean of 20 independenttrials, with the shaded regions indicating two standard deviations. l l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l l b=1 b=5 b=10 b=50
30 100 300 1000 30 100 300 1000 30 100 300 1000 30 100 300 10000.00.51.0
Data Dimension (log scale) R e l a t i v e E rr o r method lll AMAMregHM
Figure 2:
Operator norm relative error in recovering the true covariance matrix Σ as a function of the datadimension for different choices of condition number parameter b for the arithmetic mean (red), Fisher-Sunregularized sample covariance (green) and harmonic mean (blue). Each point is the mean of 20 independenttrials, with the shaded regions indicating two standard deviations. as is the case when the condition number b is close to 1, regularization yields an especially largeimprovement in estimation error, but the gap between the harmonic mean and the Fisher-Sunestimator narrows substantially as soon as the condition number becomes even moderately large,corresponding to the population covariance matrix being far from the identity. In keeping withProposition 3, which suggests that we should only expect the harmonic mean to yield improvementover the arithmetic mean for suitably small condition numbers, the size of the improvement ofharmonic mean over the (unregularized) arithmetic mean does appear to shrink as the conditionnumber increases. Note that performance stabilizes as b and p increase because we are assessingthe estimators according to relative error, not because the problem is necessarily becoming easierfor larger values of these parameters.It is natural to ask how these estimators compare as the number of observations vary. Towardthat end, consider the same experimental setup discussed above, but taking (cid:100) qp (cid:101) samples, with q > q correspond, roughly, to better-conditioned sample covariance matrices. Figure 3shows how the three estimators of the population matrix mean compare as a function of thisparameter q for different choices of the dimension p and condition number b . We see that as thenumber of samples increases (i.e., as q increases), the improvement of the harmonic mean over thearithmetic mean decreases. This is in keeping with Proposition 2 as well as the intuition that as18 l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l ll l l l l l l b=1 b=2 b=10 b=20 p = p = q R e l a t i v e E rr o r method lll AMAMregHM
Figure 3:
Operator norm error in recovering the true covariance matrix Σ as a function of the number ofnormal samples, parameterized by q , for different choices of the data dimension p and the condition number b . for the arithmetic mean (orange-red) and harmonic mean (teal). Each point is the mean of 20 independenttrials, with the shaded regions indicating two standard deviations. the number of samples increases, the sample covariance becomes a more stable (though still notconsistent) estimate of the population covariance.In Section 4, we saw that in the N = 2 case, the matrix harmonic mean H could be furtherimproved by Rao-Blackwellization. Thus, we have four possible estimates of the population covari-ance: the arithmetic mean, the Fisher-Sun regularization of the arithmetic mean, the harmonicmean and the Rao-Blackwellization of the harmonic mean. Figure 4 compares these four differentestimators, under the same experimental setup as the one described above. The plot shows thatRao-Blackwellizing the harmonic mean does little to change its behavior. Indeed, the performanceof the harmonic mean and its Rao-Blackwellization are so similar that their lines overlap in the plot.As in the plots above, the arithmetic mean performs more poorly as an estimator compared to theharmonic mean, but regularization using the method of Fisher and Sun improves its performanceover that of the harmonic mean, if only slightly. l l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l ll l l l l l l l p=50 p=200 p=500 p=2000 Condition number (log scale) R e l a t i v e E rr o r method llll AMAMFSHMHMRB
Figure 4:
Operator norm relative error in recovering the true covariance matrix Σ as a function of thecondition number b for different choices the data dimension p for the arithmetic mean (red), Fisher-Sunregularization of the arithmetic mean (green), harmonic mean (blue) and Rao-Blackwellized harmonic mean(purple). Each point is the mean of 20 independent trials, with the shaded regions indicating two standarddeviations. Note that the lines corresponding to the harmonic mean and its Rao-Blackwellization overlap. Discussion
The results presented here seem to contradict our intuitions about matrix estimation under variousmatrix norms. It is common in the literature to measure the quality of an estimator ˆΣ by the limit ofthe Frobenius norm error (cid:107) ˆΣ − Σ (cid:107) F , and shrinkage estimators such as [33] are designed to minimizethis quantity. Since the operator norm is bounded above by the Frobenius norm, controlling theFrobenius norm error is sufficient to control the operator norm. However, in the high-dimensionalregime, these arguments become more subtle. One must often normalize the Frobenius nor by thematrix dimension to ensure convergence and obtain a sensible asymptotic analysis. Typically, thisnormalization takes the form p − (cid:107) ˆΣ − Σ (cid:107) F . The analogous operator norm quantity, p − (cid:107) ˆΣ − Σ (cid:107) ,converges to zero in many common settings, rendering the upper bound in terms of p − (cid:107) ˆΣ − Σ (cid:107) F (asymptotically) trivial. For example, when ˆΣ is Wishart-distributed (as happens when ˆΣ is asample covariance), (cid:107) ˆΣ − Σ (cid:107) = O (1). Thus, obtaining non-trivial bounds on (cid:107) ˆΣ − Σ (cid:107) requiresdirect analysis rather than a Frobenius norm bound.Ultimately, the choice to analyze the operator norm error as opposed to the Frobenius norm erroror some other matrix norm is guided by the inference task at hand, and the available informationabout the population covariance Σ that one wishes to capture in the estimator ˆΣ. For the task ofrecovering the leading eigenvector(s) of Σ, the operator norm takes a more prominent role due tothe Davis-Kahan inequality, but here again the resulting bound need not be tight, and the resultingbound on eigenvector recovery may be trivial.In summary, our results highlight the ways in which the harmonic mean H ( P ) may out-performthe arithmetic mean A ( P ) as an estimator of the population covariance Σ: • Under certain conditions on Σ, the operator norm error of H ( P ) is better than that of A ( P ).In particular, when Σ = I , the normalized Frobenius norm error of H ( P ) matches that of A ( P ), despite H ( P ) not being optimized for the Frobenius norm loss. We have observedsimilar behavior empirically when Σ is close but not equal to the identity. • For a spiked model Σ = I + θvv ∗ , there is a range of values of θ for which the eigenvectorrecovery using H ( P ) is always worse than that obtained using A ( P ), even though H ( P )provides a better operator norm error than A ( P ).We are not aware of any other estimators with these two properties, let alone of one that hasthe interpretation of being a mean with respect to a different geometry. Moreover, the fact that H ( P ) can be interpreted as the result of a data-splitting procedure suggests the possibility ofother procedures with similar interpretations that improve over classical estimators in the high-dimensional regime. The Rao-Blackwellization of H ( P ) shows that, for the purpose of covarianceestimation, a similar or better improvement over A ( P ) can also be achieved by a suitably-chosenscalar multiple of A ( P ). While this shows that H ( P ) has better-performing alternatives in practice,our results shed light on the behavior of different means in high dimensions, opening the door tofuture work and a better understanding other measures of matrix estimation error.Our original reason for investigating other matrix means was the problem of misaligned obser-vations, as happens in brain imaging data. In such settings, pooling the raw data is not feasible,(e.g., the underlying time series of resting state fMRI imaging), since observations cannot be alignedacross samples. We initially believed that H ( P ) outperforming A ( P ) empirically was a consequenceof this setting, but surprisingly found it to be the case even in the classic covariance estimationproblem with no misalignment. The results presented here are only a partial explanation of this20henomenon, and a better understanding of the various matrix means in both the aligned and themisaligned settings in high dimension warrants further study. Acknowledgements
K. Levin and A. Lodhia are supported by a NSF DMS Research Training Grant 1646108. E.Levina’s research is partially supported by NSF DMS grants 1521551 and 1916222.
References [1] P.-A. Absil, R., Mahony, and R. Sepulchre. Riemannian geometry of Grassmann manifoldswith a view on algorithmic computation.
Acta Applicandae Mathematicae , 80(2):199–220,2004.[2] T. W. Anderson.
An introduction to multivariate statistical analysis . Wiley Series in Prob-ability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, third edition,2003.[3] Z. Bai and J. W. Silverstein.
Spectral analysis of large dimensional random matrices . SpringerSeries in Statistics. Springer, New York, second edition, 2010.[4] J. Baik, G. Ben Arous, and S. P´ech´e. Phase transition of the largest eigenvalue for nonnullcomplex sample covariance matrices.
The Annals of Probability , 33(5):1643–1697, 2005.[5] F. Benaych-Georges and R. R. Nadakuditi. The eigenvalues and eigenvectors of finite, low rankperturbations of large random matrices.
Advances in Mathematics , 227(1):494–521, 2011.[6] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in high dimen-sion.
The Annals of Statistics , 41(4):1780–1815, 2013.[7] R. Bhatia.
Positive definite matrices . Princeton Series in Applied Mathematics. PrincetonUniversity Press, Princeton, NJ, 2007.[8] P. J. Bickel and E. Levina. Covariance regularization by thresholding.
The Annals of Statistics ,36(6):2577–2604, 2008.[9] P. J. Bickel and E. Levina. Regularized estimation of large covariance matrices.
The Annalsof Statistics , 36(1):199–227, 2008.[10] E. Bullmore and O. Sporns. Complex brain networks: Graph theoretical analysis of structuraland functional systems.
Nature Reviews Neuroscience , 10:186–198, 2009.[11] T. T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation.
Journalof the American Statistical Association , 106(494):672–684, 2011.[12] T. T. Cai, Z. Ma, and Y. Wu. Optimal estimation and rank detection for sparse spikedcovariance matrices.
Probability Theory and Related Fields , 161(3-4):781–815, 2015.2113] T. T. Cai, Z. Ren, and H. H. Zhou. Estimating structured high-dimensional covariance andprecision matrices: optimal rates and adaptive estimation.
Electronic Journal of Statistics ,10(1):1–59, 2016.[14] T. T. Cai, C.-H. Zhang, and H. H. Zhou. Optimal rates of convergence for covariance matrixestimation.
The Annals of Statistics , 38(4):2118–2144, 2010.[15] T. T. Cai and H. H. Zhou. Minimax estimation of large covariance matrices under (cid:96) -norm. Statistica Sinica , 22(4):1319–1349, 2012.[16] T. T. Cai and H. H. Zhou. Optimal rates of convergence for sparse covariance matrix estima-tion.
The Annals of Statistics , 40(5):2389–2420, 2012.[17] M. Capitaine and C. Donati-Martin. Strong asymptotic freeness for Wigner and Wishartmatrices.
Indiana University Mathematics Journal , 56(2):767–803, 2007.[18] S. Chatterjee. Matrix estimation by universal singular value thresholding.
The Annals ofStatistics , 43(1):177–214, 2015.[19] Y. Chen, A. Wiesel, Y. C. Eldar, and A. O. Hero. Shrinkage algorithms for MMSE covarianceestimation.
IEEE Trans. Signal Process. , 58(10):5016–5029, 2010.[20] M. J. Daniels and R. E. Kass. Shrinkage estimators for covariance matrices.
Biometrics ,57(4):1173–1184, 2001.[21] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation.
SIAM J. NumericalAnalysis , 7(1), March 1970.[22] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonalityconstraints.
SIAM Journal on Matrix Analysis and Applications , 20(2):303–353, 1999.[23] N. El Karoui. Spectrum estimation for large dimensional covariance matrices using randommatrix theory.
The Annals of Statistics , 36(6):2757–2790, 2008.[24] J. Fan, Y. Liao, and H. Liu. An overview of the estimation of large covariance and precisionmatrices.
The Econometrics Journal , 19(1):C1–C32, 2016.[25] Z. Fan, I. M. Johnstone, and Y. Sun. Spiked covariances and principal components analysisin high-dimensional random effects models, 2018.[26] T. J. Fisher and X. Sun. Improved Stein-type shrinkage estimators for the high-dimensionalmultivariate normal covariance matrix.
Computational Statistics & Data Analysis , 55(5):1909–1918, 2011.[27] D. C. Hoyle and M. Rattray. Statistical mechanics of learning multiple orthogonal signals:asymptotic theory and fluctuation effects.
Physical Review E. Statistical, Nonlinear, and SoftMatter Physics , 75(1):016101, 13, 2007.[28] I. M. Johnstone and A. Y. Lu. On consistency and sparsity for principal components analysisin high dimensions.
Journal of the American Statistical Association , 104(486):682–693, 2009.2229] N. El Karoui. Tracy-Widom limit for the largest eigenvalue of a large class of complex samplecovariance matrices.
The Annals of Probability , 35(2):663–714, 2007.[30] E. Kolaczyk, L. Lin, S. Rosenberg, J. Xu, and J. Walters. Averages of unlabeled networks:Geometric characterization and asymptotic behavior. arXiv:1709.02793 , 2017.[31] Y. Konno. Exact moments of the multivariate F and beta distributions. Journal of the JapanStatistical Society (Nihon Tˆokei Gakkai Kaihˆo) , 18(2):123–130, 1988.[32] T. Kubokawa and A. Inoue. Estimation of covariance and precision matrices under scale-invariant quadratic loss in high dimension.
Electronic Journal of Statistics , 8(1):130–158,2014.[33] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices.
Journal of Multivariate Analysis , 88(2):365–411, 2004.[34] O. Ledoit and M. Wolf. Nonlinear shrinkage estimation of large-dimensional covariance ma-trices.
The Annals of Statistics , 40(2):1024–1060, 2012.[35] K. Levin, A. Lodhia, and E. Levina. Recovering low-rank structure from multiple networkswith unknown edge distributions. arXiv:1906.07265 , 2019.[36] A. Lodhia. Harmonic means of Wishart random matrices. arXiv:1905.02357 , 2019.[37] S. Lunag´omez, S. C. Olhede, and P. J. Wolfe. Modeling network populations via graph dis-tances. , 2019.[38] T. Marrinan, J. R. Beveridge, B. Draper, M. Kirby, and C. Peterson. Finding the subspacemean or median to fit your need. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1082–1089, 2014.[39] V. A. Marˇcenko and L. A. Pastur. Distribution of eigenvalues in certain sets of randommatrices.
Matematicheskii Sbornik , 72 (114):507–536, 1967.[40] R. J. Muirhead.
Aspects of multivariate statistical theory . John Wiley & Sons, Inc., New York,1982. Wiley Series in Probability and Mathematical Statistics.[41] B. Nadler. Finite sample approximation results for principal component analysis: a matrixperturbation approach.
The Annals of Statistics , 36(6):2791–2817, 2008.[42] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.
Statistica Sinica , 17(4):1617–1642, 2007.[43] C. R. Rao.
Linear statistical inference and its applications . John Wiley & Sons, New York-London-Sydney, second edition, 1973. Wiley Series in Probability and Mathematical Statistics.[44] A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices.
Journal of the American Statistical Association , 104(485):177–186, 2009.[45] A. Schwartzman. Lognormal distributions and geometric averages of symmetric positive defi-nite matrices.
International Statistical Review , 84(3):456–486, 2016.2346] S. T. Smith. Covariance, subspace, and intrinsic Cram´er-Rao bounds.
IEEE Transactions onSignal Processing , 53(5):1610–1630, 2005.[47] O. Sporns.
Discovering the Human Connectome . MIT Press, 2012.[48] A. Touloumis. Nonparametric Stein-type shrinkage covariance matrix estimators in high-dimensional settings.
Computational Statistics & Data Analysis , 83:251–261, 2015.[49] W. B. Wu and M. Pourahmadi. Nonparametric estimation of large covariance matrices oflongitudinal data.
Biometrika , 90(4):831–844, 2003.[50] J. Yao, S. Zheng, and Z. Bai.
Large sample covariance matrices and high-dimensional dataanalysis , volume 39 of
Cambridge Series in Statistical and Probabilistic Mathematics . Cam-bridge University Press, New York, 2015.[51] Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the Davis-Kahan theorem forstatisticians.
Biometrika , 102:315–323, 2015.
A Technical Results
Lemma 3.
For every < a < b and k ≥ , (cid:90) ba x k − (cid:112) ( b − x )( x − a ) d x = π (cid:18) a + b (cid:19) k +1 (cid:98) k − (cid:99) (cid:88) j =0 j + 1 (cid:18) k − j (cid:19)(cid:18) jj (cid:19)(cid:18) b − ab + a (cid:19) j +2 j . Proof.
The change of variables u = 2 b − a (cid:18) x − a + b (cid:19) makes the above integral equal to (cid:18) b − a (cid:19) (cid:90) − (cid:26)(cid:18) b − a (cid:19) u + (cid:18) a + b (cid:19)(cid:27) k − (cid:112) − u d u, and expanding, this is equal to k − (cid:88) j =0 (cid:18) k − j (cid:19)(cid:18) b − a (cid:19) j +2 (cid:18) a + b (cid:19) k − − j (cid:90) − u j (cid:112) − u d u. Each odd j vanishes by symmetry, yielding (cid:98) k − (cid:99) (cid:88) j =0 (cid:18) k − j (cid:19)(cid:18) b − a (cid:19) j +2 (cid:18) a + b (cid:19) k − − j (cid:90) − u j (cid:112) − u d u. This integral with respect to u is well-known and can be evaluated either through trigonometricsubstitution u = sin θ and repeatedly integrating by parts, or by using evenness, changing variablesto u = √ v and relating the resulting integral to the Beta function. We obtain (cid:90) − u j (cid:112) − u d u = π j +1 (cid:18) jj (cid:19) j + 1 . π (cid:98) k − (cid:99) (cid:88) j =0 (cid:18) k − j (cid:19)(cid:18) b − a (cid:19) j +2 (cid:18) a + b (cid:19) k − − j (cid:18) jj (cid:19) j + 1 12 j . The following are results from [31]. Let ∆ be a fixed p × p positive definite matrix. Definition 4 (Multivariate Beta) . A random p × p positive definite random matrix L is said tofollow a multivariate Beta distribution with parameters p , n , n and ∆ if its density obeys f L ( (cid:96) ) := K n ,n det( (cid:96) ) n − p − det(∆ − (cid:96) ) n − p − det(∆) ( N − p − , (cid:22) (cid:96) (cid:22) ∆ ,K n ,n := Γ p (cid:0) N (cid:1) Γ p (cid:0) n (cid:1) Γ p (cid:0) n (cid:1) , Γ p ( x ) := π p ( p − p (cid:89) i =1 Γ (cid:16) x − i − (cid:17) ,N := n + n , (15)we denote this distribution as B ( p ; n , n ; ∆).The following result appears as Corollary 3.3 (ii) in [31]. We restate it here for ease of reference. Theorem 5.
Let T be an arbitrary p × p deterministic matrix. Suppose the matrix L is distributedas B ( p ; n , n ; ∆) then E [ LT L ] = n N ( N − N + 2) (cid:2) { n ( N + 1) − } ∆ T ∆ + n { (∆ T ∆) (cid:62) + Tr(∆ T )∆ } (cid:3) , (16) where N = n + n ..