Random matrices: Universality of ESDs and the circular law
RRANDOM MATRICES:UNIVERSALITY OF ESDS AND THE CIRCULAR LAW
TERENCE TAO, VAN VU, AND MANJUNATH KRISHNAPUR (APPENDIX)
Abstract.
Given an n × n complex matrix A , let µ A ( x, y ) := 1 n |{ ≤ i ≤ n, Re λ i ≤ x, Im λ i ≤ y }| be the empirical spectral distribution (ESD) of its eigenvalues λ i ∈ C , i = 1 , . . . n .We consider the limiting distribution (both in probability and inthe almost sure convergence sense) of the normalized ESD µ √ n A n of a random matrix A n = ( a ij ) ≤ i,j ≤ n where the random variables a ij − E ( a ij ) are iid copies of a fixed random variable x with unitvariance. We prove a universality principle for such ensembles,namely that the limit distribution in question is independent ofthe actual choice of x . In particular, in order to compute thisdistribution, one can assume that x is real of complex gaussian.As a related result, we show how laws for this ESD follow fromlaws for the singular value distribution of √ n A n − zI for complex z . As a corollary we establish the Circular Law conjecture (bothalmost surely and in probability), that asserts that µ √ n A n con-verges to the uniform measure on the unit disk when the a ij havezero mean. Introduction
Empirical spectral distributions.
This paper is concerned withthe convergence of empirical spectral distributions of random matrices,both in the sense of convergence in probability and in the almost suresense.
Definition 1.2 (Modes of convergence) . For each n , let F n be a randomvariable taking values in some Hausdorff topological space X , and let F be another element of X . • We say that F n converges in probability to F if for every neigh-bourhood V of F , we have lim n →∞ P ( F n ∈ V ) = 1. • We say that F n converges almost surely to F if we have P (lim n →∞ F n = F ) = 1. a r X i v : . [ m a t h . P R ] A p r TERENCE TAO, VAN VU, AND MANJUNATH KRISHNAPUR (APPENDIX)
Similarly, if X n is a scalar random variable, we say that X n is boundedin probability if we havelim C →∞ lim inf n →∞ P ( | X n | ≤ C ) = 1and almost surely bounded if we have P (lim sup n | X n | < ∞ ) = 1 . Let M n ( C ) denote the set of n × n complex matrices. For A ∈ M n ( C ),we let µ A ( s, t ) := 1 n |{ ≤ i ≤ n, Re λ i ≤ s, Im λ i ≤ t }| be the empirical spectral distribution (ESD) of its eigenvalues λ i ∈ C , i = 1 , . . . n . This is a discrete probability measure on C .Now suppose that A n ∈ M n ( C ) is a random matrix ensemble (i.e. aprobability distribution on M n ( C )), and let µ ∞ be a probability mea-sure on C . We give the space of probability measures on C the usual vague topology , thus a sequence of deterministic measures µ n convergesto µ if (cid:82) C f dµ n converges to (cid:82) C f dµ for every test function (i.e. con-tinuous and compactly supported function) f : C → R . Thus, byDefinition 1.2, we see that µ √ n A n converge in probability to µ ∞ if forevery continuous and compactly supported function f : C → R , theexpression (cid:90) C f ( z ) dµ √ n A n ( z ) − (cid:90) C f ( z ) dµ ∞ (1)converges to zero in probability, thuslim n →∞ P ( | (cid:90) C f ( z ) dµ √ n A n ( z ) − (cid:90) C f ( z ) dµ ∞ | ≥ ε ) = 0for every ε >
0. Similarly, µ √ n A n converges almost surely to µ ∞ if withprobability 1, the expression (1) converges to zero for all f : C → R . Remark . In practice, our matrices A n will have bounded entrieson the average, which suggests (by the Weyl comparision inequality,see Lemma A.2) that their eigenvalues should be of size about O ( √ n );thus the normalization by √ n is natural.1.4. Universality.
A fundamental problem in the theory of randommatrices is to determine the limiting distribution of the ESD of a ran-dom matrix ensemble (either in probability or in the almost sure sense),as the size of the random matrix tends to infinity.The situation with this problem, so far, is that the analysis dependsvery much on which ensemble one is dealing with. In some cases such as
NIVERSALITY OF ESDS AND THE CIRCULAR LAW 3 when the entries have gaussian distribution, powerful group-theoreticstructure (e.g. invariance under the orthogonal group O ( n ) or unitarygroup U ( n )) plays an essential role, as one can use it to derive an ex-plicit formula for the joint distribution of the eigenvalues. The limitingdistribution can then be computed directly from this formula. In themajority of cases, however, there is little symmetry, and such a formulais not available. Consequently, the problem becomes much harder andits analysis typically requires tools from various areas of mathematics.On the other hand, there is a well-known intuition behind this prob-lem (and many others concerning random matrices), the universality phenomenon, that asserts that the limiting distribution should not de-pend on the particular distribution of the entries. This phenomenonmotivates many theorems and conjectures in the area. In the follow-ing, we mention two famous examples, Wigner’s semi-circle law andthe Circular Law conjecture. Wigner’s semi circle law.
In the 1950’s, motivated by numerical ex-periments, Wigner [28] proved that the ESD of an n × n hermitianmatrix with (upper diagonal) entries being iid gaussian random vari-ables converge to the semi-circle law F whose density is given by ρ ( x ) = (cid:40) π √ − x , | x | ≤ , | x | > . Wigner’s result (which holds for both modes of convergence) was laterextended to many other ensembles. The most general form only re-quires the mean and variance of the entries [16, 2]:
Theorem 1.5.
Let A n be the n × n hermitian random matrix whoseupper diagonal entries are iid complex random variables with mean 0and variance 1. Then the ESD of √ n A n converges (both in probabilityand in the almost sure sense) to the semi-circle distribution.Circular Law Conjecture. The well-known Circular Law conjecturedeals with non-hermitian matrices.
Conjecture 1.6.
Let A n be the n × n random matrix whose entriesare iid complex random variables with mean 0 and variance 1. Thenthe ESD of √ n A n converges (both in probability and in the almost suresense) to the uniform distribution on the unit disk. Similarly to Wigner’s law, this conjecture was posed, based on numer-ical evidence, in the 1950’s. The case when the entries have complex
TERENCE TAO, VAN VU, AND MANJUNATH KRISHNAPUR (APPENDIX) gaussian distribution was verified by Mehta [14] in 1967, using Gini-bre’s formula for the joint density function of the eigenvalues of A n (see, for example, [2, Chapter 10]): p ( λ , . . . , λ n ) = c n (cid:89) i Bernoulli Gaussian !" ! !" ! !"& ’"!%"& %"$!"’ ! %"! %"’ %" ! !"$ ! !"’ %"!!"! !" ! !"$%"!!" ! !" ! !"’!"’ &"!%"$!"& ! %"! %"& %" ! !"& %"!!"! !"$ Figure 1. Eigenvalue plots of two randomly generated5000 by 5000 matrices. On the left, each entry was aniid Bernoulli random variable, taking the values +1 and − / 2. On the right, each entrywas an iid Gaussian normal random variable, with prob-ability density function is √ ∗ π exp( − x / , 0) rather than at theorigin.) copies of x and y , respectively. For each n , let M n be a deterministic n × n matrix satisfying sup n n (cid:107) M n (cid:107) < ∞ . (3) Let A n := M n + X n and B n := M n + Y n . Then µ √ n A n − µ √ n B n convergesin probability to zero. If furthermore we make the additional hypothesisthat the ESDs µ ( √ n M n − zI )( √ n M n − zI ) ∗ (4) converge to a limit for almost every z , then µ √ n A n − µ √ n B n convergesalmost surely to zero.Remark . The theorem still holds if we restrict the size of the ma-trices to an infinite subsequence n < n < . . . of positive integers.This freedom to pass to a subsequence is useful for technical reasonsinvolving compactness arguments.The condition (3) has the following useful consequence, which we shalluse repeatedly: TERENCE TAO, VAN VU, AND MANJUNATH KRISHNAPUR (APPENDIX) Lemma 1.9 (Tightness of ESDs) . Let M n and A n be as in Theorem1.7. Then the quantities n (cid:107) A n (cid:107) and (cid:82) C | z | dµ √ n A n ( z ) are almostsurely bounded (and hence also bounded in probability).Proof. By the Weyl comparison inequality (Lemma A.2) it suffices toshow that n (cid:107) A n (cid:107) is almost surely bounded. By (3) and the triangleinequality it suffices to show that n (cid:107) X n (cid:107) is almost surely bounded.But this follows from the finite second moment of x and the strong lawof large numbers. (cid:3) As an immediate corollary of Theorem 1.7, we have Corollary 1.10 (Universality principle) . Let x, y be complex randomvariables with zero mean and unit variance. Let X n and Y n be n × n random matrices whose entries are iid copies of x and y , respectively.For each n , let M n be a deterministic n × n matrix satisfying (3) .Let A n := M n + X n and B n := M n + Y n . Then if µ √ n B n convergesin probability to a limiting measure µ , then µ √ n A n also converges inprobability to µ . If furthermore we make the additional hypothesis thatthe ESDs (4) converge to a limit for almost every z , then we can replace“in probability” by “almost surely” in the previous sentence. A demonstration of this corollary appears in Figure 2. Remark . One consequence of Corollary 1.10 (in the case when (4)converges to a limit) is that the ESD µ √ n A n behaves asymptoticallydeterministically in the sense that there exists a deterministic measure µ n for each n such that µ √ n A n − µ n converges almost surely to zero.Indeed, one can simply take µ n to be an instance of µ √ n B n , where the B n are selected independently of the A n , and the claim will hold almostsurely. The question remains as to whether µ n itself converges to somelimit as n → ∞ ; we partially address this issue in Theorem 1.23 below.1.12. The Circular Law Conjecture. Thanks to Corollary 1.10, wecan reduce the problem of computing the limiting distribution to thecase when the entries are gaussian (or having any special distributionsatisfying the variance bound). In particular, since the Circular Law isverified for random matrices with complex gaussian entries (see [14]), itfollows that this law (both in probability and in the almost sure sense)holds in full generality. In other words, we have shown The authors thank Oded Schramm for this observation. The idea of establishing a limiting law by first replacing a general randomvariable with a gaussian one is sometimes referred to as the “Lindberg trick” in theliterature. NIVERSALITY OF ESDS AND THE CIRCULAR LAW 7 Bernoulli Gaussian !" ! % ! &% !" ! % ! &% !" ! !" ! % ! &" !" ! !" ! % ! &" Figure 2. Eigenvalue plots of randomly generated n by n matrices of the form D n + M n , where n =5000. In left column, each entry of M n was an iidBernoulli random variable, taking the values +1 and − / 2, and in the right col-umn, each entry was an iid Gaussian normal ran-dom variable, with probability density function is √ π exp( − x / D n is the de-terministic matrix diag(1 , , . . . , , . , . , . . . , . D n is the deterministic matrixdiag(1 , , . . . , , . , . , . . . , . 8) (in each case, the first n/ . . Theorem 1.13 (Circular Law) . Let X n be the n × n random matrixwhose entries are iid complex random variables with mean 0 and vari-ance 1. Then the ESD of √ n X n converges (both in probability and inthe almost sure sense) to the uniform distribution on the unit disk.Remark . In [26] (see also [10] for an alternate proof for the inprobability sense), this theorem was proven with the extra assumptionthat the entries have finite (2 + ε )-th moment for any fixed ε > M n to be the all zero matrix(for which the boundedness and convergence hypotheses are trivial).In [12], explicit distributions were computed for the case when M n isan arbitrary diagonal matrix and X n has iid gaussian entries. Theformula for the limiting distribution is somewhat technical, but itssupport is easy to describe: it is exactly the set of z ∈ C for which (cid:82) | z − x | − dµ ( x ) ≥ µ is the limiting distribution of the ESD of TERENCE TAO, VAN VU, AND MANJUNATH KRISHNAPUR (APPENDIX) M n . (In the case M n is all zero, µ has all its mass at the origin, and sothe set of z is the unit disk.)The proof of Theorem 1.7 actually shows that if M n and M (cid:48) n bothobey (3) and have the property that the difference between the ESD(4) and the counterpart for M (cid:48) n converges to zero for almost every z ,then Theorem 1.7 holds with A n := M n + X n and B n := M (cid:48) n + Y n (seeRemark B.3).This has the following interesting consequence. Assume that M n isa matrix with low rank, say o ( n ). In this case, it is easy to see thatthe ESD (4) concentrates at | z | , since the matrix involved here is aself-adjoint low rank perturbation of | z | I . Thus, we can replace M n by the zero matrix and obtain Corollary 1.15. (Circular Law for shifted matrices) Let X n be the n × n random matrix whose entries are iid complex random variableswith mean 0 and variance 1 and M n be a deterministic matrix with rank o ( n ) and obeying (3) . Let A n := M n + X n . Then the ESD of √ n A n converges (in either sense) to the uniform distribution on the unit disk. In particular, it shows that Theorem 1.13 still holds if the entries have(the same) non-zero mean. This extends a result of Chafa¨ı [5], whichin addition assumed that the entries had finite fourth moment.1.16. Extensions. We can extend Theorem 1.7 in several ways. First,by conditioning, we can obtain a theorem for M n being a random ma-trix. Theorem 1.17 (Universality from a random base matrix) . Let x and y be complex random variables with zero mean and unit variance. Let X n = ( x ij ) ≤ i,j ≤ n and Y n = ( y ij ) ≤ i,j ≤ n be n × n random matrices whoseentries are iid copies of x and y , respectively. For each n , let M n bea random n × n matrix, independent of X n or Y n , such that n (cid:107) M n (cid:107) is bounded in probability (see Definition 1.2). Let A n := M n + X n and B n := M n + Y n . Then µ √ n A n − µ √ n B n converges in probability tozero. If we furthermore assume that n (cid:107) M n (cid:107) is almost surely bounded,and (4) converges almost surely to some limit for almost every z , then µ √ n A n − µ √ n B n converges almost surely to zero. We can also address a more general form of random matrices (cf.[8]). Let K n , L n be two sequences of matrices. Define A n := M n + K n X n L n and B n := M n + K n Y n L n . We can show that under somemild assumptions on M n , K n , L n , Theorem 1.7 still holds: NIVERSALITY OF ESDS AND THE CIRCULAR LAW 9 Bernoulli Gaussian !" ! ! )* )) ! * ! !"% ’"%! !" ! ! )* )) ! * ! !"% ’"%! Figure 3. Eigenvalue plots of two randomly generated5000 by 5000 matrices of the form A + BM n B , where A and B are diagonal matrices having n/ n/ D )and the value 2 (for X ). On the left, each entry of M n was an iid Bernoulli random variable, taking the values+1 and − / 2. On the right, eachentry of M n was an iid Gaussian normal random variable,with probability density function is √ ∗ π exp( − x / Theorem 1.18. Let x and y be complex random variables with zeromean and unit variance. Let X n and Y n be n × n random matrices whoseentries are iid copies of x and y , respectively. Let M n , K n , L n be random n × n matrices (independent of X n , Y n ) and let A n := M n + K n X n L n and B n := M n + K n Y n L n . Assume that the expressions n (cid:107) A n (cid:107) + 1 n (cid:107) B n (cid:107) + 1 n (cid:107) K − n M n L − n (cid:107) + 1 n (cid:107) K − n L − n (cid:107) (5) are bounded in probability. If furthermore we assume that (5) is almostsurely bounded, and that for almost every z the ESDs µ ( √ n K − n M n L − n − zK − n L − n )( √ n K − n M n L − n − zK − n L − n ) ∗ (6) converge almost surely to a limit, then µ √ n A n − µ √ n B n converges almostsurely to zero. Note that Theorem 1.17 is the special case of Theorem 1.18 in which K n = L n = I . It seems of interest to see whether the hypotheses on(5) can be verified for various natural random or deterministic matrices M n , K n , L n , normalised appropriately by a suitable power of n . We donot pursue this matter here.A demonstration of the above theorem for the Bernoulli and theGaussian case appears in Figure 3.The proofs of these extensions are discussed in Section 7. Another direction for generalization is to consider random matriceswhose entries are independent, but not necessarily identically distributed.Most of the tools used in this paper (e.g. law of large numbers, Tala-grand’s inequality, and the least singular value bound from [26]) extendwithout difficulty to this setting. Furthermore, Krishnapur pointed outthat one can also prove a “universal” version of Theorem B.1. Thisleads to a generalization in Appendix C (written by Krishnapur).For similar reasons, one expects to be able to extend the above re-sults to the case when X n and Y n are sparse iid random matrices; forinstance, the least singular value bounds from [26] extend to this case,and the circular law for sparse iid matrices is already known in severalcases [9], [26]. We, however, will not pursue these matters here.1.19. Computing the ESD of a random non-hermitian matrixvia the ESD of a hermitian one. Theorem 1.7 provides one use-ful way to compute the (limiting distribution of) ESD of a randomnon-hermitian matrix, namely that one can restrict to any particulardistribution (such as complex gaussian) of the entries. The proof ofthis theorem (with some modification) also provides another way todeal with this problem, namely that one can reduce the problem ofcomputing the ESD of √ n A n to that of ( √ n A n − zI )( √ n A n − zI ) ∗ , forfixed z ∈ C . More precisely, we have the following equivalences. Theorem 1.20 (Equivalences for convergence) . Let A n be as in The-orem 1.7, and let µ be a probability measure on C with the secondmoment condition (cid:82) | z | dµ ( z ) < ∞ . Then the following are equiva-lent: (i) The ESD µ √ n A n of √ n A n converges in probability to µ . (ii) For almost every complex number z , n log | det( √ n A n − zI ) | con-verges in probability to (cid:82) C log | w − z | dµ ( w ) . (iii) For almost every complex number z , there exists a sequence ε n > of positive numbers converging to zero such that n log det((( √ n A n − zI )+ ε n I )( √ n A n − zI ) ∗ + ε n I ) converges in probability to (cid:82) C log | w − z | dµ ( w ) .If furthermore the ESDs (4) converge to a limit for almost every z , thenwe can replace convergence in probability by almost sure convegence inthe above equivalences. We prove this result in Section 8. As a corollary, we have a criterionfor when √ n A n converges to a distribution µ : NIVERSALITY OF ESDS AND THE CIRCULAR LAW 11 Corollary 1.21. Let A n be as in Theorem 1.7, and let µ be a probabilitymeasure on C with the second moment condition (cid:82) | z | dµ ( z ) < ∞ .Suppose that for almost every complex number z , the ESD of ( √ n A n − zI )( √ n A n − zI ) ∗ converges in probability to a limiting distribution η z on [0 , + ∞ ) such that the integral (cid:82) C log t dη z ( t ) is absolutely convergentand equal to (cid:82) C log | w − z | dµ ( w ) . Then the ESD of √ n A n convergesin probability to µ . If the ESDs (4) converge to a limit for almostevery z , then we can replace convergence in probability by almost sureconvergence in the above implication.Proof. We verify the claim for almost sure convergence only; the prooffor convergence in probability is similar and is left as an exercise to thereader.By Lemma 1.9, we see that for fixed z , | n trace( √ n A n − zI )( √ n A n − zI ) ∗ | is also almost surely bounded. Taking limits, we conclude that (cid:90) C t dη z ( t ) < ∞ . We then see from the dominated convergence theorem that for any ε > n log det((( √ n A n − zI ) + εI )( √ n A n − zI ) ∗ + εI ) converges almostsurely to (cid:82) C log( t + ε ) dη z ( t ). From this we obtain hypothesis (iii) ofTheorem 1.20 (if ε n is chosen to decay to zero sufficiently slowly), andthe claim follows. (cid:3) Since the eigenvalues of ( √ n A n − zI )( √ n A n − zI ) ∗ are the squaresof the singular values of √ n A n − zI , we can also say that Theorem1.20 reduces the problem of computing the limiting distribution of theeigenvalues of √ n A n to that of the singular values of √ n A n − zI .The big gain here is that the matrix ( √ n A n − zI )( √ n A n − zI ) ∗ ishermitian. (Random matrices of this type are often called sample co-variance matrices in the literature.) This allows one to use standardtools such as truncation, Wigner’s moment method and Stieljes trans-form (see, for instance, the proof of Theorem 1.5 in [2, Chapter 2]), orresults such as Theorem B.1; techniques from free probability are alsovery powerful for such problems. These methods cannot be applied tonon-hermitian matrices for various reasons (see [2, Chapter 10] for adiscussion) and their failure has been the main difficulty in attackingproblems such as the Circular Law conjecture.One can use Corollary 1.21 to give another proof of Theorem 1.13,without relying on explicit formulas such as (2). We omit the details. Existence of the limit. The results in the previous chaptersprovide two different ways to compute (explicitly) the limiting measureof the ESD of random matrices. In fact there is a simple compactnessargument that guarantees the existence of the limit, assuming of coursethat the deterministic ESDs (4) already converge, although the argu-ment does not provide too much information on what the limit actuallyis. More precisely, we have Theorem 1.23. Let x be a complex random variable with zero meanand unit variance. Let X n be the n × n random matrix whose entriesare iid copies of x . For each n , let M n be a deterministic n × n matrixsatisfying sup n n (cid:107) M n (cid:107) < ∞ . (7) Assume furthermore that the ESD (4) converges for almost every z ∈ C . Then the ESD of √ n A n , where A n := M n + X n , converges (in bothsenses) to a limiting measure µ .Proof. We let f , f , f , . . . be an enumeration of a sequence of testfunctions which is dense in the uniform topology (such a sequence ex-ists thanks to the Stone-Weierstrass theorem and the compact sup-port of test functions). By applying the Bolzano-Weierstrass theoremonce for each function in this sequence and then using the Arzel´a-Ascoli diagonalization argument, we can refine the subsequence so that (cid:82) C f j ( z ) dµ √ n A n ( z ) converges in probability to some limit for each j ,and hence by a limiting argument (cid:82) C g ( z ) dµ √ n A n ( z ) converges in prob-ability to a limit for each test function g . By the Riesz representationfunction we conclude that along this subsequence, µ √ n A n converges inprobability to some limit µ , which is also a probability measure by thetightness bounds in Lemma 1.9.Applying Theorem 1.20, we conclude that for almost every z , theexpression1 n log det((( 1 √ n A n − zI ) + ε n I )(( 1 √ n A n − zI ) ∗ + ε n I )) (8)converges in probability to 2 (cid:82) C log | w − z | dµ ( w ) along this sequence,for some ε n converging to zero. On the other hand, from the hypothesesand the theorem of Dozier and Silverstein (see Theorem B.1) we knowthat for almost every z , the expression (8) has a almost sure limit forthe entire sequence of n . Combining the two facts we see that for almostevery z , (8) in fact converges almost surely to 2 (cid:82) C log | w − z | dµ ( w )for all n . The claim now follows from another application of Theorem1.20. (cid:3) NIVERSALITY OF ESDS AND THE CIRCULAR LAW 13 Notation. The asymptotic notation is used under the assump-tion that n → ∞ , holding all other parameters fixed. Thus for instance,if we say that a quantity a z,n depending on n and another parameter z is equal to o (1), this means that a z,n converges to zero as n → ∞ for fixed z , but this convergence need not be uniform in z . As anotherexample, the condition (3) is equivalent to asserting that (cid:107) M n (cid:107) = O ( n )as n → ∞ . 2. The replacement principle The first step toward Theorem 1.7 is the following result that givesa general criterion for two random matrix ensembles √ n A n , √ n B n toconverge to the same limit. Theorem 2.1 (Replacement principle) . Suppose for each n that A n , B n ∈ M n ( C ) are ensembles of random matrices. Assume that (i) The expression n (cid:107) A n (cid:107) + 1 n (cid:107) B n (cid:107) (9) is bounded in probability (resp. almost surely). (ii) For almost all complex numbers z , n log | det( 1 √ n A n − zI ) | − n log | det( 1 √ n B n − zI ) | converges in probability (resp. almost surely) to zero. In par-ticular, for each fixed z , these determinants are non-zero withprobability − o (1) for all n (resp. almost surely non-zero forall but finitely many n ).Then µ √ n A n − µ √ n B n converges in probability (resp. almost surely) tozero. We would like to remark here that we do not need to require inde-pendence among the entries of A n and B n . The proof of this theoremis rather “soft” in nature, relying primarily on the Stieltjes transformtechnique (following Girko [7]) that analyses the ESD µ √ n A n in terms ofthe log-determinants n log | det( √ n A n − zI ) | , combined with tools fromclassical real analysis such as the dominated convergence theorem (seeLemma 3.1 for the precise version of this theorem that we need). Thedetails are given in Section 3.In view of Lemma 1.9, we see that Theorem 1.7 follows immediatelyfrom Theorem 2.1 and the following proposition. Proposition 2.2 (Converging determinant) . Let x and y be complexrandom variables with zero mean and unit variance. Let X n and Y n be n × n random matrices whose entries are iid copies of x and y , respec-tively. For each n , let M n be a deterministic n × n matrix satisfying (3) . Set A n := M n + X n and B n := M n + Y n . Then for every fixed z ∈ C , n log | det( 1 √ n A n − zI ) | − n log | det( 1 √ n B n − zI ) | (10) converges in probability to zero. If furthermore we assume that (4) converges to a limit for this value of z , then (10) converges almostsurely to zero. For any square matrix A of size n , let λ i ( A ) and s i ( A ) be the eigen-values and singular values of A . Furthermore, let d i ( A ) be the distancefrom the i th row vector of A to the subspace formed by the first i − | det A | = n (cid:89) i =1 | λ i ( A ) | = n (cid:89) i =1 s i ( A ) = n (cid:89) i =1 d i ( A ) . (11)We will need to study the singular values and distances of √ n A n − zI and √ n B n − zI in order to estimate their determinants. The proof ofProposition 2.2, which occupies Sections 4, 5 and 6, is the heart of thepaper. This proof relies on the following three ingredients: • A result by Dozier and Silverstein [3] that compares the ESD ofthe singular values of the matrices √ n A n − zI and √ n B n − zI .This will let us handle all the rows from 1 to (1 − δ ) n for somesmall δ > • A lower tail estimate for the distance between a random vectorand a fixed subspace of relatively large co-dimension, using aconcentration inequality of Talagrand [13]. This will handle thecontribution of the rows between (1 − δ ) n and (say) n − n . . • A polynomial lower bound for the least singular value of √ n A n − zI and √ n B n − zI from [26, 27]. This bound enables us to handlethe contribution of the last n . rows.3. The replacement principle The purpose of this section is to establish Theorem 2.1. We beginwith a version of the dominated convergence theorem. NIVERSALITY OF ESDS AND THE CIRCULAR LAW 15 Lemma 3.1 (Dominated convergence) . Let ( X, ν ) be a finite measurespace. For each integer n ≥ , let f n : X → R be a random functionswhich are jointly measurable with respect to X and the underlying prob-ability space. Assume that (i) (Uniform integrability) There exists δ > such that (cid:82) X | f n ( x ) | δ dν is bounded in probability (resp. almost surely). (ii) (Pointwise convergence in probability) For ν -almost every x ∈ X , f n ( x ) converges in probability (resp. almost surely) to zero.Then (cid:82) X f n ( x ) dν ( x ) converges in probability (resp. almost surely) tozero.Proof. We first prove the claim for convergence in probability. We cannormalise ν to be a probability measure. Let ε > (cid:90) X f n ( x ) dν ( x ) = O ( ε )with probability 1 − O ( ε ) − o (1).By hypothesis (i), we already know that with probability 1 − O ( ε ) − o (1), that (cid:90) X | f n ( x ) | δ dν ( x ) ≤ C ε for some C ε depending on ε . This implies that (cid:90) X f n ( x ) I ( | f n ( x ) | ≥ M ) dν ( x ) ≤ C ε /M δ for any M > 0, where I ( E ) denotes the indicator of an event E . Inparticular, for M large enough we have (cid:90) X f n ( x ) I ( | f n ( x ) | ≥ M ) dν ( x ) ≤ ε, with probability 1 − O ( ε ) − o (1), and so it will suffice to show that (cid:90) X f n ( x ) I ( | f n ( x ) | ≤ M ) dν ( x ) = O ( ε ) (12)with probability 1 − o (1).Fix M . By hypothesis, we have lim n →∞ P ( | f n ( x ) | ≥ ε ) = 0 for ν -almost every x ∈ X . By the dominated convergence theorem, weconclude that (cid:90) X P ( | f n ( x ) | ≥ ε ) dν ( x ) = o (1) . By Fubini’s theorem, we conclude that E (cid:90) X I ( | f n ( x ) | ≥ ε ) dν ( x ) = o (1)and so by Markov’s inequality, we have (cid:90) X I ( | f n ( x ) | ≥ ε ) dν ( x ) = O ( ε/M )with probability 1 − o (1). The claim (12) easily follows.Now we prove the claim for almost sure convergence. Again we let ν be a probability measure and ε > − O ( ε ) we have (cid:90) X | f n ( x ) | δ dν ( x ) ≤ C ε for all sufficiently large n , and some C ε depending on n . Also, withprobability 1, f n ( x ) converges to zero for almost every x . The claim nowfollows by invoking (the deterministic special case of) the convergencein probability version of the lemma that we have just proven. (cid:3) Now we begin the proof of Theorem 2.1. We thus assume that A n , B n are as in that theorem. We shall first prove the claim for convergencein probability, and indicate later how to modify the proof to obtain theprinciple for almost sure convergence.From the boundedness in probability of (9) and Weyl’s comparisoninequality (Lemma A.2) we see that for every ε > C ε > n , the eigenvalues λ , . . . , λ n of A n obey the bound n (cid:88) j =1 n | λ j | ≤ C ε (13)or equivalently that (cid:90) C | z | dµ √ n A n ( z ) ≤ C ε with probability 1 − O ( ε ) − o (1). Similarly we have (cid:90) C | z | dµ √ n B n ( z ) ≤ C ε . In particular, for each n we see that with probability 1 − O ( ε ) − o (1)we have the tightness bounds µ √ n A n { z ∈ C : | z | ≥ R } ≤ C ε /R (14)and µ √ n B n { z ∈ C : | z | ≥ R } ≤ C ε /R (15)for all R > NIVERSALITY OF ESDS AND THE CIRCULAR LAW 17 We now take the standard step of passing from the ESDs µ √ n A n , µ √ n B n to the characteristic functions m √ n A n , m √ n B n : R → C , which aredefined by the formulae m √ n A n ( u, v ) := (cid:90) C e iu Re( z )+ iv Im( z ) dµ √ n A n ( z ) m √ n B n ( u, v ) := (cid:90) C e iu Re( z )+ iv Im( z ) dµ √ n B n ( z )thus the functions m √ n A n , m √ n B n are continuous and are bounded uni-formly in magnitude by 1.Thanks to the tightness bounds (14)-(15), we can easily pass back andforth between convergence of ESDs and convergence of characteristicfunctions: Lemma 3.2. Let the notation and assumptions be as above. Then thefollowing are equivalent: (i) µ √ n A n − µ √ n B n converges in probability. (ii) For almost every u, v , m √ n A n ( u, v ) − m √ n B n ( u, v ) converges inprobability.Proof. We first show that (i) implies (ii). Fix u, v , and let ε > R depending on C ε and ε such that µ √ n A n ( { z ∈ C : | z | ≥ R } ) + µ √ n B n ( { z ∈ C : | z | ≥ R } ) ≤ ε with probability 1 − O ( ε ) − o (1). In particular, with probability 1 − O ( ε ) − o (1) we have m √ n B n ( u, v ) − m √ n A n ( u, v ) = (cid:90) ψ ( z/R ) e iu Re( z )+ iv Im( z ) [ dµ √ n B n ( z ) − dµ √ n A n ( u, v )( z )]+ O ( ε )where ψ is any smooth compactly supported function that equals oneon the unit ball. But since µ √ n B n − µ √ n A n converges in probability,the integral here converges to zero in probability. The claim follows.Now we prove that (ii) implies (i). Since continuous compactly sup-ported functions are the uniform limit of smooth compactly supportedfunctions, it suffices to show that (cid:82) C f dµ √ n A n − (cid:82) C f dµ √ n B n convergesin probability to zero for every smooth compactly supported function f : C → C . Now fix a smooth compactly supported function f : C → C . ByFourier analysis, we can write (cid:90) C f dµ √ n A n − (cid:90) C f dµ √ n B n = (cid:90) R (cid:90) R ˆ f ( u, v )( m √ n A n ( u, v ) − m √ n B n ( u, v )) dudv (16)for some smooth, rapidly decreasing function ˆ f . In particular, the mea-sure dν = ˆ f ( u, v ) dudv is finite. The claim now follows from dominatedconvergence (Lemma 3.1); note that the function m √ n A n − m √ n B n isbounded and so clearly obeys the moment condition required in thatlemma. (cid:3) In view of the above lemma, it suffices to show that m √ n A n ( u, v ) − m √ n B n ( u, v ) converges in probability to zero for almost every u, v ∈ R .Fix u, v . Since we can exclude a set of measure zero, we can assumethat u, v are non-zero. We allow all implied constants in the argumentsbelow to depend on u, v .Following Girko [7], we now proceed via the Stieltjes-like transform g √ n A n : C → R , defined almost everywhere by the formula g √ n A n ( z ) := 2Re (cid:90) C z − w | z − w | dµ √ n ( w )= 2 n Re n (cid:88) j =1 z − √ n λ j | z − √ n λ j | ; (17)observe that this is a locally integrable function on C , and that g √ n A n ( z ) = ∂∂ Re( z ) 2 n log | det( 1 √ n A n − zI ) | (18)for all but finitely many z .We have the following fundamental identity: Lemma 3.3 (Girko’s identity) . [7] For every non-zero u, v we have m √ n A n ( u, v ) = u + v πiu (cid:90) R ( (cid:90) R g √ n A n ( s + it ) e ius + ivt dt ) ds, where the inner integral is absolutely integrable for almost every s , andthe outer integral is absolutely convergent.Proof. We argue as in [2, Lemma 3.1]. Since m √ n A n ( u, v ) = 1 n n (cid:88) j =1 e i ( u Re( √ n λ j )+ v Im( √ n λ j )) NIVERSALITY OF ESDS AND THE CIRCULAR LAW 19 it suffices from (17) to show that e i ( u Re( w )+ v Im( w )) = u + v πiu (cid:90) R ( (cid:90) R Re( s + it − w ) | s + it − w | e ius + ivt dt ) ds for each complex number w , with an absolutely convergent inner inte-gral and outer integral. But standard contour integration shows that (cid:90) R Re( s + it − w ) | s + it − w | e ius + ivt dt = π sgn( s − Re( w )) e − v | s − Re( w ) | e ius e iv Im( w ) (19)for every s (cid:54) = Re( w ), and the claim follows by an elementary integra-tion. (cid:3) We can of course define g √ n B n similarly, with analogous identities. Toconclude the proof of Theorem 2.1, it thus suffices to show that for any ε > n , we have (cid:90) R ( (cid:90) R ( g √ n A n ( s + it ) − g √ n B n ( s + it )) e ius + ivt dt ) ds = O ( ε ) (20)with probability 1 − O ( ε ) − o (1).Fix ε > 0. By (14), (15), we can find an R > − O ( ε ), µ √ n A n ( { z ∈ C : | z | ≥ R } ) + µ √ n B n ( { z ∈ C : | z | ≥ R } ) ≤ ε. (21)We now condition on the event that (21) holds.We now smoothly localize the z variable to a compact set as follows.Let ψ : R → R + be a smooth cutoff function which equals 1 on [ − , − , Lemma 3.4 (Truncation in s, t ) . Let w ∈ C . (i) The integral (cid:90) R | (cid:90) R Re( w − ( s + it )) | w − ( s + it ) | e ius + ivt dt | (1 − ψ ( s/R )) ds is of size O (1) , and (if R is large enough) is of size O ( ε ) when | w | ≤ R . (ii) The integral (cid:90) R | (cid:90) R Re( w − ( s + it )) | w − ( s + it ) | e ius + ivt (1 − ψ ( t/R )) dt | ψ ( s/R ) ds (22) is of size O (1) , and (if R is large enough) is of size O ( ε ) when | w | ≤ R . Proof. The claim (i) follows easily from (19), so we turn to (ii). Wefirst verify the claim that (22) is bounded. Replacing everything byabsolute values one sees that | (cid:90) R Re( w − ( s + it )) | w − ( s + it ) | e ius + ivt (1 − ψ ( t/R )) dt | = O (1)(in fact one can obtain an explicit upper bound of π ), so we can dis-pose of the region of integration in which s = Re( w ) + O (1). For theremaining values of s , we use repeated integration by parts, integratingthe e ivt term and differentiating the others. After two such integrationswe obtain the bound | (cid:90) R Re( w − ( s + it )) | w − ( s + it ) | e ius + ivt (1 − ψ ( t/R )) dt | = O (( R − + | s − Re( w ) | − ) ) . The claim then follows.Finally, if | w | ≤ R , then one easily verifies (by repeated integrationby parts) that (cid:90) R Re( w − ( s + it )) | w − ( s + it ) | e ius + ivt (1 − ψ ( t/R )) dt = O (1 /R )(say), and so the final claim of (ii) follows. (cid:3) From this lemma and (17), the triangle inequality and (21) we con-clude that (cid:90) R ( (cid:90) R g √ n A n ( s + it ) e ius + ivt dt )(1 − ψ ( s/R )) ds = O ( ε ) . (23)and (cid:90) R ( (cid:90) R g √ n A n ( s + it ) e ius + ivt (1 − ψ ( t/R )) dt ) ψ ( s/R ) ds = O ( ε ) . (24)From (23), (24) (and their counterparts for g √ n B n ) and the triangleinequality, we thus see that to prove (20), it suffices to show that (cid:90) R (cid:90) R ( g √ n A n ( s + it ) − g √ n B n ( s + it )) e ius + ivt ψ ( t/R ) ψ ( s/R ) dtds (25)converges in probability to zero for every fixed R ≥ 1. Note that theintegrands here are now jointly absolutely integrable in t, s , and so wemay now freely interchange the order of integration.Fix R . Using (18) and integration by parts in the s variable, we canrewrite (25) in the form (cid:90) R (cid:90) R f n ( s, t ) φ u,v,R ( s, t ) dsdt NIVERSALITY OF ESDS AND THE CIRCULAR LAW 21 where f n ( s, t ) := 1 n log | det( 1 √ n A n − zI ) | − n log | det( 1 √ n B n − zI ) | and φ u,v,R ( s, t ) := − ∂∂s ( e ius + ivt ψ ( t/R ) ψ ( s/R )) . (Note that there are finitely many values of t for which the integrationby parts is not justified due to singularities in g √ n A n or g √ n B n , butthese values of t clearly give a zero contribution at the end of the day.)Thus it will suffice to show that (cid:90) R (cid:90) R | f n ( s, t ) || φ u,v,R ( s, t ) | dsdt converges in probability to zero.From (11) we have1 n log | det( 1 √ n A n − zI ) | = 1 n n (cid:88) j =1 log | √ n λ j − ( s + it ) | (26)and similarly for B n . From the boundedness and compact support of φ u,v,R we observe that (cid:90) R (cid:90) R log | √ n λ − ( s + it ) | | φ u,v,R ( s, t ) | dsdt ≤ O φ u,v,R (1 + 1 n | λ | )for all λ ∈ C ; from this, (26), (13), and the triangle inequality we seethat (cid:90) R (cid:90) R | f n ( s, t ) | | φ u,v,R ( s, t ) | dsdt (27)is bounded uniformly in n . Since by hypothesis f n ( s, t ) converges inprobability to zero for almost every s, t , the claim now follows fromdominated convergence (Lemma 3.1). The proof of Theorem 2.1 isnow complete in the case of convergence in probability.3.5. The almost sure convergence case. We now indicate how toadapt the above arguments to the case of almost sure convergence.Firstly, since (9) is now almost surely bounded instead of just boundedin probability, we can now say that for every ε > C ε > − O ( ε ), (14), (15) holds for all sufficientlylarge n (as opposed to these bounds holding with probability 1 − O ( ε ) − o (1) for each n separately).Next, we observe the (well-known) fact that Lemma 3.2 continuesto hold when convergence in probability is replaced by almost sureconvergence throughout. Indeed the implication of (ii) from (i) is nearlyidentical and is left as an exercise to the reader. To deduce (i) from (ii) in the almost sure case, observe from the separability of the spaceof smooth compactly supported functions in the uniform topology thatit suffices to show that (16) converges almost surely to zero for each f . On the other hand, from (ii) and Fubini’s theorem we know thatwith probability 1, that m √ n A n ( u, v ) − m ( u, v ) converges to zero foralmost every u, v , and the claim follows from the (ordinary) dominatedconvergence theorem.Once again we use Girko’s identity, Lemma 3.3, and reduce to showingthat for every ε > 0, one has with probability 1 − O ( ε ) that (20) holdsfor all but finitely many n . From our bounds on (14), (15) we see thatwith probability 1 − O ( ε ), that (21) holds for all but finitely many n .We apply Lemma 3.4 (which is deterministic) and reduce to showingthat (25) converges almost surely to zero for each fixed R ≥ 1. Therest of the argument proceeds as in the convergence in probability case.3.6. An alternate argument. There is an alternate derivation ofTheorem 2.1 that avoids Fourier analysis, and is instead based on theobservation that for any complex polynomial P ( z ), the distributionalLaplacian ∆ log | P ( z ) | of the logarithm of the magnitude of P is equalto the counting measure of the zeroes of P (counting multiplicity). Inparticular, we see from Green’s theorem that (cid:90) C f d ( µ √ n A n − µ √ n B n ) = 12 πn (cid:90) C (∆ f ( z )) log | det( 1 √ n A n − zI ) |− n log | det( 1 √ n B n − zI ) | for any smooth, compactly supported f . Applying Lemma 3.1 we canthen get convergence of this integral (either in probability or in thealmost sure sense, as appropriate); the uniform integrability requiredcan be established by repeating the computations used to bound (27).One can then easily take limits to replace smooth compactly supported f to continuous compactly supported f ; we omit the details.4. Proof of Proposition 2.2 In this section we present the proof of Proposition 2.2, modulo severalkey lemmas. Let x, y, M n , A n , B n , z be as in that proposition. Byshifting M n by √ nzI if necessary we can assume z = 0. Our task isnow to show that1 n log | det( 1 √ n A n ) | − n log | det( 1 √ n B n ) | converges in probability to zero, and also almost surely to zero if µ n M n M ∗ n converges. We thank Manjunath Krishnapur for this simpler argument. NIVERSALITY OF ESDS AND THE CIRCULAR LAW 23 Let us first remark that the almost sure convergence claim implies theconvergence in probability claim. Indeed, suppose that convergence inprobability failed, then there would exist an ε > P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n log | det( 1 √ n A n ) | − n log | det( 1 √ n B n ) | (cid:12)(cid:12)(cid:12)(cid:12) ≥ ε (cid:19) ≥ ε (28)for a subsequence of n . By vague sequential compactness one can passto a further subsequence along which µ n M n M ∗ n converges, and hence byhypothesis one has almost sure (and hence in probability) convergenceto zero along this sequence, contradicting (28). Thus it suffices toestablish almost sure convergence assuming the convergence of µ n M n M ∗ n .Let Z , . . . , Z n be the rows of M n . By assumption (3) we have n (cid:88) j =1 (cid:107) Z i (cid:107) = O ( n ) . In particular, at least half of the Z i have norm O ( √ n ). By permutingthe rows of M n , A n , B n if necessary, we may assume that it the last halfof the rows have this property, thus (cid:107) Z i (cid:107) = O ( √ n ) for all n/ ≤ i ≤ n. (29)Let σ ( A ) ≥ . . . ≥ σ n ( A ) ≥ A . We have the following fundamental lower bound: Lemma 4.1 (Least singular value bound) . With probability , we have σ n ( A n ) , σ n ( B n ) ≥ n − O (1) (30) for all but finitely many n . In particular, with probability , A n and B n are invertible for all but finitely many n .Proof. This follows immediately from [26, Theorem 2.1] or [27, Theo-rem 4.1] and the Borel-Cantelli lemma, noting from (3) of Proposition2.2 that the operator norm of M n is of polynomial size n O (1) . Thereare previous results in [17], [24], [18], [25], which handled special caseswith more assumptions on M n and the underlying distributions x, y (for instance, in some of the prior results M n was assumed to vanish,or x, y were assumed to be integer-valued or to have finite higher mo-ments). One can obtain explicit bounds on the tail probability and onthe exponent O (1); see [27]. However, for our applications the abovebounds will suffice. (cid:3) We also have with probability 1 the crude upper bound σ ( A n ) , σ ( B n ) ≤ n O (1) (31) for all but finitely many n , which follows easily from the polynomialsize of M n the bounded second moment of x, y , and the Borel-Cantellilemma. Again, much sharper bounds are available, especially if x and y have finite fourth moment, but we will not need these bounds here.Let X , . . . , X n be the rows of A n , and for each 1 ≤ i ≤ n let V i bethe i − X , . . . , X i − . From (11) wehave 1 n log | det( 1 √ n A n ) | = 1 n n (cid:88) i =1 log dist( 1 √ n X i , V i )and similarly1 n log | det( 1 √ n B n ) | = 1 n n (cid:88) i =1 log dist( 1 √ n Y i , W i )where Y , . . . , Y n are the rows of √ n B n , and W i is spanned by Y , . . . , Y i − .Our task is then to show that1 n n (cid:88) i =1 log dist( 1 √ n X i , V i ) − log dist( 1 √ n Y i , W i )converges almost surely to zero.From (30), (31) and Lemma A.4 we almost surely obtain the boundlog dist( 1 √ n X i , V i ) , log dist( 1 √ n Y i , W i ) = O (log n )for all but finitely many n . Thus it suffices to show that1 n (cid:88) ≤ i ≤ n − n . log dist( 1 √ n X i , V i ) − log dist( 1 √ n Y i , W i )(say) converges almost surely to zero. This follows immediately fromthe following two lemmas. Lemma 4.2 (High-dimensional contribution) . For every ε > thereexists < δ < / such that with probability , one has n (cid:88) (1 − δ ) n ≤ i ≤ n − n . | log dist( 1 √ n X i , V i ) | = O ( ε ) for all but finitely many n . Similarly with dist( √ n X i , V i ) replaced by dist( √ n Y i , W i ) . Lemma 4.3 (Low-dimensional contribution) . For every ε > thereexists < δ < / , such that with probability − O ( ε ) , one has n (cid:88) ≤ i ≤ (1 − δ ) n log dist( 1 √ n X i , V i ) − log dist( 1 √ n Y i , W i ) = O ( ε ) for all but finitely many n . NIVERSALITY OF ESDS AND THE CIRCULAR LAW 25 The next two sections will be devoted to the proofs of these two lem-mas. 5. Proof of Lemma 4.2 We now prove Lemma 4.2. We can of course take n to be large de-pending on all fixed parameters. Let 0 < δ < / ε to be chosen later.Clearly it suffices to prove this lemma for dist( √ n X i , V i ). We firstprove the (much easier) bound for the positive component of the loga-rithm. By the Borel-Cantelli lemma it suffices to show that ∞ (cid:88) n =1 P ( 1 n (cid:88) (1 − δ ) n ≤ i ≤ n − n . max(log dist( 1 √ n X i , V i ) , ≥ ε ) < ∞ . To establish this, we use the crude boundmax(log dist( 1 √ n X i , V i ) , ≤ max(log 1 √ n (cid:107) X i (cid:107) , n (cid:88) (1 − δ ) n ≤ i ≤ n − n . max(log dist( 1 √ n X i , V i ) , ≤ O ( ∞ (cid:88) m =0 n (cid:88) (1 − δ ) n ≤ i ≤ n − n . I ( (cid:107) X i (cid:107) ≥ m √ n )) . (32)Thus if the left-hand side of (32) exceeds ε , we must have1 n (cid:88) (1 − δ ) n ≤ i ≤ n − n . I ( (cid:107) X i (cid:107) ≥ m √ n ) ≥ ε/ (100 + m ) (say) for some m ≥ 0. On the other hand, from (29) and the secondmoment method we see that P ( (cid:107) X i (cid:107) ≥ m √ n ) = O (2 − m ), and thusby Hoeffding’s inequality we have P ( 1 n (cid:88) (1 − δ ) n ≤ i ≤ n − n . I ( (cid:107) X i (cid:107) ≥ m √ n ) ≥ ε/ (100+ m ) ) ≤ C exp( − cn − . − cm − . )(say) for some constants C, c > ε , if δ is chosen suffi-ciently small depending on ε . The claim follows.It remains to establish the bound for the negative component of thelogarithm. By the Borel-Cantelli lemma it suffices to show that ∞ (cid:88) n =1 P ( 1 n (cid:88) (1 − δ ) n ≤ i ≤ n − n . max( − log dist( 1 √ n X i , V i ) , ≥ ε ) < ∞ . This will follow from the union bound and the following estimate. Proposition 5.1 (Lower tail bound) . Let ≤ d ≤ n − n . and < c < , and let W be a (deterministic) d -dimensional subspace of C n . Let X be a row of A n (the exact choice of row is not important).Then P (dist( X, W ) ≤ c √ n − d ) = O (exp( − n . )) . (The implied constant of course depends on c .) Indeed, since X i and V i are independent of each other, the propositionimplies that dist( 1 √ n X i , V i ) ≥ √ n √ n − i + 1(say) for each (1 − δ ) n ≤ i ≤ n − n . , with probability 1 − O ( n − )(say). Setting δ sufficiently small (compared to (cid:15) ), taking logarithmsand summing in i and n one obtains the claim.It remains to prove the proposition. Similar lower bounds concerningthe distance of a random vector to a fixed subspace have appeared in[22], [18], [19]. Here, however, we have the complication that the coef-ficients of X have non-zero mean and have no higher moment boundsthan the second moment; in particular, they can be unbounded.We first eliminate the problem that X has non-zero mean. Write X = v + X (cid:48) , where v := E ( X ) is a deterministic vector (which couldbe quite large) and X (cid:48) has mean zero. Then we have dist( X, W ) ≥ dist( X (cid:48) , span( W, v )). Thus Proposition 5.1 follows from the mean zerocase (after making the harmless change of incrementing d to d + 1, andadjusting the parameters slightly to suit this).Henceforth we assume that X has mean zero, thus X = ( x , . . . , x n )for some iid copies x , . . . , x n of x . Now we deal with the problem thatthe x , . . . , x n can be unbounded. By Chebyshev’s inequality, we have P ( | x i | ≥ n . ) = O ( n − . ) for all 1 ≤ i ≤ n . The event | x i | ≥ n . are jointly independent in i . By Chernoff inequality (see, for instance,[23, Chapter 1]), we can show that with probability 1 − O (exp( − n . )),that there are at most n . indices i for which | x i | ≥ n . . (One can alsoverify this directly using binomial coefficients and Sterling’s formula.)By conditioning on the various possible sets of indices for which | x i | ≥ n . , we see that it suffices to show that P (dist( X, W ) ≤ c √ n − d | E I ) = O (exp( − n . ))for each I ⊂ { , . . . , n } of cardinality at most n . , where E I is theevent that I = { ≤ i ≤ n : | x i | ≥ n . } . NIVERSALITY OF ESDS AND THE CIRCULAR LAW 27 Without loss of generality we can take I = { n (cid:48) + 1 , . . . , n } for some n − n . ≤ n (cid:48) ≤ n . We then observe thatdist( X, W ) ≥ dist( π ( X ) , π ( W ))where π : C n → C n (cid:48) is the orthogonal projection. By conditioning onthe coordinates x n (cid:48) +1 , . . . , x n and making the minor change of replacing n with n (cid:48) (and adjusting c slightly), we may thus reduce to the casewhen I is empty, thus it suffices to show that P (dist( X, W ) ≤ c √ n − d || x i | < n . for all i ) = O (exp( − n . )) . Let ˜ x be the random variable x conditioned to the event | x | < n . , andlet ˜ X = (˜ x , . . . , ˜ x n ) be a vector consisting of iid copies of ˜ x . It thensuffices to show that P (dist( ˜ X, W ) ≤ c √ n − d ) = O (exp( − n . )) . (33)Note that ˜ x might have a non-zero mean, but this can be easily dealtwith by the same trick used before, subtracting E ˜ x from ˜ x to make X to have zero mean. Since x had variance 1, we see from monotoneconvergence that ˜ x has variance 1 − o (1).To prove (33), we recall the following inequality of Talagrand. Theorem 5.2 (Talagrand’s inequality) . Let D be the unit disk { z ∈ C , | z | ≤ } . For every product probability µ on D n , every convex -Lipschitz function F : C n → R , and every r ≥ , µ ( | F − M ( F ) | ≥ r ) ≤ − r / , where M ( F ) denotes the median of F .Proof. This is the complex version of [13, Corollary 4.10], in which D was replaced by the unit interval [0 , / / 4) in the exponent. (cid:3) We apply this theorem with µ equal to the distribution of ˜ X/n . and F : C n → R equal to the convex 1-Lipschitz function F ( v ) :=dist( v, W ), and conclude that P ( | dist( ˜ X, W ) − M (dist( ˜ X, W )) | ≥ n . r ) ≤ − r / 8) (34)for every r > 0. On the other hand, we can easily compute the secondmoment (cf. [22, Lemma 2.5]): Lemma 5.3. We have E (dist( ˜ X, W ) ) = (1 − o (1))( n − d ) . Proof. Let π = ( π ij ) ≤ i,j ≤ n be the orthogonal projection matrix to W .Observe that dist( ˜ X, W ) = (cid:80) ni =1 (cid:80) nj =1 ˜ x i π ij ˜ x j . Since the ˜ x i are iidwith mean zero, we thus have E (dist( ˜ X, W ) ) = ( E ˜ x ) n (cid:88) i =1 π ii . But (cid:80) ni =1 π ii = trace( π ) is equal to ˜ n . Since ˜ x had variance 1 − o (1),the claim follows. (cid:3) Since n − d ≥ n . and c < 1, the claim (33) from follows from (34)and the above lemma. The proof of Lemma 4.2 is now complete.6. Proof of Lemma 4.3 We now begin the proof of Lemma 4.3. Fix ε , and assume that δ issufficiently small depending on ε . Write n (cid:48) := (cid:98) (1 − δ ) n (cid:99) . Observe that (cid:81) n (cid:48) i =1 dist( √ n X i , V i ) is the n (cid:48) -dimensional volume of the parallelepipedspanned by X , . . . , X n (cid:48) , which is also equal to det( n A n,n (cid:48) A ∗ n,n (cid:48) ) / ,where A n,n (cid:48) is the n (cid:48) × n matrix with rows X , . . . , X n (cid:48) . Expressingthis determinant as the product of singular values, we conclude theidentity1 n (cid:88) ≤ i ≤ (1 − δ ) n log dist( 1 √ n X i , V i ) = 1 n n (cid:48) (cid:88) i =1 log (cid:18) √ n σ i ( A n,n (cid:48) ) (cid:19) . Similarly for Y i , W i , and B n,n (cid:48) (the matrix generated by Y , . . . , Y n (cid:48) .Thus it suffices to show that with probability 1 − O ( ε ), one has1 n (cid:48) n (cid:48) (cid:88) i =1 log (cid:18) √ n σ i ( A n,n (cid:48) ) (cid:19) − log (cid:18) √ n σ i ( B n,n (cid:48) ) (cid:19) = O ( ε ) (35)for all but finitely many n . We rewrite (35) as (cid:90) ∞ log t dν n,n (cid:48) ( t ) = O ( ε ) (36)where dν n,n (cid:48) is the difference of two ESDs: dν n,n (cid:48) = µ n (cid:48) A n,n (cid:48) A ∗ n,n (cid:48) − µ n (cid:48) B n,n (cid:48) B ∗ n,n (cid:48) . We control (35) by dividing the range of t into several parts. NIVERSALITY OF ESDS AND THE CIRCULAR LAW 29 The region of very large t . We now control the region where t ≥ R ε for some large R ε .From Lemma A.2 we have that1 n n (cid:48) (cid:88) i =1 ( 1 √ n σ i ( A n,n (cid:48) )) , n n (cid:48) (cid:88) i =1 ( 1 √ n σ i ( B n,n (cid:48) )) is almost surely bounded, and thus (cid:90) ∞ t | dν n,n (cid:48) ( t ) | is also almost surely bounded. Thus, with probability 1 − O ( ε ), wehave (cid:90) ∞ t | dν n,n (cid:48) ( t ) | ≤ C ε for all but finitely many n , and some C ε independent of n , which impliesthat (cid:90) ∞ R ε | log t || dν n,n (cid:48) ( t ) | ≤ ε (37)for all but finitely many n , and some R ε depending only on ε .6.2. The region of intermediate t . We now control the region ε ≤ t ≤ R ε . Lemma 6.3. Let ψ be a smooth function which equals on [ ε , R ε ] andis supported on [ ε / , R ε ] . Then with probability , we have (cid:90) ∞ ψ ( t ) log tdν n,n (cid:48) ( t ) = O ( ε ) , (38) if δ is sufficiently small depending on ε and ψ .Proof. From the interlacing property (Lemma A.1), we see that (cid:90) ∞ ψ ( t ) log tdν n,n (cid:48) ( t ) = (cid:90) ∞ ψ ( t ) log tdν n,n ( t ) + O ( ε )if δ is sufficiently small depending on ε and ψ .We now apply the recent result in [3, Theorem 1.1]. For the reader’sconvenience, we restate this result in the Appendix; see Theorem B.1.This result asserts under the above hypotheses that the ESDs dµ n A n A ∗ n and dµ n B n B ∗ n converge almost surely to the same limit (in fact, thislimit is given explicitly in terms of the limiting distribution of µ n M n M ∗ n via the inverse Stieltjes transform of (47)). In particular, ν n,n convergesalmost surely to zero, and the claim follows. (cid:3) Remark . Note that for the convergence in probability case of Propo-sition 2.2, we need to apply Theorem B.1 to a subsequence of n ratherthan to all n , thanks to the subsequence extraction performed at thebeginning of Section 4.6.5. The region of moderately small t . We now control the region δ ≤ t ≤ ε . For this we need some bounds on the low singular valuesof A n,n (cid:48) and B n,n (cid:48) . Lemma 6.6. With probability , we have n n (cid:48) (cid:88) i =1 ( 1 √ n σ i ( A n,n (cid:48) )) − = O (1) (39) for all but finitely many n , and similarly with A n,n (cid:48) replaced by B n,n (cid:48) .Proof. Clearly it suffices to establish the claim for A n,n (cid:48) . Using Propo-sition 5.1 and the Borel-Cantelli lemma, we see that with probability1, we havedist( 1 √ n X i , span( X , . . . , X i − , X i +1 , . . . , X n (cid:48) )) ≥ √ δn for all but finitely many n , and all 1 ≤ i ≤ n (cid:48) . The claim then followsfrom Lemma A.4. (cid:3) Since the σ i ( A n,n (cid:48) ) are decreasing in i , and n (cid:48) = (cid:98) (1 − δ ) n (cid:99) , we seethat the above lemma implies that with probability 1, we have1 √ n σ (cid:98) (1 − δ ) n (cid:99) ( A n,n (cid:48) ) ≥ cδ for all but finitely many n , and some absolute constant c > 0. We cangeneralize this lower bound to handle higher singular values also: Lemma 6.7. There exists an absolute constant c > such that withprobability , we have √ n σ i ( A n,n (cid:48) ) ≥ c n (cid:48) − in (40) for all but finitely many n , and all ≤ i ≤ (1 − δ ) n , and similarlywith A n,n (cid:48) replaced by B n,n (cid:48) .Proof. Clearly it suffices to establish the claim for A n,n (cid:48) . Using Propo-sition 5.1 and the Borel-Cantelli lemma, we see that with probability1, we havedist( 1 √ n X i , span( X , . . . , X i − , X i +1 , . . . , X n (cid:48)(cid:48) )) ≥ √ n − n (cid:48)(cid:48) NIVERSALITY OF ESDS AND THE CIRCULAR LAW 31 for all but finitely many n , and all 1 ≤ i ≤ n (cid:48)(cid:48) and n/ ≤ n (cid:48)(cid:48) ≤ n (cid:48) .Applying Lemma A.4, we conclude that we almost surely have1 n n (cid:48)(cid:48) (cid:88) i =1 ( 1 √ n σ i ( A n,n (cid:48)(cid:48) )) − = O ( nn − n (cid:48)(cid:48) )for all but finitely many n , and all n/ ≤ n (cid:48)(cid:48) ≤ n (cid:48) . Using the crudebound n (cid:48)(cid:48) (cid:88) i =1 ( 1 √ n σ i ( A n,n (cid:48)(cid:48) )) − ≥ ( n − n (cid:48)(cid:48) )( 1 √ n σ n (cid:48)(cid:48) − n ( A n,n (cid:48)(cid:48) )) − we conclude that we almost surely have1 √ n σ n (cid:48)(cid:48) − n ( A n,n (cid:48)(cid:48) ) ≥ c (cid:48) n − n (cid:48)(cid:48) n for all but finitely many n , all n/ ≤ n (cid:48)(cid:48) ≤ n (cid:48) , and some absoluteconstant c (cid:48) > 0. The claim now follows from the Cauchy interlacingproperty (Lemma A.1). (cid:3) Remark . If one assumes stronger moment assumptions (e.g sub-gaussian) on x , then more precise bounds are known, especially in the M n = 0 case: see [19], [20].From this lemma we can now bound the relevant contribution to (35): Lemma 6.9. With probability , and if δ is sufficiently small dependingon ε , we have (cid:90) ε δ | log t || dν n,n (cid:48) ( t ) | = O ( ε ) (41) for all but finitely many n .Proof. By the triangle inequality and symmetry it suffices to show thatwith probability 1, we have (cid:90) ε δ | log t | dµ n (cid:48) A n,n (cid:48) A ∗ n,n (cid:48) ( t ) = O ( ε )for all but finitely many n . We rewrite the left-hand side as1 n n (cid:48) (cid:88) i =1 f ( 1 √ n σ i ( A n,n (cid:48) ))where f ( t ) := | log t | I ( δ ≤ t ≤ ε ). Since f cannot exceed | log δ | , wesee that the contribution of the case i ≥ (1 − δ ) n is acceptable if δ issmall enough, so it suffices to show that we almost surely have1 n (cid:88) ≤ i ≤ (1 − δ ) n f ( 1 √ n σ i ( A n,n (cid:48) )) = O ( ε ) for all but finitely many n .By Lemma 6.7, we may assume that n is such that (40) holds. As aconsequence, we see that the only terms in the above sum which arenon-vanishing are those for which i = (1 − O ( ε )) n . But then if weapply (40) and crudely estimate f ( t ) ≤ − log t we obtain the claim. (cid:3) The contribution of very small t . Finally, we need to controlthe contribution when t ≤ δ . Lemma 6.11. With probability , and if δ is sufficiently small depend-ing on ε , we have (cid:90) δ | log t || dν n,n (cid:48) ( t ) | = O ( ε ) (42) for all but finitely many n .Proof. By arguing as in the proof of Lemma 6.9, it suffices to show thatwe almost surely have1 n n (cid:48) (cid:88) i =1 g ( 1 √ n σ i ( A n,n (cid:48) )) = O ( ε )for all but finitely many n , where g ( t ) := | log t | I ( t ≤ δ ).By Lemmas 6.6, we may assume n is such that (39) holds. On theother hand, if δ is small enough, we have the bound g ( t ) ≤ εt − . Theclaim now follows from (39). (cid:3) Putting together (37), (38), (41), (42) we see that with probability1 − O ( ε ), we have (36) for all but finitely many n , and the claim follows.7. Extensions Proof of Theorem 1.17. The theorem in the case of almost sureconvergence follows immediately from Theorem 1.7 by conditioning on M n , so it remains to verify the theorem in the case of convergence inprobability.Let fix a test function f (as in (1)) and a positive ε . By the bound-edness in probability of n (cid:107) M (cid:107) , we can find a C = C ε such that P ( M n ∈ Ω n ) ≥ − ε , whereΩ n := { M ∈ M n ( C ) : 1 n (cid:107) M (cid:107) ≤ C } . NIVERSALITY OF ESDS AND THE CIRCULAR LAW 33 Let M fn be the matrix in Ω n which maximizes the quantity P ( | (cid:90) C f ( z ) dµ √ n ( M fn + X n ( z ) − (cid:90) C f ( z ) dµ √ n ( M fn + Y n ( z ) | ≥ ε ) . Applying Theorem 1.7 to the sequence M fn + X n and M fn + Y n , we seethat this quantity is o (1).Theorem 1.17 follows by integrating over all possible values of M n using the definition of M fn , as well as the fact that P (Ω n ) ≥ − ε , andthen letting ε → Proof of Theorem 1.18. We first verify the claim for conver-gence in probability.The condition (i) of Theorem 2.1 is satisfied thanks to the boundednessin probability of (5). In order to complete the proof, one needs to check(ii). Notice thatdet( 1 √ n A n − zI ) = det( 1 √ n ( K − n M n L − n + X n ) − zK − n L − n ) det L n K n . The term det L n K n also appears in det( √ n B n − zI ) and becomes ad-ditive (and thus cancels) after taking logarithm. Therefore, one onlyneeds to show that n log | det (cid:16) √ n ( K − n M n L − n + X n ) − zK − n L − n (cid:17) |− n log | det (cid:16) √ n ( K − n M n L − n + Y n ) − zK − n L − n (cid:17) | converges in probability to zero.One can obtain this by repeating the proof of Proposition 2.2. Theslight change here is that zI is replaced by zK − n L − n , but this has nosignificant impact, except that we need to show F n := 1 √ n ( K − n M n L − n − zK − n L − n )satisfies If the maximum is not attained, one can instead choose M fn to be a matrixwhich maximizes this quantity to within a factor of two (say). n trace F n F ∗ n = 1 n (cid:107) F n (cid:107) = O (1)almost surely (in order to guarantee (3)). But this is a consequence ofthe boundedness in probability of (5).The proof of the almost sure convergence is established similarly, withthe obvious changes (e.g. replacing boundedness in probability withalmost sure boundedness). We omit the details.8. Proof of Theorem 1.20 We first prove that (ii) implies (i) for almost sure convergence. Let A n and µ be as in Theorem 1.20. Construct a diagonal matrix B (cid:48) n whosediagonal entries are independent samples from µ and let B n := √ nB (cid:48) n .We wish to invoke Theorem 2.1. We first need to verify the almost sureboundedness of (9). The bound for A n follows from Lemma 1.9, andthe bound for B n follows from the second moment hypothesis on µ andthe (strong) law of large numbers. By Theorem 2.1, the problem nowreduces to showing that for almost all complex numbers z ,1 n log | det( 1 √ n A n − zI ) | − n log | det( 1 √ n B n − zI ) | converges almost surely to zero. The right hand side is easy to compute:1 n log | det( 1 √ n B n − zI ) | = 1 n log | det( B (cid:48) n − zI ) | = (cid:80) ni =1 log | λ i − z | n , where λ i are iid samples from µ . On the other hand, from Fubini’stheorem we see that (cid:82) C log | w − z | dµ ( w ) is locally integrable in z , andthus (cid:90) C log | w − z | dµ ( w ) < ∞ (43)for almost every z . If z is such that (43) holds, then by the strong lawof large numbers, we see that P ni =1 log | λ i − z | n converges almost surely to (cid:82) C log | w − z | dµ ( w ). This shows that (ii) implies (i) for almost sureconvergence. The proof for convergence in probability is identical andis left as an exercise to the reader.Now we show that (iii) implies (ii) for almost sure convergence. Let z be such that (43) and (iii) hold. To show (ii), it suffices from (11) toshow that n (cid:80) ni =1 log σ i converges almost surely to (cid:82) C log | w − z | dµ ( w ),where σ i = σ i ( √ n A n − zI ) are the singular values of √ n A n − zI . Onthe other hand, from (iii) we already know that n (cid:80) ni =1 log (cid:112) σ i + ε n NIVERSALITY OF ESDS AND THE CIRCULAR LAW 35 converges almost surely to (cid:82) C log | w − z | dµ ( w ). Thus it suffices to showthat 1 n n (cid:88) i =1 log (cid:113) σ i + ε n − log σ i (44)converges almost surely to zero.From Lemma 1.9, we know that n (cid:107) A n (cid:107) is almost surely bounded,and so for each z n n (cid:88) i =1 σ i = 1 n (cid:107) √ n A n − zI (cid:107) is almost surely bounded also. From this we easily see that1 n (cid:88) ≤ i ≤ n : σ i ≥ δ n log (cid:113) σ i + ε n − log σ i converges almost surely to zero for some sequence δ n (depending on ε n ) converging sufficiently slowly to zero. To conclude the almost sureconvergence of (44) to zero, it thus suffices to show that1 n (cid:88) ≤ i ≤ n : σ i ≤ δ n log 1 σ i converges almost surely to zero. Using Lemma 4.1, we almost surelyhave sup i log σ i ≤ O (log n ) for all but finitely many n , so it suffices toshow that 1 n (cid:88) ≤ i ≤ n − n . : σ i <δ n log 1 σ i . converges almost surely to zero. To do this, it suffices by the unionbound and the Borel-Cantelli lemma to show that P ( σ n − i ≤ c in ) = O (exp( − n . )) . (45)for all 1 ≤ i ≤ n − n . and some c > n .For this we argue as in the proof of Lemma 6.7. Fix i . Let A (cid:48) n be thematrix form by the first n − k rows of A n − z √ nI with k := i/ σ (cid:48) j , ≤ j ≤ n − k be the singular values of A (cid:48) n (in decreasing order, asusual). By the interlacing law (Lemma A.1) and re-normalizing, σ n − i ≥ √ n σ (cid:48) n − i . (46)By Lemma A.4, we have that σ (cid:48)− + · · · + σ (cid:48)− n − k = dist − + · · · + dist − n − k , where dist j is the distance from the j th row of A (cid:48) n to the subspacespanned by the remaining rows.As shown in the proof of Lemma 4.2, with probability 1 − exp( − n − . ),dist j is bounded from below by Ω( √ k ) = Ω( √ i ) for all j . Thus, withthis probability, the right hand side in the above identity is O ( n/i ).On the other hand, as the σ (cid:48) j are ordered decreasingly, the left handside is at least ( i − k ) σ (cid:48)− n − i = i σ (cid:48)− n − i . It follows that with probability 1 − exp( − n − . ), σ (cid:48) n − i = Ω( i √ n ) . This and (46) complete the proof of (45), and so (44) converges almostsurely to zero.As previously observed, the convergence of (44) to zero shows that (ii)implies (iii) for almost sure convergence. An inspection of the argumentshows the convergence of (44) to zero also lets us deduce (iii) from (ii).The claim for convergence in probability follows similarly. To concludethe proof of Theorem 1.20, it thus suffices to show that (i) implies (ii).Again we start with the almost sure convergence case. Assume that(i) holds, and let z be such that (43) holds. By shifting A by √ nzI ifnecessary we may take z to be zero. Let λ , . . . , λ n denote the eigen-values of √ n A n . By (11), it suffices to show that n (cid:80) nj =1 log | λ j | con-verges almost surely to (cid:82) C log | w | dµ ( w ). From (13) we know that n (cid:80) nj =1 | λ j | is almost surely bounded. From this and (i) we concludethat n (cid:80) nj =1 log( | λ j | + ε ) converges almost surely to (cid:82) C (log | w | + ε ) dµ ( w )for any fixed ε > 0. Combining this with (43) and dominated con-vergence, we see that n (cid:80) nj =1 log( | λ j | + ε n ) converges almost surely to (cid:82) C log | w | dµ ( w ) for some sequence ε n > n n (cid:88) j =1 log( | λ j | + ε n ) − log | λ j | converges almost surely to zero. NIVERSALITY OF ESDS AND THE CIRCULAR LAW 37 By repeating the arguments used to establish the almost sure conver-gence of (44) to zero, it suffices to show that1 n (cid:88) ≤ i ≤ n : | λ i |≤ δ n log 1 | λ i | converges almost surely to zero.Let us order the eigenvalues λ i so that | λ | ≥ . . . ≥ | λ n | . From Lemma4.1 and (45) (and the Borel-Cantelli lemma) we know that we almostsurely have 1 n (cid:88) (1 − κ ) n
2, and hence byWeyl’s comparison inequality (Lemma A.3) that we almost surely have1 n (cid:88) (1 − κ ) n 0, and the claim follows. The analogous implication forconvergence in probability is similar. The proof of Theorem 1.20 is nowcomplete. Appendix A. Linear algebra inequalities In this appendix we record some elementary identities and inequalitiesregarding the eigenvalues and singular values of matrices. Lemma A.1 (Cauchy’s interlacing law) . Let A be an n × n matrix withcomplex entries and A (cid:48) be the submatrix formed by the first m := n − k rows. Let σ ( A ) ≥ . . . ≥ σ n ( A ) ≥ denote the singular values of A ,and similarly for A (cid:48) . Then we have σ i ( A ) ≥ σ i ( A (cid:48) ) ≥ σ i + k ( A ) for every ≤ i ≤ n − k . Proof. The claim follows easily from the minimax characterization σ i ( A ) = sup V i ⊂ C n inf v ∈ V i : (cid:107) v (cid:107) =1 (cid:107) Av i (cid:107) and σ i ( A (cid:48) ) = sup V i ⊂ C n − k inf v ∈ V i : (cid:107) v (cid:107) =1 (cid:107) Av i (cid:107) of the singular values, where V i range over i -dimensional complex sub-spaces. (cid:3) Lemma A.2 (Weyl comparison inequality for second moment) . Let A = ( a ij ) ≤ i,j ≤ n ∈ M n ( C ) have generalized eigenvalues λ , . . . , λ n ∈ C and singular values σ ( A ) ≥ . . . ≥ σ n ( A ) ≥ . Then n (cid:88) j =1 | λ j | ≤ n (cid:88) j =1 σ j ( A ) = (cid:107) A (cid:107) = n (cid:88) i =1 n (cid:88) j =1 | a ij | . Proof. The two equalities here are clear, so it suffices to prove the in-equality. By the Jordan normal form we can write A = BU B − forsome upper-triangular U and invertible B . By the QR factorizationwe can write B = QR for some orthogonal Q and upper triangular R .We conclude that A = QV Q − for some upper triangular V . Conju-gating by Q , we thus reduce to the case when A is an upper triangularmatrix, in which case the eigenvalues are simply the diagonal entries a , . . . , a nn and the claim is clear. (cid:3) We also have the following (stronger) variant of the above inequality: Lemma A.3 (Weyl comparison inequality for products) . Let A =( a ij ) ≤ i,j ≤ n ∈ M n ( C ) have generalized eigenvalues λ , . . . , λ n ∈ C , or-dered so that | λ | ≤ . . . ≤ | λ n | , and singular values σ ( A ) ≥ . . . ≥ σ n ( A ) ≥ . Then we have J (cid:89) j =1 | λ j | ≤ J (cid:89) j =1 σ j ( A ) and n (cid:89) j = J σ j ( A ) ≤ n (cid:89) j = J | λ j | for all ≤ J ≤ n .Proof. It suffices to prove the former claim, as the latter then followsfrom (11). By arguing as in Lemma A.2 we may assume that A isupper triangular, so that the diagonal entries are some permutation of λ , . . . , λ n . Consider the symmetric minor A (cid:48) of A formed by the rows NIVERSALITY OF ESDS AND THE CIRCULAR LAW 39 and columns corresponding to the entries λ , . . . , λ J . The determinantof this matrix is then λ . . . λ J , and thus by (11) we have J (cid:89) j =1 σ j ( A (cid:48) ) = J (cid:89) j =1 | λ j | . The claim then follows from the Cauchy interlacing inequality (LemmaA.1). (cid:3) Now we record a useful identity for the negative second moment of arectangular matrix. Lemma A.4 (Negative second moment) . Let ≤ n (cid:48) ≤ n , and let A bea full rank n (cid:48) × n matrix with singular values σ ( A ) ≥ . . . ≥ σ n (cid:48) ( A ) > and rows X , . . . , X n (cid:48) ∈ C n . For each ≤ i ≤ n (cid:48) , let W i be thehyperplane generated by the n (cid:48) − rows X , . . . , X i − , X i +1 , . . . , X n (cid:48) .Then n (cid:48) (cid:88) j =1 σ j ( A ) − = n (cid:48) (cid:88) j =1 dist( X j , W j ) − . Proof. Observe that the n (cid:48) × n (cid:48) matrix ( AA ∗ ) − has eigenvalues σ ( A ) − , . . . , σ n (cid:48) ( A ) − . Taking traces, we conclude that n (cid:48) (cid:88) j =1 σ j ( A ) − = n (cid:48) (cid:88) j =1 ( AA ∗ ) − e j · e j where e , . . . , e n (cid:48) is the standard basis of C n (cid:48) . But if v j := ( AA ∗ ) − e j =( v j, , . . . , v j,n (cid:48) ), then A ∗ v j = v j, X + . . . + v j,n (cid:48) X n (cid:48) is orthogonal to A ∗ e i = X i for i (cid:54) = j (and thus orthogonal to W j ), and has an innerproduct of 1 with A ∗ e j = X j . Taking inner products of A ∗ v j with theorthogonal projection of X j to W j , we conclude that v j,j dist( X j , W j ) = 1 . Since v j,j = v j · e j = ( AA ∗ ) − e j · e j , the claim follows. (cid:3) Appendix B. A result of Dozier and Silverstein Here we reproduce Theorem 1.1 of [3] which we used in the end ofSection 6. Theorem B.1. [3, Theorem 1.1] Let c be a positive constant and x be a random variable with variance one. Let X n be an n × r randommatrix whose entries are iid copies of x , where r = ( c + o (1)) n . Let M n be a random n × r matrix independent from X n such that the ESD of M n M ∗ n converges to a limiting distribution H . Define C n := cn ( M n + X n )( M n + X n ) ∗ . Then the ESD of C n converges almost surely (andhence also in probability) to a limiting distribution F , whose Stieljestransform m ( z ) := (cid:82) λ − z dF ( λ ) satisfies the integral equation m = (cid:90) dH ( t ) t cm − (1 + cm ) z + (1 − c ) (47) for any z ∈ C .Remark B.2 . The theorem still holds if we restrict the size n of thematrices to an infinite subsequence n < n < . . . of positive integers.One can show this by, for example, artificially filling in the missingindices or repeat the proof of Theorem B.1 under this restriction. Remark B.3 . In (47), H appears, but the actual definition of M n isirrelevant. Thus, one can conclude that if M n and M (cid:48) n are such thatthe ESD’s of M n M ∗ n and M (cid:48) n M (cid:48)∗ n tend to the same limit, then the ESDsof cn ( M n + X n )( M n + X n ) ∗ and cn ( M (cid:48) n + X n )( M (cid:48) n + X n ) ∗ also tend tothe same limit. Remark B.4 . It was mentioned by Speicher [21] and also Krishnapur(private communication) that Theorem B.1 can be proved using freeprobability, which is different from the approach in [3]. Appendix C. Using a Hermitian invariance principle(by Manjunath Krishnapur) The authors have shown invariance principles for ESDs of several non-Hermitian matrix models. As in earlier papers, the proof goes throughHermitian matrices, but does not need rates of convergence of the Her-mitian ESDs, thanks to new ideas such as Lemma 4.2. However, be-cause of the use of Theorem B.1, it may appear that a limiting resultfor the associated Hermitian matrices is necessary to carry the programthrough. In this appendix, we point out how one may obtain a weakinvariance principle for ESDs of non-Hermitian matrices by using aninvariance principle for Hermitian matrices due to Chatterjee [4], incases where a convergence result such as Theorem B.1 is not available.As mentioned earlier, other parts of the proof do not require the entriesare iid. Thus, as a consequence, we can obtain a weak invariance prin-ciple for a random matrix model with independent but not identicallydistributed entries.We need the following definition from [26, Section 2]. NIVERSALITY OF ESDS AND THE CIRCULAR LAW 41 Definition C.1 (Controlled second moment) . Let κ ≥ 1. A complexrandom variable x is said to have κ -controlled second moment if onehas the upper bound E | x | ≤ κ (in particular, | E x | ≤ κ / ), and the lower bound E Re( zx − w ) I ( | x | ≤ κ ) ≥ κ Re( z ) (48)for all complex numbers z, w . Example. The Bernoulli random variable ( P ( x = +1) = P ( x = − 1) =1 / 2) has 1-controlled second moment. The condition (48) asserts inparticular that x has variance at least κ , but also asserts that a signif-icant portion of this variance occurs inside the event | x | ≤ κ , and alsocontains some more technical phase information about the covariancematrix of Re( x ) and Im( x ). Theorem C.2. Let M n = (cid:16) µ ( n ) i,j (cid:17) i,j ≤ n and C n = (cid:16) σ ( n ) i,j (cid:17) i,j ≤ n be constant(i.e. deterministic) matrices satisfying (1) sup n n − (cid:107) M n (cid:107) < ∞ , (2) a ≤ σ ( n ) i,j ≤ b for all n, i, j for some < a < b < ∞ .Given a matrix X = ( x i,j ) i,j ≤ n set A n ( X ) = 1 √ n ( M n + C n · X ) = 1 √ n (cid:16) µ ( n ) i,j + σ ( n ) i,j x i,j (cid:17) i,j ≤ n . (here ” · ” denotes Hadamard product).Now suppose that x ( n ) i,j are independent complex-valued random vari-ables with E [ x ( n ) i,j ] = 0 and E [ | x ( n ) i,j | ] = 1 and that y ( n ) i,j are independentrandom variables, also having zero mean and unit variance.Assume furthermore that both x ( n ) ij and y ( n ) ij have κ -controlled secondmoment for some constant κ > .Assume also Pastur’s condition n n (cid:88) i,j =1 E (cid:104) | x ( n ) i,j | I | x ( n ) i,j | ≥ (cid:15) √ n (cid:105) −→ for all (cid:15) > . (49) and the same for Y in place of X . Then, µ A n ( X ) − µ A n ( Y ) → in the sense of probability. Some remarks.(1) If we assume that x ( n ) i,j are i.i.d. and y ( n ) i,j are i.i.d then Pastur’scondition is obviously satisfied. Further, the condition of κ -controlled second moment is also not necessary (see the firststep in the proof sketch).(2) Although the weak invariance principle in the paper uses onlysubsequential limits (see Remark 6.4), it does use Theorem B.1to say that subsequential limits are the same for X as for Y .Hence we need some changes in the proof in order to establishTheorem C.2, which we do in this appendix.(3) This highlights the important new ideas of the paper, such asLemma 4.2, which eliminate the need for rates of convergenceof ESDs of the Hermitian matrices ( A n − zI ) ∗ ( A n − zI ). Thisis unlike all earlier papers in the subject that followed Bai’sapproach and required such rates (eg., [1],[26],[9],[15]). Theneed for rates made it impossible to use the invariance principlefor Hermitian matrices as we shall do now.(4) Take C n = J (all ones matrix) and M n = 0. Then Pastur’scondition (49) implies almost sure convergence of the ESD of A n ( X ) ∗ A n ( X ) (see [2, Theorem 3.9]). For general C n , sincewe use Chatterjee’s invariance principle which assumes Pastur’scondition but only gives weak invariance, we are able to assertonly weak invariance for the non-Hermitian ESDs also. Thus,there is some room for improvement here, namely, to strengthenthe conclusion of Theorem C.2 to almost sure convergence.(5) Does ESD of A n ( X ) converge? Perhaps so, provided the singu-lar values of C n − zI have a limiting measure for every z . In[12] we have discussed some easy-to-check sufficient conditionson C n which implies convergence.The following lemma is a “Wishart” analogue of the computations insection 2 of [4] which considers Wigner matrices. As in that paper, theidea is to consider the Stieltjes transform of the ESD of A n ( X ) ∗ A n ( X )as a function of X . However a slight twist is needed as compared toWigner matrices, because the entries of A n ( X ) ∗ A n ( X ) are quadratic in X whereas the invariance principle we invoke requires bounds on thesup-norm of derivatives of the Stieltjes transform. Lemma C.3. Let X and Y be as in Theorem C.2. Let ν X n and ν Y n be the ESDs of A n ( X ) ∗ A n ( X ) and A n ( Y ) ∗ A n ( Y ) . Then ν X n − ν Y n → weakly as n → ∞ . NIVERSALITY OF ESDS AND THE CIRCULAR LAW 43 Proof. Let H n ( X ) = (cid:20) A n ( X ) A n ( X ) ∗ (cid:21) have ESD θ X n . The eigenvalues of H n ( X ) are exactly the positive andnegative square roots of the eigenvalues of A n ( X ) ∗ A n ( X ). Thus wemust show that θ X n − θ Y n → α inthe upper half plane and let f ( X ) := n Tr( H n ( X ) − αI ) − . The proofis complete if we show that E [ f ( X )] − E [ f ( Y )] → α withIm { α } > 0. This can be done by following the same calculations as in[4]. It works because the entries of H n ( X ) are linear in X and hencethe first partial derivative of H n with respect to any x i,j is a constantmatrix. One must also use the upper bound on σ i,j to bound thederivatives of f . (cid:3) Remark: Obviously the same conclusion holds for A n − zI , just byabsorbing zI into M n . Proof of Theorem C.2. The conditions on M n and C n show that thefirst condition of Theorem 2.1 is satisfied (where the two matrices A n and B n are now A n ( X ) and A n ( Y )).Thus we only need to show an analogue of Proposition 2.2 (only theweak part). We sketch the modifications needed.(1) Lemma 4.1 can be proved under independence and κ -controlledsecond moment without i.i.d. assumption (see [26, Theorem2.5]). If we make i.i.d. assumption, then Lemma 4.1 is itselfapplicable, which explains the first remark after the statementof the theorem.The upper bounds on singular values in (31) are very generaland hold in our setting for the same reasons. Hence we reduceto Lemma 4.2 and Lemma 4.3 as in the paper.(2) The high-dimensional contribution (analogue of Lemma 4.2) isproved almost the same way. In the proof of the lower tail bound(Proposition 5.1) use the bounds on σ ( n ) i,j appropriately. In par-ticular, we get a lower bounds of a ( n − d ) for the second momentof dist( X, W ) in Lemma 5.3, and in applying Theorem 5.2 weget a Lipschitz constant of b for F ( X ) = dist( X, W ).(3) In the low-dimensional contribution (Lemma 4.3), the calcu-lations in sections 6.1, 6.5 and 6.10 are exactly as before (insection 6.5, we use the concentration result already outlined inthe previous step). (4) That leaves section 6.2, which is the only step that is differ-ently handled. Here we apply Lemma C.3 instead of quotingTheorem B.1. (cid:3) Acknowledgements. The first author is supported by a grant from theMacarthur Foundation and by NSF grant DMS-0649473. The secondauthor is supported by an NSF Career Grant. The authors would liketo thank M. Krishnapur for useful discussions and his careful reading ofan early draft, and Ken Miller, Ricky, and weiyu for further corrections.We also like to thank P. Matchett Wood for providing the figures inthe introduction. References [1] Z. D. Bai, Circular law, Ann. Probab. (1997), 494–529.[2] Z. D. Bai and J. Silverstein, Spectral analysis of large dimensional randommatrices, Mathematics Monograph Series , Science Press, Beijing 2006.[3] R. Dozier, J. Silverstein, On the empirical distribution of eigenvalues oflarge dimensional information-plus-noise-type matrices, J. Multivar. Anal. (2007), 678–694.[4] S. Chatterjee, A simple invariance principle. [arXiv:math/0508213][5] D. Chafai, Circular law for non-central random matrices, preprint .[6] A. Edelman, Eigenvalues and condition numbers of random matrices. SIAMJ. Matrix Anal. Appl. Theory Probab. Appl. (1984), 694–706.[8] V. L. Girko, The strong circular law. Twenty years later. II. Random Oper.Stochastic Equations (2004), no. 3, 255–312.[9] F. G¨otze, A.N. Tikhomirov, On the circular law, preprint [10] F. G¨otze, A.N. Tikhomirov, The Circular Law for Random Matrices, preprint [11] J. Ginibre, Statistical Ensembles of Complex, Quaternion, and Real Matrices, Journal of Mathematical Physics (1965), 440-449.[12] M. Krishnapour and V. Vu, manuscript in preparation .[13] M. Ledoux, The concentration of measure phenomenon, Mathematical surveyand monographs, volume 89, AMS 2001.[14] M.L. Mehta, Random Matrices and the Statistical Theory of Energy Levels,Academic Press, New York, NY, 1967.[15] G. Pan and W. Zhou, Circular law, Extreme singular values and potentialtheory, preprint .[16] L. A Pastur, On the spectrum of random matrices, Teoret. Mat. Fiz. Annals ofMathematics, to appear .[18] M. Rudelson and R. Vershynin, The Littlewood-Offord problem and the con-dition number of random matrices, Advances in Mathematics, to appear .[19] M. Rudelson, R. Vershynin, The smallest singular value of a rectangular ran-dom matrix, preprint. NIVERSALITY OF ESDS AND THE CIRCULAR LAW 45 [20] M. Rudelson, R. Vershynin, The least singular value of a random square matrixis O ( n − / ), preprint.[21] R. Speicher, survey in preparation. [22] T. Tao and V. Vu, On random ± Random Structures Algorithms 28 (2006), no. 1, 1–23.[23] T. Tao, V. Vu, Additive combinatorics, Cambridge University Press, 2006.[24] T. Tao and V. Vu, Inverse Littlewood-Offord theorems and the condition num-ber of random discrete matrices, Annals of Mathematics, to appear .[25] T. Tao and V. Vu, The condition number of a randomly perturbed matrix, STOC 2007 .[26] T. Tao and V. Vu, Random Matrices: The circular Law, Communications inContemporary Mathematics , (2008), 261–307.[27] T. Tao and V. Vu, Random matrices: A general approach for the least singularvalue problem, preprint.[28] P. Wigner, On the distribution of the roots of certain symmetric matrices, TheAnnals of Mathematics (1958) 325-327. Department of Mathematics, UCLA, Los Angeles CA 90095-1555 E-mail address : tao@@math.ucla.edu Department of Mathematics, Rutgers University, Piscataway NJ 08854-8019 E-mail address : vanvu@@math.rutgers.edu Department of Mathematics, U. Toronto, Toronto Canada, MS5 2E4 E-mail address ::