A Random Matrix Approach to Neural Networks
aa r X i v : . [ m a t h . P R ] J un Submitted to the Annals of Applied Probability
A RANDOM MATRIX APPROACHTO NEURAL NETWORKS
By Cosme Louart, Zhenyu Liao, and Romain Couillet ∗ CentraleSup´elec, University of Paris–Saclay, France.
This article studies the Gram random matrix model G = T Σ T Σ,Σ = σ ( W X ), classically found in the analysis of random feature mapsand random neural networks, where X = [ x , . . . , x T ] ∈ R p × T is a(data) matrix of bounded norm, W ∈ R n × p is a matrix of indepen-dent zero-mean unit variance entries, and σ : R → R is a Lipschitzcontinuous (activation) function — σ ( W X ) being understood entry-wise. By means of a key concentration of measure lemma arisingfrom non-asymptotic random matrix arguments, we prove that, as n, p, T grow large at the same rate, the resolvent Q = ( G + γI T ) − ,for γ >
0, has a similar behavior as that met in sample covariancematrix models, involving notably the moment Φ = Tn E[ G ], which pro-vides in passing a deterministic equivalent for the empirical spectralmeasure of G . Application-wise, this result enables the estimation ofthe asymptotic performance of single-layer random neural networks.This in turn provides practical insights into the underlying mecha-nisms into play in random neural networks, entailing several unex-pected consequences, as well as a fast practical means to tune thenetwork hyperparameters.
1. Introduction.
Artificial neural networks, developed in the late fifties(Rosenblatt, 1958) in an attempt to develop machines capable of brain-likebehaviors, know today an unprecedented research interest, notably in its ap-plications to computer vision and machine learning at large (Krizhevsky, Sutskever and Hinton,2012; Schmidhuber, 2015) where superhuman performances on specific tasksare now commonly achieved. Recent progress in neural network perfor-mances however find their source in the processing power of modern com-puters as well as in the availability of large datasets rather than in thedevelopment of new mathematics. In fact, for lack of appropriate tools tounderstand the theoretical behavior of the non-linear activations and de-terministic data dependence underlying these networks, the discrepancy be-tween mathematical and practical (heuristic) studies of neural networks haskept widening. A first salient problem in harnessing neural networks liesin their being completely designed upon a deterministic training dataset ∗ Couillet’s work is supported by the ANR Project RMT4GRAPH (ANR-14-CE28-0006).
MSC 2010 subject classifications:
Primary 60B20; secondary 62M45 imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. X = [ x , . . . , x T ] ∈ R p × T , so that their resulting performances intricatelydepend first and foremost on X . Recent works have nonetheless establishedthat, when smartly designed, mere randomly connected neural networkscan achieve performances close to those reached by entirely data-drivennetwork designs (Rahimi and Recht, 2007; Saxe et al., 2011). As a mat-ter of fact, to handle gigantic databases, the computationally expensivelearning phase (the so-called backpropagation of the error method) typi-cal of deep neural network structures becomes impractical, while it was re-cently shown that smartly designed single-layer random networks (as stud-ied presently) can already reach superhuman capabilities (Cambria et al.,2015) and beat expert knowledge in specific fields (Jaeger and Haas, 2004).These various findings have opened the road to the study of neural networksby means of statistical and probabilistic tools (Choromanska et al., 2015;Giryes, Sapiro and Bronstein, 2015). The second problem relates to the non-linear activation functions present at each neuron, which have long beenknown (as opposed to linear activations) to help design universal approxi-mators for any input-output target map (Hornik, Stinchcombe and White,1989).In this work, we propose an original random matrix-based approach tounderstand the end-to-end regression performance of single-layer randomartificial neural networks, sometimes referred to as extreme learning ma-chines (Huang, Zhu and Siew, 2006; Huang et al., 2012), when the number T and size p of the input dataset are large and scale proportionally withthe number n of neurons in the network. These networks can also be seen,from a more immediate statistical viewpoint, as a mere linear ridge-regressorrelating a random feature map σ ( W X ) ∈ R n × T of explanatory variables X = [ x , . . . , x T ] ∈ R p × T and target variables y = [ y , . . . , y T ] ∈ R d × T ,for W ∈ R n × p a randomly designed matrix and σ ( · ) a non-linear R → R function (applied component-wise). Our approach has several interestingfeatures both for theoretical and practical considerations. It is first one ofthe few known attempts to move the random matrix realm away from ma-trices with independent or linearly dependent entries. Notable exceptionsare the line of works surrounding kernel random matrices (El Karoui, 2010;Couillet and Benaych-Georges, 2016) as well as large dimensional robuststatistics models (Couillet, Pascal and Silverstein, 2015; El Karoui, 2013;Zhang, Cheng and Singer, 2014). Here, to alleviate the non-linear difficulty,we exploit concentration of measure arguments (Ledoux, 2005) for non-asymptotic random matrices, thereby pushing further the original ideas of(El Karoui, 2009; Vershynin, 2012) established for simpler random matrixmodels. While we believe that more powerful, albeit more computational imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS intensive, tools (such as an appropriate adaptation of the Gaussian toolsadvocated in (Pastur and ˆSerbina, 2011)) cannot be avoided to handle ad-vanced considerations in neural networks, we demonstrate here that theconcentration of measure phenomenon allows one to fully characterize themain quantities at the heart of the single-layer regression problem at hand.In terms of practical applications, our findings shed light on the alreadyincompletely understood extreme learning machines which have proved ex-tremely efficient in handling machine learning problems involving large tohuge datasets (Huang et al., 2012; Cambria et al., 2015) at a computation-ally affordable cost. But our objective is also to pave to path to the un-derstanding of more involved neural network structures, featuring notablymultiple layers and some steps of learning by means of backpropagation ofthe error.Our main contribution is twofold. From a theoretical perspective, we firstobtain a key lemma, Lemma 1, on the concentration of quadratic forms ofthe type σ ( w T X ) Aσ ( X T w ) where w = ϕ ( ˜ w ), ˜ w ∼ N (0 , I p ), with ϕ : R → R and σ : R → R Lipschitz functions, and X ∈ R p × T , A ∈ R n × n are deter-ministic matrices. This non-asymptotic result (valid for all n, p, T ) is thenexploited under a simultaneous growth regime for n, p, T and boundednessconditions on k X k and k A k to obtain, in Theorem 1, a deterministic ap-proximation ¯ Q of the resolvent E[ Q ], where Q = ( T Σ T Σ + γI T ) − , γ > σ ( W X ), for some W = ϕ ( ˜ W ), ˜ W ∈ R n × p having independent N (0 , T Σ T Σ, such as its limitingspectral measure in Theorem 2.Application-wise, the theoretical findings are an important preliminarystep for the understanding and improvement of various statistical meth-ods based on random features in the large dimensional regime. Specifically,here, we consider the question of linear ridge-regression from random featuremaps, which coincides with the aforementioned single hidden-layer randomneural network known as extreme learning machine. We show that, undermild conditions, both the training E train and testing E test mean-square er-rors, respectively corresponding to the regression errors on known input-output pairs ( x , y ) , . . . , ( x T , y T ) (with x i ∈ R p , y i ∈ R d ) and unknownpairings (ˆ x , ˆ y ) , . . . , (ˆ x ˆ T , ˆ y ˆ T ), almost surely converge to deterministic limit-ing values as n, p, T grow large at the same rate (while d is kept constant)for every fixed ridge-regression parameter γ >
0. Simulations on real image imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017
C. LOUART ET AL. datasets are provided that corroborate our results.These findings provide new insights into the roles played by the acti-vation function σ ( · ) and the random distribution of the entries of W inrandom feature maps as well as by the ridge-regression parameter γ in theneural network performance. We notably exhibit and prove some peculiarbehaviors, such as the impossibility for the network to carry out elementaryGaussian mixture classification tasks, when either the activation function orthe random weights distribution are ill chosen.Besides, for the practitioner, the theoretical formulas retrieved in thiswork allow for a fast offline tuning of the aforementioned hyperparametersof the neural network, notably when T is not too large compared to p .The graphical results provided in the course of the article were particularlyobtained within a 100- to 500-fold gain in computation time between theoryand simulations.The remainder of the article is structured as follows: in Section 2, weintroduce the mathematical model of the system under investigation. Ourmain results are then described and discussed in Section 3, the proofs ofwhich are deferred to Section 5. Section 4 discusses our main findings. Thearticle closes on concluding remarks on envisioned extensions of the presentwork in Section 6. The appendix provides some intermediary lemmas ofconstant use throughout the proof section. Reproducibility:
Python 3 codes used to produce the results of Section 4are available at https://github.com/Zhenyu-LIAO/RMT4ELM
Notations:
The norm k · k is understood as the Euclidean norm for vectorsand the operator norm for matrices, while the norm k · k F is the Frobe-nius norm for matrices. All vectors in the article are understood as columnvectors.
2. System Model.
We consider a ridge-regression task on random fea-ture maps defined as follows. Each input data x ∈ R p is multiplied by a ma-trix W ∈ R n × p ; a non-linear function σ : R → R is then applied entry-wiseto the vector W x , thereby providing a set of n random features σ ( W x ) ∈ R n for each datum x ∈ R p . The output z ∈ R d of the linear regression is theinner product z = β T σ ( W x ) for some matrix β ∈ R n × d to be designed.From a neural network viewpoint, the n neurons of the network are thevirtual units operating the mapping W i · x σ ( W i · x ) ( W i · being the i -th imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS row of W ), for 1 ≤ i ≤ n . The neural network then operates in two phases:a training phase where the regression matrix β is learned based on a knowninput-output dataset pair ( X, Y ) and a testing phase where, for β now fixed,the network operates on a new input dataset ˆ X with corresponding unknownoutput ˆ Y .During the training phase, based on a set of known input X = [ x , . . . , x T ] ∈ R p × T and output Y = [ y , . . . , y T ] ∈ R d × T datasets, the matrix β is chosenso as to minimize the mean square error T P Ti =1 k z i − y i k + γ k β k F , where z i = β T σ ( W x i ) and γ > β , thisleads to the explicit ridge-regressor β = 1 T Σ (cid:18) T Σ T Σ + γI T (cid:19) − Y T where we defined Σ ≡ σ ( W X ). This follows from differentiating the meansquare error along β to obtain 0 = γβ + T P Ti =1 σ ( W x i )( β T σ ( W x i ) − y i ) T ,so that ( T ΣΣ T + γI n ) β = T Σ Y T which, along with ( T ΣΣ T + γI n ) − Σ =Σ( T Σ T Σ + γI T ) − , gives the result.In the remainder, we will also denote Q ≡ (cid:18) T Σ T Σ + γI T (cid:19) − the resolvent of T Σ T Σ. The matrix Q naturally appears as a key quantity inthe performance analysis of the neural network. Notably, the mean-squareerror E train on the training dataset X is given by E train = 1 T (cid:13)(cid:13)(cid:13) Y T − Σ T β (cid:13)(cid:13)(cid:13) F = γ T tr Y T Y Q . (1)Under the growth rate assumptions on n, p, T taken below, it shall appearthat the random variable E train concentrates around its mean, letting thenappear E[ Q ] as a central object in the asymptotic evaluation of E train .The testing phase of the neural network is more interesting in practiceas it unveils the actual performance of neural networks. For a test datasetˆ X ∈ R p × ˆ T of length ˆ T , with unknown output ˆ Y ∈ R d × ˆ T , the test mean-square error is defined by E test = 1 T (cid:13)(cid:13)(cid:13) ˆ Y T − ˆΣ T β (cid:13)(cid:13)(cid:13) F imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. where ˆΣ = σ ( W ˆ X ) and β is the same as used in (1) (and thus only dependson ( X, Y ) and γ ). One of the key questions in the analysis of such an ele-mentary neural network lies in the determination of γ which minimizes E test (and is thus said to have good generalization performance). Notably, small γ values are known to reduce E train but to induce the popular overfitting is-sue which generally increases E test , while large γ values engender both largevalues for E train and E test .From a mathematical standpoint though, the study of E test brings for-ward some technical difficulties that do not allow for as a simple treatmentthrough the present concentration of measure methodology as the study of E train . Nonetheless, the analysis of E train allows at least for heuristic ap-proaches to become available, which we shall exploit to propose an asymp-totic deterministic approximation for E test .From a technical standpoint, we shall make the following set of assump-tions on the mapping x σ ( W x ). Assumption W ) . The matrix W is defined by W = ϕ ( ˜ W ) (understood entry-wise), where ˜ W has independent and identically distributed N (0 , entries and ϕ ( · ) is λ ϕ -Lipschitz. For a = ϕ ( b ) ∈ R ℓ , ℓ ≥
1, with b ∼ N (0 , I ℓ ), we shall subsequently denote a ∼ N ϕ (0 , I ℓ ).Under the notations of Assumption 1, we have in particular W ij ∼ N (0 , ϕ ( t ) = t and W ij ∼ U ( − ,
1) (the uniform distribution on [ − , ϕ ( t ) = − √ π R ∞ t e − x dx ( ϕ is here a p /π -Lipschitz map).We further need the following regularity condition on the function σ . Assumption σ ) . The function σ is Lipschitz continuouswith parameter λ σ . This assumption holds for many of the activation functions traditionallyconsidered in neural networks, such as sigmoid functions, the rectified linearunit σ ( t ) = max( t, imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS Assumption . As n → ∞ , < lim inf n min { p/n, T /n } ≤ lim sup n max { p/n, T /n } < ∞ while γ, λ σ , λ ϕ > and d are kept constant. In addition, lim sup n k X k < ∞ lim sup n max ij | Y ij | < ∞ .
3. Main Results.
Main technical results and training performance.
As a standard pre-liminary step in the asymptotic random matrix analysis of the expectationE[ Q ] of the resolvent Q = ( T Σ T Σ+ γI T ) − , a convergence of quadratic formsbased on the row vectors of Σ is necessary (see e.g., (Mar˘cenko and Pastur,1967; Silverstein and Bai, 1995)). Such results are usually obtained by ex-ploiting the independence (or linear dependence) in the vector entries. Thisnot being the case here, as the entries of the vector σ ( X T w ) are in gen-eral not independent, we resort to a concentration of measure approach, asadvocated in (El Karoui, 2009). The following lemma, stated here in a non-asymptotic random matrix regime (that is, without necessarily resorting toAssumption 3), and thus of independent interest, provides this concentrationresult. For this lemma, we need first to define the following key matrixΦ = E h σ ( w T X ) T σ ( w T X ) i (2)of size T × T , where w ∼ N ϕ (0 , I p ). Lemma . Let Assumptions 1–2hold. Let also A ∈ R T × T such that k A k ≤ and, for X ∈ R p × T and w ∼ N ϕ (0 , I p ) , define the random vector σ ≡ σ ( w T X ) T ∈ R T . Then, P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cT k X k λ ϕλ σ min (cid:18) t t ,t (cid:19) for t ≡ | σ (0) | + λ ϕ λ σ k X k q pT and C, c > independent of all other param-eters. In particular, under the additional Assumption 3, P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cn min( t,t ) for some C, c > . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Note that this lemma partially extends concentration of measure resultsinvolving quadratic forms, see e.g., (Rudelson et al., 2013, Theorem 1.1), tonon-linear vectors.With this result in place, the standard resolvent approaches of randommatrix theory apply, providing our main theoretical finding as follows.
Theorem Q ]) . Let Assumptions 1–3hold and define ¯ Q as ¯ Q ≡ (cid:18) nT Φ1 + δ + γI T (cid:19) − where δ is implicitly defined as the unique positive solution to δ = T tr Φ ¯ Q .Then, for all ε > , there exists c > such that (cid:13)(cid:13) E[ Q ] − ¯ Q (cid:13)(cid:13) ≤ cn − + ε . As a corollary of Theorem 1 along with a concentration argument on T tr Q , we have the following result on the spectral measure of T Σ T Σ, whichmay be seen as a non-linear extension of (Silverstein and Bai, 1995) forwhich σ ( t ) = t . Theorem T Σ T Σ) . Let Assumptions 1–3 hold and, for λ , . . . , λ T the eigenvalues of T Σ T Σ , define µ n = T P Ti =1 δ λ i .Then, for every bounded continuous function f , with probability one Z f dµ n − Z f d ¯ µ n → . where ¯ µ n is the measure defined through its Stieltjes transform m ¯ µ n ( z ) ≡ R ( t − z ) − d ¯ µ n ( t ) given, for z ∈ { w ∈ C , ℑ [ w ] > } , by m ¯ µ n ( z ) = 1 T tr (cid:18) nT Φ1 + δ z − zI T (cid:19) − with δ z the unique solution in { w ∈ C , ℑ [ w ] > } of δ z = 1 T tr Φ (cid:18) nT Φ1 + δ z − zI T (cid:19) − . Note that ¯ µ n has a well-known form, already met in early random ma-trix works (e.g., (Silverstein and Bai, 1995)) on sample covariance matrixmodels. Notably, ¯ µ n is also the deterministic equivalent of the empirical imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS spectral measure of T P T W T W P for any deterministic matrix P ∈ R p × T such that P T P = Φ. As such, to some extent, the results above provide aconsistent asymptotic linearization of T Σ T Σ. From standard spiked modelarguments (see e.g., (Benaych-Georges and Nadakuditi, 2012)), the result k E[ Q ] − ¯ Q k → T Σ T Σ (if any) behave similarly to those of T P T W T W P ,a remark that has fundamental importance in the neural network perfor-mance understanding.However, as shall be shown in Section 3.3, and contrary to empiricalcovariance matrix models of the type P T W T W P , Φ explicitly depends onthe distribution of W ij (that is, beyond its first two moments). Thus, theaforementioned linearization of T Σ T Σ, and subsequently the deterministicequivalent for µ n , are not universal with respect to the distribution of zero-mean unit variance W ij . This is in striking contrast to the many linearrandom matrix models studied to date which often exhibit such universalbehaviors. This property too will have deep consequences in the performanceof neural networks as shall be shown through Figure 3 in Section 4 for anexample where inappropriate choices for the law of W lead to network failureto fulfill the regression task.For convenience in the following, letting δ and Φ be defined as in Theo-rem 1, we shall denote Ψ = nT Φ1 + δ . (3)Theorem 1 provides the central step in the evaluation of E train , for whichnot only E[ Q ] but also E[ Q ] needs be estimated. This last ingredient isprovided in the following proposition. Proposition
QAQ ]) . Let Assumptions 1–3 hold and A ∈ R T × T be a symmetric non-negative definite matrix which iseither Φ or a matrix with uniformly bounded operator norm (with respect to T ). Then, for all ε > , there exists c > such that, for all n , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E[ QAQ ] − ¯ QA ¯ Q + n tr (cid:0) Ψ ¯ QA ¯ Q (cid:1) − n tr Ψ ¯ Q ¯ Q Ψ ¯ Q !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ cn − + ε . As an immediate consequence of Proposition 1, we have the followingresult on the training mean-square error of single-layer random neural net-works. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Theorem . Let Assumptions 1–3 hold and ¯ Q , Ψ be defined as in Theorem 1 and (3) . Then, for all ε > , n − ε (cid:0) E train − ¯ E train (cid:1) → almost surely, where E train = 1 T (cid:13)(cid:13)(cid:13) Y T − Σ T β (cid:13)(cid:13)(cid:13) F = γ T tr Y T Y Q ¯ E train = γ T tr Y T Y ¯ Q " n tr Ψ ¯ Q − n tr(Ψ ¯ Q ) Ψ + I T ¯ Q. Since ¯ Q and Φ share the same orthogonal eigenvector basis, it appearsthat E train depends on the alignment between the right singular vectors of Y and the eigenvectors of Φ, with weighting coefficients (cid:18) γλ i + γ (cid:19) λ i n P Tj =1 λ j ( λ j + γ ) − − n P Tj =1 λ j ( λ j + γ ) − ! , ≤ i ≤ T where we denoted λ i = λ i (Ψ), 1 ≤ i ≤ T , the eigenvalues of Ψ (whichdepend on γ through λ i (Ψ) = nT (1+ δ ) λ i (Φ)). If lim inf n n/T >
1, it is easilyseen that δ → γ →
0, in which case E train → n n/T < δ → ∞ as γ → E train consequently does not have a simple limit (see Section 4.3for more discussion on this aspect).Theorem 3 is also reminiscent of applied random matrix works on empiri-cal covariance matrix models, such as (Bai and Silverstein, 2007; Kammoun et al.,2009), then further emphasizing the strong connection between the non-linear matrix σ ( W X ) and its linear counterpart W Φ .As a side note, observe that, to obtain Theorem 3, we could have usedthe fact that tr Y T Y Q = − ∂∂γ tr Y T Y Q which, along with some analyticityarguments (for instance when extending the definition of Q = Q ( γ ) to Q ( z ), z ∈ C ), would have directly ensured that ∂ ¯ Q∂γ is an asymptotic equivalentfor − E[ Q ], without the need for the explicit derivation of Proposition 1.Nonetheless, as shall appear subsequently, Proposition 1 is also a proxy tothe asymptotic analysis of E test . Besides, the technical proof of Proposition 1quite interestingly showcases the strength of the concentration of measuretools under study here. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS Testing performance.
As previously mentioned, harnessing the asymp-totic testing performance E test seems, to the best of the authors’ knowledge,out of current reach with the sole concentration of measure arguments usedfor the proof of the previous main results. Nonetheless, if not fully effec-tive, these arguments allow for an intuitive derivation of a deterministicequivalent for E test , which is strongly supported by simulation results. Weprovide this result below under the form of a yet unproven claim, a heuristicderivation of which is provided at the end of Section 5.To introduce this result, let ˆ X = [ˆ x , . . . , ˆ x ˆ T ] ∈ R p × ˆ T be a set of inputdata with corresponding output ˆ Y = [ˆ y , . . . , ˆ y ˆ T ] ∈ R d × ˆ T . We also defineˆΣ = σ ( W ˆ X ) ∈ R p × ˆ T . We assume that ˆ X and ˆ Y satisfy the same growthrate conditions as X and Y in Assumption 3. To introduce our claim, weneed to extend the definition of Φ in (2) and Ψ in (3) to the followingnotations: for all pair of matrices ( A, B ) of appropriate dimensions,Φ AB = E h σ ( w T A ) T σ ( w T B ) i Ψ AB = nT Φ AB δ where w ∼ N ϕ (0 , I p ). In particular, Φ = Φ XX and Ψ = Ψ XX .With these notations in place, we are in position to state our claimedresult. Conjecture E test ) . Let Assumptions 1–2 hold and ˆ X, ˆ Y satisfy the same conditions as X, Y in Assumption 3. Then,for all ε > , n − ε (cid:0) E test − ¯ E test (cid:1) → almost surely, where E test = 1ˆ T (cid:13)(cid:13)(cid:13) ˆ Y T − ˆΣ T β (cid:13)(cid:13)(cid:13) F ¯ E test = 1ˆ T (cid:13)(cid:13)(cid:13) ˆ Y T − Ψ T X ˆ X ¯ QY T (cid:13)(cid:13)(cid:13) F + n tr Y T Y ¯ Q Ψ ¯ Q − n tr(Ψ ¯ Q ) (cid:20) T tr Ψ ˆ X ˆ X − T tr( I T + γ ¯ Q )(Ψ X ˆ X Ψ ˆ XX ¯ Q ) (cid:21) . While not immediate at first sight, one can confirm (using notably therelation Ψ ¯ Q + γ ¯ Q = I T ) that, for ( ˆ X, ˆ Y ) = ( X, Y ), ¯ E train = ¯ E test , as ex-pected. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
In order to evaluate practically the results of Theorem 3 and Conjecture 1,it is a first step to be capable of estimating the values of Φ AB for various σ ( · ) activation functions of practical interest. Such results, which call forcompletely different mathematical tools (mostly based on integration tricks),are provided in the subsequent section.3.3. Evaluation of Φ AB . The evaluation of Φ AB = E[ σ ( w T A ) T σ ( w T B )]for arbitrary matrices A, B naturally boils down to the evaluation of itsindividual entries and thus to the calculus, for arbitrary vectors a, b ∈ R p ,of Φ ab ≡ E[ σ ( w T a ) σ ( w T b )] = (2 π ) − p Z σ ( ϕ ( ˜ w ) T a ) σ ( ϕ ( ˜ w ) T b ) e − k ˜ w k d ˜ w. (4)The evaluation of (4) can be obtained through various integration tricksfor a wide family of mappings ϕ ( · ) and activation functions σ ( · ). The mostpopular activation functions in neural networks are sigmoid functions, suchas σ ( t ) = erf( t ) ≡ √ π R t e − u du , as well as the so-called rectified linear unit(ReLU) defined by σ ( t ) = max( t,
0) which has been recently popularizedas a result of its robust behavior in deep neural networks. In physical ar-tificial neural networks implemented using light projections, σ ( t ) = | t | isthe preferred choice. Note that all aforementioned functions are Lipschitzcontinuous and therefore in accordance with Assumption 2.Despite their not abiding by the prescription of Assumptions 1 and 2, webelieve that the results of this article could be extended to more generalsettings, as discussed in Section 4. In particular, since the key ingredient inthe proof of all our results is that the vector σ ( w T X ) follows a concentrationof measure phenomenon, induced by the Gaussianity of ˜ w (if w = ϕ ( ˜ w )),the Lipschitz character of σ and the norm boundedness of X , it is likely,although not necessarily simple to prove, that σ ( w T X ) may still concentrateunder relaxed assumptions. This is likely the case for more generic vectors w than N ϕ (0 , I p ) as well as for a larger class of activation functions, such aspolynomial or piece-wise Lipschitz continuous functions.In anticipation of these likely generalizations, we provide in Table 1 thevalues of Φ ab for w ∼ N (0 , I p ) (i.e., for ϕ ( t ) = t ) and for a set of functions σ ( · ) not necessarily satisfying Assumption 2. Denoting Φ ≡ Φ( σ ( t )), it isinteresting to remark that, since arccos( x ) = − arcsin( x )+ π , Φ(max( t, t ) + Φ( | t | ). Also, [Φ(cos( t )) + Φ(sin( t ))] a,b = exp( − k a − b k ), a resultreminiscent of (Rahimi and Recht, 2007). Finally, note that Φ(erf( κt )) → It is in particular not difficult to prove, based on our framework, that, as n/T → ∞ ,a random neural network composed of n/ σ ( t ) = cos( t ) imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS σ ( t ) Φ ab t a T b max( t, π k a kk b k (cid:16) ∠ ( a, b ) acos( − ∠ ( a, b )) + p − ∠ ( a, b ) (cid:17) | t | π k a kk b k (cid:16) ∠ ( a, b ) asin( ∠ ( a, b )) + p − ∠ ( a, b ) (cid:17) erf( t ) π asin (cid:18) a T b √ (1+2 k a k )(1+2 k b k ) (cid:19) { t> } − π acos( ∠ ( a, b ))sign( t ) π asin( ∠ ( a, b ))cos( t ) exp( − ( k a k + k b k )) cosh( a T b )sin( t ) exp( − ( k a k + k b k )) sinh( a T b ). Table 1
Values of Φ ab for w ∼ N (0 , I p ) , ∠ ( a, b ) ≡ a T b k a kk b k . Φ(sign( t )) as κ → ∞ , inducing that the extension by continuity of erf( κt )to sign( t ) propagates to their associated kernels.In addition to these results for w ∼ N (0 , I p ), we also evaluated Φ ab =E[ σ ( w T a ) σ ( w T b )] for σ ( t ) = ζ t + ζ t + ζ and w ∈ R p a vector of indepen-dent and identically distributed entries of zero mean and moments of order k equal to m k (so m = 0); w is not restricted here to satisfy w ∼ N ϕ (0 , I p ).In this case, we findΦ ab = ζ h m (cid:16) a T b ) + k a k k b k (cid:17) + ( m − m )( a ) T ( b ) i + ζ m a T b + ζ ζ m h ( a ) T b + a T ( b ) i + ζ ζ m (cid:2) k a k + k b k (cid:3) + ζ (5)where we defined ( a ) ≡ [ a , . . . , a p ] T .It is already interesting to remark that, while classical random matrixmodels exhibit a well-known universality property — in the sense that theirlimiting spectral distribution is independent of the moments (higher thantwo) of the entries of the involved random matrix, here W —, for σ ( · ) a poly-nomial of order two, Φ and thus µ n strongly depend on E[ W kij ] for k = 3 , W ij are Bernoulli withzero mean and unit variance but succeed with possibly high performance ifthe W ij are standard Gaussian (which is explained by the disappearance ornot of the term ( a T b ) and ( a ) T ( b ) in (5) if m = m ). and n/ σ ( t ) = sin( t ) implements a Gaussian differencekernel. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
4. Practical Outcomes.
We discuss in this section the outcomes ofour main results in terms of neural network application. The technical dis-cussions on Theorem 1 and Proposition 1 will be made in the course of theirrespective proofs in Section 5.4.1.
Simulation Results.
We first provide in this section a simulation cor-roborating the findings of Theorem 3 and suggesting the validity of Conjec-ture 1. To this end, we consider the task of classifying the popular MNISTimage database (LeCun, Cortes and Burges, 1998), composed of grayscalehandwritten digits of size 28 ×
28, with a neural network composed of n = 512units and standard Gaussian W . We represent here each image as a p = 784-size vector; 1 024 images of sevens and 1 024 images of nines were extractedfrom the database and were evenly split in 512 training and test images,respectively. The database images were jointly centered and scaled so to fallclose to the setting of Assumption 3 on X and ˆ X (an admissible preprocess-ing intervention). The columns of the output values Y and ˆ Y were takenas unidimensional ( d = 1) with Y j , ˆ Y j ∈ {− , } depending on the imageclass. Figure 1 displays the simulated (averaged over 100 realizations of W )versus theoretical values of E train and E test for three choices of Lipschitzcontinuous functions σ ( · ), as a function of γ .Note that a perfect match between theory and practice is observed, forboth E train and E test , which is a strong indicator of both the validity ofConjecture 1 and the adequacy of Assumption 3 to the MNIST dataset.We subsequently provide in Figure 2 the comparison between theoreticalformulas and practical simulations for a set of functions σ ( · ) which do notsatisfy Assumption 2, i.e., either discontinuous or non-Lipschitz maps. Thecloseness between both sets of curves is again remarkably good, althoughto a lesser extent than for the Lipschitz continuous functions of Figure 1.Also, the achieved performances are generally worse than those observed inFigure 1.It should be noted that the performance estimates provided by Theorem 3and Conjecture 1 can be efficiently implemented at low computational costin practice. Indeed, by diagonalizing Φ (which is a marginal cost independentof γ ), ¯ E train can be computed for all γ through mere vector operations; simi-larly ¯ E test is obtained by the marginal cost of a basis change of Φ ˆ XX and thematrix product Φ X ˆ X Φ ˆ XX , all remaining operations being accessible throughvector operations. As a consequence, the simulation durations to generatethe aforementioned theoretical curves using the linked Python script were imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS − − − − − σ ( t ) = max( t, σ ( t ) = erf(t) σ ( t ) = t σ ( t ) = | t | γ M S E ¯ E train ¯ E test E train E test Fig 1 . Neural network performance for Lipschitz continuous σ ( · ) , W ij ∼ N (0 , , as afunction of γ , for 2-class MNIST data (sevens, nines), n = 512 , T = ˆ T = 1024 , p = 784 . found to be 100 to 500 times faster than to generate the simulated net-work performances. Beyond their theoretical interest, the provided formulastherefore allow for an efficient offline tuning of the network hyperparameters,notably the choice of an appropriate value for the ridge-regression parameter γ . 4.2. The underlying kernel.
Theorem 1 and the subsequent theoreticalfindings importantly reveal that the neural network performances are di-rectly related to the Gram matrix Φ, which acts as a deterministic ker-nel on the dataset X . This is in fact a well-known result found e.g., in(Williams, 1998) where it is shown that, as n → ∞ alone, the neural net-work behaves as a mere kernel operator (this observation is retrieved herein the subsequent Section 4.3). This remark was then put at an advantagein (Rahimi and Recht, 2007) and subsequent works, where random featuremaps of the type x σ ( W x ) are proposed as a computationally efficientproxy to evaluate kernels ( x, y ) Φ( x, y ).As discussed previously, the formulas for ¯ E train and ¯ E test suggest thatgood performances are achieved if the dominant eigenvectors of Φ show agood alignment to Y (and similarly for Φ X ˆ X and ˆ Y ). This naturally drives us imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.10 − − − − − σ ( t ) = sign( t ) σ ( t ) = 1 { t> } σ ( t ) = 1 − t γ M S E ¯ E train ¯ E test E train E test Fig 2 . Neural network performance for σ ( · ) either discontinuous or non Lipschitz, W ij ∼N (0 , , as a function of γ , for 2-class MNIST data (sevens, nines), n = 512 , T = ˆ T =1024 , p = 784 . to finding a priori simple regression tasks where ill-choices of Φ may annihi-late the neural network performance. Following recent works on the asymp-totic performance analysis of kernel methods for Gaussian mixture models(Couillet and Benaych-Georges, 2016; Zhenyu Liao, 2017; Mai and Couillet,2017) and (Couillet and Kammoun, 2016), we describe here such a task.Let x , . . . , x T/ ∼ N (0 , p C ) and x T/ , . . . , x T ∼ N (0 , p C ) where C and C are such that tr C = tr C , k C k , k C k are bounded, and tr( C − C ) = O ( p ). Accordingly, y , . . . , y T/ = − y T/ , . . . , y T = 1. Itis proved in the aforementioned articles that, under these conditions, it istheoretically possible, in the large p, T limit, to classify the data using akernel least-square support vector machine (that is, with a training dataset)or with a kernel spectral clustering method (that is, in a completely unsu-pervised manner) with a non-trivial limiting error probability (i.e., neitherzero nor one). This scenario has the interesting feature that x T i x j → i = j while k x i k − p tr( C + C ) →
0, almost surely,irrespective of the class of x i , thereby allowing for a Taylor expansion of thenon-linear kernels as early proposed in (El Karoui, 2010).Transposed to our present setting, the aforementioned Taylor expan- imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS sion allows for a consistent approximation ˜Φ of Φ by an information-plus-noise (spiked) random matrix model (see e.g., (Loubaton and Vallet, 2010;Benaych-Georges and Nadakuditi, 2012)). In the present Gaussian mixturecontext, it is shown in (Couillet and Benaych-Georges, 2016) that data clas-sification is (asymptotically at least) only possible if ˜Φ ij explicitly con-tains the quadratic term ( x T i x j ) (or combinations of ( x i ) T x j , ( x j ) T x i , and( x i ) T ( x j )). In particular, letting a, b ∼ N (0 , C i ) with i = 1 ,
2, it is easilyseen from Table 1 that only max( t, | t | , and cos( t ) can realize the task.Indeed, we have the following Taylor expansions around x = 0:asin( x ) = x + O ( x )sinh( x ) = x + O ( x )acos( x ) = π − x + O ( x )cosh( x ) = 1 + x O ( x ) x acos( − x ) + p − x = 1 + πx x O ( x ) x asin( x ) + p − x = 1 + x O ( x )where only the last three functions (only found in the expression of Φ ab corresponding to σ ( t ) = max( t, | t | , or cos( t )) exhibit a quadratic term.More surprisingly maybe, recalling now Equation (5) which considers non-necessarily Gaussian W ij with moments m k of order k , a more refined anal-ysis shows that the aforementioned Gaussian mixture classification task willfail if m = 0 and m = m , so for instance for W ij ∈ {− , } Bernoulliwith parameter . The performance comparison of this scenario is shownin the top part of Figure 3 for σ ( t ) = − t + 1 and C = diag( I p/ , I p/ ), C = diag(4 I p/ , I p/ ), for W ij ∼ N (0 ,
1) and W ij ∼ Bern (that is, Bernoulli { ( − , ) , (1 , ) } ). The choice of σ ( t ) = ζ t + ζ t + ζ with ζ = 0 is mo-tivated by (Couillet and Benaych-Georges, 2016; Couillet and Kammoun,2016) where it is shown, in a somewhat different setting, that this choiceis optimal for class recovery. Note that, while the test performances areoverall rather weak in this setting, for W ij ∼ N (0 , E test drops below one(the amplitude of the ˆ Y ij ), thereby indicating that non-trivial classificationis performed. This is not so for the Bernoulli W ij ∼ Bern case where E test issystematically greater than | ˆ Y ij | = 1. This is theoretically explained by thefact that, from Equation (5), Φ ij contains structural information about thedata classes through the term 2 m ( x T i x j ) + ( m − m )( x i ) T ( x j ) which in-duces an information-plus-noise model for Φ as long as 2 m +( m − m ) = 0, imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. i.e., m = m (see (Couillet and Benaych-Georges, 2016) for details). Thisis visually seen in the bottom part of Figure 3 where the Gaussian scenariopresents an isolated eigenvalue for Φ with corresponding structured eigen-vector, which is not the case of the Bernoulli scenario. To complete thisdiscussion, it appears relevant in the present setting to choose W ij in such away that m − m is far from zero, thus suggesting the interest of heavy-taileddistributions. To confirm this prediction, Figure 3 additionally displays theperformance achieved and the spectrum of Φ observed for W ij ∼ Stud, thatis, following a Student-t distribution with degree of freedom ν = 7 normal-ized to unit variance (in this case m = 1 and m = 5). Figure 3 confirmsthe large superiority of this choice over the Gaussian case (note nonethelessthe slight inaccuracy of our theoretical formulas in this case, which is likelydue to too small values of p, n, T to accommodate W ij with higher ordermoments, an observation which is confirmed in simulations when letting ν be even smaller).4.3. Limiting cases.
We have suggested that Φ contains, in its dominanteigenmodes, all the usable information describing X . In the Gaussian mix-ture example above, it was notably shown that Φ may completely fail tocontain this information, resulting in the impossibility to perform a classifi-cation task, even if one were to take infinitely many neurons in the network.For Φ containing useful information about X , it is intuitive to expect thatboth inf γ ¯ E train and inf γ ¯ E test become smaller as n/T and n/p become large.It is in fact easy to see that, if Φ is invertible (which is likely to occur inmost cases if lim inf n T /p > n →∞ ¯ E train = 0lim n →∞ ¯ E test − T (cid:13)(cid:13)(cid:13) ˆ Y T − Φ ˆ XX Φ − Y T (cid:13)(cid:13)(cid:13) F = 0and we fall back on the performance of a classical kernel regression. It isinteresting in particular to note that, as the number of neurons n becomeslarge, the effect of γ on E test flattens out. Therefore, a smart choice of γ is only relevant for small (and thus computationally more efficient) neuronlayers. This observation is depicted in Figure 4 where it is made clear thata growth of n reduces E train to zero while E test saturates to a non-zero limitwhich becomes increasingly irrespective of γ . Note additionally the interest-ing phenomenon occurring for n ≤ T where too small values of γ induceimportant performance losses, thereby suggesting a strong importance ofproper choices of γ in this regime. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS − − − − − . − . − . . . W ij ∼ N (0 , W ij ∼ Bern W ij ∼ Stud γ M S E ¯ E train ¯ E test E train E test closespike no spike farspike W ij ∼ N (0 , W ij ∼ Bern W ij ∼ Stud
Fig 3 . (Top) Neural network performance for σ ( t ) = − t + 1 , with different W ij , for a -class Gaussian mixture model (see details in text), n = 512 , T = ˆ T = 1024 , p = 256 .(Bottom) Spectra and second eigenvector of Φ for different W ij (first eigenvalues are oforder n and not shown; associated eigenvectors are provably non informative). Of course, practical interest lies precisely in situations where n is nottoo large. We may thus subsequently assume that lim sup n n/T <
1. Inthis case, as suggested by Figures 1–2, the mean-square error performancesachieved as γ → σ ( · ) for imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.10 − − − − − n = 256 → γ M S E ¯ E train ¯ E test E train E test Fig 4 . Neural network performance for growing n ( , , , , ) as afunction of γ , σ ( t ) = max( t, ; 2-class MNIST data (sevens, nines), T = ˆ T = 1024 , p = 784 . Limiting ( n = ∞ ) ¯ E test shown in thick black line. optimally chosen γ . It is important for this study to differentiate betweencases where r ≡ rank(Φ) is smaller or greater than n . Indeed, observe that,with the spectral decomposition Φ = U r Λ r U T r for Λ r ∈ R r × r diagonal and U r ∈ R T × r , δ = 1 T tr Φ (cid:18) nT Φ1 + δ + γI T (cid:19) − = 1 T tr Λ r (cid:18) nT Λ r δ + γI r (cid:19) − which satisfies, as γ → ( δ → rn − r , r < nγδ → ∆ = T tr Φ (cid:0) nT Φ∆ + I T (cid:1) − , r ≥ n. A phase transition therefore exists whereby δ assumes a finite positive valuein the small γ limit if r/n <
1, or scales like 1 /γ otherwise.As a consequence, if r < n , as γ →
0, Ψ → nT (1 − rn )Φ and ¯ Q ∼ Tn − r U r Λ − r U T r + γ V r V T r , where V r ∈ R T × ( n − r ) is any matrix such that [ U r V r ]is orthogonal, so that Ψ ¯ Q → U r U T r and Ψ ¯ Q → U r Λ − r U T r ; and thus, imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS ¯ E train → T tr Y V r V T r Y T = T k Y V r k F , which states that the residual train-ing error corresponds to the energy of Y not captured by the space spannedby Φ. Since E train is an increasing function of γ , so is ¯ E train (at least for alllarge n ) and thus T k Y V r k F corresponds to the lowest achievable asymptotictraining error.If instead r > n (which is the most likely outcome in practice), as γ → Q ∼ γ ( nT Φ∆ + I T ) − and thus¯ E train γ → −→ T tr Y Q ∆ " n tr Ψ ∆ Q − n tr(Ψ ∆ Q ∆ ) Ψ ∆ + I T Q ∆ Y T where Ψ ∆ = nT Φ∆ and Q ∆ = ( nT Φ∆ + I T ) − .These results suggest that neural networks should be designed both ina way that reduces the rank of Φ while maintaining a strong alignmentbetween the dominant eigenvectors of Φ and the output matrix Y .Interestingly, if X is assumed as above to be extracted from a Gaussianmixture and that Y ∈ R × T is a classification vector with Y j ∈ {− , } ,then the tools proposed in (Couillet and Benaych-Georges, 2016) (related tospike random matrix analysis) allow for an explicit evaluation of the afore-mentioned limits as n, p, T grow large. This analysis is however cumbersomeand outside the scope of the present work.
5. Proof of the Main Results.
In the remainder, we shall use exten-sively the following notations:Σ = σ ( W X ) = σ T ... σ T n , W = w T ... w T n i.e., σ i = σ ( w T i X ) T . Also, we shall define Σ − i ∈ R ( n − × T the matrix Σ with i -th row removed, and correspondingly Q − i = (cid:18) T Σ T Σ − T σ i σ T i + γI T (cid:19) − . Finally, because of exchangeability, it shall often be convenient to work withthe generic random vector w ∼ N ϕ (0 , I T ), the random vector σ distributedas any of the σ i ’s, the random matrix Σ − distributed as any of the Σ − i ’s,and with the random matrix Q − distributed as any of the Q − i ’s. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Concentration Results on Σ . Our first results provide concentra-tion of measure properties on functionals of Σ. These results unfold fromthe following concentration inequality for Lipschitz applications of a Gaus-sian vector; see e.g., (Ledoux, 2005, Corollary 2.6, Propositions 1.3, 1.8) or(Tao, 2012, Theorem 2.1.12). For d ∈ N , consider µ the canonical Gaus-sian probability on R d defined through its density dµ ( w ) = (2 π ) − d e − k w k and f : R d → R a λ f -Lipschitz function. Then, we have the said normalconcentration µ (cid:18)(cid:26)(cid:12)(cid:12)(cid:12)(cid:12) f − Z f dµ (cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:27)(cid:19) ≤ Ce − c t λ f (6)where C, c > d and λ f . As a corollary (see e.g., (Ledoux,2005, Proposition 1.10)), for every k ≥ "(cid:12)(cid:12)(cid:12)(cid:12) f − Z f dµ (cid:12)(cid:12)(cid:12)(cid:12) k ≤ (cid:18) Cλ f √ c (cid:19) k . The main approach to the proof of our results, starting with that of thekey Lemma 1, is as follows: since W ij = ϕ ( ˜ W ij ) with ˜ W ij ∼ N (0 ,
1) and ϕ Lipschitz, the normal concentration of ˜ W transfers to W which furtherinduces a normal concentration of the random vector σ and the matrix Σ,thereby implying that Lipschitz functionals of σ or Σ also concentrate. Aspointed out earlier, these concentration results are used in place for theindependence assumptions (and their multiple consequences on convergenceof random variables) classically exploited in random matrix theory. Notations:
In all subsequent lemmas and proofs, the letters c, c i , C, C i > n and t below) and may be reused from line toline. Additionally, the variable ε > c, c i , C, C i may depend on ε .We start by recalling the first part of the statement of Lemma 1 andsubsequently providing its proof. Lemma . Let Assumptions 1–2hold. Let also A ∈ R T × T such that k A k ≤ and, for X ∈ R p × T and imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS w ∼ N ϕ (0 , I p ) , define the random vector σ ≡ σ ( w T X ) T ∈ R T . Then, P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cT k X k λ ϕλ σ min (cid:18) t t ,t (cid:19) for t ≡ | σ (0) | + λ ϕ λ σ k X k q pT and C, c > independent of all other param-eters. Proof.
The layout of the proof is as follows: since the application w T σ T Aσ is “quadratic” in w and thus not Lipschitz (therefore not allowing fora natural transfer of the concentration of w to T σ T Aσ ), we first prove that √ T k σ k satisfies a concentration inequality, which provides a high probability O (1) bound on √ T k σ k . Conditioning on this event, the map w √ T σ T Aσ can then be shown to be Lipschitz (by isolating one of the σ terms forbounding and the other one for retrieving the Lipschitz character) and, upto an appropriate control of concentration results under conditioning, theresult is obtained.Following this plan, we first provide a concentration inequality for k σ k .To this end, note that the application ψ : R p → R T , ˜ w σ ( ϕ ( ˜ w ) T X ) T isLipschitz with parameter λ ϕ λ σ k X k as the combination of the λ ϕ -Lipschitzfunction ϕ : ˜ w w , the k X k -Lipschitz map R n → R T , w X T w and the λ σ -Lipschitz map R T → R T , Y σ ( Y ). As a Gaussian vector, ˜ w has anormal concentration and so does ψ ( ˜ w ). Since the Euclidean norm R T → R , Y
7→ k Y k is 1-Lipschitz, we thus have immediately by (6) P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13) √ T σ ( w T X ) (cid:13)(cid:13)(cid:13)(cid:13) − E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) √ T σ ( w T X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ Ce − cTt k X k λ σλ ϕ for some c, C > σ ( w T X ), (cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) σ ( w T X ) (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) σ (0)1 T T (cid:13)(cid:13)(cid:13)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) σ ( w T X ) − σ (0)1 T T (cid:13)(cid:13)(cid:13) ≤ λ σ k w k · k X k so that, by Jensen’s inequality,E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13) √ T σ ( w T X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:21) ≤ | σ (0) | + λ σ E (cid:20) √ T k w k (cid:21) k X k≤ | σ (0) | + λ σ s E (cid:20) T k w k (cid:21) k X k imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. with E[ k ϕ ( ˜ w ) k ] ≤ λ ϕ E[ k ˜ w k ] = pλ ϕ (since ˜ w ∼ N (0 , I p )). Letting t ≡| σ (0) | + λ σ λ ϕ k X k q pT , we then find P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) √ T σ ( w T X ) (cid:13)(cid:13)(cid:13)(cid:13) ≥ t + t (cid:19) ≤ Ce − cTt λ ϕλ σ k X k which, with the remark t ≥ t ⇒ ( t − t ) ≥ t /
2, may be equivalentlystated as ∀ t ≥ t , P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) √ T σ ( w T X ) (cid:13)(cid:13)(cid:13)(cid:13) ≥ t (cid:19) ≤ Ce − cTt λ ϕλ σ k X k . (7)As a side (but important) remark, note that, since P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) Σ √ T (cid:13)(cid:13)(cid:13)(cid:13) F ≥ t √ T (cid:19) = P vuut n X i =1 (cid:13)(cid:13)(cid:13)(cid:13) σ i √ T (cid:13)(cid:13)(cid:13)(cid:13) ≥ t √ T ≤ P max ≤ i ≤ n (cid:13)(cid:13)(cid:13)(cid:13) σ i √ T (cid:13)(cid:13)(cid:13)(cid:13) ≥ r Tn t ! ≤ nP (cid:13)(cid:13)(cid:13)(cid:13) σ √ T (cid:13)(cid:13)(cid:13)(cid:13) ≥ r Tn t ! the result above implies that ∀ t ≥ t , P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) Σ √ T (cid:13)(cid:13)(cid:13)(cid:13) F ≥ t √ T (cid:19) ≤ Cne − cT t nλ ϕλ σ k X k and thus, since k · k F ≥ k · k , we have ∀ t ≥ t , P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) Σ √ T (cid:13)(cid:13)(cid:13)(cid:13) ≥ t √ T (cid:19) ≤ Cne − cT t nλ ϕλ σ k X k Thus, in particular, under the additional Assumption 3, with high probabil-ity, the operator norm of Σ √ T cannot exceed a rate √ T . Remark . The aforementionedcontrol of k Σ k arises from the bound k Σ k ≤ k Σ k F which may be quite loose(by as much as a factor √ T ). Intuitively, under the supplementary Assump-tion 3, if E[ σ ] = 0 , then Σ √ T is “dominated” by the matrix √ T E[ σ ]1 T T , the op-erator norm of which is indeed of order √ n and the bound is tight. If σ ( t ) = t and E[ W ij ] = 0 , we however know that k Σ √ T k = O (1) (Bai and Silverstein, imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS E[ σ ] = 0 , then k Σ √ T k should remain of this order. And, if instead E[ σ ] = 0 , the contribution of √ T E[ σ ]1 T T should merely engender a single large amplitude isolate singularvalue in the spectrum of Σ √ T and the other singular values remain of order O (1) . These intuitions are not captured by our concentration of measureapproach.Since Σ = σ ( W X ) is an entry-wise operation, concentration results withrespect to the Frobenius norm are natural, where with respect to the operatornorm are hardly accessible. Back to our present considerations, let us define the probability space A K = { w, k σ ( w T X ) k ≤ K √ T } . Conditioning the random variable of inter-est in Lemma 2 with respect to A K and its complementary A cK , for some K ≥ t , gives P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ ( w T X ) Aσ ( w T X ) T − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ P (cid:18)(cid:26)(cid:12)(cid:12)(cid:12)(cid:12) T σ ( w T X ) Aσ ( w T X ) T − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:27) , A K (cid:19) + P ( A cK ) . We can already bound P ( A cK ) thanks to (7). As for the first right-handside term, note that on the set { σ ( w T X ) , w ∈ A K } , the function f : R T → R : σ σ T Aσ is K √ T -Lipschitz. This is because, for all σ, σ + h ∈{ σ ( w T X ) , w ∈ A K } , k f ( σ + h ) − f ( σ ) k = (cid:13)(cid:13)(cid:13) h T Aσ + ( σ + h ) T Ah (cid:13)(cid:13)(cid:13) ≤ K √ T k h k . Since conditioning does not allow for a straightforward application of (6),we consider instead ˜ f , a K √ T -Lipschitz continuation to R T of f A K , therestriction of f to A K , such that all the radial derivative of ˜ f are constantin the set { σ, k σ k ≥ K √ T } . We may thus now apply (6) and our previousresults to obtain P (cid:16)(cid:12)(cid:12)(cid:12) ˜ f ( σ ( w T X )) − E[ ˜ f ( σ ( w T X ))] (cid:12)(cid:12)(cid:12) ≥ KT t (cid:17) ≤ e − cTt k X k λ σλ ϕ . Therefore, P (cid:16)n(cid:12)(cid:12)(cid:12) f ( σ ( w T X )) − E[ ˜ f ( σ ( w T X ))] (cid:12)(cid:12)(cid:12) ≥ KT t o , A K (cid:17) = P (cid:16)n(cid:12)(cid:12)(cid:12) ˜ f ( σ ( w T X )) − E[ ˜ f ( σ ( w T X ))] (cid:12)(cid:12)(cid:12) ≥ KT t o , A K (cid:17) ≤ P (cid:16)(cid:12)(cid:12)(cid:12) ˜ f ( σ ( w T X )) − E[ ˜ f ( σ ( w T X ))] (cid:12)(cid:12)(cid:12) ≥ KT t (cid:17) ≤ e − cTt k X k λ σλ ϕ . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Our next step is then to bound the difference ∆ = | E[ ˜ f ( σ ( w T X ))] − E[ f ( σ ( w T X ))] | .Since f and ˜ f are equal on { σ, k σ k ≤ K √ T } ,∆ ≤ Z k σ k≥ K √ T (cid:16) | f ( σ ) | + | ˜ f ( σ ) | (cid:17) dµ σ ( σ )where µ σ is the law of σ ( w T X ). Since k A k ≤
1, for k σ k ≥ K √ T , max( | f ( σ ) | , | ˜ f ( σ ) | ) ≤k σ k and thus∆ ≤ Z k σ k≥ K √ T k σ k dµ σ = 2 Z k σ k≥ K √ T Z ∞ t =0 k σ k ≥ t dtdµ σ = 2 Z ∞ t =0 P (cid:0)(cid:8) k σ k ≥ t (cid:9) , A cK (cid:1) dt ≤ Z K Tt =0 P ( A cK ) dt + 2 Z ∞ t = K T P ( k σ ( w T X ) k ≥ t ) dt ≤ P ( A cK ) K T + 2 Z ∞ t = K T Ce − ct λ ϕλ σ k X k dt ≤ CT K e − cTK λ ϕλ σ k X k + 2 Cλ ϕ λ σ k X k c e − cTK λ ϕλ σ k X k ≤ Cc λ ϕ λ σ k X k where in last inequality we used the fact that for x ∈ R , xe − x ≤ e − ≤ K ≥ t ≥ λ σ λ ϕ k X k q pT . As a consequence, P (cid:16)n(cid:12)(cid:12)(cid:12) f ( σ ( w T X )) − E[ f ( σ ( w T X ))] (cid:12)(cid:12)(cid:12) ≥ KT t + ∆ o , A K (cid:17) ≤ Ce − cTt k X k λ ϕλ σ so that, with the same remark as before, for t ≥ KT , P (cid:16)n(cid:12)(cid:12)(cid:12) f ( σ ( w T X )) − E[ f ( σ ( w T X ))] (cid:12)(cid:12)(cid:12) ≥ KT t o , A K (cid:17) ≤ Ce − cTt k X k λ ϕλ σ . To avoid the condition t ≥ KT , we use the fact that, probabilities beinglower than one, it suffices to replace C by λC with λ ≥ λCe − c Tt k X k λ ϕλ σ ≥ t ≤ KT .
The above inequality holds if we take for instance λ = C e C c since then t ≤ KT ≤ Cλ ϕ λ σ k X k cKT ≤ Cλ ϕ λ σ k X k c √ pT (using successively ∆ ≥ Cc λ ϕ λ σ k X k imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS and K ≥ λ σ λ ϕ k X k q pT ) and thus λCe − cTt k X k λ ϕλ σ ≥ λCe − C cp ≥ λCe − C c ≥ . Therefore, setting λ = max(1 , C e C ′ c ), we get for every t > P (cid:16)n(cid:12)(cid:12)(cid:12) f ( σ ( w T X ) − E[ f ( σ ( w T X )] (cid:12)(cid:12)(cid:12) ≥ KT t o , A K (cid:17) ≤ λCe − cTt k X k λ ϕλ σ which, together with the inequality P ( A cK ) ≤ Ce − cTK λ ϕλ σ k X k , gives P (cid:16)(cid:12)(cid:12)(cid:12) f ( σ ( w T X ) − E[ f ( σ ( w T X )] (cid:12)(cid:12)(cid:12) ≥ KT t (cid:17) ≤ λCe − Tct k X k λ ϕλ σ + Ce − cTK λ ϕλ σ k X k . We then conclude P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ ( w T X ) Aσ ( w T X ) T − T tr(Φ A ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ ( λ + 1) Ce − cT k X k λ ϕλ σ min( t /K ,K ) and, with K = max(4 t , √ t ), P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ ( w T X ) Aσ ( w T X ) T − T tr(Φ A ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:19) ≤ ( λ + 1) Ce − cT min t t ,t ! k X k λ ϕλ σ . Indeed, if 4 t ≤ √ t then min( t /K , K ) = t , while if 4 t ≥ √ t thenmin( t /K , K ) = min( t / t , t ) = t / t .As a corollary of Lemma 2, we have the following control of the momentsof T σ T Aσ . Corollary . Let Assumptions 1–2hold. For w ∼ N ϕ (0 , I p ) , σ ≡ σ ( w T X ) T ∈ R T , A ∈ R T × T such that k A k ≤ ,and k ∈ N , E "(cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) k ≤ C (cid:18) t η √ T (cid:19) k + C (cid:18) η T (cid:19) k with t = | σ (0) | + λ σ λ ϕ k X k q pT , η = k X k λ σ λ ϕ , and C , C > independentof the other parameters. In particular, under the additional Assumption 3, E "(cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) k ≤ Cn k/ imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Proof.
We use the fact that, for a nonnegative random variable Y ,E[ Y ] = R ∞ P ( Y > t ) dt , so thatE "(cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) k = Z ∞ P (cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) k > u ! du = Z ∞ kv k − P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T Aσ − T tr Φ A (cid:12)(cid:12)(cid:12)(cid:12) > v (cid:19) dv ≤ Z ∞ kv k − Ce − cTη min (cid:18) v t ,v (cid:19) dv ≤ Z t kv k − Ce − cTv t η dv + Z ∞ t kv k − Ce − cTvη dv ≤ Z ∞ kv k − Ce − cTv t η dv + Z ∞ kv k − Ce − cTvη dv = (cid:18) t η √ cT (cid:19) k Z ∞ kt k − Ce − t dt + (cid:18) η cT (cid:19) k Z ∞ kt k − Ce − t dt which, along with the boundedness of the integrals, concludes the proof.Beyond concentration results on functions of the vector σ , we also havethe following convenient property for functions of the matrix Σ. Lemma . Let f : R n × T → R be a λ f -Lipschitz function with respect to the Froebnius norm. Then, under Assump-tions 1–2, P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) Σ √ T (cid:19) − E f (cid:18) Σ √ T (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cTt λ σλ ϕλ f k X k for some C, c > . In particular, under the additional Assumption 3, P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) Σ √ T (cid:19) − E f (cid:18) Σ √ T (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cT t . Proof.
Denoting W = ϕ ( ˜ W ), since vec( ˜ W ) ≡ [ ˜ W , · · · , ˜ W np ] is a Gaus-sian vector, by the normal concentration of Gaussian vectors, for g a λ g - imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS Lipschitz function of W with respect to the Frobenius norm (i.e., the Eu-clidean norm of vec( W )), by (6), P ( | g ( W ) − E[ g ( W )] | > t ) = P (cid:16)(cid:12)(cid:12)(cid:12) g ( ϕ ( ˜ W )) − E[ g ( ϕ ( ˜ W ))] (cid:12)(cid:12)(cid:12) > t (cid:17) ≤ Ce − ct λ gλ ϕ for some C, c >
0. Let’s consider in particular g : W f (Σ / √ T ) andremark that | g ( W + H ) − g ( W ) | = (cid:12)(cid:12)(cid:12)(cid:12) f (cid:18) σ (( W + H ) X ) √ T (cid:19) − f (cid:18) σ ( W X ) √ T (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ f √ T k σ (( W + H ) X ) − σ ( W X ) k F ≤ λ f λ σ √ T k HX k F = λ f λ σ √ T √ tr HXX T H T ≤ λ f λ σ √ T q k XX T kk H k F concluding the proof.A first corollary of Lemma 3 is the concentration of the Stieltjes transform T tr (cid:0) T Σ T Σ − zI T (cid:1) − of µ n , the empirical spectral measure of T Σ T Σ, for all z ∈ C \ R + (so in particular, for z = − γ , γ > Corollary µ n ) . UnderAssumptions 1–2, for z ∈ C \ R + , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T tr (cid:18) T Σ T Σ − zI T (cid:19) − − E " T tr (cid:18) T Σ T Σ − zI T (cid:19) − > t ! ≤ Ce − c dist( z, R +)2 Tt λ σλ ϕ k X k for some C, c > , where dist( z, R + ) is the Hausdorff set distance. In par-ticular, for z = − γ , γ > , and under the additional Assumption 3 P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T tr Q − T tr E[ Q ] (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cnt . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Proof.
We can apply Lemma 3 for f : R T tr( R T R − zI T ) − , sincewe have | f ( R + H ) − f ( R ) | = (cid:12)(cid:12)(cid:12)(cid:12) T tr(( R + H ) T ( R + H ) − zI T ) − (( R + H ) T H + H T R )( R T R − zI T ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) T tr(( R + H ) T ( R + H ) − zI T ) − ( R + H ) T H ( R T R − zI T ) − (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) T tr(( R + H ) T ( R + H ) − zI T ) − H T R ( R T R − zI T ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ k H k dist( z, R + ) ≤ k H k F dist( z, R + ) where, for the second to last inequality, we successively used the relations | tr AB | ≤ √ tr AA T √ tr BB T , | tr CD | ≤ k D k tr C for nonnegative definite C , and k ( R T R − zI T ) − k ≤ dist( z, R + ) − , k ( R T R − zI T ) − R T R k ≤ k ( R T R − zI T ) − R T k = k ( R T R − zI T ) − R T R ( R T R − zI T ) − k ≤ k ( R T R − zI T ) − R T R k k ( R T R − zI T ) − k ≤ dist( z, R + ) − , for z ∈ C \ R + , and finally k · k ≤ k · k F .Lemma 3 also allows for an important application of Lemma 2 as follows. Lemma T σ T Q − σ ) . Let Assumptions 1–3 hold andwrite W T = [ w , . . . , w n ] . Define σ ≡ σ ( w T X ) T ∈ R T and, for W T − =[ w , . . . , w n ] and Σ − = σ ( W − X ) , let Q − = ( T Σ T − Σ − + γI T ) − . Then, for A, B ∈ R T × T such that k A k , k B k ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T AQ − Bσ − T tr Φ A E[ Q − ] B (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cn min( t ,t ) for some C, c > independent of the other parameters. Proof.
Let f : R T σ T A ( R T R + γI T ) − Bσ . Reproducing the proofof Corollary 2, conditionally to T k σ k ≤ K for any arbitrary large enough K >
0, it appears that f is Lipschitz with parameter of order O (1). Alongwith (7) and Assumption 3, this thus ensures that P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T AQ − Bσ − T σ T A E[ Q − ] Bσ (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T AQ − Bσ − T σ T A E[ Q − ] Bσ (cid:12)(cid:12)(cid:12)(cid:12) > t, k σ k T ≤ K (cid:19) + P (cid:18) k σ k T > K (cid:19) ≤ Ce − cnt imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS for some C, c >
0. We may then apply Lemma 1 on the bounded normmatrix A E[ Q − ] B to further find that P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T AQ − Bσ − T tr Φ A E[ Q − ] B (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T AQ − Bσ − T σ T A E[ Q − ] Bσ (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) + P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T σ T A E[ Q − ] Bσ − T tr Φ A E[ Q − ] B (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ C ′ e − c ′ n min( t ,t ) which concludes the proof.As a further corollary of Lemma 3, we have the following concentrationresult on the training mean-square error of the neural network under study. Corollary . Under As-sumptions 1–3, P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y Q − T tr Y T Y E (cid:2) Q (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ Ce − cnt for some C, c > independent of the other parameters. Proof.
We apply Lemma 3 to the mapping f : R T tr Y T Y ( R T R + γI T ) − . Denoting Q = ( R T R + γI T ) − and Q H = (( R + H ) T ( R + H )+ γI T ) − ,remark indeed that | f ( R + H ) − f ( R ) | = (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y (( Q H ) − Q ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y ( Q H − Q ) Q H (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y Q ( Q H − Q ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y Q H (( R + H ) T ( R + H ) − R T R ) QQ H (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y QQ H (( R + H ) T ( R + H ) − R T R ) Q (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y Q H ( R + H ) T HQQ H (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y Q H H T RQQ H (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y QQ H ( R + H ) T RQ (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) T tr Y T Y QQ H H T RQ (cid:12)(cid:12)(cid:12)(cid:12) . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. As k Q H ( R + H ) T k = p k Q H ( R + H ) T ( R + H ) Q H k and k RQ k = p k QR T RQ k are bounded and T tr Y T Y is also bounded by Assumption 3, this implies | f ( R + H ) − f ( R ) | ≤ C k H k ≤ C k H k F for some C >
0. The function f is thus Lipschitz with parameter independentof n , which allows us to conclude using Lemma 3.The aforementioned concentration results are the building blocks of theproofs of Theorem 1–3 which, under all Assumptions 1–3, are establishedusing standard random matrix approaches.5.2. Asymptotic Equivalents.
First Equivalent for E[ Q ] . This section is dedicated to a first char-acterization of E[ Q ], in the “simultaneously large” n, p, T regime. This pre-liminary step is classical in studying resolvents in random matrix theory asthe direct comparison of E[ Q ] to ¯ Q with the implicit δ may be cumbersome.To this end, let us thus define the intermediary deterministic matrix˜ Q = (cid:18) nT Φ1 + α + γI T (cid:19) − with α ≡ T tr ΦE[ Q − ], where we recall that Q − is a random matrix dis-tributed as, say, ( T Σ T Σ − T σ σ T + γI T ) − .First note that, since T tr Φ = E[ T k σ k ] and, from (7) and Assump-tion 3, P ( T k σ k > t ) ≤ Ce − cnt for all large t , we find that T tr Φ = R ∞ t P ( T k σ k > t ) dt ≤ C ′ for some constant C ′ . Thus, α ≤ k E[ Q − ] k T tr Φ ≤ C ′ γ is uniformly bounded.We will show here that k E[ Q ] − ˜ Q k → n → ∞ in the regime ofAssumption 3. As the proof steps are somewhat classical, we defer to theappendix some classical intermediary lemmas (Lemmas 5–7). Using the re-solvent identity, Lemma 5, we start by writingE[ Q ] − ˜ Q = E (cid:20) Q (cid:18) nT Φ1 + α − T Σ T Σ (cid:19)(cid:21) ˜ Q = E[ Q ] nT Φ1 + α ˜ Q − E (cid:20) Q T Σ T Σ (cid:21) ˜ Q = E[ Q ] nT Φ1 + α ˜ Q − T n X i =1 E h Qσ i σ T i i ˜ Q imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS which, from Lemma 6, gives, for Q − i = ( T Σ T Σ − T σ i σ T i + γI T ) − ,E[ Q ] − ˜ Q = E[ Q ] nT Φ1 + α ˜ Q − T n X i =1 E " Q − i σ i σ T i T σ T i Q − i σ i ˜ Q = E[ Q ] nT Φ1 + α ˜ Q −
11 + α T n X i =1 E h Q − i σ i σ T i i ˜ Q + 1 T n X i =1 E " Q − i σ i σ T i (cid:0) T σ T i Q − i σ i − α (cid:1) (1 + α )(1 + T σ T i Q − i σ i ) ˜ Q. Note now, from the independence of Q − i and σ i σ T i , that the second right-hand side expectation is simply E[ Q − i ]Φ. Also, exploiting Lemma 6 in re-verse on the rightmost term, this givesE[ Q ] − ˜ Q = 1 T n X i =1 E[ Q − Q − i ]Φ1 + α ˜ Q + 11 + α T n X i =1 E (cid:20) Qσ i σ T i ˜ Q (cid:18) T σ T i Q − i σ i − α (cid:19)(cid:21) . (8)It is convenient at this point to note that, since E[ Q ] − ˜ Q is symmetric, wemay writeE[ Q ] − ˜ Q = 12 11 + α T n X i =1 (cid:16) E[ Q − Q − i ]Φ ˜ Q + ˜ Q ΦE[ Q − Q − i ] (cid:17) + 1 T n X i =1 E (cid:20)(cid:16) Qσ i σ T i ˜ Q + ˜ Qσ i σ T i Q (cid:17) (cid:18) T σ T i Q − i σ i − α (cid:19)(cid:21)! . (9)We study the two right-hand side terms of (9) independently.For the first term, since Q − Q − i = − Q T σ i σ T i Q − i ,1 T n X i =1 E[ Q − Q − i ]Φ1 + α ˜ Q = 11 + α T E " Q T n X i =1 σ i σ T i Q − i Φ ˜ Q = 11 + α T E " Q T n X i =1 σ i σ T i Q (cid:18) T σ T i Q − i σ i (cid:19) Φ ˜ Q where we used again Lemma 6 in reverse. Denoting D = diag( { T σ T i Q − i σ i } ni =1 ),this can be compactly written1 T n X i =1 E[ Q − Q − i ]Φ1 + α ˜ Q = 11 + α T E (cid:20) Q T Σ T D Σ Q (cid:21) Φ ˜ Q. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Note at this point that, from Lemma 7, k Φ ˜ Q k ≤ (1 + α ) Tn and (cid:13)(cid:13)(cid:13)(cid:13) Q √ T Σ T (cid:13)(cid:13)(cid:13)(cid:13) = s(cid:13)(cid:13)(cid:13)(cid:13) Q T Σ T Σ Q (cid:13)(cid:13)(cid:13)(cid:13) ≤ γ − . Besides, by Lemma 4 and the union bound, P (cid:18) max ≤ i ≤ n D ii > α + t (cid:19) ≤ Cne − cn min( t ,t ) for some C, c >
0, so in particular, recalling that α ≤ C ′ for some constant C ′ > (cid:20) max ≤ i ≤ n D ii (cid:21) = Z C ′ )0 P (cid:18) max ≤ i ≤ n D ii > t (cid:19) dt + Z ∞ C ′ ) P (cid:18) max ≤ i ≤ n D ii > t (cid:19) dt ≤ C ′ ) + Z ∞ C ′ ) Cne − cn min(( t − (1+ C ′ )) ,t − (1+ C ′ )) dt = 2(1 + C ′ ) + Z ∞ C ′ Cne − cnt dt = 2(1 + C ′ ) + e − Cn (1+ C ′ ) = O (1) . As a consequence of all the above (and of the boundedness of α ), we havethat, for some c >
0, 1 T (cid:13)(cid:13)(cid:13)(cid:13) E (cid:20) Q T Σ T D Σ Q (cid:21) Φ ˜ Q (cid:13)(cid:13)(cid:13)(cid:13) ≤ cn . (10)Let us now consider the second right-hand side term of (9). Using therelation ab T + ba T (cid:22) aa T + bb T in the order of Hermitian matrices (whichunfolds from ( a − b )( a − b ) T (cid:23) a = T Qσ i ( T σ T i Q − i σ i − α )and b = T − ˜ Qσ i ,1 T n X i =1 E (cid:20)(cid:16) Qσ i σ T i ˜ Q + ˜ Qσ i σ T Q (cid:17) (cid:18) T σ T i Q − i σ i − α (cid:19)(cid:21) (cid:22) √ T n X i =1 E " Qσ i σ T i Q (cid:18) T σ T i Q − i σ i − α (cid:19) + 1 T √ T n X i =1 E h ˜ Qσ i σ T i ˜ Q i = √ T E (cid:20) Q T Σ T D Σ Q (cid:21) + nT √ T ˜ Q Φ ˜ Q imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS where D = diag( { T σ T i Q − i σ i − α } ni =1 ). Of course, since we also have − aa T − bb T (cid:22) ab T + ba T (from ( a + b )( a + b ) T (cid:23) T n X i =1 E (cid:20)(cid:16) Qσ i σ T i ˜ Q + ˜ Qσ i σ T Q (cid:17) (cid:18) T σ T i Q − i σ i − α (cid:19)(cid:21) (cid:23) −√ T E (cid:20) Q T Σ T D Σ Q (cid:21) − nT √ T ˜ Q Φ ˜ Q. But from Lemma 4, P (cid:16) k D k > tn ε − (cid:17) = P (cid:18) max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12) T σ T i Q − i σ i − α (cid:12)(cid:12)(cid:12)(cid:12) > tn ε − (cid:19) ≤ Cne − c min( n ε t ,n
12 + ε t ) so that, with a similar reasoning as in the proof of Corollary 1, (cid:13)(cid:13)(cid:13)(cid:13) √ T E (cid:20) Q T Σ T D Σ Q (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ T E (cid:2) k D k (cid:3) ≤ Cn ε ′ − where we additionally used k Q Σ k ≤ √ T in the first inequality.Since in addition (cid:13)(cid:13)(cid:13) nT √ T ˜ Q Φ ˜ Q (cid:13)(cid:13)(cid:13) ≤ Cn − , this gives (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T n X i =1 E (cid:20)(cid:16) Qσ i σ T i ˜ Q + ˜ Qσ i σ T i Q (cid:17) (cid:18) T σ T i Q − i σ i − α (cid:19)(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ Cn ε − . Together with (9), we thus conclude that (cid:13)(cid:13)(cid:13) E[ Q ] − ˜ Q (cid:13)(cid:13)(cid:13) ≤ Cn ε − . Note in passing that we proved that k E [ Q − Q − ] k = Tn (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T n X i =1 E [ Q − Q − i ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) n E (cid:20) Q T Σ T D Σ Q (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ cn where the first equality holds by exchangeability arguments.In particular, α = 1 T tr ΦE[ Q − ] = 1 T tr ΦE[ Q ] + 1 T tr Φ(E[ Q − ] − E[ Q ]) imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. where | T tr Φ(E[ Q − ] − E[ Q ]) | ≤ cn . And thus, by the previous result, (cid:12)(cid:12)(cid:12)(cid:12) α − T tr Φ ˜ Q (cid:12)(cid:12)(cid:12)(cid:12) ≤ Cn − + ε T tr Φ . We have proved in the beginning of the section that T tr Φ is bounded andthus we finally conclude that (cid:13)(cid:13)(cid:13)(cid:13) α − T tr Φ ˜ Q (cid:13)(cid:13)(cid:13)(cid:13) ≤ Cn ε − . Second Equivalent for E[ Q ] . In this section, we show that E[ Q ]can be approximated by the matrix ¯ Q , which we recall is defined as¯ Q = (cid:18) nT Φ1 + δ + γI T (cid:19) − where δ > δ = T tr Φ ¯ Q . The fact that δ > f : [0 , ∞ ) → (0 , ∞ ), x f ( x ), satisfies x ≥ x ′ ⇒ f ( x ) ≥ f ( x ′ ), ∀ a > , af ( x ) > f ( ax ) and thereexists x such that x ≥ f ( x ), then f has a unique fixed point (Yates, 1995,Th 2). It is easily shown that δ T tr Φ ¯ Q is such a map, so that δ existsand is unique.To compare ˜ Q and ¯ Q , using the resolvent identity, Lemma 5, we start bywriting ˜ Q − ¯ Q = ( α − δ ) ˜ Q nT
Φ(1 + α )(1 + δ ) ¯ Q from which | α − δ | = (cid:12)(cid:12)(cid:12)(cid:12) T tr Φ (cid:0) E[ Q − ] − ¯ Q (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) T tr Φ (cid:16) ˜ Q − ¯ Q (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) + cn − + ε = | α − δ | T tr Φ ˜ Q nT Φ ¯ Q (1 + α )(1 + δ ) + cn − + ε imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS which implies that | α − δ | − T tr Φ ˜ Q nT Φ ¯ Q (1 + α )(1 + δ ) ! ≤ cn − + ε . It thus remains to show thatlim sup n T tr Φ ˜ Q nT Φ ¯ Q (1 + α )(1 + δ ) < | α − δ | ≤ cn ε − . To this end, note that, by Cauchy–Schwarz’sinequality,1 T tr Φ ˜ Q nT Φ ¯ Q (1 + α )(1 + δ ) ≤ s nT (1 + δ ) T tr Φ ¯ Q · nT (1 + α ) T tr Φ ˜ Q so that it is sufficient to bound the limsup of both terms under the squareroot strictly by one. Next, remark that δ = 1 T tr Φ ¯ Q = 1 T tr Φ ¯ Q ¯ Q − = n (1 + δ ) T (1 + δ ) T tr Φ ¯ Q + γ T tr Φ ¯ Q . In particular, nT (1 + δ ) T tr Φ ¯ Q = δ nT (1+ δ ) T tr Φ ¯ Q (1 + δ ) nT (1+ δ ) T tr Φ ¯ Q + γ T tr Φ ¯ Q ≤ δ δ . But at the same time, since k ( nT Φ + γI T ) − k ≤ γ − , δ ≤ γT tr Φthe limsup of which is bounded. We thus conclude thatlim sup n nT (1 + δ ) T tr Φ ¯ Q < . (11)Similarly, α , which is known to be bounded, satisfies α = (1 + α ) nT (1 + α ) T tr Φ ˜ Q + γ T tr Φ ˜ Q + O ( n ε − )and we thus have alsolim sup n nT (1 + α ) T tr Φ ˜ Q < imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. which completes to prove that | α − δ | ≤ cn ε − .As a consequence of all this, k ˜ Q − ¯ Q k = | α − δ | · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ Q nT Φ ¯ Q (1 + α )(1 + δ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ cn − + ε and we have thus proved that k E[ Q ] − ¯ Q k ≤ cn − + ε for some c > P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T tr Q − T tr ¯ Q (cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) T tr Q − T tr E[ Q ] (cid:12)(cid:12)(cid:12)(cid:12) > t − (cid:12)(cid:12)(cid:12)(cid:12) T tr E[ Q ] − T tr ¯ Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) ≤ C ′ e − c ′ n ( t − cn −
12 + ε ) ≤ C ′ e − c ′ nt for all large n . As a consequence, for all γ > T tr Q − T tr ¯ Q → m µ n − m ¯ µ n of Stieltjes transforms m µ n : C \ R + → C , z T tr( T Σ T Σ − zI T ) − and m ¯ µ n : C \ R + → C , z T tr( nT Φ1+ δ z − zI T ) − (with δ z the unique Stieltjes transform solutionto δ z = T tr Φ( nT Φ1+ δ z − zI T ) − ) converges to zero for each z in a subset of C \ R + having at least one accumulation point (namely R − ), almost surelyso (that is, on a probability set A z with P ( A z ) = 1). Thus, letting { z k } ∞ k =1 be a converging sequence strictly included in R − , on the probability onespace A = ∩ ∞ k =1 A k , m µ n ( z k ) − m ¯ µ n ( z k ) → k . Now, m µ n is com-plex analytic on C \ R + and bounded on all compact subsets of C \ R + .Besides, it was shown in (Silverstein and Bai, 1995; Silverstein and Choi,1995) that the function m ¯ µ n is well-defined, complex analytic and boundedon all compact subsets of C \ R + . As a result, on A , m µ n − m ¯ µ n is complexanalytic, bounded on all compact subsets of C \ R + and converges to zero ona subset admitting at least one accumulation point. Thus, by Vitali’s con-vergence theorem (Titchmarsh, 1939), with probability one, m µ n − m ¯ µ n con-verges to zero everywhere on C \ R + . This implies, by (Bai and Silverstein,2009, Theorem B.9), that µ n − ¯ µ n →
0, vaguely as a signed finite measure,with probability one, and, since ¯ µ n is a probability measure (again from theresults of (Silverstein and Bai, 1995; Silverstein and Choi, 1995)), we havethus proved Theorem 2.5.2.3. Asymptotic Equivalent for E[ QAQ ] , where A is either Φ or sym-metric of bounded norm. The evaluation of the second order statistics of imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017
RANDOM MATRIX APPROACH TO NEURAL NETWORKS the neural network under study requires, beside E[ Q ], to evaluate the moreinvolved form E[ QAQ ], where A is a symmetric matrix either equal to Φor of bounded norm (so in particular k ¯ QA k is bounded). To evaluate thisquantity, first writeE[ QAQ ] = E (cid:2) ¯ QAQ (cid:3) + E (cid:2) ( Q − ¯ Q ) AQ (cid:3) = E (cid:2) ¯ QAQ (cid:3) + E (cid:20) Q (cid:18) nT Φ1 + δ − T Σ T Σ (cid:19) ¯ QAQ (cid:21) = E (cid:2) ¯ QAQ (cid:3) + nT
11 + δ E (cid:2) Q Φ ¯
QAQ (cid:3) − T n X i =1 E h Qσ i σ T i ¯ QAQ i . Of course, since
QAQ is symmetric, we may writeE[
QAQ ] = 12 (cid:18) E (cid:2) ¯ QAQ + QA ¯ Q (cid:3) + nT
11 + δ E (cid:2) Q Φ ¯
QAQ + QA ¯ Q Φ Q (cid:3) − T n X i =1 E h Qσ i σ T i ¯ QAQ + QA ¯ Qσ i σ T i Q i! which will reveal more practical to handle.First note that, since (cid:13)(cid:13) E[ Q ] − ¯ Q (cid:13)(cid:13) ≤ Cn ε − and A is such that k ¯ QA k isbounded, k E[ ¯
QAQ ] − ¯ QA ¯ Q ] k ≤ k ¯ QA kk E[ Q ] − ¯ Q k ≤ C ′ n ε − , which providesan estimate for the first expectation. We next evaluate the last right-handside expectation above. With the same notations as previously, from ex-changeability arguments and using Q = Q − − Q T σσ T Q − , observe that1 T n X i =1 E h Qσ i σ T i ¯ QAQ i = nT E h Qσσ T ¯ QAQ i = nT E " Q − σσ T ¯ QAQ T σ T Q − σ = nT
11 + δ E h Q − σσ T ¯ QAQ i + nT
11 + δ E " Q − σσ T ¯ QAQ δ − T σ T Q − σ T σ T Q − σ imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. which, reusing Q = Q − − Q T σσ T Q − , is further decomposed as1 T n X i =1 E h Qσ i σ T i ¯ QAQ i = nT
11 + δ E h Q − σσ T ¯ QAQ − i − nT
11 + δ E " Q − σσ T ¯ QAQ − σσ T Q − T σ T Q − σ + nT E " Q − σσ T ¯ QAQ − δ − T σ T Q − σ (1 + δ ) (cid:0) T σ T Q − σ (cid:1) − nT E " Q − σσ T ¯ QAQ − σσ T Q − (cid:0) δ − T σ T Q − σ (cid:1) (1 + δ ) (cid:0) T σ T Q − σ (cid:1) = nT
11 + δ E (cid:2) Q − Φ ¯
QAQ − (cid:3) − nT
11 + δ E " Q − σσ T Q − T σ T ¯ QAQ − σ T σ T Q − σ + nT E " Q − σσ T (cid:0) δ − T σ T Q − σ (cid:1) (1 + δ ) (cid:0) T σ T Q − σ (cid:1) ¯ QAQ − − nT E " Q − σσ T Q − T σ T ¯ QAQ − σ (cid:0) δ − T σ T Q − σ (cid:1) (1 + δ ) (cid:0) T σ T Q − σ (cid:1) ≡ Z + Z + Z + Z (where in the previous to last line, we have merely reorganized the termsconveniently) and our interest is in handling Z + Z T + Z + Z T + Z + Z T + Z + Z T . Let us first treat term Z . Since ¯ QAQ − is bounded, byLemma 4, T σ T ¯ QAQ − σ concentrates around T tr Φ ¯ QAE [ Q − ]; but, as k Φ ¯ Q k is bounded, we also have | T tr Φ ¯ QAE [ Q − ] − T tr Φ ¯ QA ¯ Q | ≤ cn ε − . We thusdeduce, with similar arguments as previously, that − Q − σσ T Q − Cn ε − (cid:22) Q − σσ T Q − " T σ T ¯ QAQ − σ T σ T Q − σ − T tr Φ ¯ QA ¯ Q δ (cid:22) Q − σσ T Q − Cn ε − with probability exponentially close to one, in the order of symmetric ma-trices. Taking expectation and norms on both sides, and conditioning on the imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS aforementioned event and its complementary, we thus have that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E " Q − σσ T Q − T σ T ¯ QAQ − σ T σ T Q − σ − E [ Q − Φ Q − ] T tr Φ ¯ QA ¯ Q δ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k E [ Q − Φ Q − ] k Cn ε − + C ′ ne − cn ε ′ ≤ k E [ Q − Φ Q − ] k C ′′ n ε − But, again by exchangeability arguments,E[ Q − Φ Q − ] = E[ Q − σσ T Q − ] = E " Qσσ T Q (cid:18) T σ T Q − σ (cid:19) = Tn E (cid:20) Q T Σ T D Σ Q (cid:21) with D = diag( { T σ T i Q − σ i } ), the operator norm of which is bounded asO(1). So finally, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E " Q − σσ T Q − T σ T ¯ QAQ − σ T σ T Q − σ − E [ Q − Φ Q − ] T tr Φ ¯ QA ¯ Q δ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ Cn ε − . We now move to term Z + Z T . Using the relation ab T + ba T (cid:22) aa T + bb T ,E " ( δ − T σ T Q − σ ) Q − σσ T ¯ QAQ − + Q − A ¯ Qσσ T Q − (1 + T σ T Q − σ ) (cid:22) √ n E " ( δ − T σ T Q − σ ) (1 + T σ T Q − σ ) Q − σσ T Q − + 1 √ n E h Q − A ¯ Qσσ T ¯ QAQ − i = √ n Tn E (cid:20) Q T Σ T D Σ Q (cid:21) + 1 √ n E (cid:2) Q − A ¯ Q Φ ¯
QAQ − (cid:3) and the symmetrical lower bound (equal to the opposite of the upper bound),where D = diag(( δ − T σ T i Q − i σ i ) / (1 + T σ T i Q − i σ i )). For the same reasonsas above, the first right-hand side term is bounded by Cn ε − . As for thesecond term, for A = I T , it is clearly bounded; for A = Φ, using nT ¯ Q Φ1+ δ = I T − γ ¯ Q , E[ Q − A ¯ Q Φ ¯
QAQ − ] can be expressed in terms of E[ Q − Φ Q − ] andE[ Q − ¯ Q k Φ Q − ] for k = 1 ,
2, all of which have been shown to be bounded (atmost by Cn ε ). We thus conclude that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E "(cid:18) δ − T σ T Q − σ (cid:19) Q − σσ T ¯ QAQ − + Q − A ¯ Qσσ T Q − (1 + T σ T Q − σ ) ≤ Cn ε − . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Finally, term Z can be handled similarly as term Z and is shown to beof norm bounded by Cn ε − .As a consequence of all the above, we thus find thatE [ QAQ ] = ¯ QA ¯ Q + nT E (cid:2) Q Φ ¯
QAQ (cid:3) δ − nT E (cid:2) Q − Φ ¯
QAQ − (cid:3) δ + nT T tr Φ ¯ QA ¯ Q (1 + δ ) E[ Q − Φ Q − ] + O ( n ε − ) . It is attractive to feel that the sum of the second and third terms abovevanishes. This is indeed verified by observing that, for any matrix B ,E [ QBQ ] − E [ Q − BQ ] = 1 T E h Qσσ T Q − BQ i = 1 T E (cid:20) Qσσ T QBQ (cid:18)
T σ T Q − σ (cid:19)(cid:21) = 1 n E (cid:20) Q T Σ T D Σ QBQ (cid:21) and symmetricallyE [
QBQ ] − E [
QBQ − ] = 1 n E (cid:20) QBQ T Σ T D Σ Q (cid:21) with D = diag(1 + T σ T i Q − i σ i ), and a similar reasoning is performed tocontrol E[ Q − BQ ] − E[ Q − BQ − ] and E[ QBQ − ] − E[ Q − BQ − ]. For B bounded, k E[ Q T Σ T D Σ QBQ ] k is bounded as O (1), and thus k E[ QBQ ] − E[ Q − BQ − ] k is of order O ( n − ). So in particular, taking A of bounded norm, we find thatE [ QAQ ] = ¯ QA ¯ Q + nT T tr Φ ¯ QA ¯ Q (1 + δ ) E[ Q − Φ Q − ] + O ( n ε − ) . Take now B = Φ. Then, from the relation AB T + BA T (cid:22) AA T + BB T inthe order of symmetric matrices, (cid:13)(cid:13)(cid:13)(cid:13) E [ Q Φ Q ] −
12 E [ Q − Φ Q + Q Φ Q − ] (cid:13)(cid:13)(cid:13)(cid:13) = 12 n (cid:13)(cid:13)(cid:13)(cid:13) E (cid:20) Q T Σ T D Σ Q Φ Q + Q Φ Q T Σ T D Σ Q (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:20) Q T Σ T D Σ Q T Σ T D Σ Q (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) + k E [ Q Φ Q Φ Q ] k (cid:19) . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS The first norm in the parenthesis is bounded by Cn ε and it thus remainsto control the second norm. To this end, similar to the control of E[ Q Φ Q ],by writing E[ Q Φ Q Φ Q ] = E[ Qσ σ T Qσ σ T Q ] for σ , σ independent vectorswith the same law as σ , and exploiting the exchangeability, we obtain aftersome calculus that E[ Q Φ Q ] can be expressed as the sum of terms of the formE[ Q ++ 1 T Σ T ++ D Σ ++ Q ++ ] or E[ Q ++ 1 T Σ T ++ D Σ ++ Q ++ 1 T Σ T ++ D Σ ++ Q ++ ] for D, D diagonal matrices of norm bounded as O (1), while Σ ++ and Q ++ aresimilar as Σ and Q , only for n replaced by n +2. All these terms are boundedas O (1) and we finally obtain that E[ Q Φ Q Φ Q ] is bounded and thus (cid:13)(cid:13)(cid:13)(cid:13) E [ Q Φ Q ] −
12 E [ Q − Φ Q + Q Φ Q − ] (cid:13)(cid:13)(cid:13)(cid:13) ≤ Cn .
With the additional control on Q Φ Q − − Q − Φ Q − and Q − Φ Q − Q − Φ Q − ,together, this implies that E[ Q Φ Q ] = E[ Q − Φ Q − ]+ O k·k ( n − ). Hence, for A =Φ, exploiting the fact that nT δ Φ ¯ Q Φ = Φ − γ ¯ Q Φ, we have the simplificationE [ Q Φ Q ] = ¯ Q Φ ¯ Q + nT E (cid:2) Q Φ ¯ Q Φ Q (cid:3) δ − nT E (cid:2) Q − Φ ¯ Q Φ Q − (cid:3) δ + nT T tr Φ ¯ Q (1 + δ ) E[ Q − Φ Q − ] + O k·k ( n ε − )= ¯ Q Φ ¯ Q + nT T tr Φ ¯ Q (1 + δ ) E[ Q Φ Q ] + O k·k ( n ε − ) . or equivalentlyE [ Q Φ Q ] − nT T tr Φ ¯ Q (1 + δ ) ! = ¯ Q Φ ¯ Q + O k·k ( n ε − ) . We have already shown in (11) that lim sup n nT T tr Φ ¯ Q (1+ δ ) < Q Φ Q ] = ¯ Q Φ ¯ Q − nT T tr Φ ¯ Q (1+ δ ) + O k·k ( n ε − ) . So finally, for all A of bounded norm,E [ QAQ ] = ¯ QA ¯ Q + nT T tr Φ ¯ QA ¯ Q (1 + δ ) ¯ Q Φ ¯ Q − nT T tr Φ ¯ Q (1+ δ ) + O ( n ε − )which proves immediately Proposition 1 and Theorem 3. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Derivation of Φ ab . Gaussian w . In this section, we evaluate the terms Φ ab providedin Table 1. The proof for the term corresponding to σ ( t ) = erf( t ) can bealready be found in (Williams, 1998, Section 3.1) and is not recalled here.For the other functions σ ( · ), we follow a similar approach as in (Williams,1998), as detailed next.The evaluation of Φ ab for w ∼ N (0 , I p ) requires to estimate I ≡ (2 π ) − p Z R p σ ( w T a ) σ ( w T b ) e − k w k dw. Assume that a and b and not linearly dependent. It is convenient to ob-serve that this integral can be reduced to a two-dimensional integration byconsidering the basis e , . . . , e p defined (for instance) by e = a k a k , e = b k b k − a T b k a kk b k a k a k q − ( a T b ) k a k k b k and e , . . . , e p any completion of the basis. By letting w = ˜ w e + . . . + ˜ w p e p and a = ˜ a e (˜ a = k a k ), b = ˜ b e + ˜ b e (where ˜ b = a T b k a k and ˜ b = k b k q − ( a T b ) k a k k b k ), this reduces I to I = 12 π Z R Z R σ ( ˜ w ˜ a ) σ ( ˜ w ˜ b + ˜ w ˜ b ) e − ( ˜ w + ˜ w ) d ˜ w d ˜ w . Letting ˜ w = [ ˜ w , ˜ w ] T , ˜ a = [˜ a , T and ˜ b = [˜ b , ˜ b ] T , this is convenientlywritten as the two-dimensional integral I = 12 π Z R σ ( ˜ w T ˜ a ) σ ( ˜ w T ˜ b ) e − k ˜ w k d ˜ w. The case where a and b would be linearly dependent can then be obtainedby continuity arguments. The function σ ( t ) = max( t, . For this function, we have I = 12 π Z min( ˜ w T ˜ a, ˜ w T ˜ b ) ≥ ˜ w T ˜ a · ˜ w T ˜ b · e − k ˜ w k d ˜ w. Since ˜ a = ˜ a e , a simple geometric representation lets us observe that n ˜ w | min( ˜ w T ˜ a, ˜ w T ˜ b ) ≥ o = n r cos( θ ) e + r sin( θ ) e | r ≥ , θ ∈ [ θ − π , π o imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS where we defined θ ≡ arccos (cid:16) ˜ b k ˜ b k (cid:17) = − arcsin (cid:16) ˜ b k ˜ b k (cid:17) + π . We may thus oper-ate a polar coordinate change of variable (with inverse Jacobian determinantequal to r ) to obtain I = 12 π Z π θ − π Z R + ( r cos( θ )˜ a ) (cid:16) r cos( θ )˜ b + r sin( θ )˜ b (cid:17) re − r dθdr = ˜ a π Z π θ − π cos( θ ) (cid:16) cos( θ )˜ b + sin( θ )˜ b (cid:17) dθ Z R + r e − r dr. With two integration by parts, we have that R R + r e − r dr = 2. Classicaltrigonometric formulas also provide Z π θ − π cos( θ ) dθ = 12 ( π − θ ) + 12 sin(2 θ )= 12 π − arccos ˜ b k ˜ b k ! + ˜ b k ˜ b k ˜ b k ˜ b k !Z π θ − π cos( θ ) sin( θ ) dθ = 12 sin ( θ ) = 12 ˜ b k ˜ b k ! where we used in particular sin(2 arccos( x )) = 2 x √ − x . Altogether, thisis after simplification and replacement of ˜ a , ˜ b and ˜ b , I = 12 π k a kk b k (cid:16)p − ∠ ( a, b ) + ∠ ( a, b ) arccos( − ∠ ( a, b )) (cid:17) . It is worth noticing that this may be more compactly written as I = 12 π k a kk b k Z ∠ ( a,b ) − arccos( − x ) dx. which is minimum for ∠ ( a, b ) → − − x ) ≥ − , I > a and b not linearlydependent.For a and b linearly dependent, we simply have I = 0 for ∠ ( a, b ) = − I = k a kk b k for ∠ ( a, b ) = 1. The function σ ( t ) = | t | . Since | t | = max( t,
0) + max( − t, | w T a | · | w T b | = max( w T a,
0) max( w T b,
0) + max( w T ( − a ) ,
0) max( w T ( − b ) , w T ( − a ) ,
0) max( w T b,
0) + max( w T a,
0) max( w T ( − b ) , . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Hence, reusing the results above, we have here I = k a kk b k π (cid:16) p − ∠ ( a, b ) + 2 ∠ ( a, b ) acos( − ∠ ( a, b )) − ∠ ( a, b ) acos( ∠ ( a, b )) (cid:17) . Using the identity acos( − x ) − acos( x ) = 2 asin( x ) provides the expectedresult. The function σ ( t ) = 1 t ≥ . With the same notations as in the case σ ( t ) =max( t, I = 12 π Z min( ˜ w T ˜ a, ˜ w T ˜ b ) ≥ e − k ˜ w k d ˜ w. After a polar coordinate change of variable, this is I = 12 π Z π θ − π dθ Z R + re − r dr = 12 − θ π from which the result unfolds. The function σ ( t ) = sign( t ) . Here it suffices to note that sign( t ) = 1 t ≥ − − t ≥ so that σ ( w T a ) σ ( w T b ) = 1 w T a ≥ w T b ≥ + 1 w T ( − a ) ≥ w T ( − b ) ≥ − w T ( − a ) ≥ w T b ≥ − w T a ≥ w T ( − b ) ≥ and to apply the result of the previous section, with either ( a, b ), ( − a, b ),( a, − b ) or ( − a, − b ). Since arccos( − x ) = − arccos( x ) + π , we conclude that I = (2 π ) − p Z R p sign( w T a )sign( w T b ) e − k w k dw = 1 − θ π . The functions σ ( t ) = cos( t ) and σ ( t ) = sin( t ) .. Let us first consider σ ( t ) =cos( t ). We have here to evaluate I = 12 π Z R cos (cid:16) ˜ w T ˜ a (cid:17) cos (cid:16) ˜ w T ˜ b (cid:17) e − k ˜ w k d ˜ w = 18 π Z R (cid:16) e ı ˜ w T ˜ a + e − ı ˜ w T ˜ a (cid:17) (cid:16) e ı ˜ w T ˜ b + e − ı ˜ w T ˜ b (cid:17) e − k ˜ w k d ˜ w which boils down to evaluating, for d ∈ { ˜ a + ˜ b, ˜ a − ˜ b, − ˜ a + ˜ b, − ˜ a − ˜ b } , theintegral e − k d k Z R e − k ˜ w − ıd k d ˜ w = (2 π ) e − k d k . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS Altogether, we find I = 12 (cid:16) e − k a + b k + e − k a − b k (cid:17) = e − ( k a k + k b k ) cosh( a T b ) . For σ ( t ) = sin( t ), it suffices to appropriately adapt the signs in the ex-pression of I (using the relation sin( t ) = ı ( e t + e − t )) to obtain in the end I = 12 (cid:16) e − k a + b k + e − k a − b k (cid:17) = e − ( k a k + k b k ) sinh( a T b )as desired.5.4. Polynomial σ ( · ) and generic w . In this section, we prove Equation 5for σ ( t ) = ζ t + ζ t + ζ and w ∈ R p a random vector with independent andidentically distributed entries of zero mean and moment of order k equal to m k . The result is based on standard combinatorics. We are to evaluateΦ ab = E h(cid:16) ζ ( w T a ) + ζ w T a + ζ (cid:17) (cid:16) ζ ( w T b ) + ζ w T b + ζ (cid:17)i . After development, it appears that one needs only assess, for say vectors c, d ∈ R p that take values in { a, b } , the momentsE[( w T c ) ( w T d ) ] = X i i j j c i c i d j d j E[ w i w i w j w j ]= X i m c i d i + X i = j m c i d j + 2 X i = i m c i d i c i d i = X i m c i d i + X i j − X i = j m c i d j + 2 X i i − X i = i m c i d i c i d i = m ( c ) T ( d ) + m ( k c k k d k − ( c ) T ( d ))+ 2 m (cid:16) ( c T d ) − ( c ) T ( d ) (cid:17) = ( m − m )( c ) T ( d ) + m (cid:16) k c k k d k + 2( c T d ) (cid:17) E h ( w T c ) ( w T d ) i = X i i j c i c i d j E[ w i w i w j ] = X i m c i d i = m ( c ) d E h ( w T c ) i = X i i c i c i E[ w i w i ] = m k c k where we recall the definition ( a ) = [ a , . . . , a p ] T . Gathering all the termsfor appropriate selections of c, d leads to (5). imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Heuristic derivation of Conjecture 1.
Conjecture 1 essentially fol-lows as an aftermath of Remark 1. We believe that, similar to Σ, ˆΣ is ex-pected to be of the form ˆΣ = ˆΣ ◦ + ˆ¯ σ T ˆ T , where ˆ¯ σ = E[ σ ( w T ˆ X )] T , with k ˆΣ ◦ √ T k ≤ n ε with high probability. Besides, if X, ˆ X were chosen as consti-tuted of Gaussian mixture vectors, with non-trivial growth rate conditionsas introduced in (Couillet and Benaych-Georges, 2016), it is easily seen that¯ σ = c p + v and ˆ¯ σ = c p + ˆ v , for some constant c and k v k , k ˆ v k = O (1).This subsequently ensures that Φ X ˆ X and Φ ˆ X ˆ X would be of a similar formΦ ◦ X ˆ X + ¯ σ ˆ¯ σ T and Φ ◦ ˆ X ˆ X + ˆ¯ σ ˆ¯ σ T with Φ ◦ X ˆ X and Φ ◦ ˆ X ˆ X of bounded norm. Thesefacts, that would require more advanced proof techniques, let envision thefollowing heuristic derivation for Conjecture 1.Recall that our interest is on the test performance E test defined as E test = 1ˆ T (cid:13)(cid:13)(cid:13) ˆ Y T − ˆΣ T β (cid:13)(cid:13)(cid:13) F which may be rewritten as E test = 1ˆ T tr (cid:16) ˆ Y ˆ Y T (cid:17) − T ˆ T tr (cid:16) Y Q Σ T ˆΣ ˆ Y T (cid:17) + 1 T ˆ T tr (cid:16) Y Q Σ T ˆΣ ˆΣ T Σ QY T (cid:17) ≡ Z − Z + Z . (12)If ˆΣ = ˆΣ ◦ + ˆ¯ σ T ˆ T follows the aforementioned claimed operator norm control,reproducing the steps of Corollary 3 leads to a similar concentration for E test , which we shall then admit. We are therefore left to evaluating E[ Z ]and E[ Z ]. imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS We start with the term E[ Z ], which we expand asE[ Z ] = 2 T ˆ T E h tr( Y Q Σ T ˆΣ ˆ Y T ) i = 2 T ˆ T n X i =1 h tr( Y Qσ i ˆ σ T i ˆ Y T ) i = 2 T ˆ T n X i =1 E " tr Y Q − i σ i ˆ σ T i ˆ Y T T σ T i Q − i σ i ! = 2 T ˆ T
11 + δ n X i =1 E h tr (cid:16) Y Q − i σ i ˆ σ T i ˆ Y T (cid:17)i + 2 T ˆ T
11 + δ n X i =1 E " tr (cid:16) Y Q − i σ i ˆ σ T i ˆ Y T (cid:17) δ − T σ T i Q − i σ i T σ T i Q − i σ i = 2 nT ˆ T
11 + δ tr (cid:16) Y E[ Q − ]Φ X ˆ X ˆ Y T (cid:17) + 2 T ˆ T
11 + δ E h tr (cid:16) Y Q Σ T D ˆΣ ˆ Y T (cid:17)i ≡ Z + Z with D = diag( { δ − T σ T i Q − i σ i } ), the operator norm of which is bounded by n ε − with high probability. Now, observe that, again with the assumptionthat ˆΣ = ˆΣ ◦ + ¯ σ T ˆ T with controlled ˆΣ ◦ , Z may be decomposed as2 T ˆ T
11 + δ E h tr (cid:16) Y Q Σ T D ˆΣ ˆ Y T (cid:17)i = 2 T ˆ T
11 + δ E h tr (cid:16) Y Q Σ T D ˆΣ ◦ ˆ Y T (cid:17)i + 2 T ˆ T
11 + δ T ˆ T ˆ Y T E h Y Q Σ T D ¯ σ i . In the display above, the first right-hand side term is now of order O ( n ε − ).As for the second right-hand side term, note that D ¯ σ is a vector of inde-pendent and identically distributed zero mean and variance O ( n − ) entries;while note formally independent of Y Q Σ T , it is nonetheless expected thatthis independence “weakens” asymptotically (a behavior several times ob-served in linear random matrix models), so that one expects by central limitarguments that the second right-hand side term be also of order O ( n ε − ).This would thus result inE[ Z ] = 2 nT ˆ T
11 + δ tr (cid:16) Y E[ Q − ]Φ X ˆ X ˆ Y T (cid:17) + O ( n ε − )= 2 nT ˆ T
11 + δ tr (cid:16) Y ¯ Q Φ X ˆ X ˆ Y T (cid:17) + O ( n ε − )= 2ˆ T tr (cid:16) Y ¯ Q Ψ X ˆ X ˆ Y T (cid:17) + O ( n ε − ) imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. where we used k E[ Q − ] − ¯ Q k ≤ Cn ε − and the definition Ψ X ˆ X = nT Φ X ˆ X δ .We then move on to E[ Z ] of Equation (12), which can be developed asE[ Z ] = 1 T ˆ T E h tr (cid:16) Y Q Σ T ˆΣ ˆΣ T Σ QY T (cid:17)i = 1 T ˆ T n X i,j =1 E h tr (cid:16) Y Qσ i ˆ σ T i ˆ σ j σ T j QY T (cid:17)i = 1 T ˆ T n X i,j =1 E " tr Y Q − i σ i ˆ σ T i T σ T i Q − i σ i ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! = 1 T ˆ T n X i =1 X j = i E " tr Y Q − i σ i ˆ σ T i T σ T i Q − i σ i ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! + 1 T ˆ T n X i =1 E " tr Y Q − i σ i ˆ σ T i ˆ σ i σ T i Q − i (1 + T σ T i Q − i σ i ) Y T ! ≡ Z + Z . In the term Z , reproducing the proof of Lemma 1 with the condition k ˆ X k bounded, we obtain that ˆ σ T i ˆ σ i ˆ T concentrates around T tr Φ ˆ X ˆ X , whichallows us to write Z = 1 T ˆ T n X i =1 E " tr Y Q − i σ i tr(Φ ˆ X ˆ X ) σ T i Q − i (1 + T σ T i Q − i σ i ) Y T ! + 1 T ˆ T n X i =1 E " tr Y Q − i σ i (cid:0) ˆ σ T i ˆ σ i − tr Φ ˆ T (cid:1) σ T i Q − i (1 + T σ T i Q − i σ i ) Y T ! = 1 T tr(Φ ˆ X ˆ X )ˆ T n X i =1 E " tr Y Q − i σ i σ T i Q − i (1 + T σ T i Q − i σ i ) Y T ! + 1 T n X i =1 E " tr Y Qσ i ˆ σ T i ˆ σ i − tr Φ ˆ T ˆ T ! σ T i QY T ! ≡ Z + Z with D = diag( { T σ T i ˆ σ i − T tr Φ ˆ T ˆ T } ni =1 ) and thus Z can be rewritten as Z = 1 T E (cid:20) tr (cid:18) Y Q Σ T √ T D Σ Q √ T Y T (cid:19)(cid:21) = O ( n ε − ) imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS while for Z , following the same arguments as previously, we have Z = 1 T tr Φ ˆ X ˆ X ˆ T n X i =1 E " tr Y Q − i σ i σ T i Q − i (1 + T σ T i Q − i σ i ) Y T ! = 1 T tr Φ ˆ X ˆ X ˆ T n X i =1 δ ) E h tr (cid:16) Y Q − i σ i σ T i Q − i Y T (cid:17)i + 1 T tr Φ ˆ X ˆ X ˆ T n X i =1 δ ) E (cid:20) tr (cid:16) Y Qσ i σ T i QY T (cid:17) (cid:18) (1 + δ ) − (1 + 1 T σ T i Q − i σ i ) (cid:19)(cid:21) = 1 T tr Φ ˆ X ˆ X ˆ T n X i =1 δ ) E h tr (cid:16) Y Q − i Φ X Q − i Y T (cid:17)i + 1 T tr Φ ˆ X ˆ X ˆ T n X i =1 δ ) E h tr (cid:16) Y Q Σ T D Σ QY T (cid:17)i = nT E h tr (cid:16) Y Q − Φ X Q − Y T (cid:17)i tr(Φ ˆ X ˆ X )ˆ T (1 + δ ) + O ( n ε − )where D = diag( { (1 + δ ) − (1 + T σ T i Q − i σ i ) } ni =1 ).Since E[ Q − AQ − ] = E[ QAQ ] + O k·k ( n ε − ), we are free to plug in theasymptotic equivalent of E[ QAQ ] derived in Section 5.2.3, and we deduce Z = nT E " tr Y ¯ Q Φ X ¯ Q + ¯ Q Ψ X ¯ Q · n tr (cid:0) Ψ X ¯ Q Φ X ¯ Q (cid:1) − n tr (cid:0) Ψ X ¯ Q (cid:1) ! Y T tr(Φ ˆ X ˆ X )ˆ T (1 + δ ) = n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr (cid:0) Ψ X ¯ Q (cid:1) T tr(Ψ ˆ X ˆ X ) + O ( n ε − ) . The term Z of the double sum over i and j ( j = i ) needs more efforts.To handle this term, we need to remove the dependence of both σ i and σ j in Q in sequence. We start with j as follows: Z = 1 T ˆ T n X i =1 X j = i E " tr Y Qσ i ˆ σ T i ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! = 1 T ˆ T n X i =1 X j = i E " tr Y Q − j σ i ˆ σ T i ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! − T ˆ T n X i =1 X j = i E " tr Y Q − j σ j σ T j Q − j σ i ˆ σ T i T σ T j Q − j σ j ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! ≡ Z − Z imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. where in the previous to last inequality we used the relation Q = Q − j − Q − j σ j σ T j Q − j T σ T j Q − j σ j . For Z , we replace 1 + T σ T j Q − j σ j by 1 + δ and take expectation over w j Z = 1 T ˆ T n X i =1 X j = i E " tr Y Q − j σ i ˆ σ T i ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! = 1 T ˆ T n X j =1 E " tr Y Q − j Σ T − j ˆΣ − j ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! = 1 T ˆ T
11 + δ n X j =1 E h tr (cid:16) Y Q − j Σ T − j ˆΣ − j ˆ σ j σ T j Q − j Y T (cid:17)i + 1 T ˆ T
11 + δ n X j =1 E " tr Y Q − j Σ T − j ˆΣ − j ˆ σ j σ T j Q − j ( δ − T σ T j Q − j σ j )1 + T σ T j Q − j σ j Y T ! ≡ Z + Z . The idea to handle Z is to retrieve forms of the type P nj =1 d j ˆ σ j σ T j =ˆΣ T D Σ for some D satisfying k D k ≤ n ε − with high probability. To thisend, we use Q − j Σ T − j ˆΣ − j T = Q − j Σ T ˆΣ T − Q − j σ j ˆ σ T j T = Q Σ T ˆΣ T + Qσ j σ T j Q − T σ T j Qσ j Σ T ˆΣ T − Q − j σ j ˆ σ T j T and thus Z can be expanded as the sum of three terms that shall be imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS studied in order: Z = 1 T ˆ T
11 + δ n X j =1 E " tr Y Q − j Σ T − j ˆΣ − j ˆ σ j σ T j Q − j ( δ − T σ T j Q − j σ j )1 + T σ T j Q − j σ j Y T ! = 1 T ˆ T
11 + δ E " tr Y Q Σ T ˆΣ T ˆΣ T D Σ QY T ! + 1 T ˆ T
11 + δ n X j =1 E " tr Y Qσ j σ T j Q Σ T ˆΣˆ σ j ( δ − T σ T j Q − j σ j ) σ T j QT (1 − T σ T j Qσ j ) Y T ! − T ˆ T
11 + δ n X j =1 E (cid:20) tr (cid:18) Y Qσ j ˆ σ T j ˆ σ j σ T j Q ( δ − T σ T j Q − j σ j )(1 + 1 T σ T j Q − j σ j ) Y T (cid:19)(cid:21) ≡ Z + Z − Z . where D = diag( { δ − T σ T j Q − j σ j } ni =1 ). First, Z is of order O ( n ε − ) since Q Σ T ˆΣ T is of bounded operator norm. Subsequently, Z can be rewrittenas Z = 1ˆ T
11 + δ E (cid:20) tr (cid:18) Y Q Σ T D Σ T QY T (cid:19)(cid:21) = O ( n ε − )with here D = diag (cid:16) δ − T σ T j Q − j σ j (cid:17) (cid:18) T tr (cid:18) Q − j Σ T − j ˆΣ − j T Φ ˆ XX (cid:19) + T tr ( Q − j Φ) T tr Φ ˆ X ˆ X (cid:19) (1 − T σ T j Qσ j )(1 + T σ T j Q − j σ j ) ni =1 . The same arguments apply for Z but for D = diag (cid:26) tr Φ ˆ X ˆ X T ( δ − T σ T j Q − j σ j )(1 + 1 T σ T j Q − j σ j ) (cid:27) ni =1 which completes to show that | Z | ≤ Cn ε − and thus Z = Z + O ( n ε − )= 1 T ˆ T
11 + δ n X j =1 E h tr (cid:16) Y Q − j Σ T − j ˆΣ − j ˆ σ j σ T j Q − j Y T (cid:17)i + O ( n ε − ) . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
It remains to handle Z . Under the same claims as above, we have Z = 1 T ˆ T
11 + δ n X j =1 E " tr Y Q − j Σ T − j ˆΣ − j T Φ ˆ XX Q − j Y T ! = 1 T ˆ T
11 + δ n X j =1 X i = j E (cid:20) tr (cid:18) Y Q − j σ i ˆ σ T i T Φ ˆ XX Q − j Y T (cid:19)(cid:21) = 1 T ˆ T
11 + δ n X j =1 X i = j E " tr Y Q − ij σ i ˆ σ T i T σ T i Q − ij σ i Φ ˆ XX Q − ij Y T ! − T ˆ T
11 + δ n X j =1 X i = j E " tr Y Q − ij σ i ˆ σ T i T σ T i Q − ij σ i Φ ˆ XX Q − ij σ i σ T i Q − ij T σ T i Q − ij σ i Y T ! ≡ Z − Z where we introduced the notation Q − ij = ( T Σ T Σ − T σ i σ T i − T σ j σ T j + γI T ) − .For Z , we replace T σ T i Q − ij σ i by δ , and take the expectation over w i , imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS as follows Z = 1 T ˆ T
11 + δ n X j =1 X i = j E " tr Y Q − ij σ i ˆ σ T i T σ T i Q − ij σ i Φ ˆ XX Q − ij Y T ! = 1 T ˆ T δ ) n X j =1 X i = j E h tr (cid:16) Y Q − ij σ i ˆ σ T i Φ ˆ XX Q − ij Y T (cid:17)i + 1 T ˆ T δ ) n X j =1 X i = j E " tr Y Q − ij σ i ˆ σ T i ( δ − T σ T i Q − ij σ i )1 + T σ T i Q − ij σ i Φ ˆ XX Q − ij Y T ! = n T ˆ T δ ) E h tr (cid:16) Y Q −− Φ X ˆ X Φ ˆ XX Q −− Y T (cid:17)i + 1 T ˆ T δ ) n X j =1 X i = j E (cid:20) tr (cid:18) Y Q − j σ i ˆ σ T i (cid:18) δ − T σ T i Q − ij σ i (cid:19) Φ ˆ XX Q − j Y T (cid:19)(cid:21) + 1 T ˆ T δ ) n X j =1 X i = j E " tr Y Q − j σ i ˆ σ T i Φ ˆ XX Q − j T σ i σ T i Q − j − T σ T i Q − j σ i Y T (cid:18) δ − T σ T i Q − ij σ i (cid:19)! = n T ˆ T δ ) E h tr (cid:16) Y Q −− Φ X ˆ X Φ ˆ XX Q −− Y T (cid:17)i + 1 T ˆ T δ ) n X j =1 E h tr (cid:16) Y Q − j Σ T − j D ˆΣ − j Φ ˆ XX Q − j Y T (cid:17)i + nT ˆ T δ ) n X j =1 E h Y Q − j Σ T − j D ′ Σ − j Q − j Y T i + O ( n ε − )= n T ˆ T δ ) E h tr (cid:16) Y Q −− Φ X ˆ X Φ ˆ XX Q −− Y T (cid:17)i + O ( n ε − )with Q −− having the same law as Q − ij , D = diag( { δ − T σ T i Q − ij σ i } ni =1 )and D ′ = diag (cid:26) ( δ − T σ T i Q − ij σ i ) T tr ( Φ ˆ XX Q − ij Φ X ˆ X ) (1 − T σ T i Q − j σ i )(1+ T σ T i Q − ij σ i ) (cid:27) ni =1 , both expected to be oforder O ( n ε − ). Using again the asymptotic equivalent of E[ QAQ ] devisedin Section 5.2.3, we then have Z = n T ˆ T δ ) E h tr (cid:16) Y Q −− Φ X ˆ X Φ ˆ XX Q −− Y T (cid:17)i + O ( n ε − )= 1ˆ T tr (cid:16) Y ¯ Q Ψ X ˆ X Ψ ˆ XX ¯ QY T (cid:17) + 1ˆ T tr (cid:0) Ψ X ¯ Q Ψ X ˆ X Ψ ˆ XX ¯ Q (cid:1) n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q )+ O ( n ε − ) . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Following the same principle, we deduce for Z that Z = 1 T ˆ T
11 + δ n X j =1 X i = j E " tr Y Q − ij σ i ˆ σ T i T σ T i Q − ij σ i Φ ˆ XX Q − ij σ i σ T i Q − ij T σ T i Q − ij σ i Y T ! = 1 T ˆ T δ ) n X j =1 X i = j E (cid:20) tr (cid:16) Y Q − ij σ i σ T i Q − ij Y T (cid:17) T tr (cid:0) Φ ˆ XX Q − ij Φ X ˆ X (cid:1)(cid:21) + 1 T ˆ T δ ) n X j =1 X i = j E h tr (cid:16) Y Q − j σ i D i σ T i Q − j Y T (cid:17)i + O ( n ε − )= n T ˆ T
11 + δ E (cid:20) tr (cid:16) Y Q −− Φ X Q −− Y T (cid:17) T tr (cid:0) Φ ˆ XX Q −− Φ X ˆ X (cid:1)(cid:21) + O ( n ε − )= 1ˆ T tr (cid:0) Ψ ˆ XX ¯ Q Ψ X ˆ X (cid:1) n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q ) + O ( n ε − ) . with D i = T tr (cid:0) Φ ˆ XX Q − ij Φ X ˆ X (cid:1) (cid:2) (1 + δ ) − (1 + T σ T i Q − ij σ i ) (cid:3) , also believedto be of order O ( n ε − ). Recalling the fact that Z = Z + O ( n ε − ), wecan thus conclude for Z that Z = 1ˆ T tr (cid:16) Y ¯ Q Ψ X ˆ X Ψ ˆ XX ¯ QY T (cid:17) + 1ˆ T tr (cid:0) Ψ X ¯ Q Ψ X ˆ X Ψ ˆ XX ¯ Q (cid:1) n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q ) − T tr (cid:0) Ψ ˆ XX ¯ Q Ψ X ˆ X (cid:1) n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q ) + O ( n ε − ) . As for Z , we have Z = 1 T ˆ T n X i =1 X j = i E " tr Y Q − j σ j σ T j Q − j σ i ˆ σ T i T σ T j Q − j σ j ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! = 1 T ˆ T n X j =1 E " tr Y Q − j σ j σ T j Q − j Σ T − j ˆΣ − j T σ T j Q − j σ j ˆ σ j σ T j Q − j T σ T j Q − j σ j Y T ! . Since Q − j T Σ T − j ˆΣ − j is expected to be of bounded norm, using the concen- imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS tration inequality of the quadratic form T σ T j Q − j Σ T − j ˆΣ − j T ˆ σ j , we infer Z = 1 T ˆ T n X j =1 E " tr Y Q − j σ j σ T j Q − j Y T (1 + T σ T j Q − j σ j ) ! (cid:18) T tr (cid:16) Q − j Σ T − j ˆΣ − j Φ ˆ XX (cid:17) + O ( n ε − ) (cid:19) = 1 T ˆ T n X j =1 E " tr Y Q − j σ j σ T j Q − j Y T (1 + T σ T j Q − j σ j ) ! (cid:18) T tr (cid:16) Q − j Σ T − j ˆΣ − j Φ ˆ XX (cid:17)(cid:19) + O ( n ε − ) . We again replace T σ T j Q − j σ j by δ and take expectation over w j to obtain Z = 1 T ˆ T δ ) n X j =1 E (cid:20) tr (cid:16) Y Q − j σ j σ T j Q − j Y T (cid:17) T tr (cid:16) Q − j Σ T − j ˆΣ − j Φ ˆ XX (cid:17)(cid:21) + 1 T ˆ T δ ) n X j =1 E " tr( Y Q − j σ j D j σ T j Q − j Y T )(1 + T σ T j Q − j σ j ) T tr (cid:16) Q − j Σ T − j ˆΣ − j Φ ˆ XX (cid:17) + O ( n ε − )= nT ˆ T δ ) E (cid:20) tr (cid:16) Y Q − Φ X Q − Y T (cid:17) T tr (cid:16) Q − Σ T − ˆΣ − Φ ˆ XX (cid:17)(cid:21) + 1 T ˆ T δ ) E (cid:20) tr (cid:16) Y Q Σ T D Σ QY T (cid:17) T tr (cid:16) Q − Σ T − ˆΣ − Φ ˆ XX (cid:17)(cid:21) + O ( n ε − )with D j = (1 + δ ) − (1 + T σ T j Q − j σ j ) = O ( n ε − ), which eventually bringsthe second term to vanish, and we thus get Z = nT ˆ T δ ) E (cid:20) tr (cid:16) Y Q − Φ X Q − Y T (cid:17) T tr (cid:16) Q − Σ T − ˆΣ − Φ ˆ XX (cid:17)(cid:21) + O ( n ε − ) . For the term T tr (cid:16) Q − Σ T − ˆΣ − Φ ˆ XX (cid:17) we apply again the concentrationinequality to get1 T tr (cid:16) Q − Σ T − ˆΣ − Φ ˆ XX (cid:17) = 1 T X i = j tr (cid:16) Q − j σ i ˆ σ T i Φ ˆ XX (cid:17) = 1 T X i = j tr Q − ij σ i ˆ σ T i T σ T i Q − ij σ i Φ ˆ XX ! = 1 T
11 + δ X i = j tr (cid:16) Q − ij σ i ˆ σ T i Φ ˆ XX (cid:17) + 1 T
11 + δ X i = j tr Q − ij σ i ˆ σ T i ( δ − T σ T i Q − ij σ i )1 + T σ T i Q − ij σ i Φ ˆ XX ! = n − T
11 + δ tr (cid:0) Φ ˆ XX E[ Q −− ]Φ X ˆ X (cid:1) + 1 T
11 + δ tr (cid:16) Q − j Σ T − j D ˆΣ − j Φ ˆ XX (cid:17) + O ( n ε − ) imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL. with high probability, where D = diag( { δ − T σ T i Q − ij σ i } ni =1 ), the norm ofwhich is of order O ( n ε − ). This entails1 T tr (cid:16) Q − Σ T − ˆΣ − Φ ˆ XX (cid:17) = nT
11 + δ tr (cid:0) Φ ˆ XX E[ Q −− ]Φ X ˆ X (cid:1) + O ( n ε − )with high probability. Once more plugging the asymptotic equivalent ofE[ QAQ ] deduced in Section 5.2.3, we conclude for Z that Z = 1ˆ T tr (cid:0) Ψ ˆ XX ¯ Q Ψ X ˆ X (cid:1) n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q ) + O ( n ε − )and eventually for Z Z = 1ˆ T tr (cid:16) Y ¯ Q Ψ X ˆ X Ψ ˆ XX ¯ QY T (cid:17) + 1ˆ T tr (cid:0) Ψ X ¯ Q Ψ X ˆ X Ψ ˆ XX ¯ Q (cid:1) n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q ) − T tr (cid:0) Ψ ˆ XX ¯ Q Ψ X ˆ X (cid:1) n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q ) + O ( n ε − ) . Combining the estimates of E[ Z ] as well as Z and Z , we finally havethe estimates for the test error defined in (12) as E test = 1ˆ T (cid:13)(cid:13)(cid:13) ˆ Y T − Ψ T X ˆ X ¯ QY T (cid:13)(cid:13)(cid:13) F + n tr (cid:0) Y ¯ Q Ψ X ¯ QY T (cid:1) − n tr(Ψ X ¯ Q ) (cid:20) T tr Ψ ˆ X ˆ X + 1ˆ T tr (cid:0) Ψ X ¯ Q Ψ X ˆ X Ψ ˆ XX ¯ Q (cid:1) − T tr (cid:0) Ψ ˆ X ˆ X ¯ Q Ψ X ˆ X (cid:1)(cid:21) + O ( n ε − ) . Since by definition, ¯ Q = (Ψ X + γI T ) − , we may useΨ X ¯ Q = (Ψ X + γI T − γI T ) (Ψ X + γI T ) − = I T − γ ¯ Q in the second term in brackets to finally retrieve the form of Conjecture 1.
6. Concluding Remarks.
This article provides a possible direction ofexploration of random matrices involving entry-wise non-linear transforma-tions (here through the function σ ( · )), as typically found in modelling neuralnetworks, by means of a concentration of measure approach. The main ad-vantage of the method is that it leverages the concentration of an initialrandom vector w (here a Lipschitz function of a Gaussian vector) to trans-fer concentration to all vector σ (or matrix Σ) being Lipschitz functions of imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS w . This induces that Lipschitz functionals of σ (or Σ) further satisfy con-centration inequalities and thus, if the Lipschitz parameter scales with n ,convergence results as n → ∞ . With this in mind, note that we could havegeneralized our input-output model z = β T σ ( W x ) of Section 2 to z = β T σ ( x ; W )for σ : R p × P → R n with P some probability space and W ∈ P a ran-dom variable such that σ ( x ; W ) and σ ( X ; W ) (where σ ( · ) is here appliedcolumn-wise) satisfy a concentration of measure phenomenon; it is not evennecessary that σ ( X ; W ) has a normal concentration so long that the corre-sponding concentration function allows for appropriate convergence results.This generalized setting however has the drawback of being less explicit andless practical (as most neural networks involve linear maps W x rather thannon-linear maps of W and x ).A much less demanding generalization though would consist in changingthe vector w ∼ N ϕ (0 , I p ) for a vector w still satisfying an exponential (notnecessarily normal) concentration. This is the case notably if w = ϕ ( ˜ w ) with ϕ ( · ) a Lipschitz map with Lipschitz parameter bounded by, say, log( n ) orany small enough power of n . This would then allow for w with heavier thanGaussian tails.Despite its simplicity, the concentration method also has some stronglimitations that presently do not allow for a sufficiently profound analysis ofthe testing mean square error. We believe that Conjecture 1 can be provedby means of more elaborate methods. Notably, we believe that the powerfulGaussian method advertised in (Pastur and ˆSerbina, 2011) which relies onStein’s lemma and the Poincar´e–Nash inequality could provide a refinedcontrol of the residual terms involved in the derivation of Conjecture 1.However, since Stein’s lemma (which states that E[ xφ ( x )] = E[ φ ′ ( x )] for x ∼ N (0 ,
1) and differentiable polynomially bounded φ ) can only be usedon products xφ ( x ) involving the linear component x , the latter is not directlyaccessible; we nonetheless believe that appropriate ansatzs of Stein’s lemma,adapted to the non-linear setting and currently under investigation, couldbe exploited.As a striking example, one key advantage of such a tool would be thepossibility to evaluate expectations of the type Z = E[ σσ T ( T σ T Q − σ − α )]which, in our present analysis, was shown to be bounded in the order ofsymmetric matrices by Φ Cn ε − with high probability. Thus, if no matrix(such as ¯ Q ) pre-multiplies Z , since k Φ k can grow as large as O ( n ), Z cannotbe shown to vanish. But such a bound does not account for the fact that imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Φ would in general be unbounded because of the term ¯ σ ¯ σ T in the displayΦ = ¯ σ ¯ σ T + E[( σ − ¯ σ )( σ − ¯ σ ) T ], where ¯ σ = E[ σ ]. Intuitively, the “mean”contribution ¯ σ ¯ σ T of σσ T , being post-multiplied in Z by T σ T Q − σ − α (whichaverages to zero) disappears; and thus only smaller order terms remain.We believe that the aforementioned ansatzs for the Gaussian tools wouldbe capable of subtly handling this self-averaging effect on Z to prove that k Z k vanishes (for σ ( t ) = t , it is simple to show that k Z k ≤ Cn − ). Inaddition, Stein’s lemma-based methods only require the differentiability of σ ( · ), which need not be Lipschitz, thereby allowing for a larger class ofactivation functions.As suggested in the simulations of Figure 2, our results also seem to extendto non continuous functions σ ( · ). To date, we cannot envision a methodallowing to tackle this setting.In terms of neural network applications, the present article is merely afirst step towards a better understanding of the “hardening” effect occurringin large dimensional networks with numerous samples and large data points(that is, simultaneously large n, p, T ), which we exemplified here throughthe convergence of mean-square errors. The mere fact that some standardperformance measure of these random networks would “freeze” as n, p, T grow at the predicted regime and that the performance would heavily dependon the distribution of the random entries is already in itself an interestingresult to neural network understanding and dimensioning. However, moreinteresting questions remain open. Since neural networks are today dedicatedto classification rather than regression, a first question is the study of theasymptotic statistics of the output z = β T σ ( W x ) itself; we believe that z satisfies a central limit theorem with mean and covariance allowing forassessing the asymptotic misclassification rate.A further extension of the present work would be to go beyond the single-layer network and include multiple layers (finitely many or possibly a numberscaling with n ) in the network design. The interest here would be on thekey question of the best distribution of the number of neurons across thesuccessive layers.It is also classical in neural networks to introduce different (possibly ran-dom) biases at the neuron level, thereby turning σ ( t ) into σ ( t + b ) for arandom variable b different for each neuron. This has the effect of mitigat-ing the negative impact of the mean E[ σ ( w T i x j )], which is independent ofthe neuron index i .Finally, neural networks, despite their having been recently shown to op- imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS erate almost equally well when taken random in some very specific scenar-ios, are usually only initiated as random networks before being subsequentlytrained through backpropagation of the error on the training dataset (that is,essentially through convex gradient descent). We believe that our frameworkcan allow for the understanding of at least finitely many steps of gradient de-scent, which may then provide further insights into the overall performanceof deep learning networks.APPENDIX A: INTERMEDIARY LEMMASThis section recalls some elementary algebraic relations and identitiesused throughout the proof section. Lemma . For invertible matrices
A, B , A − − B − = A − ( B − A ) B − . Lemma . For A Hermitian, v a vectorand t ∈ R , if A and A + tvv T are invertible, then (cid:16) A + tvv T (cid:17) − v = A − v tv T A − v . Lemma . For nonnegative definite A and z ∈ C \ R + , k ( A − zI T ) − k ≤ dist( z, R + ) − k A ( A − zI T ) − k ≤ where dist( x, A ) is the Hausdorff distance of a point to a set. In particular,for γ > , k ( A + γI T ) − k ≤ γ − and k A ( A + γI T ) − k ≤ . REFERENCES
Akhiezer, N. I. and
Glazman, I. M. (1993).
Theory of linear operators in Hilbert space .Courier Dover Publications.
Bai, Z. D. and
Silverstein, J. W. (1998). No eigenvalues outside the support of the lim-iting spectral distribution of large dimensional sample covariance matrices.
The Annalsof Probability Bai, Z. D. and
Silverstein, J. W. (2007). On the signal-to-interference-ratio of CDMAsystems in wireless communications.
Annals of Applied Probability Bai, Z. D. and
Silverstein, J. W. (2009).
Spectral analysis of large dimensional randommatrices , second ed. Springer Series in Statistics, New York, NY, USA.
Benaych-Georges, F. and
Nadakuditi, R. R. (2012). The singular values and vectorsof low rank perturbations of large rectangular random matrices.
Journal of MultivariateAnalysis imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 C. LOUART ET AL.
Cambria, E. , Gastaldo, P. , Bisio, F. and
Zunino, R. (2015). An ELM-based modelfor affective analogical reasoning.
Neurocomputing
Choromanska, A. , Henaff, M. , Mathieu, M. , Arous, G. B. and
LeCun, Y. (2015).The Loss Surfaces of Multilayer Networks. In
AISTATS . Couillet, R. and
Benaych-Georges, F. (2016). Kernel spectral clustering of largedimensional data.
Electronic Journal of Statistics Couillet, R. and
Kammoun, A. (2016). Random Matrix Improved Subspace Clustering.In . Couillet, R. , Pascal, F. and
Silverstein, J. W. (2015). The random matrix regimeof Maronna’s M-estimator with elliptically distributed samples.
Journal of MultivariateAnalysis
Giryes, R. , Sapiro, G. and
Bronstein, A. M. (2015). Deep Neural Networks withRandom Gaussian Weights: A Universal Classification Strategy?
IEEE Transactionson Signal Processing Hornik, K. , Stinchcombe, M. and
White, H. (1989). Multilayer feedforward networksare universal approximators.
Neural networks Hoydis, J. , Couillet, R. and
Debbah, M. (2013). Random beamforming over quasi-static and fading channels: a deterministic equivalent approach.
IEEE Transactions onInformation Theory Huang, G.-B. , Zhu, Q.-Y. and
Siew, C.-K. (2006). Extreme learning machine: theoryand applications.
Neurocomputing Huang, G.-B. , Zhou, H. , Ding, X. and
Zhang, R. (2012). Extreme learning machinefor regression and multiclass classification.
Systems, Man, and Cybernetics, Part B:Cybernetics, IEEE Transactions on Jaeger, H. and
Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systemsand saving energy in wireless communication.
Science
Kammoun, A. , Kharouf, M. , Hachem, W. and
Najim, J. (2009). A central limit the-orem for the sinr at the lmmse estimator output for large-dimensional signals.
IEEETransactions on Information Theory El Karoui, N. (2009). Concentration of measure and spectra of random matrices: ap-plications to correlation matrices, elliptical distributions and beyond.
The Annals ofApplied Probability El Karoui, N. (2010). The spectrum of kernel random matrices.
The Annals of Statistics El Karoui, N. (2013). Asymptotic behavior of unregularized and ridge-regularizedhigh-dimensional robust regression estimators: rigorous results. arXiv preprintarXiv:1311.2445 . Krizhevsky, A. , Sutskever, I. and
Hinton, G. E. (2012). Imagenet classification withdeep convolutional neural networks. In
Advances in neural information processing sys-tems
LeCun, Y. , Cortes, C. and
Burges, C. (1998). The MNIST database of handwrittendigits.
Ledoux, M. (2005).
The concentration of measure phenomenon . American Mathemat-ical Soc. Loubaton, P. and
Vallet, P. (2010). Almost sure localization of the eigenvalues in aGaussian information plus noise model. Application to the spiked models.
ElectronicJournal of Probability Mai, X. and
Couillet, R. (2017). The counterintuitive mechanism of graph-based semi-supervised learning in the big data regime. In
IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP’17) . imsart-aap ver. 2014/10/16 file: ELM_Romain_MSE.tex date: June 30, 2017 RANDOM MATRIX APPROACH TO NEURAL NETWORKS Mar˘cenko, V. A. and
Pastur, L. A. (1967). Distribution of eigenvalues for some setsof random matrices.
Math USSR-Sbornik Pastur, L. and ˆSerbina, M. (2011).
Eigenvalue distribution of large random matrices .American Mathematical Society.
Rahimi, A. and
Recht, B. (2007). Random features for large-scale kernel machines. In
Advances in neural information processing systems
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storageand organization in the brain.
Psychological review Rudelson, M. , Vershynin, R. et al. (2013). Hanson-Wright inequality and sub-Gaussianconcentration.
Electron. Commun. Probab Saxe, A. , Koh, P. W. , Chen, Z. , Bhand, M. , Suresh, B. and
Ng, A. Y. (2011). Onrandom weights and unsupervised feature learning. In
Proceedings of the 28th interna-tional conference on machine learning (ICML-11)
Schmidhuber, J. (2015). Deep learning in neural networks: An overview.
Neural Networks Silverstein, J. W. and
Bai, Z. D. (1995). On the empirical distribution of eigenvaluesof a class of large dimensional random matrices.
Journal of Multivariate Analysis Silverstein, J. W. and
Choi, S. (1995). Analysis of the limiting spectral distribution oflarge dimensional random matrices.
Journal of Multivariate Analysis Tao, T. (2012).
Topics in random matrix theory . American Mathematical Soc.
Titchmarsh, E. C. (1939).
The Theory of Functions . Oxford University Press, New York,NY, USA.
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. in Compressed Sensing, 210–268, Cambridge University Press . Williams, C. K. I. (1998). Computation with infinite neural networks.
Neural Compu-tation Yates, R. D. (1995). A framework for uplink power control in cellular radio systems.
IEEE Journal on Selected Areas in Communications Zhang, T. , Cheng, X. and
Singer, A. (2014). Marchenko-Pastur Law for Tyler’s andMaronna’s M-estimators. http://arxiv.org/abs/1401.3424 . Zhenyu Liao, R. C. (2017). A Large Dimensional Analysis of Least Squares SupportVector Machines. (submitted to) Journal of Machine Learning Research ..