Exact results on high-dimensional linear regression via statistical physics
Alexander Mozeika, Mansoor Sheikh, Fabian Aguirre-Lopez, Fabrizio Antenucci, Anthony CC Coolen
aa r X i v : . [ m a t h . S T ] D ec Exact results on high-dimensional linear regression via statistical physics
Alexander Mozeika, Mansoor Sheikh, Fabian Aguirre-Lopez, Fabrizio Antenucci, and Anthony CC Coolen
1, 4, 3 London Institute for Mathematical Sciences, 35A South Street, London, W1K 2XF, United Kingdom. Dept of Mathematics, King’s College London, The Strand, London WC2R 2LS, United Kingdom. Saddle Point Science Ltd., 35A South Street, London, W1K 2XF, United Kingdom. Dept of Biophysics, Radboud University, 6525AJ Nijmegen, The Netherlands. (Dated: December 2, 2020)It is clear that conventional statistical inference protocols need to be revised to deal correctlywith the high-dimensional data that are now common. Most recent studies aimed at achieving thisrevision rely on powerful approximation techniques, that call for rigorous results against which theycan be be tested. In this context, the simplest case of high-dimensional linear regression has acquiredsignificant new relevance and attention. In this paper we use the statistical physics perspective oninference to derive a number of new exact results for linear regression in the high-dimensional regime.
PACS numbers: 02.50.Tt, 05.10.-a, 64.60.De
Introduction –The advent of modern high-dimensional data poses a significant challenge to statistical inference.The latter is understood well in the conventional regime of growing sample size with constant dimension. For high-dimensional data, where the dimension is of the same order as the sample size, the foundations of inference methodsare still fragile, and even the simplest scenario of linear regression [1] has to be revised [2]. The study of linearregression (LR) in the high-dimensional regime has recently attracted significant attention in the mathematics [3–6]and statistical physics communities [7–9]. The statistical physics framework is naturally suited to deal with high-dimensional data. While the connection between statistical physics and information theory was established a whileago by Jaynes [10], the approach has more recently been extended also to information processing [11] and machinelearning [12]. In the statistical physics framework, the free energy encodes statistical properties of inference, akin tocumulant generating functions in statistics, but its direct computation via high-dimensional integrals is often difficult.This led to the development of several non-rigorous methods, such as the mean-field approximation, the replica trickand the cavity method [13]. Message passing in particular, which can be seen as algorithmic implementation of thelatter [14], has emerged as an efficient analysis tool for statistical inference in high dimensions [15–17].Most rigorous results on high-dimensional LR were obtained upon assuming uncorrelated data [4, 8, 15, 17], possiblywith sparsity of parameters [3, 6]. Recently, correlations in sampling were analyzed in [16] for rotationally invariantdata matrices. In all these studies, however, the parameters of the noise in the data were assumed known , unlikethe standard statistical setting where they are usually inferred [1]. The exact prescription of the noise strength isunwelcome, since it is artificially removing an important source of overfitting in realistic applications of regression. Inhigh-dimensional LR, inference protocols can mistake noise for signal, reflected in increased under-estimation of thenoise and over-estimation of the magnitude of other model parameters.In this paper we derive new exact results for the high-dimensional regime of Bayesian LR which complement theaforementioned rigorous studies, using the statistical physics formulation of inference.
Statement of the problem and preview of results –We consider Bayesian inference of the LR model, t = Z θ + σ ǫ ,where t ∈ IR N and Z ∈ IR N × d are observed and the parameters θ ∈ IR d and σ ∈ IR + are to be inferred, with ǫ denotingzero-average noise. We adopt a teacher-student scenario [18, 19]: the teacher samples independently the rows of Z from some probability distribution P ( z ) and then uses the LR model to obtain t with the true parameters ( θ , σ ). Weassume that the student then applies the Bayes formula to try to infer ( θ , σ ) assuming a Gaussian prior N ( , η − I d )for θ , and a generic prior P ( σ ) for the noise parameter σ . Specifically, we do not consider the case where theobservations are coming from an unknown source and/or one needs to do model selection.We map the LR inference problem onto a Gibbs-Boltzmann distribution with inverse ‘temperature’ β . This allowsus to investigate properties of different inference protocols. In particular, maximum a posteriori (MAP) inference isobtained for β → ∞ and η >
0, maximum likelihood (ML) inference for β → ∞ and η = 0, and marginalization inferencefor β = 1. We will refer to ‘ML (MAP) at finite temperature’ for the case of η = 0 ( η >
0) and β finite.The high-dimensional regime is obtained for ( N, d ) → ( ∞ , ∞ ) with fixed ratio ζ = d/N ∈ (0 , ∞ ). We will henceforthindicate this limit as ( N, d ) → ∞ , to simplify notation. Note that in order to keep t finite in the ( N, d ) → ∞ limit,the matrix Z has to be replaced with Z / √ d (unless of course we impose a suitable sparsity condition).Within the above setting we obtain the following results: (i) The ML estimator ˆ σ of the noise parameter σ is self-averaging as ( N, d ) → ∞ (i.e. its variance is vanishing in this limit), for any distributions of Z and ǫ . Webound the likelihood of deviations of ˆ σ from its mean for Gaussian noise ǫ ; (ii) If σ is known and the distributionsof Z and ǫ are Gaussian, we compute the distribution of the MAP and ML estimators of θ ; (iii) We compute thecharacteristic function of the mean square error d || θ − ˆ θ ML [ D ] || for the ML estimator at finite ( N, d ), where θ arethe true parameters; (iv) We determine average and variance of the free energy of ML inference for the finite β andfinite ( N, d ). The ML free energy density is self-averaging as (
N, d ) → ∞ if the spectrum of the covariance matrix Z T Z /N is self-averaging. For Gaussian ǫ and Z , we recover the results obtained by the replica method in [9]; (v) Ifthe true parameters θ are independent random variables, we derive average and variance of the free energy of MAPinference for finite β and ( N, d ). The MAP free energy is shown to be self-averaging if the spectrum of Z T Z /N isself-averaging as ( N, d ) → ∞ .In the following subsections we describe how the above results were obtained, with full mathematical details relegatedto the Appendix. Statistical Physics and Bayesian Inference –We assume that we observe a data sample of ‘input-output’ pairs { ( z , t ) , . . . , ( z N , t N ) } , where ( z i , t i ) ∈ IR d +1 , generated randomly and independently from P ( t, z | θ ) = P ( t | z , θ ) P ( z ) , (1)with parameters θ that are unknown to us. If we assume a prior distribution P ( θ ), then the distribution of θ , giventhe data, follows from the Bayes formula P ( θ | D ) = P ( θ ) Q Ni =1 P ( t i | z i , θ ) R d˜ θ P (˜ θ ) nQ Ni =1 P ( t i | z i , ˜ θ ) o . (2)Here D = { t , Z } , with t = ( t , . . . , t N ), and Z = ( z , . . . , z N ) is an N × d matrix. In Bayesian language, expression (2)is the posterior distribution of θ , given the prior distribution P ( θ ) and the observed data D .The simplest way to use (2) for inference is to compute the maximum a posteriori (MAP) estimator ˆ θ MAP [ D ] = argmin θ E ( θ | D ) , (3)in which the so-called Bayesian likelihood function E ( θ | D ) = − N X i =1 log P ( t i | z i , θ ) − log P ( θ ) (4)consists of a first term, the log-likelihood used also in maximum likelihood (ML) inference, and a second term, thatacts as a regularizer. Bayesian inference can thus be seen as a generalization of MAP inference, and MAP inferencegeneralizes ML inference.The square error d || θ − ˆ θ [ D ] || , with the Euclidean norm || · · · || and the true parameters θ underlying thedata, is often used to quantify the quality of inference in (3). Its first moment is the mean square error (MSE) d hh|| θ − ˆ θ [ D ] || i D i θ . Furthermore, the posterior mean ˆ θ [ D ] = Z d θ P ( θ | D ) θ (5)(the marginalization estimator) is the minimum MSE (MMSE) estimator in the Bayes optimal case, i.e. when priordistribution and model likelihood are known [19].The above approaches to Bayesian inference can be unified conveniently in a single statistical physics (SP) formu-lation via the Gibbs-Boltzmann distribution P β ( θ | D ) = e − βE ( θ | D ) Z β [ D ] , (6)with the normalization constant, or ‘partition function’ Z β [ D ] = R d θ e − βE ( θ | D ) . For β = 1 this is the evidence termof Bayesian inference. In statistical physics language, (4) plays the role of ‘energy’ in (6) and β is the (fictional) inversetemperature. The temperature can be interpreted as a noise amplitude in stochastic gradient descent minimizationof E ( θ | D ) [9]. Properties of the system (6) follow upon evaluating the ‘free energy’ F β [ D ] = − β log Z β [ D ] . (7)The estimators (3) and (5) are recovered from the average R d θ P β ( θ | D ) θ by taking the zero ‘temperature’ limit β → ∞ , or by setting β = 1, respectively. This follows upon observing that for β = 1 the distribution (6) and theposterior (2) are identical, and that ˆ θ MAP [ D ] = lim β →∞ R d θ P β ( θ | D ) θ by the Laplace argument [20]. We note thatthe interpretation of the MAP estimator (3) in the SP framework (6) is that ˆ θ MAP [ D ] is the ‘ground state’ of thesystem.The Kullback-Leibler (KL) ‘distance’ [21] between the distribution P ( t, z | θ ) and its empirical counterpart ˆ P ( t, z | D ) = N − P i δ ( t − t i ) δ ( z − z i ), given by D ( ˆ P [ D ] || P θ ) = Z d t d z ˆ P ( t, z | D ) log (cid:16) ˆ P ( t, z | D ) P ( t, z | θ ) (cid:17) (8)can also be used to obtain the ML estimator, via ˆ θ ML [ D ] = argmin θ D ( ˆ P || P θ ). Furthermore, since N D ( ˆ P [ D ] || P θ ) = E ( θ | D )+log P ( θ ) − N S ( ˆ P [ D ]) where the last term, minus the Shannon entropy of ˆ P [ D ], is independent of θ , the MAPestimator can be obtained via ˆ θ MAP [ D ] = argmin θ (cid:8) N D ( ˆ P || P θ ) − log P ( θ ) (cid:9) .Finally, the KL distance (8) can also be used to define the difference ∆ D ( θ , θ | D ) = D ( ˆ P [ D ] || P θ ) − D ( ˆ P [ D ] || P θ ),where θ are the true parameters responsible for the data, which served as a useful measure of over-fitting in MLinference [22], and was recently extended to MAP inference in generalized linear models [9]. Both latter studiesused the SP framework, equivalent to (7), to study typical (as opposed to worst-case ) properties of inference in the high-dimensional regime via the average free energy h F β [ D ] i D /N as computed by the replica method [13]. Bayesian Linear Regression –In Bayesian linear regression (LR) with Gaussian priors, called ridge regression , it isassumed that the data ( z i , t i ) are for all i sampled independently from the distribution N ( t | θ . z , σ ) P ( z ), so the energy(4) is given by E (cid:0) θ ,σ | D (cid:1) = 12 σ || t − Z θ || + 12 η || θ || (9)+ 12 N log(2 πσ ) − log P ( σ ) , where η ≥ P ( θ ). The true parameters of D are written as θ and σ , i.e. weassume that t = Z θ + ǫ with the noise vector ǫ being sampled from some distribution, e.g. the multivariate Gaussian N ( , σ I N ), with mean and covariance σ I N .The energy function can also be written as2 σ E (cid:0) θ ,σ | D (cid:1) = σ (cid:0) N log(2 πσ ) − P ( σ ) (cid:1) + (cid:0) θ − J − σ η Z T t (cid:1) T J σ η (cid:0) θ − J − σ η Z T t (cid:1) + t T (cid:16) I N − ZJ − σ η Z T (cid:17) t (10)where we defined the d × d matrix J = Z T Z , with elements [ J ] kℓ = P Ni =1 z i ( k ) z i ( ℓ ), and its ‘regularized’ version J σ η = J + σ η I d . The distribution (6) is now P β ( θ , σ | D ) (11)= P β (cid:0) θ | σ , D (cid:1) e − β [ F β,σ [ D ]+ N log(2 πσ ) − log P ( σ ) ] R ∞ d˜ σ e − β [ F β, ˜ σ [ D ]+ N log ˜ σ − log P (˜ σ ) ] , where P β (cid:0) θ | σ , D (cid:1) is the Gaussian distribution P β ( θ | σ , D ) = N (cid:0) θ (cid:12)(cid:12) J − σ η Z T t , σ β − J − σ η (cid:1) (12)with mean J − σ η Z T t and covariance σ β − J − σ η . We have also defined the conditional free energy F β,σ [ D ] = d β + 12 σ t T (cid:0) I N − ZJ − σ η Z T (cid:1) t − β log | π e σ β − J − σ η | , (13)while the full free energy associated with (11) is given by F β [ D ] = − β log Z d θ d σ e − βE ( θ ,σ | D ) (14)= − β log Z ∞ d σ e − β [ F β,σ [ D ]+ N log(2 πσ ) − log P ( σ ) ] . Note that if the noise parameter σ is known, i.e. P ( σ ) = δ ( σ − σ ), then F β [ D ] = F β,σ [ D ] − N β log(2 πσ )and P β ( θ , σ | D ) = P β ( θ | σ , D ) δ ( σ − σ ). For β → ∞ the free energy is via the Laplace argument given by F ∞ [ D ] = min θ ,σ E (cid:0) θ , σ | D (cid:1) . F ∞ [ D ] is the ground state energy of (11). The ground state (cid:8) ˆ θ [ D ] , ˆ σ [ D ] (cid:9) =argmin θ ,σ E (cid:0) θ , σ | D (cid:1) is given by ˆ θ [ D ] = J − σ η Z T t , (15)i.e. the mean of (12), and the solution of the equation σ = 1 N (cid:12)(cid:12)(cid:12)(cid:12) t − Z ˆ θ [ D ] (cid:12)(cid:12)(cid:12)(cid:12) + 2 σ N ∂∂σ log P ( σ ) , (16)corresponding to the MAP estimators of the parameters [23]. From the second second line in (14) we infer F ∞ [ D ] = min σ h F ∞ ,σ [ D ]+ N log(2 πσ )2 − log P ( σ ) i , (17)(again via the Laplace argument), as well as for ( N, d ) → ∞ the free energy density f β [ D ] = N F β [ D ] at any β : f β [ D ] = min σ h F β,σ [ D ] N + log(2 πσ )2 − log P ( σ ) N i . (18)For β = 1 the distribution (11) can be used to compute the MMSE estimators of θ and σ , given by the averages Z ∞ d θ d σ P β ( θ , σ | D ) θ = h J − σ η Z T t i σ (19) Z ∞ d θ d σ P β ( θ , σ | D ) σ = h σ i σ , where the short-hand h· · · i σ refers to averaging over the following marginal of the distribution (11): P β ( σ | D ) = e − β [ F β,σ [ D ]+ N log(2 πσ ) − log P ( σ ) ] R ∞ d˜ σ e − β [ F β, ˜ σ [ D ]+ N log(2 π ˜ σ ) − log P (˜ σ ) ] . (20)For ( N, d ) → ∞ this marginal is dominated by the solution of (18). The dominant value of θ in (19) is (15), but with σ being the solution of the following equation, which for β = 1 gives the MMSE estimators, and which recovers theMAP estimators (15) and (16) when β → ∞ : σ = β ( β − ζ ) 1 N (cid:12)(cid:12)(cid:12)(cid:12) t − Z ˆ θ [ D ] (cid:12)(cid:12)(cid:12)(cid:12) − σ η ( β − ζ ) 1 N Tr (cid:2) J − σ η (cid:3) + 2 σ β ( β − ζ ) N ∂∂σ log P ( σ ) . (21)The free energies (13) and (14) obey the Helmholtz free energy relations. In particular, with E ( θ | D ) = E (cid:0) θ ,σ | D (cid:1) − N log(2 πσ )+log P ( σ ) we get F β,σ [ D ] = E β [ D ] − T S β [ D ] , (22)where T = 1 /β , with the average energy E β [ D ] = Z d θ P β ( θ | σ , D ) E ( θ | D ) , (23)and with the differential entropy S β [ D ] = − Z d θ P β ( θ | σ , D ) log P β ( θ | σ , D ) . (24)In the free energy (13) we have E β [ D ] = d β + 12 σ t T (cid:16) I N − ZJ − σ η Z T (cid:17) t (25)S β [ D ] = 12 log | π e σ β − J − σ η | . Furthermore, the average energy can be written as E β [ D ] = d β + min θ E ( θ | D ). Distribution of estimators ˆ θ MAP and ˆ θ ML –If the noise parameter σ is independent of the realization of the data D ,e.g. σ is known or self-averaging as ( N, d ) → ∞ , and the noise ǫ has Gaussian statistics N ( , σ I N ), the distributionof the MAP estimator (15) is P ( ˆ θ ) = D N (cid:16) ˆ θ | J − σ η J θ , σ J − σ η J (cid:17)E Z . (26)For η = 0, i.e. ML inference, and without averaging over Z , this recovers Theorem 7.6b in [1]. To probe the ( N, d ) → ∞ regime we rescale z i ( µ ) → z i ( µ ) / √ d with now z i ( µ ) = O (1). This gives J = C /ζ and J σ η = C ζσ η /ζ , with the samplecovariance matrix [ C ] kℓ = N − P Ni =1 z i ( k ) z i ( ℓ ) and C ζσ η = C + ζσ η I , so P ( ˆ θ ) = D N (cid:0) ˆ θ | C − ζσ η C θ , ζσ C − ζσ η C (cid:1)E Z (27)Furthermore, for a Gaussian sample with true covariance matrix Σ , i.e. if each z i in Z is drawn independently from N ( , Σ ), the distribution of ˆ θ for any finite ( N, d ) is the Gaussian mixture P ( ˆ θ ) = Z d C W ( C | Σ /N, d, N ) × N (cid:16) ˆ θ | C − ζσ η [ C ] C θ , ζσ C − ζσ η [ C ] C (cid:17) . (28)The integral is over all symmetric positive definite d × d matrices, and W ( C | Σ /N, d, N ) is the Wishart distribution,which is non-singular when d ≤ N . Note that (28) also represents the distribution of ‘ground states’ of (12).For η = 0 the distribution (28) becomes the multivariate Student’s t -distribution with N +1 − d degrees of freedom: P ( ˆ θ ) = Γ (cid:0) N +12 (cid:1) Γ (cid:0) N +1 − d (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) (1 − ζ + 1 /N ) Σ π ( N + 1 − d ) ζσ (cid:12)(cid:12)(cid:12)(cid:12) (29) × (cid:20)
1+ (1 − ζ + 1 /N ) Σ ( N + 1 − d ) ζσ (cid:16) ˆ θ − θ (cid:17)(cid:16) ˆ θ − θ (cid:17) T (cid:21) − N +12 The vector of true parameters θ is the mode and [ ζσ / (1 − ζ − N − )] Σ − is the covariance matrix of (29). In theregime ( N, d ) → ∞ one can recover from (29) the moments of the multivariate Gaussian suggested by the replicamethod [9]. In this regime one indeed finds that any finite subset of components of ˆ θ ML is described by a Gaussiandistribution [24, 25]. Statistical properties of the estimator ˆ σ –For η = 0 the estimator (15) simplifies considerably to ˆ θ ML [ D ] = (cid:0) Z T Z (cid:1) − Z T t (30)giving us, via (16), the ML noise estimatorˆ σ = 1 N ǫ T (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ . (31)In particular, if the noise ǫ originates from a distribution with mean and covariance σ I N , mean and variance ofˆ σ are (cid:10) ˆ σ (cid:11) ǫ = σ (1 − ζ ) , Var (cid:0) ˆ σ (cid:1) = 2 σ N (1 − ζ ) . (32)Hence for ( N, d ) → ∞ the noise estimator (31) is independent of Z and self-averaging [26]. Furthermore, for finite( N, d ) and δ > σ isProb (cid:2) ˆ σ / ∈ (cid:0) σ (1 − ζ ) − δ, σ (1 − ζ )+ δ (cid:1)(cid:3) ≤ (cid:10) e − α || t − Zˆ θ ML [ D ] || (cid:11) D e αN ( σ (1 − ζ ) − δ )+ (cid:10) e α || t − Zˆ θ ML [ D ] || (cid:11) D e − αN ( σ (1 − ζ )+ δ ) . (33)Assuming that the noise ǫ is Gaussian, described by N ( , σ I N ), the moment-generating function (MGF) (cid:10) e α || t − Zˆ θ ML [ D ] || (cid:11) D = e − N (1 − ζ ) log ( − ασ ) (34)is independent of Z , allowing us to estimate the fluctuations of ˆ σ for δ ∈ (cid:0) , σ (1 − ζ ) (cid:1) via the inequalityProb (cid:2) ˆ σ / ∈ (cid:0) σ (1 − ζ ) − δ, σ (1 − ζ ) + δ (cid:1)(cid:3) ≤ e − N (cid:2) (1 − ζ ) log (cid:0) − ζ (1 − ζ ) − δ/σ (cid:1) − δ/σ (cid:3) + e − N (cid:2) (1 − ζ ) log (cid:0) − ζ (1 − ζ )+ δ/σ (cid:1) + δ/σ (cid:3) (35)For α = 2i a with a ∈ IR the MGF (34) becomes the characteristic function (CF) (cid:10) e i a || t − Zˆ θ ML [ D ] || (cid:11) D = (cid:0) − i a σ (cid:1) − N (1 − ζ ) (36)of the random variable (cid:12)(cid:12)(cid:12)(cid:12) t − Z ˆ θ ML [ D ] (cid:12)(cid:12)(cid:12)(cid:12) . Note that (36) is the CF of the gamma distribution (see Theorem 7.6b in[1]), with mean N σ (1 − ζ ) and variance N σ (1 − ζ ). Mean and variance of ˆ σ are σ (1 − ζ ) and 2 σ (1 − ζ ) /N ,respectively. For σ = 1 we obtain that N ˆ σ is a chi-square distribution with N (1 − ζ ) degrees of freedom, asexpected from Cochran’s theorem [27].Finally, it follows from (32) and (21) that the finite temperature ML noise estimator in the high-dimensional regimeis given by ˆ σ = ββ − ζ σ (1 − ζ ). We observe that for β = 1 we obtain unbiased estimation of σ . Note that theseresults also follow from the average free energy computed by the replica method [9]. Statistical properties of MSE in ML inference –Using the distribution (29) and with the eigenvalues λ ( Σ ) ≤ λ ( Σ ) ≤ · · · ≤ λ d ( Σ ) of the true (population) covariance matrix Σ , the CF of the MSE, defined as d || θ − ˆ θ ML [ D ] || for finite ( N, d ), can be written as (cid:10) e i α || θ − ˆ θ ML [ D ] || (cid:11) D = Z ∞ d ω Γ N +1 − d ( ω ) × d Y ℓ =1 (cid:18) − i α ζσ ω (1 − ζ + N − ) λ ℓ ( Σ ) (cid:19) − , (37)with the gamma distribution Γ ν ( ω ) = ν ν/ ν/ Γ( ν/ ω ν − e − νω for ν >
0. The last term in (37) is the product of CFs ofgamma distributions with the same ‘shape’ parameter 1 /
2, but different ‘scale’ parameters 2 ζσ /ω (1 − ζ + N − ) λ ℓ ( Σ ).From (37) one obtains mean and variance of the MSE:1 d (cid:10) || θ − ˆ θ ML [ D ] || (cid:11) D = ζσ − ζ − N − Tr[ Σ − ] d , (38)Var (cid:16) d || θ − ˆ θ ML [ D ] || (cid:17) = 2 ζ σ (1 − ζ ) Tr[ Σ − ] d . The latter gives us the condition for self-averaging of the MSE, i.e. Var (cid:0) d || θ − ˆ θ ML [ D ] || (cid:1) → N, d ) → ∞ .We finally consider deviations of d || θ − ˆ θ ML [ D ] || from the mean µ ( Σ ) given in (38). The probability of observingthe event event d || θ − ˆ θ ML [ D ] || / ∈ (cid:0) µ ( Σ ) − δ, µ ( Σ )+ δ (cid:1) for δ >
0, is bounded from above as followsProb (cid:20) d || θ − ˆ θ ML [ D ] || / ∈ ( µ ( Σ ) − δ, µ ( Σ ) + δ ) (cid:21) ≤ C − e − N Φ − [ α,µ ( λ d ) ,δ ] + C + e − N Φ + [ α,µ ( λ ) ,δ ] . (39)with some small α > ± . For the rate function Φ − [ α, µ ( λ d ) , δ ] to be positive for arbitrarysmall δ it is sufficient that µ ( λ d ) ≥
1, where µ ( λ ) = ζσ / (1 − ζ ) λ , while for µ ( λ d ) < δ valuesmust satisfy δ > − µ ( λ d ). The rate function Φ + [ α, µ ( λ ) , δ ] is positive for any δ ∈ (0 , µ ( λ )). Statistical properties of the free energy –We consider the free energy (13) for finite inverse temperature β and finite( N, d ). Assuming that the noise ǫ has mean and covariance σ I N , and that the parameter σ is independent of D ,the average free energy is (cid:10) F β,σ [ D ] (cid:11) D = d β + 12 σ θ T0 (cid:10)(cid:0) J − JJ − σ η J (cid:1)(cid:11) Z θ + σ σ (cid:0) N − (cid:10) Tr[ JJ − σ η ] (cid:11) Z (cid:1) (40) − β (cid:10) log | π e σ β − J − σ η | (cid:11) Z Under the same assumptions, the variance of F β,σ [ D ] can be obtained by exploiting the Helmholtz free energyrepresentation F β,σ [ D ] = E β [ D ] − T S β [ D ], giving usVar (cid:0) F β,σ [ D ] (cid:1) = Var ( E β [ D ]) + T Var (S β [ D ]) − T Cov ( E β [ D ] , S β [ D ]) . (41)The full details on each term in (41) are found in the Appendix. Free energy of ML inference –For η = 0 and after transforming z i ( µ ) → z i ( µ ) / √ d for all ( i, µ ), with z i ( µ ) = O (1),(40) gives the average free energy density D F β,σ [ D ] N E D = 12 σ σ (1 − ζ ) + ζ β log (cid:16) β πσ ζ (cid:17) + ζ β Z d λ ρ d ( λ ) log( λ ) , (42)where we defined the average eigenvalue density ρ d ( λ ) = h ρ d ( λ | Z ) i Z of the empirical covariance matrix, with ρ d ( λ | Z ) = 1 d d X ℓ =1 δ (cid:2) λ − λ ℓ ( Z T Z /N ) (cid:3) . (43)The variance of free energy density isVar (cid:16) F β,σ [ D ] N (cid:17) = Var (cid:16) E [ D ] N (cid:17) + T Var (cid:16) S( P [ D ]) N (cid:17) (44)= ζ β Z d λ d˜ λ C d ( λ, ˜ λ ) log( λ ) log(˜ λ ) + σ (1 − ζ )2 σ N where we have defined the correlation function C d ( λ, ˜ λ ) = (cid:10) ρ d ( λ | Z ) ρ d (˜ λ | Z ) (cid:11) Z − (cid:10) ρ d ( λ | Z ) (cid:11) Z (cid:10) ρ d (˜ λ | Z ) (cid:11) Z . Clearly, if R d λ d˜ λ C d ( λ, ˜ λ ) f ( λ, ˜ λ ) → N, d ) → ∞ , for any smooth function f ( λ, ˜ λ ), then the free energy density f β [ D ] = F β [ D ] /N is self-averaging.Finally, if we use (42) in the free energy density (18) for η = 0, and assume Gaussian data with true populationcovariance matrix Σ = I d , then for β ∈ [ ζ, ∞ ) we findlim N →∞ f β [ D ] = β − ζ β log (cid:16) πσ (1 − ζ ) β − ζ (cid:17) + log ( β )+12 − β (cid:16) ζ log ζ +(1 − ζ ) log(1 − ζ )+2 ζ (cid:17) , (45)with the convention 0 log 0 = 0. For β ∈ (0 , ζ ) we get lim N →∞ f β [ D ] = −∞ . Since for λ ∈ [ a − , a + ] and 0 < ζ < ρ d ( λ | Z ) converges to (2 πλζ ) − p ( λ − a − )( a + − λ ) in a distributional sense as ( N, d ) → ∞ [28],with a ± = (1 ±√ ζ ) , the free energy density is self-averaging. Its values are plotted versus the temperature in Figure1. Furthermore, the average free energy density (45) is identical to that of [9]. Since lim β ↓ ζ lim N →∞ f β [ D ] is finite,the system exhibits a zeroth-order phase transition [29] at T = 1 /ζ . Free energy of MAP inference –We next assume that the true parameters θ are drawn at random, with mean and covariance matrix S I d . As before we rescale z i ( µ ) → z i ( µ ) / √ d where z i ( µ ) = O (1), and define J = C /ζ (so that C = Z T Z /N ) and C ζσ η = ζ J σ η . Then the average of (40) over θ becomes DD F β,σ [ D ] N E D E θ = ζ β + S ζη Z d λ ρ d ( λ ) λλ + ζσ η + σ σ (cid:16) − ζ Z d λ ρ d ( λ ) λλ + ζσ η (cid:17) + ζ β Z d λ ρ d ( λ ) log (cid:0) λ + ζσ η (cid:1) − ζ β log (cid:0) π e σ β − ζ (cid:1) (46)Furthermore, using (41), we obtain, under the same assumptions, that Var( F β,σ [ D ] /N ) is of the form (see sectionF 4 in the Appendix ): Var (cid:16) F β,σ [ D ] N (cid:17) = Z d λ d˜ λ C d ( λ, ˜ λ )Φ( λ, ˜ λ )+ O (cid:18) N (cid:19) . (47) f β /β FIG. 1. Asymptotic free energy density f β = lim N →∞ f β [ D ] of finite temperature ML inference as a function of temperature T = 1 /β , plotted for ζ ∈ { / , / , . . . , / } (from right to left) in the high-dimensional regime where N, d → ∞ with fixedratio ζ = d/N . For β → ∞ it approaches the value log[2 π e σ (1 − ζ )]. For β → ζ it approaches ζ [ ζ log(1 − ζ ) − log(1 − ζ ) − ζ ],and for β ∈ (0 , ζ ) the free energy density is −∞ . Here the true noise parameter is σ = 1 and the true data covariance matrix is I d . Hence for η > ρ d ( λ | Z ) is self-averaging (since then C d ( λ, ˜ λ ) → N, d ) → ∞ ). This is the case e.g. for Gaussian data withcovariance matrix Σ = I d . Summary and outlook –The above results emphasize that still much can be learned about high-dimensional Bayesianlinear regression from exact calculations with standard methods. Many questions remain still open and we hope thatthis paper may contribute to future work in this direction. Some results appear well within reach, such as the extensionto sub-Gaussian noise for the argument that leads to (35), employing techniques used in [30]. Other results are lessimmediate but seem feasible, such as extending some of the ML results to MAP inference, starting from evaluation ofthe distribution of ˆ θ MAP (28) for (
N, d ) → ∞ . Another interesting line of work would be to try to extend our presentresults to generalized linear models (GLMs). Other crucial investigations, such as a rigorous analytical study of theeffect of model mismatch, appear instead to be still quite challenging with current techniques. ACKNOWLEDGMENTS
We would like to thank Dr Heather Battey and Sir David Cox for reading a first draft of this manuscript and valuablesuggestions. AM is supported by Cancer Research UK (award C45074/A26553) and the UK’s Medical ResearchCouncil (award MR/R014043/1). FA is supported by EIT Health ID 19277 (RGS@HOME) under H2020. FALgratefully acknowledges financial support through a scholarship from Conacyt (Mexico).
Appendix A: Ingredients
We write I N for N × N identity matrix. The data D = { t , Z } , where t ∈ R N and Z = (cid:0) z , . . . , z N (cid:1) is the N × d matrix, is a set of observed ‘input-output’ pairs { ( z , t ) , . . . , ( z N , t N ) } generated by the process t = Z θ + ǫ . (A1)The true parameter vector θ ∈ R d is unknown, and the vector ǫ ∈ R N represents noise with mean and covariance σ I N , with also the true noise parameter σ unknown to us. The (empirical) covariance matrix of the input data is J [ Z ] = Z T Z , (A2)where [ J [ Z ]] kℓ = P Ni =1 z i ( k ) z i ( ℓ ). To simplify notation we will sometimes omit the dependence on Z and write J ≡ J [ Z ]. The maximum a posteriori estimator (MAP) of θ in linear regression with Gaussian prior N ( , η − I d ) is ˆ θ [ D ] = J − σ η Z T t , (A3)where J η = J + η I d . For η = 0 the above gives us the maximum likelihood (ML) estimator ˆ θ [ D ] = J − Z T t (A4)We are interested in the high-dimensional regime: ( N, d ) → ( ∞ , ∞ ) with fixed ζ = d/N >
0, which we will write as(
N, d ) → ∞ to simplify notation. Appendix B: Distribution of ˆ θ estimator in MAP inference Let us assume that the noise parameter σ is independent of data D , and that the noise ǫ is sampled from theGaussian distribution N ( , σ I N ). The distribution of the MAP estimator (A3) can then be computed as follows P ( ˆ θ ) = D δ (cid:0) ˆ θ − J − σ η Z T t (cid:1)E D = D δ (cid:0) ˆ θ − J − σ η Z T (cid:0) Z θ + ǫ (cid:1)(cid:1)E D = DD δ (cid:0) ˆ θ − J − σ η J θ − J − σ η Z T ǫ (cid:1)E ǫ E Z = Z d x (2 π ) d e i x T ˆ θ D e − i x T J − σ η J θ D e − i x T J − σ η Z T ǫ E ǫ E Z = Z d x (2 π ) d (cid:28) e − σ x T J − σ η J − σ η Jx +i x T (cid:0) ˆ θ − J − σ η J θ (cid:1)(cid:29) Z = 1(2 π ) d *r (2 π ) d (cid:12)(cid:12)(cid:12)(cid:0) σ J − σ η J (cid:1) − (cid:12)(cid:12)(cid:12) Z d x N (cid:0) x | , (cid:0) σ J − σ η J (cid:1) − (cid:1) e i x T (cid:0) ˆ θ − J − σ η J θ (cid:1)+ Z = D N (cid:16) ˆ θ | J − σ η J θ , σ J − σ η J (cid:17)E Z (B1)To take the limit ( N, d ) → ∞ , we rescale z i ( µ ) → z i ( µ ) / √ d with z i ( µ ) = O (1). Now J = C /ζ , where ζ = d/N ,[ C ] µν = N − P i z i ( µ ) z i ( ν ), and J − σ η = ζ C − ζσ η giving us the distribution P ( ˆ θ ) = D N (cid:16) ˆ θ | C − ζσ η C θ , ζσ C − ζσ η C (cid:17)E Z (B2)with C ζσ η ≡ C + ζσ η I . Furthermore, since C ≡ C [ Z ] and C − ζσ η ≡ C − ζσ η [ C [ Z ]] we have that P ( ˆ θ ) = D N (cid:16) ˆ θ | C − ζσ η [ C [ Z ]] C [ Z ] θ , ζσ C − ζσ η [ C [ Z ]] C [ Z ] (cid:17)E Z (B3)= Z d C ( N Y i =1 Z P ( z i )d z i ) δ (cid:0) C − C [ Z ] (cid:1) N (cid:16) ˆ θ | C − ζσ η [ C ] C θ , ζσ C − ζσ η [ C ] C (cid:17) = Z d C Z d ˆ C (cid:0) π (cid:1) d ( N Y i =1 Z P ( z i )d z i ) exp h iTr n ˆ C (cid:0) J − J [ Z ] (cid:1)oi N (cid:16) ˆ θ | C − ζσ η [ C ] C θ , ζσ C − ζσ η [ C ] C (cid:17) Let us now consider the following average, assuming that z is sampled from the Gaussian distribution N ( , Σ ): ( N Y i =1 Z P ( z i )d z i ) exp h − iTr n ˆ CJ [ Z ] oi = ( N Y i =1 Z P ( z i )d z i ) exp " − i 1 N Tr ( ˆ C N X i =1 z i z i T ) = (cid:20)Z P ( z ) d z e − i N Tr { ˆ C zz T } (cid:21) N = (cid:12)(cid:12)(cid:12)(cid:12) I + 2i N Σ ˆ C (cid:12)(cid:12)(cid:12)(cid:12) − N . (B4)This is the characteristic function of the Wishart distribution [31], defined by the density W (cid:0) J | Σ /N, d, N (cid:1) = | Σ /N | − N | J | N − d − Nd π d ( d − Q dℓ =1 Γ (cid:0) N +1 − ℓ (cid:1) e − N Tr (cid:0) JΣ − (cid:1) . (B5)0The Wishart distribution is singular when d > N . Thus for Gaussian z the distribution (B1) is the Gaussian mixture P ( ˆ θ ) = Z d C W (cid:0) C | Σ /N, d, N (cid:1) N (cid:16) ˆ θ | C − ζσ η [ C ] C θ , ζσ C − ζσ η [ C ] C (cid:17) = Z d C | Σ /N | − N | C | N − d − e − N Tr (cid:0) CΣ − (cid:1) Nd π d ( d − Q dℓ =1 Γ (cid:0) N +1 − ℓ (cid:1) e − ζσ (cid:0) ˆ θ − C − ζσ η C θ (cid:1) T (cid:0) C − ζσ η [ C ] C (cid:1) − [ C ] (cid:0) ˆ θ − C − ζσ η C θ (cid:1)(cid:12)(cid:12)(cid:12) πζσ C − ζσ η [ C ] C (cid:12)(cid:12)(cid:12) . (B6)We note that an alternative derivation of this result is provided in [9]. Appendix C: Distribution of ˆ θ estimator in ML inference Let us consider the following integral appearing in the distribution of MAP estimator (B6): Z d C | C | N − d − e − N Tr (cid:0) CΣ − (cid:1) e − ζσ (cid:0) ˆ θ − C − ζσ η C θ (cid:1) T (cid:0) C − ζσ η [ C ] C (cid:1) − [ C ] (cid:0) ˆ θ − C − ζσ η C θ (cid:1)(cid:12)(cid:12)(cid:12) πζσ C − ζσ η [ C ] C (cid:12)(cid:12)(cid:12) = Z d C | C | N − d − e − Tr (cid:26) C (cid:20) N Σ − + C − ζσ (cid:0) C − ζσ η [ C ] C (cid:1) − [ C ] (cid:0) ˆ θ − C − ζσ η C θ (cid:1)(cid:0) ˆ θ − C − ζσ η C θ (cid:1) T (cid:21)(cid:27) (cid:12)(cid:12)(cid:12) πζσ C − ζσ η [ C ] C (cid:12)(cid:12)(cid:12) For η = 0, i.e. ML inference, this integral simplifies to Z d C | C | N − d − e − Tr (cid:26) C (cid:20) N Σ − + ζσ (cid:0) ˆ θ − θ (cid:1)(cid:0) ˆ θ − θ (cid:1) T (cid:21)(cid:27) | πζσ C − | = (cid:0) πζσ (cid:1) d Z d C | C | N − d e − Tr (cid:26) C (cid:20) N Σ − + ζσ (cid:0) ˆ θ − θ (cid:1)(cid:0) ˆ θ − θ (cid:1) T (cid:21)(cid:27) and can be computed by using the normalization identity R d J W (cid:0) J | Σ , d, N (cid:1) = 1, from which one obtains Z d C | C | N − d − e − Tr (cid:0) CΣ − (cid:1) = | Σ | N Nd π d ( d − d Y ℓ =1 Γ (cid:0) N + 1 − ℓ (cid:1) . (C1)Hence also Z d C | C | N − d e − Tr (cid:0) CΣ − (cid:1) = | Σ | N +12 ( N +1) d π d ( d − d Y ℓ =1 Γ (cid:0) N + 2 − ℓ (cid:1) (C2)which gives us the result Z d C | C | N − d e − Tr (cid:26) C (cid:20) N Σ − + ζσ (cid:0) ˆ θ − θ (cid:1)(cid:0) ˆ θ − θ (cid:1) T (cid:21)(cid:27) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N Σ − + ( ˆ θ − θ )( ˆ θ − θ ) T ζσ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − N +12 ( N +1) d π d ( d − d Y ℓ =1 Γ (cid:0) N +2 − ℓ (cid:1) . (C3)For η = 0 the distribution (B6) thereby becomes P ( ˆ θ ) = Z d C | Σ /N | − N | C | N − d − Nd π d ( d − Q dℓ =1 Γ (cid:0) N +1 − ℓ (cid:1) e − N Tr (cid:0) CΣ − (cid:1) e −
12 1 ζσ (cid:0) ˆ θ − θ (cid:1) T C (cid:0) ˆ θ − θ (cid:1) | πζσ C − | . = | Σ /N | − N Nd π d ( d − Q dℓ =1 Γ (cid:0) N +1 − ℓ (cid:1) Z d C | C | N − d − e − Tr (cid:0) C N Σ − (cid:1) e −
12 1 ζσ (cid:0) ˆ θ − θ (cid:1) T C (cid:0) ˆ θ − θ (cid:1) | πζσ C − | . = (cid:0) πζσ (cid:1) d d Y ℓ =1 Γ (cid:0) N +2 − ℓ (cid:1) Γ (cid:0) N +1 − ℓ (cid:1) | Σ /N | − N (cid:12)(cid:12)(cid:12)(cid:12) N Σ − + 1 ζσ (cid:0) ˆ θ − θ (cid:1)(cid:0) ˆ θ − θ (cid:1) T (cid:12)(cid:12)(cid:12)(cid:12) − N +12 = π − d Γ (cid:0) N +12 (cid:1) Γ (cid:0) N +1 − d (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) Σ ζσ N (cid:12)(cid:12)(cid:12)(cid:12) (cid:0) (cid:0) ˆ θ − θ (cid:1) T Σ ζσ N (cid:0) ˆ θ − θ (cid:1)(cid:1) − N +12 . (C4)1The last line in above was obtained using the ‘matrix determinant lemma’. Thus, after slight rearrangement, P ( ˆ θ ) = [ π ( N +1 − d )] − d Γ (cid:0) N +12 (cid:1) Γ (cid:0) N +1 − d (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) (1 − ζ +1 /N ) Σ ζσ (cid:12)(cid:12)(cid:12)(cid:12) (cid:0) ˆ θ − θ ) T (1 − ζ + 1 /N ) Σ ζσ ( N +1 − d ) ( ˆ θ − θ ) (cid:1) − N +12 , (C5)which is the multivariate Student’s t -distribution, with N + 1 − d degrees of freedom, ‘location’ vector θ and ‘shape’matrix ζσ Σ − / (1 − ζ + 1 /N ). Appendix D: Statistical properties of ˆ σ estimator in ML inference In ML inference the estimator of θ is given by (A4) and the estimator of noise parameter σ is given by the densityˆ σ [ D ] = 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t − Z ˆ θ [ D ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t − Z (cid:0) Z T Z (cid:1) − Z T t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1)(cid:0) Z θ + ǫ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) Z θ + (cid:0) I N − Z J − [ Z ] Z T (cid:1) ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 N ǫ T (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ = 1 N ǫ T (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ (D1)In above we used ( I N − Z ( Z T Z ) − Z T ) Z θ = Z θ − Z (cid:0) Z T Z (cid:1) − Z T Z θ = and ( I N − Z (cid:0) Z T Z (cid:1) − Z T ) = I N − Z (cid:0) Z T Z (cid:1) − Z T , i.e. I N − Z (cid:0) Z T Z (cid:1) − Z T is an idempotent matrix. For ǫ sampled from any distribution with mean and covariance σ I N , the average and variance of ˆ σ [ D ] are (by Wick’s theorem): (cid:10) ˆ σ [ D ] (cid:11) ǫ = 1 N D ǫ T (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ E ǫ = σ N Tr (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) = σ (cid:0) − ζ (cid:1) (D2) (cid:10) ˆ σ [ D ] (cid:11) ǫ − (cid:10) ˆ σ [ D ] (cid:11) ǫ = 2 σ N (cid:0) − ζ (cid:1) (D3)Next we are interested in the probability of event ˆ σ [ D ] / ∈ (cid:0) σ (1 − ζ ) − δ, σ (1 − ζ ) + δ (cid:1) . This is given byProb " N N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) / ∈ (cid:0) σ (1 − ζ ) − δ, σ (1 − ζ ) + δ (cid:1) = Prob " N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≤ N (cid:0) σ (1 − ζ ) − δ (cid:1) + Prob " N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≥ N (cid:0) σ (1 − ζ ) + δ (cid:1) . (D4)First, we consider the probabilityProb " N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≥ N (cid:0) σ (1 − ζ ) + δ (cid:1) = Prob (cid:20) e α P Ni =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≥ e αN (cid:0) σ (1 − ζ )+ δ (cid:1)(cid:21) ≤ (cid:28) e α P Ni =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) (cid:29) D e − αN (cid:0) σ (1 − ζ )+ δ (cid:1) , (D5)2where i α > Gaussian and consider the moment-generating function (cid:28) e α P Ni =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) (cid:29) D = (cid:28) e α ǫ T (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ (cid:29) D = Z d ǫ N (cid:0) ǫ | , σ I N (cid:1) (cid:28) e α ǫ T (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ (cid:29) Z = 1 (cid:0) πσ (cid:1) N/ *Z d ǫ e − σ ǫ T ǫ + α ǫ T (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) ǫ + Z = 1 (cid:0) πσ (cid:1) N/ *Z d ǫ e − ǫ T (cid:20) I N /σ − α (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1) (cid:21) ǫ + Z = 1 (cid:0) πσ (cid:1) N/ *(cid:12)(cid:12)(cid:12)(cid:12) π h I N /σ − α (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1)i − (cid:12)(cid:12)(cid:12)(cid:12) + Z = (cid:28)(cid:12)(cid:12)(cid:12) I N − ασ (cid:0) I N − Z (cid:0) Z T Z (cid:1) − Z T (cid:1)(cid:12)(cid:12)(cid:12) − (cid:29) Z = (cid:28)(cid:12)(cid:12)(cid:12) I N (1 − ασ ) + ασ Z (cid:0) Z T Z (cid:1) − Z T (cid:12)(cid:12)(cid:12) − (cid:29) Z = (cid:0) − ασ (cid:1) − N/ * (cid:12)(cid:12)(cid:12) I N + ασ − ασ Z (cid:0) Z T Z (cid:1) − Z T (cid:12)(cid:12)(cid:12) + Z (D6)Now Z (cid:0) Z T Z (cid:1) − Z T is a projection matrix, and its eigenvalue are λ i ∈ { , } , giving us (cid:28) e α P Ni =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) (cid:29) D = (cid:0) − ασ (cid:1) − N/ * (cid:12)(cid:12)(cid:12) I N + ασ − ασ Z (cid:0) Z T Z (cid:1) − Z T (cid:12)(cid:12)(cid:12) + Z = (cid:0) − ασ (cid:1) − N/ * (cid:0) Q Ni =1 h ασ − ασ λ i (cid:0) Z (cid:0) Z T Z (cid:1) − Z T (cid:1)i (cid:1) + Z = (cid:0) − ασ (cid:1) − N/ * e − N P λ N P Ni =1 δ λ ; λi (cid:0) Z (cid:0) Z T Z (cid:1) − Z T (cid:1) log (cid:20) ασ − ασ λ (cid:21) + Z = (cid:0) − ασ (cid:1) − N/ * e − N log (cid:20) ασ − ασ (cid:21) N P Ni =1 λ i (cid:0) Z (cid:0) Z T Z (cid:1) − Z T (cid:1)+ Z = (cid:0) − ασ (cid:1) − N/ * e − log (cid:0) ασ − ασ (cid:1) Tr (cid:26) Z (cid:0) Z T Z (cid:1) − Z T (cid:27) + Z = (cid:0) − ασ (cid:1) − N/ e − d log (cid:0) ασ − ασ (cid:1) = e − N (cid:20) ζ log (cid:0) ασ − ασ (cid:1) +log (cid:0) − ασ (cid:1) (cid:21) = e − N (cid:0) − ζ (cid:1) log (cid:0) − ασ (cid:1) . (D7)Hence Prob " N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≥ N (cid:0) σ (1 − ζ ) + δ (cid:1) ≤ (cid:28) e α P Ni =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) (cid:29) D e − αN (cid:0) σ (1 − ζ )+ δ (cid:1) ≤ e − N (cid:0) − ζ (cid:1) log (cid:0) − ασ (cid:1) e − αN (cid:0) σ (1 − ζ )+ δ (cid:1) = e − N h (cid:0) − ζ (cid:1) log (cid:0) − ασ (cid:1) + α (cid:0) σ (1 − ζ )+ δ (cid:1) i = e − N Φ( α ) . (D8)The rate function Φ( α ) = (cid:0) − ζ (cid:1) log (cid:0) − ασ (cid:1) + α (cid:0) σ (1 − ζ ) + δ (cid:1) (D9)has a maximum at α = δσ (cid:0) (1 − ζ ) σ + δ (cid:1) , (D10)3such that max α Φ( α ) = (cid:0) − ζ (cid:1) log (cid:0) (cid:0) − ζ (cid:1)(cid:0) − ζ (cid:1) + δ/σ (cid:1) + δ/σ , and henceProb " N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≥ N (cid:0) σ (1 − ζ ) + δ (cid:1) ≤ e − N (cid:0)(cid:0) − ζ (cid:1) log (cid:0) (cid:0) − ζ (cid:1)(cid:0) − ζ (cid:1) + δ/σ (cid:1) + δ/σ (cid:1) . (D11)We note that in above expression the rate function (1 − ζ ) log[(1 − ζ ) / (1 − ζ + δ/σ )] + δ/σ vanishes when δ/σ = 0,and is a monotonically increasing function of δ/σ .Second, for α > " N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≤ N (cid:0) σ (1 − ζ ) − δ (cid:1) = Prob (cid:20) e − α P Ni =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≥ e − αN (cid:0) σ (1 − ζ ) − δ (cid:1)(cid:21) ≤ (cid:28) e − α P Ni =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) (cid:29) D e αN (cid:0) σ (1 − ζ ) − δ (cid:1) = e − N (cid:0) − ζ (cid:1) log (cid:0) ασ (cid:1) e αN (cid:0) σ (1 − ζ ) − δ (cid:1) = e − N φ ( α ) , (D12)where φ ( α ) = (cid:0) − ζ (cid:1) log (cid:0) ασ (cid:1) − α (cid:0) σ (1 − ζ ) − δ (cid:1) . (D13)Here we used the Markov inequality and the result (D7) with α → − α . The rate function φ ( α ) has a maximum at α = δ/σ ((1 − ζ ) σ − δ ), such that max α φ ( α ) = (1 − ζ ) log[(1 − ζ ) / (1 − ζ − δ/σ )] − δ/σ , and henceProb " N X i =1 (cid:0) t i − ˆ θ [ D ] . z i (cid:1) ≤ N (cid:0) σ (1 − ζ ) − δ (cid:1) ≤ e − N (cid:0)(cid:0) − ζ (cid:1) log (cid:0) (cid:0) − ζ (cid:1)(cid:0) − ζ (cid:1) − δ/σ (cid:1) − δ/σ (cid:1) . (D14)Here the rate function (1 − ζ ) log[(1 − ζ ) / (1 − ζ − δ/σ )] − δ/σ vanishes when δ/σ = 0, and is a monotonic increasingfunction of δ/σ when σ (cid:0) − ζ (cid:1) > δ .Finally, combining the inequalities (D11) and (D14) we obtain the inequalityProb (cid:2) ˆ σ [ D ] / ∈ (cid:0) σ (1 − ζ ) − δ, σ (1 − ζ ) + δ (cid:1)(cid:3) ≤ e − N (cid:0)(cid:0) − ζ (cid:1) log (cid:0) (cid:0) − ζ (cid:1)(cid:0) − ζ (cid:1) − δ/σ (cid:1) − δ/σ (cid:1) + e − N (cid:0)(cid:0) − ζ (cid:1) log (cid:0) (cid:0) − ζ (cid:1)(cid:0) − ζ (cid:1) + δ/σ (cid:1) + δ/σ (cid:1) (D15)which is valid for δ ∈ (0 , σ (1 − ζ )). Appendix E: Statistical properties of MSE in ML inference
In this section we consider statistical properties of the minimum square error (MSE) d || θ − ˆ θ [ D ] || , where θ isthe vector of the true parameters responsible for the data, and ˆ θ [ D ] is the ML estimator (A4).4
1. Moment generating function
Let us consider the moment generating function D e α || θ − ˆ θ [ D ] || E D = Z d ˆ θ P ( ˆ θ ) e α || θ − ˆ θ || (E1)= (cid:0) π (cid:0) N + 1 − d (cid:1)(cid:1) − d Γ (cid:0) N +12 (cid:1) Γ (cid:0) N +1 − d (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) (1 − ζ + 1 /N ) Σ ζσ (cid:12)(cid:12)(cid:12)(cid:12) × Z d ˆ θ (cid:0) (cid:0) ˆ θ − θ (cid:1) T N + 1 − d (1 − ζ + 1 /N ) Σ ζσ (cid:0) ˆ θ − θ (cid:1)(cid:1) − N +12 e α || θ − ˆ θ || = (cid:0) π (cid:0) N + 1 − d (cid:1)(cid:1) − d Γ (cid:0) N +12 (cid:1) Γ (cid:0) N +1 − d (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) (1 − ζ + 1 /N ) Σ ζσ (cid:12)(cid:12)(cid:12)(cid:12) × Z d ˆ θ (cid:0) ˆ θ T N + 1 − d (1 − ζ + 1 /N ) Σ ζσ ˆ θ (cid:1) − N +12 e α || ˆ θ || = Z d ˆ θ Z ∞ d ω N (cid:0) ˆ θ | , ζσ ω (1 − ζ + 1 /N ) Σ − (cid:1) Γ N +1 − d ( ω )e α || ˆ θ || = Z d ˆ θ Z ∞ d ω e − ˆ θ T (cid:0) ω (1 − ζ +1 /N ) Σ ζσ − α I d (cid:1) ˆ θ (cid:12)(cid:12)(cid:12) π ζσ ω (1 − ζ +1 /N ) Σ − (cid:12)(cid:12)(cid:12) Γ N +1 − d ( ω )= Z ∞ d ω Γ N +1 − d ( ω ) (cid:12)(cid:12)(cid:12) ζσ ω (1 − ζ +1 /N ) Σ − (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ω (1 − ζ +1 /N ) Σ ζσ − α I d (cid:12)(cid:12)(cid:12) = Z ∞ d ω Γ N +1 − d ( ω ) (cid:12)(cid:12)(cid:12) I d − α ζσ ω (1 − ζ +1 /N ) Σ − (cid:12)(cid:12)(cid:12) = Z ∞ d ω e − log (cid:12)(cid:12)(cid:12)(cid:12) I d − α ζσ ω (1 − ζ +1 /N ) Σ − (cid:12)(cid:12)(cid:12)(cid:12) Γ N +1 − d ( ω )= Z ∞ d ω e − P dµ =1 log (cid:0) − αζσ ω (1 − ζ +1 /N ) λµ ( Σ ) (cid:1) Γ N +1 − d ( ω ) (E2)where we encounter the gamma distribution, for ν > ν ( ω ) = ν ν/ ν/ Γ( ν/ ω ν − e − νω (E3)We note that the above derivation was obtained using the mixture of Gaussians representation of multivariate Studentt distribution [32]. Thus the moment generating function is given by D e α || θ − ˆ θ [ D ] || E D = Z ∞ Γ N +1 − d ( ω ) (cid:12)(cid:12)(cid:12) I d − α ζσ ω (1 − ζ +1 /N ) Σ − (cid:12)(cid:12)(cid:12) d ω (E4)= Z ∞ d ω Γ N +1 − d ( ω ) d Y ℓ =1 (cid:0) − α ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − and, by the transformation α = 2i a in the above, we also obtain the characteristic function D e i a || θ − ˆ θ || E D = Z ∞ Γ N +1 − d ( ω ) d ω d Y ℓ =1 (cid:0) − i a ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − . (E5)We note that the last term in above is the product of characteristic functions of gamma distributions. Each gammadistribution has the same ‘shape’ parameter 1 / ζσ /ω (1 − ζ + 1 /N ) λ ℓ ( Σ ).5 a. The first two moments of the MSE Let us now consider derivatives of the moment generating function (E4) upon replacing α → α/d . The derivativewith respect to α then gives us2 ∂∂α D e d α || θ − ˆ θ [ D ] || E D = 2 Z ∞ d ω Γ N +1 − d ( ω ) ∂∂α d Y ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − = Z ∞ d ω Γ N +1 − d ( ω ) d Y ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − × d d X ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) (E6)For α = 0 this gives us the average1 d D || θ − ˆ θ [ D ] || E D = ζσ − ζ + 1 /N d Tr (cid:8) Σ − (cid:9) Z ∞ d ω Γ N +1 − d ( ω ) ω − = ζσ − ζ + 1 /N − ζ + 1 /N − ζ − /N d Tr (cid:2) Σ − (cid:3) = ζσ − ζ − /N d Tr (cid:2) Σ − (cid:3) . (E7)Now we consider the second derivative with respect to α :4 ∂ ∂α D e d α || θ − ˆ θ [ D ] || E D = 4 Z ∞ d ω Γ N +1 − d ( ω ) ∂ ∂α d Y ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − = 2 Z ∞ d ω Γ N +1 − d ( ω ) ∂∂α d Y ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − × d d X ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ )= Z ∞ d ω Γ N +1 − d ( ω ) ( ∂∂α d Y ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − × d d X ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ )+ 2 d Y ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − × ∂∂α d d X ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) ) = Z ∞ d ω Γ N +1 − d ( ω ) d Y ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − × ( " d d X ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) + 2 d d X ℓ =1 (cid:0) − α ζσ ωd (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − (cid:20) ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:21) ) (E8)6Evaluation at α = 0 gives us the second moment (cid:28) d || θ − ˆ θ [ D ] || (cid:29) D = (cid:0) ζσ (1 − ζ + 1 /N ) (cid:1) Z ∞ d ω Γ N +1 − d ( ω ) ω − "(cid:0) d d X ℓ =1 λ ℓ ( Σ ) (cid:1) + 2 d d X ℓ =1 (cid:0) λ ℓ ( Σ ) (cid:1) = (cid:0) ζσ (1 − ζ + 1 /N ) (cid:1) Z ∞ d ω Γ N +1 − d ( ω ) ω − (cid:20)(cid:0) d Tr (cid:2) Σ − (cid:3) (cid:1) + 2 d Tr (cid:2) Σ − (cid:3)(cid:21) = (cid:0) ζσ (1 − ζ + 1 /N ) (cid:1) (1 − ζ + 1 /N ) (1 − ζ − /N )(1 − ζ − /N ) (cid:20)(cid:0) d Tr (cid:2) Σ − (cid:3) (cid:1) + 2 d Tr (cid:2) Σ − (cid:3)(cid:21) = ζ σ (1 − ζ − /N )(1 − ζ − /N ) (cid:20)(cid:0) d Tr (cid:2) Σ − (cid:3) (cid:1) + 2 d Tr (cid:2) Σ − (cid:3)(cid:21) (E9)Now upon combining the mean (E7) and the second moment (E9) we obtain the variance of the random variable d || θ − ˆ θ [ D ] || for ( N, d ) → ∞ : (cid:28) d || θ − ˆ θ [ D ] || (cid:29) D − d D || θ − ˆ θ [ D ] || E D = ζ σ (1 − ζ − /N )(1 − ζ − /N ) (cid:20)(cid:0) d Tr (cid:2) Σ − (cid:3) (cid:1) + 2 d Tr (cid:2) Σ − (cid:3)(cid:21) − ζ σ (cid:0) − ζ − /N (cid:1) (cid:0) d Tr (cid:2) Σ − (cid:3) (cid:1) = " ζ σ (1 − ζ − /N )(1 − ζ − /N ) − ζ σ (cid:0) − ζ − /N (cid:1) d Tr (cid:2) Σ − (cid:3) + ζ σ (1 − ζ − /N )(1 − ζ − /N ) 2 d Tr (cid:2) Σ − (cid:3) = 2 (cid:0) ζσ − ζ (cid:1) d Tr (cid:2) Σ − (cid:3) . (E10) b. Properties of the MGF for large ( N, d ) Let us consider the moment generating function (E4) of the MSE for the covariance matrix Σ = λ I d . The meanMSE h d || θ − ˆ θ [ D ] || i D is given by µ ( λ ) = ζσ / (1 − ζ ) λ (E11)using equation (E7) for large ( N, d ). The MGF of the MSE is given by D e α || θ − ˆ θ [ D ] || E D = Z ∞ Γ N +1 − d ( ω ) d ω (cid:0) − α ζσ ω (1 − ζ +1 /N ) λ (cid:1) d = Z ∞ d ω Γ N +1 − d ( ω ) (cid:0) ω (1 − ζ + 1 /N ) λω (1 − ζ + 1 /N ) λ − αζσ (cid:1) d = Z ∞ d ω Γ N +1 − d ( ω ) (cid:16) ωω − α ζσ (1 − ζ +1 /N ) λ (cid:17) d = Z ∞ d ω Γ N +1 − d ( ω ) (cid:0) ωω − αµ ( λ ) (cid:1) d , (E12)where in the last line we assumed ( N, d ) → ∞ . Let us now consider the integral1 N log Z ∞ d ω Γ N +1 − d ( ω ) (cid:0) ωω − α (cid:1) d = 1 N log ( N + 1 − d ) ( N +1 − d ) / ( N +1 − d ) / Γ(( N + 1 − d ) / N log Z ∞ d ω ω N − d − e − ( N − d +1) ω (cid:0) ωω − α (cid:1) d = 1 N log (cid:0) − ζ N (cid:1) − ζ N / Γ (cid:0) − ζ N (cid:1) + 1 N log Z ∞ d ω e N h (1 − ζ ) log ω − (1 − ζ ) ω + ζ log (cid:0) ωω − α (cid:1) i = 1 − ζ N log (cid:0) − ζ π N (cid:1) + O (cid:0) N − (cid:1) + 12 φ − ( ω − ) + 12 N log (cid:0) πN ( − φ ′′− ( ω − )) (cid:1) + O (cid:0) N − (cid:1) = 1 − ζ φ − ( ω − ) + 12 N log (cid:0) ζ − φ ′′− ( ω − ) (cid:1) + O (cid:0) N − (cid:1) , (E13)where ω − = argmax ω ∈ (0 , ∞ ) φ − ( ω ), with the function φ − ( ω ) = (1 − ζ ) log ω − (1 − ζ ) ω + ζ log (cid:0) ωω − α (cid:1) (E14)7We note φ − ( ω ) has a maximum when the solution of φ ′− ( ω ) = (cid:0) − ζ (cid:1) ω − (cid:0) α + 1 (cid:1)(cid:0) − ζ (cid:1) ω + α (cid:0) α − ω (cid:1) ω = 0 (E15)satisfies the condition φ ′′− ( ω ) > ω ζ − (cid:0) ω − α (cid:1) <
0. The latter is satisfied when ω ∈ (cid:0) , (1 − √ ζ ) α/ (1 − ζ ) (cid:1) ∪ (cid:0) (1 + √ ζ ) α/ (1 − ζ ) , ∞ (cid:1) for ζ ∈ [ 0 ,
1) and when ω ∈ (cid:0) , α/ (cid:1) for ζ = 1. However, the difference ω − α for ζ ∈ [ 0 ,
1) is negative on the interval (cid:0) , (1 − √ ζ ) α/ (1 − ζ ) (cid:1) , so φ − ( ω ) is undefined. The same is also truefor (cid:0) , α/ (cid:1) at ζ = 1 thus leaving us only with the interval (cid:0) (1 + √ ζ ) α/ (1 − ζ ) , ∞ (cid:1) with ζ ∈ (0 , α ± r(cid:0) α (cid:1) − α (cid:0) − ζ (cid:1) when the inequality (cid:0) α +12 (cid:1) /α ≥ / (cid:0) − ζ (cid:1) is satisfied.This holds when α ∈ (cid:0) , (1 + ζ − √ ζ ) / (1 − ζ ) (cid:1) ∪ (cid:0) (1 + ζ + 2 √ ζ ) / (1 − ζ ) , ∞ (cid:1) , but for ζ ∈ (0 ,
1) only one solution ω = α + r(cid:0) α (cid:1) − α (cid:0) − ζ (cid:1) belongs to the interval (cid:0) (1 + √ ζ ) α ) / (1 − ζ ) , ∞ (cid:1) , when α ∈ (cid:0) , (1 + ζ − √ ζ ) / (1 − ζ ) (cid:1) ,i.e. it corresponds to the maximum of φ − ( ω ).We note that upon substituting α → αµ ( λ ), the factor appearing in the integral (E12), we obtain ω − = 1 + αµ ( λ )2 + s(cid:0) αµ ( λ )2 (cid:1) − α µ ( λ ) (cid:0) − ζ (cid:1) (E16)for α ∈ (cid:0) , λ (1 + ζ − √ ζ ) /ζσ (cid:1) and ζ ∈ (0 , N found to become D e α || θ − ˆ θ [ D ] || E D = s ζ − φ ′′− ( ω − ) e N (cid:0) − ζ + φ − ( ω − ) (cid:1) + O (cid:0) /N (cid:1) . (E17)
2. Deviations from the mean
We are interested in the probability of the event d || θ − ˆ θ [ D ] || / ∈ (cid:0) µ ( λ ) − δ, µ ( λ ) + δ (cid:1) . This is given byProb (cid:20) d || θ − ˆ θ [ D ] || / ∈ (cid:0) µ ( λ ) − δ, µ ( λ ) + δ (cid:1)(cid:21) = Prob h || θ − ˆ θ [ D ] || ≤ d (cid:0) µ ( λ ) − δ (cid:1)i + Prob h || θ − ˆ θ [ D ] || ≥ d (cid:0) µ ( λ ) + δ (cid:1)i . (E18)First, for α > h || θ − ˆ θ [ D ] || ≥ d (cid:0) µ ( λ ) + δ (cid:1)i = Prob (cid:20) e α || θ − ˆ θ [ D ] || ≥ e αd (cid:0) µ ( λ )+ δ (cid:1)(cid:21) ≤ D e α || θ − ˆ θ [ D ] || E D e − αd (cid:0) µ ( λ )+ δ (cid:1) = s ζ − φ ′′− ( ω − ) e N (cid:0) − ζ + φ − ( ω − ) (cid:1) + O (cid:0) /N (cid:1) e − αζN (cid:0) µ ( λ )+ δ (cid:1) = s ζ − φ ′′− ( ω − ) e − N h ζ − − φ − ( ω − )+ αζ (cid:0) µ ( λ )+ δ (cid:1) i + O (cid:0) /N (cid:1) (E19)From this it follows that for N → ∞ we have − N log Prob h || θ − ˆ θ [ D ] || ≥ d (cid:0) µ ( λ ) + δ (cid:1)i ≥ ζ − − φ − ( ω − ) + α ζ (cid:0) µ ( λ ) + δ (cid:1) + O (cid:0) /N (cid:1) (E20)We seek an upper bound with respect to α , but it is not clear how to implement this analytically for any α . However,for small α the function (divided by ζ ) appearing in the right-hand side of (E20) has the following Taylor expansion: (cid:0) µ ( λ ) − δ (cid:1) α + ζ (cid:0) µ ( λ ) − (cid:1) − − ζ ) α + (cid:0) µ − (cid:1) ζ + (cid:0) µ − µ − µ + 2 (cid:1) ζ − (cid:0) − ζ (cid:1) α + O (cid:0) α (cid:1) , (E21)so if µ ( λ ) + δ >
1, the first term in this expansion is positive and hence if α > µ ( λ ) ≥
1, where µ ( λ ) = ζσ / (1 − ζ ) λ , the value of δ > µ < δ .8Second, for α > h || θ − ˆ θ [ D ] || ≤ d (cid:0) µ ( λ ) − δ (cid:1)i = Prob (cid:20) e − α || θ − ˆ θ [ D ] || ≥ e − αd (cid:0) µ ( λ ) − δ (cid:1)(cid:21) ≤ D e − α || θ − ˆ θ [ D ] || E D e αd (cid:0) µ ( λ ) − δ (cid:1) = s ζ − φ ′′ + ( ω +0 ) e N (cid:0) − ζ + φ + ( ω +0 ) (cid:1) + O (cid:0) /N (cid:1) e αζN (cid:0) µ ( λ ) − δ (cid:1) = s ζ − φ ′′ + ( ω +0 ) e − N h ζ − − φ + ( ω +0 ) − αζ (cid:0) µ ( λ ) − δ (cid:1) i + O (cid:0) /N (cid:1) , (E22)where the function φ + , defined as φ + ( ω ) = (1 − ζ ) log ω − (1 − ζ ) ω + ζ log (cid:0) ωω + α (cid:1) , (E23)has a maximum at ω +0 = 12 (cid:0) − α + q(cid:0) α − (cid:1) + 4 α/ (1 − ζ ) (cid:1) . (E24)Now for α → αµ ( λ ) we have in the exponential of (E22): ζ − − φ + ( ω +0 ) − αζ (cid:0) µ ( λ ) − δ (cid:1) = δ ζ α − µ ( λ ) ζ (cid:0) − ζ (cid:1) α + µ ( λ ) ζ (cid:0) ζ +1 (cid:1) (cid:0) − ζ (cid:1) α − µ ( λ ) ζ (cid:0) ζ +3 ζ +1 (cid:1) (cid:0) − ζ (cid:1) α + O ( α ) . (E25)This is positive for sufficiently small α , and hence the lower bound in the inequality − N log Prob h || θ − ˆ θ [ D ] || ≤ d (cid:0) µ ( λ ) − δ (cid:1)i ≥ ζ − − φ + ( ω +0 ) − αζ (cid:0) µ ( λ ) − δ (cid:1) + O (cid:0) /N (cid:1) (E26)is positive for any δ ∈ (0 , µ ( λ )) and sufficiently small α , when N → ∞ .Now combining (E20) with (E26) allows us bound the probability (E18) as followsProb (cid:20) d || θ − ˆ θ [ D ] || / ∈ (cid:0) µ ( λ ) − δ, µ ( λ ) + δ (cid:1)(cid:21) ≤ C − e − N h ζ − − φ − ( ω − )+ α ζ (cid:0) µ ( λ )+ δ (cid:1) i + C + e − N h ζ − − φ + ( ω +0 ) − αζ (cid:0) µ ( λ ) − δ (cid:1) i . (E27)for some constants C ± and some sufficiently small α >
0. We note that for the first term in the above upper boundto vanish, as N → ∞ , for arbitrary small δ it is sufficient that µ ( λ ) ≥
1, where µ ( λ ) = ζσ / (1 − ζ ) λ , but for µ ( λ ) < δ must be such that δ > − µ ( λ ). The second term in the upper bound is vanishing for any δ ∈ (0 , µ ( λ )).Finally we consider deviations of d || θ − ˆ θ [ D ] || from its mean µ (cid:0) Σ (cid:1) = [ ζσ / (1 − ζ − N − )] d Tr (cid:2) Σ − (cid:3) , derived in(E7). To this end we consider the probability of event d || θ − ˆ θ [ D ] || / ∈ (cid:0) µ (cid:0) Σ (cid:1) − δ, µ (cid:0) Σ (cid:1) + δ (cid:1) given by the sumProb h || θ − ˆ θ [ D ] || ≤ d (cid:0) µ (cid:0) Σ (cid:1) − δ (cid:1)i + Prob h || θ − ˆ θ [ D ] || ≥ d (cid:0) µ (cid:0) Σ (cid:1) + δ (cid:1)i ≤ D e − α || θ − ˆ θ [ D ] || E D e αd (cid:0) µ (cid:0) Σ (cid:1) − δ (cid:1) + D e α || θ − ˆ θ [ D ] || E D e − αd (cid:0) µ (cid:0) Σ (cid:1) + δ (cid:1) (E28)for α >
0. Let us order the eigenvalues of Σ in a such a way that λ ( Σ ) ≤ λ ( Σ ) ≤ · · · ≤ λ d ( Σ ) then, using (E4), forthe moment generating functions in above we obtain D e − α || θ − ˆ θ [ D ] || E D = Z ∞ Γ N +1 − d ( ω ) d ω d Y ℓ =1 (cid:0) α ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − ≤ Z ∞ Γ N +1 − d ( ω ) d ω (cid:0) α ζσ ω (1 − ζ + 1 /N ) λ ( Σ ) (cid:1) − d (E29)and D e α || θ − ˆ θ [ D ] || E D = Z ∞ Γ N +1 − d ( ω ) d ω d Y ℓ =1 (cid:0) − α ζσ ω (1 − ζ + 1 /N ) λ ℓ ( Σ ) (cid:1) − ≤ Z ∞ Γ N +1 − d ( ω ) d ω (cid:0) − α ζσ ω (1 − ζ + 1 /N ) λ d ( Σ ) (cid:1) − d . (E30)9Furthermore, by the inequalities 1 /λ d ( Σ ) ≤ d Tr (cid:2) Σ − (cid:3) ≤ /λ ( Σ ), the mean obeys µ (cid:0) λ d ( Σ ) (cid:1) ≤ µ (cid:0) Σ (cid:1) ≤ µ (cid:0) λ ( Σ ) (cid:1) ,where µ (cid:0) λ (cid:1) = ζσ / (1 − ζ − N − ) λ . The latter combined with the upper bounds in (E29) and (E30) gives usProb (cid:20) d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ − ˆ θ [ D ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / ∈ (cid:0) µ (cid:0) Σ (cid:1) − δ, µ (cid:0) Σ (cid:1) + δ (cid:1)(cid:21) ≤ Z ∞ d ω Γ N +1 − d ( ω ) (cid:0) αζσ ω (1 − ζ + 1 /N ) λ ( Σ ) (cid:1) − d e αd (cid:0) µ (cid:0) λ ( Σ ) (cid:1) − δ (cid:1) + Z ∞ d ω Γ N +1 − d ( ω ) (cid:0) − αζσ ω (1 − ζ + 1 /N ) λ d ( Σ ) (cid:1) − d e − αd (cid:0) µ (cid:0) λ d ( Σ ) (cid:1) + δ (cid:1) (E31)Finally, using similar steps to those that led us earlier to (E27) giveProb (cid:20) d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) θ − ˆ θ [ D ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) / ∈ (cid:0) µ (cid:0) Σ (cid:1) − δ, µ (cid:0) Σ (cid:1) + δ (cid:1)(cid:21) ≤ C − e − N Φ − [ α,µ ( λ d ) ,δ ] + C + e − N Φ + [ α,µ ( λ ) ,δ ] , (E32)for some constants C ± and some sufficiently small α >
0. In the above we have defined the (rate) functionsΦ − [ α, µ ( λ d ) , δ ] = 12 h ζ − − φ − ( ω − ) + α ζ (cid:0) µ ( λ d ( Σ )) + δ (cid:1)i , (E33)Φ + [ α, µ ( λ ) , δ ] = 12 h ζ − − φ + ( ω +0 ) − αζ (cid:0) µ (cid:0) λ ( Σ ) (cid:1) − δ (cid:1)i . (E34)Here φ − ( ω − ) is defined by (E14) and (E16) with µ ( λ ) replaced by µ ( λ d ), and φ + ( ω +0 ) is defined by (E23) and (E24)with α replaced by αµ ( λ ). We note that for the first term in the upper bound (E32) to vanish, as ( N, d ) → ∞ , forarbitrary small δ , it is sufficient that µ ( λ d ) ≥
1, where µ ( λ ) = ζσ / (1 − ζ ) λ , but for µ ( λ d ) < δ must be such that δ > − µ ( λ d ). The second term in the upper bound is vanishing for any δ ∈ (0 , µ ( λ )). Appendix F: Statistical properties of free energy
In this section we consider statistical properties of the (conditional) free energy F β,σ [ D ] = d β + 12 σ t T (cid:0) I N − ZJ − σ η Z T (cid:1) t − β log (cid:12)(cid:12)(cid:12) π e σ β − J − σ η (cid:12)(cid:12)(cid:12) (F1)assuming that σ is independent from data D .
1. The average of free energy
Let us consider the average free energy (cid:10) F β,σ [ D ] (cid:11) D = d β + 12 σ D t T (cid:0) I N − ZJ − σ η Z T (cid:1) t E D − β D log | π e σ β − J − σ η | E D = d β + 12 σ DD t T (cid:0) I N − ZJ − σ η Z T (cid:1) t E ǫ E Z − β D log | π e σ β − J − σ η | E Z (F2)Now, assuming that the noise vector ǫ has mean and covariance σ I N , we can work out D t T (cid:0) I N − ZJ − σ η Z T (cid:1) t E ǫ = D Tr h(cid:0) I N − ZJ − σ η Z T (cid:1) tt T iE ǫ = Tr h(cid:0) I N − ZJ − σ η Z T (cid:1) D(cid:0) Z θ + ǫ (cid:1)(cid:0) Z θ + ǫ (cid:1) T E ǫ i = Tr h(cid:0) I N − ZJ − σ η Z T (cid:1)(cid:0) Z θ θ T0 Z T + 2 Z θ (cid:10) ǫ T (cid:11) ǫ + (cid:10) ǫǫ T (cid:11) ǫ (cid:1)i = Tr h(cid:0) I N − ZJ − σ η Z T (cid:1)(cid:0) Z θ θ T0 Z T + σ I N (cid:1)i = θ T0 Z T (cid:0) I N − ZJ − σ η Z T (cid:1) Z θ + σ Tr h I N − ZJ − σ η Z T i = θ T0 (cid:0) J − JJ − σ η J (cid:1) θ + σ (cid:0) N − Tr h JJ − σ η i (cid:1) , (F3)and hence the average free energy is given by (cid:10) F β,σ [ D ] (cid:11) D = d β + 12 σ θ T0 D(cid:0) J − JJ − σ η J (cid:1)E Z θ + σ σ (cid:0) N − D Tr h JJ − σ η iE Z (cid:1) − β D log (cid:12)(cid:12)(cid:12) π e σ β − J − σ η (cid:12)(cid:12)(cid:12)E Z . (F4)0
2. The variance of free energy
We turn to the variance of the free energy F β,σ [ D ]. To this end we exploit the free energy equality F = U − T S ,which gives us Var( F ) = Var( U − T S ) = Var( U ) − T Cov(
U, S ) + T Var( S ). The latter applied to (F1) leads toVar (cid:0) F β,σ [ D ] (cid:1) = Var (cid:0) E [ D ] (cid:1) + T Var (cid:0) S[ D ] (cid:1) − T Cov (cid:0) E [ D ] , S[ D ] (cid:1) (F5)Let us consider the energy varianceVar (cid:0) E [ D ] (cid:1) = Var (cid:0) d β + 12 σ t T (cid:0) I N − ZJ − σ η Z T (cid:1) t (cid:1) (F6)= 14 σ Var (cid:0)(cid:0) Z θ + ǫ (cid:1) T (cid:0) I N − ZJ − σ η Z T (cid:1)(cid:0) Z θ + ǫ (cid:1)(cid:1) If we define v = Z θ and A = ( I N − ZJ − σ η Z T ) then the above is of the formVar (cid:0)(cid:0) v + ǫ (cid:1) T A (cid:0) v + ǫ (cid:1)(cid:1) = Var (cid:0) v T Av + 2 ǫ T Av + ǫ T A ǫ (cid:1) = Var (cid:0) v T Av (cid:1) + 4 Var (cid:0) ǫ T Av (cid:1) + Var (cid:0) ǫ T A ǫ (cid:1) +2 (cid:2) (cid:0) v T Av , ǫ T Av (cid:1) + Cov (cid:0) v T Av , ǫ T A ǫ (cid:1) + 2 Cov (cid:0) ǫ T Av , ǫ T A ǫ (cid:1)(cid:3) . (F7)In the following we will use the following identities: Z T AZ = J − JJ − σ η J Tr[ A ] = N − Tr[ JJ − σ η ] , (F8) Z T A Z = J (cid:0) I d − J − σ η J (cid:1) Tr (cid:2) A (cid:3) = N − JJ − σ η ] + Tr[( JJ − σ η ) ] (F9)We can now compute each term in (F7):Var (cid:0) v T Av (cid:1) = Var (cid:0) θ T0 Z T (cid:0) I N − ZJ − σ η Z T (cid:1) Z θ (cid:1) = Var (cid:0) θ T0 (cid:0) J − JJ − σ η J (cid:1) θ (cid:1) = D(cid:0) θ T0 (cid:0) J − JJ − σ η J (cid:1) θ (cid:1) E Z − D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E Z (F10)Var (cid:0) ǫ T Av (cid:1) = DD(cid:0) ǫ T Av (cid:1) E ǫ E Z − (cid:10)(cid:10) ǫ T Av (cid:11) ǫ (cid:11) Z = DD(cid:0) ǫ T Av (cid:1) E ǫ E Z = (cid:10) v T A T (cid:10) ǫǫ T (cid:11) ǫ Av (cid:11) Z = σ (cid:10) v T A v (cid:11) Z = σ D θ T0 Z T (cid:0) I N − ZJ − σ η Z T (cid:1) Z θ E Z = σ θ T0 D J (cid:0) I d − J − σ η J (cid:1) E Z θ (F11)Var (cid:0) ǫ T A ǫ (cid:1) = DD(cid:0) ǫ T A ǫ (cid:1) E ǫ E Z − (cid:10)(cid:10) ǫ T A ǫ (cid:11) ǫ (cid:11) Z = σ (cid:0) (cid:10) Tr [ A ] (cid:11) Z + 2 (cid:10) Tr[ A ] (cid:11) Z − h Tr[ A ] i Z (cid:1) = σ h D(cid:0) N − Tr[ JJ − σ η ] (cid:1) E Z + 2 D N − JJ − σ η ] + Tr[ (cid:0) JJ − σ η (cid:1) ] E Z − D N − Tr[ JJ − σ η ] E Z i (F12)Cov (cid:0) v T Av , ǫ T Av (cid:1) = (cid:10)(cid:10) v T Av ǫ T Av (cid:11) ǫ (cid:11) Z − (cid:10)(cid:10) v T Av (cid:11) ǫ (cid:11) Z (cid:10)(cid:10) ǫ T Av (cid:11) ǫ (cid:11) Z = 0 (F13)Cov (cid:0) v T Av , ǫ T A ǫ (cid:1) = (cid:10)(cid:10) v T Av ǫ T A ǫ (cid:11) ǫ (cid:11) Z − (cid:10)(cid:10) v T Av (cid:11) ǫ (cid:11) Z (cid:10)(cid:10) ǫ T A ǫ (cid:11) ǫ (cid:11) Z = (cid:10) v T Av Tr A (cid:10) ǫǫ T (cid:11) ǫ (cid:11) Z − (cid:10) v T Av (cid:11) Z (cid:10)(cid:10) ǫ T A ǫ (cid:11) ǫ (cid:11) Z = σ h D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ (cid:0) N − Tr[ JJ − σ η ] (cid:1)E Z − D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E Z D(cid:0) N − Tr[ JJ − σ η ] (cid:1)E Z i (F14)Cov (cid:0) ǫ T Av , ǫ T A ǫ (cid:1) = (cid:10)(cid:10) ǫ T Av ǫ T A ǫ (cid:11) ǫ (cid:11) Z − (cid:10)(cid:10) ǫ T Av (cid:11) ǫ (cid:11) Z (cid:10)(cid:10) ǫ T A ǫ (cid:11) ǫ (cid:11) Z = 0 (F15)In deriving the above identities we also used the following result D(cid:0) ǫ T A ǫ (cid:1) E ǫ = X i ,...,i h ǫ i ǫ i ǫ i ǫ i i ǫ A i i A i i = σ X i ,...,i h δ i ,i δ i ,i + δ i ,i δ i ,i + δ i ,i δ i ,i i A i i A i i = σ X i ,i (cid:8) A i i A i i + 2 A i i (cid:9) = σ (cid:0) Tr [ A ] + 2Tr[ A ] (cid:1) . (F16)1Using all of the above results in (F6) we obtain4 σ Var (cid:0) E [ D ] (cid:1) = D(cid:0) θ T0 (cid:0) J − JJ − σ η J (cid:1) θ (cid:1) E Z − D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E Z + 4 σ θ T0 D J (cid:0) I d − J − σ η J (cid:1) E Z θ + σ " D(cid:0) N − Tr h JJ − σ η i (cid:1) E Z + 2 D(cid:0) N − h JJ − σ η i + Tr h(cid:0) JJ − σ η (cid:1) i (cid:1)E Z − D(cid:0) N − Tr h JJ − σ η i (cid:1)E Z +2 σ D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ (cid:0) N − Tr[ JJ − σ η ] (cid:1)E Z − D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E Z D(cid:0) N − Tr[ JJ − σ η ] (cid:1)E Z ! . (F17)The entropy variance is given byVar (cid:0) S[ D ] (cid:1) = Var (cid:0)
12 log | π e σ β − J − σ η | (cid:1) = 14 D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z − D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z (F18)and the covariance isCov (cid:0) E [ D ] , S[ D ] (cid:1) = Cov (cid:0) d β + 12 σ t T (cid:0) I N − ZJ − σ η Z T (cid:1) t , d π e σ β − ) + 12 log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12) (cid:1) = 14 σ Cov t T (cid:0) I N − ZJ − σ η Z T (cid:1) t , log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12) ! = 14 σ DD t T (cid:0) I N − ZJ − σ η Z T (cid:1) t E ǫ log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z − σ DD t T (cid:0) I N − ZJ − σ η Z T (cid:1) t E ǫ E Z D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z = 14 σ Dh θ T0 (cid:0) J − JJ − σ η J (cid:1) θ + σ (cid:0) N − Tr h JJ − σ η i (cid:1)i log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z − σ Dh θ T0 (cid:0) J − JJ − σ η J (cid:1) θ + σ (cid:0) N − Tr h JJ − σ η i (cid:1)iE Z D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z (F19)Finally, using the results (F17), (F18) and (F19), the variance of free energy follows from equation (F5).
3. Free energy of ML inference
For η = 0 we simply have J σ η = J and the average free energy (F4) is given by (cid:10) F β,σ [ D ] (cid:11) D = d β + 12 σ θ T0 (cid:10)(cid:0) J − JJ − J (cid:1)(cid:11) Z θ + σ σ (cid:0) N − (cid:10) Tr (cid:2) JJ − (cid:3)(cid:11) Z (cid:1) − β (cid:10) log | π e σ β − J − | (cid:11) Z = d β + σ σ (cid:0) N − d (cid:1) − d β log(2 π e σ ) + d β log( β ) + 12 β h log | J |i Z (F20)Once more we put Z → Z / √ d where now z i ( µ ) = O (1) for all ( i, µ ), then we obtain the average free energy density1 N (cid:10) F β,σ [ D ] (cid:11) D = 12 σ σ (cid:0) − ζ (cid:1) + ζ β log (cid:0) β πσ ζ (cid:1) + ζ β Z d λ h ρ d ( λ | Z ) i Z log λ (F21)with the density of eigenvalues of the d × d empirical covariance matrix C = N Z T Z , ρ d ( λ | Z ) = 1 d d X ℓ =1 δ (cid:0) λ − λ ℓ ( C ) (cid:1) . (F22)We can similarly compute the variance of (F1) for η = 0 and Z → Z / √ d . Firstly, we consider the energy variance(F17) which is given by Var (cid:0) E [ D ] (cid:1) = σ σ (cid:0) N − d (cid:1) , from which we deduceVar (cid:0) E [ D ] N (cid:1) = σ σ (cid:0) − ζ (cid:1) N . (F23)Secondly, the entropy variance (F18) is given byVar (cid:0) S[ D ] (cid:1) = 14 (cid:10) log | J | (cid:11) Z − h log | J |i Z = d hD(cid:0) d d X ℓ =1 log λ ℓ (cid:1) E Z − D d d X ℓ =1 log λ ℓ E Z i = d hD(cid:16) Z d λ ρ d ( λ | Z ) log λ (cid:17) E Z − D Z d λ ρ d ( λ | Z ) log λ E Z i (F24)2and hence Var (cid:0) S[ D ] N (cid:1) = ζ hD(cid:0) Z d λ ρ d ( λ | Z ) log λ (cid:1) E Z − D Z d λ ρ d ( λ | Z ) log λ E Z i (F25)Finally, we consider the covariance (F19) which gives usCov (cid:0) E [ D ] , S[ D ] (cid:1) = σ σ (cid:0) N − d (cid:1) D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z − σ σ (cid:0) N − d (cid:1) D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z = 0 (F26)Using above results in identity (F5) we obtain the desired variance of the free energy density:Var (cid:0) F β,σ [ D ] N (cid:1) = Var (cid:0) E [ D ] N (cid:1) + T Var (cid:0) S[ D ] N (cid:1) = σ σ − ζN + ζ β Z d λ d˜ λ hD ρ d ( λ | Z ) ρ d (˜ λ | Z ) E Z −h ρ d ( λ | Z ) i Z (cid:10) ρ d (˜ λ | Z ) (cid:11) Z i log( λ ) log(˜ λ ) (F27)
4. Free energy of MAP inference
Let us assume that the true parameters θ are drawn randomly, with mean and covariance matrix S I d . We maythen compute the average of (F4): (cid:10)(cid:10) F β,σ [ D ] (cid:11) D (cid:11) θ = d β + 12 σ (cid:28)D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E θ (cid:29) Z + σ σ (cid:0) N − D Tr h JJ − σ η iE Z (cid:1) − β D log | π e σ β − J − σ η | E Z = d β + 12 σ (cid:28) Tr (cid:20)(cid:0) J − JJ − σ η J (cid:1) D θ θ T0 E θ (cid:21)(cid:29) Z + σ σ (cid:0) N − D Tr h JJ − σ η iE Z (cid:1) − β D log | π e σ β − J − σ η | E Z = d β + S σ D Tr h(cid:0) J − JJ − σ η J (cid:1)iE Z + σ σ (cid:0) N − D Tr h JJ − σ η iE Z (cid:1) − β D log | π e σ β − J − σ η | E Z = d β + S ζσ D Tr h(cid:0) C − CC − ζσ η C (cid:1)iE Z + σ σ (cid:0) N − D Tr h CC − ζσ η iE Z (cid:1) − β D log | π e σ β − ζ C − ζσ η | E Z . (F28)In the last line we have set Z → Z / √ d , with z i ( µ ) = O (1) for all ( i, µ ), and used J = C /ζ and C − σ η = ζ C − ζσ η , where C = Z T Z /N . Let us consider the matrix product CC − η = C ( C + η I d ) − = (( C + η I d ) C − ) − = ( I d + η C − ) − , givingTr (cid:2) CC − η (cid:3) = Tr h(cid:0) I d + η C − (cid:1) − i = d X ℓ =1 /λ ℓ (cid:0) I d + η C − (cid:1) = d X ℓ =1 / (cid:0) ηλ ℓ (cid:0) C − (cid:1)(cid:1) = d X ℓ =1 / (cid:0) ηλ − ℓ (cid:0) C (cid:1)(cid:1) = d X ℓ =1 λ ℓ (cid:0) C (cid:1) / (cid:0) λ ℓ (cid:0) C (cid:1) + η (cid:1) . (F29)Similarly we can write C C − η = C (cid:0) C + η I d (cid:1) − = C (cid:0) I d + η C − (cid:1) − = (cid:0)(cid:0) I d + η C − (cid:1) C − (cid:1) − = (cid:0) C − + η C − (cid:1) − (F30)The matrices C and (cid:0) C + η I d (cid:1) − obviously commute, soTr (cid:2) CC − η C (cid:3) = Tr (cid:2) C C − η (cid:3) = d X ℓ =1 λ ℓ (cid:0) C (cid:1) /λ ℓ (cid:0) C η (cid:1) = d X ℓ =1 λ ℓ (cid:0) C (cid:1) /λ ℓ (cid:0) C η (cid:1) . (F31)Now C η = C + η I d = C ( I d + η C − ) = ( I d + η C − ) C , and hence λ ℓ (cid:0) C η (cid:1) = λ ℓ (cid:0) C (cid:1) λ ℓ (cid:0) I d + η C − (cid:1) = λ ℓ (cid:0) C (cid:1)(cid:0) ηλ ℓ (cid:0) C − (cid:1)(cid:1) = λ ℓ (cid:0) C (cid:1) + η. (F32)Thus Tr (cid:2) CC − η C (cid:3) = P dℓ =1 λ ℓ (cid:0) C (cid:1) / (cid:0) λ ℓ (cid:0) C (cid:1) + η (cid:1) . Finally, we consider the inverse C − η = (cid:0) C + η I d (cid:1) − = (cid:0) I d + η C − (cid:1) − C − = C − (cid:0) I d + η C − (cid:1) − , (F33)3The matrices C − and (cid:0) I d + η C − (cid:1) − obviously commute, and hence the ℓ -th eigenvalue of C − η is given by λ ℓ (cid:0) J − η (cid:1) = 1 / ( λ ℓ (cid:0) J (cid:1) + η ) . (F34)Using the above results on the relevant matrices in (F28) allows us to compute the average free energy: (cid:28)(cid:28) N F β,σ [ D ] (cid:29) D (cid:29) θ = ζ β + S ζσ N D Tr h(cid:0) J − JJ − ζσ η J (cid:1)iE Z + σ σ (cid:0) − N D Tr h JJ − ζσ η iE Z (cid:1) − β N D log | π e σ β − ζ J − ζσ η | E Z = ζ β + S ζσ dN d d X ℓ =1 * λ ℓ (cid:0) J (cid:1) − λ ℓ (cid:0) J (cid:1) λ ℓ (cid:0) J (cid:1) + ζσ η + Z + σ σ (cid:0) − dN d * d X ℓ =1 λ ℓ (cid:0) J (cid:1) λ ℓ (cid:0) J (cid:1) + ζσ η + Z (cid:1) − ζ β log (cid:0) π e σ β − ζ (cid:1) + ζ β d * d X ℓ =1 log (cid:0) λ ℓ (cid:0) J (cid:1) + ζσ η (cid:1)+ Z = ζ β + S ζη Z d λ h ρ d ( λ | Z ) i Z λλ + ζσ η + σ σ (cid:16) − ζ Z d λ h ρ d ( λ | Z ) i Z λλ + ζσ η (cid:17) − ζ β log (cid:0) π e σ β − ζ (cid:1) + ζ β Z d λ h ρ d ( λ | Z ) i Z log (cid:0) λ + ζσ η (cid:1) (F35)and hence the average free energy density is given by (cid:28)(cid:28) N F β,σ [ D ] (cid:29) D (cid:29) θ = ζ β + S ζη Z d λ h ρ d ( λ | Z ) i Z λλ + ζσ η + σ σ (cid:16) − ζ Z d λ h ρ d ( λ | Z ) i Z λλ + ζσ η (cid:17) − ζ β log (cid:0) π e σ β − ζ (cid:1) + ζ β Z d λ h ρ d ( λ | Z ) i Z log (cid:0) λ + ζσ η (cid:1) (F36)In the derivation of the above results we used the following simple eigenvalue identities λ ℓ (cid:0) C − η (cid:1) = 1 λ ℓ (cid:0) C (cid:1) + η , λ ℓ (cid:0) CC − η (cid:1) = λ ℓ (cid:0) C (cid:1) λ ℓ (cid:0) C (cid:1) + η , λ ℓ (cid:0) C − CC − η C (cid:1) = ηλ ℓ (cid:0) C (cid:1) λ ℓ (cid:0) C (cid:1) + η , (F37)from one also obtainsTr (cid:2) J − JJ − η J (cid:3) = dη Z d λ ρ d ( λ | Z ) λλ + η , Tr (cid:2) JJ − η (cid:3) = d Z d λ ρ d ( λ | Z ) λλ + η (F38)Finally, we compute the variance of the free energy (F1). First, we consider the energy variance (F17) which is4 σ Var (cid:0) E [ D ] (cid:1) = DD(cid:0) θ T0 (cid:0) J − JJ − σ η J (cid:1) θ (cid:1) E Z E θ − (cid:28)D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E θ (cid:29) Z + 4 D σ θ T0 D J (cid:0) I d − J − σ η J (cid:1) E Z θ E θ + σ " D(cid:0) N − Tr h JJ − σ η i (cid:1) E Z + 2 D(cid:0) N − h JJ − σ η i + Tr h(cid:0) JJ − σ η (cid:1) i (cid:1)E Z − D(cid:0) N − Tr h JJ − σ η i (cid:1)E Z +2 σ "DD θ T0 (cid:0) J − JJ − σ η J (cid:1) θ (cid:0) N − Tr[ JJ − σ η ] (cid:1)E θ E Z − DD θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E θ E Z D(cid:0) N − Tr[ JJ − σ η ] (cid:1)E Z = S D Tr h(cid:0) J − JJ − σ η J (cid:1)i + 2Tr h(cid:0) J − JJ − σ η J (cid:1) iE Z − S D Tr h J − JJ − σ η J iE Z + 4 σ S D Tr h J (cid:0) I d − J − σ η J (cid:1) iE Z + σ " D(cid:0) N − Tr h JJ − σ η i (cid:1) E Z + 2 D(cid:0) N − h JJ − σ η i + Tr h(cid:0) JJ − σ η (cid:1) i (cid:1)E Z − D(cid:0) N − Tr h JJ − σ η i (cid:1)E Z +2 σ S D Tr h J − JJ − σ η J i (cid:0) N − Tr[ JJ − σ η ] (cid:1)E Z − D Tr h J − JJ − σ η J iE Z D(cid:0) N − Tr[ JJ − σ η ] (cid:1)E Z ! (F39)4where we have used (F16). Hence for Z → Z / √ d with z i ( µ ) = O (1), we have4 σ Var (cid:0) E [ D ] N (cid:1) = S ζ N D Tr h(cid:0) C − CC − ζσ η C (cid:1)i + 2Tr h(cid:0) C − CC − ζσ η C (cid:1) iE Z − S ζ N D Tr h C − CC − ζσ η C iE Z + 4 σ S ζN D Tr h C (cid:0) I d − C − ζσ η C (cid:1) iE Z + σ N " D(cid:0) N − Tr h CC − ζσ η i (cid:1) E Z + 2 D(cid:0) N − h CC − ζσ η i + Tr h(cid:0) CC − ζσ η (cid:1) i (cid:1)E Z − D(cid:0) N − Tr h CC − ζσ η i (cid:1)E Z +2 σ S ζN " D Tr h C − CC − ζσ η C i (cid:0) N − Tr[ CC − ζσ η ] (cid:1)E Z − D Tr h C − CC − ζσ η C iE Z D(cid:0) N − Tr[ CC − ζσ η ] (cid:1)E Z = S ζ N *(cid:20) dζσ η Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:21) + 2 d ( ζσ η ) Z d λ ρ d ( λ | Z ) (cid:16) λλ + ζσ η (cid:17) + Z − S ζ N (cid:28) dζσ η Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:29) Z + 4 σ S ζN * d ( ζσ η ) Z d λ ρ d ( λ | Z ) λ (cid:0) λ + ζσ η (cid:1) + Z + σ " (cid:28)(cid:16) − dN Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:17) (cid:29) Z − (cid:28) − dN Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:29) Z + 2 N (cid:28) N − d Z d λ ρ d ( λ | Z ) λλ + ζσ η + d Z d λ ρ d ( λ | Z ) (cid:0) λλ + ζσ η (cid:1) (cid:29) Z +2 σ S ζ " (cid:28)(cid:16) dN ζσ η Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:17)(cid:16) − dN Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:17)(cid:29) Z − (cid:28) dN ζσ η Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:29) Z (cid:28) − dN Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:29) Z = ζ (cid:0) S σ η + σ − σ S σ η (cid:1)" (cid:28)(cid:16) Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:17) (cid:29) Z − (cid:28)Z d λ ρ d ( λ | Z ) λλ + ζσ η (cid:29) Z + O (1 /N ) (F40)HenceVar (cid:0) E [ D ] N (cid:1) = ζ (cid:0) S σ η + σ − σ S σ η (cid:1) σ " (cid:28)(cid:16) Z d λ ρ d ( λ | Z ) λλ + ζσ η d λ (cid:17) (cid:29) Z − (cid:28)Z d λ ρ d ( λ | Z ) λλ + ζσ η d λ (cid:29) Z + O ( 1 N ) . . (F41)Furthermore, the covariance (F19) becomes4 σ Cov (cid:0) E [ D ] , S[ D ] (cid:1) = Cov (cid:16) t T (cid:0) I N − ZJ − σ η Z T (cid:1) t , log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12) (cid:17) = Dh D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E θ + σ (cid:0) N − Tr h JJ − σ η i (cid:1)i log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12) E Z − Dh D θ T0 (cid:0) J − JJ − σ η J (cid:1) θ E θ + σ (cid:0) N − Tr h JJ − σ η i (cid:1)iE Z D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z = Dh S Tr h J − JJ − σ η J i + σ (cid:0) N − Tr h JJ − σ η i (cid:1)i log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12) E Z − Dh S Tr h J − JJ − σ η J i + σ (cid:0) N − Tr h JJ − σ η i (cid:1)iE Z D log (cid:12)(cid:12)(cid:12) J − σ η (cid:12)(cid:12)(cid:12)E Z (F42)5From this, upon setting Z → Z / √ d with z i ( µ ) = O (1) for all ( i, µ ) then follows the result4 σ Cov (cid:0) E [ D ] /N, S[ D ] /N (cid:1) = Dh S ζN Tr h J − JJ − ζσ η J i + σ (cid:0) − N Tr h JJ − ζσ η i (cid:1)i N log (cid:12)(cid:12)(cid:12) ζ J − ζσ η (cid:12)(cid:12)(cid:12) E Z − Dh S ζN Tr h J − JJ − ζσ η J i + σ (cid:0) − N Tr h JJ − ζσ η i (cid:1)iE Z N D log (cid:12)(cid:12)(cid:12) ζ J − ζσ η (cid:12)(cid:12)(cid:12)E Z = ζ Z d λ d˜ λ h h ρ d ( λ | Z ) i Z h ρ d (˜ λ | Z ) i Z − h ρ d ( λ | Z ) ρ d (˜ λ | Z ) i Z i × h S λζσ ηλ + ζσ η + σ (cid:0) − λζλ + ζσ η (cid:1)i log (cid:0) ˜ λ + ζσ η (cid:1) (F43)Finally, the entropy variance (F18) for Z → Z / √ d is given byVar (cid:0) S[ D ] N (cid:1) = ζ Z d λ d˜ λ h h ρ d ( λ | Z ) ρ d (˜ λ | Z ) i Z −h ρ d ( λ | Z ) i Z h ρ d (˜ λ | Z ) i Z i log( λ + ζσ η ) log(˜ λ + ζσ η ) . (F44)Using all of the above results in (F5) we finally obtain the variance of the free energy density:Var (cid:16) F β,σ [ D ] N (cid:17) = Z d λ d˜ λ h h ρ d ( λ | Z ) ρ d (˜ λ | Z ) i Z − h ρ d ( λ | Z ) i Z h ρ d (˜ λ | Z ) i Z i × " ζ σ (cid:0) S σ η + σ − σ S σ η (cid:1) λλ + ζσ η ˜ λ ˜ λ + ζσ η + 14 T ζ log (cid:0) λ + ζσ η (cid:1) log (cid:0) ˜ λ + ζσ η (cid:1) − T ζ σ (cid:16) S λζσ ηλ + ζσ η + σ (cid:16) − λζλ + ζσ η (cid:17)(cid:17) log (cid:0) ˜ λ + ζσ η (cid:1) + O (1 /N ) . (F45) Appendix G: Self-averaging of the estimator ˆ σ Let us consider the equation σ = β ( β − ζ ) 1 N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t − Z ˆ θ [ D ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − σ η ( β − ζ ) 1 N Tr h J − σ η i + 2 σ β ( β − ζ ) N ∂∂σ log P ( σ ) . (G1)Upon inserting the MAP estimator (A3) and the short-hand σ = v we obtain v = β ( β − ζ ) 1 N t T (cid:0) I N − ZJ − vη Z T (cid:1) t − v η ( β − ζ ) 1 N Tr (cid:2) J − vη (cid:3) + 2 v β ( β − ζ ) N ∂∂v log P ( v ) . (G2)To solve this equation for v we define the following recursion, with the short-hand ∂ v ≡ ∂∂v : v t +1 = β ( β − ζ ) 1 N t T (cid:0) I N − ZJ − v t η Z T (cid:1) t − v t η ( β − ζ ) 1 N Tr (cid:2) J − v t η (cid:3) + 2 v t β ( β − ζ ) N ∂ v log P ( v ) | v = v t , (G3)Since t = Z θ + ǫ this recursion, of which the desired estimator is the fixed-point, has the general form v t +1 = Ψ [ v t | Z , θ , ǫ ] . (G4)Thus for any choice of { Z , θ , ǫ } , i.e. which play the role of ‘disorder’, the function Ψ is a random non-linear operatoracting on v t . If the initial value v is independent of the disorder, then the next value v is independent from a particular realisation of disorder, i.e. v is self-averaging , as soon as the operator Ψ is self-averaging, i.e. iflim ( N,d ) →∞ (cid:10) Ψ [ v | Z , θ , ǫ ] (cid:11) Z , θ , ǫ − (cid:10) Ψ [ v | Z , θ , ǫ ] (cid:11) Z , θ , ǫ = 0 . (G5)By induction, all v t with t ≥ N, d ) → ∞ be replaced by the followingdeterministic map, whose fixed-point will be the asymptotic estimator ˆ σ that is then guaranteed to be self-averaging: v t +1 = h Ψ [ v t | Z , θ , ǫ ] i Z , θ , ǫ (G6)6To prove the self-averaging property of Ψ we assume that the true parameters θ and noise ǫ have mean and thecovariance matrices S I d and σ I N , respectively. Let us first consider the average h Ψ [ v | Z , θ , ǫ ] i Z , θ , ǫ = βN ( β − ζ ) D t T (cid:0) I N − ZJ − v η Z T (cid:1) t E Z , θ , ǫ − v ηN ( β − ζ ) (cid:10) Tr (cid:2) J − v η (cid:3)(cid:11) Z + 2 v β ( β − ζ ) N ∂ v log P ( v ) | v = v = βσ N ( β − ζ ) D Tr h(cid:0) I N − ZJ − v η Z T (cid:1) iE Z + βN ( β − ζ ) D θ T0 Z T (cid:0) I N − ZJ − v η Z T (cid:1) Z θ E Z , θ − v ηN ( β − ζ ) (cid:10) Tr (cid:2) J − v η (cid:3)(cid:11) Z + 2 v β ( β − ζ ) N ∂ v log P ( v ) | v = v = βN ( β − ζ ) σ D Tr h(cid:0) I N − ZJ − v η Z T (cid:1) iE Z + βN ( β − ζ ) S D Tr h J (cid:0) I d − J − v η J (cid:1) iE Z − v ηN ( β − ζ ) (cid:10) Tr (cid:2) J − v η (cid:3)(cid:11) Z + 2 v β ( β − ζ ) N ∂ v log P ( v ) | v = v = βσ β − ζ (cid:18) − ζ + 1 N D Tr (cid:2)(cid:0) I d − JJ − v η (cid:1) (cid:3)E Z (cid:19) + βS N ( β − ζ ) D Tr (cid:2) J (cid:0) I d − J − v η J (cid:1) (cid:3)E Z − v ηN ( β − ζ ) (cid:10) Tr (cid:2) J − v η (cid:3)(cid:11) Z + 2 v β ( β − ζ ) N ∂ v log P ( v ) | v = v (G7)We then make the usual substitution Z → Z / √ d , with z i ( µ ) = O (1) for all ( i, µ ), and we define the average ρ d ( λ ) = h ρ d ( λ | Z ) i Z of the eigenvalue density ρ d ( λ | Z ) = 1 d d X ℓ =1 δ (cid:2) λ − λ ℓ ( Z T Z /N ) (cid:3) . (G8)of the empirical covariance matrix. Then the above average becomes h Ψ [ v | Z , θ , ǫ ] i Z , θ , ǫ = βσ β − ζ (cid:18) − ζ + ζ Z d λ ρ d ( λ ) (cid:16) ζv ηλ + ζv η (cid:17) (cid:19) + βS β − ζ Z d λ ρ d ( λ ) (cid:16) ζv ηλ + ζv η (cid:17) λ − v ηζ β − ζ Z d λ ρ d ( λ ) λ + ζv η + 2 v β ( β − ζ ) N ∂ v log P ( v ) | v = v (G9)Finally we turn to the varianceVar(Ψ [ v | Z , θ , ǫ ]) = (cid:18) ββ − ζ (cid:19) N Var (cid:16) t T (cid:0) I N − ZJ − v η Z T (cid:1) t (cid:17) + v η ( β − ζ ) N Var (cid:0) Tr (cid:2) J − v η (cid:3)(cid:1) − βv η ( β − ζ ) N Cov (cid:16) t T (cid:0) I N − ZJ − v η Z T (cid:1) t , Tr (cid:2) J − v η (cid:3)(cid:17) . (G10)Computing in this expression the relevant averages over the random variables Z , θ and ǫ , with the familiar substi-tution Z → Z / √ d , gives us the following resultVar(Ψ [ v | Z , θ , ǫ ]) = Z d λ d˜ λ C d ( λ, ˜ λ ) ( (cid:18) ββ − ζ (cid:19) (cid:18) ζv ηλ + ζv η (cid:19) (cid:18) ζv η ˜ λ + ζv η (cid:19) h σ ζ + S λ ˜ λ +2 σ S ζλ i + (cid:18) v ζ ηβ − ζ (cid:19) λ + ζv η ) (cid:0) ˜ λ + ζv η (cid:1) − βv ζ η ( β − ζ ) (cid:18) ζv ηλ + ζv η (cid:19) σ ζ + S λ ˜ λ + ζv η ) + 2 N (cid:18) ββ − ζ (cid:19) Z d λ ρ d ( λ ) ( σ h − ζ + ζ (cid:16) ζv ηλ + ζv η (cid:17) i + S ζ (cid:16) ζv ηλ + ζv η (cid:17) λ ) , (G11)with the correlation function C d ( λ, ˜ λ ) = (cid:10) ρ d ( λ | Z ) ρ d (˜ λ | Z ) (cid:11) Z − (cid:10) ρ d ( λ | Z ) (cid:11) Z (cid:10) ρ d (˜ λ | Z ) (cid:11) Z . Clearly, if the spectrum ρ d ( λ | Z )is self-averaging when ( N, d ) → ∞ , then the correlation function will vanish in this limit, and hence Ψ [ v | Z , θ , ǫ ]will be self-averaging. [1] A. C. Rencher and G. B. Schaalje, Linear models in statistics (John Wiley & Sons, 2008). [2] P. J. Huber and E. M. Ronchetti, Robust statistics (John Wiley & Sons, 2009).[3] M. Bayati and A. Montanari, IEEE Transactions on Information Theory , 1997 (2011).[4] N. El Karoui, D. Bean, P. J. Bickel, C. Lim, and B. Yu, Proceedings of the National Academy of Sciences , 14557(2013).[5] N. E. Karoui and E. Purdom, arXiv preprint arXiv:1608.00696 (2016).[6] M. J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint (Cambridge University Press, 2019).[7] M. Advani and S. Ganguli, Physical Review X , 031034 (2016).[8] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborov´a, P. Natl. Acad. Sci. USA , 5451 (2019).[9] A. C. C. Coolen, M. Sheikh, A. Mozeika, F. A. Lopez, and F. Antenucci, J. Phys. A: Math. Theor. , 365001 (2020).[10] E. T. Jaynes, Physical review , 620 (1957).[11] H. Nishimori, Statistical physics of spin glasses and information processing: an introduction (Clarendon Press, 2001).[12] C. M. Bishop,
Pattern recognition and machine learning (Springer, 2006).[13] M. M´ezard, G. Parisi, and M. Virasoro,
Spin glass theory and beyond: An Introduction to the Replica Method and ItsApplications (World Scientific Publishing Co Inc, 1987).[14] M. M´ezard and A. Montanari,
Information, physics, and computation (Oxford University Press, 2009).[15] D. Donoho and A. Montanari, Probability Theory and Related Fields , 935 (2016).[16] C. Gerbelot, A. Abbara, and F. Krzakala, arXiv preprint arXiv:2002.04372 (2020).[17] J. Barbier, N. Macris, M. Dia, and F. Krzakala, IEEE Transactions on Information Theory , 4270 (2020).[18] A. Engel and C. Van den Broeck, Statistical mechanics of learning (Cambridge University Press, 2001).[19] L. Zdeborov´a and F. Krzakala, Advances in Physics , 453 (2016).[20] N. G. De Bruijn, Asymptotic methods in analysis (New York: Dover, 1981).[21] T. M. Cover and J. A. Thomas,
Elements of information theory (John Wiley & Sons, 2012).[22] A. C. C. Coolen, J. E. Barrett, P. Paga, and C. J. Perez-Vicente, Journal of Physics A: Mathematical and Theoretical , 375001 (2017).[23] If the inverse- χ distribution is used as a prior for σ , then the MAP estimator for the latter is given by σ = ( N + ν +2) − + ( N + ν + 2) − || t − Z ˆ θ [ D ] || which suggests that the hyper-parameter ν has to be extensive in order to remainrelevant for large N .[24] S. Nadarajah and S. Kotz, Acta Applicandae Mathematica , 53 (2005).[25] B. G. Kibria and A. H. Joarder, Journal of Statistical research , 59 (2006).[26] In statistics this phenomena would be referred to as having a fully concentrated measure.[27] W. G. Cochran, Mathematical Proceedings of the Cambridge Philosophical Society , 178–191 (1934).[28] F. G¨otze and A. Tikhomirov, Bernoulli , 503 (2004).[29] P.-H. Chavanis, Phys. Rev. E. , 056123 (2002).[30] M. Rudelson and R. Vershynin, Electronic Communications in Probability (2013).[31] M. L. Eaton, Multivariate statistics: a vector space approach. (JOHN WILEY & SONS, 1983).[32] M. Taboga,