Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization
Benjamin Aubin, Florent Krzakala, Yue M. Lu, Lenka Zdeborová
GGeneralization error in high-dimensional perceptrons:Approaching Bayes error with convex optimization
Benjamin Aubin † , Florent Krzakala (cid:63) , Yue M. Lu ◦ , Lenka Zdeborová †† Université Paris-Saclay, CNRS, CEA,Institut de physique théorique, 91191, Gif-sur-Yvette, France. (cid:63)
Laboratoire de Physique Statistique, CNRS & Sorbonnes Universités,École Normale Supérieure, PSL University, Paris, France. ◦ John A. Paulson School of Engineering and Applied Sciences,Harvard University, Cambridge, MA 02138, USA
Abstract
We consider a commonly studied supervised classification of a synthetic dataset whoselabels are generated by feeding a one-layer neural network with random iid inputs. Westudy the generalization performances of standard classifiers in the high-dimensionalregime where α = n/d is kept finite in the limit of a high dimension d and number ofsamples n . Our contribution is three-fold: First, we prove a formula for the generalizationerror achieved by (cid:96) regularized classifiers that minimize a convex loss. This formula wasfirst obtained by the heuristic replica method of statistical physics. Secondly, focussing oncommonly used loss functions and optimizing the (cid:96) regularization strength, we observe thatwhile ridge regression performance is poor, logistic and hinge regression are surprisinglyable to approach the Bayes-optimal generalization error extremely closely. As α → ∞ theylead to Bayes-optimal rates, a fact that does not follow from predictions of margin-basedgeneralization error bounds. Third, we design an optimal loss and regularizer that provablyleads to Bayes-optimal generalization error. a r X i v : . [ s t a t . M L ] J un Introduction
High-dimensional statistics, where the ratio α = n/d is kept finite while the dimensionality d and the number of samples n grow, often display interesting non-intuitive features. Asymptoticgeneralization performances for such problems in the so-called teacher-student setting, withsynthetic data, have been the subject of intense investigations spanning many decades [1–6]. Tounderstand the effectiveness of modern machine learning techniques, and also the limitations ofthe classical statistical learning approaches [7, 8], it is of interest to revisit this line of research.Indeed, this direction is currently the subject to a renewal of interests, as testified by some veryrecent, yet already rather influential papers [9–13]. The present paper subscribes to this line ofwork and studies high-dimensional classification within one of the simplest models consideredin statistics and machine learning: convex linear estimation with data generated by a teacher perceptron [14]. We will focus on the generalization abilities in this problem, and compare theperformances of Bayes-optimal estimation to the more standard empirical risk minimization . Wethen compare the results with the prediction of standard generalization bounds that illustratein particular their limitation even in this simple, yet non-trivial, setting. Synthetic data model —
We consider a supervised machine learning task, whose dataset isgenerated by a single layer neural network, often named a teacher [1–3], that belongs to theGeneralized Linear Model (GLM) class. Therefore we assume the n samples are drawn accordingto y = ϕ (cid:63) out (cid:18) √ d X w (cid:63) (cid:19) ⇔ y ∼ P (cid:63) out ( . ) , (1)where w (cid:63) ∈ R d denotes the ground truth vector drawn from a probability distribution P w (cid:63) with second moment ρ w (cid:63) ≡ d E (cid:2) (cid:107) w (cid:63) (cid:107) (cid:3) and ϕ (cid:63) out represents a deterministic or stochasticactivation function equivalently associated to a distribution P (cid:63) out . The input data matrix X =( x µ ) nµ =1 ∈ R n × d contains iid Gaussian vectors, i.e ∀ µ ∈ [1 : n ] , x µ ∼ N ( , I d ) . Even thoughthe framework we use and the theorems and results we derived are valid for a rather genericchannel in eq. (1) —including regression problems— we will mainly focus the presentation onthe commonly considered perceptron case: a binary classification task with data given by asign activation function ϕ (cid:63) out ( z ) = sign ( z ) , with a Gaussian weight distribution P w (cid:63) ( w (cid:63) ) = N w (cid:63) ( , ρ w (cid:63) I d ) . The ± labels are thus generated as y = sign (cid:18) √ d X w (cid:63) (cid:19) , with w (cid:63) ∼ N w (cid:63) ( , ρ w (cid:63) I d ) . (2) Empirical Risk Minimization —
The workhorse of machine learning is Empirical RiskMinimization (ERM), where one minimizes a loss function in the corresponding high-dimensionalparameter space R d . To avoid overfitting of the training set one often adds a regularization term r . ERM then corresponds to estimating ˆ w erm = argmin w [ L ( w ; y , X )] where the regularized2raining loss L is defined by, using the notation z µ ( w , x µ ) ≡ √ d x (cid:124) µ w , L ( w ; y , X ) = n (cid:88) µ =1 l ( y µ , z µ ( w , x µ )) + r ( w ) . (3)The goal of the present paper is to discuss the generalization performance of these estimatorsfor the classification task (2) in the high-dimensional limit. We focus our analysis on commonlyused loss functions l , namely the square l square ( y, z ) = ( y − z ) , logistic l logistic ( y, z ) =log(1 + exp( − yz )) and hinge losses l hinge ( y, z ) = max (0 , − yz ) . We will mainly illustrateour results for the (cid:96) regularization r ( w ) = λ (cid:107) w (cid:107) / , where we introduced a regularizationstrength hyper-parameter λ . Related works —
The above learning problem has been extensively studied in the statisticalphysics community using the heuristic replica method [1–3, 14, 15]. Due to the interest inhigh-dimensional statistics, they have experienced a resurgence in popularity in recent years.In particular, rigorous works on related problems are much more recent. The authors of [10]established rigorously the replica-theory predictions for the Bayes-optimal generalizationerror. Here we focus on standard ERM estimation and compare it to the results obtainedin [10]. Authors of [16] analyzed rigorously M-estimators for the regression case where data aregenerated by a linear-activation teacher. Here we analyze classification with a more general andnon-linear teacher, focusing in particular on the sign-teacher. The case of max-margin loss wasstudied in [17] with a technically closely related proof, but with a focus on the over-parametrizedregime, thus not addressing the questions that we focus on. A range of unregularized losses wasalso analyzed for a sigmoid teacher (that is very similar to a sign-teacher) again in the contextof the double-descent behaviour in [18, 19]. Here we focus instead on the regularized case as itdrastically improves generalization performances of the ERM and that allows us to comparewith the Bayes-optimal estimation as well as to standard generalization bounds. Our proof, asin the above mentioned works and [20], is based on Gordon’s minimax formalism, including inparticular the effect of the regularization.
Main contributions —
Our first main contribution is to provide rigorously, in Sec. 2, theclassification generalization performances of empirical risk minimization with the loss given by(3) in the high-dimensional limit, for any convex loss and an (cid:96) regularization. Note that theproof is easily extended to any convex separable regularization. Additionally, we provide a proofof the equivalence between the results of our paper and the ones initially obtained by the replicamethod, which is of additional interest given the wide range of application of these heuristicsstatistical-physics technics in machine learning and computer science [21, 22]. In particular,the replica predictions in [15, 23–25] follow from our results. Another approach that originatedin physics are the so-called TAP equations [26–28] that lead to the so-called ApproximateMessage Passing algorithm for solving linear and generalized linear problems with Gaussianmatrices [29, 30]. This algorithm can be analyzed with the so-called state evolution method [31],and it is widely believed (and in fact proven for linear problems [4, 32]) that the fixed-point ofthe state evolution gives the optimal error in high-dimensional convex optimization problems.3he state evolution equations are in fact equivalent to the one given by the replica theory andtherefore our results vindicate this approach as well. We also demonstrate numerically thatthese asymptotic results are very accurate even for moderate system sizes, and they have beenperformed with the scikit-learn library [33].Secondly, and more importantly, we provide in Sec. 3 a detailed analysis of the generalizationerror for standard losses such as square, hinge (or equivalently support vector machine) andlogistic, as a function of the regularization strength λ and the number of samples per dimension α . We observe, in particular, that while the ridge regression never closely approaches the Bayes-optimal performance, the logistic regression with optimized (cid:96) regularization gets extremelyclose to optimal. And so does, to a lesser extent, the hinge regression and the max-marginestimator to which the unregularized logistic and hinge converge [34]. It is quite remarkablethat these canonical losses are able to approach the error of the Bayes-optimal estimator forwhich, in principle, the marginals of a high-dimensional probability distribution need to beevaluated. Notably, all the later losses give —for a good choice of the regularization strength λ —generalization errors scaling as Θ (cid:0) α − (cid:1) for large α , just as the Bayes-optimal generalizationerror [10]. This is found to be at variance with the prediction of Rademacher and max-margin-based bounds that predict instead a Θ (cid:0) α − / (cid:1) rate [35, 36], which therefore appear to bevacuous in the high-dimensional regime.Third, in Sec. 4, we design a custom (non-convex) loss and regularizer that provably gives aplug-in estimator that efficiently achieves Bayes-optimal performances, including the optimal Θ (cid:0) α − (cid:1) rate for the generalization error. Our construction is related to the one discussedin [37–39], but is not restricted to convex losses. In the formulas that arise for this statistical estimation problem, the correlations between theestimator ˆ w and the ground truth vector w (cid:63) play a fundamental role and we thus define twoscalar overlap parameters to measure the statistical reconstruction: m ≡ d E y , X ˆ w (cid:124) w (cid:63) , q ≡ d E y , X [ (cid:107) ˆ w (cid:107) ] . (4)In particular, the generalization error of the estimator ˆ w ( α ) ∈ R d obtained by performingEmpirical Risk Minimization (ERM) on the training loss L in eq. (3) with n = αd samples e ermg ( α ) ≡ E y, x [ y (cid:54) = ˆ y ( ˆ w ( α ); x )] , (5)where ˆ y ( ˆ w ( α ); x ) denotes the predicted label, has both at finite d and in the asymptotic limitan explicit expression depending only on the above overlaps m and q : Proposition 2.1 (Generalization error of classification) . In our synthetic binary classificationtask, the generalization error of ERM (or equivalently the test error) is given by e ermg ( α ) = 1 π acos ( √ η ) , with η ≡ m ρ w (cid:63) q and ρ w (cid:63) ≡ d E (cid:2) (cid:107) w (cid:63) (cid:107) (cid:3) . (6)4 roof. The proof, shown in SM. II, is a simple computation based on a Gaussian integration.To obtain the generalization performances, it thus remains to obtain the asymptotic valuesof m , q (and thus of η ), in the limit d → ∞ . With the (cid:96) regularization, these values arecharacterized by a set of fixed point equations given by the next theorems :For any τ > , let us first recall the definitions of the Moreau-Yosida regularization and theproximal operator of a convex loss function ( y, z ) (cid:55)→ (cid:96) ( y · z ) : M τ ( z ) = min x (cid:110) (cid:96) ( x ) + ( x − z ) τ (cid:111) , P τ ( z ) = argmin x (cid:110) (cid:96) ( x ) + ( x − z ) τ (cid:111) . (7) Theorem 2.2 (Gordon’s min-max fixed point - Binary classification with (cid:96) regularization) . As n, d → ∞ with n/d = α = Θ(1) , the overlap parameters m, q concentrate to m −→ d →∞ √ ρ w (cid:63) µ ∗ , q −→ d →∞ ( µ ∗ ) + ( δ ∗ ) , (8) where parameters µ ∗ and δ ∗ are solutions of ( µ ∗ , δ ∗ ) = arg min µ,δ ≥ sup τ> (cid:26) λ ( µ + δ )2 − δ τ + α E g,s M τ [ δg + µsϕ out (cid:63) ( √ ρ w (cid:63) s )] (cid:27) , (9) and g, s are two iid standard normal random variables. The solutions ( µ ∗ , δ ∗ ) of (9) can bereformulated as a set of fixed point equations µ ∗ = αλτ ∗ + α E g,s [ s · ϕ out (cid:63) ( √ ρ w (cid:63) s ) · P τ ∗ ( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s ))] ,δ ∗ = αλτ ∗ + α − E g,s [ g · P τ ∗ ( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s ))] , ( δ ∗ ) = α E g,s [(( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s )) − P τ ∗ ( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s ))) ] . (10) Proof.
The proof, shown in SM. III.1, is an application of the Gordon minimax theory.This set of fixed point equations can be finally mapped to the ones obtained by the heuristic replica method from statistical physics (whose heuristic derivation is shown in SM. IV) as wellas the state evolution of the approximate-message-passing algorithm [27, 30, 40]. Thus theirvalidity for this convex estimation problem is rigorously established by the following theorem:
Corollary 2.3 (Equivalence Gordon-replicas) . As n, d → ∞ with n/d = α = Θ(1) , the overlapparameters m, q concentrate to the fixed point of the following set of equations: m = α Σ ρ w (cid:63) · E y,ξ (cid:104) Z out (cid:63) × f out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) · f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) ,q = m /ρ w (cid:63) + α Σ · E y,ξ (cid:20) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) · f out (cid:16) y, q / ξ, Σ (cid:17) (cid:21) , (11) Σ = (cid:16) λ − α · E y,ξ (cid:104) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) · ∂ ω f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105)(cid:17) − , Note that for a generic convex and non-separable regularizer (different than (cid:96) ), it would contain instead sixequations (see SM. III.2). ith η ≡ m ρ w (cid:63) q , f out ( y, ω, V ) ≡ V − ( P V [ l ( y, . )]( ω ) − ω ) , Z out (cid:63) ( y, ω, V ) = E z (cid:104) P out (cid:63) (cid:16) y |√ V z + ω (cid:17)(cid:105) , f out (cid:63) ( y, ω, V ) ≡ ∂ ω log ( Z out (cid:63) ( y, ω, V )) , (12) where ξ, z denote two iid standard normal random variables, and E y the continuous or discretesum over all possible values y according to P out (cid:63) . Proof.
For the sake of clarity, the proof is again left in SM. III.3.Equivalent equations for the whole GLM class (classification and regression) with anyseparable and convex regularizer are shown in SM. III.2.
Bayes optimal baseline —
Finally, we shall compare the ERM performances to the Bayes-optimal generalization error. Being the information-theoretically best possible estimator, we willuse it as a reference baseline for comparison. The expression of the Bayes-optimal generalizationwas derived in [24] and proven in [10] and we recall here the result:
Theorem 2.4 (Bayes Asymptotic performance, from [10]) . For the model (1) with P w (cid:63) ( w (cid:63) ) = N w (cid:63) ( , ρ w (cid:63) I d ) , the Bayes-optimal generalization error is quantified by two scalar parameters q b and ˆ q b that verify the set of fixed point equations q b = ˆ q b q b , ˆ q b = α E y,ξ (cid:20) Z out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q b (cid:17) · f out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q b (cid:17) (cid:21) , (13) and reads e bayesg ( α ) = 1 π acos ( √ η b ) with η b = q b ρ w (cid:63) . (14) We now move to the core of the paper and analyze the set of fixed point equations (10),or equivalently (11), leading to the generalization performances given by (6), for commonclassifiers on our synthetic binary classification task. As already stressed, even though theresults are valid for a wide range of regularizers, we focus on estimators based on ERM with (cid:96) regularization r ( w ) = λ (cid:107) w (cid:107) / , and with square loss (ridge regression) l square ( y, z ) = ( y − z ) , logistic loss (logistic regression) l logistic ( y, z ) = log(1 + exp( − yz )) or hinge loss (SVM) l hinge ( y, z ) = max (0 , − yz ) . In particular, we study the influence of the hyper-parameter λ on the generalization performances and the different large α behaviour generalization ratesin the high-dimensional regime, and compare with the Bayes results. We show the solutionsof the set of fixed point equations eqs. (11) in Figs. 1a, 1b, 1c respectively for ridge, hinge andlogistic (cid:96) regressions. Ridge regression is a special case, for which its quadratic loss allows toderive and fully solve the equations (see SM. V.3). However in general the set of equations hasno analytical closed form and needs therefore to be solved numerically. It is in particular thecase for logistic and hinge, whose Moreau-Yosida regularization eq. (12) is, however, analytical.6irst, to highlight the accuracy of the theoretical predictions, we compare in Figs. 1a-1c theERM asymptotic ( d → ∞ ) generalization error with the performances of numerical simulations( d = 10 , averaged over n s = 20 samples) of ERM of the training loss eq. (3). Presented fora wide range of number of samples α and of regularization strength λ , we observe a perfectmatch between theoretical predictions and numerical simulations so that the error bars arebarely visible and have been therefore removed. This shows that the asymptotic predictions arevalid even with very moderate sizes. As an information theoretical baseline, we also show theBayes-optimal performances (black) given by the solution of eq. (13). Ridge estimation—
As we might expect the square loss gives the worst performances. Forlow values of the generalization, it leads to an interpolation-peak at α = 1 . The limit ofvanishing regularization λ → leads to the least-norm or pseudo-inverse estimator ˆ w pseudo =( X (cid:124) X ) − X (cid:124) y . The corresponding generalization error presents the largest interpolation-peakand achieves a maximal generalization error e g = 0 . . These are well known observations,discussed as early as in [23, 25], that are object of a renewal of interest under the name doubledescent , following a recent series of papers [11, 41–47]. This double descent behaviour for thepseudo-inverse is shown in Fig. 1a with a yellow line. On the contrary, larger regularizationstrengths do not suffer this peak at α = 1 , but their generalization error performance issignificantly worse than the Bayes-optimal baseline for larger values of α . Indeed, as wemight expect, for a large number of samples, a large regularization biases wrongly the training.However, even with optimized regularizations, performances of the ridge estimator remains faraway from the Bayes-optimal performance. Hinge and logistic estimation—
Both these losses, which are the classical ones used inclassification problems, improve drastically the generalization error. First of all, let us noticethat they do not display a double-descent behavior. This is due to the fact that our resultsare illustrated in the noiseless case and that our synthetic dataset is always linearly separable.Optimizing the regularization, our results in Fig. 1b-1c show both hinge and logistic ERM-based classification approach very closely the Bayes error. To offset these results, note thatperformances of logistic regression on non-linearly separable data are however very poor, asillustrated by our analysis of a rectangle door teacher (see SM. V.6).
Max-margin estimation—
As discussed in [34], both the logistic and hinge estimator con-verge, for vanishing regularization λ → , to the max-margin solution. Taking the λ → limitin our equations, we thus obtain the max-margin estimator performances. While this is notwhat gives the best generalization error (as can be seen in Fig. 1c the logistic with an optimized λ has a lower error), the max-margin estimator gives very good results, and gets very close tothe Bayes-error. Optimal regularization—
Defining the regularization value that optimizes the generalizationas λ opt ( α ) = argmin λ e ermg ( α, λ ) , (15)7e show in Figs. 1b-1c that both optimal values λ opt ( α ) (dashed-dotted orange) for logistic andhinge regression decrease to as α grows and more data are given. Somehow surprisingly, weobserve in particular that the generalization performances of logistic regression with optimalregularization are extremely close to the Bayes performances. The difference with the optimizedlogistic generalization error is barely visible by eye, so that we explicitly plotted the difference,which is roughly of order − .Ridge regression Fig. 1a shows a singular behaviour: there exists an optimal value (purple)which is moreover independent of α achieved for λ opt (cid:39) . . This value was first foundnumerically and confirmed afterwards semi-analytically in SM. V.3. Generalization rates at large α — Finally, we turn to the very instructive behavior at largevalues of α when a large amount of data is available. First, we notice that the Bayes-optimalgeneralization error, whose large α analysis is performed in SM. V.1, decreases as e bayesg ∼ α →∞ . α − . Compared to this optimal value, ridge regression gives poor performances in thisregime. For any value of the regularization λ — and in particular for both the pseudo-inversecase at λ = 0 and the optimal estimator λ opt — its generalization performances decrease muchslower than the Bayes rate, and goes only as e ridgeg ∼ α →∞ . α − / (see SM. V.3 for thederivation). Hinge and logistic regressions present a radically different, and more favorable,behaviour. Fig. 1b-1c show that keeping λ finite when α goes to ∞ , does not yield the Bayes-optimal rates. However the max-margin solution (that corresponds to the λ → limit of theseestimators) gives extremely good performances e logistic , hingeg ∼ λ → e max − marging ∼ α →∞ . α − see derivation in SM. V.4). This is the same rate as the Bayes one, only that the constant isslightly higher. Comparison with VC and Rademacher statistical bounds—
Given the fact that both themax-margin estimator and the optimized logistic achieve optimal generalization rates going as Θ (cid:0) α − (cid:1) , it is of interest to compare those rates to the prediction of statistical learning theorybounds. Statistical learning analysis (see e.g. [35, 36, 48]) relies to a large extent on the Vapnik-Chervonenkis dimension (VC) analysis and on the so-called
Rademacher complexity . The uniformconvergence result states that if the Rademacher complexity or the Vapnik-Chervonenkisdimension d VC is finite, then for a large enough number of samples the generalization gap willvanish uniformly over all possible values of parameters. Informally, uniform convergence tellsus that with high probability, for any value of the weights w , the generalization gap satisfies R population ( w ) − R n empirical ( w ) = Θ (cid:16)(cid:112) d VC /n (cid:17) where d VC = d − for our GLM hypothesisclass. Therefore, given that the empirical risk can go to zero (since our data are separable),this provides a generalization error upper-bound e g ≤ Θ( α − / ) . This is much worse thatwhat we observe in practice, where we reach the Bayes rate e g = Θ( α − ) . Tighter boundscan be obtained using the Rademacher complexity, and this was studied recently (using theaforementioned replica method ) in [49] for the very same problem. We reproduced their resultsand plotted the Rademacher complexity generalization bound in Fig.1 (dashed-green) thatdecreases as Θ (cid:0) α − / (cid:1) for the binary classification task eq. (2).One may wonder if this could be somehow improved. Another statistical-physics heuristic8 α . . . . . . e g λ =0 Pseudo inv. λ =0.0001 λ =0.001 λ =0.01 λ =0.1 λ opt =0.5708 λ =1 λ =10.0 λ =100.0Bayes − α − e g − / − λ =0 Pseudo inv. λ =0.0001 λ =0.001 λ =0.01 λ =0.1 λ opt =0.5708 λ =1 λ =10.0 λ =100.0BayesRademacher0 . α − . α − / (a) Ridge regression: square loss with (cid:96) regularization. Interpolation-peak, at α = 1 , is maximal for thepseudo-inverse estimator λ = 0 (yellow line) that reaches e g = 0 . . α . . . . . . e g λ =1.0 λ =10.0 λ =100.0 λ =0 Max-margin e g bayes e g opt λ opt / − α − − e g − λ =0.001 λ =0.01 λ =0.1 λ =1.0 λ =10.0 λ =100.0 λ =0 Max-marginBayesRademacher (b) Hinge regression: hinge loss with (cid:96) regularization. For clarity the rescaled value of λ opt / (dotted-dashed orange) is shown as well as its generalization error e optg (dotted orange) that is slightly belowand almost indistinguishable of the max-margin performances (dashed black). α . . . . . . e g ( α ) λ =1.0 λ =10.0 λ =100.0 λ =0 Max-margin e g bayes e g opt λ opt valueRademacher α e o p t g − e b a y e s g × − e g opt − e g bayes − α − − e g − λ =0.001 λ =0.01 λ =0.1 λ =1.0 λ =10.0 λ =100.0Bayes λ =0 Max-marginRademacher (c) Logistic regression: logistic loss with (cid:96) regularization - The value of λ opt (dotted-dashed orange)is shown as well as its generalization error e optg (dotted orange). Visually indistinguishable from theBayes-optimal line, their difference e optg − e bayesg is shown as an inset (dashed orange). Figure 1: Asymptotic generalization error for (cid:96) regularization ( d → ∞ ) as a function of α for different regularizations strengths λ , compared to numerical simulation (points) of ridgeregression for d = 10 and averaged over n s = 20 samples. Numerics has been performed withthe default methods Ridge , LinearSVC , LogisticRegression of scikit-learn package [33]. Bayesoptimal performances are shown with a black line and goes as Θ (cid:0) α − (cid:1) , while the Rademachercomplexity (dashed green) decrease as Θ (cid:0) α − / (cid:1) . Both hinge and logistic converge to max-margin estimator (limit λ = 0 ) which is shown in dashed black and deceases as Θ( α − ) , whileRidge decreases as Θ( α − / ) . 10omputation, however, suggests that, unfortunately, uniform bound are plagued to a slow rate Θ (cid:0) α − / (cid:1) . Indeed, the authors of [50] showed with a replica method-style computation that there exists some set of weights, in the binary classification task. (2), that lead to Θ (cid:0) α − / (cid:1) rates: the uniform bound is thus tight. The gap observed between the uniform bound and thealmost Bayes-optimal results observed in practice in this case is therefore not a paradox, but anillustration that the price to pay for uniform convergence is the inability to describe the optimalrates one can sometimes get in practice. Therefore, we believe, that the fact this phenomenacan be observed in a such simple problem sheds an interesting light on the current debate inunderstanding generalization in deep learning [7].Remarking our synthetic dataset is linearly-separable, we may try to take this fact intoconsideration to improve the generalization rate. In particular, it can be done using the max-margin based generalization error for separable data: Theorem 3.1 (Hard-margin generalization bound [35, 36, 48]) . Given a S = { x , · · · , x n } suchthat ∀ µ ∈ [1 : n ] , (cid:107) x µ (cid:107) ≤ r . Let ˆ w the hard-margin SVM estimator on S drawn with distribution D . With probability − δ , the generalization error is bounded by e g ( α ) ≤ α →∞ (cid:16) r (cid:107) ˆ w (cid:107) + (cid:112) log (4 /δ ) log (cid:107) ˆ w (cid:107) (cid:17) / √ n . (16)In our case one has r (cid:39) d E x (cid:107) x (cid:107) = d (cid:80) di =1 E x i = 1 . On the other hand, in the large sizelimit, the norm of the estimator (cid:107) ˆ w (cid:107) / √ d (cid:39) √ q , that yields e g ( α ) ≤ (cid:113) qα . We now need toplug the values of the norm q obtained by our max-margin solution to finally obtain the results.Unfortunately, this bound turns out to be even worse than the previous one. Indeed the norm ofthe hard margin estimator q is found to grow with α in the solution of the fixed point equation,and therefore the margin decay rather fast, rendering the bound vacuous. For small values of α ,one finds that q ∼ α that provides a vacuous constant generalisation bound e g ≤ Θ (1) , whilefor large α , q ∼ α that yields an even worse bound e g ≤ Θ ( √ α ) . Clearly, max-margin basedbounds do not perform well in this high-dimensional example. Given the fact that logistic and hinge losses reach values extremely close to Bayes optimalgeneralization performances, one may wonder if by somehow slightly altering these losses onecould actually reach the Bayesian values with a plug-in estimator obtained by ERM. This iswhat we achieve in this section, by constructing a (non-convex) optimization problem with aspecially tuned loss and regularization, whose solution yields Bayes-optimal generalization.Recent insights have shown that indeed one can sometime re-interpret Bayesian estimationas an optimization program in inverse problems [37, 38, 51, 52]. In particular, [39] showedexplicitly, on the basis of the non-rigorous replica method of statistical mechanics, that someBayes-optimal reconstruction problems could be turned into convex M-estimation.Matching ERM and Bayes-optimal generalization errors eqs. (6)-(14) with overlaps respect-ively solutions of eq. (11)-(13) and assuming that Z w (cid:63) ( γ, Λ) ≡ E w ∼ P w (cid:63) exp (cid:0) − / w + γw (cid:1) Z out (cid:63) ( y, ω, V ) are log-concave in γ and ω , we define the optimal loss and regularizer l opt , r opt : l opt ( y, z ) = − min ω (cid:18) ( z − ω ) ρ w (cid:63) − q b ) + log Z out (cid:63) ( y, ω, ρ w (cid:63) − q b ) (cid:19) ,r opt ( w ) = − min γ (cid:18)
12 ˆ q b w − γw + log Z w (cid:63) ( γ, ˆ q b ) (cid:19) , with ( q b , ˆ q b ) solution of eq. (13) . (17)See SM. VI for the derivation. Following these considerations, we provide the following theorem: Theorem 4.1.
The result of empirical risk minimization eq. (3) with l opt and r opt in eq. (17) ,leads to Bayes optimal generalization error in the high-dimensional regime.Proof. We present only the sketch of the proof here. First we note that the so called Bayes-optimal Approximate Message Passing (AMP) algorithm [30] is provably convergent, and indeedreaches Bayes-optimal performances (see [53]). Second, we remark that an AMP algorithmfor the minimization of the ERM with loss and regularization given by (17) is exactly identicalto the Bayes-optimal AMP. This shows that AMP applied to the ERM problem correspondingto (17) both converge to its fixed point and reach Bayes-optimal performances. The theoremfinally follows by noting (see [32,54]) that the AMP fixed point corresponds to the extremizationconditions of the loss. − − z . . . . . . . . . l o p t ( y = , z ) α =0.0 α =1.0 α =2.0 α =3.0 α =4.0 α =5.0 α =6.0 α =7.0logistic − − w r o p t ( w ) α =0.0 α =1.0 α =2.0 α =3.0 α =4.0 α =5.0 α =6.0 α =7.0 ‘ k w k Figure 2: Optimal loss l opt ( y = 1 , z ) and regularizer r opt ( w ) for model eq. (2) as a function of α . The optimal loss and regularizer λ opt and r opt for the model (2) are illustrated in Fig. (2). Andnumerical evidences of ERM with (17) compared to (cid:96) logistic regression and Bayes performancesare presented in SM. VI. 12 cknowledgments This work is supported by the ERC under the European Unions Horizon 2020 Research andInnovation Program 714608-SMiLe, by the French Agence Nationale de la Recherche undergrant ANR-17-CE23-0023-01 PAIL and ANR-19-P3IA-0001 PRAIRIE, and by the US NationalScience Foundation under grants CCF-1718698 and CCF-1910410. We also acknowledge supportfrom the chaire CFM-ENS “Science des donnees”. Part of this work was done when Yue M. Luwas visiting Ecole Normale as a CFM-ENS “Laplace” invited researcher.
References [1] Hyunjune Sebastian Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanicsof learning from examples.
Physical review A , 45(8):6056, 1992.[2] Timothy LH Watkin, Albrecht Rau, and Michael Biehl. The statistical mechanics of learninga rule.
Reviews of Modern Physics , 65(2):499, 1993.[3] Andreas Engel and Christian Van den Broeck.
Statistical mechanics of learning . CambridgeUniversity Press, 2001.[4] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices.
IEEE Transac-tions on Information Theory , 58(4):1997–2017, 2011.[5] Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. On robustregression with high-dimensional predictors.
Proceedings of the National Academy ofSciences , 110(36):14557–14562, 2013.[6] David Donoho and Andrea Montanari. High dimensional robust m-estimation: Asymptoticvariance via approximate message passing.
Probability Theory and Related Fields , 166(3-4):935–969, 2016.[7] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-standing deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 ,2016.[8] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.
Proceedings of the NationalAcademy of Sciences , 116(32):15849–15854, 2019.[9] Emmanuel J. Candes and Pragya Sur. The phase transition for the existence of the maximumlikelihood estimate in high-dimensional logistic regression, 2018.[10] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimalerrors and phase transitions in high-dimensional generalized linear models.
Proceedings ofthe National Academy of Sciences , 116(12):5451–5460, 2019.1311] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises inhigh-dimensional ridgeless least squares interpolation, 2019.[12] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571 , 2019.[13] Song Mei and Andrea Montanari. The generalization error of random features regression:Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355 , 2019.[14] Elizabeth Gardner and Bernard Derrida. Three unfinished works on the optimal storagecapacity of networks.
Journal of Physics A: Mathematical and General , 22(12):1983, 1989.[15] Manfred Opper and Wolfgang Kinzel. Statistical mechanics of generalization. In
Models ofneural networks III , pages 151–209. Springer, 1996.[16] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis ofregularized m -estimators in high dimensions. IEEE Transactions on Information Theory ,64(8):5592–5628, 2018.[17] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. The generalization errorof max-margin linear classifiers: High-dimensional asymptotics in the overparametrizedregime, 2019.[18] Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis. A model of double descent forhigh-dimensional binary linear classification. arXiv preprint arXiv:1911.05822 , 2019.[19] Ganesh Kini and Christos Thrampoulidis. Analytic study of double descent in binaryclassification: The impact of loss. arXiv preprint arXiv:2001.11572 , 2020.[20] Francesca Mignacco, Florent Krzakala, Yue M. Lu, and Lenka Zdeborová. The role ofregularization in classification of high-dimensional noisy Gaussian mixture. 2(2):1–21,2020.[21] Marc Mezard and Andrea Montanari.
Information, physics, and computation . OxfordUniversity Press, 2009.[22] Lenka Zdeborova. Understanding deep learning is also a job for physicists.
Nature Physics ,pages 1745–2481, 2020.[23] M. Opper, W. Kinzel, J. Kleinz, and R. Nehl. On the ability of the optimal perceptron togeneralise.
Journal of Physics A: General Physics , 23(11), 1990.[24] Manfred Opper and David Haussler. Generalization performance of Bayes optimal classi-fication algorithm for learning a perceptron.
Physical Review Letters , 66(20):2677–2680,1991.[25] M. Opper and W. Kinzel.
Models of neural networks III . Springer, 1996.1426] Marc Mézard. The space of interactions in neural networks: Gardner’s computation withthe cavity method.
Journal of Physics A: Mathematical and General , 22(12):2181, 1989.[27] Yoshiyuki Kabashima. A cdma multiuser detection algorithm on the basis of belief propaga-tion.
Journal of Physics A: Mathematical and General , 36(43):11111, 2003.[28] Yoshiyuki Kabashima and Shinsuke Uda. A bp-based algorithm for performing bayesianinference in large perceptron-type networks. In
International Conference on AlgorithmicLearning Theory , pages 479–493. Springer, 2004.[29] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms forcompressed sensing.
Proceedings of the National Academy of Sciences , 106(45):18914–18919,2009.[30] Sundeep Rangan. Generalized approximate message passing for estimation with randomlinear mixing. pages 2168–2172, 2011.[31] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on densegraphs, with applications to compressed sensing.
IEEE Transactions on Information Theory ,57(2):764–785, 2011.[32] Cédric Gerbelot, Alia Abbara, and Florent Krzakala. Asymptotic errors for convex penalizedlinear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372 , 2020.[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of MachineLearning Research , 12:2825–2830, 2011.[34] Saharon Rosset, Ji Zhu, and Trevor J. Hastie. Margin maximizing loss functions. In S. Thrun,L. K. Saul, and B. Schölkopf, editors,
Advances in Neural Information Processing Systems 16 ,pages 1237–1244. MIT Press, 2004.[35] Vladimir Vapnik.
Estimation of dependences based on empirical data . Springer Science &Business Media, 2006.[36] Shai Shalev-Shwartz and Shai Ben-David.
Understanding machine learning: From theory toalgorithms . Cambridge university press, 2014.[37] Rémi Gribonval. Should penalized least squares regression be interpreted as maximum aposteriori estimation?
IEEE Transactions on Signal Processing , 59(5):2405–2410, 2011.[38] Remi Gribonval and Pierre Machart. Reconciling "priors" & "priors" without prejudice?In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 26 , pages 2193–2201. Curran Associates,Inc., 2013. 1539] Madhu Advani and Surya Ganguli. An equivalence between high dimensional Bayesoptimal inference and M-estimation.
Advances in Neural Information Processing Systems ,(1):3386–3394, 2016.[40] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds andalgorithms.
Advances in Physics , 65(5):453–552, 2016.[41] Mario Geiger, Stefano Spigler, Stéphane d’ Ascoli, Levent Sagun, Marco Baity-Jesi, GiulioBiroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the losslandscape of deep neural networks.
Physical Review E , 100(1), Jul 2019.[42] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphaned’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description ofgeneralization with number of parameters in deep learning.
Journal of Statistical Mechanics:Theory and Experiment , 2020(2):023401, 2020.[43] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.
Proceedings of the NationalAcademy of Sciences , 116(32):15849–15854, 2019.[44] Partha P Mitra. Understanding overfitting peaks in generalization error: Analytical riskcurves for l and l penalized interpolation, 2019.[45] Song Mei and Andrea Montanari. The generalization error of random features regression:Precise asymptotics and double descent curve, 2019.[46] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mezard, and Lenka Zdeborová.Generalisation error in learning with random features and the hidden manifold model. arXiv preprint arXiv:2002.09339 , 2020.[47] Stéphane d’Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. Double trouble indouble descent: Bias and variance (s) in the lazy regime. arXiv preprint arXiv:2003.01054 ,2020.[48] Peter Bartlett and John Shawe-taylor. Generalization performance of support vectormachines and other pattern classifiers, 1998.[49] Alia Abbara, Benjamin Aubin, Florent Krzakala, and Lenka Zdeborová. Rademachercomplexity and spin glasses: A link between the replica and statistical theories of learning. arXiv preprint arXiv:1912.02729 , 2019.[50] A Engel and W Fink. Statistical mechanics calculation of vapnik-chervonenkis bounds forperceptrons. Journal of Physics A: Mathematical and General , 26(23):6893, 1993.[51] Rémi Gribonval and Mila Nikolova. A characterization of proximity operators. arXivpreprint arXiv:1807.04014 , 2018.[52] Rémi Gribonval and Mila Nikolova. On bayesian estimation and proximity operators.
Applied and Computational Harmonic Analysis , 2019.1653] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. PhaseTransitions, Optimal Errors and Optimality of Message-Passing in Generalized LinearModels. pages 1–59, 2017.[54] Andrea Montanari, YC Eldar, and G Kutyniok. Graphical models concepts in compressedsensing.
Compressed Sensing: Theory and Applications , pages 394–438, 2012.[55] H Nishimori. Exact results and critical properties of the ising model with competinginteractions.
Journal of Physics C: Solid State Physics , 13(21):4071–4076, jul 1980.[56] Osame Kinouchi and Nestor Caticha. Learning algorithm that gives the bayes generalizationlimit for perceptrons.
Physical Review E - Statistical Physics, Plasmas, Fluids, and RelatedInterdisciplinary Topics , 54(1):R54–R57, 1996.[57] Derek Bean, Peter J. Bickel, Noureddine El Karoui, and Bin Yu. Optimal M-estimation inhigh-dimensional regression.
Proceedings of the National Academy of Sciences of the UnitedStates of America , 110(36):14563–14568, 2013.[58] David Donoho and Andrea Montanari. High dimensional robust M-estimation: asymptoticvariance via approximate message passing.
Probability Theory and Related Fields , 166(3-4):935–969, 2016.[59] Madhu Advani and Surya Ganguli. Statistical mechanics of optimal convex inference inhigh dimensions.
Physical Review X , 6(3):1–16, 2016.17 upplementary material
In this supplementary material (SM), we provide the proofs and computation details leadingto the results presented in the main manuscript. In Sec. I, we first recall the definition of thestatistical model used in Sec. 1 and we give proper definitions of the denoising distributionsinvolved in the analysis of the Bayes-optimal and Empirical Risk Minimization (ERM) estimation.In particular, we provide the analytical expressions of the denoising functions used in Sec. 3to analyze ridge, hinge and logistic regressions. In Sec. II, we detail the computation of thebinary classification generalization error leading to the expressions in Proposition. 2.1 andThm. 2.4 respectively for ERM and Bayes-optimal estimation. In Sec. III, we present the proofsof the central theorems stated in Sec. 2. In particular, we derive the Gordon-based proof of theThm. 2.2 in the more general regression (real-valued) version and provide as well the proof ofCorollary. 2.3 which establishes the equivalence between the set of fixed-point equations of theGordon’s proof in the binary classification case and the one resulting from the heuristic replicacomputation. The corresponding statistical physics framework used to analyze Bayes and ERMstatistical estimations and the replica computation leading to expressions in Corollary. 2.3 aredetailed In Sec. IV. The section V is devoted to provide additional technical details on the resultswith (cid:96) regularization addressed in Sec. 3. In particular, we present the large α expansions of thegeneralization error for the Bayes-optimal , ridge , pseudo-inverse and max-margin estimators, andwe investigate the performances of logistic regression on non-linearly separable data. Finallyin Sec. VI, we show the derivation of the fine-tuned loss and regularizer provably leading toBayes-optimal performances, as explained and advocated in Sec. 4, and we show some numericalevidences that ERM achieves indeed Bayes-optimal error in Fig. 5.18 able of Contents I Definitions and notations 20
I.1 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20I.2 Bayes-optimal and ERM estimation . . . . . . . . . . . . . . . . . . . . . . . 20I.3 Denoising distributions and updates . . . . . . . . . . . . . . . . . . . . . . . 21I.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
II Binary classification generalization errors 28
II.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28II.2 Bayes-optimal generalization error . . . . . . . . . . . . . . . . . . . . . . . 29II.3 ERM generalization error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
III Proofs of the ERM fixed points 30
III.1 Gordon’s result and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30III.2 Replica’s formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33III.3 Equivalence Gordon-Replica’s formulation - (cid:96) regularization and Gaussianweights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 IV Replica computation for Bayes-optimal and ERM estimations 38
IV.1 Statistical inference and free entropy . . . . . . . . . . . . . . . . . . . . . . 38IV.2 Replica computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38IV.3 ERM and Bayes-optimal free entropy . . . . . . . . . . . . . . . . . . . . . . 42IV.4 Sets of fixed point equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 43IV.5 Useful derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
V Applications 50
V.1 Bayes-optimal estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50V.2 Generalities on ERM with (cid:96) regularization . . . . . . . . . . . . . . . . . . . 51V.3 Ridge regression - Square loss with (cid:96) regularization . . . . . . . . . . . . . 51V.4 Hinge regression / SVM - Hinge loss with (cid:96) regularization . . . . . . . . . . 56V.5 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58V.6 Logistic with non-linearly separable data - A rectangle door teacher . . . . . 59 VI Reaching Bayes optimality 60
VI.1 Generalized Approximate Message Passing (GAMP) algorithm . . . . . . . . 60VI.2 Matching Bayes-optimal and ERM performances . . . . . . . . . . . . . . . . 60VI.3 Summary and numerical evidences . . . . . . . . . . . . . . . . . . . . . . . 62 Definitions and notations
I.1 Statistical model
We recall the supervised machine learning task considered in the main manuscript eq. (1), whosedataset is generated by a single layer neural network, often named a teacher , that belongs tothe Generalized Linear Model (GLM) class. Therefore we assume the n samples are drawnaccording to y = ϕ (cid:63) out (cid:18) √ d X w (cid:63) (cid:19) ⇔ y ∼ P (cid:63) out ( . ) , (18)where w (cid:63) ∈ R d denotes the ground truth vector drawn from a probability distribution P w (cid:63) withsecond moment ρ w (cid:63) ≡ d E (cid:2) (cid:107) w (cid:63) (cid:107) (cid:3) and ϕ (cid:63) out represents a deterministic or stochastic activationfunction equivalently associated to a distribution P (cid:63) out . The input data matrix X = ( x µ ) nµ =1 ∈ R n × d contains iid Gaussian vectors, i.e ∀ µ ∈ [1 : n ] , x µ ∼ N ( , I d ) . I.2 Bayes-optimal and ERM estimation
Inferring the above statistical model from observations { y , X } can be tackled in several ways.In particular, Bayesian inference provides a generic framework for statistical estimation basedon the high-dimensional, often intractable, posterior distribution P ( w | y , X ) = P ( y | w , X ) P ( w ) P ( y , X ) . (19)Estimating the average of the above posterior distribution in the case we have access to theground truth prior distributions P ( y | w , X ) = P out (cid:63) ( y | z ) with z ≡ √ d X w and P ( w ) = P w (cid:63) ( w ) , refers to Bayes-optimal estimation and leads to the corresponding Minimal Mean-Squared Error (MMSE) estimator ˆ w mmse = E P ( w | y , X ) [ w ] . It has been rigorously analyzed indetails in [10] for the whole GLM class eq. (18). Another celebrated approach and widely usedin practice is the Empirical Risk Minimization (ERM) that minimizes instead a regularized loss: ˆ w erm = argmin w [ L ( w ; y , X )] with L ( w ; y , X ) = n (cid:88) µ =1 l ( w ; y µ , x µ ) + r ( w ) . (20)Interestingly analyzing the ERM estimation may be included in the above Bayesian framework.Indeed exponentiating eq. (20), we see that minimizing the loss L is equivalent to maximize theposterior distribution P ( w | y , X ) = e −L ( w ; y , X ) if we choose carefully the prior distributions asfunctions of the regularizer r and the loss l : − log P ( y | w , X ) = l ( w ; y , X ) , − log P ( w ) = r ( w ) . (21)Computing the maximum of the posterior P ( y | w , X ) refers instead to the so-called MaximumA Posteriori (MAP) estimator, and therefore analyzing the empirical minimization of (20) isequivalent to obtain the performance of the MAP estimator with prior distributions given by(21). Thus both the study of ERM (MAP) and Bayes-optimal (MMSE) estimations are simplyreduced to the analysis of the posterior eq. (19).20 .3 Denoising distributions and updates Analyzing the posterior distribution eq. (19) in the high-dimensional regime [10] will boil downto introducing the scalar denoising distributions Q w , Q out and their respective normalizations Z w , Z out Q w ( w ; γ, Λ) ≡ P w ( w ) Z w ( γ, Λ) e − Λ w + γw , Q out ( z ; y, ω, V ) ≡ P out ( y | z ) Z out ( y, ω, V ) e − V − ( z − ω ) √ πV , Z w ( γ, Λ) ≡ E w ∼ P w (cid:104) e − Λ w + γw (cid:105) , Z out ( y, ω, V ) ≡ E z ∼N (0 , (cid:104) P out (cid:16) y |√ V z + ω (cid:17)(cid:105) . (22)We define as well the denoising functions, that play a central role in Bayesian inference. Note inparticular that they correspond to the updates of the Approximate Message Passing algorithmin [30] that we recalled in Sec. VI.1. They are defined as the derivatives of log Z w and log Z out ,namely f w ( γ, Λ) ≡ ∂ γ log ( Z w ) = E Q w [ w ] and ∂ γ f w ( γ, Λ) ≡ E Q w (cid:2) w (cid:3) − f f out ( y, ω, V ) ≡ ∂ ω log ( Z out ) = V − E Q out [ z − ω ] and ∂ ω f out ( y, ω, V ) ≡ ∂f out ( y, ω, V ) ∂ω . (23) I.3.1 Bayes-optimal - MMSE denoising functions
In Bayes-optimal estimation, the ground truth prior and channel distributions P w (cid:63) ( w ) and P out (cid:63) ( y | z ) of the teacher eq. (1) are known. Hence, replacing P w and P out in (22), we obtain theBayes-optimal scalar denoising distributions in terms of which the Bayes-optimal free entropyeq. (95) is written Q w (cid:63) ( w ; γ, Λ) ≡ P w (cid:63) ( w ) Z w (cid:63) ( γ, Λ) e − Λ w + γw , Q out (cid:63) ( z ; y, ω, V ) ≡ P out (cid:63) ( y | z ) Z out (cid:63) ( y, ω, V ) e − V − ( z − ω ) √ πV ., (24)and the denoising updates are therefore given by eq. (23) with the corresponding distributions f w (cid:63) ( γ, Λ) ≡ ∂ γ log Z w (cid:63) ( γ, Λ) , f out (cid:63) ( y, ω, V ) ≡ ∂ ω log Z out (cid:63) ( y, ω, V ) . (25) I.3.2 ERM - MAP denoising functions
Before defining similar denoising functions to analyze the MAP for ERM estimation, we firstrecall the definition of the Moreau-Yosida regularization.
Moreau-Yosida regularization and proximal
Let Σ > , f ( , z ) a convex function in z .Defining the regularized functional L Σ [ f ( , . )]( z ; x ) = f ( , z ) + 12Σ ( z − x ) , (26)21he Moreau-Yosida regularization M Σ and the proximal map P Σ are defined by P Σ [ f ( , . )]( x ) = argmin z L Σ [ f ( , . )]( z ; x ) = argmin z (cid:20) f ( , z ) + 12Σ ( z − x ) (cid:21) , (27) M Σ [ f ( , . )]( x ) = min z L Σ [ f ( , . )]( z ; x ) = min z (cid:20) f ( , z ) + 12Σ ( z − x ) (cid:21) , (28)where ( , z ) denotes all the arguments of the function f , where z plays a central role. TheMAP denoising functions for any convex loss l ( , . ) and convex separable regularizer r ( . ) can bewritten in terms of the Moreau-Yosida regularization or the proximal map as follows f map ,r w ( γ, Λ) ≡ P Λ − [ r ( . )] (Λ − γ ) = Λ − γ − Λ − ∂ Λ − γ M Λ − [ r ( . )] (Λ − γ ) ,f map ,l out ( y, ω, V ) ≡ − ∂ ω M V [ l ( y, . )]( ω ) = V − ( P V [ l ( y, . )]( ω ) − ω ) . (29)The above updates can be considered as definitions, but it is instructive to derive them from thegeneric definition of the denoising distributions eq. (23) if we maximize the posterior distribution.This is done by taking, in a physics language, a zero temperature limit and we present it indetails in the next paragraph. Derivation of the MAP updates
To have access to the maximum of the generic distributionseq. (22), we introduce a fictive noise/temperature ∆ or inverse temperature β , ∆ = β . Inparticular for Bayes-optimal estimation this temperature is finite and fixed to ∆ = β = 1 .Indeed with the mapping eq. (21), minimizing the loss function L (20) is equivalent to maximizethe posterior distribution. Therefore it can be done by taking the zero noise/temperature limit ∆ → of the channel and prior denoising distributions Q out and Q w . It is the purpose of thefollowing paragraphs where we present the derivation leading to the result (29). Channel
Using the mapping eq. (21), we assume that the channel distribution can be expressedas P ( y | z ) ∝ e − l ( y,z ) . Therefore we introduce the corresponding channel distribution P out atfinite temperature ∆ associated to the convex loss l ( y, z ) P mapout ( y | z ) = e − l ( y,z ) √ π ∆ . Note that the case of the square loss l ( y, z ) = ( y − z ) is very specific. Its channel distributionsimply reads P out ( y | z ) = e −
12∆ ( y − z )2 √ π ∆ and is therefore equivalent to predict labels y accordingto a noisy Gaussian linear model y = z + √ ∆ ξ , where ξ ∼ N (0 , and ∆ denotes thereforethe real noise of the model.In order to obtain a non trivial limit and a closed set of equations when ∆ → , we mustdefined rescaled variables as follows: V † ≡ lim ∆ → V ∆ , f mapout , † ( y, ω, V † ) ≡ lim ∆ → ∆ × f mapout ( y, ω, V ) , ∆ → by † . Similarly to eq. (26),we introduce therefore the rescaled functional L V † [ l ( y, . )]( z ; ω ) = l ( y, z ) + 12 V † ( z − ω ) , (30)such that, injecting P mapout , the channel denoising distribution Q mapout and the correspondingpartition function Z mapout eq. (22) simplify in the zero temperature limit as follows: Q mapout ( z ; y, ω, V ) ≡ lim ∆ → e − l ( y,z )+ V ( z − ω ) (cid:112) π ∆ V † √ π ∆ = lim ∆ → e − L V † [ l ( y,. )]( z ; ω ) (cid:112) π ∆ V † √ π ∆ , (31) ∝ δ (cid:0) z − P V † [ l ( y, . )]( ω ) (cid:1) Z mapout ( y, ω, V ) = lim ∆ → (cid:90) R d zQ mapout ( z ; y, ω, V ) = lim ∆ → e − M V † [ l ( y,. )]( ω ) (cid:112) π ∆ V † √ π ∆ , (32)that involve the proximal map and the Moreau-Yosida regularization defined in eq. (28). Finallytaking the zero temperature limit, the MAP denoising function f mapout , † leads to the result (29): f mapout , † ( y, ω, V † ) ≡ lim ∆ → ∆ × f mapout ( y, ω, V ) ≡ lim ∆ → ∆ × ∂ ω log Z mapout ≡ lim ∆ → ∆ V − E Q mapout [ z − ω ]= − ∂ ω M V † [ l ( y, . )]( ω ) = V − † (cid:0) P V † [ l ( y, . )]( ω ) − ω (cid:1) . (33) Prior
Similarly as above, using the mapping eq. (21), for a convex and separable regularizer r ,the corresponding prior distribution at temperature ∆ can be written P mapw ( w ) = e − r ( w ) . Note that at ∆ = 1 the classical (cid:96) regularization with strength λ , r (cid:96) ( w ) = − λ | w | , and the (cid:96) regularization r (cid:96) ( w ) = − λw / are equivalent to choosing a Laplace prior P w ( w ) ∝ e − λ | w | ora Gaussian prior P w ( w ) ∝ e − λw . To obtain a meaningful limit as ∆ → , we again introducethe following rescaled variables Λ † ≡ lim ∆ → ∆ × Λ , γ † ≡ lim ∆ → ∆ × γ , and the functional L Λ − † [ r ( . )] ( w ; Λ − † γ † ) = r ( w ) + 12 Λ † (cid:16) w − Λ − † γ † (cid:17) = (cid:20) r ( w ) + 12 Λ † w − γ † w (cid:21) + 12 γ † Λ − † , (34)23uch that in the zero temperature limit, the prior denoising distribution Q mapw and the partitionfunction Z mapw reduce to Q mapw ( w ; γ, Λ) ≡ lim ∆ → P w ( w ) e − Λ w + γw = lim ∆ → e − L Λ − † [ r ]( w ;Λ − † γ † ) e − γ † Λ − † ∝ δ (cid:16) w − P Λ − † [ r ] (Λ − † γ † ) (cid:17) (35) Z mapw ( y, ω, V ) = lim ∆ → (cid:90) R d wQ mapw ( w ; γ, Λ) = lim ∆ → e − M Λ − † [ r ](Λ − † γ † ) e − γ † Λ − † , (36)that involve again the proximal map P Λ − † and the Moreau-Yosida regularization M Λ − † definedin eq. (28). Finally the MAP denoising update f mapw , † is simply given by: f mapw , † ( γ † , Λ † ) ≡ lim ∆ → f mapw ( γ, Λ) = lim ∆ → ∂ γ log Z mapw ≡ lim ∆ → E Q mapw [ w ]= lim ∆ → ∂ γ (cid:18) − M Λ − † [ r ( . )] (Λ − † γ † ) − γ † Λ − † (cid:19) = ∂ γ † (cid:18) −M Λ − † [ r ( . )] (Λ − † γ † ) − γ † Λ − † (cid:19) (37) = Λ − † γ † − Λ − † ∂ Λ − † γ † M Λ − † [ r ( . )] (Λ − † γ † ) = P Λ − † [ r ( . )] (Λ − † γ † )= argmin w (cid:20) r ( w ) + 12 Λ † ( w − Λ − † γ † ) (cid:21) = argmin w (cid:20) r ( w ) + 12 Λ † w − γ † w (cid:21) , and we recover the result (29). I.4 Applications
In this section we list the explicit expressions of the Bayes-optimal eq. (25) and ERM eq. (29)denoising functions largely used to produce the examples in Sec. 3.
I.4.1 Bayes-optimal updates
The Bayes-optimal denoising functions (25) are detailed in the case of a linear , sign and rectangledoor channel with a Gaussian noise ξ ∼ N (0 , and variance ∆ ≥ , and for Gaussian and sparse-binary weights.
Channel • Linear: y = ϕ out (cid:63) ( z ) = z + √ ∆ ξ Z out (cid:63) ( y, ω, V ) = N ω ( y, ∆ (cid:63) + V ) ,f out (cid:63) ( y, ω, V ) = (∆ (cid:63) + V ) − ( y − ω ) , ∂ ω f out (cid:63) ( y, ω, V ) = − (∆ (cid:63) + V ) − . (38)24 Sign: y = ϕ out (cid:63) ( z ) = sign ( z ) + √ ∆ (cid:63) ξ Z out (cid:63) ( y, ω, V ) = N y (1 , ∆ (cid:63) ) 12 (cid:18) (cid:18) ω √ V (cid:19)(cid:19) + N y ( − , ∆ (cid:63) ) 12 (cid:18) − erf (cid:18) ω √ V (cid:19)(cid:19) ,f out (cid:63) ( y, ω, V ) = N y (1 , ∆ (cid:63) ) − N y ( − , ∆ (cid:63) ) Z out (cid:63) ( y, ω, V ) N ω (0 , V ) . (39) • Rectangle door: y = ϕ out (cid:63) ( z ) = ( κ m ≤ z ≤ κ M ) − ( z ≤ κ m or z ≥ κ M ) + √ ∆ (cid:63) ξ For κ m < κ M , we obtain Z out (cid:63) ( y, ω, V ) = N y (1 , ∆ (cid:63) ) 12 (cid:18) erf (cid:18) κ M − ω √ V (cid:19) − erf (cid:18) κ m − ω √ V (cid:19)(cid:19) + N y ( − , ∆ (cid:63) ) 12 (cid:18) − (cid:18) erf (cid:18) κ M − ω √ V (cid:19) − erf (cid:18) κ m − ω √ V (cid:19)(cid:19)(cid:19) ,f out (cid:63) ( y, ω, V ) = 1 Z out ( N y (1 , ∆ (cid:63) ) ( −N ω ( κ M , V ) + N ω ( κ m , V ))+ N y ( − , ∆ (cid:63) ) ( N ω ( κ M , V ) − N ω ( κ m , V ))) . (40) Prior • Gaussian weights: w ∼ P w ( w ) = N w ( µ, σ ) Z w (cid:63) ( γ, Λ) = e γ σ +2 γµ − Λ µ σ +1) √ Λ σ + 1 , f w (cid:63) ( γ, Λ) = γσ + µ σ , ∂ γ f w (cid:63) ( γ, Λ) = σ σ . (41) • Sparse-binary weights: w ∼ P w ( w ) = ρδ ( w ) + ( ρ − ( δ ( w −
1) + δ ( w + 1)) Z w (cid:63) ( γ, Λ) = ρ + e − Λ2 (1 − ρ ) cosh( γ ) ,f w (cid:63) ( γ, Λ) = e − Λ2 (1 − ρ ) sinh( γ ) ρ + e − Λ2 (1 − ρ ) cosh( γ ) , ∂ γ f w (cid:63) ( γ, Λ) = e − Λ2 (1 − ρ ) cosh( γ ) ρ + e − Λ2 (1 − ρ ) cosh( γ ) . (42) I.4.2 ERM updates
The ERM denoising functions (29) have, very often, no explicit expression except for the square and hinge losses, and for (cid:96) , (cid:96) regularizations that are analytical. However, in the particularcase of a two times differentiable convex loss the denoising functions can still be written as thesolution of an implicit equation detailed below.25 onvex losses • Square loss
The proximal map for the square loss l square ( y, z ) = ( y − z ) is easily obtained and reads P V (cid:20)
12 ( y, . ) (cid:21) ( ω ) = argmin z (cid:20)
12 ( y − z ) + 12 V ( z − ω ) (cid:21) = (1 + V ) − ( ω + yV ) . Therefore (29) yields f squareout ( y, ω, V ) = V − (cid:18) P V (cid:20)
12 ( y, . ) (cid:21) ( ω ) − ω (cid:19) = (1 + V ) − ( y − ω ) ,∂ ω f squareout ( y, ω, V ) = − (1 + V ) − . (43) • Hinge loss
The proximal map of the hinge loss l hinge ( y, z ) = max (0 , − yz ) P V (cid:104) l hinge ( y, . ) (cid:105) ( ω ) = argmin z max (0 , − yz ) + 12 V ( z − ω ) (cid:124) (cid:123)(cid:122) (cid:125) ≡L ≡ z (cid:63) ( y, ω, V ) . can be expressed analytically by distinguishing all the possible cases:• − yz < : L = V ( z − ω ) ⇒ z (cid:63) = ω if yz (cid:63) < ⇔ z (cid:63) = ω if ωy < .• − yz > : L = V ( z − ω ) + 1 − yz ⇒ ( z (cid:63) − ω ) = yV ⇔ z (cid:63) = ω + V y if − yz (cid:63) > ⇔ z (cid:63) = ω + V y if ωy < − y V = 1 − V , as y = 1 .• Hence we have one last region to study − V < ωy < . It follows y (1 − V ) < ω < y : V ( z − y ) ≤ V ( z − ω ) ⇒ z (cid:63) = y . Finally we obtain a simple analytical expression for the proximal and its derivative P V (cid:104) l hinge ( y, . ) (cid:105) ( ω ) = ω + V y if ωy < − Vy if − V < ωy < ω if ωy > , ∂ ω P V (cid:104) l hinge ( y, . ) (cid:105) ( ω ) = if ωy < − V if − V < ωy < if ωy > . Hence with (29), the hinge denoising function and its derivative read f hingeout ( y, ω, V ) = y if ωy < − V ( y − ω ) V if − V < ωy < otherwise , ∂ ω f hingeout ( y, ω, V ) = (cid:40) − V if − V < ωy < otherwise . (44)26 Generic differentiable convex loss
In general, finding the proximal map in (29) is intractable. In particular, it is the case for thelogistic loss considered in Sec. V.5. However assuming the convex loss is a generic two timesdifferentiable function l ∈ D , taking the derivative of the proximal map P V [ l ( y, . )] ( ω ) = argmin z (cid:20) l ( y, z ) + 12 V ( z − ω ) (cid:21) ≡ z (cid:63) ( y, ω, V ) , verifies therefore the implicit equations: z (cid:63) ( y, ω, V ) = ω − V ∂ z l ( y, z (cid:63) ( y, ω, V )) , ∂ ω z (cid:63) ( y, ω, V ) = (cid:0) V ∂ z l ( y, z (cid:63) ( y, ω, V )) (cid:1) − . (45)Once those equations solved, the denoising function and its derivative are simply expressed as f diffout ( y, ω, V ) = V − ( z (cid:63) ( y, ω, V ) − ω ) , ∂ ω f diffout ( y, ω, V ) = V − ( ∂ ω z (cid:63) ( y, ω, V ) − , (46)with z (cid:63) ( y, ω, V ) = P V [ l ( y, . )] ( ω ) solution of (45). Regularizations • (cid:96) regularization Using the definition of the prior update in eq. (29) for the (cid:96) regularization r ( w ) = λw , weobtain f (cid:96) w ( γ, Λ) = argmin w (cid:20) λw w − γw (cid:21) = γλ + Λ ,∂ γ f (cid:96) w ( γ, Λ) = 1 λ + Λ and Z (cid:96) w ( γ, Λ) = exp (cid:18) γ Λ2( λ + Λ) (cid:19) . (47) • (cid:96) regularization Performing the same computation for the (cid:96) regularization r ( w ) = λ | w | , we obtain f (cid:96) w ( γ, Λ) = argmin w (cid:20) λ (cid:107) w (cid:107) + 12 Λ w − γw (cid:21) = γ − λ Λ γ > λ γ + λ Λ γ + λ < otherwise ,∂ γ f (cid:96) w ( γ, Λ) = (cid:40) (cid:107) γ (cid:107) > λ otherwise . (48)27 I Binary classification generalization errors
In this section, we present the computation of the asymptotic generalization error e g ( α ) ≡ lim d →∞ E y, x [ y (cid:54) = ˆ y ( ˆ w ( α ); x )] , (49)leading to expressions in Proposition. 2.1 and Thm. 2.4. The computation at finite dimension issimilar if we do not consider the limit d → ∞ . II.1 General case
The generalization error e g is the prediction error of the estimator ˆ w on new samples { y , X } ,where X is an iid Gaussian matrix and y are ± labels generated according to (18): y = ϕ out (cid:63) ( z ) with z = 1 √ d X w (cid:63) . (50)As the model fitted by ERM may not lead to binary outputs, we may add a non-linearity ϕ : R (cid:55)→ {± } (for example a sign) on top of it to insure to obtain binary outputs ˆ y ± according to ˆ y = ϕ (ˆ z ) with ˆ z = 1 √ d X ˆ w . (51)The classification generalization error is given by the probability that the predicted labels ˆ y andthe true labels y do not match. To compute it, first note that the vectors ( z , ˆ z ) averaged over allpossible ground truth vectors w (cid:63) (or equivalently labels y ) and input matrix X follow in thelarge size limit a joint Gaussian distribution with zero mean and covariance matrix σ = lim d →∞ E w (cid:63) , X d (cid:20) w (cid:63) (cid:124) w (cid:63) w (cid:63) (cid:124) ˆ ww (cid:63) (cid:124) ˆ w ˆ w (cid:124) ˆ w (cid:21) ≡ (cid:20) σ w (cid:63) σ w (cid:63) ˆw σ w (cid:63) ˆw σ ˆw (cid:21) . (52)The asymptotic generalization error depends only on the covariance matrix σ and as the samplesare iid it reads e g ( α ) = lim d →∞ E y, x [ y (cid:54) = ˆ y ( ˆ w ( α ); x )] = 1 − P [ y = ˆ y ( ˆ w ( α ); x )] = 1 − (cid:90) ( R + ) d x N x ( , σ )= 1 − (cid:32)
12 + 1 π atan (cid:32)(cid:115) σ (cid:63) ˆw σ w (cid:63) σ ˆw − σ (cid:63) ˆw (cid:33)(cid:33) = 1 π acos (cid:18) σ w (cid:63) ˆw √ σ w (cid:63) σ ˆw (cid:19) , (53)where we used the fact that atan( x ) = π − acos (cid:16) x − x (cid:17) and acos (cid:0) x − (cid:1) = acos( x ) .Finally e g ( α ) ≡ lim d →∞ E y, x [ y (cid:54) = ˆ y ( ˆ w ( α ); x )] = 1 π acos (cid:18) σ w (cid:63) ˆw √ ρ w (cid:63) σ ˆw (cid:19) , (54)with σ w (cid:63) ˆw ≡ lim d →∞ E w (cid:63) , X d ˆ w (cid:124) w (cid:63) , ρ w (cid:63) ≡ lim d →∞ E w (cid:63) d (cid:107) w (cid:63) (cid:107) , σ ˆw ≡ lim d →∞ E w (cid:63) , X d (cid:107) ˆ w (cid:107) . I.2 Bayes-optimal generalization error
The Bayes-optimal generalization error for classification is equal to eq. (54) where the Bayesestimator ˆ w is the average over the posterior distribution eq. (19) denoted (cid:104) . (cid:105) , knowing theteacher prior P w (cid:63) and channel P out (cid:63) distributions: ˆ w = (cid:104) w (cid:105) w . Hence the parameters σ ˆw and σ w (cid:63) ˆw read in the Bayes-optimal case σ ˆw ≡ lim d →∞ E w (cid:63) , X d (cid:107) ˆ w (cid:107) = lim d →∞ E w (cid:63) , X d (cid:107)(cid:104) w (cid:105) w (cid:107) ≡ q b ,σ w (cid:63) ˆw ≡ lim d →∞ E w (cid:63) , X d ˆ w (cid:124) w (cid:63) = lim d →∞ E w (cid:63) , X d (cid:104) w (cid:105) (cid:124) w w (cid:63) ≡ m b . Using Nishimori identity [55], we easily obtain m b = q b which is solution of eq. (13). Thereforethe generalization error simplifies e bayesg ( α ) = 1 π acos (cid:0) √ η b (cid:1) , with η b = q b ρ w (cid:63) . (55) II.3 ERM generalization error
The generalization error of the ERM estimator is given again by eq. (54) with parameters σ ˆw ≡ lim d →∞ E w (cid:63) , X d (cid:107) ˆ w (cid:107) = lim d →∞ E w (cid:63) , X d (cid:107) ˆ w erm (cid:107) ≡ q ,σ w (cid:63) ˆw ≡ lim d →∞ E w (cid:63) , X d ˆ w (cid:124) w (cid:63) = lim d →∞ E w (cid:63) , X d ( ˆ w erm ) (cid:124) w (cid:63) ≡ m . where the parameters m, q are the asymptotic ERM overlaps solutions of eq. (11) and that finallylead to the ERM generalization error for classification: e ermg ( α ) = 1 π acos ( √ η ) , with η ≡ m ρ w (cid:63) q . (56)29 II Proofs of the ERM fixed points
III.1 Gordon’s result and proofs
We consider in this section that the data have been generated by a teacher (18) with Gaussianweights w (cid:63) ∼ P w (cid:63) ( w (cid:63) ) = N w (cid:63) ( , ρ w (cid:63) I d ) with ρ w (cid:63) ≡ E (cid:2) ( w (cid:63) ) (cid:3) . (57) III.1.1 For real outputs - Regression with (cid:96) regularization In what follows, we prove a theorem that characterizes the asymptotic performance of empiricalrisk minimization ˆ w erm = argmin w n (cid:88) i =1 l (cid:16) y i , √ d x (cid:124) i w (cid:17) + λ (cid:107) w (cid:107) , (58)where { y i } ≤ i ≤ n are general real-valued outputs (that are not necessarily binary), l ( y, z ) is aloss function that is convex with respect to z , and λ > is the strength of the (cid:96) regularization.Note that this setting is more general than the one considered in Thm. 2.2 in the main text,which focuses on binary outputs and loss functions in the form of l ( y, z ) = (cid:96) ( yz ) for someconvex function (cid:96) ( · ) . Theorem III.1 (Regression with (cid:96) regularization) . As n, d → ∞ with n/d = α = Θ(1) , theoverlap parameters m, q concentrate to m −→ d →∞ √ ρ w (cid:63) µ ∗ , q −→ d →∞ ( µ ∗ ) + ( δ ∗ ) , (59) where the parameters µ ∗ , δ ∗ are the solutions of ( µ ∗ , δ ∗ ) = arg min µ,δ ≥ sup τ> (cid:26) λ ( µ + δ )2 − δ τ + α E g,s M τ [ l ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , . )]( µs + δg ) (cid:27) . (60) Here, M τ [ l ( , . )]( x ) is the Moreau-Yosida regularization defined in (28) , and g, s are two iid standardnormal random variables. Proof.
Since the teacher weight vector w (cid:63) is independent of the input data matrix X, we canassume without loss of generality that w (cid:63) = √ dρ d e , where e is the first natural basis vector of R d , and ρ d = (cid:107) w (cid:63) (cid:107) / √ d . As d → ∞ , ρ d → √ ρ w (cid:63) .Accordingly, it will be convenient to split the data matrix into two parts:X = (cid:2) s B (cid:3) , (61)where s ∈ R n × and B ∈ R n × ( d − are two sub-matrices of iid standard normal entries. Theweight vector w in (58) can also be written as w = [ √ dµ, v (cid:124) ] (cid:124) , where µ ∈ R denotes the30rojection of w onto the direction spanned by the teacher weight vector w (cid:63) , and v ∈ R d − isthe projection of w onto the complement subspace. These representations serve to simplify thenotations in our subsequent derivations. For example, we can now write the output as y i = ϕ out (cid:63) ( ρ d s i ) , (62)where s i is the i th entry of the Gaussian vector s in (61).Let Φ d denote the cost of the ERM in (58), normalized by d . Using our new representationsintroduced above, we have Φ d = min µ, v d n (cid:88) i =1 l (cid:16) y i , µs i + √ d b (cid:124) i v (cid:17) + λ ( dµ + (cid:107) v (cid:107) )2 d , (63)where b (cid:124) i denotes the i th row of B. Since the loss function l ( y i , z ) is convex with respect to z ,we can rewrite it as l ( y i , z ) = sup q { qz − l ∗ ( y i , q ) } , (64)where l ∗ ( y i , q ) = sup z { qz − l ( y i , z ) } is its convex conjugate. Substituting (64) into (63), wehave Φ d = min µ, v sup q µ q (cid:124) s d + 1 d / q (cid:124) B v − d n (cid:88) i =1 l ∗ ( y i , q i ) + λ (cid:16) dµ + (cid:107) v (cid:107) (cid:17) d . (65)Now consider a new optimization problem (cid:101) Φ d = min µ, v sup q µ q (cid:124) s d + (cid:107) q (cid:107)√ d h (cid:124) v d + (cid:107) v (cid:107)√ d g (cid:124) q d − d n (cid:88) i =1 l ∗ ( y i , q i ) + λ (cid:16) dµ + (cid:107) v (cid:107) (cid:17) d , (66)where h ∼ N ( , I d − ) and g ∼ N ( , I n ) are two independent standard normal vectors. Itfollows from Gordon’s minimax comparison inequality (see, e.g. , [ ? ]) that P ( | Φ d − c | ≥ (cid:15) ) ≤ P (cid:16)(cid:12)(cid:12)(cid:12)(cid:101) Φ d − c (cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:17) (67)for any constants c and (cid:15) > . This implies that (cid:101) Φ d serves as a surrogate of Φ d . Specifically, if (cid:101) Φ d concentrates around some deterministic limit c as d → ∞ , so does Φ d . In what follows, weproceed to solve the surrogate problem in (66). First, let δ = (cid:107) v (cid:107) / √ d . It is easy to see that (66)31an be simplified as (cid:101) Φ d = min µ,δ ≥ sup q (cid:40) q (cid:124) ( µ s + δ g ) d − δ (cid:107) q (cid:107)√ d (cid:107) h (cid:107)√ d − d n (cid:88) i =1 l ∗ ( y i , q i ) + λ ( µ + δ )2 (cid:41) ( a ) = min µ,δ ≥ sup τ> sup q (cid:40) − τ (cid:107) q (cid:107) d − δ (cid:107) h (cid:107) τ d + q (cid:124) ( µ s + δ g ) d − d n (cid:88) i =1 l ∗ ( y i , q i ) + λ ( µ + δ )2 (cid:41) = min µ,δ ≥ sup τ> (cid:40) λ ( µ + δ )2 − δ (cid:107) h (cid:107) τ d − αn inf q (cid:104) τ (cid:107) q (cid:107) − q (cid:124) ( µ s + δ g ) + n (cid:88) i =1 l ∗ ( y i , q i ) (cid:105)(cid:41) ( b ) = min µ,δ ≥ sup τ> (cid:40) λ ( µ + δ )2 − δ (cid:107) h (cid:107) τ d − αn n (cid:88) i =1 M τ [ l ( y i , . )]( µs i + δg i ) (cid:41) . In ( a ) , we have introduced an auxiliary variable τ to rewrite − δ (cid:107) q (cid:107)√ d (cid:107) h (cid:107)√ d as − δ (cid:107) q (cid:107)√ d (cid:107) h (cid:107)√ d = sup τ> (cid:40) − τ (cid:107) q (cid:107) d − δ (cid:107) h (cid:107) τ d (cid:41) , and to get ( b ) , we use the identity inf q (cid:110) τ q − qz + (cid:96) ∗ ( q ) (cid:111) = − inf x (cid:26) ( z − x ) τ + (cid:96) ( x ) (cid:27) that holds for any z and for any convex function (cid:96) ( x ) and its conjugate (cid:96) ∗ ( q ) . As d → ∞ ,standard concentration arguments give us (cid:107) h (cid:107) d → and n (cid:80) ni =1 M τ [ l ( y i , . )]( µs i + δg i ) → E g,s M τ [ l ( y, . )]( µs + δg ) locally uniformly over τ, µ and δ . Using (67) and recalling (62), wecan then conclude that the normalized cost of the ERM Φ d converges to the optimal value ofthe deterministic optimization problem in (60). Finally, since λ > , one can show that the costfunction of (60) has a unique global minima at µ ∗ and δ ∗ . It follows that the empirical values of ( µ, δ ) also converge to their corresponding deterministic limits ( µ ∗ , δ ∗ ) . III.1.2 For binary outputs - Classification with (cid:96) regularization In what follows, we specialize the previous theorem to the case of binary classification, with aconvex loss function in the form of l ( y, z ) = (cid:96) ( yz ) for some function (cid:96) ( · ) . Theorem III.2 (Thm. 2.2 in the main text. Gordon’s min-max fixed point - Classification with (cid:96) regularization) . As n, d → ∞ with n/d = α = Θ(1) , the overlap parameters m, q concentrateto m −→ d →∞ √ ρ w (cid:63) µ ∗ , q −→ d →∞ ( µ ∗ ) + ( δ ∗ ) , (68) where parameters µ ∗ , δ ∗ are solutions of ( µ ∗ , δ ∗ ) = arg min µ,δ ≥ sup τ> (cid:26) λ ( µ + δ )2 − δ τ + α E g,s M τ [ δg + µsϕ out (cid:63) ( √ ρ w (cid:63) s )] (cid:27) , (69)32 nd g, s are two iid standard normal random variables. The solutions ( µ ∗ , δ ∗ , τ ∗ ) of (69) can bereformulated as a set of fixed point equations µ ∗ = αλτ ∗ + α E [ s · ϕ out (cid:63) ( √ ρ w (cid:63) s ) · P τ ∗ ( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s ))] ,δ ∗ = αλτ ∗ + α − E [ g · P τ ∗ ( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s ))] , ( δ ∗ ) = α E [( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s ) − P τ ∗ ( δ ∗ g + µ ∗ sϕ out (cid:63) ( √ ρ w (cid:63) s ))) ] , (70)where M τ and P τ denote the Moreau-Yosida regularization and the proximal map of a convexloss function ( y, z ) (cid:55)→ (cid:96) ( yz ) : M τ ( z ) = min x (cid:26) (cid:96) ( x ) + ( x − z ) τ (cid:27) , P τ ( z ) = arg min x (cid:26) (cid:96) ( x ) + ( x − z ) τ (cid:27) . Proof.
We start by deriving (69) as a special case of (60). To that end, we note that M τ [ l ( y, . )]( z ) = min x (cid:26) l ( y ; x ) + ( x − z ) τ (cid:27) = min x (cid:26) (cid:96) ( yx ) + ( x − z ) τ (cid:27) = min x (cid:26) (cid:96) ( x ) + ( x − yz ) τ (cid:27) = M τ ( yz ) , where to reach the last equality we have used the fact that y ∈ {± } . Substituting this specialform into (60) and recalling (62), we reach (69).Finally, to obtain the fixed point equations (70), we simply take the partial derivatives of thecost function in (69) with respect to µ, δ, τ , and use the following well-known calculus rules forthe Moreau-Yosida regularization [ ? ]: ∂ M τ ( z ) ∂z = z − P τ ( z ) τ ,∂ M τ ( z ) ∂τ = − ( z − P τ ( z )) τ . III.2 Replica’s formulation
The replica computation presented in Sec. IV boils down to the characterization of the overlaps m, q in the high-dimensional limit n, d → ∞ with α = nd = Θ(1) , given by the solution of aset of, in the most general case, six fixed point equations over m, q, Q, ˆ m, ˆ q, ˆ Q . Introducing thenatural variables Σ ≡ Q − q , ˆΣ ≡ ˆ Q + ˆ q , η ≡ m ρ w (cid:63) q and ˆ η ≡ ˆ m ˆ q , the set of fixed point equations33or arbitrary P w (cid:63) , P out (cid:63) , convex loss l ( y, z ) and regularizer r ( w ) , is finally given by m = E ξ (cid:104) Z w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) f w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) f w (cid:16) ˆ q / ξ, ˆΣ (cid:17)(cid:105) ,q = E ξ (cid:20) Z w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) f w (cid:16) ˆ q / ξ, ˆΣ (cid:17) (cid:21) , Σ = E ξ (cid:104) Z w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) ∂ γ f w (cid:16) ˆ q / ξ, ˆΣ (cid:17)(cid:105) , ˆ m = α E y,ξ (cid:104) Z out (cid:63) ( . ) · f out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) , ˆ q = α E y,ξ (cid:20) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) f out (cid:16) y, q / ξ, Σ (cid:17) (cid:21) , ˆΣ = − α E y,ξ (cid:104) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) ∂ ω f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) . (71)The above equations depend on the Bayes-optimal partition functions Z w (cid:63) , Z out (cid:63) defined ineq. (24), the updates f w (cid:63) , f out (cid:63) in eq. (25) and the ERM updates f w , f out eq. (29). III.3 Equivalence Gordon-Replica’s formulation - (cid:96) regularization and Gaus-sian weights III.3.1 Replica’s formulation for (cid:96) regularization The proximal for the (cid:96) penalty with strength λ can be computed explicitly in eq. (47) andthe corresponding denoising function is simply given by f (cid:96) ,λ w ( γ, Λ) = γλ +Λ . Therefore, fora Gaussian teacher (57) already considered in Thm. (70) with second moment ρ w (cid:63) , using thedenoising function (41), the fixed point equations over m, q, Σ can be computed analyticallyand lead to m = ρ w (cid:63) ˆ mλ + ˆΣ , q = ρ w (cid:63) ˆ m + ˆ q ( λ + ˆΣ) , Σ = 1 λ + ˆΣ . (72)Hence, removing the hat variables in eqs. (71), the set of fixed point equations can be rewrittenin a more compact way leading to the Corollary. 2.3 that we recall here: Corollary III.3 (Corollary. 2.3 in the main text. Equivalence Gordon-Replicas) . The set of fixedpoint equations (70) in Thm. III.2 that govern the asymptotic behaviour of the overlaps m and q isequivalent to the following set of equations, obtained from the heuristic replica computation: m = α Σ ρ w (cid:63) · E y,ξ (cid:104) Z out (cid:63) ( . ) · f out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) · f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) q = m /ρ w (cid:63) + α Σ · E y,ξ (cid:20) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) · f out (cid:16) y, q / ξ, Σ (cid:17) (cid:21) (73) Σ = (cid:16) λ − α · E y,ξ (cid:104) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) · ∂ ω f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105)(cid:17) − with η ≡ m ρ w (cid:63) q , ξ ∼ N (0 , and E y the continuous or discrete sum over all possible values y according to P out (cid:63) .Proof of Corollary. III.3(Corollary. 2.3). For the sake of clarity, we use the abusive notation P V ( y, ω ) = P V [ l ( y, . )]( ω ) , and we remove the ∗ . 34 ictionary We first map the Gordon’s parameters ( µ, δ, τ ) in eq. (70) to ( m, q, Σ ) in eq. (73): √ ρ w (cid:63) µ ↔ m , µ + δ ↔ q , τ ↔ Σ . so that η = m ρ w (cid:63) q = µ µ + δ , − η = δ µ + δ . From eq. (24), we can rewrite the channel partition function Z out (cid:63) and its derivative Z out (cid:63) ( y, ω, V ) = E z (cid:104) P out (cid:63) (cid:16) y |√ V z + ω (cid:17)(cid:105) ,∂ ω Z out (cid:63) ( y, ω, V ) = 1 √ V E z (cid:104) zP out (cid:63) (cid:16) y |√ V z + ω (cid:17)(cid:105) , (74)where z denotes a standard normal random variable. Equation over m Let us start with the equation over m in eq. (73): m = Σ αρ w (cid:63) E y,ξ (cid:104) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) f out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) = Σ α √ ρ w (cid:63) √ − η E y,ξ,z (cid:104) zP out (cid:63) (cid:16) y |√ ρ w (cid:63) (cid:16)(cid:112) − ηz + √ ηξ (cid:17)(cid:17) Σ − ( P Σ ( y, √ qξ ) − √ qξ ) (cid:105) (Using eq. (74)) ⇔ µ = (cid:112) µ + δ δ α E y,ξ,z (cid:34) zP out (cid:63) (cid:34) y |√ ρ w (cid:63) δz + µξ (cid:112) µ + δ (cid:35) (cid:16) P τ (cid:16) y, (cid:112) µ + δ ξ (cid:17) − (cid:112) µ + δ ξ (cid:17)(cid:35) (Dictionary) = (cid:112) µ + δ δ α E ξ,z (cid:34) z (cid:32) P τ (cid:32) ϕ out (cid:63) (cid:32) √ ρ w (cid:63) δz + µξ (cid:112) µ + δ (cid:33) , (cid:112) µ + δ ξ (cid:33) − (cid:112) µ + δ ξ (cid:33)(cid:35) (Integration over y ) = α E s,g (cid:104)(cid:16) s − µδ g (cid:17) ( P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs ) − ( δg + µs )) (cid:105) (Change of variables ( ξ, z ) → ( g, s ) ) = α E s,g (cid:104)(cid:16) s − µδ g (cid:17) ( P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )) (cid:105) (Gaussian integrations) ⇔ µ = α E s,g (cid:2) s · P τ (cid:0) ϕ out (cid:63) (cid:0) √ ρ w (cid:63) s (cid:1) , δg + µs (cid:1)(cid:3) αδ E s,g (cid:2) g · P τ (cid:0) ϕ out (cid:63) (cid:0) √ ρ w (cid:63) s (cid:1) , δg + µs (cid:1)(cid:3) = αλτ + α E s,g [ s · ϕ out (cid:63) ( √ ρ w (cid:63) s ) ( P τ ( δg + µs ) ϕ out (cid:63) ( √ ρ w (cid:63) s ))] , (Second fixed point equation)where we used the fact that P out (cid:63) ( y | z ) = δ ( y − ϕ out (cid:63) ( z )) , the change of variables s = µξ + δz √ µ + δ g = δξ − µz √ µ + δ ⇔ ξ = δg + µs √ µ + δ z = δs − µg √ µ + δ , (75)35nd finally in the last equality the definition of the second fixed point equation in eqs. (70): δ = α E s,g (cid:2) g · P τ (cid:0) ϕ out (cid:63) (cid:0) √ ρ w (cid:63) s (cid:1) , δg + µs (cid:1)(cid:3) λτ + α − . (76) Equation over q Let us now compute the equation over q in eq. (73): q − m /ρ w (cid:63) = Σ α E y,ξ (cid:20) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) f out (cid:16) y, q / ξ, Σ (cid:17) (cid:21) = Σ α E y,ξ,z (cid:20) P out (cid:63) (cid:16) y |√ ρ w (cid:63) (cid:16)(cid:112) − ηz + √ ηξ (cid:17)(cid:17) ( p Σ ( y, √ qξ ) − √ qξ ) (cid:21) (Using eq. (74)) ⇔ δ = α E y,ξ,z (cid:34) P out (cid:63) (cid:32) y |√ ρ w (cid:63) δz + µξ (cid:112) µ + δ (cid:33) (cid:16) p τ (cid:16) y, (cid:112) µ + δ ξ (cid:17) − (cid:112) µ + δ ξ (cid:17) (cid:35) (Dictionary) = α E ξ,z (cid:32) p τ (cid:32) ϕ out (cid:63) (cid:32) √ ρ w (cid:63) δz + µξ (cid:112) µ + δ (cid:33) , (cid:112) µ + δ ξ (cid:33) − (cid:112) µ + δ ξ (cid:33) (Integration over y ) = α E g,s (cid:104) ( p τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs ) − ( δg + µs )) (cid:105) (Change of variables ( ξ, z ) → ( g, s ) ) Equation over Σ Let us conclude with the equation over Σ in eq. (73) that we encounteredin eq. (76). Let us first compute α E y,ξ (cid:104) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) ∂ ω f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) = α E y,ξ,z (cid:20) P out (cid:63) (cid:16) y |√ ρ w (cid:63) (cid:16)(cid:112) − ηz + √ ηξ (cid:17)(cid:17)
1Σ ( ∂ ω p Σ ( y, √ qξ ) − (cid:21) (Using eq. (74)) = ατ E y,ξ,z (cid:34) P out (cid:63) (cid:32) y |√ ρ w (cid:63) δz + µξ (cid:112) µ + δ (cid:33) (cid:16) ∂ ω P τ (cid:16) y, (cid:112) µ + δ ξ (cid:17) − (cid:17)(cid:35) (Dictionary) = ατ E ξ,z (cid:34) ∂ ω P τ (cid:32) ϕ out (cid:63) (cid:32) √ ρ w (cid:63) δz + µξ (cid:112) µ + δ (cid:33) , (cid:112) µ + δ ξ (cid:33)(cid:35) − ατ (Integration over y ) = 1 τ α ( E g,s [ ∂ ω P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] − (Change of variables ( ξ, z ) → ( g, s ) )36herefore, the last equation over Σ in eq. (73) reads Σ = (cid:16) λ − α E y,ξ (cid:104) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) ∂ ω f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105)(cid:17) − ⇔ τ = (cid:18) λ − τ α ( E g,s [ ∂ ω P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] − (cid:19) − ⇔ α E g,s [ ∂ ω P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] = τ λ + α − . Noting that E g,s [ ∂ ω P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] = 1 δ E g,s [ d∂ ω P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )]= 1 δ E g,s [ ∂ g P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] = 1 δ E g,s [ g P τ ( δg + µsϕ out (cid:63) ( √ ρ w (cid:63) s ))] (Stein’s lemma)where we used the Stein’s lemma in the last equality, we finally obtain α E g,s [ ∂ ω P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] = τ λ + α − ⇔ δ = ατ λ + α − E g,s [ g · P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] . Gauge transformation
We still remain to prove that E s,g [ g · P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] = E s,g [ g · P τ ( δg + µsϕ out (cid:63) ( √ ρ w (cid:63) s ))] E s,g [ s · P τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs )] = E s,g [ s · P τ ( δg + µsϕ out (cid:63) ( √ ρ w (cid:63) s ))] E g,s (cid:104) ( p τ ( ϕ out (cid:63) ( √ ρ w (cid:63) s ) , δg + µs ) − ( δg + µs )) (cid:105) = E g,s (cid:104) (( p τ − ) ( δg + µsϕ out (cid:63) ( √ ρ w (cid:63) s ))) (cid:105) (77)As ϕ out (cid:63) (cid:0) √ ρ w (cid:63) s (cid:1) = ± , we can transform s → sϕ out (cid:63) (cid:0) √ ρ w (cid:63) s (cid:1) = ˜ s . It does not changethe distribution of the random variable ˜ s that is still a normal random variable. Finally denot-ing P τ (cid:0) , δg + µsϕ out (cid:63) (cid:0) √ ρ w (cid:63) s (cid:1)(cid:1) = P τ (cid:0) δg + µsϕ out (cid:63) (cid:0) √ ρ w (cid:63) s (cid:1)(cid:1) , we obtain the equivalencewith eq. (70), which concludes the proof. 37 V Replica computation for Bayes-optimal and ERM estimations
In this section, we present the statistical physics framework and the replica computation leadingto the general set of fixed point equations (11) and to the Bayes-optimal fixed point equations(13).
IV.1 Statistical inference and free entropy
As stressed in Sec. I, both ERM and Bayes-optimal estimations can be analyzed in a unifiedframework that consists in studying the joint distribution P ( y , X ) in the following posteriordistribution P ( w | y , X ) = P ( y | w , X ) P ( w ) P ( y , X ) , (78)known as the so-called partition function in the physics literature. It is the generating functionof many useful statistical quantities and is defined by Z ( y , X ) ≡ P ( y , X ) = (cid:90) R d d w P out ( y | w , X ) P w ( w )= (cid:90) R n d z P out ( y | z ) (cid:90) R d d w P w ( w ) δ (cid:18) z − √ d X w (cid:19) , (79)where we introduced the variable z = √ d X w . However in the considered high-dimensionalregime ( d → ∞ , n → ∞ , α = Θ(1) ), we are interested instead in the averaged (over instancesof input data X and teacher weights w (cid:63) or equivalently over the output labels y ) free entropy Φ defined as Φ( α ) ≡ E y , X (cid:20) lim d →∞ d log Z ( y , X ) (cid:21) . (80)The replica method is an heuristic method of statistical mechanics that allows to computethe above average over the random dataset { y , X } . We show in the next section the classicalcomputation for the Generalized Linear Model hypothesis class and iid data X. IV.2 Replica computation
IV.2.1 Derivation
We present here the replica computation of the averaged free entropy Φ( α ) in eq. (80) for generalprior distributions P w , P w (cid:63) and channel distributions P out , P out (cid:63) , so that the computationremain valid for both Bayes-optimal and ERM estimation (with any convex loss l and regularizer r ). 38 eplica trick The average in eq. (80) is intractable in general, and the computation relies onthe so called replica trick that consists in applying the identity E y , X (cid:20) lim d →∞ d log Z ( y , X ) (cid:21) = lim r → (cid:20) lim d →∞ d ∂ log E y , X [ Z ( y , X ) r ] ∂r (cid:21) . (81)This is interesting in the sense that it reduces the intractable average to the computation ofthe moments of the averaged partition function, which are easiest quantities to compute. Notethat for r ∈ N , Z ( y , X ) r represents the partition function of r ∈ N identical non-interactingcopies of the initial system, called replicas . Taking the average will then correlate the replicas,before taking the number of replicas r → . Therefore, we assume there exists an analyticalcontinuation so that r ∈ R and the limit is well defined. Finally, note we exchanged the orderof the limits r → and d → ∞ . These technicalities are crucial points but are not rigorouslyjustified and we will ignore them in the rest of the computation.Thus the replicated partition function in eq. (81) can be written as E y , X [ Z ( y , X ) r ] = E w (cid:63) , X (cid:34) r (cid:89) a =1 (cid:90) R n d z a P out a ( y | z a ) (cid:90) R d d w a P w a ( w a ) δ (cid:18) z a − √ d X w a (cid:19)(cid:35) = E X (cid:90) R n d y (cid:90) R n d z (cid:63) P out (cid:63) ( y | z (cid:63) ) (cid:90) R d d w (cid:63) P w (cid:63) ( w (cid:63) ) δ (cid:18) z (cid:63) − √ d X w (cid:63) (cid:19) × (cid:34) r (cid:89) a =1 (cid:90) R n d z a P out a ( y | z a ) (cid:90) R d d w a P w a ( w a ) δ (cid:18) z a − √ d X w a (cid:19)(cid:35) = E X (cid:90) R n d y r (cid:89) a =0 (cid:90) R n d z a P out a ( y | z a ) (cid:90) R d d w a P w a ( w a ) δ (cid:18) z a − √ d X w a (cid:19) (82)with the decoupled channel P out ( y | z ) = n (cid:89) µ =1 P out ( y µ | z µ ) . Note that the average over y isequivalent to the one over the ground truth vector w (cid:63) , which can be considered as a new replica w with index a = 0 leading to a total of r + 1 replicas.We suppose that inputs are drawn from an iid distribution, for example a Gaussian N (0 , .More precisely, for i, j ∈ [1 : d ] , µ, ν ∈ [1 : n ] , E X (cid:104) x ( µ ) i x ( ν ) j (cid:105) = δ µν δ ij . Hence z aµ = √ d (cid:80) di =1 x ( µ ) i w ai is the sum of iid random variables. The central limit theorem insures that z aµ ∼ N (cid:0) E X [ z aµ ] , E X [ z aµ z bµ ] (cid:1) , with the two first moments given by: E X [ z aµ ] = √ d (cid:80) di =1 E X (cid:104) x ( µ ) i (cid:105) w ai = 0 E X [ z aµ z bµ ] = d (cid:80) ij E X (cid:104) x ( µ ) i x ( µ ) j (cid:105) w ai w bj = d (cid:80) ij δ ij w ai w bj = d w a · w b . (83)In the following we introduce the symmetric overlap matrix Q ( { w a } ) ≡ (cid:0) d w a · w b (cid:1) a,b =0 ..r . Letus define ˜z µ ≡ ( z aµ ) a =0 ..r and ˜w i ≡ ( w ai ) a =0 ..r . The vector ˜z µ follows a multivariate Gaussian39istribution ˜z µ ∼ P ˜z (˜ z ; Q ) = N ˜ z ( r +1 , Q ) and as P ˜w ( ˜w ) = (cid:81) ra =0 P w ( ˜ w a ) it follows E y , X [ Z ( y , X ) r ] = E X (cid:90) R n d y r (cid:89) a =0 (cid:90) R n d z a P out a ( y | z a ) (cid:90) R d d w a P w a ( w a ) δ (cid:18) z a − √ d X w a (cid:19) = (cid:20)(cid:90) R d y (cid:90) R r +1 d˜ z P out ( y | ˜ z ) P ˜z (˜ z ; Q ( ˜ w )) (cid:21) n (cid:20)(cid:90) R r +1 d ˜ w P ˜w ( ˜ w ) (cid:21) d , because the channel and the prior distributions factorize. Introducing the change of variableand the Fourier representation of the δ -Dirac function, which involves a new ad-hoc parameter ˆ Q : (cid:90) R r +1 × r +1 d Q (cid:89) a ≤ b δ (cid:32) dQ ab − d (cid:88) i =1 w ai w bi (cid:33) ∝ (cid:90) R r +1 × r +1 d Q (cid:90) R r +1 × r +1 d ˆ Q exp (cid:16) − d Tr (cid:104) Q ˆ Q (cid:105)(cid:17) exp (cid:32) d (cid:88) i =1 ˜w (cid:124) i ˆ Q ˜w i (cid:33) , the replicated partition function becomes an integral over the symmetric matrices Q ∈ R r +1 × r +1 and ˆ Q ∈ R r +1 × r +1 , that can be evaluated using a Laplace method in the d → ∞ limit, E y , X [ Z ( y , X ) r ] = (cid:90) R r +1 × r +1 d Q (cid:90) R r +1 × r +1 d ˆ Qe d Φ ( r ) ( Q, ˆ Q ) (84) (cid:39) d →∞ exp (cid:16) d · extr Q, ˆ Q (cid:110) Φ ( r ) ( Q, ˆ Q ) (cid:111)(cid:17) , (85)where we defined Φ ( r ) ( Q, ˆ Q ) = − Tr (cid:104) Q ˆ Q (cid:105) + log Ψ ( r )w ( ˆ Q ) + α log Ψ ( r )out ( Q )Ψ ( r )w ( ˆ Q ) = (cid:90) R r +1 d ˜w P ˜w ( ˜w ) e ˜w (cid:124) ˆ Q ˜w Ψ ( r )out ( Q ) = (cid:90) d y (cid:90) R r +1 d ˜z P ˜ z ( ˜z ; Q ) P out ( y | ˜z ) , (86)and P ˜ z (˜ z ; Q ) = e − ˜ z (cid:124) Q − ˜ z det(2 πQ ) / .Finally switching the two limits r → and d → ∞ , the quenched free entropy Φ simplifiesas a saddle point equation Φ( α ) = extr Q, ˆ Q (cid:40) lim r → ∂ Φ ( r ) ( Q, ˆ Q ) ∂r (cid:41) , (87)over symmetric matrices Q ∈ R r +1 × r +1 and ˆ Q ∈ R r +1 × r +1 . In the following we will assume asimple ansatz for these matrices in order to first obtain an analytic expression in r before takingthe derivative with respect to r . 40 S free entropy
Let’s compute the functional Φ ( r ) ( Q, ˆ Q ) appearing in the free entropyeq. (87) in the simplest ansatz: the Replica Symmetric ansatz. This later assumes that all replicaremain equivalent with a common overlap q = d w a · w b for a (cid:54) = b , a norm Q = d (cid:107) w a (cid:107) , andan overlap with the ground truth m = d w a · w (cid:63) , leading to the following expressions of thereplica symmetric matrices Q rs ∈ R r +1 × r +1 and ˆ Q rs ∈ R r +1 × r +1 : Q rs = Q m ... mm Q ... ...... ... ... qm ... q Q and ˆ Q rs = ˆ Q ˆ m ... ˆ m ˆ m − ˆ Q ... ...... ... ... ˆ q ˆ m ... ˆ q − ˆ Q , (88)with Q = ρ w (cid:63) = d (cid:107) w (cid:63) (cid:107) . Let’s compute separately the terms involved in the functional Φ ( r ) ( Q, ˆ Q ) eq. (86) with this ansatz: the first is a trace term, the second a term Ψ ( r )w dependingon the prior distributions P w , P w (cid:63) and finally the third a term Ψ ( r )out that depends on the channeldistributions P out (cid:63) , P out . Trace term
The trace term can be easily computed and takes the following form: Tr (cid:16) Q ˆ Q (cid:17)(cid:12)(cid:12)(cid:12) rs = Q ˆ Q + rm ˆ m − rQ ˆ Q + r ( r − q ˆ q . (89) Prior integral
Evaluated at the RS fixed point, and using a Gaussian identity also known asa Hubbard-Stratonovich transformation E ξ exp( √ aξ ) = e a , the prior integral can be furthersimplified Ψ ( r )w ( ˆ Q ) (cid:12)(cid:12)(cid:12) rs = (cid:90) R r +1 d ˜w P ˜w ( ˜w ) e ˜w (cid:124) ˆ Q rs ˜w = E w (cid:63) e ˆ Q ( w (cid:63) ) (cid:90) R r d ˜w P ˜w ( ˜w ) e w (cid:63) ˆ m (cid:80) ra =1 ˜ w a − ( ˆ Q +ˆ q ) (cid:80) ra =1 ( ˜ w a ) + ˆ q ( (cid:80) ra =1 ˜ w a ) = E ξ,w (cid:63) e ˆ Q ( w (cid:63) ) (cid:20) E w exp (cid:18)(cid:20) ˆ mw (cid:63) w −
12 ( ˆ Q + ˆ q ) w + ˆ q / ξw (cid:21)(cid:19)(cid:21) r . (90) Channel integral
Let’s focus on the inverse matrix Q − = Q − Q − Q − Q − Q − Q − Q − Q − Q − Q − Q − Q − Q − Q − Q − Q − (91)41ith Q − = (cid:0) Q − rm ( Q + ( r − q ) − m (cid:1) − Q − = − (cid:0) Q − rm ( Q + ( r − q ) − m (cid:1) − m ( q + ( r − q ) − Q − = ( Q − q ) − − ( Q + ( r − q ) − q ( Q − q ) − +( Q + ( r − q ) − m (cid:0) Q − rm ( Q + ( r − q ) − m (cid:1) − m ( Q + ( r − q ) − Q − = − ( Q + ( r − q ) − q ( Q − q ) − +( Q + ( r − q ) − m (cid:0) Q − rm ( Q + ( r − q ) − m (cid:1) − m ( Q + ( r − q ) − and its determinant: det Q rs = ( Q − q ) r − ( Q + ( r − q ) (cid:0) Q − rm ( Q + ( r − q ) − m (cid:1) Using the same kind of Gaussian transformation, we obtain Ψ ( r )out ( Q ) (cid:12)(cid:12)(cid:12) rs = (cid:90) d y (cid:90) R r +1 d ˜z e − ˜ z (cid:124) Q − ˜ z − log(det(2 πQ rs )) P out ( y | ˜z )= E y,ξ e − log(det(2 πQ rs )) × (cid:90) d z (cid:63) P out (cid:63) ( y | z (cid:63) ) e − Q − ( z (cid:63) ) (cid:20)(cid:90) dzP out ( y | z ) e − Q − z (cid:63) z − ( Q − − Q − ) z − Q − / ξz (cid:21) r IV.3 ERM and Bayes-optimal free entropy
Taking carefully the derivative and the r → limit imposes ˆ Q = 0 and we finally obtain thereplica symmetric free entropy Φ rs : Φ rs ( α ) ≡ E y , X (cid:20) lim d →∞ d log ( Z ( y , X )) (cid:21) (92) = extr Q, ˆ Q,q, ˆ q,m, ˆ m (cid:26) − m ˆ m + 12 Q ˆ Q + 12 q ˆ q + Ψ w (cid:16) ˆ Q, ˆ m, ˆ q (cid:17) + α Ψ out ( Q, m, q ; ρ w (cid:63) ) (cid:27) , where ρ w (cid:63) = lim d →∞ E w (cid:63) d (cid:107) w (cid:63) (cid:107) and the channel and prior integrals are defined by Ψ w (cid:16) ˆ Q, ˆ m, ˆ q (cid:17) ≡ E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) log Z w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17)(cid:105) , Ψ out ( Q, m, q ; ρ w (cid:63) ) ≡ E y,ξ (cid:104) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) log Z out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) , (93)where again Z out (cid:63) and Z w (cid:63) are defined in eq. (24) and depend on the teacher , while the denoisingfunctions Z out and Z w depend on the inference model. In particular, we explicit in the nextsections the above free entropy in the case of ERM and Bayes-optimal estimation.42 V.3.1 ERM estimation
As described in eq. (21), the free entropy for ERM estimation is therefore given by eq. (92) if wetake − log P ( y | z ) = l ( y , z ) and − log P ( w ) = r ( w ) . As described in Sec. I.3.2 they lead to thefollowing partition functions: Z λ w ( γ, Λ) = lim ∆ → e − M Λ − [ r ( λ,. )](Λ − γ ) e − γ Λ − , Z out ( y, ω, V ) = lim ∆ → e − M V ∆ [ l ( y,. )]( ω ) √ πV √ π ∆ , (94)with the Moreau-Yosida regularization (28). IV.3.2 Bayes-optimal estimation
In the Bayes-optimal case, we have access to the ground truth distributions P ( y | z ) = P out (cid:63) ( y | z ) and P ( w ) = P w (cid:63) ( w ) , and therefore Z out = Z out (cid:63) , Z w = Z w (cid:63) . Nishimori conditions in theBayes-optimal case [55] imply that Q = ρ w (cid:63) , m = q = q b , ˆ Q = 0 , ˆ m = ˆ q = ˆ q b . Therefore thefree entropy eq. (92) simplifies as an optimization problem over two scalar overlaps q b , ˆ q b : Φ b ( α ) = extr q b , ˆ q b (cid:26) − q b ˆ q b + Ψ bw (ˆ q b ) + α Ψ bout ( q b ; ρ w (cid:63) ) (cid:27) , (95)with free entropy terms Ψ bw and Ψ bout given by Ψ bw (ˆ q ) = E ξ (cid:104) Z w (cid:63) (cid:16) ˆ q / ξ, ˆ q (cid:17) log Z w (cid:63) (cid:16) ˆ q / ξ, ˆ q (cid:17)(cid:105) , Ψ bout ( q ; ρ w (cid:63) ) = E y,ξ (cid:104) Z out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q (cid:17) log Z out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q (cid:17)(cid:105) . and again Z out (cid:63) and Z w (cid:63) are defined in eq. (24). The above replica symmetric free entropy inthe Bayes-optimal case has been rigorously proven in [10]. IV.4 Sets of fixed point equations
As highlighted in Sec. II, the asymptotic overlaps m, q measure the performances of the ERMor Bayes-optimal statistical estimators, whose behaviours are respectively characterized byextremizing the free entropy (92) and (95). This section is devoted to derive the correspondingsets of fixed point equations.
IV.4.1 ERM estimation
Extremizing the free entropy eq. (92), we easily obtain the set of six fixed point equations ˆ Q = − α∂ Q Ψ out , Q = − ∂ ˆ Q Ψ w ˆ q = − α∂ q Ψ out , q = − ∂ ˆ q Ψ w , ˆ m = α∂ m Ψ out , m = ∂ ˆ m Ψ w . (96)43hese equations can be formulated as functions of the partition functions Z out (cid:63) , Z w (cid:63) and thedenoising functions f out (cid:63) , f w (cid:63) , f out , f w defined in eq. (25) and eq. (29). The derivation is shownin Appendix. IV.5.3 and defining the natural variables Σ = Q − q , ˆΣ = ˆ Q + ˆ q , η ≡ m ρ w (cid:63) q and ˆ η ≡ ˆ m ˆ q , it can be written as m = E ξ (cid:104) Z w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) f w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) f w (cid:16) ˆ q / ξ, ˆΣ (cid:17)(cid:105) ,q = E ξ (cid:20) Z w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) f w (cid:16) ˆ q / ξ, ˆΣ (cid:17) (cid:21) , Σ = E ξ (cid:104) Z w (cid:63) (cid:16)(cid:112) ˆ ηξ, ˆ η (cid:17) ∂ γ f w (cid:16) ˆ q / ξ, ˆΣ (cid:17)(cid:105) , ˆ m = α E y,ξ (cid:104) Z out (cid:63) ( . ) · f out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) , ˆ q = α E y,ξ (cid:20) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) f out (cid:16) y, q / ξ, Σ (cid:17) (cid:21) , ˆΣ = − α E y,ξ (cid:104) Z out (cid:63) ( y, √ ρ w (cid:63) ηξ, ρ w (cid:63) (1 − η )) ∂ ω f out (cid:16) y, q / ξ, Σ (cid:17)(cid:105) , (97)and we finally obtain the set of equations eqs. (71). IV.4.2 Bayes-optimal estimation
Extremizing the Bayes-optimal free entropy eq. (95), we easily obtain the set of 2 fixed pointequations over the scalar parameters q b , ˆ q b . In fact, it can also be deduced from eq. (97) usingthe Nishimori conditions f w = f w (cid:63) , f out = f out (cid:63) , m = q = q b , Σ = ρ w (cid:63) − q, ˆ m = ˆ q = ˆ q b and ˆ Q = 0 that lead to the result (13) in Thm. 2.4, from [10] ˆ q b = α E y,ξ (cid:20) Z out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q b (cid:17) f out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q b (cid:17) (cid:21) ,q b = E ξ (cid:20) Z w (cid:63) (cid:16) ˆ q / ξ, ˆ q b (cid:17) f w (cid:63) (cid:16) ˆ q / ξ, ˆ q b (cid:17) (cid:21) . (98) IV.5 Useful derivations
In this section, we give useful computation steps that we used to transform the sets of fixedpoint equations (96).
IV.5.1 Prior free entropy term
In specific simple cases, the prior free entropy term Ψ w (cid:16) ˆ Q, ˆ m, ˆ q (cid:17) ≡ E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) log Z w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17)(cid:105) in (93) can be computed explicitly. This is the case of Gaussian and binary priors P w (cid:63) with (cid:96) regularization. In particular, they lead surprisingly to the same expression meaning thatchoosing a binary or Gaussian teacher distribution does not affect the ERM performances with (cid:96) regularization. 44 aussian prior Let us compute the corresponding free entropy term with partition functions Z w (cid:63) for a Gaussian prior P w (cid:63) ( w (cid:63) ) = N w (cid:63) (0 , ρ w (cid:63) ) and Z (cid:96) ,λ w for a (cid:96) regularization respectivelygiven by eq. (41) and eq. (47): Z w (cid:63) ( γ, Λ) = e γ ρ w (cid:63) ( Λ ρ w (cid:63) +1 ) √ Λ ρ w (cid:63) + 1 , Z (cid:96) ,λ w ( γ, Λ) = e γ λ ) √ Λ + λ .
The prior free entropy term reads Ψ w (cid:16) ˆ Q, ˆ m, ˆ q (cid:17) = E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) log Z (cid:96) ,λ w (cid:16) ˆ q / ξ, ˆ q + ˆ Q (cid:17)(cid:105) = E ξ Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) ˆ qξ (cid:16) λ + ˆ Q + ˆ q (cid:17) −
12 log (cid:16) λ + ˆ Q + ˆ q (cid:17) = (cid:90) d ξ N ξ (cid:0) , ρ w (cid:63) ˆ m ˆ q − (cid:1) ˆ qξ (cid:16) λ + ˆ Q + ˆ q (cid:17) −
12 log (cid:16) λ + ˆ Q + ˆ q (cid:17) = 12 (cid:18) ˆ q + ρ w (cid:63) ˆ m λ + ˆ Q + ˆ q − log (cid:16) λ + ˆ Q + ˆ q (cid:17)(cid:19) (99)In the Bayes-optimal case for ρ w (cid:63) = 1 , the computation is similar and is given by the aboveexpression with λ = 1 , ˆ Q = 0 , ˆ m = ˆ q : Ψ bayesw (ˆ q ) = = 12 (ˆ q − log (1 + ˆ q )) (100) Binary prior
Let us compute the corresponding free entropy term with partition functions Z w (cid:63) for a binary prior P w (cid:63) ( w (cid:63) ) = ( δ ( w (cid:63) −
1) + δ ( w (cid:63) + 1)) and Z (cid:96) ,λ w for a (cid:96) regularizationrespectively given by eq. (42) and eq. (47): Z w (cid:63) ( γ, Λ) = e − Λ2 cosh( γ ) , Z (cid:96) ,λ w ( γ, Λ) = e γ λ ) √ Λ + λ .
The entropy term Ψ w reads Ψ w (cid:16) ˆ Q, ˆ m, ˆ q (cid:17) = E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) log Z (cid:96) ,λ w (cid:16) ˆ q / ξ, ˆ q + ˆ Q (cid:17)(cid:105) = E ξ Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) ˆ qξ (cid:16) λ + ˆ Q + ˆ q (cid:17) −
12 log (cid:16) λ + ˆ Q + ˆ q (cid:17) = (cid:90) dξ e − ξ √ π e − ˆ m ˆ q − m cosh (cid:16) ˆ m ˆ q − / ξ (cid:17) ˆ qξ (cid:16) λ + ˆ Q + ˆ q (cid:17) −
12 log (cid:16) λ + ˆ Q + ˆ q (cid:17) = 12 (cid:18) ˆ q + ˆ m λ + ˆ Q + ˆ q − log (cid:16) λ + ˆ Q + ˆ q (cid:17)(cid:19) (101)45e recover exactly the same free entropy term than for Gaussian prior teacher eq. (99) for ρ w (cid:63) = 1 . IV.5.2 Updates derivatives
Let’s compute, in full generality, the derivative of the partition functions defined in Sec. 22 andthat will be useful to simplify the set (96). ∂ γ Z w ( γ, Λ) = Z w ( γ, Λ) × E Q w [ w ] = Z w ( γ, Λ) f w ( γ, Λ) ∂ Λ Z w ( γ, Λ) = − Z w ( γ, Λ) × E Q w (cid:2) w (cid:3) = − (cid:0) ∂ γ f w ( γ, Λ) + f ( γ, Λ) (cid:1) ∂ ω Z out ( y, ω, V ) = Z out ( y, ω, V ) × V − E Q out [ z − ω ]= Z out ( y, ω, V ) f out ( y, ω, V ) ∂ V Z out ( y, ω, V ) = 12 Z out ( y, ω, V ) × (cid:0) E Q out (cid:2) V − ( z − ω ) (cid:3) − V − (cid:1) = 12 Z out ( y, ω, V ) (cid:0) ∂ ω f out ( y, ω, V ) + f ( y, ω, V ) (cid:1) (102) IV.5.3 Simplifications of the fixed point equations
We recall the set of fixed point equations eq. (96) ˆ Q = − α∂ Q Ψ out , Q = − ∂ ˆ Q Ψ w ˆ q = − α∂ q Ψ out , q = − ∂ ˆ q Ψ w , ˆ m = α∂ m Ψ out , m = ∂ ˆ m Ψ w , (103)that can be simplified and formulated as functions of Z out (cid:63) , Z w (cid:63) , f out (cid:63) , f w (cid:63) , f out , and f w definedin eq. (25) and eq. (29), using the derivatives in (102).46 quation over ˆ q∂ q Ψ out = ∂ q E y,ξ (cid:104) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) log Z out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) = E y,ξ [ ∂ q ω (cid:63) ∂ ω Z out (cid:63) log Z out + ∂ q V (cid:63) ∂ V Z out (cid:63) log Z out + Z out (cid:63) Z out ( ∂ q ω∂ ω Z out + ∂ q V ∂ V Z out ) (cid:21) = E y,ξ (cid:20) − m q − / ξf out (cid:63) Z out (cid:63) log Z out + m q − (cid:0) ∂ ω f out (cid:63) + f (cid:63) (cid:1) Z out (cid:63) log Z out + Z out (cid:63) Z out (cid:18) q − / ξf out Z out − (cid:0) ∂ ω f out + f (cid:1) Z out (cid:19)(cid:21) = 12 E y,ξ (cid:2) − m q − ∂ ξ ( f out (cid:63) Z out (cid:63) log Z out ) + m q − (cid:0) ∂ ω f out (cid:63) + f (cid:63) (cid:1) Z out (cid:63) log Z out + (cid:0) ∂ ξ ( f out Z out (cid:63) ) − (cid:0) ∂ ω f out + f (cid:1) Z out (cid:63) (cid:1)(cid:3) (Stein lemma) = 12 E y,ξ (cid:2) − m q − (cid:0) ∂ ω f out (cid:63) log Z out + Z out (cid:63) f (cid:63) log Z out − (cid:0) ∂ ω f out (cid:63) + f (cid:63) (cid:1) Z out (cid:63) log Z out (cid:1)(cid:3) + 12 E y,ξ (cid:2) − mq − Z out (cid:63) f out (cid:63) f out (cid:3) + 12 E y,ξ (cid:2) ∂ ω f out Z out + mq − Z out (cid:63) f out (cid:63) f out − (cid:0) ∂ ω f out + f (cid:1) Z out (cid:63) (cid:3) = − E y,ξ (cid:104) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) f (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) , (Simplifications with (102))that leads to ˆ q = − α∂ q Ψ out = α E y,ξ (cid:20) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) f out (cid:16) y, q / ξ, Q − q (cid:17) (cid:21) . (104) Equation over ˆ m∂ m Ψ out = E y,ξ (cid:104) ∂ m Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) log Z out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) = E y,ξ [( ∂ m ω (cid:63) ∂ ω Z out (cid:63) + ∂ m V (cid:63) ∂ V Z out (cid:63) ) log Z out ]= E y,ξ (cid:104)(cid:16) q − / ξf out (cid:63) Z out (cid:63) − mq − (cid:0) ∂ ω f out (cid:63) + f (cid:63) (cid:1) Z out (cid:63) (cid:17) log Z out (cid:105) = E y,ξ (cid:2) ∂ ξ ( f out (cid:63) Z out (cid:63) log Z out ) − (cid:0) ∂ ω f out (cid:63) + f (cid:63) (cid:1) Z out (cid:63) log Z out (cid:3) (Stein Lemma) = E y,ξ (cid:2) mq − ( ∂ ω f out (cid:63) Z out (cid:63) log Z out + f out (cid:63) ∂ ω Z out (cid:63) log Z out − (cid:0) ∂ ω f out (cid:63) + f (cid:63) (cid:1) Z out (cid:63) (cid:1) log Z out (cid:3) + E y,ξ [ Z out (cid:63) f out (cid:63) f out ]= E y,ξ (cid:104) Z out (cid:63) ( ., ., . ) f out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) f out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) (Simplifications with (102))47hat leads to ˆ m = α∂ m Ψ out = α E y,ξ (cid:104) Z out (cid:63) ( ., ., . ) f out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) f out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) . (105) Equation over ˆ Q∂ Q Ψ out = E y,ξ (cid:104) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) ∂ Q log Z out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) = E y,ξ (cid:104) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) ∂ Q V ∂ V log Z out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) = 12 E y,ξ (cid:104) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) (cid:0) ∂ ω f out + f (cid:1) (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) leading to ˆ Q = − α∂ Q Ψ out = − α E y,ξ (cid:104) Z out (cid:63) (cid:16) y, mq − / ξ, ρ w (cid:63) − mq − m (cid:17) ∂ ω f out (cid:16) y, q / ξ, Q − q (cid:17)(cid:105) − ˆ q . (106) Equation over q∂ ˆ q Ψ w = ∂ ˆ q E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) log Z w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17)(cid:105) = E ξ (cid:20) ∂ ˆ q ω (cid:63) ∂ ω Z w (cid:63) log Z w + ∂ ˆ q V (cid:63) ∂ V Z w (cid:63) log Z w + Z w (cid:63) Z w ( ∂ ˆ q ω∂ ω Z w + ∂ ˆ q V ∂ V Z w ) (cid:21) = E ξ (cid:20) − ˆ m q − / ξf w (cid:63) Z w (cid:63) log Z w + ˆ m ˆ q − (cid:0) ∂ ω f w (cid:63) + f (cid:63) (cid:1) Z w (cid:63) log Z w + Z w (cid:63) Z w (cid:18)
12 ˆ q − / ξf w Z w − (cid:0) ∂ ω f w + f (cid:1) Z w (cid:19)(cid:21) = E ξ (cid:20) − ˆ m q − / ∂ ξ ( f w (cid:63) Z w (cid:63) log Z w ) + ˆ m ˆ q − (cid:0) ∂ ω f w (cid:63) + f (cid:63) (cid:1) Z w (cid:63) log Z w + (cid:18)
12 ˆ q − / ∂ ξ ( f w Z w (cid:63) ) − (cid:0) ∂ ω f w + f (cid:1) Z w (cid:63) (cid:19)(cid:21) (Stein lemma) = 12 E ξ (cid:2) − ˆ m ˆ q − (cid:0) ∂ ω f w (cid:63) Z w (cid:63) log Z w + Z w (cid:63) f (cid:63) log Z w − (cid:0) ∂ ω f w (cid:63) + f (cid:63) (cid:1) Z w (cid:63) log Z w (cid:1) − ˆ m ˆ q − Z w (cid:63) f w (cid:63) f w + (cid:0) ˆ m ˆ q − Z w (cid:63) f w f w (cid:63) + Z w (cid:63) ∂ ω f w − (cid:0) ∂ ω f w + f (cid:1) Z w (cid:63) (cid:1)(cid:3) = − E ξ (cid:20) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) f w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17) (cid:21) (Simplifications with (102))leading to q = − ∂ ˆ q Ψ w = E ξ (cid:20) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) f w (cid:16) ˆ q / ξ, ˆ q + ˆ Q (cid:17) (cid:21) (107)48 quation over m∂ ˆ m Ψ w = ∂ m E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) log Z w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17)(cid:105) = E ξ [( ∂ ˆ m ω (cid:63) ∂ ω Z w (cid:63) + ∂ ˆ m V (cid:63) ∂ V Z w (cid:63) ) log Z w ]= E ξ (cid:104)(cid:16) ˆ q − / ξf w (cid:63) Z w (cid:63) − ˆ m ˆ q − (cid:0) ∂ ω f w (cid:63) + f (cid:63) (cid:1) Z w (cid:63) (cid:17) log Z w (cid:105) = E ξ (cid:2) ˆ m ˆ q − ∂ ξ ( f w (cid:63) Z w (cid:63) log Z w ) − (cid:0) ∂ ω f w (cid:63) + f (cid:63) (cid:1) Z w (cid:63) log Z w (cid:3) (Stein Lemma) = E ξ (cid:2) ˆ m ˆ q − (cid:0) ∂ ω f w (cid:63) Z w (cid:63) log Z w + Z w (cid:63) f (cid:63) log Z w − (cid:0) ∂ ω f w (cid:63) + f (cid:63) (cid:1) Z w (cid:63) log Z w (cid:1) + Z w (cid:63) f w (cid:63) f w ]= E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) f w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) f w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17)(cid:105) (Simplifications with (102))leading to m = 2 ∂ ˆ m Ψ w = E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) f w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) f w (cid:16) ˆ q / ξ, ˆ q + ˆ Q (cid:17)(cid:105) (108) Equation over Q∂ ˆ Q Ψ w (cid:16) ˆ Q, ˆ m, ˆ q (cid:17) = ∂ ˆ Q E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) log Z w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17)(cid:105) = E ξ (cid:20) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) Z w ∂ ˆ Q Λ ∂ Λ Z w (cid:16) ˆ q / ξ, ˆ Q + ˆ q (cid:17)(cid:21) = − E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) (cid:0) ∂ γ f w + f w (cid:1)(cid:105) (with (102))hence Q = − ∂ ˆ Q Ψ w = E ξ (cid:104) Z w (cid:63) (cid:16) ˆ m ˆ q − / ξ, ˆ m ˆ q − ˆ m (cid:17) ∂ γ f w (cid:16) ˆ q / ξ, ˆ q + ˆ Q (cid:17)(cid:105) + q . (109)49 Applications
In this section, we provide details of the results presented in Sec. 3. In particular as an illustration,we consider a Gaussian teacher ( ρ w (cid:63) = 1 ) with a noiseless sign activation: P out (cid:63) ( y | z ) = δ ( y − sign ( z )) , P w (cid:63) ( w (cid:63) ) = N w (cid:63) (0 , ρ w (cid:63) ) , (110)whose corresponding denoising functions are derived in eq. (39) and eq. (41). Remark V.1.
Note that performances of ERM with (cid:96) regularization for a teacher with Gaussianweights P w (cid:63) ( w ) = N w (0 , or binary weights P w (cid:63) ( w ) = ( δ ( w −
1) + δ ( w + 1)) , will besimilar. Indeed free entropy terms Ψ w eq. (93) for a Gaussian prior (99) and for binary weights (101) are equal in this setting, so do the set of fixed point equations. V.1 Bayes-optimal estimation
Using expressions eq. (39) and eq. (41), corresponding to the teacher model eq. (110), the priorequation eq. (98) can be simplified while the channel one has no analytical expression. Hencethe set of fixed point equations eqs. (100) for the model eq. (110) read q b = ˆ q b q b , ˆ q b = α E y,ξ (cid:20) Z out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q b (cid:17) f out (cid:63) (cid:16) y, q / ξ, ρ w (cid:63) − q b (cid:17) (cid:21) . (111) Large α behaviour Let us derive the large α behaviour of the Bayes-optimal generalizationerror eq. (55) that depends only on the overlap q b solution of eq. (111). q b measures thecorrelation with the ground truth, so we expect that in the limit α → ∞ , q b → . Therefore,we need to extract the behaviour of ˆ q b in eq. (111). Injecting expressions Z out (cid:63) and f out (cid:63) fromeq. (39), we obtain ˆ q b = α E y,ξ (cid:20) Z out (cid:63) (cid:16) y, q / ξ, − q b (cid:17) f out (cid:63) (cid:16) y, q / ξ, − q b (cid:17) (cid:21) = 2 α (cid:90) Dξy N √ qξ (0 , − q b ) (cid:18) (cid:18) √ q b ξ √ − q b ) (cid:19)(cid:19) = 2 π α − q b (cid:90) Dξ e − q b ξ − q b (cid:18) (cid:18) √ q b ξ √ − q b ) (cid:19)(cid:19) , where the last integral can be computed in the limit q b → : (cid:90) Dξ e − q b ξ − q b (cid:18) (cid:18) √ q b ξ √ − q b ) (cid:19)(cid:19) = (cid:90) d ξ − e ξ q b+1)2(1 − q b) √ π (cid:18) (cid:18) √ q b ξ √ − q b ) (cid:19)(cid:19) (cid:39) (cid:90) d ξ − e ξ − q b √ π (cid:18) (cid:18) ξ √ − q b ) (cid:19)(cid:19) = √ − q b √ π (cid:90) d η e − η (cid:16) η √ (cid:17) = c √ π (cid:112) − q b , c ≡ (cid:82) d η e − η (cid:16) η √ (cid:17) (cid:39) . . Finally, we obtain in the large α limit: ˆ q b = k α √ − q b , q b = ˆ q b q b , with k ≡ c π √ π (cid:39) . . The above equations can be solved analytically and lead to: q b = 12 (cid:16) αk (cid:112) α k + 4 − α k (cid:17) (cid:39) α →∞ − α k , ˆ q b = k α , and therefore the Bayes-optimal asymptotic generalization error is given by e bayesg ( α ) = 1 π acos ( √ q b ) (cid:39) α →∞ kπ α (cid:39) . α . (112) V.2 Generalities on ERM with (cid:96) regularization Combining the teacher update for Gaussian weights eq. (41) with the update associated to the (cid:96) regularization eq. (41), the free entropy term can be explicitly derived in (99). Taking thecorresponding derivatives, the fixed point equations for m, q, Σ eq. (96) are thus explicit andsimply read Σ = 1 λ + ˆΣ , q = ρ w (cid:63) ˆ m + ˆ q ( λ + ˆΣ) , m = ρ w (cid:63) ˆ mλ + ˆΣ . (113)All the following examples have been performed with a (cid:96) regularization, so that the aboveequations (113) remain valid for the different losses considered in Sec. 3. In the next subsections,we provide some details on the asymptotic performances of ERM with various losses with (cid:96) regularization and ρ w (cid:63) = 1 .In general for a generic loss, the proximal eq. (29) has no analytical expression, just as thefixed point equations (97). The square loss is particular in the sense eqs. (97) have a closed formsolution. Also the Hinge loss has an analytical proximal. Apart from that, eqs. (97) must besolved numerically. However it is useful to notice that the proximal can be easily found for atwo times differentiable loss using eq. (46). This is for example the case of the logistic loss. V.3 Ridge regression - Square loss with (cid:96) regularization The prior equations over m, q, Σ are already derived in eq. (113) and remain valid. Combiningeq. (39) for the considered sign channel with a potential additional Gaussian noise ∆ (cid:63) in (110)and the square loss eq. (43), the channel fixed point equations for ˆ q, ˆ m, ˆΣ eqs. (97) lead to Σ = 1 λ + ˆΣ , ˆΣ = α Σ + 1 ,q = ˆ m + ˆ q ( λ + ˆΣ) , ˆ q = α (1 + q + ∆ (cid:63) ) − (cid:113) m π (Σ + 1) ,m = ˆ mλ + ˆΣ , ˆ m = α (cid:113) π Σ + 1 . (114)51 .3.1 Pseudo-inverse estimator We analyze the fixed point equations eqs. (114) for the pseudo-inverse estimator, that is in thelimit λ → . Solving Σ Combining the two first equations over Σ and ˆΣ in (114), we obtain Σ = (cid:112) ( α + λ − + 4 λ − α − λ + 12 λ (cid:39) λ → − α + | α − | λ + 12 (cid:18) α + 1 | α − | − (cid:19) , (115)that exhibits two different behaviour depending if α < or α > . Regime α < In this regime α < , eq. (115) becomes Σ = 1 − αλ + α − α , that leads to the closed set of equations in the limit λ →
0Σ = (1 − α ) + αλλ (1 − α ) (cid:39) λ → − αλ , ˆΣ = (1 − α ) αλ ( α − + λ (cid:39) λ → λα − α ,m = α (1 − α ) λ + (1 − α ) (cid:114) π (cid:39) λ → α (cid:114) π , ˆ m = λα (cid:113) π λ + (1 − α ) (cid:39) λ → λα (cid:113) π − α ,q (cid:39) λ → α ( π (1 + ∆ (cid:63) ) − α ) π (1 − α ) , ˆ q (cid:39) λ → αλ (2( α − α + π (∆ (cid:63) + 1)) π (1 − α )(1 − α + λ ) . (116)Hence we obtain for α < : m pseudo = α (cid:114) π q pseudo = α ( π (1 + ∆ (cid:63) ) − α ) π (1 − α ) (117)and the corresponding generalization error e pseudog ( α ) = 1 π acos (cid:32)(cid:115) α (1 − α ) π (1 + ∆ (cid:63) ) − α (cid:33) if α < . (118)Note in particular that e pseudog ( α ) −→ α → . , meaning that the interpolation peak at α = 1 reaches the maximum generalization error. Regime α > Eq. (115) becomes
Σ = 12 (cid:18) α + 1 α − − (cid:19) = 12 (cid:18) α + 1 α − − (cid:19) = 1 α − .
52n the limit λ → , the fixed point equations eqs. (114) reduce to Σ + 1 = αα − , ˆΣ = α − ,q = ( α − π + ˆ q ( α − , ˆ q = ( α − α (cid:18) (1 + q + ∆ (cid:63) ) − π (cid:19) ,m = (cid:114) π , ˆ m = ( α − (cid:114) π . (119)In particular we obtain for α > : m pseudo = (cid:114) π , q pseudo = 1 α − (cid:18) (cid:63) + 2 π ( α − (cid:19) , (120)and the corresponding generalization error e pseudog ( α ) = 1 π acos (cid:32)(cid:115) α − π (1 + ∆ (cid:63) ) + ( α − (cid:33) if α > . (121) Large α behaviour From this expression we easily obtain the large α behaviour of thepseudo-inverse estimator: e pseudog ( α ) = 1 π acos (cid:32)(cid:115) α − π (1 + ∆ (cid:63) ) + ( α − (cid:33) = 1 π acos (cid:32)(cid:18) Cα − (cid:19) / (cid:33) (cid:39) α →∞ c √ α where C = π (1 + ∆ (cid:63) ) − and c = √ Cπ . In particular for a noiseless teacher ∆ (cid:63) = 0 , c = (cid:113) π − π (cid:39) . , leading to e pseudog ( α ) (cid:39) α →∞ . √ α . (122) V.3.2 Ridge at finite λ Let us now consider the set of fixed point equation eq. (114) for finite λ (cid:54) = 0 . Defining t ≡ (cid:112) ( α + λ − + 4 λt ≡ ( t + α + λ + 1) − t ≡ (cid:112) α + 1) λ + ( α − + λ t ≡ ( t + α + λ + 1) − t ≡ (cid:112) α + 2 α ( λ −
1) + ( λ + 1) , Σ = 12 t − α − λ + 1 λ ˆΣ = 12 (cid:0) t + α − λ − (cid:1) q = 2 α (cid:0) − α t + 2 α + π ∆ (cid:63) + π (cid:1) π ( α + α ( t + 2 λ −
2) + ( λ + 1) ( t + λ + 1)) , ˆ q = (cid:0) αλ (cid:0) π (∆ (cid:63) + 1) (cid:0) t + ( α + λ ) (cid:0) t + α + λ (cid:1) + 2 λ + 1 (cid:1) − αt (cid:0) t + ( α + λ ) (cid:0)(cid:112) α + 1) λ + ( α − + λ + α + λ (cid:1) + 2 λ (cid:1) − αt + 4 α (cid:1)(cid:1) ,m = 2 (cid:113) π αt + α + λ + 1 , ˆ m = 2 (cid:113) π αλt − α + λ + 1 . Generalization error behaviour at large α Expanding the ratio m √ q in the large α limit, weobtain m √ q (cid:39) − C α with C = π (cid:63) ) − leading to e ridge ,λ g ( α ) = 1 π acos (cid:18) m √ q (cid:19) (cid:39) α →∞ c √ α with c = √ Cπ . (123)Thus, the asymptotic generalization error for ridge regression with any regularization strength λ ≥ decrease as . √ α , similarly to the pseudo-inverse result. Optimal regularization
The optimal value λ opt ( α ) , introduced in Sec. 3, which minimizesthe generalization error at a given α can be found taking the derivative of m √ q and is written asthe root of the following functional F [ α, λ, ∆ (cid:63) ] = ∂ λ (cid:18) m √ q (cid:19) = a a a a , with a = − α (cid:114) a α + α ( t + 2 λ −
2) + ( λ + 1) ( t + λ + 1) ,a = 2 (cid:0) α t + α (2 λt + ( t + 2) t −
1) + ( λ + 1) ( t + λ + 1) t (cid:1) − π (1 + ∆ (cid:63) ) ,a = t t ,a = α (2 − t ) + π (1 + ∆ (cid:63) ) . . . . . . λ − − − − − − − l og | ∂ λ ( m / √ q ) | α =1 α =2 α =3 α =5 α =10 α =20 α =30 α =50 α =100 α =1000 λ opt = 0 .
20 40 60 80 100 α . . . . . . e g e g bayes e g ridge with λ opt − − − − − Figure 3: (
Left ) Absolute value of the derivative of m/ √ q with respect to λ plotted in alogarithmic scale. λ opt is reached at the root of the functional F [ α, λ ] that corresponds tothe divergence in the logarithmic scale. Plotted for a wide range of α , the optimal value isclearly constant and independent of α . Its value is approximately λ opt (cid:39) . . ( Right )Bayes-optimal (black) vs ridge regression (dashed red) generalization errors with optimal (cid:96) regularization λ opt (cid:39) . .Unfortunately, this functional cannot be analyzed analytically. Instead we plot its value for awide range of α as a function of λ (for ∆ (cid:63) = 0 ) and we observe in particular that there existsa unique value λ opt (cid:39) . as illustrated in Fig. 3 ( left ) that is independent of α . As anillustration, we show the generalization error of ridge regression with the optimal regularization λ opt = 0 . compared to the Bayes-optimal performances in Fig. 3 ( right ).55 .4 Hinge regression / SVM - Hinge loss with (cid:96) regularization The hinge loss l hinge ( y, z ) = max (0 , − yz ) is linear by part and is therefore another simpleexample of analytical loss to analyze. In particular its proximal map can computed in eq. (44)and the corresponding denoising functions read: f out (cid:16) y, q / ξ, Σ (cid:17) = y if ξy < − Σ √ qy −√ qξ Σ if − Σ √ q < ξy < √ q otherwise ,∂ ω f out (cid:16) y, q / ξ, Σ (cid:17) = − if − Σ √ q < ξy < √ q otherwise . (124)The fixed point equations eq. (97) have unfortunately no closed form and need to be solvednumerically. V.4.1 Max-margin estimator
As proven in [34] both the hinge and logistic estimators converge to the max-margin solutionin the limit λ → as soon as the data are linearly separable. We will start with the fixed pointequations for hinge, whose denoising functions (124) are analytical. Taking the λ → limit isnon-trivial and we need therefore to introduce some rescaled variables to obtain a closed setof equations. Numerical evidences at finite α show that we shall use the following rescaledvariables: ˆ m = Θ ( λ ) , ˆ q = Θ (cid:0) λ (cid:1) , ˆΣ = Θ ( λ ) , m = Θ(1) , q = Θ(1) , Σ = Θ (cid:0) λ ¯1 (cid:1) . The fixed point equations eq. (97) simplify and become m = ˆ m , q = ˆ m + ˆ q (1 + ˆΣ) , Σ = 11 + ˆΣ , ˆ m = 2 α Σ I ˆ m ( q, η ) , ˆ q = 2 α Σ I ˆ q ( q, η ) , ˆΣ = 2 α Σ I ˆΣ ( q, η ) , (125)56ith I ˆ m ( q, η ) ≡ (cid:90) √ q −∞ d ξ N ξ (0 , N ξ (cid:18) , − η √ η (cid:19) (1 − √ qξ ) , = √ π (cid:18) erf (cid:18) √ √ q (1 − η ) (cid:19) + 1 (cid:19) + 2 e − q (1 − η ) (cid:112) q (1 − η )4 π I ˆ q ( q, η ) ≡ (cid:90) √ q −∞ d ξ N ξ (0 ,
1) 12 (cid:32) (cid:32) √ ηξ (cid:112) − η ) (cid:33)(cid:33) (1 − √ qξ ) , I ˆΣ ( q, η ) ≡ (cid:90) √ q −∞ d ξ N ξ (0 ,
1) 12 (cid:32) (cid:32) √ ηξ (cid:112) − η ) (cid:33)(cid:33) . (126) Large α expansion Numerically at large α (and λ → ), we obtain the following scalings q = Θ( α ) , m = Θ( α ) , Σ = Θ(1) , ˆ q = Θ(1) , ˆ m = Θ( α ) , ˆΣ = Θ(1) . (127)Therefore, in order to close the equations, we introduce new variables ( c q , c η ) such that q = α →∞ c q α , η = 1 − c η α . (128)In this limit, we can extract the large α behaviours of integrals I ˆ m , I ˆ q , I ˆΣ : I ˆ m ( q, η ) = I ∞ ˆ m ( c q , c η ) , I ˆ q ( q, η ) = I ∞ ˆ q ( c q , c η ) α , I ˆΣ ( q, η ) = I ∞ ˆΣ ( c q , c η ) α , (129)where I ∞ ˆ m , I ∞ ˆ q , I ∞ ˆΣ are Θ(1) and read I ∞ ˆ m ( c q , c η ) ≡ √ π (cid:18) erf (cid:18) √ √ c η c q (cid:19) + 1 (cid:19) + 2 e − cηcq √ c η c q π , I ∞ ˆ q ( c q , c η ) ≡ e − cηcq (cid:18) √ π (3 c η c q + 1) e cηcq (cid:18) erf (cid:18) √ √ c η c q (cid:19) + 1 (cid:19) + 4( c η c q ) / + 2 √ c η c q (cid:19) π √ c q , I ∞ ˆΣ ( c q , c η ) ≡ √ π (cid:18) erf (cid:18) √ √ c η c q (cid:19) + 1 (cid:19) + 2 e − cηcq √ c η c q π √ c q . (130)57ence the set of fixed-point equations eq. (125) simplifies to: ˆΣ = 2 I ∞ ˆΣ ( c q , c η )1 − I ∞ ˆΣ ( c q , c η ) , Σ = 1 − I ∞ ˆΣ ( c q , c η )ˆ m = 2 α I ∞ ˆ m ( c q , c η )1 − I ∞ ˆΣ ( c q , c η ) , m = 2 α I ∞ ˆ m ( c q , c η )ˆ q = 2 I ∞ ˆ q ( c q , c η ) (cid:16) − I ∞ ˆΣ ( c q , c η ) (cid:17) , q = 4 α ( I ∞ ˆ m ( c q , c η )) + 2 I ∞ ˆ q ( c q , c η ) , (131)which can be closed by rewriting the equations eqs. (128): η = m q ≡ − c η α = 1 − I ∞ ˆ q ( c q , c η )2 (cid:0) I ∞ ˆ m ( c q , c η ) (cid:1) α ,q = c q α (cid:39) α ( I ∞ ˆ m ( c q , c η )) . (132)Equivalently ( c (cid:63)q , c (cid:63)η ) is the root of the set of non-linear fixed point equations ( F η ( c q , c η ) , F q ( c q , c η )) : F η ( c q , c η ) ≡ I ∞ ˆ q ( c q , c η )2 (cid:0) I ∞ ˆ m ( c q , c η ) (cid:1) − c η , F q ( c q , c η ) ≡ I ∞ ˆ m ( c q , c η )) − c q , (133)that cannot be solved analytically. However a unique numerical solution is found and lead to ( c (cid:63)q , c (cid:63)η ) = (0 . , . . Therefore the generalization error of the max-margin estimator inthe large α regime is given by e max − marging ( α ) = 1 π arccos (cid:18) m √ q (cid:19) (cid:39) α →∞ π arccos (cid:18) − c (cid:63)η α (cid:19) (cid:39) α →∞ Kα , (134)with K = √ c (cid:63)η π (cid:39) . , leading to e max − marging ( α ) (cid:39) α →∞ . α . (135) V.5 Logistic regression
The logistic loss is a combination of the cross entropy loss l ( y, z ) = − y log( σ ( z )) − (1 − y ) log(1 − σ ( z )) with as sigmoid activation function σ , that simplifies for binary labels y ± to l logistic ( y, z ) = log(1 + exp( − yz )) with the two first derivatives given by ∂ z l logistic ( y, z ) = − ye zy + 1 , ∂ z l logistic ( y, z ) = y cosh ( zy )) = y cosh (cid:0) yz (cid:1) . Its proximal is not analytical, but it can be written as the solution of the implicit equation (45)providing the corresponding denoising functions (46). Solving the fixed point equations (97),we obtain performances that approach closely the Bayes-optimal baseline as illustrated in Fig. 4( left ). 58 α . . . . . . e g λ =0.001 λ =0.01 λ =0.1 λ =1 λ =10.0 λ =100.0 λ =0 Max-marginBayes . . . . . . . . . . . e m a p g − e b a y e s g α . . . . . . e g Bayes λ =0.01 λ =0.01 sim. Figure 4: (
Left ) Logistic regression - Generalization error as a function of α for differentregularizations strength λ . Decreasing λ , the generalization error approaches very closely theBayes-optimal error (black line). The difference with the Bayes error is shown as an inset.Logistic flirts with Bayes error but never achieves it exactly. The asymptotic behaviour iscompared to numerical logistic regression with d = 10 and averaged over n s = 20 samples,performed with the default method LogisticRegression of the scikit-learn package [33].(
Right ) Rectangle door teacher with κ = 0 . - Bayes-optimal generalization error (black)compared to asymptotic generalization performances of (cid:96) logistic regression (dashed yellowline) and numerical ERM (crosses). V.6 Logistic with non-linearly separable data - A rectangle door teacher
The analysis of ERM for the linearly separable dataset generated by (110) reveals that logisticregression with (cid:96) regularization was able to approach very closely Bayes-optimal error. There-fore it seems us very interesting to investigate if logistic regression could perform as well on amore complicated non-linearly separable dataset obtained by a rectangle door channel y = sign (cid:18) √ d X w (cid:63) − κ (cid:19) . (136)This channel has been already considered in [10] and we fix the width of the door to κ =0 . to obtain labels ± with probability . . We then compare the ERM performances oflogistic regression with (cid:96) regularization to the Bayes-optimal performances given by (111)with denoising functions derived in eq. (40). We show in Fig. 4 ( right ) the comparison only foran arbitrary hyper-parameter λ = 1 . − , as results are similar for any regularization. As wemight expect, the logistic regression is not able to reach the Bayes-optimal generalization error.Both Bayes-optimal and ERM performances are stuck in the symmetric fixed point m = 0 upto α it (cid:39) . . Above this threshold it becomes unstable and Bayes error decreases to zero inthe α → limit, while the logistic regression with arbitrary λ remains stuck to its maximalgeneralization error, meaning that in this non-linearly separable case, the logistic regressionlargely underperforms Bayes-optimal performances.59 I Reaching Bayes optimality
In this section, we propose a derivation inspired by [24, 37–39, 51, 52, 56–59] of the fine-tunedloss and regularizer (17) discussed in Sec. 4. We assume that the dataset is generated by ateacher (18) such that Z out (cid:63) ( ., ω, . ) and Z w (cid:63) ( γ, . ) are respectively log-concave in ω and γ . Thederivation is based on the GAMP algorithm introduced in [30] for the model eq. (1), that westart by recalling. VI.1 Generalized Approximate Message Passing (GAMP) algorithm
The GAMP algorithm can be written as the following set of iterative equations that depend onthe update functions (23): ˆ w t +1 = f w ( γ t , Λ t )ˆ c t +1w = ∂ γ f w ( γ t , Λ t ) f t out = f out (cid:0) y, ω t , V t (cid:1) and Λ ti = − d (cid:80) nµ =1 X µi ∂ ω f t out ,µ γ ti = √ d (cid:80) nµ =1 X µi f t out ,µ + Λ ti ˆ w ti V tµ = d (cid:80) di =1 X µi ˆ c tw,i ω tµ = √ d (cid:80) di =1 X µi ˆ w ti − V tµ f t − ,µ . (137)It has been proven in [53] that the GAMP algorithm with Bayes-optimal update functions f w = f w (cid:63) and f out = f out (cid:63) (25) converges to the Bayes-optimal performances in the largesize limit. Yet the GAMP denoising functions are generic and can be chosen as will dependingon the statistical estimation method. In particular we may choose the denoising functions forBayes-optimal estimation (25) or the ones corresponding to ERM estimation (29) f bayesw ( γ, Λ) = ∂ γ log ( Z w (cid:63) ) ,f bayesout ( y, ω, V ) = ∂ ω log ( Z out (cid:63) ) ,f erm ,r w ( γ, Λ) = Λ − γ − Λ − ∂ Λ − γ M Λ − [ r ( . )] (Λ − γ ) ,f erm ,l out ( y, ω, V ) = − ∂ ω M V [ l ( y, . )]( ω ) , (138)whose corresponding GAMP algorithms (137) will achieve potentially different fixed points andthus different performances. As it is proven that GAMP with Bayes-optimal updates lead tothe optimal generalization error, so that ERM matches the same performances it is sufficientto enforce that at each time step t the Bayes-optimal and ERM denoising functions are equal f bayes = f erm . Enforcing these two constraints will lead to the expressions for the optimal loss l opt and regularizer r opt , so that ERM matches Bayes-optimal performances. VI.2 Matching Bayes-optimal and ERM performances
Imposing the equality on the channel updates we obtain f bayesout ( y, ω, V ) = f erm ,l out ( y, ω, V ) ⇔ ∂ ω log ( Z out (cid:63) ) ( y, ω, V ) = − ∂ ω M V (cid:2) l opt ( y, . ) (cid:3) ( ω ) . M V [log Z out (cid:63) ( y, ., V )] ( ω ) = M V (cid:2) −M V (cid:2) l opt ( y, . ) (cid:3) ( ω ) (cid:3) = − l opt ( y, ω ) , where we invert the Moreau-Yosida regularization in the last equality that is valid as long as Z out (cid:63) ( y, ω, V ) is assumed to be log-concave in ω , (see [39] for a derivation). We finally obtain l opt ( y, z ) = −M V [log ( Z out (cid:63) ) ( y, ., V )] ( z ) = − min ω (cid:18) ( z − ω ) V + log Z out (cid:63) ( y, ω, V ) (cid:19) . (139)Let us perform the same computation for the prior updates. First we introduce a rescaleddenoising distribution: ˜ Q w (cid:63) ( w ; γ, Λ) ≡ Z w (cid:63) ( γ, Λ) P w (cid:63) ( w ) e − Λ ( w − Λ − γ ) , log (cid:16) ˜ Z w (cid:63) ( γ, Λ) (cid:17) = log ( Z w (cid:63) ( γ, Λ)) −
12 Λ − γ , (140)so that the the prior updates read f bayesw ( γ, Λ) = ∂ γ log ( Z w (cid:63) ) = Λ − γ + Λ − ∂ Λ − γ log (cid:16) ˜ Z w (cid:63) (cid:17) ,f erm ,r w ( γ, Λ) = P Λ − [ r ] (Λ − γ ) = Λ − γ − Λ − ∂ Λ − γ M Λ − [ r ] (Λ − γ ) . (141)Imposing the equivalence of the Bayes-optimal and ERM prior update, f bayesw ( γ, Λ) = f erm ,r w ( γ, Λ) ⇔ ∂ Λ − γ log (cid:16) ˜ Z w (cid:63) (cid:17) = − ∂ Λ − γ M Λ − (cid:2) r opt (cid:3) (Λ − γ ) , (142)and assuming that Z w ( γ, Λ) is log-concave in γ , we may invert the Moreau-Yosida regularization,that leads to: r opt (cid:0) Λ − γ (cid:1) = −M Λ − (cid:104) log (cid:16) ˜ Z w (cid:63) (cid:17) (cid:0) ., Λ − (cid:1)(cid:105) ( w ) (143) = − min Λ − γ (cid:18) ( w − Λ − γ ) − + log ˜ Z w (cid:63) ( γ, Λ) (cid:19) = − min γ (cid:18)
12 Λ w − γw + log Z w (cid:63) ( γ, Λ) (cid:19) . The last step, is to characterize the variances V and Λ involved in (139) and (143) that are sofar undetermined. To achieve the Bayes-optimal performances, we therefore need to used thevariances V and Λ solutions of the Bayes-optimal GAMP algorithm (137). In the large size limit,these quantities concentrate and are giveen by the State Evolution (SE) of the GAMP algorithm,that we recall herein. State evolution of GAMP
In the large size limit, the expectation of the parameter V and Λ over the ground truth w (cid:63) and the input data X lead to [53]: E w (cid:63) , X [ V ] = ρ w (cid:63) − q b , E w (cid:63) , X [ Λ] = ˆ q b , (144)where q b and ˆ q b are solutions of the Bayes-optimal set of fixed point equations eq. (13).61 I.3 Summary and numerical evidences
Choosing the fine-tuned (potentially non-convex depending on Z out (cid:63) and Z w (cid:63) ) loss and regu-larizer l opt ( y, z ) = − min ω (cid:18) ( z − ω ) ρ w (cid:63) − q b ) + log Z out (cid:63) ( y, ω, ρ w (cid:63) − q b ) (cid:19) r opt ( w ) = − min γ (cid:18)
12 ˆ q b w − γw + log Z w (cid:63) ( γ, ˆ q b ) (cid:19) (145)with q b and ˆ q b are solutions of the Bayes-optimal set of fixed point equations eq. (13), we showedthat ERM can provably match the Bayes-optimal performances. In particular we illustrated thebehaviour of the optimal loss and regularizer λ opt and r opt for the model (2) in Fig. 2 of the maintext. Note in particular that even though the loss l opt is not convex (but seems quasi-convex),numerical simulations of ERM with (145) (black dots) presented in Fig. 5 show that ERM achievesindeed the Bayes-optimal performances (black line) even at finite dimension. α . . . . . . e g logistic λ =0.001logistic λ =0.01logistic λ =0.1logistic λ =1logistic λ =10.0logistic λ =100.0 l opt + r opt e g Bayes Figure 5: Generalization error obtained by optimization of the optimal loss l opt and r opt for themodel (2), compared to (cid:96) logistic regression and Bayes-optimal performances. Numerics hasbeen performed with scipy.optimize.minimize with the L-BFGS-B solver for d = 10 andaveraged over n s = 10= 10