Asymptotic Errors for Teacher-Student Convex Generalized Linear Models (or : How to Prove Kabashima's Replica Formula)
AAsymptotic Errors for Teacher-Student Convex GeneralizedLinear Models (or : How to Prove Kabashima’s Replica Formula)
Cedric Gerbelot, Alia Abbara and Florent KrzakalaLaboratoire de Physique de l’École Normale SupérieurePSL University, CNRS, Sorbonne UniversitésParis, France. [email protected],[email protected],[email protected]
July 2, 2020
Abstract
There has been a recent surge of interest in the study of asymptotic reconstruction performancein various cases of generalized linear estimation problems in the teacher-student setting, especiallyfor the case of i.i.d standard normal matrices. In this work, we prove a general analytical formulafor the reconstruction performance of convex generalized linear models, and go beyond suchmatrices by considering all rotationally-invariant data matrices with arbitrary bounded spectrum,proving a decade-old conjecture originally derived using the replica method from statisticalphysics. This is achieved by leveraging on state-of-the-art advances in message passing algorithmsand the statistical properties of their iterates. Our proof is crucially based on the constructionof converging sequences of an oracle multi-layer vector approximate message passing algorithm,where the convergence analysis is done by checking the stability of an equivalent dynamicalsystem. Beyond its generality, our result also provides further insight into overparametrizednon-linear models, a fundamental building block of modern machine learning. We illustrate ourclaim with numerical examples on mainstream learning methods such as logistic regression andlinear support vector classifiers, showing excellent agreement between moderate size simulationand the asymptotic prediction.
In the modern era of statistics and machine learning, data analysis often requires solving high-dimensional estimation problems with a very large number of parameters. Developing algorithmsfor this task and understanding their limitations have become a major challenge. In this paper, weconsider this question in the framework of supervised learning under the teacher-student point ofview: (i) data is synthetic and labels are generated by a "teacher" rule and (ii) training is donewith a convex Generalized Linear Model (GLM) with a convex penalty term. Such problems areubiquitous in machine learning, statistics, communications, and signal processing.The study of asymptotic (i.e. large-dimensional) reconstruction performance of generalized linearestimation in the teacher-student setting has been the subject of a significant body of work overthe past few decades [SST92, WRB93, EVdB01, BM11b, EKBB +
13, DM16, ZK16], and is currentlywitnessing a renewal of interest especially for the case of identically and independently distributed(i.i.d.) standard normal data matrices, see e.g. [SCC19, HMRT19, MM19]. Our aim in this paper is1 a r X i v : . [ s t a t . M L ] J u l o provide a general analytical formula describing the reconstruction performance of such convexgeneralized linear models, but for a broader class of more adaptable matrices.The problem is defined as follows: we aim at reconstructing a given i.i.d. weight vector x ∈ R N from outputs y ∈ R M generated using a training set ( f µ ) µ =1 ,...,M and the "teacher" rule: y = φ ( Fx + ω ) (1)where φ is a proper, closed, convex and separable function, and ω ∼ N (0 , ∆ Id ) is an i.i.d. noisevector. To go beyond the simple Gaussian i.i.d. case tackled in a majority of theoretical work,we shall allow matrices of arbitrary spectrum. We consider the data matrix F ∈ R M × N , obtainedby concatenating the vectors of the training set, to be rotationally invariant : its singular valuedecomposition reads F = UDV T where U ∈ R M × M , V ∈ R N × N are uniformly sampled from theorthogonal groups O ( M ) and O ( N ) , respectively. D ∈ R M × N contains the singular values of F . Ouranalysis encompasses any singular value distribution with compact support. Crucially, we placeourselves in the so-called high-dimensional regime, so that M, N → ∞ while the ratio α ≡ M/N iskept finite. Our goal is to study the reconstruction performance of the generalized linear estimationmethod: ˆx ∈ arg min x ∈ R N { g ( Fx , y ) + f ( x ) } (2)where g and f are proper, closed, convex and separable functions. This type of procedure countsamong the fundamental building blocks of modern machine learning, and encompasses severalmainstream methods such as logistic regression, LASSO or linear support vector machines. Focusingon the asymptotic limit, we are interested in assessing the reconstruction performance: for instancethrough the mean squared error E ≡ E [ (cid:107) x − ˆx (cid:107) ] /N or the overlap m x = ˆ x T x /N . Main contributions —
Our main contributions are the following: • We provide an analytical formula for the reconstruction error of problem (2) with data generatedby (1) in the asymptotic setup, for all convex losses and penalties (including for instance Logistic,Hinge, LASSO and Elastic net), for all rotationally invariant sequences of matrices F . • By doing so, we give a mathematically rigorous proof of a replica formula obtained heuristicallythrough statistical physics for this problem, notably by Y. Kabashima [Kab08]. This is a significantstep that goes beyond the setting of most rigorous work on replica results, which assume matricesto be i.i.d. random Gaussian ones. • Our proof method has an interest of its own, and builds on a detailed mapping between alternatingdirections descent methods [BPC +
11] from convex optimization and a set of algorithms calledmulti-layer approximate message-passing algorithms [MKMZ17, SRF16]. In particular, it cruciallyuses ideas from [BM11b] and [FRS18]. Our work also builds on recent results that dealt withpenalized linear regression [GAK20] and generalizes them to the more generic non-linear case.
Related work —
The simplest case of the present question, when both f and g are quadraticfunctions, can be mapped to a random matrix theory problem and solved rigorously, as in e.g.[HMRT19]. Handling non-linearity is, however, more challenging. A long history of research tacklesthis difficulty in the high-dimensional limit, especially in the statistical physics literature where theasymptotic limit focus is common. The usual analytical approach in statistical physics of learning[SST92, WRB93, EVdB01] is a heuristic, non-rigorous but very adaptable technique called the replicamethod [MPV87, MM09]. In particular, it has been applied on many variations of the presentproblem, and built the foundation of a large number of deep, non-trivial results in machine learning,2ignal processing and statistics, e.g. [GD89, OKKN90, OK96, Bie03, KWT09, GS10, AG16, Mit19,ESAP + x [Kab08].Proving the validity of a replica prediction is a difficult task altogether. There has been recentprogress, however, in the particular case of Gaussian data, where the matrix F is simply made ofi.i.d. standard Gaussian coefficients. In this case, the asymptotic performance of the LASSO wasrigorously derived in [BM11a], and the existence of the logistic estimator discussed in [SCC19]. A setof papers managed to extend this study to a large set of convex losses g , using the so-called Gordoncomparison theorem [TAH18]. We significantly broaden those results here by proving the Kabashimaformula, that is valid beyond Gaussian matrices, since it holds for any rotationally invariant matrix,and for any convex and separable loss g and regularization f .Our proof strategy is based on the use of approximate-message-passing [DMM09, Ran11], aspioneered in [BM11b], and is similar to a recent work [GAK20] on a simpler setting. This family ofalgorithms is a statistical physics-inspired variant of belief propagation [Méz89, Kab03, KU04] wherelocal beliefs are approximated by Gaussian distributions. A key feature of these algorithms is theexistence of the state evolution equations, an iterative scalar equivalent model which allows to trackthe asymptotic statistical properties of the iterates at every time step. A series of groundbreakingpapers initiated with [BM11a] proved that these equations are exact in the large system limit,and remarkably extended the method to treat nonlinear problems [Ran11] and handle rotationallyinvariant matrices [RSF19, TK20]. We shall use a variant of these algorithms called multi-layerapproximate message-passing (MLVAMP) [SRF16, FRS18].The key technical point in our approach is the proof of convergence of an oracle version of MLVAMP.This is achieved by phrasing our oracle algorithm as a dynamical system similar to ADMM [BPC + +
15, LRP16].
Our main result is a rigorous analytical formula for the reconstruction error obtained by the empiricalrisk minimization of (2) with data generated by (1). We start by stating the necessary assumptions.
Assumptions 1.
Consider the minimization problem (2) with f and g proper closed, convex andseparable functions, and F ∈ R M × N an orthogonally invariant matrix. Consider that the empiricaldistributions of the underlying truth x and eigenvalues of F T F respectively converge with second ordermoments, as defined in appendix A, to given distributions p x and p λ . Assume that the distribution p λ is non-trivial and has compact support. Finally consider the limit M, N → ∞ with fixed ratio α = M/N . Additional detail on the set of assumptions is given in appendix A. This analysis framework ismainly due to [BM11a], and roughly amounts to non-diverging input quantities and well-behavedupdate functions, which applies to the presently studied setting.3 heorem 1 (Fixed point equations) . Let ˆx be the estimator of x defined in (2) , and ˆz = Fˆx theestimator of z = Fx . Ground-truth vectors have norms ρ x ≡ (cid:107) x (cid:107) /N , ρ z ≡ (cid:107) z (cid:107) /M . Under theset of assumptions 1, the squared norms and the overlap of these estimators with the ground-truth aregiven by: m ∗ x ≡ lim N →∞ ˆx T x N = E (cid:104) x η f/ ˆ Q ∗ x ( X ) (cid:105) m ∗ z ≡ lim M →∞ ˆz T z M = E (cid:104) z η g ( .,y ) / ˆ Q ∗ z ( Z ) (cid:105) (3) q ∗ x ≡ lim N →∞ (cid:107) ˆx (cid:107) N = E (cid:104) η f/ ˆ Q ∗ x ( X ) (cid:105) q ∗ z ≡ lim N →∞ (cid:107) ˆz (cid:107) N = E (cid:104) η g ( .,y ) / ˆ Q ∗ z ( Z ) (cid:105) (4) where X = ˆ m ∗ x x + √ ˆ χ ∗ x ξ x ˆ Q x , Z = ˆ m ∗ z z + √ ˆ χ ∗ z ξ z ˆ Q z . Expectations are taken with respect to the randomvariables x ∼ p x , z ∼ N (0 , √ ρ z ) , y ∼ φ ( z + ω ) , ξ x , ξ z ∼ N (0 , , and eigenvalues λ ∼ p λ . η isa shorthand for the scalar proximal operator: η γf ( z ) = arg min x ∈X (cid:26) γf ( x ) + 12 ( x − z ) (cid:27) (5) and parameters ˆ Q ∗ x , ˆ Q ∗ z , ˆ m ∗ x , ˆ m ∗ z , ˆ χ ∗ x , ˆ χ ∗ z are given by the fixed point of the system: ˆ Q x = ˆ Q x ( E (cid:104) η (cid:48) f/ ˆ Q x ( X ) (cid:105) − − Q z = ˆ Q z ( E (cid:104) η (cid:48) g ( .,y ) / ˆ Q z ( Z ) (cid:105) − − m x = E (cid:104) x η f/ ˆ Q x ( X ) (cid:105) ρ x χ x − ˆ m x ˆ m z = E (cid:104) z η g ( .,y ) / ˆ Q z ( Z ) (cid:105) ρ z χ z − ˆ m z ˆ χ x = E (cid:104) η f/ ˆ Q x ( X ) (cid:105) χ x − ρ x ( ˆ m x + ˆ m x ) − ˆ χ x ˆ χ z = E (cid:104) η g ( .,y ) / ˆ Q z ( Z ) (cid:105) χ z − ρ z ( ˆ m z + ˆ m z ) − ˆ χ z ˆ Q x = E (cid:20) Q x + λ ˆ Q z (cid:21) − − ˆ Q x (6a) ˆ Q z = α E (cid:20) λ ˆ Q x + λ ˆ Q z (cid:21) − − ˆ Q z (6b) ˆ m x = 1 χ x E (cid:20) ˆ m x + λ ˆ m z ˆ Q x + λ ˆ Q z (cid:21) − ˆ m x (6c) ˆ m z = ρ x αχ z ρ z E (cid:20) λ ( ˆ m x + λ ˆ m z )ˆ Q x + λ ˆ Q z (cid:21) − ˆ m z (6d) ˆ χ x = 1 χ x E (cid:34) ˆ χ x + λ ˆ χ z + ρ x ( ˆ m x + λ ˆ m z ) ( ˆ Q x + λ ˆ Q z ) (cid:35) (6e) − ρ x ( ˆ m x + ˆ m x ) − ˆ χ x ˆ χ z = 1 αχ z E (cid:34) λ ( ˆ χ x + λ ˆ χ z + ρ x ( ˆ m x + λ ˆ m z ) )( ˆ Q x + λ ˆ Q z ) (cid:35) (6f) − ρ z ( ˆ m z + ˆ m z ) − ˆ χ z , where χ x = ( ˆ Q x + ˆ Q x ) − , and χ z = ( ˆ Q z + ˆ Q z ) − . This set of fixed point equations naturally stems from the "replica-symmetric" free energy notationcommonly used in the statistical physics community [MPV87, MM09], which we can now rigorouslyestablish as the following corollary to Theorem 1 :4 orollary 1 (The Kabashima formula) . Theorem 1 can equivalently be rewritten as the solution ofthe extreme value problem defined by the replica free energy from [TK20]. f = − extr m x ,χ x ,q x ,m z ,χ z ,q z { g F + g G − g S } , (7) g F = extr ˆ m x , ˆ χ x , ˆ Q x , ˆ m z , ˆ χ z , ˆ Q z (cid:26) q x ˆ Q x − χ x ˆ χ x − ˆ m x m x − α ˆ m z m z + α (cid:16) q z ˆ Q z − χ z ˆ χ z (cid:17) + E (cid:104) φ x ( ˆ m x , ˆ Q x , ˆ χ x ; x , ξ x ) (cid:105) + α E (cid:104) φ z ( ˆ m z , ˆ Q z , ˆ χ z ; z , ξ z ) (cid:105)(cid:111) ,g G = extr ˆ m x , ˆ χ x , ˆ Q x , ˆ m z , ˆ χ z , ˆ Q z (cid:26) q x ˆ Q x − χ x ˆ χ x − m x ˆ m x − αm z ˆ m z + α (cid:16) q z ˆ Q z − χ z ˆ χ z (cid:17) − (cid:18) E (cid:104) log( ˆ Q x + λ ˆ Q z ) (cid:105) − E (cid:20) ˆ χ x + λ ˆ χ z ˆ Q x + λ ˆ Q z (cid:21) − E (cid:34) ρ x ( ˆ m x + λ ˆ m z ) ( ˆ Q x + λ ˆ Q z ) (cid:35)(cid:33)(cid:41) ,g S = 12 (cid:18) q x χ x − m x ρ x χ x (cid:19) + α (cid:18) q z χ z − m z ρ z χ z (cid:19) , where φ x and φ z are the potential functions φ x ( ˆ m x , ˆ Q x , ˆ χ x ; x , ξ x ) = lim β →∞ β log (cid:90) e − β ˆ Q x x + β ( ˆ m x x + √ ˆ χ x ξ x ) x − βf ( x ) dx, (8) φ z ( ˆ m z , ˆ Q z , ˆ χ z ; z , χ z ) = lim β →∞ β log (cid:90) e − β ˆ Q z z + β ( ˆ m z z + √ ˆ χ z ξ z ) z − βg ( y,z ) dz. (9)In this β → ∞ limit (the so-called zero temperature limit in statistical physics), potentials φ x and φ z correspond to maximum a posteriori (MAP) estimation. Note that they are closely relatedto the Moreau envelopes M [PB +
14, BC +
11] of f and g , which represent a smoothed form of theobjective function with the same minimizers: φ x ( ˆ m x , ˆ Q x , ˆ χ x ; x , ξ x ) = ˆ Q x X − M f/ ˆ Q x ( X ) (10)where M γf ( z ) = inf x (cid:26) f ( x ) + 12 γ (cid:107) x − z (cid:107) (cid:27) , ∀ γ (cid:62) and X = ˆ m x x + √ ˆ χ x ξ x ˆ Q x (11)The expression for φ z is similar. These equivalent forms can be obtained using Laplace’s methodon (62) and (9), as detailed in appendix B. This shows that our result also extends the framework of[TAH18], where the reconstruction performance of Gaussian GLMs is characterized using expectedMoreau envelopes which naturally appear in the free energy (7).With the knowledge of the asymptotic overlap m ∗ x , and squared norms q ∗ x , ρ x , most quantitiesof interest can be estimated. For instance, the quadratic reconstruction error is obtained from itsdefinition as E = ρ x + q ∗ x − m ∗ x , while the angle between the ground-truth vector and the estimatoris θ = arccos ( m ∗ x / ( √ ρ x q ∗ x )) . One can also estimate the generalization error for new random Gaussiansamples [EVdB01], or compute similar errors for the denoising of z . Such replica predictions beingwidely used [GD89, OKKN90, OK96, Bie03, KWT09, GS10, AG16, Mit19, ESAP +
20] we simplyillustrate Theorem 1 in Figure 1 on a classification problem. Simulation details are given in appendixF. The comparison with finite size numerical experiments shows that, despite being asymptotical innature, these predictions are extremely accurate even at moderate system sizes.5igure 1: Illustration of Theorem 1 in a binary classification problem with data generated as y = φ ( Fx + ω ) with the data matrix F being Left : a Gaussian i.i.d. matrix and
Right : arandom orthogonal invariant matrix with a squared uniform density of eigenvalues. We plot theangle between the estimator and the ground-truth vector θ = arccos ( m ∗ x / ( √ ρ x q ∗ x )) as a function ofthe aspect ratio α = M/N for Ridge regression, a Support Vector Machine with linear kernel anda Logistic regression, with (cid:96) penalty λ = 10 − . The theoretical prediction (full line) is comparedwith numerical experiments (points) conducted using standard convex optimization solvers from[PVG + Our proof follows an approach pioneered in [BM11b] for the LASSO. The idea is to build a sequenceof iterates that provably converges towards the estimator ˆx , while the statistical properties of thoseiterates follow equations that match those of Theorem 1 at their fixed point. We must thereforeconcern ourselves with three fundamental aspects:(i) construct a sequence of iterates with a rigorous statistical characterization that matches theequations of Theorem 1,(ii) verify that the sequence’s fixed point reaches the estimator ˆ x ,(iii) determine the conditions for this sequence to be provably convergent, so that the statisticalcharacterization indeed applies to the point of interest ˆ x , and not to arbitrary vectors from adiverging trajectory.We handle point (i) with the 2-layer version of the MLVAMP algorithm proposed by [FRS18]. It isan expectation-propagation [Min01] inspired algorithm that iteratively matches the moments of atarget distribution with those of a tractable approximation of the objective. We remind the full setof iterations in appendix C. In the maximum a posteriori setting, MLVAMP is similar to a multilayerproximal descent method and can be viewed as successively solving an alternating direction methodof multipliers (ADMM) [BPC +
11] step for each layer of the model, with an additional LMMSE(minimum mean square) step on each layer. As pointed out in [FSARS16], the main differencebetween MLVAMP and standard convex optimization methods are the implicit prescription of descentstep sizes and prefactors, adapting them to the local curvature of the functions composing theobjective cost.The main result of [FRS18] is a rigorous proof of state evolution equations for MLVAMP, whichholds for any isotropically distributed initialization of the estimates. The following lemma establishesthe link between those equations and our main theorem.6 emma 1. (Fixed point of 2-layer MLVAMP state evolution equations) The state evolution equationsof 2-layer MLVAMP from [FRS18], reminded in appendix E, match the equations of Theorem 1 attheir fixed point.
Proof
See appendix E.We can thus construct the sequences of interest using the 2-layer MLVAMP algorithm. How-ever, being only interested in the fixed point of the state evolution equations, we introduce an oraclealgorithm where the second-order parameters, i.e. the implicit step-sizes and prefactors stemmingfrom the Bayesian nature of the algorithm, are prescribed from the fixed point of the state evolutionequations. In our notations, these parameters correspond to ˆ Q x , ˆ Q z , ˆ Q x , ˆ Q z , χ x , χ z .The Oracle-MLVAMP iterations then read:Initialize h (0)1 x h (0)2 z , prescribe ˆ Q x , ˆ Q z , ˆ Q x , ˆ Q z , χ x , χ z . Forward pass – denoising Forward pass – LMMSE ˆx ( t )1 = Prox f/ ˆ Q x ( h ( t )1 x ) ˆz ( t )2 = F ( ˆ Q x F T F + ˆ Q x Id ) − ( ˆ Q x h ( t )2 x + ˆ Q z F T h ( t )2 z ) (12a) h ( t )2 x = ( ˆx ( t )1 /χ x − ˆ Q x h ( t )1 x ) / ˆ Q x h ( t )1 z = ( ˆz ( t )2 /χ z − ˆ Q z h ( t )2 z ) / ˆ Q z (12b) Backward pass – denoising Backward pass – LMMSE ˆz ( t )1 = Prox g ( .,y ) / ˆ Q z ( h ( t )1 z ) ˆx ( t +1)2 = ( ˆ Q x F T F + ˆ Q z Id ) − ( ˆ Q x h ( t )2 x + ˆ Q z F T h ( t +1)2 z ) (12c) h ( t +1)2 z = ( ˆz ( t )1 /χ z − ˆ Q z h ( t )1 z ) / ˆ Q z h ( t +1)1 x = ( ˆx ( t +1)2 /χ x − ˆ Q x h ( t )2 x ) / ˆ Q x . (12d)At each iteration, Oracle-MLVAMP returns two sets of estimators ( ˆx ( t )1 , ˆx ( t )2 ) and ( ˆz ( t )1 , ˆz ( t )2 ) whichrespectively aim at reconstructing the minimizer ˆx and ˆz = Fˆx . At the fixed point, we have ˆx ( t )1 = ˆx ( t )2 and ˆz ( t )1 = ˆz ( t )2 . From Lemma 1, we know that the iterates of Oracle-MLVAMP can be characterizedby the equations of Theorem 1 with proper time indices. We must now show that the estimatorof interest defined by (1) and (2) can be systematically reached using Oracle-MLVAMP. We thuscontinue with point (ii). Lemma 2. (Fixed point of Oracle-MLVAMP) The fixed point of the iterations (12) is unique and isthe solution to the convex optimization problem (2) . Proof
See appendix D.This part is quite straightforward and is a direct consequence of the structure of the algorithmand properties of proximal operators. We now need to characterize the convergence properties ofOracle-MLVAMP and move to point (iii). From this point on, we will make the assumption that f and g are twice differentiable, keeping in mind that any continuous function can be approximatedwith a twice differentiable one to an arbitrary precision as proved, for example, in [AFLMR07]. Lemma 3. (Linear convergence of Oracle-MLVAMP) Consider the smoothed problem ˆx ( λ , ˜ λ ) ∈ arg min x ∈ R N (cid:110) ˜ g ( Fx , y ) + ˜ f ( x ) (cid:111) (13)7 here ˜ f ( x ) = f ( x ) + λ (cid:107) x (cid:107) and ˜ g ( x ) = g ( x ) + ˜ λ (cid:107) x (cid:107) . Take the Oracle-MLVAMP iterations ranon (13), from which we extract x ( t ) = (cid:104) h ( t )2 z , h ( t )1 x (cid:105) T . We then have the following: ∀ ˜ λ > , ∃ λ ∗ s.t. ∀ λ (cid:62) λ ∗ : (14) ∃ < c < λ s.t. (cid:107) x ( t ) − x ∗ (cid:107) (cid:54) (cid:18) cλ (cid:19) t (cid:107) x − x ∗ (cid:107) (15) which implies lim t →∞ (cid:107) x ( t ) − ˆx (cid:107) = 0 (16) where x ∗ is the unique fixed point of the iterations (12) . Proof
See appendix G.The intuition here is that, for a loss function with any non-zero strong convexity constant, and asufficiently strongly convex regularization, Oracle-MLVAMP systematically converges linearly towardsits unique fixed point. We elaborate on this lemma in the next section. We can now state thefollowing lemma, which claims that Theorem 1 holds for a sufficiently strongly convex problem.
Lemma 4. (Asymptotic error for the smoothed problem)Consider the smoothed minimization problem (13) . Under the the set of assumptions 1, we thenhave the following statement: ∀ ˜ λ > , ∃ λ ∗ s.t. ∀ λ (cid:62) λ ∗ : Theorem 1 holds. (17)
Proof of Lemma 4
Lemma 4 immediately follows from the conjunction of Lemmas 1,2,3. Indeed,any isotropically initialized trajectory of Oracle-MLVAMP will converge to its unique fixed pointaccording to the Banach fixed point theorem and Lemma 3. Lemma 1 then tells us that each step ofthis trajectory verifies the equations of Theorem 1 with proper time indices. Finally, Lemma 2 showsthat the fixed point of this trajectory is the unique solution to the smoothed problem (13).Proving Theorem 1 boils down to performing an analytic continuation on the ridge parameters.
Proof of Theorem 1
The optimality condition of the smoothed problem (13) reads ∂f ( ˆx ) + F T ∂g ( Fˆx ) + ( λ Id + ˜ λ F T F ) ˆx = 0 (18)and defines an analytic function of ( λ , ˜ λ ) for the coordinates of the solution ˆx ( λ , ˜ λ ) , and thusfor ˆz = Fˆx as well, using the analytic inverse function theorem from [KP02]. From the form of theproximal of a convex (and differentiable) function with an addition ridge penalty (see appendix B):Prox γ ( f + λ (cid:107) . (cid:107) ) ( z ) = ((1 + γλ )Id + γ f (cid:48) ) − (z) , (19)we directly see that all equations defining the scalar quantities in the state evolution equations (6) areanalytic in ( λ , ˜ λ ) , which implicitly defines an analytic function for any scalar combination of thosequantities [KP02]. Moreover, Lemma 4 holds for an open subset of ( λ , ˜ λ ) ; we can therefore usean analytic continuation theorem [KP02] to extend Lemma 4 to all non-negative values of ( λ , ˜ λ ) ,finally proving Theorem 1 by choosing ( λ , ˜ λ ) = (0 , .The remaining technical part is the proof of convergence lemma 3. For this purpose, we use adynamical system reformulation of Oracle-MLVAMP and a result from control theory, adapted tomachine learning in [LRP16] and more specifically to ADMM in [NLR + Oracle-MLVAMP as a dynamical system: proof of Lemma 3
The key idea of the approach pioneered in [LRP16] is to recast any non-linear dynamical systemas a linear one, where convergence will be naturally characterized by a matrix norm. For a givennon-linearity ˜ O and iterate h , we define the variable u = ˜ O ( h ) and rewrite the initial algorithm interms of this trivial transform. Any property of ˜ O is then summarized in a constraint matrix linking h and u . For example, if ˜ O has Lipschitz constant ω , then for all t : (cid:107) u t +1 − u t (cid:107) (cid:54) ω (cid:107) h t +1 − h t (cid:107) , (20)which can be rewritten in matrix form: (cid:20) h t +1 − h t u t +1 − u t (cid:21) T (cid:20) ω I d h − I d u (cid:21) (cid:20) h t +1 − h t u t +1 − u t (cid:21) (cid:62) (21)where I d h , I d u are the identity matrices with dimensions of u , h , i.e. M or N in our case. Anyco-coercivity property (verified by proximal operators) can be rewritten in matrix form but yields nonblock diagonal constraint matrices. We will thus directly use the Lipschitz constants for our proof,as they lead to simpler derivations and suffice to prove the required result. The main theorem from[LRP16], adapted to ADMM in [NLR + h (0)1 x , h (0)2 z h ( t +1)1 x = W ˜ O h ( t )1 x + W ˜ O ( W h ( t )2 z + W ˜ O h ( t )1 x ) (22) h ( t +1)2 z = ˜ O ( W h ( t )2 z + W ˜ O h ( t )1 x ) (23)where W = ˆ Q x ˆ Q x (cid:18) χ x ( ˆ Q z F T F + ˆ Q x Id ) − − Id (cid:19) W = ˆ Q z χ x ˆ Q x ( ˆ Q z F T F + ˆ Q x Id ) − F T (24) W = ˆ Q z ˆ Q z (cid:18) χ z F ( ˆ Q z F T F + ˆ Q x Id ) − F T − Id (cid:19) W = ˆ Q x ˆ Q z χ z F ( ˆ Q z F T F + ˆ Q x Id ) − (25) ˜O = ˆ Q x ˆ Q x (cid:18) χ x ˆ Q x Prox f / ˆ Q x ( · ) − Id (cid:19) ˜O = ˆ Q z ˆ Q z (cid:18) χ z ˆ Q z Prox g ( .,y ) / ˆ Q z ( · ) − Id (cid:19) . (26) For the linear recast, we then define the variables: u ( t )0 = ˜ O ( h ( t )1 x ) , v ( t ) = W h ( t )2 z + W u ( t )0 , and u ( t )1 = ˜ O ( v ( t ) ) (27)such that h ( t +1)2 z = u ( t )1 , h ( t +1)1 x = W u ( t )0 + W u ( t )1 . (28)where u , h x ∈ R N , v , u , h z ∈ R M . We then write x ( t ) = (cid:34) h ( t )2 z h ( t )1 x (cid:35) , u ( t ) = (cid:34) u ( t )1 u ( t )0 (cid:35) , z ( t )0 = (cid:34) h ( t )1 x u ( t )0 (cid:35) , z ( t )1 = (cid:34) v ( t ) u ( t )1 (cid:35) . x ( t +1) = Ax ( t ) + Bu ( t ) (29) z ( t )1 = C x ( t ) + D u ( t ) (30) z ( t )2 = C x ( t ) + D u ( t ) (31)where A = 0 ( M + N ) × ( M + N ) B = (cid:20) I M × M M × N W W (cid:21) (32) C = (cid:20) N × M I N × N N × M N × N (cid:21) D = (cid:20) N × M N × N N × M I N × N (cid:21) (33) C = (cid:20) W M × N M × M M × N (cid:21) D = (cid:20) M × M W I M × M M × N (cid:21) . (34)The next step is to impose the properties of the non-linearities ˜ O , ˜ O through constraint matrices.The Lipschitz constants ω , ω of ˜ O , ˜ O can be determined using properties of proximal operators[GB16] and are directly linked to the strong convexity and smoothness of the cost function andregularization. The relevant properties of proximal operators are reminded in appendix B, and thesubsequent derivation of the Lipschitz constants is detailed in appendix G, and gives the followingresults: ω = ˆ Q x ˆ Q x (cid:115) Q x − ˆ Q x ( ˆ Q x + λ ) ω = ˆ Q z ˆ Q z (cid:115) Q z − ˆ Q z ( ˆ Q z + ˜ λ ) . (35)We thus define the constraints matrices M = (cid:20) ω − (cid:21) ⊗ I N × N M = (cid:20) ω − (cid:21) ⊗ I M × M (36)where ⊗ denotes the Kronecker product. We then use Theorem 4 from [LRP16] in the appropriateform for Oracle-MLVAMP, as was done in [NLR +
15] for ADMM.
Proposition 1. (Theorem 4 from [LRP16]) Consider the following linear matrix inequality with τ ∈ [0 , : (cid:23) (cid:20) A T PA − τ P A T PBB T PA B T PB (cid:21) + (cid:20) C D C D (cid:21) T (cid:20) β M β M (cid:21) (cid:20) C D C D (cid:21) . (37) If (37) is feasible for some P (cid:31) and β , β (cid:62) , then for any initialization x , we have: (cid:107) x ( t ) − ˆx (cid:107) (cid:54) (cid:112) κ ( P ) τ t (cid:107) x − ˆx (cid:107) for all t (38) where κ ( P ) is the condition number of P . We show in appendix G how the additional ridge penalties from the smoothed problem (13)parametrized by ˜ λ , λ can be used to make (37) feasible and prove Lemma 3. The core idea is toleverage on the Lipschitz constants (35), the operator norms of the matrices defined in (24) and thefollowing upper and lower bounds on the ˆ Q parameters: λ min ( H f ) (cid:54) ˆ Q x (cid:54) λ max ( H f ) λ min ( H g ) (cid:54) ˆ Q z (cid:54) λ max ( H g ) (39) ˆ Q z λ min ( F T F ) (cid:54) ˆ Q x (cid:54) ˆ Q z λ max ( F T F ) ˆ Q x λ max ( FF T ) (cid:54) ˆ Q z (cid:54) ˆ Q x λ min ( FF T ) , (40)10here H f , H g are the Hessian of the loss and regularization functions. These bounds are easilyobtained from the definitions of χ x , χ z in the state evolution equations (or equivalently in Theorem1), and the fact that the derivative of a proximal operator reads, for a twice differentiable function: D η γf ( x ) = ( Id + γ H f ( η γf ( x ))) − . (41)Detail of this derivation can also be found in appendices B and G. For the smoothed problem (13),the maximum and minimum eigenvalues of the Hessians are directly augmented by ˜ λ , λ , whichallows us to control the scaling of the ˆ Q parameters. The rest of the convergence proof is then basedon successive application of Schur’s lemma [HJ12] on the linear matrix inequality (37) and translatingthe resulting conditions on inequalities which can be verified by choosing the appropriate ˜ λ , λ , β , β .Convergence of gradient-based descent methods for sufficiently strongly-convex objectives is a coher-ent result from an optimization point of view. This is corroborated by the symbolic convergencerates derived for ADMM in [NLR + https://github.com/cgerbelo/Replica_GLM_orth.inv. Acknowledgments
The authors would like to thank Benjamin Aubin, Yoshiyuki Kabashima and Lenka Zdeborová fordiscussions. This work is supported by the French Agence Nationale de la Recherche under grant ANR-17-CE23-0023-01 PAIL and ANR-19-P3IA-0001 PRAIRIE. Additional funding is acknowledged from“Chaire de recherche sur les modèles et sciences des données”, Fondation CFM pour la Recherche-ENS.11 eferences [AFLMR07] Daniel Azagra, Juan Ferrera, Fernando López-Mesas, and Yenny Rangel. Smoothapproximation of lipschitz functions on riemannian manifolds.
Journal of MathematicalAnalysis and Applications , 326(2):1370–1378, 2007.[AG16] Madhu Advani and Surya Ganguli. An equivalence between high dimensional bayesoptimal inference and m-estimation. In
Advances in Neural Information ProcessingSystems , pages 3378–3386, 2016.[BC +
11] Heinz H Bauschke, Patrick L Combettes, et al.
Convex analysis and monotone operatortheory in Hilbert spaces , volume 408. Springer, 2011.[Bie03] Thomas Published Biehl, Michael; Caticha, Nestor; Opper, Manfred; Villmann. Statis-tical Physics of Learning and Generalization.
Adaptivity and Learning , pages 77–88,2003.[BM11a] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on densegraphs, with applications to compressed sensing.
IEEE Transactions on InformationTheory , 57(2):764–785, 2011.[BM11b] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices.
IEEETransactions on Information Theory , 58(4):1997–2017, 2011.[BPC +
11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributedoptimization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends R (cid:13) in Machine learning , 3(1):1–122, 2011.[DM16] David Donoho and Andrea Montanari. High dimensional robust m-estimation: Asymp-totic variance via approximate message passing. Probability Theory and Related Fields ,166(3-4):935–969, 2016.[DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms forcompressed sensing.
Proceedings of the National Academy of Sciences , 106(45):18914–18919, 2009.[EKBB +
13] Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. Onrobust regression with high-dimensional predictors.
Proceedings of the National Academyof Sciences , 110(36):14557–14562, 2013.[ESAP +
20] Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep Rangan, andAlyson K Fletcher. Generalization error of generalized linear models in high dimensions. arXiv preprint arXiv:2005.00180 , 2020.[EVdB01] Andreas Engel and Christian Van den Broeck.
Statistical mechanics of learning . Cam-bridge University Press, 2001.[FRS18] Alyson K Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networksin high dimensions. In , pages 1884–1888. IEEE, 2018.12FSARS16] Alyson Fletcher, Mojtaba Sahraee-Ardakan, Sundeep Rangan, and Philip Schniter.Expectation consistent approximate inference: Generalizations and convergence. In , pages 190–194.IEEE, 2016.[GAK20] Cédric Gerbelot, Alia Abbara, and Florent Krzakala. Asymptotic errors for convexpenalized linear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372 ,2020.[GB16] Pontus Giselsson and Stephen Boyd. Linear convergence and metric selection for douglas-rachford splitting and admm.
IEEE Transactions on Automatic Control , 62(2):532–544,2016.[GD89] Elizabeth Gardner and Bernard Derrida. Three unfinished works on the optimal storagecapacity of networks.
Journal of Physics A: Mathematical and General , 22(12):1983,1989.[GS10] Surya Ganguli and Haim Sompolinsky. Statistical mechanics of compressed sensing.
Physical review letters , 104(18):188701, 2010.[HJ12] Roger A Horn and Charles R Johnson.
Matrix analysis . Cambridge university press,2012.[HMRT19] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises inhigh-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 ,2019.[Kab03] Yoshiyuki Kabashima. A cdma multiuser detection algorithm on the basis of beliefpropagation.
Journal of Physics A: Mathematical and General , 36(43):11111, 2003.[Kab08] Yoshiyuki Kabashima. Inference from correlated patterns: a unified theory for perceptronlearning and linear vector channels. In
Journal of Physics: Conference Series , volume 95,page 012001. IOP Publishing, 2008.[KP02] Steven G Krantz and Harold R Parks.
A primer of real analytic functions . SpringerScience & Business Media, 2002.[KU04] Yoshiyuki Kabashima and Shinsuke Uda. A bp-based algorithm for performing bayesianinference in large perceptron-type networks. In
International Conference on AlgorithmicLearning Theory , pages 479–493. Springer, 2004.[KWT09] Yoshiyuki Kabashima, Tadashi Wadayama, and Toshiyuki Tanaka. A typical reconstruc-tion limit for compressed sensing based on lp-norm minimization.
Journal of StatisticalMechanics: Theory and Experiment , 2009(09):L09003, 2009.[LRP16] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of opti-mization algorithms via integral quadratic constraints.
SIAM Journal on Optimization ,26(1):57–95, 2016.[Méz89] Marc Mézard. The space of interactions in neural networks: Gardner’s computationwith the cavity method.
Journal of Physics A: Mathematical and General , 22(12):2181,1989. 13Min01] Thomas Peter Minka.
A family of algorithms for approximate Bayesian inference . PhDthesis, Massachusetts Institute of Technology, 2001.[Mit19] Partha P Mitra. Compressed sensing and overparametrized networks: Overfitting peaksin a model of misparametrized sparse regression in the interpolation limit. 2019.[MKMZ17] Andre Manoel, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Multi-layergeneralized linear estimation. In , pages 2098–2102. IEEE, 2017.[MM09] Marc Mezard and Andrea Montanari.
Information, physics, and computation . OxfordUniversity Press, 2009.[MM19] Song Mei and Andrea Montanari. The generalization error of random features regression:Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355 , 2019.[MPV87] Marc Mézard, Giorgio Parisi, and Miguel Virasoro.
Spin glass theory and beyond: AnIntroduction to the Replica Method and Its Applications , volume 9. World ScientificPublishing Company, 1987.[NLR +
15] Robert Nishihara, Laurent Lessard, Benjamin Recht, Andrew Packard, and Michael IJordan. A general analysis of the convergence of admm. arXiv preprint arXiv:1502.02009 ,2015.[OK96] Manfred Opper and Wolfgang Kinzel. Statistical mechanics of generalization. In
Modelsof neural networks III , pages 151–209. Springer, 1996.[OKKN90] M. Opper, W. Kinzel, J. Kleinz, and R. Nehl. On the ability of the optimal perceptronto generalise.
Journal of Physics A: General Physics , 23(11), 1990.[PB +
14] Neal Parikh, Stephen Boyd, et al. Proximal algorithms.
Foundations and Trends R (cid:13) inOptimization , 1(3):127–239, 2014.[PVG +
11] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, BertrandThirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, VincentDubourg, et al. Scikit-learn: Machine learning in python. the Journal of machineLearning research , 12:2825–2830, 2011.[Ran11] Sundeep Rangan. Generalized approximate message passing for estimation with ran-dom linear mixing. In , pages 2168–2172. IEEE, 2011.[RSF19] Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Vector approximate messagepassing.
IEEE Transactions on Information Theory , 2019.[SCC19] Pragya Sur, Yuxin Chen, and Emmanuel J Candès. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square.
Probability Theoryand Related Fields , 175(1-2):487–558, 2019.[SRF16] Philip Schniter, Sundeep Rangan, and Alyson K Fletcher. Vector approximate messagepassing for the generalized linear model. In , pages 1525–1529. IEEE, 2016.14SST92] Hyunjune Sebastian Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanicsof learning from examples.
Physical review A , 45(8):6056, 1992.[TAH18] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysisof regularized m -estimators in high dimensions. IEEE Transactions on InformationTheory , 64(8):5592–5628, 2018.[TK20] Takashi Takahashi and Yoshiyuki Kabashima. Macroscopic analysis of vector approxi-mate message passing in a model mismatch setting. arXiv preprint arXiv:2001.02824 ,2020.[TVV04] Antonia M Tulino, Sergio Verdú, and Sergio Verdu.
Random matrix theory and wirelesscommunications . Now Publishers Inc, 2004.[WRB93] Timothy LH Watkin, Albrecht Rau, and Michael Biehl. The statistical mechanics oflearning a rule.
Reviews of Modern Physics , 65(2):499, 1993.[ZK16] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds andalgorithms.
Advances in Physics , 65(5):453–552, 2016.15
Convergence of vector sequences
This section is a brief summary of the framework originally introduced in [BM11a] and used in[FRS18, RSF19]. We review the key definitions and verify that they apply in our setting. We remindthe full set of state evolution equations from [FRS18] at (97), when applied to learning a GLM, inappendix E.The main building blocks are the notions of vector sequence and pseudo-Lipschitz function , whichallow to define the empirical convergence with p-th order moment . Consider a vector of the form x ( N ) = ( x ( N ) , ..., x N ( N )) (42)where each sub-vector x n ( N ) ∈ R r for any given r ∈ N ∗ . For r=1, which we use in Theorem 1, x ( N ) is denoted a vector sequence .Given p (cid:62) , a function f : R r → R s is said to be pseudo-Lipschitz continuous of order p if thereexists a constant C > such that for all x , x ∈ R s : (cid:107) f ( x ) − f ( x ) (cid:107) (cid:54) C (cid:107) x − x (cid:107) (cid:104) (cid:107) x (cid:107) p − + (cid:107) x (cid:107) p − (cid:105) (43)Then, a given vector sequence x ( N ) converges empirically with p-th order moment if there exists arandom variable X ∈ R r such that: • E | X | p < ∞ ; and • for any scalar-valued pseudo-Lipschitz continuous f ( . ) of order p, lim N →∞ N N (cid:88) n =1 f ( x n ( N )) = E [ f ( X )] a.s. (44)Note that defining an empirically converging singular value distribution implicitly defines a sequenceof matrices F ( N ) using the definition of rotational invariance from the introduction. This naturallybrings us back to the original definitions from [BM11a]. An important point is that the almost sureconvergence of the second condition holds for random vector sequences, such as the ones we considerin the introduction. Note that the noise vector ω must also satisfy these conditions, and naturallydoes when it is an i.i.d. Gaussian one. We also remind the definition of uniform Lipschitz continuity .For a given mapping φ ( x , A ) defined on x ∈ X and A ∈ R , we say it is uniform Lipschitzcontinuous in x at A = ¯ A if there exists constants L and L (cid:62) and an open neighborhood U of ¯ A such that: (cid:107) φ ( x , A ) − φ ( x , A ) (cid:107) (cid:54) L (cid:107) x − x (cid:107) (45)for all x , x ∈ X and A ∈ U ; and (cid:107) φ ( x , A ) − φ ( x , A ) (cid:107) (cid:54) L (1 + (cid:107) x (cid:107) ) | A − A | (46)for all x ∈ X and A , A ∈ U .The additional conditions for the state evolution theorem from [FRS18] (assumption 2) to holdare uniform Lipschitz continuity (which implies continuity) of the update functions (97f,97j,97n,97r)and their derivatives in their arguments at their parameters. These conditions are straightforward tocheck on the linear update functions (97j,97r). For the update functions involving proximal operators(97f,97n), uniform Lipschitz continuity can be checked on a case by case basis and holds true formost commonly used functions for losses and regularizations. Potentially pathological cases can betreated by working on compact spaces excluding any bothersome divergence.16 Convex analysis and properties of proximal operators
We start this section with a few useful definitions from convex analysis, which can all be found intextbooks such as [BC + ˜ O , ˜ O . In what follows, we denote X the Hilbert space with scalar inner product serving as inputand output space, here R N or R M . For simplicity, we will write all operators as going from X to X . Definition 1. (Co-coercivity) Let T : X → X and β ∈ R ∗ + . Then T is β co-coercive if βT isfirmly-nonexpansive, i.e. (cid:104) x − y , T ( x ) − T ( y ) (cid:105) (cid:62) β (cid:107) T ( x ) − T ( y ) (cid:107) (47) for all x , y ∈ X . Proximal operators are 1 co-coercive or equivalently firmly-nonexpansive.
Definition 2. (Strong convexity) A proper closed function is σ -strongly convex with σ > if f − σ (cid:107) . (cid:107) is convex. If f is differentiable, the definition is equivalent to f ( x ) (cid:62) f ( y ) + (cid:104)∇ f ( y ) , x − y (cid:105) + σ (cid:107) x − y (cid:107) (48) for all x , y ∈ X . Definition 3. (Smoothness for convex functions) A proper closed function f is β -smooth with β > if β (cid:107) . (cid:107) − f is convex. If f is differentiable, the definition is equivalent to f ( x ) (cid:54) f ( y ) + (cid:104)∇ f ( y ) , x − y (cid:105) + β (cid:107) x − y (cid:107) (49) for all x , y ∈ X . An immediate consequence of those definitions is the following second order condition: for twicedifferentiable functions, f is σ -strongly convex and β -smooth if and only if: σ Id (cid:22) H f (cid:22) β Id . (50) Corollary 2. (Remark 4.24 [BC + T : X → X is β -cocoercive if and only if β T ishalf-averaged. This means that T can be expressed as: T = 12 β (Id + S) (51) where S is a nonexpansive operator. Proposition 2. (Resolvent of the sub-differential [BC + f is the resolvent of the sub-differential ∂f of f :Prox γf = (Id + γ∂ f) − . (52)The following proposition is due to [GB16], and is useful determining upper bounds on theLipschitz constant of update functions involving proximal operators. Proposition 3. (Proposition 2 from [GB16]) Assume that f is σ -strongly convex and β -smooth andthat γ ∈ ]0 , ∞ [ . Then Prox γf − γβ Id is γβ − γσ -cocoercive if β > σ and 0-Lipschitz if β = σ . We will use these definitions and properties to derive the Lipschitz constants of ˜ O , ˜ O in appendixG. 17 .1 Jacobian of the proximal Using proposition 2, the proximal operator can be written, for any parameter γ ∈ R + and x in theinput space X : Prox γf ( x ) = (Id + γ∂ f) − ( x ) . (53)For any convex and differentiable function f , we have:Prox γf ( x ) + γ ∇ f ( Prox γf ( x )) = x (54)For a twice differentiable f , applying the chain rule then yields: D Prox γf ( x ) + γ H f ( Prox γf ( x )) D Prox γf ( x ) = Id (55)where D is the Jacobian matrix and H the Hessian. Since f is a convex function, its Hessian ispositive semi-definite, and, knowing that γ is strictly positive, the matrix (Id + γ H f ( Prox γf )) isinvertible. We thus have: D Prox γf ( x ) = (Id + γ H f ( Prox γf ( x ))) − (56) B.2 Proximal of ridge regularized functions
Since we consider only separable functions, we can work with scalar version of the proximal operators.The scalar proximal of a given function with an added ridge regularization can be written:Prox γ ( f + λ (cid:107) . (cid:107) ) ( x ) = (Id + γ ( ∂ f + λ )) − (x) (57) = ((1 + γλ ) Id + γf (cid:48) ) − ( x ) for differentiable f . (58)The function inside the inverse is analytical in λ with non-zero derivative ( γ is strictly positive), soits inverse is also analytical in λ according to the analytical inverse function theorem [KP02]. B.3 From replica potentials to Moreau envelopes
Here we show how the potentials defined for the replica free energy of corollary can be mapped toMoreau envelopes in the β → ∞ limit. We consider the scalar case since the replica expressions arescalar. All functions are separable here, so any needed generalization to the multidimensional case isimmediate. We start by reminding the definition of the Moreau envelope [BC +
11, PB + M γf of aproper, closed and convex function f for a given γ ∈ R ∗ + and any z ∈ R : M γf ( z ) = inf x ∈ R (cid:110) f ( x ) + (1 / γ ) (cid:107) x − z (cid:107) (cid:111) (59)The Moreau envelope can be interpreted as a smoothed version of a given objective function with thesame minimizer. For (cid:96) minimization for example, it allows to work with a differentiable objective.By definition of the proximal operator we have the following identity:Prox γf ( z ) = arg min x ∈ R (cid:110) f ( x ) + (1 / γ ) (cid:107) x − z (cid:107) (cid:111) (60) M γf ( z ) = f ( Prox γf ( z )) + 12 (cid:107) Prox γf ( z ) − z (cid:107) (61)18e can now match the replica potentials with the Moreau envelope. We start from the definition ofsaid potentials, to which we apply Laplace’s approximation: φ x ( ˆ m x , ˆ Q x , ˆ χ x ; x , ξ x ) = lim β →∞ β log (cid:90) e − β ˆ Q x x + β ( ˆ m x x + √ ˆ χ x ξ x ) x − βf ( x ) dx (62) = − ˆ Q x x ∗ ) + ( ˆ m x x + (cid:112) ˆ χ x ξ x ) x ∗ − f ( x ∗ ) (63)where x ∗ = arg min x (cid:40) − ˆ Q x x + ( ˆ m x x + (cid:112) ˆ χ x ξ x ) x − f ( x ) (cid:41) (64)This is an unconstraint convex optimization problem, thus its optimality condition is enough tocharacterize its unique minimizer: − ˆ Q x x ∗ + ( ˆ m x x + (cid:112) ˆ χ x ξ x ) − ∂f ( x ∗ ) = 0 (65) ⇐⇒ x ∗ = ( Id + 1ˆ Q x ∂f ) − (cid:18) ˆ m x x + √ ˆ χ x ξ x ˆ Q x (cid:19) (66) ⇐⇒ x ∗ = Prox f ˆ Q x (cid:18) ˆ m x x + √ ˆ χ x ξ x ˆ Q x (cid:19) (67)Replacing this in the replica potential and completing the square, we get: φ x ( ˆ m x , ˆ Q x , ˆ χ x ; x , ξ x ) = − f ( Prox γf ( X )) − ˆ Q x (cid:107) X − Prox γf ( X ) (cid:107) + X Q x (68) = ˆ Q x X − M Q x f ( X ) (69)where we used the shorthand X = ˆ m x x + √ ˆ χ x ξ x ˆ Q x . C Background on multilayer vector approximate message passing
Here we simply give the full iterations of the MLVAMP algorithm from [FRS18] applied to a 2-layernetwork. For a given operator T : X → X , the brackets (cid:104) T ( x ) (cid:105) = d (cid:80) di =1 T ( x ) i denote element-wiseaveraging operations, where d is M or N in our case. For a given matrix M ∈ R d × d , the bracketsamount to (cid:104) M (cid:105) = d Tr ( M ) . Going from MLVAMP to Oracle-MLVAMP simply boils down toprescribing the scalar terms in the following iterations:19nitialize h (0)1 x , h (0)2 z Forward pass – denoising ( L = 0) Forward pass - LMMSE ( L = 1) ˆx ( t )1 = g x ( h ( t )1 x , ˆ Q ( t )1 x ) ˆz ( t )2 = g z ( h ( t )2 x , h ( t )2 z , ˆ Q ( t )2 x , ˆ Q ( t )2 z ) (70a) χ ( t )1 x = 1ˆ Q ( t )1 x (cid:42) ∂g x ( h ( t )1 x , ˆ Q ( t )1 x ) ∂ h ( t )1 x (cid:43) χ ( t )2 z = 1ˆ Q ( t )2 z (cid:42) ∂g z ( h ( t )2 x , h ( t )2 z , ˆ Q ( t )2 x , ˆ Q ( t )2 z ) ∂ h ( t )2 z (cid:43) (70b) ˆ Q ( t )2 x = 1 /χ ( t )1 x − ˆ Q ( t )1 x ˆ Q ( t )1 z = 1 /χ ( t )2 z − ˆ Q ( t )2 z (70c) h ( t )2 x = ( ˆx ( t )1 /χ ( t )1 x − ˆ Q ( t )1 x h ( t )1 x ) / ˆ Q ( t )2 x h ( t )1 z = ( ˆz ( t ) /χ ( t )2 z − ˆ Q ( t )2 z h ( t )2 z ) / ˆ Q ( t )1 z (70d) Backward pass – denoising ( L = 1) Backward pass - LMMSE ( L = 0) ˆz ( t )1 = g z ( h ( t )1 z , ˆ Q ( t )1 z ) ˆx ( t +1)2 = g x ( h ( t )2 x , h ( t +1)2 z , ˆ Q ( t )2 x , ˆ Q ( t +1)2 z ) (70e) χ ( t )1 z = 1ˆ Q ( t )1 z (cid:42) ∂g z ( h ( t )1 z , ˆ Q ( t )1 z ) ∂ h ( t )1 z (cid:43) χ ( t +1)2 x = 1ˆ Q ( t )2 x (cid:42) ∂g x ( h ( t )2 x , h ( t +1)2 z , ˆ Q ( t )2 x , ˆ Q ( t +1)2 z ) ∂ h ( t )2 x (cid:43) (70f) ˆ Q ( t +1)2 z = 1 /χ ( t )1 z − ˆ Q ( t )1 z ˆ Q ( t +1)1 x = 1 /χ ( t +1)2 x − ˆ Q ( t )2 x (70g) h ( t +1)2 z = ( ˆz ( t )1 /χ ( t )1 z − ˆ Q ( t )1 z h ( t )1 z ) / ˆ Q ( t +1)2 z h ( t +1)1 x = ( ˆx ( t +1)2 /χ ( t +1)2 x − ˆ Q ( t )2 x h ( t )2 x ) / ˆ Q ( t +1)1 x . (70h)Denoiser functions g x and g z can be written as proximal operators in the MAP setting: g x ( h ( t )1 x , ˆ Q ( t )1 x ) = arg min x (cid:40) f ( x ) + ˆ Q ( t )1 x (cid:107) x − h ( t )1 x (cid:107) (cid:41) = Prox f/ ˆ Q ( t )1 x ( h ( t )1 x ) (71) g z ( h ( t )1 z , ˆ Q ( t )1 z ) = arg min z (cid:40) g ( y , z ) + ˆ Q ( t )1 z (cid:107) z − h ( t )1 z (cid:107) (cid:41) = Prox g ( ., y ) / ˆ Q ( t )1 z ( h ( t )1 z ) . (72)LMMSE denoisers g z and g x in the MAP setting read (see [SRF16]): g z ( h ( t )2 x , h ( t )2 z , ˆ Q ( t )2 x , ˆ Q ( t )2 z ) = F ( ˆ Q ( t )2 z F T F + ˆ Q ( t )2 x Id ) − ( ˆ Q ( t )2 x h ( t )2 x + ˆ Q ( t )2 z F T h ( t )2 z ) (73) g x ( h ( t )2 x , h ( t +1)2 z , ˆ Q ( t )2 x , ˆ Q ( t +1)2 z ) = ( ˆ Q ( t +1)2 z F T F + ˆ Q ( t )2 x Id ) − ( ˆ Q ( t )2 x h ( t )2 x + ˆ Q ( t +1)2 z F T h ( t +1)2 z ) . (74) D Fixed point analysis of two-layer MLVAMP
Here we show that the fixed point of 2-layer MLVAMP coincides with the unique minimizer of theconvex problem 2, proving Lemma 2. Writing the fixed point of the scalar parameters of the iterations(70), we get the following prescriptions on the scalar quantities: χ x ≡ χ x = 1 χ x = ˆ Q x + ˆ Q x χ z ≡ χ z = 1 χ z = ˆ Q z + ˆ Q z (75) ˆ Q x χ x + ˆ Q x χ x = 1 ˆ Q z χ z + ˆ Q z χ z = 1 (76)and the following ones on the estimates: ˆ x = ˆ x ˆ z = ˆ z (77) ˆ z = F ˆ x ˆ z = F ˆ x (78)20e would like the fixed point of MLVAMP to satisfy the following first-order optimality condition ∂f ( ˆx ) + F T ∂g ( Fx ∗ ) = 0 , (79)which characterizes the unique minimizer of the unconstraint convex problem (2). Replacing h x ’sexpression inside h x reads h x = (cid:18) ˆx χ x − ˆ Q x h x (cid:19) / ˆ Q x (80) = (cid:18) ˆx χ x − (cid:18) ˆx χ x − ˆ Q x h x (cid:19)(cid:19) / ˆ Q x (81)and using (75) we get ˆx = ˆx , and a similar reasoning gives ˆz = ˆz . From (73) and (74), we clearlyfind ˆz = Fˆx . Inverting the proximal operators in (71) and (72) yields ˆx + 1ˆ Q x ∂g ( ˆx ) = h x (82) ˆz + 1ˆ Q z ∂g ( ˆz ) = h z . (83)Starting from the MLVAMP equation on h x , we write h x = (cid:18) ˆx χ x − ˆ Q x h x (cid:19) / ˆ Q x (84) = (cid:18) ˆx χ x − ( ˆ Q z F T F + ˆ Q x Id ) ˆx + ˆ Q z F T h z (cid:19) / ˆ Q x (85) = − (cid:18) ˆ Q z F T F + ˆ Q x (cid:18) − χ x ˆ Q x (cid:19) Id (cid:19) ˆx / ˆ Q x + F T (cid:18) ˆ Q z (cid:18) χ z ˆ Q z − (cid:19) ˆz − ∂ g ( ˆz ) (cid:19) (86)which is equal to the left-hand term in (82). Using this equality, as well as ˆz = Fˆx and relations (75)and (76) yields ∂f ( ˆx ) + F T ∂g ( Fˆx ) = 0 . (87)Hence, the fixed point of MLVAMP satisfies the optimality condition (79) and is indeed the desiredMAP estimator: ˆx = ˆx = ˆx . E State evolution equations
E.1 Heuristic state evolution equations
The state evolution equations track the evolution of MLVAMP (70) and provide statistical propertiesof its iterates. They are derived in [TK20] taking the heuristic assumption that h , h , h , h behave as Gaussian estimates, which comes from the physics cavity approach: ˆ Q ( t )1 x h ( t )1 x − ˆ m ( t )1 x x d = (cid:113) ˆ χ ( t )1 x ξ ( t )1 x (88a) V T ( ˆ Q ( t )2 x h ( t )2 x − ˆ m ( t )2 x x ) d = (cid:113) ˆ χ ( t )2 x ξ ( t )2 x (88b) U T ( ˆ Q ( t )1 z h ( t )1 z − ˆ m ( t )1 z z ) d = (cid:113) ˆ χ ( t )1 z ξ ( t )1 z (88c) ˆ Q ( t )2 z h ( t )2 z − ˆ m ( t )2 z z d = (cid:113) ˆ χ ( t )2 z ξ ( t )2 z (88d)21here d = denotes equality of empirical distributions. U and V come from the singular value decompo-sition F = UDV T and are Haar-sampled; ξ ( t )1 x , ξ ( t )2 x , ξ ( t )1 z , ξ ( t )2 z are normal Gaussian vectors, independentfrom x , z , V T x and U T z . Parameters ˆ Q ( t )1 x , ˆ Q ( t )1 z , ˆ Q ( t )2 x , ˆ Q ( t )2 z are defined through MLVAMP’s itera-tions (70); while parameters ˆ m ( t )1 x , ˆ m ( t )1 z , ˆ m ( t )2 x , ˆ m ( t )2 z and ˆ χ ( t )1 x , ˆ χ ( t )1 z , ˆ χ ( t )2 x , ˆ χ ( t )2 z are prescribed through SEequations. Other useful variables are the overlaps and squared norms of estimators, for k ∈ { , } : m ( t ) kx = x (cid:62) ˆ x ( t ) k N q ( t ) kx = (cid:107) ˆ x ( t ) k (cid:107) Nm ( t ) kz = z (cid:62) ˆ z ( t ) k M q ( t ) kz = (cid:107) ˆ z ( t ) k (cid:107) M .
Starting from assumptions (88), and following the derivation of [TK20] adapted to the iteration orderfrom (70), the heuristic state evolution equations read:Initialize ˆ Q (0)1 x , ˆ Q (0)2 z , ˆ m (0)1 x , ˆ m (0)2 z , ˆ χ (0)1 x , ˆ χ (0)2 z > .m ( t )1 x = E x η f/ ˆ Q ( t )1 x ˆ m ( t )1 x x + (cid:113) ˆ χ ( t )1 x ξ ( t )1 x ˆ Q ( t )1 x (89a) χ ( t )1 x = 1ˆ Q ( t )1 x E η (cid:48) f/ ˆ Q ( t )1 x ˆ m ( t )1 x x + (cid:113) ˆ χ ( t )1 x ξ ( t )1 x ˆ Q ( t )1 x (89b) q ( t )1 x = E η f/ ˆ Q ( t )1 x ˆ m ( t )1 x x + (cid:113) ˆ χ ( t )1 x ξ ( t )1 x ˆ Q ( t )1 x (89c) ˆ Q ( t )2 x = 1 χ ( t )1 x − ˆ Q ( t )1 x (89d) ˆ m ( t )2 x = m ( t )1 x ρ x χ ( t )1 x − ˆ m ( t )1 x (89e) ˆ χ ( t )2 x = q ( t )1 x ( χ ( t )1 x ) − ( m ( t )1 x ) ρ x ( χ ( t )1 x ) − ˆ χ ( t )1 x (89f) m ( t )2 z = ρ x α E (cid:34) λ ( ˆ m ( t )2 x + λ ˆ m ( t )2 z )ˆ Q ( t )2 x + λ ˆ Q ( t )2 z (cid:35) (89g) χ ( t )2 z = 1 α E (cid:34) λ ˆ Q ( t )2 x + λ ˆ Q ( t )2 z (cid:35) (89h) q ( t )2 z = 1 α E (cid:34) λ ( ˆ χ ( t )2 x + λ ˆ χ ( t )2 z )( ˆ Q ( t )2 x + λ ˆ Q ( t )2 z ) (cid:35) + ρ x α E λ (cid:34) λ ( ˆ m ( t )2 x + λ ˆ m ( t )2 z ) ( ˆ Q ( t )2 x + λ ˆ Q ( t )2 z ) (cid:35) (89i) ˆ Q ( t )1 z = 1 χ ( t )2 z − ˆ Q ( t )2 z (89j) ˆ m ( t )1 z = m ( t )2 z ρ z χ ( t )2 z − ˆ m ( t )2 z (89k)22 χ ( t )1 z = q ( t )2 z ( χ ( t )2 z ) − ( m ( t )2 z ) ρ z ( χ ( t )2 z ) − ˆ χ ( t )2 z (89l) m ( t )1 z = E z η g ( y,. ) / ˆ Q ( t )1 z ˆ m ( t )1 z z + (cid:113) ˆ χ ( t )1 z ξ ( t )1 z ˆ Q ( t )1 z (89m) χ ( t )1 z = 1ˆ Q ( t )1 z E η (cid:48) g ( y,. ) / ˆ Q ( t )1 z ˆ m ( t )1 z z + (cid:113) ˆ χ ( t )1 z ξ ( t )1 z ˆ Q ( t )1 z (89n) q ( t )1 z = E η g ( y,. ) / ˆ Q ( t )1 z ˆ m ( t )1 z z + (cid:113) ˆ χ ( t )1 z ξ ( t )1 z ˆ Q ( t )1 z (89o) ˆ Q ( t +1)2 z = 1 χ ( t )1 z − ˆ Q ( t )1 z (89p) ˆ m ( t +1)2 z = m ( t )1 z ρ z χ ( t )1 z − ˆ m ( t )1 z (89q) ˆ χ ( t +1)2 z = q ( t )1 z ( χ ( t )1 z ) − ( m ( t )1 z ) ρ z ( χ ( t )1 z ) − ˆ χ ( t )1 z (89r) m ( t +1)2 x = ρ x E (cid:34) ˆ m ( t )2 x + λ ˆ m ( t +1)2 z ˆ Q ( t )2 x + λ ˆ Q ( t +1)2 z (cid:35) (89s) χ ( t +1)2 x = E (cid:34) Q ( t )2 x + λ ˆ Q ( t +1)2 z (cid:35) (89t) q ( t +1)2 x = E (cid:34) ˆ χ ( t )2 x + λ ˆ χ ( t +1)2 z ( ˆ Q ( t )2 x + λ ˆ Q ( t +1)2 z ) (cid:35) + ρ x E (cid:34) ( ˆ m ( t +1)2 x + λ ˆ m ( t +1)2 z ) ( ˆ Q ( t )2 x + λ ˆ Q ( t +1)2 z ) (cid:35) (89u) ˆ Q ( t +1)1 x = 1 χ ( t +1)2 x − ˆ Q ( t )2 x (89v) ˆ m ( t +1)1 x = m ( t +1)2 x ρ x χ ( t +1)2 x − ˆ m ( t )2 x (89w) ˆ χ ( t +1)1 x = q ( t +1)2 x ( χ ( t +1)2 x ) − ( m ( t +1)2 x ) ρ x ( χ ( t +1)2 x ) − ˆ χ ( t )2 x . (89x)We are interested in the fixed point of these state evolution equations, where χ ( t )1 x = χ ( t )2 x = χ x , q ( t )1 x = q ( t )2 x = q x , m ( t )1 x = m ( t )2 x = m x , χ ( t )1 z = χ ( t )2 z = χ z , q ( t )1 z = q ( t )2 z = q z , and m ( t )1 z = m ( t )2 z = m z areachieved. From there we easily recover (6). However, these equations are not rigorous since thestarting assumptions are not proven. Therefore, we will turn to a rigorous formalism to consolidatethose results. E.2 Rigorous state evolution formalism
We now look into the state evolution equations derived for MLVAMP in [SRF16]. Those equations areproven to be exact in the asymptotic limit, and follow the same algorithm as (70). In particular, they23rovide statistical properties of vectors h x , h x , h z , h z . We can read relations from [FRS18] usingthe following dictionary between our notations and theirs, valid at each iteration of the algorithm: ˆ Q x , ˆ Q x , ˆ Q z , ˆ Q z ←→ γ − , γ +0 , γ +1 , γ − (90a) χ x ˆ Q x , χ x ˆ Q x , χ z ˆ Q z , χ z ˆ Q z ←→ α − , α +0 , α − , α +1 (90b) x , z , ρ x , ρ z , h x , h x , h z , h z ←→ Q , Q , τ , τ , r − , r +0 , r +1 , r − . (90c)Placing ourselves in the asymptotic limit, [FRS18] shows the following equalities: r − = Q + Q − (91a) r +0 = Q + Q +0 (91b) r − = Q + Q − (91c) r +1 = Q + Q +1 (91d)where Q − ∼ N (0 , τ − ) N and Q − ∼ N (0 , τ − ) N are i.i.d. Gaussian vectors. Q +0 , Q +1 have the followingnorms and non-zero correlations with ground-truth vectors Q , Q : τ +0 ≡ (cid:107) Q +0 (cid:107) N c +0 ≡ Q T Q + N (92) τ +1 ≡ (cid:107) Q +1 (cid:107) M c +1 ≡ Q T Q + M . (93)With simple manipulations, we can rewrite (91) as: r − = Q + Q − (94a) V T r +0 d = (cid:18) c +0 τ (cid:19) V T Q + V T ˜Q +0 (94b) r − = Q + Q − (94c) U T r +1 d = (cid:18) c +1 τ (cid:19) U T Q + U T ˜Q +1 (94d)where for k ∈ { , } vectors ˜Q + k = − c + k τ k Q k + Q + k (95)and Q − , Q − have no correlation with ground-truth vectors Q , Q , U T Q , V T Q . Besides, Lemma 5from [RSF19] states that V T ˜Q +0 and U T ˜Q +1 have components that converge empirically to Gaussianvariables, respectively N (0 , τ +0 ) and N (0 , τ +1 ) . Let us now translate this in our own terms, using thefollowing relations that complete our dictionary with state evolution parameters: ˆ m x ˆ Q x ←→ m z ˆ Q z ←→ (96a) ˆ m x ˆ Q x ←→ c +0 τ ˆ m z ˆ Q z ←→ c +1 τ (96b) ˆ χ x ˆ Q x ←→ τ − ˆ χ z ˆ Q z ←→ τ − (96c) ˆ χ x ˆ Q x ←→ τ +0 − ( c +0 ) τ ˆ χ z ˆ Q z ←→ τ +1 − ( c +1 ) τ . (96d)24imple bookkeeping transforms equations (94) into a rigorous statement of starting assumptions (91)from [TK20]. Since those assumptions are now rigorously established in the asymptotic limit, theremaining derivation of state evolution equations (89) holds and provides a mathematically exactstatement. E.3 Scalar equivalent model of state evolution
For the sake of completeness, we will provide an overview of the explicit matching between thestate evolution formalism from [FRS18] which was developed in a series of papers, and the replicaformulation from [TK20] which relies on statistical physics methods. Although not necessary to ourproof, it is interesting to develop an intuition about the correspondence between those two facesof the same coin. We have seen in the previous subsection that [FRS18] introduces ground-truthvectors Q , Q , estimates r ± , r ± which are related to vectors Q ± , Q ± . Let us introduce a few morevectors using matrices from the singular value decomposition F = UDV T . Let s ν ∈ R N be thevector containing all square roots of eigenvalues of F T F with p ν its element-wise distribution; and s µ ∈ R M the vector containing all square roots of eigenvalues of FF T with p µ its element-wisedistribution. Note that those two vectors contain the singular values of F , but one of them alsocontains max( M, N ) − min( M, N ) zero values. p µ and p ν are both well-defined since p λ is properlydefined in Assumptions 1. We also define P = V T Q P +0 = V T Q +0 P − = V T Q − P = UQ P +1 = UQ +1 P − = UQ − . By virtue of Lemma 5 from [RSF19], the six previous vectors have elements that converge empiricallyto a Gaussian variable. Hence, all defined vectors have an element-wise separable distribution, and wecan write the state evolution as a scalar model on random variables sampled from those distributions.To do so, we will simply write the variables without the bold font: for instance Z ∼ p x , s ν ∼ p ν ,and Q − refers to the random variable distributed according to the element-wise distribution of vector Q − . The scalar random variable state evolution from [FRS18] now reads:Initialize γ − (0)1 , γ − (0)0 , τ − (0)0 , τ − (0)1 , Q − (0)0 ∼ N (0 , τ − (0)0 ) , Q − (0)1 ∼ N (0 , τ − (0)1 ) , α − (0)0 , α − (0)1 Initial pass (ground truth only) s ν ∼ p ν , s µ ∼ p µ , Q ∼ p x (97a) τ = E [( Q ) ] P ∼ N (0 , τ ) (97b) Q = s µ P τ = E [( s µ P ) ] = E [( s µ ) ] τ P ∼ N (0 , τ ) (97c) Forward Pass (estimation): α +( t )0 = E (cid:20) η (cid:48) f/γ − ( t )0 ( Q + Q − ( t )0 ) (cid:21) (97d) γ +( t )0 = γ ( t )0 α +( t )0 − γ − ( t )0 (97e) Q +( t )0 = 11 − α +( t )0 (cid:110) η f/γ − ( t )0 ( Q + Q − ( t )0 ) − Q − α +0 Q − ( t )0 (cid:111) (97f) K +( t )0 = Cov (cid:16) Q , Q +( t )0 (cid:17) (cid:16) P , P +( t )0 (cid:17) ∼ N (cid:16) , K +( t )0 (cid:17) (97g)25 +( t )1 = E (cid:34) s µ γ − ( t )1 γ − ( t )1 s µ + γ +( t )0 (cid:35) (97h) γ +( t )1 = γ − ( t )1 α +( t )1 − γ − ( t )1 (97i) Q +( t )1 = 11 − α +( t )1 (cid:40) s µ γ − ( t )1 γ − ( t )1 s µ + γ +( t )0 ( Q − ( t )1 + Q ) + s µ γ +( t )0 γ − ( t )1 s µ + γ +( t )0 ( P +( t )0 + P ) − Q − α +( t )1 Q − ( t )1 (cid:111) (97j) K +( t )1 = Cov (cid:16) Q , Q +( t )1 (cid:17) (cid:16) P , P +( t )1 (cid:17) ∼ N (cid:16) , K +( t )1 (cid:17) (97k) Backward Pass (estimation): α − ( t +1)1 = E (cid:104) η g ( y,. ) /γ +( t )1 ( P + P +( t )1 ) (cid:105) (97l) γ − ( t +1)1 = γ +( t )1 α − ( t +1)1 − γ +( t )1 (97m) P − ( t +1)1 = 11 − α − ( t +1)1 (cid:110) η g ( y,. ) /γ +( t )1 ( P + P +( t )1 ) − P − α − ( t +1)1 P +( t )1 (cid:111) (97n) τ − ( t +1)1 = E (cid:104) ( P − ( t +1)1 ) (cid:105) Q − ( t +1)1 ∼ N (0 , τ − ( t +1)1 ) (97o) α − ( t +1)0 = E (cid:34) γ +( t )0 γ − ( t +1)1 s ν + γ +( t )0 (cid:35) (97p) γ − ( t +1)0 = γ +( t )0 α − ( t +1)0 − γ +( t )0 (97q) P − ( t +1)0 = 11 − α − ( t +1)0 (cid:40) s ν γ − ( t )1 γ − ( t +1)1 s ν + γ +( t )0 ( Q − ( t +1)1 + Q )+ γ +( t )0 γ − ( t +1)1 s ν + γ +( t )0 ( P +( t )0 + P ) − P − α − ( t +1)0 P +( t )0 (cid:41) (97r) τ − ( t +1)0 = E (cid:104) ( P − ( t +1)0 ) (cid:105) Q − ( t +1)0 ∼ N (0 , τ − ( t +1)0 ) . (97s) E.4 Direct matching of the state evolution fixed point equations
To be consistent, we should be able to show that equations (97) allow us to recover equations (89) attheir fixed point. Although somewhat tedious, this task is facilitated using dictionaries (90) and (96).We shall give here an overview of this matching through a few examples. • Recovering equation (89e)Let us start from the rigorous scalar state evolution, in particular equation (97f) that defines variable Q +0 . We get rid of time indices here since we focus on the fixed point. We first compute the correlation26 +0 = E (cid:2) Q Q +0 (cid:3) = 11 − α +0 (cid:110) E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) − τ (cid:111) (98)where we have used E [( Q ) ] = τ . At the fixed point, we know from MLVAMP or simply translatingequations (75), (76) that − α +0 = α − , α − = γ − + γ +0 γ +0 , γ +0 α +0 = γ − α − . Simple manipulations take us to c +0 = E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) α − − τ (1 + γ − γ +0 ) (99) (cid:18) c +0 τ (cid:19) γ +0 = E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) γ +0 τ α − − γ − . (100)Now let us translate this back into our notations. The term E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) simply translatesinto m x , and the rest of the terms can all be changed according to our dictionary. (100) exactlybecomes ˆ m x = m x ρ x χ x − ˆ m x , (101)hence we perfectly recover equations (89e) at the fixed point. • Recovering equation (89f)We start again from (97f) and square it: E (cid:2) ( Q +0 ) (cid:3) = 1(1 − α +0 ) (cid:110) E (cid:104) η f/γ − ( Q + Q − ) (cid:105) + E (cid:2) ( Q ) (cid:3) + ( α +0 ) E (cid:2) ( Q − ) (cid:3) − E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) − α +0 E (cid:104) Q − η f/γ − ( Q + Q − ) (cid:105)(cid:111) (102) τ +0 = 1(1 − α +0 ) (cid:110) E (cid:104) η f/γ − ( Q + Q − ) (cid:105) + τ + ( α +0 ) τ − − E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) − α +0 E (cid:104) Q − η f/γ − ( Q + Q − ) (cid:105)(cid:111) . (103)Since Q − is a Gaussian variable, independent from Q , we can use Stein’s lemma and use equation (97d)to get E (cid:104) Q − η f/γ − ( Q + Q − ) (cid:105) = α +0 τ − . (104)Moreover, from (98) we have ( c +0 ) ( α − ) = (cid:16) E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) − τ (cid:17) (105) ( c +0 ) ( α − ) τ − ( E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) ) τ = − E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105) + τ . (106)27eplacing (104) and (106) into (103), we reach (cid:18) τ +0 − ( c +0 ) τ (cid:19) ( α − ) = E (cid:104) η f/γ − ( Q + Q − ) (cid:105) − (cid:16) E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105)(cid:17) τ − ( α +0 ) τ − (107) (cid:18) τ +0 − ( c +0 ) τ (cid:19) ( γ +0 ) = E (cid:104) η f/γ − ( Q + Q − ) (cid:105) ( γ +0 ) ( α − ) − (cid:16) E (cid:104) Q η f/γ − ( Q + Q − ) (cid:105)(cid:17) ( γ +0 ) τ ( α − ) − ( γ − ) τ − . (108)Notice that E (cid:104) η f/γ − ( Q + Q − ) (cid:105) simply translates into our variable q x from its definition (89c), andour dictionary directly transforms (108) into equation (89f): ˆ χ x = q x χ x − m x ρ x χ x − ˆ χ x . (109) • Recovering equation (89s)We first note that for any function h , E [ h ( s ν )] = min(1 , α ) E [ h ( s µ )] + max(0 , − α ) h (0) . (110)and s ν ∼ p λ . Applying this to h ( s ) = γ − s γ − s + γ +0 and starting from (97j), we rewrite α +1 = E (cid:34) γ − s µ γ − s µ + γ +0 (cid:35) (111) = 1 α E (cid:20) γ − λγ − λ + γ +0 (cid:21) (112)with λ ∼ p λ , which translates into equation (89s): χ z = 1 α E (cid:20) λ ˆ Q x + λ ˆ Q z (cid:21) . (113)In a similar fashion, we can recover all equations (89) by writing variances and correlations betweenscalar random variables defined in (97), and using the independence properties established in [FRS18];thus directly showing the matching between the two state evolution formalisms at their fixed point. F Numerical implementation details
Here we give a few derivation details for implementation of the equations presented in Theorem1. We provide the Python script used to produce the figures in the main body of the paper as anexample. The experimental points were obtained using the convex optimization tools of [PVG + N = 200 , M = αN , for α ∈ [0 . , . Each point is averaged 100 timesto get smoother curves. The theoretical prediction was simply obtained by iterating the equationsfrom Theorem 1. This can lead to unstable numerical schemes, and we include a few commentsabout stability in the code provided with this version of the paper. For Gaussian data, the design28atrices were simply obtained by sampling a normal distribution N (0 , (cid:112) /M ) , effectively yieldingthe Marchenko-Pastur distribution [TVV04] for averaging on the eigenvalues of F T F in the stateevolution equations : λ F T F ∼ max(0 , − α ) δ ( λ −
0) + α (cid:112) (0 , λ − a ) + (0 , b − λ ) + πλ (114)where a = (cid:113) − (cid:0) α (cid:1) , b = (cid:113) (cid:0) α (cid:1) , and (0 , x ) + = max(0 , x ) . For the example of orthogonallyinvariant matrix with arbitrary spectrum, we chose to sample the singular values of F from the uniformdistribution U ( (cid:2) (1 − α ) , (1 + α ) (cid:3) ) . This leads to the following distribution for the eigenvalues of F T F : λ F T F ∼ max(0 , − α ) δ (0) + min(1 , α ) (cid:18) α ) − (1 − α ) ) I {√ λ ∈ [(1 − α ) , (1+ α ) ] } √ λ (cid:19) (115) where I is the indicator function.The only quantities that need additional calculus are the averages of proximals, squared proxi-mals and derivatives of proximals. Here we give the corresponding expressions for the losses andregularizations that were used to make the figures. F.1 Regularization : elastic net
For the elastic net regularization, we can obtain an exact expression, avoiding any numericalintegration. The proximal of the elastic net reads:Prox Q x ( λ | x | + λ (cid:107) x (cid:107) ) ( . ) = 11 + λ ˆ Q x s (cid:18) ., λ ˆ Q x (cid:19) (116)where s (cid:16) ., λ ˆ Q x (cid:17) is the soft-thresholding function: s (cid:32) r k , λ ˆ Q − x (cid:33) = r k + λ ˆ Q x if r k < − λ ˆ Q x if − λ ˆ Q x < r k < λ ˆ Q x r k − λ ˆ Q x if r k > λ ˆ Q x . (117)We assume that the ground-truth x is pulled from a Gauss-Bernoulli law of the form: φ ( x ) = (1 − ρ ) δ ( x ) + ρ √ π exp ( − x / . (118)Note that we did our plots with ρ = 1 , but this form can be used to study the effect of sparsity inthe model. Writing X = ˆ m x x + √ ˆ χ x ξ x ˆ Q x , and remembering that ξ x ∼ N (0 , , a little calculus thenshows that: E [ Prox f / ˆ Q x ( X )]=
11 + λ ˆ Q x (1 − ρ ) λ + ( ˆ Q x ) τ ( ˆ Q x ) erfc (cid:18) λ ˆ Q x √ τ (cid:19) − λ √ τ exp( − λ
2( ˆ Q x ) τ )ˆ Q x √ π (119) + ρ λ + ( ˆ Q x ) ( τ + τ )( ˆ Q x ) erfc (cid:32) λ ˆ Q x (cid:112) τ + τ ) (cid:33) − λ (cid:112) τ + τ ) exp( − λ
2( ˆ Q x ) ( τ + τ ) )ˆ Q x √ π . E [ Prox (cid:48) f / ˆ Q x ( X )] = 11 + λ ˆ Q x (cid:34) (1 − ρ ) erfc (cid:18) λ ˆ Q x √ τ (cid:19) + ρ erfc (cid:32) λ ˆ Q x (cid:112) τ + τ ) (cid:33)(cid:35) (120)and E [ x Prox f / ˆ Q x ( X )] = ρ √ τ λ ˆ Q x erfc (cid:32) λ ˆ Q x (cid:112) τ + τ ) (cid:33) (121)where we write τ = ( ˆ m x / ˆ Q x ) and τ = ˆ χ x / ˆ Q x . We now turn to the loss functions. F.2 Loss functions
The loss functions sometimes have no closed form, as is the case for the logistic loss. In that case,numerical integration cannot be avoided, and we recommend marginalizing all the possible variablesthat can be averaged out. In the present model, if the teacher y is chosen as a sign, one-dimensionalintegrals can be reached, leading to stable and reasonably fast implementation (a few minutes togenerate a curve comparable to those of Figure 1 for the non-linear models, the ridge regressionbeing very fast). The interested reader can find the corresponding marginalized prefactors in thecode jointly provided with this paper. We will not give the detail of these computations, as it sumsup to dull calculus and does not yield any convenient closed form. Square loss
The square loss is defined as: f ( x, y ) = 12 ( x − y ) , (122)its proximal and partial derivative then read:Prox γ f ( p ) = γ γ p + 11 + γ y (123) ∂∂p Prox γ f ( p ) = γ γ . (124)Using this form with a plain ridge penalty (elastic net with (cid:96) = 0 ) leads to great simplificationin the equations of Theorem 1 and we recover the classical expressions obtained for ridge regressionin papers such as [HMRT19, GAK20]. Hinge loss
The hinge loss reads: f ( x, y ) = max(0 , − yx ) . (125)Assuming y ∈ {− , +1 } , its proximal and partial derivative then read:Prox γ f ( p ) = p + yγ if γ (1 − yp ) (cid:62) y if (cid:54) γ (1 − yp ) (cid:54) p if γ (1 − yp ) (cid:54) (126) ∂∂p Prox γ f ( p ) = if γ (1 − yp ) (cid:62) if (cid:54) γ (1 − yp ) (cid:54) if γ (1 − yp ) (cid:54) . (127)30 ogistic loss f ( x, y ) = log(1 + exp( − yx )) (128)Its proximal (at point p) is the solution to the fixed point problem: x = p + yγ (1 + exp( yx )) , (129)and its derivative, given that the logistic loss is twice differentiable, reads: ∂∂p Prox γ f ( p ) = 11 + γ ∂ ∂p f ( Prox γ f ( p )) (130) = 11 + γ cosh ( Prox γ f ( p )) . (131) G Proof of Lemma 3: Convergence analysis of Oracle-MLVAMP
In this section, we give the detail of the convergence proof of Oracle-MLVAMP. We start by provingthe bounds on ˆ Q parameters. We remind the reader that we place ourselves at the fixed point of thestate evolution equations, thus the prescriptions from appendix D apply. G.1 Variance parameters bounds
From the fixed point of the MLVAMP iterations, as used in appendix D, the definition of the averagingoperators in appendix C, and the form of the Jacobian of the proximal operator form appendix B,we obtain the following relations on the ˆ Q parameters involving the Hessian matrices of f and g : Q x + ˆ Q x = 1 N Tr (cid:104) ( H f + ˆ Q x Id ) − (cid:105) = 1 N Tr (cid:104) ( ˆ Q z F T F + ˆ Q x Id ) − (cid:105) (132) Q z + ˆ Q z = 1 N Tr (cid:104) ( H g + ˆ Q z Id ) − (cid:105) = 1 M Tr (cid:104) FF T ( ˆ Q z FF T + ˆ Q x Id ) − (cid:105) . (133)From those, we obtain the following inequalities: λ min ( H f ) (cid:54) ˆ Q x (cid:54) λ max ( H f ) (134) λ min ( H g ) (cid:54) ˆ Q z (cid:54) λ max ( H g ) (135) ˆ Q z λ min ( F T F ) (cid:54) ˆ Q x (cid:54) ˆ Q z λ max ( F T F ) (136) ˆ Q x λ max ( FF T ) (cid:54) ˆ Q z (cid:54) ˆ Q x λ min ( FF T ) . (137)Note that λ min ( FF T ) can be equal to , the right-hand side of the last inequality would then beuninformative. λ max ( F T F ) = λ max ( FF T ) are strictly positive, since the spectrum is assumed to benon-trivial. As advocated by the second order condition of strong convexity from appendix B ordirectly looking at the Hessian matrices, respectively adding a ridge penalty parametrized by λ , ˜ λ to f, g augments λ min ( H f ) , λ max ( H f ) by λ and λ min ( H f ) , λ max ( H f ) by ˜ λ .31 .2 Operator norms and Lipschitz constants Operator norms of the matrices W , W , W , W The norms of the linear operators W , W , W , W can be computed or bounded with respect tothe singular values of the matrix F . The derivations are straightforward and do not require anyspecific mathematical result. Denoting (cid:107) W (cid:107) the operator norm of a given matrix W , we have thefollowing: (cid:107) W (cid:107) = ˆ Q x ˆ Q x (cid:32) | ˆ Q x − ˆ Q z λ min ( F T F ) | ˆ Q x + ˆ Q z λ min ( F T F ) , | ˆ Q x − ˆ Q z λ max ( F T F ) | ˆ Q x + ˆ Q z λ max ( F T F ) (cid:33) (138) (cid:107) W (cid:107) = ˆ Q z χ x ˆ Q x (cid:112) λ max ( F T F )ˆ Q x + ˆ Q z λ min ( F T F ) (139) (cid:107) W (cid:107) = ˆ Q z ˆ Q z (cid:32) | ˆ Q x − ˆ Q z λ min ( FF T ) | ˆ Q x + ˆ Q z λ min ( FF T ) , | ˆ Q x − ˆ Q z λ max ( FF T ) | ˆ Q x + ˆ Q z λ max ( FF T ) (cid:33) (140) (cid:107) W (cid:107) = ˆ Q x χ z ˆ Q z (cid:112) λ max ( F T F )ˆ Q x + ˆ Q z λ min ( F T F ) . (141) Lispchitz constants of ˜ O , ˜ O We now derive upper bounds of the Lipschitz constants of ˜ O , ˜ O using the convex analysis reminderin appendix B. We give detail for ˜ O , the derivation is identical for ˜ O . Let ( σ , β ) ∈ R ∗ bethe strong-convexity and smoothness constants of f . Note that, from the upper and lower boundsobtained in appendix G.1, we have σ (cid:54) ˆ Q x (cid:54) β . Case 1: < σ < β Proposition 3 gives the following expression:Prox Q x f = 12 (cid:32)
11 + σ / ˆ Q x + 11 + β / ˆ Q x (cid:33) Id + 12 (cid:32)
11 + σ / ˆQ −
11 + β / ˆQ (cid:33) S (142)where S is a non-expansive operator. Replacing in the expression of ˜ O leads to: ˜ O = ˆ Q x ˆ Q x (cid:18)(cid:18) χ x (cid:18) Q x + σ + 1ˆ Q x + β (cid:19) − (cid:19) Id + 12 χ x (cid:18) + σ − + β (cid:19) S (cid:19) (143)which, knowing that ˆ Q x + ˆ Q x = χ x , and separating the case where the first term of the sum in 143is negative or positive, ˜ O has Lipschitz constant: ω = ˆ Q x ˆ Q x max (cid:32) ˆ Q x − σ ˆ Q x + σ , β − ˆ Q x ˆ Q x + β (cid:33) . (144) Case 2: < σ = β In this case, we have from Proposition 3: (cid:107)
Prox Q x f ( x ) − Prox Q x f ( y ) (cid:107) = (cid:32)
11 + σ / ˆ Q x (cid:33) (cid:107) x − y (cid:107) (145)32hich, with the firm non-expansiveness of the proximal operator gives, for any x, y ∈ R : (cid:107) ˜ O ( x ) − ˜ O ( y ) (cid:107) = (cid:32) ˆ Q x ˆ Q x (cid:33) (cid:18) Q x χ x (cid:107) Prox Q x f ( x ) − Prox Q x f ( y ) (cid:107) (146) − χ x (cid:28) x − y, Prox Q x f ( x ) − Prox Q x f ( y ) (cid:29) + (cid:107) x − y (cid:107) (cid:19) (147) (cid:54) (cid:32) ˆ Q x ˆ Q x (cid:33) (cid:18)(cid:32) Q x χ x − χ x (cid:33) (cid:107) Prox Q x f ( x ) − Prox Q x f ( y ) (cid:107) + (cid:107) x − y (cid:107) (cid:19) (148) = (cid:32) ˆ Q x ˆ Q x (cid:33) (cid:32) Q x χ x − χ x (cid:33) (cid:32)
11 + σ / ˆ Q x (cid:33) + 1 (cid:107) x − y (cid:107) (149) = (cid:32) ˆ Q x − ˆ Q x ( ˆ Q x + σ ) + 1 (cid:33) (cid:107) x − y (cid:107) . (150)The upper bound on the Lipschitz constant is therefore: ω = ˆ Q x ˆ Q x (cid:115) Q x − ˆ Q x )( ˆ Q x + σ ) . (151) Case 3: no strong convexity or smoothness assumption
In this case the only informationwe have is the firm nonexpansiveness of the proximal operator, which leads us to the same derivationas the previous one up to (148), where the first term in the sum can be positive or negative. Thisyields the Lipschitz constant: ω = ˆ Q x ˆ Q x max (cid:18) , A A (cid:19) . (152) Recovering (35)
In our proof, we make no assumption on the strong-convexity or smoothnessof the function, but adding the ridge penalties λ , ˜ λ brings us for both ˜ O and ˜ O to either the firstof the second case above. It is straightforward to see that the Lipschitz constant (151) is a upperbound of (144). We thus use (151) for generality, and recover the expressions (35) shown in the mainbody of the paper. ω = ˆ Q x ˆ Q x (cid:115) Q x − ˆ Q x ( ˆ Q x + λ ) (153) ω = ˆ Q z ˆ Q z (cid:115) Q z − ˆ Q z ( ˆ Q z + ˜ λ ) . (154) G.3 Dynamical system convergence analysis
We are now ready to prove the convergence lemma 3. The choice of additional regularization is λ arbitrarily large, and ˜ λ fixed but non-zero. ˆ Q x , ˆ Q z can thus be made arbitrarily large, and ˆ Q z , ˆ Q x remain finite. We write the corresponding linear matrix inequality (37) and expand theconstraint term: (cid:23) (cid:20) A T PA − τ P A T PBB T PA B T PB (cid:21) + (cid:20) β C T M C + β C T M C β C T M D + β C T M D β D T M C + β D T M C β D T M D + β D T M D (cid:21) (155)33 little basic algebra shows that: C T M C = (cid:20) M × M M × N N × M ω I N × N (cid:21) C T M C = (cid:20) ω W T W M × N N × M N × N (cid:21) (156) C T M D = 0 ( M + N ) × ( M + N ) D T M C = 0 ( M + N ) × ( M + N ) (157) C T M D = (cid:20) M × M ω W T W N × M N × N (cid:21) D T M C = (cid:20) M × M M × N ω W T W N × N (cid:21) (158) D T M D = (cid:20) M × M M × N N × M − I N × N (cid:21) D T M D = (cid:20) − I M × M M × N N × M ω W T W (cid:21) (159)where all the matrices constituting the blocks have been defined in section 4. This gives the followingform for the constraint matrix: (cid:20) H H H T H (cid:21) (160)where H = (cid:20) β ω W T W M × N N × M β ω I N × N (cid:21) (161) H = (cid:20) M × M β ω W T W N × M N × N (cid:21) (162) H = (cid:20) − β I M × M M × N N × M − β I N × N + β ω W T W (cid:21) (163)thus the LMI becomes: (cid:23) (cid:20) − τ P + H H H T B T PB + H (cid:21) . (164)We take P as block diagonal: P = (cid:20) P M × N N × M P (cid:21) (165)where P ∈ R M × M and P ∈ R N × N are positive definite (no zero eigenvalues) and diagonalizable inthe same basis as F T F , which is also the eigenbasis of W , W , W T W , W T W . We then have: B T PB = (cid:20) P + W T P W W T P W W T P W W T P W (cid:21) . (166)We are then trying find the conditions for the following problem to be feasible with < τ < : (cid:20) τ P − H − H − H T − ( B T PB + H ) (cid:21) (cid:23) (167)Schur’s lemma then says that the strict version of (167), which we will consider, is equivalent [HJ12]to: − ( B T PB + H ) (cid:31) and τ P − H + H ( B T PB + H ) − H T (cid:31) (168)We start with − ( B T PB + H ) . 34 onditions for − ( B T PB + H ) (cid:31) We want to derive the conditions for: (cid:20) β I M × M − P − W T P W − W T P W − W T P W β I N × N − β ω W T W − W T P W (cid:21) (cid:31) . (169)Applying Schur’s lemma again gives the equivalent problem: β I N × N − β ω W T W − W T P W (cid:31) (170) β I M × M − P − W T P W − W T P W ( β I N × N − β ω W T W − W T P W ) − W T P W (cid:31) . (171)We start with (170). A sufficient condition for it to hold true is: β > β ω λ max ( W T W ) + λ max ( P ) λ max ( W T W ) . (172)From appendix G.1, we have: λ max ( W T W ) (cid:54) (cid:32) ˆ Q x ˆ Q x (cid:33) max (cid:32) | ˆ Q x − ˆ Q z λ min ( F T F ) | ˆ Q x + ˆ Q z λ min ( F T F ) , | ˆ Q x − ˆ Q z λ max ( F T F ) | ˆ Q x + ˆ Q z λ max ( F T F ) (cid:33) (173) (cid:54) max (cid:32) − ˆ Q z ˆ Q x λ min ( F T F ) (cid:33) , (cid:32) − ˆ Q z ˆ Q x λ max ( F T F ) (cid:33) = b (174)and ω λ max ( W T W ) (cid:54) (cid:32) ˆ Q z ˆ Q z (cid:33) (cid:32) Q z ) − ( ˆ Q z ) ( ˆ Q z + ˜ λ ) (cid:33)(cid:32) ˆ Q x χ z ˆ Q z (cid:33) λ max ( F T F )( ˆ Q x + ˆ Q z λ min ( F T F )) (175) (cid:54) ˆ Q z (cid:32) λ + ˜ λ ˆ Q z + ( ˆ Q z ) ˆ Q z (cid:33) (cid:32) ˆ Q z + ˆ Q z ˆ Q z ( ˆ Q z + ˜ λ ) (cid:33) λ max ( F T F ) . (176)For arbitrarily large ˆ Q z , the quantity (cid:16) λ + ˜ λ ˆ Q z + ( ˆ Q z ) ˆ Q z (cid:17) (cid:16) ˆ Q z + ˆ Q z ˆ Q z ( ˆ Q z +˜ λ ) (cid:17) λ max ( F T F ) is triviallybounded above whatever the value of ˜ λ , ˆ Q z . Let b be such an upper bound independent of λ , ˆ Q x , ˆ Q z . The sufficient condition for (170) to hold thus becomes: β > β ˆ Q z b + λ max ( P ) b (177)where b , b are constants independent of λ , ˆ Q x , ˆ Q z .We now turn to (171). A sufficient condition for it to hold is: β > λ max ( P ) + λ max ( W T W ) λ max ( P )+ ( λ max ( P )) λ max ( W T W ) λ max ( W T W ) β − β ω λ max ( W T W ) − λ max ( P ) λ max ( W T W ) (178)35ote that condition (170) ensures that the denominator in (178) is non-zero. We then have: λ max ( W T W ) (cid:54) (cid:32) ˆ Q z χ x ˆ Q x (cid:33) λ max ( F T F )( ˆ Q x + ˆ Q z λ min ( F T F ) (179) (cid:54) ˆ Q z (1 + ˆ Q x ˆ Q x )ˆ Q x λ max ( F T F ) (180)This quantity can be bounded above by a constant independent of λ , ˆ Q x , ˆ Q z for arbitrarily large ˆ Q x . Let b be such a constant . Then a sufficient condition for condition (171) to hold is: β > λ max ( P ) + b λ max ( P ) + b b ( λ max ( P )) β − β ˆ Q z b − λ max ( P ) b (181)we see that β must scale linearly with ˆ Q z which is one of the parameters that is made arbitrarilylarge. Then β also needs to become arbitrarily large for the conditions to hold. We choose β = 2 β ˆ Q z b + λ max ( P ) b for the rest of the proof. Condition (177) is then automatically verified,and β needs to be chosen according to condition (181), which becomes: β > λ max ( P ) + b λ max ( P ) + b b λ max ( P ) β ˆ Q z b (182)This obviously has a bounded solution for large values of ˆ Q z . We now turn to the second part of(168). Conditions for τ P − H + H ( B T PB + H ) − H T (cid:31) We need to study the term − H ( B T PB + H ) − H T (we study it with the − sign since themiddle matrix is negative definite from conditions (170,171) which are now verified). As we willsee, because of the form of H , we don’t need to explicitly compute the whole inverse. Let Z = − ( B T PB + H ) − = (cid:20) Z Z Z T Z (cid:21) ( Z has the same block dimensions as ( B T PB + H ) ). We thenhave: − H ( B T PB + H ) − H T = H ZH T (183) = (cid:20) β ω W T W Z W T W M × N N × M N × N (cid:21) . (184)We thus only need to characterize the lower right block of Z . It is easy to see that conditions (170)and (171) also enforce that both the Schur complements associated with the upper left and lowerright blocks of − ( B T PB + H ) are invertible, thus giving the following form for Z using the blockmatrix inversion lemma [HJ12]: Z = ( β I N − β ω W T W (185) − W T P W − W T P W ( β I M − P − W T P W ) − W T P W ) − . (186)36e thus have the following upper bound on the largest eigenvalue of Z : λ max ( Z ) (cid:54) β − β ˆ Q z b − λ max ( P ) b − b b λ max ( P ) β − λ max ( P ) − b λ max ( P ) , (187)using the prescription β = 2 β ˆ Q z b + λ max ( P ) b , we get: λ max ( Z ) = 1 β ˆ Q z b − b b λ max ( P ) β − λ max ( P ) − b λ max ( P ) (cid:54) b ˆ Q z (188)where b is a constant independent of the arbitrarily large parameters λ , ˆ Q x , ˆ Q z . Thus λ max ( Z ) can be made arbitrarily small by making λ arbitrarily large.We now want to find conditions for τ P − H + H ( B T PB + H ) − H T (cid:31) which is equivalent to: (cid:20) τ P − β ω W T W − β ω W T W Z W T W M × N N × M τ P − β ω I N (cid:21) . (189)This involves a block diagonal matrix, we only need to check the positive-definiteness of the separateblocks. We start with the upper left block, for which a sufficient condition is: τ λ min ( P ) − β ω λ max ( W T W ) − β ω λ max ( W T W ) λ max ( W T W ) λ max ( Z ) > (190)Using the bounds from appendix G.1, we have: ω λ max ( W T W ) (cid:54) (cid:32) ˆ Q z ˆ Q z (cid:33) (cid:18) Q z ) − ( ˆ Q z ) ( ˆ Q z + ˜ λ ) (cid:19) λ max ( W T W ) (191) (cid:54) (cid:18) Q z ) − ( ˆ Q z ) ( ˆ Q z + ˜ λ ) (cid:19) max (cid:32) | ˆ Q x − ˆ Q z λ min ( F T F ) | ˆ Q x + ˆ Q z λ min ( F T F ) , | ˆ Q x − ˆ Q z λ max ( F T F ) | ˆ Q x + ˆ Q z λ max ( F T F ) (cid:33) (192) (cid:54) λ ˆ Q z + ˜ λ + ( ˆ Q z ) ( ˆ Q z + ˜ λ ) max((1 − ˆ Q z ˆ Q x λ min ( F T F )) , (1 − ˆ Q z ˆ Q x λ max ( F T F )) ) (193) (cid:54) Q z (2˜ λ + (˜ λ + ( ˆ Q z ) )ˆ Q z ) max((1 − ˆ Q z ˆ Q x λ min ( F T F )) , (1 − ˆ Q z ˆ Q x λ max ( F T F )) ) (194)Thus there exists a constant b , independent of λ , ˆ Q z , ˆ Q x such that, for sufficiently large ˆ Q z : ω λ max ( W T W ) (cid:54) b ˆ Q z . (195)Remember that we had: ω λ max ( W T W ) (cid:54) ˆ Q z b , (196)which gives the following sufficient condition for the upper left block in (189): τ λ min ( P ) − β b ˆ Q z − β b b b ˆ Q z > . (197)A sufficient condition for the lower right block in (189) then reads: τ λ min ( P ) − β ω > , (198)37here we have: β ω = (cid:32) ˆ Q x ˆ Q x (cid:33) (cid:32) Q x ) − ( ˆ Q x ) ( ˆ Q x + λ ) (cid:33) (2 β ˆ Q z b + λ max ( P ) b ) (199) = 1ˆ Q x ( ˆ Q x ) (cid:32) Q x ) − ( ˆ Q x ) ( ˆ Q x + λ ) (cid:33) (cid:32) β ˆ Q z ˆ Q x b + λ max ( P ) b ˆ Q x (cid:33) (200)We remind the reader that ˆ Q z , ˆ Q x grow linearly with λ . Thus the dominant scaling at large λ is(exchanging ˆ Q x with ˆ Q z up to a constant): β ω (cid:54) b ˆ Q z , (201)where b is a constant independent of the arbitrarily large quantities. The final condition becomes: τ λ min ( P ) − β b ˆ Q z − β b b b ˆ Q z > (202) τ λ min ( P ) − b ˆ Q z > (203)where we want τ < . We now choose τ = ˜ c/ ˆ Q z with a constant ˜ c independent of λ , ˆ Q z , ˆ Q x thatverifies ˜ c > max (cid:16) ( β b + β b b b λ min ( P ) , b λ min ( P ) (cid:17) , such that: ˜ c ˆ Q z λ min ( P ) − β b ˆ Q z − β b b b ˆ Q z > (204) ˜ c ˆ Q z λ min ( P ) − b ˆ Q z > . (205)Since β is bounded for large values of ˆ Q z , and the b i and c are constants independent of λ , ˆ Q x , ˆ Q z ,one can then just enforce ˜ c < ˆ Q z using the additional ridge penalty parametrized by λ on theregularization to obtain τ < and a linear convergence rate proportional to (cid:113) ˜ cλ that ensuresconvergence. We see that the eigenvalues of the matrix P are of little importance as long as theyare non-vanishing. We choose P as the identity. In the statement of Lemma 3, we write c the exactconstant which comes from going from ˆ Q z to λ2