[PDF] Understanding Implicit Regularization in Over-Parameterized Nonlinear Statistical Model

Abstract

We study the implicit regularization phenomenon induced by simple optimization algorithms in over-parameterized nonlinear statistical models. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding of the role played by implicit regularization in the nonlinear models without excess technicality, we assume that the distribution of the covariates is known a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In particular, for the vector single index model with Gaussian covariates, our proposed estimator is shown to further enjoy the oracle statistical rate. Our results capture the implicit regularization phenomenon in over-parameterized nonlinear and noisy statistical models with possibly heavy-tailed data.

Full PDF

UUnderstanding Implicit Regularization inOver-Parameterized Nonlinear Statistical Model

Jianqing Fan ∗ Zhuoran Yang ˚ Mengxin Yu ˚ July 17, 2020

Abstract

We study the implicit regularization phenomenon induced by simple optimization algorithms inover-parameterized nonlinear statistical models. Speciﬁcally, we study both vector and matrixsingle index models where the link function is nonlinear and unknown, the signal parameter is eithera sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed.To gain a better understanding the role of implicit regularization in the nonlinear models withoutexcess technicality, we assume that the distribution of the covariates is known as a priori. Forboth the vector and matrix settings, we construct an over-parameterized least-squares loss functionby employing the score function transform and a robust truncation step designed speciﬁcally forheavy-tailed data. We propose to estimate the true parameter by applying regularization-freegradient descent to the loss function. When the initialization is close to the origin and the stepsizeis suﬃciently small, we prove that the obtained solution achieves minimax optimal statistical ratesof convergence in both the vector and matrix cases. In particular, for the vector single index modelwith Gaussian covariates, our proposed estimator is shown to enjoy the oracle statistical rate. Ourresults capture the implicit regularization phenomenon in over-parameterized nonlinear and noisystatistical models with possibly heavy-tailed data.

With the astonishing empirical success in various application domains such as computer vision(Voulodimos et al., 2018), natural language processing (Otter et al., 2020; Torﬁ et al., 2020), andreinforcement learning (Arulkumaran et al., 2017; Li, 2017), deep learning (LeCun et al., 2015;Goodfellow et al., 2016; Fan et al., 2019a) has become one of the most prevalent classes of machinelearning methods. When applying deep learning to supervised learning tasks such as regressionand classiﬁcation, the regression function or classiﬁer is represented by a deep neural network,which is learned by minimizing a loss function of the network weights. Here the loss function isdeﬁned as the empirical risk function computed based on the training data and the optimization ∗ Department of Operations Research and Financial Engineering, Princeton University; email: { jqfan,zy6,mengxiny } @princeton.edu . Research supported by the NSF grant DMS-1662139 and DMS-1712591, the ONRgrant N00014-19-1-2120, and the NIH grant 2R01-GM072611-16. a r X i v : . [ s t a t . M L ] J u l roblem is usually solved by gradient-based optimization methods. Due to the nonlinearity of theactivation function and the multi-layer functional composition, the landscape of the loss functionis highly nonconvex, with many saddle points and local minima (Dauphin et al., 2014; Swirszczet al., 2016; Yun et al., 2019). Moreover, oftentimes the neural network is over-parameterized in thesense that the total number of network weights exceeds the number of training data, making theregression or classiﬁcation problem ill-posed from a statistical perspective. Surprisingly, however,it is often observed empirically that simple algorithms such as (stochastic) gradient descent tendto ﬁnd the global minimum of the loss function despite nonconvexity. Moreover, the obtainedsolution also generalizes well to unseen data with small test error (Neyshabur et al., 2014; Zhanget al., 2017). These mysterious observations cannot be fully explained by the classical theory ofnonconvex optimization and generalization bounds via uniform convergence.To understand such an intriguing phenomenon, Neyshabur et al. (2014); Zhang et al. (2017)show empirically that the generalization stems from an “implicit regularization” of the optimiza-tion algorithm. Speciﬁcally, they observe that, in over-parametrized statistical models, althoughthe optimization problems consist of bad local minima with large generalization error, the choice ofoptimization algorithm, usually a variant of gradient descent algorithm, usually guard the iteratesfrom bad local minima and prefers the solution that generalizes well. Thus, without adding anyregularization term in the optimization objective, the implicit preference of the optimization algo-rithm itself plays the role of regularization. Implicit regularization has been shown indispensablein training deep learning models (Neyshabur et al., 2014, 2017; Zhang et al., 2017; Keskar et al.,2017; Poggio et al., 2017; Wilson et al., 2017).In order to characterize the implicit regularization eﬀect, Gunasekar et al. (2017) and Li et al.(2018) provide empirical evidence and theoretical guarantees for the implicit regularization of gradi-ent descent for least-squares regression with a two-layer linear neural network, i.e., low-rank matrixsensing. They show that gradient descent biases towards the minimum nuclear norm solution whenthe initialization is close to the origin, suﬃciently small stepsizes, and no explicit regularizationis imposed. More speciﬁcally, when the true parameter is a rank r positive-semideﬁnite matrixin R d ˆ d , they rewrite the parameter as UU J , where U P R d ˆ d , and propose to estimate the trueparameter by updating U via gradient descent. Li et al. (2018) proves that, with r O p r d q i.i.d.observations of the model, gradient descent provably recovers the true parameter with accuracy,where r O p¨q hides absolute constants and poly-logarithmic terms. Thus, in over-parametrized ma-trix sensing problems, the implicit regularization of gradient descent can be viewed as equivalentto adding a nuclear norm penalty explicitly. See also Arora et al. (2019a) for a related topic ondeep linear network.Moreover, Zhao et al. (2019); Vaˇskeviˇcius et al. (2019) recently study the implicit regularizationof gradient descent for high-dimensional linear regression with a sparse signal parameter, whichis a vector in R p with s nonzero entries. They propose to re-parametrize the parameter usingtwo vectors in R p via the Hadamard product and estimate the true parameter via un-regularizedgradient descent with proper initialization, stepsizes, and the number of iterations. They proveindependently that, with n “ O p s log p q i.i.d. observations, gradient descent yields an estimator ofthe true parameter with optimal statistical accuracy. More interestingly, when the nonzero entries2f the true parameter all have suﬃciently large magnitude, the proposed estimator attains theoracle O p a s log s { n q rate that is independent of the ambient dimension p . Hence, for sparse linearregression, the implicit regularization of gradient descent has the same eﬀect as the folded concavepenalties (Fan et al., 2014) such as smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001)and minimax concave penalty (MCP) (Zhang et al., 2010).However, the aforementioned works all establish theoretical results for linear models, whereasthe deep learning models are highly nonlinear. Besides, these works all assume the response variablein the linear model has zero or light-tailed noise and the covariates satisfy the restricted isometryproperty (RIP) condition (Cand´es, 2008). Thus, one question is left open:Can we characterize the implicit regularization of optimization algorithms for nonlinear andover-parameterized statistical models with possibly heavy-tailed data?In this work, we focus on the single index model, where the response variable Y and the covariate X satisfy Y “ f px X, β ˚ yq ` (cid:15) , where β ˚ is the true parameter, (cid:15) is the random noise, and f : R Ñ R is an unknown (nonlinear) link function. Here β ˚ is either a s -sparse vector in R p or a rank r matrix in R d ˆ d . Since f is unknown, we further assume that the (cid:96) - or Frobenius norm of β ˚ isequal to one. Our goal is to recover the true parameter β ˚ given n i.i.d. observations of the model.In a single index model, since the link function f is unknown, it is infeasible to directly estimate β ˚ via nonlinear least-squares. Moreover, jointly minimizing the least-squares loss function withrespect to β ˚ and f is computationally intractable. To overcome these challenges, a recent lineof research proposes to estimate β ˚ by the method of moments when the distribution of X isknown. This helps us provide a deep understanding on the implicit regularization induced byover-parameterization in the nonlinear models without excessive technicality and eliminate othercomplicated factors that convolve insights. Speciﬁcally, when X is a standard Gaussian randomvariable, Stein’s identity (Stein et al., 1972) implies that the expectation of Y ¨ X is proportionalto β ˚ . Thus, despite the nonlinear link function, β ˚ can be accurately estimated by neglecting f and ﬁtting a regularized least-squares regression. In particular, when β ˚ is a sparse vector,Plan and Vershynin (2016); Plan et al. (2017) prove that the Lasso estimator achieves the optimalstatistical rate of convergence. Subsequently, such an approach has been extended to the casesbeyond Gaussian covariates. In particular, Goldstein et al. (2018); Wei (2018); Goldstein andWei (2019) allow the covariates to follow an elliptically symmetric distribution that can be heavy-tailed. In addition, utilizing a generalized version of Stein’s identity (Stein et al., 2004), Yang et al.(2017a) extends the Lasso approach to the setting where the covariate X has a known density p .Speciﬁcally, when p is known, we can deﬁne the score function S p p¨q by S p p¨q “ ´ ∇ log p p¨q ,which satisﬁes that E r Y ¨ S p p X qs identiﬁes the direction of β ˚ . Thus, the true parameter can beestimated by via an M -estimation problem with S p p X q served as the covariate.Following the approach of Yang et al. (2017a), we aim to estimate β ˚ via Stein’s identity andwithout any explicit regularization. To this end, we ﬁrst adopt the quadratic loss function inYang et al. (2017a) and rewrite the parameter of interest by over-parameterization. When β ˚ isa sparse vector in R p , we adopt a Hadamard product parameterization (Hoﬀ, 2017; Zhao et al.,2019; Vaˇskeviˇcius et al., 2019) and write β ˚ as w d w ´ v d v , where both w and v are vectors3n R p . We propose to minimize the loss function as a function of the new parameters via gradientdescent, where both w and v are initialized near an all-zero vector and the stepsizes are ﬁxed tobe a suﬃciently small constant η ą

0. Furthermore, when β ˚ is a low-rank matrix, we similarlyrepresent β ˚ as WW J ´ VV J and propose to recover β ˚ by applying the gradient descent algorithmto the quadratic loss function under the new parameterization.Furthermore, the analysis of our algorithm faces the following two challenges. First, due to over-parameterization, there exist exponentially many stationary points of the population loss functionthat are far from the true parameter. Thus, it seems that the gradient descent algorithm would belikely to return a stationary point that incurs a large error. Second, both the response Y and thescore S p p X q can be heavy-tailed random variables. Thus, the gradient of the empirical loss functioncan deviate signiﬁcantly from its expectation, which poses an additional challenge to establishingthe statistical error of the proposed estimator.To overcome these diﬃculties, in our algorithm, instead of estimating E r Y ¨ S p p X qs by itsempirical counterpart, we construct robust estimators via proper truncation techniques, which havebeen widely applied in high-dimensional M -estimation problems with heavy-tailed data (Fan et al.,2020b; Zhu, 2017; Wei and Minsker, 2017; Minsker, 2018; Fan et al., 2020a; Ke et al., 2019; Minskerand Wei, 2020). These robust estimators are then employed to compute the update directions of thegradient descent algorithm. Moreover, despite the seemingly perilous loss surface, we prove that,when initialized near the origin and suﬃciently small stepsizes, the gradient descent algorithm guardthe iterates from bad stationary points. More importantly, when the number of iterations is properlychosen, the obtained estimator provably enjoys (near-)optimal O p a s log p { n q and O p a rd log d { n q (cid:96) -statistical rates under the sparse and low-rank settings, respectively. Moreover, for sparse β ˚ ,when the magnitude of the nonzero entries is suﬃciently large, we prove that our estimator enjoys anoracle O p a s log n { n q (cid:96) -statistical rate and that is independent of the dimensionality p . In addition,we also establish near-optimal (cid:96) -statistical rates. Our proof is based on a jointly statistical andcomputational analysis of the gradient descent dynamics. Speciﬁcally, we decompose the iteratesinto a signal part and a noise part, where the signal part share the same sparse or low-rank structuresas the true signal and the noise part are orthogonal to the true signal. We prove that the signalpart converges to the true parameter eﬃciently whereas the noise part accumulates at a ratherslow rate and thus remains small for a suﬃciently large number of iterations. Such a dichotomybetween the signal and noise parts characterizes the implicit regularization of the gradient descentalgorithm and enables us to establish the statistical error of the ﬁnal estimator.To summarize, our contribution is three-fold. First, for sparse and low-rank single index modelswhere the random noise is heavy-tailed, we employ a quadratic loss function based on a robustestimator of E r Y ¨ S p p X qs and propose to estimate β ˚ by combining over-parameterization andregularization-free gradient descent. Second, we prove that, when the initialization, stepsizes,and stopping time of the gradient descent algorithm are properly chosen, the proposed estimatorachieves near-optimal statistical rates of convergence under both the sparse and low-rank settings.Moreover, when the true parameter is sparse and its nonzero entries all suﬃciently large in absolutevalue, our estimator provably enjoys the oracle statistical rate. Finally, our theory complements theresults of Li et al. (2018) by allowing the true parameter to be a general low-rank and symmetric4atrix and incorporating heavy-tailed noise in the model. Our work belongs to the recent line of research on understanding the implicit regularization ofgradient-based optimization methods in various statistical models. For over-parameterized logisticregression with separable data, Soudry et al. (2018) proves that the iterates of the gradient de-scent algorithm converge to the max-margin solution. This work is extended by Ji and Telgarsky(2019b,a); Gunasekar et al. (2018b); Nacson et al. (2019); Ji and Telgarsky (2019c) for studyinglinear classiﬁcation problems with other loss functions, parameterization, or training algorithms.Montanari et al. (2019); Deng et al. (2019) study the asymptotic generalization error of the max-margin classiﬁer under the over-parameterized regime. Recently, for neural network classiﬁers, Xuet al. (2018); Lyu and Li (2020); Chizat and Bach (2020) prove that gradient descent converges tothe max-margin classiﬁer under certain conditions. In addition, various works have established theimplicit regularization phenomenon for regression. For example, for low-rank matrix sensing, Liet al. (2018); Gunasekar et al. (2017) show that, with over-parameterization, unregularized gradientdescent ﬁnds the optimal solution eﬃciently. For various models including matrix factorization,Ma et al. (2020) proves that the iterates of gradient descent stays in a benign region that enjoyslinear convergence. Arora et al. (2019a); Gidel et al. (2019) characterize the implicit regularizationof gradient descent in deep matrix factorization. For sparse linear regression, Zhao et al. (2019);Vaˇskeviˇcius et al. (2019) prove that, with re-parameterization, gradient descent ﬁnds an estimatorwhich attains the optimal statistical rate of convergence. Gunasekar et al. (2018a) studies theimplicit regularization of generic optimization methods in over-parameterized linear regression andclassiﬁcation. Furthermore, for nonlinear regression models, Du et al. (2018) proves that, for neuralnetworks with homogeneous action functions, gradient descent automatically balances the weightsacross diﬀerent layers. Oymak and Soltanolkotabi (2018); Azizan et al. (2019) show that, in over-parameterized models, when the loss function satisﬁes certain conditions, both gradient descentand mirror descent algorithms converge to one of the global minima which is the closest to theinitial point.Moreover, in linear regression, when initialized from the origin, gradient descent converges tothe minimum (cid:96) -norm (min-norm) solution. Besides, as shown in Soudry et al. (2018), gradientdescent converges to the max-margin classiﬁer in over-parameterized logistic regression. There is arecent line of works on characterizing the risk of the min-norm and max-margin estimators underthe over-parametrized setting where p is larger than n . See, e.g, Belkin et al. (2018, 2019); Liang andRakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019); Derezi´nski et al. (2019); Ma et al. (2019);Mei and Montanari (2019); Montanari et al. (2019); Kini and Thrampoulidis (2020); Muthukumaret al. (2020) and the references therein. These works prove that, as p grows to be larger than n ,the risk ﬁrst increases and then magically decreases after a certain threshold. Thus, there existsanother bias-variance tradeoﬀ in the over-parameterization regime. Such a mysterious phenomenonis coined by Belkin et al. (2018) as the “double-descent” phenomenon, which is conceived as anoutcome of implicit regularization and over-parameterization.Furthermore, there exists a large body of literature on the optimization and generalization of5raining over-parameterized neural works. In a line of research, using mean-ﬁeld approximation,Chizat and Bach (2018); Rotskoﬀ and Vanden-Eijnden (2018); Sirignano and Spiliopoulos (2018);Mei et al. (2018, 2019); Wei et al. (2019) propose various optimization approaches with probableconvergence to the global optima of the training loss. Besides, with diﬀerent scaling, another line ofworks study the convergence and generalization of gradient-based methods for over-parameterizedneural networks under the framework of the neural tangent kernel (NTK) (Jacot et al., 2018). See,e.g., Du et al. (2019b,a); Zou et al. (2018); Chizat et al. (2019); Allen-Zhu et al. (2019a,b); Jacotet al. (2018); Cao and Gu (2019); Arora et al. (2019b); Lee et al. (2019); Weinan et al. (2019);Yehudai and Shamir (2019); Bai and Lee (2019); Huang et al. (2020) and the references therein.Their theory shows that a suﬃciently wide neural network can be well approximated by the randomfeature model (Rahimi and Recht, 2008). Then, with suﬃciently small stepsizes, (stochastic) gradi-ent descent algorithm implicitly forces the network weights to stay in a neighborhood of the initialvalue. Such an implicit regularization phenomenon enables these papers to establish convergencerates and generalization errors for neural network training.Furthermore, our work is also closely related to the large body of literature on single indexmodels. Single index model has been extensively studied in the low-dimensional setting. See, e.g.,Han (1987); McCullagh and Nelder (1989); Hardle et al. (1993); Carroll et al. (1997); Xia et al.(1999); Horowitz (2009) and the references therein. Most of these works propose to jointly estimate β ˚ and f based on solving the global optimum of nonconvex M -estimation problems. Thus, thesemethods can be computationally intractable in the worst case. Under the Gaussian or ellipticalassumption on the covariates, a more related line of research propose eﬃcient estimators of thedirection of β ˚ based on factorizing a set of moments involving X and Y . See, e.g., Brillinger(1982); Li et al. (1989); Li (1991, 1992); Duan et al. (1991); Cook (1998); Cook and Lee (1999);Cook and Ni (2005) and the references therein. Furthermore, for single index models in the high-dimensional setting, Thrampoulidis et al. (2015); Genzel (2016); Plan and Vershynin (2016); Planet al. (2017); Neykov et al. (2016a); Zhang et al. (2016); Yang et al. (2017a); Goldstein et al. (2018);Wei (2018); Goldstein and Wei (2019); Na et al. (2019) propose to estimate the direction of β ˚ via (cid:96) -regularized regression. Most of these works impose moment conditions inspired by Brillinger(1982), which ensures that the direction of β ˚ can be recovered from the covariance of Y and atransformation of X . Among these papers, our work is closely related to Yang et al. (2017a) in thatwe adopt the same loss function based on generalized Stein’s identity (Stein et al., 2004). Thatwork only studies the statistical error of the (cid:96) -regularized estimator, which is a solution to a convexoptimization problem. In comparison, without any regularization, we construct estimators basedon over-parameterization and gradient descent. We provide both statistical and computationalerrors of the proposed algorithm and establish a similar statistical rate of convergence as in Yanget al. (2017a). Moreover, when each nonzero entry of β ˚ is suﬃciently large, we further obtain anoracle statistical rate which cannot be obtained by the (cid:96) -regularized estimator. Furthermore, Jianget al. (2014); Neykov et al. (2016b); Yang et al. (2017b); Tan et al. (2018); Lin et al. (2018); Yanget al. (2019); Balasubramanian et al. (2018); Babichev et al. (2018); Qian et al. (2019); Lin et al.(2019) generalize models such as misspeciﬁed phase retrieval (Cand´es et al., 2015), slice inverseregression (Li, 1991), and multiple index model (Xia, 2008) to the high-dimensional setting. The6stimators proposed in these works are based on second-order moments involving Y and X andrequire (cid:96) -regularization, hence are not directly comparable with our estimator. In this subsection, we give an introduction to our notations. Throughout this work, we use r n s todenote the set t , , . . . , n u . For a subset S in r n s and a vector u , we use u S to denote the vectorwhose i -th entry is u i if i P S and 0 otherwise. For any vector u and q ě

0, we use } u } (cid:96) q to representthe vector (cid:96) q norm. In addition, the inner product x u , v y between any pair of vectors u , v is deﬁnedas the Euclidean inner product u J v . Moreover, we deﬁne u d v as the Hardmard product of vectors u , v . For any given matrix X P R d ˆ d , we use } X } op , } X } F and } X } ˚ to represent the operatornorm, Frobenius norm and nuclear norm of matrix X respectively. In addition, for any two matrices X , Y P R d ˆ d , we deﬁne their inner product x X , Y y as x X , Y y “ tr p X J Y q . Moreover, if we write X ě X ď

0, then the matrix X is meant to be positive semideﬁnite or negative semideﬁnite.We let t a n , b n u n ě be any two positive series. We write a n À b n if there exists a universal constant C such that a n ď C ¨ b n and we write a n ! b n if a n { b n Ñ

0. In addition, we write a n — b n , if wehave a n À b n and b n À a n and the notations of a n “ O p b n q and a n “ o p b n q share the same meaningwith a n À b n and a n ! b n . Moreover, a n “ r O p b n q means a n ď Cb n up to some logarithm terms. The organization of our paper is as follows. We introduce the background knowledge in §

2. In § § § In this section, we introduce the phenomenon of implicit regularization via over-parameterization,high dimensional single index model, and generalized Stein’s identity (Stein et al., 2004).

Both Gunasekar et al. (2017) and Li et al. (2018) have studied least squares objectives over positivesemideﬁnite matrices β P R d ˆ d of the following formmin β ě F p β q “ n n ÿ i “ p y i ´ x X i , β yq , (2.1)where the labels t y i u ni “ are generated from linear measurements y i “ x X i , β ˚ y , i P r n s , with β ˚ P R d ˆ d being positive semideﬁnite and low rank. Here β ˚ is of rank r where r is much smaller than d . Instead of working on objective β directly, they write β as β “ UU J where U P R d ˆ d , and7tudy the optimization problem related to U ,min U P R d ˆ d f p U q “ n n ÿ i “ ` y i ´ x X i , UU J y ˘ . (2.2)The least-squares problem in (2.2) is over-parameterized because here β is parameterized by U ,which has d degrees of freedom, whereas β ˚ , being a rank- r matrix, has O p rd q degrees of freedom.Gunasekar et al. (2017) proves that when t X i u mi “ are commutative and U is properly initialized,if the gradient ﬂow of (2.2) converges to a solution p U such that p β “ p U p U J is a globally optimalsolution of (2.1), then p U has the minimum nuclear norm over all global optima. Namely, p β P argmin β ě } β } ˚ , subject to x X i , p β y “ y i , @ i P r n s . However, the assumption on commutable t X i u mi “ is very restrictive. Gunasekar et al. (2017) conjec-tures that similar result still holds when the covariates satisfy weaker conditions. Subsequently, Liet al. (2018) proves this conjecture partially. In particular, assuming t X i u ni “ satisfy the restrictedisometry property (RIP) condition (Cand´es, 2008), it proves that nearly exact recovery of β ˚ isachieved by applying gradient descent to (2.2) with the initialization close to zero and suﬃcientlysmall stepsizes.As for noisy statistical model, both Zhao et al. (2019) and Vaˇskeviˇcius et al. (2019) study over-parameterized high dimensional noisy linear regression problem independently. Speciﬁcally, herethe response variables t y i u ni “ are generated from model y i “ x J i β ˚ ` (cid:15) i , i P r n s , (2.3)where β ˚ P R p and t (cid:15) i u ni “ are i.i.d. sub-Gaussian random variables that are independent with thecovariates t x i u ni “ . Moreover, here β ˚ has only s nonzero entries where s ! p . Instead of addingsparsity-enforcing penalties, they propose to estimate β ˚ via gradient descent with respect to w , v on loss function min w P R p , v P R p L p w , v q “ n n ÿ i “ r x J i p w d w ´ v d v q ´ y i s , (2.4)where parameter β of the linear model is over-parameterized as β “ w d w ´ v d v . Under therestricted isometry property (RIP) condition on the covariates, these works prove that, when thehyperparameters is proper selected, gradient descent on (2.4) ﬁnds an estimator of β ˚ with optimalstatistical rate of convergence. In this subsection, we ﬁrst introduce the score functions associated with random vectors and ma-trices, which are utilized in our algorithms. Then we formally deﬁne the high dimensional singleindex model (SIM) in both the vector and matrix settings.8 eﬁnition 2.1.

Let x P R p be a random vector with density function p p x q : R p Ñ R . The scorefunction S p p x q : R p Ñ R p associated with x is deﬁned as S p p x q : “ ´ ∇ x log p p x q “ ´ ∇ x p p x q{ p p x q . Here the score function S p p x q relies on the density function p p x q of the covariate x . In orderto simplify the notations, in the rest of the paper, we just omit the subscript p from S p when theunderlying distribution of x is clear to us. Remark:

If the covariate X P R d ˆ d is a random matrix whose entries are i.i.d. with density p p x q , we then deﬁne the score function S p X q P R d ˆ d entrywisely. In other words, for any t i, j u Pr d s ˆ r d s , we obtain S p X q i,j : “ ´ p p X i,j q{ p p X i,j q . (2.5)Next, we would like to discuss on ﬁrst-order general Stein’s identity. Lemma 2.2. (First-Order General Stein’s Identity, (Stein et al., 2004)) We assume that the co-variate x P R p follows a distribution with density function p p x q : R p Ñ R which is diﬀerentiableand | p p x q| Ñ } x } Ñ 8 . Then for any diﬀerentiable function f p x q with E r| f p x q S p x q|s _ E r| ∇ x f p x q|s ă 8 , it holds that, E r f p x q S p x qs “ E r ∇ x f p x qs , where S p x q “ ´ ∇ x p p x q{ p p x q is the score function with respect to x deﬁned in Deﬁnition 2.1. Remark:

In the case of having matrix covariate, we are able to achieve the same conclusionby simply replacing x P R p by X P R d ˆ d in Lemma 2.2 with the deﬁnition of matrix score functionin (2.5).We next introduce the deﬁnitions of the class of models that we are interested in. Deﬁnition 2.3. (Sparse Vector SIM) We assume the response Y P R is generated from model Y “ f px x , β ˚ yq ` (cid:15), (2.6)with unknown link f : R Ñ R , p -dimensional covariate x as well as signal β ˚ which is the parameterof interest. Here, we let the noise (cid:15) be additive and mean zero, in a sense that (cid:15) P R is an exogenousrandom noise with E r (cid:15) s “

0. In addition, if not particularly indicated, we assume entries of x are i.i.d. random variables with known density p p x q . As for the underlying true signal β ˚ , it isassumed to be s -sparse with s ! p. Note that the length of β ˚ can be absorbed by the unknownlink f , we then let } β ˚ } “ y i “ x J i β ˚ ` (cid:15) , phase retrieval y i “ p x J i β ˚ q ` (cid:15) as wellas one-bit compressed sensing y “ sign p x J i β ˚ q ` (cid:15) . Note that the model depends on covariate x via inner product, thus, we are able to extend the sparse vector SIM to the case of matrix valuedcovariates. Next, we deﬁne the low rank matrix SIM as follows.9 eﬁnition 2.4. (Symmetric Low Rank Matrix SIM) For the low rank matrix SIM, we assume theresponse Y P R is generated from Y “ f px X , β ˚ yq ` (cid:15), (2.7)in which β ˚ P R d ˆ d is a low rank symmetric matrix with rank r ! d and the link function f isunknown. For the covariate X P R d ˆ d , we assume entries of X are i.i.d. with known density p p x q .For model identiﬁability, we let Y be generated from the model with } β ˚ } F “ , in that } β ˚ } F canalso be absorbed by the unknown link function f . In addition, the noise term (cid:15) is also assumed tobe additive and mean zero.As we have discussed in the introduction, almost all existing literature mainly focus on studyingimplicit regularization with respect to linear models with sub-Gaussian data. One question isstill open, does implicit regularization phenomenon only exist for linear models with light-tailednoise? Motivated by these prior arts, in the following § §

4, we theoretically investigate thephenomenon of implicit regularization to high dimensional SIM with both Gaussian and generaldesign.

Leveraging our conclusion from Lemma 2.2 as well as our deﬁnition of sparse vector SIM in Deﬁ-nitions 2.3, we get E r Y ¨ S p x qs “ E r f px x , β ˚ yq ¨ S p x qs “ E r f px x , β ˚ yqs ¨ β ˚ : “ µ ˚ β ˚ , which recovers our true signal β ˚ up to scaling. We then notice Y ¨ S p x q is a good estimator ofthe direction of β ˚ as long as f satisﬁes E r f px x , β ˚ yqs ‰

0. Thus, throughout this whole section,we focus on the sparse vector SIM with E r f px x , β ˚ yqs ‰

0. In this scenario, the population leveloptimization problem that we want to solve is equivalent tomin β L p β q : “ x β, β y ´ x β, E r Y ¨ S p x qsy . As population expectation is unaccessible, E r Y ¨ S p x qs is replaced by its sample version estimator n ř ni “ y i S p x i q . Under the regime of high dimensional SIM, in which our underlying signal β ˚ isassumed to be sparse by Deﬁnition 2.3, one proposal is to solve the following regularized problem:min β L p β q : “ x β, β y ´ A β, n n ÿ i “ y i S p x i q E ` λ } β } (3.1)in order to get a solution p β with an optimal convergence rate to µ ˚ β ˚ .With a ﬂurry of studies on implicit regularization in both areas of computer science and statisticsrecently, one may curious about, instead of adding penalties or tuning parameters, can we stillachieve an estimator with optimal convergence rate, in the scenario of high dimensional SIM?Motivated by pioneering work related to over-parameterized linear models in § β as β “ w d w ´ v d v , where both w and v are p -dimensional vectors. Then our modiﬁed objectivefunction L p β q “ L p w , v q becomes L p w , v q “ x w d w ´ v d v , w d w ´ v d v y ´ A w d w ´ v d v , n n ÿ i “ y i S p x i q E . (3.2)Gradient updates of w , v and β for solving (3.2) are given by w t ` “ w t ´ η ∇ w L p w t , v t q “ w t ´ η ´ w t d w t ´ v t d v t ´ n n ÿ i “ S p x i q y i ¯ d w t , (3.3) v t ` “ v t ` η ∇ v L p w t , v t q “ v t ` η ´ w t d w t ´ v t d v t ´ n n ÿ i “ S p x i q y i ¯ d v t , (3.4) β t ` “ w t ` d w t ` ´ v t ` d v t ` . (3.5)Since zero is a stationary point of the algorithm, it can not be a starting point. Ideally, we shouldinitialize the components with true coeﬃcient zero at zero and nonzero at non-zero so that theyare closer to the true parameter β ˚ . However, this is not feasible since we do not know the supportof β ˚ . Instead, we initialize w and v as w “ v “ α ¨ p ˆ , where α is a small constant and p ˆ is an all-one vector. This provides a good compromise: zero components get nearly zeroinitializations, which are the majority under the sparsity assumption, and nonzero components getnonzero initializations. Even though we initialize every component at the same value, the nonzerocomponents move quickly to their stationary component, while zero components remain small.This is how over-parameterization diﬀerentiate active components from inactive components. Weillustrate this by a simulation experiment. A simulation study.

In this simulation, we ﬁx sample size n “ p “ s “

5. Let S : “ t i : | β ˚ i | ą u . The responses t y i u ni “ are generatedfrom y i “ f px x , β ˚ yq` (cid:15) i , i P r n s with link functions f p x q “ x (linear regression) and f p x q “ sin p x q . Here we assume β ˚ is s -sparse with β i “ {? s, i P S (for model identiﬁcation), and t x i u ni “ arestandard Gaussian random vectors. We ﬁrst over-parameterize parameter β as w d w ´ v d v andinitialize w “ v “ ´ ¨ p ˆ . Then we update w , v and β regarding equations (3.3), (3.4), and(3.5) with stepsize η “ .

01. The evolution of the distance between our unnormalized iterates β t and µ ˚ β ˚ , trajectories of β j,t for j P S and max j P S c | β j,t | are depicted in Figures 1 and 2.From the simulation results given in Figure 1(a) and Figure 2(a), we notice that there exists atime interval, where we can nearly recover µ ˚ β ˚ . From plots (b) in Figures 1 and 2, we can see withover-parameterization, ﬁve nonzero components all increase rapidly and converge quickly to theirstationary points. Meanwhile, the maximum estimation error for inactive component, representedby } β S c ,t } , still remains small, as shown in Figure 1(c) and Figure 2(c). In other words, runninggradient descent with respect to over-parameterized parameters can help us distinguish non-zerocomponents from zero components, while applying gradient descent to the ordinary loss can not.It is worth noting that if we let the partial derivatives of L p w , v q with respect to both w and11 -6 (a) (b) (c)Figure 1: With link function f p x q “ x , (a) characterizes the evolution of distance } β t ´ µ ˚ β ˚ } against iteration number t ; (b) depicts the trajectories β j,t ( j P S ) for ﬁve nonzero components,and (c) presents the trajectory max j P S c | β j,t | . -7 (a) (b) (c)Figure 2: With link function f p x q “ sin p x q , similar to Figure 1, here (a) characterizes the evolutionof distance } β t ´ µ ˚ β ˚ } against iteration number t ; (b) depicts the trajectories β j,t ( j P S ) for ﬁvenonzero components, and (c) presents the trajectory max j P S c | β j,t | . v be zero ∇ w L p w , v q “ ´ w d w ´ v d v ´ n n ÿ i “ S p x i q y i ¯ d w t “ , ´ ∇ v L p w , v q “ ´ w d w ´ v d v ´ n n ÿ i “ S p x i q y i ¯ d v t “ , we are able to see that there exist exponentially many saddle points of our loss function L p w , v q .However, it can be inferred from our analysis on the trajectories of diﬀerent entries below thatsaddle points would not be hit before the iterate β t reaches the “good region” where we enjoyoptimal (cid:96) - and (cid:96) -convergence of β t to µ ˚ β ˚ . 12 .1 Gaussian Design In this subsection, we discuss over-parameterized SIM with Gaussian covariates: x P R p „ N p µ, Σ q .Only in this subsection, we change the identiﬁability condition in Deﬁnition 2.3 from assuming } β ˚ } “ } Σ { β ˚ } “ Let us begin with the basic assumption.

Assumption 3.1.

Assume µ ˚ “ E r f px x, β ˚ yqs ‰ C min and C max such that C min I p ˆ p ď Σ ď C max I p ˆ p holds.(b). Both t f px x i , β ˚ yqu ni “ and t (cid:15) i u ni “ are i.i.d. sub-Gaussian random variables, with sub-Gaussiannorm denoted by σ f and σ respectively .The score function for x „ N p µ, Σ q is S p x q “ Σ ´ p x ´ µ q and Assumption 3.1(a) makes theGaussian distributed covariates non-degenerate. Assumption 3.1(b) enables the concentration ofthe empirical estimator n ř ni “ y i S p x i q to its population version µ ˚ β ˚ . Note that this assumptionis quite standard and easy to be satisﬁed by a broad class of models including models with boundedlink functions and sub-Gaussian noises. It includes the linear regression model in (2.3) studied byZhao et al. (2019) and Vaˇskeviˇcius et al. (2019). In addition, this assumption will further be relaxedto the bounded ﬁnite moment in § § Algorithm 1:

Algorithm for Vector SIM with Gaussian Design

Data:

Training covariates t x i u ni “ , response variables t y i u ni “ , initial value α , step size η ;Initialize variables w “ α ¨ p ˆ , v “ α ¨ p ˆ and set iteration number t “ while t ă T dow t ` “ w t ´ η “ w t d w t ´ v t d v t ´ n ř ni “ Σ ´ p x i ´ µ q y i ‰ d w t ; v t ` “ v t ` η “ w t d w t ´ v t d v t ´ n ř ni “ Σ ´ p x i ´ µ q y i ‰ d v t ; β t ` “ w t d w t ´ v t d v t ; t “ t ` endResult: Output the ﬁnal estimate p β ˚ “ β T .Now we are ready to present the statistical rates of convergence for the estimator constructed byAlgorithm 1. Let us divide the support set S “ t i : | β ˚ i | ą u into S “ t i : | β i | Á log p a log p { n u and S “ t i : 0 ă | β ˚ | À a log p { n u , which correspond to the sets of strong and weak signals,respectively. We let s and s be the cardinality of S and S , respectively. In addition, we let s m “ min i P S | µ ˚ β ˚ i | be the smallest value of strong signals.13 heorem 3.2. Apart from Assumption 3.1, if we further let our initial value α satisfy 0 ă α À { p and set stepsize η as 0 ă η À {p max i | β ˚ i |q in our Algorithm 1, there exist constants a , a ą T P r a log p s m { α q{ ηs m , a a n { log p { η s , we obtain that } β T ´ µ ˚ β ˚ } À s log nn ` s log pn , } β T ´ µ ˚ β ˚ } À s c log nn ` s c log pn hold with probability at least 1 ´ p ´ ´ n ´ . Meanwhile, the statistical rates of convergence forthe normalized iterates are given by ››››› β T ›› Σ { β T ›› ´ µ ˚ β ˚ | µ ˚ | ››››› À s log nn ` s log pn , ››››› β T ›› Σ { β T ›› ´ µ ˚ β ˚ | µ ˚ | ››››› À a p s ` s q c s log nn ` s log pn . We conclude from Theorem 3.2 that if we just have strong signals, then with high probability,for any T P r a log p s m { α q{ ηs m , a a n { log p { η s , we get the oracle statistical rates O p a s log n { n q and O p s a log n { n q in terms of the (cid:96) - and (cid:96) -norms respectively. Notice that these oracle rates areindependent of the ambient dimension p . Besides, when β ˚ also consists of weak signals, we achieve O p a s log p { n q and O p s a log p { n q statistical rates in terms of the (cid:96) - and (cid:96) -norms, respectively,where s is the sparsity of β ˚ . Such statistical rates match the minimax rates of sparse linearregression (Raskutti et al., 2011) and are thus minimax optimal. Notice that the oracle ratesare achievable via explicit regularization using folded concave penalties (Fan et al., 2014) such asSCAD (Fan and Li, 2001) and MCP (Zhang et al., 2010). Thus, Theorem 3.2 shows that, withover-parameterization, the implicit regularization of gradient descent has the same eﬀect as addinga folded concave penalty function to the loss function in (3.2) explicitly.Furthermore, comparing our work to Plan and Vershynin (2016); Plan et al. (2017), whichstudy on high dimensional SIM with (cid:96) -regularization, thanks to the implicit regularization phe-nomenon, we avoid bias brought by the (cid:96) -penalty and attain the oracle statistical rate. Theorem3.2 generalizes the results in Zhao et al. (2019) and Vaˇskeviˇcius et al. (2019) for the linear modelto high-dimensional SIMs. In addition, to satisfy the RIP condition, their sample complexity is atleast O p s log p q if their covariate x follows the Gaussian distribution. Whereas, by using the lossfunction in (3.2) motivated by the Stein’s identity (Stein et al., 1972, 2004), the RIP condition isunnecessary in our analysis. Instead, our theory only requires that n ´ ř ni “ S p x i q ¨ y i concentratesat a fast rate. As a result, our sample complexity is max t O p s log p q , O p log p qu for (cid:96) -consistency,which is better than O p s log p q when s is much larger than ? log p . Here, the O p log p q term arisesdue to ensuring O p a n { log p q Á O p log p { α qq , where α À { p is the magnitude of initialization.The proof ideas behind Theorem 3.2 are as follows. First, we are able to control the strengthsof both error and weak signal components, denoted by } β t d S c } , } β t d S } respectively, at thesame order with their initial values until O p a n { log p { η q steps. Meanwhile, every entry of strongsignal part β t d S grows at exponential rates to (cid:15) “ O p a log n { n q accuracy around µ ˚ β ˚ d S within O p log p { α q{ ηs m q steps. The ﬁnal statistical rates are obtained by combining the results onthe active and inactive components together. See § A.1 and § A.2 for the detail.14inally, as shown in Theorem 3.2, if the stopping time T P r a log p s m { α q{ ηs m , a a n { log p { η s ,we will get an estimator β T with optimal statistical rates with high probability. However, inpractice, the constants a and a are unknown. Thus, in the following, we introduce a method toselect a proper stopping time T by estimating f . T We split the dataset into training data and testing data. We utilize the training data to implementAlgorithm 1 and get the estimator β t as well as the value of the training loss (3.2) at step t .We notice β t varies slowly inside the optimal time interval speciﬁed in Theorem 3.2, so that theﬂuctuation of the training loss (3.2) can be smaller than a threshold. Based on that, we choose m testing points on the ﬂatted curve of the training loss (3.2) and denote their corresponding numberof iterations as t t j u , j P r m s . For each j P r m s , we then reuse the training data and normalizedestimator β t j {} Σ { β t j } , j P r m s to ﬁt the link function f . Let the obtained estimator be p f j . Forthe testing dataset, we perform out-of-sample prediction and get m prediction losses: l j “ n test n test ÿ i “ “ Y i ´ p f j px x i , β t j {} Σ { β t j } yq ‰ , @ j P r m s . Next, we choose T as t j ˚ where j ˚ “ argmin j Pr m s l j .In the following § p f j and establish its theoreticalguarantee. We now consider estimating the nonparametric component and the prediction risk. Suppose weare given an estimator p β of β and n i.i.d. observations t y i , x i u ni “ of the model. For simplicity ofthe technical analysis, we assume that p β is independent of t y i , x i u ni “ , which can be achieved bydata-splitting. Moreover, we assume that p β is an estimator of β ˚ such that ›› p β ´ β ˚ ›› “ o p n ´ { q , ›› Σ { p β ›› “ , and ›› Σ { β ˚ ›› “ . (3.6)Our goal is to construct an estimate the regression function f px¨ , β ˚ yq based on p β and t y i , x i u ni “ .Note that, when β ˚ is known, we can directly estimate f based on y i and Z ˚ i : “ x J i β ˚ , i P r n s via standard non-parametric regression. When p β is accurate, a direct idea is to replace Z ˚ i by Z i : “ x J i p β and follow the similar route. For a new observation x , we deﬁne Z as Z : “ x J p β and Z ˚ as Z ˚ : “ x J β ˚ respectively.To predict Y , we estimate function g p z q using kernel regression with data tp y i , x J i p β qu ni “ . Specif-ically, we let the function K h p u q be K h p u q : “ { h ¨ K p u { h q , in which K : R Ñ R is a kernel functionwith K p u q “ I t| u |ď u and h is a bandwidth. By the deﬁnitions of Z ˚ , Z, and Z i , i P r n s given above,the prediction function p g p Z q is deﬁned as p g p Z q “ $&% ř ni “ y i K h p Z ´ Z i q ř ni “ K h p Z ´ Z i q , | Z ´ µ J p β | ď R, , otherwise , (3.7)15here we follow the convention that 0 { “

0. In what follows, we consider the (cid:96) -prediction risk of p g , which is given by E „!p g ` x x , p β y ˘ ´ f ` x x , β ˚ y ˘)  , where the expectation is taken with respect to x and t x i , y i u ni “ . Before proceeding to the theoreticalguarantees, we make the following assumption on the regularity of f . Assumption 3.3.

There exists an α ą C ą | f p x q| , | f p x q| ď C ` | x | α . For the rationality of the Assumption 3.3, we note that the constraint on f p x q and f p x q givenabove is weaker than assuming f p x q and f p x q are bounded functions directly. Next, we presentTheorem 3.4 which characterizes the convergence rate of mean integrated error of our predictionfunction p g p Z q . Theorem 3.4.

If we set R “ a log p n q and h — n ´ { in (3.7), under Assumption 3.3, the (cid:96) -prediction risk of p g deﬁned in (3.7) is given by E „!p g ` x x , p β D q ´ f ` x x , β ˚ y ˘)  À polylog p n q n { , where p β “ β T {} Σ { β T } is the normalized β T given in Theorem 3.2 and polylog p n q containsterms that are polynomials of log n .The proof of Theorem 3.4 is given in section A.3. Note that it is possible to reﬁne the analysison the prediction risk for f with higher order derivatives by utilizing higher order kernels; seeTsybakov (2008) therein. But this is not the key message of our paper. In this subsection, we extend our methodology to the setting with covariates generated from ageneral distribution. Following our discussions at the beginning of §

3, ideally we aim at solving theloss function with over-parameterized variable given in (3.2). However, when the distribution of x has density p , the score S p x q can be a heavy-tailed random variable such that E r Y ¨ S p x qs and itsempirical counterpart may not be suﬃciently close.To remedy this issue, we modify the loss function in (3.2) by replacing y i and S p x i q by theirtruncated (Winsorized) version q y i and q S p x i q , respectively. Speciﬁcally, we propose to apply gradientdescent to the following modiﬁed loss function with respect to u and v :min w , v L p w , v q : “ x w d w ´ v d v , w d w ´ v d v y ´ n n ÿ i “ q y i @ w d w ´ v d v , q S p x i q D . (3.8)We denote q a P R d as the truncated version of vector a P R d based on a parameter τ (Fan et al.,2020a). That is, its entries are given by r q a s j “ r a s j if | a i | ď τ and τ otherwise. By apply elementwise16runcation to t y i u ni “ and t S p x i qu ni “ in (3.8), we allow the score S p x q and the response Y to bothhave heavy-tailed distributions. By choosing a proper threshold τ , such a truncation step ensures n ´ ř ni “ q y i q S p x i q converge to E r Y ¨ S p x qs with a desired rate in (cid:96) -norm. Compared with Algorithm1, here we only modify the deﬁnition of the loss function. Thus, we defer the details of the proposedalgorithm for this setting to Algorithm 3 in § A.5.Before stating our main theorem, we ﬁrst present an assumption on the distributions of thecovariate and the response variables.

Assumption 3.5.

Assume there exists a constant M such that E “ Y ‰ ď M, E “ S p x q j ‰ ď M, @ j P r p s . Assuming the fourth moments exist and are bounded is signiﬁcant weaker than the sub-Gaussianassumption. Moreover, such an assumption is prevalent in literatures on robust statistics (Fan et al.,2020b, 2018, 2019b). Now we are ready to introduce the theoretical results for the setting withgeneral design.

Theorem 3.6.

Under our Assumption 3.5, we set the thresholding parameter τ “ pp M ¨ n q{ log p q { { α satisfy 0 ă α À { p , and set the stepsize η such that 0 ă η À {p max i | β ˚ i |q in Algorithm 3 given in § A.5. There exist constants a , a , such that } β T ´ µ ˚ β ˚ } À s log pn , } β T ´ µ ˚ β ˚ } À s c log pn hold with probability at least 1 ´ p ´ , for any T P r a log p s m { α q{p ηs m q , a a n { log p { η s . Here s is the cardinality of our support set S . In addition, for the normalized iterates, we further have ›››› β T } β T } ´ µ ˚ β ˚ | µ ˚ | ›››› À s log pn , ›››› β T } β T } ´ µ ˚ β ˚ | µ ˚ | ›››› À s c log pn . Compared with Theorem 3.2 for the Gaussian design, here we achieve the O p a s log p { n q and O p s a log p { n q statistical rates of convergence in terms of the (cid:96) - and (cid:96) -norms, respectively. Theserates are the same of those achieved by adding an (cid:96) -norm regularization explicitly (Plan andVershynin, 2016; Plan et al., 2017; Yang et al., 2017a) and are minimax optimal (Raskutti et al.,2011). Moreover, we note that here S p x q and Y can be both heavy-tailed and our truncationprocedure successfully tackles such a challenge without sacriﬁcing the statistical rates.It is also worthwhile noting that Theorem 3.6 characterizes the implicit regularization phe-nomenon of optimization algorithms for over-parameterized nonlinear models with heavy-taileddata. Here the optimization algorithm is the standard gradient descent, combined with an addi-tional truncation to the data, which can be viewed as a pre-processing step. This result adds tothe existing literature on implicit regularization which mainly focuses on linear models with zeroor light-tailed noise (Li et al., 2018; Zhao et al., 2019; Vaˇskeviˇcius et al., 2019).17 Main Results for Over-Parametrized Low Rank SIM

In this section, we present the results for over-parameterized low rank matrix SIM introducedin Deﬁnition 2.4 with both standard Gaussian and generally distributed covariates. Similar tothe results in §

3, here we also focus on matrix SIM with ﬁrst-order links, i.e., we assume that µ ˚ “ E r f px X , β ˚ yqs ‰

0, where β ˚ is a low rank matrix with rank r . Note that we assume that theentries of covariate X P R d ˆ d are i.i.d. with density p . Also recall that we deﬁne the score function S p X q P R d ˆ d in (2.5). Then, similar to the loss function in (3.2), we consider the loss function L p β q : “ x β, β y ´ A β, n n ÿ i “ y i S p X i q E , where β P R d ˆ d is a symmetric matrix. Hereafter, we rewrite β as WW J ´ VV J , where both W and V are matrices in R d ˆ d . With such over-parameterization, we propose to estimate β ˚ byapplying gradient descent to the loss function L p W , V q : “ x WW J ´ VV J , WW J ´ VV J y ´ A WW J ´ VV J , n n ÿ i “ y i S p X i q E . (4.1)Since the rank of β ˚ is unknown, we initialization W and V as W “ V “ α ¨ I d ˆ d for a small α ą t W t , V t , β t u via W t ` “ W t ´ η ´ W t W J t ´ V t V J t ´ n n ÿ i “ S p X i q y i ´ n n ÿ i “ S p X i q J y i ¯ W t , (4.2) V t ` “ V t ` η ´ W t W J t ´ V t V J t ´ n n ÿ i “ S p X i q y i ´ n n ÿ i “ S p X i q J y i ¯ V t , (4.3) β t ` “ W t W J t ´ V t V J t , where η in (4.2) and (4.3) is the stepsize. Note that here the algorithm does not impose any explicitregularization. In the rest of this section, we show that such a procedure yields an estimator of thetrue parameter β ˚ with near-optimal statistical rates of convergence.Like the case of sparse vector, here we also divide eigenvalues of µ ˚ β ˚ into diﬀerent groupsby their strengths. We let r ˚ i , i P r n s be the i -th eigenvalue of µ ˚ β ˚ . The support set R of theeigenvalues is deﬁned as R : “ t i : | r ˚ i | ą u with size r . We then divide the support set R into R : “ t i : | r ˚ i | Á log d a d log d { n u and R : “ t i : 0 ă | r ˚ i | À a d log d { n u , which correspond tocollections of strong and weak signals with cardinality denoting by r and r , respectively. Moreover,we use r m to denote the minimum strong eigenvalue in magnitude, i.e. r m “ min i P R | r ˚ i | . In this subsection, we focus on the model in (2.7) with the entries of covariate X being i.i.d. N p , q random variables. In this case, S p X i q “ X i . This leads to Algorithm 2, by using (4.1)-(4.3).Similar to the case in § lgorithm 2: Algorithm for Low Rank Matrix SIM with Gaussian Design

Data:

Training design matrix X i P R d ˆ d , i P r n s , response variables t y i u ni “ , initial value α and step size η ;Initialize W “ α ¨ I d ˆ d , V “ α ¨ I d ˆ d and set iteration number t “ while t ă T doW t ` “ W t ´ η p W t W J t ´ V t V J t ´ n ř ni “ X i y i ´ n ř ni “ X J i y i q W t ; V t ` “ V t ` η p W t W J t ´ V t V J t ´ n ř ni “ X i y i ´ n ř ni “ X J i y i q V t ; β t ` “ W t W J t ´ V t V J t ; t “ t ` endResult: Output the ﬁnal estimate p β “ β T . Assumption 4.1. t y i u ni “ are i.i.d. sub-Gaussian random variables with sub-Gaussian norm σ y . Theorem 4.2.

We set α À { d and stepsize η À {p max i | r ˚ i |q in Algorithm 2. Under Assumption4.1, there exist constants a , a such that for any T P r a log p r m { α q{ ηr m , a a n {p d log d q{ η s , withprobability 1 ´ {p d q ´ { n , we obtain ›› β T ´ µ ˚ β ˚ ›› F À rd log dn , ›› β T ´ µ ˚ β ˚ ›› ˚ À r c d log dn . Moreover, for the normalized iterates β t {} β t } F , we have ›››› β T } β T } F ´ µ ˚ β ˚ | µ ˚ | ›››› F À rd log dn , ›››› β T } β T } F ´ µ ˚ β ˚ | µ ˚ | ›››› ˚ À r c d log dn . As shown in Theorem 4.2, with the proper choices of initialization parameter α , stepsize η , andthe stopping time T , Algorithm 2 constructs an estimator that achieves near-optimal statisticalrates of convergence (up to logarithmic factors compared to minimax lower bound (Rohde andTsybakov, 2011)). Notice that the statistical rates established in Theorem 4.2 are also enjoyedby the M -estimator based on the least-squares loss function with nuclear norm penalty (Plan andVershynin, 2016; Plan et al., 2017). Thus, in terms of statistical estimation, applying gradientdescent to the over-parameterized loss function in (4.1) is equivalent to adding a nuclear normpenalty explicitly, hence demonstrating the implicit regularization eﬀect.Furthermore, our method extends the existing works that focus on the implicit regularizationphenomenon in noiseless linear matrix sensing models with positive semideﬁnite signal matrices(Gunasekar et al., 2017; Li et al., 2018; Arora et al., 2019a; Gidel et al., 2019). Speciﬁcally, weallow a more general class of models with nonlinear links and symmetric signal matrices. Moreover,compared with Li et al. (2018), our strengths are two-fold. First, under the setting of standardGaussian design with signals at constant level, our sample complexity is only at the order of r O p rd q whereas they need at least r O p r d q samples so as to establish their RIP condition (Cand´es, 2008).Second, our results also hold under the existence of weak signals, i.e. 0 ă min i P R | r ˚ i | À r O p a d { n q .When we ﬁx d and r , in order to meet the RIP condition with parameter δ , the sample size n needs to19atisfy n Á O p { δ q according to Theorem 4.2 in Recht et al. (2010). As Li et al. (2018) requires anRIP parameter δ with δ À O p min i P R | r ˚ i | {? r q in its Theorem 1, the corresponding minimum signalstrength min i P R | r ˚ i | should satisfy min i P R | r ˚ i | Á O pp { n q { q which brings a stronger assumptionthan us.The way of choosing stopping time T in the case of matrix SIM is almost the same with ourmethod in § x J β ˚ by tr p X J β ˚ q Indeed, as we assume } Σ { β ˚ } “ } β ˚ } F “ x J β t and tr p X J β t q follow the standard normal distribution. Thus, our resultson the prediction risk in § In the rest of this section, we focus on the low rank matrix SIM beyond Gaussian covariates.Hereafter, we assume the entries of X are i.i.d. random variables with a known density function p : R Ñ R . Recall that, according to the remarks following Deﬁnition 2.1, the score function S p X q P R d ˆ d is deﬁned as S p X q j,k : “ S p X j,k q “ ´ p p X j,k q{ p p X j,k q , where S p X q j,k and X j,k are the p j, k q -th entries of S p X q and X for all j, k P r d s . However, similarto the results in § S p X q can have heavy-tailed distributions and thus n ´ ř ni “ y i ¨ S p X i q may not converge its expectation E r Y ¨ S p X qs eﬃciently in terms of spectral norm. Here X i is the i -th observation of the covariate X . To tackle such a challenge, we employ a shrinkageapproach (Catoni et al., 2012; Fan et al., 2020b; Minsker, 2018) to construct a robust estimator of E r Y ¨ S p X qs . Speciﬁcally, we let φ p x q “ log p ´ x ` x { q , x ď , log p ` x ` x { q , x ą , which is approximately x when x is small and grows at logarithmic rate for large x . The rescaledversion λ ´ φ p λx q for λ Ñ X P R d ˆ d , we apply spectral decomposition to its Hermitian dilation andobtain X ˚ : “ « T ﬀ “ QΣ ˚ Q T , where Σ ˚ P R d ˆ d is a diagonal matrix. Based on such a decomposition, we deﬁne r X “ Q φ p Σ ˚ q Q T ,where φ applies elementwisely to Σ ˚ . Then we write r X as a block matrix as r X : “ « r X r X r X r X ﬀ , r X is in R d ˆ d . We further deﬁne a mapping φ : R d ˆ d Ñ R d ˆ d by letting φ p X q : “ r X , which is a regularized version of X . Given data y , X , we ﬁnally deﬁne H p¨q as H p y S p X q , κ q : “ { κ ¨ φ p κy ¨ S p X qq , @ κ ą , (4.4)where κ is a thresholding parameter, converging to zero. Based on the operator H deﬁned in (4.4),we deﬁne a loss function L p W , V q as L p W , V q : “ x WW J ´ VV J , WW J ´ VV J y ´ n n ÿ i “ @ WW J ´ VV J , H p y i S p X i q , κ qq D . (4.5)After over-parameterizing β as WW J ´ VV J , we propose to construct an estimator of β ˚ by apply-ing gradient descent on the following loss function in (4.5) with respect to W , V . See Algorithm 4in § B.3 for the details of the algorithm.In the following, we present the statistical rates of convergence for the obtained estimator. Weﬁrst introduce the assumption on Y and p . Assumption 4.3.

We assume that both the response variable Y and entries of S p X q possessbounded fourth moments. Speciﬁcally, there exists a constant M such that E “ Y ‰ ď M, E “ S p X q i,j ‰ ď M, @ p i, j q P r d s ˆ r d s . Next, we present the main theorem for low rank matrix SIM.

Theorem 4.4.

In Algorithm 4, we set parameter κ in (4.4) as κ “ a log p d q{p nd ¨ M q and letthe initialization parameter α and the stepsize η satisfy α À { d and 0 ă η À {p max i | r ˚ i |q ,respectively. Then, under Assumption 4.3, there exist constants a , a such that for any T Pr a log p r m { α q{ ηr m , a a n { d log d { η s , with probability 1 ´ p d q ´ , we obtain ›› β T ´ µ ˚ β ˚ ›› F À rd log dn , ›› β T ´ µ ˚ β ˚ ›› ˚ À r c d log dn . Moreover, for the normalized iterate β t {} β t } F , we have ›››› β T } β T } F ´ µ ˚ β ˚ | µ ˚ | ›››› F À rd log dn , ›››› β T } β T } F ´ µ ˚ β ˚ | µ ˚ | ›››› ˚ À r c d log dn . For low rank matrix SIM, when the hyperparameters of the gradient descent algorithm areproperly chosen, we also capture the implicit regularization phenomenon by applying a simpleoptimization procedure to over-parameterized loss function with heavy-tailed measurements. Here,applying the thresholding operator H in (4.4) can also be viewed as a data pre-processing step,which arises due to handling heavy-tailed observations. Note that the (cid:96) - and (cid:96) -statistical ratesgiven in Theorem 4.4 are minimax optimal up to a logarithmic term (Rohde and Tsybakov, 2011).Similar results were also obtained by Plan and Vershynin (2016); Yang et al. (2017a); Goldsteinet al. (2018); Na et al. (2019) via adding explicit nuclear norm regularization. Thus, in terms ofstatistical recovery, when employing the thresholding in (4.4) and over-parameterization, gradientdescent enforces implicit regularization that has the same eﬀect as the nuclear norm penalty.21 Numerical Experiments

In this section, we illustrate the performance of the proposed estimator in diﬀerent settings viasimulation studies. We let (cid:15) „ N p , . q in our models deﬁned in (2.6) and (2.7) and choose thelink function to be one of t f j u j “ , whose details are given in Figures 3 and 4. -10 -5 0 5 10-100-50050100 -10 -5 0 5 10-50050 -10 -5 0 5 10-10-50510 -10 -5 0 5 10-4-2024 (a) (b) (c) (d)Figure 3: Plot of link functions (a): f p x q “ x ` x , (b): f p x q “ x ` x ` cos x , (c): f p x q “ x { ` x ` ? x and (d): f p x q “ x ` x -10 -5 0 5 10-30-20-100102030 -10 -5 0 5 10-10-50510 -10 -5 0 5 10-15-10-5051015 -10 -5 0 5 10-20-1001020 (a) (b) (c) (d)Figure 4: Plot of link functions (a): f p x q “ ? x ` x , (b): f p x q “ x { ` x , (c): f p x q “ x ` x and (d): f p x q “

10 tanh x ` x .To measure the estimation accuracy, we use dist p p β, β ˚ q “ min t} p β {} p β } ‚ ´ β ˚ } ‚ , } p β {} p β } ‚ ` β ˚ } ‚ u ,where ‚ stands for Euclidean norm in the vector case and Frobenius norm under the setting ofmatrix covariate. The number of simulations is 100. Recall that Theorems 3.2 and 3.6 establish the a s log p { n statistical rate of convergence in the (cid:96) -norm. To vary this, we ﬁx p “ s to be one of t , , u , and use the value of a s log p { n to determine n . In addition, we choose the support of β ˚ randomly among all subsets of t , . . . , p u with cardinality s . For each j P supp p β ˚ q , we set β ˚ j “ {? s ¨ Uniform pt´ , uq . Besides, we letthe entries of the covariate x have i.i.d. distributions, which are either the standard Gaussiandistribution, Student’s t-distribution with 5 degrees of freedom, or the Gamma distribution withshape parameter 8 and scale parameter 0.1. Based on β ˚ , the distribution of x , and one of theaforementioned univariate functions t f j u j “ , we generate n i.i.d. samples t x i , y i u ni “ from the vector22IM given in (2.6). As for the optimization procedure, throughout § α “ ´ , stepsize η “ .

005 in Algorithms 1 and 3. Our estimator p β is chosen by p β “ argmin β t dist p β t , β ˚ q , where β t is the t -th iterate of Algorithm 1 and Algorithm 3. The choiceof stoping time is ideal but serves purposes. As shown in our asymptotic results, there is anintervals of sweet stopping time. By using the data driven choice, we get similar results, but takemuch longer time.With the standard Gaussian distributed covariates, we plot the average distance dist p p β, β ˚ q against a s log p { n in Figure 5 for f and f respectively, based on 100 independent trails for each n . The results show that the estimation error is bounded eﬀectively by a linear function of signalstrength a s log p { n . Indeed, the linearity holds surprisingly well, which corroborates our theory. (a) (b)Figure 5: The average (cid:96) -distances between the true parameters β ˚ and estimated parameters p β in vector SIM with standard Gaussian distributed covariates and (a) link function f and (b) linkfunction f .As for generally distributed covariates, we set p p x q given in Deﬁnition 2.3 to be one of thefollowing distributions: (i) Student’s t-distribution with 5 degrees of freedom and (ii) Gammadistribution with shape parameter 8 and scale parameter 0.1. The score functions of these twodistributions are given by S p x q “ x {p ` x q and S p x q “ ´ { x , respectively. In addition,the truncating parameter τ in Algorithm 3 is taken as τ “ p n { log p q { . We then plot distancedist p p β, β ˚ q against a s log p { n in Figure 6 for link functions f and f with t p q and Gamma p , . q distributed covariates respectively, based on 100 independent experiments. It also worths notingthat the estimation errors align well with a linear function of a s log p { n .23 (a) (b)Figure 6: The averaged (cid:96) -distances between the true parameter and estimated parameters in vectorSIM for (a) t p q distributed covariates with the link function f and (b) Gamma p , . q distributedcovariates and the link function f . In the scenario of low rank matrix, statistical rate in Frobenius norm is a rd log d { n , accordingto Theorems 4.2 and 4.4. Throughout § d “

25, and for each r P t , , u ,we use a rd log d { n to determine n . The true parameter matrix β ˚ is set to be USU J , where U P R d ˆ d is any random orthogonal matrix and S is a diagonal matrix with r nonzero entrieschosen randomly among the index set t , . . . , d u . Moreover, we set the nonzero diagonal entriesof S as 1 {? r ¨ Uniform pt´ , uq . Besides, we also let every entry of the covariate X have i.i.d.distribution, which is one of the same three distributions in § β ˚ , the distribution of X and one of t f j u j “ to generate n i.i.d. data t X i , y i u ni “ basedon (2.7). As for the optimization procedure, throughout § α “ ´ , stepsize η “ .

005 and implement the Algorithm 2 and Algorithm 4 for Gaussian andgeneral design respectively. Our estimator p β is also chosen by p β “ argmin β t dist p β t , β ˚ q , where β t is the the t -th iterate given in the Algorithm 2 and Algorithm 4. Again, this is the ideal choiceof stopping time, but serves the purpose as the result does not depend very much on the properchoice of stopping time.With the standard Gaussian distributed covariates, we plot the averaged distance dist p p β, β ˚ q against a rd log d { n in Figure 7 for f and f respectively, based on 100 independent trails foreach case. The estimation error again follows linearly on a rd log d { n . The simulation results areconsistent what is predicted by the theory. 24 (a) (b)Figure 7: The averaged (cid:96) -distances between the true parameter β ˚ and estimated parametermatrices p β in SIM with standard Gaussian distributed covariates and (a) the link function f and(b) the link function f .We also show distance dist p p β, β ˚ q against a rd log d { n in Figure 8 for f and f with t p q andGamma p , . q distributed covariates respectively, based on 100 independent experiments, whichis in line with the theory. Here the shrinkage parameter κ in Algorithm 4 is set to be κ “ a log p d q{p nd q . (a) (b)Figure 8: The averaged (cid:96) -distances between true parameter β ˚ and estimated parameter matri-ces p β for (a) t p q distributed covariates with link function f and (b) Gamma p , . q distributedcovariates with the link function f . In this paper, we study the implicit regularization induced by the gradient descent algorithm in over-parameterized vector and matrix single index models. We consider the case where the link functionis unknown, the distribution of the covariates is known as a prior, and the signal parameter is either a s -sparse vector in R p or a rank- r matrix in R d ˆ d . Using the score function and the Stein’s identity,25e propose an over-parameterized nonlinear least-squares loss function. To handle the possiblyheavy-tailed distributions of the score functions and the response variables, we adopt additionaltruncation techniques that robustify the loss function. For both the vector and matrix SIMs, weconstruct an estimator of the signal parameter by applying gradient descent to the proposed lossfunction, without any explicit regularization. We prove that, when initialized near the origin,gradient descent with a small stepsize ﬁnds an estimator that enjoys minimax-optimal statisticalrates of convergence. Moreover, for vector SIM with Gaussian design, we further obtain the oraclestatistical rates that are independent of the ambient dimension. Our results demonstrate that theimplicit regularization phenomenon also appears when applying simple optimization algorithms inover-parametrized nonlinear statistical models with possibly heavy-tailed data.26 Proofs of Theoretical Results in § In this section, we prove the results presented in §

3. Speciﬁcally, in § A.1 we ﬁrst consider a specialcase of § µ ˚ β ˚ are non-negative. The analysis of such a simpler caseconveys the key ideas that will be used for proving Theorem 3.2 in § A.2. Moreover, we present theproofs of Theorems 3.4 and 3.6 in § A.3 and § A.4, respectively.

A.1 A Warm-Up Example: Non-Negative Signal

When each entry of the signal parameter µ ˚ β ˚ is non-negative, instead of writing µ ˚ β ˚ as w d w ´ v d v , we simply parameterize it as w d w , where w P R p . In this case, the loss function L p w , v q given in (3.2) is reduced to L p w q “ x w d w , w d w y ´ A w d w , n n ÿ i “ y i S p x i q E , (A.1)where t x i u i Pr n s are n i.i.d. observations in R p generated from the Gaussian distribution N p µ, Σ q with score function S p x q “ Σ ´ p x ´ µ q . Then, starting from w “ α ¨ p ˆ , we obtain t w t u t ě viarunning gradient descent on L p w q in (A.1) with a constant stepsize η ą

0, i.e., w t ` “ w t ´ η ´ w t d w t ´ n n ÿ i “ S p x i q y i ¯ d w t , @ t ě . (A.2)From t w t u t ě , we deﬁne β t “ w t d w t for all t ě

0, which are used to estimate µ ˚ β ˚ .In the sequel, we show that the statements in Theorem 3.2 also hold for t β t u t ě deﬁned above.Speciﬁcally, with α and η chosen as the same in Theorem 3.2, there exist two absolute constants a and a such that for any T P r a log p s m { α q{ ηs m , a a n { log p { η s , with probability at least1 ´ p ´ ´ n ´ , we have } β T ´ µ ˚ β ˚ } À s log nn ` s log pn , } β T ´ µ ˚ β ˚ } À s c log nn ` s c log pn . Meanwhile, we also get the convergence rate for the normalized version of our iterates ››››› β T ›› Σ { β T ›› ´ µ ˚ β ˚ | µ ˚ | ››››› À s log nn ` s log pn , ››››› β T ›› Σ { β T ›› ´ µ ˚ β ˚ | µ ˚ | ››››› À a p s ` s q c s log nn ` s log pn . Before proceeding to the theoretical proof for this warm-up example, we ﬁrst remind readersof our notations. We deﬁne the support set S of our signals as S : “ t i : | β ˚ i | ą u . Then pureerror part of the t -th iterate w t is denoted by e t “ S c d w t , in which S c is a vector whose i -th entry is one if i P S c and zero otherwise. Recall that our underlying true signal β ˚ is s -sparse, we further classify signals inside S in terms of their strengths. So we deﬁne subset S as S : “ t i : | β ˚ i | Á log p a log p { n u which contains strong signals and subset S that contains weak27ignals as S : “ t i : 0 ă | β ˚ i | À a log p { n u . Thus, strong signal part and weak signal part of w t are denoted by s t “ S d w t and u t “ S d w t respectively. In addition, we let s and s bethe size of set S and S . For simplicity, through our proof in § A, we set γ “ a n { log p by γ andΦ n “ n ř ni “ S p x i q y i . Proof of Theorem 3.2 for the Warm-Up Example.

The proof of Theorem 3.2 for the warm-up ex-ample requires two major ingredients: (i) the strengths control of pure error, weak signal partsof our iterates t w t u t ě and (ii) entrywise convergence of strong signal components of our iterates t β t u t ě .For the pure error and weak signal parts of the iterates t w t u t ě , we depict their iteratingdynamics in the following Lemma A.1. Lemma A.1.

Under assumptions in Theorem 3.2, with probability 1 ´ p ´ , there exists a constant a depending on absolute constants C , C ą } e t } ď C α À p , } u t } ď C α À p , for any t ď T : “ a γη . Proof.

See § A.1.1 for a detailed proof.To be more speciﬁc, in Lemma A.1 given above, we prove that the strengths of } e t } and } u t } are controlled well by a term of order O p { p q with high probability for all t ď T “ O p γ { η q . Thisfurther implies that ›› β t d S c ´ µ ˚ β ˚ d S c ›› À s log pn ` p (A.3)holds with probability at least 1 ´ p ´ when t ď T .For strong signal components of our iterates t β t u t ě , in the following Lemma A.2, we prove that β t d S converges to µ ˚ β ˚ d S entrywisely with high probability after certain iterations. Lemma A.2.

Under assumptions in Theorem 3.2, if we further choose 0 ă η ď {p

16 max i | µ ˚ β ˚ i |q ,there exists a constant a such that } S d β t ´ S d µ ˚ β ˚ } ď M c log nn holds with probability 1 ´ n ´ , for any t ě a log p s m α q{p ηs m q . Here s m “ min i P S r µ ˚ β ˚ s i is thesmallest value of strong signals and M is a constant that is proportional to max t} f } ψ , σ u . Proof.

See § A.1.2 for a detailed proof.Utilizing the conclusion from Lemma A.2, we obtain an upper bound of (cid:96) -distance between β t d S and µ ˚ β ˚ d S as } β t d S ´ µ ˚ β ˚ d S } ď M s log nn , when t ě a log ´ s m α ¯L p ηs m q . (A.4)28ombining (A.3) and (A.4), we get the ﬁrst conclusion that } β T ´ µ ˚ β ˚ } ď M s log nn ` M s log pn ` c p holds with probability at least 1 ´ n ´ ´ p ´ , when T P r O p log p s m { α q{p ηs m qq , O p γ { η qs .Next, we prove the (cid:96) -convergence rate for the normalized version of our iterates. We note that µ ˚ “ E r f px x , β ˚ yqs is a constant. Without loss of generality, we assume µ ˚ ą

0. Then there existsan n ˚ depending on µ ˚ such that we have } Σ { β T } ě µ ˚ ´ } Σ { p β T ´ µ ˚ β ˚ q} ě µ ˚ ´ a C max ¨ d M s log nn ` M s log pn ` c p ě µ ˚ , by triangle inequality, when n ě n ˚ and T P r O p log p s m { α q{p ηs m qq , O p γ { η qs . Then we furtherobtain ›››› β T } Σ { β T } ´ β ˚ ›››› “ } β T ´ } Σ { β T } ¨ β ˚ } } Σ { β T } ď } β T ´ µ ˚ β ˚ } ` } µ ˚ β ˚ ´ } Σ { β T } β ˚ } } Σ { β T } ď µ ˚ } β T ´ µ ˚ β ˚ } ` C max µ ˚ C min } µ ˚ β ˚ ´ β T } ď M s log nn ` M s log pn ` c p . Inequalities regarding the (cid:96) -convergence rates for β T and its normalized version β T {} Σ { β T } with T P r O p log p s m { α q{p ηs m qq , O p γ { η qs can also be established by following similar argumentsabove. This concludes the proof for the warm-up case of Theorem 3.2. A.1.1 Proof of Lemma A.1

Proof.

We prove our Lemma A.1 by induction. As we initialize e with } e } ď α À { p and u with } u } ď α À { p , Lemma A.1 holds when t “

0. Next, for any t ˚ with 0 ď t ˚ ă T “ a γ { η , ifthe conclusion of Lemma A.1 holds for any t with 0 ď t ď t ˚ , we need to verify that it also holdsat step t ˚ `

1. Moreover, the constant a in the expression of T will be speciﬁed by us during ourproof.By the updating rule of gradient descent given in (A.2) and the deﬁnition of e t ` , we have e t ` “ S c d w t ` “ S c d “ w t ´ η ` β t ´ Φ n ˘ d w t ‰ “ e t ´ η ` β t ´ Φ n ˘ d e t . Note that w t “ s t ` u t ` e t and each component has a disjoint support set. Then we further have } e t ` } “ ›› e t ´ η “ s t d s t ` u t d u t ` e t d e t ´ µ ˚ β ˚ ´ ` Φ n ´ µ ˚ β ˚ ˘‰ d e t ›› “ ›› e t ´ η ` e t d e t ` Φ n ´ E r Φ n s ˘ d e t ›› , s t d e t “ , u t d e t “ , and β ˚ d e t “

0. By triangle inequality, } e t ` } ď “ ` η `›› e t d e t ›› ` ›› Φ n ´ E r Φ n s ›› ˘‰ ¨ } e t } . (A.5)From the expression on the right hand side of (A.5), we further obtain that } e t ` } ď “ ` η ` C α ` M a log p { n ˘‰ ¨ } e t } (A.6)holds with probability 1 ´ p ´ for any t with 0 ď t ď t ˚ , according to the following Lemma A.3and our hypothesis induction at time t ˚ . Lemma A.3.

Under assumptions given in Theorem 3.2, with probability 1 ´ p ´ , we obtain ›› Φ n ´ E r Φ n s ›› ď M c log pn , in which M is a constant which is proportional to max t} f } ψ , σ u . Proof.

The detailed proof is given in § A.1.3We now deal with u t using a similar technique. For u t , by the gradient updates given in (A.2),we obtain } u t ` } “ ›› u t ´ η ` β t ´ Φ n ˘ d u t ›› “ ›› u t ´ η “ u t d u t ´ µ ˚ β ˚ ´ ` Φ n ´ µ ˚ β ˚ ˘‰ d u t ›› . According to our deﬁnition of set S , without loss of generality, we assume there exists a constant M such that | µ ˚ β ˚ i | ď M a log p { n , for all i P S . Together with our induction hypothesis at step t ˚ , we have } u t ` } ď “ ` η ` C α ` max t M, M u ¨ a log p { n ˘‰ ¨ } u t } , (A.7)with probability 1 ´ p ´ for any t with 0 ď t ď t ˚ . As we have assumed n ! p under our settingsof high dimensional SIM, by our assumption on α ( α À { p ) stated in Theorem 3.2, there exists aconstant c ą t C α , C α u ď c c log pn holds. If we set c as c “ {p c ` max t M, M uq and let a “ c log p min t C , C uq in our expressionof T with T “ a γ { η , combining (A.6) and (A.7), we then have } e t ˚ ` } ď “ ` η {p c γ q ‰ t ˚ ` ¨ } e } ď exp ` T log ` ` η {p c γ q ˘˘ ¨ α ď exp p log p C qq ¨ α À { p, and } u t ˚ ` } ď “ ` η {p c γ q ‰ t ˚ ` ¨ } u } ď exp ` T log ` ` η {p c γ q ˘˘ ¨ α ď exp p log p C qq ¨ α À { p. Thus, our induction hypothesis also holds for t ˚ `

1. In addition, as t ˚ is arbitrarily chosen, weclaim our conclusion for Lemma A.1. 30 .1.2 Proof of Lemma A.2 Proof.

Following (A.2), our updating rule with respect to strong signal component s t is given by s t ` “ S d w t ` “ s t ´ η ` β t ´ Φ n ˘ d s t “ s t ´ η “ s t d s t ` u t d u t ` e t d e t ´ µ ˚ β ˚ ´ ` S d Φ n ´ µ ˚ β ˚ d S ˘‰ d s t “ s t ´ η “ s t d s t ´ µ ˚ β ˚ d S ´ ` S d Φ n ´ µ ˚ β ˚ d S ˘‰ d s t . As a reminder, here we also denote n ř ni “ S p x i q y i as Φ n . Then we further get the evolution onstrong signal part of β t ` as S d β t ` “ s t ` d s t ` “ (cid:32) S ´ η “ S d β t ´ µ ˚ β ˚ d S ´ ` S d Φ n ´ µ ˚ β ˚ d S ˘‰( d S d β t . In addition, by following a similar proof procedure of Lemma A.3, we obtain that ›› S d Φ n ´ E r S d Φ n s ›› ď M c log nn (A.8)holds with probability 1 ´ n ´ , in which M is a constant only relying on max t} f } ψ , σ u . Next,we analyze dynamics of every entry of S d β t separately. For any i P S , we get the evolution of β t,i as β t ` ,i “ “ ´ η ` β t,i ´ µ ˚ β ˚ i ´ ξ i ˘‰ ¨ β t,i , (A.9)where ξ i “ r n ř nk “ S p x k q y k s i ´ µ ˚ β ˚ i . By direct calculation from (A.8), we have that | ξ i | ď M a log n { n holds for any i P S simultaneously with probability 1 ´ n ´ .The basic idea of the proof is as follows. For simplicity of notation, let β i : “ µ ˚ β ˚ i ` ξ i . Then,it is clear that β i is a stationary point of equation (A.9) and our task is to show that the iteration(A.9) converges to the stationary point suﬃciently fast before the errors in components S c and S are too large. Since we start from a small initial value with β ,i ă β i (with high probability), thenonlinear factor “ ´ η ` β t,i ´ µ ˚ β ˚ i ´ ξ i ˘‰ in (A.9) is always greater than one. Thus, the sequence β t,i is monotonically increasing with | β t,i | ă β i with the nonlinear factor in (A.9) close to one as β i,i approaches β i .red We now analyze the dynamics of t β t,i u t ě . We divide its evolution into several phases: thetime it takes from β ,i “ α to β i {

2; the time it takes from to β i { { β i ; the time it takes from3 { β i to 7 { β i ; and so on. We divide our analysis into three steps. Step I.

For 0 ă β t,i ď β i { , we get a geometric increment of β t,i , namely β t ` ,i ě β t,i ¨ „ ` η ¨ ˆ µ ˚ β ˚ i ` ξ i ˙ . We wish to get a suﬃcient condition for time t such that after t -th iteration, β t,i ě β i {

2, namely,ﬁnding the time t to satisfy α ¨ ˆ ` ηβ i ˙ t ě β i , t ě T ,i : “

12 log ˆ β i α ˙M log ˆ ` ηβ i ˙ . Let us get an upper bound for term T ,i . By inequality x log p x q ´ x ` ě , when x ě

0, we obtain T ,i ď

12 log ˆ β i α ˙Mˆ ηβ i { ` ηβ i { ˙ ď log ˆ β i α ˙M` ηβ i ˘ ` log ˆ β i α ˙ ď ˆ β i α ˙M` ηβ i ˘ . The last inequality follows from our assumption on η . Thus, for every i P S , when t ě p β i {p α qq{ ηβ i ,we get β t,i ě β i { Step II. If p ´ { m q β i ď β t,i ď p ´ { m ` q β i with some m ě

1, we also obtain a geometricincrement of β t,i by (A.9), namely β t ` ,i ě ˆ ` ηβ i m ` ˙ ¨ β t,i . Similarly, we also want a suﬃcient condition for t i,m such that we have β t,i ˆ ` ηβ i m ` ˙ t i,m ě ˆ ´ m ` ˙ β i , which is equivalent to ﬁnd a t i,m that satisﬁes t i,m ě T i,m : “

12 log ˆ p ´ { m ` q β i β t,i ˙M log ˆ ` ηβ i m ` ˙ . Similar to the case of the

Step I , we obtain an upper bound of T i,m as T i,m : “

12 log ˆ p ´ { m ` q β i β t,i ˙M log ˆ ` ηβ i m ` ˙ ď

12 log ˆ p ´ { m ` q β i p ´ { m q β i ˙Mˆ ηβ i { m ` ` ηβ i { m ` ˙ “

12 log ˆ ´ { m ` ´ { m ˙Mˆ ηβ i { m ` ` ηβ i { m ` ˙ . By direct calculation, we further have T i,m ď

12 log ˆ ` { m ` ´ { m ˙ ¨ ˆ ` ηβ i m ` ˙Mˆ ηβ i m ` ˙ ď ˆ { m ` ´ { m ˙Mˆ ηβ i m ` ˙ ď ηβ i . The last inequality follows from our assumption on m ě . So for t i,m ě { ηβ i , we get β t ` t m ,i ěp ´ { m ` q β i under our settings of Step II . For the target (cid:15) “ M a log n { n , if we repeat our Step II above for m : “ r log p β i { (cid:15) q s times, we have β i { m ď (cid:15) and β t,i ě β i ´ (cid:15), after T i ě ˆ β i α ˙M` ηβ i ˘ ě ˆ β i α ˙M` ηβ i ˘ ` R log ˆ β i (cid:15) ˙VM` ηβ i ˘ ,

32y our assumptions on α and (cid:15) . Step III.

It remains to prove that our iterates β t,i , i P S never exceed β i through the wholeiteration if we take stepsize small enough. Without loss of generality, we assume β t,i ă β i , and wewant to prove that β t ` ,i ď β i holds for all t . As β t,i ă β i , there must exist an m satisfying ˆ ´ m ˙ β i ď β t,i ď ˆ ´ m ` ˙ β i . Then we get an upper bound of β t ` ,i by (A.9) as β t ` ,i ď ˆ ` ηβ i m ˙ β t,i “ ˆ ` ηβ i m ` η β i m ˙ β t,i ď ˆ ´ m ` ˙ β i ` ηβ i m ` ˆ ´ m ` ˙ β i ` η β i m ˆ ´ m ` ˙ β i . When we take the stepsize η satisfying η ď {

16 max i µ ˚ β ˚ i ď { β i , for every i P S , we obtain4 ηβ i m ` ˆ ´ m ` ˙ β i ` η β i m ˆ ´ m ` ˙ β i ď β i m ` . Finally, after letting (cid:15) “ M a log n { n and following three steps of our proof given above, we obtain µ ˚ β ˚ i ´ (cid:15) ď β t,i ď µ ˚ β ˚ i ` (cid:15), for every i P S after T iterations. And we have T satisﬁes T ě max i P S T i “ a log ˆ s m α ˙ { ` ηs m ˘ . with some constants a . Here we utilize the fact that µ ˚ β ˚ i and β i are at the same order by ourdeﬁnition of set S and the concentration upper bound of β i to µ ˚ β ˚ i with max i P S | µ ˚ β ˚ i ´ β i | À a log n { n . Thus, for t ě O p log p s m { α q{ ηs m q , we have } β t d S ´ µ ˚ β ˚ i d S } ď M c log nn , and we obtain our conclusion of Lemma A.2. A.1.3 Proof of Lemma A.3

Proof.

By our deﬁnition of Φ n , we have ›› Φ n ´ E r Φ n s ›› “ ›››› n n ÿ i “ S p x i q y i ´ n n ÿ i “ E r S p x i q y i s ›››› ď ›››› n n ÿ i “ S p x i q f p x T i β ˚ q ´ n n ÿ i “ E “ S p x i q f p x T i β ˚ q ‰›››› ` ›››› n n ÿ i “ S p x i q (cid:15) i ›››› . f p x Tj β ˚ q as f j , j P r n s and the i -th row of n ř ni “ S p x i q f i ´ n ř ni “ E r S p x i q f i s as W i , i P r p s . Then we get the expression of W i as W i “ S p x q i f ` ¨ ¨ ¨ ` S p x n q i f n n ´ E „ S p x q i f ` ¨ ¨ ¨ ` S p x n q i f n n  , which can be regarded as a concentration of n i.i.d. sub-exponential variables with sub-exponentialnorm } S p x q i f ´ E r S p x q i f s} ψ ď sup i } S p x q i } ψ } f } ψ : “ K. After applying Bernstein inequality given in Corollary 2.8.3 of Vershynin (2018), we have P ´ max i Pr p s | W i | ě t ¯ ď ´ ´ c min (cid:32) t { K , t { K ( ¨ n ` log p ¯ , (A.10)in which c is a universal constant. We further set t “ K a p {p cn q in (A.10), then we claimmax i Pr p s | W i | ď K c pcn holds with probability 1 ´ p ´ . Similarly, we also get } n ř ni “ x i (cid:15) i } À σ a log p { n, with probability1 ´ p ´ . After denoting max t K, σ u as M , we obtain that ›› Φ n ´ E r Φ n s ›› ď M c log pn holds with probability 1 ´ p ´ . Thus, we claim our conclusion of Lemma A.3.We have already proved our warm-up case for non-negative signals, in the following section, § A.2, we analyze the situation when we have general signals.

A.2 Proof of Theorem 3.2 with General Signals

In the following we consider a more general case, in which we are not able to get any informationabout the sign of µ ˚ β ˚ i , for any i P r p s . In this situation, we over-parameterize µ ˚ β ˚ as w d w ´ v d v ,in which w and v are vectors with size p ˆ

1. Then we apply gradient descent to the following lossfunction (A.11)min w , v L p w , v q “ x w d w ´ v d v , w d w ´ v d v y ´ A w d w ´ v d v , n n ÿ i “ y i S p x i q E . (A.11)with respect to w and v . Their gradient descent updates are given by w t ` “ w t ´ η ` w t d w t ´ v t d v t ´ Φ n ˘ d w t . (A.12) v t ` “ v t ` η ` w t d w t ´ v t d v t ´ Φ n ˘ d v t (A.13)Similar to the case of non-negative signals, here we also remind readers of our notations ﬁrst. Wedivide entries of β ˚ into diﬀerent groups in terms of their strengths by using the same way with34ur method in § A.1. The support set S of our signal is deﬁned as S : “ t i : | β ˚ i | ą u , and the set S which contains the strong signals is denoted by S : “ t i : | β i | Á log p a log p { n u . In addition,we also deﬁne S as S : “ t i : 0 ă | β ˚ | À a log p { n u , which contains all indices of the weak signals.Likewise, pure error parts of w t and v t are denoted by e ,t : “ S c d w t and e ,t : “ S c d v t respectively. In addition, strong signal parts of w t and v t are denoted by s ,t “ S d w t and s ,t “ S d v t , meanwhile, weak signals parts are written as u ,t : “ S d w t and u ,t : “ S d v t .Here, we also denote a n { log p by γ and n ř ni “ S p x i q y i by Φ n , and let s and s be the size of set S and S respectively. Proof.

The proof idea of Theorem 3.2 is almost the same with our warm-up example in § A.1, it alsorequires our analysis on dynamics of pure error, weak signal, strong signal components separately.According to our conclusion from the following Lemma A.4, there exist an another constant a suchthat we are able to control our pure error, weak signal parts at the same level with their initialvalues α within the time horizon 0 ď t ď T “ a a n { log p { η . Lemma A.4.

Under the assumptions in Theorem 3.2, there exists a constant a depending on C , C , such that } e ,t } ď C ¨ α À p , } e ,t } ď C ¨ α À p , } u ,t } ď C ¨ α À p , } u ,t } ď C ¨ α À p , hold with probability 1 ´ p ´ , for all t ď T “ a γ { η . Proof.

Please see § A.2.1 for a detailed proof.In addition, we also conclude from the following Lemma A.5 that we are able to obtain entrywiseconvergence of strong signal component.

Lemma A.5.

Under assumptions in Theorem 3.2, if we further choose 0 ă η ď {p

48 max i | µ ˚ β ˚ i |q ,there exists a constant a such that ›› β t d S ´ µ ˚ β ˚ d S ›› ď M c log nn holds with probability 1 ´ n ´ for all t ě a log p s m { α q{p ηs m q , where s m denotes the minimumvalue of | µ ˚ β ˚ i | , with i P S Proof.

Please see § A.2.2 for a detailed proof.Finally, following the same proof procedure of our warm-up case, we claim our conclusion ofTheorem 3.2.Next, in the following two subsections, namely § A.2.1 and § A.2.2, we will give our proof forLemma A.4 and Lemma A.5. 35 .2.1 Proof of Lemma A.4

Proof.

Similar with our analysis of proving Lemma A.1, here we also prove our Lemma A.4 byinduction hypothesis. It holds that our initializations } e , } , } e , } , } u , } and } u , } satisfyour conclusion given in Lemma A.4. Next, for an arbitrarily chosen t ˚ with 0 ď t ˚ ă T “ a γ { η ,we also assume Lemma A.4 holds for any t , with 0 ď t ď t ˚ and we aim at verifying our conclusionfor the step t ˚ `

1. In addition, constant a will be given during our proof.From our gradient descent updates of w t and v t given in (A.12)-(A.13), our updates with respectto pure error parts e ,t , e ,t and weak signal components u ,t , u ,t are obtained as follows e ,t ` “ e ,t ´ η ` β t ´ Φ n ˘ d e ,t , e ,t ` “ e ,t ` η ` β t ´ Φ n ˘ d e ,t , (A.14) u ,t ` “ u ,t ´ η ` β t ´ Φ n ˘ d u ,t , u ,t ` “ u ,t ` η ` β t ´ Φ n ˘ d u ,t . (A.15)Then for any l P t , u , the following inequalities always hold according to triangle inequality } e l,t ` } ď “ ` η ` } e ,t d e ,t } ` } e ,t d e ,t } ` } Φ n ´ E r Φ n s} ˘‰ ¨ } e l,t } , (A.16) } u l,t ` } ď “ ` η ` } u ,t d u ,t } ` } u ,t d u ,t } ` } µ ˚ β ˚ d S } ` } Φ n ´ E r Φ n s} ˘‰ ¨ } u l,t } . (A.17)According to our induction hypothesis, for any l P t , u , we are able to bound } e l,t } togetherwith } u l,t } at the same order with α , when t ď t ˚ . Thus, we replace } e l,t d e l,t } and } u l,t d u l,t } by C α and C α respectively in (A.16) and (A.17). Similar with our warm-up case, we assumethere exists a constant M such that | µ ˚ β ˚ i | ď M a log p { n when i P S . After further applyingLemma A.3, we obtain that } e l,t ` } ď ” ` η ´ C α ` M a log p { n ¯ı ¨ } e l,t } , and } u l,t ` } ď ” ` η ´ C α ` max t M, M u a log p { n ¯ı ¨ } u l,t } , hold with probability 1 ´ p ´ for any t ď t ˚ and l P t , u . By our assumption on α in Theorem3.2, there exists a constant c ą t C α , C α u ď c c log pn holds. If we set c as c “ {p c ` max t M, M uq and let a “ c log p min t C , C uq in our expressionof T with T “ a γ { η , combining (A.16) and (A.17), we then have ›› e l,t ˚ ` ›› ď “ ` η {p c γ q ‰ t ˚ ` ¨ } e l, } ď exp ` T log ` ` η {p c γ q ˘˘ ¨ α ď exp p log p C qq ¨ α À { p, and ›› u l,t ˚ ` ›› ď “ ` η {p c γ q ‰ t ˚ ` ¨ } u l, } ď exp ` T log ` ` η {p c γ q ˘˘ ¨ α ď exp p log p C qq ¨ α À { p, for any l P t , u . Thus, our induction hypothesis also holds for t ˚ `

1. In addition, as t ˚ isarbitrarily chosen, we claim our conclusion for Lemma A.4.36 .2.2 Proof of Lemma A.5 Proof.

Following (A.12) and (A.13), the dynamics of β p q t : “ s ,t d s ,t “ S d w t ` d w t ` ,β p q t : “ s ,t d s ,t “ S d v t ` d v t ` and β t,S : “ S d β t are obtained as β p q t ` “ “ ´ η ` β t,S ´ µ ˚ β ˚ d S ` µ ˚ β ˚ d S ´ Φ n d S ˘‰ d β ,t , (A.18) β p q t ` “ “ ` η ` β t,S ´ µ ˚ β ˚ d S ` µ ˚ β ˚ d S ´ Φ n d S ˘‰ d β ,t , (A.19) β t ` ,S “ β p q t ` ´ β p q t ` . Like the case of non-negative signals, we also denote the i -th entry of µ ˚ β ˚ ´ Φ n as ξ i , we then get | ξ i | ď M a log n { n holds for any i P S simultaneously with probability 1 ´ n ´ by our LemmaA.3. Without loss of generality, here we just analyze entries i P S with µ ˚ β ˚ i ` ξ i ą

0. Similarly,we also divide our analysis into several steps.

Step I.

When we have 0 ď β t,i ď p µ ˚ β ˚ i ` ξ i q{ , i P S , we will get geometric increment of β p q t,i anddecrement of β p q t,i respectively β p q t ` ,i “ w t ` ,i ě „ ` η p µ ˚ β ˚ i ` ξ i q  ¨ w t,i , β p q t ` ,i “ v t ` ,i ď „ ´ η p µ ˚ β ˚ i ` ξ i q  ¨ v t,i . This ﬁrst stage ends when our β t,i exceeds p µ ˚ β ˚ ` ξ i q{

2, so we estimate the time order of t ,i thatsatisﬁes β t,i ě „ ` η p µ ˚ β ˚ i ` ξ i q  t ,i α ´ „ ´ η p µ ˚ β ˚ i ` ξ i q  t ,i α ě µ ˚ β ˚ i ` ξ i . It could be hard for us to ﬁgure out the exact order of t ,i , instead, we ﬁnd a suﬃcient conditionfor t ,i , i.e. when t ě t ,i , we must have β t,i ě p µ ˚ β ˚ i ` ξ i q{ . Observe that it is suﬃcient to solvethe following inequality for t ,i „ ` η p µ ˚ β ˚ i ` ξ i q  t ,i α ě µ ˚ β ˚ ` ξ i ` α , which is equivalent to ﬁnd t ,i satisfying t ,i ě T ,i : “

12 log ˆ µ ˚ β ˚ ` ξ i α ` ˙M log ˆ ` η p µ ˚ β ˚ ` ξ i q ˙ . Like the case of non-negative signals, we also denote β i : “ µ ˚ β ˚ i ` ξ i for simplicity and we obtain T ,i “

12 log ˆ β i α ` ˙M log ˆ ` ηβ i ˙ ď ˆ β i α ˙ ¨ ˆ ` ηβ i ˙M` ηβ i ˘ ď ˆ β i α ˙M` ηβ i ˘ . in which the second inequality follows from x log p x q ´ x ` ě , when x ě α . Thus, we set t ,i “ p β i { α q{ ηβ i such that for all t ě t ,i we get β t,i ě β i { i P S .37 tep II. If we have p ´ { m q β i ď β t,i ď p ´ { m ` q β i , for some 1 ď m ď m “ r log p β i { (cid:15) q s with (cid:15) “ M a log n { n , according to (A.18) and (A.19), we obtain w t ` ,i ě ˆ ` ηβ i m ` ˙ d w t,i , v t ` ,i ď ˆ ´ ηβ i m ` ˙ d v t,i . Here we also want to get a suﬃcient condition for t i,m with w t,i ˆ ` ηβ i m ` ˙ t i,m ´ ˆ ´ ηβ i m ` ˙ t i,m v t,i ě ˆ ´ m ` ˙ β i , which is equivalent to ﬁnd t i,m that satisﬁes t i,m ě T i,m : “

12 log ˆ p ´ { m ` q β i ` α w t,i ˙M log ˆ ` ηβ i m ` ˙ . Here we have assume v t,i ď α , because we will demonstrate that β t,i will never exceed β i throughthe whole iteration in the following Step III , then v t,i will keep decreasing according to (A.19).Similar to the ﬁrst stage, it is also suﬃcient for us to get an upper bound of T i,m . Underassumption w t,i ě β t,i ě p ´ { m q β i , we obtain T i,m “

12 log ˆ p ´ { m ` q β i ` α w t,i ˙M log ˆ ` ηβ i m ` ˙ ď

12 log ˆ p ´ { m ` q β i ` α p ´ { m q β i ˙Mˆ ηβ i { m ` ` ηβ i { m ` ˙ “

12 log ˆ ` { m ` ´ { m ` α p ´ { m q β i ˙ ¨ ˆ ` η m ` β i ˙Mˆ ηβ i m ` ˙ , where the second inequality follows from x log p x q ´ x ` ě , when x ě . By direct calculation,we further get T i,m ď ˆ { m ` ´ { m ` α p ´ { m q β i ˙Mˆ ηβ i m ` ˙ ď ηβ i ` m ` α ηβ i . (A.20)In order to control the term 2 m ` α {p ηβ i q , we need to ﬁnd an upper bound of m . For the target (cid:15) “ M a log n { n , if we repeat Step II for m “ r log p β i { (cid:15) q s times, we have β i { m ď (cid:15) , meanwhile,we also have 2 m ` ď β i { (cid:15) À a n { log nβ i for any m ď m .By our assumption on the initial value α stated in Theorem 3.2, we have α À { p . Then webound 2 m ` α {p ηβ i q given in (A.20) as2 m ` α ηβ i À c n log n ¨ p ηβ i ď Cηβ i , (A.21)in which C is a universal constant, when m ď m . Combining our results from (A.20) and (A.21),we ﬁnally bound T i,m as T i,m ď

12 log ˆ p ´ { m ` q β i ` α w t,i ˙M log ˆ ` ηβ i m ` ˙ ď ` Cηβ i , m ď m . Thus, when t i,m ě p ` C q{ ηβ i , we have β t ` t i,m ,i ě p ´ { m ` q β i under the settingsof Step II with m ď m .As we have discussed above, for (cid:15) “ M a log n { n , we have β i { m ď (cid:15) with m “ r log p β i { (cid:15) q s .Thus, after ﬁnishing Step I and repeating

Step II for at most m “ r log p β i { (cid:15) q s times, we obtain β t,i ě β i ´ (cid:15) after T i ě C log ˆ β i α ˙L` ηβ i ˘ ě ˆ β i α ˙M` ηβ i ˘ ` p ` C q R log ˆ β i (cid:15) ˙VM` ηβ i ˘ according to our assumptions on α and (cid:15) .Our conclusion above is built upon the assumption on p ´ { m q β i ď β t,i ď p ´ { m ` q β i , for m ď m , with m “ r log p β i { (cid:15) q s . If there is a β t,i that exceeds β i , the assumptions in step p II q are violated. In addition, the dynamics of β t with β t ě p ´ { m q β i and m ą m “ r log p β i { (cid:15) q s isstill remains to be characterized.So in the next step of our analysis, we prove that β i,t keeps increasing through the wholeiteration and will never exceed β i , if we take stepsize small enough. Step III.

When we take stepsize η with η ď {p

48 max i µ ˚ β ˚ i q , we have β t ` ,i ě β t,i and β t,i ď β i for all t and i P S .First, we will prove that β t,i ď β t ` ,i for every t . Without loss of generality, for any i P S ,we assume β t,i ă β i , and there exists an m ě p ´ { m q β i ď β t,i ďp ´ { m ` q β i . Then by (A.18) and (A.19) we obtain a lower bound of β t ` ,i as β t ` ,i “ w t ` ,i ´ v t ` ,i ě ˆ ` ηβ t,i m ` ˙ w t,i ´ ˆ ´ ηβ t,i m ` ˙ v t,i “ w t,i ` ηβ t,i w t,i m ` η β t,i w t,i m ` ´ v t,i ` ηβ i v t,i m ´ η β i v t,i m ` ě w t,i ´ v t,i “ β t,i . The reason that we are able to get the last inequality is because our assumption on η satisﬁes ηβ i v t,i { m ´ η β i v t,i { m ` ě i P S .For the second part of Step III , we prove β t,i ď β i for every t ě , i P S by induction. First,for any i P S we know β ,i ă β i and we assume β t ,i ď β i for any 0 ď t ď t . Then, we will verifythis conclusion also holds for step t `

1. Without loss of generality, for the t -th iterate β t,i , weassume that ˆ ´ m ˙ β i ď β t,i ď ˆ ´ m ` ˙ β i , (A.22)holds for some m ě

0. According to equation β t,i “ w t,i ´ v t,i and (A.22), we further have ˆ ´ m ˙ β i ď w t,i ď ˆ ´ m ` ˙ β i ` v t,i . Following updates of w t,i and v t,i given in (A.18) and (A.19), we obtain an upper bound of w t ` ,i

39s well as a lower bound of v t ` ,i as w t ` ,i ď ˆ ` ηβ i m ˙ ¨ w t,i ď ˆ ` ηβ i m ´ ` η β i m ˙ ¨ „ˆ ´ m ` ˙ β i ` v t,i  ,v t ` ,i ě ˆ ´ ηβ i m ˙ ¨ v t,i “ ˆ ´ ηβ i m ´ ` η β i m ˙ ¨ v t,i . Then we further get an upper bound of β t ` ,i as β t ` ,i “ w t ` ,i ´ v t ` ,i ď ˆ ´ m ` ˙ β i ` ηβ i m ´ ` ηβ i v t,i m ´ ` ˆ ´ m ` ˙ η β i m . (A.23)By the updating rule on v t,i given in (A.19), we obtain that as long as β t ,i ď β i for all t ď t ,we always have v t,i ď α . After setting η ď {p

24 max i β i q ď {p

48 max i µ ˚ β ˚ i q in (A.23), we have β t ` ,i ď β i for any i P S . Thus, we have ﬁnished our proof of the second part in Step III above.In conclusion, once we obtain β i ´ (cid:15) ď β t,i ď β i with (cid:15) “ M a log n { n, after proceeding Step I and

Step II , our iterates on β t,i with i P S will keep being in this region for any t ě T i . By the deﬁnition of β t , we further obtain µ ˚ β ˚ i ´ (cid:15) ď β t,i ď µ ˚ β i ` (cid:15), with (cid:15) “ M a log n { n after t ě max i T i Á log p s m { α q{ ηs m iterations with probability 1 ´ n ´ .This is equivalent with our claim in Proposition A.5: there exists a constant a such that } β t d S ´ µ ˚ β ˚ I d S } ď M c log nn holds for all t ě a log p s m { α q{p ηs m q with probability 1 ´ n ´ . A.3 Proof of Theorem 3.4

Proof.

In this subsection, we will prove our results on the MSE of kernel regression with gaussiancovariates. As a reminder, in § Z ˚ “ x T β ˚ , Z “ x T p β and Z i “ x T i p β , and event t Z, | Z ´ µ J p β | ď R u , where R “ ? log n and x is a new observation. We further deﬁne ourprediction function p g p Z q as p g p Z q “ $’&’% ř ni “ y i K h p Z ´ Z i q ř ni “ K h p Z ´ Z i q , | Z ´ µ J p β | ď R, , otherwise (A.24)in which we assume 0 { “

0. Note that Z ´ µ J p β is a random variable which follows standardGaussian distribution under our settings given in § Z as P ` | Z ´ µ J p β | ě t ˘ “ ` ´ t { ˘ (A.25)40n other words, by letting t “ ? log n in (A.25), with probability 1 ´ { n , we have | Z ´ µ J p β | ď ? log n. Next, we separate our prediction error into two parts E “ p p g p Z q ´ f p Z ˚ qq ‰ “ E ” p p g p Z q ´ f p Z ˚ qq ¨ I t| Z ´ µ J p β |ď R u ıloooooooooooooooooooooomoooooooooooooooooooooon p I q ` E ” p p g p Z q ´ f p Z ˚ qq ¨ I t| Z ´ µ J p β |ą R u ıloooooooooooooooooooooomoooooooooooooooooooooon p II q . For term p II q , by our deﬁnition of p g p Z q given in (A.24), we have E ” p p g p Z q ´ f p Z ˚ qq I t| Z ´ µ J p β |ą R u ı ď E ” f p Z ˚ q I t| Z ´ µ J p β |ą R u ı ď a E r f p Z ˚ q s b P p| Z ´ µ J p β | ą R q À σ f ¨ n , where the second inequality follows from Cauchy-Schwartz Theorem. In addition, the third in-equality above is given by our assumption on f p Z ˚ q , in which we assumed f p Z ˚ q is a sub-Gaussianrandom variable with variance proxy σ f .For term p I q , we further separate it into p III q and p IV q which are regarded as integrated meansquare error and approximation error respectively. p I q “ E ” p p g p Z q ´ g p Z qq I t| Z ´ µ J p β |ď R u ıloooooooooooooooooooomoooooooooooooooooooon p III q :(MSE) ` E ” p g p Z q ´ f p Z ˚ qq I t| Z ´ µ J p β |ď R u ı . loooooooooooooooooooooomoooooooooooooooooooooon p IV q :(Approximation error) (A.26)For p III q (MSE), we deﬁne g p Z q as g p Z q “ ř ni “ g p Z i q K h p Z ´ Z i q ř ni “ K h p Z ´ Z i q . Then we see p III q can also be controlled by two terms, namely variance and bias of our approxi-mation p III q ď E ” p p g p Z q ´ g p Z qq I t| Z ´ µ J p β |ď R u ılooooooooooooooooooooomooooooooooooooooooooon p V q :(Variance) ` E ” p g p Z q ´ g p Z qq I t| Z ´ µ J p β |ď R u ılooooooooooooooooooooomooooooooooooooooooooon p VI q :(Bias) . (A.27)Combining (A.26) and (A.27), we see that the (cid:96) -risk can be bounded by a sum of the approximationerror, bias, and variance. In the sequel, we bound these three terms separately. Step I: Approximation error.

By our settings in § Z ´ µ J p β and Z ˚ ´ µ J β ˚ arestandard Gaussian random variables. Moreover, we have Z ˚ “ µ J β ˚ ` x Σ { p β, Σ { β ˚ y ¨ p Z ´ µ J p β q ` b ´ x Σ { p β, Σ { β ˚ y ¨ ζ : “ cos α ¨ Z ` sin α ¨ ζ ` µ J β ˚ ´ cos α ¨ µ J p β, where α P r , π { s and ζ „ N p , q is independent of Z . In addition, by Assumption 3.1-(a) and(3.6), it holds that sin α “ ´ x Σ { p β, Σ { β ˚ y “ o p n ´ { q . Y “ f p Z ˚ q ` (cid:15), Z ˚ “ cos α ¨ p Z ´ µ J p β q ` sin α ¨ ζ ` µ J β ˚ . (A.28)For simplicity, we denote r Z p z q as r Z p z q “ cos α ¨ p z ´ µ J p β q ` µ J β ˚ . Then, according to (A.28), theregression function is given by g p z q “ E r Y | Z “ z s “ E “ f ` r Z p z q ` sin α ¨ ζ ˘ | Z “ z ‰ “ ż R f ` r Z p z q ` sin α ¨ ζ ˘ ¨ φ p ζ q d ζ, (A.29)where φ is the density of the standard Gaussian distribution. To bound the approximation error p IV q , we ﬁrst use f p cos α ¨ p Z ´ µ J p β q ` µ J β ˚ q to approximate f p Z ˚ q as well as g p Z q . For simplicity,we denote r Z as r Z “ cos α ¨ p Z ´ µ J p β q ` µ J β ˚ with cos α “ x Σ { β ˚ , Σ { p β y , then the approximationerror is bounded as E ” p f p Z ˚ q ´ g p Z qq I t| Z ´ µ J p β |ď R u ı (A.30) ď E ” t f p Z ˚ q ´ f p r Z qu I t| Z ´ µ J p β |ď R u ı ` E ” p f p r Z q ´ g p Z qq I t| Z ´ µ J p β |ď R u ı , (A.31)For the ﬁrst term on the right-hand side of (A.31), by Taylor expansion we have f p Z ˚ q ´ f p r Z q “ f p r Z ` sin α ¨ ζ q ´ f p r Z q “ f p r Z ` t sin α ¨ ζ q ¨ sin α ¨ ζ, which implies that E ” p f p Z ˚ q ´ f p r Z qq I t| Z ´ µ J p β |ď R u ı “ sin α ż | Z ´ µ J p β |ď R ż R f p r Z ` t p Z, ζ q sin α ¨ ζ q ζ φ p ζ q d ζ d F p Z q À sin α, (A.32)where t p Z, ζ q is a constant lines in r , s which depends on Z, ζ . For (A.32) given above, we utilizeAssumption 3.3. For the second term, by the deﬁnition of g given in (A.29) we have ˇˇˇ f p r Z q ´ g p Z q ˇˇˇ “ ˇˇˇˇ f p r Z q ´ ż R f p r Z ` sin α ¨ ζ q φ p ζ q d ζ ˇˇˇˇ “ ˇˇˇˇ sin α ¨ ż R f p r Z ` t p Z, ζ q sin α ¨ ζ q ζφ p ζ q d η ˇˇˇˇ , which implies that E ” p f p r Z q ´ g p Z qq I t| Z ´ µ J β |ď R u ı ď sin α ż | Z ´ µ J p β |ď R ˆż R f p r Z ` t p Z, ζ q sin α ¨ ζ q ζφ p ζ q d ζ ˙ d F p Z qď sin α ż | Z ´ µ J p β |ď R ż R f p r Z ` t p Z, ζ q sin α ¨ ζ q ζ φ p ζ q d ζ d F p Z q À sin α. (A.33)Combining (A.30), (A.32), and (A.33) we bound the approximation error term by p IV q “ E ” p f p Z ˚ q ´ g p Z qq I t| Z ´ µ J p β |ď R u ı À sin α À o p n ´ { q . (A.34)42ext, we control the strength of term p V q , which is regarded as the variance of our approximation. Step II: Variance control.

For term p V q , by deﬁnition, we obtain p V q “ ż | Z ´ µ J p β |ď R ż E “ p p g p Z q ´ g p Z qq | Z , . . . , Z n ‰ d F p Z , . . . , Z n q d F p Z q . For any ﬁxed Z , we let B n p z q : “ t Z : nP n p B p Z, h qq ą u , where P n p B p Z, h qq “ n ř ni “ I p} Z i ´ Z } ď h q . Then we further have E “ p p g p Z q ´ g p Z qq | Z , . . . , Z n ‰ “ E «„ ř ni “ p y i ´ g p z i qq I t} Z i ´ Z } ď h u ř ni “ I t} Z i ´ Z } ď h u  ˇˇˇ Z , . . . , Z n ﬀ “ ř ni “ Var p Y i | Z i q I t} Z i ´ Z } ď h u n P n p B p Z, h qq ď σ n P n p B p Z, h qq ¨ I B n p Z q . For the last inequality, we have that Var p Y i | Z i q ď E r Y i | Z i s ď σ À polylog p n q holds by ourfollowing Lemma A.6- (ii) . Lemma A.6.

Under our settings given in § (i) . g p z q function deﬁned in (A.29) is Lipschitz over area t| z | ď R u , whose Lipschitz constant L is bounded by poly p R q . (ii) . The variance of Y given Z “ z with | z ´ µ J p β | ď R ` h, h “ o p q is bounded by poly p R q . (iii) . sup | z ´ µ J p β |ď R g p z q ď ploy p R q . Proof.

The detailed proof is given in § A.3.1.So we obtain p V q ď ż | Z ´ µ J β |ď R ż σ I B n p Z q n P n p B p Z, h qq d F p Z , . . . , Z n q d F p Z q . As we have n P n p B p Z, h qq “ ř ni “ I p} Z i ´ Z } ď h q „ Binomial p n, q q , with q “ P p Z P B p Z, h qq , we thenobtain ż σ I B n p Z q n P n p B p Z, h qq d F p Z , . . . , Z n q “ ż σ I B n p Z q n P n p B p Z, h qq d F p Z , . . . , Z n q“ E „ σ I p n P n p B p Z,h qqą q n P n p B p Z, h qq  ď σ nq . The last inequality follows from Lemma 4.1 in Gy¨orﬁ et al. (2002). Then we further get an upperbound for p V q as p V q ď ż | Z ´ µ J p β |ď R d F p Z q n P p Z P B p Z, h qq . t| Z ´ µ J p β | ď R u is a bounded area, we choose x , . . . , x m such that t| Z ´ µ J p β | ď R u is coveredby Y Mj “ B p x i , h { q with M ď cR { h . Then we ﬁnally bound term p V q as p V q ď σ ż | Z ´ µ J p β |ď R d F p Z q n P p B p Z, h qq ď M ÿ j “ σ ż I t Z P B p x j ,h { qu d F p Z q n P p B p Z, h qqď M ÿ j “ σ ż I t Z P B p x j ,h { qu d F p Z q n P p B p x j , h { qq ď σ Mn ď Cσ Rnh . (A.35)In the next step, we will get an upper bound for the bias term of our approximation.

Step III: Bias control.

For term p VI q , we ﬁrst bound the diﬀerence between g p Z q and g p Z q| g p Z q ´ g p Z q| “ ˇˇˇˇ ř ni “ p g p Z i q ´ g p Z qq K h p Z ´ Z i q ř ni “ K n p Z ´ Z i q ˇˇˇˇ ď L h ` g p Z q ¨ I B n p Z q c , where the last inequality follows from Lemma A.6- (i) , which yields g is a Lipschitz function withLipschitz constant L bounded by polylog p n q . Then we obtain E ” | g p Z q ´ g p Z q| I t| Z ´ µ J p β |ď R u ı ď L h ` ż | Z ´ µ J p β |ď R g p Z q E “ I B n p Z q c ‰ d F p Z qď L h ` sup | Z ´ µ J β |ď R g p Z q ż | Z ´ µ J p β |ď R r ´ P p Z P B p Z, h qqs n d F p Z qď L h ` sup | Z ´ µ J p β |ď R g p Z q ż | Z ´ µ J p β |ď R exp p´ n P p Z P B p Z, h qqq ¨ n P p Z P B p Z, h qq n P p Z P B p Z, h qq d F p Z qď L h ` sup | Z ´ µ J p β |ď R g p Z q sup u t ue ´ u u ż | Z ´ µ J p β |ď R d F p Z q n P p B p Z, h qqď L h ` polylog p n q nh . (A.36)The last inequality (A.36) also follows from our Lemma A.6- (iii) . Thus, combining our conclusionsfrom (A.34), (A.35) and (A.36), and by letting h “ n ´ { , we bound the (cid:96) -error as E “ p p g p Z q ´ f p Z ˚ qq ‰ À polylog p n q n { , which concludes the proof of of Theorem 3.4. A.3.1 Proof of Lemma A.6

Proof.

For term (i) , by mean value theorem, we have | g p z q ´ g p z q|ď " ż R ˇˇ f ` cos α ¨ r z ´ µ J p β ` t p ζ q ¨ p z ´ z qs ` sin α ¨ ζ ` µ J β ˚ ˘ˇˇ ¨ φ p ζ q d ζ * ¨ | z ´ z | , (A.37)44here t p ζ q is a constant inside r , s that depends on ζ . Here, if α ď

1, the right hand side of(A.37) is bounded as " ż R | f p cos α ¨ r z ´ µ J p β ` t p ζ q ¨ p z ´ z qs ` sin α ¨ ζ ` µ J β ˚ q| φ p ζ q d ζ * ¨ | z ´ z |ď " ż R C ` | cos α ¨ r z ´ µ J p β ` t p ζ q ¨ p z ´ z qs ` sin α ¨ ζ ` µ J β ˚ | α φ p ζ q d ζ * ¨ | z ´ z |ď p C ` q ¨ | z ´ z | ` " ż R | cos α ¨ r z ´ µ J p β ` t p ζ q ¨ p z ´ z qs ` sin α ¨ ζ ` µ J β ˚ | φ p ζ q d ζ * ¨ | z ´ z |ď p C ` ` R ¨ | cos α | ` µ J β ˚ q ¨ | z ´ z | ` | sin α | ¨ | z ´ z | ¨ ż R | ζ | φ p ζ q d ζ ď p C ` R q ¨ | z ´ z | , in which C is a constant. In addition, if α ą , by convexity property of function f p x q “| x | α , α ě

1, we then have " ż R | f p cos α ¨ r z ´ µ J p β ` t p ζ q ¨ p z ´ z qs ` sin α ¨ ζ ` µ J β ˚ q| φ p ζ q d ζ * ¨ | z ´ z |ď p C ` ` α ´ | cos α | α R α q ¨ | z ´ z | ` α ´ | sin α | α ¨ | z ´ z | ¨ ż R | ζ | α φ p ζ q d ζ ď p C ` α ´ R α q ¨ | z ´ z | Thus, we claim that our g function is Lipschitz over area t| z | ď R u . For terms (ii) and (iii) , by deﬁnitions, we knowVar p Y | Z “ z q ď E r Y | Z “ z s ď ż R f p r Z p z q ` sin α ¨ ζ q φ p ζ q d ζ ` σ , and g p z q “ E r Y | Z “ z s “ ż R f p r Z ` sin α ¨ ζ q ¨ φ p ζ q d ζ. in which σ denotes the variance of (cid:15). By our assumption on f given in Assumption 3.3, afterfollowing similar procedures given by us of proving part (i) , we claim our conclusion for terms (ii) and (iii) . A.4 Proof of Theorem 3.6

Proof.

The proof of Theorem 3.6 is almost the same with the proof of Theorem 3.2. The majordiﬀerences between them are two folds. Firstly, we need to replace the estimator Φ n : “ n ř ni “ y i x i by n ř ni “ q y i q S p x i q in (A.14)-(A.15) and (A.18)-(A.19). In addition, we establish a new concentrationinequality between n ř ni “ q y i q S p x i q and µ ˚ β ˚ in the following Lemma A.7. Lemma A.7.

Under Assumption 3.5, by choosing threshold τ “ p M ¨ n { log p q { {

2, we have ›››› n n ÿ i “ q y i q S p x i q ´ µ ˚ β ˚ ›››› À c log pn (A.38)holds with probability 1 ´ { p . 45 roof. The detailed proof is given in § A.4.1.Then by our conclusion from Lemma A.7 , and following the proof procedure of Lemma A.4and A.5 above, there exist a constant a such that we obtain } e ,t } À α À p , } e ,t } À α À p , } u ,t } À α À p , } u ,t } À α À p , for any t ď T : “ a a n { log p . Similarly, for signal parts, there also exists a constant a such thatwhen t ě a log p s m α q{ ηs m we get ›› β t d S ´ µ ˚ β ˚ d S ›› À c log pn . Combining two conclusions above, we claim our proof of Theorem 3.6. Next, we will prove LemmaA.7 which we have applied in the process of proving Theorem 3.6.

A.4.1 Proof of Lemma A.7

Proof.

We separate the left hand side of (A.38) into two parts, namely ›››› n n ÿ i “ q y i q S p x i q ´ µ ˚ β ˚ ›››› “ ›››› n n ÿ i “ q y i q S p x i q ´ E r q y q S p x qs ` E r q y q S p x qs ´ E r y ¨ S p x qs ›››› ď ›››› n n ÿ i “ q y i q S p x i q ´ E r q y q S p x qs ›››› ` ››› E r q y q S p x qs ´ E r y ¨ S p x qs ››› . To simplify the notations, within this proof, we deﬁne1 n n ÿ i “ q y i q S p x i q ´ E r q y q S p x qs “ r Ψ , E r q y q S p x qs ´ E r y ¨ S p x qs “ r Φ . In addition, we deﬁne event C j as C j “ t| y | ď τ, | S p x q j | ď τ u , then we are able to control j -thentry of r Φ as r Φ j “ E “ q y q S p x q j ‰ ´ E “ y ¨ S p x q j ‰ ď E “ p| y | ´ τ q ¨ p| S p x q j | ´ τ q ¨ I C cj ‰ ď b E “ y S p x q j ‰ ¨ “ P p| y | ą τ q ` P p| S p x q| ą τ q ‰ ď (cid:32) E “ y ‰ ¨ E “ S p x q j ‰( { ¨ ? M { { τ ď ? M { τ . The third and fourth inequalities are established by Cauchy Schwartz inequality and Chebyshevinequality respectively. In addition, the last inequality follows from our Assumption 3.5. Note46hat the inequality above holds for any j P r d s so that we have } r Φ } ď ? M { τ . For term r Ψ, bydeﬁnition, we know that | q y i q S p x i q j | ď τ and ř ni “ Var p q y i q S p x i q j q ď n ¨ M with j P r p s . After directlyapplying Bernstein inequality and we further obtain P ˜›››› n n ÿ i “ q y i q S p x i q ´ µ ˚ β ˚ ›››› ě ? Mτ ` t ¸ ď p ¨ exp ˆ ´ nt M ` τ t { ˙ . (A.39)We set t “ m a log p { n and τ “ m { p n { log p q { in (A.39), in which m and m are constantsthat we will specify later. We aim at establishing the following inequality2 p ¨ exp ˆ ´ nt M ` τ t { ˙ “ p ¨ exp ˆ ´ m log p M ` m m ˙ ď p . Then by setting m “ ? M and m “ ? M {

4, we obtain3 m M ` m m ě . Thus, we obtain that ›››› n n ÿ i “ q y i q S p x i q ´ µ ˚ β ˚ ›››› ď p ? ` q? M c log pn holds with probability 1 ´ { p , and we conclude the proof of Lemma A.7. A.5 Algorithm in § Algorithm 3:

Algorithm for Vector SIM with General Design

Data:

Training covariates t x i u ni “ , response vector t y i u ni “ , truncating parameter τ , initialvalue α , step size η ;Initialize variables w “ α ¨ p ˆ , v “ α ¨ p ˆ and set iteration number t “ while t ă T dow t ` “ w t ´ η ` w t d w t ´ v t d v t ´ n ř ni “ q S p x i q q y i ˘ d w t ; v t ` “ v t ` η ` w t d w t ´ v t d v t ´ n ř ni “ q S p x i q q y i ˘ d v t ; β t ` “ w t d w t ´ v t d v t ; t “ t ` endResult: Output the ﬁnal estimate p β ˚ “ β T . B Proof of General Theorems in § B.1 Proof of Theorem 4.2

As we assume µ ˚ β ˚ : “ E r f px X , β ˚ yqs β ˚ is symmetric in §

4, so we over-parameterize µ ˚ β as WW J ´ VV J , in which W and V are matrices with dimension d ˆ d. Then our loss function47elated to W , V becomesmin W , V L p W , V q : “ x WW J ´ VV J , WW J ´ VV J y ´ A WW J ´ VV J , n n ÿ i “ y i X i E . The gradient updates with respect to W , V and β are given by W t ` “ W t ´ η ´ W t W J t ´ V t V J t ´ n n ÿ i “ y i X i ´ n n ÿ i “ y i X J i ¯ W t , (B.1) V t ` “ V t ` η ´ W t W J t ´ V t V J t ´ n n ÿ i “ y i X i ´ n n ÿ i “ y i X J i ¯ V t , (B.2) β t ` “ W t ` W J t ` ´ V t ` V J t ` . (B.3)For simplicity, let M ˚ “ n ř ni “ y i X i ` n ř ni “ y i X J i , whose spectral decomposition is M ˚ : “ Q ˚ Σ ˚ Q ˚J . Here for identiﬁability, through this section, we always assume eigenvalues are sortedin order of decreasing value in the diagonal matrix for any spectral decomposition. We then deﬁne W ,t and V ,t as W ,t “ Q ˚J W t Q ˚ and V ,t “ Q ˚J V t Q ˚ , meanwhile, the corresponding gradientupdates with respect to W ,t and V ,t are given by W ,t ` “ W ,t ´ η ` W ,t W J ,t ´ V ,t V J ,t ´ Σ ˚ ˘ W ,t , V ,t ` “ V ,t ` η ` W ,t W J ,t ´ V ,t V J ,t ´ Σ ˚ ˘ V ,t ,β ,t ` “ W ,t ` W J ,t ` ´ V ,t ` V J ,t ` . If we initialize W , and V , as diagonal matrices, then all of their following updates will keepbeing diagonal matrices. In this case, our analysis on symmetric low rank matrices can be relaxedto the analysis on sparse vectors. Likewise, we also remind readers of the notations before formallyproving Theorem 4.2.Like the case of sparse vector, here we also divide eigenvalues of µ ˚ β ˚ into diﬀerent groupsby their strengths. We let r ˚ i , i P r n s be the i -th eigenvalue of µ ˚ β ˚ . The support set R of oureigenvalues is deﬁned as R : “ t i : | r ˚ i | ą u , in addition, the set R which contains strong signals isdeﬁned as R : “ t i : | r ˚ i | Á log d a d log d { n u , and the set R : “ t i : 0 ă | r ˚ i | À a d log d { n u denotesthe collection of weak signals. Likewise, pure error parts of W ,t and V ,t can be denoted by E w,t : “ I R c W ,t and E v,t : “ I R c V ,t respectively. (Here, I R is not indicator function but diagonalmatrix with one in the index set R and zero otherwise). In addition, strong signal parts of W ,t and V ,t are denoted by S w,t “ I R W ,t and S v,t “ I R V ,t and at the same time, weak signalparts are written as U w,t : “ I R W ,t and U v,t : “ I R V ,t . The cardinality of set R and R aredenoted by r and r respectively. For simplicity, we denote γ ˚ as γ ˚ “ a n { d log d through ourproof in § B. Next, we will formally prove our Theorem 4.2.

Proof.

The proof idea behind Theorem 4.2 is similar with that of Theorem 3.2. We prove that thestrengths of pure error and weak signal parts of our eigenvalues are controlled according to thefollowing Lemma B.1. 48 emma B.1. (Error Dynamics) Under assumptions in Theorem 4.2, there exist a constant a suchthat for any t with 0 ď t ď T “ a γ ˚ { η we obtain } E w,t } op ď C ¨ α À d , } E v,t } op ď C ¨ α À d , } U w,t } op ď C ¨ α À d , } U v,t } op ď C ¨ α À d , with probability 1 ´ {p d q ´ { n , where C , C ą Proof.

The detailed proof can be found in § B.1.1.For the t-th iterate β t , we separate it into three parts, namely, Q ˚ I R β ,t Q ˚J , Q ˚ I R β ,t Q ˚J and Q ˚ I R c β ,t Q ˚J . By our conclusion from Lemma B.1, with probability 1 ´ {p d q ´ { n , weobtain ››› Q ˚ I R β ,t Q ˚J ` Q ˚ I R c β ,t Q ˚ T ››› op À d . (B.4)for all t with t ď T “ O p γ ˚ { η q . Next, we will analyze the dynamics of our signal components of t β t u t ě in the following Lemma B.2. Lemma B.2. (Signal Dynamics) Let the spectral decomposition of µ ˚ β ˚ be µ ˚ β ˚ “ P ˚ R ˚ P ˚J . We denote the minimum absolute value of our strong signals µ ˚ β ˚ as r m . Under assumptions inTheorem 4.2, if we further choose 0 ă η ď {p

48 max i | µ ˚ r ˚ i |q , there exist a constant a such thatfor all t ě a log p r m { α q{ ηr m , we obtain ››› Q ˚ β ,t I R Q ˚ T ´ P ˚ R ˚ P ˚J ››› op ď M c d log dn (B.5)with probability 1 ´ {p d q ´ { n . Proof.

The detailed proof can be found in § B.1.1.Combing (B.4) and (B.5) above, we control the diﬀerence between β t and µ ˚ β ˚ as ›› β t ´ µ ˚ β ˚ ›› F ď ››› Q ˚ I R β ,t Q ˚J ´ P ˚ R ˚ I R Y R P ˚J ››› F ` ››› Q ˚ I R Y R c β ,t Q ˚J ››› F À p r ` r q ¨ d log dn , ›› β t ´ µ ˚ β ˚ ›› ˚ “ ››› Q ˚ I R β ,t Q ˚J ´ P ˚ R ˚ I R Y R P ˚J ››› ˚ ` ››› Q ˚ I R Y R c β ,t Q ˚J ››› ˚ À p r ` r q c d log dn . The proof procedure of the concentration between our normalized signals is almost the same withthat in § A.1, so we just omit details about this part.In the following subsection, we will prove Lemma B.1 and B.2 respectively.49 .1.1 Proof of Lemma B.1 and Lemma B.2

Proof. As E w,t , E v,t , U w,t , U v,t , S w,t , S v,t are all diagonal matrices, then our proof of Lemma B.1 andLemma B.2 are relaxed to the proof of Lemma A.4 and Lemma A.5. The only diﬀerence betweenthem lies on the concentration in spectral norm between M ˚ : “ n ř ni “ y i X i ` n ř ni “ y i X J i “ Q ˚ Σ ˚ Q ˚ and the true signal µ ˚ β ˚ : “ P ˚ R ˚ P ˚J . We will depict this concentration upper boundin the following Lemma B.3. Lemma B.3.

With probability 1 ´ {p d q ´ { n , we have ›››› n n ÿ i “ X i y i ´ E r X i y i s ›››› op ď ¨ max " σ z c p d q n , c σ y a d log n ¨ log p d q n * , in which σ z “ c σ y ? d and c , c are constants.As we assume d ! n ! d under our settings of high dimensional SIM with matrix covariate,combining our result in Lemma B.3 and Wely’s inequality, we have Σ ˚ is an entrywise perturbationof R ˚ with a perturbation upper bound of order O p a d log d { n q .Then for Lemma B.1, by using similar induction hypothesis given in proving Lemma A.4, weverify that there exists an constant a such that we obtain upper bounds in spectral norm for errorand weak signal components as } E w,T } op À α, } E v,T } op À α, } U w,T } op À α and } U v,T } op À α, forany t ď T “ a γ ˚ { η . Then we claim our conclusion of Lemma B.1.For Lemma B.2, by following similar proof procedures given in Lemma A.5 and our deﬁnitionof set R , there exists a constant a such that we have } Q ˚ β ,t I R Q ˚J ´ Q ˚ Σ ˚ I R Q ˚J } op “ } β ,t I R ´ Σ ˚ I R } op À c d log dn (B.6)for any t ě a log p r m { α q{ ηr m . Then we further have } Q ˚ Σ ˚ I R Q ˚J ´ P ˚ R ˚ P ˚J } op “ } Q ˚ Σ ˚ Q ˚J ´ P ˚ R ˚ P ˚J ´ Q ˚ Σ ˚ I R Y R c Q ˚J } op ď } Q ˚ Σ ˚ Q ˚J ´ P ˚ R ˚ P ˚J } op ` } Q ˚ Σ ˚ I R Y R c Q ˚J } op À c d log dn , (B.7)where the last inequality follows from our Lemma B.3 and Wely’s inequality. After combining ourresults in (B.6)-(B.7), we complete our proof of Lemma B.2. B.1.2 Proof of Lemma B.3

Proof.

For any ﬁxed n and d , ﬁrst, we denote event C i , i P r n s as C i : “ I ! | y i | ď σ y a n, } X i } op ď ´ ? d ` a log d { log p { q ¯ ` a n ) . (B.8)In order to illustrate that with high probability, | y i | , and } X i } op lie in the support set of C i for all i P r n s , we ﬁrst introduce the following two Lemmas, namely Lemma B.4 and Lemma B.5.50 emma B.4. We get a union upper bound for t| y i |u ni “ , to be more speciﬁc, with probability1 ´ { n we obtain max i | y i | ď σ y ? n. Proof.

The proof is straight forward by sub-Gaussian tail bound, please refer to Proposition 2.5.2in Vershynin (2018) for more details.

Lemma B.5.

For n independent random matrices X i P R d ˆ d , i P r n s with independent standardnormal entries, with probability 1 ´ { n , we havemax i Pr n s } X i } op ď ´ ? d ` a log d { log p { q ¯ ` a n. Proof.

By Corollary 3.11 in Bandeira and Handel (2016), we have P ” } X i } op ě p ` (cid:15) q ´ ? d ` a log d { log p ` (cid:15) q ¯ ` t ı ď e ´ t { , for any 0 ă (cid:15) ď { t ě i P r n s . Taking (cid:15) “ {

2, we get a tail bound for max i Pr n s } X i } op as P ” max i Pr n s } X i } op ě ´ ? d ` a log d { log p { q ¯ ` t ı ď n ¨ P ” } X i } op ě ´ ? d ` a log d { log p { q ¯ ` t ı ď n ¨ e ´ t { “ e ´ t { ` log n . By choosing t “ ? n , we havemax i Pr n s } X i } op ď ˆ ? d ` a log d { log p { q ˙ ` a n, with probability 1 ´ { n , which completes the proof of Lemma B.5.From the Lemma B.4 and Lemma B.5 given above, we obtain P p C ci q ď P ´ ď i C ci ¯ ď P „ max i Pr n s } X i } op ě ´ ? d ` a log d { log p { q ¯ ` a n  ` P ˆ max i | y i | ě σ y a n ˙ ď n . (B.9)We further denote event A as A “ I n n ÿ i “ X i y i ¨ I C i ´ E r X y ¨ I C s ›››› op ě t + . Then we have P ˜›››› n n ÿ i “ X i y i ´ E r X y s ›››› op ě t ¸ ď P p A q ` P ˆ n n ÿ i “ X i y i ¨ I C i ‰ n n ÿ i “ X i y i ˙ ` P ˆ›› E r X y ¨ I C c s ›› op ě t ˙ : “ p I q ` p II q ` p III q . p II q according to (B.9) as p II q “ P ˆ n n ÿ i “ X i y i ¨ I C i ‰ n n ÿ i “ X i y i ˙ “ P ˆ ď i C ci ˙ ď n . (B.10)Next, in order to bound term (I) , P p A q , we ﬁrst ﬁgure out the spectral upper bound of1 n n ÿ i “ X i y i ¨ I C i ´ E r X y ¨ I C s . By the deﬁnition of C i given in (B.8), for any ﬁxed n, d , with probability 1 we have } X i y i ¨ I C i } op ď U : “ ? σ y ´a d log n ` a log d ¨ log n { log p { q ¯ ` ? σ y log n. By denoting Z i as Z i “ X i y i ¨ I C i ´ E r X i y i ¨ I C i s , we have } E r Z i Z T i s} op À } E r y i X i X T i s} op À σ y d .Furthermore, by letting σ z “ max "›››› n n ÿ i “ E r Z i Z T i s ›››› { , ›››› n n ÿ i “ E r Z T i Z i s ›››› { * , we get σ z À σ y ¨ ? d . Then, after applying matrix Bernstein inequality from Proposition 1 inKoltchinskii et al. (2011), we have that ›››› n n ÿ i “ X i y i ¨ I C i ´ E r X y ¨ I C s ›››› op ď " σ z c p d q n , U ¨ p d q n * (B.11)holds with probability 1 ´ {p d q .For term p III q , likewise, we ﬁrst get the spectral norm of E r X y ¨ I C c s . For any unit vector u , v P R d , by Cauchy-Schwartz inequality we have E “ u T X v y ¨ I C c ‰ ď b E “` u T X v ˘ ‰ ¨ E “ y I C c ‰ . As all elements of X are independent standard Gaussian variables, then we get E “ p u T X v q ‰ “ n ÿ i,j “ u i v j “ } u } ¨ } v } “ . In addition, as we have assumed t y i u ni “ are i.i.d. sub-Gaussian random variables with sub-Gaussiannorm σ y , so we obtain b E “ y ¨ I C c ‰ ď p E “ y ‰ q { ¨ P p C c q { À σ y ? n . Next, after setting t “ " σ z c p d q n , U ¨ p d q n * , (B.12)52e have t " σ y {? n . So for term p III q we obtain P ´›› E “ X y ¨ I C c ‰›› op ě t { ¯ “ . (B.13)For term p I q , by (B.11) and the deﬁnition of t given in (B.12), we get P p A q ď d . (B.14)Thus, combing our conclusions from (B.10), (B.13) and (B.14), we ﬁnally obtain that ›››› n n ÿ i “ X i y i ´ E r X y s ›››› op ď " σ z c p d q n , U ¨ p d q n * , holds with probability 1 ´ {p d q ´ { n . By our assumption that log p n q ! d ! n ! d , we concludethe proof of Lemma B.3. B.2 Proof of Theorem 4.4

Proof.

The proof of Theorem 4.4 is similar to the proof of Theorem 4.2. We need to replace n ř ni “ y i X i with n ř ni “ H p y i S p X i q , κ q in (B.1)-(B.3). The deﬁnition of H p y i S p X i q , κ q is given § M ˚ as M ˚ “ n n ÿ i “ H p y i S p X i q , κ q ` n n ÿ i “ H p y i S p X i q , κ q J and the spectral decomposition of M ˚ as M ˚ : “ Q ˚ Σ ˚ Q ˚J . We then let W ,t “ Q ˚J W t Q ˚ and V ,t “ Q ˚J V t Q ˚ . The corresponding gradient updates with respect to W ,t and V ,t are given by W ,t ` “ W ,t ´ η ´ W ,t W J ,t ´ V ,t V J ,t ´ Σ ˚ ¯ W ,t , V ,t ` “ V ,t ` η ´ W ,t W J ,t ´ V ,t V J ,t ´ Σ ˚ ¯ V ,t ,β ,t ` “ W ,t ` W J ,t ` ´ V ,t ` V J ,t ` . By selecting κ properly, the following Lemma B.6 gives a concentration between our new estimatorand µ ˚ β ˚ . Lemma B.6.

Suppose y i “ f px X i , β ˚ yq` (cid:15) , X i P R d ˆ d , and entries of X i are i.i.d. random variableswith density function p p x q . Under assumptions in Theorem 4.4 we have ››››› n n ÿ i “ H p y i S p X q i , κ q ´ E r Y ¨ S p X qs ››››› op ď a M ¨ c d log p d q n holds with probability 1 ´ p d q ´ . Proof.

Please see § B.2.1 for the detailed proof.Thus, after following the same proof procedures of Lemma B.1 and B.2, we claim our conclusionof Theorem 4.4. Next, we will give a detailed proof of Lemma B.6.53 .2.1 Proof of Lemma B.6

Proof.

Before applying results in Minsker (2018), we need to get an upper bound of } E r y S p X q ´ µ ˚ β ˚ sr y S p X q ´ µ ˚ β ˚ s T } op and it is suﬃcient for us to bound } E r y ¨ S p X q S p X q T s} op . Then for any unit vector u P R d ˆ wehave E ” y u T ¨ S p X q S p X q T ¨ u ı “ E ” y d ÿ i “ p u T S p X q r : ,i s q ı “ d ÿ i “ E ” y p u T S p X q r : ,i s q ı ď d ÿ i “ c E “ y ‰ ¨ E ”` u T S p X q r : ,i s ˘ ı ď d ¨ a M ¨ gffe E ”` d ÿ k “ u k S p X q r k, s ˘ ı . In order to get an upper bound of term E “ p ř dk “ u k S p X q r k, s q ‰ , we need to take advantage of theindependence property between entries of X , so that we get E ” p d ÿ k “ u k S p X q r k, s q ı “ d ÿ i,j “ u i u j E ” S p X q r i, s S p X q r j, s ı ď d ÿ i,j “ u i u j c E ” S p X q r i, s ı E ” S p X q r j, s ı ď M d ÿ i,j “ u i u j “ M. The last inequality follows from our Assumption 4.3. Then we get an upper bound for } E p y ¨ X X T q} op as ››› E ” y ¨ S p X q S p X q T ı››› op ď d ¨ M. Similarly, we also get an upper bound for term } E p y ¨ S p X q T S p X qq} op as } E p y ¨ S p X q T S p X qq} op ď d ¨ M. By applying Corollary 3.1 in Minsker (2018), we get the following inequality P ˆ›››› n n ÿ i “ H p y i S p X i q , κ q ´ E r y S p X qs ›››› op ě t ˙ ď d exp ˆ ´ κt ¨ n ` κ σ n ˙ , (B.15)where σ n “ max ˜ } n ÿ i “ E r y i ¨ S p X i q S p X i q T s} op , } n ÿ j “ E r y j ¨ S p X j q T S p X j qs} op ¸ ď d ¨ M ¨ n. Here we choose t “ a p d ¨ M log p d qq{ n , and we further let κ “ a log p d q{p n ¨ d ¨ M q in (B.15), sothat we obtain ››››› n n ÿ i “ H p y i S p X i q , κ q ´ E r y S p X qs ››››› op ď ? M c log p d q n , with probability 1 ´ p d q ´ . Then we complete our proof of Lemma B.6.54 .3 Algorithm in § Algorithm 4:

Algorithm for Low Rank Matrix SIM with General Design

Data:

Training design matrix X i P R d ˆ d , i P r n s , response variables t y i u ni “ , truncatingparameter κ , initial value α and step size η ;Initialize W “ α ¨ I d ˆ d , V “ α ¨ I d ˆ d and set iteration number t “ while t ă T doW t ` “ W t ´ η p W t W J t ´ V t V J t ´ n ř ni “ H p y i S p X i q , κ q ´ n ř ni “ H p y i S p X i q , κ q J q W t ; V t ` “ V t ` η p W t W J t ´ V t V J t ´ n ř ni “ H p y i S p X i q , κ q ´ n ř ni “ H p y i S p X i q , κ q J q V t ; β t ` “ W t W J t ´ V t V J t ; t “ t ` endResult: Output the ﬁnal estimate p β “ β T . C Extension to One-bit Compressed Sensing

As a concrete example, in the following, we consider the one-bit compressed sensing model (Jacqueset al., 2013; Plan and Vershynin, 2013). The response variables and the covariates satisfy y i “ sign px x i , β ˚ yq ` (cid:15), @ i P r n s , where sign p x q “ x ě p x q “ ´ , for x ă

0, and n is the number of ourobservations. Moreover, for both the vector and matrix settings, we assume that each entry of x i are i.i.d. N p , q random variables and t (cid:15) i u i Pr n s are i.i.d. sub-Gaussian random variables. As t y i u i Pr n s doesn’t convey any information about the length of our signal β ˚ , we are only able torecover the direction of β ˚ by utilizing measurements t x i , y i u i Pr n s . By following iterating proceduresin Algorithm 5 and Algorithm 6, we next summarize our theoretical results into the followingCorollary C.1. Corollary C.1.

In the scenario of vector SIM, we let our initial value α satisfy 0 ă α À { p andset stepsize η as 0 ă η À {p max i | β ˚ i |q in our Algorithm 5. Then there exist constants a , a suchthat for any t P r a log p s m { α q{p ηs m q , a a n { log p { η s , we obtain that ›››› β t } β t } ´ β ˚ ›››› À s log nn ` s log pn and ›››› β t } β t } ´ β ˚ ›››› À a p s ` s q c s log nn ` s log pn hold with probability 1 ´ p ´ ´ n ´ . In the case of low rank matrix recovery, we choose α with 0 ă α À { p and stepsize η satisfying0 ă η À {p max i | r ˚ i |q in our Algorithm 6. Then there exist constants a , a so that for any t P r a log p r m { α q{p ηr m q , a a n {p d log d q{ η s , we prove that ›››› β t } β t } F ´ β ˚ ›››› F À rd log dn and ›››› β t } β t } F ´ β ˚ ›››› ˚ À r c d log dn hold with probability 1 ´ {p d q ´ { n . 55 roof. The proof of Corollary C.1 is straight forward by following the proof procedures of Theorem3.2 and Theorem 4.2, so we just omit relevant details here. The only diﬀerence between them isthat we have Y ¨ X as an unbiased estimator of a { πβ ˚ by using properties of standard Gaussiandistribution instead of Stein’s lemma since f p x q “ sign p x q is not a diﬀerentiable function. Theproof of this property can be found in Lemma 4.1 in Plan and Vershynin (2012).Comparing to existed works on high dimensional one-bit compressed sensing (Plan and Ver-shynin, 2013; Goldstein et al., 2018; Thrampoulidis and Rawat, 2018), instead of adding (cid:96) -regularizers and tuning parameters, here we are able to achieve minimax optimal (up to logarithmicterms) (cid:96) - and (cid:96) -statistical rates under both settings of sparse vector and low rank matrix by sim-ply running gradient descent on over-parameterized loss functions (3.2), (4.1) and adopting earlystopping via out-of-sample prediction. Algorithm 5:

Algorithm for Vector SIM with Known Link Function

Data:

Training data t x i u ni “ t y i u ni “ , testing data t x i u ni “ , t y i u ni “ , initial value α , step size η and maximal iteration number T m ;Initialize variables w “ α ¨ p ˆ , v “ α ¨ p ˆ and set iteration number t “ while t ă T m dow t ` “ w t ´ η p w t d w t ´ v t d v t ´ n ř ni “ x i y i q d w t ; v t ` “ v t ` η p w t d w t ´ v t d v t ´ n ř ni “ x i y i q d v t ; β t ` “ w t d w t ´ v t d v t ; t “ t ` endResult: Choose r t such that n ř ni “ r y i ´ f p x T i β t {} β t } qs ă n ř ni “ r y i ´ f p x T i β t ` {} β t ` } qs or n ř ni “ r y i ´ f p x T i β t {} β t } qs is minimized over all iterations, then output theﬁnal estimate p β “ β r t . 56 lgorithm 6: Algorithm for Low Rank Matrix SIM with Known Link Function

Data:

Training data X i P R d ˆ d , i P r n s , y P R n , testing data X i P R d ˆ d , i P r n s , y P R n ,initial value α , step size η and maximal iteration number T m ;Initialize W “ α ¨ I d ˆ d , V “ α ¨ I d ˆ d and set iteration number t “ while t ă T m doW t ` “ W t ´ η p W t W T t ´ V t V T t ´ n ř ni “ X i y i ´ n ř ni “ X T i y i q W t ; V t ` “ V t ` η p W t W T t ´ V t V T t ´ n ř ni “ X i y i ´ n ř ni “ X T i y i q V t ; β t ` “ W t W T t ´ V t V T t ; t “ t ` endResult: Choose r t such that n ř ni “ r y i ´ f p tr p X T i β t {} β t } F qqs ă n ř ni “ r y i ´ f p tr p X T i β t ` {} β t ` } F qqs or n ř ni “ r y i ´ f p tr p X T i β t {} β t } F qqs is minimized over all iterations, then output theﬁnal estimate p β “ β r t .. 57 eferences Allen-Zhu, Z., Li, Y. and Liang, Y. (2019a). Learning and generalization in overparameterizedneural networks, going beyond two layers. In

Advances in Neural Information Processing Systems .Allen-Zhu, Z., Li, Y. and Song, Z. (2019b). A convergence theory for deep learning via over-parameterization. In

International Conference on Machine Learning .Arora, S., Cohen, N., Hu, W. and Luo, Y. (2019a). Implicit regularization in deep matrix factor-ization. In

Advances in Neural Information Processing Systems .Arora, S., Du, S., Hu, W., Li, Z. and Wang, R. (2019b). Fine-grained analysis of optimization andgeneralization for overparameterized two-layer neural networks. In

International Conference onMachine Learning .Arulkumaran, K., Deisenroth, M. P., Brundage, M. and Bharath, A. A. (2017). Deep reinforcementlearning: A brief survey.

IEEE Signal Processing Magazine , arXiv preprintarXiv:1906.03830 .Babichev, D., Bach, F. et al. (2018). Slice inverse regression with score functions. Electronic Journalof Statistics , arXiv preprint arXiv:1910.01619 .Balasubramanian, K., Fan, J. and Yang, Z. (2018). Tensor methods for additive index models underdiscordance and heterogeneity. arXiv preprint arXiv:1807.06693 .Bandeira, A. S. and Handel, R. v. (2016). Sharp nonasymptotic bounds on the norm of randommatrices with independent entries. Annals of Probability , arXiv preprint arXiv:1906.11300 .Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2018). Reconciling modern machine learning practiceand the bias-variance trade-oﬀ. arXiv preprint arXiv:1812.11118 .Belkin, M., Hsu, D. and Xu, J. (2019). Two models of double descent for weak features. arXivpreprint arXiv:1903.07571 .Brillinger, D. R. (1982). A generalized linear model with “Gaussian” regressor variables. AFestschrift For Erich L. Lehmann

Annals of Statistics , Comptes rendus-Mathematique , SIAM review , arXiv preprint arXiv:1902.01384 .Carroll, R. J., Fan, J., Gijbels, I. and Wand, M. P. (1997). Generalized partially linear single-indexmodels. Journal of the American Statistical Association , Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques , Advances in Neural Information ProcessingSystems .Chizat, L. and Bach, F. (2020). Implicit bias of gradient descent for wide two-layer neural networkstrained with the logistic loss. arXiv preprint arXiv:2002.04486 .Chizat, L., Oyallon, E. and Bach, F. (2019). On lazy training in diﬀerentiable programming. In

Advances in Neural Information Processing Systems .Cook, R. D. (1998). Principal Hessian directions revisited.

Journal of the American StatisticalAssociation , Journal of theAmerican Statistical Association , Journal of the American Statistical Association , Advances in Neural Information Processing Systems .Deng, Z., Kammoun, A. and Thrampoulidis, C. (2019). A model of double descent for high-dimensional binary linear classiﬁcation. arXiv preprint arXiv:1911.05822 .Derezi´nski, M., Liang, F. and Mahoney, M. W. (2019). Exact expressions for double descent andimplicit regularization via surrogate random design. arXiv preprint arXiv:1912.04533 .Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a). Gradient descent ﬁnds global minima ofdeep neural networks. In

International Conference on Machine Learning .59u, S. S., Hu, W. and Lee, J. D. (2018). Algorithmic regularization in learning deep homogeneousmodels: Layers are automatically balanced. In

Advances in Neural Information Processing Sys-tems .Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b). Gradient descent provably optimizes over-parameterized neural networks. In

International Conference on Learning Representations .Duan, N., Li, K.-C. et al. (1991). Slicing regression: A link-free regression method.

Annals ofStatistics , Journal of the American Statistical Association , Annals of Statistics , arXiv preprintarXiv:1904.05526 .Fan, J., Wang, K., Zhong, Y. and Zhu, Z. (2020a). Robust high dimensional factor models withapplications to statistical machine learning. Statist. Sci. to appear.Fan, J., Wang, W. and Zhong, Y. (2019b). Robust covariance estimation for approximate factormodels.

Journal of Econometrics , Annals of Statistics to appear.Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estima-tion.

Annals of Statistics , IEEE Transactions on Information Theory , Advances in Neural Information Processing Systems .Goldstein, L., Minsker, S. and Wei, X. (2018). Structured signal recovery from non-linear andheavy-tailed measurements.

IEEE Transactions on Information Theory , Information and Inference: A Journal of the IMA , Deep learning . MIT press.Gunasekar, S., Lee, J., Soudry, D. and Srebro, N. (2018a). Characterizing implicit bias in terms ofoptimization geometry. In

International Conference on Machine Learning .60unasekar, S., Lee, J. D., Soudry, D. and Srebro, N. (2018b). Implicit bias of gradient descent onlinear convolutional networks. In

Advances in Neural Information Processing Systems .Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B. and Srebro, N. (2017). Implicitregularization in matrix factorization. In

Advances in Neural Information Processing Systems .Gy¨orﬁ, L., Krzy˙zak, A., Kohler, M. and Walk, H. (2002).

A distribution-free theory of nonpara-metric regression . Springer.Han, A. K. (1987). Non-parametric analysis of a generalized regression model: The maximum rankcorrelation estimator.

Journal of Econometrics , AOS arXiv preprint arXiv:1903.08560 .Hoﬀ, P. D. (2017). Lasso, fractional norm and structured sparse estimation using a Hadamardproduct parametrization.

Computational Statistics & Data Analysis , Semiparametric and nonparametric methods in econometrics , vol. 12.Springer.Huang, K., Wang, Y., Tao, M. and Zhao, T. (2020). Why do deep residual networks generalizebetter than deep feedforward networks?–A neural tangent kernel perspective. arXiv preprintarXiv:2002.06262 .Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generaliza-tion in neural networks. In

Advances in Neural Information Processing Systems .Jacques, L., Laska, J. N., Boufounos, P. T. and G., B. R. (2013). Robust 1-bit compressive sensingvia binary stable embeddings of sparse vectors.

IEEE Transactions on Information Theory , International Conference on Learning Representations .Ji, Z. and Telgarsky, M. (2019b). The implicit bias of gradient descent on nonseparable data. In

Conference on Learning Theory .Ji, Z. and Telgarsky, M. (2019c). A reﬁned primal-dual analysis of the implicit bias. arXiv preprintarXiv:1906.04540 .Jiang, B., Liu, J. S. et al. (2014). Variable selection for general index models via sliced inverseregression.

Annals of Statistics , Statistical Science , International Con-ference on Learning Representations .Kini, G. and Thrampoulidis, C. (2020). Analytic study of double descent in binary classiﬁcation:The impact of loss. arXiv preprint arXiv:2001.11572 .Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimalrates for noisy low-rank matrix completion.

Annals of Statistics , nature , Advances in Neural Information Processing Systems .Li, K.-C. (1991). Sliced inverse regression for dimension reduction.

Journal of the American Sta-tistical Association , Journal of the American Statistical Association , Annals of Statistics , arXiv preprint arXiv:1701.07274 .Li, Y., Ma, T. and Zhang, H. (2018). Algorithmic regularization in over-parameterized matrixsensing and neural networks with quadratic activations. In COLT .Liang, T. and Rakhlin, A. (2018). Just interpolate: Kernel ”ridgeless” regression can generalize. arXiv preprint arXiv:1808.00387 .Lin, Q., Zhao, Z. and Liu, J. S. (2019). Sparse sliced inverse regression via Lasso.

Journal of theAmerican Statistical Association

Annals of Statistics , International Conference on Learning Representations .62a, C., Wang, K., Chi, Y. and Chen, Y. (2020). Implicit regularization in nonconvex statisticalestimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blinddeconvolution.

Foundations of Computational Mathematics , arXiv preprint arXiv:1912.06987 .McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models . Chapman and Hall.Mei, S., Misiakiewicz, T. and Montanari, A. (2019). Mean-ﬁeld theory of two-layers neural net-works: Dimension-free bounds and kernel limit. In

Conference on Learning Theory .Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Preciseasymptotics and double descent curve. arXiv preprint arXiv:1908.05355 .Mei, S., Montanari, A. and Nguyen, P.-M. (2018). A mean ﬁeld view of the landscape of two-layerneural networks.

Proceedings of the National Academy of Sciences , E7665–E7671.Minsker, S. (2018). Sub-Gaussian estimators of the mean of a random matrix with heavy-tailedentries.

Annals of Statistics , Bernoulli , arXiv preprintarXiv:1911.01544 .Muthukumar, V., Vodrahalli, K., Subramanian, V. and Sahai, A. (2020). Harmless interpolationof noisy data in regression. IEEE Journal on Selected Areas in Information Theory .Na, S., Yang, Z., Wang, Z. and Kolar, M. (2019). High-dimensional varying index coeﬃcient modelsvia stein’s identity.

Journal of Machine Learning Research , International Conference on Artiﬁcial Intel-ligence and Statistics .Neykov, M., Liu, J. S. and Cai, T. (2016a). (cid:96) -regularized least squares for support recovery of highdimensional single index models with Gaussian designs. Journal of Machine Learning Research , Advances in Neural Information Processing Systems .Neyshabur, B., Tomioka, R., Salakhutdinov, R. and Srebro, N. (2017). Geometry of optimizationand implicit regularization in deep learning. arXiv preprint arXiv:1705.03071 .63eyshabur, B., Tomioka, R. and Srebro, N. (2014). In search of the real inductive bias: On therole of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 .Otter, D. W., Medina, J. R. and Kalita, J. K. (2020). A survey of the usages of deep learningin natural language processing.

IEEE Transactions on Neural Networks and Learning Systems arXiv preprint arXiv:1812.10004 .Plan, Y. and Vershynin, R. (2012). Robust 1-bit compressed sensing and sparse logistic regression:A convex programming approach.

IEEE Transactions on Information Theory , .Plan, Y. and Vershynin, R. (2013). One-bit compressed sensing by linear programming. Commu-nications on Pure and Applied Mathematics , .Plan, Y. and Vershynin, R. (2016). The generalized Lasso with non-linear observations. IEEETransactions on information theory , Information and Inference: A Journal of the IMA , arXivpreprint arXiv:1801.00173 .Qian, W., Ding, S. and Cook, R. D. (2019). Sparse minimum discrepancy approach to suﬃcientdimension reduction with simultaneous variable selection in ultrahigh dimension. Journal of theAmerican Statistical Association , Advances inNeural Information Processing Systems .Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over (cid:96) q -balls. TIT , SIAM Review , Annalsof Statistics , arXiv preprint arXiv:1805.00915 .Sirignano, J. and Spiliopoulos, K. (2018). Mean ﬁeld analysis of neural networks. arXiv preprintarXiv:1805.01053 . 64oudry, D., Hoﬀer, E., Nacson, M. S., Gunasekar, S. and Srebro, N. (2018). The implicit bias ofgradient descent on separable data. Journal of Machine Learning Research , Stein’s Method . Institute of Mathematical Statistics, 1–25.Stein, C. et al. (1972). A bound for the error in the normal approximation to the distributionof a sum of dependent random variables. In

Proceedings of the Sixth Berkeley Symposium onMathematical Statistics and Probability , vol. 2. The Regents of the University of California.Swirszcz, G., Czarnecki, W. M. and Pascanu, R. (2016). Local minima in training of neural net-works. In

International Conference on Learning Representations .Tan, K. M., Wang, Z., Zhang, T., Liu, H. and Cook, R. D. (2018). A convex formulation for high-dimensional sparse sliced inverse regression.

Biometrika , Advances in Neural Information Processing Systems .Thrampoulidis, C. and Rawat, A. S. (2018). The generalized Lasso for sub-Gaussian observationswith dithered quantization.

Allerton Conference on Communication, Control, and Computing arXiv preprint arXiv:2003.01200 .Tsybakov, A. B. (2008).

Introduction to Nonparametric Estimation . Springer.Vaˇskeviˇcius, T., Kanade, V. and Rebeschini, P. (2019). Implicit regularization for optimal sparserecovery. In

Advances in Neural Information Processing Systems .Vershynin, R. (2018).

High-Dimensional Probability: An Introduction with Applications in DataScience . Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge UniversityPress.Voulodimos, A., Doulamis, N., Doulamis, A. and Protopapadakis, E. (2018). Deep learning forcomputer vision: A brief review.

Computational Intelligence and Neuroscience , .Wei, C., Lee, J. D., Liu, Q. and Ma, T. (2019). Regularization matters: Generalization and opti-mization of neural nets vs their induced kernel. In Advances in Neural Information ProcessingSystems .Wei, X. (2018). Structured recovery with heavy-tailed measurements: A thresholding procedureand optimal rates. arXiv preprint arXiv:1804.05959 .Wei, X. and Minsker, S. (2017). Estimation of the covariance structure of heavy-tailed distributions.In

Advances in Neural Information Processing Systems .65einan, E., Ma, C. and Wu, L. (2019). A comparative analysis of optimization and generaliza-tion properties of two-layer neural network and random feature models under gradient descentdynamics.

Science China Mathematics

Advances in Neural Information ProcessingSystems .Xia, Y. (2008). A multiple-index model and dimension reduction.

Journal of the American Statis-tical Association , Biometrika , arXiv preprint arXiv:1806.04339 .Yang, Z., Balasubramanian, K. and Liu, H. (2017a). High-dimensional non-Gaussian single in-dex models via thresholded score function estimation. In International Conference on MachineLearning . JMLR. org.Yang, Z., Balasubramanian, K., Wang, Z. and Liu, H. (2017b). Estimating high-dimensional non-Gaussian multiple index models via Stein’s lemma. In

Advances in Neural Information ProcessingSystems .Yang, Z., Yang, L. F., Fang, E. X., Zhao, T., Wang, Z. and Neykov, M. (2019). Misspeciﬁed non-convex statistical optimization for sparse phase retrieval.

Mathematical Programming , Advances in Neural Information Processing Systems .Yun, C., Sra, S. and Jadbabaie, A. (2019). Small nonlinearities in activation functions create badlocal minima in neural networks. In

International Conference on Learning Representations .Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017). Understanding deep learningrequires rethinking generalization.

International Conference on Learning Representations .Zhang, C.-H. et al. (2010). Nearly unbiased variable selection under minimax concave penalty.

Annals of Statistics , International Conference on Machine Learning .Zhao, P., Yang, Y. and He, Q.-C. (2019). Implicit regularization via Hadamard product over-parametrization in high-dimensional linear regression. arXiv preprint arXiv:1903.09367 .66hu, Z. (2017). Taming the heavy-tailed features by shrinkage and clipping. arXiv preprintarXiv:1710.09020 .Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2018). Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888arXiv preprint arXiv:1811.08888