[PDF] When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

Abstract

We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.

Full PDF

aa r X i v : . [ s t a t . M L ] F e b When does gradient descent with logistic lossinterpolate using deep networkswith smoothed ReLU activations?

Niladri S. ChatterjiUniversity of California, Berkeley [email protected]

Philip M. LongGoogle [email protected]

Peter L. BartlettUniversity of California, Berkeley & Google [email protected]

February 10, 2021

Abstract

We establish conditions under which gradient descent applied to ﬁxed-width deepnetworks drives the logistic loss to zero, and prove bounds on the rate of convergence.Our analysis applies for smoothed approximations to the ReLU, such as Swish and theHuberized ReLU, proposed in previous applied work. We provide two suﬃcient conditionsfor convergence. The ﬁrst is simply a bound on the loss at initialization. The second is adata separation condition used in prior analyses.

Interest in the properties of interpolating deep learning models trained with ﬁrst-order opti-mization methods is surging [Zha+17a; Bel+19]. One important question is to understand howgradient descent with appropriate random initialization routinely ﬁnds interpolating (near-zerotraining loss) solutions to these non-convex optimization problems.In this paper our focus is to understand when gradient descent drives the logistic loss tozero when applied to ﬁxed-width deep networks using smooth approximations to the ReLUactivation function. We derive upper bounds on the rate of convergence under two conditions.The ﬁrst result only requires that the initial loss is small, but does not require any assumptionabout the width of the network. It guarantees that if the initial loss is small then gradientdescent drives the logistic loss down to zero. The second result is under a separation conditionon the data. Under this assumption we demonstrate that the loss decreases adequately in theinitial iterations such that the ﬁrst result applies.A few ideas that facilitate our analysis are as follows: under the ﬁrst set of assumptions,when the loss is small, we show that the negative gradient aligns with the weights of thenetwork. This lower bounds the norm of the gradient at the beginning of the gradient stepand implies that the loss decreases quickly at the beginning of the step. We then show thatthe operator norm of Hessian of loss remains bounded throughout the gradient step. To do1his, inspired by Allen-Zhu, Li, and Song [ALS19], instead of directly analyzing the operatornorm of the Hessian, we instead analyze the diﬀerence between the gradients associated withneighboring parameter vectors. The smoothness of the loss combined with the lower bound onthe norm of the gradient at the beginning of the step implies that the loss decreases throughoutthe gradient step when the step-size is small enough.The second suﬃcient condition is when the data is separable by a margin using the featuresobtained by the gradient of the neural network at initialization (see Assumption 3.2). Thisassumption has previously been studied by Chen et al. [Che+21]. Intuitively, it is weaker thanan assumption that the training examples are not too close, as we discuss after its deﬁnition.Under this assumption we use a neural tangent kernel (NTK) analysis to show that the lossdecreases suﬃciently in the ﬁrst stage of optimization such that we can invoke our ﬁrst resultto guarantee that the loss decreases thereafter in the second stage. To analyze this ﬁrst stagewe borrow ideas from [ALS19; Zou+20], because the formulation of their results were mostclosely aligned with our needs. However we note that their results do not directly apply sincethey study networks with ReLU activations while we study smooth approximations to theReLU. In addition to adapting their proofs to our setting, we also worked out some details inthe original proofs.Our ﬁrst result could be viewed as a tool to establish convergence under a wide variety ofconditions. Our second result is one example of how it may be applied. Other separation as-sumptions on the data like the ones studied by Ji and Telgarsky [JT19b], Chen et al. [Che+21],and Zou et al. [Zou+20] could also be used in conjunction with our ﬁrst result to establishconvergence to zero-training loss.Recently Chatterji, Long, and Bartlett [CLB20] showed that gradient descent applied totwo-layer neural networks drives the logistic loss to zero when the initial loss is small and theactivation functions are Huberized ReLUs. Our work can be viewed as a generalization of theirresult to the case of deep networks. Previously Lyu and Li [LL20] proved that gradient descentapplied to deep networks drives the training logistic loss to zero, using the alignment betweenthe gradient and the weight vectors. However, their result requires the neural network to beboth positive homogeneous and smooth, which rules out the ReLU and close approximationsto it like Swish [RZL18] or the Huberized ReLU [Tat+20] that are widely used in practice.Their results do apply in case that the ReLU is raised to a power strictly greater than two.Prior work has shown that gradient descent drives the squared loss of ﬁxed-width deepnetworks to zero [Du+18; Du+19; ALS19; OS20], using the NTK perspective [JGH18; COB19].The logistic loss however is qualitatively diﬀerent. Driving the logistic loss to zero requires theweights to go to inﬁnity, far from their initial values. This means that a Taylor approximationaround the initial values cannot be applied. While the NTK framework has also been appliedto analyze training with the logistic loss, a typical result [LL18; ALS19; Zou+20] is that after poly (1 /ε ) updates, a network of size or width poly (1 /ε ) achieves ε loss. Thus to guaranteeloss very close to zero, these analyses require larger and larger networks. The reason for thisappears to be that a key part of these analyses is to show that a wider network can achievea certain ﬁxed loss by traveling a shorter distance in parameter space. Since it seems that,to drive the logistic loss to zero with a ﬁxed-width network, the parameters must travel anunbounded distance, the NTK approach cannot be applied to obtain the results of this paper.The remainder of the paper is organized as follows. In Section 2 we introduce notation anddeﬁnitions. In Section 3 we present our main theorems. We provide a proof of our ﬁrst result,Theorem 3.1 in Section 4. We conclude with a discussion in Section 5. Appendix A points to2ther related work. The proof of our second result, Theorem 3.3 and other technical detailsare presented in the appendix. This section includes notational conventions and a description of the setting.

Given a vector v , let k v k denote its Euclidean norm, k v k p denote its ℓ p -norm for any p ≥ , k v k denote the number of non-zero entries, and diag( v ) denote a diagonal matrix with v alongthe diagonal. We say a vector v is k -sparse if k v k ≤ k . Given a matrix M , let k M k denote itsFrobenius norm, k M k op denote its operator norm and k M k denote the number of non-zeroentries in the matrix. Given either a matrix or a tensor we let vec( · ) be its vectorization.Given a tensor T , let k T k = k vec( T ) k ; we will sometimes call this the Frobenius norm of T .If, for matrices T , . . . , T L +1 of diﬀerent shapes, we refer to them collectively as T , we deﬁne k T k analogously. Given two tensors A and B let A · B denote the element-wise dot product vec( A ) · vec( B ) . We use the standard “big Oh notation” [see, e.g., Cor+09]. For any k ∈ N , wedenote the set { , . . . , k } by [ k ] . For a number p of inputs, we denote the set of unit-lengthvectors in R p by S p − . We will use c, c ′ , c , . . . to denote constants, which may take diﬀerentvalues in diﬀerent contexts. We will analyze gradient descent applied to minimize the training loss of a multi-layer network.We assume that the number of inputs is equal to number of hidden nodes per layer tosimplify the presentation of our results. Our techniques can easily extend to the case wherethere are diﬀerent numbers of hidden nodes in diﬀerent layers. Let p denote the number ofinputs and the number of hidden nodes per layer, and let L denote the number of hiddenlayers.We will denote the activation function by φ . Given a vector v let φ ( v ) denote a vector withthe activation function applied to each coordinate. We study activation functions that aresimilar to the ReLU activation function but are smooth. Deﬁnition 2.1.

An activation function φ is h -smoothly approximately ReLU if • φ (0) = 0 ; • for all z ∈ R : | φ ′ ( z ) z − φ ( z ) | ≤ h , | φ ′ ( z ) | ≤ and | φ ′′ ( z ) | ≤ h . It is easy to verify the activation functions φ are contractive with respect to the Euclideannorm. That is, for any v , v ∈ R p , k φ ( v ) − φ ( v ) k ≤ k v − v k . See Lemma B.8 for aproof of this fact. Here are a couple of examples of activation functions that are h -smoothlyapproximately ReLU.1. Huberized ReLU [Tat+20]: φ ( z ) :=  if z < , z h if z ∈ [0 , h ] , z − h otherwise. (1)3. Scaled Swish [RZL18]: φ ( z ) = x . ( ( − xh )) . The scaling factor . ensures that | φ ′ ( z ) | ≤ .For i ∈ { , . . . , L } , let V i ∈ R p × p be the weight matrix of the i th layer and let V L +1 ∈ R × p be the weight vector corresponding to the outer layer. Let V = ( V , . . . , V L +1 ) consist of all ofthe trainable parameters in the network. Let f V denote the function computed by the network,which maps x to f V ( x ) = V L +1 φ ( V L · · · φ ( V x )) . Consider a training set ( x , y ) , . . . , ( x n , y n ) ∈ S p − × {− , } . For any sample s ∈ [ n ] , deﬁne u V ,s = x V ,s := x s and for all ℓ ∈ [ L ] , deﬁne u Vℓ,s := V ℓ x Vℓ − ,s and, x Vℓ,s := φ (cid:0) V ℓ x Vℓ − ,s (cid:1) , that is, u Vℓ,s refers to the pre-activation features in layer ℓ , while x Vℓ,s corresponds to the featuresafter applying the activation function in the ℓ th layer. Also for any ℓ ∈ [ L ] and s ∈ [ n ] let Σ Vℓ,s := diag (cid:0) φ ′ ( u ℓ,s ) (cid:1) = diag (cid:0) φ ′ (cid:0) V ℓ x Vℓ − ,s (cid:1)(cid:1) . Deﬁne the training loss (empirical risk with respect to the logistic loss) J by J ( V ) := 1 n n X s =1 log(1 + exp ( − y s f V ( x s ))) , and refer to loss on example s by J ( V ; x s , y s ) := log(1 + exp ( − y s f V ( x s ))) . The gradient of the loss evaluated at V is ∇ V J ( V ) = 1 n n X s =1 − y s ∇ V f V ( x s )1 + exp ( y s f V ( x s )) , and the partial gradient of f V with respect to V ℓ has the formula [see, e.g., Zou+20] ∂f V ( x s ) ∂V ℓ =  Σ Vℓ,s L Y j = ℓ +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ ℓ − ,s , when ℓ ∈ [ L ] , (2a) ∂f V ( x s ) ∂V L +1 = x V ⊤ L,s . (2b)We analyze the iterates of gradient descent V (1) , V (2) , . . . deﬁned by V ( t +1) := V ( t ) − α t ∇ V J | V = V ( t ) in terms of the properties of V (1) . Deﬁnition 2.2.

For all iterates t , deﬁne J ts := J ( V ( t ) ; x s , y s ) and let J t := n P ns =1 J ts .Additionally for all t , deﬁne ∇ J t := ∇ V J | V = V ( t ) . Main results

In this section we present our theorems and discuss their implications.

Given the initial weight matrix V (1) , width p , depth L , and training data { x s , y s } s ∈ [ n ] , deﬁne h max , α max and e Q below: h max := min ( L L − log(1 /J )6 √ p k V (1) k L , ) , (3a) α max ( h ) := min ( h

832 ( L + 1) pJ k V (1) k L +5 , k V (1) k L ( L + ) J log L (1 /J ) ) , and (3b) e Q ( α ) := L ( L + ) αJ log L (1 /J ) k V (1) k . (3c) Theorem 3.1.

For any L ≥ , for all n ≥ , for all p ≥ , for any initial parameters V (1) and dataset ( x , y ) , . . . , ( x n , y n ) ∈ S p − × {− , } , for any h -smoothly approximately ReLUactivation function with h < h max , any positive α ≤ α max ( h ) and positive Q ≤ e Q ( α ) thefollowing holds for all t ≥ . If each step-size α t = α , and if J < n L then, for all t ≥ , J t ≤ J Q · ( t −

1) + 1 . We reiterate that this theorem makes no assumption about the number on the width of thenetwork p and makes a very mild assumption on the number of samples required n ≥ . Theonly assumption is that initial loss is less than n L . We pick the step-size to be a constantwhich leads to a rate that scales with /t .Next we provide an example where we show that it is possible to arrive at a small losssolution using gradient descent starting from randomly initialized weight matrices. In this subsection assume that the entries of the initial weight matrices for the layers ℓ ∈{ , . . . , L } are drawn independently from N (cid:16) , p (cid:17) , and the entries of V (1) L +1 are drawn inde-pendently from N (0 , . In this section we also specialize to the case where the activationfunction is the Huberized ReLU (see its deﬁnition in Equation (1)). We make the followingassumption on the training data. Assumption 3.2.

With probability − δ over the random initialization, there exists a collectionof matrices W ⋆ = ( W ⋆ , . . . , W ⋆L +1 ) with k W ⋆ k = 1 , such that for all samples s ∈ [ n ] y s ( ∇ f V (1) ( x s ) · W ⋆ ) ≥ √ pγ, for some γ > . √ p on the right hand side is to balance the scale of the norm of the gra-dient at initialization which will scale with √ p as well. This is because the entries of the ﬁnallayer V (1) L +1 are drawn independently from N (0 , . This assumption is inspired by Assump-tion 4.1 made by Chen et al. [Che+21]. This assumption can be seen to be implied by strongerconditions that simply require that the training examples are not too close, as employed in[ALS19; Zou+20]. Here is some rough intuition of why. The components of ∇ f V (1) ( x s ) in-clude values computed at the last hidden layer when x s is processed using V (1) (that is, ∇ V L +1 f V (1) ( x s ) = x V (1) L,s ). For wide networks with Huberized ReLU activations, if the values of x s in the training examples do not have duplicates, their embeddings into the last hidden layerof nodes are in general position with high probability. In fact, the Gaussian Process analysisof inﬁnitely wide deep networks at initialization [Mat+18a; Mat+18b] suggests that, for widenetworks, the embeddings will not even be close to failing to be in general position. If the width p ≫ n , results from [Cov65] show that they will be linearly separable. The anti-concentrationconferred by the Gaussian initialization promotes larger (though not necessarily constant)margins. Assumption 3.2 is more reﬁned than a separation condition, since it captures a sensein which the data is amenable to treatment with neural networks that enables us to providestronger guarantees in such cases. Furthermore, in Appendix C we show that Assumption 3.2is satisﬁed with a constant margin γ by two-layer networks with Huberized ReLUs for datasatisfying a clustering condition. Finally, we note that we could also use other assumptions onthe data that have been studied in the literature [for example by, JT19b] to guarantee that theloss reduces below n L , as required to invoke Theorem 3.1. However, we provide guaranteesonly under this assumption in the interest of simplicity.Deﬁne ρ := c √ pγ (cid:20)r log (cid:16) nδ (cid:17) + log (cid:16) n (2+24 L ) (cid:17)(cid:21) (4)where c ≥ is a large enough absolute constant. Also set the value of h = h NT := (1 + 24 L ) log( n ) √ p ) L +12 L . (5)With these choices of ρ and h we are now ready to state our convergence result under As-sumption 3.2. The proof of this theorem is presented in Appendix D. Theorem 3.3.

Consider a network with Huberized ReLU activations. There exists r ( n, L, δ ) =poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) such that for any L ≥ , n ≥ , δ > , under Assumption 3.2 with γ ∈ (0 , if h = h NT and p ≥ r ( n,L,δ ) γ then both of the following hold with probability at least − δ overthe random initialization:1. For all t ∈ [ T ] set the step-size α t = α NT = Θ (cid:16) pL (cid:17) , where T = l ( L +1) ρ n L α NT m then min t ∈ [ T ] J t < n L .

2. Set V ( T +1) = V ( s ) , where s ∈ arg min s ∈ [ T ] J ( V ( s ) ) , and for all t ≥ T + 1 set the step-size α t = α max ( h ) then for all t ≥ T + 1 , J t ≤ O L L +112 (6 p ) L +5 n L · ( t − T − ! .

6e invite the reader to interpret the result of this theorem in two scenarios. The ﬁrst iswhere the depth L is a constant and the margin γ ≥ (cid:0) p ω poly (cid:0) n, log (cid:0) δ (cid:1)(cid:1)(cid:1) − , for some constant ω ∈ [0 , ) . In this case it suﬃces if p = poly (cid:0) n, log (cid:0) δ (cid:1)(cid:1) for a large enough polynomial. Underthis choice of the width p the rate of convergence in the second stage is J t ≤ O L L +112 (6 p ) L +5 n L · ( t − T − ! ≤ poly (cid:0) n, log (cid:0) δ (cid:1)(cid:1) t . Another scenario is where the margin γ is at least a constant. Here it suﬃces for the width p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) . Thus if the number of samples n ≥ h L L +112 (6 p ) L +5 i L then the rateof convergence in this second stage is J t ≤ O L L +112 (6 p ) L +5 n L · ( t − T − ! = O (cid:18) t − T − (cid:19) . In this section, we prove Theorem 3.1.

In this subsection we assemble several technical tools required to prove Theorem 3.1. Theirproofs (which in turn depend on additional, more basic, lemmas) can be found in Appendix B.We start with the following lemma, which is a slight variant of a standard inequality, andprovides a bound on the loss after a step of gradient descent when the loss function is locallysmooth.

Lemma 4.1.

For α > , let V ( t +1) = V ( t ) − α ∇ J t . If, for all convex combinations W of V ( t ) and V ( t +1) , k∇ W J k op ≤ M , then if α ≤ ( L + ) M , we have J t +1 ≤ J t − αL k∇ J t k L + . To apply Lemma 4.1 we need to show that the loss J is smooth near J t . The followinglemma controls the Hessian of the loss in terms of its current loss. Lemma 4.2. If h ≤ , for any weight matrix V such that J ( V ) < /n L we have k∇ V J ( V ) k op ≤ L + 1) √ p k V k L +5 J ( V ) h . Next, we show that J changes slowly in general, and especially slowly when it is small. Lemma 4.3.

For any weight matrix V , if k V k > then k∇ V J ( V ) k ≤ p ( L + 1) p k V k L +1 min { J ( V ) , } . The following lemma applies Lemma 4.1 (along with Lemma 4.2) to show that if the step-size at step t is small enough then the loss decreases by an amount that is proportional to thesquared norm of the gradient. 7 emma 4.4. If h ≤ , J t < n L , and αJ t ≤ h

832 ( L + 1) √ p k V ( t ) k L +5 , then J t +1 ≤ J t − αL k∇ J t k L + . The next lemma establishes a lower bound on the norm of the gradient at any iteration interms of the loss J t and the norm of the weight matrix V ( t ) . Lemma 4.5.

For all L ∈ N if h ≤ h max , J t < n L , and k V ( t ) k L ≤ log(1 /J t ) k V (1) k L log(1 /J ) then k∇ J t k ≥ ( L + ) J t log(1 /J t ) k V ( t ) k . (6)The lower bound on the gradient is proved by showing that the alignment between thenegative gradient −∇ J t and V ( t ) is large when the loss is small. The proof proceeds by showingthat when h is suﬃciently small and the norm of V ( t ) is not too large, then the inner productbetween −∇ J t and V ( t ) can be lower bounded by a function of the loss J t . As stated above the proof goes through for any positive h ≤ h max , step-size α ≤ α max ( h ) andany Q ≤ e Q ( α ) (recall the deﬁnitions of h max , α max and e Q in Equations (3a)-(3c)). We will usethe following multi-part inductive hypothesis:(I1) J t ≤ J Q · ( t − ; (I2) log(1 /J t ) k V ( t ) k L ≥ log(1 /J ) k V (1) k L ;(I3) αJ t ≤ h L +1) p k V ( t ) k L +5 .The ﬁrst part of the inductive hypothesis will be used to ensure that the loss decreases at theprescribed rate, the second part helps establish a lower bound on the norm of the gradient inlight of Lemma 4.5 and the third part will ensure that the step-size is small enough to applyLemma 4.4 and also allows us to make several useful approximations in our proofs.The base case is trivially true for the ﬁrst and second part of the inductive hypothesis. Itis true for the third part since the step-size α ≤ α max ( h ) ≤ h L +1) pJ k V (1) k L +5 . Now let usassume that the inductive hypothesis holds for a step t ≥ and prove that it holds for thenext step t + 1 . We start with Part I1. Lemma 4.6.

If the inductive hypothesis holds at step t then, J t +1 ≤ J Qt + 1 . roof Since αJ t ≤ h L +1) p k V ( t ) k L +5 and J t ≤ J < n L , by invoking Lemma 4.4, J t +1 ≤ J t − Lα ( L + ) k∇ J t k . Additionally since h ≤ h max and by Part I2 of the inductive hypothesis k V ( t ) k L ≤ log(1 /J t ) k V (1) k L log(1 /J ) ,we use the lower bound on the norm of the gradient established in Lemma 4.5 to get J t +1 ≤ J t − L ( L + ) αJ t log (1 /J t ) k V ( t ) k i ) ≤ J t − L ( L + ) αJ t log − L (1 /J t ) log L (1 /J ) k V (1) k ! ( ii ) ≤ J t − L ( L + ) αJ t log L (1 /J ) k V (1) k ! , (7)where ( i ) follows by Part (I2) of the inductive hypothesis, and ( ii ) follows since L ≥ and J t ≤ J < n L , therefore log − L (1 /J t ) ≥ .For any z ≥ , the quadratic function z − z L ( L + ) α log L (1 /J ) k V (1) k is a monotonically increasing function in the interval " , k V (1) k L ( L + ) α log L (1 /J ) . Thus, because J t ≤ J Q ( t − , if J Q ( t − ≤ k V (1) k L ( L + ) α log L (1 /J ) , the RHS of (7) is boundedabove by its value when J t = J Q ( t − . But this is easy to check: by our choice of step-size α we have, α ≤ α max ≤ k V (1) k L ( L + ) J log L (1 /J ) ⇒ J ≤ k V (1) k L ( L + ) α log L (1 /J ) ⇒ J Q ( t −

1) + 1 ≤ k V (1) k L ( L + ) α log L (1 /J ) . J t = J Q ( t − , we get J t +1 ≤ J Q ( t −

1) + 1 − J Q ( t −

1) + 1 L ( L + ) α log L (1 /J ) k V (1) k ! = J Qt + 1 (cid:18) Qt + 1 Q ( t −

1) + 1 (cid:19) − QQ ( t −

1) + 1 L ( L + ) αJ log L (1 /J ) Q k V (1) k ! = J Qt + 1 (cid:18) QQ ( t −

1) + 1 (cid:19) − QQ ( t −

1) + 1 L ( L + ) αJ log L (1 /J ) Q k V (1) k ! ≤ J Qt + 1 − (cid:18) QQ ( t −

1) + 1 (cid:19) ! (cid:18) since Q ≤ e Q ( α ) = L ( L + ) αJ log L (1 /J ) k V (1) k (cid:19) ≤ J Qt + 1 . This establishes the desired upper bound on the loss at step t + 1 .In the next lemma we shall establish that the second part of the inductive hypothesis holds. Lemma 4.7.

Under the setting of Theorem 3.1, if the induction hypothesis holds at step t then, log (cid:16) J t +1 (cid:17) k V ( t +1) k L ≥ log (cid:16) J (cid:17) k V (1) k L . Proof

We know from Lemma 4.4 that J t +1 ≤ J t − Lα k∇ J t k ( L + ) J t ! , and by the triangle inequality k V ( t +1) k ≤ k V ( t ) k + α k∇ J t k , hence log (cid:16) J t +1 (cid:17) k V ( t +1) k L ≥ log  J t (cid:18) − Lα ( L + 12 ) Jt k∇ J t k (cid:19) (cid:0) k V ( t ) k + α k∇ J t k (cid:1) L = log (cid:16) J t (cid:17) + log  (cid:18) − Lα ( L + 12 ) Jt k∇ J t k (cid:19) (cid:0) k V ( t ) k + α k∇ J t k (cid:1) L = log (cid:16) J t (cid:17)  − log (cid:18) − Lα ( L + 12 ) Jt k∇ J t k (cid:19) log (cid:16) Jt (cid:17)  k V ( t ) k L (cid:16) α k∇ J t kk V ( t ) k (cid:17) L ( i ) ≥ log (cid:16) J t (cid:17) k V ( t ) k L  Lα k∇ J t k ( L + ) J t log (cid:16) Jt (cid:17) !(cid:16) α k∇ J t kk V ( t ) k (cid:17) L  (8)10here ( i ) follows since log(1 − z ) ≤ − z for all z ∈ (0 , and because Lα ( L + ) J t k∇ J t k ≤ Lα ( L + ) h ( L + 1) pJ t k V ( t ) k L +1) i (by Lemma 4.3) = αJ t " L ( L + 1) p k V ( t ) k L +1) L + < αJ t "

832 ( L + 1) p k V ( t ) k L +5 h ( k V ( t ) k > by Lemma B.5, and h ≤ ) ≤ (by Part I3 of the IH) . We want the term in curly brackets in Inequality (8) to be at least 1, that is, Lα k∇ J t k ( L + ) J t log (cid:16) J t (cid:17) ≥ (cid:18) α k∇ J t kk V ( t ) k (cid:19) L . (9)To show that this inequality holds, ﬁrst note that α k∇ J t kk V ( t ) k ≤ αJ t p ( L + 1) p k V ( t ) k L (by Lemma 4.3) ≤ L · αJ t "

832 ( L + 1) p k V ( t ) k L +5 h (since k V ( t ) k > by Lemma B.5) ≤ L (by Part I3 of the IH) . For any positive z ≤ L we have the inequality that (1 + z ) L ≤ Lz , therefore to show thatInequality (9) holds it instead suﬃces to show that Lα k∇ J t k ( L + ) J t log (cid:16) J t (cid:17) ≥ Lα k∇ J t kk V ( t ) k⇐ k∇ J t k ≥ ( L + ) J t log (cid:16) J t (cid:17) k V ( t ) k , which follows from Lemma 4.5 that guarantees k∇ J t k ≥ ( L + ) J t log(1 /J t ) k V ( t ) k . Thus we have provedthat the term in the curly brackets in Inequality (8) is at least and hence log (cid:16) J t +1 (cid:17) k V ( t +1) k L ≥ log (cid:16) J t (cid:17) k V ( t ) k L ≥ log (cid:16) J (cid:17) k V (1) k L . This proves that the ratio is lower bounded at step t + 1 by its initial value and establishesour claim.Finally we ensure that the third part of the inductive hypothesis holds. This allows us to applyLemma 4.4 in the next step t + 1 . Lemma 4.8.

Under the setting of Theorem 3.1 if the induction hypothesis holds at step t then, αJ t +1 ≤ h

832 ( L + 1) p k V ( t +1) k L +5 . roof We want to show that αJ t +1 k V ( t +1) k L +5 ≤ h

832 ( L + 1) p , but we know by Lemma 4.7 that k V ( t +1) k L ≤ log(1 /J t +1 ) k V (1) k L log(1 /J ) so it instead suﬃces to provethat αJ t +1 log L +5 L (cid:18) J t +1 (cid:19) ≤ h log L +5 L (1 /J )832 ( L + 1) p k V (1) k L +5 . (10)Lemma 4.6 establishes that J t +1 ≤ J < /n L . The function z log L +5 L (1 /z ) is increasingover the interval (0 , e L +5 L ) . Recall that n ≥ therefore, J t +1 ≤ J < L < e L +5 L . Thus, the LHS of (10) is maximized at J αJ t +1 log L +5 L (cid:18) J t +1 (cid:19) ≤ αJ log L +5 L (cid:18) J (cid:19) ≤ h log L +5 L (1 /J )832 ( L + 1) √ p k V (1) k L +5 where ﬁnal inequality holds by choice of the step-size α . This completes the proof.Combining the results of Lemmas 4.6, 4.7 and 4.8 completes the proof of theorem. We have shown that deep networks with smoothed ReLU activations trained by gradientdescent with logistic loss achieved training loss approaching zero if the loss is initially smallenough. We also established conditions under which this happens that formalize the idea thatthe NTK features are useful. Our analysis applies in the case of networks using the increasinglypopular Swish activation function.While, to simplify our treatment, we concentrated on the case that the number of hiddennodes in each layer is equal to the number of inputs, our analysis should easily be adapted tothe case of varying numbers of hidden units.Analysis of architectures such as Residual Networks and Transformers would be a potentiallyinteresting next step.

Acknowledgements

We gratefully acknowledge the support of the NSF through grants DMS-2031883 and DMS-2023505 and the Simons Foundation through award 814639.12 ontents

B.1 Additional deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14B.2 Basic lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15B.3 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20B.4 Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20B.5 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34B.6 Proof of Lemma 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36B.7 Proof of Lemma 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

C An example where the margin in Assumption 3.2 is constant 41D Omitted proofs from Section 3.2 45

D.1 Additional deﬁnitions and notation . . . . . . . . . . . . . . . . . . . . . . . . . 46D.2 Technical tools required for the neural tangent kernel proofs . . . . . . . . . . . 47D.3 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

E Proof of Lemma D.7 53

E.1 Properties at initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53E.1.1 Proof of Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55E.1.2 Proof of Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58E.1.3 Proof of Part (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58E.1.4 Proof of Part (d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60E.1.5 Proof of Part (e) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61E.1.6 Proof of Part (f) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64E.1.7 Proof of Part (g) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66E.1.8 Proof of Part (h) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68E.1.9 Other useful concentration lemmas . . . . . . . . . . . . . . . . . . . . . 69E.2 Useful properties in a neighborhood around the initialization . . . . . . . . . . 7413.3 The proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

F Probabilistic tools 91

A Additional related work

Building on the work of Lyu and Li [LL20], Ji and Telgarsky [JT20] study ﬁnite-width deepReLU neural networks and show that starting from a small loss, gradient ﬂow coupled withlogistic loss leads to convergence of the directions of the parameter vectors. They also demon-strate alignment between the parameter vector directions and the negative gradient. However,they do not prove that the training loss converges to zero.Using mean-ﬁeld techniques Chizat and Bach [CB20], building on [CB18; MMM19], showthat inﬁnitely wide two-layer ReLU networks trained with gradient ﬂow on logistic loss leads toa max-margin classiﬁer in a particular non-Hilbertian space of functions. (See also the videosin a talk about this work [Chi20].) Chen et al. [Che+20] analyzed regularized training withgradient ﬂow on inﬁnitely wide networks. When training is regularized, the weights also maytravel far from their initial values. Previously Brutzkus et al. [Bru+18] studied ﬁnite-widthtwo-layer leaky ReLU networks and showed that when the data is linearly separable, thesenetworks can be trained up to zero-loss using stochastic gradient descent with the hinge loss.Our study is motivated in part by the line of work that has emerged which emphasizesthe need to understand the behavior of interpolating (zero training loss/error) classiﬁers andregressors. A number of recent papers have analyzed the properties of interpolating methodsin linear regression [Has+19; Bar+20; Mut+20b; TB20; BL20], linear classiﬁcation [Mon+19;CL20; LS20; Mut+20a; HMX20], kernel regression [LR20; MM19; LRZ20] and simplicial near-est neighbor methods [BHM18].There are also many related papers that characterize the implicit bias of the solution ob-tained by ﬁrst-order methods [NTS15; Sou+18; JT19c; Gun+18a; Gun+18b; LMZ18; Aro+19a;JT19a].Finally, we note that a number of other recent papers also theoretically study the opti-mization of neural networks including [And+14; LY17; Zho+17; Zha+17b; GLM18; PSZ18;Du+18; SS18; Zha+19; Aro+19b; BG19; Wei+19; JT19b; NCS19; SY19; ZG19].

B Omitted proofs from Section 4.1

In this section we present the proofs of Lemmas 4.1-4.5.

B.1 Additional deﬁnitions

Deﬁnition B.1.

For any weight matrix V , deﬁne g s ( V ) := 11 + exp ( y s f V ( x s )) . We will often use g s as shorthand for g s ( V ) when V can be determined from context. Further,for all t ∈ { , , . . . } , deﬁne g ts := g s ( V ( t ) ) . Informally, g s ( V ) is the size of the contribution of example s to the gradient.14 eﬁnition B.2. For all iterates t , all ℓ ∈ [ L + 1] and all s ∈ [ n ] , deﬁne x ( t ) ℓ,s := x V ( t ) ℓ,s , u ( t ) ℓ,s := u V ( t ) ℓ,s and Σ ( t ) ℓ,s := Σ V ( t ) ℓ,s . B.2 Basic lemmas

To prove Lemmas 4.1-4.5, we will need some more basic lemmas, which we ﬁrst prove.

Lemma B.3.

For any x ∈ R p and y ∈ {− , } and any weight matrix V we have the following:1.

11 + exp ( yf V ( x )) ≤ log(1 + exp( − yf V ( x ))) = J ( V ; x, y ) . exp ( yf V ( x ))(1 + exp ( yf V ( x ))) ≤

11 + exp ( yf V ( x )) ≤ J ( V ; x, y ) . Proof

Part follows since for any z ∈ R , we have the inequality (1 + exp( z )) − ≤ log(1 +exp( − z )) .Part follows since for any z ∈ R d , we have the inequality exp( z ) / (1 + exp( z )) ≤ (1 + exp( z )) − .The following lemma is useful for establishing a relatively simple lower bound on a sum ofapplications of a concave function. Lemma B.4. If ψ : [0 , M ] → R + is a concave function with ψ (0) = 0 . Then the minimum of P ni =1 ψ ( z i ) subject to z , . . . , z n ≥ and P ni =1 z i = M is ψ ( M ) . Proof

Let z , . . . , z n be any solution, and let i be the least index such that z i > . Then,since ψ is concave and non-negative, we have ψ ( z + z i ) + ψ (0) = ψ ( z + z i ) ≤ ψ ( z ) + ψ ( z i ) . Thus, replacing z with z + z i , and replacing z i with , produces a solution with one fewernonzero entres that it at least as good. Repeating this for each i > implies that the solutionwith z = M and z = . . . = z n = 0 is optimal.The next lemma shows that large weights are needed to achieve small loss. Lemma B.5.

For any L ∈ N and any weight matrix V if J ( V ) < n L then,1. k V k > √ , and2. max k ∈ [ L ] Q L +1 j = k +1 k V j k ≤ (cid:16) k V k√ L (cid:17) L ≤ k V k L . roof Proof of Part 1:

Since φ is -Lipschitz and φ (0) = 0 , for all z , | φ ( z ) | ≤ | z | , and thus,given any sample s , J ( V ; x s , y s ) = log (1 + exp ( − y s V L +1 φ ( V L · · · φ ( V x )))) ≥ log   − L +1 Y j =1 k V j k op k x s k  ≥ log   − L +1 Y j =1 k V j k op  (since k x s k = 1 ) ≥ log   − L +1 Y j =1 k V j k  . By the AM-GM inequality  L +1 Y j =1 k V j k  L +1 ≤ P L +1 j =1 k V j k L + 1 ≤ k V k L + 1 . Therefore J ( V ; x s , y s ) ≥ log − (cid:18) k V k√ L + 1 (cid:19) L +1 !! . Now we know that n L > J ( V ) = 1 n X s ∈ [ n ] J ( V ; x s , y s ) ≥ log − (cid:18) k V k√ L + 1 (cid:19) L +1 !! . Solving for k V k leads to the implication √ L + 1 log L +1 (cid:0) n L (cid:1) − ! < k V k . Since for any z ∈ [0 , , exp( z ) ≤ z and n ≥ , hence k V k > √ L + 1 log L +1 (cid:18) n L (cid:19) ≥ √ L + 1 log L +1 (cid:18) L (cid:19) > √ L + 1 log L +1 (cid:0) L (cid:1) = √ L + 1(24 L ) L +1 log L +1 (3) ≥ √ . This proves Part 1 of the lemma.

Proof of Part 2:

Let η = k V k . Then for any k ∈ [ L ] , L +1 Y j = k +1 k V j k

16s maximized subject to P L +1 j = k +1 k V j k ≤ η when every k V j k = η / ( L − k ) , this follows bythe AM-GM inequality. At the maximum it takes the value max k ∈ [ L ] L +1 Y j = k +1 k V j k op ≤ max k ∈ [ L ] (cid:18) η √ L − k (cid:19) L − k ≤ (cid:18) η √ L (cid:19) L where the second inequality above holds since η = k V k ≥ by Part 1 of this lemma.The next lemma bounds the product of the operator norms of matrices in terms of a “collectiveFrobenius norm”. Lemma B.6.

For matrices A , . . . , A L +1 and M , . . . , M L +1 , let A = ( A , . . . , A L +1 ) . If k A k > and for all i ∈ [ L + 1] , k M i k op ≤ . Then, for any nonempty I ⊆ [ L + 1] Y i ∈I k A i k op k M i k op ≤ k A k L +1 ( L + 1) L +12 ≤ k A k L +1 . Proof

We know that for all i ∈ [ L + 1] , k A i k op ≤ k A i k , therefore, by the AM-GM inequality Y i ∈I (cid:0) k A i k op k M i k op (cid:1) ≤ Y i ∈I k A i k op ≤ Y i ∈I k A i k ≤ (cid:18) P i ∈I k A i k |I| (cid:19) |I| ≤ (cid:18) k A k |I| (cid:19) |I| ≤ k A k L +1) ( L + 1) L +1 , where the last inequality follows by our assumption that k A k > . Taking square roots com-pletes the proof.The next lemma bounds bounds how much perturbing the factors changes a product of ma-trices. Lemma B.7.

Let A , . . . , A L +1 , B , . . . , B L +1 , M , . . . , M L +1 and N , . . . , N L +1 be matrices,and let A = ( A , . . . , A L +1 ) and B = ( B , . . . , B L +1 ) . Assume • k A k > and k B k > , • for all i ∈ [ L + 1] , k M i k op ≤ and k N i k op ≤ and • for all i ∈ [ L + 1] , k M i − N i k op ≤ κ ,then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L +1 Y i =1 ( A i M i ) − L +1 Y i =1 ( B i N i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ ( k A k + k A − B k ) L +1 ( κ k A k + k A − B k ) . roof By the triangle inequality (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L +1 Y i =1 ( A i M i ) − L +1 Y i =1 ( B i N i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L +1 X j =1  j Y i =1 A i M i !  L +1 Y i = j +1 B i N i  − j − Y i =1 A i M i !  L +1 Y i = j B i N i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ L +1 X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) j Y i =1 A i M i !  L +1 Y i = j +1 B i N i  − j − Y i =1 A i M i !  L +1 Y i = j B i N i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = L +1 X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( A j M j − B j N j ) j − Y i =1 A i M i !  L +1 Y i = j +1 B i N i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ L +1 X j =1 k A j M j − B j N j k op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) j − Y i =1 A i M i !  L +1 Y i = j +1 B i N i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op . (11)For some j , consider T ( j ) := ( A , . . . , A j − , B j +1 , B L +1 ) . By the triangle inequality, k T ( j ) k ≤ k A k + k A − B k . Thus, Lemma B.6 implies (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) j − Y i =1 A i M i !  L +1 Y i = j +1 B i N i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ j − Y i =1 k A i M i k op !  L +1 Y i = j +1 k B i N i k op  ≤ j − Y i =1 k A i k op !  L +1 Y i = j +1 k B i k op  ≤ ( k A k + k A − B k ) L +1 ( L + 1) L +12 . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L +1 Y i =1 A i M i − L +1 Y i =1 B i N i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ ( k A k + k A − B k ) L +1 ( L + 1) L +12 L +1 X j =1 k A j M j − B j N j k op = ( k A k + k A − B k ) L +1 ( L + 1) L +12 L +1 X j =1 k A j M j − A j N j + A j N j − B j N j k op ≤ ( k A k + k A − B k ) L +1 ( L + 1) L +12 L +1 X j =1 (cid:16) k A j ( M j − N j ) k op + k A j − B j k op (cid:17) ≤ ( k A k + k A − B k ) L +1 ( L + 1) L +12  L +1 X j =1 k A j k op k M j − N j k op + L +1 X j =1 k A j − B j k op  ≤ ( k A k + k A − B k ) L +1 ( L + 1) L +12  κ L +1 X j =1 k A j k + L +1 X j =1 k A j − B j k  ≤ ( k A k + k A − B k ) L +1 ( L + 1) L +12 h √ L + 1 ( κ k A k + k A − B k ) i = ( k A k + k A − B k ) L +1 ( L + 1) L ( κ k A k + k A − B k ) ≤ ( k A k + k A − B k ) L +1 ( κ k A k + k A − B k ) , which completes the proof.The next lemma shows that h -smoothly approximately ReLU activations are contractivemaps. Lemma B.8.

Given an h -smoothly approximately ReLU activation φ , for any v , v ∈ R p wehave k φ ( v ) − φ ( v ) k ≤ k v − v k . That is, φ is a contractive map with respect to the Euclideannorm. Proof

Let ( v ) j denote the j th coordinate of a vector v . For each j ∈ [ p ] , by Taylor’s theorem,for some ˜ v j ∈ [( v ) j , ( v ) j ] ( φ ( v ) − φ ( v )) j = φ ′ (˜ v j )( v − v ) j . Thus, k φ ( v ) − φ ( v ) k = X j ∈ [ p ] (cid:0) φ ′ (˜ v j )( v − v ) j (cid:1) = X j ∈ [ p ] (cid:0) φ ′ (˜ v j ) (cid:1) ( v − v ) j ( i ) ≤ X j ∈ [ p ] ( v − v ) j = k v − v k , where ( i ) follows because | φ ′ ( z ) | ≤ for all z ∈ R for h -smoothly approximately ReLU acti-vations. Taking square roots completes the proof.19 .3 Proof of Lemma 4.1 Lemma 4.1.

For α > , let V ( t +1) = V ( t ) − α ∇ J t . If, for all convex combinations W of V ( t ) and V ( t +1) , k∇ W J k op ≤ M , then if α ≤ ( L + ) M , we have J t +1 ≤ J t − αL k∇ J t k L + . Proof

Along the line segment joining V ( t ) to V ( t +1) , the function J ( · ) is M -smooth, therefore, J t +1 ≤ J t + ∇ J t · ( V ( t +1) − V ( t ) ) + M k V ( t +1) − V ( t ) k = J t − α k∇ J t k + α M k∇ J t k = J t − α (cid:18) − αM (cid:19) k∇ J t k ≤ J t − LL + α k∇ J t k . This completes the proof.

B.4 Proof of Lemma 4.2

The proof of Lemma 4.2 is built up in stages, through a series of lemmas.The ﬁrst lemma bounds the norm of the diﬀerence between the pre-activation ( u Vj,s ) and post-activation features ( x Vj,s ) at any layer j , when the weight matrix of a single layer is swapped.It also provides a bound on the norm of the pre-activation and post-activation features at anylayer in terms of the norm of the weight matrix. Lemma B.9.

Consider V = ( V , . . . , V L +1 ) and W = ( W , . . . , W L +1 ) , and ℓ ∈ [ L + 1] .Suppose that V j = W j for all j = ℓ , and k V k , k W k > . Then, for all examples s and all layers j , 1. k u Vj,s k ≤ k V k L +1 ;2. k x Vj,s k ≤ k V k L +1 ;3. k u Vj,s − u Wj,s k ≤ k V ℓ − W ℓ k op k V k L +1 ; and4. k x Vj,s − x Wj,s k ≤ k V ℓ − W ℓ k op k V k L +1 . Proof

Proof of Parts 1 and 2:

For any sample s and layer j we have k u Vj,s k = k V j φ ( u Vj − ,s ) k ≤ k V j k op k φ ( u Vj − ,s ) k ( i ) ≤ k V j k op k u Vj − ,s k ≤ j Y k =1 k V k k op k x s k ( ii ) ≤ j Y k =1 k V k k op ( iii ) ≤ k V k L +1 , ( i ) follows since φ is contractive (Lemma B.8), ( ii ) is because k x s k = 1 and ( iii ) is byLemma B.6. This completes the proof of Part 1 of this lemma. Again since φ is contractive, k x Vj,s k = k φ ( u Vj,s ) k ≤ k V k L +1 , which establishes the second part of the lemma. Proof of Parts 3 and 4:

For any j < ℓ , u Vj,s = u Wj,s and x Vj,s = x Wj,s , since V j = W j for all j = ℓ . For j = ℓ we have k u Vℓ,s − u Wℓ,s k = k V ℓ x Vℓ − ,s − W ℓ x Wℓ − ,s k = k ( V ℓ − W ℓ ) x Vℓ − ,s k ≤ k V ℓ − W ℓ k op k x Vℓ − ,s k = k V ℓ − W ℓ k op k φ (cid:0) V ℓ − x Vℓ − ,s (cid:1) k ( i ) ≤ k V ℓ − W ℓ k op Y k<ℓ k V k k op ≤ k V ℓ − W ℓ k op Y k<ℓ k V k k ( ii ) ≤ k V ℓ − W ℓ k op k V k L +1 where ( i ) follows since φ is a contractive map (Lemma B.8) and because k x s k = 1 , and (ii)follows by applying Lemma B.6. Since φ is contractive we also have k x Vℓ,s − x Wℓ,s k ≤ k V ℓ − W ℓ k op k V k L +1 . When j > ℓ , it is possible to establish our claim by mirroring the argument in the j = ℓ casewhich completes the proof of the last two parts.The next lemma upper bounds diﬀerence between the Σ Vj,s and Σ Wj,s , when the weight matricesdiﬀer in a single layer.

Lemma B.10.

Consider V = ( V , . . . , V L ) and W = ( W , . . . , W L ) , and ℓ ∈ [ L ] . Suppose that V j = W j for all j = ℓ , and k V k , k W k > . Then, for all examples s and all layers j , k Σ Vj,s − Σ Wj,s k op ≤ k V ℓ − W ℓ k op k V k L +1 h . Proof

For any j ∈ [ L ] and any s ∈ [ n ] , Σ Vj,s and Σ Wj,s are both diagonal matrices, and hence k Σ Vj,s − Σ Wj,s k op = k φ ′ ( u Vj,s ) − φ ′ ( u Wj,s ) k ∞ ≤ k u Vj − u Wj k ∞ h (since φ ′ is (1 /h ) -Lipschitz) ≤ k V ℓ − W ℓ k op k V k L +1 h , by Lemma B.9.The following lemma bounds the diﬀerence between g s ( V ) and g s ( W ) for any sample s whenthe weight matrices V and W diﬀer in a single layer. Lemma B.11.

Consider V = ( V , . . . , V L +1 ) and W = ( W , . . . , W L +1 ) , and ℓ ∈ [ L + 1] .Suppose that V j = W j for all j = ℓ , with k V ℓ − W ℓ k op k V k L +1 ≤ , and k V k , k W k > .Also suppose that for all examples s , for all convex combinations f W of V and W , we have if J s ( f W ) ≤ J s ( V ) , then, | g s ( V ) − g s ( W ) | ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1 . roof By Taylor’s theorem applied to the function / (1 + exp( z )) we can bound | g s ( V ) − g s ( W ) | = (cid:12)(cid:12)(cid:12)(cid:12)

11 + exp ( y s f V ( x s )) −

11 + exp ( y s f W ( x s )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ exp ( y s f V ( x s ))(1 + exp ( y s f V ( x s ))) | y s f W ( x s ) − y s f V ( x s ) | | {z } =:Ξ + ( y s f W ( x s ) − y s f V ( x s )) f W ∈ [ V,W ] (cid:12)(cid:12)(cid:12)(cid:12) y s f f W ( x s ))(exp( y s f f W ( x s )) + 1) − exp( y s f f W ( x s ))(exp( y s f f W ( x s )) + 1) (cid:12)(cid:12)(cid:12)(cid:12)| {z } =:Ξ . (12)The ﬁrst term Ξ can be bounded as Ξ = (cid:12)(cid:12)(cid:12)(cid:12) exp ( y s f V ( x s ))(1 + exp ( y s f V ( x s ))) ( y s f W ( x s ) − y s f V ( x s )) (cid:12)(cid:12)(cid:12)(cid:12) = 1(1 + exp ( y s f V ( x s ))) exp ( y s f V ( x s ))1 + exp ( y s f V ( x s )) | y s f W ( x s ) − y s f V ( x s ) |≤ y s f V ( x s ))) | f W ( x s ) − f V ( x s ) | = g s ( V ) (cid:13)(cid:13) u WL +1 ,s − u VL +1 ,s (cid:13)(cid:13) ( i ) ≤ g s ( V ) k V ℓ − W ℓ k op k V k L +1 ( ii ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1 , where ( i ) follows by applying Lemma B.9 and ( ii ) follows since g s ( V ) ≤ J s ( V ) by Lemma B.3.The second term Ξ Ξ = ( y s f W ( x s ) − y s f V ( x s )) f W ∈ [ V,W ] (cid:12)(cid:12)(cid:12)(cid:12) y s f f W ( x s ))(exp( y s f f W ( x s )) + 1) − exp( y s f f W ( x s ))(exp( y s f f W ( x s )) + 1) (cid:12)(cid:12)(cid:12)(cid:12) ( i ) ≤ ( f W ( x s ) − f V ( x s )) f W ∈ [ V,W ] log(1 + exp( − y s f f W ( x s )))= ( f W ( x s ) − f V ( x s )) f W ∈ [ V,W ] J s ( f W ) ( ii ) ≤ J s ( V ) ( f W ( x s ) − f V ( x s )) = J s ( V ) (cid:0) u VL +1 ,s − u WL +1 ,s (cid:1) iii ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) ( iv ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1 , where ( i ) follows since for every z ∈ R (cid:12)(cid:12)(cid:12)(cid:12) z )(exp( z ) + 1) − exp( z )(exp( z ) + 1) (cid:12)(cid:12)(cid:12)(cid:12) ≤ log(1 + exp( − z )) , ( ii ) is by our assumption that for any f W ∈ [ V, W ] , J s ( f W ) ≤ J s ( V ) , ( iii ) follows by invokingLemma B.9 and ﬁnally ( iv ) is by the assumption that k V ℓ − W ℓ k op k V k L +1 ≤ . By using ourbounds on Ξ and Ξ in conjunction with Inequality (12) we obtain the bound | g s ( V ) − g s ( W ) | ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1 V and W , when these weight matrices diﬀer in a single layer. Lemma B.12.

Let h ≤ , and consider V = ( V , . . . , V L +1 ) and W = ( W , . . . , W L +1 ) , and ℓ ∈ [ L ] . Suppose that V j = W j for all j = ℓ , and if • k V ℓ − W ℓ k op k V k L +1 ≤ ; • k V − W k ≤ k V k L +1 ; • k V k > and k W k > ; • for all s and all convex combinations f W of V and W , J s ( f W ) ≤ J s ( V ) .Then, k∇ V J s ( V ) − ∇ W J s ( W ) k ≤ p ( L + 1) pJ s ( V ) k V k L +5 k V ℓ − W ℓ k h . Proof

Since k∇ V J s ( W ) − ∇ W J s ( V ) k = L +1 X k =1 k∇ V k J s ( W ) − ∇ W k J s ( V ) k . (13)First we seek a bound on k∇ V k J s ( V ) − ∇ W k J s ( W ) k op when k ∈ [ L ] . We have k∇ V k J s ( V ) − ∇ W k J s ( W ) k op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) W ⊤ L +1 x W ⊤ k − ,s − g s ( V )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) + Σ

Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 x W ⊤ k − ,s − g s ( V )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) + Σ

Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) × W ⊤ L +1 ( x W ⊤ k − ,s − x V ⊤ k − ,s + x V ⊤ k − ,s ) − g s ( V )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op . k∇ V k J s ( V ) − ∇ W k J s ( W ) k op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 ( x W ⊤ k − ,s − x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } =:Ξ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 x W ⊤ k − ,s − g s ( V )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } =:Ξ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 x W ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } =:Ξ . (14)We will control each of these three terms separately in lemmas below. First in Lemma B.13we establish that Ξ ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) , then in Lemma B.14 we prove Ξ ≤ J s ( V ) k V k L +1) k V ℓ − W ℓ k op , and in Lemma B.15 we establish Ξ ≤ J s ( V ) k V k L +5 k V ℓ − W ℓ k h . These three bound combined with the decomposition in (14) tells us that for any k ∈ [ L ] k∇ V k J s ( V ) − ∇ W k J s ( W ) k op ≤ J s ( W ) k V ℓ − W ℓ k op k V k L +1) + 8 J s ( V ) k V k L +1) k V ℓ − W ℓ k op + 40 J s ( V ) k V k L +5 k V ℓ − W ℓ k h ≤ J s ( V ) k V k L +5 k V ℓ − W ℓ k h , where the previous inequality follows since h < and k V k > . Since V k and W k are a p × p -dimensional matrices, we ﬁnd k∇ V k J s ( V ) − ∇ W k J s ( W ) k ≤ √ p k∇ V k J s ( V ) − ∇ W k J s ( W ) k op ≤ √ pJ s ( V ) k V k L +5 k V ℓ − W ℓ k h . (15)24or the ﬁnal layer we know that k∇ V L +1 J s ( V ) − ∇ W L +1 J s ( W ) k = k g s ( V ) x VL,s − g s ( W ) x WL,s k = k ( g s ( V ) − g s ( W ) + g s ( W )) x VL,s − g s ( W ) x WL,s k≤ | g s ( V ) − g s ( W ) |k x VL,s k + g s ( W ) k x VL,s − x WL,s k ( i ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) + g s ( W ) k V ℓ − W ℓ k op k V k L +1( ii ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) + 2 J s ( V ) k V ℓ − W ℓ k op k V k L +1( iii ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) (16)where ( i ) follows by invoking Lemma B.11 and Lemma B.9, ( ii ) follows since g s ( W ) ≤ J s ( W ) by Lemma B.3 and because by assumption J s ( W ) ≤ J s ( V ) , and ( iii ) follows since k V k > .This previous inequality along with (15) and (13) yield k∇ V J s ( W ) − ∇ W J s ( V ) k ≤ L √ pJ s ( V ) k V k L +5 k V ℓ − W ℓ k h ! + (cid:16) J s ( V ) k V ℓ − W ℓ k op k V k L +1) (cid:17) ≤ ( L + 1) √ pJ s ( V ) k V k L +5 k V ℓ − W ℓ k h ! . Taking square roots completes the proof.As promised in the proof of Lemma B.12 we now bound Ξ . Lemma B.13.

Borrowing the setting the setting and notation of Lemma B.12, if Ξ is asdeﬁned in (14) then Ξ ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) . roof Unpacking using the deﬁnition of Ξ Ξ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 ( x W ⊤ k − ,s − x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) ( W ⊤ L +1 − V ⊤ L +1 + V ⊤ L +1 )( x W ⊤ k − ,s − x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ g s ( W ) (cid:13)(cid:13) Σ Vk,s (cid:13)(cid:13) op  L Y j = k +1 (cid:13)(cid:13)(cid:13) V ⊤ j (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13) Σ Vj,s (cid:13)(cid:13) op  (cid:13)(cid:13)(cid:13) W ⊤ L +1 − V ⊤ L +1 + V ⊤ L +1 (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) x W ⊤ k − ,s − x V ⊤ k − ,s (cid:13)(cid:13)(cid:13) ( i ) ≤ g s ( W )  L Y j = k +1 (cid:13)(cid:13)(cid:13) V ⊤ j (cid:13)(cid:13)(cid:13) op  (cid:13)(cid:13)(cid:13) W ⊤ L +1 − V ⊤ L +1 + V ⊤ L +1 (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) x W ⊤ k − ,s − x V ⊤ k − ,s (cid:13)(cid:13)(cid:13) ≤ g s ( W )  L Y j = k +1 (cid:13)(cid:13)(cid:13) V ⊤ j (cid:13)(cid:13)(cid:13) op  (cid:18)(cid:13)(cid:13)(cid:13) W ⊤ L +1 − V ⊤ L +1 (cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13) V ⊤ L +1 (cid:13)(cid:13)(cid:13) op (cid:19) (cid:13)(cid:13)(cid:13) x W ⊤ k − ,s − x V ⊤ k − ,s (cid:13)(cid:13)(cid:13) ( ii ) ≤ g s ( W ) k V ℓ − W ℓ k op k V k L +1  L Y j = k +1 (cid:13)(cid:13)(cid:13) V ⊤ j (cid:13)(cid:13)(cid:13) op  (cid:18)(cid:13)(cid:13)(cid:13) W ⊤ L +1 − V ⊤ L +1 (cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13) V ⊤ L +1 (cid:13)(cid:13)(cid:13) op (cid:19) = g s ( W ) k V ℓ − W ℓ k op k V k L +1  k W ℓ − V ℓ k op L Y j = k +1 (cid:13)(cid:13)(cid:13) V ⊤ j (cid:13)(cid:13)(cid:13) op + L +1 Y j = k +1 (cid:13)(cid:13)(cid:13) V ⊤ j (cid:13)(cid:13)(cid:13) op  ( iii ) ≤ g s ( W ) k V ℓ − W ℓ k op k V k L +1 (cid:16) k W ℓ − V ℓ k op k V k L +1 + k V k L +1 (cid:17) ( iv ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) (cid:16) k W ℓ − V ℓ k op + 1 (cid:17) ( v ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) , where ( i ) follows since k Σ Vk,s k op ≤ , ( ii ) follows from invoking Lemma B.9, ( iii ) is byLemma B.6, ( iv ) follows since g s ( W ) ≤ J s ( W ) by Lemma B.3 and because by assumption J s ( W ) ≤ J s ( V ) . Finally ( v ) follows since by assumption k V ℓ − W ℓ k op ≤ / k V k L +1 and k V k > , therefore k V ℓ − W ℓ k op ≤ .We continue and now bound Ξ which as deﬁned in the proof of Lemma B.12. Lemma B.14.

Borrowing the setting the setting and notation of Lemma B.12, if Ξ is asdeﬁned in (14) then Ξ ≤ J s ( V ) k V k L +1) k V ℓ − W ℓ k op . roof Unpacking the term Ξ Ξ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 x W ⊤ k − ,s − g s ( V )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) ( W ⊤ L +1 − V ⊤ L +1 + V ⊤ L +1 ) x W ⊤ k − ,s − g s ( V )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) ( W ⊤ L +1 − V ⊤ L +1 ) x W ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x W ⊤ k − ,s − g s ( V )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) ( W ⊤ L +1 − V ⊤ L +1 )( x W ⊤ k − ,s − x V ⊤ k − ,s + x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 ( x W ⊤ k − ,s − x V ⊤ k − ,s )+( g s ( W ) − g s ( V ))  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) ( W ⊤ L +1 − V ⊤ L +1 )( x W ⊤ k − ,s − x V ⊤ k − ,s + x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } ♠ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 ( x W ⊤ k − ,s − x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } ♣ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( g s ( W ) − g s ( V ))  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } =: ♥ . (17)27he ﬁrst term ♠ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) ( W ⊤ L +1 − V ⊤ L +1 )( x W ⊤ k − ,s − x V ⊤ k − ,s + x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ g s ( W ) (cid:13)(cid:13) Σ Vk,s (cid:13)(cid:13) op  L Y j = k +1 k V ⊤ j k op k Σ Vj,s k op  k W ⊤ L +1 − V ⊤ L +1 k op ( k x W ⊤ k − ,s − x V ⊤ k − ,s k + k x V ⊤ k − ,s k ) ( i ) ≤ g s ( W )  L Y j = k +1 k V ⊤ j k op  k W ⊤ L +1 − V ⊤ L +1 k op ( k x W ⊤ k − ,s − x V ⊤ k − ,s k + k x V ⊤ k − ,s k ) ( ii ) ≤ g s ( W ) k V k L +1 k W ⊤ L +1 − V ⊤ L +1 k op ( k x W ⊤ k − ,s − x V ⊤ k − ,s k + k x V ⊤ k − ,s k ) ≤ g s ( W ) k V k L +1 k V ℓ − W ℓ k op ( k x W ⊤ k − ,s − x V ⊤ k − ,s k + k x V ⊤ k − ,s k ) ( iii ) ≤ g s ( W ) k V k L +1 k V ℓ − W ℓ k op ( k V ℓ − W ℓ k op k V k L +1 + k V k L +1 )= g s ( W ) k V k L +1) k V ℓ − W ℓ k op ( k V ℓ − W ℓ k op + 1) ( iv ) ≤ J s ( V ) k V k L +1) k V ℓ − W ℓ k op ( k V ℓ − W ℓ k op + 1) ( v ) ≤ J s ( V ) k V k L +1) k V ℓ − W ℓ k op , where ( i ) follows since k Σ Vk,s k op ≤ , ( ii ) is by invoking Lemma B.6, ( iii ) follows due toLemma B.9, ( iv ) is because g s ( W ) ≤ J s ( W ) by Lemma B.3 and by the assumption J s ( W ) ≤ J s ( V ) , and ﬁnally ( v ) follows by the assumption that k V ℓ − W ℓ k op ≤ / k V k L +1 ≤ .Moving on to ♣ , again since g s ( W ) ≤ J s ( W ) (by Lemma B.3) and J s ( W ) ≤ J s ( V ) (byassumption), ♣ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 ( x W ⊤ k − ,s − x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ J s ( V ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 ( x W ⊤ k − ,s − x V ⊤ k − ,s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ J s ( V ) (cid:13)(cid:13) Σ Vk,s (cid:13)(cid:13) op  L Y j = k +1 k V ⊤ j k op k Σ Vj,s k op  k V ⊤ L +1 k op k x W ⊤ k − ,s − x V ⊤ k − ,s k≤ J s ( V )  L +1 Y j = k +1 k V ⊤ j k op  k x W ⊤ k − ,s − x V ⊤ k − ,s k ( i ) ≤ J s ( V ) k V k L +1) k V ℓ − W ℓ k op , ( i ) follows by invoking Lemmas B.6 and B.9. Finally let us control ♥ ♥ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( g s ( W ) − g s ( V ))  Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ | g s ( W ) − g s ( V ) |k Σ Vk,s k op  L Y j = k +1 k V ⊤ j k op k Σ Vj,s k op  k V ⊤ L +1 k op k x V ⊤ k − ,s k≤ | g s ( W ) − g s ( V ) | L +1 Y j = k +1 k V ⊤ j k op k x V ⊤ k − ,s k ( i ) ≤ | g s ( W ) − g s ( V ) |k V k L +1 k x V ⊤ k − ,s k ( ii ) ≤ | g s ( W ) − g s ( V ) |k V k L +1)( iii ) ≤ J s ( V ) k V ℓ − W ℓ k op k V k L +1) , where ( i ) follows by Lemma B.6, ( ii ) is by Lemma B.9 and ( iii ) is by invoking Lemma B.11.Combining the bounds on ♠ , ♣ and ♥ along with (17) we ﬁnd Ξ ≤ J s ( V ) k V k L +1) k V ℓ − W ℓ k op + 2 J s ( V ) k V k L +1) k V ℓ − W ℓ k op ≤ J s ( V ) k V k L +1) k V ℓ − W ℓ k op , where the previous inequality follows since k V k > .Finally we bound Ξ which as deﬁned in the proof of Lemma B.12. Lemma B.15.

Borrowing the setting the setting and notation of Lemma B.12, if Ξ is asdeﬁned in (14) then Ξ ≤ J s ( V ) k V k L +5 k V ℓ − W ℓ k h . roof Since g s ( W ) ≤ J s ( W ) and J s ( W ) ≤ J s ( V ) (by assumption) we have Ξ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) g s ( W )  Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 x W ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ J s ( W ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) W ⊤ L +1 x W ⊤ k − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ J s ( V ) k W ⊤ L +1 k op k x W ⊤ k − ,s k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − Σ Vk,s L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = 2 J s ( V ) k W ⊤ L +1 k op k x W ⊤ k − ,s k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Wk,s L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − (Σ Vk,s − Σ Wk,s + Σ

Wk,s ) L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ J s ( V ) k W ⊤ L +1 k op k x W ⊤ k − ,s k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Wk,s  L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } =: ♠ + 2 J s ( V ) k W ⊤ L +1 k op k x W ⊤ k − ,s k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (Σ Vk,s − Σ Wk,s ) L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op | {z } =: ♣ . (18)Before we bound ♠ and ♣ , let us establish a few useful bounds. First note that for anylayer j by Lemma B.10 k Σ Vj,s − Σ Wj,s k op ≤ k V ℓ − W ℓ k op k V k L +1 h . (19)Also we know that k x W ⊤ k − ,s k ≤ k x V ⊤ k − ,s k + k x W ⊤ k − ,s − x V ⊤ k − ,s k ≤ k V k L +1 + k V k L +1 k V ℓ − W ℓ k op ≤ k V k L +1 (1 + k V ℓ − W ℓ k op ) ≤ k V k L +1 . (20)Finally, k W L +1 k op ≤ k V L +1 k op + k V L +1 − W L +1 k op ≤ k V k + k V ℓ − W ℓ k op ≤ k V k , (21)where the last inequality follows by our assumptions that k V ℓ − W ℓ k op ≤ k V k L +1 and k V k > .30ith these bounds in place we are ready to bound ♠ : ♠ = 2 J s ( V ) k W ⊤ L +1 k op k x W ⊤ k − ,s k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Wk,s  L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ J s ( V ) k W ⊤ L +1 k op k x W ⊤ k − ,s k (cid:13)(cid:13) Σ Wk,s (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ J s ( V ) k V k L +2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L Y j = k +1 (cid:16) W ⊤ j Σ Wj,s (cid:17) − L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( ii ) ≤ J s ( V ) k V k L +2 ( k V k + k V − W k ) L +1 (cid:18)(cid:18) k V ℓ − W ℓ k op k V k L +1 h (cid:19) k V k + k V − W k (cid:19) = 8 J s ( V ) k V k L +5 h (cid:18) k V − W kk V k (cid:19) L +1 (cid:18) k V ℓ − W ℓ k op k V − W k + h k V k L +2 (cid:19) ( iii ) ≤ J s ( V ) k V k L +5 k V − W k h (cid:18) k V − W kk V k (cid:19) L +1( iv ) ≤ J s ( V ) k V k L +5 k V − W k h (cid:18) L + 1) k V − W kk V k (cid:19) ( v ) ≤ J s ( V ) k V k L +5 k V − W k h (22)where ( i ) follows by using the bounds in (20) and (21), ( ii ) follows by invoking Lemma B.7and using (19), ( iii ) follows since h ≤ and k V k ≥ by assumption, and therefore k V ℓ − W ℓ k op k V − W k + h k V k L +2 ≤ , Inequality ( iv ) follows since for any < z < L , (1+ ( L + 1) z ) L +1 ≤

1+ ( L + 1) z and because byassumption k V − W k ≤ k V k / ( L + 1) , and ﬁnally ( v ) is again because k V − W k ≤ k V k / ( L + 1) .Let’s turn our attention to ♣ . ♣ = 2 J s ( V ) k W ⊤ L +1 k op k x W ⊤ k − ,s k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (Σ Wk,s − Σ Vk,s ) L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ J s ( V ) k V k L +2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (Σ Wk,s − Σ Vk,s ) L Y j = k +1 (cid:16) V ⊤ j Σ Vj,s (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ J s ( V ) k V k L +2 (cid:13)(cid:13) Σ Wk,s − Σ Vk,s (cid:13)(cid:13) op L Y j = k +1 k V ⊤ j k op k Σ Vj,s k op ( ii ) ≤ J s ( V ) k V k L +3 (cid:13)(cid:13) Σ Wk,s − Σ Vk,s (cid:13)(cid:13) op ( iii ) ≤ J s ( V ) k V k L +4 k V ℓ − W ℓ k op h , (23)where ( i ) follows from the bounds in (20) and (21), ( ii ) follows by invoking Lemma B.6 and ( iii ) is by (19). 31y combining the bounds in (22) and (23) we have a bound on Ξ . Ξ ≤ J s ( V ) k V k L +5 h + 8 J s ( V ) k V k L +4 k V ℓ − W ℓ k op h ≤ J s ( V ) k V k L +5 k V − W k h = 40 J s ( V ) k V k L +5 k V ℓ − W ℓ k h , which completes the proof.Lemma B.12 provides a bound on the norm of the diﬀerence between ∇ V J s ( V ) and ∇ V J s ( W ) ,when the weight matrices V and W diﬀer only at a single layer. The next lemma invokesLemma B.12 ( L + 1) times to bound the norm of the diﬀerence between the gradients of theloss at V and W when they potentially diﬀer in all of the layers. Lemma B.16.

Let h ≤ , and consider V = ( V , . . . , V L +1 ) and W = ( W , . . . , W L +1 ) , suchthat the following are satisﬁed for all j ∈ [ L + 1] : • k V j − W j k op k V k L +1 ≤ ; • k V − W k ≤ k V k L +5 ; • k V k > and k W k > .For every j ∈ { , . . . , L + 1 } deﬁne T ( j ) := ( W , W , . . . , W j , V j +1 , . . . , V L +1 ) . Suppose thatfor all j ∈ [ L + 1] , for all examples s , and for all convex combinations f W of T ( j ) and T ( j + 1) , J s ( f W ) ≤ J s ( T ( j )) ≤ J s ( V ) . Then k∇ V J ( V ) − ∇ W J ( W ) k ≤ L + 1) √ pJ ( V ) k V k L +5 k V − W k h . Proof

We may transform V into W by swapping one layer at a time. For any s ∈ [ n ] Lemma B.12 bounds the norm of diﬀerence in each swap, thus, k∇ V J s ( V ) − ∇ W J s ( W ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L X k =0 (cid:0) ∇ T ( k ) J s ( T ( k )) − ∇ T ( k +1) J s ( T ( k + 1)) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ L X k =0 (cid:13)(cid:13) ∇ T ( k ) J s ( T ( k )) − ∇ T ( k +1) J s ( T ( k + 1)) (cid:13)(cid:13) ≤ L +1 X k =1 p ( L + 1) pJ s ( T ( k )) k T ( k ) k L +5 k V k − W k k h = 52 p ( L + 1) ph L +1 X k =1 J s ( T ( k )) k T ( k ) k L +5 k V k − W k k≤ p ( L + 1) pJ s ( V ) h L +1 X k =1 k T ( k ) k L +5 k V k − W k k , (24)32here the ﬁnal inequality follows from the assumption that J s ( T ( k )) ≤ J s ( V ) . For any k ∈ [ L + 1] k T ( k ) k L +5 = k V k L +5 (cid:18) k T ( k ) kk V k (cid:19) L +5 = k V k L +5 (cid:18) k T ( k ) − V + V kk V k (cid:19) L +5 ≤ k V k L +5 (cid:18) k T ( k ) − V kk V k (cid:19) L +5 ≤ k V k L +5 (cid:18) k W − V kk V k (cid:19) L +5( i ) ≤ k V k L +5 (cid:18) L + 5) k W − V kk V k (cid:19) ( ii ) ≤ k V k L +5 , where ( i ) follows since for any z < L +5 , (1+ z ) L +5 ≤ L +5) z and because by assumption k V − W k / k V k ≤ L +5 , and ( ii ) again follows by our assumption that k V − W k / k V k ≤ L +5 .Using this bound in Inequality (24) k∇ V J s ( V ) − ∇ W J s ( W ) k ≤ p ( L + 1) pJ s ( V ) k V k L +5 h L +1 X k =1 k V k − W k k≤ p ( L + 1) pJ s ( V ) k V k L +5 h (cid:16) √ L + 1 k V − W k (cid:17) = 208( L + 1) √ pJ s ( V ) k V k L +5 k V − W k h . Thus, k∇ V J ( V ) − ∇ W J ( W ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X s ∈ [ n ] ∇ V J s ( V ) − ∇ W J s ( W ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n X s ∈ [ n ] k∇ V J s ( V ) − ∇ W J s ( W ) k≤ n X s ∈ [ n ] L + 1) √ pJ s ( V ) k V k L +5 k V − W k h = 208( L + 1) √ pJ ( V ) k V k L +5 k V − W k h completing the proof. Lemma 4.2. If h ≤ , for any weight matrix V such that J ( V ) < /n L we have k∇ V J ( V ) k op ≤ L + 1) √ p k V k L +5 J ( V ) h . Proof

First since J ( V ) < n L , Lemma B.5 implies that k V k > . Given the Hessian ∇ J at a point V , and let λ correspond to the maximum eigenvalue of the Hessian at this point33nd let W be the corresponding normalized eigenvector (with k W k = 1 ). Then we know that k∇ J k op = λ . k∇ J k op = λ = lim η → lim η → J ( V + η W + η W ) − J ( V + η W ) η − lim η → J ( V + η W ) − J ( V ) η η = lim η → vec( ∇ J ( V + η W )) · vec( W ) − vec( ∇ J ( V )) · vec( W ) η ≤ lim η → k∇ J ( V + η W ) − ∇ J ( V ) kk W k η = lim η → k∇ J ( V + η W ) − ∇ J ( V ) k η . Since the function J ( · ) is continuous, for all small enough η the assumptions of Lemma B.16are satisﬁed. Hence by applying Lemma B.16 k∇ J k op ≤ lim η → k∇ J ( V + η W ) − ∇ J ( V ) k η = lim η → k∇ J ( V + η W ) − ∇ J ( V ) kk V + η W − V k≤ L + 1) √ pJ ( V ) k V k L +5 h as claimed. B.5 Proof of Lemma 4.3

Lemma 4.3.

For any weight matrix V , if k V k > then k∇ V J ( V ) k ≤ p ( L + 1) p k V k L +1 min { J ( V ) , } . Proof

First note that since J ( V ) < n L therefore k V k > by Lemma B.5. Now for any ℓ ∈ [ L ] the formula for the gradient of the loss with respect to V ℓ is given by (see Equation (2a)) ∂J ( V ; x s , y s ) ∂V ℓ = g s ( V )  Σ Vℓ,s L Y j = ℓ +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ ℓ − ,s , therefore its operator norm (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ; x s , y s ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) op = g s ( V ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Vℓ,s L Y j = ℓ +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ ℓ − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ g s ( V ) k Σ Vℓ,s k op  L Y j = ℓ +1 k V ⊤ j k op k Σ Vj,s k op  k V ⊤ L +1 k op k x V ⊤ ℓ − ,s k≤ g s ( V )  L +1 Y j = ℓ +1 k V j k op  k x Vℓ − ,s k (25)34here the last step follows since k Σ Vj,s k op ≤ max z | φ ′ ( z ) | < . By its deﬁnition k x Vℓ − ,s k = k φ ( V ℓ − φ ( · · · φ ( V x s ))) k ( i ) ≤ k V ℓ − φ ( · · · φ ( V x s )) k≤ k V ℓ − k op k φ ( · · · φ ( V x )) k≤  ℓ − Y j =1 k V j k op  k x s k ( ii ) ≤  ℓ − Y j =1 k V j k op  , where ( i ) follows since φ is contractive (Lemma B.8) and ( ii ) is because k x s k = 1 . Along withInequality (25) this implies (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ; x s , y s ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) op ≤ g s ( V ) Y j = ℓ k V j k op ≤ g s ( V ) Y j = ℓ k V j k ≤ g s ( V ) k V k L +1 , where the last inequality follows from Lemma B.6. Therefore we have (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X s ∈ [ n ] ∂J ( V ; x s , y s ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ n X s ∈ [ n ] (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ; x s , y s ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) op ≤ k V k L +1 n X s ∈ [ n ] g s ( V ) . We know that g s ( V ) ≤ J s ( V ) by Lemma B.3 and also that g s ( V ) < . Therefore, (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) op ≤ k V k L +1 n min (X s J s ( V ) , n ) ≤ k V k L +1 min { J ( V ) , } . Given that V ℓ is a p × p matrix we infer (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) ≤ √ p (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) op ≤ √ p k V k L +1 min { J ( V ) , } . (26)When ℓ = L + 1 ∂J ( V ; x s , y s ) ∂V L +1 = g s ( V ) x V ⊤ L,s , by using the same chain of logic as in the case of ℓ < L + 1 we can obtain the bound (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) ≤ √ p k V k L +1 min { J ( V ) , } . Summing up over all layers k∇ J ( V ) k = L +1 X ℓ =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂J ( V ) ∂V ℓ (cid:13)(cid:13)(cid:13)(cid:13) ≤ ( L + 1) p k V k L +1) (min { J ( V ) , } ) , hence, taking squaring roots completes the proof.35 .6 Proof of Lemma 4.4 Lemma 4.4. If h ≤ , J t < n L , and αJ t ≤ h

832 ( L + 1) √ p k V ( t ) k L +5 , then J t +1 ≤ J t − αL k∇ J t k L + . Proof

Since, by assumption, J t < n L , Lemma B.5 implies k V ( t ) k > . We would liketo apply Lemmas 4.2 and 4.3. To apply these lemmas we ﬁrst upper and lower bound thenorm of all convex combinations of V ( t ) and V ( t +1) . Consider W = ηV (1) + (1 − η ) V ( t +1) = V ( t ) − (1 − η ) α ∇ J t for any η ∈ [0 , . An upper bound on the norm raised to the L + 5 thpower is k W k L +5 = k V ( t ) − (1 − η ) α ∇ J t k L +5 = k V ( t ) k L +5 k V ( t ) − (1 − η ) α ∇ J t kk V ( t ) k ! L +5 ≤ k V ( t ) k L +5 k V ( t ) k + α k∇ J t kk V ( t ) k ! L +5 = k V ( t ) k L +5 (cid:18) α k∇ J t kk V ( t ) k (cid:19) L +5( i ) ≤ k V ( t ) k L +5 α ( p ( L + 1) pJ t k V ( t ) k L +1 ) k V ( t ) k ! L +5 = k V ( t ) k L +5 (cid:16) α ( p ( L + 1) pJ t k V ( t ) k L ) (cid:17) L +5( ii ) ≤ k V ( t ) k L +5 (cid:16) L + 5)( p ( L + 1) pαJ t k V ( t ) k L (cid:17) ≤ k V ( t ) k L +5 (27)where ( i ) follows by invoking Lemma 4.3 and ( ii ) follows since for any z > , (1 + z ) L +5 ≤ L + 5) z for all z ≤ / (3 L + 5) and because the step-size α is chosen such that α ( p ( L + 1) pJ t k V ( t ) k L ) ≤ h L + 1) √ pJ t k V ( t ) k L +5 · p ( L + 1) pJ t k V ( t ) k L = h L + 1) / k V ( t ) k L +5 ≤ L + 5 . W ∈ [ V ( t ) , V ( t +1) ] raised to the L + 5 th power isbounded by k V ( t ) k L +5 . Next we lower bound the norm of W , k W k = k V ( t ) − (1 − η ) α ∇ J t k ≥ k V ( t ) k (cid:18) − α k∇ J t kk V ( t ) k (cid:19) ( i ) ≥ k V ( t ) k (cid:16) − α p ( L + 1) pJ t k V ( t ) k L (cid:17) ( ii ) ≥ √ (cid:16) − α p ( L + 1) pJ t k V ( t ) k L (cid:17) ( iii ) > , where ( i ) follows by again invoking Lemma 4.3, ( ii ) is by Lemma B.5 that guarantees that k V ( t ) k > √ since J t < n L and ( iii ) is by the logic above that guarantees that α (cid:16)p ( L + 1) pJ t k V ( t ) k L (cid:17) ≤ L +5 . Thus we have also shown that k W k > for any W ∈ [ V ( t ) , V ( t +1) ] .In order to apply Lemma 4.1 (that shows that the loss decreases along a gradient stepwhen the loss is smooth along the path), we would like to bound k∇ W J k op for all convexcombinations W of V ( t ) and V ( t +1) . For N = (cid:24) √ ( L +1) p k V ( t ) k L +1 k V ( t +1) − V ( t ) k J t (cid:25) , (similarly tothe proof of Lemma E.8 of [LL20]) we will prove the following by inductionFor all s ∈ { , . . . , N } , for all η ∈ [0 , s/N ] , for W = ηV ( t +1) + (1 − η ) V ( t ) , k∇ W J k op ≤ L +1) √ pJ t k V ( t ) k L +5 h .The base case, where s = 0 follows directly from Lemma 4.2. Now, assume that the inductivehypothesis holds from some s , and, for η ∈ ( s/N, ( s +1) /N ] , consider W = ηV ( t +1) +(1 − η ) V ( t ) .Let f W = ( s/N ) V ( t +1) + (1 − s/N ) V ( t ) . Since the step-size α is small enough, by applyingLemma 4.1 along with the inductive hypothesis yields J ( f W ) ≤ J t . On applying Lemma 4.3(which provides a bound on the Lipschitz constant of J ) J ( W ) ≤ J ( f W ) + ( p ( L + 1) p max ¯ W ∈ [ W, f W ] k ¯ W k L +1 ) k W − f W k ( i ) ≤ J ( f W ) + (2 p ( L + 1) p ) k V ( t ) k L +1 k W − f W k≤ J ( f W ) + (2 p ( L + 1) p ) k V ( t ) k L +1 k V ( t +1) − V ( t ) k N = J ( f W ) + J t ≤ J t , where ( i ) follows since max ¯ W ∈ [ W, f W ] k ¯ W k L +1 ≤ k V ( t ) k L +1 by using the same logic used toarrive at Inequality (27). Applying Lemma 4.2, this implies that for any W ∈ [ V ( t ) , V ( t +1) ] k∇ W J k op ≤ L + 1) √ pJ ( W ) k W k L +5 h ≤ L + 1) √ pJ t k V ( t ) k L +5 h , completing the proof of the inductive step.So, now we know that, for all convex combinations W of V ( t ) and V ( t +1) , k∇ W J k op ≤ L +1) √ pJ t k V ( t ) k L +5 h . By our choice of α < L + · h L +1) √ pJ t k V ( t ) k L +5 , by applying Lemma 4.1,37e have J t +1 ≤ J t − LL + α k∇ J t k which is the desired result. B.7 Proof of Lemma 4.5

Lemma 4.5.

For all L ∈ N if h ≤ h max , J t < n L , and k V ( t ) k L ≤ log(1 /J t ) k V (1) k L log(1 /J ) then k∇ J t k ≥ ( L + ) J t log(1 /J t ) k V ( t ) k . (6) Proof

We have k∇ J t k = sup a : k a k =1 ( ∇ J t · a ) ≥ ( ∇ J t ) · − V ( t ) k V ( t ) k ! = 1 k V ( t ) k X ℓ ∈ [ L +1] ∇ V ℓ J t · (cid:16) − V ( t ) ℓ (cid:17) . (28)Note that by deﬁnition, ∇ V ℓ J t · (cid:16) − V ( t ) ℓ (cid:17) = 1 n X s ∈ [ n ] ∇ V ℓ J ts · (cid:16) − V ( t ) ℓ (cid:17) . (29)Consider two cases. Case 1: (When ℓ = L + 1 ) In this case, for any s ∈ [ n ] by the formula for the gradient in(2b) we have ∇ V L +1 J ts · (cid:16) − V ( t ) L +1 (cid:17) = g ts y s V ( t ) L +1 x ( t ) L,s = g ts y s f V ( t ) ( x s ) . and therefore ∇ V L +1 J t · (cid:16) − V ( t ) L +1 (cid:17) = 1 n X s ∈ [ n ] g ts y s f V ( t ) ( x s ) . (30) Case 2: (When ℓ ∈ [ L ] ) Below we will prove the claim (in Lemma B.17) that for any ℓ ∈ [ L ] ∇ V ( t ) ℓ J t · (cid:16) − V ( t ) ℓ (cid:17) ≥ n X s ∈ [ n ] g ts " y s f V ( t ) ( x s ) − √ ph k V ( t ) k L L L − . (31)By combining this with the results of inequalities (28) and (30) k∇ J t k ≥ L + 1 n k V ( t ) k X s ∈ [ n ] g ts y s f V ( t ) ( x s ) − L √ ph k V ( t ) k L L L − k V ( t ) k  n X s ∈ [ n ] g ts  ( i ) ≥ L + 1 n k V ( t ) k X s ∈ [ n ] g ts y s f V ( t ) ( x s ) − L √ ph k V ( t ) k L L L − k V ( t ) k J t ( ii ) ≥ L + 1 n k V ( t ) k X s ∈ [ n ] g ts y s f V ( t ) ( x s ) − L √ ph k V (1) k L log(1 /J t )2 log(1 /J ) L L − k V ( t ) k J t (32)38here ( i ) follows because g ts ≤ J ts by Lemma B.3 and ( ii ) follows by our assumption on k V ( t ) k .For every sample s , J ts = log (1 + exp ( − y s f V ( t ) ( x s ))) which implies y s f V ( t ) ( x s ) = log (cid:18) J ts ) − (cid:19) and g ts = 11 + exp( y s f V ( t ) ( x s )) = 1 − exp ( − J ts ) . Plugging this into (32) we derive, k∇ J t k ≥ L + 1 n k V k n X s =1 (1 − exp ( − J ts )) log (cid:18) J ts ) − (cid:19) − L √ ph k V (1) k L /J ) L L − k V k J t log(1 /J t ) . Observe that the function (1 − exp( − z )) log (cid:16) z ) − (cid:17) is continuous and concave with lim z → + (1 − exp( − z )) log (cid:16) z ) − (cid:17) = 0 . Also recall that P s J ts = J t n . Therefore applying Lemma B.4,to the function ψ with ψ (0) = 0 and ψ ( z ) = (1 − exp( − z )) log (cid:16) z ) − (cid:17) for z > , we get k∇ J t k ≥ L + 1 k V ( t ) k (cid:20) − exp( − J t n ) n log (cid:18) J t n ) − (cid:19)(cid:21) − L √ ph k V (1) k L /J ) L L − k V k J t log(1 /J t ) . (33)We know that for any z ∈ [0 , z ) ≤ z and exp( − z ) ≤ − z + z . Since J t < n L and n ≥ , these bounds on the exponential function combined with In-equality (33) yields k∇ J t k≥ L + 1 k V ( t ) k (cid:20) ( J t − nJ t ) log (cid:18) J t n (cid:19)(cid:21) − L √ ph k V (1) k L /J ) L L − k V k J t log(1 /J t )= ( L + 1) J t log(1 /J t ) k V ( t ) k " − nJ t − log(2 n )log(1 /J t ) − √ ph k V (1) k L /J ) L L − = ( L + 1 / J t log(1 /J t ) k V ( t ) k " L + ) − nJ t − log(2 n )log(1 /J t ) − √ ph k V (1) k L /J ) L L − . (34)By the choice of h ≤ h max we have √ ph k V (1) k L /J ) L L − ≤ L .

Next, since J t < n L and n ≥ n )log(1 /J t ) ≤ log(2) + log( n )(1 + 24 L ) log( n ) ≤ L . nJ t < L ≤ L .

Therefore, using these three bounds in conjunction with Inequality (34) yields k∇ J t k ≥ ( L + 1 / J t log(1 /J t ) k V ( t ) k " L + ) − L (cid:21) ≥ ( L + ) J t log(1 /J t ) k V ( t ) k , which establishes the desired bound.As promised above we now lower bound the inner product between the gradient of the losswith respect to V ( t ) ℓ and the weight matrix for any ℓ ∈ [ L ] . Lemma B.17.

Under the conditions of Lemma 4.5 and borrowing all notation from the proofof Lemma 4.5 above, for all ℓ ∈ [ L ] ∇ V ℓ J t · (cid:16) − V ( t ) ℓ (cid:17) ≥ n X s ∈ [ n ] g ts " y s f V ( t ) ( x s ) − √ ph k V ( t ) k L L L − . Proof

To ease notation, let us drop the ( t ) in the superscript and refer to V ( t ) as V . Recallthat for any matrices A and B , A · B = vec( A ) · vec( B ) = Tr( A ⊤ B ) . Also recall the formulafor the gradient of the loss in (2a), therefore, for any s ∈ [ n ] ∇ V ℓ J ts · ( − V ℓ )= − Tr (cid:16) V ⊤ ℓ ∇ V ℓ J ts (cid:17) = g ts y s Tr  V ⊤ ℓ  Σ ℓ,s L Y j = ℓ +1 V ⊤ j Σ j,s  V ⊤ L +1 x ⊤ ℓ − ,s  = g ts y s Tr  L Y j = ℓ (cid:16) V ⊤ j Σ j,s (cid:17) V ⊤ L +1 x ⊤ ℓ − ,s  ( i ) = g ts y s Tr  x ⊤ ℓ − ,s L Y j = ℓ (cid:16) V ⊤ j Σ j,s (cid:17) V ⊤ L +1  = g ts y s x ⊤ ℓ − ,s L Y j = ℓ (cid:16) V ⊤ j Σ j,s (cid:17) V ⊤ L +1( ii ) = g ts y s x ⊤ L,s V ⊤ L +1 + g ts y s L − X k = ℓ (cid:16) x ⊤ k − ,s V ⊤ k Σ k,s − x ⊤ k,s (cid:17) L Y j = k +1 (cid:16) V ⊤ j Σ j,s (cid:17) V ⊤ L +1  + g ts y s (cid:16) x ⊤ L − ,s V ⊤ L Σ L,s − x ⊤ L,s (cid:17) V ⊤ L +1( iii ) = g ts y s f V ( t ) ( x s ) + g ts y s L − X k = ℓ (cid:16) x ⊤ k − ,s V ⊤ k Σ k,s − x ⊤ k,s (cid:17) L Y j = k +1 (cid:16) V ⊤ j Σ j,s (cid:17) V ⊤ L +1  + g ts y s (cid:16) x ⊤ L − ,s V ⊤ L Σ L,s − x ⊤ L,s (cid:17) V ⊤ L +1 , ( i ) follows by the cyclic property of the trace, and ( ii ) follows since the second term andthird term in the equation form a telescoping sum, and ( iii ) is because f V ( t ) ( x s ) = V L +1 x L,s by deﬁnition. By the property of h -smoothly approximately ReLU activations, for any z ∈ R we know that | φ ′ ( z ) z − φ ( z ) | ≤ h . Therefore for any k ∈ [ L ] , k x ⊤ k − ,s V ⊤ k Σ k,s − x ⊤ k,s k ∞ ≤ h and hence k x ⊤ k − ,s V ⊤ k Σ k,s − x ⊤ k,s k ≤ √ ph . Continuing from the previous displayed equation, byapplying the Cauchy-Schwarz inequality we ﬁnd ∇ V ℓ J ts · ( − V ℓ ) ≥ g ts y s f V ( t ) ( x s ) − g ts L − X k = ℓ k x ⊤ k − ,s V ⊤ k Σ k,s − x ⊤ k,s k L Y j = k +1 (cid:13)(cid:13)(cid:13) V ⊤ j (cid:13)(cid:13)(cid:13) op k Σ j,s k op (cid:13)(cid:13)(cid:13) V ⊤ L +1 (cid:13)(cid:13)(cid:13) − g ts k x ⊤ L − ,s V ⊤ L Σ L,s − x ⊤ L,s k (cid:13)(cid:13)(cid:13) V ⊤ L +1 (cid:13)(cid:13)(cid:13) ≥ g ts y s f V ( t ) ( x s ) − √ phg ts L X k = ℓ L +1 Y j = k +1 k V j k op k Σ j,s k op ( i ) ≥ g ts y s f V ( t ) ( x s ) − √ phg ts L X k = ℓ L +1 Y j = k +1 k V j k op ≥ g ts y s f V ( t ) ( x s ) − √ pLhg ts k ∈ [ L ] L +1 Y j = k +1 k V j k op ( ii ) ≥ g ts y s f V ( t ) ( x s ) − √ pLhg ts k ∈ [ L ] L +1 Y j = k +1 k V j k ( iii ) ≥ g ts y s f V ( t ) ( x s ) − √ ph k V k L g ts L L − (35)where ( i ) follows since φ ′ ( · ) ≤ and therefore k Σ j,s k op ≤ , ( ii ) follows since for any matrix M , k M k op ≤ k M k and Inequality ( iii ) follows by invoking Part 2 of Lemma B.5. The previousdisplay along with (29) yields ∇ V ℓ J t · ( − V ℓ ) ≥ n X s ∈ [ n ] g ts " y s f V ( t ) ( x s ) − √ ph k V k L L L − which completes our proof of this claim.Now that we have proved all the lemmas stated in Section 4.1, the reader can next jump toSection 4.2. C An example where the margin in Assumption 3.2 is constant

In this section we provide an example where the margin γ in Assumption 3.2 is constant.Consider a two-layer Huberized ReLU network. In this section we always let φ denote the Hu-berized ReLU activation (see its deﬁnition in Equation (1)). Since here we are only concernedwith the properties of the network at initialization, let V (1) be denoted simply by V . The ﬁrst41ayer V ∈ R p × p has its entries drawn independently from N (cid:16) , p (cid:17) and V ∈ R × p has itsentries drawn independently from N (0 , .Let V ,i denote the i th row of V and let V ,i denote the i th coordinate of V . The networkcomputed by these weights is f V ( x ) = V φ ( V x ) .Consider data in which examples of each class are clustered. There is a unit vector µ ∈ S p − such that, for all s with y s = 1 , k x s − µ k ≤ r , and, for all s with y s = − , k x s − ( − µ ) k ≤ r .Let us say that such data is r -clustered . (Recall that k x s k = 1 for all s .) Proposition C.1.

For any δ > , suppose that h ≤ √ π p , r ≤ min ( , √ phc ′ q log ( pnδ ) ) , and p ≥ log c ′ ( n/δ ) for a large enough constant c ′ > . If the data r -clustered then, with probability − δ there exists W ⋆ = ( W ⋆ , W ⋆ ) with k W ⋆ k = 1 such thatfor all s ∈ [ n ] , y s ( ∇ V f V ( x s ) · W ⋆ ) ≥ c √ p where c is a positive absolute constant. Proof

Deﬁne a set S := (cid:26) i ∈ [ p ] : 12 ≤ | V ,i | ≤ (cid:27) , and also deﬁne S + := { i ∈ S : V ,i · µ ≥ h } and S − := { i ∈ S : − V ,i · µ ≥ h } . Consider an event E margin such that all of the following simultaneously occur:(a) p (cid:0) − o p (1) (cid:1) ≤ |S + | ≤ p (cid:0) + o p (1) (cid:1) ;(b) p (cid:0) − o p (1) (cid:1) ≤ |S − | ≤ p (cid:0) + o p (1) (cid:1) ;(c) for all s ∈ [ n ] and i ∈ [ p ] , | V ,i · ( x s − y s µ ) | ≤ h .Let us assume that the event E margin holds for the remainder of the proof. Using simple con-centration arguments in Lemma C.2 below we will show that P [ E margin ] ≥ − δ .The gradient of f with respect to V ,i is ∇ V ,i f V ( x ) = x (cid:0) V ,i φ ′ ( V ,i · x ) (cid:1) . Consider a sample with index s with y s = 1 . For any i ∈ S + sign( V ,i ) (cid:0) µ · ∇ V ,i f V ( x s ) (cid:1) = µ · x s (cid:0) | V ,i | φ ′ ( V ,i · x s ) (cid:1) = (cid:0) | V ,i | φ ′ ( V ,i · x s ) (cid:1) + µ · ( x s − µ ) (cid:0) | V ,i | φ ′ ( V ,i · x s ) (cid:1) ( i ) ≥ φ ′ ( V ,i · µ + V ,i · ( x s − µ )) − µ · ( x s − µ ) φ ′ ( V ,i · x s ) ( ii ) ≥ φ ′ ( V ,i · µ + V ,i · ( x s − µ )) − ( iii ) ≥ φ ′ (2 h )2 − ( iv ) = 12 −

18 = 38 (36)42here ( i ) follows since ≤ | V ,i | ≤ when i ∈ S + . Inequality ( ii ) follows since φ ′ is boundedby and because k x s − y s µ k ≤ r ≤ / . Inequality ( iii ) follows since i ∈ S + and therefore ( V ,i ) · µ ≥ h , under event E margin , ( V ,i ) · ( x s − µ ) ≥ − h , and since φ ′ is a monotonicallyincreasing function. Equation ( iv ) follows since φ ′ (2 h ) = 1 . On the other hand, for any i ∈ S − : sign( V ,i ) µ · ∇ V ,i f V ( x s ) = µ · x s (cid:0) | V ,i | φ ′ ( V ,i · x s ) (cid:1) = | V ,i | φ ′ ( V ,i · x s ) + µ · ( x s − µ ) (cid:0) | V ,i | φ ′ ( V ,i · x s ) (cid:1) ( i ) ≥ − µ · ( x s − µ ) φ ′ ( V ,i · x s ) ( ii ) ≥ − (37)where ( i ) follows since | V ,i | ≤ when i ∈ S − and φ ′ is always non-negative. Inequality ( ii ) again follows since φ ′ is bounded by and because k x s − y s µ k ≤ r ≤ / .Similarly we can also show that for a sample s with y s = − , for any i ∈ S − sign( V ,i ) µ · ∇ V ,i f V ( x s ) ≤ − (38)and for any i ∈ S + sign( V ,i ) µ · ∇ V ,i f V ( x s ) ≤ . (39)With these calculations in place let us construct W ⋆ = ( W ⋆ , W ⋆ ) where, W ⋆ ∈ R p × p , W ⋆ ∈ R × p and k W ⋆ k = 1 . Set W ⋆ = 0 . For all i ∈ S + ∪ S − set W ⋆ ,i = sign( V ,i ) µ p |S + | + |S − | and for all i

6∈ S + ∪ S − , set W ⋆ ,i = 0 . We can easily check that k W ⋆ k = 1 (since k µ k = 1 ).Thus, for any sample s with y s = 1 y s ( ∇ V f V ( x s ) · W ⋆ )= ∇ V f V ( x s ) · W ⋆ = X i ∈S + ∇ V ,i f V ( x s ) · W ⋆ ,i + X i ∈S − ∇ V ,i f V ( x s ) · W ⋆ ,i = 1 p |S + | + |S − |  X i ∈S + sign( V ,i ) µ · ∇ V ,i f V ( x s ) + X i ∈S − sign( V ,i ) µ · ∇ V ,i f V ( x s )  ( i ) ≥ p |S + | + |S − | (cid:20) |S + | − |S − | (cid:21) = 18 p |S + | + |S − | [3 |S + | − |S − | ] ( ii ) ≥ p p (1 + o p (1)) (cid:20) p (cid:18) − o p (1) (cid:19) − p (cid:18)

12 + o p (1) (cid:19)(cid:21) ≥ c √ p where ( i ) follows by using inequalities (36) and (37) and ( ii ) follows by Parts (a) and (b)of the event E margin . The ﬁnal inequality follows since we assume that p is greater than aconstant. This shows that it is possible to achieve a margin of c √ p on the positive examples.43y mirroring the logic above and using Inequalities (38) and (39) we can show that a marginof c √ p can also be attained on the negative examples. This completes our proof.As promised we now show that the event E margin deﬁned above occurs with probability at least − δ . Lemma C.2.

For the event E margin be deﬁned in the proof of Proposition C.1 above, P [ E margin ] ≥ − δ. Proof

We shall show that each of the three sub-events in the deﬁnition of the event E margin occur with probability at least − δ/ . Then a union bound establishes the statement of thelemma. Proof of Part (a):

Recall the deﬁnition of the set SS := (cid:26) i ∈ [ p ] : 12 ≤ | V ,i | ≤ (cid:27) , and also the deﬁnition of the set S + S + := { i ∈ S : V ,i · µ ≥ h } . We will ﬁrst derive a high probability bound the size of the set S , and then use this bound tocontrol the size of S + . A trivial upper bound is |S| ≤ p . Let us derive a lower bound on itssize. Deﬁne the random variable ζ i = I (cid:2) ≤ | V ,i | ≤ (cid:3) . It is easy to check that |S| = P i ∈ [ p ] ζ i .The expected value of this random variable E [ ζ i ] = 1 − P (cid:20) | V ,i | ≤ (cid:21) − P [ | V ,i | ≥ ( i ) ≥ − / − ( − / √ π − P [ | V ,i | ≥ ( ii ) ≥ − √ π − exp ( − √ π > where ( i ) follows since V ,i ∼ N (0 , so its density is upper bounded bounded by / √ π , and ( ii ) follows by a Mill’s ratio bound to upper bound P [ | V ,i | ≥ z ] ≤ × exp( − z / √ πz . A Hoeﬀdingbound (see Theorem F.5) implies that for any η ≥ P h |S| ≥ p E [ ζ i ] − ηp i ≥ − exp (cid:0) − c η p (cid:1) . Setting η = 1 /p / we get P (cid:20) |S| ≥ p (cid:18) − p / (cid:19)(cid:21) ≥ − exp ( − c √ p ) . (40)We now will bound |S + | conditioned on the event in the previous display: p (cid:16) − p / (cid:17) ≤ |S| ≤ p . For each i ∈ S , the random variable V ,i · µ ∼ N (cid:16) , p (cid:17) since each entry of V ,i is drawn inde-pendently from N (cid:16) , p (cid:17) and because k µ k = 1 . Deﬁne a random variable ξ i := I [ V ,i · µ ≥ h ] .It is easy to check that |S + | = P i ∈S ξ i . The expected value of ξ i (cid:12)(cid:12)(cid:12)(cid:12) E [ ξ i ] − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) P [ V ,i · µ ≥ h ] − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) P [ V ,i · µ ≥ − P [ V ,i · µ ∈ [0 , h )] − (cid:12)(cid:12)(cid:12)(cid:12) = P [ V ,i · µ ∈ [0 , h )] ( i ) ≤ h √ p √ π √ ( ii ) ≤ √ p ( i ) follows since the density of this Gaussian is upper bounded by √ π × (cid:16) √ √ p (cid:17) and ( ii ) isby the assumption that h ≤ √ π p . Thus we have shown that − √ p ≤ E [ ξ i ] ≤ + √ p . Again aHoeﬀding bound (see Theorem F.5) implies that for any η ≥ P "(cid:12)(cid:12)(cid:12) X i ∈S ξ i − |S| E [ ξ i ] (cid:12)(cid:12)(cid:12) ≤ ηp (cid:12)(cid:12)(cid:12)(cid:12) p (cid:18) − p / (cid:19) ≤ |S| ≤ p ≥ − (cid:0) − c η p (cid:1) . By setting η = 1 /p / we get that P (cid:20)(cid:12)(cid:12)(cid:12) |S + | − |S| E [ ξ i ] (cid:12)(cid:12)(cid:12) ≤ p / (cid:12)(cid:12)(cid:12)(cid:12) p (cid:18) − p / (cid:19) ≤ |S| ≤ p (cid:21) ≥ − − c √ p ) . (41)By a union bound over the events in (40) and (41) we get that P (cid:20) p (cid:18) − o p (1) (cid:19) ≤ |S + | ≤ p (cid:18)

12 + o p (1) (cid:19)(cid:21) ≥ − exp ( − c √ p ) − − c √ p ) . By assumption p ≥ log c ′ ( n/δ ) for a large enough constant c ′ , thus P (cid:20) p (cid:18) − o p (1) (cid:19) ≤ |S + | ≤ p (cid:18)

12 + o p (1) (cid:19)(cid:21) ≥ − δ/ which completes our proof of the ﬁrst part. Proof of Part (b):

The proof of this second part follows by exactly the same logic as Part (a).

Proof of Part (c):

Fix any i ∈ [ p ] and s ∈ [ n ] . Recall that V ,i ∼ N (cid:16) , p I (cid:17) and byassumption k x s − y s µ k ≤ r . Thus the random variable V ,i · ( x s − y s µ ) is a zero-mean Gaussianrandom variable with variance at most r p . A standard Gaussian concentration bound impliesthat P [ | V ,i · ( x s − y s µ ) | ≤ h ] ≥ − (cid:18) − c ph r (cid:19) . (42)By a union bound over all i ∈ [ p ] and all s ∈ [ n ] we get P [ ∃ i ∈ [ p ] , s ∈ [ n ] : | ( V ,i ) · ( x s − y s µ ) | ≤ h ] ≥ − np exp (cid:18) − c ph r (cid:19) ≥ − δ where the last inequality follows since r ≤ ph ( c ′ ) log ( pnδ ) and because p ≥ log c ′ ( n/δ ) for a largeenough constant c ′ > . This completes our proof. D Omitted proofs from Section 3.2

In this section we prove Theorem 3.3. We largely follow the high-level analysis strategy pre-sented in [Che+21] to prove that with high probability if the the width of the network is largeenough then gradient descent down the loss to at most n L under the Asssumption 3.2.After that we use our general result, Theorem 3.1 to prove that gradient descent continues toreduce the loss beyond this point. We begin by introducing some deﬁnitions that are usefulin our proofs in this section. All the results in this section are specialized to the case of theHuberized ReLU activation function (see its deﬁnition in Equation (1)).45 .1 Additional deﬁnitions and notation Following Chen et al. [Che+21] we deﬁne the Neural Tangent random features (henceforthNT) function class. In these deﬁnitions depend on the initial weights V (1) and radii τ, ρ > .We shall choose the value of these radii in terms of problem parameters in the sequel. Deﬁnea ball around the initial parameters. Deﬁnition D.1.

For any V (1) and ρ > deﬁne a ball around this weight matrix as B ( V (1) , ρ ) := (cid:26) V : max ℓ ∈ [ L +1] k V ℓ − V (1) ℓ k ≤ ρ (cid:27) . We then deﬁne the neural tangent kernel function class.

Deﬁnition D.2.

Given initial weights V (1) , deﬁne the function F V (1) ,V ( x ) := f V (1) ( x ) + ( ∇ f V (1) ( x )) · ( V − V (1) ) , then the NT function class with radius ρ > is as follows F ( V (1) , ρ ) := n F V (1) ,V ( x ) : V ∈ B ( V (1) , ρ ) o . We continue to deﬁne the minimal error achievable by any function in this NT functionclass.

Deﬁnition D.3.

For any V (1) and any ρ > deﬁne ε NT ( V (1) , ρ ) := min V ∈B ( V (1) ,ρ ) n n X s =1 log(1 + exp( − y s F V (1) ,V ( x s ))) , that is, it is the minimal training loss achievable by functions in the NT function class centeredat V (1) . Also let V ⋆ ( V (0) , ρ ) ∈ B ( V (0) , ρ ) be an arbitrary minimizer: V ⋆ ∈ arg min V ∈B ( V (1) ,ρ ) n n X s =1 log(1 + exp( − y s F V (1) ,V ( x s ))) . We will be concerned with the maximum approximation error of this tangent kernel arounda ball of the initial weight matrix.

Deﬁnition D.4.

For any V (1) and any τ > deﬁne ε app ( V (1) , τ ) := sup s ∈ [ n ] sup ˆ V, e V ∈B ( V (1) ,τ ) (cid:12)(cid:12)(cid:12) f ˆ V ( x s ) − f e V ( x s ) − ∇ f e V ( x s ) · (cid:16) ˆ V − e V (cid:17)(cid:12)(cid:12)(cid:12) . Finally we deﬁne the maximum norm of the gradient with respect to the weights of anylayer.

Deﬁnition D.5.

For any initial weights V (1) and any τ > deﬁne Γ( V (1) , τ ) := sup s ∈ [ n ] sup ℓ ∈ [ L +1] sup V ∈B ( V (1) ,τ ) k∇ V ℓ f V ( x s ) k . .2 Technical tools required for the neural tangent kernel proofs We borrow [Che+21, Lemma 5.1] that bounds the average empirical risk in the ﬁrst T iterationswhen the iterates remain in a ball around the initial weight matrix. We have translated thelemma into our notation. Lemma D.6.

Set the step-size α t = α = O (cid:16) L Γ( V (1) ,τ ) (cid:17) for all t ∈ [ T ] . Suppose that givenan initialization V (1) and radius ρ > we pick τ > such that V ⋆ ∈ B ( V (1) , τ ) and V ( t ) ∈B ( V (1) , τ ) for all t ∈ [ T ] . Then it holds that T T X t =1 J ( V ( t ) ) ≤ k V (1) − V ⋆ k − k V ( T +1) − V ⋆ k + 2 T αε NT ( V (1) , ρ ) T α (cid:0) − ε app ( V (1) , τ ) (cid:1) . Technically the setting studied by Chen et al. [Che+21] diﬀers from the setting that we studyin our paper. They deal with neural networks with ReLU activations instead of HuberizedReLU activations that we consider here. However, it is easy to scan through the proof of theirlemma to verify that it does not rely on any speciﬁc properties of ReLUs.The next lemma bounds the approximation error of the neural tangent kernel in a neigh-bourhood around the initial weight matrix and provides a bound on the maximum norm of thegradient. The proof of this lemma below relies on several diﬀerent lemmas that are collectedand proved in Appendix E.

Lemma D.7.

For any δ > , suppose that τ = Ω (cid:18) log ( nLδ ) p L (cid:19) and τ = O (cid:18) L log ( p ) (cid:19) , h ≤ τ √ p and p = poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for some suﬃciently large polynomial. Then with probability at least − δ over the random initialization V (1) we have(a) ε app ( V (1) , τ ) ≤ O ( p p log( p ) L τ / ) and,(b) Γ( V (1) , τ ) ≤ O ( √ pL ) . Having provided a bound on the approximation error let us continue and show that gradientdescent reaches a weight matrix whose error is comparable to ε NT . Lemma D.8.

For any L ∈ N , δ > , τ = Ω log (cid:0) nLδ (cid:1) p L ! and τ ≤ c ( p log( p )) L , where c is a large enough positive constant, ρ = τ L , h ≤ τ √ p , and p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial, if we run gradient descent with a constant step-size α t = α =Θ (cid:16) pL (cid:17) , for T = l ( L +1) ρ α · ε NT ( V (1) ,ρ ) m iterations, with probability − δ over the random initialization min t ∈ [ T ] J ( V ( t ) ) ≤ ε NT ( V (1) , ρ ) . Our proof closely follows the proof of [Che+21, Theorem 3.3].

Proof

Recall the deﬁnition of V ⋆ ∈ arg min V ∈B ( V (1) ,ρ ) n n X s =1 log(1 + exp( − y s F V (1) ,V ( x s ))) .

47e would like to apply Lemma D.6 to show that the average loss of the iterates of gradientdescent decreases. To do so we must ﬁrst ensure that all iterates V ( t ) and V ⋆ remain in a ballof radius τ around initialization.We have assumed that τ ≤ c ( p log( p )) L and that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enoughpolynomial. Therefore if this polynomial is large enough we have τ ≤ O (cid:18) L log ( p ) (cid:19) . Thismeans by we can invoke Lemma D.7 which guarantees that with probability at least − δ ,the approximation error ε app ( V (1) , τ ) ≤ O ( √ pL τ / ) and the maximum norm of the gradient Γ( V (1) , τ ) ≤ O ( √ pL ) . Again recall that τ ≤ c ( p log( p )) L , where c is a large enough positiveconstant. Thus for a large enough value of c the approximation error ε app ( V (1) , τ ) ≤ . Let usassume that this is case going forward.Since ρ = τ L ≤ τ , V ⋆ is clearly in B ( V (1) , τ ) . We will now show that the iterates { V ( t ) } t ∈ [ T ] also lie in this ball by induction. The base case when t = 1 is trivially true. So now assumethat V (1) , . . . , V ( t − lie in this ball and we will proceed to show that V ( t ) also lies in this ball.Since ε app ( V (1) , τ ) ≤ / , by Lemma D.6 t − t − X t ′ =1 J ( V ( t ′ ) ) ≤ k V (1) − V ⋆ k − k V ( t ) − V ⋆ k + 2( t − αε NT ( V (1) , ρ )( t − α which implies X ℓ ∈ [ L +1] k V ( t ) ℓ − V ⋆ℓ k = k V ( t ) − V ⋆ k ≤ k V (1) − V ⋆ k + 2 α ( t − ε NT ( V (1) , ρ ) − α t − X t ′ =1 J ( V ( t ′ ) ) ≤ k V (1) − V ⋆ k + 2 α ( t − ε NT ( V (1) , ρ ) ( i ) ≤ ρ + ( L + 1) ρ ≤ ( L + 2) ρ ≤ Lρ where ( i ) follows since V ∗ ∈ B ( V (1) , ρ ) and t ≤ T = l ( L +1) ρ αε NT ( V (1) ,ρ ) m . Taking square roots impliesthat for each ℓ ∈ [ L + 1] , k V ( t ) ℓ − V ⋆ℓ k ≤ √ Lρ . By the triangle inequality for any ℓ ∈ [ L + 1] k V ( t ) ℓ − V (1) ℓ k ≤ k V ( t ) ℓ − V ⋆ℓ k + k V ⋆ℓ − V (1) ℓ k ≤ √ Lρ + ρ < Lρ = τ. This shows that V ( t ) ℓ ∈ B ( V (1) , τ ) and completes the induction.Now that we have established that V ⋆ and V ( t ) are all in a ball of radius τ around V (1) we canagain invoke Lemma D.6 (recall from above that ε app ( V (1) , τ ) ≤ and Γ( V (1) , τ ) ≤ O ( √ pL ) )to infer that min t ∈ [ T ] J ( V ( t ) ) ≤ T T X t =1 J ( V ( t ) ) ≤ k V (1) − V ⋆ k − k V ( T +1) − V ⋆ k + 2 T αε NT ( V (1) , ρ ) T α ≤ k V (1) − V ⋆ k + 2 T αε NT ( V (1) , ρ ) T α = 2 ε NT ( V (1) , ρ ) + k V (1) − V ⋆ k T α ≤ ε NT ( V (1) , ρ ) , where the last inequality follows since V ⋆ ∈ B ( V (1) , ρ ) , therefore k V (1) − V ⋆ k ≤ ( L + 1) ρ and because T = l ( L +1) ρ αε NT ( V (1) ,ρ ) m . This completes our proof.48inally we shall show that under Assumption 3.2 the error ε NT ( V (1) , ρ ) is bounded withhigh probability. Recall the assumption on the data. Assumption 3.2.

Under the Assumption 3.2, for any ε, δ > , if the radius ρ ≥ c hp log( n/δ ) + log (cid:16) ε ) − (cid:17)i √ pγ for some large enough positive absolute constant c then, with probability − δ over the ran-domness in the initialization ε NT ( V (1) , ρ ) = min V ∈B ( V (1) ,ρ ) n n X s =1 log(1 + exp( − y s F V (1) ,V ( x s ))) ≤ ε. Proof

Recall that, by deﬁnition, F V (1) ,V ( x ) = f V (1) ( x ) + ( ∇ f V (1) ( x )) · ( V − V (1) ) . By Assumption 3.2 we know that, with probability − δ , there exists W ⋆ with k W ⋆ k = 1 ,such that for all s ∈ [ n ] y i ( ∇ f V (1) ( x s ) · W ⋆ ) ≥ √ pγ. (43)By Lemma E.11 proved below with know that P h | f V (1) ( x s ) | ≤ c p log( n/δ ) i ≥ − δ. (44)For the remainder of the proof let’s assume of that both events in (43) and (44) occur. Thishappens with probability at least − δ . Thus, for any positive λy i [ f V (1) ( x s ) + λ ∇ V f V (1) ( x s ) · W ⋆ ] ≥ λ √ pγ − c p log( n/δ ) . Setting λ = c √ log( n/δ )+log (cid:16) ε ) − (cid:17) √ pγ we infer that y i [ f V (1) ( x s ) + λ ( ∇ V f V (1) ( x s ) · W ⋆ )] ≥ λ √ pγ − c p log( n/δ ) = log (cid:18) ε ) − (cid:19) . (45)Set V = V (1) + λW ⋆ . The neural tangent kernel function at this weight vector is F V (1) ,V ( x ) = f (1) V ( x ) + ∇ V f V (1) ( x ) · ( V − V (1) ) = f (1) V ( x ) + λ ∇ V f V (1) ( x ) · W ⋆ . n n X s =1 log (cid:16) (cid:16) − y i F V (1) ,V ( x s ) (cid:17)(cid:17) ≤ n n X s =1 log (cid:18) (cid:18) − log (cid:18) ε ) − (cid:19)(cid:19)(cid:19) ≤ ε. We can conclude that if we choose the radius ρ ≥ λ k W ⋆ k = λ = c √ log( n/δ )+log (cid:16) ε ) − (cid:17) √ pγ (since k W ⋆ k = 1 by assumption) then there exists a function in the NT function class with trainingerror at most ε . This completes our proof. D.3 Proof of Theorem 3.3

Theorem 3.3.

2. Set V ( T +1) = V ( s ) , where s ∈ arg min s ∈ [ T ] J ( V ( s ) ) , and for all t ≥ T + 1 set the step-size α t = α max ( h ) then for all t ≥ T + 1 , J t ≤ O L L +112 (6 p ) L +5 n L · ( t − T − ! . Proof

Proof of Part 1:

Deﬁne two events E a := (cid:26) ε NT ( V (1) , ρ ) ≤ n L (cid:27) and E b := (cid:26) min t ∈ [ T ] J ( V ( t ) ) ≤ ε NT ( V (1 , ρ ) (cid:27) . We will show that the E := E a ∩ E b occurs with probability at least − δ . That is, P (cid:20) E = (cid:26) min t ∈ [ T ] J ( V ( t ) ) ≤ n L (cid:27)(cid:21) ≥ − δ. (46)The value of ρ is set to be (this was done in Equation (4)) ρ = c √ pγ (cid:20)r log (cid:16) nδ (cid:17) + log (cid:16) n (2+24 L ) (cid:17)(cid:21) > c √ pγ r log (cid:16) nδ (cid:17) + log  (cid:16) n (2+24 L ) (cid:17) −  (since e z ≤ z when z ∈ [0 , ).50ith this choice of ρ , since c is a large enough absolute constant, Lemma D.9 guarantees that P [ E a ] ≥ − δ, (47)where the probability is over the randomness in the initialization. Continue by setting τ = 3 Lρ = 3 c L √ pγ (cid:20)r log (cid:16) nδ (cid:17) + log (cid:16) n (2+24 L ) (cid:17)(cid:21) . Since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) /γ for a large enough polynomial it is guaranteed that τ = Ω log (cid:0) nLδ (cid:1) p L ! and τ ≤ c ( p log( p )) L where c is the positive absolute constant from the statement of Lemma D.8. Also recall thevalue of h = h NT from Equation (5) h = h NT = (1 + 24 L ) log( n ) √ p ) L +12 L i ) ≤ c L hp log( n/δ ) + log (cid:0) n (2+24 L ) (cid:1)i pγ = τ √ p , where ( i ) follows since γ ∈ (0 , by assumption. Under these choices of τ and h along withthe choice of the step-size α t = Θ (cid:16) pL (cid:17) , and number of steps T , Lemma D.8 guarantees that P [ E b ] ≥ − δ. (48)A union bound over the events (47) and (48) proves the Claim (46), which completes the proofof this ﬁrst part. Proof of Part 2:

To prove this part of the lemma, we will invoke Theorem 3.1 to guaranteethat the loss decreases in the steps t ∈ { T + 1 , . . . } . We deﬁned V ( T +1) = V ( s ) , where s ∈ arg min t ∈ [ T ] J ( V ( t ) ) , thus we are guaranteed to have J ( V ( T +1) ) ≤ n L < n L , if event E deﬁned above occurs. Deﬁne another event E := n k V (1) k ≤ p pL o . Lemma E.12 guarantees that P [ E ] ≥ − δ . Deﬁne the “good event” E := E ∩ E . A simpleunion bound shows that P [ E ] ≥ − δ. Assume that this event E occurs for the remainder of this proof. This also establishes thatthe success probability of gradient descent is at least − δ as mentioned in the theoremstatement.To invoke Theorem 3.1 we need to ensure that h < h max . Recall that, in Equation (3a), wedeﬁned h max := min ( L L − log(1 /J T +1 )6 √ p k V ( T +1) k L , ) . ℓ ∈ [ L + 1] , k V ( T +1) ℓ − V (1) k ≤ τ (this fact is implicit in the proof of Lemma D.8). Bythe triangle inequality k V ( T +1) k ≤ k V (1) k + k V ( T +1) − V (1) k ≤ p pL + √ L + 1 τ ≤ p pL by the choice of τ above and since p ≥ poly ( L, log ( nδ )) γ for a large enough polynomial. Thismeans that h max ≥ L L − (1 + 24 L ) log( n )6 √ p ( √ pL ) L = (1 + 24 L ) log( n )6 L +1 p L +12 L = h NT = h. Thus, our choice of h is valid. In this second stage the step-size is chosen to be α max ( h ) = min ( h

832 ( L + 1) pJ T +1 k V ( T +1) k L +5 , k V ( T +1) k L ( L + ) J T +1 log /L (1 /J T +1 ) ) = h

832 ( L + 1) pJ T +1 k V ( T +1) k L +5 , where the ﬁrst term of the minima wins out above by our choice of h and because k V ( T +1) k ≤√ pL . Thus Theorem 3.1 guarantees that J ( V ( t ) ) ≤ J ( V ( T +1) ) e Q ( α max ( h )) · ( t − T −

1) + 1 where e Q ( · ) was deﬁned in Equation (3c). Thus, e Q ( α max ( h )) = L ( L + ) α max ( h ) J T +1 log /L (1 /J T +1 ) k V ( T +1) k = L ( L + ) J T +1 log /L (1 /J T +1 ) k V ( T +1) k × h

832 ( L + 1) pJ T +1 k V ( T +1) k L +5 = hL ( L + )832( L + 1) p k V ( T +1) k L +7 ≥ (1 + 24 L )( L + ) log( n )832 √ L ( L + 1) p (6 p ) L +12 (6 pL ) L +72 ≥ n )104 √ L L +72 ( L + 1) (6 p ) L +102 ≥ √ L L +72 ( L + 1) (6 p ) L +102 t ≥ T + 1 J ( V ( t ) ) ≤ J ( V ( T +1) ) Q · ( t − T −

1) + 1 ≤ n L Q · ( t − T −

1) + 1 ≤ n L √ L L +72 ( L + 1) (6 p ) L +102 ( t − T −

1) + 104 L L +72 ( L + 1) (6 p ) L +102 < n L √ L L +72 ( L + 1) (6 p ) L +102 ( t − T − O L L +112 (6 p ) L +5 n L · ( t − T − ! , this completes the proof. E Proof of Lemma D.7

In this section we prove Lemma D.7 that controls the approximation error ε app ( V (1) , τ ) andestablishes a bound on the maximum norm of the gradient Γ( V (1) , τ ) . The proof of thislemma requires analogs of several lemmas from [ALS19; Zou+20] adapted to our setting. InAppendix E.1 we prove that several useful properties hold at initialization with high proba-bility. In Appendix E.2 we show that some of these properties extend to weight matrices closeto initialization and in Appendix E.3 we prove Lemma D.7.Throughout this section we analyze the initialization scheme described in Section 3.2. Thisscheme is as follows: for all ℓ ∈ [ L ] the entries of V (1) ℓ are drawn independently from N (cid:16) , p (cid:17) and the entries of V (1) L +1 are drawn independently from N (0 , . Again, the results of thisappendix apply only to the Huberized ReLU (see deﬁnition in (1)). E.1 Properties at initialization

In the next lemma we show that several useful properties hold with high probability at ini-tialization.

Lemma E.1.

For any δ > , suppose that h < √ pL , p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enoughpolynomial and τ = Ω (cid:18) log ( nLδ ) p L (cid:19) . Then with probability at least − δ over the randomness in V (1) we have the following:(a) For all s ∈ [ n ] and all ℓ ∈ [ L ] : k x V (1) ℓ,s k ∈ (cid:20) , (cid:21) . (b) For all all ℓ ∈ [ L ] , k V (1) ℓ k op ≤ O (1) , and k V (1) L +1 k ≤ O ( √ p ) . c) For all s ∈ [ n ] and all ≤ ℓ ≤ ℓ ≤ L , (cid:13)(cid:13)(cid:13) V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ (cid:13)(cid:13)(cid:13) op ≤ O ( L ) , and for any ≤ ℓ ≤ L (cid:13)(cid:13)(cid:13) V (1) L +1 Σ V (1) L,s · · · Σ V (1) ℓ ,s V (1) ℓ (cid:13)(cid:13)(cid:13) op ≤ O ( √ pL ) . (d) For all s ∈ [ n ] and all ≤ ℓ ≤ ℓ ≤ L , (cid:13)(cid:13)(cid:13) V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ a (cid:13)(cid:13)(cid:13) ≤ k a k for all vectors a with k a k ≤ k = cp log( p ) L , where c is a small enough positive absoluteconstant.(e) For all s ∈ [ n ] and all ≤ ℓ ≤ ℓ ≤ L , (cid:13)(cid:13)(cid:13) a ⊤ V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ (cid:13)(cid:13)(cid:13) ≤ O ( k a k ) for all vectors a with k a k ≤ k = cp log( p ) L , where c is a small enough positive absoluteconstant.(f ) For all s ∈ [ n ] and all ≤ ℓ ≤ ℓ ≤ L , | a ⊤ V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ b | ≤ O k a kk b k p k log( p ) √ p ! for all vectors a, b with k a k , k b k ≤ k = cp log( p ) L , where c is a small enough positiveabsolute constant.(g) For all s ∈ [ n ] and all ≤ ℓ ≤ L , | V (1) L +1 Σ V (1) L,s · · · Σ V (1) ℓ ,s V (1) ℓ a | ≤ O (cid:16) k a k p k log( p ) (cid:17) for all vectors a with k a k ≤ k = cp log( p ) L , where c is a small enough positive absoluteconstant.(h) For β = O (cid:16) L τ / √ p (cid:17) and S ℓ,s ( β ) := n j ∈ [ p ] : | V (1) ℓ,j x V (1) ℓ,s | ≤ β o , where V (1) ℓ,j refers to the j th row of V (1) ℓ , for all ℓ ∈ [ L ] and all s ∈ [ n ] : |S ℓ,s ( β ) | ≤ O ( p / β ) = O ( pL τ / ) . We will prove this lemma part by part and show that each of the eight properties holds withprobability at least − δ/ and take a union bound at the end. We show that each of the partholds with this probability in the eight lemmas (Lemmas E.2-E.9) that follow.54 .1.1 Proof of Part (a)Lemma E.2. For any δ > , suppose that h < √ pL and p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a largeenough polynomial, then with probability at least − δ/ over the randomness in V (1) we havethat for all s ∈ [ n ] and all ℓ ∈ [ L ] : k x V (1) ℓ,s k ∈ (cid:20) , (cid:21) . Proof

Fix any layer ℓ ∈ [ L ] and any sample s ∈ [ n ] . We will prove the result for this layerand sample, and apply a union bound at the end. To ease notation we drop V (1) from thesuperscript of x V (1) ℓ,s and refer to V (1) ℓ as simply V ℓ .By deﬁnition x ℓ,s = φ ( V ℓ x ℓ − ,s ) . Conditioned on x ℓ − ,s , each coordinate of V ℓ x ℓ − ,s is distributed as N (cid:16) , k x ℓ − ,s k p (cid:17) , sinceeach entry of V ℓ is drawn independently from N (0 , p ) . Let ¯ φ ( z ) = max { , z } denote the ReLUactivation function. Then we know that ¯ φ ( z ) − h ≤ φ ( z ) ≤ ¯ φ ( z ) for any z ∈ R . Let ( x ℓ,s ) i denote the i th coordinate of x ℓ and let V ℓ,i denote the i th row of V ℓ . Therefore conditioned on x ℓ − ,s E (cid:2) ( x ℓ,s ) i (cid:12)(cid:12) x ℓ − ,s (cid:3) = E (cid:2) φ ( V ℓ,i x ℓ − ,s ) (cid:12)(cid:12) x ℓ − ,s (cid:3) ≥ E (cid:2) ¯ φ ( V ℓ,i x ℓ − ,s ) (cid:12)(cid:12) x ℓ − ,s (cid:3) − h E (cid:2) ¯ φ ( V ℓ,i x ℓ − ,s ) (cid:12)(cid:12) x ℓ − ,s (cid:3) + h ( i ) = 12 E h ( V ℓ,i x ℓ − ,s ) (cid:12)(cid:12) x ℓ − ,s i − h E (cid:2) | V ℓ,i x ℓ − ,s | (cid:12)(cid:12) x ℓ − ,s (cid:3) h k x ℓ − ,s k p − h k x ℓ − ,s k√ pπ + h , where ( i ) follows since ¯ φ ( z ) = 0 if z < and the distribution of V ℓ,i x ℓ − ,s is symmetric aboutthe origin. Therefore summing up over all i ∈ [ p ] we ﬁnd E (cid:2) k x ℓ,s k | x ℓ − ,s (cid:3) = X i ∈ [ p ] E (cid:2) ( x ℓ,s ) i | x ℓ − ,s (cid:3) ≥ k x ℓ − ,s k − √ ph k x ℓ − ,s k√ π + h p ≥ (cid:18) k x ℓ − ,s k − h √ p (cid:19) . (49)Similarly we can also demonstrate an upper bound of E (cid:2) k x ℓ,s k | x ℓ − ,s (cid:3) ≤ k x ℓ − ,s k since φ ( z ) ≤ ¯ φ ( z ) for any z as stated previously.Let k·k ψ denote the sub-Gaussian norm of a random variable (see Deﬁnition F.1) and let k·k ψ denote the sub-exponential norm (see Deﬁnition F.2). Since the function φ is -Lipschitz,conditioned on x ℓ − ,s , k ( x ℓ,s ) i k ψ = k φ ( V ℓ,i x ℓ − ,s ) k ψ ≤ k φ ( V ℓ,i x ℓ − ,s ) − E [ φ ( V ℓ,i x ℓ − ,s ) | x ℓ − ,s ] k ψ + k E [ φ ( V ℓ,i x ℓ − ,s ) | x ℓ − ,s ] k ψ ( i ) ≤ c k x ℓ − ,s k√ p + k E [ φ ( V ℓ,i x ℓ − ,s ) | x ℓ − ,s ] k ψ ( ii ) ≤ c k x ℓ − ,s k√ p (50)55here ( i ) follows by invoking Lemma F.4, and ( ii ) follows since we showed above that k E [ φ ( V ℓ,i x ℓ − ,s ) | x ℓ − ,s ] k ψ = | E [ φ ( V ℓ,i x ℓ − ,s ) | x ℓ − ,s ] | ≤ q E [ φ ( V ℓ,i x ℓ − ,s ) | x ℓ − ,s ] ≤ k x ℓ − ,s k√ p . Therefore k ( x ℓ,s ) i k ψ ≤ k ( x ℓ,s ) i k ψ ≤ c k x ℓ − ,s k p by Lemma F.3. Since the random variables ( x ℓ,s ) , . . . , ( x ℓ,s ) p are conditionally independent given x ℓ − ,s , applying Bernstein’s inequality(see Theorem F.6) we get that for any η ∈ (0 , P (cid:16)(cid:12)(cid:12)(cid:12) k x ℓ,s k − E (cid:2) k x ℓ,s k | x ℓ − ,s (cid:3) (cid:12)(cid:12)(cid:12) ≤ η k x ℓ − ,s k (cid:12)(cid:12)(cid:12) x ℓ − ,s (cid:17) ≥ − − c min ( η k x ℓ − ,s k p × (cid:0) c k x ℓ − ,s k /p (cid:1) , η k x ℓ − ,s k c k x ℓ − ,s k /p )! ≥ − (cid:0) − c min (cid:8) η p, ηp (cid:9)(cid:1) ≥ − (cid:0) − c pη (cid:1) . We established above that the expected value (cid:18) k x ℓ − ,s k − h √ p (cid:19) ≤ E (cid:2) k x ℓ,s k | x ℓ − ,s (cid:3) ≤ k x ℓ − ,s k . Thus P k x ℓ,s k ∈ "(cid:18) k x ℓ − ,s k − h √ p (cid:19) − η k x ℓ − ,s k , k x ℓ − ,s k (1 + η ) x ℓ − ,s ! ≥ − (cid:0) − c pη (cid:1) . Taking a union bound over all samples and all hidden layers we ﬁnd that: for all s ∈ [ n ] andall ℓ ∈ [ L ] P k x ℓ,s k ∈ "(cid:18) k x ℓ − ,s k − h √ p (cid:19) − η k x ℓ − ,s k , k x ℓ − ,s k (1 + η ) ≥ − nL exp (cid:0) − c pη (cid:1) . This implies that, for all s ∈ [ n ] and ℓ ∈ [ L ] P (cid:18)(cid:12)(cid:12)(cid:12) k x ℓ,s k − k x ℓ − ,s k (cid:12)(cid:12)(cid:12) ≤ η k x ℓ − ,s k + h √ p k x ℓ − ,s k + h p (cid:19) ≥ − nL exp (cid:0) − c pη (cid:1) . Setting η = L and because by assumption h √ p < L = η we get that for all s and ℓ ∈ [ L ] : P (cid:18)(cid:12)(cid:12)(cid:12) k x ℓ,s k − k x ℓ − ,s k (cid:12)(cid:12)(cid:12) ≤ η k x ℓ − ,s k + η k x ℓ − ,s k + η (cid:19) ≥ − nL exp (cid:16) − c pL (cid:17) (51)Let us assume that the event of (51) holds for the rest of this proof. Starting with ℓ = 1 weknow that k x ,s k = k x s k = 1 , thus if the event in the previous display holds then by the choiceof η = 1 / (50 L ) we have that k x ,s k ∈ [1 − η, η ] . z ∈ [0 , we have that (1 + z ) / ≤ z and (1 − z ) / ≥ − z . Thus, by takingsquare roots k x ,s k ∈ [1 − η, η ] . We will now prove that k x ℓ,s k ∈ [1 − ℓη, ℓη ] using an inductive argument over ℓ = 1 , . . . , L .The base case when ℓ = 1 of course holds by the display above. Now let us prove it for a layer ℓ > assuming it holds at layer ℓ − .Let us ﬁrst prove the upper bound on k x ℓ,s k , the lower bound will follow by same exactsame logic. If the event in (51) holds then we know that k x ℓ,s k − k x ℓ − ,s k ≤ η k x ℓ − ,s k + η k x ℓ − ,s k + η which implies that k x ℓ,s k ≤ k x ℓ − ,s k (1 + η ) + η k x ℓ − ,s k + η k x ℓ − ,s k (cid:18) η + η k x ℓ − ,s k + η k x ℓ − ,s k (cid:19) ( i ) ≤ k x ℓ − ,s k (cid:18) η + 10 η η (cid:19) = k x ℓ − ,s k (cid:18) η η (cid:19) ( ii ) ≤ k x ℓ − ,s k (cid:18) η (cid:19) where ( i ) follows since by the inductive hypothesis k x ℓ − ,s k ≥ − ℓ − η and because η = L , therefore k x ℓ − ,s k ≥ − ℓ − L ≥ > , and ( ii ) again follows because η = L and L ≥ . Taking square roots we ﬁnd k x ℓ,s k ≤ k x ℓ − ,s k r η ≤ (1 + 3( ℓ − η ) r η (by the IH) ( i ) ≤ (1 + 3( ℓ − η ) (cid:18) η (cid:19) = 1 + 3 ( ℓ − η + 20 η ℓ − η ( ii ) ≤ ℓη where ( i ) follows since √ z ≤ z and ( ii ) follows since η = L and L ≥ . This establishesthe desired upper bound on k x ℓ,s k . As mentioned above, the lower bound (1 − ℓη ) ≤ k x ℓ,s k follows by mirroring the logic. This completes our induction and proves that for all s and all ℓ with probability at least − δ/ k x ℓ,s k ∈ [1 − ℓη, ℓη ] . Our choice of η = L establishes that k x ℓ,s k ∈ (cid:20) , (cid:21) (52)for all s ∈ [ n ] and ℓ ∈ [ L ] with probability at least − nL exp (cid:0) − c pL (cid:1) ≥ − δ/ , which followssince p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial. This wraps up our proof.57 .1.2 Proof of Part (b)Lemma E.3. For any δ > suppose that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial,then with probability at least − δ/ over the randomness in V (1) :for all ℓ ∈ [ L ] , k V (1) ℓ k op ≤ O (1) , and k V (1) L +1 k ≤ O ( √ p ) . Proof

For any ﬁxed ℓ ∈ [ L ] recall that each entry of V (1) ℓ is drawn independently from N (cid:16) , p (cid:17) . Thus, by invoking [Ver18, Theorem 4.4.5] we know that k V (1) ℓ k op ≤ O (1) with probability at least − exp( − Ω( p )) . The entries of V (1) L +1 are drawn from N (0 , , thereforeby Theorem F.7 we ﬁnd that k V (1) L +1 k ≤ p with probability − exp( − Ω( p )) . By a union bound over the L + 1 layers and noting that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) yields thatfor all ℓ ∈ [ L ] , k V (1) ℓ k op ≤ O (1) , and k V (1) L +1 k ≤ O ( √ p ) with probability at least − δ/ as claimed. E.1.3 Proof of Part (c)Lemma E.4.

For any δ > suppose that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial,then with probability at least − δ/ over the randomness in V (1) we have that for all s ∈ [ n ] and all ≤ ℓ ≤ ℓ ≤ L , (cid:13)(cid:13)(cid:13) V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ (cid:13)(cid:13)(cid:13) op ≤ O ( L ) , and all ≤ ℓ ≤ L (cid:13)(cid:13)(cid:13) V (1) L +1 Σ V (1) L,s · · · Σ V (1) ℓ ,s V (1) ℓ (cid:13)(cid:13)(cid:13) op ≤ O ( √ pL ) . Proof

We begin by analyzing the case where ℓ < L + 1 . A similar analysis works to prove theclaim when ℓ = L + 1 . This is because the variance of each entry of V (1) L +1 is , whereas when ℓ < L + 1 the variance of each entry of V (1) ℓ is /p . Therefore the bound is simply multipliedby a factor of √ p in the case when ℓ = L + 1 .Fix the layers ≤ ℓ ≤ ℓ ≤ L and ﬁx the sample index s . At the end of the proof we shalltake a union bound over all pairs of layers and all samples. Now to ease notation let us denote V (1) ℓ by simply V ℓ and let Σ V (1) ℓ,s be denoted by Σ ℓ,s .To bound the operator norm (cid:13)(cid:13)(cid:13) V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ (cid:13)(cid:13)(cid:13) op = sup a : k a k =1 (cid:13)(cid:13)(cid:13) V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ a (cid:13)(cid:13)(cid:13) (53)58e will ﬁrst consider a supremum over vectors that are non-zero only on an arbitrary ﬁxedsubset S ⊆ [ p ] with cardinality | S | ≤ (cid:4) c pL (cid:5) , where c is small enough absolute constant. Thatis, we shall bound Ξ := sup a : k a k =1 , supp( a ) ⊆ S (cid:13)(cid:13)(cid:13) V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ a (cid:13)(cid:13)(cid:13) . Using this we will then bound the operator norm in (53) by decomposing any unit vector a into p j c pL k vectors that are non-zero only on subsets of size at most (cid:4) c pL (cid:5) .Let us begin by ﬁrst bounding Ξ . Recall that for a ﬁxed unit vector z ∈ S p − by Part (b) ofLemma E.10 that is proved below we get k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z k ≤ with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) .We take an / -net (see the deﬁnition of an ε -net in Deﬁnition F.8) of unit vectors { a i } mi =1 whose coordinates are non-zero only on this particular subset S , with respect to the Euclideannorm. There exists such an / -net of size m = 9 c p/L (see Lemma F.9). By a union bound, ∀ i ∈ [ m ] , k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a i k ≤ (54)with probability at least − O ( nL · c p/L ) e − Ω (cid:16) pL (cid:17) = 1 − e − Ω (cid:16) pL (cid:17) since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial, and because c is a small enough constant. We will now proceedto show that if the “good event” (54) regarding the / -net holds then we can use it to establishguarantees for all unit vectors a that are only non-zero on this subset S . To see this, if ζ ( a ) maps each unit vector a with support contained in S to its nearest neighbor in { a , . . . , a m } ,then if the event in (54) holds then Ξ = sup a : k a k =1 , supp( a ) ⊆ S k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a k = sup a : k a k =1 , supp( a ) ⊆ S k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( a − ζ ( a ) + ζ ( a )) k≤ sup j ∈ [ m ] k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a j k + sup a : k a k =1 , supp( a ) ⊆ S k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( a − ζ ( a )) k ( i ) ≤ sup j ∈ [ m ] k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a j k + 14 sup a : k a k =1 , supp( a ) ⊆ S (cid:13)(cid:13)(cid:13)(cid:13) V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( a − ζ ( a )) k a − ζ ( a ) k (cid:13)(cid:13)(cid:13)(cid:13) ( ii ) ≤ where and ( i ) follows since k a − ζ ( a ) k ≤ / , Inequality ( ii ) follows since we assumed theevent (54) to hold and by the deﬁnition of Ξ . By rearranging terms we ﬁnd that for any unitvector a that is only non-zero on subset S we have that k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a k ≤ − × < . (55)with the same probability that is at least − e − Ω (cid:16) pL (cid:17) .59s mentioned above we will now consider a partition of [ p ] = S ∪ . . . ∪ S q , such that forall i ∈ [ q ] , | S i | ≤ (cid:4) c pL (cid:5) and the number of the sets in the partition q ≤ p j c pL k = l L c m . Givenan arbitrary unit vector b ∈ S p − , we can decompose it as b = u + . . . + u q , where each u i isnon-zero only on the set S i . Invoking the triangle inequality k V ℓ Σ ℓ − ,s · · · Σ ℓ V ℓ b k ≤ q X i =1 k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ u i k = q X i =1 (cid:13)(cid:13)(cid:13)(cid:13) V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ u i k u i k (cid:13)(cid:13)(cid:13)(cid:13) k u i k . By applying the result of (55) to each term in the sum above along with a union bound overthe q sets S , . . . , S q we ﬁnd that: for all unit vectors b ∈ S p − k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b k ≤ q X i =1 k u i k ≤ √ q q X i =1 k u i k ! / = 3 √ q = O ( L ) with probability at least − qe − Ω (cid:16) pL (cid:17) = 1 − O ( L ) e − Ω (cid:16) pL (cid:17) . The deﬁnition of the operatornorm of a matrix k A k op = sup v : k v k =1 k Av k along with the previous display establishes theclaim for this particular pair of layers ℓ and ℓ and sample s . A union bound over pairs oflayers and all samples to establish that: for all pairs ≤ ℓ ≤ ℓ ≤ L and all s ∈ [ n ] k V ℓ Σ ℓ − · · · Σ ℓ V ℓ k op ≤ O ( L ) (56)with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) . As claimed above, a similar analysis shows thatfor all s ∈ [ n ] and all ℓ ∈ [ L ] k V L +1 Σ L · · · Σ ℓ V ℓ k op ≤ O ( √ pL ) (57)with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) . Since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) we can ensure thatboth events in (56) and (57) occur simultaneously with probability at least − δ/ . E.1.4 Proof of Part (d)Lemma E.5.

We ﬁx the layers ≤ ℓ ≤ ℓ ≤ L and ﬁx the sample index s . At the end of the proofwe shall take a union bound over all pairs of layers and all samples. Again, to ease notation,let us denote V (1) ℓ by simply V ℓ and let Σ V (1) ℓ,s be denoted by Σ ℓ,s .60or a ﬁxed unit vector z ∈ S p − by Part (b) of Lemma E.10 that is proved below we have k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z k ≤ (58)with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) . Consider a / -net of k -sparse unit vectors { a i } mi =1 , where m = (cid:0) pk (cid:1) k (such a net exists, see Lemma F.10).Using (58) and taking a union bound, we ﬁnd that for all vectors { a i } mi =1 k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a i k ≤ with probability at least − O (cid:0)(cid:0) pk (cid:1) k · nL (cid:1) e − Ω (cid:16) pL (cid:17) .Now by mirroring the logic that lead from inequality (54) to inequality (55) in the proof ofthe previous lemma, we can establish that for any vector a that is k -sparse (cid:13)(cid:13)(cid:13)(cid:13) V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a k a k (cid:13)(cid:13)(cid:13)(cid:13) ≤ again with probability that is at least − (cid:0) pk (cid:1) k · O ( nL ) e − Ω (cid:16) pL (cid:17) . A union bound over all pairsof layers and all samples we ﬁnd that: for all ≤ ℓ ≤ ℓ ≤ L , for all s ∈ [ n ] and for all vectors a that are k -sparse (cid:13)(cid:13)(cid:13)(cid:13) V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ a k a k (cid:13)(cid:13)(cid:13)(cid:13) ≤ with probability at least − O (cid:18)(cid:18) pk (cid:19) k · n L (cid:19) e − Ω (cid:16) pL (cid:17) ≥ − O (cid:18)(cid:16) epk (cid:17) k k · n L (cid:19) e − Ω (cid:16) pL (cid:17) (since (cid:0) pk (cid:1) ≤ (cid:0) epk (cid:1) k ) = 1 − O (cid:18) epk (cid:19) k · n L ! e − Ω (cid:16) pL (cid:17) = 1 − O (cid:0) n L (cid:1) e − Ω (cid:16) pL − k log(9 ep ) (cid:17) ≥ − δ/ where the last inequality follows since k ≤ cp log( p ) L where c is a small enough absolute constantand p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial. This completes the proof. E.1.5 Proof of Part (e)Lemma E.6.

For any δ > , suppose that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial,then with probability at least − δ/ over the randomness in V (1) we have that for all s ∈ [ n ] and all ≤ ℓ ≤ ℓ ≤ L , (cid:13)(cid:13)(cid:13) a ⊤ V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ (cid:13)(cid:13)(cid:13) ≤ O ( k a k ) for all vectors a with k a k ≤ k = cp log( p ) L , where c is a small enough positive absolute constant. roof We ﬁx the layers ≤ ℓ ≤ ℓ ≤ L and ﬁx the sample index s . At the end of the proofwe shall take a union bound over all pairs of layers and all samples. In the proof let us denote V (1) ℓ by simply V ℓ and let Σ V (1) ℓ,s be denoted by Σ ℓ,s .For any ﬁxed vector z we know from Part (a) of Lemma E.10 that with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) over the randomness in ( V ℓ − , . . . , V ) k Σ ℓ − ,s V ℓ − . . . Σ ℓ ,s V ℓ z k ≤ k z k . (59)Recall that the entries of V ℓ are drawn independently from N (0 , p ) . Thus, conditioned onthis event above, for any ﬁxed vector w the random variable w ⊤ V ℓ (Σ ℓ − ,s · · · Σ ℓ V ℓ z ) is amean-zero Gaussian with variance at most k w k k z k p . Thus over the randomness in V ℓ P (cid:18)(cid:12)(cid:12)(cid:12) w ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z (cid:12)(cid:12)(cid:12) ≤ L k w kk z k (cid:12)(cid:12)(cid:12) V ℓ − , . . . , V (cid:19) ≥ − e − Ω (cid:16) pL (cid:17) . (60)By union bound over the events in (59) and (60) we have P (cid:18)(cid:12)(cid:12)(cid:12) w ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z (cid:12)(cid:12)(cid:12) ≤ L k w kk z k (cid:19) ≥ − O ( nL ) e − Ω (cid:16) pL (cid:17) . (61)Similar to the proof of Lemma E.4 our strategy will be to ﬁrst bound sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , supp( b ) ∈⊆ S (cid:12)(cid:12)(cid:12) a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b (cid:12)(cid:12)(cid:12) where S is a ﬁxed subset of [ p ] with | S | ≤ c pL , where c is a small enough absolute constant. Let { z i } ri =1 be an / -net of unit vectors with respect to the Euclidean norm whose coordinatesare non-zero only on this subset S . There exists such an / -net of size r = 9 c p/L (seeLemma F.9). Let { w i } mi =1 be a / -net of k -sparse unit vectors in Euclidean norm of size m = (cid:0) pk (cid:1) k (Lemma F.10 guarantees the existence of such a net). Therefore by using (61) andtaking a union bound we get that ∀ i ∈ [ r ] , j ∈ [ m ] (cid:12)(cid:12)(cid:12) w ⊤ j V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z i (cid:12)(cid:12)(cid:12) ≤ L (62)with probability at least − mrO ( nL ) e − Ω (cid:16) pL (cid:17) = 1 − O (9 c p/L (cid:0) pk (cid:1) k nL ) e − Ω (cid:16) pL (cid:17) = 1 − e − Ω (cid:16) pL (cid:17) , since k = cp log( p ) L where both c and c are small enough absolute constants andbecause p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial.We will now demonstrate that if the “good event” in (62) holds then we can use this toestablish a similar guarantee for all k -sparse unit vectors a and all unit vectors b that are onlynon-zero on the subset S . To see this, as before, suppose ζ maps any unit-length vector withsupport in S to its nearest neighbor in { z , . . . , z r } and λ maps any k -sparse unit vector to its62earest neighbor in { w , . . . , w m } . Then if the event in (62) holds, we have Ξ := sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , supp( b ) ⊆ S (cid:12)(cid:12)(cid:12) a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b (cid:12)(cid:12)(cid:12) = sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , supp( b ) ⊆ S (cid:12)(cid:12)(cid:12) ( a − λ ( a ) + λ ( a )) ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( b − ζ ( b ) + ζ ( b )) (cid:12)(cid:12)(cid:12) ≤ sup i ∈ [ m ] ,j ∈ [ r ] (cid:12)(cid:12)(cid:12) w ⊤ i V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z j (cid:12)(cid:12)(cid:12) + sup a : k a k =1 , k a k ≤ k,j ∈ [ r ] (cid:12)(cid:12)(cid:12) ( a − λ ( a )) ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z j (cid:12)(cid:12)(cid:12) + sup i ∈ [ m ] ,b : k b k =1 , supp( b ) ⊆ S (cid:12)(cid:12)(cid:12) w ⊤ i V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( b − ζ ( b )) (cid:12)(cid:12)(cid:12) + sup a : k a k =1 , k a k ≤ k sup i ∈ [ m ] ,b : k b k =1 , supp( b ) ⊆ S (cid:12)(cid:12)(cid:12) ( a − λ ( a )) ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( b − ζ ( b )) (cid:12)(cid:12)(cid:12) ( i ) ≤ L + Ξ4 + Ξ4 + Ξ16 ≤ L + 916 Ξ (63)where ( i ) follows by the deﬁnition of Ξ , because we assume that the event in (62) holds, andalso because k a − λ ( a ) k ≤ / and k b − ζ ( b ) k ≤ / .By rearranging terms in the previous display we can infer that Ξ := sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , supp( b ) ⊆ S (cid:12)(cid:12)(cid:12) a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b (cid:12)(cid:12)(cid:12) ≤ (cid:0) − (cid:1) L < L (64)with probability at least − e − Ω (cid:16) pL (cid:17) .Finally, when b is an arbitrary unit vector we can partition [ p ] = S ∪ . . . ∪ S m , such that forall i ∈ [ m ] , | S i | ≤ (cid:4) c pL (cid:5) and the number of the sets in the partition q ≤ p j c pL k = l L c m . Thus,given an arbitrary unit vector b ∈ S p − , we can decompose it as b = u + . . . + u q , where each u i is non-zero only on the set S i . By invoking the triangle inequality | a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ V ℓ b | ≤ q X i =1 | a ⊤ V ℓ Σ ℓ − · · · Σ ℓ ,s V ℓ u i | . By applying the result of (64) to each term in the sum above we ﬁnd that: for all k -sparseunit vectors a and all unit vectors b ∈ S p − | a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b | ≤ L q X i =1 k u i k ≤ L √ q q X i =1 k u i k ! / = 10 √ qL = O (1) with probability at least − qe − Ω (cid:16) pL (cid:17) = 1 − O ( L ) e − Ω (cid:16) pL (cid:17) . In other words for all k -sparseunit vectors a k a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ k = sup b : k b k =1 | a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b | ≤ O (1) with the same probability that is at least − O ( L ) e − Ω (cid:16) pL (cid:17) . By a union bound over the pairsof layers ℓ and ℓ and all samples s ∈ [ n ] we establish that: for all pairs ≤ ℓ ≤ ℓ ≤ L , all63 ∈ [ n ] and all k -sparse vectors a (cid:13)(cid:13)(cid:13)(cid:13) a ⊤ k a k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ (cid:13)(cid:13)(cid:13)(cid:13) ≤ O (1) with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) . Since, p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) we can ensure thatthis happens with probability at least − δ/ which completes the proof. E.1.6 Proof of Part (f )Lemma E.7.

For any δ > , suppose that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial,then with probability at least − δ/ over the randomness in V (1) we have that for all s ∈ [ n ] and all ≤ ℓ ≤ ℓ ≤ L , | a ⊤ V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ b | ≤ O k a kk b k s k log( p ) p ! for all vectors a, b with k a k , k b k ≤ k = cp log( p ) L , where c is a small enough positive absoluteconstant. Proof

Fix the layers ≤ ℓ ≤ ℓ ≤ L and the sample index s . At the end of the proof we shalltake a union bound over all pairs of layers and all samples. In the proof let us denote V (1) ℓ bysimply V ℓ and let Σ V (1) ℓ,s be denoted by Σ ℓ,s .For any ﬁxed vector z we know from Part (a) of Lemma E.10 that with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) over the randomness in ( V ℓ − , . . . , V ) k Σ ℓ − ,s V ℓ − . . . Σ ℓ ,s V ℓ z k ≤ k z k . (65)Recall that the entries of V ℓ are drawn independently from N (0 , p ) . Thus, conditioned onthis event above, for any ﬁxed vector w the random variable w ⊤ V ℓ (Σ ℓ − ,s · · · Σ ℓ V ℓ z ) is amean-zero Gaussian with variance at most k w k k z k p . Therefore over the randomness in V ℓ P (cid:12)(cid:12)(cid:12) w ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z (cid:12)(cid:12)(cid:12) ≥ c s k log( p ) p k w kk z k (cid:12)(cid:12)(cid:12) V ℓ − , . . . , V ! ≤ e − k log( p )128 c , (66)where c is a small enough positive absolute constant that will be chosen only as a functionof the constant c . A union bound over the events in (65) and (66) yields P (cid:12)(cid:12)(cid:12) w ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z (cid:12)(cid:12)(cid:12) ≤ c s k log( p ) p k w kk z k ! ≥ − O ( nL ) e − Ω (cid:16) pL (cid:17) − e − k log( p )128 c = 1 − e − Ω (cid:16) pL (cid:17) − e − k log( p )128 c (67)where the last equality holds since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial.64et { w i } mi =1 be a / -net of k -sparse unit vectors in Euclidean norm of size m = (cid:0) pk (cid:1) k (sucha net exists, see Lemma F.10). Therefore by using (67) and taking a union bound we ﬁnd that ∀ i, j ∈ [ m ] (cid:12)(cid:12)(cid:12) w ⊤ i V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ w j (cid:12)(cid:12)(cid:12) ≤ c s k log( p ) p (68)with probability at least − m (cid:18) e − Ω( p/L ) + e − k log( p )128 c (cid:19) = 1 − O (cid:18)(cid:18) pk (cid:19) k (cid:19) ! ( e − Ω (cid:16) pL (cid:17) + e − k log( p )128 c ) ( i ) ≥ − O (cid:18) epk (cid:19) k ! ( e − Ω( p/L ) + e − k log( p )128 c )= 1 − (cid:18) e − Ω (cid:16) pL (cid:17) +2 k log(9 ep ) + e − k log( p )128 c +2 k log(9 ep ) (cid:19) ( ii ) = 1 − (cid:18) e − Ω (cid:16) pL (cid:17) + e − Ω( k log( p )) (cid:19) ( iii ) = 1 − e − Ω (cid:16) pL (cid:17) where ( i ) follows since (cid:0) pk (cid:1) ≤ (cid:0) epk (cid:1) k , ( ii ) follows since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) , k = cp log( p ) L andbecause c is a small enough absolute constant (which can be chosen given the constant c ),and ( iii ) again follows since k = cp log( p ) L .We will now demonstrate that if the “good event” in (68) holds then we can use this toestablish a similar guarantee for all k -sparse unit vectors a and b . Suppose, for each k -sparseunit vector w , that ζ ( w ) is its nearest neighbor in { w , . . . , w m } . Then, if the event in (68)holds, Ξ := sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , k b k ≤ k (cid:12)(cid:12)(cid:12) a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b (cid:12)(cid:12)(cid:12) = sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , k b k ≤ k (cid:12)(cid:12)(cid:12) ( a − ζ ( a ) + ζ ( a )) ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( b − ζ ( b ) + ζ ( b )) (cid:12)(cid:12)(cid:12) ≤ sup i,j ∈ [ m ] (cid:12)(cid:12)(cid:12) w ⊤ i V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ w j (cid:12)(cid:12)(cid:12) + sup a : k a k =1 , k a k ≤ k,j ∈ [ m ] (cid:12)(cid:12)(cid:12) ( a − ζ ( a )) ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ w j (cid:12)(cid:12)(cid:12) + sup i ∈ [ m ] ,b : k b k =1 , k b k ≤ k (cid:12)(cid:12)(cid:12) w ⊤ i V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( b − ζ ( b )) (cid:12)(cid:12)(cid:12) + sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , k b k ≤ k (cid:12)(cid:12)(cid:12) ( a − ζ ( a )) ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ( b − ζ ( b )) (cid:12)(cid:12)(cid:12) ( i ) ≤ c s k log( p ) p + Ξ4 + Ξ4 + Ξ16 = 1 c s k log( p ) p + 916 Ξ where ( i ) follows by the deﬁnition of Ξ , because we assume that the event in (68) holds, andalso since k a − ζ ( a ) k ≤ / and k b − ζ ( b ) k ≤ / .65y rearranging terms in the previous display we can infer that Ξ = sup a : k a k =1 , k a k ≤ k sup b : k b k =1 , k b k ≤ k (cid:12)(cid:12)(cid:12) a ⊤ V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b (cid:12)(cid:12)(cid:12) ≤ (cid:0) − (cid:1) c s k log( p ) p = O s k log( p ) p ! with probability at least − e − Ω (cid:16) pL (cid:17) . Taking a union bound over all pairs of layers and allsample we ﬁnd that, for all ≤ ℓ ≤ ℓ ≤ L , for all s ∈ [ n ] and all k -sparse vectors a and b (cid:12)(cid:12)(cid:12)(cid:12) a ⊤ k a k V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ b k b k (cid:12)(cid:12)(cid:12)(cid:12) = O s k log( p ) p ! (69)with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) . Since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enoughpolynomial we can ensure that this probability is at least − δ/ which completes our proof. E.1.7 Proof of Part (g)Lemma E.8.

For any δ > , suppose that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial,then, with probability at least − δ/ over the randomness in V (1) , for all s ∈ [ n ] and all ≤ ℓ ≤ L , | V (1) L +1 Σ V (1) L,s · · · Σ V (1) ℓ,s V (1) ℓ a | ≤ O (cid:16) k a k p k log( p ) (cid:17) for all vectors a with k a k ≤ k = cp log( p ) L , where c is a small enough positive absolute constant. Proof

Fix the layer ≤ ℓ ≤ L and the sample index s . At the end of the proof we shall takea union bound over all layers and all samples. Let us denote V (1) ℓ by simply V ℓ and let Σ V (1) ℓ,s be denoted by Σ ℓ,s .For any ﬁxed vector z we know from Part (a) of Lemma E.10 that with probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) over the randomness in ( V L , . . . , V ) k Σ L,s V L . . . Σ ℓ,s V ℓ z k ≤ k z k . (70)Recall that the entries of V L +1 are drawn independently from N (0 , . Thus, conditioned onthis event above, for any ﬁxed vector w the random variable w ⊤ V L +1 (Σ L,s · · · Σ ℓ V ℓ z ) is amean-zero Gaussian with variance at most k z k . Therefore over the randomness in V L +1 P | V L +1 Σ L,s · · · Σ ℓ,s V ℓ z | ≥ p k log( p ) c k z k (cid:12)(cid:12)(cid:12) V L , . . . , V ! ≤ e − k log( p )32 c , (71)where c is a small enough positive absolute constant that will be chosen only as a functionof the constant c . A union bound over the events in (70) and (71) yields P | V L +1 Σ L,s · · · Σ ℓ,s V ℓ z | ≤ p k log( p ) c k z k ! ≥ − O ( nL ) e − Ω (cid:16) pL (cid:17) − e − k log( p )32 c = 1 − e − Ω (cid:16) pL (cid:17) − e − k log( p )32 c (72)66here the last equality holds since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial.Let { z i } mi =1 be a / -net of k -sparse unit vectors in Euclidean norm of size m = (cid:0) pk (cid:1) k (sucha net exists, see Lemma F.10). Therefore by using (72) and taking a union bound we ﬁnd that ∀ i ∈ [ m ] , | V L +1 Σ L,s · · · Σ ℓ,s V ℓ z i | ≤ p k log( p ) c (73)with probability at least − m (cid:18) e − Ω( p/L ) + e − k log( p )128 c (cid:19) = 1 − O (cid:18)(cid:18) pk (cid:19) k (cid:19) ( e − Ω (cid:16) pL (cid:17) + e − k log( p )32 c ) ( i ) ≥ − O (cid:18) epk (cid:19) k ! ( e − Ω( p/L ) + e − k log( p )32 c )= 1 − (cid:18) e − Ω (cid:16) pL (cid:17) + k log(9 ep ) + e − k log( p )32 c + k log(9 ep ) (cid:19) ( ii ) = 1 − (cid:18) e − Ω (cid:16) pL (cid:17) + e − Ω( k log( p )) (cid:19) ( iii ) = 1 − e − Ω (cid:16) pL (cid:17) where ( i ) follows since (cid:0) pk (cid:1) ≤ (cid:0) epk (cid:1) k , ( ii ) follows since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) , k = cp log( p ) L andbecause c is a small enough absolute constant (which can be chosen given the constant c ),and ( iii ) follows again since k = cp log( p ) L .We will now demonstrate that if the “good event” (73) holds then we can use this to establisha similar guarantee for all k -sparse unit vectors a . To see this, as before, suppose ζ maps anyunit-length k -sparse vector to its nearest neighbor in { z , . . . , z r } . Suppose that the event in(73) holds then Ξ := sup a : k a k =1 , k a k ≤ k | V L +1 Σ L,s · · · Σ ℓ,s V ℓ a | = sup a : k a k =1 , k a k ≤ k | V L +1 Σ L,s · · · Σ ℓ,s V ℓ ( a − ζ ( a ) + ζ ( a )) |≤ sup i ∈ [ m ] | V L +1 Σ L,s · · · Σ ℓ,s V ℓ z j | + sup a : k a k =1 , k a k ≤ k | V L +1 Σ L,s · · · Σ ℓ,s V ℓ ( a − ζ ( a )) | ( i ) ≤ p k log( p ) c + 14 sup a : k a k =1 , k a k ≤ k (cid:12)(cid:12)(cid:12)(cid:12) V L +1 Σ L,s · · · Σ ℓ,s V ℓ a − ζ ( a ) k a − ζ ( a ) k (cid:12)(cid:12)(cid:12)(cid:12) ( ii ) ≤ p k log( p ) c + Ξ4 where ( i ) holds because we assume that the event in (73) holds and since k a − ζ ( a ) k ≤ / ,and ( ii ) follows by the deﬁnition of Ξ .By rearranging terms in the previous display we infer that Ξ = sup a : k a k =1 , k a k ≤ k | V L +1 Σ L,s · · · Σ ℓ,s V ℓ a | ≤ (cid:0) − (cid:1) p k log( p ) c = O (cid:16)p k log( p ) (cid:17) with probability at least − e − Ω (cid:16) pL (cid:17) . Taking a union bound over all layers and all sample weﬁnd that, for all ≤ ℓ ≤ L , for all s ∈ [ n ] and all k -sparse vectors a (cid:12)(cid:12)(cid:12)(cid:12) V L +1 Σ L,s · · · Σ ℓ,s V ℓ a k a k (cid:12)(cid:12)(cid:12)(cid:12) = O (cid:16)p k log( p ) (cid:17) (74)67ith probability at least − O ( nL ) e − Ω (cid:16) pL (cid:17) . Since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enoughpolynomial we can ensure that this probability is at least − δ/ which completes our proof. E.1.8 Proof of Part (h)Lemma E.9.

For any δ > , suppose that h < √ pL , p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enoughpolynomial and τ = Ω (cid:18) log ( nLδ ) p L (cid:19) . For β = O (cid:16) L τ / √ p (cid:17) , if S ℓ,s ( β ) := n j ∈ [ p ] : | V (1) ℓ,j x V (1) ℓ,s | ≤ β o where V (1) ℓ,j refers to the j th row of V (1) ℓ , then with probability at least − δ/ over the ran-domness in V (1) we have that for all ℓ ∈ [ L ] and all s ∈ [ n ] : |S ℓ,s ( β ) | ≤ O ( p / β ) = O ( pL τ / ) . Proof

To ease notation let us refer to V (1) ℓ as V ℓ and x V (1) ℓ,s as x ℓ,s . For a ﬁxed ℓ ∈ [ L ] andsample s ∈ [ n ] deﬁne Z ( ℓ, j, s ) := I [ | V ℓ,j x ℓ,s | ≤ β ] so that |S ℓ,s ( β ) | = P pj =1 Z ( j, ℓ, s ) . Deﬁne E to be the event that k x ℓ − ,s k ≥ . By Inequal-ity (52) in the proof of Lemma E.2 above P [ E ] ≥ − O ( nL ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) . (75)Conditioned on x ℓ − ,s since each entry of V ℓ,j is drawn independently from N (0 , p ) we knowthat the distribution of V ℓ,j x ℓ − ,s ∼ N (cid:16) , k x ℓ − ,s k p (cid:17) . Thus, conditioned on the event E , whichis determined by the random weights before layer ℓ , we have E (cid:2) Z ( j, ℓ, s ) (cid:12)(cid:12) E (cid:3) = P (cid:2) j ∈ S ℓ,s ( β ) (cid:12)(cid:12) E (cid:3) = r p π k x ℓ − ,s k Z β − β exp (cid:18) − x p k x ℓ − ,s k (cid:19) d x ≤ r pπ Z β − β exp (cid:18) − x p k x ℓ − ,s k (cid:19) d x ≤ β r pπ . On applying Hoeﬀding’s inequality (see Theorem F.5) we ﬁnd P  |S ℓ,s ( β ) | ≤ E  p X j =1 Z ( j, ℓ, s ) (cid:12)(cid:12)(cid:12) E  + p / β ≤ p (cid:18) β r pπ (cid:19) + p / β ≤ p / β (cid:12)(cid:12)(cid:12)(cid:12) E  ≥ − exp( − Ω( p / β )) . (76)Taking a union bound over the events in (75) and (76) we ﬁnd that P h |S ℓ,s ( β ) | ≤ p / β i ≥ − exp( − Ω( p / β )) − O ( nL ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) . ℓ ∈ [ L ] and all s ∈ [ n ] |S ℓ,s ( β ) | ≤ p / β = O ( pL τ / ) with probability at least − O ( nL ) exp( − Ω( p / β )) − O ( n L ) exp (cid:0) − Ω (cid:0) pL (cid:1)(cid:1) . We shall nowdemonstrate that this probability of success is at least − δ/ . On substituting the value of β = O ( L / τ / √ p ) we ﬁnd that this probability is at least − O ( nL ) exp( − Ω( p / β )) − O ( n L ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) = 1 − O ( nL ) exp( − Ω( pL τ / )) − O ( n L ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) ( i ) = 1 − O ( nL ) exp( − Ω( pL τ / )) − exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) ( ii ) = 1 − O ( nL ) exp (cid:18) − Ω (cid:18) log (cid:18) nLδ (cid:19)(cid:19)(cid:19) − exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) ≥ − δ/ , where ( i ) follows since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial and ( ii ) follows byassumption that τ = Ω (cid:18) log ( nLδ ) p L (cid:19) . This completes our proof. E.1.9 Other useful concentration lemmas

The following lemma is useful in the proofs of Lemmas E.4-E.8. It bounds the norm of anarbitrary unit vector z that is multiplied by alternating weight matrices V (1) ℓ and corresponding Σ V (1) ℓ,s . Lemma E.10.

Suppose that p ≥ poly( L, log( n )) for a large enough polynomial, then givenan arbitrary unit vector z ∈ S p − with probability at least − O ( nL ) exp (cid:0) − Ω (cid:0) pL (cid:1)(cid:1) over therandomness in V (1) for all ≤ ℓ ≤ ℓ ≤ L and for all s ∈ [ n ] (a) (cid:13)(cid:13)(cid:13) Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ z (cid:13)(cid:13)(cid:13) ≤ , and(b) (cid:13)(cid:13)(cid:13) V (1) ℓ Σ V (1) ℓ − ,s · · · Σ V (1) ℓ ,s V (1) ℓ z (cid:13)(cid:13)(cid:13) ≤ . Proof

To ease notation let us denote V (1) ℓ by simply V ℓ , let Σ V (1) ℓ,s be denoted by Σ ℓ,s , and let x V (1) ℓ,s be denoted as x ℓ,s . Proof of Part (a):

For any layer ℓ ∈ { ℓ , . . . , ℓ − } deﬁne z ℓ,s := Σ ℓ,s V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ z with the convention that z ℓ − ,s := z .Conditioned on z ℓ − ,s the distribution of V ℓ z ℓ − ,s ∼ N (cid:16) , k z ℓ − ,s k Ip (cid:17) , since each entry of V ℓ is drawn independently from N (0 , p ) . We begin by evaluating the expected value of its69quared norm conditioned on the randomness in V ℓ − , . . . , V . Let V ℓ,j denote the j th row of V ℓ and let (Σ ℓ,s ) jj denote the j th element on the diagonal of Σ ℓ,s , then E (cid:2) k z ℓ,s k | V ℓ − , . . . , V (cid:3) = E (cid:2) k Σ ℓ,s V ℓ z ℓ − ,s k | V ℓ − , . . . , V (cid:3) = E  p X j =1 ((Σ ℓ,s ) jj V ℓ,j z ℓ − ,s ) (cid:12)(cid:12) V ℓ − , . . . , V  . By the deﬁnition of the Huberized ReLU observe that each entry (Σ ℓ,s ) jj = φ ′ ( V ℓ,j x ℓ,s ) ≤ I [ V ℓ,j x ℓ − ,s ≥ and therefore E (cid:2) k z ℓ,s k | V ℓ − , . . . , V (cid:3) ≤ E  p X j =1 I [ V ℓ,j x ℓ − ,s ≥

0] ( V ℓ,j z ℓ − ,s ) (cid:12)(cid:12)(cid:12) V ℓ − , . . . , V  . (77)Let us decompose the vector z ℓ − ,s into its component in the x ℓ − ,s direction, which we referto as z k ℓ − ,s , and a component that is perpendicular to x ℓ − ,s , which we refer to as z ⊥ ℓ − ,s . Thatis, deﬁne z k ℓ − ,s := (cid:18) z ℓ − ,s · x ℓ − ,s k x ℓ − ,s k (cid:19) x ℓ − ,s k x ℓ − ,s k and, z ⊥ ℓ − ,s := z ℓ − ,s − z k ℓ − ,s . These vectors are orthogonal and hence, continuing from Inequality (77) by applying Lemma F.11to each term in the sum E (cid:2) k z ℓ,s k | V ℓ − , . . . , V (cid:3) ≤ E  p X j =1 I h V ℓ,j x V (1) ℓ − ,s ≥ i (cid:16) V ℓ,j z k ℓ − ,s (cid:17) (cid:12)(cid:12) V ℓ − , . . . , V  + E  p X j =1 I h V ℓ,j x V (1) ℓ − ,s ≥ i (cid:16) V ℓ,j z ⊥ ℓ − ,s (cid:17) (cid:12)(cid:12) V ℓ − , . . . , V  . (78)We begin by evaluating the term involving the parallel components. For any j , conditionedon V ℓ − , . . . , V , recalling that V ℓ,j is the j th row of V ℓ , the random variable V ℓ,j x ℓ − ,s ∼N (0 , k x ℓ − ,s k p ) , therefore E (cid:20) I [ V ℓ,j x ℓ − ,s ≥ (cid:16) V ℓ,j z k ℓ − ,s (cid:17) (cid:12)(cid:12) V ℓ − , . . . , V (cid:21) = (cid:18) z ℓ − ,s · (cid:18) x ℓ − ,s k x ℓ − ,s k (cid:19)(cid:19) E h I [ V ℓ,j x ℓ − ,s ≥

0] ( V ℓ,j x ℓ − ,s ) (cid:12)(cid:12) V ℓ − , . . . , V i = (cid:18) z ℓ − ,s · (cid:18) x ℓ − ,s k x ℓ − ,s k (cid:19)(cid:19) × E h ( V ℓ,j x ℓ − ,s ) (cid:12)(cid:12) V ℓ − , . . . , V i = (cid:18) z ℓ − ,s · (cid:18) x ℓ − ,s k x ℓ − ,s k (cid:19)(cid:19) × k x ℓ − ,s k p = k z k ℓ − ,s k p . (79)70or the perpendicular component, notice that conditioned on ( V ℓ − , . . . , V ) , the distributionof the random variable V ℓ,j z ⊥ ℓ − ,s is symmetric about the hyperplane deﬁned by V ℓ,j x ℓ − ,s ≥ ,and hence E (cid:20) I [ V ℓ,j x ℓ − ,s ≥ (cid:16) V ℓ,j z ⊥ ℓ − ,s (cid:17) (cid:12)(cid:12) V ℓ − , . . . , V (cid:21) = 12 E (cid:20)(cid:16) V ℓ,j z ⊥ ℓ − ,s (cid:17) (cid:12)(cid:12) V ℓ − , . . . , V (cid:21) = 12 × k z ⊥ ℓ − ,s k p = k z ⊥ ℓ − ,s k p . (80)By combining the results of (78)-(80) we ﬁnd that E (cid:2) k z ℓ,s k | V ℓ − , . . . , V (cid:3) ≤ p  k z ⊥ ℓ − ,s k + k z k ℓ − ,s k p  = k z ℓ − ,s k . (81)By symmetry among the p coordinates we can also infer that E (cid:2) ( z ℓ,s ) i | V ℓ − , . . . , V (cid:3) ≤ k z ℓ − ,s k p for each i ∈ [ p ] . Thus, by the same argument as we used in Lemma E.2 to arrive at (50)we can show that conditioned on V ℓ − , . . . , V the sub-Gaussian norm k ( z ℓ,s ) i k ψ is at most c k z ℓ − ,s k / √ p and hence the sub-exponential norm k ( z ℓ,s ) i k ψ ≤ k ( z ℓ,s ) i k ψ ≤ c k z ℓ − ,s k /p (by Lemma F.3). Therefore by Bernstein’s inequality (see Theorem F.6) for any η ∈ (0 , P (cid:2) k z ℓ,s k ≤ k z ℓ − ,s k (1 + η ) (cid:12)(cid:12) V ℓ − , . . . , V (cid:3) ≥ − exp (cid:0) − c pη (cid:1) . Setting η = L and taking a union bound we infer that for all s ∈ [ n ] and all ℓ ∈ { ℓ − , . . . , ℓ } P h k z ℓ,s k ≤ k z ℓ − ,s k p η i ≥ − O ( nL ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) . (82)We will now show by an inductive argument for the layers that if the “good event” in (82)holds then k z ℓ,s k ≤ ℓ − ℓ + 1) η , for all ℓ ∈ { ℓ − , . . . , ℓ − } and all s ∈ [ n ] . The basecase holds at ℓ − since by deﬁnition k z ℓ − ,s k = k z k = 1 . Now assume that the inductiveargument holds at any layers ℓ , . . . , ℓ − . Then if the event in (82) holds we have k z ℓ,s k ≤ k z ℓ − ,s k p η ≤ (1 + 3( ℓ − ℓ ) η ) (1 + η ) (by the IH and because √ η ≤ η ) = 1 + 3 (cid:18) ℓ − ℓ + 53 (cid:19) η + 3( ℓ − ℓ − η ≤ ℓ − ℓ + 1) η (since η = L and L ≥ ).This completes the induction. Hence we have shown that for all s ∈ [ n ] P (cid:20) k z ℓ − ,s k ≤ ℓ − ℓ + 1)50 L (cid:21) ≥ − O ( nL ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) . (83)Recall that z ℓ − ,s := Σ ℓ − · · · Σ ℓ ,s V ℓ z , therefore taking union bound over all pairs of layerswe get that for all ≤ ℓ ≤ ℓ ≤ L + 1 and all s ∈ [ n ] P (cid:20) k z ℓ − ,s k ≤ ℓ − ℓ + 1)50 L (cid:21) ≥ − O ( nL ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) . Proof of Part (b):

For a ﬁxed s ∈ [ n ] we condition on z ℓ − ,s and consider the randomvariable a s = V ℓ z ℓ − ,s . Since ℓ ∈ [ L ] each entry of V ℓ is drawn independently from N (0 , p ) .The distribution of each entry of a s conditioned on z ℓ − ,s is N (0 , k z ℓ − ,s k p ) . Therefore bythe Gaussian-Lipschitz concentration inequality (see Theorem F.7) for any η ′ > P h k a s k ≤ √ k z ℓ − ,s k (1 + η ′ ) (cid:12)(cid:12) z ℓ − ,s i ≥ − exp (cid:0) − c pη ′ (cid:1) . Setting η ′ = L and taking a union bound over all samples we get that for all s ∈ [ n ] P (cid:20) k a s k ≤ √ k z ℓ − ,s k (cid:18) L (cid:19) (cid:12)(cid:12) z ℓ − ,s (cid:21) ≥ − n exp (cid:16) − c pL (cid:17) . (84)By a union bound over the events in (83) and (84) we ﬁnd that for all s ∈ [ n ] : P (cid:20) k a s k ≤ √ (cid:18) ℓ − ℓ + 1)50 L (cid:19) (cid:18) L (cid:19)(cid:21) ≥ − O ( nL ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) . The deﬁnition of a s = V ℓ z ℓ − = V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ,s and the previous display above yieldsthat for all s ∈ [ n ] P (cid:20) k V ℓ z ℓ − = V ℓ Σ ℓ − ,s · · · Σ ℓ ,s V ℓ ,s k ≤ √ (cid:18) L L (cid:19) (cid:18) L (cid:19) ≤ (cid:21) ≥ − O ( nL ) exp (cid:16) − Ω (cid:16) pL (cid:17)(cid:17) . Finally a union bound over all pairs of ≤ ℓ ≤ ℓ ≤ L completes the proof of the secondpart.The next lemma bounds the magnitude of the initial function values with high probability. Lemma E.11.

For any δ > , suppose that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polyno-mial, then with probability at least − δ over the randomness in V (1) for all s ∈ [ n ] , | f V (1) ( x s ) | ≤ c p log(2 n/δ ) . Proof

By Lemma E.2: with probability at least − δ/ k x V (1) L,s k ≤ (85)for all s ∈ [ n ] . Fix a sample with index s ∈ [ n ] . Conditioned on x V (1) L,s , the random variable V (1) L +1 x V (1) L,s ∼ N (0 , k x V (1) L,s k ) since each entry of V (1) L +1 is drawn independently from N (0 , .Therefore for any η > P h | f V (1) ( x s ) | ≤ η k x V (1) L,s k (cid:12)(cid:12) x V (1) L,s i ≥ − (cid:0) − c η (cid:1) . A union bound over all samples implies P h ∀ s ∈ [ n ] : | f V (1) ( x s ) | ≤ η k x V (1) L,s k (cid:12)(cid:12) x V (1) L,s i ≥ − n exp (cid:0) − c η (cid:1) . η = c p log( n/δ ) where c is a large enough absolute constant we get that P h ∀ s ∈ [ n ] : | f V (1) ( x s ) | ≤ c p log( n/δ ) k x V (1) L,s k (cid:12)(cid:12) x V (1) L,s i ≥ − δ . (86)Taking union bound over the events in (85) and (86) we ﬁnd that P h ∀ s ∈ [ n ] : | f V (1) ( x s ) | ≤ c p log( n/δ ) i ≥ − δ which completes the proof.Lastly we prove a lemma that bounds the norm of initial weight matrix with high probability. Lemma E.12.

For any δ > , suppose that p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polyno-mial, then with probability at least − δ over the randomness in V (1) k V (1) k ≤ p pL. Proof

By deﬁnition k V (1) k = X ℓ ∈ [ L +1] k V (1) ℓ k . When ℓ ∈ [ L ] , the matrix V (1) ℓ has its entries drawn from N (0 , p ) . Therefore by applyingTheorem F.7 we ﬁnd that for any ﬁxed ℓ ∈ [ L ] , P (cid:20) k V (1) ℓ k ≤ p × p ×

54 = 5 p (cid:21) ≤ exp ( − Ω( p )) . While when ℓ = L + 1 , the p -dimensional vector V (1) L +1 has its entries drawn from N (0 , .Hence, again applying Theorem F.7 we get P (cid:20) k V (1) L +1 k ≤ p × ×

54 = 5 p (cid:21) ≤ exp ( − Ω( p )) . Taking a union bound over all L + 1 layers we ﬁnd that P (cid:20) ∀ ℓ ∈ [ L + 1] : k V (1) ℓ k ≤ p (cid:21) ≤ ( L + 1) exp ( − Ω( p )) ≤ − δ where the last inequality follows since p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) . Therefore, k V (1) k ≤ ( L + 1) × p ≤ pL with probability at least − δ . Taking square roots establishes the claim.73 .2 Useful properties in a neighborhood around the initialization In the next two lemmas we shall assume that the “good event” described in Lemma E.1 holds.We shall show that when the initial weight matrices satisﬁes those properties, we can alsoextend some of these properties to matrices in a neighborhood around the initial parameters.

Lemma E.13.

Let the event in Lemma E.1 hold and suppose that the conditions on h , p and τ described in that lemma hold. Let e V be a weight matrix such that k e V ℓ − V (1) k op ≤ τ for all ℓ ∈ [ L ] . For all ℓ ∈ [ L ] and s ∈ [ n ] , let e Σ ℓ,s be diagonal matrices such that k e Σ ℓ,s − Σ V (1) ℓ,s k ≤ k ,and ( e Σ ℓ,s ) jj ∈ [ − , for all j ∈ [ p ] . If τ ≤ q k log( p ) p ≤ O (cid:0) L (cid:1) then, for all ≤ ℓ ≤ ℓ ≤ L , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ Y j = ℓ e V ⊤ j e Σ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ O ( L ) . Proof

Fix an arbitrary sample index s . To ease notation let us refer to V (1) as V , Σ V (1) ℓ,s as Σ ℓ , and e Σ ℓ,s as e Σ ℓ . Note that for any j ∈ [ L ] e V ⊤ j e Σ j = V ⊤ j Σ j + V ⊤ j (cid:16)e Σ j − Σ j (cid:17)| {z } =:Γ j + ( e V j − V j ) ⊤ e Σ j | {z } =:∆ j . (87)Let us refer to Γ j and ∆ j as “ﬂip matrices”. Then, if we deﬁne the set A j = { V ⊤ j Σ j , Γ j , ∆ j },expanding the product into a sum of terms yields ℓ Y j = ℓ e V ⊤ j e Σ j = ℓ Y j = ℓ (cid:16) V ⊤ j Σ j,s + Γ j,s + ∆ j,s (cid:17) = X A ℓ ∈A ℓ ,...,A ℓ ∈A ℓ ℓ Y j = ℓ A j . (88)To bound the operator norm of any term in the sum above we will employ the followingstrategy. We will begin by bounding the operator norm of terms that have at most one ﬂipmatrix. To bound the operator norm of some other term with two or more ﬂip matrices we candecompose it into subsequences that have exactly two ﬂips and end in a ﬂip, and subsequencesthat have at most one ﬂip. In the calculations that follows the indices q , q and q satisfy: ≤ ℓ ≤ q ≤ q ≤ q ≤ ℓ ≤ L . Subsequences with no ﬂip matrices:

First, the term in which A j = V ⊤ j Σ j for all j can bebounded by Part (c) of Lemma E.1: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q Y j = q V ⊤ j Σ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q Y j = q Σ j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ O ( L ) . (89)74 ubsequences with one ﬂip matrix: There will be two types of sub-sequences with just oneﬂip matrix. First, let us consider the following type of subsequence: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  ∆ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∆ ⊤ q  q Y j = q − Σ j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13) ∆ ⊤ q (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q Y j = q − Σ j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ O ( L ) (cid:13)(cid:13)(cid:13) ∆ ⊤ q (cid:13)(cid:13)(cid:13) op = O ( L ) (cid:13)(cid:13)(cid:13) e Σ q ( e V q − V q ) (cid:13)(cid:13)(cid:13) op ( ii ) ≤ O ( L ) (cid:13)(cid:13)(cid:13) e V q − V q (cid:13)(cid:13)(cid:13) op ( iii ) ≤ O ( τ L ) ( iv ) ≤ O (1) , (90)where ( i ) follows by again invoking Part (c) of Lemma E.1, ( ii ) follows since by assumption thediagonal matrix e Σ q has its entries bounded between [ − , , ( iii ) follows since by assumption (cid:13)(cid:13)(cid:13) e V q − V q (cid:13)(cid:13)(cid:13) op ≤ τ and ( iv ) follows since by assumption τ = O (1 /L ) .Next, let us consider the second type of subsequence with just one ﬂip matrix: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  Γ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  V ⊤ q (cid:16)e Σ q − Σ q (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = sup a : k a k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  V ⊤ q (cid:16)e Σ q − Σ q (cid:17) a (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = sup a : k a k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) a ⊤ (cid:16)e Σ q − Σ q (cid:17) V q q Y j = q − Σ j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) For each a let’s deﬁne b = (cid:16)e Σ q − Σ q (cid:17) a . Since k e Σ q − Σ q k ≤ k , therefore b is k -sparse. Alsosince the diagonal matrix e Σ q has entries in [ − , and Σ q has entries in [0 , , therefore theentries of e Σ q − Σ q lie in [ − , . This implies that k b k ≤ k a k ≤ . Applying Part (e) ofLemma E.1, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  Γ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ O ( k b k ) = O (1) . (91) Subsequences with two ﬂip matrices:

Now we continue to subsequences with two ﬂip matrices.There shall be four types of such subsequences. We begin by consider subsequences of the type (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  ∆ q  q − Y j = q +1 V ⊤ j Σ j  ∆ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q +1 V ⊤ j Σ j  ∆ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q +1 V ⊤ j Σ j  ∆ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ O ( τ L ) ( ii ) ≤ O ( τ L ) (92)where ( i ) follows by (90) and ( ii ) follows since τ = O (1 /L ) .75ext, we bound the operator norm of a subsequence of the type (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  ∆ q  q − Y j = q +1 V ⊤ j Σ j  Γ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  ∆ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q +1 V ⊤ j Σ j  Γ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ O ( τ L ) (93)where ( i ) follows by invoking inequalities (90) and (91).We continue to bound the operator norm of subsequences (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  Γ q  q − Y j = q +1 V ⊤ j Σ j  ∆ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  Γ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  ∆ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ O ( τ L ) (94)where ( i ) follows again invoking inequalities (90) and (91).Finally we bound the operator norm of subsequences of the type (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j  Γ q  q − Y j = q +1 V ⊤ j Σ j  Γ q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q − Y j = q V ⊤ j Σ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V ⊤ q (cid:16)e Σ q − Σ q (cid:17)  q − Y j = q +1 V ⊤ j Σ j  V ⊤ q (cid:16)e Σ q − Σ q (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ O ( L ) sup a : k a k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V ⊤ q (cid:16)e Σ q − Σ q (cid:17)  q − Y j = q +1 V ⊤ j Σ j  V ⊤ q (cid:16)e Σ q − Σ q (cid:17) a (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O ( L ) sup a : k a k =1 sup b : k b k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b ⊤ V ⊤ q (cid:16)e Σ q − Σ q (cid:17)  q − Y j = q +1 V ⊤ j Σ j  V ⊤ q (cid:16)e Σ q − Σ q (cid:17) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( L ) sup a : k a k =1 sup b : k b k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a ⊤ (cid:16)e Σ q − Σ q (cid:17) V q  q +1 Y j = q − Σ j V j  (cid:16)e Σ q − Σ q (cid:17) V q b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( L ) sup a : k a k =1 sup b : k b k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a ⊤ (cid:16)e Σ q − Σ q (cid:17) V q  q +1 Y j = q − Σ j V j  (cid:16)e Σ q − Σ q (cid:17) V q b k V q b k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k V q b k ( ii ) ≤ O ( L ) s k log pp sup b : k b k =1 k V q b k = O ( L ) s k log pp k V q k op ( iii ) ≤ O ( L ) s k log pp (95)where ( i ) follows by invoking Part (c) of Lemma E.1. Inequality ( ii ) follows since the vectors a ⊤ (cid:16)e Σ q − Σ q (cid:17) and (cid:16)e Σ q − Σ q (cid:17) V q b k V q b k are k -sparse and both have norm less than or equalto 4, thus we can apply Part (f) of Lemma E.1, and ( iii ) follows by applying Part (b) ofLemma E.1.As stated above we can decompose each product in (88) that has at least two ﬂips intosubsequences that end in a ﬂip and have exactly two ﬂips, and subsequences that have at most76ne ﬂip. The subsequences that have at most one ﬂip have operator norm at most O ( L ) (byinequalities (89)-(91)).The above logic (92)-(94) implies that subsequences with exactly two ﬂips that have atleast one ∆ ﬂip have operator norm at most O ( τ L ) ≤ (cid:16)q k log pp L (cid:17) (since τ ≤ q k log pp byassumption). Subsequences with two Γ ﬂips have operator norm at most O (cid:16)q k log pp L (cid:17) . Deﬁne ψ := O (cid:16)q k log pp L (cid:17) . So, if a sequence has r ﬂip matrices then its operator norm is boundedby O (cid:16) ψ ⌊ r/ ⌋ × L (cid:17) So, putting it together, by recalling the decomposition in (88) we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ Y j = ℓ e V ⊤ j e Σ j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X A ℓ ∈A ℓ ,...,A ℓ ∈A ℓ ℓ Y j = ℓ A j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ (1 + 2 L ) × O ( L ) + X A ℓ ∈A ℓ ,...,A ℓ ∈A ℓ : ≥ ﬂips (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ Y j = ℓ A j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( ii ) ≤ O ( L ) + L X r =2 (cid:18) Lr (cid:19) r O (cid:16) Lψ ⌊ r/ ⌋ (cid:17) = O ( L ) + L L X r =2 (cid:18) Lr (cid:19) O (cid:16) (4 ψ ) ⌊ r/ ⌋ (cid:17) ≤ O ( L ) + L L X r =0 (cid:18) Lr (cid:19) O (cid:16) (4 ψ ) ⌊ r/ ⌋ (cid:17) ( iii ) ≤ O ( L ) + L (cid:16) O (cid:16)p ψ (cid:17)(cid:17) L ( iv ) = O ( L ) + L O √ L (cid:18) k log( p ) p (cid:19) / !! L ( v ) = O ( L ) + L (cid:18) O (cid:18) L (cid:19)(cid:19) L ( vi ) = O ( L ) where ( i ) follows since the number of terms with at most one ﬂip matrix is (1 + 2 L ) andthe operator norm of each such term is upper bounded by O ( L ) by inequalities (89)-(91).Inequality ( ii ) is because the number of terms with r ﬂip matrices is (cid:0) Lr (cid:1) r , ( iii ) by the Bi-nomial theorem, ( iv ) is by our deﬁnition of ψ . Inequality ( v ) follows since by assumption q k log( p ) p ≤ O (cid:0) L (cid:1) and ( vi ) follows since for any ≤ z < L , (1 + z ) L ≤ Lz . This completesour proof.The following lemma bounds the diﬀerence between post-activation features at the ℓ th layerwhen the weight matrix is perturbed from its initial value.77 emma E.14. Let the event in Lemma E.1 hold and suppose that the conditions on h , p and τ described in that lemma hold with the additional assumptions that τ = O (cid:18) L log ( p ) (cid:19) and h ≤ τ √ p . Let V (1) be the initial weight matrix and e V , ˆ V are weight matrices such that k e V ℓ − V (1) ℓ k op , k ˆ V ℓ − V (1) ℓ k op ≤ τ for all ℓ ∈ [ L ] . Then1. k Σ e Vℓ,s − Σ ˆ Vℓ,s k ≤ O ( pL τ / ) ;2. k x e Vℓ,s − x ˆ Vℓ,s k ≤ O ( L τ ) ;for all ℓ ∈ [ L ] and all s ∈ [ n ] . Proof

Fix the sample s , let us remove its index from all subscripts. To further simplify nota-tion let us refer to Σ V (1) ℓ,s , Σ e Vℓ,s , Σ ˆ Vℓ,s , x V (1) ℓ,s , x e Vℓ,s and x ˆ Vℓ,s as Σ ℓ , e Σ ℓ , ˆΣ ℓ , x ℓ , ˜ x ℓ and ˆ x ℓ respectively.Before the ﬁrst layer (at layer ) deﬁne Σ = e Σ = ˆΣ = I and recall that by deﬁnition forany sample s ∈ [ n ] , x V (1) ,s = x e V ,s = x ˆ V ,s = x s .We will inductively show that1. k e Σ ℓ − ˆΣ ℓ k , k Σ ℓ − e Σ ℓ k , k Σ ℓ − ˆΣ ℓ k ≤ O ( pL τ / ) , and2. k ˜ x ℓ − ˆ x ℓ k , k x ℓ − ˜ x ℓ k , k x ℓ − ˆ x ℓ k ≤ O ( L τ ) .The base case, when ℓ = 0 is trivially true since x = ˆ x = ˜ x and Σ = e Σ = ˆΣ .Now let us assume that the inductive hypothesis holds at for all layers r = 1 , . . . , ℓ − . Weshall prove that the inductive hypothesis holds at layer ℓ in two steps. Step 1:

By the triangle inequality k e Σ ℓ − ˆΣ ℓ k ≤ k Σ ℓ − e Σ ℓ k + k Σ ℓ − ˆΣ ℓ k . (96)Thus by showing that k Σ ℓ − e Σ ℓ k and k Σ ℓ − ˆΣ ℓ k are at most O ( pL τ / ) also proves the claimthat k e Σ ℓ − ˆΣ ℓ k ≤ O ( pL τ / ) .We begin by bounding k Σ ℓ − e Σ ℓ k . Recall that by deﬁnition the diagonal matrix (Σ ℓ ) jj =( φ ′ ( V ℓ x ℓ − )) j . So to bound the diﬀerence between Σ ℓ − e Σ ℓ we characterize the diﬀerencebetween V ℓ x ℓ − − e V ℓ ˜ x ℓ − = ( e V ℓ − V ℓ ) x ℓ − + e V ℓ (˜ x ℓ − − x ℓ − ) . By assumption k e V ℓ − V ℓ k op ≤ τ , k x ℓ − k ≤ by Part (a) of Lemma E.1, and k ˜ x ℓ − − x ℓ − k ≤ O ( L τ ) by the inductive hypothesis. Therefore, k V ℓ x ℓ − − e V ℓ ˜ x ℓ − k ≤ k ( e V ℓ − V ℓ ) x ℓ − k + k e V ℓ (˜ x ℓ − − x ℓ − ) k≤ k e V ℓ − V ℓ k op k x ℓ − k + k e V ℓ k op k ˜ x ℓ − − x ℓ − k≤ τ + O ( L τ ) k e V ℓ k op ≤ τ + O ( L τ ) (cid:16) k e V ℓ − V ℓ k op + k V ℓ k op (cid:17) ( i ) ≤ τ + O ( L τ ) ( τ + O (1)) ( ii ) = O ( L τ ) ( i ) follows since k V ℓ k op ≤ O (1) by Part (b) of Lemma E.1 and ( ii ) follows since τ = O (1) by assumption.Let β = O (cid:18) L τ √ p (cid:19) > h > . This particular choice for the value of β shall become clearshortly, and h ≤ β/ since h ≤ τ √ p by assumption. Deﬁne the set: S ℓ ( β ) := { j ∈ [ p ] : | V ℓ,j x ℓ − | ≤ β } where V ℓ,j refers to the j th row of V ℓ . Also deﬁne s (1) ℓ ( β ) := (cid:12)(cid:12)(cid:12) { j ∈ S ℓ ( β ) : (Σ ℓ ) jj = ( e Σ ℓ ) jj } (cid:12)(cid:12)(cid:12) and, s (2) ℓ ( β ) := (cid:12)(cid:12)(cid:12) { j ∈ S cℓ ( β ) : (Σ ℓ ) jj = ( e Σ ℓ ) jj } (cid:12)(cid:12)(cid:12) . Clearly we must have that k Σ ℓ − e Σ ℓ k = s (1) ℓ ( β ) + s (2) ℓ ( β ) . To bound s (1) ℓ ( β ) we note that s (1) ℓ ( β ) ≤ |S ℓ ( β ) | ≤ O ( p / β ) by Part (h) of Lemma E.1. Wefocus on s (2) ℓ ( β ) . For a j ∈ S cℓ ( β ) by the deﬁnition of the Huberized ReLU if (Σ ℓ ) jj = ( e Σ ℓ ) jj then we must have that (cid:12)(cid:12)(cid:12) e V ℓ,j ˜ x ℓ − − V ℓ,j x ℓ − (cid:12)(cid:12)(cid:12) ≥ β − h. This further implies that ( β − h ) s (2) ℓ ( β ) ≤ X j ∈S cℓ ( β ):(Σ ℓ ) jj =( e Σ ℓ ) jj (cid:12)(cid:12)(cid:12) e V ℓ,j ˜ x ℓ − − V ℓ,j x ℓ − (cid:12)(cid:12)(cid:12) ≤ k V ℓ x ℓ − − e V ℓ ˜ x ℓ − k ≤ O ( L τ ) . Therefore, we ﬁnd that k Σ ℓ − e Σ ℓ k = s (1) ℓ ( β ) + s (2) ℓ ( β ) ≤ O ( L τ )( β − h ) + O ( p / β ) . Balancing both of these terms on the RHS leads to the choice β = O ( L τ / √ p ) . This choice of β shows that k Σ ℓ − e Σ ℓ k ≤ O ( pL τ / ) . Similarly we can also show that k Σ ℓ − ˆΣ ℓ k ≤ O ( pL τ / ) . These two bounds along with (96)proves the ﬁrst part of the inductive hypothesis. Step 2:

Now, for the second part we want to show that k ˜ x ℓ,s − ˆ x ℓ,s k remains bounded. Wecan also show that k x ℓ,s − ˜ x ℓ,s k and k x ℓ,s − ˜ x ℓ,s k remain bounded by mirroring the logic thatfollows. Deﬁne a diagonal matrix ( ˇΣ ℓ ) jj := ( ˆΣ ℓ − e Σ ℓ ) jj " e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − , for all, j ∈ [ p ] .

79n the deﬁnition above we use the convention that / . We will show that for any j ∈ [ p ] | ( ˇΣ ℓ ) jj | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ˆΣ ℓ − e Σ ℓ ) jj " e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − ≤ . Firstly observe that the matrices e Σ ℓ and ˆΣ ℓ have entries between [0 , , therefore e Σ ℓ − Σ ℓ hasentries between − and . Also recall that by the deﬁnition of the Huberized ReLU, ( ˆΣ ℓ ) jj =  if ˆ V ℓ,j ˆ x ℓ − > h, ˆ V ℓ,j ˆ x ℓ − h if ˆ V ℓ,j ˆ x ℓ − ∈ [0 , h ] , if ˆ V ℓ,j ˆ x ℓ − < . Now we will analyze a few cases and show that the absolute values of the entries of ˇΣ ℓ aresmaller than in each case.If the signs of e V ℓ,j ˜ x ℓ − and ˆ V ℓ,j ˆ x ℓ − are opposite then we must have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ˆΣ ℓ − e Σ ℓ ) jj e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( i ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | e V ℓ,j ˜ x ℓ − || ˆ V ℓ,j ˆ x ℓ − | + | e V ℓ,j ˜ x ℓ − | ≤ , where ( i ) follows since | ( ˆΣ ℓ − e Σ ℓ ) jj | ≤ . If they have the same sign and are both negativethen ( e Σ ℓ − Σ ℓ ) jj = 0 in this case. The same is true when they are both positive and are biggerthan h . Therefore, we are only left with the case when both are positive and one of them issmaller than h . If e V ℓ,j ˜ x ℓ − > h and ˆ V ℓ,j ˆ x ℓ − ∈ [0 , h ] we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ˆΣ ℓ − e Σ ℓ ) jj e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) ˆ V ℓ,j ˆ x ℓ − h − (cid:17) e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) V ℓ,j x ℓ − h − ˆ V ℓ,j ˆ x ℓ − e V ℓ,j ˜ x ℓ − − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ where the last inequality follows since e V ℓ,j ˜ x ℓ − > h . And ﬁnally in the case where ˆ V ℓ,j ˆ x ℓ − > h and e V ℓ,j ˜ x ℓ − ∈ [0 , h ] we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ˆΣ ℓ − e Σ ℓ ) jj e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − e V ℓ,j ˜ x ℓ − h ! e V ℓ,j ˜ x ℓ − ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:16) h − e V ℓ,j ˜ x ℓ − (cid:17) e V ℓ,j ˜ x ℓ − h (cid:16) ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:17) . To show that this term of the RHS above is smaller than it is suﬃcient to show that ( h − e V ℓ,j ˜ x ℓ − ) e V ℓ,j ˜ x ℓ − ≤ h ( ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − ) , in our case where ≤ e V ℓ,j ˜ x ℓ − ≤ h < ˆ V ℓ,j ˆ x ℓ − . Consider the change of variables a = e V ℓ,j ˜ x ℓ − and b = ˆ V ℓ,j ˆ x ℓ − , then it suﬃces to show that ( h − a ) a ≤ h ( b − a ) ⇐ ≤ a − ah + hb. The derivative of a − ah + hb with respect to a is a − h ) , which is non-positive when a ≤ h .Therefore the minima of the quadratic when a ∈ [0 , h ] is at a = h and the minimum value is h − h + hb = hb − h = h ( b − h ) > . This proves that | ( ˇΣ) jj | ≤ in this ﬁnal case as well.80ith this established we note that e ℓ := ˆ x ℓ − ˜ x ℓ = φ ( ˆ V ℓ ˆ x ℓ − ) − φ ( e V ℓ ˜ x ℓ − )= ˆΣ ℓ ˆ V ℓ ˆ x ℓ − − e Σ ℓ e V ℓ ˜ x ℓ − + φ ( ˆ V ℓ ˆ x ℓ − ) − ˆΣ ℓ ˆ V ℓ ˆ x ℓ − − φ ( e V ℓ ˜ x ℓ − ) + e Σ ℓ e V ℓ ˜ x ℓ − | {z } =: χ ℓ ( i ) = (cid:16) ˆΣ ℓ + ˇΣ ℓ (cid:17) (cid:16) ˆ V ℓ ˆ x ℓ − − e V ℓ ˜ x ℓ − (cid:17) + χ ℓ = (cid:16) ˆΣ ℓ + ˇΣ ℓ (cid:17) ˆ V ℓ | {z } =: A ℓ (ˆ x ℓ − − ˜ x ℓ − ) | {z } =: e ℓ − + (cid:16) ˆΣ ℓ + ˇΣ ℓ (cid:17) (cid:16) e V ℓ − V ℓ (cid:17) ˜ x ℓ − + χ ℓ | {z } =: b ℓ = A ℓ e ℓ − + b ℓ (97)where ( i ) follows because by the deﬁnition of the matrix ˇΣ ℓ for each coordinate j we have (cid:16) ˆΣ ℓ + ˇΣ ℓ (cid:17) jj (cid:16) ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:17) = (cid:16) ˆΣ ℓ (cid:17) jj (cid:16) ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:17) + (cid:0) ˇΣ ℓ (cid:1) jj (cid:16) ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:17) = (cid:16) ˆΣ ℓ (cid:17) jj (cid:16) ˆ V ℓ,j ˆ x ℓ − − e V ℓ,j ˜ x ℓ − (cid:17) + ( ˆΣ ℓ − e Σ ℓ ) jj e V ℓ,j ˜ x ℓ − = ( ˆΣ ℓ ) jj ˆ V ℓ,j ˆ x ℓ − − ( e Σ ℓ ) jj e V ℓ,j ˜ x ℓ − . In Equation (97) above we have expressed the diﬀerence between the post-activation featuresat layer ℓ in terms of the diﬀerence at layer ℓ − plus some error terms. Repeating this ℓ − more times yields e ℓ = A ℓ e ℓ − + b ℓ = A ℓ ( A ℓ − e ℓ − + b ℓ − ) + b ℓ = Y j = ℓ A j e +  ℓ − X r =1  r +1 Y j = ℓ A j  b r  + b ℓ . Since e = k ˆ x − ˜ x k = 0 , by re-substituting the values of A ℓ and b ℓ we ﬁnd ˆ x ℓ − ˜ x ℓ = ℓ − X r =1  r +1 Y j = ℓ (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  ( ˆΣ r + ˇΣ r )( ˆ V r − e V r )˜ x r − + (cid:16) ˆΣ ℓ + ˇΣ ℓ (cid:17) (cid:16) e V ℓ − ˆ V ℓ (cid:17) ˜ x ℓ − + ℓ − X r =1  r +1 Y j = ℓ ( ˆΣ j + ˇΣ j ) ˆ V j  χ r + χ ℓ , (98)81nd therefore by the triangle inequality k ˆ x ℓ − ˜ x ℓ k≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ − X r =1  r +1 Y j = ℓ (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  ( ˆΣ r + ˇΣ r )( ˆ V r − e V r )˜ x r − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:16) ˆΣ ℓ + ˇΣ ℓ (cid:17) (cid:16) e V ℓ − ˆ V ℓ (cid:17) ˜ x ℓ − (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ − X r =1  r +1 Y j = ℓ ( ˆΣ j + ˇΣ j ) V j  χ r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + k χ ℓ k≤ ℓ − X r =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r +1 Y j = ℓ (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ˆΣ r + ˇΣ r (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ˆ V r − e V r (cid:13)(cid:13)(cid:13) op k ˜ x r − k + (cid:13)(cid:13)(cid:13) ˆΣ ℓ + ˇΣ ℓ (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) e V ℓ − ˆ V ℓ (cid:13)(cid:13)(cid:13) op k ˜ x ℓ − k + ℓ − X r =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r +1 Y j = ℓ ( ˆΣ j + ˇΣ j ) V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k χ r k + k χ ℓ k Recall that the diagonal matrices (Σ ℓ − ˆΣ ℓ − ˇΣ ℓ ) are O ( pL τ / ) sparse by the inductivehypothesis. Also the matrices (Σ ℓ − ˆΣ ℓ − ˇΣ ℓ ) have entries in [ − , . Therefore by applyingLemma E.13 (note that since τ = O (cid:18) L log ( p ) (cid:19) therefore Lemma E.13 applies at this levelof sparsity) we ﬁnd k ˆ x ℓ − ˜ x ℓ k ≤ O ( L ) " ℓ X r =1 k ˆ V r − e V r k op k ˜ x r − k + ℓ X r =1 k χ r k ( i ) ≤ O ( L ) " ℓ X r =1 k ˆ V r − e V r k op k ˜ x r − k + ℓh √ p ≤ O ( L ) " ℓ X r =1 k ˆ V r − e V r k op ( k ˜ x r − − x r − k + k x r − k ) + ℓh √ p ( ii ) ≤ O ( L ) " ℓ X r =1 k ˆ V r − e V r k op (cid:0) O ( L τ ) + 2 (cid:1) + ℓh √ p ( iii ) ≤ O ( L ) " ℓ X r =1 k ˆ V r − e V r k op + Lτ ≤ O ( L τ ) , Inequality ( i ) follows since by deﬁnition of the Huberized ReLU for any z ∈ R we have that φ ( z ) ≤ φ ′ ( z ) z ≤ φ ( z ) + h/ , therefore k χ r k ∞ = (cid:13)(cid:13)(cid:13) φ ( V ℓ x ℓ − ) − Σ ℓ V ℓ x ℓ − − φ ( e V ℓ ˜ x ℓ − ) + e Σ ℓ e V ℓ ˜ x ℓ − (cid:13)(cid:13)(cid:13) ∞ ≤ · h h (99)which implies that k χ r k ≤ h √ p . Next ( ii ) follows by bound on k ˜ x r − − x r − k due to theinductive hypothesis and because k x r − k ≤ by Part (a) of Lemma E.1. Finally ( iii ) followsby assumption τ ≤ O (1 /L ) and h < τ √ p . This establishes a bound on k ˆ x ℓ − ˜ x ℓ k . We can alsomirror the logic to bound k x ℓ − ˜ x ℓ k and k x ℓ − ˜ x ℓ k . This completes the induction and the proof82f the lemma. Lemma E.15.

Let the event in Lemma E.1 hold and suppose that the conditions on h , p and τ described in that lemma hold with the additional assumptions that τ = O (cid:18) L log ( p ) (cid:19) and h ≤ τ √ p . Let V (1) be the initial weight matrix and e V , ˆ V be weight matrices such that k e V ℓ − V (1) ℓ k op , k ˆ V ℓ − V (1) ℓ k op ≤ τ for all ℓ ∈ [ L ] . Also let ¯Σ ℓ,s be O ( pL τ / ) -sparse diagonalmatrices with entries in [ − , for all ℓ ∈ [ L ] and s ∈ [ n ] . Then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) e V L +1 ℓ Y r = L (cid:16) Σ e Vr,s + ¯Σ r,s (cid:17) e V r − ˆ V L +1 ℓ Y r = L Σ ˆ Vr,s ˆ V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ O (cid:16)p p log( p ) L τ / (cid:17) . for all ℓ ∈ [ L ] and all s ∈ [ n ] . Proof

We want to bound the operator norm of e V L +1 ℓ Y r = L (cid:16) Σ e Vr,s + ¯Σ r,s (cid:17) e V r − ˆ V L +1 ℓ Y r = L Σ ˆ Vr,s ˆ V r = e V L +1 ℓ Y r = L (cid:16) Σ e Vr,s + ¯Σ r,s (cid:17) e V r − V (1) L +1 ℓ Y r = L Σ V (1) r,s V (1) r | {z } =: χ + V (1) L +1 ℓ Y r = L Σ V (1) r,s V (1) r − ˆ V L +1 ℓ Y r = L ˆΣ Vr,s ˆ V r | {z } =: χ . (100)We shall instead bound the operator norm of χ and χ . Let us proceed to bound the operatornorm of χ (the bound on χ will hold using exactly the same logic). Now to ease notationlet us ﬁx a sample index s ∈ [ n ] and drop it from all subscripts. Also to simplify notation letus refer to Σ e Vr,s as e Σ r , Σ ˆ Vr,s as ˆΣ r , ¯Σ r,s as ¯Σ r , and Σ V (1) r,s as Σ r . We shall also refer to V (1) assimply V .By assumption the diagonal matrix ¯Σ r is O ( pL τ / ) -sparse with entries in [ − , . Also thematrix Σ r − e Σ r is O ( pL τ / ) -sparse by Lemma E.14. Therefore the matrix ˇΣ r := e Σ r + ¯Σ r − Σ r is also O ( pL τ / ) -sparse and has entries in [ − , . Thus, χ = e V L +1 ℓ Y r = L (cid:16)e Σ r + ¯Σ r (cid:17) e V r − V L +1 ℓ Y r = L Σ r V r = e V L +1 ℓ Y r = L (cid:0) Σ r + ˇΣ r (cid:1) e V r − V L +1 ℓ Y r = L Σ r V r = ( e V L +1 − V L +1 ) ℓ Y r = L (cid:0) Σ r + ˇΣ r (cid:1) e V r | {z } =: ♠ + V L +1 ℓ Y r = L (cid:0) Σ r + ˇΣ r (cid:1) e V r − ℓ Y r = L Σ r V r !| {z } =: ♣ . (101)83he operator norm of ♠ is easy to bound by invoking Lemma E.13 k♠k op ≤ k e V L +1 − V L +1 k op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ Y r = L (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ O ( τ L ) . (102)To bound the operator norm of ♣ we will decompose the diﬀerence of the products of matricesterms into a sum. Each term in this sum correspond to either a ﬂip from V r to ˜ V r or from Σ r to Σ r + ˇΣ r . That is, ♣ = − V L +1 ℓ Y r = L Σ r V r − V L +1 ℓ Y r = L (cid:0) Σ r + ˇΣ r (cid:1) e V r ! = −  ℓ X q = L V L +1 =: ω ,q z }| { q +1 Y r = L (Σ r V r ) ! (cid:0) ˇΣ q (cid:1)  e V q =: ω ,q z }| { ℓ Y r = q − (cid:0) Σ r + ˇΣ r (cid:1) e V r | {z } =: ♦ q  −  ℓ X q = L V L +1 =: ω ,q z }| { q +1 Y r = L (Σ r V r ) ! Σ q (cid:16) V q − e V q (cid:17) =: ω ,q z }| { ℓ Y r = q − (cid:0) Σ r + ˇΣ r (cid:1) e V r | {z } =: ♥ q  (103)where in the previous equality above, the indices in the products “count down”, so that casesin which q = L include “empty products”, and we adopt the convention that, in such cases, ω ,q = ω ,q = I, and when q = ℓ ω ,q = ω ,q = I.

84e begin by bounding the operator norm of ♦ q (for a q that is not ℓ or L , the exact samebound follows in these boundary cases): k ♦ q k op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) (cid:0) ˇΣ q (cid:1)  e V q ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r,s (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) Σ / q ˇΣ q Σ / q e V ℓ ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) Σ / q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k ˇΣ q k op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ / q e V ℓ ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( ii ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) Σ / q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ / q e V ℓ ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( iii ) ≤ O (cid:18)q pL τ / log( p ) (cid:19) (cid:13)(cid:13)(cid:13) Σ / q (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) e V ℓ ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( iv ) ≤ O (cid:18)q pL τ / log( p ) (cid:19) × O ( L ) = O (cid:16)p p log( p ) L τ / (cid:17) (104)where in ( i ) we deﬁne Σ / q to be a diagonal matrix with (Σ / q ) jj := I h ( e Σ q,s ) jj = 0 i . Note thatsince ˇΣ q is O ( pL τ / ) sparse, therefore Σ / q is also O ( pL τ / ) sparse. Inequality ( ii ) followssince the entries of ˇΣ q,s lie between [ − , , ( iii ) follows by applying Part (g) of Lemma E.1.Finally, ( iv ) by applying Lemma E.13 since the matrix ˇΣ r is O ( pL τ / ) -sparse and has entriesin [ − , .To control the operator norm of ♥ q (again for a q = ℓ or L , the exact same bound followsin these boundary cases): k ♥ q k op = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) Σ q (cid:16) e V q − V q (cid:17)  ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k Σ q k op (cid:13)(cid:13)(cid:13) e V q − V q (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ≤ τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ Y r = q +1 (cid:0) Σ r + ˇΣ r (cid:1) e V r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( i ) ≤ O ( τ L ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V L +1 q − Y r = L (Σ r V r ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( ii ) ≤ O ( √ pτ L ) where ( i ) follows by applying Lemma E.13 and ( ii ) follows by Part (c) of Lemma E.1.With these bounds on ♦ q and ♥ q along with the decomposition in (103) we ﬁnd that k♣k op ≤ L × (cid:16) O (cid:16)p p log( p ) L τ / (cid:17) + O ( √ pτ L ) (cid:17) ≤ O (cid:16)p p log( p ) L τ / (cid:17) . k♣k op along with (101) and (102) we get that k χ k op ≤ O ( τ L ) + O (cid:16)p p log( p ) L τ / (cid:17) = O (cid:16)p p log( p ) L τ / (cid:17) . As mentioned above we can also bound k χ k op using the exact same logic to get that k χ k op ≤ O (cid:16)p p log( p ) L τ / (cid:17) . Thus, the decomposition in (100) along with a triangle inequality proves by claim of the lemma.

E.3 The proof

With these various lemmas in place we are now ﬁnally ready to prove Lemma D.7

Lemma D.7.

Note that since τ = Ω (cid:18) log ( nLδ ) p L (cid:19) and τ ≤ O (cid:18) L log ( p ) (cid:19) , h ≤ τ √ p ≤ √ pL , andbecause p ≥ poly (cid:0) L, log (cid:0) nδ (cid:1)(cid:1) for a large enough polynomial all the conditions required toinvoke Lemma E.1 are satisﬁed. Let us assume that the event in Lemma E.1 which occurswith probability at least − δ holds in the rest of this proof. Proof of Part (a):

Recall the deﬁnition of the approximation error ε app ( V (1) , τ ) := sup s ∈ [ n ] sup ˆ V, e V ∈B ( V (1) ,τ ) (cid:12)(cid:12)(cid:12) f ˆ V ( x s ) − f e V ( x s ) − ∇ f e V ( x s ) · (cid:16) ˆ V − e V (cid:17)(cid:12)(cid:12)(cid:12) . Fix a ˆ V, e V ∈ B ( V (1) , τ ) and a sample s ∈ [ n ] . To ease notation denote Σ ˆ Vℓ,s by ˆΣ ℓ , Σ e Vℓ,s by e Σ ℓ , x ˆ Vℓ,s by ˆ x ℓ , x e Vℓ,s by ˜ x ℓ and x V (1) ℓ,s by x ℓ,s . We know that f e V ( x s ) = e V L +1 ˜ x L and f ˆ V ( x s ) = ˆ V L +1 ˆ x L .Also since ∇ ˆ V L +1 f ˆ V ( x s ) = ˆ x L we have f ˆ V ( x s ) − f e V ( x s ) − ∇ f e V ( x s ) · (cid:16) ˆ V − e V (cid:17) = ˆ V L +1 ˆ x L − e V L +1 ˜ x L − ( ˆ V L +1 − e V L +1 ) · ˜ x L − L X ℓ =1 ∇ e V ℓ f e V ( x s ) · ( ˆ V ℓ − e V ℓ )= ˆ V L +1 (ˆ x L − ˜ x L ) − L X ℓ =1 ∇ e V ℓ f e V ( x s ) · ( ˆ V ℓ − e V ℓ ) . (105)86y Equation (98) from the proof of Lemma E.14 above we can decompose the diﬀerence asfollows, ˆ x L − ˜ x L = L − X ℓ =1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  ( ˆΣ ℓ + ˇΣ ℓ )( ˆ V ℓ − e V ℓ )˜ x ℓ − + (cid:16) ˆΣ L + ˇΣ L (cid:17) (cid:16) ˆ V L − e V L (cid:17) ˜ x ℓ − + L − X ℓ =1  ℓ +1 Y j = L ( ˆΣ j + ˇΣ j ) ˆ V j  χ ℓ + χ L , (106)where the diagonal matrix ˇΣ j,s is O ( pL τ / ) -sparse and has entries in [ − , , and the p -dimensional vectors χ ℓ have inﬁnity norm at most h (see Inequality (99)). Now when ℓ ∈ [ L ] ,the formula for the gradient given in (2a), using this formula and because given two matrices A and B , A · B = Tr( A ⊤ B ) we get ∇ e V ℓ f e V ( x s ) · (cid:16) ˆ V ℓ − e V ℓ (cid:17) = Tr h ∇ e V ℓ f e V ( x s ) ⊤ (cid:16) ˆ V ℓ − e V ℓ (cid:17)i = Tr e Σ ℓ L Y j = ℓ +1 e V ⊤ j e Σ j  e V ⊤ L +1 ˜ x ⊤ ℓ −  ⊤ (cid:16) ˆ V ℓ − e V ℓ (cid:17) = Tr  ˜ x ℓ − e V L +1  ℓ +1 Y j = L e Σ j e V j  e Σ ℓ (cid:16) ˆ V ℓ − e V ℓ (cid:17) = e V L +1  ℓ +1 Y j = L e Σ j,s e V j  e Σ ℓ,s (cid:16) ˆ V ℓ − e V ℓ (cid:17) ˜ x ℓ − . (107)87sing (105)-(107), and noting that, here Q ℓ +1 j = L A j denotes A L A L − . . . A ℓ +1 , i.e. the indices“count down”, we ﬁnd f ˆ V ( x s ) − f e V ( x s ) − ∇ f e V ( x s ) · (cid:16) ˆ V − e V (cid:17) ( i ) = L X ℓ =1  ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) e V j  ( ˆΣ ℓ + ˇΣ ℓ )( ˆ V ℓ − e V ℓ )˜ x ℓ − − e V L +1  ℓ +1 Y j = L e Σ j e V j  e Σ ℓ (cid:16) ˆ V ℓ − e V ℓ (cid:17) ˜ x ℓ −  + L X ℓ =1  ℓ +1 Y j = L ( ˆΣ j + ˇΣ j ) ˆ V j  χ ℓ = L X ℓ =1  ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  ( ˆΣ ℓ + ˇΣ ℓ ) − e V L +1  ℓ +1 Y j = L e Σ j e V j  e Σ ℓ  ( ˆ V ℓ − e V ℓ )˜ x ℓ − + L X ℓ =1  ℓ +1 Y j = L ( ˆΣ j + ˇΣ j ) ˆ V j  χ ℓ = L X ℓ =1  ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  − e V L +1  ℓ +1 Y j = L e Σ j e V j  e Σ ℓ ( ˆ V ℓ − e V ℓ )˜ x ℓ − | {z } =: ♠ ℓ + L X ℓ =1 ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  (cid:16) ˆΣ ℓ + ˇΣ ℓ − e Σ ℓ (cid:17) ( ˆ V ℓ − e V ℓ )˜ x ℓ − | {z } =: ♣ ℓ + L X ℓ =1  ℓ +1 Y j = L ( ˆΣ j + ˇΣ j ) ˆ V j  χ ℓ | {z } =: ♥ ℓ (108)where in ( i ) , we adopt the convention that when ℓ = L , the “empty products” Q ℓ +1 j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j and Q ℓ +1 j = L e Σ j e V j are interpreted as I . Let us bound the norm of ♠ ℓ in the case where ℓ = L (the bound in the boundary case when ℓ = L follows by exactly the same logic): k♠ ℓ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  − e V L +1  ℓ +1 Y j = L e Σ j e V j  e Σ ℓ ( ˆ V ℓ − e V ℓ )˜ x ℓ − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  − e V L +1  ℓ +1 Y j = L e Σ j e V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)e Σ ℓ ( ˆ V ℓ − e V ℓ )˜ x ℓ − (cid:13)(cid:13)(cid:13) ( i ) ≤ O ( p p log( p ) L τ / ) (cid:13)(cid:13)(cid:13)e Σ ℓ (cid:13)(cid:13)(cid:13) op k ˆ V ℓ − e V ℓ kk ˜ x ℓ − k ( ii ) ≤ O ( p p log( p ) L τ / ) ( k ˜ x ℓ − − x ℓ − k + k x ℓ − k ) ( iii ) ≤ O ( p p log( p ) L τ / ) (cid:0) O ( L τ ) (cid:1) ≤ O ( p p log( p ) L τ / ) (109)88here ( i ) follows by invoking Lemma E.15, ( ii ) is because the entries of e Σ ℓ lie between and and because k ˆ V ℓ − e V ℓ k ≤ τ since both ˆ V and e V are in B ( V (1) , τ ) . Inequality ( iii ) is because k ˜ x ℓ − − x ℓ − k ≤ O ( L τ ) by Lemma E.14 and k x ℓ − k ≤ by Part (a) of Lemma E.1.Moving on to ♣ ℓ (again consider the case where ℓ = L , the bound in the boundary casewhen ℓ = L follows by exactly the same logic), k♣ ℓ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j  (cid:16) ˆΣ ℓ + ˇΣ ℓ − e Σ ℓ (cid:17) ( ˆ V ℓ − e V ℓ )˜ x ℓ − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ V L +1  ℓ +1 Y j = L (cid:16) ˆΣ j + ˇΣ j (cid:17) ˆ V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ˆΣ ℓ + ˇΣ ℓ − e Σ ℓ (cid:13)(cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ˆ V ℓ − e V ℓ (cid:13)(cid:13)(cid:13) k ˜ x ℓ − k ( i ) ≤ O ( L ) × τ × (cid:0) O ( L τ ) (cid:1) = O ( τ L ) (110)where ( i ) follows by invoking Lemma E.13, since the diagonal matrix ˆΣ ℓ + ˇΣ ℓ − e Σ ℓ have entriesbetween − and and by bounding k ˜ x ℓ − k as we did above. Finally we bound the norm of ♥ ℓ (again in the case where ℓ = L , the bound when ℓ = L follows by exactly the same logic) k ♥ ℓ k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ +1 Y j = L ( ˆΣ j + ˇΣ j ) ˆ V j χ ℓ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ℓ +1 Y j = L ( ˆΣ j + ˇΣ j ) ˆ V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k χ ℓ k ( i ) ≤ O ( L ) k χ ℓ k ≤ O ( L ) √ p k χ ℓ k ∞ ( ii ) ≤ O ( L ) √ ph ( iii ) ≤ O ( τ L ) (111)where ( i ) is by invoking Lemma E.13, ( ii ) is due to a bound on the k χ ℓ k ∞ ≤ h derived inInequality 99 and ( iii ) is by the assumption that h < τ √ p . The bounds on the norms of ♠ ℓ , ♣ ℓ and ♥ ℓ along with the decomposition in (108) reveals that for any s ∈ [ n ] , ˆ V, e V ∈ B ( V (1) , τ ) : (cid:12)(cid:12)(cid:12) f ˆ V ( x s ) − f e V ( x s ) − ∇ f e V ( x s ) · (cid:16) ˆ V − e V (cid:17)(cid:12)(cid:12)(cid:12) ≤ L (cid:16) O ( p p log( p ) L τ / ) + O ( τ L ) (cid:17) ≤ O ( p p log( p ) L τ / ) . This completes the proof of the ﬁrst part.

Proof of Part (b):

Recall the deﬁnition of Γ( V (1) , τ )Γ( V (1) , τ ) = sup s ∈ [ n ] sup ℓ ∈ [ L +1] sup V ∈B ( V (1) ,τ ) k∇ V ℓ f V ( x s ) k . s ∈ [ n ] . First let us bound the Frobenius norm of the gradient when ℓ ∈ [ L ] . Bythe formula in (2a) we have k∇ V ℓ f V ( x s ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Vℓ,s L Y j = ℓ +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 x V ⊤ ℓ − ,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Σ Vℓ,s L Y j = ℓ +1 (cid:16) V ⊤ j Σ Vj,s (cid:17) V ⊤ L +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k x V ⊤ ℓ − ,s k≤ (cid:13)(cid:13) Σ Vℓ,s (cid:13)(cid:13) op (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L Y j = ℓ +1 V ⊤ j Σ Vj,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k V L +1 kk x ℓ − ,s k≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L Y j = ℓ +1 V ⊤ j Σ Vj,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op k V L +1 kk x ℓ − ,s k (since k Σ Vℓ,s k op ≤ ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L Y j = ℓ +1 V ⊤ j Σ Vj,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op (cid:16) k V (1) L +1 k + k V (1) L +1 − V L +1 k (cid:17) (cid:16) k x V (1) ℓ − ,s k + k x Vℓ − ,s − x V (1) ℓ − ,s k (cid:17) ( i ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L Y j = ℓ +1 V ⊤ j Σ Vj,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( O ( √ p ) + τ ) (cid:0) O ( L τ ) (cid:1) ( ii ) ≤ O ( √ p ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L Y j = ℓ +1 V ⊤ j Σ Vj,s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op ( iii ) ≤ O ( √ pL ) (112)where ( i ) follows since k V (1) L +1 k ≤ O ( √ p ) by Part (b) of Lemma E.1, k V (1) L +1 − V L +1 k ≤ τ , k x V (1) ℓ − ,s k ≤ by Part (a) of Lemma E.1 and k x Vℓ − ,s − x V (1) ℓ − ,s k ≤ O ( L τ ) by Lemma E.14. Next ( ii ) follows since τ = O (1 /L ) . Finally, ( iii ) follows since the matrix Σ Vj,s − Σ V (1) j,s is O ( pL τ / ) sparse by Lemma E.14, therefore we can apply Lemma E.13 to bound the operator norm ofthe product of the matrices (since τ = O (cid:18) L log ( p ) (cid:19) , that Lemma applies that this level ofsparsity).If ℓ = L + 1 , then the gradient at V is x VL,s , therefore sup s ∈ [ n ] sup V ∈B ( V (1) ,τ ) k∇ V L +1 f V ( x s ) k = sup s ∈ [ n ] sup V ∈B ( V (1) ,τ ) k x VL,s k≤ sup s ∈ [ n ] sup V ∈B ( V (1) ,τ ) (cid:16) k x V (1) L,s k + k x V (1) L,s − x VL,s k (cid:17) ≤ sup s ∈ [ n ] sup V ∈B ( V (1) ,τ ) (cid:0) O ( L τ ) (cid:1) ≤ O (1) , where above we used the fact that k x V (1) L,s k ≤ by Part (a) of Lemma E.1 and k x V (1) L,s − x VL,s k ≤ O ( L τ ) by Lemma E.14 along with the fact that τ ≤ O (1 /L ) . Combining the conclusions inthe two cases when ℓ ∈ [ L ] and ℓ ∈ [ L + 1] establishes our second claim.Now that we have proved Lemma D.7, the reader can next jump to Appendix D.2.90 Probabilistic tools

For an excellent reference of sub-Gaussian and sub-exponential concentration inequalities werefer the reader to Vershynin [Ver18]. We begin by deﬁning sub-Gaussian and sub-exponentialrandom variables.

Deﬁnition F.1.

A random variable θ is sub-Gaussian if k θ k ψ := inf (cid:8) t > E [exp( θ /t )] < (cid:9) is bounded. Further, k θ k ψ is deﬁned to be its sub-Gaussian norm. Deﬁnition F.2.

A random variable θ is said to be sub-exponential if k θ k ψ := inf { t > E [exp( | θ | /t ) < } is bounded. Further, k θ k ψ is deﬁned to be its sub-exponential norm. Next we state a few well-known facts about sub-Gaussian random variables.

Lemma F.3. [Ver18, Lemma 2.7.6] If a random variable θ is sub-Gaussian random variablethen θ is sub-exponential with k θ k ψ = k θ k ψ . Lemma F.4. [Ver18, Theorem 5.2.2] If a random variable θ ∼ N (0 , and g is a -Lipschitzfunction then k g ( θ ) − E [ g ( θ )] k ψ ≤ c , for some absolute positive constant c . Let us state Hoeﬀding’s inequality [see, e.g., Ver18, Theorem 2.6.2], a concentration inequal-ity for a sum of independent sub-Gaussian random variables.

Theorem F.5.

For independent mean-zero sub-Gaussian random variables θ , . . . , θ m , forevery η > , we have P "(cid:12)(cid:12)(cid:12) m X i =1 θ i (cid:12)(cid:12)(cid:12) ≥ η ≤ − cη P mi =1 k θ i k ψ ! , where c is a positive absolute constant. We shall also use Bernstein’s inequality [see, e.g., Ver18, Theorem 2.8.1] a concentrationinequality for a sum of independent sub-exponential random variables.

Theorem F.6.

For independent mean-zero sub-exponential random variables θ , . . . , θ m , forevery η > , we have P "(cid:12)(cid:12)(cid:12) m X i =1 θ i (cid:12)(cid:12)(cid:12) ≥ η ≤ − c min ( η P mi =1 k θ i k ψ , η max i k θ i k ψ )! , where c is a positive absolute constant. Next is the Gaussian-Lipschitz contraction inequality applied to control the squared normof a Gaussian random vector [see, e.g., Wai19, Example 2.28].91 heorem F.7.

Let θ , . . . , θ m be drawn i.i.d. from N (0 , σ ) then, for every η > , we have P " m X i =1 θ i ≥ σ m (1 + η ) ≤ exp (cid:0) − cmη (cid:1) , where c is a positive absolute constant. Let us continue by deﬁning an ε -net with respect to the Euclidean distance. Deﬁnition F.8.

Let S ⊆ R p . A subset K is called an ε -net of S if every point in S is withina distance ε (in Euclidean distance) of some point in K . The following lemma bounds the size of a / -net of unit vectors in R p . Lemma F.9.

Let S be the set of all unit vectors in R p . Then there exists a / -net of S ofsize p . Proof

Follows immediately by invoking [Ver18, Corollary 4.2.13] with ε = 1 / .Here is a bound on the size of an / -net of k -sparse unit vectors. Lemma F.10.

Let S be the set of all k -sparse unit vectors in R p . Then there exists a / -netof S of size (cid:0) pk (cid:1) k . Proof

We construct a / -net as follows. The number of distinct k -sparse subsets of [ p ] are (cid:0) pk (cid:1) . Over each of these distinct subsets build a / -net of unit vectors of size k , this isguaranteed by the preceding lemma. Thus by building a / -net for each of these subset andtaking union of these nets we have built a / -net of k -sparse unit vectors of size (cid:0) pk (cid:1) k asclaimed.Here is a simple fact about orthogonal projections of Gaussian random vectors. Lemma F.11.

Given a σ > , let v ∈ R p be a random vector whose entries are drawnindependently from N (0 , σ ) . Then given any two orthogonal vectors a and b E (cid:20)(cid:16) v ⊤ ( a + b ) (cid:17) (cid:21) = E (cid:20)(cid:16) v ⊤ a (cid:17) (cid:21) + E (cid:20)(cid:16) v ⊤ b (cid:17) (cid:21) . Proof

Observe that E h(cid:0) v ⊤ ( a + b ) (cid:1) i = E h(cid:0) v ⊤ a (cid:1) i + E h(cid:0) v ⊤ b (cid:1) i + 2 E (cid:2) ( v ⊤ a )( v ⊤ b ) (cid:3) . Let usdemonstrate that the cross-term is zero: E h ( v ⊤ a )( v ⊤ b ) i = E h a ⊤ vv ⊤ b i = a ⊤ E h vv ⊤ i b = σ ( a ⊤ b ) = 0 , which proves our claim. 92 eferences [ALS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A convergence theory for deeplearning via over-parameterization”. In: International Conference on MachineLearning . 2019, pp. 242–252 (Cited on pages 2, 6, 53).[And+14] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. “Learningpolynomials with neural networks”. In:

International conference on machine learn-ing . 2014, pp. 1908–1916 (Cited on page 14).[Aro+19a] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. “Implicit regularizationin deep matrix factorization”. In:

Advances in Neural Information ProcessingSystems . 2019, pp. 7413–7424 (Cited on page 14).[Aro+19b] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. “Fine-Grained Analysis of Optimization and Generalization for overparameterized Two-Layer Neural Networks”. In:

International Conference on Machine Learning . 2019,pp. 322–332 (Cited on page 14).[BL20] Peter L Bartlett and Philip M Long. “Failures of model-dependent generalizationbounds for least-norm interpolation”. In: arXiv preprint arXiv:2010.08479 (2020)(Cited on page 14).[Bar+20] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. “Be-nign overﬁtting in linear regression”. In:

Proceedings of the National Academy ofSciences (2020) (Cited on page 14).[Bel+19] Mikhail Belkin, Daniel J Hsu, Siyuan Ma, and Soumik Mandal. “Reconcilingmodern machine-learning practice and the classical bias–variance trade-oﬀ”. In:

Proceedings of the National Academy of Sciences

Advances in Neural Information Processing Systems . 2018, pp. 2300–2311 (Citedon page 14).[BG19] Alon Brutzkus and Amir Globerson. “Why do larger models generalize better? Atheoretical perspective via the XOR problem”. In:

International Conference onMachine Learning . 2019, pp. 822–830 (Cited on page 14).[Bru+18] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. “SGDLearns Over-parameterized Networks that Provably Generalize on Linearly Sep-arable Data”. In:

International Conference on Learning Representations . 2018(Cited on page 14).[CL20] Niladri S Chatterji and Philip M Long. “Finite-sample analysis of interpolatinglinear classiﬁers in the overparameterized regime”. In: arXiv preprint arXiv:2004.12019 (2020) (Cited on page 14).[CLB20] Niladri S Chatterji, Philip M Long, and Peter L Bartlett. “When does gradi-ent descent with logistic loss ﬁnd interpolating two-layer networks?” In: arXivpreprint arXiv:2012.02409 (2020) (Cited on page 2).93Che+20] Zixiang Chen, Yuan Cao, Quanquan Gu, and Tong Zhang. “A Generalized NeuralTangent Kernel Analysis for Two-layer Neural Networks”. In:

Advances in NeuralInformation Processing Systems . 2020 (Cited on page 14).[Che+21] Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. “How Much Over-parameterization Is Suﬃcient to Learn Deep ReLU Networks?” In:

InternationalConference on Learning Representations . 2021 (Cited on pages 2, 6, 45–47).[Chi20] Lénaïc Chizat.

Analysis of Gradient Descent on Wide Two-Layer ReLU NeuralNetworks . Talk at MSRI. 2020. url : (Cited on page 14).[CB18] Lénaïc Chizat and Francis Bach. “On the global convergence of gradient descentfor over-parameterized models using optimal transport”. In: Advances in NeuralInformation Processing Systems . 2018, pp. 3036–3046 (Cited on page 14).[CB20] Lénaïc Chizat and Francis Bach. “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss”. In:

Conference on LearningTheory . 2020 (Cited on page 14).[COB19] Lénaïc Chizat, Edouard Oyallon, and Francis Bach. “On lazy training in diﬀer-entiable programming”. In:

Advances in Neural Information Processing Systems .2019, pp. 2937–2947 (Cited on page 2).[Cor+09] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Cliﬀord Stein.

Introduction to algorithms . MIT Press, 2009 (Cited on page 3).[Cov65] Thomas M Cover. “Geometrical and statistical properties of systems of linearinequalities with applications in pattern recognition”. In:

IEEE transactions onelectronic computers

International Confer-ence on Machine Learning . 2019, pp. 1675–1685 (Cited on page 2).[Du+18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. “Gradient De-scent Provably Optimizes Over-parameterized Neural Networks”. In:

Interna-tional Conference on Learning Representations . 2018 (Cited on pages 2, 14).[GLM18] Rong Ge, Jason D Lee, and Tengyu Ma. “Learning One-hidden-layer Neural Net-works with Landscape Design”. In:

International Conference on Learning Repre-sentations (2018) (Cited on page 14).[Gun+18a] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nathan Srebro. “Charac-terizing Implicit Bias in Terms of Optimization Geometry”. In:

InternationalConference on Machine Learning . 2018, pp. 1832–1841 (Cited on page 14).[Gun+18b] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nathan Srebro. “Implicitbias of gradient descent on linear convolutional networks”. In:

Advances in NeuralInformation Processing Systems . 2018, pp. 9461–9471 (Cited on page 14).[Has+19] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. “Sur-prises in high-dimensional ridgeless least squares interpolation”. In: arXiv preprintarXiv:1903.08560 (2019) (Cited on page 14).94HMX20] Daniel J Hsu, Vidya Muthukumar, and Ji Xu. “On the proliferation of supportvectors in high dimensions”. In: arXiv preprint arXiv:2009.10670 (2020) (Citedon page 14).[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. “Neural tangent kernel:Convergence and generalization in neural networks”. In:

Advances in Neural In-formation Processing Systems . 2018, pp. 8571–8580 (Cited on page 2).[JT19a] Ziwei Ji and Matus Telgarsky. “Gradient descent aligns the layers of deep linearnetworks”. In:

International Conference on Learning Representations . 2019 (Citedon page 14).[JT19b] Ziwei Ji and Matus Telgarsky. “Polylogarithmic width suﬃces for gradient de-scent to achieve arbitrarily small test error with shallow ReLU networks”. In:

International Conference on Learning Representations . 2019 (Cited on pages 2,6, 14).[JT19c] Ziwei Ji and Matus Telgarsky. “The implicit bias of gradient descent on nonsep-arable data”. In:

Conference on Learning Theory . 2019, pp. 1772–1798 (Cited onpage 14).[JT20] Ziwei Ji and Matus Telgarsky. “Directional convergence and alignment in deeplearning”. In:

Advances in Neural Information Processing Systems . 2020 (Citedon page 14).[LL18] Yuanzhi Li and Yingyu Liang. “Learning overparameterized neural networks viastochastic gradient descent on structured data”. In:

Advances in Neural Informa-tion Processing Systems . 2018, pp. 8157–8166 (Cited on page 2).[LMZ18] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. “Algorithmic regularization inover-parameterized matrix sensing and neural networks with quadratic activa-tions”. In:

Conference On Learning Theory . 2018, pp. 2–47 (Cited on page 14).[LY17] Yuanzhi Li and Yang Yuan. “Convergence analysis of two-layer neural networkswith ReLU activation”. In:

Advances in Neural Information Processing Systems .2017, pp. 597–607 (Cited on page 14).[LR20] Tengyuan Liang and Alexander Rakhlin. “Just interpolate: Kernel “ridgeless"regression can generalize”. In:

Annals of Statistics

Con-ference on Learning Theory . 2020, pp. 2683–2711 (Cited on page 14).[LS20] Tengyuan Liang and Pragya Sur. “A precise high-dimensional asymptotic the-ory for boosting and min- ℓ -norm interpolated classiﬁers”. In: arXiv preprintarXiv:2002.01586 (2020) (Cited on page 14).[LL20] Kaifeng Lyu and Jian Li. “Gradient Descent Maximizes the Margin of Homoge-neous Neural Networks”. In: International Conference on Learning Representa-tions . 2020 (Cited on pages 2, 14, 37).[Mat+18a] Alexander Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and ZoubinGhahramani. “Gaussian process behaviour in wide deep neural networks”. In:

International Conference on Learning Representations . 2018 (Cited on page 6).95Mat+18b] Alexander Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and ZoubinGhahramani. “Gaussian process behaviour in wide deep neural networks”. In: arXiv preprint arXiv:1804.11271 (2018) (Cited on page 6).[MMM19] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Mean-ﬁeld theory oftwo-layers neural networks: dimension-free bounds and kernel limit”. In: arXivpreprint arXiv:1902.06015 (2019) (Cited on page 14).[MM19] Song Mei and Andrea Montanari. “The generalization error of random featuresregression: Precise asymptotics and double descent curve”. In: arXiv preprintarXiv:1908.05355 (2019) (Cited on page 14).[Mon+19] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. “The generaliza-tion error of max-margin linear classiﬁers: High-dimensional asymptotics in theoverparametrized regime”. In: arXiv preprint arXiv:1911.01544 (2019) (Cited onpage 14).[Mut+20a] Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin,Daniel J Hsu, and Anant Sahai. “Classiﬁcation vs regression in overparameterizedregimes: Does the loss function matter?” In: arXiv preprint arXiv:2005.08054 (2020) (Cited on page 14).[Mut+20b] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai.“Harmless interpolation of noisy data in regression”. In:

IEEE Journal on SelectedAreas in Information Theory (2020) (Cited on page 14).[NTS15] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. “In Search of the RealInductive Bias: On the Role of Implicit Regularization in Deep Learning.” In:

International Conference on Learning Representations (Workshop) . 2015 (Citedon page 14).[NCS19] Atsushi Nitanda, Geoﬀrey Chinot, and Taiji Suzuki. “Gradient descent can learnless over-parameterized two-layer neural networks on classiﬁcation problems”. In: arXiv preprint arXiv:1905.09870 (2019) (Cited on page 14).[OS20] Samet Oymak and Mahdi Soltanolkotabi. “Towards moderate overparameteri-zation: global convergence guarantees for training shallow neural networks”. In:

IEEE Journal on Selected Areas in Information Theory (2020) (Cited on page 2).[PSZ18] Rina Panigrahy, Sushant Sachdeva, and Qiuyi Zhang. “Convergence results forneural networks via electrodynamics”. In:

Innovations in Theoretical ComputerScience . 2018 (Cited on page 14).[RZL18] Prajit Ramachandran, Barret Zoph, and Quoc V Le. “Searching for activa-tion functions”. In:

International Conference on Learning Representations (Work-shop) . 2018 (Cited on pages 2, 4).[SS18] Itay Safran and Ohad Shamir. “Spurious local minima are common in two-layerReLU neural networks”. In:

International Conference on Machine Learning . 2018,pp. 4433–4441 (Cited on page 14).[SY19] Zhao Song and Xin Yang. “Quadratic suﬃces for over-parametrization via matrixchernoﬀ bound”. In: arXiv preprint arXiv:1906.03593 (2019) (Cited on page 14).96Sou+18] Daniel Soudry, Elad Hoﬀer, Mor Shpigel Nacson, Suriya Gunasekar, and NathanSrebro. “The implicit bias of gradient descent on separable data”. In:

Journal ofMachine Learning Research

Ad-vances in Neural Information Processing Systems . 2020 (Cited on pages 2, 3).[TB20] Alexander Tsigler and Peter L Bartlett. “Benign overﬁtting in ridge regression”.In: arXiv preprint arXiv:2009.14286 (2020) (Cited on page 14).[Ver18] Roman Vershynin.

High-dimensional probability: An introduction with applica-tions in data science . Vol. 47. Cambridge University Press, 2018 (Cited on pages 58,91, 92).[Wai19] Martin Wainwright.

High-dimensional statistics: A non-asymptotic viewpoint .Vol. 48. Cambridge University Press, 2019 (Cited on page 91).[Wei+19] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. “Regularization matters:Generalization and optimization of neural nets vs their induced kernel”. In:

Ad-vances in Neural Information Processing Systems . 2019, pp. 9712–9724 (Cited onpage 14).[Zha+17a] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.“Understanding deep learning requires rethinking generalization”. In:

Interna-tional Conference on Learning Representations . 2017 (Cited on page 1).[Zha+19] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. “Learning one-hidden-layer ReLU networks via gradient descent”. In:

International Conferenceon Artiﬁcial Intelligence and Statistics . 2019, pp. 1524–1534 (Cited on page 14).[Zha+17b] Yuchen Zhang, Jason D Lee, Martin Wainwright, and Michael Jordan. “On thelearnability of fully-connected neural networks”. In:

International Conference onArtiﬁcial Intelligence and Statistics . 2017, pp. 83–91 (Cited on page 14).[Zho+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S. Dhillon.“Recovery Guarantees for One-hidden-layer Neural Networks”. In:

InternationalConference on Machine Learning . 2017, pp. 4140–4149 (Cited on page 14).[Zou+20] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. “Gradient descentoptimizes over-parameterized deep ReLU networks”. In:

Machine Learning arXiv preprint arXiv:1906.04688arXiv preprint arXiv:1906.04688