[PDF] An Elementary Approach to Convergence Guarantees of Optimization Algorithms for Deep Networks

Abstract

We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based on elementary arguments and computations. The convergence analysis revolves around the analytical and computational structures of optimization oracles central to the implementation of deep networks in machine learning software. We provide a systematic way to compute estimates of the smoothness constants that govern the convergence behavior of first-order optimization algorithms used to train deep networks. A diverse set of example components and architectures arising in modern deep networks intersperse the exposition to illustrate the approach.

Full PDF

aa r X i v : . [ c s . L G ] D ec An Elementary Approach to Convergence Guaranteesof Optimization Algorithms for Deep Networks

Vincent Roulet, Zaid HarchaouiDepartment of Statistics, University of Washington, Seattle, USA

Abstract

We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based onelementary arguments and computations. The convergence analysis revolves around the analytical and computationalstructures of optimization oracles central to the implementation of deep networks in modern machine learning soft-ware. We provide a systematic way to compute estimates of the smoothness constants that govern the convergencebehavior of ﬁrst-order optimization algorithms used to train deep networks. Diverse examples related to modern deepnetworks are interspersed within the text to illustrate the approach.

Deep networks have achieved remarkable performance in several application domains such as computer vision, naturallanguage processing and genomics [15, 20, 7]. The input-output mapping implemented by a deep neural network isa chain of compositions of modules, where each module is typically a composition of a non-linear mapping, calledan activation function, and an afﬁne mapping. The last module in the chain is usually task-speciﬁc in that it relatesto a performance accuracy for a speciﬁc task. This module can be expressed either explicitly in analytical formas in supervised classiﬁcation or implicitly as a solution of an optimization problem as in dimension reduction orunsupervised clustering.The optimization problem arising when training a deep network is often framed as a non-convex optimizationproblem, dismissing the structure of the objective yet central to the software implementation. Indeed optimizationalgorithms used to train deep networks proceed by making calls to ﬁrst-order (or second-order) oracles relying ondynamic programming such as gradient back-propagation [25, 22, 16, 5, 2, 23, 9]. Gradient back-propagation isnow part of modern machine learning software [1, 19]. We highlight here the elementary yet important fact thatthe chain-compositional structure of the objective naturally emerges through the smoothness constants governing theconvergence guarantee of a gradient-based optimization algorithm. This provides a reference frame to relate thenetwork architecture and the convergence rate through the smoothness constants. This also brings to light the beneﬁtof speciﬁc modules popular among practitioners to improve the convergence.In Sec. 2, we deﬁne the parameterized input-output map implemented by a deep network as a chain-compositionof modules and write the corresponding optimization objective consisting in learning the parameters of this map. InSec. 3, we detail the implementation of ﬁrst-order and second-order oracles by dynamic programming; the classicalgradient back-propagation algorithm is recovered as a canonical example. Gauss-Newton steps can also be simplystated in terms of calls to an automatic-differentiation oracle implemented in modern machine learning software li-braries. In Sec. 4, we present the computation of the smoothness constants of a chain of computations given itscomponents and the resulting convergence guarantees for gradient descent. Finally, in Sec. 5, we present the appli-cation of the approach to derive the smoothness constants for the VGG architecture and illustrate how our approachcan be used to identify the beneﬁts of batch-normalization [24, 13]. In the Appendix, we estimate the smoothnessconstants related to the VGG architecture and we investigate batch-normalization in the light of our approach [24, 13].All the proofs and the notations are also provided in the Appendix.1 φ u . . . b t a t u t φ t . . . φ τ u τ x τ = f τ ( x , u ) x t − x t f Figure 1: Deep network architecture.

A feed-forward deep network of depth τ can be described as a transformation of an input x into an output x τ throughthe composition of τ blocks, called layers, illustrated in Fig. 1. Each layer is deﬁned by a set of parameters. In general,(see Sec. 2.3 for a detailed decomposition), these parameters act on the input of the layer through an afﬁne operationfollowed by a non-linear operation. Formally, the t th layer can be described as a function of its parameters u t and agiven input x t − that outputs x t as x t = φ t ( x t − , u t ) = a t ( b t ( x t − , u t )) , (1)where b t is generally linear in u t and afﬁne in x t − and a t is non-linear.Learning a deep network consists in minimizing w.r.t. its parameters an objective involving n inputs ¯ x (1) , . . . , ¯ x ( n ) ∈ R δ . Formally, the problem is written min ( u ,...,u τ ) ∈ R p × ... × R pτ h ( x (1) τ , . . . , x ( n ) τ ) + r ( u , . . . , u τ ) subject to x ( i ) t = φ t ( x t − , u ( i ) t ) for t = 1 , . . . , τ, i = 1 , . . . , n,x ( i )0 = ¯ x ( i ) for i = 1 , . . . , n, (2)where u t ∈ R p t is the set of parameters at layer t whose dimension p t can vary among layers and r is a regularizationon the parameters of the network.We are interested in the inﬂuence of the architecture on the optimization complexity of the problem. The architec-ture translates into a structure of the chain of computations involved in the optimization problem. Deﬁnition 2.1.

A chain of τ computations φ t : R d t − × R p t → R d t is deﬁned as f : R d × R P τt =1 p t → R P τt =1 d t suchthat for x ∈ R d and u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t we have f ( x , u ) = ( f ( x , u ); . . . ; f τ ( x , u )) with f t ( x , u ) = φ t ( f t − ( x , u ) , u t ) for t = 1 , . . . , τ, (3) and f ( x , u ) = x . We denote f t,x ( u ) = f t ( x , u ) and f t,u ( x ) = f t ( x , u ) . Denote then f the chain of computations associated to the layers of a deep network and consider the concatenationof the transformations of each input as a single transformation, i.e., f t (¯ x, u ) = ( f t (¯ x (1) , u ); . . . ; f t (¯ x ( n ) , u )) for t ∈ { , . . . , τ } , and ¯ x = (¯ x (1) ; . . . ; ¯ x ( n ) ) , the objective in (2) can be written as min u ∈ R P τt =1 pt h ( f τ (¯ x, u )) + r ( u ) , (4)where f τ : R nd × R P τt =1 p t → R nd τ is the output of a chain of τ computations with d = δ , r : R P τt =1 p t → R istypically a decomposable differentiable function such as r ( u ) = λ P τt =1 k u t k for λ ≥ , and we present examplesof learning objectives f : R nd τ → R below. Assumptions on differentiability and smoothness of the objective aredetailed in Sec. 4. 2 .2 Objectives In the following, we consider the output of the chain of computations on n sample to be given as ˆ y = (ˆ y (1) ; . . . ; ˆ y ( n ) ) =( f τ (¯ x (1) , u ); . . . ; f τ (¯ x ( n ) , u )) = f τ (¯ x, u ) for ¯ x = (¯ x (1) ; . . . ; ¯ x ( n ) ) . Supervised learning.

For supervised learning, the objective can be decomposed as a ﬁnite sum h (ˆ y ) = 1 n n X i =1 h ( i ) (ˆ y ( i ) ) , (5)where h ( i ) are losses on the labels predicted by the chain of computations, i.e., h ( i ) (ˆ y ( i ) ) = L (ˆ y ( i ) , y ( i ) ) with y ( i ) thelabel of ¯ x i , and L is a given loss such as the squared loss or the logistic loss (see Appendix D.1). Unsupervised learning.

In unsupervised learning tasks the labels are unknown. The objective itself is deﬁnedthrough a minimization problem rather than through an explicit loss function. For example, a convex clusteringobjective [11] is written h (ˆ y ) = min y (1) ,...,y ( n ) ∈ R q n X i =1 k y ( i ) − ˆ y ( i ) k + X i , such that the vectorized formulation of the batch-normalization reads ν ( x ) = Vec( ˜ Z ) for x = Vec( Z ) .4 .4 Speciﬁc structures An auto-encoder seeks to learn a compact representation of some data ¯ x ∈ R d by passing it through an encodernetwork with output dimension ˆ d ≪ d then a decoder network with output dimension d with the objective that theﬁnal output is close to the original input. Each network can be represented by a chain of computations. Given n datapoints ¯ x = (¯ x (1) : . . . ; ¯ x ( n ) ) , denoting f e the encoder with parameters u e such that f eu e : R d → R ˆ d and f d the decoderwith parameters u d such that f du d : R ˆ d → R d , the objective is min u e ,u d n n X i =1 k ¯ x ( i ) − f d ( f e (¯ x ( i ) , u e ) , u d ) k . The composition of the encoder and the decoder form a chain of computations such that the overall objective can bewritten as in (4) as detailed in Appendix 2.3. Formally, denoting f : x , u → f d ( f e ( x , u e ) , u d ) for u = ( u d ; u e ) the resulting chain of computations on a single input and f (¯ x, u ) = ( f (¯ x (1) , u ); . . . ; f (¯ x ( n ) , u ) the concatenationof the outputs applied to the set of inputs, the objective of an auto-encoder has the form h ( f (¯ x, u )) with h (ˆ x ) = n P ni =1 k ¯ x i − ˆ x i k . Dense networks use not only the last input but all previous ones. The output of such networks can be described as f τ ( x , v ) = x τ with x t = φ t ( x t − , v t ) , x t − = ( x ; . . . ; x t − ) for t = 1 , . . . , τ, (12)where v t = ( u t, ; . . . u t,t − ) are the parameters of the layer dispatched with one set of parameters per previous stateand v = ( v ; . . . ; v τ ) . The dynamics can be described as previously as φ t ( x t − , v t ) = a t ( b t ( x t − , v t )) . Thebilinear operation b t is still a matrix multiplication or a convolution as previously presented except that it incorporatesmore variables. The non-linear operation a t is also the same, i.e., it incorporates an activation function and, potentially,a pooling operation and a normalization operation.Dense networks can naturally be translated as a single input-output transformation by deﬁning layers of the form x t = ψ t ( x t − , v t ) = ( x ; . . . ; x t − ; φ t ( x t − , v t )) = ( x t − ; φ t ( x t − , v t )) for t = 1 , . . . , τ, and f τ ( x , v ) = E τ x τ = x τ where E τ is a linear projector that extracts x τ from x τ .Highway networks are dense networks that consider only the last input and the penultimate one, i.e., they are ofthe form (12) except that they propagate only x t − t = ( x t − , x t ) . Namely they are deﬁned by x t − t = ψ t ( x t − t − , v t ) = ( x t − ; φ t ( x t − t − , v t )) for t = 1 , . . . , τ, with v t = ( u t,t − ; u t,t − ) . Finally, residual networks are highway networks with ﬁxed parameters acting on thepenultimate input. In the simple case where the current and penultimate inputs have the same dimension, they read x t = a t ( b t ( x t − , u t ) + x t − ) = a t (˜ b t ( x t − t − , u t )) for t = 1 , . . . , τ, (13)with x − = 0 , where b t and a t are of the forms described above. This amounts to deﬁne layers ψ t on x t − t − whosebi-afﬁne operation ˜ b t ( x t − t − , u t ) has a non-zero afﬁne term ˜ β xt on x t − t − = ( x t − ; x t − ) , see Appendix D.8. We consider implicit functions that take the form g ( α ) = arg min β ∈ R b ζ ( α, β ) where ζ is twice differentiable and ζ ( α, · ) is strongly convex for any α such that g ( α ) is uniquely deﬁned . These canbe used either in the objective as seen before with clustering tasks, in that case α = x τ . These can also be used in thelayers such that α = ( x, u ) and φ ( x, u ) = arg min β ∈ R b ζ ( x, u, β ) .5f the minimizer is computed exactly, we can compute the gradient by invoking the implicit function theorem.Formally, denoting ξ ( α, β ) = ∇ β ζ ( α, β ) , the function g ( α ) is deﬁned by the implicit equation ξ ( α, g ( α )) = 0 and itsgradient is given by ∇ g ( α ) = −∇ α ξ ( α, g ( α )) ∇ β ξ ( α, g ( α )) − = −∇ α,β ζ ( α, g ( α )) ∇ β,β ζ ( α, g ( α )) − The smoothness constants of this layer for exact minimizations are provided in Appendix D.9.If the minimizer is computed approximately through an algorithm, its derivative can be computed by using au-tomatic differentiation through the chain of computations deﬁning the algorithm (see Subsection 3.2 for a detailedexplanation of automatic differentiation). Alternatively, an approximate gradient can be computed by using the aboveformula. The resulting approximation error of the gradient is given by the following lemma.

Lemma 2.2.

Let ζ : ( α, β ) → ζ ( α, β ) ∈ R for α ∈ R a , β ∈ R b be s.t. ζ ( α, · ) is µ ζ -strongly convex for any α and denote ξ ( α, β ) = ∇ β ζ ( α, β ) . Denote g ( α ) = arg min β ∈ R b ζ ( α, β ) and ˆ g ( α ) ≈ arg min β ∈ R b ζ ( α, β ) be anapproximate minimizer. Provided that ζ has a L ζ -Lipschitz gradient and a H ζ -Lipschitz Hessian, the approximationerror of using b ∇ ˆ g ( α ) = −∇ α ξ ( α, ˆ g ( α )) ∇ β ξ ( α, ˆ g ( α )) − instead of ∇ g ( α ) is bounded as k b ∇ ˆ g ( α ) − ∇ g ( α ) k ≤ H ζ µ − ζ (1 + L ζ µ − ζ ) k ˆ g ( α ) − g ( α ) k . For each class of optimization algorithm considered (gradient descent, Gauss-Newton, Newton), we deﬁne the appro-priate optimization oracle called at each step of the optimization algorithm which can be efﬁciently computed througha dynamic programming procedure. For a gradient step, we retrieve the gradient back-propagation algorithm. Thegradient back-propagation algorithm forms then the basis of automatic-differentiation procedures.

In the following, we use the notations presented in Sec. A for gradients, Hessians and tensors. Brieﬂy, ∇ f ( x ) is usedto denote the gradient of a function f at x , which, if f : R p → R d is multivariate, is the transpose of the Jacobian, i.e., ∇ f ( x ) ∈ R p × d . For a multivariate function f : R p → R d , its second order information at x is represented by a tensor ∇ f ( x ) ∈ R p × p × d , and we denote for example ∇ f ( x )[ y, y, · ] = ( y ⊤ ∇ f (1) ( x )) y ; . . . ; y ⊤ ∇ f ( n ) ( x ) y ) ∈ R d . For afunction f : R p → R d , we deﬁne, provided that ∇ f ( x ) , ∇ f ( x ) are deﬁned, ℓ xf ( y ) = ∇ f ( x ) ⊤ y, q xf ( y ) = ∇ f ( x ) ⊤ y + 12 ∇ f ( x )[ y, y, · ] , (14)such that the linear and quadratic approximations of f around x are f ( x + y ) ≈ f ( x ) + ℓ xf ( y ) and f ( x + y ) ≈ f ( x ) + q xf ( y ) respectively.We consider optimization oracles as procedures that compute either the next step of an optimization method or adecent direction along which the next step of an optimization method is taken. Formally, the optimization oracles foran objective f are deﬁned by a model m uf that approximates the objective around the current point u as f ( u + v ) ≈ f ( u ) + m uf ( v ) . The models can be minimized with an additional proximal term that ensures that the minimizer lies ina region where the model approximates well the objective as v ∗ γ = arg min v ∈ R p m uf ( v ) + 12 γ k v k , u new = u + v ∗ γ . The parameter γ acts as a stepsize that controls how large should be the step (the smaller the γ , the smaller the v ∗ γ ).Alternatively the model can be minimized directly providing a descent direction along which the next iterate is takenas v ∗ = arg min v ∈ R p m uf ( v ) u new = u + γv ∗ , γ is found by a line-search using e.g. an Armijo condition [18].On a point u ∈ R p , given a regularization κ , for an objective of the form h ◦ ψ + r : R p → R ,(i) a gradient oracle is deﬁned as v ∗ = arg min v ∈ R p ℓ uh ◦ ψ ( v ) + ℓ ur ( v ) + κ k v k , (15)(ii) a (regularized) Gauss-Newton oracle is deﬁned as v ∗ = arg min v ∈ R p q ψ ( u ) h ( ℓ uψ ( v )) + q ur ( v ) + κ k v k , (16)(iii) a (regularized) Newton oracle is deﬁned as v ∗ = arg min v ∈ R p q uh ◦ ψ ( v ) + q ur ( v ) + κ k v k . (17) Proposition 3.1.

Let f be a chain of τ computations φ t : R d t − × R p t → R d t , u = ( u ; . . . ; u τ ) and x ∈ R d . Denote ψ = f x ,τ and f ( x , u ) = ( x ; . . . ; x τ ) . Assume r to be decomposable as r ( u ) = P τt =1 r t ( u t ) . Gradient (15) ,Gauss-Newton (16) and Newton (17) oracles on h ◦ ψ + r are the solutions v ∗ = ( v ∗ ; . . . ; v ∗ τ ) of problems of the form min v ,...,v τ ∈ R p × ... × R pτ y ,...,y τ ∈ R d × ... × R dτ τ X t =1 y ⊤ t P t y t + p ⊤ t y t + y ⊤ t − R t v t + 12 v ⊤ t Q t v t + q ⊤ t v t + κ k v t k (18) subject to y t = A t y t − + B t v t for t ∈ { , . . . , τ } ,y = 0 , where A t = ∇ x t − φ t ( x t − , u t ) ⊤ , B t = ∇ u t φ t ( x t − , u t ) ⊤ ,p τ = ∇ h ( ψ ( u )) , p t = 0 for t = τ ,q t = ∇ r t ( u t ) ,

1. for gradient oracles (15) , P t = 0 , R t = 0 , Q t = 0 ,

2. for Gauss-Newton oracles (16) , P τ = ∇ h ( ψ ( u )) , P t = 0 for t = τ, R t = 0 , Q t = ∇ r t ( u t ) , ,3. for Newton oracles (17) , deﬁning λ τ = ∇ h ( ψ ( u )) , λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } , we have P τ = ∇ h ( ψ ( u )) , P t − = ∇ x t − x t − φ t ( x t − , u t )[ · , · , λ t ] for t ∈ { , . . . , τ } ,R t = ∇ x t − u t φ t ( x t − , u t )[ · , · , λ t ] , Q t = ∇ r t ( u t ) + ∇ u t u t φ t ( x t − , u t )[ · , · , λ t ] . Problems of the form min u ,...,u τ ∈ R p × ... × R pτ x ,...,x τ ∈ R d × ... × R dτ τ X t =1 h t ( x t ) + τ X t =1 g t ( u t ) (19)subject to x t = φ t ( x t − , u t ) for t ∈ { , . . . , τ } ,x = ˆ x ˆ x t at time t by cost t (ˆ x t ) = min u t +1 ,...,u τ ∈ R pt +1 × ... × R pτ x t ,...,x τ ∈ R dt × ... × R dτ τ X t ′ = t h t ′ ( x t ′ ) + τ X t ′ = t +1 g t ′ ( u t ′ ) subject to x t ′ = φ t ′ ( x t ′ − , u t ′ ) for t ′ ∈ { t + 1 , . . . , τ } ,x t = ˆ x t , such that they follow the recursive relation cost t (ˆ x t ) = min u t +1 ∈ R pt +1 { h t (ˆ x t ) + g t +1 ( u t +1 ) + cost t +1 ( φ t +1 ( x t , u t +1 )) } . (20)This principle cannot be used directly on the original problem, since Eq. (20) cannot be solved analytically for genericproblems of the form (19). However, for quadratic problems with linear compositions of the form (18), this principlecan be used to solve problems (18) by dynamic programming [4]. Therefore as a corollary of Prop. 3.1, the complexityof all optimization steps given in (15), (16), (17) is linear w.r.t. to the length τ of the chain. Precisely, Prop. 3.1 showsthat each optimization step amounts to reducing the complexity of the recursive relation (20) to an analytic problem.In particular, while the Hessian of the objective scales as P τt =1 p t , a Newton step has a linear and not cubiccomplexity with respect to τ . We present in Appendix B the detailed computation of a Newton step, alternativederivations were ﬁrst proposed in the control literature [6]. This involves the inversion of intermediate quadraticcosts at each layer. Gauss-Newton steps can also be solved by dynamic programming and can be more efﬁcientlyimplemented using an automatic-differentiation oracles as we explain below. As explained in last subsection and shown in Appendix B, a gradient step can naturally be derived as a dynamicprogramming procedure applied to the subproblem (18). However, the implementation of the gradient step providesitself a different kind of oracle on the chain of computations as deﬁned below.

Deﬁnition 3.2.

Given a chain of computations f : R P τt =1 p t × R d → R P τt =1 d t as deﬁned in Def. 2.1, u ∈ R P τt =1 p t and x ∈ R d , an automatic differentiation oracle is a procedure that gives access to µ → ∇ f x ,τ ( u ) µ for any µ ∈ R d τ . The subtle difference is that we have access to ∇ f x ,τ ( u ) not as a matrix but as a linear operator . The matrix ∇ f x ,τ ( u ) can also be computed and stored to perform gradient vector products. Yet, this requires a surplus of storageand of computations that are generally not necessary for our purposes. The only quantities that need to be stored aregiven by the forward pass. Then, these quantities can be used to compute any gradient vector product directly.The deﬁnition of an automatic differentiation oracle is composed of two steps:1. a forward pass that computes f x ,τ ( u ) and stores the information necessary to compute gradient-vector products.2. the compilation of a backward pass that computes µ → ∇ f x ,τ ( u ) µ for any µ ∈ R d τ given the informationcollected in the forward pass.Note that the two aforementioned passes are decorrelated in the sense that the forward pass does not require theknowledge of the slope µ for which ∇ f x ,τ ( u ) µ is computed.We present in Algo. 1 and Algo. 2 the classical forward-backward passes used in modern automatic-differentiationlibraries. The implementation of the automatic differentiation oracle as a procedure that computes both the value ofthe chain f x ,τ ( u ) and the linear operator µ → f x ,τ ( u ) µ is then presented in Algo. 3 and illustrated in Fig. 2.Computing the gradient g = ∇ ( h ◦ f x ,τ )( u ) on u ∈ R p amounts then to1. computing with Algo. 3, f x ,τ ( u ) , µ → ∇ f x ,τ ( u ) µ = Autodiﬀ( f, u ) ,2. computing µ = ∇ h ( f x ,τ ( u )) then g = ∇ f x ,τ ( u ) µ .8 lgorithm 1 Forward pass Inputs:

Chain of computations f deﬁned by ( φ t ) t =1 ,...,τ , input x as in Def. 2.1, variable u = ( u ; . . . ; u τ ) Initialize x = x for t = 1 , . . . , τ do Compute x t = φ t ( x t − , u t ) Store ∇ φ t ( x t − , u t ) end for Output: x τ , ∇ φ t ( x t − , u t ) for t ∈ { , . . . , τ } . Algorithm 2

Backward pass Inputs:

Slope µ , intermediate gradients ∇ φ t ( x t − , u t ) for t ∈ { , . . . , τ } Initialize λ τ = µ for t = τ, . . . , do Compute λ t − = ∇ x t − φ t ( x t − , u t ) λ t Store g t = ∇ u t φ t ( x t − , u t ) λ t end for Output: ( g , . . . , g τ ) = ∇ f x ,τ ( u ) µ Algorithm 3

Chain of computations with automatic-differentiation oracle (

Autodiﬀ ) Inputs:

Chain of computations f deﬁned by ( φ t ) t =1 ,...,τ , input x as in Def. 2.1, variable u = ( u ; . . . ; u τ ) Compute using Algo. 1 ( x τ , ( ∇ φ t ( x t − , u t ) τt =1 ) = Forward( f, u ) which gives f x ,τ ( u ) = x τ Deﬁne µ → ∇ f x ,τ ( u ) µ as µ → Backward( µ, ( ∇ φ t ( x t − , u t )) τt =1 ) according to Algo. 2. Output: f x ,τ ( u ) , µ → ∇ f x ,τ ( u ) µ x Forward pass u → f ( x , u ) u =( u ; . . . ; u τ ) x φ u x φ u x . . . φ τ u τ x τ ∇ φ ∇ φ ∇ φ τ λ τ µg τ . . .λ λ λ g g Backward pass µ →∇ f x ,τ ( u ) µ = gg =( g ; . . . ; g τ )Store gradients ∇ φ t ( x t − , u t ) Figure 2: Automatic differentiation of a chain of computations.9 .2.2 Complexity

Without additional information on the structure of the layers, the space and time complexities of the forward-backwardalgorithm is of the order of S FB ≤ τ X t =1 ( p t + d t − ) d t , T FB ≤ τ X t =1 T ( φ t , ∇ φ t ) + 2 τ X t =1 ( d t − d t + p t d t ) , respectively, where T ( φ t , ∇ φ t ) is the time complexity of computing φ t , ∇ φ t during the backward pass. The unitschosen are for the space complexity the cost of storing one digit and for the time complexity the cost of performing anaddition or a multiplication.Provided that for all t ∈ { , . . . τ } , T ( φ t , ∇ φ t ) + 2( d t − d t + p t d t ) ≤ Q T ( φ t ) , (21)where T ( φ t ) is the time complexity of computing φ t and Q ≥ is a constant, we get that T FB ≤ Q T ( f ) , where T ( f ) is the complexity of computing the chain of computations [14]. We retrieve Baur-Strassen’s theoremwhich states that the complexity of computing the derivative of a function formulated as a chain of computations is ofthe order of the complexity of computing the function itself [3, 10].For chain of computations of the form (7), this cost can be reﬁned as shown in Appendix B. Speciﬁcally, for achain of fully-connected layers with element-wise activation function, no normalization or pooling, the cost of thebackward pass is then of the order of O ( P τt =1 mδ t ( δ t − + 1)) elementary operations. For a chain of convolutionallayers with element-wise activation function, no normalization or pooling, the cost of the backward pass is of the orderof O (cid:16)P τt =1 (2 n pt n ft s ft + n pt n ft + δ t ) m (cid:17) elementary operations. The Gauss-Newton step can also be solved by making calls to an automatic differentiation oracle.

Proposition 3.3.

Consider the Gauss-Newton oracle (16) on u = ( u ; . . . ; u τ ) for a convex objective h , a convexdecomposable regularization r ( u ) = P τt =1 r t ( u t ) and a differentiable chain of computations f with output ψ = f x ,τ on some input x . We have that1. the Gauss-Newton oracle amounts to solving min µ ∈ R dτ (cid:16) q ψ ( u ) h (cid:17) ⋆ ( µ ) + (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ) , (22) where for a function f we denote by f ⋆ its convex conjugate,2. the Gauss-Newton oracle is v ∗ = ∇ (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ∗ ) where µ ∗ is the solution of (22) ,3. the dual problem (22) can be solved by d τ + 1 calls to an automatic differentiation procedure. Proposition 3.3 shows that a Gauss-Newton step is only d τ + 1 times more expansive than a gradient-step.Precisely, for a deep network with a supervised objective, we have d τ = nk where n is the number of samples and k is the number of classes. A gradient step makes then one call to an automatic differentiation procedure to get thegradient of the batch and the Gauss-Newton method will then make nk + 1 more calls. If mini-batch Gauss-Newtonsteps are considered then the cost reduces to mk + 1 calls to an automatic differentiation oracle, where m is the sizeof the mini-batch. 10 Optimization complexity

We present smoothness properties with respect to the Euclidean norm k · k , whose operator norm is denoted k · k , .In the following, for a function f : R d → R n and a set C ⊂ dom f ⊂ R d , we denote by m Cf = sup x ∈ C k f ( x ) k , ℓ Cf = sup x,y ∈ Cx = y k f ( x ) − f ( y ) k k x − y k , L Cf = sup x,y ∈ Cx = y k∇ f ( x ) − ∇ f ( y ) k , k x − y k , a bound of h on C , the Lipschitz-continuity parameter of h on C , and the smoothness parameter of h on C (i.e., theLipschitz-continuity parameter of its gradient if it exists), all with respect to k · k . Note that if x = Vec( X ) fora given matrix X , k x k = k X k F . We denote by m f , ℓ f , L f the same quantities deﬁned on the domain of f , e.g., m f = m dom ff . We denote by C m,ℓ,L the class of functions f such that m f = m, ℓ f = ℓ, L f = L . In the following, weallow the quantities m f , ℓ f , L f to be inﬁnite if for example the function is unbounded or the smoothness constant isnot deﬁned. The procedures presented below output inﬁnite estimates if the combinations of the smoothness propertiesdo not allow for ﬁnite estimates. On the other hand, they provide ﬁnite estimates automatically if they are available.In the following we denote N τt =1 B R t ( R p t ) = { u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t : u t ∈ R p t , k u t k ≤ R t } . We recall the convergence rate to a stationary point of a gradient descent and a stochastic gradient descent on con-strained problems.

Theorem 4.1 (8, Theorems 1 and 2) . Consider problems of the form(i) min u ∈ R p { F ( u ) := h ( ψ ( u )) + r ( u ) } , or ( ii ) min u ∈ R p ( F ( u ) := 1 n n X i =1 h i ( ψ i ( u )) + r ( u ) ) , subject to u ∈ C, subject to u ∈ C, where C is a closed convex set and F is L CF smooth on C . For problem (ii), consider that we have access to anunbiased estimate b ∇ F ( u ) of ∇ F ( u ) with a variance bounded as E ( k b ∇ F ( u ) − ∇ F ( u ) k ) ≤ σ .A projected gradient descent applied on problem (i) with step-size γ = ( L CF ) − converges to an ε -stationary pointin at most O (cid:18) L CF ( F ( u ) − F ∗ ) ǫ (cid:19) iterations, where u is the initial point and F ∗ = min u ∈ C F ( u ) .A stochastic projected gradient descent applied on problem ( ii ) with step-size γ = (2 L CF ) − converges in expec-tation to an ( ε + σ ) -stationary point in at most O (cid:18) L CF ( F ( u ) − F ∗ ) ǫ (cid:19) iterations, where u is the initial point and F ∗ = min u ∈ C F ( u ) . Remarks.

1. Since a gradient descent is monotonically decreasing, a gradient descent applied to the unconstrained problemconverges to an ε -stationary point in at most O L S F ( F ( u ) − F ∗ ) ǫ ! iterations, where S = { u ∈ R p : F ( u ) ≤ F ( u ) } is the initial sub-level set.2. Tighter rates of convergence can be obtained for the ﬁnite-sum problem (ii) by using variance reduction methodsand by varying mini-batch sizes [17]. They would then depend on the smoothness constants of the objective orthe maximal smoothness of the components on C , i.e., max i =1 ,...,n L Ch i ◦ ψ i + r .11he smoothness of the objectives F deﬁned in Theorem 4.1 can be derived from the smoothness properties of theircomponents. Proposition 4.2.

We present the smoothness computations for a deep network. Generic estimations of the smoothness properties ofa chain of computation are presented in Appendix C. The propositions below give upper bounds on the smoothnessconstants of the function achieved through chain-composition. For a trivial composition such as f ◦ f − , the upperbound is clearly loose. The upper bounds we present here are informative for non-trivial architectures.The estimation is done by a forward pass on the network, as illustrated in Fig. 3. The reasoning is based on thefollowing lemma. Lemma 4.3.

Consider a chain f of τ computations φ t ∈ C ℓ φt ,L φt initialized at some x ∈ R d .(i) We have ℓ f τ,x ≤ ℓ τ , where ℓ = 0 , ℓ t = ℓ φ t + ℓ t − ℓ φ t , for t ∈ { , . . . , τ } .(ii) We have L f τ,x ≤ L τ , where L = 0 , L t = L t − ℓ φ t + L φ t (1 + ℓ t − ) , for t ∈ { , . . . , τ } . In the case of deep networks, the computations are not Lipschitz continuous due to the presence of bi-afﬁnefunctions. Yet, provided that inputs of computations are bounded and that we have access to the smoothness of thecomputations, we can have an estimate of the Lipschitz-continuity of the computations restricted to these boundedsets.

Corollary 4.4.

Consider a chain f of τ of computations φ t ∈ C m φt ,ℓ φt ,L φt initialized at some x ∈ R d and consider C = N τt =1 B R t ( R p t ) . Then the smoothness of the output of the chain f τ,x on C , can be estimated as in Lemma 4.3by replacing ℓ φ t with ˜ ℓ φ t deﬁned by ˜ ℓ φ t = min { ℓ φ t , L φ t ( m t − + R t ) + k∇ φ t (0 , k , } ,m t = min { m φ t , ˜ ℓ φ t ( m t − + R t ) + k φ t (0 , k } , for t ∈ { , . . . , τ } , with m = k x k . We specialize the result to deep networks. We denote by B L,l u ,l x the set of L -smooth bi-afﬁne functions b suchthat k∇ u b (0 , k , = l u , k∇ x b (0 , k , = l x , i.e., functions of the form b ( x, u ) = β ( x, u ) + β u ( u ) + β x ( x ) + β , with β bilinear and L -smooth, β u , β x linear and l u , l x Lipschitz continuous respectively and β a constant vector. Proposition 4.5.

Consider a chain f of τ computations whose layers φ t are deﬁned by φ t ( x t − , u t ) = a t ( b t ( x t − , u t )) , for t ∈ { , . . . , τ } , where b t ∈ B L bt ,l ubt ,l xbt , and a t is decomposed as a t = a t,k t ◦ . . . ◦ a t, , with a t,i ∈ C m at,i ,ℓ at,i ,L at,i . Consider C = N τt =1 B R t ( R p t ) . The outputs m τ , ℓ τ and L τ of Algo. 4 satisfy m Cf τ,x ≤ m τ , ℓ Cf τ,x ≤ ℓ τ , L Cf τ,x ≤ L τ . φ R . . . m φ t , ℓ φ t , L φ t φ t R t . . . φ τ R τ m τ , ℓ τ , L τ m t − ℓ t − L t − m t ℓ t L t f Figure 3: Smoothness estimates computations.

Algorithm 4

Automatic smoothness computations Inputs:

1. Chain of computations f deﬁned by φ t = a t ◦ b t for t ∈ { , . . . , τ } with a t = a t,k t ◦ . . . ◦ a t,

2. Smoothness properties L b t , l ub t , l xb t of the biafﬁne function b t ∈ B L bt ,l ubt ,l xbt

3. Smoothness properties m a t,i , ℓ a t,i , L a t,i of the nonlinear functions a t,i ∈ C m at,i ,ℓ at,i ,L at,i

4. Initial point x

5. Bounds R t on the parameters Initialize m = k x k , ℓ = 0 , L = 0 for t = 1 , . . . , τ do ℓ xt, = L b t R t + l xb t , ℓ ut, = L b t m t − + l ub t , ℓ t, = 1 m t, = ℓ xt, m t − + ℓ ut, R t + k b t (0 , k L t, = 0 for j = 1 , . . . , k t do ˜ ℓ a t,j = min { ℓ a t,j , k∇ a t,j (0) k + L a t,j m t,j − } m t,j = min { m a t,j , k a t,j (0) k + ˜ ℓ a t,j m t,j − } ℓ t,j = ˜ ℓ a t,j ℓ t,j − L t,j = L t,j − ℓ a t ,j + L a t,j ( ℓ t,j − ) end for m t = m t,k t ℓ t = ℓ xt, ℓ t,k t ℓ t − + ℓ ut, ℓ t,k t L t = L t − ℓ xt, ℓ t,k t +( L b t R t + l b t ) L t,k t ℓ t − +2 (cid:16) ( L b t m t − + l ub t )( L b t R t + l xb t ) L t,k t + L b t ℓ t,k t (cid:17) ℓ t − +( L b t m t − + l ub t ) L t,k t end for Output: m τ , ℓ τ , L τ Corollary 4.6.

Consider a chain f of τ computations as deﬁned in Prop. 4.5 and u ∗ = ( u ∗ ; . . . , u ∗ τ ) ∈ R p . Thesmoothness properties of f on C ′ = { u = ( u ; . . . ; u τ ) ∈ R p : ∀ t ∈ { , . . . , τ } , k u t − u ∗ t k ≤ R ′ t } are given as inProp. 4.5 by considering R ′ t in place of R t ,l β xt + L β t k u ∗ t k in place of l β xt , k β t k + l β ut k u ∗ t k in place of k β t k . We apply our framework to assess the smoothness properties of the Visual Geometry Group (VGG) deep network usedfor image classiﬁcation [24].

The VGG Network is a benchmark network for image classiﬁcation with deep networks. The objective is to classifyimages among classes. Its architecture is composed of 16 layers described in Appendix F. We consider in thefollowing smoothness properties for mini-batches with size m = 128 , i.e., by concatenating m chains of computations f ( i ) each deﬁned by a different input. This highlights the impact of the size of the mini-batch for batch-normalization. Smoothness computations.

To compute the Lipschitz-continuity and smoothness parameters, we recall the list ofLipschitz continuity and smoothness constants of each layer of interest. For the bilinear and linear operations wedenote by L the smoothness of the bilinear operation β and by ℓ the Lipschitz-continuity of the linear operation β u .The smoothness constants of interest are1. ℓ conv = √ m (cid:6) ks (cid:7) , L conv = (cid:6) ks (cid:7) , where the patch is of size k × k and the stride is s ,2. ℓ full = √ m , L full = 1 ,3. ℓ ReLu = 1 , L ReLu not deﬁned,4. ℓ softmax = 2 , L softmax = 4 ,5. ℓ maxpool = 1 , L maxpool not deﬁned,6. ℓ log = 2 , L log = 2 .A Lipschitz-continuity estimate of this architecture can then be computed using Prop. 4.5 on a Cartesian product ofballs C = { w = ( u ; . . . ; u ) : k u t k ≤ R } for R = 1 for example. Smooth VGG.

First, the VGG architecture can be made continuously differentiable by considering the soft-plusactivation instead of the ReLU activation and average pooling instead of the max-pooling operation. As shown inAppendix D, we have1. ℓ avgpool = 1 , L avgpool = 0 ,2. ℓ softplus = 1 , L softplus = 1 / .Denoting ℓ VGG and ℓ VGG-smooth the Lipschitz-continuity estimates of the original VGG network and the modiﬁedoriginal network on a Cartesian product of balls C = { u = ( u ; . . . ; u ) : k u t k ≤ } with k x k = 1 , we get usingProp. 4.5, | ℓ VGG − ℓ VGG-smooth | ℓ VGG ≤ − . atch-normalization effect. We can also compare the smoothness properties of the smoothed network with thesame network modiﬁed by adding the batch-normalization layer for m inputs and ǫ normalization parameter at eachconvolutional layer. As shown in Appendix D, the batch-normalization satisﬁes1. m batch = δm , ℓ batch =2 ǫ − / , L batch =2 m − / ǫ − .Denoting ℓ VGG-smooth , L VGG-smooth and ℓ VGG-batch , L VGG-batch the Lipschitz-continuity and smoothness estimatesof the smoothed VGG network with and without batch-normalization respectively on a Cartesian product of balls C = { u = ( u ; . . . ; u ) : k u t k ≤ } with k x k = 1 , we get using Prop. 4.5,for ǫ = 10 − , ℓ VGG-smooth ≤ ℓ VGG-batch L VGG-smooth ≤ L VGG-batch for ǫ = 10 , ℓ VGG-smooth ≥ ℓ VGG-batch L VGG-smooth ≥ L VGG-batch

Intuitively, the batch-norm bounds the output of each layer, mitigating the increase of m t in the computations of theestimates of the smoothness in lines 8 and 9 of Algo. 4. Yet, for a small ǫ , this effect is balanced by the non-smoothnessof the batch-norm layer (which for ǫ → tends to have an inﬁnite slope around 0). Acknowledgments.

This work was supported by NSF CCF-1740551, NSF DMS-1839371, the program “Learningin Machines and Brains”, and faculty research awards.

References [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-enberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.[2] M. Anthony and P. Bartlett.

Neural Network Learning: Theoretical Foundations . Cambridge University Press,2009.[3] W. Baur and V. Strassen. The complexity of partial derivatives.

Theoretical computer science , 22(3):317–330,1983.[4] D. P. Bertsekas.

Dynamic programming and optimal control . Athena Scientiﬁc, 3rd edition, 2005.[5] R. Duda, P. Hart, and D. Stork.

Pattern classiﬁcation . John Wiley & Sons, 2nd edition, 2012.[6] J. C. Dunn and D. P. Bertsekas. Efﬁcient dynamic programming implementations of Newton’s method forunconstrained optimal control problems.

Journal of Optimization Theory and Applications , 63(1):23–38, 1989.[7] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams.Convolutional networks on graphs for learning molecular ﬁngerprints. In

Advances in Neural Information Pro-cessing Systems 28 , 2015.[8] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochasticcomposite optimization.

Mathematical Programming , 155(1-2):267–305, 2016.[9] I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning . The MIT Press, 2016.[10] A. Griewank. Who invented the reverse mode of differentiation?

Documenta Mathematica , Optimizationstories:389–400, 2012.[11] T. D. Hocking, A. Joulin, F. Bach, and J.-P. Vert. Clusterpath: an algorithm for clustering using convex fusionpenalties. In

Proceedings of the 28th International Conference on Machine Learning , 2011.1512] R. A. Horn and C. R. Johnson.

Matrix analysis . Cambridge university press, 2nd edition, 2012.[13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariateshift. In

Proceedings of the 32nd International Conference on Machine Learning , volume 37, pages 448–456,2015.[14] K. V. Kim, Y. E. Nesterov, and B. Cherkasskii. An estimate of the effort in computing the gradient. In

DokladyAkademii Nauk , volume 275, pages 1306–1309. Russian Academy of Sciences, 1984.[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classiﬁcation with deep convolutional neural networks.In

Advances in Neural Information Processing Systems 25 , 2012.[16] Y. Lecun. A theoretical framework for back-propagation. In , 1988.[17] Z. Li and J. Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In

Advances in Neural Information Processing Systems 31 , pages 5564–5574, 2018.[18] J. Nocedal and S. J. Wright.

Numerical Optimization . Springer, 2nd edition, 2006.[19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In

Advances in NeuralInformation Processing Systems 32 , 2019.[20] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. In

EmpiricalMethods in Natural Language Processing (EMNLP) , pages 1532–1543, 2014.[21] V. Roulet, S. Srinivasa, D. Drusvyatskiy, and Z. Harchaoui. Iterative linearized control: stable algorithms andcomplexity guarantees. In

Proceedings of the 36th International Conference on Machine Learning , 2019. Longversion.[22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.

Nature ,323(6088):533–536, 1986.[23] S. Shalev-Shwartz and S. Ben-David.

Understanding Machine Learning: From Theory to Algorithms . CambridgeUniversity Press, 2014.[24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In

Inter-national Conference on Learning Representations , 2015.[25] P. Werbos.

The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Fore-casting . Wiley-Interscience, 1994. 16

Notations

A.1 Matrices

For a matrix M ∈ R d × n , we denote by Vec( M ) the concatenation of the columns of M . We denote k M k , =sup x =0 ,y =0 x ⊤ My k x k k y k its norm induced by the Euclidean norm and k M k F = qP ij M ij its Frobenius norm. A.2 Tensors

A tensor A = ( a ijk ) i ∈{ ,...,d } ,j ∈{ ,...,n } ,k ∈{ ,...,p } ∈ R d × n × p is represented as a list of matrices A = ( A , . . . , A p ) where A k = ( a ijk ) i ∈{ ,...,d } ,j ∈{ ,...,n } ∈ R d × n for k ∈ { , . . . p } . Tensor-matrix product.

Given matrices P ∈ R d × d ′ , Q ∈ R n × n ′ , R ∈ R p × p ′ , we denote A [ P, Q, R ] = p X k =1 R k, P ⊤ A k Q, . . . , p X k =1 R k,p ′ P ⊤ A k Q ! ∈ R d ′ × n ′ × p ′ If P, Q or R are identity matrices, we use the symbol “ · ” in place of the identity matrix. For example, we denote A [ P, Q, I p ] = A [ P, Q, · ] = (cid:0) P ⊤ A Q, . . . , P ⊤ A p Q (cid:1) . Fact A.1.

Let

A ∈ R d × p × n , P ∈ R d × d ′ , Q ∈ R p × p ′ , R ∈ R n × n ′ and S ∈ R d ′ × d ′′ , T ∈ R p ′ × p ′′ , U ∈ R n ′ × n ′′ . Denote A ′ = A [ P, Q, R ] ∈ R d ′ × p ′ × n ′ . Then A ′ [ S, T, U ] = A [ P S, QT, RU ] ∈ R d ′′ × p ′′ × n ′′ . Flat tensors. If P, Q or R are vectors we consider the ﬂatten object. In particular, for x ∈ R d , y ∈ R n , we denote A [ x, y, · ] =  x ⊤ A y ... x ⊤ A p y  ∈ R p , rather than having A [ x, y, · ] ∈ R × × p . Similarly, for z ∈ R p , we have A [ · , · , z ] = p X k =1 z k A k ∈ R d × n . Transpose.

For a tensor A = ( A , . . . , A p ) ∈ R d,n,p we denote A t = ( A ⊤ , . . . , A ⊤ p ) ∈ R n,d,p . Outer product.

We denote the outer product of three vectors x ∈ R d , y ∈ R n , z ∈ R p as x ⊠ y ⊠ z ∈ R d × n × p suchthat ( x ⊠ y ⊠ z ) ijk = x i y j z k . Tensor norm.

We deﬁne the norm of a tensor A induced by the Euclidean norm as follows. Deﬁnition A.2.

The norm of a tensor A induced by the Euclidean norm is deﬁned as kAk , , = sup x =0 ,y =0 ,z =0 A [ x, y, z ] k x k k y k k z k . (23) Fact A.3.

The tensor norm satisﬁes the following properties, for a given tensor

A ∈ R d × n × p ,1. kAk , , = kA t k , , ,2. kA [ P, Q, R ] k , , ≤ kAk , , k P k , k Q k , k R k , for P, Q, R with appropriate sizes,3. kAk , , = sup z =0 k P pk =1 z k A k k , k z k . .3 Gradients For a multivariate function f : R d R n , composed of n real functions f ( j ) for j ∈ { , . . . n } , we denote ∇ f ( x ) =( ∇ f (1) ( x ) , . . . , ∇ f ( n ) ( x )) ∈ R d × n , that is the transpose of its Jacobian on x , ∇ f ( x ) = ( ∂f ( j ) ∂x i ( x )) i ∈{ ,...,d } ,j ∈{ ,...,n } ∈ R d × n . We represent its 2nd order information by a tensor ∇ f ( x ) = ( ∇ f (1) ( x ) , . . . , ∇ f ( n ) ( x )) ∈ R d × d × n . Fact A.4.

We have for f : R d → R n , twice differentiable, and C ⊂ dom f convex, ℓ Cf = sup x,y ∈ Cx = y k f ( x ) − f ( y ) k k x − y k = sup x ∈ C k∇ f ( x ) k , , L Cf = sup x,y ∈ Cx = y k∇ f ( x ) − ∇ f ( y ) k , k x − y k = sup x ∈ C k∇ f ( x ) k , , . where k∇ f ( x ) k , denotes the operator norm of ∇ f ( x ) and k∇ f ( x ) k , , denotes the tensor norm of ∇ f ( x ) bothwith respect to the Euclidean norm.Proof. We have for x, y ∈ C , k f ( x ) − f ( y ) k = k Z ∇ f ( x + t ( y − x )) ⊤ ( y − x ) dt k ≤ sup x ∈ C k∇ f ( x ) k , k x − y k , k∇ f ( x ) − ∇ f ( y ) k , = k Z ∇ f ( x + t ( y − x ))[ y − x, · , · ] dt k , ≤ sup x ∈ C k∇ f ( x ) k , , k x − y k , which gives ℓ Cf ≤ sup x ∈ C k∇ f ( x ) k , and L Cf ≤ sup x ∈ C k∇ f ( x ) k , , . The equalities come from the deﬁnitionsof the gradient and the Hessian.For a real function, f : R d × R p R , whose value is denoted f ( x, y ) , we decompose its gradient ∇ f ( x, y ) ∈ R d + p on ( x, y ) ∈ R d × R p as ∇ f ( x, y ) = (cid:18) ∇ x f ( x, y ) ∇ y f ( x, y ) (cid:19) with ∇ x f ( x, y ) ∈ R d , ∇ y f ( x, y ) ∈ R p . Similarly we decompose its Hessian ∇ f ( x, y ) ∈ R ( d + p ) × ( d + p ) on blocks that correspond to the variables ( x, y ) asfollows ∇ f ( x, y ) = (cid:18) ∇ xx f ( x, y ) ∇ xy f ( x, y ) ∇ yx f ( x, y ) ∇ yy f ( x, y ) (cid:19) with ∇ xx f ( x, y ) ∈ R d × d , ∇ yy f ( x, y ) ∈ R p × p , ∇ xy f ( x, y ) = ∇ yx f ( x, y ) ⊤ ∈ R d × p . Given a function f : R d + p R n and ( x, y ) ∈ R d × R p , we denote ∇ x f ( x, y ) = ( ∇ x f (1) ( x, y ) , . . . , ∇ x f ( n ) ( x, y )) ∈ R d × n and we deﬁne similarly ∇ y f ( x, y ) ∈ R p × n .For its second order information we deﬁne ∇ xx f ( x, y ) = ( ∇ xx f (1) ( x, y ) , . . . , ∇ xx f ( n ) ( x, y )) , similarly for ∇ xx f ( x, y ) . Dimension of these deﬁnitions are ∇ x f ( x, y ) ∈ R d × n , ∇ y f ( x, y ) ∈ R p × n , ∇ xx f ( x, y ) ∈ R d × d × n , ∇ yy f ( x, y ) ∈ R p × p × n , ∇ xy f ( x, y ) ∈ R d × p × n , ∇ yx f ( x, y ) ∈ R p × d × n . A.4 Matrix functions

For a differentiable matrix-valued multivariate function g : R d → R p × n such that g ( x ) = ( g j,k ( x ) ≤ j ≤ p, ≤ k ≤ n ) , wedenote its ﬁrst order information as a tensor ∇ g ( x ) = (cid:18) ∂g j,k ( x ) ∂x i (cid:19) ≤ i ≤ d, ≤ j ≤ p, ≤ k ≤ n ∈ R d × p × n . This notation is consistent with previous ones, i.e., for f : R d + p R n and g ( y ) = ∇ y f ( x, y ) ∈ R p × n , then ∇ g ( y ) = ∇ xy f ( x, y ) ∈ R d × p × n . From previous deﬁnitions, we have the following fact.18 act A.5. For a differentiable matrix-valued multivariate function g : R d → R p × n , A ∈ R p ′ × p , B ∈ R n × n ′ , denoting h ( x ) = Ag ( x ) B ∈ R p ′ × n ′ , we have ∇ h ( x ) = ∇ g ( x )[ · , A ⊤ , B ] ∈ R d × p ′ × n ′ . A.5 Bilinear functions

Deﬁnition A.6.

A function β : R d × R n → R p is bilinear if it is represented by a tensor B ∈ R d × n × p such that forany x ∈ R d , y ∈ R n , β ( x, y ) = B [ x, y, · ] . The gradient of a bilinear function β represented by a tensor B ∈ R d × n × p at a point x, y is given by ∇ x β ( x, y ) = B [ · , y, · ] ∈ R d × p , ∇ y β ( x, y ) = B [ x, · , · ] ∈ R n × p . (24)The Hessian of the bilinear function is given ∇ xx β ( x, y ) = 0 , ∇ yy β ( x, y ) = 0 , ∇ xy β ( x, y ) = B , ∇ yx β ( x, y ) = B t . (25)A bilinear function is not Lipschitz continuous as can be seen from Eq. (24). It is smooth w.r.t. the Euclidean normwith a smoothness constant given by the tensor norm of B as shown in the following proposition. Lemma A.7.

The smoothness of a bilinear function β deﬁned by a tensor B is upper bounded as L β ≤ kBk , , .Proof. We have k∇ β ( x, y ) k , , = sup z =0 k P pk =1 z k ˜ B k k , k z k , where ∇ β ( x, y ) = ( ˜ B , . . . , ˜ B p ) . We have by Eq. (25) that P pk =1 z k ˜ B k is of the form p X k =1 z k ˜ B k = (cid:18) P pk =1 z k B k P pk =1 z k B ⊤ k (cid:19) where B = ( B , . . . , B p ) . Therefore we get k P pk =1 z k ˜ B k k , = k P pk =1 z k B k k , , see [12, Theorem 7.3.3.]. There-fore k∇ β ( x, y ) k = sup z =0 k P pk =1 z k ˜ B k k , k z k = sup z =0 k P pk =1 z k B k k , k z k = kBk , , . B Oracle arithmetic complexity proofs

B.1 Feasibility of the optimization oracle steps

The gradient step (15) is always feasible for any κ > , the Gauss-Newton step (16) is feasible for any κ > if h, r are convex and the Newton step is feasible for any κ > if h ◦ f and r are convex. A sufﬁcient condition for theNewton step to be feasible if h ◦ f is not convex but h ◦ f and r are smooth is to choose κ < ( L h ◦ f + L r ) − . In otherwords, the step-size must be chosen small enough such that the Newton step is a convex problem. B.2 Optimization oracles as linear quadratic problems

Lemma B.1.

Let f be a chain of τ computations φ t . Let u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t , x ∈ R d , denote x t = f t ( x , u ) and E ⊤ t = (0 p t d , p t p , . . . , I p t p t , . . . , p t p τ ) ∈ R p t × ( d + P τt =1 p t ) such that E ⊤ t ( x ; u ) = u t for t ∈{ , . . . , τ } . . If φ t are differentiable, then ∇ f t ( x , u ) = ∇ f t − ( x , u ) ∇ x t − φ t ( x t − , u t ) + E t ∇ u t φ t ( x t − , u t ) (26) with ∇ f ( x , u ) = E , with E ⊤ = (I d d , d p , . . . , . . . , d p τ ) ∈ R d × ( d + P τt =1 p t ) such that E ⊤ ( x ; u ) = x .2. If φ t are twice differentiable, ∇ f t ( x , u ) = ∇ f t − ( x , u )[ · , · , ∇ x t − φ t ( x t − , u t )] (27) + ∇ x t − x t − φ t ( x t − , u t )[ ∇ f t − ( x , u ) ⊤ , ∇ f t − ( x , u ) ⊤ , · ]+ ∇ x t − u t φ t ( x t − , u t )[ ∇ f t − ( x , u ) ⊤ , E ⊤ t , · ]+ ∇ u t x t − φ t ( x t − , u t )[ E ⊤ t , ∇ f t − ( x , u ) ⊤ , · ]+ ∇ u t u t φ t ( x t − , u t )[ E ⊤ t , E ⊤ t , · ] with ∇ f ( x , u ) = 0 .Proof. It follows from the deﬁnition of the chain of computations and the notations used for tensors. Precisely wehave that f t ( x , u ) = φ t ( f t − ( x , u ) , E ⊤ t ( x ; u )) , hence the ﬁrst result that can be written ∇ f t ( x , u ) = ∇ f t − ( x , u ) ∇ x t − φ t ( f t − ( x , u ) , E ⊤ t ( x ; u )) + E t ∇ u t φ t ( f t − ( x , u ) , E ⊤ t ( x ; u )) , hence the second result using the tensor notations. Proposition 3.1.

1. for gradient oracles (15) , P t = 0 , R t = 0 , Q t = 0 ,

2. for Gauss-Newton oracles (16) , P τ = ∇ h ( ψ ( u )) , P t = 0 for t = τ, R t = 0 , Q t = ∇ r t ( u t ) , ,3. for Newton oracles (17) , deﬁning λ τ = ∇ h ( ψ ( u )) , λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } , we have P τ = ∇ h ( ψ ( u )) , P t − = ∇ x t − x t − φ t ( x t − , u t )[ · , · , λ t ] for t ∈ { , . . . , τ } ,R t = ∇ x t − u t φ t ( x t − , u t )[ · , · , λ t ] , Q t = ∇ r t ( u t ) + ∇ u t u t φ t ( x t − , u t )[ · , · , λ t ] . roof. To reformulate the optimization oracle problems as quadratic problems with linear dynamics we reformulate ∇ ψ ( u ) ⊤ v as a linear chain of compositions and ∇ ψ ( u )[ v, v, ∇ h ( ψ ( u ))] as a quadratic on the linear trajectory deﬁnedby the gradient and the parameters using Lemma B.1. Precisely, for u = ( u , . . . ; u τ ) and v = ( v ; . . . ; v τ ) , denoting f x ( u ) = ( x ; . . . ; x τ ) and ∇ f x ( u ) ⊤ v = ( y ; . . . ; y τ ) with ∇ f x ,t ( u ) ⊤ v = y t , we have from (26) y t = ∇ x t − φ t ( x t − , u t ) ⊤ y t − + ∇ u t φ t ( x t − , u t ) ⊤ v t , for t ∈ { , . . . , τ } (28) y = 0 , and ∇ ψ ( u ) ⊤ v = ∇ f x ,τ ( u ) ⊤ v = y τ . For the second order derivatives, from (27), we have for v = ( v ; . . . ; v τ ) and λ t ∈ R d t , ∇ f x ,t ( u )[ v, v, λ t ] = ∇ f x ,t − ( u )[ v, v, ∇ x t − φ t ( x t − , u t ) λ t ]+ 12 ∇ x t − x t − φ t ( x t − , u t )[ y t − , y t − , λ t ]+ ∇ x t − u t φ t ( x t − , u t )[ y t − , v t , λ t ]+ 12 ∇ u t u t φ t ( x t − , u t )[ v t , v t , λ t ] Hence we get ∇ ψ ( u )[ v, v, ∇ h ( ψ ( u ))] = τ X t =1 ∇ x t − x t − φ t ( x t − , u t )[ y t − , y t − , λ t ] (29) + ∇ x t − u t φ t ( x t − , u t )[ y t − , v t , λ t ]+ 12 ∇ u t u t φ t ( x t − , u t )[ v t , v t , λ t ] where y t are given in (28) and λ t are deﬁned by λ τ = ∇ h ( ψ ( u )) , λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } . The results follow by using the decomposability of r and inserting (28) and (29).We present the resolution of the Newton step by dynamic programming in Algo. 5 whose implementation isjustiﬁed in Proposition B.2. Note that the gradient is computed during the ﬁrst backward pass which can reducethe computations by factorizing those computations. For the Gauss-Newton steps the same dynamic programmingapproach can be applied, however it is less computationally expansive to use automatic differentiation procedures aspresented in Sec. 3. Proposition B.2.

Consider problem (18) and assume it is bounded below, then the cost-to-go functions deﬁned for t ∈ { , . . . , τ } and x t ∈ R δ t as cost t ( x t ) = min v t +1 ,...,v τ y t ,...,y τ τ X t ′ = t y ⊤ t ′ P t ′ y t ′ + p ⊤ t ′ y t ′ + τ X t ′ = t +1 y ⊤ t ′ − R t ′ v t ′ + 12 v ⊤ t ′ Q t ′ v t ′ + q ⊤ t ′ v t ′ + κ k v t ′ k (30) subject to y t ′ = A t ′ y t ′ − + B t ′ v t ′ for t ′ ∈ { t + 1 , . . . , τ } ,y t = x t , where P = 0 , p = 0 , are quadratics of the form cost t ( x t ) = 12 x ⊤ t C t x t + c ⊤ t x t + cste , (31) where C t , c t are deﬁned recursively in line 20 Algo. 5 with C t = C ⊤ t and cste is a constant. The solution of (18) isgiven by, starting from y = 0 , v ∗ t = K t y t − + k t y t = A t y t − + B t v ∗ t , where K t and k t are deﬁned in line 21 of Algo. 5. roof. The cost-to-go functions satisfy the recursive equation (20) for t ∈ { , . . . , τ } cost t − ( x t − ) = 12 x ⊤ t − P t − x t − + p ⊤ t − x t − + min v t ∈ R ρt (cid:26) x ⊤ t − R t v t + 12 v ⊤ t Q t v t + q ⊤ t v t + κ k v t k + cost t ( A t x t − + B t v t ) (cid:27) , starting from cost τ ( x τ ) = x ⊤ τ P τ x τ + p ⊤ τ x τ so we get C τ = P τ and c τ = p τ . Assume that the cost-to-go function cost t has the form (31) for t ∈ { , . . . , τ } then the recursive equation (20) reads cost t − ( x t − ) = 12 x ⊤ t − ( P t − + A ⊤ t C t A t ) x t − + ( A ⊤ t c t + p t − ) ⊤ x t − + min v t ∈ R ρt n v ⊤ t ( R ⊤ t x t − + q t + B ⊤ t ( C t A t x t − + c t )) + 12 v ⊤ t ( κ I + Q t + B ⊤ t C t B t ) v t o . If κ I + Q t + B ⊤ t C t B t , then the minimization problem is unbounded below and so is the original objective. If κ I + Q t + B ⊤ t C t B t ≻ , then the minimization gives us cost t − as a quadratic and the corresponding minimizer v ∗ t ( x t − ) for a given x t − , i.e. C t − = P t − + A ⊤ t C t A t − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( R ⊤ t + B ⊤ t C t A t ) ,c t − = A ⊤ t c t + p t − − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( q t + B ⊤ t c t ) ,v ∗ t ( x t − ) = − ( κ I + Q t + B ⊤ t C t B t ) − (( R ⊤ t + B ⊤ t C t A t ) x t − + q t + B ⊤ t c t ) . The solution of (18) is given by computing cost (0) which amounts to compute, starting from y = 0 , v ∗ t = arg min v ∈ R ρt (cid:26) v ⊤ Q t v + q ⊤ t v + y ⊤ t − R t v + cost t +1 ( A t y t − + B t v ) (cid:27) = v ∗ t ( x t − ) ,y t = A t y t − + B t v ∗ t . Finally we present the derivation of a gradient step, i.e., gradient back-propagation, as a dynamic programmingprocedure, which gives the forward-backward algorithm presented in Sec. 3 by taking r = 0 , κ = − . Proposition B.3.

Consider the gradient step (15) as formulated in (18) with κ = 1 /γ . The cost-to-go functionsdeﬁned as in (30) are linear of the form cost t ( x t ) = λ ⊤ t x t + cste , (32) where λ τ = ∇ h ( ψ ( u )) λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } and the solution of the step is given by v ∗ t = − γ ( ∇ r ( u t ) + ∇ u t φ t ( x t − , u t ) λ t ) . Proof.

The cost-to-go function deﬁned in (30) for a gradient step reads for t = τ , cost τ ( x τ ) = p ⊤ τ x τ , so we getEq. (32) for t = τ with λ τ = ∇ h ( ψ ( u )) . Assume that the cost-to-go function has the form (32) for t ∈ { , . . . , τ } ,then the recursive equation (20) reads cost t − ( x t − ) = min v t ∈ R ρt (cid:26) v ⊤ t q t + λ ⊤ t ( A t x t − + B t v t ) + 12 γ k v t k (cid:27) So we get that cost t − is a linear function deﬁned by λ t − = A ⊤ t λ t and that the optimal corresponding parameter isindependent of x t − and reads v ∗ t = − γ ( q t + B ⊤ t λ t ) . Plugging the values of A t , B t , q t into the solutions give the results.22 lgorithm 5 Newton oracle by dynamic programming Inputs:

Chain of computations f deﬁned by φ t , objective h , regularization r , regularization for the step κ , currentweights u = ( u ; . . . ; u τ ) Forward pass: for t = 1 , . . . , τ do Compute x t = φ t ( x t − , u t ) Store A t = ∇ x t − φ t ( x t − , u t ) ⊤ , B t = ∇ u t φ t ( x t − , u t ) ⊤ and ∇ u t u t φ t ( x t − , u t ) , ∇ u t x t − φ t ( x t − , u t ) , ∇ x t − x t − φ t ( x t − , u t ) end for Initialize λ τ = ∇ h ( x τ ) , P τ = ∇ h ( x τ ) , p τ = ∇ h ( x τ ) for t = τ, . . . , do Compute P t − = ∇ x t − x t − φ t ( x t − , u t )[ · , · , λ t ] p t − = 0 Q t = ∇ u t u t φ t ( x t − , u t )[ · , · , λ t ] + ∇ r t ( u t ) q t = ∇ r t ( u t ) R t = ∇ x t − u t φ t ( x t − , u t )[ · , · , λ t ] Compute λ t − = ∇ x t − φ t ( x t − , u t ) λ t end for Initialize C τ = P τ , c τ = p τ , feasible = True for t = τ, . . . , do if κ I + Q t + B ⊤ t C t B t then feasible = False break end if

Compute C t − = P t − + A ⊤ t C t A t − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( R ⊤ t + B ⊤ t C t A t ) c t − = A ⊤ t c t + p t − − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( q t + B ⊤ t c t ) Store K t = − ( κ I + Q t + B ⊤ t C t B t ) − ( R ⊤ t + B ⊤ t C t A t ) k t = − ( κ I + Q t + B ⊤ t C t B t ) − ( q t + B ⊤ t c t ) end for if feasible = False then

Re-do 2nd backward pass with κ := 2 · κ end if Rollout:

Initialize y = 0 for t = 1 , . . . , τ do v ∗ t = K t y t − + k t , y t = A t y t − + B t v t end for Output: ( v ∗ ; . . . ; v ∗ τ ) .3 Detailed complexities of forward and backward passes Deﬁnition B.4 (Sparsity of the operations) . We deﬁne the sparsity s β of a bilinear operation β as the number ofnon-zero elements in its corresponding tensor.We deﬁne the sparsity s α of a function α as the sparsity of its gradient, i.e., the maximal number of its non-zeroelements for any inputs. The sparsity of a bilinear operation amounts to the number of multiplications needed to compute B [ x, y, z ] , B [ · , y, z ] , B [ x, · , z ] or B [ x, y, · ] , which gives us the sparsity of the two bilinear operations studied in this paper. Fact B.5.

For a matrix-product as in (8) , we have s β = m ˜ δδ . For a convolution as in (9) , we have s β = mn p n f s f . We considered Π k Z t − as the extraction of coordinates and not a matrix-vector product. Note that the sparsityof the bilinear operation deﬁnes also the number of multiplications needed to compute gradient vector products like ∇ x t − β t ( x t − , u t ) λ t +1 or ∇ u t β t ( x t − , u t ) λ t +1 for λ t +1 ∈ R δ t .The sparsity of a function f ∈ R d → R n naturally gives the number of multiplications needed to compute gradient-vector products ∇ f ( x ) λ for any x ∈ R d , λ ∈ R n . For element-wise activation functions as in (10), we have s α = mδ ,where we consider the input of the activation function to be z = Vec( Z ) for Z ∈ R m × δ . Note that the sparsity of anactivation function as deﬁned here does not directly give the cost of computing it, neither its gradient. Forward-backward detailed complexity.

We present in the next proposition the cost of computing only the back-ward pass to compute the whole gradient. The cost of computing the function and the gradients of the layers in theforward pass can be further detailed using the sparsity of the bilinear operation and the cost of computing the activationfunction and its derivatives. The detailed complexities given in Sec. 3 follow.

Proposition B.6.

Consider a chain f of τ layers as deﬁned in Def. 2.1 whose layers φ t are deﬁned by a t , b t as in (7) .Then the cost of the backward pass deﬁned in Algo. 2 is of the order of O τ X t =1 s a t + 2 s β t + s β ut + s β xt ! elementary operations.Proof. If the chain of layers has the form (7), the gradient vector products during the backward pass read ∇ x t − φ t ( x t − , u t ) λ t +1 = ∇ x t − b t ( x t − , u t ) ∇ a t ( ω t ) λ t +1 = ( B t [ · , u t , · ] + ∇ β zt ( x t − )) ∇ a t ( ω t ) λ t +1 , ∇ u t φ t ( x t − , u t ) λ t +1 = ∇ u t b t ( x t − , u t ) ∇ a t ( ω t ) λ t +1 = ( B t [ x t − , · , · ] + ∇ β ut ( u t )) ∇ a t ( ω t ) λ t +1 , where ω t = b t ( x t − , u t ) . The deﬁnitions of the sparsity of bilinear or general operations give the result by looking ateach operation starting from the right. B.4 Gauss-Newton by axutomatic differentiation

Derivatives of the gradient vector product can then be computed themselves by back-propagation as recalled in thefollowing lemma.

Lemma B.7 ([21, Lemma 3.4]) . Consider a differentiable chain of composition f and an input x ∈ R d suchthat ψ = f x ,τ : R P τt =1 p t → R d τ . Given a variable u ∈ R P τt =1 p t and a decomposable differentiable function g : R P τt =1 p t → R such that g ( u ) = P τt =1 g t ( u t ) for u = ( u ; . . . ; u τ ) , computing the derivative of µ → g ( ∇ ψ ( u ) µ ) requires two calls to an automatic-differentiation procedure. Proposition 3.3.

Consider the Gauss-Newton oracle (16) on u = ( u ; . . . ; u τ ) for a convex objective h , a convexdecomposable regularization r ( u ) = P τt =1 r t ( u t ) and a differentiable chain of computations f with output ψ = f x ,τ on some input x . We have that . the Gauss-Newton oracle amounts to solving min µ ∈ R dτ (cid:16) q ψ ( u ) h (cid:17) ⋆ ( µ ) + (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ) , (22) where for a function f we denote by f ⋆ its convex conjugate,2. the Gauss-Newton oracle is v ∗ = ∇ (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ∗ ) where µ ∗ is the solution of (22) ,3. the dual problem (22) can be solved by d τ + 1 calls to an automatic differentiation procedure.Proof. The ﬁrst and second claims follow from standard duality computations applied to (16), they require convexityof h and r . The third claim comes from the fact that (22) is a quadratic convex problem that can be solved in at most d τ iterations of a conjugate gradient descent. Each iteration requires to compute the gradient of µ → ( q ur + κ k ·k / ⋆ ( −∇ ψ ( u ) µ ) which requires two calls to an automatic differentiation procedure by Lemma B.7 and using that r ∗ is also decomposable. A last call to an automatic differentiation procedure is needed to compute ∇ ψ ( u ) µ ∗ . Thecosts of computing ∇ ( q ψ ( u ) h ) ⋆ ( µ ) for µ ∈ R d τ and ∇ ( q ur + κ k · k / ⋆ ( u ) for u ∈ R P τt =1 p t are ignored since they donot involve a chain of compositions and are assumed to be easily accessible. C Smoothness computations

C.1 Elementary operations

Univariate functions.Lemma C.1.

Let α i ∈ C ℓ i ,L i for i = 1 , . . . , n . Denote ℓ = ( ℓ i ) ni =1 , L = ( L i ) ni =1 .1. Assume α i : R d i → R m i , then a : ( R P ni =1 d i → R P ni =1 m i x = ( x ; . . . ; x n ) → ( α ( x ); . . . ; α n ( x n )) is k ℓ k -Lipschitz continuous and k L k ∞ -smooth.2. Assume α i : R d i → R m , then a : ( R P ni =1 d i → R m x = ( x ; . . . ; x n ) → P ni =1 α i ( x i ) is k ℓ k -Lipschitz continuous and k L k ∞ -smooth.3. Assume a i : R d → R m i , then a : ( R d → R P ni =1 m i x → ( α ( x ); . . . ; α n ( x )) is k ℓ k -Lipschitz continuous and k L k -smooth.4. Assume α i : R d → R m , then a : ( R d → R m x → P ni =1 α i ( x ) is k ℓ k -Lipschitz continuous and k L k -smooth.Proof.

1. We have for x = ( x ; . . . ; x n ) ∈ R P ni =1 d i and z = ( z ; . . . ; z n ) ∈ R P ni =1 m i , k∇ a ( x ) z k = k n X i =1 ∇ α i ( x i ) z i k ≤ n X i =1 k z i k k∇ α i ( x i ) k , ≤ k z k vuut n X i =1 ℓ i , a . For x = ( x ; . . . ; x n ) , y = ( y ; . . . ; y n ) ∈ R P ni =1 d i , we have k ( ∇ a ( x ) −∇ a ( y )) z k ≤ n X i =1 k z i k k∇ α i ( x i ) −∇ α i ( y i ) k , ≤ n X i =1 k z i k k x i − y i k L i ≤ k z k k x − y k max i ∈{ ,...,n } L i . Hence k∇ a ( x ) − ∇ a ( y ) k , ≤ k x − y k max i ∈{ ,...,n } L i which gives an upper bound on the smoothness of a .2. We have for x = ( x ; . . . ; x n ) ∈ R P ni =1 d i and z ∈ R m , k∇ a ( x ) z k = n X i =1 k∇ α i ( x i ) z k ≤ n X i =1 ℓ i k z k , which gives the Lipschitz-continuity parameter. Similarly we have for x = ( x ; . . . ; x n ) , y = ( y ; . . . ; y n ) ∈ R P ni =1 d i , k ( ∇ a ( x ) −∇ a ( y )) z k = n X i =1 k ( ∇ α i ( x i ) −∇ α i ( y i )) z k ≤ n X i =1 L i k x i − y i k k z k ≤ max i ∈{ ,...,n } L i k x − y k k z k , which gives the smoothness constant of a .3. The bound on the Lipschitz-continuity parameter follows from the same argument as in 1. For the smoothnessparameter, we have for x, y ∈ R d and z = ( z ; . . . ; z n ) ∈ R P ni =1 m i , k ( ∇ a ( x ) − ∇ a ( y )) z k ≤ n X i =1 k z i k k∇ α i ( x ) − ∇ α i ( y ) k , ≤ n X i =1 k z i k k x − y k L i ≤ k z k k x − y k vuut n X i =1 L i . Hence the result as in 1.4. Clear by linearity of the gradient and triangular inequality.

Bilinear functions.Lemma C.2.

Consider s × t bilinear functions β i +( j − s : R d i × R p j → R m i +( j − s for i ∈ { , . . . s } , j ∈ { , . . . t } then β : ( R P si =1 d i × R P tj =1 p j → R P stk =1 m k ( x, u ) → ( β ( x , u ); . . . ; β s ( x s , u ); β s +1 ( x , u ); . . . ; β st ( x s , u t )) is L β = max k ∈{ ,...,st } L β k smooth.Proof. By Lemma A.7, we have that L β = sup x,u k β ( x, u ) k / k x k k u k . Now k β ( x, u ) k = s X i =1 t X j =1 k β i + s ( j − ( x i , u j ) k ≤ s X i =1 t X j =1 L β i + s ( j − k x i k k u j k ≤ max k ∈{ ,...,st } L β k k x k k u k . C.2 Compositions

The smoothness properties of the functions can be derived by bounding appropriately their ﬁrst and second orderinformation. Even if e.g. the functions are not twice differentiable, the same results would apply by decomposingcarefully the terms, we directly use the second order information as it directly gives what we are interested in.26or a smooth function, an upper bound on the Lipschitz continuity of the function on a bounded set can be estimatedeven if the function is not Lipschitz continuous. Similarly a bound on the function a bounded set can be reﬁned asdeﬁned below.

Fact C.3.

For a function f ∈ C m f ,ℓ f ,L f and R > . Denoting B R = { x ∈ dom f : k x k ≤ R } , we have that ℓ B R f ≤ ℓ f ( R ) := min { ℓ f , k∇ f (0) k , + RL f } ,m B R f ≤ m f ( R ) := min { m f , k f (0) k + Rℓ f ( R ) } For a sequence of compositions we have the following result.

Lemma C.4.

Consider a = a k ◦ . . . ◦ a with a j ∈ C m aj ,ℓ aj ,L aj for j ∈ { , . . . , k } and a : R d → R n . Denote B R = { x ∈ R d : k x k ≤ R } , and for j ∈ { , . . . , k } , m j = m a j ( m t − ) ,ℓ j = ℓ j − ℓ a j ( m j − ) ,L j = L a j ℓ j − + L j − ℓ a j ( m j − ) , with m = R, ℓ = 1 , L = 0 . We have m B R a ≤ m τ , ℓ B R a ≤ ℓ τ = k Y j =1 ℓ a j ( m j − ) , L B R a ≤ L τ = k X j =1 L a j j − Y i =1 ℓ a i ( m i − ) !  k Y i = j +1 ℓ a i ( m i − )  . Proof.

The bound on the output is a direct iterative application of Fact C.3. We have for x ∈ R d , ∇ a ( x ) = k Y j =1 g j ( x ) , where g j ( x ) = ∇ a j ( a j − ◦ . . . ◦ a ( x )) for j ∈ { , . . . , k } . We have sup x ∈ R d : k x k ≤ R k g j ( x ) k , ≤ min { ℓ a j , k∇ a j (0) k , + L a j m B R a j − ◦ ... ◦ a } . Therefore ℓ B R a ≤ k Y j =1 ℓ a j ( m j − ) . We have for x ∈ R d , ∇ a ( x ) = k X j =1 ∇ a j ( x )  j − Y i =1 g i ( x ) ! ⊤ , j − Y i =1 g i ( x ) ! ⊤ , k Y i = j +1 g i ( x )  . Therefore L B R a ≤ k X j =1 L a j j − Y i =1 ℓ a i ( m i − ) !  k Y i = j +1 ℓ a i ( m i − )  . Lemma C.4 can be used to estimate the smoothness of a chain of computations with respect to its input for ﬁxedparameters.

Corollary C.5.

Consider a chain f of τ computations φ t ∈ C m φt ,ℓ φt ,L φt with given parameters u = ( u ; . . . ; u τ ) . enote φ t,u t = φ t ( · , u t ) . Denote B R = { x ∈ R d : k x k ≤ R } , and for j ∈ { , . . . , k } , m j = m φ t ( · ,u t ) ( m t − ) ,ℓ j = ℓ j − ℓ φ j ( · ,u j ) ( m j − ) ,L j = L φ j ( · ,u j ) ℓ j − + L j − ℓ φ j ( · ,u j ) ( m j − ) , with m = R, ℓ = 1 , L = 0 . We have m B R f τ,u ≤ m τ , ℓ B R f τ,u ≤ ℓ τ = k Y j =1 ℓ φ j ( · ,u j ) ( m j − ) ,L B R f τ,u ≤ L τ = k X j =1 L φ j ( · ,u j ) j − Y i =1 ℓ φ i ( · ,u i ) ( m i − ) !  k Y i = j +1 ℓ φ i ( · ,u i ) ( m i − )  . C.3 Chains of computations

We have the following result for smooth and Lipschitz continuous chains of computations.

Lemma 4.3.

Consider a chain f of τ computations φ t ∈ C ℓ φt ,L φt initialized at some x ∈ R d .(i) We have ℓ f τ,x ≤ ℓ τ , where ℓ = 0 , ℓ t = ℓ φ t + ℓ t − ℓ φ t , for t ∈ { , . . . , τ } .(ii) We have L f τ,x ≤ L τ , where L = 0 , L t = L t − ℓ φ t + L φ t (1 + ℓ t − ) , for t ∈ { , . . . , τ } .Proof. The ﬁrst claim follows directly from Lemma B.1. For the second claim we have that (27) gives L f t ≤ L f t − ℓ φ t + L φ t ℓ f t − + 2 L φ t ℓ f t − + L φ t , which simpliﬁes to give the result.For a bivariate function φ ( x, u ) : R d × R p → R η , we deﬁne ℓ uφ = sup u ∈ R p ,x ∈ R d ℓ φ ( x,u + · ) , ℓ xφ = sup u ∈ R p ,x ∈ R d ℓ φ ( x + · ,u ) . Moreover if the function is continuously differentiable, we deﬁne L uuφ = sup u ∈ R p ,x ∈ R d ℓ ∇ u φ ( x,u + · ) , L xuφ = sup u ∈ R p ,x ∈ R d ℓ ∇ u φ ( x + · ,u ) ,L uxφ = sup u ∈ R p ,x ∈ R d ℓ ∇ x φ ( x,u + · ) , L xxφ = sup u ∈ R p ,x ∈ R d ℓ ∇ x φ ( x + · ,u ) . For a bivariate continuosuly differentiable function φ ( x, u ) : R p × R d → R η , we have that ℓ uφ = sup u ∈ R p ,x ∈ R d k∇ u φ ( x, u ) k , , ℓ xφ = sup u ∈ R p ,x ∈ R d k∇ x φ ( x, u ) k , . If the function φ is twice continuously differentiable, we have that L uuφ = sup u ∈ R p ,x ∈ R d k∇ uu φ ( x, u ) k , , , L xuφ = sup u ∈ R p ,x ∈ R d k∇ xu φ ( x, u ) k , , ,L uxφ = sup u ∈ R p ,x ∈ R d k∇ ux φ ( x, u ) k , , , L xxφ = sup u ∈ R p ,x ∈ R d k∇ uu φ ( x, u ) k , , . Finally for R x ≥ , R u ≥ , we have sup ( x,u ) ∈ B Rx × B Ru k∇ u φ ( x, u ) k , ≤ ℓ uφ ( R x , R u ) := min { ℓ uφ , k∇ u φ (0 , k , + L uuφ R u + L xuφ R x } (33) sup ( x,u ) ∈ B Rx × B Ru k∇ x φ ( x, u ) k , ≤ ℓ xφ ( R x , R u ) := min { ℓ xφ , k∇ x φ (0 , k , + L xxφ R u + L uxφ R x } .

28e then have the following.

Lemma C.6.

Let f be a chain of τ computations φ t ∈ C m φt ,ℓ uφt ,ℓ xφt ,L uuφt ,L xuφt ,L xxφt , initialized at some x such that k x k ≤ R . Let C = N τt =1 B R t ( R p t ) = { u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t : u t ∈ R p t , k u t k ≤ R t } . Deﬁne for t ∈ { , . . . , τ } , m t = min { m φ t , k φ t (0 , k + ℓ uφ t ( m t − , R t ) R t + ℓ xφ t ( m t − , R t ) m t − } ,ℓ t = ℓ uφ t ( m t − , R t ) + ℓ t − ℓ xφ t ( m t − , R t ) ,L t = L t − ℓ xφ t ( m t − , R t ) + L xxφ t ℓ t − + ( L xuφ t + L uxφ t ) ℓ t − + L uuφ t . with m = R , ℓ = 0 , L = 0 . We have that m Cf τ ,x ≤ m τ , ℓ Cf τ ,x ≤ ℓ τ , L Cf τ ,x ≤ L τ . Proof.

The result directly follows from Lemma B.1, with the Lipschitz-continuity constants derived in (33).

Proof.

The proof relies on Lemma C.6, where the smoothness of the inner compositions are computed according toLemma C.4. Namely, we have ℓ uφ t ( R x , R u ) ≤ ℓ a t ( m b t ( R x , R u )) ℓ ub t ( R x , R u ) , ℓ xφ t ( R x , R u ) ≤ ℓ a t ( m b t ( R x , R u )) ℓ xb t ( R x , R u ) , with ℓ ub t ( R x , R u ) = L b t R x + l ub t , ℓ xb t ( R x , R u ) = L b t R u + l xb t ,m b t ( R x , R u ) = ℓ ub t ( R x , R u ) R u + ℓ xb t ( R x , R u ) R x + k b t (0 , k , and ℓ a t can be computed as in Lemma. C.4. On the other hand, denoting L xxφ t ( R x , R u ) = sup ( x,u ) ∈ B Rx × B Ru k∇ xx φ t ( x, u ) k , , (and similarly for L uuφ t , L uxφ t , L xuφ t ), we have L xxφ t ( R x , R u ) ≤ L a t ( m b t ( R x , R u )) ℓ xb t ( R x , R u ) L uuφ t ( R x , R u ) ≤ L a t ( m b t ( R x , R u )) ℓ ub t ( R x , R u ) L xu ( R x , R u ) = L ux ( R x , R u ) = L b t ℓ a t ( m b t ( R x , R u )) + L a t ( m b t ( R x , R u )) ℓ ub t ( R x , R u ) ℓ xb t ( R x , R u ) , where L a t ( m b t ( R x , R u )) is computed by Lemma C.4. D Smoothness of objectives and layers

D.1 Supervised objectives

For supervised objectives h : R nd τ → R that reads for ˆ y = (ˆ y ; . . . ; ˆ y n ) with ˆ y i ∈ R d τ , h (ˆ y ) = 1 n n X i =1 h i (ˆ y i ) , we only need to compute the smoothness of h i (ˆ y i ) (see Lemma C.1) which is usually deﬁned by a loss h i (ˆ y i ) = L (ˆ y i , y i ) . We are interested in this section in the smoothness L h ( C ) and Lipschitz-continuity ℓ h ( C ) of the objective h on a set C . We omit the dependency on the set C if Lipschitz-continuity or smoothness properties of the functionsare deﬁned on its whole domain. Square loss.

Assume that the labels belong to a compact set Y . The square loss is deﬁned by h (ˆ y ) = L sq (ˆ y, y ) =(ˆ y − y ) / . We have then ℓ sq ( C ) = ρ C + ρ Y , L sq = 1 . where ρ C = max x ∈ C k x k and ρ Y = max y ∈Y k y k . 29 ogistic loss. Consider y ∈ { , } q , the logistic loss is deﬁned as h (ˆ y ) = L log (ˆ y, y ) = − y ⊤ ˆ y +log (cid:16)P qj =1 exp(ˆ y j ) (cid:17) . We have then, denoting exp( y ) = (exp( y i )) i =1 ,...q , ∇ h (ˆ y ) = − y + exp(ˆ y )exp(ˆ y ) ⊤ q , ∇ h (ˆ y ) = diag (exp(ˆ y ))exp(ˆ y ) ⊤ q − exp(ˆ y ) exp(ˆ y ) ⊤ (exp(ˆ y ) ⊤ q ) . Therefore using that y ∈ { , } q and that k exp(ˆ y ) k ≤ k exp(ˆ y ) k , ℓ log ≤ , L log ≤ . D.2 Unsupervised objectives

For the k-means and spectral clustering objectives, we consider the outputs of the chains of the computations to forma matrix F (¯ x, u ) = ( f (¯ x (1) , u ) , . . . , f (¯ x ( n ) , u )) ∈ R q × n where q = d τ and n to be the number of samples. Theobjectives are then h : R q × n → R and we denote by Z ∈ R q × n their variables. The overall objective is h ( F (¯ x, u )) for ¯ x = (¯ x (1) ; . . . ; ¯ x ( n ) ) . We denote k the number of classes that the unsupervised objective aims to cluster and Y = { Y = ( y , . . . , y n ) ⊤ ∈ { , } n × k s.t. Y k = n } . K-means clustering.

The K-means clustering objective reads h ( Z ) = min Y ∈Y C ∈ R q × k n X i =1 k Cy i − z i k . for Z = ( z , . . . , z n ) ∈ R q × n . Minimization in C can be performed analytically such that the problem can be rewritten h ( Z ) = min N ∈N Tr ((I n − N ) Z ⊤ Z ) , where N = { N = Y ( Y ⊤ Y ) − Y ⊤ ∈ R n × n for Y ∈ Y , Y ⊤ Y ≻ } is the set of normalized equivalencematrices. Spectral clustering.

A natural relaxation of K-means is spectral clustering, that considers P = { P ∈ R n × n s.t. P (cid:23) , P = P, Rank ( P ) = k } ⊃ N instead of the set of normalized equivalence matrices. The solution of h ( Z ) = min P ∈P Tr ((I n − P ) Z ⊤ Z ) is then given by ﬁnding the k largest eigenvectors of the Gram matrix Z ⊤ Z . Formally the objective is written h ( Z ) = n X i = n − k +1 σ i ( Z ) , where for a matrix A , σ ( A ) ≥ . . . , ≥ σ n ( A ) are the singular values of A in decreasing order. The objective h is thena spectral function of the matrix Z . Convex clustering.

The convex clustering objective reads for ˆ y = (ˆ y ; . . . ; ˆ y n ) ∈ R qn h (ˆ y ) = min y (1) ,...,y ( n ) ∈ R q n X i =1 k y ( i ) − ˆ y ( i ) k + X i

30s the vector corresponding to the group g of size s g . Here the groups are deﬁned by all possible differences for i < j in Eq. (34). Proposition D.1.

The convex-clustering objective h (ˆ y ) = min y ∈ R qn k y − ˆ y k + k Dy k G is convex, Lipschitz-continuous and smooth with parameters ℓ cvx-cluster ≤ n ( n − , L cvx-cluster ≤ . Proof.

The convex clustering objective h is the Moreau envelope of the function Ω : y → k Dy k G . It is thereforeconvex and -smooth, i.e., L h = 1 . Moreover, the Moreau envelope can be rewritten h (ˆ y ) = sup z ∈ dom(Ω ∗ ) ˆ y ⊤ z − Ω ∗ ( z ) − k z k , where Ω ∗ is the convex conjugate of Ω . Therefore ∇ h (ˆ y ) ∈ dom(Ω ∗ ) . We have that Ω ∗ ( z ) = sup y ∈ R q z ⊤ y − k Dy k G ≥ sup y ∈ R q z ⊤ y − n ( n − k y k , such that the supremum is ﬁnite only if k z k > n ( n − . Therefore ∇ h (ˆ y ) ∈ dom(Ω ∗ ) ⊂ B (cid:18) , n ( n − (cid:19) , where B (0 , n ( n − ) is the Euclidean ball centered at with radius n ( n − . D.3 Bilinear and linear layers

Vectorized matrix-products as a bilinear operation.

Given two matrices A ∈ R n × d and B ∈ R d × p , the matrixproduct AB is deﬁned by a tensor M = ((I d ⊗ e ( q mod n )+1 )( f ⊤⌈ q/n ⌉ ⊗ I d )) q =1 ,...,np ∈ R nd × dp × np where e i is the i th canonical vector in R n and f j is the j th canonical vector in R p such that Vec( AB ) = M [Vec( A ) , Vec( B ) , · ] . (35)This can be checked as for q = i + n ( j − ∈ { , . . . , np } , with i ∈ { , . . . n } , j ∈ { , . . . , p } , Vec( AB ) q = ( AB ) ij = Vec( e ⊤ i A ) ⊤ Vec( Bf j )= Vec( A ) ⊤ (I d ⊗ e i )( f ⊤ j ⊗ I d ) Vec( B )= ( M [Vec( A ) , Vec( B ) , · ]) q . Convolutional layer detailed.

For completeness we detail the convolution for an image. For a convolutional layer,the input is an image

I ∈ R C × H × B with C channels each composed of a matrix of height H and breadth B theweights are given by ˜ C ﬁlters F , . . . , F ˜ C ∈ R C × K × K of patch size K and the biases are given by b ∈ R ˜ C . Theconvolution of the image by a ﬁlter F ˜ c , with ˜ c ∈ { , . . . , ˜ C } with additional bias b ˜ c , is given at point i, j as C ˜ c,i,j = C X c =1 hF ˜ c [ c, · , · ] , E ⊤ row,i I [ c, · , · ] E col,j i + b ˜ c , where F ˜ c [ c, · , · ] is the ﬁlter of size K × K in channel c of ﬁlter F ˜ c and I [ c, · , · ] is the image in channel c .The matrices E row,i ∈ R H × K and E col,j ∈ R B × K extract rows and columns of I [ c, · , · ] . They are bands witha diagonal of K ones centered at positions i or j . If the pattern of the patch is given as P = K ⊤ K , the extractionmatrices read E row,i = e i ⊗ ⊤ K ∈ R H × K , e i ∈ R H for i ∈ { , . . . H } , similarly E col,j = e j ⊗ ⊤ K ∈ R W × K . Theysatisfy E ⊤ row,i E row,i = I K and E row,i E ⊤ row,i ∈ R H × H is a projector. Similarly facts apply for E col,j except that one31eplaces H by B . The output of the convolution with all ﬁlters is then a tensor C ∈ R ˜ H × ˜ B × ˜ C where ˜ H and ˜ B dependon the choices of the stride chosen in the convolution. Smoothness of fully-connected layer.

For a fully connected layer, the bilinear function β ( x, u ) → U ⊤ X for u =Vec( U ) , x = Vec( X ) is clearly 1-smooth (because k U ⊤ X k F ≤ k U k F k X k F ). The linear part β u is clearly -Lipschitz continuous. So we get L full = 1 , l ufull = 1 . Smoothness of convolutional layer.

For a convolution, by Lemma C.2, we only need to compute the smoothness ofthe convolution of an image with one ﬁlter. This is done by the following Lemma.

Lemma D.2.

Consider p subsets S k of { , . . . , n } of size | S k | = d . Denote Π k ∈ { , } d × n the linear form thatextracts the S k coordinates of a vector of size n , i.e., Π k z = z S k for z ∈ R n . The convolution of z ∈ R n by w ∈ R d through the p subsets S k deﬁned as β ( z, w ) = ( w ⊤ Π z ; . . . ; w ⊤ Π p z ) is L β = p max i =1 ,...,n | V i | -smooth where V i = { S j : i ∈ S j } .Proof. We have k β ( z, w ) k = p X j =1 ( w ⊤ Π j z ) ≤ p X j =1 k w k k z S j k = k w k d X i =1 X S j ∈ V i z i ≤ k w k max i =1 ,...,n | V i |k z k . Concretely, for a convolution such that at most p patches contain a coordinate i the convolution is √ p -smooth. Ifthe patches do not overlap then the convolution is -smooth. If the convolution has a stride of 1 and the operation isnormalized by the size of the ﬁlters then the convolution has again a smoothness constant of 1. Generally for a 2dconvolution with a kernel of size k × k and a stride of s , we have max i =1 ,...,n | V i | = (cid:6) ks (cid:7) and so L conv = (cid:24) ks (cid:25) , l conv = (cid:24) ks (cid:25) . Batch of inputs.

For batch of inputs, the smoothness constants of the non-linear and bilinear parts do not change byLemmas C.1 and C.2. The Lipschitz-constant of the linear part of the biafﬁne function is modiﬁed using Lemma C.1item 3. Namely for a batch of size m , the fully connected layers or the convolutional layers have a linear part whoseLipschitz constant is given by l b = √ m . D.4 Activation functions

The Lipschitz and smoothness constants of an element-wise activation α t function are deﬁned by the Lipschitz andsmoothness constant of the scalar function ¯ α t from which it is deﬁned. Denote by f ( x ) := log(1 + exp( x )) , we have f ′ ( x ) = (1 + exp( − x )) − , f ′′ ( x ) = (2 + 2 cosh( x )) − , f ′′′ ( x ) = − sinh( x ) / (2(1 + cosh( x ) )) . Soft-plus.

For α deﬁned by element-wise application of ¯ α ( x ) = f ( x ) , we get ℓ softplus = 1 , L softplus = 1 / . Sigmoid.

For α deﬁned by the element-wise application of ¯ α ( x ) = f ′ ( x ) , we get ℓ sig = 1 / , L sig = 1 / . eLU. For α deﬁned by the element-wise application of ¯ α ( x ) = max(0 , x ) , we get ℓ ReLu = 1 , L

ReLu not deﬁned , since the function is not continuously differentiable. Soft-max layer.

A soft-max layer takes as input x ∈ R d and outputs f ( x ) = exp( x ) / (exp( x ) ⊤ d ) where exp( x ) isthe element-wise application of exp . Its gradient is given by ∇ f ( x ) = diag (exp(ˆ y ))exp(ˆ y ) ⊤ q − exp(ˆ y ) exp(ˆ y ) ⊤ (exp(ˆ y ) ⊤ q ) . Its second-order information can be computed as for the batch-normalization layer, we get then ℓ softmax = 2 , L softmax = 4 . D.5 Normalization layers

Proposition D.3.

The batch normalization operation ν batch : R δm → R δm deﬁned as in (11) is(i) bounded by m batch = δm ,(ii) Lipschitz-continuous with a constant ℓ batch = 2 ǫ − / ,(iii) smooth with a constant L batch = 2 δm − / ǫ − .Proof. The batch-normalization layer as deﬁned in (11) is the composition ν = ν ◦ ν of a centering step ν ( x ) = Vec (cid:18) Z − Z m ⊤ m m (cid:19) and a normalization step ν (˜ x ) = Vec diag (cid:18) m diag ( ˜ Z ˜ Z ⊤ ) + ǫ δ (cid:19) − / ! ˜ Z ! , where here and thereafter Z, ˜ Z ∈ R δ × m , x = Vec( Z ) , ˜ x = Vec( ˜ Z ) .The centering step is an orthonormal projection, i.e., ν ( x ) = Vec( Z Π m ) = (Π m ⊗ I δ ) x where Π m = I m − m ⊤ m m is an orthonormal projector and so is (Π m ⊗ I m ) . Therefore we have ℓ ν ≤ and L ν = 0 . For the normalizationsstep denote for x ∈ R m , and ¯ x = ( x ; . . . ; x δ ) ∈ R mδ with x i ∈ R m , f ( x ) = r m k x k + ǫ, g ( x ) = (cid:18) x i f ( x ) (cid:19) i =1 ,...,m , ¯ g (¯ x ) = ( g ( x ); . . . ; g ( x δ )) ∈ R mδ , such that ν (˜ x ) = T m,d ¯ g ( T d,m ˜ x ) , where T d,m is the linear operator such that T d,m Vec( Z ) = Vec( Z ⊤ ) for any Z ∈ R d × m . First we have that k ¯ g (¯ x ) k ≤ δ max i ∈{ ,...,d } k g ( x i ) k ≤ δm / , such that m ν ≤ δm / . Then the gradients can be computed as ∇ f ( x ) = xmf ( x ) = g ( x ) m ∈ R m , ∇ g ( x ) = f ( x ) I m −∇ f ( x ) x ⊤ f ( x ) = mf ( x ) I m − xx ⊤ mf ( x ) ∈ R m × m , ∇ ¯ g (¯ x ) = diag ( ∇ g ( x ) , . . . , ∇ g ( x m )) ∈ R md × md , X , . . . X τ ∈ R d × p we denote by diag ( X , . . . , X τ ) =  X . . . . . . . . . ...... . . . . . . . . . X τ  ∈ R dτ × pτ , the corresponding block diagonal matrix. Therefore we get k∇ g ( x ) k , ≤ mf ( x ) + k x k mf ( x ) ≤ m − k x k + ǫ ( m − k x k , + ǫ ) / ≤ cǫ − / , k∇ ¯ g (¯ x ) k ≤ cǫ − / , where c = 2 / (3 / / ≈ . and we used that the spectral norm of the block-diagonal matrix is given by the maximalspectral norm of its block diagonal components. Since T m,d , T d,m are orthonormal operators, we get ℓ ν ≤ ǫ − / . The second order tensor of g reads ∇ g ( x ) = 3 m f ( x ) x ⊠ x ⊠ x − mf ( x ) m X i =1 x ⊠ e i ⊠ e i + e i ⊠ x ⊠ e i + e i ⊠ e i ⊠ x ! ∈ R m × m × m , ∇ ¯ g (¯ x ) = diag ( ∇ g ( x ) , . . . , ∇ g ( x d )) , where e i ∈ R m is the i th canonical vector in R m and for a sequence of tensors X , . . . , X d we denote by X = diag ( X , . . . , X d ) ∈ R dm × dm × dm the tensor whose diagonal is composed of the tensors X , . . . X d such that X i +( m − p,j +( m − p,k +( m − p = ( X p ) ijk and 0 outside the diagonal. We get then by deﬁnition of the tensor norm, k∇ g ( x ) k , , ≤ k x k m f ( x ) + 3 k x k mf ( x ) = 3 k x k ( k x k + mf ( x ) ) m f ( x ) = 3 k x k (2 k x k + mǫ ) m ( m − k x k + ǫ ) / ≤ m − / √ c (2 c + 1)( c + 1) / ( mǫ ) − , where c = (1 + √ / such that √ c (2 c +1)( c +1) / ≈ . . Therefore we get k∇ ¯ g (¯ x ) k , , ≤ δ max i ∈{ ,...,δ } k∇ g ( x i ) k , , and L ν ≤ δm − / ǫ − . D.6 Pooling layers

We consider pooling layers for which the patches do not coincide such that they amount to a (potentially non-linear)projection.

Average pooling.

The average pooling layer is a linear operation. If the patches do not coincide, it is a projectionwith Lipschitz constant one. ℓ avg = 1 , L avg = 0 . Max-pooling.

Given an image

I ∈ R C × H × B with C channels each composed of a matrix of height H and breadth B , the max pooling layer extracts n p patches of the form P i,j = E ⊤ row,i I [ c, · , · ] E col,j where E row,i ∈ R H × K and E col,j ∈ R B × K extract rows and columns of I [ c, · , · ] respectively. On each of this patch their maximum value is takenas the output, namely, the output image reads ˜ I c,i,j = max k,l P i,jk,l . It is naturally non-continuously differentiable and34t is -Lipschitz continuous if the patches do not coincide. ℓ maxpool = 1 L maxpool not deﬁned . D.7 Auto-encoders, composition of chains of computations

For τ vectors ( u ; . . . ; u τ ) ∈ R P τt =1 p t and ≤ s ≤ t ≤ τ , we denote u s : t = ( u s ; . . . ; u t ) ∈ R P tr = s p r . For τ functions φ t : R d t − × R p t → R d t , we can split the chain of computations of the τ functions φ t into smaller chains ofcomputations. Namely, for ≤ s ≤ t ≤ τ , we denote the output of the chain of computations deﬁned by φ s , . . . φ t as φ s → t ( x s − , u s : t ) = x t s.t. x r = φ r ( x r − , u r ) for r ∈ { s, . . . , t } . In particular, we have φ t = φ t → t . The output of the chain of computations of the τ fucntions φ t can then be split as φ → τ ( x , u τ ) = φ t +1 → τ ( φ → t ( x , u t ) , u t +1: τ ) for any t ∈ { , . . . τ − } . On the other hand, the composition of two chains of computations can readily be seen as a chain of computations.Namely, for two chains of computations f and g with computations ( φ ft ) τ f t =1 and ( ψ gt ) τ g t =1 , parameters u and v respec-tively, the composition of f and g is h ( x , u ) = g ( f ( x , u ) , v ) . It is a chain of τ f + τ g computations χ t = ( φ t for t ∈ { , . . . , τ f } ψ t − τ f for t ∈ { τ f + 1 , τ f + τ g } , with input x and parameters w = ( u ; v ) ∈ R P τft =1 p ft + P τgt =1 p gt such that w t = ( u t for t ∈ { , . . . , τ f } v t − τ f for t ∈ { τ f + 1 , τ f + τ g } . D.8 Residual Networks

Recall the architecture of a residual network x t = a t ( b t ( x t − , u t ) + x t − ) for t = 1 , . . . , τx = x, x − = 0 , where we assume b t : R d t − × R p t → R η t such that x t − ∈ R d t − , x t − ∈ R η t and b t ( x t − , u t ) = B t [ x t − , u t , · ] + B ut u t + B xt x t − + β t , where B = ( B t, , . . . , B t,η t ) is a tensor. They can be expressed in terms of the variable ¯ x t = ( x t , x t − ) as ¯ φ t (¯ x t − , u t ) = ¯ a t (¯ b t (¯ x t − , u t )) , (36)35here ¯ b t is deﬁned as ¯ b t (¯ x t − , u t ) = ¯ β t (¯ x t − , u t ) + ¯ β tu ( u t ) + ¯ β tx (¯ x t − ) + ¯ β t = ¯ B t [¯ x t − , u t , · ] + ¯ B ut u t + ¯ B xt ¯ x t − + ¯ β t , ¯ B t = ( ¯ B t, , . . . , ¯ B t,η t , p t × ( d t − + η t ) , . . . , p t × ( d t − + η t ) | {z } d t − ) , ¯ B t,j = (cid:0) B t,j , p t × η t (cid:1) for j ∈ { , . . . , d t − } , ¯ B ut = (cid:18) B ut d t − × p t (cid:19) , ¯ B xt = (cid:18) B xt I η t I d t − d t − × η t (cid:19) , ¯ β t = (cid:18) β t d t − (cid:19) . Denoting ¯ ω t = ( ω t, , ω t, ) = ¯ b t (¯ x t − , u t ) , we have ¯ a t (¯ ω t ) = ( a ( ω t, ) , ω t, ) . We can derive the smoothness constants of the layers of a residual network expressed as in (36) as L ¯ β t = L β t , l ¯ β ut = l β ut , l ¯ β xt ≤ l β xt + 1 , k ¯ β t k = k β t k ,m ¯ a t ≤ (1 + m a t ) , ℓ ¯ a t ≤ max(1 , ℓ a t ) , L ¯ a t = L a t . Proposition 4.5 can then be applied in this setting.

D.9 Implicit functions

The smoothness constants of an implicit function are given in the following lemma. They can easily be reﬁned byconsidering smoothness properties w.r.t. to each of the variables α and β of the function ζ deﬁning the problem. Lemma D.4.

Let ζ : ( α, β ) → ζ ( α, β ) ∈ R for α ∈ R a , β ∈ R b be s.t. ζ ( α, · ) is µ ζ -strongly convex for any α .Denote g ( α ) = arg min β ∈ R b ζ ( α, β ) . Provided that ζ has a L ζ -Lipschitz gradient and a H ζ -Lipschitz Hessian, thesmoothness constants of g are bounded as ℓ g ≤ L ζ µ − ζ , L g ≤ H ζ µ − ζ (1 + ℓ g )(1 + L ζ µ − ζ ) ≤ H ζ µ − ζ (1 + L ζ µ − ζ ) . Proof.

By the implicit function theorem, g ( α ) is uniquely deﬁned and its gradient is given by ∇ g ( α ) = −∇ α ξ ( α, g ( α )) ∇ β ξ ( α, g ( α )) − = −∇ α,β ζ ( α, g ( α )) ∇ β,β ζ ( α, g ( α )) − , where ξ ( α, β ) = ∇ β ζ ( α, β ) . The Lipschitz constant of g follows from that. For the smoothness we compute itssecond order information and bound the corresponding tensors. Note that the same results can be obtained by simplysplitting the functions in appropriate terms. We have ∇ g ( α ) = h ( α, g ( α )) , where h ( α, β ) = −∇ α ξ ( α, β ) ∇ β ξ ( α, β ) − = −∇ α,β ζ ( α, β ) ∇ β,β ζ ( α, β ) − . Using Lemma D.5, we get ∇ g ( α ) = ∇ α h ( α, g ( α )) + ∇ β h ( α, g ( α ))[ ∇ g ( α ) , · , · ]= − ∇ αα ξ ( α, g ( α ))[ · , · , ∇ β ξ ( α, g ( α )) − ] − ∇ ξ αβ ( α, g ( α ))[ · , ∇ β ξ ( α, g ( α )) − ∇ α ξ ( α, g ( α )) ⊤ , ∇ β ξ ( α, g ( α )) − ] − ∇ βα ξ ( α, g ( α ))[ ∇ g ( α ) , · , ∇ β ξ ( α, g ( α )) − ] − ∇ ξ ββ ( α, g ( α ))[ ∇ g ( α ) , ∇ β ξ ( α, g ( α )) − ∇ α ξ ( α, g ( α )) ⊤ , ∇ β ξ ( α, g ( α )) − ] . ξ correspond to third derivativesof ζ , whose norms are bounded by H ζ by assumption. Moreover we have that k∇ β ξ ( α, g ( α )) − k = k∇ β,β ζ ( α, β ) − k ≤ µ − ζ by assumption.The approximation error of the gradient when using an approximate minimizer inside the expression of the gradientis provided in the following lemma. It follows from smoothness considerations. Lemma 2.2.

Denote h ( α, β ) = −∇ α ξ ( α, β ) ∇ β ξ ( α, β ) − = −∇ α,β ζ ( α, β ) ∇ β,β ζ ( α, β ) − such that b ∇ ˆ g ( α ) = h ( α, ˆ g ( α )) and ∇ g ( α ) = h ( α, g ( α )) . The approximation error is given by computing the smoothness constant of h ( α, · ) for any α . We bound the gradient of h ( α, · ) (same results can be obtained by considering differences of the functions). FromLemma D.5, we have ∇ β h ( α, β ) = −∇ βα ξ ( α, β )[ · , · , ∇ β ξ ( α, β ) − ] − ∇ ξ ββ ( α, β )[ · , ∇ β ξ ( α, β ) − ∇ α ξ ( α, g ( α )) ⊤ , ∇ β ξ ( α, β ) − ] . The result follows by using Facts A.4 and A.3. We observe that second derivatives of ξ correspond to third derivativesof ζ , whose norms are bounded by H ζ by assumption. Moreover we have that k∇ β ξ ( α, g ( α )) − k = k∇ β,β ζ ( α, β ) − k ≤ µ − ζ by assumption. Lemma D.5.

Let ξ : ( α, β ) → ξ ( α, β ) ∈ R b for α ∈ R a , β ∈ R b such that ∇ β ξ ( α, β ) ∈ R b × b is positive deﬁnite forall α ∈ R a , β ∈ R b . Denoting h ( α, β ) = ∇ α ξ ( α, β ) ∇ β ξ ( α, β ) − ∈ R a × b we have ∇ α h ( α, β ) = ∇ αα ξ ( α, β )[ · , · , ∇ β ξ ( α, β ) − ] + ∇ ξ αβ ( α, β )[ · , ∇ β ξ ( α, β ) − ∇ α ξ ( α, β ) ⊤ , ∇ β ξ ( α, β ) − ] , ∇ β h ( α, β ) = ∇ βα ξ ( α, β )[ · , · , ∇ β ξ ( α, β ) − ] + ∇ ξ ββ ( α, β )[ · , ∇ β ξ ( α, β ) − ∇ α ξ ( α, β ) ⊤ , ∇ β ξ ( α, β ) − ] . Proof.

This follows from the product rule, Fact A.5, Lemma D.6 and Fact A.1.

Lemma D.6.

Let g : R d → S ++ n be differentiable and h ( x ) = ( g ( x )) − . Then ∇ h ( x ) = ∇ g ( x )[ · , g ( x ) − , g ( x ) − ] .Proof. Let x ∈ R d and δ ∈ R d . Consider ﬁrst d = 1 , such that ∇ g ( x ) ∈ R n × n . h ( x + δ ) = ( g ( x ) + δ ∇ g ( x ) + o ( δ )) − = g ( x ) − − δg ( x ) − ∇ g ( x ) g ( x ) − + o ( δ ) . So in this case ∇ h ( x ) = g ( x ) − ∇ g ( x ) g ( x ) − ∈ R n × n . The result follows for n = d by concatenating this resultin a tensor such that for d > , ∇ h ( x ) = ∇ g ( x )[ · , g ( x ) − , g ( x ) − ] . Alternatively it can directly be seen from thefollowing ﬁrst order approximation for d > , h ( x + δ ) = ( g ( x ) + ∇ g ( x )[ δ, · , · ] + o ( k δ k )) − = g ( x ) − − g ( x ) − ∇ g ( x )[ δ, · , · ] g ( x ) − + o ( k δ k ) . E Optimization complexity proofs

E.1 Smoothness of the objective

Proposition 4.2.

Consider a closed convex set C ⊂ R p , ψ ∈ C Cm Cψ ,ℓ Cψ ,L Cψ , r ∈ C L r and h ∈ C ℓ h ,L h with ℓ h = + ∞ if h is not Lipschitz-continuous. The smoothness of F = h ◦ ψ + r on C is bounded as L CF ≤ L Cψ ˜ ℓ Ch + (cid:0) ℓ Cψ (cid:1) L h + L r , here ˜ ℓ Ch = min { ℓ h , min z ∈ ψ ( C ) k∇ h ( z ) k + L h ℓ Cψ D C } , where D C = sup x,y ∈ C k x − y k .Proof. Consider h, f to be twice differentiable. Same results can be obtained by considering differences of gradients.We get for u ∈ R p , ∇ ( h ◦ f )( u ) = ∇ f ( u )[ · , · , ∇ h ( f ( u ))] + ∇ f ( u ) ∇ h ( f ( u )) ∇ f ( u ) ⊤ . The norm of ∇ h ( f ( u )) can either be directly bounded by ℓ h or by using that for any u, u ′ ∈ C , k∇ h ( f ( u )) k ≤k∇ h ( f ( u ′ )) k + L h k f ( u ) − f ( u ′ ) k . By choosing u ′ ∈ arg min u ∈ C k∇ h ( f ( u )) k and bounding the second term bythe diameter of C , we get a bound on sup u ∈ C k∇ h ( f ( u )) k . The result follows using Fact. A.4 and the deﬁnitions ofthe norms used to bound ℓ f , L f for a given function f . Corollary 4.6.

The smoothness properties of f x on C ′ are given by considering ˆ f x (∆) = f x ( u ∗ + ∆) = f x ( u ) where ∆ = u − u ∗ with k ∆ t k ≤ R ′ . The shifted chain of computations is given by ˆ f x ,t (∆) = a t ( b t ( ˆ f x ,t − (∆) , u ∗ t + ∆ t )) This means that ˆ f x (∆) is a chain of compositions deﬁned by the same non-linearities a t and bi-afﬁne functions ˆ b t modiﬁed as ˆ b t ( x t − , ∆) = b t ( x t − , u ∗ t + ∆ t ) = β t ( x t − , ∆ t ) + β ut (∆ t ) + ˆ β xt ( x t − ) + ˆ β t , where ˆ β xt ( x t − ) = β xt ( x t − ) + β t ( u ∗ t , x t − ) ˆ β t = β t + β ut ( u ∗ t ) . F Detailed network

VGG network.

The VGG Network is a benchmark network for image classiﬁcation with deep networks. Theobjective is to classify images among classes. Its architecture is composed of 16 layers described below. Wedrop the dependency to the layers in their detailed formulation. We precise the number of patches n p of the pooling orconvolution operation, which, multiplied by the number of ﬁlters n f gives the output dimension of these operations.For a fully connected layer we precise the output dimension δ out .0. x i ∈ R n p n f with n p = 224 × and n f = 3 ,1. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 224 × , n f conv = 64 ,2. φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 224 × , n f conv = 64 , n p maxpool = 112 × , n f maxpool = 64 ,3. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 112 × , n f conv = 128 φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 112 × , n f conv = 128 , n p maxpool = 56 × , n f maxpool = 128 ,5. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 56 × , n f conv = 256 ,6. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 56 × , n f conv = 256 , 38. φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 56 × , n f conv = 256 , n p maxpool = 28 × , n f maxpool = 256 ,8. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 28 × , n f conv = 512 ,9. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 28 × , n f conv = 512 ,10. φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 28 × , n f conv = 512 , n p maxpool = 14 × , n f maxpool = 512 ,11. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 14 × , n f conv = 512 ,12. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 14 × , n f conv = 512 φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 14 × , n f conv = 512 , n p maxpool = 7 × , n f maxpool = 512 ,14. φ ( x, u ) = α ReLu ( b full ( x, u )) with δ out = 4096 ,15. φ ( x, u ) = α ReLu ( b full ( x, u )) with δ out = 4096 ,16. φ ( x, u ) = α softmax ( b full ( x, u )) with δ out = 1000 .17. h (ˆ y ) = P ni =1 L log (ˆ y i , y i ) /n for k = 1000= 1000