An Elementary Approach to Convergence Guarantees of Optimization Algorithms for Deep Networks
aa r X i v : . [ c s . L G ] D ec An Elementary Approach to Convergence Guaranteesof Optimization Algorithms for Deep Networks
Vincent Roulet, Zaid HarchaouiDepartment of Statistics, University of Washington, Seattle, USA
Abstract
We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based onelementary arguments and computations. The convergence analysis revolves around the analytical and computationalstructures of optimization oracles central to the implementation of deep networks in modern machine learning soft-ware. We provide a systematic way to compute estimates of the smoothness constants that govern the convergencebehavior of first-order optimization algorithms used to train deep networks. Diverse examples related to modern deepnetworks are interspersed within the text to illustrate the approach.
Deep networks have achieved remarkable performance in several application domains such as computer vision, naturallanguage processing and genomics [15, 20, 7]. The input-output mapping implemented by a deep neural network isa chain of compositions of modules, where each module is typically a composition of a non-linear mapping, calledan activation function, and an affine mapping. The last module in the chain is usually task-specific in that it relatesto a performance accuracy for a specific task. This module can be expressed either explicitly in analytical formas in supervised classification or implicitly as a solution of an optimization problem as in dimension reduction orunsupervised clustering.The optimization problem arising when training a deep network is often framed as a non-convex optimizationproblem, dismissing the structure of the objective yet central to the software implementation. Indeed optimizationalgorithms used to train deep networks proceed by making calls to first-order (or second-order) oracles relying ondynamic programming such as gradient back-propagation [25, 22, 16, 5, 2, 23, 9]. Gradient back-propagation isnow part of modern machine learning software [1, 19]. We highlight here the elementary yet important fact thatthe chain-compositional structure of the objective naturally emerges through the smoothness constants governing theconvergence guarantee of a gradient-based optimization algorithm. This provides a reference frame to relate thenetwork architecture and the convergence rate through the smoothness constants. This also brings to light the benefitof specific modules popular among practitioners to improve the convergence.In Sec. 2, we define the parameterized input-output map implemented by a deep network as a chain-compositionof modules and write the corresponding optimization objective consisting in learning the parameters of this map. InSec. 3, we detail the implementation of first-order and second-order oracles by dynamic programming; the classicalgradient back-propagation algorithm is recovered as a canonical example. Gauss-Newton steps can also be simplystated in terms of calls to an automatic-differentiation oracle implemented in modern machine learning software li-braries. In Sec. 4, we present the computation of the smoothness constants of a chain of computations given itscomponents and the resulting convergence guarantees for gradient descent. Finally, in Sec. 5, we present the appli-cation of the approach to derive the smoothness constants for the VGG architecture and illustrate how our approachcan be used to identify the benefits of batch-normalization [24, 13]. In the Appendix, we estimate the smoothnessconstants related to the VGG architecture and we investigate batch-normalization in the light of our approach [24, 13].All the proofs and the notations are also provided in the Appendix.1 φ u . . . b t a t u t φ t . . . φ τ u τ x τ = f τ ( x , u ) x t − x t f Figure 1: Deep network architecture.
A feed-forward deep network of depth τ can be described as a transformation of an input x into an output x τ throughthe composition of τ blocks, called layers, illustrated in Fig. 1. Each layer is defined by a set of parameters. In general,(see Sec. 2.3 for a detailed decomposition), these parameters act on the input of the layer through an affine operationfollowed by a non-linear operation. Formally, the t th layer can be described as a function of its parameters u t and agiven input x t − that outputs x t as x t = φ t ( x t − , u t ) = a t ( b t ( x t − , u t )) , (1)where b t is generally linear in u t and affine in x t − and a t is non-linear.Learning a deep network consists in minimizing w.r.t. its parameters an objective involving n inputs ¯ x (1) , . . . , ¯ x ( n ) ∈ R δ . Formally, the problem is written min ( u ,...,u τ ) ∈ R p × ... × R pτ h ( x (1) τ , . . . , x ( n ) τ ) + r ( u , . . . , u τ ) subject to x ( i ) t = φ t ( x t − , u ( i ) t ) for t = 1 , . . . , τ, i = 1 , . . . , n,x ( i )0 = ¯ x ( i ) for i = 1 , . . . , n, (2)where u t ∈ R p t is the set of parameters at layer t whose dimension p t can vary among layers and r is a regularizationon the parameters of the network.We are interested in the influence of the architecture on the optimization complexity of the problem. The architec-ture translates into a structure of the chain of computations involved in the optimization problem. Definition 2.1.
A chain of τ computations φ t : R d t − × R p t → R d t is defined as f : R d × R P τt =1 p t → R P τt =1 d t suchthat for x ∈ R d and u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t we have f ( x , u ) = ( f ( x , u ); . . . ; f τ ( x , u )) with f t ( x , u ) = φ t ( f t − ( x , u ) , u t ) for t = 1 , . . . , τ, (3) and f ( x , u ) = x . We denote f t,x ( u ) = f t ( x , u ) and f t,u ( x ) = f t ( x , u ) . Denote then f the chain of computations associated to the layers of a deep network and consider the concatenationof the transformations of each input as a single transformation, i.e., f t (¯ x, u ) = ( f t (¯ x (1) , u ); . . . ; f t (¯ x ( n ) , u )) for t ∈ { , . . . , τ } , and ¯ x = (¯ x (1) ; . . . ; ¯ x ( n ) ) , the objective in (2) can be written as min u ∈ R P τt =1 pt h ( f τ (¯ x, u )) + r ( u ) , (4)where f τ : R nd × R P τt =1 p t → R nd τ is the output of a chain of τ computations with d = δ , r : R P τt =1 p t → R istypically a decomposable differentiable function such as r ( u ) = λ P τt =1 k u t k for λ ≥ , and we present examplesof learning objectives f : R nd τ → R below. Assumptions on differentiability and smoothness of the objective aredetailed in Sec. 4. 2 .2 Objectives In the following, we consider the output of the chain of computations on n sample to be given as ˆ y = (ˆ y (1) ; . . . ; ˆ y ( n ) ) =( f τ (¯ x (1) , u ); . . . ; f τ (¯ x ( n ) , u )) = f τ (¯ x, u ) for ¯ x = (¯ x (1) ; . . . ; ¯ x ( n ) ) . Supervised learning.
For supervised learning, the objective can be decomposed as a finite sum h (ˆ y ) = 1 n n X i =1 h ( i ) (ˆ y ( i ) ) , (5)where h ( i ) are losses on the labels predicted by the chain of computations, i.e., h ( i ) (ˆ y ( i ) ) = L (ˆ y ( i ) , y ( i ) ) with y ( i ) thelabel of ¯ x i , and L is a given loss such as the squared loss or the logistic loss (see Appendix D.1). Unsupervised learning.
In unsupervised learning tasks the labels are unknown. The objective itself is definedthrough a minimization problem rather than through an explicit loss function. For example, a convex clusteringobjective [11] is written h (ˆ y ) = min y (1) ,...,y ( n ) ∈ R q n X i =1 k y ( i ) − ˆ y ( i ) k + X i
Lemma 2.2.
Let ζ : ( α, β ) → ζ ( α, β ) ∈ R for α ∈ R a , β ∈ R b be s.t. ζ ( α, · ) is µ ζ -strongly convex for any α and denote ξ ( α, β ) = ∇ β ζ ( α, β ) . Denote g ( α ) = arg min β ∈ R b ζ ( α, β ) and ˆ g ( α ) ≈ arg min β ∈ R b ζ ( α, β ) be anapproximate minimizer. Provided that ζ has a L ζ -Lipschitz gradient and a H ζ -Lipschitz Hessian, the approximationerror of using b ∇ ˆ g ( α ) = −∇ α ξ ( α, ˆ g ( α )) ∇ β ξ ( α, ˆ g ( α )) − instead of ∇ g ( α ) is bounded as k b ∇ ˆ g ( α ) − ∇ g ( α ) k ≤ H ζ µ − ζ (1 + L ζ µ − ζ ) k ˆ g ( α ) − g ( α ) k . For each class of optimization algorithm considered (gradient descent, Gauss-Newton, Newton), we define the appro-priate optimization oracle called at each step of the optimization algorithm which can be efficiently computed througha dynamic programming procedure. For a gradient step, we retrieve the gradient back-propagation algorithm. Thegradient back-propagation algorithm forms then the basis of automatic-differentiation procedures.
In the following, we use the notations presented in Sec. A for gradients, Hessians and tensors. Briefly, ∇ f ( x ) is usedto denote the gradient of a function f at x , which, if f : R p → R d is multivariate, is the transpose of the Jacobian, i.e., ∇ f ( x ) ∈ R p × d . For a multivariate function f : R p → R d , its second order information at x is represented by a tensor ∇ f ( x ) ∈ R p × p × d , and we denote for example ∇ f ( x )[ y, y, · ] = ( y ⊤ ∇ f (1) ( x )) y ; . . . ; y ⊤ ∇ f ( n ) ( x ) y ) ∈ R d . For afunction f : R p → R d , we define, provided that ∇ f ( x ) , ∇ f ( x ) are defined, ℓ xf ( y ) = ∇ f ( x ) ⊤ y, q xf ( y ) = ∇ f ( x ) ⊤ y + 12 ∇ f ( x )[ y, y, · ] , (14)such that the linear and quadratic approximations of f around x are f ( x + y ) ≈ f ( x ) + ℓ xf ( y ) and f ( x + y ) ≈ f ( x ) + q xf ( y ) respectively.We consider optimization oracles as procedures that compute either the next step of an optimization method or adecent direction along which the next step of an optimization method is taken. Formally, the optimization oracles foran objective f are defined by a model m uf that approximates the objective around the current point u as f ( u + v ) ≈ f ( u ) + m uf ( v ) . The models can be minimized with an additional proximal term that ensures that the minimizer lies ina region where the model approximates well the objective as v ∗ γ = arg min v ∈ R p m uf ( v ) + 12 γ k v k , u new = u + v ∗ γ . The parameter γ acts as a stepsize that controls how large should be the step (the smaller the γ , the smaller the v ∗ γ ).Alternatively the model can be minimized directly providing a descent direction along which the next iterate is takenas v ∗ = arg min v ∈ R p m uf ( v ) u new = u + γv ∗ , γ is found by a line-search using e.g. an Armijo condition [18].On a point u ∈ R p , given a regularization κ , for an objective of the form h ◦ ψ + r : R p → R ,(i) a gradient oracle is defined as v ∗ = arg min v ∈ R p ℓ uh ◦ ψ ( v ) + ℓ ur ( v ) + κ k v k , (15)(ii) a (regularized) Gauss-Newton oracle is defined as v ∗ = arg min v ∈ R p q ψ ( u ) h ( ℓ uψ ( v )) + q ur ( v ) + κ k v k , (16)(iii) a (regularized) Newton oracle is defined as v ∗ = arg min v ∈ R p q uh ◦ ψ ( v ) + q ur ( v ) + κ k v k . (17) Proposition 3.1.
Let f be a chain of τ computations φ t : R d t − × R p t → R d t , u = ( u ; . . . ; u τ ) and x ∈ R d . Denote ψ = f x ,τ and f ( x , u ) = ( x ; . . . ; x τ ) . Assume r to be decomposable as r ( u ) = P τt =1 r t ( u t ) . Gradient (15) ,Gauss-Newton (16) and Newton (17) oracles on h ◦ ψ + r are the solutions v ∗ = ( v ∗ ; . . . ; v ∗ τ ) of problems of the form min v ,...,v τ ∈ R p × ... × R pτ y ,...,y τ ∈ R d × ... × R dτ τ X t =1 y ⊤ t P t y t + p ⊤ t y t + y ⊤ t − R t v t + 12 v ⊤ t Q t v t + q ⊤ t v t + κ k v t k (18) subject to y t = A t y t − + B t v t for t ∈ { , . . . , τ } ,y = 0 , where A t = ∇ x t − φ t ( x t − , u t ) ⊤ , B t = ∇ u t φ t ( x t − , u t ) ⊤ ,p τ = ∇ h ( ψ ( u )) , p t = 0 for t = τ ,q t = ∇ r t ( u t ) ,
1. for gradient oracles (15) , P t = 0 , R t = 0 , Q t = 0 ,
2. for Gauss-Newton oracles (16) , P τ = ∇ h ( ψ ( u )) , P t = 0 for t = τ, R t = 0 , Q t = ∇ r t ( u t ) , ,3. for Newton oracles (17) , defining λ τ = ∇ h ( ψ ( u )) , λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } , we have P τ = ∇ h ( ψ ( u )) , P t − = ∇ x t − x t − φ t ( x t − , u t )[ · , · , λ t ] for t ∈ { , . . . , τ } ,R t = ∇ x t − u t φ t ( x t − , u t )[ · , · , λ t ] , Q t = ∇ r t ( u t ) + ∇ u t u t φ t ( x t − , u t )[ · , · , λ t ] . Problems of the form min u ,...,u τ ∈ R p × ... × R pτ x ,...,x τ ∈ R d × ... × R dτ τ X t =1 h t ( x t ) + τ X t =1 g t ( u t ) (19)subject to x t = φ t ( x t − , u t ) for t ∈ { , . . . , τ } ,x = ˆ x ˆ x t at time t by cost t (ˆ x t ) = min u t +1 ,...,u τ ∈ R pt +1 × ... × R pτ x t ,...,x τ ∈ R dt × ... × R dτ τ X t ′ = t h t ′ ( x t ′ ) + τ X t ′ = t +1 g t ′ ( u t ′ ) subject to x t ′ = φ t ′ ( x t ′ − , u t ′ ) for t ′ ∈ { t + 1 , . . . , τ } ,x t = ˆ x t , such that they follow the recursive relation cost t (ˆ x t ) = min u t +1 ∈ R pt +1 { h t (ˆ x t ) + g t +1 ( u t +1 ) + cost t +1 ( φ t +1 ( x t , u t +1 )) } . (20)This principle cannot be used directly on the original problem, since Eq. (20) cannot be solved analytically for genericproblems of the form (19). However, for quadratic problems with linear compositions of the form (18), this principlecan be used to solve problems (18) by dynamic programming [4]. Therefore as a corollary of Prop. 3.1, the complexityof all optimization steps given in (15), (16), (17) is linear w.r.t. to the length τ of the chain. Precisely, Prop. 3.1 showsthat each optimization step amounts to reducing the complexity of the recursive relation (20) to an analytic problem.In particular, while the Hessian of the objective scales as P τt =1 p t , a Newton step has a linear and not cubiccomplexity with respect to τ . We present in Appendix B the detailed computation of a Newton step, alternativederivations were first proposed in the control literature [6]. This involves the inversion of intermediate quadraticcosts at each layer. Gauss-Newton steps can also be solved by dynamic programming and can be more efficientlyimplemented using an automatic-differentiation oracles as we explain below. As explained in last subsection and shown in Appendix B, a gradient step can naturally be derived as a dynamicprogramming procedure applied to the subproblem (18). However, the implementation of the gradient step providesitself a different kind of oracle on the chain of computations as defined below.
Definition 3.2.
Given a chain of computations f : R P τt =1 p t × R d → R P τt =1 d t as defined in Def. 2.1, u ∈ R P τt =1 p t and x ∈ R d , an automatic differentiation oracle is a procedure that gives access to µ → ∇ f x ,τ ( u ) µ for any µ ∈ R d τ . The subtle difference is that we have access to ∇ f x ,τ ( u ) not as a matrix but as a linear operator . The matrix ∇ f x ,τ ( u ) can also be computed and stored to perform gradient vector products. Yet, this requires a surplus of storageand of computations that are generally not necessary for our purposes. The only quantities that need to be stored aregiven by the forward pass. Then, these quantities can be used to compute any gradient vector product directly.The definition of an automatic differentiation oracle is composed of two steps:1. a forward pass that computes f x ,τ ( u ) and stores the information necessary to compute gradient-vector products.2. the compilation of a backward pass that computes µ → ∇ f x ,τ ( u ) µ for any µ ∈ R d τ given the informationcollected in the forward pass.Note that the two aforementioned passes are decorrelated in the sense that the forward pass does not require theknowledge of the slope µ for which ∇ f x ,τ ( u ) µ is computed.We present in Algo. 1 and Algo. 2 the classical forward-backward passes used in modern automatic-differentiationlibraries. The implementation of the automatic differentiation oracle as a procedure that computes both the value ofthe chain f x ,τ ( u ) and the linear operator µ → f x ,τ ( u ) µ is then presented in Algo. 3 and illustrated in Fig. 2.Computing the gradient g = ∇ ( h ◦ f x ,τ )( u ) on u ∈ R p amounts then to1. computing with Algo. 3, f x ,τ ( u ) , µ → ∇ f x ,τ ( u ) µ = Autodiff( f, u ) ,2. computing µ = ∇ h ( f x ,τ ( u )) then g = ∇ f x ,τ ( u ) µ .8 lgorithm 1 Forward pass Inputs:
Chain of computations f defined by ( φ t ) t =1 ,...,τ , input x as in Def. 2.1, variable u = ( u ; . . . ; u τ ) Initialize x = x for t = 1 , . . . , τ do Compute x t = φ t ( x t − , u t ) Store ∇ φ t ( x t − , u t ) end for Output: x τ , ∇ φ t ( x t − , u t ) for t ∈ { , . . . , τ } . Algorithm 2
Backward pass Inputs:
Slope µ , intermediate gradients ∇ φ t ( x t − , u t ) for t ∈ { , . . . , τ } Initialize λ τ = µ for t = τ, . . . , do Compute λ t − = ∇ x t − φ t ( x t − , u t ) λ t Store g t = ∇ u t φ t ( x t − , u t ) λ t end for Output: ( g , . . . , g τ ) = ∇ f x ,τ ( u ) µ Algorithm 3
Chain of computations with automatic-differentiation oracle (
Autodiff ) Inputs:
Chain of computations f defined by ( φ t ) t =1 ,...,τ , input x as in Def. 2.1, variable u = ( u ; . . . ; u τ ) Compute using Algo. 1 ( x τ , ( ∇ φ t ( x t − , u t ) τt =1 ) = Forward( f, u ) which gives f x ,τ ( u ) = x τ Define µ → ∇ f x ,τ ( u ) µ as µ → Backward( µ, ( ∇ φ t ( x t − , u t )) τt =1 ) according to Algo. 2. Output: f x ,τ ( u ) , µ → ∇ f x ,τ ( u ) µ x Forward pass u → f ( x , u ) u =( u ; . . . ; u τ ) x φ u x φ u x . . . φ τ u τ x τ ∇ φ ∇ φ ∇ φ τ λ τ µg τ . . .λ λ λ g g Backward pass µ →∇ f x ,τ ( u ) µ = gg =( g ; . . . ; g τ )Store gradients ∇ φ t ( x t − , u t ) Figure 2: Automatic differentiation of a chain of computations.9 .2.2 Complexity
Without additional information on the structure of the layers, the space and time complexities of the forward-backwardalgorithm is of the order of S FB ≤ τ X t =1 ( p t + d t − ) d t , T FB ≤ τ X t =1 T ( φ t , ∇ φ t ) + 2 τ X t =1 ( d t − d t + p t d t ) , respectively, where T ( φ t , ∇ φ t ) is the time complexity of computing φ t , ∇ φ t during the backward pass. The unitschosen are for the space complexity the cost of storing one digit and for the time complexity the cost of performing anaddition or a multiplication.Provided that for all t ∈ { , . . . τ } , T ( φ t , ∇ φ t ) + 2( d t − d t + p t d t ) ≤ Q T ( φ t ) , (21)where T ( φ t ) is the time complexity of computing φ t and Q ≥ is a constant, we get that T FB ≤ Q T ( f ) , where T ( f ) is the complexity of computing the chain of computations [14]. We retrieve Baur-Strassen’s theoremwhich states that the complexity of computing the derivative of a function formulated as a chain of computations is ofthe order of the complexity of computing the function itself [3, 10].For chain of computations of the form (7), this cost can be refined as shown in Appendix B. Specifically, for achain of fully-connected layers with element-wise activation function, no normalization or pooling, the cost of thebackward pass is then of the order of O ( P τt =1 mδ t ( δ t − + 1)) elementary operations. For a chain of convolutionallayers with element-wise activation function, no normalization or pooling, the cost of the backward pass is of the orderof O (cid:16)P τt =1 (2 n pt n ft s ft + n pt n ft + δ t ) m (cid:17) elementary operations. The Gauss-Newton step can also be solved by making calls to an automatic differentiation oracle.
Proposition 3.3.
Consider the Gauss-Newton oracle (16) on u = ( u ; . . . ; u τ ) for a convex objective h , a convexdecomposable regularization r ( u ) = P τt =1 r t ( u t ) and a differentiable chain of computations f with output ψ = f x ,τ on some input x . We have that1. the Gauss-Newton oracle amounts to solving min µ ∈ R dτ (cid:16) q ψ ( u ) h (cid:17) ⋆ ( µ ) + (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ) , (22) where for a function f we denote by f ⋆ its convex conjugate,2. the Gauss-Newton oracle is v ∗ = ∇ (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ∗ ) where µ ∗ is the solution of (22) ,3. the dual problem (22) can be solved by d τ + 1 calls to an automatic differentiation procedure. Proposition 3.3 shows that a Gauss-Newton step is only d τ + 1 times more expansive than a gradient-step.Precisely, for a deep network with a supervised objective, we have d τ = nk where n is the number of samples and k is the number of classes. A gradient step makes then one call to an automatic differentiation procedure to get thegradient of the batch and the Gauss-Newton method will then make nk + 1 more calls. If mini-batch Gauss-Newtonsteps are considered then the cost reduces to mk + 1 calls to an automatic differentiation oracle, where m is the sizeof the mini-batch. 10 Optimization complexity
We present smoothness properties with respect to the Euclidean norm k · k , whose operator norm is denoted k · k , .In the following, for a function f : R d → R n and a set C ⊂ dom f ⊂ R d , we denote by m Cf = sup x ∈ C k f ( x ) k , ℓ Cf = sup x,y ∈ Cx = y k f ( x ) − f ( y ) k k x − y k , L Cf = sup x,y ∈ Cx = y k∇ f ( x ) − ∇ f ( y ) k , k x − y k , a bound of h on C , the Lipschitz-continuity parameter of h on C , and the smoothness parameter of h on C (i.e., theLipschitz-continuity parameter of its gradient if it exists), all with respect to k · k . Note that if x = Vec( X ) fora given matrix X , k x k = k X k F . We denote by m f , ℓ f , L f the same quantities defined on the domain of f , e.g., m f = m dom ff . We denote by C m,ℓ,L the class of functions f such that m f = m, ℓ f = ℓ, L f = L . In the following, weallow the quantities m f , ℓ f , L f to be infinite if for example the function is unbounded or the smoothness constant isnot defined. The procedures presented below output infinite estimates if the combinations of the smoothness propertiesdo not allow for finite estimates. On the other hand, they provide finite estimates automatically if they are available.In the following we denote N τt =1 B R t ( R p t ) = { u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t : u t ∈ R p t , k u t k ≤ R t } . We recall the convergence rate to a stationary point of a gradient descent and a stochastic gradient descent on con-strained problems.
Theorem 4.1 (8, Theorems 1 and 2) . Consider problems of the form(i) min u ∈ R p { F ( u ) := h ( ψ ( u )) + r ( u ) } , or ( ii ) min u ∈ R p ( F ( u ) := 1 n n X i =1 h i ( ψ i ( u )) + r ( u ) ) , subject to u ∈ C, subject to u ∈ C, where C is a closed convex set and F is L CF smooth on C . For problem (ii), consider that we have access to anunbiased estimate b ∇ F ( u ) of ∇ F ( u ) with a variance bounded as E ( k b ∇ F ( u ) − ∇ F ( u ) k ) ≤ σ .A projected gradient descent applied on problem (i) with step-size γ = ( L CF ) − converges to an ε -stationary pointin at most O (cid:18) L CF ( F ( u ) − F ∗ ) ǫ (cid:19) iterations, where u is the initial point and F ∗ = min u ∈ C F ( u ) .A stochastic projected gradient descent applied on problem ( ii ) with step-size γ = (2 L CF ) − converges in expec-tation to an ( ε + σ ) -stationary point in at most O (cid:18) L CF ( F ( u ) − F ∗ ) ǫ (cid:19) iterations, where u is the initial point and F ∗ = min u ∈ C F ( u ) . Remarks.
1. Since a gradient descent is monotonically decreasing, a gradient descent applied to the unconstrained problemconverges to an ε -stationary point in at most O L S F ( F ( u ) − F ∗ ) ǫ ! iterations, where S = { u ∈ R p : F ( u ) ≤ F ( u ) } is the initial sub-level set.2. Tighter rates of convergence can be obtained for the finite-sum problem (ii) by using variance reduction methodsand by varying mini-batch sizes [17]. They would then depend on the smoothness constants of the objective orthe maximal smoothness of the components on C , i.e., max i =1 ,...,n L Ch i ◦ ψ i + r .11he smoothness of the objectives F defined in Theorem 4.1 can be derived from the smoothness properties of theircomponents. Proposition 4.2.
Consider a closed convex set C ⊂ R p , ψ ∈ C Cm Cψ ,ℓ Cψ ,L Cψ , r ∈ C L r and h ∈ C ℓ h ,L h with ℓ h = + ∞ if h is not Lipschitz-continuous. The smoothness of F = h ◦ ψ + r on C is bounded as L CF ≤ L Cψ ˜ ℓ Ch + (cid:0) ℓ Cψ (cid:1) L h + L r , where ˜ ℓ Ch = min { ℓ h , min z ∈ ψ ( C ) k∇ h ( z ) k + L h ℓ Cψ D C } , where D C = sup x,y ∈ C k x − y k . What remain to characterize are the smoothness properties of a chain of computations.
We present the smoothness computations for a deep network. Generic estimations of the smoothness properties ofa chain of computation are presented in Appendix C. The propositions below give upper bounds on the smoothnessconstants of the function achieved through chain-composition. For a trivial composition such as f ◦ f − , the upperbound is clearly loose. The upper bounds we present here are informative for non-trivial architectures.The estimation is done by a forward pass on the network, as illustrated in Fig. 3. The reasoning is based on thefollowing lemma. Lemma 4.3.
Consider a chain f of τ computations φ t ∈ C ℓ φt ,L φt initialized at some x ∈ R d .(i) We have ℓ f τ,x ≤ ℓ τ , where ℓ = 0 , ℓ t = ℓ φ t + ℓ t − ℓ φ t , for t ∈ { , . . . , τ } .(ii) We have L f τ,x ≤ L τ , where L = 0 , L t = L t − ℓ φ t + L φ t (1 + ℓ t − ) , for t ∈ { , . . . , τ } . In the case of deep networks, the computations are not Lipschitz continuous due to the presence of bi-affinefunctions. Yet, provided that inputs of computations are bounded and that we have access to the smoothness of thecomputations, we can have an estimate of the Lipschitz-continuity of the computations restricted to these boundedsets.
Corollary 4.4.
Consider a chain f of τ of computations φ t ∈ C m φt ,ℓ φt ,L φt initialized at some x ∈ R d and consider C = N τt =1 B R t ( R p t ) . Then the smoothness of the output of the chain f τ,x on C , can be estimated as in Lemma 4.3by replacing ℓ φ t with ˜ ℓ φ t defined by ˜ ℓ φ t = min { ℓ φ t , L φ t ( m t − + R t ) + k∇ φ t (0 , k , } ,m t = min { m φ t , ˜ ℓ φ t ( m t − + R t ) + k φ t (0 , k } , for t ∈ { , . . . , τ } , with m = k x k . We specialize the result to deep networks. We denote by B L,l u ,l x the set of L -smooth bi-affine functions b suchthat k∇ u b (0 , k , = l u , k∇ x b (0 , k , = l x , i.e., functions of the form b ( x, u ) = β ( x, u ) + β u ( u ) + β x ( x ) + β , with β bilinear and L -smooth, β u , β x linear and l u , l x Lipschitz continuous respectively and β a constant vector. Proposition 4.5.
Consider a chain f of τ computations whose layers φ t are defined by φ t ( x t − , u t ) = a t ( b t ( x t − , u t )) , for t ∈ { , . . . , τ } , where b t ∈ B L bt ,l ubt ,l xbt , and a t is decomposed as a t = a t,k t ◦ . . . ◦ a t, , with a t,i ∈ C m at,i ,ℓ at,i ,L at,i . Consider C = N τt =1 B R t ( R p t ) . The outputs m τ , ℓ τ and L τ of Algo. 4 satisfy m Cf τ,x ≤ m τ , ℓ Cf τ,x ≤ ℓ τ , L Cf τ,x ≤ L τ . φ R . . . m φ t , ℓ φ t , L φ t φ t R t . . . φ τ R τ m τ , ℓ τ , L τ m t − ℓ t − L t − m t ℓ t L t f Figure 3: Smoothness estimates computations.
Algorithm 4
Automatic smoothness computations Inputs:
1. Chain of computations f defined by φ t = a t ◦ b t for t ∈ { , . . . , τ } with a t = a t,k t ◦ . . . ◦ a t,
2. Smoothness properties L b t , l ub t , l xb t of the biaffine function b t ∈ B L bt ,l ubt ,l xbt
3. Smoothness properties m a t,i , ℓ a t,i , L a t,i of the nonlinear functions a t,i ∈ C m at,i ,ℓ at,i ,L at,i
4. Initial point x
5. Bounds R t on the parameters Initialize m = k x k , ℓ = 0 , L = 0 for t = 1 , . . . , τ do ℓ xt, = L b t R t + l xb t , ℓ ut, = L b t m t − + l ub t , ℓ t, = 1 m t, = ℓ xt, m t − + ℓ ut, R t + k b t (0 , k L t, = 0 for j = 1 , . . . , k t do ˜ ℓ a t,j = min { ℓ a t,j , k∇ a t,j (0) k + L a t,j m t,j − } m t,j = min { m a t,j , k a t,j (0) k + ˜ ℓ a t,j m t,j − } ℓ t,j = ˜ ℓ a t,j ℓ t,j − L t,j = L t,j − ℓ a t ,j + L a t,j ( ℓ t,j − ) end for m t = m t,k t ℓ t = ℓ xt, ℓ t,k t ℓ t − + ℓ ut, ℓ t,k t L t = L t − ℓ xt, ℓ t,k t +( L b t R t + l b t ) L t,k t ℓ t − +2 (cid:16) ( L b t m t − + l ub t )( L b t R t + l xb t ) L t,k t + L b t ℓ t,k t (cid:17) ℓ t − +( L b t m t − + l ub t ) L t,k t end for Output: m τ , ℓ τ , L τ Corollary 4.6.
Consider a chain f of τ computations as defined in Prop. 4.5 and u ∗ = ( u ∗ ; . . . , u ∗ τ ) ∈ R p . Thesmoothness properties of f on C ′ = { u = ( u ; . . . ; u τ ) ∈ R p : ∀ t ∈ { , . . . , τ } , k u t − u ∗ t k ≤ R ′ t } are given as inProp. 4.5 by considering R ′ t in place of R t ,l β xt + L β t k u ∗ t k in place of l β xt , k β t k + l β ut k u ∗ t k in place of k β t k . We apply our framework to assess the smoothness properties of the Visual Geometry Group (VGG) deep network usedfor image classification [24].
The VGG Network is a benchmark network for image classification with deep networks. The objective is to classifyimages among classes. Its architecture is composed of 16 layers described in Appendix F. We consider in thefollowing smoothness properties for mini-batches with size m = 128 , i.e., by concatenating m chains of computations f ( i ) each defined by a different input. This highlights the impact of the size of the mini-batch for batch-normalization. Smoothness computations.
To compute the Lipschitz-continuity and smoothness parameters, we recall the list ofLipschitz continuity and smoothness constants of each layer of interest. For the bilinear and linear operations wedenote by L the smoothness of the bilinear operation β and by ℓ the Lipschitz-continuity of the linear operation β u .The smoothness constants of interest are1. ℓ conv = √ m (cid:6) ks (cid:7) , L conv = (cid:6) ks (cid:7) , where the patch is of size k × k and the stride is s ,2. ℓ full = √ m , L full = 1 ,3. ℓ ReLu = 1 , L ReLu not defined,4. ℓ softmax = 2 , L softmax = 4 ,5. ℓ maxpool = 1 , L maxpool not defined,6. ℓ log = 2 , L log = 2 .A Lipschitz-continuity estimate of this architecture can then be computed using Prop. 4.5 on a Cartesian product ofballs C = { w = ( u ; . . . ; u ) : k u t k ≤ R } for R = 1 for example. Smooth VGG.
First, the VGG architecture can be made continuously differentiable by considering the soft-plusactivation instead of the ReLU activation and average pooling instead of the max-pooling operation. As shown inAppendix D, we have1. ℓ avgpool = 1 , L avgpool = 0 ,2. ℓ softplus = 1 , L softplus = 1 / .Denoting ℓ VGG and ℓ VGG-smooth the Lipschitz-continuity estimates of the original VGG network and the modifiedoriginal network on a Cartesian product of balls C = { u = ( u ; . . . ; u ) : k u t k ≤ } with k x k = 1 , we get usingProp. 4.5, | ℓ VGG − ℓ VGG-smooth | ℓ VGG ≤ − . atch-normalization effect. We can also compare the smoothness properties of the smoothed network with thesame network modified by adding the batch-normalization layer for m inputs and ǫ normalization parameter at eachconvolutional layer. As shown in Appendix D, the batch-normalization satisfies1. m batch = δm , ℓ batch =2 ǫ − / , L batch =2 m − / ǫ − .Denoting ℓ VGG-smooth , L VGG-smooth and ℓ VGG-batch , L VGG-batch the Lipschitz-continuity and smoothness estimatesof the smoothed VGG network with and without batch-normalization respectively on a Cartesian product of balls C = { u = ( u ; . . . ; u ) : k u t k ≤ } with k x k = 1 , we get using Prop. 4.5,for ǫ = 10 − , ℓ VGG-smooth ≤ ℓ VGG-batch L VGG-smooth ≤ L VGG-batch for ǫ = 10 , ℓ VGG-smooth ≥ ℓ VGG-batch L VGG-smooth ≥ L VGG-batch
Intuitively, the batch-norm bounds the output of each layer, mitigating the increase of m t in the computations of theestimates of the smoothness in lines 8 and 9 of Algo. 4. Yet, for a small ǫ , this effect is balanced by the non-smoothnessof the batch-norm layer (which for ǫ → tends to have an infinite slope around 0). Acknowledgments.
This work was supported by NSF CCF-1740551, NSF DMS-1839371, the program “Learningin Machines and Brains”, and faculty research awards.
References [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-enberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.[2] M. Anthony and P. Bartlett.
Neural Network Learning: Theoretical Foundations . Cambridge University Press,2009.[3] W. Baur and V. Strassen. The complexity of partial derivatives.
Theoretical computer science , 22(3):317–330,1983.[4] D. P. Bertsekas.
Dynamic programming and optimal control . Athena Scientific, 3rd edition, 2005.[5] R. Duda, P. Hart, and D. Stork.
Pattern classification . John Wiley & Sons, 2nd edition, 2012.[6] J. C. Dunn and D. P. Bertsekas. Efficient dynamic programming implementations of Newton’s method forunconstrained optimal control problems.
Journal of Optimization Theory and Applications , 63(1):23–38, 1989.[7] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams.Convolutional networks on graphs for learning molecular fingerprints. In
Advances in Neural Information Pro-cessing Systems 28 , 2015.[8] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochasticcomposite optimization.
Mathematical Programming , 155(1-2):267–305, 2016.[9] I. Goodfellow, Y. Bengio, and A. Courville.
Deep Learning . The MIT Press, 2016.[10] A. Griewank. Who invented the reverse mode of differentiation?
Documenta Mathematica , Optimizationstories:389–400, 2012.[11] T. D. Hocking, A. Joulin, F. Bach, and J.-P. Vert. Clusterpath: an algorithm for clustering using convex fusionpenalties. In
Proceedings of the 28th International Conference on Machine Learning , 2011.1512] R. A. Horn and C. R. Johnson.
Matrix analysis . Cambridge university press, 2nd edition, 2012.[13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariateshift. In
Proceedings of the 32nd International Conference on Machine Learning , volume 37, pages 448–456,2015.[14] K. V. Kim, Y. E. Nesterov, and B. Cherkasskii. An estimate of the effort in computing the gradient. In
DokladyAkademii Nauk , volume 275, pages 1306–1309. Russian Academy of Sciences, 1984.[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks.In
Advances in Neural Information Processing Systems 25 , 2012.[16] Y. Lecun. A theoretical framework for back-propagation. In , 1988.[17] Z. Li and J. Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In
Advances in Neural Information Processing Systems 31 , pages 5564–5574, 2018.[18] J. Nocedal and S. J. Wright.
Numerical Optimization . Springer, 2nd edition, 2006.[19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In
Advances in NeuralInformation Processing Systems 32 , 2019.[20] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. In
EmpiricalMethods in Natural Language Processing (EMNLP) , pages 1532–1543, 2014.[21] V. Roulet, S. Srinivasa, D. Drusvyatskiy, and Z. Harchaoui. Iterative linearized control: stable algorithms andcomplexity guarantees. In
Proceedings of the 36th International Conference on Machine Learning , 2019. Longversion.[22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.
Nature ,323(6088):533–536, 1986.[23] S. Shalev-Shwartz and S. Ben-David.
Understanding Machine Learning: From Theory to Algorithms . CambridgeUniversity Press, 2014.[24] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In
Inter-national Conference on Learning Representations , 2015.[25] P. Werbos.
The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Fore-casting . Wiley-Interscience, 1994. 16
Notations
A.1 Matrices
For a matrix M ∈ R d × n , we denote by Vec( M ) the concatenation of the columns of M . We denote k M k , =sup x =0 ,y =0 x ⊤ My k x k k y k its norm induced by the Euclidean norm and k M k F = qP ij M ij its Frobenius norm. A.2 Tensors
A tensor A = ( a ijk ) i ∈{ ,...,d } ,j ∈{ ,...,n } ,k ∈{ ,...,p } ∈ R d × n × p is represented as a list of matrices A = ( A , . . . , A p ) where A k = ( a ijk ) i ∈{ ,...,d } ,j ∈{ ,...,n } ∈ R d × n for k ∈ { , . . . p } . Tensor-matrix product.
Given matrices P ∈ R d × d ′ , Q ∈ R n × n ′ , R ∈ R p × p ′ , we denote A [ P, Q, R ] = p X k =1 R k, P ⊤ A k Q, . . . , p X k =1 R k,p ′ P ⊤ A k Q ! ∈ R d ′ × n ′ × p ′ If P, Q or R are identity matrices, we use the symbol “ · ” in place of the identity matrix. For example, we denote A [ P, Q, I p ] = A [ P, Q, · ] = (cid:0) P ⊤ A Q, . . . , P ⊤ A p Q (cid:1) . Fact A.1.
Let
A ∈ R d × p × n , P ∈ R d × d ′ , Q ∈ R p × p ′ , R ∈ R n × n ′ and S ∈ R d ′ × d ′′ , T ∈ R p ′ × p ′′ , U ∈ R n ′ × n ′′ . Denote A ′ = A [ P, Q, R ] ∈ R d ′ × p ′ × n ′ . Then A ′ [ S, T, U ] = A [ P S, QT, RU ] ∈ R d ′′ × p ′′ × n ′′ . Flat tensors. If P, Q or R are vectors we consider the flatten object. In particular, for x ∈ R d , y ∈ R n , we denote A [ x, y, · ] = x ⊤ A y ... x ⊤ A p y ∈ R p , rather than having A [ x, y, · ] ∈ R × × p . Similarly, for z ∈ R p , we have A [ · , · , z ] = p X k =1 z k A k ∈ R d × n . Transpose.
For a tensor A = ( A , . . . , A p ) ∈ R d,n,p we denote A t = ( A ⊤ , . . . , A ⊤ p ) ∈ R n,d,p . Outer product.
We denote the outer product of three vectors x ∈ R d , y ∈ R n , z ∈ R p as x ⊠ y ⊠ z ∈ R d × n × p suchthat ( x ⊠ y ⊠ z ) ijk = x i y j z k . Tensor norm.
We define the norm of a tensor A induced by the Euclidean norm as follows. Definition A.2.
The norm of a tensor A induced by the Euclidean norm is defined as kAk , , = sup x =0 ,y =0 ,z =0 A [ x, y, z ] k x k k y k k z k . (23) Fact A.3.
The tensor norm satisfies the following properties, for a given tensor
A ∈ R d × n × p ,1. kAk , , = kA t k , , ,2. kA [ P, Q, R ] k , , ≤ kAk , , k P k , k Q k , k R k , for P, Q, R with appropriate sizes,3. kAk , , = sup z =0 k P pk =1 z k A k k , k z k . .3 Gradients For a multivariate function f : R d R n , composed of n real functions f ( j ) for j ∈ { , . . . n } , we denote ∇ f ( x ) =( ∇ f (1) ( x ) , . . . , ∇ f ( n ) ( x )) ∈ R d × n , that is the transpose of its Jacobian on x , ∇ f ( x ) = ( ∂f ( j ) ∂x i ( x )) i ∈{ ,...,d } ,j ∈{ ,...,n } ∈ R d × n . We represent its 2nd order information by a tensor ∇ f ( x ) = ( ∇ f (1) ( x ) , . . . , ∇ f ( n ) ( x )) ∈ R d × d × n . Fact A.4.
We have for f : R d → R n , twice differentiable, and C ⊂ dom f convex, ℓ Cf = sup x,y ∈ Cx = y k f ( x ) − f ( y ) k k x − y k = sup x ∈ C k∇ f ( x ) k , , L Cf = sup x,y ∈ Cx = y k∇ f ( x ) − ∇ f ( y ) k , k x − y k = sup x ∈ C k∇ f ( x ) k , , . where k∇ f ( x ) k , denotes the operator norm of ∇ f ( x ) and k∇ f ( x ) k , , denotes the tensor norm of ∇ f ( x ) bothwith respect to the Euclidean norm.Proof. We have for x, y ∈ C , k f ( x ) − f ( y ) k = k Z ∇ f ( x + t ( y − x )) ⊤ ( y − x ) dt k ≤ sup x ∈ C k∇ f ( x ) k , k x − y k , k∇ f ( x ) − ∇ f ( y ) k , = k Z ∇ f ( x + t ( y − x ))[ y − x, · , · ] dt k , ≤ sup x ∈ C k∇ f ( x ) k , , k x − y k , which gives ℓ Cf ≤ sup x ∈ C k∇ f ( x ) k , and L Cf ≤ sup x ∈ C k∇ f ( x ) k , , . The equalities come from the definitionsof the gradient and the Hessian.For a real function, f : R d × R p R , whose value is denoted f ( x, y ) , we decompose its gradient ∇ f ( x, y ) ∈ R d + p on ( x, y ) ∈ R d × R p as ∇ f ( x, y ) = (cid:18) ∇ x f ( x, y ) ∇ y f ( x, y ) (cid:19) with ∇ x f ( x, y ) ∈ R d , ∇ y f ( x, y ) ∈ R p . Similarly we decompose its Hessian ∇ f ( x, y ) ∈ R ( d + p ) × ( d + p ) on blocks that correspond to the variables ( x, y ) asfollows ∇ f ( x, y ) = (cid:18) ∇ xx f ( x, y ) ∇ xy f ( x, y ) ∇ yx f ( x, y ) ∇ yy f ( x, y ) (cid:19) with ∇ xx f ( x, y ) ∈ R d × d , ∇ yy f ( x, y ) ∈ R p × p , ∇ xy f ( x, y ) = ∇ yx f ( x, y ) ⊤ ∈ R d × p . Given a function f : R d + p R n and ( x, y ) ∈ R d × R p , we denote ∇ x f ( x, y ) = ( ∇ x f (1) ( x, y ) , . . . , ∇ x f ( n ) ( x, y )) ∈ R d × n and we define similarly ∇ y f ( x, y ) ∈ R p × n .For its second order information we define ∇ xx f ( x, y ) = ( ∇ xx f (1) ( x, y ) , . . . , ∇ xx f ( n ) ( x, y )) , similarly for ∇ xx f ( x, y ) . Dimension of these definitions are ∇ x f ( x, y ) ∈ R d × n , ∇ y f ( x, y ) ∈ R p × n , ∇ xx f ( x, y ) ∈ R d × d × n , ∇ yy f ( x, y ) ∈ R p × p × n , ∇ xy f ( x, y ) ∈ R d × p × n , ∇ yx f ( x, y ) ∈ R p × d × n . A.4 Matrix functions
For a differentiable matrix-valued multivariate function g : R d → R p × n such that g ( x ) = ( g j,k ( x ) ≤ j ≤ p, ≤ k ≤ n ) , wedenote its first order information as a tensor ∇ g ( x ) = (cid:18) ∂g j,k ( x ) ∂x i (cid:19) ≤ i ≤ d, ≤ j ≤ p, ≤ k ≤ n ∈ R d × p × n . This notation is consistent with previous ones, i.e., for f : R d + p R n and g ( y ) = ∇ y f ( x, y ) ∈ R p × n , then ∇ g ( y ) = ∇ xy f ( x, y ) ∈ R d × p × n . From previous definitions, we have the following fact.18 act A.5. For a differentiable matrix-valued multivariate function g : R d → R p × n , A ∈ R p ′ × p , B ∈ R n × n ′ , denoting h ( x ) = Ag ( x ) B ∈ R p ′ × n ′ , we have ∇ h ( x ) = ∇ g ( x )[ · , A ⊤ , B ] ∈ R d × p ′ × n ′ . A.5 Bilinear functions
Definition A.6.
A function β : R d × R n → R p is bilinear if it is represented by a tensor B ∈ R d × n × p such that forany x ∈ R d , y ∈ R n , β ( x, y ) = B [ x, y, · ] . The gradient of a bilinear function β represented by a tensor B ∈ R d × n × p at a point x, y is given by ∇ x β ( x, y ) = B [ · , y, · ] ∈ R d × p , ∇ y β ( x, y ) = B [ x, · , · ] ∈ R n × p . (24)The Hessian of the bilinear function is given ∇ xx β ( x, y ) = 0 , ∇ yy β ( x, y ) = 0 , ∇ xy β ( x, y ) = B , ∇ yx β ( x, y ) = B t . (25)A bilinear function is not Lipschitz continuous as can be seen from Eq. (24). It is smooth w.r.t. the Euclidean normwith a smoothness constant given by the tensor norm of B as shown in the following proposition. Lemma A.7.
The smoothness of a bilinear function β defined by a tensor B is upper bounded as L β ≤ kBk , , .Proof. We have k∇ β ( x, y ) k , , = sup z =0 k P pk =1 z k ˜ B k k , k z k , where ∇ β ( x, y ) = ( ˜ B , . . . , ˜ B p ) . We have by Eq. (25) that P pk =1 z k ˜ B k is of the form p X k =1 z k ˜ B k = (cid:18) P pk =1 z k B k P pk =1 z k B ⊤ k (cid:19) where B = ( B , . . . , B p ) . Therefore we get k P pk =1 z k ˜ B k k , = k P pk =1 z k B k k , , see [12, Theorem 7.3.3.]. There-fore k∇ β ( x, y ) k = sup z =0 k P pk =1 z k ˜ B k k , k z k = sup z =0 k P pk =1 z k B k k , k z k = kBk , , . B Oracle arithmetic complexity proofs
B.1 Feasibility of the optimization oracle steps
The gradient step (15) is always feasible for any κ > , the Gauss-Newton step (16) is feasible for any κ > if h, r are convex and the Newton step is feasible for any κ > if h ◦ f and r are convex. A sufficient condition for theNewton step to be feasible if h ◦ f is not convex but h ◦ f and r are smooth is to choose κ < ( L h ◦ f + L r ) − . In otherwords, the step-size must be chosen small enough such that the Newton step is a convex problem. B.2 Optimization oracles as linear quadratic problems
Lemma B.1.
Let f be a chain of τ computations φ t . Let u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t , x ∈ R d , denote x t = f t ( x , u ) and E ⊤ t = (0 p t d , p t p , . . . , I p t p t , . . . , p t p τ ) ∈ R p t × ( d + P τt =1 p t ) such that E ⊤ t ( x ; u ) = u t for t ∈{ , . . . , τ } . . If φ t are differentiable, then ∇ f t ( x , u ) = ∇ f t − ( x , u ) ∇ x t − φ t ( x t − , u t ) + E t ∇ u t φ t ( x t − , u t ) (26) with ∇ f ( x , u ) = E , with E ⊤ = (I d d , d p , . . . , . . . , d p τ ) ∈ R d × ( d + P τt =1 p t ) such that E ⊤ ( x ; u ) = x .2. If φ t are twice differentiable, ∇ f t ( x , u ) = ∇ f t − ( x , u )[ · , · , ∇ x t − φ t ( x t − , u t )] (27) + ∇ x t − x t − φ t ( x t − , u t )[ ∇ f t − ( x , u ) ⊤ , ∇ f t − ( x , u ) ⊤ , · ]+ ∇ x t − u t φ t ( x t − , u t )[ ∇ f t − ( x , u ) ⊤ , E ⊤ t , · ]+ ∇ u t x t − φ t ( x t − , u t )[ E ⊤ t , ∇ f t − ( x , u ) ⊤ , · ]+ ∇ u t u t φ t ( x t − , u t )[ E ⊤ t , E ⊤ t , · ] with ∇ f ( x , u ) = 0 .Proof. It follows from the definition of the chain of computations and the notations used for tensors. Precisely wehave that f t ( x , u ) = φ t ( f t − ( x , u ) , E ⊤ t ( x ; u )) , hence the first result that can be written ∇ f t ( x , u ) = ∇ f t − ( x , u ) ∇ x t − φ t ( f t − ( x , u ) , E ⊤ t ( x ; u )) + E t ∇ u t φ t ( f t − ( x , u ) , E ⊤ t ( x ; u )) , hence the second result using the tensor notations. Proposition 3.1.
Let f be a chain of τ computations φ t : R d t − × R p t → R d t , u = ( u ; . . . ; u τ ) and x ∈ R d . Denote ψ = f x ,τ and f ( x , u ) = ( x ; . . . ; x τ ) . Assume r to be decomposable as r ( u ) = P τt =1 r t ( u t ) . Gradient (15) ,Gauss-Newton (16) and Newton (17) oracles on h ◦ ψ + r are the solutions v ∗ = ( v ∗ ; . . . ; v ∗ τ ) of problems of the form min v ,...,v τ ∈ R p × ... × R pτ y ,...,y τ ∈ R d × ... × R dτ τ X t =1 y ⊤ t P t y t + p ⊤ t y t + y ⊤ t − R t v t + 12 v ⊤ t Q t v t + q ⊤ t v t + κ k v t k (18) subject to y t = A t y t − + B t v t for t ∈ { , . . . , τ } ,y = 0 , where A t = ∇ x t − φ t ( x t − , u t ) ⊤ , B t = ∇ u t φ t ( x t − , u t ) ⊤ ,p τ = ∇ h ( ψ ( u )) , p t = 0 for t = τ ,q t = ∇ r t ( u t ) ,
1. for gradient oracles (15) , P t = 0 , R t = 0 , Q t = 0 ,
2. for Gauss-Newton oracles (16) , P τ = ∇ h ( ψ ( u )) , P t = 0 for t = τ, R t = 0 , Q t = ∇ r t ( u t ) , ,3. for Newton oracles (17) , defining λ τ = ∇ h ( ψ ( u )) , λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } , we have P τ = ∇ h ( ψ ( u )) , P t − = ∇ x t − x t − φ t ( x t − , u t )[ · , · , λ t ] for t ∈ { , . . . , τ } ,R t = ∇ x t − u t φ t ( x t − , u t )[ · , · , λ t ] , Q t = ∇ r t ( u t ) + ∇ u t u t φ t ( x t − , u t )[ · , · , λ t ] . roof. To reformulate the optimization oracle problems as quadratic problems with linear dynamics we reformulate ∇ ψ ( u ) ⊤ v as a linear chain of compositions and ∇ ψ ( u )[ v, v, ∇ h ( ψ ( u ))] as a quadratic on the linear trajectory definedby the gradient and the parameters using Lemma B.1. Precisely, for u = ( u , . . . ; u τ ) and v = ( v ; . . . ; v τ ) , denoting f x ( u ) = ( x ; . . . ; x τ ) and ∇ f x ( u ) ⊤ v = ( y ; . . . ; y τ ) with ∇ f x ,t ( u ) ⊤ v = y t , we have from (26) y t = ∇ x t − φ t ( x t − , u t ) ⊤ y t − + ∇ u t φ t ( x t − , u t ) ⊤ v t , for t ∈ { , . . . , τ } (28) y = 0 , and ∇ ψ ( u ) ⊤ v = ∇ f x ,τ ( u ) ⊤ v = y τ . For the second order derivatives, from (27), we have for v = ( v ; . . . ; v τ ) and λ t ∈ R d t , ∇ f x ,t ( u )[ v, v, λ t ] = ∇ f x ,t − ( u )[ v, v, ∇ x t − φ t ( x t − , u t ) λ t ]+ 12 ∇ x t − x t − φ t ( x t − , u t )[ y t − , y t − , λ t ]+ ∇ x t − u t φ t ( x t − , u t )[ y t − , v t , λ t ]+ 12 ∇ u t u t φ t ( x t − , u t )[ v t , v t , λ t ] Hence we get ∇ ψ ( u )[ v, v, ∇ h ( ψ ( u ))] = τ X t =1 ∇ x t − x t − φ t ( x t − , u t )[ y t − , y t − , λ t ] (29) + ∇ x t − u t φ t ( x t − , u t )[ y t − , v t , λ t ]+ 12 ∇ u t u t φ t ( x t − , u t )[ v t , v t , λ t ] where y t are given in (28) and λ t are defined by λ τ = ∇ h ( ψ ( u )) , λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } . The results follow by using the decomposability of r and inserting (28) and (29).We present the resolution of the Newton step by dynamic programming in Algo. 5 whose implementation isjustified in Proposition B.2. Note that the gradient is computed during the first backward pass which can reducethe computations by factorizing those computations. For the Gauss-Newton steps the same dynamic programmingapproach can be applied, however it is less computationally expansive to use automatic differentiation procedures aspresented in Sec. 3. Proposition B.2.
Consider problem (18) and assume it is bounded below, then the cost-to-go functions defined for t ∈ { , . . . , τ } and x t ∈ R δ t as cost t ( x t ) = min v t +1 ,...,v τ y t ,...,y τ τ X t ′ = t y ⊤ t ′ P t ′ y t ′ + p ⊤ t ′ y t ′ + τ X t ′ = t +1 y ⊤ t ′ − R t ′ v t ′ + 12 v ⊤ t ′ Q t ′ v t ′ + q ⊤ t ′ v t ′ + κ k v t ′ k (30) subject to y t ′ = A t ′ y t ′ − + B t ′ v t ′ for t ′ ∈ { t + 1 , . . . , τ } ,y t = x t , where P = 0 , p = 0 , are quadratics of the form cost t ( x t ) = 12 x ⊤ t C t x t + c ⊤ t x t + cste , (31) where C t , c t are defined recursively in line 20 Algo. 5 with C t = C ⊤ t and cste is a constant. The solution of (18) isgiven by, starting from y = 0 , v ∗ t = K t y t − + k t y t = A t y t − + B t v ∗ t , where K t and k t are defined in line 21 of Algo. 5. roof. The cost-to-go functions satisfy the recursive equation (20) for t ∈ { , . . . , τ } cost t − ( x t − ) = 12 x ⊤ t − P t − x t − + p ⊤ t − x t − + min v t ∈ R ρt (cid:26) x ⊤ t − R t v t + 12 v ⊤ t Q t v t + q ⊤ t v t + κ k v t k + cost t ( A t x t − + B t v t ) (cid:27) , starting from cost τ ( x τ ) = x ⊤ τ P τ x τ + p ⊤ τ x τ so we get C τ = P τ and c τ = p τ . Assume that the cost-to-go function cost t has the form (31) for t ∈ { , . . . , τ } then the recursive equation (20) reads cost t − ( x t − ) = 12 x ⊤ t − ( P t − + A ⊤ t C t A t ) x t − + ( A ⊤ t c t + p t − ) ⊤ x t − + min v t ∈ R ρt n v ⊤ t ( R ⊤ t x t − + q t + B ⊤ t ( C t A t x t − + c t )) + 12 v ⊤ t ( κ I + Q t + B ⊤ t C t B t ) v t o . If κ I + Q t + B ⊤ t C t B t , then the minimization problem is unbounded below and so is the original objective. If κ I + Q t + B ⊤ t C t B t ≻ , then the minimization gives us cost t − as a quadratic and the corresponding minimizer v ∗ t ( x t − ) for a given x t − , i.e. C t − = P t − + A ⊤ t C t A t − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( R ⊤ t + B ⊤ t C t A t ) ,c t − = A ⊤ t c t + p t − − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( q t + B ⊤ t c t ) ,v ∗ t ( x t − ) = − ( κ I + Q t + B ⊤ t C t B t ) − (( R ⊤ t + B ⊤ t C t A t ) x t − + q t + B ⊤ t c t ) . The solution of (18) is given by computing cost (0) which amounts to compute, starting from y = 0 , v ∗ t = arg min v ∈ R ρt (cid:26) v ⊤ Q t v + q ⊤ t v + y ⊤ t − R t v + cost t +1 ( A t y t − + B t v ) (cid:27) = v ∗ t ( x t − ) ,y t = A t y t − + B t v ∗ t . Finally we present the derivation of a gradient step, i.e., gradient back-propagation, as a dynamic programmingprocedure, which gives the forward-backward algorithm presented in Sec. 3 by taking r = 0 , κ = − . Proposition B.3.
Consider the gradient step (15) as formulated in (18) with κ = 1 /γ . The cost-to-go functionsdefined as in (30) are linear of the form cost t ( x t ) = λ ⊤ t x t + cste , (32) where λ τ = ∇ h ( ψ ( u )) λ t − = ∇ x t − φ t ( x t − , u t ) λ t for t ∈ { , . . . , τ } and the solution of the step is given by v ∗ t = − γ ( ∇ r ( u t ) + ∇ u t φ t ( x t − , u t ) λ t ) . Proof.
The cost-to-go function defined in (30) for a gradient step reads for t = τ , cost τ ( x τ ) = p ⊤ τ x τ , so we getEq. (32) for t = τ with λ τ = ∇ h ( ψ ( u )) . Assume that the cost-to-go function has the form (32) for t ∈ { , . . . , τ } ,then the recursive equation (20) reads cost t − ( x t − ) = min v t ∈ R ρt (cid:26) v ⊤ t q t + λ ⊤ t ( A t x t − + B t v t ) + 12 γ k v t k (cid:27) So we get that cost t − is a linear function defined by λ t − = A ⊤ t λ t and that the optimal corresponding parameter isindependent of x t − and reads v ∗ t = − γ ( q t + B ⊤ t λ t ) . Plugging the values of A t , B t , q t into the solutions give the results.22 lgorithm 5 Newton oracle by dynamic programming Inputs:
Chain of computations f defined by φ t , objective h , regularization r , regularization for the step κ , currentweights u = ( u ; . . . ; u τ ) Forward pass: for t = 1 , . . . , τ do Compute x t = φ t ( x t − , u t ) Store A t = ∇ x t − φ t ( x t − , u t ) ⊤ , B t = ∇ u t φ t ( x t − , u t ) ⊤ and ∇ u t u t φ t ( x t − , u t ) , ∇ u t x t − φ t ( x t − , u t ) , ∇ x t − x t − φ t ( x t − , u t ) end for Initialize λ τ = ∇ h ( x τ ) , P τ = ∇ h ( x τ ) , p τ = ∇ h ( x τ ) for t = τ, . . . , do Compute P t − = ∇ x t − x t − φ t ( x t − , u t )[ · , · , λ t ] p t − = 0 Q t = ∇ u t u t φ t ( x t − , u t )[ · , · , λ t ] + ∇ r t ( u t ) q t = ∇ r t ( u t ) R t = ∇ x t − u t φ t ( x t − , u t )[ · , · , λ t ] Compute λ t − = ∇ x t − φ t ( x t − , u t ) λ t end for Initialize C τ = P τ , c τ = p τ , feasible = True for t = τ, . . . , do if κ I + Q t + B ⊤ t C t B t then feasible = False break end if
Compute C t − = P t − + A ⊤ t C t A t − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( R ⊤ t + B ⊤ t C t A t ) c t − = A ⊤ t c t + p t − − ( R t + A ⊤ t C t B t )( κ I + Q t + B ⊤ t C t B t ) − ( q t + B ⊤ t c t ) Store K t = − ( κ I + Q t + B ⊤ t C t B t ) − ( R ⊤ t + B ⊤ t C t A t ) k t = − ( κ I + Q t + B ⊤ t C t B t ) − ( q t + B ⊤ t c t ) end for if feasible = False then
Re-do 2nd backward pass with κ := 2 · κ end if Rollout:
Initialize y = 0 for t = 1 , . . . , τ do v ∗ t = K t y t − + k t , y t = A t y t − + B t v t end for Output: ( v ∗ ; . . . ; v ∗ τ ) .3 Detailed complexities of forward and backward passes Definition B.4 (Sparsity of the operations) . We define the sparsity s β of a bilinear operation β as the number ofnon-zero elements in its corresponding tensor.We define the sparsity s α of a function α as the sparsity of its gradient, i.e., the maximal number of its non-zeroelements for any inputs. The sparsity of a bilinear operation amounts to the number of multiplications needed to compute B [ x, y, z ] , B [ · , y, z ] , B [ x, · , z ] or B [ x, y, · ] , which gives us the sparsity of the two bilinear operations studied in this paper. Fact B.5.
For a matrix-product as in (8) , we have s β = m ˜ δδ . For a convolution as in (9) , we have s β = mn p n f s f . We considered Π k Z t − as the extraction of coordinates and not a matrix-vector product. Note that the sparsityof the bilinear operation defines also the number of multiplications needed to compute gradient vector products like ∇ x t − β t ( x t − , u t ) λ t +1 or ∇ u t β t ( x t − , u t ) λ t +1 for λ t +1 ∈ R δ t .The sparsity of a function f ∈ R d → R n naturally gives the number of multiplications needed to compute gradient-vector products ∇ f ( x ) λ for any x ∈ R d , λ ∈ R n . For element-wise activation functions as in (10), we have s α = mδ ,where we consider the input of the activation function to be z = Vec( Z ) for Z ∈ R m × δ . Note that the sparsity of anactivation function as defined here does not directly give the cost of computing it, neither its gradient. Forward-backward detailed complexity.
We present in the next proposition the cost of computing only the back-ward pass to compute the whole gradient. The cost of computing the function and the gradients of the layers in theforward pass can be further detailed using the sparsity of the bilinear operation and the cost of computing the activationfunction and its derivatives. The detailed complexities given in Sec. 3 follow.
Proposition B.6.
Consider a chain f of τ layers as defined in Def. 2.1 whose layers φ t are defined by a t , b t as in (7) .Then the cost of the backward pass defined in Algo. 2 is of the order of O τ X t =1 s a t + 2 s β t + s β ut + s β xt ! elementary operations.Proof. If the chain of layers has the form (7), the gradient vector products during the backward pass read ∇ x t − φ t ( x t − , u t ) λ t +1 = ∇ x t − b t ( x t − , u t ) ∇ a t ( ω t ) λ t +1 = ( B t [ · , u t , · ] + ∇ β zt ( x t − )) ∇ a t ( ω t ) λ t +1 , ∇ u t φ t ( x t − , u t ) λ t +1 = ∇ u t b t ( x t − , u t ) ∇ a t ( ω t ) λ t +1 = ( B t [ x t − , · , · ] + ∇ β ut ( u t )) ∇ a t ( ω t ) λ t +1 , where ω t = b t ( x t − , u t ) . The definitions of the sparsity of bilinear or general operations give the result by looking ateach operation starting from the right. B.4 Gauss-Newton by axutomatic differentiation
Derivatives of the gradient vector product can then be computed themselves by back-propagation as recalled in thefollowing lemma.
Lemma B.7 ([21, Lemma 3.4]) . Consider a differentiable chain of composition f and an input x ∈ R d suchthat ψ = f x ,τ : R P τt =1 p t → R d τ . Given a variable u ∈ R P τt =1 p t and a decomposable differentiable function g : R P τt =1 p t → R such that g ( u ) = P τt =1 g t ( u t ) for u = ( u ; . . . ; u τ ) , computing the derivative of µ → g ( ∇ ψ ( u ) µ ) requires two calls to an automatic-differentiation procedure. Proposition 3.3.
Consider the Gauss-Newton oracle (16) on u = ( u ; . . . ; u τ ) for a convex objective h , a convexdecomposable regularization r ( u ) = P τt =1 r t ( u t ) and a differentiable chain of computations f with output ψ = f x ,τ on some input x . We have that . the Gauss-Newton oracle amounts to solving min µ ∈ R dτ (cid:16) q ψ ( u ) h (cid:17) ⋆ ( µ ) + (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ) , (22) where for a function f we denote by f ⋆ its convex conjugate,2. the Gauss-Newton oracle is v ∗ = ∇ (cid:0) q ur + κ k · k / (cid:1) ⋆ ( −∇ ψ ( u ) µ ∗ ) where µ ∗ is the solution of (22) ,3. the dual problem (22) can be solved by d τ + 1 calls to an automatic differentiation procedure.Proof. The first and second claims follow from standard duality computations applied to (16), they require convexityof h and r . The third claim comes from the fact that (22) is a quadratic convex problem that can be solved in at most d τ iterations of a conjugate gradient descent. Each iteration requires to compute the gradient of µ → ( q ur + κ k ·k / ⋆ ( −∇ ψ ( u ) µ ) which requires two calls to an automatic differentiation procedure by Lemma B.7 and using that r ∗ is also decomposable. A last call to an automatic differentiation procedure is needed to compute ∇ ψ ( u ) µ ∗ . Thecosts of computing ∇ ( q ψ ( u ) h ) ⋆ ( µ ) for µ ∈ R d τ and ∇ ( q ur + κ k · k / ⋆ ( u ) for u ∈ R P τt =1 p t are ignored since they donot involve a chain of compositions and are assumed to be easily accessible. C Smoothness computations
C.1 Elementary operations
Univariate functions.Lemma C.1.
Let α i ∈ C ℓ i ,L i for i = 1 , . . . , n . Denote ℓ = ( ℓ i ) ni =1 , L = ( L i ) ni =1 .1. Assume α i : R d i → R m i , then a : ( R P ni =1 d i → R P ni =1 m i x = ( x ; . . . ; x n ) → ( α ( x ); . . . ; α n ( x n )) is k ℓ k -Lipschitz continuous and k L k ∞ -smooth.2. Assume α i : R d i → R m , then a : ( R P ni =1 d i → R m x = ( x ; . . . ; x n ) → P ni =1 α i ( x i ) is k ℓ k -Lipschitz continuous and k L k ∞ -smooth.3. Assume a i : R d → R m i , then a : ( R d → R P ni =1 m i x → ( α ( x ); . . . ; α n ( x )) is k ℓ k -Lipschitz continuous and k L k -smooth.4. Assume α i : R d → R m , then a : ( R d → R m x → P ni =1 α i ( x ) is k ℓ k -Lipschitz continuous and k L k -smooth.Proof.
1. We have for x = ( x ; . . . ; x n ) ∈ R P ni =1 d i and z = ( z ; . . . ; z n ) ∈ R P ni =1 m i , k∇ a ( x ) z k = k n X i =1 ∇ α i ( x i ) z i k ≤ n X i =1 k z i k k∇ α i ( x i ) k , ≤ k z k vuut n X i =1 ℓ i , a . For x = ( x ; . . . ; x n ) , y = ( y ; . . . ; y n ) ∈ R P ni =1 d i , we have k ( ∇ a ( x ) −∇ a ( y )) z k ≤ n X i =1 k z i k k∇ α i ( x i ) −∇ α i ( y i ) k , ≤ n X i =1 k z i k k x i − y i k L i ≤ k z k k x − y k max i ∈{ ,...,n } L i . Hence k∇ a ( x ) − ∇ a ( y ) k , ≤ k x − y k max i ∈{ ,...,n } L i which gives an upper bound on the smoothness of a .2. We have for x = ( x ; . . . ; x n ) ∈ R P ni =1 d i and z ∈ R m , k∇ a ( x ) z k = n X i =1 k∇ α i ( x i ) z k ≤ n X i =1 ℓ i k z k , which gives the Lipschitz-continuity parameter. Similarly we have for x = ( x ; . . . ; x n ) , y = ( y ; . . . ; y n ) ∈ R P ni =1 d i , k ( ∇ a ( x ) −∇ a ( y )) z k = n X i =1 k ( ∇ α i ( x i ) −∇ α i ( y i )) z k ≤ n X i =1 L i k x i − y i k k z k ≤ max i ∈{ ,...,n } L i k x − y k k z k , which gives the smoothness constant of a .3. The bound on the Lipschitz-continuity parameter follows from the same argument as in 1. For the smoothnessparameter, we have for x, y ∈ R d and z = ( z ; . . . ; z n ) ∈ R P ni =1 m i , k ( ∇ a ( x ) − ∇ a ( y )) z k ≤ n X i =1 k z i k k∇ α i ( x ) − ∇ α i ( y ) k , ≤ n X i =1 k z i k k x − y k L i ≤ k z k k x − y k vuut n X i =1 L i . Hence the result as in 1.4. Clear by linearity of the gradient and triangular inequality.
Bilinear functions.Lemma C.2.
Consider s × t bilinear functions β i +( j − s : R d i × R p j → R m i +( j − s for i ∈ { , . . . s } , j ∈ { , . . . t } then β : ( R P si =1 d i × R P tj =1 p j → R P stk =1 m k ( x, u ) → ( β ( x , u ); . . . ; β s ( x s , u ); β s +1 ( x , u ); . . . ; β st ( x s , u t )) is L β = max k ∈{ ,...,st } L β k smooth.Proof. By Lemma A.7, we have that L β = sup x,u k β ( x, u ) k / k x k k u k . Now k β ( x, u ) k = s X i =1 t X j =1 k β i + s ( j − ( x i , u j ) k ≤ s X i =1 t X j =1 L β i + s ( j − k x i k k u j k ≤ max k ∈{ ,...,st } L β k k x k k u k . C.2 Compositions
The smoothness properties of the functions can be derived by bounding appropriately their first and second orderinformation. Even if e.g. the functions are not twice differentiable, the same results would apply by decomposingcarefully the terms, we directly use the second order information as it directly gives what we are interested in.26or a smooth function, an upper bound on the Lipschitz continuity of the function on a bounded set can be estimatedeven if the function is not Lipschitz continuous. Similarly a bound on the function a bounded set can be refined asdefined below.
Fact C.3.
For a function f ∈ C m f ,ℓ f ,L f and R > . Denoting B R = { x ∈ dom f : k x k ≤ R } , we have that ℓ B R f ≤ ℓ f ( R ) := min { ℓ f , k∇ f (0) k , + RL f } ,m B R f ≤ m f ( R ) := min { m f , k f (0) k + Rℓ f ( R ) } For a sequence of compositions we have the following result.
Lemma C.4.
Consider a = a k ◦ . . . ◦ a with a j ∈ C m aj ,ℓ aj ,L aj for j ∈ { , . . . , k } and a : R d → R n . Denote B R = { x ∈ R d : k x k ≤ R } , and for j ∈ { , . . . , k } , m j = m a j ( m t − ) ,ℓ j = ℓ j − ℓ a j ( m j − ) ,L j = L a j ℓ j − + L j − ℓ a j ( m j − ) , with m = R, ℓ = 1 , L = 0 . We have m B R a ≤ m τ , ℓ B R a ≤ ℓ τ = k Y j =1 ℓ a j ( m j − ) , L B R a ≤ L τ = k X j =1 L a j j − Y i =1 ℓ a i ( m i − ) ! k Y i = j +1 ℓ a i ( m i − ) . Proof.
The bound on the output is a direct iterative application of Fact C.3. We have for x ∈ R d , ∇ a ( x ) = k Y j =1 g j ( x ) , where g j ( x ) = ∇ a j ( a j − ◦ . . . ◦ a ( x )) for j ∈ { , . . . , k } . We have sup x ∈ R d : k x k ≤ R k g j ( x ) k , ≤ min { ℓ a j , k∇ a j (0) k , + L a j m B R a j − ◦ ... ◦ a } . Therefore ℓ B R a ≤ k Y j =1 ℓ a j ( m j − ) . We have for x ∈ R d , ∇ a ( x ) = k X j =1 ∇ a j ( x ) j − Y i =1 g i ( x ) ! ⊤ , j − Y i =1 g i ( x ) ! ⊤ , k Y i = j +1 g i ( x ) . Therefore L B R a ≤ k X j =1 L a j j − Y i =1 ℓ a i ( m i − ) ! k Y i = j +1 ℓ a i ( m i − ) . Lemma C.4 can be used to estimate the smoothness of a chain of computations with respect to its input for fixedparameters.
Corollary C.5.
Consider a chain f of τ computations φ t ∈ C m φt ,ℓ φt ,L φt with given parameters u = ( u ; . . . ; u τ ) . enote φ t,u t = φ t ( · , u t ) . Denote B R = { x ∈ R d : k x k ≤ R } , and for j ∈ { , . . . , k } , m j = m φ t ( · ,u t ) ( m t − ) ,ℓ j = ℓ j − ℓ φ j ( · ,u j ) ( m j − ) ,L j = L φ j ( · ,u j ) ℓ j − + L j − ℓ φ j ( · ,u j ) ( m j − ) , with m = R, ℓ = 1 , L = 0 . We have m B R f τ,u ≤ m τ , ℓ B R f τ,u ≤ ℓ τ = k Y j =1 ℓ φ j ( · ,u j ) ( m j − ) ,L B R f τ,u ≤ L τ = k X j =1 L φ j ( · ,u j ) j − Y i =1 ℓ φ i ( · ,u i ) ( m i − ) ! k Y i = j +1 ℓ φ i ( · ,u i ) ( m i − ) . C.3 Chains of computations
We have the following result for smooth and Lipschitz continuous chains of computations.
Lemma 4.3.
Consider a chain f of τ computations φ t ∈ C ℓ φt ,L φt initialized at some x ∈ R d .(i) We have ℓ f τ,x ≤ ℓ τ , where ℓ = 0 , ℓ t = ℓ φ t + ℓ t − ℓ φ t , for t ∈ { , . . . , τ } .(ii) We have L f τ,x ≤ L τ , where L = 0 , L t = L t − ℓ φ t + L φ t (1 + ℓ t − ) , for t ∈ { , . . . , τ } .Proof. The first claim follows directly from Lemma B.1. For the second claim we have that (27) gives L f t ≤ L f t − ℓ φ t + L φ t ℓ f t − + 2 L φ t ℓ f t − + L φ t , which simplifies to give the result.For a bivariate function φ ( x, u ) : R d × R p → R η , we define ℓ uφ = sup u ∈ R p ,x ∈ R d ℓ φ ( x,u + · ) , ℓ xφ = sup u ∈ R p ,x ∈ R d ℓ φ ( x + · ,u ) . Moreover if the function is continuously differentiable, we define L uuφ = sup u ∈ R p ,x ∈ R d ℓ ∇ u φ ( x,u + · ) , L xuφ = sup u ∈ R p ,x ∈ R d ℓ ∇ u φ ( x + · ,u ) ,L uxφ = sup u ∈ R p ,x ∈ R d ℓ ∇ x φ ( x,u + · ) , L xxφ = sup u ∈ R p ,x ∈ R d ℓ ∇ x φ ( x + · ,u ) . For a bivariate continuosuly differentiable function φ ( x, u ) : R p × R d → R η , we have that ℓ uφ = sup u ∈ R p ,x ∈ R d k∇ u φ ( x, u ) k , , ℓ xφ = sup u ∈ R p ,x ∈ R d k∇ x φ ( x, u ) k , . If the function φ is twice continuously differentiable, we have that L uuφ = sup u ∈ R p ,x ∈ R d k∇ uu φ ( x, u ) k , , , L xuφ = sup u ∈ R p ,x ∈ R d k∇ xu φ ( x, u ) k , , ,L uxφ = sup u ∈ R p ,x ∈ R d k∇ ux φ ( x, u ) k , , , L xxφ = sup u ∈ R p ,x ∈ R d k∇ uu φ ( x, u ) k , , . Finally for R x ≥ , R u ≥ , we have sup ( x,u ) ∈ B Rx × B Ru k∇ u φ ( x, u ) k , ≤ ℓ uφ ( R x , R u ) := min { ℓ uφ , k∇ u φ (0 , k , + L uuφ R u + L xuφ R x } (33) sup ( x,u ) ∈ B Rx × B Ru k∇ x φ ( x, u ) k , ≤ ℓ xφ ( R x , R u ) := min { ℓ xφ , k∇ x φ (0 , k , + L xxφ R u + L uxφ R x } .
28e then have the following.
Lemma C.6.
Let f be a chain of τ computations φ t ∈ C m φt ,ℓ uφt ,ℓ xφt ,L uuφt ,L xuφt ,L xxφt , initialized at some x such that k x k ≤ R . Let C = N τt =1 B R t ( R p t ) = { u = ( u ; . . . ; u τ ) ∈ R P τt =1 p t : u t ∈ R p t , k u t k ≤ R t } . Define for t ∈ { , . . . , τ } , m t = min { m φ t , k φ t (0 , k + ℓ uφ t ( m t − , R t ) R t + ℓ xφ t ( m t − , R t ) m t − } ,ℓ t = ℓ uφ t ( m t − , R t ) + ℓ t − ℓ xφ t ( m t − , R t ) ,L t = L t − ℓ xφ t ( m t − , R t ) + L xxφ t ℓ t − + ( L xuφ t + L uxφ t ) ℓ t − + L uuφ t . with m = R , ℓ = 0 , L = 0 . We have that m Cf τ ,x ≤ m τ , ℓ Cf τ ,x ≤ ℓ τ , L Cf τ ,x ≤ L τ . Proof.
The result directly follows from Lemma B.1, with the Lipschitz-continuity constants derived in (33).
Proof.
The proof relies on Lemma C.6, where the smoothness of the inner compositions are computed according toLemma C.4. Namely, we have ℓ uφ t ( R x , R u ) ≤ ℓ a t ( m b t ( R x , R u )) ℓ ub t ( R x , R u ) , ℓ xφ t ( R x , R u ) ≤ ℓ a t ( m b t ( R x , R u )) ℓ xb t ( R x , R u ) , with ℓ ub t ( R x , R u ) = L b t R x + l ub t , ℓ xb t ( R x , R u ) = L b t R u + l xb t ,m b t ( R x , R u ) = ℓ ub t ( R x , R u ) R u + ℓ xb t ( R x , R u ) R x + k b t (0 , k , and ℓ a t can be computed as in Lemma. C.4. On the other hand, denoting L xxφ t ( R x , R u ) = sup ( x,u ) ∈ B Rx × B Ru k∇ xx φ t ( x, u ) k , , (and similarly for L uuφ t , L uxφ t , L xuφ t ), we have L xxφ t ( R x , R u ) ≤ L a t ( m b t ( R x , R u )) ℓ xb t ( R x , R u ) L uuφ t ( R x , R u ) ≤ L a t ( m b t ( R x , R u )) ℓ ub t ( R x , R u ) L xu ( R x , R u ) = L ux ( R x , R u ) = L b t ℓ a t ( m b t ( R x , R u )) + L a t ( m b t ( R x , R u )) ℓ ub t ( R x , R u ) ℓ xb t ( R x , R u ) , where L a t ( m b t ( R x , R u )) is computed by Lemma C.4. D Smoothness of objectives and layers
D.1 Supervised objectives
For supervised objectives h : R nd τ → R that reads for ˆ y = (ˆ y ; . . . ; ˆ y n ) with ˆ y i ∈ R d τ , h (ˆ y ) = 1 n n X i =1 h i (ˆ y i ) , we only need to compute the smoothness of h i (ˆ y i ) (see Lemma C.1) which is usually defined by a loss h i (ˆ y i ) = L (ˆ y i , y i ) . We are interested in this section in the smoothness L h ( C ) and Lipschitz-continuity ℓ h ( C ) of the objective h on a set C . We omit the dependency on the set C if Lipschitz-continuity or smoothness properties of the functionsare defined on its whole domain. Square loss.
Assume that the labels belong to a compact set Y . The square loss is defined by h (ˆ y ) = L sq (ˆ y, y ) =(ˆ y − y ) / . We have then ℓ sq ( C ) = ρ C + ρ Y , L sq = 1 . where ρ C = max x ∈ C k x k and ρ Y = max y ∈Y k y k . 29 ogistic loss. Consider y ∈ { , } q , the logistic loss is defined as h (ˆ y ) = L log (ˆ y, y ) = − y ⊤ ˆ y +log (cid:16)P qj =1 exp(ˆ y j ) (cid:17) . We have then, denoting exp( y ) = (exp( y i )) i =1 ,...q , ∇ h (ˆ y ) = − y + exp(ˆ y )exp(ˆ y ) ⊤ q , ∇ h (ˆ y ) = diag (exp(ˆ y ))exp(ˆ y ) ⊤ q − exp(ˆ y ) exp(ˆ y ) ⊤ (exp(ˆ y ) ⊤ q ) . Therefore using that y ∈ { , } q and that k exp(ˆ y ) k ≤ k exp(ˆ y ) k , ℓ log ≤ , L log ≤ . D.2 Unsupervised objectives
For the k-means and spectral clustering objectives, we consider the outputs of the chains of the computations to forma matrix F (¯ x, u ) = ( f (¯ x (1) , u ) , . . . , f (¯ x ( n ) , u )) ∈ R q × n where q = d τ and n to be the number of samples. Theobjectives are then h : R q × n → R and we denote by Z ∈ R q × n their variables. The overall objective is h ( F (¯ x, u )) for ¯ x = (¯ x (1) ; . . . ; ¯ x ( n ) ) . We denote k the number of classes that the unsupervised objective aims to cluster and Y = { Y = ( y , . . . , y n ) ⊤ ∈ { , } n × k s.t. Y k = n } . K-means clustering.
The K-means clustering objective reads h ( Z ) = min Y ∈Y C ∈ R q × k n X i =1 k Cy i − z i k . for Z = ( z , . . . , z n ) ∈ R q × n . Minimization in C can be performed analytically such that the problem can be rewritten h ( Z ) = min N ∈N Tr ((I n − N ) Z ⊤ Z ) , where N = { N = Y ( Y ⊤ Y ) − Y ⊤ ∈ R n × n for Y ∈ Y , Y ⊤ Y ≻ } is the set of normalized equivalencematrices. Spectral clustering.
A natural relaxation of K-means is spectral clustering, that considers P = { P ∈ R n × n s.t. P (cid:23) , P = P, Rank ( P ) = k } ⊃ N instead of the set of normalized equivalence matrices. The solution of h ( Z ) = min P ∈P Tr ((I n − P ) Z ⊤ Z ) is then given by finding the k largest eigenvectors of the Gram matrix Z ⊤ Z . Formally the objective is written h ( Z ) = n X i = n − k +1 σ i ( Z ) , where for a matrix A , σ ( A ) ≥ . . . , ≥ σ n ( A ) are the singular values of A in decreasing order. The objective h is thena spectral function of the matrix Z . Convex clustering.
The convex clustering objective reads for ˆ y = (ˆ y ; . . . ; ˆ y n ) ∈ R qn h (ˆ y ) = min y (1) ,...,y ( n ) ∈ R q n X i =1 k y ( i ) − ˆ y ( i ) k + X i 30s the vector corresponding to the group g of size s g . Here the groups are defined by all possible differences for i < j in Eq. (34). Proposition D.1. The convex-clustering objective h (ˆ y ) = min y ∈ R qn k y − ˆ y k + k Dy k G is convex, Lipschitz-continuous and smooth with parameters ℓ cvx-cluster ≤ n ( n − , L cvx-cluster ≤ . Proof. The convex clustering objective h is the Moreau envelope of the function Ω : y → k Dy k G . It is thereforeconvex and -smooth, i.e., L h = 1 . Moreover, the Moreau envelope can be rewritten h (ˆ y ) = sup z ∈ dom(Ω ∗ ) ˆ y ⊤ z − Ω ∗ ( z ) − k z k , where Ω ∗ is the convex conjugate of Ω . Therefore ∇ h (ˆ y ) ∈ dom(Ω ∗ ) . We have that Ω ∗ ( z ) = sup y ∈ R q z ⊤ y − k Dy k G ≥ sup y ∈ R q z ⊤ y − n ( n − k y k , such that the supremum is finite only if k z k > n ( n − . Therefore ∇ h (ˆ y ) ∈ dom(Ω ∗ ) ⊂ B (cid:18) , n ( n − (cid:19) , where B (0 , n ( n − ) is the Euclidean ball centered at with radius n ( n − . D.3 Bilinear and linear layers Vectorized matrix-products as a bilinear operation. Given two matrices A ∈ R n × d and B ∈ R d × p , the matrixproduct AB is defined by a tensor M = ((I d ⊗ e ( q mod n )+1 )( f ⊤⌈ q/n ⌉ ⊗ I d )) q =1 ,...,np ∈ R nd × dp × np where e i is the i th canonical vector in R n and f j is the j th canonical vector in R p such that Vec( AB ) = M [Vec( A ) , Vec( B ) , · ] . (35)This can be checked as for q = i + n ( j − ∈ { , . . . , np } , with i ∈ { , . . . n } , j ∈ { , . . . , p } , Vec( AB ) q = ( AB ) ij = Vec( e ⊤ i A ) ⊤ Vec( Bf j )= Vec( A ) ⊤ (I d ⊗ e i )( f ⊤ j ⊗ I d ) Vec( B )= ( M [Vec( A ) , Vec( B ) , · ]) q . Convolutional layer detailed. For completeness we detail the convolution for an image. For a convolutional layer,the input is an image I ∈ R C × H × B with C channels each composed of a matrix of height H and breadth B theweights are given by ˜ C filters F , . . . , F ˜ C ∈ R C × K × K of patch size K and the biases are given by b ∈ R ˜ C . Theconvolution of the image by a filter F ˜ c , with ˜ c ∈ { , . . . , ˜ C } with additional bias b ˜ c , is given at point i, j as C ˜ c,i,j = C X c =1 hF ˜ c [ c, · , · ] , E ⊤ row,i I [ c, · , · ] E col,j i + b ˜ c , where F ˜ c [ c, · , · ] is the filter of size K × K in channel c of filter F ˜ c and I [ c, · , · ] is the image in channel c .The matrices E row,i ∈ R H × K and E col,j ∈ R B × K extract rows and columns of I [ c, · , · ] . They are bands witha diagonal of K ones centered at positions i or j . If the pattern of the patch is given as P = K ⊤ K , the extractionmatrices read E row,i = e i ⊗ ⊤ K ∈ R H × K , e i ∈ R H for i ∈ { , . . . H } , similarly E col,j = e j ⊗ ⊤ K ∈ R W × K . Theysatisfy E ⊤ row,i E row,i = I K and E row,i E ⊤ row,i ∈ R H × H is a projector. Similarly facts apply for E col,j except that one31eplaces H by B . The output of the convolution with all filters is then a tensor C ∈ R ˜ H × ˜ B × ˜ C where ˜ H and ˜ B dependon the choices of the stride chosen in the convolution. Smoothness of fully-connected layer. For a fully connected layer, the bilinear function β ( x, u ) → U ⊤ X for u =Vec( U ) , x = Vec( X ) is clearly 1-smooth (because k U ⊤ X k F ≤ k U k F k X k F ). The linear part β u is clearly -Lipschitz continuous. So we get L full = 1 , l ufull = 1 . Smoothness of convolutional layer. For a convolution, by Lemma C.2, we only need to compute the smoothness ofthe convolution of an image with one filter. This is done by the following Lemma. Lemma D.2. Consider p subsets S k of { , . . . , n } of size | S k | = d . Denote Π k ∈ { , } d × n the linear form thatextracts the S k coordinates of a vector of size n , i.e., Π k z = z S k for z ∈ R n . The convolution of z ∈ R n by w ∈ R d through the p subsets S k defined as β ( z, w ) = ( w ⊤ Π z ; . . . ; w ⊤ Π p z ) is L β = p max i =1 ,...,n | V i | -smooth where V i = { S j : i ∈ S j } .Proof. We have k β ( z, w ) k = p X j =1 ( w ⊤ Π j z ) ≤ p X j =1 k w k k z S j k = k w k d X i =1 X S j ∈ V i z i ≤ k w k max i =1 ,...,n | V i |k z k . Concretely, for a convolution such that at most p patches contain a coordinate i the convolution is √ p -smooth. Ifthe patches do not overlap then the convolution is -smooth. If the convolution has a stride of 1 and the operation isnormalized by the size of the filters then the convolution has again a smoothness constant of 1. Generally for a 2dconvolution with a kernel of size k × k and a stride of s , we have max i =1 ,...,n | V i | = (cid:6) ks (cid:7) and so L conv = (cid:24) ks (cid:25) , l conv = (cid:24) ks (cid:25) . Batch of inputs. For batch of inputs, the smoothness constants of the non-linear and bilinear parts do not change byLemmas C.1 and C.2. The Lipschitz-constant of the linear part of the biaffine function is modified using Lemma C.1item 3. Namely for a batch of size m , the fully connected layers or the convolutional layers have a linear part whoseLipschitz constant is given by l b = √ m . D.4 Activation functions The Lipschitz and smoothness constants of an element-wise activation α t function are defined by the Lipschitz andsmoothness constant of the scalar function ¯ α t from which it is defined. Denote by f ( x ) := log(1 + exp( x )) , we have f ′ ( x ) = (1 + exp( − x )) − , f ′′ ( x ) = (2 + 2 cosh( x )) − , f ′′′ ( x ) = − sinh( x ) / (2(1 + cosh( x ) )) . Soft-plus. For α defined by element-wise application of ¯ α ( x ) = f ( x ) , we get ℓ softplus = 1 , L softplus = 1 / . Sigmoid. For α defined by the element-wise application of ¯ α ( x ) = f ′ ( x ) , we get ℓ sig = 1 / , L sig = 1 / . eLU. For α defined by the element-wise application of ¯ α ( x ) = max(0 , x ) , we get ℓ ReLu = 1 , L ReLu not defined , since the function is not continuously differentiable. Soft-max layer. A soft-max layer takes as input x ∈ R d and outputs f ( x ) = exp( x ) / (exp( x ) ⊤ d ) where exp( x ) isthe element-wise application of exp . Its gradient is given by ∇ f ( x ) = diag (exp(ˆ y ))exp(ˆ y ) ⊤ q − exp(ˆ y ) exp(ˆ y ) ⊤ (exp(ˆ y ) ⊤ q ) . Its second-order information can be computed as for the batch-normalization layer, we get then ℓ softmax = 2 , L softmax = 4 . D.5 Normalization layers Proposition D.3. The batch normalization operation ν batch : R δm → R δm defined as in (11) is(i) bounded by m batch = δm ,(ii) Lipschitz-continuous with a constant ℓ batch = 2 ǫ − / ,(iii) smooth with a constant L batch = 2 δm − / ǫ − .Proof. The batch-normalization layer as defined in (11) is the composition ν = ν ◦ ν of a centering step ν ( x ) = Vec (cid:18) Z − Z m ⊤ m m (cid:19) and a normalization step ν (˜ x ) = Vec diag (cid:18) m diag ( ˜ Z ˜ Z ⊤ ) + ǫ δ (cid:19) − / ! ˜ Z ! , where here and thereafter Z, ˜ Z ∈ R δ × m , x = Vec( Z ) , ˜ x = Vec( ˜ Z ) .The centering step is an orthonormal projection, i.e., ν ( x ) = Vec( Z Π m ) = (Π m ⊗ I δ ) x where Π m = I m − m ⊤ m m is an orthonormal projector and so is (Π m ⊗ I m ) . Therefore we have ℓ ν ≤ and L ν = 0 . For the normalizationsstep denote for x ∈ R m , and ¯ x = ( x ; . . . ; x δ ) ∈ R mδ with x i ∈ R m , f ( x ) = r m k x k + ǫ, g ( x ) = (cid:18) x i f ( x ) (cid:19) i =1 ,...,m , ¯ g (¯ x ) = ( g ( x ); . . . ; g ( x δ )) ∈ R mδ , such that ν (˜ x ) = T m,d ¯ g ( T d,m ˜ x ) , where T d,m is the linear operator such that T d,m Vec( Z ) = Vec( Z ⊤ ) for any Z ∈ R d × m . First we have that k ¯ g (¯ x ) k ≤ δ max i ∈{ ,...,d } k g ( x i ) k ≤ δm / , such that m ν ≤ δm / . Then the gradients can be computed as ∇ f ( x ) = xmf ( x ) = g ( x ) m ∈ R m , ∇ g ( x ) = f ( x ) I m −∇ f ( x ) x ⊤ f ( x ) = mf ( x ) I m − xx ⊤ mf ( x ) ∈ R m × m , ∇ ¯ g (¯ x ) = diag ( ∇ g ( x ) , . . . , ∇ g ( x m )) ∈ R md × md , X , . . . X τ ∈ R d × p we denote by diag ( X , . . . , X τ ) = X . . . . . . . . . ...... . . . . . . . . . X τ ∈ R dτ × pτ , the corresponding block diagonal matrix. Therefore we get k∇ g ( x ) k , ≤ mf ( x ) + k x k mf ( x ) ≤ m − k x k + ǫ ( m − k x k , + ǫ ) / ≤ cǫ − / , k∇ ¯ g (¯ x ) k ≤ cǫ − / , where c = 2 / (3 / / ≈ . and we used that the spectral norm of the block-diagonal matrix is given by the maximalspectral norm of its block diagonal components. Since T m,d , T d,m are orthonormal operators, we get ℓ ν ≤ ǫ − / . The second order tensor of g reads ∇ g ( x ) = 3 m f ( x ) x ⊠ x ⊠ x − mf ( x ) m X i =1 x ⊠ e i ⊠ e i + e i ⊠ x ⊠ e i + e i ⊠ e i ⊠ x ! ∈ R m × m × m , ∇ ¯ g (¯ x ) = diag ( ∇ g ( x ) , . . . , ∇ g ( x d )) , where e i ∈ R m is the i th canonical vector in R m and for a sequence of tensors X , . . . , X d we denote by X = diag ( X , . . . , X d ) ∈ R dm × dm × dm the tensor whose diagonal is composed of the tensors X , . . . X d such that X i +( m − p,j +( m − p,k +( m − p = ( X p ) ijk and 0 outside the diagonal. We get then by definition of the tensor norm, k∇ g ( x ) k , , ≤ k x k m f ( x ) + 3 k x k mf ( x ) = 3 k x k ( k x k + mf ( x ) ) m f ( x ) = 3 k x k (2 k x k + mǫ ) m ( m − k x k + ǫ ) / ≤ m − / √ c (2 c + 1)( c + 1) / ( mǫ ) − , where c = (1 + √ / such that √ c (2 c +1)( c +1) / ≈ . . Therefore we get k∇ ¯ g (¯ x ) k , , ≤ δ max i ∈{ ,...,δ } k∇ g ( x i ) k , , and L ν ≤ δm − / ǫ − . D.6 Pooling layers We consider pooling layers for which the patches do not coincide such that they amount to a (potentially non-linear)projection. Average pooling. The average pooling layer is a linear operation. If the patches do not coincide, it is a projectionwith Lipschitz constant one. ℓ avg = 1 , L avg = 0 . Max-pooling. Given an image I ∈ R C × H × B with C channels each composed of a matrix of height H and breadth B , the max pooling layer extracts n p patches of the form P i,j = E ⊤ row,i I [ c, · , · ] E col,j where E row,i ∈ R H × K and E col,j ∈ R B × K extract rows and columns of I [ c, · , · ] respectively. On each of this patch their maximum value is takenas the output, namely, the output image reads ˜ I c,i,j = max k,l P i,jk,l . It is naturally non-continuously differentiable and34t is -Lipschitz continuous if the patches do not coincide. ℓ maxpool = 1 L maxpool not defined . D.7 Auto-encoders, composition of chains of computations For τ vectors ( u ; . . . ; u τ ) ∈ R P τt =1 p t and ≤ s ≤ t ≤ τ , we denote u s : t = ( u s ; . . . ; u t ) ∈ R P tr = s p r . For τ functions φ t : R d t − × R p t → R d t , we can split the chain of computations of the τ functions φ t into smaller chains ofcomputations. Namely, for ≤ s ≤ t ≤ τ , we denote the output of the chain of computations defined by φ s , . . . φ t as φ s → t ( x s − , u s : t ) = x t s.t. x r = φ r ( x r − , u r ) for r ∈ { s, . . . , t } . In particular, we have φ t = φ t → t . The output of the chain of computations of the τ fucntions φ t can then be split as φ → τ ( x , u τ ) = φ t +1 → τ ( φ → t ( x , u t ) , u t +1: τ ) for any t ∈ { , . . . τ − } . On the other hand, the composition of two chains of computations can readily be seen as a chain of computations.Namely, for two chains of computations f and g with computations ( φ ft ) τ f t =1 and ( ψ gt ) τ g t =1 , parameters u and v respec-tively, the composition of f and g is h ( x , u ) = g ( f ( x , u ) , v ) . It is a chain of τ f + τ g computations χ t = ( φ t for t ∈ { , . . . , τ f } ψ t − τ f for t ∈ { τ f + 1 , τ f + τ g } , with input x and parameters w = ( u ; v ) ∈ R P τft =1 p ft + P τgt =1 p gt such that w t = ( u t for t ∈ { , . . . , τ f } v t − τ f for t ∈ { τ f + 1 , τ f + τ g } . D.8 Residual Networks Recall the architecture of a residual network x t = a t ( b t ( x t − , u t ) + x t − ) for t = 1 , . . . , τx = x, x − = 0 , where we assume b t : R d t − × R p t → R η t such that x t − ∈ R d t − , x t − ∈ R η t and b t ( x t − , u t ) = B t [ x t − , u t , · ] + B ut u t + B xt x t − + β t , where B = ( B t, , . . . , B t,η t ) is a tensor. They can be expressed in terms of the variable ¯ x t = ( x t , x t − ) as ¯ φ t (¯ x t − , u t ) = ¯ a t (¯ b t (¯ x t − , u t )) , (36)35here ¯ b t is defined as ¯ b t (¯ x t − , u t ) = ¯ β t (¯ x t − , u t ) + ¯ β tu ( u t ) + ¯ β tx (¯ x t − ) + ¯ β t = ¯ B t [¯ x t − , u t , · ] + ¯ B ut u t + ¯ B xt ¯ x t − + ¯ β t , ¯ B t = ( ¯ B t, , . . . , ¯ B t,η t , p t × ( d t − + η t ) , . . . , p t × ( d t − + η t ) | {z } d t − ) , ¯ B t,j = (cid:0) B t,j , p t × η t (cid:1) for j ∈ { , . . . , d t − } , ¯ B ut = (cid:18) B ut d t − × p t (cid:19) , ¯ B xt = (cid:18) B xt I η t I d t − d t − × η t (cid:19) , ¯ β t = (cid:18) β t d t − (cid:19) . Denoting ¯ ω t = ( ω t, , ω t, ) = ¯ b t (¯ x t − , u t ) , we have ¯ a t (¯ ω t ) = ( a ( ω t, ) , ω t, ) . We can derive the smoothness constants of the layers of a residual network expressed as in (36) as L ¯ β t = L β t , l ¯ β ut = l β ut , l ¯ β xt ≤ l β xt + 1 , k ¯ β t k = k β t k ,m ¯ a t ≤ (1 + m a t ) , ℓ ¯ a t ≤ max(1 , ℓ a t ) , L ¯ a t = L a t . Proposition 4.5 can then be applied in this setting. D.9 Implicit functions The smoothness constants of an implicit function are given in the following lemma. They can easily be refined byconsidering smoothness properties w.r.t. to each of the variables α and β of the function ζ defining the problem. Lemma D.4. Let ζ : ( α, β ) → ζ ( α, β ) ∈ R for α ∈ R a , β ∈ R b be s.t. ζ ( α, · ) is µ ζ -strongly convex for any α .Denote g ( α ) = arg min β ∈ R b ζ ( α, β ) . Provided that ζ has a L ζ -Lipschitz gradient and a H ζ -Lipschitz Hessian, thesmoothness constants of g are bounded as ℓ g ≤ L ζ µ − ζ , L g ≤ H ζ µ − ζ (1 + ℓ g )(1 + L ζ µ − ζ ) ≤ H ζ µ − ζ (1 + L ζ µ − ζ ) . Proof. By the implicit function theorem, g ( α ) is uniquely defined and its gradient is given by ∇ g ( α ) = −∇ α ξ ( α, g ( α )) ∇ β ξ ( α, g ( α )) − = −∇ α,β ζ ( α, g ( α )) ∇ β,β ζ ( α, g ( α )) − , where ξ ( α, β ) = ∇ β ζ ( α, β ) . The Lipschitz constant of g follows from that. For the smoothness we compute itssecond order information and bound the corresponding tensors. Note that the same results can be obtained by simplysplitting the functions in appropriate terms. We have ∇ g ( α ) = h ( α, g ( α )) , where h ( α, β ) = −∇ α ξ ( α, β ) ∇ β ξ ( α, β ) − = −∇ α,β ζ ( α, β ) ∇ β,β ζ ( α, β ) − . Using Lemma D.5, we get ∇ g ( α ) = ∇ α h ( α, g ( α )) + ∇ β h ( α, g ( α ))[ ∇ g ( α ) , · , · ]= − ∇ αα ξ ( α, g ( α ))[ · , · , ∇ β ξ ( α, g ( α )) − ] − ∇ ξ αβ ( α, g ( α ))[ · , ∇ β ξ ( α, g ( α )) − ∇ α ξ ( α, g ( α )) ⊤ , ∇ β ξ ( α, g ( α )) − ] − ∇ βα ξ ( α, g ( α ))[ ∇ g ( α ) , · , ∇ β ξ ( α, g ( α )) − ] − ∇ ξ ββ ( α, g ( α ))[ ∇ g ( α ) , ∇ β ξ ( α, g ( α )) − ∇ α ξ ( α, g ( α )) ⊤ , ∇ β ξ ( α, g ( α )) − ] . ξ correspond to third derivativesof ζ , whose norms are bounded by H ζ by assumption. Moreover we have that k∇ β ξ ( α, g ( α )) − k = k∇ β,β ζ ( α, β ) − k ≤ µ − ζ by assumption.The approximation error of the gradient when using an approximate minimizer inside the expression of the gradientis provided in the following lemma. It follows from smoothness considerations. Lemma 2.2. Let ζ : ( α, β ) → ζ ( α, β ) ∈ R for α ∈ R a , β ∈ R b be s.t. ζ ( α, · ) is µ ζ -strongly convex for any α and denote ξ ( α, β ) = ∇ β ζ ( α, β ) . Denote g ( α ) = arg min β ∈ R b ζ ( α, β ) and ˆ g ( α ) ≈ arg min β ∈ R b ζ ( α, β ) be anapproximate minimizer. Provided that ζ has a L ζ -Lipschitz gradient and a H ζ -Lipschitz Hessian, the approximationerror of using b ∇ ˆ g ( α ) = −∇ α ξ ( α, ˆ g ( α )) ∇ β ξ ( α, ˆ g ( α )) − instead of ∇ g ( α ) is bounded as k b ∇ ˆ g ( α ) − ∇ g ( α ) k ≤ H ζ µ − ζ (1 + L ζ µ − ζ ) k ˆ g ( α ) − g ( α ) k . Proof. Denote h ( α, β ) = −∇ α ξ ( α, β ) ∇ β ξ ( α, β ) − = −∇ α,β ζ ( α, β ) ∇ β,β ζ ( α, β ) − such that b ∇ ˆ g ( α ) = h ( α, ˆ g ( α )) and ∇ g ( α ) = h ( α, g ( α )) . The approximation error is given by computing the smoothness constant of h ( α, · ) for any α . We bound the gradient of h ( α, · ) (same results can be obtained by considering differences of the functions). FromLemma D.5, we have ∇ β h ( α, β ) = −∇ βα ξ ( α, β )[ · , · , ∇ β ξ ( α, β ) − ] − ∇ ξ ββ ( α, β )[ · , ∇ β ξ ( α, β ) − ∇ α ξ ( α, g ( α )) ⊤ , ∇ β ξ ( α, β ) − ] . The result follows by using Facts A.4 and A.3. We observe that second derivatives of ξ correspond to third derivativesof ζ , whose norms are bounded by H ζ by assumption. Moreover we have that k∇ β ξ ( α, g ( α )) − k = k∇ β,β ζ ( α, β ) − k ≤ µ − ζ by assumption. Lemma D.5. Let ξ : ( α, β ) → ξ ( α, β ) ∈ R b for α ∈ R a , β ∈ R b such that ∇ β ξ ( α, β ) ∈ R b × b is positive definite forall α ∈ R a , β ∈ R b . Denoting h ( α, β ) = ∇ α ξ ( α, β ) ∇ β ξ ( α, β ) − ∈ R a × b we have ∇ α h ( α, β ) = ∇ αα ξ ( α, β )[ · , · , ∇ β ξ ( α, β ) − ] + ∇ ξ αβ ( α, β )[ · , ∇ β ξ ( α, β ) − ∇ α ξ ( α, β ) ⊤ , ∇ β ξ ( α, β ) − ] , ∇ β h ( α, β ) = ∇ βα ξ ( α, β )[ · , · , ∇ β ξ ( α, β ) − ] + ∇ ξ ββ ( α, β )[ · , ∇ β ξ ( α, β ) − ∇ α ξ ( α, β ) ⊤ , ∇ β ξ ( α, β ) − ] . Proof. This follows from the product rule, Fact A.5, Lemma D.6 and Fact A.1. Lemma D.6. Let g : R d → S ++ n be differentiable and h ( x ) = ( g ( x )) − . Then ∇ h ( x ) = ∇ g ( x )[ · , g ( x ) − , g ( x ) − ] .Proof. Let x ∈ R d and δ ∈ R d . Consider first d = 1 , such that ∇ g ( x ) ∈ R n × n . h ( x + δ ) = ( g ( x ) + δ ∇ g ( x ) + o ( δ )) − = g ( x ) − − δg ( x ) − ∇ g ( x ) g ( x ) − + o ( δ ) . So in this case ∇ h ( x ) = g ( x ) − ∇ g ( x ) g ( x ) − ∈ R n × n . The result follows for n = d by concatenating this resultin a tensor such that for d > , ∇ h ( x ) = ∇ g ( x )[ · , g ( x ) − , g ( x ) − ] . Alternatively it can directly be seen from thefollowing first order approximation for d > , h ( x + δ ) = ( g ( x ) + ∇ g ( x )[ δ, · , · ] + o ( k δ k )) − = g ( x ) − − g ( x ) − ∇ g ( x )[ δ, · , · ] g ( x ) − + o ( k δ k ) . E Optimization complexity proofs E.1 Smoothness of the objective Proposition 4.2. Consider a closed convex set C ⊂ R p , ψ ∈ C Cm Cψ ,ℓ Cψ ,L Cψ , r ∈ C L r and h ∈ C ℓ h ,L h with ℓ h = + ∞ if h is not Lipschitz-continuous. The smoothness of F = h ◦ ψ + r on C is bounded as L CF ≤ L Cψ ˜ ℓ Ch + (cid:0) ℓ Cψ (cid:1) L h + L r , here ˜ ℓ Ch = min { ℓ h , min z ∈ ψ ( C ) k∇ h ( z ) k + L h ℓ Cψ D C } , where D C = sup x,y ∈ C k x − y k .Proof. Consider h, f to be twice differentiable. Same results can be obtained by considering differences of gradients.We get for u ∈ R p , ∇ ( h ◦ f )( u ) = ∇ f ( u )[ · , · , ∇ h ( f ( u ))] + ∇ f ( u ) ∇ h ( f ( u )) ∇ f ( u ) ⊤ . The norm of ∇ h ( f ( u )) can either be directly bounded by ℓ h or by using that for any u, u ′ ∈ C , k∇ h ( f ( u )) k ≤k∇ h ( f ( u ′ )) k + L h k f ( u ) − f ( u ′ ) k . By choosing u ′ ∈ arg min u ∈ C k∇ h ( f ( u )) k and bounding the second term bythe diameter of C , we get a bound on sup u ∈ C k∇ h ( f ( u )) k . The result follows using Fact. A.4 and the definitions ofthe norms used to bound ℓ f , L f for a given function f . Corollary 4.6. Consider a chain f of τ computations as defined in Prop. 4.5 and u ∗ = ( u ∗ ; . . . , u ∗ τ ) ∈ R p . Thesmoothness properties of f on C ′ = { u = ( u ; . . . ; u τ ) ∈ R p : ∀ t ∈ { , . . . , τ } , k u t − u ∗ t k ≤ R ′ t } are given as inProp. 4.5 by considering R ′ t in place of R t ,l β xt + L β t k u ∗ t k in place of l β xt , k β t k + l β ut k u ∗ t k in place of k β t k . Proof. The smoothness properties of f x on C ′ are given by considering ˆ f x (∆) = f x ( u ∗ + ∆) = f x ( u ) where ∆ = u − u ∗ with k ∆ t k ≤ R ′ . The shifted chain of computations is given by ˆ f x ,t (∆) = a t ( b t ( ˆ f x ,t − (∆) , u ∗ t + ∆ t )) This means that ˆ f x (∆) is a chain of compositions defined by the same non-linearities a t and bi-affine functions ˆ b t modified as ˆ b t ( x t − , ∆) = b t ( x t − , u ∗ t + ∆ t ) = β t ( x t − , ∆ t ) + β ut (∆ t ) + ˆ β xt ( x t − ) + ˆ β t , where ˆ β xt ( x t − ) = β xt ( x t − ) + β t ( u ∗ t , x t − ) ˆ β t = β t + β ut ( u ∗ t ) . F Detailed network VGG network. The VGG Network is a benchmark network for image classification with deep networks. Theobjective is to classify images among classes. Its architecture is composed of 16 layers described below. Wedrop the dependency to the layers in their detailed formulation. We precise the number of patches n p of the pooling orconvolution operation, which, multiplied by the number of filters n f gives the output dimension of these operations.For a fully connected layer we precise the output dimension δ out .0. x i ∈ R n p n f with n p = 224 × and n f = 3 ,1. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 224 × , n f conv = 64 ,2. φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 224 × , n f conv = 64 , n p maxpool = 112 × , n f maxpool = 64 ,3. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 112 × , n f conv = 128 φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 112 × , n f conv = 128 , n p maxpool = 56 × , n f maxpool = 128 ,5. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 56 × , n f conv = 256 ,6. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 56 × , n f conv = 256 , 38. φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 56 × , n f conv = 256 , n p maxpool = 28 × , n f maxpool = 256 ,8. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 28 × , n f conv = 512 ,9. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 28 × , n f conv = 512 ,10. φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 28 × , n f conv = 512 , n p maxpool = 14 × , n f maxpool = 512 ,11. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 14 × , n f conv = 512 ,12. φ ( x, u ) = α ReLu ( b conv ( x, u )) with n p conv = 14 × , n f conv = 512 φ ( x, u ) = π maxpool ( α ReLu ( b conv ( x, u ))) with n p conv = 14 × , n f conv = 512 , n p maxpool = 7 × , n f maxpool = 512 ,14. φ ( x, u ) = α ReLu ( b full ( x, u )) with δ out = 4096 ,15. φ ( x, u ) = α ReLu ( b full ( x, u )) with δ out = 4096 ,16. φ ( x, u ) = α softmax ( b full ( x, u )) with δ out = 1000 .17. h (ˆ y ) = P ni =1 L log (ˆ y i , y i ) /n for k = 1000= 1000