Doubly infinite residual networks: a diffusion process approach
DDoubly infinite residual networks: a diffusion processapproach
Stefano PeluchettiCogent Labs [email protected]
Stefano FavaroUniversity of Torino and Collegio Carlo Alberto [email protected]
July 8, 2020
Abstract
When neural network’s parameters are initialized as i.i.d., neural networks exhibit unde-sirable forward and backward properties as the number of layers increases, e.g., vanishingdependency on the input, and perfectly correlated outputs for any two inputs. To overcomethese drawbacks Peluchetti and Favaro (2020) considered fully connected residual networks(ResNets) with parameters’ distributions that shrink as the number of layers increases. Inparticular, they established an interplay between infinitely deep ResNets and solutionsto stochastic differential equations, i.e. diffusion processes, showing that infinitely deepResNets does not suffer from undesirable forward properties. In this paper, we review theforward-propagation results of Peluchetti and Favaro (2020), extending them to the settingof convolutional ResNets. Then, we study analogous backward-propagation results, whichdirectly relate to the problem of training deep ResNets. Finally, we extend our study to thedoubly infinite regime where both network’s width and depth grow unboundedly. Within thisnovel regime the dynamics of quantities of interest converge, at initialization, to deterministiclimits. This allow us to provide analytical expressions for inference, both in the case ofweakly trained and fully trained networks. These results point to a limited expressive powerof doubly infinite ResNets when the unscaled parameters are i.i.d, and residual blocks areshallow.
Modern neural networks featuring a large number of layers (depth) and features per layer (width)have achieved a remarkable performance across many domains (LeCun et al., 2015). It is wellknown (Neal, 1995; Matthews et al., 2018) that, as network’s width goes to infinity, neuralnetworks whose parameters are appropriately distributed converge to Gaussian processes. Theinterplay between infinite wide neural networks and Gaussian stochastic processes contributedremarkably to the study of properties of very wide networks and, more recently, it formed thebasis for the introduction of inferential algorithms directly targeting the infinite-dimensionalsetting (Lee et al., 2018; Garriga-Alonso et al., 2019; Lee et al., 2019a; Arora et al., 2019).Based on these recent studies, it seems natural to ask whether there exists a analogous interplaybetween infinitely deep neural networks and classes of stochastic processes. At a first glance, thisinterplay might prove elusive. Indeed there exists a duality between initialization schemes andBayesian neural networks: an initialization scheme can be seen as a prior distribution on themodel parameters, thus inducing a prior on the neural network. A neural network at initializationmay thus be viewed as a suitable stochastic process indexed by depth, whose distribution is1 a r X i v : . [ s t a t . M L ] J u l tanh ReLU
Figure 1: Function samples of a given pre-activation (number 1) of the last layer, x last, , of afully connected feedforward network with 500 layers of 500 units over a 1-dimensional input z ∈ [ − , ; tanh and ReLU activation function, and parameters on the edge of chaos; 5 drawsare displayed in blue in each figure; for each input the , and quantiles are in orange.defined by a sequence of conditional distributions mapping from each layer to the next layer.Early works focused on stabilizing the variance of key quantities of interest across the layersof deep neural networks (Glorot and Bengio, 2010; He et al., 2015). More recent works (Pooleet al., 2016; Schoenholz et al., 2017; Hayou et al., 2019a) considered the impact of initializationschemes to the propagation of the input signal.Even when initialized on the edge of chaos (EOC) for optimal signal propagation (Hayouet al., 2019a), feedforward networks with an independent and identically distributed (i.i.d.)initialization progressively exhibit pathological properties as their total depth increases. Twocommon pathological properties are: i) the dependency on the input eventually vanishes for mostactivation functions (Neal, 1995; Poole et al., 2016; Schoenholz et al., 2017); ii) the layers, whenviewed as random functions on the input space, eventually concentrate on restrictive familiesincluding constant functions (Hayou et al., 2019a). As an illustrative example, in Figure 1we show function samples from the last layer of a deep feedforward neural network for twoactivation functions under EOC initialization. For the hyperbolic tangent ( tanh ) activation,i.e. φ ( x ) = tanh( x ) , the input has no discernible impact on the output, as can be seen by theconstant marginal distributions, and the sampled functions are almost constant. This behavior isrepresentative of most smooth activation functions used in practice (Hayou et al., 2019a). For therectified linear unit ( ReLU ) activation, i.e. φ ( x ) = max(0 , x ) , the input affects the variance of theoutput and the function samples are piece-wise linear. In both cases, the outputs correspondingto any two inputs end up perfectly correlated. While this analysis applies to feedforward networks,very deep residual networks suffer from similar issues (Yang and Schoenholz, 2017), with theadditional complication that the variance of the Gaussian-distributed pre-activations may growunbounded over the layers.The critical difficulties discussed so far are determined by the fact that typical prior distributionson model parameters introduce a constant level of randomness over each hidden network’slayer. To overcome these difficulties, Peluchetti and Favaro (2020) introduced a class of priordistributions that depend on the number of layers, in such a way that the distribution ofparameters shrinks as the number layers increases. Under this novel prior setting, Peluchetti andFavaro (2020) showed that fully connected residual neural networks (ResNet) converge, as thenumber of layers increases and jointly over multiple inputs, to diffusion processes on a finite timeinterval. The conditions required for attaining convergence to diffusion processes then providewith a guideline for selecting compatible neural network architectures, activation functions andparameters distributions. The resulting limiting diffusion processes satisfy stochastic differential2quations (SDE) that describe the evolution of infinitely deep neural network layers over time(depth); the connection with SDEs sheds light on properties of very deep neural networks ina general framework, which includes finitely wide neural networks and correlated parameterdistributions. In particular, Peluchetti and Favaro (2020) showed that the limiting diffusionprocess is a well-behaved stochastic process in the sense that: i) it retains dependency from theinput; ii) it does not suffer from the perfect correlation constraint; iii) it does not collapse to adeterministic function nor does it diverge.In this paper, we review the forward-propagation results introduced in Peluchetti and Favaro(2020), and we extend these results to the setting of convolutional ResNets. Then, we studyanalogous backward-propagation results, which directly relate to the fundamental problem oftraining ResNets. Stochastic gradient descent (SGD) is arguably the most common paradigm fortraining neural networks (Robbins and Monro, 1951; Bottou et al., 2018). Focusing on gradientbackward-propagation, in deep neural networks models we face a large number of Jacobian matrixmultiplications needed to compute the gradients with respect to the parameters of the lowestlayers. This may result in the vanishing (or exploding) gradients problem, where the magnitudeof the gradients of the parameters in the lowest layers goes to zero (grows unbounded) as thetotal number of layers increases. As SGD relies on such gradients to perform the updates of theparameters, the vanishing (or exploding) gradient issue is detrimental to the training performanceof deep neural networks. The information propagation literature covers this setting too, withresults qualitatively similar to the forward signal propagation analysis (Schoenholz et al., 2017;Hayou et al., 2019a; Yang and Schoenholz, 2017). Our backward-propagation study is directlyrelevant for training performance. In particular, for a class of ResNets, we show that the Jacobianmatrix of any layer with respect to the input layer converges to a matrix diffusion process, whichis the solution of a corresponding matrix SDE. Moreover, under appropriate conditions suchlimiting matrix diffusion is shown to be invertible with dynamics given by a related matrix SDE.These results imply that in the limit of infinite total depth the Jacobian of the final layer withrespect to any layer is again well-behaved and that exploding gradients are not possible.We conclude our study by extending the forward-propagation results of Peluchetti and Favaro(2020), as well as their corresponding backward-propagation results of ResNets, to the contextof doubly infinite neural networks, namely infinitely deep and infinitely wide neural networks.In such a novel context we assume model parameters to be initialized as fully i.i.d., and wealso assume a more restricted class of activation functions. Regarding forward-propagation,we show that the neural network dynamics simplify and many quantities of interest are eitheranalytically available or can be efficiently approximated numerically. Furthermore, we show thatthe distribution in function space of ResNets converges to Gaussian processes with an affinekernel. Regarding backward-propagation, the recent literature on Neural Tangent Kernel (NTK)studies the solution obtained by gradient descent with infinitesimally small learning rate andquadratic loss for infinitely wide networks (Jacot et al., 2018; Arora et al., 2019). We establish aconnection with this line of research by showing that in the doubly infinite setting the NTK atinitialization converges again to an affine kernel. Under the assumption considered, these resultsimply that both weakly and fully trained neural networks which are both very wide and verydeep collapse to linear regression.The paper is structured as follows. Section 2 contains some notation and preliminary resultson diffusion limits of discrete-time stochastic processes. In Section 3 we review the forward-propagation results introduced in Peluchetti and Favaro (2020) for fully connected ResNet,and we extend these results to the setting of convolutional ResNets. Section 4 contains ourbackward-propagation result: the Jacobian matrix of any network’s layer of a class of ResNets3onverges to a matrix diffusion process solution of a corresponding SDE. In Section 5 we performa preliminary analysis of doubly infinite neural networks, where both depth and width growsunbounded. Section 6 contains numerical experiments, whereas in Section 7 we discuss ourresults. Proofs are deferred to Appendix A, and additional plots to Appendix B. We start by setting the notation to be used throughout the paper: tensors (matrices, vectors)are indexed via subscripts ( x i , h i,j , . . . ), and we make use of • to index all elements and of : to index ranges; there is no distinction between vectors and n × matrices, i.e. vectors arecolumn vectors; for a matrix h , h (cid:62) is its transpose, and if h is square diag( h ) is its diagonalvector and Tr( h ) is its trace; the norm of a vector x is (cid:107) x (cid:107) = √ x (cid:62) x ; if y is another vector theirinner product is (cid:104) x, y (cid:105) = x (cid:62) y ; the norm of a matrix h is (cid:107) h (cid:107) = (cid:112) Tr( h (cid:62) h ) ; for two matrices h and g , hg stands for matrix multiplication, h ⊗ g for Kronecker product and h (cid:12) g element-wiseproduct; we assume that matrix multiplication has higher precedence than element-wise product;for a tensor u , vec( u ) is its vectorization (row-wise for matrices, in general the elements aretraversed starting from the last dimension); we make use of I for the identity matrix and of fora vector of ones; for random variables z , w , var[ z ] is the variance of z , cov[ z, w ] is the covariancebetween z and w and ρ [ z, w ] is their correlation; for two random vectors x ∈ R r , y ∈ R c the r × c cross-covariance matrix C [ x, y ] is given by C [ x, y ] i,j = cov[ x i , x j ] ; the r × r covariance matrix of x is thus V [ x ] = C [ x, x ] ; the expectation E [ u ] of a random tensor u is the tensor of the expectationsof its elements; for two D -dimensional stochastic processes x t and y t , we make use of [ x ] t for thequadratic variation, which is a D -dimensional vector, and of [ x, y ] t for the quadratic covariation,which is a D × D -dimensional matrix; for a differentiable function f : R k → R , ∇ x f ( x ) ∈ R k is its corresponding gradient vector; if f : R k → R m , J ( f ( x ) , x ) ∈ R m × k is its correspondingJacobian matrix.The remaining part of this section is devoted to recall assumptions and results for diffusionapproximations of discrete-time stochastic processes. This paves the way to Section 3, wherewe review the results of Peluchetti and Favaro (2020) on fully connected ResNet, and weextend them the setting of convolutional ResNets. Let x l , for l = 1 , . . . , L , denote the l -thlayer of a neural network with L layers, and let x be the network input; we refer to thenext section for a precise explanation of what x l , for l = 1 , . . . , L , represents in the context ofneural networks, and in particular in the context of neural network with residual architecture(ResNets). As we consider continuous time stochastic process limit, we re-index x , x , . . . , x L on a discrete time scale. In particular, let T > denote a terminal time, ∆ t = T /L , for each L we establish a correspondence between discrete indices l ∈ Z + and discrete times t ∈ R + by l = 0 , , . . . , L ↔ t = 0 , ∆ t, t, . . . , T . Without loss of generality, we consider a neural networkwith input x and layers x ∆ t , . . . , x T , with x t being a generic layer.Let p ( x T | x ) denote the conditional distribution of the network’s output given the input for aneural network at initialization. The strategy of Peluchetti and Favaro (2020) to enforce desirableproperties on p ( x T | x ) consists in having a deep neural networks to converge, as the numberof layers L go to infinity, i.e. ∆ t ↓ , to a continuous-time stochastic process on the finite timeinterval [0 , T ] . In this case, for L large enough, the conditional distribution p ( x T | x ) will beclose to the distribution of the limiting process at terminal time T given the same x , andsuch a limiting process should be chosen to make this transition density well behaved. Thereare many approaches to construct continuous-time stochastic processes as limiting dynamicsof discrete-time stochastic processes, and in this work we consider the simplest case where the4imiting process has continuous paths. In all the neural network architectures considered in thepresent paper, each network’s layer depends exclusively on the previous one, hence x t has theMarkov property. In particular, these two specific conditions identify a broad class of diffusionprocesses (Stroock and Varadhan, 2006), which are continuous-time Markov processes withcontinuous paths, as natural candidates for the limiting process.Let x t denote a generic D -dimensional discrete-time Markov process and let ∆ x t = x t +∆ t − x t define the corresponding forward increments of the stochastic process. Hereafter we report a setof general assumptions that imply the convergence of the stochastic process x t to the solution ofa limiting stochastic differential equation (SDE). In particular, it is implicit that the conditionaldistribution p ( x t +∆ t | x t ) depends on ∆ t for the limits to exist as required. Assumption 2.1 (convergence of instantaneous mean and covariance functions) . There exist µ x ( x ) : R D → R D and σ x ( x ) : R D → R D × D such that: lim ∆ t ↓ E [∆ x t | x t ]∆ t = µ x ( x t ) (1) lim ∆ t ↓ V [∆ x t | x t ]∆ t = σ x ( x t ) (2) lim ∆ t ↓ E [(∆ x t ) δ | x t ]∆ t = 0 (3) for some δ > , where: i) all convergences are uniform on compact sets of R D for each component;ii) µ x ( x ) and σ x ( x ) are continuous functions; iii) σ x ( x ) is positive semi-definite, i.e. σ x ( x ) = σ x ( x ) σ x ( x ) (cid:62) for some σ x ( x ) : R D → R D × D . The infinitesimal evolution of the diffusion processes considered in the present work is completelycharacterized by their instantaneous mean vector (1) and instantaneous covariance matrix (2).That is, the first two limits in Assumption 2.1 pinpoint the form of the limiting SDE. Thecondition (3) represents a technical condition, in the sense that it allows us to consider the limits(1) and (2) instead of their truncated version. We refer to Nelson (1990) for additional details onAssumption 2.1 and related assumptions. The next theorem establishes that, under additionalassumptions, in the limit x t can be embedded in the solution of a SDE. Theorem 2.1.
Under Assumption 2.1, extend x t to a continuous-time process x t on t ∈ [0 , T ] by continuous-on-right step-wise-constant interpolation of x t , i.e. x t = x u u ≤ t
C > such that for each x ∈ R D : (cid:107) µ x ( x ) (cid:107) + (cid:107) σ x ( x ) (cid:107) ≤ C (1 + (cid:107) x (cid:107) ) . When Assumption 2.1 and Assumption 2.2 hold true and Assumption 2.3 does not hold, we stillobtain convergence to the solution of the SDE (5). However, the stochastic process x t mightdiverge to infinity with positive probability on any time interval. We will return to this problemin detail. Peluchetti and Favaro (2020) studied the implications of Assumption 2.1, Assumption 2.2 andAssumption 2.3 in the context of fully connected ResNets, thus establishing a novel interplaybetween infinitely deep ResNets and solutions to classes of stochastic differential equations, i.e.classes of diffusion processes. In this section, we review the results of Peluchetti and Favaro(2020) on fully connected ResNet, and we extend them to the setting of convolutional ResNets.
Let consider unmodified, albeit simplified, standard neural network architectures, which is in linewith the research area of information propagation (Poole et al., 2016; Schoenholz et al., 2017;6ayou et al., 2019a). First of all, the stochastic process x t needs to be of constant dimensionality,otherwise ∆ x t is undefined. Consistently with the previous section we assume x t ∈ R D . ForAssumption 2.1 to hold we need Pr( (cid:107) ∆ x t (cid:107) > ε | x t ) ↓ as ∆ t ↓ for any ε > , i.e. we requirethe increments to vanish eventually. Intuitively this is due to the continuity of the paths of thelimiting diffusion process. A fully connected feedforward neural network is expressed by x t +∆ t = f t ( x t ) = φ ( A t x t + a t ) , for a nonlinear activation function φ : R → R applied element-wise. We refer to A t ∈ R D × D asweights and to a t ∈ R D as biases. Hence ∆ x t = φ ( A t x t + a t ) − x t . Shrinking increments wouldimply that for all x , φ ( A t x + a t ) can be made arbitrarily concentrated around x with a suitablechoice of distributions for ( A t , a t ) . This cannot be achieved unless φ is linear or the distributionof ( A t , a t ) depends on x . Indeed, fixing x determines the values around which ( A t , a t ) need toconcentrate for the increments to vanish (if any), hence the increments will not vanish for adifferent x (cid:48) (cid:54) = x , a fact that is most easily seen in the specific case where ( A t , a t ) are scalars.The same lines of reasoning rules out the residual network architecture (ResNet), originallyintroduced in the work of He et al. (2016a). In particular, in the ResNet architecture we write x t +∆ t = f t ( x t + r t ( x t )) . This leaves us with the identity ResNet of He et al. (2016b) where wewrite x t +∆ t = x t + r t ( x t ) (7)for some choice of r t , the residual blocks, which we require to eventually vanish. Each r t resultsfrom an interleaved application of affine transforms and non-linear activation functions. Peluchettiand Favaro (2020) considered the case of shallow residual blocks, such that (7) becomes x t +∆ t = x t + φ ( A t ψ ( x t ) + a t ) (8)for two activation functions φ : R → R , ψ : R → R which are applied element-wise. Weremark that the non-standard approach of using of 2 activation functions, i.e. φ , ψ , is appliedto cover the case of shallow residual blocks in full generality. For a shallow residual block r t ,the vanishing increments requirement is satisfied by having the distributions of weights A t andbiases a t both concentrate around 0 provided that φ (0) = 0 . Furthermore, it proves to beadvantageous to consider weights and biases given by increments of diffusions corresponding tosolvable SDEs. Notice that the use of increments implies independence across layers, and thesimplest parametrization (16) corresponds to typical fully i.i.d. initializations used in practice. Assumption 3.1 (parameters distribution and scaling) . Let W t and b t be the diffusion processeswith values in R D × D and R D , respectively, solutions of: dW t = µ W dt + d (cid:102) W t ; d vec( (cid:102) W t ) = σ W d vec( B Wt ) (9) db t = µ b dt + σ b dB bt , (10) where B Wt and B bt are independent BMs with independent components respectively with valuesin R D × D and R D , µ W ∈ R D × D , µ b ∈ R D , σ W ∈ R D × D , σ b ∈ R D × D , and Σ W = σ W σ W (cid:62) , Σ b = σ b σ b (cid:62) are positive semi-definite. That is, W t and b t are matrix and vector valued diffu-sions, solutions of SDEs with arbitrary time-homogeneous and deterministic drift and diffusioncoefficients. Now, let consider the setting of Assumption 3.1. The discretizations of the diffusion processes W t and b t displayed in (9) and (10), respectively, admit exact representations as follows ∆ W t = µ W ∆ t + ε Wt √ ∆ t ; ∆ b t = µ b ∆ t + ε bt √ ∆ t vec( ε Wt ) i.i.d. ∼ N D (cid:0) , Σ W (cid:1) ; ε bt i.i.d. ∼ N D (cid:0) , Σ b (cid:1) , t = ∆ t, . . . , T , where N stands for the multivariate Gaussian distribution. Accordingly, weconsider residual blocks where A t = ∆ W t and a t = ∆ b t , and we write the ResNet as follows x t +∆ t = x t + φ (∆ W t ψ ( x t ) + ∆ b t ) . (11)Thus Assumption 3.1 covers the case where the parameters are i.i.d. across layers according to anarbitrary multivariate Gaussian distribution, up to the required scaling which is necessary to obtainthe desired diffusion limit. By considering deterministic but time-dependent µ Wt , µ bt , Σ Wt , Σ bt theextension to layer-dependent distributions is immediate. More generally, we can consider W t and b t driven by arbitrary SDEs. Moreover, dependencies across the parameters of different layerscan be accommodated by introducing additional SDE-driven processes, commonly driving theevolution of W t and b t . We do not pursue further these directions in the present work. As forthe activation functions, we will require the following assumption. Assumption 3.2 (activation functions regularity) . The function φ : R → R satisfies: φ (0) = 0 , φ is continuously differentiable three times on R , its second and third derivatives have at mostexponential tails growth, i.e. for some k > : lim | x |↑∞ | φ (cid:48)(cid:48) ( x ) | e k | x | + lim | x |↑∞ | φ (cid:48)(cid:48)(cid:48) ( x ) | e k | x | < ∞ . The function ψ : R → R is locally bounded and continuously differentiable two times on R . Interestingly, φ (0) = 0 and a smooth φ have been shown to be key requirements to achieve goodsignal propagation (Hayou et al., 2019a,b). On the basis of Assumption 3.1 and Assumption 3.2,we now report the main result of Peluchetti and Favaro (2020) regarding the ResNet (11). Theorem 3.1.
Under Assumption 3.1 and Assumption 3.2, Assumption 2.1 with δ = 2 holdstrue for the ResNet x t defined in (11). In particular, one has µ x ( x ) = φ (cid:48) (0)( µ b + µ W ψ ( x )) + 12 φ (cid:48)(cid:48) (0) diag( V [ ε Wt ψ ( x ) + ε bt | x ]) σ x ( x ) = φ (cid:48) (0) V [ ε Wt ψ ( x ) + ε bt | x ] . Furthermore, Assumption 2.2 is satisfied, and by Theorem 2.1 the continuous-time interpolation x t of the ResNet x t converges in law to the solution on [0 , T ] of dx t = φ (cid:48) (0) (cid:0) V [ ε Wt ψ ( x t ) + ε bt | x t ] (cid:1) / dB t (12) + (cid:0) φ (cid:48) (0)( µ b + µ W ψ ( x t )) + 12 φ (cid:48)(cid:48) (0) diag( V [ ε Wt ψ ( x t ) + ε bt | x t ]) (cid:1) dt with initial value x = x where B t is a D -dimensional BM vector with independent components. Theorem 3.1 does not establish a direct connection between x t and the driving sources ofstochasticity provided by W t and b t . As we are interested in the properties of deep ResNets in thefunction space, i.e. over multiple inputs, a brute force approach would require to establish diffusionlimits as in Theorem 3.1 for an enlarged x t = [ x (1) t · · · x ( N ) t ] ∈ R DN corresponding to N initialvalues x = [ x (1)0 · · · x ( N )0 ] . Instead, Peluchetti and Favaro (2020) showed that the limiting SDEis equivalent in law to the solution of another SDE which preserves the dependency on the drivingsources of stochasticity. From here on, let x ( i ) t and x ( j ) t denote ResNets corresponding to twoinitial values x ( i )0 and x ( j )0 , respectively, and let x ( i ) t and x ( j ) t denote diffusion limits correspondingto the same two initial values, i.e. x ( i )0 = x ( i )0 and x ( j )0 = x ( j )0 respectively. Hereafter we willcontinue to use x t for x ( i ) t and x t for x ( i ) t when no confusion arises.8 orollary 3.1. Under Assumption 3.1 and Assumption 3.2, the limiting diffusion process x ( i ) t of the ResNet x ( i ) t is obtained as the solution on [0 , T ] of: dx ( i ) t = φ (cid:48) (0)( dW t ψ ( x ( i ) t ) + db t ) + 12 φ (cid:48)(cid:48) (0)( d [ W ψ ( x ( i ) )] t + d [ b ] t ) , (13) and d [ x ( i ) , x ( j ) ] t = φ (cid:48) (0) ( d [ W ψ ( x ( i ) ) , W ψ ( x ( j ) )] t + d [ b, b ] t ) . (14)An immediate consequence of Theorem 3.1 is that the distribution of the ResNet output giventhe input, i.e. p ( x T | x ) , converges to the transition density p ( x T | x ) of the solution of (13). Inparticular, as T is finite, the dependency on the input does not vanish in the limit of infinitetotal depth L and can be controlled via the parameter distributions and the integration time T . The representations (12) and (13) are complementary: depending on the situation it willprove advantageous to use one or the other. Theorem 3.1 and Corollary 3.1 are general in thesense that we allow for an arbitrary covariance structure between the elements of ε Wt , i.e. anarbitrary (constant and deterministic) quadratic covariation for W t . This makes it difficult toderive more explicit results, and is also an impractical approach as the parametrization requires O ( D ) elements. Peluchetti and Favaro (2020) then considered more restrictive distributionassumptions with a more manageable O ( D ) parametrization cost. Assumption 3.3 (matrix Gaussian weights) . Let b t , µ b , σ b , B bt , µ W , B Wt be defined as in As-sumption 3.1. Let W t be the diffusion matrix with values in R D × D solution of: dW t = µ W dt + σ W O dB Wt σ W I , where σ W O , σ W I ∈ R D × D and Σ W O = σ W O σ W O (cid:62) , Σ W I = σ W I (cid:62) σ W I are positive semi-definite. Let consider the setting of Assumption 3.1. Under this setting the discretization of W t satisfies ε Wt i.i.d. ∼ MN
D,D (cid:0) , Σ W O , Σ W I (cid:1) for t = ∆ t, . . . , T , where MN stands for the matrix Gaussian distribution. This is an immediateconsequence of the following fact: if ζ ∼ MN (0 , I , I) then AζB ∼ MN (0 , AA (cid:62) , B (cid:62) B ) . Thereader is referred to the monograph Gupta and Nagar (1999) for a comprehensive and stimulatingtreatment of matrix Gaussian variates and their properties. The fundamental property of the MN distributions is that the covariance factorizes as follows: cov( ε Wo,i , ε Wo (cid:48) ,i (cid:48) ) = Σ W O o,o (cid:48) Σ W I i,i (cid:48) . Corollary 3.2.
Under Assumption 3.2 and Assumption 3.3, the limiting diffusion process x ( i ) t of the ResNet x ( i ) t is obtained as the solution on [0 , T ] of: dx ( i ) t = φ (cid:48) (0) (cid:0) ( µ W ψ ( x ( i ) t ) + µ b ) dt + σ W O dB Wt σ W I ψ ( x ( i ) t ) + σ b dB bt (cid:1) (15) + 12 φ (cid:48)(cid:48) (0) diag (cid:0) Σ b + Σ W O ( ψ ( x ( i ) t ) (cid:62) Σ W I ψ ( x ( i ) t )) (cid:1) dt, and d [ x ( i ) , x ( j ) ] t = φ (cid:48) (0) (cid:0) Σ b + Σ W O ψ ( x ( i ) t ) (cid:62) Σ W I ψ ( x ( j ) t ) (cid:1) dt. Under Assumption 3.3, i.e. matrix Gaussian weights, we have that V [ ε Wt ψ ( x t ) + ε bt | x t ] is givenby Σ b + Σ W O ( ψ ( x t ) (cid:62) Σ W I ψ ( x t )) . In particular, the dependency on the state x t in Equation (12)goes through a linear transformation and a weighted inner product. This fact sheds some lighton the impact of introducing dependencies among row and columns of the weight parameters9 t = ∆ W t . Specifically, the matrix Σ W I defines the structure of the inner weighted product,while the matrix Σ W O defines how such transforms affect each dimension d ∈ D .Peluchetti and Favaro (2020) completed their study by considering the simplest fully i.i.d.setting with the assumption of centered distributions for W t and b t . Fully i.i.d. initializationsare commonly used in training of neural networks. A scaling of the weights by D − / is alsointroduced; this is the same scaling used to obtain Gaussian process limits in infinitely widenetworks (Neal, 1995; Lee et al., 2018). We will see in Section 5 that this scaling allows to studythe case D ↑ ∞ . Assumption 3.4 (fully i.i.d. parameters) . Let W t and b t be diffusion processes with values in R D × D and R D , respectively, and solutions of dW t = σ w √ D dB Wt db t = σ b dB bt for B Wt , B bt independent BMs respectively with values in R D × D , R D and scalars σ w > , σ b > . Let consider the setting of Assumption 3.4. Under this setting the discretizations of W t and b t satisfy: ∆ W t = ε Wt σ w √ D √ ∆ t ; ∆ b t = ε bt σ b √ ∆ t (16) ε Wt i.i.d. ∼ MN
D,D (cid:0) , I D , I D (cid:1) ; ε bt i.i.d. ∼ N D (cid:0) , I D (cid:1) . (17) Corollary 3.3.
Under Assumption 3.2 and Assumption 3.4, the limiting diffusion process x ( i ) t of the ResNet x ( i ) t is obtained as the solution on [0 , T ] of: dx ( i ) t = φ (cid:48) (0) (cid:0) σ w √ D (cid:107) ψ ( x ( i ) t ) (cid:107) dB Wt + σ b dB bt (cid:1) (18) + 12 φ (cid:48)(cid:48) (0) (cid:0) σ b + σ w D (cid:107) ψ ( x ( i ) t ) (cid:107) ) (cid:1) I D dt, and d [ x ( i ) , x ( j ) ] t = φ (cid:48) (0) (cid:0) σ b + σ w D (cid:104) ψ ( x ( i ) t ) , ψ ( x ( j ) t ) (cid:105) (cid:1) I D dt. Observe that in all cases the activation function φ only impacts the dynamics through itslocal behavior at the origin, while this is not the case for the activation ψ . Under the settingof Assumption 3.4, i.e. fully i.i.d. parameters, we have that V [ ε Wt ψ ( x t ) + ε bt | x t ] is given by σ b + σ w D (cid:107) ψ ( x t ) (cid:107) . In particular, the dependency on the state x t in (12) goes only through thenorm of x t , which is permutation invariant in d ∈ D . Accordingly, the distribution of thestochastic processes x t,d is exchangeable across d ∈ D if the distribution of x ,d is so. We willshow in Section 5.2 that, under Assumption 3.4, as D ↑ ∞ x t,d will become i.i.d. over d if x ,d isso. Remark 3.1.
From (13) and (14) we see that the joint evolution of x ( i ) t and x ( j ) t correspondingto 2 inputs x ( i )0 and x ( j )0 , respectively, is not perfectly correlated (unless there are no weightparameters, a not very relevant case). This remains true also in the parameterizations ofAssumption 3.3 and Assumption 3.4. Thus in the limit of infinite total depth L the distributionin function space does not suffer from the perfect correlation problem. The joint distribution p ( x ( i ) T , x ( j ) T | x ( i )0 , x ( j )0 ) is not Gaussian. We will recover the Gaussian case as D ↑ ∞ under theparametrization of Assumption 3.4 in Section 5.2. emark 3.2. A standard time-change result for SDEs (Revuz and Yor, 1999) implies that thetime-scaling of a SDE is equivalent to multiplying the drift and the diffusion coefficients by thescaling constant and by the square root of the scaling constant, respectively, as it can be intuitivelyseen from (6). Furthermore, from (12) we see that it is possible to compensate changes in theintegration time T with changes in the “hyper-parameters” µ b , µ W , Σ b , Σ W in Assumption 3.1 toleave the dynamics of (12) invariant. These observations remain true also in the parameterizationsof Assumption 3.3 and Assumption 3.4. Accordingly, we can restrict to T = 1 without loss ofgenerality. Remark 3.3.
Without further assumptions, the solutions to the limiting SDEs that we haveobtained can be explosive solutions. In particular, from (12) we see that the potentially troublesometerm is represented by the variance matrix included in the drift (15). Assumption 2.3 is satisfiedunder all considered parameter distribution assumptions if either: i) ψ exhibits at most square-rootgrowth, in particular ψ is bounded; or ii) ψ exhibits at most linear growth, in particular ψ is theidentity function, and φ (cid:48)(cid:48) (0) = 0 , in particular φ = tanh . We will show in Section 5.2 that, underAssumption 3.4, as D ↑ ∞ in the case of φ (cid:48)(cid:48) (0) (cid:54) = 0 with ψ the identity function, the explosiontime becomes deterministic. Remark 3.4.
The diffusion limits that we have obtained are based on sufficiently smooth functions φ per Assumption 3.2. Given the popularity of the ReLU activation function φ ( a ) = max(0 , a ) ,we consider here a brief analysis which includes it. In particular, let assume that φ ( a ) ispositively homogeneous, i.e. φ ( αa ) = αφ ( a ) for α > , h is random variable, and γ > then: E [ φ ( h ∆ t γ ) / ∆ t ] = E [ φ ( h )] ∆ t γ − and E [ φ ( h ∆ t γ ) / ∆ t ] = E (cid:2) φ ( h ) (cid:3) ∆ t γ − . Comparing theseresults with (1) and (2), we see that unless E [ φ ( h )] = 0 , choosing γ = 1 / would result in thedrift term blowing up. The alternative of choosing γ = 1 recovers a non stochastic limit whichcan be interpreted as a particular form of Chen et al. (2018). The positive homogeneity of ReLU activations makes it equivalent to modify the recursion or reparameterize the parameter.
So far we have considered x ∈ R D to be the input of the ResNet. A neural network acts as afunction approximator to be fitted to some dataset D = ( Z , Y ) = { ( z ( i ) , y ( i ) ) } Ni =1 of size N where z ( i ) ∈ R Z represents an input and y ( i ) ∈ R Y represents the corresponding output. In particular,classification problems can be framed in this setting if we use a one-hot representation for y ( i ) .In general, there can be a mismatch between D, Z and Y , making it is necessary to introduceadaptation layers z ( i ) (cid:55)→ x ( i )0 and x ( i ) T (cid:55)→ (cid:98) y ( i ) where (cid:98) y ( i ) is the network prediction for z ( i ) . As for x t , we will denote a single data-point ( z ( i ) , y ( i ) ) with ( z, y ) when no confusion arises. In this section, we extend the main results of Peluchetti and Favaro (2020) to the setting ofconvolutional residual neural networks (CNN). Such extension relies on the equivalence betweenconvolutional transformations (either at a given position, or over all positions) and specific formsof matrix multiplication. We restrict our attention to 2D convolutions and square filters forsimplicity of exposition, everything carries over to the more general settings with the intuitivemodifications. Convolutional neural networks are best described by keeping the features, heightand width dimensions separated, in which case x t is a three-dimensional tensor. This doesnot cause issues: we can always consider the vectorization vec( x t ) which allows us to refer todefinitions and results of Section 2, as we did for instance in Assumption 3.1 for W t . We denotethe input image to the convolutional neural network and its layers with x t , t = 0 , ∆ t, . . . , T . Asbefore x t needs to be of fixed dimensionality: x t ∈ R U × V × D , D being the number of channels,and U and V being respectively the height and the width.11e consider square filters of spatial length K , with K being odd, in which case the off-center rangeof the filter is E = ( K − / . Assuming unitary strides in both height and width dimensions,constant dimensionality is achieved by padding the width and height dimensions of x t , t > ,with E pixels borders. The padding can be performed arbitrarily here: typically the valuesnext to the boarder are copied or paddings have the same value of a background reference level.We enumerate the set of P = U V positions, where positions are ordered in row-wise manner(the ordering does not affect the results as long as it is the same everywhere). A convolutionaltransform x ∈ R ( U × V × D ) (cid:55)→ y ∈ R ( U × V × D ) is obtained by applying (convolving) the samefilter W ∈ R D × ( U × V × D ) to the extracted patches x (cid:70) p ∈ R ( U × V × D ) by matrix multiplication: y p = W x (cid:70) p , y p ∈ R D for each p = 1 , . . . , P . Parentheses indicate how the dimensions areflattened (vectorized), and each patch is given by x (cid:70) p = x (cid:70) p, • = x u − E : u + E,v − E : v + E, • for position p = ( u, v ) , u = 1 , . . . , U , v = 1 , . . . , V . We incorporate the padding in the patch extractionoperation: indexing outside the allowed ranges (which happens for positions at the boarders)returns the padded values. More generally a bias term b ∈ R D can be included resulting in y p = W x (cid:70) p + b . See Dumoulin and Visin (2016) and references therein for a comprehensiveaccount. For convenience let F = U V D denote the extracted patch size. We begin with the mostgeneric parametrization for CNNs covered in this work, which corresponds to Assumption 3.1 forthe fully connected case.
Assumption 3.5 (CNN parameters distribution and scaling) . Let W t and b t be the diffusionprocesses respectively with values in R D × K × K × D and R D solutions of: dW t = µ W dt + d (cid:102) W t d vec( (cid:102) W t ) = σ W d vec( B Wt ) db t = µ b dt + σ b dB bt where B Wt and B bt are independent BMs with independent components respectively with valuesin R D × K × K × D and R D , µ W ∈ R D × K × K × D , µ b ∈ R D , σ W ∈ R DF × DF , σ b ∈ R D × D , and Σ W = σ W σ W (cid:62) , Σ b = σ b σ b (cid:62) are positive semi-definite. That is, W t and b t are tensor and vectorvalued diffusions, solutions of SDEs with arbitrary time-homogeneous and deterministic drift anddiffusion coefficients. Under this setting the discretizations of W t and b t satisfy: ∆ W t = µ W ∆ t + ε Wt √ ∆ t (19) ∆ b t = µ b ∆ t + ε bt √ ∆ t vec( ε Wt ) i.i.d. ∼ N DF (cid:0) , Σ W (cid:1) ε bt i.i.d. ∼ N D (cid:0) , Σ b (cid:1) for t = ∆ t, . . . , T where N stands for the multivariate Gaussian distribution. Again, we considershallow residual blocks and two activation functions, leading to the following recursion: x t +∆ t,p = x t,p + φ (∆ W t ψ ( x t, (cid:70) p ) + ∆ b t ) ( p = 1 , . . . , P ) (20)where ∆ W t ψ ( x t, (cid:70) p ) is computed by matrix multiplication as explained above. The next theoremstates our main convergence (diffusion limit) result in the setting of convolutional ResNets. Thisis the convolutional ResNets counterpart of Theorem 3.1 for fully connected ResNets. Theorem 3.2.
Under Assumption 3.5 and Assumption 3.2, Assumption 2.1 holds true for ec( x t ) ∈ R P D with x t defined in (20) with δ = 2 and: µ x ( x ) = µ (cid:70) ( x (cid:70) ) ... µ (cid:70) ( x (cid:70) P ) µ (cid:70) ( x (cid:70) p ) = φ (cid:48) (0)( µ b + µ W ψ ( x (cid:70) p )) + 12 φ (cid:48)(cid:48) (0) diag( V [ ε Wt ψ ( x (cid:70) p ) + ε bt | x (cid:70) p ]) σ x ( x ) = φ (cid:48) (0) σ (cid:70) ( x (cid:70) , x (cid:70) ) · · · σ (cid:70) ( x (cid:70) , x (cid:70) P ) ... ... ... σ (cid:70) ( x (cid:70) P , x (cid:70) ) · · · σ (cid:70) ( x (cid:70) P , x (cid:70) P ) σ (cid:70) ( x (cid:70) p , x (cid:70) p (cid:48) ) = C [ ε Wt ψ ( x (cid:70) p ) + ε bt , ε Wt ψ ( x (cid:70) p (cid:48) ) + ε bt | x (cid:70) p , x (cid:70) p (cid:48) ] Assumption 2.2 is satisfied, and by Theorem 2.1 the continuous-time interpolation vec( x t ) of vec( x t ) converges in law to the solution on [0 , T ] of d vec( x t ) (21) = φ (cid:48) (0) σ (cid:70) ( x t, (cid:70) , x t, (cid:70) ) · · · σ (cid:70) ( x t, (cid:70) , x t, (cid:70) P ) ... ... ... σ (cid:70) ( x t, (cid:70) P , x t, (cid:70) ) · · · σ (cid:70) ( x t, (cid:70) P , x t, (cid:70) P ) / dB t + µ (cid:70) ( x t, (cid:70) ) ... µ (cid:70) ( x t, (cid:70) P ) dt with initial value x = x where B t is a P D -dimensional BM vector with independent components.
Hereafter we omit all proofs for CNN-related results. Proofs of these results are obtained alonglines similar to the proofs of corresponding results for fully connected neural networks, whilebeing more cumbersome due to the extra spacial dimensions. Notice that the dimensionality ofthe driving Brownian motion depends on
U, V . As in Section 3.1 we can restate Theorem 3.2by making explicit the dependency on the driving sources of randomness. In particular, thisallows us to formulate the dynamics of x t as integration with respect to Brownian motions whosedimensionality does not depend on the number of inputs, nor their spatial sizes U, V . Corollary 3.4.
Under the same assumptions of Theorem 3.2 the limiting process is also givenby the solution on [0 , T ] of the system: dx ( i ) t,p = φ (cid:48) (0)( dW t ψ ( x ( i ) t, (cid:70) p ) + db t ) + 12 φ (cid:48)(cid:48) (0)( d [ W ψ ( x ( i ) (cid:70) p )] t + d [ b ] t ) (22) for p = 1 , . . . , P where dW t and db t are defined in Assumption 3.5 and over two initial valuesand two positions we have: d [ x ( i ) p , x ( j ) p (cid:48) ] t = φ (cid:48) (0) ( d [ W ψ ( x ( i ) (cid:70) p ) , W ψ ( x ( j ) (cid:70) p (cid:48) )] t + d [ b, b ] t ) (23)The parametrization of Assumption 3.5 is O ( D F ) . Hereafter we introduce a more parsimoniousparameterization which is based on tensor Gaussian distributions; this is a natural generalizationof the matrix Gaussian distribution Gupta and Nagar (1999). The use of Kronecker productsallows us to cover this parametrization with a compact notation. We also introduce a fully i.i.d.initialization with the same scaling with D as in the fully connected case. Assumption 3.6 (CNN tensor Gaussian weights) . Let b t , µ b , σ b , B bt , µ W , B Wt be defined as inAssumption 3.5. Let W t be the diffusion tensor with values in R D × ( K × K × D ) solution of: dW t = µ W dt + σ W O dB Wt ( σ W U ⊗ σ W V ⊗ σ W I ) , where σ W O , σ W I ∈ R D × D , σ W U , σ W V ∈ R K × K and Σ W O = σ W O σ W O (cid:62) , Σ W U = σ W U σ W U (cid:62) , Σ W V = σ W V σ W V (cid:62) , Σ W I = σ W I (cid:62) σ W I are positive semi-definite. W t satisfies: ε Wt i.i.d. ∼ T N
D,K,K,D (cid:0) , Σ W O , Σ W U , Σ W V , Σ W I (cid:1) for t = ∆ t, . . . , T , where T N stands for the tensor Gaussian distribution, and we have cov( ε Wo,u,v,i , ε Wo (cid:48) ,u (cid:48) ,v (cid:48) ,i (cid:48) ) = Σ W O o,o (cid:48) Σ W U u,u (cid:48) Σ W V v,v (cid:48) Σ W I i,i (cid:48) . Corollary 3.5.
Under the same assumptions of Theorem 3.2, if W t is distributed according toAssumption 3.6, (22) and (23) are given by: dx ( i ) t,p = φ (cid:48) (0) (cid:0) ( µ W ψ ( x ( i ) t, (cid:70) p ) + µ b ) dt (24) + σ W O dB Wt ( σ W U ⊗ σ W V ⊗ σ W I ) ψ ( x ( i ) t, (cid:70) p ) + σ b dB bt (cid:1) + 12 φ (cid:48)(cid:48) (0) diag (cid:0) Σ b + Σ W O ( ψ ( x ( i ) t, (cid:70) p ) (cid:62) (Σ W U ⊗ Σ W V ⊗ Σ W I ) ψ ( x ( i ) t, (cid:70) p )) (cid:1) dtd [ x ( i ) p , x ( j ) p (cid:48) ] t = φ (cid:48) (0) (cid:0) Σ b + Σ W O ψ ( x ( i ) t, (cid:70) p ) (cid:62) (Σ W U ⊗ Σ W V ⊗ Σ W I ) ψ ( x ( j ) t, (cid:70) p (cid:48) ) (cid:1) dt Assumption 3.7 (CNN fully i.i.d. parameters) . Let W t and b t be the diffusion processesrespectively with values in R D × K × K × D and R D solutions of: dW t = σ w √ D dB Wt db t = σ b dB bt for B Wt , B bt independent BMs respectively with values in R D × D , R D and scalars σ w > , σ b > . Let consider the setting of Assumption 3.4. Under this setting the discretizations of W t and b t satisfy: ∆ W t = ζ Wt σ w √ D √ ∆ t ∆ b t = ζ bt σ b √ ∆ tζ Wt i.i.d. ∼ T N
D,K,K,D (cid:0) , I D , I K , I K , I D (cid:1) ζ bt i.i.d. ∼ N D (cid:0) , I D (cid:1) Corollary 3.6.
Under the same assumptions of Theorem 3.2, if W t and b t are distributedaccording to Assumption 3.7, (22) and (23) are given by: dx ( i ) t,p = φ (cid:48) (0) (cid:0) σ w √ D (cid:107) ψ ( x ( i ) t, (cid:70) p ) (cid:107) dB Wt + σ b dB bt (cid:1) (25) + 12 φ (cid:48)(cid:48) (0) (cid:0) σ b + σ w D (cid:107) ψ ( x ( i ) t, (cid:70) p ) (cid:107) ) (cid:1) I D dtd [ x ( i ) p , x ( j ) p (cid:48) ] t = φ (cid:48) (0) (cid:0) σ b + σ w D (cid:104) ψ ( x ( i ) t, (cid:70) p ) , ψ ( x ( j ) t, (cid:70) p (cid:48) ) (cid:105) (cid:1) I D dt In view of the results obtained in this section, all the remarks of Section 3.1 have a correspondingremark that applies to infinitely deep convolutional ResNets. Namely, the main qualitativeconclusions continue to hold. That is, the stochastic process limit is well-behaved and perfect-correlation problems are avoided, explosive solutions are possible whenever φ (cid:48)(cid:48) (0) (cid:54) = 0 .14 ResNets gradient diffusions
In this section we consider the trainability at initialization of very deep ResNets which are finitelywide. In a generic setting, gradient descent (GD) iterations with a fixed learning rate η are ofthe form θ ( b + 1) = θ ( b ) − η ∇R ( θ ( b ))) for b = 0 , , . . . , where θ ( b ) ∈ R Θ is the generic iteration of the parameters of interest, and R ( θ ) is a smooth real-valued loss function to be minimized. Differently from the GD, the stochasticgradient descent (SGD) relies on unbiased estimates of the gradient of the loss function of interest.In particular, for E [ ∇R b ( θ )] = ∇R ( θ ) , SGD iterations with a fixed learning rate η are of theform θ ( b + 1) = θ ( b ) − η ∇R b ( θ ( b )) . Both R ( θ ) and R b ( θ ) are obtained by summing or averaging terms of the form R ( (cid:98) y ( z ) , y ) with R : R Y × R Y → R being the loss function for 1 data-point ( z, y ) and (cid:98) y ( z ) being the prediction ofthe neural network for z . For the rest of this section we consider a single data point and smooth R . A key difficulty in training very deep neural networks is that the gradients with respect tolower layers, i.e. small t for large L in our setting, might vanish or explode. This phenomenonresults in negligible or diverging parameter updates and ultimately in bad training performance.This intuition can be made rigorous by linking the norm of the gradients, or their expectations,to loss function decrements (Bottou et al., 2018).For each t , let θ t denote the weight, or the bias, at layer t , either in the “standard” form (∆ W t , ∆ b t ) or in the “reparametrized” form ( ε Wt , ε bt ) . Then, we can write the following equations ( ∇ θ t − ∆ t R ) (cid:62) = J ( R, x T ) J ( x T , x t ) J ( x t , θ t − ∆ t ) J ( x T , x t ) = J ( x T , x T − ∆ t ) J ( x T − ∆ t , x T − t ) · · · J ( x t +∆ t , x t ) . The problematic term is represented by the Jacobian matrix J ( x T , x t ) . Indeed the matrix J ( x T , x t ) involves a large (infinite in the limit L ↑ ∞ ) number of matrix multiplications forthe lower layers of deep networks, where t ≈ . Observe that J ( x T , x t ) is closely related to J ( x t , x ) as, provided that J ( x t , x ) is invertible, J ( x T , x t ) can be obtained as J ( x T , x t ) = J ( x T , x ) J ( x t , x ) − . In any case, the properties of J ( x t , x ) are most closely related to theproblem of a vanishing/exploding gradient. Hereafter we show that the for infinitely deep ResNets,when φ (cid:48)(cid:48) (0) = 0 , the problem of an exploding gradient is avoided. Moreover, we show that thelimiting process is always invertible. Construction of invertible networks is the main focus ofrecent research (Behrmann et al., 2019), and the invertibility of residual networks has beenempirically shown to be related to model robustness (Engstrom et al., 2019).Let x t follow the ResNet (11), with ψ being the identity function and φ being differentiable on R , i.e., x t +∆ t = x t + φ (∆ W t x t + ∆ b t ) . (26)Let g t = J ( x t , x ) , hence g t +∆ t = J ( x t +∆ t , x t ) g t , and by direct computation we can write thefollowing ∆ g t = (cid:0) φ (cid:48) (∆ W t x t + ∆ b t )1 D (cid:62) (cid:12) ∆ W t (cid:1) g t . Now, we show that the Jacobian matrix J ( x t , x ) is well behaved in the sense that it convergesto the solution J ( x t , x ) of a matrix SDE as L ↑ ∞ . As in the case of x t , we can derive a limitingSDE to which g t converges, as L ↑ ∞ , by establishing the convergence of the correspondinginstantaneous mean and covariance of g t . Let denote this limiting SDE with g t = J ( x t , x ) .15ubsequently we can link g t with W t and b t by showing the equivalence in law between g t andthe solution to another SDE. The next theorem states directly the final result. Theorem 4.1.
Under Assumption 3.1 and Assumption 3.2, the continuous-time interpolation g t of g t converges in law to the solution of the following matrix SDE: dg t = (cid:0) φ (cid:48) (0) dW t + φ (cid:48)(cid:48) (0) d [ W x D (cid:62) (cid:12) W ] t (cid:1) g t (27)When φ (cid:48)(cid:48) (0) = 0 , the dynamics of (27) are known as the (right, time-changed) stochasticexponential of W (Protter, 2005) which here defines a non-explosive process. Moreover, it turnsout that J ( x t , x ) is invertible and we can find the SDE determining the evolution of its inverse. Corollary 4.1.
Under Assumption 3.1 and Assumption 3.2, g t satisfying the matrix SDE (27)is invertible and its inverse satisfies the following matrix SDE: dg − t = g − t (cid:0) − φ (cid:48) (0) dW t − φ (cid:48)(cid:48) (0) d [ W x D (cid:62) (cid:12) W ] t + φ (cid:48) (0) d [ W ] t (cid:1) . (28)Hence, J ( x T , x t ) can be obtained as J ( x T , x t ) = g T g − t by integrating (27) and (28), whichare driven by the same process W . Theorem 4.1 and Corollary 4.1 have two fundamentalconsequences:i) as g t is the Jacobian of the last layer with respect of the first layer of the limiting process x t ,it follows that when φ (cid:48)(cid:48) (0) = 0 the exploding gradient problem is avoided;ii) by the inverse function theorem the limiting process x t is invertible.Note that the results of this section hold for all the parametrizations discussed in Section 3.1. In this section we study ResNets where both the depth L and the dimension D grow unboundedly.In particular, it is assumed that first L ↑ ∞ , and then D ↑ ∞ . Although some results presentedin this section are not completely rigorous, numerical experiments reported in Section 6 supporttheir correctness. Moreover, numerical experiments reported in Section 6 support the conjecturethat analogous results hold when D and L grows unbounded jointly for the smooth activationfunctions here considered. Hereafter we consider the setting of Corollary 3.3, i.e. fully i.i.d.parametrization, with the additional assumption that ψ is the identity function, that is we assumethe ResNet displayed in (26). Then SDE (18) is equivalent, in distribution, to the representationwhere each data point i has an associated D -dimensional BM B ( i ) t , { B ( i ) t } Ni =1 are dependent over i , and each B ( i ) t corresponds to both the weights and biases sources of stochasticity. That is, theSDE (18) is equivalent to the following dx ( i ) t = φ (cid:48) (0)( σ b + σ w q ( i ) t ) / dB ( i ) t + 12 φ (cid:48)(cid:48) (0)( σ b + σ w q ( i ) t ) I D dt (29) d [ B ( i ) , B ( j ) ] t = σ b + σ w λ ( i,j ) t (cid:0) ( σ b + σ w q ( i ) t )( σ b + σ w q ( j ) t ) (cid:1) / I D dt with λ ( i,j ) t = λ ( x ( i ) t , x ( j ) t ) = (cid:104) x ( i ) t , x ( j ) t (cid:105) /D and q ( i ) t = λ ( i,i ) t = q ( x ( i ) t ) = (cid:107) x ( i ) t (cid:107) /D . We additionallydefine m ( i ) t = m ( x ( i ) t ) = 1 /D (cid:80) Dd =1 x ( i ) t,d . As a starting point we need to ensure the well-posednessof (29) for small t > as D ↑ ∞ . Therefore, hereafter we assume that the following limitsexist and are finite: m ( i ) , ∞ = lim D ↑∞ m ( i )0 , q ( i ) , ∞ = lim D ↑∞ q ( i )0 , and λ ( i,j ) , ∞ = lim D ↑∞ λ ( i,j )0 (thenotation does not convey explicitly the dependence of x t , and hence of q t , λ t , m t on D ).16 .1 Weak and full training The connection between Gaussian processes and infinitely wide (finitely) deep neural neuralnetworks is well understood in the literature (Neal, 1995; Lee et al., 2018; Garriga-Alonso et al.,2019). In Section 5.2 we show that similar results hold true in the context of infinitely deep andinfinitely wide ResNets, thus obtaining convergence to a Gaussian process prior. For infinitewide neural networks, the Neural Tangent Kernel (NTK, Jacot et al. (2018); Arora et al. (2019);Lee et al. (2019a)) allows for computing the solution obtained by fully training a neural networkaccording to continuous-time, i.e. infinitesimal learning rate, GD under the assumption ofa quadratic loss. The “reparametrized” gradients used by GD are computed with respect toparameters which are i.i.d. distributed according to standard Gaussian distributions and anyscaling is expressed via multiplication, not via the Gaussian distribution’s variance. In our contextthis corresponds to gradients with respect to ( ε Wt , ε bt ) in (17). In particular, if (cid:98) y ( i ) , (cid:98) y ( j ) ∈ R areoutputs of a neural network corresponding to two data points, and if θ is the vector of all modelparameters, the corresponding NTK is defined as K ( i,j ) = (cid:104)∇ θ (cid:98) y ( i ) , ∇ θ (cid:98) y ( j ) (cid:105) (30)As the width of a neural network goes to infinity, the stochastic quantity K ( i,j ) converges to adeterministic limit K ( i,j, ∞ ) for each pair of points under the mentioned assumptions. Buildingon this results, it is possible to establish the equivalence between the solution obtained by fullytraining a neural network via continuous-time GD and kernel regression via the K ( i,j, ∞ ) kernel.We show in Section 5.3 that similar results hold for the case of infinitely deep and infinitely wideResNets: (30) at initialization converges to a deterministic limit. Moreover, it is known (Aroraet al., 2019) that in the aforementioned setting, training only the last output of a neural networkunder the same conditions corresponds to performing Bayesian inference under the Gaussianprocess prior arising in the infinite wide limit. We will thus talk equivalently of Bayesian inferenceand weak training, and we will refer to the standard NTK setting as full training.All the results of Section 5.2 and Section 5.3 concern with a ResNet x T with input x . Aspreviously mentioned it is necessary to complete the ResNet with an input layer adapting theinfinitely wide ResNet to finite-dimensional inputs. Moreover, to more closely resemble neuralnetworks used in practice, an output adaptation layer is commonly introduced as well. InSection 5.4 we investigate the implications to the training of completing the ResNet with inputand output layers. First, let observe that the evolution of (29) is directly governed by q ( i ) t and λ ( i,j ) t . In Lemma A.2in Appendix A we derive the corresponding SDEs that they follow, and we show that as D increases q ( i ) t and λ ( i,j ) t converge to deterministic limits which are obtained as solutions of ordinarydifferential equations (ODE). The drift, diffusion, and correlation coefficients driving (29) convergeto deterministic limit too which results in i.i.d. processes across the dimensions.17 roposition 5.1. As D ↑ ∞ m ( i ) t , q ( i ) t , λ ( i,j ) t converge to the solutions of the ODEs dm ( i ) , ∞ t = 12 φ (cid:48)(cid:48) (0) (cid:0) σ b + σ w q ( i ) , ∞ t (cid:1) dtdq ( i ) , ∞ t = (cid:0) φ (cid:48)(cid:48) (0) m ( i ) , ∞ t + φ (cid:48) (0) (cid:1)(cid:0) σ b + σ w q ( i ) , ∞ t (cid:1) dtdλ ( i,j ) , ∞ t = (cid:0) φ (cid:48)(cid:48) (0)(( σ b + σ w q ( i ) , ∞ t ) m j, ∞ t + ( σ b + σ w q ( j ) , ∞ t ) m ( i ) , ∞ t )+ φ (cid:48) (0) (cid:0) σ b + σ w λ ( i,j ) , ∞ t ) (cid:1) dt. Under the assumption φ (cid:48)(cid:48) (0) = 0 , the solutions for m ( i ) , ∞ T , q ( i ) , ∞ T and λ ( i,j ) , ∞ T are m ( i ) , ∞ T = m ( i ) , ∞ q ( i ) , ∞ T = q ( i ) , ∞ + (cid:18) q ( i ) , ∞ + σ b σ w (cid:19) (cid:16) e φ (cid:48) (0) σ w T − (cid:17) λ ( i,j ) , ∞ T = λ ( i,j ) , ∞ + (cid:18) λ ( i,j ) , ∞ + σ b σ w (cid:19) (cid:16) e φ (cid:48) (0) σ w T − (cid:17) , respectively. Under the assumption φ (cid:48)(cid:48) (0) (cid:54) = 0 , let c and c be two constants, C = − φ (cid:48) (0) σ w + φ (cid:48)(cid:48) (0) ( σ b + σ w c ) . Then, the solutions for m ( i ) , ∞ T and q ( i ) , ∞ T are m ( i ) , ∞ T = 1 φ (cid:48)(cid:48) (0) (cid:26) − φ (cid:48) (0) + 1 σ w √ C tan (cid:18) σ w √ C ( T + 2 c ) (cid:19)(cid:27) (31) q ( i ) , ∞ T = 1 φ (cid:48)(cid:48) (0) σ w (cid:40) − φ (cid:48)(cid:48) (0) σ b + C sec (cid:18) σ w √ C ( T + 2 c ) (cid:19) (cid:41) , (32) respectively. Proposition 5.2. As D ↑ ∞ each x ( i ) t converges to x ( i ) , ∞ t , with each x ( i ) , ∞ t i.i.d. across thedimensions. For x ( i ) , ∞ t = x ( i ) , ∞ t, , x ( i ) , ∞ t, , . . . , and d, u ≥ dx ( i ) , ∞ t,d = φ (cid:48) (0)( σ b + σ w q ( i ) , ∞ t ) / dB ( i ) , ∞ t,d (33) + 12 φ (cid:48)(cid:48) (0)( σ b + σ w q ( i ) , ∞ t ) dtd [ B ( i ) , ∞ d , B ( j ) , ∞ u ] t = σ b + σ w λ ( i,j ) , ∞ t (cid:0) ( σ b + σ w q ( i ) , ∞ t )( σ b + σ w q ( j ) , ∞ t ) (cid:1) / δ d,u dt where B ( i ) , ∞ t, , B ( i ) , ∞ t, , . . . are scalar BMs dependent over i and q ( i ) , ∞ t , λ ( i,j ) , ∞ t are obtained bysolving the ODEs in Proposition 5.1. Over the two data-points indexed by i, j this is a 2-dimensional SDE with time-dependent and deterministic drift and diffusion coefficients whichadmits a bivariate Gaussian transition density: p ( x ( i ) , ∞ T,d , x ( j ) , ∞ T,d | x ( i )0 ,d , x ( j )0 ,d ) = N (cid:32) (cid:34) x ( i )0 ,d + m ( i ) , ∞ T − m ( i ) , ∞ x ( j )0 ,d + m ( j ) , ∞ T − m ( j ) , ∞ (cid:35) , (34) (cid:34) v ( i ) , ∞ T − v ( i ) , ∞ c ( i,j ) , ∞ T − c ( i,j ) , ∞ c ( i,j ) , ∞ T − c ( i,j ) , ∞ v ( j ) , ∞ T − v ( j ) , ∞ (cid:35) (cid:33) , where v ( i ) , ∞ t = q ( i ) , ∞ t − ( m ( i ) , ∞ t ) , c ( i,j ) , ∞ t = λ ( i,j ) , ∞ t − m ( i ) , ∞ t m ( j ) , ∞ t . K ( i,j ) , ∞ = c ( i,j ) , ∞ T − c ( i,j ) , ∞ and mean function M ( i ) , ∞ d = x ( i )0 ,d + m ( i ) , ∞ T − m ( i ) , ∞ .That is,i) when φ (cid:48)(cid:48) (0) = 0 , we have M ( i ) , ∞ d = x ( i )0 ,d and K ( i,j ) , ∞ = (cid:16) λ ( i,j ) , ∞ + σ b σ w (cid:17) (cid:16) e φ (cid:48) (0) σ w T − (cid:17) ;ii) when φ (cid:48)(cid:48) (0) (cid:54) = 0 , from (31)-(32) we obtain the deterministic explosion time of x ( i ) , ∞ t,d bysolving σ w √ C ( T + 2 c ) = π in T ; the constants c , c depend on m ( i ) , ∞ , q ( i ) , ∞ and haveto be determined numerically. Let θ = { ε Wt , ε bt } T − t =0 denote the “reparametrized” collection of parameters with respect to whichwe compute the NTK. In the rest of the section we establish the convergence of K ( i,j ) to adeterministic limit as L ↑ ∞ and then D ↑ ∞ . In particular, we operate under the followingassumptions: i) x t follows (26); ii) the parameters follow Assumption 3.4; iii) φ (cid:48)(cid:48) (0) = 0 ; iv) (cid:98) y = x T, . We have K ( i,j ) = K ( i,j ) W + K ( i,j ) b , where K ( i,j ) W = (cid:80) Tt =∆ t K ( i,j ) W,t , K ( i,j ) b = (cid:80) Tt =∆ t K ( i,j ) b,t , and K ( i,j ) W,t = J ( (cid:98) y ( i ) , x ( i ) T ) J ( x ( i ) T , x ( i ) t ) J ( x ( i ) t , ∆ W t − ∆ t ) (cid:16) J ( (cid:98) y ( j ) , x ( j ) T ) J ( x ( j ) T , x ( j ) t ) J ( x ( j ) t , ∆ W t − ∆ t ) (cid:17) (cid:62) σ w ∆ t/D K ( i,j ) b,t = J ( (cid:98) y ( i ) , x ( i ) T ) J ( x ( i ) T , x ( i ) t ) J ( x ( i ) t , ∆ b t − ∆ t ) (cid:16) J ( (cid:98) y ( j ) , x ( j ) T ) J ( x ( j ) T , x ( j ) t ) J ( x ( j ) t , ∆ b t − ∆ t ) (cid:17) (cid:62) σ b ∆ t as J ( x t , ζ Wt − ∆ t ) = J ( x t , ∆ W t − ∆ t ) σ w √ ∆ t/ √ DJ ( x t , ζ bt − ∆ t ) = J ( x t , ∆ b t − ∆ t ) σ b √ ∆ t. Let recall from our study in Section 4 that, as L → ∞ , we have that J ( x ( i ) T , x ( i ) t ) → g T g − t and J ( x ( j ) T , x ( j ) t ) → g T g − t , as the evolution of g t does not depend on x t when φ (cid:48)(cid:48) (0) = 0 . Furthermore,observe that J ( x t , ∆ W t − ∆ t ) d,i,j → φ (cid:48) (0) δ d,i x t,j and J ( x t , ∆ b t − ∆ t ) d,i → φ (cid:48) (0) δ d,i . By combiningthese results, and by assuming that the interchange of limits is justified, we write K ( i,j ) W → φ (cid:48) (0) σ w J ( (cid:98) y ( i ) , x ( i ) T ) g T (cid:34)(cid:90) T (cid:104) x ( i ) t , x ( j ) t (cid:105) D g − t g − t (cid:62) dt (cid:35) g (cid:62) T J ( (cid:98) y ( j ) , x ( j ) T ) (cid:62) K ( i,j ) b → φ (cid:48) (0) σ b J ( (cid:98) y ( i ) , x ( i ) T ) g T (cid:20)(cid:90) T g − t g − t (cid:62) dt (cid:21) g (cid:62) T J ( (cid:98) y ( j ) , x ( j ) T ) (cid:62) . Now, g − t g − t (cid:62) = ( g (cid:62) t g t ) − . Accordingly, by an application of Ito’s formula for matrix SDEproducts (Protter, 2005, Chapter V, Theorem 47), for U t = g (cid:62) t g t we obtain the following SDE dU t = φ (cid:48) (0) σ w √ D g (cid:62) t (cid:16) dB Wt + dB Wt (cid:62) (cid:17) g t + φ (cid:48) (0) σ w U t dt, where U = I D , and whose quadratic variation (a matrix, in this particular case) is of the followingform d [ U ] t = φ (cid:48) (0) σ w D (cid:16) g t (cid:62) (cid:12) g t (cid:62) (cid:17)(cid:16) g t (cid:12) g t (cid:17) dt, D → ∞ . Therefore, U t → U ∞ t where dU ∞ t = φ (cid:48) (0) σ w U ∞ t dt . Thus, as D → ∞ theterm g (cid:62) t g t is an infinite dimensional diagonal matrix with constant element u ∞ t computable bysolving the ODE du ∞ t = φ (cid:48) (0) σ w u ∞ t dt with initial value u ∞ = 1 , i.e. u ∞ t = exp( φ (cid:48) (0) σ w t ) .Observe that: i) the matrix (cid:82) T ( g (cid:62) t g t ) − dt is asymptotically diagonal with constant element (1 − exp( φ (cid:48) (0) σ w T )) / ( φ (cid:48) (0) σ w ) ; ii) the matrix g T g (cid:62) T is asymptotically diagonal with constantelement exp( φ (cid:48) (0) σ w T ) / ( φ (cid:48) (0) σ w ) . Therefore, one has that the matrix g T (cid:104)(cid:82) T g − t g − t (cid:62) dt (cid:105) g (cid:62) T is asymptotically diagonal with value (exp( φ (cid:48) (0) σ w T ) − / ( φ (cid:48) (0) σ w ) . Observe that we rely onthe assumption that the approximation errors due to considering each term separately vanish as D ↑ ∞ , or at least the approximation errors cancel out. Finally, (cid:98) y = x T, corresponds to selectthe first element of this diagonal matrix. If E = e φ (cid:48) (0) σ w T then K ( i,j ) b → K ( i,j ) , ∞ b = σ b σ w ( E − . Along similar lines we obtain the deterministic limit to which K ( i,j ) N T ,W,t converges as D ↑ ∞ , i.e., K ( i,j ) W → K ( i,j ) , ∞ W = λ ( i,j ) , ∞ φ (cid:48) (0) σ w T E + σ b σ w (cid:2) φ (cid:48) (0) σ w T E − ( E − (cid:3) hence obtaining the main NTK convergence result: K ( i,j ) → K ( i,j ) , ∞ = λ ( i,j ) , ∞ CE + σ b σ w CE. where E = exp( C ) and C = φ (cid:48) (0) σ w T . This can be contrasted with the main result of Section 5.2,where we have shown that the (standard) kernel corresponding to D ↑ ∞ for φ (cid:48)(cid:48) (0) = 0 is givenby K ( i,j ) → K ( i,j ) , ∞ = λ ( i,j ) , ∞ ( E −
1) + σ b σ w ( E − . Observe that the two kernels correspond to two different training regimes: i) training all layersof the neural network; ii) training only the output layer of the neural network. However, the twokernels are qualitatively similar. In particular, both kernels depend linearly on λ ( i,j ) , ∞ . The onlydifference is with respect to the behavior of ( E − compared to CE as a function of C . Results presented in Section 5.2 and Section 5.3 entail a neural network with an infinite-dimensional input. Let z, z (cid:48) ∈ R Z be two inputs of the neural network. Let consider a linearadaptation layer, i.e. an embedding, of the form x = Az where, in line with Section 5.2 andSection 5.3, the elements of A ∈ R D × Z are i.i.d. as N (0 , σ Z ) . It follows that across d we have ( x ,d , x (cid:48) ,d ) i.i.d. ∼ N (0 , Σ Z ( z, z (cid:48) )) , where Σ Z ( z, z (cid:48) ) = σ Z (cid:20) (cid:107) z (cid:107) (cid:104) z, z (cid:48) (cid:105)(cid:104) z (cid:48) , z (cid:105) (cid:107) z (cid:48) (cid:107) (cid:21) . for convenience we use in this section the z, z (cid:48) notation instead of z ( i ) , z ( j ) , and proceed in the same way forall other quantities depending on i, j
20y the strong law of large numbers λ = D (cid:104) x , x (cid:48) (cid:105) → λ ∞ = E [ x , x (cid:48) , ] = σ Z (cid:104) z, z (cid:48) (cid:105) as D ↑ ∞ ,hence q ∞ = σ Z (cid:107) z (cid:107) and q (cid:48)∞ = σ Z (cid:107) z (cid:48) (cid:107) . In the weakly training setting, which is equivalent toBayesian inference with a Gaussian process prior, we know from Section 5.2 that across d wehave ( x ∞ T,d , x (cid:48)∞
T,d | x ,d , x (cid:48) ,d ) i.i.d. ∼ N (cid:32) (cid:20) x ,d x (cid:48) ,d (cid:21) , Σ weak ( z, z (cid:48) ) (cid:33) where Σ weak ( z, z (cid:48) ) = σ Z ( E − (cid:20) (cid:107) z (cid:107) (cid:104) z, z (cid:48) (cid:105)(cid:104) z (cid:48) , z (cid:105) (cid:107) z (cid:48) (cid:107) (cid:21) + σ b σ w ( E − with E = e φ (cid:48) (0) σ w T . Then, by direct computation we obtain the following Gaussian distribution ( x ∞ T,d , x (cid:48)∞
T,d ) i.i.d. ∼ N (cid:32) (cid:20) (cid:21) , Σ Z ( z, z (cid:48) ) + Σ weak ( z, z (cid:48) ) (cid:33) = N (cid:32) (cid:20) (cid:21) , σ Z E (cid:20) (cid:107) z (cid:107) (cid:104) z, z (cid:48) (cid:105)(cid:104) z (cid:48) , z (cid:105) (cid:107) z (cid:48) (cid:107) (cid:21) + σ b σ w ( E − (cid:33) . That is, the prior distribution induced by a doubly infinite ResNet with the input adaptationlayer is i.i.d. across the dimensions d , and distributed as a centered Gaussian process with kernel K ( z, z (cid:48) ) = σ Z E (cid:104) z, z (cid:48) (cid:105) + σ b σ w ( E − . (35)We also augment the neural network with an output adaptation layer (cid:98) y = G x T , where theelements of G ∈ R × D are i.i.d. as N (0 , σ Y /D ) . Then, it follows that the doubly infinite ResNetwith both input and output adaption layers still follows a Gaussian process whose kernel is K ( z, z (cid:48) ) = σ Y K ( z, z (cid:48) ) . (36)Now, consider the Bayesian noiseless liner model with fully independent prior distributionsformulated by (cid:98) y = α + βz where α ∈ R , α ∼ N (0 , σ α ) , β ∈ R Z , β i ∼ N (0 , σ β ) for i = 1 , . . . , Z ,then: ( (cid:98) y, (cid:98) y (cid:48) ) ∼ N (cid:32) (cid:20) (cid:21) , σ β (cid:20) (cid:107) z (cid:107) (cid:104) z, z (cid:48) (cid:105)(cid:104) z (cid:48) , z (cid:105) (cid:107) z (cid:48) (cid:107) (cid:21) + σ α (cid:33) . Thus, according to (35) it follows that, within the doubly infinite limit, the completed ResNetprior model collapses to a noiseless Bayesian linear regression prior where σ α = σ b σ w σ Y ( E − and σ β = σ Y σ Z E .Under the fully trained setting, in Section 5.3 we have established the convergence of the NTK.Now, we consider directly the doubly infinite ResNet augmented with both input and outputlayers as previously defined. Recall that in the NTK literature the input layer is sometimes nottrained, and the output layer is sometimes omitted (Arora et al., 2019). Hereafter we reportonly the results for the special case in which all layers are present and trained as it most closelyresembles standard practice for finitely-sized networks, i.e., K ( z, z (cid:48) ) = σ Y (cid:18) σ Z ( C + 1) E (cid:104) z, z (cid:48) (cid:105) + σ b σ w CE (cid:19) + K ( z, z (cid:48) )= σ Y (cid:18) σ Z ( C + 2) E (cid:104) z, z (cid:48) (cid:105) + σ b σ w ( CE + E − (cid:19) . (37)21or more general cases, the results follow along lines similar to the steps detailed in Section 5.3.In particular, in all cases the kernel remains affine in (cid:104) z, z (cid:48) (cid:105) , and only the coefficients are affected.Establishing the equivalence between kernel regression and fully trained neural networks requiresthe following steps: i) establishing the NTK convergence at initialization, as we did in Section 5.3and ii) bounding the NTK fluctuations during training, see Arora et al. (2019); Lee et al. (2019b).To the best of our knowledge, the second step has not been formally established for architectureswhich are not feed-forward, i.e. for ResNets in particular. Assuming such a result, (37) show thatin the doubly infinite limit fully trained ResNet correspond to noiseless (kernel) linear regression.Kernel regression is equivalent to the posterior predictive mean of a Gaussian process with thesame kernel. Hence, relatively to point predictions, both weakly and fully trained doubly infinitenetworks collapse to a linear model. We start by introducing all the neural network models considered in this section. In all theexperiments we set ψ to the identity function and, without loss of generality, we assume T = 1 .Regarding fully connected networks, we consider the fully i.i.d. parametrization of Assumption 3.4.When Z = 1 , i.e. for 1-dimensional inputs, we can opt for copying the input across all dimensions: x , • = z for an input z , i.e. x ,d = z for each d ∈ D . We refer to this model as F tanh when φ = tanh and as F swish when φ = swish . The swish activation function ( swish( x ) = x sigmoid( x ) )has been shown empirically (Ramachandran et al., 2017) and theoretically (Hayou et al., 2019a)to be competitive. More in general, for any input dimension Z , we complete the model withinput and output adaptation layers as defined in Section 5.4. We choose to use σ Z = Z/I and σ Y = 1 /D . We will refer to such completed models as F tanh and F swish .Regarding convolutional networks, we consider the fully i.i.d. parameterization of Assumption 3.7.A generic input is here of dimension U × V × C , with U, V, C representing the input height,width and number of channels. The adaptation layer is here an 1-by-1 convolution adapting thenumber of channels to the model dimension D . More precisely: for each p x ,p = Az p where p index the U V positions and the elements of A ∈ R D × Z are i.i.d. as N (0 , /C ) . The output layeris composed again of a 1-by-1 convolution which is followed by global space averaging. That is: (cid:98) y = UV (cid:80) UVp =1 G x T,p , here again p index the U V positions and the elements of G ∈ R Y × D arei.i.d. as N (0 , /D ) . We refer to this convolutional model with φ = tanh as C tanh . We start with a numerical study of the correctness of the results of Section 3, Section 4 andSection 5. We consider F tanh with σ w = σ b = 1 and two 1-dimensional inputs z (1) = 0 , z (2) = 1 ,hence x (1)0 , • = z (1) , x (2)0 , • = z (2) , and simulate . draws of the first dimension ( d = 1 ) of:a) x (1) T , x (2) T via the ResNet recursion (8);b) x (1) T , x (2) T via the discretization (6) of the limiting SDE (29);c) x (1) , ∞ T , x (2) , ∞ T via the analytical transition density (34).22or L = D = 500 . We only consider the first dimension because, as observed in Section 5.2, inthe limit of D ↑ ∞ the dimensions are i.i.d. Our analysis imply that a) and b) are equivalentwhen L ↑ ∞ , and c) is equivalent to b) when additionally D ↑ ∞ . As both D and L are largewe expect good agreement between the distributions corresponding to a) b) and c). Numericalresults are reported in Figure 2 where indeed a good agreement with the theory is observed. Figure 2: For model F tanh : 2D KDE (kernel density estimator) plot for ( (cid:98) y ( z (1) ) , (cid:98) y ( z (2) )) (left),1D KDE and histogram plots for (cid:98) y ( z (1) ) (center), (cid:98) y ( z (2) ) (right) when (cid:98) y is sampled from aResNet (resnet), from the Euler discretization of its limiting SDE for L ↑ ∞ (sde) and from theanalytical SDE transition density for L, D ↑ ∞ (analytical); (cid:98) y denotes a generic model output,hence (cid:98) y is its first dimension.For the same neural network model F tanh , Figure 3 displays the convergence of the neural tangentkernels K (1 , W , K (1 , b to their limits K (1 , , ∞ W , K (1 , , ∞ b for z (1) = 1 , z (2) = 2 . The convergence isassessed in the setting where both the depth L and the dimension D grow unbounded jointly.Results displayed in Figure 3 support the numerical analysis of Section 5.3. Results also supportthe conjecture that the order in which the limits are taken does not impact the results for thesmooth activation functions considered in the present work. D = L (1,2) W (1,2), W D = L (1,2) b (1,2), b Figure 3: For model F tanh : neural tangent kernels K (1 , W , K (1 , b as function of L = D and theiranalytical limits K (1 , , ∞ W , K (1 , , ∞ b corresponding to D, L ↑ ∞ ; empirical average plotted withsolid line, shaded areas correspond to ± empirical standard deviations. We show empirically that the dependency on the input is retained, and that the output distributiondoes not exhibit perfect correlation for very deep residual networks constructed as in the presentpaper. Again, we consider the neural network model F tanh with σ w = σ b = 1 . Figure 2 showsthat x (1) T, and x (2) T, have different distributions. This means that the input dependency is retained23
05 2 1 0 1 2505 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2
Figure 4: Function samples of x T. for F tanh (top) and F swish (bottom), see Figure 1 for thedescription of the plotted quantities.in the neural network. Furthermore, from the left plot we see that x (1) T, and x (2) T, are not perfectlycorrelated, otherwise the 2D KDE would collapse to a straight line.Figure 4 (top panels), which can be contrasted with Figure 1, displays samples of x T, from F tanh in function space for different combinations of L and D . More specifically, we approximatefunction draws by considering 400 inputs z ( i ) equally spaced on [ − , . Using the ResNetrecursion (8) we obtain 400 output values x ( i ) T, . We repeat this procedure to obtain 10.000function draws and report the results in Figure 4 (top). For L = D = 500 , i.e. for jointly largewidth and depth, the function draws are close to linear in agreement with Section 5.3. We thenreplicate this experiment for F swish and we report the results in Figure 4 (bottom panels). Inthis case φ (cid:48) (0) = φ (cid:48)(cid:48) (0) = 1 / and Assumption 2.3 is not satisfied, but in this specific instance wedid not observe divergent trajectories for the . function draws. The impact of adding aninput adaptation layer is limited to symmetrizing the function space distributions around theorigin, while F tanh and F swish trend upward with z . Hence, we do not include additional plotsfor this additional case as they add little information. Appendix B contains additional 2D plotsof samples of x T, for both F tanh and F swish .Finally, Figure 5 (top panels) displays the correlations ρ [ x (1) T, , x (2) T, ] for the neural network’sinputs ( z (1) , z (2) ) in the range [ − , × [ − , , for the tanh and swish activation functions: for24 .80.40.00.40.8 0.80.40.00.40.80.80.40.00.40.8 0.80.40.00.40.8 Figure 5: Output correlation heatmap for F tanh (top-left), F swish (top-right), E tanh (bottom-left), E ReLU (bottom-right).different inputs, the corresponding output correlations are far from 1. Let refer to the modelof Figure 1 with tanh activation as E tanh , and to the model of Figure 1 with ReLU activationas E ReLU . For the sake of comparison, we show in Figure 5 (bottom panels) the correlations ρ [ x (1) last, , x (2) last, ] for pre-activation 1 for E tanh and E ReLU : all correlations are close to 1.
We consider the MNIST dataset (LeCun, 1998). In particular, each observation ( z, k ) is composedof an image z and a target k among 10 classes representing the numbers to . We flatten theimages obtaining z ∈ R and, as common, we rescale each z as z/ to bound the inputson [0 , . We consider the neural network F tanh trained via full-batch GD training and averageMSE loss. In order to frame classification as a regression problem, we use 1-hot encoding: eachclass k = 1 , . . . , is encoded as y k ∈ R which has the k -th component equal to and all othercomponents equal to . The gradients are computed with respect to ( ε Wt , ε bt ) in (17).We consider F tanh with σ w = 1 and σ b = 0 . . The use of a smaller bias variance is common inthe NTK literature (Arora et al., 2019). From Section 5.4 we know that as L and D increase thefully trained F tanh collapses to noiseless Bayesian linear regression. We consider 20.000 randomlysampled observations from the training portion of the MNIST dataset, and we compute the testaccuracy on the test portion of the MNIST dataset, which is composed of 10.000 observations.Using 1-hot encoding we perform kernel regression using kernel (37) via standard kernel regression(Williams and Rasmussen, 2006) for the predictive posterior mean of Gaussian processes. Fornumerical stability the model is augmented with a small noise variance equal to / . , andwe obtain a test accuracy of . . We compare this accuracy with test accuracies computedfor F tanh under different values of D = L , which is fully-trained for 120 epochs. We use a singlelearning rate tuned to optimize final test accuracy.25
100 200 300 400 500 D = L T e s t A cc u r a c y D = L T e s t A cc u r a c y Figure 6: Final MNIST test accuracy after 120 epochs of training F tanh via GD (left, blue), SGD(left, orange), full-batch Adam (right, blue), mini-batch Adam (right, orange) compared with thetheoretical limiting value corresponding to Bayesian linear regression (dashed gray).In practice, the training of neural networks typically is performed via SGD, or via other stochasticvariants of GD, as full-batch training is prohibitively expensive for large datasets. Accordingly,here we perform SGD training of F tanh , with batches of 200 observations each. Again, weconsider 120 epochs and different different values of D = L . The same learning rate is used. Inboth experiments no further adjustments are performed, such as gradient clipping. The resultsare reported in Figure 6 (left). We observe that there is strong agreement between the limitingtheoretical test accuracy and the final test accuracy of F tanh fully trained with GD, which is thecase covered by our theory. Moreover this result empirically extends to SGD.For completeness, we consider the same training setting with Adam (Kingma and Ba, 2015), apopular adaptive stochastic optimizer, and we report the results in Figure 6 (right) for both full-batch and mini-batch variants. While there is no strong consensus on whether Adam outperformsor underperforms SGD when a carefully tuned learning rate is used (Wilson et al., 2017; Choiet al., 2019), Adam is known to be less sensitive to learning rate specifications and exhibitsmore robust behavior in difficult optimization problems. In particular, the proposed experimentprovides an alternative viewpoint: Adam (with mini-batching, as standard) is able to “escape”the domain of attraction of linear model solutions, at least up to the largest model size hereconsidered. We suspect that more complex neural network architectures might exhibit analogouspathologies at initialization when the number of parameters is very large, and Adam seems morerobust to these issues. In any case, a formal investigation would require new results in the NTKliterature to cover adaptive optimizers. While a theoretical investigation of the backward properties of CNNs is beyond the scope ofthe present paper, in this section we empirically investigate to what extent the observationsof Section 6.3 extends to convolutional neural networks. In particular, we consider C tanh with σ w = 1 and σ b = 0 . . The setting is the same of Section 6.3, with the exception that the inputimages are not flattened. We consider training under MSE loss for 120 epochs with both SGDand Adam. For computation reasons we restrict the maximum model size to D = L = 150 D = L T e s t A cc u r a c y Figure 7: Final MNIST test accuracy after 120 epochs of training C tanh with SGD (blue) andAdam (orange).and do not investigate full-batch training. We report the results of this experiment in Figure 7.These results suggests convergence as D = L which is reminiscent of what observed in Figure 6.Similarly to Figure 6 Adam exhibits superior performance for large D = L . We established the convergence of identity ResNets (He et al., 2016b), and of their correspondingJacobian matrices, to solutions of SDEs as the number of layers goes to infinity. Our results relyon smooth activation functions and on parameter’s distributions which shrink as total depthincreases; further conditions on the activation functions are obtained by restricting the limitingSDEs to be non explosive. While we covered in full detail the case of fully connected residualnetworks, there are no theoretical impediments in the extension of our results to the case ofconvolutional architectures, as we did in Section 3.2 for the forward-propagation results. Buildingon our connection between infinitely deep networks and diffusion processes, we showed that bothforward and backward dynamics are well-behaved. More precisely, regarding forward propagationwe showed that as total depth grows unboundedly: i) the dependency of the last layer on theinput does not vanish; ii) the last layer, as stochastic function on input space, remains flexiblewithout collapsing to restrictive families of distributions, iii) the last layer does not collapse to adeterministic limit, nor does it diverge to infinity, i.e. it converges to a non-degenerate conditionalprobability distribution. All these results hold for both fully-connected and convolutional ResNets.Moreover, we showed that the activation function needs to satisfy φ (cid:48)(cid:48) (0) = 0 in order to rule outexplosive dynamics over the layers.With regards to backward propagation, and limitedly to the case of fully-connected networkswithout the second (extra) activation function, we showed that the Jacobian of the final layerwith respect to any layer can be expressed as the multiplication of two matrix diffusions whichsatisfy the same desiderata i), ii) and iii) in the limit of infinite total depth, and is hence similarlywell-behaved. Moreover, we addressed the problem of the trainability at initialization of suchneural networks, showing that exploding gradients are not possible in the limit of infinite depthand that the ResNet is invertible. In contrast to the information propagation approach, our27nalysis covers finitely-wide neural networks and correlated parameters.Finally, limited to the case of fully i.i.d. parameter’s distributions, we investigated the case of thedoubly infinite ResNets where width grows unbounded as well. The attractiveness of the doublyinfinite setting is mainly related to the potential of obtaining analytical results. We showedthat doubly infinite ResNets converge to Gaussian processes whose kernels can be computed bysolving systems of ODEs. In particular, when φ (cid:48)(cid:48) (0) = 0 and the model is completed with a fullyi.i.d. input adaptation, we showed that the doubly infinite ResNet collapses to a Bayesian linearmodel with a fully factorial prior distribution. To conclude, we obtained the form of the NTKthat corresponds to full training with continuous time GD and quadratic loss of doubly infiniteResNets. In particular, we observed that such a kernel is qualitatively identical to the kernel ofthe Gaussian process arising in this doubly infinite limit, thus implying that fully trained doublyinfinite networks are again equivalent to performing linear regression. Numerical experimentssupport the validity of the proposed derivations.The present work illustrates the many pitfalls that must be overcome in order to derive non-triviallimits as depth and width grows unbounded in neural networks. Architectures, parameters andactivation functions need to satisfy precise conditions. However, still under these conditions, theresulting limiting behavior can be very unexpressive. While an undesirable result if inference viathe limiting process is the goal, the connection to very simple models introduces the possibility ofperforming hyper-parameter optimization on the finitely-sized neural network via empirical bayeson the corresponding linear model. Moreover, it is likely that an NTK limit can be establishedfor doubly infinite convolutional ResNets. Such limiting model could offer efficient classificationfor images in the settings in which computational restrictions or a limited amount of trainingdata do not warrant the use of a deep learning solution.To overcome the present limitations one could narrow the fundamental gap between theory andpractice by considering more realistic residual blocks consisting of multiple layers. Such deepresidual blocks could be approached either via fractional Brownian motions (Biagini et al., 2008)or via re-scaled Brownian motions. Moreover, such extension would allow to consider neuralnetworks which are infinitely wide only in the residual blocks internal dimension. The fields ofdiffusion processes and SDEs are mature and rich fields (Øksendal, 2003; Karatzas and Shreve,1999; Revuz and Yor, 1999; Kloeden and Platen, 1992; Stroock and Varadhan, 2006), with avast range of theoretical results and simulation methods. We envision that examining neuralnetworks properties from the point of view of SDEs will bring further insights. A Proofs
This appendix contains all the proofs of the theorems stated in the main text and the lemmasrequired to prove them.
Proof of Theorem 2.1.
This is (Nelson, 1990, Theorem 2.2): Assumption 2.1 and the postulatedweakly unique and non-explosive weak solution satisfy all the conditions required for the appli-cation of (Nelson, 1990, Theorem 2.2). Note that we use a stronger non-explosivity condition(Øksendal (2003)). Alternatively, for this standard result the reader can refer to the monographStroock and Varadhan (2006) on which Nelson (1990) is based; yet another reference is Ethierand Kurtz (2009).11 28 emma A.1. If φ satisfies Assumption 3.2, (cid:15) ∼ N (0 , σ ) with σ ≤ σ ∗ , α > , then we can find M ( α, σ ∗ ) < ∞ and M ( α, σ ∗ ) < ∞ such that: E (cid:2) | φ (cid:48)(cid:48) ( (cid:15) ) | α (cid:3) ≤ M ( α, σ ∗ ) E (cid:2) | φ (cid:48)(cid:48)(cid:48) ( (cid:15) ) | α (cid:3) ≤ M ( α, σ ∗ ) Proof.
We prove the result only for φ (cid:48)(cid:48) ( (cid:15) ) , the case for φ (cid:48)(cid:48)(cid:48) ( (cid:15) ) being identical. Let L large enoughsuch that | φ (cid:48)(cid:48) ( x ) | ≤ K e K | x | for | x | ≥ L then: E (cid:2) | φ (cid:48)(cid:48) ( (cid:15) ) | α (cid:3) = E (cid:2) | φ (cid:48)(cid:48) ( (cid:15) ) | α | (cid:15) |≤ L (cid:3) + E (cid:2) | φ (cid:48)(cid:48) ( (cid:15) ) | α | (cid:15) | >L (cid:3) ≤ sup | x |≤ L | φ (cid:48)(cid:48) ( x ) | α + K α E [ e K α | (cid:15) | ] The first term is finite, that the second one can be bounded by a finite and increasing function in σ follows from the symmetry in law of (cid:15) and the form of its movement generating function. Proof of Theorem 3.1.
We suppress the dependency on t of vector and matrices and the condi-tioning in expectations and covariances in this proof to ease the notation. We also drop theboldness of x t as no confusion arises in this setting. We instead reserve subscripts for indexing:for example x d denotes the d -th element of a vector x .Let h = ( µ W √ ∆ t + ε W ) ψ ( x ) + ( µ b √ ∆ t + ε b ) so that h √ ∆ t = ∆ W ψ ( x ) + ∆ b . By second orderTaylor expansion of φ around 0 we have for d = 1 , . . . , D ∆ x d ∆ t = φ ( h d √ ∆ t )∆ t = φ (cid:48) (0) h d ∆ t − / + 12 φ (cid:48)(cid:48) (0) h d + 16 φ (cid:48)(cid:48)(cid:48) ( ϑ d ) h d ∆ t / with ϑ d ∈ ( − h d √ ∆ t, h d √ ∆ t ) . To prove (1) we want to show that ∀ R > ∆ t ↓ sup (cid:107) x (cid:107) Notice that d [ W ψ ( x )] t + d [ b ] t = d [ W ψ ( x ) + b ] t = diag( V [ ε Wt ψ ( x t ) + ε bt | x t ]) dt dW t and db t in (13) shows that the drift terms are matched between (12) and(13). The quadratic variation of (12) is φ (cid:48) (0) diag( V [ ε Wt ψ ( x t ) + ε bt | x t ]) dt which is equal to the quadratic variation of (13) as it is computed as d [ x ] t = d [ φ (cid:48) (0)( W ψ ( x ) + b )] t = φ (cid:48) (0) d [ W ψ ( x ) + b ] t This shows the equivalence in law between the solution of (12) and the solution of (13). Then(14) immediately follows by direct computation. Proof of Corollary 3.2 and Corollary 3.3. Notice that d [ W ψ ( x ( i ) ) + b, W ψ ( x ( j ) ) + b ] t = C [ ε Wt ψ ( x ( i ) t ) + ε bt , ε Wt ψ ( x ( j ) t ) + ε bt | x ( i ) t , x ( j ) t ] dt = (cid:0) Σ b + C [ ε Wt ψ ( x ( i ) t ) , ε Wt ψ ( x ( j ) t ) | x ( i ) t , x ( j ) t ] (cid:1) dt and C [ ε Wt ψ ( x ( i ) t ) , ε Wt ψ ( x ( j ) t ) | x ( i ) t , x ( j ) t ] r,c = E [( ε Wt,r, • ψ ( x ( i ) t ))( ε Wt,c, • ψ ( x ( j ) t )) | x ( i ) t , x ( j ) t ]= D (cid:88) d,u =1 ψ ( x ( i ) t,d ) ψ ( x ( j ) t,u ) E [ W r,d W c,u ]= Σ W O r,c D (cid:88) d,u =1 ψ ( x ( i ) t,d ) ψ ( x ( j ) t,u )Σ W I d,u = Σ W O r,c ( ψ ( x ( i ) t ) (cid:62) Σ W I ψ ( x ( j ) t )) . This proves Corollary 3.2. Corollary 3.3 follows by setting σ b = σ b I D , σ W I = I D and σ W O = σ w D − / I D . Proof of Theorem 4.1. Here g t is a D × D matrix-valued SDE instead of standard D -dimensional(vector) SDEs. All the theory presented in Section 2 continues to hold with the obviousmodifications by working on the vectorization of matrix-valued processes. When establishing thelimits for g t in Assumption 2.1 the conditioning is both on g t and on x t , indeed the convergenceto the limiting process is obtained jointly in x t and g t . This proof follows the exact same path ofthe proofs of Theorem 3.1 and Corollary 3.1 so we highlight the main steps only. And once againwe suppress the dependency on t of vector and matrices, the conditioning in expectations andcovariances, and the boldness of x t and g t as no confusion arises in this setting: ∆ g = (cid:0) φ (cid:48) (∆ W x + ∆ b )1 D (cid:62) (cid:12) ∆ W (cid:1) g Let h = ( µ W √ ∆ t + ε W ) x + ( µ b √ ∆ t + ε b ) so that h √ ∆ t = ∆ W x + ∆ b . By second order Taylorexpansion of φ (cid:48) around 0 we have for d = 1 , . . . , Dφ (cid:48) ( h d √ ∆ t ) = φ (cid:48) (0) + φ (cid:48)(cid:48) (0) h d √ ∆ t + 12 φ (cid:48)(cid:48)(cid:48) ( ϑ d ) h d ∆ t ϑ d ∈ ( − h d √ ∆ t, h d √ ∆ t ) . Then with ϑ = (cid:2) ϑ · · · ϑ D (cid:3) ∆ g = (cid:0) ( φ (cid:48) (0)1 D √ ∆ t + φ (cid:48)(cid:48) (0) h ∆ t + 12 φ (cid:48)(cid:48)(cid:48) ( ϑ ) h ∆ t / )1 D (cid:62) (cid:12) ( µ W √ ∆ t + ε W ) (cid:1) g In order to obtain the instantaneous mean of g t we need to compute E (cid:20) ∆ g ∆ t (cid:21) = φ (cid:48) (0) µ W g + φ (cid:48)(cid:48) (0) E [ ε W x (cid:62) D (cid:12) ε W ] + r g,µ ( g, x, ∆ t )= µ g ( g, x ) + r g,µ ( g, x, ∆ t ) where r µ ( g, x, ∆ t ) is a reminder term and we want to show that for each R > ∆ t ↓ sup (cid:107) g (cid:107) + (cid:107) x (cid:107) Assume that x t follows (29) and let (cid:101) µ x ( x ) = φ (cid:48)(cid:48) (0)( σ b + σ w q ( x )) , (cid:101) σ x ( x ) = φ (cid:48) (0)( σ b + σ w q ( x )) / , (cid:101) σ xx ( x, y ) = φ (cid:48) (0)( σ b + σ w λ ( x, y )) / . Then the processes m ( i ) t , q ( i ) t , λ ( i,j ) t follow the SDEs: dm ( i ) t = (cid:101) µ x ( x ( i ) t ) dt + (cid:101) σ x ( x ( i ) t ) 1 D D (cid:88) d =1 dB ( i ) t,d dq ( i ) t = (cid:16) (cid:101) µ x ( x ( i ) t ) m ( i ) t + (cid:101) σ x ( x ( i ) t ) (cid:17) dt + 2 (cid:101) σ x ( x ( i ) t ) 1 D D (cid:88) d =1 x ( i ) t,d dB ( i ) t,d dλ ( i,j ) t = (cid:16)(cid:101) µ x ( x ( i ) t ) m ( j ) t + (cid:101) µ x ( x ( j ) t ) m ( i ) t + (cid:101) σ xx ( x ( i ) t , x ( j ) t ) (cid:17) dt + 1 D (cid:101) σ x ( x ( i ) t ) D (cid:88) d =1 x ( j ) t,d dB ( i ) t,d + 1 D (cid:101) σ x ( x ( j ) t ) D (cid:88) d =1 x ( i ) t,d dB ( j ) t,d Proof. The result is obtained by a straightforward but tedious application of multi-dimensionalIto’s formula (Øksendal (2003)). Heuristic for Proposition 5.1. From Lemma A.2 we have [ m ( i ) ] T = 1 D (cid:90) T (cid:101) σ x ( x ( i ) t ) dt [ q i ] T = 1 D (cid:90) T (cid:101) σ x ( x ( i ) t ) q ( i ) t dt [ λ ( i,j ) ] T = 1 D (cid:90) T (cid:101) σ x ( x ( i ) t ) q ( j ) t + (cid:101) σ x ( x ( j ) t ) q ( i ) t dt where (cid:101) σ x ( x ) = φ (cid:48) (0) ( σ b + σ w q ( x )) . Assuming that q ( i ) t can be controlled (for instance boundson SDE solutions can be used to bound E [sup ≤ t ≤ T q ( i ) t ] when φ (cid:48)(cid:48) (0) = 0 ) all the quadraticvariations can be shown to converge to leaving out only the deterministic component. The restof Proposition 5.1 follows by assuming that the small noise limit of the SDEs is given by thecorresponding ODEs, and by computing the ODEs solutions. Heuristic for Proposition 5.2. We know from Proposition 5.1 that φ (cid:48) (0)( σ b + σ w q ( i ) t ) → φ (cid:48) (0)( σ b + σ w q ( i ) , ∞ t )12 φ (cid:48)(cid:48) (0)( σ b + σ w q ( i ) t ) → φ (cid:48)(cid:48) (0)( σ b + σ w q ( i ) , ∞ t ) σ b + σ w λ ( i,j ) t (cid:0) ( σ b + σ w q ( i ) t )( σ b + σ w q ( j ) t ) (cid:1) / → σ b + σ w λ ( i,j ) , ∞ t (cid:0) ( σ b + σ w q ( i ) , ∞ t )( σ b + σ w q ( j ) , ∞ t ) (cid:1) / D ↑ ∞ . Then Proposition 5.2 follows by assuming that the solution of (29) converges to thesolution of (33) as D ↑ ∞ and by computing the transition density of (33). B Additional plots In Figure 8 we plot 2D function samples of x T, for F tanh and F swish to complement thevisualizations of Section 4.2. activation: tanh activation: swish Figure 8: Function samples of x T, for F tanh (left) and F swish (right) for L = 100 and D = 100 on the rectangle [ − , × [ − , . Acknowledgements Stefano Favaro received funding from the European Research Council (ERC) under the EuropeanUnion’s Horizon 2020 research and innovation programme under grant agreement No 817257.Stefano Favaro gratefully acknowledge the financial support from the Italian Ministry of Education,University and Research (MIUR), “Dipartimenti di Eccellenza” grant 2018-2022.34 eferences Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. (2019). On exactcomputation with an infinitely wide neural net. In Advances in Neural Information ProcessingSystems 32 .Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., and Jacobsen, J.-H. (2019). Invertibleresidual networks. In International Conference on Machine Learning , pages 573–582.Biagini, F., Hu, Y., Øksendal, B., and Zhang, T. (2008). Stochastic calculus for fractionalBrownian motion and applications . Springer Science & Business Media.Billingsley, P. (1999). Convergence of Probability Measures . Wiley-Interscience, 2nd edition.Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machinelearning. Siam Review , 60(2):223–311.Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinarydifferential equations. In Advances in Neural Information Processing Systems 31 , pages6571–6583.Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J., and Dahl, G. E. (2019). On empiricalcomparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446 .Dumoulin, V. and Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXivpreprint arXiv:1603.07285 .Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B., and Madry, A. (2019). Adversarialrobustness as a prior for learned representations.Ethier, S. N. and Kurtz, T. G. (2009). Markov processes: characterization and convergence .Wiley-Interscience.Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. (2019). Deep convolutional networks asshallow gaussian processes. In International Conference on Learning Representations .Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neuralnetworks. In Proceedings of the thirteenth international conference on artificial intelligenceand statistics , pages 249–256.Gupta, A. K. and Nagar, D. K. (1999). Matrix variate distributions . Chapman and Hall/CRC,1st edition.Hayou, S., Doucet, A., and Rousseau, J. (2019a). On the impact of the activation function ondeep neural networks training. In Proceedings of the 36th International Conference on MachineLearning , pages 2672–2680.Hayou, S., Doucet, A., and Rousseau, J. (2019b). Training dynamics of deep networks usingstochastic gradient descent via neural tangent kernel. arXiv preprint arXiv:1905.13654 .He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification. In Proceedings of the IEEE internationalconference on computer vision , pages 1026–1034.He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778.35e, K., Zhang, X., Ren, S., and Sun, J. (2016b). Identity mappings in deep residual networks.In European conference on computer vision , pages 630–645. Springer.Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence andgeneralization in neural networks. In Advances in Neural Information Processing Systems 31 ,pages 8571–8580.Karatzas, I. and Shreve, S. (1999). Brownian Motion and Stochastic Calculus . Springer, 2ndedition.Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In InternationalConference on Learning Representations .Kloeden, P. E. and Platen, E. (1992). Numerical Solution of Stochastic Differential Equations .Springer, corrected edition.LeCun, Y. (1998). The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ .LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature , 521(7553):436.Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. (2018). Deepneural networks as gaussian processes. In International Conference on Learning Representations .Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J., and Pennington, J. (2019a).Wide neural networks of any depth evolve as linear models under gradient descent. In Advancesin Neural Information Processing Systems 32 .Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J., and Pennington, J. (2019b).Wide neural networks of any depth evolve as linear models under gradient descent.Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. (2018).Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271 .Neal, R. M. (1995). Bayesian Learning for Neural Networks . PhD thesis, University of Toronto.Nelson, D. B. (1990). Arch models as diffusion approximations. Journal of econometrics ,45(1-2):7–38.Øksendal, B. (2003). Stochastic Differential Equations: An Introduction with Applications .Springer, 6th edition.Peluchetti, S. and Favaro, S. (2020). Infinitely deep neural networks as diffusion processes. In Proceedings of the twenty-third international conference on artificial intelligence and statistics .Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. (2016). Exponentialexpressivity in deep neural networks through transient chaos. In Advances in Neural InformationProcessing Systems 29 , pages 3360–3368.Protter, P. E. (2005). Stochastic integration and differential equations . Springer.Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions. arXivpreprint arXiv:1710.05941 .Revuz, D. and Yor, M. (1999). Continuous Martingales and Brownian Motion . Springer, 3rdedition.Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathe-matical statistics , pages 400–407. 36choenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J. (2017). Deep informationpropagation. In International Conference on Learning Representations .Stroock, D. W. and Varadhan, S. S. (2006). Multidimensional diffusion processes . Springer, 2006edition.Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning , volume 2.MIT press Cambridge, MA.Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. (2017). The marginal value ofadaptive gradient methods in machine learning. In Advances in Neural Information ProcessingSystems , pages 4148–4158.Yang, G. and Schoenholz, S. (2017). Mean field residual networks: On the edge of chaos. InGuyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett,R., editors,