Large-width functional asymptotics for deep Gaussian neural networks
Daniele Bracale, Stefano Favaro, Sandra Fortini, Stefano Peluchetti
PPublished as a conference paper at ICLR 2021 L ARGE - WIDTH FUNCTIONAL ASYMPTOTICS FOR DEEP G AUSSIAN NEURAL NETWORKS
Daniele Bracale , Stefano Favaro , , Sandra Fortini , Stefano Peluchetti University of Torino, Collegio Carlo Alberto, Bocconi University, Cogent Labs A BSTRACT
In this paper, we consider fully-connected feed-forward deep neural networkswhere weights and biases are independent and identically distributed accordingto Gaussian distributions. Extending previous results (Matthews et al., 2018a;b;Yang, 2019) we adopt a function-space perspective, i.e. we look at neural networksas infinite-dimensional random elements on the input space R I . Under suitableassumptions on the activation function we show that: i) a network defines acontinuous stochastic process on the input space R I ; ii) a network with re-scaledweights converges weakly to a continuous Gaussian Process in the large-width limit;iii) the limiting Gaussian Process has almost surely locally γ -Hölder continuouspaths, for < γ < . Our results contribute to recent theoretical studies on theinterplay between infinitely-wide deep neural networks and Gaussian Processes byestablishing weak convergence in function-space with respect to a stronger metric. NTRODUCTION
The interplay between infinitely-wide deep neural networks and classes of Gaussian Processes has itsorigins in the seminal work of Neal (1995), and it has been the subject of several theoretical studies.See, e.g., Der & Lee (2006), Lee et al. (2018), Matthews et al. (2018a;b), Yang (2019) and referencestherein. Let consider a fully-connected feed-forward neural network with re-scaled weights composedof L ≥ layers of widths n , . . . , n L , i.e. f (1) i ( x ) = I (cid:88) j =1 w (1) i,j x j + b (1) i i = 1 , . . . , n f ( l ) i ( x ) = 1 √ n l − n l − (cid:88) j =1 w ( l ) i,j φ ( f ( l − j ( x )) + b ( l ) i l = 2 , . . . , L, i = 1 , . . . , n l (1)where φ is a non-linearity and x ∈ R I is a real-valued input of dimension I ∈ N . Neal (1995)considered the case L = 2 , a finite number k ∈ N of fixed distinct inputs ( x (1) , . . . , x ( k ) ) , witheach x ( r ) ∈ R I , and weights w ( l ) i,j and biases b ( l ) i independently and identically distributed (iid) asGaussian distributions. Under appropriate assumptions on the activation φ Neal (1995) showedthat: i) for a fixed unit i , the k -dimensional random vector ( f (2) i ( x (1) ) , . . . , f (2) i ( x ( k ) )) convergesin distribution, as the width n goes to infinity, to a k -dimensional Gaussian random vector; ii) thelarge-width convergence holds jointly over finite collections of i ’s and the limiting k -dimensionalGaussian random vectors are independent across the index i . These results concerns neural networkswith a single hidden layer, but Neal (1995) also includes preliminary considerations on infinitely-widedeep neural networks. More recent works, such as Lee et al. (2018), established convergence resultscorresponding to Neal (1995) results i) and ii) for deep neural networks under the assumption thatwidths n , . . . , n L go to infinity sequentially over network layers. Matthews et al. (2018a;b) extendedthe work of Neal (1995); Lee et al. (2018) by assuming that the width n grows to infinity jointly overnetwork layers, instead of sequentially, and by establishing joint convergence over all i and countabledistinct inputs. The joint growth over the layers is certainly more realistic than the sequential growth,since the infinite Gaussian limit is considered as an approximation of a very wide network. Weoperate in the same setting of Matthews et al. (2018b), hence from here onward n ≥ denotes the1 a r X i v : . [ m a t h . P R ] F e b ublished as a conference paper at ICLR 2021common layer width, i.e. n , . . . , n L = n . Finally, similar large-width limits have been establishedfor a great variety of neural network architectures, see for instance Yang (2019).The assumption of a countable number of fixed distinct inputs is the common trait of the literatureon large-width asymptotics for deep neural networks. Under this assumption, the large-width limitof a network boils down to the study of the large-width asymptotic behavior of the k -dimensionalrandom vector ( f ( l ) i ( x (1) ) , . . . , f ( l ) i ( x ( k ) )) over i ≥ for finite k . Such limiting finite-dimensionaldistributions describe the large-width distribution of a neural network a priori over any dataset, whichis finite by definition. When the limiting distribution is Gaussian, as it often is, this immediatelypaves the way to Bayesian inference for the limiting network. Such an approach is competitive withthe more standard stochastic gradient descent training for the fully-connected architectures object ofour study (Lee et al., 2020). However, knowledge of the limiting finite-dimensional distributions isnot enough to infer properties of the limiting neural network which are inherently uncountable suchas the continuity of the limiting neural network, or the distribution of its maximum over a boundedinterval. Results in this direction give a more complete understanding of the assumptions being madea priori, and hence whether a given model is appropriate for a specific application. For instance, VanDer Vaart & Van Zanten (2011) shows that for Gaussian Processes the function smoothness under theprior should match the smoothness of the target function for satisfactory inference performance.In this paper we thus consider a novel, and more natural, perspective to the study of large-widthlimits of deep neural networks. This is an infinite-dimensional perspective where, instead of fixinga countable number of distinct inputs, we look at f ( l ) i ( x, n ) as a stochastic process over the inputspace R I . Under this perspective, establishing large-width limits requires considerable care and,in addition, it requires to show the existence of both the stochastic process induced by the neuralnetwork and its large-width limit. We start by proving the existence of i) a continuous stochasticprocess, indexed by the network width n , corresponding to the fully-connected feed-forward deepneural network; ii) a continuous Gaussian Process corresponding to the infinitely-wide limit of thedeep neural network. Then, we prove that the stochastic process i) converges weakly, as the width n goes to infinity, to the Gaussian Process ii) jointly over all units i . As a by-product of our results,we show that the limiting Gaussian Process has almost surely locally γ -Hölder continuous paths, for < γ < . To make the exposition self-contained we include an alternative proof of the main resultof Matthews et al. (2018a;b), i.e. the finite-dimensional limit for full-connected neural networks.The major difference between our proof and that of Matthews et al. (2018b) is due to the use of thecharacteristic function to establish convergence in distribution, instead of relying on a CLT (Blumet al., 1958) for exchangeable sequences.The paper is structured as follows. In Section 2 we introduce the setting under which we operate,whereas in Section 3 we present a high-level overview of the approach taken to establish our results.Section 4 contains the core arguments of the proof of our large-width functional limit for deepGaussian neural networks, which are spelled out in detail in the supplementary material (SM). Weconclude in Section 5. ETTING
Let (Ω , H , P ) be the probability space on which all random elements of interest are defined. Further-more, let N ( µ, σ ) denote a Gaussian distribution with mean µ ∈ R and strictly positive variance σ ∈ R + , and let N k ( m , Σ) be a k -dimensional Gaussian distribution with mean m ∈ R k andcovariance matrix Σ ∈ R k × k . In particular, R k is equipped with (cid:107) · (cid:107) R k , the euclidean norm inducedby the inner product (cid:104)· , ·(cid:105) R k , and R ∞ = × ∞ i =1 R is equipped with (cid:107) · (cid:107) R ∞ , the norm induced by thedistance d ( a , b ) ∞ = (cid:80) i ≥ ξ ( | a i − b i | ) / i for a , b ∈ R ∞ (Theorem 3.38 of Aliprantis & Border(2006)), where ξ ( t ) = t/ (1 + t ) for all real values t ≥ . Note that ( R , | · | ) and ( R ∞ , (cid:107) · (cid:107) R ∞ ) arePolish spaces, i.e. separable and complete metric spaces (Corollary 3.39 of Aliprantis & Border(2006)). We choose d ∞ since it generates a topology that coincides with the product topology (line5 of the proof of Theorem 3.36 of Aliprantis & Border (2006)). The space ( S, d ) will indicate ageneric Polish space such as R or R ∞ with the associated distance. We indicate with S R I the spaceof functions from R I into S and C ( R I ; S ) ⊂ S R I the space of continuous functions from R I into S .2ublished as a conference paper at ICLR 2021Let ω ( l ) i,j be the random weights of the l -th layer, and assume that they are iid as N (0 , σ ω ) , i.e. ϕ ω ( l ) i,j ( t ) = E [ e i tω ( l ) i,j ] = e − σ ω t (2)is the characteristic function of ω ( l ) i,j , for i ≥ , j = 1 , . . . , n and l ≥ . Let b ( l ) i be the random biasesof the l -th layer, and assume that they are iid as N (0 , σ b ) , i.e. ϕ b ( l ) i ( t ) = E [ e i tb ( l ) i ] = e − σ b t (3)is the characteristic function of b ( l ) i , for i ≥ and l ≥ . Weights ω ( l ) i,j are independent of biases b ( l ) i ,for any i ≥ , j = 1 , . . . , n and l ≥ . Let φ : R → R denote a continuous non-linearity. For thefinite-dimensional limit we will assume the polynomial envelop condition | φ ( s ) | ≤ a + b | s | m , (4)for any s ∈ R and some real values a, b > and m ≥ . For the functional limit we will use astronger assumption on φ , assuming φ to be Lipschitz on R with Lipschitz constant L φ .Let Z be a stochastic process on R I , i.e. for each x ∈ R I , Z ( x ) is defined on (Ω , H , P ) and it takesvalues in S . For any k ∈ N and x , . . . , x k ∈ R I , let P Zx ,...,x k = P ( Z ( x ) ∈ A , . . . , Z ( x k ) ∈ A k ) ,with A , . . . , A k ∈ B ( S ) . Then, the family of finite-dimensional distributions of Z ( x ) is defined asthe family of distributions { P Zx ,...,x k : x , . . . , x k ∈ R I and k ∈ N } . See, e.g., Billingsley (1995).In Definition 1 and Definition 2 we look at the deep neural network (1) as a stochastic process oninput space R I , that is a stochastic process whose finite-dimensional distributions are determined bya finite number k ∈ N of fixed distinct inputs ( x (1) , . . . , x ( k ) ) , with each x ( r ) ∈ R I . The existenceof the stochastic processes of Definition 1 and Definition 2 will be thoroughly discussed in Section 3. Definition 1.
For any fixed l ≥ and i ≥ , let ( f ( l ) i ( n )) n ≥ be a sequence of stochastic processeson R I . That is, f ( l ) i ( n ) : R I → R , with x (cid:55)→ f ( l ) i ( x, n ) , is a stochastic process on R I whose finite-dimensional distributions are the laws, for any k ∈ N and x (1) , . . . , x ( k ) ∈ R I , of the k -dimensionalrandom vectors f (1) i ( X , n ) = f (1) i ( X ) = [ f (1) i ( x (1) , n ) , . . . , f (1) i ( x ( k ) , n )] T = I (cid:88) j =1 ω (1) i,j x j + b (1) i (5) f ( l ) i ( X , n ) = [ f ( l ) i ( x (1) , n ) , . . . , f ( l ) i ( x ( k ) , n )] T = 1 √ n n (cid:88) j =1 ω ( l ) i,j ( φ • f ( l − j ( X , n )) + b ( l ) i (6) where X = [ x (1) , . . . , x ( k ) ] ∈ R I × k is a I × k input matrix of k distinct inputs x ( r ) ∈ R I , denotesa vector of dimension k × of ’s, x j denotes the j -th row of the input matrix and φ • X is theelement-wise application of φ to the matrix X . Let f ( l ) r,i ( X , n ) = Tr f ( l ) i ( X , n ) = f ( l ) i ( x ( r ) , n ) denotethe r -th component of the k × vector f ( l ) i ( X , n ) , being r a vector of dimension k × with in the r -the entry and elsewhere. Remark : in contrast to (1), we have defined (5)-(6) over an infinite number of units i ≥ over eachlayer l , but the dependency on each previous layer l − remains limited to the first n components. Definition 2.
For any fixed l ≥ , let ( F ( l ) ( n )) n ≥ be a sequence of stochastic processes on R I . Thatis, F ( l ) ( n ) : R I → R ∞ , with x (cid:55)→ F ( l ) ( x, n ) , is a stochastic process on R I whose finite-dimensionaldistributions are the laws, for any k ∈ N and x (1) , . . . , x ( k ) ∈ R I , of the k -dimensional randomvectors F (1) ( X ) = (cid:104) f (1)1 ( X ) , f (1)2 ( X ) , . . . (cid:105) T F ( l ) ( X , n ) = (cid:104) f ( l )1 ( X , n ) , f ( l )2 ( X , n ) , . . . (cid:105) T . Remark : for k inputs, the vector F ( l ) ( X , n ) is an ∞× k array, and for a single input x ( r ) , F ( l ) ( x ( r ) , n ) can be written as [ f ( l )1 ( x ( r ) , n ) , f ( l )2 ( x ( r ) , n ) , . . . ] T ∈ R ∞× . We define F ( l ) r ( X , n ) = F ( l ) ( x ( r ) , n ) the r -th column of F ( l ) ( X , n ) . When we write (cid:104) F ( l − ( x, n ) , F ( l − ( y, n ) (cid:105) R n (see (8)) we treat F ( l ) ( x, n ) and F ( l ) ( y, n ) as elements in R n and not in R ∞ , i.e. we consider only the first n compo-nents of F ( l ) ( x, n ) and F ( l ) ( y, n ) . 3ublished as a conference paper at ICLR 2021 LAN SKETCH
We start by recalling the notion of convergence in law, also referred to as convergence in distribution orweak convergence, for a sequence of stochastic processes. See Billingsley (1995) for a comprehensiveaccount.
Definition 3 (convergence in distribution) . Suppose that f and ( f ( n )) n ≥ are random elements ina topological space C . Then, ( f ( n )) n ≥ is said to converge in distribution to f , if E [ h ( f ( n ))] → E [ h ( f )] as n → ∞ for every bounded and continuous function h : C → R . In that case we write f ( n ) d → f . In this paper, we deal with continuous and real-valued stochastic processes. More precisely, weconsider random elements defined on C ( R I ; S ) , with ( S, d ) Polish space. Our aim is to study in C ( R I ; S ) the convergence in distribution as the width n goes to infinity for:i) the sequence ( f ( l ) i ( n )) n ≥ for a fixed l ≥ and i ≥ with ( S, d ) = ( R , | · | ) , i.e. the neuralnetwork process for a single unit;ii) the sequence ( F ( l ) ( n )) n ≥ for a fixed l ≥ with ( S, d ) = ( R ∞ , (cid:107) · (cid:107) ∞ ) , i.e. the neural networkprocess for all units.Since applying Definition 3 in a function space is not easy, we need, proved in SM F, the followingproposition. Proposition 1 (convergence in distribution in C ( R I ; S ) , ( S, d ) Polish) . Suppose that f and ( f ( n )) n ≥ are random elements in C ( R I ; S ) with ( S, d ) Polish space. Then, f ( n ) d → f if: i) f ( n ) f d → f and ii) the sequence ( f ( n )) n ≥ is uniformly tight. We denoted with f d → the convergence in law of the finite-dimensional distributions of a sequence ofstochastic processes. The notion of tightness formalizes the concept that the probability mass is notallowed to “escape at infinity”: a single random element f in a topological space C is said to be tightif for each (cid:15) > there exists a compact T ⊂ C such that P [ f ∈ C \ T ] < (cid:15) . If a metric space ( C, ρ ) is Polish any random element on the Borel σ -algebra of C is tight. A sequence of random elements ( f ( n )) n ≥ in a topological space C is said to be uniformly tight if for every (cid:15) > there exists acompact T ⊂ C such that P [ f ( n ) ∈ C \ T ] < (cid:15) for all n .According to Proposition 1, to achieve convergence in distribution in function spaces we need thefollowing Steps A-D: Step A) to establish the existence of the finite-dimensional weak-limit f on R I . We will rely onTheorem 5.3 of Kallenberg (2002), known as Levy theorem. Step B) to establish the existence of the stochastic processes f and ( f ( n )) n ≥ as elements in S R I thespace of function from R I into S . We make use of Daniell-Kolmogorov criterion (Kallenberg, 2002,Theorem 6.16): given a family of multivariate distributions { P I probability measure on R dim( I ) |I ⊂ { x (1) , . . . , x ( k ) } x ( z ) ∈ R I ,k ∈ N } there exists a stochastic process with { P I } as finite-dimensionaldistributions if { P I } satisfies the projective property: P J ( · × R J \I ) = P I ( · ) , I ⊂ J ⊂{ x (1) , . . . , x ( k ) } x ( z ) ∈ R I ,k ∈ N . That is, it is required consistency with respect to the marginaliza-tion over arbitrary components. In this step we also suppose, for a moment, that the stochasticprocesses ( f ( n )) n ≥ and f belong to C ( R I ; S ) and we establish the existence of such stochasticprocesses in C ( R I ; S ) endowed with a σ -algebra and a probability measure that will be defined. Step C) to show that the stochastic processes ( f ( n )) n ≥ and f belong to C ( R I ; S ) ⊂ S R I . Withregards to ( f ( n )) n ≥ this is a direct consequence of (5)-(6) and the continuity of φ . With regardsto the limiting process f , with an additional Lipschitz assumption on φ , we rely on the followingKolmogorov-Chentsov criterion (Kallenberg, 2002, Theorem 3.23): Kallenberg (2002) uses the same term “tightness” for both cases of a single random element and of sequencesof random elements; we find that the introduction of “uniform tightness” brings more clarity.
Proposition 2 (continuous version and local-Hölderianity, ( S, d ) complete) . Let f be a process on R I with values in a complete metric space ( S, d ) , and assume that there exist a, b, H > such that, E [ d ( f ( x ) , f ( y )) a ] ≤ H (cid:107) x − y (cid:107) ( I + b ) , x, y ∈ R I Then f has a continuous version (i.e. f belongs to C ( R I ; S ) ), and the latter is a.s. locally Höldercontinuous with exponent c for any c ∈ (0 , b / a ) . Step D) the uniform tightness of ( f ( n )) n ≥ in C ( R I ; S ) . We rely on an extension of the Kolmogorov-Chentsov criterion (Kallenberg, 2002, Corollary 16.9), which is stated in the following proposition. Proposition 3 (uniform tightness in C ( R I ; S ) , ( S, d ) Polish) . Suppose that ( f ( n )) n ≥ are randomelements in C ( R I ; S ) with ( S, d ) Polish space. Assume that f (0 R I , n ) n ≥ (i.e. f ( n ) evaluated at theorigin) is uniformly tight in S and that there exist a, b, H > such that, E [ d ( f ( x, n ) , f ( y, n )) a ] ≤ H (cid:107) x − y (cid:107) ( I + b ) , x, y ∈ R I , n ∈ N uniformly in n . Then ( f ( n )) n ≥ is uniformly tight in C ( R I ; S ) . ARGE - WIDTH FUNCTIONAL LIMITS
IMIT ON C ( R I ; S ) , WITH ( S, d ) = ( R , | · | ) , FOR A FIXED UNIT i ≥ AND LAYER l Lemma 1 (finite-dimensional limit) . If φ satisfies ( ) then there exists a stochastic process f ( l ) i : R I → R such that ( f ( l ) i ( n )) n ≥ f d → f ( l ) i as n → ∞ .Proof. Fix l ≥ and i ≥ . Fixed k inputs X = [ x (1) , . . . , x ( k ) ] , we show that as n → + ∞ f ( l ) i ( X , n ) d → N k ( , Σ( l )) , (7)where Σ( l ) denotes the k × k covariance matrix, which can be computed through the recursion: Σ(1) i,j = σ b + σ ω (cid:104) x ( i ) , x ( j ) (cid:105) R I , Σ( l ) i,j = σ b + σ ω (cid:82) φ ( f i ) φ ( f j ) q ( l − (d f ) , where q ( l − = N k ( , Σ( l − . By means of (2), (3), (5) and (6), f (1) i ( X ) d = N k ( , Σ(1)) , Σ(1) i,j = σ b + σ ω (cid:104) x ( i ) , x ( j ) (cid:105) R I f ( l ) i ( X , n ) | f ( l − ,...,n d = N k ( , Σ( l, n )) , for l ≥ , Σ( l, n ) i,j = σ b + σ ω n (cid:68) ( φ • F ( l − i ( X , n )) , ( φ • F ( l − j ( X , n )) (cid:69) R n (8)We prove (7) using Levy’s theorem, that is the point-wise convergence of the sequence of characteristicfunctions of (8). We defer to SM A for the complete proof.Lemma 1 proves Step A . This proof gives an alternative and self-contained proof of the mainresult of Matthews et al. (2018b), under the more general assumption that the activation function φ satisfies the polynomial envelop (4). Now we prove Step B , i.e. the existence of the stochasticprocesses f ( l ) i ( n ) and f ( l ) i on the space R R I , for each layer l ≥ , unit i ≥ and n ∈ N . In SM E.1we show that the finite-dimensional distributions of f ( l ) i ( n ) satisfies Daniell-Kolmogorov criterion(Kallenberg, 2002, Theorem 6.16), and hence the stochastic process f ( l ) i ( n ) exists. In SM E.2 weprove a similar result for the finite-dimensional distributions of the limiting process f ( l ) i . In SM E.3we prove that, if these stochastic processes are continuous, they are naturally defined in C ( R I ; R ) .In order to prove the continuity, i.e. Step C note that f (1) i ( x ) = (cid:80) Ij =1 ω (1) i,j x j + b (1) i is continuousby construction, thus by induction on l , if f ( l − i ( n ) are continuous for each i ≥ and n , then f ( l ) i ( x, n ) = √ n (cid:80) nj =1 ω ( l ) i,j φ ( f ( l − j ( x, n )) + b ( l ) i is continuous being composition of continuousfunctions. For the limiting process f ( l ) i we assume φ to be Lipschitz with Lipschitz constant L φ . Inparticular we have the following: Lemma 2 (continuity) . If φ is Lipschitz on R then f ( l ) i (1) , f ( l ) i (2) , . . . are P -a.s. Lipschitz on R I ,while the limiting process f ( l ) i is P -a.s. continuous on R I and locally γ -Hölder continuous for each < γ < . Proof.
Here we present a sketch of the proof, and we defer to SM B.1 and SM B.2 for the completeproof. For ( f ( l ) i ( n )) n ≥ it is trivial to show that for each n | f ( l ) i ( x, n ) − f ( l ) i ( y, n ) | ≤ H ( l ) i ( n ) (cid:107) x − y (cid:107) R I , x, y ∈ R I , P − a.s. (9)where H ( l ) i ( n ) denotes a suitable random variable, which is defined by the following recursion over l (cid:40) H (1) i ( n ) = (cid:80) Ij =1 (cid:12)(cid:12) ω (1) i,j (cid:12)(cid:12) H ( l ) i ( n ) = L φ √ n (cid:80) nj =1 (cid:12)(cid:12) ω ( l ) i,j (cid:12)(cid:12) H ( l − j ( n ) (10)To establish the continuity of the limiting process f ( l ) i we rely on Proposition 2. Take two inputs x, y ∈ R I . From (7) we get that [ f ( l ) i ( x ) , f ( l ) i ( y )] ∼ N ( , Σ( l )) where Σ(1) = σ b (cid:20) (cid:21) + σ ω (cid:20) (cid:107) x (cid:107) R I (cid:104) x, y (cid:105) R I (cid:104) x, y (cid:105) R I (cid:107) y (cid:107) R I (cid:21) , Σ( l ) = σ b (cid:20) (cid:21) + σ ω (cid:90) (cid:20) | φ ( u ) | φ ( u ) φ ( v ) φ ( u ) φ ( v ) | φ ( v ) | (cid:21) q ( l − (d u, d v ) , where q ( l − = N ( , Σ( l − . Defining a T = [1 , − , from (7) we know that f ( l ) i ( y ) − f ( l ) i ( x ) ∼ N ( a T , a T Σ( l ) a ) .Thus | f ( l ) i ( y ) − f ( l ) i ( x ) | θ ∼ | (cid:113) a T Σ( l ) a N (0 , | θ ∼ ( a T Σ( l ) a ) θ | N (0 , | θ . We proceed by induction over the layers. For l = 1 , E (cid:104) | f (1) i ( y ) − f (1) i ( x ) | θ (cid:105) = C θ ( a T Σ(1) a ) θ = C θ ( σ ω (cid:107) y (cid:107) R I − σ ω (cid:104) y, x (cid:105) R I + σ ω (cid:107) x (cid:107) R I ) θ = C θ ( σ ω ) θ ( (cid:107) y (cid:107) R I − (cid:104) y, x (cid:105) R I + (cid:107) x (cid:107) R I ) θ = C θ ( σ ω ) θ (cid:107) y − x (cid:107) θ R I , where C θ = E [ | N (0 , | θ ] . By hypothesis induction there exists a constant H ( l − > such that (cid:82) | u − v | θ q ( l − (d u, d v ) ≤ H ( l − (cid:107) y − x (cid:107) θ R I . Then, | f ( l ) i ( y ) − f ( l ) i ( x ) | θ ∼ | N (0 , | θ ( a T Σ( l ) a ) θ = | N (0 , | θ (cid:16) σ ω (cid:90) [ | φ ( u ) | − φ ( u ) φ ( v ) + | φ ( v ) | ] q ( l − (d u, d v ) (cid:17) θ ≤ | N (0 , | θ ( σ ω L φ ) θ (cid:90) | u − v | θ q ( l − (d u, d v ) ≤ | N (0 , | θ ( σ ω L φ ) θ H ( l − (cid:107) y − x (cid:107) θ R I . where we used | φ ( u ) | − φ ( u ) φ ( v ) + | φ ( v ) | = | φ ( u ) − φ ( v ) | ≤ L φ | u − v | and the Jenseninequality. Thus, E (cid:104) | f ( l ) i ( y ) − f ( l ) i ( x ) | θ (cid:105) ≤ H ( l ) (cid:107) y − x (cid:107) θ R I , (11)where the constant H ( l ) can be explicitly derived by solving the following system (cid:26) H (1) = C θ ( σ ω ) θ H ( l ) = C θ ( σ ω L φ ) θ H ( l − . (12)It is easy to get H ( l ) = C lθ ( σ ω ) lθ ( L φ ) ( l − θ . Observe that H ( l ) does not depend on i (this willbe helpful in establishing the uniformly tightness of ( f ( l ) i ( n )) n ≥ and the continuity of F ( l ) ). ByProposition 2, setting α = 2 θ , and β = 2 θ − I (since β needs to be positive, it is sufficient tochoose θ > I/ ) we get that f ( l ) i has a continuous version and the latter is P -a.s locally γ -Höldercontinuous for every < γ < − I θ , for each θ > I/ . Taking the limit as θ → + ∞ we concludethe proof. 6ublished as a conference paper at ICLR 2021 Lemma 3 (uniform tightness) . If φ is Lipschitz on R then ( f ( l ) i ( n )) n ≥ is uniformly tight in C ( R I ; R ) .Proof. We defer to SM B.3 for details. Fix i ≥ , l ≥ . We apply Proposition 3 to show the uniformtightness of the sequence ( f ( l ) i ( n )) n ≥ in C ( R I ; R ) . By Lemma 2 f ( l ) i (1) , f ( l ) i (2) , . . . are randomelements in C ( R I ; R ) . Since ( R , | · | ) is Polish, every probability measure is tight, then f (0 R I , n ) istight in R for every n . Moreover, by Lemma 1 f i (0 R I , n ) n ≥ d → f ( l ) i (0 R I ) , therefore by (Dudley,2002, Theorem 11.5.3), f (0 R I , n ) n ≥ is uniformly tight in R .It remains to show that there exist two values α > and β > , and a constant H ( l ) > such that E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | α (cid:105) ≤ H ( l ) (cid:107) y − x (cid:107) I + β R I , x, y ∈ R I , n ∈ N uniformly in n . Take two points x, y ∈ R I . From (8) we know that f ( l ) i ( y, n ) | f ( l − ,...,n ∼ N (0 , σ y ( l, n )) and f ( l ) i ( x, n ) | f ( l − ,...,n ∼ N (0 , σ x ( l, n )) with joint distribution N ( , Σ( l, n )) , where Σ(1) = (cid:20) σ x (1) Σ(1) x,y Σ(1) x,y σ y (1) (cid:21) , Σ( l ) = (cid:20) σ x ( l, n ) Σ( l, n ) x,y Σ( l, n ) x,y σ y ( l, n ) (cid:21) , with, σ x (1) = σ b + σ ω (cid:107) x (cid:107) R I ,σ y (1) = σ b + σ ω (cid:107) y (cid:107) R I , Σ(1) x,y = σ b + σ ω (cid:104) x, y (cid:105) R I ,σ x ( l, n ) = σ b + σ ω n (cid:80) nj =1 | φ ◦ f ( l − j ( x, n ) | ,σ y ( l, n ) = σ b + σ ω n (cid:80) nj =1 | φ ◦ f ( l − j ( y, n ) | , Σ( l, n ) x,y = σ b + σ ω n (cid:80) nj =1 φ ( f ( l − j ( x, n )) φ ( f ( l − j ( y, n )) Defining a T = [1 , − we have that f ( l ) i ( y, n ) | f ( l − ,...,n − f ( l ) i ( x, n ) | f ( l − ,...,n is distributed as N ( a T , a T Σ( l, n ) a ) , where a T Σ( l, n ) a = σ y ( l, n ) − l, n ) x,y + σ x ( l, n ) . Consider α = 2 θ with θ integer. Thus (cid:12)(cid:12)(cid:12) f ( l ) i ( y, n ) | f ( l − ,...,n − f ( l ) i ( x, n ) | f ( l − ,...,n (cid:12)(cid:12)(cid:12) θ ∼ | (cid:113) a T Σ( l, n ) a N (0 , | θ ∼ ( a T Σ( l, n ) a ) θ | N (0 , | θ . As in previous theorem, for l = 1 we get E (cid:104) | f (1) i ( y, n ) − f (1) i ( x, n ) | θ (cid:105) = C θ ( σ ω ) θ (cid:107) y − x (cid:107) θ R I where C θ = E [ | N (0 , θ | ] . Set H (1) = C θ ( σ ω ) θ and by hypothesis induction suppose that for every j ≥ E (cid:104) | f ( l − j ( y, n ) − f ( l − j ( x, n ) | θ (cid:105) ≤ H ( l − (cid:107) y − x (cid:107) θ R I . By hypothesis φ is Lipschitz, then E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | θ (cid:12)(cid:12)(cid:12) f ( l − ,...,n (cid:105) = C θ ( a T Σ( l, n ) a ) θ = C θ (cid:16) σ y ( l, n ) − l, n ) x,y + σ x ( l, n ) (cid:17) θ = C θ (cid:16) σ ω n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) φ ◦ f ( l − j ( y, n ) − φ ◦ f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) (cid:17) θ ≤ C θ (cid:16) σ ω L φ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) f ( l − j ( y, n ) − f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) (cid:17) θ = C θ ( σ ω L φ ) θ n θ (cid:16) n (cid:88) j =1 (cid:12)(cid:12)(cid:12) f ( l − j ( y, n ) − f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) (cid:17) θ ≤ C θ ( σ ω L φ ) θ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) f ( l − j ( y, n ) − f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) θ . E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | θ (cid:105) = E (cid:104) E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | θ (cid:12)(cid:12)(cid:12) f ( l − ,...,n (cid:105)(cid:105) ≤ C θ ( σ ω L φ ) θ n n (cid:88) j =1 E (cid:104) | f ( l − j ( y, n ) − f ( l − j ( x, n ) | θ (cid:105) ≤ C θ ( σ ω L φ ) θ H ( l − (cid:107) y − x (cid:107) θ R I . We can get the constant H ( l ) by solving the same system as (12), obtaining H ( l ) = C lθ ( σ ω ) lθ ( L φ ) ( l − θ which does not depend on n . By Proposition 3 setting α = 2 θ and β = 2 θ − I ,since β must be a positive constant, it is sufficient to take θ > I/ and this concludes the proof.Note that Lemma 3 provides the last Step D that allows us to prove the desired result which isexplained in the theorem that follows:
Theorem 1 (functional limit) . If φ is Lipschitz on R then f ( l ) i ( n ) d → f ( l ) i on C ( R I ; R ) .Proof. We apply Proposition 1 to ( f ( l ) i ( n )) n ≥ . By Lemma 2, we have that f ( l ) i , ( f ( l ) i ( n )) n ≥ belongto C ( R I ; R ) . From Lemma 1 we have the convergence of the finite-dimensional distributions of ( f ( l ) i ( n )) n ≥ , and form Lemma 3 we have the uniform tightness of ( f ( l ) i ( n )) n ≥ .4.2 L IMIT ON C ( R I ; S ) , WITH ( S, d ) = ( R ∞ , (cid:107) · (cid:107) R ∞ ) , FOR A FIXED LAYER l As in the previous section we prove Steps A-D for the sequence ( F ( l ) ( n )) n ≥ . Remark thateach stochastic process F ( l ) , F ( l ) (1) , F ( l ) (2) , . . . defines on C ( R I ; R ∞ ) a joint measure whose i -th marginal is the measure induced respectively by f ( l ) i , f ( l ) i ( n ) , f (2) i ( n ) , . . . (see SM E.1 -SM E.4).Let F ( l ) d = (cid:78) ∞ i =1 f ( l ) i , where (cid:78) denotes the product measure. Lemma 4 (finite-dimensional limit) . If φ satisfies (4) then F ( l ) ( n ) f d → F ( l ) as n → ∞ .Proof. The proof follows by Lemma 1 and Cramér-Wold theorem for finite-dimensional projection of F ( l ) ( n ) : it is sufficient to establish the large n asymptotic of linear combinations of the f ( l ) i ( X , n ) ’sfor i ∈ L ⊂ N . In particular, we show that for any choice of inputs elements X , as n → + ∞ F ( l ) ( X , n ) d → ∞ (cid:79) i =1 N k ( , Σ( l )) , (13)where Σ( l ) is defined in (7). The proof is reported in SM C. Lemma 5 (continuity) . If φ is Lipschitz on R then F ( l ) , ( F ( l ) ( n )) n ≥ belong to C ( R I ; R ∞ ) . Moreprecisely F ( l ) (1) , F ( l ) (2) , . . . are P -a.s. Lipschitz on R I , while the limiting process F ( l ) is P -a.s.continuous on R I and locally γ -Hölder continuous for each < γ < .Proof. It derives immediately from Lemma 2. We defer to SM D.1 and SM D.2 for details. Thecontinuity of the sequence process immediately follows from the Lipschitzianity of each componentin (9) while the continuity of the limiting process F ( l ) is proved by applying Proposition 2. Take twoinputs x, y ∈ R I and fix α ≥ even integer. Since ξ ( t ) ≤ t for all t ≥ , and by Jensen inequality d (cid:0) F ( l ) ( x ) , F ( l ) ( y ) (cid:1) α ∞ ≤ (cid:16) ∞ (cid:88) i =1 i | f ( l ) i ( x ) − f ( l ) i ( y ) | (cid:17) α ≤ ∞ (cid:88) i =1 i | f ( l ) i ( x ) − f ( l ) i ( y ) | α Thus, by applying monotone convergence theorem to the positive increasing sequence g ( N ) = (cid:80) Ni =1 12 i | f ( l ) i ( x ) − f ( l ) i ( y ) | α (which allows to exchange E and (cid:80) ∞ i =1 ), we get8ublished as a conference paper at ICLR 2021 E (cid:104) d (cid:0) F ( l ) ( x ) , F ( l ) ( y ) (cid:1) α ∞ (cid:105) ≤ E (cid:104) ∞ (cid:88) i =1 i | f ( l ) ( x ) − f ( l ) i ( y ) | α (cid:105) = lim N →∞ E (cid:104) N (cid:88) i =1 i | f ( l ) i ( x ) − f ( l ) i ( y ) | α (cid:105) = ∞ (cid:88) i =1 i E (cid:104) | f ( l ) i ( x ) − f ( l ) i ( y ) | α (cid:105) = ∞ (cid:88) i =1 i H ( l ) (cid:107) x − y (cid:107) α R I = H ( l ) (cid:107) x − y (cid:107) α R I where we used (11) and the fact that H ( l ) does not depend on i (see (12)). Therefore, by Proposition 2,for each α > I , setting β = α − I (since β needs to be positive, it is sufficient to choose α > I ) F ( l ) has a continuous version F ( l )( θ ) which is P -a.s locally γ -Hölder continuous for every < γ < − Iα .Letting α → ∞ we conclude. Theorem 2 (functional limit) . If φ is Lipschitz on R then ( F ( l ) ( n )) n ≥ d → F ( l ) as n → ∞ on C ( R I ; R ∞ ) .Proof. This is Proposition 1 applied to ( F ( l ) ( n )) n ≥ . From Lemma 4 and Lemma 5 it remains toshow the uniform tightness of the sequence ( F ( l ) ( n )) n ≥ in C ( R I ; R ∞ ) . Let (cid:15) > and let ( (cid:15) i ) i ≥ be a positive sequence such that (cid:80) ∞ i =1 (cid:15) i = (cid:15)/ . We have established the uniform tightness ofeach component (Lemma 3). Therefore for each i ∈ N there exists a compact K i ⊂ C ( R I ; R ) such that P [ f ( l ) i ( n ) ∈ C ( R I ; R ) \ K i ] < (cid:15) i for each n ∈ N (such compact depends on (cid:15) i ). Set K = × ∞ i =1 K i which is compact by Tychonoff theorem. Note that this is a compact on the productspace × ∞ i =1 C ( R I ; R ) with associated product topology, and this is also a compact on C ( R I ; R ∞ ) (see SM E.4). Then P (cid:104) F ( l ) ( n ) ∈ C ( R I ; R ∞ ) \ K (cid:105) = P (cid:104) (cid:83) ∞ i =1 { f ( l ) i ( n ) ∈ C ( R I ; R ) \ K i } (cid:105) ≤ (cid:80) ∞ i =1 P (cid:104) f ( l ) i ( n ) ∈ C ( R I ; R ) \ K i (cid:105) ≤ (cid:80) ∞ i =1 (cid:15) i < (cid:15) which concludes the proof. ISCUSSION
We looked at deep Gaussian neural networks as stochastic processes, i.e. infinite-dimensionalrandom elements, on the input space R I , and we showed that: i) a network defines a stochasticprocess on the input space R I ; ii) under suitable assumptions on the activation function, a networkwith re-scaled weights converges weakly to a Gaussian Process in the large-width limit. Theseresults extend previous works (Neal, 1995; Der & Lee, 2006; Lee et al., 2018; Matthews et al.,2018a;b; Yang, 2019) that investigate the limiting distribution of neural network over a countablenumber of distinct inputs. From the point of view of applications, the convergence in distributionis the starting point for the convergence of expectations. Let consider a continuous function g : C ( R I ; R ∞ ) → R . By the continuous function mapping theorem (Billingsley, 1999, Theorem 2.7),we have g ( F ( l ) ( n )) d → g ( F ( l ) ) as n → + ∞ , and under uniform integrability (Billingsley, 1999,Section 3), we have (Billingsley, 1999, Theorem 3.5) E [ g ( F ( l ) ( n ))] → E [ g ( F ( l ) )] as n → + ∞ . Seealso Dudley (2002) and references therein.As a by-product of our results we showed that, under a Lipschitz activation function, the limitingGaussian Process has almost surely locally γ -Hölder continuous paths, for < γ < . This raises thequestion on whether it is possible to strengthen our results to cover the case γ = 1 , or even the caseof local Lipschitzianity of the paths of the limiting process. In addition, if the activation function isdifferentiable, does this property transfer to the limiting process? We leave these questions to futureresearch. Finally, while fully-connected deep neural networks represent an ideal starting point fortheoretical analysis, modern neural network architectures are composed of a much richer class oflayers which includes convolutional, residual, recurrent and attention components. The technicalarguments followed in this paper are amenable to extensions to more complex network architectures.Providing a mathematical formulation of network’s architectures and convergence results in a waythat it allows for extensions to arbitrary architectures, instead of providing an ad-hoc proof for eachspecific case, is a fundamental research problem. Greg Yang’s work on Tensor Programs (Yang,2019) constitutes an important step in this direction.9ublished as a conference paper at ICLR 2021 R EFERENCES
Charalambos D. Aliprantis and Kim Border.
Infinite Dimensional Analysis: A Hitchhiker’s Guide .Springer-Verlag Berlin and Heidelberg GmbH & Company KG, 2006.Patrick Billingsley.
Probability and Measure . John Wiley & Sons, 3rd edition, 1995.Patrick Billingsley.
Convergence of Probability Measures . Wiley-Interscience, 2nd edition, 1999.JR Blum, H Chernoff, M Rosenblatt, and H Teicher. Central Limit Theorems for InterchangeableProcesses.
Canadian Journal of Mathematics , 10:222–229, 1958.Ricky Der and Daniel D Lee. Beyond Gaussian Processes: On the Distributions of Infinite Networks.In
Advances in Neural Information Processing Systems , pp. 275–282, 2006.RM Dudley.
Real Analysis and Probability . Cambridge University Press, 2002.Olav Kallenberg.
Foundations of Modern Probability . Springer Science & Business Media, 2ndedition, 2002.Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, andYasaman Bahri. Deep Neural Networks as Gaussian Processes. In
International Conference onLearning Representations , 2018.Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, andJascha Sohl-Dickstein. Finite Versus Infinite Neural Networks: an Empirical Study. volume 33,2020.Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani.Gaussian Process Behaviour in Wide Deep Neural Networks. In
International Conference onLearning Representations , 2018a.Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani.Gaussian Process Behaviour in Wide Deep Neural Networks, 2018b.Radford M Neal.
Bayesian Learning for Neural Networks . PhD thesis, University of Toronto, 1995.Aad Van Der Vaart and Harry Van Zanten. Information Rates of Nonparametric Gaussian ProcessMethods.
Journal of Machine Learning Research , 12(6), 2011.Greg Yang. Wide Feedforward or Recurrent Neural Networks of Any Architecture are GaussianProcesses. In
Advances in Neural Information Processing Systems , volume 32, 2019.10ublished as a conference paper at ICLR 2021
SM A
The following inequalities will be used different times in the proofs without explicit mention:1) For any real values a , . . . , a n ≥ and s ≥ , ( a + · · · + a n ) s ≤ n s − ( a s + · · · + a sn ) . It follows immediately considering the convex function x (cid:55)→ x s applied to the the weightedsum a + ··· + a n / n .2) For every values a , . . . , a n ∈ R and < s < , | a + · · · + a n | s ≤ | a | s + · · · + | a n | s . It follows immediately studying the s -Hölder function x (cid:55)→ | x | s .By means of (2), (3) and (5), we can write for i ≥ and l ≥ ϕ f (1) i ( X ,n ) ( t ) = E [ e i t T f (1) i ( X ,n ) ]= E (cid:104) exp (cid:110) i t T (cid:104) I (cid:88) j =1 ω (1) i,j x j + b (1) i (cid:105)(cid:111)(cid:105) = E (cid:104) exp (cid:110) i t T b (1) i + i t T I (cid:88) j =1 ω (1) i,j x j (cid:111)(cid:105) = E (cid:104) exp (cid:110) i( t T ) b (1) i (cid:111)(cid:105) I (cid:89) j =1 E (cid:104) exp (cid:110) i( t T x j ) ω (1) i,j (cid:111)(cid:105) = exp (cid:110) − σ b ( t T ) (cid:111) I (cid:89) j =1 exp (cid:110) − σ ω ( t T x j ) (cid:111) = exp (cid:110) − (cid:104) σ b ( t T ) + σ ω I (cid:88) j =1 ( t T x j ) (cid:105)(cid:111) = exp (cid:110) − t T Σ(1) t (cid:111) , i.e. f (1) i ( X ) d = N k ( , Σ(1)) , with k × k covariance matrix with element in the i -th row and j -th column as follows Σ(1) i,j = σ b + σ ω (cid:104) x ( i ) , x ( j ) (cid:105) R I . Observe that we can also determine the marginal distributions, f (1) r,i ( X ) ∼ N (0 , Σ(1) r,r ) , (14)where Σ(1) r,r = σ b + σ ω (cid:107) x ( r ) (cid:107) R I . i ≥ and l ≥ , by means of (2), (3) and (6) we can write ϕ f ( l ) i ( X ,n ) | f ( l − ,...,n ( t ) = E [ e i t T f ( l ) i ( X ,n ) | f ( l − ,...,n ]= E (cid:104) exp (cid:110) i t T (cid:104) √ n n (cid:88) j =1 ω ( l ) i,j ( φ • f ( l − j ( X , n )) + b (1) i (cid:105)(cid:111) | f ( l − ,...,n (cid:105) = E (cid:104) exp (cid:110) i t T b (1) i + i t T √ n n (cid:88) j =1 ω ( l ) i,j ( φ • f ( l − j ( X , n )) (cid:111) | f ( l − ,...,n (cid:105) = E (cid:104) exp (cid:110) i( t T ) b (1) i (cid:111)(cid:105) n (cid:89) j =1 E (cid:104) exp (cid:110) i ω ( l ) i,j (cid:16) √ n t T ( φ • f ( l − j ( X , n )) (cid:17)(cid:111) | f ( l − ,...,n (cid:105) = exp (cid:110) − σ b ( t T ) (cid:111) n (cid:89) j =1 exp (cid:110) − n σ ω (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:111) = exp (cid:110) − (cid:104) σ b ( t T ) + σ ω n n (cid:88) j =1 (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:105)(cid:111) = exp (cid:110) − t T Σ( l, n ) t (cid:111) , i.e. f ( l ) i ( X , n ) | f ( l − ,...,n d = N k ( , Σ( l, n )) , with k × k covariance matrix with element in the i -th row and j -th column as follows Σ( l, n ) i,j = σ b + σ ω n (cid:68) ( φ • F ( l − i ( X , n )) , ( φ • F ( l − j ( X , n )) (cid:69) R n . Observe that we can also determine the marginal distributions, f ( l ) r,i ( X , n ) | f ( l − ,...,n ∼ N (0 , Σ( l, n ) r,r ) , (15)where Σ( l, n ) r,r = σ b + σ ω n (cid:107) φ • F ( l − r ( X , n ) (cid:107) R n . SM A.1:
ASYMPTOTICS FOR THE i − th COORDINATE
First of all, from Definition 1, note that since f (1) i ( X ) does not depend on n we consider the limit as n → ∞ only for f ( l ) i ( X , n ) for all l ≥ . It comes directly from Equation (6) that, for every fixed l and n the sequence (cid:0) f ( l ) i ( X , n ) (cid:1) i ≥ is exchangeable. In particular, let p ( l ) n denote the de Finetti (random)probability measure of the exchangeable sequence (cid:0) f ( l ) i ( X , n ) (cid:1) i ≥ . That is, by the celebrated deFinetti representation theorem, conditionally to p ( l ) n the f ( l ) i ( X , n ) ’s are iid as p ( l ) n . Now, let considerthe induction hypothesis that, p ( l − n d → q ( l − as n → + ∞ , where q ( l − = N k ( , Σ( l − . Toestablish the convergence in distribution we rely on Theorem 5.3 of Kallenberg (2002) known as Levytheorem, taking into account the point-wise convergence of the characteristic functions. Therefore12ublished as a conference paper at ICLR 2021we can write the following expression: ϕ f ( l ) i ( X ,n ) ( t ) = E [ e i t T f ( l ) i ( X ,n ) ]= E [ E [ e i t T f ( l ) i ( X ,n ) | f ( l − ,...,n ]]= E (cid:104) exp (cid:110) − t T Σ( l, n ) t (cid:111)(cid:105) = E (cid:104) exp (cid:110) − (cid:104) σ b ( t T ) + σ ω n n (cid:88) j =1 (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:105)(cid:111)(cid:105) = e − σ b ( t T ) E (cid:104) exp (cid:110) − σ ω n n (cid:88) j =1 (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:111)(cid:105) = e − σ b ( t T ) E (cid:104) E (cid:104) exp (cid:110) − σ ω n n (cid:88) j =1 (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:111) | p ( l − n (cid:105)(cid:105) = e − σ b ( t T ) E (cid:104) n (cid:89) j =1 E (cid:104) exp (cid:110) − σ ω n (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:111) | p ( l − n (cid:105)(cid:105) = e − σ b ( t T ) E (cid:104) n (cid:89) j =1 (cid:90) exp (cid:110) − σ ω n (cid:16) t T ( φ • f ) (cid:17) (cid:111) p ( l − n (d f ) (cid:105) = e − σ b ( t T ) E (cid:104)(cid:16) (cid:90) exp (cid:110) − σ ω n (cid:16) t T ( φ • f ) (cid:17) (cid:111) p ( l − n (d f ) (cid:17) n (cid:105) . Observe that the last integral is with respect to k coordinates: i.e. d f = (d f , . . . , d f k ) . Denote as p → the convergence in probability. We will prove the following lemmas:L1) for each l ≥ and s ≥ , P [ p ( l − n ∈ Y s ] = 1 , where Y s = { p : (cid:82) (cid:107) φ • f (cid:107) s R k p (d f ) < + ∞} ;L2) (cid:82) ( t T ( φ • f )) p ( l − n (d f ) p → (cid:82) ( t T ( φ • f )) q ( l − (d f ) , as n → + ∞ ;L3) (cid:82) ( t T ( φ • f )) (cid:2) − exp (cid:8) − θ σ ω n ( t T ( φ • f )) (cid:9)(cid:3) p ( l − n (d f ) p → , as n → + ∞ for every θ ∈ (0 , .SM A.1.1: PROOF OF
L1In order to prove the three lemmas, we will use many times the envelope condition (4) without explicitmention. For l = 2 we have E [ (cid:107) φ • f (1) i ( X ) (cid:107) s R k ] ≤ E (cid:104)(cid:16) k (cid:88) r =1 | φ ◦ f (1) r,i ( X ) | (cid:17) s/ (cid:105) ≤ E (cid:104)(cid:16) k (cid:88) r =1 | φ ◦ f (1) r,i ( X ) | (cid:17) s (cid:105) ≤ E (cid:104) k s − k (cid:88) r =1 | φ ◦ f (1) r,i ( X ) | s (cid:105) = k s − k (cid:88) r =1 E (cid:104) | φ ◦ f (1) r,i ( X ) | s (cid:105) ≤ k s − k (cid:88) r =1 E (cid:104) ( a + b | f (1) r,i ( X ) | m ) s (cid:105) ≤ (2 k ) s − k (cid:88) r =1 (cid:16) a s + b s E [ | f (1) r,i ( X ) | sm ] (cid:17) < + ∞ , f (1) r,i ( X ) ∼ N (0 , σ b + σ ω (cid:107) x ( r ) (cid:107) R I ) and then E [ | f (1) r,i ( X ) | sm ] = M sm ( σ b + σ ω (cid:107) x ( r ) (cid:107) R I ) sm / , where M c is the c -th moment of | N (0 , | . Now assume that L1 is true for ( l − , i.e. for each s ≥ it holds (cid:82) (cid:107) φ • f (cid:107) s R k p ( l − n (d f ) < + ∞ uniformly in n , and we prove that it is true also for ( l − . E [ (cid:107) φ • f ( l − i ( X , n ) (cid:107) s R k | f ( l − ,...,n ] ≤ E (cid:104) k s − k (cid:88) r =1 | φ ◦ f ( l − r,i ( X , n ) | s | f ( l − ,...,n (cid:105) ≤ (2 k ) s − k (cid:88) r =1 (cid:16) a s + b s E (cid:104) | f ( l − r,i ( X , n ) | ms | f ( l − ,...,n (cid:105)(cid:17) ≤ D ( a, k.s ) + D ( b, k, s ) k (cid:88) r =1 E (cid:104) | f ( l − r,i ( X , n ) | ms | f ( l − ,...,n (cid:105) . From (15) we get E (cid:104) | f ( l − r,i ( X , n ) | ms | f ( l − ,...,n (cid:105) = M ms (cid:16) σ b + σ ω n (cid:107) φ • F ( l − r ( X , n ) (cid:107) R n (cid:17) sm / ≤ M ms sm − (cid:16) σ smb + σ smω n sm (cid:107) φ • F ( l − r ( X , n ) (cid:107) sm R n (cid:17) / . Thus we have E [ (cid:107) φ • f ( l − i ( X , n ) (cid:107) s R k | p ( l − n ] ≤ D ( a, k, s ) + D ( b, k, s, m ) k (cid:88) r =1 (cid:16) σ smb + σ smω n sm E (cid:104) (cid:107) φ • F ( l − r ( X , n ) (cid:107) sm R n | p ( l − n (cid:105)(cid:17) / , where E (cid:104) (cid:107) φ • F ( l − r ( X , n ) (cid:107) sm R n | p ( l − n (cid:105) ≤ E (cid:104) n sm − n (cid:88) i =1 | φ ◦ f ( l − r,i ( X , n ) | sm | p ( l − n (cid:105) ≤ D ( s, m ) n sm (cid:90) | φ ( f r ) | sm p ( l − n (d f r ) ≤ D ( s, m ) n sm (cid:90) (cid:107) φ • f (cid:107) sm R k p ( l − n (d f ) , where the last inequality is due to the fact that | φ ( f r ) | sm ≤ (cid:16) (cid:80) kr =1 | φ ( f r ) | (cid:17) sm and then (cid:90) | φ ( f r ) | sm p ( l − n (d f r ) ≤ (cid:90) (cid:16) k (cid:88) r =1 | φ ( f r ) | (cid:17) sm p ( l − n (d f , . . . , d f k ) = (cid:90) (cid:107) φ • f (cid:107) sm R k p ( l − n (d f ) . So, we proved that E [ (cid:107) φ • f ( l − i ( X , n ) (cid:107) s R k | p ( l − n ] ≤ D ( a, k, s ) + D ( b, k, s, m ) k (cid:88) r =1 (cid:16) σ smb + σ smω D ( s, m ) (cid:90) (cid:107) φ • f (cid:107) sm R k p ( l − n (d f ) (cid:17) / , (16)which is finite by induction hypothesis uniformly in n . To conclude, since p ( l − n iid ∼ f ( l − i ( X , n ) | p ( l − n we get (cid:90) (cid:107) φ • f (cid:107) s R k p ( l − n (d f ) = E [ (cid:107) φ • f ( l − i ( X , n ) (cid:107) s R k | p ( l − n ]= E [ E [ (cid:107) φ • f ( l − i ( X , n ) (cid:107) s R k | p ( l − n ] | p ( l − n ] ≤ cost ( a, k, s, m ) < ∞ (17)14ublished as a conference paper at ICLR 2021which is bounded uniformly in n since the inner expectation is bounded uniformly in n by (16). Remark : Y s is a measurable set with respect to the weak topology for each s ≥ , indeed for each R ∈ N defining the map T R : U → R , T R ( p ) = (cid:90) B R (0) (cid:107) φ • f (cid:107) s R k p (d f ) = (cid:90) R k (cid:107) φ • f (cid:107) s R k X ( B R (0)) ( f ) p (d f ) where U := { p : p distribution of a r.v. X : Ω → R k } endowed with the weak topology, since ∩ R ∈ N T − R (0 , ∞ ) = Y s and (0 , ∞ ) is open, it is sufficient to prove that T R is continuous. Let ( p m ) ⊂ U such that p m converges to p with respect to the weak topology, then by Definition 3 | T R ( p m ) − T R ( p ) | = (cid:12)(cid:12)(cid:12) (cid:90) (cid:107) φ • f (cid:107) s R k X ( B R (0)) ( f ) p m (d f ) − (cid:90) (cid:107) φ • f (cid:107) s R k X ( B R (0)) ( f ) p (d f ) (cid:12)(cid:12)(cid:12) → because the function f (cid:55)→ (cid:107) φ • f (cid:107) s R k X ( B R (0)) ( f ) is continuous (by composition of the continuousfunctions φ and (cid:107)(cid:107) s ) and bounded by Weierstrass theorem.SM A.1.2: PROOF OF
L2By induction hypothesis, p ( l − n converges weakly to a p ( l − with respect to the weak topologyand the limit is degenerate, in the sense that it provides a.s. the distribution q ( l − . Then p ( l − n converges in probability to p ( l − . Then for every sub sequence n (cid:48) there exists a further sub se-quence n (cid:48)(cid:48) such that p ( l − n (cid:48)(cid:48) converges a.s. to p ( l − . By induction hypothesis, p ( l − is absolutelycontinuous with respect to the Lebesgue measure. Since φ is a.s. continuous and the sequence (cid:0) ( t T ( φ • f )) (cid:1) n ≥ uniformly integrable with respect to p ( l − n (by Cauchy-Schwarz inequality andL1 (cid:82) (cid:0) t T ( φ • f ) (cid:1) s p ( l − n (d f ) ≤ (cid:107) t (cid:107) s R k (cid:82) (cid:107) φ • f (cid:107) s R k p ( l − n (d f ) < ∞ , thus is L s -bounded for each s ≥ , and so uniformly integrable, then we can write the following (cid:90) ( t T ( φ • f )) p ( l − n (cid:48)(cid:48) (d f ) a.s. → (cid:90) ( t T ( φ • f )) q ( l − (d f ) . Thus, as n → + ∞ (cid:90) ( t T ( φ • f )) p ( l − n (d f ) p → (cid:90) ( t T ( φ • f )) q ( l − (d f ) . SM A.1.3:
PROOF OF
L3Let p ≥ and q ≥ such that p + q = 1 . By means of Hölder inequality (cid:90) (cid:107) φ • f (cid:107) R k (1 − e − σ ω n ( t T ( φ • f )) ) p ( l − n (d f ) ≤ (cid:16) (cid:90) (cid:107) φ • f (cid:107) p R k p ( l − n (d f ) (cid:17) / p (cid:16) (cid:90) (1 − e − σ ω n ( t T ( φ • f )) ) q p ( l − n (d f ) (cid:17) / q . Since q ≥ , for every y ≥ we have ≤ − e − y < , then (1 − e − y ) q ≤ (1 − e − y ) ≤ y . Itimplies the following (cid:90) (cid:107) φ • f (cid:107) R k (1 − e − σ ω n ( t T ( φ • f )) ) p ( l − n (d f ) ≤ (cid:16) (cid:90) (cid:107) φ • f (cid:107) p R k p ( l − n (d f ) (cid:17) / p (cid:16) (cid:90) σ ω n ( t T ( φ • f )) p ( l − n (d f ) (cid:17) / q ≤ (cid:16) (cid:90) (cid:107) φ • f (cid:107) p R k p ( l − n (d f ) (cid:17) / p (cid:16) (cid:107) t (cid:107) R k σ ω n (cid:90) (cid:107) φ • f (cid:107) R k p ( l − n (d f ) (cid:17) / q → , as n → + ∞ since by L1 the two integrals are bounded uniformly in n. Thus for every y > and θ ∈ (0 , e − θy ≥ e − y ⇒ ≤ − e − θy ≤ − e − y ≤ we get ≤ (cid:90) (cid:0) t T ( φ • f ) (cid:1) (cid:104) − exp (cid:8) − θ σ ω n (cid:0) t T ( φ • f ) (cid:1) (cid:9)(cid:105) p ( l − n (d f ) ≤ (cid:107) t (cid:107) R k (cid:90) (cid:107) φ • f (cid:107) R k (cid:104) − exp (cid:8) − σ ω n (cid:0) t T ( φ • f ) (cid:1) (cid:9)(cid:105) p ( l − n (d f ) → , as n → + ∞ . 15ublished as a conference paper at ICLR 2021SM A.1.4: COMBINATION OF THE LEMMAS
We conclude in two steps.
Step 1: uniform integrability.
Define y = y n ( f ) = σ ω n ( t T ( φ • f )) . Thus ϕ f ( l ) i ( X ,n ) ( t ) = e − σ b ( t T ) E (cid:104)(cid:16) (cid:90) e − y n ( f ) p ( l − n (d f ) (cid:17) n (cid:105) = e − σ b ( t T ) E [ A n ] where A n = (cid:16) (cid:82) e − y n ( f ) p ( l − n (d f ) (cid:17) n . ( A n ) n ≥ is is uniformly integrable because it is L s -boundedfor all s ≥ . Indeed, since < e − y n ( f ) ≤ E [ A sn ] ≤ E (cid:104)(cid:16) (cid:90) p ( l − n (d f ) (cid:17) ns (cid:105) = E (cid:2) (cid:3) = 1 Step 2: convergence in probability.
By Lagrange theorem for y > there exists θ ∈ (0 , suchthat e − y = 1 − y + y (1 − e − yθ ) . Then for every n there exists a real value θ n ∈ (0 , such that thefollow equality holds: A n = (cid:16) − σ ω n [ A (cid:48) n − A (cid:48)(cid:48) n ] (cid:17) n . where (cid:40) A (cid:48) n = (cid:82) ( t T ( φ • f )) p ( l − n (d f ) A (cid:48)(cid:48) n = (cid:82) ( t T ( φ • f )) (cid:104) − exp (cid:110) − θ n σ ω n ( t T ( φ • f )) (cid:111)(cid:105) p ( l − n (d f ) Using the definition of the exponential function, i.e. e x = lim n →∞ (1 + xn ) n , L2 and L3 we get that A n p → exp (cid:110) − σ ω (cid:90) ( t T ( φ • f )) q ( l − (d f ) (cid:111) , as n → ∞ Conclusion : since convergence in probability with uniform integrability implies convergence inmean, by the two above steps we get ϕ f ( l ) i ( X ,n ) ( t ) = e − σ b ( t T ) E [ A n ] → exp (cid:110) − σ b t T ) − σ ω (cid:90) ( t T ( φ • f )) q ( l − (d f ) (cid:111) = exp (cid:110) − (cid:104) σ b ( t T ) + σ ω (cid:90) ( t T ( φ • f )) q ( l − (d f ) (cid:105)(cid:111) = exp (cid:110) − t T Σ( l ) t (cid:111) , where Σ( l ) is a k × k matrix with elements Σ( l ) i,j = σ b + σ ω (cid:90) φ ( f i ) φ ( f j ) q ( l − (d f ) , where q ( l − = N k ( , Σ( l − . Then the limit distribution of f ( l ) i ( X ) is a k -dimensional Gaussiandistribution with mean and covariance matrix Σ( l ) , i.e. as n → + ∞ , f ( l ) i ( X , n ) d → N k ( , Σ( l )) . SM B.1
Fix i ≥ , l ≥ , n ∈ N . We prove that there exists a random variable H ( l ) i ( n ) such that | f ( l ) i ( x, n ) − f ( l ) i ( y, n ) | ≤ H ( l ) i ( n ) (cid:107) x − y (cid:107) R I , x, y ∈ R I , P − a.s. ξ ∈ Ω the function x (cid:55)→ f ( l ) i ( x, n )( ξ ) is Lipschitz. We proceed by induction on the layers.Fix x, y ∈ R I . For the first layer, by (5) we get | f (1) i ( x, n )( ξ ) − f (1) i ( y, n )( ξ ) | = (cid:12)(cid:12)(cid:12) I (cid:88) j =1 ω (1) i,j ( ξ ) x j + b (1) i ( ξ ) − (cid:0) I (cid:88) j =1 ω (1) i,j ( ξ ) y j + b (1) i ( ξ ) (cid:1)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) I (cid:88) j =1 ω (1) i,j ( ξ ) x j − I (cid:88) j =1 ω (1) i,j ( ξ ) y j (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) I (cid:88) j =1 ω (1) i,j ( ξ )( x j − y j ) (cid:12)(cid:12)(cid:12) ≤ I (cid:88) j =1 (cid:12)(cid:12)(cid:12) ω (1) i,j ( ξ ) (cid:12)(cid:12)(cid:12) | x j − y j |≤ (cid:107) x − y (cid:107) R I I (cid:88) j =1 (cid:12)(cid:12)(cid:12) ω (1) i,j ( ξ ) (cid:12)(cid:12)(cid:12) where we used that | x j − y j | ≤ (cid:107) x − y (cid:107) R I . Set H (1) i ( n ) = (cid:80) Ij =1 (cid:12)(cid:12) ω (1) i,j (cid:12)(cid:12) . Suppose by inductionhypothesis that for each j ≥ there exists a random variable H ( l − j ( n ) such that | f ( l − j ( x, n )( ξ ) − f ( l − j ( y, n )( ξ ) | ≤ H ( l − j ( n )( ξ ) (cid:107) x − y (cid:107) R I , and let L φ be the Lipschitz constant of φ . Then by (6)we get | f ( l ) i ( x, n )( ξ ) − f ( l ) i ( y, n )( ξ ) | = (cid:12)(cid:12)(cid:12) √ n n (cid:88) j =1 ω ( l ) i,j ( ξ ) φ ( f ( l − j ( x, n )) + b ( l ) i ( ξ ) − (cid:104) √ n n (cid:88) j =1 ω ( l ) i,j ( ξ ) φ ( f ( l − j ( y, n )) + b ( l ) i ( ξ ) (cid:105)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) √ n n (cid:88) j =1 ω ( l ) i,j ( ξ ) φ ( f ( l − j ( x, n )) − √ n n (cid:88) j =1 ω ( l ) i,j ( ξ ) φ ( f ( l − j ( y, n )) (cid:12)(cid:12)(cid:12) ≤ √ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) ω ( l ) i,j ( ξ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ ( f ( l − j ( x, n )) − φ ( f ( l − j ( y, n )) (cid:12)(cid:12) ≤ L φ √ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) ω ( l ) i,j ( ξ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( l − j ( x, n ) − f ( l − j ( y, n ) (cid:12)(cid:12) ≤ L φ √ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) ω ( l ) i,j ( ξ ) (cid:12)(cid:12)(cid:12) H ( l − j ( n )( ξ ) (cid:107) x − y (cid:107) R I ≤ (cid:107) x − y (cid:107) R I L φ √ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) ω ( l ) i,j ( ξ ) (cid:12)(cid:12)(cid:12) H ( l − j ( n )( ξ ) Set H ( l ) i ( n ) = L φ √ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) ω ( l ) i,j (cid:12)(cid:12)(cid:12) H ( l − j ( n ) Thus we proved that fixed l ≥ , and i ≥ , for each n ∈ NP (cid:104)(cid:110) ξ ∈ Ω : | f ( l ) i ( x, n )( ξ ) − f ( l ) i ( y, n )( ξ ) | ≤ H ( l ) i ( n )( ξ ) (cid:107) x − y (cid:107) R I (cid:111)(cid:105) = 1 . Thus, each process f ( l ) i (1) , f ( l ) i (2) , . . . is P -a.s. Lipschitz, in particular is P -a.s. continuous processes,i.e. belongs to C ( R I ; R ) . In order to prove the continuity of f ( l ) i we can not just take the limit as n → + ∞ of (9) because the left quantity converges to | f ( l ) i ( x ) − f ( l ) i ( y ) | only in distribution andnot P -a.s., but we can prove the continuity by applying Proposition 2, as we will show in SM B.2.17ublished as a conference paper at ICLR 2021 SM B.2
Fix i ≥ , l ≥ . We show the continuity of the limiting process f ( l ) i by applying Proposition 2. Taketwo inputs x, y ∈ R I . From (7) we know that [ f ( l ) i ( x ) , f ( l ) i ( y )] ∼ N ( , Σ( l )) where Σ(1) = σ b (cid:20) (cid:21) + σ ω (cid:20) (cid:107) x (cid:107) R I (cid:104) x, y (cid:105) R I (cid:104) x, y (cid:105) R I (cid:107) y (cid:107) R I (cid:21) , Σ( l ) = σ b (cid:20) (cid:21) + σ ω (cid:90) (cid:20) | φ ( u ) | φ ( u ) φ ( v ) φ ( u ) φ ( v ) | φ ( v ) | (cid:21) q ( l − (d u, d v ) , where q ( l − = N ( , Σ( l − . We want to find two values α > and β > , and a constant H ( l ) > such that E (cid:104) | f ( l ) i ( y ) − f ( l ) i ( x ) | α (cid:105) ≤ H ( l ) (cid:107) y − x (cid:107) I + β R I . Defining a T = [1 , − we have f ( l ) i ( y ) − f ( l ) i ( x ) ∼ N ( a T , a T Σ( l ) a ) . Consider α = 2 θ with θ integer. Thus | f ( l ) i ( y ) − f ( l ) i ( x ) | θ ∼ | (cid:113) a T Σ( l ) a N (0 , | θ ∼ ( a T Σ( l ) a ) θ | N (0 , | θ . We proceed by induction over the layers. For l = 1 , E (cid:104) | f (1) i ( y ) − f (1) i ( x ) | θ (cid:105) = C θ ( a T Σ(1) a ) θ = C θ ( σ ω (cid:107) y (cid:107) R I − σ ω (cid:104) y, x (cid:105) R I + σ ω (cid:107) x (cid:107) R I ) θ = C θ ( σ ω ) θ ( (cid:107) y (cid:107) R I − (cid:104) y, x (cid:105) R I + (cid:107) x (cid:107) R I ) θ = C θ ( σ ω ) θ (cid:107) y − x (cid:107) θ R I , where C θ is the θ -th moment of the chi-square distribution with one degree of freedom. By hypothesis φ is Lipschitz. (cid:90) | u − v | θ q ( l − (d u, d v ) ≤ H ( l − (cid:107) y − x (cid:107) θ R I . Then, | f ( l ) i ( y ) − f ( l ) i ( x ) | θ ∼ | N (0 , | θ ( a T Σ( l ) a ) θ = | N (0 , | θ (cid:16) σ ω (cid:90) [ | φ ( u ) | − φ ( u ) φ ( v ) + | φ ( v ) | ] q ( l − (d u, d v ) (cid:17) θ = | N (0 , | θ (cid:16) σ ω (cid:90) | φ ( u ) − φ ( v ) | q ( l − (d u, d v ) (cid:17) θ ≤ | N (0 , | θ ( σ ω L φ ) θ (cid:16) (cid:90) | u − v | q ( l − (d u, d v ) (cid:17) θ ≤ | N (0 , | θ ( σ ω L φ ) θ (cid:90) | u − v | θ q ( l − (d u, d v ) ≤ | N (0 , | θ ( σ ω L φ ) θ H ( l − (cid:107) y − x (cid:107) θ R I . Thus we conclude E (cid:104) | f ( l ) i ( y ) − f ( l ) i ( x ) | θ (cid:105) ≤ H ( l ) (cid:107) y − x (cid:107) θ R I , where the constant H ( l ) can be explicitly derived by solving the following system (cid:26) H (1) = C θ ( σ ω ) θ H ( l ) = C θ ( σ ω L φ ) θ H ( l − . It is easy to get H ( l ) = C lθ ( σ ω ) lθ ( L φ ) ( l − θ . Notice that this quantity does not depend on i .Therefore, by Proposition 2, by placing α = 2 θ and β = 2 θ − I , for every θ > I/ ( β needs to bepositive then we take θ > I/ ) there exists a continuous version f ( l )( θ ) i of the process f ( l ) i with P -a.s.locally γ -Hölder paths for every < γ < − I θ .18ublished as a conference paper at ICLR 2021• Thus f ( l )( θ ) i and f ( l ) i are indistinguishable (same trajectories), i.e there exists Ω ( θ ) ⊂ Ω with P (Ω ( θ ) ) = 1 such that for each ω ∈ Ω ( θ ) , x (cid:55)→ f ( l ) i ( x )( ω ) is locally γ -Hölder for each < γ < − I θ .• Define Ω (cid:63) = (cid:84) θ>I/ Ω ( θ ) , then for each < δ < there exists θ such that δ < − I θ < , thus for each ω ∈ Ω (cid:63) ⊂ Ω ( θ ) , the trajectory x (cid:55)→ f ( l ) i ( x )( ω ) is locally δ -Hölder continuous.By Proposition 2 we can conclude that f ( l ) i has a continuous version and the latter is P -a.s locally γ -Hölder continuous for every < γ < . SM B.3
Fix i ≥ , l ≥ . We apply Proposition 3 to show the uniform tightness of the sequence ( f ( l ) i ( n )) n ≥ in C ( R I ; R ) . By Lemma 2 f ( l ) i (1) , f ( l ) i (2) , . . . are random elements in C ( R I ; R ) . First we show thatthe sequence f (0 R I , n ) n ≥ is uniformly tight in R . We use the following statement from (Dudley,2002, Theorem 11.5.3) Proposition 4.
Let ( C, ρ ) be a metric space and suppose f ( n ) d → f where f ( n ) is tight for all n .Then f ( n ) n ≥ is uniformly tight. Since ( R , | · | ) is Polish every probability measure is tight, then f (0 R I , n ) is tight in R for every n . Moreover, by Lemma 1 f i (0 R I , n ) n ≥ d → f ( l ) i (0 R I ) , then by Proposition (4) f (0 R I , n ) n ≥ isuniformly tight in R . In order to apply Proposition 3 it remains to show that there exist two values α > and β > , and a constant H ( l ) > such that E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | α (cid:105) ≤ H ( l ) (cid:107) y − x (cid:107) I + β R I , x, y ∈ R I , n ∈ N uniformly in n . The first idea could be try to bound (uniformly in n ) the expected value of H ( l ) i ( n ) obtained in (10), but this turns out to be very difficult. Thus we choose another way. Take two points x, y ∈ R I . From (8) we know that f ( l ) i ( y, n ) | f ( l − ,...,n ∼ N (0 , σ y ( l, n )) and f ( l ) i ( x, n ) | f ( l − ,...,n ∼ N (0 , σ x ( l, n )) with joint distribution N ( , Σ( l, n )) , where Σ(1) = (cid:20) σ x (1) Σ(1) x,y Σ(1) x,y σ y (1) (cid:21) , Σ( l ) = (cid:20) σ x ( l, n ) Σ( l, n ) x,y Σ( l, n ) x,y σ y ( l, n ) (cid:21) , with, σ x (1) = σ b + σ ω (cid:107) x (cid:107) R I ,σ y (1) = σ b + σ ω (cid:107) y (cid:107) R I , Σ(1) x,y = σ b + σ ω (cid:104) x, y (cid:105) R I ,σ x ( l, n ) = σ b + σ ω n n (cid:88) j =1 | φ ◦ f ( l − j ( x, n ) | ,σ y ( l, n ) = σ b + σ ω n n (cid:88) j =1 | φ ◦ f ( l − j ( y, n ) | , Σ( l, n ) x,y = σ b + σ ω n n (cid:88) j =1 φ ( f ( l − j ( x, n )) φ ( f ( l − j ( y, n )) Defining a T = [1 , − we have that f ( l ) i ( y, n ) | f ( l − ,...,n − f ( l ) i ( x, n ) | f ( l − ,...,n is distributed as N ( a T , a T Σ( l, n ) a ) , where a T Σ( l, n ) a = σ y ( l, n ) − l, n ) x,y + σ x ( l, n ) . α = 2 θ with θ integer. Thus (cid:12)(cid:12)(cid:12) f ( l ) i ( y, n ) | f ( l − ,...,n − f ( l ) i ( x, n ) | f ( l − ,...,n (cid:12)(cid:12)(cid:12) θ ∼ | (cid:113) a T Σ( l, n ) a N (0 , | θ ∼ ( a T Σ( l, n ) a ) θ | N (0 , | θ . Start first with the case l = 1 . E (cid:104) | f (1) i ( y, n ) − f (1) i ( x, n ) | θ (cid:105) = C θ ( a T Σ(1) a ) θ = C θ ( σ ω (cid:107) y (cid:107) R I − σ ω (cid:104) y, x (cid:105) R I + σ ω (cid:107) x (cid:107) R I ) θ = C θ ( σ ω ) θ ( (cid:107) y (cid:107) R I − (cid:104) y, x (cid:105) R I + (cid:107) x (cid:107) R I ) θ = C θ ( σ ω ) θ (cid:107) y − x (cid:107) θ R I , where C θ is the θ -th moment of the chi-square distribution with one degree of freedom. Set H (1) = C θ ( σ ω ) θ . By hypothesis induction suppose that for every j ≥ E (cid:104) | f ( l − j ( y, n ) − f ( l − j ( x, n ) | θ (cid:105) ≤ H ( l − (cid:107) y − x (cid:107) θ R I . By hypothesis φ is Lipschitz, then E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | θ (cid:12)(cid:12)(cid:12) f ( l − ,...,n (cid:105) = C θ ( a T Σ( l, n ) a ) θ = C θ (cid:16) σ y ( l, n ) − l, n ) x,y + σ x ( l, n ) (cid:17) θ = C θ (cid:16) σ ω n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) φ ◦ f ( l − j ( y, n ) − φ ◦ f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) (cid:17) θ ≤ C θ (cid:16) σ ω L φ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) f ( l − j ( y, n ) − f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) (cid:17) θ = C θ ( σ ω L φ ) θ n θ (cid:16) n (cid:88) j =1 (cid:12)(cid:12)(cid:12) f ( l − j ( y, n ) − f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) (cid:17) θ ≤ C θ ( σ ω L φ ) θ n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) f ( l − j ( y, n ) − f ( l − j ( x, n ) (cid:12)(cid:12)(cid:12) θ . Using the induction hypothesis E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | θ (cid:105) = E (cid:104) E (cid:104) | f ( l ) i ( y, n ) − f ( l ) i ( x, n ) | θ (cid:12)(cid:12)(cid:12) f ( l − ,...,n (cid:105)(cid:105) ≤ C θ ( σ ω L φ ) θ n n (cid:88) j =1 E (cid:104) | f ( l − j ( y, n ) − f ( l − j ( x, n ) | θ (cid:105) ≤ C θ ( σ ω L φ ) θ H ( l − (cid:107) y − x (cid:107) θ R I . We can get the constant H ( l ) by solving the same system as (12), obtaining H ( l ) = C lθ ( σ ω ) lθ ( L φ ) ( l − θ which does not depend on n . By Proposition 3 setting α = 2 θ and β = 2 θ − I ,since β must be a positive constant, it is sufficient to take θ > I/ and this concludes the proof. SM C
Fix k inputs X = [ x (1) , . . . , x ( k ) ] and a layer l . We show that as n → + ∞ (cid:16) f ( l ) i ( X , n ) (cid:17) i ≥ d → ∞ (cid:79) i =1 N k ( , Σ( l )) where (cid:78) denotes the product measure and with Σ( l ) as in (7). We prove this statement by provingthe n large asymptotic behaviour of any finite linear combination of the f ( l ) i ( X , n ) ’s, for i ∈ L ⊂ N .20ublished as a conference paper at ICLR 2021See, e.g. Billingsley (1999) for details. Following the notation of Matthews et al. (2018b), consider afinite linear combination of the function values without the bias, i.e., T ( l ) ( L , p, X , n ) = (cid:88) i ∈L p i [ f ( l ) i ( X , n ) − b ( l ) i ] . Then for the first layer we write T (1) ( L , p, X ) = (cid:88) i ∈L p i (cid:104) I (cid:88) j =1 ω (1) i,j x j (cid:105) = I (cid:88) j =1 γ (1) j ( L , p, X ) , where γ (1) j ( L , p, X ) = (cid:88) i ∈L p i ω ( l ) i,j x j . and for any l ≥ T ( l ) ( L , p, X , n ) = (cid:88) i ∈L p i (cid:104) √ n n (cid:88) j =1 ω ( l ) i,j ( φ • f ( l − j ( X , n )) (cid:105) = 1 √ n n (cid:88) j =1 γ ( l ) j ( L , p, X , n ) , where γ ( l ) j ( L , p, X , n ) = (cid:88) i ∈L p i ω ( l ) i,j ( φ • f ( l − j ( X , n )) . For the first layer we get ϕ T (1) ( L ,p, X ) ( t ) = E (cid:104) e i t T T (1) ( L ,p, X ) (cid:105) = E (cid:104) exp (cid:110) i t T (cid:104) I (cid:88) j =1 (cid:88) i ∈L p i ω (1) i,j x j (cid:105)(cid:111)(cid:105) = I (cid:89) j =1 (cid:89) i ∈L E (cid:104) exp (cid:110) i t T (cid:104) p i ω (1) i,j x j (cid:105)(cid:111)(cid:105) = I (cid:89) j =1 (cid:89) i ∈L exp (cid:110) − σ ω p i (cid:16) t T x j (cid:17) (cid:111) = exp (cid:110) − σ ω n (cid:88) i ∈L p i n (cid:88) j =1 (cid:16) t T x j (cid:17) (cid:111) = exp (cid:110) − t T Θ( L , p, t (cid:111) , i.e. T (1) ( L , p, X ) d = N k ( , Θ( L , p, , with k × k covariance matrix with element in the i -th row and j -th column as follows Θ i,j ( L , p,
1) = p T pσ ω (cid:104) x ( i ) , x ( j ) (cid:105) R I , p T p = (cid:80) i ∈L p i . For l ≥ we get ϕ T ( l ) ( L ,p, X ,n ) | f ( l − ,...,n ( t ) = E (cid:104) e i t T T ( l ) ( L ,p, X ,n ) | f ( l − ,...,n (cid:105) = E (cid:104) exp (cid:110) i t T (cid:104) √ n n (cid:88) j =1 (cid:88) i ∈L p i ω ( l ) i,j ( φ • f ( l − j ( X , n )) (cid:105)(cid:111) | f ( l − ,...,n (cid:105) = n (cid:89) j =1 (cid:89) i ∈L E (cid:104) exp (cid:110) i t T (cid:104) √ n p i ω ( l ) i,j ( φ • f ( l − j ( X , n )) (cid:105)(cid:111) | f ( l − ,...,n (cid:105) = n (cid:89) j =1 (cid:89) i ∈L exp (cid:110) − σ ω n p i (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:111) = exp (cid:110) − σ ω n (cid:88) i ∈L p i n (cid:88) j =1 (cid:16) t T ( φ • f ( l − j ( X , n )) (cid:17) (cid:111) = exp (cid:110) − t T Θ( L , p, l, n ) t (cid:111) , i.e. T ( l ) ( L , p, X , n ) | f ( l − ,...,n d = N k ( , Θ( L , p, l, n )) , with k × k covariance matrix with element in the i -th row and j -th column as follows Θ i,j ( L , p, l, n ) = p T p σ ω n (cid:68) ( φ • F ( l − i ( X , n )) , ( φ • F ( l − j ( X , n )) (cid:69) R n , where p T p = (cid:80) i ∈L p i . Thus, along lines similar to the proof of the large n asymptotics for the i − th coordinate (just replacing σ b ← and σ ω ← p T pσ ω ), we have that for any l ≥ , as n → + ∞ , ϕ T ( l ) ( L ,p, X ,n ) ( t ) → exp (cid:110) − p T pσ ω (cid:90) (cid:16) t T ( φ • f ) (cid:17) q ( l − (d f ) (cid:105)(cid:111) = exp (cid:110) − t T Θ( L , p, l ) t (cid:111) , i.e. T ( l ) ( L , p, X , n ) converges weakly to a k -dimensional Gaussian distribution with mean and k × k covariance matrix Θ( L , p, l ) with elements Θ i,j ( L , p, l ) = p T pσ ω (cid:90) φ ( f i ) φ ( f j ) q ( l − (d f ) , where q ( l − (d f ) = q ( l − (d f , . . . , d f k ) = N k ( , Θ( L , p, l − f . To complete the proof justobserve that Θ( L , p, l ) = p T p Σ( l ) . SM D.1
We will use, without explicit mention, that the series (cid:80) ∞ i =1 q i converges when | q | < . In particularwhen q = / the series sum to 1. Fix, l ≥ and n ∈ N . We prove that there exists a random variable H ( l ) ( n ) such that d (cid:0) F ( l ) ( x, n ) , F ( l ) ( y, n ) (cid:1) ∞ ≤ H ( l ) ( n ) (cid:107) x − y (cid:107) R I , x, y ∈ R I , P − a.s. It immediately derives from the Lipschitzianity of each component, indeed by (9) we get d (cid:0) F ( l ) ( x, n ) , F ( l ) ( y, n ) (cid:1) ∞ = ∞ (cid:88) i =1 i | f ( l ) i ( x, n ) − f ( l ) i ( y, n ) | | f ( l ) i ( x, n ) − f ( l ) i ( y, n ) |≤ ∞ (cid:88) i =1 i | f ( l ) i ( x, n ) − f ( l ) i ( y, n ) |≤ (cid:107) x − y (cid:107) R I ∞ (cid:88) i =1 i H ( l ) i ( n ) . (cid:80) ∞ i =1 12 i H ( l ) i ( n ) converges almost surely. By (10) we get ∞ (cid:88) i =1 i H ( l ) i ( n ) = ∞ (cid:88) i =1 i L φ √ n n (cid:88) j =1 (cid:12)(cid:12) ω ( l ) i,j (cid:12)(cid:12) H ( l − j ( n )= L φ √ n n (cid:88) j =1 H ( l − j ( n ) ∞ (cid:88) i =1 | ω ( l ) i,j | i . It remains to show the convergence almost surely of the series (cid:80) ∞ i =1 | ω ( l ) i,j | i . We apply the three-seriesKolmogorov criterion (Kallenberg, 2002, Theorem 4.18). Call X i := | ω ( l ) i,j | i • By Markov inequality P ( X i > ≤ E [ X i ] = E [ | N (0 ,σ ω ) | ]2 i , thus (cid:80) ∞ i =1 P ( X i > ≤ E [ | N (0 , σ ω ) | ] < ∞ • Call Y i = X i I { X i ≤ } ≤ X i . Then (cid:80) ∞ i =1 E [ Y i ] ≤ (cid:80) ∞ i =1 E [ X i ] = (cid:80) ∞ i =1 E [ | N (0 ,σ ω ) | ]2 i = E [ | N (0 , σ ω ) | ] < ∞ • V( Y i ) = E [ Y i ] − E [ Y i ] , thus (cid:80) ∞ i =1 V( Y i ) = (cid:80) ∞ i =1 E [ Y i ] − (cid:80) ∞ i =1 E [ Y i ] . The first seriesconverges since E [ Y i ] ≤ E [ X i ] = σ ω E [ χ (1)]4 i = σ ω i (then (cid:80) E Y i ≤ σ ω (cid:80) i < ∞ ),and the other series converges since < E [ Y i ] ≤ E [ X i ] implies E [ Y i ] ≤ E [ X i ] = E [ | N (0 ,σ ω ) | ]4 i (then (cid:80) E [ Y i ] ≤ E [ | N (0 , σ ω ) | ] (cid:80) i < ∞ ).Denoting Q ( l ) j = (cid:80) ∞ i =1 | ω ( l ) i,j | i and by setting H ( l ) ( n ) := L φ √ n (cid:80) nj =1 H ( l − j ( n ) Q ( l ) j we complete theproof. SM D.2
Fix l ≥ . We show the continuity of the limiting process F ( l ) by applying Proposition 2. We willuse, without explicit mention, that the function r (cid:55)→ r r is bounded by 1 for r > . Take two inputs x, y ∈ R I and fix α ≥ even integer. Since (cid:80) ∞ i =1 12 i | f ( l ) ( x ) − f ( l ) i ( y ) | | f ( l ) ( x ) − f ( l ) i ( y ) | < (cid:80) ∞ i =1 12 i = 1 and, byJensen inequality, also (cid:80) ∞ i =1 12 i (cid:0) | f ( l ) ( x ) − f ( l ) i ( y ) | | f ( l ) ( x ) − f ( l ) i ( y ) | (cid:1) α < (cid:80) ∞ i =1 12 i = 1 , we get d (cid:0) F ( l ) ( x ) , F ( l ) ( y ) (cid:1) α ∞ = (cid:16) ∞ (cid:88) i =1 i | f ( l ) i ( x ) − f ( l ) i ( y ) | | f ( l ) i ( x ) − f ( l ) i ( y ) | (cid:17) α ≤ ∞ (cid:88) i =1 i (cid:16) | f ( l ) i ( x ) − f ( l ) i ( y ) | | f ( l ) i ( x ) − f ( l ) i ( y ) | (cid:17) α ≤ ∞ (cid:88) i =1 i | f ( l ) i ( x ) − f ( l ) i ( y ) | α Thus, by applying monotone convergence theorem to the positive increasing sequence g ( N ) = (cid:80) Ni =1 12 i | f ( l ) i ( x ) − f ( l ) i ( y ) | α (which allows to exchange E and (cid:80) ∞ i =1 ), we get23ublished as a conference paper at ICLR 2021 E (cid:104) d (cid:0) F ( l ) ( x ) , F ( l ) ( y ) (cid:1) α ∞ (cid:105) ≤ E (cid:104) ∞ (cid:88) i =1 i | f ( l ) ( x ) − f ( l ) i ( y ) | α (cid:105) = E (cid:104) lim N →∞ N (cid:88) i =1 i | f ( l ) i ( x ) − f ( l ) i ( y ) | α (cid:105) = lim N →∞ E (cid:104) N (cid:88) i =1 i | f ( l ) i ( x ) − f ( l ) i ( y ) | α (cid:105) = ∞ (cid:88) i =1 i E (cid:104) | f ( l ) i ( x ) − f ( l ) i ( y ) | α (cid:105) = ∞ (cid:88) i =1 i H ( l ) (cid:107) x − y (cid:107) α R I = H ( l ) (cid:107) x − y (cid:107) α R I where we used (11) and the fact that H ( l ) does not depend on i (see (12)).Therefore, by Proposition 2, for each α > I , setting β = α − I (since β needs to be positive, itis sufficient to choose α > I ) F ( l ) has a continuous version F ( l )( θ ) and the latter is P -a.s locally γ -Hölder continuous for every < γ < − Iα .• Thus F ( l )( α ) and F ( l ) are indistinguishable (same trajectories), i.e there exists Ω ( α ) ⊂ Ω with P (Ω ( α ) ) = 1 such that for each ω ∈ Ω ( α ) , x (cid:55)→ F ( l ) ( x )( ω ) is locally γ -Hölder foreach < γ < − Iα .• Define Ω (cid:63) = (cid:84) α>I Ω ( α ) , then for each < δ < there exists α such that δ < − Iα < , thus for each ω ∈ Ω (cid:63) ⊂ Ω ( α ) , the trajectory x (cid:55)→ F ( l ) ( x )( ω ) is locally δ -Höldercontinuous.By Proposition 2 we can conclude that F ( l ) has a continuous version and the latter is P -a.s locally γ -Hölder continuous for every < γ < . SM E G ENERAL INTRODUCTION TO D ANIELL -K OLMOGOROV EXTENSION THEOREM
Let X be a set of indexes and { ( E x , E x ) } x ∈ X measurable spaces. On E := × x ∈ X E x we can considerthe σ -algebra E := (cid:78) x ∈ X E x that is E = σ ( π x , x ∈ X ) = σ (cid:16) (cid:91) x ∈ X π − x ( E x ) (cid:17) where for each x ∈ X , π x : E → E x , ω := ( ω x ) x ∈ X (cid:55)→ π x ( ω ) = ω x . E is generated by measurablerectangles. A measurable rectangle A is of the form A := × x ∈ X A x such that only a finite number of A x ∈ E x are different from E x σ - ALGEBRA ON THE SPACE OF FUNCTIONS
Fix X = R I and ( S, d ) Polish space. We consider the measurable sets { ( E x , E x ) } x ∈ X = { ( S, B ( S )) } x ∈ R I thus we can construct a measurable space ( E, E ) = ( × x ∈ X E x , (cid:79) x ∈ X E x ) = (cid:16) S R I , B ( S R I ) (cid:17) S R I = × x ∈ R I S is the set of all functions from R I into S and B ( S R I ) := (cid:79) x ∈ R I B ( S )= σ (cid:16) (cid:91) x ∈ R I π − x ( B ( S )) (cid:17) = σ (cid:16)(cid:110) A := × x ∈ X A x such that only a finite number of A x are different from S (cid:111)(cid:17) An example of measurable rectangle is A = S × A x (1) × S × A x (2) × S × S × · · · × A x ( k ) × S × S × . . . where k ∈ N and only for x (1) , . . . , x ( k ) the cartesian products are different to S .Denote by Z = ( Z x ) x ∈ R I , Z x : (Ω , H , P ) → S any stochastic process of interest, such as f ( l ) i ( n ) or f ( l ) i for some l ≥ , i ≥ and n ≥ when S = ( R , | · | ) , or even F ( l ) ( n ) or F ( l ) for l ≥ and n ≥ when S = ( R ∞ , (cid:107) · (cid:107) ∞ ) . Consider the finite-dimensional distributions of Z Λ = { P Zx (1) ,...,x ( k ) on B ( S k ) | x ( j ) ∈ R I , j ∈ { , . . . , k } , k ∈ N } If Λ is consistent in the sense of Kolmogorov theorem, then there exists an unique probability measure P (cid:48) on ( S R I , B ( S R I )) such that the canonical process Z (cid:48) = ( Z (cid:48) x ) x ∈ R I , Z (cid:48) x : S R I → S, ω (cid:55)→ Z (cid:48) x ( ω ) = ω ( x ) on ( S R I , B ( S R I ) , P (cid:48) ) has finite-dimensional distributions that coincide with Λ .SM E.1 : E XISTENCE OF A PROBABILITY MEASURE ON S R I FOR THE SEQUENCE PROCESSES
Fix S = R . Fix a layer l , a unit i ≥ on that layer and n ∈ N . We want to prove that thereexists a probability measure P ( i,l,n ) on ( R R I , B ( R R I )) such that the associated canonical process Θ ( i,l,n ) x : R R I → R , ω (cid:55)→ ω ( x ) has finite-dimensional distributions that coincide with Λ ( i,l,n ) = (cid:110) P ( i,l,n ) x (1) ,...,x ( k ) (cid:111) k ∈ N , where P ( i,l,n ) x (1) ,...,x ( k ) is the distribution of f ( l ) i ( X , n ) . We do not know the exact form of this distributionbut we know the distribution of the conditioned random variable f ( l ) i ( X , n ) | f ( l − ,...,n (see (8)). Thus,since from (8) the distribution of f (1) i ( X ) is well known, proceeding by induction it is sufficient toprove the existence of two probability measures P ( i, ,n ) and P ( i,l,n ) | l − on ( R R I , B ( R R I )) such that theassociated canonical processes Θ ( i, ,n ) x , and Θ ( i, ,n ) | l − x have finite-dimensional distributions thatcoincide respectively with Λ ( i, ,n ) := (cid:110) P ( i, ,n ) x (1) ,...,x ( k ) (cid:111) k ∈ N and Λ ( i,l,n ) | l − := (cid:110) P ( i,l,n, ) | l − x (1) ,...,x ( k ) (cid:111) k ∈ N , where P ( i, ,n ) x (1) ,...,x ( k ) = N k ( , Σ(1 , X )) and P ( i,l,n ) | l − x (1) ,...,x ( k ) = N k ( , Σ( l, n, X )) defined on B ( R k ) . Ob-serve that, for simplicity of notation, we have always avoided to write the dependence of thecovariance matrix on the inputs matrix X , but in this case it is important to emphasize this. For theproof we defer to the limit case in the next subsection since the proof is the same step by step. When S = R ∞ , recall that given a sequence of probability spaces { ( R R I , B ( R R I ) , P ( i,l,n ) ) } i ≥ there existsa unique probability measure P ( l,n ) on ( × ∞ i =1 R R I , (cid:78) ∞ i =1 B ( R R I )) = (cid:0) ( R ∞ ) R I , B (( R ∞ ) R I ) (cid:1) suchthat, for each measurable rectangle A = × ∞ i =1 A i where only for a finite number of i the set A i isdifferent from R R I , then P ( l,n ) ( A ) = (cid:81) ∞ i =1 P ( i,l,n ) ( A i ) . Moreover this probability is denoted as P ( l,n ) =: (cid:78) ∞ i =1 P ( i,l,n ) . This means that the existence of the stochastic processes f ( l ) i ( n ) implies theexistence of the stochastic processes F ( l ) ( n ) . 25ublished as a conference paper at ICLR 2021SM E.2 : E XISTENCE OF A PROBABILITY MEASURE ON S R I FOR THE LIMIT PROCESS
Note that, as observed in previous section, the existence of the stochastic processes f ( l ) i on ( R R I , B ( R R I )) implies the existence of the stochastic processes F ( l ) on (cid:0) ( R ∞ ) R I , B (( R ∞ ) R I ) (cid:1) .Then we focus on the proof when S = R . Fix a layer l and a unit i ≥ on that layer. We want toprove that there exists a probability measure P ( i,l ) on ( R R I , B ( R R I )) such that the canonical process Θ ( i,l ) x : R R I → R , ω (cid:55)→ ω ( x ) has finite-dimensional distributions that coincide with Λ ( i,l ) = (cid:110) P ( i,l ) x (1) ,...,x ( k ) (cid:111) k ∈ N , where P ( i,l ) x (1) ,...,x ( k ) are the finite-dimensional distributions of f ( l ) i determined in (7), i.e. P ( i,l ) x (1) ,...,x ( k ) = N k ( , Σ( l, X )) defined on B ( R k ) . By Daniell-Kolmogorov existence result (Kallenberg, 2002,Theorem 6.16) it is sufficient to prove that for each k ∈ N and for each x (1) , . . . , x ( k ) elements on R I , then P ( i,l ) x (1) ,...,x ( z ) ,...,x ( k ) ( B (1) × · · · × B ( z − × R × B ( z +1) × · · · × B ( k ) )= P ( i,l ) x (1) ,...,x ( z − ,x ( z +1) ,...,x ( k ) ( B (1) × · · · × B ( z − × B ( z +1) × · · · × B ( k ) ) , (18)for every z ∈ { , . . . , k } and for every B ( j ) ∈ B ( R ) for all j = 1 , . . . , k , j (cid:54) = z . Fix k ∈ N , k inputs x (1) , . . . , x ( k ) , z ∈ { , . . . , k } and B ( j ) ∈ B ( R ) for all j = 1 , . . . , k , j (cid:54) = z . Define the projection π [ z ] : R k → R k − such that π [ k ] ( y , . . . , y k ) = [ y , . . . , y z − , y z +1 , . . . y k ] T . Thus, condition (18)is equivalent to the following: P ( i,l ) x (1) ,...,x ( k ) ◦ π [ z ] = P ( i,l ) π [ z ] ( x (1) ,...,x ( k )) , where on the left we have the image measure of P ( i,l ) x (1) ,...,x ( k ) under π [ z ] . We prove this byproving that the respective Fourier transformations coincide. In the following calculations wedefine y = [ y , . . . , y k ] T , y [ z ] = [ y , . . . , y z − , y z +1 , . . . , y k ] T and t = [ t , . . . , t k ] T , t [ k ] =[ t , . . . , t z − , t z +1 , . . . , t k ] T , then by definition of image measure we get ϕ (cid:0) P ( i,l ) x (1) ,...,x ( k ) ◦ π [ z ] (cid:1) ( t [ z ] ) = (cid:90) R k − e i t T [ z ] y [ z ] (cid:0) P ( i,l ) x (1) ,...,x ( k ) ◦ π [ z ] (cid:1) (d y [ z ] )= (cid:90) R k e i t T [ z ] π [ z ] ( y ) P ( i,l ) x (1) ,...,x ( k ) (d y ) . Now, recalling that j is the k × vector with in the j -th position and otherwise, since π [ z ] ( y ) = y [ z ] , defining π (cid:63) [ z ] ( t ) = (cid:80) kj =1 j (cid:54) = z j t j we get t T [ z ] π [ z ] ( y ) = y T π (cid:63) [ z ] ( t ) . Then ϕ (cid:0) P ( i,l ) x (1) ,...,x ( k ) ◦ π [ z ] (cid:1) ( t [ z ] ) = (cid:90) R k e i y T π (cid:63) [ z ] ( t ) P ( i,l ) x (1) ,...,x ( k ) (d y )= ϕ P ( i,l ) x (1) ,...,x ( k ) ( π (cid:63) [ z ] ( t ))= ϕ N k ( , Σ( l, X )) ( π (cid:63) [ z ] ( t ))= exp (cid:8) − π (cid:63) [ z ] ( t ) T Σ( l, X ) π (cid:63) [ z ] ( t ) (cid:9) = exp (cid:8) − t T [ z ] (cid:98) Σ( l, X ) t [ z ] (cid:9) , where (cid:98) Σ( l, X ) is the matrix Σ( l, X ) without the z -th row and the z -th column. But since (cid:98) Σ( l, X ) =Σ( l, π [ z ] ( x (1) , . . . , x ( k ) )) we get ϕ (cid:0) P ( i,l ) x (1) ,...,x ( k ) ◦ π [ z ] (cid:1) ( t [ z ] ) = ϕ P ( i,l ) π [ z ]( x (1) ,...,x ( k )) ( t [ z ] ) for each t [ z ] andthus the two Fourier transformations coincides, as we wanted to prove.SM E.3: E XISTENCE OF A PROBABILITY MEASURE ON C ( R I ; R ) If Z is, in addition, a continuous stochastic process then we will show that there exists a probabilitymeasure P Z on C ( R I ; R ) ⊂ R R I endowed with a σ -algebra G ⊂ B ( R R I ) such that the finite-dimensional distribution of Z (cid:48) and Z coincide. 26ublished as a conference paper at ICLR 2021As suggested by Kallenberg (2002) (page 311) we consider C ( R I ; R ) with the topology of uniformconvergence on compacts, that is (cid:26) ρ R : C ( R I ; R ) × C ( R I ; R ) → [0 , ∞ ) , ( ω , ω ) (cid:55)→ ρ S ( ω , ω ) = (cid:80) ∞ R =1 12 R sup x ∈ B R (0) ξ ( | ω ( x ) − ω ( x ) | R ) (19)The Borel σ -field G := B ( C ( R I ; R ) , ρ R ) is generated by the evaluation maps π x , thus it coincidewith the product σ -field, i.e. G = σ (Γ) , where Γ = (cid:8) Γ x (1) ,...,x ( k ) ( A ) | A = A x (1) × · · · × A x ( k ) , A x ( j ) ∈ B ( R ) , x ( j ) ∈ R I , j ∈ { , . . . , k } , k ∈ N (cid:9) where Γ x (1) ,...,x ( k ) ( A ) = (cid:8) ω ∈ C ( R I ; R ) | ω ( x (1) ) ∈ A x (1) , . . . , ω ( x ( k ) ) ∈ A x ( k ) (cid:9) . Note that since σ (Γ) ⊂ B ( R R I ) then G = σ (Γ) ⊂ B ( R R I ) . Theorem 3.
There exists a unique probability measure P Z on ( C ( R I ; R ) , G ) such that the canonicalprocess Z (cid:48) restricted to ( C ( R I ; R ) , G )) has finite-dimensional distributions that coincide with thoseof Z . For the existence of P Z consider the following Lemma 6.
Let ( Z x ) x ∈ R I be a R -valued continuous stochastic process defined on (Ω , H , P ) . Then (cid:26) Z : Ω → C ( R I ; R ) ω → Z ( ω ) = ( Z x ( ω )) x ∈ R I is a random variable, i.e. measurable from (Ω , H ) into ( C ( R I ; R ) , G ) .Proof. By previous proposition G = σ (Γ) , then taking O ∈ σ (Γ) , O = Γ x (1) ,...,x ( k ) ( A ) for some k ∈ N , { x (1) , . . . , x ( k ) } ⊂ R I and A = A x (1) × · · · × A x ( k ) , A x ( j ) ∈ B ( R ) ,we get { ω ∈ Ω |Z ( ω ) ∈ O} = { ω ∈ Ω | Z x (1) ( ω ) ∈ A x (1) , . . . , Z x ( k ) ( ω ) ∈ A x ( k ) } = k (cid:92) j =1 { Z x ( j ) ∈ A x ( j ) } ∈ H where we used that Z x ( j ) are random variables from (Ω , H ) into ( R , B ( R )) .Then we can define a probability measure P Z on ( C ( R I ; R ) , G ) being the image measure of Z under P , that is ∀O ∈ G , P Z ( O ) = P ( Z ∈ O ) Now we prove that the finite-dimensional distributions of Z (cid:48) coincide with those of Z . It is sufficientto prove the following Lemma 7. P Z coincide wit the image measure of the canonical process Z (cid:48) under P (cid:48) restricted to ( C ( R I ; R ) , G ) .Proof. Fix
O ∈ G = σ (Γ) , O = Γ x (1) ,...,x ( k ) ( A ) for some k ∈ N , { x (1) , . . . , x ( k ) } ⊂ R I and A = A x (1) × · · · × A x ( k ) , A x ( j ) ∈ B ( R ) . By definition of P Z , P Z ( O ) = P ( Z ∈ Γ x (1) ,...,x ( k ) ( A ))= P ( { ω ∈ Ω |Z ( ω ) ∈ O} )= P ( { ω ∈ Ω | Z x (1) ( ω ) ∈ A x (1) , . . . , Z x ( k ) ( ω ) ∈ A x ( k ) )= P Zx (1) ,...,x ( k ) ( A ) By Daniell-Kolmogorv extension theorem the finite-dimensional distributions of Z coincide withthose of the canonical process Z (cid:48) under P (cid:48) , then P Zx (1) ,...,x ( k ) ( A ) = P (cid:48) ( Z (cid:48) ∈ O ) .The uniqueness of P Z follows by the uniqueness P (cid:48) .27ublished as a conference paper at ICLR 2021SM E.4: σ ( × ∞ i =1 C ( R I ; R )) ⊂ σ ( C ( R I ; R ∞ ) First, note that × ∞ i =1 C ( R I ; R ) (cid:39) C ( R I ; R ∞ ) , indeed the map Ξ : C ( R I ; R ∞ ) → × ∞ i =1 C ( R I ; R ) , ω (cid:55)→ ( ω , ω , . . . ) is an isomorphism because is linear and bijective, indeed ω is (cid:107) · (cid:107) ∞ -continuous if and only if eachcomponent ω i is | · | -continuous. It means that each element in one space could be seen as an elementin the other and vice-versa, but different topologies are defined on these spaces. Now we prove thatthe sigma algebra generated by the product topology in × ∞ i =1 C ( R I ; R ) is contained on the sigmaalgebra generated by the topology of uniform convergence on compact set in C ( R I ; R ∞ ) . For each f, g ∈ C ( R I ; R ∞ ) we have the following distances ρ prod ( f, g ) = (cid:80) ∞ i =1 12 i ξ (cid:16) (cid:80) ∞ R =1 12 R sup x ∈ B R (0) ξ ( | f i ( x ) − g i ( x ) | ) (cid:17) , on × ∞ i =1 C ( R I ; R ) ρ unif ( f, g ) = (cid:80) ∞ R =1 12 R sup x ∈ B R (0) ξ (cid:16) (cid:80) ∞ i =1 12 i ξ ( | f i ( x ) − g i ( x ) | ) (cid:17) , on C ( R I , R ∞ ) (20)Using that ξ is increasing and continuous and that sup x ( (cid:80) i h i ( x )) ≤ (cid:80) i sup x h i ( x ) it can beproved that there exists a constant C > such that (cid:107) f (cid:107) unif ≤ C (cid:107) f (cid:107) prod . This mean that if h ∈ B prod(cid:15) ( f ) = { g : (cid:107) f − g (cid:107) prod < (cid:15) } than h ∈ B unifC(cid:15) ( f ) = { g : (cid:107) f − g (cid:107) unif < (cid:15) } , that is B prod(cid:15) ( f ) ⊂ B unifC(cid:15) ( f ) which implies σ ( ρ prod ) ⊂ σ ( ρ unif ) . In particular each compact with respectto (cid:107) · (cid:107) prod is compact with respect to (cid:107) · (cid:107) unif , indeed considering a (cid:107) · (cid:107) prod -compact K then forevery sequence ( k i ) ⊂ K there exists ( k i j ) ⊂ K and k ∈ K such that (cid:107) k i j − k (cid:107) prod → . Moreover (cid:107) k i j − k (cid:107) unif ≤ C (cid:107) k i j − k (cid:107) prod → , i.e. K is compact with respect to (cid:107) · (cid:107) unif . SM F
In this section we prove the Proposition 1.
Proof.