Noisy Recurrent Neural Networks
Soon Hoe Lim, N. Benjamin Erichson, Liam Hodgkinson, Michael W. Mahoney
NNoisy Recurrent Neural Networks
Soon Hoe Lim
Nordita, KTH Royal Institute of Technologyand Stockholm University [email protected]
N. Benjamin Erichson
ICSI and Department of Statistics,UC Berkeley [email protected]
Liam Hodgkinson
ICSI and Department of Statistics,UC Berkeley [email protected]
Michael W. Mahoney
ICSI and Department of Statistics,UC Berkeley [email protected]
Abstract
We provide a general framework for studying recurrent neural networks (RNNs) trainedby injecting noise into hidden states. Specifically, we consider RNNs that can be viewedas discretizations of stochastic differential equations driven by input data. This frameworkallows us to study the implicit regularization effect of general noise injection schemes byderiving an approximate explicit regularizer in the small noise regime. We find that, underreasonable assumptions, this implicit regularization promotes flatter minima; it biases towardsmodels with more stable dynamics; and, in classification tasks, it favors models with largerclassification margin. Sufficient conditions for global stability are obtained, highlighting thephenomenon of stochastic stabilization, where noise injection can improve stability duringtraining. Our theory is supported by empirical results which demonstrate improved robustnesswith respect to various input perturbations, while maintaining state-of-the-art performance.
Viewing recurrent neural networks (RNNs) as discretizations of ordinary differential equations (ODEs) drivenby input data has recently gained attention [10, 35, 19, 63]. The “formulate in continuous time, and thendiscretize” approach [50, 78] motivates novel architecture designs before experimentation, and it provides auseful interpretation as a dynamical system. This, in turn, has led to gains in reliability and robustness to dataperturbations.Recent efforts have shown how adding noise can also improve stability during training, and consequentlyimprove robustness [46]. In this work, we consider discretizations of the corresponding stochastic differentialequations (SDEs) obtained from ODE formulations of RNNs through the addition of a diffusion term. We referto these as
Noisy RNNs (NRNNs). By dropping these random elements at inference time, NRNNs become astochastic learning strategy which, as we shall prove, has a number of important benefits. In particular, stochasticlearning strategies (including dropout) are often used as natural regularizers, favoring solutions in regions of theloss landscape with desirable properties (often improved generalization and/or robustness). This mechanism iscommonly referred to as implicit regularization [52, 51, 66], differing from explicit regularization where the lossis explicitly modified. For neural network models, implicit regularization towards wider minima is conjecturedto be a prominent ingredient in the success of stochastic optimization [84, 37]. Indeed, implicit regularizationhas been linked to increases in classification margins [61, 49], which can lead to improved generalizationperformance [67]. A common approach to identify and study implicit regularization is to approximate theimplicit regularization by an appropriate explicit regularizer [1, 9, 26]. Doing so, we will see that NRNNs favorwide minima (like SGD); more stable dynamics; and classifiers with a large classification margin, keepinggeneralization error small.SDEs have also seen recent appearances in neural SDEs [74, 30], stochastic generalizations of neural ODEs [12].These can be seen as an analogue of NRNNs for non-sequential data, with a similar relationship to NRNNs asfeedforward neural networks have to RNNs. They have been shown to be robust in practice [46]. Analogously, a r X i v : . [ s t a t . M L ] F e b e show that the NRNN framework leads to more reliable and robust RNN classifiers, whose promise isdemonstrated by experiments on benchmark data sets. Contributions.
For the class of NRNNs (formulated first as a continuous-time model, which is then dis-cretized):• we identify the form of the implicit regularization for NRNNs through a corresponding (data-dependent)explicit regularizer in the small noise regime (see Theorem 1);• we focus on its effect in classification tasks, providing bounds for the classification margin and generalizationerror for the deterministic RNN classifiers (see Theorem 2);• we show that noise injection can also lead to improved stability (see Theorem 3) via a Lyapunov stabilityanalysis of continuous-time NRNNs; and• we demonstrate via empirical results on benchmark data sets that NRNN classifiers are more robust to dataperturbations, when compared to other recurrent models, while retaining state-of-the-art performance for cleandata. Research code is shared via https://github.com/erichson/NoisyRNN . Overview.
This paper is organized as follows. In Section 2, we discuss some related work. In Section 3, weintroduce the NRNNs. In Section 4, we investigate the effect of noise injections in NRNNs through the lensof implicit regularization. In Section 5, we study the effect in classification tasks. In Section 6, we providesufficient conditions for noise injections to stabilize the NRNNs. In Section 7, we provide an empirical analysison benchmark data sets to support our theory. In Section 8, we conclude the paper and provide future directions.We defer the relevant technical details and proofs to the Supplementary Material ( SM ). Notation.
We use (cid:107) v (cid:107) := (cid:107) v (cid:107) to denote the Euclidean norm of the vector v , and (cid:107) A (cid:107) and (cid:107) A (cid:107) F to denotethe spectral norm and Frobenius norm of the matrix A , respectively. The i th element of a vector v is denoted by v i or [ v ] i , and the ( i, j ) -entry of a matrix A by A ij or [ A ] ij . For a vector v = ( v , . . . , v d ) , diag( v ) denotes thediagonalization of v with diag( v ) ii = v i . I denotes the identity matrix (with dimension clear from context),while superscript T denotes transposition. For a matrix M , M sym = ( M + M T ) / denotes its symmetric part, λ min ( M ) and λ max ( M ) denote its minimum and maximum eigenvalue respectively, and σ max ( M ) denotes itsmaximum singular value. For a function f : R n → R m such that each of its first-order partial derivatives (withrespect to x ) exist, ∂f∂x ∈ R m × n is the Jacobian matrix of f . For a scalar-valued function g : R n → R , ∇ h g isthe gradient of g with respect to the variable h ∈ R n , and H h g is the Hessian of g with respect to h . C ( I ; J ) denotes the space of continuous J -valued functions defined on I . Dynamical Systems and Machine Learning.
There are various interesting connections between machinelearning and dynamical systems. Formulating machine learning in the framework of continuous-time dynamicalsystems was recently popularized by [77]. Subsequent efforts focus on constructing learning models byapproximating continuous-time dynamical systems [12, 39, 62] and studying them using tools from numericalanalysis [47, 81, 86, 85]. On the other hand, dynamical systems theory provides useful theoretical tools foranalyzing neural networks (NNs), including RNNs [75, 18, 45, 10, 19], and useful principles for designing NNs[28, 70]. Other examples of dynamical systems inspired models include the learning of invariant quantities viatheir Hamiltonian or Lagrangian representations [48, 27, 13, 88, 73]. Another class of models is inspired byKoopman theory, yielding models where the evolution operator is linear [72, 57, 20, 59, 44, 6, 5, 16].
Stochastic Training and Regularization Strategies.
Regularization techniques such as noise injection anddropout can help to prevent overfitting in neural networks. Following the classical work [7] that studiesregularizing effects of noise injection on data, several works study the effects of noise injection into differentparts of networks for various architectures [31, 58, 46, 70, 4, 34, 83, 76]. In particular, [9] recently studied theregularizing effect of isotropic Gaussian noise injection into the layers of feedforward networks. For RNNs,[15] shows that noise additions on the hidden states outperform Bernoulli dropout in terms of performanceand bias. Some specific formulations of RNNs as SDEs were also considered in Chapter 10 of [55]. Implicitregularization has also been studied more generally [52, 51, 25, 14, 66].2
Noisy Recurrent Neural Networks
We formulate continuous-time recurrent neural networks (CT-RNNs) at full generality as a system of input-drivenODEs: for a terminal time
T > and an input signal x = ( x t ) t ∈ [0 ,T ] ∈ C ([0 , T ]; R d x ) , the output y t ∈ R d y , for t ∈ [0 , T ] , is a linear map of hidden states h t ∈ R d h satisfying d h t = f ( h t , x t )d t, y t = V h t , (1)where V ∈ R d y × d h , and f : R d h × R d x → R d h is typically Lipschitz continuous, guaranteeing existence anduniqueness of solutions to (1).A natural stochastic variant of CT-RNNs arises by replacing the ODE in (1) by an Itˆo SDE, that is, d h t = f ( h t , x t )d t + σ ( h t , x t )d B t , y t = V h t , (2)where σ : R d h × R d x → R d h × r and ( B t ) t ≥ is an r -dimensional Brownian motion. The functions f and σ arereferred to as the drift and diffusion coefficients, respectively. Intuitively, (2) amounts to a noisy perturbation ofthe corresponding deterministic CT-RNN (1). At full generality, we refer to the system (2) as a continuous-timenoisy RNN (CT-NRNN). To guarantee the existence of a unique solution to (2), in the sequel, we assume that { f ( · , x t ) } t ∈ [0 ,T ] and { σ ( · , x t ) } t ∈ [0 ,T ] are uniformly Lipschitz continuous, and t (cid:55)→ f ( h, x t ) , t (cid:55)→ σ ( h, x t ) arebounded in t ∈ [0 , T ] for each fixed h ∈ R d h . For further details, see SM , Section B.While much of our theoretical analysis will focus on this general formulation of CT-NRNNs, our empirical andstability analyses focus on the choice of drift function f ( h, x ) = Ah + a ( W h + U x + b ) , (3)where a : R → R is a Lipschitz continuous scalar activation function extended to act on vectors pointwise, A, W ∈ R d h × d h , U ∈ R d h × d x and b ∈ R d h . Typical examples of activation functions include a ( x ) = tanh( x ) .The matrices A, W, U, V, b are all assumed to be trainable parameters. This particular choice of drift dates backto the early Cohen-Grossberg formulation of CT-RNNs, and it was recently reconsidered in [19].
While precise choices of drift functions f are the subject of existing deterministic RNN theory, good choicesof the diffusion coefficient σ are less clear. Here, we shall consider a parametric class of diffusion coefficientsgiven by: σ ( h, x ) ≡ (cid:15) ( σ I + σ diag( f ( h, x ))) , (4)where the noise level (cid:15) > is small, and σ ≥ and σ ≥ are tunable parameters describing the relativestrength of additive noise and a multiplicative noise respectively.While the stochastic component is an important part of the model, one can set (cid:15) ≡ at inference time. In doingso, noise injections in NRNNs may be viewed as a learning strategy. A similar stance is considered in [46]for treating neural SDEs. From this point of view, we may relate noise injections generally to regularizationmechanisms considered in previous works. For example, additive noise injection was studied in the context offeedforward NNs in [9], in which case a Gaussian noise is injected to the activation function at each layer of theNN. Furthermore, multiplicative noise injections include stochastic depth and dropout strategies as special cases[47, 46]. By taking a Gaussian approximation to Bernoulli noise and taking a continuous-time limit, NNs withstochastic dropout can be weakly approximated by an SDE with appropriate multiplicative noise; see [47]. Allof these works highlight various advantages of noise injection for training NNs. As in the deterministic case, exact simulation of the SDE in (2) is infeasible in practice, and so one must specifya numerical integration scheme. We will focus on the explicit Euler-Maruyama (E-M) integrators, which are thestochastic analogues of Euler-type integration schemes for ODEs.Let t < t < · · · < t M := T be a partition of the interval [0 , T ] . Denote δ m := t m +1 − t m for each m = 0 , , . . . , M − , and δ := ( δ m ) . The E-M scheme provides a family (parametrized by δ ) of approximationsto the solution of the SDE in (2): h δm +1 = h δm + f ( h δm , ˆ x m ) δ m + σ ( h δm , ˆ x m ) (cid:112) δ m ξ m , (5)3or m = 0 , , . . . , M − , where (ˆ x m ) m =0 ,...,M − is a given sequential data, the ξ m ∼ N (0 , I ) are independent r -dimensional standard normal random vectors, and h δ = h . As ∆ := max m δ m → , the family ofapproximations ( h δm ) converges strongly to the Itˆo process ( h t ) satisfying (2) (at rate O ( √ ∆) when the stepsizes are uniform; see Theorem 10.2.2 in [40]). See Section C in SM for details on the general case.It is worth mentioning that while higher-order integrators are also possible to consider, the presence of Itˆowhite noise poses a significant challenge over the standard ODE case. Generally speaking, implementations ofhigher-order schemes require additional computational effort which may outweigh the benefit of using them.For instance, in implicit E-M schemes, the zero of a nonlinear equation has to be determined in each time step[32]. In Milstein and stochastic Runge-Kutta schemes, there is an extra computational cost in simulating theL´evy area [53]. Similar challenges arise for other multistep schemes and higher-order schemes. To highlight the advantages of NRNNs over their deterministic counterpart, we show that, under reasonableassumptions, NRNNs exhibit a natural form of implicit regularization . By this, we mean regularization imposedimplicitly by the stochastic learning strategy, without explicitly modifying the loss, but that, e.g., may promoteflatter minima. Our goal is achieved by deriving an appropriate explicit regularizer through a perturbationanalysis in the small noise regime. This becomes useful when considering NRNNs as a learning strategy, sincewe can precisely determine the effect of the noise injection as a regularization mechanism.The study for discrete-time NRNNs is of practical interest and is our focus here. Nevertheless, analogous resultsfor continuous-time NRNNs are also valuable for exploring other discretization schemes. For this reason, wealso study the continuous-time case in Section D in SM . Our analysis covers general NRNNs, not necessarilythose with the drift term (3) and diffusion term (4), that satisfy the following assumption, which is typicallyreasonable in practice. Assumption A.
The drift f and diffusion coefficient σ of the SDE in (2) satisfy the following:(i) for all t ∈ [0 , T ] and x ∈ R d x , h (cid:55)→ f ( h, x ) and h (cid:55)→ σ ij ( h, x ) have Lipschitz continuous partialderivatives in each coordinate up to order three (inclusive);(ii) for any h ∈ R d h , t (cid:55)→ f ( h, x t ) and t (cid:55)→ σ ( h, x t ) are bounded and Borel measurable on [0 , T ] .We consider a rescaling of the noise σ (cid:55)→ (cid:15)σ in (2), where (cid:15) > is assumed to be a small parameter, in line withour noise injection strategies in Subsection 3.1.In the sequel, we let ¯ h δm denote the hidden states of the corresponding deterministic RNN model, satisfying ¯ h δm +1 = ¯ h δm + δ m f (¯ h δm , ˆ x m ) , m = 0 , , . . . , M − , (6)with ¯ h δ = h . Let ∆ := max m ∈{ ,...,M − } δ m , and denote the state-to-state Jacobians by ˆ J m = I + δ m ∂f∂h (¯ h δm , ˆ x m ) . (7)For m, k = 0 , . . . , M − , also let ˆΦ m,k = ˆ J m ˆ J m − · · · ˆ J k , (8)where the empty product is assumed to be the identity. Note that the ˆΦ m,k are products of the state-to-stateJacobian matrices, important for analyzing signal propagation in RNNs [11]. For the sake of brevity, we denote f m = f (¯ h δm , ˆ x m ) and σ m = σ (¯ h δm , ˆ x m ) for m = 0 , , . . . , M .The following result, which is our first main result, relates the loss function, averaged over realizations of theinjected noise, used for training NRNN to that for training deterministic RNN in the small noise regime. Theorem 1 (Implicit regularization induced by noise injection) . Under Assumption A, E (cid:96) ( h δM ) = (cid:96) (¯ h δM ) + (cid:15) Q (¯ h δ ) + ˆ R (¯ h δ )] + O ( (cid:15) ) , (9)4 s (cid:15) → , where the terms ˆ Q and ˆ R are given by ˆ Q (¯ h δ ) = ∇ l (¯ h δM ) T M (cid:88) k =1 δ k − ˆΦ M − ,k M − (cid:88) m =1 δ m − v m , (10) ˆ R (¯ h δ ) = M (cid:88) m =1 δ m − tr( σ Tm − ˆΦ TM − ,m H ¯ h δ l ˆΦ M − ,m σ m − ) , (11) with v m a vector with the p th component ( p = 1 , . . . , d h ) : [ v m ] p = tr( σ Tm − ˆΦ TM − ,m H ¯ h δ [ f M ] p ˆΦ M − ,m σ m − ) . (12) Moreover, | ˆ Q (¯ h δ ) | ≤ C Q ∆ , | ˆ R (¯ h δ ) | ≤ C R ∆ , (13) for C Q , C R > independent of ∆ . If the loss is convex, then ˆ R is non-negative, but ˆ Q needs not be. However, ˆ Q can be made negligible relative to ˆ R provided that ∆ is taken sufficiently small. This also ensures that the E-M approximations are accurate.Theorem 1 implies that the injection of noise into the hidden states of deterministic RNN is, on average,approximately equivalent to a regularized objective functional. Moreover, the explicit regularizer is solelydetermined by the discrete-time flow generated by the Jacobians ∂f m ∂ ¯ h (¯ h δm ) , the diffusion coefficients σ n , andthe Hessian of the loss function, all evaluated along the dynamics of the deterministic RNN.The Hessian of the loss function commonly appears in implicit regularization analyses, and it suggests apreference towards wider minima in the loss landscape. Commonly considered a positive attribute [37], this,in turn, suggests a degree of robustness in the loss to perturbations in the hidden states [82]. More interesting,however, is the appearance of the Jacobians, which is indicative of a preference towards slower, more stabledynamics. Both of these attributes suggest NRNNs could exhibit a strong tendency towards models which areless sensitive to input perturbations. In the next section, we will take a look at the effect on generalizationperformance.We remark that Theorem 1 resembles a discrete-time analogue of Theorem 6 in SM for CT-RNNs, exceptthat, unlike the corresponding term Q appearing there, ˆ Q has no explicit dependence on the noise coefficient.Therefore, as is common in stochastic calculus, a direct discretization of the result in Theorem 6 would notprovide the correct explicit regularizer for discrete-time NRNNs.Assuming that ˆ Q is negligible and ignoring higher-order terms, we can interpret training with NRNNas an approximation of the following optimal control problem [77], with the running cost ˆ C m − := tr( σ Tm − ˆΦ TM − ,m H ¯ h δ l ˆΦ M − ,m σ m − ) : min E ( x ,y ) ∼ µ (cid:34) (cid:96) (¯ h M ) + (cid:15) M (cid:88) m =1 δ m − ˆ C m − (cid:35) (14)s.t. ¯ h δm +1 = ¯ h δm + δ m f (¯ h δm , ˆ x m ) , m = 0 , , . . . , M − , and ¯ h δ = h , where ( x := (ˆ x m ) m =0 , ,...,M − , y ) denotes a training example drawn from the distribution µ and the mini-mization is with respect to the parameters (controls) in the corresponding deterministic RNN. On the otherhand, we can interpret training with the deterministic RNN as the above optimal control problem, but withzero running cost or regularization. Note that if the Hessian matrix is positive semi-definite, then the ˆ C m , m = 0 , , . . . , M − , are quadratic forms with the associated metric tensor ˆ M Tm ˆ M m := H ¯ h δm l and ˆ C m = (cid:104) ˆΦ M − ,m +1 σ m , ˆΦ M − ,m +1 σ m (cid:105) ˆ M m = (cid:107) ˆ M m ˆΦ M − ,m +1 σ m (cid:107) F ≤ (cid:107) σ m (cid:107) F (cid:107) ˆ M m (cid:107) F (cid:107) ˆΦ M − ,m +1 (cid:107) F .Overall, we can see that the use of NRNNs as a regularization mechanism reduces the state-to-state Jacobiansand Hessian of the loss function according to the noise level (cid:15) .5 Implications in Classification Tasks
Our focus now turns to an investigation of the benefits of NRNNs over their deterministic counterparts forclassification tasks. From Theorem 1, it is clear that adding noise to deterministic RNN implicitly regularizesthe state-to-state Jacobians. Here, we show that doing so also enhances an implicit tendency towards classifierswith large classification margin, improving generalization with respect to the deterministic model. Our analysishere covers general deterministic RNNs, although we also apply our results to obtain explicit expressions forLipschitz RNNs.Let S N denote a set of training samples, s n := ( x n , y n ) , for n = 1 , . . . , N , where each input sequence x n = ( x n, , x n, , . . . , x n,M − ) ∈ X ⊂ R d x M has a corresponding class label y n ∈ Y = { , . . . , d y } .Following the statistical learning framework, these samples are assumed to be independently drawn from anunderlying probability distribution µ on the sample space S = X × Y . An RNN-based classifier g δ ( x ) isconstructed in the usual way by taking g δ ( x ) = argmax i =1 ,...,d y p i ( V ¯ h δM [ x ]) , (15)where p i ( x ) = e x i / (cid:80) j e x j is the softmax function. Letting (cid:96) denoting the cross-entropy loss, such a classifieris trained from S N by minimizing the empirical risk (training error) R N ( g δ ) := 1 N N (cid:88) n =1 (cid:96) ( g δ ( x n ) , y n ) (16)as a proxy for the true (population) risk (test error) R ( g δ ) = E ( x ,y ) ∼ µ (cid:96) ( g δ ( x ) , y ) , with ( x , y ) ∈ S . Themeasure used to quantify the prediction quality is the generalization error (or estimation error), which is thedifference between the empirical risk of the classifier on the training set and the true risk: GE( g δ ) := |R ( g δ ) − R N ( g δ ) | . (17)The classifier is a function of the output of the deterministic RNN, which is an Euler discretization of the ODE(1) with step sizes δ = ( δ m ) . In particular, for the Lipschitz RNN, ˆΦ m,k = ˆ J m ˆ J m − · · · ˆ J k , (18)where ˆ J l = I + δ l ( A + D l W ) , with D ijl = a (cid:48) ([ W ¯ h δl + U ˆ x l + b ] i ) e ij .In the following, we let conv( X ) denote the convex hull of X . We let ˆ x m := (ˆ x , . . . , ˆ x m ) so that ˆ x =ˆ x M − , and use the notation f [ x ] to indicate the dependence of the function f on the vector x . Our result willdepend on two characterizations of a training sample s i = ( x i , y i ) . Definition 1 (Classification margin) . The classification margin of a training sample s i = ( x i , y i ) measured bythe Euclidean metric d is defined as the radius of the largest d -metric ball in X centered at x i that is containedin the decision region associated with the class label y i , i.e., it is: γ d ( s i ) = sup { a : d ( x i , x ) ≤ a ⇒ g δ ( x ) = y i ∀ x } . Intuitively, a larger classification margin allows a classifier to associate a larger region centered on a point x i in the input space to the same class. This makes the classifier less sensitive to input perturbations, and aperturbation of x i is still likely to fall within this region, keeping the classifier prediction. In this sense, theclassifier becomes more robust.In our case, the networks are trained by a loss (cross-entropy) that promotes separation of different classes in thenetwork output. This, in turn, maximizes a certain notion of score of each training sample. Definition 2 (Score) . For a training sample s i = ( x i , y i ) , we define its score as o ( s i ) = min j (cid:54) = y i √ e y i − e j ) T S δ [ x i ] ≥ , where e i ∈ R d y is the Kronecker delta vector with e ii = 1 and e ji = 0 for i (cid:54) = j , S δ [ x i ] := p ( V ¯ h δM [ x i ]) with ¯ h δM [ x i ] denoting the hidden state of the RNN, driven by the input sequence x i , at terminal index M .6ecall that the classifier g δ ( x ) = arg max i ∈ ,...,d y [ S δ ] i [ x ] and the decision boundary between class i and class j in the feature space are given by the hyperplane { z = S δ : z i = z j } . A positive score implies that at thenetwork output, classes are separated by a margin that corresponds to the score. However, a large score may notimply a large classification margin.The following result, which is our second main result, follows the approach of [68] (which exploits thealgorithmic robustness framework in [80]) to provide bounds for classification margin and generalization errorfor the deterministic RNN classifiers g δ (see Section E in SM for background and proof). In particular, it showsthat the generalization error is expected to diminish with the spectral norm of the state-to-state Jacobians. Theorem 2 (Classification margin and generalization bound for the deterministic RNN) . Suppose that Assump-tion A holds. Assume that the score o ( s i ) > and γ ( s i ) := o ( s i ) C (cid:80) M − m =0 δ m sup ˆ x ∈ conv( X ) (cid:107) ˆΦ M,m +1 [ˆ x ] (cid:107) > , (19) where C = (cid:107) V (cid:107) (cid:18) max m =0 , ,...,M − (cid:13)(cid:13)(cid:13)(cid:13) ∂f (¯ h δm , ˆ x m ) ∂ ˆ x m (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) > is independent of s i (in particular, C = (cid:107) V (cid:107) [max m =0 ,...,M − (cid:107) D m U (cid:107) ] for Lipschitz RNNs), the ˆΦ m,k aredefined in (102) and the δ m are the step sizes. Then, the classification margin for the training sample s i is γ d ( s i ) ≥ γ ( s i ) . (20) Moreover, if we further assume that X is a (subset of) a k -dimensional manifold with k ≤ d x M , γ :=min s i ∈S N γ ( s i ) > , and (cid:96) ( g δ ( x ) , y ) ≤ L g for all s ∈ S , then for any δ (cid:48) > , with probability at least − δ (cid:48) , GE( g δ ) ≤ L g (cid:32) γ k/ (cid:114) d y C kM k +1 log 2 N + (cid:114) /δ (cid:48) ) N (cid:33) , (21) where C M > is a constant that measures complexity of X , N is the number of training examples and d y is thenumber of label classes. The above theorem bounds the generalization error in terms of the classification margin, which is independent ofthe network size, but which takes into account the structure of the data (via its covering number). All dependenceon network architecture, in particular the step sizes δ , is captured in γ .Recalling from Section 4 that, up to O ( (cid:15) ) and under the assumption that ˆ Q vanishes, the loss minimized by theNRNN classifer is, on average, (cid:96) (¯ h δM ) + (cid:15) ˆ R (¯ h δ ) , as (cid:15) → , where the regularizer is ˆ R (¯ h δ ) = 12 M (cid:88) m =1 δ m − (cid:107) ˆ M M − ˆΦ M − ,m σ m − (cid:107) F , (22)where ˆ M TM ˆ M M := H ¯ h δM l is the Cholesky decomposition of the Hessian matrix of the convex cross-entropy loss.The appearance of the state-to-state Jacobians in Φ m,k in both the regularizer (22) and the lower bound (19)suggests that noise injection implicitly aids generalization performance. More precisely, in the small noiseregime and on average, NRNNs promote classifiers with large classification margin, an attribute linked to bothimproved robustness and generalization [80]. In this sense, training with NRNN classifiers is a stochasticstrategy to improve generalization over deterministic RNN classifiers, particularly in learning tasks where thegiven data is corrupted (c.f. the caveats pointed out in [69]).Theorem 2 implies that the upper bound for the generalization error is determined by the spectrum of the ˆΦ M − ,m . To make the upper bound small, keeping δ m and M fixed, the spectral norm of the ˆΦ M − ,m shouldbe made small. Doing so improves stability of the RNN, but it may also lead to vanishing gradients, hinderingcapacity of the model to learn. Therefore, despite the expected benefit of robustness due to appropriate noiseinjection, there is a fundamental tradeoff between trainability and generalization in RNN classifiers (c.f. [79]).To tighten the upper bound of generalization error while avoiding the vanishing gradient problem, one shouldtune the numerical step sizes δ m and noise level (cid:15) in NRNN appropriately. RNN architectures for the drift whichhelp to ensure moderate Jacobians (e.g. (cid:107) ˆΦ M − ,m (cid:107) ≈ for all m [11]) also remain valuable in this respect.7 Stability and Noise-Induced Stabilization
Here, we obtain sufficient conditions to guarantee stochastic stability of CT-NRNNs. This will also provideanother lens to highlight their potential for improved robustness, in the vein of [46]. A dynamical system isconsidered stable if trajectories which are close to each other initially remain close at subsequent times. Asobserved in [60, 56, 10], stability plays an essential role in the study of RNNs to avoid the exploding gradientproblem , a property of unstable systems where the gradient increases in magnitude with the depth. Whilegradient clipping during training can somewhat alleviate this issue, better performance and robustness is achievedby enforcing stability in the model itself.Our stability analysis will focus on establishing almost sure exponential stability (for other notions of stability,see Section F in SM ) for CT-NRNNs with the drift function (3). To preface the definition, consider initializingthe SDE at two different random variables h and h (cid:48) := h + (cid:15) , where (cid:15) ∈ R d h is a constant non-randomperturbation with (cid:107) (cid:15) (cid:107) ≤ δ . The resulting hidden states, h t and h (cid:48) t , are set to satisfy (2) with the same Brownianmotion B t , starting from their initial values h and h (cid:48) , respectively. The evolution of (cid:15) t = h (cid:48) t − h t satisfies d (cid:15) t = A(cid:15) t d t + ∆ a t ( (cid:15) t )d t + ∆ σ t ( (cid:15) t )d B t , (23)where ∆ a t ( (cid:15) t ) = a ( W h (cid:48) t + U x t + b ) − a ( W h t + U x t + b ) and ∆ σ t ( (cid:15) t ) = σ ( h t + (cid:15) t , x t ) − σ ( h t , x t ) . Since ∆ a t (0) = 0 , ∆ σ t (0) = 0 for all t ∈ [0 , T ] , (cid:15) t = 0 admits a trivial equilibrium for (23). Our objective isto analyze the stability of the solution (cid:15) t = 0 , that is, to see how the final state (cid:15) T (and hence the output y (cid:48) T − y T = V (cid:15) T of the RNN) changes for an arbitrarily small initial perturbation (cid:15) (cid:54) = 0 . To this end, weconsider an extension of the Lyapunov exponent to SDEs at the level of sample path [55]. Definition 3 (Almost sure global exponential stability) . The sample (or pathwise) Lyapunov exponent of thetrivial solution of (23) is
Λ = lim sup t →∞ t − log (cid:107) (cid:15) t (cid:107) . The trivial solution (cid:15) t = 0 is almost surely globallyexponentially stable if Λ is almost surely negative for all (cid:15) ∈ R d h .For the sample Lyapunov exponent Λ( ω ) , there is a constant C > and a random variable ≤ τ ( ω ) < ∞ suchthat for all t > τ ( ω ) , (cid:107) (cid:15) t (cid:107) = (cid:107) h (cid:48) t − h t (cid:107) ≤ Ce Λ t almost surely. Therefore, almost sure exponential stabilityimplies that almost all sample paths of (23) will tend to the equilibrium solution (cid:15) = 0 exponentially fast.The following result, which is our third main result, uses this definition to provide our primary stability result. Theorem 3 (Bounds for sample Lyapunov exponent for the CT-NRNN) . Assume that a is monotone non-decreasing, and ≤ σ ≤ (cid:107) ∆ σ t ( (cid:15) ) (cid:107) F ≤ σ for all (cid:15) ∈ R d h , t ∈ [0 , T ] . Then for any (cid:15) ∈ R d h , withprobability one, φ + λ min ( A sym ) ≤ Λ ≤ ψ + L a σ max ( W ) + λ max ( A sym ) , (24) with φ = − σ + σ and ψ = − σ + σ , where L a is the Lipschitz constant of a . In the special case without noise ( σ = σ = 0 ), we recover case (a) of Theorem 1 in [19]: when A sym isnegative definite and σ min ( A sym ) > L a σ max ( W ) , Theorem 3 implies that (2) is exponentially stable. Evenin the additive noise setting, however, the Lyapunov exponents of the CT-NRNN driven by additive noise arenot generally the same as those of the corresponding deterministic CT-RNN. Oseledets multiplicative ergodictheorem implies they will be the same if the data generating process x t is ergodic [2]. Characterizing Lyapunovexponents for SDEs is a non-trivial affair in general — we refer to, for instance, [3] for details on this.Most strikingly, and similar to [46], Theorem 3 implies that even if the deterministic CT-RNN is not exponentiallystable, it can be stabilized through a stochastic perturbation. Consequently, injecting noise appropriately canimprove training performance. The evaluation of robustness of neural networks (RNNs in particular) is an often neglected yet crucial aspect ofmodel development and validation. In this section, we investigate the robustness of NRNNs and compare theirperformance to other recently introduced state-of-the-art models on both clean and corrupted data. We refer toSection G in SM for details of our experiments.Our task of choice is pixel-by-pixel MNIST classification [42]. This task sequentially presents pixels to themodel and uses the final hidden state to predict the class membership probability of the input image. While8able 1: Robustness with respect to white noise ( σ ) and S&P ( α ) perturbations on the ordered MNIST task. Name clean σ = 0 . σ = 0 . σ = 0 . α = 0 . α = 0 . α = 0 . Antisymmetric RNN [10] 97.5% 45.7% 22.3% 17.0% 77.1% 63.9% 42.6%CoRNN [63] 99.1% 96.6% 61.9% 32.1% 95.6% 88.1% 58.9%Exponential RNN [43] 96.7% 86.7% 58.1% 33.3% 83.6% 70.7% 43.4%Lipschitz RNN [19] % 98.4% 78.9% 47.1% 97.6% 93.4% 73.5%NRNN (mult. noise: 0.02 / add. noise: 0.02) 99.1% % 88.4% 62.9% 98.3% 95.6% 78.7%NRNN (mult. noise: 0.02 / add. noise: 0.05) 99.1% % % % % % % Table 2: Robustness with respect to white noise ( σ ) and S&P ( α ) perturbations on the permuted MNIST task. Name clean σ = 0 . σ = 0 . σ = 0 . α = 0 . α = 0 . α = 0 . Antisymmetric RNN [10] 92.8% 92.4% 89.5% 81.9% 90.5% 87.9% 72.6%CoRNN [63] 93.8% 26.2% 17.7% 15.3% 79.6% 67.7% 45.2%Exponential RNN [43] 93.3% 90.6% 78.4% 61.6% 80.4% 70.6% 51.6%Lipschitz RNN [19] % 95.4% 93.5% 83.7% 93.7% 90.2% 70.8%NRNN (mult. noise: 0.02 / add. noise: 0.02) 94.9% % % 94.3% % 93.1% 88.6%NRNN (mult. noise: 0.02 / add. noise: 0.05) 94.7% 94.6% % % % % % Table 3: Robustness with respect to adversarial perturbations on the ordered MNIST task.
Name r = 0 . r = 0 . r = 0 . r = 0 . Antisymmetric RNN [10] 79.4% 24.7% 11.4% 10.2%CoRNN [63] 97.5% 85.5% 55.9% 35.1%Exponential RNN [43] 94.5% 59.3% 19.7% 14.3%Lipschitz RNN [19] 98.1% 85.7% 58.9% 37.1%NRNN (multiplicative noise: 0.02 / additive noise: 0.02) % 94.3% 79.6% 58.3%NRNN (multiplicative noise: 0.02 / additive noise: 0.05) % % % % relatively straightforward when the ordered sequence is presented to the model, using instead a fixed randompermutation of the input sequence remains a challenging task for recurrent models.In addition to the NRNN, we consider three other RNNs derived from continuous-time models, including theLipschitz RNN [19] (the deterministic counterpart to our NRNN), the coupled oscillatory RNN (coRNN) [63]and the antisymmetric RNN [10]. We also consider the exponential RNN [43], a discrete-time model that usesorthogonal recurrent weights. We train each model with the prescribed tuning parameters for the ordered andpermuted MNIST task, using ten different seed values. For comparison, we train all models with hidden-to-hidden weight matrices of dimension d h = 128 . Then we study the sensitivity of these models with respect to asequence of perturbed inputs during inference time. Three different type of perturbations are considered: (a)white noise; (b) salt and pepper (SP); and (c) adversarial perturbations. To be more concrete, let X ∈ R × bea thumbnail. The perturbations in consideration are as follows.• Additive white noise perturbations are constructed as ˜ X = X + ∆ X , where the additive noise is drawn froma Gaussian distribution ∆ X ∼ N (0 , σ ) . This perturbation strategy emulates measurement errors that canresult from data acquisition with poor sensors. The parameter σ can be used to vary the strength of theseerrors.• Salt and pepper perturbations emulate defective pixels that result from converting analog signals to digitalsignals. The noise model takes the form P ( ˜ X = X ) = 1 − α , and P ( ˜ X = max) = P ( ˜ X = min) = α/ , where ˜ X ( i, j ) denotes the corrupted image and min and max denote to the minimum and maximum pixelvalues. The parameter α controls the proportion of defective pixels.• Adversarial perturbations are “worst-case” non-random perturbations maximizing the loss (cid:96) ( g δ ( X + ∆ X ) , y ) subject to the constraint that the norm of the perturbation (cid:107) ∆ X (cid:107) ≤ r . We consider the fast gradient signmethod for constructing these perturbations [71]. The parameter r controls the strength of these perturbations.Tables 1 shows the average test accuracy (evaluated for models that are trained with 10 different seed values) forthe ordered pixel-by-pixel MNIST task. Here we present results for both white noise and salt and pepper inputperturbations for varying noise levels. While the Lipschitz RNN performs best on clean input sequences, the9 .0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.20.40.60.81.0 Noisy RNN (0.02/0.05)Noisy RNN (0.02/0.02)Lipschitz RNNAntisymmetric RNNcoRNNexpRNN t e s t acc u r ac y amount of noise(a) White noise perturbations. amount of noise(b) Salt and pepper perturbations. Figure 1: Test accuracy for the ordered MNIST task as a function of the strength of input perturbations. t e s t acc u r ac y amount of noise(a) White noise perturbations. Noisy RNN (0.02/0.05)Noisy RNN (0.02/0.02)Lipschitz RNNAntisymmetric RNNcoRNNexpRNN amount of noise(b) Salt and pepper perturbations.
Figure 2: Test accuracy for the permuted MNIST task as a function of the strength of input perturbations.NRNNs show an improved resilience to input perturbations. Here, we consider two different configurations forthe NRNN. In both cases, we set the multiplicative noise level to . , whereas we consider the additive noiselevels . and . . We chose these configurations as they appear to provide a good trade-off between accuracyand robustness. Note that the predictive accuracy on clean inputs starts to drop when the noise-injections duringtraining are chosen to be too large.Table 2 shows results for the permuted pixel-by-pixel MNIST task. Again, it can be seen that the NRNN showsan improved resilience to both white noise and S&P input perturbations. Interestingly, it can also be seen thatmost models that are trained on the permuted MNIST tasks tend to be more robust as compared to the samearchitecture that is trained on the ordered MNIST task.Table 3 shows the average test accuracy for the ordered MNIST task for adversarial perturbations. Again, theNRNNs show a superior resilience even to large perturbations, whereas the Antisymmetric and ExponentialRNN appear to be sensitive even to small perturbations.Figures 1 and 2 visualize the performance of different recurrent models with respect to white noise and salt andpepper perturbations. The colored bands indicate ± standard deviation around the average performance. Inall cases, the NRNN appears to be less sensitive to input perturbations as compared to the other models, whilemaintaining state-of-the-art performance for clean inputs. Hence, these results support our theoretical insightsand support our claim that noise-injection helps to improve the robustness with respect to input perturbations. In this paper, we have studied RNNs trained by injecting noise into the hidden states. Within the frameworkof SDEs, we study the regularizing effects of general noise injection schemes. Our empirical results are inagreement with our theory and its implications, finding that NRNNs achieve superior robustness to inputperturbations, while maintaining state-of-the-art generalization performance. We believe our framework can beused to guide the principled design of a class of reliable and robust RNN classifiers.10ur work opens up a range of interesting future directions. In particular, for deterministic RNNs, it has beenshown that the models learn optimally near the edge of stability [11]. One could extend these analyses toNRNNs with the ultimate goal of improving their performance. On the other hand, there are a few drawbacks.As discussed in Section 5, although the noise is shown here to implicitly stabilize RNNs, it could negativelyimpact capacity for long-term memory [56, 87]. Furthermore, we have not taken into account the implicit biasdue to the stochastic optimization procedure [66, 17]. Extending other recent analyses to account for these is thesubject of future work.
Acknowledgements
We are grateful for the generous support from Amazon AWS. S. H. Lim would like to acknowledge NorditaFellowship 2018-2021 for providing support of this work. N. B. Erichson, L. Hodgkinson, and M. W. Mahoneywould like to acknowledge the IARPA (contract W911NF20C0035), ARO, NSF, and and ONR via its BRC onRandNLA for providing partial support of this work. Our conclusions do not necessarily reflect the position orthe policy of our sponsors, and no official endorsement should be inferred.
References [1] Alnur Ali, Edgar Dobriban, and Ryan Tibshirani. The implicit regularization of stochastic gradient flowfor least squares. In
International Conference on Machine Learning , pages 233–244. PMLR, 2020.[2] Ludwig Arnold, W Kliemann, and E Oeljeklaus. Lyapunov exponents of linear stochastic systems. In
Lyapunov exponents , pages 85–125. Springer, 1986.[3] Ludwig Arnold and Wolfgang Kliemann. Large deviations of linear stochastic differential equations. In
Stochastic differential systems , pages 115–151. Springer, 1987.[4] Raman Arora, Peter Bartlett, Poorya Mianjy, and Nathan Srebro. Dropout: Explicit forms and capacitycontrol. arXiv preprint arXiv:2003.03397 , 2020.[5] Omri Azencot, N Benjamin Erichson, Vanessa Lin, and Michael Mahoney. Forecasting sequential datausing consistent Koopman autoencoders. In
International Conference on Machine Learning , pages475–485. PMLR, 2020.[6] Kaushik Balakrishnan and Devesh Upadhyay. Deep adversarial Koopman model for reaction-diffusionsystems. arXiv preprint arXiv:2006.05547 , 2020.[7] Chris M Bishop. Training with noise is equivalent to Tikhonov regularization.
Neural Computation ,7(1):108–116, 1995.[8] Yurii Nikolaevich Blagoveshchenskii and Mark Iosifovich Freidlin. Some properties of diffusion processesdepending on a parameter. In
Doklady Akademii Nauk , volume 138, pages 508–511. Russian Academy ofSciences, 1961.[9] Alexander Camuto, Matthew Willetts, Umut S¸ ims¸ekli, Stephen Roberts, and Chris Holmes. Explicitregularisation in Gaussian noise injections. arXiv preprint arXiv:2007.07368 , 2020.[10] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. AntisymmetricRNN: A dynamical system view onrecurrent neural networks. In
International Conference on Learning Representations , 2019.[11] Minmin Chen, Jeffrey Pennington, and Samuel S Schoenholz. Dynamical isometry and a mean field theoryof RNNs: Gating enables signal propagation in recurrent neural networks. arXiv preprint arXiv:1806.05394 ,2018.[12] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differentialequations. In
Advances in Neural Information Processing Systems , pages 6571–6583, 2018.[13] Zhengdao Chen, Jianyu Zhang, Martin Arjovsky, and L´eon Bottou. Symplectic recurrent neural networks.In
International Conference on Learning Representations , 2019.[14] M. Derezi´nski, F. Liang, and M. W. Mahoney. Exact expressions for double descent and implicit regular-ization via surrogate random design. Technical report, 2019. Preprint: arXiv:1912.04533.[15] Adji B Dieng, Rajesh Ranganath, Jaan Altosaar, and David M Blei. Noisin: Unbiased regularization forrecurrent neural networks. arXiv preprint arXiv:1805.01500 , 2018.1116] Akshunna S Dogra and William T Redman. Optimizing neural networks via Koopman operator theory. arXiv preprint arXiv:2006.02361 , 2020.[17] Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep Rangan, and Alyson K Fletcher.Implicit bias of linear RNNs. arXiv preprint arXiv:2101.07833 , 2021.[18] Rainer Engelken, Fred Wolf, and LF Abbott. Lyapunov spectra of chaotic recurrent neural networks. arXivpreprint arXiv:2006.02427 , 2020.[19] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney.Lipschitz recurrent neural networks. In
International Conference on Learning Representations , 2021.[20] N Benjamin Erichson, Michael Muehlebach, and Michael W Mahoney. Physics-informed autoencoders forLyapunov-stable fluid flow prediction. arXiv preprint arXiv:1905.10866 , 2019.[21] Wei Fang.
Adaptive timestepping for SDEs with non-globally Lipschitz drift . PhD thesis, University ofOxford, 2019.[22] Wei Fang, Michael B Giles, et al. Adaptive Euler–Maruyama method for SDEs with nonglobally Lipschitzdrift.
Annals of Applied Probability , 30(2):526–560, 2020.[23] Mark Iosifovich Freidlin and Alexander D Wentzell.
Random Perturbations of Dynamical Systems .Springer, 1998.[24] J.F.L. Gall.
Brownian Motion, Martingales, and Stochastic Calculus . Graduate Texts in Mathematics.Springer International Publishing, 2016.[25] D. F. Gleich and M. W. Mahoney. Anti-differentiating approximation algorithms: A case study withmin-cuts, spectral, and flow. In
Proceedings of the 31st International Conference on Machine Learning ,pages 1018–1025, 2014.[26] Chengyue Gong, Tongzheng Ren, Mao Ye, and Qiang Liu. MaxUp: A simple way to improve generalizationof neural network training. arXiv preprint arXiv:2002.09024 , 2020.[27] Samuel Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. In
Advances inNeural Information Processing Systems , volume 32, 2019.[28] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks.
Inverse Problems ,34(1):014004, 2017.[29] Desmond J Higham, Xuerong Mao, and Andrew M Stuart. Strong convergence of Euler-type methods fornonlinear stochastic differential equations.
SIAM Journal on Numerical Analysis , 40(3):1041–1063, 2002.[30] Liam Hodgkinson, Chris van der Heide, Fred Roosta, and Michael W Mahoney. Stochastic normalizingflows. arXiv preprint arXiv:2002.09547 , 2020.[31] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochasticdepth. In
European Conference on Computer Vision , pages 646–661. Springer, 2016.[32] Martin Hutzenthaler and Arnulf Jentzen.
Numerical approximations of stochastic differential equationswith non-globally Lipschitz continuous coefficients , volume 236. American Mathematical Society, 2015.[33] Martin Hutzenthaler, Arnulf Jentzen, Peter E Kloeden, et al. Strong convergence of an explicit numericalmethod for SDEs with nonglobally Lipschitz continuous coefficients.
The Annals of Applied Probability ,22(4):1611–1641, 2012.[34] Kam-Chuen Jim, C Lee Giles, and Bill G Horne. An analysis of noise in recurrent neural networks:Convergence and generalization.
IEEE Transactions on Neural Networks , 7(6):1424–1438, 1996.[35] Anil Kag, Ziming Zhang, and Venkatesh Saligrama. RNNs incrementally evolving on an equilibriummanifold: A panacea for vanishing and exploding gradients? In
International Conference on LearningRepresentations , 2020.[36] Ioannis Karatzas and Steven E Shreve.
Brownian Motion . Springer, 1998.[37] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprintarXiv:1609.04836 , 2016.[38] Hassan K Khalil and Jessy W Grizzle.
Nonlinear Systems , volume 3. Prentice hall Upper Saddle River, NJ,2002. 1239] Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled differential equations forirregular time series. arXiv preprint arXiv:2005.08926 , 2020.[40] Peter E Kloeden and Eckhard Platen.
Numerical Solution of Stochastic Differential Equations , volume 23.Springer Science & Business Media, 2013.[41] Hiroshi Kunita. Stochastic differential equations and stochastic flows of diffeomorphisms. In
Ecole d’´et´ede Probabilit´es de Saint-Flour XII-1982 , pages 143–303. Springer, 1984.[42] Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. A simple way to initialize recurrent networks ofrectified linear units. arXiv preprint arXiv:1504.00941 , 2015.[43] Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: Asimple parametrization of the orthogonal and unitary group. In
International Conference on MachineLearning , pages 3794–3803, 2019.[44] Yunzhu Li, Hao He, Jiajun Wu, Dina Katabi, and Antonio Torralba. Learning compositional Koopmanoperators for model-based control. In
International Conference on Learning Representations , 2019.[45] Soon Hoe Lim. Understanding recurrent neural networks using nonequilibrium response theory. arXivpreprint arXiv:2006.11052 , 2020.[46] Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. How does noise helprobustness? Explanation and exploration under the neural SDE framework. In
Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 282–290, 2020.[47] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridgingdeep architectures and numerical differential equations. In
International Conference on Machine Learning ,pages 3276–3285. PMLR, 2018.[48] Michael Lutter, Christian Ritter, and Jan Peters. Deep Lagrangian networks: Using physics as model priorfor deep learning. arXiv preprint arXiv:1907.04490 , 2019.[49] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In
International Conference on Learning Representations , 2020.[50] Chao Ma, Stephan Wojtowytsch, Lei Wu, et al. Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t. arXiv preprint arXiv:2009.10713 , 2020.[51] M. W. Mahoney. Approximate computation and implicit regularization for very large-scale data analysis.In
Proceedings of the 31st ACM Symposium on Principles of Database Systems , pages 143–154, 2012.[52] M. W. Mahoney and L. Orecchia. Implementing regularization implicitly via approximate eigenvectorcomputation. In
Proceedings of the 28th International Conference on Machine Learning , pages 121–128,2011.[53] Simon JA Malham and Anke Wiese. An introduction to SDE simulation. arXiv preprint arXiv:1004.0646 ,2010.[54] Xuerong Mao.
Exponential Stability of Stochastic Differential Equations . Marcel Dekker, 1994.[55] Xuerong Mao.
Stochastic Differential Equations and Applications . Elsevier, 2007.[56] John Miller and Moritz Hardt. Stable recurrent models. arXiv preprint arXiv:1805.10369 , 2018.[57] Jeremy Morton, Freddie D Witherden, and Mykel J Kochenderfer. Deep variational Koopman models:Inferring Koopman observations for uncertainty-aware dynamics modeling and control. arXiv preprintarXiv:1902.09742 , 2019.[58] Hyeonwoo Noh, Tackgeun You, Jonghwan Mun, and Bohyung Han. Regularizing deep neural networks bynoise: Its interpretation and optimization. In
Advances in Neural Information Processing Systems , pages5109–5118, 2017.[59] Shaowu Pan and Karthik Duraisamy. Physics-informed probabilistic learning of linear embeddings ofnonlinear dynamics with guaranteed stability.
SIAM Journal on Applied Dynamical Systems , 19(1):480–509,2020.[60] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neuralnetworks. In
International conference on machine learning , pages 1310–1318. PMLR, 2013.[61] Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Xavier Boix, JackHidary, and Hrushikesh Mhaskar. Theory of deep learning III: Explaining the non-overfitting puzzle. arXivpreprint arXiv:1801.00173 , 2017. 1362] Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuous-in-depthneural networks. arXiv preprint arXiv:2008.02389 , 2020.[63] T. Konstantin Rusch and Siddhartha Mishra. Coupled oscillatory recurrent neural network (coRNN): Anaccurate and (gradient) stable architecture for learning long time dependencies. In
International Conferenceon Learning Representations , 2021.[64] Simo S¨arkk¨a and Arno Solin.
Applied Stochastic Differential Equations , volume 10. Cambridge UniversityPress, 2019.[65] Shai Shalev-Shwartz and Shai Ben-David.
Understanding Machine Learning: From Theory to Algorithms .Cambridge University Press, 2014.[66] Samuel L Smith, Benoit Dherin, David GT Barrett, and Soham De. On the origin of implicit regularizationin stochastic gradient descent. arXiv preprint arXiv:2101.12176 , 2021.[67] Jure Sokoli´c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Generalization error of deepneural networks: Role of classification margin and data structure. In , pages 147–151. IEEE, 2017.[68] Jure Sokoli´c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neuralnetworks.
IEEE Transactions on Signal Processing , 65(16):4265–4280, 2017.[69] David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generalization.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6976–6987,2019.[70] Qi Sun, Yunzhe Tao, and Qiang Du. Stochastic training of residual networks: A differential equationviewpoint. arXiv preprint arXiv:1812.00174 , 2018.[71] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 , 2013.[72] Naoya Takeishi, Yoshinobu Kawahara, and Takehisa Yairi. Learning Koopman invariant subspaces fordynamic mode decomposition. In
Proceedings of the 31st International Conference on Neural InformationProcessing Systems , pages 1130–1140, 2017.[73] Peter Toth, Danilo J Rezende, Andrew Jaegle, S´ebastien Racani`ere, Aleksandar Botev, and Irina Higgins.Hamiltonian generative networks. In
International Conference on Learning Representations , 2019.[74] Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent Gaussian modelsin the diffusion limit. arXiv preprint arXiv:1905.09883 , 2019.[75] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov expo-nents for RNNs: Understanding information propagation using dynamical systems tools. arXiv preprintarXiv:2006.14123 , 2020.[76] Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. arXiv preprint arXiv:2002.12915 , 2020.[77] E Weinan. A proposal on machine learning via dynamical systems.
Communications in Mathematics andStatistics , 5(1):1–11, 2017.[78] E Weinan, Chao Ma, and Lei Wu. Machine learning from a continuous viewpoint, I.
Science ChinaMathematics , pages 1–34, 2020.[79] Lechao Xiao, Jeffrey Pennington, and Samuel S Schoenholz. Disentangling trainability and generalizationin deep learning. arXiv preprint arXiv:1912.13053 , 2019.[80] Huan Xu and Shie Mannor. Robustness and generalization.
Machine Learning , 86(3):391–423, 2012.[81] Yibo Yang, Jianlong Wu, Hongyang Li, Xia Li, Tiancheng Shen, and Zhouchen Lin. Dynamical systeminspired adaptive time stepping controller for residual network families. arXiv preprint arXiv:1911.10305 ,2019.[82] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis oflarge batch training and robustness to adversaries. arXiv preprint arXiv:1802.08241 , 2018.[83] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXivpreprint arXiv:1409.2329 , 2014. 1484] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deeplearning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.[85] Huishuai Zhang, Da Yu, Mingyang Yi, Wei Chen, and Tie-yan Liu. Stability and convergence theory forlearning ResNet: A full characterization. 2019.[86] Jingfeng Zhang, Bo Han, Laura Wynter, Kian Hsiang Low, and Mohan Kankanhalli. Towards robustResNet: A small step but a giant leap. arXiv preprint arXiv:1902.10887 , 2019.[87] Jingyu Zhao, Feiqing Huang, Jia Lv, Yanjie Duan, Zhen Qin, Guodong Li, and Guangjian Tian. Do RNNand LSTM have long memory? In
International Conference on Machine Learning , pages 11365–11375.PMLR, 2020.[88] Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ODE-Net: LearningHamiltonian dynamics with control. In
International Conference on Learning Representations , 2019.15 upplementary Material (SM)
A Notation and Background
We begin by introducing some notations that will be used in this SM .• (cid:107) · (cid:107) F denotes Frobenius norm, (cid:107) · (cid:107) p denote p -norm ( p > ) of a vector/matrix (in particular, the (cid:107) v (cid:107) := (cid:107) v (cid:107) denotes Euclidean norm of the vector v and (cid:107) A (cid:107) denotes the spectral norm of the matrix A ).• The i th element of a vector v is denoted as v i or [ v ] i and the ( i, j ) -entry of a matrix A is denoted as A ij or [ A ] ij .• I denotes identity matrix (the dimension should be clear from the context).• tr denotes trace, the superscript T denotes transposition, and R + := (0 , ∞ ) .• For a function f : R n → R m such that each of its first-order partial derivatives (with respect to x ) existon R n , ∂f∂x ∈ R m × n denotes the Jacobian matrix of f .• For a scalar-valued function g : R n → R , ∇ h g denotes gradient of g with respect to the variable h ∈ R n and H h g denotes Hessian of g with respect to h .• The notation a.s. means P -almost surely and E is expectation with respect to P , where P is an underlyingprobability measure.• For a matrix M , M sym = ( M + M T ) / denote its symmetric part, λ min ( M ) and λ max ( M ) denoteits minimum and maximum eigenvalue respectively, and σ min ( M ) and σ max ( M ) denote its minimumand maximum singular value respectively.• For a vector v = ( v , . . . , v d ) , diag( v ) denotes the diagonal matrix with the i th diagonal entry equal v i .• denotes a vector with all entries equal to one.• e ij denotes the Kronecker delta.• C ( I ; J ) denotes the space of continuous J -valued functions defined on I .• C , ( D × I ; J ) denotes the space of all J -valued functions V ( x, t ) defined on D × I which arecontinuously twice differentiable in x ∈ D and once differentiable in t ∈ I .Next, we recall the RNN models considered in the main paper. Continuous-Time NRNNs.
For a terminal time
T > and an input signal x = ( x t ) t ∈ [0 ,T ] ∈ C ([0 , T ]; R d x ) ,the output y t ∈ R d y , for t ∈ [0 , T ] , is a linear map of the hidden states h t ∈ R d h satisfying the Itˆo stochasticdifferential equation (SDE): d h t = f ( h t , x t )d t + σ ( h t , x t )d B t , y t = V h t , (25)where V ∈ R d y × d h , f : R d h × R d x → R d h , σ : R d h × R d x → R d h × r and ( B t ) t ≥ is an r -dimensional Wienerprocess.In particular, as an example and for empirical experiments, we focus on the choice of drift function: f ( h, x ) = Ah + a ( W h + U x + b ) , (26)where a : R → R is a Lipschitz continuous scalar activation function (such as tanh ) extended to act on vectorspointwise, A, W ∈ R d h × d h , U ∈ R d h × d x and b ∈ R d h , and the choice of diffusion coefficient: σ ( h, x ) = (cid:15) ( σ I + σ diag( f ( h, x ))) , (27)where the noise level (cid:15) > is small, and σ ≥ and σ ≥ are tunable parameters describing the relativestrength of additive noise and a multiplicative noise respectively.We consider the following NRNN models by discretizing the SDE (25), as discussed in detail in the main paper. Discrete-Time NRNNs.
Let t < t < · · · < t M := T be a partition of the interval [0 , T ] . Denote δ m := t m +1 − t m for each m = 0 , , . . . , M − , and δ := ( δ m ) . The Euler-Mayurama (E-M) scheme providesa family (parametrized by δ ) of approximations to the solution of the SDE in (2): h δm +1 = h δm + f ( h δm , ˆ x m ) δ m + σ ( h δm , ˆ x m ) (cid:112) δ m ξ m , (28)16or m = 0 , , . . . , M − , where (ˆ x m ) m =0 ,...,M − is a given sequential data, the ξ m ∼ N (0 , I ) are independent r -dimensional standard normal random vectors, and h δ = h . Eq. (28) describes the update equation of ourNRNN models, an example of which is when f and σ are taken to be (26) and (27) respectively (see also theexperiments in the main paper). In the special case when (cid:15) := 0 in this example, we recover the Lipschitz RNNof [19]. Organizational Details.
This SM is organized as follows.• In Section B, we provide results that guarantee existence and uniqueness of solutions to the SDEdefining our continuous-time NRNNs.• In Section C, we provide results that guarantee stability and convergence of our discrete-time NRNNs.These results are in fact very general and may be of independent interest.• In Section D, we provide results on implicit regularization due to noise injection in both continuous-timeand discrete-time NRNNs, in particular the proof of Theorem 1 in the main paper.• In Section E, we provide some background and results to study classification margin and generalizationbound of the corresponding discrete-time deterministic RNNs, in particular the proof of Theorem 2 inthe main paper.• In Section F, we discuss stability of continuous-time NRNNs and the noise-induced stabilizationphenomenon, and provide conditions that guarantee almost sure exponential stability of the NRNNs, inparticular the proof of Theorem 3 in the main paper.• In Section G, we provide details on the empirical results in the main paper. B Existence and Uniqueness of Solutions
Essential to any discussion concerning SDEs is the existence and uniqueness of solutions — in this case, we areinterested in strong solutions [36].In the following, we fix a complete filtered probability space (Ω , F , ( F t ) t ≥ , P ) which satisfies the usualconditions [36] and on which there is defined an r -dimensional Wiener process ( B t ) t ≥ . We also fix a T > and denote f ( h t , x t ) := f ( h t , t ) , σ ( h t , x t ) := σ ( h t , t ) to emphasize the explicit dependence of the functionson time t through the input x t .We start with the following assumptions on the SDE (25). Assumption B. (a) (Global Lipschitz condition) The coefficients f and σ are L -Lipschitz, i.e., there exists aconstant L > such that (cid:107) f ( h, t ) − f ( h (cid:48) , t ) (cid:107) + (cid:107) σ ( h, t ) − σ ( h (cid:48) , t ) (cid:107) F ≤ L (cid:107) h − h (cid:48) (cid:107) (29)for all h, h (cid:48) ∈ R d h and t ∈ [0 , T ] .(b) (Linear growth condition) f and σ satisfy the following linear growth condition, i.e., there exists a constant K > such that (cid:107) f ( h, t ) (cid:107) + (cid:107) σ ( h, t ) (cid:107) F ≤ K (1 + (cid:107) h (cid:107) ) (30)for all h ∈ R d h and t ∈ [0 , T ] .Under Assumption B, it is a standard result from stochastic analysis that the SDE (2) has a unique solution(which is a continuous and adapted process ( h t ) t ∈ [0 ,T ] satisfying the integral equation h t = h + (cid:82) t f ( h s , s ) ds + (cid:82) t σ ( h s , s ) dB s ) for every initial value h ∈ R d h , for t ∈ [0 , T ] (see, for instance, Theorem 3.1 in Section 2.3 of[55]). The uniqueness is in the sense that for any other solution h (cid:48) t satisfying the SDE, P [ h t = h (cid:48) t for all t ∈ [0 , T ]] = 1 . (31)For our purpose, the following conditions suffice to satisfy Assumption B. Assumption C.
The function a : R → R is an activation function (i.e., a non-constant and Lipschitz continuousfunction), and σ is L σ -Lipschitz for some L σ > . Lemma 1.
Consider the SDE (2) defining our CT-NRNN. Then, under Assumption C, Assumption B is satisfied. roof. Note that f ( h, t ) = Ah + a ( W h + U x t + b ) , where a is an activation function. For any t ∈ [0 , T ] , (cid:107) f ( h, t ) − f ( h (cid:48) , t ) (cid:107) ≤ (cid:107) A ( h − h (cid:48) ) (cid:107) + (cid:107) a ( W h + U x t + b ) − a ( W h (cid:48) + U x t + b ) (cid:107) (32) ≤ (cid:107) A (cid:107)(cid:107) h − h (cid:48) (cid:107) + L a (cid:107) W ( h − h (cid:48) ) (cid:107) (33) ≤ ( (cid:107) A (cid:107) + L a (cid:107) W (cid:107) ) (cid:107) h − h (cid:48) (cid:107) , (34)for all h, h (cid:48) ∈ R d h , where L a > is the Lipschitz constant of the (non-constant) activation function a . Therefore,the condition (a) in Assumption B is satisfied since by our assumption σ is L σ -Lipschitz for some constant L σ > . In this case one can take L = max( (cid:107) A (cid:107) + L a (cid:107) W (cid:107) , L σ ) in Eq. (29).Since f and σ are L -Lipschitz, they satisfy the linear growth condition (b) in Assumption B. Indeed, if f is L -Lipschitz, then for t ∈ [0 , T ] , (cid:107) f ( h, t ) (cid:107) = (cid:107) f ( h, t ) − f (0 , t ) + f (0 , t ) (cid:107) ≤ L (cid:107) h (cid:107) + (cid:107) f (0 , t ) (cid:107) ≤ L (cid:107) h (cid:107) + C f , (35)for some constant C f ∈ (0 , ∞ ) , where we have used the fact that f (0 , t ) = a ( U x t + b ) is bounded for t ∈ [0 , T ] (since continuous functions on compact sets are bounded). So, (cid:107) f ( h, t ) (cid:107) ≤ ( L (cid:107) h (cid:107) + C f ) ≤ L (cid:107) h (cid:107) + 2 LC f (cid:107) h (cid:107) + C f . (36)For (cid:107) h (cid:107) ≥ , we have: (cid:107) f ( h, t ) (cid:107) ≤ ( L + 2 LC f ) (cid:107) h (cid:107) + C f ≤ ( L + 2 LC f + C f )(1 + (cid:107) h (cid:107) ) . (37)For (cid:107) h (cid:107) < , we have: (cid:107) f ( h, t ) (cid:107) ≤ ( L + 2 LC f ) (cid:107) h (cid:107) + C f ≤ ( L + 2 LC f ) + C f ≤ ( L + 2 LC f + C f )(1 + (cid:107) h (cid:107) ) . (38)Choosing K = L + 2 LC f + C f gives us the linear growth condition for f .Similarly, one can show that σ satisfies the linear growth condition. The proof is done.Throughout the paper, we work with SDEs satisfying Assumption C. The following additional assumption onthe SDEs will be needed and invoked. Assumption D.
For t ∈ [0 , T ] , the partial derivatives of the coefficients f i ( h, t ) , σ ij ( h, t ) with respect to h upto order three (inclusive) exist. Moreover, the coefficients f i ( h, t ) , σ ij ( h, t ) and all these partial derivatives are:(i) bounded and Borel measurable in t , for fixed h ∈ R d h ;(ii) Lipschitz continuous in h , for fixed t ∈ [0 , T ] .In particular, Assumption D implies that these partial derivatives (with respect to h ) of f and σ satisfy (a)-(b)in Assumption B. Assumption D holds for SDEs with commonly used activation functions such as hyperbolictangent. We remark that Assumption C-D may be weakened in various directions (for instance, to locallyLipschitz coefficients) but for the purpose of this paper we need not go beyond these assumptions. C Stability and Convergence of the Euler-Maruyama Schemes
We provide stability and strong convergence results for the explicit Euler-Mayurama (E-M) approximations ofthe SDE (25), which is time-inhomogeneous due to the dependence of the drift and possibly diffusion coefficienton a time-varying input, here. Intuitively, strong convergence results ensure that the approximated path followsthe continuous path accurately, in contrast to weak convergence results which can only guarantee this at thelevel of probability distribution. The latest version of strong convergence results for time-homogeneous SDEscan be found in [21, 22]. The results for our time-inhomogeneous SDEs can be obtained by adapting the proofin [21] without much difficulty. Since we cannot find them in the literature, we provide them in this section.First, we recall the discretization scheme. Let t < t < · · · < t M := T and t m +1 = t m + δ m , for m = 0 , , . . . , M − and some time step δ m > . Note that we work at full generality here since the step sizes δ m are not necessarily uniform and may even depend on the numerical solution, i.e., δ m = δ ( h δm ) (see Example1). The general results will be of independent interest, in particular for further explorations in designing othervariants of NRNNs. 18or m = 0 , , . . . , M − , consider h δm +1 = h δm + f ( h δm , ˆ x m ) δ m + σ ( h δm , ˆ x m )∆ B m , (39)where ∆ B m := B t m +1 − B t m , (ˆ x m ) m =0 , ,...,M − is a given input sequential data, and h δ = h .Let t = max { t m : t m ≤ t } , m t = max { m : t m ≤ t } for the nearest time point before time t , and itsindex. Denote the piecewise constant interpolant process ¯ h t = h δt . It is convenient to use continuous-timeapproximations, so we consider the continuous interpolant that satisfies: h δt = h δt + f ( h δt , x t )( t − t ) + σ ( h δt , x t )( B t − B t ) , (40)so that h δt is the solution of the SDE: dh δt = f ( h δt , x t ) dt + σ ( h δt , x t ) dB t = f (¯ h t , x t ) dt + σ (¯ h t , x t ) dB t . (41)We make the following assumptions about the time step. Assumption E.
The (possibly adaptive) time step function δ : R d h → R + is continuous and strictly positive,and there exist constants α, β > such that for all h ∈ R d h , δ satisfies (cid:104) h, f ( h, t ) (cid:105) + 12 δ ( h ) (cid:107) f ( h, t ) (cid:107) ≤ α (cid:107) h (cid:107) + β (42)for every t ∈ [0 , T ] .Note that if another time step function δ (cid:15) ( h ) is smaller than δ ( h ) , then δ (cid:15) ( h ) also satisfies Assumption E.A simple adaptation of the proof of Theorem 2.1.1 in [21] to our case of time-inhomogeneous SDE gives thefollowing result. Proposition 1 (Finite-time stability) . Under Assumption B and Assumption E, T is a.s. attainable (i.e., for ω ∈ Ω , P [ ∃ N ( ω ) < ∞ s.t. t N ( ω ) ≥ T ] = 1 ) and for all p > there exists a constant C > (depending ononly p and T ) such that E (cid:34) sup t ∈ [0 ,T ] (cid:107) h δt (cid:107) p (cid:35) ≤ C. (43)This is the discrete-time analogue of the result that E (cid:104) sup t ∈ [0 ,T ] (cid:107) h t (cid:107) p (cid:105) < ∞ for all p > , which can beproven by simply adapting the proof of Lemma 2.1.1. in [21].In the case where the time step is adaptive, we take the following lower bound on the time step to bound theexpected number of time steps (how quickly δ ( h ) → as (cid:107) h (cid:107) → ). Assumption F.
There exist constants a, b, q > such that the adaptive time step function satisfies: δ ( h ) ≥ a (cid:107) h (cid:107) q + b . (44)Next, we provide strong convergence result for the numerical approximation with the time step δ . When thetime step δ is adaptive, one needs to rescale the time step function by a small scalar-valued magnitude (cid:15) > andthen consider the limit as (cid:15) → . Following [21], we make the following assumption. Assumption G.
The rescaled time step function δ (cid:15) satisfies (cid:15) min( T, δ ( h )) ≤ δ (cid:15) ( h ) ≤ min( (cid:15)T, δ ( h )) , (45)where δ satisfies Assumption E-F.Under this additional assumption, we have the following convergence result, which can be proven by adaptingthe proof of Theorem 2.1.2 in [21] to our time-inhomogeneous SDE case. The proof is based on the argumentused for the uniform time step analysis (see Theorem 2.2 in [29]), taking into account the adaptive nature of thetime step appropriately. 19 heorem 4 (Strong convergence) . Let the SDE (2) satisfy Assumption B and the time step function satisfyAssumption G. Then, for all p > , lim (cid:15) → E (cid:34) sup t ∈ [0 ,T ] (cid:107) h δ (cid:15) t − h t (cid:107) p (cid:35) = 0 , (46) where h δt is the continuous interpolant satisfying (40) and h t satisfies the SDE (2) . In particular, the non-adaptive time stepping scheme satisfies the above assumptions. Therefore, stability andstrong convergence of the schemes are guaranteed by the above results.Under stronger assumptions on the drift f we can obtain the order of strong convergence for the numericalschemes; see Theorem 2.1.3 in [21] for the case of time-homogeneous SDEs. This result can be adapted to ourcase to obtain order- strong convergence, which is also obtained in the special case when the step sizes areuniform (see Theorem 10.2.2 in [40]). Theorem 5 (Strong convergence rate) . Assume that f satisfies the following one-sided Lipschitz condition, i.e.,there exists a constant α > such that for all h, h (cid:48) ∈ R d h , (cid:104) h − h (cid:48) , f ( h, t ) − f ( h (cid:48) , t ) (cid:105) ≤ α (cid:107) h − h (cid:48) (cid:107) (47) for all t ∈ [0 , T ] , and the following locally polynomial growth Lipschitz condition, i.e., there exists γ, µ, q > such that for all h, h (cid:48) ∈ R d h , (cid:107) f ( h, t ) − f ( h (cid:48) , t ) (cid:107) ≤ ( γ ( (cid:107) h (cid:107) q + (cid:107) h (cid:48) (cid:107) q ) + µ ) (cid:107) h − h (cid:48) (cid:107) , (48) for all t ∈ [0 , T ] . Moreover, assume that σ is globally Lipschitz and the time step function satisfies AssumptionG. Then, for all p > , there exists a constant C > such that E (cid:34) sup t ∈ [0 ,T ] (cid:107) h δ (cid:15) t − h t (cid:107) p (cid:35) ≤ C(cid:15) p/ . (49)Lastly, it is worth mentioning the following adaptive scheme, which may be an useful option when designingNRNNs. Example 1 (Adaptive E-M) . Under the same setup as the classical E-M setting, we may also introduce anadaptive step size scheme through a sequence of random vectors d m . In this case, h δm +1 = h δm + d m (cid:12) f ( h δm , ˆ x m )∆ t m + σ ( h δm , ˆ x m )(∆ t m ) / ξ m , (50)where (cid:12) denotes the pointwise (Hadamard) product, and each d n may be dependent on h δi for i ≤ m . Providedthat d m → (1 , . . . , uniformly almost surely as ∆ t → , one could also obtain the same convergence as theclassical E-M case. The adaptive setting allows for potentially better approximations by shrinking step sizesin places where the solution changes rapidly. An intuitive explanation for the instability of the standard E-Mapproximation of SDEs is that there is always a very small probability of a large Brownian increment whichcauses the approximation to produce a solution with undesirable growth. Using an adaptive time step eliminatesthis problem. Moreover, this scheme includes, in appropriate sense, the stochastic depth in [47] (see page 9there) and the dropout in [46] as special cases upon choosing an appropriate d m and σ .In particular, one can consider the following drift-tamed E-M scheme, where all components, d im , of the elementsof the sequence are generated as a function of h δm , i.e., d m = 1max { , c (cid:107) h δm (cid:107) + c } , (51)for some c , c > . In this way, the drift term is “tamed” by a solution-dependent multiplicative factor nolarger than one, which prevents the hidden state in the next time step from becoming too large. This adaptivescheme is related to the one introduced in [33] to provide an explicit numerical method that would displaystrong convergence in circumstances where the standard E-M method does not. Under certain conditions strongconvergence of this scheme can be proven (even for SDEs with superlinearly growing drift coefficients). Otheradaptive schemes include the increment-tamed scheme of [32] and many others.20 Implicit Regularization in NRNNs
As discussed in the main paper, although the learning is carried out in discrete time, it is worth studying thecontinuous-time setting. The results for the continuous-time case may provide alternative perspectives and, moreimportantly, will be useful as a reference for exploring other discretization schemes for the CT-NRNNs. InSubsection D.1, we study implicit regularization for the continuous-time NRNNs. In Subsection D.2, we studyimplicit regularization for discrete-time NRNNs and comment on the difference between the continuous-timeand discrete-time case.
D.1 Continuous-Time SettingMain Result and Discussions.
For the sake of brevity, we denote f t ( · ) := f ( · , x t ) and σ t ( · ) := σ ( · , x t ) for t ∈ [0 , T ] in the following.To begin, consider the process (¯ h t ) t ∈ [0 ,T ] satisfying the following initial value problem (IVP): d ¯ h t = f t (¯ h t ) dt, ¯ h = h . (52)Let Ψ denote the unique fundamental matrix satisfying the following properties: for ≤ s ≤ u ≤ t ≤ T ,(a) ∂ Ψ( t, s ) ∂t = ∂f t ∂ ¯ h (¯ h t )Ψ( t, s ); ∂ Ψ( t, s ) ∂s = − Ψ( t, s ) ∂f t ∂ ¯ h (¯ h t ); (53)(b) Ψ( t, s ) = Ψ( t, u )Ψ( u, s ) ;(c) Ψ( t, s ) = Ψ − ( s, t ) ;(d) Ψ( s, s ) = I .Also, let Σ( t, s ) := Ψ( t, s ) σ s (¯ h s ) for ≤ s ≤ t ≤ T .The following result links the expected loss function used for training CT-NRNNs to that for training deterministicCT-RNNs when the noise amplitude is small. Theorem 6 (Explicit regularization induced by noise injection for CT-NRNNs) . Under Assumption A in themain paper, E (cid:96) ( h T ) = (cid:96) (¯ h T ) + (cid:15) Q (¯ h ) + R (¯ h )] + O ( (cid:15) ) , (54) as (cid:15) → , where Q and R are given by Q (¯ h ) = ( ∇ l (¯ h T )) T (cid:90) T d s Ψ( T, s ) (cid:90) s d u v ( u ) + ( ∇ l (¯ h T )) T (cid:90) T d s w ( s ) , (55) R (¯ h ) = (cid:90) T d s tr(Σ( T, s )Σ(
T, s ) (cid:62) ∇ (cid:96) (¯ h T )) , (56) with v ( u ) a vector with the p th component ( p = 1 , , . . . , d h ) : v p ( u ) = tr (Σ( s, u )Σ T ( s, u ) ∇ [ f s ] p (¯ h s )) , (57) and w ( s ) a vector with the q th component ( q = 1 , , . . . , d h ): w q ( s ) = r (cid:88) k =1 d h (cid:88) j,l =1 Ψ qjk,l ( T, s ) ∂ l σ jks (¯ h s ) σ lks (¯ h s ) . (58)Therefore, to study the difference between the CT-NRNNs and their deterministic version, it remains toinvestigate the role of Q and R in Theorem 6. If the Hessian is positive semi-definite, then R (¯ h ) is also positivesemi-definite and thus a viable regularizer. On the other hand, Q (¯ h ) need not be non-negative. However, byassuming that ∇ f and ∇ σ ij are small (that is, f is approximately linear and σ relatively independent of ¯ h ),then Q can be perceived negligible and we may focus predominantly on R . An argument of this kind wasused in [9] in the context of Gauss-Newton Hessian approximations. In particular, Q = 0 for linear NRNNs21ith additive noise. Therefore, Theorem 6 essentially tells us that injecting noise to deterministic RNN isapproximately equivalent to considering a regularized objective functional. Moreover, the explicit regularizer issolely determined by the flow generated by the Jacobian ∂f t ∂ ¯ h (¯ h t ) , the diffusion coefficient σ t and the Hessian ofthe loss function, all evaluated along the dynamics of the deterministic RNN.Under these assumptions, ignoring higher-order terms and bounding the Frobenius inner product in (56), we caninterpret training with CT-NRNN as an approximation of the following optimal control problem [77] with therunning cost C ( t ) := tr( σ t (¯ h t ) T Ψ( T, t ) T ∇ (cid:96) (¯ h t )Ψ( T, t ) σ t (¯ h t )) : min E ( x ,y ) ∼ µ (cid:34) (cid:96) (¯ h T ) + (cid:15) (cid:90) T C ( t ) dt (cid:35) (59)s.t. d¯ h t = f t (¯ h t )d t, t ∈ [0 , T ] , ¯ h = h , (60)where ( x := ( x t ) t ∈ [0 ,T ] , y ) denotes a training example drawn from the distribution µ and the minimization iswith respect to the parameters (controls) in the corresponding deterministic RNN. On the other hand, we caninterpret training with the deterministic RNN as the above optimal control problem with zero running cost orregularization. Note that if the Hessian matrix is symmetric positive semi-definite, then C ( t ) is a quadratic formwith the associated metric tensor M Tt M t := ∇ (cid:96) (¯ h t ) and C ( t ) = 12 (cid:104) Ψ( T, t ) σ t , Ψ( T, t ) σ t (cid:105) M t = 12 (cid:107) M t Ψ( T, t ) σ t (cid:107) F ≤ (cid:107) σ t (cid:107) F (cid:107) M t (cid:107) F (cid:107) Ψ( T, t ) (cid:107) F . (61)Overall, we can see that the use of NRNNs as a regularization mechanism reduces the fundamental matrices Ψ( T, s ) according to the magnitude of the elements of σ t . Proof of Theorem 6.
Next, we prove Theorem 6. We will need some auxiliary results before doing so.For a small perturbation parameter (cid:15) > , the hidden states now satisfy the SDE d h t = f t ( h t )d t + (cid:15)σ t ( h t )d B t , where we have used the shorthand f t ( · ) = f ( · , x t ) and σ t ( · ) = σ ( · , x t ) . To investigate the effect of theperturbation, consider the following hierarchy of differential equations: d h (0) t = f t ( h (0) t )d t, (62) d h (1) t = ∂f t ∂h ( h (0) t ) h (1) t d t + σ t ( h (0) t )d B t , (63) d h (2) t = ∂f t ∂h ( h (0) t ) h (2) t d t + Φ (1) t ( h (0) t , h (1) t )d t + Φ (2) t ( h (0) t , h (1) t )d B t , (64)with h (0)0 = h , h (1)0 = 0 , and h (2)0 = 0 , and where Φ (1) t ( h , h ) = 12 (cid:88) i,j ∂ f t ∂h i ∂h j ( h ) h i h j (65) Φ (2) t ( h , h ) = (cid:88) i ∂σ t ∂h i ( h ) h i . (66)In the sequel, we will suppose Assumption A in the main paper (which is equivalent to Assumption B andAssumption D) holds. Under this assumption, each of these initial value problems have a unique solution for t ∈ [0 , T ] . The processes h (0) t , h (1) t and h (2) t denote the zeroth-, first-, and second-order terms in an expansionof h t about (cid:15) = 0 . This can be easily seen using Kunita’s theory of stochastic flows. In particular, by Theorem3.1 in [41], letting h (1) (cid:15),t = ∂h t ∂(cid:15) , we find that d h (1) (cid:15),t = ∂f t ∂h ( h t ) h (1) (cid:15),t d t + (cid:16) σ t ( h t ) + (cid:15) Φ (2) t ( h t , h (1) (cid:15),t ) (cid:17) d B t , and so we find that h (1)0 ,t = h (1) t . Similarly, h (2) (cid:15),t = ∂ h t ∂(cid:15) = ∂h (1) (cid:15),t ∂(cid:15) can be shown to satisfy d h (2) (cid:15),t = ∂f t ∂h ( h t ) h (2) (cid:15),t d t + 2Φ (1) t ( h t , h (1) (cid:15),t )d t + (cid:32) (2) t ( h t , h (1) (cid:15),t ) + (cid:15) (cid:88) k [ h (1) (cid:15),t ] k ∂∂h k Φ (2) t ( h t , h (1) (cid:15),t ) (cid:33) d B t . ( h t , h (1) (cid:15),t ) with respect to (cid:15) and projecting to the second coordinate. Taking (cid:15) = 0 , we find that h (2)0 ,t = 2 h (2) t . Therefore,informally, a pathwise second-order Taylor expansion about (cid:15) = 0 reveals that h t = h (0) t + (cid:15)h (1) t + (cid:15) h (2) t + O ( (cid:15) ) .To formalize this statement, we will later bound the third-order error term in Lemma 3.While the equation for h (0) t is not explicitly solvable, both h (1) t and h (2) t are. In particular, for t ∈ [0 , T ] (see Eq.(4.28) in [64]): h (1) t = (cid:90) t Ψ( t, s ) σ s ( h (0) s )d B s = (cid:90) t Σ( t, s )d B s , (67) h (2) t = (cid:90) t Ψ( t, s )Φ (1) s ( h (0) s , h (1) s )d s + (cid:90) t Ψ( t, s )Φ (2) s ( h (0) s , h (1) s )d B s . (68)The key result needed to prove Theorem 6 is contained in the following theorem. In the sequel, big O notation isto be understood in the almost sure sense. Theorem 7.
For a scalar-valued loss function (cid:96) ∈ C ( R d h ) , for t ∈ [0 , T ] , (cid:96) ( h t ) = (cid:96) ( h (0) t ) + (cid:15) ∇ (cid:96) ( h (0) t ) · h (1) t + (cid:15) (cid:18) ∇ (cid:96) ( h (0) t ) · h (2) t + 12 ( h (1) t ) (cid:62) ∇ (cid:96) ( h (0) t )( h (1) t ) (cid:19) + O ( (cid:15) ) , as (cid:15) → . We now prove Theorem 7. The proof relies on two lemmas. The first bounds the solutions h ( i ) over [0 , T ] . Lemma 2.
For any p > , sup s ∈ [0 ,T ] (cid:107) h (0) s (cid:107) p < ∞ and E sup s ∈ [0 ,T ] (cid:107) h ( i ) s (cid:107) p < ∞ for i = 1 , .Proof. For s ∈ [0 , T ] , h (0) s = h + (cid:82) s f u ( h (0) u )d u , so recalling ( x + y ) ≤ x + 2 y and (cid:107) f t ( h ) (cid:107) ≤ K (1 + (cid:107) h (cid:107) ) , (cid:107) h (0) s (cid:107) ≤ (cid:107) h (cid:107) + 2 (cid:90) s (cid:107) f u ( h (0) u ) (cid:107) d u ≤ (cid:18) (cid:107) h (cid:107) + K s + K (cid:90) s (cid:107) h (0) u (cid:107) d u (cid:19) . Therefore, by Gronwall’s inequality, (cid:107) h (0) s (cid:107) ≤ (cid:107) h (cid:107) + K s ) e K s ≤ (cid:107) h (cid:107) + K T ) e K T < + ∞ , and so sup s ∈ [0 ,T ] (cid:107) h (0) s (cid:107) < ∞ . Similarly, for s ∈ [0 , T ] , h (1) s = (cid:90) s ∂f u ∂h ( h (0) u ) h (1) u d u + (cid:90) s σ u ( h (0) u )d B u . Therefore, for p ≥ (since ( x + y ) p ≤ p − ( x p + y p ) by Jensen’s inequality): (cid:107) h (1) s (cid:107) p ≤ p − (cid:90) s (cid:13)(cid:13)(cid:13)(cid:13) ∂f u ∂h ( h (0) u ) (cid:13)(cid:13)(cid:13)(cid:13) p (cid:107) h (1) u (cid:107) p d u + 2 p − (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) s σ u ( h (0) u )d B u (cid:13)(cid:13)(cid:13)(cid:13) p . Because the Itˆo integral is a continuous martingale, the Burkholder-Davis-Gundy inequality (see Theorem3.28 in [36]) implies that for positive constants C p depending only on p (but not necessarily the same in eachappearance), E sup s ∈ [0 ,T ] (cid:107) h (1) s (cid:107) p ≤ C p (cid:90) s E sup s ∈ [0 ,u ] (cid:107) h (1) s (cid:107) p d u + C p (cid:18)(cid:90) t (cid:107) σ u ( h (0) u ) (cid:107) d u (cid:19) p/ An application of Gronwall’s inequality yields E sup s ∈ [0 ,T ] (cid:107) h (1) s (cid:107) p ≤ C p (cid:32)(cid:90) T (cid:107) σ u ( h (0) u ) (cid:107) d u (cid:33) p/ e C p T . E sup s ∈ [0 ,T ] (cid:107) h (1) s (cid:107) p < ∞ for all p ≥ . The p ∈ (0 , case follows from H¨older’s inequality.Repeating this same approach for h (2) s completes the proof.The second of our two critical lemmas provides a pathwise expansion of h t about (cid:15) in the vein of [8]. Doing socharacterizes the response of the NRNN hidden states to small noise perturbations at the sample path level. Itcan be seen as a strengthening of Theorem 2.2 in [23] for our time-inhomogeneous SDEs. Lemma 3.
For a fixed (cid:15) > , and any < (cid:15) ≤ (cid:15) , with probability one, h t = h (0) t + (cid:15)h (1) t + (cid:15) h (2) t + (cid:15) R (cid:15) ( t ) , where for any p > , sup (cid:15) ∈ (0 ,(cid:15) ) E sup t ∈ [0 ,T ] (cid:107) R (cid:15) ( t ) (cid:107) p < ∞ . (69) Proof.
It suffices to show that sup (cid:15) ∈ (0 ,(cid:15) ) E sup t ∈ [0 ,T ] (cid:107) R (cid:15) ( t ) (cid:107) p < ∞ for p ≥ — the p ∈ (0 , case followsfrom H¨older’s inequality. In the sequel, we shall let K denote a finite number (not necessarily the same in eachappearance) depending only on f, σ, T, (cid:15) , and p , and therefore independent of t, (cid:15) .For (cid:15) > , let h (cid:15)t = h (0) t + (cid:15)h (1) t + (cid:15) h (2) t and R ( t ) = (cid:15) − ( h t − h (cid:15)t ) , where h t , h (1) t , h (2) t are coupled togetherthrough the same Brownian motion. Then (cid:15) R ( t ) = (cid:90) t (cid:18) f s ( h s ) − f s ( h (0) s ) − (cid:15) ∂f s ∂h ( h (0) s ) h (1) s − (cid:15) ∂f s ∂h ( h (0) s ) h (2) s − (cid:15) Φ (1) s ( h (0) s , h (1) s ) (cid:19) d s + (cid:15) (cid:90) t σ s ( h s ) − σ s ( h (0) s ) − (cid:15) Φ (2) s ( h (0) s , h (1) s )d B s . To simplify, we decompose (cid:15) R ( t ) into the sum of four random variables θ i ( t ) , i = 1 , . . . , , given by θ ( t ) = (cid:90) t [ f s ( h s ) − f s ( h (cid:15)s )] d sθ ( t ) = (cid:90) t (cid:20) f s ( h (cid:15)s ) − f s ( h (0) s ) − (cid:15) ∂f s ∂h ( h (0) s ) h (1) s − (cid:15) ∂f s ∂h ( h (0) s ) h (2) s − (cid:15) Φ (1) s ( h (0) s , h (1) s ) (cid:21) d sθ ( t ) = (cid:15) (cid:90) t (cid:104) σ s ( h s ) − σ s ( h (0) s + (cid:15)h (1) s ) (cid:105) d B s θ ( t ) = (cid:15) (cid:90) t (cid:104) σ s ( h (0) s + (cid:15)h (1) s ) − σ s ( h (0) s ) − (cid:15) Φ (2) s ( h (0) s , h (1) s ) (cid:105) d B s . Beginning with the more straightforward terms θ ( t ) , θ ( t ) , by Lipschitz continuity of f , (cid:107) f s ( h s ) − f s ( h (cid:15)s ) (cid:107) ≤ L f (cid:15) (cid:107) R ( s ) (cid:107) , and so E sup s ∈ [0 ,t ] (cid:107) θ ( s ) (cid:107) p ≤ K(cid:15) p (cid:90) t E sup s ∈ [0 ,u ] (cid:107) R ( s ) (cid:107) d u. In the same way, (cid:107) σ s ( h s ) − σ s ( h (0) s + (cid:15)h (1) s ) (cid:107) ≤ L σ (cid:15) (cid:107) h (2) s + (cid:15)R ( s ) (cid:107) . Recall that ( (cid:82) t g ( s )d s ) p ≤ t p − (cid:82) t g ( s ) p d s by Jensen’s inequality. Now, θ ( s ) is a continuous martingale, and hence, the Burkholder-Davis-Gundy inequality (see Theorem 3.28 in [36]) implies that for some constant C p > depending only on p , E sup s ∈ [0 ,t ] (cid:107) θ ( s ) (cid:107) p ≤ (cid:15) p C p (cid:18)(cid:90) t E (cid:13)(cid:13)(cid:13) σ s ( h s ) − σ s ( h (0) s + (cid:15)h (1) s ) (cid:13)(cid:13)(cid:13) d s (cid:19) p/ ≤ C p L pσ (cid:15) p (cid:18)(cid:90) t E (cid:107) h (2) s + (cid:15)R ( s ) (cid:107) d s (cid:19) p/ ≤ C p L pσ (cid:15) p T p/ − (cid:90) t E (cid:107) h (2) s + (cid:15)R ( s ) (cid:107) p d s ≤ C p L pσ (cid:15) p p − T p/ − (cid:18)(cid:90) t E (cid:107) h (2) s (cid:107) p d s + (cid:15) p (cid:90) t E (cid:107) R ( s ) (cid:107) p d s (cid:19) . E sup s ∈ [0 ,t ] (cid:107) θ ( s ) (cid:107) p ≤ K(cid:15) p (cid:32) (cid:15) p (cid:90) t E sup s ∈ [0 ,u ] (cid:107) R ( s ) (cid:107) p d u (cid:33) . Treating the θ term next, for each s ∈ [0 , t ] , by Taylor’s theorem, there exists some (cid:15) s ∈ (0 , (cid:15) ) such that f s ( h (cid:15)s ) − f s ( h ) − (cid:15) ∂f s ∂h ( h ) h = (cid:15) ∂f s ∂h ( h (cid:15) s s ) h + (cid:15) Φ (1) s ( h (cid:15) s s , h ) . Therefore, by Lipschitz continuity of the derivatives of f , θ ( t ) = (cid:15) (cid:90) t (cid:18) ∂f s ∂h ( h (cid:15) s s ) h + Φ (1) s ( h (cid:15) s s , h ) − ∂f s ∂h ( h ) h − Φ (1) s ( h , h ) (cid:19) d s ≤ K(cid:15) (cid:90) t (cid:107) h (cid:15) s s − h (cid:107) d s ≤ K(cid:15) (cid:90) t (cid:107) h (1) s (cid:107) + (cid:15) (cid:107) h (2) s (cid:107) d s. From Lemma 2, it follows that E sup s ∈ [0 ,T ] (cid:107) θ ( s ) (cid:107) p ≤ K(cid:15) p . Similarly, by Taylor’s theorem, there exists (cid:15) s ∈ (0 , (cid:15) ) such that σ s ( h (0) s + (cid:15)h (1) s ) − σ s ( h (0) s ) = (cid:15) Φ (2) s ( h (0) s + (cid:15) s h (1) s , h (1) s ) , and so for p ≥ , by the Burkholder-Davis-Gundy inequality and Lipschitz continuity of the derivatives of σ , E sup s ∈ [0 ,t ] (cid:107) θ ( s ) (cid:107) p ≤ C p (cid:15) p (cid:18)(cid:90) t E (cid:13)(cid:13)(cid:13) Φ (2) s ( h (0) s + (cid:15) s h (1) s , h (1) s ) − Φ (2) s ( h (0) s , h (1) s ) (cid:13)(cid:13)(cid:13) d s (cid:19) p/ ≤ KC p (cid:15) p (cid:18)(cid:90) t E (cid:107) h (1) s (cid:107) d s (cid:19) p/ ≤ KC p T p/ (cid:15) p E sup s ∈ [0 ,T ] (cid:107) h (1) s (cid:107) p ≤ K(cid:15) p . Combining estimates for θ , θ , θ , θ , E sup s ∈ [0 ,t ] (cid:107) R ( s ) (cid:107) p = 4 p − (cid:15) − p (cid:18) E sup s ∈ [0 ,t ] (cid:107) θ ( s ) (cid:107) p + E sup s ∈ [0 ,t ] (cid:107) θ ( s ) (cid:107) p + E sup s ∈ [0 ,t ] (cid:107) θ ( s ) (cid:107) p + E sup s ∈ [0 ,T ] (cid:107) θ ( s ) (cid:107) p (cid:19) ≤ K (cid:32) (cid:90) t E sup s ∈ [0 ,u ] (cid:107) R ( s ) (cid:107) p d u (cid:33) , and so by Gronwall’s inequality, E sup s ∈ [0 ,t ] (cid:107) R ( s ) (cid:107) p ≤ Ke Kt . Since K is independent of t ≤ T and (cid:15) ≤ (cid:15) ,it follows that sup (cid:15) ∈ (0 ,(cid:15) ) E sup s ∈ [0 ,t ] (cid:107) R ( s ) (cid:107) p ≤ Ke Kt < + ∞ , and the result follows.We remark that perturbative techniques such as the one used to obtain Theorem 3 are standard in the theory ofstochastic flows.Theorem 7 now follows in a straightforward fashion from Lemma 3 by taking a second-order Taylor expansionof (cid:96) ( h (0) t + (cid:15)h (1) t + (cid:15) h (2) t + O ( (cid:15) )) about (cid:15) = 0 .We are now in a position to prove Theorem 6 using Theorem 7.25 roof of Theorem 6. From Theorem 7, we have, upon taking expectation: E (cid:96) ( h t ) = (cid:96) ( h (0) t ) + (cid:15) ( ∇ h (0) (cid:96) ) T E h (1) t + (cid:15) (cid:18) ( ∇ h (0) (cid:96) ) T E h (2) t + 12 E ( h (1) t ) T ( H h (0) (cid:96) ) h (1) t (cid:19) + O ( (cid:15) ) , (70)for t ∈ [0 , T ] , as (cid:15) → , where H h (0) denotes Hessian operator and the h ( i ) t satisfy Eq. (62)-(64).Since ∇ f t and its derivative are bounded and are thus Lipschitz continuous, by Picard’s theorem the IVP has aunique solution. Moreover, it follows from our assumptions that the solution to the IVP is square-integrable (i.e., (cid:82) t (cid:107) Ψ( t, s ) (cid:107) F ds < ∞ for any t ∈ [0 , T ] ). Therefore, the solution h (1) t to Eq. (63) can be uniquely representedas the following Itˆo integral: h (1) t = (cid:90) t Ψ( t, s ) σ ( h (0) s , s ) dB s , (71)where Ψ( t, s ) is the (deterministic) fundamental matrix solving the IVP (53). We have E h (1) t = 0 and E (cid:107) h (1) t (cid:107) = (cid:90) t (cid:107) Ψ( t, s ) σ ( h (0) s , s ) (cid:107) F ds < ∞ . (72)Similar argument together with Assumption D shows that the solution h (2) t to Eq. (64) admits the followingunique integral representation, with the i th component: h (2) it = 12 (cid:90) t Ψ ij ( t, s )[ h (1) s ] l ∂ b j ∂ [ h (0) s ] l ∂ [ h (0) s ] k [ h (1) s ] k ds + (cid:90) t Ψ ij ( t, s ) ∂σ jk ∂ [ h (0) s ] l [ h (1) s ] l dB ks , (73)where the last integral above is a uniquely defined Itˆo integral.Plugging Eq. (71) into the above expression and then taking expectation, we have: E h (2) it = 12 E (cid:90) t ds Ψ ij ( t, s ) ∂ b j ∂ [ h (0) s ] l ∂ [ h (0) s ] k (cid:90) s dB l u (cid:90) s dB k u Ψ ll ( s, u ) σ l l σ k k Ψ kk ( s, u )+ 12 E (cid:90) t dB ks Ψ ij ( t, s ) ∂σ jk ∂ [ h (0) s ] l (cid:90) t dB l u Ψ ll ( s, u ) σ l l ( h (0) u , u ) , (74)where we have performed change of variable to arrive at the last double integral above.Using the semigroup property of Ψ , we have Ψ( t, s ) = Ψ( t, − ( s, for any s ≤ t (and so Ψ ij ( t, s ) =Ψ ij ( t, − ) j j ( s, and Ψ ll ( s, u ) = Ψ ll ( s, − ) l l ( u, etc.). Using this property in (74) and thenevaluating the resulting expression using properties of moments of stochastic integrals (applying Eq. (5.7) andProposition 4.16 in [24] – note that Itˆo isometry follows from Eq. (5.7) there), we obtain ( ∇ h (0) (cid:96) ) T E h (2) t = Q ( h (0) ) , where Q satisfies Eq. (55).Similarly, plugging Eq. (71) into E h (1) Tt ( H h (0) (cid:96) ) h (1) t , and then proceeding as above and applying the cyclicproperty of trace, give E ( h (1) t ) T ( H h (0) (cid:96) ) h (1) t = R ( h (0) ) , where R satisfies Eq. (56). The proof is done. D.2 Discrete-Time Setting: Proof of Theorem 1 in the Main Paper
The goal in this subsection is to prove Theorem 1 in the main paper, the discrete-time analogue of Theorem 6.We recall the theorem in the following.
Theorem 8 (Explicit regularization induced by noise injection for discrete-time NRNNs – Theorem 1 in themain paper) . Under Assumption A in the main paper, E (cid:96) ( h δM ) = (cid:96) (¯ h δM ) + (cid:15) Q (¯ h δ ) + ˆ R (¯ h δ )] + O ( (cid:15) ) , (75) as (cid:15) → , where the terms ˆ Q and ˆ R are given by ˆ Q (¯ h δ ) = ( ∇ l (¯ h δM )) T M (cid:88) k =1 δ k − ˆΦ M − ,k M − (cid:88) m =1 δ m − v m , (76) ˆ R (¯ h δ ) = M (cid:88) m =1 δ m − tr( σ Tm − ˆΦ TM − ,m H ¯ h δ l ˆΦ M − ,m σ m − ) , (77)26 ith v m a vector with the p th component ( p = 1 , . . . , d h ) : [ v m ] p = tr( σ Tm − ˆΦ TM − ,m H ¯ h δ [ f M ] p ˆΦ M − ,m σ m − ) . Moreover, | ˆ Q (¯ h δ ) | ≤ C Q ∆ , | ˆ R (¯ h δ ) | ≤ C R ∆ , (78) for C Q , C R > independent of ∆ . To prove Theorem 8, the key idea is to first obtain a discretized version of the loss function in Theorem 7 byeither discretizing the results in Theorem 7 or by proving directly from the discretized equations (28). It thenremains to compute the expectation of this loss as functional of the discrete-time process. The first part isstraightforward while the second part involves some tedious recursive computations.Let t < t < · · · < t M := T be a partition of the interval [0 , T ] and let δ m = t m +1 − t m for each m = 0 , , . . . , M − . For small parameter (cid:15) > , the E-M scheme is given by: h δm +1 = h δm + f ( h δm , ˆ x m ) δ m + (cid:15)σ ( h δm , ˆ x m ) (cid:112) δ m ξ m , (79)where (ˆ x m ) m =0 ,...,M − is a given sequential data, each ξ m ∼ N (0 , I ) is an independent r -dimensional standardnormal random vector, and h δ = h .Consider the following hierarchy of recursive equations. For the sake of notation cleanliness, we replace thesuperscript δ by hat when denoting the δ -dependent approximating solutions in the following.For m = 0 , , . . . , M − : ˆ h (0) m +1 = ˆ h (0) m + δ m f (ˆ h (0) m , ˆ x m ) , ˆ h (0)0 = h , (80) ˆ h (1) m +1 = ˆ J m ˆ h (1) m + (cid:112) δ m σ (ˆ h (0) m , ˆ x m ) ξ m , ˆ h (1)0 = 0 , (81) ˆ h (2) m +1 = ˆ J m ˆ h (2) m + (cid:112) δ m + δ m Ψ (ˆ h (0) m , ˆ h (1) m ) + δ m Ψ (ˆ h (0) m , ˆ h (1) m ) ξ m , ˆ h (2)0 = 0 , (82)where the ˆ J m = I + δ m f (cid:48) (ˆ h (0) m , ˆ x m ) (83)are the state-to-state Jacobians and Ψ ( h , h ) = 12 (cid:88) i,j ∂ f m ∂h i ∂h j ( h ) h i h j , (84) Ψ ( h , h ) = (cid:88) i ∂σ m ∂h i ( h ) h i . (85)Note that the above equations can also be obtained by E-M discretization of Eq. (62)-(63).The following theorem is a discrete-time analogue of Theorem 7. Recall that the big O notation is to beunderstood in the almost sure sense. Theorem 9.
Under the same assumption as before, for a scalar-valued loss function (cid:96) ∈ C ( R d h ) , for m = 0 , , . . . , M − , we have (cid:96) (ˆ h m +1 ) = (cid:96) (ˆ h (0) m ) + (cid:15) ∇ (cid:96) (ˆ h (0) m ) · ˆ h (1) m + (cid:15) (cid:18) ∇ (cid:96) (ˆ h (0) m ) · ˆ h (2) m + 12 (ˆ h (1) m ) (cid:62) ∇ (cid:96) (ˆ h (0) m )(ˆ h (1) m ) (cid:19) + O ( (cid:15) ) , (86) as (cid:15) → , where the ˆ h ( i ) m , i = 0 , , , satisfy Eq. (80) - (82) .Proof. The proof is analogous to the one for continuous-time case, working with the discrete-time process (79)instead of continuous-time process.We begin by recalling a remark from the main text. 27 emark 1.
Interestingly, Theorem 8 looks like discrete-time analogue of Theorem 6 for CT-RNN, except that,unlike the term Q there, the term ˆ Q for the discrete-time case has no explicit dependence on the noise coefficient.Therefore, a direct discretization of the result in Theorem 6 would not give us the correct explicit regularizerfor discrete-time NRNNs. This remark highlights the difference between learning in the practical discrete-timesetting versus learning in the idealized continuous-time setting with NRNNs. This also means that we need towork out an independently crafted proof for the discrete-time case.The proof of Theorem 8 involves some tedious, albeit technically straightforward, computations. The keyingredients are the recursive relations (80)-(82) and the property of standard Gaussian random vectors that E ξ lp ξ jq = e pq e lj , (87)where the e pq denote the Kronecker delta.To organize our proof, we begin by introducing some notation and proving a lemma. Notation.
For m = 1 , . . . , M − , let us denote f (cid:48) m := f (cid:48) (ˆ h (0) m , ˆ x m ) , σ m := σ (ˆ h (0) m , ˆ x m ) , H lj f im := ∂ f i (ˆ h (0) m , ˆ x m ) ∂ [ˆ h (0) m ] l ∂ [ˆ h (0) m ] j , (88) D l σ ijm := ∂σ ij (ˆ h (0) m , ˆ x m ) ∂ [ˆ h (0) m ] l , (89)and ˆΦ m,k := J m J m − · · · J k , ˆΦ k,k +1 = I, (90)for k = 1 , . . . , m . For computational convenience, we are using Einstein’s summation notation for repeatedindices in the following. Lemma 4.
For m = 0 , , . . . , M , E ˆ h (1) m = 0 and E [ˆ h (1) m ] l [ˆ h (1) m ] j = δ m − σ ll m − σ jl m − + m − (cid:88) k =1 δ k − ˆΦ ll m − ,k ˆΦ jj m − ,k σ l l k − σ j l k − . (91) Proof.
From Eq. (81), we have ˆ h (1)0 = 0 , ˆ h (1)1 = √ δ σ t ξ and, upon iterating, for m = 1 , . . . , M − , ˆ h (1) m +1 = (cid:112) δ m σ m ξ m + m (cid:88) k =1 (cid:112) δ k − ˆΦ m,k σ k − ξ k − . (92)The first equality in the lemma follows from taking expectation of Eq. (92) and using the fact that the ξ k are (mean zero) standard Gaussian random variables. The second equality in the lemma follows from takingexpectation of a product of components of the ˆ h (1) m +1 in Eq. (92) and applying the property (87). Proof of Theorem 8.
Iterating Eq. (82), we obtain ˆ h (2)0 = 0 , ˆ h (2)1 = δ Ψ ( h ,
0) + √ δ Ψ ( h , ξ and, for m = 1 , . . . , M − , ˆ h (2) m +1 = δ m Ψ (ˆ h (0) m , ˆ h (1) m ) + (cid:112) δ m Ψ (ˆ h (0) m , ˆ h (1) m ) + m (cid:88) k =1 δ k − ˆΦ m,k Ψ (ˆ h (0) k − , ˆ h (1) k − )+ m (cid:88) k =1 (cid:112) δ k − ˆΦ m,k Ψ (ˆ h (0) k − , ˆ h (1) k − ) ξ k − . (93)28ubstituting in the formulae (84)-(85) in the right hand side above and then using Eq. (92): [ˆ h (2) m +1 ] i = δ m h (1) m ] l H lj f im [ˆ h (1) m ] j + m (cid:88) k =1 δ k − ipm,k [ˆ h (1) k − ] l H lj f pk − [ˆ h (1) k − ] j + (cid:112) δ m D l σ ijm [ˆ h (1) m ] l ξ jm + m (cid:88) k =1 (cid:112) δ k − ˆΦ iqm,k D l σ qrk − [ˆ h (1) k − ] l ξ rk − (94) = δ m h (1) m ] l H lj f im [ˆ h (1) m ] j + m (cid:88) k =1 δ k − ipm,k [ˆ h (1) k − ] l H lj f pk − [ˆ h (1) k − ] j + (cid:112) δ m D l σ ijm ξ jm (cid:32)(cid:112) δ m − σ ll m − ξ l m − + m − (cid:88) k =1 (cid:112) δ k − ˆΦ ll m − ,k σ l l k − ξ l k − (cid:33) + (cid:112) δ ˆΦ iqm, D l σ qr ξ r ( (cid:112) δ σ ll ξ l ) (95) + m (cid:88) k =3 (cid:112) δ k − ˆΦ iqm,k D l σ qrk − ξ rk − (cid:32)(cid:112) δ k − σ lp k − ξ p k − + k − (cid:88) k (cid:48) =1 (cid:112) δ k (cid:48) − ˆΦ lp k − ,k (cid:48) σ p p k (cid:48) − ξ p k (cid:48) − (cid:33) , where we have made use of the fact that ˆ h (1)0 = 0 and ˆ h (1)1 = √ δ σ ξ in the last two lines above to rewrite thesummation (so that the summation over k in the last line above starts at k = 3 ).Therefore, using the above result, Lemma 4 and Eq. (87), we compute the expectation of [ˆ h (2) m +1 ] i : E [ˆ h (2) m +1 ] i = 12 m +1 (cid:88) k =1 δ k − ˆΦ ipm,k H lj f pm m (cid:88) k =1 δ k − ˆΦ ll m − ,k σ l l k − σ j l k − ˆΦ jj m − ,k . (96)Moreover, using Lemma 4, we obtain, for m = 1 , , . . . , M − , E [ˆ h (1) m +1 ] l [ H ˆ h (0) l ] lj [ˆ h (1) m +1 ] j = m +1 (cid:88) k =1 δ k − σ l l k − ˆΦ ll m,k [ H ˆ h (0) l ] lj ˆΦ jj m,k σ j l k − . (97)The first statement of the theorem then follows from Theorem 9 and Eq. (96)-(97) (with m := M − ): ˆ Q (¯ h δ ) = ∂ i l (¯ h δM ) M (cid:88) k =1 δ k − ˆΦ ipM − ,k M − (cid:88) m =1 δ m − ∂ lj [ f M ] p ˆΦ ll M − ,m σ l l m − σ j l m − ˆΦ jj M − ,m , (98) ˆ R (¯ h δ ) = M (cid:88) m =1 δ m − σ l l m − ˆΦ ll M − ,m [ H ¯ h δ l ] lj ˆΦ jj M − ,m σ j l m − . (99)The last statement of the theorem follows from taking straightforward bounds. Remark 2.
We remark that the computed ˆ h (2) m (a key step in the above proof), like that for h (2) t in the continuous-time case, has explicit dependence on the noise coefficient. It is only upon taking the expectation (see Eq. (96))that the dependence on the noise coefficient vanishes (whereas E h (2) t (cid:54) = 0 retains its dependence on the noisecoefficient). This fully reconciles with Remark 1. Remark 3.
Moreover, one can compute the variance of l (ˆ h M ) to be (cid:15) ( ∇ l (ˆ h (0) M )) T C ∇ l (ˆ h (0) M ) + O ( (cid:15) ) , as (cid:15) → , where C is a PSD matrix whose ( l, j ) -entry is given by Eq. (91) with m := M . So we see that thespread of l (ˆ h M ) about its average is O ( (cid:15) ) as (cid:15) → . E Classification Margin and Generalization Bound for Deterministic RNNs: Proof ofTheorem 2 in the Main Paper
We recall the setting considered in the main paper before providing proof to the results presented there.29et S N denote a set of training samples s n := ( x n , y n ) for n = 1 , . . . , N , where each input sequence x n = ( x n, , x n, , . . . , x n,M − ) ∈ X ⊂ R d x M has a corresponding class label y n ∈ Y = { , . . . , d y } .Following the statistical learning framework, these samples are assumed to be independently drawn from anunderlying probability distribution µ on the sample space S = X × Y . An RNN-based classifier g δ ( x ) isconstructed in the usual way by taking g δ ( x ) = argmax i =1 ,...,d y p i ( V ¯ h δM [ x ]) , (100)where p i ( x ) = e x i / (cid:80) j e x j is the softmax function. Letting (cid:96) denoting the cross-entropy loss, such a classifieris trained from S N by minimizing the empirical risk (training error) R N ( g δ ) := 1 N N (cid:88) n =1 (cid:96) ( g δ ( x n ) , y n ) as a proxy for the true (population) risk (testing error) R ( g δ ) = E ( x ,y ) ∼ µ (cid:96) ( g δ ( x ) , y ) with ( x , y ) ∈ S .The measure used to quantify the prediction quality is the generalization error (or estimation error), which is thedifference between the empirical risk of the classifier on the training set and the true risk: GE ( g δ ) := |R ( g δ ) − R N ( g δ ) | . (101)The classifier is a function of the output of the deterministic RNN, which is an Euler discretization of the ODE(1) with step sizes δ = ( δ m ) . In particular, for the Lipschitz RNN, ˆΦ m,k = ˆ J m ˆ J m − · · · ˆ J k , (102)where ˆ J l = I + δ l ( A + D l W ) , with D ijl = a (cid:48) ([ W ¯ h δl + U ˆ x l + b ] i ) e ij .In the following, we let conv( X ) denote the convex hull of X . We denote ˆ x m := (ˆ x , . . . , ˆ x m ) so that ˆ x = ˆ x M − , and use the notation f [ x ] to indicate the dependence of the function f on the vector x . Moreover,we will need the following two definitions to characterize a training sample s i = ( x i , y i ) Working in the above setting, we now recall and prove the second main result in the main paper, providingbounds for classification margin and generalization error for the deterministic RNN classifiers g δ . Theorem 10 (Classification margin and generalization bound for the deterministic RNN – Theorem 2 in themain paper) . Suppose that Assumption A in the main paper holds. Assume that the o ( s i ) > and γ ( s i ) := o ( s i ) C (cid:80) M − m =0 δ m sup ˆ x ∈ conv( X ) (cid:107) ˆΦ M,m +1 [ˆ x ] (cid:107) > , (103) where C = (cid:107) V (cid:107) (cid:18) max m =0 , ,...,M − (cid:13)(cid:13)(cid:13)(cid:13) ∂f (¯ h δm , ˆ x m ) ∂ ˆ x m (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) > is a constant (in particular, C = (cid:107) V (cid:107) (max m =0 ,...,M − (cid:107) D m U (cid:107) ) for Lipschitz RNNs), the ˆΦ m,k are definedin (102) and the δ m are the step sizes. Then, we have the following upper bound on the classification margin forthe training sample s i : γ d ( s i ) ≥ γ ( s i ) . (104) Moreover, if we further assume that X is a (subset of) k -dimensional manifold with k ≤ d x M , γ :=min s i ∈S N γ ( s i ) > , and (cid:96) ( g δ ( x ) , y ) ≤ L g for all s ∈ S , then for any δ (cid:48) > , with probability at least − δ (cid:48) , GE ( g δ ) ≤ L g (cid:32) γ k/ (cid:114) d y C kM k +1 log 2 N + (cid:114) /δ (cid:48) ) N (cid:33) , (105) where C M > is a constant that measures complexity of X , N is the number of training examples and d y is thenumber of label classes.
30n order to prove Theorem 10, we place ourselves in the algorithmic robustness framework of [80]. Thisframework provides bounds for the generalization error based on the robustness of a learning algorithm thatlearns a classifier g by exploiting the structure of the training set S N . Robustness is, roughly speaking, thedesirable property for a learning algorithm that if a testing sample is “similar” to a training sample, then thetesting error is close to the training error (i.e., the algorithm is insensitive to small perturbations in the trainingdata).To ensure that our exposition is self-contained, we recall important definitions and results from [80, 67] toformalize the previous statement in the context of our deterministic RNNs in the following. Definition 4.
Let S N be a training set and S the sample space. A learning algorithm is ( K, (cid:15) ( S N )) -robust if S can be partitioned into K disjoint sets denoted by K k , k = 1 , . . . , K : K k ⊂ S , k = 1 , . . . , K, (106) S = ∪ Kk =1 K k , and K k ∩ K k (cid:48) = ∅ , ∀ k (cid:54) = k (cid:48) , (107)such that for all s i ∈ S N and all s ∈ S , s i = ( x i , y i ) ∈ K k ∧ s = ( x , y ) ∈ K k = ⇒ | (cid:96) ( g ( x i ) , y i ) − (cid:96) ( g ( x ) , y ) | ≤ (cid:15) ( S N ) . (108)The above definition says that a robust learning algorithm selects a classifier g for which the losses of any s and s i in the same partition K k are close.The following result from Theorem 1 in [80] will be critical to the proof of Theorem 10. It provides ageneralization bound for robust algorithms. Theorem 11.
If a learning algorithm is ( K, (cid:15) ( S N )) -robust and (cid:96) ( g ( x ) , y ) ≤ M for all s = ( x , y ) ∈ S , forsome constant M > , then for any δ > , with probability at least − δ , GE ( g ) ≤ (cid:15) ( S N ) + M (cid:114) K log(2) + 2 log(1 /δ ) m . (109)Note that the above generalization bound is data-dependent, in contrast to bounds obtained via approaches basedon complexity or stability arguments that give bounds in terms of data agnostic measures such as the Rademachercomplexity or the VC dimension, which are found not sufficient for explaining the good generalization propertiesof deep neural networks.The number of partition K in the above can be bounded in terms of the covering number of the sample space S , which gives a way to measure the complexity of sets. We recall the definition of covering number in thefollowing. Definition 5 (Covering) . Let A be a set. We say that A is ρ -covered by a set A (cid:48) , with respect to the (pseudo-)metric d , if for all a ∈ A , there exists a (cid:48) ∈ A (cid:48) with d ( a, a (cid:48) ) ≤ ρ . We call the cardinality of the smallest A (cid:48) that ρ -covers A covering number, denoted by N ( S ; d, ρ ) .The covering number is the smallest number of (pseudo-)metric balls of radius ρ needed to cover S and wedenote it by N ( S ; d, ρ ) , where d denotes the (pseudo-)metric. The choice of metric d determines how efficientlyone may cover X . For example, the Euclidean metric d ( x, x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) for x, x (cid:48) ∈ X . The covering numberof many structured low-dimensional data models can be bounded in terms of their intrinsic properties. Since inour case the space S = X × Y , we write N ( S ; d, ρ ) ≤ d y · N ( X ; d, ρ ) , where d y is the number of label classes.We take d to be the Euclidean metric: d ( x , x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) for x , x (cid:48) ∈ X , unless stated otherwise. Lemma 5 (Example 27.1 from [65]) . Assume that
X ⊂ R m lies in a k -dimensional subspace of R m . Let c = max x ∈X (cid:107) x (cid:107) and take d to be the Euclidean metric. Then N ( X ; d, ρ ) ≤ (2 c √ k/ρ ) k . In other words, a subset, X , of a k -dimensional manifold has the covering number ( C M /ρ ) k , where C M > isa constant. We remark that other complexity measures such as Rademacher complexity can be bounded basedon the covering number (see [65] for details).The class of robust learning algorithms that is of interest to us is the large margin classifiers. We defineclassification margin in the following. Definition 6 (Classification margin) . The classification margin of a training sample s i = ( x i , y i ) measured by ametric d is defined as the radius of the largest d -metric ball in X centered at x i that is contained in the decisionregion associated with the class label y i , i.e., it is: γ d ( s i ) = sup { a : d ( x i , x ) ≤ a = ⇒ g ( x ) = y i ∀ x } . (110)31ntuitively, a larger classification margin allows a classifier to associate a larger region centered on a point x i in the input space to the same class. This makes the classifier less sensitive to input perturbations and a noisyperturbation of x i is still likely to fall within this region, keeping the classifier prediction. In this sense, theclassifier becomes more robust.The following result follows from Example 9 in [80]. Proposition 2.
If there exists a γ > such that γ d ( s i ) > γ for all s i ∈ S N , then the classifier g is ( d y ·N ( X ; d, γ/ , -robust. In our case the networks are trained by a loss (cross-entropy) that promotes separation of different classes at thenetwork output. The training aims at maximizing a certain notion of score of each training sample.
Definition 7 (Score) . For a training sample s i = ( x i , y i ) , we define its score as o ( s i ) = min j (cid:54) = y i √ e y i − e j ) T S δ [ x i ] ≥ , (111)where e i ∈ R d y is the Kronecker delta vector with e ii = 1 and e ji = 0 for i (cid:54) = j , S δ [ x i ] := p ( V ¯ h δM [ x i ]) with ¯ h δM [ x i ] denoting the hidden state of the RNN, driven by the input sequence x i , at terminal index M .The RNN classifier g δ is defined as g δ ( x ) = arg max i ∈{ ,...,d y } S i [ x ] , (112)and the decision boundary between class i and class j in the output space is given by the hyperplane { z = p ( V ¯ h δM ) : z i = z j } . A positive score implies that at the network output, classes are separated by a margin thatcorresponds to the score. However, a large score may not imply a large classification margin – recall that theclassification margin is a function of the decision boundary in the input space, whereas the training algorithmaims at optimizing the decision boundary at the network output in the output space.We need the following lemma relating a pair of vectors in the input space and the output space. Lemma 6.
For any x , x (cid:48) ∈ X ⊂ R d x M , and a given RNN output functional F [ · ] , (cid:107)F [ x ] − F [ x (cid:48) ] (cid:107) ≤ sup ¯ x ∈ conv( X ) (cid:107) J [¯ x ] (cid:107) · (cid:107) x − x (cid:48) (cid:107) , (113) where J [ x ] = d F [ x ] /d x is the input-output Jacobian of the RNN output functional.Proof. Let t ∈ [0 , and define the function F ( t ) = F [ x + t ( x (cid:48) − x )] . Note that dF ( t ) dt = J [ x + t ( x (cid:48) − x )]( x (cid:48) − x ) . (114)Therefore, F [ x (cid:48) ] − F [ x ] = F (1) − F (0) = (cid:90) dF ( t ) dt dt = (cid:18)(cid:90) J [ x + t ( x (cid:48) − x )] dt (cid:19) ( x (cid:48) − x ) , (115)where we have used the fundamental theorem of calculus.Now, (cid:107)F [ x ] − F [ x (cid:48) ] (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) J [ x + t ( x (cid:48) − x )] dt (cid:13)(cid:13)(cid:13)(cid:13) · (cid:107) ( x (cid:48) − x ) (cid:107) (116) ≤ sup x , x (cid:48) ∈X ,t ∈ [0 , (cid:107) J [ x + t ( x (cid:48) − x )] (cid:107) · (cid:107) ( x (cid:48) − x ) (cid:107) (117) ≤ sup ¯ x ∈ conv( X ) (cid:107) J [¯ x ] (cid:107) · (cid:107) x − x (cid:48) (cid:107) , (118)where we have used the fact that x + t ( x (cid:48) − x ) ∈ conv( X ) for all t ∈ [0 , to arrive at the last line. The proofis done. 32he following proposition provides classification margin bounds in terms of the score and input-output Jacobianassociated to the RNN classifier. Proposition 3.
Assume that a RNN classifier g δ ( x ) , defined in (112) , classifies a training sample x i with thescore o ( s i ) > . Then we have the following lower bound for the classification margin: γ d ( s i ) ≥ o ( s i )sup x ∈ conv( X ) (cid:107) J [ x ] (cid:107) , (119) where conv( X ) denotes the convex hull of X and J [ x ] = d F [ x ] /d x , with F [ x ] = p ( V ¯ h δM [ x ]) , is the input-output Jacobian associated to the RNN.Proof. The proof is essentially identical to that of Theorem 4 in [67]. We provide the full detail here forcompleteness.Denote o ( s i ) = o ( x ( i ) , y ( i ) ) , where x ( i ) := ( x ( i )0 , · · · , x ( i ) M − ) ∈ X ⊂ R d x M , and v ij = √ e i − e j ) , where e i ∈ R d y denotes the Kronecker delta vector.The classification margin of the training sample s i is: γ d ( s i ) = sup { a : (cid:107) x ( i ) − x (cid:107) ≤ a = ⇒ g δ ( x ) = y ( i ) ∀ x } (120) = sup { a : (cid:107) x ( i ) − x (cid:107) ≤ a = ⇒ o ( x , y ( i ) ) > ∀ x } . (121)By Definition 7, o ( x , y ( i ) ) > if and only if min j (cid:54) = y ( i ) v Ty ( i ) j F [ x ] > .On the other hand, min j (cid:54) = y ( i ) v Ty ( i ) j F [ x ] = min j (cid:54) = y ( i ) ( v Ty ( i ) j F [ x ( i ) ] + v Ty ( i ) j ( F [ x ] − F [ x ( i ) ])) (122) ≥ min j (cid:54) = y ( i ) v Ty ( i ) j F [ x ( i ) ] + min j (cid:54) = y ( i ) v Ty ( i ) j ( F [ x ] − F [ x ( i ) ]) (123) = o ( x ( i ) , y ( i ) ) + min j (cid:54) = y ( i ) v Ty ( i ) j ( F [ x ] − F [ x ( i ) ]) . (124)Therefore, o ( x ( i ) , y ( i ) ) + min j (cid:54) = y ( i ) v Ty ( i ) j ( F [ x ] − F [ x ( i ) ]) > implies that o ( x , y ( i ) ) > and so γ d ( s i ) ≥ sup (cid:26) a : (cid:107) x ( i ) − x (cid:107) ≤ a = ⇒ o ( x ( i ) , y ( i ) ) + min j (cid:54) = y ( i ) v Ty ( i ) j ( F [ x ] − F [ x ( i ) ]) > ∀ x (cid:27) (125) = sup (cid:26) a : (cid:107) x ( i ) − x (cid:107) ≤ a = ⇒ o ( x ( i ) , y ( i ) ) − max j (cid:54) = y ( i ) v Ty ( i ) j ( F [ x ( i ) ] − F [ x ]) > ∀ x (cid:27) (126) = sup (cid:26) a : (cid:107) x ( i ) − x (cid:107) ≤ a = ⇒ o ( x ( i ) , y ( i ) ) > max j (cid:54) = y ( i ) v Ty ( i ) j ( F [ x ( i ) ] − F [ x ]) ∀ x (cid:27) . (127)Now, using the fact that (cid:107) v y ( i ) j (cid:107) = 1 and Lemma 6, we have: max j (cid:54) = y ( i ) v Ty ( i ) j ( F [ x ( i ) ] − F [ x ]) ≤ sup ¯ x ∈ conv( X ) (cid:107) J [¯ x ] (cid:107) · (cid:107) x ( i ) − x (cid:107) . (128)Using this inequality gives: γ d ( s i ) ≥ sup (cid:40) a : (cid:107) x ( i ) − x (cid:107) ≤ a = ⇒ o ( x ( i ) , y ( i ) ) > sup ¯ x ∈ conv( X ) (cid:107) J [¯ x ] (cid:107) · (cid:107) x ( i ) − x (cid:107) ∀ x (cid:41) (129) ≥ o ( x ( i ) , y ( i ) )sup ¯ x ∈ conv( X ) (cid:107) J [¯ x ] (cid:107) . (130)The proof is done.We now have all the needed ingredients to prove Theorem 10.33 roof of Theorem 10. By Proposition 2, Lemma 5, and our assumption on complexity of the sample space, theRNN classifier is ( d y · (2 C M /γ ) k , -robust, for some constant C M > . Due to Theorem 11 (with M := L g there), it remains to prove the upper bound (104) for the classification margin of a training sample to completethe proof. The inequality (105) then follows immediately from Theorem 11.By Proposition 3, we have γ d ( s i ) ≥ o ( s i )sup ˆ x ∈ conv( X ) (cid:107) J [ˆ x ] (cid:107) , (131)where J [ˆ x ] := d F [ˆ x ] /d ˆ x is the input-output Jacobian associated to the RNN. Therefore, to complete the proofit suffices to show that (cid:107) J [ˆ x ] (cid:107) ≤ C M − (cid:88) m =0 δ m (cid:107) ˆΦ M,m +1 [ˆ x ] (cid:107) , (132)where C is the constant from the theorem and ˆΦ m +1 ,k , ≤ k ≤ m ≤ M − satisfies: ˆΦ k,k = I, (133) ˆΦ m +1 ,k = ˆ J m ˆΦ m,k , (134)where ˆ J m = I + δ m f (cid:48) (ˆ h m , ˆ x m ) (with the ˆ h (0) m satisfying Eq. (80), recalling that we are replacing the superscript δ by hat when denoting the δ -dependent approximating solutions for the sake of notation cleanliness) and the δ m > are the step sizes.Iterating (134) up to the ( m + 1) th step, for m ≥ k , gives: ˆΦ m +1 ,k = ˆ J m ˆ J m − · · · ˆ J k =: m (cid:89) l = k ˆ J l . (135)Note that ˆΦ m +1 ,k = ∂ ˆ h (0) m +1 ∂ ˆ h (0) m ∂ ˆ h (0) m ∂ ˆ h (0) m − · · · ∂ ˆ h (0) k +1 ∂ ˆ h (0) k = d ˆ h (0) m +1 d ˆ h (0) k . (136)Now, applying chain rule: J [ˆ x ] = ∂p ( V ˆ h (0) M ) ∂ ˆ h (0) M M − (cid:88) j =0 ∂ ˆ h (0) M ∂ ˆ h (0) M − · · · ∂ ˆ h (0) j +2 ∂ ˆ h (0) j +1 ∂ ˆ h (0) j +1 ∂ ˆ x j , (137)where p is the softmax function.We compute: ∂p ( V ˆ h (0) M ) ∂ ˆ h (0) M = V E, (138)where E ij = p i ( e ij − p j ) . From (136), we have ∂ ˆ h (0) M ∂ ˆ h (0) M − · · · ∂ ˆ h (0) j +2 ∂ ˆ h (0) j +1 = ˆΦ M,j +1 . (139)On the other hand, ∂ ˆ h (0) j +1 ∂ ˆ x j = δ j ∂f (ˆ h (0) j , ˆ x j ) ∂ ˆ x j , (140)for j = 0 , , . . . , M − . Note that for Lipschitz RNNs, we have ∂ ˆ h (0) j +1 ∂ ˆ x j = δ j D j U , where D ijl = a (cid:48) ([ W ˆ h (0) l + U ˆ x l + b ] i ) e ij . 34sing the results of the above computations gives: J [ˆ x ] = V E M − (cid:88) m =0 δ m ˆΦ M,m +1 ∂f (ˆ h (0) m , ˆ x m ) ∂ ˆ x m . (141)Therefore, (cid:107) J [ˆ x ] (cid:107) ≤ (cid:107) V E (cid:107) M − (cid:88) m =0 (cid:107) δ m ˆΦ M,m +1 (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂f (ˆ h (0) m , ˆ x m ) ∂ ˆ x m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (142) ≤ (cid:107) V (cid:107) (cid:32) max m =0 , ,...,M − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂f (ˆ h (0) m , ˆ x m ) ∂ ˆ x m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:33) M − (cid:88) m =0 δ m (cid:107) ˆΦ M,m +1 (cid:107) = C M − (cid:88) m =0 δ m (cid:107) ˆΦ M,m +1 (cid:107) . (143)For the Lipschitz RNN, we have C := (cid:107) V (cid:107) ( max m =0 , ,...,M − (cid:107) D m U (cid:107) ) . The proof is done.It follows immediately from Eq. (143) that we have the following sufficient condition for stability with respectto hidden states of deterministic RNN to guarantee stability with respect to input sequence. Corollary 1.
Fix a M and assume that C (cid:80) M − m =0 δ m < . Then, (cid:107) ˆΦ M,m +1 (cid:107) ≤ for m = 0 , . . . , M − implies that (cid:107) J [ˆ x ] (cid:107) < . F Stability and Noise-Induced Stabilization for NRNNs: Proof of Theorem 3 in theMain Paper
We begin by discussing stochastic stability for SDEs, which are the underlying continuous-time models for ourNRNNs.Although the additional complexities of SDEs over ODEs often necessitate more involved analyses, many ofthe same ideas typically carry across. This is also true for stability. A typical approach for proving stability ofODEs involves Lyapunov functions — in Chapter 4 of [55], such approaches are extended for SDEs. This givesway to three notions of stability: (1) stability in probability; (2) moment stability; and (3) almost sure stability.Their definitions are provided in Definitions 4.2.1, 4.3.1, 4.4.1 in [55], and are repeated below for convenience.To preface the definition, consider initializing (25) at two different random variables h and h (cid:48) := h + (cid:15) ,where (cid:15) ∈ R d h is a constant non-random perturbation with (cid:107) (cid:15) (cid:107) ≤ δ . The resulting hidden states, h t and h (cid:48) t , areset to satisfy (25) with the same Brownian motion B t , starting from their initial values h and h (cid:48) , respectively.The evolution of (cid:15) t = h (cid:48) t − h t satisfies d (cid:15) t = A(cid:15) t d t + ∆ a t ( (cid:15) t )d t + ∆ σ t ( (cid:15) t )d B t , (144)where ∆ a t ( (cid:15) t ) = a ( W h (cid:48) t + U x t + b ) − a ( W h t + U x t + b ) and ∆ σ t ( (cid:15) t ) = σ t ( h t + (cid:15) t ) − σ t ( h t ) . Since ∆ a t (0) = 0 , ∆ σ t (0) = 0 for all t ∈ [0 , T ] , (cid:15) t = 0 admits a trivial equilibrium for (144). Definition 8 (Stability for SDEs) . The trivial solution of the SDE (144) is(i) stochastically stable (or, stable in probability) if for every (cid:15) ∈ (0 , , r > , there exists a δ = δ ( (cid:15), r ) > such that P ( (cid:107) (cid:15) t (cid:107) < r for all t ≥ ≥ − (cid:15) whenever (cid:107) (cid:15) (cid:107) < δ .(ii) stochastically asymptotically stable if it is stochastically stable and, moreover, for every (cid:15) ∈ (0 , ,there exists a δ = δ ( (cid:15) ) > such that P (lim t →∞ (cid:15) t = 0) ≥ − (cid:15) whenever (cid:107) (cid:15) (cid:107) < δ .(iii) almost surely exponentially stable if lim sup t →∞ t − log (cid:107) (cid:15) t (cid:107) < with probability one whenever (cid:107) (cid:15) (cid:107) < δ .(iv) p -th moment exponentially stable if there exists λ, C > such that E (cid:107) (cid:15) t (cid:107) p ≤ C (cid:107) (cid:15) (cid:107) p e − λ ( t − t ) for all t ≥ t .The properties in Definition 8 are said to hold globally if they also hold under no restrictions on (cid:15) . Stability inprobability neglects to quantify rates of convergence, and is implied by almost sure exponential stability. Onthe other hand, for our class of SDEs, p -th moment exponential stability would imply almost sure exponentialstability (see Theorem 4.2 in [55]). 35ne critical difference between Lyapunov stability theory for ODEs and SDEs lies in the stochastic stabilizationphenomenon . Let L be the infinitesimal generator (for a given input signal x t ) of the diffusion process describedby the SDE (144): L = ∂∂t + (cid:88) i (cid:0) ( A(cid:15) ) i + ∆ a it ( (cid:15) ) (cid:1) ∂∂(cid:15) i + 12 (cid:88) i,j (cid:2) ∆ σ t ( (cid:15) )∆ σ t ( (cid:15) ) (cid:62) (cid:3) ij ∂ ∂(cid:15) i ∂(cid:15) j . (145)The generator for the corresponding ODE arises by taking ∆ σ t ≡ . In classical Lyapunov theory for ODEs, theexistence of a non-negative Lyapunov function V satisfying LV ≤ in some neighbourhood of the equilibriumis both necessary and sufficient for stability (see Chapter 4 of [38]). For SDEs, it has been shown that thiscondition is sufficient, but no longer necessary [55, 54]. This is by the nature of stochastic stabilization — theaddition of noise can can have the surprising effect of increased stability over its deterministic counterpart. Ofcourse, this is not universally the case as some forms of noise can be sufficiently extreme to induce instability;see Section 4.5 in [55].Identifying sufficient conditions which quantify the stochastic stabilization phenomenon are especially useful inour setting, and as it turns out (see also [46]), these are most easily obtained for almost sure exponential stability.Therefore, our stability analysis will focus on establishing almost sure exponential stability . The objectiveis to analyze such stability of the solution (cid:15) t = 0 , that is, to see how the final state (cid:15) T (and hence the output y (cid:48) T − y T = V (cid:15) T of the RNN) changes for an arbitrarily small initial perturbation (cid:15) (cid:54) = 0 .To this end, we consider an extension of the Lyapunov exponent to SDEs at the level of sample path [55]. Definition 9 (Almost sure global exponential stability) . The sample (or pathwise) Lyapunov exponent of thetrivial solution of (144) is
Λ = lim sup t →∞ t − log (cid:107) (cid:15) t (cid:107) . The trivial solution (cid:15) t = 0 is almost surely globallyexponentially stable if Λ is almost surely negative for all (cid:15) ∈ R d h .For the sample Lyapunov exponent Λ( ω ) , there is a constant C > and a random variable ≤ τ ( ω ) < ∞ suchthat for all t > τ ( ω ) , (cid:107) (cid:15) t (cid:107) = (cid:107) h (cid:48) t − h t (cid:107) ≤ Ce Λ t almost surely. Therefore, almost sure exponential stabilityimplies that almost all sample paths of (144) will tend to the equilibrium solution (cid:15) = 0 exponentially fast. Withthis definition in tow, we state and prove our primary stability result. This is equivalent to Theorem 3 in themain paper. Theorem 12 (Bounds for sample Lyapunov exponent of the trivial solution) . Assume that Assumption C holds.Suppose that a is L a -Lipschitz, ≤ a T ∆ ( (cid:15), t ) (cid:15) ≤ L a (cid:107) (cid:15) (cid:107) and σ ≤ tr( σ T ∆ ( (cid:15), t ) σ ∆ ( (cid:15), t )) ≤ σ for all (cid:15) ∈ R d h and t ∈ [0 , T ] , for some L a > , σ , σ ≥ . Then, with probability one, − σ + σ λ min ( A sym ) ≤ Λ ≤ − σ + σ L a σ max ( W ) + λ max ( A sym ) , (146) for any (cid:15) ∈ R d h . To establish the bounds in Theorem 12, we appeal to the following theorem, which arises from combiningTheorems 4.3.3 and 4.3.5 in [55] in the case p = 2 . Here, for a function V , we let V (cid:15) = ∂V /∂(cid:15) . Theorem 13 (Stochastic Lyapunov theorem) . If there exists a function V ∈ C , ( R d h × R + ; R + ) and c , C > , c , C ∈ R , c , C ≥ such that for all (cid:15) (cid:54) = 0 and t ≥ t ,(i) c (cid:107) (cid:15) (cid:107) ≤ V ( (cid:15), t ) ≤ C (cid:107) (cid:15) (cid:107) ,(ii) c V ( (cid:15), t ) ≤ LV ( (cid:15), t ) ≤ C V ( (cid:15), t ) , and(iii) c V ( (cid:15), t ) ≤ (cid:107) V (cid:15) ( (cid:15), t )∆ σ t ( (cid:15) ) (cid:107) F ≤ C V ( (cid:15), t ) ,then, with probability one, the Lyapunov exponent Λ lies in the interval c − C ≤ Λ ≤ − c − C . (147)The proof of Theorem 13 involves the Itˆo formula, an exponential martingale inequality and a Borel-Cantellitype argument. The functions V above are called stochastic Lyapunov functions and the use of the theoreminvolves construction of these functions. We are now in a position to prove Theorem 12, and will find that thechoice V ( (cid:15), t ) = (cid:107) (cid:15) (cid:107) will suffice. 36 roof of Theorem 12. It suffices to verify the conditions of Theorem 2 with V ( (cid:15), t ) = V ( (cid:15) ) = (cid:107) (cid:15) (cid:107) .Clearly (i) is satisfied. To show (iii), by the conditions on ∆ σ t , we have that σ (cid:107) (cid:15) (cid:107) ≤ (cid:107) V (cid:15) ( (cid:15) )∆ σ t ( (cid:15) ) (cid:107) F ≤ σ (cid:107) (cid:15) (cid:107) . It remains only to show (ii). Observe that LV ( (cid:15) ) = (cid:15) (cid:62) ( A + A (cid:62) ) (cid:15) + 2∆ a t ( (cid:15) ) (cid:15) + tr (∆ σ t ( (cid:15) )∆ σ t ( (cid:15) ) (cid:62) ) . Since ≤ ∆ a t ( (cid:15) ) (cid:15) and | ∆ a t ( (cid:15) ) (cid:15) | ≤ (cid:107) a ( W h (cid:48) t + U x t + b ) − a ( W h t + U x t + b ) (cid:107) (cid:107) (cid:15) (cid:107)≤ L a (cid:107) W (cid:15) (cid:107) (cid:107) (cid:15) (cid:107) ≤ L a σ max ( W ) (cid:107) (cid:15) (cid:107) , it follows that LV ( (cid:15) ) ≤ (2 λ max ( A sym ) + 2 L a σ max ( W ) + σ ) (cid:107) (cid:15) (cid:107) , and LV ( (cid:15) ) ≥ (2 λ min ( A sym ) + σ ) (cid:107) (cid:15) (cid:107) . The bound (147) now follows from Theorem 13 with c = C = 1 , c = 2 λ min ( A sym ) + σ , C =2 λ max ( A sym ) + 2 L a σ max ( W ) + σ , c = 4 σ , and C = 4 σ . Remark 4.
To see if the bounds in Theorem 12 are indeed sharp (at least for certain cases), consider the linearSDE dH t = AH t dt + BH t dW t , where A ∈ R d h × d h , B = σI , σ ∈ R and W t is a scalar Wiener process.Then, since A and B commute, they can be simultaneously diagonalized, and so the linear SDE can be reducedvia transformation to a set of independent one-dimensional linear SDEs. In particular, one can show that H t = exp(( A − B / t + BW t )) H and the Lyapunov exponents Λ of this system are the real part of theeigenvalues of A − B / . Note that λ min ( A sym − B / ≤ Λ ≤ λ max ( A sym − B / a.s.. Since B = σI ,this inequality implies Eqn. (146) with L a := 0 . The bounds are tight in the scalar case ( d h = 1 and A is ascalar), with the inequality becoming an equality. G Experimental Details
Following [19], we construct the hidden-to-hidden weight matrices A and W as A = T ( B, β a , γ a ) := (1 − β a ) · ( B + B T ) + β a · ( B − B T ) − γ a I, (148) W = T ( C, β w , γ w ) := (1 − β w ) · ( C + C T ) + β w · ( C − C T ) − γ w I. (149)Here, B and C denote weight matrices that have the same dimensions as A and W . The tuning parameters γ a and γ w can be used to increase dampening. We initialize the weight matrices by sampling weights from thenormal distribution N (0 , σ init ) , where σ init is the variance. Table 4 summarizes the tuning parameters thatwe have used in our experiments. We train our models for epochs, with scheduled learning rate decays atepochs { } . We use Adam with default parameters for minimizing the objective.Table 4: Tuning parameters used to train the NRNN. Name d h lr decay β γ a γ w (cid:15) σ init add. noise mult. noiseOrdered MNIST 128 0.001 0.1 0.75 0.001 0.001 0.01 . / . / . / . / We performed a random search to obtain the tuning parameters. Since our model is closely related to theLipschitz RNN, we started with the tuning parameters proposed in [19]. We evaluated different noise levels,both for multiplicative and additive noise, in the range [0 . , . . We tuned the levels of noise-injection so thatthe models achieve state-of-the-art performance on clean input data. Further, we observed that the robustnessof the model is not significantly improving when trained with increased levels of noise-injections. Overall,our experiments indicated that the model is relatively insensitive to the particular amount of additive andmultiplicative noise level in the small noise regime.Further, we need to note that we only considered models that used a combination of additive and multiplicativenoise-injections. One could also train models using either only additive or multiplicative noise-injections. We37id not investigate in detail the trade-offs between the different strategies. The motivation for our experimentswas to demonstrate that (i) models trained with noise-injections can achieve state-of-the-art performance onclean input data, and (ii) such models are also more resilient to input perturbations.For establishing a fair set of baselines, we used the following implementations and prescribed tuning parametersfor the other models that we considered.• Exponential RNN.
We used the following implementation: https://github.com/Lezcano/expRNN . We used the default parameters. We trained the model, with hidden dimension d h = 128 , for epochs.• CoRNN.
We used the following implementation, provided as part of the Supplementary Material: https://openreview.net/forum?id=F3s69XzWOia . We used the default parameters proposed bythe authors for training the model with hidden dimension d h = 128 . We trained the model for epochs with learning rate decay at epoch 90.• Lipschitz RNN.
We used the following implementation, provided as part of the SupplementaryMaterial: https://openreview.net/forum?id=-N7PBXqOUJZ . We used the default parametersproposed by the authors for training the model with hidden dimension d h = 128 . We trained the modelfor epochs with learning rate decay at epoch 90.• Antisymmetric RNN.
To our best knowledge, there is no public implementation by the authors forthis model. However, the Antisymmetric RNN can be seen as a special case of the Lipschitz RNN orthe NRNN, without the stabilizing term A and without noise-injection. We trained this model by usingour implementation and the following tuning parameters: β = 1 . , γ = 0 . , lr = . , (cid:15) = 0 . .We trained the model for100