RNN-based Online Learning: An Efficient First-Order Optimization Algorithm with a Convergence Guarantee
N. Mert Vural, Selim F. Yilmaz, Fatih Ilhan, Suleyman S. Kozat
RRNN-based Online Learning: An EfficientFirst-Order Optimization Algorithm with aConvergence Guarantee
N. Mert Vural, Selim F. Yilmaz, Fatih Ilhan and Suleyman S. Kozat,
Senior Member, IEEE
Abstract —We investigate online nonlinear regression withcontinually running recurrent neural network networks (RNNs),i.e., RNN-based online learning. For RNN-based online learning,we introduce an efficient first-order training algorithm thattheoretically guarantees to converge to the optimum networkparameters. Our algorithm is truly online such that it does notmake any assumption on the learning environment to guar-antee convergence. Through numerical simulations, we verifyour theoretical results and illustrate significant performanceimprovements achieved by our algorithm with respect to thestate-of-the-art RNN training methods.
Index Terms —Online learning, neural network training, re-current neural networks, sequential learning, regression, onlinegradient descent.
I. I
NTRODUCTION
Prediction of individual sequences is one of the mainsubjects of interest in the contemporary online learning lit-erature [1]. In this problem, we sequentially receive a datasequence related to a desired signal to predict the signal’s nextvalue [2]. This problem is also known as online regressionand extensively studied in the neural network [3], [4], signalprocessing [5], [6], and machine learning literatures [1], [2].In these studies, nonlinear approaches are generally employedsince linear modeling is inadequate for a wide range ofapplications due to constraint of linearity [3], [6].For online regression, there exists a wide range of nonlin-ear approaches in the fields of machine learning and signalprocessing [6]–[8]. However, these approaches usually sufferfrom prohibitive computational requirements and may providepoor performance due to overfitting and stability issues [5].Adopting neural networks is another method for online nonlin-ear regression due to their success in approximating nonlinearfunctions [3], [9]. However, neural network-based regressionalgorithms are shown to be prone to such issues as overfittingor demonstrating inadequate performance in certain applica-tions [10], [11]. To overcome the limitations of the shallownetworks, neural networks composed of multiple layers, i.e.,deep neural networks (DNNs), have recently been introduced.In DNNs, each layer performs a feature extraction based on theprevious layers, which enables them to model highly nonlinearstructures [12], [13]. However, this layered structure poorly
This work is supported in part by TUBITAK Contract No. 117E153.N. M. Vural, S. F. Yilmaz, F. Ilhan and S. S. Kozat are with the Depart-ment of Electrical and Electronics Engineering, Bilkent University, Ankara06800, Turkey, e-mail: [email protected], [email protected],fi[email protected], [email protected]. performs in capturing time dependencies of the data due toits lack of temporal memory. Therefore, DNNs provide onlylimited performance in processing temporal data and modelingtime series [14]. To remedy this issue, recurrent neural net-works (RNNs) are used, as these networks have a feed-backconnection that enables them to store past information [7],[15]. Through their many variants, recurrent neural networkshave seen remarkable empirical success in a broad range ofsequential learning domains [4], [16]–[18]. In this study, weare interested in online nonlinear regression with RNN-basednetworks due to their superior performance in capturing timedependencies.For RNNs, there exists a wide range of online training meth-ods to learn network parameters [4], [19]–[23]. Among them,the first-order methods are commonly preferred in practice dueto their computational efficiency [19], [22], [23]. However,using gradient-based optimization methods for RNN-basedonline learning is challenging due to the divergence problemscaused by the exploding gradient problem [24]. In addition tothis, finding an effective learning rate for first-order methodsrequires time-consuming search algorithms [25], which costs asignificant amount of time and effort in practical applications.To resolve these problems, several heuristic methods havebeen proposed. For the divergence problems: Krueger etal. [26] penalized the squared distance between the norms ofsuccessive hidden states to keep the RNN model stable duringtraining. Mikolov et al. [26] and Pascanu et al. [27] showedthat gradient clipping helps to reduce exploding gradients inpractice. To overcome the learning rate-related problems: Blieret al. [28] applied multiple learning rates with adaptive ran-domization, which is shown to perform close to the stochasticgradient descent (SGD) with the optimal learning rate. As analternative approach, Orabona et al. [29] randomly performedgradient descent updates with a fixed learning rate, which isagain empirically shown to be successful in neural networktraining.While these heuristics are reasonably successful in practice,they are ad-hoc and based on empirical observations, whichmay not necessarily be applicable in online settings. Thereare also mathematical studies in the literature to providetheoretical performance guarantees for RNN-based learning:Hardt et al. [30] showed that the gradient descent algorithmlearns the globally optimum weights when the learning modelis a single input single output linear dynamical system. Oymaket al. [31] extended this result to the contractive nonlineardynamical systems by assuming that the ground truth hidden a r X i v : . [ c s . L G ] M a r tate vectors are observed. Additionally, Allen-Zhu et al. [32]showed the Elman Network trained with SGD is capable ofminimizing the regression loss with the assumption that thenumber of neurons is polynomial in the training data size.Although these studies provide definite solutions for po-tential problems of the first-order methods, the conditions oftheir results are restrictive for online settings. Therefore, theirresults are usually inapplicable to practical online learningscenarios. In this study, differing from the previous works, weintroduce a first-order optimization algorithm that theoreticallyguarantees to learn the locally optimum parameters withoutany assumption on the input/output sequences (except theirboundedness) or the system dynamics of the model. Weemphasize that our convergence guarantee is valid when ouralgorithm is used with commonly used RNN models, e.g.,Elman networks [33] and LSTMs [34].To obtain this result, we model RNN optimization as asequential learning problem and assume each time step asa separate loss function, whose argument is the networkweights (detailed in the following). By using the Lipschitzcharacteristics of these loss functions, we develop an onlinegradient descent algorithm that guarantees to converge to theweights with zero derivatives. In the simulations, we verify ourtheoretical results and show that our algorithm improves theerror performance of the state-of-the-art methods [19], [22],[23] on several real-life datasets. Therefore, in this paper, weintroduce a both practical and theoretically justified algorithmthat can be used safely in RNN-based online learning settings.Our contributions can be summarized as follows: • To the best of our knowledge, we, as the first time inthe literature, introduce an online first-order optimizationalgorithm that guarantees to learn the locally optimumparameters when used with practical RNN models. Wenote that previous studies in the literature either makead-hoc assumptions in their results or give a theoreticaljustification for restricted settings that do not sufficientlydescribe the practical scenarios. • Our algorithm is truly online such that it does not makeany assumption on the desired data sequence to guaranteeconvergence. Therefore, it can be safely used in any practical application. • Through simulations involving real-life datasets, we il-lustrate significant performance gains achieved by ouralgorithm with respect to the state-of-the-art methods[19], [22], [23].This paper is organized as follows: In Section II, weformally introduce the online regression problem and describeour RNN model. In Section III, we develop a first-order opti-mization algorithm with a theoretical convergence guarantee.In Section IV, we verify our results and demonstrate theperformance of our algorithm with numerical simulations. InSection V. we conclude the paper with final remarks.II. M
ODEL AND P ROBLEM D EFINITION
All vectors are column vectors and denoted by boldfacelower case letters. Matrices are represented by boldface capitalletters. We use (cid:107)·(cid:107) (respectively (cid:107)·(cid:107) ∞ ) to denote the (cid:96) (respectively (cid:96) ∞ ) vector or matrix norms depending on theargument. We use bracket notation [ n ] to denote the set of thefirst n positive integers, i.e., [ n ] = { , · · · , n } .We study online regression, where we observe a desiredsignal { d t } t ≥ and regression vectors { x t } t ≥ to sequentiallyestimate d t with ˆ d t . For mathematical convenience in theproofs, we assume d t ∈ [ −√ n h , √ n h ] , with a user-dependent n h ∈ N , and x t ∈ [ − , n x . However, our derivations holdfor any bounded input/output sequence after proper shiftingand scaling. Given our estimate ˆ d t , which is a functionof {· · · , x t − , x t } and {· · · , d t − , d t − } , we suffer the loss (cid:96) ( d t , ˆ d t ) . Our aim is to optimize the network with respect tothe loss function (cid:96) ( · , · ) . In this study, we particularly work withthe squared loss, i.e., (cid:96) ( ˆ d t , d t ) = 0 . d t − ˆ d t ) . However,since our work uses a generic approach for the gradient-based non-convex optimization, it can be extended for anycontinuous cost function. An extension for the cross-entropyloss is given in Appendix C.In this paper, we study online regression with continuallyrunning RNNs [35]. For this task, we use the Elman networkmodel, i.e., h t +1 = tanh( Wh t + Ux t ) (1) ˆ d t = c T h t . (2)Here, as the weight matrices, we have W ∈ R n h × n h , U ∈ R n h × n x and c ∈ R n h , with (cid:107) c (cid:107) ≤ . Moreover, h t ∈ [ − , n h is the hidden state vector, x t ∈ [ − , n x is theinput vector, ˆ d t ∈ [ −√ n h , √ n h ] is our estimation, and tanh applies to vectors point-wise. We note that we use the Elmannetwork model with tanh and linear activations in (1)-(2) dueto the wide use and simplicity of this model [36]. However,our derivations can be extended for any differentiable neuralnetwork architecture, given that the Lipschitz properties of thearchitecture can be explicitly derived. The sketch of such anextension for LSTMs is given in Appendix D. Additionally,although we do not explicitly write the bias terms, they canbe included in (1)-(2) by augmenting the input vectors with aconstant dimension.III. A LGORITHM D EVELOPMENT
In this section, we develop an online first-order algorithmthat guarantees convergence to the locally optimum weights.We develop our algorithm in three subsections. In the firstsubsection, we describe our sequential learning approach andmake definitions for the following analysis. In the secondsubsection, we present some auxiliary results that will beused in the development of the main algorithm. In the lastsubsection, we introduce our algorithm and mathematicallyprove its convergence guarantee.
NN pass -defined in (1)-parametrized by θ t and µ t RNN passparametrized by θ t and µ t · · · RNN passparametrized by θ t and µ t x x x t h c Tt h t ( θ t , µ t ) = ˆ d t InputVectorsInitialState Unfolded version of the RNN model in (1)-(2) over all the time steps up to the current time step t .Note that all forward passes share the same parameters, i.e., θ t and µ t . Output Layer h h h t − h t ( θ t , µ t ) Fig. 1:
In this figure, we visually describe h t ( θ t , µ t ) , where h t ( θ t , µ t ) is defined as the hidden state vector obtained by running the model in (1)-(2) with θ t and µ t from the initial time step up to the current time step t . In the figure, the RNN sequence is initialized with a predetermined initial state h , whichis independent of the network weights. Then, the same RNN forward pass (given in (1)) is repeatedly applied to the input sequence { x t } t ≥ , where all theiterations are parametrized by θ t and µ t . The resulting hidden vector after t iterations is defined to be h t ( θ t , µ t ) . Here, we note that the dependence of h t ( · , · ) on t is due to the increased length of the recursion at each time step. A. Sequential Learning Approach
We investigate the RNN-based regression problem problemin the sequential learning framework. Here, we make nostatistical assumptions on the data in order to model chaotic,non-stationary or even adversarial environments. Hence, wemodel our problem as a game between a learner and anadversary, where the learner is tasked with predicting theweight matrices from some convex sets in each round t . Inthe following, we use the vectorized forms of the weightmatrices, i.e., θ t = vec ( W t ) and µ t = vec ( U t ) , for math-ematical convenience. Therefore, we construct the game asfollows: At each round t , the learner declares his prediction θ t and µ t ; concurrently, the adversary chooses a target value d t ∈ [ −√ n h , √ n h ] , an input x t ∈ [ − , n x , and a weightvector c t , (cid:107) c t (cid:107) ≤ ; then, the learner observes the loss function (cid:96) t ( θ t , µ t ) := 0 . (cid:0) d t − c Tt h t ( θ t , µ t ) (cid:124) (cid:123)(cid:122) (cid:125) ˆ d t (cid:1) (3)and suffers the loss (cid:96) t ( θ t , µ t ) , where h t ( θ t , µ t ) denotes thehidden state vector obtained by running the model in (1)-(2)with θ t and µ t from the initial time step up to the currenttime step t (for the detailed description of h t ( θ t , µ t ) see Fig.1). This procedure of play is repeated across T rounds, where T is the total number of input instances. Here, we note thatwe constructed our setting for adversarial c t selections formathematical convenience in our proofs. However, since theselected (cid:96) t ( θ t , µ t ) is convex with respect to c t , we will usethe online gradient descent algorithm [37] to learn the optimaloutput layer weights (simultaneously with the hidden layerweights) during the training.Since we assume no statistical assumptions on the in-put/output sequences, we analyze our performance with thenotion of regret . However, we note that the standard regret We scale the squared loss with . for mathematical convenience whencomputing the derivatives. We note that the boundedness of c will be required in our proofs. However,our algorithm will guarantee to keep c bounded with a proper projectiononto a convex set. Here, we assume in particular (cid:107) c (cid:107) ≤ for mathematicalconvenience in the proofs. definition for the convex problems is intractable in the non-convex settings due to the NP-hardness of the non-convexglobal optimization [38], [39]. Therefore, we use the notionof local regret recently introduced by Hazan et al. [39], whichquantifies the objective of predicting points with a smallgradient on average.To formulate the local regret for our setting, we first definethe projected partial derivatives of (cid:96) t ( θ , µ ) with respect to θ and µ as follows: ∂ K θ ,η θ (cid:96) t ( θ , µ ) ∂ θ := 1 η θ (cid:16) θ − Π K θ (cid:104) θ − η θ ∂(cid:96) t ( θ , µ ) ∂ θ (cid:105)(cid:17) , (4) ∂ K µ ,η µ (cid:96) t ( θ , µ ) ∂ µ := 1 η µ (cid:16) µ − Π K µ (cid:104) µ − η µ ∂(cid:96) t ( θ , µ ) ∂ µ (cid:105)(cid:17) . (5)Here, ∂ K ,η denotes the projected partial derivative operatordefined with some convex set K and some learning rate η , thelearning rates η θ and η µ are used to update θ t and µ t , theoperators Π K θ [ · ] and Π K µ [ · ] denote the orthogonal projectionsonto K θ and K µ .We define the time-smoothed loss at time t , parametrizedby some window size w ∈ [ T ] , as L t,w ( θ , µ ) := 1 w w − (cid:88) i =0 (cid:96) t − i ( θ , µ ) . (6)Then, we define the local regret as R w ( T ) := T (cid:88) t =1 (cid:16)(cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ( θ t , µ t ) ∂ θ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ∂ K µ ,η µ L t,w ( θ t , µ t ) ∂ µ (cid:13)(cid:13)(cid:13) (cid:17) . (7) Our aim is to derive a sublinear upper bound for R w ( T ) in order to ensure the convergence of our algorithm to thelocally optimum weights. However, before the derivations,we first present some auxiliary results, i.e., the Lipschitz andsmoothness properties of L t,w ( θ , µ ) , which will be used inthe convergence proof of our algorithm. B. Lipschitz and Smoothness Properties
In this section, we derive the Lipschitz and smoothnessproperties of the time-smoothed loss function L t,w ( θ , µ ) . Weote that L t,w ( θ , µ ) is defined as the average of the mostrecent w instant loss functions (see (6)), where the lossfunction (cid:96) t ( θ , µ ) recursively depends on θ and µ due to h t ( θ , µ ) (see (3) and Fig. 1). We emphasize that since we areinterested in online learning, this recursion can be infinitelylong, which might cause L t,w ( θ , µ ) to have unboundedly largederivatives . On the other hand, online algorithms naturallyrequire loss functions with bounded gradients to guaranteeconvergence [40]. Therefore, in the following, we first analyzethe (potentially infinite) recursive dependency of L t,w ( θ , µ ) on µ and θ , where we derive sufficient conditions for itsderivatives to be bounded. Then, we use the results of thisanalysis to find the explicit formulations of the Lipschitz andsmoothness constants of L t,w ( θ , µ ) in terms of the modelparameters defining our RNN model in (1)-(2).To analyse the effect of recursion, we first write the hiddenstate update in (1) with the vectorized weight matrices as h t +1 = tanh( H t θ + X t µ ) (8)where H t = I ⊗ h Tt , X t = I ⊗ x Tt , and ⊗ is the Kroneckerproduct.We, then, provide the following lemma, where we derivethe Lipschitz and smoothness properties of the single RNNiteration defined in (8). Lemma 1.
Let W and U the hidden layer weight matricesin the model (1)-(2), which satisfy (cid:107) W (cid:107) ≤ λ , and (cid:107) U (cid:107) ≤ λ for some λ ∈ R . By using the equivalent hidden state updateformula in (8), the Lipschitz and smoothness properties of thesingle RNN iteration can be written as: (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ h t (cid:13)(cid:13)(cid:13) ≤ λ, (9) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ h t (cid:13)(cid:13)(cid:13) ≤ λ , (10) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ h t ∂ θ (cid:13)(cid:13)(cid:13) ≤ λ √ n h , (11) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ h t ∂ µ (cid:13)(cid:13)(cid:13) ≤ λ √ n x , (12) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ θ ∂ µ (cid:13)(cid:13)(cid:13) ≤ √ n x √ n h . (13) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ θ (cid:13)(cid:13)(cid:13) ≤ √ n h , (14) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ µ (cid:13)(cid:13)(cid:13) ≤ √ n x , (15) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ θ (cid:13)(cid:13)(cid:13) ≤ n h , (16) (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ µ (cid:13)(cid:13)(cid:13) ≤ n x , (17) Proof.
See Appendix A.In the following lemma and remark, we derive the Lipschitzproperties of h t ( θ , µ ) , and observe the effect of infinitely longrecursion on the derivatives of L t,w ( θ , µ ) . Lemma 2.
Let W , W (cid:48) , U , U (cid:48) satisfy (cid:107) W (cid:107) , (cid:107) W (cid:48) (cid:107) ≤ λ , and (cid:107) U (cid:107) , (cid:107) U (cid:48) (cid:107) ≤ λ . By using the notation in (8), let h t ( θ , µ ) and h t ( θ (cid:48) , µ (cid:48) ) be the state vectors obtained at time t by runningthe model in (1)-(2) with the matrices W , U , and W (cid:48) , U (cid:48) on common input sequence { x , x , · · · , x t − } , respectively. If h ( θ , µ ) = h ( θ (cid:48) , µ (cid:48) ) , then (cid:107) h t ( θ , µ ) − h t ( θ (cid:48) , µ (cid:48) ) (cid:107) ≤ t (cid:88) i =0 λ i (cid:0) √ n h (cid:107) θ − θ (cid:48) (cid:107) + √ n x (cid:107) µ − µ (cid:48) (cid:107) (cid:1) . (18) Proof.
See Appendix A.
Remark 1.
We note that, by (18), to ensure h t ( θ , µ ) has abounded gradient with respect to θ and µ in an infinite timehorizon, λ should be in [0 , , i.e., λ ∈ [0 , . In this case, theright hand side of (18) becomes bounded, i.e., (cid:107) h t ( θ , µ ) − h t ( θ (cid:48) , µ (cid:48) ) (cid:107) ≤ √ n h − λ (cid:107) θ − θ (cid:48) (cid:107) + √ n x − λ (cid:107) µ − µ (cid:48) (cid:107) (19) for any t ∈ [ T ] .Recall that L t w ( θ , µ ) is dependent on h t ( θ , µ ) due to (6)and (3). Hence, to ensure the derivatives of L t w ( θ , µ ) staybounded, we need to constrain our parameter space as K θ = { vec ( W ) : (cid:107) W (cid:107) ≤ λ } and K µ = { vec ( U ) : (cid:107) U (cid:107) ≤ λ } forsome λ ∈ [0 , . We note that since K θ and K µ are convexsets for any λ ∈ [0 , , our constraint does not violate thesetting described in the previous subsection. Now that we have found λ ∈ [0 , is sufficient for L t,w ( θ , µ ) to have bounded derivatives in any t ∈ [ T ] , in thefollowing theorem, we provide its Lipschitz and smoothnessconstants. Theorem 1.
Let θ = vec ( W ) and µ = vec ( U ) , where W and U satisfy (cid:107) W (cid:107) ≤ λ , and (cid:107) U (cid:107) ≤ λ for some λ ∈ [0 , .Then, L t,w ( θ , µ ) has the following Lipschitz and smoothnessproperties: (cid:13)(cid:13)(cid:13) ∂L t,w ( θ , µ ) ∂ θ (cid:13)(cid:13)(cid:13) ≤ R θ , where R θ = 2 n h − λ . (20) (cid:13)(cid:13)(cid:13) ∂L t,w ( θ , µ ) ∂ µ (cid:13)(cid:13)(cid:13) ≤ R µ , where R µ = 2 √ n h n x − λ . (21) (cid:13)(cid:13)(cid:13) ∂ L t,w ( θ , µ ) ∂ θ (cid:13)(cid:13)(cid:13) ≤ β θ , where β θ = 4 n h √ n h (1 − λ ) . (22) (cid:13)(cid:13)(cid:13) ∂ L t,w ( θ , µ ) ∂ µ (cid:13)(cid:13)(cid:13) ≤ β µ , where β µ = 4 n x √ n h (1 − λ ) . (23) (cid:13)(cid:13)(cid:13) ∂ L t,w ( θ , µ ) ∂ θ ∂ µ (cid:13)(cid:13)(cid:13) ≤ β θµ , where β θµ = 4 n h √ n x (1 − λ ) . (24) Proof.
See Appendix B.In the following section, we use these properties to derivean RNN learning algorithm with a convergence guarantee.
C. Main Algorithm
Here, we present our main algorithm, namely the WindowedOnline Gradient Descent Algorithm (WOGD), shown in Al-gorithm 1.In the algorithm, we take the window size w ∈ [ T ] and λ ∈ [0 , as the inputs. We, then, define the parameter spaces K θ , K µ , and K c in line 3. Here, we define K θ and K µ as givenin Remark 1 to ensure that the derivatives of the loss functionsare bounded. Furthermore, we define K c as K c = { c : (cid:107) c (cid:107) ≤ } to satisfy our assumption of (cid:107) c (cid:107) ≤ . lgorithm 1 Windowed Online Gradient Descent Algorithm(WOGD) Parameters: w ∈ [ T ] , λ ∈ [0 , . Initialize θ , µ , c and h . Let K θ = { vec ( W ) : (cid:107) W (cid:107) < λ }K µ = { vec ( U ) : (cid:107) U (cid:107) ≤ λ }K c = { c : (cid:107) c (cid:107) ≤ } for t = 1 to T do Predict θ t , µ t and c t . Receive x t and generate ˆ d t . Observe d t and the cost function (cid:96) t ( θ t , µ t ) . Updates: c t +1 = Π K c (cid:104) c t − √ t ∂(cid:96) t ( θ , µ ) ∂ c t (cid:105) (25) θ t +1 = θ t − η θ ∂ K θ ,η θ L t,w ( θ t , µ t ) ∂ θ (26) µ t +1 = µ t − η µ ∂ K µ ,η µ L t,w ( θ t , µ t ) ∂ µ . (27) end for In the learning part, we first predict the hidden layer weightmatrices in their vectorized forms, i.e., θ t and µ t , and theoutput layer weights, i.e., c t (see line 5). Then, we receive theinput vector x t and generate ˆ d t by running the model in (1)-(2).We next observe ground truth value d t and the loss function (cid:96) t ( θ t , µ t ) in line 7. Having observed the label, we update theweight matrices in line 8 (or in (25)-(27)). Here, we updatethe output layer weights c t with the projected online gradientdescent algorithm [37], since (cid:96) t ( θ t , µ t ) is convex with respectto c t . We update the hidden weights in (26)-(27) by usingtheir projected partial derivatives defined with ( K θ , η θ ) and ( K µ , η µ ) .We note that since we constructed our setting for adversarial c t selections, the update rule for the output layer in (25) doesnot contradict with our analysis. Moreover, by using [37, Theo-rem 1], we can prove that this update rule converges to the bestpossible output layer weights satisfying (cid:107) c (cid:107) ≤ . Therefore, inthe following theorem, we provide the convergence guaranteeof WOGD specifically for the hidden layer weights. Theorem 2.
Let (cid:96) t ( θ , µ ) and L t,w ( θ , µ ) be the loss and time-smoothed loss functions defined in (3) and (6), respectively.Moreover, let β θ and β µ be the smoothness parameters definedin (22)-(23). Then, if WOGD is run with the parameters < η θ ≤ β θ and < η µ ≤ β µ , (28) it ensures that R w ( T ) ≤ √ n h min { η θ , η µ } Tw , (29) where R w ( T ) is the local regret defined in (7). By selectinga window size w such that Tw = o ( T ) , one can bound R w ( T ) with a sublinear bound, hence, guarantee convergence of thehidden layer weights to the locally optimum parameters. We use little-o notation, i.e., g ( x ) = o ( f ( x )) , to describe an upper-boundthat cannot be tight, i.e., lim x →∞ g ( x ) /f ( x ) = 0 . Proof.
See Appendix B.Theorem 2 shows that with appropriate parameter selec-tions, WOGD is guaranteed to converge to the locally op-timum hidden layer weights without any assumption on theinput/output sequences and output layer weights c t . Addition-ally, recall that by [37, Theorem 1], the output layer weightsalso converge to the best weights in hindsight. Therefore,we conclude that WOGD with the learning rates in (28)guarantees to learn the locally optimum RNN parameters forany bounded input/output sequences .Now that we have proved the convergence guarantee ofWOGD, in the following remark, we investigate the compu-tational requirement of WOGD and comment on the windowsize selection for the algorithm.
Remark 2.
We note that the most expensive operation ofWOGD is the computation of the projected partial derivativesin (26)-(27), which requires the computation of the partialderivatives of L t,w ( θ , µ ) with respect to θ and µ , and theircorresponding projections onto K θ and K µ . To compute thepartial derivatives, we use the Truncated BackpropagationThrough Time algorithm [41], which has O (cid:0) hn h ( n h + n x ) (cid:1) computational complexity with a truncation length h . SinceWOGD uses the partial derivatives of the last w losses, wecan approximate these partial derivatives with a single back-propagation by using a truncation length h = w + n for some n ∈ N , which results in O (cid:0) wn h ( n h + n x ) (cid:1) computationalrequirement for computing the partial derivatives. In addi-tion, the projection step can be performed by computing thesingular value decomposition (SVD) of the weight matrices W and U , and clipping their singular values with λ . Here,since performing SVD requires O (cid:0) min { n h , n x } n h ( n h + n x ) (cid:1) ,the computational requirement of WOGD becomes O (cid:0) ( w +min { n h , n x } ) n h ( n h + n x ) (cid:1) per time step.We highlight that WOGD introduces a tradeoff betweenthe convergence speed and computational complexity. Forexample, one can choose a large window size w to ensurefast convergence (see (29)) by increasing the computationalrequirement of WOGD, or vice versa. However, as we willshow in the following section, selecting w = (cid:100)√ T (cid:101) usuallyprovides the best trade-off between performance and efficiency. In the next remark, we discuss the effect of choosing higherlearning rates than the theoretically guaranteed ones given inTheorem 2.
Remark 3.
We note that WOGD is constructed by assumingthe worst-case Lipschitz constants derived in Theorem 1. Onthe other hand, our experiments suggest that in practice, thelandscape of the objective function is generally nicer whatis predicted by our theoretical development. For example, inthe simulations, it is observed that the smoothness parametersof the error surface, i.e., β θ and β µ , is usually to times smaller than the values given in (22)-(23). Therefore, it ispractically possible to obtain vanishing gradient with WOGDby using much higher learning rates than theoretically guar-anteed learning rates in (28). As the regret bound of WOGDis inversely proportional with the learning rates (see (29)), inhe following, we use WOGD with the higher learning ratesthan suggested in Theorem 2 to obtain faster convergence. IV. S
IMULATIONS
In this section, we verify our theoretical analysis and il-lustrate the performance of our algorithm on different real-lifedatasets. In particular, we consider the regression performancefor elevators [42], and pumadyn [43] datasets. To demonstratethe performance improvements of our algorithm, we compareit with the three widely used first-order optimization algo-rithms: Adam [19], RmsProp [22] and SGD [23].For all the simulations, we randomly draw the initial RNNweights from a Gaussian distribution with zero mean andstandard deviation of . , and set the initial values of allinternal state variables to . For a fair comparison, in eachexperiment, we choose the hyper-parameters such that all thecompared algorithms reach their maximum performance inthat setup. We run each experiment times and provide themean performances. A. Elevators Dataset
In the first simulation, we consider the regression perfor-mance on the elevators dataset [42], which includes input/output pairs obtained from a procedure that is related tocontrolling an F16 aircraft, i.e., T = 10000 . Here, our aim isto predict the scalar output that expresses the actions of theF16-aircraft. For this, we use -dimensional input vectors ofthe dataset with an additional bias dimension, i.e., n x = 19 . Toget small loss values with relatively lower run-time, we use -dimensional state vectors in RNNs, i.e., n h = 16 . In WOGD,we use η θ = 0 . , η µ = 0 . , λ = 0 . , w = 100 . InAdam, RmsProp and SGD, we respectively choose the learningrates as . , . and . . We note that we do not usemomentum in SGD.In Fig. 2a, we demonstrate the temporal loss of the com-pared algorithms for the elevators dataset. Here, we observethat despite the differences in the performances at the begin-ning, RmsProp, Adam and SGD converge to the similar lossvalues. We also observe that since WOGD obtains small lossvalues much faster than the other algorithms, it can track thedesired data sequence with lower error values throughout thesimulations.To observe how window size affects the performance, werun WOGD with six different window sizes, namely w ∈{ , , , , , } , and plot their resulting temporalperformance in Fig. 2b. Here, we see that as consistent withour regret bound in Theorem 2, the final loss values obtainedby the algorithms are inversely proportional to their windowsizes. Moreover, we observe that in the initial part of theexperiment, the algorithms with larger window sizes tend tosuffer larger losses, as averaging gradients in a large windowprevents them to overfit to the small amount of data observedin the earlier stage. From the figure, we can observe that theWOGD with w = 100 provides comparable performance withthe others at both initial and final parts of the experiments.As the computational requirement of our algorithm is linear in w (see Remark bla), selecting w = 100 provides a highlypreferable trade-off for practical purposes.In Figs. 2c and 2d, we compare the empirical smoothnessparameters of the smoothed loss functions with the theoreticalupper bounds derived in Theorem 1. To observe the behaviourof the empirical smoothness parameters without calculating theexact Hessian matrix, we use the empirical Lipschitz constants,i.e., β empθ,t = (cid:13)(cid:13)(cid:13) ∂L t,w ( θ t +1 , µ t +1 ) ∂ θ − ∂L t,w ( θ t , µ t ) ∂ θ (cid:13)(cid:13)(cid:13) (cid:107) θ t +1 − θ t (cid:107) (30) β empµ,t = (cid:13)(cid:13)(cid:13) ∂L t,w ( θ t +1 , µ t +1 ) ∂ µ − ∂L t,w ( θ t , µ t ) ∂ µ (cid:13)(cid:13)(cid:13) (cid:107) µ t +1 − µ t (cid:107) . (31)In the figures, we observe that β empθ,t and β empµ,t are practi-cally much lower than the theoretical upper bounds, wherethe theoretical values are given in the titles of the plots.We note that the difference between the theoretical upper-bounds and empirical smoothness parameters are expected,since we derived the worst-case upper bounds by consideringthe saturation region of RNNs, which is rarely encountered inpractice due to variations in real-world datasets. Furthermore,in Figs. 2b and 2c, we see that the learning rates we usedsatisfy the requirement of Theorem 2 – stated in (28).To verify our theoretical analysis in Theorem 2, we plotthe normalized regret of WOGD, i.e., R w ( t ) /t for t ∈ [ T ] ,and the regret bound in (29) (scaled with . ) in Fig. 2d.Here, we see that as consistent with our theoretical derivation,the normalized regret vanishes. Moreover, it is bounded bythe regret bound scaled with . . We believe that the gapbetween the normalized regret and actual regret bound is dueto our adversarial model for the output layer weights, which ismainly used for mathematical convenience. Since we updatethe output layer weights simultaneously with the hidden layerweights, deriving a tighter regret bound requires a cooperativelearning model for neural network optimization, which, to thebest of our knowledge, has not been studied in the literature.The construction of such model and its analysis is left as anopen problem for the future studies. B. Pumaydn Dataset
In our second simulation, we consider the pumaydn dataset[43], which includes input/output pairs obtained fromthe simulation of Unimation Puma 560 robotic arm, i.e., T =8000 . Here, our aim is to estimate the angular accelerationof the arm by using the angular position and angular velocityof the links. For this simulation, we use -dimensional inputvectors of the dataset with an additional bias dimension, i.e., n x = 9 , and -dimensional state vectors in RNN, i.e., n h =16 . In WOGD, we use η θ = 0 . , η µ = 0 . , λ = 0 . , w = 90 . In Adam, RmsProp, and SGD, we choose the learningrates as . , . , and . , respectively. As in the previousexperiment, we do not use momentum in SGD.In Fig. 3a, we demonstrate the temporal loss of the com-pared algorithms for the pumaydn dataset. Here, we observethat while all algorithms provide comparable performances,WOGD enjoys a smaller error value at the end of the a) (b)(c) (d) Fig. 2: (a)
Sequential prediction performances of the algorithms for the elevators dataset (b) - (c) Comparison between the empirical smothness parameteresof L t,w ( θ , µ ) and their theoretical upper bounds given in (22) and (23) ( d) Comparison between the normalized regret bound of WOGD, i.e., R w ( t ) /t for t ∈ [ T ] , and its theoretical upper bound given in (29). simulation. In Figs. 3b and 3c, we compare the empiricalsmoothness parameters (as defined in (30)-(31)) and theirtheoretical upper-bound derived in Theorem 1. As in theprevious experiment, we observe that the empirical smooth-ness parameters, namely β empθ,t and β empµ,t , are considerablysmaller than the theoretical upper bounds and our learningrate selection satisfies the requirement in (28). In Fig. 3d,we plot the normalized regret of WOGD, i.e., R w ( t ) /t for t ∈ [ T ] , and the regret bound in (29) scaled with . .As in the previous experiment, here also, we see that thenormalized regret vanishes and the regret bound scaled with . bounds the normalized regret by above, which is parallelwith our derivations in Theorem 2. Since our algorithm enjoysthis guarantee by providing comparable performance with thestate-of-the-art methods, it can be a theoretically groundedalternative of the widely used heuristics [19], [22], [23] inRNN-based online learning settings.V. C ONCLUSION
We studied online nonlinear regression with continuallyrunning RNNs, i.e., RNN-based online learning. For this prob-lem, we introduced a first-order gradient-based optimizationalgorithm that provides convergence guarantee to the locallyoptimum parameters. We emphasize that unlike previous theo-retical results, which holds in restricted settings [30], [31], our algorithm is generic such that it guarantees convergence withany smooth RNN architecture, e.g., the Elman networks [33]or LSTMs [34], without any assumption on the input/outputsequences.To achieve this result, we model the RNN-based onlineregression problem as a sequential learning problem, where weassumed each time step as a separate loss function assigned byan adversary. We characterize the Lipschitz properties of theseloss functions with respect to the network weights and derivedsufficient conditions for our model to have bounded deriva-tives. Then, by using these results, we introduce an onlinegradient descent algorithm that is guaranteed to converge to thelocally optimum parameters. In the simulations, we verify ourtheoretical analysis. Here, we also demonstrate that our algo-rithm achieves considerable performance improvements withrespect to the state-of-the-art gradient descent methods [19],[22], [23]. A
PPENDIX
AFor the following, we denote the derivative of tanh as tanh (cid:48) , where tanh (cid:48) ( x ) = 1 − tanh( x ) . Due to the spacerestrictions, we denote the elementary row scaling operationwith (cid:12) , i.e., x (cid:12) W = diag ( x ) W , where x ∈ R n , W ∈ R n × m and diag ( x ) ∈ R n × n is the elementary scaling matrix whose a) (b)(c) (d) Fig. 3: (a)
Sequential prediction performances of the algorithms for the pumaydn dataset (b) - (c) Comparison between the empirical smothness parameteresof L t,w ( θ , µ ) and their theoretical upper bounds given in (22) and (23) ( d) Comparison between the normalized regret bound of WOGD, i.e., R w ( t ) /t for t ∈ [ T ] , and its theoretical upper bound given in (29). diagonal elements are the components of x . Before the proofs,we give three inequalities that will be used frequently in thefollowing: Lemma 3.
For any x , y ∈ R n , W ∈ R n × m , where n, m ∈ N ,the following statements hold: (cid:107) x (cid:12) W (cid:107) ≤ (cid:107) x (cid:107) ∞ (cid:107) W (cid:107) (32) (cid:107) tanh (cid:48) ( x ) − tanh (cid:48) ( y ) (cid:107) ∞ ≤ (cid:107) x − y (cid:107) (33) (cid:107) I ⊗ x T (cid:107) = (cid:107) x (cid:107) . (34) Proof.
1) Since x (cid:12) W = diag ( x ) W , we have (cid:107) x (cid:12) W (cid:107) ≤(cid:107) diag ( x ) (cid:107)(cid:107) W (cid:107) , where we use the Cauchy-Schwarz in-equality for bounding. Since by definition (cid:107) diag ( x ) (cid:107) = (cid:107) x (cid:107) ∞ , (cid:107) x (cid:12) W (cid:107) ≤ (cid:107) diag ( x ) (cid:107)(cid:107) W (cid:107) = (cid:107) x (cid:107) ∞ (cid:107) W (cid:107) .2) Recall that tanh (cid:48) ( x ) = 1 − tanh( x ) . Since tanh( x ) ∈ [ − , , tanh is -Lipschitz -smooth. Then, by using (cid:107) x (cid:107) ∞ ≤ (cid:107) x (cid:107) for any x ∈ R n , we have (cid:107) tanh (cid:48) ( x ) − tanh (cid:48) ( y ) (cid:107) ∞ ≤ (cid:107) tanh (cid:48) ( x ) − tanh (cid:48) ( y ) (cid:107) . Since tanh is -smooth, we have (cid:107) tanh (cid:48) ( x ) − tanh (cid:48) ( y ) (cid:107) ≤ (cid:107) x − y (cid:107) .3) See [44, Theorem 8]. Proof of Lemma 1.
In the following, we prove each statementseparately: 1)
Proof of (9) : We note that (1) and (8) are equivalent. Byusing (32) and tanh (cid:48) ( x ) ≤ on (1), we write (cid:13)(cid:13)(cid:13) ∂ tanh( Wh t + Ux t ) ∂ h t (cid:13)(cid:13)(cid:13) = (cid:107) tanh (cid:48) ( Wh t + Ux t ) (cid:12) W (cid:107)≤ (cid:107) tanh (cid:48) ( Wh t + Ux t ) (cid:107) ∞ (cid:107) W (cid:107) ≤ λ. Proof of (10) : By using (32) and (33), we write (cid:13)(cid:13)(cid:13) ∂ tanh( Wh t + Ux t ) ∂ h t − ∂ tanh( W (cid:48) h t + Ux t ) ∂ h t (cid:13)(cid:13)(cid:13) = (cid:107) tanh (cid:48) ( Wh t + Ux t ) (cid:12) W − tanh (cid:48) ( Wh (cid:48) t + Ux t ) (cid:12) W (cid:107)≤ (cid:107) tanh (cid:48) ( Wh t + Ux t ) − tanh (cid:48) ( Wh (cid:48) t + Ux t ) (cid:107) ∞ (cid:107) W (cid:107)≤ (cid:107) W (cid:107)(cid:107) h t − h (cid:48) t (cid:107)(cid:107) W (cid:107) ≤ λ (cid:107) h t − h (cid:48) t (cid:107) . Proof of (11) : We note that since tanh is twice differen-tiable, the order of partial derivatives is not important. Then,y using (32), (33), (34), and (cid:107) h t (cid:107) ≤ √ n h , we write (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ θ − ∂ tanh( H (cid:48) t θ + X t µ ) ∂ θ (cid:13)(cid:13)(cid:13) = (cid:107) tanh (cid:48) ( H t θ + X t µ ) (cid:12) H t − tanh (cid:48) ( H (cid:48) t θ + X t µ ) (cid:12) H (cid:48) t (cid:107) (35) ≤ (cid:107) tanh (cid:48) ( H t θ + X t µ ) − tanh (cid:48) ( H (cid:48) t θ + X t µ ) (cid:107) ∞ (cid:107) H t (cid:107) + (cid:107) tanh (cid:48) ( H (cid:48) t θ + X t µ ) (cid:107) ∞ (cid:107) H t − H (cid:48) t (cid:107) (36) ≤ (cid:107) tanh (cid:48) ( Wh t + Ux t ) − tanh (cid:48) ( Wh (cid:48) t + Ux t ) (cid:107) ∞ (cid:107) H t (cid:107) + (cid:107) h t − h (cid:48) t (cid:107) (37) ≤ (cid:0) (cid:107) W (cid:107)√ n h +1 (cid:1) (cid:107) h t − h (cid:48) t (cid:107) ≤ (2 λ √ n h +1) (cid:107) h t − h (cid:48) t (cid:107) , where we add ± tanh (cid:48) ( H (cid:48) t θ + X t µ ) (cid:12) H t inside of the normin (35), and use the triangle inequality for (36). Here, weomit +1 term for mathematical convenience in the followingderivations.4) Proof of (12) : This can be obtained by repeating the stepsin the proof of (11) for µ and h t .5) Proof of (13) : By using (32), (33), (34), (cid:107) h t (cid:107) ≤ √ n h and (cid:107) x t (cid:107) ≤ √ n x , we write (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ θ − ∂ tanh( H t θ + X t µ (cid:48) ) ∂ θ (cid:13)(cid:13)(cid:13) = (cid:107) tanh (cid:48) ( H t θ + X t µ ) (cid:12) H t − tanh (cid:48) ( H t θ + X t µ (cid:48) ) (cid:12) H t (cid:107)≤ (cid:107) tanh (cid:48) ( H t θ + X t µ ) − tanh (cid:48) ( H t θ + X t µ (cid:48) ) (cid:107) ∞ (cid:107) H t (cid:107)≤ (cid:107) X t (cid:107)(cid:107) µ − µ (cid:48) (cid:107)(cid:107) H t (cid:107) ≤ √ n h √ n x (cid:107) µ − µ (cid:48) (cid:107) . Proof of (14) : By using (32), tanh (cid:48) ( x ) ≤ , and (cid:107) h t (cid:107) ≤√ n h , (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ θ (cid:13)(cid:13)(cid:13) = (cid:107) tanh (cid:48) ( H t θ + X t µ ) (cid:12) H t (cid:107)≤ (cid:107) tanh (cid:48) ( H t θ + X t µ ) (cid:107) ∞ (cid:107) H t (cid:107) ≤ √ n h . Proof of (15) : This can be obtained by repeating the stepsin the proof of (14) for µ .8) Proof of (16) : By using (32), (33), (34), and (cid:107) h t (cid:107) ≤ √ n h ,we write (cid:13)(cid:13)(cid:13) ∂ tanh( H t θ + X t µ ) ∂ θ − ∂ tanh( H t θ (cid:48) + X t µ ) ∂ θ (cid:13)(cid:13)(cid:13) = (cid:107) tanh (cid:48) ( H t θ + X t µ ) (cid:12) H t − tanh (cid:48) ( H t θ (cid:48) + X t µ ) (cid:12) H t (cid:107)≤ (cid:107) tanh (cid:48) ( H t θ + X t µ ) − tanh (cid:48) ( H t θ (cid:48) + X t µ ) (cid:107) ∞ (cid:107) H t (cid:107)≤ (cid:107) H t (cid:107)(cid:107) θ − θ (cid:48) (cid:107)(cid:107) H t (cid:107) ≤ n h (cid:107) θ − θ (cid:48) (cid:107) . Proof of (17) : This can be obtained by repeating the stepsin the proof of (16) for µ . Proof of Lemma 2.
Before the proof, let h t ( θ (cid:48) , µ ) be the statevector obtained at time t by running the model in (1) with thematrices W (cid:48) , U , input sequence { x , x , · · · , x t − } , and theinitial condition h ( θ (cid:48) , µ ) = h ( θ , µ ) = h ( θ (cid:48) , µ (cid:48) ) . Then, (cid:107) h t ( θ , µ ) − h t ( θ (cid:48) , µ (cid:48) ) (cid:107) (38) ≤ (cid:107) h t ( θ , µ ) − h t ( θ (cid:48) , µ ) (cid:107) + (cid:107) h t ( θ (cid:48) , µ ) − h t ( θ (cid:48) , µ (cid:48) ) (cid:107) , (39)where we add ± h t ( θ (cid:48) , µ ) inside of the norm in (38), anduse the triangle inequality. We will bound the terms in (39)separately. We begin with the first term. Since h t ( θ , µ ) and h t ( θ (cid:48) , µ ) share the same µ , in the following (between (40)-(44)), we abbreviate them as h t ( θ ) and h t ( θ (cid:48) ) : (cid:107) h t ( θ ) − h t ( θ (cid:48) ) (cid:107) = (40) = (cid:107) tanh( Wh t − ( θ ) + Ux t ) − tanh( W (cid:48) h t − ( θ (cid:48) ) + Ux t ) (cid:107) (41) ≤ (cid:107) tanh( Wh t − ( θ ) + Ux t ) − tanh( Wh t − ( θ (cid:48) ) + Ux t ) (cid:107) + (cid:107) tanh( Wh t − ( θ (cid:48) ) + Ux t ) − tanh( W (cid:48) h t − ( θ (cid:48) ) + Ux t ) (cid:107) (42) ≤ λ (cid:107) h t − ( θ ) − h t − ( θ (cid:48) ) (cid:107) + √ n h (cid:107) θ − θ (cid:48) (cid:107) (43) ≤ t − (cid:88) i =0 (cid:16) λ i √ n h (cid:107) θ − θ (cid:48) (cid:107) (cid:17) (44)(45)Here, to obtain (42), we add ± tanh( Wh t − ( θ (cid:48) )+ Ux t ) insideof the norm in (41), and use the triangle inequality. Then, weuse (9) and (14) to get (43). Until we reach (44), we repeatedlyapply the same bounding technique to bound the norm of thedifferences between the state vectors.Now, we bound the second term in (39). Since h t ( θ (cid:48) , µ ) and h t ( θ (cid:48) , µ (cid:48) ) share the same µ (cid:48) , in the following (between(46)-(50)), we abbreviate them as h t ( µ ) and h t ( µ (cid:48) ) : (cid:107) h t ( µ ) − h t ( µ (cid:48) ) (cid:107) = (46) = (cid:107) tanh( W (cid:48) h t − ( µ )+ Ux t ) − tanh( W (cid:48) h t − ( µ (cid:48) )+ U (cid:48) x t ) (cid:107) (47) ≤ (cid:107) tanh( W (cid:48) h t − ( µ )+ Ux t ) − tanh( W (cid:48) h t − ( µ (cid:48) )+ Ux t ) (cid:107) + (cid:107) tanh( W (cid:48) h t − ( µ (cid:48) )+ Ux t ) − tanh( W (cid:48) h t − ( µ (cid:48) )+ U (cid:48) x t ) (cid:107) (48) ≤ λ (cid:107) h t − ( µ ) − h t − ( µ (cid:48) ) (cid:107) + √ n x (cid:107) µ − µ (cid:48) (cid:107) (49) ≤ t − (cid:88) i =0 (cid:16) λ i √ n x (cid:107) µ − µ (cid:48) (cid:107) (cid:17) (50)(51)Here, for (48), we add ± tanh( W (cid:48) h t − ( µ (cid:48) ) + Ux t ) and usethe triangle inequality. We, then, use (9) and (15) to get (49).Until we reach (50), we repeatedly apply the same techniqueto bound the norm of the differences between state vectors. Inthe end, we use (44) and (50) to bound (39), which yields thestatement in the lemma.A PPENDIX B Proof of Theorem 1.
Recall that L t,w ( θ , µ ) = w (cid:80) w − i =0 (cid:96) t − i ( θ , µ ) . Hence, if we can bound the derivativeof (cid:96) t ( θ , µ ) for an arbitrary t ∈ [ T ] , the resulting boundwill be valid for L t,w ( θ , µ ) . Therefore, in the following, weanalyse the Lipschitz properties of (cid:96) t ( θ , µ ) for an arbitrary t ∈ [ T ] and extend the result for L t,w ( θ , µ ) . In addition, inthe following, we note that since d t , ˆ d t ∈ [ −√ n h , √ n h ] and (cid:107) c t (cid:107) ≤ , the (cid:96) norm of − ( d t − ˆ d t ) c is bounded by √ n h ,i.e., (cid:107)− ( d t − ˆ d t ) c (cid:107) ≤ √ n h .) We write (cid:13)(cid:13)(cid:13) ∂(cid:96) t ( θ , µ ) ∂ θ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) − ( d t − ˆ d t ) ∂ ˆ d t ∂ θ (cid:13)(cid:13)(cid:13) (52) = (cid:13)(cid:13)(cid:13) − ( d t − ˆ d t ) ∂ ˆ d t ∂ h t (cid:16) t (cid:88) τ =1 ∂ h t ∂ h τ ∂ h τ ∂ θ (cid:17)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) − ( d t − ˆ d t ) c (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) τ =1 ∂ h t ∂ h τ ∂ h τ ∂ θ (cid:13)(cid:13)(cid:13) ≤ √ n h (cid:13)(cid:13)(cid:13) t (cid:88) τ =1 ∂ h t ∂ h τ ∂ h τ ∂ θ (cid:13)(cid:13)(cid:13) = 2 √ n h (cid:13)(cid:13)(cid:13) t (cid:88) τ =1 (cid:16) t (cid:89) i = τ +1 ∂ h i ∂ h i − (cid:17) ∂ h τ ∂ θ (cid:13)(cid:13)(cid:13) ≤ √ n h t (cid:88) τ =1 (cid:13)(cid:13)(cid:13) t (cid:89) i = τ +1 ∂ h i ∂ h i − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ h τ ∂ θ (cid:13)(cid:13)(cid:13) ≤ √ n h √ n h t (cid:88) τ =1 λ t − τ (53) ≤ n h − λ , (54)where we use (9) and (14) to get (53). By realizing that (54)holds for an arbitrary t , the statement in the theorem can beobtained.2) By using similar steps in (52)-(54), we write (cid:13)(cid:13)(cid:13) ∂(cid:96) t ( θ , µ ) ∂ µ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) − ( d t − ˆ d t ) ∂ ˆ d t ∂ µ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) − ( d t − ˆ d t ) ∂ ˆ d t ∂ h t (cid:16) t (cid:88) τ =1 ∂ h t ∂ h τ ∂ h τ ∂ µ (cid:17)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) − ( d t − ˆ d t ) c (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) τ =1 ∂ h t ∂ h τ ∂ h τ ∂ µ (cid:13)(cid:13)(cid:13) ≤ √ n h √ n x t (cid:88) τ =1 λ t − τ (55) ≤ √ n h n x − λ , (56)where we use (9) and (15) to get (55). By realizing that (56)holds for an arbitrary t , the statement in the theorem can beobtained.3) Let us use h t and h (cid:48) t for the state vectors obtained byrunning the model in (1) from the initial step up to current timestep t with the same initial condition, same input layer matrix µ , common input sequence { x , · · · , x t − } but different θ and θ (cid:48) , respectively. Let us also say (cid:96) t ( θ (cid:48) , µ ) = 0 . d t − ˆ d (cid:48) t ) ,where ˆ d (cid:48) t is the prediction of the second model producing h (cid:48) t . Then, (cid:13)(cid:13)(cid:13) ∂(cid:96) t ( θ , µ ) ∂ θ − ∂(cid:96) t ( θ (cid:48) , µ ) ∂ θ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ( d t − ˆ d (cid:48) t ) c T (cid:16) t (cid:88) τ =1 ∂ h (cid:48) t ∂ h (cid:48) τ ∂ h (cid:48) τ ∂ θ (cid:17) − ( d t − ˆ d t ) c T (cid:16) t (cid:88) τ =1 ∂ h t ∂ h τ ∂ h τ ∂ θ (cid:17)(cid:13)(cid:13)(cid:13) (57) ≤ √ n h t (cid:88) τ =1 (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ ∂ h τ ∂ θ − ∂ h (cid:48) t ∂ h (cid:48) τ ∂ h (cid:48) τ ∂ θ (cid:13)(cid:13)(cid:13) (58) ≤ √ n h t (cid:88) τ =1 (cid:16)(cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ h τ ∂ θ − ∂ h (cid:48) τ ∂ θ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ∂ h (cid:48) τ ∂ θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13)(cid:17) (59) ≤ √ n h t (cid:88) τ =1 λ t − τ (cid:13)(cid:13)(cid:13) ∂ h τ ∂ θ − ∂ h (cid:48) τ ∂ θ (cid:13)(cid:13)(cid:13) +2 n h t (cid:88) τ =1 (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13) . (60)Here, to get (59) from (58), we add ± ∂ h t ∂ h τ ∂ h (cid:48) τ ∂ θ inside the norm,and use the triangle inequality. To get (60), we use (9) and(14).In the following, we will bound the terms in (60) separately.We begin with the first term. Note that h τ = tanh( Wh τ − + Ux τ − ) , and h (cid:48) τ = tanh( W (cid:48) h (cid:48) τ − + Ux τ − ) . Then, √ n h t (cid:88) τ =1 λ t − τ (cid:13)(cid:13)(cid:13) ∂ h τ ∂ θ − ∂ h (cid:48) τ ∂ θ (cid:13)(cid:13)(cid:13) (61) ≤ √ n h t (cid:88) τ =1 λ t − τ (cid:0) λ √ n h (cid:107) h τ − − h (cid:48) τ − (cid:107) +2 n h (cid:107) θ − θ (cid:48) (cid:107) (cid:1) (62) ≤ √ n h (cid:16) t (cid:88) τ =1 λ t − τ (cid:17)(cid:16) λn h − λ + 2 n h (cid:17) (cid:107) θ − θ (cid:48) (cid:107) (63) ≤ n h √ n h − λ (cid:16) λ − λ + 1 (cid:17) (cid:107) θ − θ (cid:48) (cid:107) , (64)where we add ± ∂ tanh( Wh (cid:48) τ − + Ux τ − ) ∂ θ , and use the triangleinequality, (11) and (16) for (62). We use Lemma 2 for (63).Now, we bound the second term in (60). To bound the term,we first focus on the term inside of the sum, i.e., (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13) : (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ∂ h t − ∂ h τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ h t ∂ h t − − ∂ h (cid:48) t ∂ h (cid:48) t − (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ∂ h (cid:48) t ∂ h (cid:48) t − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ h t − ∂ h τ − ∂ h (cid:48) t − ∂ h τ (cid:13)(cid:13)(cid:13) (65) ≤ λ t − τ − (cid:0) λ (cid:107) h t − − h (cid:48) t − (cid:107) +2 λ √ n h (cid:107) θ − θ (cid:48) (cid:107) (cid:1) + λ (cid:13)(cid:13)(cid:13) ∂ h t − ∂ h τ − ∂ h (cid:48) t − ∂ h τ (cid:13)(cid:13)(cid:13) (66) ≤ λ t − τ − (cid:16) λ √ n h − λ +2 λ √ n h (cid:17) (cid:107) θ − θ (cid:48) (cid:107) + λ (cid:13)(cid:13)(cid:13) ∂ h t − ∂ h τ − ∂ h (cid:48) t − ∂ h τ (cid:13)(cid:13)(cid:13) (67) ≤ ( t − τ ) λ t − τ − (cid:16) λ √ n h − λ +2 λ √ n h (cid:17) (cid:107) θ − θ (cid:48) (cid:107) , (68)where we add ± ∂ h (cid:48) t ∂ h (cid:48) t − ∂ h t − ∂ h τ , and use the triangle inequality for(65), utilize (10) and (11) for (66), and Lemma 2 for (67). We,then, repeat the same manipulations in (65)-(67) to bound theerms with partial derivatives. Then, the second term in (60)can be bound as: n h t (cid:88) τ =1 (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13) (69) ≤ n h t (cid:88) τ =1 ( t − τ ) λ t − τ − (cid:16) λ √ n h − λ + 2 λ √ n h (cid:17) (cid:107) θ − θ (cid:48) (cid:107) (70) = 4 n h √ n h − λ (cid:16) λ (1 − λ ) + λ − λ (cid:17) (cid:107) θ − θ (cid:48) (cid:107) , (71)where we use the upper bound of the series (cid:80) tτ =1 ( t − τ ) λ t − τ − , i.e., / (1 − λ ) , to get (71). Then, by using (64)and (71), we can bound (60) as (cid:13)(cid:13)(cid:13) ∂(cid:96) t ( θ , µ ) ∂ θ − ∂(cid:96) t ( θ (cid:48) , µ ) ∂ θ (cid:13)(cid:13)(cid:13) (72) ≤ n h √ n h − λ (cid:16) λ (1 − λ ) + 2 λ − λ + 1 (cid:17) (cid:107) θ − θ (cid:48) (cid:107) (73) = 4 n h √ n h (1 − λ ) (cid:107) θ − θ (cid:48) (cid:107) . (74)By realizing that (74) holds for an arbitrary t , the statementin the theorem can be obtained.4) This part can be obtained by adapting the steps in theprevious proof for µ , and use the Lipschitz conditions in (12),(15), (17) and Lemma 2 accordingly.5) We use the same notation in the proof of the 3rd statement.Then, (cid:13)(cid:13)(cid:13) ∂(cid:96) t ( θ , µ ) ∂ µ − ∂(cid:96) t ( θ (cid:48) , µ ) ∂ µ (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ( d t − ˆ d (cid:48) t ) c T (cid:16) t (cid:88) τ =1 ∂ h (cid:48) t ∂ h (cid:48) τ ∂ h (cid:48) τ ∂ µ (cid:17) − ( d t − ˆ d t ) c T (cid:16) t (cid:88) τ =1 ∂ h t ∂ h τ ∂ h τ ∂ µ (cid:17)(cid:13)(cid:13)(cid:13) (75) ≤ √ n h t (cid:88) τ =1 (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ ∂ h τ ∂ µ − ∂ h (cid:48) t ∂ h (cid:48) τ ∂ h (cid:48) τ ∂ µ (cid:13)(cid:13)(cid:13) (76) ≤ √ n h t (cid:88) τ =1 (cid:16)(cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ h τ ∂ µ − ∂ h (cid:48) τ ∂ µ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ∂ h (cid:48) τ ∂ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13)(cid:17) (77) ≤ √ n h t (cid:88) τ =1 λ t − τ (cid:13)(cid:13)(cid:13) ∂ h τ ∂ µ − ∂ h (cid:48) τ ∂ µ (cid:13)(cid:13)(cid:13) +2 √ n h n x t (cid:88) τ =1 (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13) . (78)Here, to get (77) from (76), we add ± ∂ h t ∂ h τ ∂ h (cid:48) τ ∂ µ inside the norm,and use the triangle inequality. To get (78), we use (9) and(15).We bound the terms in (80) separately. We begin with thefirst term. Note that h τ = tanh( Wh τ − + Ux τ − ) , and h (cid:48) τ = tanh( W (cid:48) h (cid:48) τ − + Ux τ − ) . Then, √ n h t (cid:88) τ =1 λ t − τ (cid:13)(cid:13)(cid:13) ∂ h τ ∂ µ − ∂ h (cid:48) τ ∂ µ (cid:13)(cid:13)(cid:13) (79) ≤ √ n h t (cid:88) τ =1 λ t − τ (cid:0) λ √ n x (cid:107) h τ − − h (cid:48) τ − (cid:107) +2 √ n x n h (cid:107) θ − θ (cid:48) (cid:107) (cid:1) (80) ≤ √ n h (cid:16) t (cid:88) τ =1 λ t − τ (cid:17)(cid:16) λ √ n x n h − λ + 2 √ n x n h (cid:17) (cid:107) θ − θ (cid:48) (cid:107) (81) ≤ n h √ n x − λ (cid:16) λ − λ + 1 (cid:17) (cid:107) θ − θ (cid:48) (cid:107) , (82)where we add ± ∂ tanh( Wh (cid:48) τ − + Ux τ − ) ∂ θ , and use the triangleinequality, (12) and (13) for (80). We use Lemma 2 for (81).Now, we bound the second term in (78): √ n h n x t (cid:88) τ =1 (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13) (83) ≤ √ n h n x t (cid:88) τ =1 ( t − τ ) λ t − τ − (cid:16) λ √ n h − λ + 2 λ √ n h (cid:17) (cid:107) θ − θ (cid:48) (cid:107) (84) = 4 n h √ n x − λ (cid:16) λ (1 − λ ) + λ − λ (cid:17) (cid:107) θ − θ (cid:48) (cid:107) , (85)where we use (68) to bound the terms (cid:13)(cid:13)(cid:13) ∂ h t ∂ h τ − ∂ h (cid:48) t ∂ h (cid:48) τ (cid:13)(cid:13)(cid:13) . Then, byusing (82) and (85), we bound (78) as follows: (cid:13)(cid:13)(cid:13) ∂(cid:96) t ( θ , µ ) ∂ µ − ∂(cid:96) t ( θ (cid:48) , µ ) ∂ µ (cid:13)(cid:13)(cid:13) (86) ≤ n h √ n x − λ (cid:16) λ (1 − λ ) + 2 λ − λ + 1 (cid:17) (cid:107) θ − θ (cid:48) (cid:107) (87) = 4 n h √ n x (1 − λ ) (cid:107) θ − θ (cid:48) (cid:107) . (88)By realizing that (88) holds for an arbitrary t , the statementin the theorem can be obtained. Proof of Theorem 2.
In the following, we use (cid:104)· , ·(cid:105) to denotethe inner product. Due to the space constraints, we omit thearguments in the partial derivative terms, i.e., ∂L t,w ∂ θ := ∂(cid:96) t ( θ , µ ) ∂ θ , ∂ K θ ,η θ L t,w ∂ θ := ∂ K θ ,η θ L t,w ( θ t , µ t ) ∂ θ ,∂L t,w ∂ µ := ∂(cid:96) t ( θ , µ ) ∂ µ , ∂ K θ ,η θ L t,w ∂ µ := ∂ K µ ,η µ L t,w ( θ t , µ t ) ∂ µ . We start our proof by bounding the term L t,w ( θ t +1 , µ t +1 ) − t,w ( θ t , µ t ) : L t,w ( θ t +1 , µ t +1 ) − L t,w ( θ t , µ t ) ≤ (cid:68) ∂L t,w ∂ θ , θ t +1 − θ t (cid:69) + β θ (cid:107) θ t +1 − θ t (cid:107) (cid:68) ∂L t,w ∂ µ , µ t +1 − µ t (cid:69) + β µ (cid:107) µ t +1 − µ t (cid:107) + β θµ (cid:107) θ t +1 − θ t (cid:107)(cid:107) µ t +1 − µ t (cid:107) (89) = − η θ (cid:68) ∂L t,w ∂ θ , ∂ K θ ,η θ L t,w ∂ θ (cid:69) − η µ (cid:68) ∂L t,w ∂ µ , ∂ K θ ,η θ L t,w ∂ µ (cid:69) + β θ η θ (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ θ (cid:13)(cid:13)(cid:13) + β µ η µ (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ µ (cid:13)(cid:13)(cid:13) + β θµ η θ η µ (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ µ (cid:13)(cid:13)(cid:13) (90) = − η θ (cid:68) ∂L t,w ∂ θ , ∂ K θ ,η θ L t,w ∂ θ (cid:69) − η µ (cid:68) ∂L t,w ∂ µ , ∂ K θ ,η θ L t,w ∂ µ (cid:69) +0 . (cid:16) η θ (cid:112) β θ (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ θ (cid:13)(cid:13)(cid:13) + η µ (cid:112) β µ (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ µ (cid:13)(cid:13)(cid:13)(cid:17) (91) ≤ − (cid:0) η θ − η θ β θ (cid:1)(cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ θ (cid:13)(cid:13)(cid:13) − (cid:0) η µ − η µ β µ (cid:1)(cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ µ (cid:13)(cid:13)(cid:13) , (92)where we use [45, Lemma 3.4] for (89), the update rules in(26)-(27) for (90), and the fact that β θµ = (cid:112) β θ β µ for (91)(see (22)-(24)). Moreover, for (92), we use [39, Lemma 3.2]and the fact a + b ) ≤ ( a + b ) for any a, b ∈ R .For the following, we define l t ( θ , µ ) = 0 for t ≤ .We continue our proof by bounding L T +1 ,w ( θ T +1 , µ T +1 ) asfollows: L T +1 ,w ( θ T +1 , µ T +1 )= T (cid:88) t =0 L t +1 ,w ( θ t +1 , µ t +1 ) − L t,w ( θ t , µ t ) (93) = T (cid:88) t =1 (cid:16) L t,w ( θ t +1 , µ t +1 ) − L t,w ( θ t , µ t ) (cid:17) + T (cid:88) t =0 (cid:16) L t +1 ,w ( θ t +1 , µ t +1 ) − L t,w ( θ t +1 , µ t +1 ) (cid:17) (94) ≤ − (cid:0) η θ − η θ β θ (cid:1) T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ θ (cid:13)(cid:13)(cid:13) − (cid:0) η µ − η µ β µ (cid:1) T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ µ (cid:13)(cid:13)(cid:13) + 4 √ n h Tw , (95)where we add ± L t,w ( θ t +1 , µ t +1 ) to (93) to obtain (94), anduse (93) and d t , ˆ d t ∈ [ −√ n h , √ n h ] for any t ∈ [ T ] , whichimplies L t +1 ,w ( θ t +1 , µ t +1 ) − L t,w ( θ t +1 , µ t +1 ) ≤ √ n h /w ,to obtain (95).We note that since the (cid:96) t ( θ t , µ t ) is defined as thesquare loss between the ground truth value and our pre-diction, it is non-negative for all t ∈ [ T ] , which implies L T +1 ,w ( θ T +1 , µ T +1 ) ≥ . Then, by using (95), we write (cid:0) η θ − η θ β θ (cid:1) T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ θ (cid:13)(cid:13)(cid:13) + (cid:0) η µ − η µ β µ (cid:1) T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ∂ µ (cid:13)(cid:13)(cid:13) ≤ √ n h Tw . (96)By choosing < η θ ≤ / (2 β θ ) , < η µ ≤ / (2 β µ ) anddividing both sides of (96) with min { η θ , η µ } / , we can obtainthe statement in the theorem.A PPENDIX
CIn this part, we describe how to extend our work for thecross-entropy loss, which is denoted as RE ( ·||· ) . Since thecross-entropy loss is mainly used for classification, we de-scribe our extension by using the following RNN architecture: h t +1 = tanh( Wh t + Ux t )ˆ d t = f ( c T h t ) E t = RE ( d t || ˆ d t ) . Here, f is assumed to be the sigmoid or softmax functiondepending on the dimension of the desired sequence. As in (1)-(2), h t ∈ [ − , n h is the hidden state vector, x t ∈ [ − , n x is the input vector, and d t , ˆ d t ∈ [0 , is our estimation.Moreover, E t denotes the cross-entropy loss at time instance t . As in the squared loss, the cross-entropy is convex withrespect to output layer weights, i.e., c . Therefore, we can usethe projected online gradient descent – as in (25)– to ensurethe convergence of the output layer update. Moreover, sincethe formula for the derivative of the cross-entropy functionwith respect to c is the same with that of the squared loss,i.e., . d t − ˆ d t ) ∂ c = ∂RE ( d t || ˆ d t ) ∂ c = ( d t − ˆ d t ) c , the Lipschitzproperties derived in Theorem 1 applies to the the cross-entropy loss as well. Since Theorem 2 uses only the Lipschitzproperties of the RNN architecture to prove convergence, itcan be extended for the cross-entropy with the same update-projection steps in (26)-(27). Therefore, Algorithm 1 can beused for the cross-entropy case without any change as well.A PPENDIX
DIn this part, we describe how to extend our work for LSTMs.The equations of LSTM and the loss function is given as: z t = tanh( W y t − + U x t ) (97) i t = σ ( W y t − + U x t ) (98) f t = σ ( W y t − + U x t ) (99) c t = i t (cid:12) z t + f t (cid:12) c t − (100) o t = σ ( W y t − + U x t ) (101) y t = o t (cid:12) tanh( c t ) (102) ˆ d t = c T y t (103) E t = 0 . d t − ˆ d t ) . (104)where (cid:12) denotes the element-wise multiplication, c t ∈ R n h is the state vector, x t ∈ [ − , n x is the input vector, and y t ∈ R n h is the output vector, and ˆ d t ∈ [ −√ n h , √ n h ] is ourfinal estimation. Furthermore, the sigmoid function σ ( . ) andhe hyperbolic tangent function tanh( . ) applies point wise tothe vector elements. The weight matrices are W i ∈ R n h × n h , U i ∈ R n h × n x for i = 1 , · · · , , and c ∈ R n h , with (cid:107) c (cid:107) ≤ .As in the vanilla RNN case, the boundedness of c can beguaranteed with a proper projection onto a convex set. Notethat although we do not explicitly write the bias terms, theycan be included in (97)-(103) by augmenting the input vectorwith a constant dimension.Similar to the vanilla RNN case, the loss function E t isconvex with respect to the output weights c . Therefore, we canuse the projected gradient descent to ensure the convergenceof the output layer update. For the hidden layer weights, notethat we define the projected gradients as in (4)-(5) and theregret as as in (7) to ensure to find a stationary point for thegradient-descent updates. Then, by using the same intuition,we can extend the projected gradient definition for LSTM as ∂ K θ ,η θ (cid:96) t ( θ i , µ i ) ∂ θ i := 1 η θ (cid:16) θ i − Π K θ (cid:104) θ i − η θ ∂(cid:96) t ( θ i , µ i ) ∂ θ i (cid:105)(cid:17) ,∂ K µ ,η µ (cid:96) t ( θ i , µ i ) ∂ µ i := 1 η µ (cid:16) µ i − Π K µ (cid:104) µ i − η µ ∂(cid:96) t ( θ i , µ i ) ∂ µ i (cid:105)(cid:17) and the regret definition as R w ( T ):= T (cid:88) t =1 4 (cid:88) i =1 (cid:16)(cid:13)(cid:13)(cid:13) ∂ K θ ,η θ L t,w ( θ i,t , µ i,t ) ∂ θ i (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ∂ K µ ,η µ L t,w ( θ i,t , µ i,t ) ∂ µ i (cid:13)(cid:13)(cid:13) (cid:17) , (105)where θ i,t and µ i,t are the vectorized forms of the weightmatrices W i,t and U i,t at time t , i.e., θ i,t = vec ( W i,t ) and µ i,t = vec ( U i,t ) .We note that the convergence analysis for our algorithmrequires the Lipschitz properties of the architecture of interest.To this end, we can use the Lipschitz properties of LSTMsderived in [46, Proposition 2]. Then, we can use these resultsin Theorem 2 to upper-bound the regret defined in (105).Accordingly, Algorithm 1 can be extended by for the LSTMoptimization. R EFERENCES[1] N. Cesa-Bianchi and G. Lugosi,
Prediction, Learning, and Games . NewYork, NY, USA: Cambridge University Press, 2006.[2] V. G. Vovk, “Aggregating strategies,” in
Proceedings of the Third AnnualWorkshop on Computational Learning Theory , ser. COLT ’90. SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc., 1990, pp. 371–386.[3] D. F. Specht, “A general regression neural network,”
IEEE Transactionson Neural Networks , vol. 2, no. 6, pp. 568–576, Nov 1991.[4] T. Ergen and S. S. Kozat, “Efficient online learning algorithms basedon lstm neural networks,”
IEEE Transactions on Neural Networks andLearning Systems , vol. 29, no. 8, pp. 3772–3783, Aug 2018.[5] A. C. Singer, G. W. Wornell, and A. V. Oppenheim, “Nonlinearautoregressive modeling and estimation in the presence of noise,”
DigitalSignal Processing
IEEE Transactions on Signal Processing , vol. 52, no. 8, pp.2275–2285, Aug 2004.[7] S. Haykin,
Neural Networks: A Comprehensive Foundation , 2nd ed.Upper Saddle River, NJ, USA: Prentice Hall PTR, 1998.[8] Tsungnan Lin, B. G. Horne, P. Tino, and C. L. Giles, “Learning long-term dependencies in narx recurrent neural networks,”
IEEE Transac-tions on Neural Networks , vol. 7, no. 6, pp. 1329–1338, Nov 1996.[9] J. Y. Goulermas, P. Liatsis, X. Zeng, and P. Cook, “Density-driven gener-alized regression neural networks (dd-grnn) for function approximation,”
IEEE Transactions on Neural Networks , vol. 18, no. 6, pp. 1683–1696,Nov 2007. [10] N. D. Vanli, M. O. Sayin, I. Delibalta, and S. S. Kozat, “Sequentialnonlinear learning for distributed multiagent systems via extreme learn-ing machines,”
IEEE Transactions on Neural Networks and LearningSystems , vol. 28, no. 3, pp. 546–558, March 2017.[11] S. Lawrence and C. L. Giles, “Overfitting and neural networks: con-jugate gradient and backpropagation,” in
Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN2000. Neural Computing: New Challenges and Perspectives for the NewMillennium , vol. 1, July 2000, pp. 114–119 vol.1.[12] I. Arel, D. C. Rose, and T. P. Karnowski, “Deep machine learning - anew frontier in artificial intelligence research [research frontier],”
IEEEComputational Intelligence Magazine , vol. 5, no. 4, pp. 13–18, Nov2010.[13] L. Shao, D. Wu, and X. Li, “Learning deep and wide: A spectral methodfor learning deep networks,”
IEEE Transactions on Neural Networks andLearning Systems , vol. 25, no. 12, pp. 2303–2308, Dec 2014.[14] M. Hermans and B. Schrauwen, “Training and analysing deep recurrentneural networks,” in
Advances in Neural Information Processing Systems26 , C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.Weinberger, Eds. Curran Associates, Inc., 2013, pp. 190–198.[15] K. S. Narendra and K. Parthasarathy, “Identification and control ofdynamical systems using neural networks,”
IEEE Transactions on NeuralNetworks , vol. 1, no. 1, pp. 4–27, March 1990.[16] P. Liu, X. Qiu, and X. Huang, “Recurrent neural network for textclassification with multi-task learning,” 2016.[17] N. Laptev, J. Yosinski, L. E. Li, and S. Smyl, “Time-series extreme eventforecasting with neural networks at uber,” in
International Conferenceon Machine Learning , no. 34, 2017, pp. 1–5.[18] W. D. Mulder, S. Bethard, and M.-F. Moens, “A survey on the appli-cation of recurrent neural networks to statistical language modeling,”
Computer Speech & Language , vol. 30, no. 1, pp. 61 – 98, 2015.[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”2014.[20] J. Martens and I. Sutskever, “Learning recurrent neural networks withhessian-free optimization,” in
Proceedings of the 28th InternationalConference on International Conference on Machine Learning , ser.ICML’11. USA: Omnipress, 2011, pp. 1033–1040. [Online]. Available:http://dl.acm.org/citation.cfm?id=3104482.3104612[21] G. V. Puskorius and L. A. Feldkamp, “Neurocontrol of nonlineardynamical systems with kalman filter trained recurrent networks,”
IEEETransactions on Neural Networks , vol. 5, no. 2, pp. 279–297, March1994.[22] G. Hinton, “Neural networks for machine learning,” 2012.[23] S. Ruder, “An overview of gradient descent optimization algorithms,”2016.[24] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-cies with gradient descent is difficult,”
IEEE Transactions on NeuralNetworks , vol. 5, no. 2, pp. 157–166, March 1994.[25] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, andJ. Schmidhuber, “LSTM: A search space odyssey,”
CoRR , vol.abs/1503.04069, 2015. [Online]. Available: http://arxiv.org/abs/1503.04069[26] D. Krueger and R. Memisevic, “Regularizing rnns by stabilizing activa-tions,” 2015.[27] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of trainingrecurrent neural networks,” 2012.[28] L. Blier, P. Wolinski, and Y. Ollivier, “Learning with random learningrates,” 2018.[29] F. Orabona and T. Tommasi, “Training deep networks without learningrates through coin betting,” 2017.[30] M. Hardt, T. Ma, and B. Recht, “Gradient descent learns linear dynam-ical systems,” 2016.[31] S. Oymak, “Stochastic gradient descent learns state equations withnonlinear activations,” 2018.[32] Z. Allen-Zhu, Y. Li, and Z. Song, “On the convergence rate of trainingrecurrent neural networks,”
CoRR , vol. abs/1810.12065, 2018. [Online].Available: http://arxiv.org/abs/1810.12065[33] J. L. Elman, “Finding structure in time,”
COGNITIVE SCIENCE , vol. 14,no. 2, pp. 179–211, 1990.[34] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
NeuralComput. , vol. 9, no. 8, pp. 1735–1780, Nov. 1997.[35] R. J. Williams and D. Zipser, “A learning algorithm for continuallyrunning fully recurrent neural networks,”
Neural Comput. , vol. 1, no. 2,pp. 270–280, Jun. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.2.27036] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrentneural networks for sequence learning,” 2015.[37] M. Zinkevich, “Online convex programming and generalized infinitesi-mal gradient ascent,” vol. 2, 04 2003.[38] S. Aydore, L. Dicker, and D. Foster, “A local regret in nonconvex onlinelearning,” 2018.[39] E. Hazan, K. Singh, and C. Zhang, “Efficient regret minimizationin non-convex games,”
CoRR , vol. abs/1708.00075, 2017. [Online].Available: http://arxiv.org/abs/1708.00075[40] S. Shalev-Shwartz, “Online learning and online convex optimization,”
Foundations and Trends in Machine Learning , vol. 4, no. 2, pp. 107–194, 2012.[41] R. J. Williams and D. Zipser, “Backpropagation,” Y. Chauvin and D. E.Rumelhart, Eds. Hillsdale, NJ, USA: L. Erlbaum Associates Inc.,1995, ch. Gradient-based Learning Algorithms for Recurrent Networks and Their Computational Complexity, pp. 433–486. [Online]. Available:http://dl.acm.org/citation.cfm?id=201784.201801[42] J. Alcal-Fdez, A. Fernndez, J. Luengo, J. Derrac, and S. Garca, “Keeldata-mining software tool: Data set repository, integration of algorithmsand experimental analysis framework.”
Multiple-Valued Logic and SoftComputing ∼ delve/data/datasets.html, accessed: 2019-07-21.[44] P. Lancaster and H. Farahat, “Norms on direct sums and tensor prod-ucts,” Mathematics of Computation - Math. Comput. , vol. 26, 05 1972.[45] S. Bubeck, “Convex optimization: Algorithms and complexity,”
Found.Trends Mach. Learn. , vol. 8, no. 3-4, pp. 231–357, Nov. 2015. [Online].Available: http://dx.doi.org/10.1561/2200000050[46] J. Miller and M. Hardt, “Stable recurrent models.” in