[PDF] Faster Training of Very Deep Networks Via p-Norm Gates

Abstract

A major contributing factor to the recent advances in deep neural networks is structural units that let sensory information and gradients to propagate easily. Gating is one such structure that acts as a flow control. Gates are employed in many recent state-of-the-art recurrent models such as LSTM and GRU, and feedforward models such as Residual Nets and Highway Networks. This enables learning in very deep networks with hundred layers and helps achieve record-breaking results in vision (e.g., ImageNet with Residual Nets) and NLP (e.g., machine translation with GRU). However, there is limited work in analysing the role of gating in the learning process. In this paper, we propose a flexible p -norm gating scheme, which allows user-controllable flow and as a consequence, improve the learning speed. This scheme subsumes other existing gating schemes, including those in GRU, Highway Networks and Residual Nets as special cases. Experiments on large sequence and vector datasets demonstrate that the proposed gating scheme helps improve the learning speed significantly without extra overhead.

Full PDF

FFaster Training of Very Deep Networks Via p -NormGates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh

Center for Pattern Recognition and Data AnalyticsDeakin University, Geelong AustraliaEmail: {phtra, truyen.tran, dinh.phung, svetha.venkatesh}@deakin.edu.au

Abstract —A major contributing factor to the recent advancesin deep neural networks is structural units that let sensoryinformation and gradients to propagate easily. Gating is onesuch structure that acts as a ﬂow control. Gates are employedin many recent state-of-the-art recurrent models such as LSTMand GRU, and feedforward models such as Residual Nets andHighway Networks. This enables learning in very deep networkswith hundred layers and helps achieve record-breaking resultsin vision (e.g., ImageNet with Residual Nets) and NLP (e.g.,machine translation with GRU). However, there is limited work inanalysing the role of gating in the learning process. In this paper,we propose a ﬂexible p -norm gating scheme, which allows user-controllable ﬂow and as a consequence, improve the learningspeed. This scheme subsumes other existing gating schemes,including those in GRU, Highway Networks and Residual Nets asspecial cases. Experiments on large sequence and vector datasetsdemonstrate that the proposed gating scheme helps improve thelearning speed signiﬁcantly without extra overhead. I. I

NTRODUCTION

Deep neural networks are becoming the method of choice invision [1], speech recognition [2] and NLP [3], [4]. Deepnets represent complex data more efﬁciently than shallowones [5]. With more non-linear hidden layers, deep networkscan theoretically model functions with higher complexityand nonlinearity [6]. However, learning standard feedforwardnetworks with many hidden layers is notoriously difﬁcult [7].Likewise, standard recurrent networks suffer from vanishinggradients for long sequences [8] making gradient-based learningineffective. A major reason is that many layers of non-lineartransformation prevent the data signals and gradients fromﬂowing easily through the network. In the forward directionfrom data to outcome, a change in data signals may not lead toany change in outcome, leading to the poor credit assignmentproblem. In the backward direction, a large error gradient atthe outcome may not be propagated back to the data signals.As a result, learning stops prematurely without returning aninformative mapping from data to outcome.There have been several effective methods to tackle theproblem. The ﬁrst line of work is to use non-saturated non-linear transforms such as rectiﬁed linear units (ReLUs) [9],[1], [10], whose gradients are non-zero for a large portion ofthe input space. Another approach that also increases the levelof linearity of the information propagation is through gating [11], [12]. The gates are extra control neural units that let partof information pass through a channel. They are learnable andhave played an important role in state-of-the-art feedforward architectures such as Highway Networks [13] and ResidualNetworks [14], and recurrent architectures such as such as LongShort-Term Memory (LSTM) [11], [12] and Gated RecurrentUnit (GRU) [15].Although the details of these architectures differ, they sharea common gating scheme. More speciﬁcally, let h t be theactivation vector of size K (or memory cells, in the case ofLSTM) at computational step t , where t can be the index ofthe hidden layer in feedforward networks, or the time step inrecurrent networks. The updating of h t follows the followingrule: h t ← α ∗ ˜ h t + α ∗ h t − (1)where ˜ h t is the non-linear transformation of h t (and the inputat t if given), α , α ∈ [0 , k are gates and ∗ is point-wisemultiplication. When α > , a part of the previous activationvector is copied into the new vector. Thus the update has anonlinear part (controlled by α ) and a linear part (controlledby α ). The nonlinear part keeps transforming the input to morecomplex output, whilst the linear part retains a part of inputto pass across layers much easier. The linear part effectivelyprevents the gradient from vanishing even if there are hundredsof layers. For example, Highway Networks can be trained withmore than 1000 layers [13], which were previous impossiblefor feedforward networks.This updating rule opens room to study relationship betweenthe two gates α and α , and there has been a limited work inthis direction. Existing work includes Residual Networks with α = α = , hence ˜ h t plays the role of the residual. For theLSTM, there is no explicit relation between the two gates. TheGRU and the work reported in [13] use α + α = , whichleads to less parameters compared to the LSTM. This paperfocuses on the later, and aims to address the inherent drawbackin this linear relationship. In particular, when α approaches with rate λ , α approaches with the same rate, and thismay prevent information from passing too early. To this endwe propose a more ﬂexible p -norm gating scheme, where thefollowing relationship holds: ( α p + α p ) /p = for p > andthe norm is applied element-wise. This introduces just one anextra controlling hyperparameter p . When p = 1 , the schemereturns to original gating in Highway Networks and GRUs.We evaluate this p -norm gating scheme on two settings:the traditional classiﬁcation of vector data under Highway a r X i v : . [ s t a t . M L ] A ug etworks and sequential language models under the GRUs.Extensive experiments demonstrate that with p > , thelearning speed is signiﬁcantly higher than existing gatingschemes with p = 1 . Compared with the original gating,learning with p > is 2 to 3 times faster for vector dataand more than 15% faster for sequential data.The paper is organized as follows. Section 2 presents theHighway Networks, GRUs and the p -norm gating mechanism.Experiments and results with the two models are reported inSection 3. Finally, Section 4 discusses the ﬁndings further andconcludes the paper. II. M ETHODS

In this section, we propose our p -norm gating scheme. Toaid the exposition, we ﬁrst brieﬂy review the two state-of-the-art models that use gating mechanisms: Highway Networks[13] (a feedforward architecture for vector-to-vector mapping)and Gated Recurrent Units [15] (a recurrent architecture forsequence-to-sequence mapping) in Sections II-A and II-B,respectively. Notational convention :

We use bold lowercase letters forvectors and capital letters for matrices. The sigmoid functionof a scalar x is deﬁned as σ ( x ) = [1 + exp( − x )] − , x ∈ R . With a slight abuse of notation, we use σ ( x ) , where x = ( x , x , ..., x n ) is a vector, to denote a vector ( σ ( x ) , ..., σ ( x n )) . The same rule applies to other functionof vector g ( x ) . The operator ∗ is used to denote element-wisemultiplication. For both the feedforward networks and recurrentnetworks, we use index t to denote the computational steps ,and it can be layers in feedforward networks or time step inrecurrent networks. As shown from Fig. 1, the two architecturesare quite similar except for when extra input x t is available ateach step. 𝒉 𝑡−1 𝒉 𝑡 𝒉 𝑡 𝜶 𝜶 𝒉 𝑡−1 𝒉 𝑡 𝒉 𝑡 𝜶 𝜶 𝒙 𝑡 𝒓 (a) Highway Network layer (b) Gated Recurrent Unit Figure 1. Gating mechanisms in Highway Networks and GRUs. The currenthidden state h t is the sum of candidate hidden state ˜ h t moderated by α and the previous hidden state h t − moderated by α A. Highway Networks

A Highway Network is a feedforward neural networkwhich maps an input vector x to an outcome y . A standardfeedforward network consists of T hidden layers where theactivation h t ∈ R k t at the t th layer ( t = 1 , .., T ) is a non-linearfunction of its lower layer: h t = g ( W t h t − + b t ) where W t and b t are the parameter matrix and bias vector atthe t th layer, and g ( · ) is the element-wise non-linear transform.At the bottom layer, h is the input x . The top hidden layer h T is connected to the outcome.Training very deep feedforward networks remains difﬁcultfor several reasons. First, the number of parameters grows withthe depth of the network, which leads to overﬁtting. Second,the stack of multiple non-linear functions makes it difﬁcult forthe information and the gradients to pass through.In Highway Networks, there are two modiﬁcations thatresolve these problems: (i) Parameters are shared betweenlayers leading to a compact model, and (ii) The activation func-tion is modiﬁed by adding sigmoid gates that let informationfrom lower layers pass linearly through. Fig. 1(a) illustrates aHighway Network layer. The ﬁrst modiﬁcation requires that allthe hidden layers to have the same hidden units k . The bottomlayer is identical to that of standard feedforward networks. Thesecond modiﬁcation deﬁnes a candidate hidden state ˜ h t ∈ R k as the usual non-linear transform: ˜ h t = g ( W h t − + b ) where W and b are parameter matrix and bias vector thatshared among all hidden layers. Finally the hidden state isgated by two gates α , α ∈ [0 , k as follows: h t = α ∗ ˜ h t + α ∗ h t − (2)for t ≥ . The two gates α and α are sigmoid functionsand can be independent, where α = σ ( U h t − + c ) and α = σ ( U h t − + c ) or summed to unit element-wise, e.g., = α + α . The latter option was used in the paper ofHighway Networks [13].The part α ∗ h t − , which is called carry behavior , makesthe information from layers below pass easily through thenetwork. This behavior also allows the back-propagation tocompute the gradient more directly to the input. The net effectis that the networks can be very deep (up to thousand of layers). B. Gated Recurrent UnitsRecurrent neural networks:

A recurrent neural network(RNN) is an extension of feedforward networks to map avariable-length input sequence x , ..., x T to a output sequence y , ..., y T . An RNN allows self-loop connections and sharedparameters across all steps of the sequence. For vanilla RNNs,the activation (which is also called hidden state) h t is a functionof the current input and the previous hidden state h t − : h t = g ( W x t + U h t − + b ) (3)where W, U and b are parameters shared among all steps.However, vanilla RNNs suffer from vanishing gradient forlarge T , thus preventing the use for long sequences [8]. ated Recurrent Units: A Gated Recurrent Unit (GRU)[16], [15] is an extension of vanilla RNNs (see Fig. 1(b) foran illustration) that does not suffer from the vanishing gradientproblem. At each step t , we ﬁst compute a candidate hiddenstate ˜ h t as follows: r t = σ ( W r x t + U r h t − + b r )˜ h t = tanh ( W h x t + U h ( r t ∗ h t − ) + b h ) where r t is a reset gate that controls the information ﬂow of theprevious state to the candidate hidden state. When r t is closeto , the previous hidden state is ignored and the candidatehidden state is reset with the current input.GRUs then update the hidden state h t using the same ruleas in Eq. (2). The difference is in the gate function: α = σ ( W α x t + U α h t − + b α ) , where current input x t is used.The linear relationship between the two gates is assumed: α =1 − α . This relationship enables the hidden state from previousstep to be copied partly to the current step. Hence h t is a linearinterpolation of the candidate hidden state and all previoushidden states. This prevents the gradients from vanishing andcaptures longer dependencies in the input sequence. Remark:

Both Highway Networks and GRUs can be consid-ered as simpliﬁed versions of Long Short-Term Memory [11].With the linear interpolation between consecutive states, theGRUs have less parameters. Empirical experiments revealedthat GRUs are comparable to LSTM and more efﬁcient intraining [17]. C. p -norm Gates As described in the two sections above, the gates of thenon-linear and linear parts in both Highway Networks (theversion empirically validated in [13]) and GRUs use the samelinear constraint: α + α = , s.t. α , α ∈ (0 , k where α plays the updating role and α plays the forgettingrole in the computational sequence. Since the relationship islinear, when α gets closer to , α will get closer to atthe same rate. During learning, the gates might become morespecialized and discriminative, this same-rate convergence mayblock the information from the lower layer passing through ata high rate. The learning speed may suffer as a result.We propose to relax this scheme by using the following thefollowing p -norm scheme: ( α p + α p ) p = , equivalently: α = ( − α p ) p (4)for p > , where the norm is applied element-wise.The dynamics of the relationship of the two gates as afunction of p is interesting. For p > we have α + α > .This increases the amount of information passing for the linearpart. To be more concrete, let α = . . For the linear gatesrelationship with p = 1 , there is a portion of α = . ofold information passing through each step. But for p = 2 ,the passing portion is α = . , and for p = 5 , it is α = . . When p → ∞ , α → , regardless of α aslong as α < . This is achievable since α is often modelledas a logistic function. When p → ∞ , α → , the activationof the ﬁnal hidden layer loads all the information of the pastwithout forgetting. Note that the ImageNet winner 2015, theResidual Network [14], is a special case with α , α → .On the other hand, p < implies α + α < , the linearitygates are closed at a faster rate, which may prevent informationand gradient ﬂow passing easily through layers.III. E XPERIMENTS

In this section, we empirically study the behavior of the p -norm gates in feedforward networks (in particular, HighwayNetworks presented in Section. II-A) and recurrent networks(Gated Recurrent Unit in Section. II-B). A. Vector Data with Highway Network

We used vector-data classiﬁcation tasks to evaluate theHighway Networks under p -norm gates. We used 10 hiddenlayers of 50 dimensions each. The models were trained usingstandard stochastic gradient descent (SGD) for 100 epochswith mini-batch of 20. Datasets:

We used two large UCI datasets: MiniBooNE par-ticle identiﬁcation (MiniBoo) and Sensorless Drive Diagnosis(Sensorless) . The ﬁrst is a binary classiﬁcation task wheredata were taken from the MiniBooNE experiment and usedto classify electron neutrinos (signal) from muon neutrinos(background). The second dataset was extracted from motorcurrent with 11 different classes. Table 1 reports the datastatistics. Table ID

ATASETS FOR TESTING H IGHWAY N ETWORKS .Dataset Dimens. Classes Train. set Valid. setMiniBoo 50 2 48,700 12,200Sensorless 48 11 39,000 9,800

Training curves:

Fig. 2 shows the training curves on trainingsets. The loss function is measured by negative-log likelihood.The training costs with p = 2 and p = 3 decrease and convergemuch faster than ones with p = 0 . and p = 1 . In the MiniBoodataset, training with p = 2 and p = 3 only needs 20 epochsto reach 0.3 nats, while p = 1 needs nearly 100 epochs and p = 0 . does not reach that value. The pattern is similar in theSensorless dataset, the training loss for p = 1 is 0.023 after100 epochs, while for p = 2 and p = 3 , the losses reach thatvalue after 53 and 44 epochs, respectively. The training for p = 0 . was largely unsuccessful so we do not report here. Prediction:

The prediction results on the validation sets arereported in Table II. To evaluate the learning speed, we reportthe number of training epochs to reach a certain benchmarkwith different values of p . We also report the results after 100epochs. For the MiniBoo dataset (Table II(a)), p = 0 . doesnot reach the benchmark of 89% of F1-score, p = 1 needs 94 https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identiﬁcation https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

20 40 60 80 100

Epoch . . . . . . . . . N ega t i v e - l og li k e li hood p = 0.8p = 1p = 2p = 3 Epoch . . . . . . N ega t i v e - l og li k e li hood p = 0.8p = 1p = 2p = 3 (a) (b) Figure 2. Learning curves on training sets. (a) MiniBoo dataset. (b) Sensorless dataset. epochs while both p = 2 and p = 3 need 33 epochs, nearly3 times faster. For the Sensorless dataset (Table II(b)), p = 3 has the best result and needs only 35 epochs to achieve 99%of macro F1-score while p = 1 and p = 2 need 77 and 41epochs, respectively. Visualization:

Fig. 3 illustrates how 50 channels of the twogates open through 10 layers with different value of p fora randomly chosen data instance in the test set of MiniBoo.Recall from Sec. II-C that α and α control the amount ofinformation in the non-linearity part and the linearity part,respectively and α p + α p = . It is clear that with the largervalue of p , the more the two gates are open. Interestingly, thevalues of most channels in the gate α are larger than ones inthe gate α for all values of p . The model seems to prefer thelinearity part. More interestingly, there is a gradual change ingates over the layers, although gates between layers are notdirectly linked. At the lower layers, gates are more uniform,but get more informative near the top (which is closer to theoutcome). B. Sequential Data with GRUs

To evaluate the p -norm with GRUs, we compare the resultswith different values of p in the task of language modelingat character-level [12], [18], [19]. This modeling level hasattracted a great interest recently due to their generalizabilityover language usage, a very compact state space (which equalsthe alphabet size, hence the learning speed), and the ability tocapture sub-word structures. More speciﬁcally, given a sequenceof characters c , ..., c t − , the GRU models the probability of thenext character c t : P ( c t | c t − , ..., c ) . The quality of modelis measured by bits-per-character in the test set which is − log P ( c t | c t − , ..., c ) . The model is trained by maximizingthe log-likelihood: (cid:80) Tt =1 log P ( c t | c t − , ..., c ) . Dataset:

We used the UCI Reuters dataset which containsarticles of 50 authors in online Writeprint . We randomlychose , sentences for training and , sentences for https://archive.ics.uci.edu/ml/datasets/Reuter_50_50 validation. For sentences with length longer than 100 characters,we only used the ﬁrst 100 characters. The model is trained with400 hidden units, 50 epochs and 32 sentences each mini-batch. Results:

Fig. 4 reports (a) training curves and (b) results onvalidation set through epochs. It is clear from the two ﬁguresthat the model with p = 3 performs best among the choices p ∈ (0 . , , , , both in learning speed and model quality. Togive a more concrete example, as indicated by the horizontallines in Figs. 4(a,b), learning with p = 3 reaches 1.5 nats after34 epoch, while learning p = 1 reaches that training loss after43. For model quality on test data, model with p = 3 achieves2.0 bits-per-character after 41 epochs, faster than model with p = 1 after 50 epochs. C. Evaluating the Effectiveness of p We have demonstrated in both Highway Nets and GRUs thattraining is faster with p > . However, we question whetherthe larger value of p always implies the better results andfaster training? For example, as p → ∞ , we have α → and the activation at the ﬁnal hidden layer contains a copyof the ﬁrst layer and all other candidate states: h T = h + (cid:80) Tt =2 α t ˜ h t . This makes the magnitude of hidden states notproperly controllable in very deeper networks. To evaluate theeffectiveness of p , we conducted experiments on the MiniBoodataset with p = 0 . , , , ..., and networks with depths of10, 20, 30. We found that the model works very well for allvalues of p with 10 hidden layers. When the number of layersincrease (let say 20 or 30), the model only works well with p = 2 and p = 3 . This suggests that a proper control of thehidden state norms may be needed for very deep networks andwidely open gates.IV. D ISCUSSION AND C ONCLUSION

A. Discussion

Gating is a method for controlling the linearity of thefunctions approximated by deep networks. Another method isto use piece-wise linear units, such as those in ReLU family able IIR

ESULTS ON VALIDATION SETS . T

HE SECOND COLUMN IS THE NUMBER OF EPOCHS TO REACH A BENCHMARK MEASURED BY

F1-

SCORE (%)

FOR M INI B OO AND MACRO

F1-

SCORE (%)

FOR S ENSORLESS . T

HE THIRD COLUMN IS THE RESULTS AFTER RUNNING

EPOCHS .(a) MiniBoo dataset (b) Sensorless dataset p epochs to 89% F1-score F1-score (%)0.8 N/A 88.51 94 89.12

33 90.4 p epochs to 99% F1-score macro F1-score (%)0.8 92 99.11 77 99.42 41 99.73

35 99.7 La y e r p=1 Channel p=2 p=3 . . . . . . . . (a) The gate α La y e r p=1 Channel p=2 p=3 . . . . . . . (b) The gate α Figure 3. The dynamics of 50 channels of the two gates through 10 layers with different p . Epoch . . . . N ega t i v e - l og li k e li hood p = 0.5p = 1p = 2p = 3 Epoch . . . . B i t s - pe r- c ha r a c t e r p = 0.5p = 1p = 2p = 3 (a) (b) Figure 4. (a). Learning curve on training set. (b). Results on validation set measured by Bits-per-character .8 1 2 3 4 5 6 7 8 p . . . . . . . F - sc o r e

10 layers20 layers30 layers

Figure 5. Results on Miniboo dataset with different p and different numberof layers [9], [1], [10]. Still, partial or piece-wise linearity has desirablenonlinearity for complex functions. At the same time it helps toprevent activation units from being saturated and the gradientsfrom being vanishing, making gradient-based learning possiblefor very deep networks [13], [11]. The main idea of p -normgates is to allow a greater ﬂow of data signals and gradientsthrough many computational steps. This leads to faster learning,as we have demonstrated through experiments.It remains less clear about the dynamics of the relationshipbetween the linearity gate α and nonlinearity gate α . Wehypothesize that, at least during the earlier stage of the learning,larger gates help to improve the credit assignment by allowingeasier gradient communication from the outcome error to eachunit. Since the gates are learnable, the amount of linearity inthe function approximator is controlled automatically. B. Conclusion

In this paper, we have introduced p -norm gates, a ﬂexiblegating scheme that relaxes the relationship between nonlinearityand linearity gates in state-of-the-art deep networks such asHighway Networks, Residual Networks and GRUs. The p -normgates make the gates generally wider for larger p , and thusincrease the amount of information and gradient ﬂow passingthrough the networks. We have demonstrated the p -norm gateson two major settings: vector classiﬁcation tasks with HighwayNetworks and sequence modelling with GRUs. The extensiveexperiments consistently demonstrated that faster learning iscaused by p > .There may be other ways to control linearity through therelationship between the linearity gate α and nonlinearitygate α . A possible scheme could be a monotonic relationshipbetween the two gates so as α → then α → and α → then α → . It also remains open to validate this idea onLSTM memory cells, which may lead to a more compact model with less than one gate parameter set. The other open directionis to modify the internal working of gates to make them moreinformative [20], and to assist in regularizing the hidden states,following the ﬁndings in Sec. III-C and also in a recent work[21]. R EFERENCES[1] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,” in

Proceedings of the IEEE International Conference on Computer Vision ,2015, pp. 1026–1034.[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,”

Signal Processing Magazine, IEEE , vol. 29,no. 6, pp. 82–97, 2012.[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in

Advances in Neural Information ProcessingSystems , 2014, pp. 3104–3112.[4] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska,I. Gulrajani, and R. Socher, “Ask me anything: Dynamic memory net-works for natural language processing,” arXiv preprint arXiv:1506.07285 ,2015.[5] Y. Bengio, “Learning deep architectures for AI,”

Foundations andtrends R (cid:13) in Machine Learning , vol. 2, no. 1, pp. 1–127, 2009.[6] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun,“The loss surfaces of multilayer networks,” in International Conferenceon Artiﬁcial Intelligence and Statistics , 2015, pp. 192–204.[7] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploringstrategies for training deep neural networks,”

The Journal of MachineLearning Research , vol. 10, pp. 1–40, 2009.[8] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradientﬂow in recurrent nets: the difﬁculty of learning long-term dependencies,”2001.[9] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectiﬁer networks,” in

Proceedings of the 14th International Conference on Artiﬁcial Intelligenceand Statistics. JMLR W&CP Volume , vol. 15, 2011, pp. 315–323.[10] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio,“Maxout networks,” in

Proceedings of The 30th International Conferenceon Machine Learning , 2013, pp. 1319–1327.[11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[12] A. Graves, “Generating sequences with recurrent neural networks,” arXivpreprint arXiv:1308.0850 , 2013.[13] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deepnetworks,” arXiv preprint arXiv:1507.06228 , 2015.[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” arXiv preprint arXiv:1512.03385 , 2015.[15] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On theproperties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259 , 2014.[16] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations using rnnencoder-decoder for statistical machine translation,” in

EMNLP , 2014,pp. 1724–1734.[17] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation ofgated recurrent neural networks on sequence modeling,” arXiv preprintarXiv:1412.3555 , 2014.[18] T. Mikolov, I. Sutskever, A. Deoras, H.-S. Le, S. Kombrink, andJ. Cernocky, “Subword language modeling with neural networks,” , 2012.[19] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neurallanguage models,” arXiv preprint arXiv:1508.06615 , 2015.[20] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “DeepCare: A DeepDynamic Memory Model for Predictive Medicine,”

PAKDD’16, arXivpreprint arXiv:1602.00357 , 2016.[21] D. Krueger and R. Memisevic, “Regularizing RNNs by StabilizingActivations,” arXiv preprint arXiv:1511.08400arXiv preprint arXiv:1511.08400