Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions
Simone Scardapane, Steven Van Vaerenbergh, Danilo Comminiello, Simone Totaro, Aurelio Uncini
AACCEPTED FOR PRESENTATION AT 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP)
RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATESUSING KERNEL ACTIVATION FUNCTIONS
Simone Scardapane ∗† , Steven Van Vaerenbergh ‡ , Danilo Comminiello † , Simone Totaro † , and Aurelio Uncini †† Sapienza University of Rome, Italy ‡ University of Cantabria, Spain
ABSTRACT
Gated recurrent neural networks have achieved remarkable re-sults in the analysis of sequential data. Inside these networks,gates are used to control the flow of information, allowing tomodel even very long-term dependencies in the data. In thispaper, we investigate whether the original gate equation (alinear projection followed by an element-wise sigmoid) canbe improved. In particular, we design a more flexible archi-tecture, with a small number of adaptable parameters, whichis able to model a wider range of gating functions than theclassical one. To this end, we replace the sigmoid function inthe standard gate with a non-parametric formulation extend-ing the recently proposed kernel activation function (KAF),with the addition of a residual skip-connection. A set of ex-periments on sequential variants of the MNIST dataset showsthat the adoption of this novel gate allows to improve accu-racy with a negligible cost in terms of computational powerand with a large speed-up in the number of training iterations.
Index Terms — Recurrent network, LSTM, GRU, Gate,Kernel activation function
1. INTRODUCTION
Recurrent neural networks (RNNs) have recently gained alarge popularity in the analysis of sequential data, followingmore widespread success in the field of deep learning [1].Among all possible RNNs, gated architectures (originatingfrom the seminal work in [2]) have shown to be particularlysuitable at handling long, or very long, temporal dependen-cies in the data. While the original long short-term memory(LSTM) network dates back to twenty years ago, recent ad-vances in computational power have allowed to scale them tomulti-layered and sequence-to-sequence configurations [3],achieving significant breakthroughs in multiple fields, e.g.,neural machine translation [4].Fundamentally, a gate is a multiplicative layer that learnsto perform a ‘soft selection’ of some content (e.g., the hid-den state of the RNN), allowing the gradient and the infor-mation to flow more easily through multiple time-steps while ∗ Corresponding author e-mail: [email protected]. avoiding vanishing or exploding gradients. Despite their im-portance, however, the role and the use of gates inside RNNsremain open questions for research. The original LSTM net-work was designed with two gates to have a unitary derivative[2], which were later increased to three with the inclusion of aforget gate [5]. Subsequent research has experimented with awide range of different configurations, including the gated re-current unit (GRU) [4], merging two gates into a single updategate, or even simpler architectures having a single gate, suchas the minimally gated unit [6], or the JANET model [7] (seealso [8] for a large comparison of feasible variations). From atheoretical perspective, [9] has recently shown that gates nat-urally arise if we assume (axiomatically) a (quasi-)invarianceto time transformations of the input data.Note that, even considering this wide range of alternativeformulations (mostly in terms of how many gates are neededfor an optimal architecture), the basic design of a single gatehas remained more or less constant, i.e., each gate is obtainedby applying an element-wise sigmoid nonlinearity to a linearprojection of the inputs and/or hidden states. Only a hand-ful of works have explored alternative designs for this com-ponent, such as the inclusion of hidden layers [10], or skip-connections through the gates of different layers [11]. Moti-vated by the possibility of improving the performance of theRNNs, in this paper we propose an extended gate architec-ture, which is endowed with a larger expressiveness than thestandard formulation. At the same time, we try to keep thecomputational overhead as small as possible. To this end, wefocus on replacing the sigmoid operation, extending it witha non-parametric form that is adapted independently for eachcell (and for each gate) inside the gated RNN.Our starting point is noting that a lot of work has beendone in the deep learning literature for designing flexible ac-tivation functions, that could replace standard hyperbolic tan-gents or rectified linear units (ReLUs). These include simpleparametric schemes like the parametric ReLU [12], or moreelaborate formulations where the flexibility of the functionscan be determined as a hyper-parameter. The latter case in-clude maxout networks [13], adaptive piecewise linear units[14], and kernel activation functions (KAFs) [15]. There is agood consensus in that endowing the functions with this flex- a r X i v : . [ c s . N E ] J u l bility can enhance the performance of the network, possiblyallowing to simplify the architecture of the neural network it-self significantly [14]. However, the sigmoid function usedinside a gate is different from a standard activation function,in that its behavior cannot be unrestricted (e.g., by taking neg-ative values). Due to this, none of these proposals can be ap-plied straightforwardly to the case of gates inside RNNs: forexample, all functions based on rectifiers (such as the APL[14]) are unbounded over their domain [14].To this end, in this paper we propose an extension of thebasic KAF model. A KAF is a non-parametric activationfunction defined in terms of a kernel expansion over a fixeddictionary [15]. Here, we combine it with a bounded nonlin-earity and a residual connection to make its behavior consis-tent with that of a gating function (more details in Section 3).As a result, our proposed flexible gate mimics exactly a stan-dard sigmoid at the beginning of the optimization process, butthanks to the addition of a small number of adaptable param-eters, it can adapt itself based on the training data to a muchlarger family of shapes (see Fig. 1 for some examples).We evaluate the proposed model on a set of standardbenchmarks involving sequential formulations of the MNISTdataset (e.g., where each image is processed pixel-by-pixel).We show that a gated RNN with our proposed flexible gatecan achieve higher accuracy with a faster rate of convergence,while at the same time having a small computational overheadwith respect to a standard formulation. Paper outline
The rest of the paper is organized as follows. In Section 2we introduce the GRU model (as a representative example ofgated RNN). Next, the proposed gate with flexible sigmoidsis described in Section 3. We evaluate the proposal in Section4, before concluding in Section 5.
2. GATED RECURRENT NEURAL NETWORKS2.1. Update equations
Consider a generic sequential task, where at each time step t we receive a new input x t ∈ R d . The evolution of a genericRNN can be described by the following equation: h t = φ ( x t , h t − ; θ ) , (1)where h t represents the internal state of the RNN, θ is theset of adaptable parameters, and φ ( · ) is a generic update rule.Gated RNNs implement φ ( · ) with the presence of one or moregating functions, which control the flow of information be-tween time steps. As we stated in Section 1, different types ofgated RNNs, with different number of gates, exist in the lit-erature. For brevity, in the rest of the paper we will focus onthe case of GRUs [4], although our method extends immedi-ately to LSTMs and any other gated network described in the previous section. However, GRUs represent a good compro-mise between accuracy and number of gates (two comparedto three as in the LSTM), which is why we choose it here.A GRU cell updates its internal state h t − as follows: u t = σ ( W u x t + V u h t − + b u ) , (2) r t = σ ( W r x t + V r h t − + b r ) , (3) h t = (1 − u t ) ◦ h t − + u t ◦ tanh ( W h x t + U t ( r t ◦ h t − ) + b h ) , (4)where (2) and (3) are, respectively, the update gate and resetgate, ◦ is the element-wise multiplication, σ ( · ) is the standardsigmoid function, and the cell has adaptable matrices givenby θ = { W u , W f , W h , V u , V f , V h , b u , b f , b h } . Notethat while the tanh( · ) function in (4) can be changed freely(e.g., to a ReLU function), the sigmoid function in the twogates is essential for having a correct behavior, i.e., the updatevector u t and reset vector r t should always remain boundedin [0 , . GRUs can be used for a variety of tasks by properly manipu-lating the sequence of their internal states h , h , . . . . Sincein our experiments we consider the problem of classifyingeach sequence of data, we briefly describe here the detailsof the optimization approach. We underline, however, thatthe method we propose in the next section is agnostic to theactual task, as it acts on the basic GRU formulation.Suppose to have N different sequences (cid:8) x it (cid:9) Ni =1 , and foreach of them a single class label y i = 1 , . . . , C . Denote by h i the internal state of the GRU after processing the i -th se-quence. To obtain a classification, this is fed through anotherlayer with a softmax activation function: (cid:98) y i = softmax (cid:0) Ah i + b (cid:1) , (5)with (cid:98) y i having values over the C -dimensional simplex. Thenetwork is trained by minimizing the average cross-entropybetween the real classes and the predicted classes: J ( θ ) = − N N (cid:88) i =1 C (cid:88) c =1 (cid:2) y i = c (cid:3) log (cid:0)(cid:98) y ij (cid:1) , (6)where (cid:98) y ij is the j -th element of the prediction vector (cid:98) y i , and [ · ] is if its argument is true, otherwise. Minimization of(6) is obtained by unrolling the network over all time steps viaback-propagation through time (BPTT) [1]. While this coversthe basic mathematical framework, in practice several meth-ods can be used to stabilize and improve the convergence ofBPTT, including gradients’ clipping [3], multiple variationsof dropout [16], or regularizing appropriately weights and/orchanges in activations during training [17]. V a l u e o f t h e g a t e (a) γ = 1.0 V a l u e o f t h e g a t e (b) γ = 0.5 V a l u e o f t h e g a t e (c) γ = 0.1 Fig. 1 . Random samples of the proposed flexible gates with different bandwidths. In all cases we sample uniformly pointson the x -axis, while the mixing coefficients are sampled from a normal distribution. y -axis always goes from to .
3. PROPOSED GATE WITH FLEXIBLE SIGMOID
What we propose in this paper is to replace the sigmoid in(2) and (3) with another (scalar) function with higher flexibil-ity, while (a) keeping the overall ‘sigmoid-like’ behavior, and(b) maintaining a low computational overhead. Our proposalbuilds upon the KAF [15], which was originally designed as areplacement for standard activation functions, e.g., the ReLU.In this section we briefly describe the KAF formulation beforeintroducing our extension.
A KAF is defined as a one-dimensional kernel expansion:KAF ( s ) = D (cid:88) i =1 α i κ ( s, d i ) , (7)where s is a generic input to the activation function, κ ( · , · ) : R × R → R is a valid kernel function, { α i } Di =1 are calledmixing coefficients, and { d i } Di =1 are called the dictionary ele-ments. To make back-propagation tractable, differently froma standard kernel method the D elements of the dictionaryare fixed beforehand and shared across the entire network. Inparticular, we let D as a user-chosen hyper-parameter, and wesample D values over the x -axis, uniformly around zero. Ba-sically, higher values of D will correspond to an increasedflexibility of the function, together with an increase in thenumber of free parameters. Mixing coefficients are adaptedthrough standard back-propagation, independently for everyneuron, which can be realized efficiently through vectorizedoperations [15].The kernel function κ ( · , · ) only needs to respect the pos-itive semi-definiteness property, and for our experiments weuse the 1D Gaussian kernel defined as: κ ( s, d i ) = exp (cid:110) − γ ( s − d i ) (cid:111) , (8)where γ ∈ R is a kernel parameter, i.e., the inverse band-width. The parameter γ > defines the range of influence of each α i element. For selecting it, we adopt the rule-of-thumbproposed in [15]: γ = 16∆ , (9)where ∆ is the resolution of the dictionary elements. Addi-tionally, we have found beneficial to let γ adapt independentlyfor each KAF, always through back-propagation. We cannot use (7) straightforwardly because (a) it is un-bounded, and (b) using the Gaussian kernel, it goes to for s → ±∞ in both directions. We propose to alleviate theseproblems by using the following formulation for the flexiblegate: σ KAF ( s ) = σ (cid:18) KAF ( s ) + 12 s (cid:19) , (10)where the sigmoid σ on the right-hand side keeps the bound-edness of the function, while the addition of the residual term s ensures that (10) behaves as a standard sigmoid outside therange of the dictionary. In Fig. 1 we show some realizationsof (10) for different choices of the mixing coefficients and γ .It can be seen that the functions can represent a wide array ofdifferent shapes, all consistent with the general behavior of agating function.In fact, to simplify optimization we also initialize the mix-ing coefficients to approximate the identity function, so thatthe flexible gate behaves as a sigmoid in the initial stage oftraining. In order to do this, we apply kernel ridge regressionon the dictionary to select the initial values for the mixingcoefficients: α = ( K + ε I ) − d , (11)where α is the vector of mixing coefficients, d is the vectorof dictionary elements, I is the identity matrix of appropriatesize, and ε > is a small scalar coefficient to ensure stabilityof the matrix inversion (we use ε = 10 − in the experiments). G a t e ou t pu t Fig. 2 . Example of the proposed KAF gate when initializedas a standard sigmoid. The dashed line is KAF ( s ) in (10);markers show its mixing coefficients; the solid green line isthe final output of the gate.By using this initialization, all gates will behave identically toa standard sigmoid at the beginning (an example of initializa-tion is shown in Fig. 2). In our proposed GRU, we then usea different set of mixing coefficients for each forget gate andupdate gate.
4. EXPERIMENTAL RESULTS4.1. Experimental setup
To evaluate the proposed algorithm, we compare a standardGRU with a GRU endowed with flexible gates as in (10). Weuse a standard set of sequential benchmarks constructed fromthe MNIST dataset, which are commonly used for testinglong-term dependencies in gated RNNs [18]. MNIST is animage classification dataset composed of 60000 images fortraining (and 10000 for testing), each belonging to one out often classes. Each image is of dimension × with black-and-white pixels. From this, we construct three sequentialproblems: Row-wise MNIST (R-MNIST)
Each image is processedsequentially, row-by-row, i.e., we have sequences oflength , each represented by the value of pixels. Pixel-wise MNIST (P-MNIST)
Each image is representedas a sequence of pixels, read from left to right andfrom top to bottom from the original image.
Permuted P-MNIST (PP-MNIST)
Similar to P-MNIST,but the order of the pixels is shuffled using a (fixed)permutation matrix. http://yann.lecun.com/exdb/mnist/ Dataset GRU (Standard) GRU (proposed)
R-MNIST . ± . . ± . P-MNIST . ± . . ± . PP-MNIST . ± . . ± . Table 1 . Average test accuracy obtained by a standard GRUcompared with a GRU endowed with the proposed flexiblegates (standard deviation is shown in brackets).P-MNIST and PP-MNIST are particularly challenging be-cause of the need of processing relatively long-term depen-dencies in the data.GRUs have an internal state of dimensionality , andwe include an additional batch normalization step [1] beforethe output layer in (5) to stabilize training in the presence oflong sequences. We train using the Adam optimization al-gorithm with BPTT on mini-batches of elements, with aninitial learning rate of . , and we clip all gradients updates(in norm) to . . For the proposed gating function, we initial-ize the dictionary from elements equispaced in [ − . , . .Early stopping is used to decide when to finish the opti-mization procedure. We keep the last 10000 elements of thetraining set as a validation part, and we compute the averageaccuracy of the model every iterations, stopping wheneveraccuracy is not improving for at least iterations. All thecode is written in PyTorch and it is run using a Tesla K80GPU on the Google Colaboratory platform. The results of the experiments averaged over different runsare given in Tab. 1. We can see that the proposed GRUachieves higher test accuracy in all three cases (third columnin Tab. 1). Interestingly, this difference is particularly signifi-cant for datasets with long temporal dependencies (P-MNISTand PP-MNIST). Here, the standard GRU also exhibits highstandard deviations due to it converging to poorer minima insome cases. The proposed GRU is able to achieve high accu-racy consistently over all runs.We conjecture that this last result is also due to the higherflexibility allowed to the optimization procedure during train-ing. To test this, we visualize in Fig. 3 the average loss andvalidation accuracy on the P-MNIST dataset of the two algo-rithms. We can see that the proposed GRU converges fasterand more steadily, especially in the first half of training. Thisis consistent with the behavior found when using KAFs asactivation functions, e.g., [15].Fig 4 shows an histogram of the values of γ in (8) aftertraining the proposed GRU on the P-MNIST dataset. Inter-estingly, the optimal architectures can benefit from a widerange of different bandwidths for the kernel. Looking back
500 1000 1500 2000 2500 3000Epoch0.51.01.52.0 L o ss Standard GRUProposed GRU (a) Training loss A cc u r ac y Standard GRUProposed GRU (b) Validation accuracy
Fig. 3 . Convergence results on the P-MNIST dataset for a standard GRU and the proposed GRU. (a) Loss evolution on thetraining dataset (per iteration); (b) Validation accuracy (per epoch). The plots are focused on the first half of training. Shadedareas represent the variance. N u m b e r o f ce ll s Fig. 4 . Sample histogram of the values for γ in (8), aftertraining, for the reset gate of the GRU.at Fig. 1, this translates in functions going from almost-linearto highly nonlinear behaviors. To conclude our experimental section, we also perform a sim-ple ablation study on the R-MNIST dataset, by training ourproposed GRU with two modifications: • Rand : we initialize the mixing coefficients randomlyinstead of following the identity initialization as in (11). • No-Residual : we remove the residual connection from(10), leaving only σ KAF ( s ) = σ ( KAF ( s )) .Results of this set of experiments are provided in Fig. 5, where we also show with a red line the average test accuracyobtained by the standard GRU. We can see that removing theresidual connection vastly degrades the performance, possiblybecause the resulting gating functions will revert to zero attheir boundaries. Initializing the coefficients as the identityhelps improving the accuracy by a lower margin, and alsostabilizes it by reducing the variation of the results.
5. CONCLUSIONS
In this paper, we proposed an extension of the standard gat-ing component used in most gated RNNs, e.g., LSTMs andGRUs. Specifically, we replace the element-wise sigmoid op-eration with a per-cell function endowed with a small numberof parameters, that can adapt to the training data. To this end,we extend the kernel activation function in order to make itsshape always consistent with a sigmoid-like behavior. Theresulting function can be implemented easily in most deeplearning frameworks, has a smooth behavior over its entiredomain, and it imposes only a small computational overheadon the architecture. Experiments on a set of standard sequen-tial problems with GRUs show that the proposed architectureachieve superior results (in terms of test accuracy), while atthe same time converging faster (and reliably) in terms ofnumber of iterations due to its increased flexibility.Future research directions involve experimenting withother gated RNNs (possibly with different numbers of gates,layers, etc.), applications, and interpreting the resulting func-tions with respect to the task at hand. More in general,sigmoid-like functions are essential for many other deeplearning components beside gated RNNs, including softmaxfunctions for classification, attention-based architectures, andneural memories [19]. An interesting question is whether our .96 0.97 0.98 0.99Test accuracyNormalRandNo-ResidualRand+No-Residual
Fig. 5 . Average results (in terms of test accuracy) of an abla-tion study on the R-MNIST dataset.
Rand : we initialize themixing coefficients randomly.
No-Residual : we remove theresidual connection in (10). With a dashed red line we showthe performance of a standard GRU.extended formulation can benefit (both in terms of accuracyand speed) these other architectures.
6. REFERENCES [1] I. Goodfellow, Y. Bengio, and A. Courville,
Deep learn-ing , MIT Press Cambridge, 2016.[2] S. Hochreiter and J. Schmidhuber, “Long short-termmemory,”
Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence tosequence learning with neural networks,” in
Advances inneural information processing systems , 2014, pp. 3104–3112.[4] K. Cho et al., “Learning phrase representations usingRNN encoder–decoder for statistical machine transla-tion,” in
Proc. 2014 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , 2014, pp.1724–1734.[5] F. A. Gers and J. Schmidhuber, “Recurrent nets that timeand count,” in
Proc. IEEE International Joint Confer-ence on Neural Networks (IJCNN) . IEEE, 2000, vol. 3,pp. 189–194.[6] G.-B. Zhou, J. Wu, C.-L. Zhang, and Z.-H. Zhou, “Min-imal gated unit for recurrent neural networks,”
Interna-tional Journal of Automation and Computing , vol. 13,no. 3, pp. 226–234, 2016. [7] J. van der Westhuizen and J. Lasenby, “The unreason-able effectiveness of the forget gate,” arXiv preprintarXiv:1804.04849 , 2018.[8] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steune-brink, and J. Schmidhuber, “LSTM: A search spaceodyssey,”
IEEE Transactions on Neural Networks andLearning Systems , vol. 28, no. 10, pp. 2222–2232, 2017.[9] C. Tallec and Y. Ollivier, “Can recurrent neural net-works warp time?,” in
Proc. 2018 International Confer-ence on Learning Representations (ICLR) , 2018.[10] Y. Gao and D. Glowacka, “Deep gate recurrent neuralnetwork,”
JMLR: Workshop and Conference Proceed-ings , vol. 63, pp. 350–365, 2016.[11] K. Yao, T. Cohn, K. Vylomova, K. Duh, and C. Dyer,“Depth-gated recurrent neural networks,” arXiv preprintarXiv:1508.03790 , 2015.[12] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deepinto rectifiers: Surpassing human-level performance onimagenet classification,” in
Proc. IEEE Int. Conf. onComputer Vision (ICCV) , 2015, pp. 1026–1034.[13] I. J. Goodfellow, D. Warde-Farley, M. Mirza,A. Courville, and Y. Bengio, “Maxout networks,”in
Proc. 30th International Conference on MachineLearning (ICML) , 2013.[14] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi,“Learning activation functions to improve deep neuralnetworks,” arXiv preprint arXiv:1412.6830 , 2014.[15] S. Scardapane, S. Van Vaerenbergh, S. Totaro, andA. Uncini, “Kafnets: kernel-based non-parametric ac-tivation functions for neural networks,” arXiv preprintarXiv:1707.04035 , 2017.[16] Y. Gal and Z. Ghahramani, “A theoretically groundedapplication of dropout in recurrent neural networks,”in
Advances in neural information processing systems ,2016, pp. 1019–1027.[17] D. Krueger and R. Memisevic, “Regularizing RNNsby stabilizing activations,” in
Proc. 2016 InternationalConference on Learning Representations (ICLR) , 2016.[18] S. Zhang et al., “Architectural complexity measures ofrecurrent neural networks,” in
Advances in neural infor-mation processing systems , 2016, pp. 1822–1830.[19] A. Graves, G. Wayne, M. Reynolds, T. Harley, et al.,“Hybrid computing using a neural network with dy-namic external memory,”