[PDF] Fast Weight Long Short-Term Memory

Abstract

Associative memory using fast weights is a short-term memory mechanism that substantially improves the memory capacity and time scale of recurrent neural networks (RNNs). As recent studies introduced fast weights only to regular RNNs, it is unknown whether fast weight memory is beneficial to gated RNNs. In this work, we report a significant synergy between long short-term memory (LSTM) networks and fast weight associative memories. We show that this combination, in learning associative retrieval tasks, results in much faster training and lower test error, a performance boost most prominent at high memory task difficulties.

Full PDF

FF AST W EIGHT L ONG S HORT -T ERM M EMORY

T. Anderson Keller, Sharath Nittur Sridhar, Xin Wang

Intel AI Lab, Artiﬁcial Intelligence Products Group, Intel Corporation { andy.a.keller, sharath.nittur.sridhar, xin3.wang } @intel.com A BSTRACT

Associative memory using fast weights is a short-term memory mechanism thatsubstantially improves the memory capacity and time scale of recurrent neuralnetworks (RNNs). As recent studies introduced fast weights only to regular RNNs,it is unknown whether fast weight memory is beneﬁcial to gated RNNs. In thiswork, we report a signiﬁcant synergy between long short-term memory (LSTM)networks and fast weight associative memories. We show that this combination,in learning associative retrieval tasks, results in much faster training and lower testerror, a performance boost most prominent at high memory task difﬁculties.

NTRODUCTION

RNNs are highly effective in learning sequential data. Simple RNNs maintain memory throughhidden states that evolve over time. Keeping memory in this simple, transient manner has, amongothers, two shortcomings. First, memory capacity scales linearly with the dimensionality of recur-rent representations, limited for complex tasks. Second, it is difﬁcult to support memory at diversetime scales, particularly challenging for tasks that require information from variably distant past.Numerous differentiable memory mechanisms have been proposed to overcome the limitations ofdeep RNNs. Some of these mechanisms, e.g. attention, have become a universal practice in real-world applications such as machine translation (Bahdanau et al., 2014; Daniluk et al., 2017; Vaswaniet al., 2017). One type of memory augmentation of RNNs includes mechanisms that employ long-term, generic key-value storages (Graves et al., 2014; Weston et al., 2015; Kaiser et al., 2017).Another kind of memory mechanisms, inspired by early work on fast weights (Hinton & Plaut,1987; Schmidhuber, 1992), uses auto-associative, recurrently adaptive weights for short-term mem-ory storage (Ba et al., 2016a; Zhang & Zhou, 2017; Schlag & Schmidhuber, 2017). Associativememory considerably ameliorates limitations of RNNs. First, it liberates memory capacity from thelinear scaling with respect to hidden state dimensions; in the case of auto-associative memory likefast weights, the scaling is quadratic (Ba et al., 2016a). Neural Turing Machine (NTM)-style genericstorage can support memory access at arbitrary temporal displacements, whereas fast weight-stylememory has its own recurrent dynamics, potentially learnable as well (Zhang & Zhou, 2017). Fi-nally, if architected and parameterized carefully, some associative memory dynamics can also alle-viate the vanishing/exploding gradient problem (Dangovski et al., 2017).Besides memory augmentation, another entirely distinct approach to overcoming regular RNNs’drawbacks is by clever design of recurrent network architecture. The earliest but most effective andwidely adopted one is gated RNN cells such as long short-term memory (LSTM) (Hochreiter &Schmidhuber, 1997). Recent work has proposed ever more complex topologies involving hierarchyand nesting, e.g. Chung et al. (2016); Zilly et al. (2016); Ruben et al. (2017).How do gated RNNs such as LSTM interact with associative memory mechanisms like fast weights?Are they redundant, synergistic, or rather competitive to each other? This remains an open questionsince all fast weight networks reported so far are based on regular, instead of gated, RNNs. Here weanswer this question by revealing a strong synergy between fast weight and LSTM.

ELATED W ORK

Our present work builds upon results reported by Ba et al. (2016a), using the same fast weight mech-anism. A number of studies subsequent to Ba et al. (2016a), though not applied to gated RNNs, pro-1 a r X i v : . [ c s . N E ] A p r osed interesting mechanisms directly extending or closely related to fast weights. WeiNet (Zhang& Zhou, 2017) parameterized the fast weight update rule and learned it jointly with the network.Gated fast weights (Schlag & Schmidhuber, 2017) used a separate network to produce fast weightsfor the main RNN and the entire network was trained end-to-end. Rotational unit of memory (Dan-govski et al., 2017) is an associative memory mechanism related to yet distinct from fast weights.Its memory matrix is updated with a norm-preserving operation between the input and a target.Danihelka et al. (2016) proposed an LSTM network augmented by an associative memory that lever-ages hyperdimensional vector arithmetic for key-value storage and retrieval. This is an NTM-style,non-recurrent memory mechanism and hence different from the fast weight short-term memory. AST W EIGHT

LSTM

Our fast weight LSTM (FW-LSTM) network is deﬁned by the following update equations for thecell states, hidden state, and fast weight matrix (Figure 1). LSTMFW

Figure 1: FW-LSTM diagram  ˆi t ˆf t ˆo t ˆg t  = LN  W i U i W f U f W o U o W g U g  (cid:18) h t − x t (cid:19) +  b i b f b o b g  (1) ( i t , f t , o t , g t ) = (cid:16) σ ( ˆi t ) , σ ( ˆf t ) , σ ( ˆo t ) , ReLU( ˆg t ) (cid:17) (2) A t = λ A t − + η g t g (cid:62) t (3) c t = LN [ f t (cid:12) c t − + i t (cid:12) ReLU ( ˆg t + A t g t )] (4) h t = o t (cid:12) ReLU ( c t ) (5)Here x t ∈ R d , h t , v t , ˆv t , b v ∈ R h , W v , A t ∈ R h × h and U v ∈ R h × d , where v ∈ { i , f , o , g } ,and t indexes time steps. (cid:12) denotes Hadamard (element-wise) product, LN [ · ] layer normalization,and σ ( · ) , ReLU( · ) are the sigmoind and rectiﬁed linear function applied element-wise. We used ReLU( · ) in places of tanh( · ) for efﬁciency, as it did not make a signiﬁcant difference in practice.Our construction is identical to the standard LSTM cell except for a fast weight memory A t queriedby the input activation g t . Since g t is a function of both the network output h t − and the new input x t , this gives the network control over what to associate with each new input. XPERIMENTS

To study the performance of FW-LSTM in comparison with the original fast weight RNN (FW-RNN) and LSTM with layer normalization (LN-LSTM), we experimented with the associative re-trieval task (ART) described in Ba et al. (2016a). Input sequences are composed of K key-valuepairs followed by a separator ?? , and then a query key, e.g. for K = 8 , an example sequence is a1b2c3d4??b whose target answer is . We experimented with sequence lengths much greaterthan the original K = 8 , up to K = 30 similar to Zhang & Zhou (2017) and Dangovski et al. (2017).We further devised a modiﬁed ART (mART) that is a re-arrangement of input sequences in theoriginal ART. In mART, all keys are presented ﬁrst, then followed by all values in the correspondingorder, e.g. the mART equivalent of the above training example is abcd1234??b with target answerof again . In contrast to ART, where the temporal distance is constantly between associated pairsand only average retrieval distance grows with K , in mART temporal distances of both associationand retrieval scales linearly with K . This renders the task more difﬁcult to learn than the originalART, and K can be used to control the difﬁculty of memory associations.In all experiments, we augmented the FW-LSTM cell with a learned 100-dimensional embeddingfor the input x t . Additionally, network output at the end of the sequence was processed by another Note that the placement of layer normalizations is slightly different from the method described in theoriginal paper (Ba et al., 2016b) We ﬁnd applying layer normalization to the hidden state and input activationssimultaneously (rather than separately as in the original model) worked better for this fast weight architecture.

50 100 150 200 250 300Epoch020406080100 V a li d a t i o n A cc u r a c y % FW-LSTM h =20FW-LSTM h =50FW-RNN h =20FW-RNN h =50 Figure 2: Validation accuracy during the course of training of mART K=8 for FW-LSTMs andFW-RNNs of 20 and 50 hidden units.hidden layer with 100

ReLU units before the ﬁnal softmax , identical to Ba et al. (2016a). Allmodels were tuned as described in

Appendix and run for a minimum of 300 epochs.The left half of Table 1 shows performances of LN-LSTM, FW-RNN , and our FW-LSTM trained onART with different sequence lengths and numbers of hidden units. FW-LSTM has a slight advantagewhen the number of hidden units is low, but otherwise both the FW-RNN and FW-LSTM solve thetask perfectly.The right half of Table 1 shows performances of the same models trained on the mART. Due tosigniﬁcantly increased difﬁculty of the task, we instead show results for sequence lengths K = 8 , .In learning mART, FW-LSTM outperformed FW-RNN and LN-LSTM by a much greater marginespecially at high memory difﬁculty, K = 16 , and also converged much faster (Figure 2).Table 1: Test accuracy (%) of associative retrieval task (ART) and modiﬁed associative retrieval task(mART) for different sized models and sequence lengths K . Task

ART mART K = 8 K = 30 K = 8 K = 16 h = 20 LN-LSTM 37.8 22.7 38.2 29.5 19kFW-RNN 98.7 95.7 55.5 30.3 12kFW-LSTM h = 50 LN-LSTM 95.4 21.0 34.8 25.7 43kFW-RNN h = 100 LN-LSTM 97.6 18.4 33.4 22.5 100kFW-RNN

ONCLUSIONS

We observed that FW-LSTM trained signiﬁcantly faster and achieved lower test error in perform-ing the original ART. Further, in learning the harder mART, when input sequences are longer, wefound that FW-LSTM could still perform the task highly accurately, while both FW-RNN and LN-LSTM utterly failed. This was true even when FW-LSTM had fewer trainable parameters. Theseresults suggest that gated RNNs equipped with fast weight memory is a promising combination forassociative learning of sequences. The parameters η and λ used for FW-RNN here are different than those in Zhang & Zhou (2017), resultingin an improved performance. The values used are listed in Appendix . EFERENCES

Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. Using FastWeights to Attend to the Recent Past. 2016a. ISSN 10495258. URL http://arxiv.org/abs/1610.06258 .Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. jul 2016b. URL http://arxiv.org/abs/1607.06450 .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by JointlyLearning to Align and Translate. sep 2014. ISSN 0147-006X. doi: 10.1146/annurev.neuro.26.041002.131047. URL http://arxiv.org/abs/1409.0473 .Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical Multiscale Recurrent NeuralNetworks. sep 2016. URL http://arxiv.org/abs/1609.01704 .Rumen Dangovski, Li Jing, and Marin Soljacic. Rotational Unit of Memory. oct 2017. URL http://arxiv.org/abs/1710.09537 .Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner, and Alex Graves. Associative LongShort-Term Memory. feb 2016. URL http://arxiv.org/abs/1602.03032 .Michał Daniluk, Tim Rockt¨aschel, Johannes Welbl, and Sebastian Riedel. Frustratingly Short At-tention Spans in Neural Language Modeling. feb 2017. URL http://arxiv.org/abs/1702.04521 .Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines.

Arxiv , pp. 1–26, 2014.ISSN 2041-1723. doi: 10.3389/neuro.12.006.2007. URL http://arxiv.org/abs/1410.5401 .Geoffrey E. Hinton and David C. Plaut. Using Fast Weights to Deblur Old Memories.

Proceedingsof the 9th Annual Conference of the Cognitive Science Society , pp. 177–186, 1987. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.1011 .Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.

Neural Computation , 9(8):1–32, 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.Łukasz Kaiser, Oﬁr Nachum, Aurko Roy, and Samy Bengio. Learning to Remember Rare Events.mar 2017. URL http://arxiv.org/abs/1703.03129 .Joel Ruben, Antony Moniz, David Krueger, and David Krueger@umontreal Ca. Nested LSTMs.80:1–15, jan 2017. URL http://arxiv.org/abs/1801.10308 .Imanol Schlag and J¨urgen Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Gen-eration.

NIPS Metalearning Workshop , 2017. URL http://metalearning.ml/papers/metalearn17{_}schlag.pdf .J¨urgen Schmidhuber. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Re-current Networks.

Neural Computation , 4(1):131–139, jan 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.1.131. URL .Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. (Nips), 2017.URL http://arxiv.org/abs/1706.03762 .Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.

Iclr , pp. 1–15, 2015. ISSN1098-7576. doi: v0. URL http://arxiv.org/abs/1410.3916 .Wei Zhang and Bowen Zhou. Learning to update Auto-associative Memory in Recurrent NeuralNetworks for Improving Sequence Memorization. sep 2017. URL http://arxiv.org/abs/1709.06493 .Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn´ık, and J¨urgen Schmidhuber. RecurrentHighway Networks. jul 2016. ISSN 1938-7228. URL http://arxiv.org/abs/1607.03474 . 4 A CKNOWLEDGMENTS

We thank Drs. Tristan J. Webb, Marcel Nassar and Amir Khosrowshahi for insightful discussions.We also thank Dr. Jason Knight for his assistance setting up Kubernetes cluster used for trainingand tuning.

PPENDIX

SSOCIATIVE RETRIEVAL H YPERPARAMETERS

All models in the Associative retrieval section were tuned over the following hyperparameter rangesusing standard grid search. The ﬁnal models were selected based on the highest validation setaccuracy from the following set: η ∈ { . , . , . , . , . } (6) λ ∈ { . , . } (7) grad clip ∈ { . , . } (8) learning rate ∈ { − , − } (9) anneal rate ∈ { , } (10)(11)where anneal rate is the number of epochs between which the learning rate is halved, and grad clip is the maximum L2 norm clipping value for the gradient.The optimal hyperparameters were found to match for both the FW-RNN and FW-LSTM for thesimple associative retrieval tasks. They are as follows: η = 1 . (12) λ = 0 . (13) grad clip = 5 . (14) learning rate = 10 − (15) anneal rate = 100= 100