[PDF] Multi-timescale Representation Learning in LSTM Language Models

Abstract

Language models must capture statistical dependencies between words at timescales ranging from very short to very long. Earlier work has demonstrated that dependencies in natural language tend to decay with distance between words according to a power law. However, it is unclear how this knowledge can be used for analyzing or designing neural network language models. In this work, we derived a theory for how the memory gating mechanism in long short-term memory (LSTM) language models can capture power law decay. We found that unit timescales within an LSTM, which are determined by the forget gate bias, should follow an Inverse Gamma distribution. Experiments then showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution. Further, we found that explicitly imposing the theoretical distribution upon the model during training yielded better language model perplexity overall, with particular improvements for predicting low-frequency (rare) words. Moreover, the explicit multi-timescale model selectively routes information about different types of words through units with different timescales, potentially improving model interpretability. These results demonstrate the importance of careful, theoretically-motivated analysis of memory and timescale in language models.

Full PDF

MMulti-timescale representation learning in LSTM Language Models

Shivangi Mahto , Vy A. Vo , Javier S. Turek , and Alexander G. Huth Department of Computer Science, The University of Texas at Austin Brain-Inspired Computing Lab, Intel Labs Department of Neuroscience, The University of Texas at Austin [email protected], { vy.vo,javier.turek } @intel.com,[email protected] Abstract

Although neural language models are effectiveat capturing statistics of natural language, theirrepresentations are challenging to interpret. Inparticular, it is unclear how these models retaininformation over multiple timescales. In thiswork, we construct explicitly multi-timescalelanguage models by manipulating the inputand forget gate biases in a long short-termmemory (LSTM) network. The distributionof timescales is selected to approximate powerlaw statistics of natural language through acombination of exponentially decaying mem-ory cells. We then empirically analyze thetimescale of information routed through eachpart of the model using word ablation exper-iments and forget gate visualizations. Theseexperiments show that the multi-timescalemodel successfully learns representations atthe desired timescales, and that the distribu-tion includes longer timescales than a stan-dard LSTM. Further, information about high-,mid-, and low-frequency words is routed pref-erentially through units with the appropriatetimescales. Thus we show how to constructlanguage models with interpretable representa-tions of different information timescales.

Effective language models should capture the sta-tistical properties of natural language, includinginformation that varies at multiple timescales. Forexample, syntactic effects evolve at the timescaleof words, whereas semantics, emotions, and narra-tives can evolve at much longer timescales of tensto hundreds or thousands of words. The importanceof long timescale information is evident in resultsshowing that neural networks have outperformedclassical n -gram models on many language model-ing benchmarks (Melis et al., 2019; Krause et al.,2019; Dai et al., 2019). This difference is attributedto these networks’ ability to capture long timescale dependencies that that are impossible for n -grammodels. Yet it is difﬁcult to interpret how neurallanguage models represent information at differenttimescales, and unclear how these timescale repre-sentations should be controlled to yield better ormore interpretable models.One popular architecture for neural languagemodels is recurrent neural networks, in particularLong Short-Term Memory (LSTM) (Hochreiterand Schmidhuber, 1997; Merity et al., 2018; Meliset al., 2018). Efforts to interpret the representationslearned by LSTMs using probing tasks have shownthat LSTM language models are capable of learn-ing both short timescale information about wordorder (i.e., syntactic information) (Adi et al., 2017;Linzen et al., 2016), and long timescale seman-tic information (Zhu et al., 2018; Conneau et al.,2018; Gulordava et al., 2018). Other methods haveattempted to interpret the timescale of LSTM rep-resentations using predictive models of brain re-sponses to natural language (Jain and Huth, 2018;Toneva and Wehbe, 2019). Yet the question of howand where information about different timescalesis maintained in LSTM representations still doesnot have a satisfying answer.One alternative to interpreting representations inexisting models is to construct language models inwhich different layers or groups of units are explic-itly constrained to operate at different timescales.Several approaches have been proposed for build-ing such explicitly multi-timescale models, includ-ing updating different groups of units at differentintervals (El Hihi and Bengio, 1996; Koutnik et al.,2014; Liu et al., 2015; Chung et al., 2017), gat-ing units across layers (Chung et al., 2015), andincluding explicit control of the input and forgetgates that determine how information is stored andremoved from memory (Xu et al., 2016; Shen et al.,2018; Tallec and Ollivier, 2018). These approachesease interpretation by controlling the timescales a r X i v : . [ c s . C L ] S e p epresented by different units. Yet this raises a newconcern: unlike standard LSTMs, explicitly multi-timescale models are unable to ﬂexibly learn thestatistics of natural language. This can decrease theperformance of these models (K´ad´ar et al., 2018)and diminish their utility. Thus, when constructingexplicitly multi-timescale language models it is im-portant to consider which timescales are present innatural language.Lin and Tegmark (2016) quantiﬁed the distribu-tion of timescales in natural language by measuringthe mutual information between tokens as a func-tion of the distance between them. They observedthat mutual information decays as a power law,which is common to many hierarchical structures(Lin and Tegmark, 2016; Sainburg et al., 2019). Itwould be desirable for a language model to retaintemporal information that mimics these statistics.However, it is not clear how to attain power law us-ing LSTMs, which are fundamentally designed todecay information exponentially across time (Tal-lec and Ollivier, 2018).In this work, we present a method to control thetimescales of information represented by each unitof an LSTM language model, resulting in inter-pretable multi-timescale representations. Buildingon the theoretical grounding of Tallec and Ollivier(2018), we quantify the timescale represented ineach unit using forget gate activations. We usethis framework to analyze an existing LSTM lan-guage model (Merity et al., 2018) and show howdifferent layers of the model retain informationacross time. Next, we use this framework to con-struct explicitly multi-timescale language modelswhere the timescale of each LSTM unit is con-trolled by setting the forget and input gate biases.To determine the distribution of timescales withinthis model we used a prior that mimics the powerlaw statistical properties of natural language (Linand Tegmark, 2016) through a combination of ex-ponential timescales. Finally, we show that thisprior creates interpretable representations in whichlong and short timescale information is selectivelyrouted into different parts of the network. We are interested in understanding and quantifyingthe timescale of information in LSTM-based lan-guage models. The timescale is directly related tothe memory mechanism employed by the LSTM, which involves the LSTM cell state c t , input gate i t and forget gate f t , i t = σ ( W ix x t + W ih h t + b i ) f t = σ ( W fx x t + W fh h t + b f )˜ c t = tanh( W cx x t + W ch h t + b c ) c t = f t (cid:12) c t − + i t (cid:12) ˜ c t , where x t is the input at time t , h t is the hiddenstate, W ih , W ix , W fh , W fx , W ch , W cx are the dif-ferent weight matrices and b i , b f , b c the respectivebiases. σ ( · ) and tanh( · ) represent element-wisesigmoid and hyperbolic tangent functions. Inputand forget gates control the ﬂow of information inand out of memory. The forget gate f t controlshow much memory from the last time step c t − iscarried forward to the current state c t . The inputgate i t controls how much information from theinput x t and hidden state h t at the current timestepis stored in memory for subsequent timesteps.To examine representational timescales, considera “free input” regime in which there is only nullinput to the LSTM after timestep t , i.e., x t = 0 for t > t . Ignoring information leakage throughthe hidden state, i.e., assuming W ch = 0 , b c = 0 ,and W fh = 0 , the cell state update becomes c t = f t (cid:12) c t − . For t > t , it can be further simpliﬁed as c t = f t − t (cid:12) c = c (cid:12) e (log f )( t − t ) , (1)where c = c t is the cell state at t , and f = σ ( b f ) is the value of the forget gate, which dependsonly on the forget gate bias b f here. Equation (1)shows that LSTM memory exhibits exponentialdecay with characteristic forgetting time , T = − f = 1log(1 + e − b f ) . (2)That is, values in the cell state tend to shrink bya factor of e every T timesteps. We refer to theforgetting time in Equation (2) as the timescale ofinformation represented by an LSTM unit. Our deﬁnition for the timescale of a LSTM unit inEquation (2) above applies in the free regime whichis dominated by the forget gate bias b f . Beyondthis simple case, we can estimate the timescale for igure 1: Estimated timescales across units in themodel from Merity et al. (2018). The majority of unitshave estimated timescales that are less than 20. Themodel learned to process information for timescales aslong as 150 timesteps. a LSTM unit by measuring the average forget gatevalues over a set of real test sequences, T est = − f , (3)where ¯ f = KN (cid:80) Nj =1 (cid:80) Kt =1 f jt where f jt is theforget gate value of the unit at t th timestep for j th test sentence. N is the total number of testsentences and K is the length of test sentences.Figure 1 shows estimated timescales in theLSTM language model from Merity et al. (2018)trained on Penn Treebank (Marcus et al., 1999;Mikolov et al., 2011). These timescales lie be-tween 0 and 150 timesteps, with more than 90% oftimescales being less than 10 timesteps, indicatingthat this network skews its forget gates to processshorter timescales during training. This resemblesthe ﬁndings by Khandelwal et al. (2018), whichshowed that the model’s sensitivity is reduced forinformation farther than 20 timesteps in the past.Ideally, we would like to control the timescale ofeach unit to counter this training effect and selectglobally a distribution that follows from naturallanguage data. Following the analysis in Section 2.1, the desiredtimescale T desired for an LSTM unit can be con-trolled by setting the forget gate bias to the value b f = − log( e Tdesired − . (4)The balance between forgetting information fromthe previous timestep and adding new informationfrom the current timestep is controlled by the rela-tionship between forget and input gates. To main-tain this balance we set the input gate bias b i to the Figure 2: Estimated mutual information of tokens inPenn Treebank (PTB) and WikiText-2 datasets. opposite value of the forget gate, i.e., b i = − b f .This ensures that the relation i t ≈ − f t holds true.Importantly, these bias values remain ﬁxed (i.e. are not learned) during training, in order to keep thedesired timescale distribution across the network. Ideally, we would like to select the distributionof timescales across LSTM units to match thestatistics of natural language. It is well knownthat natural language contains a mixture of differ-ent types of dependence that evolve at differenttimescales (Lang et al., 1990; Daubechies, 1990;El Hihi and Bengio, 1996). All of these effectscan be summarized by examining the mutual in-formation between tokens as a function of theirseparation, which has been observed to approxi-mately follow a power law decay (Lin and Tegmark,2016). In Figure 2 we reproduce this result, show-ing the power law decay of mutual information forthe Penn Treebank and WikiText-2 (Merity et al.,2017) datasets. This suggests that the distributionof timescales should approximate power law decayacross time.From Equation (1) we see that LSTM memorytends to decay exponentially. Thus, we will approx-imate the power law decay seen in natural languageusing a mixture of exponential functions. Let usassume that the timescale T for each unit is drawnfrom a distribution P ( T ) . We want to deﬁne P ( T ) such that the expected value over T of the function e − tT approximates a power law decay t − d for someconstant d , t − d ∝ E T [ e − t/T ] = (cid:90) ∞ P ( T ) e − t/T dT. (5)Noting the similarity between this equation and igure 3: Word ablation experiment. For each inputcontext sentence and each value of k from k = 1 to k = t − , the word k timesteps before the current timestep t is replaced with UNK . The cell state vectors at t arethen extracted. The impact of each word ablation ismeasured as the distance between the original cell statevector and the cell state vector after ablation (Eq. 6). the Laplace transform, we can solve this prob-lem to ﬁnd that P ( T ) is an Inverse Gamma dis-tribution with shape parameter α = d and scaleparameter β = 1 . The probability density func-tion of the inverse gamma distribution is given as P ( T ; α, β ) = β α Γ( α ) (1 /T ) α +1 exp ( − β/T ) . Select-ing exponential timescales according to this distri-bution should thus enable us to approximate thepower law temporal dependencies of natural lan-guage using an LSTM. Controlling the timescale of each unit in an LSTMlanguage model enables us to obtain interpretablerepresentations. To understand the effects of ourmanipulations, we use several techniques to inter-pret and visualize the timescale of information pass-ing through each unit.

Forget gate visualization.

For each LSTM unit,we compute the mean forget gate value across dif-ferent timesteps for a test sentence. We then sortthe units according to their mean forget gate val-ues. High mean forget gate values imply that aunit retains information from the past over longertimescales, while low mean forget gate values im-ply that a unit only maintains shorter timescaleinformation. This visualisation serves to checkwhether the assigned timescale relates with the for-get gate values of the unit or not.

Word ablation during inference.

The decay ofinformation in the cell state of an LSTM layer is another indicator of the timescale of informationpassing through that layer. We visualize the rate ofinformation decay by ablating words (i.e. replacethem with

UNK ) during inference, and then mea-suring the effect on subsequent cell states. Thisprocedure is depicted in Figure 3.The impact of ablating word w t − k on the cellstate of layer i is called ∆ c k ( i ) . Speciﬁcally, ∆ c k ( i ) is the normalized L2 norm over the dif-ference of cell states in layer i with and withoutablating the word k timesteps away from the cur-rent timestep t in the sentence, or ∆ c k ( i ) = 1 L L (cid:88) t =1 || c kt ( i ) − c t ( i ) || || c t ( i ) || , (6)where c kt ( i ) is the cell state vector of layer i at word t with word t − k ablated, c t ( i ) is the cell state vec-tor of layer i without ablation, and L is the lengthof the input test sentence. In our experiments, weestimate ∆ c k ( i ) for k ranging from 0 to maximumlength of the input sentence and average it over allthe test sentences. We experimentally evaluated the effectiveness ofour explicit multi-timescale language model onthe Penn Treebank (PTB) (Marcus et al., 1999;Mikolov et al., 2011) and WikiText-2 (WT2) (Mer-ity et al., 2017) datasets. PTB contains a vocabu-lary of 10K unique words, with 930K tokens in thetraining, 200K in validation, and 82K in test data.WT2 is a larger dataset with a vocabulary size of33K unique words, with almost 2M tokens in thetraining set, 220K in the validation set, and 240Kin the test set.We compared two language models: a standardstateful LSTM language model (Merity et al., 2018)as the baseline, and our multi-timescale languagemodel. Both models comprise three LSTM layerswith 1150 units in the ﬁrst two layers and 400 unitsin the third layer, with an embedding size of 400.The input and output embeddings were tied. Allmodels were trained using SGD followed by non-monotonically triggered ASGD for 1000 epochs.Training sequences were of length 70 with a prob-ability of 0.95 and 35 with a probability of 0.05.During inference, all test sentences were of length70. For training, all embedding weights were uni-formly initialized in the interval [ − . , . . All igure 4: Estimated timescale is highly correlated withassigned timescale, shown for all 1150 units in LSTMlayer 2 of the multi-timescale language model. weights and biases of the LSTM layers were uni-formly initialized between (cid:2) − H , H (cid:3) where H is theoutput size of the respective layer.Multi-timescale language models have the samearchitecture and training schedule as the baselinemodel, except for the forget and input gate biasvalues for the ﬁrst two LSTM layers. In order toget representations for short timescale information,we assigned timescale T = 3 to half of the unitsand timescale T = 4 to the rest for layer 1. UsingEquation (4), the corresponding forget and inputgate bias values for units with timescale T = 3 were ﬁxed to . and − . , whereas for the unitswith timescale T = 4 , they were ﬁxed to . and − . . For layer 2, we assigned timescales to eachunit by selecting values from an Inverse Gammadistribution. These timescales were then used tocompute the forget and input biases for each unit.We selected the best shape parameter α for theinverse gamma distribution (which is equal to theexponent d in the corresponding power law) bytesting several different values, and found that α =0 . works best for our models.This parameter sets80% units of layer 2 to have timescales less than 20and the rest to have higher timescales ranging upto the thousands. Biases in the third LSTM layerwere not ﬁxed, as we found that changes here hadlittle effect on the network. Further details aboutthe selection of these parameters is available in thesupplementary material. In Section 2.3, we showed that the forget gatebias controls the timescale of the unit, and de-rived a distribution of assigned timescales for themulti-timescale language model. After training this model, we tested whether this control was success-ful by estimating the empirical timescale of eachunit based on their mean forget gate values usingEquation (2). Figure 4 shows that the assigned andestimated timescales in layer 2 are strongly cor-related. This demonstrates that the timescale ofan LSTM unit can be effectively controlled by theforget gate bias.

To further examine representational timescales, wenext visualized forget gate values of units from allthree layers of both the multi-timescale and base-line language models as described in Section 2.5.The goal is to compare the distribution of theseforget gate values across the two language models,and to assess how these values change over timefor a given input.First, we sorted the LSTM units of each layeraccording to their mean forget gate values over atest sequence. For visualization purposes, we thendownsampled these values by calculating the aver-age forget gate value of every 10 consecutive sortedunits for each timestep. Heat maps of these sortedand down-sampled forget gate values are shownin Figure 5. The horizontal axis shows timesteps(words) across a sample test sequence, and the ver-tical axis shows different units. Units with averageforget gate values close to 1.0 (bottom) are retain-ing information across many timesteps, i.e. theyare capturing long timescale information. Figure 5shows that the baseline language model containsfewer long timescale units than the multi-timescalelanguage model. They are also more evenly dis-tributed across the layers than the multi-timescalelanguage model. Figure 5b also shows the (approx-imate) assigned timescales for units in the multi-timescale language model. As expected, layer 1contains short timescales and layer 2 contains arange of both short and long timescales. Layer 1units with short (assigned) timescales have smallerforget gate values across different timesteps. Inlayer 2, we observe that units with large assignedtimescale have higher mean forget gate valuesacross different timesteps, for example the unitswith assigned timescale of 362 in 5b have forgetgate values of almost 1.0 across all timesteps. Sim-ilar to the previous analysis, this demonstrates thatour method is effective at controlling the timescaleof each unit, and assigns a different distribution oftimescales than the baseline model. a) Baseline Language Model (b) Multi-timescale Language Model

Figure 5: Forget gate values of LSTM units for a test sentence from the PTB dataset. Units are sorted top to bottomby increasing mean forget gate value, and averaged in groups of 10 units to enable visualization. Figure 5b alsoshows average assigned timescale (rounded off) of the units. (a) PTB dataset (b) WikiText-2 dataset

Figure 6: Change in cell state of all the three layers for both the baseline and Multi-timescale language models inword ablation experiment. A curve with a steep slope indicates that cell state difference decays quickly over time,suggesting that the LSTM layer retains information of shorter timescales.

Another way to interpret timescale of informationretained by the layers is to visualize the decay ofinformation in the cell state over time. We esti-mated this decay with word ablation experimentsas described in Section 2.5.In Figure 6, we show the normalized cell statedifference averaged across test sentences for allthree layers of both baseline (blue) and multi-timescale (orange) models. In the PTB dataset,information in layer 1 of the baseline model decaysmore slowly than in layer 2. In this case, layer2 of the baseline model retains shorter timescaleinformation than layer 1. In the WikiText-2 dataset,the difference between layer 1 and layer 2 of thebaseline model is inverted, with layer 2 retaininglonger timescale information. However, in themulti-timescale model the trend is nearly identi-cal for both datasets, with information in layer 2decaying more slowly than layer 1. This is ex-pected for our multi-timescale model, which wedesigned to have short timescale dependencies inlayer 1 and longer timescale dependencies in layer2. Furthermore, the decay curves are very simi- lar across datasets for the multi-timescale model,but not for the baseline model, demonstrating thatcontrolling the timescales gives rise to predictablebehavior across layers and datasets. Layer 3 hassimilar cell state decay rate across both models. Inboth models, layer 3 is initialized randomly, andwe expect its behavior to be largely driven by thelanguage modeling task.Next, we explored the rate of cell state decayacross different groups of units in layer 2 of themulti-timescale language model. We ﬁrst sorted thelayer 2 units according to their assigned timescaleand then partitioned these units into groups of 100before estimating the cell state decay curve foreach group. As can be seen in Figure 7, units witha shorter average timescale have faster decay rates,whereas units with longer average timescale haveslower information decay. While the previous sec-tion demonstrated that our manipulation could con-trol the forget gate values, this result demonstratesthat we can directly inﬂuence how information isretained in the LSTM cell states. ataset Model above 10K 1K-10K 100-1K below 100 All tokens

PTB Baseline 6.82 27.77 184.19 2252.50 61.40Multi-timescale 6.84 27.14 176.11 2100.89 59.69Mean diff. -0.02 0.63 8.08 152.03 1.7195% CI [-0.06, 0.02] [0.38, 0.88] [6.04, 10.2] [119.1,186.0] [1.41, 2.02]

WT2 Baseline 7.49 49.70 320.59 4631.08 69.88Multi-timescale 7.46 48.52 308.43 4318.72 68.08Mean diff. 0.03 1.17 12.20 312.13 1.8195% CI [0.01,0.06] [0.83,1.49] [9.96,14.4] [267.9,356.3] [1.61,2.01]

Table 1: Perplexity of the multi-timescale and baseline models for tokens across different frequency bins for thePenn TreeBank (PTB) and WikiText-2 (WT2) test datasets. We also report the mean difference in perplexity(baseline - multi-timescale) across 10,000 bootstrapped samples, along with the 95% conﬁdence interval (CI).Figure 7: Change in cell states of 100-unit groupshaving different average timescale of layer 2 in Multi-timescale model in word ablation experiment for PTBdataset. As the assigned timescale to the group de-creases the slope of the curve decreases indicating re-tained information of smaller timescale.

One potential downside of constructing explicitlymulti-timescale language models is that they mayperform worse than ordinary language models, ren-dering their utility questionable. We attempted toaddress this issue by using a prior on the distribu-tion across timescales that matches the statisticaltemporal dependencies of natural language. Totest whether this effort was successful at buildingan effective model, we compared language model-ing performance between the baseline and multi-timescale models by computing perplexity on thetest portion of each dataset. Results are shown inTable 1. Here we see that the multi-timescale lan-guage model signiﬁcantly outperforms the baselinemodel for both datasets by an average margin of1.75 perplexity, demonstrating that the explicitlymulti-timescale language model is actually better.From the earlier forget gate visualizations and word ablation tests, we saw that the multi-timescalemodel seemed to contain larger representations ofvery long timescale than did the baseline model.Thus the small performance advantage of the multi-timescale model might be due to it better captur-ing long timescale information. To test this, weinvestigated how language model performance dif-fered for predicting words that appear with differ-ent frequencies. It has been shown that commonwords rely mostly on short timescale information,whereas rare words require longer timescale infor-mation (Khandelwal et al., 2018). Thus if the multi-timescale model contains more long timescale in-formation, it should give larger improvements tomodel performance for infrequent words.We divided the words in the test dataset into 4bins depending on their frequencies in the trainingdata: a) greater than 10,000 occurrences; b) be-tween 1000 and 10,000; c) between 100 and 1000;and d) extremely rare words with less than 100occurrences. Then we compared performance ofthe models for test words belonging to each fre-quency bin in Table 1. The multi-timescale modelperformed signiﬁcantly better than baseline in bothdatasets for the 3 less frequent bins, with increas-ing difference for the less frequent words. Thisresult suggests that the performance advantage ofthe multi-timescale model is highest for words thatrequire very long timescale information.We assessed statistical signiﬁcance of the dif-ferences in performance between models using abootstrap procedure. Test data were divided into100-word sequences and resampled with replace-ment 10,000 times. For each sample, we computedthe difference in model perplexity (baseline - multi-timescale) for each word frequency bin and acrossall words. We report the 95% conﬁdence inter- igure 8: Information routing across different units ofthe multi-timescale LSTM for PTB dataset. Each lineshows results for words in a different frequency bin.The ordinate axis shows the ratio of model perplex-ity with and without ablation of a group of 50 LSTMunits, sorted and grouped by assigned timescale. Ra-tios above 1 indicate a decrease in performance follow-ing ablation, suggesting that the ablated units are im-portant for predicting words in the given frequency bin.Abscissa shows the average timescale of each group. vals (CIs) of the perplexity difference in Table 1.The difference between models is signiﬁcant at alevel of p < . if the CI does not overlap with 0.The multi-timescale model thus has a signiﬁcantlylower perplexity across all frequency bins, exceptfor the highest frequency words in PTB. In separatetests we showed that these results are also robust torandom initialization when training the model 5.3. Our previous results showed that we were ableto successfully control the timescales of differentunits in our multi-timescale model, and that thisdid not cause model performance to suffer (in fact,performance improved). However, we have notyet shown that the representations that this modellearns for different timescales are interpretable.For these representations to be interpretable, wewould expect that different types of informationare routed through the units that were assigned dif-ferent timescales. To test this, we divided the testdata into word frequency bins for which differenttimescales of information should be important. Forexample, if long timescale information is particu-larly important for low frequency words, then wewould expect that information about those words is selectively routed through the units that were as-signed long timescales. We tested the importanceof LSTM units with different assigned timescalesfor words in each frequency bin by selectively ab-lating those units during inference, and then mea-suring the effect on prediction performance.We divided the LSTM units from layer 2 of themulti-timescale model into 23 groups of 50 con-secutive units, sorted by assigned timescale. Weablated one group of units at a time by explicitlysetting their output to 0, while keeping the restof the units active. We then computed the modelperplexity for different word frequency bins, andplotted the ratio of the perplexity with and withoutablation. If performance gets worse for a particularfrequency bin when ablating a particular group ofunits, it implies that the ablated units are routinginformation corresponding to that timescale.Figure 8 shows this ratio across all frequencybins and groups for the PTB dataset (similar resultsfor WikiText-2 are shown in the supplement). Ab-lating units with a long timescales (20-300 words)causes performance to degrade the most for lowfrequency words (below 100 and 1K-10K); ablat-ing units with medium timescales (5-10 words)worsens performance for medium frequency words(1k-10k); and ablating units with the shortesttimescales ( < In this paper, we presented a mechanism to inter-pret and control the timescale of information rout-ing through an LSTM unit via the input and forgetgate biases. We ﬁrst examined the timescale ofinformation ﬂowing through a standard LSTM lan-guage model and found that most units preferredshort timescale information. We designed a multi-timescale language model where timescales in themiddle layer were assigned based on an inversegamma distribution, a prior that we introduce basedon nature language statistics. We used several meth-ods including forget gate visualization, unit ab-lation, and word ablation to study and comparetimescales of information in our model and a stan-dard LSTM. The results showed that our model wassuccessful in learning the representations of vari-us timescales, including longer timescales thanthe standard model. These results demonstrate thatour explicit multi-timescale LSTM language modelcan be a useful tool for studying representations ofdifferent timescales in natural language.

References

Yossi Adi, Einat Kermany, Yonatan Belinkov, OferLavi, and Yoav Goldberg. 2017. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. In

International Conference on LearningRepresentations .Junyoung Chung, Sungjin Ahn, and Yoshua Bengio.2017. Hierarchical multiscale recurrent neural net-works. In

Proceedings of the 5th International Con-ference on Learning Representations .Junyoung Chung, aglar Glehre, Kyunghyun Cho, andYoshua Bengio. 2015. Gated feedback recurrent neu-ral networks. In

ICML , pages 2067–2075.Alexis Conneau, Germ´an Kruszewski, Guillaume Lam-ple, Lo¨ıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single vector: Probing sentenceembeddings for linguistic properties. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers) , pages 2126–2136.Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Car-bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.Transformer-xl: Attentive language models beyonda ﬁxed-length context. In

Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics , pages 2978–2988.Ingrid Daubechies. 1990. The wavelet transform, time-frequency localization and signal analysis.

IEEEtransactions on information theory , 36(5):961–1005.Salah El Hihi and Yoshua Bengio. 1996. Hierarchicalrecurrent neural networks for long-term dependen-cies. In

Advances in neural information processingsystems , pages 493–499.Kristina Gulordava, Piotr Bojanowski, ´Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 1195–1205.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural Computation ,9(8):1735–1780.Shailee Jain and Alexander Huth. 2018. Incorporatingcontext into language encoding models for fmri. In

Advances in neural information processing systems ,pages 6628–6637. ´Akos K´ad´ar, Marc-Alexandre Cˆot´e, GrzegorzChrupała, and Afra Alishahi. 2018. Revisitingthe hierarchical multiscale LSTM. In

Proceedingsof the 27th International Conference on Compu-tational Linguistics , pages 3215–3227, Santa Fe,New Mexico, USA. Association for ComputationalLinguistics.Urvashi Khandelwal, He He, Peng Qi, and Dan Juraf-sky. 2018. Sharp nearby, fuzzy far away: How neu-ral language models use context. In

Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 284–294, Melbourne, Australia. Associationfor Computational Linguistics.Jan Koutnik, Klaus Greff, Faustino Gomez, and Juer-gen Schmidhuber. 2014. A clockwork rnn. In

In-ternational Conference on Machine Learning , pages1863–1871.Ben Krause, Emmanuel Kahembwe, Iain Murray,and Steve Renals. 2019. Dynamic evaluationof transformer language models. arXiv preprintarXiv:1904.08378 .Kevin J Lang, Alex H Waibel, and Geoffrey E Hin-ton. 1990. A time-delay neural network architec-ture for isolated word recognition.

Neural networks ,3(1):23–43.Henry W Lin and Max Tegmark. 2016. Critical behav-ior from deep dynamics: a hidden dimension in nat-ural language. arXiv preprint arXiv:1606.06737 .Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of lstms to learn syntax-sensitive dependencies.

Transactions of the Associa-tion for Computational Linguistics , 4:521–535.Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu,and Xuan-Jing Huang. 2015. Multi-timescale longshort-term memory neural network for modellingsentences and documents. In

Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing , pages 2326–2335.Mitchell P Marcus, Beatrice Santorini, Mary AnnMarcinkiewicz, and Ann Taylor. 1999. Treebank-3.

Linguistic Data Consortium, Philadelphia , 14.G´abor Melis, Tom´aˇs Koˇcisk`y, and Phil Blunsom. 2019.Mogriﬁer lstm. In

International Conference onLearning Representations .Gbor Melis, Chris Dyer, and Phil Blunsom. 2018. Onthe state of the art of evaluation in neural languagemodels. In

International Conference on LearningRepresentations .Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2018. Regularizing and optimizing LSTMlanguage models. In

International Conference onLearning Representations .tephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017. Pointer sentinel mixture mod-els. In

International Conference on Learning Repre-sentations .Tom´aˇs Mikolov, Anoop Deoras, Stefan Kombrink,Luk´aˇs Burget, and Jan ˇCernock`y. 2011. Empiricalevaluation and combination of advanced languagemodeling techniques. In

Twelfth Annual Conferenceof the International Speech Communication Associ-ation .Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Tim Sainburg, Brad Theilman, Marvin Thielk, and Tim-othy Q Gentner. 2019. Parallels in the sequential or-ganization of birdsong and human speech.

Naturecommunications , 10(1):1–11.Yikang Shen, Shawn Tan, Alessandro Sordoni, andAaron Courville. 2018. Ordered neurons: Integrat-ing tree structures into recurrent neural networks. In

International Conference on Learning Representa-tions .Corentin Tallec and Yann Ollivier. 2018. Can recurrentneural networks warp time? In

International Con-ference on Learning Representations .Mariya Toneva and Leila Wehbe. 2019. Interpret-ing and improving natural-language processing (inmachines) with natural language-processing (in thebrain). In

Advances in Neural Information Process-ing Systems , pages 14928–14938.Jiacheng Xu, Danlu Chen, Xipeng Qiu, and Xuan-JingHuang. 2016. Cached long short-term memory neu-ral networks for document-level sentiment classiﬁ-cation. In

Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing ,pages 1660–1669.Xunjie Zhu, Tingfeng Li, and Gerard De Melo. 2018.Exploring semantic properties of sentence embed-dings. In

Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 2: Short Papers) , pages 632–637.

We compared the performance of multi-timescalelanguage model for different shape parametersin inverse gamma distribution. Figure 9 shows

Figure 9: Assigned timescale to LSTM units of layer2of multi-timescale language model corresponding todifferent shape parameter α .Figure 10: Performance of multi-timescale modelsfor different shape parameters α on both PTB andWikiText-2 dataset. timescales assigned to LSTM units of layer 2 cor-responding to different shape parameters. Theseshape parameters cover a wide range of possi-ble timescale distribution to the units. Figure 10shows that multi-timescale models performs bestfor α = 0 . . With the purpose to select proper timescales to eachlayer in Section 3.1, we conducted experiments ondesigning LSTM language models with differentcombinations of timescales across the three lay-ers. We found that layer 1 (the closest layer toinput) always prefers smaller timescales within therange from 1 to 5. This is consistent with whathas been observed in literature: the ﬁrst layer fo-cuses more on syntactic information present inshort timescales (Peters et al., 2018; Jain and Huth,2018). We also observed that the layer 3, i.e., thelayer closest to the output, does not get affected bythe assigned timescale. Since we have tied encoder- ataset Model Performance

PTB Baseline . ± . Multi-timescale . ± . WikiText-2 Baseline . ± . Multi-timescale . ± . Table 2: Perplexity of the baseline and multi-timescalemodels over 5 different training instances. Values arethe mean and standard error over the training instances. decoder settings while training, layer 3 seems tolearn global word representation with a speciﬁctimescale of information control by the trainingtask (language modeling). The middle LSTM layer(layer 2) was more ﬂexible, which allowed us to se-lect speciﬁc distributions of timescales. Therefore,we achieve the Multi-timescale Language Modelin Section 3.1 by setting layer 1 biases to smalltimescales, layer 2 to satisfy the inverse gammadistribution and thus aim to achieve the power-lawdecay of the mutual information, and layer 3 withfreedom to learn the timescales required for thecurrent task.

We quantiﬁed the variability in model performancedue to stochastic differences in training with differ-ent random seeds. Table 2 shows the mean perplex-ity and standard error across 5 different traininginstances. The variance due to training is similaracross the two models.