[PDF] Top-down Tree Long Short-Term Memory Networks

Abstract

Long Short-Term Memory (LSTM) networks, a type of recurrent neural network with a more complex computational unit, have been successfully applied to a variety of sequence modeling tasks. In this paper we develop Tree Long Short-Term Memory (TreeLSTM), a neural network model based on LSTM, which is designed to predict a tree rather than a linear sequence. TreeLSTM defines the probability of a sentence by estimating the generation probability of its dependency tree. At each time step, a node is generated based on the representation of the generated sub-tree. We further enhance the modeling power of TreeLSTM by explicitly representing the correlations between left and right dependents. Application of our model to the MSR sentence completion challenge achieves results beyond the current state of the art. We also report results on dependency parsing reranking achieving competitive performance.

Full PDF

TTop-down Tree Long Short-Term Memory Networks

Xingxing Zhang, Liang Lu and

Mirella Lapata

School of Informatics, University of Edinburgh10 Crichton Street, Edinburgh EH8 9AB, UK { x.zhang,liang.lu } @ed.ac.uk,[email protected] Abstract

REE

LSTM), aneural network model based on LSTM, whichis designed to predict a tree rather than a lin-ear sequence. T

REE

LSTM deﬁnes the prob-ability of a sentence by estimating the gener-ation probability of its dependency tree. Ateach time step, a node is generated basedon the representation of the generated sub-tree. We further enhance the modeling powerof T

REE

LSTM by explicitly representing thecorrelations between left and right depen-dents. Application of our model to the MSRsentence completion challenge achieves re-sults beyond the current state of the art. Wealso report results on dependency parsingreranking achieving competitive performance.

Neural language models have been gaining increas-ing attention as a competitive alternative to n-grams.The main idea is to represent each word using areal-valued feature vector capturing the contexts inwhich it occurs. The conditional probability of thenext word is then modeled as a smooth function ofthe feature vectors of the preceding words and thenext word. In essence, similar representations arelearned for words found in similar contexts result-ing in similar predictions for the next word. Previ-ous approaches have mainly employed feed-forward (Bengio et al., 2003; Mnih and Hinton, 2007) andrecurrent neural networks (Mikolov et al., 2010;Mikolov, 2012) in order to map the feature vec-tors of the context words to the distribution for thenext word. Recently, RNNs with Long Short-TermMemory (LSTM) units (Hochreiter and Schmidhu-ber, 1997; Hochreiter, 1998) have emerged as a pop-ular architecture due to their strong ability to capturelong-term dependencies. LSTMs have been success-fully applied to a variety of tasks ranging from ma-chine translation (Sutskever et al., 2014), to speechrecognition (Graves et al., 2013), and image descrip-tion generation (Vinyals et al., 2015).Despite superior performance in many applica-tions, neural language models essentially predict se-quences of words. Many NLP tasks, however, ex-ploit syntactic information operating over tree struc-tures (e.g., dependency or constituent trees). In thispaper we develop a novel neural network modelwhich combines the advantages of the LSTM archi-tecture and syntactic structure. Our model estimatesthe probability of a sentence by estimating the gen-eration probability of its dependency tree. Insteadof explicitly encoding tree structure as a set of fea-tures, we use four LSTM networks to model fourtypes of dependency edges which altogether specifyhow the tree is built. At each time step, one LSTM isactivated which predicts the next word conditionedon the sub-tree generated so far. To learn the repre-sentations of the conditioned sub-tree, we force thefour LSTMs to share their hidden layers. Our modelis also capable of generating trees just by samplingfrom a trained model and can be seamlessly inte-grated with text generation applications. a r X i v : . [ c s . C L ] A p r ur approach is related to but ultimately differ-ent from recursive neural networks (Pollack, 1990)a class of models which operate on structured in-puts. Given a (binary) parse tree, they recursivelygenerate parent representations in a bottom-up fash-ion, by combining tokens to produce representationsfor phrases, and eventually the whole sentence. Thelearned representations can be then used in classi-ﬁcation tasks such as sentiment analysis (Socher etal., 2011b) and paraphrase detection (Socher et al.,2011a). Tai et al. (2015) learn distributed representa-tions over syntactic trees by generalizing the LSTMarchitecture to tree-structured network topologies.The key feature of our model is not so much thatit can learn semantic representations of phrases orsentences, but its ability to predict tree structure andestimate its probability.Syntactic language models have a long historyin NLP dating back to Chelba and Jelinek (2000)(see also Roark (2001) and Charniak (2001)). Thesemodels differ in how grammar structures in a parsingtree are used when predicting the next word. Otherwork develops dependency-based language modelsfor speciﬁc applications such as machine translation(Shen et al., 2008; Zhang, 2009; Sennrich, 2015),speech recognition (Chelba et al., 1997) or sentencecompletion (Gubbins and Vlachos, 2013). All in-stances of these models apply Markov assumptionson the dependency tree, and adopt standard n-gramsmoothing methods for reliable parameter estima-tion. Emami et al. (2003) and Sennrich (2015) esti-mate the parameters of a structured language modelusing feed-forward neural networks (Bengio et al.,2003). Mirowski and Vlachos (2015) re-implementthe model of Gubbins and Vlachos (2013) withRNNs. They view sentences as sequences of wordsover a tree. While they ignore the tree structuresthemselves, we model them explicitly.Our model shares with other structured-based lan-guage models the ability to take dependency infor-mation into account. It differs in the following re-spects: (a) it does not artiﬁcially restrict the depthof the dependencies it considers and can thus beviewed as an inﬁnite order dependency languagemodel; (b) it not only estimates the probability of astring but is also capable of generating dependencytrees; (c) ﬁnally, contrary to previous dependency-based language models which encode syntactic in- formation as features, our model takes tree structureinto account more directly via representing differenttypes of dependency edges explicitly using LSTMs.Therefore, there is no need to manually determinewhich dependency tree features should be used orhow large the feature embeddings should be.We evaluate our model on the MSR sentence com-pletion challenge, a benchmark language modelingdataset. Our results outperform the best publishedresults on this dataset. Since our model is a generaltree estimator, we also use it to rerank the top K de-pendency trees from the (second order) MSTPasrserand obtain performance on par with recently pro-posed dependency parsers. We seek to estimate the probability of a sentence byestimating the generation probability of its depen-dency tree. Syntactic information in our model isrepresented in the form of dependency paths. In thefollowing, we ﬁrst describe our deﬁnition of depen-dency path and based on it explain how the proba-bility of a sentence is estimated.

Generally speaking, a dependency path is the pathbetween

ROOT and w consisting of the nodes onthe path and the edges connecting them. To rep-resent dependency paths, we introduce four typesof edges which essentially deﬁne the “shape” of adependency tree. Let w denote a node in a treeand w , w , . . . , w n its left dependents. As shown inFigure 1, L EFT edge is the edge between w andits ﬁrst left dependent denoted as ( w , w ) . Let w k (with 1 < k ≤ n ) denote a non-ﬁrst left dependentof w . The edge from w k − to w k is a N X -L EFT edge (N X stands for N EXT ), where w k − is the rightadjacent sibling of w k . Note that the N X -L EFT edge ( w k − , w k ) replaces edge ( w , w k ) (illustrated with adashed line in Figure 1) in the original dependencytree. The modiﬁcation allows information to ﬂowfrom w to w k through w , . . . , w k − rather than di-rectly from w to w k . R IGHT and N X -R IGHT edgesare deﬁned analogously for right dependents.Given these four types of edges, dependencypaths (denoted as D ( w ) ) can be deﬁned as follows w w k − w k w n L EFT N X -L EFT

Figure 1: L EFT and N X -L EFT edges. Dotted line between w and w k − (also between w k and w n ) indicate that there maybe ≥ bearing in mind that the ﬁrst right dependent of ROOT is its only dependent and that w p denotes theparent of w . We use ( . . . ) to denote a sequence,where () is an empty sequence and (cid:107) is an operatorfor concatenating two sequences.(1) if w is ROOT , then D ( w ) = () (2) if w is a left dependent of w p (a) if w is the ﬁrst left dependent, then D ( w ) = D ( w p ) (cid:107) ( (cid:104) w p , L EFT (cid:105) ) (b) if w is not the ﬁrst left dependent and w s isits right adjacent sibling, then D ( w ) = D ( w s ) (cid:107) ( (cid:104) w s , N X -L EFT (cid:105) ) (3) if w is a right dependent of w p (a) if w is the ﬁrst right dependent, then D ( w ) = D ( w p ) (cid:107) ( (cid:104) w p , R IGHT (cid:105) ) (b) if w is not the ﬁrst right dependent and w s is its left adjacent sibling, then D ( w ) = D ( w s ) (cid:107) ( (cid:104) w s , N X -R IGHT (cid:105) ) A dependency tree can be represented by the set ofits dependency paths which in turn can be used toreconstruct the original tree. Dependency paths for the ﬁrst two levelsof the tree in Figure 2 are as follows (ig-noring for the moment the subscripts whichwe explain in the next section). D ( sold ) =( (cid:104) ROOT , R IGHT (cid:105) ) (see deﬁnitions (1) and (3a)), D ( year ) = D ( sold ) (cid:107) ( (cid:104) sold , L EFT (cid:105) ) (see (2a)), D ( manufacturer ) = D ( year ) (cid:107) ( (cid:104) year , N X -L EFT (cid:105) ) (see (2b)), D ( cars ) = D ( sold ) (cid:107) ( (cid:104) sold , R IGHT (cid:105) ) (see (3a)), D ( in ) = D ( cars ) (cid:107) ( (cid:104) cars , N X -R IGHT (cid:105) ) (according to (3b)). The core problem in syntax-based language model-ing is to estimate the probability of sentence S given Throughout this paper we assume all dependency trees areprojective.

ROOT sold manufacturer The luxury auto year last cars in U.S. the Figure 2:

Dependency tree of the sentence

The luxury automanufacturer last year sold 1,214 cars in the U.S.

Subscriptsindicate breadth-ﬁrst traversal.

ROOT has only one dependent(i.e., sold ) which we view as its ﬁrst right dependent. its corresponding tree T , P ( S | T ) . We view the prob-ability computation of a dependency tree as a gener-ation process. Speciﬁcally, we assume dependencytrees are constructed top-down, in a breadth-ﬁrstmanner. Generation starts at the ROOT node. Foreach node at each level, ﬁrst its left dependents aregenerated from closest to farthest and then the rightdependents (again from closest to farthest). Thesame process is applied to the next node at the samelevel or a node at the next level. Figure 2 shows thebreadth-ﬁrst traversal of a dependency tree.Under the assumption that each word w in a de-pendency tree is only conditioned on its dependencypath , the probability of a sentence S given its depen-dency tree T is: P ( S | T ) = ∏ w ∈ BFS ( T ) \ ROOT P ( w | D ( w )) (1)where D ( w ) is the dependency path of w . Note thateach word w is visited according to its breadth-ﬁrstsearch order (BFS(T)) and the probability of ROOT is ignored since every tree has one. The role of

ROOT in a dependency tree is the same as the beginof sentence token (BOS) in a sentence. When com-puting P ( S | T ) (or P ( S ) ), the probability of ROOT (orBOS) is ignored (we assume it always exists), but isused to predict other words. We explain in the nextsection how T

REE

LSTM estimates P ( w | D ( w )) . A dependency path D ( w ) is subtree which we de-note as a sequence of (cid:104) word , edge-type (cid:105) tuples. Our w w w w w w e and tied W ho w w w w w w w w w w w w w G E N - L G EN -N X -LG EN -N X -L G E N - R G EN -N X -R G EN -N X -R Figure 3:

Generation process of left ( w , w , w ) and right( w , w , w ) dependents of tree node w o (top) using four LSTMs(G EN -L, G EN -R, G EN -N X -L and G EN -N X -R). The model canhandle an arbitrary number of dependents due to G EN -N X -Land G EN -N X -R. innovation is to learn the representation of D ( w ) us-ing four LSTMs. The four LSTMs (G EN -L, G EN -R, G EN -N X -L and G EN -N X -R) are used to repre-sent the four types of edges (L EFT , R

IGHT , N X -L EFT and N X -R IGHT ) introduced earlier. G EN ,N X , L and R are shorthands for G ENERATE , N

EXT ,L EFT and R

IGHT . At each time step, an LSTM ischosen according to an edge-type; then the LSTMtakes a word as input and predicts/generates its de-pendent or sibling. This process can be also viewedas adding an edge and a node to a tree. Speciﬁ-cally, LSTMs G EN -L and G EN -R are used to gen-erate the ﬁrst left and right dependent of a node( w and w in Figure 3). So, these two LSTMsare responsible for going deeper in a tree. WhileG EN -N X -L and G EN -N X -R generate the remain-ing left/right dependents and therefore go wider ina tree. As shown in Figure 3, w and w are gener-ated by G EN -N X -L, whereas w and w are gener-ated by G EN -N X -R. Note that the model can handleany number of left or right dependents by applyingG EN -N X -L or G EN -N X -R multiple times.We assume time steps correspond to the stepstaken by the breadth-ﬁrst traversal of the depen-dency tree and the sentence has length n . Attime step t (1 ≤ t ≤ n ), let (cid:104) w t (cid:48) , z t (cid:105) denote the lasttuple in D ( w t ) . Subscripts t and t (cid:48) denote thebreadth-ﬁrst search order of w t and w t (cid:48) , respectively. z t ∈ { L EFT , R IGHT , N X -L EFT , N X -R IGHT } is theedge type (see the deﬁnitions in Section 2.1). Let W e ∈ R s ×| V | denote the word embedding matrix and W ho ∈ R | V |× d the output matrix of our model, where | V | is the vocabulary size, s the word embedding sizeand d the hidden unit size. We use tied W e and tied W ho for the four LSTMs to reduce the number of pa-rameters in our model. The four LSTMs also sharetheir hidden states. Let H ∈ R d × ( n + ) denote the shared hidden states of all time steps and e ( w t ) theone-hot vector of w t . Then, H [ : , t ] represents D ( w t ) at time step t , and the computation is: x t = W e · e ( w t (cid:48) ) (2a) h t = LSTM z t ( x t , H [ : , t (cid:48) ]) (2b) H [ : , t ] = h t (2c) y t = W ho · h t (2d)where the initial hidden state H [ : , ] is initialized toa vector of small values such as 0.01. According toEquation (2b), the model selects an LSTM based onedge type z t . We describe the details of LSTM z t inthe next paragraph. The probability of w t given itsdependency path D ( w t ) is estimated by a softmax function: P ( w t | D ( w t )) = exp ( y t , w t ) ∑ | V | k (cid:48) = exp ( y t , k (cid:48) ) (3)We must point out that although we use four jointlytrained LSTMs to encode the hidden states, the train-ing and inference complexity of our model is no dif-ferent from a regular LSTM, since at each time steponly one LSTM is working.We implement LSTM z in Equation (2b) using adeep LSTM (to simplify notation, from now on wewrite z instead of z t ). The inputs at time step t are x t and h t (cid:48) (the hidden state of an earlier timestep t (cid:48) ) and the output is h t (the hidden state of cur-rent time step). Let L denote the layer number ofLSTM z and ˆ h lt the internal hidden state of the l -thlayer of the LSTM z at time step t , where x t is ˆ h t and h t (cid:48) is ˆ h Lt (cid:48) . The LSTM architecture introduces mul-tiplicative gates and memory cells ˆ c lt (at l -th layer)in order to address the vanishing gradient problemwhich makes it difﬁcult for the standard RNN modelto learn long-distance correlations in a sequence.Here, ˆ c lt is a linear combination of the current inputsignal u t and an earlier memory cell ˆ c lt (cid:48) . How muchinput information u t will ﬂow into ˆ c lt is controlled We ignore all bias terms for notational simplicity. w w w w w w w w w G E N - L G EN -N X -LG EN -N X -L G E N - R G EN -N X -R G EN -N X -R w w w D L D Figure 4:

Generation of left and right dependents of node w according to L D T REE

LSTM. by input gate i t and how much of the earlier mem-ory cell ˆ c lt (cid:48) will be forgotten is controlled by forgetgate f t . This process is computed as follows: u t = tanh ( W z , lux · ˆ h l − t + W z , luh · ˆ h lt (cid:48) ) (4a) i t = σ ( W z , lix · ˆ h l − t + W z , lih · ˆ h lt (cid:48) ) (4b) f t = σ ( W z , lf x · ˆ h l − t + W z , lf h · ˆ h lt (cid:48) ) (4c)ˆ c lt = f t (cid:12) ˆ c lt (cid:48) + i t (cid:12) u t (4d)where W z , lux ∈ R d × d ( W z , lux ∈ R d × s when l =

1) and W z , luh ∈ R d × d are weight matrices for u t , W z , lix and W z , lih are weight matrices for i t and W z , lf x , and W z , lf h are weight matrices for f t . σ is a sigmoid functionand (cid:12) the element-wise product.Output gate o t controls how much information ofthe cell ˆ c lt can be seen by other modules: o t = σ ( W z , lox · ˆ h l − t + W z , loh · ˆ h lt (cid:48) ) (5a)ˆ h lt = o t (cid:12) tanh ( ˆ c lt ) (5b)Application of the above process to all layers L , willyield ˆ h Lt , which is h t . Note that in implementation,all ˆ c lt and ˆ h lt (1 ≤ l ≤ L ) at time step t are stored,although we only care about ˆ h Lt ( h t ). T REE

LSTM computes P ( w | D ( w )) based on the de-pendency path D ( w ) , which ignores the interactionbetween left and right dependents on the same level.In many cases, T REE

LSTM will use a verb to pre-dict its object directly without knowing its subject.For example, in Figure 2, T

REE

LSTM uses (cid:104)

ROOT ,R IGHT (cid:105) and (cid:104) sold , R

IGHT (cid:105) to predict cars . This in-formation is unfortunately not speciﬁc to cars (manythings can be sold, e.g., chocolates , candy ). Consid-ering manufacturer , the left dependent of sold wouldhelp predict cars more accurately. In order to jointly take left and right dependentsinto account, we employ yet another LSTM, whichgoes from the furthest left dependent to the closestleft dependent (L D is a shorthand for left depen-dent). As shown in Figure 4, L D LSTM learns therepresentation of all left dependents of a node w ;this representation is then used to predict the ﬁrstright dependent of the same node. Non-ﬁrst right de-pendents can also leverage the representation of leftdependents, since this information is injected intothe hidden state of the ﬁrst right dependent and canpercolate all the way. Note that in order to retain thegeneration capability of our model (Section 3.4), weonly allow right dependents to leverage left depen-dents (they are generated before right dependents).The computation of the L D T REE

LSTM is al-most the same as in T

REE

LSTM except when z t = G EN -R. In this case, let v t be the cor-responding left dependent sequence with length K ( v t = ( w , w , w ) in Figure 4). Then, the hiddenstate ( q k ) of v t at each time step k is: m k = W e · e ( v t , k ) (6a) q k = LSTM L D ( m k , q k − ) (6b)where q K is the representation for all left depen-dents. Then, the computation of the current hid-den state becomes (see Equation (2) for the originalcomputation): r t = (cid:20) W e · e ( w t (cid:48) ) q K (cid:21) (7a) h t = LSTM G EN -R ( r t , H [ : , t (cid:48) ]) (7b)where q K serves as additional input for LSTM G EN -R .All other computational details are the same as inTreeLSTM (see Section 2.3). On small scale datasets we employ Negative Log-likelihood (NLL) as our training objective for bothT

REE

LSTM and L D T REE

LSTM: L NLL ( θ ) = − | S | ∑ S ∈ S log P ( S | T ) (8)where S is a sentence in the training set S , T is thedependency tree of S and P ( S | T ) is deﬁned as inEquation (1).n large scale datasets (e.g., with vocabularysize of 65K), computing the output layer activa-tions and the softmax function with NLL wouldbecome prohibitively expensive. Instead, we em-ploy Noise Contrastive Estimation (NCE; Gutmannand Hyv¨arinen (2012), Mnih and Teh (2012)) whichtreats the normalization term ˆ Z in ˆ P ( w | D ( w t )) = exp ( W ho [ w , : ] · h t ) ˆ Z as constant. The intuition behind NCEis to discriminate between samples from a data dis-tribution ˆ P ( w | D ( w t )) and a known noise distribu-tion P n ( w ) via binary logistic regression. Assumingthat noise words are k times more frequent than realwords in the training set (Mnih and Teh, 2012), thenthe probability of a word w being from our model P d ( w , D ( w t )) is ˆ P ( w | D ( w t )) ˆ P ( w | D ( w t ))+ kP n ( w ) . We apply NCE tolarge vocabulary models with the following trainingobjective: L NCE ( θ ) = − | S | ∑ T ∈ S | T | ∑ t = (cid:18) log P d ( w t , D ( w t ))+ k ∑ j = log [ − P d ( ˜ w t , j , D ( w t ))] (cid:19) where ˜ w t , j is a word sampled from the noise distri-bution P n ( w ) . We use smoothed unigram frequen-cies (exponentiating by 0.75) as the noise distribu-tion P n ( w ) (Mikolov et al., 2013b). We initializeln ˆ Z = Z during train-ing (Vaswani et al., 2013). We set k = We assess the performance of our model on twotasks: the Microsoft Research (MSR) sentence com-pletion challenge (Zweig and Burges, 2012), and de-pendency parsing reranking. We also demonstratethe tree generation capability of our models. In thefollowing, we ﬁrst present details on model train-ing and then present our results. We implementedour models using the Torch library (Collobert etal., 2011) and our code is available at https://github.com/XingxingZhang/td-treelstm . We trained our model with back propagationthrough time (Rumelhart et al., 1988) on an Nvidia GPU Card with a mini-batch size of 64. The ob-jective (NLL or NCE) was minimized by stochasticgradient descent. Model parameters were uniformlyinitialized in [ − . , . ] . We used the NCE objec-tive on the MSR sentence completion task (due tothe large size of this dataset) and the NLL objec-tive on dependency parsing reranking. We used aninitial learning rate of 1.0 for all experiments andwhen there was no signiﬁcant improvement in log-likelihood on the validation set, the learning rate wasdivided by 2 per epoch until convergence (Mikolovet al., 2010). To alleviate the exploding gradientsproblem, we rescaled the gradient g when the gradi-ent norm || g || > g = g || g || (Pascanu et al.,2013; Sutskever et al., 2014). Dropout (Srivastavaet al., 2014) was applied to the 2-layer T REE

LSTMand L D T REE

LSTM models. The word embeddingsize was set to s = d / d is the hidden unitsize. The task in the MSR Sentence Completion Chal-lenge (Zweig and Burges, 2012) is to select thecorrect missing word for 1,040 SAT-style test sen-tences when presented with ﬁve candidate comple-tions. The training set contains 522 novels fromthe Project Gutenberg which we preprocessed as fol-lows. After removing headers and footers from theﬁles, we tokenized and parsed the dataset into de-pendency trees with the Stanford Core NLP toolkit(Manning et al., 2014). The resulting training setcontained 49M words. We converted all words tolower case and replaced those occurring ﬁve timesor less with UNK. The resulting vocabulary sizewas 65,346 words. We randomly sampled 4,000sentences from the training set as our validation set.The literature describes two main approaches tothe sentence completion task based on word vectorsand language models. In vector-based approaches,all words in the sentence and the ﬁve candidatewords are represented by a vector; the candidatewhich has the highest average similarity with thesentence words is selected as the answer. For lan-guage model-based methods, the LM computes theprobability of a test sentence with each of the ﬁvecandidate words, and picks the candidate comple-tion which gives the highest probability. Our modelbelongs to this class of models. odel d | θ | Accuracy

Word Vector based ModelsLSA — — 49.0Skip-gram 640 102M 48.0 IV LBL 600 96.0M 55.5Language ModelsKN5 — — 40.0UDepNgram — — 48.3LDepNgram — — 50.0RNN 300 48.1M 45.0RNNME 300 1120M 49.3depRNN+3gram 100 1014M 53.5ldepRNN+4gram 200 1029M 50.7LBL 300 48.0M 54.7LSTM 300 29.9M 55.00LSTM 400 40.2M 57.02LSTM 450 45.3M 55.96Bidirectional LSTM 200 33.2M 48.46Bidirectional LSTM 300 50.1M 49.90Bidirectional LSTM 400 67.3M 48.65Model CombinationsRNNMEs — — 55.4Skip-gram + RNNMEs — — 58.9Our ModelsT

REE

LSTM 300 31.6M 55.29L D T REE

LSTM 300 32.5M 57.79T

REE

LSTM 400 43.1M 56.73L D T REE

LSTM 400 44.7M

Table 1:

Model accuracy on the MSR sentence completion task.The results of KN5, RNNME and RNNMEs are reported inMikolov (2012), LSA and RNN in Zweig et al. (2012), UDep-Ngram and LDepNgram in Gubbins and Vlachos (2013), de-pRNN+3gram and depRNN+4gram in Mirowski and Vlachos(2015), LBL in Mnih and Teh (2012), Skip-gram and Skip-gram+RNNMEs in Mikolov et al. (2013a), and IV LBL in Mnihand Kavukcuoglu (2013); d is the hidden size and | θ | the num-ber of parameters in a model. Table 1 presents a summary of our results to-gether with previoulsy published results. The bestperforming word vector model is IV LBL (Mnih andKavukcuoglu, 2013) with an accuracy of 55.5, whilethe best performing single language model is LBL(Mnih and Teh, 2012) with an accuracy of 54.7.Both approaches are based on the log-bilinear lan-guage model (Mnih and Hinton, 2007). A combi-nation of several recurrent neural networks and theskip-gram model holds the state of the art with anaccuracy of 58.9 (Mikolov et al., 2013b). To fairlycompare with existing models, we restrict the layer

Parser

Development TestUAS LAS UAS LASMSTParser-2nd 92.20 88.78 91.63 88.44T

REE

LSTM 92.51 89.07 91.79 88.53T

REE

LSTM* 92.64 89.09 91.97 88.69L D T REE

LSTM

NN parser* 92.00 89.70 91.80 89.60S-LSTM*

Table 2:

Performance of T

REE

LSTM and L D T REE

LSTM onreranking the top dependency trees produced by the 2nd orderMSTParser (McDonald and Pereira, 2006). Results for the NNand S-LSTM parsers are reported in Chen and Manning (2014)and Dyer et al. (2015), respectively. * indicates that the modelis initialized with pre-trained word vectors. size of our models to 1. We observe that L D T REE

L-STM consistently outperforms T

REE

LSTM, whichindicates the importance of modeling the interac-tion between left and right dependents. In fact,L D T REE

LSTM ( d = An LSTM with d =

400 out-performs its smaller counterpart ( d = d = D T REE

L-STM ( d = W e and W ho ) dominate the number of parame-ters in all neural models except for RNNME, de-pRNN+3gram and ldepRNN+4gram, which includea ME model that contains 1 billion sparse n-gramfeatures (Mikolov, 2012; Mirowski and Vlachos,2015). The number of parameters in T REE

LSTMand L D T REE

LSTM is not much larger compared toLSTM due to the tied W e and W ho matrices. In this section we demonstrate that our model canbe also used for parse reranking. This is not possi-ble for sequence-based language models since theycannot estimate the probability of a tree. We useour models to rerank the top K dependency treesproduced by the second order MSTParser (McDon- LSTMs and BiLSTMs were also trained with NCE( s = d /

2; hyperparameters were tuned on the development set). ld and Pereira, 2006). We follow closely the ex-perimental setup of Chen and Manning (2014) andDyer et al. (2015). Speciﬁcally, we trained T

REE

L-STM and L D T REE

LSTM on Penn Treebank sec-tions 2–21. We used section 22 for development andsection 23 for testing. We adopted the Stanford ba-sic dependency representations (De Marneffe et al.,2006); part-of-speech tags were predicted with theStanford Tagger (Toutanova et al., 2003). We trainedT

REE

LSTM and L D T REE

LSTM as language mod-els (singletons were replaced with UNK) and didnot use any POS tags, dependency labels or com-position features, whereas these features are used inChen and Manning (2014) and Dyer et al. (2015).We tuned d , the number of layers, and K on the de-velopment set.Table 2 reports unlabeled attachment scores(UAS) and labeled attachment scores (LAS) forthe MSTParser, T REE

LSTM ( d = K = D T REE

LSTM ( d = K = REE

LSTM and L D T REE

LSTM out-perform the baseline MSTParser, with L D T REE

L-STM performing best. We also initialized the wordembedding matrix W e with pre-trained GLOVE vec-tors (Pennington et al., 2014). We obtained a slightimprovement over T REE

LSTM (T

REE

LSTM* inTable 2; d = K =

4) but no im-provement over L D T REE

LSTM. Finally, notice thatL D T REE

LSTM is slightly better than the NN parserin terms of UAS but worse than the S-LSTM parser.In the future, we would like to extend our model sothat it takes labeled dependency information into ac-count.

This section demonstrates how to use a trainedL D T REE

LSTM to generate tree samples. The gen-eration starts at the

ROOT node. At each time step t ,for each node w t , we add a new edge and node to Figure 5:

Generated dependency trees with L D T REE

LSTMtrained on the PTB. the tree. Unfortunately during generation, we do notknow which type of edge to add. We therefore usefour binary classiﬁers (A DD -L EFT , A DD -R IGHT ,A DD -N X -L EFT and A DD -N X -R IGHT ) to predictwhether we should add a L

EFT , R

IGHT , N X -L EFT or N X -R IGHT edge. Then when a classiﬁer pre-dicts true, we use the corresponding LSTM to gener-ate a new node by sampling from the predicted worddistribution in Equation (3). The four classiﬁers takethe previous hidden state H [ : , t (cid:48) ] and the output em-bedding of the current node W ho · e ( w t ) as features. Speciﬁcally, we use a trained L D T REE

LSTM togo through the training corpus and generate hiddenstates and embeddings as input features; the corre-sponding class labels (true and false) are “read off”the training dependency trees. We use two-layer rec-tiﬁer networks (Glorot et al., 2011) as the four clas-siﬁers with a hidden size of 300. We use the sameL D T REE

LSTM model as in Section 3.3 to gener-ate dependency trees. The classiﬁers were trainedusing AdaGrad (Duchi et al., 2011) with a learningrate of 0.01. The accuracies of A DD -L EFT , A DD -R IGHT , A DD -N X -L EFT and A DD -N X -R IGHT are It is possible to get rid of the four classiﬁers by addingSTART/STOP symbols when generating left and right depen-dents as in (Eisner, 1996). We refrained from doing this forcomputational reasons. For a sentence with N words, this ap-proach will lead to 2 N additional START/STOP symbols (withone START and one STOP symbol for each word). Conse-quently, the computational cost and memory consumption dur-ing training will be three times as much rendering our modelless scalable. The input embeddings have lower dimensions and thereforeresult in slightly worse classiﬁers.

In this paper we developed T

REE

LSTM (andL D T REE

LSTM), a neural network model architec-ture, which is designed to predict tree structuresrather than linear sequences. Experimental resultson the MSR sentence completion task show thatL D T REE

LSTM is superior to sequential LSTMs.Dependency parsing reranking experiments high-light our model’s potential for dependency pars-ing. Finally, the ability of our model to gener-ate dependency trees holds promise for text gen-eration applications such as sentence compressionand simpliﬁcation (Filippova et al., 2015). Althoughour experiments have focused exclusively on depen-dency trees, there is nothing inherent in our formu-lation that disallows its application to other types oftree structure such as constituent trees or even tax-onomies.

Acknowledgments

We would like to thank Adam Lopez, Frank Keller,Iain Murray, Li Dong, Brian Roark, and the NAACLreviewers for their valuable feedback. XingxingZhang gratefully acknowledges the ﬁnancial sup-port of the China Scholarship Council (CSC). LiangLu is funded by the UK EPSRC Programme GrantEP/I031022/1, Natural Speech Technology (NST).

References [Bengio et al.2003] Yoshua Bengio, R´ejean Ducharme,Pascal Vincent, and Christian Janvin. 2003. A neuralprobabilistic language model.

The Journal of MachineLearning Research , 3:1137–1155.[Charniak2001] Eugene Charniak. 2001. Immediate-head parsing for language models. In

Proceedings ofthe 39th Annual Meeting on Association for Compu-tational Linguistics , pages 124–131. Association forComputational Linguistics.[Chelba and Jelinek2000] Ciprian Chelba and FrederickJelinek. 2000. Structured language modeling.

Com-puter Speech and Language , 14(4):283–332.[Chelba et al.1997] Ciprian Chelba, David Engle, Freder-ick Jelinek, Victor Jimenez, Sanjeev Khudanpur, LidiaMangu, Harry Printz, Eric Ristad, Ronald Rosenfeld, Andreas Stolcke, et al. 1997. Structure and per-formance of a dependency language model. In

EU-ROSPEECH . Citeseer.[Chen and Manning2014] Danqi Chen and ChristopherManning. 2014. A fast and accurate dependencyparser using neural networks. In

Proceedings ofthe 2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) , pages 740–750,Doha, Qatar, October. Association for ComputationalLinguistics.[Chen et al.2015] X Chen, X Liu, MJF Gales, andPC Woodland. 2015. Recurrent neural network lan-guage model training with noise contrastive estimationfor speech recognition. In

In 40th IEEE InternationalConference on Accoustics, Speech and Signal Process-ing , pages 5401–5405, Brisbane, Australia.[Collobert et al.2011] Ronan Collobert, KorayKavukcuoglu, and Cl´ement Farabet. 2011. Torch7:A matlab-like environment for machine learning. In

BigLearn, NIPS Workshop , number EPFL-CONF-192376.[De Marneffe et al.2006] Marie-Catherine De Marneffe,Bill MacCartney, Christopher D Manning, et al. 2006.Generating typed dependency parses from phrasestructure parses. In

Proceedings of LREC , volume 6,pages 449–454.[Duchi et al.2011] John Duchi, Elad Hazan, and YoramSinger. 2011. Adaptive subgradient methods for on-line learning and stochastic optimization.

The Journalof Machine Learning Research , 12:2121–2159.[Dyer et al.2015] Chris Dyer, Miguel Ballesteros, WangLing, Austin Matthews, and Noah A. Smith. 2015.Transition-based dependency parsing with stack longshort-term memory. In

Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: Long Pa-pers) , pages 334–343, Beijing, China, July. Associa-tion for Computational Linguistics.[Eisner1996] Jason M Eisner. 1996. Three new prob-abilistic models for dependency parsing: An explo-ration. In

Proceedings of the 16th conference on Com-putational linguistics-Volume 1 , pages 340–345. Asso-ciation for Computational Linguistics.[Emami et al.2003] Ahmad Emami, Peng Xu, and Fred-erick Jelinek. 2003. Using a connectionist model ina syntactical based language model. In

Proceedingsof the IEEE International Conference on Acoustics,Speech, and Signal Processing , pages 372–375, HongKong, China.[Filippova et al.2015] Katja Filippova, Enrique Alfon-seca, Carlos A Colmenares, Lukasz Kaiser, and OriolVinyals. 2015. Sentence compression by deletionwith lstms. In

EMNLP , pages 360–368.Glorot et al.2011] Xavier Glorot, Antoine Bordes, andYoshua Bengio. 2011. Deep sparse rectiﬁer neuralnetworks. In

International Conference on Artiﬁcial In-telligence and Statistics , pages 315–323.[Graves et al.2013] Alan Graves, Abdel-rahman Mo-hamed, and Geoffrey Hinton. 2013. Speech recogni-tion with deep recurrent neural networks. In

Acoustics,Speech and Signal Processing (ICASSP), 2013 IEEEInternational Conference on , pages 6645–6649. IEEE.[Gubbins and Vlachos2013] Joseph Gubbins and AndreasVlachos. 2013. Dependency language models for sen-tence completion. In

EMNLP , pages 1405–1410, Seat-tle, Washington, USA, October. Association for Com-putational Linguistics.[Gutmann and Hyv¨arinen2012] Michael U Gutmann andAapo Hyv¨arinen. 2012. Noise-contrastive estimationof unnormalized statistical models, with applicationsto natural image statistics.

The Journal of MachineLearning Research , 13(1):307–361.[Hochreiter and Schmidhuber1997] Sepp Hochreiter andJ¨urgen Schmidhuber. 1997. Long short-term memory.

Neural computation , 9(8):1735–1780.[Hochreiter1998] Sepp Hochreiter. 1998. Vanishing gra-dient problem during learning recurrent neural netsand problem solutions.

International Journal of Un-certainty, Fuzziness and Knowledge-based Systems ,6(2):107–116.[Manning et al.2014] Christopher D Manning, Mihai Sur-deanu, John Bauer, Jenny Finkel, Steven J Bethard,and David McClosky. 2014. The stanford corenlpnatural language processing toolkit. In

Proceedingsof 52nd Annual Meeting of the Association for Com-putational Linguistics: System Demonstrations , pages55–60.[McDonald and Pereira2006] Ryan T McDonald and Fer-nando CN Pereira. 2006. Online learning of approxi-mate dependency parsing algorithms. In

EACL .[Mikolov et al.2010] Tomas Mikolov, Martin Karaﬁ´at,Lukas Burget, Jan Cernock`y, and Sanjeev Khudan-pur. 2010. Recurrent neural network based languagemodel. In

INTERSPEECH 2010, 11th Annual Confer-ence of the International Speech Communication As-sociation, Makuhari, Chiba, Japan, September 26-30,2010 , pages 1045–1048.[Mikolov et al.2013a] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013a. Efﬁcient esti-mation of word representations in vector space. In

Proceedings of the 2013 International Conference onLearning Representations , Scottsdale, Arizona, USA.[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever,Kai Chen, Greg S Corrado, and Jeff Dean. 2013b.Distributed representations of words and phrases andtheir compositionality. In

Advances in Neural Infor-mation Processing Systems 26 , pages 3111–3119. [Mikolov2012] Tomas Mikolov. 2012.

Statistical Lan-guage Models based on Neural Networks . Ph.D. the-sis, Brno University of Technology.[Mirowski and Vlachos2015] Piotr Mirowski and An-dreas Vlachos. 2015. Dependency recurrent neurallanguage models for sentence completion. In

ACL ,pages 511–517, Beijing, China, July. Association forComputational Linguistics.[Mnih and Hinton2007] Andriy Mnih and Geoffrey Hin-ton. 2007. Three new graphical models for statisticallanguage modelling. In

Proceedings of the 24th In-ternational Conference on Machine Learning , pages641–648.[Mnih and Kavukcuoglu2013] Andriy Mnih and KorayKavukcuoglu. 2013. Learning word embeddings efﬁ-ciently with noise-contrastive estimation. In

Advancesin Neural Information Processing Systems 26 , pages2265–2273.[Mnih and Teh2012] Andriy Mnih and Yee Whye Teh.2012. A fast and simple algorithm for training neuralprobabilistic language models. In

Proceedings of the29th International Conference on Machine Learning ,pages 1751–1758, Edinburgh, Scotland.[Pascanu et al.2013] Razvan Pascanu, Tomas Mikolov,and Yoshua Bengio. 2013. On the difﬁculty of train-ing recurrent neural networks. In

Proceedings of the31st International Conference on Machine Learning ,pages 1310–1318, Atlanta, Georgia, USA.[Pennington et al.2014] Jeffrey Pennington, RichardSocher, and Christopher D Manning. 2014. Glove:Global vectors for word representation.

EMNLP ,12:1532–1543.[Pollack1990] Jordan B. Pollack. 1990. Recursive dis-tributed representations.

Artiﬁcial Intelligence , 1–2(46):77–105.[Roark2001] Brian Roark. 2001. Probabilistic top-downparsing and language modeling.

Computational lin-guistics , 27(2):249–276.[Rumelhart et al.1988] David E Rumelhart, Geoffrey EHinton, and Ronald J Williams. 1988. Learning repre-sentations by back-propagating errors.

Cognitive mod-eling , 5:3.[Sennrich2015] Rico Sennrich. 2015. Modelling and op-timizing on syntactic n-grams for statistical machinetranslation.

Transactions of the Association for Com-putational Linguistics , 3:169–182.[Shen et al.2008] Libin Shen, Jinxi Xu, and RalphWeischedel. 2008. A new string-to-dependency ma-chine translation algorithm with a target dependencylanguage model. In

Proceedings of ACL-08: HLT ,pages 577–585, Columbus, Ohio, USA.[Socher et al.2011a] Richard Socher, Eric H. Huang, Jef-frey Pennington, Christopher D. Manning, and An-drew Ng. 2011a. Dynamic pooling and unfoldingecursive autoencoders for paraphrase detection. In

Advances in Neural Information Processing Systems ,pages 801–809.[Socher et al.2011b] Richard Socher, Jeffrey Pennington,Eric H. Huang, Andrew Y. Ng, and Christopher D.Manning. 2011b. Semi-supervised recursive autoen-coders for predicting sentiment distributions. In

Pro-ceedings of the 2011 Conference on Empirical Meth-ods in Natural Language Processing , pages 151–161,Edinburgh, Scotland, UK.[Srivastava et al.2014] Nitish Srivastava, Geoffrey Hin-ton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A simple way to pre-vent neural networks from overﬁtting.

The Journal ofMachine Learning Research , 15(1):1929–1958.[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, andQuoc VV Le. 2014. Sequence to sequence learningwith neural networks. In

Advances in Neural Informa-tion Processing Systems , pages 3104–3112.[Tai et al.2015] Kai Sheng Tai, Richard Socher, andChristopher D. Manning. 2015. Improved semanticrepresentations from tree-structured long short-termmemory networks. In

Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguis-tics and the 7th International Joint Conference on Nat-ural Language Processing (Volume 1: Long Papers) ,pages 1556–1566, Beijing, China, July. Associationfor Computational Linguistics.[Toutanova et al.2003] Kristina Toutanova, Dan Klein,Christopher D Manning, and Yoram Singer. 2003.Feature-rich part-of-speech tagging with a cyclic de-pendency network. In

Proceedings of the 2003 Con-ference of the North American Chapter of the Associ-ation for Computational Linguistics on Human Lan-guage Technology-Volume 1 , pages 173–180. Associa-tion for Computational Linguistics.[Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao,Victoria Fossum, and David Chiang. 2013. Decod-ing with large-scale neural language models improvestranslation. In

Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing ,pages 1387–1392, Seattle, Washington, USA.[Vinyals et al.2015] Oriol Vinyals, Alexander Toshev,Samy Bengio, and Dumitru Erhan. 2015. Show andtell: A neural image caption generator. In

The IEEEConference on Computer Vision and Pattern Recogni-tion , Boston, Massachusetts, USA.[Zhang2009] Ying Zhang. 2009.

Structured languagemodels for statistical machine translation . Ph.D. the-sis, Johns Hopkins University.[Zweig and Burges2012] Geoffrey Zweig and Chris J.C.Burges. 2012. A challenge set for advancing languagemodeling. In

Proceedings of the NAACL-HLT 2012Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT ,pages 29–36, Montr´eal, Canada.[Zweig et al.2012] Geoffrey Zweig, John C Platt, Christo-pher Meek, Christopher JC Burges, Ainur Yessenalina,and Qiang Liu. 2012. Computational approachesto sentence completion. In