In-Order Chart-Based Constituent Parsing
IIn-Order Chart-Based Constituent Parsing
Yang Wei, Yuanbin Wu, and Man Lan
School of Computer Science and TechnologyEast China Normal University [email protected] { ybwu,mlan } @cs.ecnu.edu.cn Abstract
We propose a novel in-order chart-basedmodel for constituent parsing. Compared withprevious CKY-style and top-down models, ourmodel gains advantages from in-order traver-sal of a tree (rich features, lookahead infor-mation and high efficiency) and makes a bet-ter use of structural knowledge by encod-ing the history of decisions. Experimentson the Penn Treebank show that our modeloutperforms previous chart-based models andachieves competitive performance comparedwith other discriminative single models.
Constituent parsing has achieved great progress inrecent years. Advanced neural networks, such asrecurrent neural networks (RNN) (Vinyals et al.,2015) and self-attentive networks (Kitaev andKlein, 2018a), provide new techniques to extractpowerful features for sentences. At the same time,new decoding algorithms are also developed tohelp searching correct parsing trees faster. Ex-amples include new transition systems (Watan-abe and Sumita, 2015; Dyer et al., 2016; Liuand Zhang, 2017a), sequence-to-sequence parsers(G´omez-Rodr´ıguez and Vilares, 2018; Shen et al.,2018), and chart-based algorithms which is ourtopic in this paper.Traditionally, decoders of chart parsers arebased on the CKY algorithm: they search the opti-mal tree in a bottom-up manner with dynamic pro-gramming. Stern et al. (2017a) proposes a simpletop-down decoder as an alternative of the CKYdecoder. It starts at the biggest constituent (i.e.,the sentence), and recursively splits constituents togenerate intermediate non-terminals. Comparingwith dynamic programming, the top-down parserdecodes greedily (thus are much faster) and doesnot guarantee to reach the optimal tree. Further-more, when predicting a non-terminal, the rich information about its subtrees is ignored, whichmakes the parser underperforms the CKY decoder.In this work, inspired by the in-order transi-tion system (Liu and Zhang, 2017a), we proposean in-order chart parser to explore local structuresof a non-terminal. Before producing a new con-stituent, the in-order parser tries to first collectenough substructure informationy by predictingits left child. Compared with its transition-basedcounterpart (Liu and Zhang, 2017a) which onlypredicts labels of a constituent (by shifting a non-terminal symbol into the stack), the in-order chartparser predicts both labels and boundaries of con-stituents. We would think that the additional su-pervision signals on boundaries can help to betterutilize tree structures and also reduce search spaceof the parser. We also provide a dynamic oraclewith which the in-order chart decoder could fol-low in cases of deviating from the gold decodingtrajectory.Next, since both the in-order and top-down de-coding can be seen as sequential decision tasks,we also would like to investigate whether and howprevious decisions can influence the current deci-sion. We introduce a new RNN to track historypredictions, and the decoder will take the RNN’shidden states into account in each constituent pre-diction.We conduct experiments on the benchmark PTBdataset. The results show that the in-order chartparser can achieve competitive performance withthe CKY and top-down decoder, and tracking his-tory predictions is also useful for further improv-ing the performance of decoding.To summarize, our main contributions include,• A new in-order chart-based parser which isbased on the in-order traversal of a con-stituent tree (Section 3.1).• A dynamic oracle for the in-order parsing a r X i v : . [ c s . C L ] F e b lgorithm 1 Top-Down Parsing. function T OP D OWN P ARSING ( i, j ) if j = i + 1 then ˆ (cid:96) ← Label( i, j ) else ˆ (cid:96) ← Label( i, j ) ˆ k ← Split( i, j ) T OP D OWN P ARSING ( i, k ) T OP D OWN P ARSING ( k, j ) end if end function which can provide supervision signals evenin a wrong decoding state (Section 3.2).• A mechanism based on RNNs to tracking his-tory decisions in top-down and in-order pars-ing (Section 5). Given a sentence ( w , w , . . . , w n − ) of length n ,its constituent tree T can be represented by a col-lection of labeled spans of the sentence, T (cid:44) { ( i, j, (cid:96) ) | span ( i, j ) with label (cid:96) is a constituent. } , where i and j − are the left and right boundary ofa span respectively. The parsing task is to identifythe spans in T .Typically, a neural constituent parser containstwo components, an encoder which assigns scoresto labeled spans and a decoder which finds the bestspan collection. We first describe our encoder.We represent each word w i using three piecesof information, a randomly initialized word em-bedding e i , a character-based embedding c i ob-tained by a character-level LSTM and a randomlyinitialized part-of-speech tag embedding p i . Weconcatenate these three embeddings to generate arepresentation of word w i , x i = [ e i ; c i ; p i ] . To build the representation s ij of an unlabeledspan ( i, j ) , following (Stern et al., 2017a), we firstencode the sentence with a bidirectional LSTM.Let → h i and ← h i be the forward and backward hid-den states of the i -th position. The representationof span ( i, j ) is the concatenation of the vector dif-ferences → h j − → h i and ← h i − ← h j , s ij = [ → h j − → h i ; ← h i − ← h j ] . Algorithm 2
In-Order Parsing. function I N O RDER P ARSING ( i, j, R ) if j = R then ˆ (cid:96) ← Label( i, j ) else ˆ (cid:96) ← Label( i, j ) ˆ k ← Parent( i, j, R ) I N O RDER P ARSING ( j, j + 1 , k ) I N O RDER P ARSING ( i, k, R ) end if end function Given s ij , the score functions of spans and la-bels are implemented as two-layers feedforwardneural networks, s label ( i, j, (cid:96) ) = v (cid:62) (cid:96) f ( W (cid:96) f ( W (cid:96) s ij + b (cid:96) ) + b (cid:96) ) ,s span ( i, j ) = v (cid:62) s f ( W s f ( W s s ij + b s ) + b s ) , (1)where f denotes a nonlinear function (ReLU), and v , W , b are model parameters.We define the score of a tree to be the sum of itslabel scores and span scores: s tree ( T ) = (cid:88) ( i,j,(cid:96) ) ∈ T s label ( i, j, (cid:96) ) + s span ( i, j ) . The goal of a decoder is to (approximately) find atree with the highest score. There are many strate-gies for designing a decoder. From the perspectiveof the tree traversal, we have two types of chart-based decoders.The first one is the CKY decoder which is basedon dynamic programming. It follows the post-order traversal of a tree (first visiting children, thenthe parent), and is able to find optimal trees with arelative high time complexity O ( n ) .The second one is the top-down decoder (Sternet al., 2017a) which is based on a greedy algo-rithm. It is executed according to the pre-ordertraversal of a tree (Algorithm 1). Given a span ( i, j ) , the top-down decoder first chooses the bestnon-terminal label (cid:96) for ( i, j ) , and predict a split k to produce two new constituents ( i, k ) , ( k, j ) .Then it recursively parses ( i, k ) and ( k, j ) untilmeeting spans of length one. The function Label in line 3 and 5 is:
Label( i, j ) = arg max (cid:96) s label ( i, j, (cid:96) ) PS-VPPRPShe VBZloves VBGwriting NNcode . . NPSNP in-orderparsinginput (a) Execution of our in-order parsing algorithm. SVP SNPPRPShe VBZlovesVBGwriting NN code ..VP NP (b) Output constituent tree.
Figure 1: An execution of our in-order parsing algorithm (a) and the resulting constituent tree (b) for the sentence“
She loves writing code. ” from Stern et al. (2017a). Beginning with the leftmost child span (0 , , the algorithmpredicts its label NP and right boundary of its parent span . Because span (0 , has reached the right bound ,there is no need to predict its parent span. Then the algorithm recursively acts on the right subtree ranging from to with a right bound . The algorithm terminates when the parent spans of all spans have been predicted. Thedotted arrows are the order in which the whole algorithm is executed and the numbers in the circles indicate theorder in which the right boundaries are predicted. Notice that the empty set label ∅ represents spans which do notreally exist in the gold trees and the unary label S-VP is predicted in a single step. and the function Split in line 6 is:
Split( i, j ) = arg max i Parent( i, j, R ) = arg max j Label ). Infact, the in-order chart parser can be seen as aug-menting the action “shift a non-terminal” in thein-order transition system with span boundaries(e.g., “shift an NP with its right bound at position ”). One rationale behind adding the boundaryconstraint is that only the in-order sequence is notsufficient to determine the full structure of a tree,we need to know the boundaries of correspondingnon-terminals. Hence, we can consider the super-vision signals in the chart-based parser are moreaccurate than those in the in-order transition-basedparser.Finally, the in-order decoder is also a greedy al-gorithm, and it enjoys similar parsing speed withtop-down parsers. During the training process, the decoder may out-put incorrect spans at some steps and the follow-ing decoding process should be able to continuebased on those incorrect intermediates. In this sec-tion, we develop a dynamic oracle for the in-orderparser. It helps to provide supervision signals evenin a wrong decoding state.Formally, for any span ( i, j ) , the dynamic ora-cle aims to find an oracle label and a set of oracleright boundaries S . If span ( i, j ) is in the gold tree,following the decisions in S , the decoder shouldconstruct the full gold tree finally. Otherwise, thebest potential tree after adopting any decisions in S should be the same as the best one in the currentstep, which means that the decisions in S shouldnot reduce any future reachable gold spans.For label decisions, if a span is contained in thegold tree, the oracle label is simply the label of itin the gold tree. Otherwise the oracle label is theempty label ∅ .For right boundary decisions, given a span ( i, j ) and a right bound R , our goal is to find a set ofright boundaries which are not greater than R . Be-sides, we have to make sure that the optimal reach-able constituent tree generated after adopting theseright boundaries is consistent with that at the cur- Algorithm 3 Dynamic Oracle for Our In-OrderParser. Input: The span ( i, j ) to be analysed and theright bound R of the parent span of it; Output: The set S of the oracle right boundariesof the parent span of ( i, j ) , in which any ele-ment ˆ j satisfies j < ˆ j ≤ R ; Identify the smallest enclosing gold con-stituent ( i (cid:48) , j (cid:48) ) of span ( i, j ) which is not equalto it; if j (cid:48) = j then Let j ∗ = R else Let j ∗ = min( j (cid:48) , R ) end if Identify the smallest enclosing gold con-stituent (˜ i, ˜ j ) of span ( j, j ∗ ) ; if ˜ i + 1 = ˜ j then Let S = { ˜ j } else Let S = { k ∈ b (˜ i, ˜ j ) | j < k ≤ j ∗ } end if return S ;rent step.Algorithm 3 shows our dynamic oracle for rightboundary decisions. Firstly in line , for span ( i, j ) , the algorithm identifies the smallest enclos-ing gold constituent ( i (cid:48) , j (cid:48) ) which is not equal toit. Then from line to , select j ∗ according tothe value of j (cid:48) and j . Next in line , for span ( j, j ∗ ) , identify the smallest enclosing gold con-stituent (˜ i, ˜ j ) . Finally from line to , if span (˜ i, ˜ j ) is of length , directly return the right bound-ary of it. Otherwise return the set of right bound-aries of its child spans which also lie inside span ( j, j ∗ ) , { k ∈ b (˜ i, ˜ j ) | j < k ≤ j ∗ } . Here b (˜ i, ˜ j ) represents the set of right boundaries of the childspans of (˜ i, ˜ j ) . For example, if given a span (1 , alone with its child spans (1 , , (3 , and (6 , ,we would have b (1 , 7) = { , , } .In our implementation, we choose the rightmostboundary in S as our oracle decision. This will notaffect the performance since different choices cor-respond to different binarizations of the original n -ary tree. The proof of the correctness of our dy-namic oracle is similar to that in Cross and Huang(2016). For better understanding, we present a more detailed ex-planation in the supplementary material. Training We use margin training to learn these modelswhich has been widely used in structured predic-tion (Taskar et al., 2005).For a span ( i, j ) in the gold constituent tree, let (cid:96) ∗ represent its gold label and k ∗ represent the goldright boundary of its parent span. Let ˆ (cid:96) and ˆ k rep-resent decisions made in Equation (2) and (3). If ˆ (cid:96) (cid:54) = (cid:96) ∗ , we define the hinge loss as: max(0 , − s label ( i, j, (cid:96) ∗ ) + s label ( i, j, ˆ (cid:96) )) . Otherwise we define the loss to zero. Similarly, if ˆ k (cid:54) = k ∗ , we define the hinge loss as: max(0 , − s parent ( i, j, k ∗ ) + s parent ( i, j, ˆ k )) , where s parent ( i, j, k ) is defined as: s parent ( i, j, k ) = s span ( i, k ) + s span ( j, k ) . For a single training example, we accumulatethe hinge losses at all decision points. Finallywe minimize the sum of training objectives on alltraining examples.Having defined the dynamic oracle for our in-order parsing model in Section 3.2, we can dealwith all the spans even if they are not in the goldtree. In our implementation, we can train with ex-ploration to increase the numbers of incorrect sam-ples and better handle them. More specifically,we follow the decisions predicted by the model in-stead of gold decisions and the dynamic oracle canprovide supervision at testing time. Both top-down and in-order decoding can be seenas sequential decision processes. In above setting,the prediction of labels and parent spans (splits)only considers features of the current span. Wecan also take previous decoding decisions into ac-count. Here, we propose using an LSTM to trackprevious decoding results and utilize those historyinformation in the label and parent selection.Supposed that span ( i, j ) with label (cid:96) is the t -thprediction of the decision sequence in our parser.We encode this information using a LSTM with t -th input [ s ij ; E (cid:96) ] (i.e., the concatenation of thespan representation s ij and the label embedding E (cid:96) of (cid:96) ), h t = LSTM([ s ij ; E (cid:96) ] , h t − ) . LSTM LSTM LSTM (1,4, VP )3 4 S-VP (3,4, NP )5 LSTM (2,4, S-VP )4 NP … (2,3, ) … (a) Chain-LSTM. LSTM LSTM (1,4, VP )3 … (1,5, ) LSTM S-VP LSTM (2,4, S-VP )4 NP (2,3, ) (b) Stack-LSTM. Figure 2: Two different types of LSTMs to encode thehistory of decisions in Figure 1. We only illustrate thedecoding process of the phrase “ loves writing code. ”here. Notice that the orange and blue input vectorsdenote the representations of spans and labels, respec-tively. The red and green output vectors denote the pre-dictions on right boundaries and labels, respectively. We replace s ij in Equation (1) with [ s ij ; h t ] to uti-lize history decisions information in the predictionof the next labeled span.There are two variants of the LSTM setting. Thefirst one is encoding the history decisions in theorder of in-order traversal. As is shown in Figure2(a), after predicting the labeled span (1 , , VP ) ,we use its representation as input of the LSTM andutilize the output to predict next right boundary 3,then its label ∅ . We call it chain-LSTM for thatthe whole LSTM is linear and has no branching inthe middle. The second one is when predicting alabeled span, we ignore previous decisions fromthe right subtree of its left subtree. For example,in Figure 2(b), after inputting the representation oflabeled span (1 , , VP ) , we first predict next rightboundary 3 with label ∅ and then predict its rightsubtree the same as chain-LSTM. However, whenodel F1 Sents/SecCKY-style parser 91.79 5.55Top-down parser 91.77 25.70Top-down parser ∗ ∗ Table 1: Comparison of results and running time on thePTB test set. The in-order S-R parser represents the in-order shift-reduce transition-based parser. ∗ representsadding the information of history decisions. we predict its parent span, we only use the historydecisions before span (1 , without consideringits right subtree ranging from 2 to 4. The outputof the LSTM after inputting the representation oflabeled span (1 , , VP ) is used twice, once for pre-dicting its right subtree, another for its parent span.We call it stack-LSTM here for that its structure issimilar to the one described in Dyer et al. (2015). We use standard benchmark of WSJ sections inPTB (Marcus et al., 1993) for our experiments,where the Sections 2-21 are used for training data,Section 22 for development data and Section 23for testing data.For words in the testing corpus but not in thetraining corpus, we replace them with a unique la-bel Table 2: Development set results of the in-order parserswith and without the information of history decisions.LR, LP represent labeled recall and precision, respec-tively. with the CKY-style and top-down model. We ob-serve that our in-order model slightly outperformsboth of them in F1 score. This confirms that ourmodel can not only consider the lookahead infor-mation of top-down model, but also benefits fromthe rich features of CKY-style model. Next, wecompare our model with the in-order transition-based model. Our model also outperforms theirsfor that their model does not predict the bound-aries together with the labels, which is correspondwith our conjecture that only label sequence can-not uniquely determine a constituent tree. In addi-tion, we add the information of history decisionsinto the top-down parser and our in-order parser.We observe that it can improve the performanceof the two parsers without losing too much ef-ficiency. Finally, we compare the running timeof these parsers. Since our in-order decoding isalso based on greedy algorithm, it has similar effi-ciency as the top-down parser.In order to verify the validity of the LSTM en-coding the history decisions, we remove it andparse the tree only using span representations. Thecomparison results are shown in Table 2. Weobserve that the in-order model indeed benefitsfrom the chain-LSTM. However, adding the stack-LSTM will reduce performance unexpectedly. Wespeculate that stack-LSTM does not encode thewhole history decisions as the chain-LSTM. Itonly utilizes part of the history information toguide the prediction of spans, which leads to per-formance degradation.Next, we evaluate different configurations of theLSTM encoding history decisions as shown in Ta-ble 3. First, we compare the results in line 2 and 6.We observe that span representations have betterresults than label embeddings as the input of theLSTM. The same results can be seen in line 3 and7, which means that span representations are moreimportant for the encoding on history decisions.Then for the output of the LSTM, we find that spanodel LR LP F1LEmb LPre 91.51 92.90 92.20- + SPre 91.51 92.97 92.23+ SRep - 91.56 92.97 92.26+ SRep + SPre 91.67 93.00 SRep LPre 91.27 93.27 92.26- + SPre 91.56 93.02 92.28 Table 3: Development set results of the in-order parserswith different configurations on the LSTM encodinghistory decisions. “LEmb” and “SRep” represent usinglabel embeddings and span representations as the inputof the LSTM, respectively. “LPre” and “SPre” repre-sent using the history decisions information to predictlabels and spans, respectively. All the models track thehistory decisions based on the stack-LSTM here. predictions do not benefit much from history deci-sions as shown in line 2 and 3. The best resultoccurs in line 5 which uses both label embeddingsand span representations as input and utilizes theoutput to predict labels and spans.The final test results are shown in Table 4. Weachieve competitive results compared with otherdiscriminative single models trained without ex-ternal data. The best result from Kitaev and Klein(2018a) is obtained by a self-attentive encoder andthe second best one from Teng and Zhang (2018)is obtained by local predictions on spans and CFGrules. Additionally, Stern et al. (2017b) get a bet-ter result than ours by generative methods.Furthermore, we use the pre-trained BERT (De-vlin et al., 2018) to improve our in-order parser bysimplely concatenating initial input embeddingsand BERT vectors. We do not use fine-tunedBERT since DyNet has no ready-made BERT im-plementation, and we will implement it in futurework. Despite this, the result has an improvementof 0.7 (92.7 F1) compared with the single model.The best result using fine-tuned BERT is Kitaevand Klein (2018b) which uses the same model asKitaev and Klein (2018a). They use Transformer(Vaswani et al., 2017) rather than LSTM as the en-coder and obtains a pretty high F1 score of 95.7,which shows the powerful encoding capability ofTransformer. We select three examples from the test corpusand show the predictions of the three chart-basedparsers in Table 5. Model LR LP F1 Discriminative Vinyals et al. (2015) - - 88.3Zhu et al. (2013) 90.2 90.7 90.4Cross and Huang (2016) 90.5 92.1 91.3Liu and Zhang (2017b) 91.3 92.1 91.7Stern et al. (2017a) 90.3 93.2 91.8Liu and Zhang (2017a) - - 91.8Shen et al. (2018) 92.0 91.7 91.8Hong and Huang (2018) 91.5 92.5 92.0Teng and Zhang (2018) 92.2 92.5 92.4Kitaev and Klein (2018a) 93.2 93.9 Our in-order parser 91.1 93.0 92.0 Generative Dyer et al. (2016) - - 89.8Stern et al. (2017b) 92.5 92.5 92.5 Table 4: Final results on the PTB test set. Here weonly compare with single model parsers trained withoutexternal data. Given the 5-th sentence, only the CKY-styleparser gives the wrong prediction. We observethat CKY-style parser combines the phrase “ setup ” and the following phrases to VP because of itspost-order traversal. However, top-down and in-order parser firstly assign ADVP to the word “ up ”and then combine it with the following phrases.This shows that in-order parser can make the bestof the lookahead information of top-down parser.Given the 1090-th sentence, top-down parserpredicts incorrectly. It generates VP for the phrase“ really got to ...” firstly. However, this phraseshould be combined with the word “ ’ve ” together,not individually be reduced to VP. CKY-styleparser and in-order parser give the right predic-tions for that they do not give a non-terminal tothe phrase “ really got to ...” at the beginning andreduce the whole phrase after all the phrases in-side it being reduced. This shows that our in-orderparser can better utilize the rich features from thelocal information than top-down parser.Given the 1938-th sentence, all of the threeparsers give the wrong predictions. The prob-lems of top-down parser and in-order parser arethe same, which thinking of the words “ panic ” and“ bail ” as juxtaposed verbs and both modified byprepositional phrases “ out of ...”. This can onlybe solved by considering the whole phrase afterthe two verbs. However, CKY-style model alsosuffers from this problem and incorrectly predictsent (Line 5) ... step up to the plate to support the beleaguered floor traders ...Gold ... (VP step (ADVP up (PP to (NP the plate ))) (S (VP to (VP support ...)))) ...CKY-style ... (VP step up (PP to (NP the plate )) (S (VP to (VP support ...)))) ...Top-down ... (VP step (ADVP up (PP to (NP the plate ))) (S (VP to (VP support ...)))) ...In-order ... (VP step (ADVP up (PP to (NP the plate ))) (S (VP to (VP support ...)))) ...Sent (Line 1090) ... they ’ve really got to make the investment in people ...Gold ... (NP they ) (VP ’ve (ADVP really ) (VP got (S (VP to (VP make (NP ...)))))) ...CKY-style ... (NP they ) (VP ’ve (ADVP really ) (VP got (S (VP to (VP make (NP ...)))))) ...Top-down ... (NP they ) (VP ’ve (VP (ADVP really ) got (S (VP to (VP make (S ...)))))) ...In-order ... (NP they ) (VP ’ve (ADVP really ) (VP got (S (VP to (VP make (NP ...)))))) ...Sent (Line 1938) They could still panic and bail out of the market .Gold ... (VP (VP panic ) and (VP bail (PRT out ) (PP of (NP the market )))) ...CKY-style ... (VP panic and bail (PP out (PP of (NP the market )))) ...Top-down ... (VP (VP panic ) and bail (PP out (PP of (NP the market )))) ...In-order ... (VP (VP panic ) and bail (PP out (PP of (NP the market )))) ... Table 5: Predictions of the three chart-based parsers on three examples from the test corpus. Red words representthe wrong predictions. the label of the word “ panic ”. We can speculatethat top-down parser and in-order parser may havebetter predictions on unary chains than CKY-styleparser. During recent years, many efforts have been madeto balance the performance and efficiency of theparsing models. Based on transition systems(Chen and Manning, 2014), in-order model (Liuand Zhang, 2017a) is proposed to integrate the richfeatures of bottom-up models (Zhu et al., 2013;Cross and Huang, 2016) and lookahead informa-tion of top-down models (Liu and Zhang, 2017b;Smith et al., 2017). In order to solve the exposurebias problem, Fern´andez-Gonz´alez and G´omez-Rodr´ıguez (2018) propose dynamic oracles fortop-down and in-order models which achieve thebest performance amongst transition-based mod-els. Furthermore, Hong and Huang (2018) developa new transition-based parser searching over expo-nentially large space like CKY-style models usingbeam search and cube pruning and reduce the timecomplexity to linear.Chart-based models usually have better perfor-mance but lower efficiency than transition-basedmodels. Different from action sequence predic-tions in transition-based models, CKY-style andtop-down inferences are applied to chart-basedmodels based on independent scoring of labels andspans (Stern et al., 2017a; Gaddy et al., 2018).However, CKY-style models have a high time complexity of O ( n ) and top-down models can-not take all the states into account which will losesome performance. Vieira and Eisner (2017) im-prove the CKY-style models both on accuracy andruntime by learning pruning policies. Besides,self-attentive encoder is also used to obtain betterlexical representations and achieves the best resultso far (Kitaev and Klein, 2018a). Our model isinspired by Liu and Zhang (2017a) and we alsoparse the constituent trees in the order of in-ordertraversal which can efficiently make the best oflookahead information and local information atthe same time. Besides, we use LSTM to encodethe history decisions and use the outputs to betterpredict the future labeled spans.In future work, we will replace the LSTM en-coder with more powerful Transformer and tryfine-tuned BERT to further improve the perfor-mance. We propose a novel in-order chart-based con-stituent parsing model which utilizes the informa-tion of history decisions to improve the perfor-mance. The model not only achieves a high effi-ciency of the top-down models, but also performsbetter than the CKY-style models. Besides, weargue that history decisions are indeed helpful tothe decoder of the top-down and in-order models.Our model achieves competitive results amongstthe discriminative single models and is superior toprevious chart-based models. eferences Danqi Chen and Christopher D. Manning. 2014. Afast and accurate dependency parser using neuralnetworks. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Pro-cessing, EMNLP 2014, October 25-29, 2014, Doha,Qatar, A meeting of SIGDAT, a Special InterestGroup of the ACL , pages 740–750.James Cross and Liang Huang. 2016. Span-basedconstituency parsing with a structure-label systemand provably optimal dynamic oracles. In Proceed-ings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2016,Austin, Texas, USA, November 1-4, 2016 , pages 1–11.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR , abs/1810.04805.Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing of the Asian Fed-eration of Natural Language Processing, ACL 2015,July 26-31, 2015, Beijing, China, Volume 1: LongPapers , pages 334–343.Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A. Smith. 2016. Recurrent neural networkgrammars. In NAACL HLT 2016, The 2016 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, San Diego California, USA,June 12-17, 2016 , pages 199–209.Daniel Fern´andez-Gonz´alez and Carlos G´omez-Rodr´ıguez. 2018. Dynamic oracles for top-downand in-order shift-reduce constituent parsing. In Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, Brussels,Belgium, October 31 - November 4, 2018 , pages1303–1313.David Gaddy, Mitchell Stern, and Dan Klein. 2018.What’s going on in neural constituency parsers?an analysis. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 1(Long Papers) , pages 999–1010.Carlos G´omez-Rodr´ıguez and David Vilares. 2018.Constituent parsing as sequence labeling. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, Brussels, Bel-gium, October 31 - November 4, 2018 , pages 1314–1324. Juneki Hong and Liang Huang. 2018. Linear-time con-stituency parsing with rnns and dynamic program-ming. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics, ACL2018, Melbourne, Australia, July 15-20, 2018, Vol-ume 2: Short Papers , pages 477–483.Nikita Kitaev and Dan Klein. 2018a. Constituencyparsing with a self-attentive encoder. In Proceed-ings of the 56th Annual Meeting of the Associa-tion for Computational Linguistics, ACL 2018, Mel-bourne, Australia, July 15-20, 2018, Volume 1: LongPapers , pages 2675–2685.Nikita Kitaev and Dan Klein. 2018b. Multilingualconstituency parsing with self-attention and pre-training. CoRR , abs/1812.11760.Jiangming Liu and Yue Zhang. 2017a. In-ordertransition-based constituent parsing. Transactionsof the Association for Computational Linguistics ,5:413–424.Jiangming Liu and Yue Zhang. 2017b. Shift-reduceconstituent parsing with neural lookahead features. Transactions of the Association for ComputationalLinguistics , 5:45–58.Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of english: The penn treebank. Computa-tional Linguistics , 19(2):313–330.Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan-dro Sordoni, Aaron C. Courville, and Yoshua Ben-gio. 2018. Straight to the tree: Constituency pars-ing with neural syntactic distance. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics, ACL 2018, Melbourne,Australia, July 15-20, 2018, Volume 1: Long Papers ,pages 1171–1180.Noah A. Smith, Chris Dyer, Miguel Ballesteros, Gra-ham Neubig, Lingpeng Kong, and Adhiguna Kun-coro. 2017. What do recurrent neural network gram-mars learn about syntax? In Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics, EACL 2017, Va-lencia, Spain, April 3-7, 2017, Volume 1: Long Pa-pers , pages 1249–1258.Mitchell Stern, Jacob Andreas, and Dan Klein. 2017a.A minimal span-based neural constituency parser.In Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2017,Vancouver, Canada, July 30 - August 4, Volume 1:Long Papers , pages 818–827.Mitchell Stern, Daniel Fried, and Dan Klein. 2017b.Effective inference for generative neural parsing.In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing,EMNLP 2017, Copenhagen, Denmark, September9-11, 2017 , pages 1695–1700.enjamin Taskar, Vassil Chatalbashev, Daphne Koller,and Carlos Guestrin. 2005. Learning structured pre-diction models: a large margin approach. In Ma-chine Learning, Proceedings of the Twenty-SecondInternational Conference (ICML 2005), Bonn, Ger-many, August 7-11, 2005 , pages 896–903.Zhiyang Teng and Yue Zhang. 2018. Two local mod-els for neural constituent parsing. In Proceedings ofthe 27th International Conference on ComputationalLinguistics, COLING 2018, Santa Fe, New Mexico,USA, August 20-26, 2018 , pages 119–132.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, 4-9 Decem-ber 2017, Long Beach, CA, USA , pages 6000–6010.Tim Vieira and Jason Eisner. 2017. Learning to prune:Exploring the frontier of fast and accurate parsing. TACL , 5:263–278.Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey E. Hinton. 2015.Grammar as a foreign language. In Advances inNeural Information Processing Systems 28: AnnualConference on Neural Information Processing Sys-tems 2015, December 7-12, 2015, Montreal, Que-bec, Canada , pages 2773–2781.Taro Watanabe and Eiichiro Sumita. 2015. Transition-based neural constituent parsing. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Process-ing of the Asian Federation of Natural LanguageProcessing, ACL 2015, July 26-31, 2015, Beijing,China, Volume 1: Long Papers , pages 1169–1179.Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang,and Jingbo Zhu. 2013. Fast and accurate shift-reduce constituent parsing. In