Neural Ranking Models for Temporal Dependency Structure Parsing
NNeural Ranking Models for Temporal Dependency Structure Parsing
Yuchen Zhang
Brandeis University [email protected]
Nianwen Xue
Brandeis University [email protected]
Abstract
We design and build the first neural temporaldependency parser. It utilizes a neural rankingmodel with minimal feature engineering, andparses time expressions and events in a textinto a temporal dependency tree structure. Weevaluate our parser on two domains: news re-ports and narrative stories. In a parsing-onlyevaluation setup where gold time expressionsand events are provided, our parser reaches0.81 and 0.70 f-score on unlabeled and la-beled parsing respectively, a result that is verycompetitive against alternative approaches. Inan end-to-end evaluation setup where time ex-pressions and events are automatically recog-nized, our parser beats two strong baselineson both data domains. Our experimental re-sults and discussions shed light on the natureof temporal dependency structures in differentdomains and provide insights that we believewill be valuable to future research in this area.
Temporal relation classification is important fora range of NLP applications that include but arenot limited to story timeline construction, ques-tion answering, summarization, etc. Most work ontemporal information extraction models the task asa pair-wise classification problem (Bethard et al.,2007; Chambers et al., 2007; Chambers and Juraf-sky, 2008; Ning et al., 2018a): given an individualpair of time expressions and/or events, the systempredicts whether they are temporally related andwhich specific relation holds between them. Analternative approach is to model the temporal re-lations in a text as a temporal dependency struc-ture (TDS) for the entire text (Kolomiyets et al.,2012). Such a temporal dependency structure hasthe advantage that (1) it can be easily used to inferadditional temporal relations between time expres-sions and/or events that are not directly connected via the transitivity properties of temporal relations,(2) it is computationally more efficient because amodel does not need to consider all pairs of timeexpressions and events in a text, and (3) it is easierto use for downstream applications such as time-line construction.However, most existing automatic systems arepair-wise models trained with traditional statisti-cal classifiers using a large number of manuallycrafted features (Bethard et al., 2017). The fewexceptions include the work of Kolomiyets et al.(2012), which describes a temporal dependencyparser based on traditional feature-based classi-fiers, and Dligach et al. (2017), which describes asystem using neural network based models to clas-sify individual temporal relations. More recently,a semi-structured approach has also been proposed(Ning et al., 2018b).In this work, taking advantage of a newly avail-able data set annotated with temporal dependencystructures – the Temporal Dependency Tree (TDT)Corpus (Zhang and Xue, 2018), we develop aneural temporal dependency structure parser us-ing minimal hand-crafted linguistic features. Oneof the advantages of neural network based modelsis that they are easily adaptable to new domains,and we demonstrate this advantage by evaluatingour temporal dependency parser on data from twodomains: news reports and narrative stories. Ourresults show that our model beats a strong logisticregression baseline. Direct comparison with exist-ing models is impossible because the only similardataset used in previous work that we are aware ofis not available to us (Kolomiyets et al., 2012), butwe show that our models are competitive againstsimilar systems reported in the literature.The main contributions of this work are: • We design and build the first end-to-end neu- https://github.com/yuchenz/structured_temporal_relations_corpus a r X i v : . [ c s . C L ] S e p al temporal dependency parser. The parseris based on a novel neural ranking model thattakes a raw text as input, extracts events andtime expressions, and arranges them in a tem-poral dependency structure. • We evaluate the parser by performing exper-iments on data from two domains: news re-ports and narrative stories, and show that ourparser is competitive against similar parsers.We also show the two domains have very dif-ferent temporal structural patterns, an obser-vation that we believe will be very valuableto future temporal parser development.The rest of the paper is organized as follows.Since temporal structure parsing is a relativelynew task, we give a brief problem description in §
2. We describe our end-to-end pipeline system in §
3. Details of the neural ranking model are dis-cussed in §
4. In § §
6. In § § In this section we give a brief description of thetemporal dependency parsing task (more details inZhang and Xue (2018)). In a temporal structureparsing task, a text is parsed into a dependencytree structure that represents the inherent tempo-ral relations among time expressions and events inthe text. The nodes in this tree are mostly time ex-pressions and events which are represented as con-tiguous spans of words in the text. They can alsobe pre-defined meta nodes, which serve as refer-ence times for other time expressions and events,and they constitute the top-most part of the tree.For example,
Past Ref , Present Ref , Future Ref ,and
Document Creation Time (DCT) are all pre-defined meta nodes. The edges in the tree repre-sent temporal relations between each parent-childpair. The temporal relations can be one of
In-cludes , Before , Overlap , and
After , or
Depend-on which holds between two time expressions.Unlike syntactic dependency parsing where eachword in a sentence is a node in the dependencystructure, in a temporal dependency structure onlysome of the words in a text are nodes in the struc-ture. Therefore, this process naturally falls intotwo stages: first time expression and event recog-nition, and then temporal relation parsing. Fig- ure 1 is an example temporal dependency tree fora news report paragraph.Due to the fact that different types of time ex-pressions and events behave differently in terms ofwhat can be their antecedents, and recognition ofthese types can be helpful for determining tempo-ral relations, finer classifications of time expres-sions and events are also defined. Time expres-sions are further classified into
Vague Time , Abso-lute Concrete Time , and
Relative Concrete Time ,according to whether or not the time expressioncan be located on the timeline, and whether ornot the interpretation of its temporal location de-pends on another time expression. Events are fur-ther classified into
Eventive Event , State , HabitualEvent , Completed Event , Ongoing Event , Modal-ized Event , Generic Habitual , and
Generic State ,according to the eventuality type of the event. Ourexperiments will show that these fine-grained clas-sifications are very helpful for the overall temporalstructure parsing accuracy.
We build a two-stage pipeline system to tackle thistemporal structure parsing problem. The first stageperforms event and time expression identification.In this stage, given a text as input, spans of wordsthat indicate events or time expressions are identi-fied and categorized. We model this stage as a se-quence labeling process. A standard Bi-LSTM se-quence model coupled with BIO labels is appliedhere. Word representations are the concatenationof word and POS tag embeddings.The second stage performs the actual tempo-ral structure parsing by identifying the antecedentfor each time expression and event, and identify-ing the temporal relation between them. In thisstage, given events and time expressions identi-fied in the first stage as input, the model outputsa temporal dependency tree in which each childnode is an event or time expression that is tempo-rally related to another event or time expressionor pre-defined meta node as its parent node. Thisstage is modeled as a ranking process: for eachnode, a finite set of neighboring nodes are first se-lected as its candidate parents. These candidatesare then ranked with a neural network model andthe highest ranking candidate is selected as its par-ent. We use a ranking model because it is sim-ple, more intuitive and easier to train than a tradi-tional transition-based or graph-based model, and
OOT
DCTlast year inspiredbornmade…tripsdeclared has died
Pre-defined Meta NodesTime ExpressionsEventsTemporal Relations Example News Paragraph a : Jorn Utzon, the Danish architect who designed the Sydney Opera House, has diede1 in Copenhagen.
Borne2 in , Mr Utzon was inspirede3 by Scandinavian functionalism in architecture, but made a number of inspirational tripse4 , including to Mexico and Morocco. In , Mr Utzon's now-iconic shell-like design for the Opera House unexpectedly wone5 a state government competition for the site on Bennelong Point on Sydney Harbour. However, he lefte6 the project in . His plans for the interior of the building were not completeds1 . The Sydney Opera House iss2 one of the world's most classic modern buildings and a landmark Australian structure. It was declarede7 a UNESCO World Heritage site last yeart4 . a From a news report in The Telegraph . Present_Ref
Figure 1: Example text and its temporal dependencytree. DCT is Document Creation Time. the learned model rarely makes mistakes that vio-late the structural constraint of a tree.Since the model we use for Stage 1 is a verystandard model with little modifications, we don’tdescribe it in detail in this paper due to the limita-tion of space. Our neural ranking model for Stage2 is described in detail in the next section.
We use a neural ranking model for the parsingstage. For each time expression or event node i ina text, a group of candidate parent nodes (time ex-pressions, events, or pre-defined meta nodes) areselected. In practice, we select a window fromthe beginning of the text to two sentences afternode i , and select all nodes in this window and allpre-defined meta nodes as the candidate parentsif node i is an event. Since the parent of a timeexpression can only be a pre-defined meta nodeor another time expression as described in Zhangand Xue (2018), we select all time expressions inthe same window and all pre-defined meta nodes as the candidate parents if node i is a time expres-sion. Let y (cid:48) i be a candidate parent of node i , a scoreis then computed for each pair of ( i, y (cid:48) i ) .Throughranking, the candidate with the highest score isthen selected as the final parent for node i .Model architecture is shown in Figure 2. Wordembeddings are used as word representations (e.g. w k ). A Bi-LSTM sequence layer is built on eachword over the entire text, computing Bi-LSTMoutput vectors for each word (e.g. w ∗ k ). The noderepresentation for each time expression or event isthe summation of the Bi-LSTM output vectors ofall words in the text span (e.g. x i ). The pair repre-sentation for node i and one of its candidates y (cid:48) i isthe concatenation of the Bi-LSTM output vectorsof these two nodes g i,y (cid:48) i = [ x i , x y (cid:48) i ] , which is thensent through a Multi-Layer Perceptron to computea score for this pair s i,y (cid:48) i . Finally all pair scores ofthe current node i are concatenated into vector c i ,and taking sof tmax on it generates the final dis-tribution o i , which is the probability distributionof each candidate being the parent of node i .Formally, the Forward Computation is: w ∗ k = BiLST M ( w k ) x i = sum ( w ∗ k − , w ∗ k , w ∗ k +1 ) g i,y (cid:48) i = [ x i , x y (cid:48) i ] h i,y (cid:48) i = tanh ( W · g i,y (cid:48) i + b ) s i,y (cid:48) i = W · h i,y (cid:48) i + b c i = [ s i, , ..., s i,i − , s i,i +1 , ..., s i,i + t ] o i = sof tmax ( c i ) Let D be the training data set of K texts, N k thenumber of nodes in text D k , and y i the gold parentfor node i . Our neural model is trained to max-imize P ( y , ..., y N k | D k ) over the whole trainingset. More specifically, the cost function is definedas follows: C = − log K (cid:89) k P ( y , ..., y N k | D k )= − log K (cid:89) k N k (cid:89) i P ( y i | D k )= K (cid:88) k N k (cid:88) i − logP ( y i | D k ) For each training example, cross-entropy loss isinimized: L = − logP ( y i | D k )= − log exp [ s i,y i ] (cid:80) y (cid:48) i exp [ s i,y (cid:48) i ] where s i,y (cid:48) i is the score for child-candidate pair ( i, y (cid:48) i ) as described in § During decoding, the parser constructs the tem-poral dependency tree incrementally by identify-ing the parent node for each event or time expres-sion in textual order. To ensure the output parseis a valid dependency tree, two constraints are ap-plied in the decoding process: (i) there can only beone parent for each node, and (ii) descendants of anode cannot be its parent to avoid cycles. Candi-dates violating these constraints are omitted fromthe ranking process. The neural model described above generates anunlabeled temporal dependency tree, with eachparent being the most salient reference time for thechild. However it doesn’t model the specific tem-poral relation (e.g. “before”, “overlap”) betweena parent and a child. We extend this basic archi-tecture to both identify parent-child pairs and pre-dict their temporal relations. In this new model,instead of ranking child-candidate pairs ( i, y (cid:48) i ) ,we rank child-candidate-relation tuples ( i, y (cid:48) i , l k ) ,where l k is the k th relation in the pre-defined setof possible temporal relation labels L . We com-pute this ranking by re-defining the pair score s i,y (cid:48) i .Here, pair score s i,y (cid:48) i is no longer a scalar score buta vector s i,y (cid:48) i of size | L | , where s i,y (cid:48) i [ k ] is the scalarscore for y (cid:48) i being the parent of i with temporal re-lation l k . Accordingly, the lengths of c i and o i are number of candidates ∗ | L | . Finally, the tuple ( i, y (cid:48) i , l k ) associated with the highest score in o i predicts that y (cid:48) i is the parent for i with temporalrelation label l k . A variation of the basic neural model is a modelthat takes a few linguistic features as input ex- An alternative decoding approach would be to performa global search for a Maximum Spanning Tree. However,due to the nature of temporal structures, our greedy decodingprocess rarely hits the constraints. …… …… word embedding (w)pair representation (g)concatenated scores (s c ) w k w w n Bi-LSTM (w*)hidden layer (h)pair scores (s p )output layer (o) …… …… w k-1 w k+1 node representation (x) x i took a trip …… …… He .x a x b x c x d Figure 2: Neural Ranking Model Architecture. x i isthe current child node, and x a , x b , x c , x d are the can-didate parent nodes for x i . Arrows from Bi-LSTMlayer to x a , x b , x c , x d are not shown. plicitly. In this model, we extend the pair rep-resentation g i,y (cid:48) i with local features: g i,y (cid:48) i =[ x i , x y (cid:48) i , φ i,y (cid:48) i ] . Time and event type feature:
Stage 1 of thepipeline not only extracts text spans that are timeexpressions or events, but also labels them withpre-defined categories of different types of timeexpressions and events. Readers are referred toZhang and Xue (2018) for the full category list.Through a careful examination of the data, we no-tice that time expressions or events are selectiveas to what types of time expression or events canbe their parent. In other words, the category ofthe child time expression or event has a strongindication on which candidate can be its parent.For example, a time expression’s parent can onlybe another time expression or a pre-defined metanode, and can never be an event; and an eventiveevent’s parent is almost certainly another even-tive event, and is highly unlikely to be a stativeevent. Therefore, we include the time expressionand event type information predicted by stage 1 inthis model as a feature. More formally, we rep-resent a time/event type as a fixed-length embed-ding t , and concatenate it to the pair representation g i,y (cid:48) i = [ x i , x y (cid:48) i , t i , t y (cid:48) i ] . Distance features:
Distance information canbe useful for predicting the parent of a child. In-tuitively, candidates that are closer to the child aremore likely to be the actual parent. Through dataexamination, we also find that a high percentage ofnodes have parents in close proximity. Therefore,e include two distance features in this model: thenode distance between a candidate and the child nd i,y (cid:48) i , and whether they are in the same sentence ss i,y (cid:48) i . One-hot representations are used for bothfeatures to represent according conditions listed inTable 1. conditions for feature nd i,y (cid:48) i : i.node id − y (cid:48) i .node id = 1 i.node id − y (cid:48) i .node id > and i.sent id = y (cid:48) i .sent idi.node id − y (cid:48) i .node id > and i.sent id (cid:54) = y (cid:48) i .sent idi.node id − y (cid:48) i .node id < conditions for feature ss i,y (cid:48) i : i.sent id = y (cid:48) i .sent idi.sent id (cid:54) = y (cid:48) i .sent id Table 1: Conditions for node distance and same sen-tence features.
The final pair representation for our linguisti-cally enriched model is as follows: g i,y (cid:48) i = [ x i , x y (cid:48) i , t i , t y (cid:48) i , nd i,y (cid:48) i , ss i,y (cid:48) i ] In the basic neural model, a straight-forward sum-pooling is used as the multi-word time expressionand event representation. However, multi-wordevent expressions usually have meaning-bearinghead words. For example, in the event “took atrip”, “trip” is more representative than “took” and“a”. Therefore, we add an attention mechanism(Bahdanau et al., 2014) over the Bi-LSTM out-put vectors in each multi-word expression to learna task-specific notion of headedness (Lee et al.,2017): α t = tanh ( W · w ∗ t ) w i,t = exp [ α t ] (cid:80) END ( i ) k = ST ART ( i ) exp [ α k ]ˆ x i = (cid:80) END ( i ) t = ST ART ( i ) w i,t · w ∗ t where ˆ x i is a weighted sum of Bi-LSTM outputvectors in span i . The weights w i,t are automati-cally learned. The final pair representation for ourattention model is as follows: g i,y (cid:48) i = [ x i , x y (cid:48) i , t i , t y (cid:48) i , nd i,y (cid:48) i , ss i,y (cid:48) i , ˆ x i , ˆ x y (cid:48) i ] This model variation is also beneficial in anend-to-end system, where time expression andevent spans are automatically extracted in Stage 1.When extracted spans are not guaranteed correcttime expressions and events, an attention layer on a slightly larger context of an extracted span has abetter chance of finding representative head wordsthan a sum-pooling layer strictly on words withina event or time expression span.
All of our experiments are conducted on thedatasets described in Zhang and Xue (2018). Thisis a temporal dependency structure corpus in Chi-nese. It covers two domains: news reports andnarrative fairy tales. It consists of 115 news ar-ticles sampled from Chinese TempEval2 datasets(Verhagen et al., 2010) and Chinese WikipediaNews , and 120 fairy tale stories sampled fromGrimm Fairy Tales . 20% of this corpus, dis-tributed evenly on both domains, are double an-notated with high inter-annotator agreements. Weuse this part of the data as our development andtest datasets (10% documents for development and10% for testing), and the remaining 80% as ourtraining dataset. We build two baseline systems to compare withour neural model. The first is a simple baselinewhich links every time expression or event to itsimmediate previous time expression or event. Ac-cording to our data, if only position information isconsidered, the most likely parent for a child is itsimmediate previous time expression or event. Thisbaseline uses the most common temporal relationedge label in the training datasets, i.e. “overlap”for news data, and “before” for grimm data.The second baseline is a more competitive base-line for stage 2 in the pipeline. It takes the outputof the first stage as input, and uses a similar rank-ing architecture but with logistic regression clas-sifiers instead of neural classifiers. The purposeof this baseline is to compare our neural modelsagainst a traditional statistical model under oth-erwise similar settings. We conduct robust fea-ture engineering on this logistic regression modelto make sure it is a strong benchmark to competeagainst. Table 2 lists the features and feature com-binations used in this model. https://zh.wikinews.org ime type and event type features: i.type and y (cid:48) i .type if i.type = absolute time and y (cid:48) i .type = root if i.type = time and y (cid:48) i .type = root are i.type and y (cid:48) i .type time, eventive, or stativeare i.type and y (cid:48) i .type root, time, or eventare i.type and y (cid:48) i .type root, time, eventive, or stativeif i.type = y (cid:48) i .type = event and ˆ y.type = state, for all ˆ y between i and y (cid:48) i distance features:if i.sent id = y (cid:48) i .sent idi.node id − y (cid:48) i .node id if i.node id − y (cid:48) i .node id = 1 combination features:if i.type = state and i.sent id (cid:54) = y (cid:48) i .sent id if i.type = state and i.node id − y (cid:48) i .node id = 1 if i.type = y (cid:48) i .type = event and i.node id − y (cid:48) i .node id = 1 if i.type = state and y (cid:48) i .type = event and i.node id − y (cid:48) i .node id = 1 and i.node id in sent = 1 and i.sent id (cid:54) = 1 other features:if i and y (cid:48) i are in quotation marks Table 2: Features in the logistic regression system.
We perform two types of evaluations for our sys-tems. First, we evaluate the stages of the pipelineand the entire pipeline, i.e. end-to-end systemswhere both time expression and event recognition,as well as temporal dependency structures are au-tomatically predicted. Our models are comparedagainst the two strong baselines described in § § § .For Stage 1, all models are trained with Adamoptimizer with early stopping and learning rate0.001. The dimensions of word embeddings, POStag embeddings, Bi-LSTM output vectors, andMLP hidden layers are tuned on the dev set to 256, https://github.com/yuchenz/tdp_ranking evaluated news grimmlabel p r f p r f all span .81 .74 .78 .83 .74 .78time .83 .81 .82 .97 .62 .76event .81 .73 .77 .83 .74 .78 Table 3: Stage 1 cross-validation on span detectionand binary time/event recognition. time/event type news grimm vague time .77 .82concrete absolute .67 -concrete relative .75 -event .61 .77state .65 .61completed .62 .26modalized .46 .31
Table 4: Stage 1 (time/event type recognition) cross-validation f1-scores on the full set.
32, 256, and 256 respectively. POS tags in Stage 1are acquired using the joint POS tagger from Wangand Xue (2014). The tagger is trained on ChineseTreebank 7.0 (Xue et al., 2010). For Stage 2, thedimensions of word embeddings, time/event typeembeddings, Bi-LSTM output vectors, and MLPhidden layers are tuned on the dev set to 32, 16,32, and 32 respectively. The optimizer is Adamwith early stopping and learning rate 0.001.
ForStage 1 in the pipeline, we perform BIO taggingwith the full set of time expression and eventtypes (i.e. a 11-way classification on all extractedspans). Extracted spans will be nodes in the finaldependency tree, and time/event types will supportfeatures in the next stage. We evaluate Stage 1performance using 10-fold cross-validation of theentire data set. We use the “exact match” evalua-tion metrics for BIO sequence labeling tasks, andcompute precision, recall, and f-score for each la-bel type.We first ignore fine-grained time/event typesand only evaluate unlabeled span detection andtime/event binary classification to show how wellour system identify events and time expressions,and how well our system distinguishes time ex-pressions from events. Table 3 shows the cross-validation results on these two evaluations. Spandetection and event recognition show similar per- odel news grimmunlabeled f labeled f unlabeled f labeled fdev test dev test dev test dev testtemporal relationparsing withgold spans
Baseline-simple .64 .68 .47 .43 .78 .79 .39 .39Baseline-logistic .81 .79 .63 .54 .74 .74 .60 .63Neural-basic .78 .75 .67 .57 .72 .74 .60 .63Neural-enriched .80 .78 .67 .59 .76 .77 .63 .65Neural-attention .83 .81 .76 .70 .79 .79 .66 .68end-to-endsystems withautomatic spans
Baseline-simple .39 .40 .26 .25 .44 .47 .27 .25Baseline-logistic .36 .34 .24 .22 .43 .49 .33 .37Neural-basic .37 .36 .21 .23 .42 .45 .33 .35Neural-enriched .51 .52 .32 .35 .44 .49 .33 .37Neural-attention .54 .54 .36 .39 .44 .49 .35 .39
Table 5: Stage 2 results (f-scores) with gold spans and timex/event labels (top), and automatic spans andtimex/event labels generated by stage 1 (bottom). Best performances are in bold. formance on both news and narrative domains.Time expressions have a higher recognition ratethan events in news data, which is consistent withthe observation that time expressions usually havea more limited vocabulary and more strict lexicalpatterns. On the other hand, due to the scarcity oftime expressions in the Grimm data, time expres-sion recognition in this domain has a very highprecision but low recall, which results in a muchlower f-score than news.Labeled full set evaluation results on time/eventtype classification are reported in Table 4. Timeexpressions have higher recognition rates thanevents on both domains, and dominant event types(“event”, “state”, etc.) have higher and more sta-ble recognition rates than other types. Event typeswith very few training instances, such as “modal-ized event” ( < % instances achieve close to 0 recognitionf-scores, and are not reported in this table. Stage 2: Temporal Dependency Parsing
ForStage 2 in the pipeline, we conduct experiments onthe five systems described above: a simple base-line, a logistic regression baseline, a basic neuralmodel, a linguistically enriched neural model, andan attention neural model. All models are trainedon automatically predicted spans of time expres-sions and events, and time/event types generatedby Stage 1 using 10-fold cross-validation, withgold standard edges (and edge labels) mappedonto the automatic spans. Evaluations in Stage2 are against gold standard spans and edges,and evaluation metrics are precision, recall, and f-score on (cid:104) child, parent (cid:105) tuples for unlabeledtrees, and (cid:104) child, relation, parent (cid:105) triples for la-beled trees.Bottom rows in Table 5 report the end-to-endperformance of our five systems on both domains.On both labeled and unlabeled parsing, our ba-sic neural model with only lexical input performscomparable to the logistic regression model. Andour enriched neural model with only three sim-ple linguistic features outperforms both the logis-tic regression model and the basic neural modelon news, improving the performance by more than10%. However, our models only slightly improvethe unlabeled parsing over the simple baseline onnarrative Grimm data. This is probably due to (1)it is a very strong baseline to link every node toits immediate previous node, since in an narrativediscourse linear temporal sequences are very com-mon; and (2) most events breaking the temporallinearity in a narrative discourse are implicit sta-tive descriptions which are harder to model withonly lexical and distance features. Finally, atten-tion mechanism improves temporal relation label-ing on both domains.
To facilitate comparison with previous workwhere gold events are used as parser input, we re-port our results on temporal dependency parsingwith gold time expression and event spans in Ta-ble 5 (top rows). These results are in the same ball-park as what is reported in previous work on tem-poral relation extraction. The best performancein Kolomiyets et al. (2012) are 0.84 and 0.65 f-scores for unlabeled and labeled parses, achievedy temporal structure parsers trained and evalu-ated on narrative children’s stories. Our best per-forming model (Neural-attention) reports 0.81 and0.70 f-scores on unlabeled and labeled parses re-spectively, showing similar performance. It is im-portant to note, however, that these two works usedifferent data sets, and are not directly compara-ble. Finally, parsing accuracy with gold time/eventspans as input is substantially higher than that withpredicted spans, showing the effects of error prop-agation.
We perform error analysis on the output of our bestmodel (Neural-attention) on the development datasets. We focus on analyzing our neural rankingmodel (i.e. Stage 2), with gold time expressionand event spans and labels as input.First, we look at errors by the types of an-tecedents. Most events in both news and grimmdata depend on their immediate previous eventor time expression as their reference time parent.71% of the events in the news data set and 78% ofthe events in the Grimm data have the immediateprevious node as their antecedent. The confusionmatrix in Table 6 illustrates how strongly this biasaffects our models. Our model learns the bias andincorrectly links around half of the events (47% innews and 46% in grimm) to their immediate pre-vious node when the correct temporal dependencyis further back in the text. news grimmpre far total pre far totalpre
317 11 328 750 60 810 far
65 72 137 104 122 226 total
382 83 465 854 182 1036
Table 6: Parent node confusion matrix. Rows are goldparents and columns are automatically parsed parents.“pre” means the parent is the immediate previous nodeof the child event, “far” means the parent is further backfrom the child event.
Second, we look at errors in temporal rela-tion labels. Considering only correctly recognizedparent-child pairs, we draw a confusion matrix asin Table 7. Our data has very few after relationsin both domains, which explains why the modelhas difficulty identifying this relation. There arealso very few include and depend-on relations inthe Grimm data, however they are identified with a news be af ov in de totalbefore after overlap include depend-on total grimm be af ov in de totalbefore
367 0 55 0 0 422 after overlap
74 1 314 0 0 389 include depend-on total
445 1 371 10 12 839
Table 7: Temporal relation confusion matrix. Rowsare gold relation labels and columns are automatic re-lation labels. “be, af, ov, in, de” stand for “before, after,overlap, include, and depend-on”. relatively high accuracy. This is probably because,according to the temporal dependency structuredesign (Zhang and Xue, 2018), these relationshold only between restricted pairs of parent andchild: include requires a time expression parentand an event child, and depend-on requires thatthe parent be the rootf. The main confusion amongtemporal relations is between before and overlap .In news data, with a high occurrence of overlap relations (60% overlap and 5% before ), most be-fore parents are wrongly recognized as overlap .Grimm data has a more balanced distribution ofthese two temporal relations (46% overlap and50% before ), however, 13% before and 17% over-lap are wrongly labeled as the other.
There is a significant amount of research on tem-poral relation extraction (Bethard et al., 2007;Bethard, 2013; Chambers and Jurafsky, 2008;Chambers et al., 2014; Ning et al., 2018a). Mostof the previous work models temporal relationextraction as pair-wise classification between in-dividual pairs of events and/or time expressions.Some of the models also add a global reasoningstep to local pair-wise classification, typically us-ing Integer Linear Programming, to exploit thetransitivity property of temporal relations (Cham-bers and Jurafsky, 2008). Such a pair-wise clas-ification approach is often dictated by the waythe data is annotated. In most of the widelyused temporal data sets, temporal relations be-tween individual pairs of events and/or time ex-pressions are annotated independently of one an-other (Pustejovsky et al., 2003; Chambers et al.,2014; Styler IV et al., 2014; O’Gorman et al.,2016; Mostafazadeh et al., 2016).Our work is most closely related to that ofKolomiyets et al. (2012), which also treats tem-poral relation modeling as temporal dependencystructure parsing. However, their dependencystructure, as described in Bethard et al. (2012),is only over events, excluding time expressionswhich are an important source of temporal infor-mation, and it also excludes states (stative events),which makes the temporal dependency structureincomplete. Moreover, their corpus only consistsof data in the narrative stories domain. We insteadchoose to develop our model based on the dataset described in Zhang and Xue (2018), which in-troduces a more comprehensive and linguisticallygrounded annotation scheme for temporal depen-dency structures. This structure includes bothevents and time expressions, and uses the linguis-tic notion of temporal anaphora to guide the anno-tation of the temporal dependency structure. Sincein this temporal dependency structure each parent-child pair is considered to be an instance of tem-poral anaphora, the parent is also called the an-tecedent and the child is also referred to as the anaphor . Their corpus consists of data from twodomains: news reports and narrative stories.More recently, Ning et al. (2018b) proposed asemi-structured approach to model temporal rela-tions in a text. Based on the observation that notall pairs of events have well-defined temporal re-lations, they propose a multi-axis representation inwhich well-defined temporal relations only holdbetween events on the same axis. The temporalrelations between events in a text form multipledisconnected subgraphs. Like other work beforethem, their annotation scheme only covers events,to the exclusion of time expressions.
Most prior work on neural dependency parsing isaimed at syntactic dependency parsing, i.e. pars-ing a sentence into a dependency tree that rep-resents the syntactic relations among the words. Recent work on dependency parsing typicallyuses transition-based or graph-based architecturescombined with contextual vector representationslearned with recurrent neural networks (e.g. Bi-LSTMs) (Kiperwasser and Goldberg, 2016).Temporal dependency parsing is, however, dif-ferent from syntactic dependency parsing. In tem-poral dependency parsing, for each event or timeexpression, there is more than one other eventor time expression that can serve as its referencetime, while the most closely related one is se-lected as the gold standard reference time parent.This naturally falls into a ranking process whereall possible reference times are ranked and thebest is selected. In this sense our neural rankingmodel for temporal dependency parsing is closelyrelated to the neural ranking model for corefer-ence resolution described in Lee et al. (2017), bothof which extract related spans of words (entitymentions for coreference resolution, and events ortime expressions for temporal dependency pars-ing). However, our temporal dependency parsingmodel differs from Lee et al’s coreference modelin that the ranking model for coreference onlyneeds to output the best candidate for each indi-vidual pairing and cluster all pairs that are coref-erent to each other. In contrast, our ranking modelfor temporal dependency parsing needs to ranknot only the candidate antecedents but also thetemporal relations between the antecedent and theanaphor. In addition, the model also adds connec-tivity and acyclic constraints in the decoding pro-cess to guarantee a tree-structured output.
In this paper, we present the first end-to-end neu-ral temporal dependency parser. We evaluate theparser with both gold standard and automaticallyrecognized time expressions and events. In bothexperimental settings, the parser outperforms twostrong baselines and shows competitive resultsagainst prior temporal systems.Our experimental results show that the modelperformance drops significantly when automati-cally predicted event and time expressions areused as input instead of gold standard ones, indi-cating an error propagation problem. Therefore,in future work we plan to develop joint modelsthat simultaneously extract events and time ex-pressions, and parse their temporal dependencystructure. eferences
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Steven Bethard. 2013. Cleartk-timeml: A minimalistapproach to tempeval 2013. In
Second Joint Con-ference on Lexical and Computational Semantics (*SEM) , volume 2, pages 10–14.Steven Bethard, Oleksandr Kolomiyets, and Marie-Francine Moens. 2012. Annotating story time-lines as temporal dependency structures. In
Pro-ceedings of the eighth international conference onlanguage resources and evaluation (LREC) , pages2721–2726. ELRA.Steven Bethard, James H Martin, and Sara Klingen-stein. 2007. Timelines from text: Identification ofsyntactic temporal relations. In
Semantic Comput-ing, 2007. ICSC 2007. International Conference on ,pages 11–18. IEEE.Steven Bethard, Guergana Savova, Martha Palmer,and James Pustejovsky. 2017. Semeval-2017 task12: Clinical tempeval. In
Proceedings of the11th International Workshop on Semantic Evalua-tion (SemEval-2017) , pages 565–572, Vancouver,Canada. Association for Computational Linguistics.Nathanael Chambers, Taylor Cassidy, Bill McDowell,and Steven Bethard. 2014. Dense event orderingwith a multi-pass architecture.
Transactions of theAssociation for Computational Linguistics , 2:273–284.Nathaniel Chambers and Daniel Jurafsky. 2008. Jointlycombining implicit constraints improves temporalordering. In
EMNLP-2008 .Nathaniel Chambers, Shan Wang, and Daniel Juraf-sky. 2007. Classifying temporal relations betweenevents. In
ACL-2007 .Dmitriy Dligach, Timothy Miller, Chen Lin, StevenBethard, and Guergana Savova. 2017. Neural tem-poral relation extraction. In
Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 2, ShortPapers , volume 2, pages 746–751.Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-ple and accurate dependency parsing using bidirec-tional lstm feature representations. arXiv preprintarXiv:1603.04351 .Oleksandr Kolomiyets, Steven Bethard, and Marie-Francine Moens. 2012. Extracting narrative time-lines as temporal dependency structures. In
Pro-ceedings of the 50th Annual Meeting of the Associ-ation for Computational Linguistics: Long Papers-Volume 1 , pages 88–97. Association for Computa-tional Linguistics. Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference resolu-tion. arXiv preprint arXiv:1707.07045 .Nasrin Mostafazadeh, Alyson Grealish, NathanaelChambers, James Allen, and Lucy Vanderwende.2016. Caters: Causal and temporal relation schemefor semantic annotation of event structures. In
Pro-ceedings of the The 4th Workshop on EVENTS: Def-inition, Detection, Coreference, and Representation,San Diego, California, June. Association for Com-putational Linguistics .Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, DanielClothiaux, Trevor Cohn, Kevin Duh, ManaalFaruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji,Lingpeng Kong, Adhiguna Kuncoro, Gaurav Ku-mar, Chaitanya Malaviya, Paul Michel, YusukeOda, Matthew Richardson, Naomi Saphra, SwabhaSwayamdipta, and Pengcheng Yin. 2017. Dynet:The dynamic neural network toolkit. arXiv preprintarXiv:1701.03980 .Qiang Ning, Hao Wu, Haoruo Peng, and Dan Roth.2018a. Improving temporal relation extractionwith a globally acquired statistical resource. arXivpreprint arXiv:1804.06020 .Qiang Ning, Hao Wu, and Dan Roth. 2018b. A multi-axis annotation scheme for event temporal relations. arXiv preprint arXiv:1804.07828 .Tim O’Gorman, Kristin Wright-Bettner, and MarthaPalmer. 2016. Richer event description: Integrat-ing event coreference with temporal, causal andbridging annotation.
Computing News Storylines ,page 47.James Pustejovsky, Patrick Hanks, Roser Sauri, An-drew See, Robert Gaizauskas, Andrea Setzer,Dragomir Radev, Beth Sundheim, David Day, LisaFerro, et al. 2003. The timebank corpus. In
Corpuslinguistics , volume 2003, page 40. Lancaster, UK.William F Styler IV, Steven Bethard, Sean Finan,Martha Palmer, Sameer Pradhan, Piet C de Groen,Brad Erickson, Timothy Miller, Chen Lin, GuerganaSavova, et al. 2014. Temporal annotation in the clin-ical domain.
Transactions of the Association forComputational Linguistics , 2:143–154.Marc Verhagen, Roser Sauri, Tommaso Caselli, andJames Pustejovsky. 2010. Semeval-2010 task 13:Tempeval-2. In
Proceedings of the 5th internationalworkshop on semantic evaluation , pages 57–62. As-sociation for Computational Linguistics.Zhiguo Wang and Nianwen Xue. 2014. Joint pos tag-ging and transition-based constituent parsing in chi-nese with non-local features. In
Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , vol-ume 1, pages 733–742.ianwen Xue, Zixin Jiang, Xiuhong Zhong, MarthaPalmer, Fei Xia, Fu-Dong Chiou, and Meiyu Chang.2010. Chinese treebank 7.0.
Linguistic Data Con-sortium, Philadelphia .Yuchen Zhang and Nianwen Xue. 2018. Structured in-terpretation of temporal relations. In