[PDF] Domain Knowledge Empowered Structured Neural Net for End-to-End Event Temporal Relation Extraction

Abstract

Extracting event temporal relations is a critical task for information extraction and plays an important role in natural language understanding. Prior systems leverage deep learning and pre-trained language models to improve the performance of the task. However, these systems often suffer from two short-comings: 1) when performing maximum a posteriori (MAP) inference based on neural models, previous systems only used structured knowledge that are assumed to be absolutely correct, i.e., hard constraints; 2) biased predictions on dominant temporal relations when training with a limited amount of data. To address these issues, we propose a framework that enhances deep neural network with distributional constraints constructed by probabilistic domain knowledge. We solve the constrained inference problem via Lagrangian Relaxation and apply it on end-to-end event temporal relation extraction tasks. Experimental results show our framework is able to improve the baseline neural network models with strong statistical significance on two widely used datasets in news and clinical domains.

Full PDF

DDomain Knowledge Empowered Structured Neural Net for End-to-EndEvent Temporal Relation Extraction

Rujun Han , Yichao Zhou Nanyun Peng , , Information Sciences Institute, University of Southern California Department of Computer Science, University of Southern California Department of Computer Science, University of California, Los Angeles [email protected]; [email protected]; [email protected]

Abstract

Extracting event temporal relations is a criti-cal task for information extraction and playsan important role in natural language under-standing. Prior systems leverage deep learn-ing and pre-trained language models to im-prove the performance of the task. However,these systems often suffer from two shortcom-ings: 1) when performing maximum a posteri-ori (MAP) inference based on neural models,previous systems only used structured knowl-edge that is assumed to be absolutely correct,i.e., hard constraints ; 2) biased predictionson dominant temporal relations when trainingwith a limited amount of data. To address theseissues, we propose a framework that enhancesdeep neural network with distributional con-straints constructed by probabilistic domainknowledge. We solve the constrained infer-ence problem via Lagrangian Relaxation andapply it to end-to-end event temporal relationextraction tasks. Experimental results showour framework is able to improve the baselineneural network models with strong statisticalsigniﬁcance on two widely used datasets innews and clinical domains.

Extracting event temporal relations from raw textdata has attracted surging attention in the NLP re-search community in recent years as it is a funda-mental task for commonsense reasoning and nat-ural language understanding. It facilitates variousdownstream applications, such as forecasting socialevents and tracking patients’ medical history. Fig-ure 1 shows an example of this task where an eventextractor ﬁrst needs to identify events ( buildup , say and stop ) in the input and then a relation clas-siﬁer predicts all pairwise relations among them,resulting in a temporal ordering as illustrated inthe ﬁgure. For example, say is BEFORE stop ; buildup INCLUDES say ; the temporal ordering

Figure 1: An example of the event temporal order-ing task. Text input is taken from the news datasetin our experiments. Solid lines / arrows between twohighlighted events show their gold temporal relations,e.g. say

BEFORE stop and buildup

INCLUDES say ,and the dash line shows a wrong prediction, i.e., the

VAGUE relation between buildup and say . In the table,Column Overall shows the relation distribution over theentire training corpus; Column Type Pair (P) shows the predicted relation distribution condition on the eventpairs having types occurrence and reporting (such as buildup and say ); Column Type Pair (G)shows the gold relation distribution condition on eventpairs having the same types. Biased predictions of

VAGUE relation between buildup and say can be par-tially corrected by using the gold event type-relationstatistics in Column Type Pair (G). between buildup and stop cannot be decided fromthe context, so the relation should be

VAGUE .Predicting event temporal relations is inherentlychallenging as it requires the system to understandeach event’s beginning and end times. However,these time anchors are often hard to specify withina complicated context, even for humans. As a re-sult, there is usually a large amount of

VAGUE pairs (nearly 50% in the table of Figure 1) in anexpert-annotated dataset, resulting in heavily class-imbalanced datasets. Moreover, expert annotationsare often time-consuming to gather, so the sizesof existing datasets are relatively small. To cope a r X i v : . [ c s . C L ] O c t ith the class-imbalance problem and the smalldataset issues, recent research efforts adopt hardconstraint -enhanced deep learning methods andleverage pre-trained language models (Ning et al.,2018c; Han et al., 2019b) and are able to establishreasonable baselines for the task.The hard-constraints used in the SOTA systemscan only be constructed when they are nearly 100%correct and hence make the knowledge adoptionrestrictive. Temporal relation transitivity is a fre-quently used hard constraint that requires if A BEFORE B and B BEFORE C , it must be that A BEFORE C . However, constraints are usuallynot deterministic in real-world applications. For ex-ample, a clinical treatment and test are more likely to happen AFTER a medical problem , butnot always . Such probabilistic constraints cannotbe encoded with the hard-constraints as in the pre-vious systems.Furthermore, deep neural models have biasedpredictions on dominant classes, which is particu-larly concerning given the small and biased datasetsin event temporal extraction. For example, in Fig-ure 1, an event pair headed and say (with relation

INCLUDES ) is incorrectly predicted as

VAGUE (Column Type Pair (P)) by our baseline neuralmodel, partially due to dominant percentage of

VAGUE label (Column Overall), and partially dueto the complexity of the context. Using the domainknowledge that headed and say have event typesof occurrence and reporting , respectively,we can ﬁnd a new label probability distribution(Type Pair (G)) for this pair. The probability massallocated to

VAGUE would decrease by 10% and in-crease by 7.2% for

INCLUDES , which signiﬁcantlyincreases the chance for a correct label prediction.We propose to improve deep structured neuralnetworks by incorporating domain knowledge suchas corpus statistics in the model inference, and bysolving the constrained inference problem usingLagrangian Relaxation. This framework allows usto beneﬁt from the strong contextual understandingof pre-trained language models while optimizingmodel outputs based on probabilistic structuredknowledge that previous deep models fail to con-sider. Experimental results demonstrate the effec-tiveness of this framework.We summarize our contributions below:• We formulate the incorporation of probabilis-tic knowledge as a constrained inference prob-lem and use it to optimize the outcomes from strong neural models.• Novel applications of Lagrangian Relaxationon end-to-end temporal relation extractiontask with event-type and relation constraints.• Our framework signiﬁcantly outperformsbaseline systems without knowledge adop-tion and achieves new SOTA results on twodatasets in news and clinical domains.

The problem we focus on is end-to-end event tem-poral relation extraction, which takes a raw text asinput, ﬁrst identiﬁes all events, and then classiﬁestemporal relations for all predicted event pairs. Theleft column of Figure 2 shows an example. An end-to-end system is practical in a real-world settingwhere events are not annotated in the input andchallenging because temporal relations are harderto predict after noise is introduced during eventextraction.

In this section, we ﬁrst describe the details of ourdeep neural networks for an end-to-end event tem-poral relation extraction system, then show how toformulate domain-knowledge between event typesand relations as distributional constraints in Inte-ger Linear Programming (ILP), and ﬁnally applyLagrangian Relaxation to solve the constrained in-ference problem. Our base model is trained end-to-end with cross-entropy loss and multitask learningto obtain relation scores. We need to perform anadditional inference step in order to incorporate domain-knowledge as distributional constraints . As illustrated in the left column in Figure 2, ourend-to-end model shares a similar work-ﬂow asthe pipeline model in Han et al. (2019b), wheremulti-task learning with a shared feature extractoris used to train the pipeline model. Let E , EE and R denote event, candidate event pairs and thefeasible relations, respectively, in an input instance x n , where n is the instance index. The combinedtraining loss is L = c E L E + L R , where L E and L R are the losses for the event extractor and the relationmodule, respectively, and c E is a hyper-parameterbalancing the two losses. Feature Encoder.

Input instances are ﬁrst sent topre-trained language models such as BERT (Devlin igure 2: An overview of the proposed framework. The left column shows the end-to-end event temporal relationextraction workﬂow. The right column (in the dashed box) illustrates how we propose to enhance the end-to-endextraction system. The ﬁnal MAP inference contains two components: scores from the relation module and distri-butional constraints constructed using domain knowledge and corpus statistics. The text input is a real exampletaken from the I2 B EMPORAL dataset. The MAP inference is able to push the predicted probability of the eventtype-relation triplet closer to the ground-truth (corpus statistics). et al., 2018) and RoBERTa (Liu et al., 2019), thento a Bi-LSTM layer as in previous event temporalrelation extraction work (Han et al., 2019a).Encoded features will be used as inputs to theevent extractor and the relation module below.

Event Extractor.

The event extractor ﬁrst pre-dicts scores over event classes for each input tokenand then detects event spans based on these scores.If an event has over more than one tokens, its be-ginning and ending vectors are concatenated as theﬁnal event representation. The event score is de-ﬁned as the predicted probability distribution overevent classes. Pairs predicted to include non-eventsare automatically labeled as

NONE , whereas validcandidate event pairs are fed into the relation mod-ule to obtain their relation scores.

Relation Module.

The relation module’s input isa pair of events, which share the same encoded fea-tures as the event extractor. We simply concatenatethem before feeding them into the relation mod-ule to produce relation scores S ( y ri,j , x n ) , whichare computed using the Softmax function where y ri,j is a binary indicator of whether an event pair ( i, j ) ∈ EE has relation r ∈ R . As shown in Figure 2, once the relation scoresare computed via the relation module, a MAP in-ference is performed to incorporate distributionalconstraints so that the structured knowledge can be used to adjust neural baseline model scores andoptimize the ﬁnal model outputs. We formulateour MAP inference with distributional constraints as an LR problem and solve it with an iterativealgorithm.Next, we explain the details of each componentin our MAP inference.

Much of the domain-knowledge required for real-world problems are probabilistic in nature. Inthe task of event relation extraction, domain-knowledge can be the prior probability of a spe-ciﬁc event-pair’s occurrence acquired from largecorpora or knowledge base (Ning et al., 2018b); domain-knowledge can also be event-property andrelation distribution obtained using corpus statis-tics, as we study in this work. Previous work mostlyleverage hard constraints for inference (Yoshikawaet al., 2009; Ning et al., 2017; Leeuwenberg andMoens, 2017; Ning et al., 2018a; Han et al.,2019a,b), where constraints such as transitivity andevent-relation consistency are assumed to be abso-lutely correct. As we discuss in Section 1, hardconstraints are rigid and thus cannot be used tomodel probabilistic domain-knowledge .The right column in Figure 2 illustrates how ourwork leverages corpus statistics to construct distri-butional constraints . Let P be a set of event prop-erties such as clinical types (e.g. treatment or problem ).For the pair ( P m , P n ) and the triplet ( P m , P n , r ) , where P n , P m ∈ P and r ∈ R , wean retrieve their counts in the training corpus as C ( P m , P n , r ) = (cid:88) i,j ∈EE c ( P i = P m ; P j = P n ; r i,j = r ) and C ( P m , P n ) = (cid:88) i,j ∈EE c ( P i = P m ; P j = P n ) . Let t = ( P m , P n , r ) . The prior triplet probabilitycan thus be deﬁned as p ∗ t = C ( P m , P n , r ) C ( P m , P n ) . Let ˆ p t denote the predicted triplet probability, dis-tributional constraints require that, p ∗ t − θ ≤ ˆ p t ≤ p ∗ t + θ (1)where θ is the tolerance margin between the priorand predicted probabilities. We formulate our MAP inference as an ILP prob-lem. Let T be a set of triplets whose predictedprobabilities need to satisfy Equation 1. We candeﬁne our full ILP as L = (cid:88) ( i,j ) ∈EE (cid:88) r ∈R y ri,j S ( y ri,j , x ) (2) s.t. p ∗ t − θ ≤ ˆ p t ≤ p ∗ t + θ, ∀ t ∈ T , and y ri,j ∈ { , } , (cid:88) r ∈R y ri,j = 1 , where S ( y ri,j , x ) , ∀ r ∈ R is the scoring func-tion obtained from the relation module. For t =( P m , P n , r ) , we have ˆ p t = (cid:80) EE ( i : Pm,j : Pn ) y ri,j (cid:80) EE ( i : Pm,j : Pn ) (cid:80) R r (cid:48) y r (cid:48) i,j .The output of the MAP inference, ˆy , is a collec-tion of optimal label assignments for all relationcandidates in an input instance x n . (cid:80) r ∈R y ri,j = 1 ensures that each event pair gets one label assign-ment and this is the only hard constraint we use.To improve computational efﬁciency, we applythe heuristic to optimize only the equality con-straints p ∗ t = ˆ p t , ∀ t ∈ T . Our optimization algo-rithm terminates when | p ∗ t − ˆ p t | ≤ θ . This heuristichas been shown to work efﬁciently without hurtinginference performance (Meng et al., 2019). Foreach triplet t , its equality constraint can be rewrit-ten as F ( t ) = (1 − p ∗ t ) EE (cid:88) ( i : P m ,j : P n ) y ri,j , (3) − p ∗ t EE (cid:88) ( i : P m ,j : P n , R (cid:88) r (cid:48) (cid:54) = r ) y r (cid:48) i,j = 0 . The goal is to maximize the objective function de-ﬁned by Eq. (2) while satisfying the equality con-straints.

Algorithm 1

Gradient Ascent for LR procedure for t ∈ T do λ t = 0 k = 0 while k < K do (cid:46) K : max iteration6: ˆ y k +1 ← arg max L ( λ k ) for t ∈ T do ∆ t = p ∗ t − ˆ p t if | ∆ t | > θ then λ k +1 t = λ kt + α ∆ t if ∆ t ≤ θ, ∀ t then break k = k + 1 α = γα (cid:46) γ : decay rate Solving Eq. (2) is NP-hard. Thus, we reformulate itas a Lagrangian Relaxation problem by introducingLagrangian multipliers λ t for each distributionalconstraint . Lagrangian Relaxation has been appliedin a variety NLP tasks, as described by Rush andCollins (2011, 2012) and Zhao et al. (2017).The Lagrangian Relaxation problem can be writ-ten as L ( y , λ ) = (cid:88) ( i,j ) ∈EE (cid:88) r ∈R y ri,j S ( y ri,j , x ) + (cid:88) t ∈T λ t F ( t ) . (4)Initialize λ t = 0 . Eq. (4) can be solved with thefollowing iterative algorithm (Algorithm 1).1. At each iteration k , obtain the best rela-tion assignments per MAP inference, ˆ y k =arg max L ( y , λ )

2. Update the Lagrangian multiplier in order tobring the predicted probability closer to theprior. Speciﬁcally, for each t ∈ T ,• If | p ∗ t − ˆ p t | ≤ θ , λ k +1 t = λ kt • Otherwise, λ k +1 t = λ kt + α ( p ∗ t − ˆ p t ) α is the step size. We are solving a min-max prob-lem: the ﬁrst step chooses the maximum likelihoodassignments by ﬁxing λ ; the second step searchesfor λ values that minimize the objective function. This section explains how to construct our distribu-tional constraints and the implementation detailsfor inference with LR.onstraint Triplets Count % occurrence , occurrence , * 124 19.7 occurrence , reporting , * 50 7.9 occurrence , action , * 44 7.0 reporting , occurrence , * 41 6.5 action , occurrence , * 40 6.4 action , action , * 20 3.2 reporting , reporting , * 18 2.9 action , reporting , * 18 2.9 reporting , action , * 17 2.7 Table 1: TimeBank-Dense: triplet prediction count andpercentage in the development set (sample size = 629).

The selection of distributional constraints is crucialfor our algorithm. If the probability of an event-type and relation triplet is unstable across differentsplits of data, we may over-correct the predictedprobability. We use the following search algorithmwith heuristic rules to ensure constraint stability.

For TimeBank-Dense, we ﬁrst sort candidateconstraints by their corresponding values of C ( P m , P n ) = (cid:80) ˆ r ∈R C ( P m , P n , ˆ r ) . We list C ( P m , P n ) with the largest prediction numbersand their percentages in the development set inTable 1.Next, we set as our threshold to includeconstraints for our main experimental results. Wefound this number to work relatively well for bothTimeBank-Dense and I2 B EMPORAL . We willshow the impact of relaxing this threshold in thediscussion section. In Table 1, the constraints inthe bottom block are ﬁltered out. Moreover, Eq. 3implies that a constraint deﬁned on one triplet ( P m , P n , r ) has impact on all ( P m , P n , r (cid:48) ) for r (cid:48) ∈ R\ r . In other words, decreasing ˆ p ( P m ,P n ,r ) isequivalent to increasing ˆ p ( P m ,P n ,r (cid:48) ) and vice versa.Thus, we heuristically pick ( P m , P n , VAGUE ) asour default constraint triplets.Finally, we adopt a greedy search rule to selectthe ﬁnal set of constraints. We start with the topconstraint triplet in Table 1 and then keep addingthe next one as long as it doesn’t hurt the gridsearch F score on the development set. Eventu-ally, four constraints triplets are selected, and they Recall that our LR algorithm in Section 3.2.3 has threehyper-parameters: initial step size α , decay rate γ , and toler-ance θ . We perform a grid search on the development set anduse the best hyper-parameters on the test set. can be found in Table 3. B EMPORAL

Similar to TimeBank-Dense, we use the thresh-old to select candidate constraints. However, it iscomputationally expensive to use the greedy searchrule above by conducting grid search as the numberof constraints that pass this threshold is large (15 ofthem), development set sample size is more than 3times of TimeBank-Dense, and a large transformeris used for modeling, Therefore, we incorporateanother two heuristic rules to directly select con-straints,1. We randomly split the train data into ﬁvesubsets of equal size { s , s , s , s , s } . Fortriplet t to be selected, we must have (cid:80) k =1 | p t,s k − p ∗ t | < . .2. | ˆ p t − p ∗ t | > . , where ˆ p t is the predictedprobability of t on the development set.The ﬁrst rule ensures that a constraint triplet isstable over a randomly split of data; the secondone ensures that the probability gaps between thepredicted and gold are large so that we will notover-correct them. Eventually, four constraints sat-isfy these rules, and they can be found in Table 9,and we run only one ﬁnal grid search for theseconstraints. The ILP component in Sec. 3.2.2 is implementedusing an off-the-shelf solver provided by Gurobioptimizer. Hyper-parameters choices can be foundin Table 6 in the Appendix.

This section describes the two event temporal rela-tion datasets used in this paper and then explainsthe evaluation metrics.

Temporal relation corporasuch as TimeBank (Pustejovsky et al., 2003) andRED (O’Gorman et al., 2016) consist of expert an-notations of news articles. The common issue ofthese corpora is missing annotations. Collectingdensely annotated temporal relation corpora withall events and relations fully annotated is a chal-lenging task as annotators could easily overlooksome facts (Bethard et al., 2007; Cassidy et al.,2014; Chambers et al., 2014; Ning et al., 2017). imeBank-Dense 2012 i2b2 Challenge (I2 B EMPORAL )Event Relation Event Relation (TempEval Metrics)F R P F Span F Type Accuracy R P F Feature-based Benchmark

Table 2: Overall experiment results: per MacNemar’s test, the improvements against the end-to-end baselinemodels by adding inference with distributional constraints are both statistically signiﬁcant for TimeBank-Dense(p-value < . ) and I2 B EMPORAL (p-value < . ). For I2 B EMPORAL , our end-to-end system isoptimized for the F score of the gold pairs. The TimeBank-Dense dataset mitigates this is-sue by forcing annotators to examine all pairs ofevents within the same or neighboring sentences,and this dataset has been widely evaluated on thistask (Chambers et al., 2014; Ning et al., 2017;Cheng and Miyao, 2017; Meng and Rumshisky,2018). Temporal relations consist of

BEFORE , AFTER , INCLUDES , INCLUDED , SIMULTANE-OUS , and

VAGUE . Moreover, each event hasseveral properties, e.g., type , tense , and polar-ity . Event types include occurrence , action , reporting , state , etc. Event pairs that aremore than 2 sentences away are not annotated. I2 B EMPORAL . In the clinical domain, oneof the earliest event temporal datasets was providedin the 2012 Informatics for Integrating Biology andthe Bedside (i2b2) Challenge on NLP for ClinicalRecords (Sun et al., 2013). Clinical events are cat-egorized into 6 types: treatment , problem , test , clinical-dept , occurrence , and evidential . The ﬁnal data used in the chal-lenge contains three temporal relations: BEFORE , AFTER , and

OVERLAP . The 2012 i2b2 challengealso had an end-to-end track, which we use as ourfeature-based system baseline. To mimic the inputstructure of TimeBank-Dense, we only considerevent pairs that are within 3 consecutive sentences.Overall, 13% of the long-distance relations are ex-cluded. To be consistent with previous work, we adopt twodifferent evaluation metrics. For TimeBank-Dense,we use standard micro-average scores that are alsoused in the baseline system (Han et al., 2019b).Since the end-to-end system can predict the gold Over 80% of these long-distance pairs are event co-reference, i.e., simply predicting them as

OVERLAP willachieve high performance. pair as

NONE , we follow the convention of IE tasksand exclude them from the evaluation. For I2 B EMPORAL , we adopt the

TempEval evaluationmetrics used in the 2012 i2b2 challenge. Theseevaluation metrics differ from the standard F in away that it computes the graph closure for both goldand predictions labels. Since I2 B EMPORAL contains roughly six times more missing annota-tions than the gold pairs, we only evaluate the per-formance of the gold pairs.Both datasets contain three types of entities:events, time expressions, and document time. Inthis work, we focus on event-event relations andexclude all other relations from the evaluation.

We use CAEVO (Chambers et al., 2014), a hybrid system of rulesand linguistic feature-based MaxEnt classiﬁer, asour feature-based benchmark for TimeBank-Dense.Model implementation and performance are bothprovided by Han et al. (2019b). As for I2 B EMPORAL , we retrieve the predictions from thetop end-to-end system provided by Yan et al. (2013)and report the performance according to the evalu-ation metrics speciﬁed in Section 5.2.

Neural Model Baselines.

We use the end-to-endsystems described by Han et al. (2019b) as ourneural network model benchmarks (Row 2 of Ta-ble 2). For TimeBank-Dense, the best global struc-tured model’s performance is reported by Han et al.(2019b). For I2 B EMPORAL , we re-implementthe pipeline joint model. Note that this end-to-end model only predicts whether each token is anevent as well as each pair of token’s relation. Event https://github.com/PlusLabNLP/JointEventTempRel pans are not predicted, so head-tokens are usedto represent events; event types are also not pre-dicted. Therefore, we do not report Span F andType Accuracy in this benchmark. End-to-end Baseline.

For the

TimeBank-Dense dataset, we use the pipeline joint (local) model withno global constraints as presented by Han et al.(2019b). In contrast to the aforementioned neuralbaseline provided in the same paper, this end-to-end model does not use any inference techniques.Hence, it serves as a fair baseline for our method(with inference). For

TimeBank-Dense , we buildour framework based on this model .For the I2 B EMPORAL dataset to be morecomparable with the 2012 i2b2 challenge, we aug-ment the event extractor illustrated in Figure 2 byallowing event type predictions; that is, for each in-put token, we not only predict whether it is an eventor not, but also predict its event type. We followthe convention in the IE ﬁeld by adding a “BIO”label to each token in the data. For example, thetwo tokens in “physical therapy” in Figure 2 will belabeled as B- treatment and I- treatment , re-spectively. To be consistent with the partial matchmethod used in the 2012 i2b2 challenge, the eventspan detector looks for token predictions that startwith either “B-” or “I-” and ensures that all tokenspredicted within the same event span have only oneevent type.RoBERTa-large is used as the base model, andcross-entropy loss is used to train the model. Weﬁne-tune the base model and conduct a grid searchon the random hold-out set to pick the best hyper-parameters such as c E in the multitask learning lossand the weight, w E pos for positive event types (i.e.B- and I-). The best hyper-parameter choices canbe found in Table 6 in the Appendix. Table 2 contains our main results. We discussmodel performances on TimeBank-Dense andI2 B EMPORAL in this section.

All neural models outperform the feature-based sys-tem by more than 10% per relation F score. Ourstructured model outperforms the previous SOTAsystems with hard constraints and joint event andrelation training by 1.1%. Compared with the Code and data for TimeBank-Dense are published here: https://github.com/rujunhan/EMNLP-2020 end-to-end baseline model with no constraints, oursystem achieves 2% absolute improvement, whichis statistically signiﬁcant with a p -value < . per MacNemar’s test. This is strong evidence thatleveraging Lagrangian Relaxation to incorporatedomain knowledge can be extremely beneﬁcialeven for strong neural network models.The ablation study in Table 3 shows how dis-tributional constraints work and the constraints’individual contributions. The predicted probabilitygaps shrink by 0.15, 0.24, and 0.13 respectivelyfor the three constraints chosen, while providing0.91%, 0.65%, and 0.44% improvements to the ﬁ-nal F score for relation extraction. We also showthe breakdown of the performance for each relationclass in Table 4. The overall F improvement ismainly driven by the recall scores in the positive re-lation classes ( BEFORE , AFTER , and

INCLUDES )that have much smaller sample size than

VAGUE .These results are consistent with the ablation studyin Table 3, where the end-to-end baseline modelover-predicts on

VAGUE , and the LR algorithm cor-rects it by assigning less conﬁdent predictions on

VAGUE to positive and minority classes accordingto their relation scores. B EMPORAL

All neural models outperform the feature-based sys-tem by more than 30% per relation F score. Ourstructured model with distributional constraints outperforms the neural pipeline joint models ofHan et al. (2019b) by 2.5% per absolute scale.Compared with our end-to-end baseline model, oursystem achieves 0.77% absolute improvement onF measure, which is statistically signiﬁcant witha p -value < . per MacNemar’s test. Thisresult also shows that adding inference with dis-tributional constraints can be helpful for strongneural baseline models.Table 9 in the Appendix Section C shows how distributional constraints work and their individualcontributions. Predicted probability gaps shrinkby 0.17, 0.16, 0.11, and 0.14, respectively, for thefour constraints chosen, providing 0.19%, 0.25%,0.22%, and 0.12% improvements to the ﬁnal F scores for relation extraction. We also have thebreakdown performance for each relation class inTable 8. The performance gain is caused mostly bythe increase of recall scores in BEFORE and

AF-TER . This is consistent with the results in Table 9where the model over-predicts on the

OVERLAP onstraint Triplets Prob. Gap F occur. , occur. , VAGUE -0.15 +0.91% occur. , reporting , VAGUE -0.24 +0.65% action , occur. , VAGUE -0.13 +0.44% reporting , occur. , VAGUE ∗ Table 3: TimeBank-Dense ablation study: gap shrink-age of predicted probability and F contribution perconstraint. ∗ is selected per Sec. 4, but the probabilitygap is smaller than the tolerance in the test set, henceno impact to the F score. End-to-end Baseline End-to-end InferenceP R F P R F B A I - - - 8.3 1.8 2.9 II - - - - - - S - - - - - - V Avg

Table 4: Model performance breakdown for TimeBank-Dense. “-” indicates no predictions were made for thatparticular label, probably due to the small size of thetraining sample.

BEFORE ( B ), AFTER ( A ), INCLUDES ( I ), IS INCLUDED ( II ), SIMULTANEOUS ( S ), VAGUE ( V ) class, possibly because of label imbalance. Infer-ence is able to partially correct this mistake byleveraging distributional constraints constructedwith event type and relation corpus statistics. We can use the errors made by our structured neuralmodel on TimeBank-Dense to guide potential direc-tions for future research. There are 26 errors madeby the structured model that are correctly predictedby the baseline model. In Table 5, we show theerror breakdown by constraints. Our method worksby leveraging corpus statistics to correct border-line errors made by the baseline model; however,when the baseline model makes borderline correctpredictions, the inference could mistakenly changethem to the wrong labels. This situation can happenwhen the context is complicated or when the eventtime interval is confusing.For the constraint ( occur. , occur. , VAGUE ),nearly all errors are cross-sentence event pairs withlong context information. In ex.1 , the gold relationbetween responded and use is VAGUE becauseof the negation of use , but one could also arguethat if use were to happen, responded is BEFORE use . This inherent annotation confusion can causethe baseline model to predict

VAGUE marginallyover

BEFORE . When informed by the constraintstatistics that vague is over-predicted, the infer- occurrence , occurrence , VAGUE (57.7%) ex.1

In a bit of television diplomacy, Iraq’s deputyforeign minister responded from Baghdad in less thanone hour, saying Washington would break internationallaw by attacking without UN approval. The United Statesis not authorized to use force before going to the council. occurrence , reporting , VAGUE (26.9%) ex.2

A new Essex County task force began delvingThursday into the slayings of 14 black women over thelast ﬁve years in the Newark area, as law-enforcementofﬁcials acknowledged that they needed to work harder... action , occurrence , VAGUE (15.4%) ex.3

The Russian leadership has staunchly opposedthe western alliance’s expansion into Eastern Europe.

Table 5: Error examples and breakdown by constraints. ence algorithm revises the baseline prediction to

BEFORE . Similarly, in ex.2 and ex.3 , one couldmake strong cases that both the relations between delving and acknowledged , and opposed and ex-pansion are

BEFORE rather than

VAGUE from thecontext. This annotation ambiguity can contributeto the errors made by the proposed method.Our analysis shows that besides the necessity tocreate high-quality data for event temporal relationextraction, it could be useful to incorporate addi-tional information such as discourse relation (par-ticularly for ( occur. , occur. , VAGUE )) andother prior knowledge on event properties to re-solve the ambiguity in event temporal reasoning.

In Sec. 4, we use a 3% threshold when selectingcandidate constraints. In this section, we show theimpact of relaxing this threshold on TimeBank-Dense. Table 1 shows three constraints that missthe 3% bar by 0.1-0.3%. In Figure 3, we show F scores on the development and test sets by includ-ing these constraints. Recall that only constraintsthat do not hurt development F score are used.Therefore, Top5 and

Top6 on the chart both cor-respond to the results in Table 2.

Top7 includes( reporting , reporting , VAGUE ), Top8 includes ( actioin , reporting , VAGUE ),and

Top9 includes ( reporting , actioin , VAGUE ).We observe that F score continues to improveover the development set, but on the test set, F score eventually falls. This appears to support ourhypothesis that when the triplet count is small, theratio calculated based on that count is not so re-liable as the ratio could vary drastically betweendevelopment and test sets. Optimizing over the igure 3: Dev v.s. Test sets performance ( F score)after relaxing the threshold of triplet count for selectingconstraints. All numbers are percentages. development set can be an over-correction for thetest set, and hence results in a performance drop. As described in Sec 5.3, to ensure fair comparisonwith the previous SOTA system (Han et al., 2019b),our baseline model for TimeBank-Dense does notpredict event types. That is, when counting thetriplet ( P m , P n , ˆ r ) , we assume there is an oraclemodel that provides event types P m , P n for thepredicted relation ˆ r . One could potentially extendour work by training a similar multi-task learningmodel to predict both types and relations as ourmodel does for the I2 B EMPORAL dataset. Weleave this as a future research direction.

News Domain.

Early work on temporal rela-tion extraction use local pair-wise classiﬁcationwith hand-engineered features (Mani et al., 2006;Verhagen et al., 2007; Chambers et al., 2007; Ver-hagen and Pustejovsky, 2008). Later efforts, suchas ClearTK (Bethard, 2013), UTTime (Laokul-rat et al., 2013), NavyTime (Chambers, 2013),and CAEVO (Chambers et al., 2014), improveearlier work with better linguistic and syntacticrules. Yoshikawa et al. (2009); Ning et al. (2017);Leeuwenberg and Moens (2017) explore structuredlearning for this task, and more recently, neuralmethods have also been shown effective (Tourilleet al., 2017; Cheng and Miyao, 2017; Meng et al.,2017; Meng and Rumshisky, 2018). Ning et al.(2018c) and Han et al. (2019b) are the most recentwork leveraging neural network and pre-trainedlanguage models to build an end-to-end system.Our work differs from these prior work in that webuild a structured neural model with distributionalconstraints that combines both the beneﬁts of both deep learning and domain knowledge . Clinical Domain.

The 2012 i2b2 Challenge((Sun et al., 2013)) is one of the earliest effortsto advance event temporal relation extraction ofclinical data. The challenge hosted three tasks onevent (and event property) classiﬁcation, temporalrelation extraction, and the end-to-end track. Fol-lowing this early effort, a series of clinical eventtemporal relation challenges were created in the fol-lowing years ((Bethard et al., 2015, 2016, 2017)).However, data in these challenges are relativelyhard to acquire, and therefore they are not usedin this paper. As in the news data, traditional ma-chine learning approaches (Lee et al., 2016; Chikka,2016; Xu et al., 2013; Tang et al., 2013; Savovaet al., 2010) that tackle the end-to-end event andtemporal relation extraction problem require time-consuming feature engineering such as collectinglexical and syntax features. Some recent work (Dli-gach et al., 2017; Leeuwenberg and Moens, 2017;Galvan et al., 2018) apply neural network-basedmethods to model the temporal relations, but arenot capable of incorporating prior knowledge aboutclinical events and temporal relations as proposedby our framework.

In conclusion, we propose a general frameworkthat augments deep neural networks with distribu-tional constraints constructed using probabilistic domain knowledge . We apply it in the setting ofend-to-end temporal relation extraction task withevent-type and relation constraints and show thatthe MAP inference with distributional constraints can signiﬁcantly improve the ﬁnal results.We plan to apply the proposed framework onvarious event reasoning tasks and construct novel distributional constraints that could leverage do-main knowledge beyond corpus statistics, such asthe larger unlabeled data and rich information con-tained in knowledge bases.

Acknowledgments

This work is supported by the Intelligence Ad-vanced Research Projects Activity (IARPA), viaContract No. 2019-19051600007 and the USDefense Advanced Research Projects Agency(DARPA), via Contract W911NF-15-1-0543. Theviews expressed are those of the authors and do notreﬂect the Department of Defense’s ofﬁcial policyor position or the U.S. Government. eferences

Steven Bethard. 2013. Cleartk-timeml: A minimal-ist approach to tempeval 2013. In

Second JointConference on Lexical and Computational Seman-tics (*SEM), Volume 2: Proceedings of the SeventhInternational Workshop on Semantic Evaluation (Se-mEval 2013) , pages 10–14. Association for Compu-tational Linguistics.Steven Bethard, Leon Derczynski, Guergana Savova,James Pustejovsky, and Marc Verhagen. 2015.SemEval-2015 task 6: Clinical TempEval. In

Pro-ceedings of the 9th International Workshop on Se-mantic Evaluation (SemEval 2015) , pages 806–814,Denver, Colorado. Association for ComputationalLinguistics.Steven Bethard, James H. Martin, and Sara Klingen-stein. 2007. Timelines from text: Identiﬁcationof syntactic temporal relations. In

Proceedings ofthe International Conference on Semantic Comput-ing , ICSC ’07, pages 11–18, Washington, DC, USA.IEEE Computer Society.Steven Bethard, Guergana Savova, Wei-Te Chen, LeonDerczynski, James Pustejovsky, and Marc Verhagen.2016. SemEval-2016 task 12: Clinical TempEval.In

Proceedings of the 10th International Workshopon Semantic Evaluation (SemEval-2016) , pages1052–1062, San Diego, California. Association forComputational Linguistics.Steven Bethard, Guergana Savova, Martha Palmer,and James Pustejovsky. 2017. SemEval-2017 task12: Clinical TempEval. In

Proceedings of the11th International Workshop on Semantic Evalua-tion (SemEval-2017) , pages 565–572, Vancouver,Canada. Association for Computational Linguistics.Taylor Cassidy, Bill McDowell, Nathanael Chambers,and Steven Bethard. 2014. An annotation frame-work for dense event ordering. In

Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , pages501–506. Association for Computational Linguis-tics.Nate Chambers. 2013. Navytime: Event and time or-dering from raw text. In

Second Joint Conferenceon Lexical and Computational Semantics (*SEM),Volume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013) ,pages 73–77, Atlanta, Georgia, USA. Associationfor Computational Linguistics.Nathanael Chambers, Taylor Cassidy, Bill McDowell,and Steven Bethard. 2014. Dense event orderingwith a multi-pass architecture. In

ACL .Nathanael Chambers, Shan Wang, and Dan Juraf-sky. 2007. Classifying temporal relations betweenevents. In

Proceedings of the 45th Annual Meetingof the ACL on Interactive Poster and DemonstrationSessions , ACL ’07, pages 173–176, Stroudsburg, PA,USA. Association for Computational Linguistics. Fei Cheng and Yusuke Miyao. 2017. Classifying tem-poral relations by bidirectional lstm over depen-dency paths. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , volume 2, pages1–6.Veera Raghavendra Chikka. 2016. CDE-IIITH atSemEval-2016 task 12: Extraction of temporal in-formation from clinical documents using machinelearning techniques. In

Proceedings of the 10thInternational Workshop on Semantic Evaluation(SemEval-2016) , pages 1237–1240, San Diego, Cal-ifornia. Association for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In

NAACL-HLT .Dmitriy Dligach, Timothy Miller, Chen Lin, StevenBethard, and Guergana Savova. 2017. Neural tem-poral relation extraction. In

Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 2, ShortPapers , pages 746–751, Valencia, Spain. Associa-tion for Computational Linguistics.Diana Galvan, Naoaki Okazaki, Koji Matsuda, andKentaro Inui. 2018. Investigating the challenges oftemporal relation extraction from clinical text. In

Proceedings of the Ninth International Workshop onHealth Text Mining and Information Analysis , pages55–64, Brussels, Belgium. Association for Compu-tational Linguistics.Rujun Han, I-Hung Hsu, Mu Yang, Aram Galstyan,Ralph Weischedel, and Nanyun Peng. 2019a. Deepstructured neural network for event temporal rela-tion extraction. In

Proceedings of the 23rd Confer-ence on Computational Natural Language Learning(CoNLL) , pages 666–106, Hong Kong, China. Asso-ciation for Computational Linguistics.Rujun Han, Qiang Ning, and Nanyun Peng. 2019b.Joint event and temporal relation extraction withshared representations and structured prediction. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 434–444, Hong Kong, China. Association for Computa-tional Linguistics.Natsuda Laokulrat, Makoto Miwa, Yoshimasa Tsu-ruoka, and Takashi Chikayama. 2013. Uttime: Tem-poral relation classiﬁcation using deep syntactic fea-tures. In

Second Joint Conference on Lexical andComputational Semantics (*SEM), Volume 2: Pro-ceedings of the Seventh International Workshop onSemantic Evaluation (SemEval 2013) , pages 88–92,Atlanta, Georgia, USA. Association for Computa-tional Linguistics.Hee-Jin Lee, Hua Xu, Jingqi Wang, Yaoyun Zhang,Sungrim Moon, Jun Xu, and Yonghui Wu. 2016.THealth at SemEval-2016 task 12: an end-to-endsystem for temporal information extraction fromclinical notes. In

Proceedings of the 10th Interna-tional Workshop on Semantic Evaluation (SemEval-2016) , pages 1292–1297, San Diego, California. As-sociation for Computational Linguistics.Artuur Leeuwenberg and Marie-Francine Moens. 2017.Structured learning for temporal relation extractionfrom clinical records. In

Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 1, LongPapers , volume 1, pages 1150–1158.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint , arXiv:1907.11692.Inderjeet Mani, Marc Verhagen, Ben Wellner,Chong Min Lee, and James Pustejovsky. 2006. Ma-chine learning of temporal relations. In

Proceedingsof the 21st International Conference on Compu-tational Linguistics and the 44th Annual Meetingof the Association for Computational Linguistics ,ACL-44, pages 753–760, Stroudsburg, PA, USA.Association for Computational Linguistics.Tao Meng, Nanyun Peng, and Kai-Wei Chang. 2019.Target language-aware constrained inference forcross-lingual dependency parsing. In

Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 1117–1128, HongKong, China. Association for Computational Lin-guistics.Yuanliang Meng and Anna Rumshisky. 2018. Context-aware neural model for temporal information extrac-tion. In

Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) .Yuanliang Meng, Anna Rumshisky, and Alexey Ro-manov. 2017. Temporal information extraction forquestion answering using syntactic dependencies inan lstm-based architecture. In

Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 887–896.Qiang Ning, Zhili Feng, and Dan Roth. 2017. A struc-tured learning approach to temporal relation extrac-tion. In

EMNLP , Copenhagen, Denmark.Qiang Ning, Zhili Feng, Hao Wu, and Dan Roth. 2018a.Joint reasoning for temporal and causal relations.In

Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers) , pages 2278–2288. Association forComputational Linguistics.Qiang Ning, Hao Wu, Haoruo Peng, and Dan Roth.2018b. Improving temporal relation extraction with a globally acquired statistical resource. In

Proceed-ings of the 2018 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long Papers) , pages 841–851, New Orleans,Louisiana. Association for Computational Linguis-tics.Qiang Ning, Ben Zhou, Zhili Feng, Haoruo Peng, andDan Roth. 2018c. CogCompTime: A tool for under-standing time in natural language. In

Proceedingsof the 2018 Conference on Empirical Methods inNatural Language Processing: System Demonstra-tions , pages 72–77, Brussels, Belgium. Associationfor Computational Linguistics.Tim O’Gorman, Kristin Wright-Bettner, and MarthaPalmer. 2016. Richer event description: Integratingevent coreference with temporal, causal and bridg-ing annotation. In

Proceedings of 2nd Workshop onComputing News Storylines , pages 47–56. Associa-tion for Computational Linguistics.James Pustejovsky, Patrick Hanks, Roser Sauri, An-drew See, Robert Gaizauskas, Andrea Setzer,Dragomir Radev, Beth Sundheim, David Day, andLisa Ferro. 2003. The timebank corpus. In

Corpuslinguistics , pages 647–656.Alexander M. Rush and Michael Collins. 2011. Ex-act decoding of syntactic translation models throughLagrangian relaxation. In

Proceedings of the 49thAnnual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 72–82, Portland, Oregon, USA. Associationfor Computational Linguistics.Alexander M Rush and MJ Collins. 2012. A tutorialon dual decomposition and lagrangian relaxation forinference in natural language processing.

Journal ofArtiﬁcial Intelligence Research , page 305–362.Guergana K Savova, James J Masanz, Philip V Ogren,Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute. 2010. Mayo clin-ical text analysis and knowledge extraction system(ctakes): architecture, component evaluation and ap-plications.

Journal of the American Medical Infor-matics Association , 17(5):507–513.Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. 2013.Evaluating temporal relations in clinical text: 2012i2b2 challenge.Buzhou Tang, Yonghui Wu, Min Jiang, Yukun Chen,Joshua C Denny, and Hua Xu. 2013. A hybrid sys-tem for temporal information extraction from clini-cal text.

Journal of the American Medical Informat-ics Association , 20(5):828–835.Julien Tourille, Olivier Ferret, Aurelie Neveol, andXavier Tannier. 2017. Neural architecture for tem-poral relation extraction: a bi-lstm approach for de-tecting narrative containers. In

Proceedings of the5th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) , vol-ume 2, pages 224–230.Marc Verhagen, Robert Gaizauskas, Frank Schilder,Mark Hepple, Graham Katz, and James Pustejovsky.2007. Semeval-2007 task 15: Tempeval temporalrelation identiﬁcation. In

Proceedings of the 4th In-ternational Workshop on Semantic Evaluations , Se-mEval ’07, pages 75–80, Stroudsburg, PA, USA. As-sociation for Computational Linguistics.Marc Verhagen and James Pustejovsky. 2008. Tempo-ral processing with the tarsqi toolkit. In , COLING ’08, pages189–192, Stroudsburg, PA, USA. Association forComputational Linguistics.Yan Xu, Yining Wang, Tianren Liu, Junichi Tsujii, andEric I-Chao Chang. 2013. An end-to-end system toidentify temporal relation in discharge summaries:2012 i2b2 challenge.

Journal of the American Med-ical Informatics Association , 20(5):849–858.Xu Yan, Wang Yining, Liu Tianren, Tsujii Junichi, andChang EI. 2013. An end-to-end system to iden-tify temporal relation in discharge summaries: 2012i2b2 challenge.Katsumasa Yoshikawa, Sebastian Riedel, MasayukiAsahara, and Yuji Matsumoto. 2009. Jointly identi-fying temporal relations with markov logic. In

Pro-ceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International JointConference on Natural Language Processing of theAFNLP: Volume 1-Volume 1 , pages 405–413. Asso-ciation for Computational Linguistics.Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2017. Men also likeshopping: Reducing gender bias ampliﬁcation usingcorpus-level constraints. In

Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing , pages 2979–2989, Copenhagen,Denmark. Association for Computational Linguis-tics.

AppendixA Hyper-parametersB Data SummaryC I2 B EMPORAL

Results

We show the breakdown performance and con-tributions of individual constraints for I2 B EMPORAL in Table 8 and Table 9 respectively.

TimeBank-Dense I2 B EMPORAL c E - 1.0 w E pos - 5.0lr - e − α θ γ Table 6: Hyper-parameters chosen using developmentdata. For TimeBank-Dense, end-to-end baseline modelis provided by the Han et al. (2019b), so we do not trainit from scratch.

TimeBank-Dense I2 B EMPORAL

Train 22 190Dev 5 -Test 9 120

Train 4032 11253Dev 629 -Test 1427 8794

Table 7: Data overview. Note that we exclude eventpairs whose sentence distance longer than 3 in I2 B EMPORAL , and there are 6 times more missing rela-tions than the gold annotated ones in, which explainswhy number of pairs per documents are smaller inI2 B EMPORAL than in TimeBank-Dense.

D Reproducibility List • Data and code used for TimeBank-Dense canbe found in project code base. However, dueto user conﬁdentiality agreement, we are notable to provide data and and data analysis codefor I2 B EMPORAL . Modeling code will beadded to the project code base upon obtainingpermission from the data owner.• We use BERT-base-uncased and Roberta-large models implemented in Huggingfacetransformers. Additional parameters (such asLSTM and MLP) are negligible compared tothose used in the pre-trained LMs;• ILP is solved by an off-the-shelf solver pro-vided by Gurobi optimizer;• Range of grid-search. c E : (1.0, 2.0); w E pos :(1.0, 2.0, 5.0, 10.0); lr: ( e − , e − , e − ), α :(1.0, 2.0, 5.0, 10.0); θ : (0.2, 0.3, 0.5); γ : (0.7,0.8, 0.9). nd-to-end Baseline End-to-end InferenceP R F P R F B A O TempEval

Table 8: Model performance breakdown for I2 B EMPORAL . BEFORE ( B ), AFTER ( A ), OVERLAP ( O ).Constraint Triplets Prob. Gap F occur. , problem , OVERLAP -0.17 +0.19% occur. , treatment , OVERLAP -0.16 +0.24% treatment , occur. , OVERLAP -0.11 +0.22% treatment , problem , OVERLAP -0.14 +0.12%Combined F1 Improvement 0.77%

Table 9: I2 B EMPORAL ablation study: gap shrink-age of predicted probability and F1