[PDF] Analogous Process Structure Induction for Sub-event Sequence Prediction

Abstract

Computational and cognitive studies of event understanding suggest that identifying, comprehending, and predicting events depend on having structured representations of a sequence of events and on conceptualizing (abstracting) its components into (soft) event categories. Thus, knowledge about a known process such as "buying a car" can be used in the context of a new but analogous process such as "buying a house". Nevertheless, most event understanding work in NLP is still at the ground level and does not consider abstraction. In this paper, we propose an Analogous Process Structure Induction APSI framework, which leverages analogies among processes and conceptualization of sub-event instances to predict the whole sub-event sequence of previously unseen open-domain processes. As our experiments and analysis indicate, APSI supports the generation of meaningful sub-event sequences for unseen processes and can help predict missing events.

Full PDF

AAnalogous Process Structure Induction for Sub-event Sequence Prediction

Hongming Zhang ∗ , Muhao Chen , Haoyu Wang , Yangqiu Song , & Dan Roth Department of Computer Science and Engineering, HKUST Department of Computer and Information Science, UPenn {hzhangal, yqsong}@cse.ust.hk{muhao, why16gzl, danroth}@seas.upenn.edu

Abstract

Computational and cognitive studies of eventunderstanding suggest that identifying, com-prehending, and predicting events depend onhaving structured representations of a se-quence of events and on conceptualizing (ab-stracting) its components into (soft) event cat-egories. Thus, knowledge about a known pro-cess such as “ buying a car ” can be used inthe context of a new but analogous processsuch as “ buying a house ”. Nevertheless, mostevent understanding work in NLP is still at theground level and does not consider abstrac-tion. In this paper, we propose an AnalogousProcess Structure Induction (

APSI ) frame-work, which leverages analogies among pro-cesses and conceptualization of sub-event in-stances to predict the whole sub-event se-quence of previously unseen open-domainprocesses. As our experiments and analy-sis indicate,

APSI supports the generationof meaningful sub-event sequences for unseenprocesses and can help predict missing events. Understanding events has long been a challeng-ing task in NLP, to which many efforts have beendevoted by the community. However, most exist-ing works are focusing on procedural (or horizon-tal ) event prediction tasks. Examples include pre-dicting the next event given an observed event se-quence (Radinsky et al., 2012) and identifying theeffect of a biological process (i.e., a sequence ofevents) on involved entities (Berant et al., 2014).These tasks mostly focus on predicting relatedevents in a procedure based on their statistical cor-relations in previously observed text. As a re-sult, understanding the meaning of an event might ∗ This work was done when the ﬁrst author was visitingthe University of Pennsylvania. Code is available at: http://cogcomp.org/page/publication_view/910 . Figure 1: An illustration of leveraging known processesto predict the sub-event sequence of a new process. not be crucial for these horizontal tasks. For ex-ample, simply selecting the most frequently co-occurring event can offer acceptable performanceon the event prediction task (Granroth-Wildingand Clark, 2016).Computational and cognitive studies (Schankand Abelson, 1977; Zacks and Tversky, 2001)suggest that inducing and utilizing the hierarchicalstructure of events is a crucial component of howhumans understand new events and can help manyaforementioned horizontal event prediction tasks.Consider the example in Figure 1. Assume thatone has never bought a house, but is familiar withhow to “buy a car” and “rent a house”; referringto analogous steps in these two relevant processeswould still provide guidance for the target processof “buy a house”. Motivated by this hypothesis,our work proposes to directly evaluate a model’sevent understanding ability. We deﬁne this as the The original paper refers to the knowledge about pro-cesses and their sub-events as event schemata. a r X i v : . [ c s . A I] O c t igure 2: Demonstration of the proposed APSI framework. Given a target process P , we ﬁrst decompose itssemantics into two dimensions (i.e., predicate and argument) by grouping processes that share a predicate or anargument. For each such group of processes, we then leverage the observed process graphs G to generate anabstract and probabilistic representation for their sub-event sequences. In the last step, we merge them with aninstantiation module to produce the sub-event sequence of P . ability to identify vertical relations, that is, to pre-dict the sub-event sequence of a new process .We require models to generate the sub-event se-quence for a previously unobserved process givenobserved processes along with their sub-event se-quences, which we refer to as “the observed pro-cess graphs" in the rest of this paper. This task ismore challenging than “conventional" event pre-dictions tasks, since it requires the generation of asub-event sequence given a new, previously unob-served, process deﬁnition.To address this problem, we propose an Anal-ogous Process Structure Induction (APSI) frame-work. Given a new process deﬁnition (e.g., ‘buya house’), we ﬁrst decompose it into two dimen-sions: predicate and argument. For each of these,we collect a group of processes that share thesame predicate (i.e., ‘buy- ARG ’) or same argu-ment (i.e., ‘

PRE -house’), and then induce an ab-stract and probabilistic sub-event representationfor each group. Our underlying assumption is thatprocesses that share the same predicate or argu-ment could be analogous to each other, and thuscould share similar sub-event structures. Finally,we merge these two abstract representations, us-ing an instantiation module, to predict the sub-event structure of the target process. By doing so,we only need a small number of analogous pro- A process is a more coarse-grained event by itself. Weuse this term to distinguish it from sub-events.

Figure 3: Examples of Sub-Event Representations. cesses (as we show, 20, on average) to generateunseen sub-events for the target process. Intrin-sic and extrinsic evaluations show that APSI out-performs all baseline methods and can generatemeaningful sub-event sequences for unseen pro-cesses, which are proven to be helpful for predict-ing missing events.The rest of the paper is organized as follows.Section 2 introduces the Analogous Process struc-ture induction (APSI) framework. Section 3describes our intrinsic and extrinsic evaluation,demonstrating the effectiveness of APSI and thequality of the induced process knowledge. We dis-cuss related works in Section 4 and conclude thispaper with Section 5.

Figure 2 illustrates the details of the proposedAPSI framework. Given an unseen process P , atarget sub-event sequence length k , and a set ofbserved process graphs G , the task is to predict a k -step sub-event sequence [ E (cid:48) , E (cid:48) , ..., E (cid:48) k ] for P .Each process graph G ∈ G in the input contains aprocess deﬁnition P G and an n -step temporally or-dered sub-event sequence [ E G , E G , ..., E Gn ]. Weassume that each process P is described as a com-bination of a predicate and an argument (e.g.,‘buy+house’) and each sub-event E ∈ E isgiven as verb-centric dependency graph as usedin (Zhang et al., 2020b) (see examples in Figure 3).In APSI, we decompose the target process intotwo dimensions (i.e., predicate and argument). Foreach target process, we collect a group of observedprocess graphs that share either the predicate orthe argument with the target process; we assumethat processes in these groups have sufﬁcient in-formation for predicting the structure of the targetprocess. We then leverage an event conceptualiza-tion module to induce an abstract representationof each process group. Finally, we merge the twoabstract, probabilistic representations and instanti-ate it to generate a ground sub-event sequence asthe ﬁnal prediction. Detailed descriptions of APSIcomponents are introduced as follows. Each process deﬁnition P is given as a predicateand its argument, which we term below the two“dimensions” of the process deﬁnition. We thencollect all process graphs in G that have the samepredicate as P into G p and those that have the sameargument into G a . We assume that these two setsprovide the information needed to generate an ab-stract process representation that would guide theinstantiation of the event steps for P . The goal of the semantic abstraction step is to ac-quire abstract representations S p and S a for G p and G a respectively, to help transfer the knowl-edge from the grounded observed processes to thetarget new process. To do so, we ﬁrst need toconceptualize observed sub-events in G p and G a (e.g., “eat an apple”) to a more abstract level (e.g.,“eat fruit”). Clearly, each event could be concep-tualized to multiple abstract events. For exam-ple, “eat an apple” can be conceptualized to “eatfruit” but also to “eat food”, and the challenge isto determine the appropriate level of abstraction.On one hand, the conceptualized event cannot betoo general, as we do not want to lose touch withthe original event, and, on the other hand, if it is too speciﬁc, we will not aggregate enough in-stances of sub-events into it, thus we will havedifﬁculties transferring knowledge to the new un-seen process. To automatically achieve the bal-ance between these conﬂicting requirements andselect the best abstract event for each observedsub-event, we model it as a weighted mutually ex-clusive set cover problem (Lu and Lu, 2014) andpropose an efﬁcient algorithm, described below, tosolve it. We then merge the repeated conceptual-ized events and determine their relative positions. For each event E , we ﬁrst identify all potentialevents that it can be conceptualized to. If two sub-events E and E can be conceptualized to thesame event C , we place E and E into the set E C . To qualitatively guide the abstraction processwe introduce below a notion of semantic loss thatwe incur as we move up to more abstract represen-tations. To measure the semantic loss during theconceptualization, we assign weight to each set: W ( E C ) = 1 (cid:80) E ∈E C F ( E, C ) , (1)where F ( E, C ) is a scoring function, deﬁned be-low in Eq. 2, that captures the amount of “seman-tic details" preserved due to abstracting from E to C . With this deﬁnition, the event conceptual-ization problem can be formalized as ﬁnding ex-clusive sets (such as C ) that cover all observedevents with minimum total weight. In the rest ofthis section, we ﬁrst introduce how to collect po-tential conceptualized events for each E , how wedeﬁne F , and how we solve this discrete optimiza-tion problem. Identifying Potential Conceptualizations

As-sume that sub-event E contains m words w E , w E , ..., w Em , each corresponds to a nodein Figure 3; for each of these, we can re-trieve a list of hypernym paths from Word-Net (Miller, 1998). For example, given the word“house”, WordNet returns two hypernym paths :(1) “house” → “building” → “structure” → ...; (2)“house” → “ﬁrm” → “business” → .... As a result,we can ﬁnd (cid:81) w ∈ E L ( w ) potential conceptualizedevents for E , where L ( w ) is the number of w ’s hy-pernyms. We denote the potential conceptualizedevent set for E as C E and the overall set as C . No sub-event can appear in two selected sets. We omit the synset number for clear representation. lgorithm 1

Event Conceptualization

INPUT:

Set of events E . Each E ∈ E is associatedwith a set of potential conceptualization events C E .The overall conceptualized event set C .

1: Initialize event partition set P := ∅ .2: while E (cid:54) = ∅ do for Each E ∈ E do for Each C ∈ C E do E C := ∅ .6: Compute F ( E, C ) using Eq. (2).7: end for end for for Each C ∈ C do for Each E ∈ E do if C ∈ C E then E C := E C ∪ { E } .13: end if end for

15: Compute W ( E C ) using Eq. (1).16: end for

17: Select E C min with the minimum W score.18: E := E \ E C min P := P ∪ {E C min } .20: end while OUTPUT:

Partition of n event subsets P = {E , E , ..., E n } , where each subset E i correspondsto a unique conceptualized event C i . Conceptualization Scoring

As mentioned above,for each pair of a sub-event E and its potentialconceptualization C , we propose a scoring func-tion F ( E, C ) to measure how much “semantic in-formation" is preserved after the conceptualiza-tion. Motivated by Budanitsky and Hirst (2006)and based on the assumption that the more abstractthe conceptualized event is, the more semantic de-tails are lost, we deﬁne F ( E, C ) to be: F ( E, C ) = m (cid:89) i =1 w D ( w Ei ,w Ci ) , (2)where D ( w Ei , w Ci ) is the depth from w Ei to w Ci onthe taxonomy path, and w is a hyper-parameter measuring how much “semantics" is preserved fol-lowing each step of the conceptualization. Conceptualization Assignment

Now we are ableto model the procedure of ﬁnding proper con-ceptualized events as a weighted mutually ex-clusive set cover problem. Note that this is anNP-complete problem and requires a prohibitivecomputational cost to obtain the optimum solu-tion (Karp, 1972). To obtain an efﬁcient solutionthat is empirically sufﬁcient for assigning con-ceptualized events with reasonable amount of in- In practice, we use two separate hyper-parameters w v and w n for verbs and nouns, respectively. stances, we develop a greedy procedure as de-scribed in Algorithm 1. For each retrieved processgraph set G p or G a , we collect all its sub-eventsas E and use it as the input for the conceptualiza-tion algorithm. In each iteration, we ﬁrst computethe conceptualization score F for all the ( E , C )pairs and then compute the weight score for allconceptualization sets E C . After selecting the setwith minimum weight, E C min , we remove all theevents covered by it from E and repeat the processuntil no event is left. After the conceptualization,we merge sub-events that are conceptualized to thesame event and represent them with the resultingconceptualized event C , whose weight is deﬁnedto be W ( C ) = W ( E C ) . Compared with the naivealgorithm, which ﬁrst expands all possible subsets(i.e., it includes all subsets of E C for all C ) andthen leverages the sort and ﬁlter technique to se-lect the ﬁnal subsets, we reduce the time complex-ity from O ( |C| · |E| ) to O ( n · |C| · |E| ) , where n is the number of conceptualized events and is typ-ically much smaller than |E| . After conceptualizing and merging all sub-events,we need to determine their loosely temporal or-der (e.g., whether they typically appear at the be-ginning or the end of these sub-event sequences).Let the set of selected conceptualized events be C ∗ .For each C ∈ C ∗ , we deﬁne its order score T ( C ) ,indicating how likely C is to appear ﬁrst, as: T ( C ) = (cid:88) C (cid:48) ∈ C ∗ θ ( (cid:88) E C ∈E C (cid:88) E C (cid:48) ∈E C (cid:48) t ( E C , E C (cid:48) ) − t ( E C (cid:48) , E C )) , (3) where θ is the unit step function and t ( E C , E C (cid:48) ) represents how many times E C appears before E C (cid:48) in an observed process graph. In the last step, we leverage the two abstract rep-resentations we got for the predicate and argumentof the target process deﬁnition to predict its ﬁnalsub-events. To do so, we propose the followinginstantiation procedure. We are given the abstractrepresentations S p and S a , for the predicate andargument, respectively. Each is a set of concep-tualized events associated with weights and orderscores. For each conceptualized event C p ∈ S p ,using each event C a ∈ S a , we can generate a newinstantiated event ˆ C p . For example, if C p is “cutfruit” and C a is ‘buy an apple’, then our modelould create the new event “cut an apple”. Specif-ically, for each w ∈ C p , if we can ﬁnd a word ˆ w such that ˆ w is a hyponym of w , we will replace w with ˆ w and repeat this process until no hyponymcan be detected in C p . We denote the generatedevent by ˆ C p . To account for the semantic lossduring the instantiation procedure, we deﬁne theweight and order score of ˆ C p as follows: ˆ W ( ˆ C p ) = W ( C p ) · F ( ˆ C p , C p ) · (cid:80) C (cid:48) a ∈ S a W ( C (cid:48) a ) W ( C a ) (4) ˆ T ( ˆ E p ) = T ( C p ) · F ( ˆ C p , C p ) · (cid:80) C (cid:48) a ∈ S a W ( C (cid:48) a ) W ( C a ) , (5) Similarly, we apply the same procedure to C a with C p , and denote the resulted event ˆ C a . We thenrepeatedly merge instantiated events by summingup their weights and averaging their order scores.In the end, we select top k sub-events based on theweights and sort them based on the order score asthe sub-event sequence prediction. In this section, we conduct intrinsic and extrinsicevaluations to show that APSI can generate mean-ingful sub-event sequences for unseen processes,which can help predict the missing events.

We collect process graphs from the WikiHowwebsite (Koupaee and Wang, 2018). In Wiki-How, each process is associated with a sequenceof temporally ordered human-created steps. Foreach step, as shown in Figure 3, we use the toolreleased by ASER (Zhang et al., 2020b) to extractevents and construct the process graphs. We selectall processes, where each step has one and onlyone event, and randomly split them into the trainand test data. As a result, we got 13,501 train-ing process graphs and 1,316 test process graphs ,whose average sub-event sequence length is 3.56. We compare with the following baseline methods:

Sequence to sequence (Seq2seq):

One intuitivesolution to the sub-event sequence prediction taskwould be modeling it as a sequence to sequenceproblem, where the process is treated as the inputand the sub-event sequence the output. Here we We do not need a development set because the proposedsolution APSI is not a learning-based method. adopt the standard GRU-based encoder-decoderframework (Sutskever et al., 2014) as the baseframework and change the generation unit fromwords to events. For each process or sub-event,we leverage pre-trained word embeddings (i.e.,GloVe-6b-300d (Pennington et al., 2014)) or lan-guage models (i.e., RoBERTa-base (Liu et al.,2019)) as the representation, which are denoted asSeq2seq (GloVe) and Seq2seq (RoBERTa).

Top One Similar Process:

Another baseline is the“top one similar process”. For each new process,we can always ﬁnd the most similar observed pro-cess. Then we can use the sub-event sequence ofthe observed process as the prediction. We employdifferent methods (i.e., token-level Jaccard coefﬁ-cient or cosine similarity of GloVe/RoBERTa pro-cess representations) to measure the process simi-larity. We denote them as Top one similar process(Jaccard), (GloVe), and (RoBERTa), respectively.For each process, we also present a randomlygenerated sequence and a human-generated se-quence as the lower-bound and upper-bound forsub-event sequence prediction models. We ﬁrst present the intrinsic evaluation to showthe quality of the predicted sub-event sequences ofunseen processes. For each test process, we pro-vide the process name and the sub-event sequencelength to evaluated systems and ask them to gen-erate a ﬁxed-length sub-event sequence. Motivated by the ROUGE score (Lin, 2004), wepropose an event-based ROUGE (E-ROUGE) toevaluate the quality of the predicted sub-event se-quence. Speciﬁcally, similar to ROUGE, whichevaluates the generation quality based on N-gramtoken occurrence, we evaluate how much percent-age of the sub-event and time-ordered sub-eventpairs in the induced sequence is covered by thehuman-provided references. We denote the eval-uation over single event and event pairs as E-ROUGE1 and E-ROUGE2, respectively. We alsoprovide two covering standards to better under-stand the prediction quality: (1) “String Match”:all words in the predicted event/pairs must be thesame as the referent event/pairs; (2) “HypernymAllowed”: the predicted and referent event must The human-generated sequence is randomly selectedfrom the WikiHow and excluded during the evaluation. We select the majority length of all references.odel String Match Hypernym AllowedE-ROUGE1 E-ROUGE2 E-ROUGE1 E-ROUGE2Random 2.9165 0.4664 23.5873 8.1089Seq2seq (GloVe) 5.0323 1.4965 27.8710 13.0946Seq2seq (RoBERTa) 4.5455 0.4831 28.0032 12.8502Top one similar process (Jaccard) 8.8589 5.1000 28.6548 14.6231Top one similar process (GloVe) 9.8797 5.1452 29.4203 13.6001Top one similar process (RoBERTa) 9.2599 4.7390 30.6599 15.8417Analogous Process Structure Induction (APSI)

Human 29.0189 15.2542 50.4647 29.4423(a) Basic Setting (for each sub-event, we only predict and evaluate the verb)Model String Match Hypernym AllowedE-ROUGE1 E-ROUGE2 E-ROUGE1 E-ROUGE2Random 0.0000 0.0000 0.5104 0.0903Seq2seq (GloVe) 0.1935 0.0534 0.9677 0.1069Seq2seq (RoBERTa) 0.4870 0.0000 1.7857 0.2899Top one similar process (Jaccard) 0.6562 0.2257 2.4797 0.5867Top one similar process (GloVe) 0.8750 0.2106 2.8801 0.7372Top one similar process (RoBERTa) 0.9479 0.3009 3.2811 0.9929Analogous Process Structure Induction (APSI)

Human 11.6351 5.5905 18.0034 8.2695(b) Advanced Setting (for each sub-event, we predict and evaluate all words)

Table 1: Intrinsic evaluation results of the induced process structures. On average, we have 1.7 human-generatedsub-event sequences as the references for each test process. Best performing models are marked with the bold font. have the same dependency structure, and for thewords on the same graph position, they should bethe hypernym of or same as each other. For ex-ample, if the referent event is “eat apple” and thepredicted event is “eat fruit”, we still count it asa match. The “String Match” setting is stricter,but the “Hypernym Allowed” setting also has itsunique value to help better understand if our sys-tem is predicting relevant sub-events.

In terms of training, we set both w v and w n tobe 0.5 for our model. For the seq2seq baselines,we set the learning rate to be 0.001 and train themodels until they converge on the training data.All other hyper-parameters following the originalpaper. In terms of the evaluation, we also pro-vide two settings. (1) Basic: we follow previousworks (Glavas et al., 2014) to predict and evaluateevents based on verbs; (2) Advanced: we predictand evaluate events based on all words. We show the results in Table 1. In general, there isstill a notable gap between current models’ per-formance and human performance, but the pro- posed APSI framework can indeed generate suf-ﬁciently relevant sub-events. For example, if weonly consider the verb. Even in the string matchsetting, 14.8% of the predicted event and 6.6% ofthe ordered event pairs are covered by the refer-ences, which is much better than the random guessand nearly half of the performance of human be-ings. If hypernym is allowed, 36% and 19% of thepredicted event and event pairs are covered. Be-sides that, if we take all words in the event intoconsideration, the task becomes more challeng-ing. Speciﬁcally, even human can only achieve11.63 E-ROUGE1 and 5.59 E-ROUGE2, whichsuggests that low scores achieved by current mod-els are probably due to the limitation of the currentdataset (e.g., on average, we only have 1.7 refer-ences for each test process). If more referencesare provided, the performance of all models willalso increase. In the rest of the intrinsic evaluation,we present more detailed analysis based on the ad-vanced setting (string match) and a case study tohelp better understand the performance of APSI.

One key step in our framework is how to leveragethe two abstract representations to predict the ﬁ- igure 4: Hyper-parameter inﬂuence on the quality of APSI generated sub-event sequences. For both w v and w n , 0indicates no conceptualization and the larger the value, the deeper the conceptualization is. Best performing rangesare marked with red boxes, which indicate that the suitable conceptualization level is the key to APSI’s success. Model E-ROUGE1 E-ROUGE2Simple Merge 2.5884 0.4062Normalized 2.2238 0.3611APSI (Instantiation)

Table 2: Performance of different merging methods. nal sub-event sequence. In APSI, we propose aninstantiation module, which jointly leverages thetwo representations to generate detailed events. Toshow its effect, we compare it with two other op-tions: (1) Simple Merge: Merge two represen-tation and select the top k sub-events based onthe weight; (2) Normalized: First normalize theweight of all sub-events based on each representa-tion and then select the top k sub-events.From the result in Table 2, we can see that dueto the imbalanced distribution of the two represen-tations, simply choosing the most weighted sub-events is problematic. On average, for each predi-cate, we can collect 18.04 processes, while we canonly collect 1.92 processes for each argument. Asa result, the sub-events in the predicate representa-tion typically have a larger weight. Thus if we sim-ply merge them, most of the predicted sub-eventswill come from the predicate representation. Ide-ally, the “normalized” method can eliminate theinﬂuence of such imbalance, but it also ampli-ﬁes the noise and achieves worse empirical per-formance. Differently, the proposed instantiationmodule uses events in one representation as thereference to help instantiate the events in the otherone. As a result, we jointly use these two repre-sentations to generate a group of detailed events, Figure 5: Case Study. We mark the covered and notcovered predictions with green and red colors. and then we can select the top k generated newevents. By doing so, we do not only go detailedfrom the abstract representation but also avoid theimbalanced distribution issue. In APSI, we use two hyper-parameters w v and w n to control the conceptualization and instantiationdepth we want over verbs and nouns respectively.0 means no conceptualization and the larger valueindicates more conceptualization we encourage.We show the performance of APSI with differenthyper-parameter combinations in Figure 4, fromwhich we can see that a suitable level of concep-tualization is the key to the success of APSI. Ifno conceptualization is allowed, all the predictedevents are restricted to the observed sub-event,thus we cannot predict “search house” after see-ing “search car” and some events about the house.On the other hand, if we do not restrict the depth ofconceptualization, all the sub-events will be con-ceptualized to be too general. As a result, evenwith the instantiation module, we could not pre-dict the detailed sub-event as we want. igure 6: Demonstration of the event masked LM.Pre-trained language models are trained to predict themasked event given other events as the context. Figure 5 shows an example that we use to analyzethe current limitations of APSI. We can see thatAPSI can successfully predict events like “iden-tify symptoms”, but fails to predict event “identifycauses”. Instead, it predicts “take supplements”.This is because APSI learns to predict such se-quence from other processes like “treat diarrhea”or other diseases in the observed process graphs.Treating those diseases typically does not involveidentifying the cause, which is not the case fortreating pain. And, treating diseases often involvestaking medicines, which can be conceptualized to“take supplement”. As no events about pain helpsinstantiate “supplement", APSI just predicts it.

As discussed by (Rumelhart, 1975), the knowl-edge about process and sub-events can help un-derstand event sequences. Thus, in this section,we investigate whether the induced process knowl-edge can help predict the missing events. Givena sub-event sequence, for each event in the se-quence, we can use the rest of the sequence as thecontext and ask models to select the correct eventagainst one negative event example. To make thetask challenging, instead of random sampling, wefollow Zellers et al. (2019) to select similar butwrong negative candidates based on their repre-sentation (i.e., BERT (Devlin et al., 2019)) sim-ilarity. We use the same training and test as theintrinsic experiment and as a result, we got 13,501training sequences and 7,148 test questions.The baseline method we are comparing with isthe event-based masked language model , whose On our dataset, the RoBERTa based event LM modeloutperforms existing LSTM-based event prediction models. Model Accuracy ∆ RoBERTa-based Event LM 73.59% -+ Seq2seq (GloVe) 73.06% -0.53%+ Seq2seq (RoBERTa) 72.33% -1.26%+ Top1 similar (Jaccard) 72.76% -0.83%+ Top1 similar (GloVe) 74.14% 0.55%+ Top1 similar (RoBERTa) 74.16% 0.57%+ APSI 74.78% † ‡ Table 3: Results on the event prediction task. † and ‡ indicate the statistical signiﬁcance over the baselinewith p-value smaller than 0.01 and 0.001 respectively. demonstration is shown in ﬁgure 6. We use pre-trained RoBERTa-base (Liu et al., 2019) to ini-tialize the tokenizer and transformer layer and allsequences of training processes as the trainingdata. To show the value of understanding the rela-tionship between process and their sub-event se-quence, for each sub-event sequence in the testdata, we ﬁrst leverage the process name and dif-ferent structure prediction methods to predict sub-event sequences and use them as additional con-text to help the event masked LM to predict themissing event. To show the effect upper bound ofadding process knowledge, we also tried addingthe process structure provided by human beings asthe context , which is denoted as ‘+Human’. Allmodels are evaluated based on accuracy.From the results in Table 3, we can make thefollowing observations. First, adding high-qualityprocess knowledge (i.e., APSI and Human) cansigniﬁcantly help the baseline model, which indi-cates that adding knowledge about the process canhelp better understand the event sequence. Sec-ond, the effect of process knowledge is positivelycorrelated with their quality as shown in Table 1.Adding a low-quality process structure may hurtthe performance of the baseline model due to theintroduction of the extra noise. Third, the currentway of using process knowledge is still very sim-ple and there is room for better usage of the pro-cess knowledge, as the research focus of this paperis predicting process structure rather than applyingit, we leave that for the future work. We randomly select another sub-event sequence that de-scribes the same process from WikiHow, which could be dif-ferent from the currently tested sequence. As a result, addingsuch sequence cannot help predict all missing events.

Related Works

Throughout history, considering the importanceof events in understanding human language (e.g.,commonsense knowledge (Zhang et al., 2020a)),many efforts have been devoted to deﬁne, repre-sent, and understand events. For example, Verb-Net (Schuler, 2005) created a verb lexicon to rep-resent the semantic relations among verbs. Af-ter that, FrameNet (Baker et al., 1998) proposedto represent the event semantics with schemas,which has one predicate and several arguments.Apart from the structure of events, understand-ing events by predicting relations among themalso becomes a popular research topic (e.g., Time-Bank (Pustejovsky et al., 2003) for temporal rela-tions and Event2Mind (Rashkin et al., 2018) forcausal relations). Different from these horizon-tal relations between events, in this paper, wepropose to understand event vertically by treatingeach event as a process and trying to understandwhat is happening (i.e., sub-event) inside the targetevent. Such knowledge is also referred to as eventschemata (Zacks and Tversky, 2001) and showncrucial for how humans understand events (Abbottet al., 1985). One line of related works in the NLPcommunity is extracting super-sub event relationsfrom textual corpus (Hovy et al., 2013; Glavaset al., 2014). The difference between this work andthem is that we are trying to understand events bydirectly generating the sub-event sequences ratherthan extracting such information from text. An-other line of related works is the narrative schemaprediction (Chambers and Jurafsky, 2008), whichalso holds the assumption that event schemata canhelp understand events. But their research focus isusing the overall process implicitly to help predictfuture events while this work tries to understandevents by knowing the relation between processesand their sub-event sequences explicitly.

In this paper, we try to understand events verti-cally by viewing them as processes and predict-ing their sub-event sequences. Our APSI frame-work is motivated by the notion of analogous pro-cesses, and attempts to transfer knowledge from(a very small number of) familiar processes to anew one. The intrinsic evaluation demonstratesthe effectiveness of APSI and the quality of thepredicted sub-event sequences. Moreover, the ex-trinsic evaluation shows that, even with a naive ap- plication method, the process knowledge can helpbetter predict missing events.

Acknowledgements

This research is supported by the Ofﬁce of theDirector of National Intelligence (ODNI), In-telligence Advanced Research Projects Activ-ity (IARPA), via IARPA Contract No. 2019-19051600006 under the BETTER Program, andby contract FA8750-19-2-1004 with the USDefense Advanced Research Projects Agency(DARPA). The views expressed are those of theauthors and do not reﬂect the ofﬁcial policy orposition of the Department of Defense or theU.S. Government. This paper is also partiallysupported by Early Career Scheme (ECS, No.26206717), General Research Fund (GRF, No.16211520), and Research Impact Fund (RIF, No.R6020-19) from the Research Grants Council(RGC) of Hong Kong.

References

Valerie Abbott, John B Black, and Edward E Smith.1985. The representation of scripts in memory.

Journal of memory and language , pages 179–199.Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet Project. In

Proceed-ings of COLING-ACL 1998 , pages 86–90.Jonathan Berant, Vivek Srikumar, Pei-Chun Chen,Abby Vander Linden, Brittany Harding, BradHuang, Peter Clark, and Christopher D. Manning.2014. Modeling biological processes for readingcomprehension. In

Proceedings of EMNLP 2014 .Alexander Budanitsky and Graeme Hirst. 2006. Eval-uating wordnet-based measures of lexical semanticrelatedness.

Comput. Linguistics , 32(1):13–47.Nathanael Chambers and Daniel Jurafsky. 2008. Unsu-pervised learning of narrative event chains. In

Pro-ceedings of ACL 2008 , pages 789–797.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of NAACL-HLT 2019 ,pages 4171–4186.Goran Glavas, Jan Snajder, Marie-Francine Moens, andParisa Kordjamshidi. 2014. Hieve: A corpus for ex-tracting event hierarchies from news stories. In

Pro-ceedings of LREC 2014 , pages 3678–3683.Mark Granroth-Wilding and Stephen Clark. 2016.What happens next? event prediction using a com-positional neural network model. In

Proceedings ofAAAI 2016 , pages 2727–2733.duard H. Hovy, Teruko Mitamura, Felisa Verdejo, JunAraki, and Andrew Philpot. 2013. Events are notsimple: Identity, non-identity, and quasi-identity. In

Proceedings of EVENTS@NAACL-HLT 2013 , pages21–28.Richard M. Karp. 1972. Reducibility among combi-natorial problems. In

Proceedings of a symposiumon the Complexity of Computer Computations 1972 ,pages 85–103.Mahnaz Koupaee and William Yang Wang. 2018. Wik-ihow: A large scale text summarization dataset.

CoRR , abs/1810.09305.Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In

Proceedings ofText Summarization Branches Out 2004 , pages 74–81.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretrainingapproach.

CoRR , abs/1907.11692.Songjian Lu and Xinghua Lu. 2014. An exact algo-rithm for the weighed mutually exclusive maximumset cover problem.

CoRR , abs/1401.6385.George A Miller. 1998.

WordNet: An electronic lexicaldatabase . MIT press.Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectorsfor word representation. In

Proceedings of EMNLP2014 , pages 1532–1543.James Pustejovsky, Patrick Hanks, Roser Sauri, An-drew See, Robert Gaizauskas, Andrea Setzer,Dragomir Radev, Beth Sundheim, David Day, LisaFerro, et al. 2003. The timebank corpus. In

Corpuslinguistics , page 40.Kira Radinsky, Sagie Davidovich, and ShaulMarkovitch. 2012. Learning causality for newsevents prediction. In

Proceedings of the 21st WorldWide Web Conference 2012, WWW 2012, Lyon,France, April 16-20, 2012 , pages 909–918.Hannah Rashkin, Maarten Sap, Emily Allaway,Noah A. Smith, and Yejin Choi. 2018. Event2mind:Commonsense inference on events, intents, and re-actions. In

Proceedings of ACL 2018 , pages 463–473.DE Rumelhart. 1975. Notes on a schema for storieslanguage, thought, and culture.

Representation andunderstanding , pages 211–236.Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals and understanding: An inquiry into hu-man knowledge structures.Karin Kipper Schuler. 2005. Verbnet: A broad-coverage, comprehensive verb lexicon. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In

Proceedings of NeurIPS 2014 , pages3104–3112.Jeffrey M Zacks and Barbara Tversky. 2001. Eventstructure in perception and conception.

Psychologi-cal bulletin , 127(1):3.Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. Hellaswag: Can amachine really ﬁnish your sentence? In

Proceed-ings of ACL 2019 , pages 4791–4800.Hongming Zhang, Daniel Khashabi, Yangqiu Song,and Dan Roth. 2020a. Transomcs: From linguisticgraphs to commonsense knowledge. In

Proceedingsof IJCAI 2020 , pages 4004–4010.Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song,and Cane Wing-Ki Leung. 2020b. ASER: A large-scale eventuality knowledge graph. In