Learning Semantic Script Knowledge with Event Embeddings
LLearning Semantic Script Knowledge with Event Embeddings
Ashutosh Modi
ASHUTOSH @ COLI . UNI - SB . DE Saarland University, Saarbr¨ucken, Germany
Ivan Titov
TITOV @ UVA . NL University of Amsterdam, Amsterdam, the Netherlands
Abstract
Induction of common sense knowledge aboutprototypical sequences of events has recently re-ceived much attention (e.g., (Chambers & Juraf-sky, 2008; Regneri et al., 2010)). Instead of in-ducing this knowledge in the form of graphs, asin much of the previous work, in our method, dis-tributed representations of event realizations arecomputed based on distributed representations ofpredicates and their arguments, and then theserepresentations are used to predict prototypicalevent orderings. The parameters of the com-positional process for computing the event rep-resentations and the ranking component of themodel are jointly estimated from texts. We showthat this approach results in a substantial boostin ordering performance with respect to previousmethods.
1. Introduction
It is generally believed that natural language understandingsystems would benefit from incorporating common-senseknowledge about prototypical sequences of events and theirparticipants. Early work focused on structured representa-tions of this knowledge (called scripts (Schank & Abelson,1977)) and manual construction of script knowledge bases.However, these approaches do not scale to complex do-mains (Mueller, 1998; Gordon, 2001). More recently, auto-matic induction of script knowledge from text have startedto attract attention: these methods exploit either naturaltexts (Chambers & Jurafsky, 2008; 2009) or crowdsourceddata (Regneri et al., 2010), and, consequently, do not re-quire expensive expert annotation. Given a text corpus,they extract structured representations (i.e. graphs), for ex-ample chains (Chambers & Jurafsky, 2008) or more gen-
Accepted at the workshop track of International Conference onLearning Representations (ICLR), 2014 eral directed acyclic graphs (Regneri et al., 2010). Thesegraphs are scenario-specific, nodes in them correspond toevents (and associated with sets of potential event men-tions) and arcs encode the temporal precedence relation.These graphs can then be used to inform NLP applica-tions (e.g., question answering) by providing informationwhether one event is likely to precede or succeed another.In this work we advocate constructing a statistical modelwhich is capable of “answering” at least some of the ques-tions these graphs can be used to answer, but doing thiswithout explicitly representing the knowledge as a graph.In our method, the distributed representations (i.e. vec-tors of real numbers) of event realizations are computedbased on distributed representations of predicates and theirarguments, and then the event representations are used ina ranker to predict the expected ordering of events. Boththe parameters of the compositional process for computingthe event representation and the ranking component of themodel are estimated from data.In order to get an intuition why the embedding approachmay be attractive, consider a situation where a prototypi-cal ordering of events the bus disembarked passengers and the bus drove away needs to be predicted. An approachbased on frequency of predicate pairs (Chambers & Juraf-sky, 2008), is unlikely to make a right prediction as driv-ing usually precedes disembarking. Similarly, an approachwhich treats the whole predicate-argument structure as anatomic unit (Regneri et al., 2010) will probably fail as well,as such a sparse model is unlikely to be effectively learn-able even from large amounts of data. However, our em-bedding method would be expected to capture relevant fea-tures of the verb frames, namely, the transitive use for thepredicate disembark and the effect of the particle away , andthese features will then be used by the ranking componentto make the correct prediction.In previous work on learning inference rules (Berant et al.,2011), it has been shown that enforcing transitivity con-straints on the inference rules results in significantly im-proved performance. The same is true for the event order- a r X i v : . [ c s . L G ] A p r earning Semantic Script Knowledge with Event Embeddings disembarked passengersbus predicate embedding event embeddingarg embedding T a R p T a x a = C ( bus ) a = C ( passenger ) p = C ( disembark ) arg embeddinghidden layer h A h Figure 1.
Computation of an event representation ( the bus disem-barked passengers ). ing task, as scripts have largely linear structure, and ob-serving that a ≺ b and b ≺ c is likely to imply a ≺ c . In-terestingly, in our approach we implicitly learn the modelwhich satisfies transitivity constraints, without the need forany explicit global optimization on a graph.The approach is evaluated on crowdsourced dataset of Reg-neri et al. (2010) and we demonstrate that using our modelresults in the 13.5% absolute improvement in F on eventordering with respect to their graph induction method (84%vs. 71%).
2. Model
In this section we describe the model we use for computingevent representations as well as the ranking component ofour model.
Learning and exploiting distributed word representations(i.e. vectors of real values, also known as embeddings )have been shown to be beneficial in many NLP applica-tions (Bengio et al., 2001; Turian et al., 2010; Collobertet al., 2011). These representations encode semantic andsyntactic properties of a word, and are normally learned inthe language modeling setting (i.e. learned to be predictiveof local word context), though they can also be specializedby learning in the context of other NLP applications suchas PoS tagging or semantic role labeling (Collobert et al.,2011). More recently, the area of distributional composi-tional semantics have started to emerge (Baroni & Zam-parelli, 2011; Socher et al., 2012), they focus on induc-ing representations of phrases by learning a compositionalmodel. Such a model would compute a representation of aphrase by starting with embeddings of individual words inthe phrase, often this composition process is recursive andguided by some form of syntactic structure.
Algorithm 1
Learning Algorithm
Notation w : ranking weight vector E k : k th sequence of events in temporal order t k : array of model scores for events in E k γ : fixed global margin for ranking LearnWeights()for epoch = 1 to T for k = 1 to K [over event sequences] for i = 1 to | E k | [over events in the seq]Compute embedding x e i for event e i Calculate score s e i = w T x e i end for Collect scores in t k = [ s e , . . . , s e i , . . . ] error = RankingError ( t k ) back-propagate error update all embedding parameters and w end forend for RankingError ( t k ) err = 0 for rank = 1 , . . . , l for rankBef ore = 1 , . . . , rank if ( t k [ rankBef ore ] − t k [ rank ]) < γerr = err + 1 end ifend forfor rankAf ter = rank + 1 , . . . , l if ( t k [ rank ] − t k [ rankAf ter ]) < γerr = err + 1 end ifend forend forreturn err In our work, we use a simple compositional model for rep-resenting semantics of a verb frame (i.e. the predicate andits arguments). The model is shown in Figure 1. Each word w i in the vocabulary is mapped to a real vector based onthe corresponding lemma (the embedding function C ). Thehidden layer is computed by summing linearly transformedpredicate and argument embeddings and passing it throughthe logistic sigmoid function. We use different transfor-mation matrices for arguments and predicates, T and R ,respectively. The event representation x is then obtainedby applying another linear transform (matrix A ) followedby another application of the sigmoid function.These event representations are learned in the context of Only syntactic heads of arguments are used in this work. Ifan argument is a coffee maker , we will use only the word maker . earning Semantic Script Knowledge with Event EmbeddingsScenario Precision (%) Recall (%) F1 (%) BL EE verb
MSA BS EE BL EE verb
MSA BS EE BL EE verb
MSA BS EEBus 70.1 81.9 80.0 76.0 85.1 71.3 75.8 80.0 76.0 91.9 70.7 78.8 80.0 76.0 88.4Coffee 70.1 73.7 70.0 68.0 69.5 72.6 75.1 78.0 57.0 71.0 71.3 74.4 74.0 62.0 70.2Fastfood 69.9 81.0 53.0 97.0 90.0 65.1 79.1 81.0 65.0 87.9 67.4 80.0 64.0 78.0 88.9Ret. Food 74.0 94.1 48.0 87.0 92.4 68.6 91.4 75.0 72.0 89.7 71.0 92.8 58.0 79.0 91.0Iron 73.4 80.1 78.0 87.0 86.9 67.3 69.8 72.0 69.0 80.2 70.2 69.8 75.0 77.0 83.4Microwave 72.6 79.2 47.0 91.0 82.9 63.4 62.8 83.0 74.0 90.3 67.7 70.0 60.0 82.0 86.4Scr. Eggs 72.7 71.4 67.0 77.0 80.7 68.0 67.7 64.0 59.0 76.9 70.3 69.5 66.0 67.0 78.7Shower 62.2 76.2 48.0 85.0 80.0 62.5 80.0 82.0 84.0 84.3 62.3 78.1 61.0 85.0 82.1Telephone 67.6 87.8 83.0 92.0 87.5 62.8 87.9 86.0 87.0 89.0 65.1 87.8 84.0 89.0 88.2Vending 66.4 87.3 84.0 90.0 84.2 60.6 87.6 85.0 74.0 81.9 63.3 84.9 84.0 81.0 88.2
Average
Table 1.
Results on the crowdsourced data for the verb-frequency baseline (BL), the verb-only embedding model (EE verb ), Regneri et al.(2010) (MSA), Frermann et al. (2014)(BS) and the full model (EE). event ranking: the transformation parameters as well asrepresentations of words are forced to be predictive of thetemporal order of events. However, one important charac-teristic of neural network embeddings is that they can be in-duced in a multitasking scenario, and consequently can belearned to be predictive of different types of contexts pro-viding a general framework for inducing different aspectsof (semantic) properties of events, as well as exploiting thesame representations in different applications.
The task of learning stereotyped order of events naturallycorresponds to the standard ranking setting. Here, we as-sume that we are provided with sequences of events, andour goal is to capture this order. We discuss how we obtainthis learning material in the next section. We learn a linearranker (characterized by a vector w ) which takes an eventrepresentation and returns a ranking score. Events are thenordered according to the score to yield the model predic-tion. Note that during the learning stage we estimate notonly w but also the event representation parameters, i.e.matrices T , R and A , and the word embedding C . Notethat by casting the event ordering task as a global rank-ing problem we ensure that the model implicitly exploitstransitivity of the temporal relation, the property which iscrucial for successful learning from finite amount of data,as we argued in the introduction and will confirm in ourexperiments.We use an online ranking algorithm based on the Percep-tron Rank (PRank, (Crammer & Singer, 2001)), or, moreaccurately, its large-margin extension. One crucial differ-ence though is that the error is computed not only with re-spect to w but also propagated back through the structureof the neural network. The learning procedure is sketchedin Algorithm 1. Additionally, we use a Gaussian prior onweights, regularizing both the embedding parameters andthe vector w . We initialize word representations using the SENNA embeddings (Collobert et al., 2011).
3. Experiments
We evaluate our approach on crowdsourced data collectedfor script induction by Regneri et al. (2010), though, inprinciple, the method is applicable in arguably more gen-eral setting of Chambers & Jurafsky (2008).
Regneri et al. (2010) collected short textual descriptions(called event sequence descriptions, ESDs ) of various typesof human activities (e.g., going to a restaurant, ironingclothes) using crowdsourcing (Amazon Mechanical Turk),this dataset was also complemented by descriptions pro-vided in the OMICS corpus (Gupta & Kochenderfer, 2004).The datasets are fairly small, containing 30 ESDs per ac-tivity type in average (we will refer to different activities as scenarios ), but the collection can easily be extended giventhe low cost of crowdsourcing. The ESDs are written in abullet-point style and the annotators were asked to followthe temporal order in writing. Consider an example ESDfor the scenario prepare coffee : { go to coffee maker } → { fill water in coffee maker } →{ place the filter in holder } → { place coffee in filter } →{ place holder in coffee maker } → { turn on coffee maker } Though individual ESDs may seem simple, the learningtask is challenging because of the limited amount of train-ing data, variability in the used vocabulary, optionality ofevents (e.g., going to the coffee machine may not be men-tioned in a ESD), different granularity of events and vari-ability in the ordering (e.g., coffee may be put in a filterbefore placing it in a coffee maker). When we kept the word representations fixed to the SENNAembeddings and learned only matrices T , R and A , we obtainedsimilar results (0.3% difference in the average F1 score). earning Semantic Script Knowledge with Event Embeddings Unlike our work, Regneri et al. (2010) relies on WordNetto provide extra signal when using the Multiple SequenceAlignment (MSA) algorithm. As in their work, each de-scription was preprocessed to extract a predicate and headsof argument noun phrases to be used in the model.The methods are evaluated on human annotated scenario-specific tests: the goal is to classify event pairs as appearingin a given stereotypical order or not (Regneri et al., 2010). The model was estimated as explained in Section 2.2 withthe order of events in ESDs treated as gold standard. Weused 4 held-out scenarios to choose model parameters, noscenario-specific tuning was performed, and the 10 testscripts were not used to perform model selection.When testing, we predicted that the event pair ( e , e ) is inthe stereotypical order ( e ≺ e ) if the ranking score for e exceeded the ranking score for e In our experiments, we compared our event embeddingmodel ( EE ) against three baseline systems ( BL , MSA ) and BS MSA is the system of Regneri et al. (2010). BS is aa hierarchical Bayesian system of Frermann et al. (2014).BL chooses the order of events based on the preferred or-der of the corresponding verbs in the training set: ( e , e )is predicted to be in the stereotypical order if the number oftimes the corresponding verbs v and v appear in this or-der in the training ESDs exceeds the number of times theyappear in the opposite order (not necessary at adjacent po-sitions); a coin is tossed to break ties (or if v and v arethe same verb).We also compare to the version of our model which usesonly verbs (EE verbs ). Note that EE verbs is conceptuallyvery similar to BL, as it essentially induces an orderingover verbs. However, this ordering can benefit from the im-plicit transitivity assumption used in EE verbs (and EE), aswe discussed in the introduction. The results are presentedin Table 1.The first observation is that the full model improves sub-stantially over the baseline and the previous methods (MSAand BS) (13.5% and 6.5% improvement over MSA and BSrespectively in F1), this improvement is largely due to anincrease in the recall but the precision is not negatively af-fected. We also observe a substantial improvement in allmetrics from using transitivity, as seen by comparing theresults of BL and EE verb (11.3% improvement in F1). Thissimple approach already outperforms the pipelined MSAsystem. These results seem to support our hypothesis in The unseen event pairs are not coming from the same ESDsmaking the task harder: the events may not be in any temporalrelation. This is also the reason for using the F1 score rather thanthe accuracy, both in Regneri et al. (2010) and in our work. the introduction that inducing graph representations fromscripts may not be an optimal strategy from the practicalperspective.
References
Baroni, Marco and Zamparelli, Robert. Nouns are vec-tors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In
Proceedingsof EMNLP , 2011.Bengio, Yoshua, Ducharme, R´ejean, and Vincent, Pascal.A neural probabilistic language model. In
Proceedingsof NIPS , 2001.Berant, Jonathan, Dagan, Ido, and Goldberger, Jacob.Global learning of typed entailment rules. In
Proceed-ings of ACL , 2011.Chambers, Nathanael and Jurafsky, Dan. Unsupervisedlearning of narrative schemas and their participants. In
Proceedings of ACL , 2009.Chambers, Nathanael and Jurafsky, Daniel. Unsupervisedlearning of narrative event chains. In
Proceedings ofACL , 2008.Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., and Kuksa, P. Natural language pro-cessing (almost) from scratch.
Journal of MachineLearning Research , 12:2493–2537, 2011.Crammer, Koby and Singer, Yoram. Pranking with ranking.In
Proceedings of NIPS , 2001.Frermann, Lea, Titov, Ivan, and Pinkal, Manfred. A hi-erarchical bayesian model for unsupervised induction ofscript knowledge. In
EACL, Gothenberg, Sweden , 2014.Gordon, Andrew. Browsing image collections with repre-sentations of common-sense activities.
JAIST , 52(11),2001.Gupta, Rakesh and Kochenderfer, Mykel J. Common sensedata acquisition for indoor mobile robots. In
Proceed-ings of AAAI , 2004.Mueller, Erik T.
Natural Language Processing withThought Treasure . Signiform, 1998.Regneri, Michaela, Koller, Alexander, and Pinkal, Man-fred. Learning script knowledge with web experiments.In
Proceedings of ACL , 2010.Schank, R. C and Abelson, R. P.
Scripts, Plans, Goals,and Understanding . Lawrence Erlbaum Associates, Po-tomac, Maryland, 1977. earning Semantic Script Knowledge with Event Embeddings
Socher, Richard, Huval, Brody, Manning, Christopher D.,and Ng, Andrew Y. Semantic compositionality throughrecursive matrix-vector spaces. In
Proceedings ofEMNLP , 2012.Turian, Joseph, Ratinov, Lev, and Bengio, Yoshua. Wordrepresentations: A simple and general method for semi-supervised learning. In