SSyntactic Search by Example
Micah Shlain , Hillel Taub-Tabib Shoval Sadde Yoav Goldberg , Allen Institute for AI, Tel Aviv, Israel Bar Ilan University, Ramat-Gan, Israel { micahs,hillelt,shovals,yoavg } @allenai.org [email protected] Abstract
We present a system that allows a user tosearch a large linguistically annotated cor-pus using syntactic patterns over dependencygraphs. In contrast to previous attempts to thiseffect, we introduce a light-weight query lan-guage that does not require the user to knowthe details of the underlying syntactic represen-tations, and instead to query the corpus by pro-viding an example sentence coupled with sim-ple markup. Search is performed at an interac-tive speed due to an efficient linguistic graph-indexing and retrieval engine. This allows forrapid exploration, development and refinementof syntax-based queries. We demonstrate thesystem using queries over two corpora: theEnglish wikipedia, and a collection of Englishpubmed abstracts. A demo of the wikipediasystem is available at: https://allenai.github.io/spike/ . The introduction of neural-network based modelsinto NLP brought with it a substantial increase insyntactic parsing accuracy. We can now produceaccurate syntactically annotated corpora at scale.However, the produced representations themselvesremain opaque to most users, and require substan-tial linguistic expertise to use. Patterns over syn-tactic dependency graphs can be very effectivefor interacting with linguistically-annotated cor-pora, either for linguistic retrieval or for informa-tion and relation extraction (Fader et al., 2011; Ak-bik et al., 2014; Valenzuela-Esc´arcega et al., 2015, In this paper, we very loosely use the term “syntactic”to refer to a linguistically motivated graph-based annotationover a piece of text, where the graph is directed and there isa path between any two nodes. While this usually impliessyntactic dependency trees or graphs (and indeed, our systemcurrently indexes Enhanced English Universal Dependencygraphs (Nivre et al., 2016; Schuster and Manning, 2016)) thesystem can work also with more semantic annotation schemese.g, (Oepen et al., 2015), given the availability of an accurateenough parser for them. (1)
A light-weight query language that does notrequire in-depth familiarity with the underlyingsyntactic representation scheme, and instead letsthe user specify their intent via a natural languageexample and lightweight markup. (2)
A fast, near-real-time response time due to effi-cient indexing, allowing for rapid experimentation.Figure 1 (next page) shows the interface of ourweb-based system. The user issued the query: (cid:104)(cid:105) founder : [e]Paul was a t : [w]founder of (cid:104)(cid:105) entity : [e]Microsoft .The query specifies a sentence ( Paul was afounder of Microsoft ) and three named captures: founder , t and entity . The founder and entity captures should have the same entity-type as thecorresponding sentence words (PERSON for Pauland ORGANIZATION for Microsoft, indicatedby [e] ), and the t capture should have the sameword form as the one in the sentence (founder)(indicated by [w] ). The syntactic relation betweenthe captures should be the same as the one inthe sentence, and the founder and entity capturesshould be expanded (indicated by (cid:104)(cid:105) ).The query is translated into a graph-based query,which is shown below the query, each graph-nodeassociated with the query word that triggered it.The system also returned a list of matched sen-tences. The matched tokens for each capture group( founder , t and entity ) are highlighted. The usercan then issue another query, browse the results, ordownload all the results as a tab-separated file. a r X i v : . [ c s . C L ] J un igure 1: Syntactic Search System While several rich query languages over linguis-tic tree and graph structure exist, they require asubstantial amount of expertise to use. The userneeds to be familiar not only with the syntax ofthe query language itself, but to also be intimatelyfamiliar with the specific syntactic scheme used inthe underlying linguistic annotations. For exam-ple, in Odin (Valenzuela-Esc´arcega et al., 2015), adedicated language for pattern-based informationextraction, the same rule as above is expressed as: - label: Persontype: tokenpattern: |[entity="PERSON"]+- label: Organizationtype: tokenpattern: |[entity="ORGANIZATION"]+- label: foundedtype: dependencypattern: |trigger = [word=founded]founder:Person = >nsubjentity:Organization = >nmod
The Spacy NLP toolkit also includes patternmatcher over dependency trees,using JSON basedsyntax: [ { "PATTERN": { "ORTH": "founder" } ,"SPEC": { "NODE_NAME": "t" }} , { "PATTERN": { "ENT_TYPE": "PERSON" }} ,"SPEC": { "NODE_NAME": "founder","NBOR_RELOP": ">nsubj", We focus here on systems that are based on dependencysyntax, but note that many systems and query languagesexist also for constituency-trees, e.g., TGREP/TGREP2,TigerSearch (Lezius et al., 2002), the linguists search engine(Resnik and Elkiss, 2005), Fangorn (Ghodke and Bird, 2012). https://spacy.io/ "NBOR_NAME": "t" }} , { "PATTERN": { "ENT_TYPE": "ORGANIZATION" } ,"SPEC": { "NODE_NAME": "entity","NBOR_RELOP": ">nmod","NBOR_NAME": "t" }} ] Stanford’s Core-NLP package (Manning et al.,2014) includes a dependency matcher called S EM -G REX , which uses a more concise syntax: { ner : PERSON } =founder
While the different systems vary in the verbose-ness and complexity of their own syntax (indeed,the Turku system’s syntax is rather minimal), theyall require the user to explicitly specify the de-pendency relations between the tokens, making itchallenging and error-prone to write, read or edit.The challenge grows substantially as the complex-ity of the pattern increases beyond the very simpleexample we show here.Closest in spirit to our proposal, the P
ROP -M INER system of Akbik et al. (2013) which lets theuser enter a natural language sentence, mark spansas subject , predicate and object , and have a rule be https://nlp.stanford.edu/software/tregex.shtml enerated automatically. However, the system is re-stricted to ternary subject-predicate-object patterns.Furthermore, the generated pattern is written in apath-expression SQL variant (SerQL, (Broekstraand Kampman, 2003)), which the user then needsto manually edit. For example, our query abovewill be translated to: SELECT subject, predicate, objectFROM predicate.3 nsubj subject,predicate.3 nmod object,WHERE subject POS NNPAND predicate.3 POS NNAND object POS NNPAND subject TEXT PAULAND predicate.3 TEXT founderAND object TEXT MicrosoftAND subject FULL_ENTITYAND object FULL_ENTITY
All these systems require the user to closely in-teract with linguistic concepts and explicitly spec-ify graph-structures, posing a high barrier of entryfor non-expert users. They also slow down expertusers: formulating a complex query may requirea few minutes. Furthermore, many of these querylanguages are designed to match against a providedsentence, and are not indexable. This requires it-erating over all sentences in the corpus attemptingto match each one, requiring substantial time toobtain matches from large corpora.Augustinus et al. (2012) describe a system forsyntactic search by example, which retrieves treefragments and which is completely UI based. Oursystem takes a similar approach, but replaces theUI-only interface with an expressive textual querylanguage, allowing for richer queries. We alsoreturn node matches rather than tree fragments.
We propose a substantially simplified language,that has the minimal syntax and that does not re-quire the user to know the underlying syntacticschema upfront (though it does not completely hideit from the user, allowing for exposure over time,and allowing control for expert users who under-stand the underlying syntactic annotation scheme).The query language is designed to be linguis-tically expressive, simple to use and amenable toefficient indexing and query. The simplicity and in-dexing requirements do come at a cost, though: wepurposefully do not support some of the featuresavailable in existing languages. We expect thesefeatures to correlate with expertise. At the same Example of a query feature we do not support is quantifi- time, we also seamlessly support expressing arbi-trary sub-graphs, a task which is either challengingor impossible with many of the other systems.The language is based on the following principles: (1)
The core of the query is a natural languagesentence. (2)
A user can specify the tokens of interest andconstraints on them via lightweight markup. (3)
While expert users can specify complex tokenconstraints, effective constraints can be specifiedby pulling values from the query words.The required syntactic knowledge from the user,both in terms of the syntax of the query languageitself and in terms of the underlying linguistic for-malism, remains minimal.
The language is structured around between-tokenrelations and within-token constraints, where to-kens can be captured .Formally, our query G = ( V, E ) is a labeled di-rected graph, where each node v i ∈ V correspondsto a token, and a labeled edge e = ( v i , v j , (cid:96) ) ∈ E between the nodes corresponds to a between-tokensyntactic constraint. This query graph is thenmatched against parsed target sentences, lookingfor a correspondence between query nodes andtarget-sentence nodes that adhere to the token andedge constraints.For example, the following graph specifies threetokens, where the first and second are connected viaan ‘xcomp’ relation, and the second and third viaa ‘dobj’ relation. The first token is unconstrained,while the second token must have the POS-tag ofVB, and the third token must be the word home.Sentences whose syntactic graph has a subgraphthat aligns to the query graph and adheres to theconstraints will be considered as matches. Exampleof such matching sentences are:- John wanted w to go v home h after lunch .- It was a place she decided w to call v her home h .The < w > , < v > and < h > marks on the nodes denote named captures . When matching a sentence, thesentence tokens corresponding to the graph-nodeswill be bound to variables named ‘w’, ‘v’ and ‘h’,in our case { w=wanted, v=go, h=home } forthe first sentence and { w=decided, v=call,h=home } for the second. Graph nodes can also be cation, i.e., “nodes a and b should be connected via a path thatincludes one or more ‘conj’ edges”. nnamed, in which case they must match sentencetokens but will not bind to any variable. The graphstructure is not meant to be specified by hand, butrather to be inferred from the example based querylanguage described in the next section (an examplequery resulting in this graph is “ They w : wanted tov : [tag]go h : [word]home ”).Between-token constraints correspond to labeleddirected edges in the sentence’s syntactic graph.Within-token constraints correspond to proper-ties of individual sentence tokens. For each prop-erty we specify a list of possible values (a disjunc-tion) and if lists for several properties are provided,we require all of them to hold (a conjunction). Forexample, in the constraint tag=VBD | VBZ&lemma=buy we look for tokens with POS-tag of either
VBD or VBZ , and the lemma buy . The list of possible valuesfor a property can be specified as a pipe-separatedlist ( tag=VBD | VBZ | VBN ) or as a regular expression( tag=/VB[DZN]/ ). The graph language described above is expressiveenough to support many interesting queries, but itis also very tedious to specify query graphs G , es-pecially for non-expert users. We propose a simplesyntax that allows to easily specify a graph query G (constrained nodes connected by labeled edges)using a textual query q that takes the form of anexample sentence and lightweight markup.Let s = w , ..., w n be a proper English sentence.Let D be its dependency graph, with nodes w i andlabeled edges ( w i , w j , (cid:96) ) . A corresponding textualquery q takes the form q = q , ..., q n , where each q i is either a word q i = w i , or a marked word q i = m ( w i ) . A marking of a word takes the form: : word (unnamed capture) name : word (named cap-ture) or name : [constraints]word , : [constraints]word .Consider the query: John w : wanted to v : [tag=VB] go h : [word=home] home corresponding to the above graph query. Themarked words are: q = w : wanted (unconstrained, name: w ) q = v : [tag=VB]go (cnstr: tag=VB , name: v ) q = h : [word=home]home (cnstr: word=home , name: h ) Indeed, we currently do not even expose a textual repre-sentation of the graph. Currently supported properties are word-form ( word ),lemma ( lemma ), pos-tag ( tag ) or entity type ( entity ). Ad-ditional types can be easily added, provided that we havesuitable linguistic annotators for them.
Each of these corresponds to a node v q i in the querygraph above.Let m be the set of marked query words, and m + be a minimal connected subgraph of D thatincludes all the words in m . When translating q to G , each marked word w i ∈ m is translated to anamed query graph node v q i with the appropriaterestriction. The additional words w j ∈ m + \ m aretranslated to unrestricted, unnamed nodes v q j . Weadd a query graph edge ( v q i , v q j , (cid:96) ) for each pair in V for which ( w i , w j , (cid:96) ) ∈ D . Further query simplifications.
Consider themarked word h : [word=home] home . The constraintis redundant with the word. In such cases we allowthe user to drop the value, which is then taken fromthe corresponding property of the query word. Thisallows us to replace the query: John w : wanted to v : [tag=VB]go h : [word=home]home with: John w : wanted to v : [tag]go h : [word]home This further drives the “by example” agenda, asthe user does not need to know what the lemma,entity-type or POS-tag of a word are in order tospecify them as a constraint. Full property namescan be replaced with their shorthands w,l,t,e : John w : wanted to v : [t]go h : [w]home Finally, capture names can be omitted, in whichcase an automatic name is generated based on thecorresponding word:
John : wanted to : [t]go : [w]home Anchors.
In some cases we want to add a nodeto the graph, without an explicit capture. In suchcases we can use the anchor $ ( $John ). These areinterpreted as having a default constraint of [w] ,which can be overriden by providing an alternativeconstraint ( $[e]John ), or an empty one ( $[]John ). Expansions
When matching a query against asentence the graph nodes bind to sentence words.Sometimes, we may want the match to be expandedto a larger span of the sentence. For example, whenmatching a word which is part of a entity, we of-ten wish to capture the entire entity rather than theword. This is achieved by prefixing the term withthe “expansion diamond” (cid:104)(cid:105) . The default behavioris to expand the match from the current word to thenamed entity boundary or NP-chunk that surroundsit, if it exists. We are currently investigating theoption of providing additional expansion strategies. ummary
To summarize the query languagefrom the point of view of the user: the user startswith a sentence w , ..., w n , and marks some of thewords for inclusion in the query graph. For eachmarked word, the user may specify a name, and op-tional constraints. The user query is then translatedto a graph query as described above. The results listhighlights the words corresponding to the markedquery words. The user can choose for the results tohighlight entire entities rather than single words. An important aspect of the system is its interactiv-ity. Users enter queries by writing a sentence andadding markup on some words, and can then refinethem following feedback from the environment, aswe demonstrate with a walk-through example.A user interested in people who obtained degreesfrom higher education institutions may issue thefollowing query: subj : John obtained his d : [w]degree from inst : Harvard
Here, the person in the “subj” capture and theinstitution in the “inst” capture are placehold-ers for items to be captured, so the user usesgeneric names and leaves them unconstrained.The “degree” (“d”) capture should match exactly,as the user specified the “w” constraint (exactword match). When pressing
Enter , the useris then shown the resulting query-graph and a re-sult list. The user can then refine their queriesbased on either the query graph, the result list,or both. For the above query, the graph is:Note that the query graph associates each graphnode with the query word that triggered it. Theword “obtained” resulted in a graph node eventhough it was not marked by the user as a cap-ture. The user makes note to themselves to go backto this word later. The user also notices that theword “from” is not part of the query.Looking at the result list, things look weird:Maybe this is because the word from is not in thegraph? Indeed, adding a non-capturing exact-wordanchor on “from” solves this issue: subj : John obtained his d : [w]degree $from inst : Harvard
However, the resulting list contains many non-names in the subj capture. Trying to resolve this,the user adds an ”entity-type” constraint to the subj capture: subj : [e]John obtained his d : [w]degree $frominst : Harvard
Note that the user didn’t specify an exact type, yetthe query graph correctly resolved PERSON.The user is interested in the full name of the personand organization, so they change from single-wordcapture to expanded capture, with the defaultexpansion level (using the diamond prefix (cid:104)(cid:105) ): (cid:104)(cid:105) subj : [e]John obtained his d : [w]degree $from (cid:104)(cid:105) inst : Harvard
These are the kind of results the user expected, butnow they are curious about degrees obtained byfemales, and their representation in the Wikipediacorpus. Adding the pronoun to the query, the userthen issues the following two queries, saving theresult-sets from each one as a CSV for furthercomparative analysis. (cid:104)(cid:105) subj : [e]John obtained $his d : [w]degree $from (cid:104)(cid:105) inst : Harvard (cid:104)(cid:105) subj : [e]John obtained $her d : [w]degree $from (cid:104)(cid:105) inst : Harvard
Our user now worries that they may be missingsome results by focusing on the word degree .Maybe other things can be obtained from a univer-sity? The user then sets an exact-word constrainton “Harvard”, adds a lemma constraint to “obtain”and clears the constraint from “degree”: (cid:104)(cid:105) subj : [e]John : [l]obtained his d : degree $from (cid:104)(cid:105) inst : [w]Harvard Browsing the results, the d capture includes wordssuch as “BA, PhD, MBA, certificate”. But theesult list is rather short, suggesting that either Harvard or obtain are too restrictive. The userseeks to expand the “obtain” node’s vocabulary,adding back the exact word constraint on “degree”while removing the one from “obtain”: (cid:104)(cid:105) subj : [e]John : []obtained his d : [w]degree $from (cid:104)(cid:105) inst : [w]Harvard Looking at the result list in the o capture, theuser chooses the lemmas “receive, complete, earn,obtain, get”, adds them to the o constraint, andremoves the degree constraint. (cid:104)(cid:105) subj : [e]Johno : [l=receive | complete | earn | obtain | get]obtainedhis d : degree $from (cid:104)(cid:105) inst : [w]Harvard The returned result-set is now much longer, andwe select additional terms for the degree slot andremove the institution word constraint, resulting inthe final query: (cid:104)(cid:105) subj : [e]Johno : [l=receive | complete | earn | obtain | get]obtained his d : [w=degree | MA | BA | MBA | doctorate | masters | PhD]degree$from (cid:104)(cid:105) inst : Harvard
The result is a list of person names earningdegrees from institution, and the entire list can bedownloaded as a tab-separated file which includesthe named captures as well as the source sentences(over Wikipedia, this list has 6197 rows). The query can also be further refined to capture which degree was obtained, e.g.: (cid:104)(cid:105) subj : [e]John o : [l=...]obtained] his kind : lawd : [w=...]degree $from (cid:104)(cid:105) inst : Harvard capturing under kind words like law , chemistry , engineering and DLitt but also bachelors , masters and graduate .This concludes our walk-through. To whet the reader’s appetite, here are a sampleof additional queries, showing different potential The list can be even more comprehensive had we selectedadditional degree words and obtain words, and consideredalso additional re-phrasings. use-cases. Over wikipedia:- p : [e]Sam $[l=win | receive]won an $Oscar .- (cid:104)(cid:105) p : [e]Sam $[l=win | receive]won an $Oscar $for (cid:104)(cid:105) thing : something - $fish $such $as (cid:104)(cid:105) fish : salmon - (cid:104)(cid:105) hero : [t]Spiderman $is a $superhero - I like kind : coconut $oil - kind : coconut $oil is $used for purpose : eating Over a pubmed corpus, annotated with the SciS-pacy (Neumann et al., 2019) pipeline:- (cid:104)(cid:105) x : [e]aspirin $inhibits (cid:104)(cid:105) y : thing - a $combination of (cid:104)(cid:105) d1 : [e]aspirin and (cid:104)(cid:105) d2 : [e]alcohol $ : [l]causes (cid:104)(cid:105) t : thing - (cid:104)(cid:105) patients : [t]rats were $injected $with (cid:104)(cid:105) what : drugs The indexing is handled by Lucene. We currentlyuse Odinson (Valenzuela-Esc´arcega et al., 2020), an open-source Lucene-based query engine devel-oped at Lum.ai, as a successor of Odin (Valenzuela-Esc´arcega et al., 2015), that allows to index syn-tactic graphs and issue efficient path queries onthem. We translate our queries into an Odinsonpath query that corresponds to a longest path inour query graph. We then iterate over the returnedOdinson matches and verify the constraints thatwere not on the path. Conceptually, the Odinsonsystem works by first using Lucene’s reverse-indexfor retrieving sentences for which there is a tokenmatching each of the specified token-constraints,and then verifying the syntactic between-token con-straints. To improve the Lucene-query selectivity,tokens are indexed with incoming and outgoingsyntactic edge label information, which is incorpo-rated as additional token-constraints to the Luceneengine. The system easily supports millions ofsentences, returning results at interactive speeds. We introduce a simple query language that allowsto pose complex syntax-based queries, and obtainresults in an interactive speed.A search interface over Wikipedia sentencesis available at https://allenai.github.io/spike/ . We intend to release the code as opensource, as well as providing hosted open access toa PubMed-based corpus. https://lucene.apache.org https://github.com/lum-ai/odinson/ cknowledgments We thank the team at LUM.ai and the Universityof Arizona, in particular Mihai Surdeanu, MarcoValenzuela-Esc´arcega, Gus Hahn-Powell and DaneBell, for fruitful discussion and their work on theOdinson system.This project has received funding from the Eu-ropoean Research Council (ERC) under the Eu-ropoean Union’s Horizon 2020 research and inno-vation programme, grant agreement No. 802774(iEXTRACT).
References
Alan Akbik, Oresti Konomi, and Michail Melnikov.2013. Propminer: A workflow for interactive in-formation extraction and exploration using depen-dency trees. In
Proceedings of the 51st Annual Meet-ing of the Association for Computational Linguistics:System Demonstrations , pages 157–162, Sofia, Bul-garia. Association for Computational Linguistics.Alan Akbik, Thilo Michael, and Christoph Boden.2014. Exploratory relation extraction in large textcorpora. In
Proceedings of COLING 2014, the 25thInternational Conference on Computational Linguis-tics: Technical Papers , pages 2087–2096.Liesbeth Augustinus, Vincent Vandeghinste, andFrank Van Eynde. 2012. Example-based treebankquerying. In
LREC .Jeen Broekstra and Arjohn Kampman. 2003. Serql:A second generation rdf query language. In
Proc.SWAD-Europe Workshop on Semantic Web Storageand Retrieval , pages 13–14.Anthony Fader, Stephen Soderland, and Oren Etzioni.2011. Identifying relations for open information ex-traction. In
Proceedings of the conference on empir-ical methods in natural language processing , pages1535–1545. Association for Computational Linguis-tics.Sumukh Ghodke and Steven Bird. 2012. Fangorn: Asystem for querying very large treebanks. In
Pro-ceedings of COLING 2012: Demonstration Papers ,pages 175–182, Mumbai, India. The COLING 2012Organizing Committee.Wolfgang Lezius, Hannes Biesinger, and Ciprian-Virgil Gerstenberger. 2002. Tigersearch manual.Juhani Luotolahti, Jenna Kanerva, and Filip Ginter.2017. Dep search: Efficient search tool for largedependency parsebanks. In
Proceedings of the 21stNordic Conference on Computational Linguistics ,pages 255–258, Gothenburg, Sweden. Associationfor Computational Linguistics. Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In
Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.Mark Neumann, Daniel King, Iz Beltagy, and WaleedAmmar. 2019. Scispacy: Fast and robust models forbiomedical natural language processing.Joakim Nivre, Marie-Catherine De Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, et al. 2016. Universal dependenciesv1: A multilingual treebank collection. In
Proceed-ings of the Tenth International Conference on Lan-guage Resources and Evaluation (LREC’16) , pages1659–1666.Stephan Oepen, Marco Kuhlmann, Yusuke Miyao,Daniel Zeman, Silvie Cinkov´a, Dan Flickinger, JanHajiˇc, and Zdeˇnka Ureˇsov´a. 2015. SemEval 2015task 18: Broad-coverage semantic dependency pars-ing. In
Proceedings of the 9th International Work-shop on Semantic Evaluation (SemEval 2015) , pages915–926, Denver, Colorado. Association for Compu-tational Linguistics.Philip Resnik and Aaron Elkiss. 2005. The Linguist’sSearch Engine: An overview. In
Proceedings of theACL Interactive Poster and Demonstration Sessions ,pages 33–36, Ann Arbor, Michigan. Association forComputational Linguistics.Sebastian Schuster and Christopher D Manning. 2016.Enhanced english universal dependencies: An im-proved representation for natural language under-standing tasks. In
Proceedings of the Tenth Interna-tional Conference on Language Resources and Eval-uation (LREC’16) , pages 2371–2378.Marco A Valenzuela-Esc´arcega, ¨Ozg¨un Babur, GusHahn-Powell, Dane Bell, Thomas Hicks, EnriqueNoriega-Atala, Xia Wang, Mihai Surdeanu, EmekDemir, and Clayton T Morrison. 2018. Large-scaleautomated machine reading discovers new cancer-driving mechanisms.
Database , 2018.Marco A. Valenzuela-Esc´arcega, Gus Hahn-Powell,and Dane Bell. 2020. Odinson: A fast rule-based in-formation extraction framework. In
Proceedings ofthe Twelfth International Conference on LanguageResources and Evaluation (LREC 2020) , Marseille,France. European Language Resources Association(ELRA).Marco A Valenzuela-Esc´arcega, Gus Hahn-Powell, Mi-hai Surdeanu, and Thomas Hicks. 2015. A domain-independent rule-based framework for event extrac-tion. In