[PDF] Bootstrapping Relation Extractors using Syntactic Search by Examples

Abstract

The advent of neural-networks in NLP brought with it substantial improvements in supervised relation extraction. However, obtaining a sufficient quantity of training data remains a key challenge. In this work we propose a process for bootstrapping training datasets which can be performed quickly by non-NLP-experts. We take advantage of search engines over syntactic-graphs (Such as Shlain et al. (2020)) which expose a friendly by-example syntax. We use these to obtain positive examples by searching for sentences that are syntactically similar to user input examples. We apply this technique to relations from TACRED and DocRED and show that the resulting models are competitive with models trained on manually annotated data and on data obtained from distant supervision. The models also outperform models trained using NLG data augmentation techniques. Extending the search-based approach with the NLG method further improves the results.

Full PDF

BBootstrapping Relation Extractorsusing Syntactic Search by Examples

Matan Eyal Asaf Amrami

1, 2

Hillel Taub-Tabib Yoav Goldberg

1, 2 Allen Institute for AI, Tel Aviv, Israel Bar Ilan University, Ramat-Gan, Israel matane,asafa,hillelt,[email protected]

Abstract

The advent of neural-networks in NLP broughtwith it substantial improvements in supervisedrelation extraction. However, obtaining a sufﬁ-cient quantity of training data remains a keychallenge. In this work we propose a pro-cess for bootstrapping training datasets whichcan be performed quickly by non-NLP-experts.We take advantage of search engines oversyntactic-graphs (Such as Shlain et al. (2020))which expose a friendly by-example syntax.We use these to obtain positive examples bysearching for sentences that are syntacticallysimilar to user input examples. We apply thistechnique to relations from TACRED and Do-cRED and show that the resulting models arecompetitive with models trained on manuallyannotated data and on data obtained from dis-tant supervision. The models also outperformmodels trained using NLG data augmentationtechniques. Extending the search-based ap-proach with the NLG method further improvesthe results.

The goal of Relation Extraction (RE) is to ﬁndand classify instances of certain relations in rawtext. We denote a binary relation instance, i.e. arelation instance with two arguments, with a tu-ple x = ( s , e , e , r ) , where s = [ w · · · w n ] is asequence of sentence tokens, e , e are entity men-tions within s corresponding to the ﬁrst and secondrelation argument, respectively, and r ∈ R ∪ { ∅ } is a relation label from a set of predeﬁned relationsof interest, or an indication of ‘no-relation’. In bi-nary classiﬁcation our goal is to classify whether,according to s , the entity mentions, e and e , sat-isfy r , the relation label. For such classiﬁcation werequire a training dataset X , comprised of X p , aset of positive examples, representing the relationof interest, and X n , a set of negatives examples. The success of recent papers (Soares et al., 2019;Murty et al., 2020) in supervised RE is fueled by ad-vances in deep learning, but also, crucially, by theavailability of a large training set such as TACRED(Zhang et al., 2017), containing tens of thousandsof training examples. For most relations of interest,such training data is not available. In this work we examine methods to inexpen-sively construct X p and X n , in cases where atraining set is not available. We are especiallyinterested in constructing the positive set, X p . In contrast to common NLP tasks like POS tag-ging, entity extraction and dependency parsing, thetask of relation extraction exhibits a much largerdegree of label sparsity. For some relations, evenwhen considering only sentences with entities ofthe relevant types, the ratio between positive andnegative examples is highly skewed toward the lat-ter and obtaining a modest amount of positive ex-amples will require a laborious annotation effort(see §3). While manual annotation of large datasetsis a viable approach, it typically requires contract-ing a team of professional annotators (Doddingtonet al., 2004; Ellis et al., 2015) or crowd workers(Zhang et al., 2017; Yao et al., 2019) and is not wellsuited for smaller projects or for ad-hoc extractiontasks.Our main contribution in this paper is a newmethodology built on top of Shlain et al. (2020)for cheaply obtaining large datasets (§6). Shlainet al. (2020) proposed a syntactic search enginethat given a lightly annotated example sentence,retrieves new sentences with a similar syntacticstructure from a pre-annotated dataset. Our syntac-tic search bootstrapping method requires a smallnumber of manually curated positive example sen-tences. Then the search engine matches are usedas training data for ML models. We evaluate thisapproach comparing to human annotated data ofvarying sizes. a r X i v : . [ c s . C L ] F e b hile this method shows promising results withvery few user input examples, we also test the im-pact on performance when more examples are used.One technique for obtaining an abundance of ex-amples uses recent Natural Language Generation(NLG) models (§7.1). It has been shown in recentpapers (Wei and Zou, 2019; Anaby-Tavor et al.,2019; Kumar et al., 2020; Amin-Nejad et al., 2020;Russo et al., 2020) that generating abundance oftraining examples can improve classiﬁer perfor-mance. We aim to check whether this can improveour syntactic search method as well.We evaluate the proposed methodologies bytraining DL classiﬁers on the obtained data. We show that: (1) Syntactic patterns are com-petitive at bootstrapping training data for ML, evenwith as little as 3 patterns;(2) Training DL models over the output of syntac-tic patterns can signiﬁcantly improve both recalland F1 over a rule based approach which uses thepatterns directly;(3) Training ML models over the output of syntac-tic patterns performs better than training modelsover recently popular NLG data augmentation tech-niques;(4) Augmenting the output of syntactic patterns us-ing NLG techniques is often helpful;(5) Different relations beneﬁt from different strate-gies.The code for all our experiments alongside thegeneration outputs is publicly available . Distant Supervision.

Since its introduction, Dis-tant Supervision (Mintz et al., 2009) has establisheditself as a viable alternative to manual annotation.Distant Supervision assumes the availability of aknowledge base (KB) of (cid:104) e , r, e (cid:105) triplets where e , e are entities known to satisfy relation r . To ob-tain training examples for a relation r , we samplesentences from a large background corpus: sen-tences which include entity pairs listed in the KBas satisfying r are labeled positive, the remainingsentences are labeled negative (potentially after sat-isfying additional constraints). While effective insome cases, the reliance on large pre-existing KBsis a signiﬁcant limitation. Such KBs are not usuallyavailable and the cost of constructing them is high. Bootstrapping from Rules, Snorkel.

To elimi- github.com/mataney/BootstrappingRelationExtractors nate the reliance on external KBs, Angeli et al.(2015) used the predictions of a rule based extrac-tor on a large corpus to train a ﬁrst iteration of astatistical extractor. They then continued to reﬁnethe extractor through self-training.Another system which can optionally utilizerules instead of external KBs is Snorkel (Ratneret al., 2017). Snorkel is implementing the data-programming paradigm (Ratner et al., 2016) whereML models are trained in three stages: (i) userswrite labeling functions that weakly label datapoints using arbitrary heuristics ( e.g. extractionrules); (ii) the system learns a re-weighted combi-nation of the labeling functions by explicitly model-ing the actual distribution of each class. The resultsare often precise but low-recall; and (iii) The sys-tem uses discriminative models to increase recallwhile preserving precision.The techniques used by Angeli et al. (2015) andSnorkel can be effective in increasing the accu-racy of the initial labeling rules, but coming upwith “good enough" initial rules remains a majorchallenge. In this sense, the search-based meth-ods suggested in this work for bootstrapping REdatasets are complimentary and can be plugged inas a ﬁrst step in these multi-step solutions.Only few papers can be directly compared to ourpaper and use matches as training-data for ML clas-siﬁers. One paper similar in that sense is Angeliet al. (2013) which claims that training a classi-ﬁer using search-based examples works better thantraditional bootstrapping methods. See §6.2 forfurther compression with Angeli et al. (2013). Augmentation Through Generation.

Similarlyto our Example Generation approach, recent pa-pers (Anaby-Tavor et al., 2019; Kumar et al., 2020)suggest using pre-trained language models for dataaugmentation. In both these papers, the authorssuggest prepending class labels to generative mod-els in order to augment the number of instancesfor classes with a small number of examples. Incontrast to these papers we use language models ina zero-shot context, and rather than requiring exist-ing labeled examples of the relevant relation, wepropose to manually label the generated samples.

In contrast to linguistic annotation tasks such asparts-of-speech, syntactic-trees or semantic roles,annotating data for relation-extraction does not re-quire special expertise. Annotation can be easilyerformed by a motivated native speaker of thelanguage (in case of "every-day" relations such asthose available in TACRED and DocRED) or bya domain expert (in case of "specialized" relationssuch as in biomedicine or law). Annotating a givensentence for a given relation takes roughly the timeit takes to read and understand the sentence. Sowhat stops us from obtaining large amounts of an-notated data for ML?The annotation challenge lies in relation sparsity in the wild. In an attempt to get a perspective onthis issue, let’s consider the founded-by relationbetween a PERSON and an ORG, as attested inthe TACRED corpus. Assuming we consider onlysentences that contain both a person mention andan organization mention, how many sentences dowe have to annotate before we reach, for exam-ple, 10 positive examples? The TACRED trainingset has 124 founded-by instances, as well as 6947"negative" instances with matching entity types("negative" examples are either other relations, or no-relation ). This 1-out-of-57 ratio indicates thatwe will likely sample 56 "negative" sentences be-fore hitting a positive instance. This ratio is overlyoptimistic, as the annotations in the TACRED cor-pus are already very skewed in favor of positiveexamples. Even under this very optimistic scenario,we will need to annotate 570 sentences to recover10 positive examples. The cost of annotation, then,is not in annotating each individual positive sen-tence, but in ﬁnding the sentences to annotate in theﬁrst place. Therefore, we should seek for methodsthat point towards probable positive instances.In this paper, we present two methods, the ﬁrstreturns close to 1-out-of-1 positive ratio, althoughwith low syntactic diversity, and a second methodwith roughly 1-out-of-3 positive ratio.

We are interested in the problem of obtaining arelation classiﬁer for a binary relation, when noa-priori annotated training data for this relation isavailable. We seek a methodology that will allowto create an effective extractor, using a minimalamount of data annotation effort.We compare four approaches – manual anno-tation, syntactic-search, manual annotation overgenerated examples, and a combination of the lasttwo – to be described in later sections. Here, we See Appendix A for similar distributions over all rela-tions. discuss setup which is shared to all experiments.In order to evaluate the methodology on multipledatasets with similar relations, we chose a set of re-lations that appear in both the TACRED (Zhanget al., 2017) and DocRED (Yao et al., 2019)datasets with at least 50 development examples .To quantify the performance of our methodologywe assess it comparing to varying amounts of man-ually annotated data. In our settings, large amountsof supervised examples represent upper bound forour bootstrapping methods and are not expected.While relation extraction is often considered asa multi-class classiﬁcation problem (“ﬁnd the oc-currences of any of these possible relations”), weinstead treat the relations separately, training a bi-nary classiﬁer for each one. We believe this is morerepresentative of a user who wishes to target a lownumber of relations, who is likely to conduct datacollection and evaluation for one relation at a time. Obtaining Negative Examples

When training abinary classiﬁer, it is required to include a set ofnegative examples alongside the list of positive ex-amples. In all our experiments we obtain negativeexamples by looking for sentences that contain en-tity types that are compatible with the relation (i.e,for the founded-by relation we sample sentencesthat include both a PERSON and an ORG). In oursyntactic based methods we sample from the samedomain as our positive examples (Wikipedia) andthen ﬁlter this list by removing sentences in whichthe entities are connected by a syntactic patternwhich is attested by the positive examples. For thesupervised baselines of various sizes, we obtainnegative examples by sampling them from the an-notated training set, without replacement. Datasets

We used two datasets to explore our dif-ferent methods. TACRED (Zhang et al., 2017), alarge-scale multi-class relation extraction datasetbuilt over newswire and web text. And Do-cRED (Yao et al., 2019), a dataset for documentlevel RE, and similarly designed for multi-classprediction. Per our setup above, we changed thesetting of both datasets to per relation binary classi- org:country of headquarters , org:founded by , per:children , per:city of death , per:date of death , per:origin , per:religion , per:spouse for TACRED, and similarly head-quarters location , founded by , child , place of death , date ofdeath , country of origin , religion , spouse for DocRED. The positive to negative ratio in training data has aneffect on the resulting model’s quality. We experimented withpositive-to-negative ratios of 1, 5, 10 and 20, as well as witha “match the dev-set” ratio. We found a ratio of 10 negativeexamples for each positive sentence to performs well. cation. As our main goal in this paper is to evalu-ate different bootstrapping methods, and not novelmethods for document-level relation extraction, wechose to include only instances with single sup-porting sentence in DocRED ( i.e. sentence levelrelations). As DocRED’s labelled test set is notpublicly available, we used the development set asour test set and used 20% of the train set as devel-opment set.

Models

Our classiﬁers throughout the followingexperiments are based on the Entity Markers ar-chitecture (Soares et al., 2019). In the paper, theauthors proposed wrapping the relation argumentswith marker tokens ( e.g. [ E start ] John [ E end ] was born in [ E start ] [ E end ] ). The alteredtext is then passed as input to a BERT model (De-vlin et al., 2018) where the relation between thetwo entities is represented by the concatenation ofthe ﬁnal hidden states corresponding to their re-spective start tokens. Finally, this representationis fed into a classiﬁcation head and the model isﬁne-tuned for relation classiﬁcation. cf. (Soareset al., 2019) for more details. We use a similarmodel with the exception that we use a more recentpretrained language model, RoBERTa (Liu et al.,2019), and perform binary, rather than multi-class,classiﬁcation.In all of the following experiments we trainedour model with 3 different random seeds to lowervariance introduced to the model with different ini-tializations, and report the average score. At infer-ence time we set the prediction threshold value forthe test set to be the cut-off value that maximizedF1 over the development set. Setup.

Our comparison point throughout the paperis a model trained on traditionally-collected anno-tated data. We sample increasing-sized annotatedsets from TACRED and DocRED, containing 55,110, 220, 550, and 1100 examples. These corre-spond to 5, 10, 20, 50, 100 positive examples with50, 100, 200, 500, 1,000 negatives examples. Ad-ditionally, we measure the performance on thesedatasets when using all available positive examplesfor each relation.

Results

Listed in the top rows of Table 1, averagedover all relations. Unsurprisingly, increasing thenumber of examples increases performance, withthe exception of DocRED on which using all posi-tive labels performs slightly worse than using 100 (cid:51)(cid:68)(cid:88)(cid:79)(cid:3)(cid:73)(cid:82)(cid:88)(cid:81)(cid:71)(cid:72)(cid:71)(cid:3)(cid:48)(cid:76)(cid:70)(cid:85)(cid:82)(cid:86)(cid:82)(cid:73)(cid:87)(cid:3)(cid:76)(cid:81)(cid:3)(cid:36)(cid:83)(cid:85)(cid:76)(cid:79)(cid:3)(cid:20)(cid:28)(cid:26)(cid:24)(cid:17)(cid:29)(cid:62)(cid:72)(cid:32)(cid:51)(cid:40)(cid:53)(cid:54)(cid:50)(cid:49)(cid:64)(cid:51)(cid:68)(cid:88)(cid:79)(cid:3)(cid:7)(cid:73)(cid:82)(cid:88)(cid:81)(cid:71)(cid:72)(cid:71)(cid:3)(cid:29)(cid:62)(cid:72)(cid:32)(cid:50)(cid:53)(cid:42)(cid:36)(cid:49)(cid:44)(cid:61)(cid:36)(cid:55)(cid:44)(cid:50)(cid:49)(cid:64)(cid:48)(cid:76)(cid:70)(cid:85)(cid:82)(cid:86)(cid:82)(cid:73)(cid:87)(cid:3)(cid:76)(cid:81)(cid:3)(cid:36)(cid:83)(cid:85)(cid:76)(cid:79)(cid:3)(cid:20)(cid:28)(cid:26)(cid:24)(cid:17) (cid:58)(cid:68)(cid:90)(cid:72)(cid:79)(cid:69)(cid:72)(cid:85)(cid:74)(cid:3)(cid:73)(cid:82)(cid:88)(cid:81)(cid:71)(cid:72)(cid:71)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:58)(cid:68)(cid:85)(cid:86)(cid:68)(cid:90)(cid:3)(cid:48)(cid:72)(cid:70)(cid:75)(cid:68)(cid:81)(cid:76)(cid:70)(cid:68)(cid:79)(cid:16)(cid:55)(cid:72)(cid:70)(cid:75)(cid:81)(cid:76)(cid:70)(cid:68)(cid:79)(cid:3)(cid:54)(cid:70)(cid:75)(cid:82)(cid:82)(cid:79)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:75)(cid:76)(cid:86)(cid:3)(cid:73)(cid:68)(cid:76)(cid:87)(cid:75)(cid:73)(cid:88)(cid:79)(cid:171)(cid:48)(cid:82)(cid:85)(cid:85)(cid:76)(cid:86)(cid:3)(cid:80)(cid:82)(cid:89)(cid:72)(cid:71)(cid:3)(cid:87)(cid:82)(cid:3)(cid:36)(cid:87)(cid:79)(cid:68)(cid:81)(cid:87)(cid:68)(cid:3)(cid:15)(cid:3)(cid:42)(cid:72)(cid:82)(cid:85)(cid:74)(cid:76)(cid:68)(cid:3)(cid:15)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:73)(cid:82)(cid:88)(cid:81)(cid:71)(cid:72)(cid:71)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:49)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:3)(cid:37)(cid:68)(cid:83)(cid:87)(cid:76)(cid:86)(cid:87)(cid:3)(cid:38)(cid:82)(cid:81)(cid:89)(cid:72)(cid:81)(cid:87)(cid:76)(cid:82)(cid:81)(cid:171)(cid:41)(cid:85)(cid:76)(cid:72)(cid:71)(cid:85)(cid:76)(cid:70)(cid:75)(cid:3)(cid:49)(cid:68)(cid:88)(cid:80)(cid:68)(cid:81)(cid:81)(cid:3)(cid:15)(cid:3)(cid:68)(cid:3)(cid:51)(cid:85)(cid:82)(cid:87)(cid:72)(cid:86)(cid:87)(cid:68)(cid:81)(cid:87)(cid:3)(cid:83)(cid:68)(cid:85)(cid:76)(cid:86)(cid:75)(cid:3)(cid:83)(cid:85)(cid:76)(cid:72)(cid:86)(cid:87)(cid:3)(cid:15)(cid:3)(cid:73)(cid:82)(cid:88)(cid:81)(cid:71)(cid:72)(cid:71)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:49)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:68)(cid:79)(cid:16)(cid:54)(cid:82)(cid:70)(cid:76)(cid:68)(cid:79)(cid:3)(cid:36)(cid:86)(cid:86)(cid:82)(cid:70)(cid:76)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:17)(cid:3) (cid:56)(cid:86)(cid:72)(cid:3)(cid:68)(cid:86)(cid:3)(cid:83)(cid:82)(cid:86)(cid:76)(cid:87)(cid:76)(cid:89)(cid:72)(cid:3)(cid:79)(cid:68)(cid:69)(cid:72)(cid:79)(cid:72)(cid:71)(cid:3)(cid:72)(cid:91)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72)(cid:86) (cid:54)(cid:72)(cid:81)(cid:87)(cid:72)(cid:81)(cid:70)(cid:72)(cid:52)(cid:88)(cid:72)(cid:85)(cid:92)(cid:48)(cid:68)(cid:87)(cid:70)(cid:75)(cid:76)(cid:81)(cid:74)(cid:3)(cid:40)(cid:91)(cid:68)(cid:80)(cid:83)(cid:79)(cid:72)(cid:86)(cid:55)(cid:85)(cid:68)(cid:76)(cid:81)(cid:42)(cid:85)(cid:68)(cid:83)(cid:75) e1:[e=PER]Paul $founded e2:[e=ORG]Microsoft in 1975

Figure 1: Flow of the Syntactic Search by Exam-ple method. For details, see §6. sampled positive examples for each relation, we at-tribute this to sampling noise. DocRED scores aregenerally lower than TACRED scores. This is be-cause of the way we constructed the developmentand test sets: while in TACRED’s development seteach sentence includes a single entity pair with asingle relation, in DocRED, we pass all possiblesentences with entity pairs of the same type as theevaluated relation as possible candidates. This dra-matically increases the number of candidates, andby that of possible type I errors. Moreover, as weincluded only examples with exactly one support-ing sentence, the number of positive examples islow for some of the relations. All of this effects Do-cRED classiﬁcation scores comparing to TACRED.Importantly, in all these experiments, the numberof annotated examples used is signiﬁcantly higherthan the number used in our Syntactic Search ex-periments (3 examples in total).

We consider this section to be the main contributionof the work. We show that:(i) with modern DL modeling, effective relationextractors can be trained using sentences derivedfrom less than a handful of syntactic patterns; and(ii) through the use of by-example syntactic searchengines, one can construct these patterns veryquickly, without needing to understand syntax.To explain the suggested workﬂow, let’s con-sider a user who wants to train a relation extrac-tion binary classiﬁer for the founded by relation,and has a single example sentence, “Paul foundedMicrosoft in April 1975”. Patterns over syntacticstructures and entity types are very effective forderiving high-precision extraction templates. Forexample, searching for sentences containing theword “founded” with an nsubj dependency of typePERSON and dobj dependency of type ORG, willreturn many matches for the founded-by relations. ethod TACRED DocRED

Annotated

All (3 qrs) (3 queries)

Table 1: Average test F1 score over all relations. Pat-tern Based RE was given 3 positive patterns. Synt.Search is trained on data created from same 3 patterns.The Annotated experiments are denoted by the numberof positive examples + negative examples.

There are two issues with this approach (1) whilehigh-precision, the recall of the patterns is low; and(2) syntactic patterns require both linguistic andcomputational expertise to specify and execute.The premise of this paper is that the low recallcan be offset by machine learning. The sentencesresulting from syntactic search over a few patternsare diverse enough that an ML model trained overthem manages to generalize from the speciﬁc syn-tactic pattern and identify a broader range of cases,increasing recall substantially. We show this isindeed the case.To overcome the need for linguistic expertise wepropose using a by-example syntactic search en-gine (Shlain et al., 2020) which allows users to ex-ecute syntactic queries based on example sentences:the user enters a sentence satisfying the relation ofinterest and annotates it with light markup indi-cating the arguments and the trigger words. Thesystem then automatically translates the markupinto a syntactic pattern, matches it against a largepre-annotated corpus ( e.g. all Wikipedia sentences),and returns results. The user does not need to be fa-miliar with syntactic formalisms or with advancedNLP. Fig. 1 demonstrates the user process. Starting withthe sentence

Paul founded Microsoft in April 1975 ,the user marks

Paul as e1 ( e1: ) with an entity-typerestriction of PERSON ( [e=PER] ), Microsoft as e2 ( e2: ) with an entity-type ORG ( [e=ORG] ),and founded as a trigger word ( $founded ). The https://spike.apps.allenai.org SPIKE system translates the query into a syntacticgraph, which is then matched against Wikipedia,returning 11,345 sentences matching the pattern(note that the word ‘founded’ is matched lexically,while Paul and Microsoft become place holders forany person and any organization that adhere to thesyntactic conﬁguration). A subset of the returnedsentences is then used as positive examples formodel training.While 11,345 cases make an impressive trainingset, these sentences share the same core syntac-tic conﬁguration, and classiﬁers, trained on thesematches, will not necessarily generalize well. Thematches will also share the exact same lexical pred-icate (“founded”). The lack of lexical diversityof the predicate can be expanded by the user bysupplying alternative words, perhaps aided by dis-tributional similarity methods such as word2vec,or by querying a bi-LM such as BERT (Devlinet al., 2018) (§6.2.1). To counter the lack of struc-tural diversity the user can supply additional pat-terns, derived from example sentences. For ex-ample, the user may supply also ‘[ e Microsoft]’sfounder [ e Paul]’ (possessive construction) and‘[ e Microsoft] was founded by [ e Paul]’ (passive)as additional patterns (§6.2).

For each relation, we select 3 representativesentences and annotate them based on the processdescribed above . We do not perform any lexicalexpansion of trigger words beyond the initial pat-tern at this point. The queries are processed bySPIKE (Shlain et al., 2020) and the results are usedas positive instances in the generated training set.A full list of the SPIKE queries we used can befound in appendix D, Table 5.We also compare the TACRED classiﬁer to a rulebased extractor which uses the syntactic queries di-rectly. Each syntactic query is added as a syntacticpattern to this extractor: any sentence which sat-isﬁes one of the syntactic patterns is labeled as apositive instance; sentences which do not satisfyany of the patterns are labeled negative. Results

Listed in the

Synt. Search and

Pattern In this experiment, the selection of representative sen-tences is based on a heuristic process: we intuitively conceiveof basic sentences exemplifying the relation, construct the cor-responding Spike queries and brieﬂy validate the number andquality of the returned results. We limit the number of seedexamples to 3 since we believe coming up with 3 examplesshould be simple even for non-experts. In §7.1 we show thatusing more seed examples can further improve performance. ataset Predicates 100 500 1000TACRED One Trig. 0.487 0.459 0.461Trig. List 0.517 0.490 0.478DocRED One Trig. 0.290 0.336 0.338Trig. List 0.316 0.338 0.337

Table 2: F1 scores for founded by , child , place of death and date of death and spouse when expanding the trig-gers list for the Syntactic Search “by Example" method. Based RE rows of Table 1 , Pattern Based RE , us-ing just the 3 patterns per relation, achieves a verylow F-score of 12.8%, due to low recall. How-ever, this is already competitive with training aclassiﬁer on 5-10 positive examples per relation.Training a classiﬁer on the extracted relations in-creases the scores signiﬁcantly, to 44.3 F on TA-CRED and 26.6 F on DocRED, approaching su-pervised training on 50+500 annotations (for TA-CRED) or 20+200 annotations (for DocRED). Thisresult demonstrates that training an ML model overthe output of a rule based model can signiﬁcantlyimprove performance, echoing similar conclusionsin Angeli et al. (2013). Interestingly, Angeli et al.(2013) used a total of 4,697 patterns across 41 rela-tions, an average of 114 patterns per relation. Wedemonstrate that by applying syntactic patterns toa large corpus and using modern DL classiﬁers, re-sults competitive with manual annotation baselinescan be reached with as few as 3 syntactic rules. Constructing queries from 3 seed sentencesproduces retrieved sentences with low lexical diver-sity. e.g. if all the seed sentences for founded-by use the word “founded” to express the relation,then all retrieved sentences will likewise includethe word “founded” , and exclude alternatives like“established”, “formed”, “started”, etc.In this experiment we generalize the seed queriesto allow a list of trigger words rather than a singleword. We consider only relations which includea lexical trigger in their seed patterns . Alterna-tive triggers are selected by reviewing the closestwords to the original triggers in word2vec’s em-bedding space (Mikolov et al., 2013). Appendix Results correspond to 100+1,000 (TACRED) and1,000+10,000 (DocRED) examples, for results and discus-sions of different dataset sizes, see Appendix B. per:children , per:date of death , org:founded by , per:cityof death and per:spouse , and DocRED’s child , date of death , founded by , place of death and spouse C includes the lists of alternative lexical triggersused. We train classiﬁers on 100+1000, 500+5,000and 1,000+10,000 examples obtained from theseexpanded-trigger queries.

Results

As illustrated in Table 2, adding alternativetriggers improves results across all sample sizes forTACRED and for the 100+1000 size in DocRED.

We showed how the Syntactic Search by Exam-ple method works with only a few human anno-tated examples. In this section we would like topursue NLG based methods to expand the num-ber of exemplary patterns. Generative languagemodels, compared to other methods for data aug-mentation ( e.g.

Iterative bootstrapping and distantsupervision) are highly accessible and require lowtechnical expertise (sometimes passing a prompt isenough). Moreover, recent papers (Wei and Zou,2019; Anaby-Tavor et al., 2019; Kumar et al., 2020;Amin-Nejad et al., 2020; Russo et al., 2020) reporthigh impact of such models for the closely relatedData Augmentation task. We therefore present nu-merous methods that take advantage of such mod-els for RE bootstrapping.First we show how a user can produce a highnumber of generated sentences using GPT2 (Rad-ford et al., 2019). Then we demonstrate how thegenerated sentences can be integrated in the Syn-tactic Search by Example method (§7.1). Finally,in order to validate the necessity of the syntacticsearch in this ﬂow we compare it to feeding the rawgenerations as inputs to a classiﬁer (§7.2).

Generating Examples LMThe user-ﬂow

Depicted in Fig. 2: The user en-ters a relation prompt (“Paul founded Microsoft”),to which the system responds by returning sen-tences that express the same relation. While notall returned sentences express the relation, manyof them do. To ﬁlter out out-of-relation sentencesthe user goes through the list until she identiﬁesa predeﬁned number of positive examples (Herewe used 100 sentences). In our experiments weencountered 1 positive example for every 3 exam-ples annotated. This 1-out-of-3 ratio is signiﬁcantlybetter than blindly sampling from a corpus (1-out-of-57, see §3), and by that can considerably saveannotation time. For each example, the user marksthe relevant entities, and optionally also the trig- aul founded Microsoft in April 1975.Generation Model

Paul, the Microsoft co-founder who is now the company's chairman...Paul, co-founder and chairman of Microsoft…Paul works at Microsoft.Paul was a founder and chairman of both Microsoft and......

Using annotated sentences as exemplary patterns.

SentenceSampleSampled SentencesSyntactic SearchAnnotated Sentences

Paul, the Microsoft co-founder who is now the company's chairman...Paul, co-founder and chairman of Microsoft…Paul works at Microsoft.Paul was a founder and chairman of both Microsoft and......

Use as positive labeled examples

Train

Figure 2: Flow of sampling examples from conditionallanguage model. The “Syntactic Search" step corre-sponds to §7.1, while skipping this step, correspondsto §7.2. ger word (the main word indicating the relation).These examples are then used as additional inputexamples to the syntactic search engine (§7.1) oras train datasets for ML models (§7.2).

Technical details

We begin with a large pre-trained LM (we use GPT2-medium (Radford et al.,2019)), and ﬁne-tune it to the generation task.The method assumes the availability of relation-annotated data, though its relations do not need tooverlap with the ones we are attempting to extract(in our case, we ensured the groups are distinct).The approach can be considered as an instance oftransfer-learning, where we attempt to transfer theexample-generation knowledge from the trainingrelations to novel relations. Given the annotated REdataset, we consider positive examples of the form ( s , e , e , r ) , where r ∈ R . We transform each in-stance to a conditioned LM training example, inwhich the LM sees a preﬁx (prompt) and shouldcomplete it. In our case the prompt is derived from ( e , e , r ) , followed by a special symbol, and wetrain the LM to produce the corresponding sentence s . To derive the preﬁx we apply a pre-deﬁned tem-plate associated with each relation r . The templatehas two slots to be ﬁlled with the entities e and e . For example, a template for the founded-by relation can take the form [ e ] founded [ e ] . Wethen ﬁne-tune GPT2 on these training examples. Atinference time, the user provides a single promptbased on their desired relation.Given the user prompt, we generate 1000 sen- In our experiments, we use on average 3 different tem-plates for each relation type, so a single annotated relationexample will result in 3 (on average) different ﬁne-tuningexamples for the LM, each with a different prompt. tences with nucleus sampling (Holtzman et al.,2019) of . and length of up to 50 tokens. Weannotate the generated sentences until reaching100 positive instances (usually requiring 200-300sentences), this takes up to 1.5 hours per relation.These generated sentences are annotated and usedas inputs to the syntactic search method (§7.1) ordirectly as positive examples to a classiﬁer (§7.2). We integrate the generation outputs in the SyntacticSearch by Example method by taking the positiveannotated examples (on which we mark the entitiesas part of the annotation process) and automaticallytransforming them into SPIKE queries. This stephas the potential to add substantial syntactic andlexical diversity to the pattern set, resulting in bothlarger and more diverse sets of positive examples.This combines the best of both worlds: the genera-tive model is used to provide structural and lexicaldiversity, while the syntactic search system is usedto provide a large selection of naturally occurringcorpus sentences adhering to these patterns.

Experiments and ResultsSetup

To reduce noise, we exclude queries wheremore than 1 out of 5 sampled results does not ex-press the relation of interest. On average, we in-creased the number of syntactic patterns to 9.25,ranging from 6 to 14 after ﬁltering.

Results

As listed in the

Search + Generation rowof Table 1, this method achieved best performancefor both TACRED and DocRED with overall scorescorresponding to 550/1100 and 220/550 annotatedexamples respectively. Using the generation out-puts as examples doesn’t only help in suggestingmore sentences satisfying the relation but also inaugmenting the number of predicates used. Welooked on the number of predicates used for the TA-CRED relations which include lexical triggers (sim-ilarly to §6.2.1), the generation phase suggested . predicates on average, more than the . predi-cates per relation of our original patterns, and lessthan the trigger expansion method we suggestedin §6.2.1, where we tried to ﬁnd all the possiblepredicates, with . triggers on average. We con-clude that while the Syntactic Search by Exam-ple method performs well with only a few examplepatterns, this can be even improved with more in-put examples. While we report Syntactic Search byExample enjoys such generation-based pattern aug-entation, a similar boost with different, non-NLG,methods is of course possible. We leave furtherprobing for other pattern augmentation methods asfuture work. It is possible that generative models produce di-verse enough training examples that will suggestour syntactic search superﬂuous. We validate thenecessity of taking the annotated generations (An-notated similarly to §7.1) through the SyntacticSearch by Example method, by comparing it tosimply passing the annotated generations as classi-ﬁer inputs, as depicted in the RHS of Fig. 2.

Experiments and ResultsSetup

Many of the samples include the entitiesfrom the prompt verbatim. Before using them asthe model inputs, we replace the entities with arandom Wikipedia entity of the same type.

Results

As can be seen in the

Example Generation row in Table 1, on TACRED, this method producesF1 scores on par with Syntactic Search by Example.However, evaluating on DocRED, the method doesnot produce competitive results . On both datasetsit produce worse than Search + Generation . Weconclude that it is more beneﬁcial to use outputsof generative models as syntactic search queries,and by that ﬁnd syntactically similar sentences,comparing to simply use generations as the trainset. We deduce models are likely to generalizebetter on “real world" examples.

Analyzing the results we highlight some interest-ing trends (Fig. 3). First, we note that the behavioris not consistent between relations, nor datasets:different relations behave differently, showing dif-ferent trade-offs between different methods.Classiﬁers for relations like “Religion" , “City The language model used to generate examples was ﬁne-tuned on a version of TACRED which excludes the relationswe evaluate on. Still, for TACRED, the language model is ﬁne-tuned and evaluated on data from the same domain (newswire).The DocRED data on the other hand, is taken from Wikipedia,so the evaluation is essentially out of domain. We thereforeconclude that used independently, this approach is applicableonly in cases where a background RE dataset is of the samedomain as the target corpus from which we want to extractrelations. TACRED’s “religion" relation plateaus as it has a lownumber of train instances. children child origin country of origin date of death date of death founded by founded by country of hq

HQ location city of death place of death religion religion spouse spouse

Figure 3: F1 scores of TACRED (right), and Do-cRED (left) by relation. of Death" and “Date of Death" seem to plateau ataround 50-100 manually annotated examples. Forthese relations, annotating more data is not neces-sarily useful. The syntactic search approach worksespecially well for these relations: applying syn-tactic search over 3 seed queries is sufﬁcient toyield results on par or slightly higher than all avail-able manually annotated data. We hypothesize thatthese ﬁndings might be the result of low diversityin the ways these relations are typically expressed.While the combined

Search + Generation ap-proach is overall useful, the effect is not consistentacross relations: performance improves for somerelations and deteriorates for others. In §7.1 wedescribed the techniques we use to reduce the noisecoming from additional queries. These techniqueshowever are rather basic and these results indicatethat more advanced techniques of the type used inAngeli et al. (2015) and Ratner et al. (2017), arelikely to yield more consistent improvements.

Distant supervision (Mintz et al., 2009) sug-gests a method to construct a training dataset basedon a large external KB of relation triplets. Yao et al.(2019) offered a machine annotated version of Do-cRED constructed by aligning Wikipedia pagesith Wikidata. The authors took great care increating this resource: a high-quality NER modeltrained on in-domain manually annotated data wasused to automatically annotate possible relation ar-guments; a named entity linker was used to mergeentities with similar KB ID; and ﬁnally, Wikidatawas queried in order to label pairs of linked entities.We trained a classiﬁer using the releaseddata, sampling increasing number of examples:(100+1,000, 500+5,000, 1,000+10,000). We reportbest score of 0.312 F (500+5,000 split). Results

This Distant Supervision dataset, createdby Yao et al. (2019), appears to be of very highquality and the results are on par with the full setof manually annotated data. These results indicatethat given a large KB of relation triplets, a high-quality in-domain NER, and a high quality linkingsolution, distant-supervision is a very promisingtechnique. It should be noted however, that theavailability of all these external resources is veryrare in practice and is not required by the methodsproposed in this work.

We explored only English in this work. However,we argue that our main method – example-basedsyntactic search followed by DL-training – is notstrongly tied to English, and we encourage otherresearchers to experiment with it in their languagesof interest. We provide details of what is needed toadapt the system to a different language.The Syntactic Search by Example method re-quires (1) An automatically dependency-parsedcorpora in the language. These can be readilyproduced by the many syntactic parsers that areavailable for many languages (Manning et al.,2014; Honnibal and Montani, 2017; Qi et al.,2020). (2) An indexing engine that supports ef-ﬁcient queries over parse trees. Shlain et al. (2020)uses the open-source Odinson engine (Valenzuela-Escárcega et al., 2015) for this purpose. (3) Acomponent that translates a query in spike’s “byexample" syntax to the indexing engine’s querysyntax. This requires ﬁnding the minimal (in termsof number of nodes) sub-graph that connects all re-lation arguments (and predicates if available), thensearch for sentences with similar sub-graphs in theindex. With these three components, a syntactic-search system can be readily implemented. Therest of the components are straightforward applica-tion of DL methods. Indeed, we suspect the major obstacle in application to a new language will bethe availability of evaluation data.

10 Conclusion

We show that with modern DL classiﬁers and adataset bootstrapped using syntactic search withas few as 3 seed patterns can be as effective asa dataset with hundreds of manually annotatedsamples. Using LMs help to further diversify thedataset and improve results. Overall, our resultsare positively optimistic for bootstrapping methods.However, this work is only an initial step in explor-ing methods for bootstrapping relation extractorsusing minimal user effort, supported by strong pre-trained neural LMs. We hope to encourage furtherwork in this direction.

11 Acknowledgments

This project has received funding from the Euro-pean Research Council (ERC) under the EuropeanUnion’s Horizon 2020 research and innovationprogramme, grant agreement No. 802774 (iEX-TRACT).

References

Ali Amin-Nejad, Julia Ive, and Sumithra Velupillai.2020. Exploring transformer text generation formedical dataset augmentation. In

Proceedings ofThe 12th Language Resources and Evaluation Con-ference , pages 4699–4708.Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich,Amir Kantor, George Kour, Segev Shlomov, NaamaTepper, and Naama Zwerdling. 2019. Not enoughdata? deep learning to the rescue! arXiv preprintarXiv:1911.03118 .Gabor Angeli, Arun Tejasvi Chaganty, Angel X Chang,Kevin Reschke, Julie Tibshirani, Jean Wu, OsbertBastani, Keith Siilats, and Christopher D Manning.2013. Stanford’s 2013 kbp system. In

TAC .Gabor Angeli, Victor Zhong, Danqi Chen, Arun Te-jasvi Chaganty, Jason Bolton, Melvin Jose JohnsonPremkumar, Panupong Pasupat, Sonal Gupta, andChristopher D Manning. 2015. Bootstrapped selftraining for knowledge base population. In

TAC .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .George R Doddington, Alexis Mitchell, Mark A Przy-bocki, Lance A Ramshaw, Stephanie M Strassel, andRalph M Weischedel. 2004. The automatic contentextraction (ace) program-tasks, data, and evaluation.In

Lrec , volume 2, page 1. Lisbon.oe Ellis, Jeremy Getman, Dana Fore, Neil Kuster,Zhiyi Song, Ann Bies, and Stephanie M Strassel.2015. Overview of linguistic resources for the tackbp 2015 evaluations: Methodologies and results.In

TAC .Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2019. The curious case of neural textdegeneration. arXiv preprint arXiv:1904.09751 .Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.Varun Kumar, Ashutosh Choudhary, and Eunah Cho.2020. Data augmentation using pre-trained trans-former models. arXiv preprint arXiv:2003.02245 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Christopher D Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In

Proceedings of 52nd annualmeeting of the association for computational linguis-tics: system demonstrations , pages 55–60.Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efﬁcient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In

Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Vol-ume 2-Volume 2 , pages 1003–1011. Association forComputational Linguistics.Shikhar Murty, Pang Wei Koh, and Percy Liang.2020. Expbert: Representation engineering withnatural language explanations. arXiv preprintarXiv:2005.01932 .Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D Manning. 2020. Stanza:A python natural language processing toolkitfor many human languages. arXiv preprintarXiv:2003.07082 .Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog , 1(8):9.Alexander Ratner, Stephen H. Bach, Henry R. Ehren-berg, Jason Alan Fries, Sen Wu, and Christopher Ré.2017. Snorkel: Rapid training data creation with weak supervision.

Proceedings of the VLDB Endow-ment. International Conference on Very Large DataBases , 11 3:269–282.Alexander J Ratner, Christopher M De Sa, Sen Wu,Daniel Selsam, and Christopher Ré. 2016. Data pro-gramming: Creating large training sets, quickly. In

Advances in neural information processing systems ,pages 3567–3575.Giuseppe Russo, Nora Hollenstein, Claudiu Musat, andCe Zhang. 2020. Control, generate, augment: Ascalable framework for multi-attribute text genera-tion. arXiv preprint arXiv:2004.14983 .Micah Shlain, Hillel Taub-Tabib, Shoval Sadde, andYoav Goldberg. 2020. Syntactic search by exam-ple. In

Proceedings of ACL 2020, System Demon-strations .Livio Baldini Soares, Nicholas FitzGerald, JeffreyLing, and Tom Kwiatkowski. 2019. Matching theblanks: Distributional similarity for relation learn-ing. arXiv preprint arXiv:1906.03158 .Marco A Valenzuela-Escárcega, Gus Hahn-Powell, Mi-hai Surdeanu, and Thomas Hicks. 2015. A domain-independent rule-based framework for event extrac-tion. In

Proceedings of ACL-IJCNLP 2015 SystemDemonstrations , pages 127–132.Jason Wei and Kai Zou. 2019. Eda: Easydata augmentation techniques for boosting perfor-mance on text classiﬁcation tasks. arXiv preprintarXiv:1901.11196 .Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin,Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou,and Maosong Sun. 2019. Docred: A large-scaledocument-level relation extraction dataset. arXivpreprint arXiv:1906.06127 .Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slotﬁlling. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing(EMNLP 2017) , pages 35–45.

A Estimated effort for Annotation

Table 3 lists for each relation the ratio of positiveto negative examples in the TACRED training set.A negative example for a relation r , is any examplewhose entities share types with a positive exampleof r , but whose label is different from r . Note thatTACRED signiﬁcantly under represents negativeexamples so the reported ratio is an upper boundon the ratio in the wild.elation Pos/Neg Ratioorg:country_of_hq 1/7org:founded_by 1/56per:children 1/64per:city_of_death 1/31per:date_of_death 1/26per:origin 1/10per:religion 1/2per:spouse 1/52 Table 3: Pos/Neg ratio in TACRED, rounded to the clos-est fraction.

B Syntactic Search by Example withvarying dataset sizes

We experimented with varying the number of sam-pled examples, using the same 3 seed syntacticpatterns. The results are reported in Table 4. WhileDocRED’s F1 scores increase with increasing num-ber of sampled examples, the trend is opposite inTACRED. We believe this is due to different ini-tializations and inductive noise in both the positiveand negative samples introduced by sampling fromsemi-noisy data.

Method TACRED DocRED

Synt. Search - 100 0.443 0.250Synt. Search - 500 0.434 0.259Synt. Search - 1000 0.427 0.266

Table 4: Syntactic Search by Example with differenttraining sizes

C Trigger List Expansion

For the majority of patterns used in the SyntacticSearch by Example experiments we used a singletrigger word (see Appendix D). To experiment withusing trigger lists, we modiﬁed the patterns in Ap-pendix D in the following way:We changed the triggers in all child\children pat-terns to include any of the following possibilities:baby, child, children, daughter, daughters, son,sons, step-daughter, step-son, step-child, step-children, stepchildren, stepdaughter, stepsonFor founded-by relations we change the“founder" trigger to be any of these triggers:founder, co-founder, cofounder, creatorand changed “founded" to be any trigger from thefollowing list: create, creates, created, creating, creation, co-founded, co-found, debut, emerge, emerges,emerged, emerging, establish, established, es-tablishing, establishes, establishment, forge,forges, forged, forging, forms, formed,forming, founds, found, founded, found-ing, launched, launches, launching, opened,opens, opening, shapes, shaped, shaping, start,started, starting, startsIn spouse relations we expanded the “hus-band\wife" trigger to be any of:ex-husband, ex-wife, husband, widow, widower,wife, sweetheart, brideand the “marry" trigger to:divorce, divorced, married, marry, wed, divorcingFor the “date of death" and “place\city of death" we changed the “died" trigger to any of:died, executed, killed, dies, perished, succumbed,passed, murdered, suicide

Examples used for Syntactic Search by Example child <>e1:[e=PER]John ’s t:[w={triggers}]daughter , <>e2:[e=PER]Tim, likes swimming.<>e1:[e=PER]Mary did something to her t:[w={triggers}]son, <>e2:[e=PER]John in 1992.<>e1:[e=PER]Mary was survived by her 4 t:[w={triggers}]sons, John, John, <>e2:[e=PER]John andJohn.triggers = son | daughter | child | children | daughters | sons founded by <>e1:[e=ORG]Microsoft t:[w]founder <>e2:[e=PER]Mary likes running.<>e2:[e=PER]Mary t:[w]founded <>e1:[e=ORG]Microsoft.<>e1:[e=ORG]Microsoft was t:[w]founded $by <>e2:[e=PER]Mary. headquarters location

John Doe, a professor at the <>e1:[e=ORG]Oxford <>in:[t=IN]in <>e2:[e=LOC]England likes running.<>e1:[e=ORG]Oxford, a leading <>t:[t=NN]company <>in:[t=IN]in <>e2:[e=LOC]England.<>e2:[e=LOC]England pos:[t=POS]’s largest university is <>e1:[e=ORG]Oxford. religion <>e1:[e=PER]John is a e2:[w={triggers}]Jewish„e2:[w={triggers}]Jewish <>e1:[e=PER]John is walking down the street.<>e1:[e=PER]John is a e2:[w={triggers}]Methodist Person.triggers = Methodist | Episcopal | separatist | Jew | Christian | Sunni | evangelical | atheism | Islamic |secular | fundamentalist | Christianist | Jewish | Anglican | Catholic | orthodox | Scientology | Islamist |Islam | Muslim | Shia spouse <>e1:[e=PER]John ’s t:[w=wife | husband]wife, <>e2:[e=PER]Mary , died in 1991.<>e1:[e=PER]John t:[l]married <>e2:[e=PER]Mary„<>e1:[e=PER]John is t:[w]married to <>e2:[e=PER]Mary, origin <>e2:[e=MISC]Scottish <>e1:[e=PER]Mary is high.<>e1:[e=PER]Mary is a <>e2:[e=MISC]Scottish professor.<>e1:[e=PER]Mary, the <>e2:[e=LOC]US professor. date of death <>e1:[e=PER]John was announced t:[w]dead in <>e2:[e=DATE]1943.<>e1:[e=PER]John t:[w]died in <>e2:[e=DATE]1943.<>e1:[e=PER]John, an NLP scientist, t:[w]died <>e2:[e=DATE]1943. place of death <>e1:[e=PER]John t:[w]died in <>e2:[e=LOC]London, <>country:e=LOC England in 1997.<>e1:[e=PER]John t:[w]died in <>e2:[e=LOC]London in 1997.<>e1:[e=PER]John $-LRB- t:[w]died in <>e2:[e=LOC]London $-RRB-.

DocRED’s founded by <>e1:[e=ORG]MISC Microsoft t:[w]founder <>e2:[e=PER]Mary likes running.<>e2:[e=PER]Mary t:[w]founded <>e1:[e=ORG]MISC Microsoft.<>e1:[e=ORG]MISC Microsoft was t:[w]founded $by <>e2:[e=PER]Mary.

DocRED’s origin <>e2:[e=MISC]Scottish company, <>e1:[e=ORG]Microsoft is successful.<>e1:[e=ORG]MISC Microsoft is a <>e2:[e=MISC]Scottish Company.

Continued on next page >e1:[e=ORG]MISC Microsoft is a <>t:[t=NN]song $by <>e2:[e=MISC]Scottish musician.

DocRED’s date of death <>e1:[e=PER]John $-LRB-<>e1:[e=PER]John t:[w]died in <>e2:[e=DATE]1943.<>e1:[e=PER]John, an NLP scientist, t:[w]died <>e2:[e=DATE]1943.

DocRED’s place of death <>e1:[e=PER]John t:[w]died in <>e2:[e=LOC]London, <>country:e=LOC England in 1997.<>e1:[e=PER]John t:[w]died in <>e2:[e=LOC]London in 1997.<>e1:[e=PER]John $-LRB- $[e=DATE]1997, $[e=LOC]London $- $[e=DATE]1997<>e2:[e=LOC]London $-RRB-.

DocRED’s headquarters location <>e1:[e=ORG]Microsoft, a leading <>t:[t=NN] company <>in:[t=IN]in <>e2:[e=LOC]Redmond.<>e1:[e=ORG]Microsoft is t:[l=base | headquarter]based in <>e2:[e=LOC]England.<>e1:[e=ORG]Microsoft, a leading <>t:[t=NN] company based <>in:[t=IN]in <>e2:[e=LOC]Redmond.<>e1:[e=ORG]Microsoft, a leading <>t:[t=NN] company <>in:[t=IN]in <>e2:[e=LOC]Redmond.<>e1:[e=ORG]Microsoft is t:[l=base | headquarter]based in <>e2:[e=LOC]England.<>e1:[e=ORG]Microsoft, a leading <>t:[t=NN] company based <>in:[t=IN]in <>e2:[e=LOC]Redmond.