Adaptive Rule Discovery for Labeling Text Data
AAdaptive Rule Discovery for Labeling Text Data
Sainyam Galhotra
UMass Amherst [email protected] Behzad Golshan
Megagon Labs [email protected] Wang-Chiew Tan
Megagon Labs [email protected]
ABSTRACT
Creating and collecting labeled data is one of the major bot-tlenecks in machine learning pipelines and the emergence ofautomated feature generation techniques such as deep learn-ing, which typically requires a lot of training data, has fur-ther exacerbated the problem. While weak-supervision tech-niques have circumvented this bottleneck, existing frame-works either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a la-beled subset of the data to automatically mine rules (e.g.,Snuba). The process of manually writing rules can be te-dious and time consuming. At the same time, creating alabeled subset of the data can be costly and even infeasi-ble in imbalanced settings. This is due to the fact that arandom sample in imbalanced settings often contains only afew positive instances.To address these shortcomings, we present
Darwin , aninteractive system designed to alleviate the task of writingrules for labeling text data in weakly-supervised settings.Given an initial labeling rule,
Darwin automatically gener-ates a set of candidate rules for the labeling task at hand,and utilizes the annotator’s feedback to adapt the candidaterules. We describe how
Darwin is scalable and versatile. Itcan operate over large text corpora (i.e., more than 1 millionsentences) and supports a wide range of labeling functions(i.e., any function that can be specified using a context freegrammar). Finally, we demonstrate with a suite of exper-iments over five real-world datasets that
Darwin enablesannotators to generate weakly-supervised labels efficientlyand with a small cost. In fact, our experiments show thatrules discovered by
Darwin on average identify 40% morepositive instances compared to Snuba even when it is pro-vided with 1000 labeled instances.
PVLDB Reference Format:
Sainyam Galhotra, Behzad Golshan and Wang-Chiew Tan. Adap-tive Rule Discovery for Labeling Text Data.
PVLDB , 12(xxx):xxxx-yyyy, 2019.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.
Proceedings of the VLDB Endowment,
Vol. 12, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
1. INTRODUCTION
Today, many applications are powered by machine learn-ing techniques. The success of deep learning methods indomains such as natural language processing and computervision is further fuelling this trend. While deep learning(and machine learning in general) can offer superior perfor-mance, training such systems typically requires a large setof labeled examples, which is expensive and time-consumingto obtain.
Weak supervision techniques circumvent the above prob-lem to some extent, by leveraging heuristic rules that cangenerate (noisy) labels for a subset of data . A large volumeof labels can be obtained at a low cost this way, and to com-pensate for the noise, noise-aware techniques can be usedfor further improving the performance of machine learningmodels [12, 6]. However, obtaining high-quality labelingheuristics remains a challenging problem. A subset of ex-isting frameworks, with Snorkel [12] being the most notableexample, rely on domain experts to provide a set of label-ing heuristics which can be a tedious and time-consumingtask. In contrast, other frameworks aim to automaticallymine useful heuristics using further provided supervision.For instance, Snuba [17] circumvents dependence on domainexperts by requiring a labeled subset of the data, and thenutilizing it to automatically derive labeling heuristics.
Bab-ble labble is another example which asks expert to label afew examples and explain their choice (in natural language).This explanation is used to derive labeling heuristics. Whilethese approaches have been quite effective in certain set-tings, we elicit their limitations with the following real-worldexample on text-data.
Example Consider a corpus that are questions sub-mitted by a hotel’s guests to the concierge. Our goal is tobuild an intent classifier to find (and label) the set of ques-tions asking for directions or means of transportation fromone location to another. Below is a sample of messages fromthe corpus with positive instances marked as green. S1 What is the best way to get to SFO airport? + S2 Is there a bart from SFO to the hotel? + S3 What is the best way to check in there? - S4 Is Uber the fastest way to get to the airport? + S5 Would Uber Eats be the fastest way to order? - S6 What is the best way to order food from you? - To be more concise, we refer to heuristic rules simply asheuristics or rules as well throughout the paper.1 a r X i v : . [ c s . D B ] M a y equired Initial Data Required
Supervision
Labeled Subset -Heuristic -None - Manually
Creating
HeuristicsVerifying
HeuristicsNone - - -
DarwinSnuba
Single Labeling Explaining
Heuristics - babble labble Snorkel Figure 1: Comparing weak-supervision frameworksRelying on domain experts to provide labeling heuristicsfor tasks such as the one presented in Example 1 is a commonapproach but it has a number of shortcomings: • It is time consuming . Annotators must be familiar withthe rule language (e.g., Stanford’s Tregex or AI2’s IKElanguage). Moreover, they need to be acquainted withthe dataset to specify useful rules, i.e., rules that labela reasonable number of instances with a small amount ofnoise. This is normally done with trial-and-error and fine-tuning of the rules on a sample of the corpus, which canbe quite tedious. • Oftentimes, some useful rules remain undiscovered .This is because annotators may miss important keywordsor possess limited domain knowledge. For example, theword ‘ bart ’ (which refers to a transportation system inCalifornia) is clearly useful for the task in Example 1.However, annotators may miss the important keyword‘ bart ’ or they may not even know what ‘ bart ’ is (especiallythose who are not from the area). • It yields rules with overlapping results . If multiple an-notators work on writing rules independently, they arelikely to end up with identical or similar rules. sinceHence, the number of distinct labels obtained does not al-ways grow linearly with the number of annotators, whichis rather inefficient.The alternative approach would be to automatically mineuseful heuristics with systems such as Snuba or Babble lab-ble. Both systems require a set of labeled instances (accom-panied by natural language explanations in case of Babblelabble) which can be costly and oftentimes infeasible to col-lect in imbalanced settings. For instance while Example 1shows a balanced number of positive and negative instances,in practice, the positive instances often make up only a tinyfraction of the entire corpus. Hence, labeling a randomsample would not be sufficient to obtain enough positiveinstances. Consequently, automatically inferring heuristicrules is not feasible using the few discovered positive in-stances.To mitigate the above issues, we present
Darwin , anadaptive rule discovery system for text data. Figure 1 high-lights how
Darwin compares with other state-of-the-art weak-supervision frameworks. Compared to Snuba,
Darwin re-quires far less labeled instances. In fact as we show in ourexperiments, a single labeling rule (or a couple of labeledinstances) would be sufficient for
Darwin . Compared toSnorkel and Babble labble,
Darwin requires a lower degree Figure 2: Sample query to annotators.of supervision by domain experts. More explicitly,
Darwin requires experts to simply verify the suggested heuristicswhile Snorkel requires them to manually write such rulesand Babble labble requires them to provide explanations forwhy a particular label is assigned to a given data point.Given a corpus and a seed labeling rule,
Darwin identifiesa set of promising candidate rules. The best candidate rule(along with a few example instances matching the rule) isthen presented to the annotator to confirm whether it isuseful for capturing the positive instances or not. Figure 2presents an example of this step for the intent describedin Example 1. The annotator is presented with examplesthat satisfy the rule and asked to answer whether the ruleis useful for the intent (a YES/NO question). Based onthe response,
Darwin adaptively identifies the next set ofpromising candidate rules. This interactive process, whererules are illustrated with examples, facilitates annotatorsto identify the most effective set of rules without the needto fully understand the corpus or the rule language. Ourcontributions are as follows. • Darwin supports any rule language that can be specifiedusing a context-free grammar. Therefore, it can generatea wide range of rules, from simple phrase matching tocomplex conditions over the dependency parse trees ofthe sentences. • Darwin can effectively identify rules over a large text cor-pora, even when the number of candidate rules is expo-nentially large. In fact, we show that verifying 50 heuris-tics suggested by
Darwin is enough to achieve a F1-scoreof 0 .
80. Furthermore, we present theoretical results onapproximation guarantees of
Darwin . • Darwin does not require annotators to be familiar withthe rule language. By analyzing the similarity and theoverlap between the set of sentences matching differentrules,
Darwin automatically surface patterns in data andalso supports parallel discovery of rules by asking differentannotators to evaluate different rules. • We demonstrate how
Darwin can be used for a variety oflabeling tasks: classify intents, find sentences that men-tion particular entity types, and identify sentences thatdescribe certain relationships between entities (i.e., rela-tion extraction).2n the following sections, we define our problem, describe
Darwin , and demonstrate its effectiveness and efficiencythrough a suite of experiments. Specifically, we show that
Darwin outperforms other baseline approaches in its abil-ity to generate a larger set of labeled examples by asking alimited number of questions.
2. PRELIMINARIES & PROBLEM DEFINI-TION
In a nutshell,
Darwin takes as input an unlabeled cor-pus of sentences along with an initial seed labeling heuris-tic (which is assumed to generate at least two positive in-stances).
Darwin then identifies promising candidate label-ing heuristics.
Darwin leverages an oracle to verify whethera particular candidate heuristic is effective at capturing posi-tive instances or not. Finally, the set of discovered heuristicsare forwarded to Snorkel [12] to train a high precision clas-sifier. Before describing Darwin ’s rule discovery pipeline indetail, we provide a formal definition of labeling heuristicsalong with a description of an oracle.
Heuristic search space.
Naturally, labeling heuristics canbe of different types with distinct semantics. For example,a heuristic may check for certain phrases in a sentence [17]or it may enforce some conditions on the parse tree [19] andPOS tags of a sentence. In
Darwin , the space of possibleheuristics are specified using a collection of
Heuristic Gram-mars , where each grammar describes a particular type oflabeling heuristics. These concepts are formally defined asfollows.
Definition 1 (Heuristic Grammar).
A Heuristic Gram-mar G is a Context-Free Grammar (CFG). Recall that aCFG consists of a collection of derivation rules. For a given heuristic grammar G , we define labeling heuris-tics as follows. Definition 2 (Labeling heuristic).
A labeling heuris-tic r is a derivation of the grammar G . We use C r to denotethe set of sentences in the corpus that satisfy the heuristic r , and refer to | C r | as it’s coverage . To further clarify the above definitions, let us consider asimplified regular expression grammar called TokensRegex.TokensRegex captures all regular expressions over tokensconsidering ‘+’ and ‘*’ operators . This grammar can beformally writen using a CFG grammer as shown below. Example 2 (TokensRegex Grammar).
Let V denotethe set of all possible words. The regular expression grammaron the tokens comprises of the following derivation rules. A → vA ( ∀ v ∈ V ) A → A + AA → A ∗ AA → (cid:15) The above TokensRegex grammar allows for a regular expres-sion of words as a candidate labeling heuristic. For example, Note that Snorkel both provides a framework for writing la-beling rules as well as tools for training noise-aware models.Here we refer to the latter. We use TokenRegex to explain
Darwin ’s pipeline.
Dar-win functionality is not restricted to this grammar and wediscuss more complex grammar.
Uber PROPN is VERB way NOUN the DET best ADJ to ADP hotel NOUN our ADJ
Figure 3: An example of a parse tree. this grammar generates heuristics such as ‘best way to’ or‘shuttle’ as well as less meaningful heuristics such as ‘shuttleis airport’ as candidates for the task described in Example 1.A sentence satisfies the heuristic if it contains that phrase.The sentences s , s and s in Example 1 satisfy the heuris-tic r = ‘best way to’, hence C h = { s , s , s } . As a default setting,
Darwin comprises of two differentgrammars (a) TokensRegex (b) TreeMatch, with the abil-ity to plug in more heuristic grammars as long as they arecontext-free. While TokensRegex is capable of capturinglexical patterns and phrases, it fails to capture syntacticpatterns and pattern over parse trees. TreeMatch grammarcaptures such patterns to identify more complex and genericheuristic functions.
Definition 3 (TreeMatch Grammar).
Let V denotethe set of terminals comprising of all the tokens and Part-of-Speech (POS) tags [11] present in the corpus. E.g., NOUN,VERB, etc. Derivation Rules:
The grammar has three fundamentaloperations that make up a heuristic, namely And ( ∧ ), Child(/), and Descendant (//). The symbol ‘a/b’ implies thatterminal ‘b’ should be a child of terminal ‘a’ in the depen-dency parse tree. The symbol ‘a//b’ implies that terminal‘b’ should be a descendant of terminal ’a’ in the parse tree.Given that, the derivation rules of the grammar are: A → /AA → A ∧ AA → //AA → v ( ∀ v ∈ V ) It is important to mention that the complexity of heuris-tics that can be specified using the TreeMatch grammar ex-ceeds what rule-mining frameworks such as Snuba or Babblelabble can capture.
Oracle Abstraction.
Finally, we formalize the feedbackthat we may either obtain from a single annotator, a groupof annotators, or a crowd-sourcing platform using the notionof Oracles as follows.
Definition 4 (Oracle).
An Oracle O is a functionwhich given a heuristic r and a few samples from its cov-erage set C h outputs a YES/NO answer indicating whetheror not r is adequately precise. An Oracle plays the role of a perfect annotator who alwaysanswers the questions correctly. In practice, annotators mayprovide incorrect answers (as we show in our experiments),but the notion of an oracle enables us to provide insightsinto the theoretical aspects of our problem.
Problem statement.
We are now ready to formally defineour problem. Given a labeling task, our goal is to find a set3 of labeling heuristics such that the union of the coverageof the heuristics in R , denoted as P = (cid:83) r ∈ R C r , would havea high recall (i.e., to contain a high ratio of the positive in-stances in the corpus). We would like to maximize the recallof set P without posing too many queries to the oracle. Weempirically observed that users label a heuristic as preciseonly when the heuristic has precision at least 0.8. Hence,in this paper, we do not focus on optimizing the precisionof heuristics, which we can also rely on various de-noisingtechniques from the weak supervision literature [12]. Problem 1 (Maximize Heuristics Coverage).
Givena corpus S , a seed labeling function r , an oracle O , and abudget b , find a set R of labeling heuristics using at most b queries to the oracle, such that the recall of set P , i.e., theunion of the coverage of heuristics in R , is maximized. Lemma The
Maximize Heuristics Coverage prob-lem is NP-hard.
Proof.
We show the hardness of our problem by reduc-ing the maximum-coverage problem to an instance of ourproblem. Given a collection of sets A = { A , A , . . . , A n } and a budget b , the maximum-coverage problem aims to find b sets from A such that the size of their union is maximized.Given an instance of the maximum-coverage problem, wecreate an instance of our problem as follows. For each set A i , we define a heuristic r i with coverage set C r i = A i andmark all the instances as positives. Consequently, the oracle O would always respond with a Yes as all the heuristic areperfectly precise. Now, it is easy to see that the coverageof set P in our setting is equivalent to the coverage of se-lected sets in the maximum-coverage problem. As a result,if our heuristic discovery problem can be solved in poly-nomial time, then the corresponding sets would form theoptimal maximum-coverage solution. Hence, our problem isalso NP-hard.Note that while we focus on maximizing the recall, it isalso useful to report the performance of the classifier thatis trained using our weakly-supervised labels. Therefore, inour experiments, we also record the F-score of our trainedclassifier to provide a better evaluation of Darwin .
3. THE
Darwin
SYSTEM
In this section, we describe the architecture of
Darwin which is illustrated in Figure 4.
Darwin operates in multi-ple phases that aim to identify diverse set of heuristics usedto identify positives. The pipeline is initialized with a seedlabelling function or a couple of positive sentences.
Darwin learns a rudimentary classifier using these positive sentencesand the classifier is refined with evolving training data. Inorder to identify new heuristics
Darwin leverages the fol-lowing properties. (i) The generalizability of the trainedclassifier helps guide the search towards semantically simi-lar heuristics. For example, on identifying the importanceof ‘bus’ as a heuristic,
Darwin identifies ‘public transport’as another possibility due to their related semantics . (ii)Local structural changes to the already identified heuristicshelps identify new heuristics eg. consider ‘What is the best This generalization is possible via word embeddings whichare provided as an input to the classifier. We provide moredetails of our classifier in the experiments.
Algorithm 1
Darwin
Input:
Input Corpus S , seed heuristic r , budget b Output:
Collection of heuristics R , Set of positive instances P ,Classifier C I ← generate index ( S ) Q ← φ P ← coverage ( r ) C, P (cid:48) ← train classifier ( P, { r } , S ) while | Q | ≤ b do H ← generate hierarchy ( S , P (cid:48) , I ) q ← traversal ( H, P, Q, C ) if oracle query (q) then P ← P ∪ coverage ( q ) C, P (cid:48)(cid:48) ← train classifier ( R, P, S ) P (cid:48) ← P (cid:48)(cid:48) \ P (cid:48) H ← update scores ( H ) Q ← Q ∪ { q } return R, P, C way to the hotel?’ as input seed sentence,
Darwin con-structs local modifications by dropping and adding tokens(derivation rules in general) and identifying a new heuristic‘shuttle to the hotel’.
Darwin leverages these intuitions toadaptively refine the search space and simultaneously learna precise classifier with high coverage over the positives.Before describing the architecture, we define a data struc-ture that is critical for efficient execution of
Darwin . Allcandidate heuristics that are considered by
Darwin are or-ganized in the form of a hierarchy. This hierarchy capturessubset/superset relationship among heuristics. Heuristicswith higher coverage are placed closer to the root and theones with lower coverage are closer to the leaves. For ex-ample. ‘best way to the hotel’ is a subset of ‘best way to’and will be its descendant. One of the key properties of thishierarchical structure is that if a heuristic r is identified tocapture positives, then any of its subset (descendant in thehierarchy) does not capture any new positive. This datastructure has O (1) update time to identify the subsets of aheuristic. Additionally, it is helpful for efficient execution oflocal structural changes to any heuristic. All these benefitswill be discussed in detail in the later sections.Algorithm 1 presents the pseudo code of the end-to-end Darwin architecture.
Darwin ’s input consists of the corpusto be labeled, a collection of heuristic grammar, and one(or more) seed labeling function(s). Alternatively, a set ofpositive instances can be provided instead of seed labelingheuristics. The output of
Darwin is the set of generatedheuristics, the positive instances that are discovered, and aclassifier that is trained using the labeled data.Before the heuristic-discovery phase begins,
Darwin cre-ates an index over the input corpus for fast access to thesentences that satisfy a given heuristic (more details in Sec-tion 3.1). The heuristic-discovery phase is an iterative pro-cess where
Darwin interacts with annotators and uses theirfeedback to identify new candidates and ask further queries.In a nutshell, each iteration consists of the following steps.First, the
Candidate Generation component generates a smallset of promising candidate heuristics (from the space of allpossible heuristic functions), and organizes them in the formof a hierarchy H (line 6) with the most generic functions atthe top and the stricter ones at the bottom. We will describeshortly how H is generated and used to prune less effectiveheuristics. Once the hierarchy is built, Darwin ’s HierarchyTraversal component carefully navigates and evaluates theheuristics in the hierarchy to find the best candidate (line4 nput Corpus Hierarchy TraversalScore UpdateIndex Generation Candidate Selection
Input
Seed Rule(s) Collection of HeuristicsLabelled Set
Output
Figure 4:
Darwin ’s architecture. best way way to to get …s1 … best way to get A “best” AA “way” A A “to”
A A “get” AA “get” A A “ t o ” A A “ w a y ” A fastest way way to to get …s4 … fastest way to get A “fastest” AA “way” A A “to”
A A “get” AA “get” A A “ t o ” A A “ w a y ” A Figure 5: Examples of derivation sketches7). The best candidate is then presented to the annotator(line 8). Finally, the updated classifier and scores of heuris-tics are sent back to hierarchy generation to identify newcandidates and perform traversal for the next iteration. Wedescribe the details of these components next.
Darwin creates an index for the input corpus to providefast access to sentences that satisfy certain heuristics. Thisindex aims at constructing a space efficient representationof each sentence in the corpus for efficient execution of sub-sequent steps involving traversal through the various candi-date heuristics. This hierarchical structure of this index isvery similar to that of a trie.Given a collection of heuristic grammar { G , . . . , G t } anda sentence s , one can enumerate the set of all possible heuris-tics of G i , generated using a fixed number of derivation rules,that s satisfies. For example, using the TokensRegex gram-mar, the set of all heuristics that a sentence s satisfies isthe set of all regular expressions that correspond to s . Weorganize the set of heuristics matching a sentence s intoa structure called the Derivation Sketch , which summarizesthe derivations of all heuristics that match s . Figure 5 shows(parts of) the derivation sketch for sentences s s I which is a compactrepresentation of all heuristics that are satisfied by at leastone sentence in the corpus. Each node in I represents aheuristic labeling function and stores the number of sen-tences that satisfy it, pointers to the children in the index,and an inverted list that points to sentences that satisfy theheuristic. Index (s1, s4) … best Count = 1 {s1} way Count = 2 {s1, s4} to Count = 2 {s1, s4} get Count = 2 {s1, s4} fastest Count = 1 {s4}best way Count = 1 {s1} way to Count = 2 {s1, s4} to get Count = 2 {s1, s4} fastest way Count = 1 {s4} …… Figure 6: An example of index creation processThe index I is created by merging the derivation sketch ofsentences, one at a time. The index is first initialized withthe derivation sketch of the first sentence. Thereafter, forevery new sentence s , the derivation sketch of s is mergedinto I as follows. The root node of the sketch and the rootnode of I is merged, and then all nodes (starting form themerged root) are considered in a breadth-first fashion; Thechildren of the node under consideration which are derivedusing the same derivation rule are merged together. Forevery node that gets merged, the count of the merged nodeis increased by one. Also, the inverted list at that node getsupdated to include the new sentence. Figure 6 shows theindex built from derivation sketches of s s TreeMatch Grammar:
This grammar has more opera-tors as compared to TokensRegex and can generate expo-nentially more candidate heuristics. The derivation sketchcan be created as explained by enumerating all sequence ofderivation rules up to a fixed number of steps. However, amore compact derivation sketch for the TreeMatch grammaris simply the dependency parse tree of the sentence, as wecan use it to quickly check whether a heuristic matches theparse tree or not [19]. Figure 3 shows the dependency parsetree of a sentence which can serve as its derivation sketchas well. Given the exponentially many candidate heuristicsgenerated by this grammar, the candidate generation stepis crucial for ignoring useless heuristics and thereby helpingthe subsequent stages to focus on meaningful heuristics. Weevaluate the performance of
Darwin with this grammar inthe next section.
As mentioned earlier, the number of possible heuristicsunder a given grammar G is often exponential in the sizeof dictionary. The task of the heuristics-hierarchy genera-tion component is to generate a manageable set of promisingcandidate heuristics from the space of all possible heuristicsand organize the generated candidate heuristics in a hierar-chy that captures the subset/superset relationship betweenthe heuristics. Specifically, the hierarchy generation processconsists of the following steps. First, the Candidate Genera-tion step generates a subset of possible heuristics that havehigh coverage over the set of positive instances discoveredso far. This algorithm operates in a greedy best-first searchmechanism to identify valuable candidates. These heuristics5 lgorithm 2
Candidate-heuristic Generation
Input:
Index I , Set of positive instances P , Number of desiredheuristics k Output:
Collection of heuristics R R ← {∗} , recentHeuristic ← ∗ , candidates ← φ while | R | ≤ k do candidates ← candidates ∪ Children( recentHeuristic , I ) sortedCandidates ← CoverageSort( candidates , P ) recentHeuristic ← sortedCandidates [0] candidates ← candidates . remove( recentHeuristic ) R ← R ∪ recentHeuristic return R are promising as they already have some overlap with theexisting positive instances. Next, these candidates are ar-ranged in the form of a hierarchy along with subset-supersetedges between them. We describe these steps in detail next. The candidate generation step uses the index I to gener-ate a set of heuristic labeling functions with high coverageover the set of positive instances P that are discovered sofar by Darwin . Note that heuristics that (at least partially)cover the set of discovered positive instances, are likely tobe good heuristics and help detect more positive instances.To efficiently find such heuristics, we rely on one of the in-teresting properties of index I ; Recall that the count of anode u ∈ I refers to the total number of sentences thatcontain the tokens along the path from the root to u intheir derivation sketches. As descendant nodes correspondto stricter heuristics, the coverage of a heuristic correspond-ing to a node is never less than the coverage of any of itsdescendants in the index. Thus we use a greedy algorithmto identify a collection of diverse heuristics that have highcoverage over the set of positive instances P .Algorithm 2 generates candidate heuristics by exploitingthe property described above. The set of candidate heuris-tics is initialized with heuristic ‘*’ which refers to the rootof index I . This heuristic matches all possible sentences inthe corpus. In each iteration, the algorithm adds the chil-dren of the previous iteration’s best candidate heuristic tothe candidate list (line 3). The candidates are then sortedin decreasing order of coverage over the set P (line 4). Thecandidate with the highest coverage is removed from thecandidate list and appended to list of final results R (lines6-7). This process is repeated until there are k heuristics in R . Note that the time complexity of this greedy algorithmis linear in the number of candidates generated.Other constraints can also be added to the candidate-heuristic generation phase to ensure that the generated heuris-tics satisfy those criteria. For example, Darwin can applyheuristics to ensure that the candidate heuristics are diversein terms of the set of derivation rules used to derive theheuristic, their level in the index I , and the set of instancesthey cover. Some of these heuristics help Darwin avoidhaving to evaluate many similar candidate heuristics.
Hierarchical Arrangement and edge discovery.
Thecandidates returned by Algorithm 2 have high coverage overthe positives (discovered so far). This component iteratesover the generated heuristics to arrange them into a hier-archy H following the same parent/child relationship thatindex I captures and an edge is added between them. heuristic r is a child of r if it can be obtained by applyinga derivation rule to r . This hierarchical arrangement of heuristics is followed bya cleanup to get rid of heuristics that do not add any newpositive sentences than the ones already identified. The goalof cleanup is to improve the efficiency and space complexityof Darwin as the traversal component will never query aheuristic that does not add any new positives.
The result of the heuristic-hierarchy generation is a hi-erarchy H of promising heuristics. The hierarchy traversalmodule determines which heuristic in the hierarchy is thebest heuristic to be submitted to the oracle.We present three hierarchy traversal techniques: LocalSearch , UniversalSearch , and
HybridSearch . At a high level,
LocalSearch relies on the hierarchy structure to select the next best can-didate from the immediate neighborhood of heuristics veri-fied by the oracle in the past. In contrast,
UniversalSearch ignores locality constraints and selects the heuristic withmaximum benefit globally.Finally, the
HybridSearch traversal combines the first twotechniques to find the next best heuristic. The
HybridSearch traversal is more robust than
LocalSearch and
UniversalSearch .All three techniques work in an iterative fashion, and in eachiteration, the criteria for selecting a heuristic to be sent tothe oracle is based on how beneficial the heuristic is, whichwe elaborate next.
Benefit of a heuristic (r):
The benefit of a heuristic r is the expected gain in the positive set P upon choosing r . More formally, the benefit is quantified as (cid:80) s ∈ C r \ P p s ,where p s is the probability of sentence s being a positiveinstance. In Darwin , these probability values are estimatedby training a classifier using the set of positive instancesdiscovered so far and sampling random instances from thecorpus as negatives. The probability estimates improve asthe system iteratively discovers more heuristics and the clas-sifier is re-trained with more positive training examples.We describe our three traversal techniques next. LocalSearch
LocalSearch traversal algorithm (Algorithm 3) benefitsfrom the local hierarchy structure around the heuristics al-ready identified as useful by the oracle to identify the nextbest heuristic for querying. Specifically,
LocalSearch main-tains a set of candidate heuristics, and selects the most ben-eficial heuristic r from the candidates. If the oracle confirmsthat r is adequately precise, then it adds r ’s parents into thecandidate set as they are generalizations of r and might behelpful at capturing more positive instances. However, ifthe oracle labels r as a noisy heuristic, LocalSearch addsthe children of r to the candidate set instead with the hopethat a specialized version of heuristic r might be less noisy. LocalSearch is simple and efficient at utilizing the struc-ture of the hierarchy to find promising heuristics to submitto the oracle. Since the algorithm only explores the localneighborhood of the queried candidates, it has a time com-plexity of O( dt ), where d is the maximum degree of an in-ternal node and t is the number of iterations the algorithmis running for. However, a disadvantage of LocalSearch is that it may require many traversal steps in cases wherethe initial seed heuristic is quite different from other preciseheuristics the system aims to discover. Also, it does not Any short text classifier would be ideal for this task.6 lgorithm 3
LocalSearch
Traversal
Input: heuristic hierarchy H , Seed heuristic r Output:
Collection of positive instances P , Collection of heuris-tics R QueryCount ← R ← { r } , P ← C r , C ←
TrainClassifier( P ) localCandidates ← { r } while QueryCount < b do r ← GetMostBeneficialCandidateheuristic( localCandidates , C ) QueryCount ← QueryCount + 1 if OracleResponse( r ) is YES then R ← R ∪ r , P ← P ∪ C r localCandidates ← ( localCandidates \ { r } ) ∪ Parents( r ) C ←
TrainClassifier( P ) else localCandidates ← ( localCandidates \ { r } ) ∪ Children( r ) return P, R exploit the similarity and the overlap between the coveragesets of different heuristics. The
UniversalSearch algorithm,which we describe shortly, addresses these shortcomings byutilizing a holistic view of the hierarchy.
Efficient Implementation.
Since the
LocalSearch traver-sal only explores a node’s immediate parents/children, itdoes not require the entire hierarchy apriori. Hence, in itsimplementation, we can skip the heuristic-generation com-ponent and expand the hierarchy on the fly based on theoracle’s feedback.
UniversalSearch
The
UniversalSearch algorithm (see Algorithm 4) eval-uates all heuristics present in the hierarchy to identify thebest heuristic. In each iteration,
UniversalSearch omitsany heuristic for which the benefit per instance is smallerthan 0 .
5, i.e. majority of the instances in C r are expectedto be negatives. Among the remaining heuristics, it choosesthe heuristic with maximum benefit to submit to the or-acle. Based on oracle’s feedback, it re-trains the classi-fier if new positives were discovered or else it continuesto query the next best heuristic to the oracle. Note that UniversalSearch captures the best candidates irrespectiveof the hierarchy structure.The strength of
UniversalSearch is in its capability toidentify semantic similarity between heuristics and their match-ing instances even if they are far apart in the hierarchy.However, it has the following shortcomings: (1) comparedto
LocalSearch , it is inefficient as it iterates over all heuris-tics in the hierarchy to identify the best candidate, and (2)in absence of enough positive instances, the trained classi-fier is likely to overfit and not generalize well to other preciseheuristics. In such cases,
UniversalSearch fails to exploitthe structure of the hierarchy to at least find heuristics thatare structurally similar to the seed heuristics.We describe the
HybridSearch algorithm next, which com-bines the strengths of
UniversalSearch and
LocalSearch . HybridSearch
HybridSearch (See Algorithm 5) combines the two previ-ous traversal techniques by maintaining a list of local can-didates and a list of universal candidates, and imitating thestrategy of the both traversal algorithms. Starting fromthe
UniversalSearch strategy, the
HybridSearch algorithmqueries candidate heuristics (with a benefit per instanceabove 0 .
5) to the oracle. If the algorithm fails to find a
Algorithm 4
UniversalSearch
Traversal
Input: heuristic hierarchy H , Seed heuristic r Output:
Collection of positive instances P , Collection of heuris-tics R QueryCount ← R ← { r } , P ← C r , C ←
TrainClassifier( P ) universalCandidates ← { r : r ∈ H} C ←
TrainClassifier( P ) while QueryCount < b do r ← GetMostBeneficialCandidate( universalCandidates , C ) QueryCount ← QueryCount + 1 if AvgBenefit( r ) ≤ . then continue if OracleResponse( r ) is YES then R ← { r } , P ← C r C ←
TrainClassifier( P ) universalCandidates ← universalCandidates \ { r } return P, R
Algorithm 5
HybridSearch
Traversal
Input: heuristic hierarchy H , Seed heuristic r Output:
Collection of positive instances P , Collection of heuris-tics R universalMode ← True, attempt ← R ← { r } , P ← C r , C ←
TrainClassifier( P ) localCands ← { r } , universalCands ← { r : r ∈ H} QueryCount ← while QueryCount < k do if attempt ≥ τ then universalMode ← not universalMode attempt ← attempt ← attempt + 1 candidates = universalCands if universalMode else localCands QueryCount ← QueryCount + 1 r ← GetMostBeneficialCandidateheuristic( candidates , C ) if universalMode and AvgBenefit( r ) ≤ . then continue if OracleResponse(r) is YES then R ← R ∪ r , P ← P ∪ C r C ←
TrainClassifier( P ) localCands ← localCands \ { r } ∪ Parents( r ) else localCandidates ← localCandidates \ { r } ∪ Children( r ) universalCands ← universalCands \ { r } return P, R precise heuristic within a fixed number of attempts, thenit switches to the
LocalSearch strategy. Similarly, if the
LocalSearch strategy has no success within a fixed numberof attempts, the traversal toggles to the
UniversalSearch strategy. The switch between the two strategies is decidedbased by a parameter τ (by default 5) which denotes thenumber of unsuccessful attempt before the switch happens.Clearly, higher values of τ discourage switching between thetwo strategies.Our empirical evaluation shows that HybridSearch formedby combining
UniversalSearch and
LocalSearch strate-gies, runs well on all types of datasets even when the othertwo traversal algorithms struggle to discover high-qualityheuristics. In short, if the trained classifier is noisy (due tolack of positive instances),
HybridSearch exploits the struc-ture of the hierarchy to search for precise heuristics. Simi-larly when no precise heuristics are found by
LocalSearch , ituses
UniversalSearch ’s ability to generalize to other heuris-tics.
Darwin passesthe feedback to the score update component to (1) re-trainthe classifier, (2) re-evaluate the scores of all heuristics inthe hierarchy, and (3) update the set of positive instances(if feedback is positive) and signal the hierarchy generationcomponent to generate new candidate heuristics to be addedto the hierarchy.
In this section we analyze the ability of
UniversalSearch hierarchy traversal in identifying positive sentences withina query budget b . For this analysis, we consider a sim-ple model capturing how positive and negative instances arescored by a classifier. An ideal classifier assigns a score of1 to positives and 0 to negatives. However in practice, thescores follow a different distribution which we model as fol-lows. Let P ∗ denote the collection of positive sentences inthe corpus of sentences S . We assume that a reasonableclassifier assigns to a positive sentence s ∈ P ∗ a score largerthan θ ≥ . β , and less than 1 − θ other-wise. Similarly, the score of a negative sentence s ∈ S \ P ∗ isabove θ with probability β (cid:48) . Naturally, for a classifier is bet-ter than random β is larger then β (cid:48) . Under this model, wecan show that the set of heuristics R and the correspondingpositive set P identified by UniversalSearch are constantapproximation of the optimal solution.We also make a few assumptions about the hierarchy ofheuristics. We assume that the number of heuristics in thehierarchy H is linear to the number of sentences (i.e., O ( n ))and each heuristic has a minimum coverage of Ω(log n ). Thisis a realistic assumption as we focus on heuristics that canbe derived from their context free grammar using a fixednumber of steps, and our algorithm is aimed at identifyingheuristics that cover a large fraction of positives. Underthis assumption, we show that at any iteration, a heuristic r chosen by UniversalSearch has coverage | C r | larger than α × max r (cid:48) ∈H | C r (cid:48) | , where α is a constant. This guaranteesthat UniversalSearch identifies at least αP OPT positiveswithin a query budget of b , where P OPT is the total numberof positives identified by an ideal algorithm. To bound theestimated coverage of a heuristic r , we use the Hoeffding’sinequality [7]. Notation.
We define a random variable X s which refersto the score assigned to a sentence s and let µ s denote itsexpected value. The benefit score of a heuristic r is (cid:80) s ∈ C X s and its expected value is denoted by µ r . Lemma Given a heuristic function r with coverage of C r and precision p , the expected score of the heuristic func-tion is at least θβ (cid:48) | C r | . Proof.
Expected score of the heuristic function is E (cid:34) (cid:88) s ∈ C r X s (cid:35) = (cid:88) s ∈ C r µ s = (cid:88) s ∈ C r ∩ P ∗ µ s + (cid:88) s ∈ C r \ P ∗ µ r ≥ (cid:88) s ∈ C r ∩ P ∗ ( θβ ) + (cid:88) s ∈ C r \ P ∗ (cid:0) θβ (cid:48) (cid:1) = ( θβ ) p | C r | + (1 − p ) | C r | (cid:0) θβ (cid:48) (cid:1) ≥ θβ (cid:48) | C r | We use this calculation to bound the score of a heuristic r that has more than log n sentences. Lemma Consider a heuristic function r with coverage C r such that | C r | = c log n sentences, where c ≥ (cid:15) θ β (cid:48) is a constant. The benefit score of the heuristic is at least (1 − (cid:15) ) θβ (cid:48) | C r | with a probability of − /n . Proof.
The score of heuristic function r is (cid:80) s ∈ C r X s .The expected value of the score (denoted by µ r ) is calculatedin lemma 2. Using Hoeffding’s inequality, P r (cid:34) | C r | (cid:88) s ∈ C r X s ≤ (1 − (cid:15) ) µ r / | C r | (cid:35) ≤ e − (cid:15) µ r / | C r | ≤ e − (cid:15) θ β (cid:48) | C r | ≤ e − n = 2 n This shows that | C r | (cid:80) s ∈ C r X s is greater than (1 − (cid:15) ) θβ (cid:48) | C r | with a probability more than 1 − n Using a similar analysis, we identify an upper bound of theheuristic score. Due to space constraint, we defer the proofto Appendix.
Lemma Given a heuristic function r with a coverageof C r and precision p , the expected score of the heuristicfunction is atmost ( β + (1 − θ )(1 − β )) | C r | . Lemma Consider a heuristic r with coverage C r suchthat | C r | = c log n sentences, where c ≥ (cid:15) ( β +(1 − θ )(1 − β )) is a constant. The score of the heuristic is atmost (1 + (cid:15) ) ( β + (1 − θ )(1 − β )) | C r | with a probability of − /n Using the calculated bounds on score of a heuristic, weevaluate the condition when a particular heuristic is pre-ferred over the other.
Lemma Given a pair of heuristic functions r and r with respective coverage C and C . If C has more positivesthan C , the UniversalSearch score of r is higher than thatof r whenever | C || C | ≥ α with a probability of − n , where α is a constant. Using a similar analysis, we can calculate the estimatedaverage probability of a heuristic. For a heuristic r withprecision p r , we can show that it is considered for benefitcalculation only when p r > γ , where γ is a constant. Theorem In worst case,
UniversalSearch provides aconstant approximation of Problem 1 with a probability of − o (1) Proof.
In each iteration,
UniversalSearch algorithm sortseach of the candidate heuristic based on estimated averageprobability of a randomly chosen sentence from C r to be pos-itive. All these candidates have true precision p r > γ . Givena pair of heuristics r and r , using Lemma 6, the benefitscore of a block r is higher than that of r whenever | C r || C r | >α with a probability of 1 − n . Let r OPT be the heuris-tic chosen by optimal algorithm. Using union bound over (cid:0) n (cid:1) pairs of heuristics, with a probability of at least 1 − n ,the estimated benefit of r OPT is higher than that of any r (cid:48) whenever | C r (cid:48) | ≤ | C r OPT | α . Therefore, UniversalSearch never chooses any block heuristic with coverage smaller than | C r OPT | α with a probability of 1 − o (1). This shows that thetotal number of positives identified by UniversalSearch areat least | C r OPT | αγ , which is a constant approximation of | C r OPT | with a probability of 1 − o (1).8 ataset cause-effect musicians directions profession
1M 1.1 Entities tweets
Table 1: Dataset statisticsNotice that analysis above makes certain assumptions aboutthe quality of the classifier. In the initial iterations of
Dar-win the classifier has a low recall and hence, the valuesof β and θ are lower. As Darwin identifies new heuristicfunctions, the increase in training data pushes these valueshigher, thereby improving the approximation factor of ouralgorithm. It is important to mention that even when theclassifier is not ideal, our only key assumption is that theclassifier would perform better than random.
Discussion.
We proposed three techniques for hierarchytraversal.
UniversalSearch approach is useful to captureholistic information about the different candidate labelingheuristics and is proven to achieve constant approximationof the optimal solution under reasonable assumptions of thetrained classifier. However, due to lack of training datain initial iterations of the pipeline, this assumption maynot hold and
UniversalSearch does not perform optimally.However,
LocalSearch performs local generalization of iden-tified heuristics to quickly increase the number of identifiedpositives.
HybridSearch is a robust amalgamation of thesetwo techniques and is recommendation. Since
HybridSearch is a generalization of
LocalSearch and
UniversalSearch , itis slightly less efficient than either of these.
4. EXPERIMENTS
In this section, we perform empirical evaluation of
Dar-win along with other baselines to validate the following. • The ability of
Darwin to identify majority of the positiveseven when initialized with a small seed set. • The positives identified by
Darwin outperform other base-line techniques that use active learning, a human annota-tor or any other automated techniques. We show that
Darwin can uncover most of positive instances (i.e, 80%or more) with roughly 100 queries. • The heuristics identified by
Darwin have high precisionand help train a classifier with superior F-score ( ≥ . • Darwin is highly efficient and can generate labels froma corpus of 1M sentences in less than 3 hrs.
Darwin performance is resilient to variations in the seed set.
Here, we describe the datasets, the baselines, and ouroverall experimental setup.
Datasets.
We experimented with five diverse real-worlddatasets each suitable for one of the following NLP tasks: en-tity extraction, relationship extraction, and intent classifica-tion. Table 1 summarizes the statistics of these datasets. Alldatasets, except for directions , come with ground-truthlabels which we use for evaluation and to synthesize theresponses from an oracle. For the directions datasets, werely on human annotators to generate the gold standard andvalidate the heuristics. We describe each of these datasetbelow. • cause-effect [15] is a dataset commonly used as a bench-mark for relationship extraction between pairs of entities. We focus on the task of finding sentences that describe acause and effect relationship between two entities. • directions is an internal dataset described in Example 1.For this dataset, we leveraged Figure-eight crowd workersto verify the heuristics generated by Darwin . • musicians dataset consists of sentences from Wikipediaarticles. The task is entity extraction with the goal toextract the names of musicians. The ground-truth is ob-tained with the help of NELL’s knowledge-base . • professions dataset is a collection of sentences from ClueWeb .The sentences that mention various professions (e.g., sci-entist, teacher, etc.) are positives. The ground truth isgenerated using NELL’s knowledge-base. • tweets [18] data set is a benchmark for classifying theintent of tweets into predefined categories such as food,travel and career, etc. Baselines.
We evaluate our framework on two fronts: (1)the ratio of positive instances it discovers (i.e. coverage)and (2) the performance of the classifier trained using ourweakly-supervised labels. Our baselines for these two eval-uation criteria are listed below. • Section 4.2 compares the fraction of positives identifiedwith
Snuba [17]. In this experiment, we consider a smallsample of positives chosen randomly from the dataset. • Section 4.3 compares the coverage obtained by
Darwin against two baselines, namely
HighP and
HighC . HighP isa simpler version of
Darwin which selects the rule whichis expected to have a high precision (according to the clas-sifier) and submits it to the oracle. On the other hand,
HighC selects rules with the maximum coverage, irrespec-tive of their expected precision . • Section 4.4 compares the F-score of the classifier gener-ated by
Darwin with an
Active Learning ( AL ) [14] anda Keyword Sampling ( KS ) technique as well as the HighP baseline mentioned earlier. AL improves its performanceby selecting the instance with the highest entropy andasking the oracle for its label. It then re-trains the clas-sifier using the new label. The KS approach is designedto check if we can quickly obtain a small set of promisinginstances by filtering the corpus using a set of relevant key-words, and label the instances in the smaller set. To do so,we asked annotators to provide 10 distinct keywords as aheuristic to filter the dataset. The KS technique randomlysamples instances from the filtered dataset and asks for itslabel. We employ the same deep learning based classifierfor all the techniques.Finally, note that Darwin can use different traversal algo-rithms:
LocalSearch , UniversalSearch , and
HybridSearch ,which we refer to as
Darwin (LS),
Darwin (US), and
Dar-win (HS) respectively.
Settings.
We implemented all proposed algorithms andbaselines in Python and ran the experiments on a serverwith a 500GB RAM and 64 core 2.10GHz x 2 processors.The dependency parse trees and the POS tags are gener-ated with SpaCy . All text classifiers trained in our ex- https://figure-eight.com http://rtw.ml.cmu.edu/rtw/kbbrowser/ https://lemurproject.org/clueweb09/ HighC ’s performance was quite poor as most of its sug-gested rules are rejected by the oracle. As a result, we omitHighC from the plots for the sake of clarity. https://spacy.io/
25 50 125 250 5001000 C o v e r age SnubaDARWIN(HS) (a) directions
25 100 500 1000 2000 C o v e r age SnubaDARWIN(HS) (b) musicians
Figure 7: Effects of seed set size on performance.
25 50 200 400 8001600 C o v e r age SnubaDARWIN(HS) (a) directions
20 100 500 1000 2000 C o v e r age SnubaDARWIN(HS) (b) musicians
Figure 8: Effects of biased seed set size on performance.periments (whether used by
Darwin or other baselines) areimplemented with a 3-layer convolutional neural network fol-lowed by two fully connected layers, following the architec-ture described by Kim et al [8]. The input to the classifier isa matrix created by stacking the word-embedding vectors ofthe words appearing in the sentence. We also used SpaCy’sword-embeddings for English . For generating derivationsketches, the maximum depth is set to 10 and we consider10K heuristics in candidate selection. When simulating theresponses from an oracle (using the ground-truth data), werespond YES to heuristic h if at least 80% of its coverageset consist of positive instances. In this experiment, we initialize
Snuba and
Darwin (HS)with the same set of randomly chosen labeled sentences andcompare the total number of positives identified by eachof the techniques . Note that Snuba does not query theoracle and infers heuristics based on the provided labeledinstances may infer inaccurate heuristics. For a fair eval-uation, we choose not compare the accuracy of identifiedheuristics. Figure 7 shows the change in fraction of iden-tified positives by varying the size of the initial seed set.
Darwin (HS) is able to identify majority of positives evenwhen the pipeline is initialized with less than 25 sentences.However,
Snuba requires at least 200 randomly chosen sen-tences for directions and 1000 for musicians . If we employexpert to sample positives,
Snuba requires at least 100 pos-itive samples in musicians .To further evaluate the generalizability of
Snuba and
Dar-win (HS) to identify heuristics that have limited or no evi-dence in the initial seed set, we construct a biased sampleof seed positives. In this experiment, we choose sentencesrandomly from the corpus after ignoring the ones that con-tain the token ‘shuttle’ in directions dataset and ‘com- https://spacy.io/models/en In this experiment, we do not start with a single labelingheuristic as
Snuba achieves a very small coverage as it failsto obtain enough positive instance due to the high degree ofimbalance in these datasets. poser’ in musicians . Figure 8 shows the fraction of pos-itives identified with varying size of the seed set.
Snuba is not able to identify the positives that contain the to-ken ‘shuttle’ in directions and ‘composer’ in musicians .Henceforth, it achieves poor coverage over the positives intwo datasets.
Darwin
HS is able to identify majority of posi-tives irrespective of the number of sentences used to initializethe pipeline.
Snuba requires considerably more labelled sen-tences in musicians as compared to directions due to thepresence of many diverse heuristics in the dataset, most ofwhich have limited evidence in the seed subset. We observesimilar performance gap between
Snuba and
Darwin (HS)for other datasets.This experiment validates that
Snuba works well when theinitial seed set has enough randomly chosen positives andlacks the ability to generalize to heuristics that have limitedevidence. On the other hand,
Darwin (HS) is able to identifymajority of the positives even when the pipeline is initial-ized with just 25 sentences and has good generalizability.To further evaluate
Darwin ’s ability to identify positives,the following subsection considers a more challenging sce-nario where the pipeline is initialized with a single labelingheuristic or just two positive sentences.
Figures 6a-6d and 10a illustrate the fraction of positivesidentified by
Darwin and our baselines. We can observethat
Darwin (HS) has the most stable performance and out-performs other techniques. While
Darwin (US) occasionallyoutperforms
Darwin (HS), we observe that it fails to per-form well on all datasets. In most cases (with an exception of cause-effect ), the
Darwin (HS) achieves a coverage of 0.8using less than 120 queries to the oracle. The cause-effect is known to be a tough benchmark in the NLP communityas the best F-score reported by [15] is 82% given completeaccess to the training set. Assuming that the oracle con-siders a majority vote by querying three crowd membersand each query costs 2 cents , the Darwin (HS) pipelinegenerates more than 80% of the positive labels with only$7 .
20. Figure 6d demonstrates the behavior for ‘Food’ in-tent in the tweets. We observed similar behavior for ‘Travel’and ‘Career’ intents on this data set. We can observe thatthe other baselines do not perform well compared to
Dar-win ; The highP identifies heuristics with very small cov-erage as its candidates. Also note that the
Darwin (LS)algorithm shows a high progressive coverage initially but itconverges to a very low coverage value because it is unableto identify rules that are semantically similar, but far awayin the hierarchy. Overall, we recommend
Darwin (HS)for any practical application as it is more robust and worksbetter than most of the techniques. On the other hand,
Darwin (LS) and
Darwin (US) variants work well in spe-cific settings.
Darwin (LS) performs better than the othertechniques when precise rules are present close to each otherin the hierarchy and
Darwin (US) performs well in the pres-ence of abundant labelled examples.Figure 11 shows some the heuristics which are queried bythe
Darwin (HS) algorithm. In the directions example,
Darwin (HS) started with ‘ best way to get to ’ and was ableto traverse to ‘ shuttle to ’, which is quite distinct from the These are standard assumptions in crowdsourcing plat-forms eg. figure-eight. We used the same cost model tocollect labels for directions .10 (a) musicians C o v e r age DARWIN (HS) (b) cause-effect C o v e r age DARWIN (US) (c) directions C o v e r age DARWIN (LS) (d) food-tweets C o v e r age highP (e) musicians F - sc o r e DARWIN (HS) (f) cause-effect F - sc o r e DARWIN (US) (g) directions F - sc o r e DARWIN (LS) (h) food-tweets F - sc o r e highP AL KS Figure 9: Comparison of rule coverage and classifier’s F-score for
Darwin based pipelines on various datasets. (a) Heuristic coverage C o v e r age DARWIN (LS)DARWIN (US) (b) Classifier Performance F - sc o r e DARWIN(HS)AL highPKS
Figure 10: Comparison for profession . has been caused by caused by by triggered bybest way to get to way to get to to get get to the hotelshuttle to shuttle to the to the hotel from(a) cause-effect(b) directions Figure 11: Example of traversals by
HybridSearch algo-rithm on two datasets.initial seed rule. The choice of ‘ to the hotel from ’ by thealgorithm provides some evidence that ‘ shuttle to ’ is also agood rule since the phrases often co-occur together in posi-tive instances. In the cause-effect example, the traversalis relatively simple as the algorithm generalizes the initialrule first and as soon as it reaches the noisy and unhelpfulrule ‘ by ’, it again specializes to ‘ triggered by ’ which is againa precise rule. In addition to these simpler heuristics, Dar-win identified more complex heuristics for professions like‘/is/NOUN ∧ job’, among others. This section compares the quality of the classifier gen-erated using the labels identified by
Darwin . Figure 6e-6hand 10b show that
Darwin (HS) dominates other techniquesover all the datasets. The active learning technique suffersfrom poor F-score initially and improves gradually. Since ALgenerates very few training examples, the trained classifier ishighly unstable and shows jittery F-score. The KS approachshows similar performance and performs comparable to AL.On the other hand,
Darwin based pipelines are much morestable in terms of F-score. The classifier that was trainedwith the labeled data generated by
Darwin pipelines always maintains a high precision. It is interesting to note that [18]reports the maximum F-score for food intent to be 0.54,as compared to 0.84 by
Darwin . The classifier generatedby
Darwin achieved an F-score above 0.8 for other intentslike ‘Travel’ and ‘Career’ too while [18] reports a maximumF-score of 0.58 for these intents.
To provide better insights into
Darwin ’s performance, wehave conducted a series of experiments to evaluate (1) howefficient the framework is in terms of the time required toobtain labels, (2) how much noise-aware models (trained bySnorkel) can improve the classification results, (3) how wellhuman annotators approximate our notation oracles. Dueto space limitations, we present the effect of varying seedrules and parameters in the Appendix.
Efficiency in Label Collection.
As we demonstrated,
Darwin identifies majority of the positive instances in allthe datasets using roughly 100 queries. The time taken togenerate the index structure for all the datasets was less than5 minutes. The hierarchy generation phase then iterates overthe index to identify the candidate rules. This phase takesless 15 minutes for a corpus of 100K.Since the
LocalSearch algorithm does not require the in-dex to be pre-computed and generates candidates on the fly,it runs in less than 45 minutes for all datasets.
HybridSearch and
UniversalSearch traversal algorithms require 60-90 min-utes on smaller datasets (i.e., directions , musicians and cause-effect ) and about 2 hour and 45 minutes on professions .The major bottleneck in this process is the time taken bythe classifier to make a prediction for all instances in the cor-pus (It takes roughly 25 minutes for one round of trainingand testing on the professions dataset). We implementeda simple optimization where we evaluated a sentences onlyif it had a confidence score more than 0.3 in the previous it-eration and only evaluated instances that didn’t satisfy thisconstraint once every three iterations. This heuristic helpedus reduce the running time from 2 hours and 45 minutes to65 minutes for the professions dataset. The total runningtime does not grow linearly with the size of dataset becausemost of the components use the classifier to identify thepositives. These candidate positives are used for hierarchygeneration and traversal. Hence, the running time grows lin-11 C D F
Darwin
Darwin +Snorkel 0.82 0.78 0.97 0.87
Table 2: Performance of
Darwin with Snorkel(
M=musicians,C=cause-effect,D=directions,F=food-tweet )early with positive set size but not with dataset size (afterusing the above mentioned optimization).
Training noise-aware classifiers.
One of the recent de-velopments in weak-supervision paradigms has been emer-gence of frameworks such as
Snorkel [12] which are designedto de-noise the generated labels and train noise-aware clas-sifiers. In this experiment, we direct the set of rules identi-fied by
Darwin to Snorkel, and compare the quality of thenoise-aware classifier against a classifier trained directly onthe labels generated by
Darwin . Table 2 summarizes the F-score that two classifiers have obtained on our datasets. Wecan observe that in most cases, using Snorkel does not yieldany improvements. This is mainly because in many of thesedatasets, the rules generated by
Darwin already exhibit alow degree of noise and a good coverage and thus there isalmost no room for improvements. Nevertheless, we can seethat on some datasets such as directions using Snorkel canbe quite beneficial.
Performance of human annotators.
Clearly,
Darwin ’sperformance heavily relies on the quality of responses it re-ceives from the annotators. To study how well human an-notators perform, we ran an experimental study on Figure-eight crowd-sourcing platform for directions dataset. La-bels were collected for 2 600 heuristics. Each annotator waspaid 2 cents per a single rule evaluation and three evalua-tions per rule were collected. A manual inspection of theresults reveals that annotators were able to capture most ofthe precise heuristics such as ‘best way to get there’, ‘shut-tle from’, ‘across the street from’, ‘airport to hotel’, and etc.Overall, we found less than 10 false positives responses in the69 positive heuristics identified by the crowd labels. Theseerroneous responses were due to the fact that the 5 match-ing sentences presented to the annotator sometimes can have3 or 4 positive instances by chance which confuses the an-notators. Presenting more samples lowers the error rate.Interestingly,
Darwin often rates these heuristics lower inpreference to query as it can analyze the complete coverageset, and mitigate such errors by considering the entire distri-bution of instances. The annotators took 23 sec on averageto label a heuristic query. For 100 queries,
Darwin gener-ates all the labels in less than 40 min of human effort. Thistime can further be reduced by asking various questions inparallel to different crowd members.
5. RELATED WORK
To the best of our knowledge,
Darwin is the first systemthat assists annotators to discover rules under any desiredrule grammar for rapid labeling of text data. Our work is re-lated to studies in areas of weak supervision, crowdsourcing,and the intersection of the two which we discuss next.
Weak Supervision.
There are multiple existing approachesfor generating labels in weakly supervised settings. Sometechniques rely on the notion of distant supervision wherethe labels are inferred using an external knowledge base [10,1, 21]. One notable example is a system named Snuba [17]which generates labeling rules based on an existing labeled dataset. In contrast to these systems,
Darwin is designedfor scenarios where no additional sources of information areavailable. In such cases, it is necessary to rely on annota-tors to write labeling rules. While using expert-written ruleshave proven to be highly effective in many settings [12], thereis limited work on how to facilitate the process of writingor discovering high-quality rules. One interesting exampleis Babble Labble[6], a labeling tool that allows annotatorsto explain (in natural language form) why they have as-signed a label to a given data point. These explanations arethen transformed into labeling rules. While Babble Labblesimplifies the rule writing process, it only handles a singleinternal rule language. On the other hand,
Darwin allowsexperts to pick their desired rule language depending on thecomplexity and the dynamics of the task at hand.There have been several studies on utilizing the weakly-supervised labels in an optimal way. Snorkel [12] and Coral [16]are recent examples of systems (based on the data program-ming paradigm) that de-noise and utilize the labels collectedvia weak supervision. Similarly, there are numerous datamanagement problems spanning data fusion [2, 13] and truthdiscovery [9], which focus on identifying reliable sources ofdata. Many recent studies in data integration have alsoexplored techniques that handle error in crowd answers [3,5]. Note that
Darwin is a framework for discovering label-ing rules which goes hand-in-hand with the aforementionedsystems since
Darwin ’s generated rules can be further pro-cessed using these de-noising techniques to achieve betterresults.
Crowdsourcing Frameworks.
There has been many stud-ies on devising oracle based abstractions that handle anno-tations from a crowd and minimize the noise in answers [20,4]. Perhaps, more relevant to our work, are existing studieson how labeling rules can be verified with the help of thecrowd. One recent example is a system named CrowdGame[22] which validates a rule by showing either the rule or itsmatching instances to the annotators. The authors demon-strate that their proposed game-based techniques yields thebest results for rule verification. Unlike
Darwin
Crowd-Grame assumes a pre-existing (manageable) set of possiblerules from which the best rule should be selected.
Dar-win , on the other hand, has no such assumption and hasto create a promising set of rules from the rule grammar.Additionally, the game-based approach to annotate a rulecan be modeled as an Oracle in
Darwin .
6. CONCLUSION
We present
Darwin , an interactive end-to-end systemthat enables annotators to rapidly label text datasets byidentifying precise labeling rules for the task at hand.
Dar-win compiles the semantic and syntactic patterns in the cor-pus to generate a set of candidate heuristics that are highlylikely to capture the positives instances in the corpus. Theset of candidate heuristics are organized into a hierarchywhich enables
Darwin to quickly determine which heuris-tic should be presented to the annotators for verification.Our experiments demonstrate the superior performance of
Darwin in wide range of labeling tasks spanning intent clas-sification, entity extraction and relationship extraction.12 . REFERENCES [1] E. Alfonseca, K. Filippova, J.-Y. Delort, andG. Garrido. Pattern learning for relation extractionwith a hierarchical topic model. In
ACL , 2012.[2] X. L. Dong and D. Srivastava. Big data integration.
Synthesis Lectures on Data Management , 7(1), 2015.[3] S. Galhotra, D. Firmani, B. Saha, and D. Srivastava.Robust entity resolution using random graphs. In
SIGMOD , 2018.[4] A. Ghosh, S. Kale, and P. McAfee. Who moderatesthe moderators?: crowdsourcing abuse detection inuser-generated content. In
Proceedings of the 12thACM conference on Electronic commerce , 2011.[5] A. Gruenheid, B. Nushi, T. Kraska, W. Gatterbauer,and D. Kossmann. Fault-tolerant entity resolutionwith the crowd. arXiv preprint arXiv:1512.00537 ,2015.[6] B. Hancock, P. Varma, S. Wang, M. Bringmann,P. Liang, and C. R´e. Training classifiers with naturallanguage explanations. arXiv preprintarXiv:1805.03818 , 2018.[7] W. Hoeffding. Probability inequalities for sums ofbounded random variables. In
The Collected Works ofWassily Hoeffding , pages 409–426. Springer, 1994.[8] Y. Kim. Convolutional neural networks for sentenceclassification. In
EMNLP , 2014.[9] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao,W. Fan, and J. Han. A survey on truth discovery.
ACM SIGKDD Explorations Newsletter , 17(2), 2016.[10] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distantsupervision for relation extraction without labeleddata. In
ACL/IJCNLP , 2009.[11] S. Petrov, D. Das, and R. McDonald. A universalpart-of-speech tagset. arXiv preprint arXiv:1104.2086 ,2011.[12] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu,and C. R´e. Snorkel: Rapid training data creation withweak supervision.
PVLDB , 11(3), 2017.[13] T. Rekatsinas, M. Joglekar, H. Garcia-Molina,A. Parameswaran, and C. R´e. Slimfast: Guaranteedresults for data fusion and source reliability. In
SIGMOD , 2017.[14] B. Settles. Active learning.
Synthesis Lectures onArtificial Intelligence and Machine Learning , 2012.[15] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng.Semantic compositionality through recursivematrix-vector spaces. In
EMNLP-CoNLL , 2012.[16] P. Varma, B. D. He, P. Bajaj, N. Khandwala,I. Banerjee, D. L. Rubin, and C. R´e. Inferringgenerative model structure with static analysis. In
NIPS , 2017.[17] P. Varma and C. R´e. Snuba: Automating weaksupervision to label training data.
PVLDB , 2019.[18] J. Wang, G. Cong, W. X. Zhao, and X. Li. Mininguser intents in twitter: A semi-supervised approach toinferring intent categories for tweets. 2015.[19] X. Wang, A. Feng, B. Golshan, A. Halevy, G. Mihaila,H. Oiwa, and W.-C. Tan. Scalable semantic queryingof text.
PVLDB , 11(9), 2018.[20] P. Welinder, S. Branson, S. J. Belongie, andP. Perona. The multidimensional wisdom of crowds. In
NIPS , 2010.[21] F. Yang, Z. Yang, and W. W. Cohen. Differentiablelearning of logical rules for knowledge base reasoning.In
NIPS , 2017.[22] J. Yang, J. Fan, Z. Wei, G. Li, T. Liu, and X. Du.Cost-effective data annotation using game-basedcrowdsourcing. 2018.13
PPENDIXA. PROOF OF LEMMA 4
Proof.
Expected score of the heuristic function is E [ (cid:88) s ∈ C r X s ] = (cid:88) s ∈ C r µ s = (cid:88) s ∈ C r ∩ P µ s + (cid:88) s ∈ C r \ P µ s ≤ (cid:88) s ∈ C r ∩ P ( β + (1 − θ )(1 − β ))+ (cid:88) s ∈ C r \ P (cid:0) β (cid:48) + (1 − θ )(1 − β (cid:48) ) (cid:1) = ( β + (1 − θ )(1 − β )) p | C r | +(1 − p ) | C r | (cid:0) β (cid:48) + (1 − θ )(1 − β (cid:48) ) (cid:1) ≤ ( β + (1 − θ )(1 − β )) | C r | B. PROOF OF LEMMA 5
Proof.
The score of heuristic function r is (cid:80) s ∈ C r X s .The expected value of the score is calculated in lemma 4.Using Hoeffding’s inequality, P r [ 1 | C r | (cid:88) s ∈ C r X s ≤ (1 + (cid:15) ) µ r / | C r | ] ≤ e − (cid:15) µ r / | C r | = 2 e − (cid:15) ( β +(1 − θ )(1 − β )) | C r | = 2 e − n = 2 n This shows that | C r | (cid:80) s ∈ C r X s is smaller than (1 + (cid:15) )( β +(1 − θ )(1 − β )) | C r | with a probability more than 1 − n C. PROOF OF LEMMA 6
Proof.
Using Lemma 3 and 5, we can observe that thecalculated benefit of heuristic h is atleast (1 − (cid:15) ) θβ (cid:48) | C r | with a probability of 1 − n . Similarly, the score of r isatmost (1 + (cid:15) ) ( β + (1 − θ )(1 − β )) | C r | with a probabilityof 1 − n . This shows that score ( r ) > score ( r ) (1)(1 − (cid:15) ) θβ (cid:48) | C r | > (1 + (cid:15) ) ( β + (1 − θ )(1 − β )) | C r | (2) | C r || C r | > (1 + (cid:15) ) ( β + (1 − θ )(1 − β ))(1 − (cid:15) ) θβ (cid:48) (3) | C r || C r | > α (4)where α is a constant. D. ADDITIONAL EXPERIMENTS
Sensitivity to
HybridSearch ’s traversal parameters.
Here, we study to what extent
Darwin ’s performance issensitive to parameter τ in the HybridSearch traversal al-gorithm. Recall that parameter τ determines how often the HybridSearch algorithm switches between exploiting the lo-cal structure of the hierarchy as opposed to evaluating allpossible candidates using the classifier. Figure 12a shows C o v e r age τ = 3 τ = 5 τ = 7 τ = 9 (a) musicians C o v e r age Rule 1Rule 2Rule 3 (b) musicians
Figure 12: Sensitivity of
Darwin to τ and seed rules. R u l e C o v e r age Figure 13: Sensitivity of
Darwin (HS) to the number of can-didates generated.that
Darwin (HS) performs very similar on varying τ . Thesolution quality tends to improve slightly on increasing τ because the effective rules for the musicians dataset arenot close to each other in the hierarchy. However, notethat choosing large values of τ can affect the efficiency ofthe pipeline. More precisely, large values of τ force the HybridSearch system to rely on its internal classifier to eval-uate all rule candidates for too many steps which can bequite time consuming.
Sensitivity to seed rule.
This experiment establishes that
Darwin has a robust performance given different types ofinput seed rule. Focusing on the musicians dataset, weinitialize
Darwin with the following seed rules. Rule 1 isthe keyword ‘ composer ’ stating that any sentence containingthis word mentions a musician. Rule 2 is the keyword ‘ piano ’and finally Rule 3 is the sentence ‘
Beethoven taught pianoto the daughters of Hungarian Countess Anna Brunsvik. ’.Note that Rule 2 is an extremely generalized version of Rule3. Figure 12b compares the performance of
Darwin (HS) forall three seed rules.
Darwin performs equally well on threedifferent types of input seed rules. We can observe that forRule 3,
Darwin requires the initial 8 queries to generalizethe seed rule, and as soon as it identifies a rule with highcoverage, it performs very similar to the other seed rules.
Sensitivity to number of generated candidates.
Oneof the parameters of the
Darwin framework is the num-ber of candidates that gets generated by the candidate-rulegeneration component. In our experiments,
Darwin gener-ates 10K candidates rules with high coverage and organizesthem into an index. The goal is to make sure the set of gener-ated candidates contain some (if not all) of the precise rules.Choosing a large value for the index size would satisfy thisobjective but affects the efficiency and increases the num-ber of candidates that
UniversalSearch and
HybridSearch algorithms need to consider. We observed that generating10K candidates per iteration helped
Darwin identify precisecandidate rules. Figure 13 shows that the performance of
Darwin (HS) algorithm is consistently similar for differentnumber of candidate rules generated.
Sensitivity to classifier quality.
Figure 14 compares14 Q ue s t i on s Epochs
Hybrid
Figure 14: Effect of classifier quality on
Darwin (HS) (on musicians dataset)the performance of
HybridSearch strategy on musicians dataset by varying the number of epochs for which the neu-ral network classifier is trained. With more epochs, the clas-sifier tends to overfit more to the training data. We measurethe number of questions from
Darwin (HS) to the oracle inorder to label at least 75% of the positive sentences. It isevident that the