[PDF] Adaptive Rule Discovery for Labeling Text Data

Abstract

Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings. This is due to the fact that a random sample in imbalanced settings often contains only a few positive instances. To address these shortcomings, we present Darwin, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, Darwin automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how Darwin is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar). Finally, we demonstrate with a suite of experiments over five real-world datasets that Darwin enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by Darwin on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.

Full PDF

AAdaptive Rule Discovery for Labeling Text Data

Sainyam Galhotra

UMass Amherst [email protected] Behzad Golshan

Megagon Labs [email protected] Wang-Chiew Tan

Megagon Labs [email protected]

ABSTRACT

Creating and collecting labeled data is one of the major bot-tlenecks in machine learning pipelines and the emergence ofautomated feature generation techniques such as deep learn-ing, which typically requires a lot of training data, has fur-ther exacerbated the problem. While weak-supervision tech-niques have circumvented this bottleneck, existing frame-works either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a la-beled subset of the data to automatically mine rules (e.g.,Snuba). The process of manually writing rules can be te-dious and time consuming. At the same time, creating alabeled subset of the data can be costly and even infeasi-ble in imbalanced settings. This is due to the fact that arandom sample in imbalanced settings often contains only afew positive instances.To address these shortcomings, we present

Darwin , aninteractive system designed to alleviate the task of writingrules for labeling text data in weakly-supervised settings.Given an initial labeling rule,

Darwin automatically gener-ates a set of candidate rules for the labeling task at hand,and utilizes the annotator’s feedback to adapt the candidaterules. We describe how

Darwin is scalable and versatile. Itcan operate over large text corpora (i.e., more than 1 millionsentences) and supports a wide range of labeling functions(i.e., any function that can be speciﬁed using a context freegrammar). Finally, we demonstrate with a suite of exper-iments over ﬁve real-world datasets that

Darwin enablesannotators to generate weakly-supervised labels eﬃcientlyand with a small cost. In fact, our experiments show thatrules discovered by

Darwin on average identify 40% morepositive instances compared to Snuba even when it is pro-vided with 1000 labeled instances.

PVLDB Reference Format:

Sainyam Galhotra, Behzad Golshan and Wang-Chiew Tan. Adap-tive Rule Discovery for Labeling Text Data.

PVLDB , 12(xxx):xxxx-yyyy, 2019.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. 12, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

1. INTRODUCTION

Today, many applications are powered by machine learn-ing techniques. The success of deep learning methods indomains such as natural language processing and computervision is further fuelling this trend. While deep learning(and machine learning in general) can oﬀer superior perfor-mance, training such systems typically requires a large setof labeled examples, which is expensive and time-consumingto obtain.

Weak supervision techniques circumvent the above prob-lem to some extent, by leveraging heuristic rules that cangenerate (noisy) labels for a subset of data . A large volumeof labels can be obtained at a low cost this way, and to com-pensate for the noise, noise-aware techniques can be usedfor further improving the performance of machine learningmodels [12, 6]. However, obtaining high-quality labelingheuristics remains a challenging problem. A subset of ex-isting frameworks, with Snorkel [12] being the most notableexample, rely on domain experts to provide a set of label-ing heuristics which can be a tedious and time-consumingtask. In contrast, other frameworks aim to automaticallymine useful heuristics using further provided supervision.For instance, Snuba [17] circumvents dependence on domainexperts by requiring a labeled subset of the data, and thenutilizing it to automatically derive labeling heuristics.

Bab-ble labble is another example which asks expert to label afew examples and explain their choice (in natural language).This explanation is used to derive labeling heuristics. Whilethese approaches have been quite eﬀective in certain set-tings, we elicit their limitations with the following real-worldexample on text-data.

Example Consider a corpus that are questions sub-mitted by a hotel’s guests to the concierge. Our goal is tobuild an intent classiﬁer to ﬁnd (and label) the set of ques-tions asking for directions or means of transportation fromone location to another. Below is a sample of messages fromthe corpus with positive instances marked as green. S1 What is the best way to get to SFO airport? + S2 Is there a bart from SFO to the hotel? + S3 What is the best way to check in there? - S4 Is Uber the fastest way to get to the airport? + S5 Would Uber Eats be the fastest way to order? - S6 What is the best way to order food from you? - To be more concise, we refer to heuristic rules simply asheuristics or rules as well throughout the paper.1 a r X i v : . [ c s . D B ] M a y equired Initial Data Required

Supervision

Labeled Subset -Heuristic -None - Manually

Creating

HeuristicsVerifying

HeuristicsNone - - -

DarwinSnuba

Single Labeling Explaining

Heuristics - babble labble Snorkel Figure 1: Comparing weak-supervision frameworksRelying on domain experts to provide labeling heuristicsfor tasks such as the one presented in Example 1 is a commonapproach but it has a number of shortcomings: • It is time consuming . Annotators must be familiar withthe rule language (e.g., Stanford’s Tregex or AI2’s IKElanguage). Moreover, they need to be acquainted withthe dataset to specify useful rules, i.e., rules that labela reasonable number of instances with a small amount ofnoise. This is normally done with trial-and-error and ﬁne-tuning of the rules on a sample of the corpus, which canbe quite tedious. • Oftentimes, some useful rules remain undiscovered .This is because annotators may miss important keywordsor possess limited domain knowledge. For example, theword ‘ bart ’ (which refers to a transportation system inCalifornia) is clearly useful for the task in Example 1.However, annotators may miss the important keyword‘ bart ’ or they may not even know what ‘ bart ’ is (especiallythose who are not from the area). • It yields rules with overlapping results . If multiple an-notators work on writing rules independently, they arelikely to end up with identical or similar rules. sinceHence, the number of distinct labels obtained does not al-ways grow linearly with the number of annotators, whichis rather ineﬃcient.The alternative approach would be to automatically mineuseful heuristics with systems such as Snuba or Babble lab-ble. Both systems require a set of labeled instances (accom-panied by natural language explanations in case of Babblelabble) which can be costly and oftentimes infeasible to col-lect in imbalanced settings. For instance while Example 1shows a balanced number of positive and negative instances,in practice, the positive instances often make up only a tinyfraction of the entire corpus. Hence, labeling a randomsample would not be suﬃcient to obtain enough positiveinstances. Consequently, automatically inferring heuristicrules is not feasible using the few discovered positive in-stances.To mitigate the above issues, we present

Darwin , anadaptive rule discovery system for text data. Figure 1 high-lights how

Darwin compares with other state-of-the-art weak-supervision frameworks. Compared to Snuba,

Darwin re-quires far less labeled instances. In fact as we show in ourexperiments, a single labeling rule (or a couple of labeledinstances) would be suﬃcient for

Darwin . Compared toSnorkel and Babble labble,

Darwin requires a lower degree Figure 2: Sample query to annotators.of supervision by domain experts. More explicitly,

Darwin requires experts to simply verify the suggested heuristicswhile Snorkel requires them to manually write such rulesand Babble labble requires them to provide explanations forwhy a particular label is assigned to a given data point.Given a corpus and a seed labeling rule,

Darwin identiﬁesa set of promising candidate rules. The best candidate rule(along with a few example instances matching the rule) isthen presented to the annotator to conﬁrm whether it isuseful for capturing the positive instances or not. Figure 2presents an example of this step for the intent describedin Example 1. The annotator is presented with examplesthat satisfy the rule and asked to answer whether the ruleis useful for the intent (a YES/NO question). Based onthe response,

Darwin adaptively identiﬁes the next set ofpromising candidate rules. This interactive process, whererules are illustrated with examples, facilitates annotatorsto identify the most eﬀective set of rules without the needto fully understand the corpus or the rule language. Ourcontributions are as follows. • Darwin supports any rule language that can be speciﬁedusing a context-free grammar. Therefore, it can generatea wide range of rules, from simple phrase matching tocomplex conditions over the dependency parse trees ofthe sentences. • Darwin can eﬀectively identify rules over a large text cor-pora, even when the number of candidate rules is expo-nentially large. In fact, we show that verifying 50 heuris-tics suggested by

Darwin is enough to achieve a F1-scoreof 0 .

80. Furthermore, we present theoretical results onapproximation guarantees of

Darwin . • Darwin does not require annotators to be familiar withthe rule language. By analyzing the similarity and theoverlap between the set of sentences matching diﬀerentrules,

Darwin automatically surface patterns in data andalso supports parallel discovery of rules by asking diﬀerentannotators to evaluate diﬀerent rules. • We demonstrate how

Darwin can be used for a variety oflabeling tasks: classify intents, ﬁnd sentences that men-tion particular entity types, and identify sentences thatdescribe certain relationships between entities (i.e., rela-tion extraction).2n the following sections, we deﬁne our problem, describe

Darwin , and demonstrate its eﬀectiveness and eﬃciencythrough a suite of experiments. Speciﬁcally, we show that

Darwin outperforms other baseline approaches in its abil-ity to generate a larger set of labeled examples by asking alimited number of questions.

2. PRELIMINARIES & PROBLEM DEFINI-TION

In a nutshell,

Darwin takes as input an unlabeled cor-pus of sentences along with an initial seed labeling heuris-tic (which is assumed to generate at least two positive in-stances).

Darwin then identiﬁes promising candidate label-ing heuristics.

Darwin leverages an oracle to verify whethera particular candidate heuristic is eﬀective at capturing posi-tive instances or not. Finally, the set of discovered heuristicsare forwarded to Snorkel [12] to train a high precision clas-siﬁer. Before describing Darwin ’s rule discovery pipeline indetail, we provide a formal deﬁnition of labeling heuristicsalong with a description of an oracle.

Heuristic search space.

Naturally, labeling heuristics canbe of diﬀerent types with distinct semantics. For example,a heuristic may check for certain phrases in a sentence [17]or it may enforce some conditions on the parse tree [19] andPOS tags of a sentence. In

Darwin , the space of possibleheuristics are speciﬁed using a collection of

Heuristic Gram-mars , where each grammar describes a particular type oflabeling heuristics. These concepts are formally deﬁned asfollows.

Definition 1 (Heuristic Grammar).

A Heuristic Gram-mar G is a Context-Free Grammar (CFG). Recall that aCFG consists of a collection of derivation rules. For a given heuristic grammar G , we deﬁne labeling heuris-tics as follows. Definition 2 (Labeling heuristic).

A labeling heuris-tic r is a derivation of the grammar G . We use C r to denotethe set of sentences in the corpus that satisfy the heuristic r , and refer to | C r | as it’s coverage . To further clarify the above deﬁnitions, let us consider asimpliﬁed regular expression grammar called TokensRegex.TokensRegex captures all regular expressions over tokensconsidering ‘+’ and ‘*’ operators . This grammar can beformally writen using a CFG grammer as shown below. Example 2 (TokensRegex Grammar).

Let V denotethe set of all possible words. The regular expression grammaron the tokens comprises of the following derivation rules. A → vA ( ∀ v ∈ V ) A → A + AA → A ∗ AA → (cid:15) The above TokensRegex grammar allows for a regular expres-sion of words as a candidate labeling heuristic. For example, Note that Snorkel both provides a framework for writing la-beling rules as well as tools for training noise-aware models.Here we refer to the latter. We use TokenRegex to explain

Darwin ’s pipeline.

Dar-win functionality is not restricted to this grammar and wediscuss more complex grammar.

Uber PROPN is VERB way NOUN the DET best ADJ to ADP hotel NOUN our ADJ

Figure 3: An example of a parse tree. this grammar generates heuristics such as ‘best way to’ or‘shuttle’ as well as less meaningful heuristics such as ‘shuttleis airport’ as candidates for the task described in Example 1.A sentence satisﬁes the heuristic if it contains that phrase.The sentences s , s and s in Example 1 satisfy the heuris-tic r = ‘best way to’, hence C h = { s , s , s } . As a default setting,

Darwin comprises of two diﬀerentgrammars (a) TokensRegex (b) TreeMatch, with the abil-ity to plug in more heuristic grammars as long as they arecontext-free. While TokensRegex is capable of capturinglexical patterns and phrases, it fails to capture syntacticpatterns and pattern over parse trees. TreeMatch grammarcaptures such patterns to identify more complex and genericheuristic functions.

Definition 3 (TreeMatch Grammar).

Let V denotethe set of terminals comprising of all the tokens and Part-of-Speech (POS) tags [11] present in the corpus. E.g., NOUN,VERB, etc. Derivation Rules:

The grammar has three fundamentaloperations that make up a heuristic, namely And ( ∧ ), Child(/), and Descendant (//). The symbol ‘a/b’ implies thatterminal ‘b’ should be a child of terminal ‘a’ in the depen-dency parse tree. The symbol ‘a//b’ implies that terminal‘b’ should be a descendant of terminal ’a’ in the parse tree.Given that, the derivation rules of the grammar are: A → /AA → A ∧ AA → //AA → v ( ∀ v ∈ V ) It is important to mention that the complexity of heuris-tics that can be speciﬁed using the TreeMatch grammar ex-ceeds what rule-mining frameworks such as Snuba or Babblelabble can capture.

Oracle Abstraction.

Finally, we formalize the feedbackthat we may either obtain from a single annotator, a groupof annotators, or a crowd-sourcing platform using the notionof Oracles as follows.

Definition 4 (Oracle).

An Oracle O is a functionwhich given a heuristic r and a few samples from its cov-erage set C h outputs a YES/NO answer indicating whetheror not r is adequately precise. An Oracle plays the role of a perfect annotator who alwaysanswers the questions correctly. In practice, annotators mayprovide incorrect answers (as we show in our experiments),but the notion of an oracle enables us to provide insightsinto the theoretical aspects of our problem.

Problem statement.

We are now ready to formally deﬁneour problem. Given a labeling task, our goal is to ﬁnd a set3 of labeling heuristics such that the union of the coverageof the heuristics in R , denoted as P = (cid:83) r ∈ R C r , would havea high recall (i.e., to contain a high ratio of the positive in-stances in the corpus). We would like to maximize the recallof set P without posing too many queries to the oracle. Weempirically observed that users label a heuristic as preciseonly when the heuristic has precision at least 0.8. Hence,in this paper, we do not focus on optimizing the precisionof heuristics, which we can also rely on various de-noisingtechniques from the weak supervision literature [12]. Problem 1 (Maximize Heuristics Coverage).

Givena corpus S , a seed labeling function r , an oracle O , and abudget b , ﬁnd a set R of labeling heuristics using at most b queries to the oracle, such that the recall of set P , i.e., theunion of the coverage of heuristics in R , is maximized. Lemma The

Maximize Heuristics Coverage prob-lem is NP-hard.

Proof.

We show the hardness of our problem by reduc-ing the maximum-coverage problem to an instance of ourproblem. Given a collection of sets A = { A , A , . . . , A n } and a budget b , the maximum-coverage problem aims to ﬁnd b sets from A such that the size of their union is maximized.Given an instance of the maximum-coverage problem, wecreate an instance of our problem as follows. For each set A i , we deﬁne a heuristic r i with coverage set C r i = A i andmark all the instances as positives. Consequently, the oracle O would always respond with a Yes as all the heuristic areperfectly precise. Now, it is easy to see that the coverageof set P in our setting is equivalent to the coverage of se-lected sets in the maximum-coverage problem. As a result,if our heuristic discovery problem can be solved in poly-nomial time, then the corresponding sets would form theoptimal maximum-coverage solution. Hence, our problem isalso NP-hard.Note that while we focus on maximizing the recall, it isalso useful to report the performance of the classiﬁer thatis trained using our weakly-supervised labels. Therefore, inour experiments, we also record the F-score of our trainedclassiﬁer to provide a better evaluation of Darwin .

3. THE

Darwin

SYSTEM

In this section, we describe the architecture of

Darwin which is illustrated in Figure 4.

Darwin operates in multi-ple phases that aim to identify diverse set of heuristics usedto identify positives. The pipeline is initialized with a seedlabelling function or a couple of positive sentences.

Darwin learns a rudimentary classiﬁer using these positive sentencesand the classiﬁer is reﬁned with evolving training data. Inorder to identify new heuristics

Darwin leverages the fol-lowing properties. (i) The generalizability of the trainedclassiﬁer helps guide the search towards semantically simi-lar heuristics. For example, on identifying the importanceof ‘bus’ as a heuristic,

Darwin identiﬁes ‘public transport’as another possibility due to their related semantics . (ii)Local structural changes to the already identiﬁed heuristicshelps identify new heuristics eg. consider ‘What is the best This generalization is possible via word embeddings whichare provided as an input to the classiﬁer. We provide moredetails of our classiﬁer in the experiments.

Algorithm 1

Darwin

Input:

Input Corpus S , seed heuristic r , budget b Output:

Collection of heuristics R , Set of positive instances P ,Classiﬁer C I ← generate index ( S ) Q ← φ P ← coverage ( r ) C, P (cid:48) ← train classifier ( P, { r } , S ) while | Q | ≤ b do H ← generate hierarchy ( S , P (cid:48) , I ) q ← traversal ( H, P, Q, C ) if oracle query (q) then P ← P ∪ coverage ( q ) C, P (cid:48)(cid:48) ← train classifier ( R, P, S ) P (cid:48) ← P (cid:48)(cid:48) \ P (cid:48) H ← update scores ( H ) Q ← Q ∪ { q } return R, P, C way to the hotel?’ as input seed sentence,

Darwin con-structs local modiﬁcations by dropping and adding tokens(derivation rules in general) and identifying a new heuristic‘shuttle to the hotel’.

Darwin leverages these intuitions toadaptively reﬁne the search space and simultaneously learna precise classiﬁer with high coverage over the positives.Before describing the architecture, we deﬁne a data struc-ture that is critical for eﬃcient execution of

Darwin . Allcandidate heuristics that are considered by

Darwin are or-ganized in the form of a hierarchy. This hierarchy capturessubset/superset relationship among heuristics. Heuristicswith higher coverage are placed closer to the root and theones with lower coverage are closer to the leaves. For ex-ample. ‘best way to the hotel’ is a subset of ‘best way to’and will be its descendant. One of the key properties of thishierarchical structure is that if a heuristic r is identiﬁed tocapture positives, then any of its subset (descendant in thehierarchy) does not capture any new positive. This datastructure has O (1) update time to identify the subsets of aheuristic. Additionally, it is helpful for eﬃcient execution oflocal structural changes to any heuristic. All these beneﬁtswill be discussed in detail in the later sections.Algorithm 1 presents the pseudo code of the end-to-end Darwin architecture.

Darwin ’s input consists of the corpusto be labeled, a collection of heuristic grammar, and one(or more) seed labeling function(s). Alternatively, a set ofpositive instances can be provided instead of seed labelingheuristics. The output of

Darwin is the set of generatedheuristics, the positive instances that are discovered, and aclassiﬁer that is trained using the labeled data.Before the heuristic-discovery phase begins,

Darwin cre-ates an index over the input corpus for fast access to thesentences that satisfy a given heuristic (more details in Sec-tion 3.1). The heuristic-discovery phase is an iterative pro-cess where

Darwin interacts with annotators and uses theirfeedback to identify new candidates and ask further queries.In a nutshell, each iteration consists of the following steps.First, the

Candidate Generation component generates a smallset of promising candidate heuristics (from the space of allpossible heuristic functions), and organizes them in the formof a hierarchy H (line 6) with the most generic functions atthe top and the stricter ones at the bottom. We will describeshortly how H is generated and used to prune less eﬀectiveheuristics. Once the hierarchy is built, Darwin ’s HierarchyTraversal component carefully navigates and evaluates theheuristics in the hierarchy to ﬁnd the best candidate (line4 nput Corpus Hierarchy TraversalScore UpdateIndex Generation Candidate Selection

Input

Seed Rule(s) Collection of HeuristicsLabelled Set

Output

Figure 4:

Darwin ’s architecture. best way way to to get …s1 … best way to get A “best” AA “way” A A “to”

A A “get” AA “get” A A “ t o ” A A “ w a y ” A fastest way way to to get …s4 … fastest way to get A “fastest” AA “way” A A “to”

A A “get” AA “get” A A “ t o ” A A “ w a y ” A Figure 5: Examples of derivation sketches7). The best candidate is then presented to the annotator(line 8). Finally, the updated classiﬁer and scores of heuris-tics are sent back to hierarchy generation to identify newcandidates and perform traversal for the next iteration. Wedescribe the details of these components next.

Darwin creates an index for the input corpus to providefast access to sentences that satisfy certain heuristics. Thisindex aims at constructing a space eﬃcient representationof each sentence in the corpus for eﬃcient execution of sub-sequent steps involving traversal through the various candi-date heuristics. This hierarchical structure of this index isvery similar to that of a trie.Given a collection of heuristic grammar { G , . . . , G t } anda sentence s , one can enumerate the set of all possible heuris-tics of G i , generated using a ﬁxed number of derivation rules,that s satisﬁes. For example, using the TokensRegex gram-mar, the set of all heuristics that a sentence s satisﬁes isthe set of all regular expressions that correspond to s . Weorganize the set of heuristics matching a sentence s intoa structure called the Derivation Sketch , which summarizesthe derivations of all heuristics that match s . Figure 5 shows(parts of) the derivation sketch for sentences s s I which is a compactrepresentation of all heuristics that are satisﬁed by at leastone sentence in the corpus. Each node in I represents aheuristic labeling function and stores the number of sen-tences that satisfy it, pointers to the children in the index,and an inverted list that points to sentences that satisfy theheuristic. Index (s1, s4) … best Count = 1 {s1} way Count = 2 {s1, s4} to Count = 2 {s1, s4} get Count = 2 {s1, s4} fastest Count = 1 {s4}best way Count = 1 {s1} way to Count = 2 {s1, s4} to get Count = 2 {s1, s4} fastest way Count = 1 {s4} …… Figure 6: An example of index creation processThe index I is created by merging the derivation sketch ofsentences, one at a time. The index is ﬁrst initialized withthe derivation sketch of the ﬁrst sentence. Thereafter, forevery new sentence s , the derivation sketch of s is mergedinto I as follows. The root node of the sketch and the rootnode of I is merged, and then all nodes (starting form themerged root) are considered in a breadth-ﬁrst fashion; Thechildren of the node under consideration which are derivedusing the same derivation rule are merged together. Forevery node that gets merged, the count of the merged nodeis increased by one. Also, the inverted list at that node getsupdated to include the new sentence. Figure 6 shows theindex built from derivation sketches of s s TreeMatch Grammar:

This grammar has more opera-tors as compared to TokensRegex and can generate expo-nentially more candidate heuristics. The derivation sketchcan be created as explained by enumerating all sequence ofderivation rules up to a ﬁxed number of steps. However, amore compact derivation sketch for the TreeMatch grammaris simply the dependency parse tree of the sentence, as wecan use it to quickly check whether a heuristic matches theparse tree or not [19]. Figure 3 shows the dependency parsetree of a sentence which can serve as its derivation sketchas well. Given the exponentially many candidate heuristicsgenerated by this grammar, the candidate generation stepis crucial for ignoring useless heuristics and thereby helpingthe subsequent stages to focus on meaningful heuristics. Weevaluate the performance of

Darwin with this grammar inthe next section.

As mentioned earlier, the number of possible heuristicsunder a given grammar G is often exponential in the sizeof dictionary. The task of the heuristics-hierarchy genera-tion component is to generate a manageable set of promisingcandidate heuristics from the space of all possible heuristicsand organize the generated candidate heuristics in a hierar-chy that captures the subset/superset relationship betweenthe heuristics. Speciﬁcally, the hierarchy generation processconsists of the following steps. First, the Candidate Genera-tion step generates a subset of possible heuristics that havehigh coverage over the set of positive instances discoveredso far. This algorithm operates in a greedy best-ﬁrst searchmechanism to identify valuable candidates. These heuristics5 lgorithm 2

Candidate-heuristic Generation

Input:

Index I , Set of positive instances P , Number of desiredheuristics k Output:

Collection of heuristics R R ← {∗} , recentHeuristic ← ∗ , candidates ← φ while | R | ≤ k do candidates ← candidates ∪ Children( recentHeuristic , I ) sortedCandidates ← CoverageSort( candidates , P ) recentHeuristic ← sortedCandidates [0] candidates ← candidates . remove( recentHeuristic ) R ← R ∪ recentHeuristic return R are promising as they already have some overlap with theexisting positive instances. Next, these candidates are ar-ranged in the form of a hierarchy along with subset-supersetedges between them. We describe these steps in detail next. The candidate generation step uses the index I to gener-ate a set of heuristic labeling functions with high coverageover the set of positive instances P that are discovered sofar by Darwin . Note that heuristics that (at least partially)cover the set of discovered positive instances, are likely tobe good heuristics and help detect more positive instances.To eﬃciently ﬁnd such heuristics, we rely on one of the in-teresting properties of index I ; Recall that the count of anode u ∈ I refers to the total number of sentences thatcontain the tokens along the path from the root to u intheir derivation sketches. As descendant nodes correspondto stricter heuristics, the coverage of a heuristic correspond-ing to a node is never less than the coverage of any of itsdescendants in the index. Thus we use a greedy algorithmto identify a collection of diverse heuristics that have highcoverage over the set of positive instances P .Algorithm 2 generates candidate heuristics by exploitingthe property described above. The set of candidate heuris-tics is initialized with heuristic ‘*’ which refers to the rootof index I . This heuristic matches all possible sentences inthe corpus. In each iteration, the algorithm adds the chil-dren of the previous iteration’s best candidate heuristic tothe candidate list (line 3). The candidates are then sortedin decreasing order of coverage over the set P (line 4). Thecandidate with the highest coverage is removed from thecandidate list and appended to list of ﬁnal results R (lines6-7). This process is repeated until there are k heuristics in R . Note that the time complexity of this greedy algorithmis linear in the number of candidates generated.Other constraints can also be added to the candidate-heuristic generation phase to ensure that the generated heuris-tics satisfy those criteria. For example, Darwin can applyheuristics to ensure that the candidate heuristics are diversein terms of the set of derivation rules used to derive theheuristic, their level in the index I , and the set of instancesthey cover. Some of these heuristics help Darwin avoidhaving to evaluate many similar candidate heuristics.

Hierarchical Arrangement and edge discovery.

Thecandidates returned by Algorithm 2 have high coverage overthe positives (discovered so far). This component iteratesover the generated heuristics to arrange them into a hier-archy H following the same parent/child relationship thatindex I captures and an edge is added between them. heuristic r is a child of r if it can be obtained by applyinga derivation rule to r . This hierarchical arrangement of heuristics is followed bya cleanup to get rid of heuristics that do not add any newpositive sentences than the ones already identiﬁed. The goalof cleanup is to improve the eﬃciency and space complexityof Darwin as the traversal component will never query aheuristic that does not add any new positives.

The result of the heuristic-hierarchy generation is a hi-erarchy H of promising heuristics. The hierarchy traversalmodule determines which heuristic in the hierarchy is thebest heuristic to be submitted to the oracle.We present three hierarchy traversal techniques: LocalSearch , UniversalSearch , and

HybridSearch . At a high level,

LocalSearch relies on the hierarchy structure to select the next best can-didate from the immediate neighborhood of heuristics veri-ﬁed by the oracle in the past. In contrast,

UniversalSearch ignores locality constraints and selects the heuristic withmaximum beneﬁt globally.Finally, the

HybridSearch traversal combines the ﬁrst twotechniques to ﬁnd the next best heuristic. The

HybridSearch traversal is more robust than

LocalSearch and

UniversalSearch .All three techniques work in an iterative fashion, and in eachiteration, the criteria for selecting a heuristic to be sent tothe oracle is based on how beneﬁcial the heuristic is, whichwe elaborate next.

Beneﬁt of a heuristic (r):

The beneﬁt of a heuristic r is the expected gain in the positive set P upon choosing r . More formally, the beneﬁt is quantiﬁed as (cid:80) s ∈ C r \ P p s ,where p s is the probability of sentence s being a positiveinstance. In Darwin , these probability values are estimatedby training a classiﬁer using the set of positive instancesdiscovered so far and sampling random instances from thecorpus as negatives. The probability estimates improve asthe system iteratively discovers more heuristics and the clas-siﬁer is re-trained with more positive training examples.We describe our three traversal techniques next. LocalSearch

LocalSearch traversal algorithm (Algorithm 3) beneﬁtsfrom the local hierarchy structure around the heuristics al-ready identiﬁed as useful by the oracle to identify the nextbest heuristic for querying. Speciﬁcally,

LocalSearch main-tains a set of candidate heuristics, and selects the most ben-eﬁcial heuristic r from the candidates. If the oracle conﬁrmsthat r is adequately precise, then it adds r ’s parents into thecandidate set as they are generalizations of r and might behelpful at capturing more positive instances. However, ifthe oracle labels r as a noisy heuristic, LocalSearch addsthe children of r to the candidate set instead with the hopethat a specialized version of heuristic r might be less noisy. LocalSearch is simple and eﬃcient at utilizing the struc-ture of the hierarchy to ﬁnd promising heuristics to submitto the oracle. Since the algorithm only explores the localneighborhood of the queried candidates, it has a time com-plexity of O( dt ), where d is the maximum degree of an in-ternal node and t is the number of iterations the algorithmis running for. However, a disadvantage of LocalSearch is that it may require many traversal steps in cases wherethe initial seed heuristic is quite diﬀerent from other preciseheuristics the system aims to discover. Also, it does not Any short text classiﬁer would be ideal for this task.6 lgorithm 3

LocalSearch

Traversal

Input: heuristic hierarchy H , Seed heuristic r Output:

Collection of positive instances P , Collection of heuris-tics R QueryCount ← R ← { r } , P ← C r , C ←

TrainClassiﬁer( P ) localCandidates ← { r } while QueryCount < b do r ← GetMostBeneﬁcialCandidateheuristic( localCandidates , C ) QueryCount ← QueryCount + 1 if OracleResponse( r ) is YES then R ← R ∪ r , P ← P ∪ C r localCandidates ← ( localCandidates \ { r } ) ∪ Parents( r ) C ←

TrainClassiﬁer( P ) else localCandidates ← ( localCandidates \ { r } ) ∪ Children( r ) return P, R exploit the similarity and the overlap between the coveragesets of diﬀerent heuristics. The

UniversalSearch algorithm,which we describe shortly, addresses these shortcomings byutilizing a holistic view of the hierarchy.

Eﬃcient Implementation.

Since the

LocalSearch traver-sal only explores a node’s immediate parents/children, itdoes not require the entire hierarchy apriori. Hence, in itsimplementation, we can skip the heuristic-generation com-ponent and expand the hierarchy on the ﬂy based on theoracle’s feedback.

UniversalSearch

The

UniversalSearch algorithm (see Algorithm 4) eval-uates all heuristics present in the hierarchy to identify thebest heuristic. In each iteration,

UniversalSearch omitsany heuristic for which the beneﬁt per instance is smallerthan 0 .

5, i.e. majority of the instances in C r are expectedto be negatives. Among the remaining heuristics, it choosesthe heuristic with maximum beneﬁt to submit to the or-acle. Based on oracle’s feedback, it re-trains the classi-ﬁer if new positives were discovered or else it continuesto query the next best heuristic to the oracle. Note that UniversalSearch captures the best candidates irrespectiveof the hierarchy structure.The strength of

UniversalSearch is in its capability toidentify semantic similarity between heuristics and their match-ing instances even if they are far apart in the hierarchy.However, it has the following shortcomings: (1) comparedto

LocalSearch , it is ineﬃcient as it iterates over all heuris-tics in the hierarchy to identify the best candidate, and (2)in absence of enough positive instances, the trained classi-ﬁer is likely to overﬁt and not generalize well to other preciseheuristics. In such cases,

UniversalSearch fails to exploitthe structure of the hierarchy to at least ﬁnd heuristics thatare structurally similar to the seed heuristics.We describe the

HybridSearch algorithm next, which com-bines the strengths of

UniversalSearch and

LocalSearch . HybridSearch

HybridSearch (See Algorithm 5) combines the two previ-ous traversal techniques by maintaining a list of local can-didates and a list of universal candidates, and imitating thestrategy of the both traversal algorithms. Starting fromthe

UniversalSearch strategy, the

HybridSearch algorithmqueries candidate heuristics (with a beneﬁt per instanceabove 0 .

5) to the oracle. If the algorithm fails to ﬁnd a

Algorithm 4

UniversalSearch

Traversal

Input: heuristic hierarchy H , Seed heuristic r Output:

Collection of positive instances P , Collection of heuris-tics R QueryCount ← R ← { r } , P ← C r , C ←

TrainClassiﬁer( P ) universalCandidates ← { r : r ∈ H} C ←

TrainClassiﬁer( P ) while QueryCount < b do r ← GetMostBeneﬁcialCandidate( universalCandidates , C ) QueryCount ← QueryCount + 1 if AvgBeneﬁt( r ) ≤ . then continue if OracleResponse( r ) is YES then R ← { r } , P ← C r C ←

TrainClassiﬁer( P ) universalCandidates ← universalCandidates \ { r } return P, R

Algorithm 5

HybridSearch

Traversal

Input: heuristic hierarchy H , Seed heuristic r Output:

Collection of positive instances P , Collection of heuris-tics R universalMode ← True, attempt ← R ← { r } , P ← C r , C ←

TrainClassiﬁer( P ) localCands ← { r } , universalCands ← { r : r ∈ H} QueryCount ← while QueryCount < k do if attempt ≥ τ then universalMode ← not universalMode attempt ← attempt ← attempt + 1 candidates = universalCands if universalMode else localCands QueryCount ← QueryCount + 1 r ← GetMostBeneﬁcialCandidateheuristic( candidates , C ) if universalMode and AvgBeneﬁt( r ) ≤ . then continue if OracleResponse(r) is YES then R ← R ∪ r , P ← P ∪ C r C ←

TrainClassiﬁer( P ) localCands ← localCands \ { r } ∪ Parents( r ) else localCandidates ← localCandidates \ { r } ∪ Children( r ) universalCands ← universalCands \ { r } return P, R precise heuristic within a ﬁxed number of attempts, thenit switches to the

LocalSearch strategy. Similarly, if the

LocalSearch strategy has no success within a ﬁxed numberof attempts, the traversal toggles to the

UniversalSearch strategy. The switch between the two strategies is decidedbased by a parameter τ (by default 5) which denotes thenumber of unsuccessful attempt before the switch happens.Clearly, higher values of τ discourage switching between thetwo strategies.Our empirical evaluation shows that HybridSearch formedby combining

UniversalSearch and

LocalSearch strate-gies, runs well on all types of datasets even when the othertwo traversal algorithms struggle to discover high-qualityheuristics. In short, if the trained classiﬁer is noisy (due tolack of positive instances),

HybridSearch exploits the struc-ture of the hierarchy to search for precise heuristics. Simi-larly when no precise heuristics are found by

LocalSearch , ituses

UniversalSearch ’s ability to generalize to other heuris-tics.

Darwin passesthe feedback to the score update component to (1) re-trainthe classiﬁer, (2) re-evaluate the scores of all heuristics inthe hierarchy, and (3) update the set of positive instances(if feedback is positive) and signal the hierarchy generationcomponent to generate new candidate heuristics to be addedto the hierarchy.

In this section we analyze the ability of

UniversalSearch hierarchy traversal in identifying positive sentences withina query budget b . For this analysis, we consider a sim-ple model capturing how positive and negative instances arescored by a classiﬁer. An ideal classiﬁer assigns a score of1 to positives and 0 to negatives. However in practice, thescores follow a diﬀerent distribution which we model as fol-lows. Let P ∗ denote the collection of positive sentences inthe corpus of sentences S . We assume that a reasonableclassiﬁer assigns to a positive sentence s ∈ P ∗ a score largerthan θ ≥ . β , and less than 1 − θ other-wise. Similarly, the score of a negative sentence s ∈ S \ P ∗ isabove θ with probability β (cid:48) . Naturally, for a classiﬁer is bet-ter than random β is larger then β (cid:48) . Under this model, wecan show that the set of heuristics R and the correspondingpositive set P identiﬁed by UniversalSearch are constantapproximation of the optimal solution.We also make a few assumptions about the hierarchy ofheuristics. We assume that the number of heuristics in thehierarchy H is linear to the number of sentences (i.e., O ( n ))and each heuristic has a minimum coverage of Ω(log n ). Thisis a realistic assumption as we focus on heuristics that canbe derived from their context free grammar using a ﬁxednumber of steps, and our algorithm is aimed at identifyingheuristics that cover a large fraction of positives. Underthis assumption, we show that at any iteration, a heuristic r chosen by UniversalSearch has coverage | C r | larger than α × max r (cid:48) ∈H | C r (cid:48) | , where α is a constant. This guaranteesthat UniversalSearch identiﬁes at least αP OPT positiveswithin a query budget of b , where P OPT is the total numberof positives identiﬁed by an ideal algorithm. To bound theestimated coverage of a heuristic r , we use the Hoeﬀding’sinequality [7]. Notation.

We deﬁne a random variable X s which refersto the score assigned to a sentence s and let µ s denote itsexpected value. The beneﬁt score of a heuristic r is (cid:80) s ∈ C X s and its expected value is denoted by µ r . Lemma Given a heuristic function r with coverage of C r and precision p , the expected score of the heuristic func-tion is at least θβ (cid:48) | C r | . Proof.

Expected score of the heuristic function is E (cid:34) (cid:88) s ∈ C r X s (cid:35) = (cid:88) s ∈ C r µ s = (cid:88) s ∈ C r ∩ P ∗ µ s + (cid:88) s ∈ C r \ P ∗ µ r ≥ (cid:88) s ∈ C r ∩ P ∗ ( θβ ) + (cid:88) s ∈ C r \ P ∗ (cid:0) θβ (cid:48) (cid:1) = ( θβ ) p | C r | + (1 − p ) | C r | (cid:0) θβ (cid:48) (cid:1) ≥ θβ (cid:48) | C r | We use this calculation to bound the score of a heuristic r that has more than log n sentences. Lemma Consider a heuristic function r with coverage C r such that | C r | = c log n sentences, where c ≥ (cid:15) θ β (cid:48) is a constant. The beneﬁt score of the heuristic is at least (1 − (cid:15) ) θβ (cid:48) | C r | with a probability of − /n . Proof.

The score of heuristic function r is (cid:80) s ∈ C r X s .The expected value of the score (denoted by µ r ) is calculatedin lemma 2. Using Hoeﬀding’s inequality, P r (cid:34) | C r | (cid:88) s ∈ C r X s ≤ (1 − (cid:15) ) µ r / | C r | (cid:35) ≤ e − (cid:15) µ r / | C r | ≤ e − (cid:15) θ β (cid:48) | C r | ≤ e − n = 2 n This shows that | C r | (cid:80) s ∈ C r X s is greater than (1 − (cid:15) ) θβ (cid:48) | C r | with a probability more than 1 − n Using a similar analysis, we identify an upper bound of theheuristic score. Due to space constraint, we defer the proofto Appendix.

Lemma Given a heuristic function r with a coverageof C r and precision p , the expected score of the heuristicfunction is atmost ( β + (1 − θ )(1 − β )) | C r | . Lemma Consider a heuristic r with coverage C r suchthat | C r | = c log n sentences, where c ≥ (cid:15) ( β +(1 − θ )(1 − β )) is a constant. The score of the heuristic is atmost (1 + (cid:15) ) ( β + (1 − θ )(1 − β )) | C r | with a probability of − /n Using the calculated bounds on score of a heuristic, weevaluate the condition when a particular heuristic is pre-ferred over the other.

Lemma Given a pair of heuristic functions r and r with respective coverage C and C . If C has more positivesthan C , the UniversalSearch score of r is higher than thatof r whenever | C || C | ≥ α with a probability of − n , where α is a constant. Using a similar analysis, we can calculate the estimatedaverage probability of a heuristic. For a heuristic r withprecision p r , we can show that it is considered for beneﬁtcalculation only when p r > γ , where γ is a constant. Theorem In worst case,

UniversalSearch provides aconstant approximation of Problem 1 with a probability of − o (1) Proof.

In each iteration,

UniversalSearch algorithm sortseach of the candidate heuristic based on estimated averageprobability of a randomly chosen sentence from C r to be pos-itive. All these candidates have true precision p r > γ . Givena pair of heuristics r and r , using Lemma 6, the beneﬁtscore of a block r is higher than that of r whenever | C r || C r | >α with a probability of 1 − n . Let r OPT be the heuris-tic chosen by optimal algorithm. Using union bound over (cid:0) n (cid:1) pairs of heuristics, with a probability of at least 1 − n ,the estimated beneﬁt of r OPT is higher than that of any r (cid:48) whenever | C r (cid:48) | ≤ | C r OPT | α . Therefore, UniversalSearch never chooses any block heuristic with coverage smaller than | C r OPT | α with a probability of 1 − o (1). This shows that thetotal number of positives identiﬁed by UniversalSearch areat least | C r OPT | αγ , which is a constant approximation of | C r OPT | with a probability of 1 − o (1).8 ataset cause-effect musicians directions profession

1M 1.1 Entities tweets

Table 1: Dataset statisticsNotice that analysis above makes certain assumptions aboutthe quality of the classiﬁer. In the initial iterations of

Dar-win the classiﬁer has a low recall and hence, the valuesof β and θ are lower. As Darwin identiﬁes new heuristicfunctions, the increase in training data pushes these valueshigher, thereby improving the approximation factor of ouralgorithm. It is important to mention that even when theclassiﬁer is not ideal, our only key assumption is that theclassiﬁer would perform better than random.

Discussion.

We proposed three techniques for hierarchytraversal.

UniversalSearch approach is useful to captureholistic information about the diﬀerent candidate labelingheuristics and is proven to achieve constant approximationof the optimal solution under reasonable assumptions of thetrained classiﬁer. However, due to lack of training datain initial iterations of the pipeline, this assumption maynot hold and

UniversalSearch does not perform optimally.However,

LocalSearch performs local generalization of iden-tiﬁed heuristics to quickly increase the number of identiﬁedpositives.

HybridSearch is a robust amalgamation of thesetwo techniques and is recommendation. Since

HybridSearch is a generalization of

LocalSearch and

UniversalSearch , itis slightly less eﬃcient than either of these.

4. EXPERIMENTS

In this section, we perform empirical evaluation of

Dar-win along with other baselines to validate the following. • The ability of

Darwin to identify majority of the positiveseven when initialized with a small seed set. • The positives identiﬁed by

Darwin outperform other base-line techniques that use active learning, a human annota-tor or any other automated techniques. We show that

Darwin can uncover most of positive instances (i.e, 80%or more) with roughly 100 queries. • The heuristics identiﬁed by

Darwin have high precisionand help train a classiﬁer with superior F-score ( ≥ . • Darwin is highly eﬃcient and can generate labels froma corpus of 1M sentences in less than 3 hrs.

Darwin performance is resilient to variations in the seed set.

Here, we describe the datasets, the baselines, and ouroverall experimental setup.

Datasets.

We experimented with ﬁve diverse real-worlddatasets each suitable for one of the following NLP tasks: en-tity extraction, relationship extraction, and intent classiﬁca-tion. Table 1 summarizes the statistics of these datasets. Alldatasets, except for directions , come with ground-truthlabels which we use for evaluation and to synthesize theresponses from an oracle. For the directions datasets, werely on human annotators to generate the gold standard andvalidate the heuristics. We describe each of these datasetbelow. • cause-effect [15] is a dataset commonly used as a bench-mark for relationship extraction between pairs of entities. We focus on the task of ﬁnding sentences that describe acause and eﬀect relationship between two entities. • directions is an internal dataset described in Example 1.For this dataset, we leveraged Figure-eight crowd workersto verify the heuristics generated by Darwin . • musicians dataset consists of sentences from Wikipediaarticles. The task is entity extraction with the goal toextract the names of musicians. The ground-truth is ob-tained with the help of NELL’s knowledge-base . • professions dataset is a collection of sentences from ClueWeb .The sentences that mention various professions (e.g., sci-entist, teacher, etc.) are positives. The ground truth isgenerated using NELL’s knowledge-base. • tweets [18] data set is a benchmark for classifying theintent of tweets into predeﬁned categories such as food,travel and career, etc. Baselines.

We evaluate our framework on two fronts: (1)the ratio of positive instances it discovers (i.e. coverage)and (2) the performance of the classiﬁer trained using ourweakly-supervised labels. Our baselines for these two eval-uation criteria are listed below. • Section 4.2 compares the fraction of positives identiﬁedwith

Snuba [17]. In this experiment, we consider a smallsample of positives chosen randomly from the dataset. • Section 4.3 compares the coverage obtained by

Darwin against two baselines, namely

HighP and

HighC . HighP isa simpler version of

Darwin which selects the rule whichis expected to have a high precision (according to the clas-siﬁer) and submits it to the oracle. On the other hand,

HighC selects rules with the maximum coverage, irrespec-tive of their expected precision . • Section 4.4 compares the F-score of the classiﬁer gener-ated by

Darwin with an

Active Learning ( AL ) [14] anda Keyword Sampling ( KS ) technique as well as the HighP baseline mentioned earlier. AL improves its performanceby selecting the instance with the highest entropy andasking the oracle for its label. It then re-trains the clas-siﬁer using the new label. The KS approach is designedto check if we can quickly obtain a small set of promisinginstances by ﬁltering the corpus using a set of relevant key-words, and label the instances in the smaller set. To do so,we asked annotators to provide 10 distinct keywords as aheuristic to ﬁlter the dataset. The KS technique randomlysamples instances from the ﬁltered dataset and asks for itslabel. We employ the same deep learning based classiﬁerfor all the techniques.Finally, note that Darwin can use diﬀerent traversal algo-rithms:

LocalSearch , UniversalSearch , and

HybridSearch ,which we refer to as

Darwin (LS),

Darwin (US), and

Dar-win (HS) respectively.

Settings.

We implemented all proposed algorithms andbaselines in Python and ran the experiments on a serverwith a 500GB RAM and 64 core 2.10GHz x 2 processors.The dependency parse trees and the POS tags are gener-ated with SpaCy . All text classiﬁers trained in our ex- https://ﬁgure-eight.com http://rtw.ml.cmu.edu/rtw/kbbrowser/ https://lemurproject.org/clueweb09/ HighC ’s performance was quite poor as most of its sug-gested rules are rejected by the oracle. As a result, we omitHighC from the plots for the sake of clarity. https://spacy.io/

25 50 125 250 5001000 C o v e r age SnubaDARWIN(HS) (a) directions

25 100 500 1000 2000 C o v e r age SnubaDARWIN(HS) (b) musicians

Figure 7: Eﬀects of seed set size on performance.

25 50 200 400 8001600 C o v e r age SnubaDARWIN(HS) (a) directions

20 100 500 1000 2000 C o v e r age SnubaDARWIN(HS) (b) musicians

Figure 8: Eﬀects of biased seed set size on performance.periments (whether used by

Darwin or other baselines) areimplemented with a 3-layer convolutional neural network fol-lowed by two fully connected layers, following the architec-ture described by Kim et al [8]. The input to the classiﬁer isa matrix created by stacking the word-embedding vectors ofthe words appearing in the sentence. We also used SpaCy’sword-embeddings for English . For generating derivationsketches, the maximum depth is set to 10 and we consider10K heuristics in candidate selection. When simulating theresponses from an oracle (using the ground-truth data), werespond YES to heuristic h if at least 80% of its coverageset consist of positive instances. In this experiment, we initialize

Snuba and

Darwin (HS)with the same set of randomly chosen labeled sentences andcompare the total number of positives identiﬁed by eachof the techniques . Note that Snuba does not query theoracle and infers heuristics based on the provided labeledinstances may infer inaccurate heuristics. For a fair eval-uation, we choose not compare the accuracy of identiﬁedheuristics. Figure 7 shows the change in fraction of iden-tiﬁed positives by varying the size of the initial seed set.

Darwin (HS) is able to identify majority of positives evenwhen the pipeline is initialized with less than 25 sentences.However,

Snuba requires at least 200 randomly chosen sen-tences for directions and 1000 for musicians . If we employexpert to sample positives,

Snuba requires at least 100 pos-itive samples in musicians .To further evaluate the generalizability of

Snuba and

Dar-win (HS) to identify heuristics that have limited or no evi-dence in the initial seed set, we construct a biased sampleof seed positives. In this experiment, we choose sentencesrandomly from the corpus after ignoring the ones that con-tain the token ‘shuttle’ in directions dataset and ‘com- https://spacy.io/models/en In this experiment, we do not start with a single labelingheuristic as

Snuba achieves a very small coverage as it failsto obtain enough positive instance due to the high degree ofimbalance in these datasets. poser’ in musicians . Figure 8 shows the fraction of pos-itives identiﬁed with varying size of the seed set.

Snuba is not able to identify the positives that contain the to-ken ‘shuttle’ in directions and ‘composer’ in musicians .Henceforth, it achieves poor coverage over the positives intwo datasets.

Darwin

HS is able to identify majority of posi-tives irrespective of the number of sentences used to initializethe pipeline.

Snuba requires considerably more labelled sen-tences in musicians as compared to directions due to thepresence of many diverse heuristics in the dataset, most ofwhich have limited evidence in the seed subset. We observesimilar performance gap between

Snuba and

Darwin (HS)for other datasets.This experiment validates that

Snuba works well when theinitial seed set has enough randomly chosen positives andlacks the ability to generalize to heuristics that have limitedevidence. On the other hand,

Darwin (HS) is able to identifymajority of the positives even when the pipeline is initial-ized with just 25 sentences and has good generalizability.To further evaluate

Darwin ’s ability to identify positives,the following subsection considers a more challenging sce-nario where the pipeline is initialized with a single labelingheuristic or just two positive sentences.

Figures 6a-6d and 10a illustrate the fraction of positivesidentiﬁed by

Darwin and our baselines. We can observethat

Darwin (HS) has the most stable performance and out-performs other techniques. While

Darwin (US) occasionallyoutperforms

Darwin (HS), we observe that it fails to per-form well on all datasets. In most cases (with an exception of cause-effect ), the

Darwin (HS) achieves a coverage of 0.8using less than 120 queries to the oracle. The cause-effect is known to be a tough benchmark in the NLP communityas the best F-score reported by [15] is 82% given completeaccess to the training set. Assuming that the oracle con-siders a majority vote by querying three crowd membersand each query costs 2 cents , the Darwin (HS) pipelinegenerates more than 80% of the positive labels with only$7 .

20. Figure 6d demonstrates the behavior for ‘Food’ in-tent in the tweets. We observed similar behavior for ‘Travel’and ‘Career’ intents on this data set. We can observe thatthe other baselines do not perform well compared to

Dar-win ; The highP identiﬁes heuristics with very small cov-erage as its candidates. Also note that the

Darwin (LS)algorithm shows a high progressive coverage initially but itconverges to a very low coverage value because it is unableto identify rules that are semantically similar, but far awayin the hierarchy. Overall, we recommend

Darwin (HS)for any practical application as it is more robust and worksbetter than most of the techniques. On the other hand,

Darwin (LS) and

Darwin (US) variants work well in spe-ciﬁc settings.

Darwin (LS) performs better than the othertechniques when precise rules are present close to each otherin the hierarchy and

Darwin (US) performs well in the pres-ence of abundant labelled examples.Figure 11 shows some the heuristics which are queried bythe

Darwin (HS) algorithm. In the directions example,

Darwin (HS) started with ‘ best way to get to ’ and was ableto traverse to ‘ shuttle to ’, which is quite distinct from the These are standard assumptions in crowdsourcing plat-forms eg. ﬁgure-eight. We used the same cost model tocollect labels for directions .10 (a) musicians C o v e r age DARWIN (HS) (b) cause-effect C o v e r age DARWIN (US) (c) directions C o v e r age DARWIN (LS) (d) food-tweets C o v e r age highP (e) musicians F - sc o r e DARWIN (HS) (f) cause-effect F - sc o r e DARWIN (US) (g) directions F - sc o r e DARWIN (LS) (h) food-tweets F - sc o r e highP AL KS Figure 9: Comparison of rule coverage and classiﬁer’s F-score for

Darwin based pipelines on various datasets. (a) Heuristic coverage C o v e r age DARWIN (LS)DARWIN (US) (b) Classifier Performance F - sc o r e DARWIN(HS)AL highPKS

Figure 10: Comparison for profession . has been caused by caused by by triggered bybest way to get to way to get to to get get to the hotelshuttle to shuttle to the to the hotel from(a) cause-effect(b) directions Figure 11: Example of traversals by

HybridSearch algo-rithm on two datasets.initial seed rule. The choice of ‘ to the hotel from ’ by thealgorithm provides some evidence that ‘ shuttle to ’ is also agood rule since the phrases often co-occur together in posi-tive instances. In the cause-effect example, the traversalis relatively simple as the algorithm generalizes the initialrule ﬁrst and as soon as it reaches the noisy and unhelpfulrule ‘ by ’, it again specializes to ‘ triggered by ’ which is againa precise rule. In addition to these simpler heuristics, Dar-win identiﬁed more complex heuristics for professions like‘/is/NOUN ∧ job’, among others. This section compares the quality of the classiﬁer gen-erated using the labels identiﬁed by

Darwin . Figure 6e-6hand 10b show that

Darwin (HS) dominates other techniquesover all the datasets. The active learning technique suﬀersfrom poor F-score initially and improves gradually. Since ALgenerates very few training examples, the trained classiﬁer ishighly unstable and shows jittery F-score. The KS approachshows similar performance and performs comparable to AL.On the other hand,

Darwin based pipelines are much morestable in terms of F-score. The classiﬁer that was trainedwith the labeled data generated by

Darwin pipelines always maintains a high precision. It is interesting to note that [18]reports the maximum F-score for food intent to be 0.54,as compared to 0.84 by

Darwin . The classiﬁer generatedby

Darwin achieved an F-score above 0.8 for other intentslike ‘Travel’ and ‘Career’ too while [18] reports a maximumF-score of 0.58 for these intents.

To provide better insights into

Darwin ’s performance, wehave conducted a series of experiments to evaluate (1) howeﬃcient the framework is in terms of the time required toobtain labels, (2) how much noise-aware models (trained bySnorkel) can improve the classiﬁcation results, (3) how wellhuman annotators approximate our notation oracles. Dueto space limitations, we present the eﬀect of varying seedrules and parameters in the Appendix.

Eﬃciency in Label Collection.

As we demonstrated,

Darwin identiﬁes majority of the positive instances in allthe datasets using roughly 100 queries. The time taken togenerate the index structure for all the datasets was less than5 minutes. The hierarchy generation phase then iterates overthe index to identify the candidate rules. This phase takesless 15 minutes for a corpus of 100K.Since the

LocalSearch algorithm does not require the in-dex to be pre-computed and generates candidates on the ﬂy,it runs in less than 45 minutes for all datasets.

HybridSearch and

UniversalSearch traversal algorithms require 60-90 min-utes on smaller datasets (i.e., directions , musicians and cause-effect ) and about 2 hour and 45 minutes on professions .The major bottleneck in this process is the time taken bythe classiﬁer to make a prediction for all instances in the cor-pus (It takes roughly 25 minutes for one round of trainingand testing on the professions dataset). We implementeda simple optimization where we evaluated a sentences onlyif it had a conﬁdence score more than 0.3 in the previous it-eration and only evaluated instances that didn’t satisfy thisconstraint once every three iterations. This heuristic helpedus reduce the running time from 2 hours and 45 minutes to65 minutes for the professions dataset. The total runningtime does not grow linearly with the size of dataset becausemost of the components use the classiﬁer to identify thepositives. These candidate positives are used for hierarchygeneration and traversal. Hence, the running time grows lin-11 C D F

Darwin

Darwin +Snorkel 0.82 0.78 0.97 0.87

Table 2: Performance of

Darwin with Snorkel(

M=musicians,C=cause-effect,D=directions,F=food-tweet )early with positive set size but not with dataset size (afterusing the above mentioned optimization).

Training noise-aware classiﬁers.

One of the recent de-velopments in weak-supervision paradigms has been emer-gence of frameworks such as

Snorkel [12] which are designedto de-noise the generated labels and train noise-aware clas-siﬁers. In this experiment, we direct the set of rules identi-ﬁed by

Darwin to Snorkel, and compare the quality of thenoise-aware classiﬁer against a classiﬁer trained directly onthe labels generated by

Darwin . Table 2 summarizes the F-score that two classiﬁers have obtained on our datasets. Wecan observe that in most cases, using Snorkel does not yieldany improvements. This is mainly because in many of thesedatasets, the rules generated by

Darwin already exhibit alow degree of noise and a good coverage and thus there isalmost no room for improvements. Nevertheless, we can seethat on some datasets such as directions using Snorkel canbe quite beneﬁcial.

Performance of human annotators.

Clearly,

Darwin ’sperformance heavily relies on the quality of responses it re-ceives from the annotators. To study how well human an-notators perform, we ran an experimental study on Figure-eight crowd-sourcing platform for directions dataset. La-bels were collected for 2 600 heuristics. Each annotator waspaid 2 cents per a single rule evaluation and three evalua-tions per rule were collected. A manual inspection of theresults reveals that annotators were able to capture most ofthe precise heuristics such as ‘best way to get there’, ‘shut-tle from’, ‘across the street from’, ‘airport to hotel’, and etc.Overall, we found less than 10 false positives responses in the69 positive heuristics identiﬁed by the crowd labels. Theseerroneous responses were due to the fact that the 5 match-ing sentences presented to the annotator sometimes can have3 or 4 positive instances by chance which confuses the an-notators. Presenting more samples lowers the error rate.Interestingly,

Darwin often rates these heuristics lower inpreference to query as it can analyze the complete coverageset, and mitigate such errors by considering the entire distri-bution of instances. The annotators took 23 sec on averageto label a heuristic query. For 100 queries,

Darwin gener-ates all the labels in less than 40 min of human eﬀort. Thistime can further be reduced by asking various questions inparallel to diﬀerent crowd members.

5. RELATED WORK

To the best of our knowledge,

Darwin is the ﬁrst systemthat assists annotators to discover rules under any desiredrule grammar for rapid labeling of text data. Our work is re-lated to studies in areas of weak supervision, crowdsourcing,and the intersection of the two which we discuss next.

Weak Supervision.

There are multiple existing approachesfor generating labels in weakly supervised settings. Sometechniques rely on the notion of distant supervision wherethe labels are inferred using an external knowledge base [10,1, 21]. One notable example is a system named Snuba [17]which generates labeling rules based on an existing labeled dataset. In contrast to these systems,

Darwin is designedfor scenarios where no additional sources of information areavailable. In such cases, it is necessary to rely on annota-tors to write labeling rules. While using expert-written ruleshave proven to be highly eﬀective in many settings [12], thereis limited work on how to facilitate the process of writingor discovering high-quality rules. One interesting exampleis Babble Labble[6], a labeling tool that allows annotatorsto explain (in natural language form) why they have as-signed a label to a given data point. These explanations arethen transformed into labeling rules. While Babble Labblesimpliﬁes the rule writing process, it only handles a singleinternal rule language. On the other hand,

Darwin allowsexperts to pick their desired rule language depending on thecomplexity and the dynamics of the task at hand.There have been several studies on utilizing the weakly-supervised labels in an optimal way. Snorkel [12] and Coral [16]are recent examples of systems (based on the data program-ming paradigm) that de-noise and utilize the labels collectedvia weak supervision. Similarly, there are numerous datamanagement problems spanning data fusion [2, 13] and truthdiscovery [9], which focus on identifying reliable sources ofdata. Many recent studies in data integration have alsoexplored techniques that handle error in crowd answers [3,5]. Note that

Darwin is a framework for discovering label-ing rules which goes hand-in-hand with the aforementionedsystems since

Darwin ’s generated rules can be further pro-cessed using these de-noising techniques to achieve betterresults.

Crowdsourcing Frameworks.

There has been many stud-ies on devising oracle based abstractions that handle anno-tations from a crowd and minimize the noise in answers [20,4]. Perhaps, more relevant to our work, are existing studieson how labeling rules can be veriﬁed with the help of thecrowd. One recent example is a system named CrowdGame[22] which validates a rule by showing either the rule or itsmatching instances to the annotators. The authors demon-strate that their proposed game-based techniques yields thebest results for rule veriﬁcation. Unlike

Darwin

Crowd-Grame assumes a pre-existing (manageable) set of possiblerules from which the best rule should be selected.

Dar-win , on the other hand, has no such assumption and hasto create a promising set of rules from the rule grammar.Additionally, the game-based approach to annotate a rulecan be modeled as an Oracle in

Darwin .

6. CONCLUSION

We present

Darwin , an interactive end-to-end systemthat enables annotators to rapidly label text datasets byidentifying precise labeling rules for the task at hand.

Dar-win compiles the semantic and syntactic patterns in the cor-pus to generate a set of candidate heuristics that are highlylikely to capture the positives instances in the corpus. Theset of candidate heuristics are organized into a hierarchywhich enables

Darwin to quickly determine which heuris-tic should be presented to the annotators for veriﬁcation.Our experiments demonstrate the superior performance of

Darwin in wide range of labeling tasks spanning intent clas-siﬁcation, entity extraction and relationship extraction.12 . REFERENCES [1] E. Alfonseca, K. Filippova, J.-Y. Delort, andG. Garrido. Pattern learning for relation extractionwith a hierarchical topic model. In

ACL , 2012.[2] X. L. Dong and D. Srivastava. Big data integration.

Synthesis Lectures on Data Management , 7(1), 2015.[3] S. Galhotra, D. Firmani, B. Saha, and D. Srivastava.Robust entity resolution using random graphs. In

SIGMOD , 2018.[4] A. Ghosh, S. Kale, and P. McAfee. Who moderatesthe moderators?: crowdsourcing abuse detection inuser-generated content. In

Proceedings of the 12thACM conference on Electronic commerce , 2011.[5] A. Gruenheid, B. Nushi, T. Kraska, W. Gatterbauer,and D. Kossmann. Fault-tolerant entity resolutionwith the crowd. arXiv preprint arXiv:1512.00537 ,2015.[6] B. Hancock, P. Varma, S. Wang, M. Bringmann,P. Liang, and C. R´e. Training classiﬁers with naturallanguage explanations. arXiv preprintarXiv:1805.03818 , 2018.[7] W. Hoeﬀding. Probability inequalities for sums ofbounded random variables. In

The Collected Works ofWassily Hoeﬀding , pages 409–426. Springer, 1994.[8] Y. Kim. Convolutional neural networks for sentenceclassiﬁcation. In

EMNLP , 2014.[9] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao,W. Fan, and J. Han. A survey on truth discovery.

ACM SIGKDD Explorations Newsletter , 17(2), 2016.[10] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distantsupervision for relation extraction without labeleddata. In

ACL/IJCNLP , 2009.[11] S. Petrov, D. Das, and R. McDonald. A universalpart-of-speech tagset. arXiv preprint arXiv:1104.2086 ,2011.[12] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu,and C. R´e. Snorkel: Rapid training data creation withweak supervision.

PVLDB , 11(3), 2017.[13] T. Rekatsinas, M. Joglekar, H. Garcia-Molina,A. Parameswaran, and C. R´e. Slimfast: Guaranteedresults for data fusion and source reliability. In

SIGMOD , 2017.[14] B. Settles. Active learning.

Synthesis Lectures onArtiﬁcial Intelligence and Machine Learning , 2012.[15] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng.Semantic compositionality through recursivematrix-vector spaces. In

EMNLP-CoNLL , 2012.[16] P. Varma, B. D. He, P. Bajaj, N. Khandwala,I. Banerjee, D. L. Rubin, and C. R´e. Inferringgenerative model structure with static analysis. In

NIPS , 2017.[17] P. Varma and C. R´e. Snuba: Automating weaksupervision to label training data.

PVLDB , 2019.[18] J. Wang, G. Cong, W. X. Zhao, and X. Li. Mininguser intents in twitter: A semi-supervised approach toinferring intent categories for tweets. 2015.[19] X. Wang, A. Feng, B. Golshan, A. Halevy, G. Mihaila,H. Oiwa, and W.-C. Tan. Scalable semantic queryingof text.

PVLDB , 11(9), 2018.[20] P. Welinder, S. Branson, S. J. Belongie, andP. Perona. The multidimensional wisdom of crowds. In

NIPS , 2010.[21] F. Yang, Z. Yang, and W. W. Cohen. Diﬀerentiablelearning of logical rules for knowledge base reasoning.In

NIPS , 2017.[22] J. Yang, J. Fan, Z. Wei, G. Li, T. Liu, and X. Du.Cost-eﬀective data annotation using game-basedcrowdsourcing. 2018.13

PPENDIXA. PROOF OF LEMMA 4

Proof.

Expected score of the heuristic function is E [ (cid:88) s ∈ C r X s ] = (cid:88) s ∈ C r µ s = (cid:88) s ∈ C r ∩ P µ s + (cid:88) s ∈ C r \ P µ s ≤ (cid:88) s ∈ C r ∩ P ( β + (1 − θ )(1 − β ))+ (cid:88) s ∈ C r \ P (cid:0) β (cid:48) + (1 − θ )(1 − β (cid:48) ) (cid:1) = ( β + (1 − θ )(1 − β )) p | C r | +(1 − p ) | C r | (cid:0) β (cid:48) + (1 − θ )(1 − β (cid:48) ) (cid:1) ≤ ( β + (1 − θ )(1 − β )) | C r | B. PROOF OF LEMMA 5

Proof.

The score of heuristic function r is (cid:80) s ∈ C r X s .The expected value of the score is calculated in lemma 4.Using Hoeﬀding’s inequality, P r [ 1 | C r | (cid:88) s ∈ C r X s ≤ (1 + (cid:15) ) µ r / | C r | ] ≤ e − (cid:15) µ r / | C r | = 2 e − (cid:15) ( β +(1 − θ )(1 − β )) | C r | = 2 e − n = 2 n This shows that | C r | (cid:80) s ∈ C r X s is smaller than (1 + (cid:15) )( β +(1 − θ )(1 − β )) | C r | with a probability more than 1 − n C. PROOF OF LEMMA 6

Proof.

Using Lemma 3 and 5, we can observe that thecalculated beneﬁt of heuristic h is atleast (1 − (cid:15) ) θβ (cid:48) | C r | with a probability of 1 − n . Similarly, the score of r isatmost (1 + (cid:15) ) ( β + (1 − θ )(1 − β )) | C r | with a probabilityof 1 − n . This shows that score ( r ) > score ( r ) (1)(1 − (cid:15) ) θβ (cid:48) | C r | > (1 + (cid:15) ) ( β + (1 − θ )(1 − β )) | C r | (2) | C r || C r | > (1 + (cid:15) ) ( β + (1 − θ )(1 − β ))(1 − (cid:15) ) θβ (cid:48) (3) | C r || C r | > α (4)where α is a constant. D. ADDITIONAL EXPERIMENTS

Sensitivity to

HybridSearch ’s traversal parameters.

Here, we study to what extent

Darwin ’s performance issensitive to parameter τ in the HybridSearch traversal al-gorithm. Recall that parameter τ determines how often the HybridSearch algorithm switches between exploiting the lo-cal structure of the hierarchy as opposed to evaluating allpossible candidates using the classiﬁer. Figure 12a shows C o v e r age τ = 3 τ = 5 τ = 7 τ = 9 (a) musicians C o v e r age Rule 1Rule 2Rule 3 (b) musicians

Figure 12: Sensitivity of

Darwin to τ and seed rules. R u l e C o v e r age Figure 13: Sensitivity of

Darwin (HS) to the number of can-didates generated.that

Darwin (HS) performs very similar on varying τ . Thesolution quality tends to improve slightly on increasing τ because the eﬀective rules for the musicians dataset arenot close to each other in the hierarchy. However, notethat choosing large values of τ can aﬀect the eﬃciency ofthe pipeline. More precisely, large values of τ force the HybridSearch system to rely on its internal classiﬁer to eval-uate all rule candidates for too many steps which can bequite time consuming.

Sensitivity to seed rule.

This experiment establishes that

Darwin has a robust performance given diﬀerent types ofinput seed rule. Focusing on the musicians dataset, weinitialize

Darwin with the following seed rules. Rule 1 isthe keyword ‘ composer ’ stating that any sentence containingthis word mentions a musician. Rule 2 is the keyword ‘ piano ’and ﬁnally Rule 3 is the sentence ‘

Beethoven taught pianoto the daughters of Hungarian Countess Anna Brunsvik. ’.Note that Rule 2 is an extremely generalized version of Rule3. Figure 12b compares the performance of

Darwin (HS) forall three seed rules.

Darwin performs equally well on threediﬀerent types of input seed rules. We can observe that forRule 3,

Darwin requires the initial 8 queries to generalizethe seed rule, and as soon as it identiﬁes a rule with highcoverage, it performs very similar to the other seed rules.

Sensitivity to number of generated candidates.

Oneof the parameters of the

Darwin framework is the num-ber of candidates that gets generated by the candidate-rulegeneration component. In our experiments,

Darwin gener-ates 10K candidates rules with high coverage and organizesthem into an index. The goal is to make sure the set of gener-ated candidates contain some (if not all) of the precise rules.Choosing a large value for the index size would satisfy thisobjective but aﬀects the eﬃciency and increases the num-ber of candidates that

UniversalSearch and

HybridSearch algorithms need to consider. We observed that generating10K candidates per iteration helped

Darwin identify precisecandidate rules. Figure 13 shows that the performance of

Darwin (HS) algorithm is consistently similar for diﬀerentnumber of candidate rules generated.

Sensitivity to classiﬁer quality.

Figure 14 compares14 Q ue s t i on s Epochs

Hybrid

Figure 14: Eﬀect of classiﬁer quality on

Darwin (HS) (on musicians dataset)the performance of

HybridSearch strategy on musicians dataset by varying the number of epochs for which the neu-ral network classiﬁer is trained. With more epochs, the clas-siﬁer tends to overﬁt more to the training data. We measurethe number of questions from

Darwin (HS) to the oracle inorder to label at least 75% of the positive sentences. It isevident that the