Detecting Opportunities for Differential Maintenance of Extracted Views
DDetecting Opportunities for DifferentialMaintenance of Extracted Views
Besat Kassaie
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada,N2L [email protected]
Frank Wm. Tompa
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada,N2L [email protected]
Abstract
Semi-structured and unstructured data management is challenging, but many of the problemsencountered are analogous to problems already addressed in the relational context. In the area ofinformation extraction, for example, the shift from engineering ad hoc , application-specific extractionrules towards using expressive languages such as
CPSL and
AQL creates opportunities to proposesolutions that can be applied to a wide range of extraction programs. In this work we focus on extracted view maintenance , a problem that is well-motivated and thoroughly addressed in therelational setting.In particular, we formalize and address the problem of keeping extracted relations consistent withsource documents that can be arbitrarily updated. We formally characterize three classes of documentupdates, namely those that are irrelevant , autonomously computable , and pseudo-irrelevant withrespect to a given extractor. Finally, we propose algorithms to detect pseudo-irrelevant documentupdates with respect to extractors that are expressed as document spanners , a model of informationextraction inspired by SystemT . Information systems → Information extraction; Informationsystems → Database views; Theory of computation → Formal languages and automata theory;Applied computing → Document management and text processing
Keywords and phrases information extraction, materialized views, regular languages, documentspanners, static program analysis
Designing new languages and extraction platforms [2, 8, 32, 35], choosing an appropriatealgorithmic approach respecting the domain and the syntactic and semantic properties ofanticipated data sources and outputs [33], facilitating the incorporation of human knowledgein algorithm design [5], and adapting existing extractors to deal with new documents addedto the system [6] cover the significant part of recent research that has been done in this area.In all these efforts, the major goal is to cover the myriad ways that a relationship might beexpressed in text.Despite many technical differences, all proposed extraction approaches share a subtleand important assumption, which we call “fading attachment.” The flow of informationbetween the three main components of information extraction—source documents, theextraction program, and the extracted relations—is maintained during the developmentperiod but evaporates once the extraction program reaches a satisfactory level of accuracyand robustness. Once deployed, the information extraction process ignores the relationshipbetween the contents of the source documents and the extracted relations.We observe that the fading attachment assumption is inappropriate in many applications.Extracted relations might be modified due to privacy concerns [22] or for data cleaning a r X i v : . [ c s . D B ] J u l Detecting Opportunities for Differential Maintenance of Extracted Views purposes [21], but thereafter they are inconsistent with the contents of the source document.On the other hand, source documents might also be modified, perhaps for versioning purposesor to accommodate updates that reflect the most recent data; but again the extracted relationsbecome inconsistent with the content in the source documents.Instead we consider an extracted relation to be a materialized view of the documentcorpus. From this perspective, updating extracted relations resembles the classical viewupdate problem for relational databases [9], and keeping extracted relations in sync withthe document corpus resembles the problem of maintaining materialized views [18]. Theextracted view update problem has been introduced and formalized elsewhere [22], and inthis paper we are interested in the latter problem, i.e., extracted view maintenance.The natural way to reflect changes in source documents is to wipe out any extractedrelations and repeat the extraction process. Although this approach guarantees the preserva-tion of consistency between the source text and the extracted relations, as in the relationaldatabase context, extracting relations from scratch can be costly. For instance, in someapplications where updates to source documents occur frequently, extraction time might bea bottleneck or, in a distributed setting in which extracted relations and source documentsreside in different physical sites, the communication cost for repeatedly transferring newlyextracted relations might be significant. Thus, avoiding re-extraction is sometimes highlydesirable.The problem has been studied extensively in the relational database setting. Based on therequirements of target applications and the nature of view updates, proposed solutions rangefrom recomputing views from scratch to detecting irrelevant and autonomously computableupdates [3] and to updating views differentially [19, 23] or only as needed [7, 36]. Otheroptimization techniques can also be adopted from relational databases [32], including thematerialization of partially extracted views (which would also need to be maintained, ofcourse). In fact, we hypothesize that any of the proposed solutions in the relational settingcan be adapted to the extracted view maintenance problem.However, due to the diverse range of extraction techniques and ad hoc document updates,tackling the problem of extracted view maintenance introduces new challenges. Given acollection of documents D , a set of extraction programs E , a corresponding set of extractedrelations R , and an instance of a document update specification U , we study conditions underwhich we can apply U to members of D and apply corresponding updates to members of R without recomputing the revised extracted relations from scratch (Figure 1). That is, wewish to translate updates over documents into differential updates over extracted relations.Thus, in this paper: (a) We introduce the extracted view maintenance problem. (b)
We propose a match-and-replace document update model. (c)
We formalize three categories of document updates for which we can preserve consistencywithout repeating the extraction process: irrelevant , autonomously computable , and pseudo-irrelevant updates . (d) We propose algorithms to determine whether an update is pseudo-irrelevant with respectto extractors expressed as document spanners, a formalism that models the basis of theSystemT extraction system [32].
In order to develop specific algorithms, we assume that extracted views are defined usingSystemT, an information extraction platform that benefits from relational database concepts . Kassaie and F. W. Tompa 3
Figure 1
Extraction system that supports updates to all members of a document database. to deal with text data sources [32] . SystemT models each document as a single string andpopulates relational tables with spans , directly extracted from the input document. WithSystemT users encode extractors with a SQL-like language, i.e., AQL, to manipulate tables.AQL offers operators to work directly on text or on the extracted tables (standard relationaloperators that accept span predicates).The underlying principles adopted by SystemT have been formalized as document spanners by Fagin, et al. [12]. Most of the material in this section has been introduced in that work,which contains additional details.Let Σ be a finite alphabet and D be a (finite) document over Σ, i.e., D ∈ Σ ∗ . A span of D , denoted [ i, j i (1 ≤ i ≤ j ≤ | D | + 1), specifies the start and end offsets of a substring in D ,which is in turn denoted D [ i,j i , and extends from offset i through offset j −
1. If i = j , thisdenotes an empty span at offset i . Spans s = [ i , j i and s = [ i , j i are identical if andonly if i = i and j = j ; they overlap if i ≤ i < j or i ≤ i < j . Regular expressionsextended using variables chosen from a set V are called regular expressions with capturevariables , defined by γ in the grammar G S (Σ , V ) as follows: γ := ∅ | (cid:15) | σ | ( γ ∨ γ ) | ( γ • γ ) | ( γ ) ∗ | x { γ } where σ ∈ Σ and x ∈ V . The use of a sub-expression of the form x { g } is to denote thatwhenever the regular expression matches a string, sub-strings matched by g are to be markedby the capture variable x . If E is a regular expression with capture variables, then we denotethe set of capture variables in E as SVars ( E ). We use G S in place of G S (Σ , V ) whenever Σand V are immaterial or understood from the context. We also allow regular expressions withcapture variables to be written without parentheses that can be inferred based on priority ofoperations [20].Applying an information extractor to a document D produces a span relation , i.e., arelation that contains spans of D . To this end, if E is a regular expression with capturevariables, it specifies a document spanner , denoted (cid:74) E (cid:75) , which is a function mapping stringsover Σ ∗ to span relations. In particular, for a given document D , the spanner specified by E produces a span relation (cid:74) E (cid:75) ( D ) in which there is one column for each variable from V appearing in E , each row corresponds to a matching of E against D when the variables are How to maintain extracted views efficiently should also be investigated using other extraction languages,such as JAPE [8].
Detecting Opportunities for Differential Maintenance of Extracted Views
F o r i n f o r m a t i o n o n C O V I D - 1 9 , c a l l u s a t e a t + 1 - 8 6 7 - 9 7 5 - 5 7 7 2 o r 4 0 3 - 6 4 4 - 4 5 4 5 .
Figure 2
A sample input document D for our running example. ignored, and the value in a row for the column corresponding to x ∈ V is the span markedby x . To ensure that the extracted relation is in first-normal form with no null values, werestrict our attention to a specific class of document spanners, namely functional documentspanners , that assign exactly one span to each variable for all produced rows, regardless ofthe input document D .Let Σ be the set of Latin alphanumeric, punctuation and the space characters (the lastrepresented by ), and let d denote a digit. Applying γ phone = Σ ∗ • tn { (0 • ∨ ∨ + • • - • ac { d • d • d } • - • d • d • d • - • sc { d • d • d • d }} • Σ ∗ to the document in Figure 2 results in the span relation in Figure 3. tn ac sc[42 , i [44 , i [52 , i [88 , i [91 , i [99 , i Figure 3
The extracted relation (cid:74) γ phone (cid:75) ( D ), where D is depicted in Figure 2. (cid:73) Definition 2.1.
Throughout this paper, a functional document spanner used for the purposeof information extraction is called an extraction spanner or simply an extractor , the regularexpression with capture variables defining it is called an extraction formula , and the spanrelation produced for a document is called an extracted relation . (cid:73) Definition 2.2.
A regular expression created by eliminating all the capture variables froman extraction formula E is called the corresponding Boolean spanner and is denoted by B ( E ) . In this paper we hypothesize systems that include a document database D and a set ofextractors { E , · · · , E e } that run independently over D . The union of span relations producedby an extractor E k against the document database is stored in a relation T k that includes anadditional column to associate each document identifier with the spans for the correspondingspan relation.These tables serve as materialized views of the document database.In numerous proofs in this paper, we rely on finding “witness” documents that exhibitcertain properties: (cid:73) Definition 2.3.
Given a document D and a property P , if D exhibits property P (i.e., wecan assert P ( D ) ), D is called a witness for P . Furthermore, we use specially-constructed finite automata to test properties of given spanners.For each automaton, Q represents a finite set of states, Σ is the input alphabet, δ stands forthe transition function, Q is a set of initial states, and F represents a set of final states. . Kassaie and F. W. Tompa 5 Substring replacement , deletion , and insertion are basic update operations over documents.A change to the text is typically preceded by some browsing activities or search operationsto locate update positions in a target document. In this section we describe the proposedformal model for document update.Target points of change in a document are specified using patterns over the input string,expressed as a functional document spanner with precisely one variable. Specifically, an update formula is an extraction formula for specifying an update, defined by γ in the followinggrammar G U (Σ , x ): γ := ( γ ∨ γ ) | ( γ • γ ) | ( γ • γ ) | x { γ } (1) γ := ∅ | (cid:15) | σ | ( γ ∨ γ ) | ( γ • γ ) | ( γ ) ∗ (2)(i.e., where γ is a standard, variable-free regular expression).The functional document spanner that is represented by an update formula g maps everydocument D to a unary span relation, which we call the update relation and denote as (cid:74) g (cid:75) ( D ). When the spanner is used for updating a document D , we require that all spansin (cid:74) g (cid:75) ( D ) be mutually disjoint. In this case, sub-strings of D associated with the spansin the update relation are simultaneously replaced by a new value denoted by a constant A . Because the update relation contains non-overlapping spans, such replacements will bemutually non-interfering. (cid:73) Definition 3.1.
An instance of an update specification with given update formula g and A ∈ Σ ∗ is called an update expression and represented by Repl ( g, A ) . Given a document D , if (cid:74) g (cid:75) ( D ) contains no overlapping spans, then applying Repl ( g, A ) to D produces a newdocument Repl ( g, A )( D ) that is identical to D but with every substring in D marked by x in (cid:74) g (cid:75) replaced by the string A . Note that if A is the empty string, then the update results in the deletion of the substringsidentified by the spanner; otherwise, wherever the spanner produces an empty span, thereplacement, in effect, inserts the string A . For example, given D as in Figure 2, applying Repl (Σ ∗ • u • s • • x { (cid:15) } • a • t • Σ ∗ , free ) to D inserts ‘ free ’, at [39 , i . (cid:73) Proposition 3.2. If g is an update formula, then (cid:74) g (cid:75) is functional. Proof.
By induction on the height of the parse tree for g derived from the root symbol γ . (cid:74)(cid:73) Lemma 3.3.
Given Σ and V , let ¯ γ define a restricted form for extraction formulas asfollows: ¯ γ := γ | ( (¯ γ • )? x { ¯ γ } ( • ¯ γ )? ) where ‘?’ denotes optional and γ is defined in production (2) above. Every functionalextraction formula E based on G S (Σ , V ) can be rewritten in its normalized form ∆( E ) = W ki =1 E i where E i is a formula defined by ¯ γ for i ∈ ..k . (Note that within each E i , alloperands for disjunction and Kleene closure are standard, variable-free regular expressions.) Proof.
By induction on the height of the expression tree for γ . (cid:74) An alternative proof can be derived by noting that every extraction formula can be represented by a“vstk-path union” [12].
Detecting Opportunities for Differential Maintenance of Extracted Views
Figure 4 D is a witness for overlapping spans for (cid:74) g (cid:75) , where s i = [ b i , c i i and s j = [ b j , c j i arespans marked by x when matched by the i th and j th disjuncts (not necessarily distinct) of ∆( g ).Offset o falls within s i but not s j , and o falls within both s i and s j . For example, consider the extraction formula E = ( a ∨ b ) ∗ • X { ( Y { a } ∨ Y { a • b } ) • a } • Z { b ∨ ( b • a ) } The normalized form for E is∆( E ) =( a ∨ b ) ∗ • X { Y { a } • a } • Z { b ∨ ( b • a ) } ∨ ( a ∨ b ) ∗ • X { Y { a • b } • a } • Z { b ∨ ( b • a ) } In short, to normalize a formula, all disjunctions that have capture variables in theirdisjuncts can be “pulled up” over concatenations and other capture variables in the expressiontree to create separate disjuncts at the outermost level of the formula. (cid:73) Lemma 3.4.
Given an extraction formula E with v = | V | capture variables and at most d disjuncts per capture variable, k ≤ d v in the normalized form ∆( E ) . Proof.
By induction on v . (cid:74)(cid:73) Corollary 3.5.
Every update formula g can be rewritten as ∆( g ) , a disjunction of the form W ki =1 U i where U i is a formula defined by ( γ • )? x { γ } ( • γ )? for i ∈ ..k , γ is defined byproduction (2), k ≤ d , and d is the number of disjuncts including the capture variable in g . As noted earlier, we require that an update spanner produces no overlapping spans. (cid:73)
Definition 3.6.
An update spanner is unrestricted if, for every input document, the spansmarked by the capture variable x are pairwise identical or non-overlapping, i.e., there doesnot exist a witness for overlapping spans. To determine whether the set of witnesses for overlapping spans is provably empty(Figure 4),we first normalize g and then construct the automaton M g that matches ∆( g ) usingstandard techniques [20]. Let U i = γ L i • x { γ C i } • γ R i represent the i th disjunct of∆( g ). Then, for each disjunct U i , let the finite automaton for every (variable-free) sub-expression γ L i , γ C i , and γ R i be represented by M L i , M C i , and M R i respectively. Formally: M L i = < Q L i , Σ L i , δ L i , Q Li , F L i > , M C i = < Q C i , Σ C i , δ C i , Q Ci , F C i > , and M R i = . Because the formulas are functional, if a capture variable appears in one disjunct, it must appear in alldisjuncts. For all practical purposes, this is a polynomial blowup in expression size. Note that even though a span relation is a set, not a bag, the same span might be marked throughmore than one match to the update formula. . Kassaie and F. W. Tompa 7 M U i is constructed by applying the standard concatenation operator to M L i , M C i , and M R i ; that is, M U i = M L i • M C i • M R i .Finally, we construct M g = < Q g , Σ g , δ g , Q g , F g > by applying the standard unionoperator to the M U i machines. Then Q g = Q L ∪ Q C ∪ Q R in which Q L is the union ofstates in Q L i , Q C is the union of states in Q C i , and Q R is the union of states in Q R i . Given a ∈ Σ g , Q ⊆ Q g , and q ∈ Q g , let C ( Q, q, a ) denote the predicate “ q ∈ Q ∧ δ g ( q, a ) ∈ Q ” and C ( Q , q, a, Q ) denote the predicate “ q ∈ Q ∧ δ g ( q, a ) ∈ Q .”We build the following automaton to identify the set of witnesses for overlapping spans fora given ∆( g ) represented by M g . Each state encodes four properties: the state of matchingfor each of two (not necessarily distinct) disjuncts in ∆( g ), whether or not the matched spansare different, and whether or not the spans overlap. M Ξ = < Q Ξ , Σ Ξ , δ Ξ , Q Ξ , F Ξ > whereΣ Ξ = Σ M g ,Q Ξ = Q g × Q g × { T, F } × {
T, F } ,Q Ξ = { ( q i , q j , F, F ) | q i ∈ Q g ∧ q j ∈ Q g } ,F Ξ = { ( q i , q j , T, T ) | q i ∈ F g ∧ q j ∈ F g } δ Ξ (( q i × q j × v × w ) , a ) = ( δ g ( q i , a ) , δ g ( q j , a ) , T, w ) if C ( Q C , q i , a ) ∧ C ( Q L , q j , a )( δ g ( q i , a ) , δ g ( q j , a ) , T, w ) if C ( Q C , q i , a ) ∧ C ( Q R , q j , a )( δ g ( q i , a ) , δ g ( q j , a ) , v, T ) if C ( Q C , q i , a ) ∧ C ( Q C , q j , a )( δ g ( q i , a ) , δ g ( q j , a ) , T, T ) if C ( Q C , q i , a ) ∧ C ( Q L , q j , a, Q R )( δ g ( q i , a ) , δ g ( q j , a ) , v, w ) otherwise. (cid:73) Proposition 3.7. L ( M Ξ ) = { D | D is witness for overlapping spans for (cid:74) g (cid:75) } . Proof.
We first show that if D is a witness for overlapping spans for (cid:74) g (cid:75) then D ∈ L ( M Ξ ).Being a witness implies that D can be matched in at least two different ways: using M i and M j . Let spans s i and s j be marked by x in M i and M j , respectively. If they areoverlapping, there exist two offsets o and o (not necessarily distinct) as defined in Figure 4. s i and s j cannot both be empty: if they were, they would be either identical or disjoint bydefinition. (i) If one of the spans, say s i , is not empty and the other, say s j , is empty, let o be anoffset that falls within both spans and let a be the symbol at offset o in D . Because o falls in the span matched by M C i for M i , reading a causes a transition to some(other) state in M C i in M i . However, because s j is empty, there are only epsilontransitions between the initial state(s) and final state(s) of M C j . Therefore reading a at o causes a transition from some state in M L j to some state in M R j for M j . Thefourth alternative in the definition of the transition function in M Ξ sets v = w = T ,and further transitions will eventually lead to a final state. (ii) If s i and s j are both non-empty, then reading a symbol at o will cause a transition fromsome state in M C i to some (other) state in M C i while not making such a transitionin M C j (i.e., either wholly within M L j or M R j ). However, reading the symbol at o will cause a transition from some state in M C i to some (other) state in M C i as well asfrom some state in M C j to some (other) state in M C j . The first of these sets v = T ,and the second sets w = T . Thus when the input is exhausted, M Ξ will be in a finalstate. For simplicity, we refer to the last two dimensions of each state as if they were variables named v and w , respectively. Detecting Opportunities for Differential Maintenance of Extracted Views
Second we show that D ∈ L ( M Ξ ) implies that D is a witness for overlapping spans for (cid:74) g (cid:75) .By construction, if M Ξ accepts an input, it corresponds to starting in an initial state andending in a final state of M g . Furthermore, marking both w = T and v = T necessitates thatthe input contains two offsets o and o (not necessarily distinct) as defined in Figure 4. (cid:74)(cid:73) Corollary 3.8.
Let g be an update formula and construct M Ξ as above. If min ( M Ξ ) = ∅ then (cid:74) g (cid:75) is an unrestricted update spanner. As defined above, applying an update expression
Repl ( g, A ) to an input document D , where g specifies an unrestricted update spanner, returns a new document D in which the contentsof each span identified by (cid:74) g (cid:75) is replaced by the string A . Given an update expression andan extraction spanner, we wish to determine, for all potential input documents, whetherthe extracted materialized view can be kept consistent with the updated source documentswithout running the extractor after updating the documents in the database. This problemis similar to filtering out irrelevant updates or applying updates autonomously to relationalmaterialized views [4]. (cid:73) Definition 4.1.
An update expression Repl ( g, A ) is irrelevant with respect to an extractor (cid:74) E (cid:75) if for every input document, applying (cid:74) E (cid:75) to Repl ( g, A )( D ) produces a span relation thatis identical to applying (cid:74) E (cid:75) to D . That is, if D = Repl ( g, A )( D ) , then (cid:74) E (cid:75) ( D ) = (cid:74) E (cid:75) ( D ) . If an update expression is relevant with respect to an extractor, it may be that themodification to the extracted relation can be computed without re-running the extractor. (cid:73)
Definition 4.2.
An update expression Repl ( g, A ) is autonomously computable with respectto an extractor (cid:74) E (cid:75) if for every input document, applying (cid:74) E (cid:75) to Repl ( g, A )( D ) can becomputed from the update expression, the update relation, the extraction formula that definesthe extraction spanner, and the extracted relation. There is an important distinction between the problems of updating traditional relationalviews and updating materialized extractions. Span relations contain pairs of offsets frominput documents, not document content. Thus a span relation might be affected by anupdate even if the replaced text is not within an extracted span. In particular, replacing astring of one length by a string of another length somewhere in the document will cause aspan somewhere else in the document to shift, even if the content of that span is unaffected.More specifically, given a document D and the corresponding updated document D , ifspan S in D is disjoint from all spans produced by the unrestricted update spanner (cid:74) g (cid:75) , let shift ( g, A )( S ) represent the corresponding span in D , i.e., the new location of the content of S in D . shift ( g, A )( S ) is shifted from S by an amount that is dependent on the length of A and the lengths of all spans in the update relation that precede S in D , as captured byAlgorithm 1. (cid:73) Definition 4.3.
Update expression Repl ( g, A ) is pseudo-irrelevant with respect to anextraction spanner (cid:74) E (cid:75) if for every input document, applying (cid:74) E (cid:75) to Repl ( g, A )( D ) producesa span relation that is identical to applying (cid:74) E (cid:75) to D except to replace each span S byshift ( g, A )( S ) . That is, if D = Repl ( g, A )( D ) , then (cid:74) E (cid:75) ( D ) = { S | ∃ S ∈ (cid:74) E (cid:75) ( D ) such that S = shift ( g, A )( S ) } . The min function represents standard state minimization. Autonomous computability for updates is analogous to determinacy [29] for queries. . Kassaie and F. W. Tompa 9
Algorithm 1
Shift Algorithm.
Input: update relation R U , A , span S = [ i, j i Output: span S = [ i , j i = shift ( g, A )( S ) Precondition: R U contains no duplicates and no span that overlaps S orany other span in R U shift ← for tuple [ m, n i ∈ R U doif m < i then shift ← shift + ( n − m ) − length ( A ) endendreturn [ i − shift , j − shift i Thus, a pseudo-irrelevant update is a special case of an autonomously computable update. (cid:73)
Note.
By definition, if an update expression is irrelevant with respect to an extractionspanner, then it is also pseudo-irrelevant with respect to that spanner.
We wish to identify whether an update is irrelevant or pseudo-irrelevant with respect toa given extractor, independently of input documents. The essence of our approach is toinspect various kinds of overlap between an update expression and an extractor. Theproposed process verifies some sufficient conditions for irrelevant, autonomously deletable,and pseudo-irrelevant updates.If an update changes the content length of an extracted span, then it will be relevant;the extractor should be re-executed. However, even without changing an extracted value,an update could change the context for determining that a span should be extracted. First,updated spans, with the new value A, could form new matches for the extraction spanner,which would create new rows in the extracted view if we re-run the extractor. Second, someextracted spans might no longer match after the update, and therefore the associated rowswould disappear when the extractor is re-run after the update. After introducing a few simple constructs, we present a sound, but not necessarilycomplete, mechanism to determine whether an update expression, specified by the updateformula g and replacement string A , is pseudo-irrelevant with respect to a document spannerspecified by an extraction formula E (Figure 5). (cid:73) Definition 5.1.
Given Repl ( γ, A ) , the proxy language ∇ ( g, A ) is defined using the followingdisjunctive form: ∇ ( g, A ) = k _ i =1 V i where V i is derived from disjunct U i in ∆( g ) by replacing the marked subexpression in thatdisjunct by x { A } , that is, V i = γ L i • x { A } • γ R i where γ L i and γ R i are the subexpressionspreceding and following, respectively, the marked subexpression in U i . There are some conditions under which the extracted relation after update might be autonomouslycomputable. We leave the determination and detection of such conditions for future work. These effects are not mutually exclusive.
Figure 5
The verifier statically analyzes an update expression and an extraction formula to testsufficient conditions for being a pseudo-irrelevant update.
We can now describe two simple special cases: (i) If L ( B ( E )) ∩ L ( B ( g )) = ∅ , the update is irrelevant: there is no document on which both E and g match, and therefore any document that is updated cannot have extractedcontent. (ii) If L ( B ( E )) ∩ L ( B ( g )) = ∅ but L ( B ( E )) ∩ L ( B ( ∇ ( g, A ))) = ∅ , there exist documentson which both E and g match, but if such a document is updated, the span relationproduced by the extractor becomes empty. Although the update is relevant, it isautonomously computable: every extracted tuple from the updated relation is deleted.We need to determine the relative positions of the capture variables in the extractionspanner and the unrestricted update spanner to determine whether an update is pseudo-irrelevant. Clearly, if a document update changes some or all of the content of an extracted span, it willin general change the extracted span relation. Similarly, after an update, the replacementtext might cause one or more additional spans to be extracted, so that the span relationincludes tuples that did not meet the extraction condition before the update. We leave it tofuture work to determine under what conditions an update that overlaps extracted spanshappens to be pseudo-irrelevant. Instead, we determine when there can be no overlap andthen under which further conditions an update is pseudo-irrelevant. (cid:73)
Definition 5.2.
Given two extraction formulas E and E , (cid:74) E (cid:75) and (cid:74) E (cid:75) are disjoint if forevery document D , (cid:74) E (cid:75) ( D ) includes no span that overlaps with a span in (cid:74) E (cid:75) ( D ) . Otherwise,we say that the spanners overlap. Given
Repl ( g, A ) and extractor E , we construct M (cid:109) , to determine whether (cid:74) E (cid:75) and theunrestricted update spanner could produce at least one overlapping pair of spans, that iswhether they could have at least one offset o in common. First we create a finite statemachine M i ∨ for each disjunct of ∆( E ): M i ∨ = M R • M γ • M R • . . . • M γ n i • M R ni where M γ m encodes the regular expression captured by the m th capture variable, and thenwe define M E = < Q E , Σ E , δ E , Q E , F E > by applying the standard union operator over M i ∨ where 1 ≤ i ≤ k . Next, we reuse M g = < Q g , Σ g , δ g , Q g , F g > and the predicates C ( Q, q, a )and C ( Q , q, a, Q ) that were introduced in Section 3.2. Let Q γ m denote the states in M γ m .With these, we define M (cid:109) = < Q (cid:109) , Σ (cid:109) , δ (cid:109) , Q (cid:109) , F (cid:109) > whereΣ (cid:109) = Σ M g ∩ Σ M E ,Q (cid:109) = Q g × Q E × { T, F } ,Q (cid:109) = { ( q i , q j , F ) | q i ∈ Q g ∧ q j ∈ Q E } ,F (cid:109) = { ( q i , q j , T ) | q i ∈ F g ∧ q j ∈ F E } , . Kassaie and F. W. Tompa 11 δ (cid:109) (( q i × q j × v ) , a ) = ( δ g ( q i , a ) , δ E ( q j , a ) , T ) if ( C ( Q C , q i , a ) ∨ C ( Q L , q i , a, Q R )) ∧∃ k ( C ( Q γ m , q j , a ) ∨ C ( Q R m − , q j , a, Q R m ))( δ g ( q i , a ) , δ E ( q j , a ) , v ) otherwise. (cid:73) Proposition 5.3. L ( M (cid:109) ) = { D | D is a witness for overlapping spans for update formula g and extraction formula E } . Proof.
The transition function identifies transitions that stay within a marked span or signalempty marked spans. The proof is then similar to that of Proposition 3.7. (cid:74)(cid:73)
Corollary 5.4.
Let (cid:74) g (cid:75) be an unrestricted update spanner specified by update formula g and (cid:74) E (cid:75) be a document spanner specified by extraction formula E , and construct automaton M (cid:109) as above. If min ( M (cid:109) ) = ∅ , (cid:74) g (cid:75) and (cid:74) E (cid:75) are disjoint. Similarly, for a given extraction spanner and the proxy spanner for an update, we builda finite automaton, M p (cid:109) , to recognize the set of witnesses for overlapping spans. Theconstruction procedure is exactly the same as constructing M (cid:109) , because a proxy spanner isisomorphic to a special case of an unrestricted update spanner with a constant string as themarked subexpression. (cid:73) Proposition 5.5. L ( M p (cid:109) ) = { D | D is a witness for overlapping spans for the proxy formula ∇ ( g, A ) and the extraction formula E } . Proof.
Identical to Proposition 5.3. (cid:74)(cid:73)
Corollary 5.6.
Let (cid:74) ∇ ( g, A ) (cid:75) be a proxy spanner specified by ∇ ( g, A ) and (cid:74) E (cid:75) be a documentspanner specified by extraction formula E , construct automaton M p (cid:109) as above. If min ( M p (cid:109) ) = ∅ , (cid:74) ∇ ( g, A ) (cid:75) and (cid:74) E (cid:75) are disjoint. (cid:73) Theorem 5.7.
For all documents, Repl ( g, A ) is disjoint from (cid:74) E (cid:75) (i.e., (cid:74) g (cid:75) is disjointfrom (cid:74) E (cid:75) and (cid:74) ∇ ( g, A ) (cid:75) is disjoint from (cid:74) E (cid:75) ) if min ( M (cid:109) ) = min ( M p (cid:109) ) = ∅ for automata M (cid:109) and M p (cid:109) as defined above. Proof.
This follows directly from Corollaries 5.4 and 5.6. (cid:74)
If an update is pseudo-irrelevant to an extractor, then all extracted spans must be shifted in aconsistent manner. Therefore, the ordering within a document of the extracted spans formingeach row in the extracted relation must remain unchanged after a pseudo-irrelevant update.Consider one disjunct from the normalized extraction formula for extraction (Lemma 3.3) E i = θ • X { θ } • θ • X { θ } • θ ... • X n { θ n } • θ n and a document α β α β α ...β n α n where α j , β j ∈ Σ ∗ . If α j matches θ j and β j matches θ j ,and if after substituting A for strings identified by marked spans within the spans coveringonly the α j the updated document still matches E i , then the new locations of the spansmatching β j will be simple shifts from their locations prior to the update. In fact, this willbe true not only if the document matches the same E i after update, but also if it matches adisjunct that is similar to E i as defined here.For each disjunct introduced in Lemma 3.3, create a variable-profile that expresses therelative position of each capture variable with respect to other variables. More specifically,given E a formula matching the grammar for ¯ γ , define v ( E ) as the string produced from E by eliminating all symbols except for capture variables and left and right braces. Forexample, if E = s • z { s • y { s }} • x { s } where s i are (standard) regular expressions,then v ( E ) = z { y {}} x {} . Next, given an extraction formula E , let φ E define a partitioningof the disjuncts in ∆( E ) by their variable-profiles: φ E ( E i ) = { E j | E j is a disjunct in ∆( E ) and v ( E i ) = v ( E j ) } where E i is a disjunct in ∆( E ). Finally denote the union of all disjuncts in a partition asΦ E ( E i ) = S φ E ( E i ) and let Φ( E ) = { Φ E ( E i ) | E i ∈ E } . (cid:73) Theorem 5.8.
Given an update expression Repl ( g, A ) defining an unrestricted updatespanner and a disjoint extractor defined by E , let E i denote the disjuncts in ∆( E ) and L i = L ( B (Φ E ( E i ))) . The update is pseudo-irrelevant with respect to the extractor if and onlyif ∀ D ([ D = Repl ( g, A )( D )] ∨ ∀ E i [ D ∈ L i ⇐⇒ Repl ( g, A )( D ) ∈ L i ]) Proof. ( only if: ) Assume that the update is pseudo-irrelevant with respect to the extractor.If ∀ D ( D = Repl ( g, A )( D )) then the theorem holds. Otherwise, choose D such that D = Repl ( g, A )( D ) = D . Let D = α β α β α ...β n α n where α j , β j ∈ Σ ∗ and each β i matchesthe capture variable in (cid:74) g (cid:75) . (These must be non-overlapping.) Thus D = α Aα Aα ...Aα n .Because the extractor is disjoint from the update, all the extracted spans must appear withinthe α j segments, and because the update is pseudo-irrelevant, the extractions from D mustall be merely shifts from the extractions in D . But, in that case, the disjunct E i causingthe extraction for D must have the same variable profile as the disjunct E i causing theextraction for D ; that is, Φ E ( E i ) = Φ E ( E i ) and the theorem holds.( if: ) Assume that the update is not pseudo-irrelevant with respect to the extractor. Inthat case, there exists a witness document D such that applying E to D = Repl ( g, A )( D )produces a span relation where (cid:74) E (cid:75) ( D ) = { S | ∃ S ∈ (cid:74) E (cid:75) ( D ) such that S = shift ( g, A )( S ) } .That is, either (case 1) there is a span s in (cid:74) E (cid:75) ( D ) that does not have a correspondingshifted span in (cid:74) E (cid:75) ( D ), or (case 2) there is a span s in (cid:74) E (cid:75) ( D ) that is not simply a shiftfrom some span in (cid:74) E (cid:75) ( D ). Thus, D = D . Case if-1 : Let E i be a disjunct in ∆( E ) that includes s in (cid:74) E i (cid:75) ( D ) and thus D ∈ L i . Let E i = θ • X { θ } • θ • X { θ } • θ ... • X n { θ n } • θ n and D = α β α β α ...β n α n where α j , β j ∈ Σ ∗ and α j matches θ j and β j matches θ j .Because D = D and (cid:74) g (cid:75) does not overlap (cid:74) E (cid:75) , there are some updates, all of which mustbe replacements within α , α , ..., α n . Let α i = α i s i α i s i α i ...s i ni α i ni where s i ...s i ni eachmatch the capture variable in ∆( g ). (Note that these matches must all be mutually disjointbecause an unrestricted update spanner cannot produce overlapping spans.) The update willreplace α i by α i = α i Aα i Aα i ...Aα i ni , producing the document D = α β α β α ...β n α n and if each β j was within span b j , then b j is shifted by shift ( g, A )( b j ). Thus if the updateis not pseudo-irrelevant, then clearly D / ∈ L i , because otherwise a disjunct with the samevariable-profile as E i would match D and the shifted span would appear in the extractedrelation for D . Case if-2 : Let E i be a disjunct in ∆( E ) that includes s in (cid:74) E i (cid:75) ( D ) and thus D ∈ L i . Let E i = θ • X { θ } • θ • X { θ } • θ ... • X n { θ n } • θ n and D = α β α β α ...β n α n where α j , β j ∈ Σ ∗ and α j matches θ j and β j matches θ j . There must be such an E i because s in (cid:74) E (cid:75) ( D ). . Kassaie and F. W. Tompa 13 As before, if D = D , some update occurred. Because (cid:74) ∇ ( g, A ) (cid:75) does not overlap (cid:74) E (cid:75) , allupdates must have been replacements within α , α , ..., α n . Let α i = α i Aα i Aα i ...Aα i ni where the indicated instances of the string A are a result of the update (i.e., not alreadypresent in D ). (Note that again these instances must all be mutually disjoint because anunrestricted update spanner cannot produce overlapping spans.) The update will havecreated α i from α i = α i s i α i s i α i ...s i ni α i ni where s i ...s i ni match the capture variablein ∆( g ). Thus D = α β α β α ...β n α n ∈ L ( B ( U j )) and if each β j was within span b j in D ,then b j will have been shifted by shift ( g, A )( b j ). Thus if the update is not pseudo-irrelevant,then clearly D cannot be in L i , because otherwise a disjunct with the same variable-profileas E i would match D and the pre-shifted span would appear in the extracted relation for D .Thus in both cases, ∃ D ∃ E i ( D ∈ L i ⇐⇒ Repl ( g, A )( D ) / ∈ L i ) ), which completes theproof. (cid:74) Using this theorem, we construct a machine, i.e., M R , to recognize pseudo-irrelevantupdates (Algorithm 2). Algorithm 2 creates finite state machines using standard operatorsincluding concatenation ( • ), union ( ∪ ), intersection ( ∩ ), and complement ( M ) [20]. (Weassume that the built-in function fsm () eliminates all capture variables from an input regularformula and converts the result to its equivalent finite state machine.) Algorithm 2
Construction Algorithm for Recognizer of Pseudo-Irrelevant Updates.
Input: extraction formula E ,update expression Repl ( g, A ) Output: automaton M R Precondition: (cid:74) g (cid:75) unrestricted, Repl ( g, A ) disjoint from (cid:74) E (cid:75) M φ , M R ← ∅ ; /* build extraction automata: */ for Φ i ∈ Φ( E ) do /* Φ i includes all disjuncts with the i th variable-profile */ M iφ ← fsm (Φ i ); end /* build document/update pairs: */ for u j ∈ ∆( g ) do v j ← corresponding disjunct in ∇ ( g, A ); M j ← fsm ( u j ) , M j ∇ ← fsm ( v j ); forall M iφ do M R ← M R ∪ (( M j ∩ M iφ ) • ( M j ∇ ∩ M iφ )); M R ← M R ∪ (( M j ∩ M iφ ) • ( M j ∇ ∩ M iφ )); endendreturn M R (cid:73) Proposition 5.9. min ( M R ) = ∅ iff ∃ a witness D showing that the unrestricted spannerdefined by Repl ( g, A ) is not pseudo-irrelevant with respect to the disjoint extractor (cid:74) E (cid:75) . Proof.
First we prove that if there exists a witness document that shows
Repl ( g, A ) isnot pseudo-irrelevant with respect to the extractor (cid:74) E (cid:75) then min ( M R ) = ∅ . Based on Theorem 5.8 if an update is not pseudo-irrelevant there exist at least an input string D , anupdate disjunct U j , and a partition Φ E ( E i ) such that Case 1 : D ∈ L i = ⇒ Repl ( g, A )( D ) / ∈ L i . Based on the construction, the following holds: D ∈ L ( M j ) ∧ Repl ( g, A )( D ) ∈ L ( M j ∇ ) D ∈ L ( M iφ ) ∧ Repl ( g, A )( D ) ∈ L ( M iφ ) D ∈ L ( M j ∩ M iφ ) ∧ Repl ( g, A )( D ) ∈ L ( M j ∇ ∩ M iφ ) D • Repl ( g, A )( D ) ∈ L (( M j ∩ M iφ ) • ( M j ∇ ∩ M iφ )) D • Repl ( g, A )( D ) ∈ L ( M R ) min ( M R ) = ∅ . Case 2 : Repl ( g, A )( D ) ∈ L i = ⇒ D / ∈ L i . Similarly to case 1: D • Repl ( g, A )( D ) ∈ L (( M j ∩ M iφ ) • ( M j ∇ ∩ M iφ )) D • Repl ( g, A )( D ) ∈ L ( M R ) min ( M R ) = ∅ . Next we show that if min ( M R ) = ∅ then there exists a witness D that shows Repl ( g, A ) isnot pseudo-irrelevant with respect to the extractor (cid:74) E (cid:75) . Suppose s ∈ L ( M R ).Then one of the iterations in the innermost loop of Algorithm 2 must have inserted aterm into M R . Thus, either ∃ i, j s . t . s ∈ L (( M j ∩ M iφ ) • ( M j ∇ ∩ M iφ )) or ∃ i, j s . t . s ∈ L (( M j ∩ M iφ ) • ( M j ∇ ∩ M iφ )). Thus s is the concatenation of two strings s and s such that s ∈ L ( M j ) and s ∈ L ( M j ∇ ). But then either s ∈ L ( M iφ ) ∧ s ∈ L ( M iφ ) or s ∈ L ( M iφ ) ∧ s ∈ L ( M iφ ). From this it follows that D is a witness that Repl ( g, A ) is notpseudo-irrelevant with respect to (cid:74) E (cid:75) . (cid:74)(cid:73) Corollary 5.10.
If min ( M R ) = ∅ the update is pseudo-irrelevant. From this, we arrive at a sufficient verification test for an update being pseudo-irrelevantwith respect to an extractor as depicted earlier in Figure 5: (cid:73)
Theorem 5.11.
Given an update expression Repl ( g, A ) and a regular formula E , ifmin ( M Ξ ) = min ( M (cid:109) ) = min ( M p (cid:109) ) = min ( M R ) = ∅ for automata M Ξ , M (cid:109) , M p (cid:109) , and M R as defined above, then the update expression is pseudo-irrelevant with respect to (cid:74) E (cid:75) . Proof.
Follows directly from Corollary 3.8 ( (cid:74) g (cid:75) is unrestricted), Theorem 5.7 ( Repl ( g, A )and (cid:74) E (cid:75) are disjoint), and Corollary 5.10 (updates must produce shifts). (cid:74) Expectations from extractors have risen as requirements have become more diversified, fromthe point that there were no criteria to evaluate their performance [17] to the point thatextraction algorithms need to work under various stresses such as noisy data, low responsetime, and diverse types of input and output [34]. The problems that deal with dynamicinformation sources are closest to our problem. These include continuous adaptation ofextractors as their information sources changes and equipping extractors with the abilityto recycle previously obtained extraction results. For example, the approach by Lerman etal. [25] monitors updates on information sources for a specific class of extraction algorithms(wrappers) and rebuilds the extractor if the performance decreases due to the updatesover their sources. In other work, Chen et al. [6] efficiently update extractions when newdocuments are added to the source corpus: they identify segments of new documents thathave been seen previously by the extraction process and reuse their associated results. . Kassaie and F. W. Tompa 15
Researchers have addressed many problems using the document spanner model, including howto deal with documents with missing information [26] and how to eliminate inconsistenciesfrom extracted relations [11]. Others have studied the complexity of evaluating spanners andcomputing the results of various algebraic operations over span relations [1, 14, 30, 31].In the presence of updates, the re-evaluation of an extractor might be sped up considerablyif it is provably split-correct , that is, if the extracted relation can be computed by combining theextractions from sub-documents [10]. Not only can extractions from various sub-documentsthen be run in parallel, but extractions can be completely avoided for those sub-documentsthat are not updated (i.e., those for which the update is irrelevant).Freydenberger and Thompson [16] have investigated the complexity of incrementallyre-evaluating spanners in the presence of updates. However, their update model assumesthat a document is encoded as a fixed-length word structure in which (essentially) there is aspecial character that represents (cid:15) and the only operation is replacing one character fromΣ ∪ { (cid:15) } by another.In this work we have focused on a specific primitive representation for document spanners,i.e., so-called regex formulas . It has been shown that the class of spanners defined by the moreexpressive variable-set automata is closed under natural join (as well as some other relationaloperators) [12, 27, 15], and this mechanism can be used to express various relationshipsbetween spans of a document [13]. We plan to investigate whether variable-set automata canbe adopted to simplify and extend our approach to determining pseudo-irrelevancy as wellas other forms of autonomous updates. We use finite-state automata to determine whether an update expression is pseudo-irrelevantwith respect to a document spanner. Similar static analyses of regular expressions havebeen used in diverse areas, including access control and feature interactions. For example,Murata et al. [28] propose an automaton-based, statically analyzed access control mechanismfor XML database systems. In other work, an event-based framework is introduced fordeveloping and maintaining new gestures that can be used in multi-touch environments [24],and regular expressions associated with gestures are then statically analyzed to identifypotential conflicts. Finally, we have also used finite automata to statically analyze extractorsspecified by JAPE [8] in the context of updating extracted views [22].
Perhaps our biggest contribution is the simple realization that information extraction canbe considered as a view mechanism for document databases, subject to research similarto our community’s vast experience with relational database views. The dual problems ofefficiently maintaining materialized, possibly cascaded, views and of updating documents toreflect updates expressed against extracted views open up many opportunities for continuedresearch that will ultimately lead to practical solutions.This paper deals with the first of these problems only, and it provides a framework forexploring the basic ideas in extracted view maintenance. We have introduced a simpleupdate model that can be applied to a document database and that is compatible withSystemT, a major extraction framework. We have begun to explore conditions for updates to be deemed irrelevant or autonomously computable with respect to extractors defined usingthat framework. Finally, we have described a particular form of autonomously computableupdate, namely pseudo-irrelevance, we have determined sufficient conditions for an updateto be pseudo-irrelevant, and we have designed automata to test those conditions for givenupdate expressions and extractors.
We have established some sufficient conditions for updates to be pseudo-irrelevant, but wehave not yet investigated whether there are necessary conditions as well. Furthermore, wehave not yet investigated other autonomously computable conditions, such as those thatmight result in span modifications or insertions of extracted tuples. We have also not yetexplored the practicality of constructing our verification automata nor investigated updateproperties of extractors that are defined by mechanisms more expressive than spanners.Our model for document updates is also quite limited. First of all, only one variable is usedto identify spans that can be updated, even though correlated updates might require multiplerelated variables to update. Secondly, the substitute value is limited to being a constant,whereas real world applications might need to use various values based on some factors, such asthe relative position of the update, some associated string values, or the contexts of matchedspans. Thirdly, for each document, all intended spans are updated once and simultaneously,a fundamental assumption that can be violated in practical situations. Loosening any ofthese restrictions creates new research challenges for verifying pseudo-irrelevance or otherupdate properties.
Acknowledgements
We gratefully acknowledge financial assistance received from the University of Waterloo andNSERC, the Natural Sciences and Engineering Research Council of Canada.
References Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delayenumeration for nondeterministic document spanners. In Pablo Barceló and Marco Calautti,editors, , volume 127 of
LIPIcs , pages 22:1–22:19. Schloss Dagstuhl - Leibniz-Zentrumfür Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICDT.2019.22 , doi:10.4230/LIPIcs.ICDT.2019.22 . Douglas E. Appelt and Boyan A. Onyshkevych. The common pattern specification language.In
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore,MD, USA, October 13-15, 1998 , Baltimore MD USA, 1998. URL: . José A. Blakeley, Neil Coburn, and Per-Åke Larson. Updating derived relations: Detectingirrelevant and autonomously computable updates.
ACM Trans. Database Syst. , 14(3):369–400,1989. URL: https://doi.org/10.1145/68012.68015 , doi:10.1145/68012.68015 . José A. Blakeley, Per-Åke Larson, and Frank Wm. Tompa. Efficiently updating materializedviews. In Ashish Gupta and Iderpal Singh Mumick, editors,
Materialized Views: Techniques,Implementations, and Applications , pages 163–175. MIT Press, Cambridge MA USA, 1999.(reprinted from
ACM Sigmod ‘86 , pp. 61-71). URL: http://dl.acm.org/citation.cfm?id=310709.310739 . Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, and Jeffrey F. Naughton. Efficiently incorporatinguser feedback into information extraction and integration programs. In
Proceedings of the . Kassaie and F. W. Tompa 17
ACM SIGMOD International Conference on Management of Data , pages 87–100, RhodeIsland USA, 2009. ACM. URL: https://doi.org/10.1145/1559845.1559857 , doi:10.1145/1559845.1559857 . Fei Chen, AnHai Doan, Jun Yang, and Raghu Ramakrishnan. Efficient information extrac-tion over evolving text data. In
Proceedings of the 24th International Conference on DataEngineering, ICDE , pages 943–952, Cancún, Mexico, 2008. IEEE Computer Society. URL: https://doi.org/10.1109/ICDE.2008.4497503 , doi:10.1109/ICDE.2008.4497503 . Latha S. Colby, Timothy Griffin, Leonid Libkin, Inderpal Singh Mumick, and Howard Trickey.Algorithms for deferred view maintenance. In H. V. Jagadish and Inderpal Singh Mumick,editors,
Proceedings of the 1996 ACM SIGMOD International Conference on Management ofData, Montreal, Quebec, Canada, June 4-6, 1996 , pages 469–480. ACM Press, 1996. URL: https://doi.org/10.1145/233269.233364 , doi:10.1145/233269.233364 . Hamish Cunningham, Diana Maynard, and Valentin Tablan. JAPE: a Java annotation patternsengine. Technical Report CS-00-10, Dept. Comp. Sci., Univ. Sheffield, 2000. Umeshwar Dayal and Philip A. Bernstein. On the updatability of relational views. In
FourthInternational Conference on Very Large Data Bases , pages 368–377, West Berlin Germany,1978. IEEE Computer Society. Johannes Doleschal, Benny Kimelfeld, Wim Martens, Yoav Nahshon, and Frank Neven. Split-correctness in information extraction. In Dan Suciu, Sebastian Skritek, and Christoph Koch,editors,
Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principlesof Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019 ,pages 149–163. ACM, 2019. URL: https://doi.org/10.1145/3294052.3319684 , doi:10.1145/3294052.3319684 . Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Cleaning in-consistencies in information extraction via prioritized repairs. In
Proceedings of the 33rdACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , pages 164–175, Snowbird UT USA, 2014. ACM. URL: https://doi.org/10.1145/2594538.2594540 , doi:10.1145/2594538.2594540 . Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners:A formal approach to information extraction.
J. ACM , 62(2):12:1–12:51, 2015. URL: https://doi.org/10.1145/2699442 , doi:10.1145/2699442 . Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Declarative cleaningof inconsistencies in information extraction.
ACM Trans. Database Syst. , 41(1):6:1–6:44, 2016.URL: https://doi.org/10.1145/2877202 , doi:10.1145/2877202 . Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and DomagojVrgoc. Efficient enumeration algorithms for regular document spanners.
ACM Trans. DatabaseSyst. , 45(1):3:1–3:42, 2020. URL: https://doi.org/10.1145/3351451 , doi:10.1145/3351451 . Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions ofregular expressions. In Jan Van den Bussche and Marcelo Arenas, editors,
Proceedings of the37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston,TX, USA, June 10-15, 2018 , pages 137–149. ACM, 2018. URL: https://doi.org/10.1145/3196959.3196967 , doi:10.1145/3196959.3196967 . Dominik D. Freydenberger and Sam M. Thompson. Dynamic complexity of document spanners.In Carsten Lutz and Jean Christoph Jung, editors, , volume 155 of
LIPIcs ,pages 11:1–11:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.ICDT.2020.11 , doi:10.4230/LIPIcs.ICDT.2020.11 . Robert Gaizauskas and Yorick Wilks. Information extraction: Beyond document retrieval.
Journal of Documentation , 54(1):70–105, 1998. Ashish Gupta and Iderpal Singh Mumick, editors.
Materialized Views: Techniques, Implemen-tations, and Applications . MIT Press, Cambridge, MA, USA, 1999. Ashish Gupta, Inderpal Singh Mumick, and V. S. Subrahmanian. Maintaining views in-crementally. In Peter Buneman and Sushil Jajodia, editors,
Proceedings of the 1993 ACMSIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993 , pages 157–166. ACM Press, 1993. URL: https://doi.org/10.1145/170035.170066 , doi:10.1145/170035.170066 . John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman.
Introduction to automata theory,languages, and computation . Pearson international edition. Addison-Wesley, 3rd edition, 2007. Ihab F. Ilyas and Xu Chu.
Data Cleaning . Morgan and Claypool, 2019. Besat Kassaie and Frank Wm. Tompa. Predictable and consistent information extraction.In
Proceedings of the ACM Symposium on Document Engineering , pages 14:1–14:10, BerlinGermany, 2019. ACM. URL: https://doi.org/10.1145/3342558.3345391 , doi:10.1145/3342558.3345391 . Akira Kawaguchi, Daniel F. Lieuwen, Inderpal Singh Mumick, and Kenneth A. Ross. Imple-menting incremental view maintenance in nested data models. In Sophie Cluet and Richard Hull,editors,
Database Programming Languages, 6th International Workshop, DBPL-6, Estes Park,Colorado, USA, August 18-20, 1997, Proceedings , volume 1369 of
Lecture Notes in ComputerScience , pages 202–221. Springer, 1997. URL: https://doi.org/10.1007/3-540-64823-2_12 , doi:10.1007/3-540-64823-2\_12 . Kenrick Kin, Björn Hartmann, Tony DeRose, and Maneesh Agrawala. Proton: multitouchgestures as regular expressions. In
ACM Conf. on Human Factors in Computing Systems ,pages 2885–2894, 2012. doi:10.1145/2207676.2208694 . Kristina Lerman, Steven Minton, and Craig A. Knoblock. Wrapper maintenance: A machinelearning approach.
J. Artif. Intell. Res. , 18:149–181, 2003. doi:10.1613/jair.1145 . Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extractingincomplete information: Expressiveness and complexity. In
Proceedings of the 37th ACMSIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems , pages 125–136,Houston TX USA, 2018. ACM. URL: https://doi.org/10.1145/3196959.3196968 , doi:10.1145/3196959.3196968 . Andrea Morciano. Engineering a runtime system for AQL. Master’s thesis, École Polytechniquede Bruxelles, Université Libre de Bruxelles, 2016. Makoto Murata, Akihiko Tozawa, Michiharu Kudo, and Satoshi Hada. XML access controlusing static analysis.
ACM Trans. Inf. Syst. Secur. , 9(3):292–324, 2006. doi:10.1145/1178618.1178621 . Alan Nash, Luc Segoufin, and Victor Vianu. Views and queries: Determinacy and rewrit-ing.
ACM Trans. Database Syst. , 35(3):21:1–21:41, 2010. URL: https://doi.org/10.1145/1806907.1806913 , doi:10.1145/1806907.1806913 . Liat Peterfreund, Dominik D. Freydenberger, Benny Kimelfeld, and Markus Kröll. Complexitybounds for relational algebra over document spanners. In Dan Suciu, Sebastian Skritek, andChristoph Koch, editors,
Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposiumon Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 -July 5, 2019 , pages 320–334. ACM, 2019. URL: https://doi.org/10.1145/3294052.3319699 , doi:10.1145/3294052.3319699 . Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programsfor document spanners. In Pablo Barceló and Marco Calautti, editors, , volume 127of
LIPIcs , pages 13:1–13:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICDT.2019.13 , doi:10.4230/LIPIcs.ICDT.2019.13 . Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and ShivakumarVaithyanathan. An algebraic approach to rule-based information extraction. In
Proceedingsof the 24th International Conference on Data Engineering, ICDE , pages 933–942, Cancún,Mexico, 2008. IEEE Computer Society. URL: https://doi.org/10.1109/ICDE.2008.4497502 , doi:10.1109/ICDE.2008.4497502 . . Kassaie and F. W. Tompa 19 Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. Named entity recognition in tweets:An experimental study. In
EMNLP , pages 1524–1534, Edinburgh UK, 2011. URL: . Sunita Sarawagi. Information extraction.
Foundations and Trends in Databases , 1(3):261–377,2008. Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declarativeinformation extraction using datalog with embedded extraction predicates. In
Proceedings ofthe 33rd International Conference on Very Large Data Bases, University of Vienna, Austria,September 23-27, 2007 , pages 1033–1044. ACM, 2007. Jingren Zhou, Per-Åke Larson, Jonathan Goldstein, and Luping Ding. Dynamic materializedviews. In Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis, editors,
Pro-ceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The MarmaraHotel, Istanbul, Turkey, April 15-20, 2007 , pages 526–535. IEEE Computer Society, 2007.URL: https://doi.org/10.1109/ICDE.2007.367898 , doi:10.1109/ICDE.2007.367898doi:10.1109/ICDE.2007.367898