A Technical Report: Entity Extraction using Both Character-based and Token-based Similarity
aa r X i v : . [ c s . D B ] F e b A Technical Report: Entity Extraction using BothCharacter-based and Token-based Similarity
Zeyi Wen , Dong Deng ∗ , Rui Zhang , Kotagiri Ramamohanarao University of Melbourne, Victoria, Australia { zeyi.wen, rui.zhang, kotagiri } @unimelb.edu.au ∗ Massachusetts Institute of Technology, USA [email protected]
Abstract —Entity extraction is fundamental to many textmining tasks such as organisation name recognition. A popularapproach to entity extraction is based on matching sub-stringcandidates in a document against a dictionary of entities. Tohandle spelling errors and name variations of entities, usuallythe matching is approximate and edit or Jaccard distance isused to measure dissimilarity between sub-string candidates andthe entities. For approximate entity extraction from free text,existing work considers solely character-based or solely token-based similarity and hence cannot simultaneously deal with minorvariations at token level and typos. In this paper, we addressthis problem by considering both character-based similarity andtoken-based similarity (i.e. two-level similarity). Measuring one-level (e.g. character-based) similarity is computationally expen-sive, and measuring two-level similarity is dramatically moreexpensive. By exploiting the properties of the two-level similarityand the weights of tokens, we develop novel techniques tosignificantly reduce the number of sub-string candidates thatrequire computation of two-level similarity against the dictionaryof entities. A comprehensive experimental study on real worlddatasets show that our algorithm can efficiently extract entitiesfrom documents and produce a high F score in the range of[0.91, 0.97]. I. I
NTRODUCTION
In text mining, a primitive task is entity extraction—therecognition of the names of entities such as people, loca-tions and organisations—in a free text document. A commonapproach to entity extraction is to compare sub-strings of adocument (hereafter “ sub-string candidates ” or simply “ can-didates ”) against a dictionary of entities, and the approachhas wide use in applications such as named entity recognition(NER) [1]. This approach needs to handle the followingtwo issues. (i) Orthographical or typographical errors ( typos hereafter) may appear in documents. For example, “Oxford”may be incorrectly written as “Oxfort”. (ii) Different namesmay refer to the same entity. For example, “Oxfort University”and “Univercity of Oxford” are the same as “The University ofOxford”. Addressing the two issues in the context of free textis very challenging, since every word of the free text may bethe starting (or ending) position of an entity in the dictionary.As a result, the number of sub-string candidates is very large,and all those sub-string candidates need to match against eachentity in the dictionary. Previous methods [2], [3], [4] usingonly character-based or token-based (i.e. one-level) similaritycannot handle both of the issues in the free text context.To the best of our knowledge, no existing work has addressed entity extraction from free text using two-level simi-larity. This is the first work to investigate entity extraction fromfree text using two-level similarity. In this paper, we proposean algorithm by considering both character-based and token-based similarity (i.e. two-level similarity) to extract entity fromfree text. Measuring one-level similarity is computationallyexpensive, and measuring two-level similarity is dramaticallymore expensive. Without novel techniques to support the two-level similarity based algorithm, extracting entities from a largenumber of documents against a large dictionary is computa-tionally very expensive. We observe that a sub-string candidatecan be similar to an entity only if they share some tokens, thuswe first identify all the matched tokens from the document foreach entity. Then based on the matched tokens, we enumerateall the sub-string candidates that potentially similar to theentity. By exploiting the properties of the two-level similarityand the weights (measured by IDF [5]) of tokens, we furtherdevelop a spanning-based method to dramatically reduce thenumber of sub-string candidates that require computation oftwo-level similarity. To summarise, we make the following keycontributions. • This is the first work to address the problem ofapproximate entity extraction from free text usingboth character-based and token-based similarity. Weformulate the problem and propose novel techniquesto solve the problem. • For each entity, by naively enumerating k matchedtokens in a document to an entity, the total numberof sub-string candidates produced is about k . Byavoiding enumerating very short or very long sub-string candidates, we design a technique to reducethe number of sub-string candidates to ( u − l ) · k ,where [ l, u ] is the range of the number of tokens inthe candidates that are neither too short nor too long. • By exploiting the properties of the two-level similarityand the weights of tokens, we develop a spanning-based candidate producing technique to significantlyreduce the number of sub-string candidates to just k . The key novelty of our spanning-based candidateproducing technique lies in a novel lower bound andcomputation reuse strategy. • We conduct extensive experiments to validate theefficiency and effectiveness of our algorithm. Theexperimental results show that our algorithm can effi-iently extract entities from documents, and producea high F score in the range of [0.91, 0.97].The rest of the paper is organised as follows. In Section II,we discuss related work in entity extraction. Then, we presentpreliminaries in Section III, and describe our algorithm usinga two-level similarity in Section IV. In Section V, we reportexperimental results of our algorithm. Finally, we conclude thepaper in Section VI.II. R ELATED W ORK
We categorise the related work of entity extraction into twomain groups: studies based on machine learning approachesand studies based on string matching approaches. Our workfalls into the later group.
Machine learning based approaches : Carreras et al.proposed an Adaboost based approach for named entity ex-traction [6]. Their key idea is to extract entities using twoclassifiers: a local classifier for detecting if a token belongsto a named entity; a global classifier for detecting if a sub-string candidate is a named entity. Jain and Pennacchiotti [7]proposed an approach using heuristics (e.g. tokens with firstletter capitalised) to extract entities from query log, and thenthe extracted entities are grouped into different clusters and as-signed labels accordingly. Cohen and Sarawagi [8] designed analgorithm using the Markov model for entity extraction. The al-gorithm has two main phases. First, a label (e.g. person name)is assigned to each token based on dictionaries/heuristics.Second, the Markov model is trained and used to predictthe entity probability for each sub-string candidate based onthe token labels. One major limitation of the abovementionedapproaches is that they require significant amount of humaneffort to collect training datasets and/or to tune heuristics.
String matching based approaches : The approximateentity extraction problem can be viewed as the approximatestring matching problem which is a well-studied problem.Navarro gives a nice survey for the approximate string match-ing problem [9]. Here, we focus on some recent work in entityextraction.Gattani et al. developed a dictionary-based algorithm forentity extraction [10]. But their algorithm aims to extract sub-string candidates that exactly match entities in dictionary fromshort documents (e.g. tweets). Kim et al. [11] proposed amemory efficient indexing approach for string matching usingcharacter-based similarity. Their proposed index is memoryfriendly by reusing position information of n -grams through atwo-level scheme. Wang et al. proposed an approximate entityextraction algorithm using neighbourhood generation [2]. Denget al. [3] designed an efficient algorithm for approximateentity extraction based on trie tree index. Kim and Shimproposed an algorithm that finds from a document top- k mostsimilar sub-string candidates to an entity [12]. A more recentstudy [13] presents techniques to find duplicated text segmentsbetween two documents using token-level similarity. All thesealgorithms use one-level, i.e. character-based or token-based,similarity to find similar entities (or text segments) in docu-ments.Some existing studies [14], [15] designed similarity func-tions and indexing techniques for the string similarity search TABLE I. F
REQUENTLY USED SYMBOLS t , e , s a token, an entity token and a text token idf ( t ) , w ( t ) IDF and the weight of t , respectively E , S , E i , S j an entity, sub-string candidate, the i th tokenof E , and the j th token of S , respectively eds ( e , s ) the edit similarity of e and sτ, δ token and entity edit similarity thresholds problem. Cohen et al. [16] developed an open source soft-ware toolkit, which supports different similarity functions, formeasuring the similarity between two strings. Chakrabariti etal. [4] proposed a filter using the token-based similarity toclassify sub-string candidates into two classes: valid sub-stringcandidates that may match some entities in the dictionary;invalid sub-string candidates that do not match any entitiesin the dictionary. The above work differs from ours, since weaim at developing an effective and efficient algorithm to extractentities from free text using both character-based and token-based similarity. III. P RELIMINARIES
For ease of presentation, a token (e.g. word) of a sub-stringcandidate is called a text token . Similarly, we call a token of anentity in the dictionary an entity token . Some frequently usedsymbols in the rest of the paper are summarised in Table III. Inthis section, we first give an approach to computing the weights(i.e. importance) of entity tokens and text tokens. Second, weprovide background knowledge of edit and Jaccard similarity.Then, we present an algorithm for finding from a documenttext tokens which match any token in the dictionary. Lastly,we formally define the approximate entity extraction problem.
A. Assigning weights to tokens
In many applications, the tokens of an entity (or a sub-string candidate) have different importance, called weights hereafter, in the entity (or the sub-string candidate). Followingcommon practice, we use IDF [5] to measure the weightsof tokens. In the approximate entity extraction problem, thedictionary is known a priori and documents are unknownbeforehand. Hence, we compute the IDF value of a token basedon the dictionary. Specifically, given the dictionary with N entities and a token t , we count the number (denoted by N t )of entities that contain t to serve as the “document frequency”of the token. Then, the IDF value of t is computed by thefollowing equation. idf ( t ) = log NN t + 1 The total IDF value of a set of tokens A is the sum of theIDF values of all the tokens in A , and can be computed asfollows. T idf ( A ) = X t ∈A idf ( t ) (1) After computing idf ( t ) and T idf ( A ) , we can compute theweight of the token t by the equation below. w ( t ) = idf ( t ) T idf ( A ) (2) Note that A can be either the entity E or the sub-stringcandidate S . We define the total weight of a subset A ′ oftokens in A (i.e. A ′ ⊆ A ) as follows. T w ( A ′ ) = X t ∈A ′ w ( t ) (3) . Edit and Jaccard similarity1) Edit similarity: Edit-distance quantifies the dissimilarityof two tokens by counting the minimum number of edit op-erations (i.e. deletion, insertion and substitution) to transformfrom one token to the other. Without loss of generality, weassume that all the edit operations have the same cost.Based on edit-distance, edit similarity is to quantify thesimilarity of two tokens. Formally, given two tokens e and s ,the edit similarity eds ( e, s ) is defined as follows. eds ( e, s ) = 1 − ed ( e, s )max {| e | , | s |} (4) where ed ( e, s ) is the edit-distance between the two tokens; | e | and | s | are the number of characters in e and s , respectively.
2) Jaccard similarity:
Jaccard similarity is mainly used asa token-based similarity. In entity extraction, Jaccard similarityis for measuring the similarity between an entity E and a sub-string candidate S and is defined as follows. JAC = |E ∩ S||E ∪ S| = |E ∩ S||E| + |S| − |E ∩ S| (5) where |E ∩ S| is the number of matched tokens between E and S ; |E ∪ S| is the number of tokens in the union of E and S ; |E| and |S| are the number of tokens in E and S , respectively.Note that the above-mentioned edit similarity is forcharacter-based similarity, while Jaccard similarity is fortoken-based similarity. We postpone our definition of two-levelsimilarity using edit or Jaccard similarity until Section IV. C. Matching text tokens against entities
Since we are interested in extracting entities from docu-ments (i.e. free text), the first step is to find in the documents allthe tokens that match to tokens in each entity of the dictionary.We use Li et al.’s algorithm [17] for finding all the matchedtokens in a document to an entity. The details about how Liet al.’s algorithm works are unimportant for understanding ourproposed algorithms. Here, we briefly explain the results pro-duced by the algorithm. Figure 1 gives example results of thematched tokens in a document. In the example, the dictionarycontains N entities which are denoted by E , E , ..., E N . Therows represent the results of the same document matching tothe N entities. A rectangle containing “X” indicates that theposition does not match any token of the entity; a rectanglecontaining E ij indicates that the token at this position matchesthe j th token of the i th entity.Example: Given a document D = “ ... The Univercity ofOxfort is near the Oxford city ... ” and an entity E = “TheUniversity of Oxford”, then each token of E is E = “The”, E = “University”, E = “of”, and E = “Oxford”. Afteridentifying all the matched token of E in D , we can represent D as “... E E E E X X E E X ...” (similar to Figure 1).
D. Problem definition
We define our approximate entity extraction problem.
Definition (Approximate Entity Extraction) . Given an entitydictionary E , a document D and an entity similarity threshold δ , the approximate entity extraction problem is to find all pairs For ease of presentation, we refer “the position” to “the token at theposition of the document”. E E E E E X E all the matched text tokens to E E E E E E E Xall the matched text tokens to E E N E N E N E N E N E N Xall the matched text tokens to E N Fig. 1. Matched position of entities in a document of entities and sub-string candidates with similarity score notsmaller than δ . I.e. Sim ( E , S ) ≥ δ where E is an entity in E and S is a sub-string candidate in D . The similarity function
Sim ( E , S ) (e.g. Fuzzy Jaccard)takes both the character-based similarity and the token-basedsimilarity into account to measure the overall similarity of E and S . The weight of each token in E and S is measured bya weight function w ( · ) . An entity token e and a text token s are called “similar” or “matched” if the character-based editsimilarity of them is not smaller than τ , i.e. eds ( e, s ) ≥ τ .IV. O UR PROPOSED ALGORITHM
In this section, we elaborate our algorithm for entityextraction. Figure 2 gives an overview of our algorithm whichhas four components. As matching text tokens in a documentagainst a dictionary is well-studied, the matching algorithmdiscussed in Section III-C serves in the Matching Text To-kens component and proposing techniques to improve thiscomponent is out of the scope of this paper. We focus ondesigning techniques for the other three components whichare inside the dashed line polygon in Figure 2. Our algorithmrepeats the following three key steps until all the entities inthe dictionary are checked. Step (i): Based on the matchedtokens output by Matching Text Tokens and a given entityin dictionary, the Producing Candidates component producesall the sub-string candidates that may match to the entity.Step (ii): The Filtering Candidates component filters out sub-string candidates that will not match to the entity. Step (iii):The Measuring Similarity component computes the similaritybetween each remaining sub-string candidate and the entity,and outputs sub-string candidates with high similarity score asextracted entities; then our algorithm goes back to Step (i) ifany entity in the dictionary requires being checked.Our algorithm can work with various similarity functions,such as Jaccard similarity, Dice similarity, cosine similarityand edit similarity. In this paper, we focus on designingtechniques to our algorithm using two-level edit-similarity(hereafter
FuzzyED ) and two-level Jaccard similarity (here-after
Fuzzy Jaccard ), as edit distance and Jaccard distance aretwo most commonly used distances in string matching. Next,we first define the FuzzyED and Fuzzy Jaccard similarity.Then, we present two algorithms to find sub-string candidates.Finally, we provide filtering techniques to our algorithm forthe corresponding two-level similarity.
A. Two fuzzy similarity functions1) The FuzzyED similarity:
Here, we define the cost of editoperations for token-based similarity and FuzzyED similarity.ocumentsentities inthe dictionary MatchingText Tokens all matchedtokensfirst entity ProducingCandidatesFilteringCandidatescandidatesthe entity n e x t e n tit y MeasuringSimilarityextractedentities
Fig. 2. Our algorithm for extracting entities
Cost of edit operations on tokens of a candidate:
FuzzyEDrequires performing two levels of edit-distance: the character-based edit-distance (for measuring similarity between twotokens) and the token-based edit-distance (for measuring thesimilarity between an entity and a sub-string candidate). Aswe have discussed the character-based edit-distance in Sec-tion III-B, here we provide details of the token-based edit-distance.The total cost of FuzzyED is the cost of transforming asub-string candidate to an entity. Without loss of generality,we assume that only the sub-string candidate can be editedand the entity is not permitted to be edited. We formulate thetotal cost of transforming a sub-string candidate S to an entity E using the following equation. F ED ( E , S ) = C D ( S ) + C I ( S ) + C S ( E , S ) (6) where C D ( S ) is the total deletion cost of removing text tokensfrom S ; C I ( S ) is the total insertion cost of inserting entitytokens of E to S ; C S ( E , S ) is the total substitution cost of E and S . We let S ′ be a subset of tokens in S that match to E ; E ′ denotes tokens that are matched by S ′ . Deletion : The total deletion cost is computed by the fol-lowing equation. C D ( S ) = T w ( S \ S ′ ) where S \ S ′ is a subset of the tokens in S (i.e. S \ S ′ ⊆ S )that needs to be deleted from S . Insertion : The total insertion cost is computed by thefollowing equation. C I ( S ) = T w ( E \ E ′ ) where E \ E ′ is a subset of tokens in E (i.e. E \ E ′ ⊆ E ) thatneeds to be inserted to S . Substitution : The total substitution cost is computed usingthe following equation. C S ( E , S ) = X e ∈E ′ ,s ∈S ′ (1 − eds ( e, s )) × ( w ( e ) + w ( s )) (7) where s is a text token that matches the entity token e . Next,we give the FuzzyED similarity based on the cost defined inEquation (6). Computing the FuzzyED similarity:
Given a sub-stringcandidate and an entity, we can compute the total edit coston transforming the sub-string to the entity by Equation (6).We adapt the dynamic programming based algorithm [9] tocompute the cost of the longest sub-string of S that is the mostsimilar to the entity E . The time complexity of the dynamicprogramming based algorithm is O ( mn ) , where m and n arethe number of tokens of E and S , respectively. The key ideaof the dynamic programming based algorithm is similar tothe algorithm for computing the character-based edit-distance between two tokens. We do not provide the details of thealgorithm here and suggest the interested readers to consultthe original paper [9].After computing the total edit cost in Equation (6), we cancompute the FuzzyED similarity, denoted by F EDS ( E , S ) ,using the following equation. F EDS ( E , S ) = ( if F ED ( E , S ) > , − F ED ( E , S ) otherwise. (8) Note that the substitution cost may be larger than 1 when τ < . (cf. Equation (7)) which results in F ED ( E , S ) > .A sub-string candidate S and an entity E are called“matched” or “similar” if F EDS ( E , S ) ≥ δ .
2) The Fuzzy Jaccard similarity:
To tolerate typos insidetokens, character-based edit-distance is applied before the Jac-card similarity is applied to measure the similarity between anentity and a sub-string candidate. The abovementioned Jaccardsimilarity is called Fuzzy Jaccard which was first studied byWang et al. [18] in the context of the string similarity joinproblem [19]. Computing the Fuzzy Jaccard similarity is muchmore complicated, since one text token may match multipletokens of an entity and vice versa. One text token matching tomultiple entity tokens frequently occurs especially when thetoken edit similarity threshold τ is small. Figure 3 shows ascenario where tokens have multiple matches. In the figure, atoken is represented by a vertex and a match is representedby an edge. The number next to an edge represents the editsimilarity between the two tokens at both ends of the edge.For example, the edit similarity between E and S is 0.85.As we can see from the figure, four tokens E , E , E m and S match multiple tokens when the edit similarity threshold τ is 0.8. In entity extraction applications, an entity token canmatch at most one token of a sub-string candidate and viceversa, so the extra matches should be removed and at mostone match is kept for each entity or text token. We call thoseextra matches redundant matches .The maximum weight matching algorithm [20] can beapplied to remove the redundant matches before computing theFuzzy Jaccard similarity. (Dice similarity and cosine similaritycan also use this approach to removing redundant matches.)Specifically, the maximum weight matching algorithm finds agraph, denoted by G , that has the following two properties: (i)any two edges in G have no common vertex; (ii) the sum ofedit similarity of edges in G is maximum.After removing the redundant matches, the Fuzzy Jaccardsimilarity of E and S can be computed using the followingequation. F J = P e ∈E ′ ,s ∈S ′ eds ( e, s ) |E| + |S| − P e ∈E ′ ,s ∈S ′ eds ( e, s ) where E ′ ⊆ E and S ′ ⊆ S ; E ′ and S ′ are subsets of tokens thathave matches after removing the redundant matches. Note thatwhen the edit similarity threshold τ is one, the above equationis equivalent to Equation (5).By considering the weights of tokens (cf. Section III-A),we can write the Fuzzy Jaccard similarity as follows. F J = P e ∈E ′ ,s ∈S ′ eds ( e, s ) · ( w ( e ) + w ( s ))1 + 1 − P e ∈E ′ ,s ∈S ′ eds ( e, s ) · ( w ( e ) + w ( s )) (9) S E E E i E m S S S S S j S n Fig. 3. Matches of an entity and a sub-string
3) Comparison on the two similarity functions:
Computingthe Fuzzy Jaccard similarity is expensive. This is becausebefore computing the similarity, we need to perform theexpensive maximum weight matching algorithm with a timecomplexity of O ( m n ) [21], where m and n are the numberof tokens of the entity E and that of the sub-string candidate S , respectively. In comparison, FuzzyED only has a timecomplexity of O ( mn ) .In the following two subsections, we explain the sub-stringcandidate producing techniques. These sub-string candidateproducing techniques can be used in the Producing Candidatescomponent of our algorithm (cf. Figure 2). B. Producing candidates by enumeration
For each entity, we can obtain the sub-string candidatesby enumerating the results produced by Li et al.’s algorithm(cf. Figure 1) as we discussed in Section III-C. That is sub-string candidates with one token matching the entity, withtwo tokens matching the entity, with three tokens matchingthe entity, etc. The number of sub-string candidates producedby this enumeration is of O ( k ) complexity and is k (1+ k )2 to be more precise, where k is the number of text tokens(in the document) that match the entity. Among the k (1+ k )2 sub-strings, many of them tend to be unpromising sub-stringcandidates. For example, a sub-string candidate with only onematched token is unlikely to match an entity of ten tokenswith the entity similarity threshold δ = 0 . . To generate fewerunpromising sub-string candidates, we give an approach thatonly needs to consider sub-string candidates with the numberof matched entity tokens in the range [ l, u ] . We refer thenumber of matched tokens of sub-string candidates in the range [ l, u ] to valid matching length . The intuition of the validmatching length is that sub-string candidates with too few ortoo many matched tokens will not match the entity with theentity edit similarity threshold δ .In what follows, we first present two propositions for theminimum and maximum valid matching length. Then, wegive details of computing the minimum and maximum validmatching length for FuzzyED and Fuzzy Jaccard. Finally, weprovide analysis to this enumeration-based candidate produc-ing technique.
1) Two propositions of the valid matching length:
For easeof presentation, we classify the text tokens of a sub-stringcandidate S into the following three subsets. (1) Unmatchedtext tokens denoted by ˆ S : the text tokens do not match anyentity token. (2) Redundant matched text tokens denoted by S ′′ :the text tokens match the entity tokens but are finally removed(by the maximum weight matching algorithm in Fuzzy Jaccardor by deletion in FuzzyED). (3) Valid matched text tokens denoted by S ′ : the text tokens match the entity tokens and arenot redundant. Please note that only the redundant matched text tokens and valid matched text tokens are in the resultsproduced by Li et al.’s algorithm. The minimum valid matching length l : Suppose a sub-stringcandidate S has only l tokens that match the entity E , i.e. thesimilarity of S and E is not smaller than δ . If l is the minimumvalid matching length, the following proposition must be true: Proposition 1.
All the l text tokens are (i) exactly matched tosome entity tokens and (ii) valid matched text tokens. The proof is straightforward and hence omitted. Accordingto the proposition, we have S = S ′ and T w ( S ) = T w ( S ′ ) = 1 (cf. Equations (3)). Please recall that S ′ denotes all the validmatched tokens in S . Given that S has the minimum number l of matched tokens, the above proposition guarantees themaximum possible similarity between S and E . The maximum valid matching length u : Suppose a sub-string candidate S has u tokens matched the entity, i.e. thesimilarity between S and E is not smaller than δ . If u isthe maximum valid matching length, the following propositionmust be true: Proposition 2.
All the tokens of the entity are exactly matched.
The proof is straightforward and hence omitted. Fromthe above proposition, we have T w ( E ) = T w ( E ′ ) = 1 (cf.Equations (3)), where E ′ denotes all the matched tokens in E .The above proposition guarantees that (i) the number of validmatched tokens is maximised (note that the maximum numberof valid matched tokens equals to the number of tokens of theentity.) and (ii) the similarity between S and E is maximisedgiven u matched tokens.
2) Computing l and u for FuzzyED: The minimum validmatching length l : According to Proposition 1, the substi-tution and deletion cost are zero, and only the insertion costis involved in transforming S to E . Therefore, the total cost F ED ( E , S ) equals to the insertion cost T w ( E \E ′ ) where E \E ′ is a subset of the tokens in E needed to be inserted to S .According to Equation (8), the similarity score is − T w ( E\E ′ ) .When E and S are matched, their similarity score is not smallerthan δ . So, we have − T w ( E \ E ′ ) ≥ δ. As we know that T w ( E ) = 1 (cf. Equation (2) and (3)), theleft part of the above constraint equals to T w ( E ′ ) (i.e. the totalweight of the matched entity tokens). So, we have T w ( E ′ ) ≥ δ .To compute l , we add the entity token with the largestweight, the second largest weight, the third largest weight andso on to S ′ until the total weight of the tokens in S ′ is notsmaller than δ . Then l is computed by l = |S ′ | . The maximum valid matching length u : According toProposition 2, no insertion cost and no substitution cost areinvolved; the only cost is deletion on the redundant matchedtext tokens. Since the sub-string candidate should match theentity, the total weight of the valid matched text tokens, i.e. T w ( S ′ ) , should satisfy the constraint T w ( S ′ ) ≥ δ . FromEquations (2) and (3), we have T w ( S ′ ) = T idf ( S ′ ) T idf ( S ) . (10) Recall that S is a sub-string candidate and S ′ is the validmatched tokens in S (i.e. S ′ ⊆ S ). Except the valid matchedokens in S ′ , the sub-string candidate S also contains redun-dant matched text tokens S ′′ , unmatched text tokens ˆ S . Wecan rewrite Equation (10) in the following form. T w ( S ′ ) = T idf ( S ′ ) T idf ( S ′ ) + T idf ( S ′′ ) + T idf ( ˆ S ) Since we compute the maximum valid matching length u ofthe matched tokens in a sub-string candidate, we only knowall the matched text tokens (cf. Section III-C) to the entity. Sowe write the above equation in the following form. T idf ( S ′ ) T idf ( S ′ ) + T idf ( S ′′ ) ≥ T w ( S ′ ) As T w ( S ′ ) ≥ δ , we have T idf ( S ′ ) T idf ( S ′ ) + T idf ( S ′′ ) ≥ δ. (11) We know that u equals to the number of the tokens of S ′ plus the number of tokens in S ′′ . From Proposition 2, T idf ( S ′ ) equals to T idf ( E ) and is a constant. The number ofthe tokens in S ′′ is maximised when each redundant token hasthe smallest IDF value. Therefore, u is the maximum numberwhen all the tokens of S ′′ match E ’s token with the smallestIDF value. To compute u , we keep adding the same entitytoken (the one with the smallest IDF value among the tokensof E ) to S ′′ until Inequality (11) does not hold. Then u iscomputed by u = |S ′ | + |S ′′ | .
3) Computing l and u for Fuzzy Jaccard: For a sub-string candidate to match an entity, the similarity of the sub-string candidate and the entity must satisfy the condition
F J ≥ δ where F J is computed using Equation (9). CombiningEquation (9) and
F J ≥ δ , we have X e ∈E ′ ,s ∈S ′ eds ( e, s ) · ( w ( e ) + w ( s )) ≥ δ δ . As eds ( e, s ) ≤ , we let eds ( e, s ) = 1 . Then we have X e ∈E ′ ,s ∈S ′ ( w ( e ) + w ( s )) = X e ∈E ′ w ( e ) + X s ∈S ′ w ( s ) ≥ δ δ . Using Equation (3), we can rewrite the above inequality as T w ( E ′ ) + T w ( S ′ ) ≥ δ δ . (12) The minimum valid matching length l : According toProposition 1, we have T w ( S ′ ) = 1 . We can rewrite Inequal-ity (12) by putting T w ( S ′ ) = 1 into it and we have T w ( E ′ ) ≥ δ −
11 + δ .
Computing l here is identical to the process of computing l inFuzzyED, except the threshold here is δ − δ instead of δ . The maximum valid matching length u : According toProposition 2, we have T w ( E ′ ) = 1 . By putting T w ( E ′ ) = 1 into Inequality (12), we have T w ( S ′ ) ≥ δ −
11 + δ . (13)
Following the same process of deriving from Equation (10)to Inequality (11), we can rewrite Inequality (13) as follows. T idf ( S ′ ) T idf ( S ′ ) + T idf ( S ′′ ) ≥ δ −
11 + δ .
Then, computing u here is identical to the process ofcomputing u in FuzzyED, except the threshold here is δ − δ instead of δ .
4) Analysis of producing candidates by enumeration:
In theenumeration-based candidate producing technique, the numberof sub-string candidates generated using the valid matchinglength is of O ( k ) complexity and is ( u − l ) × k to be moreprecise, where k is the number of matched tokens in thedocument. Even though we have reduced the number of sub-string candidates from k (1+ k )2 to ( u − l ) × k , many unpromisingsub-string candidates are generated and require measuring thetwo-level similarity (e.g. FuzzyED). Next, we propose a novelspanning-based candidate producing technique that reduces thenumber of sub-string candidates which requires measuring thetwo-level similarity to k . C. Producing candidates by spanning
We notice that the large number of unpromising sub-stringcandidates generated by the enumeration-based candidate pro-ducing technique is because many matched tokens are notimportant tokens (i.e. tokens with small IDF values such asstop words [22]). Those tokens are likely to appear many timesin a document and result in generating many unpromisingsub-string candidates. Here, we propose a spanning-basedcandidate producing technique that makes use of importanttokens which we call core tokens . The technique starts froma core token and uses left and right spanning to find sub-stringcandidates for measuring the two-level similarity. To determinethe left and right boundaries of a sub-string candidate, wedesign a lower bound dissimilarity derived from the two-levelsimilarity.In what follows, we first present the technique to findcore tokens. Then, we provide the key steps of our spanning-based candidate producing technique. After that, we givedetails of the spanning-based candidate producing technique toFuzzyED and Fuzzy Jaccard. Lastly, we design techniques forreusing computation in spanning-based candidate producing,and analyse the candidate producing technique.
1) Finding core tokens of an entity:
As we have discussedin Section III-A, each token is associated with a weight. Theweights of tokens can help reduce the number of unpromisingsub-string candidates. Our key idea is to find a subset of entitytokens (i.e. core tokens) to represent the entity. For instance,we may use core tokens { University, Oxford } to representthe entity with tokens { The, University, of, Oxford } . The restof the tokens with smaller weights, such as { The, of } in theexample, are called optional tokens in this paper.Formally, given an entity similarity threshold δ and anentity with m tokens E = {E , E , ..., E m } , we construct aset C of q tokens to represent the entity E where C ⊆ E . Theremaining ( m − q ) tokens in E \ C form a set O correspondingto the optional tokens. The property of core tokens is that atleast one core token should appear in a sub-string candidate toallow the candidate to match the entity. Next, we first describethe approaches to finding core tokens in the settings of usingthe FuzzyED and Fuzzy Jaccard similarity. Then, we providemore details of the properties of core tokens. Core tokens for FuzzyED:
The core token set C shouldsatisfy the following constraint. T w ( C ) > − δ (14) he above constraint guarantees that the total weight of tokensin the optional token set O (where O = E \ C ) to be smallerthan δ , because T w ( O ) = 1 − T w ( C ) < δ . Core tokens for Fuzzy Jaccard:
Similar to FuzzyED, thecore token set C in Fuzzy Jaccard should satisfy the followingconstraint. T w ( C ) > − δ )1 + δ (15) Due to the space limitation, we omit the details of derivingthe above inequality. The above constraint guarantees that thesimilarity of S and E is smaller than δ , given that not any coretoken is matched. Properties of core tokens:
The following lemma shows thatat least one core token should appear in a sub-string candidateto allow the sub-string candidate to match the entity.
Lemma 1.
Given a sub-string candidate S that matches anentity E (i.e. the similarity between S and E is not smallerthan δ ), the sub-string candidate S must have at least one texttoken matching to a core token of the entity E . The proof to the lemma can be found in Appendix A.According to the above lemma, the sub-string candidates notcontaining any core token can be discarded without sacrificingrecall. Hence, core tokens are good starting points to find thesub-string candidates.Note that the number of core tokens of an entity should beas small as possible, because a core token may match manytext tokens; those matched text tokens may generate manyunpromising sub-string candidates which require measuringthe two-level similarity. To minimise the number of core tokensto represent an entity (i.e. minimising the cardinality q of C ),we select q tokens with the largest weights from E to make C just satisfy the constraint for core tokens, e.g. Constraint (14)for FuzzyED.In what follows, we explain the key steps of producing acandidate starting from a core token by left and right spanning.
2) The spanning process of producing a candidate:
Sincethe core tokens represent the entity, we only use the coretokens as query tokens to find their matching positions in thedocument using Li et al.’s algorithm. The matched results ofthe entity in the document are similar to the results shown inFigure 1. In many cases, the left and right boundaries of a sub-string candidate are not core tokens. Hence, we need to checkthe left (right) side of the leftmost (rightmost) core token inthe sub-string candidate and see if any optional tokens near thecore token can be included into the sub-string candidate. Wecall the process of finding the left (right) boundary of the sub-string candidate left spanning ( right spanning ). To determinewhen the spanning should be terminated, we compute a lowerbound of the dissimilarity between the sub-string candidate andthe entity. When the left spanning or right spanning results inthe lower bound dissimilarity higher than the threshold (1 − δ ) ,the spanning should be terminated.Figure 4 shows an overview of the process of finding theboundaries of a sub-string candidate. Initially, the sub-stringwhich we call current sub-string has only one token (i.e. thecore token C ). Then, the left spanning leads to an optionaltoken O included in the current sub-string. The left spanningis terminated because of the lower bound dissimilarity is higher current sub-string initialX X X C C X X C current sub-string leftspanningX O X C C X X C current sub-string rightspanningX O X C C O X C Fig. 4. Spanning from the core token than (1 − δ ) if more tokens in the left side are included. By rightspanning, the current sub-string covers one more core token(i.e. C ) and one optional token (i.e. O ). The current sub-stringcannot be further extended because of the high lower bounddissimilarity, and hence we obtain the sub-string candidatewhich requires measuring the two-level similarity.In what follows, we first present the intuition of computingthe lower bound dissimilarity. Then, we describe the key ideasof the left spanning and the right spanning. We postponethe presentation on more details of producing the sub-stringcandidates specifically for FuzzyED and Fuzzy Jaccard untilSection IV-C4 and Section IV-C3. The lower bound dissimilarity : As demonstrated in Fig-ure 4, we start from a sub-string with a core token, and thenextend the sub-string by left and right spanning. Spanning thecurrent sub-string to include a nearby token raises change tothe similarity score. To determine when the left/right spanningprocess should be terminated, we compute the lower bounddissimilarity for the current sub-string with the nearby tokenincluded. We denote the lower bound similarity by B ⊥ , thecomputing of which depends on the specific similarity function(e.g. FuzzyED). Left spanning : Here, we provide the details of extendingthe current sub-string via left spanning. Another interpretationto the left spanning is to find the left boundary of the sub-stringcandidate. To begin with, we start from the first matched texttoken (e.g. the first C in Figure 4) in the document. Then, wespan to the left side of the current sub-string by one text token,denoted by t . Next, we compute the lower bound dissimilarity B ⊥ . If B ⊥ is smaller than (1 − δ ) , we span the current sub-string to cover the text token t ; otherwise, the left spanning isterminated. When the left spanning is terminated, the leftmostmatched text token is identified as the left boundary of thesub-string candidate. Right spanning : After the left spanning, we span thecurrent sub-string to include the tokens to its right side. Theright spanning is identical to the left spanning and hence isnot discussed afterwards.Next, we describe the details of computing the lower bounddissimilarity and the left spanning process for FuzzyED andFuzzy Jaccard.
3) Producing a candidate for FuzzyED:
The lower bounddissimilarity : In the setting of the FuzzyED similarity, thelower bound dissimilarity B ⊥ is from the total deletion costand the total substitution cost while producing the sub-stringandidate. Please note that the lower bound insertion cost isalways zero, because all the entity tokens potentially have exactmatches by left and right spanning. To compute the lowerbound dissimilarity B ⊥ more efficiently, we maintain the totalIDF values V T for all the tokens in the current sub-string,and the total IDF values V R for those text tokens needed tobe deleted from the current sub-string. V T is initialised to theIDF value of the core token and V R is initialised to 0.The substitution cost between two similar tokens E i and S j is (1 − eds ( E i , S j )) × ( w ( E i ) + w ( S j )) according to Equa-tion (7). We cannot simply include the substitution cost intothe lower bound dissimilarity, as there may exist another notincluded token S ′ r that is more similar to E i than S j , i.e. eds ( E i , S ′ r ) > eds ( E i , S j ) . If such S ′ r exists, we need to delete S j with cost w ( S j ) later in the spanning. Note that the substitutioncost (1 − eds ( E i , S j )) × ( w ( E i ) + w ( S j )) may be larger than thedeletion cost w ( S j ) . Hence, the lowest cost of including S j tothe current sub-string is set to (1 − eds ( E i , S j )) × w ( S j ) which issmaller than both (1 − eds ( E i , S j )) × ( w ( E i )+ w ( S j )) and w ( S j ) .The lowest cost of including the text token can be representedin the form of IDF values by (1 − eds ( E i , S j )) × idf ( S j ) . Thislowest cost of including the text token is equivalent to deletinga token with an IDF value of (1 − eds ( E i , S j )) × idf ( S j ) . Inwhat follows, we compute the lower bound dissimilarity as ifwe only considered deletion cost.For ease of computing the lower bound dissimilarity, wemaintain an array M with the length of |E| . The i th element ofthe array, denoted by M i and i ∈ { i : τ ≤ M i ≤ } , correspondsto the edit similarity between the most similar text token of thecurrent sub-string and the i th entity token of E (i.e. E i ). Notethat some elements (e.g. M j ) in M are marked as none if thecorresponding entity tokens have no matched text token (e.g.no token in the current sub-string S matches E j ). The equationof computing the lower bound dissimilarity can be expressedas follows. B ⊥ = V R + P i (1 − M i ) × idf ( S ′ i ) V T + P r idf ( E r ) (16) where i ∈ { i : τ ≤ M i ≤ } and r ∈ { r : M r < } .The numerator of Equation (16) represents the total “dele-tion cost”: true deletion cost V R and the substitution cost P i (1 − M i ) × idf ( S ′ i ) where S ′ i is the text token that is themost similar to E i . The denominator is the ideal total IDF valueof the sub-string; V T is the total IDF value of the currentsub-string; the term P r idf ( E r ) of the denominator is thetotal IDF value of all the not exactly matched entity tokens.We can prove that the lower bound dissimilarity increasesmonotonically as the sub-string spans. The key idea of theproof is that adding the same value idf ( t ) > to the numeratorand the denominator of Equation (16) leads to the value of B ⊥ increasing. Left spanning : For updating V T and V R in this spanning,we need to handle the following two cases separately. Supposethe token to the left side of the current sub-string is t , and t is included into the sub-string after spanning. • Case 1: t does not match any tokens of E , so we needto delete t . Hence, we update V R by V R = V R + idf ( t ) ,and we update the total IDF value V T by V T = V T + idf ( t ) . • Case 2: t matches a token E j of E . We update V T by V T = V T + idf ( E j ) . We consider this as a substitutionoperation and update V R by the following two cases. ◦ No other text token in the current sub-stringmatches E j . We update M j by M j = eds ( t, E j ) ,and we do not update V R due to no deletionrequired. ◦ A text token in the current sub-string hasmatched to E j . We update V R by V R = V R + idf ( t ) , and M j by M j = max { M j , eds ( t, E j ) } .After the update of V T , V R and M , we compute the lowerbound dissimilarity B ⊥ using Equation (16). When B ⊥ > − δ ,the left spanning terminates.
4) Producing a candidate for Fuzzy Jaccard:
The lowerbound dissimilarity : For Fuzzy Jaccard, in order to com-pute the lower bound dissimilarity B ⊥ of the current sub-string S , we first compute the maximum possible similar-ity score sc max of the sub-string, and then B ⊥ = (1 − sc max ) . According to Equation (9), the similarity scoreof E and S reaches the maximum value when the term P e ∈E ′ ,s ∈S ′ eds ( e, s ) · ( w ( e ) + w ( s )) is maximised. We canrewrite the term in the following form. X e ∈E ′ ,s ∈S ′ eds ( e, s ) · w ( e ) + X e ∈E ′ ,s ∈S ′ eds ( e, s ) · w ( s ) The current sub-string S has the maximum similarity to theentity E , when all the entity tokens are exactly matched. Thatis P e ∈E ′ ,s ∈S ′ eds ( e, s ) · w ( e ) = 1 . So we have X e ∈E ′ ,s ∈S ′ eds ( e, s ) · w ( s ) The above term is maximised when P e ∈E ′ ,s ∈S ′ eds ( e, s ) · w ( s ) reaches its maximum possiblevalue. Next, we replace the weight by the IDF values, and wehave X e ∈E ′ ,s ∈S ′ eds ( e, s ) · w ( s ) = P e ∈E ′ ,s ∈S ′ eds ( e, s ) · idf ( s ) T idf ( S ) (17) where T idf ( S ) is the total IDF value of the current sub-string (cf. Equation (1)). The above term is maximised, when eds ( e, s ) equals to the edit similarity of the entity token e tothe most similar token of the current sub-string. Recall thatthe i th element of M is the similarity of E i and the mostsimilar token of the current sub-string. Hence, we can rewritethe term (17) in the following form using M . P M i · idf ( S ′ i ) T idf ( S ) (18) where S ′ i is the text token which is the most similar to E i in thecurrent sub-string S . The value of the above term may increaseas more tokens are included via the left/right spanning. Thetext tokens that improve the similarity score are those similarto the entity tokens (i.e. through improving the value of M i ).We can modify (18) to a term that has the maximum value asfollows P M i · idf ( S ′ i ) + P r (1 − M r ) idf ( S ′ r ) T idf ( S ) + P r idf ( E r ) (19) where r ∈ { r : M r < } , and the token S ′ r (which is similarto E r ) is added to the above term only when the value of theterm (19) increases. Note that the left/right spanning processcan be terminated when the term (19) cannot be increased.The following lemma identifies the tokens that can increase thevalue of the term (19), and hence increase the Fuzzy Jaccardsimilarity. Lemma 2.
A token S ′ r can increase the Fuzzy Jaccard simi-larity of the current sub-string S if revious sub-string candidate beforeshrinkingX O X C C O X C current sub-string aftershrinkingX O X C C O X C Fig. 5. Shrinking the previous sub-string (1 − M r ) ≥ P M i · idf ( S ′ i ) T idf ( S ) The proof of Lemma 2 can be found in Appendix A. Basedon Equation (9), the maximum similarity score is computedusing the following similarity function. sc max = (1 + T )1 + 1 − (1 + T ) (20) where T is the term (19). Then the lower bound dissimilaritycan be computed by B ⊥ = 1 − sc max . We can prove thatthe lower bound dissimilarity of Fuzzy Jaccard increasesmonotonically. The proof is straightforward and hence omitted. Left Spanning : We can use the lower bound dissimilaritydiscussed above to determine when the left spanning can beterminated. We denote V T = T idf ( S ) . When spanning, we needto update the value of V T . Suppose the token to the left side ofthe current sub-string is t . We update V T by V T = V T + idf ( t ) .The numerator of the term (19) is handled by the followingtwo cases. • Case 1: t does not match to any entity token. Thenthe numerator does not need to be updated. • Case 2: t matches to an entity token E i . ◦ E i has no matching to any other text to-ken. Then, M i = eds ( E i , t ) . The updated M i contributes to increasing the numerator ofterm (19). ◦ E i has other matching to some text to-kens in the current sub-string; then, M i =max { M i , eds ( t, E i ) } .After the update of V T and M , we can recompute themaximum similarity using Equation (20), and compute thelower bound B ⊥ . If B ⊥ > − δ , the left spanning shouldbe terminated.
5) Reusing computation in producing candidates:
Theboundaries of a sub-string candidate should start and end withmatched tokens, because the leading/ending unmatched tokensof the sub-string candidate are not part of the entity. We canuse this property to reuse some computation while findingthe boundaries of a neighbour sub-string candidate (i.e. thesub-string candidate next to the previously found sub-stringcandidate in the document). We refer to text tokens that matchthe entity E as landmark tokens. Shrinking : To find the neighbour sub-string candidate,we shrink the previous sub-string candidate by one landmarktoken. That is, the left boundary is moved from the leftmostlandmark, denoted by l , to the second leftmost landmark,denoted by l . Figure 5 gives an example of shrinking theprevious sub-string candidate. The leftmost landmark l andthe second leftmost landmark l of the previous sub-stringcandidate are O and C , respectively; after shrinking, weobtain the current sub-string with C as the leftmost landmark. Suppose l matches the i th entity token E i . The total IDFvalue V T of the sub-string after shrinking can be updated asfollows. V T = V T − V s − idf ( l ) (21) where V s = P idf ( t j ) , and t j is the text token betweenthe leftmost landmark l and the second leftmost landmark l . Next, we provide the formulas for updating other valuesspecially for FuzzyED and Fuzzy Jaccard. Shrinking for FuzzyED : We update V R using the followingequation. V R = ( V R − V s − idf ( l ) if eds ( l , E i ) < M i , V R − V s − idf ( t ) otherwise. The first case is for removing the landmark token l whichis not the most similar token to E i ; the second case is forremoving l which is the most similar token to E i . If l isthe most similar token to E i , we need to update M i by M i = eds ( t, E i ) where t is the second most similar token to E i inthe previous sub-string candidate. Shrinking for Fuzzy Jaccard : We let V m = P M i · idf ( S ′ i ) .Then V m is updated using the following equation. V m = ( V m if eds ( l , E i ) < M i , V m − ( M i − eds ( t, E i )) · idf ( l ) otherwise. In the first case, we do not need to update V m , since l is notpart of V m . In the later case, we need to find the second mostsimilar token t to E i , update V m accordingly and update M i by M i = eds ( t, E i ) .After the shrinking, we can start the right spanning to findthe right boundary of the new sub-string candidate.
6) Analysis of producing candidates by spanning:
Usingthe spanning-based technique, the number of sub-strings re-quired measuring the two-level similarity is k at most, where k is the number of matched tokens (including core tokens andoptional tokens). To understand this, we refer to Figure 5.Every time, we shrink the previous sub-string candidate byone matched text token and find a new sub-string candidate.Hence, we perform k shrinking at most, and each shrinkingcorresponds to a sub-string candidate. Therefore, the spanning-based candidate producing technique generates k sub-stringcandidates at most. In comparison, the enumeration-basedcandidate producing technique generates ( u − l ) × k sub-stringsas we have analysed in Section IV-B4. Not using core tokens : The spanning-based candidateproducing technique can be applied to the case of not usingcore tokens. The number of sub-string candidates requiresmeasuring the two-level similarity is also k (i.e. all thematched text tokens). We conduct experiments to investigatethe importance of core tokens when we study the effectivenessof the spanning-based technique in the next section. D. Filtering candidates
In the Filtering Candidates component of our algorithm(cf. Figure 2), we can integrate different filtering (i.e. pruning)techniques for the two-level similarity used in the MeasuringSimilarity component. Next, we propose a filtering techniquefor FuzzyED, and present a general filtering technique for bothFuzzyED and Fuzzy Jaccard. filtering technique for FuzzyED : The key idea ofthe filtering technique is to compute a lower bound cost ontransforming a sub-string candidate to an entity, and to prunethe sub-string candidate if the lower bound cost is higher thana certain threshold. The lower bound cost includes the insertionand substitution cost on transforming the sub-string candidate S to the entity E , and is computed by the equation below. C ⊥ ( E , S ) = X E i ∈E (1 − M i ) × w ( E i ) (22) where M i is the edit similarity of the entity token E i to themost similar text token in the sub-string candidate S ; M i ∈ [0 , . Note that both the insertion cost and the substitution costare considered in the above equation, because M i = 0 is thecase of insertion and M i > is the case of substitution. Notealso that we do not include the deletion cost in Equation (22),because our algorithm finds the most similar sub-string of S that matches the entity and we do not know if the unmatchedtokens are part of the most similar sub-string. If the lowerbound cost C ⊥ ( E , S ) is higher than the threshold (1 − δ ) ,we prune the sub-string candidate and avoid measuring theFuzzyED similarity. A general filtering technique : In the Filtering Candidatescomponent, we can use more than one filtering technique.Here, we propose to use one more filtering technique in-troduced by Chakrabariti et al. [4]. Formally, a sub-stringcandidate can be pruned if the condition below is satisfied. T w ( S ∩ E ) < δ (23) The technique can be used in FuzzyED and Fuzzy Jaccard.No other proper filtering techniques come to our awareness forFuzzy Jaccard. Hence, we use the above filtering technique inour algorithm using the Fuzzy Jaccard similarity.V. E
XPERIMENTAL S TUDY
In this section, we present the experimental results on theefficiency and effectiveness of our algorithm using FuzzyEDdenoted by “FED” and our algorithm using Fuzzy Jaccarddenoted by “FJ”. In the Matching Text Token component ofour algorithm, we used the C++ source code offered by Li etal. [17]. We implemented FED and FJ in C++. All experimentswere conducted on a machine running Linux with an IntelXeon E5-2643 CPU and 32GB memory. By default, we setthe entity similarity threshold δ to 0.9, and the token editsimilarity threshold τ to 0.8. We used three real world datasets:Amazon Reviews dataset [23], DBWorld Messages dataset andIMDB Reviews dataset [24]. The details of the datasets areas follows. (i) Amazon Reviews: the dataset contains 346,867product reviews from the customers of Amazon. Each productreview serves as a document; 1,989,376 product names fromAmazon form the entity dictionary. (ii) DBWorld Messages:the dataset contains 33, 628 messages of “call for papers”, jobadvertisement and so forth in the database research community.Each message is a document; the entity dictionary contains132,745 worldwide institution names from Free-base [25]. (iii)IMDB Reviews: the dataset has 97,788 movie reviews fromthe IMDB website. Each movie review is a document; theentity dictionary contains 108,941 movie names in the IMDBwebsite. More details of the three datasets are provided inTable V; the average, maximum and minimum length of thedocuments (or the entities in the dictionary) are measured bythe number of tokens. TABLE II. D
ETAILS OF DOCUMENTS AND DICTIONARIES dataset size ave len max len min lenAmazon doc 346,867 191 29,070 30Amazon dict 1,989,376 6 204 1DBWorld doc 33,628 732 33,648 1DBWorld dict 132,745 3 27 1IMDB doc 97,788 277 2,968 8IMDB dict 108,941 3 24 1
TABLE III. O
VERALL EFFICIENCY COMPARISON sub-dataset using enumeration using spanningFED-e FJ-e FED-s FJ -sAmazon 1.05 h 26.7 h 7 sec 10 secDBWorld 0.25 h 12.9 h 11 sec 11 secIMDB 0.13 h 6.06 h 11 sec 12 sec
We have four implementations of our algorithm:
FED-e ( FED-s ) is FuzzyED together with the enumeration-based(spanning-based) candidate producing technique;
FJ-e ( FJ-s ) isFuzzy Jarcard together with the enumeration-based (spanning-based) candidate producing technique.In what follows, we first report the efficiency and effec-tiveness of our algorithm, and then we investigate the effectof core tokens on our algorithm.
A. Efficiency and effectiveness comparison
Here, we investigate the performance of our algorithmin three aspects: overall efficiency, the effect of varying theparameters (e.g. τ and δ ) on the efficiency, and effectiveness.
1) Overall efficiency:
We conducted experiments on thethree datasets for FED-s and the elapsed time of FED-s forAmazon Reviews, DBWorld Messages and IMDB Reviewsis 16 hours, 17 minutes and 16 minutes, respectively; FJ-s took twice more time than FED-s to process the datasets.Note that the Amazon Reviews dataset has around half amillion documents and two million entities in the dictionary,our FED-s can process it in 16 hours. In comparison, FED-e and FJ-e are extremely slow to process the whole datasets,because they require measuring the two-level similarity formore sub-string candidates as discussed in Section IV-C6. Toprovide some specific results on the elapsed time of the fourimplementations, we randomly sampled a sub-dataset fromeach of the original document dataset. To construct the threesub-datasets, we sampled 1 per 100 documents in the DBWorlddataset and in the IMDB dataset, and 1 per 10,000 documentsin the Amazon dataset. Thus, FJ-e and FED-e can process thethree sub-datasets in a reasonable amount of time. Note thatwe do not construct subsets of the dictionaries and we showthe effect of changing the size of the dictionary in the next setof experiments.Table III gives the efficiency of the four implementationson the three sub-datasets. As we see from the table, implemen-tations using spanning-based candidate producing technique(i.e. FED-s and FJ-s) are more than 40 times faster thanthose using enumeration-based candidate producing technique.Another observation is that FED based implementations aremore efficient than FJ based implementations, because FJ hashigher complexity than FED as discussed in Section IV-A3. -
11 0 e l ap s ed t i m e ( s e c ) entity edit similarityFED-sFJ-s FED-eFJ-e (a) Varying δ -
11 0 e l ap s ed t i m e ( s e c ) token edit similarityFED-sFJ-s FED-eFJ-e (b) Varying τ -
21 0 -
11 0 e l ap s ed t i m e ( s e c ) number of entities (thousand)FED-sFJ-s FED-eFJ-e (c) Varying dictionary size -
11 0
10 20 40 80 e l ap s ed t i m e ( s e c ) number of documentsFED-sFJ-s FED-eFJ-e (d) Varying
2) Effect of varying the parameters on efficiency:
Next, westudy the effect of varying the parameters on the efficiency ofFED-s, FED-e, FJ-s and FJ-e. In our experiments, we observedthat the results on the three datasets are similar when varyingdifferent parameters. Due to the space limitation, we use theDBWorld Messages dataset as a representative in this set ofexperiments. The default settings of the experiments are asfollows: (i) the entity similarity threshold δ is set to 0.9; (ii)the token edit similarity threshold τ is set to 0.8; (iii) thenumber of entities in the dictionary is 132,745 (i.e. the wholedictionary) and (iv) the number of documents is 10. Effect of varying the entity similarity threshold : Tostudy the effect of the entity similarity threshold δ , we var-ied δ from 0.85 to 1. Figure 6a shows the results of theeffect on the four implementations. As can be seen from thefigure, FED based implementations consistently outperformFJ based implementations. Implementations using spanning-based candidate producing technique outperform those usingenumeration-based candidate producing technique by around100 times. An observation of the figure is that as the entitysimilarity threshold decreases the total elapsed time of all theimplementations increases. This is because when the entitysimilarity threshold is small, more candidates require measur-ing the two-level similarity. Effect of varying the token similarity threshold : Fig-ure 6b gives the results of varying the token similarity thresh-old τ from 0.7 to 1. FED-s and FJ-s significantly outperformFED-e and FJ-e by two orders of magnitude. Similar to varyingthe entity similarity threshold δ , the smaller the threshold, themore time our algorithm requires. Effect of varying the size of the entity dictionary : Tostudy the effect of the size of the dictionary, we varied thenumber of entities in the dictionary from 2,000 to 128,000.Figure 6c shows that the elapsed time of all the four imple-mentations increases as the size of the dictionary increases.
Effect of varying the number of documents : To studythe effect of the number of documents on the efficiency, wesampled from the DBWorld Messages dataset four sub-datasetsof 10, 20, 40 and 80 documents with the average length of 732.We measured the total elapsed time of extracting entities from
TABLE IV. F-
MEASURE OF
FED
AND FJ δ precision recall F FED FJ FED FJ FED FJ1.00 100% 97.6% 94.5% 95.4% 97.2 96.50.95 88.0% 85.3% 94.8% 95.5% 91.3 88.40.90 71.5% 69.5% 96.6% 97.1% 82.2 81.00.85 64.1% 62.6% 99.7% 100% 78.0 77.0
TABLE V. E
FFECT OF CORE TOKENS ON CANDIDATES PRODUCING sub-dataset FED-s FED-a speedupAmazon 7 sec 0.70 hr 362DBWorld 11 sec 0.14 hr 45IMDB 11 sec 0.17 hr 56 each sub-dataset. As shown in Figure 6d, the elapsed timeof FED based implementations grows more slowly comparedwith FJ based ones. This is because the more documents,the more sub-string candidates are generated. As a result,our algorithm needs to measure more two-level similarity. Asthe cost on measuring the two-level similarity of FED basedimplementations is cheaper than that of FJ based ones ( O ( mn ) v.s. O ( m n ) ), the elapsed time of FJ based implementationsincreases faster than that of FED based ones.
3) Overall effectiveness:
To demonstrate the effectivenessof the FuzzyED similarity and the Fuzzy Jaccard similarity, weused the whole dataset of DBWorld Messages. We manuallylabelled 20,000 sub-string candidates as a set of ground truth.Entities in the document correctly extracted as entities in thedictionary are called true positive (denoted by tp ); no entities inthe document extracted as entities in the dictionary are calledfalse positive (denoted by f p ). We compute the precision p and recall r by the following equations. p = tptp + fp r = tptp + fn where f n is the number of false negative and hence tp + f n is the total number of true positives in the ground truth set.Table V-A3 shows the results of F-measure for FED andFJ on the entity similarity threshold δ changing from 0.85 to1. As we can see from the table, FED has better F score andprecision than FJ and comparable recall to FJ. FED and FJ canproduce an F score of around 0.9 when the entity similaritythreshold is 0.95. B. Effect of core tokens
In this set of experiments, we provide experimental resultsof the spanning-based approach using core tokens comparedwith the spanning-based approach without using core tokensas discussed in Section IV-C6. The datasets used in theseexperiments are identical to those detailed in Table III.To demonstrate the effectiveness of using core tokens, weused two versions of FED: one with core tokens applied inthe candidate producing process; the other, denoted by FED-a(“a” for all entity tokens), without using core tokens in thecandidate producing process. Note that the FJ based approachwithout using core tokens are extremely slow and did notcomplete within our time limit, and hence the results of FJare not shown here. As we can see from Table V-B, FED-sconsistently outperforms FED-a by upto 362 times. This isbecause using core tokens reduces the number of matchedtokens in the document, and hence significantly reduces thenumber of sub-string candidates which requires measuring thetwo-level similarity.I. C
ONCLUSION
In this paper, we have addressed the problem of entityextraction from free text using both character-based simi-larity and token-based similarity (i.e. two-level similarity).By exploiting the properties of the two-level similarity andthe weights of tokens, we have developed novel techniquesto significantly reduce the number of sub-string candidatesthat require computation of two-level similarity against theentities. A comprehensive experimental study has shown thatour algorithm based on edit similarity is efficient and alsoeffective. Moreover, our algorithm produces a high F scorein the range of [0.91,0.97] with edit similarity of [0.95,1].R EFERENCES[1] D. Nadeau and S. Sekine, “A survey of named entity recognition andclassification,”
Lingvisticae Investigationes , vol. 30, no. 1, pp. 3–26,2007.[2] W. Wang, C. Xiao, X. Lin, and C. Zhang, “Efficient approximate entityextraction with edit distance constraints,” in
SIGMOD . ACM, 2009,pp. 759–770.[3] D. Deng, G. Li, and J. Feng, “An efficient trie-based method forapproximate entity extraction with edit-distance constraints,” in
ICDE ,2012, pp. 762–773.[4] K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, “An efficient filterfor approximate membership checking,” in
SIGMOD , 2008, pp. 805–818.[5] C. D. Manning, P. Raghavan, and H. Sch¨utze,
Introduction to informa-tion retrieval . Cambridge university press Cambridge, 2008, vol. 1.[6] X. Carreras, L. Marquez, and L. Padr´o, “Named entity extraction usingadaboost,” in proceedings of the 6th conference on Natural languagelearning-Volume 20 . Association for Computational Linguistics, 2002,pp. 1–4.[7] A. Jain and M. Pennacchiotti, “Open entity extraction from web searchquery logs,” in
International Conference on Computational Linguistics .ACL, 2010, pp. 510–518.[8] W. W. Cohen and S. Sarawagi, “Exploiting dictionaries in namedentity extraction: combining semi-markov extraction processes and dataintegration methods,” in
KDD . ACM, 2004, pp. 89–98.[9] G. Navarro, “A guided tour to approximate string matching,”
ACMComputing Survey , vol. 33, no. 1, pp. 31–88, 2001.[10] A. G, D. S. L, N. G, M. T, X. C, S. D, S. S., A. R., V. H, and A. D,“Entity extraction, linking, classification, and tagging for social media:a wikipedia-based approach,”
VLDB , vol. 6, no. 11, pp. 1126–1137,2013.[11] M.-S. Kim, K.-Y. Whang, J.-G. Lee, and M.-J. Lee, “n-gram/2l: Aspace and time efficient two-level n-gram inverted index structure,” in
Proceedings of the 31st international conference on Very large databases . VLDB Endowment, 2005, pp. 325–336.[12] Y. Kim and K. Shim, “Efficient top-k algorithms for approximatesubstring matching,” in
SIGMOD . ACM, 2013, pp. 385–396.[13] P. Wang, C. Xiao, J. Qin, W. Wang, X. Zhang, and Y. Ishikawa, “Localsimilarity search for unstructured text,” in
SIGMOD , 2016, pp. 1991–2005.[14] M. Hadjieleftheriou and D. Srivastava, “Weighted set-based stringsimilarity.”
IEEE Data Eng. Bull. , vol. 33, no. 1, pp. 25–36, 2010.[15] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust andefficient fuzzy match for online data cleaning,” in
SIGMOD . ACM,2003, pp. 313–324.[16] W. W. Cohen, P. D. Ravikumar, S. E. Fienberg et al. , “A comparison ofstring distance metrics for name-matching tasks.” in
IIWeb , 2003, pp.73–78.[17] G. Li, D. Deng, and J. Feng, “Faerie: efficient filtering algorithms forapproximate dictionary-based entity extraction,” in
SIGMOD . ACM,2011, pp. 529–540.[18] J. Wang, G. Li, and J. Fe, “Fast-join: An efficient method for fuzzytoken matching based string similarity join,” in
ICDE . IEEE, 2011,pp. 458–469. [19] J. Wang, J. Feng, and G. Li, “Trie-join: Efficient trie-based string sim-ilarity joins with edit-distance constraints,”
VLDB Endowment , vol. 3,no. 1-2, pp. 1219–1230, 2010.[20] D. B. West et al. , Introduction to graph theory . Prentice hall UpperSaddle River, 2001.[21] D. P. Bertsekas, “A simple and fast label correcting algorithm forshortest paths,”
Networks , vol. 23, no. 8, pp. 703–709, 1993.[22] I. S. Dhillon, “Co-clustering documents and words using bipartitespectral graph partitioning,” in
KDD . ACM, 2001, pp. 269–274.[23] J. McAuley and J. Leskovec, “Hidden factors and hidden topics:understanding rating dimensions with review text,” in
ACM Conferenceon Recommender Systems . ACM, 2013, pp. 165–172.[24] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,“Learning word vectors for sentiment analysis,” in
Association forComputational Linguistics: Human Language Technologies . ACM,2011, pp. 142–150.[25] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free-base: a collaboratively created graph database for structuring humanknowledge,” in
SIGMOD . ACM, 2008, pp. 1247–1250. A PPENDIX
We prove Lemma 1 for FuzzyED in the following.
Proof:
Suppose no token in S matches the core tokensof E . For transforming S to E , at least we need to insert allthe core tokens in E to S and the total cost of the insertionis larger than − δ (cf. Constraint (14)). Hence, the similaritybetween S and E is smaller than δ . Therefore, for S to match E (i.e. the similarity between S and E is not smaller than δ ),at least one text token in S must match a core token of theentity E .Similarly, we can prove Lemma 1 for Fuzzy Jaccard.We prove Lemma 2 for the Fuzzy Jaccard similarity here. Proof:
Only the token that improves the Fuzzy Jaccardsimilarity are included in the current sub-string. In whatfollows, we investigate tokens that improve the Fuzzy Jaccardsimilarity. Since E r and S ′ r are similar and we assume theyhave the same IDF value, we replace E r by S ′ r in the fol-lowing process. As S ′ r leads to increase of the Fuzzy Jaccardsimilarity, we have P M i · idf ( S ′ i ) + (1 − M r ) idf ( S ′ r ) T idf ( S ) + idf ( S ′ r ) ≥ P M i · idf ( S ′ i ) T idf ( S ) . We let a = P M i · idf ( S ′ i ) , b = T idf ( S ) , c = idf ( S ′ r ) , c ′ =(1 − M r ) · idf ( S ′ r ) . Then we have a + c ′ b + c ≥ ab . Since a , b , c and c ′ are larger than 0, we can rewrite the aboveinequality as follows. ab + c ′ b ≥ ab + ac ⇒ c ′ b ≥ ac ⇒ c ′ c ≥ ab Substituting the original values of a , b , c and c ′ , we have (1 − M r ) ≥ P M i · idf ( S ′ i ) T idf ( S ) ..