Benchmarking Declarative Approximate Selection Predicates
aa r X i v : . [ c s . D B ] J u l Benchmarking Declarative Approximate SelectionPredicates byOktie HassanzadehA thesis submitted in conformity with the requirementsfor the degree of Master of ScienceGraduate Department of Computer ScienceUniversity of TorontoCopyright c (cid:13) bstract
Benchmarking Declarative Approximate Selection PredicatesOktie HassanzadehMaster of ScienceGraduate Department of Computer ScienceUniversity of Toronto2007Declarative data quality has been an active research topic. The fundamental principlebehind a declarative approach to data quality is the use of declarative statements torealize data quality primitives on top of any relational data source. A primary advantageof such an approach is the ease of use and integration with existing applications.Over the last couple of years several similarity predicates have been proposed forcommon quality primitives (approximate selections, joins, etc.) and have been fully ex-pressed using declarative SQL statements. In this thesis, new similarity predicates areproposed along with their declarative realization, based on notions of probabilistic infor-mation retrieval. Then, full declarative specifications of previously proposed similaritypredicates in the literature are presented, grouped into classes according to their primarycharacteristics. Finally, a thorough performance and accuracy study comparing a largenumber of similarity predicates for data cleaning operations is performed.ii ontents
A.1 Qgram generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.2 Word token generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.3 Qgram generation of the word tokens (for combination predicates) . . . . 54iv
SQL Statements for Predicates 55
B.1 Overlap Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55B.1.1 IntersectSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55B.1.2 Jaccard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56B.1.3 WeightedMatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57B.1.4 WeightedJaccard . . . . . . . . . . . . . . . . . . . . . . . . . . . 58B.2 Aggregate Weighted Predicates . . . . . . . . . . . . . . . . . . . . . . . 59B.2.1 Tfidf Cosine Predicate . . . . . . . . . . . . . . . . . . . . . . . . 59B.2.2 BM25 Predicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62B.3 Language Modeling Predicates . . . . . . . . . . . . . . . . . . . . . . . . 64B.3.1 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 64B.3.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 66B.4 Combination Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.4.1 GES
Jaccard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.4.2 GES apx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69B.4.3 SoftTFIDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71v hapter 1Introduction
The importance of data cleaning and quality technologies for business practices is wellrecognized. Data cleaning has been an active research topic in several communitiesincluding statistics, machine learning and data management. The quality of data suffersfrom typing mistakes, lack of standards for recording database fields, integrity constraintsthat are not enforced, inconsistent data mappings, etc. For years, data quality technologyhas grown independently from core data management. Data quality tools became partof Extract Transform Load (ETL) technologies, commonly applied during the initialloading phase of data into a warehouse. Although this might be a viable approach fordata analytics, where data processed are static, it is far from acceptable for operationaldatabases. Dynamic databases however, face proliferating quality problems, that degradecommon business practices.Recently, there has been a major focus on tighter integration of data quality tech-nology with database technology. In particular there has been research work on theefficient realization of popular data cleaning algorithms inside database engines as wellas studies for the efficient realization of data quality primitives in a declarative way. Theapproaches are complementary, the former assuring great performance and the latterease of deployment and integration with existing applications without modification of1 hapter 1. Introduction similarity predicates for the popular approximate (flexible) selection operation fordata de-duplication [20, 19]. A similarity predicate sim () is a predicate that numericallyquantifies the ’similarity’ or ’closeness’ of two (string) tuples. Given a relation R , theapproximate selection operation using similarity predicate sim (), will report all tuples t ∈ R such that sim ( t q , t ) ≥ θ , where θ a specified numerical ’similarity threshold’ and t q a query tuple. Approximate selections are special cases of the approximate join (recordlinkage, similarity join) operation [20, 19]. Several efficient declarative implementationsof this operation for specific similarity predicates have been proposed [20, 19] both forapproximate selections and joins.In this thesis, we conduct a thorough study of declarative realizations of similaritypredicates for approximate selections. We introduce and adapt novel predicates, realizethem declaratively and compare them with existing ones for accuracy and performance.In particular we make the following contributions: • Inspired by the success of tf-idf cosine similarity from information retrieval [26] asa similarity predicate for approximate selections, we introduce declarative realiza-tions of other successful predicates from information retrieval and in particular thepopular BM25 measure. • We introduce declarative realizations of probabilistic similarity predicates inspiredby Language Models from information retrieval [22] and Hidden Markov Models[21], suitably adapted for the case of approximate selections. • We present declarative realizations of previously proposed similarity predicates forthe approximate selection problem and we propose a categorization of all measuresboth previously proposed and new according to their characteristics. • We present a thorough experimental study comparing all similarity predicates for hapter 1. Introduction hapter 2Related Work
Data quality has been an active research topic for many years. A collection of statisticaltechniques have been introduced initially for the record linkage problem [9, 8]. The bulkof early work on data quality was geared towards correcting problems in census files[29]. A number of similarity predicates were developed taking into account the specificapplication domain (i.e., census files) for assessing closeness between person names (e.g.,Jaro, Jaro-Winkler [15, 29], etc).The work of Cohen [6] introduced the use of primitives from information retrieval(namely cosine similarity, utilizing tf-idf[26]) to identify flexible matches among databasetuples. A performance/accuracy study conducted by Cohen et al., [7] demonstrated thatsuch techniques outperform common predicates introduced for specific domains (e.g.,Jaro, Jaro-Winkler, etc).Other techniques geared towards database tuples include the merge/purge technique[14]. Several predicates to quantify approximate match between strings have been utilizedfor dealing with quality problems, including edit distance and its variants [13]. Hybridpredicates combining notions of edit distance and cosine similarity have also been in-troduced [4, 1]. Recently, [5, 2] presented SSJOIN, a primitive operator for efficient setsimilarity joins. Utilizing ideas from [28], such an operator can be used for approxi-4 hapter 2. Related Work hapter 3Framework
Let Q be a query string and D a string record from a base relation R = { D i : 1 ≤ i ≤ N } .We denote by Q , D the set of tokens in Q and D respectively. We refer to substringsof a string as tokens in a generic sense. Such tokens can be words or q-grams (sequenceof q consecutive characters of a string) for example. For Q =‘db lab’, Q = { ‘db’, ‘lab’ } for word-based tokenization and Q = { ‘db ’ ,‘b l’,‘ la’, ‘lab’ } for tokenization using 3-grams. We refer to tokens throughout the thesis when referring to words or q-grams. Wemake the choice specific (word or q-gram) for techniques we present, when is absolutelyrequired. In certain cases, we may associate a weight with each token. Several weightingmechanisms exist. We present our techniques referring to weights of tokens, makingthe choice of the weighting scheme concrete when required. In chapter 5 we realize ourtechniques for specific choice of tokens and specific weighting mechanisms.Our goal is to calculate a similarity score between Q and D using a similarity pred-icate. We group similarity predicates into five classes based on their characteristics,namely: • Overlap predicates:
These are predicates that assess similarity based on theoverlap of tokens in Q , D . • Aggregate Weighted Predicates:
Predicates that assess similarity by manipu-6 hapter 3. Framework Q , D• Language Modeling Predicates:
Predicates that are based on probabilisticmodels imposed on elements of Q , D• Edit Based Predicates:
Predicates based on a set of edit operations appliedbetween Q and D . • Combination Predicates:
Predicates combining features from the classes above.The classes were defined by studying the properties of previously proposed similaritypredicates as well as ones newly proposed herein. The first four classes encompass predi-cates introduced previously in various contexts for data cleaning tasks, with the exceptionof BM25 which to the best of our knowledge is the first time that is deployed for datacleaning purposes. The Language Modeling class of predicates draws from work on in-formation retrieval and is introduced herein for data cleaning tasks. Within each classwe discuss declarative realizations of predicates.
Suppose Q is the set of tokens in the query string Q and D is the set of tokens in thestring tuple D . The IntersectSize predicate [28] is simply the number of common tokensbetween Q and D , i.e.: sim intersect ( Q, D ) = |Q ∩ D| (3.1)Jaccard similarity [28] is the fraction of tokens in Q and S that are present in both,namely: sim Jaccard ( Q, D ) = |Q ∩ D||Q ∪ D| (3.2)If we assign a weight w ( t ) to each token t , we can define weighted versions of the abovepredicates. WeightedMatch [28] is the total weight of common tokens in Q and D , i.e., Discussion of ways to assign such weights to tokens follows in subsequent chapters. hapter 3. Framework P t ∈Q∩D w ( t ). Similarly, WeightedJaccard is the sum of the weights of tokens in |Q ∩ D| divided by the sum of the weights of tokens in |Q ∪ D| . The predicates in this class encompass predicates widely adopted from information re-trieval (IR). A basic task in IR is, given a query, identifying relevant documents to thatquery. In our context, we would like to identify the tuples in a relation that are similar to a query string.Given a query string Q and a string tuple D , the similarity score of Q and D in thisclass of predicates is of the form sim ( Q, D ) = P t ∈Q∩D w q ( t, Q ) w d ( t, D ), where w q ( t, Q ) isthe query-based weight of the token t in string Q and w d ( t, D ) is the tuple-based weightof the token t in string D . The tf-idf cosine similarity[26] between a query string Q and a string tuple D is definedas follows: sim cosine ( Q, D ) = X t ∈Q∩D w q ( t, Q ) w d ( t, D ) (3.3)where w q ( t, Q ) , w d ( t, D ) are the normalized tf-idf weights [26]. The normalized tf-idfbetween a token t and a string S , w ( t, S ) is given by: w ( t, S ) = w ′ ( t, S ) pP t ′ ∈S w ′ ( t ′ , S ) , w ′ ( t, S ) = tf ( t, S ) .idf ( t )The idf term makes the weight of a token inversely proportional to its frequency in thedatabase; the tf term makes it proportional to its frequency in S . Intuitively, this assignslow scores to frequent tokens and high scores to rare tokens. More discussion is availableelsewhere [6, 12]. hapter 3. Framework The BM
25 similarity score between a query string Q and a tuple D , is given as: sim BM ( Q, D ) = X t ∈Q∩D w q ( t, Q ) w d ( t, D ) (3.4)where w q ( t, Q ) = ( k + 1) ∗ tf ( t, Q ) k + tf ( t, Q ) w d ( t, D ) = w (1) ( t, R ) ( k + 1) ∗ tf ( t, D ) K ( D ) + tf ( t, D ) w (1) is a modified form of Robertson-Sparck Jones weight: w (1) ( t, R ) = log (cid:18) N − n t + 0 . n t + 0 . (cid:19) (3.5) K ( D ) = k (cid:18) (1 − b ) + b | D | avgdl (cid:19) and N is the number of tuples in the base relation R , n t is the number of tuples in R containing the token t , tf ( t, D ) is the frequency of occurrence of the token t withintuple D , | D | is the number of tokens of tuple D , avgdl is the average number of tokensper tuple, i.e. P D ∈ R | D | N and k , k , and b are independent parameters. For TREC-4experiments [25], k ∈ [1 , k = 8 and b ∈ [0 . , . A language model, is a form of a probabilistic model. To realize things concretely, webase our discussion on a specific model introduced by Ponte and Croft [22]. Given acollection of documents, a language model is inferred for each; then the probability ofgenerating a given query according to each of these models is estimated and documentsare ranked according to these probabilities. Considering an approximate selection query,each tuple in the database is considered as a document; a model is inferred for each tupleand the probability of generating the query given the model is the similarity between thequery and the tuple. hapter 3. Framework The similarity score between query Q and tuple D is defined as: sim LM ( Q, D ) = ˆ p ( Q | M D ) = Y t ∈Q ˆ p ( t | M D ) × Y t/ ∈Q (1 − ˆ p ( t | M D )) (3.6)where ˆ p ( t | M D ) is the probability of token t occurring in tuple D and is given as follows:ˆ p ( t | M D ) = ˆ p ml ( t, D ) (1 . − ˆ R t,D ) × ˆ p avg ( t ) ˆ R t,D if tf ( t,D ) > cf t cs otherwise (3.7)ˆ p ml ( t, D ) is the maximum likelihood estimate of the probability of the token t under thetoken distribution for tuple D and is equal to tf ( t,D ) dl D where tf ( t,D ) is raw term frequencyand dl D is the total number of tokens in tuple D . ˆ p avg ( t ) is the mean probability of token t in documents containing it, i.e.,ˆ p avg ( t ) = ( P D ( t ∈ D ) ˆ p ml ( t | M D )) df t (3.8)where df t is the document frequency of token t . This term is used since we only have atuple sized sample from the distribution of M D , thus the maximum likelihood estimateis not reliable enough; we need an estimate from a larger amount of data. The term ˆ R t,d is used to model the risk for a term t in a document D using a geometric distribution:ˆ R t,D = (cid:18) . . f t,D ) (cid:19) × (cid:18) ¯ f t,D (1 . f t,D ) (cid:19) tf t,D (3.9)¯ f t,D is the expected term count for token t in tuple D if the token occurred at theaverage rate, i.e., p avg ( t ) × dl D . The intuition behind this formula is that as the tf getsfurther away from the normalized mean, the mean probability becomes riskier to use asan estimate. Finally, cf t is the raw count of token t in the collection, i.e. P D ∈ R tf ( t, D )and cs is the raw collection size or the total number of tokens in the collection, i.e. P D ∈ R dl D . cf t cs is used as the probability of observing a non-occurring token. hapter 3. Framework The query generation process can be modeled by a discrete Hidden Markov process.Figure 3.1 shows a simple yet powerful two-state HMM for this process. The first state,labeled “String” represents the choice of a token directly from the string. The secondstate, labeled “General English” represents the choice of a token that is unrelated to thestring, but occurs commonly in queries.Suppose Q is the query string and D is a string tuple from the base relation R ;the similarity score between Q and D , sim HMM ( Q, D ), is equal to the probability ofgenerating Q given that D is similar, that is: P ( Q | D is similar) = Y q ∈Q ( a P ( q | GE ) + a P ( q | D )) (3.10)where: P ( q | D ) = number of times q appears in D length of D (3.11) P ( q | GE ) = P D ∈ R number of times q appears in D P D ∈ R length of D (3.12)and a and a = 1 − a are transition probabilities of the HMM. The values for theseparameters can be optimized to maximize accuracy given training data. An important and widely used class of string matching predicates is the class of edit-based predicates. In this class, the similarity between Q and D is the transformationcost of string Q to D , tc ( Q, D ). More specifically tc ( Q, D ) is defined as the minimumcost sequence of edit operations that converts Q to D . Edit operations include copy,insert, substitute and delete characters in Q and D [13]. Algorithms exist to compute tc ( Q, D ) in polynomial time [13] but complexity is sensitive to the nature of operationsand their operands (individual characters, blocks of consecutive characters, etc). The hapter 3. Framework sim edit ( Q, D ) = 1 − tc ( Q, D )max {| Q | , | D |} (3.13)Edit operations have an associated cost. In the Levenstein edit-distance [13] which wewill refer to as edit-distance, the cost of copy operation is zero and all other operationshave unit cost. Other cost models are also possible [13]. We present a general similarity predicate and refer to it as generalized edit similarity(GES) (following [5]). Consider two strings Q and D that are tokenized into word tokensand a weight function w ( t ) that assigns a weight to each word token t . The transformationcost of string Q to D , tc ( Q, D ) is the minimum cost of transforming Q to D by a sequenceof the following transformation operations: • token replacement : Replacing word token t in Q by word token t in D with cost[1 − sim edit ( t , t )] · w ( t ), where sim edit ( t , t ) is the edit similarity score between t and t . hapter 3. Framework • token insertion : Inserting a word token t into Q with cost c ins · w ( t ) where c ins , isa constant token insertion factor, with values between 0 and 1. • token deletion : Deleting a word token t from Q with cost w ( t ).Suppose wt ( Q ) is the sum of weights of all word tokens in the string Q . We define the generalized edit similarity predicate between a query string Q and a tuple D as follows: sim GES ( Q, D ) = 1 − min (cid:18) tc ( Q, D ) wt ( Q ) , . (cid:19) (3.14)A related predicate is the SoftTFIDF predicate [7]. In SoftTFIDF, normalized tf-idfweights of word tokens are used along with cosine similarity and any other similarityfunction sim ( t, r ) to find the similarity between word tokens. Therefore the similarityscore, sim SoftT F IDF ( Q, D ), is equal to: X t ∈ CLOSE ( θ,Q,D ) w ( t, Q ) · w (arg max r ∈D ( sim ( t, r )) , D ) · max r ∈D ( sim ( t, r )) (3.15) where w ( t, Q ) , w ( t, D ) are the normalized tf-idf weights and CLOSE ( θ, Q, D ) is the setof words t ∈ Q such that there exists some v ∈ D such that sim ( t, v ) > θ . hapter 4Declarative Framework We now describe declarative realizations of predicates in each class. We present declar-ative statements using standard SQL expressions. For all predicates, there is a prepro-cessing phase responsible for tokenizing strings in the base relation, R , and calculatingas well as storing related weight values which are subsequently utilized at query time.Tokenization of relation R ( BASE TABLE ) creates the table
BASE TOKENS (tid, token) ,where tid is a unique token identifier for each tuple of
BASE TABLE and token an as-sociated token (from the set of tokens corresponding to the tuple with identifier tid in BASE TABLE ). The query string is also tokenized on the fly (at query time) creating thetable
QUERY TOKENS(token) .In the rest of this chapter, we present SQL expressions required for preprocessing andquery time approximate selections for the different predicates. In some cases, we re-writeformulas to make them amenable to more efficient declarative realization. The main SQLcodes are given along with their description here. Appendix A contains detailed SQLexpressions. 14 hapter 4. Declarative Framework INSERT INTO INTERSECT SCORES (tid, score)SELECT R1.tid, COUNT(*)FROM BASE TOKENS R1, QUERY TOKENS R2WHERE R1.token = R2.tokenGROUP BY R1.tid
Figure 4.1: SQL Code for IntersectSize
INSERT INTO JACCARD SCORES (tid, score)SELECT S1.tid, COUNT(*)/(S1.len+S2.len-COUNT(*))FROM BASE DDL S1, QUERY TOKENS R2,(SELECT COUNT(*) AS lenFROM QUERY TOKENS) S2WHERE S1.token = R2.tokenGROUP BY S1.tid, S1.len, S2.len
Figure 4.2: SQL Code for Jaccard Coefficient
The IntersectSize predicate requires token generation to be completed in a preprocessingstep. SQL statements to conduct such a tokenization, which is common to all predi-cates we discuss, is available in Appendix A. The SQL statement for approximate selec-tions with the IntersectSize predicate is shown on Figure 4.1. The Jaccard coefficientpredicate can be efficiently computed by storing the number of tokens for each tupleof the
BASE TABLE during the preprocessing step. For this reason we create a table
BASE DDL(tid, token, len) where len is the number of tokens in tuple with tuple-id tid . The SQL statement for conducting approximate selections with the Jaccard predi-cate is presented in Figure 4.2.The weighted overlap predicates require calculation and storage of the related weightsfor tokens of the base relation during preprocessing. For the WeightedMatch predicate, hapter 4. Declarative Framework INSERT INTO SIM SCORES (tid, score)SELECT R1W.tid, SUM(R1W.weight*R2W.weight)FROM BASE WEIGHTS R1W, QUERY WEIGHTS R2WWHERE R1W.token = R2W.tokenGROUP BY R1W.tid
Figure 4.3: SQL Code for Aggregate Weighted Predicateswe store during the preprocessing step the weight of each token redundantly with each tid, token pair in a table
BASE TOKENS WEIGHTS(tid, token, weight) in order toavoid an extra join with a table
BASE WEIGHT(token, weight) at query time. In orderto calculate the similarity score at query time, we use SQL statements similar to thatused for the IntersectSize predicate (shown in Figure 4.1) but replace table
BASE TOKENS by BASE TOKENS WEIGHTS and
COUNT(*) , by
SUM(R1.weight) .For the WeightedJaccard predicate, we create during preprocessing a table
BASE DDL(tid,token, weight, len) where weight is the weight of token and len is the sum of weightsof tokens in the tuple with tuple-id tid . The SQL statement for approximate selectionsusing this predicate is the same as the one shown in Figure 4.2 but
COUNT(*) is replacedby
SUM(weight) . The SQL implementation of the tf-idf cosine similarity predicate has been presentedin [12]. During preprocessing, we store tf-idf weights for the base relation in relation
BASE WEIGHTS(tid, token, weight) . A weight table
QUERY WEIGHTS(token, weight) for the query string is created on the fly at query time. The SQL statements in Figure4.3 will calculate the similarity score for each tuple of the base table. hapter 4. Declarative Framework Realization of BM25 in SQL involves generation of the table
BASE WEIGHTS(tid, token,weight) storing the weights for tokens in each tuple of the base relation. These weights( w d ( t, D )) consist of two parts that could be considered as modified versions of tf and idf.For a complete set of SQL statements implementing the required preprocessing, refer toAppendix A. The query weights table QUERY WEIGHTS(token, weight) can be createdon the fly using the following subquery: (SELECT TF.token, TF.tf*( k +1)/( k +TF.tf) AS weightFROM ( SELECT T.token, COUNT(*) AS tfFROM QUERY TOKENS TGROUP BY T.token ) TF) The SQL statement shown in Figure 4.3 will calculate BM25 similarity scores.
In order to calculate language modeling scores efficiently, we rewrite the formulas andfinally drop some terms that would not affect the overall accuracy of the metric. Cal-culating the values in equations (3.8) and (3.9) is easy. We build the following relationsduring preprocessing:
BASE TF(tid,token,tf) where tf = tf token,tid . BASE DL(tid,dl) where dl = dl tid . BASE PML(tid,token,pml) where pml = ˆ p ml = tf token,tid dl tid . BASE PAVG(token,pavg) where pavg = ˆ p avg ( token ). BASE FREQ(tid,token,freq) where freq = ¯ f token,tid . BASE RISK(tid,token,risk) where risk = ˆ R token,tid . hapter 4. Declarative Framework p ( Q | M D ) = "Y t ∈Q ˆ p ( t | M D ) × Y ∀ t (1 − ˆ p ( t | M D )) Y t ∈Q (1 − ˆ p ( t | M D )) (4.1)We slightly change (4.1) to the following:ˆ p ( Q | M D ) = "Y t ∈Q ˆ p ( t | M D ) × Y ∀ t ∈D (1 − ˆ p ( t | M D )) Y t ∈Q∩D (1 − ˆ p ( t | M D )) (4.2)This change results in a large performance gain, since the computation is restricted tothe tokens of the query and the tokens of a tuple (as opposed to the entire set of tokenspresent in the base relation). Experiments demonstrate that accuracy is not considerablyaffected.In equation (3.7), we only materialize the first part (i.e., values of tokens that arepresent in the tuple D ) in the relation BASE PM during preprocessing (storing the secondpart would result in unnecessary waste of space). We therefore have to divide all formulasthat use ˆ p ( t | M D ) into two parts: one for tokens present in the tuple under considerationand one for all other tokens. So we rewrite the first term in equation (4.2) as follows: Y t ∈Q ˆ p ( t | M D ) = Y t ∈Q∩D ˆ p ( t | M D ) × Y t ∈Q−D ˆ p ( t | M D )= Y t ∈Q∩D ˆ p ( t | M D ) × Y t ∈Q−D cf t cs = Y t ∈Q∩D ˆ p ( t | M D ) × Q t ∈Q cf t cs Q t ∈Q∩D cf t cs (4.3) The term Q t ∈Q cf t cs in the above formula is constant for any specific query string, so itcan be dropped, since the goal is to find most similar tuples by ranking them based on hapter 4. Declarative Framework INSERT INTO LM SCORES (tid, score)SELECT B1.tid2, EXP(B1.score + B2.sumcompm)FROM (SELECT P1.tid AS tid1, T2.tid AS tid2,SUM(LOG(P1.pm)) - SUM(LOG(1.0-P1.pm))- SUM(LOG(P1.cfcs)) AS scoreFROM BASE PM P1, QUERY TOKENS T2WHERE P1.token = T2.tokenGROUP BY P1.tid, T2.tid) B1,BASE SUMCOMPMBASE B2WHERE B1.tid1=B2.tid
Figure 4.4: SQL Code for Language Modelingthe similarity scores. Therefore, equation (4.2) can be written as follows:ˆ p ( Q | M D ) = Y t ∈Q∩D ˆ p ( t | M D ) Y t ∈Q∩D cf t cs × Y ∀ t ∈D (1 − ˆ p ( t | M D )) Y t ∈Q∩D (1 − ˆ p ( t | M D )) (4.4)This transformation allows us to efficiently compute similar tuples by just storingˆ p ( t | M D ) and cf t cs for each pair of t and D . Thus, we create table BASE PM(tid, token,pm, cfcs) where pm = ˆ p ( token | M tid ) and cfcs = cf token cs as the final result of the prepro-cessing step. We also calculate and store the term Q ∀ t ∈D (1 − ˆ p ( t | M D )) during prepro-cessing in relation BASE SUMCOMPBASE(tid, sumcompm) .The query-time SQL statement to calculate similarity scores is shown in Figure 4.4.The subquery in the statement computes the three terms in equation 4.4 that includeintersection of query and tuple tokens and therefore needs a join between the two to-ken tables. The fourth term in the equation is read from the table stored during thepreprocessing as described above. hapter 4. Declarative Framework We rewrite equation (3.10) as follows: P ( Q | D is similar) = Y q ∈Q ( a P ( q | GE ) + a P ( q | D ))= Y q ∈Q a P ( q | GE ) × Y q ∈Q (1 + a P ( q | D ) a P ( q | GE ) ) (4.5) For a specific query, the term Q q ∈Q a P ( q | GE ) in the above formula is constant for alltuples in the base relation and therefore can be dropped since our goal is to order tuplesbased on similarity to a specific query string. So the modified similarity score will be: sim HMM ( Q, D ) = Y q ∈Q (1 + a P ( q | D ) a P ( q | GE ))= Y q ∈Q∩D (1 + a P ( q | D ) a P ( q | GE ) ) (4.6)In Equation 4.6, q ∈ Q changes to q ∈ Q ∩ D because P ( q | D ) = 0 for all q / ∈ D . Thus wecan calculate the term (1 + a P ( q | D ) a P ( q | GE ) ) for all tid, token pairs during preprocessing andstore them as weight in relation BASE WEIGHTS(tid, token, weight) . Notice that theterm P ( q | D ) is equal to ˆ p ml ( q, D ) in language modeling; we use a relation BASE PML(tid,token, pml) for it. Calculating P ( q | GE ) and storing it in relation BASE PTGE(token,ptge) is also fairly simple. The final SQL query for preprocessing and the SQL statementsfor calculating similarity scores, are shown in Figure 4.5.
We use the same declarative framework proposed in [11] for approximate matching basedon edit-distance. The idea is to use properties of q-grams created from the strings togenerate a candidate set in a way that no false negatives are guaranteed to exist but theset may contain false positives. The set is subsequently filtered by computing the exactedit similarity score between the query and the strings in the candidate set. Computing hapter 4. Declarative Framework PreprocessingINSERT INTO BASE WEIGHTS(tid,token,weight)SELECT M2.tid, M2.token,(1 + (a1*M2.pml) / (a0*P2.ptge))FROM BASE PTGE P2, BASE PML M2WHERE P2.token = M2.tokenQueryINSERT INTO HMM SCORES (tid, score)SELECT W1.tid, T2.tid, EXP(SUM(LOG(W1.weight)))FROM BASE WEIGHTS W1, QUERY TOKENS T2WHERE W1.token = T2.tokenGROUP BY T2.tid, W1.tid
Figure 4.5: SQL Code for HMMthe edit similarity score is performed using a UDF. The SQL statements for candidateset generation and score calculation are available in [11].
Since the calculation of the score function for
GES (Equation 3.14) between a query stringand all tuples in a relation could be very expensive, we can first identify a candidate set oftuples similar to the methodology used for edit-distance and then use a UDF to computeexact scores between the query string and the strings in the candidate set. The elementsof the candidate set are selected using a threshold θ and the following score formula whichignores the ordering between word tokens. This formula over-estimates sim GES ( Q, D )[4]: sim
JaccardGES ( Q, D ) = 1 wt ( Q ) X t ∈Q w ( t ) · max r ∈D ( 2 q sim Jaccard ( t, r ) + d q ) (4.7) hapter 4. Declarative Framework wt ( Q ) is the sum of weights of all word tokens in Q , w ( t ) is the idf weight forword token t , q is a positive integer indicating the q-gram length extracted from words in order to calculate sim Jaccard ( t, r ) and d q = (1 − /q ) is an adjustment term. Inorder to enhance the performance of the operation, we can employ min-wise independentpermutations [3] to approximate sim Jaccard ( t , t ) in Equation 4.7. Description of min-wise independent permutations is beyond the scope of this thesis. This would result insubstituting sim Jaccard with the min-hash similarity sim mh ( t , t ), which is a provableapproximation. The resulting metric, GES apx , is shown to be an upper-bound for
GES in expectation [4]: sim apxGES ( Q, D ) = 1 wt ( Q ) X t ∈Q w ( t ) · max r ∈D ( 2 q sim mh ( t, r ) + d q ) (4.8)In order to implement the above predicates, we need to preprocess the relation usingthe following methodology: • Tokenization in two levels, first tokenizing into words and then tokenizing eachword into q-grams. Word tokens are stored in relation
BASE TOKENS(tid, token) and q-grams are stored in
BASE QGRAMS(tid, token, qgram) . • Storing idf weights of word tokens in relation
BASE IDF (token,idf) as well asthe average of idf weights in the base relation to be used as idf weights of unseentokens. • Calculating weights related to the similarity employed to compare tokens, i.e., sim ( t, r ). For GES Jaccard employing the Jaccard predicate, this includes storing thenumber of q-grams for each word token in relation
BASE TOKENSIZE (tid, token,len) . For GES apx , we have to calculate minhash signatures (required by min-wiseindependent permutations). SQL statements for generating min-hash signaturesand min-hash similarity scores, sim mh ( t, r ), are available in Appendix A. hapter 4. Declarative Framework INSERT INTO GESAPX RESULTS(tid, score)SELECT MS.tid, 1.0/SI.sumidf *SUM(I.idf*(((2.0/ q )*MS.maxsim)+(1-1/ q )))FROM MAXSIM MS, QUERY IDF I, SUM IDF SIWHERE MS.token = I.tokenGROUP BY MS.tid, SI.sumidf Figure 4.6: SQL Code for GES apx , GES
Jaccard
We omit most of SQL statements inside this chapter. In order to make the presentedstatements more readable, we assume that the following auxiliary relations are availableto us; in practice, they are calculated on-the-fly as subqueries (refer to Appendix forcomplete queries): • QUERY IDF(token, idf) stores idf weights for each token in the query. Weightsare retrieved from the base weights relation and the average idf value over all tokensin the base relation is used as the weight of query tokens not present in the baserelation.
SUM IDF(token, sumidf) will store sum of idf weights for query tokens. • MAXSIM(tid, token, maxsim) stores the maximum of the similarity scores be-tween the tokens in tuple tid and each token in the query.The tables above do not have to be computed beforehand, they are rather computedon the fly at query execution time. Assuming however they are available, the SQLstatements for computing the scores for GES apx , GES
Jaccard are shown in Figure 4.6.SoftTFIDF can also be implemented similar to GES approximation predicates. Dur-ing preprocessing, we need to first tokenize the string into word tokens and store them in
BASE TOKENS(tid, token) . Depending on the function used for similarity score betweenword tokens, we may need to tokenize each word token into qgrams as well. We thenneed to store normalized tf-idf weights of tokens in the tuples in the base relation in
BASE WEIGHTS(tid, token, weight) . hapter 4. Declarative Framework • QUERY WEIGHTS(token, weight) stores normalized tf-idf weights for each token in the query table. • CLOSE SIM SCORES(tid, token1, token2, sim) stores the similarity score of eachtoken in the query ( token2 ) with each token of each tuple in the base relation, wherethe score is greater than a threshold θ ( θ specified at query time). Such a scorecould have been computed using a declarative realization of some other similaritypredicate or a UDF to compute similarity using a string distance scheme (e.g.,Jaro-Winkler [29]). • MAXSIM(tid, token, maxsim) stores the maximum of the sim score for each query token among all tid s in
CLOSE SIM SCORES relation.
MAXTOKEN(tid, token1,token2, maxsim) stores arg max r ∈ tid ( sim ( token , r )) as well, i.e., the token ineach tuple in the base relation that has the maximum similarity with a query token token2 in CLOSE ( θ, Q, D )Figure 4.7 shows the SQL statement for MAXTOKEN table and the final similarity scorefor SoftTFIDF. hapter 4. Declarative Framework INSERT INTO MAXTOKEN(tid,token1,token2,maxsim)SELECT CS.tid, CS.token1,CS.token2, MS.maxsimFROM MAXSIM MS, CLOSE SIM SCORES CSWHERE CS.tid=MS.tid ANDCS.token2=MS.token2 AND MS.maxsim=CS.simINSERT INTO SoftTFIDF RESULTS (tid, score)SELECT TM.tid, SUM(I.weight*WB.weight*TM.maxsim)FROM MAXTOKEN TM,QUERY WEIGHTS I, BASE WEIGHTS WBWHERE TM.token2 = I.token AND TM.tid = WB.tidAND TM.token1 = WB.tokenGROUP BY TM.tid
Figure 4.7: SQL Code for SoftTFIDF - Query time hapter 5Evaluation
We experimentally evaluate the performance of each of the similarity predicates presentedthus far and compare their accuracy. The choice of the best similarity predicate in termsof accuracy highly depends on the type of datasets and errors present in them. Thechoice in terms of performance depends on the characteristics of specific predicates. Wetherefore evaluate the (a) accuracy of predicates using different datasets with differenterror characteristics and the (b) performance by dividing the preprocessing and queryexecution time into various phases to obtain detailed understanding on the relative bene-fits and limitations. All our experiments are performed on a desktop PC running MySQLserver 5.0.16 database system over Windows XP SP2 with Pentium D 3.2GHz CPU and2GBs of RAM.
In the absence of a common benchmark for data cleaning, we resort to the definitionof our own data generation scheme with controlled error. In order to generate datasetsfor our experiments, we modify and significantly enhance the UIS database generatorwhich has effectively been used in the past to evaluate duplicate detection algorithms[14]. We use the data generator to inject different types and percentages of errors to a26 hapter 5. Evaluation • the size of the dataset to be generated • the fraction of clean tuples to be utilized to generate erroneous duplicates • distribution of duplicates : the number of duplicates generated for a clean tuple canfollow a uniform, Zipfian or Poisson distribution. • percentage of erroneous duplicates : the fraction of duplicate tuples in which errorsare injected by the data generator. • extent of error in each erroneous tuple : the percentage of characters that will beselected for injecting character edit error (character insertion, deletion, replacementor swap) in each tuple selected for error injection. • token swap error : the percentage of word pairs that will be swapped in each tuplethat is selected for error injection.We use two different sources of data: a data set consisting of company names and adata set consisting of DBLP Titles . Statistical details for the two datasets are shown inTable 5.1. For the company names dataset, we also inject domain specific abbreviationerrors , e.g., replacing
Inc. with
Incorporated and vice versa.For both datasets, we generate different erroneous datasets by varying the parametersof the data generator as shown in Table 5.2. hapter 5. Evaluation dataset Table 5.1: Statistics of Clean Datasets parameter rangesize of dataset 5k - 100k
Table 5.2: Range of Parameters Used For Erroneous DatasetsWe show accuracy results for 8 different erroneous datasets generated from a dataset of company names, each containing 5000 tuples generated from 500 clean records,with uniform distribution. We choose to limit the size of the data sets to facilitateexperiments and data collection since each experiment is run multiple times to obtainstatistical significance. We conducted experiments with data sets of increasing size andwe observed that the overall accuracy trend presented remains the same. We consider theresults presented highly representative across erroneous data sets (generated accordingto our methodology) of varying sizes, and duplicate distributions. We classify these 8datasets into dirty , medium and low error datasets based on the parameters of datageneration. We have also generated 5 datasets, each having only one specific type oferror, in order to evaluate the effect of specific error types. Table 5.3 provides moredetails on the datasets. Table 5.4 shows a sample of duplicates generated by the datagenerator from CU CU hapter 5. Evaluation Percentage ofClass Name erroneous errors in token Abbr.duplicates duplicates swap errorDirty CU1 90 30 20 50Dirty CU2 50 30 20 50Medium CU3 30 30 20 50Medium CU4 10 30 20 50Medium CU5 90 10 20 50Medium CU6 50 10 20 50Low CU7 30 10 20 50Low CU8 10 10 20 50- F1 50 0 0 50- F2 50 0 20 0- F3 50 10 0 0- F4 50 20 0 0- F5 50 30 0 0
Table 5.3: Classification of Datasets hapter 5. Evaluation CU1 t Stsalney Morgan cncorporsated Group t jMorgank Stanlwey Grouio Inc. t Morgan Stanley Group Inc. t Sanlne Morganj Inocrorpated Group t Sgalet Morgan Icnorporated GroupCU5 t Morgan Stanle Grop Incorporated t Stalney Morgan Group Inc. t Morgan Stanley Group In. t Stanley Moragn Grou Inc. t Morgan Stanley Group Inc.
Table 5.4: Sample Tuples from CU1 & CU5 Datasets
We measure the accuracy of predicates, utilizing known methods from the information re-trieval literature in accordance to common practice in IR [27]. We compute the
Mean Av-erage Precision (MAP) and
Mean Maximum F scores of the rankings of each dataset im-posed by approximate selection queries utilizing our predicates. Average Precision(AP),is the average of the precision after each similar record is retrieved, i.e., P Nr =1 [ P ( r ) × rel ( r )]number of relevant records (5.1)where N is the total number of records returned, r is the rank of the record, i.e., theposition of the record in the result list sorted by decreasing similarity score, P ( r ) is theprecision at rank r , i.e., the ratio of the number of relevant records having rank ≤ r tothe total number of records having rank ≤ r , and rel ( r ) is 1 if the record at rank r isrelevant to the query and 0 otherwise. This measure emphasizes returning more similarstrings earlier. MAP is the mean AP value over a set of queries. Maximum F measure hapter 5. Evaluation score (the harmonic mean of precision and recall) over the rankingof records, i.e., max r [ 2 × P r ( r ) × Re ( r ) P r ( r ) + Re ( r ) ] (5.2)where P r ( r ) and Re ( r ) are precision and recall values for rank r . P r ( r ) is as definedabove. Re ( r ) is the ratio of the number of relevant records having rank ≤ r to the totalnumber of relevant records. Again, we compute mean maximum F over a set of queries.Our data generation methodology allows to associate easily a clean tuple with allerroneous versions of the tuple generated using our data generator.A clean tuple and itserroneous duplicates are assigned the same cluster id. Essentially each time we pick atuple from a cluster, using its string attribute as a query we consider all the tuples inthe same cluster (tuples with the same cluster id) as relevant to this query. For eachquery and a specific predicate, we return a list of tuples sorted in the order of decreasingsimilarity scores. Thus, it is easy to identify relevant and irrelevant records among theresults returned for a specific query and similarity predicate. In order to maintain ourevaluation independent of any threshold constants (specified in approximate selectionpredicates) we do not prune this list utilizing thresholds. For each dataset, we computethe mean average precision and mean maximum F measure over 500 randomly selectedqueries taken from that data set (notice that our query workload contains both clean aswell as erroneous tuples). Thus, our accuracy results represent the expected behaviour ofthe predicates over queries and thresholds. We report the values for MAP only since theresults were consistently similar for max F1 measure in all our experiments. Both WeightedMatch ( WM ) and WeightedJaccard ( WJ ) predicates require a weightingscheme to assign weights to the tokens. It is desirable to use a weighting scheme which hapter 5. Evaluation idf and the Robertson-SparkJones ( RS ) weighting scheme given in Equation 3.5 and found that RS weights lead tobetter accuracy. So in the following discussion, we use RS weights for weighted overlappredicates. For all predicates proposed previously in the literature we set any parameter values theyrequire for tuning as suggested in the respective papers. For the predicates presentedherein for data cleaning tasks, for the case of BM25, we set k =1.5, k =8 and b =0.675;for HMM, we set a to 0.2, although our experiments show that the accuracy results arenot very sensitive to the value of a as long as a reasonable value is chosen (i.e., a valuenot close to 0 or 1).The SoftTFIDF predicate requires a similarity predicate over the word tokens. Weexperimented with various similarity predicates like Jaccard, IntersectSize, edit distance,Jaro-Winkler, etc. and choose Jaro-Winkler since SoftTFIDF with Jaro-Winkler ( STfIdfw/JW ) performs the best. This was also observed in [7]. Two words are similar in Soft-TFIDF if their similarity score exceed a given threshold θ . SoftTFIDF with Jaro-Winklerperformed the best with θ =0.8. Finally, we set c ins for GES predicate to 0.5 as proposedin [4]. For calculating accuracy, we use the exact GES as shown in Equation 3.14. Weremark that we do not prune the results based on any threshold in order to keep theevaluation independent of the threshold values. Qgram generation is a common preprocessing step for all predicates. We use an SQLstatement similar to that presented in [11] to generate q-grams, with a slightly differentapproach. We first insert q − hapter 5. Evaluation Type of Error Xect Jac. WM WJ Cosine, BM25, ED GES S TfIdfLM, HMM w/JWabbr. error (F1) 0.94 0.96 0.98 1.0 1.0 0.89 1.0 1.0token swap error (F2) 1.0 1.0 1.0 1.0 1.0 0.77 0.94 1.0
Table 5.5: Accuracy: Abbr. and Token Swap Errorsall errors caused by different orders of words, e.g., “Department of Computer Science”and “Computer Science Department”. For qgram generation we also need to have anoptimal value of qgram size ( q ). A lower value of q ignores the ordering of charactersin the string while a higher value can not capture the edit errors. So an optimum valueis required to capture the edit errors taking in account the ordering of characters inthe string. The table below shows the accuracy comparison of different qgram basedpredicates (Jaccard, tf-idf ( Cosine ), HMM and BM25) in the dirty cluster of our datasets: q Jaccard Cosine HMM BM252 0.736 0.783 0.835 0.8403 0.671 0.769 0.807 0.805
The trend is similar for other predicates and the accuracy further drops for highervalues of q . Thus, we set q =2 as it achieves the best accuracy results. In this section we present a detailed comparison of the effectiveness of the similaritypredicates in capturing the different types of error introduced in the data.
Abbreviation error : Due to abbreviation errors, a tuple
AT&T Incorporated getsconverted to
AT&T Inc . Note that
Incorporated and
Inc are frequent words in the company names database. For thequery
AT&T Incorporated , the unweighted overlap predicates Jaccard (
Jac. ) and In- hapter 5. Evaluation
Xect ) will assign to the tuple
IBM Incorporated greater similarity scorethan to the tuple
AT&T Inc since they just try to match tuples on the basis of commonqgrams. Edit distance ( ED ) will behave similarly since it is cheaper to convert AT&TIncorporated to IBM Incorporated than to
AT&T Inc . The weight based predicates arerobust to abbreviation errors since they assign high weights to tokens corresponding torare (important) words e.g.
AT&T . Table 5.5 presents the accuracy of the predicates forthe case of a data set with only abbreviation error (dataset F1 ). All other predicatesWeightedMatch( WM ), WeightedJaccrd( WJ ), tf-idf( Cosine ), BM25, HMM, Language Mod-eling( LM ) and SoftTFIDF( STfIdf w/JW ) had near perfect accuracy. Similar behaviour isobserved when the percentage of duplicates and abbreviation error is varied.
Token swap errors : Due to token swap errors, a tuple
Beijing Hotel gets con-verted to
Hotel Beijing . Suppose there is a tuple
Beijing Labs present in the database,where
Labs and
Hotel are equally important tokens but more frequent than
Beijing .For a query
Beijing Hotel , edit distance and GES will claim
Beijing Labs more sim-ilar to the query than
Hotel Beijing . We remark that for accuracy calculation, we useexact GES as shown in Equation 3.14. All other predicates ignore the order of words,and hence will perform well for token swap errors. Table 5.5 shows the accuracy of thepredicates for a data set with only token swap errors (dataset F2 ). All other predicateshad near perfect accuracy. Similar trend is observed when the percentage of duplicatesand token swap error is varied. Edit errors : Edit errors involve character insertion/ deletion/ replacement and char-acter swap. The number of positions of a string at which edit error has occurred definesthe extent of the edit error. All the predicates discussed above are robust towards low editerrors but their accuracy degrades as the extent of edit error increases. Table 5.6 showsthe accuracy result for various predicates for increasing edit error in the data (datasets F3 , F4 and F5 ). The predicates giving near equal accuracy are grouped together. GESis most resilient to edit errors. Edit distance, designed to capture edit errors has aver- hapter 5. Evaluation Predicate group F3 F4 F5GES 1.0 .99 .97BM25, HMM, LM, STfIdf w/JW 1.0 .97 .91edit distance .99 .97 .90WM , WJ, Cosine .99 .93 .85Jaccard (
Jac. ), IntersectSize (
Xect ) .99 .91 .81
Table 5.6: Accuracy: Only Edit Errorsage performance. BM25, STfIdf w/JW, and probabilistic predicates ( LM and HMM ) arecompetitive in catching edit errors and perform slightly better than edit distance. Theweighted overlap predicates ( WM and WJ ) with RS weights perform equivalent to tf-idf(Cosine) but not as good as edit distance. Finally the unweighted overlap predicatesJaccard and IntersectSize perform the worst as they ignore the importance of tokens.Similar trend is observed when the percentage of erroneous duplicate is varied. Figure 5.1 shows MAP values for different predicates for the 3 classes of erroneous datasetsdescribed in Table 5.3. For the low error datasets, all the predicates perform well ex-cept edit distance, GES, IntersectSize and Jaccard. GES performs a little worse due tothe presence of token swap errors, IntersectSize and Jaccard perform worse because ofabbreviation errors and edit distance is the worst because of both factors.When the error increases, the three types of errors occur in combination and editbased predicates experience large accuracy degradation. The edit based predicates arealready not good at handling token swap errors and the presence of edit errors deterioratestheir effectiveness since the word token weights are no longer valid. This is not the casefor the qgram based predicates since edit errors affect only a small fraction of qgramsand the remaining qgram weights are still valid. Consider a query Q = Morgan Stanley hapter 5. Evaluation Group Inc. over dataset CU
5, where we expect to fetch the tuples shown in Table 5.4.The qgram based predicates are able to return all the tuples at the top 5 positions inthe list according to similarity values. GES is not able to capture the token swap and itranks t and t at position 27 and 28 respectively. The edit distance predicate performsworse; both t and t are absent from the list of top 40 similar tuples. Both edit basedpredicates give high similarity score to tuples like Silicon Valley Group, Inc. forquery Q primarily because of low edit distance between Stanley and
Valley .The unweighted overlap predicates ignore the importance of qgrams and hence per-form worse than the predicates that incorporate weights. It is interesting to note thatthe weighted overlap predicates perform better than the tf-idf ( cosine ) predicate. Thisis due to the RS weighting scheme (Equation 3.5) for weight assignment of tokens whichhas been shown more accurate than the idf weighting scheme. The former captures im-portance of tokens more accurately than the latter. The language modeling predicates(HMM and LM), and BM25 are always the best in all the three datasets. The suc-cess of the SoftTFIDF is attributed to the underlying Jaro-Winkler word level similaritypredicate which can match the words accurately even in the presence of high errors.We also experimented with the GES Jaccard and GES apx . Both predicates make use ofa threshold θ to prune irrelevant records without calculating the exact scores. Dependingon the value of θ , relevant records might also be pruned leading to a drop in accuracy.Table 5.7 shows the variation in accuracy for GES Jaccard and GES apx for threshold values( θ ) 0.7, 0.8 and 0.9 for dataset CU apx we used 5 min hash signatures in order to approximate the GES Jaccard .We observe that increasing the number of min-hash signatures takes more time with-out having a significant impact on accuracy (pretty soon it demonstrates diminishingreturns). A small number of min hash signatures results in significant accuracy loss.Experimental results show that for suitable thresholds GES
Jaccard performs as good asGES and the accuracy drops as the threshold increases. GES apx , being an approximation hapter 5. Evaluation Predicate θ =0.7 θ =0.8 θ =0.9GES Jaccard apx
Table 5.7: Accuracy of GES Predicates for Different Thresholdsfor GES
Jaccard , performs slightly worse than GES
Jaccard . Similar results were observedfor other datasets.
In this section, we compare different predicates based on preprocessing time, query timeand how well they scale when the size of the base table grows. As expected, the per-formance depends primarily on the size of the base table. Performance observationsand trends remain relatively independent from the error rate of the underlying data sets.Thus, we present the experiments on the DBLP datasets with increasing size and mediumamount of errors: 70% of erroneous duplicates, 20% extent of error, 20% token swap errorand no abbreviation error.
We divide preprocessing time for a data set to make it amenable for approximate selectionqueries into two phases. In the first phase, tokenization is performed. Qgrams areextracted from strings in the way described in section 5.3.3 and stored in related tables.Aggregate weighted (Cosine and BM25) and language modeling predicates (LM andHMM) are fastest in this phase, followed by overlap predicates (Xect and Jac.) witha small difference which is due to storing distinct tokens only. Combination predicates(
GES Jac , GES apx and
STfIdf w/JW ) are considerably slower in this phase since theyinvolve an extra level of tokenization into words. hapter 5. Evaluation
GES
Jaccard and SoftTFIDF that only require weight calculation for word tokens.Aggregate weighted and language modeling predicates are considerably slower since cal-culating weights in these predicates involves a lot of computation and creation of manyintermediate tables. Language modeling (LM) is the slowest predicate among probabilis-tic predicates since it requires the maximum number of intermediate tables to be createdand stored.
GES apx requires to compute min-hash signatures for the tokens separatelyfor a number of hash functions on top of the two level tokenization and IDF weight cal-culation, so it is the slowest of all predicates. Figure 5.2 shows the preprocessing timesfor all predicates on a dataset of 10,000 records with an average length of 37 charac-ters.
GES apx in this Figure employs min-hash computation utilizing 5 hash functions(min hash signature size of 5). Preprocessing time for
GES apx increases with increasingnumber of hash functions employed for min-hash signature calculation.
Query time for a predicate is the time taken to rank the tuples from the base table ac-cording to decreasing similarity score. Query time can also be divided into two phases:preprocessing the query string and computing similarity scores. The preprocessing partcan itself be divided into tokenization and weights computation phases as done for prepro-cessing of the base relation. We didn’t experience large variability in the time for querypreprocessing among all predicates. As described in section 4.3 the score formulas forLanguage modeling and HMM are suitably modified by dropping query dependent termswhich do not alter the similarity score order and hence, the accuracy of the predicates.Figure 5.3 shows the average query execution time of different predicates over 100queries on a table of 10,000 strings with an average length of 37 characters. The experi-mental results are consistent with our analysis. A comparison of the average query time hapter 5. Evaluation idf and are easy to compute. We expect theCosine predicate to follow these predicates as it has the additional overhead of calculatingquery weights which depend on idf of tokens. The Language Modeling predicate involvesjoin of 3 tables, so it is comparatively slow. The GES based predicates are slowest of allsince they involve identification of the best matching token among the tuples for eachquery token.
GES apx has been designed to efficiently approximate GES
Jaccard , so it isexpected to be the fastest of all GES based predicates. Note that the filtering step ofGES
Jaccard , GES apx and edit distance require a suitable threshold θ . Lower value of θ results in poor filtering and high post-processing time, while higher value of θ leads toloss of similar results and hence a drop in accuracy. We used θ =0.8 for the filtering stepin GES Jaccard and GES apx and θ =0.7 for edit distance, since these values balance thetrade-off between the performance and precision for these predicates. For GES apx , weuse 5 hash functions for min-hash calculation (min hash signature of 5).
In order to investigate the scalability of our approach, we run experiments on DBLPdatasets with sizes varying from 10k to 100k records. The variation in query time as thebase table size increases is shown in Figure 5.4. The predicates with nearly equal queryexecution times have been grouped together. Group G1 includes predicates IntersectSize,WeightedMatch and HMM, and the group G2 includes Jaccard, WeightedJaccard, Cosineand BM25. For predicates other than combination predicates, the results are consistentwith our analysis of query execution time presented in Section 5.5.2. The predicates ingroup G1 can be thought of having a weight of 1 for query tokens and they just requirea single join to compute similar tuples. The predicates in group G2 take slightly moretime than predicates in G1 since they have to calculate weights for query tokens. LM hapter 5. Evaluation θ used for these predicates and the number of words in the string. Weuse the same thresholds we used in Section 5.5.2 for these predicates. We also limit thesize of the query strings to three words in order to be able to compare the values amongdifferent datasets with other predicates. The results show that combination predicates aresignificantly slower than other predicates since for each query token, we need to determinethe best matching token from the base tuple using an auxiliary similarity function suchas Jaccard and Jaro-Winkler, apart from the time needed to calculate related weights forword tokens. GES apx is the fastest in this cluster of predicates. Increasing the numberof words in query strings considerably slows down these predicates. We excluded editdistance from this experiment because of its significantly poor accuracy. Apart from obvious ways of boosting the performance of algorithms such as modifyingscore formulas as described in Section 4 and building indices on relations to improveexecution plans of score calculation formulas, it is possible to enhance performance ofthe algorithms by using filtering and pruning methods. Filtering based methods try tofind a set of tuples which are promising duplicates by dropping a considerable percent-age of dissimilar records without calculating exact scores. GES
Jaccard is an example ofsuch techniques. Filtering based enhancement techniques for declarative framework aredescribed in detail in [16].Pruning methods enhance the performance of algorithms mainly based on the natureof q-grams made out of strings. As results of our experiments suggest, in every dataset, there is a huge number of qgrams that play very little or no role in the accuracy ofpredicates. Therefore, a very reasonable policy is to drop those q-grams in favor of space hapter 5. Evaluation stopwords e.g. the, an etc. from the documents for efficient keyword search.This approach has several advantages. Although some extra steps will be added tothe preprocessing phase, as a results of pruning, other steps of preprocessing will beconsiderably faster and overall, we gain substantial performance improvement in thepreprocessing phase (except for unweighted naive predicates where there is no step otherthan preprocessing). Since all weights are calculated from the pruned tokens table,the probability distributions of tokens will remain meaningful. The benefit for queryexecution time and the effect on accuracy depends on the threshold used for pruning.Figure 5.5 shows the effect of threshold used for pruning on MAP and execution timefor a dirty data set of company names. We use a threshold in the form of
M IN ( idf ) + rate ∗ ( M AX ( idf ) − M IN ( idf )) and change the rate from 0 (i.e., no pruning) to 0 . hapter 5. Evaluation . . . .
33 will drop nearly 150,000 out of 250,000 tokens which results in ahuge performance gain and a very little drop in accuracy of predicates (except unweightednaive predicates that benefit from pruning as described above).
We presented an exhaustive evaluation of approximate selection predicates by group-ing them into five classes based on their characteristics: overlap predicates, aggregateweighted predicates, edit-based predicates, combination predicates and language model-ing predicates. We experimentally show how predicates in each of these classes performin terms of accuracy, preprocessing and execution time. Within our framework, theoverlap predicates are relatively efficient but have low accuracy. Edit based predicatesperform worse in terms of accuracy but are relatively fast due to the filtering step theyemploy. The aggregate weighted predicates, specifically BM25, perform very well bothin terms of accuracy and efficiency. Both the predicates from the language modelingcluster perform well in terms of accuracy. Moreover, HMM is as fast as simple overlappredicates. The combination predicates are considerably slow due to their two levels oftokenization. Among the combination predicates, GES based predicates are robust inhandling edit errors but fail considerably in capturing token swap errors. SoftTFIDFwith Jaro-Winkler performs nearly equal to BM25 and HMM and is among the best in hapter 5. Evaluation hapter 5. Evaluation hapter 5. Evaluation
Preprocessing time of different predicates
Figure 5.3: Query time of different predicates hapter 5. Evaluation Q ue r y T i m e ( m s ) Base Table Size (thousand tuples)G2G1LMSTfIdf (w=3)GES
Jac (w=3)GES apx (w=3)
Figure 5.4:
Query Time: Variation in Base Table Size (a) MAP vs. Pruning Rate (b) Execution Time vs. Pruning RateFigure 5.5: Effect of Pruning on MAP and Execution Time of Different predicates hapter 5. Evaluation IDF N u m be r o f T o k en s Figure 5.6: IDF distribution of qgrams of size 3 for CU1 data set hapter 6Conclusions
We proposed new similarity predicates for approximate selections based on probabilisticinformation retrieval and presented their declarative instantiation. We presented anin-depth comparison of accuracy and performance of these new predicates along withexisting predicates, grouping them into classes based on their primary characteristics.Our experiments show that the new predicates are both effective as well as efficient fordata cleaning applications. 48 ibliography [1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates indata warehouses. In
Proceedings of the 28th International Conference on Very LargeDatabases (VLDB) , 2002.[2] Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarityjoins. In
Proceedings of the 32nd international conference on Very large data bases(VLDB) , pages 918–929. VLDB Endowment, 2006.[3] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher.Min-wise independent permutations.
Journal of Computer and System Sciences ,60(3):630–659, 2000.[4] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robustand efficient fuzzy match for online data cleaning. In
SIGMOD’03: Proceedings ofthe 2003 ACM SIGMOD International Conference on Management of Data , pages313–324, June 2003.[5] Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A primitive operator forsimilarity joins in data cleaning. In
ICDE ’06: Proceedings of the 22nd InternationalConference on Data Engineering (ICDE) , page 5, Washington, DC, USA, 2006.IEEE Computer Society.[6] William W. Cohen. Integration of heterogeneous databases without common do-mains using queries based on textual similarity. In
SIGMOD ’98: Proceedings of ibliography the 1998 ACM SIGMOD international conference on Management of data , pages201–212, New York, NY, USA, 1998. ACM Press.[7] William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparisonof string distance metrics for name-matching tasks. In Proceedings of IJCAI-03Workshop on Information Integration on the Web (IIWeb-03) , pages 73–78, 2003.[8] J. B. Copas and F. J. Hilton. Record linkage: statistical models for matchingcomputer records.
Journal of the Royal Statistical Society , pages 287–320, 1990.[9] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage.
Journal of theAmerican Statistical Association , 64(328):1183–1210, 1969.[10] Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In
Proceedings of the International Conference on Very Large Databases (VLDB) , pages371–380, 2001.[11] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukr-ishnan, and Divesh Srivastava. Approximate string joins in a database (almost)for free. In
Proceedings of the 27th International Conference on Very Large DataBases (VLDB) , pages 491–500, San Francisco, CA, USA, 2001. Morgan KaufmannPublishers Inc.[12] Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, and Divesh Srivastava. Textjoins for data cleansing and integration in an rdbms. In
Proceedings of the 19thInternational Conference on Data Engineering (ICDE) , pages 729–731, March 2003.[13] Dan Gusfield.
Algorithms on strings, trees, and sequences: computer science andcomputational biology . Cambridge University Press, New York, NY, USA, 1997. ibliography
Data Min. Knowl. Discov. , 2(1):9–37,1998.[15] M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985census of tampa.
Journal of the American Statistical Association , pages 414–420,1984.[16] Nick Koudas, Amit Marathe, and Divesh Srivastava. Flexible string matchingagainst large databases in practice. In
Proceedings of the International Conferenceon Very Large Databases (VLDB) , pages 1078–1086, August 2004.[17] Nick Koudas, Amit Marathe, and Divesh Srivastava. Spider: flexible matching indatabases. In
SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD internationalconference on Management of data , pages 876–878, New York, NY, USA, 2005. ACMPress.[18] Nick Koudas, Amit Marathe, and Divesh Srivastava. Using spider: an experiencereport. In
SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD internationalconference on Management of data , page 719, 2006.[19] Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. Record linkage: similaritymeasures and algorithms. In
SIGMOD ’06: Proceedings of the 2006 ACM SIGMODinternational conference on Management of data , pages 802–803, 2006.[20] Nick Koudas and Divesh Srivastava. Approximate joins: Concepts and techniques.In
Proceedings of the International Conference on Very Large Databases (VLDB) ,page 1363, 2005.[21] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden markov modelinformation retrieval system. In
Proceedings of the 22nd Annual International ACM ibliography SIGIR Conference on Research and Development in Information Retrieval , pages214–221, August 1999.[22] Jay M. Ponte and W. Bruce Croft. A language modeling approach to informationretrieval. In
Proceedings of the 21st Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval , pages 275–281, August 1998.[23] L.R. Rabiner. A tutorial on hidden markov models and selected applications inspeechrecognition. In
Proceedings of the IEEE , volume 77, pages 257–286, 1989.[24] Stephen Robertson. Understanding inverse document frequency: on theoretical ar-guments.
Journal of Documentation , 60(5):503–520, 2004.[25] Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu, Mike Gatford,and A. Payne. Okapi at trec-4. In
TREC , 1995.[26] Gerard Salton and Chris Buckley. Term-weighting approaches in automatic textretrieval.
Information Processing and Management , 24(5):513–523, 1988.[27] Gerard Salton and Michael J. McGill.
Introduction to Modern Information Retrieval .McGraw-Hill, Inc., New York, NY, USA, 1986.[28] Sunita Sarawagi and Alok Kirpal. Efficient set joins on similarity predicates. In
SIGMOD’04: Proceedings of the ACM SIGMOD International Conference on Man-agement of Data , pages 743–754, June 2004.[29] W. E. Winkler. The state of record linkage and current research problems. TechnicalReport RR99/04, US Bureau of the Census, 1999. ppendix AData preparation SQL Statements
We assume the base relation
BASE TABLE has an integer tuple id attribute tid and astring valued attribute string . The following SQL statements tokenize the base relation,creating
BASE TOKENS(tid, token) . Assuming that the query relation
QUERY TABLE has a single string valued attribute string , the same SQL statements can be used fortokenization of the query string by removing tid from the statements.
A.1 Qgram generation -- MAX STR SIZE is the maximum string length and q is the size of the qgrams.INSERT INTO INTEGERS(i) VALUES (1), (2), ..., (MAX STR SIZE + ( q − )INSERT INTO BASE TOKENS(tid, token)SELECT tid, SUBSTRING(CONCAT(SUBSTRING(‘$...$‘,1, q -1),UPPER( REPLACE(CONCAT(string),‘ ‘,SUBSTRING(‘$...$‘,1, q -1))),SUBSTRING(‘$...$‘,1, q -1)) , INTEGERS.i, q )FROM INTEGERS INNER JOIN BASE TABLE ONINTEGERS.i <= LENGTH( REPLACE(CONCAT(string),‘ ‘,SUBSTRING(‘$...$‘,1, q -1))) + ( q -1) ppendix A. Data preparation SQL Statements A.2 Word token generation
INSERT INTO BASE TOKENS(tid, token)SELECT tid, SUBSTRING(CONCAT(string), 1, LOCATE(’ ’, CONCAT(string)) - 1)FROM BASE TABLEWHERE LOCATE(’ ’, CONCAT(string)) > 0UNION ALLSELECT tid, SUBSTRING(CONCAT(string), N1.I+1, N2.I - N1.I-1)FROM BASE TABLE, INTEGERS N1, INTEGERS N2WHERE N1.I = LOCATE(’ ’, CONCAT(string), N1.I) AND N2.I = LOCATE(’ ’, CONCAT(string), N1.I + 1)UNION ALLSELECT tid, SUBSTRING(CONCAT(string), LENGTH(CONCAT(string)) - LOCATE(’ ’, REVERSE(CONCAT(string)))+2)FROM BASE TABLEWHERE LOCATE(’ ’, CONCAT(string)) > 0UNION ALLSELECT tid, CONCAT(string)FROM BASE TABLEWHERE LOCATE(’ ’, CONCAT(string)) = 0
A.3 Qgram generation of the word tokens (for com-bination predicates)
INSERT INTO BASE QGRAMS(tid, token, qgram)SELECT tid, token,SUBSTRING(CONCAT(SUBSTRING(‘$...$‘,1, q -1), UPPER(token), SUBSTRING(‘$...$‘,1, q -1)), INTEGERS.I, q )FROM INTEGERS INNER JOIN BASE TOKENS ON INTEGERS.I <= LENGTH(token) + ( q -1)GROUP BY tid, token, qgram ppendix BSQL Statements for Predicates B.1 Overlap Predicates
B.1.1 IntersectSize
QueryINSERT INTO INTERSECT RESULTS(tid, score)SELECT R1.tid, COUNT(*)FROM BASE TOKENS R1, QUERY TOKENS R2WHERE R1.token = R2.tokenGROUP BY R1.tid ppendix B. SQL Statements for Predicates B.1.2 Jaccard
PreprocessingINSERT INTO BASE DDL(tid, ddl)SELECT T.tid, COUNT(*)FROM BASE TOKENS TGROUP BY T.tidINSERT INTO BASE TOKENSDDL(tid, token, ddl)SELECT T.tid, T.token, D.ddlFROM BASE TOKENS T, BASE DDL DWHERE T.tid = D.tid QueryINSERT INTO JACCARD RESULTS(tid, score)SELECT S1.tid, COUNT(*)/(S1.ddl + S2.ddl - COUNT(*))FROM BASE TOKENSDDL S1, QUERY TOKENS R2, (SELECT COUNT(*) AS ddl FROM QUERY TOKENS T) S2WHERE S1.token = R2.tokenGROUP BY S1.tid ppendix B. SQL Statements for Predicates B.1.3 WeightedMatch
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE TF(tid, token, tf)SELECT T.tid, T.token, COUNT(*)FROM BASE TOKENS TGROUP BY T.tid, T.tokenINSERT INTO BASE BMIDF(token, midf)SELECT T.token, LOG(S.SIZE - COUNT(T.tid) + 0.5) - LOG(COUNT(T.tid) + 0.5)FROM BASE TF T, BASE SIZE SGROUP BY T.tokenINSERT INTO BASE WEIGHTS(tid, token, weight)SELECT T.tid, T.token, I.midfFROM BASE BMIDF I, BASE TF TWHERE I.token = T.token QueryINSERT INTO WEIGHTEDMATCH RESULTS(tid1, tid2, score)SELECT W1.tid, T2.tid, SUM(W1.weight)FROM BASE WEIGHTS W1, QUERY TOKENS T2WHERE W1.token = T2.tokenGROUP BY T2.tid, W1.tid ppendix B. SQL Statements for Predicates B.1.4 WeightedJaccard
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE TF(tid, token, tf)SELECT T.tid, T.token, COUNT(*)FROM BASE TOKENS TGROUP BY T.tid, T.tokenINSERT INTO BASE BMIDF(token, midf)SELECT T.token, LOG(S.SIZE - COUNT(T.tid) + 0.5) - LOG(COUNT(T.tid) + 0.5)FROM BASE TF T, BASE SIZE SGROUP BY T.tokenINSERT INTO BASE WEIGHTS(tid, token, weight)SELECT T.tid, T.token, I.midfFROM BASE BMIDF I, BASE TOKENS TWHERE I.token = T.tokenINSERT INTO BASE DDL(tid, ddl)SELECT W.tid, SUM(weight)FROM BASE WEIGHTS WGROUP BY W.tidINSERT INTO BASE TOKENSDDL(tid, token, ddl, weight)SELECT W.tid, W.token, D.DDL, W.WEIGHTFROM BASE WEIGHTS W, BASE DDL DWHERE W.tid = D.tid QueryINSERT INTO WJ RESULTS(tid, score)SELECT S1.tid, SUM(S1.weight)/(S1.ddl + S2.ddl - SUM(S1.weight))FROM BASE TOKENSDDL S1, QUERY TOKENS R2,( SELECT SUM(T.weight) AS ddlFROM (SELECT T.token, I.IDF AS weightFROM BASE IDF I, QUERY TOKENS TWHERE I.token = T.token) T ) S2WHERE S1.token = R2.tokenGROUP BY S1.tid ppendix B. SQL Statements for Predicates B.2 Aggregate Weighted Predicates
B.2.1 Tfidf Cosine Predicate
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE IDF(token, idf)SELECT T.token, LOG(S.SIZE) - LOG(COUNT(DISTINCT T.tid))FROM BASE TOKENS T, BASE SIZE SGROUP BY T.tokenINSERT INTO BASE TF(tid, token, tf)SELECT T.tid, T.token, COUNT(*)FROM BASE TOKENS TGROUP BY T.tid, T.tokenINSERT INTO BASE LENGTH(tid, len)SELECT T.tid, SQRT(SUM(I.idf*I.idf*T.tf*T.tf))FROM BASE IDF I, BASE TF TWHERE I.token = T.tokenGROUP BY T.tidINSERT INTO BASE WEIGHTS(tid, token, weight)SELECT T.tid, T.token, I.idf*T.tf/L.lenFROM BASE IDF I, BASE TF T, BASE LENGTH LWHERE I.token = T.token AND T.tid = L.tid ppendix B. SQL Statements for Predicates QueryINSERT INTO COSINE RESULTS(tid, score)SELECT R1W.tid, SUM(R1W.weight*R2W.weight)FROM BASE WEIGHTS R1W,(SELECT T.token, QIDF.idf*QTF.tf/QLEN.length AS weightFROM (SELECT R.token, R.idfFROM QUERY TOKENS S, BASE IDF RWHERE S.token = R.tokenGROUP BY S.token) QIDF,(SELECT T.token, COUNT(*) AS tfFROM QUERY TOKENS TGROUP BY T.tid, T.token) QTF,(SELECT SQRT(SUM(QIDF.idf*QIDF.idf*QTF.tf*QTF.tf)) AS lengthFROM (SELECT R.token, R.idfFROM QUERY TOKENS S, BASE IDF RWHERE S.token = R.tokenGROUP BY S.token) QIDF,(SELECT T.token, COUNT(*) AS tfFROM QUERY TOKENS TGROUP BY T.token) QTFWHERE I.token = T.token) QLENWHERE QIDF.token = QTF.token) R2WWHERE R1W.token = R2W.tokenGROUP BY R1W.tid ppendix B. SQL Statements for Predicates ppendix B. SQL Statements for Predicates B.2.2 BM25 Predicate
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE TF(tid, token, tf)SELECT T.tid, T.token, COUNT(*)FROM BASE TOKENS TGROUP BY T.tid, T.tokenINSERT INTO BASE BMIDF(token, midf)SELECT T.token, LOG(S.SIZE - COUNT(T.tid) + 0.5) - LOG(COUNT(T.tid) + 0.5)FROM BASE TF T, BASE SIZE SGROUP BY T.tokenINSERT INTO BASE BMBASELENGTH(tid, len)SELECT T.tid, SUM(T.tf)FROM BASE TF TGROUP BY T.tidINSERT INTO BASE BMBASEAVGLENGTH(avglen)SELECT AVG(len)FROM BASE BMBASELENGTHINSERT INTO BASE BMBASEMODTF(tid, token, mtf)SELECT T.tid, T.token, (T.tf*( k +1)) / ( ((( 1 - b )+( b *L.DL/A.AVGDL))* k ) + T.tf )FROM BASE BMBASELENGTH L, BASE BMBASEAVGLENGTH A,BASE TF TWHERE L.tid = T.tidINSERT INTO BASE BMBASEWEIGHTS(tid, token, weight)SELECT T.tid, T.token, T.mtf*I.midfFROM BASE BMBASEMODTF T, BASE BMIDF IWHERE T.token = I.token QueryINSERT INTO BM25 RESULTS(tid, score)SELECT B.tid, SUM(B.weight * S.mtf)FROM BASE BMBASEWEIGHTS B,(SELECT token, (COUNT(*))*( k +1) / ( k +COUNT(*)) AS mtfFROM QUERY TOKENS TGROUP BY T.token) SWHERE B.token = S.tokenGROUP BY B.tid ppendix B. SQL Statements for Predicates ppendix B. SQL Statements for Predicates B.3 Language Modeling Predicates
B.3.1 Language Modeling
PreprocessingINSERT INTO BASE TF(tid, token, tf)SELECT T.tid, T.token, COUNT(*)FROM BASE TOKENS TGROUP BY T.tid, T.tokenINSERT INTO BASE DL(tid, dl)SELECT T.tid, COUNT(*)FROM BASE TOKENS TGROUP BY T.TIINSERT INTO BASE PML(tid, token, pml)SELECT T.tid, T.token, T.tf/D.dlFROM BASE TF T, BASE DL DWHERE T.tid=D.tidINSERT INTO BASE PAVG(tid, token, pavg)SELECT P.token, AVG(P.pml)FROM BASE PML PGROUP BY P.tokenINSERT INTO BASE FREQ(tid, token, freq)SELECT T.tid, T.token, P.pavg*D.dlFROM BASE TF T, BASE PAVG P, BASE DL DWHERE T.token = P.token AND T.tid=D.tidINSERT INTO BASE RISK(tid, token, risk)SELECT T.tid, T.token, (1.0/(1.0+Q.freq)) * (POWER(Q.freq/(1.0+Q.freq), T.tf))FROM BASE TF T, BASE FREQ QWHERE T.tid=Q.tid AND T.token = Q.tokenINSERT INTO BASE TSIZE(size)SELECT COUNT(*)FROM BASE TOKENSINSERT INTO BASE CFCS(token, cfcs)SELECT T.token, COUNT(*) / S.sizeFROM BASE TOKENS T, BASE TSIZE SGROUP BY T.token ppendix B. SQL Statements for Predicates INSERT INTO BASE PM(tid, token, pm)SELECT T.tid, T.token, 1E0* POWER(M.pml, 1.0-R.risk) * POWER(A.pavg, R.risk), C.cfcsFROM BASE TF T, BASE RISK R, BASE PML M, BASE PAVG A, BASE CFCS C, BASE TSIZE SWHERE T.tid=R.tid AND T.token = R.token AND T.tid=M.tid ANDT.token = M.token AND T.token = A.token AND T.token = C.tokenINSERT INTO BASE SUMCOMPMBASE(tid, sumcompm)SELECT P.tid, SUM(LOG(1.0-P.pm))FROM BASE PM PGROUP BY P.tid QueryINSERT INTO LM RESULTS(tid, score)SELECT B1.tid, B1.score + B2.sumcompmFROM (SELECT P1.tid, SUM(LOG(P1.pm)) - SUM(LOG(1.0-P1.pm)) - SUM(LOG(P1.cfcs)) AS scoreFROM BASE PM P1, QUERY TOKENS T2WHERE P1.token = T2.tokenGROUP BY P1.tid) B1,BASE SUMCOMPMBASE B2WHERE B1.tid=B2.tid ppendix B. SQL Statements for Predicates B.3.2 Hidden Markov Models
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE TF(tid, tf)SELECT T.tid, T.token, COUNT(*)FROM BASE TOKENS TGROUP BY T.tid, T.tokenINSERT INTO BASE DL(tid, dl)SELECT T.tid, COUNT(*)FROM BASE TOKENS TGROUP BY T.tidINSERT INTO BASE PML(tid, token, pml)SELECT T.tid, T.token, F.tf/D.dlFROM BASE TOKENS T, BASE TF F, BASE DL DWHERE F.tid=T.tid AND T.token = F.token AND T.tid=D.tidINSERT INTO BASE SUMDL(sdl)SELECT SUM(T.dl)FROM BASE DL TINSERT INTO BASE PTGE(token, ptge)SELECT T.token, SUM(T.tf)/D.sdlFROM BASE TF T, BASE SUMDL DGROUP BY T.tokenINSERT INTO BASE WEIGHTSHMM(tid, token, weight)SELECT M.tid, M.token, LOG( (1 + ( a *M.pml) / ( a *P.ptge)) )FROM BASE PTGE P, BASE PML MWHERE P.token = M.tokenGROUP BY tid,token QueryINSERT INTO HMM SCORES(tid, score)SELECT W1.tid, EXP(SUM(W1.weight))FROM BASE WEIGHTS W1, QUERY TOKENS T2WHERE W1.token = T2.tokenGROUP BY W1.tid ppendix B. SQL Statements for Predicates B.4 Combination Predicates
B.4.1 GES
Jaccard
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE IDF(token,idf)SELECT T.token, LOG(S.size) - LOG(COUNT(DISTINCT T.tid))FROM BASE TOKENS T, BASE SIZE SGROUP BY T.tokenINSERT INTO BASE IDFAVG(idfavg)SELECT AVG(I.idf)FROM BASE IDF IINSERT INTO BASE TOKENSIZE(tid, token, size)SELECT T.tid, T.token, COUNT(*)FROM BASE QGRAMS TGROUP BY T.tid, T.TOKEINSERT INTO BASE QGRAMSTOKENSIZE(tid, token, qgram, size)SELECT T.tid, T.token, T.qgram, S.sizeFROM BASE QGRAMS T, BASE TOKENSIZE SWHERE T.tid = S.tid AND T.token = S.token ppendix B. SQL Statements for Predicates Query (Filtering Step)SELECT MAXSIM.tid, R.string(1.0 - 1.0/ q ) + (1/SUM(QIDF.idf)) * SUM(QIDF.idf * (((2.0/ q ) * MAXSIM.maxsim) )) AS scoreFROM ( SELECT JAC SIM.tid, JAC SIM.token2, MAX(sim) AS maxsimFROM (SELECT BSIZE.tid AS tid, BSIZE.token AS token1, Q.token AS token2,COUNT(*)/(BSIZE.size + QSIZE.size - COUNT(*)) AS simFROM BASE QGRAMSTOKENSIZE BSIZE, QUERY QGRAMS Q,(SELECT T.token, COUNT(*) AS sizeFROM QUERY QGRAMS TGROUP BY T.token) QSIZEWHERE BSIZE.qgram = Q.qgram AND Q.token = QSIZE.tokenGROUP BY BSIZE.tid, BSIZE.token, Q.token) JAC SIMGROUP BY JAC SIM.tid, JAC SIM.token2 ) MAXSIM,(SELECT R.token, R.idfFROM QUERY TOKENS S, BASE IDF RWHERE S.token = R.tokenGROUP BY S.tokenUNIONSELECT S.token, A.IDFAVG AS idfFROM QUERY TOKENS S, BASE IDFAVG AWHERE S.token NOT IN (SELECT I.token FROM BASE IDF I)GROUP BY S.token) QIDF,BASE TABLE RWHERE TM.token2 = I.token AND R.tid = MAXSIM.tidGROUP BY TM.tidHAVING score >= θ ppendix B. SQL Statements for Predicates B.4.2 GES apx
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE IDF(token, size)SELECT T.token, LOG(S.size) - LOG(COUNT(DISTINCT T.tid))FROM BASE TOKENS T, BASE SIZE SGROUP BY T.tokenINSERT INTO BASE IDFAVG(idfavg)SELECT AVG(I.idf)FROM BASE IDF IINSERT INTO BASE HASHFUNC(fid, func)SELECT N.i-1, round(rand()*MAXINT)FROM INTEGERS NLIMIT HASH SIZEINSERT INTO BASE HASHVALUE(fid, qgram, value)SELECT F.FID, Q.QGRAM, MOD(CONV(HEX( Q.qgram), 16, 10) * MAXINT, F.func)FROM BASE HASHFUNC F, (SELECT DISTINCT QGRAM FROM BASE QGRAMS) QINSERT INTO BASE MINHASHSIGNATURE(tid, token, fid, score)SELECT Q.tid, Q.token, H.fid, MIN(H.value)FROM BASE QGRAMS Q, BASE HASHVALUE HWHERE Q.QGRAM = H.QGRAMGROUP BY Q.tid, Q.token, H.fid ppendix B. SQL Statements for Predicates Query (Filtering Step)SELECT MAXSIM.tid, R.string,(1.0 - 1.0/ q ) + (1/SUM(I.idf)) * SUM(I.idf * (((2.0/ q ) * MAXSIM.maxsim) )) AS scoreFROM (SELECT MH SIM.tid, MH SIM.token2, MAX(sim) AS maxsimFROM (SELECT BMHSIG.tid AS tid, BMHSIG.token AS token1,QMHSIG.token AS token2, COUNT(*)/ H AS simFROM BASE MINHASHSIGNATURE BMHSIG,(SELECT Q.token, H.fid, MIN(H.value) AS valueFROM QUERY QGRAMS Q, BASE HASHVALUE HWHERE Q.qgram = H.qgramGROUP BY Q.token, H.fid) QMHSIGWHERE BMHSIG.fid = QMHSIG.fid AND BMHSIG.value = QMHSIG.valueGROUP BY BMHSIG.tid, BMHSIG.token, QMHSIG.token) MH SIMGROUP BY MH SIM.tid, MH SIM.token2) MAXSIM,(SELECT R.token, R.idfFROM QUERY TOKENS Q, BASE IDF RWHERE Q.token = R.tokenGROUP BY Q.tokenUNIONSELECT Q.token, A.IDFAVG AS idfFROM QUERY TOKENS Q, BASE IDFAVG AWHERE Q.token NOT IN (SELECT I.token FROM BASE IDF I)GROUP BY Q.token) QIDF,BASE TABLE RWHERE MAXSIM.token2 = I.token AND R.tid = MAXSIM.tidGROUP BY MAXSIM.tidHAVING score >= θ ppendix B. SQL Statements for Predicates B.4.3 SoftTFIDF
PreprocessingINSERT INTO BASE SIZE(size)SELECT COUNT(*)FROM BASE TABLEINSERT INTO BASE IDF(token, idf)SELECT T.token, LOG(S.size) - LOG(COUNT(DISTINCT T.tid))FROM BASE TOKENS T, BASE SIZE SGROUP BY T.tokenINSERT INTO BASE TF(tid, token, tf)SELECT T.tid, T.token, COUNT(*)FROM BASE TOKENS TGROUP BY T.tid, T.tokenINSERT INTO BASE LENGTH(tid, len)SELECT T.tid, SQRT(SUM(I.idf*I.idf*T.tf*T.tf))FROM BASE IDF I, BASE TF TWHERE I.token = T.tokenGROUP BY T.tidINSERT INTO BASE WEIGHTS(tid, token, weight))SELECT T.tid, T.token, I.idf*T.tf/L.lenFROM BASE IDF I, BASE TF T, BASE LENGTH LWHERE I.token = T.token AND T.tid = L.tid ppendix B. SQL Statements for Predicates QueryINSERT INTO SOFTTFIDF RESULTS(tid, score)SELECT MAXTOKEN.tid, SUM(WB.weight * WQ.weight * MAXTOKEN.maxsim)FROM BASE WEIGHTS WB,(SELECT JARO SIM.tid, JARO SIM.token1, JARO SIM.token2, MAXSIM.maxsimFROM (SELECT JARO SIM.tid, JARO SIM.token2, MAX(sim) AS maxsimFROM (SELECT R1.tid AS tid, R1.token AS token1,R2.token AS token2, JaroWinkler(R1.token,R2.token) AS simFROM BASE TOKENS R1, QUERY TOKENS R2WHERE JaroWinkler(R1.token,R2.token) >= θ ) JARO SIMGROUP BY JARO SIM.tid, JARO SIM.token2) MAXSIM,(SELECT R1.tid AS tid, R1.token AS token1,R2.token AS token2, JaroWinkler(R1.token,R2.token) AS simFROM BASE TOKENS R1, QUERY TOKENS R2WHERE JaroWinkler(R1.token,R2.token) >= θθ