Efficient and Effective ER with Progressive Blocking
Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava
EEfficient and Effective ER with Progressive Blocking
Sainyam Galhotra , Donatella Firmani , Barna Saha , and Divesh Srivastava UMass Amherst, [email protected] Roma Tre University, [email protected] UC Berkeley, [email protected] AT&T Labs – Research, [email protected]
Abstract
Blocking is a mechanism to improve the efficiency ofEntity Resolution (ER) which aims to quickly pruneout all non-matching record pairs. However, depend-ing on the distributions of entity cluster sizes, exist-ing techniques can be either (a) too aggressive, suchthat they help scale but can adversely affect the EReffectiveness, or (b) too permissive, potentially harm-ing ER efficiency. In this paper, we propose a newmethodology of progressive blocking ( pBlocking ) toenable both efficient and effective ER, which worksseamlessly across different entity cluster size distri-butions. pBlocking is based on the insight that theeffectiveness-efficiency trade-off is revealed only whenthe output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedbackloop to refine the blocking result in a data-drivenfashion. Specifically, we bootstrap pBlocking withtraditional blocking methods and progressively im-prove the building and scoring of blocks until we getthe desired trade-off, leveraging a limited amountof ER results as a guidance at every round. Weformally prove that pBlocking converges efficiently( O ( n log n ) time complexity, where n is the totalnumber of records). Our experiments show that in-corporating partial ER output in a feedback loop canimprove the efficiency and effectiveness of blockingby 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%. Entity Resolution (ER) is the problem of identifyingwhich records in a data set refer to the same real-world entity [7]. ER technologies are key for solvingcomplex tasks (e.g., building a knowledge graph) butcomparing all the record pairs to decide which pairsmatch is often infeasible. For this reason, the firststep of ER selects sub-quadratic number of recordpairs to compare in the subsequent steps. To this end,a commonly used approach is blocking [24]. Blockinggroups similar records into blocks and then selectspairs from the “cleanest” blocks – i.e., those withfewer non-matching pairs – for further comparisons.The literature is rich with methods for building andprocessing blocks [24], but depending on the dataset at hand, different techniques can either leave toomany matching pairs outside, leading to incompleteER results and low effectiveness, or include too manynon-matching pairs, leading to low efficiency. pBlocking . We propose a new progressive block-ing technique that overcomes the above limitationsby short-circuiting the two operations – blocking andpair comparisons – that are traditionally solved se-quentially. Our method starts with an aggressiveblocking step, which is efficient but not very effective.Then, it computes a limited amount of ER results ona subset of pairs selected by the aggressive blocking,and sends these partial (matching and non-matching)results from the ER phase back to the blocking phase,creating a “loop”, to improve blocking effectiveness.1 a r X i v : . [ c s . D B ] M a y NTITIESRECORDS
Comparison Cleaning (Z)Block Cleaning (Y)Block Building (X) Pair Matching Clustering c6 c6c6 z6 cima mamac6 c6c6 z6 cima mama malibunavigationcorvette c6chevy (a) chevrolet (1717 records)corvette (272)malibu (187)c6 (251)navigation (43)corvette ∧ c6 z6 (25)chevy (11) B l o c k s i z e Blocks (b)
Figure 1: (a) Illustration of a standard blocking pipeline. Block building, block cleaning and comparisoncleaning sub-tasks are highlighted in white. The downstream ER algorithm is shown in gray. Description ofeach record is reported in Table 1. (b) Block size distribution (standard blocking) for the real cars datasetused in our experiments.Table 1: Sample records (we omit schema informa-tion) referring to 4 distinct entities. r ei represents thei-th record referring to entity e . Records in the firsttwo rows refer to a Chevrolet Corvette C6 ( c
6) anda Z6 ( z ma ) and a Citr¨oen C6 ( ci ) (same modelname as Corvette C6 but different car). r c : chevy corvette c6 r c : chevy corvette c6 navigation r c : chevrolet corvette c6 r z : corvette z6 navigation r ma : chevy malibu navigation r ma : chevrolet chevy malibu r ma : chevrolet malibu r ci : citroen c6 navigation In this way, blocking can progressively self-regulateand adapt to the properties of each dataset, withno configuration effort. We illustrate our blockingmethod, that we call pBlocking , in the following ex-ample.
Example 1.
Consider the records in Table 1 from the cars dataset used in our experiments, and a standardschema-agnostic blocking strategy S such as [20]. Asshown in Figure 1a, we consider three blocking sub-tasks [24]. First, during block building , S createsa separate block for each text token (we only showthe blocks ‘corvette’, ‘navigation’, ‘malibu’, ’c6’ and‘chevy’). Then, during block cleaning , S uses athreshold to prune out all the blocks of large size.Depending on the threshold value (using the blocksizes in the entire cars dataset, shown in Figure 1b),we can have any of the following extreme behaviors.(Note that no intermediate setting of the thresholdcan yield a sparse set of candidates that is at the sametime complete.) • Aggressive blocking: S prunes every block exceptthe smallest one (‘chevy’) and returns ( r c , r c ) , ( r c , r ma ) , ( r c , r ma ) and ( r ma , r ma ) , missing r c and r ma . • Permissive blocking: S prunes only the largestblock (‘chevrolet’) and returns many non-matching pairs.Finally, during comparison cleaning , S can use an-other threshold to further prune out pairs sharingfew blocks, e.g. by using meta-blocking [22]. As inblock cleaning, different threshold values can yield ag-gressive or permissive behaviours. Note that match-ing pairs such as ( r c , r c ) share the same numberof blocks (‘corvette’ and ‘c6’) as non-matching pairssuch as ( r c , r z ) (‘corvette’ and ‘navigation’). (Evenworse, ‘c6’ is larger than ‘navigation’.) pBlocking can solve these problems in a fewrounds: the first round does aggressive blocking, thesecond round does more effective blocking by makingtargeted updates accordingly to partial ER results,and so on. Examples of such updates to the blockingresult are discussed below.1. Creation of new blocks that help inclusion of( r c , r c ) , ( r c , r c ): pBlocking creates a new block ‘corvette ∧ c6’ with records present in bothblocks ‘corvette’ and ‘c6’. This block is muchsmaller than its two constituents and has onlyCorvette C6 cars.2. Adaptive cleaning to help inclusion of( r ma , r ma ) , ( r ma , r ma ): pBlocking can dis-courage pruning of block ‘malibu’ that containsChevrolet Malibu cars, even if it is a large block;2. Adaptive cleaning to help exclusion of non-matching pairs: pBlocking can encourage prun-ing of block ‘navigation’ that contains no match-ing pairs, even if it is a small block.After a few rounds of updates like the above, pBlocking returns all the matching pairs with veryfew non-matching pairs. Note that after the lastround, the ER output can be computed on the re-sulting pairs as in the traditional setting. Updatesof type (1) are performed via a new block intersec-tion algorithm, while (2) and (3) are performed bya new block scoring method. By construction, whenthe blocking scores converge, the entire blocking re-sult also converges. Our contributions.
The main contribution of thispaper is a new blocking methodology with both highefficiency and effectiveness in a variety of applicationscenarios. Since pBlocking can in principle start offusing any blocking strategy, it represents not only anew approach but also a way to “boost” traditionalones. pBlocking works seamlessly across differententity cluster size distributions such as: • small entity clusters , where, using block inter-section, pBlocking can recover entities such asCorvette C6 consisting of few records sharinglarge and dirty blocks. • large entity clusters , where, using block scoring, pBlocking can recover entities such as ChevroletMalibu consisting of many records sharing largeand clean blocks.We prove theoretically and show empirically that,with a few rounds and a limited amount of partial ERresults, our progressive blocking method can providea significant boost in blocking effectiveness withoutpenalizing efficiency. Specifically, we (i) demonstratefast convergence and low space and time complex-ity ( O ( n log n ), where n is the number of records)of pBlocking ; (ii) report experiments achieving upto 60% increase in recall when compared to state-of-the-art blocking [5], and up to 5x boost in efficiency.Finally, we observe that pBlocking can yield up to70% increase on the F-score of the final ER result,thus confirming the substantial benefits of our ap-proach. Outline.
The rest of this paper is organized as Table 2: Notation Table V Collection of records C Collection of clusters B Block: A subset of records, B ⊆ Vp m ( u, v ) Similarity between u and vP = ( V, A (cid:48) ) Blocking graph, A (cid:48) ⊂ V × Vφ Feedback frequency p ( B ) Probability score of a block Bu ( B ) Uniformity score of block BH ( B ) Entropy of block B H Block Hierarchy G t Random Geometric graph γ Fraction of nodes used for scoring blocks µ g Expected similarity of a matching edge µ r Expected similarity of a non-matching edge follows. Sections 2 and 3 provide preliminary discus-sions and a high-level description of the pBlocking approach. Sections 4 and 5 explain our block inter-section and block scoring methods, respectively. Sec-tion 6 provides theoretical analysis of pBlocking ’seffectiveness and Section 7 provides extensive experi-mental results and key takeaways. Section 8 discussesthe related work and we conclude in Section 9.
Table 2 summarizes the main symbols used through-out this paper. Let V be the input set of records, with | V | = n . Consider an (unknown) graph C = ( V, E + ),where ( u, v ) ∈ E + means that u and v represent thesame entity. C is transitively closed, that is, each ofits connected components C ⊆ V is a clique repre-senting a distinct entity. We call each clique a cluster of V , and refer to the partition induced by C as theER ground truth . Definition 1 (Pair Recall) . Given a set of matchingrecord pairs A (cid:48) ⊆ V × V , Pair Recall is the fractionof pairs ( u, v ) ∈ E + that can be either (i) matcheddirectly, because ( u, v ) ∈ A (cid:48) , or (ii) indirectly inferredfrom other pairs ( u, w ) , ( w , w ) , . . . , ( w c , v ) ∈ A (cid:48) byconnectivity. A formal definition of the blocking task follows.
Problem 1 (Blocking Task) . Given a set of records V , group records into possibly overlapping blocks B ≡{ B , B , . . . } , B i ⊆ V and compute a graph P =3 V, A (cid:48) ) , where A (cid:48) ⊆ A , A ≡ { ( u, v ) : ∃ B i ∈ B s.t. u ∈ B i ∧ v ∈ B i } , such that A (cid:48) is sparse ( | A (cid:48) | << (cid:0) n (cid:1) )and A (cid:48) has high Pair Recall. We refer to P as the blocking graph . The blocking graph P is the final product of block-ing and contains all the pairs that can be consideredfor pair matching. The efficiency and effectiveness ofthe blocking method is measured as Pair Recall (PR)of (the set of edges in) P and the number of edges init for a certain PR, respectively. Blocking methodsconsist of three sub-tasks as defined by [24]: blockbuilding, block cleaning and comparison cleaning. Inthe following, we describe each of these steps and thecorresponding methods in the literature. Block building ( BB ) takes as input V and returnsa block collection B , by assigning each record in V topossibly multiple blocks. The popular standard block-ing [20] strategy creates a separate block B t for eachtoken t in the records and assigns to B t all the recordsthat contain the token t . In order to tolerate spellingerrors, q-grams blocking [11] considers character-levelq-grams instead of entire tokens. Other strategiesinclude canopy clustering [18] and sorted neighbor-hood [13]. Canopy clustering iteratively selects a ran-dom seed record r , and creates a new block B r (or acanopy) with all the records that have a high similar-ity with r with respect to a given similarity function(e.g., using a subset of features [18]). We can usedifferent similarity functions to build different sets ofcanopies. Sorted neighborhood sorts all the recordsaccording to multiple sort orders (e.g., each accord-ing to a different attribute [13]) and then it slidesa window w of tokens over each ordering, every timecreating a new block B w . Blocks have the same num-ber of distinct tokens but the number of records in ablock can vary significantly. Each of these techniquescreates O ( n ) blocks. Block cleaning ( BC ) takes as input the block col-lection B and returns a subset B (cid:48) ⊆ B by prun-ing blocks that may contain too many non-matchingrecord pairs. Block cleaning is typically performed byassigning each block a score : B →
IR with a blockscoring procedure and then pruning blocks with lowscore. Traditional scoring strategies include functionsof block sizes such as TF-IDF [7, 21].
Comparison cleaning ( CC ) takes as input the set A of all the intra-block record pairs in the blockcollection B (cid:48) (which is a subset of the intra-blockrecord pairs in B ) and returns a graph P = ( V, A (cid:48) ),with A (cid:48) ⊆ A , by pruning pairs that are likely to benon-matching. Comparison cleaning is typically per-formed by assigning each pair a weight : A → IRand then pruning pairs with low weight. Weight-ing strategies include meta-blocking [22] possibly withactive learning [29, 5]. In classic meta-blocking, weight ( u, v ) corresponds to the number of blocks inwhich u and v co-occur, based on the assumption thatthat more blocks a record pair shares, the more likelyit is to be matching. The recent
BLOSS strategy [5]employs active learning on top of the pairs generatedby meta-blocking, and learns a classifier using fea-tures extracted from the blocking graph for furtherpruning.We denote with B ( X, Y, Z ) a blocking strategy thatuses the methods X , Y , and Z , respectively for blockbuilding, block cleaning and comparison cleaning.The strategy used in our cars example (Example 1)can be thus denoted as B ( standard blocking, TF-IDF,meta-blocking ). After blocking.
Typical ER algorithms include pair matching and entity clustering operations. Suchoperations label as “matching” the pairs referringto the same entity and “non-matching” otherwise,and typically require the use of a classifier [19] or acrowd [34]. Clustering consists of building a possiblynoisy clustering C (cid:48) according to labels, and can bedone with a variety of techniques, including robustvariants of connected components [31] and randomgraphs [9]. This noisy clustering is the final productof ER. Analogous to traditional blocking methods, pBlocking takes as input a collection V of recordsand returns a blocking graph P . A high-level view This assumption holds for block building methods such asstandard blocking, q-grams blocking and sorted neighborhoodwith multiple orderings [13], and extends naturally to canopyclustering by using multiple similarity functions.
4f the methods introduced in pBlocking , for each ofthe main blocking sub-tasks of Section 2, is providedbelow. Such methods, unlike previous ones, canleverage a feedback of partial ER results.
Block building in pBlocking constructs new blocksarranged in the form of a hierarchy . First level blocksare initialized with blocks generated by a traditionalmethod (e.g., standard blocking, sorted neighbor-hood, canopy clustering or q-gram blocking). Sub-sequent levels contain intersections of the blocks inthe previous levels. pBlocking can use feedbackfrom the partial ER output to build intersectionssuch as ‘corvette ∧ c6’ that can lead to new, cleanerblocks, and avoid bad intersections such as ‘corvette ∧ chevrolet’ that would not improve the fraction ofmatching pairs in P (Chevrolet Corvette C6 and Z6are different entities). We discuss block intersectionin Section 4. Block cleaning in pBlocking prunes dirty blocksbased on feedback-based scores. First round scoresare initialized with a traditional method (e.g. TF-IDF). Then, scores are refined based on feedbackby combining two quantities: the fraction p ( B ) ofmatching pairs in a block B , and the block unifor-mity u ( B ), which captures the distribution of en-tities within the block ( u ( B ) is the inverse of per-plexity [17]). Since the goal of blocking phase is toidentify blocks that have a higher fraction of match-ing pairs and fewer entity clusters, we combine theabove values as score ( B ) = p ( B ) · u ( B ). pBlocking can use feedback from the partial ER output to es-timate p ( B ) and u ( B ), yielding high scores for cleanblocks such as ‘malibu’ (high p ( B ) and high u ( B ))and low scores for dirtier blocks such as ‘navigation’(low p ( B ) and low u ( B )), and ‘c6’ (low u ( B )). Wediscuss block scoring in Section 5.Finally, comparison cleaning in pBlocking is im-plemented with a traditional method such as meta-blocking. Workflow.
Algorithm 1 describes the pBlocking workflow and how the introduced blocking methodscan be used. We denote with pBlocking ( X, Y, Z )a progressive blocking strategy that uses the meth-ods X , Y and Z , respectively for building the firstlevel of the block hierarchy, initializing the block Algorithm 1
Our blocking method pBlocking
Require:
Records V , methods X , Y , Z for each blocking step.Default: X=standard blocking, Y= TF-IDF and
Z=meta-blocking . Ensure:
Blocking graph P C (cid:48) ← ∅ B ← build the first level of block hierarchy with method X scores ← initialize block scores using method Y P ← block cleaning and comparison cleaning with method Z P new ← ∅ for round=2; round ≤ /φ ∧ P (cid:54) = P new ; round++ do while ER progress is less than φ do C (cid:48) ← Execute an incremental step of method W for pairmatching and clustering on P score ← update the block scores according to C (cid:48) //Feedback B ← update the block hierarchy based on score P ← P new P new ← block cleaning and comparison cleaning with Z return H scores, and performing comparison cleaning as de-scribed in Algorithm 1. In our cars examples, wehave pBlocking ( standard blocking, TF-IDF, meta-blocking ).We first initialize the set of clusters C (cid:48) , the blockhierarchy and the block scores ( lines 1–3 ). Thenext step ( line 4 ) consists of computing the firstversion of the blocking graph P according to the se-lected method for comparison cleaning (e.g., meta-blocking). The graph P is then progressively up-dated, round after round ( lines 6–12 ). In order toactivate the feedback mechanism, pBlocking needsto interact with an ER algorithm W for pair match-ing and clustering operations ( line 7–8 ). Algo-rithm W is executed over P until it makes a progress of φ with φ ∈ [0 , φ · n log n recordpairs have been processed since the previous round. At that point, the algorithm W is interrupted, C (cid:48) is updated ( line 8 ) and sent as feedback to all of pBlocking ’s components. Based on such feedback,we update the function score ( B ) = p ( B ) · u ( B ) ( line9 ) and construct new blocks in the form of a hierarchy( line 10 ). Higher score blocks are used to enumer-ate the most promising record pairs and generate theupdated blocking graph P new ( lines 11-12 ). Wheneither the maximum number of rounds φ has beenreached (setting φ = 1 is the same as switching off For algorithms such as [33], progress can be defined as afraction φ · n of processed records since the previous round. P = P new ), pBlocking terminates by returning P .We present a formal analysis of the effectiveness of pBlocking in Section 6. We refer to Section 7 for ex-periments. Due to its robustness to different choicesof the pair matching algorithm W , we do not include W in pBlocking ’s parameters (differently from X , Y , Z ). Natural choices for W include progressive ERstrategies that can process P in an online fashion andcompute C (cid:48) incrementally [32, 33, 19]. However, tra-ditional algorithms, such as [7] can be used as well byadding incremental ER techniques [12, 35] on top. For efficiency, it is crucial to ensure that the totaltime and space taken to compute P is close to linearin n . Since every round of pBlocking comes with itsown time and space overhead, we first describe howto bound the complexity of every round and thendiscuss how to set the parameter φ in Algorithm 1(and thus the maximum number of rounds) so as tobound the complexity of the entire workflow. Round Complexity. pBlocking implementsthe following strategies to decrease overhead of eachround.
Efficient block cleaning.
We compute the blockscores by sampling Θ(log n ) records from each of thetop O ( n ) high-score blocks computed in the previousround. Efficient comparison cleaning.
For simplicity, webuild P by enumerating at most Θ( n log n ) intra-block pairs by processing blocks in non-increasingblock score.Based on the above discussion, we have Lemma 1. Lemma 1.
A single round of pBlocking ( X, Y, Z ),such as pBlocking ( standard blocking, TF-IDF,meta-blocking ) has O ( n log n ) space and time com-plexity.Proof. We first show that the total feedback is lim-ited to O ( n log n ) space complexity, even though itconsiders all transitively inferred matching and non-matching edges, which can be Ω( n log n ). For thematching pairs, we store all the records with an en-tity id such that any pair of records that have been resolved share the same id. This requires O ( n ) spacein the worst case and captures all the matching edgesthat have been identified in the ER output. For thenon-matching pairs, we store a non-matching edgebetween their entity ids. Since the maximum num-ber of pairs returned by pBlocking is limited to O ( n log n ), the total number of pairs compared ineach round and thus the number of non-matchingedges stored is also O ( n log n ). Then, we analyzethe complexity of using feedback for the BB and BC tasks. Since the maximum number of blocks con-sidered in any round for the scoring component is O ( n ) and the scoring mechanism samples O (log n )pairs from each block, the total number of edges enu-merated for block scoring and building is O ( n log n ).Since the maximum number of pairs for inclusion inthe graph H is also O ( n log n ), a single round of pBlocking outputs H in O ( n log n ) total work. Workflow Complexity.
As discussed in Section 6, φ can be set to a small constant fraction. Thus,along with Lemma 1, this guarantees an O ( n log n )complexity for the entire workflow. Experimentallya smaller φ value yields higher final recall, thus asa default we set φ = 0 .
01, yielding a maximum of100 rounds. Although such a φ value gets the besttrade-off between effectiveness and efficiency in ourexperiments, we also observe that slight variations ofits setting do not affect the performance much (Sec-tion 7), demonstrating the robustness of pBlocking . One of the major challenges of block building ( BB )is that when generating candidate pairs that cap-ture matches it can also generate a number of non-matching pairs. This phenomenon is highly prevalentin datasets with very few matching pairs. To over-come this challenge, our block building by intersec-tion algorithm takes a collection of blocks B , . . . , B m built by a traditional method for BB and createssmaller clean blocks out of large dirty ones, thus con-tributing to the recall of the blocking graph with-out adding extra non-matching pairs. An intersec-tion block hierarchy H is constructed as follows. Let6 lgorithm 2 Block Layers Creation
Require:
Set of records V , depth d Ensure:
Layer set { L , . . . , L d } for i = 1; i ≤ d ; i + + do L i ← φ processed ← φ for v ∈ V do blockLst ← getBlocks(v) for i = 2; i < d; i + + do for B = { B j : B j ∈ blockLst } , |B| = i do B (cid:48) = ∩ Bj ∈B B j if B (cid:48) / ∈ processed then L i .append( B (cid:48) ) processed.append( B (cid:48) ) blockLst ← L i the first layer be B , . . . , B m . Then blocks in layer L consist of the intersection of L distinct blocks in thefirst layer. Example 2.
Consider our cars example in Sec-tion 1, and the blocks corresponding to tokens‘corvette’ and ‘c6’, namely B corvette , and B c6 . Asample block in the second level of H is B corvette,c = B corvette ∩ B c . When we build the new block, we onlyinclude records containing the two tokens ‘corvette’and ‘c6’ (possibly non consecutively), thus obtaininga cleaner block than the original ones. Refined blocks.
We refer to the newly createdblock as a refined block, and to the intersecting blocksas parent blocks. Not all the refined blocks are use-ful. We need one of the following correlation basedconditions to hold to decide if a refined block B i,j must be kept in H . • score ( B i,j ) > score ( B i ) · score ( B j ), that is thescore of the refined block is higher than the com-bined score of the parent blocks. • The existence of a randomly chosen record r inblocks B i and B j is positively correlated, i.e. P r [ r ∈ B i,j ] = | B i,j | /n > P r ( r ∈ B i ) · P r ( r ∈ B j ), which simplifies to | B i,j | > | B i || B j | n . For ex-ample, the number of common records in blockscorresponding to tokens ‘c6’ and ‘corvette’ ismuch higher than the common records in blockscorresponding to ‘navigation’ and ‘c6’.Suppose the maximum depth of the hierarchy is d which is a constant. The construction of refinedblocks can take O ( n d ) time if the number of blocks Algorithm 3
Layer Cleaning
Require:
Layer set { L , . . . , L d } Ensure:
Cleaned Layer set { L , . . . , L d } for i = 2; i < d ; i + + do for block ∈ L i do parentLst ← getParents(block) if (cid:81) p ∈ parentLst score ( p ) < score ( block ) or (cid:81) p ∈ parentLst | Li − p ] | n < | Li [ block ] | n then continue else L i .remove(block) considered in the first layer is O ( n ). For efficiency,we iterate over the records (linear scan) and for eachrecord r , we consider all pairs of blocks that contain r as candidates to generate blocks in the different lev-els of the hierarchy. The following lemma bounds thetotal number of refined blocks across the hierarchy. Lemma 2.
The number of blocks present in H is O ( n ) if each record r is present in a constant numberof blocks.Proof. Our algorithm considers each record u ∈ V and generates intersection blocks by performing con-junction of blocks that contain the record u . Sup-pose the record u is present in γ u blocks in the firstlayer. Then the maximum number of blocks presentin H that contain u is (cid:80) di =1 (cid:0) γ u i (cid:1) . Assuming γ u isa constant, the maximum number of blocks in thehierarchy is n (cid:80) di =1 (cid:0) γ u i (cid:1) = O ( n ). Refinement algorithm.
We are now ready to de-scribe pBlocking ’s intersection method for buildingthe block hierarchy. Our method has two steps: • (Alg. 2) The first step creates all possible blocksconsidering the intersection search space. • (Alg. 3) The cleaning phase removes the blocksthat do not satisfy the correlation criterion de-scribed above.Algorithm 2 describes the creation step, which iter-ates over all the records in the corpus and createsall possible blocks per record. The list of all blocksto which a record belongs is constructed (denotedby blockLst) and the new blocks are added in differ-ent layers. The layer of the new block depends onthe number of intersecting blocks that constitute thenew block. Then, the cleaning step in Algorithm 37terates over the different layers and keeps only theblocks that satisfy the score or size requirements. Fora block in layer q , getParents () identifies the twoblocks which are in layer ( q −
1) whose conjunctiongenerates the block being considered. If these parentshave been removed during the cleaning phase, thentheir parents are considered and the process is con-tinued recursively until we end up at the ancestorspresent in the list of blocks.Block Layers Creation (Alg. 2) constructs all theblocks in the form of a hierarchy and Layer Clean-ing (Alg. 3) deactivates the blocks that do not sat-isfy the correlation requirements. Since the resultof block layers creation does not change in different pBlocking iterations, decoupling the creation com-ponent from the cleaning component (which changesdynamically) allows for more efficient computation.
Time complexity.
Assuming the depth of thehierarchy is a constant, Algorithms 2 and 3 operatein time linear in the number of records n . Blockrefinement takes 3 minutes for a data set with 1 M records in our experiments. Let A (cid:48) ⊂ V × V be the pairs selected by blockingphase at a given point (we recall that A (cid:48) is the edgeset of the blocking graph P = ( V, A (cid:48) )) and each con-sidered pair ( u, v ) ∈ A (cid:48) has a similarity value denotedby p m ( u, v ). A block B ⊆ V refers to a subset ofrecords. Using this notation, we discuss the differ-ent methods for scoring blocks and how the scoresconverge with feedback for effective ER performance. Block scoring.
Block scoring helps to distin-guish informative blocks based on their ability to cap-ture records from a single cluster. By selecting pairswithin informative blocks, down-stream ER opera-tions can focus on records pairs that have high prob-ability of being a match. The most common mecha-nism used in the literature is TF-IDF and it assignsblock scores inversely proportional to the block sizeprioritizing smaller blocks over larger ones. If thedata set has small clusters, such a simple methodcan work well. However, if the data set has a skewedcluster size distribution, some large blocks are just uninformative (and are rightfully less preferred byTF-IDF), but others can represent a large cluster andthus should stand out in the scoring. Distinguishingthese blocks before pair matching can be difficult, but pBlocking provides a way to leverage the feedback.Specifically, the scoring algorithm of pBlocking prioritizes blocks having (a) high fraction of match-ing pairs measured as matching probability withina block and (b) fewer number of clusters (especiallylarger clusters) measured as uniformity (a functionof entropy of the cluster distribution within a givenblock B ). Lower entropy and hence lower diversityvalues indicate the representativeness of B towards aparticular cluster as opposed to higher entropy val-ues which refer to the presence of many fragmentedclusters.More formally, the matching probability scoreidentifies the probability that a randomly chosen pair( u, v ) | u, v ∈ B refers to the same entity and is de-fined as follows. Definition 2 (Matching Probability score p ( B )) . The value p ( B ) is defined as the fraction of matchingpairs within a block B . The block uniformity, u ( B ) captures perplexity ofcluster distribution within B measured in terms of itsentropy. Definition 3 (Cluster Entropy H ( B )) . The clus-ter entropy of a block, H ( B ) refers to the entropyof the cluster distribution when restricted to therecords present in block B . Mathematically, H ( B ) = − (cid:80) C ∈C p C log p C , where p C = | C ∩ B | / | B | refers tothe probability that a randomly chosen node from B belongs to cluster C . Using H ( B ), block uniformity score is defined asfollows. Definition 4 (Block Uniformity u ( B )) . The blockuniformity u ( B ) = e − H ( B ) is the inverse of perplexity[17] of the cluster distribution within the block whereperplexity refers to the exponential of cluster distri-bution entropy. Example 3.
Suppose that we know that a block B contains records of two clusters C and C and thus e can compute the uniformity of B exactly. If thetwo clusters are perfectly balanced in B , i.e., | C ∩ B | = 0 . · | B | and | C ∩ B | = 0 . · | B | , the entropyis H ( B ) = − . . − . . ≈ . and thus u ( B ) = e − H ( B ) = 0 . . If there is some skew, e.g. | C ∩ B | = 0 . · | B | and | C ∩ B | = 0 . · | B | , then theentropy is lower H ( B ) = − . . − . . ≈ . and the uniformity is higher u ( B ) ≈ . . Inthe extreme case where C ∩ B = B and C ∩ B = ∅ , H ( B ) = 0 and u ( B ) = 1 . Note that when resolving two duplicate-free datasetswhere all clusters are of size 2 (also known as RecordLinkage) the entropy increases with block size, thusblock uniformity yields comparable results to tradi-tional TF-IDF.Since the goal of block scoring is to identify blocksthat have high matching probability and high unifor-mity, we multiply the two values to get a final esti-mate of the block score.
Definition 5 (Block Score, score ( B )) . The scoreof a block B , score ( B ) , is defined as the product ofmatching probability score and uniformity score of B .That is, score ( B ) = p ( B ) u ( B ) . Next, we describe the algorithm to estimate thesecomponents of block score. The exact value of match-ing probability and block uniformity requires com-plete ER results. However, pBlocking estimatesthese scores initially with the similarity estimates ofevery pair of records and refines these scores withadditional feedback from partial ER results.
Matching probability score.
The matchingprobability score is estimated as the average match-ing similarity of pairs of records within the block, i.e.: p ( B ) = (cid:80) u,v ∈ B p m ( u, v ) (cid:0) | B | (cid:1) where p m ( u, v ) is estimated as follows: • for pairs declared as matches, we set p m ( u, v ) =1; • for pairs declared as non-matches, we set p m ( u, v ) = 0; • for unlabelled pairs, we use the p m values com-puted by common similarity metrics (e.g. via jaccard similarity or the similarity-to-probabilitymapping as in [26]). Block uniformity estimation.
Estimating uni-formity score requires the cluster size distribution in B , which is harder to infer from the prior similarityvalues. We next describe a mechanism to estimate en-tropy H ( B ) needed to compute the uniformity score.We consider each record u ∈ B , and consider the clus-ter C u that contains u . We are interested in comput-ing | C u ∩ B || B | in order to compute entropy H ( B ). In-stead, we compute the expected size of | C u ∩ B | as E u = E [ | C u ∩ B | ] = (cid:80) v ∈ B p m ( u, v ) based on p m val-ues of edges incident on u . We compute the expectedcluster size for every record u ∈ B and sort them innon-increasing order. Let L be the sorted list. Letthe first record in the sorted list L , that is, the nodewith highest expected cluster size in B be u . On ex-pectation u has E u records in B that belong to C u .All these records must have similar expected clustersizes as well. We put u and the next (cid:98) E u (cid:99) recordsfrom L to a set S U , assuming that they belong to thesame cluster C u . We recurse on L \ S U until a par-tition { S U , S V , . . . } of the block is generated. Thesize of each partition can be thought of as a roughestimate of the true cluster distribution in B and isused to calculate the entropy. Example 4.
Consider a block B , with | B | =10 . Let [ u , u . . . u ] be the corresponding list L of records sorted in non-increasing E u i val-ues. If E u = (cid:80) i ∈ ... p m ( u , u i ) = 6 . weset S U = { u . . . u (cid:98) E u (cid:99) } = { u . . . u } andthen consider the next node in L which is u . If E u = (cid:80) i ∈ , p m ( u , u i ) = 2 we set S U = { u . . . u (cid:98) E u (cid:99) } = { u . . . u } and then finish. As | S U | = 0 . · | B | and | S U | = 0 . · | B | we estimate u ( B ) = e − . . − . . ≈ . . The value returned by this mechanism is generallyan under-estimate of the true entropy H ( B ) but inpractice it can approach H ( B ) quickly with increas-ing feedback data and turns out to be very efficient.Section 6.2 discusses this convergence rate in differentapplication scenarios. Efficient block cleaning.
Traditional scor-ing strategies such as TF-IDF are based on block9ize computation and thus operate in linear time.Computing our score ( B ) values requires instead toprocess intra-block pairs and thus yields potentiallyquadratic computation. Hence, we sample Θ(log n )records from each block for its score computation.This strategy operates in Θ(log n ) time and takesless than 1 minute for a data set with 1 M recordsin our experiments. Our sampling strategy gives anapproximation within a factor of (1+ (cid:15) ) of the match-ing probability scores estimated using all the recordswithin each block (Lemma 7). In this section we present a theoretical analysis ofthe effectiveness of pBlocking . We first analyze thepair recall of blocking in the absence of feedback byconsidering a natural generative model for block cre-ation. Next we analyze the effect of feedback on blockscoring and the final recall.
We start by giving the following basic lemma below.
Lemma 3.
The blocking graph P = ( V, A (cid:48) ) containsa spanning tree for each clique C of C = ( V, E + ) iffthe Pair Recall is 1.Proof. If A (cid:48) contains a spanning tree for each clique C , then any pair ( u, v ) ∈ A (cid:48) ∩ E + contributes directlyto the recall. All pairs of records ( u, v ) that refer tothe same entity, ( u, v ) ∈ E + and are not present in A (cid:48) , ( u, v ) / ∈ A (cid:48) can be inferred from the edges in thespanning tree using transitivity, ensuring Pair Recall= 1. For the converse, let us assume that ∃ C ∈ C such that A (cid:48) does not contain any spanning tree overthe matching edges. This implies that C is split intomultiple components (say C , C ) when restricted to A (cid:48) ∩ E + edges. In this case, the collection of matchingedges joining these components, { ( x, y ) , ∀ x ∈ C , y ∈ C } cannot be inferred as none of these edges areprocessed by the mentioned ER operations, yieldingpair recall of P less than 1.Our probabilistic model for block creation is mo-tivated by the standard blocking [20], sorted neigh- borhood [13] and canopy clustering [18] algorithmswhich aim to generate blocks that capture high sim-ilarity candidate pairs. This model of block genera-tion is closely related to Random Geometric Graphs[27] which were proposed by Gilbert in 1961 and havebeen used widely to analyze spatial graphs. Definition 6 (Random Geometric Graphs) . Let S t refer to the surface of a t-dimensional unit sphere, S t ≡ { x ∈ R t +1 | || x || = 1 } . A random geometricgraph G t ( V, E ) of n vertices V , has parameters t ∈ Z + and a real number r ∈ [0 , . It assigns each vertex i ∈ V to a point chosen independently and uniformlyat random within S t and any pair of vertices i, j ∈ V are connected if the distance between their respectivepoints is less than r . Now, we define the probabilistic block generationmodel.
Definition 7 (Probabilistic Block Generation) . Theblock generation model places the records u ∈ V inde-pendently and uniformly at random within S t . Everyrecord u constructs a ball of volume ( α log n/n ) with u as the center, where α is a given parameter and allpoints within the ball are referred to as block B u . The set of points present within a ball B u can beseen as high similarity points that would have beenchosen as blocking candidates in the absence of feed-back. Our probabilistic block generation model con-structs n blocks, one for each node and every pair ofrecords that co-occur in a block B u , u ∈ V , has anedge in the blocking graph P g ( V, E ) (subscript g toemphasize generative model). Next we analyze pairrecall of P g ( V, E ). Notation.
Let d ( u, v ) refer to the distance be-tween records u and v and r (cid:15) refer to the radius ofan (cid:15) -volume ball in t dimensions. Under these as-sumptions we first show that the expected number ofedges in the blocking graph P g is at least α ( n −
1) log n and then that P g ( V, E ) has recall << Lemma 4.
The blocking graph P g ( V, E ) contains atleast α ( n −
1) log n candidate pairs on expectation. (cid:15) = O ( r t(cid:15) ). roof. Each record u ∈ V , constructs a spheri-cal ball of volume α log n/n , with u as the centerand all points within the ball are added as neigh-bors of u in the blocking graph. Hence, the num-ber of expected neighbors of u within the ball is α ( n −
1) log n/n . There are a total of n such blocks(one ball per record) and each of the candidate pairs( u, v ) is counted twice (once for the block B u andonce for the block B v ). Hence there are a total of α ( n −
1) log n such candidate pairs. Notice that thisanalysis ignores the candidate pairs ( u, v ) which aremore than r α log n/n from each other but are con-nected in the blocking graph. This would happenif they are present together in another block centeredat w ∈ V \ { u, v } , that is ∃ w | d ( u, w ) ≤ r α log n/n and d ( v, w ) ≤ r α log n/n . This shows that the totalnumber of candidate pairs in the blocking graph isatleast α ( n −
1) log n .Additionally, P g ( V, E ) has the following property:
Lemma 5.
A blocking graph P g is a subgraph of arandom geometric graph G t with r = 2 r α log n/n Proof.
Following the construction of blocking graph,if the distance between any pair of vertices u, v ∈ V is less than or equal to r c log n/n , then ( u, v ) ∈ E .Similarly, any pair of nodes u, v ∈ V such that d ( u, v ) > r c log n/n , then ( u, v ) / ∈ E . However, if r c log n/n < d ( u, v ) ≤ r c log n/n , the pair ( u, v ) ∈ H g only if ∃ w ∈ V such that d ( u, w ) ≤ r c log n/n and d ( v, w ) ≤ r c log n/n . This shows that the blockinggraph H g is a subgraph of a random geometric graphwhere a pair of vertices (u,v) is connected only if thedistance d ( u, v ) ≤ r c log n/n is connected.This means that if G t has suboptimal recall then P g also has poor recall and hence, we analyze therecall of G t with r = 2 r α log n/n . Lemma 3 showsthat the blocking graph will achieve recall = 1 onlyif it contains a spanning tree of each cluster. Hence,we analyze the formation of spanning trees in G (cid:48) t = G t ( V, E ∩ E + ) that refers to G t restricted to matchingedges. We show the following result, Lemma 6.
The graph G t restricted to matchingedges in the ground truth, E + splits a cluster C ,where | C | = o ( n/α ) into multiple components. Proof. Using the connectivity result from [27], a ran-dom geometric graph G t of n nodes is disconnectedif the expected degree of the nodes is < log n . Ad-ditionally, it splits the graph G t into many smallerclusters. Therefore, a cluster C ∈ V is disconnectedin G (cid:48) t = G t ( V, E ∩ E + ) if the degree of each vertex is < log | C | .The expected degree of a record u ∈ C , restrictedto G (cid:48) t is O ( | C | ( α log nn )) = o (log n ) if | C | = o ( n/α ).Hence, the expected degree of each node within acluster C is o (log | C | ), leading to formation of dis-connected components within C . Theorem 1.
A blocking graph P g ( V, E ) , generatedaccording to the probabilistic block model has recall < unless all clusters have size Θ( n ) assuming α isa constant.Proof. Lemma 6 shows that the cluster C of size < n/α is split into various disconnected componentswhen restricted to matching edges. Hence, the block-ing graph P g does not form a spanning tree of C andwill have recall less than 1 (Lemma 3). Since the clus-ter C is broken into many small clusters, the drop inrecall is also significant. Remark.
The analysis extends when consideringless noisy data such as when only a constant fractionof records are placed randomly on the unit sphere,and the remaining records are grouped together ac-cording to the cluster identity they belong to. Ouranalysis exposes the lack of robustness of performingblocking without feedback.
In this section we analyze the pair recall of blockingwhen employed with pBlocking . For this analysiswe consider the noisy edge similarity model p m ( u, v )that builds on the edge noise model studied in priorwork on ER [8]. Definition 8 (Noisy edge model) . Noisy edge modeldefines the similarity of a pair of records with param-eters θ ∈ (0 , , β = Θ(log n ) and β (cid:48) = Θ(log n ) . matching edge ( u, v ) ∈ E + has a similarity dis-tributed uniformly at random within [ θ, with prob-ability − βn and remaining edges are distributed uni-formly within [0 , θ ) . A non-matching edge has similardistribution on similarity values with β (cid:48) instead of β . When β << β (cid:48) , the matching probability score of ablock with higher fraction of matching edges is muchhigher than the one with fewer matching edges and pBlocking algorithm will consider blocks in the cor-rect ordering even in the absence of feedback. How-ever, it is most challenging when non-matching edgesare generated with a distribution similar to matchingedges, that is β and β (cid:48) are close. We define a ran-dom variable X ( u, v ) to refer to the edge similaritydistributed according to the noisy edge model. Fol-lowing this notion, let µ g and µ r denote the expectedsimilarity of a matching and non-matching edge re-spectively. µ g = (1 − β/n ) 1 + θ βn θ µ r has the same value with β (cid:48) instead of β .We show that the feedback based block score ini-tialized with TF-IDF weights is able to achieve per-fect recall with a feedback of Θ( n log n ) pairs as-suming that the ER phase makes no mistakes onthe pairs that it processes, helping to ensure the cor-rectness of partially inferred entities. Additionally,the feedback from the ER phase is distributed ran-domly across edges within a block. We also discussthe extension when feedback is biased towards pairsfrom large entity clusters and high similarity pairs.In those scenarios, pBlocking ’s scoring mechanismconverges quicker leveraging the larger feedback dueto transitivity. Effect of Sampling.
First, we show that samplingΘ(log n ) records from a block gives approximationwithin a factor of (1 + (cid:15) ) of the matching probabilityscore computed using all the records. Lemma 7.
For a block B with | B | > c log n , thematching probability score of B estimated by sampling Θ(log n/(cid:15) ) records randomly is within [(1 − (cid:15) ) , (1+ (cid:15) )] factor of p ( B ) with a probability of − o (1) , where p ( B ) is the score using all | B | records. Proof. Consider a block B with more than c log n records. Let X ( u, v ) denote the edge similarity ofa pair ( u, v ) according to the noisy edge model.The matching probability score of B on consideringthe complete block is ( | B | ) (cid:80) u,v ∈ B X ( u, v ). The ex-pected score of the block ( µ B ) is1 (cid:0) | B | (cid:1) E (cid:88) u,v ∈ B X ( u, v ) = 1 (cid:0) | B | (cid:1) (cid:88) u,v ∈ B, ( u,v ) ∈ E + E [ X ( u, v )]+ 1 (cid:0) | B | (cid:1) (cid:88) u,v ∈ B, ( u,v ) ∈ E − E [ X ( u, v )]= (1 − α ) µ g + αµ r where α is the fraction of non-matching pairs in theblock B .For a sample of S = c log n/(cid:15) (cid:48) records, the ex-pected probability score ( µ S ) is (1 − α ) µ g + αµ r , where (cid:15) (cid:48) = (cid:15)/ (2 + (cid:15) )1 (cid:0) c log n (cid:1) E [ (cid:88) u,v ∈ S X ( u, v )] = 1 (cid:0) c log n (cid:1) (cid:88) u,v ∈ S, ( u,v ) ∈ E + E [ X ( u, v )]+ 1 (cid:0) c log n (cid:1) (cid:88) u,v ∈ S, ( u,v ) ∈ E − E [ X ( u, v )]= (1 − α ) µ g + αµ r Using Hoeffding’s inequality [14],
P r (cid:0) c log n (cid:1) (cid:88) u,v ∈ S X ( u, v ) ≤ (1 − (cid:15) (cid:48) ) µ S ≤ e − (cid:15) (cid:48) µ S ( c log n ) ≤ e − n = 1 n Using the same argument, we can show that
P r (cid:34) (1 − (cid:15) (cid:48) ) µ S ≤ ( c log n ) (cid:80) u,v ∈ S X ( u, v ) ≤ (1 + (cid:15) (cid:48) ) µ S (cid:35) ≥ − n This shows that the calculated probabilityscore on the samples S is within a factor of (1 − (cid:15) (cid:48) )and (1 + (cid:15) (cid:48) ) of the expected score with a probabilityof 1 − o (1). The probability score of B on consideringall records, is also within a factor of (1 − (cid:15) (cid:48) ) and(1 + (cid:15) (cid:48) ) of the expected value µ S . Therefore, theestimated score on sampling guarantees approxima-tion within a factor of (cid:15) (cid:48) − (cid:15) (cid:48) = 1 + 2 (cid:15) (cid:48) / (1 − (cid:15) (cid:48) ) = 1 + (cid:15) with a high probability.The above lemma can extend to block uniformity be-cause p m values are used analogously for expectedcluster sizes. In Lemma 8 we show how to set theconstant within the Θ notation based on level of noisein the p m values.To prove the convergence of pBlocking , we firstestimate the lower and upper bound of matchingprobability scores of a block B in the presence offeedback and show that a feedback of Θ(log n ) isenough to rank blocks with larger fraction of match-ing pairs higher than the blocks with fewer matchingpairs. Our analysis first considers the blocks con-taining more than γ log n records (where γ is a largeconstant say 12) and we analyze the smaller blocksseparately. Convergence for large blocks.
First, we evalu-ate the converged block scores with a feedback F andevaluate the condition that the block scores are in thecorrect order. For this analysis, we consider the frac-tion of matching edges for block score computationbut similar lemmas extend for the uniformity scorecalculation. Lemma 8.
For all blocks B , with more than γ log n records, the matching probability score of B , p ( B ) after a feedback of F = O (log n ) randomly cho-sen pairs is at most (1 − α ) | F | / (cid:0) γ log n (cid:1) + 1 . p (cid:48) (1 −| F | / (cid:0) γ log n (cid:1) ) with a probability of − /n , where α is the fraction of non-matching pairs in B , γ is aconstant and p (cid:48) = µ g (1 − α ) + µ r α .Proof. For block scoring, pBlocking considers a sam-ple of S = γ log n records (where γ is a large con-stant) and considers the sample ensuring that feed-back F ⊆ S × S belongs to this sample. The to-tal number of matching edges which have been iden-tified with feedback over randomly chosen pairs is (1 − α ) | F | . Let X ( u, v ) be a random variable thatrefers to the similarity of the pair ( u, v ) and µ ( u, v )to its expected value. For S = γ log n , the expectedsimilarity of non-feedback edges within C is (cid:88) u,v ∈ S, ( u,v ) / ∈ F µ ( u, v ) = (cid:88) ( u,v ) ∈ E + E [ X ( u, v )] + (cid:88) ( u,v ) / ∈ E + E [ X ( u, v )]= (cid:88) ( u,v ) ∈ E + µ g + (cid:88) ( u,v ) / ∈ E + µ r = (cid:18)(cid:18) γ log n (cid:19) − | F | (cid:19) ( µ g (1 − α ) + µ r α )We use the Hoeffding inequality to bound the to-tal similarity, (cid:80) X ( u, v ) of T = (cid:16)(cid:0) γ log n (cid:1) − | F | (cid:17) = γ (cid:48) (cid:0) log n (cid:1) , for some constant γ (cid:48) , edges which do nothave feedback. (cid:88) u,v ∈ B c , ( u,v ) / ∈ F X ( u, v ) ≤ (1 + δ ) (cid:88) u,v ∈ B c , ( u,v ) / ∈ F µ ( u, v )with a probability of 1 − e − δ µ T / | T | which canbe simplified as 1 − e − δ µ T , since µ r , µ g > / > − /n after substituting δ = 0 . B is at-most (cid:18) | F | ( γ log n ) (1 − α ) + 1 . p (cid:48) (1 − | F | / (cid:0) γ log n (cid:1) ) (cid:19) witha high probability.Similarly, we prove a lower bound on block score. Lemma 9.
For all blocks B with | B | ≥ γ log n ,the matching probability score after a feedback F = O (log n ) record pairs in B is at least (1 − α ) | F | / (cid:0) γ log n (cid:1) + 0 . p (cid:48) (1 − | F | / (cid:0) γ log n (cid:1) ) with a prob-ability of − /n , where p (cid:48) = µ g (1 − α ) + µ r α and γ is a constant. Now, we analyze different scenarios of edge noise tounderstand the trade-off between required feedbackand noise.
Lemma 10.
For every pair of blocks, B c , B d withmore than γ log n records, the matching probabilityscore estimate of B c with − α fraction of matchingedges is greater than the score of B d with − β (with < β ) fraction of matching edges with a probabilityof − n if ((1 − α ) µ g + αµ r ) > − β ) µ g + βµ r ) even in the absence of feedback.Proof. Using Lemma 8 and 9, we can evaluate thecondition that score ( B c ) > score ( B d ) with a proba-bility of 1 − n , in the absence of feedback. In orderto guarantee this for all blocks, we perform a unionbound over Θ( n ) pairs of blocks, guaranteeing thesuccess rate to 1 − o (1).The previous lemma shows a scenario where thenoise is not high and the prior based estimation ofmatching probability scores give a correct orderingof blocks. Now, we consider the more challengingnoisy scenario and show that Θ(log n ) feedback perblock is enough for correct ordering. Lemma 11.
For every pairs of blocks, B c , B d withmore than γ log n records, the matching probabilityscore estimate of B c with − α fraction of matchingedges is greater than the score of B d with − β (where α < β ) fraction of matching edges with a probabilityof − n whenever the ER phase provides overall feed-back of Θ( n log n ) randomly chosen edges.Proof. Using Lemma 9, score ( B c ) ≥ | F | / (cid:0) γ log n (cid:1) (1 − α ) + 0 . µ g (1 − α ) + αµ r )(1 − | F | / (cid:0) γ log n (cid:1) ) and us-ing Lemma 8, score ( B d ) ≤ | F | / (cid:0) γ log n (cid:1) (1 − β ) +1 . µ g (1 − β ) + βµ r )(1 − | F | / (cid:0) γ log n (cid:1) ) with a proba-bility of 1 − n . Hence, score ( B c ) > score ( B d ) holdsif F = c log n , where c is a large constant. With aunion bound over (cid:0) n (cid:1) pairs of blocks, the score ofany block B c (with higher fraction of matches) ishigher than that of any block B d (with lower frac-tion of matches) with a probability of 1 − n . Thetotal feedback to ensure Θ(log n ) feedback on eachblock is Θ( n log n ) as we consider Θ( n ) blocks forscoring. Convergence for small blocks.
The above anal-ysis does not extend to blocks of size less than γ log n .However, all these blocks are ranked higher than thelarge blocks by TF-IDF. Hence, when pBlocking isinitialized, the initial set of candidates generated willconsider all these blocks before any of the larger blocks. In the worst case, there can be δn suchblocks, for some constant δ because our approachconstructs a constant number of blocks per record(say δ ). Thus, the maximum number of candidatesconsidered from small blocks is δn (cid:0) γ log n (cid:1) and allthese candidates are considered in the first iterationof pBlocking . Following the discussion on small andlarge blocks, we prove the main result of the conver-gence of pBlocking . Theorem 2. pBlocking pipeline achieves perfect re-call with a feedback of O ( n log n ) spread randomlyacross blocks.Proof. For blocks with more than γ log n records,Lemmas 10 and 11 show that a block with higherfraction of matching pairs is ranked higher than ablock with fewer matching pairs, if provided with afeedback of Θ( n log n ). Blocks with less than γ log n records have not been considered above but in theworst case, these blocks generate O ( n log n ) candi-dates as the maximum number of blocks consideredis Θ( n ). This ensures that a feedback of Θ( n log n )is sufficient to ensure the stated result. Discussion.
Lemma 11 considers the convergenceof block scores when the feedback is provided ran-domly over Θ(log n ) edges within a block. If thefeedback is biased towards Θ(log n ) non-matchingedges, the scores of noisier blocks will drop quickerand pBlocking will converge faster. Similarly, ifthe ER algorithm queries pairs with higher similarity(e.g. edge ordering [34]) or grows clusters by process-ing nodes (e.g. node ordering [33]), providing largerfeedback due to transitivity, this will only facilitatethe growth (reduction) in score of blocks with higher(lower) fraction of matching pairs leading to fasterconvergence.Finally, for the presented analysis, we assumedthat oracle answers are correct. Nonetheless, (i) forsmall amount of oracle errors ( ∼ pBlocking keeps converging, only at aslightly slower rate and demonstrates robustness.14able 3: Number of nodes n (i.e., records), number of clusters k (i.e., entities), size of the largest cluster | C | , the total number of matches in the data set | E + | and the reference to the paper where they appearedfirst. dataset n k | C | (cid:12)(cid:12) E + (cid:12)(cid:12) ref. description songs
1M 1M 0.99M 2 146K [6] Self-join of songs with very few matches. citations products cora cars camera
In this section we empirically demonstrate the abil-ity of pBlocking to boost the efficiency and effective-ness of blocking and thus to improve the performanceof ER. We also demonstrate the fast convergence of pBlocking thus confirming our theoretical analysisin Section 6, and the robustness of pBlocking in dif-ferent scenarios, including errors in ER results. Thissection is structured as follows. • Section 7.2.
We compare the efficiency and ef-fectiveness of pBlocking to prior work showinghigher pair recall and faster running time in allthe data sets. • Section 7.3.
We analyze pBlocking when usedin conjunction with different ER methods show-ing higher
F-score (up to 60%) irrespective ofthe method of choice. • Section 7.4.
We study the dynamic performanceof pBlocking and show its ability to convergemonotonically to high effectiveness without com-promising on efficiency in different scenarios in-cluding errors in ER results.
Before showing results we describe our experimen-tal setup and the methods considered in our experi-ments.
Experimental set-up.
We implemented the algo-rithms in Java and machine learning tools in Python.The code runs on a server with 500GB RAM (allcodes used ≤ togenerate text descriptions of the image data ( cars ).For implementing the hierarchy we observed that wecan trim at a depth of 10 without any significant dropin the performance. Blocking methods.
We consider 8 strate-gies for the blocking sub-tasks described in Sec-tion 2 and combine such strategies into 16 differentpipelines. We study such pipelines with and withoutour pBlocking approach on top. BB ) We consider 4 methods for Block Building ( BB )and follow the suggestions of [25] for their con-figuration. Standard blocking [20] ( StBl ) gen-erates a new block for each text token in thedataset. Q-grams blocking [11] (
QGBL ) gener-ates a new block for each 3-gram of characters.Sorted neighborhood [13] (
SoNE ) sorts the tokensfor each attribute and generates a new block forevery sliding window of size 3 over these sortorders. Canopy clustering [18] (
CaCl ) gener-ates a new block for each cluster of high simi-larity records (calculated as unweighted Jaccardsimilarity). We construct multiple instances ofcanopies (blocks), one for each attribute (i.e.,based on the similarity of record pairs with re-spect to that attribute) and one based on all at-tributes together. BC ) We consider 2 traditional block scoring meth-ods for Block Cleaning ( BC ), dubbed TF-IDF [28]and uniform scoring (
Unif ). For comparison https://cloud.google.com/vision , M and then prune the re-maining blocks. We set default M to 10 million. CC ) We consider 2 popular methods for Compari-son Cleaning ( CC ), dubbed meta-blocking [22]( MB ) and BLOSS [5], and follow the suggestionsof [22] for their configuration. Weights of recordpairs are set to their Jaccard similarity weightedwith the block scores from the BC sub-task. Weconsider the top 100 high-weight pairs for eachrecord and prune the remaining record pairs.We recall that variants of our approach are denotedas pBlocking (,,) while traditional blocking pipelineswithout feedback are denoted as B (,,) where the pa-rameters correspond to techniques for BB , BC and CC sub-tasks, respectively. Default methods are StBl for BB , TF-IDF for BC and MB for CC . Default φ for pBlocking is 0 . Pair matching and Clustering methods.
Weconsider the following 3 strategies that leverage thenotion of an oracle to answer pairwise queries of theform “does u match with v ?” (a) Edge [34] withdefault parameter setting. (b)
Eager [9], the state-of-the-art technique to solve ER in the presence oferroneous oracle answers. (c)
Node is the ER mech-anism derived from [33] and was proposed as an im-provement over
Edge . The
Eager algorithm han-dles noise for data sets with matching pairs muchlarger than n and performs similar to Edge for datasets that have fewer matching pairs [8], so we use itas default. We implement the abstract oracle toolwith a classifier using scikit learn in Python. Weconsider two variants, Random forests (default) anda Neural Network. The random forest classifier istrained with default settings of scikit learn. The neu-ral network is implemented with a 3-layer convolu-tional neural network followed by two fully connectedlayers. We used word2vec word-embeddings for eachtoken in the records. In structured data sets, weextract similarity features for each attribute as in [6].For cars we use the text descriptions to calculate We note that setting a score threshold rather than a limiton the number of pairs would not take into account differentscores distributions fairly. https://scikit-learn.org/stable/ text-based features along with image-based features.Given the unstructured nature of text descriptions forsome data sets we extracted POS tags using Spacy .All the considered classifiers are trained off-line withless than 1 ,
000 labelled pairs, containing a similaramount of matching and non-matching pairs. Theselabelled record pairs are the ones provided by therespective source for citations , songs , products and camera (the papers mentioned in Table 3, col-umn “ref.”). For cars and cora we perform activelearning (following the guidelines of [6]) to identify asmall set of labelled examples for training, which areexcluded from the evaluation of blocking quality. In this experiment we evaluate the empirical benefitof pBlocking compared to previous blocking strate-gies.
Blocking effectiveness.
Figure 2 compares thePair Recall (PR) of pBlocking and of a traditionalblocking pipeline B for different choices of the blockbuilding and comparison cleaning techniques. Weuse default block cleaning TF-IDF and default M value. pBlocking achieves more than 0 .
90 recall forall the data sets and with all the block building strate-gies, demonstrating its robustness to different clus-ter distributions and properties of the data. Con-versely, most of the considered block building strate-gies (
StBl , QGBL and
SoNE ) have significantly lowerrecall even when used together with
BLOSS for select-ing the pairs wisely.
QGBL and
SoNE help to improverecall in data sets with spelling errors but due tovery few spelling mistakes in our data sets
StBl hasslightly higher recall. In terms of the data sets, theno-feedback blocking approach B has varied behavior. products and camera yield the best performance dueto the presence of relatively cleaner blocks that helpto easily identify matching pairs even without feed-back. songs has higher noise and cars has a skeweddistribution of clusters thereby making it harder forprevious techniques to handle. For this analysis, wedo not consider cora (the smallest data set) as it hasless than 2M pairs and hence, all techniques achieve https://spacy.io/ (a) songs P a i r R e c a ll B (b) citations P a i r R e c a ll pBlocking (c) products P a i r R e c a ll (d) cars P a i r R e c a ll (e) songs P a i r R e c a ll B (f) citations P a i r R e c a ll pBlocking (g) products P a i r R e c a ll (h) cars P a i r R e c a ll Figure 2: Pair recall of B (, TF-IDF ,) and pBlocking (, TF-IDF ,) with varying BB and CC . (a-d) use MB and(e-h) use BLOSS . CaCl did not finish within 24 hrs on songs and citations data set.Table 4: Running time compar-ison of B ( StBl , TF-IDF , MB ) and pBlocking ( StBl , TF-IDF , MB ). Dataset 0.95 Pair recall Time budget: 1 hr pBlocking B pBlocking B songs citations cars products camera cora perfect recall. We observed similar trends with Unif method for block cleaning in place of
TF-IDF (dis-cussed in Appendix).
Blocking efficiency.
In this experiment, we con-sider two different settings to compare (i) the timerequired to achieve more than 0 .
95 pair recall (ii) thepair recall when the pipeline is allowed to run for afixed amount of time (1 hour). We run each techniquefor various values of M and choose the best value thatsatisfies the required constraints. In the case of fixedbudget of running time = 1hour, we run pBlocking ’sfeedback loop for the most iterations that allow thepipeline to process all records in the required timelimit.Table 4 compares the total time required to achieve 0 .
95 pair recall for each dataset . pBlocking pro-vides more than 3 times reduction in running timefor most large scale datasets in this setting. In termsof total number of pairs enumerated, pBlocking con-siders around M=10 million to achieve 0 .
95 recall for citations as opposed to more than 200 million for B . We observed similar results for other block build-ing ( SoNE , QGBL and
CaCl ) and cleaning strategies.The last two columns of Table 4 compare the pairrecall of the generated candidates when the techniqueis allowed to run for 1 hour. pBlocking achieves bet-ter pair recall as compared to B across all datasets.The gain in recall is higher for larger datasets. Theperformance of pBlocking for cars is lower than thatof pBlocking in Figure 2d because the feedback loopdoes not converge completely in 1hr. The pipelineruns for 8 rounds of feedback in this duration. Thisis consistent with the performance of pBlocking inFigure 4a, where the feedback is turned off after 10 it-erations. The difference in performance of pBlocking and B is not high for small datasets of low noise like products , cora and camera as opposed to songs , citations and cars . This includes the time required by each approach to per-form pair matching on the generated candidates. .3 Robustness of Progressive Block-ing In this section, we evaluate the performance of pBlocking with varying strategies for pair match-ing and clustering in Algorithm 1 (referred to as W in the pseudo-code). For this analysis, we use thedefault setting for M as in Figure 2. Varying ER methods.
We recall that pBlocking can be used in conjunction with a variety of tech-niques for pair matching and clustering. Table5a compares the Pair Recall of the blocking graph,when using the different progressive ER methodsmentioned in Section 7.1. The final Pair Recall of pBlocking is more than 0 .
90 in all data sets andmatching algorithms except citations for node ERand more than 0 .
85 in all cases. This observa-tion confirms our theoretical analysis in Section 6.2,demonstrating that the feedback loop can improvethe blocking, irrespective of the ER algorithm un-der consideration (which is a desirable property fora blocking algorithm). The above comparison of ERperformance considers the algorithms with a defaultchoice of Random Forest classifier as the oracle. Weobserved that the feedback from the ER phase whenusing a Neural Network classifier contains slightlymore errors but the blocking phase with pBlocking shows similar recall. We provide more discussion onER errors in Section 7.4.
Benefit on the final ER result.
Table 5bcompares the F-score of the final ER results whenblocking is performed with and without pBlocking .In this experiment we use the state-of-the-art algo-rithm,
Eager as the pair matching algorithm with de-fault parameter values. Final F-score achieved withfeedback is more than 0.9 for all data sets except products . For songs , citations and cars the F-score of pBlocking is 1.5 times more than that oftraditional blocking pipeline without feedback, thusdemonstrating the effects of better effectiveness andefficiency of blocking. This section studies the performance of pBlocking dynamically, in terms of (i) effect of feedback fre- Table 5: (a) Pair recall of pBlocking on varying ERstrategies. (b) Comparison of the final F-score of the
Eager method. The blocking graph is computed with pBlocking (StBl, TF-IDF, MB) and B (StBl, TF-IDF,MB) (both with default settings). (a) Dataset B pBlockingEdge Node Eagersongs citations cars products camera cora (b) Dataset B pBlockingsongs citations cars products camera cora F - sc o r e % progress0.0050.010.020.040.08 (a) Feedback Frequency F - sc o r e % progress0.050.10.2 (b) Oracle error Figure 3: Progressive behavior of pBlocking withvarying feedback frequency and errors in the feedback( cars ).quency φ , (ii) effect of error on convergence, and (iii)convergence of the blocking result in the maximumnumber of rounds. Feedback frequency.
The φ parameter repre-sents the fraction of newly processed record pairs af-ter which feedback is sent from the partial ER resultsback to the blocking phase. Therefore, the parame-ter φ can control the maximum number of roundsof pBlocking and how often the blocking graph isupdated. In order to describe the effect of varying φ , Figure 3a shows the F-score of the ER results asa function of the percentage of rounds completed,that we refer to as the blocking progress . . In thefigure, different curves correspond to different feed-back frequencies, including the default one (in blue).This plot shows that by updating the blocking graph Not to be confused with the “ER progress” in Algorithm 1 (a) Pair Recall comparison P a i r R e c a ll Feedback round B (b) F-score comparison F - sc o r e Feedback round pBlocking
Figure 4: Effect of feedback loop in cars dataset.more frequently (and thus increasing the number ofrounds), the F-score increases faster when φ is re-duced from 0.08 to 0.01. The plot also shows thatthe F-score corresponding to smaller values of φ (upto 0.01) is consistently higher or equal as compared tothe F-score corresponding to larger values of φ . Giventhat the running time of the pipeline increases withmore frequent updates (smaller values of φ ), there ap-pears to be limited value in decreasing φ below 0 . Effect of ER errors.
As in the previous experi-ment, Figure 3b shows the effect of synthetic error inthe ER results by varying the fraction of erroneousoracle answers. To this end, we corrupted the oracleanswers randomly so as to get the desired amount ofnoise. We note that even when 1 out of 5 answers arewrong, the final F-score is almost 0 .
8, growing mono-tonically from the beginning to the end at the costof a few extra pairs compared. pBlocking convergesslower with higher error but the error does not accu-mulate and it performs much better than any otherbaseline. Additionally, we observed that even with20% error, the pair recall of pBlocking is as highas 0 .
98 even though the F-score is close to 0.8 dueto mistakes made by pair matching and clusteringphase. This confirms that pBlocking is robust to er-rors in ER results and maintains high effectiveness toproduce ER results with high F-score.
Score Convergence.
Figure 4a comparesthe Pair Recall (PR) of the blocking phase of pBlocking ( StBl , TF-IDF , MB ) after every round offeedback with the recall of B ( StBl , TF-IDF , MB ). Both B and pBlocking start with PR value close to 0.52and pBlocking consistently improves with more feed- back achieving PR close to 0 . pBlocking ’s score as-signment strategy to achieve high PR values evenwith minimal feedback. Figure 4b compares the fi-nal F-score achieved by our method if the feedbackloop is stopped after a few rounds. It shows that pBlocking achieves more than 0.8 F-score even whenstopped after 10 rounds of feedback. This experimentvalidates that the convergence of block scoring leadsto the convergence of the entire ER workflow. The empirical analysis in the previous sections hasdemonstrated pBlocking ’s benefit on final F-scoreand its ability to boost effectiveness of blocking tech-niques across all data sets without compromising onefficiency. The key takeaways from our analysis aresummarized below. • pBlocking improves Pair Recall irrespective ofthe technique used for block building, blockcleaning or comparison cleaning (Figure 2), thusdemonstrating its flexibility. • Feedback based scoring helps in particular toboost blocking efficiency and effectiveness fornoisy datasets with many matching pairs (i.e.containing large clusters) such as cars , by en-abling accurate selection of cleanest blocks. • The block intersection algorithm helps in par-ticular with data sets with fewer matchingpairs (i.e. with mainly small clusters) such as citations and songs , by providing a way tobuild small focused blocks with high fraction ofmatching pairs. Block intersection can also helpin data sets like products and camera but thebenefit is not as high as that in songs , becausemany records in such data sets have unique iden-tifiers (e.g. product model IDs) and thus initialblocks are reasonably clean.
Blocking has been used to scale Entity Resolution(ER) for a very long time. However, all the tech-niques in the literature have considered blocking as a19reprocessing step and suffered from the trade-off be-tween effectiveness and efficiency/scalability. We di-vide the related work into two parts: advanced block-ing methods which we improve upon, and progressiveER methods which can be used to generate a limitedamount of matching/non-matching pairs to send as afeedback to our blocking computation.
Advanced blocking methods.
There are manyblocking methods in the literature with different in-ternal functionalities and solving different blockingsub-tasks. In this paper, we considered four repre-sentative block building strategies, namely standardblocking [20], canopy clustering [18], sorted neighbor-hood [13] and q-grams blocking [11]. It is well-knownthat such techniques can yield a fairly dense blockinggraph when used alone. We refer the reader to [24]for an extensive survey of various blocking techniquesand their shortcomings. Such block building strate-gies can be used as the method X in our Algorithm 1.Recent works have proposed advanced methodsthat can be used in combination with the mentionedblock building techniques by focusing on the compar-ison cleaning sub-task (thus improving on efficiency).The first technique in this space is meta-blocking [22].Meta-blocking aims to extract the most similar pairsof records by leveraging block-to-record relationshipsand can be very efficient in reducing the number ofunnecessary pairs produced by traditional blockingtechniques, but it is not always easy to configure. Tothis end, follow-up works such Blast [29] use “loose”schema information to distinguish promising pairs,while [4] and SNB [23] rely on a sample of labeledpairs for learning accurate blocking functions andclassification models respectively. Finally, the mostrecent strategy BLOSS [5] uses active learning to selectsuch a sample and configure the meta-blocking. Thegoal of traditional meta-blocking [22] and its follow-up techniques like
BLOSS [5] prune out low similar-ity candidates from the blocking graph generated us-ing various block building strategies discussed above.Their performance is highly dependent on the effec-tiveness of block building techniques and the qualityof blocking graph. On the other hand, pBlocking constructs meaningful blocks that effectively cap-ture majority of the matching pairs and scores each block based on their quality to generate fewer non-matching pairs in the blocking graph. Meta-blockingtechniques compute the blocking graph statically,prior to ER, and thus can be used as the Z methodin our Algorithm 1. In Figure 2 we compare withclassic meta-blocking and BLOSS , as the latter showsits superiority over Blast and SNB.
Progressive ER.
Many applications need to re-solve data sets efficiently but do not require the ERresult to be complete. Recent literature describedmethods to compute the best possible partial solu-tion. Such techniques include pay-as-you-go
ER [36]that use “hints” on records that are likely to refer tothe same entity and more generally progressive
ERsuch as the schema-agnostic method in [30] and thestrategies in [3][26] that consider a limit on the exe-cution time. In our discussion, we considered oracle-based techniques, namely
Node [33],
Edge [34], and
Eager [9]. Differently from other progressive tech-niques, oracle-based methods consider a limit on thenumber of pairs that are examined by the oracle formatching/non-matching response. Such techniqueswere originally designed for dealing with the crowdbut they can also be used with a variety of classifiersdue to their flexibility. All these techniques naturallywork in combination with pBlocking by sending asfeedback their partial results.
Other ER methods.
In addition to the abovemethods, we mention works on ER architectures thatcan help users to debug and tune parameters for thedifferent components of ER [10, 6, 15, 25]. Specifi-cally, the approaches in [10, 6] show how to leveragethe crowd in this setting. All of these techniques areorthogonal to the scope of our work and we do notconsider them in our analysis. The previous workin [37] proposes to greedily merge records as they arematched by ER, while processing the blocks one ata time. Each merged record (containing tokens fromthe component records) is added to the unprocessedblocks, permitting its participation in the subsequentmatching and merging by their iterative algorithm.Limitations of processing blocks one at a time hasbeen shown in more recent blocking works [22].20
Conclusions
We have proposed a new blocking algorithm, pBlocking that progressively updates the relativescores of blocks and constructs new blocks by lever-aging a novel feedback mechanism from partial ERresults. Most of the techniques in the literature per-form blocking as a preprocessing step to prune out re-dundant non-matching record pairs. However, thesetechniques are sensitive to the distribution of clus-ter sizes and the amount of noise in the data setand thus are either highly efficient with poor recallor have high recall with poor efficiency. pBlocking can boost the effectiveness and efficiency of blockingacross all data sets by jump-starting blocking withany of the standard techniques and then using newrobust feedback-based methods for solving blockingsub-tasks in a data-driven way. To the best of ourknowledge, pBlocking is the first framework whereblocking and pair matching components of ER canhelp each other and produce high quality results insynergy.
References [1] .[2] http://di2kg.dia.uniroma3.it/2019 .[3] Y. Altowim, D. V. Kalashnikov, and S. Mehro-tra. Progressive approach to relational entityresolution.
PVLDB , 7(11):999–1010, 2014.[4] M. Bilenko, B. Kamath, and R. J. Mooney.Adaptive blocking: Learning to scale up recordlinkage. In
ICDM , 2006.[5] G. dal Bianco, M. A. Gon¸calves, and D. Duarte.Bloss: Effective meta-blocking with almost noeffort.
Information Systems , 75, 2018.[6] S. Das, P. S. GC, A. Doan, J. F. Naughton,G. Krishnan, R. Deep, E. Arcaute, V. Raghaven-dra, and Y. Park. Falcon: Scaling up hands-offcrowdsourced entity matching to build cloud ser-vices. In
SIGMOD , 2017. [7] A. K. Elmagarmid, P. G. Ipeirotis, and V. S.Verykios. Duplicate record detection: A survey.
IEEE Trans. Knowl. Data Eng. , 19(1), 2007.[8] D. Firmani, B. Saha, and D. Srivastava. Onlineentity resolution using an oracle.
PVLDB , 9(5),2016.[9] S. Galhotra, D. Firmani, B. Saha, and D. Sri-vastava. Robust entity resolution using randomgraphs. In
SIGMOD , 2018.[10] C. Gokhale, S. Das, A. Doan, J. F. Naughton,N. Rampalli, J. Shavlik, and X. Zhu. Corleone:hands-off crowdsourcing for entity matching. In
SIGMOD , 2014.[11] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,N. Koudas, S. Muthukrishnan, and D. Srivas-tava. Approximate string joins in a database(almost) for free. In
VLDB , pages 491–500, 2001.[12] A. Gruenheid, X. L. Dong, and D. Srivastava.Incremental record linkage.
PVLDB , 7(9), 2014.[13] M. A. Hern´andez and S. J. Stolfo. Themerge/purge problem for large databases. In
ACM Sigmod Record , volume 24, pages 127–138.ACM, 1995.[14] W. Hoeffding. Probability inequalities for sumsof bounded random variables. In
The Col-lected Works of Wassily Hoeffding , pages 409–426. Springer, 1994.[15] P. Konda, S. Das, P. Suganthan GC, A. Doan,A. Ardalan, J. R. Ballard, H. Li, F. Panahi,H. Zhang, J. Naughton, et al. Magellan: Towardbuilding entity matching management systems.
PVLDB , 9(12):1197–1208, 2016.[16] J. Krause, M. Stark, J. Deng, and L. Fei-Fei.3d object representations for fine-grained cate-gorization. In , 2013.[17] C. D. Manning, C. D. Manning, and H. Sch¨utze.
Foundations of statistical natural language pro-cessing . 1999.2118] A. McCallum, K. Nigam, and L. H. Ungar. Ef-ficient clustering of high-dimensional data setswith application to reference matching. In
Pro-ceedings of ACM SIGKDD international confer-ence on Knowledge discovery and data mining ,pages 169–178, 2000.[19] S. Mudgal, H. Li, T. Rekatsinas, A. Doan,Y. Park, G. Krishnan, R. Deep, E. Arcaute, andV. Raghavendra. Deep learning for entity match-ing: A design space exploration. In
SIGMOD ,2018.[20] G. Papadakis, G. Alexiou, G. Papastefanatos,and G. Koutrika. Schema-agnostic vs schema-based configurations for blocking methods on ho-mogeneous data.
PVLDB , 9(4):312–323, 2015.[21] G. Papadakis, E. Ioannou, T. Palpanas,C. Niederee, and W. Nejdl. A blocking frame-work for entity resolution in highly heteroge-neous information spaces.
IEEE Transactions onKnowledge and Data Engineering , 25(12):2665–2682, 2012.[22] G. Papadakis, G. Koutrika, T. Palpanas, andW. Nejdl. Meta-blocking: Taking entity resolu-tionto the next level.
TKDE , 26, 2014.[23] G. Papadakis, G. Papastefanatos, andG. Koutrika. Supervised meta-blocking.
PVLDB , 7, 2014.[24] G. Papadakis, J. Svirsky, A. Gal, and T. Pal-panas. Comparative analysis of approxi-mate blocking techniques for entity resolution.
PVLDB , 2016.[25] G. Papadakis, L. Tsekouras, E. Thanos, G. Gi-annakopoulos, T. Palpanas, and M. Koubarakis.The return of jedai: end-to-end entity resolu-tion for structured and semi-structured data.
PVLDB , 11(12):1950–1953, 2018.[26] T. Papenbrock, A. Heise, and F. Naumann. Pro-gressive duplicate detection.
TKDE , 27(5), 2015.[27] M. Penrose et al.
Random geometric graphs , vol-ume 5. Oxford university press, 2003. [28] H. Sch¨utze, C. D. Manning, and P. Raghavan.Introduction to information retrieval. In
Pro-ceedings of the international communication ofassociation for computing machinery conference ,page 260, 2008.[29] G. Simonini, S. Bergamaschi, and H. Jagadish.Blast: a loosely schema-aware meta-blocking ap-proach for entity resolution.
PVLDB , 9(12),2016.[30] G. Simonini, G. Papadakis, T. Palpanas, andS. Bergamaschi. Schema-agnostic progressive en-tity resolution.
IEEE Transactions on Knowl-edge and Data Engineering , 31(6):1208–1221,2018.[31] V. Verroios and H. Garcia-Molina. Entity resolu-tion with crowd errors. In
ICDE , pages 219–230,2015.[32] V. Verroios, H. Garcia-Molina, and Y. Papakon-stantinou. Waldo: An adaptive human interfacefor crowd entity resolution. In
SIGMOD , 2017.[33] N. Vesdapunt, K. Bellare, and N. Dalvi.Crowdsourcing algorithms for entity resolution.
PVLDB , 7(12):1071–1082, 2014.[34] J. Wang, G. Li, T. Kraska, M. J. Franklin,and J. Feng. Leveraging transitive relations forcrowdsourced joins. In
SIGMOD , 2013.[35] S. E. Whang and H. Garcia-Molina. Incrementalentity resolution on rules and data.
The VLDBJournal , 23(1), Feb. 2014.[36] S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolution.
TKDE ,25(5), 2013.[37] S. E. Whang, D. Menestrina, G. Koutrika,M. Theobald, and H. Garcia-Molina. Entity res-olution with iterative blocking. In
SIGMOD ,2009.22 (a) songs P a i r R e c a ll B (b) citations P a i r R e c a ll pBlocking (c) products P a i r R e c a ll (d) cars P a i r R e c a ll (e) songs P a i r R e c a ll B (f) citations P a i r R e c a ll pBlocking (g) products P a i r R e c a ll (h) cars P a i r R e c a ll Figure 5: Pair recall of B (, Unif ,) and pBlocking (, Unif ,) with varying BB and CC . (a-d) use MB and (e-h)use BLOSS . CaCl did not finish within 24 hrs on songs and citations data set.
A Additional Experiments
Blocking Effectiveness.
Figure 2 compares thePair Recall of pBlocking and a traditional block-ing pipeline B , both with block-weights initializedwith TF-IDF weighting mechanism. Figure 5 per-forms the same comparison with the pipelines ini-tialized using
Unif weights. Since, all blocks areassigned equal weight, we consider the block clean-ing threshold of 100 along with default value of M. pBlocking performs substantially better than B fordifferent settings of block building techniques acrossvarious datasets. With comparison to TF-IDF weight-ing scheme,
Unif performs slightly worse but the dif-ference is not substantial. The no-feedback pipeline B has varied performance across different data setswith the best performance on products and poorestperformance on citations and songssongs