[PDF] Efficient and Effective ER with Progressive Blocking

Abstract

Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently ( O(nlo g 2 n) time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%.

Full PDF

EEﬃcient and Eﬀective ER with Progressive Blocking

Sainyam Galhotra , Donatella Firmani , Barna Saha , and Divesh Srivastava UMass Amherst, [email protected] Roma Tre University, [email protected] UC Berkeley, [email protected] AT&T Labs – Research, [email protected]

Abstract

Blocking is a mechanism to improve the eﬃciency ofEntity Resolution (ER) which aims to quickly pruneout all non-matching record pairs. However, depend-ing on the distributions of entity cluster sizes, exist-ing techniques can be either (a) too aggressive, suchthat they help scale but can adversely aﬀect the EReﬀectiveness, or (b) too permissive, potentially harm-ing ER eﬃciency. In this paper, we propose a newmethodology of progressive blocking ( pBlocking ) toenable both eﬃcient and eﬀective ER, which worksseamlessly across diﬀerent entity cluster size distri-butions. pBlocking is based on the insight that theeﬀectiveness-eﬃciency trade-oﬀ is revealed only whenthe output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedbackloop to reﬁne the blocking result in a data-drivenfashion. Speciﬁcally, we bootstrap pBlocking withtraditional blocking methods and progressively im-prove the building and scoring of blocks until we getthe desired trade-oﬀ, leveraging a limited amountof ER results as a guidance at every round. Weformally prove that pBlocking converges eﬃciently( O ( n log n ) time complexity, where n is the totalnumber of records). Our experiments show that in-corporating partial ER output in a feedback loop canimprove the eﬃciency and eﬀectiveness of blockingby 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%. Entity Resolution (ER) is the problem of identifyingwhich records in a data set refer to the same real-world entity [7]. ER technologies are key for solvingcomplex tasks (e.g., building a knowledge graph) butcomparing all the record pairs to decide which pairsmatch is often infeasible. For this reason, the ﬁrststep of ER selects sub-quadratic number of recordpairs to compare in the subsequent steps. To this end,a commonly used approach is blocking [24]. Blockinggroups similar records into blocks and then selectspairs from the “cleanest” blocks – i.e., those withfewer non-matching pairs – for further comparisons.The literature is rich with methods for building andprocessing blocks [24], but depending on the dataset at hand, diﬀerent techniques can either leave toomany matching pairs outside, leading to incompleteER results and low eﬀectiveness, or include too manynon-matching pairs, leading to low eﬃciency. pBlocking . We propose a new progressive block-ing technique that overcomes the above limitationsby short-circuiting the two operations – blocking andpair comparisons – that are traditionally solved se-quentially. Our method starts with an aggressiveblocking step, which is eﬃcient but not very eﬀective.Then, it computes a limited amount of ER results ona subset of pairs selected by the aggressive blocking,and sends these partial (matching and non-matching)results from the ER phase back to the blocking phase,creating a “loop”, to improve blocking eﬀectiveness.1 a r X i v : . [ c s . D B ] M a y NTITIESRECORDS

Comparison Cleaning (Z)Block Cleaning (Y)Block Building (X) Pair Matching Clustering c6 c6c6 z6 cima mamac6 c6c6 z6 cima mama malibunavigationcorvette c6chevy (a) chevrolet (1717 records)corvette (272)malibu (187)c6 (251)navigation (43)corvette ∧ c6 z6 (25)chevy (11) B l o c k s i z e Blocks (b)

Figure 1: (a) Illustration of a standard blocking pipeline. Block building, block cleaning and comparisoncleaning sub-tasks are highlighted in white. The downstream ER algorithm is shown in gray. Description ofeach record is reported in Table 1. (b) Block size distribution (standard blocking) for the real cars datasetused in our experiments.Table 1: Sample records (we omit schema informa-tion) referring to 4 distinct entities. r ei represents thei-th record referring to entity e . Records in the ﬁrsttwo rows refer to a Chevrolet Corvette C6 ( c

6) anda Z6 ( z ma ) and a Citr¨oen C6 ( ci ) (same modelname as Corvette C6 but diﬀerent car). r c : chevy corvette c6 r c : chevy corvette c6 navigation r c : chevrolet corvette c6 r z : corvette z6 navigation r ma : chevy malibu navigation r ma : chevrolet chevy malibu r ma : chevrolet malibu r ci : citroen c6 navigation In this way, blocking can progressively self-regulateand adapt to the properties of each dataset, withno conﬁguration eﬀort. We illustrate our blockingmethod, that we call pBlocking , in the following ex-ample.

Example 1.

Consider the records in Table 1 from the cars dataset used in our experiments, and a standardschema-agnostic blocking strategy S such as [20]. Asshown in Figure 1a, we consider three blocking sub-tasks [24]. First, during block building , S createsa separate block for each text token (we only showthe blocks ‘corvette’, ‘navigation’, ‘malibu’, ’c6’ and‘chevy’). Then, during block cleaning , S uses athreshold to prune out all the blocks of large size.Depending on the threshold value (using the blocksizes in the entire cars dataset, shown in Figure 1b),we can have any of the following extreme behaviors.(Note that no intermediate setting of the thresholdcan yield a sparse set of candidates that is at the sametime complete.) • Aggressive blocking: S prunes every block exceptthe smallest one (‘chevy’) and returns ( r c , r c ) , ( r c , r ma ) , ( r c , r ma ) and ( r ma , r ma ) , missing r c and r ma . • Permissive blocking: S prunes only the largestblock (‘chevrolet’) and returns many non-matching pairs.Finally, during comparison cleaning , S can use an-other threshold to further prune out pairs sharingfew blocks, e.g. by using meta-blocking [22]. As inblock cleaning, diﬀerent threshold values can yield ag-gressive or permissive behaviours. Note that match-ing pairs such as ( r c , r c ) share the same numberof blocks (‘corvette’ and ‘c6’) as non-matching pairssuch as ( r c , r z ) (‘corvette’ and ‘navigation’). (Evenworse, ‘c6’ is larger than ‘navigation’.) pBlocking can solve these problems in a fewrounds: the ﬁrst round does aggressive blocking, thesecond round does more eﬀective blocking by makingtargeted updates accordingly to partial ER results,and so on. Examples of such updates to the blockingresult are discussed below.1. Creation of new blocks that help inclusion of( r c , r c ) , ( r c , r c ): pBlocking creates a new block ‘corvette ∧ c6’ with records present in bothblocks ‘corvette’ and ‘c6’. This block is muchsmaller than its two constituents and has onlyCorvette C6 cars.2. Adaptive cleaning to help inclusion of( r ma , r ma ) , ( r ma , r ma ): pBlocking can dis-courage pruning of block ‘malibu’ that containsChevrolet Malibu cars, even if it is a large block;2. Adaptive cleaning to help exclusion of non-matching pairs: pBlocking can encourage prun-ing of block ‘navigation’ that contains no match-ing pairs, even if it is a small block.After a few rounds of updates like the above, pBlocking returns all the matching pairs with veryfew non-matching pairs. Note that after the lastround, the ER output can be computed on the re-sulting pairs as in the traditional setting. Updatesof type (1) are performed via a new block intersec-tion algorithm, while (2) and (3) are performed bya new block scoring method. By construction, whenthe blocking scores converge, the entire blocking re-sult also converges. Our contributions.

The main contribution of thispaper is a new blocking methodology with both higheﬃciency and eﬀectiveness in a variety of applicationscenarios. Since pBlocking can in principle start oﬀusing any blocking strategy, it represents not only anew approach but also a way to “boost” traditionalones. pBlocking works seamlessly across diﬀerententity cluster size distributions such as: • small entity clusters , where, using block inter-section, pBlocking can recover entities such asCorvette C6 consisting of few records sharinglarge and dirty blocks. • large entity clusters , where, using block scoring, pBlocking can recover entities such as ChevroletMalibu consisting of many records sharing largeand clean blocks.We prove theoretically and show empirically that,with a few rounds and a limited amount of partial ERresults, our progressive blocking method can providea signiﬁcant boost in blocking eﬀectiveness withoutpenalizing eﬃciency. Speciﬁcally, we (i) demonstratefast convergence and low space and time complex-ity ( O ( n log n ), where n is the number of records)of pBlocking ; (ii) report experiments achieving upto 60% increase in recall when compared to state-of-the-art blocking [5], and up to 5x boost in eﬃciency.Finally, we observe that pBlocking can yield up to70% increase on the F-score of the ﬁnal ER result,thus conﬁrming the substantial beneﬁts of our ap-proach. Outline.

The rest of this paper is organized as Table 2: Notation Table V Collection of records C Collection of clusters B Block: A subset of records, B ⊆ Vp m ( u, v ) Similarity between u and vP = ( V, A (cid:48) ) Blocking graph, A (cid:48) ⊂ V × Vφ Feedback frequency p ( B ) Probability score of a block Bu ( B ) Uniformity score of block BH ( B ) Entropy of block B H Block Hierarchy G t Random Geometric graph γ Fraction of nodes used for scoring blocks µ g Expected similarity of a matching edge µ r Expected similarity of a non-matching edge follows. Sections 2 and 3 provide preliminary discus-sions and a high-level description of the pBlocking approach. Sections 4 and 5 explain our block inter-section and block scoring methods, respectively. Sec-tion 6 provides theoretical analysis of pBlocking ’seﬀectiveness and Section 7 provides extensive experi-mental results and key takeaways. Section 8 discussesthe related work and we conclude in Section 9.

Table 2 summarizes the main symbols used through-out this paper. Let V be the input set of records, with | V | = n . Consider an (unknown) graph C = ( V, E + ),where ( u, v ) ∈ E + means that u and v represent thesame entity. C is transitively closed, that is, each ofits connected components C ⊆ V is a clique repre-senting a distinct entity. We call each clique a cluster of V , and refer to the partition induced by C as theER ground truth . Deﬁnition 1 (Pair Recall) . Given a set of matchingrecord pairs A (cid:48) ⊆ V × V , Pair Recall is the fractionof pairs ( u, v ) ∈ E + that can be either (i) matcheddirectly, because ( u, v ) ∈ A (cid:48) , or (ii) indirectly inferredfrom other pairs ( u, w ) , ( w , w ) , . . . , ( w c , v ) ∈ A (cid:48) byconnectivity. A formal deﬁnition of the blocking task follows.

Problem 1 (Blocking Task) . Given a set of records V , group records into possibly overlapping blocks B ≡{ B , B , . . . } , B i ⊆ V and compute a graph P =3 V, A (cid:48) ) , where A (cid:48) ⊆ A , A ≡ { ( u, v ) : ∃ B i ∈ B s.t. u ∈ B i ∧ v ∈ B i } , such that A (cid:48) is sparse ( | A (cid:48) | << (cid:0) n (cid:1) )and A (cid:48) has high Pair Recall. We refer to P as the blocking graph . The blocking graph P is the ﬁnal product of block-ing and contains all the pairs that can be consideredfor pair matching. The eﬃciency and eﬀectiveness ofthe blocking method is measured as Pair Recall (PR)of (the set of edges in) P and the number of edges init for a certain PR, respectively. Blocking methodsconsist of three sub-tasks as deﬁned by [24]: blockbuilding, block cleaning and comparison cleaning. Inthe following, we describe each of these steps and thecorresponding methods in the literature. Block building ( BB ) takes as input V and returnsa block collection B , by assigning each record in V topossibly multiple blocks. The popular standard block-ing [20] strategy creates a separate block B t for eachtoken t in the records and assigns to B t all the recordsthat contain the token t . In order to tolerate spellingerrors, q-grams blocking [11] considers character-levelq-grams instead of entire tokens. Other strategiesinclude canopy clustering [18] and sorted neighbor-hood [13]. Canopy clustering iteratively selects a ran-dom seed record r , and creates a new block B r (or acanopy) with all the records that have a high similar-ity with r with respect to a given similarity function(e.g., using a subset of features [18]). We can usediﬀerent similarity functions to build diﬀerent sets ofcanopies. Sorted neighborhood sorts all the recordsaccording to multiple sort orders (e.g., each accord-ing to a diﬀerent attribute [13]) and then it slidesa window w of tokens over each ordering, every timecreating a new block B w . Blocks have the same num-ber of distinct tokens but the number of records in ablock can vary signiﬁcantly. Each of these techniquescreates O ( n ) blocks. Block cleaning ( BC ) takes as input the block col-lection B and returns a subset B (cid:48) ⊆ B by prun-ing blocks that may contain too many non-matchingrecord pairs. Block cleaning is typically performed byassigning each block a score : B →

IR with a blockscoring procedure and then pruning blocks with lowscore. Traditional scoring strategies include functionsof block sizes such as TF-IDF [7, 21].

Comparison cleaning ( CC ) takes as input the set A of all the intra-block record pairs in the blockcollection B (cid:48) (which is a subset of the intra-blockrecord pairs in B ) and returns a graph P = ( V, A (cid:48) ),with A (cid:48) ⊆ A , by pruning pairs that are likely to benon-matching. Comparison cleaning is typically per-formed by assigning each pair a weight : A → IRand then pruning pairs with low weight. Weight-ing strategies include meta-blocking [22] possibly withactive learning [29, 5]. In classic meta-blocking, weight ( u, v ) corresponds to the number of blocks inwhich u and v co-occur, based on the assumption thatthat more blocks a record pair shares, the more likelyit is to be matching. The recent

BLOSS strategy [5]employs active learning on top of the pairs generatedby meta-blocking, and learns a classiﬁer using fea-tures extracted from the blocking graph for furtherpruning.We denote with B ( X, Y, Z ) a blocking strategy thatuses the methods X , Y , and Z , respectively for blockbuilding, block cleaning and comparison cleaning.The strategy used in our cars example (Example 1)can be thus denoted as B ( standard blocking, TF-IDF,meta-blocking ). After blocking.

Typical ER algorithms include pair matching and entity clustering operations. Suchoperations label as “matching” the pairs referringto the same entity and “non-matching” otherwise,and typically require the use of a classiﬁer [19] or acrowd [34]. Clustering consists of building a possiblynoisy clustering C (cid:48) according to labels, and can bedone with a variety of techniques, including robustvariants of connected components [31] and randomgraphs [9]. This noisy clustering is the ﬁnal productof ER. Analogous to traditional blocking methods, pBlocking takes as input a collection V of recordsand returns a blocking graph P . A high-level view This assumption holds for block building methods such asstandard blocking, q-grams blocking and sorted neighborhoodwith multiple orderings [13], and extends naturally to canopyclustering by using multiple similarity functions.

4f the methods introduced in pBlocking , for each ofthe main blocking sub-tasks of Section 2, is providedbelow. Such methods, unlike previous ones, canleverage a feedback of partial ER results.

Block building in pBlocking constructs new blocksarranged in the form of a hierarchy . First level blocksare initialized with blocks generated by a traditionalmethod (e.g., standard blocking, sorted neighbor-hood, canopy clustering or q-gram blocking). Sub-sequent levels contain intersections of the blocks inthe previous levels. pBlocking can use feedbackfrom the partial ER output to build intersectionssuch as ‘corvette ∧ c6’ that can lead to new, cleanerblocks, and avoid bad intersections such as ‘corvette ∧ chevrolet’ that would not improve the fraction ofmatching pairs in P (Chevrolet Corvette C6 and Z6are diﬀerent entities). We discuss block intersectionin Section 4. Block cleaning in pBlocking prunes dirty blocksbased on feedback-based scores. First round scoresare initialized with a traditional method (e.g. TF-IDF). Then, scores are reﬁned based on feedbackby combining two quantities: the fraction p ( B ) ofmatching pairs in a block B , and the block unifor-mity u ( B ), which captures the distribution of en-tities within the block ( u ( B ) is the inverse of per-plexity [17]). Since the goal of blocking phase is toidentify blocks that have a higher fraction of match-ing pairs and fewer entity clusters, we combine theabove values as score ( B ) = p ( B ) · u ( B ). pBlocking can use feedback from the partial ER output to es-timate p ( B ) and u ( B ), yielding high scores for cleanblocks such as ‘malibu’ (high p ( B ) and high u ( B ))and low scores for dirtier blocks such as ‘navigation’(low p ( B ) and low u ( B )), and ‘c6’ (low u ( B )). Wediscuss block scoring in Section 5.Finally, comparison cleaning in pBlocking is im-plemented with a traditional method such as meta-blocking. Workﬂow.

Algorithm 1 describes the pBlocking workﬂow and how the introduced blocking methodscan be used. We denote with pBlocking ( X, Y, Z )a progressive blocking strategy that uses the meth-ods X , Y and Z , respectively for building the ﬁrstlevel of the block hierarchy, initializing the block Algorithm 1

Our blocking method pBlocking

Require:

Records V , methods X , Y , Z for each blocking step.Default: X=standard blocking, Y= TF-IDF and

Z=meta-blocking . Ensure:

Blocking graph P C (cid:48) ← ∅ B ← build the ﬁrst level of block hierarchy with method X scores ← initialize block scores using method Y P ← block cleaning and comparison cleaning with method Z P new ← ∅ for round=2; round ≤ /φ ∧ P (cid:54) = P new ; round++ do while ER progress is less than φ do C (cid:48) ← Execute an incremental step of method W for pairmatching and clustering on P score ← update the block scores according to C (cid:48) //Feedback B ← update the block hierarchy based on score P ← P new P new ← block cleaning and comparison cleaning with Z return H scores, and performing comparison cleaning as de-scribed in Algorithm 1. In our cars examples, wehave pBlocking ( standard blocking, TF-IDF, meta-blocking ).We ﬁrst initialize the set of clusters C (cid:48) , the blockhierarchy and the block scores ( lines 1–3 ). Thenext step ( line 4 ) consists of computing the ﬁrstversion of the blocking graph P according to the se-lected method for comparison cleaning (e.g., meta-blocking). The graph P is then progressively up-dated, round after round ( lines 6–12 ). In order toactivate the feedback mechanism, pBlocking needsto interact with an ER algorithm W for pair match-ing and clustering operations ( line 7–8 ). Algo-rithm W is executed over P until it makes a progress of φ with φ ∈ [0 , φ · n log n recordpairs have been processed since the previous round. At that point, the algorithm W is interrupted, C (cid:48) is updated ( line 8 ) and sent as feedback to all of pBlocking ’s components. Based on such feedback,we update the function score ( B ) = p ( B ) · u ( B ) ( line9 ) and construct new blocks in the form of a hierarchy( line 10 ). Higher score blocks are used to enumer-ate the most promising record pairs and generate theupdated blocking graph P new ( lines 11-12 ). Wheneither the maximum number of rounds φ has beenreached (setting φ = 1 is the same as switching oﬀ For algorithms such as [33], progress can be deﬁned as afraction φ · n of processed records since the previous round. P = P new ), pBlocking terminates by returning P .We present a formal analysis of the eﬀectiveness of pBlocking in Section 6. We refer to Section 7 for ex-periments. Due to its robustness to diﬀerent choicesof the pair matching algorithm W , we do not include W in pBlocking ’s parameters (diﬀerently from X , Y , Z ). Natural choices for W include progressive ERstrategies that can process P in an online fashion andcompute C (cid:48) incrementally [32, 33, 19]. However, tra-ditional algorithms, such as [7] can be used as well byadding incremental ER techniques [12, 35] on top. For eﬃciency, it is crucial to ensure that the totaltime and space taken to compute P is close to linearin n . Since every round of pBlocking comes with itsown time and space overhead, we ﬁrst describe howto bound the complexity of every round and thendiscuss how to set the parameter φ in Algorithm 1(and thus the maximum number of rounds) so as tobound the complexity of the entire workﬂow. Round Complexity. pBlocking implementsthe following strategies to decrease overhead of eachround.

Eﬃcient block cleaning.

We compute the blockscores by sampling Θ(log n ) records from each of thetop O ( n ) high-score blocks computed in the previousround. Eﬃcient comparison cleaning.

For simplicity, webuild P by enumerating at most Θ( n log n ) intra-block pairs by processing blocks in non-increasingblock score.Based on the above discussion, we have Lemma 1. Lemma 1.

A single round of pBlocking ( X, Y, Z ),such as pBlocking ( standard blocking, TF-IDF,meta-blocking ) has O ( n log n ) space and time com-plexity.Proof. We ﬁrst show that the total feedback is lim-ited to O ( n log n ) space complexity, even though itconsiders all transitively inferred matching and non-matching edges, which can be Ω( n log n ). For thematching pairs, we store all the records with an en-tity id such that any pair of records that have been resolved share the same id. This requires O ( n ) spacein the worst case and captures all the matching edgesthat have been identiﬁed in the ER output. For thenon-matching pairs, we store a non-matching edgebetween their entity ids. Since the maximum num-ber of pairs returned by pBlocking is limited to O ( n log n ), the total number of pairs compared ineach round and thus the number of non-matchingedges stored is also O ( n log n ). Then, we analyzethe complexity of using feedback for the BB and BC tasks. Since the maximum number of blocks con-sidered in any round for the scoring component is O ( n ) and the scoring mechanism samples O (log n )pairs from each block, the total number of edges enu-merated for block scoring and building is O ( n log n ).Since the maximum number of pairs for inclusion inthe graph H is also O ( n log n ), a single round of pBlocking outputs H in O ( n log n ) total work. Workﬂow Complexity.

As discussed in Section 6, φ can be set to a small constant fraction. Thus,along with Lemma 1, this guarantees an O ( n log n )complexity for the entire workﬂow. Experimentallya smaller φ value yields higher ﬁnal recall, thus asa default we set φ = 0 .

01, yielding a maximum of100 rounds. Although such a φ value gets the besttrade-oﬀ between eﬀectiveness and eﬃciency in ourexperiments, we also observe that slight variations ofits setting do not aﬀect the performance much (Sec-tion 7), demonstrating the robustness of pBlocking . One of the major challenges of block building ( BB )is that when generating candidate pairs that cap-ture matches it can also generate a number of non-matching pairs. This phenomenon is highly prevalentin datasets with very few matching pairs. To over-come this challenge, our block building by intersec-tion algorithm takes a collection of blocks B , . . . , B m built by a traditional method for BB and createssmaller clean blocks out of large dirty ones, thus con-tributing to the recall of the blocking graph with-out adding extra non-matching pairs. An intersec-tion block hierarchy H is constructed as follows. Let6 lgorithm 2 Block Layers Creation

Require:

Set of records V , depth d Ensure:

Layer set { L , . . . , L d } for i = 1; i ≤ d ; i + + do L i ← φ processed ← φ for v ∈ V do blockLst ← getBlocks(v) for i = 2; i < d; i + + do for B = { B j : B j ∈ blockLst } , |B| = i do B (cid:48) = ∩ Bj ∈B B j if B (cid:48) / ∈ processed then L i .append( B (cid:48) ) processed.append( B (cid:48) ) blockLst ← L i the ﬁrst layer be B , . . . , B m . Then blocks in layer L consist of the intersection of L distinct blocks in theﬁrst layer. Example 2.

Consider our cars example in Sec-tion 1, and the blocks corresponding to tokens‘corvette’ and ‘c6’, namely B corvette , and B c6 . Asample block in the second level of H is B corvette,c = B corvette ∩ B c . When we build the new block, we onlyinclude records containing the two tokens ‘corvette’and ‘c6’ (possibly non consecutively), thus obtaininga cleaner block than the original ones. Reﬁned blocks.

We refer to the newly createdblock as a reﬁned block, and to the intersecting blocksas parent blocks. Not all the reﬁned blocks are use-ful. We need one of the following correlation basedconditions to hold to decide if a reﬁned block B i,j must be kept in H . • score ( B i,j ) > score ( B i ) · score ( B j ), that is thescore of the reﬁned block is higher than the com-bined score of the parent blocks. • The existence of a randomly chosen record r inblocks B i and B j is positively correlated, i.e. P r [ r ∈ B i,j ] = | B i,j | /n > P r ( r ∈ B i ) · P r ( r ∈ B j ), which simpliﬁes to | B i,j | > | B i || B j | n . For ex-ample, the number of common records in blockscorresponding to tokens ‘c6’ and ‘corvette’ ismuch higher than the common records in blockscorresponding to ‘navigation’ and ‘c6’.Suppose the maximum depth of the hierarchy is d which is a constant. The construction of reﬁnedblocks can take O ( n d ) time if the number of blocks Algorithm 3

Layer Cleaning

Require:

Layer set { L , . . . , L d } Ensure:

Cleaned Layer set { L , . . . , L d } for i = 2; i < d ; i + + do for block ∈ L i do parentLst ← getParents(block) if (cid:81) p ∈ parentLst score ( p ) < score ( block ) or (cid:81) p ∈ parentLst | Li − p ] | n < | Li [ block ] | n then continue else L i .remove(block) considered in the ﬁrst layer is O ( n ). For eﬃciency,we iterate over the records (linear scan) and for eachrecord r , we consider all pairs of blocks that contain r as candidates to generate blocks in the diﬀerent lev-els of the hierarchy. The following lemma bounds thetotal number of reﬁned blocks across the hierarchy. Lemma 2.

The number of blocks present in H is O ( n ) if each record r is present in a constant numberof blocks.Proof. Our algorithm considers each record u ∈ V and generates intersection blocks by performing con-junction of blocks that contain the record u . Sup-pose the record u is present in γ u blocks in the ﬁrstlayer. Then the maximum number of blocks presentin H that contain u is (cid:80) di =1 (cid:0) γ u i (cid:1) . Assuming γ u isa constant, the maximum number of blocks in thehierarchy is n (cid:80) di =1 (cid:0) γ u i (cid:1) = O ( n ). Reﬁnement algorithm.

We are now ready to de-scribe pBlocking ’s intersection method for buildingthe block hierarchy. Our method has two steps: • (Alg. 2) The ﬁrst step creates all possible blocksconsidering the intersection search space. • (Alg. 3) The cleaning phase removes the blocksthat do not satisfy the correlation criterion de-scribed above.Algorithm 2 describes the creation step, which iter-ates over all the records in the corpus and createsall possible blocks per record. The list of all blocksto which a record belongs is constructed (denotedby blockLst) and the new blocks are added in diﬀer-ent layers. The layer of the new block depends onthe number of intersecting blocks that constitute thenew block. Then, the cleaning step in Algorithm 37terates over the diﬀerent layers and keeps only theblocks that satisfy the score or size requirements. Fora block in layer q , getParents () identiﬁes the twoblocks which are in layer ( q −

1) whose conjunctiongenerates the block being considered. If these parentshave been removed during the cleaning phase, thentheir parents are considered and the process is con-tinued recursively until we end up at the ancestorspresent in the list of blocks.Block Layers Creation (Alg. 2) constructs all theblocks in the form of a hierarchy and Layer Clean-ing (Alg. 3) deactivates the blocks that do not sat-isfy the correlation requirements. Since the resultof block layers creation does not change in diﬀerent pBlocking iterations, decoupling the creation com-ponent from the cleaning component (which changesdynamically) allows for more eﬃcient computation.

Time complexity.

Assuming the depth of thehierarchy is a constant, Algorithms 2 and 3 operatein time linear in the number of records n . Blockreﬁnement takes 3 minutes for a data set with 1 M records in our experiments. Let A (cid:48) ⊂ V × V be the pairs selected by blockingphase at a given point (we recall that A (cid:48) is the edgeset of the blocking graph P = ( V, A (cid:48) )) and each con-sidered pair ( u, v ) ∈ A (cid:48) has a similarity value denotedby p m ( u, v ). A block B ⊆ V refers to a subset ofrecords. Using this notation, we discuss the diﬀer-ent methods for scoring blocks and how the scoresconverge with feedback for eﬀective ER performance. Block scoring.

Block scoring helps to distin-guish informative blocks based on their ability to cap-ture records from a single cluster. By selecting pairswithin informative blocks, down-stream ER opera-tions can focus on records pairs that have high prob-ability of being a match. The most common mecha-nism used in the literature is TF-IDF and it assignsblock scores inversely proportional to the block sizeprioritizing smaller blocks over larger ones. If thedata set has small clusters, such a simple methodcan work well. However, if the data set has a skewedcluster size distribution, some large blocks are just uninformative (and are rightfully less preferred byTF-IDF), but others can represent a large cluster andthus should stand out in the scoring. Distinguishingthese blocks before pair matching can be diﬃcult, but pBlocking provides a way to leverage the feedback.Speciﬁcally, the scoring algorithm of pBlocking prioritizes blocks having (a) high fraction of match-ing pairs measured as matching probability withina block and (b) fewer number of clusters (especiallylarger clusters) measured as uniformity (a functionof entropy of the cluster distribution within a givenblock B ). Lower entropy and hence lower diversityvalues indicate the representativeness of B towards aparticular cluster as opposed to higher entropy val-ues which refer to the presence of many fragmentedclusters.More formally, the matching probability scoreidentiﬁes the probability that a randomly chosen pair( u, v ) | u, v ∈ B refers to the same entity and is de-ﬁned as follows. Deﬁnition 2 (Matching Probability score p ( B )) . The value p ( B ) is deﬁned as the fraction of matchingpairs within a block B . The block uniformity, u ( B ) captures perplexity ofcluster distribution within B measured in terms of itsentropy. Deﬁnition 3 (Cluster Entropy H ( B )) . The clus-ter entropy of a block, H ( B ) refers to the entropyof the cluster distribution when restricted to therecords present in block B . Mathematically, H ( B ) = − (cid:80) C ∈C p C log p C , where p C = | C ∩ B | / | B | refers tothe probability that a randomly chosen node from B belongs to cluster C . Using H ( B ), block uniformity score is deﬁned asfollows. Deﬁnition 4 (Block Uniformity u ( B )) . The blockuniformity u ( B ) = e − H ( B ) is the inverse of perplexity[17] of the cluster distribution within the block whereperplexity refers to the exponential of cluster distri-bution entropy. Example 3.

Suppose that we know that a block B contains records of two clusters C and C and thus e can compute the uniformity of B exactly. If thetwo clusters are perfectly balanced in B , i.e., | C ∩ B | = 0 . · | B | and | C ∩ B | = 0 . · | B | , the entropyis H ( B ) = − . . − . . ≈ . and thus u ( B ) = e − H ( B ) = 0 . . If there is some skew, e.g. | C ∩ B | = 0 . · | B | and | C ∩ B | = 0 . · | B | , then theentropy is lower H ( B ) = − . . − . . ≈ . and the uniformity is higher u ( B ) ≈ . . Inthe extreme case where C ∩ B = B and C ∩ B = ∅ , H ( B ) = 0 and u ( B ) = 1 . Note that when resolving two duplicate-free datasetswhere all clusters are of size 2 (also known as RecordLinkage) the entropy increases with block size, thusblock uniformity yields comparable results to tradi-tional TF-IDF.Since the goal of block scoring is to identify blocksthat have high matching probability and high unifor-mity, we multiply the two values to get a ﬁnal esti-mate of the block score.

Deﬁnition 5 (Block Score, score ( B )) . The scoreof a block B , score ( B ) , is deﬁned as the product ofmatching probability score and uniformity score of B .That is, score ( B ) = p ( B ) u ( B ) . Next, we describe the algorithm to estimate thesecomponents of block score. The exact value of match-ing probability and block uniformity requires com-plete ER results. However, pBlocking estimatesthese scores initially with the similarity estimates ofevery pair of records and reﬁnes these scores withadditional feedback from partial ER results.

Matching probability score.

The matchingprobability score is estimated as the average match-ing similarity of pairs of records within the block, i.e.: p ( B ) = (cid:80) u,v ∈ B p m ( u, v ) (cid:0) | B | (cid:1) where p m ( u, v ) is estimated as follows: • for pairs declared as matches, we set p m ( u, v ) =1; • for pairs declared as non-matches, we set p m ( u, v ) = 0; • for unlabelled pairs, we use the p m values com-puted by common similarity metrics (e.g. via jaccard similarity or the similarity-to-probabilitymapping as in [26]). Block uniformity estimation.

Estimating uni-formity score requires the cluster size distribution in B , which is harder to infer from the prior similarityvalues. We next describe a mechanism to estimate en-tropy H ( B ) needed to compute the uniformity score.We consider each record u ∈ B , and consider the clus-ter C u that contains u . We are interested in comput-ing | C u ∩ B || B | in order to compute entropy H ( B ). In-stead, we compute the expected size of | C u ∩ B | as E u = E [ | C u ∩ B | ] = (cid:80) v ∈ B p m ( u, v ) based on p m val-ues of edges incident on u . We compute the expectedcluster size for every record u ∈ B and sort them innon-increasing order. Let L be the sorted list. Letthe ﬁrst record in the sorted list L , that is, the nodewith highest expected cluster size in B be u . On ex-pectation u has E u records in B that belong to C u .All these records must have similar expected clustersizes as well. We put u and the next (cid:98) E u (cid:99) recordsfrom L to a set S U , assuming that they belong to thesame cluster C u . We recurse on L \ S U until a par-tition { S U , S V , . . . } of the block is generated. Thesize of each partition can be thought of as a roughestimate of the true cluster distribution in B and isused to calculate the entropy. Example 4.

Consider a block B , with | B | =10 . Let [ u , u . . . u ] be the corresponding list L of records sorted in non-increasing E u i val-ues. If E u = (cid:80) i ∈ ... p m ( u , u i ) = 6 . weset S U = { u . . . u (cid:98) E u (cid:99) } = { u . . . u } andthen consider the next node in L which is u . If E u = (cid:80) i ∈ , p m ( u , u i ) = 2 we set S U = { u . . . u (cid:98) E u (cid:99) } = { u . . . u } and then ﬁnish. As | S U | = 0 . · | B | and | S U | = 0 . · | B | we estimate u ( B ) = e − . . − . . ≈ . . The value returned by this mechanism is generallyan under-estimate of the true entropy H ( B ) but inpractice it can approach H ( B ) quickly with increas-ing feedback data and turns out to be very eﬃcient.Section 6.2 discusses this convergence rate in diﬀerentapplication scenarios. Eﬃcient block cleaning.

Traditional scor-ing strategies such as TF-IDF are based on block9ize computation and thus operate in linear time.Computing our score ( B ) values requires instead toprocess intra-block pairs and thus yields potentiallyquadratic computation. Hence, we sample Θ(log n )records from each block for its score computation.This strategy operates in Θ(log n ) time and takesless than 1 minute for a data set with 1 M recordsin our experiments. Our sampling strategy gives anapproximation within a factor of (1+ (cid:15) ) of the match-ing probability scores estimated using all the recordswithin each block (Lemma 7). In this section we present a theoretical analysis ofthe eﬀectiveness of pBlocking . We ﬁrst analyze thepair recall of blocking in the absence of feedback byconsidering a natural generative model for block cre-ation. Next we analyze the eﬀect of feedback on blockscoring and the ﬁnal recall.

We start by giving the following basic lemma below.

Lemma 3.

The blocking graph P = ( V, A (cid:48) ) containsa spanning tree for each clique C of C = ( V, E + ) iﬀthe Pair Recall is 1.Proof. If A (cid:48) contains a spanning tree for each clique C , then any pair ( u, v ) ∈ A (cid:48) ∩ E + contributes directlyto the recall. All pairs of records ( u, v ) that refer tothe same entity, ( u, v ) ∈ E + and are not present in A (cid:48) , ( u, v ) / ∈ A (cid:48) can be inferred from the edges in thespanning tree using transitivity, ensuring Pair Recall= 1. For the converse, let us assume that ∃ C ∈ C such that A (cid:48) does not contain any spanning tree overthe matching edges. This implies that C is split intomultiple components (say C , C ) when restricted to A (cid:48) ∩ E + edges. In this case, the collection of matchingedges joining these components, { ( x, y ) , ∀ x ∈ C , y ∈ C } cannot be inferred as none of these edges areprocessed by the mentioned ER operations, yieldingpair recall of P less than 1.Our probabilistic model for block creation is mo-tivated by the standard blocking [20], sorted neigh- borhood [13] and canopy clustering [18] algorithmswhich aim to generate blocks that capture high sim-ilarity candidate pairs. This model of block genera-tion is closely related to Random Geometric Graphs[27] which were proposed by Gilbert in 1961 and havebeen used widely to analyze spatial graphs. Deﬁnition 6 (Random Geometric Graphs) . Let S t refer to the surface of a t-dimensional unit sphere, S t ≡ { x ∈ R t +1 | || x || = 1 } . A random geometricgraph G t ( V, E ) of n vertices V , has parameters t ∈ Z + and a real number r ∈ [0 , . It assigns each vertex i ∈ V to a point chosen independently and uniformlyat random within S t and any pair of vertices i, j ∈ V are connected if the distance between their respectivepoints is less than r . Now, we deﬁne the probabilistic block generationmodel.

Deﬁnition 7 (Probabilistic Block Generation) . Theblock generation model places the records u ∈ V inde-pendently and uniformly at random within S t . Everyrecord u constructs a ball of volume ( α log n/n ) with u as the center, where α is a given parameter and allpoints within the ball are referred to as block B u . The set of points present within a ball B u can beseen as high similarity points that would have beenchosen as blocking candidates in the absence of feed-back. Our probabilistic block generation model con-structs n blocks, one for each node and every pair ofrecords that co-occur in a block B u , u ∈ V , has anedge in the blocking graph P g ( V, E ) (subscript g toemphasize generative model). Next we analyze pairrecall of P g ( V, E ). Notation.

Let d ( u, v ) refer to the distance be-tween records u and v and r (cid:15) refer to the radius ofan (cid:15) -volume ball in t dimensions. Under these as-sumptions we ﬁrst show that the expected number ofedges in the blocking graph P g is at least α ( n −

1) log n and then that P g ( V, E ) has recall << Lemma 4.

The blocking graph P g ( V, E ) contains atleast α ( n −

1) log n candidate pairs on expectation. (cid:15) = O ( r t(cid:15) ). roof. Each record u ∈ V , constructs a spheri-cal ball of volume α log n/n , with u as the centerand all points within the ball are added as neigh-bors of u in the blocking graph. Hence, the num-ber of expected neighbors of u within the ball is α ( n −

1) log n/n . There are a total of n such blocks(one ball per record) and each of the candidate pairs( u, v ) is counted twice (once for the block B u andonce for the block B v ). Hence there are a total of α ( n −

1) log n such candidate pairs. Notice that thisanalysis ignores the candidate pairs ( u, v ) which aremore than r α log n/n from each other but are con-nected in the blocking graph. This would happenif they are present together in another block centeredat w ∈ V \ { u, v } , that is ∃ w | d ( u, w ) ≤ r α log n/n and d ( v, w ) ≤ r α log n/n . This shows that the totalnumber of candidate pairs in the blocking graph isatleast α ( n −

1) log n .Additionally, P g ( V, E ) has the following property:

Lemma 5.

A blocking graph P g is a subgraph of arandom geometric graph G t with r = 2 r α log n/n Proof.

Following the construction of blocking graph,if the distance between any pair of vertices u, v ∈ V is less than or equal to r c log n/n , then ( u, v ) ∈ E .Similarly, any pair of nodes u, v ∈ V such that d ( u, v ) > r c log n/n , then ( u, v ) / ∈ E . However, if r c log n/n < d ( u, v ) ≤ r c log n/n , the pair ( u, v ) ∈ H g only if ∃ w ∈ V such that d ( u, w ) ≤ r c log n/n and d ( v, w ) ≤ r c log n/n . This shows that the blockinggraph H g is a subgraph of a random geometric graphwhere a pair of vertices (u,v) is connected only if thedistance d ( u, v ) ≤ r c log n/n is connected.This means that if G t has suboptimal recall then P g also has poor recall and hence, we analyze therecall of G t with r = 2 r α log n/n . Lemma 3 showsthat the blocking graph will achieve recall = 1 onlyif it contains a spanning tree of each cluster. Hence,we analyze the formation of spanning trees in G (cid:48) t = G t ( V, E ∩ E + ) that refers to G t restricted to matchingedges. We show the following result, Lemma 6.

The graph G t restricted to matchingedges in the ground truth, E + splits a cluster C ,where | C | = o ( n/α ) into multiple components. Proof. Using the connectivity result from [27], a ran-dom geometric graph G t of n nodes is disconnectedif the expected degree of the nodes is < log n . Ad-ditionally, it splits the graph G t into many smallerclusters. Therefore, a cluster C ∈ V is disconnectedin G (cid:48) t = G t ( V, E ∩ E + ) if the degree of each vertex is < log | C | .The expected degree of a record u ∈ C , restrictedto G (cid:48) t is O ( | C | ( α log nn )) = o (log n ) if | C | = o ( n/α ).Hence, the expected degree of each node within acluster C is o (log | C | ), leading to formation of dis-connected components within C . Theorem 1.

A blocking graph P g ( V, E ) , generatedaccording to the probabilistic block model has recall < unless all clusters have size Θ( n ) assuming α isa constant.Proof. Lemma 6 shows that the cluster C of size < n/α is split into various disconnected componentswhen restricted to matching edges. Hence, the block-ing graph P g does not form a spanning tree of C andwill have recall less than 1 (Lemma 3). Since the clus-ter C is broken into many small clusters, the drop inrecall is also signiﬁcant. Remark.

The analysis extends when consideringless noisy data such as when only a constant fractionof records are placed randomly on the unit sphere,and the remaining records are grouped together ac-cording to the cluster identity they belong to. Ouranalysis exposes the lack of robustness of performingblocking without feedback.

In this section we analyze the pair recall of blockingwhen employed with pBlocking . For this analysiswe consider the noisy edge similarity model p m ( u, v )that builds on the edge noise model studied in priorwork on ER [8]. Deﬁnition 8 (Noisy edge model) . Noisy edge modeldeﬁnes the similarity of a pair of records with param-eters θ ∈ (0 , , β = Θ(log n ) and β (cid:48) = Θ(log n ) . matching edge ( u, v ) ∈ E + has a similarity dis-tributed uniformly at random within [ θ, with prob-ability − βn and remaining edges are distributed uni-formly within [0 , θ ) . A non-matching edge has similardistribution on similarity values with β (cid:48) instead of β . When β << β (cid:48) , the matching probability score of ablock with higher fraction of matching edges is muchhigher than the one with fewer matching edges and pBlocking algorithm will consider blocks in the cor-rect ordering even in the absence of feedback. How-ever, it is most challenging when non-matching edgesare generated with a distribution similar to matchingedges, that is β and β (cid:48) are close. We deﬁne a ran-dom variable X ( u, v ) to refer to the edge similaritydistributed according to the noisy edge model. Fol-lowing this notion, let µ g and µ r denote the expectedsimilarity of a matching and non-matching edge re-spectively. µ g = (1 − β/n ) 1 + θ βn θ µ r has the same value with β (cid:48) instead of β .We show that the feedback based block score ini-tialized with TF-IDF weights is able to achieve per-fect recall with a feedback of Θ( n log n ) pairs as-suming that the ER phase makes no mistakes onthe pairs that it processes, helping to ensure the cor-rectness of partially inferred entities. Additionally,the feedback from the ER phase is distributed ran-domly across edges within a block. We also discussthe extension when feedback is biased towards pairsfrom large entity clusters and high similarity pairs.In those scenarios, pBlocking ’s scoring mechanismconverges quicker leveraging the larger feedback dueto transitivity. Eﬀect of Sampling.

First, we show that samplingΘ(log n ) records from a block gives approximationwithin a factor of (1 + (cid:15) ) of the matching probabilityscore computed using all the records. Lemma 7.

For a block B with | B | > c log n , thematching probability score of B estimated by sampling Θ(log n/(cid:15) ) records randomly is within [(1 − (cid:15) ) , (1+ (cid:15) )] factor of p ( B ) with a probability of − o (1) , where p ( B ) is the score using all | B | records. Proof. Consider a block B with more than c log n records. Let X ( u, v ) denote the edge similarity ofa pair ( u, v ) according to the noisy edge model.The matching probability score of B on consideringthe complete block is ( | B | ) (cid:80) u,v ∈ B X ( u, v ). The ex-pected score of the block ( µ B ) is1 (cid:0) | B | (cid:1) E  (cid:88) u,v ∈ B X ( u, v )  = 1 (cid:0) | B | (cid:1) (cid:88) u,v ∈ B, ( u,v ) ∈ E + E [ X ( u, v )]+ 1 (cid:0) | B | (cid:1) (cid:88) u,v ∈ B, ( u,v ) ∈ E − E [ X ( u, v )]= (1 − α ) µ g + αµ r where α is the fraction of non-matching pairs in theblock B .For a sample of S = c log n/(cid:15) (cid:48) records, the ex-pected probability score ( µ S ) is (1 − α ) µ g + αµ r , where (cid:15) (cid:48) = (cid:15)/ (2 + (cid:15) )1 (cid:0) c log n (cid:1) E [ (cid:88) u,v ∈ S X ( u, v )] = 1 (cid:0) c log n (cid:1) (cid:88) u,v ∈ S, ( u,v ) ∈ E + E [ X ( u, v )]+ 1 (cid:0) c log n (cid:1) (cid:88) u,v ∈ S, ( u,v ) ∈ E − E [ X ( u, v )]= (1 − α ) µ g + αµ r Using Hoeﬀding’s inequality [14],

P r  (cid:0) c log n (cid:1) (cid:88) u,v ∈ S X ( u, v ) ≤ (1 − (cid:15) (cid:48) ) µ S  ≤ e − (cid:15) (cid:48) µ S ( c log n ) ≤ e − n = 1 n Using the same argument, we can show that

P r (cid:34) (1 − (cid:15) (cid:48) ) µ S ≤ ( c log n ) (cid:80) u,v ∈ S X ( u, v ) ≤ (1 + (cid:15) (cid:48) ) µ S (cid:35) ≥ − n This shows that the calculated probabilityscore on the samples S is within a factor of (1 − (cid:15) (cid:48) )and (1 + (cid:15) (cid:48) ) of the expected score with a probabilityof 1 − o (1). The probability score of B on consideringall records, is also within a factor of (1 − (cid:15) (cid:48) ) and(1 + (cid:15) (cid:48) ) of the expected value µ S . Therefore, theestimated score on sampling guarantees approxima-tion within a factor of (cid:15) (cid:48) − (cid:15) (cid:48) = 1 + 2 (cid:15) (cid:48) / (1 − (cid:15) (cid:48) ) = 1 + (cid:15) with a high probability.The above lemma can extend to block uniformity be-cause p m values are used analogously for expectedcluster sizes. In Lemma 8 we show how to set theconstant within the Θ notation based on level of noisein the p m values.To prove the convergence of pBlocking , we ﬁrstestimate the lower and upper bound of matchingprobability scores of a block B in the presence offeedback and show that a feedback of Θ(log n ) isenough to rank blocks with larger fraction of match-ing pairs higher than the blocks with fewer matchingpairs. Our analysis ﬁrst considers the blocks con-taining more than γ log n records (where γ is a largeconstant say 12) and we analyze the smaller blocksseparately. Convergence for large blocks.

First, we evalu-ate the converged block scores with a feedback F andevaluate the condition that the block scores are in thecorrect order. For this analysis, we consider the frac-tion of matching edges for block score computationbut similar lemmas extend for the uniformity scorecalculation. Lemma 8.

For all blocks B , with more than γ log n records, the matching probability score of B , p ( B ) after a feedback of F = O (log n ) randomly cho-sen pairs is at most (1 − α ) | F | / (cid:0) γ log n (cid:1) + 1 . p (cid:48) (1 −| F | / (cid:0) γ log n (cid:1) ) with a probability of − /n , where α is the fraction of non-matching pairs in B , γ is aconstant and p (cid:48) = µ g (1 − α ) + µ r α .Proof. For block scoring, pBlocking considers a sam-ple of S = γ log n records (where γ is a large con-stant) and considers the sample ensuring that feed-back F ⊆ S × S belongs to this sample. The to-tal number of matching edges which have been iden-tiﬁed with feedback over randomly chosen pairs is (1 − α ) | F | . Let X ( u, v ) be a random variable thatrefers to the similarity of the pair ( u, v ) and µ ( u, v )to its expected value. For S = γ log n , the expectedsimilarity of non-feedback edges within C is (cid:88) u,v ∈ S, ( u,v ) / ∈ F µ ( u, v ) = (cid:88) ( u,v ) ∈ E + E [ X ( u, v )] + (cid:88) ( u,v ) / ∈ E + E [ X ( u, v )]= (cid:88) ( u,v ) ∈ E + µ g + (cid:88) ( u,v ) / ∈ E + µ r = (cid:18)(cid:18) γ log n (cid:19) − | F | (cid:19) ( µ g (1 − α ) + µ r α )We use the Hoeﬀding inequality to bound the to-tal similarity, (cid:80) X ( u, v ) of T = (cid:16)(cid:0) γ log n (cid:1) − | F | (cid:17) = γ (cid:48) (cid:0) log n (cid:1) , for some constant γ (cid:48) , edges which do nothave feedback. (cid:88) u,v ∈ B c , ( u,v ) / ∈ F X ( u, v ) ≤ (1 + δ ) (cid:88) u,v ∈ B c , ( u,v ) / ∈ F µ ( u, v )with a probability of 1 − e − δ µ T / | T | which canbe simpliﬁed as 1 − e − δ µ T , since µ r , µ g > / > − /n after substituting δ = 0 . B is at-most (cid:18) | F | ( γ log n ) (1 − α ) + 1 . p (cid:48) (1 − | F | / (cid:0) γ log n (cid:1) ) (cid:19) witha high probability.Similarly, we prove a lower bound on block score. Lemma 9.

For all blocks B with | B | ≥ γ log n ,the matching probability score after a feedback F = O (log n ) record pairs in B is at least (1 − α ) | F | / (cid:0) γ log n (cid:1) + 0 . p (cid:48) (1 − | F | / (cid:0) γ log n (cid:1) ) with a prob-ability of − /n , where p (cid:48) = µ g (1 − α ) + µ r α and γ is a constant. Now, we analyze diﬀerent scenarios of edge noise tounderstand the trade-oﬀ between required feedbackand noise.

Lemma 10.

For every pair of blocks, B c , B d withmore than γ log n records, the matching probabilityscore estimate of B c with − α fraction of matchingedges is greater than the score of B d with − β (with < β ) fraction of matching edges with a probabilityof − n if ((1 − α ) µ g + αµ r ) > − β ) µ g + βµ r ) even in the absence of feedback.Proof. Using Lemma 8 and 9, we can evaluate thecondition that score ( B c ) > score ( B d ) with a proba-bility of 1 − n , in the absence of feedback. In orderto guarantee this for all blocks, we perform a unionbound over Θ( n ) pairs of blocks, guaranteeing thesuccess rate to 1 − o (1).The previous lemma shows a scenario where thenoise is not high and the prior based estimation ofmatching probability scores give a correct orderingof blocks. Now, we consider the more challengingnoisy scenario and show that Θ(log n ) feedback perblock is enough for correct ordering. Lemma 11.

For every pairs of blocks, B c , B d withmore than γ log n records, the matching probabilityscore estimate of B c with − α fraction of matchingedges is greater than the score of B d with − β (where α < β ) fraction of matching edges with a probabilityof − n whenever the ER phase provides overall feed-back of Θ( n log n ) randomly chosen edges.Proof. Using Lemma 9, score ( B c ) ≥ | F | / (cid:0) γ log n (cid:1) (1 − α ) + 0 . µ g (1 − α ) + αµ r )(1 − | F | / (cid:0) γ log n (cid:1) ) and us-ing Lemma 8, score ( B d ) ≤ | F | / (cid:0) γ log n (cid:1) (1 − β ) +1 . µ g (1 − β ) + βµ r )(1 − | F | / (cid:0) γ log n (cid:1) ) with a proba-bility of 1 − n . Hence, score ( B c ) > score ( B d ) holdsif F = c log n , where c is a large constant. With aunion bound over (cid:0) n (cid:1) pairs of blocks, the score ofany block B c (with higher fraction of matches) ishigher than that of any block B d (with lower frac-tion of matches) with a probability of 1 − n . Thetotal feedback to ensure Θ(log n ) feedback on eachblock is Θ( n log n ) as we consider Θ( n ) blocks forscoring. Convergence for small blocks.

The above anal-ysis does not extend to blocks of size less than γ log n .However, all these blocks are ranked higher than thelarge blocks by TF-IDF. Hence, when pBlocking isinitialized, the initial set of candidates generated willconsider all these blocks before any of the larger blocks. In the worst case, there can be δn suchblocks, for some constant δ because our approachconstructs a constant number of blocks per record(say δ ). Thus, the maximum number of candidatesconsidered from small blocks is δn (cid:0) γ log n (cid:1) and allthese candidates are considered in the ﬁrst iterationof pBlocking . Following the discussion on small andlarge blocks, we prove the main result of the conver-gence of pBlocking . Theorem 2. pBlocking pipeline achieves perfect re-call with a feedback of O ( n log n ) spread randomlyacross blocks.Proof. For blocks with more than γ log n records,Lemmas 10 and 11 show that a block with higherfraction of matching pairs is ranked higher than ablock with fewer matching pairs, if provided with afeedback of Θ( n log n ). Blocks with less than γ log n records have not been considered above but in theworst case, these blocks generate O ( n log n ) candi-dates as the maximum number of blocks consideredis Θ( n ). This ensures that a feedback of Θ( n log n )is suﬃcient to ensure the stated result. Discussion.

Lemma 11 considers the convergenceof block scores when the feedback is provided ran-domly over Θ(log n ) edges within a block. If thefeedback is biased towards Θ(log n ) non-matchingedges, the scores of noisier blocks will drop quickerand pBlocking will converge faster. Similarly, ifthe ER algorithm queries pairs with higher similarity(e.g. edge ordering [34]) or grows clusters by process-ing nodes (e.g. node ordering [33]), providing largerfeedback due to transitivity, this will only facilitatethe growth (reduction) in score of blocks with higher(lower) fraction of matching pairs leading to fasterconvergence.Finally, for the presented analysis, we assumedthat oracle answers are correct. Nonetheless, (i) forsmall amount of oracle errors ( ∼ pBlocking keeps converging, only at aslightly slower rate and demonstrates robustness.14able 3: Number of nodes n (i.e., records), number of clusters k (i.e., entities), size of the largest cluster | C | , the total number of matches in the data set | E + | and the reference to the paper where they appearedﬁrst. dataset n k | C | (cid:12)(cid:12) E + (cid:12)(cid:12) ref. description songs

1M 1M 0.99M 2 146K [6] Self-join of songs with very few matches. citations products cora cars camera

In this section we empirically demonstrate the abil-ity of pBlocking to boost the eﬃciency and eﬀective-ness of blocking and thus to improve the performanceof ER. We also demonstrate the fast convergence of pBlocking thus conﬁrming our theoretical analysisin Section 6, and the robustness of pBlocking in dif-ferent scenarios, including errors in ER results. Thissection is structured as follows. • Section 7.2.

We compare the eﬃciency and ef-fectiveness of pBlocking to prior work showinghigher pair recall and faster running time in allthe data sets. • Section 7.3.

We analyze pBlocking when usedin conjunction with diﬀerent ER methods show-ing higher

F-score (up to 60%) irrespective ofthe method of choice. • Section 7.4.

We study the dynamic performanceof pBlocking and show its ability to convergemonotonically to high eﬀectiveness without com-promising on eﬃciency in diﬀerent scenarios in-cluding errors in ER results.

Before showing results we describe our experimen-tal setup and the methods considered in our experi-ments.

Experimental set-up.

We implemented the algo-rithms in Java and machine learning tools in Python.The code runs on a server with 500GB RAM (allcodes used ≤ togenerate text descriptions of the image data ( cars ).For implementing the hierarchy we observed that wecan trim at a depth of 10 without any signiﬁcant dropin the performance. Blocking methods.

We consider 8 strate-gies for the blocking sub-tasks described in Sec-tion 2 and combine such strategies into 16 diﬀerentpipelines. We study such pipelines with and withoutour pBlocking approach on top. BB ) We consider 4 methods for Block Building ( BB )and follow the suggestions of [25] for their con-ﬁguration. Standard blocking [20] ( StBl ) gen-erates a new block for each text token in thedataset. Q-grams blocking [11] (

QGBL ) gener-ates a new block for each 3-gram of characters.Sorted neighborhood [13] (

SoNE ) sorts the tokensfor each attribute and generates a new block forevery sliding window of size 3 over these sortorders. Canopy clustering [18] (

CaCl ) gener-ates a new block for each cluster of high simi-larity records (calculated as unweighted Jaccardsimilarity). We construct multiple instances ofcanopies (blocks), one for each attribute (i.e.,based on the similarity of record pairs with re-spect to that attribute) and one based on all at-tributes together. BC ) We consider 2 traditional block scoring meth-ods for Block Cleaning ( BC ), dubbed TF-IDF [28]and uniform scoring (

Unif ). For comparison https://cloud.google.com/vision , M and then prune the re-maining blocks. We set default M to 10 million. CC ) We consider 2 popular methods for Compari-son Cleaning ( CC ), dubbed meta-blocking [22]( MB ) and BLOSS [5], and follow the suggestionsof [22] for their conﬁguration. Weights of recordpairs are set to their Jaccard similarity weightedwith the block scores from the BC sub-task. Weconsider the top 100 high-weight pairs for eachrecord and prune the remaining record pairs.We recall that variants of our approach are denotedas pBlocking (,,) while traditional blocking pipelineswithout feedback are denoted as B (,,) where the pa-rameters correspond to techniques for BB , BC and CC sub-tasks, respectively. Default methods are StBl for BB , TF-IDF for BC and MB for CC . Default φ for pBlocking is 0 . Pair matching and Clustering methods.

Weconsider the following 3 strategies that leverage thenotion of an oracle to answer pairwise queries of theform “does u match with v ?” (a) Edge [34] withdefault parameter setting. (b)

Eager [9], the state-of-the-art technique to solve ER in the presence oferroneous oracle answers. (c)

Node is the ER mech-anism derived from [33] and was proposed as an im-provement over

Edge . The

Eager algorithm han-dles noise for data sets with matching pairs muchlarger than n and performs similar to Edge for datasets that have fewer matching pairs [8], so we use itas default. We implement the abstract oracle toolwith a classiﬁer using scikit learn in Python. Weconsider two variants, Random forests (default) anda Neural Network. The random forest classiﬁer istrained with default settings of scikit learn. The neu-ral network is implemented with a 3-layer convolu-tional neural network followed by two fully connectedlayers. We used word2vec word-embeddings for eachtoken in the records. In structured data sets, weextract similarity features for each attribute as in [6].For cars we use the text descriptions to calculate We note that setting a score threshold rather than a limiton the number of pairs would not take into account diﬀerentscores distributions fairly. https://scikit-learn.org/stable/ text-based features along with image-based features.Given the unstructured nature of text descriptions forsome data sets we extracted POS tags using Spacy .All the considered classiﬁers are trained oﬀ-line withless than 1 ,

000 labelled pairs, containing a similaramount of matching and non-matching pairs. Theselabelled record pairs are the ones provided by therespective source for citations , songs , products and camera (the papers mentioned in Table 3, col-umn “ref.”). For cars and cora we perform activelearning (following the guidelines of [6]) to identify asmall set of labelled examples for training, which areexcluded from the evaluation of blocking quality. In this experiment we evaluate the empirical beneﬁtof pBlocking compared to previous blocking strate-gies.

Blocking eﬀectiveness.

Figure 2 compares thePair Recall (PR) of pBlocking and of a traditionalblocking pipeline B for diﬀerent choices of the blockbuilding and comparison cleaning techniques. Weuse default block cleaning TF-IDF and default M value. pBlocking achieves more than 0 .

90 recall forall the data sets and with all the block building strate-gies, demonstrating its robustness to diﬀerent clus-ter distributions and properties of the data. Con-versely, most of the considered block building strate-gies (

StBl , QGBL and

SoNE ) have signiﬁcantly lowerrecall even when used together with

BLOSS for select-ing the pairs wisely.

QGBL and

SoNE help to improverecall in data sets with spelling errors but due tovery few spelling mistakes in our data sets

StBl hasslightly higher recall. In terms of the data sets, theno-feedback blocking approach B has varied behavior. products and camera yield the best performance dueto the presence of relatively cleaner blocks that helpto easily identify matching pairs even without feed-back. songs has higher noise and cars has a skeweddistribution of clusters thereby making it harder forprevious techniques to handle. For this analysis, wedo not consider cora (the smallest data set) as it hasless than 2M pairs and hence, all techniques achieve https://spacy.io/ (a) songs P a i r R e c a ll B (b) citations P a i r R e c a ll pBlocking (c) products P a i r R e c a ll (d) cars P a i r R e c a ll (e) songs P a i r R e c a ll B (f) citations P a i r R e c a ll pBlocking (g) products P a i r R e c a ll (h) cars P a i r R e c a ll Figure 2: Pair recall of B (, TF-IDF ,) and pBlocking (, TF-IDF ,) with varying BB and CC . (a-d) use MB and(e-h) use BLOSS . CaCl did not ﬁnish within 24 hrs on songs and citations data set.Table 4: Running time compar-ison of B ( StBl , TF-IDF , MB ) and pBlocking ( StBl , TF-IDF , MB ). Dataset 0.95 Pair recall Time budget: 1 hr pBlocking B pBlocking B songs citations cars products camera cora perfect recall. We observed similar trends with Unif method for block cleaning in place of

TF-IDF (dis-cussed in Appendix).

Blocking eﬃciency.

In this experiment, we con-sider two diﬀerent settings to compare (i) the timerequired to achieve more than 0 .

95 pair recall (ii) thepair recall when the pipeline is allowed to run for aﬁxed amount of time (1 hour). We run each techniquefor various values of M and choose the best value thatsatisﬁes the required constraints. In the case of ﬁxedbudget of running time = 1hour, we run pBlocking ’sfeedback loop for the most iterations that allow thepipeline to process all records in the required timelimit.Table 4 compares the total time required to achieve 0 .

95 pair recall for each dataset . pBlocking pro-vides more than 3 times reduction in running timefor most large scale datasets in this setting. In termsof total number of pairs enumerated, pBlocking con-siders around M=10 million to achieve 0 .

95 recall for citations as opposed to more than 200 million for B . We observed similar results for other block build-ing ( SoNE , QGBL and

CaCl ) and cleaning strategies.The last two columns of Table 4 compare the pairrecall of the generated candidates when the techniqueis allowed to run for 1 hour. pBlocking achieves bet-ter pair recall as compared to B across all datasets.The gain in recall is higher for larger datasets. Theperformance of pBlocking for cars is lower than thatof pBlocking in Figure 2d because the feedback loopdoes not converge completely in 1hr. The pipelineruns for 8 rounds of feedback in this duration. Thisis consistent with the performance of pBlocking inFigure 4a, where the feedback is turned oﬀ after 10 it-erations. The diﬀerence in performance of pBlocking and B is not high for small datasets of low noise like products , cora and camera as opposed to songs , citations and cars . This includes the time required by each approach to per-form pair matching on the generated candidates. .3 Robustness of Progressive Block-ing In this section, we evaluate the performance of pBlocking with varying strategies for pair match-ing and clustering in Algorithm 1 (referred to as W in the pseudo-code). For this analysis, we use thedefault setting for M as in Figure 2. Varying ER methods.

We recall that pBlocking can be used in conjunction with a variety of tech-niques for pair matching and clustering. Table5a compares the Pair Recall of the blocking graph,when using the diﬀerent progressive ER methodsmentioned in Section 7.1. The ﬁnal Pair Recall of pBlocking is more than 0 .

90 in all data sets andmatching algorithms except citations for node ERand more than 0 .

85 in all cases. This observa-tion conﬁrms our theoretical analysis in Section 6.2,demonstrating that the feedback loop can improvethe blocking, irrespective of the ER algorithm un-der consideration (which is a desirable property fora blocking algorithm). The above comparison of ERperformance considers the algorithms with a defaultchoice of Random Forest classiﬁer as the oracle. Weobserved that the feedback from the ER phase whenusing a Neural Network classiﬁer contains slightlymore errors but the blocking phase with pBlocking shows similar recall. We provide more discussion onER errors in Section 7.4.

Beneﬁt on the ﬁnal ER result.

Table 5bcompares the F-score of the ﬁnal ER results whenblocking is performed with and without pBlocking .In this experiment we use the state-of-the-art algo-rithm,

Eager as the pair matching algorithm with de-fault parameter values. Final F-score achieved withfeedback is more than 0.9 for all data sets except products . For songs , citations and cars the F-score of pBlocking is 1.5 times more than that oftraditional blocking pipeline without feedback, thusdemonstrating the eﬀects of better eﬀectiveness andeﬃciency of blocking. This section studies the performance of pBlocking dynamically, in terms of (i) eﬀect of feedback fre- Table 5: (a) Pair recall of pBlocking on varying ERstrategies. (b) Comparison of the ﬁnal F-score of the

Eager method. The blocking graph is computed with pBlocking (StBl, TF-IDF, MB) and B (StBl, TF-IDF,MB) (both with default settings). (a) Dataset B pBlockingEdge Node Eagersongs citations cars products camera cora (b) Dataset B pBlockingsongs citations cars products camera cora F - sc o r e % progress0.0050.010.020.040.08 (a) Feedback Frequency F - sc o r e % progress0.050.10.2 (b) Oracle error Figure 3: Progressive behavior of pBlocking withvarying feedback frequency and errors in the feedback( cars ).quency φ , (ii) eﬀect of error on convergence, and (iii)convergence of the blocking result in the maximumnumber of rounds. Feedback frequency.

The φ parameter repre-sents the fraction of newly processed record pairs af-ter which feedback is sent from the partial ER resultsback to the blocking phase. Therefore, the parame-ter φ can control the maximum number of roundsof pBlocking and how often the blocking graph isupdated. In order to describe the eﬀect of varying φ , Figure 3a shows the F-score of the ER results asa function of the percentage of rounds completed,that we refer to as the blocking progress . . In theﬁgure, diﬀerent curves correspond to diﬀerent feed-back frequencies, including the default one (in blue).This plot shows that by updating the blocking graph Not to be confused with the “ER progress” in Algorithm 1 (a) Pair Recall comparison P a i r R e c a ll Feedback round B (b) F-score comparison F - sc o r e Feedback round pBlocking

Figure 4: Eﬀect of feedback loop in cars dataset.more frequently (and thus increasing the number ofrounds), the F-score increases faster when φ is re-duced from 0.08 to 0.01. The plot also shows thatthe F-score corresponding to smaller values of φ (upto 0.01) is consistently higher or equal as compared tothe F-score corresponding to larger values of φ . Giventhat the running time of the pipeline increases withmore frequent updates (smaller values of φ ), there ap-pears to be limited value in decreasing φ below 0 . Eﬀect of ER errors.

As in the previous experi-ment, Figure 3b shows the eﬀect of synthetic error inthe ER results by varying the fraction of erroneousoracle answers. To this end, we corrupted the oracleanswers randomly so as to get the desired amount ofnoise. We note that even when 1 out of 5 answers arewrong, the ﬁnal F-score is almost 0 .

8, growing mono-tonically from the beginning to the end at the costof a few extra pairs compared. pBlocking convergesslower with higher error but the error does not accu-mulate and it performs much better than any otherbaseline. Additionally, we observed that even with20% error, the pair recall of pBlocking is as highas 0 .

98 even though the F-score is close to 0.8 dueto mistakes made by pair matching and clusteringphase. This conﬁrms that pBlocking is robust to er-rors in ER results and maintains high eﬀectiveness toproduce ER results with high F-score.

Score Convergence.

Figure 4a comparesthe Pair Recall (PR) of the blocking phase of pBlocking ( StBl , TF-IDF , MB ) after every round offeedback with the recall of B ( StBl , TF-IDF , MB ). Both B and pBlocking start with PR value close to 0.52and pBlocking consistently improves with more feed- back achieving PR close to 0 . pBlocking ’s score as-signment strategy to achieve high PR values evenwith minimal feedback. Figure 4b compares the ﬁ-nal F-score achieved by our method if the feedbackloop is stopped after a few rounds. It shows that pBlocking achieves more than 0.8 F-score even whenstopped after 10 rounds of feedback. This experimentvalidates that the convergence of block scoring leadsto the convergence of the entire ER workﬂow. The empirical analysis in the previous sections hasdemonstrated pBlocking ’s beneﬁt on ﬁnal F-scoreand its ability to boost eﬀectiveness of blocking tech-niques across all data sets without compromising oneﬃciency. The key takeaways from our analysis aresummarized below. • pBlocking improves Pair Recall irrespective ofthe technique used for block building, blockcleaning or comparison cleaning (Figure 2), thusdemonstrating its ﬂexibility. • Feedback based scoring helps in particular toboost blocking eﬃciency and eﬀectiveness fornoisy datasets with many matching pairs (i.e.containing large clusters) such as cars , by en-abling accurate selection of cleanest blocks. • The block intersection algorithm helps in par-ticular with data sets with fewer matchingpairs (i.e. with mainly small clusters) such as citations and songs , by providing a way tobuild small focused blocks with high fraction ofmatching pairs. Block intersection can also helpin data sets like products and camera but thebeneﬁt is not as high as that in songs , becausemany records in such data sets have unique iden-tiﬁers (e.g. product model IDs) and thus initialblocks are reasonably clean.

Blocking has been used to scale Entity Resolution(ER) for a very long time. However, all the tech-niques in the literature have considered blocking as a19reprocessing step and suﬀered from the trade-oﬀ be-tween eﬀectiveness and eﬃciency/scalability. We di-vide the related work into two parts: advanced block-ing methods which we improve upon, and progressiveER methods which can be used to generate a limitedamount of matching/non-matching pairs to send as afeedback to our blocking computation.

Advanced blocking methods.

There are manyblocking methods in the literature with diﬀerent in-ternal functionalities and solving diﬀerent blockingsub-tasks. In this paper, we considered four repre-sentative block building strategies, namely standardblocking [20], canopy clustering [18], sorted neighbor-hood [13] and q-grams blocking [11]. It is well-knownthat such techniques can yield a fairly dense blockinggraph when used alone. We refer the reader to [24]for an extensive survey of various blocking techniquesand their shortcomings. Such block building strate-gies can be used as the method X in our Algorithm 1.Recent works have proposed advanced methodsthat can be used in combination with the mentionedblock building techniques by focusing on the compar-ison cleaning sub-task (thus improving on eﬃciency).The ﬁrst technique in this space is meta-blocking [22].Meta-blocking aims to extract the most similar pairsof records by leveraging block-to-record relationshipsand can be very eﬃcient in reducing the number ofunnecessary pairs produced by traditional blockingtechniques, but it is not always easy to conﬁgure. Tothis end, follow-up works such Blast [29] use “loose”schema information to distinguish promising pairs,while [4] and SNB [23] rely on a sample of labeledpairs for learning accurate blocking functions andclassiﬁcation models respectively. Finally, the mostrecent strategy BLOSS [5] uses active learning to selectsuch a sample and conﬁgure the meta-blocking. Thegoal of traditional meta-blocking [22] and its follow-up techniques like

BLOSS [5] prune out low similar-ity candidates from the blocking graph generated us-ing various block building strategies discussed above.Their performance is highly dependent on the eﬀec-tiveness of block building techniques and the qualityof blocking graph. On the other hand, pBlocking constructs meaningful blocks that eﬀectively cap-ture majority of the matching pairs and scores each block based on their quality to generate fewer non-matching pairs in the blocking graph. Meta-blockingtechniques compute the blocking graph statically,prior to ER, and thus can be used as the Z methodin our Algorithm 1. In Figure 2 we compare withclassic meta-blocking and BLOSS , as the latter showsits superiority over Blast and SNB.

Progressive ER.

Many applications need to re-solve data sets eﬃciently but do not require the ERresult to be complete. Recent literature describedmethods to compute the best possible partial solu-tion. Such techniques include pay-as-you-go

ER [36]that use “hints” on records that are likely to refer tothe same entity and more generally progressive

ERsuch as the schema-agnostic method in [30] and thestrategies in [3][26] that consider a limit on the exe-cution time. In our discussion, we considered oracle-based techniques, namely

Node [33],

Edge [34], and

Eager [9]. Diﬀerently from other progressive tech-niques, oracle-based methods consider a limit on thenumber of pairs that are examined by the oracle formatching/non-matching response. Such techniqueswere originally designed for dealing with the crowdbut they can also be used with a variety of classiﬁersdue to their ﬂexibility. All these techniques naturallywork in combination with pBlocking by sending asfeedback their partial results.

Other ER methods.

In addition to the abovemethods, we mention works on ER architectures thatcan help users to debug and tune parameters for thediﬀerent components of ER [10, 6, 15, 25]. Speciﬁ-cally, the approaches in [10, 6] show how to leveragethe crowd in this setting. All of these techniques areorthogonal to the scope of our work and we do notconsider them in our analysis. The previous workin [37] proposes to greedily merge records as they arematched by ER, while processing the blocks one ata time. Each merged record (containing tokens fromthe component records) is added to the unprocessedblocks, permitting its participation in the subsequentmatching and merging by their iterative algorithm.Limitations of processing blocks one at a time hasbeen shown in more recent blocking works [22].20

Conclusions

We have proposed a new blocking algorithm, pBlocking that progressively updates the relativescores of blocks and constructs new blocks by lever-aging a novel feedback mechanism from partial ERresults. Most of the techniques in the literature per-form blocking as a preprocessing step to prune out re-dundant non-matching record pairs. However, thesetechniques are sensitive to the distribution of clus-ter sizes and the amount of noise in the data setand thus are either highly eﬃcient with poor recallor have high recall with poor eﬃciency. pBlocking can boost the eﬀectiveness and eﬃciency of blockingacross all data sets by jump-starting blocking withany of the standard techniques and then using newrobust feedback-based methods for solving blockingsub-tasks in a data-driven way. To the best of ourknowledge, pBlocking is the ﬁrst framework whereblocking and pair matching components of ER canhelp each other and produce high quality results insynergy.

References [1] .[2] http://di2kg.dia.uniroma3.it/2019 .[3] Y. Altowim, D. V. Kalashnikov, and S. Mehro-tra. Progressive approach to relational entityresolution.

PVLDB , 7(11):999–1010, 2014.[4] M. Bilenko, B. Kamath, and R. J. Mooney.Adaptive blocking: Learning to scale up recordlinkage. In

ICDM , 2006.[5] G. dal Bianco, M. A. Gon¸calves, and D. Duarte.Bloss: Eﬀective meta-blocking with almost noeﬀort.

Information Systems , 75, 2018.[6] S. Das, P. S. GC, A. Doan, J. F. Naughton,G. Krishnan, R. Deep, E. Arcaute, V. Raghaven-dra, and Y. Park. Falcon: Scaling up hands-oﬀcrowdsourced entity matching to build cloud ser-vices. In

SIGMOD , 2017. [7] A. K. Elmagarmid, P. G. Ipeirotis, and V. S.Verykios. Duplicate record detection: A survey.

IEEE Trans. Knowl. Data Eng. , 19(1), 2007.[8] D. Firmani, B. Saha, and D. Srivastava. Onlineentity resolution using an oracle.

PVLDB , 9(5),2016.[9] S. Galhotra, D. Firmani, B. Saha, and D. Sri-vastava. Robust entity resolution using randomgraphs. In

SIGMOD , 2018.[10] C. Gokhale, S. Das, A. Doan, J. F. Naughton,N. Rampalli, J. Shavlik, and X. Zhu. Corleone:hands-oﬀ crowdsourcing for entity matching. In

SIGMOD , 2014.[11] L. Gravano, P. G. Ipeirotis, H. V. Jagadish,N. Koudas, S. Muthukrishnan, and D. Srivas-tava. Approximate string joins in a database(almost) for free. In

VLDB , pages 491–500, 2001.[12] A. Gruenheid, X. L. Dong, and D. Srivastava.Incremental record linkage.

PVLDB , 7(9), 2014.[13] M. A. Hern´andez and S. J. Stolfo. Themerge/purge problem for large databases. In

ACM Sigmod Record , volume 24, pages 127–138.ACM, 1995.[14] W. Hoeﬀding. Probability inequalities for sumsof bounded random variables. In

The Col-lected Works of Wassily Hoeﬀding , pages 409–426. Springer, 1994.[15] P. Konda, S. Das, P. Suganthan GC, A. Doan,A. Ardalan, J. R. Ballard, H. Li, F. Panahi,H. Zhang, J. Naughton, et al. Magellan: Towardbuilding entity matching management systems.

PVLDB , 9(12):1197–1208, 2016.[16] J. Krause, M. Stark, J. Deng, and L. Fei-Fei.3d object representations for ﬁne-grained cate-gorization. In , 2013.[17] C. D. Manning, C. D. Manning, and H. Sch¨utze.

Foundations of statistical natural language pro-cessing . 1999.2118] A. McCallum, K. Nigam, and L. H. Ungar. Ef-ﬁcient clustering of high-dimensional data setswith application to reference matching. In

Pro-ceedings of ACM SIGKDD international confer-ence on Knowledge discovery and data mining ,pages 169–178, 2000.[19] S. Mudgal, H. Li, T. Rekatsinas, A. Doan,Y. Park, G. Krishnan, R. Deep, E. Arcaute, andV. Raghavendra. Deep learning for entity match-ing: A design space exploration. In

SIGMOD ,2018.[20] G. Papadakis, G. Alexiou, G. Papastefanatos,and G. Koutrika. Schema-agnostic vs schema-based conﬁgurations for blocking methods on ho-mogeneous data.

PVLDB , 9(4):312–323, 2015.[21] G. Papadakis, E. Ioannou, T. Palpanas,C. Niederee, and W. Nejdl. A blocking frame-work for entity resolution in highly heteroge-neous information spaces.

IEEE Transactions onKnowledge and Data Engineering , 25(12):2665–2682, 2012.[22] G. Papadakis, G. Koutrika, T. Palpanas, andW. Nejdl. Meta-blocking: Taking entity resolu-tionto the next level.

TKDE , 26, 2014.[23] G. Papadakis, G. Papastefanatos, andG. Koutrika. Supervised meta-blocking.

PVLDB , 7, 2014.[24] G. Papadakis, J. Svirsky, A. Gal, and T. Pal-panas. Comparative analysis of approxi-mate blocking techniques for entity resolution.

PVLDB , 2016.[25] G. Papadakis, L. Tsekouras, E. Thanos, G. Gi-annakopoulos, T. Palpanas, and M. Koubarakis.The return of jedai: end-to-end entity resolu-tion for structured and semi-structured data.

PVLDB , 11(12):1950–1953, 2018.[26] T. Papenbrock, A. Heise, and F. Naumann. Pro-gressive duplicate detection.

TKDE , 27(5), 2015.[27] M. Penrose et al.

Random geometric graphs , vol-ume 5. Oxford university press, 2003. [28] H. Sch¨utze, C. D. Manning, and P. Raghavan.Introduction to information retrieval. In

Pro-ceedings of the international communication ofassociation for computing machinery conference ,page 260, 2008.[29] G. Simonini, S. Bergamaschi, and H. Jagadish.Blast: a loosely schema-aware meta-blocking ap-proach for entity resolution.

PVLDB , 9(12),2016.[30] G. Simonini, G. Papadakis, T. Palpanas, andS. Bergamaschi. Schema-agnostic progressive en-tity resolution.

IEEE Transactions on Knowl-edge and Data Engineering , 31(6):1208–1221,2018.[31] V. Verroios and H. Garcia-Molina. Entity resolu-tion with crowd errors. In

ICDE , pages 219–230,2015.[32] V. Verroios, H. Garcia-Molina, and Y. Papakon-stantinou. Waldo: An adaptive human interfacefor crowd entity resolution. In

SIGMOD , 2017.[33] N. Vesdapunt, K. Bellare, and N. Dalvi.Crowdsourcing algorithms for entity resolution.

PVLDB , 7(12):1071–1082, 2014.[34] J. Wang, G. Li, T. Kraska, M. J. Franklin,and J. Feng. Leveraging transitive relations forcrowdsourced joins. In

SIGMOD , 2013.[35] S. E. Whang and H. Garcia-Molina. Incrementalentity resolution on rules and data.

The VLDBJournal , 23(1), Feb. 2014.[36] S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolution.

TKDE ,25(5), 2013.[37] S. E. Whang, D. Menestrina, G. Koutrika,M. Theobald, and H. Garcia-Molina. Entity res-olution with iterative blocking. In

SIGMOD ,2009.22 (a) songs P a i r R e c a ll B (b) citations P a i r R e c a ll pBlocking (c) products P a i r R e c a ll (d) cars P a i r R e c a ll (e) songs P a i r R e c a ll B (f) citations P a i r R e c a ll pBlocking (g) products P a i r R e c a ll (h) cars P a i r R e c a ll Figure 5: Pair recall of B (, Unif ,) and pBlocking (, Unif ,) with varying BB and CC . (a-d) use MB and (e-h)use BLOSS . CaCl did not ﬁnish within 24 hrs on songs and citations data set.

A Additional Experiments

Blocking Eﬀectiveness.

Figure 2 compares thePair Recall of pBlocking and a traditional block-ing pipeline B , both with block-weights initializedwith TF-IDF weighting mechanism. Figure 5 per-forms the same comparison with the pipelines ini-tialized using

Unif weights. Since, all blocks areassigned equal weight, we consider the block clean-ing threshold of 100 along with default value of M. pBlocking performs substantially better than B fordiﬀerent settings of block building techniques acrossvarious datasets. With comparison to TF-IDF weight-ing scheme,

Unif performs slightly worse but the dif-ference is not substantial. The no-feedback pipeline B has varied performance across diﬀerent data setswith the best performance on products and poorestperformance on citations and songssongs