Crowdsourced Collective Entity Resolution with Relational Match Propagation
CCrowdsourced Collective Entity Resolution withRelational Match Propagation
Jiacheng Huang † , Wei Hu †∗ , Zhifeng Bao ‡ and Yuzhong Qu †† State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China ‡ RMIT University, Melbourne, AustraliaEmail: [email protected], [email protected], [email protected], [email protected]
Abstract —Knowledge bases (KBs) store rich yet heterogeneousentities and facts. Entity resolution (ER) aims to identify entitiesin KBs which refer to the same real-world object. Recent studieshave shown significant benefits of involving humans in the loop ofER. They often resolve entities with pairwise similarity measuresover attribute values and resort to the crowds to label uncertainones. However, existing methods still suffer from high labor costsand insufficient labeling to some extent. In this paper, we proposea novel approach called crowdsourced collective ER, whichleverages the relationships between entities to infer matchesjointly rather than independently. Specifically, it iteratively askshuman workers to label picked entity pairs and propagates thelabeling information to their neighbors in distance. During thisprocess, we address the problems of candidate entity pruning,probabilistic propagation, optimal question selection and error-tolerant truth inference. Our experiments on real-world datasetsdemonstrate that, compared with state-of-the-art methods, ourapproach achieves superior accuracy with much less labeling.
I. I
NTRODUCTION
Knowledge bases (KBs) store rich yet heterogeneous entitiesand facts about the real world, where each fact is structured asa triple in the form of ( entity, property, value ) . Entity reso-lution (ER) aims at identifying entities referring to the samereal-world object, which is critical in cleansing and integrationof KBs. Existing approaches exploit diversified features ofKBs, such as attribute values and entity relationships, seesurveys [1], [2], [3], [4]. Recent studies have demonstratedthat crowdsourced ER , which recruits human workers to solvemicro-tasks (e.g., judging if a pair of entities is a match), canimprove the overall accuracy.Current crowdsourced ER approaches mainly leverage tran-sitivity [5], [6], [7] or monotonicity [8], [9], [10], [11], [12]as their resolution basis. The transitivity-based approaches relyon the observation that the match relation is usually an equiva-lence relation. The monotonicity-based ones assume that eachpair of entities can be represented by a similarity vector ofattribute values, and the binary classification function, whichjudges whether a similarity vector is a match, is monotonic interms of the partial order among the similarity vectors.However, both kinds of approaches can hardly infer matchesacross different types of entities. Let us see Figure 1 forexample. The figure shows a directed graph, called entityresolution graph (ER graph), in which each vertex denotes apair of entities and each edge denotes a relationship between ∗ Corresponding author (y:directedBy, d: directedBy ) (y:directedBy, d:directedBy) (y:actedIn,d:actedIn) ( y:Joan, d:Joan ) ( y:John, d:John )( y:Joan, d:John )( y:John, d:Joan ) ( y:Player, d:Player )( y:Cradle, d:Cradle )( y:Tim, d:Tim )( y:Cradle, d:Player )( y:NYC, d:NYC ) ( y:Evanston, d:Evanston )( y:Evanston, d:NYC ) (y:actedIn,d:actedIn) (y:actedIn,d:actedIn)(y:wasBornIn, d:birthPlace) (y:wasBornIn, d:birthPlace) (y:wasBornIn, d:birthPlace)(y:livesIn, d:residence) (y:actedIn,d:actedIn)(y:actedIn,d:actedIn) Fig. 1: An ER graph example between YAGO and DBpedia.Entities in YAGO are prefixed by “ y: ”, and entities in DBpediaare prefixed by “ d: ”. Joan , John and
Tim are persons.
Cradle and
Player are movies.
NYC and
Evanston are cities.two entity pairs. Assume that ( y:Joan , d:Joan ) is labeled asa match, the birth place pair ( y:NYC , d:NYC ) is expected tobe a match. Since these two pairs are in different equivalenceclasses, the transitivity-based approaches are apparently unableto take effect. As different relationships (like y:directedBy and y:wasBornIn ) make most similarity vectors of entities of dif-ferent types incomparable, the monotonicity-based approacheshave to handle them separately.In this paper, we propose a new approach called Remp(Relational match propagation) to address the above problems.The main idea is to leverage collective ER that resolves entitiesconnected by relationships jointly and distantly, based on asmall amount of labels provided by workers. Specifically,Remp iteratively asks workers to label a few entity pairs andpropagates the labeling information to their neighboring entitypairs in distance, which are then resolved jointly rather thanindependently. There remain two challenges to achieve such acrowdsourced collective ER.The first challenge is how to conduct an effective relationalmatch propagation. Relationships like functional/inverse func-tional properties in OWL [13] (e.g., y:wasBornIn ) providea strong evidence, but these properties only account for asmall portion while the majority of relationships is multi-valued (e.g., actedIn ). Multi-valued relationships often connectnon-matches to matches (e.g., ( y:John , d:Joan ) is connectedto ( y:Cradle , d:Cradle ) in Figure 1). Therefore, we proposea new relational match propagation model, to decide whichneighbors can be safely inferred as matches.The second challenge is how to select good questions toask workers. For an ER graph involving two large KBs, thenumber of vertices (i.e. candidate questions) can be quadratic.We introduce an entity pair pruning algorithm to narrow thesearch space of questions. Moreover, different questions have a r X i v : . [ c s . D B ] F e b ifferent inference power. In order to maximize the expectednumber of inferred matches, we propose a question selectionalgorithm, which chooses possible entity matches scattered indifferent parts of the ER graph to achieve the largest numberof inferred matches.In summary, the main contributions of this paper are listedas follows: • We design a partial order based entity pruning algorithm,which significantly reduces the size of an ER graph. • We propose a relational match propagation model, whichcan jointly infer the matches between different types ofentities in distance. • We formulate the problem of optimal multiple questionsselection with cost constraint, and design an efficientalgorithm to obtain approximate solutions. • We present an error-tolerant method to infer truths fromimperfect human labeling. Moreover, we train a classifierto handle isolated entity pairs. • We conduct real-world experiments and comparison withstate-of-the-art approaches to assess the performanceof our approach. The experimental results show thatour approach achieves superior accuracy with much fewerlabeling tasks.
Paper organization.
Section II reviews the literature. Sec-tion III defines the problem and sketches out the approach.In Sections IV–VII, we describe the approach in detail. Sec-tion VIII reports the experiments and results. Last, Section IXconcludes this paper.II. R
ELATED W ORK
A. Crowdsourced ER
Inference models.
Based on the transitive relation of entitymatches, many approaches such as [5], [14] make use of priormatch probabilities to decide the order of questions. Firmaniet al. [7] proved that the optimal strategy is to ask questionsin descending order of entity cluster size. They formulatedthe problem of crowdsourced ER with early termination andput forward several question ordering strategies. Althoughthe transitive relation can infer matches within each cluster,workers need to check all clusters.On the other hand, Arasu et al. [8] investigated the mono-tonicity property among the similarity vectors of entity pairs.Given two similarity thresholds s , s and s (cid:23) s , we have Pr[ u (cid:39) u | s ( u , u ) (cid:23) s ] ≥ Pr[ u (cid:39) u | s ( u , u ) (cid:23) s ] .ALGPR [8] and ERLEARN [11] use the monotonicity prop-erty to search new thresholds, and estimate the precision ofresults. In particular, the partial order based approaches [15],[16], [12] explore similarity thresholds among similarity vec-tors. Furthermore, POWER [16] groups similarity vectorsto reduce the search space. Corleone [9] and Falcon [10]learn random forest classifiers, where each decision tree isequivalent to a similarity vector. However, these approachesare designed for ER with single entity type. To leveragemonotonicity on ER between KBs with complex type informa-tion, HIKE [12] uses hierarchical agglomerative clustering to partition entities with similar attributes and relationships, anduses the monotonicity techniques on each entity partition tofind matches. Although our approach also uses monotonicity,it only uses monotonicity to prune candidate entity pairs.In addition, our approach allows match inference betweendifferent entity types (e.g., from persons to locations) to reducethe labeling efforts. Question interfaces.
Pairwise and multi-item are two commonquestion interfaces. The pairwise interface asks workers tojudge whether a pair of entities is a match [7], [17]. Differently,Marcus et al. [18] proposed a multi-item interface to savequestions, where each question contains multiple entities tobe grouped. Wang et al. [19] minimized the number of multi-item questions on the given entity pair set such that eachquestion contains at most k entities. Waldo [20] is a recenthybrid interface, which optimizes the trade-off between costand accuracy of the two question interfaces based on taskdifficulty. The above approaches do not have the inferencepower and they may generate a large amount of questions. Quality control.
To deal with errors produced by work-ers, quality control techniques [6], [20], [21] leverage thecorrelation between matches and workers to find inaccuratelabels, and improve the accuracy by asking more questionsabout uncertain ones. These approaches gain improvement byredundant labeling.
B. Collective ER
In addition to attribute values, collective ER [22], [23],[24], [25] further takes the relationships between entities intoaccount. CMD [26] extends the probabilistic soft logic to learnrules for ontology matching. LMT [27] learns soft logic rulesto resolve entities in a familial network. Because learninga probabilistic distribution on large KBs is time-consuming,PARIS [28] and SiGMa [29] implement message passing-style algorithms that obtain seed matches created by handcrafted rules and pass the match messages to their neighbors.However, they do not leverage crowdsourcing to improvethe ER accuracy and may encounter the error accumulationproblem. III. A
PPROACH O VERVIEW
In this section, we present necessary preliminaries to defineour problem, followed by a general workflow of our approach.Frequently used notations are summarized in Table I.
A. Preliminaries & Problem DefinitionKBs store rich, structured real-world facts. In a KB, eachfact is stated in a triple of the form ( entity, property, value ) ,where property can be either an attribute or a relationship,and value can be either a literal or another entity. Thesets of entities, literals, attributes, relationships and triplesare denoted by U, L, A, R and T , respectively. Therefore, aKB is defined as a 5-tuple K = ( U, L, A, R, T ) . Moreover,attribute triples T attr ⊆ U × A × L attach entities with literals,e.g., ( Leonardo da Vinci , birth date , “1452-4-15” ) , and rela-tionship triples T rel ⊆ U × R × U link entities by relationships,e.g., ( Leonardo da Vinci , works , Mona Lisa ) .ABLE I: Frequently used notations Notations Descriptions K , u a KB and an entity r, a a relationship, and an attribute N ru , N au the value sets of r and a w.r.t. up, q an entity pair, and a question m p , m q the event that p and q is a match M a set of entity or attribute matches C a set of candidate questions Q a set of asked questions H a set of labels ER graph construction Relationalmatchpropagation MultiplequestionsselectionTruth inference( a , b )( p ,q ) ( a , b )( a , b )( p , q )( a , b ) KB a a a a p p b b b b q q KB ( p ,q ) ( a , b )( a , b )( p , q ) ( a , b )( a , b ) 𝑝 " ≃ 𝑞 " 𝑎 & ≄ 𝑏 " ER graph Probabilistic ER graph 𝑝 " ≃ 𝑞 " ? 𝑎 & ≃ 𝑏 " ? 👤 👤👤 👤 Crowdsourcing ( a , b ) denotes a prior match Fig. 2: Workflow of the proposed approach
Entity Resolution (ER) aims to resolve entities in KBs denot-ing the same real-world thing. Let u , u denote two entitiesin two different KBs. We call the entity pair p = ( u , u ) a match and denote it by u (cid:39) u or m p if u , u refer tothe same. In contrast, we call p = ( u , u ) a non-match anddenote it by u (cid:54)(cid:39) u if u , u refer to two different objects.Both matches and non-matches are regarded as resolved entitypairs, and other pairs are regarded as unresolved . Traditionally, reference matches (i.e., gold standard) are used to evaluate thequality of the ER results, and precision, recall and F1-scoreare widely-used metrics. Crowdsourced ER carries out ER with human helps. Usu-ally, it executes several human-machine loops, and in eachloop, the machine picks one or several questions to askworkers to label them and updates the ER results in termsof the labels. Due to the monetary cost of human labors,a crowdsourced ER algorithm is expected to ask limitedquestions while obtaining as many results as possible.
Definition 1 (Crowdsourced Collective ER) . Given two KBs K and K , and a budget, the crowdsourced collective ERproblem is to maximize recall with a precision restriction byasking humans to label tasks while not exceeding the budget. Specifically, we assume that both KBs contain “dense” re-lationships and focus on using matches obtained from workersto jointly infer matches with relationships.
B. A Workflow of Our Approach
Given two KBs as input, Figure 2 shows the workflow ofour approach to crowdsourced collective ER. After iteratingfour processing stages, the approach returns a set of matchesbetween the two KBs.1)
ER graph construction aims to construct a small ERgraph by reducing the amount of vertices (i.e. entitypairs). It first conducts a similarity measurement to filterout some non-matches. At the same time, it uses somematches obtained from exact matching [12], [29], [30]to calculate the similarities between attributes and findattribute matches. Then, based on the attribute matches,it assembles the similarities between values to similarityvectors, and leverages the natural partial order on thevectors to prune more vertices.2)
Relational match propagation models how to usematches to infer the match probabilities of unresolvedentity pairs in each connected component of the ER graph. It first uses some matches and maximum likeli-hood estimation to measure the consistency of relation-ships. Then, based on the consistency of relationshipsand the ER graph structure, it computes the conditionalmatch probabilities of unresolved entity pairs given thematches. The conditional match probabilities derive aprobabilistic ER graph.3)
Multiple questions selection selects a set of unresolvedentity pairs in the probabilistic ER graph as questionsto ask workers. It models the discovery of inferredmatch set for each question as the all-pairs shortest pathproblem and uses a graph-based algorithm to solve it.We prove that the multiple questions selection problemis NP-hard and design a greedy algorithm to find thebest questions to ask.4)
Truth inference infers matches based on the resultslabeled by workers. It first computes the posterior matchprobabilities of the questions based on the quality of theworkers, and then leverages these posterior probabilitiesto update the (probabilistic) ER graph. Also, for isolatedentity pairs, it builds a random forest classifier to avoidasking the workers to check them one by one.The approach stops asking more questions when there is nounresolved entity pair that can be inferred by relational matchpropagation. IV. ER G
RAPH C ONSTRUCTION
A. ER Graph
Graph structures [31], [32] are widely used to model theresolution states of entity pairs and the relationships betweenthem. For example, Dong et al. [31] proposed dependencygraph to model the dependency between similarities of entitypairs. In this paper, we use the notion of
ER graph to denotethis graph structure. Different from dependency graph, eachedge in the ER graph is labeled with a pair of relationshipsfrom two KBs.
Definition 2 (ER Graph) . Given two KBs K = ( U , L , A ,R , T ) and K = ( U , L , A , R , T ) , an ER graph on K and K is a directed, edge-labeled multigraph G =( V, E, l e ) , such that (1) V ⊆ U × U ; (2) for each vertexpair ( u , u ) , ( u (cid:48) , u (cid:48) ) ∈ V , (cid:0) ( u , u ) , ( u (cid:48) , u (cid:48) ) (cid:1) ∈ G ∧ l v (cid:0) ( u , u ) , ( u (cid:48) , u (cid:48) ) (cid:1) = ( r , r ) if and only if ( u , r , u (cid:48) ) ∈ T ∧ ( u , r , u (cid:48) ) ∈ T . igure 1 illustrates an ER graph fragment built from DB-pedia and YAGO. Note that, an entity can occur in multiplevertices, and a relationship can appear in different edge labels.A probabilistic ER graph is an ER graph where each edge (cid:0) ( u , u ) , ( u (cid:48) , u (cid:48) ) (cid:1) is labeled with a conditional probability Pr( u (cid:48) (cid:39) u (cid:48) | u (cid:39) u ) . The major challenge of constructingan ER graph is how to significantly reduce the size of thegraph while preserving as many potential entity matches aspossible. B. Candidate Entity Match Generation
We conduct a string matching on entity labels (e.g., thevalues of rdfs:label ) to generate candidate entity matchesand regard them as vertices in the ER graph. Specifically,we first normalize entity labels via lowercasing, tokenization,stemming, etc. Then, we leverage the
Jaccard coefficient—the size of the intersection divided by the size of the union oftwo sets—as our similarity measure to compute similarities onthe normalized label token sets and follow the previous studies[5], [16], [19] to prune the entity pairs whose similarities areless than a predefined threshold (e.g., 0.3). Although the choiceof thresholds is dataset dependent, this process runs fast andlargely reduces the amount of non-matches, thus helping theER approaches scale up. Note that there are many choiceson the similarity metric, e.g.,
Jaccard , cosine , dice andedit distance [33]; our approach can work with any of themand we use Jaccard for illustration purpose only. The set ofcandidate entity matches is denoted by M c . Similar to [12],we use the label similarities as prior match probabilities (i.e., Pr[ m p ] ). More accurate estimation in [6], [7] can be achievedby human labeling. C. Attribute Matching In M c , we refer to the subset of its entities that has exactlythe same labels as initial entity matches. We leverage them asa priori knowledge for attribute and relationship matching (seeSections IV-C and V-A). Other features, e.g., owl:sameAs andinverse functional properties [13], may also be used to inferinitial entity matches [34], [30]. Note that we do not directlyadd initial entity matches in the final ER results, because theymay contain errors. The set of initial entity matches is denotedby M in .For such a set of initial entity matches M in between twoKBs K = ( U , L , A , R , T ) and K = ( U , L , A , R ,T ) , we proceed to define the following attribute similarity tofind their attribute matches. For any two attributes a ∈ A and a ∈ A , their similarity sim ( a , a ) is defined as theaverage similarity of their values: sim A ( a , a ) = (cid:80) ( u ,u ) ∈ M in sim L ( N a u , N a u ) (cid:12)(cid:12) { ( u , u ) ∈ M in : N a u ∪ N a u (cid:54) = ∅} (cid:12)(cid:12) , (1)where N a u = { l : ( u , a , l ) ∈ T } and N a u is definedanalogously. sim L represents an extended Jaccard similaritymeasure for two sets of literals, which employs an internalliteral similarity measure and a threshold to determine twoliterals being the same when their similarity is not lower than the threshold [35]. For different types of literals, we use the
Jaccard coefficient for strings and the maximum percentagedifference for numbers (e.g., integers, floats and dates). Thethreshold is set to 0.9 to guarantee high precision. We referinterested readers to [36] for more information about attributematching.For simplicity, every attribute in one KB is restricted tomatch at most one attribute in the other KB. This global 1:1matching constraint is widely used in ontology matching [37],and facilitates our assembling of similarity vectors (later inSection IV-D). The 1:1 attribute matching selection is modeledas the bipartite graph matching problem and solved with theHungarian algorithm [38] in O (( | A | + | A | ) | A || A | ) time.The set of attribute matches is denoted by M at . D. Partial Order Based Pruning
Given the candidate entity match set M c and the attributematch set M at , for each candidate ( u , u ) ∈ M c , we createa similarity vector s ( u , u ) = ( s , s , . . . , s | M at | ) , where s i is the literal similarity ( sim L ) between u and u on the i th attribute match ( ≤ i ≤ | M at | ). As a consequence, a naturalpartial order exists among the similarity vectors: s (cid:23) s (cid:48) ifand only if ∀ ≤ i ≤ | M at | , s i ≥ s (cid:48) i . This partial order canbe used to determine whether an entity pair is a (non-)matchin two ways: (i) an entity pair ( u , u ) is a match if thereexists an entity pair ( u (cid:48) , u (cid:48) ) such that ( u (cid:48) , u (cid:48) ) is a matchand s ( u , u ) (cid:23) s ( u (cid:48) , u (cid:48) ) ; and (ii) ( u , u ) is a non-matchif there exists ( u (cid:48) , u (cid:48) ) such that ( u (cid:48) , u (cid:48) ) is a non-match and s ( u (cid:48) , u (cid:48) ) (cid:23) s ( u , u ) .We incorporate this partial order into a k -nearest neighborsearch for further pruning the candidate entity match set M c .Let us assume that an entity u in one KB has a set ofcandidate match counterparts { u , u , . . . , u J } in another KB.The similarity vectors are written as s ( u , u ) , s ( u , u ) , . . . , s ( u , u J ) , and we want to determine the top- k in them. Sincethe partial order is a weak ordering, we count the number ofvectors strictly larger than each pair ( u , u j ) (1 ≤ j ≤ J ) as its “rank”, i.e, the minimal rank in all possible refinedfull orders. Note that the counterparts of entities in one entitypair are both considered. So, the worst rank of an entity pair ( u , u ) , denoted by min rank ( u , u ) , is min rank ( u , u ) = max i ∈{ , } min rank i ( u , u ) , min rank ( u , u ) = (cid:12)(cid:12) { u (cid:48) : s ( u , u (cid:48) ) (cid:31) s ( u , u ) } (cid:12)(cid:12) , min rank ( u , u ) = (cid:12)(cid:12) { u (cid:48) : s ( u (cid:48) , u ) (cid:31) s ( u , u ) } (cid:12)(cid:12) , (2)where all ( u , u ) , ( u , u (cid:48) ) , ( u (cid:48) , u ) ∈ M c .By min rank , we design a modified k -nearest neighboralgorithm on this partial order (see Algorithm 1). Because thefull order among candidate entity matches is unknown, insteadof finding the top- k matches directly, we prune the ones thatcannot be in top- k . Thus, each entity pair ( u , u ) ∈ M c suchthat min rank ( u , u ) ≥ k needs to be pruned. Also, eachpair smaller than a pruned pair should be removed based on thepartial order to avoid redundant checking, because min rank of these pairs must be greater than k . The set of retained entity lgorithm 1: Partial order based pruning
Input:
Candidate entity match set M c , attribute match set M at ,threshold k Output:
Retained entity match set M rd foreach ( u , u ) ∈ M c do pre-compute s ( u , u ) ; M rd ← PruningInOneWay( M c , U , k ) ; M rd ← PruningInOneWay( M rd , U , k ) ; return M rd ; Function
PruningInOneWay(
M, U i , k ) D ← ∅ ; foreach u i ∈ U i do B ← (cid:8) ( u , u ) ∈ M : u = u i ∨ u = u i (cid:9) ; if | B | ≤ k then continue; /* no need to prune */ foreach ( u , u ) ∈ B do if min rank i ( u , u ) ≥ k then /* ( u (cid:48) , u (cid:48) ) cannot be pruned here */ B ← (cid:8) ( u (cid:48) , u (cid:48) ) ∈ B : s ( u , u ) (cid:54)(cid:23) s ( u (cid:48) , u (cid:48) ) (cid:9) ; D ← D ∪ B ; return D ; matches is denoted by M rd , where each entity is involved in nearly k candidate matches, due to the weak ordering of partialorder.Algorithm 1 first partitions entity match set M into eachblock B where all pairs contain the same entity (Line 8). Then,it checks each entity pair ( u , u ) ∈ B , and prunes entity pairssuch that min rank ≥ k (Lines 10–12). Finally, the retainedpairs in B are added into the output match set.Algorithm 1 first takes O( | M c || M at | ) time to pre-computethe similarity vectors. When processing U i ( i = 1 , , thepruning step (Lines 7–13) checks at most | M c | pairs, and eachtime it spends O(3 | U − i || M at | ) time to compute min rank i ,prune pairs in B and store the retained pairs in D . So, theoverall time complexity of Algorithm 1 is O (cid:0) | M c || M at | ( | U | + | U | ) (cid:1) . In practice, similarity vector construction is the mosttime-consuming part, while the pruning step only needs tocheck a small amount of entities in U or U .V. R ELATIONAL M ATCH P ROPAGATION
Given an ER graph G = ( V, E, l v , l e ) and an entity match u (cid:39) u in it, the relational match propagation infers howlikely each unresolved entity pair p ∈ V is a match basedon the structure of G , i.e. Pr[ m p | u (cid:39) u ] . In this section,we first consider a basic case that unresolved entity pairs areneighbors of a match in G . Then, we generalize it to thecase that unresolved pairs are reachable from several matches.In the basic case, we resolve entity pairs between two valuesets of a relationship pair, and define the consistency betweenrelationships to measure the portion of values containingmatched counterparts in another value set. The consistencyand the prior match probabilities of entity pairs are furthercombined to obtain “tight” posterior match probabilities. Inthe general case, we propose a Markov model on paths frommatches to unresolved ones to find the match probabilitybounds. A. Consistency Between Relationships
Functional/inverse functional properties are ideal for matchpropagation. For example, wasBornIn is a functional property, and the born places of two persons in a match must beidentical. However, we cannot just rely on functional/inversefunctional properties, since many relationships are multi-valued and only a part of the values may match. Thus, wedefine the consistency between relationships as follows.Let r and r be two relationships in two KBs . We assumethat, given the condition that u (cid:39) u ∧ u (cid:48) ∈ N r u , theprobability of the event ∃ u (cid:48) : (cid:0) u (cid:48) ∈ N r u ∧ u (cid:48) (cid:39) u (cid:48) (cid:1) issubject to a binary distribution with parameter (cid:15) . Symmetri-cally, we define parameter (cid:15) . We use (cid:15) and (cid:15) to representthe consistency for two relationships r and r , respectively: (cid:15) = Pr[ ∃ u (cid:48) : u (cid:48) ∈ N r u ∧ u (cid:48) (cid:39) u (cid:48) | u (cid:39) u , u (cid:48) ∈ N r u ] ,(cid:15) = Pr[ ∃ u (cid:48) : u ∈ N r u ∧ u (cid:48) (cid:39) u (cid:48) | u (cid:39) u , u (cid:48) ∈ N r u ] . (3)where N r u , N r u are the value sets of relationships r , r w.r.t.entities u , u , respectively.To estimate (cid:15) and (cid:15) , we use the value distribution on theinitial entity matches M in . For an entity pair ( u , u ) ∈ M in ,we introduce a latent random variable L r ,r u ,u = | M r ,r u ,u | ,where M r ,r u ,u denotes the set of entity matches in N r u × N r u .Note that we omit r , r in L r ,r u ,u and M r ,r u ,u to simplifynotations. Similar to [39], we make an assumption on theentity sets: no duplicate entities exist in each entity set. Hence, L u ,u is also the number of entities in N r u (or N r u ) whichappear in M u ,u . Based on the latent variable L u ,u , thelikelihood probability of ( N r u , N r u , L u ,u ) is Pr[ N r u , N r u , L u ,u ] = (cid:81) i =1 , (cid:0) | N riui | L u ,u (cid:1) ( (cid:15) i − (cid:15) i ) L u ,u (1 − (cid:15) i ) | N riui | . (4)Then, we use the maximum likelihood estimation to obtain (cid:15) and (cid:15) : max (cid:15) ,(cid:15) ,L · , · (cid:89) ( u ,u ) ∈ M in Pr[ N r u , N r u , L u ,u ] . (5)Since each L u ,u is an integer variable, the brute-forceoptimization can cost exponential time. Next, we present anoptimization process. Let ζ = (cid:15) (cid:15) (1 − (cid:15) )(1 − (cid:15) ) and ξ ( (cid:15) , (cid:15) ) =(1 − (cid:15) ) b (1 − (cid:15) ) b , where b = (cid:80) | N r u | , b = (cid:80) | N r u | . Wesimplify (5) to max (cid:15) ,(cid:15) ξ ( (cid:15) , (cid:15) ) (cid:81) max L u ,u c L u ,u ζ L u ,u ,where c L u ,u = (cid:0) | N r u | L u ,u (cid:1)(cid:0) | N r u | L u ,u (cid:1) . Notice that c i ζ i = c j ζ j hasonly one solution for different integers i, j . Thus, the curves c L u ,u ζ L u ,u ( ≤ L u ,u ≤ L M ) can have at most (cid:0) L M +12 (cid:1) common points, where L M = min {| N r u | , | N r u |} . Therefore, max L u ,u c L u ,u ζ L u ,u is an O( L M ) -piecewise continuousfunction, and the product of these O( L M ) -piecewise contin-uous functions is an O(max {| N r u | , | N r u | } ) -piecewise con-tinuous function. As a result, we can optimize (5) by solving O(max {| N r u | , | N r u | } ) continuous optimization problemswith two variables, which runs efficiently. B. Match Propagation to Neighbors
A basic case is that the unresolved entity pairs are adjacentto a match u (cid:39) u in G . We consider the neighbors with thesame edge label, i.e. relationship pair ( r , r ) , together. Then,our goal is to identify matches between N r u and N r u .et M u ,u ⊆ N r u × N r u denote a set of entity matches.We consider two factors about how likely M u ,u can bethe correct match result of N r u × N r u : (1) the prior matchprobabilities of matches without neighborhood information;(2) the consistency of the relationships. The match probabilityof M u ,u given u (cid:39) u is: Pr[ M u ,u | u (cid:39) u ] = 1 Z f ( M u ,u | N r u , N r u ) × g ( M u ,u | N r u ) g ( M u ,u | N r u ) , (6)where Z is the normalization factor. f ( M u ,u | N r u , N r u ) isthe prior match probability. g ( M u ,u | N r u ) , g ( M u ,u | N r u ) are the consistency of M u ,u w.r.t. N r u , N r u , respectively.Without considering neighborhood information, the priormatch probability f ( M u ,u | N r u , N r u ) is defined as thelikelihood function of M u ,u : f ( M u ,u | N r u , N r u ) = (cid:89) p ∈ M u ,u Pr[ m p ] × (cid:89) p ∈ N r u × N r u \ M u ,u (1 − Pr[ m p ]) , (7)where Pr[ m p ] denotes the prior probability of entity pair p being a match, and − Pr[ m p ] denotes the prior probabilityof p being a non-match.Let π ( M u ,u ) = { u (cid:48) | ( u (cid:48) , u (cid:48) ) ∈ M u ,u } . Note thatwhen u and u form a match, each entity u (cid:48) ∈ π ( M u ,u ) is a neighbor of u for relationship r such that ∃ u (cid:48) : u (cid:48) ∈ N r u ∧ u (cid:48) (cid:39) u (cid:48) . Based on (cid:15) , the consistency of M u ,u given N r u is defined as follows: g ( M u ,u | N r u ) = (cid:15) | π ( M u ,u ) | (1 − (cid:15) ) | N r u |−| π ( M u ,u ) | . (8) π ( M u ,u ) and g ( M u ,u | N r u ) can be defined similarly.Finally, we obtain the posterior match probability of u (cid:48) (cid:39) u (cid:48) by marginalizing Pr[ u (cid:48) (cid:39) u (cid:48) , M u ,u | u (cid:39) u ] : Pr[ u (cid:48) (cid:39) u (cid:48) | u (cid:39) u ]= (cid:88) M u ,u Pr[ u (cid:48) (cid:39) u (cid:48) , M u ,u | u (cid:39) u ] = (cid:88) M u ,u : ( u (cid:48) ,u (cid:48) ) ∈ M u ,u Pr[ M u ,u | u (cid:39) u ] , (9)where M u ,u is selected over ( N r u × N r u ) ∩ V . Example.
Let ( u , u ) = ( y:Tim , d:Tim ) , r and r denote the relationship directed , (cid:15) = (cid:15) = 0 . ,and Pr[ m p ] ≡ . (implying all pairs are viewed asthe same). From Figure 1, we can find that N r u = { y:Cradle , y:Player } and N r u = { d:Cradle , d:Player } . Thus,when M u ,u = { ( y:Cradle , d:Cradle ) , ( y:Player , d:Player ) } , Pr[ M u ,u | u (cid:39) u ] = 0 . × . ≈ . ; when M (cid:48) u ,u = { ( y:Cradle , d:Player ) } , Pr[ M (cid:48) u ,u | u (cid:39) u ] = 0 . × . × . ≈ . . So, M u ,u is more likely to be the match setwithin N r u × N r u . Furthermore, Pr[ y:Cradle (cid:39) d:Cradle ] ≈ . , whereas Pr[ y:Cradle (cid:39) d:Player ] ≈ . . C. Distant Match Propagation
The above match propagation to neighbors only estimatesthe match probabilities of direct neighbors of an entity match,which lacks the capability of discovering entity matches faraway. In the following, we extend it to a more general case, called distant match propagation , where a match reaches anunresolved entity pair through a path.Intuitively, given a match ( u , u ) and an unresolved pair ( u (cid:48) , u (cid:48) ) , the distant propagation process can be modeled as apath consisting of the entity pairs from ( u , u ) to ( u (cid:48) , u (cid:48) ) ,where each unresolved pair can be inferred as a match viaits precedent. Assume that there is a path ( u , u ) , ( u , u ) ,. . . , ( u l , u l ) in G , where ( u , u ) = ( u , u ) and ( u l , u l ) =( u (cid:48) , u (cid:48) ) . According to the chain rule of conditional probabil-ity, we have Pr[ u l (cid:39) u l | u (cid:39) u ] ≥ Pr[ u l (cid:39) u l , u (cid:39) u , . . . , u l (cid:39) u l | u (cid:39) u ]= (cid:89) li =1 Pr[ u i (cid:39) u i | u (cid:39) u , . . . , u i − (cid:39) u i − ]= (cid:89) li =1 Pr[ u i (cid:39) u i | u i − (cid:39) u i − ] , (10)where the last “=” holds because we assume that this propa-gation path satisfies the Markov property [22]. Inequation (10)gives a lower bound for Pr[ u l (cid:39) u l | u (cid:39) u ] . The largestlower bound is selected to estimate Pr[ u (cid:48) (cid:39) u (cid:48) | u (cid:39) u ] .We estimate Pr[ u (cid:48) (cid:39) u (cid:48) | u (cid:39) u ] in Algorithm 2.VI. M ULTIPLE Q UESTIONS S ELECTION
Based on the relational match propagation, unresolved entitypairs can be inferred by human-labeled matches. However,different questions have different inference capabilities. In thissection, we first describe the definition of inferred match setand the multiple questions selection problem. Then, we designa graph-based algorithm to determine the inferred match setfor each question. Finally, we formulate the benefit of multiplequestions and design a greedy algorithm to select the bestquestions.
A. Question Benefits
We follow the so-called pairwise question interface [5], [6],[7], [12], [14], [17], where each question is whether an entitypair is a match or not. Let Q be a set of pairwise questions.Labeling Q can be defined as a binary function H : Q →{ , } , where for each question q ∈ Q , H ( Q ) = 1 means that q is labeled as a match, while H ( q ) = 0 indicates that q islabeled as a non-match.Given the labels H , we propagate the labeled matches in H to unresolved pairs. The set of entity pairs that can be inferredas matches by H is inferred ( H ) = (cid:91) q ∈ Q : H ( q )=1 inferred ( q ) , (11) inferred ( q ) = { p ∈ C : Pr[ m p | m q ] ≥ τ } , (12)where C is the unresolved entity pairs and τ is the precisionthreshold for inferring high-quality matches. We evaluate inferred ( q ) in Section VI-B.Since non-matches are quadratically more than matches inthe ER problem [1], the labels to the ideal questions shouldinfer as many matches as possible. Thus, we define the benefitunction of Q as the expected number of matches can beinferred by labels to Q , which is benefit ( Q ) = E (cid:2) | inferred ( H ) | (cid:12)(cid:12) Q (cid:3) . (13)The ER algorithm can ask each question with the greatest benefit iteratively; however, there is a latency caused bywaiting for workers to finish the question. Assigning multiplequestions to workers simultaneously in one human-machineloop is a straightforward optimization to reduce the latency.Since workers in crowdsourcing platforms are paid based onthe number of solved questions, the number of questionsshould be smaller than a given budget. Thus, the optimalmultiple questions selection problem is to maximize benefit ( Q ) , s . t . Q ⊆ C, | Q | ≤ µ, (14)where µ is the constraint on the number of questions asked. B. Discovery of Inferred Match Set
In order to obtain the benefit for each question set Q ,we need to compute inferred ( q ) for each q ∈ Q . Toestimate Pr[ m p | m q ] in inferred ( q ) , we define the lengthof a directed edge ( v, v (cid:48) ) in probabilistic ER graph F as length ( v, v (cid:48) ) = − log f ( v, v (cid:48) ) = − log Pr[ m v (cid:48) | m v ] . Ac-cording to the definition of Pr[ m p | m q ] , Pr[ m p | m q ] = e dist ( q,p ) , where dist ( q, p ) is the distance of the shortest pathfrom q to p . As a result, the condition Pr[ m p | m q ] ≥ τ canbe interpreted as dist ( q, p ) ≤ ζ = − log τ . Note that edge ( v, v (cid:48) ) can be removed when Pr[ m v (cid:48) | m v ] = 0 to avoid log 0 .The all-pairs shortest path algorithms can efficiently com-pute inferred ( q ) for every q . Since most | inferred ( q ) | should be smaller than | C | , we choose to apply binary treesrather than an array of size | C | to maintain distances. Wedepict our modified Floyd-Warshall algorithm in Algorithm 2.In Lines 1–2, for every q , we create a binary tree bt ( q ) tostore the inferred pairs as well as their corresponding lengths,and a binary tree bt − ( q ) to store pairs inferring q as wellas their corresponding lengths. In Lines 3–5, the edge whoselength is not greater than ζ would be stored into binary trees.In Lines 6–11, we modify the dynamic programming processin the original FloydWarshall algorithm. Since the number ofpairs which can be inferred is significantly less than | C | , theinner loop in Lines 9–11 iterate only over the set of distanceswhich are likely to be updated. Lines 13–14 extract the inferredmatch sets from binary trees.Since each binary tree contains at most | C | elements, | R | ≤| bt ( q ) .val | ≤ | C | . The loop in Lines 6–11 takes O( | C | ) timein total. The time complexity of Algorithm 2 is O( | C | ) . C. Multiple Questions Selection
Since the match propagation works independently for eachlabel, the event that an entity pair p is inferred as a match bylabels H is equivalent to the event that p is inferred by q ∈ Q such that H ( q ) = 1 . When H is not labeled, p is resolved asa match if and only if at least one question that can resolve Algorithm 2:
DP-based inferred match set discovery
Input:
Probabilistic ER graph F , candidate question set C , distancethreshold ζ Output:
Set B of inferred match sets for all questions foreach q ∈ C do Initialize two empty binary trees bt ( q ) , bt − ( q ) ; foreach ( q, p ) ∈ C × C do if length ( q, p ) ≤ ζ then bt ( q )[ p ] ← length ( q, p ); bt − ( p )[ q ] ← length ( p, q ) ; foreach q ∈ C do foreach p ∈ bt ( q ) .val do R ← { r ∈ bt − ( q ) .val : bt ( q )[ p ] + bt − ( q )[ r ] ≤ ζ } ; foreach r ∈ R do d ← bt ( q )[ p ] + bt − ( q )[ r ] ; bt ( q )[ r ] ← d ; bt − ( r )[ p ] ← d ; B ← ∅ ; foreach q ∈ C do inferred ( q ) ← bt ( q ) .val ; B ← B ∪ { inferred ( q ) } ; return B ; p as a match is labeled as a match. Given the question set Q ,the probability that p can be resolved as a match by labels is Pr[ p ∈ inferred ( H ) | Q ] = 1 − (cid:89) q ∈ Q : p ∈ inferred ( q ) (1 − Pr[ m q ]) , (15)where inferred ( H ) is defined in Eq. (11), representing thematches that can be inferred after Q is labeled by workers.The benefit of question set Q is formulated as the expectedsize of the inferred matches by labels H : benefit ( Q ) = E (cid:2) | inferred ( H ) | (cid:12)(cid:12) Q (cid:3) = (cid:88) p ∈ C Pr (cid:104) p ∈ inferred ( H ) | Q (cid:105) . (16)Now, we want to select a set of questions that can maximizethe benefit. We first prove the hardness of the multiple ques-tions selection problem. Then, we describe a greedy algorithmto solve it. Theorem 1.
The problem of optimal multiple questions selec-tion is NP-hard.Proof.
The optimization version of the set cover problem isNP-hard. Given an element set U = { , , . . . , n } and acollection S of sets whose union equals U , the set coverproblem aims to find the minimum number of sets in S whose union also equals U . This problem can be reduced toour multiple questions selection problem in polynomial time.Assume that the vertex set of an ER graph is { p , p , . . . ,p n }∪{ p s | s ∈ S } , the edge set is { ( p s , p k ) : k = 1 , , . . . , n ∧ s ∈ S ∧ k ∈ s } , all the prior match probabilities are , theprecision threshold is , Pr[ p k | p s ] = 1 and Pr[ p s | p k ] = 0 ,for all k = 1 , , . . . , n, s ∈ S satisfying k ∈ s . Because thebenefit is equal to the number of covered elements in U , theoptimal solution of the multiple questions selection problem isalso that of the set cover problem. Thus, the multiple questionsselection problem is NP-hard. Theorem 2. benefit ( Q ) is an increasing submodular func-tion. lgorithm 3: Greedy multiple questions selection
Input:
Probabilistic ER graph F , candidate question set C , precisionthreshold τ , question number µ Output:
Selected question set Q Q ← ∅ ; P Q ← { (cid:0) q, benefit ( { q } ) (cid:1) | q ∈ Q } ; while | Q | < µ do q, ∆ q ← P Q.pop (); q (cid:48) , ∆ q (cid:48) ← P Q.top () ; while ∆ q > do ∆ q ← benefit ( Q ∪ { q } ) − benefit ( Q ) ; if ∆ q ≥ ∆ q (cid:48) then Q ← Q ∪ { q } ; break ; else P Q.push (cid:0) ( q, ∆ q ) (cid:1) ; q, ∆ q ← P Q.pop (); q (cid:48) , ∆ q (cid:48) ← P Q.top () ; if ∆ q ≤ then break ; return Q ; Proof.
Let b p ( Q ) represent Pr[ p ∈ inferred ( H ) | Q ] . Forevery p ∈ C and two disjoint subsets Q, Q (cid:48) ⊆ C , we have b p ( Q ∪ Q (cid:48) ) = b p ( Q ) + b p ( Q (cid:48) ) − b p ( Q ) b p ( Q (cid:48) ) . Thus, b p ( Q ∪ Q (cid:48) ) − b p ( Q ) = b p ( Q (cid:48) ) (cid:0) − b p ( Q (cid:48) ) (cid:1) ≥ . Since benefit ( Q ) = (cid:80) p ∈ C b p ( Q ) , it is an increasing function.Also, for every p ∈ C, Q ⊆ C and q , q ∈ C \ Q such that q (cid:54) = q , we have b p ( Q ∪ { q , q } ) + b p ( Q ) − b p ( Q ∪ { q } ) − b p ( Q ∪ { q } )= b p ( { q } ) (cid:0) b p ( Q ) − b p ( Q ∪ { q } ) (cid:1) ≤ . Thus, b p ( Q ∪ { q } ) + b p ( Q ∪ { q } ) ≥ b p ( Q ∪ { q , q } ) + b p ( Q ) . Since benefit ( Q ) = (cid:80) p ∈ C b p ( Q ) , it is a submodularfunction. Together, we prove that benefit ( · ) is an increasingsubmodular function.Since Eq. (16) is monotonic and submodular, the multiplequestions selection problem can be solved by using submod-ular optimization. We design Algorithm 3, which gives a (1 − e ) -approximation guarantee. This algorithm selects ques-tions greedily with the highest gain in benefits (i.e. ∆ q ). Wealso leverage the lazy evaluation of the submodular functionto improve the efficiency [40]. Specifically, we maintain apriority queue P Q over each candidate question q orderedby the gain in benefits ∆ q in descending order. Based onthe submodular property, when the gain in benefits ∆ q of thepicked question q is greater than that of the top question q (cid:48) in P Q , q is the question with the largest gain in benefits. Weuse an array to store b p ( Q ) , such that ∆ q can be obtained in O( | C | ) time. The overall time complexity of Algorithm 3 is O( µ | C | ) , where µ is the number of questions asked in eachloop and C is the set of unresolved entity pairs in the ERgraph. VII. T RUTH I NFERENCE
After the questions are labeled by workers, we design anerror-tolerant model to infer truths (i.e. matches and non-matches) from the imperfect labeling, which facilitates updat-ing the (probabilistic) ER graph and resolving isolated entities.
A. Error-Tolerant Inference
As the labels completed by the workers on crowdsourcingplatforms may contain errors, we assign one question to mul-tiple workers and use their labels to infer the posterior matchprobabilities. We leverage the worker probability model [41],which uses a single real number to denote a worker w ’squality λ w ∈ (0 , , i.e. the probability that w can correctlylabel a question. Since crowdsourcing platforms, e.g., AmazonMTurk offers a qualification test for their workers, we reusea worker’s precision in this test as her quality. The posteriorprobability of question q being a match is Pr[ m q | W T , W F ]= Pr[ m q ] Pr[ W T , W F | m q ]Pr[ m q ] Pr[ W T , W F | m q ] + Pr[ m q ] Pr[ W T , W F | m q ]= Pr[ m q ]Pr[ m q ] + Pr[ m q ] (cid:81) w ∈ W T − λ w λ w (cid:81) w ∈ W F λ w − λ w , (17)where W T denotes the set of workers labeling q as a match,and W F denotes the set of workers labeling q as a non-match.We assign two thresholds to filter matches and non-matchesbased on consistent labels. Entity pairs with a high posteriorprobability (e.g., ≥ . ) are regarded as matches, whilepairs with a low posterior probability (e.g., ≤ . ) are non-matches. Others are considered as inconsistent and remainunresolved. One possible reason for the inconsistency is thatthese questions are too hard. For a hard question q , we set Pr[ m q ] to Pr[ m q | W T , W F ] for reducing its benefit, thereby itis less possible to be asked more times. Next, we infer matchesbased on the consistent labels and re-estimate the probabilityof each edge in F using new matches and non-matches. B. Inference for Isolated Entity Pairs
As an exception, there may exist a small amount of isolatedentity pairs which do not occur in any relationship triples. Inthis case, the match propagation cannot infer their truths, andthe question selection algorithm has to ask these pairs oneby one. To avoid such an inefficient polling, we reuse thesimilarity vectors and the partial order relations obtained inSection IV to train a classifier for these isolated pairs.Given an isolated entity pair p , let A p denote the setof its attribute matches. We define the set N p of retainedmatches with similar attributes to p by N p = { p (cid:48) ∈ M rd : Jaccard ( A p , A p (cid:48) ) ≥ ψ } , where Jaccard calculates the sim-ilarity between two sets of attribute matches. ψ is a threshold,and we set ψ = 0 . for high precision. Since we only allowmatches to propagate in the ER graph, most obtained labelsare matches. Therefore, we treat all unresolved pairs in N p asnon-matches to balance the proportions of different labels.Next, we use N p and the labels as training data, and scikit-learn to train a random forest classifier with default parameterto predict whether p is a match. The random forest finds theunresolved pairs in N p whose similarity vectors are close toknown matches. https://scikit-learn.org ABLE II: Statistics of the datasets
TABLE III: F1-score and number of questions with realworkers
Remp HIKE POWER CorleoneF1
VIII. E
XPERIMENTS AND R ESULTS
In this section, we conduct a thorough evaluation on theeffectiveness of our approach Remp, by comparing with state-of-the-art methods followed by an in-depth investigation oneach part of Remp (as outlined in Section III-B).
Datasets.
We use one benchmark dataset and three real-worlddatasets widely used in previous work [12], [16], [28], [29].Table II lists their statistics. • IIMB is a small, synthetic benchmark dataset in OAEI containing two KBs with identical attributes and relation-ships. • DBLP-ACM (abbr. D-A) is a dataset about publicationsand authors. The original version uses a text field to storeall authors of a publication. Here, we split it and createauthorship triples. In the case that an author has multiplerepresentations on the original dataset, we follow [42] toextend the gold standard with author matches. • IMDB-YAGO (abbr. I-Y) is a large dataset about moviesand actors. Following [29], we generate the gold standardbased on “external links” in Wikipedia pages. • DBpedia-YAGO (abbr. D-Y) is a large dataset with het-erogeneous attributes and relationships. We use the sameversion as in [12], [28].
Competitors.
We compare Remp with three state-of-the-artcrowdsourced ER approaches, namely, HIKE [12], POWER[16] and Corleone [9]. We have introduced them in Section II.Since POWER and Corleone are designed for tabular data,we follow HIKE to partition entities into different clustersand deploy POWER and Corleone on each entity cluster.Specifically, IIMB, D-A and I-Y have clear type information,which is directly used to partition entities; for D-Y whichdoes not have clear type information, we reuse the partitioningalgorithm presented in HIKE.
Setup.
We implement Remp and all competing methods (astheir codes are not available) in Python 3 and C++, and strictlyfollow each competitor’s reported parameters in the respectivepaper. All our codes are open sourced . All experiments areconducted on a workstation with an Intel Xeon 3.3GHz CPUand 128GB RAM. For Remp, we uniformly assign k = 4 , http://islab.di.unimi.it/content/im oaei/2019/ https://dbs.uni-leipzig.de/en/research/projects/object matching https://github.com/nju-websoft/Remp τ = 0 . and µ = 10 , and use . as the label similaritythreshold. Similar to [9], [12], [16], we first prune out alldefinite non-matches (outlined in Section IV), and all methodstake the same retained entity matches M rd as input. A. Remp vs. State of the Art
We set up two experiments, one is with real workers andthe other is with simulated workers. The evaluation metricsare the F1-score and the number of questions (
Experiment with real workers.
We publish the questionsselected by each approach on Amazon MTurk. Each questionis labeled by five workers to decide whether the two entitiesrefer to the same object in the real world. We leverage thecommon worker qualifications to avoid spammers, i.e. weonly allow workers with an approval rate of at least .Furthermore, we reuse the label to each question for allapproaches. Thus, all approaches can receive the same labelto the same question. In total, 651 real workers labeled 3,484questions.The results are presented in Table III, and we have thefollowing findings – (1) Remp consistently achieves the bestF1-score with the fewest questions. (2) Remp improves theF1-score moderately, and reduces the number of questionssignificantly. (3) Specifically, compared with the second bestresult, Remp reduces the average number of questions by85.7%, 14.3%, 54.2% and 74.0% on IIMB, D-A, I-Y and D-Y,respectively. To summarize, Remp achieves the best F1-scoreand saves the number of questions, especially when the datasetcontains various relationships (e.g., D-Y).
Experiment with simulated workers.
We also generatesimulated workers who give wrong labels to questions witha fixed probability (called error rate ). We follow HIKE to setthe error rate of simulated workers to 0.05, 0.15 and 0.25.Figure 3 shows the comparison results and we make severalobservations – (1) All approaches obtain stable F1-scores,indicating their robustness in handling imperfect labeling. (2)Remp consistently obtains the highest F1-score, and beats thesecond best result by 0.4%, 3.0%, 1.6%, 8.0% on IIMB, D-A, I-Y and D-Y, respectively. This is attributed to Remp’srobustness in uncovering matches with low literal similar-ities as compared to its competitors. For example, literalinformation is insufficient on I-Y and D-Y, thereby causingerrors in the partial order of HIKE and POWER as well asthe rules of Corleone. (3) Remp needs considerably fewerquestions on IIMB, I-Y and D-Y. Compared with the secondbest result, Remp reduces the average number of questionsby 79.9%, 26.7%, 62.5% and 71.4% on IIMB, D-A, I-Y andD-Y, respectively. One reason is that there are many typesof entities on these datasets, and most matches are linked byrelationships of different domain/range types. However, HIKE,POWER and Corleone cannot infer these matches efficiently.(4) On the D-A dataset, Remp only reduces six more questionsthan POWER, because in the ER graph there are many isolatedcomponents but only one type of relationship, making Remphave to check them all. F ( % ) IIMBRempPOWER HIKECorleone D-ARempPOWER HIKECorleone I-YRempPOWER HIKECorleone D-YRempPOWER HIKECorleone .
05 0 .
15 0 . Error rate Q RempPOWER HIKECorleone .
05 0 .
15 0 . Error rateRempPOWER HIKECorleone .
05 0 .
15 0 . Error rateRempPOWER HIKECorleone .
05 0 .
15 0 . Error rateRempPOWER HIKECorleone
Fig. 3: F1-score and number of questions w.r.t. simulated workers of varying error ratesTABLE IV: Effectiveness of attribute matching
B. Internal Evaluation of Remp
In this section, we evaluate how each major module of Rempcontributes to its overall performance.
Effectiveness of attribute matching.
For the I-Y dataset, wereuse the gold standard created by SiGMa [29]. For the D-Ydataset, we follow the recommendation of YAGO and extract19 attribute matches from the subPropertyOf links as the goldstandard. Note that it is not necessary to match attributes forthe other two datasets. We employ the conventional precision,recall and F1-score as our evaluation metrics.As depicted in Table IV, Remp performs perfectly on theI-Y dataset but gains a relatively low recall on the D-Ydataset, and the 1:1 matching constraint helps Remp improvethe precision. We observe that Remp fails to identify sev-eral attribute matches when the attribute pairs rarely appearin M in (i.e. entity pairs from exact string matching), orwhen the values are dramatically different (e.g., the icd value for dbp:Trigeminal neuralgia is “G44.847”, but for yago:Trigeminal neuralgia is “G-50.0”). We argue that ourattribute matches are sufficient to ER, since the first typeof missing matches only helps resolve a small portion ofentities but increases the running time of building similarityvectors, while the second type requires extra value process-ing/correction steps before computing the similarities. Effectiveness of partial order based pruning.
To test theperformance of the entity pair pruning module in Remp, weemploy two metrics: (i) the reduction ratio (RR), which is theproportion of pruned candidates, and (ii) the pair completeness(PC), which is the proportion of true matches preserved incandidate/retained matches. We also use the error rate ofoptimal monotone classifier defined in [15] to measure theincorrectness of the partial order.As shown in Table V, candidate matches contain mosttrue matches on IIMB, D-A and I-Y, but only . of http://webdam.inria.fr/paris/yd relations.zip TABLE V: Effectiveness of partial order based pruning k = 4 Candidate matches Retained matches k -nearest neighbors P C ( % ) IIMBD-A I-YD-Y
Fig. 4: Pair completeness w.r.t. k -nearest neighborstrue matches on the D-Y dataset. This is because on thisdataset 8.4% of the entities in the true matches lack labels. OnIIMB and D-A, Remp has a relatively low RR, because thetrue matches account for . and . of the candidatematches, respectively. On I-Y and D-Y, the PC of retainedmatches is close to that of candidate matches, but mostcandidate matches are pruned. This indicates that the entitypair pruning module is effective. We notice that the error rateon each dataset is nearly perfect, but the other monotonicity-based approaches (i.e., POWER and HIKE) achieve worseaccuracy (see Table III). The main reason is that our partialorder is restricted to neighbors of each entity pair, where errorsdo not propagate to the whole candidate match set.Furthermore, the pair completeness of retained matchesw.r.t. varying k is shown in Figure 4. The pair completenessconverges quickly on IIMB, D-A and I-Y but slowly on D-Y,because many matches have only one or two shared attributes,making the partial order work inefficiently. Effectiveness of match propagation.
We additionally com-pare the match propagation module of Remp with two col-lective, non-crowdsourcing ER approaches: PARIS [28] andSiGMa [29]. More details about them have been given inSection II. To assess the real propagation capability of Remp,we ignore the classifier for handling isolated entity pairs. Werandomly sample different portions of entity matches as the F ( % ) IIMBRempMaxPr MaxInf
D-ARempMaxPr MaxInf
I-YRempMaxPr MaxInf
D-YRempMaxPr MaxInf
Fig. 5: F1-score of Remp, MaxInf and MaxPr w.r.t. varying numbers of questionsTABLE VI: F1-score w.r.t. varying portions of seed matches % of matches20 40 60 80IIMB Remp 97.5%
PARIS 96.0% 96.5% 97.0% 97.4%SiGMa
PARIS 71.3% 79.1% 86.2% 92.5%SiGMa 92.7% 94.9% 96.7% 98.4%I-Y Remp
PARIS 34.8% 57.9% 75.4% 89.0%SiGMa 34.0% 58.5% 76.1% 89.3%D-Y Remp
PARIS 82.2% 84.7% 87.2% 89.5%SiGMa 33.6% 57.4% 75.3% 89.1% seeds for Remp, PARIS and SiGMa. The experiments arerepeated five times and the F1-score is reported in Table VI.We observe that Remp achieves the best F1-score on D-A, I-Yand D-Y. On IIMB (20% of matches), the F1-score of Rempis slightly worse than that of SiGMa, because SiGMa canobtain matches between isolated entities based on their literalsimilarities directly. Overall, Remp can achieve the highestF1-score in most cases.
Effectiveness of question selection benefit.
We implementtwo alternative heuristics as baselines, namely MaxInf andMaxPr, to evaluate the question selection benefit. We set µ = 1 and use ground truths as labels. MaxInf selects thequestions with the maximal inference power. MaxPr choosesthe questions with the maximal match probability. Figure 5depicts the result and each curve starts when the F1-score isgreater than 0. We find (1) Remp always achieves the best F1-score with much less number of questions. (2) MaxPr obtainsthe lowest F1-score except on the IIMB dataset, because itdoes not consider how many matches can be inferred by thenew question. (3) MaxInf performs worse than Remp, as itoften chooses non-matches as the questions, making it findfewer matches than Remp using the same number of questions.This experiment demonstrates that our benefit function is themost effective one. Effectiveness of multiple questions selection.
Table VIIdepicts the F1-score, the number of questions ( µ = 1 , , , ), and ourfindings are as follows – (1) Remp achieves a stable F1-scoreon all datasets. (2) The number of questions increases when µ increases, especially when µ = 10 , . This is probably TABLE VII: F1-score and number of questions with differentquestion number thresholds per round µ = 1 µ = 5 µ = 10 µ = 20 F1
TABLE VIII: F1-score of inference on isolated entity pairs
Isolated matches Remp Random forestIIMB 0.3% 95.3% 0.0%D-A 0.4% 97.7% 13.7%I-Y 28.1% 70.9% 66.3%D-Y 60.4% 87.2% 84.5% because Remp always asks µ questions in one human-machineloop, and it has to ask an extra batch of questions when somequestions with large benefit are labeled as non-matches.Although asking multiple questions in one loop increases themonetary cost, it reduces 75%–94.1% number of loops when µ = 20 . Effectiveness of inference on isolated entity pairs.
We ex-amine the performance of the random forest classifier in eachdataset in the experiments with real workers. As depicted inTable VIII, the classifier achieves poor performance on IIMBand D-A. Due to the tiny proportion of isolcated entity pairsin these two datasets, this is probably caused by occasionality.When the portions of isolated matches increase on I-Y and D-Y, the classifier achieves comparable performance to Remp.This demonstrates that Remp can infer enough matches forresolving the entire dataset even if the ER graph does notcover all candidate matches.
Efficiency Analysis.
We run each algorithm three times torecord the running time on each of the four datasets. Theaverage running time of Algorithm 1 on four datasets is 1s,8s, 3.9h and 3.6h, the average running time of Algorithm 2 is0.476s, 6.7s, 109s and 1.07h, and the average running time ofAlgorithm 3 is 0.128s, 1.27s, 78.5s and 1.25h. We follow theanalysis in [10] to evaluate the performance of Remp on 25%,50%, 75% and 100% of candidate (retained) entity matches M c ( M rd ) on the D-Y dataset. As depicted in Figure 6, therunning time of Algorithm 1 and Algorithm 2 increase linearlyas the number of entity pairs increases. The running time ofAlgorithm 3 on 25% and 50% of retained entity matches areclose. This is probably because the sizes of some inferredmatch sets do not increase significantly.
25 50 75 100
Candidate entity matches (%) . . . T i m e ( h ) Algorithm 1
25 50 75 100
Retained entity matches (%) . . . Algorithm 2Algorithm 3
Fig. 6: Running time w.r.t. different portion of entity pairsIX. C
ONCLUSION
In this paper, we proposed a crowdsourced approach lever-aging relationships to resolve entities in KBs collectively. Ourmain contributions are a partial order based pruning algorithm,a relational match propagation model, a constrained multiplequestions selection algorithm and an error-tolerant truth infer-ence model. Compared with existing work, our experimentalresults demonstrated superior ER accuracy and much lessnumber of questions. In future work, we plan to combinetransitive relation, partial order and match propagation togetheras a hybrid ER approach.A
CKNOWLEDGMENTS
This work was partially supported by the National KeyR&D Program of China under Grant 2018YFB1004300, theNational Natural Science Foundation of China under Grants61872172, 61772264 and 91646204, and the ARC underGrants DP200102611 and DP180102050. Zhifeng Bao is therecipient of Google Faculty Award.R
EFERENCES[1] L. Getoor and A. Machanavajjhala, “Entity resolution: Tutorial,” http://users.umiacs.umd.edu/ ∼ getoor/Tutorials/ER VLDB2012.pdf, 2012.[2] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate recorddetection: A survey,” IEEE TKDE , vol. 19, no. 1, pp. 1–16, 2007.[3] K. Sun, Y. Zhu, and J. Song, “Progress and challenges on entityalignment of geographic knowledge bases,”
International Journal ofGeo-Information , vol. 8, no. 2, pp. 77–101, 2019.[4] J. Bleiholder and F. Naumann, “Data fusion,”
ACM Computing Surveys ,vol. 41, no. 1, pp. 1–41, 2009.[5] J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng, “Leveragingtransitive relations for crowdsourced joins,” in
SIGMOD . ACM, 2013,pp. 229–240.[6] S. E. Whang, P. Lofgren, and H. Garcia-Molina, “Question selection forcrowd entity resolution,”
Proc. of the VLDB Endowment , vol. 6, no. 6,pp. 349–360, 2013.[7] D. Firmani, B. Saha, and D. Srivastava, “Online entity resolution usingan oracle,”
Proc. of the VLDB Endowment , vol. 9, no. 5, pp. 384–395,2016.[8] A. Arasu, M. G¨otz, and R. Kaushik, “On active learning of recordmatching packages,” in
SIGMOD . ACM, 2010, pp. 783–794.[9] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik,and X. Zhu, “Corleone: Hands-off crowdsourcing for entity matching,”in
SIGMOD . ACM, 2014, pp. 601–612.[10] S. Das, P. S. GC, A. Doan, J. F. Naughton, G. Krishnan, R. Deep,E. Arcaute, V. Raghavendra, and Y. Park, “Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services,” in
SIGMOD .ACM, 2017, pp. 1431–1446.[11] K. Qian, L. Popa, and P. Sen, “Active learning for large-scale entityresolution,” in
CIKM . ACM, 2017, pp. 1379–1388.[12] Y. Zhuang, G. Li, Z. Zhong, and J. Feng, “Hike: A hybrid human-machine method for entity alignment in large-scale knowledge bases,”in
CIKM . ACM, 2017, pp. 1917–1926.[13] P. F. Patel-Schneider, P. Hayes, and I. Horrocks,
OWL Web ontologylanguage semantics and abstract syntax , W3C, 2004.[14] N. Vesdapunt, K. Bellare, and N. Dalvi, “Crowdsourcing algorithms forentity resolution,”
Proc. of the VLDB Endowment , vol. 7, no. 12, pp.1071–1082, 2014. [15] Y. Tao, “Entity matching with active monotone classification,” in
PODS .ACM, 2018, pp. 49–62.[16] C. Chai, G. Li, J. Li, D. Deng, and J. Feng, “A partial-order-basedframework for cost-effective crowdsourced entity resolution,”
The VLDBJournal , vol. 27, no. 6, pp. 745–770, 2018.[17] V. Verroios and H. Garcia-Molina, “Entity resolution with crowd errors,”in
ICDE . IEEE, 2015, pp. 219–230.[18] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller, “Human-powered sorts and joins,”
Proc. of the VLDB Endowment , vol. 5, no. 1,pp. 13–24, 2011.[19] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “CrowdER: Crowd-sourcing entity resolution,”
Proc. of the VLDB Endowment , vol. 5,no. 11, pp. 1483–1494, 2012.[20] V. Verroios, H. Garcia-Molina, and Y. Papakonstantinou, “Waldo: Anadaptive human interface for crowd entity resolution,” in
SIGMOD .ACM, 2017, pp. 1133–1148.[21] S. Galhotra, D. Firmani, B. Saha, and D. Srivastava, “Robust entityresolution using random graphs,” in
SIGMOD . ACM, 2018, pp. 3–18.[22] V. Rastogi, N. Dalvi, and M. Garofalakis, “Large-scale collective entitymatching,”
Proc. of the VLDB Endowment , vol. 4, no. 4, pp. 208–218,2011.[23] C. B¨ohm, G. De Melo, F. Naumann, and G. Weikum, “Linda: distributedweb-of-data-scale entity matching,” in
CIKM . ACM, 2012, pp. 2104–2108.[24] Y. Altowim, D. V. Kalashnikov, and S. Mehrotra, “Progressive approachto relational entity resolution,”
Proc. of the VLDB Endowment , vol. 7,no. 11, pp. 999–1010, 2014.[25] V. Efthymiou, G. Papadakis, K. Stefanidis, and V. Christophides, “Mi-noaner: Schema-agnostic, non-iterative, massively parallel resolution ofweb entities,” in
EDBT , 2019, pp. 373–384.[26] A. Kimmig, A. Memory, R. J. Miller, and L. Getoor, “A collective,probabilistic approach to schema mapping,” in
ICDE . IEEE, 2017, pp.921–932.[27] P. Kouki, J. Pujara, C. Marcum, L. Koehly, and L. Getoor, “Collectiveentity resolution in familial networks,” in
ICDM . IEEE, 2017, pp.227–236.[28] F. M. Suchanek, S. Abiteboul, and P. Senellart, “PARIS: Probabilisticalignment of relations, instances, and schema,”
Proc. of the VLDBEndowment , vol. 5, no. 3, pp. 157–168, 2011.[29] S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, andZ. Ghahramani, “SiGMa: Simple greedy matching for aligning largeknowledge bases,” in
KDD . ACM, 2013, pp. 572–580.[30] Y. Zhuang, G. Li, Z. Zhong, and J. Feng, “PBA: Partition and blockingbased alignment for large knowledge bases,” in
DASFAA . Springer,2016, pp. 415–431.[31] X. Dong, A. Halevy, and J. Madhavan, “Reference reconciliation incomplex information spaces,” in
SIGMOD . ACM, 2005, pp. 85–96.[32] T. Papenbrock, A. Heise, and F. Naumann, “Progressive duplicatedetection,”
IEEE TKDE , vol. 27, no. 5, pp. 1316–1329, 2015.[33] J. Sun, Z. Shang, G. Li, Z. Bao, and D. Deng, “Balance-aware distributedstring similarity-based query processing system,”
Proc. of the VLDBEndowment , vol. 12, no. 9, pp. 961–974, 2019.[34] W. Hu and C. Jia, “A bootstrapping approach to entity linkage on thesemantic web,”
Journal of Web Semantics , vol. 34, pp. 1–12, 2015.[35] F. Naumann and M. Herschel,
An introduction to duplicate detection .Morgan and Claypool Publishers, 2010.[36] M. Cheatham and P. Hitzler, “The properties of property alignment,” in
ISWC Workshop on Ontology Matching . CEUR-WS, 2014.[37] I. Megdiche, O. Teste, and C. Trojahn, “An extensible linear approachfor holistic ontology matching,” in
ISWC , vol. LNCS 9981. Springer,2016, pp. 393–410.[38] H. W. Kuhn, “The hungarian method for the assignment problem,”
Navalresearch logistics quarterly , vol. 2, no. 1-2, pp. 83–97, 1955.[39] D. Zhang, B. I. P. Rubinstein, and J. Gemmell, “Principled graphmatching algorithms for integrating multiple data sources,”
IEEE TKDE ,vol. 27, no. 10, pp. 2784–2796, 2015.[40] B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondr´ak, andA. Krause, “Lazier than lazy greedy,” in
AAAI , 2015, pp. 1812–1818.[41] Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng, “Truth inference incrowdsourcing: Is the problem solved?”
Proc. of the VLDB Endowment ,vol. 10, no. 5, pp. 541–552, 2017.[42] A. Thor and E. Rahm, “MOMA - A mapping-based object matchingsystem,” in