A framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration
AA framework for constructing a huge name disambiguationdataset: algorithms, visualization and human collaboration
Zhuoyue Xiao [email protected] UniversityHaidian Qu, Beijing Shi, China
Yutao Zhang ∗ [email protected] UniversityHaidian Qu, Beijing Shi, China Bo Chen ∗ [email protected] University of ChinaHaidian Qu, Beijing Shi, China Xiaozhao Liu [email protected] University
Jie Tang [email protected] UniversityHaidian Qu, Beijing Shi, China
Candidate Sets Relation Graph Construction Sub-clustering Annotation Workflow
Cleaning VerifyingAdding MergingVotingVoting Voting
W Wang 1 W Wang 2W Wang 3 Unassigned
Final Result
W Wang 1 W Wang 2W Wang 3
Update
W Wang 4W Wang 2
Figure 1: An overview of proposed annotation framework.
ABSTRACT
We present a manually-labeled Author Name Disambiguation(AND)Dataset called
WhoisWho , which consists of 399,255 documentsand 45,187 distinct authors with 421 ambiguous author names. Tolabel such a great amount of AND data of high accuracy, we proposea novel annotation framework where the human and computer col-laborate efficiently and precisely. Within the framework, we alsopropose an inductive disambiguation model to classify whethertwo documents belong to the same author. We evaluate the pro-posed method and other state-of-the-art disambiguation methodson
WhoisWho . The experiment results show that: (1) Our modeloutperforms other disambiguation algorithms on this challengingbenchmark. (2) The AND problem still remains largely unsolved andrequires more in-depth research. We believe that such a large-scalebenchmark would bring great value for the author name disam-biguation task. We also conduct several experiments which provesour annotation framework could assist annotators to make accu-rate results efficiently and eliminate wrong label problems madeby human annotators effectively. ∗ Both authors contributed equally to this research.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Woodstock ’18, June 03–05, 2018, Woodstock, NY © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/1122445.1122456
CCS CONCEPTS • Information systems → Entity resolution . KEYWORDS
Name Disambiguation, Graph Neural Network, Dataset
ACM Reference Format:
Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, and Jie Tang. 2018.A framework for constructing a huge name disambiguation dataset: al-gorithms, visualization and human collaboration. In
Woodstock ’18: ACMSymposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY.
ACM,New York, NY, USA, 11 pages. https://doi.org/10.1145/1122445.1122456
The popularity of the information system has brought explosivegrowth of academic digital records. The latest estimations show thatthere are more than 271 million publications, 133 million scholarsand 754 million citations on Aminer [17] and even larger numberof them in Google scholar database.Among these digital records, almost all the documents have authorname ambiguity problem. Author name ambiguity(AND) meansthat authors of bibliographic documents are very likely to shareidentical names with the others so that the scholars’ names are notreliable enough to determine their identity. This problem poses agreat challenge to the digital bibliographic library, such as GoogleScholar, DBLP, Aminer and Microsoft Academic.Name disambiguation task is proposed to solve this problem, whichis, for a given publication, to distinguish the author who wrote thispublication from other authors who share the identical name withhim. Despite of great amount of efforts devoted and rapid growth ofdata-driven methods on graph structure [20] [8] [13][9], the ANDproblem still remains largely unsolved. a r X i v : . [ c s . S I] J u l oodstock ’18, June 03–05, 2018, Woodstock, NY Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, and Jie Tang Document 1 Document 2Document 3Same SameSameNot similarSimilar Similar
Coauthors CoauthorsCoauthors
Figure 2: An ambiguous case in the AND task.
One of the main reason is that existing name disambiguation bench-marks are limited in scale and complexity compared to the greatnumber of academic documents available on the Web. Many ma-chine learning models, such as deep learning models, heavily de-pend on large-scale and high-quality data. Hence, if we want topropose a sophisticated and robust model to do name disambigua-tion, a large, diverse and accurate benchmark is indispensable.However, building such a large-scale benchmark is an extremelychallenging task. Annotators need to assign a great number of pub-lications into different clusters. The number of clusters is uncertainand the annotators have to focus on all the pairwise relationshipsbetween the publications. For example, if there are 3 thousand(which is a median data size in our dataset) publications to label,the number of pairwise relationships would be 9 million, whichis far beyond what humans can handle. Besides, in order to judgewhether two publications belong to the same person, it is necessaryto consider the relationships between the two publications andother papers. Fig 2 gives an example of this kind of ambiguouscases. Although the document 1 and 2 belong to the same author,their attributes are completely unrelated. It is almost impossible forboth human and computers to form a correct judgement on theirrelationships without the involvement of the document 3. Lastly,the workload of labeling each set of data is heavy and it is almostimpossible to collaborate, which means, for a given name, each an-notator have to complete long-term and challenging process aloneand it is hard to merge annotation results from different annotators.This would inevitably lead to poor accuracy and efficiency.Thus, we propose a supervised inductive name disambiguationmethod to assist annotators to disambiguate correctly and effi-ciently. In this method, both supervised and unsupervised methodscan serve as similarity models. Based on the constructed similaritygraphs, we propose an end-to-end graph neural network modelto predict whether two documents belong to the same author andapply community detection algorithm to the generated result. Then,we use visualization techniques to provide display and operation in-terfaces for annotators. Lastly, our well-designed annotation work-flow split annotation process into several parts where annotators’tasks are greatly simplified and annotation results of different an-notators could be directly aggregated.We has utilized this annotation framework to label a great amount ofdata sampled from AMiner’s database, and organized these labeleddata as a disambiguation dataset. Our benchmark has significant advantages in scale, complexity and accuracy to the existing ones.Now, the dataset is available online . Typically, the author name disambiguation task can be dividedinto two categories: classification and clustering. The classificationtask [7] [6] aims to predict whether two documents refer to thesame author or not while the clustering task [16] [24][25][12] is tocluster those documents which belong to the same author together.Considerable work has been done for these two disambiguationtasks[21][11]. Han [6] defines several similarity functions to evalu-ate the document similarity based on TF-IDF and NTF, and applythe k-way spectral clustering method on the constructed similaritygraphs to disambiguate. GHOST [3] utilizes the coauthor relation-ships as input to build a similarity graph and use graph partition-ing algorithm to generate the results. Tang [16] integrates bothdocument features and graph structural features with the unifiedprobabilistic graphical model HMRF. Tran [18] utilize a deep neuralnetwork to determine whether two ambiguous documents belongto the same author.Recently, utilizing network embedding methods to learn low-dimensionalrepresentation for each document is popular, and several state-of-the-art works are based on this approach. Zhang [24] solve this prob-lem by learning graph embedding from three constructed graphsbased on coauthor relationships. Aminer [25] also leverage a super-vised inductive graph neural network model to learn representationfor documents and use recurrent neural network to predict the truenumber of the clusters.
Either the traditional or the novel, they all need an authoritativedata set for evaluation. Supervised methods [19] [18] [25] also needlabeled data for training and their performance greatly relies onthe training data.Previously, CiteSeerX and Aminer have published manully-labledauthor name disambiguation benchmarks respectively. The Cite-SeerX dataset consists of 8466 documents with 14 author nameswhile the Aminer dataset consists of 70258 documents with 100author names. These two benchmarks are widely used to train andevaluate name disambiguation models. However, those models thatperform well on these benchmarks not always generalize well inthe production environment due to two nontrivial flaws: Limited complexity.
The max document numbers of CiteSeerXand AMiner benchmark are 1464 and 999 respectively, which aresmall compared to the real data scale. In the real world, it is quitecommon for an author name to refer to thousands documents andsome author names, such as Jing Zhang and Wei Wang, even referto more than ten thousand documents. These huge document setshave much more complicated patterns and higher requirements forthe efficiency of disambiguation algorithms. Existing benchmarks http://clgiles.ist.psu.edu/data/ https://aminer.org/disambiguation framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY
The max document numbers of CiteSeerXand AMiner benchmark are 1464 and 999 respectively, which aresmall compared to the real data scale. In the real world, it is quitecommon for an author name to refer to thousands documents andsome author names, such as Jing Zhang and Wei Wang, even referto more than ten thousand documents. These huge document setshave much more complicated patterns and higher requirements forthe efficiency of disambiguation algorithms. Existing benchmarks http://clgiles.ist.psu.edu/data/ https://aminer.org/disambiguation framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY can hardly provide evaluations of these aspects. Limited Scale.
Compared to hundreds of millions of papers inthe database, the total numbers of documents and author names inthese benchmarks are limited. In order to apply the disambiguationmodels to a production environment, training samples with a largerscale are indispensable.
Limited Accuracy.
What we’ve learned from constructing namedisambiguation dataset is that it is a very difficult task for humanand they would find great difficulty giving accurate results if thereis no well-designed tools. However, for existing AND benchmarks,there is no any specific description of their annotation process,which gives us a reason to worry about the wrong labeling prob-lem.
Since there is a gap between the proposed disambiguation mod-els and the real production environment, several human-involvedmethods are proposed.D-dupe [2] is an interactive framework for entity resolution whichvisualizes the author collaboration social network into a graph,which helps users to distinguish different persons.Shen [15] designs several novel visualization interfaces where userscan assign newly-coming documents to existing author based onthe various information provided by the interfaces.We take one of Shen’s visualization interfaces that cleverly visual-izes the coauthor similarity information and document set infor-mation as our user interface. We have also made many functionalextensions and improvements on it so that it can be used in morecomplicated annotation task. More details are available in section9.
Data Annotations.
The raw data of our name disambiguationdataset are collected from the AMiner database as follows: first,we choose the author names based on the number of ambiguousauthors and their papers. Then, for each author who shares theambiguous names, we collect all the papers belonging to the authorwith their attributes, such as title, abstract, coauthors, affiliations,venues, keywords and so on. Finally, we hire several annotators tolabel the raw name disambiguation dataset based on our annotationframework.
Statistic of WhoisWho
To our best knowledge, we have pub-lished the world’s largest manually-labeled name disambiguationdataset with 399,255 papers belonging to 45,187 persons of 421common author names Some details of the dataset are listed in thefig 3.
Achievements.
We also organize a data challenge based on thepublished dataset. Moreover, we comprehensively analyze differentname disambiguation scenarios and define the two basic task tracksin the challenge: Name Disambiguation from Scratch and
ContinuousName Disambiguation . The challenge was held successfully, whichattracted more than 1,000 people formed 500 teams to participateand resulted in some meaningful ideas. The challenge also indicatesthat the name disambiguation task still remains an open problem https://biendata.com/competition/aminer2019/ (a) number of authors(b) number of documents Figure 3: Statistics of WhoisWho
Assigned documents of Wei Wang Unassigned documents of Wei Wang
Wei Wang 1 Wei Wang 2 Wei Wang 5 Wei Wang 5Wei Wang 4Wei Wang 3 (cid:50)(cid:89)(cid:72)(cid:85)(cid:16)(cid:83)(cid:68)(cid:85)(cid:87)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:72)(cid:71)(cid:50)(cid:89)(cid:72)(cid:85)(cid:16)(cid:80)(cid:72)(cid:85)(cid:74)(cid:72)(cid:71)
Figure 4: A toy example of input raw data. and requires further research. To promote the exploration in authorname disambiguation field, we plan to further release more labeleddata periodically with respective data challenges in the future.
In this section, we present the formulation of the name disambigua-tion annotation task with preliminaries.
Let a be a given name reference, and D a = { D a , D a , ..., D aN } bea set of N documents associated with the author name a . We call D a as the document set of a . We use I ( D ai ) to denote the identity oodstock ’18, June 03–05, 2018, Woodstock, NY Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, and Jie Tang (corresponding real-world person) of D ai . Thus if D ai and D aj areauthored by the same author, we have I ( D ai ) = I ( D aj ) . We omit thesuperscript in the following description if there is no ambiguity.Given this, we define the problem of author disambiguation asfollows. Definition 4.1.
Name Disambiguation.
The task of author dis-ambiguation is to find a function Θ to partition D into a set ofdisjoint clusters, i.e., Θ (D) → C , where C = { C , C , ..., C K } , such that each cluster only contains documents of the same identity—i.e., I ( D i ) = I ( D j ) , ∀ ( D i , D j ) ∈ C k × C k , and different clusters contains documents of different identities—i.e., I ( D i ) (cid:44) I ( D j ) , ∀ ( D i , D j ) ∈ C k × C k ′ , k (cid:44) k ′ . Our annotation framework begins with existing data in the data-base, which is more efficient and less challenging by contrast tolabelling from scratch.In addition, documents are integrated into the database in astreaming fashion, hence, there is also a set of unassigned docu-ments ˜ C in the input data. Thus we can use C = { C , C , ..., C K , ˜ C } .to denote the origin assignment of the input data.Figure 4 gives a toy example of raw input data which are sampledfrom Aminer database. In Figure 4, each document has a authornamed ’Wei Wang’. The color of each document represents itsreal author identity and the documents within a dashed containerdenote that they are assigned to the same author profile. There aretypically two types of errors in the assignments: • Over-merged:
Documents of different authors are wronglymerged into a single profile in the database due to their com-mon attributes or error propagation in the disambiguationprocess, i.e. ∃ ( D i , D j ) ∈ C k × C k , I ( D i ) (cid:44) I ( D j ) . For example, ’W Wang 001’ in Figure 4 is over-merged withdocuments from two different authors. • Over-partitioned:
Documents of an individual author arewrongly split as several distinct profiles in the database, i.e. ∃ ( D i , D j ) ∈ C k × C k ′ , k (cid:44) k ′ , I ( D i ) = I ( D j ) . For example, ’W Wang 004’ and ’W Wang 005’ in Figure 4are over-partitioned from the same author.For a given author name and its document set D = { D i } , ourannotation task has two goals: •
1) generate a perfect document assignment C = { C i } whichnot includes unassigned document set ˜ C . •
2) assign as many documents as possible as long as the firstgoal is met.
In this section, we present our annotation framework for name dis-ambiguation task. We first introduce the method to extract severalsimilarity graphs from a raw document set. Then we discuss theinductive model which is used to aggregate and refine the similarity graph. With the refined graph, we apply a community detectionalgorithm to cluster publications which are likely to belong to thesame author together.After that, the annotators could conduct operations upon these clus-ters. With well-designed operations and visualization interfaces,annotators are able to complete this difficult task efficiently andcorrectly.Lastly, We discuss the annotation workflow, which decomposesthe whole annotation process into four steps: cleaning , checking , adding and merging . During these steps, mutual inspection andmajority voting would be applied so that the accuracy of annotationis guaranteed. At first step, we model the pairwise similarity based on severalattributes of documents and build multiple similarity graphs foreach raw document set. Each similarity graph is referred to a specificdocument attribute.
Definition 5.1.
Similarity Graph.
For a given document set D = { D i } and a given document attribute x , we construct a com-plete weighted graph as the similarity graph G x (D) = (D , E x ) .Each edge weight (the edge weight could be zero) is calculated bysimilarity function S x .Both supervised and unsupervised methods can serve as simi-larity functions. For the name disambiguation model proposed inthis paper, we adapt inverse document frequency as the similarityfunction for its simplicity and scalability. However, in practicaldeployment, we carefully define the features and utilize SupportVector Machine models to build more accurate similarity graphs. Itis mainly because the similarity graphs would be visualized in ourannotation interface. In order to cluster documents properly, there should be a singlegraph where each edge represents the probability that connecteddocuments belong to the same author. Since the similarity of acertain attribute can not serve as this function, it is necessary toaggregate all similarity graphs into a universal graph. We call thisprocess as Graph Refinement.Actually, we take different graph refinement models as the anno-tation project progresses. At early stages, we simply sum all thesimilarity graphs together and use an empirical threshold to filterthe edges for simplicity. Because there is limited training data avail-able.After gaining enough annotation data, we adapt an end-to-endgraph neural network model to refine the graph. This model couldevaluate the similarity between each pair of documents via topo-logical information from all the similarity graphs so it is possibleto solve the tricky problem in fig 2.
Graph Neural Network for Edge Classification
Xu [23] andHamilton [5] have proven that GNN could learn graph structureinformation as Weisfeiler-Lehman algorithm if randomize inputnode features. In that sense, for a given document set D = { d i } andits similarity graphs G = {G x } , we generate a random Gaussian framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY
Xu [23] andHamilton [5] have proven that GNN could learn graph structureinformation as Weisfeiler-Lehman algorithm if randomize inputnode features. In that sense, for a given document set D = { d i } andits similarity graphs G = {G x } , we generate a random Gaussian framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY Similarity Graphs
Graph Pruning (cid:335) (cid:335)
GNN EncoderNode Features (cid:62) (cid:64)(cid:15) (cid:15)
Edge FeaturesEdge Decoder (cid:62) (cid:64)(cid:15) (cid:15)(cid:62) (cid:64)(cid:15) (cid:15)(cid:62) (cid:64)(cid:15) (cid:15) (cid:335) (cid:62) (cid:64)(cid:15) (cid:15)(cid:62) (cid:64)(cid:15) (cid:15) (cid:18)(cid:3) (cid:18)(cid:3) (cid:18)(cid:3)
MLPMLPMLP
Figure 5: The structure of proposed name disambiguation model. vector for each node as input feature. There are two reasons forrandom initialization: • Almost all valuable information of documents’ attribute havebeen encoded into the similarity graphs. • The involvement of global semantics would significantlyincrease the complexity, which make model prone to overfit-ting.Since the similarity graphs are complete graphs and most edgeswith in them are weak, we apply an adaptive pruning strategy onthese graphs.
Definition 5.2.
Adaptive Graph Pruning.
For a given N × N input adjacent matrix ˆ A , the pruned adjacent matrix A will be pro-duced as: ˜ A ij = (cid:40) i f ˆ A ij < (cid:205) j ′ ˆ A ij ′ N ˆ A ij elseA ij = ˜ A ij + ˜ A ji , and the element ˆ A ij would be filtered by the mean valueof row threshold and column threshold. So the symmetry of matrix A is preserved.Row normalization is used to normalize the pruned adjacentmatrices. A = { A x } denotes the normalized adjacency matrix set.The adjacency matrices A are fed into two EGNN [4] layers withthe node features V . The k th layer function and defined as V k = σ [∥ x ( A x V k − W k )] where ∥ represents the concatenation operation. Through GNNencoder, the input N × N adjacency matrices are encoded as Ndense vectors { V i } , where each vector is referred to a graph node(document). Then we combine all pairs of node feature vectorswith corresponding edge feature vector into N triplets ( V i , V j , E ij ) .We concatenate the feature vectors within each triplet and feedthem into an MLP classifier to decode the triplets into edges. For Each row and column represents the similarity relationships of a specific document each given document set D = { d i } and its annotation result C = { C , C , ..., C K } , we build a complete graph G c (V , E) as the groundtruth, where E = { e ij , ∀ v i , v j ∈ V × V , i (cid:44) j } . According to theresults of the annotation, we classify edges into two categories: thepositive E p and the negative E n . i.e. E p = { e ij , d i , d j ∈ C k × C k , C k ∈ C }E n = { e ij , d i , d j ∈ C k × C ′ k , k (cid:44) k ′ , C k , C ′ k ∈ C × C } In most instances, the number of negative samples is much largerthan the positive, so we apply the weighted loss based on the ratioof positive and negative samples.
After graph refinement, we got a refined graph G for a given docu-ment set D = { D i } . The edge e ij between document D i and docu-ment D j represents how likely is it that D i and D j belong to thesame author. Based on this graph, we could apply community de-tection techniques to each assigned document groups C i ∈ C andto unassigned document set ˜ C respectively. We call this process assub-clustering. It splits each assigned group and unassigned set intoseveral sub-groups which the documents within are very likely tobelong the same scholar. Hence, the annotators can readily conductoperations upon a batch of documents, which significantly booststhe efficiency of the annotation process.As long as the annotators conduct an operation, our system wouldre-cluster the documents. Apparently, efficiency is critical to theonline algorithm, so we take Speaker-Listener Label Propaga-tion Algorithm [22], an improved label propagation algorithm, asour clustering method. The algorithm can form a good result in ashort time.However, the community detection results cannot directly apply toour framework because there will be many huge sub-groups withinthe results. The huge sub-groups will severely impact visualizationeffect since we plot the similarity graphs of selected documents. Sowe set a maximum sub-group size , and split all over-size sub-groupinto several median groups by the Breadth-First-Search method. The size is empirically set to 50. oodstock ’18, June 03–05, 2018, Woodstock, NY Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, and Jie Tang
According to § 4.1, there are mainlytwo types of errors: over-merged and over-partitioned. In addition,there is a set of documents ˜ C that remains unassigned. Given these,we define five batch-wise operations as follows: • Merge : A merge operation ϕ m ( C , C ) select two assignedgroups C and C and combine all the documents from thetwo assigned groups into a new assigned group C , , i.e. ϕ m ( C , C ) : C , ← { D i ∈ C ∪ C } . Intuitively, we prefer to merge assigned groups with similardocuments as they are very likely from the same author. • Separate : A separate operation ϕ s ( C k , c j ) excludes a subset of documents { D i ∈ c j } from an assigned group C k andcreate a new assigned group C j based on c j , i.e. ϕ s ( C k , c j ) : (cid:40) C k ← { D i ∈ C k \ c j } , C j ← { D i ∈ c j } . where c j ∈ Ψ ( C k ) . We prefer to perform separate when thedocuments are similar in c j but dissimilar to the documentsin C k \ c j . • Create : A create operation ϕ c ( c j ) take a sub set c j fromunassigned document set ˜ C and create a new assigned group C j , i.e. ϕ c ( c j ) : C j ← { D i ∈ c j , c j ⊂ ˜ C } . A create operation ϕ c ( c j ) will be taken when the documentswithin c j are similar (likely from the same author) but dis-similar to any documents in existing assigned groups { C k } . • Assign : An assign operation ϕ a ( C k , c j ) take a sub set c j fromunassigned document set ˜ C and assign these documents intoan existing assigned group C k , i.e ϕ a ( C k , c j ) : C k ← { D i ∈ c j ∪ C k , c j ⊂ ˜ C } . An assign operation ϕ c ( c j ) will be taken when the documentswith c j are similar to the documents in C k (likely from thesame author). • Exclude : An exclude operation ϕ e ( C k , c j ) excludes a sub setof documents { D i ∈ c j } from an assigned groups C k and setall the documents from c j as unassigned, i.e. ϕ e ( C k , c j ) : (cid:40) C k ← { D i ∈ C k \ c j } , ˜ C ← { D i ∈ c j ∪ ˜ C } . where c j ∈ Ψ ( C k ) . We prefer to perform exclude when thedocuments are dissimilar in c j and also dissimilar to thedocuments in C k \ c j . pAny possible disambiguation operation can be converted to apermutation of the above operations. We divide annotators’ task into four steps: Clean-ing, Verifying, Adding and Merging. For each step, we give a cleargoal and limited types of operations that annotators can use, whichsignificantly simplify the task so that even the inexperienced canquickly master this annotation task. For each step where annotatorscan cooperate, the annotators are very likely to give different results. So the different voting strategies would be applied to aggregate theresults. • Cleaning
This task would be complete independently bya single annotator. The annotator would use separate or exclude operations to clean all the noise in each assignedgroup. If the annotator cannot make sure if the documentsshould be removed from the assigned group, the operationshould be performed in order to make sure all assignedgroups are clean. • Verifying
Since the previous step is done by a single an-notator, it is necessary to verify the results. So at this step,several annotators get involved. They merely use exclude operation to refine the result given by previous annotator.After verifying, the over-merged problems should be elimi-nated, which means the over-merged problem does not existany more and all the documents in the same assigned groupbelong to the same person. • Adding
At this step, several annotators would conduct assign and create operations respectively. In order to main-tain high accuracy of labeled data, annotators are asked toperform the operations only if they are quite sure, so theambiguous documents would remain unassigned. • Merging
At the last step, the annotator would check eachpair of assigned groups and decide whether to merge them .Putting merge operation as the last step is to avoid the po-tential problem described in fig 2. When many documentsremain unassigned, it is possible that some over-partitioneddocument sets seem to be totally unrelated so annotatorsjust skip them. To avoid this situation, it is necessary toconduct merge operation after sufficient documents havingbeen assigned.After applying voting after merging results, we get the annotationresult of the final version. Our workflow makes the complicatedannotation task simple and efficient while ensuring accuracy. There are three steps where the voting strate-gies are supposed to be applied:
Verifying , Adding and
Merging .At these steps, each annotator would give a different result. Aswe mentioned in 4.1, for a given document set D = { D i } and itsassignment C = { C i } ∪ ˜ C , our ultimate goals is to guarantee theaccuracy of C\ ˜ C . Based on this goal, different strategies are appliedat the end of each step. • Voting for Verifying
After verifying, each annotator p k would give a different set of excluded documents E ki foreach assigned group C i , where C i ⊂ C\ ˜ C . To maximize theaccuracy of C\ ˜ C , the aggregated result E i for assigned group C i will be the union set of {E ki } . i.e. E i = (cid:208) k E ki • Voting for Assigning (assign operation)
After adding,each annotator p k would give a set of newly-assigned doc-uments A ki for each assigned group C i . Let N( d j ) denotesthe number of sets |A ki , d j ∈ A ki | and K represents thenumber of annotators, we apply majority voting , so the Actually, annotators merely need to check a few pairs, because our visualizationinterface will filter out those pairs that are totally unrelated framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY
After adding,each annotator p k would give a set of newly-assigned doc-uments A ki for each assigned group C i . Let N( d j ) denotesthe number of sets |A ki , d j ∈ A ki | and K represents thenumber of annotators, we apply majority voting , so the Actually, annotators merely need to check a few pairs, because our visualizationinterface will filter out those pairs that are totally unrelated framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY adding results for each assigned group A i are defined as A i = { d j , N( d j ) > K / } . So the accuracy of assigning isguaranteed. • Voting for Adding (create operation)
On the other hand,after adding, each annotator would also give several setsof documents C ki for create operation. Unlike the assign operation, there would be some conflicts between the resultsgiven by different annotators. For better quantification, Weformulate the conflicts as the pairwise conflicts. For instance,if an annotator assigns both document i and document j to anewly-created group while another annotator assigns thesetwo documents into two different newly-created groups,there would be a conflicting pair. Otherwise, there would bea verified pair. We apply the voting strategy for create oper-ation base on two pairwise principles: 1) We merely adoptverified document pairs. 2) We do not adopt any conflictingdocument pairs even if they are verified. We adapt a greedysearch algorithm to merge as many as documents we canwithout violating any principles above. • Voting for Merging
After merging, each annotator p k would give a set M k = {( C i , C j ) , i (cid:44) j } of assigned grouppairs which are going to be merged .We also apply major-ity voting to these merging pairs and generate a new set ofmerging pairs M . Lastly, all the document pairs within the M would be merged. For each ambiguous author name, we would model relationshipsbetween each pair of documents, Hence, the time complexity ofbuilding similarity graphs and graph refinement is O ( N ) . Further-more, the numbers of documents for most ambiguous names are allin thousands, which means that our models need to infer millionsor ten millions of times for each ambiguous name. Apparently, it istime-consuming and impossible to deploy online.To solve this problem, we preprocess the data and cache refinedgraphs for the names to be disambiguated in advance because re-fined graph is constant during the annotation process. Only sometime-efficient modules, such as community detection and some ren-dering functions, are deployed online so that the annotator barelyfeel any delay regardless of the data scale. We conduct several comprehensive experiments to evaluate theaccuracy of our proposed annotation framework. We also evalu-ate recent state-of-the-art methods [25] [24] [10] for author namedisambiguation task on our benchmark. Lastly, we compare thesestate-of-the-art methods with ours to demonstrate its superiorityover those methods. For example, if there are three assigned groups merged together: C i , C j , C k , therewould be three merging pairs: ( C i , C j ) , ( C i , C k ) and ( C j , C k ) . If there are merely two merging pairs (there should be three ideally) between threedocuments, the three documents would also be merged.
We sampled 320 author names fromour dataset, and split them into 200, 60, 60 for training, validationand testing. Each author name refers to a totally different documentset.
We evaluate three state-of-art name dis-ambiguation methods on our dataset.
Aminer et al. [25]:
This method learns a supervised inductiveembedding model based on manually-labeled data, and use an unsu-pervised graph auto-encoder to refine the embedding on the locallinkage graph constructed based on the common features betweendocuments.
Zhang et al. [24]:
The second method is an unsupervised methodwhich constructs three graphs (document-document, author-authorand document-author) based on coauthors use the triplets sampledfrom graphs to optimize graph embedding.
Louppe et al. [10]:
This semi-supervised method first trains apairwise distance function base on a set of carefully designed simi-larity features. Then a semi-supervised HAC algorithm is used todetermine clusters.Our method is indicated by
Xiao . In our method, we leverage bothpairwise document similarity information and topological informa-tion of similarity graphs to predict whether two document belongto the same person. So we also present the performance of topologi-cal component to further analyze the contribution of edge features.
Xiao(F)
This method merely use the topological information ofsimilarity graphs, where the decoder merely takes document em-bedding learned from GNN model as input. We evaluate these dis-ambiguation models by the pairwise Precision, Recall and F1-scoreon 13 sampled author names. We also use both micro and macroaveraged scores to evaluate overall performance of each method.The averaged scores are calculated on the complete testing set.
Table 1 shows the performance of dif-ferent disambiguation methods on sampled test names of differentsizes. According to the results, the performance of both Aminer andZhang are not ideal on the large document sets since their microaverage score is much lower than the macro one. Benefiting froma scalable end-to-end training method, our method (Xiao) outper-form the other state-of-the-arts in both macro and micro averagescore (+17.18% and +19.67% over Louppe, +22.32% and +10.42% overAminer, +6.91% and +2.28% over Zhang). And the results also showthe feeding edge features directly into decoder could greatly boostour performance (+13.54% and +10.71%). However, there is still ahuge gap between our model and human.
We further investigate the performance of annotators during theannotation process. During the
Verifying step, a noisy disambigua-tion result made by a single annotator would be refined by multi-ple annotators with our voting strategy so all seemingly wrongly-assigned documents would be excluded. We count the numbersof these excluded documents E and the numbers of the rest as-signed documents R during the annotation process. The numbersare shown in fig 6. The excluding ratio ER is computed as ER = ER + E , oodstock ’18, June 03–05, 2018, Woodstock, NY Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, and Jie Tang Table 1: Results of Author Name Disambiguation.
Size Xiao Xiao(F) Louppe et al. Aminer Zhang et al.Name - Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1Xianghua Li 303 95.45 95.81 95.63 92.40 90.12 91.24 90.73 94.01 92.34 95.55 98.50 97.00 99.35 88.62 93.68Xu Shen 352 92.38 77.38 84.21 97.46 77.32 86.23 88.86 16.71 28.13 93.54 48.71 64.06 99.03 90.29 94.46Xiaoming Xie 478 87.00 47.01 61.04 71.36 48.79 57.96 90.86 23.28 37.06 94.44 36.20 02.33 98.55 72.38 83.46Suqin Liu 517 81.71 86.98 84.26 94.04 88.03 90.93 57.82 98.04 72.74 83.67 70.80 76.70 98.77 48.36 64.93Makoto Inoue 661 93.32 79.75 86.00 80.32 74.69 77.40 95.03 52.09 67.29 98.55 81.02 88.93 98.75 92.14 95.33Jihua Wang 713 93.56 86.20 89.24 86.40 75.48 80.57 78.96 84.95 81.84 94.90 45.70 61.69 98.46 44.73 61.51Chao Deng 745 95.04 87.10 90.89 85.44 78.57 81.86 97.01 77.80 86.35 99.08 72.14 83.50 98.99 86.53 92.34Qiang Wei 1130 92.78 67.67 78.26 83.02 74.62 78.60 97.93 51.62 67.61 97.49 34.08 50.51 99.11 55.43 71.10Xiaohua Liu 1334 97.11 96.81 96.96 94.80 96.49 95.63 95.38 96.06 95.72 99.41 73.08 84.23 97.26 73.47 83.71Weimin Liu 1484 83.17 77.89 80.44 76.73 81.74 79.15 92.94 33.80 49.57 97.93 21.45 35.19 97.84 50.31 66.45Min Yang 2244 92.11 61.23 73.56 75.86 61.67 68.03 96.85 34.34 50.70 98.54 31.14 47.32 98.18 38.59 55.41Jing Li 4950 94.62 88.83 91.64 72.64 81.18 76.67 99.44 56.00 71.65 99.61 51.41 67.82 98.56 75.61 85.57Jing Zhang 6141 94.91 73.49 82.84 65.41 50.59 57.06 98.41 33.61 50.11 98.99 25.97 41.14 95.78 25.29 40.02Micro Avg. - 93.69 75.65
Figure 6: Excluding Ratio of Verifying which is the proportion of seemingly wrongly-assigned documentsand indicates the accuracy of result made by a single human an-notator in our annotation framework. In fig 6, it is shown that theexcluding ratios of most document sets are less than 5%, whichindicates, with our annotation framework, a human annotator cancomplete a disambiguation task with less than 5% wrongly-assigneddocuments in most instances. Besides that, we also found, for sam-pled document sets whose sizes range from 0 to 4000, the distri-bution of excluding ratios is stable, which further demonstratesthat our annotation framework is scalable so that we can apply ourannotation framework to label larger and more complex documentsets.As we mentioned before, there would be pairwise conflicts be-tween the results given by different annotators during the adding step. We define conflicting pair ratio
CPR in the following way. Fortwo annotators i and j , let N ij be the number of common documentsthey both conduct create operation to and C ij represents the num-ber of conflict pairs between them, conflict pair ratio is calculatedas CPR = CPR ij , where CPR ij = C ij N i j . Fig 7 shows the distribution Figure 7: Conflict Pair Ratio of Create of CPR we have counted during the annotation. According to fig7, the creating conflicts widely exists while their scales are gener-ally small, which indicates that annotators are very likely to makeminor mistakes and seldom make big ones in our framework. Thiskind of mistakes can be detected and completely eliminated by ourworkflow, which further demonstrates the accuracy of our dataset.In the real annotation scenario, we arrange three annotators foreach collaboration step, which means the amount of work triples inthese collaboration steps. Our records show that it would take about300 man-hours to label 100,000 documents under this arrangement.On average, each person can label at least 600 to 700 documentsper hour with the assistance from our annotation system.
The experiment codes and annotation system demo are availableonline. In this paper, we dived into the issues of Author NameDisambiguation. framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY
The experiment codes and annotation system demo are availableonline. In this paper, we dived into the issues of Author NameDisambiguation. framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY First, we propose a novel crowdsourcing framework to build world’slargest manually-labeled Author Name Disambiguation dataset,
WhoisWho , which provides a new point of view for researchers.Through comprehensive evaluation and analysis, we demonstratethat our annotation work is highly efficient and annotation result isaccurate. We also organized a competition based on the publisheddataset and resulted in some meaningful ideas.On the other hand, we adapt an inductive supervised model for ANDtask and apply it into our annotation framework to assist annotators.We evaluate it with several state-of-the-arts name disambiguationmethods in our benchmark. Experiment results demonstrate theadvantage of our method over stat-of-the-art author name disam-biguation methods. The results also show that even the best namedisambiguation model still has a huge gap with human, which indi-cate AND is still an open problem and there is still large space forAND methods to improve.In the future, we will continue to expand the scale of annotationdata and release the annotation task to the Internet in the form ofcrowdsourcing, so that more people can participate in our annota-tion task. New annotation results will be available with new datachallenges soon.In addition to the manually-labeled data, several works[1][14] eval-uate their models with the certified information in Google Scholar.This kind of data is very tiny for one author name, so it would notbe used as a dataset in most cases. However, this data has greatadvantages in accuracy and diversity. In the future, We will try touse this data to verify accuracy of our annotation results. [1] Mehmet Ali Abdulhayoglu and Bart Thijs. 2017. Use of ResearchGate and GoogleCSE for author name disambiguation.
Scientometrics
Visual AnalyticsScience And Technology, 2006 IEEE Symposium On . 43–50.[3] Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. 2011. On Graph-Based Name Disambiguation.
Journal of Data and Information Quality
2, 2 (2011),1–23.[4] Liyu Gong and Qiang Cheng. 2019. Exploiting edge features for graph neuralnetworks. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition . 9211–9219.[5] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representationlearning on large graphs. In
Advances in neural information processing systems . 1024–1034.[6] Hui Han. 2005. Name disambiguation in author citations using a K-way spectralclustering method. In
Acm/ieee-Cs Joint Conference on Digital Libraries . 334–343.[7] Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004.Two supervised learning approaches for name disambiguation in author citations.In
JCDL’04 . 296–305.[8] Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification withGraph Convolutional Networks. (2016).[9] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXivpreprint arXiv:1611.07308 (2016).[10] Gilles Louppe, Hussein T Al-Natsheh, Mateusz Susik, and Eamonn James Maguire.2016. Ethnicity sensitive author disambiguation using semi-supervised learning.In
KESW’16 . 272–287.[11] Gideon S Mann and David Yarowsky. 2003. Unsupervised personal name disam-biguation. In
Proceedings of the seventh conference on Natural language learning atHLT-NAACL 2003-Volume 4 . Association for Computational Linguistics, 33–40.[12] Hsin Tsung Peng, Cheng Yu Lu, William Hsu, and Jan Ming Ho. 2012. Disam-biguating authors in citations on the web and authorship correlations.
ExpertSystems with Applications
39, 12 (2012), 10521–10532.[13] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learningof social representations. In
Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining . 701–710.[14] Christian Schulz, Amin Mazloumian, Alexander M Petersen, Orion Penner, andDirk Helbing. 2014. Exploiting citation networks for large-scale author namedisambiguation.
EPJ Data Science
3, 1 (2014), 11.[15] Qiaomu Shen, Tongshuang Wu, Haiyan Yang, Yanhong Wu, Huamin Qu, andWeiwei Cui. 2017. NameClarifier: A Visual Analytics System for Author NameDisambiguation.
IEEE Trans Vis Comput Graph
23, 1 (2017), 141–150.[16] Jie Tang, Alvis C. M. Fong, Bo Wang, and Jing Zhang. 2012. A Unified ProbabilisticFramework for Name Disambiguation in Digital Library.
IEEE Transactions onKnowledge & Data Engineering
24, 6 (2012), 975–987.[17] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-Miner: Extraction and Mining of Academic Social Networks. In
KDD’08 . 990–998.[18] Hung Nghiep Tran, Tin Huynh, and Tien Do. 2014.
Author Name Disambiguationby Using Deep Neural Network . Springer International Publishing. 123–132 pages.[19] Pucktada Treeratpituk and C Lee Giles. 2009. Disambiguating authors in academicpublications using random forests. In
Proceedings of the 9th ACM/IEEE-CS jointconference on Digital libraries . 39–48.[20] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprintarXiv:1710.10903 (2017).[21] Xuezhi Wang, Jie Tang, Hong Cheng, and Philip S. Yu. 2011. ADANA: ActiveName Disambiguation. In
IEEE International Conference on Data Mining . 794–803.[22] Jierui Xie, Boleslaw K Szymanski, and Xiaoming Liu. 2011. Slpa: Uncoveringoverlapping communities in social networks via a speaker-listener interaction dy-namic process. In .IEEE, 344–349.[23] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerfulare graph neural networks? arXiv preprint arXiv:1810.00826 (2018).[24] Baichuan Zhang and Mohammad Al Hasan. 2017. Name disambiguation inanonymized graphs using network embedding. In
CIKM’17 . 1239–1248.[25] Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. 2018. Name Disambiguationin AMiner: Clustering, Maintenance, and Human in the Loop.. In
Proceedings ofthe 24th ACM SIGKDD International Conference on Knowledge Discovery & DataMining . ACM, 1002–1011. oodstock ’18, June 03–05, 2018, Woodstock, NY Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, and Jie Tang
In this section, we will introduce our user interface, which is shownin the Fig 8. The prototype of this visualization interface is proposedby Shen. We will first introduce the characters designed by theprevious work, and then introduce the improvements we havemade on it. We will follow the order marked in the figure 8 one byone.
The hollow circle pointed by the arrow 1 represents the assigneddocument groups and each segment of it represents a certain am-biguous group. If a certain segment is clicked, the profile informa-tion of this group would be displayed on the left side of the interfaceand the documents belonging to this group would be visualized atthe center of the circles. Each group size is encoded as the segmentlength while the group quality is encoded as the segment’s color.The darker the segment color is, the more similar the papers in thegroup are.
The other hollow circle pointed by the arrow 2 represents theunassigned document groups. Each segment of it represents a sub-clustering group. The characters of this component is exactly thesame as the previous one.
Each document belonging to the selected groups would be plot asa graph node in the center of circles. If two documents have anycommon authors, there would be an edge between the correspond-ing nodes. Besides that, there are some edges existing between thedocuments and the assigned groups, which means that there is atleast one document within the assigned group happening to havecommon authors with the connected documents. The links betweenthe groups and documents are called potential links. These visu-alization technologies above are proposed by the previous work.Next, we will introduce our original new features.
Our framework would use sub-clustering method to split eachassigned document group into several sub-groups. The segmentspointed by arrow 4 indicate the sub-groups of assigned group 3.These segments share the same features with previous two andfacilitate annotators to conduct more precise operations.
In the previous work, the edges between the documents and groupsindicate author similarity. Since our annotation framework wouldtake various document attributes into consideration, we extend thefunctions of edges to enable them to represent more complex rela-tionships. The annotators can choose multiple document featureswhich they are interested in. If two documents are connected, itmeans that at least one of selected attributes are similar.
Our system provides several node interfaces to meet the require-ments from annotators. The marking interface allow annotatorsto mark nodes by modifying their shapes. The freezing interfacecould fix the node positions while brush interface enable them toselect a group of nodes quickly.
For each sub-group, our framework would count the frequency offeatures and plot them separately. The annotator can analyze eachselected group quickly or simply click a feature to select all thecorresponding documents.
The previous work designs a kind of potential links between theassigned groups and documents, which help annotator quickly findthe assigned groups they are interested in. We extend this designto unassigned document sets, we use the way of adding shadowsto point it out so the annotator can quickly find target unassignedgroups or documents as well. framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NYframework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration Woodstock ’18, June 03–05, 2018, Woodstock, NY