Name Disambiguation in Anonymized Graphs using Network Embedding
aa r X i v : . [ c s . S I] S e p Name Disambiguation in Anonymized Graphs using NetworkEmbedding ∗ Baichuan Zhang
Purdue UniversityWest Lafayette, IN, [email protected]
Mohammad Al Hasan
Indiana University Purdue University IndianapolisIndianapolis, IN, [email protected]
ABSTRACT
In real-world, our DNA is unique but many people share names.This phenomenon often causes erroneous aggregation of documentsof multiple persons who are namesake of one another. Such mis-takes deteriorate the performance of document retrieval, web search,and more seriously, cause improper attribution of credit or blamein digital forensic. To resolve this issue, the name disambigua-tion task is designed which aims to partition the documents as-sociated with a name reference such that each partition containsdocuments pertaining to a unique real-life person. Existing solu-tions to this task substantially rely on feature engineering, such asbiographical feature extraction, or construction of auxiliary fea-tures from Wikipedia. However, for many scenarios, such fea-tures may be costly to obtain or unavailable due to the risk ofprivacy violation. In this work, we propose a novel name disam-biguation method. Our proposed method is non-intrusive of pri-vacy because instead of using attributes pertaining to a real-lifeperson, our method leverages only relational data in the form ofanonymized graphs. In the methodological aspect, the proposedmethod uses a novel representation learning model to embed eachdocument in a low dimensional vector space where name disam-biguation can be solved by a hierarchical agglomerative cluster-ing algorithm. Our experimental results demonstrate that the pro-posed method is significantly better than the existing name disam-biguation methods working in a similar setting.
CCS CONCEPTS • Information systems → Clustering; Information retrieval;Document representation;
KEYWORDS
Name Disambiguation;Neural Network Embedding;Clustering
ACM Reference format:
Baichuan Zhang and Mohammad Al Hasan. 2017. Name Disambiguation inAnonymized Graphs using Network Embedding . In Proceedings of CIKM’17, Singapore, Singapore, November 6–10, 2017,
11 pages. This research is sponsored by Mohammad Al Hasan’s NSF CAREER Award (IIS-1149851) and also, by a research grant from CareerBuilder.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanthe author(s) must be honored. Abstracting with credit is permitted. To copy other-wise, or republish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected].
CIKM’17 , Singapore, Singapore © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.978-1-4503-4918-5/17/11...$15.00DOI: 10.1145/3132847.3132873
DOI: 10.1145/3132847.3132873
Name disambiguation [3, 10, 30, 32, 33] is an important problem,which has numerous applications in information retrieval, counter-terrorism, and bibliographic data analysis. In information retrieval,name disambiguation is critical for sanitizing search results of am-biguous queries. For example, an online search query for “MichaelJordan” may retrieve pages of former US basketball player, thepages of UC Berkeley machine learning professor, and the pages ofother persons having that name, and name disambiguation is nec-essary to split those pages into homogeneous groups. In counter-terrorism, such an exercise is essential before inserting a person’sprofile in a law enforcement database; failing to do so may causesevere trouble to many innocent persons who are namesakes ofa potential criminal. Evidently, name disambiguation is particu-larly important in the fields of bibliometrics and library science.This is due to the fact that many distinct authors share the samename reference as the first name of an author is typically writtenin abbreviated form in the citation of many scientific articles. Thus,bibliographic servers that maintain such data may mistakenly ag-gregate the articles from multiple scholars (sharing the same name)into a unique profile in some digital repositories. For an example,the Google scholar profile associated with the name “Yang Chen”(GS) is verified as the profile page of a Computer Graphics PhDcandidate at Purdue University, but based on our labeling, morethan 20 distinct persons’ publications are mixed under that pro-file mistakenly. Such mistakes in library science over- or under-estimate a researcher’s citation related impact metrics.Due to its importance, the name disambiguation task has at-tracted substantial attention from information retrieval and datamining communities. However, the majority of existing solutions [1,3, 12, 15] for this task use biographical features such as name, ad-dress, institutional affiliation, email address, and homepage. Also,contextual features such as collaborator, community affiliation, andexternal data source such as Wikipedia are used in some works [13,15]. Using biographical features is acceptable for disambiguationof authors in bibliometrics domain, but in many scenarios, for ex-ample in the national security related applications, biographicalfeatures are hard to obtain, or they may even be illegal to obtainunless a security analyst has the appropriate level of security clear-ance. Besides, in real-world social networks (e.g., Twitter, Face-book, and LinkedIn), some users may choose a strict privacy set-ting that restricts the visibility of their profile information andposts. For such privacy-preserving scenarios, many existing name https://scholar.google.com/citations?user=gl26ACAAAAAJ&hl=en CIKM’17, Nov 2017, Singapore]disambiguation techniques [10, 12, 15, 22, 27], which compute doc-ument similarity using biographical attributes are not applicable.In recent years, a few works have emerged where name dis-ambiguation task in privacy-preserving setting has been consid-ered [14, 32]. These works use relational data in the form of ananonymized person-person collaboration graph, and solve namedisambiguation by using graph topological features. Thus theypreserve the privacy of a user. Authors of [14] use graphlet ker-nels based classification model and the authors of [32] use Markovclustering based unsupervised approach. However, both of theseworks only consider a binary classification task, predicting whethera given person-node in the graph is ambiguous or non-ambiguous.This is far from a traditional name disambiguation task which par-titions the records pertaining to a given name reference into dif-ferent groups, each belonging to a unique person. Another lim-itation of the existing works is that they only utilize the person-person collaboration network, which does not generally yield agood disambiguation performance. There are other information,such as person-document association information and document-document similarity information, which can also be exploited forobtaining improved name disambiguation, yet preserving the user’sprivacy.In this work, we solve the name disambiguation task by usingonly relational information. For a given name reference, our pro-posed method pre-processes the input data as three graphs: person-person graph representing collaboration between a pair of per-sons, person-document graph representing association of a per-son with a document and document-document similarity graph.These graphs are appropriately anonymized, as such, the verticesof these graphs are represented by a unique pseudo-random identi-fier. Nodal features (such as, biographical information of a person-node, or keywords of a document-node) of any of the above threegraphs are not used, which makes the proposed method privacy-preserving.In the graph representation, the name disambiguation task be-comes a graph clustering task of the document-document graph,with the objective that each cluster contains documents pertainedto a unique real-life person. A traditional method to cluster a ho-mogeneous network cannot facilitate information exchange amongthe three graphs, so we propose a novel representation learningmodel, which embeds the vertices of these graphs into a sharedlow dimensional latent space by using a joint objective function.The objective function of our representation learning task utilizespairwise similarity ranking which is different from the typical ob-jective functions used in the existing document embedding meth-ods, such as LINE [24] and PTE [23]; the latter ones are based onK-L divergence between empirical similarity distribution and em-bedding similarity distribution. K-L divergence works over the en-tire distribution vector and it works well for document labelingor topic modeling, but not so for clustering. On the other hand,our objective function is better suited for a downstream clusteringtask because it directly optimizes the pairwise distance betweensimilar and dissimilar documents, thus making the document vec-tors disambiguation-aware in the embedded space, as such, a tradi-tional hierarchical clustering of the vectors in the embedded spacegenerates excellent name disambiguation performance. Experimen-tal comparison with several state-of-the-art name disambiguation methods—both traditional and network embedding-based—showthat the proposed method is significantly better than the existingmethods on multiple real-life name disambiguation datasets.The key contributions of this work are summarized as below:(1) We solve the name disambiguation task by using only linkeddata from network topological information. The work ismotivated by the growing demand for big data analysiswithout violating the user privacy in security sensitive do-mains.(2) We propose a network embedding based solution that lever-ages linked structures of a variety of anonymized networksin order to represent each document into a low-dimensionalvector space for solving the name disambiguation task. Tothe best of our knowledge, our work is the first one toadopt a representation learning framework for name dis-ambiguation in anonymized graphs.(3) For representation learning, we present a novel pairwiseranking based objective, which is particularly suitable forsolving the name disambiguation task by clustering.(4) We use two real-life bibliographic datasets for evaluatingthe disambiguation performance of our solution. The re-sults demonstrate the superiority of our proposed methodover the state-of-the-art methodologies for name disam-biguation in privacy-preserving setup.
There exist a large number of works on name disambiguation [3,10]. In terms of methodologies, existing works have considered su-pervised [1, 10], unsupervised [3, 11], and probabilistic relationalmodels [21, 22, 33]. In the supervised setting, Han et al. [10] pro-posed supervised name disambiguation methodologies by utilizingNaive Bayes and SVM. In these works, a distinct real-life entity canbe considered as a class, and the objective is to classify each recordto one of the classes. For the unsupervised name disambiguation,the records are partitioned into several clusters with the goal ofobtaining a partition where each cluster contains records from aunique entity. For example, Han et al.[11] used K -way spectralclustering for name disambiguation in bibliographical data. Re-cently, probabilistic relational models, especially graphical modelshave also been considered for the name disambiguation task. Forinstance, [22] proposed to use Markov Random Fields to addressname disambiguation in a unified probabilistic framework.Most existing solutions to the name disambiguation task use ei-ther biographical attributes, or auxiliary features that are collectedfrom external sources. However, the attempt of extracting bio-graphical or external data sustains the risk of privacy violation.To address this issue, a few works [14, 17, 20, 32] have consideredname disambiguation using anonymized graphs without leverag-ing the node attributes. The central idea of this type of works is toexploit graph topological features to solve the name disambigua-tion problem without intruding user privacy through the collectionof bibliographical attributes. For example, authors in [14] charac-terized the similarity between two nodes based on their local neigh-borhood structures using graph kernels and solved the name dis-ambiguation problem using SVM. However, the major drawbackName Disambiguation in Anonymized Graphs] P ape r C oun t Figure 1: Paper Count Distribution of “S Lee” of the proposed method in [14] is that it can only detect entitiesthat should be disambiguated, but fails to further partition the doc-uments into their corresponding homogeneous groups. Authorsin [20, 32] proposed an unsupervised solution to name disambigua-tion in an anonymized graph by exploiting the time-stamped net-work topology around a vertex. However, it also suffers from thesimilar issue as described above.Our proposed solution utilizes a network representation learn-ing based approach [2, 4, 7, 9, 19, 23–26]— a rather recent devel-opment in machine learning. Many of these methods are inspiredby word embedding based language model [18]. Different fromtraditional graph embedding methods, such as Laplacian Eigen-maps [5, 6], the recently proposed network embedding methods,such as DeepWalk [19], LINE [24], PTE [23], and Node2Vec [9], aremore scalable and have shown better performance in node classi-fication and link prediction tasks. Among these works, LINE [24]finds embedding of documents by using document-document sim-ilarity matrix, whereas our work uses multiple networks and per-forms a joint learning. PTE [23] performs a joint learning of mul-tiple input graphs, but PTE needs labeled data. Finally, the embed-ding formulation and optimization of our proposed method is dif-ferent than LINE or PTE. Specifically, we use a ranking based lossfunction as our objective function whereas mostly all the existingmethods use K-L divergence based objective function.
We first introduce notations used in this paper. Throughout thepaper, bold uppercase letter (e.g., X ) denotes a matrix, bold lower-case letter such as x i denotes a column vector, and (·) T denotesvector transpose. k X k F is the Frobenius norm of matrix X . Calli-graphic uppercase letter (e.g., X ) is used to denote a set and |X| isthe cardinality of the set X .For a given name reference a , we denote D a = { d a , d a , ..., d aN } to be a set of N documents with which a is associated and A a = { a , a , ..., a M } is the collaborator set of a in D a , where a < A a . Ifthere is no ambiguity we remove the superscript a in the notationsof both D a and A a and refer the terms as D and A , respectively.For illustration, in bibliographic field, D can be the set of scholarlypublications where a is one of the authors and A is the set of a ’scoauthors. In real-life, the given name reference a can be associ-ated with multiple persons (say L ) all sharing the same name. Thetask of name disambiguation is to partition D into L disjoint setssuch that each partition contains documents of a unique personentity with name reference a .Though it may appear as a simple clustering problem, name dis-ambiguation is challenging on real-life data. This is due to the fact that it requires solving a highly class-imbalanced clustering task,as the number of documents associated with a distinct person fol-lows a power-law distribution. We demonstrate it through an ex-ample from the bibliographic domain. In Figure 1, we show a his-togram of paper counts of various real-life persons named “S Lee”in CiteSeerX . As we can observe, there are a few real-life authors(dominant entities) with the name “S Lee” to whom the majority ofthe publications belong. Only a few publications belong to each ofthe remaining real-life authors with name “S Lee”. Due to this se-vere class imbalance issue, majority of traditional clustering meth-ods perform poorly on this task. Sophisticated machine learningmodels, like the one we propose below are needed for solving thistask. This example is from bibliographic domain, but power-lawdistribution of possession is common in every aspect of real-life,so we expect this challenge to hold in other domains as well.In this study, we investigate the name disambiguation problemin a restricted setup, where bibliographical features and informa-tion from external sources are not considered so that the risk of pri-vacy violation can be alleviated. Instead, we formulate the problemusing graphs in which each node has been assigned an anonymizedidentifier, and network topological structure is the only informa-tion available. Specifically, our solution encodes the local neighbor-hood structures accumulated from three different networks into aproposed network embedding model, which generates a k -dimensionalvector representation for each document. The networks are person-person network, person-document network, and linked documentnetwork, which we formally define as below: Definition 3.1 (Person-Person Network). For a given name refer-ence x , the person-person network, denoted as G pp = (A x , E pp ) ,captures collaboration between a pair of persons within the collec-tion of documents associated with x . A x is the collaborator set, and e ij ∈ E pp represents the edge between the persons, a i and a j , whocollaborated in at least one document. The weight w ij of the edge e ij is defined as the number of distinct documents in which a i and a j have collaborated. The person-person network is important because the inter-personacquaintances represented by collaboration relation can be usedto discriminate the set of documents of multiple real-life persons.However, the collaboration network does not account for the factthat the documents associated with the same real-life person are in-herently similar; person-document network and document-documentnetwork cover for this shortcoming.
Definition 3.2 (Person-Document Network). Person-Document Net-work, represented as G pd = (A ∪ D , E pd ) , is a bipartite networkwhere D is the set of documents with which the name reference a is associated and A is the set of collaborators of a over all the docu-ments in D . E pd is the set of edges between persons and documents.The edge weight w ij between a person node a i and document d j issimply defined as the number of times a i appears in document d j .For a bibliographic dataset, a i is simply an author of the document d j and the weight w ij = .Definition 3.3 (Linked Document Network). Document-DocumentNetwork, represented as G dd = (D , E dd ) , where each vertex d i ∈ D http://citeseerx.ist.psu.edu/index;jsessionid=4A26742FADC605600567F493C2D7825E CIKM’17, Nov 2017, Singapore] is a document. If two documents d i and d j are similar (more discus-sion is forthcoming), we build an edge between them represented as e ij ∈ E dd . There are several ways document-document similarity can becaptured. For instance, one can find word co-occurrence betweendifferent documents to compute this similarity. However, we re-frained from using word co-occurrence due to the privacy concernas sometimes a list of a set of unique words can reveal the identityof a person [31]. Instead we define document-document similaritythrough a combination of person-person and person-document re-lationships. Two documents are similar if the intersection of theircollaborator-sets is large (by using person-document relationship)or if the intersection of one-hop neighbors of their collaborator-sets is large (by using both person-document and person-personrelationships).The above definition of document similarity captures two im-portant patterns which facilitate effective name disambiguation bydocument clustering. First, there is a high chance for two docu-ments to be authored by the same real-life person, if they have alarge number of overlapping collaborators. Second, even if theydo not have any overlapping collaborators, large overlap in theneighbors of their collaborators signals that the documents aremost likely authored by the same person. For both cases, thesetwo documents should be placed in close proximity in the em-bedded space. Mathematically, we denote A d i as the collabora-tor set of d i . Furthermore, A d i is the set of collaborators by ex-tending A d i with all neighbors of the persons in A d i , namely A d i = A d i ∪ {N B G pp ( b )} b ∈A di , where N B G pp ( b ) is the set ofneighbors of node b in person-person network G pp . Then the doc-ument similarity between d i and d j in the graph G dd is simplydefined as w ij = |A d i ∩ A d j | .Based on our problem formulation, the name disambiguationsolution consists of two phases: (1) document representation (2)disambiguation. We discuss them as below:Given a name reference a , its associated document set D a (whichwe want to disambiguate) and the collaborator set A a , the docu-ment representation phase first constructs corresponding person-person network G pp , person-document bipartite network G pd , andlinked document network G dd . Then our proposed document rep-resentation model combines structural information from these threenetworks to generate a k -dimensional document embedding ma-trix D = [ d T , ..., d TN ] ∈ IR N × k .Then the disambiguation phase takes the document embeddingmatrix D as input and applies the hierarchical agglomerative clus-tering (HAC) with group average merging criteria to partition N documents in D a into L disjoint sets with the expectation that eachset is composed of documents of a unique person entity sharingthe name reference a . At this stage, L is a user-defined parame-ter which we match with the ground truth during the evaluationphase. In real-life though, a user needs to tune the parameter L which can easily be done with HAC, because HAC provides hier-archical organization of clusters at all levels starting from a singlecluster upto the case of single-instance cluster, and a user can re-cover clustering for any value of L as needed without additionalcost. Also, across different L values the cluster assignment of HAC is consistent (i.e., two instances that are in the same cluster forsome L value will remain in the same cluster for any smaller L value), which further helps in choosing an appropriate L value. In this section, we discuss our proposed representation learningmodel for name disambiguation. Our goal is to encode the localneighborhood structures captured by the three networks (see Def-initions 3.1 3.2 3.3) into the k -dimensional document embeddingmatrix with strong name disambiguation ability. The main intuition of our network embedding model is that neigh-boring nodes in a graph should have more similar vector represen-tation in the embedding space than non-neighboring nodes. Forinstance, in linked document network, the affinity between twoneighboring vertices d i and d j , i.e., e ij ∈ G dd should be largerthan the affinity between two non-neighboring vertices d i and d t ,i.e., e it < G dd . The affinity score between two nodes d i and d j in G dd can be calculated as the inner product of their correspond-ing embedding representations, denoted as S ddij = d Ti d j . Morespecifically, we model the probability of preserving ranking order S ddij > S ddit using the logistic function σ ( x ) = + e − x . Mathemati-cally, P (cid:0) S ddij > S ddit | d i , d j , d t (cid:1) = σ (cid:0) S ddijt (cid:1) (1)where S ddijt is defined as below: S ddijt = S ddij − S ddit = d Ti d j − d Ti d t (2)As we observe from Equation 1, the larger S ddijt , the more likelyranking order S ddij > S ddit is preserved. By assuming all the rank-ing orders generated from the linked document network G dd to beindependent, the probability P ( > | D ) of all the ranking orders be-ing preserved given the document embedding matrix D ∈ IR N × k is defined as below: P ( > | D ) = Ö ( d i , d j )∈P Gdd ( d i , d t )∈N Gdd P (cid:0) S ddij > S ddit | d i , d j , d t (cid:1) = Ö ( d i , d j )∈P Gdd ( d i , d t )∈N Gdd σ (cid:0) S ddijt (cid:1) = Ö ( d i , d j )∈P Gdd ( d i , d t )∈N Gdd σ (cid:0) S ddij − S ddit (cid:1) (3)where P G dd and N G dd are positive and negative training sets in G dd .From the Equation 3, the goal is to seek the document latentrepresentation D for all nodes in linked document network G dd ,Name Disambiguation in Anonymized Graphs]which maximizes P ( > | D ) . For the computational convenience, weminimize the following sum of negative log-likelihood objective,which is shown as follows: OB J dd = min D − ln P ( > | D ) = − Õ ( d i , d j )∈P Gdd ( d i , d t )∈N Gdd ln P (cid:0) S ddij > S ddit | d i , d j , d t (cid:1) = − Õ ( d i , d j )∈P Gdd ( d i , d t )∈N Gdd ln σ ( S ddijt ) = − Õ ( d i , d j )∈P Gdd ( d i , d t )∈N Gdd ln σ (cid:0) S ddij − S ddit (cid:1) (4)The formulation shown in Equation 4 constructs a probabilisticframework for distinguishing between neighbor nodes and non-neighbor nodes in a linked document network by preserving aranking order objective function.Using the identical argument, the objective functions for cap-turing person-person and person-document relations are given asbelow: OB J pp = min A − ln P ( > | A ) = − Õ ( a i , a j )∈P Gpp ( a i , a t )∈N Gpp ln σ ( S ppij − S ppit ) (5) OB J pd = min A , D − ln P ( > | A , D ) = − Õ ( d i , a j )∈P Gpd ( d i , a t )∈N Gpd ln σ ( S pdij − S pdit ) (6)where A ∈ IR M × k can be thought as the person embedding ma-trix and M is the number of persons in the collaborator set A . S ppij represents the affinity score between two nodes a i and a j in col-laboration graph G pp , and S pdij denotes the affinity score betweentwo nodes d i and a j in heterogeneous bipartite graph G pd . Fi-nally, P G pp and N G pp are positive and negative training sets in G pp , P G pd and N G pd are positive and negative training sets in G pd respectively.The goal of proposed network embedding framework is to unifythese three types of relations together, where the person and docu-ment vertices are shared across these three networks. An intuitivemanner is to collectively embed these three networks, which canbe achieved by minimizing the following objective function: OB J = min A , D − OB J pp − OB J pd − OB J dd + λReд ( A , D ) (7) where λReд ( A , D ) in Equation 7 is a l -norm regularization termto prevent the model from overfitting. Here for the computationalconvenience, we set Reд ( A , D ) as k A k F + k D k F . Such pairwiseranking loss objective is in the similar spirit to the Bayesian Per-sonalized Ranking [8, 29], which aims to predict the interactionbetween users and items in recommender system domain. We use stochastic gradient descent (SGD) algorithm for optimiz-ing Equation 7. Specifically, in each step we sample the train-ing instances involved in person-person, person-document, anddocument-document relations accordingly. The sampling strategyof positive instances is based on edge sampling [23]. Specifically,for example, in linked document network G dd , given an arbitrarynode d i , we sample one of its neighbors d j , i.e., ( d i , d j ) ∈ P G dd ,with the probability proportional to the edge weight for the modelupdate. On the other hand, for sampling of negative instances,we utilize uniform sampling technique. In particular, given thesampled node d i , we sample an arbitrary negative instance d t uni-formly, namely ( d i , d t ) ∈ N G dd .Therefore given a sampled triplet ( d i , d j , d t ) with ( d i , d j ) ∈ P G dd and ( d i , d t ) ∈ N G dd , using the chain rule and back-propagation,the gradient of the objective function OB J in Equation 7 w.r.t. d i can be computed as below: ∂ OB J ∂ d i = − ∂ ln σ (cid:0) S ddij − S ddit (cid:1) ∂ d i + λ d i = − ∂ ln σ (cid:0) S ddij − S ddit (cid:1) ∂ σ (cid:0) S ddij − S ddit (cid:1) × ∂ σ (cid:0) S ddij − S ddit (cid:1) ∂ (cid:0) S ddij − S ddit (cid:1) × ∂ (cid:0) S ddij − S ddit (cid:1) ∂ d i + λ d i = − σ (cid:0) S ddij − S ddit (cid:1) × σ (cid:0) S ddij − S ddit (cid:1)(cid:16) − σ (cid:0) S ddij − S ddit (cid:1)(cid:17) × ( d j − d t ) + λ d i = − e −( d Ti d j − d Ti d t ) + e −( d Ti d j − d Ti d t ) ! ( d j − d t ) + λ d i (8)Using the similar chain rule derivation, the gradient of the ob-jective function OB J w.r.t. d j and d t can be obtained as follows: ∂ OB J ∂ d j = − e −( d Ti d j − d Ti d t ) + e −( d Ti d j − d Ti d t ) ! × d i + λ d j (9) ∂ OB J ∂ d t = − e −( d Ti d j − d Ti d t ) + e −( d Ti d j − d Ti d t ) ! × (− d i ) + λ d t (10)Then embedding vectors d i , d j , and d t are updated as below:CIKM’17, Nov 2017, Singapore] d i = d i − α ∂ OB J ∂ d i d j = d j − α ∂ OB J ∂ d j d t = d t − α ∂ OB J ∂ d t (11)where α is the learning rate.Likewise, when the training instances come from person-personnetwork, and person-document bipartite network, we update theircorresponding gradients accordingly. We omit the detailed deriva-tions here since they are very similar to the aforementioned ones. Algorithm 1
Network Embedding based Name Disambiguation inAnonymized Graphs
Input: name reference a , dimension k , λ , α , L Output: document embedding matrix D and its clustering mem-bership set C Given name reference a , construct its associated D a , A a , G pp , G pd , G dd Given G pp , G pd , G dd , construct training sample sets P G pp , N G pp , P G pd , N G pd , P G dd , N G dd respectively based on edgesampling and uniform sampling techniques Initialize A and D as k -dimensional matrices for each training instance in training sample sets do Update involved parameters using SGD as described in Sec-tion 4.2 end for Given D and L , perform HAC to partition N documents in D a into L disjoint sets for name disambiguation return D , C = { c , c , ..., c N } The pseudo-code of the proposed network embedding method forname disambiguation under anonymized graphs is summarized inAlgorithm 1. The entire process consists of two phases: networkembedding for document representation and name disambiguationby clustering. Specifically, given a name reference a and its asso-ciated document set D a we aim to disambiguate, we first preparethe training instances in Line 1-2. Line 3 initializes the person anddocument embedding matrices A and D by randomly sampling el-ements from uniform distribution [− . , . ] . Then we train ourproposed network embedding model and update A and D usingthe training samples based on the SGD optimization in Line 4-6.Then given the obtained document embedding matrix D and L , inLine 7, we perform HAC to partition N documents in D a into L dis-joint sets such that each partition contains documents of a uniqueperson entity with name reference a . Finally in Line 8, we returndocument embedding matrix D and its clustering membership set C = { c , ..., c i , ..., c N } for evaluation, where 1 ≤ c i ≤ L .For the time complexity analysis, for the document embedding,when the training sample is ( d i , d j ) ∈ P G dd , as observed fromEquations 8, 9 and 11, the cost of calculating gradient of OB J w.r.t.
Name Reference
Table 1: Arnetminer Name Disambiguation Dataset d i and d j , and updating d i and d j are both O( k ) . Similar analy-sis can be applied when training instances are from P G pp , N G pp , P G pd , N G pd , N G dd . Therefore, the total computational cost is (cid:0) ∗ |P G pp | + ∗ |P G pd | + ∗ |P G dd | (cid:1) O( k ) . For the name dis-ambiguation, the computational cost of hierarchical clustering is O( N loдN ) [28]. So the total computational complexity of Algo-rithm 1 is (cid:0) ∗ |P G pp | + ∗ |P G pd | + ∗ |P G dd | (cid:1) O( k ) + O( N loдN ) . We perform several experiments to validate the performance of ourproposed network embedding method for solving the name disam-biguation task in a privacy-preserving setting using only linkeddata. We also compare our method with various other methods todemonstrate its superiority over those methods.
A key challenge for the evaluation of name disambiguation taskis the lack of availability of labeled datasets from diverse applica-tion domains. In recent years, the bibliographic repository sites,Arnetminer and CiteSeerX have published several ambiguousauthor name references along with respective ground truths (pa-per list of each real-life author), which we use for evaluation. Fromeach of these two sources, we use 10 highly ambiguous (having alarger number of distinct authors for a given name) name refer-ences and show the performance of our method on these name ref-erences. The statistics of name references in Arnetminer and Cite-SeerX datasets are shown in Table 1 and Table 2, respectively. Inthese tables, for each name reference, we show the number of doc-uments, and the number of distinct authors associated with thatname reference. It is important to understand that the name dis-ambiguation model is built on a name reference, not on a sourcedataset such as, Arnetminer or CiteSeerX as a whole, so each namereference is a distinct dataset on which the evaluation is performed. To validate the disambiguation performance of our proposed ap-proach, we compare it against 9 different methods. For a fair com-parison, all of these methods accommodate the name disambigua-tion using only relational data. Among all the competing meth-ods, Rand, AuthorList, and AuthorList-NNMF are a set of primitive https://aminer.org/disambiguation http://clgiles.ist.psu.edu/data/ Name Disambiguation in Anonymized Graphs]
Name Reference
Table 2: CiteSeerX Name Disambiguation Dataset baselines that we have designed. But, the remaining methods aretaken from recently published works. For instance, GF, DeepWalk,LINE, Node2Vec, and PTE are existing state-of-the-art approachesfor vertex embedding, which we use for name disambiguation byclustering the documents using HAC in the embedding space sim-ilar to our approach. Graphlet based graph kernel methods (GL3,GL4) are existing state-of-the-art approaches for name disambigua-tion in anonymized graphs. More details of each of the competingmethods are given below. For each method, for a given name ref-erence, a list of documents need to be partitioned among L (userdefined) different clusters. (1) Rand: This naive method randomly assigns one of existingclasses to the associated documents. (2) AuthorList:
Given the associated documents, we first aggre-gate the author-list of all documents in an author-array, then de-fine a binary feature for each author, indicating his presence orabsence in the author-list of that document. Finally we use HACwith the generated author-list as features for disambiguation task. (3) AuthorList-NNMF:
We perform Non-Negative Matrix Factor-ization (NNMF) on the generated author-list features the same waydescribed above. Then the latent features from NNMF are used ina HAC framework for disambiguation task. (4) Graph Factorization (GF) [16]:
We first represent co-authorshipnetwork G pp and the linked document network G dd as affinity ma-trices, and then utilize matrix factorization technique to representeach document into low-dimensional vector. Note that GF is opti-mized via a point-wise regression model that minimizes a squareloss function. However, in our proposed embedding approach, theobjective aims to minimize a ranking loss function, which is sub-stantially different from GF. (5) DeepWalk [19]: DeepWalk is an approach recently proposedfor network embedding, which is only applicable for homogeneousnetwork with binary edges. Given G pp and G dd , we use uniformrandom walk to obtain the contextual information of its neighbor-hood for document embedding . (6) LINE [24]: LINE aims to learn the document embedding thatpreserves both the first-order and second-order proximities . Notethat LINE can only handle the embedding of homogeneous net-work and the embedding formulation and optimization are quitedifferent from the one proposed in our work. Implementation Code is available at https://github.com/tangjianpku/LINE (7) Node2Vec [9]:
Similar to DeepWalk, Node2Vec designs a bi-ased random walk procedure for document embedding. . (8) PTE [23]: Predictive Text Embedding (PTE) framework aimsto capture the relations of word-word, word-document, and word-label. However, such keyword and label based biographical fea-tures are not available in the anonymized setup. Instead we utilizelocal structural information of both G pp and G pd networks to learnthe document embedding. However, this approach is not able tocapture the linked information among documents. (9) Graph Kernel [14]: In this work, size-3 graphlets (GL3) andsize-4 graphlets (GL4) are used to build graph kernels, which mea-sure the similarity between documents. Then the learned similaritymetric is used as features in HAC for name disambiguation. As wesee, both kernels only use network topological information. For each of the 20 name references, we perform name disambigua-tion task using our proposed method and each of the competingmethods to demonstrate that our proposed method is superior thanthe competing methods. For evaluation metric, we use Macro-F1measure [28], which is the unweighted average of F1 measure ofeach class. The range of Macro-F1 measure is between 0 and 1, anda higher value indicates better disambiguation performance. Be-sides comparison with competing methodologies, we also performexperiments to show that our method is robust against the varia-tion of user defined parameters (specifically, embedding dimensionand the number of clusters) over a wide range of parameter val-ues. Experiments are also performed to show how the embeddingmodel performs with each of the three types of networks (person-person, person-document, and document-document) incrementallyadded. Finally, we show the convergence of the learning modelwhile performing the document embedding phase.There are a few user defined parameters in our proposed embed-ding model. The first among these is the embedding dimension k ,which we set to be 20. For the regularization parameter in modelinference (see Section 4.2), we perform grid search on the valida-tion set in the following range: λ = { . , . , . , . , , } .In addition to that, we fix the learning rate α = .
02. For the dis-ambiguation stage, we use the actual number of classes L of eachname reference as input to perform HAC. For both data processingand model implementation, we implement our own code in Pythonand use NumPy, SciPy, scikit-learn, and Networkx libraries for lin-ear algebra, machine learning, and graph operations. We run allthe experiments on a 2.1 GHz Machine with 8GB memory runningLinux operating system. Table 3 and Table 4 show the performance comparison of namedisambiguation between our proposed method and other compet-ing methods for all 20 name references (one table for ArnetMinernames, and the other for CiteSeerX names). In both tables, therows correspond to the name references and the columns (2 to 12)stand for various methods. The competing methods are grouped We use the code from https://github.com/aditya-grover/node2vec The kernel values are obtained by source code supplied by the original authors
CIKM’17, Nov 2017, Singapore]
Name
Our Method
Rand AuthorList AuthorList- GF [16] DeepWalk [19] LINE [24] Node2Vec [9] PTE [23] GL3 [14] GL4 [14] Improv.Reference NNMFJing Zhang
Table 3: Comparison of Macro-F1 values between our proposed method and other competing methods for name disambigua-tion task in Arnetminer dataset (embedding dimension = 20). Paired t -test is conducted on all performance comparisons andit shows that all improvements are significant at the . level. Name
Our Method
Rand AuthorList AuthorList- GF [16] DeepWalk [19] LINE [24] Node2Vec [9] PTE [23] GL3 [14] GL4 [14] Improv.Reference NNMFK Tanaka
Table 4: Comparison of Macro-F1 values between our proposed method and other competing methods for name disambigua-tion task in CiteSeerX dataset (embedding dimension = 20). Paired t -test is conducted on all performance comparisons and itshows that all improvements are significant at the . level. logically. The first group includes the baseline methods that wehave designed such as random predictor (Rand) and methods us-ing low-dimensional factorization of author-list for clustering. Thesecond group includes various state-of-the-art network embeddingmethodologies, and the third group includes two methods usinggraphlet based graph kernels. The cell values are the performanceof a method using Macro-F1 score for disambiguation of documentsunder a given name reference. The last column shows the over-all improvement of our proposed method compared with the bestcompeting method. Since SGD based optimization technique inour proposed embedding model is a randomized method, for eachname reference we execute the method 10 times and report theaverage Macro-F1 score. For our method, we also show the stan-dard deviation in the parenthesis. For better visual comparison,we highlight the best Macro-F1 score of each name reference withbold-face font.As we observe, our proposed embedding model performs thebest for 9 and 8 name references (out of 10) in Table 3, and Ta-ble 4, respectively. Besides, the overall percentage improvementthat our method delivers over the second best method is relativelylarge. For an example, consider the name “S Lee” shown in the lastrow of Table 4. This is a difficult disambiguation task; from Table 2,it has 1091 documents and 74 distinct real-life authors ! A random Standard deviation for other competing methods are not shown due to the spacelimit. predictor (Rand) obtains a Macro-F1 of only 0.057 due to the largenumber of classes. Whereas our method achieves 0 .
624 Macro-F1 score for this name reference; the second best method for thisname (GF) achieves only 0 . .
10 20 30 40 50
Embedding Dimension k M a c r o - F ArnetminerCiteSeerX
Figure 2: The effects of embedding dimension on the namedisambiguation performance precise, PTE performs poorly as it fails to incorporate linked struc-tural information among the documents. Both GF and LINE out-perform DeepWalk in majority of name references. This is becauseDeepWalk ignores the weights of the edges, which is consideredto be very important in the linked document network. However,neither of embedding based competing methods could encode thedocument co-occurrence by exploiting the information from mul-tiple networks, which is exploited by our proposed model. Besides,as mentioned earlier, our similarity ranking based objective func-tion is better suited than the K-L divergence based objective func-tions for placing the nodes in the embedding space for facilitatinga downstream clustering task. This is possibly a significant reasonfor our method to show superior performance over the existingnetwork embedding based methods.
We also perform experiment to show how the embedding dimen-sion k affects the disambiguation performance of our proposedmethod. Specifically, we vary the number of embedding dimen-sion k as { , , , , } . For the sake of space, in each of thedatasets, we show the average results over all the 10 name ref-erences. The disambiguation results are given in Figure 2. Aswe observe, for both datasets, as the dimension of embeddings in-creases, the disambiguation performance in terms of Macro-F1 firstincreases and then decreases. The possible explanation could bethat when the embedding dimension is too small, the embeddingrepresentation capability is not sufficient. However, when the em-bedding dimension is too large, the proposed embedding modelmay overfit the data, leading to the unsatisfactory disambiguationperformance. One of the potential problems for name disambiguation is to de-termine the number of real-life persons L under a given name ref-erence, because in real-life L is generally unknown a-priori. So amethod whose performance is superior over a range of L valuesshould be preferred. For this comparison, after learning the docu-ment representation, we use various L values as input in the HAC
40 45 50 55 60Number of Distinct Authors M a c r o - F GFLINEOur Method
Figure 3: Macro-F1 results of multiple L values on name ref-erence “Lei Wang” using Our Method, GF, and LINE (embed-ding dimension = 20). Arnetminer CiteSeerX M a c r o - F Person-Document+Document-Document+Person-Person
Figure 4: Component Contribution Analysis in terms ofname disambiguation performance using Arnetminer andCiteSeerX as a whole source (embedding dimension = ). for name disambiguation and record the Macro-F1 score over dif-ferent L for the competing methods. In our experiment, we com-pare Macro-F1 value of our method with two other best performingmethods over several names, but due to space limitation, we showthis result only for one name (“Lei Wang” in Arnetminer) usingbar-charts in Figure 3. In this figure, we compare the performancedifferences between our method with two other best performingmethods (GF and LINE) as we vary L as { , , , , } . Notethat the actual number of distinct authors under “Lei Wang” is 48as shown in Table 1. As we can see, our proposed method alwaysoutperforms the state-of-the-art with all different L values, and theoverall improvement of our method over these two methods is sta-tistically significant with a p -value of less than 0 .
01. Because ofthe robustness of our proposed embedding method for name dis-ambiguation regardless of L values, this is a better method for thereal-life application. Our proposed network embedding model is composed of threetypes of networks, namely person-person, person-document, andlinked document networks (explained in Section 3). In this sectionwe study the contribution of each of the three components for theCIKM’17, Nov 2017, Singapore]
Epoch Lo ss (a) Loss vs Epoch Epoch A UC (b) AUC vs Epoch Figure 5: Convergence analysis in terms of both objectiveloss and AUC of name reference “Lei Wang” using our pro-posed network embedding model for name disambiguation. task of name disambiguation by incrementally adding the compo-nents in the network embedding model. Specifically, we first rankeach individual component by its disambiguation performance interms of Macro-F1, then add the components one by one in theorder of their disambiguation power. In particular, we first addperson-document graph, followed by linked document graph, andperson-person graph. Figure 4 shows the name disambiguationperformance in terms of Macro-F1 value using our proposed net-work embedding model with different component combinations.As we see from the figure, after adding each component, we ob-serve improvements for both datasets, in which the results are av-eraged out over all the 10 name references.
We further investigate the convergence of proposed network em-bedding algorithm shown in Section 4. Figure 5 shows the conver-gence analysis of our method under the name reference “Lei Wang”from Arnetminer. For each epoch, we sample (cid:18) (cid:12)(cid:12) E pp (cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E pd (cid:12)(cid:12)(cid:12) + | E dd | (cid:19) training instances to update the corresponding model em-bedding vectors. We can observe that our proposed network em-bedding approach converges approximately within 50 epochs andachieves promising convergence results on both pairwise rankingbased objective loss and AUC. However, as shown in Equation 7,the objective function in our proposed embedding model is notconvex, thus reaching global optimal solution using SGD based op-timization technique is a fairly challenging task. The possible rem-edy could be to decrease the learning rate α in SGD when numberof epochs increases. Another strategy is to try multiple runs withdifferent seeds initialization. Similar convergence patterns are ob-served for other name references as well. To conclude, in this paper we propose a novel representation learn-ing based solution to address the name disambiguation problem.Our proposed representation learning model uses a pairwise rank-ing objective function which clusters the documents belonging to asingle person better than other existing network embedding meth-ods. Besides, the proposed solution uses only the relational data,so it is particularly useful for name disambiguation in anonymized network, where node attributes are not available due to the pri-vacy concern. Our experimental results on multiple datasets showthat our proposed method significantly outperforms many of theexisting state-of-the-arts for name disambiguation.
REFERENCES [1] Razvan Bunescu and Marius Pasca. 2006. Using Encyclopedic Knowledge forNamed Entity Disambiguation. In
ACL . 9–16.[2] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning Graph Rep-resentations with Global Structural Information. In
CIKM . 891–900.[3] Lei Cen, Eduard C. Dragut, Luo Si, and Mourad Ouzzani. 2013. Author Dis-ambiguation by Hierarchical Agglomerative Clustering with Adaptive StoppingCriterion. In
SIGIR . 741–744.[4] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, andThomas S. Huang. 2015. Heterogeneous Network Embedding via Deep Archi-tectures. In
SIGKDD . 119–128.[5] P. Y. Chen, S. Choudhury, and A. O. Hero. 2016. Multi-centrality graphspectral decompositions and their application to cyber intrusion detection. In
ICASSP’16 .[6] Pin-Yu Chen, Baichuan Zhang, Mohammad Al Hasan, and Alfred O Hero. 2016.Incremental Method for Spectral Clustering of Increasing Orders. In
KDD Work-shop on Mining and Learning with Graphs .[7] Ting Chen and Yizhou Sun. Task-Guided and Path-Augmented HeterogeneousNetwork Embedding for Author Identification. In
Proceedings of the Tenth ACMInternational Conference on Web Search and Data Mining (WSDM’17) . 295–304.[8] Sutanay Choudhury, Khushbu Agarwal, Sumit Purohit, Baichuan Zhang, MegPirrung, Will Smith, and Mathew Thomas. 2017. NOUS: Construction andQuerying of Dynamic Knowledge Graphs. In . IEEE.[9] Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learningfor Networks. In
SIGKDD . 855–864.[10] Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004.Two Supervised Learning Approaches for Name Disambiguation in Author Ci-tations. In
Joint Conf. on Digital Libraries .[11] Hui Han, Hongyuan Zha, and C. Lee Giles. 2005. Name Disambiguation inAuthor Citations Using a K-way Spectral Clustering Method. In
ACM Joint Conf.on Digital Libraries . 334–343.[12] Xianpei Han, Le Sun, and Jun Zhao. 2011. Collective Entity Linking in Web Text:A Graph-based Method. In
SIGIR .[13] Xianpei Han and Jun Zhao. 2009. Named Entity Disambiguation by LeveragingWikipedia Semantic Knowledge. In
CIKM . 215–224.[14] Linus Hermansson, Tommi Kerola, Fredrik Johansson, Vinay Jethava, and De-vdatt Dubhashi. 2013. Entity Disambiguation in Anonymized Graphs UsingGraph Kernels. In
CIKM . 1037–1046.[15] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen F¨urstenau, Man-fred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum.2011. Robust Disambiguation of Named Entities in Text. In
EMNLP .[16] Da Kuang, Haesun Park, and Chris H. Q. Ding. 2012. Symmetric Non-negativeMatrix Factorization for Graph Clustering. In
SDM . 106–117.[17] Bradley Malin. Unsupervised name disambiguation via social network similar-ity. In
SDM’05 Workshop on Link Analysis, Counterterrorism, and Security . 93–102.[18] Tomas Mikolov, Ilya Sutskever, K Chen, G S Corrado, and Jeff Dean. DistributedRepresentations of Words and Phrases and their Compositionality. In
NIPS’13 .[19] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learn-ing of Social Representations. In
SIGKDD . 701–710.[20] Tanay Kumar Saha, Baichuan Zhang, and Mohammad Al Hasan. 2015. Namedisambiguation from link data in a collaboration graph using temporal and topo-logical features.
Social Network Analysis and Mining (2015), 1–14.[21] Yang Song, Jian Huang, Isaac G. Councill, Jia Li, and C. Lee Giles. 2007. EfficientTopic-based Unsupervised Name Disambiguation. In
JCDL . 342–351.[22] Jie Tang, Alvis C. M. Fong, Bo Wang, and Jing Zhang. 2012. A Unified Prob-abilistic Framework for Name Disambiguation in Digital Library.
IEEE TKDE (2012).[23] Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text EmbeddingThrough Large-scale Heterogeneous Text Networks. In
SIGKDD . 1165–1174.[24] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.2015. LINE: Large-scale Information Network Embedding. In
WWW . 1067–1077.[25] Cunchao Tu, Han Liu, Zhiyuan Liu, and Maosong Sun. 2017. CANE: Context-Aware Network Embedding for Relation Modeling. In
Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers) . Association for Computational Linguistics, 1722–1731.[26] Suhang Wang, Jiliang Tang, Charu Aggarwal, and Huan Liu. 2016. Linked Docu-ment Embedding for Classification. In
Proceedings of the 25th ACM Internationalon Conference on Information and Knowledge Management (CIKM ’16) . 115–124.
Name Disambiguation in Anonymized Graphs] [27] Xuezhi Wang, Jie Tang, Hong Cheng, and Philip S. Yu. 2011. ADANA: ActiveName Disambiguation. In
ICDM . 794–803.[28] Mohammed J. Zaki and Wagner Meira Jr. 2014.
Data Mining and Analysis: Fun-damental Concepts and Algorithms . Cambridge University Press.[29] Baichuan Zhang, Sutanay Choudhury, Mohammad Al Hasan, Xia Ning,Khushbu Agarwal, Sumit Purohit, and Paola Gabriela Pesntez Cabrera. 2016.Trust from the past: Bayesian Personalized Ranking based Link Prediction inKnowledge Graphs. In
SDM Workshop on Mining Networks and Graphs .[30] Baichuan Zhang, Murat Dundar, and Mohammad Al Hasan. 2016. Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams. In
CIKM’2016 . 1341–1350.[31] Baichuan Zhang, Noman Mohammed, Vachik S. Dave, and Mohammad AlHasan. 2017. Feature Selection for Classification under Anonymity Constraint.
Transactions on Data Privacy
10, 1 (2017), 1–25.[32] Baichuan Zhang, Tanay Kumar Saha, and Mohammad Al Hasan. Name disam-biguation from link data in a collaboration graph. In
ASONAM’14 . 81–84.[33] Duo Zhang, Jie Tang, Juanzi Li, and Kehong Wang. 2007. A Constraint-basedProbabilistic Framework for Name Disambiguation. In