[PDF] Pairwise Learning for Name Disambiguation in Large-Scale Heterogeneous Academic Networks

Abstract

Name disambiguation aims to identify unique authors with the same name. Existing name disambiguation methods always exploit author attributes to enhance disambiguation results. However, some discriminative author attributes (e.g., email and affiliation) may change because of graduation or job-hopping, which will result in the separation of the same author's papers in digital libraries. Although these attributes may change, an author's co-authors and research topics do not change frequently with time, which means that papers within a period have similar text and relation information in the academic network. Inspired by this idea, we introduce Multi-view Attention-based Pairwise Recurrent Neural Network (MA-PairRNN) to solve the name disambiguation problem. We divided papers into small blocks based on discriminative author attributes and blocks of the same author will be merged according to pairwise classification results of MA-PairRNN. MA-PairRNN combines heterogeneous graph embedding learning and pairwise similarity learning into a framework. In addition to attribute and structure information, MA-PairRNN also exploits semantic information by meta-path and generates node representation in an inductive way, which is scalable to large graphs. Furthermore, a semantic-level attention mechanism is adopted to fuse multiple meta-path based representations. A Pseudo-Siamese network consisting of two RNNs takes two paper sequences in publication time order as input and outputs their similarity. Results on two real-world datasets demonstrate that our framework has a significant and consistent improvement of performance on the name disambiguation task. It was also demonstrated that MA-PairRNN can perform well with a small amount of training data and have better generalization ability across different research areas.

Full PDF

aa r X i v : . [ c s . D L ] S e p Pairwise Learning for Name Disambiguation in Large-ScaleHeterogeneous Academic Networks

Qingyun Sun ∗ , Hao Peng ∗ , Jianxin Li ∗ , Senzhang Wang † , Xiangyu Dong ∗ , Liangxuan Zhao ∗ , Philip S. Yu ‡ and Lifang He §∗ Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing 100191, China † Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China ‡ University of Illinois at Chicago, Chicago 60607, USA § Lehigh University, Bethlehem, PA, USAEmail: { sunqy, penghao, lijx } @act.buaa.edu.cn, [email protected], { dongxiangyu, zhaolx } @buaa.edu.cn, [email protected], [email protected] Abstract —Name disambiguation aims to identify uniqueauthors with the same name. Existing name disambigua-tion methods always exploit author attributes to enhancedisambiguation results. However, some discriminative authorattributes (e.g., email and afﬁliation) may change because ofgraduation or job-hopping, which will result in the separationof the same author’s papers in digital libraries. Although theseattributes may change, an author’s co-authors and researchtopics do not change frequently with time, which meansthat papers within a period have similar text and relationinformation in the academic network. Inspired by this idea,we introduce Multi-view Attention-based Pairwise RecurrentNeural Network (MA-PairRNN) to solve the name disambigua-tion problem. We divided papers into small blocks based ondiscriminative author attributes and blocks of the same authorwill be merged according to pairwise classiﬁcation results ofMA-PairRNN. MA-PairRNN combines heterogeneous graphembedding learning and pairwise similarity learning into aframework. In addition to attribute and structure information,MA-PairRNN also exploits semantic information by meta-pathand generates node representation in an inductive way, whichis scalable to large graphs. Furthermore, a semantic-levelattention mechanism is adopted to fuse multiple meta-pathbased representations. A Pseudo-Siamese network consistingof two RNNs takes two paper sequences in publication timeorder as input and outputs their similarity. Results on tworeal-world datasets demonstrate that our framework has asigniﬁcant and consistent improvement of performance on thename disambiguation task. It was also demonstrated that MA-PairRNN can perform well with a small amount of trainingdata and have better generalization ability across differentresearch areas.

Keywords -Name disambiguation, graph embedding, pairwiselearning, heterogeneous information network

I. I

NTRODUCTION

Namesake problem [1] poses a huge challenge on manyapplications, e.g., information retrieval, bibliographic dataanalysis. When searching for academic publications byauthor name, the results may contain a long list of pub-lications of multiple authors with the same name. Some

Qingyun Sun and Hao Peng contributed equally to this work.Jianxin Li is corresponding author. digital libraries (e.g., DBLP and Google Scholar ) listcandidates after name disambiguation with correspondinghomepage, email and afﬁliation to make it easier to obtainall publications of one particular author. The academicimpacts of researchers are always measured by impacts oftheir publications in the research community. Therefore, itis important to keep publication data in digital librariesaccurate, consistent, and up to date.Name disambiguation [2], [3], which aims to identifyunique persons with the same name, has been studied fordecades but remains largely unsolved. Most of the existingsolutions utilize author attributes, including name, afﬁliation,email, homepage, etc., to generate paper representationsor further validate disambiguation results. However, thesediscriminative attributes, especially email and afﬁliation,may change because of graduation or job-hopping. Wetake Jian Pei , the well known leading researcher in datascience, as an example to show the change of discriminativeattributes in Fig. 1.

Jian Pei ’s papers from 2003 to 2005 areassociated with [email protected] and

State Universityof New York at Buffalo . His papers from 2005 to 2020 areassociated with [email protected] and

Simon Fraser University .The change of discriminative attributes may lead to the paperseparation problem [4], i.e., papers of an author are regardedas belonging to different authors, which commonly occurs indigital libraries. To address this issue, name disambiguationmethods should perform well even when discriminativeattributes change.Even though discriminative attributes may have changed,researchers often have a ﬁxed co-author set and a fewspeciﬁc research areas that do not change frequently overtime, which can also be exploited to solve the name dis-ambiguation problem. As shown in Fig. 1, even

Jian Pei has different afﬁliations and emails in two time periods, hisclose co-authors (e.g.,

Jiawei Han , Ke Wang ) are ﬁxed andhis research areas (e.g.,

Data mining , Time series ) are also https://dblp.uni-trier.de/ https://scholar.google.com/igure 1. An example of the change of Jian Pei ’s discriminative attributes.Figure 2. Academic network. consistent over time.There are several challenges that should be overcome:(1)

Heterogeneity of academic network.

The academicnetwork is a heterogeneous network that contains mul-tiple entities (e.g., author , paper , venue ) and multiplerelationships (e.g., writing , publishing ) as shown inFig. 2. It is challenging to preserve diverse structuraland semantic information simultaneously.(2) Inductive capability.

Many real-world applications en-counter a large number of new papers every day. It ischallenging for name disambiguation methods to havethe inductive capability that can generate representa-tions of new papers efﬁciently.(3)

Uncertain number of authors.

It is challenging todetermine the number of authors with the same name.In existing clustering based name disambiguation meth-ods [2], [3], [5], the number of authors (i.e., cluster size)is usually a pre-speciﬁed parameter.Current works [6], [7] did not efﬁciently handle thechange of discriminative attributes and inductive paper em-bedding problem in the heterogeneous academic networksimultaneously. In this work, we propose a novel M ulti-view A ttention-based Pair R ecurrent N eural N etwork framework,namely MA-PairRNN , to solve name disambiguation prob-lem. The intuitive idea is that an author’s papers during a period of time should have more similar representa-tions since the co-authors and research interests of mostauthors are consistent despite attributes change. Inspiredby this idea, we take name disambiguation as a pairwisepaper set classiﬁcation problem that does not require toestimate the number of authors with the same name. Wedivide papers into small blocks according to discriminativeauthor attributes to reduce the search space of the namedisambiguation algorithm. Then small blocks are mergedbased on pairwise classiﬁcation result and each block aftermerging is the paper set of an author. We represent eachpaper block as a sequence in publication time order andsolve the pairwise classiﬁcation problem by comparing se-quence similarity. MA-PairRNN combines multiple multi-view graph embedding layers, a semantic-level attentionlayer, and a Pseudo-Siamese recurrent neural network layer,to learn node embedding and node sequence pair similaritysimultaneously. Speciﬁcally, multi-view graph embeddinglayer generates meta-path based embeddings of papers inthe heterogeneous academic network. Then, semantic-levelattention layer fuses these meta-path based embeddings intoa vector. Finally, Pseudo-Siamese recurrent neural networklayer learns the similarity of a node sequence pair. Weelaborate on the three components as follows:

Multi-view graph embedding layer.

Multi-view graphembedding layer incorporates meta-paths to capture richsemantic information in the heterogeneous network. Theheterogeneous network is converted into multiple relationview according to meta-paths. For each view, we learn K aggregator functions to incorporate the K -hop neighborhoodof each node. In this way, node embeddings are generatedby enhancing node feature with semantics. Semantic attention layer.

Semantic attention layer cap-tures the importance of meta-paths by an attention mecha-nism and fuse semantic information for speciﬁc tasks.

Pseudo-Siamese recurrent neural network layer.

Pseudo-Siamese recurrent neural network composes of tworecurrent neural networks, which are used to learn inherentrelations of paper sequences. It takes two sequences of paperembedding as input and outputs their similarity.The main contributions are summarized as follows: • We propose a novel pairwise classiﬁcation frameworkcalled MA-PairRNN for name disambiguation task, whichlearns heterogeneous graph representation and paper setpairwise similarity simultaneously. • Under MA-PairRNN, we propose an inductive graph em-bedding method that takes both heterogeneity and largescale of the academic network into account. A semantic-level attention mechanism is leveraged to put differentemphases on each of the meta-paths. A Pseudo-Siameserecurrent neural network is adopted to learn inherentrelations and measure the similarity of two paper sets. • We conduct extensive experiments on AMiner-AND anda large-scale real-world dataset collected from Semanticcholar . The results illustrate the best performance aswell as good generalization ability of the proposed MA-PairRNN compared to other methods.The code of MA-PairRNN is available at https://github.com/RingBDStack/MA-PairRNN .II. RELATED WORKIn this section, we will brieﬂy review name disambigua-tion methods and graph embedding methods. A. Name Disambiguation

Name disambiguation methods can be divided into su-pervised [1], [8], unsupervised [6], [9] and graph-basedones [2], [5]. Graph-based works exploit graph topologicalfeatures in the academic network to enhance the repre-sentation of papers. For instance, GHOST [2] constructsdocument graph based on co-authorship. [5] leverages onlyrelational data in the form of anonymized graphs to preserveauthor privacy. Pairwise classiﬁcation methods are appliedto estimate the probability of a pair of author mentionsbelonging to the same author and are essential in thename disambiguation task. [6] ﬁrst learns representation forevery name mention in a pairwise or tripletwise way andreﬁnes the representation by a graph auto-encoder, but thismethod neglects linkage between paper and author and co-authorship. [7] addresses the pairwise classiﬁcation problemby extracting both structure-aware features and global fea-tures without considering semantic features. In this paper, wefocus on the paper set level pairwise classiﬁcation problemand exploit attribute, structure, and semantic features to formbetter representation.

B. Graph Embedding

Graph embedding aims to represent a graph as a lowdimensional vector while preserving graph structure andproperties. Recently, Graph Neural Network (GNN) [10]has attracted rising attention due to effective representationability. While most GNN works [10]–[12] focus on trans-ductive setting, there have been some recent works adoptingan inductive learning setting. DeepGL [13] aggregates aset of base graph features by relational functions that cangeneralize across networks. GraphSage [14] samples a ﬁxednumber of neighbors and generate node embeddings byaggregating their features. Both DeepGL and GraphSageare designed for homogeneous graphs. LAN [15] aggregatesneighbors with both rule-based and network-based attentionweights for knowledge graphs.Heterogeneous information networks [16]–[19] have beenstudied in recent years. Meta-path is designed to preserve di-verse semantic information of node type and edge type [20]–[22]. GTN [23] converts heterogeneous graph to new graphstructures which involve identifying task-speciﬁc meta-paths and multi-hop connections. HAN [24] includes both node-level and semantic-level attention to take the importance ofnodes and meta-paths into consideration simultaneously.In this paper, we propose an inductive graph embeddingmethod utilizing rich heterogeneous information.III. P ROPOSED M ETHOD

A. Problem Deﬁnition

In this section, we formally deﬁne Heterogeneous Aca-demic Network and the problem of Name Disambiguation.

Deﬁnition 1 (

Heterogeneous Academic Network ): Heterogeneous Academic Network is deﬁned as G = {V , E} ,where V and E denote the set of nodes and edges,respectively. A Heterogeneous Academic Network isassociated a node type mapping function f v : V → O and anedge type mapping function f e : E → R . O = { P, A, T, V } denotes node types set and R = { A writes P , P cites P , P is related to T , P is published in V } denotes edge typesset, where P, A, T, V denote the type of

Paper , Author , Topic and

Venue , respectively.

Deﬁnition 2 (

Name Disambiguation ): Given a name a , D a = { d a , d a , . . . , d aN } is a set of papers with name mention a . Every paper d ak consists of some metadata includingpaper attributes (e.g. title , year , venue , keywords ) and authorattributes (e.g. name , email , afﬁliation ). The objective ofname disambiguation is to partition all name mentions intoa set of unique authors C a = { c a , c a , . . . , c an } . B. Model Architecture

In this section, we propose a novel framework named

MA-PairRNN for name disambiguation. As describedabove, the main intuition is that papers of the same authorwithin a period should have similar representations in theacademic network since the author’s research and scholarrelation is consistent. We divide the set of papers D a intosmall blocks by discriminative author attributes in metadata.These small blocks will be merged based on pairwiseclassiﬁcation results of MA-PairRNN. First, the multi-viewinductive graph embedding layer is designed to generate thepaper representation of each meta-path. Then a semanticattention layer is designed to learn importance of meta-paths and fuse meta-path based representations. Finally,papers in every block are arranged as a sequence denoted as s ∈ S according to their publication time. Two sequencesof paper embedding are fed into a Pseudo-Siamese networkwith two RNNs for pairwise similarity learning. The overallarchitecture of MA-PairRNN is shown in Fig. 3 C. Multi-View Graph Embedding Layer

Multi-view graph embedding layer generates node repre-sentations inductively by learning a function to aggregateattribute and topology information from local neighbor-hoods. To exploit rich semantic information in the heteroge-neous academic network, we proposed the concept of meta-path based view. Given a heterogeneous academic network nput: graphmeta-paths

Multi-view graph embedding layer Semantic attention layer

RNN Cell RNN Cell RNN Cell ...

Pseudo-Siamese recurrent neural network layer different node types meta-path based embedding N × d (cid:262) MLP y Classification layer node sequence pairs ...... '0 t t N t ' ' N t t N t t '0 t ' ' N t ' 1 ' - N t initial features Embeddings N × d N × d N × d N × d Output: pairwise classification result

Figure 3. An overview of our overall network architecture. G = {V , E} and a meta-path p , a meta-path based view G p is derived from a type of proximity or relationship betweennodes characterized by a meta-path. It can capture differentaspects of structure information through meta-paths and ispotential to add new nodes dynamically.For each meta-path based view, similar to GraphSage [14],node representations are generated by aggregating featuresof meta-path based neighbors and propagating informationacross K layers. Node v ’s representation based on meta-path p is generated as below. First, in the k -th layer, each nodeaggregates its own representation and representations of its1-hop neighborhood N i generated by ( k -1)-th layer into asingle vector z ( k ) p ( N i ) as (1): z ( k ) p ( N i ) = mean ( { z ( k − p ( v j ) , ∀ v j ∈ v i ∪ N i } ) , (1)where z ( k − p ( v j ) denotes representation of v j in ( k -1)-thlayer. When k = 0, z (0) p ( v j ) is deﬁned as original feature x ( v j ) of v j . Then a weight matrix W ( k ) p and a bias vector b ( k ) p are used to transfer information between layers as (2): z ( k ) p ( v i ) = σ ( W ( k ) p · z ( k − p ( N i ) + b ( k ) p ) . (2)To extend the algorithm to a mini-batch setting, we ﬁrstsample the l -egonet of papers in the batch. The l -egonetof node v is deﬁned as the set of its l -hop neighbors andall edges between nodes in the set. For each batch, multi-view subgraphs are constructed based on the union of l -egonets of all paper nodes in this batch. Then we generatemeta-path based representation of every node in these multi-view subgraphs. For more convenient notation, we denote v i ’s ﬁnal representation based meta-path p after K layers as z p ( v i ) ≡ z ( K ) p ( v i ) , where z p ( v i ) ∈ R d . D. Semantic Attention Layer

For each paper, multiple meta-path based representationsare obtained and they can collaborate with each other. Since we assume that the importance of meta-paths varies, anattention mechanism is adopted to capture their contributionand fuse meta-path based node representations.We ﬁrst introduce a meta-path preference vector a p ∈ R |P|∗ d ′ for each meta-path p to guide the semantic attentionmechanism. For meta-path based representation z ( k ) p andmeta-path preference vector a p , the more similar they are,the greater weight will be assigned to z ( k ) p . We use anon-linear function to transform the d -dimension meta-pathbased embedding into d ′ -dimension as (3): z ′ p ( v i ) = σ ( W p · z p ( v i ) + b p ) . (3)where W p ∈ R |P|∗ d ′ is the weight parameter and b p ∈ R d ′ is the bias parameter of transformation. z ′ p ( v i ) ∈ R d ′ is the node representation of v i based meta-path p aftertransformation. The similarity of transformed representationvector and preference vector ω p ( v i ) is calculated as (4): ω p ( v i ) = a Tp · z ′ p ( v i ) k a p k · k z ′ p ( v i ) k , (4)where k · k is the L2 normalization of vectors. The weightof meta-path p for node v i is deﬁned using a softmax unitas follows: ω ′ p ( v i ) = exp ( ω p ( v i )) P p ′ ∈P exp ( ω p ′ ( v i )) . (5)Final representation of node v i is generated by fusing allmeta-path based representations in the weighted sum form: z ( v i ) = X p ′ ∈P ω ′ p ′ ( v i ) ∗ z p ′ ( v i )) . (6) E. Pseudo-Siamese Recurrent Neural Network Layer

We designed a Pseudo-Siamese recurrent neural networklayer to capture inherent relations of papers and measuresimilarity of two paper sets. Pseudo-Siamese recurrent neu-ral network layer is a Pseudo-Siamese network consistingf two RNNs with different parameters to generate repre-sentations of two node sequences. Speciﬁcally, we feed twosequence of paper embeddings into two RNNs respectively.The learned paper embedding of the paper is taken as theinput of RNN units. The output of each RNN unit can beformalized as: h t = RNN( z t , θ t ) , (7)where θ t means parameters of RNN unit. Here we applythe popular LSTM to capture inherent relations of papersequences and learn their similarity. Note that the papersequence published earlier is in published time order andthe other sequence is in reverse. This setting is based on theassumption that an author’s research topics and co-authorsare stable during the period of attribute changing. All outputsof RNN units are aggregated by a GlobalP ool function togenerate the representation of paper sequence as follows: h = GlobalP ool ( { h t , t = 1 , , · · · , | s |} ) , (8)where |·| denotes the length of sequence. We apply a simpleaveraging strategy as the GlobalP ool function here. Theﬁnal representations of two paper sequences h (1) and h (1) are concatenated and then fed into a multiple fully connectedneural network: ˆy s = σ (MLP([ h (1) , h (2) ])) , (9)where σ ( · ) denotes the softmax function and [ · , · ] representsthe concatenation operation.Since our task is classiﬁcation, the loss function L classify can be deﬁned as the Cross-Entropy over all labeled nodesequence pairs between the ground-truth and the predictresults. The proposed framework can be trained on a set ofexample pairs. For each pair of paper sequences, a cosinescore function is applied to measure the similarity of thetwo paper sequence representations as (10). L sim = sim ( h (1) , h (2) ) = h (1) · h (2) (cid:13)(cid:13) h (1) (cid:13)(cid:13) · (cid:13)(cid:13) h (2) (cid:13)(cid:13) . (10)The pairwise similarity loss function encourages node se-quences of the same author to have similar representations,and enforces that of different authors to be highly distinct.The model is then trained to minimize the sum of classi-ﬁcation loss as follows: L = L classify + η ∗ L sim , (11)where η denotes the coefﬁcient of pair similarity loss. Theoverall process of MA-PairRNN is shown in Algorithm 1.IV. EXPERIMENTS A. Dataset

For our experiments we used two datasets: Aminer-ANDand Semantic Scholar. • Aminer-AND [6]: This dataset contains 70,285 recordsof 12,798 unique authors with 100 ambiguous namereferences.

Algorithm 1:

The overall process of MA-PairRNN

Input:

Paper set D , heterogeneous graph G = {V , E} ,node features { x ( v ) , ∀ v ∈ V} , meta-path set P = { p , p , · · · , p M } , number of multi-viewgraph embedding layer K Output: meta-path based node representation { z p , z p , · · · , z p } Separate paper set D into small blocks accordingdiscriminative author attributes; Arrange papers in every block as sequence s ∈ S ; Construct meta-path based view {G p , G p , · · · , G p M } ; z (0) p ( v i ) = x ( v i ) , ∀ v i ∈ V ; while not converge do for v i ∈ V do for p ∈ P do for k = 1 , , · · · , K do Aggregate meta-path based neighborinformation in previous layer by (1); Calculate the representation of currentlayer by (2); end end Calculate the attention weight of eachmeta-path by (3), (4), (5); Fuse the semantic representation of eachmeta-path based view by (6); end for s ∈ S do Calculate the representation of sequence pair by(7) and (8); Classify the sequence pair by (9); end Calculate Loss by (10) and (11). end Table IS

TATISTICS OF S EMANTIC S CHOLAR

Dataset Node Types

SemanticScholar author author-paper paper paper-term topic paper-venue venue paper-paper • Semantic Scholar : We construct a new real-world aca-demic dataset from a digital library called SemanticScholar. There are 154,822 records of 857 unique authorswith 226 highly ambiguous name in medicine area andreference papers of these records. Detailed description isshown in Table I. The statistics of these authors’ papersare shown in Fig. 4. igure 4. Length Statistics of Paper sets.

B. Evaluation Metrics and Baselines

We apply pairwise Precision, Recall and F1 score inAminer-AND and apply averaged Accuracy, F1 score andAUC in Semantic Scholar to measure the performance ofall methods. We compare with attribute based methods aswell as attribute and structure based methods to demonstratethe effectiveness of our model. To verify the effectiveness ofeach component including meta-path based views, semantic-level attention and Pseudo-Siamese structure, we also testthree variants of MA-PairRNN. • MLP [25]: It’s s multilayer perceptron that directly pro-jecting input features into a low dimensional vector. • Deepwalk [26]: Deepwalk captures contextual informa-tion of neighborhood via uniform random walks for nodeembedding in homogeneous network. • GraphSage [14]: GraphSage samples node neighborhoodsto generate node embeddings for unseen data in an induc-tive way and is designed for homogeneous network. • Zhang et al. [5]: This method learns paper embeddingby sampling triplets from three graphs constructed byrelations of authors and papers and cluster them byhierarchical agglomerative algorithm. • GHOST [2]: GHOST use afﬁnity propagation algorithmfor clustering on a co-authors graph where the nodedistance is measured based on the number of valid paths. • Louppe et al. [3]: This method trains a pairwise distancefunction based on similarity features and use a semi-supervised HAC algorithm for clustering. • Aminer [6]: This method ﬁrst learns supervised globalembeddings and then reﬁnes the global embeddings foreach candidate set based on the local contexts. • Kim et al. [7]: It is a hybrid pairwise classiﬁcation methodwhich generates paper representation by extracting bothstructure-aware features and global features. • PairRNN

LSTM : A variation of MA-PairRNN

LSTM , whichdirectly feed node feature into a Pseudo-Siamese recurrentneural network layer with two LSTMs. • G-PairRNN

LSTM : A variation of MA-PairRNN

LSTM , which neglects the heterogeneity of academic network andgenerates representation on the original graph. • M-PairRNN

LSTM : A variation of MA-PairRNN

LSTM ,which removes semantic-level attention layer and assignsthe same importance to each meta-path. • MA-PairRNN

LSTM : The proposed model that fuses at-tribute, structure and semantic feature for node embeddinggeneration with an semantic attention mechanism.

C. Implementation Details

In Aminer-AND, the selected meta-paths of our methodconsist of

Paper-Author-Paper , Paper-Topic-Paper and

Paper-Venue-Paper . We use the author’s afﬁliation as thediscriminative attribute to separate papers into small blocksand we use the same trainset and testset as in [6].In Semantic Scholar, the selected meta-paths of ourmethod consist of

Paper-Paper , Paper-Author-Paper , Paper-Topic-Paper , and

Paper-Venue-Paper . We use the author’semail as the discriminative attribute to separate papers intosmall blocks. To evaluate the learning ability of models, wetest them on Semantic Scholar with different training ratios { , , , } .The common training parameters are set as learning rate= e − and dropout = 0.2. The node embedding dimensionis set to 64 and the classiﬁers of all methods is a three-layerfully-connected neural network with a ReLU function. Inour proposed model MA-PairRNN LSTM , K is set to 2 andthe dimension of meta-path preference vector a is set to 32. D. Results and Discussions

The performance of different methods on some samplednames of Aminer-AND is reported in Table II. The resultson Semantic Scholar is reported in Table III. Major ﬁndingsfrom experimental results can be summarized as follows:

Performance Comparison.

As shown in Table II andTable III, by incorporating attribute, structure and semanticinformation, MA-PairRNN

LSTM outperforms all baselines inboth datasets. Generally, GNN based methods that combinethe attribute and structure information usually perform betterthan those methods which only exploit attribute informa-tion. Compared to simply concatenate representations ofnodes, the Pseudo-Siamese RNN network can better extractinherent relations of paper sequence. Compared to tak-ing the graph as homogeneous, M-PairRNN

LSTM and MA-PairRNN

LSTM can exploit semantic information successfullyand show their superiority. It demonstrates that combined useof attribute, structure, and semantic features better capturethe similarities between papers. In addition, the semantic-level attention mechanism in MA-PairRNN

LSTM can exploitsemantic information more properly.Fig. 5 shows F1 scores of MA-PairRNN

LSTM on differentpartition versions of Semantic Scholar with training ratioof 80%. After adequate rounds of training, the performanceof MA-PairRNN

LSTM on each dataset partition version has able IIT

HE DETAILED RESULTS (%) ON A MINER -AND

Attr. Struc. Attr. + Struc. Attr. + Struc. + Sem.Louppe et al. Zhang et al. GHOST Aminer MA-PairRNN

LSTM

Name

Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1Hongbin Li 19.48 85.96 31.77 54.66 53.05 53.84 56.29 29.12 38.39 77.20 69.21 72.99 88.89 65.98 75.74Hua Bai 36.39 41.33 38.70 58.58 35.90 44.52 83.06 29.54 43.58 71.49 39.73 51.08 89.22 70.54 78.79Kexin Xu 91.26 98.35 94.67 90.02 82.47 86.08 92.90 28.52 43.64 91.37 98.64 94.87 85.19 71.88 77.97Lu Han 30.25 46.65 36.70 47.88 20.62 28.82 69.72 17.39 27.84 51.78 28.05 36.39 92.43 69.62 79.42Lin Huang 24.86 71.32 36.87 71.84 34.17 46.31 86.15 17.25 28.74 77.10 32.87 46.09 88.26 73.44 80.17Meiling Chen 58.32 47.14 52.14 59.36 28.80 38.79 86.11 23.85 37.35 74.93 44.70 55.99 - - -Min Zheng 25.86 32.67 28.87 54.76 19.70 28.98 80.50 15.21 25.58 57.65 22.35 32.21 86.07 82.03 84.00Qiang Shi 35.31 47.18 40.39 43.84 36.94 40.10 53.72 26.80 35.76 52.20 36.15 42.72 80.25 69.15 74.29Rong Yu 38.85 91.43 54.53 65.48 40.85 50.32 92.00 36.41 52.17 89.13 46.51 61.12 90.67 68.69 78.16Tao Deng 40.46 51.38 45.27 53.04 29.89 38.23 73.33 24.50 36.73 81.63 43.62 56.86 88.42 65.12 75.00Wei Quan 37.86 63.41 47.41 64.45 47.66 54.77 86.42 27.80 42.07 53.88 39.02 45.26 75.76 78.13 76.92Xudong Zhang 72.38 79.83 75.92 70.20 23.35 35.04 85.75 7.23 13.34 62.40 22.54 33.12 - - -Xu Xu 22.55 64.40 33.40 48.16 41.87 44.80 61.34 21.79 32.15 74.18 45.86 56.68 78.68 79.08 78.88Yanqing Wang 29.64 79.08 43.11 60.40 51.97 55.87 80.79 40.39 53.86 71.52 75.33 73.37 77.42 64.86 70.59Yong Tian 32.08 63.71 42.67 70.74 56.85 63.04 86.94 54.58 67.06 76.32 51.95 61.82 87.80 70.59 78.26Average 57.09 77.22 63.10 70.63 59.53 62.81 81.62 40.43 50.23 77.96 63.03 67.79

Table IIIQ

UANTITATIVE RESULTS AND STANDARD DEVIATION (%) ON S EMANTIC S CHOLAR

Attr. Attr. + Struc. Attr. + Struc. + Sem.Metrics Training MLP PairRNN

LSTM

Deepwalk GraphSage Aminer Kim et al. G-PairRNN

LSTM

M-PairRNN

LSTM

MA-PairRNN

LSTM ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 5. Performance of MA-PairRNN

LSTM on different SemanticScholar partition version with training ratio of 80%. gained stability and certainty and is difﬁcult to be furtherimproved though ﬂuctuations exist.

Impact of training ratio.

F1 scores of all methods onSemantic Scholar with different training ratio are shown inFig. 6 (a) and their distributions are shown in Fig. 6 (b).The performances of all methods get worse as the trainingratio decrease. Our method MA-PairRNN

LSTM and its vari- ants suffer less performance degradation than others, whichshows better learning ability.

Siamese Network v.s Pseudo-Siamese Network.

Asmentioned above, Pseudo-Siamese neural network compo-nent consists of two RNNs with different parameters. Wealso test three variations including a Pseudo-Siamese net-work with two BiLSTM (MA-PairRNN

BiLSTM ), a Siamesenetwork with two parameter-shared LSTM (MA-RNN

LSTM ),and a Siamese network with two parameter-shared BiL-STM (MA-RNN

BiLSTM ). Results on Semantic Scholar areshown in Table. IV. We can see that Pseudo-SiameseNetwork models have a better performance than the othertwo Siamese Network models. Based on our assumptionthat papers during the period of discriminative attributeschanging have similar text and structure features, the papersequence published earlier is fed into RNN in publicationtime order and the other is in reverse order. Pseudo-Siamesenetwork may better capture the changing trend of researchtopic and scholar relationship.

Impact of Different Meta-paths.

To verify the abilityof semantic-level attention, we report F1 scores of MA-PairRNN

LSTM using single meta-path and correspondingattention values on Semantic Scholar in Fig. 7. Obviously, a) F1 scores with different training ratio(b) Distributions of F1 scores with different training ratioFigure 6. Performance with different training ratio on Semantic Scholar.Table IVP

ERFORMANCE COMPARISON (%)

OF DIFFERENT SEQUENCEREPRESENTATION MODEL ON S EMANTIC S CHOLAR

Models Accuracy F1 score AUCMA-PairRNN

LSTM

BiLSTM

MA-RNN

LSTM

MA-RNN

BiLSTM there is a positive correlation between the performance ofeach meta-path and its attention value. Among four meta-paths, MA-PairRNN

LSTM gives PVP the highest weight,which means that PVP is considered as the most criticalmeta-path in paper representation. It makes sense becauseauthors research areas are highly correlated with venueswhere their papers are published. Meanwhile, PP is alsogiven a high weight. It also makes sense because author’spapers are often closely related and have similar references.

Generalization ability across research areas.

On Se-mantic Scholar, our models are trained on papers of medicalarea. To verify the generalization ability of models acrossdifferent research areas, we collected data of 100 authorsfrom biology, chemistry, computer science, and mathematicsarea, respectively. The performance of all models on these

Figure 7. Performance of single meta-path and corresponding attentionvalue.Figure 8. Performance (F1 score %) in different research areas. data is shown in Fig. 8. When trained on data of the medicalarea and test on the other four areas, the performancedegradations of our proposed model (MA-PairRNN

LSTM )and its variations (G-PairRNN

LSTM and M-PairRNN

LSTM )are less than 3%, which are better than other models. Itindicates that the structure information can enhance model’sgeneralization ability. Most models perform better whentransferred to biology and chemistry area than other twoareas. It makes sense because these two areas share morearea knowledge with the medical one.

E. Parameters Analysis

In this section, we will investigate how dimension of nodeembedding and attention preference vector and coefﬁcientof similarity loss can affect classiﬁcation performance. Theresults on Semantic Scholar are reported in Fig. 9.

Dimension of the ﬁnal node embedding z . The repre-sentation ability of graph embedding methods is affected bythe dimension of node embedding z . We explore its impactwith various dimension {

16, 32, 64, 128, 256 } . As shown inFig. 9 (a), the performance ﬁrstly improves with the increaseof node embedding dimension, then degenerates slowly, andachieves the best performance at the dimension of 64. The a) Dimension of the ﬁnal node embedding z (b) Dimension of semantic attention vector a (c) Coefﬁcient η of cosine similarity lossFigure 9. Parameter sensitivity: Dimension of node embedding z , Dimension of semantic attention vector a and Coefﬁcient η of cosine similarity loss. reason may be that larger dimension could introduce someadditional redundancies. Dimension of semantic attention vector a . We evaluatethe effect of semantic attention vector a ’s dimension inthe set of { , , , , } . As shown in Fig. 9 (b),the F1 score has minor changes, which shows that MA-PairRNN LSTM is not very sensitive to the dimension ofattention preference vector.

Coefﬁcient η of cosine similarity loss. The impactof similarity loss item is controlled by η . We vary η ∈{ , . , . , , . , , } . As shown in Fig. 9 (c), optimalperformance is obtained near η = 1, indicating that η cannotbe set too small or too large in order to prevent overﬁttingand underﬁtting. F. Case Study

We speciﬁcally choose three author variants named

JianPei in Semantic Scholar as a study case and we denote themas

Jian Pei 1 , Jian Pei 2 , Jian Pei 3 . Statistics of selectedthree author variants are shown in Table. V. Our modelclassiﬁes

Jian Pei 1 and

Jian Pei 2 as the same person while

Jian Pei 3 is another person, which is consistent with theground truth. We visualize the subgraph of the academicnetwork that three author variants are in. The visualizedsubgraph includes papers and co-authors of the three authorvariants, and topics their papers related to. Papers of threeauthor variants are colored blue, green, and red respectivelyand other nodes are colored by their type. Paper nodes of

Jian Pei 1 colored blue and paper nodes of

Jian Pei 2 coloredgreen tend to be closely connected physically and many ofthem are connected by same topics (e.g.,

Data mining , SocialNetwork ) and same venues (e.g.,

KDD , TKDE ). Jian Pei 3 ’spaper nodes are connected to paper nodes of the other two bytopic nodes such as

Algorithm and

Simulation experiment ,which are used in many research areas.V. CONCLUSION AND FUTURE WORKIn this paper, we propose MA-PairRNN, a novel pairwisenode sequence classiﬁcation framework for name disam-biguation, in which multi-view graph embedding layer is

Table VS

TATICS OF SELECTED AUTHOR VARIANTS author

Data miningJian Pei 1

441 23,729

Social networksFrequent pattern miningData miningJian Pei 2

78 4,512

Sequential pattern miningFrequent pattern miningMolecular synthesisJian Pei 3

36 690

Functional materialsConvenient Syntheses

Figure 10. Subgraph visualization of selected author variants. Paper nodecolor represents author variant (Blue:

Jian Pei 1 , Green:

Jian Pei 2 , Red:

Jian Pei 3 ) designed to generate node representation inductively, andPseudo-Siamese recurrent neural network is designed tolearn sequence pair similarity. Our proposed method canlearn node representation and sequence pair similarity si-multaneously, and can scale to large graphs for its inductiveapability. Experimental results on two real-world datasetsdemonstrate the effectiveness of our method. By analyzingthe learned attention weights of meta-paths, MA-PairRNNhas proven its potentially good interpretability. By testingon data of unseen areas, MA-PairRNN has also proven itsgood generalization ability. In the future, we plan to leveragehierarchical clustering to address the problem that an authorhas diverse research areas and works with non-overlappingsets of co-authors corresponding to each research area.A CKNOWLEDGMENT

This work is supported by the the National Key R&D Pro-gram of China (2018YFC0830804), NSFC No.61872022,NSF of Jiangsu Province BK20171420, NSF of GuangdongProvince (2017A030313339) and CCF-Tencent Open Re-search Fund, and in part by NSF under grants III-1526499,III-1763325, III-1909323, and SaTC-1930941.R

EFERENCES [1] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis, “Twosupervised learning approaches for name disambiguation inauthor citations,” in

JCDL . IEEE, 2004, pp. 296–305.[2] X. Fan, J. Wang, X. Pu, L. Zhou, and B. Lv, “On graph-basedname disambiguation,”

JDIQ , vol. 2, no. 2, pp. 1–23, 2011.[3] G. Louppe, H. T. Al-Natsheh, M. Susik, and E. J.Maguire, “Ethnicity sensitive author disambiguation usingsemi-supervised learning,” in

KESW . Springer, 2016, pp.272–287.[4] D. Lee, B.-W. On, J. Kang, and S. Park, “Effective andscalable solutions for mixed and split citation problems indigital libraries,” in

IQIS , 2005, pp. 69–76.[5] B. Zhang and M. Al Hasan, “Name disambiguation inanonymized graphs using network embedding,” in

CIKM ,2017, pp. 1239–1248.[6] Y. Zhang, F. Zhang, P. Yao, and J. Tang, “Name disambigua-tion in aminer: Clustering, maintenance, and human in theloop.” in

KDD , 2018, pp. 1002–1011.[7] K. Kim, S. Rohatgi, and C. L. Giles, “Hybrid deep pairwiseclassiﬁcation for author name disambiguation,” in

CIKM ,2019, pp. 2369–2372.[8] R. C. Bunescu and M. Pasca, “Using encyclopedic knowledgefor named entity disambiguation,” in

EACL , 2006.[9] L. Cen, E. C. Dragut, L. Si, and M. Ouzzani, “Authordisambiguation by hierarchical agglomerative clustering withadaptive stopping criterion,” in

SIGIR , 2013, pp. 741–744.[10] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip,“A comprehensive survey on graph neural networks,”

IEEETNNLS , 2020.[11] H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song,and Q. Yang, “Large-scale hierarchical text classiﬁcation withrecursively regularized deep graph-cnn,” in

WWW , 2018, pp.1063–1072. [12] H. Peng, J. Li, S. Wang, L. Wang, Q. Gong, R. Yang,B. Li, P. Yu, and L. He, “Hierarchical taxonomy-aware andattentional graph capsule rcnns for large-scale multi-label textclassiﬁcation,”

IEEE TKDE , 2019.[13] R. A. Rossi, R. Zhou, and N. K. Ahmed, “Deep featurelearning for graphs,” arXiv preprint arXiv:1704.08829 , 2017.[14] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representa-tion learning on large graphs,” in

NIPS , 2017, pp. 1024–1034.[15] P. Wang, J. Han, C. Li, and R. Pan, “Logic attention basedneighborhood aggregation for inductive knowledge graphembedding,” in

AAAI , vol. 33, 2019, pp. 7152–7159.[16] S. Wang, X. Hu, P. S. Yu, and Z. Li, “Mmrate: inferringmulti-aspect diffusion networks with multi-pattern cascades,”in

KDD , 2014, pp. 1246–1255.[17] X. Zhang, Y. Zhang, S. Wang, Y. Yao, B. Fang, and S. Y.Philip, “Improving stock market prediction via heterogeneousinformation fusion,”

KBS , vol. 143, pp. 236–247, 2018.[18] C. Gao, Y. Chen, S. Liu, Z. Tan, and S. Yan, “Adversarialnas:Adversarial neural architecture search for gans,” in

CVPR ,2020, pp. 5680–5689.[19] Y. Cao, H. Peng, and S. Y. Philip, “Multi-information sourcehin for medical concept embedding,” in

PAKDD . Springer,2020, pp. 396–408.[20] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “Pathsim:Meta path-based top-k similarity search in heterogeneousinformation networks,”

VLDB , vol. 4, no. 11, pp. 992–1003,2011.[21] H. Peng, J. Li, Q. Gong, Y. Song, Y. Ning, K. Lai, and P. S.Yu, “Fine-grained event categorization with heterogeneousgraph convolutional networks,” in

IJCAI , 2019, pp. 3238–3245.[22] Y. He, Y. Song, J. Li, C. Ji, J. Peng, and H. Peng, “Hetes-paceywalk: a heterogeneous spacey random walk for hetero-geneous information network embedding,” in

CIKM , 2019,pp. 639–648.[23] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, “Graphtransformer networks,” in

NIPS , 2019, pp. 11 960–11 970.[24] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu,“Heterogeneous graph attention network,” in

WWW , 2019, pp.2022–2032.[25] S. K. Pal and S. Mitra, “Multilayer perceptron, fuzzy sets, andclassiﬁcation,”

IEEE Trans. Neural Networks , vol. 3, no. 5,pp. 683–697, 1992.[26] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Onlinelearning of social representations,” in