[PDF] A network approach to expertise retrieval based on path similarity and credit allocation

Abstract

With the increasing availability of online scholarly databases, publication records can be easily extracted and analysed. Researchers can promptly keep abreast of others' scientific production and, in principle, can select new collaborators and build new research teams. A critical factor one should consider when contemplating new potential collaborations is the possibility of unambiguously defining the expertise of other researchers. While some organisations have established database systems to enable their members to manually produce a profile, maintaining such systems is time-consuming and costly. Therefore, there has been a growing interest in retrieving expertise through automated approaches. Indeed, the identification of researchers' expertise is of great value in many applications, such as identifying qualified experts to supervise new researchers, assigning manuscripts to reviewers, and forming a qualified team. Here, we propose a network-based approach to the construction of authors' expertise profiles. Using the MEDLINE corpus as an example, we show that our method can be applied to a number of widely used data sets and outperforms other methods traditionally used for expertise identification.

Full PDF

AA network approach to expertise retrieval based on path similarity andcredit allocation

Xiancheng Li · Luca Verginer · Massimo Riccaboni · Pietro PanzarasaAbstract

With the increasing availability of online scholarly databases, publication records can be easilyextracted and analysed. Researchers can promptly keep abreast of others’ scientiﬁc production and, inprinciple, can select new collaborators and build new research teams. A critical factor one should considerwhen contemplating new potential collaborations is the possibility of unambiguously deﬁning the exper-tise of other researchers. While some organisations have established database systems to enable theirmembers to manually produce a proﬁle, maintaining such systems is time-consuming and costly. There-fore, there has been a growing interest in retrieving expertise through automated approaches. Indeed, theidentiﬁcation of researchers’ expertise is of great value in many applications, such as identifying qualiﬁedexperts to supervise new researchers, assigning manuscripts to reviewers, and forming a qualiﬁed team.Here, we propose a network-based approach to the construction of authors’ expertise proﬁles. Using theMEDLINE corpus as an example, we show that our method can be applied to a number of widely useddata sets and outperforms other methods traditionally used for expertise identiﬁcation.

Keywords

Expertise retrieval · Path similarity · Credit allocation · Heterogeneous InformationNetworks

The increasing complexity of research problems calls for innovative solutions which combine knowledgefrom diﬀerent scientiﬁc disciplines (Van Rijnsoever and Hessels 2011). Therefore, many researchers be-come involved in interdisciplinary projects, thus collaborating with people with a variety of expertise.When facing the task of ﬁnding collaborators, scholars need to answer two inter-related questions: 1)How to identify an expert, i.e., how to ﬁnd someone who is competent in a given ﬁeld; and 2) how toproﬁle an expert, i.e., how to identify the ﬁelds in which a given scholar is an expert. In general, bothquestions jointly describe the objective of expertise retrieval (Balog et al. 2012). Indeed ﬁguring out theresearch area associated with an individual represents a challenging research problem. Search engines

Xiancheng LiSchool of Business and ManagementQueen Mary University of London, LondonE-mail: [email protected] VerginerChair of Systems DesignETH Z¨urich, Z¨urich, SwitzerlandMassimo RiccaboniIMT School for Advanced StudiesLucca, ItalyPietro PanzarasaSchool of Business and ManagementQueen Mary University of London, London a r X i v : . [ c s . S I] S e p Xiancheng Li et al. such as Google Scholar or DBLP are of great help for ﬁnding documents (Hertzum and Pejtersen 2000).However, these engines only return scientiﬁc documents, not the speciﬁc expertise of people. Even in anacademic environment, researchers still have to rely on their social networks to identify the expertise ofothers (Hofmann et al. 2010).Identifying experts is crucial for academic groups when they need to involve a collaborator withspeciﬁc expertise. In organisational settings, knowing the expertise of relevant researchers facilitates theassignment of important roles and jobs. For example, conference organisers may search for moderators,session chairs and keynote speakers with the proper expertise. And universities may want to recruitresearchers with expertise in a particular fast-developing area to improve their reputation. A good methodfor expertise retrieval is therefore fundamental to provide the necessary knowledge for such activities.However, expertise retrieval is challenging for many reasons. First, expertise is a relatively abstractconcept, and there is currently no consensus on how to deﬁne it. Besides, expertise is a particular kindof knowledge stored in one’s mind, and thus hard to identify. The only way to access people’s expertiseis through their works, e.g., documents, books, articles. Second, experts’ names are often ambiguous.A single name may belong to multiple people, and the name of the same expert can vary in diﬀerentdatabases. Indeed name disambiguation has recently become a speciﬁc and independent area of enquiry,and many studies have been carried out in this ﬁeld (Smalheiser and Torvik 2009). Finally, it is diﬃcult toevaluate the strength of the association between an expert and the works he or she has been involved in,especially because an increasing amount of scientiﬁc production is co-authored by multiple individuals.Those challenges have made expertise retrieval a multi-faceted research area. In particular, since we learnabout researchers’ expertise mainly from their publications, the task of expertise retrieval has mainlybeen articulated into identifying the knowledge areas/topics in the text corpus and assigning them to theresearchers (Silva et al. 2018).Inspired by previous approaches to dealing with credit allocation (Shen and Barab´asi 2014) and byrecent studies on ﬁnding node similarity in heterogeneous information networks (HIN) (Shi et al. 2014),we formalise the topics/expertise extracted from a given scientiﬁc publication as credit to be assigned tothe co-authors of the publication, and propose a new method to allocate them to the co-authors based ontheir publication histories. Traditional approaches to the identiﬁcation of the knowledge areas within thetext corpus use topic-modelling methods such as Latent Dirichlet Allocation (LDA) based on controlledvocabulary from well-known classiﬁcation systems such as the Medical Subject Headings (

M eSH ) inMEDLINE and the topic tags in Microsoft Academic Graph (MAG) .Our work focuses on the process of evaluating the degree of each co-author’s contribution to a collab-orative work. We propose a new method for properly assigning the expertise to each co-author accordingto his or her contribution. Our method diﬀers from traditional ones where the contribution of authors isassumed to be equal or assessed simply based on the order of authors in the byline. Moreover, our methodcan deal with large-scale data sets, and produces results that vary dynamically as the data set is updatedover time. Unlike some citation-based approaches to the assessment of contributions, which require acertain time to account for the citations that accumulate over time, our method is experience-based andthe update of authors’ expertise is determined once the new records are added into the data set.The rest of the article is organised as follows. In Section 2 we review strengths and limitations ofexisting literature on expertise identiﬁcation, and motivate our work. In Section 3 we introduce the dataused in our study. In Section 4 and Section 5 we present our new method and diﬀerent selection strategies.In Section 6, we provide some extensions to account for weights and time. In Section 7 we report resultsobtained using the MEDLINE corpus and various examples. Section 8 summarises the ﬁndings of thiswork and outlines their implications for research and practice. Previous work on expert proﬁling has primarily focused on identifying and ranking topics for a givenexpert (Balog et al. 2007; Serdyukov et al. 2011). However, only few studies have considered the temporal https://academic.microsoft.com/topicsxpertise retrieval 3 aspects of expertise. The work by Tsatsaronis et al. (2011) was one of the ﬁrst studies which focused onthe evolution of authors’ expertise over time. Their work was based on co-authorship information, andproposed evolution indices to measure the dynamics of authors’ expertise. Inspired by their work, Rybaket al. (2014) constructed temporal hierarchical expertise proﬁles using topic models. Typically, the un-derlying question of expert proﬁling is: What topics does a person know about? (Balog et al. 2007; Rybaket al. 2014). Indeed the word “topic” is commonly used in the various deﬁnitions of expertise because thetraditional approaches to expertise proﬁling rely on topic models and Natural Language Processing (NLP)techniques (Van Gysel et al. 2016). The main purpose of using those models is to classify documents intoa number of topics and ﬁnd a better match between authors and topics according to the topics extractedfrom their documents. As most of the machine learning algorithms belong to unsupervised learning, thetopics are simply collections of words and thus not always appropriate for identifying expertise (Silvaet al. 2018).Since the main focus of expertise retrieval tasks is on the analysis of the documents, NLP techniqueshave commonly been applied. Traditional approaches to the expert proﬁling tasks are based on the LDAalgorithm. LDA is a generative statistical model, ﬁrst proposed in 2003, which considers each documentas a mixture of a small number of topics and according to which the presence of each word is attributableto one of the topics of the document (Blei et al. 2003). LDA is a powerful tool to analyse documentsand pinpoint topics, but it was not designed to address the task of identifying expertise. There is nobetter solution but to treat an author as a bigger document by combining all documents he or she haspublished. To include authorship information, Rosen-Zvi et al. (2004) extended LDA and proposed theauthor-topic model for identifying the interests of authors. To make LDA suitable for diﬀerent tasks invarious contexts, many extensions have been proposed over the years. Some examples are the Author-Conference Topic model (Tang et al. 2008), the Author-Conference Topic-Connection model (Wang et al.2012), and the Author-Topic over Time model (Xu et al. 2014). Some of these have been applied topractice as a part of a new search engine Aminer (Tang 2016).However, classic LDA algorithms have several characteristics that are not ideal for such tasks. First,LDA requires a manual choice of the topic number. But one can hardly tell whether the choice is good ornot since the performance of an LDA model is evaluated by perplexity , a metric proposed by Blei et al.(2003). Therefore it is diﬃcult to decide and evaluate the number of topics. When such number is toolarge or too small, the research areas (corresponding to the topics) provided by LDA may become toogeneral or too speciﬁc (Berendsen et al. 2013). Second, since LDA is an unsupervised learning algorithm,topics generated from LDA are just distributions of words without labels which can be hard to inter-pret. Additionally, the academic research areas are always connected and have a hierarchical structure.However, LDA generates independent topics without any kind of relationships between them (Silva et al.2018).While most studies are concerned with better solutions to address the ﬂaws of topic models, fewhave highlighted the importance of author-document connections in the tasks of expertise retrieval. In2012, Duan et al. (2012) ﬁrst integrated community discovery with topic modelling, and proposed theMutual Enhanced Inﬁnite Community-Topic model which ﬁnds communities and the topics they discussin text-augmented social networks. Lately, more studies have started using information networks to avoidthe problems of the LDA models. Gerlach et al. (2018) represent the data as a bipartite network ofwords and documents and convert the task into ﬁnding communities in such a network. Some diﬀerentapproaches that focus on topic modelling using HINs have been proposed (Sun et al. 2009b). Subsequently,a pioneer algorithm called Rankclus was designed. It uses a generative model that operates on bipartitetopologies and simultaneous clusters and ranks nodes in a HIN (Sun et al. 2009a). More recently, diﬀerentcommunity detection methods, such as generative model and modularity optimisation, have been appliedto the creation of hierarchical expert proﬁles (Silva et al. 2018; Wang et al. 2015).Despite the eﬀorts of many scholars to ﬁnd better ways for extracting individuals’ interests from theworks they produced, most studies have paid little attention to the unequal contributions of authorsin collaborative works. Authors that publish with other co-authors in several ﬁelds can be associatedwith multiple topics found in their publications. Identifying the expert on a speciﬁc ﬁeld associated with https://aminer.org/ Xiancheng Li et al. a paper requires the identiﬁcation of the diﬀerent contributions of authors in collaborative works, andtherefore identifying one or more people as experts bears a resemblance to a credit allocation problem.In the last decade, as the complexity and interdisciplinarity of modern research have steadily risen,collaborations among researchers have been playing an increasingly important role (Newman 2004). Themultidisciplinary nature of research requires expertise from diﬀerent scientiﬁc ﬁelds (Lawrence 2007). Inturn, as a result of the increasing size of the newly formed scientiﬁc groups, the scientiﬁc credit systemhas come under mounting pressure (Koopman et al. 2010). As a matter of fact, the interdisciplinarity ofmodern science not only endangers the current credit allocation system, but also poses more obstacles toexpertise retrieval. In such interdisciplinary collaborations, authors from diﬀerent ﬁelds work together toproduce one result (e.g., an article), but each author contributes only partly to the publication. It cantherefore be diﬃcult to quantitatively discern the individual co-authors contributions to a multi-authoredpublication (Bao and Zhai 2017). Most topic models for expertise retrieval cannot solve this problem,and new approaches to allocating scientiﬁc credit to co-authors are therefore required.Current approaches to credit allocation fall in several major categories. The ﬁrst and classic one isto view each author as the sole author contributing a copy of the same publication. The second is todistribute the contribution to all co-authors evenly, and the third according to the order in the publicationbyline or to the role of the co-authors (Hirsch 2005, 2007; Stallings et al. 2013). The ﬁrst two categoriesare obviously biased to some degree, and the third is based on some acquiescent agreements accordingto disciplines which may not be easily acceptable by others. Recently, scholars have been working onallocating credit based on the speciﬁc contribution of each author (Foulkes and Neylon 1996; Tscharntkeet al. 2007). Shen and Barab´asi (2014) proposed a new method which focuses on the co-citations. Thismethod is based on the intuition that the more an author appears in a co-cited paper, the more credithe or she should receive. And they managed to capture the contribution of co-authors as perceived bythe scientiﬁc community and successfully tested on the Nobel Prize publications. Considering that thenovelty of a paper and the attention paid to it tend to fade with time, Bao and Zhai (2017) extendedtheir idea and proposed a dynamic credit allocation algorithm.As science can be regarded as a complex, self-organising and evolving network of scholars, projects,papers and ideas (Fortunato et al. 2018), another way to deal with the unequal contributions of multipleauthors to collaborative works is to use the similarity between a node representing a given topic and anode representing a given author to assess the contribution that the author made to the focal documentwith respect to the topic. Information networks are networks consisting of data items linked in some way.The best known example is the World Wide Web where the nodes are web pages consisting of texts,pictures or other information, and the links are hyperlinks that allow us to navigate from one page toanother. There are some networks which could be considered information networks and also have socialconnotations. Examples include the networks of email communication, and online social networks suchas Twitter and Facebook (Xiong et al. 2015).An information network is deﬁned as a directed graph G = ( V, E ) with an object type mappingfunction φ : V → A and a link type mapping function ψ ( e ) : E → R , where each object v ∈ V belongsto one particular object type φ ( v ) ∈ A , and each link e ∈ E belongs to a particular relation ψ ( e ) ∈ R .Unlike the traditional network deﬁnition, we explicitly distinguish object types and relationship types inthe network. Notice that, if there exists a relation from type A to type B, denoted as A R −→ B , the inverserelation R − holds naturally for B R − −−−→ A . Most of the time, R and its inverse R − are not equal, unlessthe two types are the same and R is symmetric. When the types of objects | A | > | R | >

1, the network is called heterogeneous information network (HIN); otherwise, it is a homogeneousinformation network. In real-world networks, multiple-typed objects are often interconnected, formingHINs (Shi et al. 2012). A bibliographic information network is a typical HIN, containing objects fromseveral types of entities. The most common entities are papers ( P ), venues (conferences/journals) ( V ),authors ( A ), aﬃliations ( af f ), and terms ( T ). The DBLP and ACM data in Fig. 1 is a typical example (Shiet al. 2014). There are links connecting diﬀerent-typed objects and the link types are deﬁned by therelations between two object types. For a bibliographic network, links can exist between nodes of thesame or diﬀerent types. For example, there are links between authors and papers denoting the “write”or “written-by” relations, and links between papers denoting “cite” and “cited-by” relations. xpertise retrieval 5 (a) DBLP data (b)

ACM data

Fig. 1

Examples of typical Heterogeneous Information Networks (HINs)In a heterogeneous network, two objects can be connected via diﬀerent paths. For example, twoauthors can be connected via the “author-paper-author” path, the “author-paper-venue-paper-author”path, and so forth. Formally, these paths are called meta-paths . In a graph

T G = (

A, R ), where A is theset of node types and R is the set of relation types, a meta path P is a path denoted in the form of A R → A R → ... R l → A l +1 , which deﬁnes a composite relation R = R ◦ R ◦ ... ◦ R l between type A and A l +1 , where ◦ denotes the composition operator on relations (Shi et al. 2014).Similarity search is a primitive operation in large-scale HINs that consist of multi-typed, intercon-nected objects, such as the bibliographic networks and social media networks. Traditional similaritymeasures (e.g., cosine similarity) are computed between vector representations of features, using numer-ical data types (Nguyen and Bai 2010). In information networks, however, the interconnections betweenobjects are sometimes more important than the features of the objects themselves.To capture the information contained in the links, Lin et al. (2006) proposed a link-based similaritymeasure PageSim and applied it to the identiﬁcation of similar web pages. PageSim only works onnetworks with one type of nodes (e.g., homogeneous information networks), but many networks areheterogeneous. Considering the semantics in meta paths constituted by diﬀerent-typed objects, Sun et al.

Xiancheng Li et al. (2011) ﬁrst proposed the path-based similarity measure

PathSim to evaluate the similarity of same-typed objects based on symmetric paths. Following their work, Yao et al. (2014) extended PathSimby incorporating richer information, such as transitive similarity, temporal dynamics, and supportiveattributes. A path-based similarity join method

JoinSim was proposed to return the top k -similar pairsof objects based on user-speciﬁed join paths (Begum et al. 2016). Wang et al. (2016) deﬁned a meta-path-based relation similarity measure, RelSim , to examine the similarity between relation instances inschema-rich HINs. In order to evaluate the relevance of diﬀerent-typed objects, Shi et al. (2014) proposed

HeteSim to measure the relevance of any object pair under arbitrary meta path. To overcome the problemrelated to the high computational and memory requirements of

HeteSim , Meng et al. (2014) proposed the

AvgSim measure that evaluates the similarity scores, respectively, through two random walk processesalong the given meta path and the reverse meta path.The idea of node similarity can be useful in expertise retrieval because, if we can measure the similaritybetween a given author and a ﬁeld, we can assess the author’s expertise in that ﬁeld.

HeteSim has beendesigned to evaluate the relevance of diﬀerent-typed objects, and thus has the potential to be applied tothe task of expertise retrieval. However, this task needs to explicitly account for the uneven contributionof various authors to collaborative eﬀorts, and therefore cannot be carried out merely by applying simplemeasures of similarity between nodes. For this reason, we decided to draw on

HeteSim , and propose aproperly adjusted method for capturing authors’ expertise in evolving networks.As a result of the increasing interest in extracting relevant topics from scientiﬁc publications, manywidely used online data sets provide external controlled vocabulary to classify publications. Some exam-ples are the

M eSH classiﬁcation system in MEDLINE and the topic tags in MAG. Those systems haveused a variety of techniques to improve the reliability of the classiﬁcations, and some scholars have startedto use them as ground truth or baseline in their works (AlShebli et al. 2018). Our method simpliﬁes theprocess of topic extraction from documents by using the MEDLINE corpus as an example, and focuseson how to allocate expertise to co-authors that unevenly contribute to collaborative eﬀorts.The method for collective credit allocation in science developed by (Shen and Barab´asi 2014) isconceptually similar to our method. Yet, it diﬀers from ours in one important aspect: it focuses on theprocess of appropriately allocating the credit of a given paper to each of the co-authors. It uses theco-citations to the given paper and other papers published by the co-authors to determine the proportionto be assigned to each co-author of the paper. If more papers have cited at the same time the focal paperand other papers published by a given co-author, a larger proportion of the credit will be allocated to thisco-author, indicating a larger contribution is made by the co-author in this work. However, at the timewhen a paper is published and therefore has no citations, contributions to this paper are equally allocatedacross co-authors. Moreover, because the citations vary over the years, so does the credit allocated to eachco-author by this method. Clearly, one shortcoming of this method lies on the fact that the contribution ofan author to a paper should be unambiguously deﬁned once the paper is published, and should thereforebe assessed according to the experience or background of each co-author rather than based on futurecitations.

MEDLINE (Medical Literature Analysis and Retrieval System Online) is a bibliographic database of lifesciences and biomedical information, maintained and curated by the US National Library of Medicine. Itincludes bibliographic information on articles from academic journals covering medicine, nursing, phar-macy, dentistry, veterinary medicine, and healthcare. The database contains records from more than5 ,

000 selected journals covering biomedicine and health from 1948 to the present. The database is freelyaccessible via the PubMed interface .In addition, PubMed provides an online scientiﬁc publication search engine that associates each paperwith several M eSH terms. These terms are similar to keywords of papers, except that a controlledvocabulary is used to classify publications. Since the

M eSH terms of a paper are not given by theauthors, they are not subject to subjective biases and can be considered as labels which indicate the major topics discussed in the paper. PubMed also constructed tree structures for M eSH terms so thatone can look for the research ﬁeld of each M eSH term.In particular, in PubMed, each

M eSH term has one

M eSH

Unique ID (starting with letter ‘D’followed by 6 digits) and at least one

M eSH

Tree ID (starting with a letter followed by digits separatedby dots). For example, the

M eSH

Tree ID of ‘Anatomic Landmarks’ is ‘A01.111’ and its

M eSH

UniqueID is ‘D059925’. The ﬁrst letter of the

M eSH

Tree ID of a

M eSH term indicates which one of the 16categories the

M eSH term belongs to. However, the

M eSH terms in the raw data are indexed by the

M eSH

Unique ID rather than the

M eSH

Tree ID. To map each

M eSH

Unique ID with the corresponding

M eSH

Tree ID, we downloaded detailed information about each

M eSH

Unique ID and used RegularExpression (Regex) to search the match between each

M eSH

Unique ID and the corresponding

M eSH

Tree ID. The

M eSH

Tree ID can have a diﬀerent depth (the depth of a node is the number of edgesfrom the node to the tree’s root node). Some

M eSH

IDs have corresponding

M eSH

Tree IDs of depthﬁve (e.g., ’A15.378.316.378’), others only have depth of two (e.g., ’B02’). To ensure that all

M eSH

IDscan be mapped to the same depth of

M eSH

Tree IDs, we converted all

M eSH

Tree IDs to depth two bycutting the numbers after the ﬁrst point. As a result, all

M eSH

IDs have been mapped to 127

M eSH

Tree IDs of depth two.To disambiguate authors’ names we used the data set provided by Torvik named Author-ity (Torvikand Smalheiser 2009). The data set provides the disambiguated authors’ names appearing in the MED-LINE data set up to the year 2008. In our work, we used the ﬁrst decade of publications in MEDLINE,from 1948 to 1957, to test the method we developed and make a comparison between a baseline ( BL )method and our method. HeteAlloc : An algorithm based on path similarity

M eSH terms allocation problem: given a time T , an author A and a M eSH term M , what is the expertiseof author A on M eSH term M at time T ? To answer this question, we have developed a method basedon the idea of credit allocation, using the author-paper and paper- M eSH connections. Notice that whatwe care about is the eﬀort devoted by an author to a

M eSH term (measured by the number of paperspublished with that

M eSH term, or possibly by the reputation or impact factor of the journals, researchvenues and outlets where these papers have appeared), rather than the reputation of the author (measuredby the citations received).

Problem description.

We focus on a subset of the HIN which contains three types of nodes: Papers,Authors and

M eSH terms. A simple example of this HIN is shown in Fig. 2. In this network, the

M eSH terms are indexed by

M eSH tree IDs, and the links between papers and

M eSH terms show which

M eSH term the papers are associated with. Our problem is how to allocate credit to single authors. Theinput to this question is the link lists of every year between 1948 to 1957, and the output is a vector foreach author with a value for each of the 127

M eSH categories indicating the author’s expertise in those

M eSH categories.We developed a dynamic credit allocation algorithm based on Path Similarity which we shall call

HeteAlloc . Based on the HIN with three types of nodes (i.e., authors, papers and

M eSH terms), our taskis to assign the credit of each

M eSH term in a paper to the corresponding authors, and to use the wholepublication history of authors to ﬁnd their expertise. Our method will calculate the similarity betweenan author and a

M eSH term, and assign a value to each author based on the similarity. It is based on https://MeSHb.nlm.nih.gov/treeView The following are the 16 most general categories: A. Anatomy; B. Organisms; C. Diseases; D. Chemicals and Drugs;E. Analytical, Diagnostic and Therapeutic Techniques and Equipment; F. Psychiatry and Psychology; G. Phenomena andProcesses; H. Disciplines and Occupations; I. Anthropology, Education, Sociology and Social Phenomena; J. Technology,Industry, Agriculture; K. Humanities; L. Information Science; M. Named Groups; N. Health Care; V. Publication Charac-teristics; Z. Geographicals. In cases where the

MeSH

Unique ID has two

MeSH

Tree IDs, we kept both

MeSH

Tree IDs. Xiancheng Li et al.

Fig. 2

An example of HIN

HeteSim (Shi et al. 2014) as this method is able to measure the similarity between diﬀerent types ofnodes, i.e., authors and

M eSH terms in this case.

Heterogeneous Similarity (

HeteSim ). HeteSim is a measurement of the relatedness of hetero-geneous objects based on an arbitrary search path. The properties of

HeteSim (e.g., symmetric andself-maximum) make it suitable for a number of applications. We deﬁne

HeteSim as follows:

HeteSim : Given a relevance path P = R ◦ R ◦ · · · R l , the HeteSim score between two objects s and t ( s ∈ R .S and t ∈ R l .T ) is HS ( s, t | R ◦ R ◦ · · · R l ) =1 | O ( s | R )) | | I ( t | R l )) | O ( s | R ) (cid:88) i =1 I ( t | R l ) (cid:88) j =1 HS ( O i ( s | R ) , I j ( t | R l ) | R ◦ · · · R l − ) , (1)where O ( s | R ) is the out-neighbours of s based on relation R , and I ( t | R l ) is the in-neighbours of t basedon relation R l . Transition probability matrix . The adjacent matrix W AB is deﬁned for all links from nodes oftype A to nodes of type B. The transition probability matrix U AB is the normalised matrix of W AB along the row vectors. Reachable probability matrix . Given a network G = ( V, E ) following a network schema S =( A, R ), a reachable probability matrix

P M for a path P = ( A A Al + 1) is deﬁned as PM P = U A A U A A U A l A l +1 . PM ( i, j ) represents the probability of object i ∈ A of reaching object j ∈ A l +1 underthe path P .Using the reachable probability matrices (Ramage et al. 2009), the HeteSim between two nodes a and b can be written in a matrix form as HeteSim ( a, b | P ) = PM P L ( a, :) PM (cid:48) P R − ( b, :) , (2)where P M is the reachable probability matrix, and

P M P ( a, :) refers to the a -th row in P M P .Finally, Equation 3 provides the normalised version of HeteSim , which ensures that the similaritybetween a node and itself is equal to one

HeteSim ( a, b | P ) = PM P L ( a, :) PM (cid:48) P R − ( b, :) (cid:114)(cid:13)(cid:13)(cid:13) PM (cid:48) P R − ( b, :) (cid:13)(cid:13)(cid:13) (cid:107) PM P L ( a, :) (cid:107) (3) xpertise retrieval 9 HeteSim in M eSH term assignment.

The deﬁnition of

HeteSim in Equation 3 can be directlyapplied to our network. For a node of type author ( A ) a and a node of type M eSH ( M ) m , the HeteSim between a and m is HeteSim ( a , m | a ∈ A, m ∈ M ) = M AP [ a , :] · M (cid:48) MP [ m , :] (cid:112) (cid:107) M AP [ a , :] (cid:107) · (cid:113)(cid:13)(cid:13) M (cid:48) MP [ m , :] (cid:13)(cid:13) , (4)where M AP and M MP are adjacency matrices between the Author nodes, Paper nodes and between M eSH nodes and Paper nodes, respectively. In Equation 4, the adjacency matrix is used instead ofthe reachable probability matrix to make our method more interpretable. It can be shown that theformalisation of

HeteSim using the adjacency matrix can be the same in an unweighted network as theformalisation of

HeteSim based on the reachable probability matrix. Note that M MP = M (cid:48) P M , the matrixproduct resulting by multiplying M AP and M (cid:48) P M , is the weighted reachable matrix between node typeAuthor and node type

M eSH . Formally, we have N papers published by a which include m = M AP [ a , :] · M (cid:48) MP [ m , :] , (5)where N means ‘the number of’.Note that all elements in M MP and M AP are either 1 or 0, and thus we have (cid:107) M AP [ a , :] (cid:107) = (cid:88) M AP [ a , :] . (6)Thus, (cid:112) (cid:107) M AP [ a , :] (cid:107) = (cid:113)(cid:88) M AP [ a , :] = (cid:112) N paper published by author a . (7)In the same way, (cid:113)(cid:13)(cid:13) M (cid:48) MP [ a , :] (cid:13)(cid:13) = (cid:113)(cid:88) M (cid:48) MP [ a , :] = (cid:112) N paper which include the MeSH term m . (8)Equation 4 can therefore be rewritten as HeteSim ( a , m | a ∈ A, m ∈ M ) = M AP [ a , :] · M (cid:48) MP [ m , :] (cid:112)(cid:80) M AP [ a , :] · (cid:112)(cid:80) M (cid:48) MP [ m , :] , (9)and interpreted as HeteSim ( a , m | a ∈ A, m ∈ M ) = N papers published by author a which include the MeSH m (cid:112) N papers published by author a · (cid:112) N papers which include the MeSH term m . (10)Though HeteSim is quite suitable for our task, there are some disadvantages. The most important oneis that

HeteSim is a “global” measure in a sense. When the similarity between an author and a

M eSH term is calculated, all papers are taken into consideration, even those which have no connection with thetarget author. For example, if someone published a paper with a

M eSH term M

1, the similarity of allauthors with M HeteSim measures the contribution of each author to the total knowledge (limited inthe data set) of a

M eSH term. However, the expertise we want to examine refers to the

M eSH termwhere an author conducted most of his or her work. In a real-world situation, one can only contribute toseveral hundreds of papers at most. And if we compare this fraction of papers to the tremendous overallamount of papers available in online databases, the similarity will be signiﬁcantly small and the original

HeteSim will have a poor performance.

Modiﬁcation of

HeteSim ( HeteAlloc ). To address this shortcoming of

HeteSim , here we proposea modiﬁed version, namely

HeteAlloc . The underlying idea is to limit the calculation to a subset of papers,which can be selected according to the context. Formally, we have

HeteAlloc(a , m | a ∈ A , m ∈ M ) = M AP [ a, :] · ( M sub [ a, :] (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:112) (cid:107) M sub [ a, :] (cid:12) M MP [ m, :] (cid:107) , (11)where the operation (cid:12) is the element-wise product, and M sub is the subset selection matrix with M sub [ a, n ] = (cid:40) n th paper is in the selected subset of target author a0 otherwise (12)Like the original HeteSim , our method is based on the cosine of two vectors. As Pirotte et al. (2007)pointed out, the angle between the node vectors is a much more predictive measure than the distancebetween the nodes. The only diﬀerence is that the second vector is ﬁltered by a row of subset selectionmatrix. The selection of the subset is the essential part of our method, and requires a considerable amountof eﬀort towards the design and computation of the matrix multiplication.In what follows, we shall present three subset selection strategies, and then show how to compute themeasure, discuss the advantages and disadvantages of each strategy, and ﬁnally provide interpretations.

HeteSim measure should therefore belimited to the subset of papers published either by our target author or those who have co-authored withthis author. To ﬁnd the subset, we provide the following deﬁnition:

Binary Reachable Matrix of Path Length i : Given relation A R → B and the adjacency matrix W AB between type A and type B , the Binary Reachable Matrix of Path Length i from A to B followingmeta-path AB i is RM ( i ) AB ( m, n ) = (cid:40) M ( i ) AB ( m, n ) = 01 otherwise (13)where M ( i ) AB = W AB · ( W BA · W AB ) ( i − .The selected subset, RM AP , follows the meta-path ‘APAP’, which, for each author, creates the subsetof papers published by the author or his/her co-authors. To be more speciﬁc, the n -th row of RM AP isa vector where the m th value is 1 if, for the n -th author, paper m is included in the subset. To this end,we deﬁne HeteAlloc

HeteAlloc(a , m | a ∈ A , m ∈ M ) = M AP [ a, :] · ( RM (2) AP [ a, :] (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:114)(cid:13)(cid:13)(cid:13) RM (2) AP [ a, :] (cid:12) M MP [ m, :] (cid:13)(cid:13)(cid:13) , (14)which can be interpreted as HeteAlloc ( a, m ) = N papers of a which include m (cid:112) N papers of a · (cid:112) N papers of a’s co-authors which include m . (15) xpertise retrieval 11 The advantage of this selection strategy is that the similarity between an author and any

M eSH termwill not be inﬂuenced by an irrelevant global change of the data set. The subset matrix is constant forall target

M eSH terms. However, this selection does not reﬂect on which speciﬁc

M eSH term an authorhas collaborated with another author, and simply includes the papers of all co-authors into the subset.5.2 Subset of co-authors’ papers in a target

M eSH term.The basic idea of this strategy is to add the target

M eSH term as another constraint for selecting thesubset. The subset includes all papers published by the target author and by the authors who have co-authored with him or her in the target

M eSH term. Since this subset varies according to

M eSH terms,we use the reachable vector of a and m to replace RM sub [ a, :]HeteAlloc(a , m | a ∈ A , m ∈ M ) = M AP [ a, :] · ( RV ( a,m ) sub (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:114)(cid:13)(cid:13)(cid:13) RV ( a,m ) sub (cid:12) M MP [ m, :] (cid:13)(cid:13)(cid:13) (16) RV ( a,m ) Sub (1 , n ) = (cid:40) V ( a,m ) Sub (1 , n ) = 01 otherwise (17)where V ( a,m ) sub = ( W AP ( a, :) (cid:12) W MP ( m, :)) · W P A · W AP . (18)Equation 16 can be interpreted as HeteAlloc ( a, m ) = N papers of a which include m (cid:112) N papers of a · (cid:112) N papers of a ’s co-authors which include m . (19)The advantage of this selection strategy is that the similarity between an author and any M eSH term will not be inﬂuenced by any irrelevant global changes of the data set. The similarity is

M eSH -sensitive, and the subset vector can ﬁlter out co-authors who had no experience on the target

M eSH term. However, this selection will lead to a low score for those who have worked with very experiencedauthors.5.3 Subset of all papers published by the co-authors of the focal paper.For each paper p , the subset includes all papers published by the co-authors of p . And for each pair,author a and M eSH term m , the calculation is conducted for every paper p of author a which includesthe M eSH term m , and the average or the sum of all papers is used as the ﬁnal score. The sum can beconsidered as a method for credit allocation and the average as a similarity measure. Here we shall usethe sum as an example: HeteAlloc ( a, m ) = (cid:88) p ∈ P a HeteAlloc ( a, p, m ) (20) HeteAlloc ( a, p, m ) = M AP [ a, :] · ( RV ( a,p ) sub (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:114)(cid:13)(cid:13)(cid:13) RV ( a,p ) sub (cid:12) M MP [ m, :] (cid:13)(cid:13)(cid:13) (21) RV ( a,p ) Sub (1 , n ) = (cid:40) V ( a,p ) Sub (1 , n ) = 01 otherwise (22) where V ( a,p ) sub = W AP ( a, :) (cid:12) W P A · W AP ( p, :) . (23)Equation 21 can be interpreted as: HeteAlloc ( a, m ) = (cid:88) all papers of a N papers of a which include m (cid:112) N papers of a · (cid:112) N papers of co-authors of paper p . (24)This similarity avoids a signiﬁcant decrease when the target author co-authors with a more experiencedone in the target M eSH term. The similarity retains the property of having a

M eSH -sensitive subset.Notice that this method works better when applied to calculate the absolute value of expertise.

HeteAlloc

The formalisation above is based on an unweighted network. Yet, one may want to capture the concen-tration of an author’s eﬀort on a speciﬁc topic (

M eSH term). For example, let us suppose that all papersof author A only contain one M eSH term M A contain two M eSH terms, M M

2. In this case, one may argue that A concentrates more than A on M A hasworked exclusively on this topic while A on the additional topic M

2. According to this idea, we proposea weighted version of

HeteAlloc which accounts for the weights of the links between papers and

M eSH terms. The weight of a link between a paper and a

M eSH term is inversely proportional to the numberof

M eSH terms associated with the paper.

HeteAlloc can be applied to a weighted network by using U MP instead of M MP , where U MP is a normalised matrix of M MP along the column vector.The weighted HeteAlloc can capture authors’ concentration on speciﬁc topics and identify the authorswhose papers are more focused on smaller

M eSH sets. However, this characteristic is not necessarily anadvantage, but simply a diﬀerent strategy to deal with the number of

M eSH terms in a paper. Theremay exist diﬀerent views about the similarity between an author and a given

M eSH term. For example,one may believe that an author is entirely devoted to a given research topic, if each of his or her paperscontains the corresponding

M eSH term. In this case, the similarity between the author and the

M eSH term would be equal to one (i.e., the idea behind the unweighted version). However, others may believethat the similarity between the author and the

M eSH term should never be equal to 1 unless an authorswork is exclusively about this

M eSH term (i.e., the idea behind the weighted version). The decisionshould be made after careful examination of the context, and should also be based on the assumptionsmade by potential users of the method (e.g., researchers or funding agencies.).Here we shall provide our personal recommendation and blueprint. For smaller

M eSH term numbers,the weighted version will work better since it is not common for researchers to work in a completely dif-ferent

M eSH term (say, Finance and Chemistry). However, when the division of topics is too fragmentedand most papers have many

M eSH terms, then the performance of the weighted version may not workwell, and the unweighted version would be recommended.6.2 Iterative calculations over the yearsThe original

HeteSim is designed for a “static” measurement of similarity. However, authors keep pub-lishing papers over the years, and their expertise may change over time. When expertise is measured atyear T , only the papers published before this year should be considered. To make our method HeteAlloc applicable to dynamic calculation, we distinguish the links connecting Author and Paper between theexperience/history links before year T and the update links at year T . This can be done by using twoadjacency matrices: M update and M experience . Since it is diﬃcult to identify the time ordering of publi-cations published in the year T , we assume that papers of year T were published at the same time. The xpertise retrieval 13 formalisation of HeteAlloc needs to be modiﬁed and the calculation, based on the modiﬁed measure, canbe conducted iteratively over the years.We shall refer to the modiﬁed algorithm as

DynamicHeteAlloc ( DHA ), and the corresponding formal-isation is

DHA ( a, m ) = (cid:88) p i ∈ M update [ a, :] (cid:12) M MP DHA ( a, p i , m ) (25)and DHA ( a, p i , m ) = ( M experience [ a, :] + I nn [ p i , :]) · ( V subset ( p i ) (cid:12) M MP [ m, :]) (cid:112) (cid:107) M experience [ a, :] + I nn [ p i , :] (cid:107) (cid:107) V subset ( p i ) (cid:12) M MP [ m, :] (cid:107) , (26)where V subset ( p i ) = M (cid:48) update [ p i , :] ∗ M experience + I (cid:48) nn [ p i , :] . (27)For each paper, we add I nn [ p i , :] to M experience [ a, :] in Equation 26 to include the current paper inthe experience paper set so as to avoid the case where M experience is a zero matrix.According to the formalisation of DHA , we have implemented Algorithm 1:

Algorithm 1

Algorithm for conducting dynamic

HeteAlloc

Input: link lists for every year,

MeSH lists

Output: expertise of every author1: initial list pre as blank list, load

MeSH list as M MP ;2: for each year ∈ [1946 , do

3: load list year as list cur ;4: Sparse matrix Creation;5: for each AuthorID ∈ list cur do if M update [ Author ID, :] is Null vector then

Next iteration;7: end if

8: ﬁnd MeSH terms needed to update

MeSH update ;9: create a null dictionary dic cur ;10: if Author ID exists in expertise dictionary dic expts then use dic expts [ Author ID ] to replace dic cur end if for each

MeSH ID ∈ MeSH update do Initialize

HeteAlloc value as zero;13: if MeSH

ID in dic cur then use dic cur [MeSH ID ] to replace HeteAlloc value14: end if

15: update

HeteAlloc value by adding result from Dynamic HeteAlloc(Author ID,

MeSH

ID)16: update dic cur [MeSH ID ] by HeteAlloc value17: update dic expts [ Author ID ] by dic cur end for end for end for

21: Write out dic expts . Algorithm 2

Sparse Matrix Creation

Input: list pre , list cur , MeSH lists

Output: M experience , M update , update list pre , dictionaries1: merge list pre and list cur as list all ;2: create a dictionary from list all for mapping nodes with indexes;3: use the dictionary to map list pre as M experience , map list cur as M update ;4: replace list pre by list all , return dictionaries for mapping. An example of this method using illustrative networks is provided in the Appendix. The results aregiven in the form of expertise matrices, where the value corresponding to row i and column j indicatesthe expertise of Author i on M eSH j . In the example, we use the publication lists of 4 authors from year1 to year 10 and calculate the expertise matrices for each author at each year. We also show the result using the ( BL ) method, which equally attributes every M eSH term of a paper to all co-authors. In thiscase, the expertise of a focal author is therefore computed through the cumulative counts of

M eSH termsassociated with all publications of the author. Thus, in the expertise matrix calculated using the ( BL )method for a year t , the value in row i and column j is equal to the number of papers published by Author i with M eSH j before year t . To compare the performance of diﬀerent selections of subsets on HIN, we have calculated the similaritybetween all pairs extracted from the pair set { a, m | a ∈ Author, m ∈ M eSH } based on three small exam-ples of networks using the ( BL ) method mentioned above, the original HeteSim , the

HeteAlloc with thesubset of co-authors papers (

HA1 ), the

HeteAlloc with the subset of co-authors papers in a target

M eSH term (

HA2 ), the

HeteAlloc with the subset of all papers published by the co-authors of the focal paper(

HA3 ), and the corresponding weighted versions of

HA1, HA2, HA3 (i.e.,

WHA1, WHA2, WHA3 ).In the ﬁrst example in Fig. 3, BL , HA2 and

HA3 perform well (see Table 1; the similarities charac-terised by better performance have been highlighted in bold). These methods can uncover the diﬀerencebetween ( A , M

1) and ( A , M A M M

2, and the similarity between A M A M M eSH term, the weighted versions in this example degenerate to theunweighted ones.

Fig. 3

Example network 1In the second example network in Fig. 4,

HA3 performs well. It shows that author A A M

1. To be more speciﬁc, A M A A M M A M A M

1. Compared to othermethods, only

HA3 gives a higher similarity for ( A , M A M M eSH term, the weighted versions in this example degenerate to theunweighted ones. xpertise retrieval 15

Table 1

Results based on example network 1

Baseline Original Unweighted WeightedPair \ Method BL HeteSim HA1 HA2 HA3 WHA1 WHA2 WHA3(A1,M1) 0.577 0.577 0.577 0.577 0.577 0.577 0.577 0.577(A1,M2) (A2,M1) 0.577 0.577 0.577 0.577 0.577 0.577 0.577 0.577(A2,M2)

Fig. 4

Example network 2

Table 2

Results based on example network 2

Baseline Original Unweighted WeightedPair \ Method BL HeteSim HA1 HA2 HA3 WHA1 WHA2 WHA3(A1,M1) 1 0.577 0.632 0.632 (A1,M2) 0 0 0 0 0 0 0 0(A2,M1) 1 0.816 0.894 0.894 (A2,M2) 0 0 0 0 0 0 0 0(A3,M1) 0.707 0.288 0.707 0.707 (A3,M2) 0.707 0.707 0.707 0.707 0.707 0.707 0.707 0.707

For the third example shown in Fig. 5, the weighted methods diﬀerentiate between

Sim ( A , M Sim ( A , M A A M

1, and the only diﬀerence between A A M P A M version can capture the concentration of research eﬀorts in some M eSH terms, and is biased in favourof the authors whose papers are more concentrated on a smaller

M eSH set.

Fig. 5

Example network 3

Table 3

Results based on example network 3

Baseline Original Unweighted WeightedPair \ Method BL HeteSim HA1 HA2 HA3 WHA1 WHA2 WHA3(A1,M1) 1 0.943 0.816 0.816 0.908 0.943 0.943 (A1,M2) 0 0 0 0 0 0 0 0(A2,M1) 0.816 0.707 0.816 0.816 0.908 0.707 0.707 (A2,M2) 0.577 0.236 0.5 0.5 0.5 0.316 0.316 0.316(A3,M1) 0 0 0 0 0 0 0 0(A3,M2) 1 0.943 0.816 0.816 0.908 0.943 0.943 (A4,M1) 0.577 0.236 0.5 0.5 0.5 0.316 0.316 0.316(A4,M2) 0.816 0.707 0.816 0.816 0.908 0.707 0.707

From the three examples above, the third subset selection strategy (i.e., subset of all papers publishedby the co-authors of the focal paper) outperforms the other two strategies. Moreover, by taking the sumof all scores (i.e., similarity measures) obtained from all publications of the focal author, this methodenables us to evaluate the global expertise of an author based on his of her entire scientiﬁc production.In what follows, we will use the third selection strategy and perform a comparison between our method(

DHA ) and the ( BL ) method applied to the MEDLINE data set. As in our data set most publicationsare associated with multiple M eSH terms, we chose to use the unweighted version of our method.The output of both methods are vectors associated with authors representing their expertise in termsof each topic (i.e.,

M eSH term). To compare the two methods, for each author we consider the followingmeasures: (1) the ratio between maximum and minimum values of the author’s expertise; (2) the author’smaximum normalised expertise (i.e., obtained by dividing all values in a vector by its norm); and (3) thenormalised maximum expertise of authors that have published more than 10 papers at the time of theassessment of expertise (i.e., criterion 2 applied only to the subset of productive authors). Moreover, for xpertise retrieval 17 every year, we calculate the mean and standard deviation of the values produced by the above assessmentmeasures, and compare them between methods.

Table 4

Comparison between

DHA and BL based on theﬁrst 10 years of the MEDLINE data set Measure (1) (2) (3)year method DHA BL DHA BL DHA BL1948 mean 2.05 1.45 0.60 0.58 0.57 0.52std 3.54 1.13 0.17 0.17 0.14 0.121949 mean 2.72 1.66 0.60 0.58 0.59 0.54std 6.24 1.63 0.16 0.16 0.14 0.121950 mean 3.48 1.84 0.60 0.57 0.60 0.55std 9.59 2.09 0.16 0.15 0.14 0.121951 mean 4.37 2.06 0.59 0.56 0.61 0.56std 13.85 2.65 0.15 0.14 0.14 0.121952 mean 5.22 2.24 0.59 0.56 0.61 0.56std 18.36 3.15 0.15 0.14 0.14 0.121953 mean 6.05 2.39 0.59 0.55 0.61 0.56std 23.02 3.60 0.15 0.14 0.14 0.111954 mean 6.85 2.53 0.59 0.55 0.61 0.56std 28.05 4.01 0.15 0.13 0.14 0.111955 mean 7.65 2.66 0.59 0.54 0.61 0.55std 33.04 4.41 0.15 0.13 0.14 0.111956 mean 8.41 2.78 0.59 0.54 0.61 0.55std 38.16 4.79 0.15 0.13 0.14 0.111957 mean 9.14 2.88 0.59 0.54 0.61 0.55std 43.32 5.13 0.15 0.13 0.14 0.11(1) the ratio between maximum and minimum values of the au-thor’s expertise; (2) the author’s maximum normalised expertise(i.e., obtained by dividing all values in a vector by its norm); and (3)the normalised maximum expertise of authors that have publishedmore than 10 papers at the time of the assessment of expertise

Author's maximum expertise F r e q u e n c y o f a u t h o r s DHABL

Fig. 6

Comparison between

DHA and BL using the normalised maximum expertise of productive authorsThe results reported in Table 4 show that the mean and standard deviation of the ratio betweenmaximum and minimum values of author’s expertise obtained with the DHA method are higher thanthe mean and standard deviation obtained with the BL method, which suggests that DHA can betterdistinguish authors according to their expertise areas, whereas BL considers all authors involved inworks relevant to multiple topics as interdisciplinary authors (i.e., with the same expertise on all M eSH terms, thus producing smaller ratios of maximum to minimum values of expertise). The results based on normalised maximum expertise of

DHA are similar to those of BL when all authors are considered,but they diﬀer when the methods are applied only to a restricted subset of productive authors, whichsuggests that our method has the potential to identify authors’ main areas of expertise precisely whenthey are most likely to work in multiple areas.Figure 6 shows the frequency of productive authors with normalised maximum expertise ranging from0 to 1. The ( BL ) method shows no authors with maximum expertise higher than 0 .

9, which suggests thatthere is no researcher dedicated to one single area and the maximum expertise of most authors lies in themiddle. However, the results obtained with our method clearly highlight its ability to identify specialisedauthors that preferentially focus on one area (i.e., with high maximum expertise) and at the same timeinterdisciplinary authors whose work spans diﬀerent areas (i.e., those with low maximum expertise).

In this work, we have proposed a new method based on path similarity and a number of subset selectionstrategies to identify authors’ expertise. Our method diﬀers from previous works as it assigns expertise toa focal author by accounting for co-authors’ contributions to the works they were involved with. We haveshown that our method can be applied to the HIN constructed from the MEDLINE corpus. However,the applicability of our method is not limited to just one data set. Indeed if we replace

M eSH termsby the topic tags in MAG, our method can be directly applied to MAG. In this case, it can retrieveauthors’ expertise based on topics as classiﬁed in MAG, and it can be suitably adjusted to reﬂect thedepth and granularity required by users. In more general cases, users can generate their own topics fromdocuments using topic modelling or other methods. By linking the generated topics and the correspondingdocuments, users can produce similar networks as those shown in Fig. 2 and they can then apply ourmethod by selecting an appropriate subset. Our work can also be used to integrate standard approaches,for example in conjunction with topic modelling for documents or by using topic classiﬁcation systems.The lack of a ground truth does not enable a deﬁnitive validation of our method. While this representsa limitation of our work, it also opens up new avenues for future work. For example, to mitigate thislimitation, we could check the Contributor Roles Taxonomy (CRediT) author statement available fromseveral journals to identify which author was involved in which part of the research. However, CRediTstatements are self-declared and not veriﬁable, which again highlights the need for methods such as theone we proposed in this article. Moreover, the CRediT author statements are not detailed enough tounambiguously indicate which speciﬁc expertise (e.g., M eSH term) should be associated with whichauthor. Another possibility is to handpick some very interdisciplinary papers (i.e., with many

M eSH terms). By reading the CV of the authors or searching for relevant information about them, we might beable to infer the

M eSH terms associated with each author, and then compare our prior knowledge withthe results obtained using our method. This test represents a “sanity check”, and an example is given inthe Appendix.Our method has a number of important applications for research and practice. Understanding thecomposition of a team and being able to associate each co-author of a paper to one or several ﬁelds ofexpertise can spur new studies of the interdisciplinarity of research teams. For example, our method willenable us to distinguish between interdisciplinary papers co-authored by researchers with overlappingexpertise, and equally interdisciplinary papers in which the co-authors have non-overlapping researchproﬁles. This, in turn, could shed further light on the impact of team diversity on scientiﬁc success andknowledge creation. Moreover, being able to identify expertise facilitates a comparative assessment of twoequally interdisciplinary studies, one pursued by an individual and the other by a group or researchers.In particular, our method enables us to distinguish between research solely pursued by one individualscholar with a highly interdisciplinary background and research pursued by an interdisciplinary groupcomprising of several highly specialised scholars. This variation in type and sources of interdisciplinarityis likely to be a critical nuance with non-trivial implications for innovation, research performance, andthe long-term impact of publications.Our method has also practical implications for funding agencies, research institutions and scientists.First, it can assist funding agencies in the identiﬁcation of appropriate reviewers with the right competence to evaluate research proposals. In turn, it may also assist reviewers in uncovering possible gaps between aproposed research and the combined expertise of the pool of applicants. Second, our method can also helpresearch institutions to develop eﬀective recruitment policies targeted at strengthening speciﬁc researchﬁelds or at developing new and fast-developing areas that require a prompt investment of resources.Finally, the identiﬁcation of special expertise can help scientists in identifying potential collaborators andshaping successful research groups. A Appendix

A.1 Example of

DHA using illustrative networks

Here we show how our method works out in full using illustrative networks, and we then compare the results with thoseobtained using the BL method. Figure 7 shows the illustrative networks from year 1 to year 5 (identical networks for ﬁveyears). Figure 8 shows the illustrative networks from year 6 to year 10 (identical networks for ﬁve years). Before year 5, thefour authors worked separately. A M M A M M A M M A M M A A M M A A M M A A M M

3. The publication lists can be found in Tables 5 and 6.Based on their experience, it is not likely for A M P MeSH category. Similarly, it is not likely for A M P BL and DHA , respectively. The results are similar between year1 and year 5 and begin to diﬀerentiate from year 6.At the end of year 5, both methods suggest that all four authors had similar expertise on M

3, whereas A A M M

2, respectively. BL simply counts for the number of papers each author published on every MeSH term, and adds them together. Following this idea, A A M P A P A M A M

2, with the same expertise as A P A M A M A M A M

2. And the results obtained using

DHA gave the expectedresult: i.e., A M A M Table 5

Publication list in the illustrative networks from year 1 to year 5

Author Paper MeSH YearA1 P1 M2,M3 1,2,3,4,5A2 P2 M1,M3 1,2,3,4,5A2 P3 M1 1,2,3,4,5A3 P4 M2,M3 1,2,3,4,5A3 P5 M2 1,2,3,4,5A4 P6 M1,M3 1,2,3,4,50 Xiancheng Li et al.

Fig. 7

Illustrative networks from year 1 to year 5

Table 6

Publication list in the illustrative networks from year 6 to year 10

Author Paper MeSH YearA1, A2 P1 M2,M3 6,7,8,9,10A1, A2 P1 M2,M3 6,7,8,9,10A2, A3 P2 M1,M2 6,7,8,9,10A2, A3 P2 M1,M2 6,7,8,9,10A3, A4 P3 M1,M3 6,7,8,9,10A3, A4 P3 M1,M3 6,7,8,9,10

Fig. 8

Illustrative networks from year 6 to year 10 xpertise retrieval 21 M BLt =   , M DHAt =   (28) M BLt =   , M DHAt =  .

95 1 . .

95 0 1 .

840 3 .

95 1 . .

95 0 1 .  (29) M BLt =   , M DHAt =  .

86 2 . .

88 0 2 .

610 5 .

88 2 . .

86 0 2 .  (30) M BLt =   , M DHAt =  .

76 3 . .

82 0 3 .

330 7 .

82 3 . .

76 0 3 .  (31) M BLt =   , M DHAt =  .

64 4 . .

75 0 4 .

020 9 .

75 4 . .

64 0 4 .  (32) M BLt =   , M DHAt =  .

94 5 . .

15 0 .

34 4 . .

34 11 .

15 4 . .

94 0 5 .  (33) M BLt =   , M DHAt =  .

22 6 . .

54 0 .

72 5 . .

72 12 .

54 5 . .

22 0 6 .  (34) M BLt =   , M DHAt =  .

50 8 . .

92 1 .

14 6 . .

14 13 .

92 6 . .

50 0 8 .  (35) M BLt =   , M DHAt =  .

76 9 . . .

61 7 . .

61 15 . . .

76 0 9 .  (36) M BLt =   , M DHAt =  .

02 10 . . . .

25 0 11 .  (37) A.2 An example based on real data

Here we provide an example based on a focal paper and show the results obtained using our method. The title of thisfocal paper is: ”Calcium Levels and Calciuria in Decalciﬁcation in Acromegaly” . It was published in 1956, co-authoredby ﬁve authors: S. De S`eze, A. Lichtwitz, D. Hioco, M. Delaville, H. Gille. Table 7 shows the MeSH terms associated withthis paper, the relevant

MeSH

Tree ID and the corresponding category names. Table 8 shows the expertise of the ﬁveco-authors on the

MeSH terms associated with the focal paper before the year 1956. The ﬁrst author, Stanislas de S`eze,was a pioneering scholar of French rheumatology. He was already an expert in two categories: Musculoskeletal Disease,Nervous System Diseases and Humans (included in the category Eukaryota). This was indicated by the high values in hisexpertise vector: 90 for B01, 42 for C05 and 12 for C10. The second author, Alfred Lichtwitz, mainly worked on D06, B01and C19. The third author, Denis Hioco, mainly worked on D01, D06, A12. The fourth author, M.Delaville, mainly workedon B01, D06, D01. The last author, Halvor Gille, was a new author, and this paper was his ﬁrst publication.Although there were some overlaps among those co-authors’ proﬁles, each of those co-authors (except the new author)had some major background knowledge in selected research areas. The desired method should be able to add appropriatevalue to the co-authors’ expertise vectors and update the expertise vectors so that they can better represent the evolutionof the co-authors’ expertise. https://pubmed.ncbi.nlm.nih.gov/13327374/2 Xiancheng Li et al.The results are given in Table 9. Upon publication of this paper, Stanislas de S`eze obtains 0 .

762 on B01, 0 .

371 on C05and 0 .

106 on C10, since he was the most experienced author in these three categories. Similarly, D. Hioco obtains 0 .

315 onD01 and 0 .

265 on A12; A. Lichtwitz obtains 0 .

193 on D01 and 0 .

211 on C19. However, M. Delaville does not achieve a highscore as he was not the most experienced author in any of these categories. As for the new author, he gains some experiencein nearly every category, especially those in which no one had much experience. In this example, he obtained 0 .

535 on D23,0 .

424 on G02 and 0 .

366 on G03. In general, our method clearly returns a reasonable result which meets our expectation.

Table 7

M eSH terms associated with the focal paper, relevant

M eSH

Tree ID and correspondingcategory names

MeSH term Relevant

MeSH

Tree ID CategoriesAcromegaly [C05,C10,C19] [Musculoskeletal Diseases; Nervous System Diseases; Endocrine System Diseases]Calcium [D01,D23] [Inorganic Chemicals; Biological Factors]Hormones [D06,D27] [Hormones, Hormone Substitutes, and Hormone Antagonists; Chemical Actions and Uses]Humans [B01] [Eukaryota]Osteoporosis [C05,C18] [Musculoskeletal Diseases; Nutritional and Metabolic Diseases]Phosphorus [D01] [Inorganic Chemicals]Urine [A12] [Fluids and Secretions]Water-Electrolyte Balance [G02,G03,G07] [Chemical Phenomena; Metabolism; Physiological Phenomena ]

Table 8

Expertise of co-authors on the

M eSH terms associated with the focal paper before year 1956

D06 D27 B01 D01 D23 A12 C05 C10 C19 G02 G03 G07 C18M. Delaville 4.860 2.477 10.200 3.235 0.188 0.915 1.472 0.211 2.758 0.456 0.089 1.010 0.971H. Gille 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000A. Lichtwitz 7.139 3.963 22.821 3.141 1.295 0.987 4.754 1.064 6.219 2.074 2.406 2.576 2.821D. Hioco 3.283 2.543 1.172 3.887 0.863 2.338 0.444 0.289 0.973 0.000 0.816 0.131 2.014De S`eze 3.514 0.682 90.230 0.417 0.196 0.157 42.682 12.108 0.390 0.213 0.000 0.133 0.697

Table 9

Expertise acquired from the focal paper

D06 D27 B01 D01 D23 A12 C05 C10 C19 G02 G03 G07 C18M. Delaville 0.185 0.097 0.084 0.141 0.041 0.048 0.005 0.006 0.092 0.038 0.027 0.051 0.038H. Gille 0.101 0.183 0.011 0.165

A.3 Summary

In Appendix A.1, we showed how our method works out in full using illustrative networks, and then compared the resultswith those obtained with the BL method. In this example, four authors with their publication lists of 10 years are given. Bychecking the publication history of those authors, indeed we can conﬁrm that the second and the third authors are expertsin diﬀerent topics. Our method was able to correctly identify the expertise of each author. However, the BL method gave aresult according to which the research proﬁles of the two authors were the same. This example and the comparison betweenmethods thus showed that our method outperformed the BL one.In Appendix A.2, we gave an example of a handpicked paper, and provided the results obtained using our method. Weshowed that our method correctly assigned expertise to the most experienced author on most MeSH terms. And authorswould not acquire much experience in categories that they were not familiar with. The result showed that our method wasable to add appropriate value to the co-authors expertise vectors and update them so that they could better represent theevolution of co-authors expertise.Despite the lack of ground truth data to deﬁnitively validate the performance of our method, the examples in theAppendix provide some possible ways to test our method. The results showed that our method can provide a reasonableassessment of authors’ expertise.xpertise retrieval 23

References

AlShebli BK, Rahwan T, Woon WL (2018) The preeminence of ethnic diversity in scientiﬁc collaboration. Nature Commu-nications 9(1):5163Balog K, De Rijke M, et al. (2007) Determining expert proﬁles (with an application to expert ﬁnding). In: IJCAI, vol 7, pp2657–2662Balog K, Fang Y, de Rijke M, Serdyukov P, Si L, et al. (2012) Expertise retrieval. Foundations and Trends R (cid:13)(cid:13)