A network approach to expertise retrieval based on path similarity and credit allocation
Xiancheng Li, Luca Verginer, Massimo Riccaboni, Pietro Panzarasa
AA network approach to expertise retrieval based on path similarity andcredit allocation
Xiancheng Li · Luca Verginer · Massimo Riccaboni · Pietro PanzarasaAbstract
With the increasing availability of online scholarly databases, publication records can be easilyextracted and analysed. Researchers can promptly keep abreast of others’ scientific production and, inprinciple, can select new collaborators and build new research teams. A critical factor one should considerwhen contemplating new potential collaborations is the possibility of unambiguously defining the exper-tise of other researchers. While some organisations have established database systems to enable theirmembers to manually produce a profile, maintaining such systems is time-consuming and costly. There-fore, there has been a growing interest in retrieving expertise through automated approaches. Indeed, theidentification of researchers’ expertise is of great value in many applications, such as identifying qualifiedexperts to supervise new researchers, assigning manuscripts to reviewers, and forming a qualified team.Here, we propose a network-based approach to the construction of authors’ expertise profiles. Using theMEDLINE corpus as an example, we show that our method can be applied to a number of widely useddata sets and outperforms other methods traditionally used for expertise identification.
Keywords
Expertise retrieval · Path similarity · Credit allocation · Heterogeneous InformationNetworks
The increasing complexity of research problems calls for innovative solutions which combine knowledgefrom different scientific disciplines (Van Rijnsoever and Hessels 2011). Therefore, many researchers be-come involved in interdisciplinary projects, thus collaborating with people with a variety of expertise.When facing the task of finding collaborators, scholars need to answer two inter-related questions: 1)How to identify an expert, i.e., how to find someone who is competent in a given field; and 2) how toprofile an expert, i.e., how to identify the fields in which a given scholar is an expert. In general, bothquestions jointly describe the objective of expertise retrieval (Balog et al. 2012). Indeed figuring out theresearch area associated with an individual represents a challenging research problem. Search engines
Xiancheng LiSchool of Business and ManagementQueen Mary University of London, LondonE-mail: [email protected] VerginerChair of Systems DesignETH Z¨urich, Z¨urich, SwitzerlandMassimo RiccaboniIMT School for Advanced StudiesLucca, ItalyPietro PanzarasaSchool of Business and ManagementQueen Mary University of London, London a r X i v : . [ c s . S I] S e p Xiancheng Li et al. such as Google Scholar or DBLP are of great help for finding documents (Hertzum and Pejtersen 2000).However, these engines only return scientific documents, not the specific expertise of people. Even in anacademic environment, researchers still have to rely on their social networks to identify the expertise ofothers (Hofmann et al. 2010).Identifying experts is crucial for academic groups when they need to involve a collaborator withspecific expertise. In organisational settings, knowing the expertise of relevant researchers facilitates theassignment of important roles and jobs. For example, conference organisers may search for moderators,session chairs and keynote speakers with the proper expertise. And universities may want to recruitresearchers with expertise in a particular fast-developing area to improve their reputation. A good methodfor expertise retrieval is therefore fundamental to provide the necessary knowledge for such activities.However, expertise retrieval is challenging for many reasons. First, expertise is a relatively abstractconcept, and there is currently no consensus on how to define it. Besides, expertise is a particular kindof knowledge stored in one’s mind, and thus hard to identify. The only way to access people’s expertiseis through their works, e.g., documents, books, articles. Second, experts’ names are often ambiguous.A single name may belong to multiple people, and the name of the same expert can vary in differentdatabases. Indeed name disambiguation has recently become a specific and independent area of enquiry,and many studies have been carried out in this field (Smalheiser and Torvik 2009). Finally, it is difficult toevaluate the strength of the association between an expert and the works he or she has been involved in,especially because an increasing amount of scientific production is co-authored by multiple individuals.Those challenges have made expertise retrieval a multi-faceted research area. In particular, since we learnabout researchers’ expertise mainly from their publications, the task of expertise retrieval has mainlybeen articulated into identifying the knowledge areas/topics in the text corpus and assigning them to theresearchers (Silva et al. 2018).Inspired by previous approaches to dealing with credit allocation (Shen and Barab´asi 2014) and byrecent studies on finding node similarity in heterogeneous information networks (HIN) (Shi et al. 2014),we formalise the topics/expertise extracted from a given scientific publication as credit to be assigned tothe co-authors of the publication, and propose a new method to allocate them to the co-authors based ontheir publication histories. Traditional approaches to the identification of the knowledge areas within thetext corpus use topic-modelling methods such as Latent Dirichlet Allocation (LDA) based on controlledvocabulary from well-known classification systems such as the Medical Subject Headings (
M eSH ) inMEDLINE and the topic tags in Microsoft Academic Graph (MAG) .Our work focuses on the process of evaluating the degree of each co-author’s contribution to a collab-orative work. We propose a new method for properly assigning the expertise to each co-author accordingto his or her contribution. Our method differs from traditional ones where the contribution of authors isassumed to be equal or assessed simply based on the order of authors in the byline. Moreover, our methodcan deal with large-scale data sets, and produces results that vary dynamically as the data set is updatedover time. Unlike some citation-based approaches to the assessment of contributions, which require acertain time to account for the citations that accumulate over time, our method is experience-based andthe update of authors’ expertise is determined once the new records are added into the data set.The rest of the article is organised as follows. In Section 2 we review strengths and limitations ofexisting literature on expertise identification, and motivate our work. In Section 3 we introduce the dataused in our study. In Section 4 and Section 5 we present our new method and different selection strategies.In Section 6, we provide some extensions to account for weights and time. In Section 7 we report resultsobtained using the MEDLINE corpus and various examples. Section 8 summarises the findings of thiswork and outlines their implications for research and practice. Previous work on expert profiling has primarily focused on identifying and ranking topics for a givenexpert (Balog et al. 2007; Serdyukov et al. 2011). However, only few studies have considered the temporal https://academic.microsoft.com/topicsxpertise retrieval 3 aspects of expertise. The work by Tsatsaronis et al. (2011) was one of the first studies which focused onthe evolution of authors’ expertise over time. Their work was based on co-authorship information, andproposed evolution indices to measure the dynamics of authors’ expertise. Inspired by their work, Rybaket al. (2014) constructed temporal hierarchical expertise profiles using topic models. Typically, the un-derlying question of expert profiling is: What topics does a person know about? (Balog et al. 2007; Rybaket al. 2014). Indeed the word “topic” is commonly used in the various definitions of expertise because thetraditional approaches to expertise profiling rely on topic models and Natural Language Processing (NLP)techniques (Van Gysel et al. 2016). The main purpose of using those models is to classify documents intoa number of topics and find a better match between authors and topics according to the topics extractedfrom their documents. As most of the machine learning algorithms belong to unsupervised learning, thetopics are simply collections of words and thus not always appropriate for identifying expertise (Silvaet al. 2018).Since the main focus of expertise retrieval tasks is on the analysis of the documents, NLP techniqueshave commonly been applied. Traditional approaches to the expert profiling tasks are based on the LDAalgorithm. LDA is a generative statistical model, first proposed in 2003, which considers each documentas a mixture of a small number of topics and according to which the presence of each word is attributableto one of the topics of the document (Blei et al. 2003). LDA is a powerful tool to analyse documentsand pinpoint topics, but it was not designed to address the task of identifying expertise. There is nobetter solution but to treat an author as a bigger document by combining all documents he or she haspublished. To include authorship information, Rosen-Zvi et al. (2004) extended LDA and proposed theauthor-topic model for identifying the interests of authors. To make LDA suitable for different tasks invarious contexts, many extensions have been proposed over the years. Some examples are the Author-Conference Topic model (Tang et al. 2008), the Author-Conference Topic-Connection model (Wang et al.2012), and the Author-Topic over Time model (Xu et al. 2014). Some of these have been applied topractice as a part of a new search engine Aminer (Tang 2016).However, classic LDA algorithms have several characteristics that are not ideal for such tasks. First,LDA requires a manual choice of the topic number. But one can hardly tell whether the choice is good ornot since the performance of an LDA model is evaluated by perplexity , a metric proposed by Blei et al.(2003). Therefore it is difficult to decide and evaluate the number of topics. When such number is toolarge or too small, the research areas (corresponding to the topics) provided by LDA may become toogeneral or too specific (Berendsen et al. 2013). Second, since LDA is an unsupervised learning algorithm,topics generated from LDA are just distributions of words without labels which can be hard to inter-pret. Additionally, the academic research areas are always connected and have a hierarchical structure.However, LDA generates independent topics without any kind of relationships between them (Silva et al.2018).While most studies are concerned with better solutions to address the flaws of topic models, fewhave highlighted the importance of author-document connections in the tasks of expertise retrieval. In2012, Duan et al. (2012) first integrated community discovery with topic modelling, and proposed theMutual Enhanced Infinite Community-Topic model which finds communities and the topics they discussin text-augmented social networks. Lately, more studies have started using information networks to avoidthe problems of the LDA models. Gerlach et al. (2018) represent the data as a bipartite network ofwords and documents and convert the task into finding communities in such a network. Some differentapproaches that focus on topic modelling using HINs have been proposed (Sun et al. 2009b). Subsequently,a pioneer algorithm called Rankclus was designed. It uses a generative model that operates on bipartitetopologies and simultaneous clusters and ranks nodes in a HIN (Sun et al. 2009a). More recently, differentcommunity detection methods, such as generative model and modularity optimisation, have been appliedto the creation of hierarchical expert profiles (Silva et al. 2018; Wang et al. 2015).Despite the efforts of many scholars to find better ways for extracting individuals’ interests from theworks they produced, most studies have paid little attention to the unequal contributions of authorsin collaborative works. Authors that publish with other co-authors in several fields can be associatedwith multiple topics found in their publications. Identifying the expert on a specific field associated with https://aminer.org/ Xiancheng Li et al. a paper requires the identification of the different contributions of authors in collaborative works, andtherefore identifying one or more people as experts bears a resemblance to a credit allocation problem.In the last decade, as the complexity and interdisciplinarity of modern research have steadily risen,collaborations among researchers have been playing an increasingly important role (Newman 2004). Themultidisciplinary nature of research requires expertise from different scientific fields (Lawrence 2007). Inturn, as a result of the increasing size of the newly formed scientific groups, the scientific credit systemhas come under mounting pressure (Koopman et al. 2010). As a matter of fact, the interdisciplinarity ofmodern science not only endangers the current credit allocation system, but also poses more obstacles toexpertise retrieval. In such interdisciplinary collaborations, authors from different fields work together toproduce one result (e.g., an article), but each author contributes only partly to the publication. It cantherefore be difficult to quantitatively discern the individual co-authors contributions to a multi-authoredpublication (Bao and Zhai 2017). Most topic models for expertise retrieval cannot solve this problem,and new approaches to allocating scientific credit to co-authors are therefore required.Current approaches to credit allocation fall in several major categories. The first and classic one isto view each author as the sole author contributing a copy of the same publication. The second is todistribute the contribution to all co-authors evenly, and the third according to the order in the publicationbyline or to the role of the co-authors (Hirsch 2005, 2007; Stallings et al. 2013). The first two categoriesare obviously biased to some degree, and the third is based on some acquiescent agreements accordingto disciplines which may not be easily acceptable by others. Recently, scholars have been working onallocating credit based on the specific contribution of each author (Foulkes and Neylon 1996; Tscharntkeet al. 2007). Shen and Barab´asi (2014) proposed a new method which focuses on the co-citations. Thismethod is based on the intuition that the more an author appears in a co-cited paper, the more credithe or she should receive. And they managed to capture the contribution of co-authors as perceived bythe scientific community and successfully tested on the Nobel Prize publications. Considering that thenovelty of a paper and the attention paid to it tend to fade with time, Bao and Zhai (2017) extendedtheir idea and proposed a dynamic credit allocation algorithm.As science can be regarded as a complex, self-organising and evolving network of scholars, projects,papers and ideas (Fortunato et al. 2018), another way to deal with the unequal contributions of multipleauthors to collaborative works is to use the similarity between a node representing a given topic and anode representing a given author to assess the contribution that the author made to the focal documentwith respect to the topic. Information networks are networks consisting of data items linked in some way.The best known example is the World Wide Web where the nodes are web pages consisting of texts,pictures or other information, and the links are hyperlinks that allow us to navigate from one page toanother. There are some networks which could be considered information networks and also have socialconnotations. Examples include the networks of email communication, and online social networks suchas Twitter and Facebook (Xiong et al. 2015).An information network is defined as a directed graph G = ( V, E ) with an object type mappingfunction φ : V → A and a link type mapping function ψ ( e ) : E → R , where each object v ∈ V belongsto one particular object type φ ( v ) ∈ A , and each link e ∈ E belongs to a particular relation ψ ( e ) ∈ R .Unlike the traditional network definition, we explicitly distinguish object types and relationship types inthe network. Notice that, if there exists a relation from type A to type B, denoted as A R −→ B , the inverserelation R − holds naturally for B R − −−−→ A . Most of the time, R and its inverse R − are not equal, unlessthe two types are the same and R is symmetric. When the types of objects | A | > | R | >
1, the network is called heterogeneous information network (HIN); otherwise, it is a homogeneousinformation network. In real-world networks, multiple-typed objects are often interconnected, formingHINs (Shi et al. 2012). A bibliographic information network is a typical HIN, containing objects fromseveral types of entities. The most common entities are papers ( P ), venues (conferences/journals) ( V ),authors ( A ), affiliations ( af f ), and terms ( T ). The DBLP and ACM data in Fig. 1 is a typical example (Shiet al. 2014). There are links connecting different-typed objects and the link types are defined by therelations between two object types. For a bibliographic network, links can exist between nodes of thesame or different types. For example, there are links between authors and papers denoting the “write”or “written-by” relations, and links between papers denoting “cite” and “cited-by” relations. xpertise retrieval 5 (a) DBLP data (b)
ACM data
Fig. 1
Examples of typical Heterogeneous Information Networks (HINs)In a heterogeneous network, two objects can be connected via different paths. For example, twoauthors can be connected via the “author-paper-author” path, the “author-paper-venue-paper-author”path, and so forth. Formally, these paths are called meta-paths . In a graph
T G = (
A, R ), where A is theset of node types and R is the set of relation types, a meta path P is a path denoted in the form of A R → A R → ... R l → A l +1 , which defines a composite relation R = R ◦ R ◦ ... ◦ R l between type A and A l +1 , where ◦ denotes the composition operator on relations (Shi et al. 2014).Similarity search is a primitive operation in large-scale HINs that consist of multi-typed, intercon-nected objects, such as the bibliographic networks and social media networks. Traditional similaritymeasures (e.g., cosine similarity) are computed between vector representations of features, using numer-ical data types (Nguyen and Bai 2010). In information networks, however, the interconnections betweenobjects are sometimes more important than the features of the objects themselves.To capture the information contained in the links, Lin et al. (2006) proposed a link-based similaritymeasure PageSim and applied it to the identification of similar web pages. PageSim only works onnetworks with one type of nodes (e.g., homogeneous information networks), but many networks areheterogeneous. Considering the semantics in meta paths constituted by different-typed objects, Sun et al.
Xiancheng Li et al. (2011) first proposed the path-based similarity measure
PathSim to evaluate the similarity of same-typed objects based on symmetric paths. Following their work, Yao et al. (2014) extended PathSimby incorporating richer information, such as transitive similarity, temporal dynamics, and supportiveattributes. A path-based similarity join method
JoinSim was proposed to return the top k -similar pairsof objects based on user-specified join paths (Begum et al. 2016). Wang et al. (2016) defined a meta-path-based relation similarity measure, RelSim , to examine the similarity between relation instances inschema-rich HINs. In order to evaluate the relevance of different-typed objects, Shi et al. (2014) proposed
HeteSim to measure the relevance of any object pair under arbitrary meta path. To overcome the problemrelated to the high computational and memory requirements of
HeteSim , Meng et al. (2014) proposed the
AvgSim measure that evaluates the similarity scores, respectively, through two random walk processesalong the given meta path and the reverse meta path.The idea of node similarity can be useful in expertise retrieval because, if we can measure the similaritybetween a given author and a field, we can assess the author’s expertise in that field.
HeteSim has beendesigned to evaluate the relevance of different-typed objects, and thus has the potential to be applied tothe task of expertise retrieval. However, this task needs to explicitly account for the uneven contributionof various authors to collaborative efforts, and therefore cannot be carried out merely by applying simplemeasures of similarity between nodes. For this reason, we decided to draw on
HeteSim , and propose aproperly adjusted method for capturing authors’ expertise in evolving networks.As a result of the increasing interest in extracting relevant topics from scientific publications, manywidely used online data sets provide external controlled vocabulary to classify publications. Some exam-ples are the
M eSH classification system in MEDLINE and the topic tags in MAG. Those systems haveused a variety of techniques to improve the reliability of the classifications, and some scholars have startedto use them as ground truth or baseline in their works (AlShebli et al. 2018). Our method simplifies theprocess of topic extraction from documents by using the MEDLINE corpus as an example, and focuseson how to allocate expertise to co-authors that unevenly contribute to collaborative efforts.The method for collective credit allocation in science developed by (Shen and Barab´asi 2014) isconceptually similar to our method. Yet, it differs from ours in one important aspect: it focuses on theprocess of appropriately allocating the credit of a given paper to each of the co-authors. It uses theco-citations to the given paper and other papers published by the co-authors to determine the proportionto be assigned to each co-author of the paper. If more papers have cited at the same time the focal paperand other papers published by a given co-author, a larger proportion of the credit will be allocated to thisco-author, indicating a larger contribution is made by the co-author in this work. However, at the timewhen a paper is published and therefore has no citations, contributions to this paper are equally allocatedacross co-authors. Moreover, because the citations vary over the years, so does the credit allocated to eachco-author by this method. Clearly, one shortcoming of this method lies on the fact that the contribution ofan author to a paper should be unambiguously defined once the paper is published, and should thereforebe assessed according to the experience or background of each co-author rather than based on futurecitations.
MEDLINE (Medical Literature Analysis and Retrieval System Online) is a bibliographic database of lifesciences and biomedical information, maintained and curated by the US National Library of Medicine. Itincludes bibliographic information on articles from academic journals covering medicine, nursing, phar-macy, dentistry, veterinary medicine, and healthcare. The database contains records from more than5 ,
000 selected journals covering biomedicine and health from 1948 to the present. The database is freelyaccessible via the PubMed interface .In addition, PubMed provides an online scientific publication search engine that associates each paperwith several M eSH terms. These terms are similar to keywords of papers, except that a controlledvocabulary is used to classify publications. Since the
M eSH terms of a paper are not given by theauthors, they are not subject to subjective biases and can be considered as labels which indicate the major topics discussed in the paper. PubMed also constructed tree structures for M eSH terms so thatone can look for the research field of each M eSH term.In particular, in PubMed, each
M eSH term has one
M eSH
Unique ID (starting with letter ‘D’followed by 6 digits) and at least one
M eSH
Tree ID (starting with a letter followed by digits separatedby dots). For example, the
M eSH
Tree ID of ‘Anatomic Landmarks’ is ‘A01.111’ and its
M eSH
UniqueID is ‘D059925’. The first letter of the
M eSH
Tree ID of a
M eSH term indicates which one of the 16categories the
M eSH term belongs to. However, the
M eSH terms in the raw data are indexed by the
M eSH
Unique ID rather than the
M eSH
Tree ID. To map each
M eSH
Unique ID with the corresponding
M eSH
Tree ID, we downloaded detailed information about each
M eSH
Unique ID and used RegularExpression (Regex) to search the match between each
M eSH
Unique ID and the corresponding
M eSH
Tree ID. The
M eSH
Tree ID can have a different depth (the depth of a node is the number of edgesfrom the node to the tree’s root node). Some
M eSH
IDs have corresponding
M eSH
Tree IDs of depthfive (e.g., ’A15.378.316.378’), others only have depth of two (e.g., ’B02’). To ensure that all
M eSH
IDscan be mapped to the same depth of
M eSH
Tree IDs, we converted all
M eSH
Tree IDs to depth two bycutting the numbers after the first point. As a result, all
M eSH
IDs have been mapped to 127
M eSH
Tree IDs of depth two.To disambiguate authors’ names we used the data set provided by Torvik named Author-ity (Torvikand Smalheiser 2009). The data set provides the disambiguated authors’ names appearing in the MED-LINE data set up to the year 2008. In our work, we used the first decade of publications in MEDLINE,from 1948 to 1957, to test the method we developed and make a comparison between a baseline ( BL )method and our method. HeteAlloc : An algorithm based on path similarity
M eSH terms allocation problem: given a time T , an author A and a M eSH term M , what is the expertiseof author A on M eSH term M at time T ? To answer this question, we have developed a method basedon the idea of credit allocation, using the author-paper and paper- M eSH connections. Notice that whatwe care about is the effort devoted by an author to a
M eSH term (measured by the number of paperspublished with that
M eSH term, or possibly by the reputation or impact factor of the journals, researchvenues and outlets where these papers have appeared), rather than the reputation of the author (measuredby the citations received).
Problem description.
We focus on a subset of the HIN which contains three types of nodes: Papers,Authors and
M eSH terms. A simple example of this HIN is shown in Fig. 2. In this network, the
M eSH terms are indexed by
M eSH tree IDs, and the links between papers and
M eSH terms show which
M eSH term the papers are associated with. Our problem is how to allocate credit to single authors. Theinput to this question is the link lists of every year between 1948 to 1957, and the output is a vector foreach author with a value for each of the 127
M eSH categories indicating the author’s expertise in those
M eSH categories.We developed a dynamic credit allocation algorithm based on Path Similarity which we shall call
HeteAlloc . Based on the HIN with three types of nodes (i.e., authors, papers and
M eSH terms), our taskis to assign the credit of each
M eSH term in a paper to the corresponding authors, and to use the wholepublication history of authors to find their expertise. Our method will calculate the similarity betweenan author and a
M eSH term, and assign a value to each author based on the similarity. It is based on https://MeSHb.nlm.nih.gov/treeView The following are the 16 most general categories: A. Anatomy; B. Organisms; C. Diseases; D. Chemicals and Drugs;E. Analytical, Diagnostic and Therapeutic Techniques and Equipment; F. Psychiatry and Psychology; G. Phenomena andProcesses; H. Disciplines and Occupations; I. Anthropology, Education, Sociology and Social Phenomena; J. Technology,Industry, Agriculture; K. Humanities; L. Information Science; M. Named Groups; N. Health Care; V. Publication Charac-teristics; Z. Geographicals. In cases where the
MeSH
Unique ID has two
MeSH
Tree IDs, we kept both
MeSH
Tree IDs. Xiancheng Li et al.
Fig. 2
An example of HIN
HeteSim (Shi et al. 2014) as this method is able to measure the similarity between different types ofnodes, i.e., authors and
M eSH terms in this case.
Heterogeneous Similarity (
HeteSim ). HeteSim is a measurement of the relatedness of hetero-geneous objects based on an arbitrary search path. The properties of
HeteSim (e.g., symmetric andself-maximum) make it suitable for a number of applications. We define
HeteSim as follows:
HeteSim : Given a relevance path P = R ◦ R ◦ · · · R l , the HeteSim score between two objects s and t ( s ∈ R .S and t ∈ R l .T ) is HS ( s, t | R ◦ R ◦ · · · R l ) =1 | O ( s | R )) | | I ( t | R l )) | O ( s | R ) (cid:88) i =1 I ( t | R l ) (cid:88) j =1 HS ( O i ( s | R ) , I j ( t | R l ) | R ◦ · · · R l − ) , (1)where O ( s | R ) is the out-neighbours of s based on relation R , and I ( t | R l ) is the in-neighbours of t basedon relation R l . Transition probability matrix . The adjacent matrix W AB is defined for all links from nodes oftype A to nodes of type B. The transition probability matrix U AB is the normalised matrix of W AB along the row vectors. Reachable probability matrix . Given a network G = ( V, E ) following a network schema S =( A, R ), a reachable probability matrix
P M for a path P = ( A A Al + 1) is defined as PM P = U A A U A A U A l A l +1 . PM ( i, j ) represents the probability of object i ∈ A of reaching object j ∈ A l +1 underthe path P .Using the reachable probability matrices (Ramage et al. 2009), the HeteSim between two nodes a and b can be written in a matrix form as HeteSim ( a, b | P ) = PM P L ( a, :) PM (cid:48) P R − ( b, :) , (2)where P M is the reachable probability matrix, and
P M P ( a, :) refers to the a -th row in P M P .Finally, Equation 3 provides the normalised version of HeteSim , which ensures that the similaritybetween a node and itself is equal to one
HeteSim ( a, b | P ) = PM P L ( a, :) PM (cid:48) P R − ( b, :) (cid:114)(cid:13)(cid:13)(cid:13) PM (cid:48) P R − ( b, :) (cid:13)(cid:13)(cid:13) (cid:107) PM P L ( a, :) (cid:107) (3) xpertise retrieval 9 HeteSim in M eSH term assignment.
The definition of
HeteSim in Equation 3 can be directlyapplied to our network. For a node of type author ( A ) a and a node of type M eSH ( M ) m , the HeteSim between a and m is HeteSim ( a , m | a ∈ A, m ∈ M ) = M AP [ a , :] · M (cid:48) MP [ m , :] (cid:112) (cid:107) M AP [ a , :] (cid:107) · (cid:113)(cid:13)(cid:13) M (cid:48) MP [ m , :] (cid:13)(cid:13) , (4)where M AP and M MP are adjacency matrices between the Author nodes, Paper nodes and between M eSH nodes and Paper nodes, respectively. In Equation 4, the adjacency matrix is used instead ofthe reachable probability matrix to make our method more interpretable. It can be shown that theformalisation of
HeteSim using the adjacency matrix can be the same in an unweighted network as theformalisation of
HeteSim based on the reachable probability matrix. Note that M MP = M (cid:48) P M , the matrixproduct resulting by multiplying M AP and M (cid:48) P M , is the weighted reachable matrix between node typeAuthor and node type
M eSH . Formally, we have N papers published by a which include m = M AP [ a , :] · M (cid:48) MP [ m , :] , (5)where N means ‘the number of’.Note that all elements in M MP and M AP are either 1 or 0, and thus we have (cid:107) M AP [ a , :] (cid:107) = (cid:88) M AP [ a , :] . (6)Thus, (cid:112) (cid:107) M AP [ a , :] (cid:107) = (cid:113)(cid:88) M AP [ a , :] = (cid:112) N paper published by author a . (7)In the same way, (cid:113)(cid:13)(cid:13) M (cid:48) MP [ a , :] (cid:13)(cid:13) = (cid:113)(cid:88) M (cid:48) MP [ a , :] = (cid:112) N paper which include the MeSH term m . (8)Equation 4 can therefore be rewritten as HeteSim ( a , m | a ∈ A, m ∈ M ) = M AP [ a , :] · M (cid:48) MP [ m , :] (cid:112)(cid:80) M AP [ a , :] · (cid:112)(cid:80) M (cid:48) MP [ m , :] , (9)and interpreted as HeteSim ( a , m | a ∈ A, m ∈ M ) = N papers published by author a which include the MeSH m (cid:112) N papers published by author a · (cid:112) N papers which include the MeSH term m . (10)Though HeteSim is quite suitable for our task, there are some disadvantages. The most important oneis that
HeteSim is a “global” measure in a sense. When the similarity between an author and a
M eSH term is calculated, all papers are taken into consideration, even those which have no connection with thetarget author. For example, if someone published a paper with a
M eSH term M
1, the similarity of allauthors with M HeteSim measures the contribution of each author to the total knowledge (limited inthe data set) of a
M eSH term. However, the expertise we want to examine refers to the
M eSH termwhere an author conducted most of his or her work. In a real-world situation, one can only contribute toseveral hundreds of papers at most. And if we compare this fraction of papers to the tremendous overallamount of papers available in online databases, the similarity will be significantly small and the original
HeteSim will have a poor performance.
Modification of
HeteSim ( HeteAlloc ). To address this shortcoming of
HeteSim , here we proposea modified version, namely
HeteAlloc . The underlying idea is to limit the calculation to a subset of papers,which can be selected according to the context. Formally, we have
HeteAlloc(a , m | a ∈ A , m ∈ M ) = M AP [ a, :] · ( M sub [ a, :] (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:112) (cid:107) M sub [ a, :] (cid:12) M MP [ m, :] (cid:107) , (11)where the operation (cid:12) is the element-wise product, and M sub is the subset selection matrix with M sub [ a, n ] = (cid:40) n th paper is in the selected subset of target author a0 otherwise (12)Like the original HeteSim , our method is based on the cosine of two vectors. As Pirotte et al. (2007)pointed out, the angle between the node vectors is a much more predictive measure than the distancebetween the nodes. The only difference is that the second vector is filtered by a row of subset selectionmatrix. The selection of the subset is the essential part of our method, and requires a considerable amountof effort towards the design and computation of the matrix multiplication.In what follows, we shall present three subset selection strategies, and then show how to compute themeasure, discuss the advantages and disadvantages of each strategy, and finally provide interpretations.
HeteSim measure should therefore belimited to the subset of papers published either by our target author or those who have co-authored withthis author. To find the subset, we provide the following definition:
Binary Reachable Matrix of Path Length i : Given relation A R → B and the adjacency matrix W AB between type A and type B , the Binary Reachable Matrix of Path Length i from A to B followingmeta-path AB i is RM ( i ) AB ( m, n ) = (cid:40) M ( i ) AB ( m, n ) = 01 otherwise (13)where M ( i ) AB = W AB · ( W BA · W AB ) ( i − .The selected subset, RM AP , follows the meta-path ‘APAP’, which, for each author, creates the subsetof papers published by the author or his/her co-authors. To be more specific, the n -th row of RM AP isa vector where the m th value is 1 if, for the n -th author, paper m is included in the subset. To this end,we define HeteAlloc
HeteAlloc(a , m | a ∈ A , m ∈ M ) = M AP [ a, :] · ( RM (2) AP [ a, :] (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:114)(cid:13)(cid:13)(cid:13) RM (2) AP [ a, :] (cid:12) M MP [ m, :] (cid:13)(cid:13)(cid:13) , (14)which can be interpreted as HeteAlloc ( a, m ) = N papers of a which include m (cid:112) N papers of a · (cid:112) N papers of a’s co-authors which include m . (15) xpertise retrieval 11 The advantage of this selection strategy is that the similarity between an author and any
M eSH termwill not be influenced by an irrelevant global change of the data set. The subset matrix is constant forall target
M eSH terms. However, this selection does not reflect on which specific
M eSH term an authorhas collaborated with another author, and simply includes the papers of all co-authors into the subset.5.2 Subset of co-authors’ papers in a target
M eSH term.The basic idea of this strategy is to add the target
M eSH term as another constraint for selecting thesubset. The subset includes all papers published by the target author and by the authors who have co-authored with him or her in the target
M eSH term. Since this subset varies according to
M eSH terms,we use the reachable vector of a and m to replace RM sub [ a, :]HeteAlloc(a , m | a ∈ A , m ∈ M ) = M AP [ a, :] · ( RV ( a,m ) sub (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:114)(cid:13)(cid:13)(cid:13) RV ( a,m ) sub (cid:12) M MP [ m, :] (cid:13)(cid:13)(cid:13) (16) RV ( a,m ) Sub (1 , n ) = (cid:40) V ( a,m ) Sub (1 , n ) = 01 otherwise (17)where V ( a,m ) sub = ( W AP ( a, :) (cid:12) W MP ( m, :)) · W P A · W AP . (18)Equation 16 can be interpreted as HeteAlloc ( a, m ) = N papers of a which include m (cid:112) N papers of a · (cid:112) N papers of a ’s co-authors which include m . (19)The advantage of this selection strategy is that the similarity between an author and any M eSH term will not be influenced by any irrelevant global changes of the data set. The similarity is
M eSH -sensitive, and the subset vector can filter out co-authors who had no experience on the target
M eSH term. However, this selection will lead to a low score for those who have worked with very experiencedauthors.5.3 Subset of all papers published by the co-authors of the focal paper.For each paper p , the subset includes all papers published by the co-authors of p . And for each pair,author a and M eSH term m , the calculation is conducted for every paper p of author a which includesthe M eSH term m , and the average or the sum of all papers is used as the final score. The sum can beconsidered as a method for credit allocation and the average as a similarity measure. Here we shall usethe sum as an example: HeteAlloc ( a, m ) = (cid:88) p ∈ P a HeteAlloc ( a, p, m ) (20) HeteAlloc ( a, p, m ) = M AP [ a, :] · ( RV ( a,p ) sub (cid:12) M MP [ m, :]) (cid:48) (cid:112) (cid:107) M AP [ a, :] (cid:107) · (cid:114)(cid:13)(cid:13)(cid:13) RV ( a,p ) sub (cid:12) M MP [ m, :] (cid:13)(cid:13)(cid:13) (21) RV ( a,p ) Sub (1 , n ) = (cid:40) V ( a,p ) Sub (1 , n ) = 01 otherwise (22) where V ( a,p ) sub = W AP ( a, :) (cid:12) W P A · W AP ( p, :) . (23)Equation 21 can be interpreted as: HeteAlloc ( a, m ) = (cid:88) all papers of a N papers of a which include m (cid:112) N papers of a · (cid:112) N papers of co-authors of paper p . (24)This similarity avoids a significant decrease when the target author co-authors with a more experiencedone in the target M eSH term. The similarity retains the property of having a
M eSH -sensitive subset.Notice that this method works better when applied to calculate the absolute value of expertise.
HeteAlloc
HeteAlloc
The formalisation above is based on an unweighted network. Yet, one may want to capture the concen-tration of an author’s effort on a specific topic (
M eSH term). For example, let us suppose that all papersof author A only contain one M eSH term M A contain two M eSH terms, M M
2. In this case, one may argue that A concentrates more than A on M A hasworked exclusively on this topic while A on the additional topic M
2. According to this idea, we proposea weighted version of
HeteAlloc which accounts for the weights of the links between papers and
M eSH terms. The weight of a link between a paper and a
M eSH term is inversely proportional to the numberof
M eSH terms associated with the paper.
HeteAlloc can be applied to a weighted network by using U MP instead of M MP , where U MP is a normalised matrix of M MP along the column vector.The weighted HeteAlloc can capture authors’ concentration on specific topics and identify the authorswhose papers are more focused on smaller
M eSH sets. However, this characteristic is not necessarily anadvantage, but simply a different strategy to deal with the number of
M eSH terms in a paper. Theremay exist different views about the similarity between an author and a given
M eSH term. For example,one may believe that an author is entirely devoted to a given research topic, if each of his or her paperscontains the corresponding
M eSH term. In this case, the similarity between the author and the
M eSH term would be equal to one (i.e., the idea behind the unweighted version). However, others may believethat the similarity between the author and the
M eSH term should never be equal to 1 unless an authorswork is exclusively about this
M eSH term (i.e., the idea behind the weighted version). The decisionshould be made after careful examination of the context, and should also be based on the assumptionsmade by potential users of the method (e.g., researchers or funding agencies.).Here we shall provide our personal recommendation and blueprint. For smaller
M eSH term numbers,the weighted version will work better since it is not common for researchers to work in a completely dif-ferent
M eSH term (say, Finance and Chemistry). However, when the division of topics is too fragmentedand most papers have many
M eSH terms, then the performance of the weighted version may not workwell, and the unweighted version would be recommended.6.2 Iterative calculations over the yearsThe original
HeteSim is designed for a “static” measurement of similarity. However, authors keep pub-lishing papers over the years, and their expertise may change over time. When expertise is measured atyear T , only the papers published before this year should be considered. To make our method HeteAlloc applicable to dynamic calculation, we distinguish the links connecting Author and Paper between theexperience/history links before year T and the update links at year T . This can be done by using twoadjacency matrices: M update and M experience . Since it is difficult to identify the time ordering of publi-cations published in the year T , we assume that papers of year T were published at the same time. The xpertise retrieval 13 formalisation of HeteAlloc needs to be modified and the calculation, based on the modified measure, canbe conducted iteratively over the years.We shall refer to the modified algorithm as
DynamicHeteAlloc ( DHA ), and the corresponding formal-isation is
DHA ( a, m ) = (cid:88) p i ∈ M update [ a, :] (cid:12) M MP DHA ( a, p i , m ) (25)and DHA ( a, p i , m ) = ( M experience [ a, :] + I nn [ p i , :]) · ( V subset ( p i ) (cid:12) M MP [ m, :]) (cid:112) (cid:107) M experience [ a, :] + I nn [ p i , :] (cid:107) (cid:107) V subset ( p i ) (cid:12) M MP [ m, :] (cid:107) , (26)where V subset ( p i ) = M (cid:48) update [ p i , :] ∗ M experience + I (cid:48) nn [ p i , :] . (27)For each paper, we add I nn [ p i , :] to M experience [ a, :] in Equation 26 to include the current paper inthe experience paper set so as to avoid the case where M experience is a zero matrix.According to the formalisation of DHA , we have implemented Algorithm 1:
Algorithm 1
Algorithm for conducting dynamic
HeteAlloc
Input: link lists for every year,
MeSH lists
Output: expertise of every author1: initial list pre as blank list, load
MeSH list as M MP ;2: for each year ∈ [1946 , do
3: load list year as list cur ;4: Sparse matrix Creation;5: for each AuthorID ∈ list cur do if M update [ Author ID, :] is Null vector then
Next iteration;7: end if
8: find MeSH terms needed to update
MeSH update ;9: create a null dictionary dic cur ;10: if Author ID exists in expertise dictionary dic expts then use dic expts [ Author ID ] to replace dic cur end if for each
MeSH ID ∈ MeSH update do Initialize
HeteAlloc value as zero;13: if MeSH
ID in dic cur then use dic cur [MeSH ID ] to replace HeteAlloc value14: end if
15: update
HeteAlloc value by adding result from Dynamic HeteAlloc(Author ID,
MeSH
ID)16: update dic cur [MeSH ID ] by HeteAlloc value17: update dic expts [ Author ID ] by dic cur end for end for end for
21: Write out dic expts . Algorithm 2
Sparse Matrix Creation
Input: list pre , list cur , MeSH lists
Output: M experience , M update , update list pre , dictionaries1: merge list pre and list cur as list all ;2: create a dictionary from list all for mapping nodes with indexes;3: use the dictionary to map list pre as M experience , map list cur as M update ;4: replace list pre by list all , return dictionaries for mapping. An example of this method using illustrative networks is provided in the Appendix. The results aregiven in the form of expertise matrices, where the value corresponding to row i and column j indicatesthe expertise of Author i on M eSH j . In the example, we use the publication lists of 4 authors from year1 to year 10 and calculate the expertise matrices for each author at each year. We also show the result using the ( BL ) method, which equally attributes every M eSH term of a paper to all co-authors. In thiscase, the expertise of a focal author is therefore computed through the cumulative counts of
M eSH termsassociated with all publications of the author. Thus, in the expertise matrix calculated using the ( BL )method for a year t , the value in row i and column j is equal to the number of papers published by Author i with M eSH j before year t . To compare the performance of different selections of subsets on HIN, we have calculated the similaritybetween all pairs extracted from the pair set { a, m | a ∈ Author, m ∈ M eSH } based on three small exam-ples of networks using the ( BL ) method mentioned above, the original HeteSim , the
HeteAlloc with thesubset of co-authors papers (
HA1 ), the
HeteAlloc with the subset of co-authors papers in a target
M eSH term (
HA2 ), the
HeteAlloc with the subset of all papers published by the co-authors of the focal paper(
HA3 ), and the corresponding weighted versions of
HA1, HA2, HA3 (i.e.,
WHA1, WHA2, WHA3 ).In the first example in Fig. 3, BL , HA2 and
HA3 perform well (see Table 1; the similarities charac-terised by better performance have been highlighted in bold). These methods can uncover the differencebetween ( A , M
1) and ( A , M A M M
2, and the similarity between A M A M M eSH term, the weighted versions in this example degenerate to theunweighted ones.
Fig. 3
Example network 1In the second example network in Fig. 4,
HA3 performs well. It shows that author A A M
1. To be more specific, A M A A M M A M A M
1. Compared to othermethods, only
HA3 gives a higher similarity for ( A , M A M M eSH term, the weighted versions in this example degenerate to theunweighted ones. xpertise retrieval 15
Table 1
Results based on example network 1
Baseline Original Unweighted WeightedPair \ Method BL HeteSim HA1 HA2 HA3 WHA1 WHA2 WHA3(A1,M1) 0.577 0.577 0.577 0.577 0.577 0.577 0.577 0.577(A1,M2) (A2,M1) 0.577 0.577 0.577 0.577 0.577 0.577 0.577 0.577(A2,M2)
Fig. 4
Example network 2
Table 2
Results based on example network 2
Baseline Original Unweighted WeightedPair \ Method BL HeteSim HA1 HA2 HA3 WHA1 WHA2 WHA3(A1,M1) 1 0.577 0.632 0.632 (A1,M2) 0 0 0 0 0 0 0 0(A2,M1) 1 0.816 0.894 0.894 (A2,M2) 0 0 0 0 0 0 0 0(A3,M1) 0.707 0.288 0.707 0.707 (A3,M2) 0.707 0.707 0.707 0.707 0.707 0.707 0.707 0.707
For the third example shown in Fig. 5, the weighted methods differentiate between
Sim ( A , M Sim ( A , M A A M
1, and the only difference between A A M P A M version can capture the concentration of research efforts in some M eSH terms, and is biased in favourof the authors whose papers are more concentrated on a smaller
M eSH set.
Fig. 5
Example network 3
Table 3
Results based on example network 3
Baseline Original Unweighted WeightedPair \ Method BL HeteSim HA1 HA2 HA3 WHA1 WHA2 WHA3(A1,M1) 1 0.943 0.816 0.816 0.908 0.943 0.943 (A1,M2) 0 0 0 0 0 0 0 0(A2,M1) 0.816 0.707 0.816 0.816 0.908 0.707 0.707 (A2,M2) 0.577 0.236 0.5 0.5 0.5 0.316 0.316 0.316(A3,M1) 0 0 0 0 0 0 0 0(A3,M2) 1 0.943 0.816 0.816 0.908 0.943 0.943 (A4,M1) 0.577 0.236 0.5 0.5 0.5 0.316 0.316 0.316(A4,M2) 0.816 0.707 0.816 0.816 0.908 0.707 0.707
From the three examples above, the third subset selection strategy (i.e., subset of all papers publishedby the co-authors of the focal paper) outperforms the other two strategies. Moreover, by taking the sumof all scores (i.e., similarity measures) obtained from all publications of the focal author, this methodenables us to evaluate the global expertise of an author based on his of her entire scientific production.In what follows, we will use the third selection strategy and perform a comparison between our method(
DHA ) and the ( BL ) method applied to the MEDLINE data set. As in our data set most publicationsare associated with multiple M eSH terms, we chose to use the unweighted version of our method.The output of both methods are vectors associated with authors representing their expertise in termsof each topic (i.e.,
M eSH term). To compare the two methods, for each author we consider the followingmeasures: (1) the ratio between maximum and minimum values of the author’s expertise; (2) the author’smaximum normalised expertise (i.e., obtained by dividing all values in a vector by its norm); and (3) thenormalised maximum expertise of authors that have published more than 10 papers at the time of theassessment of expertise (i.e., criterion 2 applied only to the subset of productive authors). Moreover, for xpertise retrieval 17 every year, we calculate the mean and standard deviation of the values produced by the above assessmentmeasures, and compare them between methods.
Table 4
Comparison between
DHA and BL based on thefirst 10 years of the MEDLINE data set Measure (1) (2) (3)year method DHA BL DHA BL DHA BL1948 mean 2.05 1.45 0.60 0.58 0.57 0.52std 3.54 1.13 0.17 0.17 0.14 0.121949 mean 2.72 1.66 0.60 0.58 0.59 0.54std 6.24 1.63 0.16 0.16 0.14 0.121950 mean 3.48 1.84 0.60 0.57 0.60 0.55std 9.59 2.09 0.16 0.15 0.14 0.121951 mean 4.37 2.06 0.59 0.56 0.61 0.56std 13.85 2.65 0.15 0.14 0.14 0.121952 mean 5.22 2.24 0.59 0.56 0.61 0.56std 18.36 3.15 0.15 0.14 0.14 0.121953 mean 6.05 2.39 0.59 0.55 0.61 0.56std 23.02 3.60 0.15 0.14 0.14 0.111954 mean 6.85 2.53 0.59 0.55 0.61 0.56std 28.05 4.01 0.15 0.13 0.14 0.111955 mean 7.65 2.66 0.59 0.54 0.61 0.55std 33.04 4.41 0.15 0.13 0.14 0.111956 mean 8.41 2.78 0.59 0.54 0.61 0.55std 38.16 4.79 0.15 0.13 0.14 0.111957 mean 9.14 2.88 0.59 0.54 0.61 0.55std 43.32 5.13 0.15 0.13 0.14 0.11(1) the ratio between maximum and minimum values of the au-thor’s expertise; (2) the author’s maximum normalised expertise(i.e., obtained by dividing all values in a vector by its norm); and (3)the normalised maximum expertise of authors that have publishedmore than 10 papers at the time of the assessment of expertise
Author's maximum expertise F r e q u e n c y o f a u t h o r s DHABL
Fig. 6
Comparison between
DHA and BL using the normalised maximum expertise of productive authorsThe results reported in Table 4 show that the mean and standard deviation of the ratio betweenmaximum and minimum values of author’s expertise obtained with the DHA method are higher thanthe mean and standard deviation obtained with the BL method, which suggests that DHA can betterdistinguish authors according to their expertise areas, whereas BL considers all authors involved inworks relevant to multiple topics as interdisciplinary authors (i.e., with the same expertise on all M eSH terms, thus producing smaller ratios of maximum to minimum values of expertise). The results based on normalised maximum expertise of
DHA are similar to those of BL when all authors are considered,but they differ when the methods are applied only to a restricted subset of productive authors, whichsuggests that our method has the potential to identify authors’ main areas of expertise precisely whenthey are most likely to work in multiple areas.Figure 6 shows the frequency of productive authors with normalised maximum expertise ranging from0 to 1. The ( BL ) method shows no authors with maximum expertise higher than 0 .
9, which suggests thatthere is no researcher dedicated to one single area and the maximum expertise of most authors lies in themiddle. However, the results obtained with our method clearly highlight its ability to identify specialisedauthors that preferentially focus on one area (i.e., with high maximum expertise) and at the same timeinterdisciplinary authors whose work spans different areas (i.e., those with low maximum expertise).
In this work, we have proposed a new method based on path similarity and a number of subset selectionstrategies to identify authors’ expertise. Our method differs from previous works as it assigns expertise toa focal author by accounting for co-authors’ contributions to the works they were involved with. We haveshown that our method can be applied to the HIN constructed from the MEDLINE corpus. However,the applicability of our method is not limited to just one data set. Indeed if we replace
M eSH termsby the topic tags in MAG, our method can be directly applied to MAG. In this case, it can retrieveauthors’ expertise based on topics as classified in MAG, and it can be suitably adjusted to reflect thedepth and granularity required by users. In more general cases, users can generate their own topics fromdocuments using topic modelling or other methods. By linking the generated topics and the correspondingdocuments, users can produce similar networks as those shown in Fig. 2 and they can then apply ourmethod by selecting an appropriate subset. Our work can also be used to integrate standard approaches,for example in conjunction with topic modelling for documents or by using topic classification systems.The lack of a ground truth does not enable a definitive validation of our method. While this representsa limitation of our work, it also opens up new avenues for future work. For example, to mitigate thislimitation, we could check the Contributor Roles Taxonomy (CRediT) author statement available fromseveral journals to identify which author was involved in which part of the research. However, CRediTstatements are self-declared and not verifiable, which again highlights the need for methods such as theone we proposed in this article. Moreover, the CRediT author statements are not detailed enough tounambiguously indicate which specific expertise (e.g., M eSH term) should be associated with whichauthor. Another possibility is to handpick some very interdisciplinary papers (i.e., with many
M eSH terms). By reading the CV of the authors or searching for relevant information about them, we might beable to infer the
M eSH terms associated with each author, and then compare our prior knowledge withthe results obtained using our method. This test represents a “sanity check”, and an example is given inthe Appendix.Our method has a number of important applications for research and practice. Understanding thecomposition of a team and being able to associate each co-author of a paper to one or several fields ofexpertise can spur new studies of the interdisciplinarity of research teams. For example, our method willenable us to distinguish between interdisciplinary papers co-authored by researchers with overlappingexpertise, and equally interdisciplinary papers in which the co-authors have non-overlapping researchprofiles. This, in turn, could shed further light on the impact of team diversity on scientific success andknowledge creation. Moreover, being able to identify expertise facilitates a comparative assessment of twoequally interdisciplinary studies, one pursued by an individual and the other by a group or researchers.In particular, our method enables us to distinguish between research solely pursued by one individualscholar with a highly interdisciplinary background and research pursued by an interdisciplinary groupcomprising of several highly specialised scholars. This variation in type and sources of interdisciplinarityis likely to be a critical nuance with non-trivial implications for innovation, research performance, andthe long-term impact of publications.Our method has also practical implications for funding agencies, research institutions and scientists.First, it can assist funding agencies in the identification of appropriate reviewers with the right competence to evaluate research proposals. In turn, it may also assist reviewers in uncovering possible gaps between aproposed research and the combined expertise of the pool of applicants. Second, our method can also helpresearch institutions to develop effective recruitment policies targeted at strengthening specific researchfields or at developing new and fast-developing areas that require a prompt investment of resources.Finally, the identification of special expertise can help scientists in identifying potential collaborators andshaping successful research groups. A Appendix
A.1 Example of
DHA using illustrative networks
Here we show how our method works out in full using illustrative networks, and we then compare the results with thoseobtained using the BL method. Figure 7 shows the illustrative networks from year 1 to year 5 (identical networks for fiveyears). Figure 8 shows the illustrative networks from year 6 to year 10 (identical networks for five years). Before year 5, thefour authors worked separately. A M M A M M A M M A M M A A M M A A M M A A M M
3. The publication lists can be found in Tables 5 and 6.Based on their experience, it is not likely for A M P MeSH category. Similarly, it is not likely for A M P BL and DHA , respectively. The results are similar between year1 and year 5 and begin to differentiate from year 6.At the end of year 5, both methods suggest that all four authors had similar expertise on M
3, whereas A A M M
2, respectively. BL simply counts for the number of papers each author published on every MeSH term, and adds them together. Following this idea, A A M P A P A M A M
2, with the same expertise as A P A M A M A M A M
2. And the results obtained using
DHA gave the expectedresult: i.e., A M A M Table 5
Publication list in the illustrative networks from year 1 to year 5
Author Paper MeSH YearA1 P1 M2,M3 1,2,3,4,5A2 P2 M1,M3 1,2,3,4,5A2 P3 M1 1,2,3,4,5A3 P4 M2,M3 1,2,3,4,5A3 P5 M2 1,2,3,4,5A4 P6 M1,M3 1,2,3,4,50 Xiancheng Li et al.
Fig. 7
Illustrative networks from year 1 to year 5
Table 6
Publication list in the illustrative networks from year 6 to year 10
Author Paper MeSH YearA1, A2 P1 M2,M3 6,7,8,9,10A1, A2 P1 M2,M3 6,7,8,9,10A2, A3 P2 M1,M2 6,7,8,9,10A2, A3 P2 M1,M2 6,7,8,9,10A3, A4 P3 M1,M3 6,7,8,9,10A3, A4 P3 M1,M3 6,7,8,9,10
Fig. 8
Illustrative networks from year 6 to year 10 xpertise retrieval 21 M BLt = , M DHAt = (28) M BLt = , M DHAt = .
95 1 . .
95 0 1 .
840 3 .
95 1 . .
95 0 1 . (29) M BLt = , M DHAt = .
86 2 . .
88 0 2 .
610 5 .
88 2 . .
86 0 2 . (30) M BLt = , M DHAt = .
76 3 . .
82 0 3 .
330 7 .
82 3 . .
76 0 3 . (31) M BLt = , M DHAt = .
64 4 . .
75 0 4 .
020 9 .
75 4 . .
64 0 4 . (32) M BLt = , M DHAt = .
94 5 . .
15 0 .
34 4 . .
34 11 .
15 4 . .
94 0 5 . (33) M BLt = , M DHAt = .
22 6 . .
54 0 .
72 5 . .
72 12 .
54 5 . .
22 0 6 . (34) M BLt = , M DHAt = .
50 8 . .
92 1 .
14 6 . .
14 13 .
92 6 . .
50 0 8 . (35) M BLt = , M DHAt = .
76 9 . . .
61 7 . .
61 15 . . .
76 0 9 . (36) M BLt = , M DHAt = .
02 10 . . . .
25 0 11 . (37) A.2 An example based on real data
Here we provide an example based on a focal paper and show the results obtained using our method. The title of thisfocal paper is: ”Calcium Levels and Calciuria in Decalcification in Acromegaly” . It was published in 1956, co-authoredby five authors: S. De S`eze, A. Lichtwitz, D. Hioco, M. Delaville, H. Gille. Table 7 shows the MeSH terms associated withthis paper, the relevant
MeSH
Tree ID and the corresponding category names. Table 8 shows the expertise of the fiveco-authors on the
MeSH terms associated with the focal paper before the year 1956. The first author, Stanislas de S`eze,was a pioneering scholar of French rheumatology. He was already an expert in two categories: Musculoskeletal Disease,Nervous System Diseases and Humans (included in the category Eukaryota). This was indicated by the high values in hisexpertise vector: 90 for B01, 42 for C05 and 12 for C10. The second author, Alfred Lichtwitz, mainly worked on D06, B01and C19. The third author, Denis Hioco, mainly worked on D01, D06, A12. The fourth author, M.Delaville, mainly workedon B01, D06, D01. The last author, Halvor Gille, was a new author, and this paper was his first publication.Although there were some overlaps among those co-authors’ profiles, each of those co-authors (except the new author)had some major background knowledge in selected research areas. The desired method should be able to add appropriatevalue to the co-authors’ expertise vectors and update the expertise vectors so that they can better represent the evolutionof the co-authors’ expertise. https://pubmed.ncbi.nlm.nih.gov/13327374/2 Xiancheng Li et al.The results are given in Table 9. Upon publication of this paper, Stanislas de S`eze obtains 0 .
762 on B01, 0 .
371 on C05and 0 .
106 on C10, since he was the most experienced author in these three categories. Similarly, D. Hioco obtains 0 .
315 onD01 and 0 .
265 on A12; A. Lichtwitz obtains 0 .
193 on D01 and 0 .
211 on C19. However, M. Delaville does not achieve a highscore as he was not the most experienced author in any of these categories. As for the new author, he gains some experiencein nearly every category, especially those in which no one had much experience. In this example, he obtained 0 .
535 on D23,0 .
424 on G02 and 0 .
366 on G03. In general, our method clearly returns a reasonable result which meets our expectation.
Table 7
M eSH terms associated with the focal paper, relevant
M eSH
Tree ID and correspondingcategory names
MeSH term Relevant
MeSH
Tree ID CategoriesAcromegaly [C05,C10,C19] [Musculoskeletal Diseases; Nervous System Diseases; Endocrine System Diseases]Calcium [D01,D23] [Inorganic Chemicals; Biological Factors]Hormones [D06,D27] [Hormones, Hormone Substitutes, and Hormone Antagonists; Chemical Actions and Uses]Humans [B01] [Eukaryota]Osteoporosis [C05,C18] [Musculoskeletal Diseases; Nutritional and Metabolic Diseases]Phosphorus [D01] [Inorganic Chemicals]Urine [A12] [Fluids and Secretions]Water-Electrolyte Balance [G02,G03,G07] [Chemical Phenomena; Metabolism; Physiological Phenomena ]
Table 8
Expertise of co-authors on the
M eSH terms associated with the focal paper before year 1956
D06 D27 B01 D01 D23 A12 C05 C10 C19 G02 G03 G07 C18M. Delaville 4.860 2.477 10.200 3.235 0.188 0.915 1.472 0.211 2.758 0.456 0.089 1.010 0.971H. Gille 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000A. Lichtwitz 7.139 3.963 22.821 3.141 1.295 0.987 4.754 1.064 6.219 2.074 2.406 2.576 2.821D. Hioco 3.283 2.543 1.172 3.887 0.863 2.338 0.444 0.289 0.973 0.000 0.816 0.131 2.014De S`eze 3.514 0.682 90.230 0.417 0.196 0.157 42.682 12.108 0.390 0.213 0.000 0.133 0.697
Table 9
Expertise acquired from the focal paper
D06 D27 B01 D01 D23 A12 C05 C10 C19 G02 G03 G07 C18M. Delaville 0.185 0.097 0.084 0.141 0.041 0.048 0.005 0.006 0.092 0.038 0.027 0.051 0.038H. Gille 0.101 0.183 0.011 0.165
A.3 Summary
In Appendix A.1, we showed how our method works out in full using illustrative networks, and then compared the resultswith those obtained with the BL method. In this example, four authors with their publication lists of 10 years are given. Bychecking the publication history of those authors, indeed we can confirm that the second and the third authors are expertsin different topics. Our method was able to correctly identify the expertise of each author. However, the BL method gave aresult according to which the research profiles of the two authors were the same. This example and the comparison betweenmethods thus showed that our method outperformed the BL one.In Appendix A.2, we gave an example of a handpicked paper, and provided the results obtained using our method. Weshowed that our method correctly assigned expertise to the most experienced author on most MeSH terms. And authorswould not acquire much experience in categories that they were not familiar with. The result showed that our method wasable to add appropriate value to the co-authors expertise vectors and update them so that they could better represent theevolution of co-authors expertise.Despite the lack of ground truth data to definitively validate the performance of our method, the examples in theAppendix provide some possible ways to test our method. The results showed that our method can provide a reasonableassessment of authors’ expertise.xpertise retrieval 23
References
AlShebli BK, Rahwan T, Woon WL (2018) The preeminence of ethnic diversity in scientific collaboration. Nature Commu-nications 9(1):5163Balog K, De Rijke M, et al. (2007) Determining expert profiles (with an application to expert finding). In: IJCAI, vol 7, pp2657–2662Balog K, Fang Y, de Rijke M, Serdyukov P, Si L, et al. (2012) Expertise retrieval. Foundations and Trends R (cid:13)(cid:13)