Learning to Rank Academic Experts in the DBLP Dataset
LLearning to Rank Academic Experts in the DBLP Dataset
Catarina Moreira [email protected]
P´avel Calado [email protected]
Bruno Martins [email protected]
Instituto Superior T´ecnico, INESC-IDAv. Professor Cavaco Silva, 2744-016 Porto Salvo, Portugal
The original publication is available at: Expert Systems, Wiley Online Library http://onlinelibrary.wiley.com/doi/10.1111/exsy.12062/abstract
Abstract
Expert finding is an information retrieval task that is concerned with the search for themost knowledgeable people with respect to a specific topic, and the search is based on doc-uments that describe people’s activities. The task involves taking a user query as input andreturning a list of people who are sorted by their level of expertise with respect to the userquery. Despite recent interest in the area, the current state-of-the-art techniques lack in prin-cipled approaches for optimally combining different sources of evidence. This article proposestwo frameworks for combining multiple estimators of expertise. These estimators are derivedfrom textual contents, from graph-structure of the citation patterns for the community of ex-perts, and from profile information about the experts. More specifically, this article exploresthe use of supervised learning to rank methods, as well as rank aggregation approaches, forcombing all of the estimators of expertise. Several supervised learning algorithms, which arerepresentative of the pointwise, pairwise and listwise approaches, were tested, and variousstate-of-the-art data fusion techniques were also explored for the rank aggregation framework.Experiments that were performed on a dataset of academic publications from the ComputerScience domain attest the adequacy of the proposed approaches.
The search for people who are knowledgeable about specific topic areas and who are within thescope of specific user communities, with basis on documents that describe people’s activities, isan information retrieval problem that has been receiving increasing attention (Serdyukov 2009).Usually referred to as expert finding , this task involves taking a short user query as input and1 a r X i v : . [ c s . I R ] J a n eturning a list of people who are sorted by their level of expertise in what concerns the query topic.When looking for experts in an enterprise environment, the documents that describe people’sactivities include reports, web pages, and manuals. When looking for academic experts, thesedocuments are the individual’s publications.Expert finding has been gaining attention not only in the scientific community but also inenterprises. Companies are interested in expert finding systems to save time and money. Forexample, when a problem arises in an on-going project, an expert with some specific knowledgemust be found with some urgency. In large companies, going through all of the documentationand looking for people who might have the skills that are needed to solve the problem is timeconsuming. Thus, often enterprises engage external personnel to solve their issues, when the solu-tion to their problem could be in an employee from their own ranks. The TREC enterprise trackdataset from the 2008 edition simulates this scenario. It provides documents from an Australiancompany. The goal is for researchers to test their algorithms for finding the most knowledgeablepeople for some query topics. In an academic scenario, expert finding systems are also very usefulfor researchers and for students who are looking for the best advisor for their work.Several effective approaches for finding experts have been proposed in the literature; theseapproaches explore different retrieval models and different sources of evidence for estimating ex-pertise. However, the current state-of-the-art techniques still lack in principled approaches forcombining the multiple sources of evidence that can be used to estimate expertise. In traditionalinformation-retrieval tasks, such as ad hoc retrieval, there has been an increasing interest in theuse of machine learning methods for building retrieval formulas that are capable of estimatingrelevance for query-document pairs. This approach is commonly referred to as Learning to Rankfor Information Retrieval (L2R4IR) (Liu 2009). The general idea behind L2R4IR approaches isto use hand-labelled data (e.g., document collections that contain relevance judgments for specificsets of queries, or information regarding user-clicks that are aggregated over query logs) to trainranking models and, in this way, use data to combine the different estimators of relevance in anoptimal way. Thus far, although many different approaches have been proposed in the expertfinding literature, few previous studies have specifically addressed the use of learning to rank inthe development of approaches for the task of expert finding.The combination of multiple sources of evidence has also received a substantial amount ofinterest in traditional search engines. The problem of combining various ranked lists for the sameset of documents, to obtain a more accurate and more reliable ordering, can be defined as Rank2ggregation (Dwork, Kumar, Naor & Sivakumar 2001). Data fusion techniques include methodsthat are used to combine these different rankings. Montague & Aslam (2002) have experimentedwith data fusion techniques in the domain of search engines, where they concluded that datafusion can provide significant advantages. Different retrieval methods often return very differentirrelevant documents, although they return the same relevant ones. Thus, rank aggregation canprovide a more reliable performance than individual retrieval methods.This article explores the use of learning to rank methods or, alternatively, the use of rankaggregation algorithms in the expert finding task, specifically combining a large pool of estimatorsfor expertise. We build on a preliminary study by Moreira, Calado & Martins (2011) that addressesthe same problem, adding a larger set of experiments. We have evaluated this work on an academicpublication dataset from the Computer Science domain.The main contributions of this work can be summarised as follows: • A Set of Features to Estimate Expertise. In this study, we defined a set of features thatare based on three different sources of evidence, namely textual similarity features, authorprofile information features and features based on citation graphs. The textual features usetraditional information retrieval techniques, which measure term co-occurrences between thequery topics and the documents that are associated with a candidate expert. The profileinformation features measure the total publication record of a candidate throughout hiscareer, under the assumption that true expert candidates are more productive. Finally,the features that are based on citation graphs capture the authority of candidate expertsfrom the attention that others give to their work. Although we might say that all of thefeatures correspond to statistical values that are often used in the literature in differentareas, to the best of our knowledge, the present study is the first to use them for the taskof discovering experts. The only exceptions are the numbers of publications and citationfeatures, which have been used in the previous work of Yang, Tang, Wang, Guo, Li & Chen(2009). In addition, we applied a new set of features that has never been used before inInformation Retrieval or in expert finding. Inspired by works in Scientometrics, we testeda set of academic indexes, which are commonly used to estimate the impact of the author’spublications on the scientific community: h-index, g-index, e-index, contemporary h-index,trend h-index, a-index and the individual h-index. In terms of the features that we proposed,the use of such indexes is our main contribution.3
A Supervised Learning to Rank Approach for Expert Finding. Our experiments exploredthe use of learning to rank methods in the expert finding task, specifically combining a largepool of estimators of expertise, and we used the hypothesis that learning to rank approachesprovide a significant improvement over the current state-of-the-art methods. To combinethese multiple estimators, we performed experiments with state-of-the-art algorithms fromthe pointwise, pairwise and listwise learning to rank approaches. There are a couple ofapproaches in the literature that propose a learning to rank framework for the problem ofexpert finding. One approach is concerned with expertise retrieval in enterprises (Macdonald& Ounis 2011), and the other approach is based on finding academic experts (Yang etal. 2009), which is similar to the method used in our paper. Our main concern with bothof these approaches is that they lack a detailed description of how the learning to rankframework works, and they formulate experiments by using only two algorithms (AdaRankand Metzler’s Automatic Feature Selection algorithm in Macdonald & Ounis (2011) andSVMrank in Yang et al. (2009)). In our paper, we provide a more thought-out descriptionof how the learning to rank framework works, and we also provide a set of algorithms thatrange from the various classes of learning to rank solutions, with which we made comparativeexperiments. • A Rank Aggregation Approach for Expert Finding. Our experiments also tested the hy-pothesis that rank aggregation methods, which are based on data fusion techniques, canprovide significant advantages over the representative generative probabilistic models thatare proposed in the expert finding literature. We use existing state-of-the-art algorithms tobuild a single expert finding model that enables the combination of a large pool of expertiseestimates.The remainder of this article is organised as follows: Section 2 presents the main concepts andrelated work. Section 3 presents the learning to rank approaches that are used in our experiments.Section 4 details the rank aggregation framework as well as the data fusion techniques that areused in our experiments. Section 5 introduces the multiple features that we use to estimate theexpertise. Section 6 describes how the system was evaluated, detailing the datasets that are usedin our experiments as well as the obtained results. Finally, Section 7 presents our conclusions andpoints to directions for future work in this area.4
Concepts and Related Work
Previous publications have surveyed the most important concepts and representative previ-ous studies in the expert finding task (Serdyukov 2009, Macdonald & Ounis 2008). Two of themost popular and well-performing methods are the candidate-based and the document-based ap-proaches. In candidate-based approaches, the system gathers all of the textual information about acandidate and merges it into a single document (i.e., the profile document). The profile documentis then ranked by determining the probability of the candidate given the query topics. Candidate-based approaches are also referred to as Model 1 in Balog, Azzopardi & de Rijke (2006) and arereferred to as query-independent methods according to Petkova & Croft (2006). In document-based approaches, the system gathers all of the documents that contain the expertise topic termsthat are included in the query. Then, the system uncovers which candidates are associated witheach of those documents and determines their probability scores. The final ranking of a candidateis provided by summing all of the individual scores of the candidate in each document. Document-based approaches are also referred to as Model 2 in Balog et al. (2006) or as query-dependentmethods according to Petkova & Croft (2006). Experimental results show that document-basedapproaches usually outperform candidate-based approaches (Balog et al. 2006).The first candidate-based approach was proposed by Craswell, Hawking, Vercoustre & Wilkins(2001), where the ranking of a candidate was computed through text similarity measures that werecomputed between the query topics and the candidate’s profile document. Balog et al. (2006)formalised a general probabilistic framework for modelling the expert finding task, which usedlanguage models to rank the candidates. Language models apply probabilistic functions that rankdocuments based on the probability of the document model generating the query topic (withinthe document). This scenario occurs when the document contains a large number of occurrencesof the query terms (Manning 2008). Petkova & Croft (2006) presented a general approach forrepresenting the knowledge of a candidate expert as a mixture of language models from associateddocuments. Later, Balog, Azzopardi & de Rijke (2009) and Petkova & Croft (2007) introducedthe idea of dependency between candidates and query topics by including a surrounding windowto weight the strength of the associations between mentions to candidates in the text of thedocuments and query topics. The surrounding window measures the proximity in which twowords occur in the text. For expert finding, this proximity plays an important role because when5he query topics appear next to a candidate’s name, there is a high probability that this querytopic is associated with that candidate.Many different authors have also proposed sophisticated probabilistic retrieval models that arebased on the document-based approaches (Balog et al. 2006, Petkova & Croft 2007, Serdyukov2009). For example, Cao, Liu, Bao & Li (2006) proposed a two-stage language model thatcombines document relevance and co-occurrence between experts and query terms. Fang &Zhai (2007) derived a generative probabilistic model from the probabilistic ranking principle andextended it with query expansion and non-uniform candidate priors. Zhu, Song & R¨uger (2007)proposed a multiple-window-based approach for integrating multiple levels of associations betweenexperts and query topics in expert finding. Later, Zhu, Song, R¨uger & Huang (2008) proposed aunified language model that integrates many document features.In addition to the candidate-based and document-based approaches, other methods have alsobeen proposed in the expert finding literature. For example, Macdonald & Ounis (2008) formaliseda voting framework that was combined with data fusion techniques. Each candidate that wasassociated with documents that contained the query topics received a vote, and the ranking ofeach candidate was given by the aggregation of the votes of each document through data fusiontechniques. Deng, King & Lyu (2011) proposed a query-sensitive AuthorRank model. Theseinvestigators modelled a co-authorship network and measured the weights of the connectionsbetween authors with the AuthorRank algorithm (Liu, Bollen, Nelson & de Sompel 2005). BecauseAuthorRank is query independent, the authors added probabilistic models to refine the algorithmto consider the query topics. Serdyukov & Hiemstra (2008) proposed the person-centric approach,which combines the ideas of the candidate and document-based approaches. Their system startsby retrieving the documents that contain the query topics, and then, it ranks the candidates bycombining the probability of generation of the query by the candidate’s language model.Although these models are capable of employing different types of associations among queryterms, documents and experts, they mostly ignore other important sources of evidence, such asthe importance of individual documents or the citation patterns between candidate experts thatare available from citation graphs. In this paper, we study two different principled approachesfor combining a much larger set of estimates for expert finding, namely learning to rank and rankaggregation.Another work that follows the paradigm of this paper belongs to Macdonald & Ounis (2011),who proposed a learning to rank approach in which they created a feature generator that was6omposed of three components, namely, a document ranking model, a cutoff value to select thetop documents according to the query topics and rank aggregation methods. Using those fea-tures, the authors made experiments with the AdaRank listwise learning to rank algorithm, whichoutperformed all of the generative probabilistic methods that were proposed in the literature.Fang & Zhai (2007) applied the probabilistic ranking principle to develop a general frameworkfrom which the candidate-based and document-based models for expert finding could be derived.They also showed how query expansion techniques, such as the association of different weights toeach candidate representation and the topic expansion to give more textual information that isrelated to the original query, can improve the performance of the models by using the framework.In the Scientometrics community, the evaluation of the scientific output of a scientist has alsoattracted significant interest, due to the importance of obtaining unbiased and fair criteria. Mostof the existing methods are based on metrics such as the total number of authored papers or thetotal number of citations (Sidiropoulos & Manolopoulos 2005, Sidiropoulos & Manolopoulos 2006).Simple and elegant indexes, such as the Hirsch index criteria, calculate how broad the researchwork of a scientist is, accounting for both productivity and impact. Graph centrality metricsinspired by PageRank, which are calculated over co-authorship graphs, have also been extensivelyused (Liu et al. 2005). In the context of academic expert search systems, these metrics can easilybe used as query-independent estimators of expertise, in much the same way that PageRank isused in the case of Web information retrieval systems.A comprehensive survey about expertise retrieval and the different techniques proposed in theliterature can be found in (Balog, Fang, de Rijke, Serdyukov & Si 2012).
For combining the multiple sources of expertise, we propose to use previous work concerningthe subject of L2R4IR. Liu (2009) presented a notable survey on the subject, categorising thepreviously proposed supervised L2R4IR algorithms into three groups, according to their inputrepresentation and optimisation objectives: • Pointwise approach - L2R4IR is seen as either a regression or a classification problem.Given the feature vectors of each single document from the data for the input space, therelevance degree of each of those individual documents is predicted with either a regressionor a classification model. The relevance scores can then be used to sort the documentsto produce the final ranked list of results. Several different pointwise methods have been7roposed in the literature, including the Additive Groves algorithm by Sorokina, Caruana &Riedewald (2007), RankClass (Ji, Han & Danilevsky 2011), the algorithm proposed by Adali,Magdon-Ismail & Marshall (2007) and random model trees (Pfahringer 2011). • Pairwise approach - L2R4IR is seen as a binary classification problem for documentpairs because the relevance degree can be regarded as a binary value that tells which doc-ument order is better for a given pair of documents. Given the feature vectors of pairs ofdocuments from the data for the input space, the relevance degree of each of those docu-ments can be predicted with scoring functions that attempt to minimise the average numberof misclassified pairs. Several different pairwise methods have been proposed, includingSVMrank (Joachims 2006), RankNet (Burges, Shaked, Renshaw, Lazier, Deeds, Hamil-ton & Hullender 2005), RankBoost (Freund, Iyer, Schapire & Singer 2003) and P-NormPush (Ertekin & Rudin 2011). • Listwise approach - L2R4IR is addressed in a way that accounts for an entire set ofdocuments that are associated with a query, taking each document as an instance. Thesemethods train a ranking function through the minimisation of a listwise loss function thatis defined on the predicted list and the ground truth list. Given feature vectors of a list ofdocuments of the data for the input space, the relevance degree of each of those documentscan be predicted with scoring functions that attempt to directly optimise the value of aspecific information-retrieval evaluation metric, which is averaged over all of the queries inthe training data (Liu 2009). Several different listwise methods have also been proposed,including SVMmap (Yue, Finley, Radlinski & Joachims 2007), AdaRank (Xu, yan Liu,Lu, Li & ying Ma 2008, Xu & Li 2007), Coordinate Ascent (Metzler & Croft 2007) andP-Classification (Ertekin & Rudin 2011).There are also some works that have extended the learning to rank approach to a relationallearning to rank framework (Qin, Liu, Zhang, Wang, Xiong & Li 2008). In this new learning task,the ranking model accounts for not only the features in the documents but also the relationshipinformation between the documents. The main difference between traditional learning to ranktasks and this new approach is that, in relational learning to rank, the ranking function is notsolely concerned with the optimisation of a specific bound for the retrieval task. Instead, itattempts to develop relationships between the documents that are retrieved. For example, if wehave two very similar documents and the learning function only considers one of them to be8elevant, then through a relationship, the other document will be given a ranking that is similarto the first document.
For combining the different sources of expertise evidence, we also rely on previous studies thathave addressed the problem of ranking search results through a rank aggregation framework,which are often based on data fusion methods that take their inspiration from voting protocolsproposed in the area of social sciences. Riker (1988) suggested a classification to distinguish thedifferent existing data fusion algorithms into two categories, namely the positional methods andthe majoritarian algorithms. Later, Fox & Shaw (1994) proposed score aggregation methods thatare specifically designed for information retrieval.The positional methods are characterised by the computation of a candidate’s score based onthe position that the candidate occupies in the ranked lists given by each voter. If the candidatefalls into the top position of the ranked list, then he receives a maximum score. If the candidatefalls into the end of the list, then his score is a minimum score. The most representative positionalalgorithms are the Borda Count (de Borda 1781) and the Reciprocal Rank (Voorhees 1999) fusionmethods.The majoritarian algorithms are characterised by a series of pairwise comparisons betweencandidates. The candidates are scored according to the number of times that they win anothercandidate in a pairwise comparison. The most representative majoritarian algorithm is most likelythe Condorcet Fusion method proposed by Montague & Aslam (2002). However, there have beenother proposals that are based on Markov Chain Models (Dwork et al. 2001).Finally, score aggregation methods determine the highest ranked candidate by combining theranking scores from all of the input rankings. Fox & Shaw (1994) proposed the CombSUM andCombMNZ methods, which have been used frequently in IR experiments. In this article, we per-formed experiments with representative supervised learning to rank algorithms from the pointwise,pairwise and listwise approaches, as well as with representative state-of-the-art data fusion algo-rithms from the positional, majoritarian and score aggregation approaches. Sections 3 and 4detail, respectively, the learning to rank and rank aggregation approaches.9igure 1: The learning to rank framework for expert finding
One of the research questions that motivates this work is concerned with the possibility oflearning to rank approaches being effectively used in the context of expert search tasks to combinedifferent estimators of expertise in a principled way, to improve on the current state-of-the artmethods.The expert finding problem can be formalised as follows. Given a set of queries Q = { q , . . . , q m } and a collection of experts E = { e , . . . , e n } , each of which is associated with specific documentsthat describe his topics of expertise, a training corpus for learning to rank is created as a set ofquery-expert pairs, ( q i , e j ) ∈ Q × E , upon which a relevance judgment that indicates the matchbetween q i and e j is assigned by a labeller. This relevance judgment is a binary label that indicateswhether the expert e j is relevant to the query topic q i or not. For each instance ( q i , e j ), a featureextractor produces a vector of features that contains statistical values that are related to q i and e j . The features can range from classical IR estimators computed from the documents associatedwith the experts (e.g., term frequency, inverse document frequency) to link-based features thatare computed from networks that encode relations between the experts (e.g., PageRank). Thesefeatures are detailed in Section 5 of this article. The inputs to the learning algorithm comprisetraining instances, their feature vectors, and the corresponding relevance judgments. The outputis a ranking function, h , which produces a ranking score for each candidate expert e j in such a waythat, when sorting experts for a given query according to these scores, the more relevant expertsappear on the top of the ranked list. 10uring the training process, the learning algorithm attempts to learn a ranking function thatis capable of sorting experts in the following ways: for the listwise approach, it optimises a specificretrieval performance measure; for the pairwise approach, it attempts to minimise the number ofmisclassifications between the expert pairs; and for the pointwise approach, it attempts to directlypredict the relevance score. In the test phase, the learned ranking function is applied to determinethe relevance between each expert e j in E toward a new query q i . Figure 1 shows a generalillustration of the learning to rank framework that is used in this work. The experiments reportedin this paper compared many different learning to rank algorithms, which required manual tuningof different parameters. In the next sections, we describe how we adjusted these parameters, andwe also detail how the different methods work. In this section, we describe the algorithms that were used in our experiments, categorising themby the learning method that was used.
Boosting theory is a supervised machine learning approach that iteratively attempts to improve acandidate solution. Boosting theory uses the concepts of weak and strong learners. Weak learnersare classifiers that are slightly correlated with true classification. Strong learners are classifiersthat are highly correlated with the true classification. The paradigm of boosting theory involvescreating a single strong learner through the combination of a set of weak learners (Freund etal. 2003).The AdaRank listwise method, proposed by Xu et al. (2008), builds a ranking model throughthe formalism of boosting, attempting to optimise a specific information retrieval performancemeasure. The basic idea of AdaRank is to train one weak ranker at each round of iterations andto combine these weak rankers as the final ranking function. After each round, the experts arere-weighted by decreasing the weight of correctly ranked experts, based on a specific evaluationmetric, and by increasing the weight of the experts that performed poorly for the same metric.The AdaRank algorithm receives as input the parameter T , which is the number of iterations thatthe algorithm will perform, and the parameter E , which corresponds to a specific informationretrieval performance measure.The RankBoost pairwise method, proposed by Freund et al. (2003), also builds a ranking11odel through the formalism of boosting, attempting to minimise the number of misclassifiedpairs of experts in a pairwise approach. The basic idea of RankBoost is to train one weak rankerat each round of iteration and to combine these weak rankers as the final ranking function. Aftereach round, the expert pairs are re-weighted by decreasing the weight of correctly ranked pairs ofexperts and increasing the weight of wrongly ranked experts. The RankBoost algorithm receivesas input the parameter T , which is the number of iterations that the algorithm will perform,and the parameter θ , which is a threshold that corresponds to the number of candidates to beconsidered in the weak rankers.The Additive Groves pointwise method, introduced by Sorokina et al. (2007), builds a rankingmodel through the formalism of regression trees, attempting to directly minimise the errors inrelevance predictions over the training dataset. In this approach, a grove is an additive modelthat contains a small number of large trees. The ranking model of a grove is built upon the sumof the ranking models of each one of those trees. The basic idea of Additive Groves is to initialisea grove with a single small tree. Iteratively, the grove is gradually expanded by adding a new treeor by enlarging the existing trees of the model. The new trees in the grove are trained with theset of experts that were misclassified by the other previously trained trees. In addition, trees arediscarded and retrained in turn until the overall predictions converge to a stable function. Thegoal of this algorithm is to find the simplest model that can make the most accurate predictions.The prediction of a grove is given by the sum of the predictions of the trees that are contained init. One major problem in using regression trees is that these models will learn a function thatfits the training data very well, but cannot make good predictions under unseen data. This phe-nomenon is known as overfitting, and these models tend to have high variances. Therefore, thereis a high danger of regression trees overfitting the training data. The bagging procedure improvesthe performance of these models by reducing the variance. Thus, bagging avoids overfitting thetraining data.The algorithm receives as input the parameter N , which is the number of trees in the grove,the parameter α , which controls the size of each individual tree, and the parameter b , which is thenumber of bagging iterations, i.e., the number of additive models that are combined in the finalensemble. The publicly available version of this algorithm tuned these parameters automatically.12 .1.2 Algorithms Based on Classification by N-Dimensional Hyperplanes Artificial Neural Networks (ANNs) are a machine learning approach that attempts to construct anN-dimensional decision boundary surface that separates data into positive examples and negativeexamples, through the simulation of some properties that occur in biological neural networks.They require the use of optimisation methods, such as gradient descent, to find a solution thatminimises the number of misclassifications (Haykin 2008).The RankNet pairwise method, proposed by Burges et al. (2005), builds a ranking modelthrough the formalism of Artificial Neural Networks, attempting to minimise the number of mis-classified pairs of experts. The basic idea of RankNet is to use a multilayer neural network witha cost error entropy function. While a typical artificial neural network computes this cost bymeasuring the difference between the network’s output values and the respective target values,RankNet computes the cost function by measuring the difference between a pair of network out-puts. RankNet attempts to minimise the value of the cost function by adjusting each weight inthe network according to the gradient of the cost function. This goal is accomplished through theuse of the backpropagation algorithm. The RankNet algorithm receives as input the parameter epochs , which is the number of iterations that are used in the process of providing the network withan input and updating the network’s weights, and the parameter hiddenNodes , which correspondsto the number of nodes that are in the network’s hidden layer. If there are too few nodes, thenwe can underfit the data. On the other hand, if there are too many nodes, then we can overfitthe data, and the resulting network will not generalise well.Support Vector Machines (SVMs) can also be defined as learning machines that construct anN-dimensional decision boundary surface that optimally separates data into positive examplesand negative examples, by maximising the margin of separation between these examples. Onemajor advantage is the computational complexity of an SVM, which does not depend on thedimensionality of the input space (Haykin 2008).The SVMmap listwise method, introduced by Yue et al. (2007), builds a ranking model throughthe formalism of structured Support Vector Machines (Tsochantaridis, Joachims, Hofmann &Altun 2005), attempting to optimise the metric of Average Precision (see Section 6.1). Thebasic idea of SVMmap is to minimise a loss function that measures the difference between theperformance of a perfect ranking (i.e., when the Average Precision equals one) and the minimumperformance of an incorrect ranking. The SVMmap algorithm receives as input the parameter C , which affects the trade-off between the model complexity and the proportion of non-separable13amples. If C is too large, then we have a high penalty for non-separable points and we couldcreate many support vectors, which could lead to overfitting. If C is too small, then we couldhave underfitting. In our experiments, we used SVMmap with a radial basis function kernel thatalso requires the manual tuning of the parameter γ , which determines the area of influence thatthe centre support vector has over the data space.The SVMrank pairwise method, introduced by Joachims (2006), builds a ranking modelthrough the formalism of Support Vector Machines. The basic idea of SVMrank is to attempt tominimise the number of misclassified expert pairs in a pairwise setting. This goal is achieved bymodifying the default support vector machine optimisation problem by constraining the optimi-sation problem to perform a minimisation of each pair of experts. This optimisation is performedover a set of training queries, their associated pairs of experts and the corresponding relevancejudgment over each pair of experts (i.e., pairwise preferences that result from a conversion fromthe ordered relevance judgments over the query-expert pairs). SVMrank receives as input theparameter C , which affects the trade-off between the model complexity and the proportion ofnon-separable samples. In our experiments, we used a linear kernel. The Coordinate Ascent listwise method, proposed by Metzler & Croft (2007), is an optimisationalgorithm that is used in unconstrained optimisation problems and that builds a ranking modelby directly maximising an information retrieval performance measure. The basic idea of Coor-dinate Ascent is to iteratively optimise a multivariate objective function by solving a series ofone-dimensional searches. In each iteration, Coordinate Ascent randomly selects one feature toperform a search on while holding all of the other features. This way, in each iteration, the algo-rithm chooses the parameters that maximise the information retrieval performance measure. TheCoordinate Ascent algorithm receives as input the parameter rr , which is the number of randomrestarts, and the parameter T , which corresponds to the number of iterations to perform in eachone-dimensional space. The most naive approach to parameter search is the grid search method. In this approach, agrid is placed over the parameter space, and the data are evaluated at every grid intersection,returning the parameters that lead to maximum performance of the learning algorithm (Metzler14 Croft 2007). However, a grid search has the problem of being unbounded because an infinite setof parameters is available for testing. To overcome this issue, the parameter search was restrictedby using the boundaries that were suggested by (Hsu, Chang & Lin 2010) for the SVM-basedalgorithms ( C ∈ { − , − , ..., , } and γ ∈ { − , − , ..., , } ). For the other approaches(AdaRank, Coordinate Ascent, RankBoost and RankNet), the grid search was stopped when theresults that were obtained started to converge into a single value according to an informationretrieval metric. The parameters were learned with a k-fold cross-validation method and werefitted to the training data. We collected the parameters that, on average, achieved the bestresults over all of the tested folds. We also experimented with the use of rank aggregation frameworks for combining multiplesources of expertise evidence. Because data fusion can aggregate the rankings of several individualfeatures for each candidate, it has the advantage of not reflecting the tendency of a single feature.Instead, it reflects the combination of all of them, resulting in a more reliable and accurate rankingsystem. The general rank aggregation framework that is proposed for expert finding is illustratedin Figure 2. Figure 2: A general rank aggregation framework for expert retrievalIn this framework, we are given a set of queries Q = { q , q , ..., q m } and a collection of candidateexperts E = { e , e , ..., e n } , each of which is associated with specific documents that describe thecandidate’s topics of expertise. For each instance ( q i , e j ) ∈ Q × E , a feature extractor producesa set of ranked lists according to the match between q i and e j . These features are detailed inSection 5 of this work. A data fusion algorithm is then applied to combine the various ranked liststhat are computed by each of the features. The inputs of a rank aggregation algorithm comprise15 set of queries and the data fusion technique to be applied. The output produces a ranking scorethat results from the aggregation of multiple features. The relevance of each expert e j towardsthe query q i is determined through this aggregated score.The score aggregation data fusion techniques that are used in our experiments required nor-malised scores for the different features. To perform this normalisation, we applied the Min-MaxNormalisation procedure, which is given by Equation 1.NormalisedValue = V alue − minV aluemaxV alue − minV alue (1)The CombSUM, CombMNZ and CombANZ approaches, which were introduced by Fox &Shaw (1994), are three examples of rank aggregation algorithms.In this article, we performed experiments with representative data fusion algorithms from theinformation retrieval literature, namely CombSUM, CombMNZ, CombANZ, Borda Fuse, Recip-rocal Rank Fuse and Condorcet Fusion.The CombSUM score of an expert e for a given query q is the sum of the normalised scoresreceived by the expert in each individual ranking feature and is given by Equation 2.CombSUM(e,q) = k (cid:88) j =1 score j ( e, q ) (2)Similarly, the CombMNZ score of an expert e for a given query q is defined by Equation 3,where r e is the number of ranking features that contribute to the retrieval of the candidate, byhaving a score that is larger than zero.CombMNZ(e,q) = CombSUM(e, q) × r e (3)The CombANZ score of an expert e for a given query q is defined in the same way asthe CombMNZ method, but the scores of the candidates are divided by the number of rankingfeatures that contribute to the retrieval of the candidate, instead of being multiplied, as shown inEquation 4. CombANZ gives more weight to the candidates that are relevant to the query butare not returned by many of the systems.CombANZ(e,q) = CombSUM(e, q) r e (4)The Borda Fuse positional method was originally proposed by de Borda (1781) in the scope16f social voting theory. This method determines the highest ranked expert by assigning to eachindividual candidate a certain number of votes. This number corresponds to its position in aranked list that is given by each feature. Generally speaking, if a given candidate e j appears inthe top of the ranked list, then one assigns to him n votes, where n is the number of experts inthe list. If the candidate appears in the second position of the ranked list, then it is assigned n − Reciprocal Rank Fuse positional method was originally proposed by Voorhees (1999)in the scope of Question Answering systems. The reciprocal rank fuse determines the highestranked expert by assigning to each individual candidate a certain score that corresponds to theinverse of its position in a ranked list given by each feature. Generally speaking, if a candidate e j appears at the top of the ranked list, one assigns to him 1 / / Condorcet Fusion majoritarian method was originally proposed by Montague & Aslam(2002) in the scope of social voting theory. The Condorcet Fusion method determines the highestranked expert by accounting for the number of times that an expert wins or ties with every othercandidate in a pairwise comparison. To rank the candidate experts, we use their win and lossvalues. If the number of wins of an expert is higher than those for another expert, then the firstexpert wins. A tie in Condorcet Fusion occurs when two experts have the same number of wins.To untie them, we account for the number of losses. The expert that has the lowest number oflosses wins. If the experts have exactly the same number of wins and losses (an unlikely scenario,but possible), then there is no way to untie them, and the system returns both in a randomorder (Bozkurt, Gurkok & Ayaz 2007).To give an illustrative example of how the framework for rank aggregation works, let usassume that a user wants to know the top experts in
Information Retrieval . The first step ofour system is to retrieve all of the authors that have the query topics in their publication’s titlesor abstracts. Each set of features is then responsible for detecting different types of informationin those documents. The textual features will collect information, such as the term frequency(Section 5 will detail these features). Profile features, on the other hand, will collect the totalpublication record of the candidate. Citation features must collect information such as the number17f citations of the candidate’s work and the number of co-authors. For each feature, a score willbe computed and assigned to the author. These features will represent the author’s knowledge ofthe query topics. Then, according to the data fusion method that is used, all of the individualscores of each feature will be combined into a single value. For example, if one uses the CombSUMdata fusion method, then the scores of each feature will be summed. The authors are then rankedby the resulting summation score.
The considered set of features for estimating the degree of expertise of a person toward a givenquery can be divided into three groups, namely, the textual features, profile features and citationgraph features. The textual features are similar to those used in standard text retrieval systemsand also in previous learning to rank experiments (e.g., TF-IDF and BM25 scores). The profileinformation features correspond to importance estimates for the authors, which are derived fromtheir profile information (e.g., the number of papers published). Finally, the graph featurescorrespond to importance and relevance estimates that are computed from citation counts andcitation graphs.
Similar to previous expert finding studies that are based on document-centric approaches, we alsouse textual similarities between the query and the contents of the documents to build estimatesof expertise. In the domain of academic digital libraries, the associations between documentsand experts can easily be obtained from the authorship information that is associated with thepublications. For each topic-expert pair, we used the Okapi BM25 document-scoring function tocompute the textual similarity features. Okapi BM25 is a state-of-the-art IR ranking mechanismthat is composed of several simpler scoring functions with different parameters and components(e.g., term frequency and inverse document frequency). It can be computed through the formulashown in Equation 5, where
T erms ( q ) represents the set of terms from query q , Docs ( a ) is theset of documents that have a as an author, Freq ( i , d j ) is the number of occurrences of term i in document d j , | d j | is the number of terms in document d j , and A is the average length ofthe documents in the collection. The values given to the parameters k and b were 1.2 and0.75, respectively. Most of the previous IR experiments use these default values for the k and b q, a ) = (cid:88) j ∈ Docs ( a ) (cid:88) i ∈ T erms ( q ) log (cid:18) N − Freq( i, d j ) + 0 . i, d j ) + 0 . (cid:19) × ( k + 1) × Freq( i,d j ) | d j | Freq( i,d j ) | d j | + k × (1 − b + b × | d j |A ) (5)We also experimented with other textual features that are commonly used in ad-hoc IR sys-tems, such as Term Frequency and
Inverse Document Frequency .Term Frequency (TF) corresponds to the number of times that each individual term in thequery occurs in all of the documents that are associated with the author. Equation 6 describesthe TF formula. TF q,a = (cid:88) j ∈ Docs ( a ) (cid:88) i ∈ T erms ( q ) Freq( i, d j ) | d j | (6)The Inverse Document Frequency (IDF) is the sum of the values for the inverse documentfrequency of each query term and is given by Equation 7. In this formula, | D | is the size of thedocument collection, and f i,D corresponds to the number of documents in the collection in whichthe i th query term occurs. IDF q = (cid:88) i ∈ T erms ( q ) log | D | f i,D (7)Other features that we used correspond to the number of unique authors that are associatedwith documents that contain the query topics, the range of years since the first and last publica-tions of the author containing the query terms, and the sum of the document lengths, in terms ofthe number of words, for all of the publications that are associated with the author.In the computation of these textual features, we considered two different textual streams fromthe documents, namely (i) a stream that is composed of the titles, and (ii) a stream that uses theabstracts of the articles. We also considered a set of profile features that are related to the amount of published materialsassociated with authors, generally assuming that the expert authors are likely to be more produc-19ive. Most of the features that are based on profile information are query independent, meaningthat they have the same value for different queries. The considered set of profile features is basedon the temporal interval between the first and the last publications, the average number of papersand articles per year, and the number of publications in conferences and in journals with andwithout the query topics in their contents.
Scientific impact metrics computed over scholarly networks that encode citation information canoffer effective approaches for estimating the importance of the contributions of specific publica-tions, publication venues, or individual authors. Thus, we have considered a set of features thatestimate expertise based on citation information. The considered features are divided into threesubsets, namely (i) citation counts, (ii) academic indexes and (iii) graph centrality. With regardto citation counts, we used the total, the average and the maximum number of citations of thepapers that contain the query topics, the average number of citations per year of the papers thatare associated with an author and the total number of unique collaborators that worked with anauthor. With regard to academic impact indexes, we used the following features: • Hirsch index of the author and of the author’s institution, measuring both the scientificproductivity and the scientific impact of the author or his institution (Hirsch 2005). A givenauthor or institution has a Hirsch index of h if h of his N p papers have at least h citationseach and the other ( N p − h ) papers have at most h citations each. Authors who have a highHirsch index or authors who are associated with institutions that have a high Hirsch indexare more likely to be considered experts. • Hirsch index considering query topics of the author, enabling the measurement of thescientific impact of the author in the field that is characterised by the query topic. Anauthor has an h index of h if h of his N p papers that contain the query terms have at least h citations each, and the other ( N p − h ) papers have at most h citations each. • Contemporary Hirsch index of the author, which adds an age-related weighting to eachcited article, giving less weight to older articles (Sidiropoulos, Katsaros & Manolopoulos2007). A researcher has a contemporary Hirsch index h c if h c of his N p articles have a scoreof S c ( i ) > = h c each, and the remaining ( N p − h c ) articles have a score of S c ( i ) < = h c . Foran article i , the score S c ( i ) is: 20 c ( i ) = γ ∗ (Y( now ) − Y( i ) + 1) − δ ∗ | CitationsTo( i ) | (8)In this formula, Y ( i ) refers to the year of publication for article i . The γ and δ parameterswere set to 4 and 1, respectively, which means that the citations for an article that waspublished during the current year is accounted for as 4 times, the citations for an articlethat was published 4 years ago is accounted for as only one time, and the citations for anarticle that was published 6 years ago is accounted for as 4 / • Trend Hirsch index (Sidiropoulos et al. 2007) for the author, which assigns to each citationan exponentially decaying weight according to the age of the citation, estimating the impactof a researcher’s work in a specific time instance. A researcher has a trend Hirsch index h t if h t of his N p articles receive a score of S t ( i ) > = h t each, and the remaining ( N p − h t ) articlesreceive a score of S t ( i ) < = h t . For an article i , the score S t ( i ) is defined as shown below: S t ( i ) = γ ∗ (cid:88) ∀ x ∈ C ( i ) (Y( now ) − Y( x ) + 1) − δ (9)Similar to the case of the contemporary Hirsch index, the γ and δ parameters are set hereto 4 and 1, respectively. • Individual Hirsch index of the author, which is computed by dividing the value of thestandard Hirsch index by the average number of authors in the articles that contribute tothe Hirsch index of the author, reduces the effects of frequent co-authorship with influentialauthors (Batista, Campiteli & Kinouchi 2006). • The a -index of the author or the author’s institution, which measures the magnitude ofthe most influential articles. For an author or an institution that has a Hirsch index of h and that has a total of N c,tot citations toward his papers, we say that he has an a -index of a = N c,tot /h . • The g -index of the author or his institution, which also quantifies scientific productivity thatis based on the publication record (Egghe 2006). Given a set of articles that are associatedwith an author or an institution and that are ranked in decreasing order of the number ofcitations that they received, the g-index is the (unique) largest number such that the top g articles received on average at least g citations.21 The e -index of the author (Zhang 2009), which represents the excess amount of citationsof an author. The motivation behind this index is that we can complement the h -index byaccounting for the excess citations that are ignored by the h -index. The e -index is given bythe formula shown in Equation 10: e = h (cid:88) j =1 cit j − h = ⇒ e = (cid:118)(cid:117)(cid:117)(cid:116) h (cid:88) j =1 cit j − h (10)In Equation 10, cit j are the citations that are received by the j th paper and h is the h -index.In addition to these features and following the ideas of Chen, Xie, Maslov & Redner (2007),we have also considered a set of graph-centrality features that estimate the influence of individualauthors using PageRank, which is a well-known graph linkage analysis algorithm that was intro-duced by the Google search engine (Brin, Page, Motwani & Winograd 1999). PageRank assigns anumerical weighting to each element of a linked set of objects (e.g., hyperlinked Web documentsor articles in a citation network) with the purpose of measuring its relative importance withinthe set. The PageRank value of a node is defined recursively and depends on the number andPageRank scores of all of the other nodes that link to it. A node that is linked to many nodeswith high PageRank scores receives a high score itself.Formally, given a graph with N authors as nodes connected between each other throughcitation links, the PageRank of an author A , P R ( A ), is defined by Equation 11. P R ( A ) = (1 − d ) N + d (cid:88) j ∈ inLinks A PR( j ) OutLinks j (11)In Equation 11, the sum is over all authors j that cite author A . The term OutLinks j corresponds to all citations made by an author j , and the term 1 − d corresponds to the dumpingfactor, which can be seen as a decay factor. Under a web search scenario, it represents theprobability that a user will stop clicking links and jumps to another random page. Under theexpert finding scenario, the dumping factor can be seen as an interest over a different author,instead of the search process only being interested in authors that are cited. The parameter d wasset to 0.85, because most of the implementations in the literature use this value (Brin et al. 1999).The PageRank-based features that we considered correspond to the sum and average of thePageRank values that are associated with the papers of the author that contain the query terms,which are computed over a directed graph that represents citations between papers. Authors who22ublished papers with high PageRank scores are more likely to be considered experts. This section describes the validation of the main hypothesis behind this work, which states thateither learning to rank approaches or rank aggregation methods can combine multiple estimatorsof expertise in a principled way, in this way improving over the current state-of-the-art expertretrieval system.
The validation of the proposed approaches requires a sufficiently large repository of textual contentthat describes the expertise of the individuals. In this work, we used a dataset for evaluating expertsearches in the Computer Science domain, which corresponds to an enriched version of the DBLP database that was made available through the Arnetminer project.DBLP data have been used in several previous experiments that involve citation analysis (Sidiropoulos& Manolopoulos 2005, Sidiropoulos & Manolopoulos 2006) and expert search (Deng, King &Lyu 2008, Deng et al. 2011). DBLP is a large dataset that covers both journal and conferencepublications for the computer science domain, in which substantial effort has been invested in theproblem of author identity resolution (i.e., resolving to the same person possibly with differentnames). This dataset contains only the publications’ title and, in some papers, the abstract. Thecontents of the full paper are not available. Table 1 provides a statistical characterisation of theDBLP dataset.The main reason for using the DBLP dataset is the fact that, to the best of our knowledge, it isthe only dataset for expert finding in academic publications that provides relevance judgments foreach query-expert pair. Therefore, it was the only dataset that is publicly available that enabledthe exploration of supervised techniques. Moreover, this dataset is the most often used in theliterature for finding academic experts(Yang et al. 2009, Deng et al. 2008, Deng et al. 2011). Wenote that our approach could be extended to any dataset of academic publications, as long as thedataset provides information about the publication’s titles and abstracts, their respective authorsand the references in the publications. that have already been used in other expert findingexperiments (Yang et al. 2009, Deng et al. 2011). The Arnetminer dataset comprises a set of 13query topics from the Computer Science domain, and it was built by collecting people from theprogram committees of important conferences that are related to the query topics. Table 2 showsthe distribution of experts that are associated with each topic, as provided by Arnetminer. Query Topics Rel. Authors Query Topics Rel. Authors
Boosting (B) 46 Natural Language (NL) 41Computer Vision (CV) 176 Neural Networks (NN) 103Cryptography (C) 148 Ontology (O) 47Data Mining (DM) 318 Planning (P) 23Information Extraction (IE) 20 Semantic Web (SW) 326Intelligent Agents (IA) 30 Support Vector Machines (SVM) 85Machine Learning (ML) 34Table 2: Characterisation of the Arnetminer dataset of Computer Science experts.With respect to the learning to rank framework, we used existing learning to rank implemen-tations that are available on the RankLib software package that was developed by Van Dang aswell as on the SVMrank implementation by Joachims (2006), on the SVMmap implementationby Yue et al. (2007) and in the Additive Groves algorithm implemented by Sorokina et al. (2007). http://arnetminer.org/lab-datasets/expertfinding/ http://projects.yisongyue.com/svmmap/ k -fold cross-validation methodology, in which we used 4-folds.In this method, our data were randomly partitioned into k equal-size sub-samples. By equal size,we mean that the number of different classes (queries) for classification is the same in every sub-sample. Of these k sub-samples, a single subsample was retained as the validation data to testthe model, and the remaining k − k times, where each of the k subsamples were used exactly once as validationdata. The results of the k folds were then averaged to produce a single estimation. In the end,each fold contained 9 queries to train and 4 queries to test.The parameters were determined by using a grid search approach (see Section 3.2). Table 3presents the optimal parameters that were found. Parameters AdaRank Coordinate Ascent RankBoost RankNet SVMrank SVMmap
Table 3: Parameters found with the grid search approach. The ”-” symbol means that thealgorithm does not have the parameter. For instance, AdaRank only has one parameter which isthe number of iterations. The other parameters do not apply to this algorithmWith regard to the rank aggregation framework, we implemented six different data fusion algo-rithms that were based in positional methods, majoritarian methods and score aggregation. The25core aggregation algorithms that were developed were CombSUM, CombMNZ and CombANZ.The positional algorithms were the Borda Fuse and Reciprocal Rank Fuse. The majoritarianalgorithm was Condorcet Fusion. All of these algorithms have been described in Section 4 of thiswork.To validate the different rank aggregation algorithms, we again had to complement the Ar-netminer dataset with negative relevance judgements (i.e., adding unimportant authors for eachof the query topics). Because we are interested in deriving a ranking ordering, rather than aclassification, the non-expert candidates were obtained in the following way: we retrieved the topauthors that were not marked as relevant from the database for each query topic according tothe BM25 metric. The Arnetminer instances together with the non-relevant candidates that werecollected constituted a total of 350 instances for each query topic.To measure the quality of the results, we used two different performance metrics, namely, thePrecision at k (P@k) and the Mean Average Precision (MAP).The Precision at rank k is used when a user wishes to look only at the first k retrieved domainexperts. The precision is calculated at that rank position through Equation 12.P@k = r ( k ) k (12)In the formula, r ( k ) is the number of relevant authors that are retrieved in the top k positions. P @ k considers only the top-ranking experts to be relevant and computes the fraction of suchexperts in the top k elements of the ranked list.The Mean of the Average Precision over the test queries is defined as the mean over theprecision scores for all of the retrieved relevant experts. For each query r , the Average Precision(AP) is given by: AP[ r ] = (cid:80) nk =1 P@k[ r ] × I { g r k = max( g ) } (cid:80) nk =1 I { g r k = max( g ) } (13)In Equation 13, n is the number of experts that are associated with query q , and g rk is therelevance grade for author k in relation to query r . In the case of our datasets, max( g ) = 1 (i.e.,we have 2 different grades for relevance, 0 or 1).The Normalised Discount Cumulative Gain (NDCG) emphasises the fact that the highly rele-vant experts should appear on top of the ranked list. This metric is given by Equation 14, where Z k is a normalisation factor that corresponds to the maximum score that could be obtained when26ooking at the top experts, and rel i is the relevance score that is assigned to expert i .NDCG[ r ] = Z k k (cid:88) i =1 rel i − (1 + i ) (14)We also performed statistical significance tests over the results by using an implementation of the two-sided randomisation test (Smucker, Allan & Carterette 2007). This section presents the results that were obtained in both of the frameworks that weretested in this work, namely the supervised learning to rank framework and the rank aggregationframework for expert finding.
In this paper, we argue that the use of supervised learning to rank algorithms is a sound approachfor the expert finding task, effectively combining a large pool of estimators that characterise theknowledge of an expert.The goal of a learning to rank framework is to combine features in an optimal way. In ourwork, we combine them by using different learning to rank algorithms, and we test them in twodifferent ways: (1) by determining the best algorithm for the expert finding task in academicpublications (Table 4), and (2) by comparing different groups of features to understand whichgroups achieve better results and are more relevant to discriminating experts (Table 6).Table 4 presents the results that were obtained over the DBLP dataset. These results showthat the pointwise approach Additive Groves outperformed all of the other pairwise and listwiselearning to rank algorithms, in terms of MAP, and therefore provided a better ranking procedurethan all of the other approaches that were tested. On the other hand, the listwise SVMmapalgorithm, with an RBF kernel, performed almost as good as the Additive Groves method. Thisfinding makes sense because the goal of SVMmap is to optimise the Mean Average Precision scores.In fact, SVMmap outperformed Additive Groves when ranking the top 5 experts (P@5), whichshows that this listwise method can also be successfully used in the context of expert finding.Table 4 also shows competitive results for the pairwise approach of SVMrank with a linear . Experiments revealed thatModel 1 and Model 2 have similar performances on such an academic dataset, but they achieveda lower performance when compared to all of the algorithms that were tested with the supervisedlearning approach. In Model 1, when an author publishes a paper that contains a set of words thatexactly match the query topics, the author achieves a very high score. In addition, because weare addressing very large datasets, there are many authors in such a situation and, consequently,the top-ranked authors are dominated by non-experts while the real experts are ranked lower.Thus, more general papers on a subject can actually contain the topic in the title and abstractcompared with papers that are written by experts, who tend to jump directly into the details.In Model 2, because we only include the publications’ titles and some abstracts, the query topics http://code.google.com/p/ears/ L2R Algorithms P@5 P@10 P@15 P@20 MAP NDCGAdaRank 0.667 0.683 0.674 0.681 0.648 0.885Coordinate Ascent 0.925 0.873 0.841 0.825 0.758 0.936RankNet 0.704 0.719 0.688 0.676 0.653 0.873RankBoost 0.838 0.879 0.832 0.815 0.784 0.940Additive Groves 0.967
SVMmap
Table 4: Results for the various learning to rank algorithms that were tested.In a separate experiment, we attempted to measure the impact of the different types of rankingfeatures on the quality of the results. Using the best performing learning to rank algorithm, namelythe Additive Groves method, we measured the results that were obtained by ranking models thatconsidered (i) only the textual similarity features, (ii) only the profile features, (iii) only the graphfeatures, (iv) textual similarity and profile features, (v) textual similarity and graph features and(vi) profile and graph features. Table 6 shows the results that were obtained, and Table 7 showsthe p-values for the statistical significance tests, using a two-sided randomisation test.
Algorithms P@5 P@10 P@15 P@20 MAP NDCGAdd. Groves vs AdaRank 0.00200* 0.00049* 0.00051* 0.00021* 0.00038* 0.00024*Add. Groves vs Coord. Ascent 0.49894 0.03075* 0.02347* 0.00797* 0.00327* 0.00701*Add. Groves vs RankNet 0.01578* 0.00215* 0.00052* 0.00021* 0.00038* 0.00024*Add. Groves vs RankBoost 0.49717 0.08504 0.03020* 0.00639* 0.00941* 0.03638*Add. Groves vs SVMmap 1.00000 0.32419 0.21003 0.13053 0.33701 0.67007Add. Groves vs SVMrank 1.00000 0.19688 0.10640 0.11147 0.08929 0.27928Add. Groves vs BM25 0.03448* 0.00137* 0.00052* 0.00021* 0.00065* 0.00083*Add. Groves vs Balog’s Model 1 (Balog et al. 2006) 0.00021* 0.00023* 0.00022* 0.00021* 0.00038* 0.00024*Add. Groves vs Balog’s Model 2 (Balog et al. 2006) 0.00021* 0.00023* 0.00022* 0.00021* 0.00038* 0.00024*
Table 5: P-Values obtained during significance tests. The * indicates that the improvementobtained using the Additive Groves algorithm is statistically significant, for a confidence intervalof 95%. 29 ets of Features P@5 P@10 P@15 P@20 MAP NDCGText Similarity + Profile + Graph
Text Similarity + Profile
Table 6: The results obtained with different sets of features.
Sets of Features P@5 P@10 P@15 P@20 MAP NDCGAll Features vs Text Similarity + Profile 1.00000 0.62337 0.74859 0.38971 0.02632* 0.02494*All Features vs Text Similarity + Graph 0.49951 0.50039 0.37410 0.24767 0.02207* 0.02850*All Features vs Profile + Graph 0.50069 0.12367 0.05410 0.03009* 0.01128* 0.00100*All Features vs Text Similarity 0.49951 0.12539 0.12440 0.03976* 0.08140 0.05283All Features vs Profile 1.00000 0.31476 0.50108 0.73553 0.53595 0.56721All Features vs Graph 0.18783 0.24849 0.06149 0.10932 0.02725* 0.57462
Table 7: P-Values obtained during significance tests. The * indicates that the improvementobtained using the Additive Groves algorithm using the entire set of features is statistically sig-nificant, for a confidence interval of 95%.Table 6 shows that the set that has the combination of all the features has the best results.The results also show that the combination of profile features that includes graph features hasthe poorest results. This finding means that the presence of the query topics in the author’spublications, specifically in the titles and abstracts, is crucial to determining whether some authorsare experts or not, and indeed, the information that is provided by textual evidence can help inexpertise retrieval.
In this paper, we also argue that rank aggregation methods, which are based on data fusion tech-niques, can provide significant advantages over the representative generative probabilistic modelsthat are proposed in the expert finding literature. We used existing state-of-the-art algorithmsto build an expert finding framework that enables a combination of a large pool of expertiseestimates.Table 8 presents the obtained results on the DBLP dataset. The CombMNZ rank aggregationtechnique outperformed all of the other algorithms, in terms of MAP, which shows that this rankaggregation method provides a better ranking than all of the other approaches. On the otherhand, the Condorcet Fusion algorithm outperformed all of the other methods in almost all of theevaluation metrics tested. In fact, the Condorcet Fusion method achieved much better results forthe P@ k than all of the other algorithms. Both MAP and NDCG are important metrics for the30valuation of system performance. However, P@ k also plays an important role in the evaluationprocess, given that, in expert finding systems, when the user searches for experts on some topic,he is usually only interested in the top k experts that are retrieved.Table 8 also shows that the Borda Fuse and the Reciprocal Rank Fuse algorithms had the sameperformance in our experiments. This finding is not surprising because these positional algorithmsare very similar. The only difference between them is that Borda Fuse uses directly the positionsof the candidates, whereas Reciprocal Rank Fuse uses the reciprocal rank of the positions of thecandidates. In this experiment, the final ranked lists were the same, in other words, they returnedthe experts in the same order.Regarding the worst results that were obtained, Table 8 shows that CombANZ was not assuccessful as the other algorithms. This finding can be explained by the fact that the CombANZalgorithm divides the CombSUM scores by the number of systems that contribute to the rankingof a candidate. Because the validation data were manipulated such that all of the candidates hadthe presence of the query topics in their publications, CombANZ de-emphasises those candidatesthat appear multiple times in the different systems.Finally, Table 8 shows that all of the algorithms that were tested in this rank aggregationframework outperformed the baseline ranking function BM25 and the state-of-the-art approachesproposed by Balog et al. (2006), namely, the candidate-based Model 1 and the document-basedModel 2, for the same reasons that were noted in the L2R experiment. Table 9 shows the p-valuesfor statistical significance tests using a two-sided randomisation test. Data Fusion Algorithms P@5 P@10 P@15 P@20 MAP NDCGCombSUM 0.400 0.408 0.415 0.450 0.413 0.739CombMNZ 0.492 0.477 0.472 0.512
BM25 (baseline) 0.492 0.431 0.385 0.339 0.326 0.709Balog’s Model 1 (Balog et al. 2006) 0.077 0.085 0.092 0.104 0.121 0.544Balog’s Model 2 (Balog et al. 2006) 0.123 0.092 0.103 0.092 0.121 0.542
Table 8: Results for the various data fusion algorithms that were tested.Again, in a separate experiment, we measured the impact of the different types of rankingfeatures on the quality of the results. Using the Condorcet Fusion algorithm, we measured the re-sults that were obtained by ranking models that considered (i) only the textual similarity features,(ii) only the profile features, (iii) only the graph features, (iv) the textual similarity and profilefeatures, (v) the textual similarity and graph features and (vi) the profile and graph features.31 lgorithms P@5 P@10 P@15 P@20 MAP NDCGCond. Fusion vs CombSUM 0.00187* 0.00584* 0.01272* 0.10114 0.30943 0.04779*Cond. Fusion vs CombMNZ 0.00997* 0.02306* 0.04293* 0.40498 0.62124 0.38286Cond. Fusion vs CombANZ 0.00094* 0.00102* 0.00621* 0.03273* 0.02431* 0.00024*Cond. Fusion vs Borda Fuse 0.00094* 0.00102* 0.00420* 0.00830* 0.20045 0.02883*Cond. Fusion vs R. Rank Fuse 0.00094* 0.00102* 0.00420* 0.00830* 0.20045 0.02883*Cond. Fusion vs BM25 0.00181* 0.00096* 0.00084* 0.00064* 0.00038* 0.00024*Cond. Fusion vs Balog’s Model 1 (Balog et al. 2006) 0.00021* 0.00023* 0.00022* 0.00021* 0.00038* 0.00024*Cond. Fusion vs Balog’s Model 2 (Balog et al. 2006) 0.00021* 0.00023* 0.00022* 0.00021* 0.00038* 0.00024*
Table 9: P-Values obtained during significance tests. The * indicates that the improvementobtained using the Condorcet Fusion algorithm is statistically significant, for a confidence intervalof 95%.Table 10 shows the obtained results. Table 11 shows the statistical significance tests that wereperformed by using a two-sided randomisation test.
Sets of Features P@5 P@10 P@15 P@20 MAP NDCGText Similarity + Profile + Graph
Text Similarity + Profile 0.400 0.415 0.400 0.373 0.327 0.704Text Similarity + Graph 0.569 0.523 0.477 0.450 0.391 0.762Profile + Graph 0.631 0.539 0.482 0.462 0.417 0.774Text Similarity 0.350 0.333 0.344 0.313 0.298 0.668Profile 0.462 0.431 0.415 0.419 0.369 0.724Graph 0.646 0.577 0.544
Table 10: The results obtained with different sets of features.As observed, the set with the combination of all of the features had the best results. BecauseDBLP has rich information about citation links, one can see that the set of graph features alsoachieved very competitive results. The results also show that, individually, textual similarityfeatures have the poorest results. This finding means that considering only the textual evidenceprovided by query topics, together with the article’s titles and abstracts, might not be sufficient todetermine if some authors are experts or not and that indeed the information provided by citationpatterns can help in expertise retrieval through a rank aggregation framework.
Sets of Features P@5 P@10 P@15 P@20 MAP NDCGAll Features vs Text Similarity + Profile 0.00047* 0.00198* 0.00052* 0.00042* 0.00038* 0.00024*All Features vs Text Similarity + Graph 0.01573* 0.03341* 0.02886* 0.02930* 0.00410* 0.00803*All Features vs Profile + Graph 0.43723 0.02689* 0.02374* 0.04621* 0.07792 0.03597*All Features vs Text Similarity 0.00091* 0.00044* 0.00048* 0.00042* 0.00062* 0.00038*All Features vs Profile 0.03319* 0.00584* 0.00530* 0.00579* 0.01095* 0.00680*All Features vs Graph 0.36671 0.22709 0.35898 0.99413 0.62449 0.41703
Table 11: P-Values obtained during significance tests. The * indicates that the improvementobtained using the Condorcet Fusion algorithm using the entire set of features is statisticallysignificant, for a confidence interval of 95 %. 32
Conclusions and Future Work
The tests performed in this paper indicate that learning to rank approaches achieve an overallgood performance in the task of expert finding, within digital libraries of academic publications.The various learning algorithms that were tested achieved significantly different results fromone another, which leads us to conclude that some of the algorithms are more suitable for thistask than others. In addition, we experimentally demonstrated that the Additive Groves pointwiseapproach and the SVMmap listwise approach outperformed the other algorithms. These resultswere quite interesting, because pointwise approaches do not consider the order of the expertsand, therefore, worse results were expected for the Additive Groves approach. However, one mustconsider that the Additive Groves algorithm is very robust in such a way that at each iteration,it always trains a new and more accurate regression tree for the experts that were misclassifiedby the previous iteration and then merges all of the learned trees into a single tree. In addition,this algorithm was amongst the top 5 best performing algorithms in the Yahoo! Learning to RankCompetition (Chapelle & Chang 2011). This finding implies that pointwise approaches can alsobe very effective, performing similarly or even better than listwise approaches.Regarding our rank aggregation framework, the results showed that rank aggregation ap-proaches also provide reasonable results for the task of expert finding in digital libraries, becausethey outperformed some of the state-of-the-art approaches, in terms of MAP. In our experiments,CombMNZ and the Condorcet Fusion algorithms achieved the best results.The effectiveness of the learning to rank and rank aggregation frameworks directly depends onthe quality of the features that they use. As a result, in this work, both frameworks achieved verygood results, always outperforming state-of-the-art approaches. We can argue that the featuresthat are proposed can provide accurate information and discriminate the expertise levels of thecandidates, enabling the retrieval of a reliable and accurate ranked list of experts. When comparingall of the different sets of features, we concluded that a combination of all of the features (textual,profile and graph) is required to achieve the best results in both experiments.For future work, it would be very interesting to apply the algorithms that were tested in thiswork in the TREC enterprise task dataset, which is very commonly used in empirical studies. Forexample, in the learning to rank approach for expert finding that is proposed by Macdonald &Ounis (2011), they achieved the best results by using the AdaRank listwise algorithm, turningtheir approach into one of the top contributions for the expert finding task in enterprises. However,33he experiments that were performed in the scope of this article showed that AdaRank was anapproach that performed poorly. We are very curious to know how the Additive Groves algorithmwould perform on such a task.It would also be interesting to extend the features that were proposed in this work to supportthe various expert finding models that have been proposed in the literature. For example, wecould build a supervised learning to rank system by still using the Additive Groves algorithmbut with a new set of features, which would be based on other studies in the literature. Forexample, we could use Balog et al.’s (2006) candidate-based model scores (Model 1), Balog etal.’s (2006) document-based model scores (Model 2), and Deng et al.’s (2011) query-sensitiveAuthorRank scores. Given that each of these models alone already provided a good method forranking experts, the combination of all of them through a supervised machine-learning approachcould lead to even more accurate and reliable ranked lists.
This work was supported by Funda¸c˜ao para a Ciˆencia e Tecnologia (FCT) through INESC-ID multiannual funding under project PEst-OE/EEI/LA0021/2013 and through FCT Project SMARTIS(ref. PTDC/EIA-EIA/115346/2009).
References
Adali, S., Magdon-Ismail, M. & Marshall, B. (2007), A classification algorithm for finding the optimal rank ag-gregation method, in ‘Proceedings of the 22nd International International Symposium on In Computer andInformation Sciences’.Balog, K., Azzopardi, L. & de Rijke, M. (2006), Formal models for expert finding in enterprise corpora, in ‘Proceed-ings of the 29th annual international ACM Conference on Research and Development in Information Retrieval’.Balog, K., Azzopardi, L. & de Rijke, M. (2009), ‘A language modeling framework for expert finding’, InformationProcessing and Management , 1–19.Balog, K., Fang, Y., de Rijke, M., Serdyukov, P. & Si, L. (2012), ‘Expertise retrieval’, Foundations and Trends inInformation Retrieval , 127–256.Batista, P. D., Campiteli, M. G. & Kinouchi, O. (2006), ‘Is it possible to compare researchers with different scientificinterests?’, Scientometrics , 179–189.Bozkurt, I. N., Gurkok, H. & Ayaz, E. S. (2007), Data fusion and bias, Technical report, Bilkent University.Breiman, L. (1996), ‘Bagging predictors’, Machine Learning , 123–140. rin, S., Page, L., Motwani, R. & Winograd, T. (1999), The pagerank citation ranking: Bringing order to the web,Technical Report 1999-66, Stanford Digital Library Technologies Project.Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N. & Hullender, G. (2005), Learning to rankusing gradient descent, in ‘Proceedings of the 22nd International Conference on Machine Learning’.Cao, Y., Liu, J., Bao, S. & Li, H. (2006), Research on expert search at enterprise track of trec 2005, in ‘Proceedingsof the 14th Text REtrieval Conference’.Chapelle, O. & Chang, Y. (2011), Yahoo! Learning to Rank Challenge - Overview, in ‘ Machine Learning Research , 1–24.’.Chen, P.-J., Xie, H., Maslov, S. & Redner, S. (2007), ‘Finding scientific gems with google’s pagerank algorithms’, Informetrics , 8–15.Craswell, N., Hawking, D., Vercoustre, A.-M. & Wilkins, P. (2001), P@noptic expert: Searching for experts not justfor documents, in ‘Proceedings of the 7th Australian World Wide Web Conference (poster papers)’.de Borda, J.-C. (1781), M´emoire sur les ´Elections au Scrutin. , Histoire de l’Acad´emie Royale des Sciences.Deng, H., King, I. & Lyu, M. R. (2008), Formal models for expert finding on dblp bibliography data, in ‘Proceedingsof the 8th IEEE International Conference on Data Mining’.Deng, H., King, I. & Lyu, M. R. (2011), ‘Enhanced models for expertise retrieval using community-aware strategies’, IEEE Transactions on Systems, Man, and Cybernetics , 1–14.Dwork, C., Kumar, R., Naor, M. & Sivakumar, D. (2001), Rank aggregation revisited, in ‘Proceeding of the 10thWorld Wide Web Conference Series’.Egghe, L. (2006), ‘Theory and practice of the g-index’, Scientometrics , 131–152.Ertekin, S. & Rudin, C. (2011), ‘On equivalence relationships between classification and ranking algorithms’, Ma-chine Learning Research , 2905–2929.Fang, H. & Zhai, C. (2007), Probabilistic models for expert finding, in ‘Proceedings of the 29th European Conferenceon Information Retrieval Research’.Fox, E. & Shaw, J. A. (1994), Combination of multiple searches, in ‘Proceedings of the 2nd Text Retrieval Confer-ence’.Freund, Y., Iyer, R., Schapire, R. E. & Singer, Y. (2003), ‘An efficient boosting algorithm for combining preferences’, Machine Learning Research , 933–969.Haykin, S. (2008), Neural Networks and Learning Machines, Pearson Education.Hirsch, J. E. (2005), An index to quantify an individual’s scientific research output, in ‘Proceedings of the NationalAcademy of Sciences USA’.Hsu, C.-W., Chang, C.-C. & Lin, C.-J. (2010), A practical guide to support vector classification, Technical report,National Taiwan University.Ji, M., Han, J. & Danilevsky, M. (2011), Ranking-based classification of heterogeneous information networks, in ‘Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining’. oachims, T. (2006), Training linear SVMs in linear time, in ‘Proceedings of the 12th ACM Conference on KnowledgeDiscovery and Data Mining’.Liu, T.-Y. (2009), ‘Learning to rank for information retrieval’, Foundations of Trends Information Retrieval , 225–331.Liu, X., Bollen, J., Nelson, M. L. & de Sompel, H. V. (2005), Co-authorship networks in the digital library researchcommunity, in ‘Information Processing and Management’ , 1462–1480.Macdonald, C. & Ounis, I. (2008), Voting techniques for expert search, in ‘Knowledge Information Systems’ , 259–280.Macdonald, C. & Ounis, I. (2011), Learning models for ranking aggregates, in ‘Proceedings of the 33rd EuropeanConference on Information Retrieval’.Manning, C. D. (2008), Introduction to Information Retrieval , Cambridge University Press.Metzler, D. & Croft, W. B. (2007), ‘Linear feature-based models for information retrieval’,
Information Retrieval , 1–23.Montague, M. H. & Aslam, J. A. (2002), Condorcet fusion for improved retrieval, in ‘Proceedings of the 11thinternational conference on information and knowledge management’.Moreira, C., Calado, P. & Martins, B. (2011), Learning to rank for expert search in digital libraries of academicpublications, in ‘Proceedings of the 15th Portuguese Conference on Artificial Intelligence’.Petkova, D. & Croft, B. (2006), Hierarchical language models for expert finding in enterprise corpora, in ‘Proceedingsof the 18th IEEE International Conference on Tools with Artificial Intelligence’.Petkova, D. & Croft, B. (2007), Proximity-based document representation for named entity retrieval, in ‘Proceedingsof the 16th ACM conference on Conference on information and knowledge management’.Pfahringer, B. (2011), Semi-random model tree ensembles: An effective and scalable regression method, in ‘Pro-ceedings of the 24th Australasian Joint Conference in Advances in Artificial Intelligence’.Qin, T., Liu, T.-Y., Zhang, X.-D., Wang, D.-S., Xiong, W.-Y. & Li, H. (2008), Learning to rank relational objectsand its application to web search, in ‘Proceedings of the 17th international conference on World Wide Web’.Riker, W. H. (1988), Liberalism Against Populism: A Confrontation Between the Theory of Democracy and theTheory of Social Choice , Waveland Press.Serdyukov, P. (2009), Search for Expertise : Going Beyond Direct Evidence, PhD thesis, University of Twente.Serdyukov, P. & Hiemstra, D. (2008), Modeling documents as mixtures of persons for expert finding, in ‘Proceedingsof the 30th European conference on Advances in information retrieval’.Sidiropoulos, A. & Manolopoulos, Y. (2005), ‘A citation-based system to assist prize awarding’, Journal of the ACMSpecial Interest Group on Management of Data Record , 54–60.Sidiropoulos, A. & Manolopoulos, Y. (2006), Generalized comparison of graphbased ranking algorithms for publi-cations and authors, in ‘Journal for Systems and Software’ , 1679–1700.Sidiropoulos, A., Katsaros, D. & Manolopoulos, Y. (2007), ‘Generalized h-index for disclosing latent facts in citationnetworks’, Scientometrics , 253–280. mucker, M. D., Allan, J. & Carterette, B. (2007), A comparison of statistical significance tests for informationretrieval evaluation, in ‘Ins Proceedings of the sixteenth ACM conference on Conference on information andknowledge management’.Sorokina, D., Caruana, R. & Riedewald, M. (2007), Additive groves of regression trees, in ‘Proceedings of the 18thEuropean Conference on Machine Learning’.Tsochantaridis, I., Joachims, T., Hofmann, T. & Altun, Y. (2005), ‘Large margin methods for structured andinterdependent output variables’, Machine Learning Research , 1453–1484.Voorhees, E. (1999), The trec-8 question answering track report, in ‘Proceedings of the 8th Text Retrieval Confer-ence’.Xu, J. & Li, H. (2007), Adarank: a boosting algorithm for information retrieval, in ‘Proceedings of the 30th annualinternational ACM conference on Research and development in information retrieval’.Xu, J., yan Liu, T., Lu, M., Li, H. & ying Ma, W. (2008), Directly optimizing evaluation measures in learningto rank, in ‘Proceedings of the 31st annual international ACM conference on Research and development ininformation retrieval’.Yang, Z., Tang, J., Wang, B., Guo, J., Li, J. & Chen, S. (2009), Expert2bole: From expert finding to bole search, in ‘Proceedings of the 15th ACM Conference on Knowledge Discovery and Data Mining’.Yue, Y., Finley, T., Radlinski, F. & Joachims, T. (2007), A support vector method for optimizing average precision, in ‘Proceedings of the 30th Annual International ACM Conference on Research and Development in InformationRetrieval’.Zhang, C.-T. (2009), ‘The e-index, complementing the h-index for excess citations’, Public Library of Science One , 5.Zhu, J., Song, D. & R¨uger, S. (2007), The open university at trec 2006 enterprise track expert search task, in ‘Proceedings of the 15th Text Retrieval Conference’.Zhu, J., Song, D., R¨uger, S. & Huang, J. (2008), Modeling document features for expert finding, in ‘Proceedings ofthe 17th ACM Conference on Information and Knowledge Management’.‘Proceedings ofthe 17th ACM Conference on Information and Knowledge Management’.