[PDF] Associations between author-level metrics in subsequent time periods

Abstract

Full PDF

AAssociations between author-level metrics in subsequent timeperiods

Ana C. M. Brito , Filipi N. Silva and Diego R. Amancio Institute of Mathematics and Computer Science,University of S˜ao Paulo, S˜ao Carlos, Brazil Indiana University Network Science Institute,Bloomington, Indiana 47408, USA (Dated: November 26, 2020)

Abstract

Understanding the dynamics of authors is relevant to predict and quantify performance in science.While the relationship between recent and future citation counts is well-known, many relationshipsbetween scholarly metrics at the author-level remain unknown. In this context, we performed ananalysis of author-level metrics extracted from subsequent periods, focusing on visibility, produc-tivity and interdisciplinarity. First, we investigated how metrics controlled by the authors (suchas references diversity and productivity) aﬀect their visibility and citation diversity. We also ex-plore the relation between authors’ interdisciplinarity and citation counts. The analysis in a subsetof Physics papers revealed that there is no strong correlation between authors’ productivity andfuture visibility for most of the authors. A higher fraction of strong positive correlations thoughwas found for those with a lower number of publications. We also found that reference diversitycomputed at the author-level may impact positively authors’ future visibility. The analysis ofmetrics impacting future interdisciplinarity suggests that productivity may play a role only for lowproductivity authors. We also found a surprisingly strong positive correlation between referencesdiversity and interdisciplinarity, suggesting that an increase in diverse citing behavior may be re-lated to a future increase in authors interdisciplinarity. Finally, interdisciplinarity and visibilitywere found to be moderated positively associated: signiﬁcant positive correlations were observedfor 30% of authors with lower productivity. a r X i v : . [ c s . D L ] N ov . INTRODUCTION The age of information promoted several new discoveries in science, with many of thememerging from interdisciplinary endeavors [4, 10, 19]. At the same time, these communi-ties are growing in size and productivity, resulting in an ever-increasing deluge of digitalinformation available in the form of published articles [15], datasets, and algorithms, as wellas across many platforms, such as cloud services and social media. However, the increasein digital resources has not leveled the playing ﬁeld for researchers; inequality is rising inscience [26].Understanding the mechanisms leading to inequality in science can help policymakers andfunding agencies better distribute research resources while also promoting a more just anddemocratic environment. Part of this problem relies on the fact that researchers competeamong themselves for limited funding and attention. In such a system, an increase ofresearchers’ visibility leads to better funding opportunities which, in turn, leads to moreavailability of resources for their institutions, thus allowing those researchers to attain evengreater visibility.The cycle in which researchers with the most resources are rewarded with even moreresources over time is a source of inequality known as the Matthew Eﬀect [12]. This is oneof the reasons why understanding how the dynamics of authors visibility unfold over timeis one of the most important problems in the ﬁeld of Science of Science [8, 22]. However,not much attention was given to understand the relationships between other authors metricsbesides citations [2, 7, 18, 21]. In particular, the literature lacks studies on metrics controlledby the authors, such as those based on their choices of references and their productivity;and the possible eﬀects they may have on their received citations.Here, we propose to explore the associations among diﬀerent bibliometric measures forauthors in subsequent time periods through correlation. Among the metrics we considerare interdisciplinarity, which is measured in terms of the subject diversity [19] of citationsreceived by the authors, productivity and visibility of authors. Here, we are interested inaddressing three main questions:1. How metrics controlled by the authors – namely, their productivity and diversity inthe choice of references – correlate with visibility metrics, such as the received numberof citations per paper, in a subsequent time period?2. How these characteristics correlate with the future interdisciplinarity of their publica-tions based on citations?3. How interdisciplinarity is related to future citations and vice versa?In addition to productivity and visibility, we also studied interdisciplinarity as it playsan important role in modern science given the increasing number of authors bridging newdiﬀerent ﬁelds. Here, it is used as a descriptor for citation diversity, as adopted in relatedworks [19]. In a similar fashion, we adopt a reference diversity for authors based on theﬁelds of the employed references in their publications.We employ the

American Physical Society (APS) dataset, which incorporates all the ci-tations and metadata for papers, mainly in Physics, published in any of the APS journals upto 2010. More speciﬁcally, we employ the dataset used in [20], which was supplemented withdisambiguated authors from the

Microsoft Academic Graph (MAG). First, we construct aco-occurrence network for the categories existing in the APS journals (PACS codes) whichis used to deﬁne a metric of interdisciplinarity for authors based, in terms of the diversityof their received citations or the references they used. Next, we calculate the correlationbetween the considered author-level metrics for a window considering previous publicationsand another which considers subsequent publications and citations. Finally, we use a statis-tical framework based on null models to obtain the signiﬁcance of the correlations betweenthe considered metrics.Several interesting results have been obtained in our analysis. We found that the diversityof references may impact positively the observed future visibility for 1/3 of low-productivityauthors. This eﬀect is minimized when analyzing more productive authors, yet the fractionof authors that were positively aﬀected varied between 22% and 25%. A weaker associationbetween productivity and citation counts was found: the highest fraction of authors with asigniﬁcant positive correlation was 21%. When comparing the fraction of authors display-ing signiﬁcant positive and negative correlations, both productivity and reference diversityturned out to be more positively than negatively correlated with authors’ visibility. Sur-prisingly, we also found that reference diversity and future interdisciplinarity are stronglypositively correlated for roughly 50% of authors. Finally, the association between inter-disciplinarity and visibility revealed that an increase in interdisciplinarity is more likely tobe linked to an increase in visibility for low productivity authors. Such positive signiﬁcant3orrelations were observed in roughly 30% of authors in that class. We believe our resultscan provide further insights into better understanding researchers’ career dynamics.

II. RELATED WORKS

In this paper, among other relationships, we analyze which factors aﬀect the visibility ofauthors (measured in terms of citations). At the paper level, some correlations between paperfeatures and the number of citations have been studied in the last few years. An importantfactor that has been found to aﬀect the visibility of paper is related to the interdisciplinarity of venues in which they are disseminated. Diﬀerent aspects of scientiﬁc pieces have beenused to deﬁne interdisciplinarity indexes. In [19], journal citation networks are used toquantify how interdisciplinary a journal is. For a given journal, the diversity of citationsfrom diﬀerent areas is used to gauge interdisciplinary. Such a diversity is computed usingthe concept of true diversity , a measure widely used to express how diverse a set of elementsfrom diﬀerent classes is [5, 23, 25]. Subject areas and citation data were extracted from the

Journal Citation Reports dataset. Some interesting conclusions were the positive correlationbetween the proposed interdisciplinary index and journals impact factor. In other words,interdisciplinary journals tend to have a higher impact factor than specialized journals.Using a diﬀerent approach, the study conducted in [4] also quantiﬁed journals inter-disciplinarity. The authors used Scopus data comprising

Information and CommunicationTechnology publications. The relationship between scholars and journals was representedvia bipartite graphs. After a SVD dimension reduction, a spectral co-clustering method wasused to identify communities of scholars and journals. The diversity (i.e. the interdisci-plinarity) of a journal was then deﬁned by analyzing the unevenness of authors distributionover the obtained network communities. Such a dispersion was computed via Shannon en-tropy, Simpson diversity, and Rao-Stirling index [11]. High values of disparity metrics werefound to occur in journals appearing between communities. Conversely, low diversity wasobserved mostly in network community cores.A correlation between interdisciplinarity and citation impact was investigated in [27].Three aspects of interdisciplinary were investigated at the paper level: variety, balance, anddisparity. Variety is the total number of diﬀerent disciplines (or

Web of Science categories)cited by the paper, while balance corresponds to the evenness of the disciplines distribution,4omputed via Shannon diversity. Disparity measures how diﬀerent are the disciplines in thereference set. The authors analyzed the impact of papers using the Normalized CitationScore (NCS). The data set used was papers from Science Citation Index-Expanded (2005).A regression estimation analysis revealed that variety was positively associated with NCS.In contrast, both balance and disparity were negatively associated with NCS.The impact of citing interdisciplinary papers on papers visibility was investigated in [10].The authors characterized interdisciplinarity at the paper level by using papers references.Subdisciplines were deﬁned by the UCSD map of science [3]. According to this map, thesimilarity between journals is based on the number of shared references (via bibliographiccoupling) and keywords. An average-linkage clustering strategy generates a cluster of 13diﬀerent categories and the pairwise cluster distance is represented in a 3D Fruchterman-Reingold layout. An analysis of 25,000 documents showed that papers citing interdisciplinarysub-disciplines tend to receive more citations than papers with fewer references to interdis-ciplinary sub-disciplines. This study also grouped sub-disciplines by distance in the UCSDmap and demonstrated that papers citing distant sub-disciplines tend to have higher relativecitation rates than papers citing similar sub-disciplines.At the author level, the study carried out in [17] investigated the eﬀects of interdisci-plinarity on scientists careers. The APS dataset was used, considering papers publishedbetween 1980 and 2009. The hierarchical system of subdisciplines classiﬁcation – referredto as

Physics and Astronomy Classiﬁcation Scheme (PACS) – was used to measure the in-terdisciplinarity of an author. They proposed an index combining the total of PACS codesused during the entire author career and the average number of diﬀerent classes appearingsimultaneously in the author papers. Using this value, authors were grouped by diﬀerentlevels of interdisciplinarity: low, medium, and high. Based on these groups, it was observedthat higher interdisciplinarity aﬀects positively productivity. A statistical model was pro-posed to reproduce the original data. The factors considered in the model were the proposedinterdisciplinarity index, the number of publications in each class, the number of citations,talent, reputation, and luck. The model reproducing the properties of the studied systemrevealed that authors with medium-high talent are the most successful ones. In addition,luck turned out to play an important role in career success. Surprisingly, it was found to beeven more relevant than interdisciplinarity factors in some cases.Another diﬀerent source of factor concerns the well-known rich-get-richer paradigm. In5ther words, if an author has received several citations, he/she has a higher tendency ofreceiving more citations if they have received a higher citation rate in the past. In [20], theauthors describe a model for reproducing the distribution of authors citations in the APSdataset. Unlike other models, they included a recency factor so that more recent citationdata receives a higher weight in the preferential attachment model. This model showedthat the rich-get-richer paradigm describes the citation distribution for authors publishingin APS journals. Most importantly, they also found that recency plays an important roleto deﬁne how broad the burstiness of citations are. The number of citations received byauthors is strongly dependent on the total of citations received in the last 1-2 years [20].

III. METHODOLOGY

The methodology adopted in this paper can be divided into the following steps:1.

Creation of PACS networks : this phase is responsible for establishing and identifyingthe subﬁelds inside the considered dataset. Groups of strongly connected subareas aregrouped into network communities. The latter is used to identify an area, which inturn is used to deﬁne some of the variables of interest. The dataset used to createthe networks is described in Section III A. The process of creating and identifyingcommunities of co-occurring PACS is described in Section III B.2.

Deﬁnition of diversity indexes : here we use diversity indexes to quantify how diverseauthors cite or are cited by other papers. The diversity takes as reference the subareas(communities) identiﬁed in the PACS network. The adopted diversity index is deﬁnedin Section III C. Diversity indexes are among the author-level metrics of interest inthis paper.3.

Quantifying the relationship between variables of interest : here we quantify there arecorrelations between variables of interest quantiﬁed in subsequent time intervals. Themethodology adopted to quantify the fraction of authors displaying signiﬁcant pos-itive/negative correlations between the variables of interest in described in SectionIII D. 6 . Dataset

The dataset consists of papers published by the American Physical Society (APS) journalsbetween 1991 and 2010. The dataset comprises 299,930 publications from APS journals.While the dataset provides several article metadata, we used for each paper the list ofauthors and the reference list. We also used the list of subﬁelds codes provided by theauthors and selected from the

Physics and Astronomy Classiﬁcation Scheme (PACS). Thisclassiﬁcation scheme is a hierarchical code system used to organize the main ﬁelds andsubﬁelds in Physics journals.When addressing any issue at the author level, one should be aware that ambiguitiesand name split may arise [1, 13]. To address this problem, we used the Microsoft AcademicGraph (MAG) dataset, which is a more extensive set of publications with authors’ namesdisambiguated [20]. We mapped the APS dataset into the MAG database by matching DOIsvalues.

B. PACS Networks

In this work, we use the notion of subﬁelds to compute the degree of interdisciplinarityinside the Physics area (for APS journals). Subﬁelds were derived from PACS co-occurrencenetworks [16]. Each publication in the APS dataset has its PACS codes, and this informationof area is provided by the authors, among a list of possible codes. We used this informationto generate networks where nodes are PACS codes. Figure 1 shows an example of PACSco-occurrence network extracted from a set of papers. As suggested by other works, PACSwere analyzed at the ﬁrst two levels [16]. Two codes are linked whenever they appeartogether in one or more papers. Here we take the view that a subﬁeld in the consideredsubset of Physics papers can be seen as a subset of highly connected codes. In this way, eachsubﬁeld is deﬁned as a community in the respective co-occurrence PACS network. While ourresults are based on the Louvain community detection algorithm [24], a preliminary analysisrevealed that there is no large diﬀerence when other methods are used to detect communities.Considering the most recent years of the dataset, using the Louvain method, we found 10network communities. An analysis of the obtained communities considering data from thelast 5 years showed that the four largest communities are mainly composed of papers in7he following subjects: (i) magnetic properties and materials ; (ii) quantum mechanics, ﬁeltheories, and special relativity ; (iii) structure of solids and liquids; crystallography ; and (iv) statistical physics, thermodynamics, and nonlinear dynamical systems . C C C Nc P (A) P (A) P (A) A P Np (A) Papers citingauthor A PACS communities

FIG. 1. Schematic representation of the components needed to calculate citations and references diversity.

C. Diversity indexes

Here we employ a diversity index for authors based on the diversity of ﬁelds being cited( citations diversity ) or referenced ( references diversity ) by their papers. Because usually citation diversity is related to interdisciplinary [19], we use both terms to describe the sameconcept. To assign a distribution of ﬁelds of a given author A , ﬁrst, we look at all thepapers P ( A ) i citing publications co-authored by A during the considered time window. Foreach citing paper we obtain the communities associated to the PACS listed in the paper.Figure 1 illustrates the necessary components employed to calculate the in -diversity index forauthors. Next, we derive the weights w in ( P i , C j ) relating a paper P i to a PACS community8 j , deﬁned as the ratio of the number of PACS in C j listed in P i , i.e. w ( P i , C j ) = | PACS( P i ) ∩ C j || PACS( P i ) | , (1)where PACS( P i ) is the set of PACS listed in paper P i . Next, we assign a weight ¯ w cit ( A, C j )relating an author A to each PACS communities C j based on the citing papers. Each citationto a paper from author A counts as a unit that is distributed among the communities, sothat ¯ w ( A, C j ) is deﬁned as ¯ w cit ( A, C j ) = (cid:88) P i n cit ( P i , A ) w ( P i , C j ) , (2)where n cit ( P i , A ) is the number of citations from P i to author A Finally, we normal-ize ¯ w ( A, C j ) across all the received citations, thus obtaining a probability-like measure p cit ( A, C j ) of relatedness between an author A and a community C j , given by p cit ( A, C j ) = ¯ w in ( A, C j ) (cid:80) C k ¯ w in ( A, C k ) . (3)The citation diversity index citDiv( A ) is then deﬁned as the exponential of entropy of p cit ( A, C j ) [19], i.e., citDiv( A ) = exp (cid:104) − (cid:88) C j p cit ( A, C j ) log p cit ( A, C j ) (cid:105) . (4)Similarly, to obtain references diversity index, we use the papers P i referenced by worksauthored by author A instead of the received citations. Thus, the weight linking an authorand a PACS community is deﬁned as¯ w ref ( A, C j ) = (cid:88) P i n ref ( A, P i ) w ( P i , C j ) , (5)where n ref ( A, P i ) is the number of times author A cited the paper P i . The probabilityanalogous to p cit (i.e. p ref ) is then normalized as: p ref ( A, C j ) = ¯ w ref ( A, C j ) (cid:80) C k ¯ w ref ( A, C k ) , (6)9nd the references diversity refDiv( A ) is calculated asrefDiv( A ) = exp (cid:104) − (cid:88) C j p ref ( A, C j ) log p ref ( A, C j ) (cid:105) . (7)Both equations 3 and 6 have been used to measure diversity in many contexts [5, 6, 19].Because the computation of p cit and p cit are not reliable when only a few data is available,these quantities were computed for authors with more than ten references and citations inthe dataset. D. Past and Future scholarly time series

We propose a framework to analyze how a scholarly metric or diversity at a certain pointin time for an author A may impact his future metrics. First, we deﬁne two moving windows,one for the past and another for the future, respectively a 5 years window before the timeunder consideration t , and a 3 years window after t , as illustrated in Figure 2a. For eachwindow, we calculate the scholarly metrics of A . In particular, for the Past window, wecalculate the number of papers, citations received per paper, and references diversity, onlyconsidering publications in the period. FuturePast

Author metrics time series Correlation distributionamong the time series

5y 3y t N u ll m o d e l s N u ll m o d e l s (a) (b) (c) -1 0 1 FIG. 2. Schematic representation of the proposed methodology. (a) Given two subsequent windows(past and future) that moves over time, we calculate the time series of the considered metrics. (b)For each time series we derive a null model based on shuﬄing them along time. (c) We draw thecorrelation distribution (gray) obtained from the data time series and highlight negative (blue) andpositive (yellow) values that are signiﬁcant in comparison to the null models. The average nullmodel distribution is also shown for comparison in red.

For the Future window, we calculate the number of citations received in that window from10apers published by A during the Past window. In the same fashion, we calculate citationdiversity by considering only publications in the Past windows and citations in the Futurewindow. By moving the windows along t for a period from 1995 to 2010, we obtain Past(number of papers, citations per author, and the references diversity), and Future (citationsreceived per paper and citation diversity) time series for each author based on the calculatedscholarly metrics.In order to draw relationships between the scholarly metrics from past and future win-dows, we adopted the Pearson correlation. However, as these metrics may have character-istics that can lead to spurious correlations, such as the presence of outliers or long-taildistributions, we employed a statistical approach based to measure the signiﬁcance of theobtained correlations. First, for each time series of each author, we obtain a set of 10 , p -value associated with each author and a pair of past and future metrics. The p -value is deﬁned as the probability of the null model resulting in a absolute correlation thatis higher than what was found for the data. Finally, the results are presented in the formof a correlation distribution alongside the percentage of negative and positive signiﬁcantrelationships by considering a threshold of 5 × − for the p -values. This is illustrated inFigure 2c. IV. RESULTS AND DISCUSSION

Here we analyze the relationship between relevant author-level metrics. More speciﬁcally,we analyze, if the diversity of references, the numbers of papers, and the number of referencesare correlated with citation counts and citation diversity. We ﬁrst focus on the relationshipbetween variables that authors can control in the ﬁrst 5-year window (e.g. the number anddiversity of references) and variables that are not directly self-dependent (such as the numberof citations and citation diversity) and are measured in the following 3-year window. Thecorrelations between paper/reference features and citation counts are discussed in SectionIV A. The correlations between paper/reference features and citation diversity are discussed11n Section IV B. Because interesting relationships between interdisciplinarity (i.e. citationdiversity) and citation counts have been reported at diﬀerent levels [4, 14, 19], we alsoanalyzed the correlations between interdisciplinarity and citations at the author level. Thisis reported in Section IV C.

A. Correlations between reference features and citations

The simplest reference feature that can be used in our analysis is the total number ofreferences. For the sake of clarity, we will use instead of the number of papers (i.e. theauthors productivity) in our analysis because the total number of references is stronglyrelated to the number of papers. In addition, the results using either number of referencesor the number of papers are very similar.We start our analysis by analyzing whether productivity – i.e. the number of publishedpapers – is correlated with the total number of citations per paper. This result is shown inFigure 3. As mentioned in the methodology (see Section III D), the histograms shows thedistribution of authors in diﬀerent degrees of correlation between the variables of interest.Each panel corresponds to a diﬀerent class of authors, according to its productivity. Theauthors analyzed in subpanels (a)-(d) are those who published the following amount of papersover all the considered period: (a) 5–25; (b) 26–36; (c) 37–58; and (d) 59–359 papers. Theconsidered thresholds in the number of publications were chosen so that each class comprises25% of all authors in the dataset. In other words, each panel corresponds to a quartileof authors. In this ﬁgure, the distribution of correlations observed using the null model isrepresented by the red curve (see Section III D). The fraction of authors displaying signiﬁcantpositive and negative correlations between the considered variables are represented in yellowand blue, respectively.The results in Figure 3 reveals that the observed distribution in all panels diﬀers fromthe null model distribution. The discrepancy between real data and null model arises sincevery high or low values of correlations are unlikely to happen by chance, while the realdata reveals an opposite eﬀect: for a fraction of authors, the correlations are signiﬁcant.Considering all four classes, 18-22% of authors displayed a positive correlation betweenproductivity and visibility. On the other hand, a negative correlation was also observed inall classes of authors. The percentage of authors displaying a negative correlation between12 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100120 = 0.12= 0.46 67.74% 21.77%10.49% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.06= 0.47 69.26% 18.87%11.87% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.07= 0.50 65.10% 20.63%14.27% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.07= 0.50 64.22% 21.13%14.65% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)

FIG. 3. Correlation between the total number of published papers and citations per paper . Panels(a)-(d) correspond to quartiles of authors sorted, in increasing order, by number of publications.The distribution of correlations obtained with the adopted null model is shown in the red curve. productivity and visibility ranged between 10% and 15%. Because more than 64% of theobserved correlations are not signiﬁcant in all four classes of authors, the results suggest thatfor most of the authors the increase in productivity is not correlated with higher citationcounts per paper.In our analysis, we also compared the proportion of positive ( f + ) and negative ( f − )correlations. The proportions are compared via q -index, deﬁned as q = f + f − . (8)In this case, all values of q are higher than q = 1, suggesting thus that in all classes of13uthors positive correlations are more likely to appear. The highest value of q was observedfor authors with the lowest number of publications (see panel (a)). We found q = 2 . diversity of references and the number of citations per paper . A stronger positive correlation is observed speciallyfor authors with lower productivity. In panel (a), one-third of authors displayed a positivecorrelation between references diversity and visibility, while in (b), the same behavior oc-curred for one-fourth of all authors. In both cases, positive correlations are more frequentthan negative correlations. We found, q = 5 .

22 and q = 2 .

65, respectively for authorsin classes (a) and (b). Authors in classes (c) and (d) displayed q values similar to thoseobserved in class (b).The analysis of reference diversity showed that the way in which authors cite otherworks may aﬀect their visibility in the near future. This eﬀect was found to be morerelevant than the productivity since signiﬁcant positive correlations were found in up to25% of authors. This eﬀect might be related to the fact that diverse references mightattract attention from other subﬁelds, favoring thus the dissemination of authors’ visibilityin other scientiﬁc communities. In fact, a similar eﬀect has been reported at the journalanalyses comparing the relationship between journals impact factor and interdisciplinaryindexes [19]. Similar eﬀects have also been observed in diﬀusion systems, where the presenceacross diﬀerent communities beneﬁts the spreading of agents [9]. While it is not possible toestablish a causal eﬀect, our results suggest that references diversity (inside a ﬁeld) mightplay a role in predicting authors’ visibility. B. Correlations between reference features and citations diversity

While in the previous section we analyzed how references features correlate with visibility,here we investigate the relationship between references and the diversity of citations. Be-14 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100120 = 0.24= 0.47 60.12% 33.47%6.41% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.15= 0.48 65.83% 24.83%9.35% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.10= 0.48 67.57% 22.46%9.97% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.09= 0.51 62.76% 24.15%13.10% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)

FIG. 4. Correlation between diversity of references and citations per paper . Panels (a)-(d) corre-spond to quartiles of authors sorted, in increasing order, by number of publications. The distribu-tion of correlations obtained with the adopted null model is shown in the red curve. cause citations diversity can be seen as an interdisciplinary index (see e.g. [19]), this sectionanalyzes how the choice (and quantity) of references is related to authors interdisciplinarity.Figure 5 depicts the correlations between the number of published papers and citationdiversity . As observed in the results reported in the previous section, for most of the authorsthere is no signiﬁcant correlation between the considered variables. However, a positivecorrelation is observed for 1/3 of all authors in class (a), while 1/4 of authors displayed apositive correlation in the other classes. A negative correlation is less frequent than positivecorrelations. In addition, the values of q decreases with productivity, since we obtained q A = 8 . q B = 3 . q C = 2 . q D = 2 . = 0.28= 0.44 61.19% 34.60%4.21% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.17= 0.47 64.12% 27.45%8.43% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.12= 0.49 62.73% 25.50%11.77% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.14= 0.50 61.39% 26.53%12.07% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d) FIG. 5. Correlation between number of published papers and citations diversity . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model.

The association between references and citation diversity was also analyzed. This resultis shown in Figure 6. The observed correlations are much stronger than the ones analyzedso far. The null model distribution is clearly not compatible with the real data. Here, asigniﬁcant relationship between reference and citation diversity arises for more than 50% ofall authors . Surprisingly, virtually all signiﬁcant correlations are positive. The percentageof positive correlations reaches roughly 50%, while signiﬁcant negative correlations wereobserved for roughly 1.5% of all authors. Another distinctive feature of the relationshipbetween reference and citation diversity lies in the fact that the relationship is similar for16 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00050100150200250 = 0.51= 0.41 44.43% 53.93%1.64% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.46= 0.42 48.91% 49.70%1.38% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.46= 0.41 49.85% 48.60%1.55% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.49= 0.41 46.79% 51.85%1.36% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)

FIG. 6. Correlation analysis between reference diversity and citation diversity . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model. all classes of authors. This result is, therefore, a strong evidence that researchers who citepapers from many other disciplines might be also cited by many other subareas. In otherwords, if authors display a diverse behavior when citing other papers, they also tend to becited by other diverse subareas. Because citation diversity can be seen as a way to measureauthors’ interdisciplinary [19], most of the authors adopting larger reference diversity in agiven period are expected to increase their interdisciplinarity indexes in the near future.17 . Interplay between interdisciplinarity and citations

Here we analyze the relationship between interdisciplinarity of authors (computed ascitation diversity) and the number of citations. While studies have shown that a positivecorrelation exists between journals interdisciplinarity and impact factor [19], only a fewstudies have touched on this issue at the author level. Here we found that for most ofthe authors ( ≥ q for each class are q A = 3 . q B = 2 . q C = 1 . q D = 1 .

4. As observed in otherassociations studied here, higher values of q are found for authors in the group of lowerproductivity.While Figure 7 only show the relationship between interdisciplinarity and future visibility,it would be still interesting to see if there is an inverse eﬀect. To investigate if variation incitations is correlated to a future variation in interdisciplinarity we conducted an analysissimilar to the one provided in Figure 7. The histograms of correlations are shown in Figure8. Overall the histograms are similar to the ones depicted in Figure 7, but here the fractionof signiﬁcant positive correlations are smaller. This is evident e.g. for authors in (a): thefraction of positive correlations drop from 29 .

1% to 24 . V. CONCLUSION

In this paper, we analyzed whether relevant scholarly variables are correlated. We pro-posed a framework to probe if features extracted from authors’ recent history are correlatedwith metrics observed a few years later. While some correlations are trivial and were notobject of study (such as correlations between citations in subsequent time periods [20]), westudied the correlation between other variables of interest. We focused our analysis on sim-18 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100 = 0.21= 0.46 63.34% 29.12%7.54% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.12= 0.47 67.23% 22.42%10.35% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.08= 0.47 67.64% 20.74%11.62% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.07= 0.51 62.71% 22.05%15.24% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)

FIG. 7. Correlation analysis between citation diversity and citations per paper . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model. ple, yet relevant metrics, including number of publications, number of citations, referencesdiversity and authors’ interdisciplinarity (measured via citation diversity).Several interesting results have been obtained. Among the associations studied, we foundthat the strongest correlations were obtained between references diversity and authors’ in-terdisciplinarity. Here we found a reciprocal tendency: if authors increase their diversitywhen citing other papers, received citations will also tend to increase. This pattern was ob-served for more than 50% of authors. The relationship between productivity and visibilitywas found to be more prominent for authors with a lower productivity. While no signif-icant correlation exists for most of authors, about 20% showed a positive and signiﬁcantcorrelation. A stronger association was obtained when analyzing the relationship between19 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100120140 = 0.20= 0.43 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (a) = 0.12= 0.44 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (b) = 0.07= 0.45 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (c) = 0.06= 0.49 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (d)

FIG. 8. Correlation analysis between citations per paper and citation diversity . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model. references diversity and future citation. For the class of authors with a lower productivity,we found that roughly 1/3 of authors displayed a signiﬁcant positive correlation betweenreferences diversity and visibility. We also studied the association between references andcitation diversity and found out that the fraction of positive signiﬁcant correlations rangesbetween 18-30% across diﬀerent classes of authors.Our study shed lights into the relationship between current and future researchers’ activ-ity. The results obtained here could be extended in diverse studies to provide mechanismsto predict authors’ behavior, given the recent researchers’ history. Future research coulddive into other research questions arising from our analysis. For example, while we foundthat signiﬁcant positive correlations are more likely to happen than negative ones, it would20e interesting to probe which factors make authors display opposite behaviors for the samevariables of interest. Another interesting feature that could be studied concerns the causal-ity of the obtained correlations. Finally, a systematic study could be performed in diﬀerentareas to check whether correlations are more signiﬁcant in speciﬁc subﬁelds.

ACKNOWLEDGMENTS

D.R.A. acknowledges ﬁnancial support from S˜ao Paulo Research Foundation (FAPESPGrant no. 2020/06271-0) and CNPq-Brazil (Grant no. 304026/2018-2). This study wasﬁnanced in part by the Coordena¸c˜ao de Aperfei¸coamento de Pessoal de N´ıvel Superior –Brasil (CAPES) – Finance Code 001. [1] D. R. Amancio, O. N. Oliveira Jr, and L. d. F. Costa. On the use of topological featuresand hierarchical characterization for disambiguating names in collaborative networks.

EPL(Europhysics Letters) , 99(4):48002, 2012.[2] D. R. Amancio, O. N. Oliveira Jr, and L. da Fontoura Costa. Three-feature model to reproducethe topology of citation networks and the eﬀects from authors’ visibility on their h-index.

Journal of informetrics , 6(3):427–434, 2012.[3] K. B¨orner, R. Klavans, M. Patek, A. M. Zoss, J. R. Biberstine, R. P. Light, V. Larivi`ere, andK. W. Boyack. Design and update of a classiﬁcation system: The ucsd map of science.

PloSone , 7(7), 2012.[4] C. Carusi and G. Bianchi. A look at interdisciplinarity using bipartite scholar/journal net-works.

Scientometrics , pages 1–28, 2019.[5] E. A. Corrˆea Jr, F. N. Silva, L. d. F. Costa, and D. R. Amancio. Patterns of authors contri-bution in scientiﬁc manuscripts.

Journal of Informetrics , 11(2):498–510, 2017.[6] H. F. de Arruda, L. d. F. Costa, and D. R. Amancio. Using complex networks for text classi-ﬁcation: Discriminating informative and imaginative documents.

EPL (Europhysics Letters) ,113(2):28007, 2016.[7] Y.-H. Eom and S. Fortunato. Characterizing and modeling citation dynamics.

PloS one ,6(9):e24926, 2011.

8] S. Fortunato, C. T. Bergstrom, K. B¨orner, J. A. Evans, D. Helbing, S. Milojevi´c, A. M.Petersen, F. Radicchi, R. Sinatra, B. Uzzi, et al. Science of science.

Science , 359(6379), 2018.[9] M. Kaiser, M. Goerner, and C. C. Hilgetag. Criticality of spreading dynamics in hierarchicalcluster networks without inhibition.

New Journal of Physics , 9(5):110, 2007.[10] V. Larivi`ere, S. Haustein, and K. B¨orner. Long-distance interdisciplinarity leads to higherscientiﬁc impact.

Plos one , 10(3), 2015.[11] L. Leydesdorﬀ, C. S. Wagner, and L. Bornmann. Interdisciplinarity as diversity in citation pat-terns among journals: Rao-stirling diversity, relative variety, and the gini coeﬃcient.

Journalof Informetrics , 13(1):255–269, 2019.[12] R. K. Merton. The matthew eﬀect in science: The reward and communication systems ofscience are considered.

Science , 159(3810):56–63, 1968.[13] S. Milojevi´c. Accuracy of simple, initials-based methods for author name disambiguation.

Journal of Informetrics , 7(4):767–773, 2013.[14] K. Okamura. Interdisciplinarity revisited: evidence for research impact and dynamism.

Pal-grave Communications , 5(1):1–9, 2019.[15] R. K. Pan, A. M. Petersen, F. Pammolli, and S. Fortunato. The memory of science: Inﬂation,myopia, and the knowledge network.

Journal of Informetrics , 12(3):656–678, 2018.[16] R. K. Pan, S. Sinha, K. Kaski, and J. Saram¨aki. The evolution of interdisciplinarity in physicsresearch.

Scientiﬁc reports , 2(1):1–8, 2012.[17] A. Pluchino, G. Burgio, A. Rapisarda, A. E. Biondo, A. Pulvirenti, A. Ferro, and T. Giorgino.Exploring the role of interdisciplinarity in physics: Success, talent and luck.

PloS one , 14(6),2019.[18] F.-X. Ren, H.-W. Shen, and X.-Q. Cheng. Modeling the clustering in citation networks.

Physica A: Statistical Mechanics and its Applications , 391(12):3533–3539, 2012.[19] F. N. Silva, F. A. Rodrigues, O. N. Oliveira Jr, and L. d. F. Costa. Quantifying the interdis-ciplinarity of scientiﬁc journals and ﬁelds.

Journal of Informetrics , 7(2):469–477, 2013.[20] F. N. Silva, A. Tandon, D. R. Amancio, A. Flammini, F. Menczer, S. Milojevi´c, and S. For-tunato. Recency predicts bursts in the evolution of author citations.

Quantitative ScienceStudies , 1(3):1298–1308, 2020.[21] M. V. Simkin and V. P. Roychowdhury. Stochastic modeling of citation slips.

Scientometrics ,62(3):367–384, 2005.

22] R. Sinatra, D. Wang, P. Deville, C. Song, and A.-L. Barab´asi. Quantifying the evolution ofindividual scientiﬁc impact.

Science , 354(6312), 2016.[23] J. V. Tohalino and D. R. Amancio. Extractive multi-document summarization using multilayernetworks.

Physica A: Statistical Mechanics and its Applications , 503:526–539, 2018.[24] V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities.

Scientiﬁc reports , 9(1):1–12, 2019.[25] H. Tuomisto. A diversity of beta diversities: straightening up a concept gone awry. part 1.deﬁning beta diversity as a function of alpha and gamma diversity.

Ecography , 33(1):2–22,2010.[26] Y. Xie. “undemocracy”: inequalities in science.

Science , 344(6186):809–810, 2014.[27] A. Yegros-Yegros, I. Rafols, and P. D’Este. Does interdisciplinary research lead to highercitation impact? the diﬀerent eﬀect of proximal and distal interdisciplinarity.

PloS one , 10(8),2015., 10(8),2015.