Associations between author-level metrics in subsequent time periods
AAssociations between author-level metrics in subsequent timeperiods
Ana C. M. Brito , Filipi N. Silva and Diego R. Amancio Institute of Mathematics and Computer Science,University of S˜ao Paulo, S˜ao Carlos, Brazil Indiana University Network Science Institute,Bloomington, Indiana 47408, USA (Dated: November 26, 2020)
Abstract
Understanding the dynamics of authors is relevant to predict and quantify performance in science.While the relationship between recent and future citation counts is well-known, many relationshipsbetween scholarly metrics at the author-level remain unknown. In this context, we performed ananalysis of author-level metrics extracted from subsequent periods, focusing on visibility, produc-tivity and interdisciplinarity. First, we investigated how metrics controlled by the authors (suchas references diversity and productivity) affect their visibility and citation diversity. We also ex-plore the relation between authors’ interdisciplinarity and citation counts. The analysis in a subsetof Physics papers revealed that there is no strong correlation between authors’ productivity andfuture visibility for most of the authors. A higher fraction of strong positive correlations thoughwas found for those with a lower number of publications. We also found that reference diversitycomputed at the author-level may impact positively authors’ future visibility. The analysis ofmetrics impacting future interdisciplinarity suggests that productivity may play a role only for lowproductivity authors. We also found a surprisingly strong positive correlation between referencesdiversity and interdisciplinarity, suggesting that an increase in diverse citing behavior may be re-lated to a future increase in authors interdisciplinarity. Finally, interdisciplinarity and visibilitywere found to be moderated positively associated: significant positive correlations were observedfor 30% of authors with lower productivity. a r X i v : . [ c s . D L ] N ov . INTRODUCTION The age of information promoted several new discoveries in science, with many of thememerging from interdisciplinary endeavors [4, 10, 19]. At the same time, these communi-ties are growing in size and productivity, resulting in an ever-increasing deluge of digitalinformation available in the form of published articles [15], datasets, and algorithms, as wellas across many platforms, such as cloud services and social media. However, the increasein digital resources has not leveled the playing field for researchers; inequality is rising inscience [26].Understanding the mechanisms leading to inequality in science can help policymakers andfunding agencies better distribute research resources while also promoting a more just anddemocratic environment. Part of this problem relies on the fact that researchers competeamong themselves for limited funding and attention. In such a system, an increase ofresearchers’ visibility leads to better funding opportunities which, in turn, leads to moreavailability of resources for their institutions, thus allowing those researchers to attain evengreater visibility.The cycle in which researchers with the most resources are rewarded with even moreresources over time is a source of inequality known as the Matthew Effect [12]. This is oneof the reasons why understanding how the dynamics of authors visibility unfold over timeis one of the most important problems in the field of Science of Science [8, 22]. However,not much attention was given to understand the relationships between other authors metricsbesides citations [2, 7, 18, 21]. In particular, the literature lacks studies on metrics controlledby the authors, such as those based on their choices of references and their productivity;and the possible effects they may have on their received citations.Here, we propose to explore the associations among different bibliometric measures forauthors in subsequent time periods through correlation. Among the metrics we considerare interdisciplinarity, which is measured in terms of the subject diversity [19] of citationsreceived by the authors, productivity and visibility of authors. Here, we are interested inaddressing three main questions:1. How metrics controlled by the authors – namely, their productivity and diversity inthe choice of references – correlate with visibility metrics, such as the received numberof citations per paper, in a subsequent time period?2. How these characteristics correlate with the future interdisciplinarity of their publica-tions based on citations?3. How interdisciplinarity is related to future citations and vice versa?In addition to productivity and visibility, we also studied interdisciplinarity as it playsan important role in modern science given the increasing number of authors bridging newdifferent fields. Here, it is used as a descriptor for citation diversity, as adopted in relatedworks [19]. In a similar fashion, we adopt a reference diversity for authors based on thefields of the employed references in their publications.We employ the
American Physical Society (APS) dataset, which incorporates all the ci-tations and metadata for papers, mainly in Physics, published in any of the APS journals upto 2010. More specifically, we employ the dataset used in [20], which was supplemented withdisambiguated authors from the
Microsoft Academic Graph (MAG). First, we construct aco-occurrence network for the categories existing in the APS journals (PACS codes) whichis used to define a metric of interdisciplinarity for authors based, in terms of the diversityof their received citations or the references they used. Next, we calculate the correlationbetween the considered author-level metrics for a window considering previous publicationsand another which considers subsequent publications and citations. Finally, we use a statis-tical framework based on null models to obtain the significance of the correlations betweenthe considered metrics.Several interesting results have been obtained in our analysis. We found that the diversityof references may impact positively the observed future visibility for 1/3 of low-productivityauthors. This effect is minimized when analyzing more productive authors, yet the fractionof authors that were positively affected varied between 22% and 25%. A weaker associationbetween productivity and citation counts was found: the highest fraction of authors with asignificant positive correlation was 21%. When comparing the fraction of authors display-ing significant positive and negative correlations, both productivity and reference diversityturned out to be more positively than negatively correlated with authors’ visibility. Sur-prisingly, we also found that reference diversity and future interdisciplinarity are stronglypositively correlated for roughly 50% of authors. Finally, the association between inter-disciplinarity and visibility revealed that an increase in interdisciplinarity is more likely tobe linked to an increase in visibility for low productivity authors. Such positive significant3orrelations were observed in roughly 30% of authors in that class. We believe our resultscan provide further insights into better understanding researchers’ career dynamics.
II. RELATED WORKS
In this paper, among other relationships, we analyze which factors affect the visibility ofauthors (measured in terms of citations). At the paper level, some correlations between paperfeatures and the number of citations have been studied in the last few years. An importantfactor that has been found to affect the visibility of paper is related to the interdisciplinarity of venues in which they are disseminated. Different aspects of scientific pieces have beenused to define interdisciplinarity indexes. In [19], journal citation networks are used toquantify how interdisciplinary a journal is. For a given journal, the diversity of citationsfrom different areas is used to gauge interdisciplinary. Such a diversity is computed usingthe concept of true diversity , a measure widely used to express how diverse a set of elementsfrom different classes is [5, 23, 25]. Subject areas and citation data were extracted from the
Journal Citation Reports dataset. Some interesting conclusions were the positive correlationbetween the proposed interdisciplinary index and journals impact factor. In other words,interdisciplinary journals tend to have a higher impact factor than specialized journals.Using a different approach, the study conducted in [4] also quantified journals inter-disciplinarity. The authors used Scopus data comprising
Information and CommunicationTechnology publications. The relationship between scholars and journals was representedvia bipartite graphs. After a SVD dimension reduction, a spectral co-clustering method wasused to identify communities of scholars and journals. The diversity (i.e. the interdisci-plinarity) of a journal was then defined by analyzing the unevenness of authors distributionover the obtained network communities. Such a dispersion was computed via Shannon en-tropy, Simpson diversity, and Rao-Stirling index [11]. High values of disparity metrics werefound to occur in journals appearing between communities. Conversely, low diversity wasobserved mostly in network community cores.A correlation between interdisciplinarity and citation impact was investigated in [27].Three aspects of interdisciplinary were investigated at the paper level: variety, balance, anddisparity. Variety is the total number of different disciplines (or
Web of Science categories)cited by the paper, while balance corresponds to the evenness of the disciplines distribution,4omputed via Shannon diversity. Disparity measures how different are the disciplines in thereference set. The authors analyzed the impact of papers using the Normalized CitationScore (NCS). The data set used was papers from Science Citation Index-Expanded (2005).A regression estimation analysis revealed that variety was positively associated with NCS.In contrast, both balance and disparity were negatively associated with NCS.The impact of citing interdisciplinary papers on papers visibility was investigated in [10].The authors characterized interdisciplinarity at the paper level by using papers references.Subdisciplines were defined by the UCSD map of science [3]. According to this map, thesimilarity between journals is based on the number of shared references (via bibliographiccoupling) and keywords. An average-linkage clustering strategy generates a cluster of 13different categories and the pairwise cluster distance is represented in a 3D Fruchterman-Reingold layout. An analysis of 25,000 documents showed that papers citing interdisciplinarysub-disciplines tend to receive more citations than papers with fewer references to interdis-ciplinary sub-disciplines. This study also grouped sub-disciplines by distance in the UCSDmap and demonstrated that papers citing distant sub-disciplines tend to have higher relativecitation rates than papers citing similar sub-disciplines.At the author level, the study carried out in [17] investigated the effects of interdisci-plinarity on scientists careers. The APS dataset was used, considering papers publishedbetween 1980 and 2009. The hierarchical system of subdisciplines classification – referredto as
Physics and Astronomy Classification Scheme (PACS) – was used to measure the in-terdisciplinarity of an author. They proposed an index combining the total of PACS codesused during the entire author career and the average number of different classes appearingsimultaneously in the author papers. Using this value, authors were grouped by differentlevels of interdisciplinarity: low, medium, and high. Based on these groups, it was observedthat higher interdisciplinarity affects positively productivity. A statistical model was pro-posed to reproduce the original data. The factors considered in the model were the proposedinterdisciplinarity index, the number of publications in each class, the number of citations,talent, reputation, and luck. The model reproducing the properties of the studied systemrevealed that authors with medium-high talent are the most successful ones. In addition,luck turned out to play an important role in career success. Surprisingly, it was found to beeven more relevant than interdisciplinarity factors in some cases.Another different source of factor concerns the well-known rich-get-richer paradigm. In5ther words, if an author has received several citations, he/she has a higher tendency ofreceiving more citations if they have received a higher citation rate in the past. In [20], theauthors describe a model for reproducing the distribution of authors citations in the APSdataset. Unlike other models, they included a recency factor so that more recent citationdata receives a higher weight in the preferential attachment model. This model showedthat the rich-get-richer paradigm describes the citation distribution for authors publishingin APS journals. Most importantly, they also found that recency plays an important roleto define how broad the burstiness of citations are. The number of citations received byauthors is strongly dependent on the total of citations received in the last 1-2 years [20].
III. METHODOLOGY
The methodology adopted in this paper can be divided into the following steps:1.
Creation of PACS networks : this phase is responsible for establishing and identifyingthe subfields inside the considered dataset. Groups of strongly connected subareas aregrouped into network communities. The latter is used to identify an area, which inturn is used to define some of the variables of interest. The dataset used to createthe networks is described in Section III A. The process of creating and identifyingcommunities of co-occurring PACS is described in Section III B.2.
Definition of diversity indexes : here we use diversity indexes to quantify how diverseauthors cite or are cited by other papers. The diversity takes as reference the subareas(communities) identified in the PACS network. The adopted diversity index is definedin Section III C. Diversity indexes are among the author-level metrics of interest inthis paper.3.
Quantifying the relationship between variables of interest : here we quantify there arecorrelations between variables of interest quantified in subsequent time intervals. Themethodology adopted to quantify the fraction of authors displaying significant pos-itive/negative correlations between the variables of interest in described in SectionIII D. 6 . Dataset
The dataset consists of papers published by the American Physical Society (APS) journalsbetween 1991 and 2010. The dataset comprises 299,930 publications from APS journals.While the dataset provides several article metadata, we used for each paper the list ofauthors and the reference list. We also used the list of subfields codes provided by theauthors and selected from the
Physics and Astronomy Classification Scheme (PACS). Thisclassification scheme is a hierarchical code system used to organize the main fields andsubfields in Physics journals.When addressing any issue at the author level, one should be aware that ambiguitiesand name split may arise [1, 13]. To address this problem, we used the Microsoft AcademicGraph (MAG) dataset, which is a more extensive set of publications with authors’ namesdisambiguated [20]. We mapped the APS dataset into the MAG database by matching DOIsvalues.
B. PACS Networks
In this work, we use the notion of subfields to compute the degree of interdisciplinarityinside the Physics area (for APS journals). Subfields were derived from PACS co-occurrencenetworks [16]. Each publication in the APS dataset has its PACS codes, and this informationof area is provided by the authors, among a list of possible codes. We used this informationto generate networks where nodes are PACS codes. Figure 1 shows an example of PACSco-occurrence network extracted from a set of papers. As suggested by other works, PACSwere analyzed at the first two levels [16]. Two codes are linked whenever they appeartogether in one or more papers. Here we take the view that a subfield in the consideredsubset of Physics papers can be seen as a subset of highly connected codes. In this way, eachsubfield is defined as a community in the respective co-occurrence PACS network. While ourresults are based on the Louvain community detection algorithm [24], a preliminary analysisrevealed that there is no large difference when other methods are used to detect communities.Considering the most recent years of the dataset, using the Louvain method, we found 10network communities. An analysis of the obtained communities considering data from thelast 5 years showed that the four largest communities are mainly composed of papers in7he following subjects: (i) magnetic properties and materials ; (ii) quantum mechanics, fieltheories, and special relativity ; (iii) structure of solids and liquids; crystallography ; and (iv) statistical physics, thermodynamics, and nonlinear dynamical systems . C C C Nc P (A) P (A) P (A) A P Np (A) Papers citingauthor A PACS communities
FIG. 1. Schematic representation of the components needed to calculate citations and references diversity.
C. Diversity indexes
Here we employ a diversity index for authors based on the diversity of fields being cited( citations diversity ) or referenced ( references diversity ) by their papers. Because usually citation diversity is related to interdisciplinary [19], we use both terms to describe the sameconcept. To assign a distribution of fields of a given author A , first, we look at all thepapers P ( A ) i citing publications co-authored by A during the considered time window. Foreach citing paper we obtain the communities associated to the PACS listed in the paper.Figure 1 illustrates the necessary components employed to calculate the in -diversity index forauthors. Next, we derive the weights w in ( P i , C j ) relating a paper P i to a PACS community8 j , defined as the ratio of the number of PACS in C j listed in P i , i.e. w ( P i , C j ) = | PACS( P i ) ∩ C j || PACS( P i ) | , (1)where PACS( P i ) is the set of PACS listed in paper P i . Next, we assign a weight ¯ w cit ( A, C j )relating an author A to each PACS communities C j based on the citing papers. Each citationto a paper from author A counts as a unit that is distributed among the communities, sothat ¯ w ( A, C j ) is defined as ¯ w cit ( A, C j ) = (cid:88) P i n cit ( P i , A ) w ( P i , C j ) , (2)where n cit ( P i , A ) is the number of citations from P i to author A Finally, we normal-ize ¯ w ( A, C j ) across all the received citations, thus obtaining a probability-like measure p cit ( A, C j ) of relatedness between an author A and a community C j , given by p cit ( A, C j ) = ¯ w in ( A, C j ) (cid:80) C k ¯ w in ( A, C k ) . (3)The citation diversity index citDiv( A ) is then defined as the exponential of entropy of p cit ( A, C j ) [19], i.e., citDiv( A ) = exp (cid:104) − (cid:88) C j p cit ( A, C j ) log p cit ( A, C j ) (cid:105) . (4)Similarly, to obtain references diversity index, we use the papers P i referenced by worksauthored by author A instead of the received citations. Thus, the weight linking an authorand a PACS community is defined as¯ w ref ( A, C j ) = (cid:88) P i n ref ( A, P i ) w ( P i , C j ) , (5)where n ref ( A, P i ) is the number of times author A cited the paper P i . The probabilityanalogous to p cit (i.e. p ref ) is then normalized as: p ref ( A, C j ) = ¯ w ref ( A, C j ) (cid:80) C k ¯ w ref ( A, C k ) , (6)9nd the references diversity refDiv( A ) is calculated asrefDiv( A ) = exp (cid:104) − (cid:88) C j p ref ( A, C j ) log p ref ( A, C j ) (cid:105) . (7)Both equations 3 and 6 have been used to measure diversity in many contexts [5, 6, 19].Because the computation of p cit and p cit are not reliable when only a few data is available,these quantities were computed for authors with more than ten references and citations inthe dataset. D. Past and Future scholarly time series
We propose a framework to analyze how a scholarly metric or diversity at a certain pointin time for an author A may impact his future metrics. First, we define two moving windows,one for the past and another for the future, respectively a 5 years window before the timeunder consideration t , and a 3 years window after t , as illustrated in Figure 2a. For eachwindow, we calculate the scholarly metrics of A . In particular, for the Past window, wecalculate the number of papers, citations received per paper, and references diversity, onlyconsidering publications in the period. FuturePast
Author metrics time series Correlation distributionamong the time series
5y 3y t N u ll m o d e l s N u ll m o d e l s (a) (b) (c) -1 0 1 FIG. 2. Schematic representation of the proposed methodology. (a) Given two subsequent windows(past and future) that moves over time, we calculate the time series of the considered metrics. (b)For each time series we derive a null model based on shuffling them along time. (c) We draw thecorrelation distribution (gray) obtained from the data time series and highlight negative (blue) andpositive (yellow) values that are significant in comparison to the null models. The average nullmodel distribution is also shown for comparison in red.
For the Future window, we calculate the number of citations received in that window from10apers published by A during the Past window. In the same fashion, we calculate citationdiversity by considering only publications in the Past windows and citations in the Futurewindow. By moving the windows along t for a period from 1995 to 2010, we obtain Past(number of papers, citations per author, and the references diversity), and Future (citationsreceived per paper and citation diversity) time series for each author based on the calculatedscholarly metrics.In order to draw relationships between the scholarly metrics from past and future win-dows, we adopted the Pearson correlation. However, as these metrics may have character-istics that can lead to spurious correlations, such as the presence of outliers or long-taildistributions, we employed a statistical approach based to measure the significance of theobtained correlations. First, for each time series of each author, we obtain a set of 10 , p -value associated with each author and a pair of past and future metrics. The p -value is defined as the probability of the null model resulting in a absolute correlation thatis higher than what was found for the data. Finally, the results are presented in the formof a correlation distribution alongside the percentage of negative and positive significantrelationships by considering a threshold of 5 × − for the p -values. This is illustrated inFigure 2c. IV. RESULTS AND DISCUSSION
Here we analyze the relationship between relevant author-level metrics. More specifically,we analyze, if the diversity of references, the numbers of papers, and the number of referencesare correlated with citation counts and citation diversity. We first focus on the relationshipbetween variables that authors can control in the first 5-year window (e.g. the number anddiversity of references) and variables that are not directly self-dependent (such as the numberof citations and citation diversity) and are measured in the following 3-year window. Thecorrelations between paper/reference features and citation counts are discussed in SectionIV A. The correlations between paper/reference features and citation diversity are discussed11n Section IV B. Because interesting relationships between interdisciplinarity (i.e. citationdiversity) and citation counts have been reported at different levels [4, 14, 19], we alsoanalyzed the correlations between interdisciplinarity and citations at the author level. Thisis reported in Section IV C.
A. Correlations between reference features and citations
The simplest reference feature that can be used in our analysis is the total number ofreferences. For the sake of clarity, we will use instead of the number of papers (i.e. theauthors productivity) in our analysis because the total number of references is stronglyrelated to the number of papers. In addition, the results using either number of referencesor the number of papers are very similar.We start our analysis by analyzing whether productivity – i.e. the number of publishedpapers – is correlated with the total number of citations per paper. This result is shown inFigure 3. As mentioned in the methodology (see Section III D), the histograms shows thedistribution of authors in different degrees of correlation between the variables of interest.Each panel corresponds to a different class of authors, according to its productivity. Theauthors analyzed in subpanels (a)-(d) are those who published the following amount of papersover all the considered period: (a) 5–25; (b) 26–36; (c) 37–58; and (d) 59–359 papers. Theconsidered thresholds in the number of publications were chosen so that each class comprises25% of all authors in the dataset. In other words, each panel corresponds to a quartileof authors. In this figure, the distribution of correlations observed using the null model isrepresented by the red curve (see Section III D). The fraction of authors displaying significantpositive and negative correlations between the considered variables are represented in yellowand blue, respectively.The results in Figure 3 reveals that the observed distribution in all panels differs fromthe null model distribution. The discrepancy between real data and null model arises sincevery high or low values of correlations are unlikely to happen by chance, while the realdata reveals an opposite effect: for a fraction of authors, the correlations are significant.Considering all four classes, 18-22% of authors displayed a positive correlation betweenproductivity and visibility. On the other hand, a negative correlation was also observed inall classes of authors. The percentage of authors displaying a negative correlation between12 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100120 = 0.12= 0.46 67.74% 21.77%10.49% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.06= 0.47 69.26% 18.87%11.87% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.07= 0.50 65.10% 20.63%14.27% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.07= 0.50 64.22% 21.13%14.65% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)
FIG. 3. Correlation between the total number of published papers and citations per paper . Panels(a)-(d) correspond to quartiles of authors sorted, in increasing order, by number of publications.The distribution of correlations obtained with the adopted null model is shown in the red curve. productivity and visibility ranged between 10% and 15%. Because more than 64% of theobserved correlations are not significant in all four classes of authors, the results suggest thatfor most of the authors the increase in productivity is not correlated with higher citationcounts per paper.In our analysis, we also compared the proportion of positive ( f + ) and negative ( f − )correlations. The proportions are compared via q -index, defined as q = f + f − . (8)In this case, all values of q are higher than q = 1, suggesting thus that in all classes of13uthors positive correlations are more likely to appear. The highest value of q was observedfor authors with the lowest number of publications (see panel (a)). We found q = 2 . diversity of references and the number of citations per paper . A stronger positive correlation is observed speciallyfor authors with lower productivity. In panel (a), one-third of authors displayed a positivecorrelation between references diversity and visibility, while in (b), the same behavior oc-curred for one-fourth of all authors. In both cases, positive correlations are more frequentthan negative correlations. We found, q = 5 .
22 and q = 2 .
65, respectively for authorsin classes (a) and (b). Authors in classes (c) and (d) displayed q values similar to thoseobserved in class (b).The analysis of reference diversity showed that the way in which authors cite otherworks may affect their visibility in the near future. This effect was found to be morerelevant than the productivity since significant positive correlations were found in up to25% of authors. This effect might be related to the fact that diverse references mightattract attention from other subfields, favoring thus the dissemination of authors’ visibilityin other scientific communities. In fact, a similar effect has been reported at the journalanalyses comparing the relationship between journals impact factor and interdisciplinaryindexes [19]. Similar effects have also been observed in diffusion systems, where the presenceacross different communities benefits the spreading of agents [9]. While it is not possible toestablish a causal effect, our results suggest that references diversity (inside a field) mightplay a role in predicting authors’ visibility. B. Correlations between reference features and citations diversity
While in the previous section we analyzed how references features correlate with visibility,here we investigate the relationship between references and the diversity of citations. Be-14 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100120 = 0.24= 0.47 60.12% 33.47%6.41% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.15= 0.48 65.83% 24.83%9.35% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.10= 0.48 67.57% 22.46%9.97% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.09= 0.51 62.76% 24.15%13.10% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)
FIG. 4. Correlation between diversity of references and citations per paper . Panels (a)-(d) corre-spond to quartiles of authors sorted, in increasing order, by number of publications. The distribu-tion of correlations obtained with the adopted null model is shown in the red curve. cause citations diversity can be seen as an interdisciplinary index (see e.g. [19]), this sectionanalyzes how the choice (and quantity) of references is related to authors interdisciplinarity.Figure 5 depicts the correlations between the number of published papers and citationdiversity . As observed in the results reported in the previous section, for most of the authorsthere is no significant correlation between the considered variables. However, a positivecorrelation is observed for 1/3 of all authors in class (a), while 1/4 of authors displayed apositive correlation in the other classes. A negative correlation is less frequent than positivecorrelations. In addition, the values of q decreases with productivity, since we obtained q A = 8 . q B = 3 . q C = 2 . q D = 2 . = 0.28= 0.44 61.19% 34.60%4.21% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.17= 0.47 64.12% 27.45%8.43% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.12= 0.49 62.73% 25.50%11.77% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.14= 0.50 61.39% 26.53%12.07% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d) FIG. 5. Correlation between number of published papers and citations diversity . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model.
The association between references and citation diversity was also analyzed. This resultis shown in Figure 6. The observed correlations are much stronger than the ones analyzedso far. The null model distribution is clearly not compatible with the real data. Here, asignificant relationship between reference and citation diversity arises for more than 50% ofall authors . Surprisingly, virtually all significant correlations are positive. The percentageof positive correlations reaches roughly 50%, while significant negative correlations wereobserved for roughly 1.5% of all authors. Another distinctive feature of the relationshipbetween reference and citation diversity lies in the fact that the relationship is similar for16 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00050100150200250 = 0.51= 0.41 44.43% 53.93%1.64% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.46= 0.42 48.91% 49.70%1.38% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.46= 0.41 49.85% 48.60%1.55% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.49= 0.41 46.79% 51.85%1.36% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)
FIG. 6. Correlation analysis between reference diversity and citation diversity . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model. all classes of authors. This result is, therefore, a strong evidence that researchers who citepapers from many other disciplines might be also cited by many other subareas. In otherwords, if authors display a diverse behavior when citing other papers, they also tend to becited by other diverse subareas. Because citation diversity can be seen as a way to measureauthors’ interdisciplinary [19], most of the authors adopting larger reference diversity in agiven period are expected to increase their interdisciplinarity indexes in the near future.17 . Interplay between interdisciplinarity and citations
Here we analyze the relationship between interdisciplinarity of authors (computed ascitation diversity) and the number of citations. While studies have shown that a positivecorrelation exists between journals interdisciplinarity and impact factor [19], only a fewstudies have touched on this issue at the author level. Here we found that for most ofthe authors ( ≥ q for each class are q A = 3 . q B = 2 . q C = 1 . q D = 1 .
4. As observed in otherassociations studied here, higher values of q are found for authors in the group of lowerproductivity.While Figure 7 only show the relationship between interdisciplinarity and future visibility,it would be still interesting to see if there is an inverse effect. To investigate if variation incitations is correlated to a future variation in interdisciplinarity we conducted an analysissimilar to the one provided in Figure 7. The histograms of correlations are shown in Figure8. Overall the histograms are similar to the ones depicted in Figure 7, but here the fractionof significant positive correlations are smaller. This is evident e.g. for authors in (a): thefraction of positive correlations drop from 29 .
1% to 24 . V. CONCLUSION
In this paper, we analyzed whether relevant scholarly variables are correlated. We pro-posed a framework to probe if features extracted from authors’ recent history are correlatedwith metrics observed a few years later. While some correlations are trivial and were notobject of study (such as correlations between citations in subsequent time periods [20]), westudied the correlation between other variables of interest. We focused our analysis on sim-18 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100 = 0.21= 0.46 63.34% 29.12%7.54% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (a) = 0.12= 0.47 67.23% 22.42%10.35% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (b) = 0.08= 0.47 67.64% 20.74%11.62% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (c) = 0.07= 0.51 62.71% 22.05%15.24% p-value 0.05 (corr 0)p-value 0.05 (corr < 0)p-value > 0.05 (d)
FIG. 7. Correlation analysis between citation diversity and citations per paper . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model. ple, yet relevant metrics, including number of publications, number of citations, referencesdiversity and authors’ interdisciplinarity (measured via citation diversity).Several interesting results have been obtained. Among the associations studied, we foundthat the strongest correlations were obtained between references diversity and authors’ in-terdisciplinarity. Here we found a reciprocal tendency: if authors increase their diversitywhen citing other papers, received citations will also tend to increase. This pattern was ob-served for more than 50% of authors. The relationship between productivity and visibilitywas found to be more prominent for authors with a lower productivity. While no signif-icant correlation exists for most of authors, about 20% showed a positive and significantcorrelation. A stronger association was obtained when analyzing the relationship between19 .00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00020406080100120140 = 0.20= 0.43 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (a) = 0.12= 0.44 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (b) = 0.07= 0.45 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (c) = 0.06= 0.49 p-value <= 0.05 (corr >= 0)p-value <= 0.05 (corr < 0)p-value > 0.05 (d)
FIG. 8. Correlation analysis between citations per paper and citation diversity . Panels (a)-(d)correspond to quartiles of authors sorted, in increasing order, by number of publications. The redcurve denotes the distribution of correlations obtained with the adopted null model. references diversity and future citation. For the class of authors with a lower productivity,we found that roughly 1/3 of authors displayed a significant positive correlation betweenreferences diversity and visibility. We also studied the association between references andcitation diversity and found out that the fraction of positive significant correlations rangesbetween 18-30% across different classes of authors.Our study shed lights into the relationship between current and future researchers’ activ-ity. The results obtained here could be extended in diverse studies to provide mechanismsto predict authors’ behavior, given the recent researchers’ history. Future research coulddive into other research questions arising from our analysis. For example, while we foundthat significant positive correlations are more likely to happen than negative ones, it would20e interesting to probe which factors make authors display opposite behaviors for the samevariables of interest. Another interesting feature that could be studied concerns the causal-ity of the obtained correlations. Finally, a systematic study could be performed in differentareas to check whether correlations are more significant in specific subfields.
ACKNOWLEDGMENTS
D.R.A. acknowledges financial support from S˜ao Paulo Research Foundation (FAPESPGrant no. 2020/06271-0) and CNPq-Brazil (Grant no. 304026/2018-2). This study wasfinanced in part by the Coordena¸c˜ao de Aperfei¸coamento de Pessoal de N´ıvel Superior –Brasil (CAPES) – Finance Code 001. [1] D. R. Amancio, O. N. Oliveira Jr, and L. d. F. Costa. On the use of topological featuresand hierarchical characterization for disambiguating names in collaborative networks.
EPL(Europhysics Letters) , 99(4):48002, 2012.[2] D. R. Amancio, O. N. Oliveira Jr, and L. da Fontoura Costa. Three-feature model to reproducethe topology of citation networks and the effects from authors’ visibility on their h-index.
Journal of informetrics , 6(3):427–434, 2012.[3] K. B¨orner, R. Klavans, M. Patek, A. M. Zoss, J. R. Biberstine, R. P. Light, V. Larivi`ere, andK. W. Boyack. Design and update of a classification system: The ucsd map of science.
PloSone , 7(7), 2012.[4] C. Carusi and G. Bianchi. A look at interdisciplinarity using bipartite scholar/journal net-works.
Scientometrics , pages 1–28, 2019.[5] E. A. Corrˆea Jr, F. N. Silva, L. d. F. Costa, and D. R. Amancio. Patterns of authors contri-bution in scientific manuscripts.
Journal of Informetrics , 11(2):498–510, 2017.[6] H. F. de Arruda, L. d. F. Costa, and D. R. Amancio. Using complex networks for text classi-fication: Discriminating informative and imaginative documents.
EPL (Europhysics Letters) ,113(2):28007, 2016.[7] Y.-H. Eom and S. Fortunato. Characterizing and modeling citation dynamics.
PloS one ,6(9):e24926, 2011.
8] S. Fortunato, C. T. Bergstrom, K. B¨orner, J. A. Evans, D. Helbing, S. Milojevi´c, A. M.Petersen, F. Radicchi, R. Sinatra, B. Uzzi, et al. Science of science.
Science , 359(6379), 2018.[9] M. Kaiser, M. Goerner, and C. C. Hilgetag. Criticality of spreading dynamics in hierarchicalcluster networks without inhibition.
New Journal of Physics , 9(5):110, 2007.[10] V. Larivi`ere, S. Haustein, and K. B¨orner. Long-distance interdisciplinarity leads to higherscientific impact.
Plos one , 10(3), 2015.[11] L. Leydesdorff, C. S. Wagner, and L. Bornmann. Interdisciplinarity as diversity in citation pat-terns among journals: Rao-stirling diversity, relative variety, and the gini coefficient.
Journalof Informetrics , 13(1):255–269, 2019.[12] R. K. Merton. The matthew effect in science: The reward and communication systems ofscience are considered.
Science , 159(3810):56–63, 1968.[13] S. Milojevi´c. Accuracy of simple, initials-based methods for author name disambiguation.
Journal of Informetrics , 7(4):767–773, 2013.[14] K. Okamura. Interdisciplinarity revisited: evidence for research impact and dynamism.
Pal-grave Communications , 5(1):1–9, 2019.[15] R. K. Pan, A. M. Petersen, F. Pammolli, and S. Fortunato. The memory of science: Inflation,myopia, and the knowledge network.
Journal of Informetrics , 12(3):656–678, 2018.[16] R. K. Pan, S. Sinha, K. Kaski, and J. Saram¨aki. The evolution of interdisciplinarity in physicsresearch.
Scientific reports , 2(1):1–8, 2012.[17] A. Pluchino, G. Burgio, A. Rapisarda, A. E. Biondo, A. Pulvirenti, A. Ferro, and T. Giorgino.Exploring the role of interdisciplinarity in physics: Success, talent and luck.
PloS one , 14(6),2019.[18] F.-X. Ren, H.-W. Shen, and X.-Q. Cheng. Modeling the clustering in citation networks.
Physica A: Statistical Mechanics and its Applications , 391(12):3533–3539, 2012.[19] F. N. Silva, F. A. Rodrigues, O. N. Oliveira Jr, and L. d. F. Costa. Quantifying the interdis-ciplinarity of scientific journals and fields.
Journal of Informetrics , 7(2):469–477, 2013.[20] F. N. Silva, A. Tandon, D. R. Amancio, A. Flammini, F. Menczer, S. Milojevi´c, and S. For-tunato. Recency predicts bursts in the evolution of author citations.
Quantitative ScienceStudies , 1(3):1298–1308, 2020.[21] M. V. Simkin and V. P. Roychowdhury. Stochastic modeling of citation slips.
Scientometrics ,62(3):367–384, 2005.
22] R. Sinatra, D. Wang, P. Deville, C. Song, and A.-L. Barab´asi. Quantifying the evolution ofindividual scientific impact.
Science , 354(6312), 2016.[23] J. V. Tohalino and D. R. Amancio. Extractive multi-document summarization using multilayernetworks.
Physica A: Statistical Mechanics and its Applications , 503:526–539, 2018.[24] V. A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities.
Scientific reports , 9(1):1–12, 2019.[25] H. Tuomisto. A diversity of beta diversities: straightening up a concept gone awry. part 1.defining beta diversity as a function of alpha and gamma diversity.
Ecography , 33(1):2–22,2010.[26] Y. Xie. “undemocracy”: inequalities in science.
Science , 344(6186):809–810, 2014.[27] A. Yegros-Yegros, I. Rafols, and P. D’Este. Does interdisciplinary research lead to highercitation impact? the different effect of proximal and distal interdisciplinarity.
PloS one , 10(8),2015., 10(8),2015.