Bibliometric analysis on mathematics, 3 snapshots: 2005, 2010, 2015
BBibliometric analysis on mathematics:3 snapshots: 2005, 2010, 2015
Serge Richard ∗ , Qiwen Sun Graduate school of mathematics, Nagoya University, Chikusa-ku, Nagoya 464-8602,Japan
E-mails: [email protected], [email protected]
Abstract
We carry out a thorough bibliometric analysis of recent publications in mathematics based onthe database
Web of Science . The individual relations between various features and the citations are provided, and the importance of the features is investigated with decision trees. The evolutionof the features over a period of 10 years is also studied. National and international collaborationsare scrutinized, but personal information are fully disregarded. keywords:
Citations, bibliometric predictors, mathematics, tree-based methods, internationalcollaborations.
This research has been triggered by a simple question: Do international collaborations increase thenumber of citations in mathematics ? By looking at the existing studies about this question in variousfields of research [12, 18, 23, 24, 25], the easy and naive answer would be positive. However, someinvestigations show that a more precise answer depends on the field of research, and that additionalinformation should be taken into account, see for example [9, 20, 21]. Thus, from the original narrowquestion, our interest has shifted to the more general question: What are the important predictors forthe publications in mathematics, if the response is the number of citations ?Similar bibliometric investigations have already been performed, as for example in [7], but math-ematics was not considered in this reference, and the analysis is partially model dependent. Specificto mathematics, let us mention the early studies [10, 11] based on MathSciNet, followed by [3] whichdiscusses the different citation indices for mathematics journals, [4] which performs a bibliometricanalysis on the period 1868–2008, and [13] which focuses on mathematics education. For very recentinvestigations, let us also mention [19] which also studies citations but from an individual perspective,[22] which provides a detailed bibliometric analysis over 40 years but on publications of a single journal,and [14] which studies US mathematics faculties with some bibliometric tools.In order to provide a broad picture about recent publications in mathematics, and therefore com-plement some of the publications introduced above, our initial hope was to use MathSciNet, whichis familiar to all mathematicians and which really focus on mathematics publications. Unfortunately,MathSciNet does not allow any automated searching or downloading, and collecting enough informa-tion for any serious analysis turns out to be impossible (despite several requests). On the other hand,Web of Science, which is not specific to mathematics but contains information about mathematics ∗ Supported by the grant
Topological invariants through scattering theory and noncommutative geometry from NagoyaUniversity, and by JSPS Grant-in-Aid for scientific research C no 18K03328, and on leave of absence from Univ. Lyon,Universit´e Claude Bernard Lyon 1, CNRS UMR 5208, Institut Camille Jordan, 43 blvd. du 11 novembre 1918, F-69622Villeurbanne cedex, France. a r X i v : . [ c s . D L ] F e b mong other fields, allows the collect of large amount of data, and its supporting team answered allour inquiries. For these reasons, after a comparison of the two databases provided in Section 2, weconcentrate in the subsequent sections on data provided by Web of Science only.Let us now be more specific about the content of this paper. Our investigations are focusing onpublications in mathematics for three years: 2005, 2010, and 2015. This choice allows us to see anevolution in the publication records over a period of ten years, without overwhelming us with toomuch data. For each of these three years, we collected between 45’000 and nearly 80’000 items relatedto mathematics, and for each item we kept the record of 10 features as predictors together with theresponse, namely the number of citations up to November 2020. These predictors are introduced anddiscussed in Section 3.The preliminary analysis consists in looking at the response as a function of a single predictor.Let us immediately stress that since the response depends on time (the number of citations increasesas years pass), all investigations are performed on the three years independently. Section 4 containsthese results, presented either with graphs or with tables. More precisely, the citations are providedsuccessively as a function of(i) the number of authors,(ii) the number of countries associated with the authors,(iii) the number of institutes associated with the authors,(iv) the number of references provided by the authors,(v) the number of pages of the publication,(vi) the number of keywords provided by the authors(vii) the open access (or not) of the publication,(viii) the journal impact factor JIF (if the publication has appeared in a journal with a JIF),(ix) the research area of the publication,(x) the categories associated with the publication.More explanations and comments are provided in Section 4.It clearly appears in these individual investigations that the response is related to some predictors,but how much information can be extracted from them, and what is their relative importance ? Thesequestions, and others, are discussed in Section 5. Because of the diversity of the predictors we opted foran approach based on tree-based methods, as introduced in [5]. Indeed, unlike the approach providedin [18] we do not want to consider some linear relations between the predictors and the response, butprefer an approach which divides the predictor space into several regions and associates to each regiona local response. Alternatively, we could have borrowed some bibliometrix tools developed in [2] if ourinvestigations were performed on R, but we opted for the tools available on the platform [17].Several experiments are performed with trees, with some parameters chosen according to the yearof publications and to the existence (or non-existence) of a JIF associated with the publications. Basedon these experiments, the predictors can be ranked according to their importance. Another outcome oftree classifiers is the ability of predicting the citations (at least within some predefined classes) basedon the predictors. Clearly, the result is not very good, but the converse would have been even moresurprising. However, the predictions are better than a random guess, as explained in Section 5.In Section 6 we turn our attention to countries: What information can be deduced from theindividual publications about the research in the countries of the authors ? Can one measure a kindof performance for each country ? And what about collaboration between countries, which is relatedto our very initial question, can one measure these collaborations, and say something about them ?Data for answering these questions are presented in Section 6 for the main countries, which means for2he country producing the majority of publications in mathematics. In fact, data covering about 130countries were available, but for some of them, the annual number of publications is too limited tosupport any analysis.With this paper we provide three snapshots (2005, 2010, 2015) about the publications in mathe-matics, and extract as much bibliometric information as possible. As already mentioned, we wouldhave preferred working on a database MathSciNet because some information would have been moreaccurate. It is quite unfortunate that the policy and the tools provided by this website do not allowsuch investigations, as implicitly acknowledged in [8]. On the other hand, by using Web of Sciencedatabase, our investigations about mathematics have covered a slightly broader range of publications. In this section we provide general information about publications in mathematics for the last 20 years.A few comparisons between the two databases MathSciNet (MSN) and Web of Science (WoS) are alsopresented. Finally, the data we shall use in the following sections are introduced, and some statisticsare exhibited. Figure 1: Yearly new indexed math publicationsSince this research is based on data aboutmathematics, let us first have a look at two im-portant sources of information. MathSciNet isan electronic database operated by the Amer-ican Mathematical Society focusing exclusivelyon publications in mathematics. In November2020 it contains about 3.9 millions items. Webof Science is a much more general database op-erated by the private company Clarivate. Itis possible to select publications in mathemat-ics by choosing the research area mathematics(SU=mathematics). In November 2020, the out-come for this general request is about 2.0 millionsitems. Note that WoS contains also categories ,and one of them corresponds to mathematics. However, by choosing this request (WC=mathematics)the number of items is 1.7 millions, and these items are strictly contained in the previous request aboutresearch area.Figure 2: Publications with at least one author from a given country: absolute and relative numbersfrom MSNA comparison about the number of works published in the last 20 years is provided in Figure 1.For MSN, all works are reported, while for WoS it is again the works corresponding to the researcharea mathematics. Let us provide one more comparison between the two databases, based on oneinformation that will be used in the analysis. The information is related to the country in which3esearch institutions (universities, research institutes, etc) are located. For simplicity, we shall callthis the country of the research institution, and by extension the country of the author working inthis research institution. Figures 2 and 3 show the yearly publications and their relative numberswith at least one author from a research institution in one of the following countries: USA, China,France, Japan, Chile. The relative numbers are with respect to the total number of publications inmathematics index by MSN and WoS (shown in Figure 1). Thus, even if the difference between thetotal numbers of items in the two databases is not negligible, we expect that their shapes and trendsare similar.Figure 3: Publications with at least one author from a given country: absolute and relative numbersfrom WoSOn the other hand, a unique feature of MSN is the Mathematics Subject Classification (MSC).This classification contains more than 60 subjects, and each one can be divided into numerous sub-subjects. Each publication is indexed by one primary subject or one primary subject with severalsecondary subjects. The subjects are usually chosen by the author(s) of a publication, or carefullyassigned by the editors of MSN. The MSC provides a rather precise information about the content ofeach publication. With this information, a refined plot of Figure 1 for MSN is shown in Figure 4. Theeight combined subjects are elaborated in [16]. Unfortunately, WoS does not contain the MSC, andtherefore we shall not be able to use this information in our investigations.Figure 4: Yearly new indexed math publications (8 fields) based on the MSCAs mentioned in the Introduction, our analysis is based on the data in three years. With the request research area = mathematics , and once the items with no clear author or with no clear affiliation forthe author(s) have been removed, the number of items collected are:research area = mathematics , clear author(s) and affiliation(s)2005 : 45 (cid:48)
035 items 2010 : 62 (cid:48)
945 items 2015 : 76 (cid:48)
788 items. (1)4he following statistics are computed on these numerous items. It has been observed in earlierpublications that the average number of authors for each paper has been increasing over time, see forexample [4, Figure 10]. Since our investigations are based on three distinct years, let us observe thiseffect on a period of 10 years, see Figure 5.Figure 5: Distribution of the number of authors per publicationIt clearly appears that the proportions of publications with 1 and 2 authors are decreasing, whilethe ones with 3, 4, or 5 and more authors are increasing. The average number of authors for thesethree years and based on the data mentioned in (1) are respectively:2005 : 2 .
10 authors 2010 : 2 .
23 authors 2015 : 2 .
39 authors. (2)Note that other numbers confirm this increase in the collaborations for each publication. Indeed,if one looks at the average number of research institutes involved for each publication one gets2005 : 1 .
57 institutes 2010 : 1 .
74 institutes 2015 : 1 .
90 institutes. (3)These numbers have been computed by counting the number of different addresses provided by thepublications.Since one of our interests is to study international collaborations, let us provide similar results forthe average number of countries involved for each publication:2005 : 1 .
23 countries 2010 : 1 .
28 countries 2015 : 1 .
32 countries. (4)Again, these numbers have been computed by counting the number of different countries mentionedin the list of addresses of the authors. If we look at the details, one obtains the distributions providedin Figure 6. Figure 6: Distribution of the number of countries per publication
In this section we introduce the predictors and the response that we have employed for our investiga-tions, and make a few comments about them. 5he predictors can be roughly divided into three categories, namely those related to the author(s)of a publication, those related to the publication itself, and those related to the journal or the physicalsupport in which the publication has appeared. All of them have been extracted from the WoS databasefor the items mentioned in (1). Let us immediately stress that the author’s names have been completelydisregarded in our investigations. • Author(s) authors : the number of authors countries : the number of countries institutes : the number of research institutesNote that in addition to the number of countries involved for each publication, the exact list ofcountries will also be investigated in Section 6. • Publication references : the number of references pages : the number of pages keywords : the number of keywords provided by the authors open access : open accessWoS provides some information about various types of open access. More precisely, publicationswhich are partially or fully open access are identifiable in the database. This information is obtainedin collaboration with
EndNote Click , formerly called
Kopernio . Since there exist various levels of openaccess, this predictor will be used cautiously. • Journal jif : journal impact factor research areas : research areas categories : categoriesThe journal impact factor is computed by WoS and assigned to several journals for each year. Moreinformation about its computation and its weaknesses can be found here . As already mentioned, research areas and categories are classification indices provided by WoS. Several research areas andseveral categories have been assigned to each journal in the WoS database, based on several criteria.Compared with the rather precise definitions of the categories, the research areas are less preciselydefined (this has been confirmed by the technical support from Clarivate who answered our inquiries).Moreover, compared with Mathematics Subject Classification (MSC) from MathSciNet, these twopredictors are hugely less precise. Not only these indices are not chosen by the authors, but they arecommon to all publications in one journal, and their assignment is not so clear. Nevertheless, theyprovide a vague information which deserves to be collected, and which will be further discussed lateron. For the response, the number of citations for each publication has been recorded. These numberswere collected in October/November 2020. It should be emphasized that these numbers range between0 and some very large numbers. For example, one work published in 2005 has been cited up to 11’106times (12’015 and 18’439 times for the most cited works published in 2010 and 2015). https://en.wikipedia.org/wiki/Impact factor , clear author(s) and affiliation(s) , citation < (cid:48)
792 items 2010 : 61 (cid:48)
084 items 2015 : 76 (cid:48)
168 items. (5)In the subsequent computations, and in particular for computations of means, it is these items whichare considered.Let us finally mention that we could have considered the 95th percentiles for the three years, whichmeans that the upper limit for 2010 and 2015 would have been lower than 64. As a result, the meansrelated to these years would have been slightly smaller. However, since we can not compare directlythe citations between publications produced respectively in 2005, 2010 and 2015, the simpler choice ofkeeping the upper bound 64 instead of keeping 95th percentiles does not affect our investigations.
In this section we study the relations of the individual predictors with the number of citations. Acomparison between the predictors, based on a tree classifier, will be presented in the next section.
The items mentioned in (5) have been divided according to the number of authors (1, 2, 3 and more)and the respective distributions of citations have been reported in the first column of Figure 7.Figure 7: Distributions depending on the number of authors or on the number of countries involvedFor each group, the mean has been indicated with a vertical line, and the median has been reportedwith a vertical dashed line. These precise values are indicated in Table 1, as well as the proportionsof publications with 1, 2, or 3 and more authors. 7005 2010 2015% mean median % mean median % mean median1 author 35.3 7.8 3 30.4 5.9 2 25.6 3.4 12 authors 36.6 10.8 6 36.5 9.0 5 35.9 5.1 2 ≥ Let us now divide the items according to the number of countries appearing in the list of addresses ofthe authors. A division into 1, 2, 3 and more countries has also been performed, and the distributionof citations is reported in the second column of Figure 7. The values of the means, the medians, andthe proportions are provided in Table 2.2005 2010 2015% mean median % mean median % mean median1 country 80.0 9.2 4 76.5 7.7 3 73.7 4.7 22 countries 17.4 13.5 8 19.8 10.8 6 21.4 6.6 3 ≥ n countries correspond to at least n authors, one (or more) in each country.However, there also exist tens of items with one author having two main addresses in two differentcountries. Since these situations confirm an additional face of internationalization, we haven’t tried toseparate these two effects. Another division according to the number of different research institutes appearing in the list ofaddresses is performed. We divided the items into 1, 2, 3, 4 and more institutes, and computed theproportion, the mean and the median for each of these groups. These numbers are reported in Table3. As for the previous two predictors, the number of publications involving only one institute decreaseover time, while the number of publications involving two and more institutes increase over time. Bylooking at these three predictors, it appears quite clearly that an increase in the number of authors,countries, or institutes, corresponds to an increase in the number of citations. In fact, both the meanand the median are increasing with these predictors.8005 2010 2015% mean median % mean median % mean median1 institute 60.7 8.0 3 51.7 6.8 3 44.7 4.2 22 institutes 27.6 12.5 8 31.0 9.6 5 33.1 5.5 33 institutes 8.8 14.5 9 12.4 10.8 6 14.6 6.6 3 ≥ For most items provided by WoS, the number of references mentioned by the authors of a publicationis provided. A classification depending on the number of references has been realized, and Figure 8provides the information about the number of publications ( y − axis) with a given number of references( x − axis), together with the citation mean (color). Note that in the three graphs, the grey colorcorrespond to the mean citation over all data of the respective year. These means appear in Table 6(first row), but for the record let us already mention them:2005 : 10 . . . . . . . . . Similar investigations can be performed with the number of pages. Figure 9 contains the outcomes ofthis investigation, with the number of items for a given number of pages together with an informationabout the citation mean.Figure 9: Citations depending on the number of pagesThe conclusion is similar to the one obtained for Figure 8, namely a positive correlation betweenthe number of pages and the citations. Once again, the average number of pages can be computed.The expected numbers of pages are2005 : 15 . . . Let us now have a look at keywords. WoS lists the keywords provided by the authors, and a secondlist of keywords introduced by WoS. Let us immediately say that we were interested only in the firstlist. In Table 4, we report the proportions of publications as a function of the number of keywords,and provide also the means and the medians. Note that there is no information about keywords forabout 1 / ≥
11 0.5 15.3 11 0.5 11.2 7 0.6 6.6 4Table 4: Citations depending on the number of keywordsWhat appears in this table is a positive relation between the number of keywords and the citationmean as long as the number of keywords is between 1 and 5. For the items with more than 5 keywords,any relation between the number of keywords and the citation mean is not really visible.Let us still provide the average number of keywords. Based on WoS, the expected number ofkeywords are: 2005 : 2 .
78 keywords 2010 : 3 .
19 keywords 2015 : 3 .
49 keywords. (10)In order to eliminate the bias related to 0 keywords, the statistics with 0 keywords removed have alsobeen computed, and one gets2005 : 4 .
38 keywords 2010 : 4 .
48 keywords 2015 : 4 .
58 keywords. (11)As for the number of references and the number of pages, it is visible in (10) and in (11) that thenumber of keywords provided by the authors is also following an increasing trend.
As already mentioned, WoS lists the items which have a partial or a full open access. Since thisinformation is available, let us just provide the proportion of items having any kind of open access,together with the respective mean and median. The outcomes are summarized in Table 5.11005 2010 2015% mean median % mean median % mean medianOA 16.4 12.2 7 21.2 9.7 5 32.3 5.7 3no OA 83.6 9.6 5 78.8 8.1 4 67.7 5.1 2Table 5: Citations with or without open accessThe table contains one information which is not surprising: the proportion of items with an openaccess increases over the years. Also, both the means and the medians are larger for the items withan open access compared to the items without this access. However, even if these results look quitenatural, we think that further and more precise information would be necessary to get a better pictureabout the impact of open access.
By definition, the journal impact factor (JIF) is available only for journals, and not for all items. Notethat WoS does not link automatically the publications appearing in journals with the correspondingJIF. However, we were able to track this information for about 2/3 of our items. More precisely, a JIFhas been associated to the following numbers of items from the list described in (5):research area = mathematics , clear author(s) and affiliation(s) , citation < , JIF available2005 : 31 (cid:48)
556 items 2010 : 44 (cid:48)
639 items 2015 : 57 (cid:48)
756 items. (12)It is not surprising that this list of items is biased. Indeed, if we look at the citation means overthese items, the numbers are not exactly the ones appearing in (6) but2005 : 12 . . . .
1, collected allpapers linked to a JIF in each subinterval and computed their mean and median. The informationabout the number of items corresponding to each subinterval is also mentioned. For a JIF below 2, anearly linear relation between JIF and citation mean or median is quite visible. On the other hand,for values of the JIF above 2, the relation is less clear. But this part of the statistics is performed onmuch fewer items, and makes it less reliable.
To each item, WoS associates one or several research area(s). As mentioned in the previous section,our selection was based on research area = mathematics , which means that all data have at least math-ematics in the list of their research areas. However, most of them possess more than just one researcharea. For the data from 2005, it turns out that only 12 additional research areas were contained in atleast 1% of the considered items. The list of these research areas is:
Mathematics, Computer Science,Engineering, Physics, Mechanics, Operations Research & Management Science, Mathematical & Com-putational Biology, Mathematical Methods in Social Sciences, Business & Economics, Biochemistry &Molecular Biology, Automation & Control Systems, Science & Technology, Biotechnology & AppliedMicrobiology . 12igure 10: Citations depending on the Journal Impact FactorBased on this list, we have computed the fraction, the mean and the median of all the itemsintroduced in (5) which possess one of these entries as a research area. The result is provided in Table6. In this table, we have also reported the relative citation which corresponds to the citation mean ofa particular research area divided by the citation mean of the same year for all the data. It turns outthat these computations give some interesting results: the range of this relative citation is between 0 . .
8, with the lowest value shared in 2005 and 2010 by
Engineering and by
Operations Research& Management Science . On the other hand, most of the highest values are reached by research areasrelated to biology or to biotechnology. Note that the median follows a similar pattern, but since thecomputation of a relative median does not look so natural, we have refrained from providing such arelative information. Thus, this table confirm that the citation mean or median really depend on theresearch areas, and that some striking differences exist. This fact is well documented, and has led tothe developments of several relative indices, see for example [1, 6].
Categories correspond to another indexation of the items chosen by WoS. They correspond to ratherbroad research fields, but some of them coincide also with research areas. The main list of categories(different from any research area) appearing in our items are the following:
Applied Mathematics,Mathematics, Statistics & Probability, Mathematics (Interdisciplinary Applications), Computer Sci-ence (Interdisciplinary Applications), Mathematical Physics, Computer Science (Theory & Methods),Engineering (Multidisciplinary), Mechanics .For each of these categories, the proportion, the citation mean, relative citation, and the medianhave been reported in Table 7. The variations already observed in Table 6 are also visible here. Notethat even in the three main categories, namely Mathematics, Applied mathematics, and Statistics and13005 2010 2015 % mean rel. c. med. % mean rel. c. med. % mean rel. c. med.Mathematics
100 10.0 1.0 5 100 8.5 1.0 4 100 5.3 1.0 2
Comp. Science
Engineering
Physics
Mechanics
Op. Research
Math. & C. Bio.
Math. M. Soc. S.
Bus. & Eco.
Bio. & M. Bio.
Auto. & C. Syst.
Science & Tech.
Bio. & A. Mic. % mean rel. c. med. % mean rel. c. med. % mean rel. c. med.Math. app.
Mathematics
Sta. & Prob.
Math., Int. App.
C. S., Int. App.
Math. Phys.
C. S., T. & M.
Eng., Mult.
Mechanics
In the previous section, we studied the relationships between the number of citations and some indi-vidual predictors extracted from the WoS database. In several figures and tables, it clearly appearsthat the citations are related to these predictors. The problems left are how much information on thecitations can be explained by the predictors and what is the relative importance among the predictors.To answer these questions, we use decision trees, as thoroughly introduced in [5].
A tree classifier is a procedure that divides a data set into two or more subsets based on some pre-determined criteria. Let us explain this on a simple example. Suppose that the response containstwo classes: Yes and No, which are represented by red and blue dots in Figure 11. Suppose also thatthere are two predictors X and X attached to each item. The tree classifier can identify the bestsplit value of one of the predictors such that the purity in each subset is enhanced. After the splitting14nto two subsets, further splits will be carried out independently for each subset. The classifier goesthrough all the possible values of the predictors at each split. It stops splitting when a stopping ruleis met. During the process, each subset is called a node, the nodes without further subsets are calledleaves. When a node contains more than one class (which means that the node is not pure), the classwith the majority of items is selected as the label class of the node. The misclassification rate of thenode corresponds to the ratio of items in the node belonging to a class which is not the label class.Figure 11: Tree-structured classifierThere exist several criteria for the choice of the split value, but all of them are based on an impurityfunction which has to be minimized. More precisely, let H denote such an impurity function, and letus consider a split of the content of one node t into two subsets, called t left and t right . Then one sets Q t = N t left N t H ( t left ) + N t right N t H ( t right ) , (14)where N t , N t left , and N t right denote the number of items in the note t and in the two subsets. Thefunction H is also evaluated on the items of t left and t right . By considering all possible splits, wechoose the one which generates the minimal value for Q t .For the impurity function, a few canonical choices are possible. In order to define them, considernow that the response contains J classes, and assume that in a node t the items are distributedfollowing a distribution { p j } Jj =1 , with p j the proportion of items in the class j . Then, some canonicalimpurity functions are Gini index : H ( t ) = (cid:88) j p j (1 − p j )Entropy : H ( t ) = − (cid:88) j p j ln( p j )Misclassification : H ( t ) = 1 − max j ( p j )In the subsequent investigations, we shall use the impurity function provided by the Gini index only.By building a tree with the above process, we often end up with a very big tree: many leaves and abig height, which corresponds to the maximal distance between the first node (root) and the farthestleaf. Such a big tree leads often to an overfitting phenomenon. Thus, a pruning procedure need to bedone to reserve the effective tree structure and to remove the risk of overfitting. One can perform thepruning procedure by removing successively the leaves with the least contribution of decreasing theimpurity, namely the weakest leaves. There exists several ways for implementing such a process, let ustherefore only sketch the main ideas of the cost complexity pruning. To each node t , one associates areal coefficient α eff ( t ) which takes into account the misclassification of the node, the misclassificationof the subtree having t as a root, and the number of leaves of the subtree, see for example [17]. Then,starting from 0 and by slowly increasing a parameter α , one prunes successively the nodes with α eff smaller than α . Obviously, an additional stopping rule has to be fixed, otherwise the process wouldend up in keeping only the root, which means the original set of items without any subdivision. Again,several options exist. Before presenting one of these stopping rules, let us mention some outcomes ofthe pruning operation. 15onsider that α is slowly increasing from 0, and that the pruning is taking place. It is clear thatthe number of leaves and the height of the successive trees are decreasing. On the other hand, thetotal misclassification of the tree tends to increase. Equivalently, the train accuracy (ratio of correctlyclassified items in the leaves) tends to decrease as α increases. It is thus natural to look for a suitablevalue of α , leading neither to a too small tree which is not able to do effective classification, nor to atoo large tree with high risk of overfitting.Such a suitable α can be obtained by testing the tree on new items. Indeed, consider a new itemhaving all necessary predictors and labeled by one class. According to the value of its predictors, theitem can be placed in a unique leaf with a labeled class. It may be correctly or incorrectly classified. Byrepeating this operation on several new items, one get a test accuracy , the ratio of correctly classifiednew items. Again, if one considers the family of trees obtained by pruning according to the parameter α , one observes that the test accuracy starts by increasing with α , before decreasing again. Since weare interested in the highest test accuracy, we then select the optimal value α opt of the parameter α corresponding to the maximum test accuracy. A typical example of the train accuracy and the testaccuracy as a function of α is provided in Figure 12. It is then this α opt which is used for stopping thepruning process. Figure 12: Train accuracy and test accuracy, as a function of α In this section, we implement these processes and describe the precise experiment we have performed.The analyze tool package is provided by Scikit-learn (Supervised Learning / Decision Trees), see [15].The experiments will be done independently on the three datasets of 2005, 2010, and 2015. For thepredictors, we shall use those introduced and discussed in Section 3. However, for the predictor jif ,some items miss this information because they do not have an associated JIF. We have then decided toperform the experiment independently on two lists of items. The first list contains all items, as shownin (5), and the predictor jif will not be used. The second list contains items with a JIF, as shown in(12), and the predictor jif is included in the list of all possible predictors.For the six lists of items mentioned above, we define a family of classes, namely a partition of theset { , , , . . . , } , corresponding to the J classes mentioned in the previous section. The partitionsshould be relevant and understandable and they should also be chosen accordingly to the specificity ofthe different lists of items. For example, the following partition will be used for the first list of itemsof 2005:weak: [0 , , normal: [6 , , good: [13 , , very good: [21 , , excellent: [41 , . (15)Let us now describe the precise construction of the tree classifier, and the pruning process.16i) Fix one dataset among the six presented in (5) and in (12),(ii) Fix one relevant partition for the dataset, as for example the one presented in (15). The numberof classes defined by this partition is denoted by J and corresponds to the number of intervals,(iii) Label the items in the dataset with one of the J classes, according to their citations, and selectrandomly X items for each class. For our experiment we have chosen X equal to 80% of theitems of the smallest class (which has always been the one corresponding to the highest citationsclass). One ends up with JX items equally distributed among the J classes,(iv) Divide randomly these JX items into K folds { Λ k } Kk =1 of equal size, for K ∈ N , and fix k = 1,(v) The fold Λ k is called the test set , and the others K − training set . Aclassification tree is build with the training set,(vi) On the classification tree, the pruning process introduced in the previous section is performed:the computation of α eff ( t ) for each node t , the increase of α , the pruning of the weakest node( i.e. the ones with the smallest α eff ), the computation of the train accuracy and of the testaccuracy. These accuracies, denoted respectively by a k ( α ) and b k ( α ), are computed for each α (but are piecewise constant) and are stored. The value k is updated by setting k := k + 1,(vii) The steps (v) and (vi) are repeated as long as k ≤ K ,(viii) For each α , the average a ( α ) and average b ( α ) are computed by averaging the K values and α opt is deduced by generating a graph like Figure 12 with the average values ( K -fold cross-validation),(ix) The optimal tree is built based on the JX items and pruned with the α opt found at the previousstep.Note that in our experiments we have used the parameter K = 5. Once this optimal tree is built,the relative importance of the predictors can be obtained. Indeed, the relative importance of thepredictors can be computed by looking at the weighted impurity decrease at each node. By using thenotation already introduced in (14), this quantity is provided for each node t by the formula N t N (cid:16) H ( t ) − N t left N t H ( t left ) − N t right N t H ( t right ) (cid:17) , where N is the total number of items in the tree. Clearly, if a node is a leaf, there is no contribution tobe subtracted. The above quantity provides an information about the decay in the impurity providedby a subdivision. Then, since each subdivision is associated with a single predictor, this decay inthe impurity is gained by the corresponding predictor. By summing up the decay in impurity dueto all subdivisions associated with one predictor, one obtains the total decay in impurity due to thispredictor. The predictors are finally ordered by the decreasing order of their total decay in impurity.In Table 8, we provide various information obtained with the tree classifier, namely the relativeimportance of the predictors, as mentioned above, but also the number of leaves and the height of eachtree. The average train accuracy and the test accuracy are also reported. These two values correspondto the two accuracies obtained at α opt after cross-validation.Let us now make several comments about Table 8.(i) First of all, the number of classes and their precise values is partially arbitrary, and was determinedafter several preliminary tests. Note that we considered less classes for the recent data since thecitations accumulate with time.(ii) The size of the resulting tree is determined by the computation of α opt as explained above. Wewere surprised that these trees are relatively small, with a number of leaves between 15 and 49, and aheight of maximum 10. On the other hand, the 6 trees contained between 4’800 and 20’000 items.(iii) About the accuracies: since the J classes are of equal size, a random guess for an item of the testset would give a correct prediction with a probability of 20% for the data of 2005, 25% for the data of17005 2010 2015JIF No Yes No Yes No Yesclasses [0,5][6,12][13,20][21,40][41,63] [0,6][7,14][15,23][24,45][46,63] [0,4][5,10][11,20][21,63] [0,5][6,12][13,24][25,63] [0,3][4,9][10,63] [0,4][5,11][12,63] references authors countries institutes pages jif research area (ENG.) (ENG.) (Bio. & A. M.) Experiment (%) 39 40 44 43 53 53Table 8: Experiments with tree classifier2010, and 33 .
3% for the data of 2015. Thus, the difference between these values, and the test accuracycan be understood as the gain in prediction due to the tree. However, since the train set and the testset do not follow the initial distributions of items, a different computation has to be performed for anarbitrary set of items in the initial dataset (see below).(iv) The ranking of the predictors for each of the 6 trees is reported in the table, but only for thefirst 4. In turns out that only 7 predictors among 10 appear in the ranking. For publications whichappear in a jif ted journal, this predictor is always chosen first, and the number of references used bythe authors is chosen as the second predictor. On the other hand, for the larger set of publications,with no use of the predictor jif , it appears that the number of references is always chosen as the firstpredictor, while the second one is different in the three experiments. Note that the importance ofthe predictor references was already anticipated in Section 4 just by looking at the sharp contrasts ofFigure 8.(v) An additional experiment has been performed with the items not used for the construction of thetrees. Indeed, the 20% of the smallest class was still available, and 20% of the untouched items in theother classes could be selected randomly. We apply the classification tree on the new test set whichis made up of these items. The last row of Table 8 provides the percentage of the correctly assignedclasses. Clearly, these accuracies do not coincided with the train accuracy or the test accuracy, sincethe corresponding items do not share the same distributions. The interpretation is the following: givenan arbitrary item from one of the three years, and based on the constructed trees, our ability to predictcorrectly the citation class corresponds to the last row of the table. Not surprisingly, this accuracyincreases as the number of classes decreases, but the knowledge of the JIF for the item does not improveour prediction. Even if these numbers show the limitations of our approach, it also provides a heart-warming message: the content of a publication still matters for the citations, and any bibliometricanalysis won’t be able to predict this. 18
About countries
In this last section we provide some statistics related to countries. Indeed, by selecting the itemsaccording to the location of the corresponding research institutes, some additional information can beextracted.In Table 9 we provide a comparison between the main 25 countries (according to their number ofpublications). These statistics can be thought as a kind of relative performance. More precisely, Table9 contains the following entries:% : Percentage of the publications having at least one author from the given country,h.c.: (highly cited) Percentage of publications having more than 63 citations with at least one authorfrom the given country,mn: (mean) Citation mean for the publications having less than 64 citations and at least one authorsfrom the given country,md: (median) Median for the publications having less than 64 citations and at least one authors fromthe given country,r.c.: (relative citation) Ratio of mn by the citation mean over all publications with less than 64citations.The following abbreviations for the countries is used: US = USA, CN = People’s R. China, FR =France, DE = Germany, IT = Italy, UK = England, CA = Canada, JP = Japan, ES = Spain, RU= Russia, AU = Australia, KR = South Korea, PL = Poland, IL = Israel, NL = Netherlands, IN =India, BR = Brazil, TW = Taiwan, BE = Belgium, CH = Switzerland, SE = Sweden, GR = Greece,CZ = Czech Republic, AT = Austria, TR = Turkey.
Table 9: Statistics for the main countries19et us make some comments about this table. First of all, columns % and h.c. sum up to morethan 100 (when all countries are considered) because publications involving authors from differentcountries are counted more than once. Note also that it is the first (and last) time that outliers(namely publications with more than 63 citations) are used: Columns h.c. provides the informationon how much the countries are involved in the highly cited publications. The columns with r.c. canbe thought as a comparison between the performance of these countries: it is rather striking that therelative citations take values between 0.6 and 1.5. Some countries are clearly performing better thanothers. On the other hand, these data do not present a clear pattern over the 3 years considered, andthe relative citations are quite stable. Only a few countries have a small variation of their relativecitations over the 3 years, but the evolution is not so noticeable. Most probably, a period of 10 yearsis not long enough to really assert a real change of performance for a given country.The next table, Table 10, is more related to the importance of international collaborations for eachcountry. By international collaboration we mean a publication with one author from the given country,and at least one author from another country. More precisely, Table 10 contains the following entries,again for the main 25 countries:% : Percentage of publications of a given country which are international collaborations,mn: (mean) Citation mean for the publications of a given country which are international collabora-tions and which have less than 64 citations,md: (median) Median for the publications of a given country which are international collaborationsand which have less than 64 citationsrcc: (relative citation for international collaborations) Ratio of mn by the citation mean over allpublications of this country with less than 64 citations,rc2: (relative citations for international collaborations / 2 authors) Ratio of mn by the citation meanover all publications of this country with less than 64 citations and at least 2 authors.Note that we have added the column rc2 in order to eliminate a bias. Indeed, an internationalcollaboration involves at least 2 authors (with a few exceptions already mentioned in Section 4.2)while arbitrary publications from any country also include many single author productions. Sincethese publications are usually less cited (see Table 1), we eliminate this bias by considering onlypublications with at least 2 authors.By looking at this table, the interest of international collaborations is quite clear. All the relativecitations appearing in rcc or rc2 are bigger or equal to 1. In fact, some countries benefit a lot frominternational collaborations, having a factor rc2 taking a maximum value of 1.5. On the other hand, forsome countries, publications which are international collaborations do not receive substantially morecitations than publications involving only researchers in this country. As a rule and not surprisingly,authors in a country with a low relative citation in Table 9 benefit more from international collabora-tions than authors from a country with a high relative citation. Fortunately, no researcher from anycountry is penalized by establishing international collaboration: such an unfortunate situation wouldend up with a rcc or rc2 smaller than 1 in Table 10.In Table 11, we provide more specific information about bi-national collaboration. More precisely,since our dataset is large enough, collaboration between two countries can be extracted. It correspondsto publications involving at least one author in a country X , and one author in a country Y . Additionalauthors and/or countries can also be involved. This number can then be divided either by the totalnumber of international collaborations of the country X , or by the total number of internationalcollaborations of the country Y . In the first case, it gives for the country X the relative importanceof collaborations with the country Y , while in the second case it gives for the country Y the relativeimportance of collaborations with the country X . Table 11 contains this information for 16 countriesand for the three years. These three information are provided in each cell, with 2005 on the top, 2010in the middle, and 2015 on the bottom. 20
005 2010 2015% mn md rcc rc2 % mn md rcc rc2 % mn md rcc rc2US 30.1 14.4 9 1.2 1.0 34.9 12.3 7 1.2 1.0 42.4 7.5 4 1.2 1.0CN 23.7 14.1 9 1.3 1.2 21.2 12.4 7 1.4 1.4 25.5 8.6 5 1.3 1.3FR 38.7 14.4 9 1.2 1.1 48.1 11.5 7 1.1 1.0 54.7 6.6 4 1.1 1.0DE 40.1 13.6 9 1.2 1.1 47.6 10.8 7 1.2 1.1 51.8 7.1 4 1.2 1.0IT 33.7 13.6 8 1.4 1.2 42.4 11.5 7 1.2 1.1 49.5 7.8 5 1.2 1.1UK 46.8 14.3 9 1.2 1.0 54.0 12.3 7 1.1 1.0 61.5 7.3 4 1.1 1.0CA 51.2 13.6 8 1.2 1.1 53.7 11.4 7 1.2 1.1 60.8 6.4 3 1.1 1.0JP 23.3 12.3 8 1.4 1.2 30.3 8.7 5 1.4 1.2 33.3 6.3 3 1.5 1.3ES 36.0 14.2 10 1.3 1.2 46.2 10.5 6 1.2 1.1 54.4 6.4 4 1.1 1.1RU 29.3 11.5 7 1.8 1.5 30.3 9.0 5 1.6 1.2 25.0 5.9 3 1.5 1.3AU 40.9 14.2 9 1.5 1.4 56.8 11.2 6 1.2 1.1 57.7 8.2 4 1.3 1.2KR 31.5 13.0 7 1.4 1.3 41.9 10.7 6 1.3 1.2 40.7 6.8 3 1.4 1.2PL 32.0 11.9 8 1.4 1.2 34.1 9.3 5 1.3 1.2 38.1 6.1 3 1.3 1.1IL 46.9 13.2 8 1.2 1.0 54.0 10.4 6.5 1.2 1.1 58.1 6.7 4 1.2 1.1NL 42.6 13.9 9 1.2 1.1 54.9 11.5 7 1.2 1.1 63.7 6.8 3 1.0 1.0IN 31.0 12.6 7 1.2 1.2 31.3 10.7 6 1.2 1.1 26.4 6.3 3 1.4 1.3BR 39.8 13.4 9 1.3 1.2 44.1 10.2 6 1.2 1.1 43.8 6.6 4 1.3 1.3TW 29.5 13.6 9 1.1 1.1 35.4 10.8 7 1.2 1.1 40.8 6.4 4 1.3 1.3BE 45.1 14.5 10 1.3 1.1 57.4 11.1 6 1.1 1.1 66.9 7.1 4 1.0 1.0CH 49.2 17.4 13 1.3 1.2 61.7 11.8 8 1.1 1.0 68.6 8.6 5 1.1 1.0SE 44.6 12.8 7 1.2 1.1 50.9 9.8 5 1.2 1.0 62.2 6.3 3 1.2 1.1GR 27.5 14.7 10 1.6 1.5 45.3 9.9 5 1.2 1.1 54.9 7.4 3 1.3 1.2CZ 37.1 12.5 7 1.7 1.4 44.8 9.8 6 1.4 1.2 39.4 6.2 3 1.6 1.5AT 40.9 14.4 10 1.4 1.2 59.5 10.7 6 1.1 1.0 62.2 6.7 4 1.1 1.0TR 21.9 11.8 8 1.1 1.1 25.6 11.7 6 1.2 1.1 33.5 6.9 4 1.4 1.3
Table 10: Statistics about international collaborations for the main countriesA general trend is visible in this table: For all countries except for China, the ratios of collaborationswith the USA slightly decrease. On the other hand, for all countries, USA included, the ratios ofcollaborations with China increase, even if for most of them, collaborations with the USA are still inmuch higher numbers compared to the collaborations with China. In that respect, Australia is quitean exception, with a higher percentage of collaborations with China than with the USA. Note that forthe USA, the ratio of collaborations with China has doubled during a period of 10 years (from 11.3%to 22.2 %). For other bi-national collaborations, involving a few % of all international collaborationsfor both countries, a general trend is not clearly visible, and fluctuations are more important (alsobecause the numbers of publications involved are smaller).
One of the unexpected outcomes of these investigations is the rapid increase of the number of publi-cations, but also of the significant change of many predictors over a period of 10 years only. Indeed,if we gather some results obtained in (1), (2), (3), (4), (8), (9), (11), and in Table 5, we obtain Table12. If we summarize in one sentence the content of this table, it would be that the publications inmathematics are becoming more collaborative, more international, longer, with more references andkeywords, more freely accessible, and especially more numerous.For the citations, it is quite natural that the Journal Impact Factor plays an important role. Inthat sense, its appearance as the most important predictor (whenever available) is not surprising. Formore general publications, the importance of the number of references is also not so surprising: quiteoften, a paper with numerous references corresponds to a paper which is well nested in the researchlandscape, and as a consequence it can be cited by several authors. The importance of other predictorsis less clear, and no conclusion can be established for them. At this level, the real content of a papercertainly matters more than any bibliometric predictor.The content of Section 6 seems new. From the point of view of the authors, these statistics are21
S CN FR DE IT UK CA JP ES RU AU KR PL IL NL INUS 36.737.540.1 23.921.518.8 24.222.822.7 21.622.219.9 27.922.325 3734.632.7 27.218.522.3 25.116.615.6 20.719.318.8 26.919.719.5 46.938.633.6 2920.515 54.546.349.2 28.323.624.2 36.12322.2CN 11.316.122.2 3.75.66.7 5.35.36.8 3.33.53.1 6.47.710.3 13.51318.7 13.11816.7 2.22.94.2 0.43.14.1 15.22226.8 15.321.619.6 2.32.35.4 2.42.16 1.73.65.6 4.16.58.3FR 8.89.58.2 4.55.75.3 8.2109.9 16.817.915.2 7.28.89.6 89.310 6.16.79.5 12.512.411.7 13.813.811.4 3.27.74.8 3.65.14.2 12.710.210.8 810.18.7 7.97.19.9 75.14.9DE 8.49.49.3 6.15.15 7.89.39.2 11.511.712.1 10.111.711.6 5.45.56.2 13.18.58.8 5.88.27.5 14.612.612.2 6.18.86.1 3.945.9 13.714.712.9 7.310.110 13.619.719 11.912.55.3IT 4.85.75.7 2.42.11.6 10.210.410 7.37.38.5 5.45.88.8 2.63.76.7 4.84.45 6.410.18.9 10.16.78.5 2.74.83.5 1.61.52.8 65.46.9 3.243.9 5.75.77.2 4.12.93UK 8.27.18.2 6.15.76.1 5.86.47.3 8.59.19.3 7.27.310.1 5.56.67.6 3.77.36.1 5.56.26.7 10.78.48.3 1511.812.8 5.54.53.7 4.766.8 4.46.47.4 5.711.210.2 7.84.54.1CA 10.49.38 12.48.18.3 6.15.75.6 4.33.63.7 3.23.85.7 5.25.55.6 6.15.54.4 4.64.33.2 5.73.82.6 7.88.25.8 6.24.53.6 3.33.95 7.39.96 7.13.34.8 7.85.65.3JP 3.32.93.2 5.26.54.3 22.43.1 4.63.23.1 2.62.72.5 1.53.52.6 2.63.22.5 1.61.91.9 24.12.6 2.522.7 11.19.88.8 54.44.2 2.20.51 3.13.31.4 4.14.32.1ES 4.53.93.6 1.31.61.8 6.16.66.2 34.74.3 5.19.27.2 3.34.54.7 2.93.73 2.42.83 4.43.24 3.43.53.4 22.13.5 77.16 2.43.51.9 3.75.56.1 2.94.73.8RU 32.62.6 0.211 5.44.23.6 64.14.1 6.63.54.1 5.23.53.5 2.91.91.5 2.43.52.5 3.61.92.4 2.822.4 1.30.62.6 3.353 5.13.34.8 4.22.62.2 1.20.41AU 3.82.73.1 6.97.17.8 1.22.41.8 2.432.4 1.72.62 7.15.16.3 3.94.23.8 2.81.83 2.72.12.4 2.82.12.8 21.52.1 1.71.72 2.21.91 2.31.43.7 1.61.62.7KR 3.83.93.9 4.154.1 0.81.21.1 0.911.7 0.60.61.1 1.51.41.3 1.81.71.7 7.46.37.1 0.90.91.7 0.70.42.2 1.11.11.5 010.9 0.50.32.1 0.80.50.5 5.35.46.2PL 2.31.91.7 0.60.51.1 2.72.12.8 3.13.23.6 2.11.92.7 1.31.72.3 0.91.32.3 3.32.63.3 3.12.72.9 1.83.42.5 0.91.11.4 00.90.9 11.72.7 1.11.91.8 00.92.9IL 654.5 0.90.51 2.42.51.8 2.32.72.3 1.61.71.2 1.62.22.1 2.842.2 20.40.6 1.51.60.8 3.92.73.2 1.71.50.6 0.70.41.7 1.32.12.2 4.23.61.6 1.60.71.7NL 2.72.62.3 0.50.90.9 21.82.1 3.75.24.3 2.42.42.3 1.83.82.9 2.41.31.8 2.42.30.9 1.92.62.4 2.82.11.5 1.51.12.1 10.60.4 1.32.31.4 3.63.71.6 1.21.60.7IN 2.31.92.7 0.91.31.8 1.211.4 2.22.61.6 1.211.3 1.71.21.5 1.81.82.7 2.22.31.8 11.72 0.60.30.9 0.812 4.24.56.6 00.83.1 10.52.3 0.81.21
Table 11: Bilateral collaborations: For a country in x -coordinate, the numbers correspond to the % ofits international collaborations with a country of the y -coordinate, for 2005, 2010, and 201522005 2010 2015 Acknowledgement
The authors would like to thank the referees for their careful reading of a previous version of this workand for their numerous advises. They also thank Keiko Koide, from the Technical Support / CustomerService of Clarivate (Asia Pacific), for having answered all their inquiries related to WoS database.
References [1] Amodio, P. and Brugnano, L. (2014). Recent advances in bibliometric indexes and the PaperRankproblem.
Journal of Computational and Applied Mathematics bibliometrix : An R-tool for comprehensive science mappinganalysis.
Journal of Informetrics
11, 959–975.[3] Bensman, S., Smolinsky, L., and Pudovkin, A. (2010). Mean Citation Rate per Article in Math-ematics Journals: Differences From the Scientific Model.
Journal of the American Society forinformation science and technology
Scientometrics
86, 179–194.[5] Breiman, L., Friedman. J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and RegressionTrees. CHAPMAN & HALL/CRC.[6] De Battisti, F. and Salini, S. (2013). Robust analysis of bibliometric data.
Stat. Methods Appl.
Journal of Informetrics
7, 861–873.[8] Dunne, E. (2021). Don’t count on it.
Notices Amer. Math. Soc.
68 no. 1, 114–118.[9] Goldfinch, S., Dale, T., and DeRouen, K. (2003). Science from the periphery: Collaboration,networks and ‘Periphery Effects’ in the citation of New Zealand Crown Research Institutes articles,1995–2000.
Scientometrics
Congr. Numer.
Notices Amer. Math. Soc.
52 no.1, 35–41.[12] Luo, J., Flynn, J.M., Solnick, R.E., Ecklund, E.H., and Matthews, K.R.W. (2011). InternationalStem Cell Collaboration: How Disparate Policies between the United States and the United KingdomImpact Research.
PLoS ONE
Educational Research and Reviews
Journal ofMachine Learning Research
12, 2825–2830.[16] Rusin, D. (2015). A Gentle Introduction to the Mathematics Subject Classification Scheme. Linkprovided byhttps://en.wikipedia.org/wiki/Mathematics Subject Classification.[17] Scikit-learn.org. https://scikit-learn.org/stable/modules/tree.html[18] Smith, M.J., Weinberger, C., Bruna, E.M., and Allesina, S. (2014). The Scientific Impact ofNations: Journal Placement and Citation Performance.
PLoS ONE
Scientometrics
Scientometrics
Scientometrics
Applied Mathematical Modelling
89, 1177–1197.[23] Wagner, C., Whetsell, T., and Leydesdorff, L. (2017). Growth of international collaboration inscience: revisiting six specialties.
Scientometrics