[PDF] Citations versus expert opinions: Citation analysis of Featured Reviews of the American Mathematical Society

Abstract

Peer review and citation metrics are two means of gauging the value of scientific research, but the lack of publicly available peer review data makes the comparison of these methods difficult. Mathematics can serve as a useful laboratory for considering these questions because as an exact science, there is a narrow range of reasons for citations. In mathematics, virtually all published articles are post-publication reviewed by mathematicians in Mathematical Reviews (MathSciNet) and so the data set was essentially the Web of Science mathematics publications from 1993 to 2004. For a decade, especially important articles were singled out in Mathematical Reviews for featured reviews. In this study, we analyze the bibliometrics of elite articles selected by peer review and by citation count. We conclude that the two notions of significance described by being a featured review article and being highly cited are distinct. This indicates that peer review and citation counts give largely independent determinations of highly distinguished articles. We also consider whether hiring patterns of subfields and mathematicians' interest in subfields reflect subfields of featured review or highly cited articles. We reexamine data from two earlier studies in light of our methods for implications on the peer review/citation count relationship to a diversity of disciplines.

Full PDF

CCitations versus expert opinions: Citation analysis of Featured Reviews of the American Mathematical Society by Lawrence Smolinsky , Daniel S. Sage , Aaron J. Lercher , and Aaron Cao Department of Mathematics, Louisiana State University, Baton Rouge, LA 70803. Middleton Library, Louisiana State University, Baton Rouge, LA 70803. Carnegie Mellon University, Pittsburgh, PA 15213. Corresponding author: [email protected] Partially supported by

National Science Foundation grant DMS 1503555 and Simons Foundation Collaboration Grant 637367637367.

Abstract

Peer review and citation metrics are two means of gauging the value of scientific research, but the lack of publicly available peer review data makes the comparison of these methods difficult. Mathematics can serve as a useful laboratory for considering these questions because as an exact science, there is a narrow range of reasons for citations. In mathematics, virtually all published articles are post-publication reviewed by mathematicians in Mathematical Reviews (MathSciNet). For a decade, especially important articles were singled out in Mathematical Reviews for featured reviews. In this study, we analyze the bibliometrics of elite articles selected by peer review and by citation count. We conclude that the two notions of significance described by being a featured review article and being highly cited are substantially distinct. This indicates that peer review and citation counts give largely independent determinations of highly distinguished papers. In another direction, we consider how hiring patterns of subfields and mathematicians’ interest in subfields may be assessed in terms of the subfields of featured review and highly cited articles.

Introduction

Two methods of evaluating the impact, quality, importance or other versions of value of a scientific work are peer assessment and informetric indicators. Peer assessment includes reviews of individual articles, reviewing for publication by referees and editors, reviewing for scholarly prizes and awards and honors, reviewing for grant support, and more (Lee, Sugimoto, Zhang, & Cronin, 2013). Peer reviewers ostensibly attempt to directly assess value, quality, and relevance. The meaning of citations is more ambiguous, but they have been used as indicators of value, impact, and even fame and pecuniary value (Cronin, 2005). Both citations and peer review are used as instruments of research evaluation. There is interest in comparing the two in terms of understanding both the significance of citations and the validity of citations in research evaluation. 2 Citation and publication networks cover nearly the entirety of academic literature. Counts of citations are available for papers indexed in the Web of Science (WOS), Scopus, Google Scholar, and, for mathematics, Mathematical Reviews (MR), available online as MathSciNet. The situation for peer reviewing is different. While the entirety of the literature indexed in Scopus and the WOS has undergone peer review from referees and editors, there is no systematic evaluation that allows comparisons of articles. Although both peer review and citation analysis may reveal certain aspects of the value of scholarly work: importance, novelty, scientific usefulness, etc., it is not clear that they measure the same aspects of value. For example, Aksnes, Langfeldt, and Wouters (2019) conjecture that research quality has four independently varying qualitative dimensions, only one of which is significantly measured by citations. It is accordingly a question of central importance to understand the relationship between citation analysis and peer review, and indeed, there have been many studies on the subject. However, almost all such research has examined peer review of research groups, institutions, or individual scholars. Although most peer review takes place at the article level, Patterson and Harris (2009, p. 343) observe that there are “surprisingly few” studies at this level. Mathematics deserves special attention in bibliometrics. We will discuss that mathematics—as an exact science—has narrower range of reasons for citing than in other fields. This makes citation analysis somewhat less complex in mathematics than in other disciplines. Accordingly, mathematics can serve as a useful laboratory for bibliometric investigations. In mathematics, there is a collection of distinguished articles well-suited for exploring the relationship between peer review and citation analysis. Between 1993 and 2004, those articles and books deemed to be especially significant were selected to receive featured reviews in MR. Since the choices were made shortly after the articles appeared, they were made independently of citations. The main goal of the present study is to investigate consistency between these two measures of quality for mathematical research by concentrating on featured review articles in MR. Prestigious highly cited and featured review articles are not evenly distributed throughout all subfields of mathematics, and these distributions sheds light on the perceived importance of subfields. Two other phenomena related to the perceived importance of subfields are the hiring patterns in top mathematics departments and the interest of mathematicians. We explore the relationship between these various phenomena related to the perceived importance of subfields.

Peer Review

Peer review is used to assess various manifestations of scholarly work including reviewing submitted manuscripts and grant proposals, selecting prizes and awards, and evaluating research departments (Moed 2005, p. 229-231). Peer review is paramount in scientific evaluation. Before an article can accumulate data on the WOS or SCOPUS, it must first pass peer review to be published. While non-peer reviewed information is widely available in the digital age and indexed on Google Scholar, a Sloan Foundation 3 study surveyed 4,000 academic researchers and found that the influence of peer review is growing in the digital environment (Nicholas, D., Watkinson, A., Jamali, H.R., Herman, E., Tenopir, C., Volentine, R., Allard, S., & Levine, K.; 2015).

Reliability

In comparing measures of research quality, the reliability of the measures limits any potential correlations. It is accordingly important to consider the reliability of peer review. In particular, how strongly do the results of peer review depend on the choice of reviewers, the form of the review instructions, and the timing of the review? Campanario’s (1998) review of literature on peer review concluded that peer review is both high status and low reliability. While reviewers are typically given instructions or guidance on evaluation criteria, Langfeldt (2001) in her study of grant peer review points out that reviewers interpret the criteria differently. The situation is summed up by an oft-repeated pithy quote from a former co-ordinating editor of the Journal of the American Statistical Association: “All who routinely submit articles for publication realize the Monte Carlo nature of review” (Eysenck & Eysenck. 1992). Several studies on inter-rater reliability are discussed by Lee, Sugimoto, Zhang, and Cronin (2013). The studies Bornmann and Daniel (2008b); Jackson, Srinivasan, Rea, Fletcher, and Kravitz (2011); Kravitz, Franks, Feldman, Gerrity, Byrne, and Tierney (2010); and Rothwell and Martyn (2000) primarily had kappa values below 0.15 with the largest 0.28. These are all very low values (Table 3, McHugh, 2012), supporting the Monte Carlo nature of review. In theory, a uniform method for peer review across an entire discipline might be used as a standard measure, but no such method exists in any field. Perhaps the best approximation to a high peer review assessment is an article’s acceptance—after review by referees and editors—in a well-respected subject-area journal. In fields where there is a reasonable consensus on the hierarchy of journals, one can consider the prestige of the journal in which an article appears. However, this is problematic, since journals are now commonly ranked using impact factors rankings (Wouters, 1999), not peer review. Another source of unreliability for peer review comes from the potential for personal bias. For example, some journals and grant organizations allow researchers to suggest or exclude potential reviewers. Coauthors are excluded in some fields but not others. There may be elaborate restrictions on reviewers in a promotion case, including disallowing faculty members from any of the candidate’s prior institutions. Most of these examples are to avoid positive bias, but positive bias for one individual may be negative for competitors. See Lee, Sugimoto, Zhang, and Cronin (2013) for a broad review of the literature on bias in peer review. It may be reasonable to expect that peer review becomes more reliable when one focuses on the most distinguished articles. For example, whereas different evaluators might reach opposite conclusions about the publishability of a marginal manuscript, one might expect almost all referees to agree on outstanding work. Since this study is restricted to Featured Review articles on MR, constituting less than 0.13% of all articles reviewed, peer review may be more reliable here than is typical from the discussed peer review study literature. We could not find 4 this issue investigated in the literature. We remark that such an investigation would need to avoid the use of citation metrics in ranking outstanding articles or journals.

Citation Analysis

Citation counts of scholarly publications are widely used as a measure of research performance, and thereby as an instrument of research evaluation. In Moed’s summary of important informetric indicators (2017, p. 51 Table 3.5), about half depend on the networks of citations and publications. G. Nigel Gilbert began his influential article (1977) “Some studies have used the number of citations received by a paper as an indication of its scientific quality, significance or ‘worth’. Likewise, the number of citations obtained by an author has been used to measure the impact of his or her work on the scientific community ”. More recently the National Research Council (NRC), which is the primary operating arm of the United States National Academies of science and engineering, reported that US faculty members were “generally in agreement that publications and citations were the most important factors in [graduate] program quality” (National Research Council, 2009, p. 12). Many bibliometrics researchers attempt to study citations and their meaning without believing they are necessarily a measure of value or impact. Others have endorsed it as a measure of value or impact. Reliability and meaning

Whereas peer review is known to be unreliable, the notion of reliability does not even make sense for citation counts. Indeed, the citation count of an individual article is simply part of the historical record; it is open to analysis, but not to experimentation. A single article can be given to different scientists to be independently peer reviewed and compared. However, a single paper does not admit independent citation counts. On the other hand, while the meaning of peer review is clear, this is not the case for citation counts. Individual referees can interpret review criteria differently, but at least specific review criteria exist. In contrast, the possible reasons for citing an article are much more amorphous. There are no set criteria required for making a citation, and an author’s reason for including a particular citation may not be obvious. The notion that citation counts reflect the impact or value of an article’s contribution to science is attributed to Robert K. Merton’s normative theory. Merton was a sociologist who has been recognized as the founder of the sociology of science. He also served on the advisory board on the Science Citation Index (Storer, 1973), which is now part of the WOS. In Merton’s view, a citation “registers in the enduring archives the intellectual property of the acknowledged source by providing a pellet of peer recognition of the knowledge claim” (Merton, 1988, p. 622). Others are based on altmetric measures or peer review such as mentions on social media, patented based measures, grant funding, or prizes and awards. See Gilbert (1977) for references. See Gilbert (1977) for references. For example, Bornmann and Osório write, “we use citations as a measure of ‘value’, because citations are usually applied to assess the usefulness and the value of publications for other researchers (Bornmann, 2017)” (Bornmann and Osório 2019, p. 546). 5 Even if one accepts that citations are given for scientific utility or as recognition of scientific accomplishments, there are still complications and subtleties in understanding the meaning of citation counts. For example, Eugene Garfield considered the issues of negative citations, self-citations, methodological and review articles, journal prestige, and variation by discipline (Garfield, 1979). However, in his view, these issues did not justify the rejection of the normative theory as they could be overcome with appropriate methodological adjustments (pp. 244-252, Garfield 1979). Garfield wrote, “…we know that citation rates say something about the contribution made by an individual’s work, at least in terms of the utility and interest the rest of the scientific community finds in it” (p. 250). We remark that as the evaluation stakes heighten for researchers, new versions of these technical challenges arise, e.g., the formation of “citation circles” (Aksnes, Langfeldt, & Wouters, 2019, p.7). On the other hand, if one rejects the normative view of citations, then there is no simple way to summarize the meaning of citations, leaving their use in evaluation unclear. A citation may be a pellet of peer recognition, as Merton asserts, but the underlying reason for the peer recognition may have little to do with scientific utility. First, since the citer is not anonymous, the reference may be made out of self-interest. Second, there are no awarding standards for the citation other than perhaps being relevant in the eye of the author and/or editor. Perhaps it is naive to attribute an author’s choice of references primarily to the Merton theory of recognizing scientific contributions and scientific utility rather than to a competing notion of economic utility, where authors choose their citations to achieve their goals of being read, respected, and recognized. This perspective is exemplified by G. Nigel Gilbert’s (1977) article title, “Referencing as persuasion.” Blaise Cronin writes, “The Achilles’ heel of citation is its residual subjectivity…” (2005, p. 169). If the failure to cite is probabilistic then the randomness may be studied and corrected or, perhaps naively, ignored as averaging out. Michael H. and Barbara R. MacRoberts have long argued that the process is nonrandom and that scientists’ citations are “highly biased”: “The equation: cited=used, may be correct with many caveats, exceptions, corrections, and qualifications, but the equation: not cited=not used, is simply false” (2018, p. 476).

Citation Analysis versus Peer Review

Do citations and peer review measure similar notions of impact or value? The question has been explored in studies comparing peer review assessments of academic programs, research groups, individual scholars, and articles. Surveys by Aksnes, Langfeldt, and Wouters (2019) and Bornmann and Daniel (2008b) and Blaise Cronin’s book (2005) describe some of the studies. We will discuss some of the results most relevant to the present study. In comparing two measures A and B—here peer review and citation counts—the reliability of A and B are relevant. With low or unknown reliability of A and B, more measurements of the correlation between A and B with non-overlapping data sets can help develop an understanding of the relationship between A and B. Various measurements are not 6 replication studies since correlations between A and B will be a distribution rather than a number.

Correlations

Before turning to particular articles that address the issue of comparing citations and peer review, we comment on correlations used to address the question. The interpretation of a correlation must be made in the context of the question posed. Suppose we have two instruments or indicators, A and B. If the qualities they measure have some common component, then one might expect a nonzero correlation, i.e., a statistically significant correlation. However, that does not mean the instruments substantively measure similar qualities.

Correlation as a measure

In considering whether indicator A and indicator B are measuring the same quality or if A can replace B as a measure, then statistically significant correlations of 0.6 may be very weak. Consider an example from the first author’s teaching. He tested math students with paper tests and computer-based tests in Calculus to see if the knowledge and skill measured were the same. Each student’s test result was an ordered pair, (handwritten score, computer score). The correlation was r > 0.6 and was statistically significant. However, the scatterplot graph (Figure 1) makes it apparent that the notions of skill and knowledge measured by these computer and handwritten tests are different. There is a large variance in the abscissa and ordinate at each level. Both tests may measure some aspects of knowledge and skill, but the specific aspects seem different.

Figure 1

Scatterplot, bins, and quartiles for a data set

Data of individual students in scatterplot (left). The same data in bins of size ten and in quartiles (right).

Aggregating, averaging, and binning

A second analytical tool that we feel requires caution is the use of averaging (or aggregating) data sets. In judging whether computer tests and handwritten tests measure the same aspects of knowledge, we would like to know if they are close on the level of individuals. Figure 1 may not give an appealing picture with a large variance at each level and suggest “no,” but after 7 averaging, the picture might suggest “yes”. This issue also occurs if an indicator is too coarse with a small number of possible outcomes in one variable and averaging is done in the second variable. For instance, suppose we bin the abscissa on Figure 1 either by quartiles or in ten point groups and average on the ordinate for the binned groups. The Pearson correlations are then greater than 0.97, and the Spearman’s rhos are both a perfect 1. Binning, aggregating, and averaging may manifest in nonobvious ways. For example, peer reviewers might give a rating of 1-4 to approximate an unnamed underlying continuous rating. In citation analysis, one might make use of an impact factor that averages a large number of article citation results. We can now consider the main questions about the relation between peer review and citation counts. To what extent is the measure of value obtained using citations similar to the measure of value obtained using peer evaluations? More precisely: 1. Is there a statistically significant correlation between citations and peer review? 2. Do citations and peer review substantively measure a common notion? As a caveat, we remark that a positive answer to the second question only makes sense if there is a high correlation between citations and peer review. However, the validity of even a high correlation between measures depends on the reliability (i.e., the self-correlation) of the measures, the second question with respect to the second question. As has already been discussed, reliability can be low for peer review and does not even make sense for citation counts. In light of this, we view a correlation of 0.6 as very weak for question 2.

Studies

There are very few studies examining the correlation between citation counts and peer review at the article level. Patterson and Harris (2009) did one such study for papers in the journal

Physics in Medicine and Biology . Patterson and Harris were an editorial board member and publisher, respectively, of this journal. They sought information on how to increase the impact factor of their journal and had access to internal peer review data. For the three years considered, they found statistically significant correlations between citation counts and peer review, all of which were weaker than 0.24. They used an averaging procedure where articles are aggregated into quintiles and then compared with the internal peer review. The authors thought it “reassuring to find that there is a significant correlation, albeit low, between citations and independent, expert, prospective review” (Patterson & Harris, 2009, p. 349). For editors interested in increasing an impact factor, this correlation may suffice to recommend action. However, this correlation, which is very low even after averaging, does not suggest that citation counts can serve as a reasonable replacement of the notion of value measured by peer review. Other researchers have investigated the relationship between peer review and citations using data from F1000, a publisher of services for biological and medical scientists. F1000 does not provide systematic peer review, but rather is a form of social media for scientists allowing post-publication peer recommendation of articles. Recommendations are submitted by F1000 faculty members, who chose articles to read and recommend. Since only a small number of articles receive a recommendation, recommended articles can be usefully compared to highly cited articles. Two studies have included an examination of recommendations and WOS citations (Li & Thelwall 2012; Waltman & Costas, 2014). Both found weak but statistically 8 significant correlations. Li and Thelwall used the ad hoc FFa numerical ratings provided by F1000 and Spearman’s rho to find correlations of about 0.3. We discuss the larger study by Waltman and Costas in more detail. Of the 1,707,631 total publications in the total (“micro-subject” determined) population considered by Waltman and Costas, 38,327 had at least one recommendation and an assigned subject. They found that 73.7% of the highly cited (top 1%) articles have no recommendations. This information allows construction of the contingency table, Table 1. The correlation in Table 1 is f = 0.163 with a 95% confidence interval of [0.159, 0.168]. Given that there are less than half as many highly cited as recommended articles, the largest possible correlation was approximately 0.663, so that f = 0.163 is about 25% the maximum possible correlation. Table 1

Waltman and Costas’s data

Highly Cited Received a recommendation Yes No Yes 4,491 33,836 No 12,585 1,656,719 Other researchers have considered the relationship between citation counts and peer review at the level of individual authors. In their study of a certain postdoctoral funding program of the German Research Foundation, Hornbostel, Bohmer, Klingsporn, Neufeld, and von Ins, (2009) compared the citation counts accumulated by successful and unsuccessful applicants after undergoing the peer review-based selection process. They concluded that while “some minor performance could be identified”, there was no decisive difference between the two classes of applicants. Wainer and Vieira (2013) also studied the relationship between bibliometric data and peer review coming from a Brazilian research funding agency. They looked at data for 2,663 individual scientists arranged in 96 groups by field and academic level. They computed Spearman’s rho correlations for each group and combined correlations from the same field using a weighted average method from biostatistics. Spearman’s rho seems a minimal type of correlation to measure with citations, but Wainer and Vieira did not have direct peer review scores. They computed weighted Spearman’s rho correlations for 55 fields (including humanities) between peer reviews and total citations for a researcher in each of WOS, Scopus, and Google Scholar resulting in 157 correlations (Table 3, pp. 407-408). Wainer and Vieira considered correlations of 0.4-0.6 to be moderate, and most of the correlations were low by their standards (see Figure 2). Only four of the 157 correlations were greater than 0.6 (for Architecture, Morphology, Urban planning, and Zoology), and only one was very strong (Architecture at 0.95). Only one of the correlations over 0.6 was for WOS citations. (It should also be remarked that two correlations were less than -0.8 (for Astronomy).) They do caution that their method of peer evaluation required repeated annual evaluations, and the peer reviewers may have relied on bibliometric measures more than if it was done only once. 9

Figure 2

Wainer and Vieira 157 correlation counts

Aggregation and averages at the level of research groups or departments are yet further removed from the article level than aggregation and averages at the level of authors. These studies are often complicated by the fact that the quality of department measure includes factors beyond the quality of scholarship of faculty. For example, the number of PhD graduates may be a more important factor in the quality of a graduate program than faculty members’ bibliometric measures. Citation count averages seem appropriate for assessing a group, but make deductions concerning peer review versus citations difficult. Aksnes and Taxt (2004), Franceschet and Constantini (2011), Rinia et al. (1998), and Wainer and Vieira (2013) conducted studies that compared average citation metrics with peer evaluations of an entire research group’s work. Rinia et al. (1998) found Spearman rank-correlation coefficients ranging from 0.16 to 0.68 between jury ratings and various bibliometric measures including number of citations divided by citation average in the subfields in which the evaluated research group was active, among others. Aksnes and Taxt (2004) found a 0.46 correlation coefficient in the latter comparison, which they considered weak. Franceschet and Costantini (2011) found rank correlations ranging from 0.32 to 0.81 for research groups in different disciplines. Civil engineering and architecture had the lowest correlation, and physics had the highest. In summary, prior studies have found only low correlations between peer review and citation counts. Moreover, data for precise, article-level comparisons is hard to come by.

Mathematics and citation analysis

Interest

One of the difficulties in citation analysis is the broad range of possible reasons for a given citation. The field of mathematics provides a useful test laboratory for understanding The US National Research Council regression weighting of US graduate programs in Mathematics and Physics (National Research Council. 2011, p. 266 & p. 271). 10 citations in general because in mathematics, this range is greatly restricted. Mathematics has a standard of argument or proof that is not present in observational, experimental, or theoretical science. Mathematical theorems are established by deductive reason from previously established results. Accordingly, in the course of a proof, it is common for a mathematician to cite only papers containing lesser-known theorems used. Refutation and debate of results become a small part of the literature. As such, papers in mathematics tend to have fewer citations on average than is usual in science. Bibliometric indicators behave differently applied to mathematics than to the sciences (Bensman, Smolinsky, & Pudovkin, 2010). The point is illustrated by a conversation between the chemist, Darl McDaniel, and his mathematician son, Andrew McDaniel. The mathematician described mathematical argumentation as a chain where each step is securely linked to the next in ironclad proof. The chemist described argument in chemistry as a bundle of straw. Here, each individual straw is a strand of evidence, with the strength of the argument determined by the number and thickness of the individual straws in the bundle. Other than direct references to theorems used, the primary reason for citations in mathematics is to attempt to persuade readers of the interest, depth, and significance of the problems considered and results obtained. Table 2 is Bornmann and Daniel’s (2008a) version of Eugene Garfield’s (1962) list of possible motivations of citers. We have added to it our view of its relevance to mathematics.

Table 2

Motivations of citers

Reason for citation Relevance in mathematics 1. Paying homage to pioneers. Y 2. Giving credit for related work (homage to peers). Y 3. Identifying methodology, equipment, etc. Y 4. Providing background reading. Y 5. Correcting one’s own work. N 6. Correcting the work of others. N 7. Criticizing previous work. N 8. Substantiating claims. N 9. Alerting to forthcoming work. Y 10. Providing leads to poorly disseminated, poorly indexed, or uncited work. Y 11. Authenticating data and classes of fact (physical constants, etc.). N 12. Identifying original publications in which an idea or concept was discussed. Y 13. Identifying original publication or other work describing an eponymic concept or term Y 14. Disclaiming work or ideas of others (negative claims). N 15. Disputing priority claims of others (negative homage) (Garfield, 1962, p85). Y

11 In addition to its narrower uses of citations, mathematics has a lower average number of joint authors per paper than other sciences (National Science Board, 2010, Table 5-16; Mallapaty 2018, with data provided by Larivière). Moreover, since there is no laboratory work in mathematics, there are fewer “collaborator (team self-citations).” These facts simplify citation analysis for mathematics and, as was observed by Smolinsky, Lercher, & McDaniel (2015), may make it a closer fit to the preferential attachment model (Simon 1955; Barabási & Albert 1999) or the cumulative advantage model (Price, 1976).

Data

MR and its online incarnation MathSciNet is a primary source for information on peer-reviewed articles in the mathematical sciences. Published by the American Mathematical Society (AMS), over 125,000 new items are added each year (American Mathematical Society, 2019). During the years 1995 to 2006, MR published 717,164 reviews of journal articles. For comparison, the WOS lists 163,648 papers that include mathematics as one of its categories for the publication years 1993-2004. Mathematics may be unique in having nearly its entire literature undergo post-publication review by scholars. Writing a review for MathSciNet is considered service to the profession similar to refereeing for a journal. Since the reviewer is not anonymous, the reviewer has motivation to be diligent. From 1995 to 2006, MR recognized articles of particular note in

Featured Reviews . Featured review articles were “. . . identified by the MR editors with the advice of distinguished outside mathematicians as being especially important…” (American Mathematical Society, 1995, p.1) and were highlighted on the title pages of MR and in MathSciNet. During the period 1995-2006, 927 articles were selected for featured review, constituting less than 0.13% of the MR literature and less than 0.45% of WOS mathematics literature. The program was discontinued in 2006. The selection process was based on a posteriori peer review and was independent of citation counts, since the articles had already been accepted for publication or recently appeared. In our determination that 927 articles received featured reviews, we made the following decisions. A few featured reviews include two articles that were published as complete articles (e.g., part 1 and part 2). Each of these articles is included in our count. Three other papers have corrections, entitled Addendum … , Correction … , or

Corrigendum …, that were separately published articles. These three are not included in our count. One article was published twice due to production errors in the original. We have counted the two versions as a single publication and added the three WOS citations to the original to the citation count for the corrected version. Among the 927 featured review articles, 79 are not indexed on the WOS and 734 include Mathematics as a WOS category. 80 featured review papers include a WOS classification of Applied Mathematics, 60 include one of the physics categories, and 30 include Mechanics. Article publication dates were 1993 to 2004. The list of featured review articles is no longer available from the American Mathematical Society. We found featured review articles through the analysis of the review texts.

12 We examined citation counts of featured review articles in bins of size 20 and size 5. The WOS lists 163,648 papers that include mathematics as one of its categories for the publication years 1993-2004. Usually, an article is termed “highly cited” if its citation count is in the top 1%. Here, this gives 1636 articles with 97 or more citations. However, in order to only consider full bins of 5, we restrict the definition to the 1559 articles with more than 100 citations. These are the top .952% most cited papers. All of the WOS highly cited articles are indexed in MR. The MR primary classification numbers were also recorded to examine the area distribution of the highly cited articles.

Results Featured Review Articles Versus Highly Cited Articles

Of the 734 featured review articles that were indexed in the mathematics category on WOS, 122 were also highly cited. The correlation between the two dichotomous variables of being a featured review and being highly cited is the phi coefficient f , i.e., the mean square contingency coefficient. Three entries in the contingency table (Table 3) are available to compute f . The last necessary number in the contingency table is the number of articles x that are neither a featured review nor a highly cited article. This last number would require knowing the number of articles in the intersection of the WOS mathematics category and the MR reviewed items, which was not computed. However, 0 £ x £ 𝜙(𝑥) = 122𝑥 − 612 ∙ 1437/(122 + 612)(122 + 1437)(𝑥 + 1437)(𝑥 + 612) , and f ( x ) is an increasing function on [0,∞). For x > 9394, it is statistically significant at the 1% level using chi-squared (Chedzoy, 2006). The maximum possible value of the correlation is 0.11, but a correlation of f = 0.11 is weak. For x= 163,648, a 95% confidence interval is [0.091, 0.128]. We recognize that being highly cited is an artificial dichotomous variable, since it is determined by a cutoff value of the number of citations. We do not have enough information to conduct an exact point-biserial correlation calculation but estimate it to be less than 0.15. Table 3

Contingency table for f Highly cited Featured review

Yes No Yes 122 612 No 1437 x Using a sample of 6,000 and assuming that WOS mathematics category papers are included in MathSciNet. 13 We note that since being a featured review is a rarer distinction (~0.45%) than being highly cited (~1%), there could not be a perfect correlation. Given the ex post facto rates of selection of featured review articles and highly cited articles, the largest possible f would be 0.684. This is less than 16% of the possible maximum correlation. One can also consider Cohen’s k statistic (Cohen, 1960), which has been previously used in the Italian study (Bertocchi, Gambardella, Jappellic, Nappi, & Peracchi, 2015) as well as for analyses of reliability. This statistic takes the observed categories’ frequencies as an a priori given. For Table 3, k is and so k < 0.11, which is small. Only 7.83% of the 1559 highly cited WOS mathematics articles were featured review articles and only 16.62% of the 734 featured review articles classified in the WOS mathematics category were highly cited. In Figure 3, the highly cited featured review articles represent only the tail of the distribution while the first 5 bars represent the 83.38% of featured review articles that are not highly cited. Figure 3

WOS citations versus number of featured reviews

Frequency of WOS core citations for the 734 featured reviews indexed in the mathematics category on the WOS. To summarize, the two notions of significance described by being a featured review article and being highly cited are substantially distinct. This indicates that peer review and citation counts give largely independent determinations of highly distinguished papers—at least when peer judgement is uninfluenced by knowledge of citation counts.

Subfield Analysis

Data on featured review articles can also be used to investigate how subfields of mathematics are evaluated for their importance to mathematics as a whole. How do the subfields of a discipline relate to hiring patterns and faculty interest? Do the subfields chosen for hiring by 14 distinguished departments correlate more strongly with the subfields with a larger number of highly cited articles or with those with more featured review articles? The Mathematics Subject Classification (MSC) used by MR divides mathematics into 63 major topics. The Joint Data Committee of the American Mathematical Society, American Statistical Association, the Mathematical Association of America, and the Society for Industrial and Applied Mathematics has aggregated the 63 topics into twelve “field of thesis” categories. Following the approach of Smolinsky and Lercher in their study of the effect of subdiscipline on citation rates (Smolinsky and Lercher, 2012), we will view these categories as the subfields of mathematics. Here, we consider two measures of the prominence of a subfield within mathematics. First, we will look at the subfield of interest of mathematicians. The professional mathematical societies request that members select two-digit MSC numbers as their fields of interest. The AMS generously supplied the 2009 data for the research of Smolinsky and Lercher (2012). Second, we examine the subfields of new PhDs hired from 2000–2010 by the top 48 mathematics departments (American Mathematical Society Group 1). Let FR , HC , H , and AMS be real-valued random variables with domain the set of twelve fields {Algebra, Analysis, Geometry, Discrete, Probability, Statistics, Applied, Computation, Control, Differential Equations, Math Education, Other}. The random variables are defined by FR (field) = the number of featured review articles in the field, HC (field) = the number of highly cited articles in the field, H (field) = the number of Group 1 hires in the field as detailed in Smolinsky and Lercher (2012), and AMS (field)= the number of AMS members with responses indicating primary interest in the subfield. The correlation matrix for the random variables is given in Table 4. The correlation between subfield of hiring in the top departments and the featured review article subfields was very strong. It was still strong, but less so, between subfield of hiring and the subfield of highly cited articles. It is also noticeable that the subfields of faculty interest correlate more strongly with featured review article subfields than with the subfields of highly cited articles or hiring. All of the correlations in Table 4 between the random variables are statistically significant.

Table 4

Correlation matrix r.v. r.v.

FR HC H AMS FR

1 0.71 0.91 0.89 HC

1 0.80 0.67 H

1 0.77

AMS Formerly, the committee also included a representative of the Institute for Mathematical Statistics. 15 review articles ( FR ) reflect the peer preference for the subfields as measured both by faculty interest ( AMS ) and hiring ( H ). Discussion

In this study, we examined the relationship between peer review and citation counts in mathematics by focusing on a body of highly distinguished mathematical articles, those selected for featured reviews. While we found a statistically significant correlation between featured review selection and being highly cited, the correlation is weak. This indicates the presence of substantial differences in the underlying selection processes. Our results are consistent with previous studies discussed in the literature review. Waltman and Costa’s (2014) study was closest in spirit to this study. Waltman and Costa’s F1000 recommendations are social media selection and are less systematic than featured reviews. Since MR covers all mathematical literature, we believed that a featured review selection would be a more reliable method of detecting the relationship between elite peer review and high citation counts. Waltman and Costa's association between WOS highly cited and F1000 reviews was somewhat stronger than WOS highly cited and featured reviews. Furthermore, given the different selection rates for featured reviews, F1000 recommendations, and highly cited articles, being highly cited has a stronger association to F1000 recommendations than to featured review selection, but both are weak. It appears that peer review and citation metrics are related to different notions of value in an article. Li and Thelwall suggest, F1000 evaluators measure “the quality of articles from an expert point of view, citations measure research impact from an author point of view…” (2012, p. 549). We believe peer review in general can be characterized as measuring quality from the expert point of view. But what does the expert or author point of view mean? Peer review is a serious professional responsibility. It is a matter of basic professional ethics to be impartial and to review an article, researcher, program or institution according to the specified parameters without personal bias. The underlying assumption is that reviewers will embrace this responsibility and will not violate the trust of the profession to chase a (typically small) measure of personal career gain. In those cases where there is a significant conflict of interest, scholars are expected to recuse themselves. In peer review, the reviewer is functioning as an independent expert. Since scholarly output is the basis of an academic’s career, an author necessarily has a different viewpoint from that of an independent expert. An author is a consumer of references and a producer of articles. As producers, authors want their articles to be read, cited, and recognized as significant. As consumers of references, they will be guided by the economic utility of achieving their career goals. Consider the eight positive “relevant to mathematics” reasons for citation in Table 2. Other than results in the immediate chain of logical argument (item 3), there is great flexibility for an author to choose references for their economic utility. Which articles should an author include and exclude as “relevant?” Will the citation affect the likely peer reviewers? Will the citation increase the credibility of the paper or attract readers and 16 citers? Will citing fashionable or important articles improve the perception of importance? Suitable results may occur in multiple articles. A highly cited article may perhaps be assumed to have value or scientific utility. However, the converse is certainly false. As of 2/19/2019, more than 85% of the 163,648 mathematics articles in WOS for the publication years 1993-2004 garnered fewer than 20 citations. These include 30% of the featured review articles and 25% of the papers appearing in the highly prestigious

Annals of Mathematics during this time period, both of which are determined by a demanding level of peer review. In mathematics, it is easy to see how a paper could be flagged as significant by reviewers even though it is predictable that it will not be highly cited. One example is an article that solves a long-explored problem and completes a line of investigation. The solution may not open new directions of research, and even if it does, those new directions may not be of particular interest to present researchers. The article may not garner many citations since relatively few papers build on it. Hiring in top departments as well as the list of fields of interest to mathematicians are more closely correlated with featured review subfields then with highly cited article subfields. It may be that the faculty, hiring committees, and chairs are acting as experts (reflecting peer review) when making hiring decisions. On the other hand, it is reasonable that selections of featured review articles would follow the subject pattern of the discipline members’ interests. There are methodological limitations on studies comparing peer review and bibliometrics. Such studies usually involve data gathered for other purposes and so do not follow experimental protocols or journal peer-review protocols. Common issues are: a) reviewers are not assigned but self-selected, b) articles reviewed are not assigned but reviewer-selected, c) reviewers are not anonymous, d) reviewers have access to citation information, and e) reviewers know the journal where the article was accepted. Since single-blind review is the most common protocol in the sciences, we have omitted the lack of anonymity of authors from this list. However, Tomkins, Zhang, and Heavlin (2017) found papers with famous authors or from high-prestige institutions are at an advantage in single-blind review compared to double-blind review. Three studies at the article level are considered in this paper: Patterson and Harris (2009), Waltman and Costas (2014), and the present study. Patterson and Harris does not suffer from any of these issues, all but d are relevant for Waltman and Costas, and c and e are present in the current study. There is an increasing trend of viewing citation counts as the primary measure of the distinction of a paper. As an illustration of this point, not only did the American Mathematical Society terminate the featured review program, but when the first author requested the list of featured review articles from the AMS, he was told that it was no longer available. Instead, he was offered the list of highly cited articles. We feel that this trend is unfortunate and identifying important articles from the viewpoint of independent experts is valuable to the community of scholars. References

American Mathematical Society. (1995). Editorial Statement.

Mathematical Reviews, 95 (a), 1. American Mathematical Society. (2019).

About MathSciNet . Retrieved August 12, 2019 from https://mathscinet.ams.org/mathscinet/help/about.html?version=2. Aksnes, D.W., Langfeldt, L., & Wouters, P. (2019). Citations, citation indicators, and research quality: An overview of basic concepts and theories.

SAGE Open, 9 (1), 1-17. doi:10.1177/2158244019829575. Aksnes, D.W. & Taxt, R.E. (2004). Peer reviews and bibliometric indicators: A comparative study at a Norwegian university. Research Evaluation 13(1), 33-41. Barabási, A.L., & Albert, R. (1999). Emergence of scaling in random networks.

Science, 286 (5439), 509–512. Bensman, S.J., Smolinsky, L.J., & Pudovkin. A.I. (2010). Mean citation rate per article in mathematics journals: Differences from the scientific model.

Journal of the American Society for Information Science and Technology, 61 (7), 1440–1463. Bertocchi, G., Gambardella, A., Jappellic, T., Nappi CA, & Peracchi, F. (2015). Bibliometric evaluation vs. informed peer review: Evidence from Italy.

Research Policy, 44 , 451–46. Bornmann, L. & Daniel, H.-D. (2008a). What do citation counts measure? A review of studies on citing behavior. Journal of Documentation 64 (1), 45-80. Bornmann, L. & Daniel, H.-D. (2008b). The effectiveness of the peer review process: Inter-referee agreement and predictive validity of manuscript refereeing at Angewandte Chemie.

Angewandte Chemie International Edition, 47 (38), 7173–7178. Bornmann, L. & Osório, A. (2019). The value and credits of n-authors publications.

Journal of Informetrics, 13 (2019) 540–554. Campanario, J.M. (1998). Peer review for journals as it stands today—part 1.

Science Communication, 19 (3), 181-211. Chedzoy, O.B. (2006). Phi-Coefficient. In Encyclopedia of Statistical Sciences. Wiley Online Library. John Wiley & Sons, Inc. https://doi.org/10.1002/0471667196.ess1960.pub2. Cohen, J. (1960). A Coefficient of Agreement for Nominal Scales.

Educational and Psychological Measurement, 20 (1), 37–46. https://doi.org/10.1177/001316446002000104. Cronin, B. (2005).

The hand of science: Academic writing and its rewards . Lanham, Maryland: Scarecrow press. 18 Eysenck, H.J., & Eysenck, S.B.G. (1992). Peer review: Advice to referees and contributors.

Personality and Individual Differences, 13 , 393-99. Franceschet, M. & Costantini, A. (2011) The first Italian research assessment exercise: A bibliometric perspective.

Journal of Informetrics 5 (2), 275-291. doi:10.1016/j.joi.2010.12.002. Garfield, E. (1962). “Can citation indexing be automated?”,

Essays of an Information Scientist, 1 , 84-90.

Garfield, E. (1979).

Citation indexing, its theory and application in science, technology, and humanities . John Wiley & Sons, Inc, New York. Gilbert, G.N. (1977). Referencing as persuasion.

Social Studies of Science, 7 , 113-122. Hornbostel, S., Bohmer, S., Klingsporn, B., Neufeld, J., & von Ins, M. (2009). Funding of young scientist and scientific excellence.

Scientometrics, 79 , 171-190. Jackson, J.L., Srinivasan, M., Rea J., Fletcher, K.E., & Kravitz, R.L. (2011). The Validity of Peer Review in a General Medicine Journal.

PLOS One 6 (7), e22475. doi:10.1371/journal.pone.0022475. Kravitz, R.L., Franks, P. Feldman, M.D., Gerrity, M. Byrne, C., Byrne, C., & Tierney, W.M. (2010). Editorial Peer Reviewers’ Recommendations at a General Medical Journal: Are They Reliable and Do Editors Care?

PLOS One 5 (4), e10072. doi: 10.1371/journal.pone.0010072. Langfeldt, L. (2001). The decision-making constraints and processes of grant peer review, and their effects on the review outcome.

Social Studies of Science, 31 , 820–841. Lee, C.J., Sugimoto, C.R., Zhang, G., & Cronin, B. (2013). Bias in Peer Review,

Journal of the American Society for Information Science and Technology, 64 (1), 2-17. Li, X., & Thelwall, M. (2012). F1000, Mendeley and traditional bibliometric indicators. In Archambault, É., Gingras, Y., Larivière, V. (Eds.),

Proceedings of 17th International Conference on Science and Technology Indicators (pp. 541–551). Montréal: OST and Science-Metrix. MacRoberts, M. H. & MacRoberts, B. R. (2018). The mismeasure of science: Citation analysis.

Journal of the Association for Information Science and Technology, 69

Biochemia Medica 22 (3), 276-82. Merton, R.K. (1988). The Matthew Effect in Science, II: Cumulative Advantage and the Symbolism of Intellectual Property.

Isis, 79 (4), 606-623. Moed, H.F. (2005).

Citation analysis in research evaluation . The Netherlands: Springer. Moed, H. F. (2017).

Applied evaluative informetrics . The Netherlands: Springer. National Research Council. (2009).

A Guide to the Methodology of the National Research Council Assessment of Doctorate Programs . The National Academies Press. National Research Council. (2011).

A Data-based Assessment of Research Doctoral Programs in the United States . The National Academies Press. National Science Board (2010).

Science and Engineering Indicators 2010 . (NSB 10-01). National Science Foundation. Nicholas, D., Watkinson, A., Jamali, H. R., Herman, E., Tenopir, C., Volentine, R., Allard, S. & Levine, K. (2015). Peer review: still king in the digital age.

Learned Publishing, 28 , 15-21. doi:10.1087/20150104. Patterson, M.S., & Harris, S. (2009). The relationship between reviewers’ quality-scores and number of citations for papers published in the journal Physics in Medicine and Biology from 2003–2005,

Scientometrics, 80 (2), 345–351. Peters, D. P., & Ceci, S. J. (1982). Peer-Review Practices of Psychological Journals - The Fate Of Accepted, Published Articles, Submitted Again.

Behavioral and Brain Sciences, 5 (2), 187–195. Price, D.D.S. (1976). A general theory of bibliometric and other cumulative advantage processes.

Journal of the American Society for Information Science, 27 (5), 292–306.

Rinia, E.J., van Leeuwen, Th.N., van Vuren, H.G., & van Raan, A.F.J. (1998). Comparative analysis of a set of bibliometric indicators and central peer review criteria: Evaluation of condensed matter physics in the Netherlands.

Research Policy , , 95-107. doi: 10.1016/S0048-7333(98)00026-2. Rothwell, PM. & Martyn, CN (2000). Reproducibility of peer review in clinical neuroscience: Is agreement between reviewers any greater than would be expected by chance alone? Brain, 123 (9), 1964–1969. Simon, H.A. (1955). On a class of skew distribution functions.

Biometrika, 42 (3/4), 425–440. 20 Smolinsky, L., Lercher, A., & McDaniel, A. (2015). Testing theories of preferential attachment in random networks of citations.

Journal of the Association for Information Science and Technology , (10), 2132-2145. Smolinsky, L., & Lercher, A. (2012). Citation rates in mathematics: a study of variation by subdiscipline. Scientometrics , , 911–924. DOI 10.1007/s11192-012-0647-3. Storer, N.W. (1973). Introduction. In R.K. Merton, The sociology of science: Theoretical and empirical investigations . University of Chicago press. Tomkins, A., Zhang, M., & Heavlin, W. D. (2017). Reviewer bias in single- versus double-blind peer review.

Proceedings of the National Academy of Sciences of the United States of America , (48), 12708–12713. doi:10.1073/pnas.1707323114. Wainer, J. & Vieira, P. (2013). Correlations between bibliometrics and peer evaluation for all disciplines: The evaluation of Brazilian scientists. Scientometrics, 96 , 395-410. doi: 10.1007/s11192-013-0969-9. Waltman, L. & Costas, R. (2014). F1000 recommendations as a potential new data source for research evaluation: a comparison with citations.