Is preprint the future of science? A thirty year journey of online preprint services
IIs preprint the future of science?A thirty year journey of online preprint services
Boya Xie
Microsoft Research, [email protected]
Zhihong Shen
Microsoft Research, [email protected]
Kuansan Wang
Microsoft Research, [email protected]
ABSTRACT
KEYWORDS preprint analysis, citation impact, publication rate, time to publish,arXiv, SSRN, bioRxiv
Preprint is a form of a scholarly article which is not peer-reviewedyet but made available either as paper format or electronic copy.If the curiosity and excitement to push forward the frontier ofcollective human knowledge makes scientists desire to share theirdiscoveries as soon as possible to as wide audiences as possible,then online preprint could be the ultimate manifestation of suchspirit of science: accessible for anyone with the Internet accessshortly after the paper is uploaded. In the past thirty years, the Webdeeply impacts all aspects of our lives and the society, including the
This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution. , © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License. Figure 1: Overall preprints distribution a r X i v : . [ c s . D L ] F e b Boya Xie, Zhihong Shen, and Kuansan Wang
In this work, we employ a large-scale multi-disciplinary preprintdataset sourced from the Microsoft Academic Graph (MAG) [41]to explore the trends and impacts of preprint quantitatively. Thisdataset includes 2.8 million preprints and 69.8 million peer-reviewedarticles from 1991 to 2020 and it is the largest preprint dataset usedfor analysis to the best of our knowledge. • Trends and adoption of preprint.
To understand how differ-ent research community and disciplines adopt the preprint cul-ture, we analyzes the development of preprints over the pastthirty years. The annual number of preprints has increased froma few thousands in 1991 to 227k in 2019, although it still accountsfor a small portion of all scientific papers. In the first nine monthsof 2020, 192k preprints are produced, which represent 6.4% ofall papers published in the same period. When zooming in toindividual domains: physics , mathematics , and economics adoptthe preprint culture the best; increasing popularity is observedin computer science and biology in the last 5 years; and preprintis rare in other domains. • Impacts on individual researchers . The data shows when awork is documented in preprint, it reaches an audience 14 monthsearlier than a peer-reviewed counterpart. We also unveil that theearly release is associated with two times more citations for apaper. • Impacts on research communities
In order to have a holistic view of online preprint and its evolutionin different disciplines, we design our analysis to include as manymajor preprint servers as possible, and cover fields from naturalscience to social science. MAG as a Web-scale scholarly datasetserves the purpose well. This section describes how the preprintdataset is generated from MAG, and what are the assumptions andknown limitations.
MAG is a heterogeneous graph containing scientific publications,citation relationships, as well as other academic entities such asauthors, institutions, journals, conferences, and fields of study. Thegraph is updated weekly to include newly released scholarly articleson the Web. As of October 2020, there are over 240 million scientificdocuments from 1800 to 2020 in MAG with eight different types:journal, conference, book, patent, book chapter, repository, thesis,and other. The preprint servers covered in this work are listed inTable 1.Totally 2.8 million preprints are found in MAG based on docu-ment types and URL domains, among which 1.7 million are con-tributed by arXiv, followed by SSRN. SSRN claims to host 0.95million abstracts and 0.82 million full text papers [43]. HoweverMAG only contains 0.76 million preprints sourced from SSRN. Wenotice there is a significant drop in the number of SSRN papersposted online after 2016 in MAG, coincident with Elsevier’s acqui-sition of SSRN. Since we don’t find any other sources providingdownloadable SSRN data, MAG SSRN data post 2016 is treated asincomplete, related evaluations are affected accordingly. BesidesSSRN, the collections for Keldysh Institute preprint, PsyArXiv, andSocArXiv are also incomplete in MAG. There are other smallerpreprint servers not being covered in this study, but we believe themajor players in the preprint services are included and the data issufficient to unveil the key characteristics of the preprint landscape.Online preprints started in 1991 with the launch of arXiv server.There are less than 1% preprints prior to 1991 in MAG. Thereforethis study only focuses on preprints starting from 1991. In order tocompare preprints with peer-reviewed papers, journal and confer-ence papers from 1991 to 2020 are also included. We exclude othertypes of documents (e.g. technical reports) with the assumption thata vast majority of journal and conference papers are peer-reviewedwhile other types of documents may not go through a rigorouspeer endorsement process. All analysis performed in this study arebased on past 30 years data unless otherwise specified.
To facilitate evaluations, we split preprints, journal and conferencepapers from 1991 to 2020 into three mutually exclusive groups(P-only, JC-only, P-JC) based on whether they are submitted topreprints or are published at a peer-reviewed journal or conference.In total, five groups are defined below and we summarize their All data presented in this study are based on MAG 2020-10-01 version. https://arxiv.org/ accessed Oct. 14, 2020 https://papers.ssrn.com/sol3/DisplayAbstractSearch.cfm accessed Oct. 14, 2020 http://api.biorxiv.org/reports/content_summary https://vixra.org/ accessed Oct. 14, 2020 https://peerj.com/preprints/ https://osf.io/preprints/socarxiv accessed Oct. 14, 2020 https://psyarxiv.com/ accessed Oct. 14, 2020 s preprint the future of science?A thirty year journey of online preprint services , Server Self-reportedsize Size inMAG Year rangearXiv 1,777,025
721 1986-presentSocArXiv 6,439
564 2016-presentchemRxiv N.A. 307 2017-presentPsyArXiv 11,959
28 2016-present
Table 1: Online preprint servers covered in this studyFigure 2: Paper groups and sizes used in this analysis relationships and sizes in Figure 2 and their usages in subsequentanalysis in Table 2. • P-only: papers only submitted to preprints; • JC-only: papers only published; • P-JC: papers submitted to preprints and published; • P-all (P-only + P-JC): all preprints; • JC-all (JC-only + P-JC): all papers published.
Preprint and post-print.
Authors normally upload papers topreprint servers either without submitting to a journal, or whilewaiting for the peer review result or for the official publishing.However, there are also cases that papers are uploaded to reposi-tories after they are published in journals or conferences, whichare called post-prints. Some servers restrict the submission to pre-publication only, while others host both. ArXiv and SSRN allow both preprint and post-print. Since there isn’t a clear signal to dif-ferentiate post-prints from preprints, and we estimate there areabout 4% post-print in arXiv and 2% in SSRN, we treat all articlesin arXiv and SSRN as preprints in this study.Section Figure Usage Measurement1 & 3.2 1 Preprint distribution P-all3.1 3 Trend P-all and P-all/(P-all+ JC-all)3.2 4 Trend by domain P-all/(P-all + JC-all)4.1 5 Days to publish byserver P-JC(publishDate -onlineDate)4.1 6 Days to publish bydomain P-JC(publishDate -onlineDate)4.2 7 Citation impact P-JC.citation and JC-only.citation4.2 8 Citation impact bydomain P-JC.citation - JC-only.citation5.2 9 Publish rate by do-main P-JC/P-all5.2 10 Publish rate byserver P-JC/P-all5.3 11 IF distribution JC-all and P-JC5.3 12 IF distribution by do-main JC-all and P-JC Table 2: Online preprint servers covered in this studyPreprint and working paper.
Multiple economics paper servers,such as RePEc, NBER, and IMF, refer to unpublished articles as work-ing papers. A working paper shares many things in common withpreprints: both are non-peer-reviewed, both are made availableonline preceding publication to reach audiences earlier. Sometimesworking paper could be preliminary results and less mature thanpreprint, and other times working paper and preprint are usedinterchangeably. Here we treat working papers from the abovementioned three servers as preprints.
Paper discipline categorization.
To perform the analysis onpreprints in individual disciplines, we leverage the domain cate-gorization of each paper in MAG [40]. MAG has a high-qualityscientific concept (or field of study) taxonomy constructed semi-automatically. As of October 2020, there are more than 740K con-cepts organized into a six level hierarchy with 19 top level domains,such as: biology , computer science , economics , etc. Each paper is as-signed with at least one top level domain. Interdisciplinary papersare labeled with multiple top level domains, e.g. a biochemical paperis labeled with both biology and chemistry . For all domain-relatedanalysis, interdisciplinary papers are counted once towards eachof its belonging domains. Physics , mathematics , computer science , economics , and biology are selected for disciplinary analysis becausethey either have the most number of preprints or are fast evolving. Post-print is assumed if the paper’s publication date is prior to the preprint onlinedate.
Boya Xie, Zhihong Shen, and Kuansan Wang
In this section, we provide an overview of the preprint generaltrends and development in different disciplines. The annual preprintvolume and ratio are evaluated.
The number of preprints uploaded online per year has been increas-ing steadily in the past decades, from 3k in 1991 to 227k in 2019.There are at least 192k new preprints made available online in thefirst nine months of 2020. Previously, the volume of overall scientificliterature are found to be growing exponentially and double every15 years on average [11]. The preprint follows a similar growthtrajectory but doubles in less than 10 years. Despite the increasingnumber, the ratio of preprints to all scientific articles is still verylow, only 6.4% in 2020, as shown in Figure 3. ArXiv is a dominantplayer in preprint services, where its change is illustrated in thebottom area of the figure. The volume growth in arXiv defines theoverall trend of all preprints. As described in the previous section,SSRN volume is expected to be higher post 2015. BioRxiv becomesan important contributor starting from 2015 as shown in the redarea. It has generated the third largest volume in the past five years.
Figure 3: Annual number of preprints (P-all) and reprints(P-all)/all-papers(P-all+JC-all) rate growth
The development of preprint is different in each domain: physics and mathematics are the pioneers as people in these communitiesare more intimate with Web technologies since early days; some arenew active players within the past 10 years, such as computer science and biology . Figure 1 shows the preprint distribution in various disci-plines and servers hosting the articles. Among all preprints, half arephysics and mathematics papers, where 96% of physics and mathe-matics preprints are uploaded to arXiv. Considering the history ofonline preprints, it is not surprising to find that 31% of all preprintsare physics papers. Publishing papers as a preprint is the norm inphysics. Figure 4 exhibits that three out of ten physics papers werepreprints in the past twenty years. On the other hand, mathematics is where we see the second largest volume, and there has been
Figure 4: Yearly preprints(P-all)/all-papers(P+JC) trend bydomain increasing level of interest of using preprints from mathematicians.Although computer science has the third largest number of preprints,majority (95%) of computer science papers are non-preprints due tothe high throughput of computer science literature. Besides physicsand mathematics which have a high ratio of preprints to all-papersat 27.7% and 15.7% respectively, economics has a relatively highaverage ratio at 20.8% from 1991 to 2015. SSRN hosts the mosteconomics preprints, as well as other social sciences such as busi-ness and political science, shown in Figure 1. While many otherdisciplines have less than 2% preprints, biology has been promotingpreprint since 2013, and a good momentum is observed in the pastfive years. Biology has 6.5% preprints in 2020 and majority of themare in bioRxiv.
Multiple articles discussed the concerns that authors might haveover submitting preprints: scooping, being precluded from journalpublication, and low quality implication, etc.[9, 10, 22, 38]. On theother hand, the benefits of preprints are frequently mentioned aswell. Since it is not straight-forward to design quantitative mea-surements from the publishing data of above mentioned pitfalls.Instead, we exam some claimed benefits of preprints for individualresearchers.
Presenting scientific findings through paper puts a timestamp onthe work. The lengthy peer review and publishing process couldtake months or even years. On the contrary, uploading a preprintpaper to get a permanent open accessed timestamp only takes a fewdays. In addition to this unbeatable advantage of time-stampinga paper, preprints enable the work to receive credits (in the formof citations) faster, which is crucial for young researchers whosecareer progression heavily depends on a timely recognition of theirwork [38].Here, we analyze the duration between preprints’ initial onlinedate and the journal/conference official publication date to evaluatehow much the preprint expedites an article to reach an audienceto gain the potential earlier recognition. The analysis is conducted s preprint the future of science?A thirty year journey of online preprint services ,
Figure 5: Days to publish by preprint serverFigure 6: Days to publish by domain from two perspectives: days difference in each preprint server, andwith different disciplines. Figure 5 exhibits the measurement for sixmajor preprint servers and the average of all preprints. The averagedays to publish of different preprint servers are listed in the legendof Figure 5. ArXiv has similar days to publish to the overall averageof all servers. Both are around 400 days. It takes much longer to pub-lish for papers from NBER, SSRN, and RePEc (shown with dottedlines in Figure 5), which is 0.5 to 1.1 times longer compared to theaverage. These three servers are where most economics preprintssubmit to. This result demonstrates a huge benefit that preprintsbring to economics researchers for earlier visibility. We assumethat a first version preprint is similar to a ready-for-peer-reviewpaper, and both take similar amount of time to be published inpeer-reviewed destinations. Figure 6 implies economics papers takethe longest time to publish, at 822 days. This finding is consistentwith the observation of a prolonged peer review process in eco-nomics [30, 35]. By submitting to preprint servers, an economicresearcher can get recognition 2.25 years earlier. Biology has theshortest lag, about 7 months. Figure 5 shows preprints in medRxiv only take 103 days to publish. However, medRxiv just launched inJune 2019. Many papers in medRxiv might still undergo the reviewand publishing process right now thus being excluded from theplots in Figure 5. The measurement for medRxiv will be more accu-rate when there is more data available over time to overcome thecurrent bias from the early published "survivors".To conclude, preprints help papers become accessible 7 months to2.25 years earlier than peer-reviewed counterparts, and economicsresearchers would benefit from it the most.
The early visibility provided by preprints generates opportunitiesfor scientific work to make more impacts. One of the most directand recognized measurement of research impact is paper citations.The citation is found to follow power-law distribution [19, 46, 48]and the dynamics is universal for all disciplines [11]. The preprintis no exception, Appendix A shows the preprint citation follows apower-law distribution. To illustrate whether preprints is associatedwith more citations, we compare the papers’ citation of publishedpapers with (P-JC) and without (JC-only) a preprint version.
Figure 7: Overall citation impact
The yearly median citations of the four groups of papers: P-JC,JC-only, P-all, and JC-all, are illustrated in Figure 7. It manifests thatpapers with preprints have more citations in any year regardlesswhether they have a published version. On average of all years,a journal or conference paper with a preprint version (P-JC) hasmedian citation of 14.8, while a non-preprint counterpart (JC-only)receives 2.6 citations, which results in 12.2 citation difference (fivetimes more). All preprints (P-all), regardless whether they havebeen published in journal/conference, have 0.7 more citations thanpapers without a preprint version (JC-only).Figure 8 demonstrates the citation difference between paperswith and without preprints in different domains. The citation differ-ence is defined as the yearly average citation per published paperwith preprints (P-JC) subtracting the yearly average citation per pa-per without preprints (JC-only). The citation difference is positiveand it indicates that published papers with a preprint version (P-JCgroup) have bigger impacts as the citation of P-JC group is more
Boya Xie, Zhihong Shen, and Kuansan Wang than the citation of JC-only group. Here we reveal that preprinthas positive citation impact in all domains, in which economicshas the most citation difference (6.9-fold more) and mathematicshas the least (0.95-fold more). The number of citations and citationdifferences for each discipline are listed in Table 3.
Figure 8: Citation impact by domain (citation of publishedpaper with preprint (P-JC) subtract citation of published pa-per without preprint (JC-only))
Domain Paper withpreprintcitations (P-JC) Paper withoutpreprintcitations(JC-only) Citationratio (P-JC/JC-only)Economics 220 28 7.86CS 116 21 5.52Physics 50 18 2.78Others 72 31 2.32Biology 68 31 2.19Mathematics 43 22 1.95Overall 74 24 3.08
Table 3: Citation impact
Is preprint contributing to the research community similar to peer-reviewed articles or is it diluting the novel findings with inaccurateor even incorrect conclusions? Do preprints serve better in certainscenarios where peer-reviewed journals or conferences have limita-tions? In this section, we address the quality concerns of preprintsby first reviewing and summarizing the established ways of qual-ity control on preprint servers. We then assess the population ofpreprints ultimately being published in a peer-reviewed venue andevaluate the impact factors of these venues to further provide thequantitative evidence of the preprints’ quality. At last, we reviewand discuss the important role of preprints in recent public healthemergencies and how preprints help to speed up the disseminationof biomedical knowledge in a global pandemic. • Author control: • Screening process: many preprint servers perform checks oncompleteness, relevance, plagiarism, appropriate language andcontents, as well as the ethical and legal compliance [4, 24, 26, 44]. • Extended content review: bioRxiv and medRxiv reject mate-rials that might pose a health or biosecurity risk [26, 27]. Addi-tionally preprint platforms are taking extra cautions when thepaper might have an immediate impact on public health. ArXivand ChemRxiv have enhanced their screening on COVID-19 re-lated articles and bioRxiv has stopped accepting articles "makingpredictions about treatments for COVID-19 solely on the basisof computational work" [25].The peer review process functions as an error and fraud captur-ing tool, therefore peer-reviewed papers are perceived as qualityguaranteed today. Oftentimes, the value of a paper is linked tothe prestige of the publication venue, especially when it comesto grant application or career progression. Nevertheless, peer re-view has its own limitations. Besides issues such as sustainability,the peer review fraud, unfair reviews, and incompetent review-ers [2, 12, 45], several studies have shown that peer reviews donot always detect errors, and some peer-review-rejected researchworks are not necessarily low quality, some are even Nobel Prizewinning discoveries [3, 6, 42, 49].
Put the criticisms of peer review aside, publishing at a well brandedjournal or conference is still one of the most common ways to certifythe quality of an article. Evaluating 2.8 million preprints in MAG,we find that 41% of them eventually are published at peer-reviewed s preprint the future of science?A thirty year journey of online preprint services , venues, shown in Figure 9.
Physics and mathematics are the top twodomains having most preprints published in journals or conferences,69% and 51% respectively. Comparably stable publication rates areobserved in all domains except biology , which jumps from 23% in2009 to over 80% in 2017. Notice the new player bioRxiv has thehighest publication rate among all preprint servers shown in Figure10. The publication rate of bioRxiv in our result is consistent withthe value reported by Abdill and Blekhman [1]. BioRxiv is the majorcontributor to the elevated preprint publication rate in biology inrecent years. MedRxiv is a new medical preprint server launchedby the same founder as bioRxiv, Cold Spring Harbor Laboratory,in June 2019. Within one year, medRxiv already has 32% of paperspublished. This number is likely to grow higher over time as somepapers might be waiting to be published. The three dashed linesin Figure 10 represent the publication rate for three most usedeconomics preprint servers. The NBER has the best rate among thethree, close to arXiv. SSRN comes second, but has less than half ofthe NBER publication rate. RePEc have further more portions ofpreprints stay unpublished, with less than 10% published.
Figure 9: Preprint publication rate (P-JC/P-all) by domainFigure 10: Preprint publication rate (P-JC/P-all) by server
With the cohort of published preprints (J-PC group), we further an-alyze their destination journals and conferences. To check whetherpreprints tend to be accepted by less influential venues, we usethe 2019 two-year impact factor as the impact indicator and plotthe number of published preprints (P-JC) and the number of allpublished papers (JC-all) with respect to their destination impactfactors in Figure 11 and 12. The 2019 impact factor is calculated asthe average citation received in 2019 for papers published in 2017and 2018 using MAG data: 𝐼𝐹 = 𝐶𝑖𝑡𝑎𝑡𝑖𝑜𝑛𝑠
𝑃𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 + 𝑃𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛
Since the calculation of the impact factor is based on publicationsin 2017 and 2018, only papers published in these two years accountfor this distribution analysis. There are 2,278,117 papers publishedin 2017 and 2018, and among them 152,799 papers have a preprintversion. Figure 11 illustrates the paper distributions over venueimpact factors, where solid blue line denotes published preprints,i.e. P-JC papers published in 2017 and 2018, and dashed black linerepresents all papers, i.e. JC-all papers published in 2017 and 2018.The distributions of the two groups are closely aligned with eachother and both are right skewed with similar means. The publishedpreprints are accepted by venues with higher impacts (impact factor5.2 and 4.2 for P-JC and JC-all groups respectively). When we take acloser look at different domains, in physics , computer science , and bi-ology , published preprints shift more towards higher impact factorsin Figure 12. To showcase the popular preprint destination venues,we present twenty-four top journals and fourteen top conferenceswith the highest numbers of preprint papers in Appendix B. Fourtop journals from each domain are listed. Since there are not manydisciplines which regard peer-reviewed conferences as primarypublishing venues except for computer science , the top conferenceslisted are all in computer science domain. Figure 11: 17’-18’ published paper (JC-all) and publishedpreprint (P-JC) distribution over venue impact factors ( * cal-culated with MAG data) https://en.wikipedia.org/wiki/Impact_factor Boya Xie, Zhihong Shen, and Kuansan Wang
Figure 12: 17’-18’ published paper (JC-all) and publishedpreprint (P-JC) distribution over venue impact factors by do-main ( * calculated with MAG data)
In September 2015, a statement from a WHO Consultation urged atimely share of pre-publication data during public health emergen-cies [33]. In the following year, Wellcome Trust, together with 30other global health bodies, called for rapid and open access for Zikaresearch data [47]. Both statements emphasized the importanceand necessity of sharing preliminary data faster in such disease out-breaks. The COVID-19 pandemic that started in December 2019 hasinfected more than 34.8 million people and caused 1 million deathsaround the globe by October 2020 according to the WHO situationreports [34]. Compared with other outbreaks in the 21st century,such as SARS, H1N1 swine flu, MERS, Ebola, and Zika, scientistsrespond to COVID-19 much faster this time. The number of papersrelated to COVID-19, especially preprint papers, has surpassed anyother outbreak topics. Preprint has been unprecedentedly utilizedby researchers to share their studies on pathogenesis, epidemiolog-ical characteristic and estimation, treatment, vaccine, etc. Fraseret al. [13] reported that 37.5% of COVID-19 papers are hosted onpreprint servers and the weekly new COVID-19 preprints is asmany as 250. The top preprint articles have been reported in morethan 300 news and tweeted over 10 thousand times. The swift re-lease process of preprints not only benefits researchers to accessthe latest findings earlier but also bridges the communication be-tween laboratories and the public more efficiently at this uniquechallenging time.In conclusion, preprints have implemented several screeningprocesses for the quality control. 41% of preprints are published inpeer reviewed journals or conferences, and the destination venuesare as influential as papers without a preprint version, if not more.Preprint contributes to the research community by facilitating rapidopen access of research results which is highly valuable duringpublic health emergencies.
To the best of our knowledge, there are more than sixty onlinepreprint servers active today. Despite the absence of an overview which covers all disciplines with large scale data, various evalua-tions have been carried out with a specific subject area or server. Themost frequently evaluated preprint servers are arXiv and bioRxiv;popular evaluated subject areas are high energy physics, biomedi-cal, economics, and COVID-19. Gentil-Beccot et al. [16] analyzedthe citation and reading behavior in arXiv with 286,180 high en-ergy physics journal articles and preprints which appeared from1991 to 2007. Kelk and Devine [23] did scienceographic comparisonbetween arXiv and viXra by sampling 20 physics papers appearedbetween 2007 and 2012 from each of the two servers. A detailedoverview of all 37,648 preprints uploaded to bioRxiv from 2013 to2018 were performed by Abdill and Belkham [1]. Serghio et al. [39]assessed 776 bioRxiv preprints posted between 2013 and 2017 tomeasure their Altmetric scores and citations. A similar work as-sessed 5,405 bioRxiv preprint was done by Fu and Hughey [14].Baumann and Wohlrabe [7] investigated four working paper seriesin RePEc to estimate the publish rate of economics working papers.Li et al. [28] compared the citations received between four reposito-ries: arXiv, RePEc, SSRN and PMC by sampling 384 papers. Whilethe COVID-19 pandemic is affecting the whole world, vast numberof studies related to this disease were carried out, including analysison the preprints’ performance during the pandemic [8, 13, 17, 25].Key characteristics and policies of 44 biomedical preprint platformswere reviewed in [24]. Besides data analysis, there were also manydiscussions related to preprints, such as the reasons to use and not touse preprints and the value to early career researchers [9, 22, 29, 38].
In this work, we provide a quantitative overview of preprints fromall disciplines in the past thirty years. Despite the increasing numberof preprint servers and effort from major research funders adoptingpolicies to promote preprints, the preprint has not been a popularform for most scientists. Generally physics and mathematics are thetwo disciplines that adopt preprint culture the best. They have longhistory of using online preprint archives, produce the most numberof preprints (50% of all preprints), and have high preprints to all-paper ratios as well as highest preprint publication rates.
Biology has significant increase in recent years, both in the preprints sizeand the publication rate.One major advantage of preprint is its immediate online releasein contrast to the long peer-reviewed publishing process. This studyshows that by using preprints, individual researchers communicatetheir findings to the audiences 14 months ahead of peer-reviewedpapers, and the work is cited five times more compare to the articlesonly published in peer-reviewed venues without a preprint version.Researchers in economics gain the most time and citation advantagewhen they utilize preprints.Besides the benefit to individual researchers, preprints also con-tribute to the research community by providing a platform whichshares valuable research results in a timely fashion. The quality ofpreprint papers is assessed through the discussion of existing qual-ity control strategies, the preprint publication rate, and the qualityof their destination venues. The 41% publication rate and the sim-ilar destination impact factor distribution indicate the quality ofpreprints is on par with peer-reviewed papers. Last but not least,we emphasize the important role that preprint plays in response s preprint the future of science?A thirty year journey of online preprint services , to disease outbreaks, and its swift publishing mechanism showsunbeatable advantage in such time sensitive events.In the past thirty years, our analysis has shown that the format ofscholarly communication has evolved together with the rapid Webtechnological development. The scientific publishing landscapehas been gradually shifted to embrace the preprint culture. Theultimate goal of scholarly communication is to disseminate theknowledge and new discoveries more rapidly, widely, and cost-effectively. Our work provides quantitative evidence to demonstratethat online preprint services could make us one step closer to thisgrand goal. We have witnessed the fast adoption and flourish of thepreprint culture in biomedical domains in the past five years withthe promotional efforts from research organizations, publishers,and high profile researchers. Despite our extensive analysis of thepositive impacts of preprints, there are still uncharacterized facets ofthe reserved attitudes towards preprints among many researchers.How to better promote preprints in different communities will needa closer look by domain experts and to integrate other data besidesscholarly publications themselves.
REFERENCES [1] Richard J. Abdill and Ran Blekhman. 2019. Tracking the popularity and outcomesof all bioRxiv preprints. eLife
Science
321 (July 2008), 15. https://doi.org/10.1126/science.1162115[3] David B. Allison, Andrew W Brown, Brandon J George, and Kathryn A Kaiser.2016. Reproducibility: A tragedy of errors.
Nature
530 (Feb. 2016), 27–29. https://doi.org/10.1038/530027a[4] arXiv. 2019. Our Moderation Process. Retrieved September 30, 2020 fromhttp://blog.arxiv.org/2019/08/29/our-moderation-process/[5] arXiv. 2020. The arXiv endorsement system. Retrieved August 31, 2020 fromhttps://arxiv.org/help/endorsement[6] Stefano Balietti. 2016. Here’s how competition makes peer review more un-fair.
The conversation (Aug. 2016). https://theconversation.com/heres-how-competition-makes-peer-review-more-unfair-62936[7] Alexandra Baumann and Klaus Wohlrabe. 2020. Where have all the workingpapers gone? Evidence from four major economics working paper series.
Scien-tometrics (July 2020), 2433–2411. https://doi.org/10.1007/s11192-020-03570-x[8] Nature Biotechnology (Ed.). 2020. All that’s fit to preprint.
Nature Biotechnology (May 2020). https://doi.org/10.1038/s41587-020-0536-x[9] Philip E. Bourne, Jessica K. Polka, Ronald D. Vale, and Robert Kiley. 2017. Tensimple rules to consider regarding preprint submission.
PLoS Comput Biol
Science (March 2018). https://science.sciencemag.org/content/359/6379/eaao0185.full[12] Henry Fountain. 2014. Science Journal Pulls 60 Papers in Peer-Review Fraud.
TheNew York Times bioRxiv (May 2020). https://doi.org/10.1101/2020.05.22.111294[14] Darwin Y Fu and Jacob J Hughey. 2019. Releasing a preprint is associated withmore attention and citations for the peer-reviewed article. eLife
APS News
Scientometrics
84 (Dec. 2010), 345–355. https://doi.org/10.1007/s11192-009-0111-1[17] Silvia Gianola, Tiago S Jesus, Silvia Bargeri, and Greta Castellini. 2020. Charac-teristics of academic publications, preprints, and registered clinical trials on theCOVID-19 pandemic.
PLoS One
6, 15 (Oct. 2020). https://doi.org/10.1371/journal.pone.0240123 [18] Paul Ginsparg. 2011. ArXiv at 20.
Nature
476 (Aug. 2011), 145–147. https://doi.org/10.1038/476145a[19] Michael Golosovsky and Sorin Solomon. 2012. Runaway events dominate theheavy tail of citation distributions.
European Physical Journal-special Topics
The scholarly kitchen (Oct.2019). https://scholarlykitchen.sspnet.org/2019/10/16/the-second-wave-of-preprint-servers-how-can-publishers-keep-afloat/[22] Jocelyn Kaiser. 2017. Are preprints the future of biology? A survival guide forscientists.
Science News (Sept. 2017). https://doi.org/0.1126/science.aaq0747[23] David Kelk and David Devine. 2012. A Scienceographic Comparison of PhysicsPapers from the arXiv and viXra Archives. arXiv (Nov. 2012). https://doi.org/1211.1036[24] Jamie J Kirkham, Naomi Penfold, Fiona Murphy, Isabelle Boutron, John PA Ioan-nidis, Jessica K Polka, and David Moher. 2020. A systematic examination ofpreprint platforms for use in the medical and biomedical sciences setting. bioRxiv (2020). https://doi.org/10.1101/2020.04.27.063578[25] Diana Kwon. 2020. How swamped preprint servers are blocking bad coronavirusresearch.
Nature
Aslib Journal of InformationManagement
67, 6 (Nov. 2015), 614–635. https://doi.org/10.1108/AJIM-03-2015-0049[29] Miguel Mayo-Yánez. 2020. Research during SARS-CoV-2 pandemic: To “Preprint”or not to “Preprint”, that is the question.
Med Clin (Barc)
J. Chem. Doc.
5, 3 (Aug. 1965), 126–128. https://doi.org/10.1021/c160018a003[31] Richard Van Noorden. 2016. Social-sciences preprint server snapped up bypublishing giant Elsevier.
Nature
SLAC-PUB-0710
PLoS Biology
JAMA
ACL 2018: 56th Annual Meeting of the Associationfor Computational Linguistics . 87–92.[41] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul)Hsu, and Kuangsan Wang. 2015. An Overview of Microsoft Academic Service(MA) and Applications. In
Proceedings of the 24th International Conference onWorld Wide Web (WWW ’15 Companion) . ACM, New York, NY, USA, 243–246.https://doi.org/10.1145/2740908.2742839[42] Richard Smith. 2006. Peer review: a flawed process at the heart of science andjournals.
Journal of the Royal Society of Medicine
99, 4 (April 2006), 178–182.https://doi.org/10.1258/jrsm.99.4.178[43] SSRN. 2020. SSRN eLibrary Statistics. Retrieved September 11, 2020 fromhttps://papers.ssrn.com/sol3/DisplayAbstractSearch.cfm
Boya Xie, Zhihong Shen, and Kuansan Wang
BMC Med.
12, 179 (Sept. 2014). https://doi.org/10.1186/s12916-014-0179-1[46] C Clara Stegehuis, N Nelly Litvak, and LR Waltman. 2012. Predicting the long-term citation impact of recent publications.
Journal of Informetrics
Journal of the Association for InformationScience and Technology
63 (2012), 72–77.[49] Eva Xiao. 2020. Scientific Journal Pulls Over a Dozen Papers byChinese Researchers.
The Wall Street Journal s preprint the future of science?A thirty year journey of online preprint services ,
A PREPRINT CITATION DISTRIBUTIONB SELECT JOURNALS AND CONFERENCES
Journal Domain Paper count with preprint Journal paper count Preprint rate Citation per paperPhysical Review D Physics 61,584 87,816 0.70 34Physical Review B Physics 52,946 163,086 0.32 36The Astrophysical Journal Physics 51,359 86,688 0.59 62Physical Review Letters Physics 40,496 104,261 0.39 82Journal of Algebra Mathematics 3,959 12,604 0.31 14Advances in Mathematics Mathematics 3,468 6,795 0.51 25J. Stat. Mech. Theory Exp. Mathematics 3,222 4,955 0.65 20Trans Am Math Soc Mathematics 2,992 8,094 0.37 24IEEE Trans Wirel Commun CS 1,450 9,541 0.15 47IEEE Trans Commun CS 1,200 10,353 0.12 51IEEE Trans. Inf. Theory CS 4,109 11,844 0.35 76IEEE Trans. Signal Process. CS 1,666 15,349 0.11 61J Bank Financ Economics 2,171 4,970 0.44 88Am Econ Rev Economics 2,050 5,677 0.36 278Journal of Financial Economics Economics 1,969 2,881 0.68 310Management Science Economics 1,945 8,285 0.23 94eLife Biology 2,044 18,958 0.11 15PLoS Computational Biology Biology 1,339 8,183 0.16 43Journal of Theoretical Biology Biology 722 9,986 0.07 35PLoS Genetics Biology 672 9,479 0.07 63Scientific Reports Multidisciplinary 5,947 119,208 0.05 15PLoS ONE Multidisciplinary 4,551 257,798 0.02 24Nature Communications Multidisciplinary 4,277 33,712 0.13 53Nature Multidisciplinary 1,949 95,790 0.02 169
Table 4: Top journals with high number of papers having preprint version (1991-2020 Sep.)
Boya Xie, Zhihong Shen, and Kuansan Wang