aa r X i v : . [ s t a t . O T ] J u l Reproducible Research: A Retrospective
Roger D. Peng Stephanie C. Hicks July 27, 2020 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
Rapid advances in computing technology over the past few decades have spurred twoextraordinary phenomena in science: large-scale and high-throughput data collectioncoupled with the creation and implementation of complex statistical algorithms for dataanalysis. Together, these two phenomena have brought about tremendous advances inscientiﬁc discovery but have also raised two serious concerns, one relatively new andone quite familiar. The complexity of modern data analyses raises questions about the reproducibility of the analyses, meaning the ability of independent analysts to re-createthe results claimed by the original authors using the original data and analysis tech-niques. While seemingly a straightforward concept, reproducibility of analyses is typi-cally thwarted by the lack of availability of the data and computer code that were usedin the analyses. A much more general concern is the replicability of scientiﬁc ﬁndings,which concerns the frequency with which scientiﬁc claims are conﬁrmed by completelyindependent investigations. While the concepts of reproduciblity and replicability arerelated, it is worth noting that they are focused on quite diﬀerent goals and addressdiﬀerent aspects of scientiﬁc progress. In this review, we will discuss the origins ofreproducible research, characterize the current status of reproduciblity in public healthresearch, and connect reproduciblity to current concerns about replicability of scientiﬁcﬁndings. Finally, we describe a path forward for improving both the reproducibilityand replicability of public health research in the future. Introduction
Scientiﬁc progress has long depended on the ability of scientists to communicate to othersthe details of their investigations. The exact meaning of “details of their investigations” haschanged considerably over time and in recent years has been nearly impossible to describeprecisely using traditional means of communication. Rapid advances in computing technol-ogy have led to large-scale and high-throughput data collection coupled with the creationand implementation of complex statistical algorithms for data analysis. In the past, itmight have suﬃced to describe the data collection and analysis using a few key words andhigh-level language. However, with today’s computing-intensive research, the lack of detailsabout the data analysis in particular can make it impossible to re-create any of the resultspresented in a paper (33). Compounding these diﬃculties is the impracticality of describingthese myriad details in traditional journal publications using natural language. To addressthis communication problem, a concept has emerged known as reproducible research , whichaims to provide far more precise descriptions of an investigator’s work to others in the ﬁeld.As such, reproducible research is an extension of the usual communications practices ofscientists, adapted to the modern era.The notion of reproducible research, which was popularized in the early 1990s, was ul-timately designed to address an emerging and serious issue at the time (43). Results ofpublished ﬁndings were increasingly dependent on complex computations done on powerfulcomputers, often implementing sophisticated algorithms on large datasets. Given the im-portance of computing to the generation of these results, it was surprising that consumers ofscientiﬁc results had no ability to inspect or examine the details of the computations beingdone. Traditional forms of scientiﬁc publication allowed for extended descriptions of studydesign and high-level analysis approaches, but low-level details about computer code, dataprocessing pipelines, and algorithms were not prioritized and generally left in an appendixor, with the wider availability of the internet, online supplement.Jon Claerbout, a geophysicist at Stanford University, wrote down many of the originalideas concerning reproducibilty of computational research. His concern largely focused ondeveloping a software system whereby the research produced by his lab could be passedon to others, including the original authors, the authors’ colleagues, students, researchsponsors, and the general public. He noted in particular the beneﬁts of reproducibility tothe original authors: ”It may seem strange to put the author’s own name at the top ofthe list to whom we wish to provide the reproducible research, but it often seems that thegreatest beneﬁciary of preparing the work in a reproducible form is the original author!” Itis equally notable that the public was listed last; all of the other constituencies mentionedwould likely exist within the small orbit of an individual investigator (10). In Claerbout’sdiscussion, the primary focus is on improving the transparency and productivity of the labitself, given that much time can be lost attempting to re-create past ﬁndings for the solepurpose of understanding what was previously done.Buckheit and Donoho introduced much of the statistical community to the conceptof reproducibility with an inﬂuential paper in 1995 detailing their WaveLab software forimplementing wavelets for data analysis (9). Citing Claerbout as a strong inﬂuence, theirrationale for promoting reproducible research produced a useful summary of Claerbout’sideas that has since be repeated many times:2n article about computational science in a scientiﬁc publication is not thescholarship itself, it is merely advertising of the scholarship. The actual schol-arship is the complete software development environment and the complete setof instructions which generated the ﬁgures.The general conclusion was that delivering a research end product such as a ﬁgure or tablewas no longer suﬃcient. Rather, the software environment and the means to create the endproduct must also be delivered, as those additional elements represent the actual scholarship.In order to satisfy this requirement, one would have to make available the data and the computer code used to generate the results.
The deﬁnition of reproducible research generally consists of the following elements. Apublished data analysis is reproducible if the analytic datasets and the computer codeused to create the data analysis is made available to others for independent study andanalysis (33). This deﬁnition is suﬃciently vague that it ultimately raises more questionsthan it answers. What is an “analytic dataset”? What does it mean to be “available”?What is included with the “computer code”?Published research can be thought of as living on a continuum up until the point ofpublication (34). Starting from question formulation and study design, proceeding to todata collection, data processing, data analysis, and ﬁnally to presentation. Along thisjourney, various elements are introduced to aid in executing the research, such as computingenvironments, measurement instruments, and software tools. One could choose to makeavailable to others any aspect of this sequence, depending on the practicalities of doing soand the relevance to the ﬁnal published results. It is challenging to develop a universal cutpoint for determining which aspects of an investigation should be disseminated and whichare not required. However, within various research communities, internal standards havedeveloped and are continuously evolved to keep pace with technology (e.g. 6, 41).The analytic dataset generally contains all of the data that can be directly linked to apublished result or number. For example, if a paper publishes an estimate of the rate ofhospitalization for heart attacks, but the overall study also collected data on hospitalizationsfor inﬂuenza, the inﬂuenza data may not be part of the analytic dataset if it makes noappearance in the published result and is not otherwise relevant. While outside investigatorsmay interested in seeing the inﬂuenza data (and the original authors may be happy to shareit), it is not needed for the sake of reproducible research. The analytic computer code isany code that was used to transform the analytic dataset into results. This may includesome data processing (such as variable transformations) as well as modeling or visualization.Generally, the software environment in which the analysis was conducted (e.g. R, Python,Matlab) does not need to be distributed if it is easily obtainable or open source. However,niche software which may be unfamiliar to many readers may need to be bundled with thedata and code.Upon ﬁrst consideration, many see reproducibility as a non-issue. How could it bethat applying the original computer code to the original dataset would not produce theoriginal results? The practical reality of modern research though is that many even simple3esults depend on a precise record of procedures and processing and the order in which theyoccur. Futhermore, many statistical algorithms have many tuning parameters that candramatically change the results if they are not inputted exactly the same way as was doneby the original investigators (22). If any of these myriad details are not properly recordedin the code, then there is a signiﬁcant possibility that the results produced by others willdiﬀer from the original investigators’ results.
The terminology of reproducible research can be bewildering to some in the scientiﬁc com-munity because there is little agreement about the meaning of the phrase in relation to otherrelated concepts (3, 21, 35). In particular, one related concept with which all scientists areconcerned is what we refer to here as replication . In this review, we deﬁne replication asthe independent investigation of a scientiﬁc hypothesis with newly collected data, new ex-perimental setups, new investigators, and possibly new analytic approaches. In a thoroughinvestigation of the terminologies of reproducible research, Lorena Barba found that someﬁelds of study made no distinction between “reproducible” and “replicable” while someﬁelds used those terms to mean the exact opposite of how we deﬁne them here (3, 30).However, a signiﬁcant plurality of ﬁelds, including epidemiology, medicine, and statisticsappear to adopt the deﬁnitions we use here.A key distinction between reproducibility and replication is that reproducibility doesnot allow for any real variation in the results. If an independent investigator were to reproduce the results of another investigator with the original data and code, there shouldnot be any variation between the two investigators’ results, except for some allowancefor diﬀerences in machine precision. Thus, exact reproducibility is sometimes referred toas “bitwise reproducibility” (30). However, replication generally allows for diﬀerences inresults that arise from statistical variability. Two independent investigators conductingthe exact same experiment should, in theory, only diﬀer by an amount quantiﬁed by thestandard deviation of the data. More generous deﬁnitions of replication allow for slightlydiﬀerent study designs, analytic populations, or statistical techniques (30). In those cases,diﬀerences in results may arise beyond simple statistical variation. Patil et al. have deviseda useful visualization of what may or may not diﬀer when reproducing or replicating apublished study (32). The relationship between reproducible research and replication is atopic to which we will return in greater detail in Section 3.It is diﬃcult to argue that interest in exactly reproducing another investigator’s work isanything but a modern phenomenon (14). Interest in reproducibility prior to the computerand internet age was likely low or non-existent given that there was generally no expectationthat investigators would share data in papers—there was simply no practical way to dothat except for very small datasets. In the past, other investigators could only resort toindependently replicating a published study using their own data collection and whateverhigh-level description of the methods that was available in the paper. In this setting,detailed descriptions of the methods of analysis were critical if others were to execute thesame approach. If the process of conducting the experiment or analysis was simple enoughor were suﬃciently standardized, then it could be reasonably described in the conﬁnes of ajournal paper. Suggesting that analyses be described with data and code is a departure from4revious ways of communicating scientiﬁc results, which relied on describing experimentsand analyses in more general terms to give readers the highlights of what was done. A moreabstract approach could not be taken with this new form of computational research becausethe proper abstractions for communicating ideas and standardization of approaches werenot yet available.
The concept of reproducible research was developed to achieve arguably modest goals.Its original aims focused on providing an approach to better communicate the details ofcomputationally intensive research to one’s collaborators, colleagues, students and oneself.But two key developments over the past 30 years have changed the context around whichreproducible research lives. Although the deﬁnition of reproducibility has not changed muchsince the 1990s, almost everything else about scientiﬁc research has.In much of the early literature on reproducible research, the focus is on “computationally-intensive” research which, because of its reliance on complex computer algorithms, wasconsidered perhaps more impenetrable than other research. Fast forward 30 years and theuse of computing in scientiﬁc research is ubiquitous. It is no longer the domain of nichegeophysical scientists or mathematical statisticians using obscure computer packages. Now,all scientiﬁc research involves the use of powerful computers, whether it is for the datacollection, the data analysis, or both. Furthermore, the increase in complexity of statisticaltechniques over this time period has resulted in the need for detailed descriptions of analyticapproaches and data processing pipelines. We are all computational scientists now and asa result the concept of reproducibility is relevant to all scientists.Along with computing power, another key advance over the past 30 years has been thedevelopment of the internet. Claerbout’s original scheme for distributing data and codeto others was via CD-ROM disks, which was a perfectly reasonable approach at the time.However, the need for a physical medium greatly limited the transferrability of informationto a large audience. With the development of the internet, it became possible for academicsto distribute data and code to the entire world for seemingly minimal cost. This increasein distribution reach changed the nature and importance of reproducible research from pri-marily improving the internal eﬃciency of the lab to allowing others to anonymously buildon one’s own work. The internet dramatically grew the size of an investigator’s personalscientiﬁc community to include many members beyond one’s immediate circle of collabo-rators. This phenomenon has provided signiﬁcant beneﬁts to science but there are someimplications that could threaten the viability of maintaining and supporting reproducibleresearch in the long-run. Before we consider these implications, we must ﬁrst consider whatare the goals of reproducible research and what problems we want reproducibility to solve.
Beyond communicating the details of an investigation, what are the goals of making researchreproducible? The stated goals achieved by making research reproducible have evolved overtime since the early 1990s and have become somewhat more elusive. Originally, the goalwas to better reveal the process of doing the research. Computational research added a new5omplexity in the form of software code and high-dimensional datasets and that complexitymade understanding the research process more diﬃcult for a reader to infer. Therefore, thesolution was to simply publish every step in the process along with the data. Claerbout andcolleagues were concerned that others (including themselves!) would not be able to learn from what they have done if they did not have the details. The easiest way to do this wasvia the literal computer code that executed the steps. Any less precise format could riskomitting a key step that aﬀected downstream results (22).Reproducible research comes with a few side beneﬁts. In addition to being able to fullyunderstand the process by which the results were obtained, readers also get the data andthe computer code, both of which are valuable to the extent that they can be re-used orre-purposed for future studies or research. Some have suggested that making data andcomputer code available to others is a per se goal of reproducible research, because bothcan be built upon and leveraged to further scientiﬁc knowledge (18). However, such aninterpretation is an extension of the original ideas of reproducibility. The former view sawdata and code as a medium for communicating ideas whereas the latter view sees data andcode essentially as products or digital research objects to be used by others (45). Whileconverting a dataset into a data product and packaging computer code into usable softwaremay seem like nominal tasks given that the underlying data and code already exist, thereare non-negligible costs associated with the development, maintenance, and support of theseproducts.Another goal of reproducible research is to provide a kind of audit trail, should one beneeded. In fact, one could suggest a deﬁnition of reproducible research as “research thatcan be checked”. Desiring an audit trail for data analyses raises the question of when suchan audit trail might be used? In general, one might be interested in seeing the details of thedata and the code for an analysis when there is a curiosity about how a speciﬁc result wasreached. Sometimes that curiosity is raised because of suspicion of an error in the analysis,but other times there is a desire to learn the details of new techniques or methodologies (22).Thus, reproducibility primarily concerns the integrity and transparency of the data analysis for an investigation. Unlike replication, reproducibility allows for an internal check on theresults and is not immediately connected to the context of the outside world. One could summarize the goal of reproducible research as providing a means to answer thequestion, “Do I understand and trust this data analysis?” With the computational natureof today’s research, we cannot hope to answer that question without being able to look atthe data and the code. In addition, we may wish to know things about the experimentalor study design as well as the hypothesis being examined (23). Given the claimed results,the data, and code, one can theoretically determine the reproducibility status of a dataanalysis. Reproducibility gives us the means by which we can assess our conﬁdence andtrust in an analysis, but it is important to reiterate that the mere fact of reproducibility ofan analysis is not a check on validity of the analysis.The notion of reproducibility as a binary or perhaps multi-level “state” of a data analysisis a useful characterization in part because it is one of a few qualities of a data analysisthat can be immediately veriﬁed. Unlike with replication, we do not need to wait for future6tudies to be conducted in order to determine the reproducibility of an analysis. However,this suggests reproducibility’s usefulness is limited. What do we ultimately learn frommerely reproducing the results of an analysis? For example, it may be possible to executecode on a dataset without ever looking at the code or the data. In that case, the originalgoal of reproducibility—to learn about the details of an investigation—has been thwarted.We have simply learned that the code produces what the authors claim the code produces.In general, executing a process and seeing that process produce the results exactly as theywere expected, produces very little new information.The statement of the question “Do I understand and trust this data analysis?” dependscritically on the perspective of the person asking the question. If the person asking is anexpert in the area, they might be able to glance at the code and data and understandimmediately what is going on. A non-expert in the ﬁeld might be able to execute thecode and produce results without ever understanding the operations of the analysis. Anadjacent question that might be worth asking is “Is this data analysis understandable andtrustworthy?” However, this question is not any easier to answer because it hypothesizesunderyling objective qualities of a data analysis. But opinions may still vary widely aboutwhat these underlying qualities should be depending on who is asking the question. Toanswer either question, one needs to look carefully at the data and the code to learn exactlywhat was done. But ultimately, the data and code only represent a piece of the answer.Whether an analysis can be understood or trusted depends critically on many aspectsoutside of the analysis itself, including the perspective of the person reading the analysis.Nevertheless, one hope is that reproducibility can lead to higher quality data analyses.The logic is that requiring all analyses to provide data and code would put investigatorson notice that their work would be scrutinized. However, one high-proﬁle example suggeststhis is unlikely to be the case.
In a now-retracted 2006 study by Potti et al. (37), the investigators claimed to have identiﬁedgenomic signatures using microarrays that could predict whether an individual respondedwell to chemotherapy. The analysis was conducted using data from publicly available celllines, and so the data were in a sense available. However, subsequent attempts to reproducethe ﬁndings failed and reproducibility was only achieved when errors were deliberatelyinserted into the analysis code (2, 12). Keith Baggerly, Kevin Coombes, and Jing Wangmeticulously reconstructed the error-prone analysis and laid out all of the details in bothtext and code. Ironically, they were ultimately able to reproduce the analysis of Potti et al.after signiﬁcant reverse engineering and forensic investigation. In fact, we might never havelearned what mistakes were made if Baggerly and his colleagues were not able to reproducethe analysis.The example of the Potti study is a pathological example of a reproducible analysis(after much forensic investigation) being profoundly incorrect. However, it is worth askingwhat role reproducibility might have played in this case? If Potti et al. had released thecode and data that were clearly linked together, perhaps as a research compendium (18),then the errors could have been found more quickly. However, given the sheer number andcomplexity of the problems, it still likely would have taken some time to understand them7ll. Coombes et al. published their letter only a year after the initial publication, so thetimeline might have been advanced by a few months. However, a key fact would remain—the ﬂawed analysis was already completed. Furthermore, once the truth was ultimatelyrevealed to the authors, it took years of further investigation by many others before theoriginal paper was retracted.
Examples like the Potti paper raise the question of whether demanding or requiring re-producibility of a study beforehand can pre-emptively improve its quality. Evidence of thisconnection between reproducibility and quality is lacking, which is not surprising given thatthe question is somewhat ill-posed. What exactly are we looking for in a “high-quality”data anlaysis? One could hypothesize that if an investigator knew in advance that the dataand the code would be publicly available for scrutiny, then they would take the extra eﬀortto make sure that the analyses were properly done. Perhaps if Potti et al. had been forcedto make their code publicly available, they would have checked it ﬁrst.In the case of Potti et al. we now know that requiring reproducibility or even justcode sharing would not have made much diﬀerence. Reporting done by
The Cancer Letter showed deﬁnitively that the investigators were aware of numerous statistical and codingerrors with the analysis but did not think they were serious problems (20). Rather, theywere considered “diﬀerences of opinion”. The notion that requiring reproducibility can leadto improved data analyses relies on the critical assumption that the investigators are ableto recognize what is an error in the ﬁrst place. If they do recognize the error and hide it,then that is fraud. If they do not recognize the error and publish it anyway, then that is atbest careless. However, in both cases, forcing the data and code to be published would nothave made any diﬀerence.
Reproducibility does not provide a useful route to preventing poor data analyses fromoccurring, but it does provide the basis for a meaningful discussion about whether theirmight be problems in the analysis and how such problems might be ﬁxed. Replication diﬀersfrom reproducibility primarily because it addresses a diﬀerent goal. Replication answers thequestion, “Is this scientiﬁc claim true?” Reproducibility addresses the integrity of the dataanalysis that generated the evidence for a scientiﬁc claim, while replication addresses theintegrity of the claim itself in the context of the outside world. Fundamentally, reproducibleresearch has relatively little to say on the question of external validity. Claims resultingfrom reproducible results can be both correct and incorrect (28). Claims resulting fromirreproducible results are less likely to be true, but that may depend on the reasons for thelack of reproducibility. For example, evidence generated via random algorithms may not beexactly reproducible if random number generator seeds are not saved, but the underlyingevidence may still be sound. Ultimately, claims made by irreproducible studies may in factbe true, but irreproducible studies simply do not provide evidence for such claims.8 .3.1 Example: Re-analysis of Air Pollution Studies
In the mid 1990s, two large studies of ambient air pollution and mortality—The Six CitiesStudy (13) and the American Cancer Society (ACS) Study (36)—were published, present-ing evidence that diﬀerences in air pollution concentrations between cities were signiﬁcantlyassociated with rates of mortaliy in these cities. Both studies came under intense scrutinywhen the U.S. Environmental Protection Agency cited the results in their revision of theNational Ambient Air Quality Standards for ﬁne particles. In particular, there were de-mands from numerous corners that the data used in the studies should be made available.However, the data in these studies, as with most health-related studies, included personalinformation about the subjects and arguments were made that promises of conﬁdentialityhad to be kept. To address the impasse of making the data available, the original investiga-tors engaged the Health Eﬀects Institute (HEI) to serve as a kind of trusted third party tobroker a reanalysis of the studies. Ultimately, HEI recruited a research team lead by inves-tigators at the University of Ottawa to obtain the original data for both the Six Cities andACS studies, reproduce the original ﬁndings, and conduct additional sensitivity analyses toassess the robustness of the original ﬁndings (27).The extensive reanalysis found that the original studies were largely reproducible, if notperfectly reproducible. For the Six Cities study, the key result was a mortality relativerisk of 1 .
26, which the re-analysis team computed to be 1 .
28. For the ACS study theoriginal mortality relative risk was 1 .
17, close to the re-analysis value of 1 .
18. While onecould argue that these studes were strictly speaking not reproducible, such small diﬀerencesare not likely to be material. In fact, we now know, after numerous follow-up studiesand independent replications, that the core ﬁndings of both studies appear to be true (8)and that the U.S. EPA itself rates the evidence of a connection between ﬁne particlesand mortality to be “likely causal” (16). The re-analysis team ran many other analyses,including variables that had not been considered in the original studies. Overall, they foundthat the sensitivity analyses did not change any of the major conclusions. Interestingly, oneof the key conclusions of the ﬁnal report from HEI was that at the end of the day “No singleepidemiologic study can be the basis for determining a causal relation between air pollutionand mortality” (27).The HEI re-analysis of the Six Cities and ACS studies highlights the role of trust indata analysis. Prior to the re-analysis, many parties simply did not trust that the analysiswas done properly or that all reasonable competing hypotheses had been considered. Whilemaking the data available might have allowed others to build that trust for themselves,allowing a neutral third party to examine the data and reproduce the ﬁndings at leastensured that one other group had seen the data. In addition, HEI’s role in organizingthe expert panel, conducting public outreach, and managing an open process played animportant role in building trust in the community. While not all parties were completelysatisiﬁed with the process, what the re-analysis did was allow fellow scientists to learn fromthe original studies and gain insight into the process that lead to the original ﬁndings.Ultimately, the key goals of reproducible research were achieved.In hindsight, another lesson learned from the HEI re-analysis is that the importance ofreproducibility of a given study can fade with time. Over 25 years later, there have beenscores of follow-up studies and replications that have largely come to similar conclusions9s the Six Cities and ACS studies. Although both studies remain seminal in the ﬁeld ofair pollution epidemiology, they could be deleted from the literature at this point and havelittle impact on our understanding of the science. This is not to say that the data andongoing analyses do not have value, but rather the original results have been subsumedby later studies. Reproducibility was only critical when the studies were ﬁrst publishedbecause of the paucity of large studies at the time.
Recent work has focused on the quality and variability of data analyses published in variousﬁelds of study (e.g. 11, 24, 25, 31), with some claiming the existence of a “replication crisis”due to the wide variation between studies examining the same hypotheses (42). The causesof this variation between studies are myriad but one large category includes various aspectsof the data analysis. Because of the increasing complexity of data analyses, many choicesand decisions must be made by analysts in the process of obtaining a result. With theseincreasing complexities we also increase the risk of human error and bias in data analysis.These choices and decisions often have an unknown impact on the ﬁnal estimates producedand therefore may or may not be recorded by the investigators (45). These “research degreesof freedom” allow investigators to unknowingly, or perhaps knowingly, steer data analysesin directions that may support speciﬁc hypotheses rather than represent all of the evidencein the data (44).What role can reproducible research play in improving the quality of data analyses acrossall ﬁelds? The answer can be found in part with the experience of the HEI re-analysis of theSix Cities and ACS air pollution studies. Because they were re-analyses, one could imaginethe expectation was that the results would be conﬁrmed to some reasonable degree. Ifthere was a signiﬁcant deviation from the published results, then we would have to diginto the original analysis to discover why. Because the results were largely reproduced, onecould argue that little was learned. However, additional analyses were done and sensitivityanalyses were conducted. As a result, we learned much about the data analysis process.The re-analysis thus produced valuable knowledge about how to analyze air pollution andhealth data.For example, the re-analysis team noted that both mortality and air pollution werehighly spatially correlated, a feature that was not considered in the original analysis. Theynoted, “If not identiﬁed and modeled correctly, spatial correlation could cause substantialerrors in both the regression coeﬃcients and their standard errors. The Reanalysis Teamidentiﬁed several methods for dealing with this, all of which resulted in some reduction inthe estimated regression coeﬃcients.” (27) In addition, reproducibility helps free up timefor the analysts interested in re-analyzing the data to focus on parts of the data analysisthat require more human intepretation. For example, if an independent data analyst knewthat an analysis was already reproducible, then more time and resources would be availableto understand why a speciﬁc model was chosen, instead of what version of software wasused to run this model. In the re-analysis of the data from Potti et al. (37), Baggerly andCoombes noted that they had spent thousands of hours re-examining the data attemptingto reproduce the original results (1, 19).There are also diﬀerent degrees of reproducibility when building a data analysis and10iﬀerences in audiences that may or may not be allowed to have access to these components.For example, a data analyst may choose to make the data available, but not the code (orthe opposite). Others may make both the code and data available for only one audience(Audience A), but not for another audience (Audience B). There are valid reasons why ananalyst might choose to do this, such as if the data analysis uses data with protected healthinformation in a hospital setting or if the data analyst works at a business or company andcannot share the code or data with others outside of the company. It is important to notethat just because an analysis is not fully reproducible to one audience (Audience B) does notmean that it is an invalid analysis with incorrect conclusions. While it does make it harderfor Audience B to trust the results, it still can be a valid or correct analysis. However, thelack of reproduciblity to this audience may mean that the evidence supporting any claimsis weaker. Despite these potential diﬀerences in degrees of reproducibility, as demonstratedin the HEI re-analysis, eﬀorts made to make a data analysis more reproducible is a step inthe right direction for making it a better data analysis.Ultimately, the reproducibility of research, when possible, allows us a signiﬁcant oppor-tunity to (1) learn from others about how best to analyze certain types of data; (2) reducehuman errors and bias as data become larger and more complex; (3) free up time for re-analyzers to focus on parts of a data analysis that require more human interpretation;(4) have discussions about what makes for a good data analysis in certain areas of study;and (5) improve the quality of future data analyses. When teaching data analysis to stu-dents, it is common to talk in abstractions and theories, describing statistical methods andmodels in isolation. When real data is shown, it is often in the form of toy examples orin short excerpts. Increasing the reproducibility of all studies presents an opportunity todramatically expand instruction on the craft of data analysis so that core set of elementsand principles for characterizing high quality analyses can be established within a ﬁeld (23).
In the thirty years since the idea of reproducible computational research was brought theforefront of the research community, we have learned much about its role and its valuein the research enterprise. The original goal of providing a transparent means by whichresearchers can communicate what they have done and allow others to learn remains aprimary rationale. Reproduciblity has a secondary role to play in improving the qualityof data analysis in that it serves as the foundation on which people can learn how othersanalyze data. Without code and data, it is nearly impossible to fully understand howa given data analysis was actually done. But much about computational research haschanged in the past 30 years and we can perhaps develop a more reﬁned notion about whatit means to make research “reproducible”. The two key ideas about reproducibility—dataand code—are worth revisiting in greater detail.
The sharing of data is ultimately valuable in and of itself. Data sharing, to the extent pos-sible, reduces the need for others to collect similar data, allows for combined analyses withother datasets, and can create important resources for unforeseen future studies. Datasets11an retain their value for considerable time, depending on the area and ﬁeld of study. Oneexample of the value of data sharing comes from the National Mortality, Morbidity, and AirPollution Study, a major air pollution epidemiology study conducted in the late 1990s andearly 2000s (39, 40). The mortality data for this study were shared on a web site and thenlater updated with new data. A systematic review found 67 publications had made use ofthe dataset, often to demonstrate the development of new statistical methodology (4). Inaddition, the release of the data at the time allowed for a level of transparency and trust inair pollution research that was novel for its time.Today, many data sharing web repositories exist that allow easy distribution of dataof almost any size. While in the past, an investigator interested in sharing data had topurchase and setup a web server, now investigators can simply upload to any number ofservices. The Open Science Framework (17), Dataverse Project (26), ICPSR (47), andSRA (29) are only a handful of public and private repositories that oﬀer free hosting ofdatasets. The major beneﬁt of repositories such as these is to absorb and consolidate thecost of hosting data for possibly long periods of time.The view of data sharing as inherently valuable is not without its challenges. Indeed,stripping data from its original context can be problematic and lead to inappropriate “oﬀ-label” re-use by others. It has been argued that data only has value in its explicit connectionto the knowledge that it produces and that we must be careful to preserve the connectionsbetween the data and the knowledge they generate (46).Recently, best practices for sharing data have been developed. Some of these practicesare speciﬁc to areas of study while some are more generic. In particular, the emergence ofthe concept of tidy data has provided a generic format for many diﬀerent types of data thatserves as the backbone of a wide variety of analytic techniques (50). Practical guidanceon sharing data via commonly used spreadsheet formats (7) and on providing relevantmetadata to collaborators is now widely applicable to many kinds of data (15).
The primary role of sharing code is to communicate what was done in transforming the datainto scientiﬁc results. Today, almost all actions releveant to teh science will have occurredon the computer and it is essential that we have a precise way to document those actions.Computer code, via any number of programming and data analytic languages, is the mostprecise way to do that. The sharing of code generally represents less of a technical burdenthan the sharing of data. Code tends to be much smaller in size than most datasets andcan easily be served by code sharing services such as GitHub, BitBucket, SourceForge, orGitLab.While the beneﬁts of code sharing tend to focus on the code’s usability and potential forre-purposing in other applications, it is important to reiterate that code’s primary purpose isto communicate what was done. In short, code is not equivalent to software . Software is codethat is speciﬁcally designed and organized for use by others in a wide variety of scenarios,often abstracting away operational details from the user. The usability of software dependscritically on aspects like design, eﬃciency, modularity, and portability—factors that shouldnot generally play a role when releasing research code. Sharing research code that is poorlydesigned and ineﬃcient is far preferable to not sharing code at all. That said, this notion12oes not preclude the possibility for best practices in developing and sharing research code.Software is often a product of research activity, particularly when new methodologyis developed. In those cases, it is important that the software is carefully considered anddesigned well for its intended users. However, it should not be considered a requirement ofreproducible research that software be a product of research. For software that is developedfor distribution, there is increasing guidance for how such software should be distributed.Software package development has become easier for programming languages like R, whichhave robust developer and user communities (38), and numerous tools have been devel-oped to make incorporating code into packages more straightforward for non-professionalprogrammers (49). In addition, the concept of testing and test-based developed has beenshown to be a useful framework for setting expectations for how software should performand identifying errors or bugs (48).
Technological trends over time generally favor a more open approach to science as the costsof sharing, hosting, and publishing have gone down. The continuing rapid advancementof computing technology, internet infrastructure, and algorithmic complexity will likelyintroduce new challenges to reproducible research. As the scientiﬁc community expands itssharing of data and code there are some important issues to consider going forward.The rapidly evolving nature of scientiﬁc communication serves to highlight the roleof reproducibility in advancing science. Without reproducibility, countless hours could bewasted simply trying to ﬁgure out what was done in a study. In situations were key decisionsmust be made based on scientiﬁc results, it is important that the robustness of the ﬁndingscan be assessed quickly without the need for guessing or inferences about the underlyingdata. A stark example can be drawn from the COVID-19 pandemic. In April 2020 littlewas known about the disease and a study was published on medR χ iv producing an estimateof the prevalence of COVID-19 in the population (5). At the time, important public healthdecisions had to be made in response to the pandemic and any information about thedisease would have been highly relevant. Upon publication, numerous criticisms about thestudy’s design and analysis appeared on social media and the web. However, the aspectmost relevant to this review is that in many of the critiques, substantial time was takento simply guess at what the researchers had done. Although a written statistical appendixwas provided with the paper, no data or code were published along with the study. As aresult, independent investigators had little choice but to infer what was done.The urgency of decision-making based on scientiﬁc evidence can exist in a variety ofsituations, not just on the the minute-by-minute timescale of a worldwide pandemic. Manyregulatory decisions in environmental health have to be made based on only a handful ofstudies. Often, there is no time to wait years for another large cohort study to replicate(or not) existing ﬁndings. In such situations where decisions need to be made, the morecode and data that can be made available to assess the evidence, the better. In the interim,followup studies can be conducted and revisions to the evidence base can be made in thefuture if needed. The re-analyses Six Cities and ACS studies provide a clear example ofthis process and history has shown those results to be highly consistent across a range ofreplication studies. 13he maintenance of code and data is generally not a topic that is discussed in thecontext of reproducible research. When a paper is published, it is sent to the journal andis considered “ﬁnished” by the investigators. Unless errors are found in the paper, onegenerally need not revisit a paper after publication. However, both code and data needto be maintained to some degree in order to be useful. Data formats can change andolder formats can fall out of favor, often making older datasets unreadable. Code thatwas once highly readable can become unreadable as newer languages come to the fore andpractitioners of older languages decrease in number. Maintenance of data and code is not aquestion of paying for computer hardware or services. Rather, it is about paying for peopleto periodically update and ﬁx problems that may be introduced by the constantly changingcomputing environment.Unfortunately, funding models for scientiﬁc research are aligned with the mechanismof paper publication, where one can deﬁnitively mark the end of a project (and also theend of the funding). However, with data and code, there is often no speciﬁc end pointbecause other investigators may re-use the data or code for years into the future. Term-based project funding, which is the structure of almost all research funding, is simply notdesigned to provide support for maintaining materials on an uncertain timeline.The ﬁrst thirty years of reproducible research largely centered on discussions of thevalidity of the idea and what value it provided to the scientiﬁc community. Such discussionsare largely settled now and both data and code share are practiced widely in many ﬁeldsof study. However, we must now engage in a second phase of reproducible research whichfocuses on the continued development of infrastructure for supporting reproducibility.14 eferences
1. Baggerly, K. (2010), “Disclose all data in publications,”
Nature , 467, 401–401.2. Baggerly, K. A. and Coombes, K. R. (2009), “Deriving chemosensitivity from cell lines:Forensic bioinformatics and reproducible research in high-throughput biology,”
TheAnnals of Applied Statistics , 1309–1334.3. Barba, L. A. (2018), “Terminologies for reproducible research,” arXiv preprintarXiv:1802.03311 .4. Barnett, A. G., Huang, C., and Turner, L. (2012), “Beneﬁts of Publicly Available Data,”
Epidemiology , 23, 500–501.5. Bendavid, E., Mulaney, B., Sood, N., Shah, S., Ling, E., Bromley-Dulfano, R., Lai, C.,Weissberg, Z., Saavedra, R., Tedrow, J., et al. (2020), “COVID-19 Antibody Seropreva-lence in Santa Clara County, California,”
MedRxiv .6. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert,C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., et al. (2001), “Minimuminformation about a microarray experiment (MIAME)toward standards for microarraydata,”
Nature genetics , 29, 365.7. Broman, K. W. and Woo, K. H. (2018), “Data organization in spreadsheets,”
TheAmerican Statistician , 72, 2–10.8. Brook, R. D., Rajagopalan, S., Pope, C. A., Brook, J. R., Bhatnagar, A., Diez-Roux,A. V., Holguin, F., Hong, Y., Luepker, R. V., Mittleman, M. A., Peters, A., Siscovick,D., Smith, S. C., Whitsel, L., and Kaufman, J. D. (2010), “Particulate matter airpollution and cardiovascular disease: An update to the scientiﬁc statement from theAmerican Heart Association,”
Circulation , 121, 2331–2378.9. Buckheit, J. and Donoho, D. L. (1995), “Wavelab and Reproducible Research,” in
Wavelets and Statistics , ed. Antoniadis, A., Springer-Verlag, New York.10. Claerbout, J. and Schwab, M. (2001), “CD-ROM versus The Web,” .11. Collaboration, O. S. et al. (2015), “Estimating the reproducibility of psychological sci-ence,”
Science , 349.12. Coombes, K., Wang, J., and Baggerly, K. (2007), “Microarrays: retracing steps,”
Nat.Med. , 13, 1276–1277.13. Dockery, D. W., Pope, C. A., Xu, X., Spengler, J. D., Ware, J. H., Fay, M. E., Ferris Jr,B. G., and Speizer, F. E. (1993), “An association between air pollution and mortalityin six US cities,”
New England journal of medicine , 329, 1753–1759.14. Drummond, C. (2018), “Reproducible research: a minority opinion,”
Journal of Exper-imental & Theoretical Artiﬁcial Intelligence , 30, 1–11.155. Ellis, S. E. and Leek, J. T. (2018), “How to share data for collaboration,”
The AmericanStatistician , 72, 53–57.16. EPA (2009),
Integrated Science Assessment for Particulate Matter , EPA National Cen-ter for Environmental Assessment.17. Foster, E. D. and Deardorﬀ, A. (2017), “Open science framework (OSF),”
Journal ofthe Medical Library Association: JMLA , 105, 203.18. Gentleman, R. and Temple Lang, D. (2007), “Statistical Analyses and ReproducibleResearch,”
Journal of Computational and Graphical Statistics , 16, 1–23.19. Goldberg, P. (2014), “Duke Scientist: I Hope NCI Doesn’t Get Original Data,”
Cancer ,41, 2.20. — (2015), “Duke Oﬃcials Silenced Med Student Who Reported Trouble in Anil PottisLab,”
The Cancer Letter .21. Goodman, S. N., Fanelli, D., and Ioannidis, J. P. (2016), “What does research repro-ducibility mean?”
Science translational medicine , 8, 341ps12–341ps12.22. Haibe-Kains, B., Adam, G. A., Hosny, A., Khodakarami, F., Board, M., Waldron, L.,Wang, B., McIntosh, C., Kundaje, A., Greene, C. S., et al. (2020), “The importanceof transparency and reproducibility in artiﬁcial intelligence research,” arXiv preprintarXiv:2003.00898 .23. Hicks, S. C. and Peng, R. D. (2019), “Elements and principles of data analysis,” arXivpreprint arXiv:1903.07639 .24. Ioannidis, J. P. (2005), “Why most published research ﬁndings are false,”
PLoSmedicine , 2, e124.25. Jager, L. R. and Leek, J. T. (2014), “An estimate of the science-wise false discoveryrate and application to the top medical literature,”
Biostatistics , 15, 1–12.26. King, G. (2007), “An introduction to the dataverse network as an infrastructure fordata sharing,” .27. Krewski, D., Burnett, R. T., Goldberg, M. S., Hoover, K., Siemiatycki, J., Jerrett, M.,Abrahamowicz, M., and White, W. (2000), “Reanalysis oft the Harvard Six Cities Studyand the American Cancer Society Study of particulate air pollution and mortality,”
Health Eﬀects Institut, Cambridge, MA .28. Leek, J. T. and Peng, R. D. (2015), “Opinion: Reproducible research can still be wrong:Adopting a prevention approach,”
Proceedings of the National Academy of Sciences ,112, 1645–1646.29. Leinonen, R., Sugawara, H., Shumway, M., and Collaboration, I. N. S. D. (2010), “Thesequence read archive,”
Nucleic acids research , 39, D19–D21.160. National Academies of Sciences, Engineering, and Medicine and others (2019),
Repro-ducibility and replicability in science , National Academies Press.31. Patil, P., Peng, R. D., and Leek, J. T. (2016), “What should researchers expect whenthey replicate studies? A statistical view of replicability in psychological science,”
Perspectives on Psychological Science , 11, 539–544.32. — (2019), “A visual tool for deﬁning reproducibility and replicability,”
Nature humanbehaviour , 3, 650–652.33. Peng, R. D. (2011), “Reproducible research in computational science,”
Science , 334,1226–1227.34. Peng, R. D., Dominici, F., and Zeger, S. L. (2006), “Reproducible Epidemiologic Re-search,”
American Journal of Epidemiology , 163, 783–789.35. Plesser, H. E. (2018), “Reproducibility vs. replicability: a brief history of a confusedterminology,”
Frontiers in neuroinformatics , 11, 76.36. Pope, C. A., Thun, M. J., Namboodiri, M. M., Dockery, D. W., Evans, J. S., Speizer,F. E., Heath, C. W., et al. (1995), “Particulate air pollution as a predictor of mortalityin a prospective study of US adults,”
American journal of respiratory and critical caremedicine , 151, 669–674.37. Potti, A., Dressman, H. K., Bild, A., Riedel, R. F., Chan, G., Sayer, R., Cragun, J.,Cottrill, H., Kelley, M. J., Petersen, R., et al. (2006), “Genomic signatures to guide theuse of chemotherapeutics,”
Nature medicine , 12, 1294–1300.38. R Core Team (2020),
R: A Language and Environment for Statistical Computing , RFoundation for Statistical Computing, Vienna, Austria.39. Samet, J. M., Zeger, S. L., Dominici, F., and et al. (2000),
The National Morbidity,Mortality, and Air Pollution Study, Part I: Methods and Methodological Issues , HealthEﬀects Institute, Cambridge MA.40. — (2000),
The National Morbidity, Mortality, and Air Pollution Study, Part II: Mor-bidity and Mortality from Air Pollution in the United States , Health Eﬀects Institute,Cambridge MA.41. Sandve, G. K., Nekrutenko, A., Taylor, J., and Hovig, E. (2013), “Ten simple rules forreproducible computational research,”
PLoS computational biology , 9.42. Schooler, J. W. (2014), “Metascience could rescue the replication crisis,”
Nature , 515,9–9.43. Schwab, M., Karrenbach, N., and Claerbout, J. (2000), “Making scientiﬁc computationsreproducible,”
Computing in Science & Engineering , 2, 61–67.44. Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011), “False-positive psychology:Undisclosed ﬂexibility in data collection and analysis allows presenting anything assigniﬁcant,”
Psychological science , 22, 1359–1366.175. Stodden, V. (2015), “Reproducing statistical results,”
Annual Review of Statistics andIts Application , 2, 1–19.46. — (2020), “Beyond Open Data: A Model for Linking Digital Artifacts to Enable Re-producibility of Scientiﬁc Claims,” in
Proceedings of the 3rd International Workshop onPractical Reproducible Evaluation of Computer Systems , pp. 9–14.47. Swanberg, S. M. (2017), “Inter-university consortium for political and social research(ICPSR),”
Journal of the Medical Library Association: JMLA , 105, 106.48. Wickham, H. (2011), “testthat: Get Started with Testing,”
The R Journal , 3, 5–10.49. Wickham, H., Hester, J., and Chang, W. (2020), devtools: Tools to Make Developing RPackages Easier , r package version 184.108.40.206. Wickham, H. et al. (2014), “Tidy data,”