Reproducible Research: A Retrospective
aa r X i v : . [ s t a t . O T ] J u l Reproducible Research: A Retrospective
Roger D. Peng Stephanie C. Hicks July 27, 2020 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
Abstract
Rapid advances in computing technology over the past few decades have spurred twoextraordinary phenomena in science: large-scale and high-throughput data collectioncoupled with the creation and implementation of complex statistical algorithms for dataanalysis. Together, these two phenomena have brought about tremendous advances inscientific discovery but have also raised two serious concerns, one relatively new andone quite familiar. The complexity of modern data analyses raises questions about the reproducibility of the analyses, meaning the ability of independent analysts to re-createthe results claimed by the original authors using the original data and analysis tech-niques. While seemingly a straightforward concept, reproducibility of analyses is typi-cally thwarted by the lack of availability of the data and computer code that were usedin the analyses. A much more general concern is the replicability of scientific findings,which concerns the frequency with which scientific claims are confirmed by completelyindependent investigations. While the concepts of reproduciblity and replicability arerelated, it is worth noting that they are focused on quite different goals and addressdifferent aspects of scientific progress. In this review, we will discuss the origins ofreproducible research, characterize the current status of reproduciblity in public healthresearch, and connect reproduciblity to current concerns about replicability of scientificfindings. Finally, we describe a path forward for improving both the reproducibilityand replicability of public health research in the future. Introduction
Scientific progress has long depended on the ability of scientists to communicate to othersthe details of their investigations. The exact meaning of “details of their investigations” haschanged considerably over time and in recent years has been nearly impossible to describeprecisely using traditional means of communication. Rapid advances in computing technol-ogy have led to large-scale and high-throughput data collection coupled with the creationand implementation of complex statistical algorithms for data analysis. In the past, itmight have sufficed to describe the data collection and analysis using a few key words andhigh-level language. However, with today’s computing-intensive research, the lack of detailsabout the data analysis in particular can make it impossible to re-create any of the resultspresented in a paper (33). Compounding these difficulties is the impracticality of describingthese myriad details in traditional journal publications using natural language. To addressthis communication problem, a concept has emerged known as reproducible research , whichaims to provide far more precise descriptions of an investigator’s work to others in the field.As such, reproducible research is an extension of the usual communications practices ofscientists, adapted to the modern era.The notion of reproducible research, which was popularized in the early 1990s, was ul-timately designed to address an emerging and serious issue at the time (43). Results ofpublished findings were increasingly dependent on complex computations done on powerfulcomputers, often implementing sophisticated algorithms on large datasets. Given the im-portance of computing to the generation of these results, it was surprising that consumers ofscientific results had no ability to inspect or examine the details of the computations beingdone. Traditional forms of scientific publication allowed for extended descriptions of studydesign and high-level analysis approaches, but low-level details about computer code, dataprocessing pipelines, and algorithms were not prioritized and generally left in an appendixor, with the wider availability of the internet, online supplement.Jon Claerbout, a geophysicist at Stanford University, wrote down many of the originalideas concerning reproducibilty of computational research. His concern largely focused ondeveloping a software system whereby the research produced by his lab could be passedon to others, including the original authors, the authors’ colleagues, students, researchsponsors, and the general public. He noted in particular the benefits of reproducibility tothe original authors: ”It may seem strange to put the author’s own name at the top ofthe list to whom we wish to provide the reproducible research, but it often seems that thegreatest beneficiary of preparing the work in a reproducible form is the original author!” Itis equally notable that the public was listed last; all of the other constituencies mentionedwould likely exist within the small orbit of an individual investigator (10). In Claerbout’sdiscussion, the primary focus is on improving the transparency and productivity of the labitself, given that much time can be lost attempting to re-create past findings for the solepurpose of understanding what was previously done.Buckheit and Donoho introduced much of the statistical community to the conceptof reproducibility with an influential paper in 1995 detailing their WaveLab software forimplementing wavelets for data analysis (9). Citing Claerbout as a strong influence, theirrationale for promoting reproducible research produced a useful summary of Claerbout’sideas that has since be repeated many times:2n article about computational science in a scientific publication is not thescholarship itself, it is merely advertising of the scholarship. The actual schol-arship is the complete software development environment and the complete setof instructions which generated the figures.The general conclusion was that delivering a research end product such as a figure or tablewas no longer sufficient. Rather, the software environment and the means to create the endproduct must also be delivered, as those additional elements represent the actual scholarship.In order to satisfy this requirement, one would have to make available the data and the computer code used to generate the results.
The definition of reproducible research generally consists of the following elements. Apublished data analysis is reproducible if the analytic datasets and the computer codeused to create the data analysis is made available to others for independent study andanalysis (33). This definition is sufficiently vague that it ultimately raises more questionsthan it answers. What is an “analytic dataset”? What does it mean to be “available”?What is included with the “computer code”?Published research can be thought of as living on a continuum up until the point ofpublication (34). Starting from question formulation and study design, proceeding to todata collection, data processing, data analysis, and finally to presentation. Along thisjourney, various elements are introduced to aid in executing the research, such as computingenvironments, measurement instruments, and software tools. One could choose to makeavailable to others any aspect of this sequence, depending on the practicalities of doing soand the relevance to the final published results. It is challenging to develop a universal cutpoint for determining which aspects of an investigation should be disseminated and whichare not required. However, within various research communities, internal standards havedeveloped and are continuously evolved to keep pace with technology (e.g. 6, 41).The analytic dataset generally contains all of the data that can be directly linked to apublished result or number. For example, if a paper publishes an estimate of the rate ofhospitalization for heart attacks, but the overall study also collected data on hospitalizationsfor influenza, the influenza data may not be part of the analytic dataset if it makes noappearance in the published result and is not otherwise relevant. While outside investigatorsmay interested in seeing the influenza data (and the original authors may be happy to shareit), it is not needed for the sake of reproducible research. The analytic computer code isany code that was used to transform the analytic dataset into results. This may includesome data processing (such as variable transformations) as well as modeling or visualization.Generally, the software environment in which the analysis was conducted (e.g. R, Python,Matlab) does not need to be distributed if it is easily obtainable or open source. However,niche software which may be unfamiliar to many readers may need to be bundled with thedata and code.Upon first consideration, many see reproducibility as a non-issue. How could it bethat applying the original computer code to the original dataset would not produce theoriginal results? The practical reality of modern research though is that many even simple3esults depend on a precise record of procedures and processing and the order in which theyoccur. Futhermore, many statistical algorithms have many tuning parameters that candramatically change the results if they are not inputted exactly the same way as was doneby the original investigators (22). If any of these myriad details are not properly recordedin the code, then there is a significant possibility that the results produced by others willdiffer from the original investigators’ results.
The terminology of reproducible research can be bewildering to some in the scientific com-munity because there is little agreement about the meaning of the phrase in relation to otherrelated concepts (3, 21, 35). In particular, one related concept with which all scientists areconcerned is what we refer to here as replication . In this review, we define replication asthe independent investigation of a scientific hypothesis with newly collected data, new ex-perimental setups, new investigators, and possibly new analytic approaches. In a thoroughinvestigation of the terminologies of reproducible research, Lorena Barba found that somefields of study made no distinction between “reproducible” and “replicable” while somefields used those terms to mean the exact opposite of how we define them here (3, 30).However, a significant plurality of fields, including epidemiology, medicine, and statisticsappear to adopt the definitions we use here.A key distinction between reproducibility and replication is that reproducibility doesnot allow for any real variation in the results. If an independent investigator were to reproduce the results of another investigator with the original data and code, there shouldnot be any variation between the two investigators’ results, except for some allowancefor differences in machine precision. Thus, exact reproducibility is sometimes referred toas “bitwise reproducibility” (30). However, replication generally allows for differences inresults that arise from statistical variability. Two independent investigators conductingthe exact same experiment should, in theory, only differ by an amount quantified by thestandard deviation of the data. More generous definitions of replication allow for slightlydifferent study designs, analytic populations, or statistical techniques (30). In those cases,differences in results may arise beyond simple statistical variation. Patil et al. have deviseda useful visualization of what may or may not differ when reproducing or replicating apublished study (32). The relationship between reproducible research and replication is atopic to which we will return in greater detail in Section 3.It is difficult to argue that interest in exactly reproducing another investigator’s work isanything but a modern phenomenon (14). Interest in reproducibility prior to the computerand internet age was likely low or non-existent given that there was generally no expectationthat investigators would share data in papers—there was simply no practical way to dothat except for very small datasets. In the past, other investigators could only resort toindependently replicating a published study using their own data collection and whateverhigh-level description of the methods that was available in the paper. In this setting,detailed descriptions of the methods of analysis were critical if others were to execute thesame approach. If the process of conducting the experiment or analysis was simple enoughor were sufficiently standardized, then it could be reasonably described in the confines of ajournal paper. Suggesting that analyses be described with data and code is a departure from4revious ways of communicating scientific results, which relied on describing experimentsand analyses in more general terms to give readers the highlights of what was done. A moreabstract approach could not be taken with this new form of computational research becausethe proper abstractions for communicating ideas and standardization of approaches werenot yet available.
The concept of reproducible research was developed to achieve arguably modest goals.Its original aims focused on providing an approach to better communicate the details ofcomputationally intensive research to one’s collaborators, colleagues, students and oneself.But two key developments over the past 30 years have changed the context around whichreproducible research lives. Although the definition of reproducibility has not changed muchsince the 1990s, almost everything else about scientific research has.In much of the early literature on reproducible research, the focus is on “computationally-intensive” research which, because of its reliance on complex computer algorithms, wasconsidered perhaps more impenetrable than other research. Fast forward 30 years and theuse of computing in scientific research is ubiquitous. It is no longer the domain of nichegeophysical scientists or mathematical statisticians using obscure computer packages. Now,all scientific research involves the use of powerful computers, whether it is for the datacollection, the data analysis, or both. Furthermore, the increase in complexity of statisticaltechniques over this time period has resulted in the need for detailed descriptions of analyticapproaches and data processing pipelines. We are all computational scientists now and asa result the concept of reproducibility is relevant to all scientists.Along with computing power, another key advance over the past 30 years has been thedevelopment of the internet. Claerbout’s original scheme for distributing data and codeto others was via CD-ROM disks, which was a perfectly reasonable approach at the time.However, the need for a physical medium greatly limited the transferrability of informationto a large audience. With the development of the internet, it became possible for academicsto distribute data and code to the entire world for seemingly minimal cost. This increasein distribution reach changed the nature and importance of reproducible research from pri-marily improving the internal efficiency of the lab to allowing others to anonymously buildon one’s own work. The internet dramatically grew the size of an investigator’s personalscientific community to include many members beyond one’s immediate circle of collabo-rators. This phenomenon has provided significant benefits to science but there are someimplications that could threaten the viability of maintaining and supporting reproducibleresearch in the long-run. Before we consider these implications, we must first consider whatare the goals of reproducible research and what problems we want reproducibility to solve.
Beyond communicating the details of an investigation, what are the goals of making researchreproducible? The stated goals achieved by making research reproducible have evolved overtime since the early 1990s and have become somewhat more elusive. Originally, the goalwas to better reveal the process of doing the research. Computational research added a new5omplexity in the form of software code and high-dimensional datasets and that complexitymade understanding the research process more difficult for a reader to infer. Therefore, thesolution was to simply publish every step in the process along with the data. Claerbout andcolleagues were concerned that others (including themselves!) would not be able to learn from what they have done if they did not have the details. The easiest way to do this wasvia the literal computer code that executed the steps. Any less precise format could riskomitting a key step that affected downstream results (22).Reproducible research comes with a few side benefits. In addition to being able to fullyunderstand the process by which the results were obtained, readers also get the data andthe computer code, both of which are valuable to the extent that they can be re-used orre-purposed for future studies or research. Some have suggested that making data andcomputer code available to others is a per se goal of reproducible research, because bothcan be built upon and leveraged to further scientific knowledge (18). However, such aninterpretation is an extension of the original ideas of reproducibility. The former view sawdata and code as a medium for communicating ideas whereas the latter view sees data andcode essentially as products or digital research objects to be used by others (45). Whileconverting a dataset into a data product and packaging computer code into usable softwaremay seem like nominal tasks given that the underlying data and code already exist, thereare non-negligible costs associated with the development, maintenance, and support of theseproducts.Another goal of reproducible research is to provide a kind of audit trail, should one beneeded. In fact, one could suggest a definition of reproducible research as “research thatcan be checked”. Desiring an audit trail for data analyses raises the question of when suchan audit trail might be used? In general, one might be interested in seeing the details of thedata and the code for an analysis when there is a curiosity about how a specific result wasreached. Sometimes that curiosity is raised because of suspicion of an error in the analysis,but other times there is a desire to learn the details of new techniques or methodologies (22).Thus, reproducibility primarily concerns the integrity and transparency of the data analysis for an investigation. Unlike replication, reproducibility allows for an internal check on theresults and is not immediately connected to the context of the outside world. One could summarize the goal of reproducible research as providing a means to answer thequestion, “Do I understand and trust this data analysis?” With the computational natureof today’s research, we cannot hope to answer that question without being able to look atthe data and the code. In addition, we may wish to know things about the experimentalor study design as well as the hypothesis being examined (23). Given the claimed results,the data, and code, one can theoretically determine the reproducibility status of a dataanalysis. Reproducibility gives us the means by which we can assess our confidence andtrust in an analysis, but it is important to reiterate that the mere fact of reproducibility ofan analysis is not a check on validity of the analysis.The notion of reproducibility as a binary or perhaps multi-level “state” of a data analysisis a useful characterization in part because it is one of a few qualities of a data analysisthat can be immediately verified. Unlike with replication, we do not need to wait for future6tudies to be conducted in order to determine the reproducibility of an analysis. However,this suggests reproducibility’s usefulness is limited. What do we ultimately learn frommerely reproducing the results of an analysis? For example, it may be possible to executecode on a dataset without ever looking at the code or the data. In that case, the originalgoal of reproducibility—to learn about the details of an investigation—has been thwarted.We have simply learned that the code produces what the authors claim the code produces.In general, executing a process and seeing that process produce the results exactly as theywere expected, produces very little new information.The statement of the question “Do I understand and trust this data analysis?” dependscritically on the perspective of the person asking the question. If the person asking is anexpert in the area, they might be able to glance at the code and data and understandimmediately what is going on. A non-expert in the field might be able to execute thecode and produce results without ever understanding the operations of the analysis. Anadjacent question that might be worth asking is “Is this data analysis understandable andtrustworthy?” However, this question is not any easier to answer because it hypothesizesunderyling objective qualities of a data analysis. But opinions may still vary widely aboutwhat these underlying qualities should be depending on who is asking the question. Toanswer either question, one needs to look carefully at the data and the code to learn exactlywhat was done. But ultimately, the data and code only represent a piece of the answer.Whether an analysis can be understood or trusted depends critically on many aspectsoutside of the analysis itself, including the perspective of the person reading the analysis.Nevertheless, one hope is that reproducibility can lead to higher quality data analyses.The logic is that requiring all analyses to provide data and code would put investigatorson notice that their work would be scrutinized. However, one high-profile example suggeststhis is unlikely to be the case.
In a now-retracted 2006 study by Potti et al. (37), the investigators claimed to have identifiedgenomic signatures using microarrays that could predict whether an individual respondedwell to chemotherapy. The analysis was conducted using data from publicly available celllines, and so the data were in a sense available. However, subsequent attempts to reproducethe findings failed and reproducibility was only achieved when errors were deliberatelyinserted into the analysis code (2, 12). Keith Baggerly, Kevin Coombes, and Jing Wangmeticulously reconstructed the error-prone analysis and laid out all of the details in bothtext and code. Ironically, they were ultimately able to reproduce the analysis of Potti et al.after significant reverse engineering and forensic investigation. In fact, we might never havelearned what mistakes were made if Baggerly and his colleagues were not able to reproducethe analysis.The example of the Potti study is a pathological example of a reproducible analysis(after much forensic investigation) being profoundly incorrect. However, it is worth askingwhat role reproducibility might have played in this case? If Potti et al. had released thecode and data that were clearly linked together, perhaps as a research compendium (18),then the errors could have been found more quickly. However, given the sheer number andcomplexity of the problems, it still likely would have taken some time to understand them7ll. Coombes et al. published their letter only a year after the initial publication, so thetimeline might have been advanced by a few months. However, a key fact would remain—the flawed analysis was already completed. Furthermore, once the truth was ultimatelyrevealed to the authors, it took years of further investigation by many others before theoriginal paper was retracted.
Examples like the Potti paper raise the question of whether demanding or requiring re-producibility of a study beforehand can pre-emptively improve its quality. Evidence of thisconnection between reproducibility and quality is lacking, which is not surprising given thatthe question is somewhat ill-posed. What exactly are we looking for in a “high-quality”data anlaysis? One could hypothesize that if an investigator knew in advance that the dataand the code would be publicly available for scrutiny, then they would take the extra effortto make sure that the analyses were properly done. Perhaps if Potti et al. had been forcedto make their code publicly available, they would have checked it first.In the case of Potti et al. we now know that requiring reproducibility or even justcode sharing would not have made much difference. Reporting done by
The Cancer Letter showed definitively that the investigators were aware of numerous statistical and codingerrors with the analysis but did not think they were serious problems (20). Rather, theywere considered “differences of opinion”. The notion that requiring reproducibility can leadto improved data analyses relies on the critical assumption that the investigators are ableto recognize what is an error in the first place. If they do recognize the error and hide it,then that is fraud. If they do not recognize the error and publish it anyway, then that is atbest careless. However, in both cases, forcing the data and code to be published would nothave made any difference.
Reproducibility does not provide a useful route to preventing poor data analyses fromoccurring, but it does provide the basis for a meaningful discussion about whether theirmight be problems in the analysis and how such problems might be fixed. Replication differsfrom reproducibility primarily because it addresses a different goal. Replication answers thequestion, “Is this scientific claim true?” Reproducibility addresses the integrity of the dataanalysis that generated the evidence for a scientific claim, while replication addresses theintegrity of the claim itself in the context of the outside world. Fundamentally, reproducibleresearch has relatively little to say on the question of external validity. Claims resultingfrom reproducible results can be both correct and incorrect (28). Claims resulting fromirreproducible results are less likely to be true, but that may depend on the reasons for thelack of reproducibility. For example, evidence generated via random algorithms may not beexactly reproducible if random number generator seeds are not saved, but the underlyingevidence may still be sound. Ultimately, claims made by irreproducible studies may in factbe true, but irreproducible studies simply do not provide evidence for such claims.8 .3.1 Example: Re-analysis of Air Pollution Studies
In the mid 1990s, two large studies of ambient air pollution and mortality—The Six CitiesStudy (13) and the American Cancer Society (ACS) Study (36)—were published, present-ing evidence that differences in air pollution concentrations between cities were significantlyassociated with rates of mortaliy in these cities. Both studies came under intense scrutinywhen the U.S. Environmental Protection Agency cited the results in their revision of theNational Ambient Air Quality Standards for fine particles. In particular, there were de-mands from numerous corners that the data used in the studies should be made available.However, the data in these studies, as with most health-related studies, included personalinformation about the subjects and arguments were made that promises of confidentialityhad to be kept. To address the impasse of making the data available, the original investiga-tors engaged the Health Effects Institute (HEI) to serve as a kind of trusted third party tobroker a reanalysis of the studies. Ultimately, HEI recruited a research team lead by inves-tigators at the University of Ottawa to obtain the original data for both the Six Cities andACS studies, reproduce the original findings, and conduct additional sensitivity analyses toassess the robustness of the original findings (27).The extensive reanalysis found that the original studies were largely reproducible, if notperfectly reproducible. For the Six Cities study, the key result was a mortality relativerisk of 1 .
26, which the re-analysis team computed to be 1 .
28. For the ACS study theoriginal mortality relative risk was 1 .
17, close to the re-analysis value of 1 .
18. While onecould argue that these studes were strictly speaking not reproducible, such small differencesare not likely to be material. In fact, we now know, after numerous follow-up studiesand independent replications, that the core findings of both studies appear to be true (8)and that the U.S. EPA itself rates the evidence of a connection between fine particlesand mortality to be “likely causal” (16). The re-analysis team ran many other analyses,including variables that had not been considered in the original studies. Overall, they foundthat the sensitivity analyses did not change any of the major conclusions. Interestingly, oneof the key conclusions of the final report from HEI was that at the end of the day “No singleepidemiologic study can be the basis for determining a causal relation between air pollutionand mortality” (27).The HEI re-analysis of the Six Cities and ACS studies highlights the role of trust indata analysis. Prior to the re-analysis, many parties simply did not trust that the analysiswas done properly or that all reasonable competing hypotheses had been considered. Whilemaking the data available might have allowed others to build that trust for themselves,allowing a neutral third party to examine the data and reproduce the findings at leastensured that one other group had seen the data. In addition, HEI’s role in organizingthe expert panel, conducting public outreach, and managing an open process played animportant role in building trust in the community. While not all parties were completelysatisified with the process, what the re-analysis did was allow fellow scientists to learn fromthe original studies and gain insight into the process that lead to the original findings.Ultimately, the key goals of reproducible research were achieved.In hindsight, another lesson learned from the HEI re-analysis is that the importance ofreproducibility of a given study can fade with time. Over 25 years later, there have beenscores of follow-up studies and replications that have largely come to similar conclusions9s the Six Cities and ACS studies. Although both studies remain seminal in the field ofair pollution epidemiology, they could be deleted from the literature at this point and havelittle impact on our understanding of the science. This is not to say that the data andongoing analyses do not have value, but rather the original results have been subsumedby later studies. Reproducibility was only critical when the studies were first publishedbecause of the paucity of large studies at the time.
Recent work has focused on the quality and variability of data analyses published in variousfields of study (e.g. 11, 24, 25, 31), with some claiming the existence of a “replication crisis”due to the wide variation between studies examining the same hypotheses (42). The causesof this variation between studies are myriad but one large category includes various aspectsof the data analysis. Because of the increasing complexity of data analyses, many choicesand decisions must be made by analysts in the process of obtaining a result. With theseincreasing complexities we also increase the risk of human error and bias in data analysis.These choices and decisions often have an unknown impact on the final estimates producedand therefore may or may not be recorded by the investigators (45). These “research degreesof freedom” allow investigators to unknowingly, or perhaps knowingly, steer data analysesin directions that may support specific hypotheses rather than represent all of the evidencein the data (44).What role can reproducible research play in improving the quality of data analyses acrossall fields? The answer can be found in part with the experience of the HEI re-analysis of theSix Cities and ACS air pollution studies. Because they were re-analyses, one could imaginethe expectation was that the results would be confirmed to some reasonable degree. Ifthere was a significant deviation from the published results, then we would have to diginto the original analysis to discover why. Because the results were largely reproduced, onecould argue that little was learned. However, additional analyses were done and sensitivityanalyses were conducted. As a result, we learned much about the data analysis process.The re-analysis thus produced valuable knowledge about how to analyze air pollution andhealth data.For example, the re-analysis team noted that both mortality and air pollution werehighly spatially correlated, a feature that was not considered in the original analysis. Theynoted, “If not identified and modeled correctly, spatial correlation could cause substantialerrors in both the regression coefficients and their standard errors. The Reanalysis Teamidentified several methods for dealing with this, all of which resulted in some reduction inthe estimated regression coefficients.” (27) In addition, reproducibility helps free up timefor the analysts interested in re-analyzing the data to focus on parts of the data analysisthat require more human intepretation. For example, if an independent data analyst knewthat an analysis was already reproducible, then more time and resources would be availableto understand why a specific model was chosen, instead of what version of software wasused to run this model. In the re-analysis of the data from Potti et al. (37), Baggerly andCoombes noted that they had spent thousands of hours re-examining the data attemptingto reproduce the original results (1, 19).There are also different degrees of reproducibility when building a data analysis and10ifferences in audiences that may or may not be allowed to have access to these components.For example, a data analyst may choose to make the data available, but not the code (orthe opposite). Others may make both the code and data available for only one audience(Audience A), but not for another audience (Audience B). There are valid reasons why ananalyst might choose to do this, such as if the data analysis uses data with protected healthinformation in a hospital setting or if the data analyst works at a business or company andcannot share the code or data with others outside of the company. It is important to notethat just because an analysis is not fully reproducible to one audience (Audience B) does notmean that it is an invalid analysis with incorrect conclusions. While it does make it harderfor Audience B to trust the results, it still can be a valid or correct analysis. However, thelack of reproduciblity to this audience may mean that the evidence supporting any claimsis weaker. Despite these potential differences in degrees of reproducibility, as demonstratedin the HEI re-analysis, efforts made to make a data analysis more reproducible is a step inthe right direction for making it a better data analysis.Ultimately, the reproducibility of research, when possible, allows us a significant oppor-tunity to (1) learn from others about how best to analyze certain types of data; (2) reducehuman errors and bias as data become larger and more complex; (3) free up time for re-analyzers to focus on parts of a data analysis that require more human interpretation;(4) have discussions about what makes for a good data analysis in certain areas of study;and (5) improve the quality of future data analyses. When teaching data analysis to stu-dents, it is common to talk in abstractions and theories, describing statistical methods andmodels in isolation. When real data is shown, it is often in the form of toy examples orin short excerpts. Increasing the reproducibility of all studies presents an opportunity todramatically expand instruction on the craft of data analysis so that core set of elementsand principles for characterizing high quality analyses can be established within a field (23).
In the thirty years since the idea of reproducible computational research was brought theforefront of the research community, we have learned much about its role and its valuein the research enterprise. The original goal of providing a transparent means by whichresearchers can communicate what they have done and allow others to learn remains aprimary rationale. Reproduciblity has a secondary role to play in improving the qualityof data analysis in that it serves as the foundation on which people can learn how othersanalyze data. Without code and data, it is nearly impossible to fully understand howa given data analysis was actually done. But much about computational research haschanged in the past 30 years and we can perhaps develop a more refined notion about whatit means to make research “reproducible”. The two key ideas about reproducibility—dataand code—are worth revisiting in greater detail.
The sharing of data is ultimately valuable in and of itself. Data sharing, to the extent pos-sible, reduces the need for others to collect similar data, allows for combined analyses withother datasets, and can create important resources for unforeseen future studies. Datasets11an retain their value for considerable time, depending on the area and field of study. Oneexample of the value of data sharing comes from the National Mortality, Morbidity, and AirPollution Study, a major air pollution epidemiology study conducted in the late 1990s andearly 2000s (39, 40). The mortality data for this study were shared on a web site and thenlater updated with new data. A systematic review found 67 publications had made use ofthe dataset, often to demonstrate the development of new statistical methodology (4). Inaddition, the release of the data at the time allowed for a level of transparency and trust inair pollution research that was novel for its time.Today, many data sharing web repositories exist that allow easy distribution of dataof almost any size. While in the past, an investigator interested in sharing data had topurchase and setup a web server, now investigators can simply upload to any number ofservices. The Open Science Framework (17), Dataverse Project (26), ICPSR (47), andSRA (29) are only a handful of public and private repositories that offer free hosting ofdatasets. The major benefit of repositories such as these is to absorb and consolidate thecost of hosting data for possibly long periods of time.The view of data sharing as inherently valuable is not without its challenges. Indeed,stripping data from its original context can be problematic and lead to inappropriate “off-label” re-use by others. It has been argued that data only has value in its explicit connectionto the knowledge that it produces and that we must be careful to preserve the connectionsbetween the data and the knowledge they generate (46).Recently, best practices for sharing data have been developed. Some of these practicesare specific to areas of study while some are more generic. In particular, the emergence ofthe concept of tidy data has provided a generic format for many different types of data thatserves as the backbone of a wide variety of analytic techniques (50). Practical guidanceon sharing data via commonly used spreadsheet formats (7) and on providing relevantmetadata to collaborators is now widely applicable to many kinds of data (15).
The primary role of sharing code is to communicate what was done in transforming the datainto scientific results. Today, almost all actions releveant to teh science will have occurredon the computer and it is essential that we have a precise way to document those actions.Computer code, via any number of programming and data analytic languages, is the mostprecise way to do that. The sharing of code generally represents less of a technical burdenthan the sharing of data. Code tends to be much smaller in size than most datasets andcan easily be served by code sharing services such as GitHub, BitBucket, SourceForge, orGitLab.While the benefits of code sharing tend to focus on the code’s usability and potential forre-purposing in other applications, it is important to reiterate that code’s primary purpose isto communicate what was done. In short, code is not equivalent to software . Software is codethat is specifically designed and organized for use by others in a wide variety of scenarios,often abstracting away operational details from the user. The usability of software dependscritically on aspects like design, efficiency, modularity, and portability—factors that shouldnot generally play a role when releasing research code. Sharing research code that is poorlydesigned and inefficient is far preferable to not sharing code at all. That said, this notion12oes not preclude the possibility for best practices in developing and sharing research code.Software is often a product of research activity, particularly when new methodologyis developed. In those cases, it is important that the software is carefully considered anddesigned well for its intended users. However, it should not be considered a requirement ofreproducible research that software be a product of research. For software that is developedfor distribution, there is increasing guidance for how such software should be distributed.Software package development has become easier for programming languages like R, whichhave robust developer and user communities (38), and numerous tools have been devel-oped to make incorporating code into packages more straightforward for non-professionalprogrammers (49). In addition, the concept of testing and test-based developed has beenshown to be a useful framework for setting expectations for how software should performand identifying errors or bugs (48).
Technological trends over time generally favor a more open approach to science as the costsof sharing, hosting, and publishing have gone down. The continuing rapid advancementof computing technology, internet infrastructure, and algorithmic complexity will likelyintroduce new challenges to reproducible research. As the scientific community expands itssharing of data and code there are some important issues to consider going forward.The rapidly evolving nature of scientific communication serves to highlight the roleof reproducibility in advancing science. Without reproducibility, countless hours could bewasted simply trying to figure out what was done in a study. In situations were key decisionsmust be made based on scientific results, it is important that the robustness of the findingscan be assessed quickly without the need for guessing or inferences about the underlyingdata. A stark example can be drawn from the COVID-19 pandemic. In April 2020 littlewas known about the disease and a study was published on medR χ iv producing an estimateof the prevalence of COVID-19 in the population (5). At the time, important public healthdecisions had to be made in response to the pandemic and any information about thedisease would have been highly relevant. Upon publication, numerous criticisms about thestudy’s design and analysis appeared on social media and the web. However, the aspectmost relevant to this review is that in many of the critiques, substantial time was takento simply guess at what the researchers had done. Although a written statistical appendixwas provided with the paper, no data or code were published along with the study. As aresult, independent investigators had little choice but to infer what was done.The urgency of decision-making based on scientific evidence can exist in a variety ofsituations, not just on the the minute-by-minute timescale of a worldwide pandemic. Manyregulatory decisions in environmental health have to be made based on only a handful ofstudies. Often, there is no time to wait years for another large cohort study to replicate(or not) existing findings. In such situations where decisions need to be made, the morecode and data that can be made available to assess the evidence, the better. In the interim,followup studies can be conducted and revisions to the evidence base can be made in thefuture if needed. The re-analyses Six Cities and ACS studies provide a clear example ofthis process and history has shown those results to be highly consistent across a range ofreplication studies. 13he maintenance of code and data is generally not a topic that is discussed in thecontext of reproducible research. When a paper is published, it is sent to the journal andis considered “finished” by the investigators. Unless errors are found in the paper, onegenerally need not revisit a paper after publication. However, both code and data needto be maintained to some degree in order to be useful. Data formats can change andolder formats can fall out of favor, often making older datasets unreadable. Code thatwas once highly readable can become unreadable as newer languages come to the fore andpractitioners of older languages decrease in number. Maintenance of data and code is not aquestion of paying for computer hardware or services. Rather, it is about paying for peopleto periodically update and fix problems that may be introduced by the constantly changingcomputing environment.Unfortunately, funding models for scientific research are aligned with the mechanismof paper publication, where one can definitively mark the end of a project (and also theend of the funding). However, with data and code, there is often no specific end pointbecause other investigators may re-use the data or code for years into the future. Term-based project funding, which is the structure of almost all research funding, is simply notdesigned to provide support for maintaining materials on an uncertain timeline.The first thirty years of reproducible research largely centered on discussions of thevalidity of the idea and what value it provided to the scientific community. Such discussionsare largely settled now and both data and code share are practiced widely in many fieldsof study. However, we must now engage in a second phase of reproducible research whichfocuses on the continued development of infrastructure for supporting reproducibility.14 eferences
1. Baggerly, K. (2010), “Disclose all data in publications,”
Nature , 467, 401–401.2. Baggerly, K. A. and Coombes, K. R. (2009), “Deriving chemosensitivity from cell lines:Forensic bioinformatics and reproducible research in high-throughput biology,”
TheAnnals of Applied Statistics , 1309–1334.3. Barba, L. A. (2018), “Terminologies for reproducible research,” arXiv preprintarXiv:1802.03311 .4. Barnett, A. G., Huang, C., and Turner, L. (2012), “Benefits of Publicly Available Data,”
Epidemiology , 23, 500–501.5. Bendavid, E., Mulaney, B., Sood, N., Shah, S., Ling, E., Bromley-Dulfano, R., Lai, C.,Weissberg, Z., Saavedra, R., Tedrow, J., et al. (2020), “COVID-19 Antibody Seropreva-lence in Santa Clara County, California,”
MedRxiv .6. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert,C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., et al. (2001), “Minimuminformation about a microarray experiment (MIAME)toward standards for microarraydata,”
Nature genetics , 29, 365.7. Broman, K. W. and Woo, K. H. (2018), “Data organization in spreadsheets,”
TheAmerican Statistician , 72, 2–10.8. Brook, R. D., Rajagopalan, S., Pope, C. A., Brook, J. R., Bhatnagar, A., Diez-Roux,A. V., Holguin, F., Hong, Y., Luepker, R. V., Mittleman, M. A., Peters, A., Siscovick,D., Smith, S. C., Whitsel, L., and Kaufman, J. D. (2010), “Particulate matter airpollution and cardiovascular disease: An update to the scientific statement from theAmerican Heart Association,”
Circulation , 121, 2331–2378.9. Buckheit, J. and Donoho, D. L. (1995), “Wavelab and Reproducible Research,” in
Wavelets and Statistics , ed. Antoniadis, A., Springer-Verlag, New York.10. Claerbout, J. and Schwab, M. (2001), “CD-ROM versus The Web,” .11. Collaboration, O. S. et al. (2015), “Estimating the reproducibility of psychological sci-ence,”
Science , 349.12. Coombes, K., Wang, J., and Baggerly, K. (2007), “Microarrays: retracing steps,”
Nat.Med. , 13, 1276–1277.13. Dockery, D. W., Pope, C. A., Xu, X., Spengler, J. D., Ware, J. H., Fay, M. E., Ferris Jr,B. G., and Speizer, F. E. (1993), “An association between air pollution and mortalityin six US cities,”
New England journal of medicine , 329, 1753–1759.14. Drummond, C. (2018), “Reproducible research: a minority opinion,”
Journal of Exper-imental & Theoretical Artificial Intelligence , 30, 1–11.155. Ellis, S. E. and Leek, J. T. (2018), “How to share data for collaboration,”
The AmericanStatistician , 72, 53–57.16. EPA (2009),
Integrated Science Assessment for Particulate Matter , EPA National Cen-ter for Environmental Assessment.17. Foster, E. D. and Deardorff, A. (2017), “Open science framework (OSF),”
Journal ofthe Medical Library Association: JMLA , 105, 203.18. Gentleman, R. and Temple Lang, D. (2007), “Statistical Analyses and ReproducibleResearch,”
Journal of Computational and Graphical Statistics , 16, 1–23.19. Goldberg, P. (2014), “Duke Scientist: I Hope NCI Doesn’t Get Original Data,”
Cancer ,41, 2.20. — (2015), “Duke Officials Silenced Med Student Who Reported Trouble in Anil PottisLab,”
The Cancer Letter .21. Goodman, S. N., Fanelli, D., and Ioannidis, J. P. (2016), “What does research repro-ducibility mean?”
Science translational medicine , 8, 341ps12–341ps12.22. Haibe-Kains, B., Adam, G. A., Hosny, A., Khodakarami, F., Board, M., Waldron, L.,Wang, B., McIntosh, C., Kundaje, A., Greene, C. S., et al. (2020), “The importanceof transparency and reproducibility in artificial intelligence research,” arXiv preprintarXiv:2003.00898 .23. Hicks, S. C. and Peng, R. D. (2019), “Elements and principles of data analysis,” arXivpreprint arXiv:1903.07639 .24. Ioannidis, J. P. (2005), “Why most published research findings are false,”
PLoSmedicine , 2, e124.25. Jager, L. R. and Leek, J. T. (2014), “An estimate of the science-wise false discoveryrate and application to the top medical literature,”
Biostatistics , 15, 1–12.26. King, G. (2007), “An introduction to the dataverse network as an infrastructure fordata sharing,” .27. Krewski, D., Burnett, R. T., Goldberg, M. S., Hoover, K., Siemiatycki, J., Jerrett, M.,Abrahamowicz, M., and White, W. (2000), “Reanalysis oft the Harvard Six Cities Studyand the American Cancer Society Study of particulate air pollution and mortality,”
Health Effects Institut, Cambridge, MA .28. Leek, J. T. and Peng, R. D. (2015), “Opinion: Reproducible research can still be wrong:Adopting a prevention approach,”
Proceedings of the National Academy of Sciences ,112, 1645–1646.29. Leinonen, R., Sugawara, H., Shumway, M., and Collaboration, I. N. S. D. (2010), “Thesequence read archive,”
Nucleic acids research , 39, D19–D21.160. National Academies of Sciences, Engineering, and Medicine and others (2019),
Repro-ducibility and replicability in science , National Academies Press.31. Patil, P., Peng, R. D., and Leek, J. T. (2016), “What should researchers expect whenthey replicate studies? A statistical view of replicability in psychological science,”
Perspectives on Psychological Science , 11, 539–544.32. — (2019), “A visual tool for defining reproducibility and replicability,”
Nature humanbehaviour , 3, 650–652.33. Peng, R. D. (2011), “Reproducible research in computational science,”
Science , 334,1226–1227.34. Peng, R. D., Dominici, F., and Zeger, S. L. (2006), “Reproducible Epidemiologic Re-search,”
American Journal of Epidemiology , 163, 783–789.35. Plesser, H. E. (2018), “Reproducibility vs. replicability: a brief history of a confusedterminology,”
Frontiers in neuroinformatics , 11, 76.36. Pope, C. A., Thun, M. J., Namboodiri, M. M., Dockery, D. W., Evans, J. S., Speizer,F. E., Heath, C. W., et al. (1995), “Particulate air pollution as a predictor of mortalityin a prospective study of US adults,”
American journal of respiratory and critical caremedicine , 151, 669–674.37. Potti, A., Dressman, H. K., Bild, A., Riedel, R. F., Chan, G., Sayer, R., Cragun, J.,Cottrill, H., Kelley, M. J., Petersen, R., et al. (2006), “Genomic signatures to guide theuse of chemotherapeutics,”
Nature medicine , 12, 1294–1300.38. R Core Team (2020),
R: A Language and Environment for Statistical Computing , RFoundation for Statistical Computing, Vienna, Austria.39. Samet, J. M., Zeger, S. L., Dominici, F., and et al. (2000),
The National Morbidity,Mortality, and Air Pollution Study, Part I: Methods and Methodological Issues , HealthEffects Institute, Cambridge MA.40. — (2000),
The National Morbidity, Mortality, and Air Pollution Study, Part II: Mor-bidity and Mortality from Air Pollution in the United States , Health Effects Institute,Cambridge MA.41. Sandve, G. K., Nekrutenko, A., Taylor, J., and Hovig, E. (2013), “Ten simple rules forreproducible computational research,”
PLoS computational biology , 9.42. Schooler, J. W. (2014), “Metascience could rescue the replication crisis,”
Nature , 515,9–9.43. Schwab, M., Karrenbach, N., and Claerbout, J. (2000), “Making scientific computationsreproducible,”
Computing in Science & Engineering , 2, 61–67.44. Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011), “False-positive psychology:Undisclosed flexibility in data collection and analysis allows presenting anything assignificant,”
Psychological science , 22, 1359–1366.175. Stodden, V. (2015), “Reproducing statistical results,”
Annual Review of Statistics andIts Application , 2, 1–19.46. — (2020), “Beyond Open Data: A Model for Linking Digital Artifacts to Enable Re-producibility of Scientific Claims,” in
Proceedings of the 3rd International Workshop onPractical Reproducible Evaluation of Computer Systems , pp. 9–14.47. Swanberg, S. M. (2017), “Inter-university consortium for political and social research(ICPSR),”
Journal of the Medical Library Association: JMLA , 105, 106.48. Wickham, H. (2011), “testthat: Get Started with Testing,”
The R Journal , 3, 5–10.49. Wickham, H., Hester, J., and Chang, W. (2020), devtools: Tools to Make Developing RPackages Easier , r package version 2.3.1.50. Wickham, H. et al. (2014), “Tidy data,”