Advancing computational reproducibility in the Dataverse data repository platform
Ana Trisovic, Philip Durbin, Tania Schlatter, Gustavo Durand, Sonia Barbosa, Danny Brooke, Mercè Crosas
AAdvancing computational reproducibility in theDataverse data repository platform
Ana Trisovic
Institute for Quantitative SocialScience, Harvard UniversityCambridge, MA, [email protected]
Philip Durbin
Institute for Quantitative SocialScience, Harvard UniversityCambridge, MA, USA
Tania Schlatter
Institute for Quantitative SocialScience, Harvard UniversityCambridge, MA, USA
Gustavo Durand
Institute for Quantitative SocialScience, Harvard UniversityCambridge, MA, USA
Sonia Barbosa
Institute for Quantitative SocialScience, Harvard UniversityCambridge, MA, USA
Danny Brooke
Institute for Quantitative SocialScience, Harvard UniversityCambridge, MA, USA
Mercè Crosas
Institute for Quantitative SocialScience, Harvard UniversityCambridge, MA, [email protected]
ABSTRACT
Recent reproducibility case studies have raised concerns showingthat much of the deposited research has not been reproducible.One of their conclusions was that the way data repositories storeresearch data and code cannot fully facilitate reproducibility dueto the absence of a runtime environment needed for the code exe-cution. New specialized reproducibility tools provide cloud-basedcomputational environments for code encapsulation, thus enablingresearch portability and reproducibility. However, they do not of-ten enable research discoverability, standardized data citation, orlong-term archival like data repositories do. This paper addressesthe shortcomings of data repositories and reproducibility tools andhow they could be overcome to improve the current lack of compu-tational reproducibility in published and archived research outputs.
CCS CONCEPTS • Information systems → Digital libraries and archives ; Comput-ing platforms . KEYWORDS computational reproducibility, open data, open code, data reposi-tory, data management, data preservation.
ACM Reference Format:
Ana Trisovic, Philip Durbin, Tania Schlatter, Gustavo Durand, Sonia Bar-bosa, Danny Brooke, and Mercè Crosas. 2020. Advancing computational
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
P-RECS’20, June 23, 2020, Stockholm, Sweden © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/1122445.1122456 reproducibility in the Dataverse data repository platform. In
Proceedingsof 3rd International Workshop on Practical Reproducible Evaluation of Sys-tems (P-RECS’20) (P-RECS’20).
ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/1122445.1122456
The requirement of reproducible computational research is becom-ing increasingly important and mandatory across the sciences [13].Because reproducibility implies a certain level of openness andsharing of data and code, parts of the scientific community havedeveloped standards around documenting and publishing theseresearch outputs [4, 19]. Publishing data and code as a replicationpackage in a data repository is considered to be a best practice forenabling research reproducibility and transparency [12]. Some academic journals endorse this approach for publishing re-search outputs, and they often encourage (or require) their authorsto release a replication package upon publication. Data repositories,such as Dataverse or Dryad, are the predominantly encouragedmode for sharing research data and code followed by journals’ ownwebsites (Figure 1) [7]. For example, the American Journal of Po-litical Science (AJPS) and the journal Political Analysis have theirown collections within the Harvard Dataverse repository, which istheir required venue for sharing research data and code.Recent case studies [6] reported that the research material pub-lished in data repositories does not often guarantee reproducibility.This is in part because, in the current form, data repositories donot capture all software and system dependencies necessary forcode execution. Even when this information is documented by theoriginal authors in an instructions file (like readme), contextualinformation still might be missing, which could make the processof research verification and reuse hard or impossible. This is also At Dataverse, the data, and code used to reproduce a published study are called"replication data" or a "replication package"; in Whole Tale, this is called a "tale", andin Code Ocean a "reproducible capsule". a r X i v : . [ c s . D L ] J un -RECS’20, June 23, 2020, Stockholm, Sweden A. Trisovic et al. data repository journal website other Sharing Mode N u m b e r o f J o u r n a l s Data sharing mode per number of journals that encourage it
Figure 1: Aggregated results for most popular data sharingmode in anthropology, economics, history, PoliSci+IR, psy-chology, and sociology from Ref. [7]. often the case with some of the alternative ways of publishingresearch data and code, for example, through the journal’s web-site. A study [18] reported that the majority of supplemental datadeposited on a journal’s website was inaccessible due to brokenlinks. Such problems are less likely to happen in data repositoriesthat follow standards for long-term archival and support persistentidentifiers.Some researchers prefer to release their data and code on theirpersonal websites or websites like GitHub and GitLab. This ap-proach does not natively provide a standardized persistent citationfor referencing and accessing the research materials, nor sufficientmetadata to make it discoverable in data search engines like GoogleDataset Search and DataCite search. In addition, it does not guaran-tee long-term accessibility as data repositories do. Because researchdeposited this way does not typically contain a runtime environ-ment, nor system or contextual information, this approach is alsooften ineffective in enabling computational reproducibility [17].New cloud services have emerged to support research data orga-nization, collaborative work, and reproducibility [15]. Even thoughthe number of different and useful reproducibility tools is constantlyincreasing, in this paper, we are going to focus on the followingprojects: Code Ocean [5], Whole Tale [1], Renku and Binder [9, 10].All of these tools are available through a web browser, and they arebased on the containerization technology Docker, which providesa standardized way to capture the computational environment thatcan be shared, reproduced, and reused.(1) Code Ocean is a research collaboration platform that enablesits users to develop, execute, share, and publish their dataand code. The platform supports a large number of program-ming languages including R, C/C++, Java, Python, and it iscurrently the only platform that supports code sharing inproprietary software like MATLAB and Stata.(2) Whole Tale is a free and open-source reproducibility plat-form that, by capturing data, code, and a complete software https://renkulab.io environment, enables researchers to examine, transform andrepublish research data that was used in an academic article.(3) Renku is a project similar to Whole Tale that focuses onemploying tools for best coding practices to facilitate collab-orative work and reproducibility.(4) Binder is a free and open-source project that allows users torun notebooks (Jupyter or R) and other code files by creatinga containerized environment using configuration files withina replication package (or a repository).Even though virtual containers are currently considered the mostcomprehensive way to preserve computational research [8, 16],they do not entirely comply with modern scientific workflows andneeds. Through the use of containers, the reproducibility platformsin most cases fail to support FAIR principles (Findable, Accessible,Interoperable, and Reusable) [19], standardized persistent citation,and long-term preservation of research outputs as data repositoriesstrive to do. Findability is enabled through standard or community-used metadata schemas that document research artifacts. Data andcode stored in a Docker container on a reproducibility platformare not easily visible nor accessible from outside of the container,which thus hinders their findability. This could be an issue for aresearcher looking for a specific dataset rather than a replicationpackage. In addition, unlike data repositories, reproducibility plat-forms do not undertake a commitment to the archival of researchmaterials. This means that, for example, in a scenario where a re-producibility platform runs out of funding, the deposited researchcould be inaccessible.Individually data repositories and reproducibility platforms can-not fully support scientific workflows and requirements for repro-ducibility and preservation. This paper explains how these short-comings could be overcome through integration that would resultin a robust paradigm for preserving computational research andenabling reproducibility and reuse while making the replicationpackages FAIR. We argue that through the integration of these ex-isting projects, rather than inventing new ones, we could combinethe functionalities that effectively complement each other. Through integration, reproducibility platforms and data reposito-ries create a synergy that addresses weaknesses of both approaches.Some of these integrations are already on the way: • CLOCKSS is an archiving repository that preserves datawith regular validity checks. Unlike other data repositories,it does not provide public or user access to the preservedcontent, except in special cases that are referred to as “trig-gered content”. Code Ocean has partnered with CLOCKSSto preserve in perpetuity research capsules associated withpublications from some of the collaborating journals. • The Whole Tale platform relies on integrations with externalresources for long-term stewardship and preservation. Theyalready enable data import from data repositories, and apublishing functionality for a replication package is currentlyunderway through DataONE, Dataverse, and Zenodo [3]. • Stencila is an open-source office suite designed for creatinginteractive, data-driven publications. With its familiar user https://clockss.org dvancing computational reproducibility in theDataverse data repository platform P-RECS’20, June 23, 2020, Stockholm, Sweden Figure 2: Preliminary view of how research stored in Data-verse can be viewed and explored in reproducibility plat-forms with a button click. interface, it is geared toward the users of Microsoft Wordand Excel. It integrates data and code as a self-contained partof the publication, and it also enables external researchesto explore the data and write custom code. Stencila and thejournal eLife have partnered up to facilitate reproduciblepublications [11].
In this paper, we present our developments in the context of theDataverse Project, which is a free and open-source software plat-form to archive, share, and cite research data. Currently, 55 in-stitutions around the globe run Dataverse instances as their datarepository.Dataverse’s integration with the reproducibility platforms haspropelled a series of questions and developments around advancingreproducibility for its vast and diverse user community. First, whilecontainer files can be uploaded to Dataverse, there is no specialhandling for these files, which can result in mixed outcomes forresearchers trying to verify reproducibility. Second, it is impor-tant to facilitate the capture of computational dependencies for theDataverse users who choose not to use a reproducibility platform.Finally, in a replication package with multiple seemingly disorga-nized code files, it would be important to minimize the time andeffort of an external user who wants to rerun and reuse the files.Therefore, new functionality to support container-based deposits,organization, and access needs to be added to Dataverse to improvereproducibility.
Dataverse integration with the reproducibility platforms shouldallow both adding new research material into Dataverse, and import-ing and reusing the existing material from Dataverse into a repro-ducibility platform. This communication is implemented througha series of existing and new APIs. The reproducibility platformsthat have an ongoing integration collaboration with Dataverse areCode Ocean, Whole Tale, Binder, and Renku.
Figure 3: Using data from Dataverse in the Whole Tale envi-ronment. Snapshot from YouTube video . Credit: Craig Willis.
Importing research material from Dataverse means that dataand code that already exist in Dataverse could be transferred di-rectly into a reproducibility platform. On the Dataverse side, this isimplemented through a new button "Explore", shown in Figure 2.When the button is clicked, the replication package is copied andsent to a reproducibility platform where it, using the configurationfiles from the package, creates a Docker container, places all dataand code into it and provides a view through a web browser. Thismeans that the Dataverse users will not need to download any ofthe files to their personal computers, nor will they need to set up acomputational environment to execute and explore the depositedfiles. So far, the "Explore" button is functional for the Whole Taleplatform, while the others are underway. The researcher whose starting point is the reproducibility plat-form will be able to import materials for their analysis directly fromDataverse. An example where a researcher is importing Dataverseopen data as "external data" into Whole Tale is shown in Figure 3,as this integration is now implemented. Similarly, Figure 4 showsnew integration developments with the lightweight cloud platform,Binder, that now enables the users to import and view data fromDataverse. The export of the encapsulated research material intoDataverse will also be possible, which means that, once an analysisis ready for dissemination, the researchers would initiate "analysisexport" in a reproducibility platform, that would then copy thefiles from a Docker container into Dataverse. This way, all neces-sary computational dependencies are automatically recorded bya reproducibility tool and stored at a data repository followingpreservation standards. This functionality is already implementedon Renku. Importing replication packages from the reproducibility platformsmeans that Dataverse would need to support the capture of theirvirtual containers. Since all aforementioned reproducibility plat-forms are based on the containerization technology Docker, new Dataverse documentation for integrations: http://guides.dataverse.org/en/4.20/admin/external-tools.html Integration code at https://github.com/SwissDataScienceCenter/renku-python/pull/909 -RECS’20, June 23, 2020, Stockholm, Sweden A. Trisovic et al.
Figure 4: BinderâĂŹs GUI ( https://mybinder.org ) supportsviewing content from Dataverse.
Dataverse developments focus on Docker containers. Docker con-tainers can be built automatically from the instructions laid out ina "Dockerfile". A Dockerfile is an often tiny text file that containscommands, typically for installing software and dependencies, toset up a runtime environment needed for research analysis. TheDataverse platform will encourage depositing Dockerfiles to cap-ture the computational environment. This would allow the usersto explore replication packages in any supported reproducibilitytool as Dockerfiles are agnostic to computational platforms. Analternative solution would be to create a Dataverse Docker registrywhere the whole images would be preserved. This approach willnot be pursued at the time due to excessive storage requirements.It is important to mention that at present, any file can be storedat Dataverse, including a Dockerfile, and that there are currentlydozens of Dockerfiles stored at Harvard’s instance of Dataverse.However, whereas previously Dockerfiles were considered as "otherfiles", in the new development they will be pre-identified at upload,and thus will require additional metadata. When a reproducibilityplatform automatically generates a Dockerfile, it is likely to be suit-able for portability and preservation. However, when a researcherprepares it, this might not be the case. Dockerfiles could be sus-ceptible to some of the practices that cause irreproducibility, likethe use of absolute (fixed) file paths, which is why Dataverse willencourage its users to use best practices when depositing these files.In particular, before depositing a Dockerfile, the researcher will beprompted to confirm that their Dockerfile does not include any ofthe common reproducibility errors.
In addition to capturing a Docker container via Dockerfile, it is im-portant to capture the sequence of steps that the user ran to obtaintheir results. This applies to the results obtained with command-line languages such as Python, MATLAB, Julia. Capturing the com-mand sequence is particularly important when there are multiplecode files within the replication dataset without the clear notationin which order they should be executed. The commands will becaptured in the replication package metadata using a communitystandard to be determined (see, for example, RO-Crate [2]). In case the replication package was pulled from a reproducibilitytool, like Code Ocean and Whole Tale, these replication commandswould be automatically populated. For example, Code Ocean gener-ates the commands that build and run a Docker container for eachreplication package, and it also encourages researchers to specifythe command sequence in a file called "run" to automatize theircode. This means that all command sequences that run "outside" and"inside" the container are captured. Dataverse users who choose notto use a reproducibility platform would need to manually specifythis sequence based on presented best practices.
Because no dataset would be ’hidden’ within a virtual container atDataverse, all files originally used in research would be indexed andthus findable by one of the common dataset search engines. Theywould also be accessible directly from the dataset landing pageon the web. Their interoperability and reusability would be nowimproved with the integration with the reproducibility platforms,as the barriers to creating a runtime environment and running codefiles would be alleviated.
Dataverse traditionally aims to provide incentives to researchersto share data through data citation credit, data metrics such as acount of downloads for datasets and access requests for restricteddata. One of the completed new developments includes integrat-ing certifications or science badges, such as Open Data and OpenMaterials, within a dataset landing page on Dataverse.The new support for reproducibility tools and containers willalso result in creating new metrics for the users. The datasets thatare deposited through a reproducibility platform into Dataversewill be denoted with a ’reproducibility certification’ badge that willsignal their origin and easy execution on the cloud. For example, areplication package that was received from a reproducibility plat-form Whole Tale, will include its origin information and encourageits exploration and reuse through Whole Tale.
The Dataverse integration with the reproducibility platforms andthe new developments that improve the reproducibility of depositedresearch will facilitate research workflows relating to verification,preservation, and reuse in the following ways (shown in Figure 5):(1) Research encapsulation. The first supported workflow en-ables authors to deposit their data and code through CodeOcean, Whole Tale, or Renku, which then create a replicationpackage that is sent for dissemination and preservation toDataverse. The Dataverse users who were not previously fa-miliar with the containerization technology Docker will nowbe able to containerize their research through the new work-flow. In addition, this workflow is particularly importantfor prestigious academic journals that verify research repro-ducibility through third-party curation services and a repro-ducibility platform. For example, code review at the journalPolitical Analysis, which collaborates with Code Ocean andHarvard Dataverse for data dissemination and preservation,will be significantly sped up with the deployment of this dvancing computational reproducibility in theDataverse data repository platform P-RECS’20, June 23, 2020, Stockholm, Sweden replication package replication packagedata and code data and code viewrepublish preserveencapsulate Figure 5: Four main workflows that Dataverse aims to support with reproducibility platform integration. workflow, as all the code associated with a publication willalready be automatized, containerized and available on thecloud.(2) Modify and republish research. The second workflow coverspulling a replication package from Dataverse and republish-ing it after an update. This would create a new version of thepackage in Dataverse, as well as track provenance about theoriginal package. Peer-review and revisions of the packageshould thus be much easier. In addition, the replication pack-ages on Dataverse that currently do not have information ontheir runtime environment could be updated and republishedwith a Dockerfile generated by one of the reproducibilityplatforms.(3) View deposited research materials. The third functionalityallows viewing and exploring the content of deposited re-search without the need to download the files and installnew software. This could be particularly valuable for exter-nal researchers and students who would like to understandresearch results or reuse data or code.(4) Preserve computational environment with Dockerfile. Throughthe new developments in Dataverse that encourage deposit-ing Dockerfiles with best practices, the researchers who areexperienced in using Docker will now be able to adequatelypreserve these files in the repository.
In the last decade, there has been extensive discussion aroundpreservation, reproducibility, and openness of computational re-search, which resulted in creating multiple new tools to facilitatethese efforts, the most popular being data repositories and repro-ducibility platforms. However, individually these two approaches cannot fully facilitate findable, interoperable, reusable, and repro-ducible research materials. This paper presents a robust solutionachieved through their integration.Described integrations have resulted in developing new func-tionality in Dataverse, such as expanding on the existing API, in-troducing new replication-package metadata, and handling virtualcontainers via Dockerfile. In addition to allowing research preser-vation in a reproducible and reusable way through the integrations,Dataverse aims to identify new and useful data metrics to be dis-played on the dataset landing page. Due to the fact that there isan increasing number of similar reproducibility tools, this paperalso advocates for considering integration with an existing solutionbefore (re)inventing a new reproducibility tool.
ACKNOWLEDGMENTS
Thank you to the Dataverse team for their help and support. Thiswork is partially funded by the Sloan Foundation. Ana Trisovic isfunded by the Sloan Foundation.
REFERENCES [1] Adam Brinckman et al. 2019. Computing environments for reproducibility:Capturing the Whole Tale.
Future Generation Computer Systems
94 (2019), 854–867.[2] Eoghan Ó Carragáin, Carole Goble, Peter Sefton, and Stian Soiland-Reyes. [n.d.].RO-Crate, a lightweight approach to Research Object data packaging. ([n. d.]).[3] Kyle Chard, Niall Gaffney, Matthew B Jones, Kacper Kowalik, Bertram Ludäscher,Jarek Nabrzyski, Victoria Stodden, Ian Taylor, Matthew J Turk, and Craig Willis.2019. Implementing computational reproducibility in the Whole Tale environ-ment. In
Proceedings of the 2nd International Workshop on Practical ReproducibleEvaluation of Computer Systems . 17–22.[4] Xiaoli Chen et al. 2019. Open is not enough.
Nature Physics
15, 2 (2019), 113–119.[5] April Clyburne-Sherin, Xu Fei, and Seth Ariel Green. 2019. ComputationalReproducibility via Containers in Psychology.
Meta-Psychology
Commun. ACM
59, 3 (2016), 62–69. -RECS’20, June 23, 2020, Stockholm, Sweden A. Trisovic et al. [7] Mercè Crosas, Julian Gautier, Sebastian Karcher, Dessi Kirilova, Gerard Otalora,and Abigail Schwartz. 2018. Data policies of highly-ranked social science journals.(2018).[8] Ivo Jimenez, Carlos Maltzahn, Adam Moody, Kathryn Mohror, Jay Lofstead,Remzi Arpaci-Dusseau, and Andrea Arpaci-Dusseau. 2015. The role of containertechnology in reproducible computer systems research. In . IEEE, 379–385.[9] P Jupyter, M Bussonnier, J Forde, J Freeman, B Granger, T Head, C Holdgraf, KKelley, G Nalvarte, A Osheroff, et al. 2018. Binder 2.0-Reproducible, interactive,sharable environments for science at scale. In
Proceedings of the 17th python inscience conference , Vol. 113. 120.[10] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian E Granger,Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, JasonGrout, Sylvain Corlay, et al. 2016. Jupyter Notebooks-a publishing format forreproducible computational workflows.. In
ELPUB . 87–90.[11] Giuliano Maciocci, Michael Aufreiter, and Nokome Bentley. 2019. IntroducingeLifeâĂŹs first computationally reproducible article. eLife Labs [Internet]
PLoS biology
9, 12 (2011), e1001195.[13] Engineering National Academies of Sciences, Medicine, et al. 2019.
Reproducibilityand replicability in science . National Academies Press.[14] Omar S Navarro Leija, Kelly Shiptoski, Ryan G Scott, Baojun Wang, NicholasRenner, Ryan R Newton, and Joseph Devietti. 2020. Reproducible Containers. In
Proceedings of the Twenty-Fifth International Conference on Architectural Supportfor Programming Languages and Operating Systems . 167–182.[15] JM Perkel. 2019. Make code accessible with these cloud services.[16] Stephen R Piccolo and Michael B Frampton. 2016. Tools and techniques forcomputational reproducibility.
GigaScience
5, 1 (2016), 30.[17] João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire.2019. A large-scale study about quality and reproducibility of jupyter notebooks.In
Proceedings of the 16th International Conference on Mining Software Repositories .IEEE Press, 507–517.[18] Anisa Rowhani-Farid and Adrian G Barnett. 2018. Badges for sharing data andcode at Biostatistics: an observational study.
F1000Research