A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility
AA Realistic Guide to Making Data Available Alongside Code toImprove Reproducibility.
Nicholas J Tierney* (1,2) & Karthik Ram* (3)
Abstract
Data makes science possible. Sharing data improves visibility, and makes the research process transparent.This increases trust in the work, and allows for independent reproduction of results. However, a largeproportion of data from published research is often only available to the original authors. Despite the obviousbenefits of sharing data, and scientists’ advocating for the importance of sharing data, most advice on sharingdata discusses its broader benefits, rather than the practical considerations of sharing. This paper providespractical, actionable advice on how to actually share data alongside research. The key message is sharingdata falls on a continuum, and entering it should come with minimal barriers. a r X i v : . [ c s . D L ] F e b Introduction “Data! data! data!” he cried impatiently. “I can’t make bricks without clay.” - Sherlock Holmes(The Adventure of the Copper Beeches by Sir Arthur Conan Doyle)Data are a fundamental currency upon which scientific discoveries are made. Without access to good data, itbecomes extremely difficult, if not impossible, to advance science. Yet, a large majority of data on whichpublished research papers are based rarely see the light of day and are only visible to the original authors(Rowhani-Farid and Barnett 2016; Stodden, Seiler, and Ma 2018). Sharing data sets upon publicationof a research paper has benefits to individual researchers, often through increased visibility of the work(Popkin 2019; Kirk and Norton 2019). But there is also a component of benefit to the broader scientificcommunity. This primarily comes in the form of potential for reuse in other contexts along use for trainingand teaching (McKiernan et al. (2016)). Assuming that the data have no privacy concerns (e.g., humansubjects, locations of critically endangered species), or that the act of sharing does not put the authors at acompetitive disadvantage (data can be embargoed for reasonable periods of time), sharing data will alwayshave a net positive benefit. First and foremost, sharing data along with other artifacts can make it easier fora reader to independently reproduce the results, thereby increasing transparency and trust in the work. Theact of easy data sharing can also improves model training, as many different models can be tried and testedon latest data sources, closing the loop on research and application of statistical techniques. Existing datasets can be combined or linked with new or existing data, fostering the development and synthesis of newideas and research areas. The biggest of these benefits is the overall increase in reproducibility.For nearly two decades, researchers who work in areas related to computational science have pushed forbetter standards to verify scientific claims, especially in areas where a full replication of the study would beprohibitively expensive. To meet these minimal standards, there must be easy access to the data, models, andcode. Among the different areas with a computational bent, the bioinformatics community in particular hasa strong culture around open source (Gentleman et al. 2004), and has made releasing code and associatedsoftware a recognized mainstream activity. Many journals in these fields have also pushed authors to submit ode (in the form of model implementations) alongside their papers, with a few journals going as far asproviding a “reproducibility review” (Peng 2011). In this paper we focus on the practical side of sharing data for the purpose of reproducibility.
Our goal is to describe various methods in which an analyst can share their data with minimal friction. Westeer clear of idealistic ideas such as the FAIR data principles (Wilkinson et al. 2016) since they still donot help a researcher share their data. We also skip the discussion around citation and credit because datacitations are still poorly tracked and there is no standardized metric or a h-index equivalent for data as ofthis writing.For a piece of computational research to be minimally reproducible, it requires three distinct elements: 1)Code; 2) Computing environment, and 3) Data. The first two of these challenges have largely been solved(Poisot et al. 2019).Although code sharing in science had a rocky start (Barnes 2010), more and more code writing by scientistsis being shared, partly due to the rapid increase in training opportunities made available by organizationslike The Carpentries, combined with the rapid adoption of Github by scientists (Ram 2013). The biggerdriver for this may also be connected to the rise in popularity of data science as a discipline distinct fromstatistics (Donoho 2017). This rapid growth in data science has largely been fueled by easy access to opensource software tools. Programming languages such as Python, R and Julia help scientists implement andshare new methods to work with data (R Core Team 2019; “Python,” n.d.; Bezanson et al. 2017). Each ofthese languages is supported by thriving communities of researchers and software developers who contributemany of the building blocks that make them popular. As of this writing, Python, Julia, and R have 167kpackages (“Pypi,” n.d.), ~ 14k packages (“Cran,” n.d.) and ~2k packages (“Julia-Pkgman,” n.d.) respectively.These packages form the building blocks of a data scientists daily workflow.In a modern computational workflow a user can pip install a package in Python, or use install.packages in R to install all software dependencies for a project. By relying on the idea of having research compendia(Gentleman and Temple Lang 2007) or something as simple as a requirements file, it is relatively straightforwardto install all the necessary scaffolding. Data on the other hand are rarely accessible that easily. typical data analysis loads a dozen or two of these open source libraries at the top of a notebook andthen relies on existing routines to rapidly read, transform, visualize, and model data. Each package dependson a complex web of other packages, building upon existing work rather than re-implementing everythingfrom scratch. Working from script and a list of such dependencies, a data analyst can easily install all thenecessary tools in any local or remote environment and reproduce the computation. When new functionalityis developed, it is packaged into a separate entity and added to a language’s package manager.The computing environment is also easily captured with modern tools such as Docker (Boettiger 2015;Jupyter 2018). Modern tools such as Binder (Jupyter 2018) can parse Docker files and dependency treesto provide on demand, live notebooks in R and Python that a reader can immediately execute in thebrowser without dealing with the challenges of local installation. This makes it simple to load a specificenvironment to run any analysis. Code is handled by version control with tools like Git and GitHub (“Git,”n.d.; “Github,” n.d.), paired with archiving frameworks such as Zenodo provide access to code (particularlymodel implementations)(Zenodo 2016). All the necessary software is available from various package managers(and their numerous geographic mirrors and archives) making it easy to install any version of a softwarepackage. However, the biggest challenge, even today, remains easy and repeatable access to data in a dataanalysis.Datasets are often far more diverse than code in terms of complexity, size, and formats. This makes themparticularly challenging to standardize or easily “install” where the code is running. While there are numerouspublic and private data repositories, none of them function as package managers, which is what provides somuch robustness to code sharing. As a result, data used in an analysis is often read from various locations(local, network, or cloud), various formats, varying levels of tidiness (Wickham 2014). There is also theoverhead associated with data publishing, the act of archiving data in a repository that also mints permanentidentifiers, that are not required of all projects due to the effort and time involved. It is worth drawing thedistinction between data sharing (making the data available with little effort) and data publishing (archivingthe data in a long-term repository, with our without curated metadata). hat aspects of data make them particularly challenging to share from a reproducibility perspective? This isthe question we tackle in this paper. While there are several papers that serve as best-practice guides forformatting data (Broman and Woo 2017) and getting them ready for sharing (Ellis and Leek 2017), the aimsof this paper are a bit different. Our aim is to address the issue of data in the context of reproducibility in datascience workflows. In particular we discuss the various tradeoffs one has to consider when documenting andsharing data, when it is worth publishing, and how this would change depending on the use case, perceivedimpact, and potential audience.We discuss how to share and/or publish data and cover various tradeoffs when deciding how much to do. Weskip detailed discussions of data preparation (Broman and Woo 2017; Ellis and Leek 2017), or citation (as ofthis writing, data citation is still in its infancy). We also analyze the state of data contained inside softwarepackages, shared and made available as part of modern data analysis. How often are data shipped insidesoftware packages? How does data format and size impact availability? To share data analysis data, there should be some minimal set of requirements. For example, the data shouldcontain information on metadata, data dictionaries, the README, and data used in analysis (See Section 2).No matter where data is submitted, there should ideally be a canonical data repository in one long termarchive that links to others. It should also have an accession number, or Digital Object Identification (DOI)number, which allows for it to be cited (discussed further in 2.4).There are 8 pieces of content to consider for data sharing:1. README: A Human readable description of the data2. Data dictionary: Human readable dictionary of data contents3. License: How to use and share the data4. Citation: How you want your data to be cited5. Machine readable meta data: Make your data searchable . Raw data: The original/first data provided7. Scripts: To clean raw data ready for analysis8. Analysis ready data: Final data used in analysisOne basic suggested directory layout is given below in Figure 1. Figure 1:
Example directory layout and structure for a data repository.
From these sections, the minimal files to provide in order of most importance are:1. README2. Data dictionary3. License4. Citation5. Analysis ready data6. Scripts to tidy analysis raw data into ready dataThese sections are now described. .1 README: A Human readable description of the data In the context of datasets, READMEs are particularly useful when there are no reliable standards. AREADME is often the first place people will go to learn more about anything in a folder - they are verycommon in software, and historically were included so the uppercase letters of README meant it would beat the top of a directory. The README is meant for someone to read and understand more about the dataand contains the “who, what, when, where, why, & how”:•
Who collected it•
What is the data•
When was it collected•
Where was it collected•
Why it was collected•
How is was collectedThe README should be placed in the top level of the project, and with one README per dataset. Itshould be brief, and provide links to the other aforementioned sections, and be in one directory. It shouldalso contain any other guidance for the user on how to read and interpret the whole directory. For example,explaining where tidy and raw data are stored, and other scripts for tidying.Saving a README with the extension .md file gives the author the formatting benefits of markdown , makingit simple to insert links, tables, and make lists. In systems like GitHub, a README file is detected andrendered in nice HTML on the repository by default.
A data dictionary provide human readable description of the data, providing context on the nature andstructure of the data. This helps someone not familiar with the data understand, and use the data. At aminimum they should contain the following pieces of information about the data:• variable names variable labels • variable codes , and• special values for missing data .Figure 2 shows an example of the data and data dictionary details. Variable names should be short,descriptive names with no spaces or special characters. In the most common tabular data case, thesecorrespond to column names, for example, “job_position”, “faculty_level”, and “years_at_uni”.
Variablelabels are longer descriptions of variables. For example “University Job Position”, “University FacultyPosition Level”, and “Number of Years Worked at University” (“Codebook Cookbook: A Guide to Writinga Good Codebook for Data Analysis Projects in Medicine” n.d.; Arslan 2019).
Variable codes apply tocategorical (factor) variables, and are the values for their contents. For example, a variable could containanswers to a question, with values and codes like so: , to indicate if someone is in statistics,for example. These should be consistent across similar variables to avoid problems where for onevariable, and in another. Date variables should have consistent formatting. For example, all dateinformation could be in format “YYYY-MM-DD”, and this should not be mixed with “YYYY-DD-MM”.Date formatting should be described in the variable labels.
Missing data are values that should have beenobserved, but were not. The code for missingness should be documented in the data dictionary, and shouldnominally be NA . If the reason for missingness is known it should be recorded. For example, censored data,patient drop out, or measurement error can have different values, such as “unknown”, -99, or other valuecodes (White et al. 2013; Broman and Woo 2017). ob_position Lecturer outside Professor statistics Senior Lecturer date_started
101 101 2019−01−01NAunknown variable job_position description outside statistics date_started University Job PositionDid this person study outside this university?Is this person in the statistics facultyWhat date did this perosn start their position? variable job_position code outside statistics date_started text description of job position0 = not studied outside, 1 = has studied outside0 = is not in statistics dept., 1 = is currently in statistics dept.date variable in format YYYY−MM−DD, e.g., 2001−01−01 is 1st January, 2001 value NA meaning unknown Missing valueMissing value
Figure 2:
Four tables - the data, and it’s variable names, variable codes, and the meaning of missing values able 1 shows an example data dictionary table from the Tidy Tuesday repository on incarceration trends(“Tidy-Tuesday-Incarcerate,” n.d.). This includes information on the variable, its class (type), and a longerdescription. Table 1:
The prisoner summary data dictionary, with columns for the variable, its class, and a shortdescription of the contents of the variable.
Variable Class Descriptionyear integer (date) Yearurbanicity character County-type (urban, suburban, small/mid, rural)pop_category character Category for population - either race, gender, or Totalrate_per_100000 double Rate within a category for prison population per 100,000 peopleData dictionaries should be placed in the README and presented as a table. Every data dictionary shouldalso be provided in its raw form (e.g., a CSV) in the repository, so they aren’t “trapped” in the README.
Data with a license clearly establishes rules on how everyone can modify, use, and share data. Without alicense, these rules are unclear, and can lead to problems with attribution and citation. It can be overwhelmingto try and find the right license for a use case. Two licenses that are well suited for data sharing are:1. Creative Commons Attribution 4.0 International Public License (CC BY), and2. Creative Commons CC0 1.0 Universal (CC0)Other licenses or notices to be aware of are copyrighted data , and data embargoes . If you are workingwith data already copyrighted , (for example under CC BY or CC0), you must give follow appropriateguidelines for giving credit. Data may also be under embargo . This means data cannot be shared morewidely until a specific release time. If sharing data under an embargo, include detailed information onthe embargo requirements in: the README, and in separate correspondence with those who receive thedata. Databases may also be licensed with the Open Data Commons Open Database License: https://opendatacommons.org/licenses/odbl/, which provides provisions for sharing, creating, and adapting,provided that work is attributed, shared, and kept open. nce you decide on a license, you should provide a LICENSE or LICENSE.md file that contains the entirelicense in the top level of the directory. The CC BY enforces attribution and due credit by default, but gives a lot of freedom for its use. Data can beshared and adapted, even for commercial use, with the following conditions:• You must provide appropriate credit to the source. This means listing the names of the creators.• Link back to the CC BY license, and• Clearly show if changes were made.• Data cannot be sub-licensed, that is - a change to the existing license• There is also no warranty, so the person or people who obtained the data cannot be held liable.The journal PLOS Comp Bio requires that data submitted cannot be more restrictive than CC BY (“PLOSComputational Biology,” n.d.). For a brief overview of the CC BY, suitable to include in a README, see(“CCBY Short Guide,” n.d.). For the full license, see (“CCBY Full License,” n.d.).
The CC0 is a “public domain” license. Data with a CC0 license means the data owners waive all theirrights to the work, and it now “owned” by the public. The data can be freely shared, copied, modified, anddistributed, even for commercial purposes without asking permission . When using data with CC0, it is goodpractice to cite the original paper, but it is not required. If you wish to use the CC0, see (“Choose-Cc0,”n.d.). For a brief overview of the CC0, see (“CC0 Short Guide,” n.d.), and for the full license, see (“CC0 FullLicense,” n.d.).
A Digital Object Identifier (DOI), is a unique identifier for a digital object such as a paper, poster, or software,and is a prerequisite for citation. For example, if the data are deposited in repositories like Dryad, Zenodo, r the Open Science Framework, the best practice would be to copy the citation created by these repositories.Under the hood, the DOI is “minted” by DataCite, who provide DOIs for research outputs and data. Thismeans that when citing data, it only makes sense to cite datasets that have been deposited into a DataCitecompliant repository. If a DOI is unavailable, a citation will be meaningless, as it cannot be tracked by anymeans. A file named citation should be placed in the directory, at the same level as the README . It shouldcontain a DOI, and could be in .bibtex format.
The README and data dictionary provides human readable information on the data. To ensure data typesare preserved - for example, dates are dates, names are characters - there needs to be some form of machinereadable metadata. This provides a structure allowing the data to be indexed and searched online, throughservices such as google datasets search (Castelvecchi 2018). An excellent standard for metadata is TableSchema written by frictionless data (Fowler, Barratt, and Walsh 2017). For example, a dataset “demographics”with three variables is shown in Table 2, and the Java Script Object Notation (JSON) equivalent in Figure 3.
Table 2:
Example demographics table of age, height, and nationality. age height nationality12 161.5 Australian21 181.2 American37 178.3 New Zealand igure 3: Example snippet of some Table Schema data for a dataset with three variables. This provides adescription of each field, and the type of field.
This contains fields such as “path”, describing the file path to the data, and a schema with sub-fields nameand type, for each variable. It also provides information for licensing and features such as line breaks, anddelimiters, useful to ensure the data is correctly imported into a machine. It is built on JSON, a lightweight,human-friendly, machine readable data-interchange format. Table schema is baked into formats such as csvy(“Csvy,” n.d.), an extended csv format, which has additional front matter in a lightweight markup format(YAML) using Table Schema.In contrast to a CSV data dictionary, JSON-LD has a defined, nested structure, which means more can becommunicated efficiently, while still maintaining readability, and avoiding extra writing and repetition thatcomes with a plain CSV where everything is defined from scratch. nother rich format is XML, the e X tensible M arkup L anguage. This was an early iteration on the idea ofa plain text format that was both human and machine readable. It is powerful and extensible (it powersthe entire Microsoft Office suite), but for metadata purposes it is not as human readable as JSON. JSON isshorter, and quicker and easier to read and write than XML, and can also be parsed with standard JavaScript,whereas XML must be parsed by a special XML parser. XML and JSON can be converted into each othersrespective formats, for example with the json or xml2 R packages. XML is still used as metadata for datastorage, for example in EML, the Ecological Metadata Language (n.d.).To create appropriate metadata, we recommend metadata generators such as dataspice or codebook ( Dataspice: Create Lightweight Schema.org Descriptions of Dataset
Raw data is usually the first format of the data provided before any tidying or cleaning of the data. If theraw data is a practical size to share, it should be shared in a folder called data-raw . The raw data shouldbe in the form that was first received, even if it is in binary or some proprietary format. If possible, datadictionaries of the raw data should be provided in this folder as well.
Any code used to clean and tidy the raw data, should be provided in the data-raw directory. This cleanedup data, analysis ready data should be placed in the data/ directory. Ideally this would involve only scriptedlanguages, but if other practical steps were taken to clean up the data, these should be recorded in a plaintext or markdown file. .8 Analysis ready data: Final data used in analysis The data used in the data analysis should be provided in a folder called data . Ideally, the data “analysisready data” should be in “Tidy Data” format (Wickham 2014), where tidy data contains variables in columns,observations in rows, and only one value per cell.In contrast to “raw data”, “analysis ready data” should be in an easily readable plain-text format, suchas comma-, tab- or semicolon-separated files. These typically have file extensions like .csv , .tsv or .txt .Binary or proprietary formats are discouraged in favor of interoperability, as it requires special software toread, even if it is sometimes slower to read in and out, and harder to share due to size. For example, R dataformats like .rds , .rda , or SPSS, STATA, or SAS data formats, .sav , .dta , or .sas7bdat . One low cost and easy way to distribute data alongside compute is to package the datasets as a data onlypackage or as part of something larger where the methods are being implemented. The R ecosystem hasmany noteworthy examples of data-only packages. One exemplar is the nycflights13 package by HadleyWickham (Wickham 2018), This package makes available airline data for all flights departing New York cityin 2013 in a tidy format, with distinct tables for metadata on airlines, airports, weather, and planes. Thepackage not only provides ready to use binary data but also ships raw data (in a folder called data-raw )along with scripts used to clean them. The package was originally created as a way to provide example datato teach tidy data principles and serves as a great model for how one can publish a data package in R.A more common use case is to include data as part of a regular package where analytical methods are beingimplemented and shared. This serves the dual purpose of exposing methods and data together, making itextremely easy for a researcher to simply install and load the library at the top of their script. CRAN’sdistributed network (along with the global content distribution network maintained by RStudio) ensure thatthe package is quickly accessible to anyone in the R ecosystem. A second critical advantage in this approachis that one could also benefit from R’s package documentation syntax to generate useful metadata for fast eference. This approach can also easily be adapted to other languages. Python for example, is far moreflexible with respect to including arbitrary files as part of a package. Other benefits of packaging data in R
1. Packaging data can be a very powerful pedagogical tool to help researchers understand how to transformdata and prepare it for further analysis. To do so, one can package raw data alongside scripts. Longform documentation such as vignettes can provide further detailed discussion on the process. Users canalso skip the raw data and scripts and proceed directly to the fast binary data, which can hold a lot ofdata when heavily compressed.2. When the primary motivation for shipping data is to illustrate visualization techniques or to runexamples, one can skip the raw data and processing scripts and only include the binary data. As longas the total package size does not exceed 5 megabytes, it would be acceptable as part of CRAN. Forcases when this size is hard to maintain, CRAN recommends data-only packages that will be rarelyupdated. For a detailed discussion on this issue and alternative approaches, see Brooke Anderson andEddelbuettel (2017).Unlike CRAN, bioconductor does not have a 5 megabyte limit for package size and offers a more liberaldata inclusion policy. They have a strict specification for how to organize genomic data in a package, sothey can be used for data analysis out of the box. While a robust solution, it only works for homoge-neous data as found with bioinformatics. For example, data in their ExperimentData section (bioconduc-tor.org/packages/3.9/data/experiment) can be used reliably. Such a strict standard would be impossible toenforce or scale on a repository as heterogeneous as CRAN. However, CRAN packages are still generally usedfor demonstration or testing purposes, over generating new knowledge for papers.One major disadvantage of packaging data inside R is that it makes the data availability very language centric.Non R users are unlikely to download and export data out of a package. This is why we recommend, as a rule,that researchers also archive data in a long-term data repository. These include domain specific repositories(see Section 6) or more general purpose ones such as Zenodo or Figshare and include the persistent identifierin all locations where the data is referenced such as the manuscript, notebook and data package. f the 15539 packages on Central R Archive Network (CRAN), 6278 contain datasets either as binary data(5903 packages) or as external datasets (766). Binary files comprise a bulk of the data shipped in the data folder (68.06%) with other plain text formats such as txt, CSV, dat, json comprising less than one percentof data shipped in packages. Another common situation that researchers face is in dealing with data that fall somewhere between small andlarge. For example, small data could be tabular, as a CSV, in the order of bytes to megabytes to gigabytesthat can be easily compressed, and large could be what falls under the umbrella of big data . The happymedium is generally data that are too big to fit on memory of most standard machines, but can successfully fiton disk (https://ropensci.github.io/arkdb/articles/articles/noapi.html). In this case, users who do not havethe support to maintain resource intensive database servers can instead rely on light serverless databases suchas MonetDB or SQLite. These databases provide disk based storage and using language agnostic interfaces, aanalyst can easily query these data in manageable chunks that don’t overrun memory limitations. Usingsoftware packages such as arkdb (Boettiger 2018) one could easily chunk large data from flat text files tothese lite databases without running into memory limitations. Another option is to break large text data filesinto chunks with named sequences (e.g., teaching-1.csv , teaching-2.csv , etc.).To make these files available alongside compute, one ingenious but short-term solution is to use the GitHubrelease feature to attach such large database dumps. GitHub releases are designed to serve software releasesand provide the opportunity to attach binary files. This mechanism can also be used to attach arbitraryfiles such as large data files, binary data, and database dumps as long as each file does not exceed 2gb. TheR package piggyback ( https://docs.ropensci.org/piggyback/ ) allows for uploading and downloadingsuch files to GitHub releases, making it easy for anyone to access data files wherever the script is being run.We emphasize again that this is a short-term solution that is dependent on GitHub maintaining this service. Challenges in documenting your dataset
There are many features to include with data alongside publications. However, not all of these are neededfor every dataset. Working out which are needed is a challenge that is not discussed in the literature. Thissection discusses a framework for users, to help them decide how much they should document their data.To frame discussion around the challenges of data documenting, we can think of how an individual datasetfalls on two features: “Effort to prepare”, and “Ease of understanding” in Figure 4. The ideal space to be inthe graph would be the top left hand corner. But what we notice is that taking more effort to prepare datameans that the data is easier to understand.
Data DumpMost studies Sensor Data 50 hectareRCTSimulation StudyEffort to Prepare E a s e o f U nde r s t and i ng Figure 4:
There is a big difference in the effort to prepare data, and how easy it is to understand - look atthe difference between most datasets, and something like a Randomized Control Trial (RCT).
Data with higher potential impact and reuse should be made as easy to understand as possible; but it alsorequires more time and effort to prepare. Impact is hard to measure, and varies from field to field, but asan example, take some data from medical randomized control trials (RCTs) on cancer treatment. Thesecan have high impact, so requires a lot of time and effort to document. Comparatively, a small survey on few demographics can have low impact. Placing these on the graph above, we see they might not havea worthwhile tradeoff for ease of understanding and ease of preparation. This means the effort put intopreparing data documentation for a small survey should be kept relatively simple, not over complicated.Likewise, data that can be created via simulation from open source software could arguably not be sharedsince it can be generated from scratch with code; a reproducible process that requires computer time, notperson time to create.Deciding how much data documentation to include should be based on the impact of the data. The moreimpactful the data, the more documentation features to include. Figure 5 shows the practical types of stepsthat can be taken against the effort required. Figure 5:
The steps required compared to effort required for data sharing for reproducibility.
Creating good documentation has similar challenges to good writing: it takes time, and it can be hard toknow when you are done. Thinking about two features: 1) the impact of data, and 2) the current effort to doeach step, provides guidelines for the researcher to decide how much documentation they should provide fortheir data. ocumentation challenges may evolve over time as the cost of making them easy to prepare and understandchange. For example, if new technology automates rigorous data documentation, thorough documentationcan take the same time it takes to do poorly now. In this case, there would be no reason why more peoplecannot do this, even when the benefits are not immediately apparent. So if something might appear to havelow impact, it should still be made easy to understand, if it does not take long to prepare that documentation. It is worth distinguishing between sharing data and publishing data. Data can be shared in numerous wayswithout going through the trouble of publishing it, which often requires metadata that a human must verify.Data can be shared in numerous ways, including placing it in a repository, packaging it with methods, orusing various free tiers of commercial services (such as dropbox or google drive). However, one must publishdata when appropriate.There are three common options for publishing data in research:1.
Least Moving Parts Domain Specific Venue Long Term Archive
We discuss each of these options and provide recommendations on how to publish data in these areas.In
Least Moving Parts , the data might be published with an R package, or as part of a GitHub releaseusing piggyback (“Piggyback,” n.d.), or in a serverless database. This approach gets the data somewhererather than nowhere . Its minimal features means it is simple to maintain. In addition, piggyback alsokeeps the data in the same location as the code, making it easier and simpler to organize. A downside is thatit does not scale to larger data. Self hosting the data is an option, but we discourage this, as it may succumbto bit rot - where data is corrupted, degraded, or servers turn off. n Domain Specific Venue , data can be published in journal data papers, or venues specific to the type ofdata. For example, an astronomy project hosts its data at the Sloan Digital Sky Survey (SDSS), and Geneticdata can be hosted at GenBank (Benson et al. (2005)).The purpose, use, and origin of the data is an important component to consider. Data for research has adifferent domain compared to data collected by governments, or by journalists. Many governments or civilorganizations are now making their own data available through a government website interface. Media andjournalism groups are also making their data available either through places like GitHub (“Pudding-Data,”n.d.), or may self host their data, such as five thirty eight (“Data-538,” n.d.).This is a good option when the data is appropriate for the domain. This is often decided by communitystandards. We recommend adhering to community standards for a location to publish your data, as this willlikely improve data reuse. The guidelines suggested in this paper for sharing the data should be included.A
Long Term Archive is the best option to share the data. Long term archives provide information suchas DOI (Digital Object Identifier) that make it easier to cite. Data can be placed in a long term archiveand a DOI can be minted. This DOI can then be used in other venues, such as domain specific or even selfhosted, and will ensure that the projects refer back appropriately.If the dataset you are shipping has a research application, the most relevant home would be the researchdata repositories zenodo , dryad , or Open Science Framework . Zenodo, Launched in 2013 in a jointcollaboration between openAIRE and CERN, provides a free, archival location for any researcher to deposittheir datasets. The only limits to file sizes are 50gb for individual files, which is generous enough toaccommodate a large number of use cases. Zenodo is able to accommodate larger file sizes upon request. TheDryad Digital Repository ((“Dryad,” n.d.)) will take data from any field of research, and perform humanquality control and assistance of the data, with the ability to link data with a journal publication, in exchangefor a data publishing fee. The OSF is a tool that captures the research process online. This ranges fromconceiving research ideas, study design, writing papers, to storing research data. The entire history of theproject is recorded, so it promotes centralized, and transparent workflows. Although not specifically designed or data, like Dryad or Zenodo, OSF does provide a DOI minting service, and a more wholistic approach,which might be appealing to users (Spies et al. 2012). Scientists have numerous venues to deposit their data, many of which are ephemeral, despite the conveniencethey offer. For data that are not narrow in scope with limited potential for reuse, we recommend publishingdata in an established data repository such as Dryad. In addition to providing a long-term home for data,these repositories also include data curation by a professional data manager, compliance with open sourcestandards, and metrics.
Data used in publications are often shared in the supplementary materials of articles, or served on repositoriessuch as the Dryad Digital Repository ((“Dryad,” n.d.)). Dryad makes data from scientific publicationsdiscoverable, reusable, and citable. It is well funded through grants from the National Science Foundationand European Commission.To provide better context around the data used in research and better expose data for reuse, journals arenow adding “data papers”. These are specifically designed for publishing articles about the data, and sharingit. This benefits both researchers and readers. Researchers receive credit for data they have collected orproduced, and readers get more context about the data.Data papers are similar to research articles, they have titles, authors, affiliations, abstract, and references.They generally require an explanation of why the data is useful to others, a direct link to the data, descriptionof the design, materials, and methods. Other information on the subject area, data type, format, and relatedarticles are usually required.Whilst useful, these requirements do not tell the author how to actually structure the data and folders forreuse. Instead, they provide ideas on what they should include. This is a useful step towards improving data euse, but it lacks some minimal structure that allows a researcher to have a predictable way to access andinterpret the data .Other journals operating in this space include journals like “data in brief”, “Data”, and “Nature ScientificData”. Guidelines for what is required in the content of the paper, and the required information along withthe data (meta data, data format, etc.) vary by journal. Tooling for producing data documentation information speeds up and simplifies the process of sharing data.Machine-readable metadata that can be indexed by google is created using the dataspice package (
Dataspice:Create Lightweight Schema.org Descriptions of Dataset codebook package in R (Arslan 2019), which also generates machine readable metadata. Data dictionariesare implemented in other software such as STATA, which provides a “codebook” command. Data can bepackaged up in a “data package” with DataPackageR (Greg Finak 2019), which provides tools to wrap updata into an R package, while providing helpers with MD5sum checks that allow checksum file comparisonsfor versioning. Note that is different to Frictionless Data’s tabular data package spec (Fowler, Barratt, andWalsh 2017).
We now explore the variety of documentation practices of a few selected datasets.
Long-term field surveys, where the same type of data is repeatedly measured and collected over a longperiod of time, are an example of data collection where the time and financial investment would necessitatemeticulously curated and easily accessible metadata. Both from the fact that the same protocol is being ollowed year after year, and that field data collection efforts are quite expensive, these data need to be welldocumented, with documentation available alongside the data.One example of a very laborious, long-term study is the 50-hectare plot on Barro Colorado Island in Panama.Researchers at the field station have censused every single tree in a pre-defined plot 7 times over the past30 years. More than a million trees have been counted. The data and metadata however are unnecessarilydifficult to reach. To obtain the data, one must fill out a form and agree to terms and conditions whichrequire approval and authorship prior to publication. The biggest challenge with this study is that the dataare stored on a personal FTP server of one of the authors. While the data are available in CSV and binary Rdata formats, the data storage servers do not have any metadata, README files or a data dictionary. Aseparate, public archive hosts the metadata (https://repository.si.edu/handle/10088/20925) in a PDF file(https://repository.si.edu/bitstream/handle/10088/20925/RoutputFull.pdf?sequence=1&isAllowed=y) thatdescribe the fields.
Datasets obtained from sensors such as meteorological data are typically easy to prepare and understand.This is because sensors measure specific features, so the description of data type happens upstream at theinstrument-design level, and flows down to data collection. The telescope data from the Sloan Digital SkySurvey (SDSS) is a great example of sensor data.The SDSS data includes photometric and spectroscopic information obtained from a 2.5m telescope at ApachePoint Observatory in New Mexico, operating since 2000, producing over 200Gb of data every day (York 2000;Blanton et al. 2017; “Sdss-Website,” n.d.). This data has been used to produce the most detailed, threedimensional maps of the universe ever made.The data are free to use and publicly accessible, with the interface to the data being designed to cover awide range of skills. For example, the marvin service to streamline access and visualisation of MaNGA data(Cherinka et al. 2018), through to raw data access (see Figure 6). igure 6: Screenshot of Raw data avaialbe through DR15 FITS.
The telescope data is very high quality, with the following features:• Metadata in the machine readable standard, SCHEMA• Data in multiple formats (e.g., .csv, .fits, and database)• Previously released data available in archives• Entire datasets available for download• Raw data before processed into database is also availableFor example, the optical and spectral data can be accessed in FITS data format, or even as a plain CSV, thefirst few rows shown in Table 3. Wavelength plotted against Flux is shown in Figure 7, with 7A showing theoutput from the SDSS website, and Figure 7B showing the output from the CSV. The fact that figure Figure7A is virtually replicated in Figure 7B demonstrates great reproducibility.
Table 3:
Example spectra data from SDSS, showing values for Wavelength, Flux, BestFit, and SkyFlux.
Wavelength Flux BestFit SkyFlux3801.893 48.513 44.655 14.0023802.770 54.516 42.890 14.1133803.645 52.393 47.813 14.1423804.522 45.273 42.612 14.1863805.397 51.529 39.659 14.2213806.273 44.530 44.183 14.349 Wavelength F l u x B Figure 7:
Visualisations from SDSS can be effecticely replicated. The spectra image of wavelength againstflux from SDSS from two sources. Part A shows the example image taken from SDSS website.Part B shows the replicated figure using data provided as CSV - although there is some datacleaning or filtering since the numbers are not identical, it still demonstrates the great potential ofreproducibility.
The SDSS provides data at different levels of complexity for different degrees of understanding. It is ahuge project involving many people, and a staggering effort to prepare. Despite this, it is still very easy tounderstand. This is a further reflection on the idea that high effort can create highly understandable data,(see Figure 4). The impact of this data is high, having changed the way we understand the past, present andfuture of the universe itself. This impact is surely due to the care and thought put into making the dataaccessible and public. It is a statement of what is possible with good data sharing practices.
The vast majority of datasets are handled in an ad-hoc manner (Michener et al. 2001) and dumped intransient data stores with little usable metadata. The value of these datasets decline over time until they are seless. We recommend that researchers consider the tradeoff in long-term value of the dataset and effortrequired to document and archive. Decide whether publishing is appropriate for your data
It is critical to publish your data ifthey are the basis for a peer-reviewed research publication. To be broadly useful, your data must bedeposited in a permanent archive, to ensure that it does not disappear from a ephemeral location suchas your university website. It should also contain useful metadata so that someone outside the projectteam can understand what the data, and variables, mean. However, this level of effort is not alwayscritical for all data science efforts. For small-scale projects such as ones where one might generatesimulated datasets, this level of curation is highly unnecessary. Here making data available in a transientlocation like GitHub or a website is sufficient.2.
Include a README file with your data archive. README files have a long history in software(https://medium.com/@NSomar/readme-md-history-and-components-a365aff07f10) and are named sothat the ASCII systems capital letters filename would show this file first, making it the obvious placeto put relevant information. In the context of datasets, READMEs are particularly useful when thereare no reliable standards. The best practice here is to have one README per dataset, with namesthat easily associate with the corresponding data files. In addition to metadata about the data such asvariable names, units of measurement and code books, the README should also contain information onlicenses, authors/copyright, a title, and dates and locations where the data were collected. Additionallykeywords, author identifiers such as ORCID, and funder information would be useful.3.
Provide a data dictionary . Provide a machine readable format for your data. When possible, provide machine readablemetadata that map on to the open standards of schema.org and JSON-LD. These metadata provideall of the information described in the README best practices (rule 2), but in a machine readableway and include much of the same information such as name, description, spatial/temporal coverageetc. One way to test if your metadata are machine readable is to use Google’s structured testing data(https://search.google.com/structured-data/testing-tool/u/0/) to verify the integrity of the metadata.5.
Provide the data in its rawest form in a folder called “data-raw”. Keeping your data raw (alsosometimes referred to as the sushi principle) is the safest way to ensure that your analysis can bereproduced from scratch. This approach not only lets you trace the provenance of any analysis but italso ensures further use of the data that a derived dataset may prevent.6.
Provide [open source?] scripts used to clean data from rawest form into tidy/analysis ready format.Raw data is often unusable without further processing (data munging or data cleaning), which are thesteps necessary to detect and clean inaccurate, incomplete records, and missing values from a dataset.While datasets can be cleaned interactively, this approach is often very difficult to reproduce. It is abetter practice to use scripts to batch process data cleaning. Scripts can be verified, version controlledand rerun without much overhead when mistakes are uncovered. These scripts describe the steps takento prepare the data, which helps explain and document any decisions taken during the cleaning phase.These should ideally operate on some raw data stored in the “data-raw” folder (rule 5).7.
Keep additional copies in more accessible locations : Even if you archive the data somewherelong-term, keep a second (or additional copies) in locations that are more readily and easily accessiblefor data analysis workflows. These include services like GitHub, GitHub LFS and others where fastcontent delivery networks (CDNs) make access and reuse practical. In the event that these servicesshut down or become unavailable, a long-term archival copy can still be accessed from a permanentrepository such as Zenodo or Dryad and populated into a new data hosting service or technology.8.
Use a hash function like MD5 checksum to ensure that the data you deposited are the same thatare being made available to others. Hash values such as MD5 are short numeric values that can serve s digital signatures for large amounts of data. By sharing the hash in a README, authors can ensurethat the data, particularly the version of data being used by the reader is the same.9. Only add a citation if your data has a DOI . A citation only makes sense when your data has aDataCite compliant DOI, which is automatically provided when data is published in repositories likeZenodo and Dryad. While a citation may not accrue references, without a DOI it is guaranteed not to.10.
Stick with simple data formats to ensure long-term usefulness of datasets in a way that is not tiedto transient technologies (such as ever changing spreadsheet standards), store the data as plain text(e.g., CSV) where possible. This can take the form of comma, or tab separated files in most cases. Thiswill ensure transparency, and future proof your data.
10 Conclusions
The open science literature has been advocating the importance of data sharing for a long time. Many ofthese articles appeal to the broader benefits of data sharing, but rarely talk about the practical considerationsof making data available alongside code. Trivial reproducibility has always relied upon the idea that as longas code exists on a platform (e.g., GitHub), and clearly lists open source dependencies, one should be able torun the code at a future time. Containerization efforts such as Docker have made it even easier to capturesystem dependencies, and rapidly launch containers. But true reproducibility requires not just code, but alsodata.Over the years researchers have made data available in a multitude of unreliable ways, including offeringdownloads from university websites, FTP repositories, and password protected servers, all of which are proneto bit rot. In rare cases where the data are deposited in a permanent archive, insufficient metadata andmissing data processing scripts have made it harder and more time intensive to map these to code. Our goalhere is to describe a variety of ways a researcher can ship data, analogous to how easy it has become topush code to services like GitHub. The choice of technology and method all depends on the nature of thedata, and its size and complexity. Once the issue of access is solved, we also discuss how much metadata to nclude in order to make the data useful. Ultimately there is a tradeoff to consider on a case by case basis.Long-term studies with complex protocols require structured metadata, while more standardized datasetssuch as those coming from sensors require far less effort.The key message for the reader is that accessing data for reproducibility should come with minimal barriers.Having users jump through data usage forms or other barriers is enough of a roadblock to dissuade users,which over time will make it hard to track the provenance of critical studies. For data with broad impact,one should further invest effort in documenting the metadata (either unstructured or structured dependingon the complexity) and also focus on ensuring that at least one archival copy exists in a permanent archive.Following this continuum can ensure that more and more computational work becomes readily reproduciblewithout unreasonable effort.
11 Acknowledgements • Mile McBain• Anna Krystalli• Daniella Lowenberg• ACEMS International Mobility Programme• Helmsley Charitable Trust, Gordon and Betty Moore Foundation, Sloan Foundation.
Literature Cited
Arslan, Ruben C. 2019. “How to Automatically Document Data with the Codebook Package to FacilitateData Reuse.”
Advances in Methods and Practices in Psychological Science
Nature
467 (7317): 753–53.https://doi.org/10.1038/467753a. enson, Dennis A, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and David L Wheeler. 2005.“GenBank.” Nucleic Acids Research
33 (Database issue): D34–8.Bezanson, Jeff, Alan Edelman, Stefan Karpinski, and Viral B Shah. 2017. “Julia: A Fresh Approach toNumerical Computing.”
SIAM Review
59 (1): 65–98. https://doi.org/10.1137/141000671.Blanton, Michael R, Matthew A Bershady, Bela Abolfathi, Franco D Albareti, Carlos Allende Prieto, AndresAlmeida, Javier Alonso-García, et al. 2017. “Sloan Digital Sky Survey IV: Mapping the Milky Way,Nearby Galaxies, and the Distant Universe,” February. http://arxiv.org/abs/1703.00052.Boettiger, Carl. 2018.
Arkdb: Archive and Unarchive Databases Using Flat Files . https://CRAN.R-project.org/package=arkdb.———. 2015. “An Introduction to Docker for Reproducible Research.”
ACM SIGOPS Operating SystemsReview
49 (1): 71–79. https://doi.org/10.1145/2723872.2723882.Broman, Karl W., and Kara H. Woo. 2017. “Data Organization in Spreadsheets.”
The American Statistician
72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.Brooke Anderson, G, and Dirk Eddelbuettel. 2017. “Hosting Data Packages via Drat: A Case Study withHurricane Exposure Data.”
The R Journal
Nature
561 (7722): 161–62.“CC0 Full License.” n.d. https://creativecommons.org/publicdomain/zero/1.0/legalcode.“CC0 Short Guide.” n.d. https://creativecommons.org/publicdomain/zero/1.0/.“CCBY Full License.” n.d. https://creativecommons.org/licenses/by/4.0/legalcode.“CCBY Short Guide.” n.d. https://creativecommons.org/licenses/by/4.0/.Cherinka, Brian, Brett H Andrews, José Sánchez-Gallego, Joel Brownstein, María Argudo-Fernández, MichaelBlanton, Kevin Bundy, et al. 2018. “Marvin: A Toolkit for Streamlined Access and Visualization of theSDSS-IV MaNGA Data Set,” December. http://arxiv.org/abs/1812.03833.“Choose-Cc0.” n.d. https://creativecommons.org/choose/zero/. Dataspice: Create Lightweight Schema.org Descriptions of Dataset . 2018. https://github.com/ropenscilabs/dataspice.Donoho, David. 2017. “50 Years of Data Science.”
Journal of Computational and Graphical Statistics:A Joint Publication of American Statistical Association, Institute of Mathematical Statistics, InterfaceFoundation of North America
26 (4): 745–66.“Dryad.” n.d. https://datadryad.org/.Ellis, Shannon E., and Jeffrey T. Leek. 2017. “How to Share Data for Collaboration.”
The AmericanStatistician
72 (1): 53–57. https://doi.org/10.1080/00031305.2017.1375987.Fowler, Dan, Jo Barratt, and Paul Walsh. 2017. “Frictionless Data: Making Research Data Quality Visible.”
IJDC
12 (2): 274–85.Gentleman, Robert C, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit,Byron Ellis, et al. 2004.
Genome Biology
Journal of Computational and Graphical Statistics
16 (1): 1–23.“Git.” n.d. https://git-scm.com/about.“Github.” n.d. https://github.com/.Greg Finak. 2019.
DataPackageR: Construct Reproducible Analytic Data Sets as R Packages . https://doi.org/10.5281/zenodo.2620378. Julia-Pkgman.” n.d. https://pkg.julialang.org/.Jupyter. 2018. “Binder 2.0 - Reproducible, Interactive, Sharable Environments for Science at Scale.”
Proceedings of the 17th Python in Science Conference . https://doi.org/10.25080/Majora-4af1f417-011.Kirk, Rebecca, and Larry Norton. 2019. “Supporting Data Sharing.”
NPJ Breast Cancer eLife
Science
334 (6060): 1226–7. https://doi.org/10.1126/science.1213847.“Piggyback.” n.d. https://github.com/ropensci/piggyback.“PLOS Computational Biology.” n.d.
PLOS ONE . Public Library of Science. https://journals.plos.org/ploscompbiol/s/data-availability.Poisot, Timothée, Anne Bruneau, Andrew Gonzalez, Dominique Gravel, and Pedro Peres-Neto. 2019.“Ecological Data Should Not Be so Hard to Find and Reuse.”
Trends in Ecology & Evolution
34 (6):494–96. https://doi.org/10.1016/j.tree.2019.04.005.Popkin, Gabriel. 2019. “Data Sharing and How It Can Benefit Your Scientific Career.”
Nature
Source Code for Biology and Medicine Core Team. 2019.
R: A Language and Environment for Statistical Computing
BMJ Open
Open Science Collaboration .Stodden, Victoria, Jennifer Seiler, and Zhaokun Ma. 2018. “An Empirical Analysis of Journal PolicyEffectiveness for Computational Reproducibility.”
Proceedings of the National Academy of Sciences of theUnited States of America
115 (11): 2584–9.“Tidy-Tuesday-Incarcerate.” n.d. https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-22.White, Ethan P, Elita Baldridge, Zachary T Brym, Kenneth J Locey, Daniel J McGlinn, and Sarah R Supp.2013. “Nine Simple Ways to Make It Easier to (Re)use Your Data.”
Ideas in Ecology and Evolution
Journal of Statistical Software
59 (10). https://doi.org/10.18637/jss.v059.i10.———. 2018.
Nycflights13: Flights That Departed Nyc in 2013 . https://CRAN.R-project.org/package=nycflights13.Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, ArieBaak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management andStewardship.”
Scientific Data .d. Accessed December 18, 2018. https://knb.ecoinformatics.org/.d. Accessed December 18, 2018. https://knb.ecoinformatics.org/