[PDF] DataFed: Towards Reproducible Research via Federated Data Management

Abstract

The increasingly collaborative, globalized nature of scientific research combined with the need to share data and the explosion in data volumes present an urgent need for a scientific data management system (SDMS). An SDMS presents a logical and holistic view of data that greatly simplifies and empowers data organization, curation, searching, sharing, dissemination, etc. We present DataFed -- a lightweight, distributed SDMS that spans a federation of storage systems within a loosely-coupled network of scientific facilities. Unlike existing SDMS offerings, DataFed uses high-performance and scalable user management and data transfer technologies that simplify deployment, maintenance, and expansion of DataFed. DataFed provides web-based and command-line interfaces to manage data and integrate with complex scientific workflows. DataFed represents a step towards reproducible scientific research by enabling reliable staging of the correct data at the desired environment.

Full PDF

TThis manuscript has been authored by UT-Battelle,LLC, under contract DE-AC05-00OR22725 with the USDepartment of Energy (DOE). The US government retainsand the publisher, by accepting the article for publication,acknowledges that the US government retains a nonexclusive,paid-up, irrevocable, worldwide license to publish orreproduce the published form of this manuscript, or allowothers to do so, for US government purposes. DOE willprovide public access to these results of federally sponsoredresearch in accordance with the DOE Public Access Plan(http://energy.gov/downloads/doe-public-access-plan).

Submitted to the held at Las Vegas,NV, USA on Dec 05-07 2019

Computational Science and Computational Intelligence (CSCI’19) Page 1 a r X i v : . [ c s . D B ] A p r ataFed: Towards Reproducible Research viaFederated Data Management Dale Stansberry*, Suhas Somnath, Jessica Breet, Gregory Shutt, and Mallikarjun Shankar

National Center for Computational SciencesOak Ridge National Laboratory

Oak Ridge, TN, USA*Email: [email protected]

Abstract —The increasingly collaborative, globalized nature ofscientiﬁc research combined with the need to share data and theexplosion in data volumes present an urgent need for a scientiﬁcdata management system (SDMS). An SDMS presents a logicaland holistic view of data that greatly simpliﬁes and empowersdata organization, curation, searching, sharing, dissemination,etc. We present DataFed - a lightweight, distributed SDMS thatspans a federation of storage systems within a loosely-couplednetwork of scientiﬁc facilities. Unlike existing SDMS offerings,DataFed uses high-performance and scalable user managementand data transfer technologies that simplify deployment, mainte-nance, and expansion of DataFed. DataFed provides web-basedand command-line interfaces to manage data and integrate withcomplex scientiﬁc workﬂows. DataFed represents a step towardsreproducible scientiﬁc research by enabling reliable staging ofthe correct data at the desired environment.

Index Terms —scientiﬁc data management system, federatedidentity management, Globus, FAIR data principles, cross-facility

I. I

NTRODUCTION

Several scientiﬁc domains are experiencing an explosionin the volume, variety, veracity and velocity of data owingto increased automation, increased computational power, andfaster, higher resolution sensors and detectors in scientiﬁcinstruments [1], [2]. At the same time, research is becomingever more globalized, collaborative, and multidisciplinary, andthere is an increasing need to publish the supporting datasetsbehind research ﬁndings [3]. Furthermore, scientiﬁc discoveryusing data analytics techniques like machine learning (ML)and artiﬁcial intelligence (AI) requires large volumes of highquality and well organized data. Prior research has shown thatas much as 50-80% of time is spent on data management andwrangling in most scientiﬁc research projects and this numberis expected to rise [4], [5]. These factors are not only loweringscientiﬁc productivity but are also exacerbating the problemof poor reproducibility in science. The current state of thepractice leads us to urgently seek a way to manage the lifecycleof data with an effective Scientiﬁc Data Management System(SDMS) [6], and use the SDMS as an essential component ofthe scientiﬁc process.We require an SDMS to greatly alleviate common datahandling and management challenges associated with largeand/or complex corpora of data by providing an intuitiveand holistic view of data and associated metadata supportedby powerful but user-friendly organization, collaboration, and search/discovery capabilities. Data should be precisely identi-ﬁed and tracked to prevent accidental mishandling or misuseand to more easily support sharing with collaborators. Domain-speciﬁc metadata and provenance would need to be capturedto provide additional context for data sets, and enable powerfulAI-driven discoveries.Many commercial and academic SDMS’s are currentlyavailable and have been used in production-oriented andresearch laboratories respectively for many years [7]–[10].Commercial SDMS products typically support only a singleorganization or domain in a rigid manner and are orientedtoward business processes rather than research. Academicresearch focused SDMS solutions tend to be tied to speciﬁcscientiﬁc domains, are often challenging to deploy, and usenon-scalable technologies that limit the pervasiveness andwidespread acceptance of such services - resulting in disjointdata silos [8], [11], [12].In this paper we introduce DataFed - a general purposeSDMS tailored to scientiﬁc research with the overarching goalsof eliminating data silos and enabling cross-organizational col-laboration. DataFed was also designed to support the FindableAccessible, Interoperable, and Reusable (FAIR) data princi-ples, which were proposed to facilitate open, collaborative,and reproducible scientiﬁc research [13]. DataFed achievesthese goals by providing both services and software tech-nologies that help to bridge multiple scientiﬁc domains andorganizational boundaries. By efﬁciently supporting scalablebig data, and addressing the typical barriers to adoption suchas complexity, rigidity, and administrative effort, DataFed aimsto encourage cross-discipline and cross-domain data exchange.When deployed, DataFed forms a federation of geographicallydistributed research organizations, scientiﬁc facilities, and datarepositories. The resulting data overlay establishes a scalable,multi-domain, scientiﬁc data network that greatly simpliﬁesthe data management aspects of cross-facility scientiﬁc re-search.Our primary contributions are: (1) we describe the systemarchitecture and implementation of an SDMS that offersa uniform (non-ﬁle-system-silo) view on data managementacross facilities, and (2) discuss speciﬁc beneﬁts and use ofDataFed to enable reproducible research over the data lifecy-cle. DataFed is available for use at https://datafed.ornl.gov. Weproceed by ﬁrst providing a high-level overview of DataFed in ig. 1. A conceptual overview of DataFed

Section II, followed by the system architecture in Section III.Next, in Section IV, we discuss the beneﬁts of using DataFed,and conclude with the future extensions planned for DataFedin Section V. II. S

YSTEM O VERVIEW

DataFed is both a service and a software framework thatintegrates distributed scientiﬁc data repositories with central-ized metadata and data management services to produce afederated, cross-organizational SDMS. DataFed is an opt-in data management network, where member organizationsconnect locally-managed data repositories to the federation,yet retain full control of the underlying storage systems (i.e.allocations and data policies). Central DataFed services areused to track, manage, and coordinate access to all data withinthe network.Figure 1 illustrates a high-level control-ﬂow and data-ﬂow view across different facilities using DataFed. Federatingheterogeneous scientiﬁc facilities into a common data networkenables the data access needs of complex cross-facility collab-orations and workﬂows. Individual facilities may house oneor more local data repositories, or may have none and rely ondata in remote data repositories. Though member organizationsmay host various scientiﬁc resources such as experimen-tal/observational instruments, or compute/analytics hardware,DataFed is only concerned with data access/managementwithin and between these facilities - not with allocating thescientiﬁc resources themselves.The Globus software suite [14], [15] has become the defacto large-scale data communication mechanism of the scien-tiﬁc community by providing secure, performant, and manageddata transfer services (based on GridFTP [16]) coupled withmodern scalable user management technologies. By leveragingthese popular Globus services, DataFed easily expands into ex- isting Globus-enabled organizations and scales-out to supportmany users without creating the administrative bottlenecks thatare common with older security models [17].Unlike distributed ﬁle systems, DataFed presents a logicalview of data rather than a physical storage path to a namedﬁle . Internally, DataFed manages data as two distinct yetsynchronized parts: a raw data object and a data record.The raw data object can contain data in any format andis ingested from a source data ﬁle into a DataFed-manageddata repository. The data record is used to uniquely identifythe data, track administrative information, and store general-and domain-speciﬁc metadata. DataFed’s client interfaces orapplication programming interfaces (APIs) must always beused to access or modify managed data.DataFed provides users with a variety of clients and in-terfaces - enabling simple and uniform data access from anyconnected resource within the DataFed federation, regardlessof physical storage location or organization association. Forexample, a user can interactively identify and stage remotedata to a local analytics resource for processing, or createscripts with DataFed APIs for batch processing and/or work-ﬂows on compute resources. DataFed can also be directlyintegrated into experimental facilities and/or data pipelines toautomate data management. (We also employ a

Data Gateway device that enables edge devices (scientiﬁc instruments) tosubmit data into the managed repositories of DataFed.)III. S

YSTEM A RCHITECTURE

The architecture of DataFed, shown in Figure 2, consists ofa “Core Facility” housing central DataFed services that man-age and coordinate remote DataFed data repositories housedwithin member-organization-managed units, such as experi-mental, compute, and analytics facilities. To enable large datatransfers DataFed requires that organizational facilities havea Globus endpoint (managed by the organization) connectingone or more local ﬁle systems to the DataFed network, andthey must install a DataFed client for users to access and/orupdate remote data as illustrated for the “Experimental Facil-ity” in Figure 2. Optionally, a facility may house one or morelocal DataFed data repositories, which each require their ownGlobus endpoint (managed by DataFed) demonstrated by boththe “Observational Facility” and “Computational Facility” inFigure 2. DataFed services and clients communicate via acontrol protocol which is distinct from the GridFTP protocol.A user or process with sufﬁcient permissions at one facilitymay access and/or update raw data housed within anotherfacility’s repository by requesting access through the Coreservices via a DataFed interface.

A. Core Services

DataFed core facility includes three distinct server types:1) control servers, 2) web servers, and 3) a database server.Control servers are primarily responsible for orchestratingactivities within the system that involve more that one compo-nent, such as data transfers between repositories or managingconcurrent data access. Control servers also act as a gatewayComputational Science and Computational Intelligence (CSCI’19) Page 3 ontrol

Data

Metadata

Experimental Facility

DataFed ClientDataFed Client

Observational Facility

Globus

Computational Facility

DataFedRepository DataFedRepository

Control ServersWeb Servers

DataFed Core Facility

GlobusGlobusDataFed Client

Fig. 2. DataFed System Architecture to microservices running within the underlying database. Webservers interface with control servers and are responsible forserving the DataFed Web Portal - which is a modern webapplication providing a graphical interface to all DataFedfeatures. Finally, the database houses all of the records andmetadata that describe the state of the entire DataFed network- including users, organizations, projects, repository conﬁgura-tion, metadata, access controls, and more. The raw binary datais stored in remote data repositories and not in the database.Given the criticality of the core database, the database shouldbe deployed in a cluster conﬁguration on highly resilientstorage with a suitable data replication factor.

B. Data Repository Servers

DataFed data repository servers are a companion servicethat, along with a dedicated Globus endpoint, connects facility-local data storage to the DataFed network. The organizationadministering the repository may use any type or conﬁgurationof data storage system to support the repository so longas it is supported by Globus. The data repository softwarestack integrates with the associated Globus endpoint to routeauthorization requests to core services for all data access -enabling both ﬁne-grained user authorization and concurrencycontrols.

C. Client Interfaces

Users can interact with DataFed via a command-line-interface (CLI) and a Web Portal. The CLI provides access tobasic DataFed features for both interactive use and scriptingvia a Python 3 package that is available via the Python PackageIndex (PyPi) for all major operating systems. This pythonpackage provides both high- and low-level APIs that can beused to build custom applications that directly interface withDataFed.The DataFed Web Portal provides users a simple yet pow-erful interface to manage their data through the use of modernweb standards. Graphical elements, contextual menus, anddrag-and-drop menus are used widely to simplify the interface and make operations intuitive. The focal-point of Web Portalis the Data Browser, which allows users to navigate throughdata, create and manage collections and projects, and sharedata with collaborators. Users can create data records, enteror edit metadata, view provenance graphs, transfer raw data,view publicly available data catalogs, and much more.

D. Authentication and Data Transfer

User account management based on certiﬁcates or encryp-tion keys can become a signiﬁcant administrative burden whenmanaging a large number of users. The common approach ofdeﬁning virtual organizations [18] based on these authentica-tion technologies can quickly become untenable due to theneed to synchronize users across all member organizations.A more modern and scalable approach to user authenticationis to use federated identity management [17], where userslink multiple organizational accounts into a single compositeidentity. This approach enables mutual secure access betweendisjoint organizations despite differences in user accountingsystems and security policies.As mention previously, DataFed is based on Globus’ fed-erated user authentication and data transfer services, conse-quently beneﬁting from the availability, scalability, and per-formance advantages of Globus. As a consequence, DataFedusers must also have a Globus account linked with one ormore scientiﬁc institutes (universities, national laboratories,etc.). DataFed uses internal Globus endpoints for it’s own datarepositories and allows DataFed user to transfer data to/fromexternal endpoints (either organizational or personal). DataFedmanages and monitors all data transfers to or from DataFeddata repositories while applying DataFed-speciﬁc concurrencycontrols and error handling policies.IV. B

ENEFITS OF D ATA F ED FOR S CIENTIFIC R ESEARCH

DataFed offers scientiﬁc researchers many of the beneﬁts ofproduction-oriented SDMSs such as unique data identiﬁcationand tracking, abstraction of physical data storage, metadataand provenance capture, powerful organization and searchComputational Science and Computational Intelligence (CSCI’19) Page 4apabilities, and data sharing with ﬁne-grained data accesscontrols. DataFed is also domain-agnostic, easy to learn anduse, and scales-out across multiple organizations.The logical representation of data as data records instead ofﬁles enables DataFed to offer a wide range of organizationalpossibilities beyond what is achievable through ﬁle systemsalone, such as: • Collections that offer a hierarchical, logical organizationof data similar to ﬁle system directories, but are moreﬂexible and descriptive. Collections can also be used toconﬁgure common access controls for contained recordsand support batch downloads. • Dynamic Views of data that can be created by users bysaving metadata-based queries. Future versions will alsoallow tags to be used for dynamic views. • Published Collections organizes data that has been pub-lished by a user to one or more DataFed catalogs. • Personal Allocations that show data by repository to assistusers with raw data storage monitoring and management.When provenance information is captured, provenancegraphs can be viewed from the web portal to allow users todynamically explore related datasets, including those createdby other users. Provenance information can be used to betterunderstand complex interrelated data collections and serves asthe basis for future DataFed data dissemination features.

A. Query-able Structured Metadata

DataFed data records support common top-level metadataschema consisting of title, description, keywords, and ref-erences (i.e. provenance) in addition to administrative ﬁeldssuch as identiﬁer, owner, alias, etc. DataFed does not imposeschemas for the domain-speciﬁc structured metadata associ-ated with data records. Users are free to deﬁne this metadatain any schema of their choosing and can use textual, numeric,Boolean, geospatial, and array value types.The top-level metadata ﬁelds (textual values) are indexed byDataFed using a full-text search engine that supports root-wordanalysis. Domain-speciﬁc metadata is not indexed by defaultbut can be enabled by DataFed administrators after consideringimpacts on the query performance. Despite not being indexed,arbitrary metadata query expressions may be used to searchwithin domain-speciﬁc ﬁelds. For example, users can search amaterials data collection to ﬁnd datasets on samples of speciﬁccompositions tested over a speciﬁc temperature range. Userscan limit searches to all personally-owned data, data ownedby a speciﬁc project, data contained in a speciﬁc collection,etc.

B. Collective Data Ownership with Projects

DataFed offers collective ownership of data through thecreation and use of DataFed Projects. These projects typicallyreﬂect real-world research being conducted by a team ofresearchers with one or more principal investigators (PIs).Projects are typically owned by a PI, who can add otherDataFed users to the project and deﬁne project structure andaccess privileges for members. Since projects have their own repository allocations, project data does not count againstmember’s personal storage allocation(s). Furthermore, projectsenable joint administration of data by PIs and project mem-bers.

C. Scripting, Batch Processing, and Workﬂow Support

The DataFed Python client package provides both high-and low-level APIs that support non-interactive use casessuch as custom applications, batch processing, integration withcomputational workﬂows, and full data pipeline automation.Running any application that requires user authentication ina non-interactive scenario requires local security credentialsto be available to the application. DataFed greatly simpliﬁesthe process of installing such credentials (issuing a single CLIcommand) compared to the common approach of manuallygenerating and installing certiﬁcates or encryption key pairs.DataFed APIs can be used to ingest data and metadatagenerated by scientiﬁc instruments, stage data in compute oranalytics environments, ingest datasets resulting from process-ing steps in complex workﬂows, capture provenance based oninput-output relationships in processing steps, etc.

D. Simpliﬁed and Scalable System Administration

As mentioned earlier, DataFed uses Globus’s federated iden-tity management, which has already been adopted by manyresearch organizations. Therefore, DataFed does not add anyadditional user administrative burden to these organizations. Atthe same time, DataFed users beneﬁt by only needing to sign-on once to Globus via their regular institutional credentials.

E. Pre-publication Data Catalogs

DataFed provides a pre-publication data catalog featurethat allows users and projects to advertise publicly accessiblecollections of living data to the wider DataFed community.These catalogs are organized using a community-driven tax-onomy which permits users to browse by topic (e.g. - scientiﬁcdomain) and search data records within topics using keyword,phrase, and metadata expressions.V. E

XAMPLE A PPLICATIONS • Data Ingestion - Data portals or pipelines are often usedto capture raw data generated by scientiﬁc instrumentsand move it to a centralized data repository [19]. DataFedcan be incorporated into such data pipelines to captureoperational (user name, project id, etc.) and domain-speciﬁc (experiment description, sample identiﬁer anddescription, instrument calibration, etc.) metadata thatsimplify organization, discovery, and sharing of data. • User Facility Administration - The DataFed Projectsconstruct enables staff at user facilities, which oftensupport hundreds of visiting researchers a year, to easilymanage the data for users while allowing those users tocontrol their data once their research is concluded (forexample - moving data back to a home institute datarepository). • Data Analytics - The association of rich domain-speciﬁcmetadata with raw data in DataFed data records canComputational Science and Computational Intelligence (CSCI’19) Page 5igniﬁcantly alleviate the data wrangling necessary tocreate collections of examples that can be used to trainAI models. Users could rapidly develop, iterate over,and improve their models by searching for applicabledata collections using search queries or tags in DataFed,placing the collection in their compute environment re-gardless of its physical location and applying the modelto the data collection. ML and AI agents can be deployedto automatically extract features, identify anomalies andembed such knowledge as metadata in DataFed datarecords. • Schema Support - DataFed can serve as the globalcollaborative platform for scientiﬁc communities to cometogether and develop schemas for their domain. DL andnatural language processing and can be applied to facili-tate mapping of metadata from one schema to another. • Provenance Capture in Workﬂows - In addition tocapturing the resulting data assets created from a seriesof data processing steps in a workﬂow, DataFed canalso capture rich provenance information such as therelationships between the data created and consumedat each step, including the processing parameters, algo-rithms / software used for processing, etc. The presence ofsuch transparent and rich provenance information wouldnot only enable other users to repeat or replicate dataprocessing but also provide additional conﬁdence fordownstream users to use such datasets, as well as peerreviewers of articles / datasets to vet such datasets. • Scientiﬁc Companion - The scientiﬁc community hasfrequently expressed their desire for a digital compan-ion that can assist researchers in planning and steeringexperiments. Such a companion could mine metadata,provenance, and raw data present in DataFed to avoidunnecessary repetition of experiments, show results fromsimilar experiments, suggest unexplored parameter com-binations, etc.VI. D

EVELOPMENT R OADMAP

The following list highlights SDMS features that enhancescalability, reliability, and usability: • High Availability and Reliability - Services are requiredto be resilient, prevent data loss, recover from failures,and preserve data integrity. In DataFed, this effort isprimarily focused on clustered core and database services. • Schema Support - DataFed will permit users / communi-ties to deﬁne metadata schemas to both validate recordsat the time of creation and to support domain-speciﬁcsearch wizards that will generate graphical input formsbased on a speciﬁed schema. This capability will facilitatethe evolution of schemas and may potentially be followedby the ability to map ﬁelds between overlapping schemas. • Data Dissemination and Bi-directional Annotations - Much like journal articles, datasets too can becomeoutdated, or found to be incomplete or erroneous. In suchcases, downstream users producing derived data productsof such datasets need to be informed of upstream changes to the data record status. Conversely, a downstream usermay discover an issue with an upstream data record andwish to notify the associated owners about the issue.DataFed modiﬁcations will allow users to ”subscribe” to”data events” associated with records or collections. Bi-directional annotations will be added to DataFed to permitup- or down-stream users to attach informational, warn-ing, and/or error notiﬁcations to records and collections. • Publication Support - DataFed will allow data recordsto be represented using Digital Object Identiﬁer (DOI)numbers with links to raw data curated in data reposito-ries external to DataFed. DataFed interfaces could still beused to access such data records from within the DataFednetwork. Furthermore, DataFed would facilitate far morenuanced searches than typically possible in external datarepositories, given the availability of rich domain-speciﬁcmetadata. • Data Caching and Replication - DataFed will automat-ically cache transient copies of remote raw data withinhigh-use facilities based on policies and/or user prefer-ences to improve system efﬁciency. A data replicationfeature will also be added to allow users to create redun-dant, but synchronized, copies of raw data on separatedata repositories for increased data safety. • Tags and labels - The attachment of tags and labelsto data records will offer additional organizational op-tions such as dynamic views based on tags and facetedsearches. Tagged datasets will also augment the collec-tions of examples (with labeled ground truth) for trainingAI models. • Improved Batch Import - The existing batch importfeature will be enhanced to simplify the process ofimporting large preexisting data collections into DataFed. • Multimedia Attachments - DataFed will allow smallmultimedia assets, such as thumbnail images to be at-tached to data records to enhance browsing of datacatalogs. • Metrics - Data records and collections will supportcommunity-driven metrics such as numbers of downloadsor subscribers, ratings, etc. to highlight the quality ofdatasets. VII. C

ONCLUSIONS

We present DataFed - a general purpose and domain-agnostic SDMS aimed at signiﬁcantly alleviating the burdenof data management to improve scientiﬁc productivity, lowerbarriers to cross-institutional and collaborative research, andfacilitate data-driven scientiﬁc discovery. DataFed providesusers with a logical view of data that abstracts the physicaldata storage and facilitates the capture and enrichment ofmetadata (operational and domain-speciﬁc) associated withthe raw data. Users of DataFed beneﬁt from provenancecapture, powerful organization and search capabilities, ﬁne-grained access control for sharing data with collaborators,unique identiﬁers for data records, etc.Computational Science and Computational Intelligence (CSCI’19) Page 6ataFed has been designed in compliance with FAIR dataprinciples for facilitating open, collaborative, and repeatableresearch. By using DataFed to manage both data and executioncontexts captured within software containers [20], DataFedrepresents an important step towards reproducibility in openscientiﬁc research by enabling users to easily, correctly, re-peatably, and reliably work with datasets within appropriatecompute or analytic contexts.DataFed’s use of popular and scalable user authenticationand data transfer services overcomes limitations of priorSDMSs and simpliﬁes deployment, maintenance, and expan-sion of the DataFed federation. Besides lowering barriersto adoption and deployment, DataFed also provides a user-friendly Web Portal for user-friendly management of dataas well as a command line interface for integration withcomplex workﬂows or to create custom applications. We arecontinually improving the resilience, scalability, performance,and capabilities of DataFed. Interested readers are welcome touse DataFed at https://datafed.ornl.gov.A

CKNOWLEDGMENT

This research used resources of the Oak Ridge LeadershipComputing Facility (OLCF) and of the Compute and DataEnvironment for Science (CADES) at the Oak Ridge NationalLaboratory, which is supported by the Ofﬁce of Science ofthe U.S. Department of Energy under Contract No. DE-AC05-00OR22725. R

EFERENCES[1] J. Blair, R. S. Canon, J. Deslippe, A. Essiari, A. Hexemer, A. A. Mac-Dowell, D. Y. Parkinson, S. J. Patton, L. Ramakrishnan, N. Tamura et al. ,“High performance data management and analysis for tomography,” in

Developments in X-Ray Tomography IX , vol. 9212. International Societyfor Optics and Photonics, 2014, p. 92121G.[2] S. V. Kalinin, E. Strelcov, A. Belianinov, S. Somnath, R. K. Vasudevan,E. J. Lingerfelt, R. K. Archibald, C. Chen, R. Proksch, N. Laanait et al. ,“Big, deep, and smart data in scanning probe microscopy,”

ACS Nano ,pp. 9068–9086, 2016.[3] B. A. Nosek, G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman,S. J. Breckler, S. Buck, C. D. Chambers, G. Chin, G. Christensen et al. ,“Promoting an open research culture,”

Science , vol. 348, no. 6242, pp.1422–1425, 2015.[4] Marder, K, Patera, A, Astolfo A, Schneider, M, Weber, B, andStampanoni, M, “Investigating the microvessel architecture of themouse brain: An approach for measuring, stitching, and analyzing 50teravoxels of data,” in . AIP, July 2015, p. 73. [Online]. Available:https://rawgit.com/4Quant/SRI2015/master/SRIPres.html

EDBT , vol. 16,2016, pp. 473–478.[6] R. Moore, “Data management systems for scientiﬁc applications,”in

Proceedings of the IFIP TC2/WG2.5 Working Conference on theArchitecture of Scientiﬁc Software . Deventer, The Netherlands, TheNetherlands: Kluwer, B.V., 2001, pp. 273–284. [Online]. Available:http://dl.acm.org/citation.cfm?id=647102.717555[7] C. Quintero, K. Tran, and A. A. Szewczak, “High-throughput qualitycontrol of dmso acoustic dispensing using photometric dye methods,”

Journal of laboratory automation , vol. 18, no. 4, pp. 296–305, 2013.[8] L. Marini, I. Gutierrez-Polo, R. Kooper, S. P. Satheesan, M. Burnette,J. Lee, T. Nicholson, Y. Zhao, and K. McHenry, “Clowder: Open sourcedata management for long tail data,” in

Proceedings of the Practice andExperience on Advanced Research Computing . ACM, 2018, p. 40. [9] C. Allan, J.-M. Burel, J. Moore, C. Blackburn, M. Linkert, S. Loynton,D. MacDonald, W. J. Moore, C. Neves, A. Patterson et al. , “Omero: ﬂex-ible, model-driven data management for experimental biology,”

Naturemethods , vol. 9, no. 3, p. 245, 2012.[10] A. P. Arkin, R. L. Stevens, R. W. Cottingham, S. Maslov, C. S. Henry,P. Dehal, D. Ware, F. Perez, N. L. Harris, S. Canon et al. , “The doesystems biology knowledgebase (kbase),”

BioRxiv , p. 096354, 2016.[11] V. Garonne, R. Vigne, G. Stewart, M. Barisits, M. Lassnig, C. Serfon,L. Goossens, A. Nairz, A. Collaboration et al. , “Rucio–the next gener-ation of large scale distributed system for atlas data management,” in

Journal of Physics: Conference Series , vol. 513, no. 4. IOP Publishing,2014, p. 042021.[12] A. Rajasekar, R. Moore, and F. Vernon, “irods: A distributed datamanagement cyberinfrastructure for observatories,” in

AGU Fall MeetingAbstracts , 2007.[13] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Ax-ton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E.Bourne et al. , “The fair guiding principles for scientiﬁc data managementand stewardship,”

Scientiﬁc data , vol. 3, 2016.[14] B. Allen, R. Ananthakrishnan, K. Chard, I. Foster, R. Madduri,J. Pruyne, S. Rosen, and S. Tuecke, “Globus: A case study insoftware as a service for scientists,” in

Proceedings of the 8thWorkshop on Scientiﬁc Cloud Computing , ser. ScienceCloud ’17.New York, NY, USA: ACM, 2017, pp. 25–32. [Online]. Available:http://doi.acm.org/10.1145/3086567.3086570[15] K. Chard, J. Pruyne, B. Blaiszik, R. Ananthakrishnan, S. Tuecke, andI. Foster, “Globus data publication as a service: Lowering barriers toreproducible science,” in . IEEE, 2015, pp. 401–410.[16] W. Allcock, “Gridftp: Protocol extensions to ftp for the grid,” , 2003.[17] S. S. Y. Shim, Geetanjali Bhalla, and Vishnu Pendyala, “Federatedidentity management,”

Computer , vol. 38, no. 12, pp. 120–122, Dec2005.[18] A. Mowshowitz et al. , “Virtual organization,”

Communications of theACM , vol. 40, no. 9, pp. 30–37, 1997.[19] S. Androulakis, “Mytardis and tardis: Managing the lifecycle of datafrom generation to publication,” in eResearch Australasia 2010 , 2010.[20] G. M. Kurtzer, V. Sochat, and M. W. Bauer, “Singularity: Scientiﬁccontainers for mobility of compute,”

PloS one , vol. 12, no. 5, p.e0177459, 2017., vol. 12, no. 5, p.e0177459, 2017.