DataFed: Towards Reproducible Research via Federated Data Management
Dale Stansberry, Suhas Somnath, Jessica Breet, Gregory Shutt, Mallikarjun Shankar
TThis manuscript has been authored by UT-Battelle,LLC, under contract DE-AC05-00OR22725 with the USDepartment of Energy (DOE). The US government retainsand the publisher, by accepting the article for publication,acknowledges that the US government retains a nonexclusive,paid-up, irrevocable, worldwide license to publish orreproduce the published form of this manuscript, or allowothers to do so, for US government purposes. DOE willprovide public access to these results of federally sponsoredresearch in accordance with the DOE Public Access Plan(http://energy.gov/downloads/doe-public-access-plan).
Submitted to the held at Las Vegas,NV, USA on Dec 05-07 2019
Computational Science and Computational Intelligence (CSCI’19) Page 1 a r X i v : . [ c s . D B ] A p r ataFed: Towards Reproducible Research viaFederated Data Management Dale Stansberry*, Suhas Somnath, Jessica Breet, Gregory Shutt, and Mallikarjun Shankar
National Center for Computational SciencesOak Ridge National Laboratory
Oak Ridge, TN, USA*Email: [email protected]
Abstract —The increasingly collaborative, globalized nature ofscientific research combined with the need to share data and theexplosion in data volumes present an urgent need for a scientificdata management system (SDMS). An SDMS presents a logicaland holistic view of data that greatly simplifies and empowersdata organization, curation, searching, sharing, dissemination,etc. We present DataFed - a lightweight, distributed SDMS thatspans a federation of storage systems within a loosely-couplednetwork of scientific facilities. Unlike existing SDMS offerings,DataFed uses high-performance and scalable user managementand data transfer technologies that simplify deployment, mainte-nance, and expansion of DataFed. DataFed provides web-basedand command-line interfaces to manage data and integrate withcomplex scientific workflows. DataFed represents a step towardsreproducible scientific research by enabling reliable staging ofthe correct data at the desired environment.
Index Terms —scientific data management system, federatedidentity management, Globus, FAIR data principles, cross-facility
I. I
NTRODUCTION
Several scientific domains are experiencing an explosionin the volume, variety, veracity and velocity of data owingto increased automation, increased computational power, andfaster, higher resolution sensors and detectors in scientificinstruments [1], [2]. At the same time, research is becomingever more globalized, collaborative, and multidisciplinary, andthere is an increasing need to publish the supporting datasetsbehind research findings [3]. Furthermore, scientific discoveryusing data analytics techniques like machine learning (ML)and artificial intelligence (AI) requires large volumes of highquality and well organized data. Prior research has shown thatas much as 50-80% of time is spent on data management andwrangling in most scientific research projects and this numberis expected to rise [4], [5]. These factors are not only loweringscientific productivity but are also exacerbating the problemof poor reproducibility in science. The current state of thepractice leads us to urgently seek a way to manage the lifecycleof data with an effective Scientific Data Management System(SDMS) [6], and use the SDMS as an essential component ofthe scientific process.We require an SDMS to greatly alleviate common datahandling and management challenges associated with largeand/or complex corpora of data by providing an intuitiveand holistic view of data and associated metadata supportedby powerful but user-friendly organization, collaboration, and search/discovery capabilities. Data should be precisely identi-fied and tracked to prevent accidental mishandling or misuseand to more easily support sharing with collaborators. Domain-specific metadata and provenance would need to be capturedto provide additional context for data sets, and enable powerfulAI-driven discoveries.Many commercial and academic SDMS’s are currentlyavailable and have been used in production-oriented andresearch laboratories respectively for many years [7]–[10].Commercial SDMS products typically support only a singleorganization or domain in a rigid manner and are orientedtoward business processes rather than research. Academicresearch focused SDMS solutions tend to be tied to specificscientific domains, are often challenging to deploy, and usenon-scalable technologies that limit the pervasiveness andwidespread acceptance of such services - resulting in disjointdata silos [8], [11], [12].In this paper we introduce DataFed - a general purposeSDMS tailored to scientific research with the overarching goalsof eliminating data silos and enabling cross-organizational col-laboration. DataFed was also designed to support the FindableAccessible, Interoperable, and Reusable (FAIR) data princi-ples, which were proposed to facilitate open, collaborative,and reproducible scientific research [13]. DataFed achievesthese goals by providing both services and software tech-nologies that help to bridge multiple scientific domains andorganizational boundaries. By efficiently supporting scalablebig data, and addressing the typical barriers to adoption suchas complexity, rigidity, and administrative effort, DataFed aimsto encourage cross-discipline and cross-domain data exchange.When deployed, DataFed forms a federation of geographicallydistributed research organizations, scientific facilities, and datarepositories. The resulting data overlay establishes a scalable,multi-domain, scientific data network that greatly simplifiesthe data management aspects of cross-facility scientific re-search.Our primary contributions are: (1) we describe the systemarchitecture and implementation of an SDMS that offersa uniform (non-file-system-silo) view on data managementacross facilities, and (2) discuss specific benefits and use ofDataFed to enable reproducible research over the data lifecy-cle. DataFed is available for use at https://datafed.ornl.gov. Weproceed by first providing a high-level overview of DataFed in ig. 1. A conceptual overview of DataFed
Section II, followed by the system architecture in Section III.Next, in Section IV, we discuss the benefits of using DataFed,and conclude with the future extensions planned for DataFedin Section V. II. S
YSTEM O VERVIEW
DataFed is both a service and a software framework thatintegrates distributed scientific data repositories with central-ized metadata and data management services to produce afederated, cross-organizational SDMS. DataFed is an opt-in data management network, where member organizationsconnect locally-managed data repositories to the federation,yet retain full control of the underlying storage systems (i.e.allocations and data policies). Central DataFed services areused to track, manage, and coordinate access to all data withinthe network.Figure 1 illustrates a high-level control-flow and data-flow view across different facilities using DataFed. Federatingheterogeneous scientific facilities into a common data networkenables the data access needs of complex cross-facility collab-orations and workflows. Individual facilities may house oneor more local data repositories, or may have none and rely ondata in remote data repositories. Though member organizationsmay host various scientific resources such as experimen-tal/observational instruments, or compute/analytics hardware,DataFed is only concerned with data access/managementwithin and between these facilities - not with allocating thescientific resources themselves.The Globus software suite [14], [15] has become the defacto large-scale data communication mechanism of the scien-tific community by providing secure, performant, and manageddata transfer services (based on GridFTP [16]) coupled withmodern scalable user management technologies. By leveragingthese popular Globus services, DataFed easily expands into ex- isting Globus-enabled organizations and scales-out to supportmany users without creating the administrative bottlenecks thatare common with older security models [17].Unlike distributed file systems, DataFed presents a logicalview of data rather than a physical storage path to a namedfile . Internally, DataFed manages data as two distinct yetsynchronized parts: a raw data object and a data record.The raw data object can contain data in any format andis ingested from a source data file into a DataFed-manageddata repository. The data record is used to uniquely identifythe data, track administrative information, and store general-and domain-specific metadata. DataFed’s client interfaces orapplication programming interfaces (APIs) must always beused to access or modify managed data.DataFed provides users with a variety of clients and in-terfaces - enabling simple and uniform data access from anyconnected resource within the DataFed federation, regardlessof physical storage location or organization association. Forexample, a user can interactively identify and stage remotedata to a local analytics resource for processing, or createscripts with DataFed APIs for batch processing and/or work-flows on compute resources. DataFed can also be directlyintegrated into experimental facilities and/or data pipelines toautomate data management. (We also employ a
Data Gateway device that enables edge devices (scientific instruments) tosubmit data into the managed repositories of DataFed.)III. S
YSTEM A RCHITECTURE
The architecture of DataFed, shown in Figure 2, consists ofa “Core Facility” housing central DataFed services that man-age and coordinate remote DataFed data repositories housedwithin member-organization-managed units, such as experi-mental, compute, and analytics facilities. To enable large datatransfers DataFed requires that organizational facilities havea Globus endpoint (managed by the organization) connectingone or more local file systems to the DataFed network, andthey must install a DataFed client for users to access and/orupdate remote data as illustrated for the “Experimental Facil-ity” in Figure 2. Optionally, a facility may house one or morelocal DataFed data repositories, which each require their ownGlobus endpoint (managed by DataFed) demonstrated by boththe “Observational Facility” and “Computational Facility” inFigure 2. DataFed services and clients communicate via acontrol protocol which is distinct from the GridFTP protocol.A user or process with sufficient permissions at one facilitymay access and/or update raw data housed within anotherfacility’s repository by requesting access through the Coreservices via a DataFed interface.
A. Core Services
DataFed core facility includes three distinct server types:1) control servers, 2) web servers, and 3) a database server.Control servers are primarily responsible for orchestratingactivities within the system that involve more that one compo-nent, such as data transfers between repositories or managingconcurrent data access. Control servers also act as a gatewayComputational Science and Computational Intelligence (CSCI’19) Page 3 ontrol
Data
Metadata
Experimental Facility
DataFed ClientDataFed Client
Observational Facility
Globus
Computational Facility
DataFedRepository DataFedRepository
Control ServersWeb Servers
DataFed Core Facility
GlobusGlobusDataFed Client
Fig. 2. DataFed System Architecture to microservices running within the underlying database. Webservers interface with control servers and are responsible forserving the DataFed Web Portal - which is a modern webapplication providing a graphical interface to all DataFedfeatures. Finally, the database houses all of the records andmetadata that describe the state of the entire DataFed network- including users, organizations, projects, repository configura-tion, metadata, access controls, and more. The raw binary datais stored in remote data repositories and not in the database.Given the criticality of the core database, the database shouldbe deployed in a cluster configuration on highly resilientstorage with a suitable data replication factor.
B. Data Repository Servers
DataFed data repository servers are a companion servicethat, along with a dedicated Globus endpoint, connects facility-local data storage to the DataFed network. The organizationadministering the repository may use any type or configurationof data storage system to support the repository so longas it is supported by Globus. The data repository softwarestack integrates with the associated Globus endpoint to routeauthorization requests to core services for all data access -enabling both fine-grained user authorization and concurrencycontrols.
C. Client Interfaces
Users can interact with DataFed via a command-line-interface (CLI) and a Web Portal. The CLI provides access tobasic DataFed features for both interactive use and scriptingvia a Python 3 package that is available via the Python PackageIndex (PyPi) for all major operating systems. This pythonpackage provides both high- and low-level APIs that can beused to build custom applications that directly interface withDataFed.The DataFed Web Portal provides users a simple yet pow-erful interface to manage their data through the use of modernweb standards. Graphical elements, contextual menus, anddrag-and-drop menus are used widely to simplify the interface and make operations intuitive. The focal-point of Web Portalis the Data Browser, which allows users to navigate throughdata, create and manage collections and projects, and sharedata with collaborators. Users can create data records, enteror edit metadata, view provenance graphs, transfer raw data,view publicly available data catalogs, and much more.
D. Authentication and Data Transfer
User account management based on certificates or encryp-tion keys can become a significant administrative burden whenmanaging a large number of users. The common approach ofdefining virtual organizations [18] based on these authentica-tion technologies can quickly become untenable due to theneed to synchronize users across all member organizations.A more modern and scalable approach to user authenticationis to use federated identity management [17], where userslink multiple organizational accounts into a single compositeidentity. This approach enables mutual secure access betweendisjoint organizations despite differences in user accountingsystems and security policies.As mention previously, DataFed is based on Globus’ fed-erated user authentication and data transfer services, conse-quently benefiting from the availability, scalability, and per-formance advantages of Globus. As a consequence, DataFedusers must also have a Globus account linked with one ormore scientific institutes (universities, national laboratories,etc.). DataFed uses internal Globus endpoints for it’s own datarepositories and allows DataFed user to transfer data to/fromexternal endpoints (either organizational or personal). DataFedmanages and monitors all data transfers to or from DataFeddata repositories while applying DataFed-specific concurrencycontrols and error handling policies.IV. B
ENEFITS OF D ATA F ED FOR S CIENTIFIC R ESEARCH
DataFed offers scientific researchers many of the benefits ofproduction-oriented SDMSs such as unique data identificationand tracking, abstraction of physical data storage, metadataand provenance capture, powerful organization and searchComputational Science and Computational Intelligence (CSCI’19) Page 4apabilities, and data sharing with fine-grained data accesscontrols. DataFed is also domain-agnostic, easy to learn anduse, and scales-out across multiple organizations.The logical representation of data as data records instead offiles enables DataFed to offer a wide range of organizationalpossibilities beyond what is achievable through file systemsalone, such as: • Collections that offer a hierarchical, logical organizationof data similar to file system directories, but are moreflexible and descriptive. Collections can also be used toconfigure common access controls for contained recordsand support batch downloads. • Dynamic Views of data that can be created by users bysaving metadata-based queries. Future versions will alsoallow tags to be used for dynamic views. • Published Collections organizes data that has been pub-lished by a user to one or more DataFed catalogs. • Personal Allocations that show data by repository to assistusers with raw data storage monitoring and management.When provenance information is captured, provenancegraphs can be viewed from the web portal to allow users todynamically explore related datasets, including those createdby other users. Provenance information can be used to betterunderstand complex interrelated data collections and serves asthe basis for future DataFed data dissemination features.
A. Query-able Structured Metadata
DataFed data records support common top-level metadataschema consisting of title, description, keywords, and ref-erences (i.e. provenance) in addition to administrative fieldssuch as identifier, owner, alias, etc. DataFed does not imposeschemas for the domain-specific structured metadata associ-ated with data records. Users are free to define this metadatain any schema of their choosing and can use textual, numeric,Boolean, geospatial, and array value types.The top-level metadata fields (textual values) are indexed byDataFed using a full-text search engine that supports root-wordanalysis. Domain-specific metadata is not indexed by defaultbut can be enabled by DataFed administrators after consideringimpacts on the query performance. Despite not being indexed,arbitrary metadata query expressions may be used to searchwithin domain-specific fields. For example, users can search amaterials data collection to find datasets on samples of specificcompositions tested over a specific temperature range. Userscan limit searches to all personally-owned data, data ownedby a specific project, data contained in a specific collection,etc.
B. Collective Data Ownership with Projects
DataFed offers collective ownership of data through thecreation and use of DataFed Projects. These projects typicallyreflect real-world research being conducted by a team ofresearchers with one or more principal investigators (PIs).Projects are typically owned by a PI, who can add otherDataFed users to the project and define project structure andaccess privileges for members. Since projects have their own repository allocations, project data does not count againstmember’s personal storage allocation(s). Furthermore, projectsenable joint administration of data by PIs and project mem-bers.
C. Scripting, Batch Processing, and Workflow Support
The DataFed Python client package provides both high-and low-level APIs that support non-interactive use casessuch as custom applications, batch processing, integration withcomputational workflows, and full data pipeline automation.Running any application that requires user authentication ina non-interactive scenario requires local security credentialsto be available to the application. DataFed greatly simplifiesthe process of installing such credentials (issuing a single CLIcommand) compared to the common approach of manuallygenerating and installing certificates or encryption key pairs.DataFed APIs can be used to ingest data and metadatagenerated by scientific instruments, stage data in compute oranalytics environments, ingest datasets resulting from process-ing steps in complex workflows, capture provenance based oninput-output relationships in processing steps, etc.
D. Simplified and Scalable System Administration
As mentioned earlier, DataFed uses Globus’s federated iden-tity management, which has already been adopted by manyresearch organizations. Therefore, DataFed does not add anyadditional user administrative burden to these organizations. Atthe same time, DataFed users benefit by only needing to sign-on once to Globus via their regular institutional credentials.
E. Pre-publication Data Catalogs
DataFed provides a pre-publication data catalog featurethat allows users and projects to advertise publicly accessiblecollections of living data to the wider DataFed community.These catalogs are organized using a community-driven tax-onomy which permits users to browse by topic (e.g. - scientificdomain) and search data records within topics using keyword,phrase, and metadata expressions.V. E
XAMPLE A PPLICATIONS • Data Ingestion - Data portals or pipelines are often usedto capture raw data generated by scientific instrumentsand move it to a centralized data repository [19]. DataFedcan be incorporated into such data pipelines to captureoperational (user name, project id, etc.) and domain-specific (experiment description, sample identifier anddescription, instrument calibration, etc.) metadata thatsimplify organization, discovery, and sharing of data. • User Facility Administration - The DataFed Projectsconstruct enables staff at user facilities, which oftensupport hundreds of visiting researchers a year, to easilymanage the data for users while allowing those users tocontrol their data once their research is concluded (forexample - moving data back to a home institute datarepository). • Data Analytics - The association of rich domain-specificmetadata with raw data in DataFed data records canComputational Science and Computational Intelligence (CSCI’19) Page 5ignificantly alleviate the data wrangling necessary tocreate collections of examples that can be used to trainAI models. Users could rapidly develop, iterate over,and improve their models by searching for applicabledata collections using search queries or tags in DataFed,placing the collection in their compute environment re-gardless of its physical location and applying the modelto the data collection. ML and AI agents can be deployedto automatically extract features, identify anomalies andembed such knowledge as metadata in DataFed datarecords. • Schema Support - DataFed can serve as the globalcollaborative platform for scientific communities to cometogether and develop schemas for their domain. DL andnatural language processing and can be applied to facili-tate mapping of metadata from one schema to another. • Provenance Capture in Workflows - In addition tocapturing the resulting data assets created from a seriesof data processing steps in a workflow, DataFed canalso capture rich provenance information such as therelationships between the data created and consumedat each step, including the processing parameters, algo-rithms / software used for processing, etc. The presence ofsuch transparent and rich provenance information wouldnot only enable other users to repeat or replicate dataprocessing but also provide additional confidence fordownstream users to use such datasets, as well as peerreviewers of articles / datasets to vet such datasets. • Scientific Companion - The scientific community hasfrequently expressed their desire for a digital compan-ion that can assist researchers in planning and steeringexperiments. Such a companion could mine metadata,provenance, and raw data present in DataFed to avoidunnecessary repetition of experiments, show results fromsimilar experiments, suggest unexplored parameter com-binations, etc.VI. D
EVELOPMENT R OADMAP
The following list highlights SDMS features that enhancescalability, reliability, and usability: • High Availability and Reliability - Services are requiredto be resilient, prevent data loss, recover from failures,and preserve data integrity. In DataFed, this effort isprimarily focused on clustered core and database services. • Schema Support - DataFed will permit users / communi-ties to define metadata schemas to both validate recordsat the time of creation and to support domain-specificsearch wizards that will generate graphical input formsbased on a specified schema. This capability will facilitatethe evolution of schemas and may potentially be followedby the ability to map fields between overlapping schemas. • Data Dissemination and Bi-directional Annotations - Much like journal articles, datasets too can becomeoutdated, or found to be incomplete or erroneous. In suchcases, downstream users producing derived data productsof such datasets need to be informed of upstream changes to the data record status. Conversely, a downstream usermay discover an issue with an upstream data record andwish to notify the associated owners about the issue.DataFed modifications will allow users to ”subscribe” to”data events” associated with records or collections. Bi-directional annotations will be added to DataFed to permitup- or down-stream users to attach informational, warn-ing, and/or error notifications to records and collections. • Publication Support - DataFed will allow data recordsto be represented using Digital Object Identifier (DOI)numbers with links to raw data curated in data reposito-ries external to DataFed. DataFed interfaces could still beused to access such data records from within the DataFednetwork. Furthermore, DataFed would facilitate far morenuanced searches than typically possible in external datarepositories, given the availability of rich domain-specificmetadata. • Data Caching and Replication - DataFed will automat-ically cache transient copies of remote raw data withinhigh-use facilities based on policies and/or user prefer-ences to improve system efficiency. A data replicationfeature will also be added to allow users to create redun-dant, but synchronized, copies of raw data on separatedata repositories for increased data safety. • Tags and labels - The attachment of tags and labelsto data records will offer additional organizational op-tions such as dynamic views based on tags and facetedsearches. Tagged datasets will also augment the collec-tions of examples (with labeled ground truth) for trainingAI models. • Improved Batch Import - The existing batch importfeature will be enhanced to simplify the process ofimporting large preexisting data collections into DataFed. • Multimedia Attachments - DataFed will allow smallmultimedia assets, such as thumbnail images to be at-tached to data records to enhance browsing of datacatalogs. • Metrics - Data records and collections will supportcommunity-driven metrics such as numbers of downloadsor subscribers, ratings, etc. to highlight the quality ofdatasets. VII. C
ONCLUSIONS
We present DataFed - a general purpose and domain-agnostic SDMS aimed at significantly alleviating the burdenof data management to improve scientific productivity, lowerbarriers to cross-institutional and collaborative research, andfacilitate data-driven scientific discovery. DataFed providesusers with a logical view of data that abstracts the physicaldata storage and facilitates the capture and enrichment ofmetadata (operational and domain-specific) associated withthe raw data. Users of DataFed benefit from provenancecapture, powerful organization and search capabilities, fine-grained access control for sharing data with collaborators,unique identifiers for data records, etc.Computational Science and Computational Intelligence (CSCI’19) Page 6ataFed has been designed in compliance with FAIR dataprinciples for facilitating open, collaborative, and repeatableresearch. By using DataFed to manage both data and executioncontexts captured within software containers [20], DataFedrepresents an important step towards reproducibility in openscientific research by enabling users to easily, correctly, re-peatably, and reliably work with datasets within appropriatecompute or analytic contexts.DataFed’s use of popular and scalable user authenticationand data transfer services overcomes limitations of priorSDMSs and simplifies deployment, maintenance, and expan-sion of the DataFed federation. Besides lowering barriersto adoption and deployment, DataFed also provides a user-friendly Web Portal for user-friendly management of dataas well as a command line interface for integration withcomplex workflows or to create custom applications. We arecontinually improving the resilience, scalability, performance,and capabilities of DataFed. Interested readers are welcome touse DataFed at https://datafed.ornl.gov.A
CKNOWLEDGMENT
This research used resources of the Oak Ridge LeadershipComputing Facility (OLCF) and of the Compute and DataEnvironment for Science (CADES) at the Oak Ridge NationalLaboratory, which is supported by the Office of Science ofthe U.S. Department of Energy under Contract No. DE-AC05-00OR22725. R
EFERENCES[1] J. Blair, R. S. Canon, J. Deslippe, A. Essiari, A. Hexemer, A. A. Mac-Dowell, D. Y. Parkinson, S. J. Patton, L. Ramakrishnan, N. Tamura et al. ,“High performance data management and analysis for tomography,” in
Developments in X-Ray Tomography IX , vol. 9212. International Societyfor Optics and Photonics, 2014, p. 92121G.[2] S. V. Kalinin, E. Strelcov, A. Belianinov, S. Somnath, R. K. Vasudevan,E. J. Lingerfelt, R. K. Archibald, C. Chen, R. Proksch, N. Laanait et al. ,“Big, deep, and smart data in scanning probe microscopy,”
ACS Nano ,pp. 9068–9086, 2016.[3] B. A. Nosek, G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman,S. J. Breckler, S. Buck, C. D. Chambers, G. Chin, G. Christensen et al. ,“Promoting an open research culture,”
Science , vol. 348, no. 6242, pp.1422–1425, 2015.[4] Marder, K, Patera, A, Astolfo A, Schneider, M, Weber, B, andStampanoni, M, “Investigating the microvessel architecture of themouse brain: An approach for measuring, stitching, and analyzing 50teravoxels of data,” in . AIP, July 2015, p. 73. [Online]. Available:https://rawgit.com/4Quant/SRI2015/master/SRIPres.html
EDBT , vol. 16,2016, pp. 473–478.[6] R. Moore, “Data management systems for scientific applications,”in
Proceedings of the IFIP TC2/WG2.5 Working Conference on theArchitecture of Scientific Software . Deventer, The Netherlands, TheNetherlands: Kluwer, B.V., 2001, pp. 273–284. [Online]. Available:http://dl.acm.org/citation.cfm?id=647102.717555[7] C. Quintero, K. Tran, and A. A. Szewczak, “High-throughput qualitycontrol of dmso acoustic dispensing using photometric dye methods,”
Journal of laboratory automation , vol. 18, no. 4, pp. 296–305, 2013.[8] L. Marini, I. Gutierrez-Polo, R. Kooper, S. P. Satheesan, M. Burnette,J. Lee, T. Nicholson, Y. Zhao, and K. McHenry, “Clowder: Open sourcedata management for long tail data,” in
Proceedings of the Practice andExperience on Advanced Research Computing . ACM, 2018, p. 40. [9] C. Allan, J.-M. Burel, J. Moore, C. Blackburn, M. Linkert, S. Loynton,D. MacDonald, W. J. Moore, C. Neves, A. Patterson et al. , “Omero: flex-ible, model-driven data management for experimental biology,”
Naturemethods , vol. 9, no. 3, p. 245, 2012.[10] A. P. Arkin, R. L. Stevens, R. W. Cottingham, S. Maslov, C. S. Henry,P. Dehal, D. Ware, F. Perez, N. L. Harris, S. Canon et al. , “The doesystems biology knowledgebase (kbase),”
BioRxiv , p. 096354, 2016.[11] V. Garonne, R. Vigne, G. Stewart, M. Barisits, M. Lassnig, C. Serfon,L. Goossens, A. Nairz, A. Collaboration et al. , “Rucio–the next gener-ation of large scale distributed system for atlas data management,” in
Journal of Physics: Conference Series , vol. 513, no. 4. IOP Publishing,2014, p. 042021.[12] A. Rajasekar, R. Moore, and F. Vernon, “irods: A distributed datamanagement cyberinfrastructure for observatories,” in
AGU Fall MeetingAbstracts , 2007.[13] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Ax-ton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E.Bourne et al. , “The fair guiding principles for scientific data managementand stewardship,”
Scientific data , vol. 3, 2016.[14] B. Allen, R. Ananthakrishnan, K. Chard, I. Foster, R. Madduri,J. Pruyne, S. Rosen, and S. Tuecke, “Globus: A case study insoftware as a service for scientists,” in
Proceedings of the 8thWorkshop on Scientific Cloud Computing , ser. ScienceCloud ’17.New York, NY, USA: ACM, 2017, pp. 25–32. [Online]. Available:http://doi.acm.org/10.1145/3086567.3086570[15] K. Chard, J. Pruyne, B. Blaiszik, R. Ananthakrishnan, S. Tuecke, andI. Foster, “Globus data publication as a service: Lowering barriers toreproducible science,” in . IEEE, 2015, pp. 401–410.[16] W. Allcock, “Gridftp: Protocol extensions to ftp for the grid,” , 2003.[17] S. S. Y. Shim, Geetanjali Bhalla, and Vishnu Pendyala, “Federatedidentity management,”
Computer , vol. 38, no. 12, pp. 120–122, Dec2005.[18] A. Mowshowitz et al. , “Virtual organization,”
Communications of theACM , vol. 40, no. 9, pp. 30–37, 1997.[19] S. Androulakis, “Mytardis and tardis: Managing the lifecycle of datafrom generation to publication,” in eResearch Australasia 2010 , 2010.[20] G. M. Kurtzer, V. Sochat, and M. W. Bauer, “Singularity: Scientificcontainers for mobility of compute,”
PloS one , vol. 12, no. 5, p.e0177459, 2017., vol. 12, no. 5, p.e0177459, 2017.