[PDF] Workflow Provenance in the Lifecycle of Scientific Machine Learning

Abstract

Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments show that the principles enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability, and an order of magnitude of query acceleration under certain workloads against without our representation.

Full PDF

11 Workﬂow Provenance in the Lifecycle ofScientiﬁc Machine Learning

Renan Souza , Leonardo G. Azevedo , V´ıtor Lourenc¸o , Elton Soares , Raphael Thiago ,Rafael Brand ˜ao , Daniel Civitarese , Emilio Vital Brazil , Marcio Moreno ,Patrick Valduriez , Marta Mattoso , Renato Cerqueira , Marco A. S. Netto IBM Research Federal University of Rio de Janeiro, Brazil Inria, Univ. Montpellier, CNRS & LIRMM, France

Abstract

Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundlyimpacting the computational science and engineering domains, like geoscience, climate science, and health science. Inthese domains, users need to perform comprehensive data analyses combining scientiﬁc data and ML models to providefor critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientiﬁcML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses evenmore challenging. In this work, we leverage workﬂow provenance techniques to build a holistic view to support the lifecycleof scientiﬁc ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principlesto build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessonslearned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments showthat the principles enable queries that integrate domain semantics with ML models while keeping low overhead ( < Index Terms

Scientiﬁc Machine Learning, Machine Learning Lifecycle, Artiﬁcial Intelligence, Data Science, Provenance, Lineage,Reproducibility, Explainability, Scientiﬁc Workﬂow, Data lake, e-Science, Design Principles, Taxonomy (cid:70)

Machine Learning (ML) has been fundamentally transforming several industries and businesses in numerous ways. Morerecently, it has also been impacting computational science and engineering domains, such as geoscience, climate science,material science, and health science. Scientiﬁc ML, i.e. , ML applied to these domains, is characterized by the combinationof data-driven techniques with domain-speciﬁc data and knowledge to obtain models of physical phenomena [1], [2], [3],[4], [5]. Obtaining models in scientiﬁc ML works similarly to conducting traditional large-scale computational experiments[6], which involve a team of scientists and engineers that formulate hypotheses, design the experiment and predeﬁneparameters and input datasets, analyze the experiment data, do observations, and calibrate initial assumptions in a cycleuntil they are satisﬁed with the results. Scientiﬁc ML is naturally large-scale because multiple people collaborate in aproject, using their multidisciplinary domain-speciﬁc knowledge to design and perform data-intensive tasks to curate( i.e. , understand, clean, enrich with observations) datasets and prepare for learning algorithms. They then plan andexecute compute-intensive tasks for computational simulations or training ML models affected by the scientiﬁc domain’sconstraints. They utilize specialized scientiﬁc software tools running either on their desktops, on cloud clusters ( e.g. ,Docker-based), or large HPC machines.Other works propose an ML lifecycle [7], [8]. Although they might apply for scientiﬁc ML, in our view, there are still gapsin these lifecycle proposals to properly address scientiﬁc ML characteristics, particularly the need for deeper integrationwith scientiﬁc domain data and specialized knowledge on a domain. Our proposed model for the lifecycle of scientiﬁc ML has three phases (explained in detail later in this paper): data curation — to curate raw data; learning data preparation — toprepare the curated data for learning; and the learning itself — aware of the constraints of a scientiﬁc domain. In each ofthese phases, there may be multiple workﬂows. Each workﬂow is a set of chained data transformations consuming andproducing datasets, and a workﬂow may consume the datasets produced by another workﬂow. For instance, there maybe multiple workﬂows only in the learning data preparation phase to transform curated data into learning datasets. These *Correspondence: Renan Souza - [email protected] a r X i v : . [ c s . D B ] S e p datasets may then be consumed by multiple workﬂows in the learning phase, transforming the datasets into different MLmodels. Therefore, we propose modeling these workﬂows as multiple interconnected workﬂows [9]. From now on, werefer to workﬂows as these multiple interconnected workﬂows in all phases of the lifecycle of scientiﬁc ML.Our primary goal in this paper is to support this lifecycle by enabling scientists and engineers to perform comprehen-sive, i.e. , end-to-end data analyses that integrate the data consumed and generated in these workﬂows, from raw domaindata to learned models. The importance of these data analyses is that they are enablers to meet critical requirements in ML,such as model reproducibility and explainability, and experiment data understanding.The main problem to achieve this goal is to deal, in an integrated and comprehensive way, with the high heterogeneityof different contexts ( e.g. , data, software, environments, persona) involved in this lifecycle. For example, the analyses needto be aware of the (hyper)parametrization of different data transformations in various workﬂows, how the transformationsaffect the experiment results ( e.g. , quality of the ML models), and the relationships between parameters, results, anddomain-speciﬁc data and knowledge. For instance, one may ask: “what happened to the model performance when theparameters varied from X to Y when the datasets had a speciﬁc characteristic in the domain?”. To allow for such analyses,tracking how the data are transformed throughout the workﬂows in an integrated and holistic way is necessary. Nothaving such holistic integration is critical for several reasons. To exemplify, it compromises experiment reproducibilityfrom a scientiﬁc perspective. From a business perspective, stakeholders may be less likely to apply an ML model, evenwith the best performance, if they do not understand the transformations that led to the best model)[10].Provenance (also referred to as lineage) data management techniques help reproduce, trace, assess, understand, andexplain data, models, and their transformation processes [11], [12], [13]. The provenance research community has evolvedsigniﬁcantly in recent years to provide for several strategic capabilities, including experiment reproducibility [14], usersteering ( i.e. , runtime monitoring, interactive data analysis, runtime ﬁne-tuning) [15], raw data analysis [16], and ourprevious work, which helps data integration for multiple workﬂows generating data in a data lake [9]. Furthermore,other works contribute to support provenance tracking speciﬁcally for ML workﬂows [17], [18], [19], [8], [7], includingreproducible models and explainability [20]. These related works are essential building blocks to be leveraged towardssupporting the lifecycle.Nevertheless, scientists and engineers still face difﬁculties in performing comprehensive data analyses that would helpthem meet those critical requirements in ML. Tracking provenance in those workﬂows could be used as a tool to providefor a holistic view, hence enabling the data analyses. However, the problem caused by the high heterogeneity in thelifecycle arises several challenges. For example, the workﬂows are highly heterogeneous and with distributed executioncontrol: there may not be one single Workﬂow Management System (WMS) orchestrating all workﬂows; instead, theremay be multiple WMSs, scripts, programs, and ML and data processing frameworks without a single uniﬁed executionorchestrator. Further, these workﬂows manage domain- and ML-speciﬁc data and knowledge stored in various distributeddata stores and run on various execution environments. Hence, strategies to track data in multiple data stores are needed.Also, another complicating factor is that efﬁciency is a common requirement, especially in HPC executions. Thus thesystems supporting the lifecycle need to scale and not add signiﬁcant tracking overhead. Designing a system to efﬁcientlytrack provenance in such heterogeneous scenarios has been recently acknowledged as a research challenge by leading datamanagement researchers [21].In this paper , our focus is to support the lifecycle of scientiﬁc ML by enabling comprehensive data analyses, addressingthe problem of high heterogeneity of different contexts. Particularly, we contribute with:(i) A comprehensive characterization of the lifecycle, from raw domain data to learned models passing through theprocesses that manipulate these data, and a taxonomy (detailing e.g. , data, execution timing, and training timingclasses) positioning the role of provenance analysis to support the lifecycle (Sec. 2);(ii) Data Design Principles to build and query a provenance-based holistic data view to integrate the data processed byworkﬂows in the lifecycle aware of the heterogeneous dimensions and enable the comprehensive analyses. A result ofthese principles is PROV-ML, a new provenance data representation for scientiﬁc ML leveraging W3C PROV[23] andMLS[24]; and System Design Principles that guide how to build a provenance system to efﬁciently track and integratethe data efﬁciently in distributed executions. A result of these principles is a reference system architecture (Sec. 3);(iii) Lessons learned after applying the principles in a system implementation and evaluating it in a real case in the Oil &Gas (O&G) industry in a testbed with 3 environments, including an HPC cluster with 393 computing nodes and 946GPUs. We found that the principles enabled comprehensive queries with rich semantics about the application domainand ML, while maintaining low tracking overhead ( < Existing works describe an ML lifecycle [7], [8], but such descriptions focus on business domains and do not address thehigh heterogeneity problem of the lifecycle of scientiﬁc ML. Since our main goal is to support this lifecycle by enablingscientists and engineers to perform comprehensive data analyses, we begin with a proposal of a model for this lifecycle

1. This paper is a major extension of our work published in IEEE WORKS@SC19 [22]. We improved and expanded the design principles,deﬁnitions, examples, and lessons learned after new experiments. Also, we reﬁned PROV-ML and extended the literature analysis. and a thorough characterization. This is the ﬁrst work that proposes a lifecycle focused on scientiﬁc ML to the best ofour knowledge. To illustrate this section’s explanations, we explore a concrete use case in the geoscience domain, of highinterest in the O&G industry.

Motivating use case.

Finding oil and gas reservoirs is a demanding task in the O&G industry and involves a broadspectrum of actions, such as the interpretation of seismic surveys. These surveys are indirect measures of the earthsubsurface that can be organized into slices (images). They cover hundreds of square kilometers and help to interpret thegeology by identifying geological structures, like salt bodies, and ﬁnd possible hydrocarbons accumulations. Processingseismic data imposes complex chained data transformations and can suffer from many problems, like noise and shadows(regions with low signal). Trying to automate such activity is of high interest in academia and industry and deep learningis a promising machine learning technique for this [5]. However, the geological structures vary geographically, frompoint to point in the subsurface, imposing signiﬁcant challenges to the ML algorithms. Thus, it requires specializedknowledge to prepare, clean, and understand the data processed in the workﬂows. To cope with this, often different teamsin an interdisciplinary group composed of geoscientists, computational scientists, engineers, statisticians, among othersdecompose the problem into parts so that each can address different facets of the problem. Nonetheless, each team has apreferred way to automate tasks and store data, and a team consumes data generated by another. Despite decomposingthe problem into parts makes the problem feasible, it creates a new problem: how to consume the data in an integratedway [9], [22], [5].In this section, we explore this use case to propose an abstract model of the lifecycle, which would apply for otherscientiﬁc domains as well. We ﬁrst characterize the personas and describe the lifecycle in Section 2.1 and then characterizethe data analyses using provenance in Section 2.2.

Multidisciplinary personas, with different skills in the domain and ML techniques, participate in the lifecycle phases. Inour previous work, we presented a spectrum of expertise and personas in scientiﬁc ML [22], depicted in Figure 1 andbrieﬂy summarized here. The spectrum ranges from scientiﬁc-domain only (fully white on the left) to ML only (fully blackon the right), with the following personas: (i)

Domain scientists , who have in-depth knowledge of the domain data and usespecialized tools to interpret, visualize, and clean the scientiﬁc data; (ii)

Computational scientists and engineers , who have highcomputational skills, often with abilities to develop parallel scripts and execute them in HPC clusters; (iii)

ML scientists andengineers , who have in-depth knowledge of statistics, ML algorithms, and software engineering. In an orthogonal sense,

Provenance specialists design the provenance schema for applications and guide other users to add provenance capturehooks to the workﬂows.

Main Expertise

Domain ML

Representative Persona

Domain Scientist Computational Scientist or Engineer ML Scientist or Engineer

Primary Activity

Data Curation ML Model Validation ML Management ML Model Training ML Model Design

Fig. 1: Spectrum of expertise and personas in the lifecycle.Our proposed model of the lifecycle of scientiﬁc ML is to divide it into three phases: data curation , learning datapreparation , and learning (Figure 2 — dashed arrows are data ﬂows and solid arrows are interactions between phases). Raw Domain Data Curated Domain Data Learning Data Models

CleanValidateFilterUnderstand

Data Curation

Data Transformation Pipeline

Learning Data Preparation

EvaluateTrain

Learning

Data Layer

Annotate Select Validate Validate

Fig. 2: The Lifecycle of Scientiﬁc ML.

Data curation.

It is the most complex phase of the lifecycle, mainly because of the nature of the scientiﬁc data. Much manualand highly specialized work are performed by the users (primarily domain scientists) to achieve automated knowledgeextraction from scientiﬁc data promoted by ML. There is a signiﬁcant gap between raw scientiﬁc data and useful datafor consumption ( e.g. , data to serve as input to train ML models). Datasets can be huge, typically containing geospatial-temporal data stored in scientiﬁc formats, like HDF5, NetCDF, SEG-Y. Specialized formats in scientiﬁc domains may require industry-speciﬁc software and domain-speciﬁc knowledge to inspect, visualize, and understand the data. In addition, userscan use metadata and textual reports to annotate the data with extra domain-speciﬁc knowledge, without which wouldbe nearly impossible to make the data useful for ML algorithms. Considering the heterogeneous nature of the data, “it isunreasonable to assume that data lives in a single source” ( e.g. , a single ﬁle system or DBMS) [10]. For instance, raw ﬁlescan be stored in ﬁle systems or cloud stores, domain-speciﬁc annotations can be stored in a Semantic Graph DBMS ( e.g. ,Triple Store) with domain ontologies, and curated data can be stored in a NoSQL DBMS. Then, computational scientists andengineers develop data-intensive scripts to clean, ﬁlter, and validate the data. Each of these steps inside the data curationphase is highly interactive, manual, and may execute independently. In other words, users may run different scripts toperform these phases, several times, in an ad-hoc way, any order, and on different machines. These phases occur in a cycle,which stops when the users consider the data “curated”. In the context of ML, it is ready to be transformed into learningdata.

Learning data preparation.

Model trainers select relevant parts of the curated data to be used for learning. For instance,if the ML task is to classify geological structures [5], seismic images will need to be correlated with seismic interpretation,creating labeled samples. After selecting the data, model designers develop scripts, typically using domain-speciﬁc librariesto manipulate the raw scientiﬁc data, to transform ( e.g. ,image cropping, quantization, scaling) the data into learningdatasets. Due to data complexity, frequently, data need to be manually inspected before it can be used as input for thelearning phase.

Learning.

The learning contemplates training, validation, and evaluation. In this phase, model trainers select the inputlearning datasets, optionally they choose validation datasets, and choose learning parameters ( e.g. , in deep learning theycan choose ranges of epochs and learning rates) that will be optimized. Trainers can use their domain knowledge to discardlearning datasets that will unlikely provide good results. The learning process is compute-intensive, typically executed inan HPC machine. One single learning process often generates multiple learned models, among which one is chosen as the“best” depending on evaluation metrics ( e.g. , MSE, accuracy, or any other user-deﬁned metric). Moreover, trainers needto monitor the learning process by, e.g. , inspecting how the evaluation metrics are evolving while the learning processiterates. They can wait until completion or interrupt the learning process, change parameters, iteratively re-submit thelearning until satisﬁed with results.

Provenance data in workﬂows contain a structured record of the data derivation paths within chained data transformationsand their parameterizations [16], [15]. Provenance data are usually represented as a directed graph where vertices areinstances of entities (data) or activities (the data transformations) or agents ( e.g. , users); and, edges are instances ofrelationships between vertices [23]. Comprehensive data analysis using provenance has been used as an enabler for severalkey capabilities: • Experiment reproducibility [11], [25], [12] • AI explainability [20], [26], [9]; • Experiment ﬁne-tuning and what-if analyses [15]; • Uncertainty quantiﬁcation [27], [28]; • Hypothesis testing [6]; and • Real-time monitoring, and interactive data analysis. [29]Based on a literature analysis [30], [11], [10], [31], [25] and on our own experience to leverage provenance to supportworkﬂows for scientiﬁc ML [9], [22], [32], [33], we propose here a taxonomy to classify workﬂow provenance analysisin support of ML, by considering three classes: data , execution timing , and training timing . Next, we characterize the datainvolved in the lifecycle.TABLE 1: Examples of provenance queries in the lifecycle of scientiﬁc ML. ID Description

Q1 Given a trained model, what are the geographic coordinates, oil basin and ﬁeld, and the number of seismic slices of the seismic in thetraining dataset?Q2 Given a trained model, what is the tile size, the noise ﬁlter threshold, and the ranges of seismic slices that were selected to generate thetraining set used to adjust this model?Q3 Given a training set, what are the values for all hyperparameters and the evaluation measure values associated with the trained modelwith least loss?Q4 What are the average, min, and max execution times of each batch iteration inside each epoch of the deep neural network training,given a training dataset?Q5 What is the execution time on average per batch iteration, per epoch, and what are the evaluation metrics of the trained models thatused the training dataset generated for a given range of seismic slices?Q6 Given the training dataset used in Q5, what was the seismic data ﬁle used, along with its number of slices, related oil basin, and ﬁeld?Q7 Considering only the learning workﬂows that used the learning dataset associated to a given range of seismic slices, list the minimumbatch loss per model obtained in the learning stage, also listing the model’s hyperparameters and evaluation measurements jointly withthe hyperparameters and measurements for associated model obtained in the validation stage, ordered by the best learned models.

Data class includes domain-speciﬁc , machine learning , and execution Provenance data may be augmented with these data,increasing the scope of the analysis.

Domain-speciﬁc data are the main data processed in the data curation phase (Sec. 2.1). Approaches to add domain datainto provenance analysis include, e.g. , raw data extraction [34] and utilization of domain-speciﬁc knowledge databasesassociated to provenance databases [9]. For raw data extraction, quantities of interest are extracted from raw data ﬁles. Fordomain databases, domain scientists may provide relevant information and metadata about the raw data and store themin knowledge graphs.

Machine learning data include learning data and generated learned models, which are more related to the learning datapreparation ( e.g. , Q1) and learning ( e.g. , Q2, Q3, Q7) phases (Fig. 2). These queries exemplify that the parametrizationwithin the data transformations and relevant metadata of the generated data are important for provenance analysis.

Execution data.

Besides model performance metrics ( e.g. , accuracy), users need to assess workﬂow execution time andresource consumption. They need to inspect if a critical block in their workﬂow ( e.g. , one demanding high parallelism) istaking longer than usual or if other parts are consuming more memory than expected. For this, provenance systems cancapture system performance metrics and timestamps ( e.g. , Q4). Metadata, such as data store metadata ( e.g. , host address),HPC cluster name, and nodes in use, can be captured and associated with the provenance of the data transformations forextended analysis.

Hybrid.

These data can be combined. In Q5 and Q7, the analysis queries data processed in workﬂows in the learningdata preparation and learning phases, whereas Q6 uses the same dataset to analyze the raw ﬁles curated in the datacuration phase.

Execution timing refers to if the analysis is done online , i.e. , while at least a workﬂow is running, or ofﬂine . Ofﬂine analysis.

The typical use of ofﬂine provenance analysis is to support reproducibility and historical data under-standing, e.g. , understand the curation of raw ﬁles and relate with the ML models. The queries Q1–Q7 can be executedofﬂine.

Online analysis.

Users can use online provenance analysis to monitor, debug or inspect the data transformations whilethey are still running ( e.g. , see the status, see how the intermediate results are evolving as the input parameters vary).The problem of adding low provenance data capture overhead is more challenging for provenance systems that allow foronline analysis [9]. Queries Q3–Q5 and Q7 exemplify queries that can be executed online, e.g. , while a training process isrunning.

Training timing refers to whether the analysis performs intra-training — i.e. , to inspect one training process, e.g. , a trainingjob running on an HPC cluster, or inter-training — i.e. , analyses comprehending results of several training processes. Intra-training . In an ofﬂine intra-training analysis, users are interested in understanding how well trained modelsgenerated in a given training process perform. All queries, Q1–Q7, could be executed either online or ofﬂine, but Q3 andQ4 are more likely to be performed as online intra-training analysis.

Inter-training . This analysis refers to comprehensive queries to understand multiple training processes, e.g. , how each ofthem performed, which learning datasets were used, how the training processes were parameterized. It supports activitieslike Model Validation, Management, Training, and Design. Usually, they are used ofﬂine, but may also be performed online.Queries Q1–Q7 ﬁt this class when analyzing multiple trained models generated in different training processes.

Workflow Provenance Analysis in the Lifecycle of Scientific MLExecution Timing

Online Offline

Data

Machine Learning Domain-specific Execution Performance

Training Timing

Inter-training Intra-training

Execution Environment

HPC cluster Cloud cluster Standalone machine

Data Store

File System DBMS Object Store

Execution Software

Workflow MS Big Data Framework ML Framework Script

Fig. 3: A taxonomy for workﬂow provenance analysis of the lifecycle of scientiﬁc ML.

Further characterization.

Other classes worth mentioning for provenance analysis are: data store —data are distributed ontomultiple stores, like ﬁle systems, cloud stores ( e.g. , IBM Cloud Object Storage, AWS S3), Relational or NoSQL DBMSs[9]; execution environment —where the workﬂows execute, such as HPC clusters, Kubernetes clusters, Standalone server; execution orchestration software —each workﬂow may be executed as a standalone script, or as a workﬂow in a WMS, or acomposition of microservice calls, or as a pipeline in a data processing ( e.g. , Spark) and ML frameworks ( e.g. , Tensorﬂow); provenance data granularity —provenance of ﬁles ( i.e. , references to ﬁles consumed and generated in a script), functions calls(arguments and outputs), blocks of code, and stack traces [30]; and provenance analysis direction —forward or backward:generally, forward queries analyze from raw scientiﬁc ﬁles or learning datasets to trained models ( e.g. , Q3–Q5, Q7), whereasbackward queries analyze from trained models to learning datasets or raw ﬁles ( e.g. , Q1, Q2, Q6).

This section presents the fundamental design principles for effective and efﬁcient management of workﬂow provenancedata in the lifecycle of scientiﬁc ML to provide for comprehensive data analyses. Although some of these design principles,individually, may have been proposed in related works [35], [9], [36], [34], together they compose the building blocks ofour approach and we assemble them as one uniﬁed set of principles and describe how they support the lifecycle. They areorganized as: (i)

Data Design Principles (Sec. 3.1), which contains the principles and key concepts that drive the contents of our holistic data view, whose a resulting artifact is PROV-ML, a new provenance data representation; and, (ii)

SystemDesign Principles (Sec. 3.2), which contain the principles that determine how the provenance data are captured in a scalableand portable manner, whose a resulting artifact is a reference system architecture.

DDP 1: Data Integration with a Holistic Data View.

The primary design principle is that, to be able to manage effectively( i.e. , capture, integrate, store, and query) provenance data in the interconnected workﬂows in all lifecycle phases, aprovenance system must implement techniques to provide for an integrated, uniﬁed, and holistic data view. Also, it has tobe aware of the contexts of the data transformations in the multiple workﬂows that consume and generate these data, their(hyper)parameterization, and output values, where these transformations run, where the generated data are stored, whoare the involved personas, and how they interact with the workﬂows. This design principle builds on a multiworkﬂow dataview concept proposed in our previous work [9]. It extends it to support the lifecycle comprehensively, with specializationsto address ML-speciﬁc data and knowledge related to domain-speciﬁc data and knowledge. Let us call this data view asthe

Provenance-based Holistic Data View of the Lifecycle of Scientiﬁc ML ( MLHolView ). The contents and the granularity ofthe

MLHolView are driven by the relevant queries for a project, and the view can be materialized as the database thatintegrates data from several sources, while the workﬂows run [9].

DDP 2: Context-awareness using Knowledge Graphs: Domain, ML, and Hybrid environments and Data stores.

Extendingprovenance with domain-speciﬁc data for data analysis has been explored before [37], [34], [15]. However, in scientiﬁcML, it is required to go a step further into the details of domain-speciﬁc knowledge, including how key domain conceptsrelate to each other. Thus, it is important to relate the data in the workﬂows with as much knowledge as possible availableabout the project’s key concepts. To be able to integrate with domain-speciﬁc knowledge databases, one needs to designthe workﬂows aware that ﬁles (or data in other data stores, like DBMSs or object stores) are associated with conceptsdeﬁned elsewhere. Then the provenance system needs to provide the proper links between the ﬁle and the domain-speciﬁcconcepts.Similarly, the

MLHolView needs ML-speciﬁc concepts and relationships. Although modeling ML-speciﬁc concepts couldbe seen as modeling data for a speciﬁc domain (in this case, ML would be the domain), ML, by itself, is a distinguisheddomain, which crosses many industries and scientiﬁc domains. Thus, the

MLHolView should have a built-in ML-speciﬁcschema, tightly coupled with the rest of the provenance data schema, to provide ML-speciﬁc context to support thecomprehensive analyses. In certain cases, such specialized schema modeling might be even helpful to accelerate queriesthat require them [38].In addition to domain- and ML-speciﬁc context awareness, since the workﬂows can be executed within heterogeneousframeworks, scripts, or WMSs and on heterogeneous environments, the

MLHolView needs to be aware of such hybrid ( i.e. ,heterogeneous) execution by containing the track of the execution environment and software, and associated metadata.These data and their relationships, with pointers to domain-speciﬁc knowledge graphs and to large data stored in otherstores, are all materialized using provenance data in the knowledge graph that forms the

MLHolView . Figure 4 illustratesthe

MLHolView and its awareness of data coming from the ML phases and the dimensions of heterogeneity (illustrated aslayers) it addresses: software, data, data stores, and infrastructure (execution environment). The ﬁgure also shows the kindof provenance analysis ( top-left ) and the key capabilities the

MLHolView enables ( top-right ) (Sec. 2.2).

Learning Data Preparation LearningData Curation

Raw Domain Data

Provenance-based Holistic Data View of the Lifecycle of Scientific ML

Curated Domain Data Learning Data Models

Data Layer

HPC File SystemObject StorageKey Value DBMSGraph DBMSDocument DBMSRDBMS

Data Stores Layer

Private Cloud Public Cloud Scientists’ DesktopsHPC Clusters

Infrastructure Layer

Big Data Processing FrameworksWorkflow Management SystemsScriptsML FrameworksScientific Software

Software Layer

On-premise

Provenance Analysis

P r o v e n a n c e C a p t u r e ü Data:

Domain, ML, Execution ü Execution timing:

Online/Offline ü Training timing:

Inter/Intra-training ü AI explainability ü Model reproducibility ü Experiment fine-tuning ü Uncertainty quantification ü Hypothesis testing ü Real-time data analysis

Fig. 4: The Provenance-based Holistic Data View of the Lifecycle of Scientiﬁc ML.

DDP 3: Provenance of Multiple Workﬂows on Data Lakes meets ML Provenance Following W3C Standards.

To be ableto implement the context-awareness for domain, ML, and hybrid environments , the

MLHolView needs a comprehensivedata representation. Data lake provenance builds on workﬂow provenance to enable the awareness of the location of each data item generated by chained data transformations in a data lake, even if there are multiple data items dispersed inhybrid environments and data stores [9], [39], making it a good alternative to address such heterogeneity of data, store,and environments. However, it is not enough to support the lifecycle, as it requires provenance of ML-speciﬁc data andlearning processes.The provenance data community has signiﬁcantly evolved in the recent years, oftentimes leveraging the PROV [23]family of documents, a W3C recommendation, making it a de facto standard that provides the building blocks, in termsof data representation, for any provenance-based approach, allowing for compatibility among different solutions [40]. ThePROV-Wf [41] workﬂow provenance data representation and its derivatives [16] have also been used and evolved byseveral initiatives [29], [15], [42]. Our previous work builds on W3C PROV and PROV-Wf to propose PROVLake, a ﬁrstprovenance data representation for workﬂows on data lakes [9]. With respect to ML-speciﬁc data modeling, there is aW3C community group developing a data representation with speciﬁc ML vocabulary, the W3C ML Schema (MLS) [24].Therefore, this data design principle proposes that the data representation for the

MLHolView should be comprehensive,with detailed semantics about the workﬂows, where they execute, the data they process, and where they are stored,combining and extending a data lake provenance representation with ML-speciﬁc data representation, following standardsand reusing existing representations, such as W3C PROV, PROV-Wf, PROVLake, and MLS.

DDP 4: Keeping Prospection and Retrospection Related but Separated.

Davidson and Freire explain that prospectiveprovenance captures the speciﬁcation of a workﬂow, i.e. , the recipe of which data transformations will be processed andtheir inputs and outputs. In contrast, retrospective provenance captures the data that was actually consumed and produced,along with a detailed execution log about the computational tasks and execution environment [25]. The prospectiveprovenance provides the abstraction layer to specify provenance analyses, often giving semantics to the retrospectiveprovenance data generated during the workﬂows’ execution. Also, there are cases that the provenance analysis uses onlyone kind of provenance data. Therefore managing both kinds of provenance data, and more importantly, with a strongconnection between each related kind, is essential for the

MLHolView , which should be reﬂected in the provenance datamodeling.

DDP 5: Designing a Focused Conceptual Data Schema.

To provide the specialized semantics needed by the

MLHolView ,we propose a conceptual data schema focusing on the key concepts identiﬁed by the characterization in Section 2. Theconcepts are driven by the lifecycle phases and the data they manipulate: the phases are illustrated in gray backgroundand the and four main kinds of data are illustrated in white background in the UML class diagram in Figure 5.

Curated Domain Data Data Preparation Learning Data Learning Model * * 1* 11*1

Data CurationRaw Domain Data

Used Domain Data

Fig. 5: Conceptual data schema of the key concepts of the lifecycle and their supporting graphs.On the four data concept classes, each instance represents one dataset, i.e. , a set of data elements that combined formone meaningful set of data for a given application. As any dataset, it may have a data schema that varies depending on theapplication; it may be further decomposed into several interrelated subdatasets (or subconcepts for a given application),and there may be related metadata such as where it is physically stored and data sizes. For example, in the case of domaindata, a set of well log ﬁles form a dataset whose application is to serve as a training dataset to train an ML algorithmto ﬁnd well tops; and the combination one seismic data ﬁle with metadata about the geographic location of the seismicdata acquisition and associated oil basins form a dataset whose application is to curate the seismic data. In the case ofmodels, there may be metadata about the model performance and hyperparameters. With respect to the three phases’classes, each can be further decomposed into workﬂows with associated execution data. A

Learning instance can bequaliﬁed into training, validation, and evaluation. With respect to relationships, each

Data Curation instance consumesa

Raw Domain Data instance and generates a

Curated Domain Data instance. Then, each

Curated Domain Data instance may be consumed by one or more

Data Preparation instances, which in turn may generate one or multiple

Curated Domain Data instances ( i.e. , a n:m relationship). For instance, a learning algorithm may require the preparationof well log data and seismic data, jointly, and thus two sets of

Curated Domain Data would need to be related to the

Data Preparation instance. Finally, each

Data Preparation instance generates a

Learning Data instance to bebe consumed by one

Learning process that generates one

Model instance. Typically, during a learning phase, there aremultiple

Learning instances, each generating a

Model instance.

SDP 1: Portable and Distributed Capture Control.

As discussed, the workﬂows execute in highly distributed, hetero-geneous environments processing data in heterogeneous data stores, executing within heterogeneous software and onheterogeneous environments. To address this distributed execution control, the provenance system should be portablewith distributed capture control so that there may be multiple provenance data capturers spread out across the multipleworkﬂows executing. To address the heterogeneity of how workﬂows are executed, the provenance system cannot be tightlycoupled with a speciﬁc workﬂow tool, but rather it should be pluggable to any of these aforementioned heterogeneousways of executing workﬂows. The distributed captured data are ultimately integrated in the uniﬁed

MLHolView . SDP 2: Specialized Microservices in a Distributed Architecture.

In addition to the distributed capture control, designing aprovenance system using a microservices architecture allows for the ﬂexibility needed for large-scale deployments in hybridenvironments. The provenance system can be decomposed into smaller, stateless microservices with specialized functionsand, more importantly, it enables that components of the provenance system architecture are deployed wherever bestﬁts for the workﬂow having provenance being captured. For instance, provenance capture components can be deployedgeographically near (or inside) the machine where the workﬂow runs, to reduce latency caused by communication costs,and other heavy-weight provenance-speciﬁc processes ( e.g. , creating the linkages, inserting in the DBMS) and the DBMSitself can be deployed elsewhere, to reduce concurrency with the running workﬂows. A real deployment exploring theﬂexibility to place the architectural components to reduce communication costs and concurrency is shown in Section 4.1.

SDP 3: Strategies for a Scalable Capture.

Since many of these workﬂows require HPC, the provenance capture systemshould not add signiﬁcant performance penalties to the running workﬂows, requiring designing strategies for a scalabledata capture. In addition to reducing concurrency, as described in the

SDP 2 , which is one of these strategies, other strategiesto reduce performance overhead are as follows. During capture, all calls from the running workﬂows to the provenancesystem should be asynchronous and do not need to wait for the data capture request to be completely processed, avoidingadding periods of waiting in the running workﬂow. Also, batches of data capture requests from the running workﬂows canbe queued and sent to the provenance system at once, avoiding keeping open multiple communication channels betweenthe running workﬂow and the provenance system. These batches are then received in the provenance system, which shouldprocess each request in the batch in a parallel manner, to reduce the time between the provenance capture in the workﬂowand the data to be readily available for queries in the

MLHolView . Moreover, during capture, the provenance systemresponsible for creating the data linkages should avoid doing read operations to the underlying DBMS, but should onlydo appends to the data in the DBMS. This is because the read operations on the DBMS inevitably have to be waited for thequery response, thus potentially increasing latency in the provenance capture. Finally, the only component that is in directcontact with a running workﬂow should be a lightweight provenance capture library shielding the workﬂows from possibleslowness from other components. The key for such lightweight library is to signiﬁcantly reduce provenance-speciﬁc codein a workﬂow, consequently reducing provenance-speciﬁc calls during execution, and strictly follow the insert-only policy,so that no queries to the DBMS are made by the library, avoiding waits. Such provenance-speciﬁc descriptions, essentialfor the speciﬁcation of the workﬂows, are stored as prospective provenance data externally to the actual workﬂow. Theprovenance library (in the client-side of the system) does not need these speciﬁcations, which are essential for the server-side of the system, so the linkages that form the

MLHolView can be provided. A side-effect of reducing provenance-callsin a workﬂow is that it also reduces the changes needed to be done, making it look as similar as possible to the originalworkﬂows without the hooks [9], [32].

SDP 4: Easing Data Linkage with Unique Data Identiﬁers.

The concept of using unique identiﬁers is useful for keepingtrack of data in provenance systems [39], [43]. Existing approaches keep track of data ﬁles consumed and produced in theworkﬂows, and here we extend this concept to keep track of every data value that participates in the

MLHolView , evenscalar values. Thus, every attribute-value pair that are consumed or produced in any data transformation participatingin any workﬂow receives a unique identiﬁer. So, whenever an attribute-value generated by one data transformation isconsumed by another, the provenance system can reuse the value keeping track of the paths between transformations and,thus, keeping the workﬂows interconnected.

SDP 5: Workﬂow Design and Adding the Provenance Capture Hooks.

To enable the context-awareness (

DDP 2 ), the ﬁrststep is to design the workﬂows with context-awareness. For this, for each workﬂow in those multiple workﬂows for agiven project, one needs to specify its data transformations with input datasets, parameters, and expected outputs. Eachcomputational process (data transformations) and the datasets they transform are qualiﬁed according to the

MLHolView ’sconceptual data schema (

DDP 5 ). When specifying data references, the physical location where the data reference isexpected to be stored should be provided, as well as metadata about the execution environment where the workﬂow willexecute. Finally, the relationships between the workﬂows and the data in the distributed data stores need to be speciﬁed.Such speciﬁcation can be maintained in conﬁguration ﬁles, which will inform the provenance capture system to enableit to create the linkages to provide the context-aware integration of domain, ML, and hybrid environments and storesusing provenance. After the speciﬁcation, hooks can be added to the workﬂows before and after each data transformation,informing the key concept (following the

MLHolView ’s conceptual data schema) in each data transformation and datareference. A data transformation execution is encapsulated by a provenance capture task, which typically occurs in afunction call, a program execution, a web service call, or an iteration in an iterative workﬂow.

Reference System Architecture.

Based on these system design principles, our proposed reference architecture is illustratedin Figure 6 and is described as follows. There are M environments ( e.g. , HPC clusters, Kubernetes clusters) and N workﬂows in all phases of the lifecycle, distributed on these environments. Each workﬂow may use heterogeneous datastores and may be implemented as a standalone script, or as a workﬂow in a WMS, or a composition of microservice calls,or as a pipeline in a data processing or ML framework. Provenance capture hooks, through a lightweight ProvLib , areadded to capture provenance data at each data transformation in each of these workﬂows. At the beginning and end of each(potentially parallel) data transformation executions for each (potentially parallel) workﬂow, a provenance capture event isemitted from the

ProvLib . Thus a provenance capture event has the granularity of a data transformation execution, withtheir corresponding input data (at the beginning) and output data (at the end). These events are asynchronously sent to a

Message Broker , such as Apache Kafka, or any lightweight repository that persists the capture requests queue. Then, the

ProvConsumer , which is a lightweight service that runs on background, consumes from this queue and sends the requeststo the

ProvManager , which is aware of the prospective provenance data and can create the context-aware linkages usingW3C PROV-based relationships and the reuse of unique identiﬁers (

DDP 4 ), and sends the data to the

MLHolView , which ismanaged by a DBMS, typically a knowledge graph DBMS. The (

Message Broker , ProvConsumer ) pair is instantiated ateach environment to reduce communication costs between the ProvLib. The ProvManager is a RESTful, stateless service andcan receive provenance capture requests in any order. Thus it uses a lightweight Key Value DBMS ( e.g. , Redis) to managestate when needed ( e.g. , to create a link with a just received request with another request sent before). During the executionof these workﬂows, users or applications may submit provenance analysis through a Query API that communicates withthe

Prov query component, which is a RESTful service responsible for implementing query building strategies using thequery language of the

MLHolView ’s DBMS and returning the results to the requesting client.

Prov ConsumerMessage BrokerProvLib

Workflow k ProvLib

Workflow … Prov ConsumerMessage Broker ProvLib

Workflow N ProvLib

Workflow k+1 … Environment 1 Environment MProvManager

MLHolView

KV DB

ProvQueryProv ServerProvenance Analysis UI

Query API

Fig. 6: Reference system architecture to manage workﬂow provenance in the lifecycle of scientiﬁc ML.

We propose a generic provenance data representation for the lifecycle of scientiﬁc ML based on the data design principles,which is the ﬁrst one to the best of our knowledge. PROV-ML is depicted in Fig. 7, where the light-color classesrepresent prospective provenance, and dark-color, retrospective. PROV-ML provides rich semantics and details basedon the conceptual data model of lifecycle’s fundamental concepts (

DDP 5 ), especially the ones in the learning phase. Thecolors in the ﬁgure map to these concepts: the blue-shaded classes account for the

Learning Data ; the gray-shaded,for the

Learning ; and the yellow-shaded, for the

Model . The stereotypes indicated in the ﬁgure represent the classesinherited from PROVLake. All classes illustrated in the ﬁgure are individually described in Table 2. We brieﬂy discuss thePROV-ML classes here, and further details are available online [44].In PROV-ML, the

Study class introduces a series of experiments, portrayed by the

LearningExperiment class, whichdeﬁnes one of the three major phases in the lifecycle, the Learning phase. A learning experiment comprises a set of learningstages, represented by the

BaseLearningStage class, which are the primary data transformation within the Learningphase and with whom the agent (

Persona class) is associated. The base learning stage serves as an abstract class where the

LearningStage and

LearningStageSection classes inherit from. Also, it relates the ML algorithm, evoked through

Algorithm class, used in the stage might be deﬁned in the context of a speciﬁc ML task ( e.g. , classiﬁcation, regression),represented in the

LearningTask class. This approach favors both the learning stage and learning stage section to conservethe relationships among other classes while grant them to have special characteristics discussed in the following. A learningstage varies regarding its type, i.e. , Training , Validation , and

Evaluation classes. The provision of a speciﬁc classfor the learning stage allows the explicit representation of the relationship between the Learning Data Preparation phase,through its Learning Data, and the Learning phase of an ML lifecycle. The

LearningStageSection class introducesthe sectioning semantics that grant capabilities of referencing subparts of the learning stage and the data, respectively. Anexample of sectioning elements relevance is the ability to reference a speciﬁc epoch within a training stage, or mentioninga set of batches within a speciﬁc epoch. The Learning Data appears in the model over the

LearningDataSetReference class. Another data transformation speciﬁed in PROV-ML is the

Feature Extraction class, which represents the processthat transforms the learning dataset into a set of features, represented by

FeatureSet class. This modeling favors the MLexperiment to be reproducible since it relates the dataset with the feature extraction process and the resulting feature set.Further fundamental aspects regarding the Learning phase are the outputs and the parametrization used to producethese outputs. Like so, The

ModelSchema class describes the characteristic of the models produced in a learning stageor learning stage section, such as the number of layers of a neural network or the number of trees in a random forest.The

ModelProspection class represents the prospected ML models, i.e. , the reference for the ML models learned duringa learning stage or learning stage section of a training stage. In addition to the data produced in the Learning phase isthe

EvaluationMeasure class. This class, combined with

EvaluationProcedure and

EvaluationSpecification classes, provide the representation of evaluation mechanisms of the produced ML models during any stage of learning,speciﬁcally: an evaluation measure deﬁnes an overall metric used to evaluate a learning stage ( e.g. , accuracy, F1-score,area under the curve); an evaluation speciﬁcation deﬁnes the set of evaluation measures used in the evaluation oflearned models; and, an evaluation procedure serves as the model evaluation framework, i.e. , it details the evaluationprocess and used methods. On the parametrization aspect, PROV-ML afford two classes

LearningHyperparameter and

ModelHyperparameter . The ﬁrst hyperparameter-related class represents the hyperparameter used in a learning stageor learning stage section ( e.g. , max training epochs, weights initialization). The second class is used in the representation of the models’ hyperparameters ( e.g. , network weights). Finally, PROV-ML addresses the retrospective counterpart of theclasses mentioned above. The classes ending in Execution and

Value are the derivative retrospective analogous of datatransformations and the attributes, respectively. << A tt r i bu t e >> Lea r n i ng T a sk << A tt r i bu t e V a l ue >> Lea r n i ng T a sk V a l ue << A tt r i bu t e >> Lea r n i ng D a t a S e t R e f e r en c e << A tt r i bu t e >> D a t a s e t C ha r a c t e r i s t i c << D a t a T r an s f o r m a t i on E x e c u t i on >> B a s eLea r n i ng E x e c u t i on << A tt r i bu t e V a l ue >> D a t a s e t C ha r a c t e r i s t i c V a l ue << A tt r i bu t e V a l ue >> Lea r n i ng D a t a s e t << A tt r i bu t e >> A l go r i t h m << P r og r a m >> S o ft w a r e << A tt r i bu t e >> Lea r n i ng H y pe r pa r a m e t e r << A tt r i bu t e V a l ue >> Lea r n i ng H y pe r pa r a m e t e r V a l ue << P r og r a m >> I m p l e m en t a t i on << E n t i t y >> I m p l e m en t a t i on C ha r a c t e r i s t i c V a l ue << D a t a T r an s f o r m a t i on >> B a s eLea r n i ng S t age << A tt r i bu t e >> M ode l P r o s pe c t i on << D a t a S c he m a >> M ode l S c he m a << A tt r i bu t e >> M ode l H y pe r pa r a m e t e r << A tt r i bu t e >> E v a l ua t i on M ea s u r e << A tt r i bu t e >> E v a l ua t i on S pe c i f i c a t i on << A tt r i bu t e >> E v a l ua t i on P r o c edu r e << A tt r i bu t e V a l ue >> M ode l H y pe r pa r a m e t e r V a l ue << D a t a R e f e r en c e >> M ode l << A tt r i bu t e V a l ue >> M ode l E v a l ua t i on << D a t a S t o r e I n s t an c e >> D a t a S t o r e I n s t an c e T r a i n i ng V a li da t i on E v a l ua t i on Lea r n i ng S t age Lea r n i ng S t age S e c t i on Lea r n i ng S t age E x e c u t i on Lea r n i ng S t age S e c t i on E x e c u t i on T r a i n i ng E x e c u t i on V a li da t i on E x e c u t i on E v a l ua t i on E x e c u t i on << A gen t. >> P e r s ona C o l o r M app i ng Lea r n i ng D a t aLea r n i ng M ode l << A tt r i bu t e >> F ea t u r e S e t << D a t a T r an s f o r m a t i on >> F ea t u r e E x t r a c t i on << D a t a T r an s f o r m a t i on E x e c u t i on >> F ea t u r e E x t r a c t i on E x e c u t i on << A tt r i bu t e V a l ue >> F ea t u r e S e t D a t a << A tt r i bu t e >> F ea t u r e S e t C ha r a c t e r i s t i c << P r o j e c t >> S t ud y << W o r k f l o w >> Lea r n i ng E x pe r i m en t << W o r k f l o w E x e c u t i on >> Lea r n i ng P r o c e ss E x e c u t i onu s ed u s ed had M e m be r had M e m be r had M e m be r had M e m be r w a s D e r i v ed F r o m w a s D e r i v ed F r o m i s G ene r a t ed B y w a s A ss o c i a t ed W i t h w a s A ss o c i a t ed W i t h w a s A ss o c i a t ed W i t h w a s I n f o r m ed B y had M e m be r had M e m be r w a s I n f o r m ed B y w a s I n f o r m ed B y w a s I n f o r m ed B y w a s A ss o c i a t ed W i t h u s ed had M e m be r w a s G ene r a t ed B y w a s D e r i v ed F r o m u s ed w a s G ene r a t ed B y w a s G ene r a t ed B y w a s D e r i v ed F r o m had M e m be r w a s D e r i v ed F r o m w a s D e r i v ed F r o m had M e m be r had M e m be r w a s G ene r a t ed B y w a s G ene r a t ed B y had M e m be r had M e m be r w a s G ene r a t ed B y w a s D e r i v ed F r o m had M e m be r w a s A tt r i bu t ed T o w a s D e r i v ed F r o m u s ed u s edhad M e m be r u s ed w a s D e r i v ed F r o m w a s D e r i v ed F r o m had M e m be r u s ed u s ed w a s D e r i v ed F r o m Fig. 7: PROV-ML: a W3C PROV- and W3C ML Schema-compliant provenance data representation for scientiﬁc ML. Alarger visualization is available online [44]. TABLE 2: PROV-ML data representation classes.

Class Description

Study Investigation ( e.g. , research hypothesis) leading ML workﬂow deﬁnitions.LearningExperiment The set of analyses ( e.g. , research questions) that drives the ML workﬂow.LearningProcessExecution An ML workﬂow execution. This is equivalent to mls:Run and was renamed to explicitly preserve the aspects ofretrospective provenance, which are not explicitly handled in MLS.LearningTask and Learning-TaskValue Deﬁnes the goal of a learning process, i.e. , the ML task ( e.g. , LearningTask : Classiﬁcation ; LearningTaskValue : Seismic Stratigraphic Classiﬁcation ).BaseLearningStage and Base-LearningStageExecution Abstract classes of

LearningStage and

LearningStageSection , and their execution counterparts used toconserve them the relationships among other classes while granting them to have special characteristics.LearningStage and Learn-ingStageExecution Deﬁnes a

Training or Validation or Evaluation ) learning stage and its execution.LearningStageSection andLearningSectionExecution Introduces the sectioning semantics, i.e. , capabilities for provenance of subparts of the learning stage and correspond-ing data.LearningDatasetReferenceand LearningDataset Deﬁnes the dataset to be used by a

LearningStage or LearningStageSection . In the last case, it is a section ofa

LearningDatasetReference . LearningDataset is the dataset used in the execution.DatasetCharacteristic andDatasetCharacteristcValue Deﬁnes metadata about the

LearningDatasetReference ( e.g. , DatasetCharacteristcValue relates with a

LearningDataset ( e.g. , FeatureExtraction should generate over

LearningDatasetReference .FeatureSetCharacteristic Deﬁnes the set of metadata that describes the

FeatureSet ( e.g. , number of features, features’ type).Software Deﬁnes a collection of ML techniques’ implementations ( e.g. , Scikit-Learn).Algorithm ML technique with no associated technology, software or implementation ( e.g. , k-means clustering technique).Implementation Deﬁnes the retrospective aspect of an Algorithm , i.e. , an ML technique’s implementation in a software ( e.g. , Scikit-Learn’s k-means implementation).ImplementationCharacteristicValue Deﬁnes the implementation’s set of metadata (properties and values), e.g. , version, git hash.LearningHyperparameter Deﬁnes the prior parameter of an Algorithm used by a

LearningStage or LearningStageSection .LearningHyperparameterValue Deﬁnes the parameter values of an execution ( e.g. , the k value in a k-means clustering technique, range of epochs ina neural network training).ModelSchema The scope of the resulting model.ModelProspection andModel The resulting model a LearningStage or a

LearningStageSection should generate, and the generated value( e.g. , the trained model after the training stage).ModelHyperparameter andModelHyperparameterValue Hyperparameters a

LearningStage or a

LearningStageSection generate, and their values corresponding tothe resulting model ( e.g. , the epoch which the resulting model was generated).DataStoreInstance Storage of the resulting model.EvaluationMeasure andModelEvaluation A measure a

LearningStage or a

LearningStageSection should evaluate and the generated value ( e.g. , theprecision of classiﬁer model).EvaluationSpeciﬁcation andEvaluationProcedure Classes directly inherited from MLS, with their semantics preserved.

In this section, we provide an experimental validation of the design principles to build and query the

MLHolView tosupport the lifecycle of scientiﬁc ML in a real case study in the O&G industry. First, we explain how we implementand deploy the provenance system used in the evaluation (Sec. 4.1). Then, we show a running example of which dataare captured during execution of the workﬂows to answer the exemplary queries Q1–Q7 (Sec. 4.2). After, we presentperformance and scalability analyses of the system (Sec. 4.3). Then, we discuss the beneﬁts of PROV-ML both in terms ofeasing queries and query performance (Sec. 4.4). Finally, we conclude with lessons learned after this evaluation (Sec. 4.5).

ProvLake [44] is a provenance system capable of capturing, integrating, and querying data across distributed services, pro-grams, scripts, and data stores used by multiple computational workﬂows using provenance data management techniques[9], [22]. In this section, we explain how we implement these principles to enable ProvLake to build the

MLHolView andhow it is deployed to support the lifecycle in our case study.

ProvLake Architecture.

ProvLake architecture is an implementation of the reference architecture (Fig. 6). Details aboutthis architecture can be found in our previous work [22]. Here we give a brief summary, highlighting how its componentsare mapped to the reference architecture proposed in this paper. The ProvLake Library (PLLib) [45] maps to the ProvLib.ProvTracker implements a simple queue management to receive the provenance capture events coming from the lib andalso implements a queue consumer, thus working both as the message broker and the provenance consumer in the referencearchitecture. ProvManager maps like the reference architecture and the PolyProvQueryEngine is the component forbuilding the provenance queries and sending it to the DBMS managing the

MLHolView . As described in the principle

SDP5 , the workﬂows are speciﬁed using prospective provenance data stored as conﬁguration ﬁles. Data transformations thatare speciﬁc and standard in ML workﬂows, e.g. , training, validation, and evaluation are deﬁned beforehand following theconceptual data schema for the key concepts (

DDP 5 ) and the PROV-ML (Sec. 3.3) for attributes, such as hyperparameters and model evaluation attributes. ProvTracker uses the speciﬁed prospective provenance data to provide for the tracking bycreating the relationships of retrospective provenance data being continuously sent by PLLib added to the workﬂows.ProvTracker gives unique identiﬁers ( SDP 4 ) to every data value captured and when there are data references ( e.g. ,references to ﬁles or identiﬁers in a database table or any analogous data reference), it creates a knowledge graphrelationship between the data value and the data store [9]. ProvManager transforms the captured data into RDF triples (thedata model of the DBMS in use by ProvLake in this implementation) following the PROV-ML ontology (when capturingdata in the learning phase) and PROVLake ontology (when capturing data in the previous phases of the lifecycle).

ProvLake Deployment in the Case Study.

The deployment of our case study also follows the system design principles(Sec. 3.2). It uses two clusters: a Kubernetes cloud cluster for data curation and learning data preparation workﬂows, andthe other is a large HPC cluster with CPUs and GPUs for the workﬂows in the learning phase. PLLib is the only componentin direct contact with the users’ workﬂows running in the clusters (

SDP 3 ). This deployment is illustrated in detail in ourprevious paper [22].

Hardware Setup.

The experiments use three environments. An HPC cluster for learning workﬂows, which has 393 Inteland Power8 nodes, each with 24 to 48 CPU cores, 256 to 512 GB RAM, interconnected via InﬁniBand, sharing about 3.45 PBin a GPFS, and using in total 946 GPUs (NVIDIA Tesla K40 and K80, each with 2880 and 4992 CUDA cores respectively); aKubernetes cloud cluster for data processing, which has 4 nodes, each with 16 GB RAM and 8 cores; and a server machineIntel Core i7-7700T CPU 2.40 GHz, 8 GB DDR4 RAM, 128 GB SSD Liteon.

Software Setup.

ProvManager, PolyProvQueryEngine, and Prov DBMS are deployed on a virtual Kubernetes cluster withtwo nodes with 4 vCores, 16 GB RAM each, virtualized on top of the data processing cluster. ProvManager’s queue is set to50, and ProvTracker threads are set to 120. The workﬂow scripts of our use case are implemented in Python using multiplelibraries, such as to manipulate raw seismic ﬁles and for learning (PyTorch V1.1) that execute in the learning cluster. Forthe query performance tests, we deployed three different DBMSs on the server machine: Apache Jena TDB 3.12, Allegro6.6.0, and Blazegraph 2.1.5.

In this section, we investigate whether our approach supports the lifecycle by enabling users to perform comprehensive, i.e. ,end-to-end analyses that integrate the data consumed and generated in the workﬂows, from raw domain data to learnedmodels. More speciﬁcally, we investigate if the proposed data design principles (Sec. 3.1) can be applied to answer queriesthat does such integration of the data. We explore the O&G use case described in Section 2 and validate if the data trackedby ProvLake, inserted in the

MLHolView implementing the PROV-ML data, can answer the queries Q1–Q7. Fig. 8 showsthe phases of the lifecycle in this use case. Next, we describe the workﬂows of the use case and how ProvLake tracks thedata.

Data Curation

Triple Store File System Training, validation and test data setsCurated and Annotated Seismic Data Trained models/data/netherlands.sgyGeoscientist's Annotations File SystemDoc. DBMSStructured Domain Knowledge

URI: - associated oil field- associated oil basin- associated PDF files (e.g., survey info)- nearby oil wells

Learning Data PreparationLearning

Filtering, Cleaning, Raw Data Extraction,Auxiliary Data Creation Training, Validation, Test Training Hyperparameters - batch_size: 60- max_epochs: 300- learning_rates: [0.01, 0.001]

Model Hyperparameters - learning_rate: 0.01- epoch: 188

Data Reference Tracking - in: /input/{train,test,valid}.hdf5- out: /models/model188.hdf5

Execution Data - cluster name, host nodes- Job id- start time, end time

Evaluation Measure - confusion_matrix: [[m]]- loss: 2.2e-3- mean_iou: 7.22e-1

Training, Validation, Test Data Creation Pipelines Data Transformations' Parameters - curated and annotated data references- seismic slice ranges- noise threshold: 0.30- tile size: 40

Data Reference Tracking - in: curated and annotated data references- out: /input/{train,test,valid}.hdf5DataWorkflowsData captured Lifecycle Step

File SystemRaw Data Extraction - file reference: /data/netherlands.sgy- filesize: 1.25 GB- inline range: [100, 750]- crossline range: [300, 1250]- geox1, geoy1: (6054167, 60735564)

Data Reference Tracking - file references in the file systems- document identifiers in Doc. DBMS- instance URIs in the Triple Store- DBMSs and file systems' metadata, location, and access information

Fig. 8: Summarized example of provenance tracking in an O&G use case. Details on the captured data, contents, stores,and the dataﬂow used to answer the queries Q1–Q7 are in Table 3.In the data curation phase, ProvLake tracks provenance while data-intensive scripts run. When processing raw ﬁles,essential data that will help answer the queries are extracted, associated with the ﬁle’s URI, and stored in the provenancedatabase. One example of such data is the embedded geographic coordinates in raw SEG-Y seismic ﬁles. Additionally,geoscientists add relevant information, based on their specialized knowledge, as input to some of those scripts to be loadedinto a domain-speciﬁc knowledge graph database, external to the provenance database, but also tracked by ProvLakethrough links between the workﬂows and these domain knowledge in the graph. Relevant information includes associatedoil ﬁelds, basins, oil wells, and pieces of texts from PDF documents with survey information related to the geological data TABLE 3: Details about the captured data in the use case.

Data structurename Description Data Characteristics and Size Data Store

Geoscientist’sAnnotations Observations they do about the seismic dataset,such as its geographic global coordinates and char-acteristics about the subsurface terrain this seismicacqusition was obtained. Also, they relate the seis-mic datasets with other artifacts of interest, such aswell logs and geological basins. Semi-structured textual ﬁles Textualdocumentsin the ﬁlesystemStructured domainknowledge Domain-speciﬁc information parsed from unstruc-tured and semi-structured documents and repre-sented as structured facts in domain ontologies. En-tities in such ontologies may represent taxonomy,rules and assertions for a given domain. Stored as domain-speciﬁc knowledge graphs in aKnowledge Base, typically managed by a TripleStore Triple storeGeological LabeledData Tabular text ﬁles, where each line contains x, ypositions (ﬂoar32, ﬂoat32) on Earth surface, anddepth (ﬂoat32) that can be in distance or time. N x · N y · N h · bytes, where N x and N y are the num-ber of points in x and y directions, respectively. N h is the number of annotated horizons. File systemPost-stacked SEGYﬁle A binary ﬁle containing N x × N y stacked tracesof one particular seismic attribute, e.g. amplitude,coherence, frequency, phase. The ﬁle also includesa main header and several trace headers. H main + N x · N y · ( H trace · T size ) , where the mainheader ( H main ) takes approximately 10KB; thetrace header uses 240 bytes; and each trace containsone ﬂoat32 value for each point in depth. For exam-ple, if the seismic is a volume × × ,besides the headers, it will contain × traceseach of which comprising ﬂoat32 values. File systemCurated and anno-tated seismic data Merged expert annotations and the SEGY raw ﬁle.It comprises the structured knowledge about thegeological data and also cube geometry, such as in-line and crossline ranges, resolution, depth range,and unit. The expert informs which parts of theinput ﬁle are suitable or not for the task. Finally,it may contain legal and access information. Withthis data, it is possible to set next phase hyperpa-rameters. Stored as structured data in a combination of Doc.DBMS and Knowledge Bases with references tothe Doc. DBMS. The Doc. DBMS has hundreds ofgigabytes and the Knowledge Base has hundredsof megabytes. Triple Storeand Doc.DBMSTraining, validation,and evaluation datasets Binary ﬁles stored in HDF5 or using Google’sProtocol buffer serialization for a good balancebetween portability and speed. These ﬁles may vary a lot, depending on the con-ﬁguration selected for data preparation workﬂows.From our experimental observations, it takes about10% or less of the input SEGY ﬁle. However, be-cause workﬂows create data sets by experimentconﬁguration, it is possible to end up with a totaldata set storage multiple times bigger than theoriginal raw ﬁle. CloudObjectStore andFile systemLearned models Mix of binary and conﬁguration ﬁles depending onthe engine used to run the learning phase (PyTorch,Tensorﬂow, Scikit-learn, etc.). The engine used to run deﬁnes trained models’type and size. Since we used Tensorﬂow backendin our experiments, we store our trained modelsusing Tensorﬂow’s tools, where each experimentproduces conﬁguration and binary ﬁles. The ﬁrstone stores the model structure and other trainingparameters, and the second one stores the model’sstate. Although model size can vary from a fewMB to several GB, our models used approximately50MB per state in our experiments. Notice that onestate is just one snapshot of one step during train-ing, so if depending on the conﬁguration settings,it is possible to have several saved states, the 50MBmay turn GB very quickly. File system acquisition process. These annotations are stored in triple stores in a domain-speciﬁc database, externally to the provenancedatabase.The learning data preparation phase includes several data transformations in a pipeline that converts the curated andannotated scientiﬁc data into training, validation, and evaluation datasets. Each transformation contains parameters thatspecify, for instance, noise ﬁlter thresholds, input shape, or the selected seismic lines (inlines, or crosslines) of the seismiccube that constitutes the training dataset. Each value of these parameters, the name of the transformation, the executiondata, and the references to input and output data are captured and represented in ProvLake’s provenance data graph.The entire process is interconnected, where each phase produces data and passes it forward for the consumption of thenext one. Essentially, ProvLake tracks and maintains such interconnections in a provenance data graph composed by RDFtriples. Such structures describe chained data transformations in the multiple workﬂows that constitute the inner phasesof the major ones of the lifecycle run. RDF resources represent the data in Fig. 8, i.e. , instances that extend prov:Entity and PROV-ML specializations. Each of these instances receives a URI, which works as a global identiﬁer throughout thelifecycle DDP 4 . Examples of RDF resources are learned models produced in the learning phase, model’s hyperparameters,evaluation metrics, and references (ﬁle path) to actual model ﬁles stored in the ﬁle system. Provenance data graphs alsoassociate execution data with learned models. Execution data may include ﬁle system metadata, cluster’s hostname and node names used in the HPC jobs, job ids in the cluster scheduler, or start and end timestamps of each block of provenancecapture events.ProvLake can keep track of data distributed in multiple stores. Such ability helps to maintain data relationships betweenraw ﬁles in the ﬁle system and structured knowledge stored in another database. Auxiliary data, such as polygons in theseismic cube, are stored in the Document DBMS. The system similarly tracks data references and related to the rawﬁles. Other data, such as implementation details, software name, and version, are captured and stored in the provenancedatabase, following the PROV-ML, but, for simplicity, we do not show them in the ﬁgure. Finally, since the system tracksevery data and their relationships while the workﬂows execute, ProvLake enables answering online, ofﬂine, intra- andinter-training provenance queries to analyze ML data, domain-speciﬁc data, and execution data throughout the phases ofthe lifecycle, exempliﬁed by the queries Q1–Q7.To submit queries, the user sends a GET or POST request to one of PolyProvQueryEngine’s endpoints. Then, PolyProv-QueryEngine sends requests to ProvManager. Most of the queries are answered with simple graph traversals using standardSPARQL features. For instance, to answer Q1, the user provides a learned model URI (generated in the learning phase),and the query should traverse in the provenance data graph backward until the raw seismic ﬁle’s URI (processed in thedata curation phase). One can get the geographic coordinates and number of seismic slices by querying the extracted datarelated to the seismic ﬁle. In turn, to obtain the oil basin and oil ﬁeld information, the query retrieves data from the resource,in the Triple Store, which represents structured knowledge about the seismic ﬁle. For Q2 and Q6, one can execute a similargraph traversal. Other queries require analytical operators, such as Q3, which requires ﬁnding the learned model withleast (using min() native SPARQL operator) loss, and returning its hyperparameters. Q4 and Q5 make use of executiondata to provide basic statistics ( min(), max(), avg() operators) about the execution time of training iterations andQ7 retrieves the models, their hyperparameters, their evaluation measures, and minimum batch loss per model generatedwhen a speciﬁc learning dataset was used. In our use case for training an autonomous identiﬁer of geological structures ( c.f.

Sec 2), the learning phase generates alarge amount of provenance data at a high frequency to stress ProvLake services. In the deep learning model training,there are two provenance capture calls (for the beginning and end) at each batch iteration in each learning epoch. In thistest, each learning workﬂow executes about 35 iterations for each learning epoch and up to 300 epochs, generating about15,000 provenance capture events per workﬂow run. ProvTracker runs on one node in the learning cluster with 24 CPUcores, whereas the learning workﬂows run in parallel and distributed on up to 8 nodes, each with 28 Intel CPU cores and6 GPUs (K80). While running the workﬂows, PLLib captures data at runtime and sends them to ProvTracker, which inturn sends them to ProvManager service deployed externally on the virtual Kubernetes cluster, which ﬁnally stores themin the Prov DBMS. A provenance capture overhead analysis of ProvLake using synthetic workloads to highly stress thesystem and comparison with a competing system has been presented in a previous work [9]. Here, we evaluate the systemdesign principles that focus on providing distributed capture control and scalable architecture (

SDP 1 – SDP 3 ). We testdifferent settings for provenance analysis, and then test the scalability using real ML workloads in both cases. We measurethe overall execution time of the learning workﬂow script, repeating each test at least 10 times and we plot the boxplots ofthe repetitions and the numeric values used in-text refer to the median of the repetitions.

Varying Provenance Capture Settings.

The PLLib allows customizing provenance capture settings, such as the queue sizeand whether the provenance capture events should be persisted to the local disk, rather than sending to ProvTracker. Then,if disk only is not speciﬁed, when the scripts execute, provenance data are captured and sent to ProvTracker.For a baseline, we ﬁrst execute the training without any provenance capture, then we vary the queue size in PLLib ( i.e. ,amount of provenance capture requests accumulated in PLLib), diskless vs. diskful ( i.e. , saving or not provenance data ina log ﬁle on disk), and online vs. ofﬂine ( i.e. , storing or not provenance data in the DBMS, available for online provenancequeries during the execution). As for the training datasets, we use a curated and labeled real seismic dataset using a speciﬁcrange of seismic slices (corresponding to a regional section of a seismic cube) deﬁned by the model trainer. The results arein Fig. 9 (a), where the fastest result is for Queue Size = 50, Diskless, Online (Setting D). Comparing with the setting withno provenance capture, the added execution overhead, in this case, is only 8.6 seconds on top of 21.3 minutes, i.e. , 0.67%,which is considered negligible.To analyze the queue size, we compare Settings A–C with D–F, and we see larger queues provide faster provenancecapture since there is less but larger communication with ProvTracker service. For instance, Setting A is about 7% slowerthan D. However, very large queues have drawbacks as they introduce higher latency between the event being capturedin the workﬂow execution and the provenance record being stored in the database, caused by the retention of provenancecapture events in PLLib’s queue. Nevertheless, for the settings with queue size 50 (D–F), a latency of less than 5 secondsbetween the actual occurrence of the event and its provenance being registered in the database, available for queries, canbe considered near real-time and good enough even for training monitoring. To analyze diskless vs. diskful settings, wecompare Setting A with B and C; and D with E and F. Diskless is faster than diskful, as the latter introduces more I/Ooperations at runtime. However, comparing only the medians, the difference is negligible (less than 0.1%). Thus, becauseof a higher fault-tolerance provided by a diskful setting, it may be useful to append provenance data onto a ﬁle on disk,locally in the cluster where the workﬂow runs. Similarly, comparing the medians, we observe that the difference between No Prov QSize 1DisklessOnline QSize 1DiskfulOnline QSize 1DiskfulOffline QSize 50DisklessOnline QSize 50DiskfulOnline QSize 50DiskfulOffline21.021.221.421.621.822.0 E x e c u t i o n T i m e ( m i n ) (A) (B) (C) (D) (E) (F) (a) x =1 x =2 x =4 x =821.021.221.421.621.822.0 E x e c u t i o n T i m e ( m i n ) Linear (b)Fig. 9: Performance analysis results. Figure 9 (a) shows the variation of prov. capture settings, where Setting D adds 0.67%overhead. Figure 9 (b) shows the scalability results, a near-linear scalability with up to 48 GPUs and 228 CPUs.online vs. ofﬂine ( e.g. , setting B vs. C or E vs. F) is also small, about 1%. Therefore, despite (D) being the fastest setting,(E) may be preferred because its performance is nearly the same as (D), and it has the advantage of backup storage forprovenance data, which is quite important as provenance is used for reproducibility.

Scalability Analysis.

In this experiment, we want to conﬁrm if the execution strategies on an HPC cluster are keepingthe overhead low in a real ML workload, running multiple learning workﬂows in parallel. We run a weak scalability testby increasing the number of processing units while increasing the data size. We use the fastest setting of the previousexperiment ( i.e. , D) and the same seismic cube. To set up the training datasets, the trainer selects up to 8 different sets ofseismic slices, where each set has the same length ( i.e. , nearly the same data size). Thus, for x ∈ { , , , } , there are x workﬂows running on x nodes in parallel, summing x Intel CPU cores, x GPUs, ∗ x CUDA GPU cores, usingin total an input dataset with size x ∗ datasize , where datasize is the size of a dataset formed by 1 set of seismic slices.The results are in Fig.9 (b), where we illustrate the linear scalability as a horizontal line passing through the median of thesmallest setting ( x = 1 ). Ideally, the medians should be near this line. If they are not, it means that ProvTracker is takingtoo long to answer, caused by high stress in the system due to too many provenance capture requests, adding latency to thetraining. However, we see that even in the largest setting ( i.e. , x = 8 ), the execution time remains close to the linear curve.The boxes remain within a small margin of 0.2 min (or 0.9% of the x = 1 median) between 21.4 and 21.6 min, meaningthat the system delivers a constant and predictable behavior even at larger scales. We note though that the variance growswith the scale, caused by the larger number of parallel tasks. Therefore, we conclude that at least for this scale (up to 48K80 GPUs), the provenance capture system delivers good scalability. In this experiment, we analyze the beneﬁts of PROV-ML, both qualitatively and quantitatively. We begin with a qualitativecomparison of queries that use PROV-ML, highlighting its expressiveness and the complexity of building queries that useor not PROV-ML. Then, we further provide a quantitative analysis to investigate whether using PROV-ML can help toaccelerate queries and, if it can, how much it helps. Among the queries Q1–Q7, we select three to compare in detail: Q1,Q5, Q7, and the reason for this choice is that they increase in complexity and how much they make use of the conceptsmodeled speciﬁcally in the PROV-ML ontology ( i.e. , emphasis on the learning phase, c.f.

Sec. 3.3). Q1 is the simplest queryand makes the least use of PROV-ML speciﬁc concepts, Q7 is the most complex query with the heaviest use of PROV-ML,and Q5 is in between these two. We write the selected queries both with and without the PROV-ML ontology (written inOWL) using SPARQL 1.1. The query complexity stems from the number of clauses to ﬁlter, patterns to match in the graphtraversal, aggregations and sorting, and amount of triples that satisfy the patterns to match; and the number of clauses thatmake use of the PROV-ML ontology deﬁnes how much each query makes use of the PROV-ML ontology.

Qualitative comparison.

Since Q7 is the most complex query and makes heavy use of PROV-ML, it helps us to illustratewhether PROV-ML eases query building, especially when there is heavy use of Learning phase concepts. Excerpts of Q7 inSPARQL with and without PROV-ML are available in the Listings 1 and 2, respectively. Comparing both, since PROV-MLhas specialized concepts for the Learning phase, it requires less clauses to be matched to express the same concept. Forinstance, to match triples in the training stage only, with PROV-ML, we just write one clause (Lst. 1 ?training to determine the correct stage. Without PROV-ML, the only resource wehave to do this is to tag the data transformations that are related to training. In PROVLake ontology, tagging of workﬂows,data transformations, and attributes is possible with the property provlake:tag , but since naming, schema deﬁnitionsand tagging are available only in the prospective part, we need three more clauses: one to relate the retrospective instancewith its prospective instance (Lst. 2 for model evaluation (Lst. 1 -- Training stage ? t r a i n i n g a provml : TrainingExecution . -- Epoch iteration (Training section) ? epoch exec training prov : wasInformedBy ? t r a i n i n g ; a provml : TrainingSectionExecution . -- Model hyperparameters ? epoch training hyperparam prov : wasGeneratedBy ? epoch exec training ; a provml : ModelHyperparameterValue ; prov : value ? epoch training hyperparam v ; prov : wasDerivedFrom ?epoch training hpram psp . ? epoch training hpram psp a provml : LearningHyperparameterSetting ; r d f s : l a b e l ? epoch training hyperparam name . -- Model ? model training prov : wasGeneratedBy ? epoch exec training ; a provml : Model . -- Model evaluation ? model training eval prov : wasGeneratedBy ? epoch exec training ; a provml : ModelEvaluation ; prov : value ? model training eval value . Listing 1: Excerpt of Q7 with PROV-ML. -- Training stage ? t r a i n i n g a provlake : DataTransformationExecution ; prov : wasInfluencedBy ? training prosp ; ? training prosp a provlake : DataTransformation ; provlake : tag ” Training ” . -- Epoch iteration (Training Section) ? epoch exec training prov : wasInformedBy ? t r a i n i n g ; a provlake : DataTransformationExecution ; prov : wasInfluencedBy epoch exec training psp. ? epoch exec training psp a provlake : DataTransformation ; r d f s : l a b e l ”Epoch Execution ” . -- Model hyperparameters ? epoch training hyperparam prov : wasGeneratedBy ? epoch exec training ; a provlake : AttributeValue ; prov : value ? epoch training hyperparam v ; prov : wasDerivedFrom ?epoch training hpram psp . ? epoch training hpram psp a provlake : A t t r i b u t e ; provlake : tag ”Hyperparameter” ; r d f s : l a b e l ? epoch training hyperparam name . -- Model ? model training a provlake : AttributeValue . prov : wasGeneratedBy ? epoch exec training ; prov : wasDerivedFrom ? model training prosp . ? model training prosp a provlake : A t t r i b u t e ; provlake : tag ”Model” . -- Model evaluation ? model training eval prov : wasGeneratedBy ? epoch exec training ; a provlake : AttributeValue ; prov : value ? model training eval value ; prov : wasDerivedFrom ? model training eval psp. ? model training eval psp a provlake : AttributeValue ; provlake : tag ”Model Evaluation ” . Listing 2: The same excerpt of Q7, without PROV-ML.Therefore, we found that when the parts of the query do not demand prospective provenance data, one needs to writethree extra clauses when not using PROV-ML (a clause to relate the retrospective with prospective, another to give theprospective instance type, and a third clause to qualify this instance, often using tags or labels). However, when the querydemands prospective provenance data, one needs only an extra clause (to qualify the instance), because the relationshipand types will be required regardless it uses PROV-ML or not. These observations are in Table 4. We veriﬁed the samebehavior in the queries Q1 and Q5. Thus, we conclude that PROV-ML’s ability to qualify speciﬁc ML data transformationsand attributes using direct types eases query building, as it reduces the number of clauses required to express ML-speciﬁcconcepts compared to a data representation that does not use PROV-ML. A reduction of one to three clauses per query partwas observed in all queries analyzed.TABLE 4: Qualitative comparison of Q7 in terms of number of clauses with and without PROV-ML.

Q7 Query part

Training stage 1 4Epoch iteration 2 5Model hyperparameters 6 7Model 2 5Model evaluation 3 6 Quantitative comparison.

We discussed in Section 3.1 that certain data design principles followed by our approach mightaccelerate queries that make use of the deﬁned concepts, and it is known that design choices when modeling an ontologymay impact performance of queries. In fact, a recent work evaluates schema optimization to speed up queries in knowledgegraphs [38], showing that this is still a relevant topic to be investigated. Therefore, in addition to the qualitative gainsdiscussed previously, we conduct a quantitative evaluation experiment with the goal of verifying how much (if any)PROV-ML impacts query performance.We generate two synthetic datasets that mimic the real use case evaluated in Sections 4.2 and 4.3. With the syntheticdataset, we can control experiment variables, such as the number of parallel learning workﬂows, number of hyperparam-eters, model evaluation metrics, epochs, and batches per epoch, and we can generate one dataset that uses PROV-ML andanother that does not. With these two synthetic datasets, it is easy to switch between with and without PROV-ML, ratherthan having to implement, deploy, and run the real learning workﬂows without PROV-ML, which would not make sensefor the project goals. Both datasets are as follows: eight parallel learning workﬂows, each with the three stages (training,validation, and evaluation), 300 epochs and 200 batches per epoch ( i.e. , 60,000 batches), where each batch is associated tobatch losses and hyperparameters, and each epoch uses hyperparameters and generates models and model evaluations. Intotal, each dataset has 10,168,890 triples.The performance impact depends on the number of clauses to be matched in the query and on the number oftriples actually matched by the Triple Store DBMS. However, the performance depends on the underlying DBMS thatmanages the

MLHolView , since the DBMS might implement efﬁcient indexing mechanisms, parallelism techniques, ordata transformation strategies. Therefore, for this experiment, we analyze the three queries (Q1, Q5, Q7). Q1 does a simplegraph traversal with simple pattern matching. Q5 does more complex graph traversals and needs to calculate aggregates(average of time difference per batch, per epoch), but for training stages only. Q7 also does complex traversals and needs tocalculate aggregates (minimum batch loss per epoch), but for three stages (training, validation, and evaluation), in additionto listing hyperparameters and model performance. Since the choice of the underlying DBMS may impact the results,we analyze three different DBMSs: AllegroGraph, Blazegraph, and Jena TDB running on the same hardware and sameconditions, with their default settings (no special ﬁne-tuning are performed in any DBMS). We analyze the query executiontime, which is measured in the requesting client subtracting the timestamp between immediately after the response hasarrived in the client from the time immediately before sending the request. Results are in Figure 10, where we plot themedians of the query execution time over a hundred repetitions or when the conﬁdence interval of the medians was below5%. The numeric values reported in-text also refer to the median of the repetitions, we do not remove outliers, the heightof the conﬁdence intervals are the error bars, and Q7 results are in log scale.

Jena AllegroGraph BlazeGraph Q u e r y E x e c u t i o n T i m e ( s ) Q1 Jena AllegroGraph BlazeGraph Q u e r y E x e c u t i o n T i m e ( s ) Q5 Jena AllegroGraph BlazeGraph L o g Q u e r y E x e c u t i o n T i m e ( s ) Q7 With PROV-ML Without PROV-ML

Fig. 10: Execution time comparison of queries using PROV-ML vs. queries without PROV-ML. Q7 is in log scale.The results show that for Q1, PROV-ML does not signiﬁcantly impact query performance since the queries using PROV-ML are only 1.17x and 1.05x slower in AllegroGraph and BlazeGraph, respectively; and the difference for Jena is within theerror bars, i.e. , no statistically signiﬁcant difference. However, Q1 is only the simplest query, with trivial graph traversalsand little use of PROV-ML speciﬁc concepts, and with query times up to a hundred of milliseconds (very fast queries). TheDBMSs are likely spending more time doing data transfers than actually computing the query, which would explain thehigher error bars for Q1 than for Q5 and Q7, for which the error bars are so small ( < We draw a set of lessons learned after the practical experience of implementing the data and system design principles tosupport the lifecycle in a real deployment in an O&G industry case that uses heterogeneous environments, i.e. , a Kubernetescluster and a large HPC cluster with CPUs and GPUs. The key ﬁndings for the success of the experiments are the following:(i)

Characterizing the lifecycle and identifying the main classes for data analysis using provenance allowed theunderstanding of the different needs in scientiﬁc ML (Sec. 2). Particularly, it helped to understand the differentpersonas driving the provenance capture to answer key online and ofﬂine, intra- and inter-training provenance queries.The queries were capable of analyzing ML data, domain-speciﬁc data, and execution data, throughout the data curation,data preparation and learning phases of the lifecycle in an integrated way. We observed that the data curation phase is themost complex. One needs to address it carefully to take advantage of domain-speciﬁc knowledge, which highly beneﬁtstrainers in the learning phase.(ii)

Employing provenance tracking and a data representation that allow data integration of multiple workﬂowshelped to address the highly heterogeneous nature of the lifecycle.

To accomplish such integration, it was key to promotea holistic view of the lifecycle, end-to-end, which we called

MLHolView , as described in the data design principle

DDP 1 ,which enabled the comprehensive data analyses ( e.g. , Q1–Q7), thus supporting the lifecycle, which is our main motivation.Due to the highly heterogeneous nature, the context-awareness using domain-speciﬁc and ML data and knowledgematerialized in a knowledge graph leveraging provenance-based relationships (

DDP 2 ) enabled tracking, persisting, andquerying interconnections between heterogeneous data with details about localization and data access. Furthermore, itenabled queries with rich semantics about the application domain and ML, exploring new data relationships that wouldnot be possible without such context-awareness.(iii)

Designing a conceptual data schema focused on the key concepts enabled the design and implementation of thesystem facilitating query building, and the ML-specialized schema modeling . The key concepts are described in

DDP5 . It enabled query acceleration and facilitated query building for queries that make heavy use of ML-speciﬁc conceptscompared with a schema that does not have such specializations. Yet, the focused schema was the basis for PROV-ML(Sec. 3.3), which served as the underlying schema for the provenance system. PROV-ML combines provenance of datalakes to address integration, embracing the heterogeneity nature, with concepts for ML (

DDP 3 ). PROV-ML leverages W3Ccontributions for provenance, W3C PROV [23], and for ML, ML Schema [24]. We hope other systems with similar purposescan adopt such representation.(iv)

The system design principles enabled data capture and integration in a highly heterogeneous and distributedsetting adding negligible overhead.

Particularly,

SDP 1 and

SDP 2 provided the portability and ﬂexibility needed insuch deployments. The scalable strategies (

SDP 3 ) allowed the system to do this while incurring low overhead and highscalability, even in HPC workloads.

The interest in workﬂow provenance management has increased in the recent years, driven by a major effort by theprovenance community [46], [47], [48], [49], [50], [51], [31], [52], [53], [35], [54], [55], [56], [57], [58] , particularly to explorepossibilities of optimizing workﬂows with the data captured by provenance tools and as a response to the urgent need forreproducible science, which is critical in scientiﬁc ML [59]. To exemplify, Thavasimani et al. , [14] investigate provenancetraces recorded during workﬂow executions to observe differences in results with minor workﬂow conﬁguration differ-ences. Other works have advanced provenance tracking techniques on heterogeneous data, stores, and environments [39],[36], [60], [61] and others have explored the intersection of provenance and blockchain [62], [63].On the intersection between ML and provenance, other works have explored provenance to support ML workﬂows [17],[26], [64], [18], [19], [65] and Deelman et al. , [66] characterized provenance analysis to leverage ML in support of scientiﬁcworkﬂows. On reproducible ML models, another aspect that has become a focus of interest in the research communityis the use of provenance as an essential tool to help create explainable artiﬁcial intelligence [20], [59]. In addition, someworks addressed the gap between the experiments of an ML workﬂow execution and a standard representation to providereproducible experiments [67], [68], [24]. Esteves et al. , [67] provide a machine-readable vocabulary and a common schemafor reproducibility of ML experiments in various frameworks and workﬂow systems. Publio et al. , [68] present a new MLdata representation based on MEX vocabulary [67] to improve processes on ML workﬂows, despite not having a clearseparation between prospective and retrospective provenance. Samuel [69] propose ProvBook, for reproducibility of MLexperiments using Jupyter notebooks applying FAIR data principles. Moreno et al. , proposed MLWfM [70] to provide dataconcepts for ML and domain-speciﬁc awareness, but without provenance concepts and a data representation.These works are important building blocks to support the lifecycle of scientiﬁc ML using provenance managementtechniques. Nevertheless, they still lack a holistic view capable of comprehensively integrating the data in the wholelifecycle, end-to-end, from raw domain data to learned models. Without such holistic view, the ML-speciﬁc concepts cannotintegrate with the speciﬁc concepts about the scientiﬁc domain, jeopardizing the comprehensive end-to-end analyses thatrequire richer semantics about the domain integrated with rich semantics about ML. In this work, we aimed at enabling scientists and engineers to perform comprehensive data analyses in the lifecycleof scientiﬁc ML. We proposed workﬂow provenance techniques to address the problem of dealing, in an integrated andcomprehensive way, with the high heterogeneity of different contexts ( e.g. , data, software, environments, persona) involvedin the lifecycle, to enable such analyses. We proposed modeling the workﬂows in all phases of this lifecycle as multipleinterconnected workﬂows. A holistic view of the data processed in these workﬂows should be built as the workﬂowsexecute. In this way, the collaborating teams can use it as their primary source of data analyses that integrate that from rawdata to learned ML models. We called it as

Provenance-based Holistic Data View of the Lifecycle of Scientiﬁc ML ( MLHolView ).It is materialized as a knowledge graph with provenance-based relationships. It is aware of the contexts of the datatransformations in the workﬂows, their (hyper)parameterizations and model metrics, which computational environmentsthey run and data stores they use, the involved personas, and how they interact with the workﬂows.To be able to build this view, aware of these many dimensions of heterogeneity, we ﬁrst characterized the lifecycleand proposed a taxonomy for the classes of data analyses ( e.g. , data, execution timing, and training timing). Then, weproposed design principles for the effective and efﬁcient management of provenance data from these workﬂows. Fromthis understanding and design principles, we derived the PROV-ML data representation, promoting such a holistic viewof data in workﬂows in the lifecycle, which is the ﬁrst one to the best of our knowledge.We also proposed system design principles and a reference system architecture to provide the view with efﬁcientprovenance capture adding signiﬁcant data capture overhead ( < REFERENCES [1] J. Hesthaven and G. Karniadakis. Scientiﬁc machine learning workshop. [Online]. Available: https://icerm.brown.edu/events/ht19-1-sml[2] Y. Gil, S. A. Pierce, H. Babaie, A. Banerjee, K. Borne, G. Bust, M. Cheatham, I. Ebert-Uphoff, C. Gomes, M. Hill, J. Horel, L. Hsu, J. Kinter,C. Knoblock, D. Krum, V. Kumar, P. Lermusiaux, Y. Liu, C. North, V. Pankratius, S. Peters, B. Plale, A. Pope, S. Ravela, J. Restrepo, A. Ridley,H. Samet, and S. Shekhar, “Intelligent systems for geosciences: an essential research agenda,”

CACM , 2018.[3] M. Raissi, P. Perdikaris, and G. Karniadakis, “Physics-informed neural networks: a deep learning framework for solving forward and inverseproblems involving nonlinear partial differential equations,”

J. Comp. Physics , 2019.[4] E. Rodrigues, I. Oliveira, R. Cunha, and M. Netto, “DeepDownscale: a deep learning strategy for high-resolution weather forecast,” in

IEEEeScience , 2018.[5] D. S. Chevitarese, D. Szwarcman, E. V. Brazil, and B. Zadrozny, “Efﬁcient classiﬁcation of seismic textures,” in

IJCNN , 2018.[6] M. Mattoso, C. Werner, G. Travassos, V. Braganholo, E. Ogasawara, D. de Oliveira, S. Cruz, W. Martinho, and L. Murta, “Towards supportingthe life cycle of large-scale scientiﬁc experiments,”

IJBPIM , 2010.[7] H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Modelhub: Deep learning lifecycle management,” in

ICDE , 2017.[8] S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert, “Automatically tracking metadata and provenance of machine learningexperiments,” in

MLS@NIPS , 2017.[9] R. Souza, L. Azevedo, R. Thiago, E. Soares, M. Nery, M. A. S. Netto, E. V. Brazil, R. Cerqueira, P. Valduriez, and M. Mattoso, “Efﬁcientruntime capture of multiworkﬂow data using provenance,” in

IEEE eScience , 2019.[10] N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich, “Data lifecycle challenges in production machine learning: a survey,”

SIGMOD Rec. , 2018.[11] M. Herschel, R. Diestelk¨amper, and H. B. Lahmar, “A survey on provenance: What for? what form? what from?”

VLDB , 2017.[12] L. Moreau, B. Lud¨ascher, I. Altintas, R. S. Barga, S. Bowers, S. Callahan, G. Chin JR., B. Clifford, S. Cohen, S. Cohen-Boulakia, S. Davidson,E. Deelman, L. Digiampietri, I. Foster, J. Freire, J. Frew, J. Futrelle, T. Gibson, Y. Gil, C. Goble, J. Golbeck, P. Groth, D. A. Holland, S. Jiang,J. Kim, D. Koop, A. Krenek, T. McPhillips, G. Mehta, S. Miles, D. Metzger, S. Munroe, J. Myers, B. Plale, N. Podhorszki, V. Ratnakar, E. Santos,C. Scheidegger, K. Schuchardt, M. Seltzer, Y. L. Simmhan, C. Silva, P. Slaughter, E. Stephan, R. Stevens, D. Turi, H. Vo, M. Wilde, J. Zhao, andY. Zhao, “Special issue: The ﬁrst provenance challenge,”

CCPE , 2008.[13] P. Buneman and W.-C. Tan, “Data provenance: What next?”

SIGMOD Rec. , 2019.[14] P. Thavasimani and P. Missier, “Facilitating reproducible research by investigating computational metadata,” in

IEEE Big Data , 2016.[15] R. Souza, V. Silva, J. J. Camata, A. Coutinho, P. Valduriez, and M. Mattoso, “Keeping track of user steering actions in dynamic workﬂows,”

FGCS , 2019.[16] V. Sousa, D. Oliveira, P. Valduriez, and M. Mattoso, “Analyzing related raw data ﬁles through dataﬂows,”

CCPE , 2016.[17] H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Towards uniﬁed data and lifecycle management for deep learning,” in

ICDE , 2017.[18] M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar,“Accelerating the machine learning lifecycle with MLﬂow,” in

IEEE Data Eng. Bulletin , 2018.[19] D. Pina, L. Kunstmann, A. Paes, D. Oliveira, and M. Mattoso, “An´alise de hiperparˆametros em aplicac¸ ˜oes de aprendizado profundo por meiode dados de proveniˆencia,” in

SBBD , 2019.[20] C. Lucero, B. Coronado, O. Hui, and D. S. Lange, “Exploring explainable artiﬁcial intelligence and autonomy through provenance,”

XAI@IJCAI , 2018.[21] M. Balazinska, S. Chaudhuri, A. Ailamaki, J. Freire, S. Krishnamurthy, and M. Stonebraker, “The next 5 years: what opportunities should thedatabase community seize to maximize its impact?” in

SIGMOD , 2020.[22] R. Souza, L. Azevedo, V. Lourenc¸o, E. Soares, R. Thiago, R. Brand˜ao, D. Civitarese, E. Brazil, M. Moreno, P. Valduriez, M. Mattoso,R. Cerqueira, and M. A. S. Netto, “Provenance data in the machine learning lifecycle in computational science and engineering,” in

WORKS@Supercomputing

SIGMOD , 2008. [26] Z. Miao, Q. Zeng, B. Glavic, and S. Roy, “Going beyond provenance: explaining query answers with pattern-based counterbalances,” in SIGMOD , 2019.[27] R. F. Silva, R. Filgueira, I. Pietri, M. Jiang, R. Sakellariou, and E. Deelman, “A characterization of workﬂow management systems forextreme-scale applications,”

FGCS , 2017.[28] G. Guerra, F. A. Rochinha, R. Elias, D. De Oliveira, E. Ogasawara, J. F. Dias, M. Mattoso, and A. L. Coutinho, “Uncertainty quantiﬁcation incomputational predictive models for ﬂuid dynamics using a workﬂow management engine,”

Int. J. Uncertain. Quantif. , 2012.[29] R. Souza, V. Silva, A. L. Coutinho, P. Valduriez, and M. Mattoso, “Data reduction in scientiﬁc workﬂows using provenance monitoring anduser steering,”

FGCS , 2017.[30] J. Pimentel, J. Freire, L. Murta, and V. Braganholo, “A survey on collecting, managing, and analyzing provenance from scripts,”

ACM Surv. ,2019.[31] L. F. Sikos and D. Philp, “Provenance-aware knowledge representation: a survey of data models and contextualized knowledge graphs,”

Data Sci. Eng , 2020.[32] R. M. Thiago, R. Souza, L. Azevedo, E. F. D. S. Soares, R. Santos, W. Dos Santos, M. De Bayser, M. C. Cardoso, M. F. Moreno, and R. Cerqueira,“Managing data lineage of o&g machine learning models: the sweet spot for shale use case,” in

EAGE Digital , 2020.[33] R. Souza, A. Codas, J. A. Nogueira Junior, M. P. Quinones, L. Azevedo, R. Thiago, E. Soares, M. Cardoso, and L. Martins, “Supporting thetraining of physics informed neural networks for seismic inversion using provenance,” in

AAPG , 2020.[34] V. Silva, D. Oliveira, P. Valduriez, and M. Mattoso, “DfAnalyzer: runtime dataﬂow analysis of scientiﬁc applications using provenance,”

VLDB , 2018.[35] C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M. R. Crusoe, K. Peters, and D. Schober, “FAIR computational workﬂows,”

Data Intelligence , 2019.[36] D. Hu, D. Feng, Y. Xie, G. Xu, X. Gu, and D. Long, “Efﬁcient provenance management via clustering and hybrid storage in big dataenvironments,”

IEEE Trans. on Big Data , 2019.[37] D. Oliveira, V. Silva, and M. Mattoso, “How much domain data should be in provenance databases?” in

TaPP , 2015.[38] C. Lei, R. Alotaibi, A. Quamar, V. Efthymiou, and F. ¨Ozcan, “Property graph schema optimization for domain-speciﬁc knowledge graphs,” arXiv , 2020.[39] I. Suriarachchi and B. Plale, “Crossing analytics systems: A case for integrated provenance in data lakes,” in

IEEE eScience , 2016.[40] P. Missier, B. Lud¨ascher, S. Bowers, S. Dey, A. Sarkar, B. Shrestha, I. Altintas, M. K. Anand, and C. Goble, “Linking multiple workﬂowprovenance traces for interoperable collaborative science,” in

WORKS@Supercomputing , 2010.[41] F. Costa, V. Silva, D. Oliveira, K. Oca˜na, E. Ogasawara, J. Dias, and M. Mattoso, “Capturing and querying workﬂow runtime provenancewith prov: a practical approach,” in

EDBT/ICDT workshops , 2013.[42] R. Souza and M. Mattoso, “Provenance of dynamic adaptations in user-steered dataﬂows,” in

IPAW , 2018.[43] D. Koop, E. Santos, B. Bauer, M. Troyer, J. Freire, and C. T. Silva, “Bridging workﬂow and data provenance using strong links,” in

SSDBM ,2010.[44] Provlake website. [Online]. Available: https://ibm.biz/provlake[45] Provlakelib github repository. [Online]. Available: https://github.com/IBM/multi-data-lineage-capture-py[46] P. Thavasimani, J. Cała, and P. Missier, “Exploiting execution provenance to explain difference between two data-intensive computations,”in

IEEE eScience , 2018.[47] P. Missier, J. Bryans, C. Gamble, and V. Curcin, “Abstracting prov provenance graphs: A validity-preserving approach,”

FGCS , 2020.[48] R. Lourenc¸o, J. Freire, and D. Shasha, “Bugdoc: algorithms to debug computational processes,” in

SIGMOD , 2020.[49] C. Rajmohan, P. Lohia, H. Gupta, S. Brahma, M. Hernandez, and S. Mehta, “On efﬁciently processing workﬂow provenance queries in spark,”in

IEEE ICDCS , 2019.[50] D. Garijo, Y. Gil, K. M. Cobourn, E. Deelman, C. Duffy, R. Ferreira da Silva, A. Kemanian, C. Knoblock, V. Kumar, S. D. Peckham, Y. Y.Chiang, D. Khider, A. Khandelwal, J. Pujara, V. Ratnakar, M. Stoica, B. Vu, and M. Pham, “Integrating models through knowledge-powereddata and process composition,”

AGU Fall Meeting , 2018.[51] M. H. Namaki, A. Floratou, F. Psallidas, S. Krishnan, A. Agrawal, and Y. Wu, “Vamsa: tracking provenance in data science scripts,” arXiv ,2020.[52] A. Spinuso, M. Atkinson, and F. Magnoni, “Active provenance for data-intensive workﬂows: engaging users and developers,” in

IEEEeScience , 2019.[53] T. Guedes, L. B. Martins, M. L. F. Falci, V. Silva, K. A. C. S. Ocaa, M. Mattoso, M. V. N. Bedo, and D. Oliveira, “Capturing and analyzingprovenance from spark-based scientiﬁc workﬂows with SAMbA-RaP,”

FGCS , 2020.[54] F. Magnoni, E. Casarotti, P. Artale Harris, M. Lindner, A. Rietbrock, I. A. Klampanos, A. Davvetas, A. Spinuso, R. Filgueira, A. Krause,M. Atkinson, A. Gemund, and V. Karkaletsis, “DARE to perform seismological workﬂows,”

AGU Fall Meetings , 2019.[55] K. Chard, N. Gaffney, M. Hategan, K. Kowalik, B. Ludscher, T. McPhillips, J. Nabrzyski, V. Stodden, I. Taylor, T. Thelen, M. Turk, and C. Willis,“Toward enabling reproducibility for data-intensive research using the whole tale platform,” in arXiv , 2020.[56] T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R. K. Bocinsky, Y. Cao, J. Cheney, F. Chirigati, S. Dey, J. Freire, C. Jones,J. Hanken, K. W. Kintigh, T. A. Kohler, D. Koop, J. A. Macklin, P. Missier, M. Schildhauer, C. Schwalm, Y. Wei, M. Bieda, and B. Ludscher,“YesWorkﬂow: A user-oriented, language-independent tool for recovering workﬂow information from scripts,”

IJDC , 2015.[57] J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, “noWorkﬂow: a tool for collecting, analyzing, and managing provenance from pythonscripts,”

VLDB , 2017.[58] L. Rupprecht, J. C. Davis, C. Arnold, Y. Gur, and D. Bhagwat, “Improving reproducibility of data science pipelines through transparentprovenance capture,”

VLDB , vol. 13, no. 12, 2020.[59] M. Arnold, R. K. E. Bellamy, M. Hind, S. Houde, S. Mehta, A. Mojsilovi, R. Nair, K. N. Ramamurthy, A. Olteanu, D. Piorkowski, D. Reimer,J. Richards, J. Tsay, and K. R. Varshney, “Factsheets: Increasing trust in ai services through supplier’s declarations of conformity,”

IBM J.Research & Development , 2019.[60] Y. Mendes, R. Braga, V. Str¨oele, and D. Oliveira, “Polyﬂow: a soa for analyzing workﬂow heterogeneous provenance data in distributedenvironments,” in

SBSI , 2019.[61] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena, “Data lake management: challenges and opportunities,”

VLDB , 2019.[62] X. Liang, S. Shetty, D. Tosh, C. Kamhoua, K. Kwiat, and L. Njilla, “Provchain: A blockchain-based data provenance architecture in cloudenvironment with enhanced privacy and availability,” in

CCGrid , 2017.[63] P. Ruan, G. Chen, T. T. A. Dinh, Q. Lin, B. C. Ooi, and M. Zhang, “Fine-grained, secure and efﬁcient data provenance on blockchain systems,”

VLDB , 2019.[64] A. Kumar, R. McCann, J. Naughton, and J. M. Patel, “Model selection management systems: the next frontier of advanced analytics,”

SIGMODRec. , 2016.[65] R. Souza, L. Neves, L. Azeredo, R. Luiz, E. Tady, P. Cavalin, and M. Mattoso, “Towards a human-in-the-loop library for trackinghyperparameter tuning in deep learning development,” in

LaDaS@VLDB , 2018.[66] E. Deelman, A. Mandal, M. Jiang, and R. Sakellariou, “The role of machine learning in scientiﬁc workﬂows,”

Int. J. HPC , 2019. [67] D. Esteves, D. Moussallem, C. B. Neto, T. Soru, R. Usbeck, M. Ackermann, and J. Lehmann, “Mex vocabulary: a lightweight interchangeformat for machine learning experiments,” in ICSS . ACM, 2015, pp. 169–176.[68] G. C. Publio, D. Esteves, A. Ławrynowicz, P. Panov, L. Soldatova, T. Soru, J. Vanschoren, and H. Zafar, “ML Schema: exposing the semanticsof machine learning with schemas and ontologies,” in

ICML , 2018.[69] S. Samuel, F. L¨ofﬂer, and B. K¨onig-Ries, “Machine learning pipelines: provenance, reproducibility and FAIR data principles,” arXiv , 2020.[70] M. Moreno, V. Lourenc¸o, S. Fiorini, P. Costa, R. Brand˜ao, D. Civitarese, and R. Cerqueira, “Managing machine learning workﬂowcomponents,” in