Workflow Provenance in the Lifecycle of Scientific Machine Learning
Renan Souza, Leonardo G. Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A. S. Netto
11 Workflow Provenance in the Lifecycle ofScientific Machine Learning
Renan Souza , Leonardo G. Azevedo , V´ıtor Lourenc¸o , Elton Soares , Raphael Thiago ,Rafael Brand ˜ao , Daniel Civitarese , Emilio Vital Brazil , Marcio Moreno ,Patrick Valduriez , Marta Mattoso , Renato Cerqueira , Marco A. S. Netto IBM Research Federal University of Rio de Janeiro, Brazil Inria, Univ. Montpellier, CNRS & LIRMM, France
Abstract
Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundlyimpacting the computational science and engineering domains, like geoscience, climate science, and health science. Inthese domains, users need to perform comprehensive data analyses combining scientific data and ML models to providefor critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientificML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses evenmore challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycleof scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principlesto build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessonslearned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments showthat the principles enable queries that integrate domain semantics with ML models while keeping low overhead ( < Index Terms
Scientific Machine Learning, Machine Learning Lifecycle, Artificial Intelligence, Data Science, Provenance, Lineage,Reproducibility, Explainability, Scientific Workflow, Data lake, e-Science, Design Principles, Taxonomy (cid:70)
Machine Learning (ML) has been fundamentally transforming several industries and businesses in numerous ways. Morerecently, it has also been impacting computational science and engineering domains, such as geoscience, climate science,material science, and health science. Scientific ML, i.e. , ML applied to these domains, is characterized by the combinationof data-driven techniques with domain-specific data and knowledge to obtain models of physical phenomena [1], [2], [3],[4], [5]. Obtaining models in scientific ML works similarly to conducting traditional large-scale computational experiments[6], which involve a team of scientists and engineers that formulate hypotheses, design the experiment and predefineparameters and input datasets, analyze the experiment data, do observations, and calibrate initial assumptions in a cycleuntil they are satisfied with the results. Scientific ML is naturally large-scale because multiple people collaborate in aproject, using their multidisciplinary domain-specific knowledge to design and perform data-intensive tasks to curate( i.e. , understand, clean, enrich with observations) datasets and prepare for learning algorithms. They then plan andexecute compute-intensive tasks for computational simulations or training ML models affected by the scientific domain’sconstraints. They utilize specialized scientific software tools running either on their desktops, on cloud clusters ( e.g. ,Docker-based), or large HPC machines.Other works propose an ML lifecycle [7], [8]. Although they might apply for scientific ML, in our view, there are still gapsin these lifecycle proposals to properly address scientific ML characteristics, particularly the need for deeper integrationwith scientific domain data and specialized knowledge on a domain. Our proposed model for the lifecycle of scientific ML has three phases (explained in detail later in this paper): data curation — to curate raw data; learning data preparation — toprepare the curated data for learning; and the learning itself — aware of the constraints of a scientific domain. In each ofthese phases, there may be multiple workflows. Each workflow is a set of chained data transformations consuming andproducing datasets, and a workflow may consume the datasets produced by another workflow. For instance, there maybe multiple workflows only in the learning data preparation phase to transform curated data into learning datasets. These *Correspondence: Renan Souza - [email protected] a r X i v : . [ c s . D B ] S e p datasets may then be consumed by multiple workflows in the learning phase, transforming the datasets into different MLmodels. Therefore, we propose modeling these workflows as multiple interconnected workflows [9]. From now on, werefer to workflows as these multiple interconnected workflows in all phases of the lifecycle of scientific ML.Our primary goal in this paper is to support this lifecycle by enabling scientists and engineers to perform comprehen-sive, i.e. , end-to-end data analyses that integrate the data consumed and generated in these workflows, from raw domaindata to learned models. The importance of these data analyses is that they are enablers to meet critical requirements in ML,such as model reproducibility and explainability, and experiment data understanding.The main problem to achieve this goal is to deal, in an integrated and comprehensive way, with the high heterogeneityof different contexts ( e.g. , data, software, environments, persona) involved in this lifecycle. For example, the analyses needto be aware of the (hyper)parametrization of different data transformations in various workflows, how the transformationsaffect the experiment results ( e.g. , quality of the ML models), and the relationships between parameters, results, anddomain-specific data and knowledge. For instance, one may ask: “what happened to the model performance when theparameters varied from X to Y when the datasets had a specific characteristic in the domain?”. To allow for such analyses,tracking how the data are transformed throughout the workflows in an integrated and holistic way is necessary. Nothaving such holistic integration is critical for several reasons. To exemplify, it compromises experiment reproducibilityfrom a scientific perspective. From a business perspective, stakeholders may be less likely to apply an ML model, evenwith the best performance, if they do not understand the transformations that led to the best model)[10].Provenance (also referred to as lineage) data management techniques help reproduce, trace, assess, understand, andexplain data, models, and their transformation processes [11], [12], [13]. The provenance research community has evolvedsignificantly in recent years to provide for several strategic capabilities, including experiment reproducibility [14], usersteering ( i.e. , runtime monitoring, interactive data analysis, runtime fine-tuning) [15], raw data analysis [16], and ourprevious work, which helps data integration for multiple workflows generating data in a data lake [9]. Furthermore,other works contribute to support provenance tracking specifically for ML workflows [17], [18], [19], [8], [7], includingreproducible models and explainability [20]. These related works are essential building blocks to be leveraged towardssupporting the lifecycle.Nevertheless, scientists and engineers still face difficulties in performing comprehensive data analyses that would helpthem meet those critical requirements in ML. Tracking provenance in those workflows could be used as a tool to providefor a holistic view, hence enabling the data analyses. However, the problem caused by the high heterogeneity in thelifecycle arises several challenges. For example, the workflows are highly heterogeneous and with distributed executioncontrol: there may not be one single Workflow Management System (WMS) orchestrating all workflows; instead, theremay be multiple WMSs, scripts, programs, and ML and data processing frameworks without a single unified executionorchestrator. Further, these workflows manage domain- and ML-specific data and knowledge stored in various distributeddata stores and run on various execution environments. Hence, strategies to track data in multiple data stores are needed.Also, another complicating factor is that efficiency is a common requirement, especially in HPC executions. Thus thesystems supporting the lifecycle need to scale and not add significant tracking overhead. Designing a system to efficientlytrack provenance in such heterogeneous scenarios has been recently acknowledged as a research challenge by leading datamanagement researchers [21].In this paper , our focus is to support the lifecycle of scientific ML by enabling comprehensive data analyses, addressingthe problem of high heterogeneity of different contexts. Particularly, we contribute with:(i) A comprehensive characterization of the lifecycle, from raw domain data to learned models passing through theprocesses that manipulate these data, and a taxonomy (detailing e.g. , data, execution timing, and training timingclasses) positioning the role of provenance analysis to support the lifecycle (Sec. 2);(ii) Data Design Principles to build and query a provenance-based holistic data view to integrate the data processed byworkflows in the lifecycle aware of the heterogeneous dimensions and enable the comprehensive analyses. A result ofthese principles is PROV-ML, a new provenance data representation for scientific ML leveraging W3C PROV[23] andMLS[24]; and System Design Principles that guide how to build a provenance system to efficiently track and integratethe data efficiently in distributed executions. A result of these principles is a reference system architecture (Sec. 3);(iii) Lessons learned after applying the principles in a system implementation and evaluating it in a real case in the Oil &Gas (O&G) industry in a testbed with 3 environments, including an HPC cluster with 393 computing nodes and 946GPUs. We found that the principles enabled comprehensive queries with rich semantics about the application domainand ML, while maintaining low tracking overhead ( < Existing works describe an ML lifecycle [7], [8], but such descriptions focus on business domains and do not address thehigh heterogeneity problem of the lifecycle of scientific ML. Since our main goal is to support this lifecycle by enablingscientists and engineers to perform comprehensive data analyses, we begin with a proposal of a model for this lifecycle
1. This paper is a major extension of our work published in IEEE WORKS@SC19 [22]. We improved and expanded the design principles,definitions, examples, and lessons learned after new experiments. Also, we refined PROV-ML and extended the literature analysis. and a thorough characterization. This is the first work that proposes a lifecycle focused on scientific ML to the best ofour knowledge. To illustrate this section’s explanations, we explore a concrete use case in the geoscience domain, of highinterest in the O&G industry.
Motivating use case.
Finding oil and gas reservoirs is a demanding task in the O&G industry and involves a broadspectrum of actions, such as the interpretation of seismic surveys. These surveys are indirect measures of the earthsubsurface that can be organized into slices (images). They cover hundreds of square kilometers and help to interpret thegeology by identifying geological structures, like salt bodies, and find possible hydrocarbons accumulations. Processingseismic data imposes complex chained data transformations and can suffer from many problems, like noise and shadows(regions with low signal). Trying to automate such activity is of high interest in academia and industry and deep learningis a promising machine learning technique for this [5]. However, the geological structures vary geographically, frompoint to point in the subsurface, imposing significant challenges to the ML algorithms. Thus, it requires specializedknowledge to prepare, clean, and understand the data processed in the workflows. To cope with this, often different teamsin an interdisciplinary group composed of geoscientists, computational scientists, engineers, statisticians, among othersdecompose the problem into parts so that each can address different facets of the problem. Nonetheless, each team has apreferred way to automate tasks and store data, and a team consumes data generated by another. Despite decomposingthe problem into parts makes the problem feasible, it creates a new problem: how to consume the data in an integratedway [9], [22], [5].In this section, we explore this use case to propose an abstract model of the lifecycle, which would apply for otherscientific domains as well. We first characterize the personas and describe the lifecycle in Section 2.1 and then characterizethe data analyses using provenance in Section 2.2.
Multidisciplinary personas, with different skills in the domain and ML techniques, participate in the lifecycle phases. Inour previous work, we presented a spectrum of expertise and personas in scientific ML [22], depicted in Figure 1 andbriefly summarized here. The spectrum ranges from scientific-domain only (fully white on the left) to ML only (fully blackon the right), with the following personas: (i)
Domain scientists , who have in-depth knowledge of the domain data and usespecialized tools to interpret, visualize, and clean the scientific data; (ii)
Computational scientists and engineers , who have highcomputational skills, often with abilities to develop parallel scripts and execute them in HPC clusters; (iii)
ML scientists andengineers , who have in-depth knowledge of statistics, ML algorithms, and software engineering. In an orthogonal sense,
Provenance specialists design the provenance schema for applications and guide other users to add provenance capturehooks to the workflows.
Main Expertise
Domain ML
Representative Persona
Domain Scientist Computational Scientist or Engineer ML Scientist or Engineer
Primary Activity
Data Curation ML Model Validation ML Management ML Model Training ML Model Design
Fig. 1: Spectrum of expertise and personas in the lifecycle.Our proposed model of the lifecycle of scientific ML is to divide it into three phases: data curation , learning datapreparation , and learning (Figure 2 — dashed arrows are data flows and solid arrows are interactions between phases). Raw Domain Data Curated Domain Data Learning Data Models
CleanValidateFilterUnderstand
Data Curation
Data Transformation Pipeline
Learning Data Preparation
EvaluateTrain
Learning
Data Layer
Annotate Select Validate Validate
Fig. 2: The Lifecycle of Scientific ML.
Data curation.
It is the most complex phase of the lifecycle, mainly because of the nature of the scientific data. Much manualand highly specialized work are performed by the users (primarily domain scientists) to achieve automated knowledgeextraction from scientific data promoted by ML. There is a significant gap between raw scientific data and useful datafor consumption ( e.g. , data to serve as input to train ML models). Datasets can be huge, typically containing geospatial-temporal data stored in scientific formats, like HDF5, NetCDF, SEG-Y. Specialized formats in scientific domains may require industry-specific software and domain-specific knowledge to inspect, visualize, and understand the data. In addition, userscan use metadata and textual reports to annotate the data with extra domain-specific knowledge, without which wouldbe nearly impossible to make the data useful for ML algorithms. Considering the heterogeneous nature of the data, “it isunreasonable to assume that data lives in a single source” ( e.g. , a single file system or DBMS) [10]. For instance, raw filescan be stored in file systems or cloud stores, domain-specific annotations can be stored in a Semantic Graph DBMS ( e.g. ,Triple Store) with domain ontologies, and curated data can be stored in a NoSQL DBMS. Then, computational scientists andengineers develop data-intensive scripts to clean, filter, and validate the data. Each of these steps inside the data curationphase is highly interactive, manual, and may execute independently. In other words, users may run different scripts toperform these phases, several times, in an ad-hoc way, any order, and on different machines. These phases occur in a cycle,which stops when the users consider the data “curated”. In the context of ML, it is ready to be transformed into learningdata.
Learning data preparation.
Model trainers select relevant parts of the curated data to be used for learning. For instance,if the ML task is to classify geological structures [5], seismic images will need to be correlated with seismic interpretation,creating labeled samples. After selecting the data, model designers develop scripts, typically using domain-specific librariesto manipulate the raw scientific data, to transform ( e.g. ,image cropping, quantization, scaling) the data into learningdatasets. Due to data complexity, frequently, data need to be manually inspected before it can be used as input for thelearning phase.
Learning.
The learning contemplates training, validation, and evaluation. In this phase, model trainers select the inputlearning datasets, optionally they choose validation datasets, and choose learning parameters ( e.g. , in deep learning theycan choose ranges of epochs and learning rates) that will be optimized. Trainers can use their domain knowledge to discardlearning datasets that will unlikely provide good results. The learning process is compute-intensive, typically executed inan HPC machine. One single learning process often generates multiple learned models, among which one is chosen as the“best” depending on evaluation metrics ( e.g. , MSE, accuracy, or any other user-defined metric). Moreover, trainers needto monitor the learning process by, e.g. , inspecting how the evaluation metrics are evolving while the learning processiterates. They can wait until completion or interrupt the learning process, change parameters, iteratively re-submit thelearning until satisfied with results.
Provenance data in workflows contain a structured record of the data derivation paths within chained data transformationsand their parameterizations [16], [15]. Provenance data are usually represented as a directed graph where vertices areinstances of entities (data) or activities (the data transformations) or agents ( e.g. , users); and, edges are instances ofrelationships between vertices [23]. Comprehensive data analysis using provenance has been used as an enabler for severalkey capabilities: • Experiment reproducibility [11], [25], [12] • AI explainability [20], [26], [9]; • Experiment fine-tuning and what-if analyses [15]; • Uncertainty quantification [27], [28]; • Hypothesis testing [6]; and • Real-time monitoring, and interactive data analysis. [29]Based on a literature analysis [30], [11], [10], [31], [25] and on our own experience to leverage provenance to supportworkflows for scientific ML [9], [22], [32], [33], we propose here a taxonomy to classify workflow provenance analysisin support of ML, by considering three classes: data , execution timing , and training timing . Next, we characterize the datainvolved in the lifecycle.TABLE 1: Examples of provenance queries in the lifecycle of scientific ML. ID Description
Q1 Given a trained model, what are the geographic coordinates, oil basin and field, and the number of seismic slices of the seismic in thetraining dataset?Q2 Given a trained model, what is the tile size, the noise filter threshold, and the ranges of seismic slices that were selected to generate thetraining set used to adjust this model?Q3 Given a training set, what are the values for all hyperparameters and the evaluation measure values associated with the trained modelwith least loss?Q4 What are the average, min, and max execution times of each batch iteration inside each epoch of the deep neural network training,given a training dataset?Q5 What is the execution time on average per batch iteration, per epoch, and what are the evaluation metrics of the trained models thatused the training dataset generated for a given range of seismic slices?Q6 Given the training dataset used in Q5, what was the seismic data file used, along with its number of slices, related oil basin, and field?Q7 Considering only the learning workflows that used the learning dataset associated to a given range of seismic slices, list the minimumbatch loss per model obtained in the learning stage, also listing the model’s hyperparameters and evaluation measurements jointly withthe hyperparameters and measurements for associated model obtained in the validation stage, ordered by the best learned models.
Data class includes domain-specific , machine learning , and execution Provenance data may be augmented with these data,increasing the scope of the analysis.
Domain-specific data are the main data processed in the data curation phase (Sec. 2.1). Approaches to add domain datainto provenance analysis include, e.g. , raw data extraction [34] and utilization of domain-specific knowledge databasesassociated to provenance databases [9]. For raw data extraction, quantities of interest are extracted from raw data files. Fordomain databases, domain scientists may provide relevant information and metadata about the raw data and store themin knowledge graphs.
Machine learning data include learning data and generated learned models, which are more related to the learning datapreparation ( e.g. , Q1) and learning ( e.g. , Q2, Q3, Q7) phases (Fig. 2). These queries exemplify that the parametrizationwithin the data transformations and relevant metadata of the generated data are important for provenance analysis.
Execution data.
Besides model performance metrics ( e.g. , accuracy), users need to assess workflow execution time andresource consumption. They need to inspect if a critical block in their workflow ( e.g. , one demanding high parallelism) istaking longer than usual or if other parts are consuming more memory than expected. For this, provenance systems cancapture system performance metrics and timestamps ( e.g. , Q4). Metadata, such as data store metadata ( e.g. , host address),HPC cluster name, and nodes in use, can be captured and associated with the provenance of the data transformations forextended analysis.
Hybrid.
These data can be combined. In Q5 and Q7, the analysis queries data processed in workflows in the learningdata preparation and learning phases, whereas Q6 uses the same dataset to analyze the raw files curated in the datacuration phase.
Execution timing refers to if the analysis is done online , i.e. , while at least a workflow is running, or offline . Offline analysis.
The typical use of offline provenance analysis is to support reproducibility and historical data under-standing, e.g. , understand the curation of raw files and relate with the ML models. The queries Q1–Q7 can be executedoffline.
Online analysis.
Users can use online provenance analysis to monitor, debug or inspect the data transformations whilethey are still running ( e.g. , see the status, see how the intermediate results are evolving as the input parameters vary).The problem of adding low provenance data capture overhead is more challenging for provenance systems that allow foronline analysis [9]. Queries Q3–Q5 and Q7 exemplify queries that can be executed online, e.g. , while a training process isrunning.
Training timing refers to whether the analysis performs intra-training — i.e. , to inspect one training process, e.g. , a trainingjob running on an HPC cluster, or inter-training — i.e. , analyses comprehending results of several training processes. Intra-training . In an offline intra-training analysis, users are interested in understanding how well trained modelsgenerated in a given training process perform. All queries, Q1–Q7, could be executed either online or offline, but Q3 andQ4 are more likely to be performed as online intra-training analysis.
Inter-training . This analysis refers to comprehensive queries to understand multiple training processes, e.g. , how each ofthem performed, which learning datasets were used, how the training processes were parameterized. It supports activitieslike Model Validation, Management, Training, and Design. Usually, they are used offline, but may also be performed online.Queries Q1–Q7 fit this class when analyzing multiple trained models generated in different training processes.
Workflow Provenance Analysis in the Lifecycle of Scientific MLExecution Timing
Online Offline
Data
Machine Learning Domain-specific Execution Performance
Training Timing
Inter-training Intra-training
Execution Environment
HPC cluster Cloud cluster Standalone machine
Data Store
File System DBMS Object Store
Execution Software
Workflow MS Big Data Framework ML Framework Script
Fig. 3: A taxonomy for workflow provenance analysis of the lifecycle of scientific ML.
Further characterization.
Other classes worth mentioning for provenance analysis are: data store —data are distributed ontomultiple stores, like file systems, cloud stores ( e.g. , IBM Cloud Object Storage, AWS S3), Relational or NoSQL DBMSs[9]; execution environment —where the workflows execute, such as HPC clusters, Kubernetes clusters, Standalone server; execution orchestration software —each workflow may be executed as a standalone script, or as a workflow in a WMS, or acomposition of microservice calls, or as a pipeline in a data processing ( e.g. , Spark) and ML frameworks ( e.g. , Tensorflow); provenance data granularity —provenance of files ( i.e. , references to files consumed and generated in a script), functions calls(arguments and outputs), blocks of code, and stack traces [30]; and provenance analysis direction —forward or backward:generally, forward queries analyze from raw scientific files or learning datasets to trained models ( e.g. , Q3–Q5, Q7), whereasbackward queries analyze from trained models to learning datasets or raw files ( e.g. , Q1, Q2, Q6).
This section presents the fundamental design principles for effective and efficient management of workflow provenancedata in the lifecycle of scientific ML to provide for comprehensive data analyses. Although some of these design principles,individually, may have been proposed in related works [35], [9], [36], [34], together they compose the building blocks ofour approach and we assemble them as one unified set of principles and describe how they support the lifecycle. They areorganized as: (i)
Data Design Principles (Sec. 3.1), which contains the principles and key concepts that drive the contents of our holistic data view, whose a resulting artifact is PROV-ML, a new provenance data representation; and, (ii)
SystemDesign Principles (Sec. 3.2), which contain the principles that determine how the provenance data are captured in a scalableand portable manner, whose a resulting artifact is a reference system architecture.
DDP 1: Data Integration with a Holistic Data View.
The primary design principle is that, to be able to manage effectively( i.e. , capture, integrate, store, and query) provenance data in the interconnected workflows in all lifecycle phases, aprovenance system must implement techniques to provide for an integrated, unified, and holistic data view. Also, it has tobe aware of the contexts of the data transformations in the multiple workflows that consume and generate these data, their(hyper)parameterization, and output values, where these transformations run, where the generated data are stored, whoare the involved personas, and how they interact with the workflows. This design principle builds on a multiworkflow dataview concept proposed in our previous work [9]. It extends it to support the lifecycle comprehensively, with specializationsto address ML-specific data and knowledge related to domain-specific data and knowledge. Let us call this data view asthe
Provenance-based Holistic Data View of the Lifecycle of Scientific ML ( MLHolView ). The contents and the granularity ofthe
MLHolView are driven by the relevant queries for a project, and the view can be materialized as the database thatintegrates data from several sources, while the workflows run [9].
DDP 2: Context-awareness using Knowledge Graphs: Domain, ML, and Hybrid environments and Data stores.
Extendingprovenance with domain-specific data for data analysis has been explored before [37], [34], [15]. However, in scientificML, it is required to go a step further into the details of domain-specific knowledge, including how key domain conceptsrelate to each other. Thus, it is important to relate the data in the workflows with as much knowledge as possible availableabout the project’s key concepts. To be able to integrate with domain-specific knowledge databases, one needs to designthe workflows aware that files (or data in other data stores, like DBMSs or object stores) are associated with conceptsdefined elsewhere. Then the provenance system needs to provide the proper links between the file and the domain-specificconcepts.Similarly, the
MLHolView needs ML-specific concepts and relationships. Although modeling ML-specific concepts couldbe seen as modeling data for a specific domain (in this case, ML would be the domain), ML, by itself, is a distinguisheddomain, which crosses many industries and scientific domains. Thus, the
MLHolView should have a built-in ML-specificschema, tightly coupled with the rest of the provenance data schema, to provide ML-specific context to support thecomprehensive analyses. In certain cases, such specialized schema modeling might be even helpful to accelerate queriesthat require them [38].In addition to domain- and ML-specific context awareness, since the workflows can be executed within heterogeneousframeworks, scripts, or WMSs and on heterogeneous environments, the
MLHolView needs to be aware of such hybrid ( i.e. ,heterogeneous) execution by containing the track of the execution environment and software, and associated metadata.These data and their relationships, with pointers to domain-specific knowledge graphs and to large data stored in otherstores, are all materialized using provenance data in the knowledge graph that forms the
MLHolView . Figure 4 illustratesthe
MLHolView and its awareness of data coming from the ML phases and the dimensions of heterogeneity (illustrated aslayers) it addresses: software, data, data stores, and infrastructure (execution environment). The figure also shows the kindof provenance analysis ( top-left ) and the key capabilities the
MLHolView enables ( top-right ) (Sec. 2.2).
Learning Data Preparation LearningData Curation
Raw Domain Data
Provenance-based Holistic Data View of the Lifecycle of Scientific ML
Curated Domain Data Learning Data Models
Data Layer
HPC File SystemObject StorageKey Value DBMSGraph DBMSDocument DBMSRDBMS
Data Stores Layer
Private Cloud Public Cloud Scientists’ DesktopsHPC Clusters
Infrastructure Layer
Big Data Processing FrameworksWorkflow Management SystemsScriptsML FrameworksScientific Software
Software Layer
On-premise
Provenance Analysis
P r o v e n a n c e C a p t u r e ü Data:
Domain, ML, Execution ü Execution timing:
Online/Offline ü Training timing:
Inter/Intra-training ü AI explainability ü Model reproducibility ü Experiment fine-tuning ü Uncertainty quantification ü Hypothesis testing ü Real-time data analysis
Fig. 4: The Provenance-based Holistic Data View of the Lifecycle of Scientific ML.
DDP 3: Provenance of Multiple Workflows on Data Lakes meets ML Provenance Following W3C Standards.
To be ableto implement the context-awareness for domain, ML, and hybrid environments , the
MLHolView needs a comprehensivedata representation. Data lake provenance builds on workflow provenance to enable the awareness of the location of each data item generated by chained data transformations in a data lake, even if there are multiple data items dispersed inhybrid environments and data stores [9], [39], making it a good alternative to address such heterogeneity of data, store,and environments. However, it is not enough to support the lifecycle, as it requires provenance of ML-specific data andlearning processes.The provenance data community has significantly evolved in the recent years, oftentimes leveraging the PROV [23]family of documents, a W3C recommendation, making it a de facto standard that provides the building blocks, in termsof data representation, for any provenance-based approach, allowing for compatibility among different solutions [40]. ThePROV-Wf [41] workflow provenance data representation and its derivatives [16] have also been used and evolved byseveral initiatives [29], [15], [42]. Our previous work builds on W3C PROV and PROV-Wf to propose PROVLake, a firstprovenance data representation for workflows on data lakes [9]. With respect to ML-specific data modeling, there is aW3C community group developing a data representation with specific ML vocabulary, the W3C ML Schema (MLS) [24].Therefore, this data design principle proposes that the data representation for the
MLHolView should be comprehensive,with detailed semantics about the workflows, where they execute, the data they process, and where they are stored,combining and extending a data lake provenance representation with ML-specific data representation, following standardsand reusing existing representations, such as W3C PROV, PROV-Wf, PROVLake, and MLS.
DDP 4: Keeping Prospection and Retrospection Related but Separated.
Davidson and Freire explain that prospectiveprovenance captures the specification of a workflow, i.e. , the recipe of which data transformations will be processed andtheir inputs and outputs. In contrast, retrospective provenance captures the data that was actually consumed and produced,along with a detailed execution log about the computational tasks and execution environment [25]. The prospectiveprovenance provides the abstraction layer to specify provenance analyses, often giving semantics to the retrospectiveprovenance data generated during the workflows’ execution. Also, there are cases that the provenance analysis uses onlyone kind of provenance data. Therefore managing both kinds of provenance data, and more importantly, with a strongconnection between each related kind, is essential for the
MLHolView , which should be reflected in the provenance datamodeling.
DDP 5: Designing a Focused Conceptual Data Schema.
To provide the specialized semantics needed by the
MLHolView ,we propose a conceptual data schema focusing on the key concepts identified by the characterization in Section 2. Theconcepts are driven by the lifecycle phases and the data they manipulate: the phases are illustrated in gray backgroundand the and four main kinds of data are illustrated in white background in the UML class diagram in Figure 5.
Curated Domain Data Data Preparation Learning Data Learning Model * * 1* 11*1
Data CurationRaw Domain Data
Used Domain Data
Fig. 5: Conceptual data schema of the key concepts of the lifecycle and their supporting graphs.On the four data concept classes, each instance represents one dataset, i.e. , a set of data elements that combined formone meaningful set of data for a given application. As any dataset, it may have a data schema that varies depending on theapplication; it may be further decomposed into several interrelated subdatasets (or subconcepts for a given application),and there may be related metadata such as where it is physically stored and data sizes. For example, in the case of domaindata, a set of well log files form a dataset whose application is to serve as a training dataset to train an ML algorithmto find well tops; and the combination one seismic data file with metadata about the geographic location of the seismicdata acquisition and associated oil basins form a dataset whose application is to curate the seismic data. In the case ofmodels, there may be metadata about the model performance and hyperparameters. With respect to the three phases’classes, each can be further decomposed into workflows with associated execution data. A
Learning instance can bequalified into training, validation, and evaluation. With respect to relationships, each
Data Curation instance consumesa
Raw Domain Data instance and generates a
Curated Domain Data instance. Then, each
Curated Domain Data instance may be consumed by one or more
Data Preparation instances, which in turn may generate one or multiple
Curated Domain Data instances ( i.e. , a n:m relationship). For instance, a learning algorithm may require the preparationof well log data and seismic data, jointly, and thus two sets of
Curated Domain Data would need to be related to the
Data Preparation instance. Finally, each
Data Preparation instance generates a
Learning Data instance to bebe consumed by one
Learning process that generates one
Model instance. Typically, during a learning phase, there aremultiple
Learning instances, each generating a
Model instance.
SDP 1: Portable and Distributed Capture Control.
As discussed, the workflows execute in highly distributed, hetero-geneous environments processing data in heterogeneous data stores, executing within heterogeneous software and onheterogeneous environments. To address this distributed execution control, the provenance system should be portablewith distributed capture control so that there may be multiple provenance data capturers spread out across the multipleworkflows executing. To address the heterogeneity of how workflows are executed, the provenance system cannot be tightlycoupled with a specific workflow tool, but rather it should be pluggable to any of these aforementioned heterogeneousways of executing workflows. The distributed captured data are ultimately integrated in the unified
MLHolView . SDP 2: Specialized Microservices in a Distributed Architecture.
In addition to the distributed capture control, designing aprovenance system using a microservices architecture allows for the flexibility needed for large-scale deployments in hybridenvironments. The provenance system can be decomposed into smaller, stateless microservices with specialized functionsand, more importantly, it enables that components of the provenance system architecture are deployed wherever bestfits for the workflow having provenance being captured. For instance, provenance capture components can be deployedgeographically near (or inside) the machine where the workflow runs, to reduce latency caused by communication costs,and other heavy-weight provenance-specific processes ( e.g. , creating the linkages, inserting in the DBMS) and the DBMSitself can be deployed elsewhere, to reduce concurrency with the running workflows. A real deployment exploring theflexibility to place the architectural components to reduce communication costs and concurrency is shown in Section 4.1.
SDP 3: Strategies for a Scalable Capture.
Since many of these workflows require HPC, the provenance capture systemshould not add significant performance penalties to the running workflows, requiring designing strategies for a scalabledata capture. In addition to reducing concurrency, as described in the
SDP 2 , which is one of these strategies, other strategiesto reduce performance overhead are as follows. During capture, all calls from the running workflows to the provenancesystem should be asynchronous and do not need to wait for the data capture request to be completely processed, avoidingadding periods of waiting in the running workflow. Also, batches of data capture requests from the running workflows canbe queued and sent to the provenance system at once, avoiding keeping open multiple communication channels betweenthe running workflow and the provenance system. These batches are then received in the provenance system, which shouldprocess each request in the batch in a parallel manner, to reduce the time between the provenance capture in the workflowand the data to be readily available for queries in the
MLHolView . Moreover, during capture, the provenance systemresponsible for creating the data linkages should avoid doing read operations to the underlying DBMS, but should onlydo appends to the data in the DBMS. This is because the read operations on the DBMS inevitably have to be waited for thequery response, thus potentially increasing latency in the provenance capture. Finally, the only component that is in directcontact with a running workflow should be a lightweight provenance capture library shielding the workflows from possibleslowness from other components. The key for such lightweight library is to significantly reduce provenance-specific codein a workflow, consequently reducing provenance-specific calls during execution, and strictly follow the insert-only policy,so that no queries to the DBMS are made by the library, avoiding waits. Such provenance-specific descriptions, essentialfor the specification of the workflows, are stored as prospective provenance data externally to the actual workflow. Theprovenance library (in the client-side of the system) does not need these specifications, which are essential for the server-side of the system, so the linkages that form the
MLHolView can be provided. A side-effect of reducing provenance-callsin a workflow is that it also reduces the changes needed to be done, making it look as similar as possible to the originalworkflows without the hooks [9], [32].
SDP 4: Easing Data Linkage with Unique Data Identifiers.
The concept of using unique identifiers is useful for keepingtrack of data in provenance systems [39], [43]. Existing approaches keep track of data files consumed and produced in theworkflows, and here we extend this concept to keep track of every data value that participates in the
MLHolView , evenscalar values. Thus, every attribute-value pair that are consumed or produced in any data transformation participatingin any workflow receives a unique identifier. So, whenever an attribute-value generated by one data transformation isconsumed by another, the provenance system can reuse the value keeping track of the paths between transformations and,thus, keeping the workflows interconnected.
SDP 5: Workflow Design and Adding the Provenance Capture Hooks.
To enable the context-awareness (
DDP 2 ), the firststep is to design the workflows with context-awareness. For this, for each workflow in those multiple workflows for agiven project, one needs to specify its data transformations with input datasets, parameters, and expected outputs. Eachcomputational process (data transformations) and the datasets they transform are qualified according to the
MLHolView ’sconceptual data schema (
DDP 5 ). When specifying data references, the physical location where the data reference isexpected to be stored should be provided, as well as metadata about the execution environment where the workflow willexecute. Finally, the relationships between the workflows and the data in the distributed data stores need to be specified.Such specification can be maintained in configuration files, which will inform the provenance capture system to enableit to create the linkages to provide the context-aware integration of domain, ML, and hybrid environments and storesusing provenance. After the specification, hooks can be added to the workflows before and after each data transformation,informing the key concept (following the
MLHolView ’s conceptual data schema) in each data transformation and datareference. A data transformation execution is encapsulated by a provenance capture task, which typically occurs in afunction call, a program execution, a web service call, or an iteration in an iterative workflow.
Reference System Architecture.
Based on these system design principles, our proposed reference architecture is illustratedin Figure 6 and is described as follows. There are M environments ( e.g. , HPC clusters, Kubernetes clusters) and N workflows in all phases of the lifecycle, distributed on these environments. Each workflow may use heterogeneous datastores and may be implemented as a standalone script, or as a workflow in a WMS, or a composition of microservice calls,or as a pipeline in a data processing or ML framework. Provenance capture hooks, through a lightweight ProvLib , areadded to capture provenance data at each data transformation in each of these workflows. At the beginning and end of each(potentially parallel) data transformation executions for each (potentially parallel) workflow, a provenance capture event isemitted from the
ProvLib . Thus a provenance capture event has the granularity of a data transformation execution, withtheir corresponding input data (at the beginning) and output data (at the end). These events are asynchronously sent to a
Message Broker , such as Apache Kafka, or any lightweight repository that persists the capture requests queue. Then, the
ProvConsumer , which is a lightweight service that runs on background, consumes from this queue and sends the requeststo the
ProvManager , which is aware of the prospective provenance data and can create the context-aware linkages usingW3C PROV-based relationships and the reuse of unique identifiers (
DDP 4 ), and sends the data to the
MLHolView , which ismanaged by a DBMS, typically a knowledge graph DBMS. The (
Message Broker , ProvConsumer ) pair is instantiated ateach environment to reduce communication costs between the ProvLib. The ProvManager is a RESTful, stateless service andcan receive provenance capture requests in any order. Thus it uses a lightweight Key Value DBMS ( e.g. , Redis) to managestate when needed ( e.g. , to create a link with a just received request with another request sent before). During the executionof these workflows, users or applications may submit provenance analysis through a Query API that communicates withthe
Prov query component, which is a RESTful service responsible for implementing query building strategies using thequery language of the
MLHolView ’s DBMS and returning the results to the requesting client.
Prov ConsumerMessage BrokerProvLib
Workflow k ProvLib
Workflow … Prov ConsumerMessage Broker ProvLib
Workflow N ProvLib
Workflow k+1 … Environment 1 Environment MProvManager
MLHolView
KV DB
ProvQueryProv ServerProvenance Analysis UI
Query API
Fig. 6: Reference system architecture to manage workflow provenance in the lifecycle of scientific ML.
We propose a generic provenance data representation for the lifecycle of scientific ML based on the data design principles,which is the first one to the best of our knowledge. PROV-ML is depicted in Fig. 7, where the light-color classesrepresent prospective provenance, and dark-color, retrospective. PROV-ML provides rich semantics and details basedon the conceptual data model of lifecycle’s fundamental concepts (
DDP 5 ), especially the ones in the learning phase. Thecolors in the figure map to these concepts: the blue-shaded classes account for the
Learning Data ; the gray-shaded,for the
Learning ; and the yellow-shaded, for the
Model . The stereotypes indicated in the figure represent the classesinherited from PROVLake. All classes illustrated in the figure are individually described in Table 2. We briefly discuss thePROV-ML classes here, and further details are available online [44].In PROV-ML, the
Study class introduces a series of experiments, portrayed by the
LearningExperiment class, whichdefines one of the three major phases in the lifecycle, the Learning phase. A learning experiment comprises a set of learningstages, represented by the
BaseLearningStage class, which are the primary data transformation within the Learningphase and with whom the agent (
Persona class) is associated. The base learning stage serves as an abstract class where the
LearningStage and
LearningStageSection classes inherit from. Also, it relates the ML algorithm, evoked through
Algorithm class, used in the stage might be defined in the context of a specific ML task ( e.g. , classification, regression),represented in the
LearningTask class. This approach favors both the learning stage and learning stage section to conservethe relationships among other classes while grant them to have special characteristics discussed in the following. A learningstage varies regarding its type, i.e. , Training , Validation , and
Evaluation classes. The provision of a specific classfor the learning stage allows the explicit representation of the relationship between the Learning Data Preparation phase,through its Learning Data, and the Learning phase of an ML lifecycle. The
LearningStageSection class introducesthe sectioning semantics that grant capabilities of referencing subparts of the learning stage and the data, respectively. Anexample of sectioning elements relevance is the ability to reference a specific epoch within a training stage, or mentioninga set of batches within a specific epoch. The Learning Data appears in the model over the
LearningDataSetReference class. Another data transformation specified in PROV-ML is the
Feature Extraction class, which represents the processthat transforms the learning dataset into a set of features, represented by
FeatureSet class. This modeling favors the MLexperiment to be reproducible since it relates the dataset with the feature extraction process and the resulting feature set.Further fundamental aspects regarding the Learning phase are the outputs and the parametrization used to producethese outputs. Like so, The
ModelSchema class describes the characteristic of the models produced in a learning stageor learning stage section, such as the number of layers of a neural network or the number of trees in a random forest.The
ModelProspection class represents the prospected ML models, i.e. , the reference for the ML models learned duringa learning stage or learning stage section of a training stage. In addition to the data produced in the Learning phase isthe
EvaluationMeasure class. This class, combined with
EvaluationProcedure and
EvaluationSpecification classes, provide the representation of evaluation mechanisms of the produced ML models during any stage of learning,specifically: an evaluation measure defines an overall metric used to evaluate a learning stage ( e.g. , accuracy, F1-score,area under the curve); an evaluation specification defines the set of evaluation measures used in the evaluation oflearned models; and, an evaluation procedure serves as the model evaluation framework, i.e. , it details the evaluationprocess and used methods. On the parametrization aspect, PROV-ML afford two classes
LearningHyperparameter and
ModelHyperparameter . The first hyperparameter-related class represents the hyperparameter used in a learning stageor learning stage section ( e.g. , max training epochs, weights initialization). The second class is used in the representation of the models’ hyperparameters ( e.g. , network weights). Finally, PROV-ML addresses the retrospective counterpart of theclasses mentioned above. The classes ending in Execution and
Value are the derivative retrospective analogous of datatransformations and the attributes, respectively. << A tt r i bu t e >> Lea r n i ng T a sk << A tt r i bu t e V a l ue >> Lea r n i ng T a sk V a l ue << A tt r i bu t e >> Lea r n i ng D a t a S e t R e f e r en c e << A tt r i bu t e >> D a t a s e t C ha r a c t e r i s t i c << D a t a T r an s f o r m a t i on E x e c u t i on >> B a s eLea r n i ng E x e c u t i on << A tt r i bu t e V a l ue >> D a t a s e t C ha r a c t e r i s t i c V a l ue << A tt r i bu t e V a l ue >> Lea r n i ng D a t a s e t << A tt r i bu t e >> A l go r i t h m << P r og r a m >> S o ft w a r e << A tt r i bu t e >> Lea r n i ng H y pe r pa r a m e t e r << A tt r i bu t e V a l ue >> Lea r n i ng H y pe r pa r a m e t e r V a l ue << P r og r a m >> I m p l e m en t a t i on << E n t i t y >> I m p l e m en t a t i on C ha r a c t e r i s t i c V a l ue << D a t a T r an s f o r m a t i on >> B a s eLea r n i ng S t age << A tt r i bu t e >> M ode l P r o s pe c t i on << D a t a S c he m a >> M ode l S c he m a << A tt r i bu t e >> M ode l H y pe r pa r a m e t e r << A tt r i bu t e >> E v a l ua t i on M ea s u r e << A tt r i bu t e >> E v a l ua t i on S pe c i f i c a t i on << A tt r i bu t e >> E v a l ua t i on P r o c edu r e << A tt r i bu t e V a l ue >> M ode l H y pe r pa r a m e t e r V a l ue << D a t a R e f e r en c e >> M ode l << A tt r i bu t e V a l ue >> M ode l E v a l ua t i on << D a t a S t o r e I n s t an c e >> D a t a S t o r e I n s t an c e T r a i n i ng V a li da t i on E v a l ua t i on Lea r n i ng S t age Lea r n i ng S t age S e c t i on Lea r n i ng S t age E x e c u t i on Lea r n i ng S t age S e c t i on E x e c u t i on T r a i n i ng E x e c u t i on V a li da t i on E x e c u t i on E v a l ua t i on E x e c u t i on << A gen t. >> P e r s ona C o l o r M app i ng Lea r n i ng D a t aLea r n i ng M ode l << A tt r i bu t e >> F ea t u r e S e t << D a t a T r an s f o r m a t i on >> F ea t u r e E x t r a c t i on << D a t a T r an s f o r m a t i on E x e c u t i on >> F ea t u r e E x t r a c t i on E x e c u t i on << A tt r i bu t e V a l ue >> F ea t u r e S e t D a t a << A tt r i bu t e >> F ea t u r e S e t C ha r a c t e r i s t i c << P r o j e c t >> S t ud y << W o r k f l o w >> Lea r n i ng E x pe r i m en t << W o r k f l o w E x e c u t i on >> Lea r n i ng P r o c e ss E x e c u t i onu s ed u s ed had M e m be r had M e m be r had M e m be r had M e m be r w a s D e r i v ed F r o m w a s D e r i v ed F r o m i s G ene r a t ed B y w a s A ss o c i a t ed W i t h w a s A ss o c i a t ed W i t h w a s A ss o c i a t ed W i t h w a s I n f o r m ed B y had M e m be r had M e m be r w a s I n f o r m ed B y w a s I n f o r m ed B y w a s I n f o r m ed B y w a s A ss o c i a t ed W i t h u s ed had M e m be r w a s G ene r a t ed B y w a s D e r i v ed F r o m u s ed w a s G ene r a t ed B y w a s G ene r a t ed B y w a s D e r i v ed F r o m had M e m be r w a s D e r i v ed F r o m w a s D e r i v ed F r o m had M e m be r had M e m be r w a s G ene r a t ed B y w a s G ene r a t ed B y had M e m be r had M e m be r w a s G ene r a t ed B y w a s D e r i v ed F r o m had M e m be r w a s A tt r i bu t ed T o w a s D e r i v ed F r o m u s ed u s edhad M e m be r u s ed w a s D e r i v ed F r o m w a s D e r i v ed F r o m had M e m be r u s ed u s ed w a s D e r i v ed F r o m Fig. 7: PROV-ML: a W3C PROV- and W3C ML Schema-compliant provenance data representation for scientific ML. Alarger visualization is available online [44]. TABLE 2: PROV-ML data representation classes.
Class Description
Study Investigation ( e.g. , research hypothesis) leading ML workflow definitions.LearningExperiment The set of analyses ( e.g. , research questions) that drives the ML workflow.LearningProcessExecution An ML workflow execution. This is equivalent to mls:Run and was renamed to explicitly preserve the aspects ofretrospective provenance, which are not explicitly handled in MLS.LearningTask and Learning-TaskValue Defines the goal of a learning process, i.e. , the ML task ( e.g. , LearningTask : Classification ; LearningTaskValue : Seismic Stratigraphic Classification ).BaseLearningStage and Base-LearningStageExecution Abstract classes of
LearningStage and
LearningStageSection , and their execution counterparts used toconserve them the relationships among other classes while granting them to have special characteristics.LearningStage and Learn-ingStageExecution Defines a
Training or Validation or Evaluation ) learning stage and its execution.LearningStageSection andLearningSectionExecution Introduces the sectioning semantics, i.e. , capabilities for provenance of subparts of the learning stage and correspond-ing data.LearningDatasetReferenceand LearningDataset Defines the dataset to be used by a
LearningStage or LearningStageSection . In the last case, it is a section ofa
LearningDatasetReference . LearningDataset is the dataset used in the execution.DatasetCharacteristic andDatasetCharacteristcValue Defines metadata about the
LearningDatasetReference ( e.g. , DatasetCharacteristcValue relates with a
LearningDataset ( e.g. , FeatureExtraction should generate over
LearningDatasetReference .FeatureSetCharacteristic Defines the set of metadata that describes the
FeatureSet ( e.g. , number of features, features’ type).Software Defines a collection of ML techniques’ implementations ( e.g. , Scikit-Learn).Algorithm ML technique with no associated technology, software or implementation ( e.g. , k-means clustering technique).Implementation Defines the retrospective aspect of an Algorithm , i.e. , an ML technique’s implementation in a software ( e.g. , Scikit-Learn’s k-means implementation).ImplementationCharacteristicValue Defines the implementation’s set of metadata (properties and values), e.g. , version, git hash.LearningHyperparameter Defines the prior parameter of an Algorithm used by a
LearningStage or LearningStageSection .LearningHyperparameterValue Defines the parameter values of an execution ( e.g. , the k value in a k-means clustering technique, range of epochs ina neural network training).ModelSchema The scope of the resulting model.ModelProspection andModel The resulting model a LearningStage or a
LearningStageSection should generate, and the generated value( e.g. , the trained model after the training stage).ModelHyperparameter andModelHyperparameterValue Hyperparameters a
LearningStage or a
LearningStageSection generate, and their values corresponding tothe resulting model ( e.g. , the epoch which the resulting model was generated).DataStoreInstance Storage of the resulting model.EvaluationMeasure andModelEvaluation A measure a
LearningStage or a
LearningStageSection should evaluate and the generated value ( e.g. , theprecision of classifier model).EvaluationSpecification andEvaluationProcedure Classes directly inherited from MLS, with their semantics preserved.
In this section, we provide an experimental validation of the design principles to build and query the
MLHolView tosupport the lifecycle of scientific ML in a real case study in the O&G industry. First, we explain how we implementand deploy the provenance system used in the evaluation (Sec. 4.1). Then, we show a running example of which dataare captured during execution of the workflows to answer the exemplary queries Q1–Q7 (Sec. 4.2). After, we presentperformance and scalability analyses of the system (Sec. 4.3). Then, we discuss the benefits of PROV-ML both in terms ofeasing queries and query performance (Sec. 4.4). Finally, we conclude with lessons learned after this evaluation (Sec. 4.5).
ProvLake [44] is a provenance system capable of capturing, integrating, and querying data across distributed services, pro-grams, scripts, and data stores used by multiple computational workflows using provenance data management techniques[9], [22]. In this section, we explain how we implement these principles to enable ProvLake to build the
MLHolView andhow it is deployed to support the lifecycle in our case study.
ProvLake Architecture.
ProvLake architecture is an implementation of the reference architecture (Fig. 6). Details aboutthis architecture can be found in our previous work [22]. Here we give a brief summary, highlighting how its componentsare mapped to the reference architecture proposed in this paper. The ProvLake Library (PLLib) [45] maps to the ProvLib.ProvTracker implements a simple queue management to receive the provenance capture events coming from the lib andalso implements a queue consumer, thus working both as the message broker and the provenance consumer in the referencearchitecture. ProvManager maps like the reference architecture and the PolyProvQueryEngine is the component forbuilding the provenance queries and sending it to the DBMS managing the
MLHolView . As described in the principle
SDP5 , the workflows are specified using prospective provenance data stored as configuration files. Data transformations thatare specific and standard in ML workflows, e.g. , training, validation, and evaluation are defined beforehand following theconceptual data schema for the key concepts (
DDP 5 ) and the PROV-ML (Sec. 3.3) for attributes, such as hyperparameters and model evaluation attributes. ProvTracker uses the specified prospective provenance data to provide for the tracking bycreating the relationships of retrospective provenance data being continuously sent by PLLib added to the workflows.ProvTracker gives unique identifiers ( SDP 4 ) to every data value captured and when there are data references ( e.g. ,references to files or identifiers in a database table or any analogous data reference), it creates a knowledge graphrelationship between the data value and the data store [9]. ProvManager transforms the captured data into RDF triples (thedata model of the DBMS in use by ProvLake in this implementation) following the PROV-ML ontology (when capturingdata in the learning phase) and PROVLake ontology (when capturing data in the previous phases of the lifecycle).
ProvLake Deployment in the Case Study.
The deployment of our case study also follows the system design principles(Sec. 3.2). It uses two clusters: a Kubernetes cloud cluster for data curation and learning data preparation workflows, andthe other is a large HPC cluster with CPUs and GPUs for the workflows in the learning phase. PLLib is the only componentin direct contact with the users’ workflows running in the clusters (
SDP 3 ). This deployment is illustrated in detail in ourprevious paper [22].
Hardware Setup.
The experiments use three environments. An HPC cluster for learning workflows, which has 393 Inteland Power8 nodes, each with 24 to 48 CPU cores, 256 to 512 GB RAM, interconnected via InfiniBand, sharing about 3.45 PBin a GPFS, and using in total 946 GPUs (NVIDIA Tesla K40 and K80, each with 2880 and 4992 CUDA cores respectively); aKubernetes cloud cluster for data processing, which has 4 nodes, each with 16 GB RAM and 8 cores; and a server machineIntel Core i7-7700T CPU 2.40 GHz, 8 GB DDR4 RAM, 128 GB SSD Liteon.
Software Setup.
ProvManager, PolyProvQueryEngine, and Prov DBMS are deployed on a virtual Kubernetes cluster withtwo nodes with 4 vCores, 16 GB RAM each, virtualized on top of the data processing cluster. ProvManager’s queue is set to50, and ProvTracker threads are set to 120. The workflow scripts of our use case are implemented in Python using multiplelibraries, such as to manipulate raw seismic files and for learning (PyTorch V1.1) that execute in the learning cluster. Forthe query performance tests, we deployed three different DBMSs on the server machine: Apache Jena TDB 3.12, Allegro6.6.0, and Blazegraph 2.1.5.
In this section, we investigate whether our approach supports the lifecycle by enabling users to perform comprehensive, i.e. ,end-to-end analyses that integrate the data consumed and generated in the workflows, from raw domain data to learnedmodels. More specifically, we investigate if the proposed data design principles (Sec. 3.1) can be applied to answer queriesthat does such integration of the data. We explore the O&G use case described in Section 2 and validate if the data trackedby ProvLake, inserted in the
MLHolView implementing the PROV-ML data, can answer the queries Q1–Q7. Fig. 8 showsthe phases of the lifecycle in this use case. Next, we describe the workflows of the use case and how ProvLake tracks thedata.
Data Curation
Triple Store File System Training, validation and test data setsCurated and Annotated Seismic Data Trained models/data/netherlands.sgyGeoscientist's Annotations File SystemDoc. DBMSStructured Domain Knowledge
URI:
Learning Data PreparationLearning
Filtering, Cleaning, Raw Data Extraction,Auxiliary Data Creation Training, Validation, Test Training Hyperparameters - batch_size: 60- max_epochs: 300- learning_rates: [0.01, 0.001]
Model Hyperparameters - learning_rate: 0.01- epoch: 188
Data Reference Tracking - in: /input/{train,test,valid}.hdf5- out: /models/model188.hdf5
Execution Data - cluster name, host nodes- Job id- start time, end time
Evaluation Measure - confusion_matrix: [[m]]- loss: 2.2e-3- mean_iou: 7.22e-1
Training, Validation, Test Data Creation Pipelines Data Transformations' Parameters - curated and annotated data references- seismic slice ranges- noise threshold: 0.30- tile size: 40
Data Reference Tracking - in: curated and annotated data references- out: /input/{train,test,valid}.hdf5DataWorkflowsData captured Lifecycle Step
File SystemRaw Data Extraction - file reference: /data/netherlands.sgy- filesize: 1.25 GB- inline range: [100, 750]- crossline range: [300, 1250]- geox1, geoy1: (6054167, 60735564)
Data Reference Tracking - file references in the file systems- document identifiers in Doc. DBMS- instance URIs in the Triple Store- DBMSs and file systems' metadata, location, and access information
Fig. 8: Summarized example of provenance tracking in an O&G use case. Details on the captured data, contents, stores,and the dataflow used to answer the queries Q1–Q7 are in Table 3.In the data curation phase, ProvLake tracks provenance while data-intensive scripts run. When processing raw files,essential data that will help answer the queries are extracted, associated with the file’s URI, and stored in the provenancedatabase. One example of such data is the embedded geographic coordinates in raw SEG-Y seismic files. Additionally,geoscientists add relevant information, based on their specialized knowledge, as input to some of those scripts to be loadedinto a domain-specific knowledge graph database, external to the provenance database, but also tracked by ProvLakethrough links between the workflows and these domain knowledge in the graph. Relevant information includes associatedoil fields, basins, oil wells, and pieces of texts from PDF documents with survey information related to the geological data TABLE 3: Details about the captured data in the use case.
Data structurename Description Data Characteristics and Size Data Store
Geoscientist’sAnnotations Observations they do about the seismic dataset,such as its geographic global coordinates and char-acteristics about the subsurface terrain this seismicacqusition was obtained. Also, they relate the seis-mic datasets with other artifacts of interest, such aswell logs and geological basins. Semi-structured textual files Textualdocumentsin the filesystemStructured domainknowledge Domain-specific information parsed from unstruc-tured and semi-structured documents and repre-sented as structured facts in domain ontologies. En-tities in such ontologies may represent taxonomy,rules and assertions for a given domain. Stored as domain-specific knowledge graphs in aKnowledge Base, typically managed by a TripleStore Triple storeGeological LabeledData Tabular text files, where each line contains x, ypositions (floar32, float32) on Earth surface, anddepth (float32) that can be in distance or time. N x · N y · N h · bytes, where N x and N y are the num-ber of points in x and y directions, respectively. N h is the number of annotated horizons. File systemPost-stacked SEGYfile A binary file containing N x × N y stacked tracesof one particular seismic attribute, e.g. amplitude,coherence, frequency, phase. The file also includesa main header and several trace headers. H main + N x · N y · ( H trace · T size ) , where the mainheader ( H main ) takes approximately 10KB; thetrace header uses 240 bytes; and each trace containsone float32 value for each point in depth. For exam-ple, if the seismic is a volume × × ,besides the headers, it will contain × traceseach of which comprising float32 values. File systemCurated and anno-tated seismic data Merged expert annotations and the SEGY raw file.It comprises the structured knowledge about thegeological data and also cube geometry, such as in-line and crossline ranges, resolution, depth range,and unit. The expert informs which parts of theinput file are suitable or not for the task. Finally,it may contain legal and access information. Withthis data, it is possible to set next phase hyperpa-rameters. Stored as structured data in a combination of Doc.DBMS and Knowledge Bases with references tothe Doc. DBMS. The Doc. DBMS has hundreds ofgigabytes and the Knowledge Base has hundredsof megabytes. Triple Storeand Doc.DBMSTraining, validation,and evaluation datasets Binary files stored in HDF5 or using Google’sProtocol buffer serialization for a good balancebetween portability and speed. These files may vary a lot, depending on the con-figuration selected for data preparation workflows.From our experimental observations, it takes about10% or less of the input SEGY file. However, be-cause workflows create data sets by experimentconfiguration, it is possible to end up with a totaldata set storage multiple times bigger than theoriginal raw file. CloudObjectStore andFile systemLearned models Mix of binary and configuration files depending onthe engine used to run the learning phase (PyTorch,Tensorflow, Scikit-learn, etc.). The engine used to run defines trained models’type and size. Since we used Tensorflow backendin our experiments, we store our trained modelsusing Tensorflow’s tools, where each experimentproduces configuration and binary files. The firstone stores the model structure and other trainingparameters, and the second one stores the model’sstate. Although model size can vary from a fewMB to several GB, our models used approximately50MB per state in our experiments. Notice that onestate is just one snapshot of one step during train-ing, so if depending on the configuration settings,it is possible to have several saved states, the 50MBmay turn GB very quickly. File system acquisition process. These annotations are stored in triple stores in a domain-specific database, externally to the provenancedatabase.The learning data preparation phase includes several data transformations in a pipeline that converts the curated andannotated scientific data into training, validation, and evaluation datasets. Each transformation contains parameters thatspecify, for instance, noise filter thresholds, input shape, or the selected seismic lines (inlines, or crosslines) of the seismiccube that constitutes the training dataset. Each value of these parameters, the name of the transformation, the executiondata, and the references to input and output data are captured and represented in ProvLake’s provenance data graph.The entire process is interconnected, where each phase produces data and passes it forward for the consumption of thenext one. Essentially, ProvLake tracks and maintains such interconnections in a provenance data graph composed by RDFtriples. Such structures describe chained data transformations in the multiple workflows that constitute the inner phasesof the major ones of the lifecycle run. RDF resources represent the data in Fig. 8, i.e. , instances that extend prov:Entity and PROV-ML specializations. Each of these instances receives a URI, which works as a global identifier throughout thelifecycle DDP 4 . Examples of RDF resources are learned models produced in the learning phase, model’s hyperparameters,evaluation metrics, and references (file path) to actual model files stored in the file system. Provenance data graphs alsoassociate execution data with learned models. Execution data may include file system metadata, cluster’s hostname and node names used in the HPC jobs, job ids in the cluster scheduler, or start and end timestamps of each block of provenancecapture events.ProvLake can keep track of data distributed in multiple stores. Such ability helps to maintain data relationships betweenraw files in the file system and structured knowledge stored in another database. Auxiliary data, such as polygons in theseismic cube, are stored in the Document DBMS. The system similarly tracks data references and related to the rawfiles. Other data, such as implementation details, software name, and version, are captured and stored in the provenancedatabase, following the PROV-ML, but, for simplicity, we do not show them in the figure. Finally, since the system tracksevery data and their relationships while the workflows execute, ProvLake enables answering online, offline, intra- andinter-training provenance queries to analyze ML data, domain-specific data, and execution data throughout the phases ofthe lifecycle, exemplified by the queries Q1–Q7.To submit queries, the user sends a GET or POST request to one of PolyProvQueryEngine’s endpoints. Then, PolyProv-QueryEngine sends requests to ProvManager. Most of the queries are answered with simple graph traversals using standardSPARQL features. For instance, to answer Q1, the user provides a learned model URI (generated in the learning phase),and the query should traverse in the provenance data graph backward until the raw seismic file’s URI (processed in thedata curation phase). One can get the geographic coordinates and number of seismic slices by querying the extracted datarelated to the seismic file. In turn, to obtain the oil basin and oil field information, the query retrieves data from the resource,in the Triple Store, which represents structured knowledge about the seismic file. For Q2 and Q6, one can execute a similargraph traversal. Other queries require analytical operators, such as Q3, which requires finding the learned model withleast (using min() native SPARQL operator) loss, and returning its hyperparameters. Q4 and Q5 make use of executiondata to provide basic statistics ( min(), max(), avg() operators) about the execution time of training iterations andQ7 retrieves the models, their hyperparameters, their evaluation measures, and minimum batch loss per model generatedwhen a specific learning dataset was used. In our use case for training an autonomous identifier of geological structures ( c.f.
Sec 2), the learning phase generates alarge amount of provenance data at a high frequency to stress ProvLake services. In the deep learning model training,there are two provenance capture calls (for the beginning and end) at each batch iteration in each learning epoch. In thistest, each learning workflow executes about 35 iterations for each learning epoch and up to 300 epochs, generating about15,000 provenance capture events per workflow run. ProvTracker runs on one node in the learning cluster with 24 CPUcores, whereas the learning workflows run in parallel and distributed on up to 8 nodes, each with 28 Intel CPU cores and6 GPUs (K80). While running the workflows, PLLib captures data at runtime and sends them to ProvTracker, which inturn sends them to ProvManager service deployed externally on the virtual Kubernetes cluster, which finally stores themin the Prov DBMS. A provenance capture overhead analysis of ProvLake using synthetic workloads to highly stress thesystem and comparison with a competing system has been presented in a previous work [9]. Here, we evaluate the systemdesign principles that focus on providing distributed capture control and scalable architecture (
SDP 1 – SDP 3 ). We testdifferent settings for provenance analysis, and then test the scalability using real ML workloads in both cases. We measurethe overall execution time of the learning workflow script, repeating each test at least 10 times and we plot the boxplots ofthe repetitions and the numeric values used in-text refer to the median of the repetitions.
Varying Provenance Capture Settings.
The PLLib allows customizing provenance capture settings, such as the queue sizeand whether the provenance capture events should be persisted to the local disk, rather than sending to ProvTracker. Then,if disk only is not specified, when the scripts execute, provenance data are captured and sent to ProvTracker.For a baseline, we first execute the training without any provenance capture, then we vary the queue size in PLLib ( i.e. ,amount of provenance capture requests accumulated in PLLib), diskless vs. diskful ( i.e. , saving or not provenance data ina log file on disk), and online vs. offline ( i.e. , storing or not provenance data in the DBMS, available for online provenancequeries during the execution). As for the training datasets, we use a curated and labeled real seismic dataset using a specificrange of seismic slices (corresponding to a regional section of a seismic cube) defined by the model trainer. The results arein Fig. 9 (a), where the fastest result is for Queue Size = 50, Diskless, Online (Setting D). Comparing with the setting withno provenance capture, the added execution overhead, in this case, is only 8.6 seconds on top of 21.3 minutes, i.e. , 0.67%,which is considered negligible.To analyze the queue size, we compare Settings A–C with D–F, and we see larger queues provide faster provenancecapture since there is less but larger communication with ProvTracker service. For instance, Setting A is about 7% slowerthan D. However, very large queues have drawbacks as they introduce higher latency between the event being capturedin the workflow execution and the provenance record being stored in the database, caused by the retention of provenancecapture events in PLLib’s queue. Nevertheless, for the settings with queue size 50 (D–F), a latency of less than 5 secondsbetween the actual occurrence of the event and its provenance being registered in the database, available for queries, canbe considered near real-time and good enough even for training monitoring. To analyze diskless vs. diskful settings, wecompare Setting A with B and C; and D with E and F. Diskless is faster than diskful, as the latter introduces more I/Ooperations at runtime. However, comparing only the medians, the difference is negligible (less than 0.1%). Thus, becauseof a higher fault-tolerance provided by a diskful setting, it may be useful to append provenance data onto a file on disk,locally in the cluster where the workflow runs. Similarly, comparing the medians, we observe that the difference between No Prov QSize 1DisklessOnline QSize 1DiskfulOnline QSize 1DiskfulOffline QSize 50DisklessOnline QSize 50DiskfulOnline QSize 50DiskfulOffline21.021.221.421.621.822.0 E x e c u t i o n T i m e ( m i n ) (A) (B) (C) (D) (E) (F) (a) x =1 x =2 x =4 x =821.021.221.421.621.822.0 E x e c u t i o n T i m e ( m i n ) Linear (b)Fig. 9: Performance analysis results. Figure 9 (a) shows the variation of prov. capture settings, where Setting D adds 0.67%overhead. Figure 9 (b) shows the scalability results, a near-linear scalability with up to 48 GPUs and 228 CPUs.online vs. offline ( e.g. , setting B vs. C or E vs. F) is also small, about 1%. Therefore, despite (D) being the fastest setting,(E) may be preferred because its performance is nearly the same as (D), and it has the advantage of backup storage forprovenance data, which is quite important as provenance is used for reproducibility.
Scalability Analysis.
In this experiment, we want to confirm if the execution strategies on an HPC cluster are keepingthe overhead low in a real ML workload, running multiple learning workflows in parallel. We run a weak scalability testby increasing the number of processing units while increasing the data size. We use the fastest setting of the previousexperiment ( i.e. , D) and the same seismic cube. To set up the training datasets, the trainer selects up to 8 different sets ofseismic slices, where each set has the same length ( i.e. , nearly the same data size). Thus, for x ∈ { , , , } , there are x workflows running on x nodes in parallel, summing x Intel CPU cores, x GPUs, ∗ x CUDA GPU cores, usingin total an input dataset with size x ∗ datasize , where datasize is the size of a dataset formed by 1 set of seismic slices.The results are in Fig.9 (b), where we illustrate the linear scalability as a horizontal line passing through the median of thesmallest setting ( x = 1 ). Ideally, the medians should be near this line. If they are not, it means that ProvTracker is takingtoo long to answer, caused by high stress in the system due to too many provenance capture requests, adding latency to thetraining. However, we see that even in the largest setting ( i.e. , x = 8 ), the execution time remains close to the linear curve.The boxes remain within a small margin of 0.2 min (or 0.9% of the x = 1 median) between 21.4 and 21.6 min, meaningthat the system delivers a constant and predictable behavior even at larger scales. We note though that the variance growswith the scale, caused by the larger number of parallel tasks. Therefore, we conclude that at least for this scale (up to 48K80 GPUs), the provenance capture system delivers good scalability. In this experiment, we analyze the benefits of PROV-ML, both qualitatively and quantitatively. We begin with a qualitativecomparison of queries that use PROV-ML, highlighting its expressiveness and the complexity of building queries that useor not PROV-ML. Then, we further provide a quantitative analysis to investigate whether using PROV-ML can help toaccelerate queries and, if it can, how much it helps. Among the queries Q1–Q7, we select three to compare in detail: Q1,Q5, Q7, and the reason for this choice is that they increase in complexity and how much they make use of the conceptsmodeled specifically in the PROV-ML ontology ( i.e. , emphasis on the learning phase, c.f.
Sec. 3.3). Q1 is the simplest queryand makes the least use of PROV-ML specific concepts, Q7 is the most complex query with the heaviest use of PROV-ML,and Q5 is in between these two. We write the selected queries both with and without the PROV-ML ontology (written inOWL) using SPARQL 1.1. The query complexity stems from the number of clauses to filter, patterns to match in the graphtraversal, aggregations and sorting, and amount of triples that satisfy the patterns to match; and the number of clauses thatmake use of the PROV-ML ontology defines how much each query makes use of the PROV-ML ontology.
Qualitative comparison.
Since Q7 is the most complex query and makes heavy use of PROV-ML, it helps us to illustratewhether PROV-ML eases query building, especially when there is heavy use of Learning phase concepts. Excerpts of Q7 inSPARQL with and without PROV-ML are available in the Listings 1 and 2, respectively. Comparing both, since PROV-MLhas specialized concepts for the Learning phase, it requires less clauses to be matched to express the same concept. Forinstance, to match triples in the training stage only, with PROV-ML, we just write one clause (Lst. 1 ?training to determine the correct stage. Without PROV-ML, the only resource wehave to do this is to tag the data transformations that are related to training. In PROVLake ontology, tagging of workflows,data transformations, and attributes is possible with the property provlake:tag , but since naming, schema definitionsand tagging are available only in the prospective part, we need three more clauses: one to relate the retrospective instancewith its prospective instance (Lst. 2 for model evaluation (Lst. 1 -- Training stage ? t r a i n i n g a provml : TrainingExecution . -- Epoch iteration (Training section) ? epoch exec training prov : wasInformedBy ? t r a i n i n g ; a provml : TrainingSectionExecution . -- Model hyperparameters ? epoch training hyperparam prov : wasGeneratedBy ? epoch exec training ; a provml : ModelHyperparameterValue ; prov : value ? epoch training hyperparam v ; prov : wasDerivedFrom ?epoch training hpram psp . ? epoch training hpram psp a provml : LearningHyperparameterSetting ; r d f s : l a b e l ? epoch training hyperparam name . -- Model ? model training prov : wasGeneratedBy ? epoch exec training ; a provml : Model . -- Model evaluation ? model training eval prov : wasGeneratedBy ? epoch exec training ; a provml : ModelEvaluation ; prov : value ? model training eval value . Listing 1: Excerpt of Q7 with PROV-ML. -- Training stage ? t r a i n i n g a provlake : DataTransformationExecution ; prov : wasInfluencedBy ? training prosp ; ? training prosp a provlake : DataTransformation ; provlake : tag ” Training ” . -- Epoch iteration (Training Section) ? epoch exec training prov : wasInformedBy ? t r a i n i n g ; a provlake : DataTransformationExecution ; prov : wasInfluencedBy epoch exec training psp. ? epoch exec training psp a provlake : DataTransformation ; r d f s : l a b e l ”Epoch Execution ” . -- Model hyperparameters ? epoch training hyperparam prov : wasGeneratedBy ? epoch exec training ; a provlake : AttributeValue ; prov : value ? epoch training hyperparam v ; prov : wasDerivedFrom ?epoch training hpram psp . ? epoch training hpram psp a provlake : A t t r i b u t e ; provlake : tag ”Hyperparameter” ; r d f s : l a b e l ? epoch training hyperparam name . -- Model ? model training a provlake : AttributeValue . prov : wasGeneratedBy ? epoch exec training ; prov : wasDerivedFrom ? model training prosp . ? model training prosp a provlake : A t t r i b u t e ; provlake : tag ”Model” . -- Model evaluation ? model training eval prov : wasGeneratedBy ? epoch exec training ; a provlake : AttributeValue ; prov : value ? model training eval value ; prov : wasDerivedFrom ? model training eval psp. ? model training eval psp a provlake : AttributeValue ; provlake : tag ”Model Evaluation ” . Listing 2: The same excerpt of Q7, without PROV-ML.Therefore, we found that when the parts of the query do not demand prospective provenance data, one needs to writethree extra clauses when not using PROV-ML (a clause to relate the retrospective with prospective, another to give theprospective instance type, and a third clause to qualify this instance, often using tags or labels). However, when the querydemands prospective provenance data, one needs only an extra clause (to qualify the instance), because the relationshipand types will be required regardless it uses PROV-ML or not. These observations are in Table 4. We verified the samebehavior in the queries Q1 and Q5. Thus, we conclude that PROV-ML’s ability to qualify specific ML data transformationsand attributes using direct types eases query building, as it reduces the number of clauses required to express ML-specificconcepts compared to a data representation that does not use PROV-ML. A reduction of one to three clauses per query partwas observed in all queries analyzed.TABLE 4: Qualitative comparison of Q7 in terms of number of clauses with and without PROV-ML.
Q7 Query part
Training stage 1 4Epoch iteration 2 5Model hyperparameters 6 7Model 2 5Model evaluation 3 6 Quantitative comparison.
We discussed in Section 3.1 that certain data design principles followed by our approach mightaccelerate queries that make use of the defined concepts, and it is known that design choices when modeling an ontologymay impact performance of queries. In fact, a recent work evaluates schema optimization to speed up queries in knowledgegraphs [38], showing that this is still a relevant topic to be investigated. Therefore, in addition to the qualitative gainsdiscussed previously, we conduct a quantitative evaluation experiment with the goal of verifying how much (if any)PROV-ML impacts query performance.We generate two synthetic datasets that mimic the real use case evaluated in Sections 4.2 and 4.3. With the syntheticdataset, we can control experiment variables, such as the number of parallel learning workflows, number of hyperparam-eters, model evaluation metrics, epochs, and batches per epoch, and we can generate one dataset that uses PROV-ML andanother that does not. With these two synthetic datasets, it is easy to switch between with and without PROV-ML, ratherthan having to implement, deploy, and run the real learning workflows without PROV-ML, which would not make sensefor the project goals. Both datasets are as follows: eight parallel learning workflows, each with the three stages (training,validation, and evaluation), 300 epochs and 200 batches per epoch ( i.e. , 60,000 batches), where each batch is associated tobatch losses and hyperparameters, and each epoch uses hyperparameters and generates models and model evaluations. Intotal, each dataset has 10,168,890 triples.The performance impact depends on the number of clauses to be matched in the query and on the number oftriples actually matched by the Triple Store DBMS. However, the performance depends on the underlying DBMS thatmanages the
MLHolView , since the DBMS might implement efficient indexing mechanisms, parallelism techniques, ordata transformation strategies. Therefore, for this experiment, we analyze the three queries (Q1, Q5, Q7). Q1 does a simplegraph traversal with simple pattern matching. Q5 does more complex graph traversals and needs to calculate aggregates(average of time difference per batch, per epoch), but for training stages only. Q7 also does complex traversals and needs tocalculate aggregates (minimum batch loss per epoch), but for three stages (training, validation, and evaluation), in additionto listing hyperparameters and model performance. Since the choice of the underlying DBMS may impact the results,we analyze three different DBMSs: AllegroGraph, Blazegraph, and Jena TDB running on the same hardware and sameconditions, with their default settings (no special fine-tuning are performed in any DBMS). We analyze the query executiontime, which is measured in the requesting client subtracting the timestamp between immediately after the response hasarrived in the client from the time immediately before sending the request. Results are in Figure 10, where we plot themedians of the query execution time over a hundred repetitions or when the confidence interval of the medians was below5%. The numeric values reported in-text also refer to the median of the repetitions, we do not remove outliers, the heightof the confidence intervals are the error bars, and Q7 results are in log scale.
Jena AllegroGraph BlazeGraph Q u e r y E x e c u t i o n T i m e ( s ) Q1 Jena AllegroGraph BlazeGraph Q u e r y E x e c u t i o n T i m e ( s ) Q5 Jena AllegroGraph BlazeGraph L o g Q u e r y E x e c u t i o n T i m e ( s ) Q7 With PROV-ML Without PROV-ML
Fig. 10: Execution time comparison of queries using PROV-ML vs. queries without PROV-ML. Q7 is in log scale.The results show that for Q1, PROV-ML does not significantly impact query performance since the queries using PROV-ML are only 1.17x and 1.05x slower in AllegroGraph and BlazeGraph, respectively; and the difference for Jena is within theerror bars, i.e. , no statistically significant difference. However, Q1 is only the simplest query, with trivial graph traversalsand little use of PROV-ML specific concepts, and with query times up to a hundred of milliseconds (very fast queries). TheDBMSs are likely spending more time doing data transfers than actually computing the query, which would explain thehigher error bars for Q1 than for Q5 and Q7, for which the error bars are so small ( < We draw a set of lessons learned after the practical experience of implementing the data and system design principles tosupport the lifecycle in a real deployment in an O&G industry case that uses heterogeneous environments, i.e. , a Kubernetescluster and a large HPC cluster with CPUs and GPUs. The key findings for the success of the experiments are the following:(i)
Characterizing the lifecycle and identifying the main classes for data analysis using provenance allowed theunderstanding of the different needs in scientific ML (Sec. 2). Particularly, it helped to understand the differentpersonas driving the provenance capture to answer key online and offline, intra- and inter-training provenance queries.The queries were capable of analyzing ML data, domain-specific data, and execution data, throughout the data curation,data preparation and learning phases of the lifecycle in an integrated way. We observed that the data curation phase is themost complex. One needs to address it carefully to take advantage of domain-specific knowledge, which highly benefitstrainers in the learning phase.(ii)
Employing provenance tracking and a data representation that allow data integration of multiple workflowshelped to address the highly heterogeneous nature of the lifecycle.
To accomplish such integration, it was key to promotea holistic view of the lifecycle, end-to-end, which we called
MLHolView , as described in the data design principle
DDP 1 ,which enabled the comprehensive data analyses ( e.g. , Q1–Q7), thus supporting the lifecycle, which is our main motivation.Due to the highly heterogeneous nature, the context-awareness using domain-specific and ML data and knowledgematerialized in a knowledge graph leveraging provenance-based relationships (
DDP 2 ) enabled tracking, persisting, andquerying interconnections between heterogeneous data with details about localization and data access. Furthermore, itenabled queries with rich semantics about the application domain and ML, exploring new data relationships that wouldnot be possible without such context-awareness.(iii)
Designing a conceptual data schema focused on the key concepts enabled the design and implementation of thesystem facilitating query building, and the ML-specialized schema modeling . The key concepts are described in
DDP5 . It enabled query acceleration and facilitated query building for queries that make heavy use of ML-specific conceptscompared with a schema that does not have such specializations. Yet, the focused schema was the basis for PROV-ML(Sec. 3.3), which served as the underlying schema for the provenance system. PROV-ML combines provenance of datalakes to address integration, embracing the heterogeneity nature, with concepts for ML (
DDP 3 ). PROV-ML leverages W3Ccontributions for provenance, W3C PROV [23], and for ML, ML Schema [24]. We hope other systems with similar purposescan adopt such representation.(iv)
The system design principles enabled data capture and integration in a highly heterogeneous and distributedsetting adding negligible overhead.
Particularly,
SDP 1 and
SDP 2 provided the portability and flexibility needed insuch deployments. The scalable strategies (
SDP 3 ) allowed the system to do this while incurring low overhead and highscalability, even in HPC workloads.
The interest in workflow provenance management has increased in the recent years, driven by a major effort by theprovenance community [46], [47], [48], [49], [50], [51], [31], [52], [53], [35], [54], [55], [56], [57], [58] , particularly to explorepossibilities of optimizing workflows with the data captured by provenance tools and as a response to the urgent need forreproducible science, which is critical in scientific ML [59]. To exemplify, Thavasimani et al. , [14] investigate provenancetraces recorded during workflow executions to observe differences in results with minor workflow configuration differ-ences. Other works have advanced provenance tracking techniques on heterogeneous data, stores, and environments [39],[36], [60], [61] and others have explored the intersection of provenance and blockchain [62], [63].On the intersection between ML and provenance, other works have explored provenance to support ML workflows [17],[26], [64], [18], [19], [65] and Deelman et al. , [66] characterized provenance analysis to leverage ML in support of scientificworkflows. On reproducible ML models, another aspect that has become a focus of interest in the research communityis the use of provenance as an essential tool to help create explainable artificial intelligence [20], [59]. In addition, someworks addressed the gap between the experiments of an ML workflow execution and a standard representation to providereproducible experiments [67], [68], [24]. Esteves et al. , [67] provide a machine-readable vocabulary and a common schemafor reproducibility of ML experiments in various frameworks and workflow systems. Publio et al. , [68] present a new MLdata representation based on MEX vocabulary [67] to improve processes on ML workflows, despite not having a clearseparation between prospective and retrospective provenance. Samuel [69] propose ProvBook, for reproducibility of MLexperiments using Jupyter notebooks applying FAIR data principles. Moreno et al. , proposed MLWfM [70] to provide dataconcepts for ML and domain-specific awareness, but without provenance concepts and a data representation.These works are important building blocks to support the lifecycle of scientific ML using provenance managementtechniques. Nevertheless, they still lack a holistic view capable of comprehensively integrating the data in the wholelifecycle, end-to-end, from raw domain data to learned models. Without such holistic view, the ML-specific concepts cannotintegrate with the specific concepts about the scientific domain, jeopardizing the comprehensive end-to-end analyses thatrequire richer semantics about the domain integrated with rich semantics about ML. In this work, we aimed at enabling scientists and engineers to perform comprehensive data analyses in the lifecycleof scientific ML. We proposed workflow provenance techniques to address the problem of dealing, in an integrated andcomprehensive way, with the high heterogeneity of different contexts ( e.g. , data, software, environments, persona) involvedin the lifecycle, to enable such analyses. We proposed modeling the workflows in all phases of this lifecycle as multipleinterconnected workflows. A holistic view of the data processed in these workflows should be built as the workflowsexecute. In this way, the collaborating teams can use it as their primary source of data analyses that integrate that from rawdata to learned ML models. We called it as
Provenance-based Holistic Data View of the Lifecycle of Scientific ML ( MLHolView ).It is materialized as a knowledge graph with provenance-based relationships. It is aware of the contexts of the datatransformations in the workflows, their (hyper)parameterizations and model metrics, which computational environmentsthey run and data stores they use, the involved personas, and how they interact with the workflows.To be able to build this view, aware of these many dimensions of heterogeneity, we first characterized the lifecycleand proposed a taxonomy for the classes of data analyses ( e.g. , data, execution timing, and training timing). Then, weproposed design principles for the effective and efficient management of provenance data from these workflows. Fromthis understanding and design principles, we derived the PROV-ML data representation, promoting such a holistic viewof data in workflows in the lifecycle, which is the first one to the best of our knowledge.We also proposed system design principles and a reference system architecture to provide the view with efficientprovenance capture adding significant data capture overhead ( < REFERENCES [1] J. Hesthaven and G. Karniadakis. Scientific machine learning workshop. [Online]. Available: https://icerm.brown.edu/events/ht19-1-sml[2] Y. Gil, S. A. Pierce, H. Babaie, A. Banerjee, K. Borne, G. Bust, M. Cheatham, I. Ebert-Uphoff, C. Gomes, M. Hill, J. Horel, L. Hsu, J. Kinter,C. Knoblock, D. Krum, V. Kumar, P. Lermusiaux, Y. Liu, C. North, V. Pankratius, S. Peters, B. Plale, A. Pope, S. Ravela, J. Restrepo, A. Ridley,H. Samet, and S. Shekhar, “Intelligent systems for geosciences: an essential research agenda,”
CACM , 2018.[3] M. Raissi, P. Perdikaris, and G. Karniadakis, “Physics-informed neural networks: a deep learning framework for solving forward and inverseproblems involving nonlinear partial differential equations,”
J. Comp. Physics , 2019.[4] E. Rodrigues, I. Oliveira, R. Cunha, and M. Netto, “DeepDownscale: a deep learning strategy for high-resolution weather forecast,” in
IEEEeScience , 2018.[5] D. S. Chevitarese, D. Szwarcman, E. V. Brazil, and B. Zadrozny, “Efficient classification of seismic textures,” in
IJCNN , 2018.[6] M. Mattoso, C. Werner, G. Travassos, V. Braganholo, E. Ogasawara, D. de Oliveira, S. Cruz, W. Martinho, and L. Murta, “Towards supportingthe life cycle of large-scale scientific experiments,”
IJBPIM , 2010.[7] H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Modelhub: Deep learning lifecycle management,” in
ICDE , 2017.[8] S. Schelter, J.-H. Boese, J. Kirschnick, T. Klein, and S. Seufert, “Automatically tracking metadata and provenance of machine learningexperiments,” in
MLS@NIPS , 2017.[9] R. Souza, L. Azevedo, R. Thiago, E. Soares, M. Nery, M. A. S. Netto, E. V. Brazil, R. Cerqueira, P. Valduriez, and M. Mattoso, “Efficientruntime capture of multiworkflow data using provenance,” in
IEEE eScience , 2019.[10] N. Polyzotis, S. Roy, S. Whang, and M. Zinkevich, “Data lifecycle challenges in production machine learning: a survey,”
SIGMOD Rec. , 2018.[11] M. Herschel, R. Diestelk¨amper, and H. B. Lahmar, “A survey on provenance: What for? what form? what from?”
VLDB , 2017.[12] L. Moreau, B. Lud¨ascher, I. Altintas, R. S. Barga, S. Bowers, S. Callahan, G. Chin JR., B. Clifford, S. Cohen, S. Cohen-Boulakia, S. Davidson,E. Deelman, L. Digiampietri, I. Foster, J. Freire, J. Frew, J. Futrelle, T. Gibson, Y. Gil, C. Goble, J. Golbeck, P. Groth, D. A. Holland, S. Jiang,J. Kim, D. Koop, A. Krenek, T. McPhillips, G. Mehta, S. Miles, D. Metzger, S. Munroe, J. Myers, B. Plale, N. Podhorszki, V. Ratnakar, E. Santos,C. Scheidegger, K. Schuchardt, M. Seltzer, Y. L. Simmhan, C. Silva, P. Slaughter, E. Stephan, R. Stevens, D. Turi, H. Vo, M. Wilde, J. Zhao, andY. Zhao, “Special issue: The first provenance challenge,”
CCPE , 2008.[13] P. Buneman and W.-C. Tan, “Data provenance: What next?”
SIGMOD Rec. , 2019.[14] P. Thavasimani and P. Missier, “Facilitating reproducible research by investigating computational metadata,” in
IEEE Big Data , 2016.[15] R. Souza, V. Silva, J. J. Camata, A. Coutinho, P. Valduriez, and M. Mattoso, “Keeping track of user steering actions in dynamic workflows,”
FGCS , 2019.[16] V. Sousa, D. Oliveira, P. Valduriez, and M. Mattoso, “Analyzing related raw data files through dataflows,”
CCPE , 2016.[17] H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Towards unified data and lifecycle management for deep learning,” in
ICDE , 2017.[18] M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar,“Accelerating the machine learning lifecycle with MLflow,” in
IEEE Data Eng. Bulletin , 2018.[19] D. Pina, L. Kunstmann, A. Paes, D. Oliveira, and M. Mattoso, “An´alise de hiperparˆametros em aplicac¸ ˜oes de aprendizado profundo por meiode dados de proveniˆencia,” in
SBBD , 2019.[20] C. Lucero, B. Coronado, O. Hui, and D. S. Lange, “Exploring explainable artificial intelligence and autonomy through provenance,”
XAI@IJCAI , 2018.[21] M. Balazinska, S. Chaudhuri, A. Ailamaki, J. Freire, S. Krishnamurthy, and M. Stonebraker, “The next 5 years: what opportunities should thedatabase community seize to maximize its impact?” in
SIGMOD , 2020.[22] R. Souza, L. Azevedo, V. Lourenc¸o, E. Soares, R. Thiago, R. Brand˜ao, D. Civitarese, E. Brazil, M. Moreno, P. Valduriez, M. Mattoso,R. Cerqueira, and M. A. S. Netto, “Provenance data in the machine learning lifecycle in computational science and engineering,” in
WORKS@Supercomputing
SIGMOD , 2008. [26] Z. Miao, Q. Zeng, B. Glavic, and S. Roy, “Going beyond provenance: explaining query answers with pattern-based counterbalances,” in SIGMOD , 2019.[27] R. F. Silva, R. Filgueira, I. Pietri, M. Jiang, R. Sakellariou, and E. Deelman, “A characterization of workflow management systems forextreme-scale applications,”
FGCS , 2017.[28] G. Guerra, F. A. Rochinha, R. Elias, D. De Oliveira, E. Ogasawara, J. F. Dias, M. Mattoso, and A. L. Coutinho, “Uncertainty quantification incomputational predictive models for fluid dynamics using a workflow management engine,”
Int. J. Uncertain. Quantif. , 2012.[29] R. Souza, V. Silva, A. L. Coutinho, P. Valduriez, and M. Mattoso, “Data reduction in scientific workflows using provenance monitoring anduser steering,”
FGCS , 2017.[30] J. Pimentel, J. Freire, L. Murta, and V. Braganholo, “A survey on collecting, managing, and analyzing provenance from scripts,”
ACM Surv. ,2019.[31] L. F. Sikos and D. Philp, “Provenance-aware knowledge representation: a survey of data models and contextualized knowledge graphs,”
Data Sci. Eng , 2020.[32] R. M. Thiago, R. Souza, L. Azevedo, E. F. D. S. Soares, R. Santos, W. Dos Santos, M. De Bayser, M. C. Cardoso, M. F. Moreno, and R. Cerqueira,“Managing data lineage of o&g machine learning models: the sweet spot for shale use case,” in
EAGE Digital , 2020.[33] R. Souza, A. Codas, J. A. Nogueira Junior, M. P. Quinones, L. Azevedo, R. Thiago, E. Soares, M. Cardoso, and L. Martins, “Supporting thetraining of physics informed neural networks for seismic inversion using provenance,” in
AAPG , 2020.[34] V. Silva, D. Oliveira, P. Valduriez, and M. Mattoso, “DfAnalyzer: runtime dataflow analysis of scientific applications using provenance,”
VLDB , 2018.[35] C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M. R. Crusoe, K. Peters, and D. Schober, “FAIR computational workflows,”
Data Intelligence , 2019.[36] D. Hu, D. Feng, Y. Xie, G. Xu, X. Gu, and D. Long, “Efficient provenance management via clustering and hybrid storage in big dataenvironments,”
IEEE Trans. on Big Data , 2019.[37] D. Oliveira, V. Silva, and M. Mattoso, “How much domain data should be in provenance databases?” in
TaPP , 2015.[38] C. Lei, R. Alotaibi, A. Quamar, V. Efthymiou, and F. ¨Ozcan, “Property graph schema optimization for domain-specific knowledge graphs,” arXiv , 2020.[39] I. Suriarachchi and B. Plale, “Crossing analytics systems: A case for integrated provenance in data lakes,” in
IEEE eScience , 2016.[40] P. Missier, B. Lud¨ascher, S. Bowers, S. Dey, A. Sarkar, B. Shrestha, I. Altintas, M. K. Anand, and C. Goble, “Linking multiple workflowprovenance traces for interoperable collaborative science,” in
WORKS@Supercomputing , 2010.[41] F. Costa, V. Silva, D. Oliveira, K. Oca˜na, E. Ogasawara, J. Dias, and M. Mattoso, “Capturing and querying workflow runtime provenancewith prov: a practical approach,” in
EDBT/ICDT workshops , 2013.[42] R. Souza and M. Mattoso, “Provenance of dynamic adaptations in user-steered dataflows,” in
IPAW , 2018.[43] D. Koop, E. Santos, B. Bauer, M. Troyer, J. Freire, and C. T. Silva, “Bridging workflow and data provenance using strong links,” in
SSDBM ,2010.[44] Provlake website. [Online]. Available: https://ibm.biz/provlake[45] Provlakelib github repository. [Online]. Available: https://github.com/IBM/multi-data-lineage-capture-py[46] P. Thavasimani, J. Cała, and P. Missier, “Exploiting execution provenance to explain difference between two data-intensive computations,”in
IEEE eScience , 2018.[47] P. Missier, J. Bryans, C. Gamble, and V. Curcin, “Abstracting prov provenance graphs: A validity-preserving approach,”
FGCS , 2020.[48] R. Lourenc¸o, J. Freire, and D. Shasha, “Bugdoc: algorithms to debug computational processes,” in
SIGMOD , 2020.[49] C. Rajmohan, P. Lohia, H. Gupta, S. Brahma, M. Hernandez, and S. Mehta, “On efficiently processing workflow provenance queries in spark,”in
IEEE ICDCS , 2019.[50] D. Garijo, Y. Gil, K. M. Cobourn, E. Deelman, C. Duffy, R. Ferreira da Silva, A. Kemanian, C. Knoblock, V. Kumar, S. D. Peckham, Y. Y.Chiang, D. Khider, A. Khandelwal, J. Pujara, V. Ratnakar, M. Stoica, B. Vu, and M. Pham, “Integrating models through knowledge-powereddata and process composition,”
AGU Fall Meeting , 2018.[51] M. H. Namaki, A. Floratou, F. Psallidas, S. Krishnan, A. Agrawal, and Y. Wu, “Vamsa: tracking provenance in data science scripts,” arXiv ,2020.[52] A. Spinuso, M. Atkinson, and F. Magnoni, “Active provenance for data-intensive workflows: engaging users and developers,” in
IEEEeScience , 2019.[53] T. Guedes, L. B. Martins, M. L. F. Falci, V. Silva, K. A. C. S. Ocaa, M. Mattoso, M. V. N. Bedo, and D. Oliveira, “Capturing and analyzingprovenance from spark-based scientific workflows with SAMbA-RaP,”
FGCS , 2020.[54] F. Magnoni, E. Casarotti, P. Artale Harris, M. Lindner, A. Rietbrock, I. A. Klampanos, A. Davvetas, A. Spinuso, R. Filgueira, A. Krause,M. Atkinson, A. Gemund, and V. Karkaletsis, “DARE to perform seismological workflows,”
AGU Fall Meetings , 2019.[55] K. Chard, N. Gaffney, M. Hategan, K. Kowalik, B. Ludscher, T. McPhillips, J. Nabrzyski, V. Stodden, I. Taylor, T. Thelen, M. Turk, and C. Willis,“Toward enabling reproducibility for data-intensive research using the whole tale platform,” in arXiv , 2020.[56] T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R. K. Bocinsky, Y. Cao, J. Cheney, F. Chirigati, S. Dey, J. Freire, C. Jones,J. Hanken, K. W. Kintigh, T. A. Kohler, D. Koop, J. A. Macklin, P. Missier, M. Schildhauer, C. Schwalm, Y. Wei, M. Bieda, and B. Ludscher,“YesWorkflow: A user-oriented, language-independent tool for recovering workflow information from scripts,”
IJDC , 2015.[57] J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, “noWorkflow: a tool for collecting, analyzing, and managing provenance from pythonscripts,”
VLDB , 2017.[58] L. Rupprecht, J. C. Davis, C. Arnold, Y. Gur, and D. Bhagwat, “Improving reproducibility of data science pipelines through transparentprovenance capture,”
VLDB , vol. 13, no. 12, 2020.[59] M. Arnold, R. K. E. Bellamy, M. Hind, S. Houde, S. Mehta, A. Mojsilovi, R. Nair, K. N. Ramamurthy, A. Olteanu, D. Piorkowski, D. Reimer,J. Richards, J. Tsay, and K. R. Varshney, “Factsheets: Increasing trust in ai services through supplier’s declarations of conformity,”
IBM J.Research & Development , 2019.[60] Y. Mendes, R. Braga, V. Str¨oele, and D. Oliveira, “Polyflow: a soa for analyzing workflow heterogeneous provenance data in distributedenvironments,” in
SBSI , 2019.[61] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena, “Data lake management: challenges and opportunities,”
VLDB , 2019.[62] X. Liang, S. Shetty, D. Tosh, C. Kamhoua, K. Kwiat, and L. Njilla, “Provchain: A blockchain-based data provenance architecture in cloudenvironment with enhanced privacy and availability,” in
CCGrid , 2017.[63] P. Ruan, G. Chen, T. T. A. Dinh, Q. Lin, B. C. Ooi, and M. Zhang, “Fine-grained, secure and efficient data provenance on blockchain systems,”
VLDB , 2019.[64] A. Kumar, R. McCann, J. Naughton, and J. M. Patel, “Model selection management systems: the next frontier of advanced analytics,”
SIGMODRec. , 2016.[65] R. Souza, L. Neves, L. Azeredo, R. Luiz, E. Tady, P. Cavalin, and M. Mattoso, “Towards a human-in-the-loop library for trackinghyperparameter tuning in deep learning development,” in
LaDaS@VLDB , 2018.[66] E. Deelman, A. Mandal, M. Jiang, and R. Sakellariou, “The role of machine learning in scientific workflows,”
Int. J. HPC , 2019. [67] D. Esteves, D. Moussallem, C. B. Neto, T. Soru, R. Usbeck, M. Ackermann, and J. Lehmann, “Mex vocabulary: a lightweight interchangeformat for machine learning experiments,” in ICSS . ACM, 2015, pp. 169–176.[68] G. C. Publio, D. Esteves, A. Ławrynowicz, P. Panov, L. Soldatova, T. Soru, J. Vanschoren, and H. Zafar, “ML Schema: exposing the semanticsof machine learning with schemas and ontologies,” in
ICML , 2018.[69] S. Samuel, F. L¨offler, and B. K¨onig-Ries, “Machine learning pipelines: provenance, reproducibility and FAIR data principles,” arXiv , 2020.[70] M. Moreno, V. Lourenc¸o, S. Fiorini, P. Costa, R. Brand˜ao, D. Civitarese, and R. Cerqueira, “Managing machine learning workflowcomponents,” in