[PDF] Data provenance, curation and quality in metrology

Abstract

Data metrology -- the assessment of the quality of data -- particularly in scientific and industrial settings, has emerged as an important requirement for the UK National Physical Laboratory (NPL) and other national metrology institutes. Data provenance and data curation are key components for emerging understanding of data metrology. However, to date provenance research has had limited visibility to or uptake in metrology. In this work, we summarize a scoping study carried out with NPL staff and industrial participants to understand their current and future needs for provenance, curation and data quality. We then survey provenance technology and standards that are relevant to metrology. We analyse the gaps between requirements and the current state of the art.

Full PDF

FFebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 1 Data provenance, curation and quality in metrology

James Cheney

University of EdinburghThe Alan Turing Institute ∗ E-mail: [email protected]

Adriane Chapman

University of SouthamptonThe Alan Turing InstituteE-mail: [email protected]

Joy Davidson

Digital Curation CentreE-mail: [email protected]

Alistair Forbes

National Physical LaboratoryE-mail: [email protected]

Data metrology – the assessment of the quality of data – particularly in scien-tiﬁc and industrial settings, has emerged as an important requirement for theUK National Physical Laboratory (NPL) and other national metrology insti-tutes. Data provenance and data curation are key components for emergingunderstanding of data metrology. However, to date provenance research hashad limited visibility to or uptake in metrology. In this work, we summa-rize a scoping study carried out with NPL staﬀ and industrial participants tounderstand their current and future needs for provenance, curation and dataquality. We then survey provenance technology and standards that are rele-vant to metrology. We analyse the gaps between requirements and the currentstate of the art.

Keywords : provenance, curation

1. Introduction

Metrology, the science of measurement, underpins commerce and manu-facturing in the UK and worldwide. International System of Units (SI)provides a coherent foundation for the representation and exchange of mea-surement data, enabling interoperability and reproducibility in all scientiﬁc a r X i v : . [ c s . D B ] F e b ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 2 and technological domains. Concepts such as metrological traceability re-quire a documented, unbroken chain of calibration experiments, each witha stated uncertainty. Conﬁdence in measurement is provided through thesetraceability chains that establish the provenance of the measurement re-sults. As science, engineering and industry increasingly rely on large-scaledata analysis, it is increasingly critical to adapt the principles of metrologyto digital settings (sometimes called data metrology ), to ensure conﬁdencein the use of data. Provenance, curation, and data quality need to be bet-ter understood in this context. Over the past 15 years, there has been agreat deal of academic research and development in diﬀerent scientiﬁc com-munities concerned with data provenance, data curation, and data quality;however, there seems to be relatively little interaction or mutual awarenessbetween this academic community and the metrology community.Provenance is information that explains “where” some data came from,or “why” it was produced in the result of a database query . Provenance is“information that helps determine the derivation history of a data product.[It includes] the ancestral data product(s) from which this data productevolved, and the process of transformation of these ancestral data prod-uct(s)” . Many overviews of provenance exist, from use in scientiﬁc compu-tation and workﬂow systems to types of provenance , and security overprovenance . Meanwhile, Digital curation (alias data curation) ‘involvesmaintaining, preserving and adding value to digital research data through-out its lifecycle...As well as reducing duplication of eﬀort in research datacreation, curation enhances the long-term value of existing data by makingit available for further high quality research’ . These deﬁnitions have beenproposed originally with the scientiﬁc research data context in mind, butapply equally to data in industry; henceforth we just say ‘data’.In 2017–18, we conducted a study initiated by NPL (the UK’s NationalMeasurement Institute), interviewing 22 participants from NPL and in-dustry partners in sectors such as energy & environment, life sciences andhealth, and advanced manufacturing, all with interests in this area, to an-alyze the needs of the metrology community. This paper summarizes thekey ﬁndings of this study, and is intended as a guide to existing resourcesfor readers who need Context and Understanding of where data came from,better information for

Curation and Reuse , Identify Good Practices , to per-form

Integrity checks on the data, to provide

Interoperability , to facilitate

Linking entities , to understanding data

Quality , to enable

Reproducibility in research, and assess

Uncertainty . We analyze a large number of use casesdrawn from across the metrology domain, and identify their needs. ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 3 The contributions of this work include:(1) A brief survey of provenance and available tools that allow the metrol-ogy community to immediately use concepts and tools within theirdomain, in Section 2(2) Identiﬁcation of existing data curation and provenance standards thatare appropriate for provenance within this domain in Section 3.(3) An analysis of a wide range of use cases from NPL, Advanced Man-ufacturing, Life Sciences and Health and Energy and Environmentaldomains, in Section 4, concluding with suggested next steps.

2. Overview of provenance and digital curation

The World Wide Web Consortium (W3C) has developed standards forprovenance under the auspices of the Provenance Incubator Group (2010)and Provenance Interchange Working Group (2011–2013). From the W3CProvenance Incubator group, “Provenance refers to the sources of infor-mation, such as entities and processes, involved in producing or deliveringan artifact. The provenance of information is crucial in deciding whetherinformation is to be trusted, how it should be integrated with other diverseinformation sources, and how to give credit to its originators when reusingit” . Provenance, occasionally referred to as lineage, creates a “familytree” of how information, processes and people interacted to create a givenartefact a .Several other related terms are used together with provenance and cu-ration. These include metadata, context and data quality. In this report,we use the following deﬁnitions for each term: • Metadata is extra information describing a data artefact. Based ondeﬁnitions from the W3C Provenance Incubator Group , “Descriptivemetadata of a resource only becomes part of its provenance when onealso speciﬁes its relationship to deriving the resource.” Thus, metadatainformation that reﬂects information about the object, e.g. ﬁle size isnot provenance information, while metadata that reﬂects informationabout creation or modiﬁcation, e.g. creation date, is provenance infor-mation. • Context is domain-relevant information about the state of the worldat the time information is created or modiﬁed. Some context can be a Artefact, in this context, denotes a digital object such as a data set. ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 4 collected with the provenance, e.g., the temperature in the laboratoryat time of measurement; some can be added later by a user, e.g., an-notations; some context is maintained separately, e.g., the accuracy ofthe measurement instrument used. • Data quality assessments are estimates of accuracy, precision, etc., ofthe information. These assessments may be automatically or manuallygenerated. • Annotations can be made on either data or provenance, and are typi-cally used to associate context and quality information with particularportions of the data and/or provenance.

Provenance requirements and other considerations

Groth et al. identiﬁed three distinct dimensions of requirements for prove-nance information: content, management and use. In this work, we discusscontent in Section 2.2, while management and use are expanded here alongwith Capture and Annotation. In addition to these concerns, Allen etal. outlined practical considerations when designing a system to captureprovenance that include capture, annotation, storage, and management.There are core implementation concerns for any system that contains oruses provenance. For instance, a mechanism must exist to gather or capture provenance information. This can either be a mechanism, e.g. an API, bywhich other systems report provenance information to the provenance sys-tem, or a set of tools that proactively gather provenance information fromother applications. Additionally, it can be beneﬁcial to associate data qual-ity concerns and context information with speciﬁc portions of provenanceinformation. Obviously, some thought must be given to how provenanceinformation and annotations are stored for later retrieval and use. Finally,“Provenance management refers to the mechanisms that make provenanceavailable and accessible in a system” , and include managing security andprivacy concerns as well as tools for validation, visualisation, representationtransformation, etc. . Provenance content

Several dimensions related to “provenance content” were identiﬁed by theW3C community, including:(1) Object: The artefact that a provenance statement is about.(2) Attribution: The sources or entities that contributed to create the arte-fact in question. ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 5 (3) Process: The activities (or steps) that were carried out to generate oraccess the artefact at hand.(4) Versioning: Records of changes to an artefact over time and what en-tities and processes were associated with those changes.(5) Justiﬁcation: Documentation recording why and how a particular de-cision is made.(6) Entailment: Explanations showing how facts were derived from otherfacts.However, there can be many possible representations of how processesinteracted, or how data was derived, that are equally valid. It is possibleto collect large amounts of provenance data, and still learn nothing fromit. So what IS provenance?We know no universal deﬁnition, and doubt one exists. Instead, theend usage of the information should be considered when deciding whatinformation should be collected and stored as provenance. For instance,if an analysis over the provenance to determine process bottlenecks in amanufacturing setting is required, then it is imperative that the provenancecontains information about those processes, as well as the starting andending timestamps of activities. On the other hand, if the provenance isto be used to detect anomalous ﬁle-access by operating system processes ina security setting, then the processes and all of the ﬁle-read and ﬁle-writeactions must be recorded. Of course, these uses may not always be fullyanticipated at the time provenance needs to be recorded, which is why it isimportant to consider the possible uses and needs at an early stage.Before any provenance information is recorded, the ﬁrst considerationshould be “what is it intended to be used for.” Standardization may berequired to identify agreed sets of provenance for particular applications,for example for measuring device metrology this could include uncertaintybudget and calibration certiﬁcate information. If the provenance that isactually captured and stored, cannot be used for the desired purpose, thesystems should be reconsidered. The Use Cases discussed in Section 4involve diﬀerent sectors/domains and each sector provides an immediategrounding of what should be in their provenance system. Additionally, howmuch provenance is actually required, or the granularity of the provenance,should be carefully considered. Granularity refers to the level of detailcaptured. For instance, in our ﬁrst example above, the granularity is the setof processes. In the second example, the granularity is individual functions(read/writes) within a process. In terms of capture, ﬁner granularity often ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 6 requires the capture agents to be embedded deeper in the applications inorder to get the appropriate detail. Additionally, on the storage side, thehigher the granularity, the more provenance will actually be recorded, andan implementation of the storage component must be suﬃciently scalableto deal with the resulting data size. Digital Curation Lifecycle Model

Digital curation enhances the value of digital information through provisionof provenance information, the production of metadata descriptions, and bycreating links between outputs to provide context which ultimately supportsintelligent reuse. The UK Digital Curation Centre (DCC)’s data-centricCuration Lifecycle Model (Figure 1) was developed to provide a graphicaloverview of the various stages required for successful curation. We brieﬂydescribe of each stage of the digital curation lifecycle; further informationcan be found on the DCC web site . ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 7 Digital Curation Lifecycle Components

During the conceptualise stage, active planning for the creation of out-puts should be carried out. Planning should include an assessment of datacapture equipment, quality assessment and storage methods as well as anyrelevant standards that will be adopted. Planning should also consider whoneeds access to the data during the active stage of data collection and howaccess will be facilitated.During the creation stage , researchers should ensure that suﬃcientcontextual information is captured to make their data FAIR (ﬁndable, ac-cessible, interoperable, reusable) b . Researchers should consider what in-formation may be necessary to validate any published ﬁndings. This mayinclude information about hardware used, software or code that is producedand any models or algorithms used to analyse and visualise the data.Researchers should select and appraise outputs necessary to supportreproducibility or for validation purposes but should also be mindful of anyadditional outputs that may have longer term value. However, not all datagenerated during a given project can or should be retained.Selected outputs should be deposited ( ingested ) to a suitable digi-tal repository or data centre. Digital Object Identiﬁers (DOIs) should beassigned to all outputs selected for deposit. The use of identiﬁers for re-searchers (e.g., ORCiD c , equipment, funding bodies (e.g., Funder Registryoﬀered by Crossref d ) and grant numbers should also be used to provideunambiguous context.The chosen repository should undertake preservation actions to en-sure that data remains authentic, reliable and usable while maintaining itsintegrity. Researchers may wish to make use of certiﬁed repositories (suchas CoreTrustSeal e ).The key objective for digital curation is to support longer term accessand reuse . Researchers should consider what context a future reuser mayneed to understand and make use the given output. FAIR data can betransformed into new research outputs to facilitate new investigations andhelp researchers to solve grand challenges.The derived outputs feed back into the conceptualise stage of thelifecycle model and the cycle begins again. b c https://orcid.org/ d e ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 8

3. Existing resources and standards

There are many resources, including standards and prototype tools, whichmay be useful for provenance applications in data metrology. In this section,we brieﬂy survey these, with references to starting points in the computerscience and scientiﬁc data management literature. The coverage we canprovide here is necessarily shallow; further details are in the references.

Standardization of provenance concepts

Provenance information exists in many domain-speciﬁc standards .There is also a World Wide Web Consortium (W3C) provenance interoper-ability standard called W3C PROV . The World Wide Web Consortium(W3C) is the main standards body for the World Wide Web, and has cre-ated many standards (called Recommendations) including HTML (used forweb pages), RDF (a data model for descriptive information using subject-predicate-object triples), and OWL, a language for deﬁning ontologies forRDF data. In this context, “ontology” means a set of classes (deﬁning kindsof things), properties (deﬁning kinds of relationships between the things),and rules that impose constraints on the classes and properties, or deﬁneways to infer new knowledge from existing facts. RDF and OWL are usedto represent data in the Semantic Web / Linked Data settings, which havebeen proposed as ways to make data on the Web more machine-readableand more useful for automated analysis.The Provenance Interchange Working Group was active from 2011–2013and produced several W3C Recommendations as well as a number of sup-porting W3C Notes (documents that support and explain standards, butdo not themselves specify standard behaviour). The four main Recommen-dations are: • PROV-DM - which speciﬁes the PROV Data Model, a collection of vo-cabulary terms used in the PROV standards and their intended mean-ings f • PROV-N - which speciﬁes the PROV Notation (also called PROV-N), aserialization format for PROV information that is intended to be bothhuman and machine readable g • PROV-O - which speciﬁes a PROV Ontology (also called PROV-O),an OWL ontology that speciﬁes basic constraints and usage for PROV f g ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 9 represented in RDF format h • PROV-CONSTRAINTS - which speciﬁes more detailed constraints andinference rules for PROV, and deﬁnes a notion of validity and equiva-lence of PROV datasets i Of these standards, PROV-DM and PROV-O are probably the mostrelevant for typical uses of provenance in the Semantic Web (that is, us-ing RDF and OWL). Other formats for PROV based on XML and JSONhave also been deﬁned, but as supporting documents rather than formalRecommendations. The design rationale and lessons learned from the de-velopment of PROV are described by Moreau et al. . Since PROV hasnow superseded earlier proposals, we do not go into further details aboutthem.In the rest of this section, we will summarize the key points of PROVwith respect to its potential use in data metrology.Figure 2 summarizes the core concepts and relationships of the PROVData Model. The three main classes introduced in PROV are: • Entities: representing some real-world or informational entity (e.g. abook, an article, a website, an instrument, an image ﬁle, a measure- h i ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 10

10 Fig. 3. Example provenance graph using W3C PROV ment) • Activities: representing processes that can be started or ended byagents, and can generate, use, or change entities • Agents: representing people, organizations, or software systems thatcan control or bear responsibility for activitiesThe basic relationships between these concepts are: • Usage: an activity may use an entity, meaning its behaviour or outcomeis inﬂuenced in some way by it • Generation: an activity may generate an entity, meaning the entity iscreated by the activity • Association: an agent may be associated with an activity in diﬀerentways, such as controlling or participating • Information: an activity may inform another activity (typically by somecommunication) • Derivation: an entity may be derived from another entity by an activity.As a special case, derivation includes versioning where an old versionof an entity is replaced by a new one derived from it • Delegation: an agent may act on behalf of another agentUsing these concepts, it is possible to highlight relationships betweenartefacts and how they were manipulated. A simple example of how theW3C PROV standard might be used to represent provenance of some onlinedocuments is shown in Figure 3. ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 11 These concepts and relationships are intentionally open-ended andbroad. Typically, a particular application of PROV will make use of ad-ditional concepts and relationships suitable to the domain of application.This is in keeping with the open-ended and distributed nature of the Se-mantic Web/Linked Data technologies; PROV intentionally does not try tocover all possible application areas, but instead it is encouraged that othervocabularies or ontologies can be used instead (or developed as needed).For example, to represent executions of scientiﬁc computations, theProvONE vocabulary extends PROV with terminology for describingscientiﬁc workﬂow computations. As another example, to represent therecording and analysis of neuroimaging experiments, the NeuroimagingData Model (NIDM) extends PROV with concepts and relationships thatstandardize terminology relating to neuroimaging instruments and softwarepackages. Available tools and techniques

There is a great deal of research on provenance, extending from theoreticalaspects , uses , tools and implementation concerns . Whilemuch of this research is still in the academic sphere, some has ripened intomore mature tools that can be leveraged. In response to the US FederalGovernment’s interest in implementing provenance, and based on the set ofquestions frequently encountered, the technical report summarizes goodpractices and engineering design decisions. In the following subsections, weidentify many of these more industrial-ready tools. Capture tools and technology.

There are no generic provenance capturetools that are suited to any kind of application or system. All provenancecapture to date involves modifying the underlying system, tools and pro-cesses identiﬁed as being important to the organisation. In some cases,speciﬁc systems have been modiﬁed to provide built-in provenance track-ing capabilities; the most relevant example of this approach is scientiﬁcworkﬂow management systems, which have been proposed for high-levelscientiﬁc computation in a number of areas . Recently, some of theapproaches developed for workﬂow systems have been adapted to scriptinglanguages such as Python , but this still leaves the problem of how totrack provenance when several diﬀerent systems or languages are used to-gether, each tracking provenance in its own way, or some not at all. Recentwork considers a lazy-capture approach, which involves system replay andcapture of provenance only when needed . ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 12 Annotation tools and technology.

Annotation allows the enrichmentof data by providing additional context or quality information about theunderlying data. While it does not necessarily require provenance, it isa piece of metadata that can be attached either to the data, or to theprovenance-representation of that underlying data. To date, there are manytools that assist with the creation of annotations and using them. However,most of these tools are very domain speciﬁc. For instance, works withchemistry-speciﬁc annotations, by highlighting chemistry-speciﬁc text, likechemical equations, and providing easy mechanisms for chemistry-speciﬁcannotations. Many projects still provide annotation in a do-it-yourself, ad-hoc way. For instance, EMBL-EBI’s UniProt j started by custom-buildinguser annotation into their web application, and have since moved on toautomatically suggest annotations. More general tools k l focus on ingestingtext documents, extracting entities, topic models, social tags, etc., andproviding the ability to use and organise this semantic information. Storage.

Provenance is data, and therefore the same storage options existfor provenance as for traditional data including: relational ; ﬂat ﬁle ;graph-based . These storage options are not mutually exclusive. For in-stance, the data and provenance may be permanently stored in a database,but may be instantiated in a ﬁle when passing between systems. Prove-nance storage introduces a number of challenges, since provenance datacan often grow to dwarf the size of the actual data of interest, requiringcompression or choices regarding what provenance to keep. Provenancecan also be modelled in diﬀerent ways, for example as annotations directlyon the underlying data, or separately using identiﬁers for cross-referencing.The choice of which technology is best for provenance storage will dependupon how that provenance information is to be used, local technology ex-pertise within the organisation, and network and trust architectures withinthe organisation. Administration.

Like ordinary data, provenance may need to be securedand protected; in particular, recording provenance in a secure system mayintroduce new risks to conﬁdentiality (e.g. conﬁdential data might be leakedvia provenance) or privacy (e.g. recording user activity may inadvertentlyrecord protected personal information). At a minimum, we must applyclassic access control techniques to the provenance information . How- j k Ontotext: https://ontotext.com/ l ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 13 ever, classic access controls may not be suﬃcient for provenance, becauseattackers may be able to leverage knowledge of how provenance is cap-tured to infer missing information even if sanitisation is applied. As such,several protection mechanisms that work more gracefully with provenancehave been identiﬁed . Cheney and Perera surveyed and compared anumber of techniques in this area.Beyond security and protection, any provenance solution must havebasic administration concerns addressed, including classic considerationsfor most systems such as facilitating user access, managing users, managingsize, scalability, performance, etc. Manipulation.

A suite of tools exists to facilitate working with and ma-nipulating W3C PROV . These include:(1) Validator: A RESTful web service that validates PROV descriptionsagainst the PROV Constraints speciﬁcation(2) Translator: Translates between diﬀerent representations of PROV. Sup-ports PROV-N, PROV-XML, PROV-O and PROV-JSON(3) Store: A provenance repository that allows storing, browsing, and man-aging provenance documents via a Web interface or a REST API(4) ProvToolbox: a Java toolbox for handling PROV(5) ProvPy: a Python implementation of the PROV data model(6) ProvExtract: for dealing with PROV embedded in web pages(7) ProvVis: visualisations of PROV(8) PROV-N Editor: a text editor with PROV-N syntax highlighted

4. Summary of study: interviews and use cases

As part of our study, individuals from several organisations that relateto NPL’s mission, as well as NPL staﬀ, have been interviewed in order toidentify the scope of possible interest in provenance, and identify speciﬁc usecases and requirements from each area. A brief summary of the interviewedgroups and their activities is presented below. These have been organisedinto the following groups: Activities within NPL; Advanced Manufacturingsector; Life Science & Health sector; Energy & Environment sector.The interviews with NPL staﬀ helped us to understand the currentcapabilities and needs within NPL regarding “data science” aspects of dataquality, provenance and curation. For the interviews with participants fromother organisations, or initiatives where NPL staﬀ are already collaboratingwith external organisations, we asked the following questions: ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 14 • What is the mission of the organisation/initiative and what role doesit play in the UK economy? • Why are data quality, provenance or curation important to that mis-sion? • What are the current practices regarding data quality, provenance andcuration and what are the unmet challenges or future needs? • What role does NPL currently play, and how could NPL provide bettersupport for these needs?For each interview summary, one or more use cases (scenarios illustrat-ing a challenge or unmet need) relating to provenance, curation, and dataquality have been identiﬁed. These are listed in Table 1. Space limitspreclude discussing all of the use cases in detail; instead, we will brieﬂysummarize three representative use cases for each domain.

NPL activities.

We held discussions with several groups within NPL, in-cluding staﬀ members representing NPL’s core activities in metrology,suchas the measurement of mass, force, pressure, and density, and a staﬀ mem-ber in the early stages of developing a provenance-tracking prototype forstress-strain laboratory data. Based on these discussions we identiﬁed sev-eral common or signiﬁcant needs. Use case 1 involves maintaining records ofcalibration and measurement methods for measurement equipment, includ-ing the measurement procedure, resulting artefacts, calibration methods,and times of (re)calibrations. Use case 2 considers the possibility that er-rors might initially go undetected and inﬂuence other results, leading tothe need to understand how an error has propagated through a system sothat users may be alerted and the errors corrected. Use case 4 considersthe reverse scenario, when errors are noticed in derived results and the goalis to diagnose the root cause of the observed error. All of these use casesraise new challenges as measurements and calibration records move fromlargely analog, manually-maintained documents recording only a few stepsof measurement and computation, to electronic records of possibly long data supply chains . Advanced manufacturing.

Advanced Manufacturing refers to the so-phisticated use of cloud computing, cognitive computing, Internet ofThings, and robotics in manufacturing. Because of the reliance on digi-tal models and data analytics, ensuring the quality of the end productsalso depends upon understanding and extending techniques for data qual-ity, such as tracking provenance through computations, and curation ofdigital artefacts. In this sector standards such as the Quality Informa- ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 15 tion Framework (QIF) are used to record manufacturing processes, whichis useful for understanding existing processes and where they might breakdown (Use case 6), and to provide traceability, optimisation, and qualitycontrol in manufacturing (Use case 7). Use case 8 likewise highlights theneed to identify and calculate failure points in manufacturing, which mayinvolve working backwards from observations of failed tests to identify rootcauses, or forward to identify other risks that may be correlated with theroot causes identiﬁed. Life science & health.

The Life Science & Health sector covers medi-cal, pharmaceutical and biological research, as well as healthcare and pub-lic health. Use of computation and data management in these areas isalready widespread, particularly in biomedical science (‘bioinformatics’).Some NPL researchers are involved in developing computational method-ologies that have the potential to improve the quality of patients’ comput-erised medical records both prospectively by automating the correction ofsubmitted data and retrospectively by reconstructing missing data fromhistorical datasets. These methods enable incomplete or missing data tobe made available for analysis and comparison with published results. Usecase 9 highlights the potential beneﬁts that AI can oﬀer but introducesissues around transparency as it will be essential that researchers are ableto easily distinguish between original data versus those that have been au-tomatically enhanced or reconstructed using algorithms. Many life scienceresearchers depend on having access to reliable physical samples to carryout their analyses. They ﬁnd that the quality and consistency of the contex-tual information associated with these physical samples varies greatly whichin turn impacts the quality of the analysis they can carry out. Use case10 highlights that the ability to understand a sample’s quality is directlyrelated to how the physical sample has been sourced and treated prior to itsarrival in the lab for digitisation and analysis. NPL researchers are workingto improve and standardise sample documentation practices by identifyingpoints over the lifecycle where relevant contextual information is producedand assessing how it should be captured most eﬃciently and eﬀectively.There is growing interest among researchers in all domains and industry toimprove the availability and reuse of existing data. For the life sciences,this may include sharing pre-competitive data across organisations, datacollected from hospitals, and subcontracting research to external organisa-tions. Use case 17 introduces a need for improved communication aboutthe context and meaning of the archived data to support informed reuse.As many research performing organisations (RPOs) transition to infras- ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 16 tructure based on cloud storage and object-based data repositories, thereis an opportunity to rethink the data capture, curation and managementprocesses to support future reuse. Energy & environment.

The Energy & Environment sector comprisesoil & gas, nuclear, hydroelectric and other renewable sources of energy, aswell as Earth observation, sustainability, climate change impact assessment,and environmental monitoring. Sensor networks or IoT devices are seeingincreasing use for environmental monitoring or monitoring conditions ofconventional or renewable energy technologies, but face challenges regard-ing calibration and uncertainty propagation through data analysis pipelines.Use case 23 identiﬁes the need for transparency and inter-compatibility ofstandards used by diﬀerent groups so that data can be exchanged, com-pared and reused. Environmental monitoring via Earth observation satel-lites plays an important role both in understanding the climate and mak-ing detailed information about the environment available to government orbusiness, and requires the need to understand and put historical data incontext of how it was captured and can be combined with current data, asdescribed in Use Case 21. In a related vein, given the longevity of manyenvironmental studies, it is essential that the provenance survives beyondthe lifetime of the software used to create the data or capture the prove-nance, so that an understanding of this historical data can facilitate datausage later, as discovered in Use Case 22. In all of these cases, the in-creasing use of large amounts of data, gathered under varying conditionsby devices with varying levels of accuracy, implies the need for provenanceand data quality management techniques similar to those that have alreadybeen investigated for scientiﬁc data management and analysis.

Recommendations

We made several recommendations to NPL based on our study, which wehave adapted for consideration by the broader metrology community asfollows: • Metrologists should establish best practices for provenancestandards to provide greater clarity regarding what techniques aresuﬃcient to meet current or future needs. • Flagship case studies should be developed to consolidate exper-tise with developing and using provenance techniques. • Metrology institutes should raise awareness of provenance aspart of data metrology through augmenting training materials or ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 17 C o n t e x t & U nd e r s t a nd i n g C u r a t i o n & R e u s e I d G oo d P r a c t i ce I n t e g r i t y I n t e r o p e r a b ili t y L i n k i n g E n t i t i e s Q u a li t y R e p r o du c i b ili t y U n ce r t a i n t y NPL Activities1 Understanding Measurements andCalibration Measures • • • • • • • • • • • • • • • • • • • • • • •

Advanced Manufacturing6 Modelling and improvement of manu-facturing processes • • • • • • • • • • • • • •

Life Science and Health9 Automating the reconstruction ofmissing data • • • •

10 Linking digital provenance with phys-ical artefacts for quality analysis • • • •

11 Interpretable conﬁdence/quality as-sessment for machine learning • • • •

12 Electronic lab notebooks and repro-ducibility • • • •

13 Identifying human gaps • •

14 Standards and best practices for reuse • •

15 Facilitating communication and under-standing across organisations • • •

16 Provenance and Quality in trials • • • • •

17 Data Reuse for additional purposes • • • •

18 Identifying duplication in rare geneticdata • • • • •

19 Diagnosing unexpected diﬀerences inresults of analytical models • • • • •

20 Understanding changes to data overtime • • •

Energy and Environment21 Reuse and understand historical datawith recent data • • • • •

22 Recording and using past conﬁgura-tions and performance • • •

23 Transparency and inter-compatibilityof standards • •

24 Reﬂecting good practice in new toolsand services • • ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 18 other dissemination pathways. • The importance of data management planning should be high-lighted to promote data reuse and curation in industrial settings. • Academic research and industry need to be brought together to facilitate transfer of solutions to practice, or communicate unmetneeds to researchers.Addressing these recommendations is the subject of current work by NPLand collaborators within the metrology and data science communities.For example, the EURAMET Technical Committee for Inter-disciplinaryMetrology (TC-IM) has initiated a project on Research Data Managementand the Consultative Committee for Units (CCU) of International Bureaufor Weights and Measures (BIPM) has set up a task group (D-SI) to pro-mote the digitisation of metrology data exchange, references, services andproducts and to transform the SI into the digital world. In these initia-tives, it is seen as essential to address issues relating to data provenance,curation and the FAIR principles and that the metrology community joinswith the wider scientiﬁc and technical communities to provide well-founded,coherent and inter-operable solutions.

5. Conclusion

Our study highlighted that there are a number of situations in metrologyand industrial settings where provenance information is needed, and thusthere are many settings where techniques developed by the provenanceresearch community could be beneﬁcial, but also showed that there is rela-tively little awareness or adoption of these techniques. This may in part bebecause of the initial focus of the provenance research community on set-tings such as scientiﬁc data and computation and on Web technologies thatseem not to be in widespread use in industry settings. For example, only inthe life science and health sector (where RDF is already widely used) didwe ﬁnd pre-existing awareness of RDF and PROV. Thus there remains asigniﬁcant gap between what is technically possible in research prototypesand the day-to-day needs expressed in the use cases we encountered. Wehope that this paper will serve to catalyze further discussion between themetrology and data provenance communities to address these challenges.

Acknowledgments

This work was supported by EPSRC (grant numberEP/S028366/1) and an ISCF Metrology Fellowship grant provided by theUK government’s Department for Business, Energy and Industrial Strategy(BEIS). ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 19 References

1. P. Buneman, S. Khanna and W. C. Tan, Why and Where: A Characterizationof Data Provenance, in

ICDT , (Springer, 2001).2. K. K. Muniswamy-Reddy, D. A. Holland, U. Braun and M. I. Seltzer,Provenance-Aware Storage Systems, in

USENIX Annual Technical Confer-ence, General Track , (Springer, 2006).3. A. Barker and J. I. van Hemert, Scientiﬁc Workﬂow: A Survey and ResearchDirections, in

PPAM , , LNCS Vol. 4967 (Springer, 2007).4. R. Bose and J. Frew, Lineage retrieval for scientiﬁc data processing: a survey,

ACM Comput. Surv. , 1 (2005).5. J. Freire, D. Koop, E. Santos and C. T. Silva, Provenance for ComputationalTasks: A Survey, Computing in Science and Engineering , 11 (2008).6. Y. Simmhan, B. Plale and D. Gannon, A survey of data provenance in e-science., SIGMOD Record , 31 (2005).7. J. Cheney, L. Chiticariu and W.-C. Tan, Provenance in databases: Why,how, and where, Found. Trends databases , 379 (April 2009).8. B. Glavic and K. R. Dittrich, Data Provenance: A Categorization of ExistingApproaches, in BTW , (GI, 2007).9. J. Cheney and R. Perera, An analytical survey of provenance sanitization, in

IPAW , (Springer, 2014).10. Digital Curation Centre, What is digital curation? (2018), , accessed March26, 2018.11. W3C Incubator Group Report,

Provenance XG Final Report

IJDC , 39 (2012).13. D. Allen, L. Seligman, B. Blaustein and A. Chapman, Provenance Captureand Use: A Practical Guide , tech. rep., The MITRE Corporation (2010).14. ProvToolbox (2020), http://lucmoreau.github.io/ProvToolbox/ .15. Digital Curation Centre, DCC curation lifecycle model (2018), , accessed March26, 2018.16. International Standards Orga-nization,

ISO 19115

National System for Geospatial Intelligence Metadata Foundation (NMF) ,tech. rep., National System for Geospatial Intelligence (2010).18. P. Missier, K. Belhajjame and J. Cheney, The W3C PROV family of speci-ﬁcations for modelling provenance metadata, in

EDBT , (ACM, 2013).19. W3C PROV Working Group,

An Overview of the PROV Family of Docu-ments

Journal of Web Semantics , 235 (2015). ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 20

21. V. Cuevas-Vicentt´ın, B. Lud¨ascher, P. Missier, K. Belhajjame, F. Chirigati,Y. Wei, S. Dey, P. Kianmajd, D. Koop, S. Bowers, I. Altintas, C. Jones,M. B. Jones, L. Walker, P. Slaughter, B. Leinfelder and Y. Cao, ProvONE:A PROV extension data model for scientiﬁc workﬂow provenance (May 2016),https://purl.dataone.org/provone-v1-dev.22. C. Maumet, T. Auer, A. Bowring, G. Chen, S. Das, G. Flandin, S. Ghosh,T. Glatard, K. J. Gorgolewski, K. G. Helmer, M. Jenkinson, D. B. Keator,B. N. Nichols, J.-B. Poline, R. Reynolds, V. Sochat, J. Turner and T. E.Nichols, Sharing brain mapping statistical results with the neuroimaging datamodel,

Sci. Data (2016).23. M. D. Allen, A. Chapman, L. Seligman and B. Blaustein, Provenance forcollaboration: Detecting suspicious behaviors and assessing trust in informa-tion, in CollaborateCom , (IEEE, Oct 2011).24. A. Chapman, B. T. Blaustein, L. Seligman and M. D. Allen, Plus: A prove-nance manager for integrated information, in

Information Reuse and Inte-gration (IRI), 2011 IEEE International Conference on , (IEEE, 2011).25. M. Anand, S. Bowers, T. M. McPhillips and B. Ludaescher, Eﬃcient Prove-nance Storage over Nested Data Collections, in

Procs. EDBT , (ACM, March2009).26. A. Chapman, H. V. Jagadish and P. Ramanan, Eﬃcient provenance storage,in

SIGMOD Conference , (ACM, 2008).27. I. Altintas, O. Barney and E. Jaeger-Frank, Provenance Collection Supportin the Kepler Scientiﬁc Workﬂow System, in

IPAW , (Springer, 2006).28. J. Freire, S. P. Callahan, E. Santos, C. E. Scheidegger, C. T. Silva and H. T.Vo, VisTrails: visualization meets data management, in

SIGMOD Confer-ence , (ACM, 2006).29. J. Kim, E. Deelman, Y. Gil, G. Mehta and V. Ratnakar, Provenance trailsin the Wings/Pegasus workﬂow system,

Concurr. Comput. (2008).30. T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R. K.Bocinsky, Y. Cao, J. Cheney, F. Chirigati, S. Dey, J. Freire, C. Jones,J. Hanken, K. W. Kintigh, T. A. Kohler, D. Koop, J. A. Macklin, P. Missier,M. Schildhauer, C. Schwalm, Y. Wei, M. Bieda and B. Lud¨ascher, Yeswork-ﬂow: A user-oriented, language-independent tool for recovering workﬂow in-formation from scripts, International Journal of Digital Curation , 298(2015).31. L. Murta, V. Braganholo, F. Chirigati, D. Koop and J. Freire, noworkﬂow:Capturing and analyzing provenance of scripts, in IPAW , eds. B. Lud¨ascherand B. Plale (Springer, 2015).32. M. Stamatogiannakis, E. Athanasopoulos, H. Bos and P. Groth, Prov2r:Practical provenance analysis of unstructured processes,

ACM Trans. Inter-net Technol. (August 2017).33. L. Hawizy et al., ChemicalTagger: A tool for semantic text-mining in chem-istry, J Cheminform (2011).34. H. Park, R. Ikeda and J. Widom, Ramp: A system for capturing and tracingprovenance in mapreduce workﬂows, in VLDB , (VLDB Foundation, August2011). ebruary 17, 2021 1:44 WSPC Proceedings - 9in x 6in main page 21

35. A. Rosenthal, L. Seligman, A. Chapman and B. Blaustein, Scalable accesscontrols for lineage, in

TAPP , (USENIX, 2009).36. R. Hasan, R. Sion and M. Winslett, The Case of the Fake Picasso: PreventingHistory Forgery with Secure Provenance, in

FAST , (USENIX, 2009).37. J. Zhang, A. Chapman and K. Lefevre, Do you know where your data’s been?— tamper-evident database provenance, in

SDM , (Springer-Verlag, 2009).38. J. Cheney, A formal framework for provenance security, in

CSF , (IEEE,2011).39. B. Blaustein, A. Chapman, L. Seligman, M. D. Allen and A. Rosenthal, Sur-rogate parenthood: Protected and informative graphs,

Proc. VLDB Endow.4