FAIR data principles imply that the origins and processing history of data should be discoverable. Data professionals have long recognized this fundamental need. The uses of such information are many: discovery and assessment of data for reuse, determination of data quality and fitness for purpose, the replication and reproducibility of findings, data validation for policy purposes, the resolution of legal issues around confidentiality, and so on. Data provenance is a critical resource that is all-too-often lacking.
The growing demand for more data, coming from an increasingly disparate range of data sources, makes this need even more pressing. Often, the most expensive aspect of research relates to finding the data, understanding it, and integrating it so that it can be analyzed. In an era of big data, machine learning, and an increasingly powerful array of technologies and techniques for using data, the need for better data provenance is acute.
Different communities experience this need in different ways, and they have formulated different approaches to standardization and automation of systems relating to provenance and process description. Some approaches emphasize the documentation and understanding of data provenance after-the-fact, while others are looking to provide executable descriptions of process that can be used across platforms and contexts.
This session will build on recent discussions held at the FAIR Convergence Symposium and the growing interest in PROV-O, SDTL, DDI-CDI and the work of the WholeTale project. It will seek to extend the discussion beyond the application of these standards to the identification of approaches, practices and requirements in different domains.
The following questions will be explored:
- How is provenance being tracked and documented in various domains?
- What are the technical and conceptual points of contact?
- Does this end up being a discipline specific thing (incommensurable, as it were, between disciplines) or can we learn from how this is done in various contexts (including how materials scientists / crystallographers / astronomers track the processes which their data goes through for example…)?
- And ultimately, what alignment / convergence is possible between different ways of tracking (automatically) provenance and process in different technologies and across domains?
The intention of this session is to promote discussion across domains, and to identify commonalities of approach which might lead us toward convergence on best practice, and ultimately to more scalable and automated solutions.
In due course, a co-branded RDA-CODATA WG or IG on this topic may be proposed. The intention is to build collaboration that will contribute to the ISC CODATA Decadal Programme ‘Making Data Work for Cross-Domain Grand Challenges’.
Domains to be invited (through some existing IGs and through other contacts)
- Social sciences
- Environmental sciences
- Earth sciences
- Life sciences/biodiversity
- Crystallography/material sciences