Aligning Provenance Approaches Across Domains
Submitted by Simon Hodson
FAIR data principles imply that the origins and processing history of data should be discoverable. Data professionals have long recognized this fundamental need. The uses of such information are many: discovery and assessment of data for reuse, determination of data quality and fitness for purpose, the replication and reproducibility of findings, data validation for policy purposes, the resolution of legal issues around confidentiality, and so on. Data provenance is a critical resource that is all-too-often lacking.
The growing demand for more data, coming from an increasingly disparate range of data sources, makes this need even more pressing. Often, the most expensive aspect of research relates to finding the data, understanding it, and integrating it so that it can be analyzed. In an era of big data, machine learning, and an increasingly powerful array of technologies and techniques for using data, the need for better data provenance is acute.
Different communities experience this need in different ways, and they have formulated different approaches to standardization and automation of systems relating to provenance and process description. Some approaches emphasize the documentation and understanding of data provenance after-the-fact, while others are looking to provide executable descriptions of process that can be used across platforms and contexts.
This session will build on recent discussions held at the FAIR Convergence Symposium and the growing interest in PROV-O, SDTL, DDI-CDI and the work of the WholeTale project. It will seek to extend the discussion beyond the application of these standards to the identification of approaches, practices and requirements in different domains.
The following questions will be explored:
- How is provenance being tracked and documented in various domains?
- What are the technical and conceptual points of contact?
- Does this end up being a discipline specific thing (incommensurable, as it were, between disciplines) or can we learn from how this is done in various contexts (including how materials scientists / crystallographers / astronomers track the processes which their data goes through for example…)?
- And ultimately, what alignment / convergence is possible between different ways of tracking (automatically) provenance and process in different technologies and across domains?
The intention of this session is to promote discussion across domains, and to identify commonalities of approach which might lead us toward convergence on best practice, and ultimately to more scalable and automated solutions.
In due course, a co-branded RDA-CODATA WG or IG on this topic may be proposed. The intention is to build collaboration that will contribute to the ISC CODATA Decadal Programme ‘Making Data Work for Cross-Domain Grand Challenges’.
Domains to be invited (through some existing IGs and through other contacts)
- Social sciences
- Environmental sciences
- Earth sciences
- Life sciences/biodiversity
- Crystallography/material sciences
Collaborative session notes (main session): https://docs.google.com/document/d/1EkT9LFKGGRVP7qhRTTHXSWm6EGPfjJSm-xIbqk9_XMk/edit?usp=sharing
Collaborative session notes (repeat session): https://docs.google.com/document/d/11Yggy1g-yD-u2o8u4TflH1F8mIruWJvZd5YujhLK-8A/edit?usp=sharing
0-10 minutes: Welcome, tour de table and rationale for the BoF.
10-50 minutes: Brief introductory presentations on provenance and process from a sample of domains. Presenters will be asked to briefly highlight key steps that the domain undertakes to gather process and provenance information and to indicate how this is encoded. A set of questions and a template will be provided.
Social sciences and statistics (Arofan Gregory)
Astronomy (Mireille Louys)
Crystallography (John Helliwell)
Chemistry (Stuart Chalk)
Environmental sciences (Matt Jones)
Earth sciences (Kerstin Lehnert)
Biomedical (Mark Musen)
50-80 minutes: Discussion including respondents.
80-90 minutes: Wrap Up and Next Steps.
Questions for panellists
What is the typical provenance and data transformation information that your domain needs to capture? Or What is data provenance in your domain?
Is there a practice around provenance information?
If so, how is it captured and shared?
How widespread is it? How much of the domain has a shared or best practice? What is the demand in your domain?
What semantics and tools, software are used to do this in your domain?
What is the role of fine grained process (transformations, normalisation etc) vs the contextual, provenance information?
What is the level of granularity about provenance? What do users want in terms of provenance information, and at what level of detail?
What demand is there for provenance and process information?
How do researchers / users use provenance and process information?
Discussion and Respondents
We propose to invite the following as respondents for the discussion:
Research Objects, (Stian Soiland-Reyes, TBC)
PROV-O, (Timothy McPhillips)
DDI-CDI, (George Alter and Arofan Gregory)
WholeTale, (Bertram Ludaescher)
The respondents would be asked to contribute to the discussion, but not to give a formal presentation. They will be asked to reflect, from the perspective of a technologist and implementer on the scenarios described by the domains, and to respond without promoting any given solution.
This BoF builds on a series of sessions and discussions, notably:
FAIR Convergence Symposium: https://conference.codata.org/FAIRconvergence2020/sessions/260/ ; recording https://vimeo.com/499271767; presentations https://drive.google.com/drive/folders/1OBpCqmH0oqVKjzu9SzEpT_0zfw3oI3a5
DDI-CDI Process and Provenance: https://codata.org/initiatives/strategic-programme/decadal-programme/ddi...
- 1140 reads