Aligning Provenance Approaches Across Domains

You are here

26 Jan 2021

Aligning Provenance Approaches Across Domains

Submitted by Simon Hodson


Meeting objectives: 

FAIR data principles imply that the origins and processing history of data should be discoverable. Data professionals have long recognized this fundamental need. The uses of such information are many: discovery and assessment of data for reuse, determination of data quality and fitness for purpose, the replication and reproducibility of findings, data validation for policy purposes, the resolution of legal issues around confidentiality, and so on. Data provenance is a critical resource that is all-too-often lacking.

The growing demand for more data, coming from an increasingly disparate range of data sources, makes this need even more pressing. Often, the most expensive aspect of research relates to finding the data, understanding it, and integrating it so that it can be analyzed. In an era of big data, machine learning, and an increasingly powerful array of technologies and techniques for using data, the need for better data provenance is acute.

Different communities experience this need in different ways, and they have formulated different approaches to standardization and automation of systems relating to provenance and process description. Some approaches emphasize the documentation and understanding of data provenance after-the-fact, while others are looking to provide executable descriptions of process that can be used across platforms and contexts.

This session will build on recent discussions held at the FAIR Convergence Symposium and the growing interest in PROV-O, SDTL, DDI-CDI and the work of the WholeTale project.  It will seek to extend the discussion beyond the application of these standards to the identification of approaches, practices and requirements in different domains.

The following questions will be explored:

  • How is provenance being tracked and documented in various domains?
  • What are the technical and conceptual points of contact?
  • Does this end up being a discipline specific thing (incommensurable, as it were, between disciplines) or can we learn from how this is done in various contexts (including how materials scientists / crystallographers / astronomers track the processes which their data goes through for example…)?
  • And ultimately, what alignment / convergence is possible between different ways of tracking (automatically) provenance and process in different technologies and across domains?

The intention of this session is to promote discussion across domains, and to identify commonalities of approach which might lead us toward convergence on best practice, and ultimately to more scalable and automated solutions.

In due course, a co-branded RDA-CODATA WG or IG on this topic may be proposed.  The intention is to build collaboration that will contribute to the ISC CODATA Decadal Programme ‘Making Data Work for Cross-Domain Grand Challenges’.

Domains to be invited (through some existing IGs and through other contacts)

  • Social sciences 
  • Environmental sciences
  • Earth sciences
  • Life sciences/biodiversity
  • Crystallography/material sciences
  • Astronomy
Meeting agenda: 

Collaborative session notes (main session): https://docs.google.com/document/d/1EkT9LFKGGRVP7qhRTTHXSWm6EGPfjJSm-xIbqk9_XMk/edit?usp=sharing

Collaborative session notes (repeat session): https://docs.google.com/document/d/11Yggy1g-yD-u2o8u4TflH1F8mIruWJvZd5YujhLK-8A/edit?usp=sharing

Programme:

0-10 minutes: Welcome, tour de table and rationale for the BoF.

10-50 minutes: Brief introductory presentations on provenance and process from a sample of domains.  Presenters will be asked to briefly highlight key steps that the domain undertakes to gather process and provenance information and to indicate how this is encoded.  A set of questions and a template will be provided.

  1. Social sciences and statistics (Arofan Gregory)

  2. Astronomy (Mireille Louys)

  3. Crystallography (John Helliwell)

  4. Chemistry (Stuart Chalk)

  5. Environmental sciences (Matt Jones)

  6. Earth sciences (Kerstin Lehnert)

  7. Biomedical (Mark Musen)

50-80 minutes: Discussion including respondents.

80-90 minutes: Wrap Up and Next Steps.

 

Questions for panellists

  • What is the typical provenance and data transformation information that your domain needs to capture? Or What is data provenance in your domain?

  • Is there a practice around provenance information?

    • If so, how is it captured and shared?

  • How widespread is it?  How much of the domain has a shared or best practice?  What is the demand in your domain?

  • What semantics and tools, software are used to do this in your domain?

  • What is the role of fine grained process (transformations, normalisation etc) vs the contextual, provenance information?

  • What is the level of granularity about provenance?  What do users want in terms of provenance information, and at what level of detail?

  • What demand is there for provenance and process information?

  • How do researchers / users use provenance and process information?

 

Discussion and Respondents

We propose to invite the following as respondents for the discussion:

  1. Research Objects, (Stian Soiland-Reyes, TBC)

  2. PROV-O, (Timothy McPhillips)

  3. DDI-CDI, (George Alter and Arofan Gregory)

  4. WholeTale, (Bertram Ludaescher)

The respondents would be asked to contribute to the discussion, but not to give a formal presentation.  They will be asked to reflect, from the perspective of a technologist and implementer on the scenarios described by the domains, and to respond without promoting any given solution.  

Type of Meeting: 
Working meeting
Short introduction describing any previous activities: 
BoF chair serving as contact person: 
Please indicate the breakout slot (s) that would suit your meeting. : 
Breakout 2
Breakout 4
Breakout 5
Breakout 8
Breakout 9
Are you willing to host a live second session to accommodate a different time zone? : 
Yes
Meeting presenters: 
Organisers: George Alter, Arofan Gregory, Joachim Wackerow, Simon Hodson; Presenters are listed in the agenda.
How do you prefer to hold the virtual component of your session: 
live
Avoid conflict with the following group (1): 
Avoid conflict with the following group (2): 
Avoid conflict with the following group (3): 
Contact for group (email):