So You Want to Track Provenance
Concepts and Considerations
Anna Krohn : firstname.lastname@example.org
RDA/US Scholars Program Internship and Research Data Provenance Interest Group
Over the course of the RDA/US Interns Program I worked with Bridget Almas and David Dubin of the Research Data Provenance Interest Group to examine provenance conceptually and practically. The literature review that I initially conducted made clear the several different varieties of provenance and provided me with a solid background. I then explored metadata standards, primarily in the Digital Curation Center List of Metadata Standards, with the goal of identifying the provenance recording capabilities of the listed standards and any potential gaps in the developing Metadata Standards Directory RDA deliverable. Taking what I’d learned from the literature and standards reviews, I applied the PROV standard to a use case. This poster synthesizes my activities into a set of considerations for those who wish to add provenance metadata to their datasets but are unsure of where to begin. This is a need that surfaced in the feedback of attendees of the meeting of the Research Data Provenance Interest Group at the 3rd RDA plenary in Dublin. The combination of definition of core terminology, identification of common standards, and a roadmap to applying both of these to a specific use case may serve as a useful starting point for this anticipated deliverable of the interest group.
Adding provenance tracking to a dataset must first begin with an examination of the dataset’s context. In my review of the provenance research literature, I compiled definitions of several varieties of provenance that fall into three distinct “traditions” that revolve around context. The oldest of these “traditions” is that of database provenance. Following that, once e-science became more prevalent, discussions of workflow provenance appear in the literature. Lastly, models for tracking provenance of web resources arise. Within each of these “traditions” appear sub-concepts of provenance. For databases there are ‘Why,’ ‘Where,’ and ‘How’ provenance; workflows have ‘Actor,’ ‘Input,’ and ‘Interaction’ provenance; while the web adds ‘Access’ provenance. The definitions featured on the poster, as sourced from the literature, have additionally been refined and compiled into a SKOS Vocabulary by David Dubin and will be integrated into the Data Foundation and Terminology Working Group’s registry of foundational terminology.
A basic understanding of these various types of provenance and how they relate to a dataset leads into methods of recording provenance metadata. The nature of the dataset, what sort of provenance is applicable, and how it will be used should inform the choice of the standards and tools. I provide a selection of common methods, standards, and tools to serve as a starting point for investigating a provenance strategy.
Lastly, I illustrate the application of a standard, W3C’s recommendation, PROV, to a use case as an example. The use case, preparing a text through a series of manual and automated processes for linguistic annotation, is a complex workflow but not of a scientific domain. This illustrates the possibilities of provenance use in the humanities, which is currently a more atypical application.
I am grateful to the RDA and RDA/US Scholars Program for the opportunity to pursue this project, and specifically to Beth Plale and Inna Kouper for their organization and support of the internship. I especially wish to thank Bridget Almas and David Dubin for their guidance.