Principles and best practices in data versioning for all data sets big and small

    You are here

16
Jan
2020

Principles and best practices in data versioning for all data sets big and small

By Mingfang Wu


Data Versioning WG

Group co-chairs: 

Jens KlumpLesley WybornAri AsmiRobert Downs

Supporting Output title: Principles and best practices in data versioning for all data sets big and small  

Authors: Jens Klump, Lesley Wyborn, Robert Downs, Ari Asmi, Mingfang Wu, Gerry Ryder, Julia Martin

DOI: 10.15497/RDA00042

Citation:  Klump, J., Wyborn, L., Downs, R., Asmi, A., Wu, M., Ryder, G., & Martin, J. (2020). Principles and best practices in data versioning for all data sets big and small. Version 1.1. Research Data Alliance. DOI: 10.15497/RDA00042.

 

Abstract:

The demand for better reproducibility of research results is growing. More and more data is becoming available online. In some cases, the datasets have become so large that downloading the data is no longer feasible. Data can also be offered through web services and accessed on demand. This means that parts of the data are accessed at a remote source when needed. In this scenario, it will become increasingly important for a researcher to be able to cite the exact extract of the data set that was used to underpin their research publication. However, while the means to identify datasets using persistent identifiers have been in place for more than a decade, systematic data versioning practices are currently not available.

Versioning procedures and best practices are well established for scientific software. The related Wikipedia article gives an overview of software versioning practices. The codebase of large software projects does bear some semblance to large dynamic datasets. Are therefore versioning practices for code also suitable for data sets or do we need a separate suite of practices for data versioning? How can we apply our knowledge of versioning code to improve data versioning practices? This Working Group investigated to which extent these practices can be used to enhance the reproducibility of scientific results.

The Research Data Alliance (RDA) Data Versioning Working Group produced this white paper to document use cases and practices, and to make recommendations for the versioning of research data. To further adoption of the outcomes, the Working Group contributed selected use cases and recommended data versioning practices to other groups in RDA and W3C. The outcomes of the RDA Data Versioning Working Group add a central element to the systematic management of research data at any scale by providing recommendations for standard practices in the versioning of research data. These practice guidelines are illustrated by a collection of use cases.

 

Please note that the previous version (v1.0) underwent community review. The current version (v1.1) was updated following the community review.

 

 

Output Status: 
RDA Supporting Outputs
Review period start: 
Tuesday, 28 January, 2020 to Friday, 28 February, 2020
Group content visibility: 
Use group defaults
Domain Agnostic: 
Domain Agnostic
  • Robert Huber's picture

    Author: Robert Huber

    Date: 21 Jan, 2020

    I like it but the recommendations could be clearer. eg what means ‘datacite  recommends’ or another WG recommends? Do you follow them or shall the reader decide?

    Please clarify if you recommend to follow third party recommendations.

  • Martin Schultz's picture

    Author: Martin Schultz

    Date: 02 Feb, 2020

    Great document! In my view, the recommendations are clear enough. However, it would be great to have reference implementations to get all details sorted out. I have only one comment concerning the similarity between data and software versioning: with software one usually ha sa stable identifier for the work, for example a git repository. This is not made explicit in the dataset versioning. The document only recommends to build collections and version those. As these will generally yield new identifiers with each version, it is not easy to define a stable "landing identifier" of a dataset as a work. Of course, in large projects (e.g. satellite data) there will be web pages built to describe the work. Fo rlong-tail data this might only rarely be the case. Hence, I would advocate the use of a stable identifier in such cases. The open question is then how to make sure that new versions are linked to this stable identifier. This is easy in software versioning, because the stable identifier is the versioning system itself.

submit a comment