Principles and best practices in data versioning for all data sets big and small
By Mingfang Wu
Supporting Output title: Principles and best practices in data versioning for all data sets big and small
Authors: Jens Klump, Lesley Wyborn, Robert Downs, Ari Asmi, Mingfang Wu, Gerry Ryder, Julia Martin
|Impact: Provides recommendations for standard practices in the versioning of research data, adding a central element to the systematic management of research data at any scale which in turn enhances reproducibility and enables the attribution of any person or organisation that contributed to the development or funding of any version of a dataset.
Citation: Klump, J., Wyborn, L., Downs, R., Asmi, A., Wu, M., Ryder, G., & Martin, J. (2020). Principles and best practices in data versioning for all data sets big and small. Version 1.1. Research Data Alliance. DOI: 10.15497/RDA00042.
The demand for better reproducibility of research results is growing. More and more data is becoming available online. In some cases, the datasets have become so large that downloading the data is no longer feasible. Data can also be offered through web services and accessed on demand. This means that parts of the data are accessed at a remote source when needed. In this scenario, it will become increasingly important for a researcher to be able to cite the exact extract of the data set that was used to underpin their research publication. However, while the means to identify datasets using persistent identifiers have been in place for more than a decade, systematic data versioning practices are currently not available.
Versioning procedures and best practices are well established for scientific software. The related Wikipedia article gives an overview of software versioning practices. The codebase of large software projects does bear some semblance to large dynamic datasets. Are therefore versioning practices for code also suitable for data sets or do we need a separate suite of practices for data versioning? How can we apply our knowledge of versioning code to improve data versioning practices? This Working Group investigated to which extent these practices can be used to enhance the reproducibility of scientific results.
The Research Data Alliance (RDA) Data Versioning Working Group produced this white paper to document use cases and practices, and to make recommendations for the versioning of research data. To further adoption of the outcomes, the Working Group contributed selected use cases and recommended data versioning practices to other groups in RDA and W3C. The outcomes of the RDA Data Versioning Working Group add a central element to the systematic management of research data at any scale by providing recommendations for standard practices in the versioning of research data. These practice guidelines are illustrated by a collection of use cases.
|Report of the RDA Data Versioning Working Group_V1.1.pdf