status: Recognised & Endorsed

Chair (s): Jens Klump, Lesley Wyborn, Mingfang Wu, Kirsten Elger

Group Email: [group_email]

Secretariat Liaison: Stefanie Kethers


The Data Versioning WG has transitioned to the Data Versioning IG as of July 2021. The email address and group space have remained the same. 


The demand for reproducibility of research results is growing, Therefore it will become increasingly important for a researcher to be able to cite the exact extract of the data set that was used to underpin their research publication. The capacity of computational hardware infrastructures have grown it is now common to have online petabyte data stores, This has encouraged the development of concatenated seamless data sets where users can use web services to select subsets based on spatial and time queries. Further, the growth in computer power has meant that higher level pre-processed data products can be generated in really short time frames.

This means that data sets and data products are needing some form of systematized way of being able to reference the exact version of the data that was used to underpin the research findings, and/or was used to generate higher level products. This was recognised by the RDA Working Group on Data Citation, whose final report recognises the need for Data Versioning. However, there were no specifics on best practice for data versioning, particularly for large volume multi-terabyte and even petabyte scale data sets. A BoF meeting held at the RDA Plenary in September 2016 in Denver highlighted the fact that there are no recognised best practices for versioning of data.

Versioning procedures and best practices are well established for scientific software and can be used enable reproducibility of scientific results. The codebase of very large software projects does bare some semblance to large dynamic datasets. Are these suitable for data sets or do we need a separate suite of practices for data versioning?

Ultimately versioning concepts developed for research data will need to be brought in line with versioning concepts used in persistent identifier systems.


The BoF initially emerged at Plenary 8 in Denver through the discussion available here:  https://www.rd-alliance.org/data-versioning-rda-8th-plenary-bof-meeting

Outputs

08
April
2021

Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles

by Mingfang Wu

The supporting output from this working group has been revised and published to the Data Science Journal.


0 | Add new comment
16
January
2020

Principles and best practices in data versioning for all data sets big and small

by Mingfang Wu

The demand for better reproducibility of research results is growing. More and more data is becoming available online. In some cases, the datasets have become so large that downloading the data is no longer feasible. Data can also be offered through web services and accessed on demand.


2 | Add new comment
16
January
2020

Compilation of Data Versioning Use cases from the RDA Data Versioning Working Group

by Mingfang Wu

Data versioning is a fundamental element to ensuring the reproducibility of research. Work in other RDA groups on data provenance and data citation, as well as the W3C Dataset Exchange Working Group, have highlighted that definitions of data versioning concepts and recommended practices are still missing.


0 | Add new comment