Data Versioning WG

WG

Group details

Chair(s): 
Jens Klump, Lesley Wyborn, Ari Asmi, Robert Downs
Secretariat Liaison: 
Stefanie Kethers
TAB Liaison: 
Tobias Weigel
 

The demand for reproducibility of research results is growing, Therefore it will become increasingly important for a researcher to be able to cite the exact extract of the data set that was used to underpin their research publication. The capacity of computational hardware infrastructures have grown it is now common to have online petabyte data stores, This has encouraged the development of concatenated seamless data sets where users can use web services to select subsets based on spatial and time queries. Further, the growth in computer power has meant that higher level pre-processed data products can be generated in really short time frames.

This means that data sets and data products are needing some form of systematized way of being able to reference the exact version of the data that was used to underpin the research findings, and/or was used to generate higher level products. This was recognised by the RDA Working Group on Data Citation, whose final report recognises the need for Data Versioning. However, there were no specifics on best practice for data versioning, particularly for large volume multi-terabyte and even petabyte scale data sets. A BoF meeting held at the RDA Plenary in September 2016 in Denver highlighted the fact that there are no recognised best practices for versioning of data.

Versioning procedures and best practices are well established for scientific software and can be used enable reproducibility of scientific results. The codebase of very large software projects does bare some semblance to large dynamic datasets. Are these suitable for data sets or do we need a separate suite of practices for data versioning?

Ultimately versioning concepts developed for research data will need to be brought in line with versioning concepts used in persistent identifier systems.


The BoF initially emerged at Plenary 8 in Denver through the discussion available here:  https://www.rd-alliance.org/data-versioning-rda-8th-plenary-bof-meeting

Recent Activity

05 Sep 2017

RE: W3C Data eXchange Working Group is also considering versioning

Hi Simon,
Thank you for pointing out this current work in W3C, I added it to our collection of materials.
https://docs.google.com/document/d/1TfBPlfjTVg0YcFxuw0UszAXPYrRmyZ6PCxtx...
Cheers,
Jens
--
Dr Jens Klump
E ***@***.*** T +61 8 6436 8828
CSIRO ARRC, 26 Dick Perry Avenue, Kensington, WA 6151, Australia

05 Sep 2017

W3C Data eXchange Working Group is also considering versioning

Dear Data Versioners -
Probably of interest that Dataset Versioning is one of the topics on the list for consideration by the W3C Data eXchange Working Group (DXWG [1])- see
https://w3c.github.io/dxwg/ucr/#ID4
Note that the W3C DXWG group is scheduled to deliver by 30 June 2019 [2].
The anticipated products are
1. A revision of the Data Catalogue vocabulary (DCAT)
2. A standard way to formalize DCAT profiles (for a particular community or application)

03 Apr 2017

Remote participation in IG Data Versioning RDA 9th Plenary meeting

Dear Members of the IG Data Versioning,
For those of you who cannot participate in person in the RDA 9th Plenary https://www.rd-alliance.org/ig-data-versioning-rda-9th-plenary-meeting we have arranged the option of remote participation.
To access the remote meeting link for this session on April 5 from 14:00-15:30 titled "RDA Plenary 9: Data Versioning Interest Group" please go to https://global.gotomeeting.com/join/311156213

23 Mar 2017

Research Data Alliance DDPIG Interim Outputs for review and comment

Dear RDA Interest Group members,

We wish to share with you the draft outputs created by three of
the Task Force teams of the RDA Data Discovery Paradigms Interest
Group. We think one or more of these outputs are relevant to the
work your IG is doing. Your thoughts and feedback on the three
interim documents will be greatly appreciated:

15 Mar 2017

Introduction

Hello,
My name is Benno Lee. I am a PhD student at Rensselaer Polytechnic
Insitute studying data set versioning. I was wondering if I could join
into the conversation about new best practices for data sets. I am working
to produce a linked data model that may be useful.
Benno