WG Data Versioning Working Meeting
A short introduction describing the activities and the scope of the group:
The demand for reproducibility of research results is growing, meaning that it will become increasingly important for a researcher to be able to cite the exact extract of the data set that was used to underpin their research publication. The capacity of computational hardware infrastructures have grown it is now common to have online petabyte data stores, This has encouraged the development of concatenated seamless data sets where users can use web services to select subsets based on spatial and time queries. Further, the growth in computer power has meant that higher level pre-processed data products can be generated in really short time frames.
This means that data sets and data products are needing some form of systematicway of being able to reference the exact version of the data that was used to underpin the research findings, and/or was used to generate higher level products. This was recognised by the RDA Working Group on Data Citation, whose final report recognises the need for Data Versioning. However, there were no specifics on best practice for data versioning, particularly for large volume multi-terabyte and even petabyte scale data sets.
There are two use case for dynamic data. Firstly nothing is done to the existing data sets, and new data are simply being appended at identifiable occurrences. For this case, versioning is more straightforward.
The second use case is more complex and involves existing data sets, models and derivative products being revised with new data, or the data itself revised as processing methods are improved there does not appear to be agreed principles on how data should be versioned.
Versioning procedures and best practices are well established for scientific software and can be used enable reproducibility of scientific results. Are these suitable for data sets or do we need a separate suite of practices for data versioning?
The IG initially emerged from a BoF at Plenary 8 in Denver through the discussion available here: https://www.rd-alliance.org/data-versioning-rda-8th-plenary-bof-meeting.
A subsequent breakout was held at Plenary 9 in Barcelona: https://www.rd-alliance.org/ig-data-versioning-rda-9th-plenary-meeting The objective was seek interest in forming a Data Versioning Interest group and a suite of use cases were documented. There was consensus that documenting best practices and developing guidelines for versioning was the path forward. There was also agreement that the work proposed was highly relevant to the Software Source Code IG, the Dynamic Data IG and the Provenance IG.
Given the maturity of the use cases presented in Barcelona, subsequent discussions have agreed turn the proposed IG into a WG, and produce a high level guideline for data versioning based on material compiled so far. More specific versioning concerns could be followed up subsequently in an IG.
Additional links to informative material related to the group i.e. group page, Case statement, working documents etc:
Case statement: https://www.rd-alliance.org/group/data-versioning-ig/case-statement/data...
Notes from Denver Plenary BoF meeting: https://www.rd-alliance.org/data-versioning-rda-8th-plenary-bof-meeting
Use cases and definitions: https://docs.google.com/document/d/1TfBPlfjTVg0YcFxuw0UszAXPYrRmyZ6PCxtx...
Links provided by Simon Cox:
W3C Data eXchange Working Group
RDA Data Citation work
W3C Data on the Web
* To establish an RDA Working Group on developing agreed practices for Data Versioning and develop a work plan.
* Seek further documented cases where groups/organisations are undertaking versioning;
* Develop a white paper on best practices for versioning for spectrum of data types (files, databases, unstructured data, model runs, etc, including assignment of persistent identifiers).
2. Why, How and What of Data Versioning
- Why - case statement (Lesley Wyborn)
- How - overview of identified practices (Jens Klump)
- Where - Lightning presentations
3. Develop work plan for RDA WG on Data Versioning.
Anyone who is interested in moving data versioning forward. We particularly welcome contributions on use cases of data versioning and data versioning policies.
Group chair serving as contact person: Jens Klump
Type of meeting: Working meeting
Session Room: Mansfield 2
Session Time: Thursday 21 September, Breakout 8, 11:00 - 12:30
Collaborative session notes:
Session slides and materials: See attachments
Remote Access Instructions (Gotomeeting):
Access Code: 575-504-053
Australia: +61 2 9087 3604
Austria: +43 1 2530 22520
Belgium: +32 28 93 7018
Canada: +1 (647) 497-9410
Denmark: +45 32 72 03 82
Finland: +358 923 17 0568
France: +33 170 950 594
Germany: +49 692 5736 7317
Ireland: +353 15 360 728
Italy: +39 0 230 57 81 42
Netherlands: +31 207 941 377
New Zealand: +64 9 280 6302
Norway: +47 21 93 37 51
Spain: +34 932 75 2004
Sweden: +46 853 527 827
Switzerland: +41 225 4599 78
United Kingdom: +44 330 221 0088
United States: +1 (312) 757-3136