Data Versioning RDA 8th Plenary BoF meeting

Meeting title: Is there a need to develop agreed best practice for versioning of Dynamic Data Sets?

Please give a short introduction describing the scope of the group and if any previous activities

The demand for reproducibility of research results is growing, meaning that it will become increasingly important for a researcher to be able to cite the exact extract of the data set that was used to underpin their research publication. The capacity of computational hardware infrastructures have grown it is now common to have online petabyte data stores, This has encouraged the development of concatenated seamless data sets where users can use web services to select subsets based on spatial and time queries. Further, the growth in computer power has meant that higher level pre-processed data products can be generated in really short time frames. 

This means that data sets and data products are needing some form of systematized way of being able to reference the exact version of the data that was used to underpin the research findings, and/or was used to generate higher level products. This was recognised by the RDA Working Group on Data Citation, whose final report recognises the need for Data Versioing. However, there were no specifics on best practice for data versioning, particularly for large volume multi-terabyte and even petabyte scale data sets.

There are two use case for dynamic data. Firstly nothing is done to the existing data sets, and new data are simply being appended at identifiable occurrences. For this case, versioning is more straight forward.

The second use case is more complex and involves existing data sets, models and derivative products being revised with new data, or the data itself revised as processing methods are improved there does not appear to be agreed principles on how data should be versioned. 

Versioning procedures and best practices are well established for scientific software and can be used enable reproducibility of scientific results. Are these suitable for data sets or do we need a separate suite of practices for data versioning?

Ultimately versioning will need to be attached to persistent identifiers.

Please provide additional links to informative material related to the group i.e. Case statements, working documents etc

Identification of Reproducible Subsets for Data Citation, Sharing and Reuse.   https://www.rd-alliance.org/system/files/documents/TCDL-RDA-Guidelines_1...

Data Citation of evolving data https://rd-alliance.org/system/files/documents/RDA-DC-Recommendations_15...

 

Please list the meeting objectives

To determine if there is a need to establish an RDA working group on developing agreed practices for Data Versioning

Meeting agenda

1. Introductions

2. Why, How and What of Data Versioning

        Why (Lesley Wyborn)

        How (Jens Klump)

        Where  - Lightning presentations

                     Bob Downs (NASA Socioeconomic Data and Applications Centre (SEDAC)

                     Mike Jones (Mendeley)

                     Cynthia Chandler (BCO-DMO)

                     Joe Hand (Dat Data)

4. General Discussion on whether there is a need to establish an RDA WG or IG to move forward.
 

Audience: Please specify who is your target audience and how they should prepare for the meeting

Anyone who is interested in moving data versioning forward.

If they have begun to systematize data versioning then if they can contribute their use case to the discussion.

Group chair serving as contact person Lesley Wyborn