Data Versioning IG











The Data Versioning WG has transitioned to the Data Versioning IG as of July 2021. The email address and group space have remained the same. 


The demand for reproducibility of research results and re-using data is growing, therefore it will become increasingly important for a researcher to be able to cite the exact version of the dataset that was used to underpin their research publication. The capacity of computational hardware infrastructures have grown and this has encouraged the development of concatenated seamless data sets where users can use web services to select subsets based on spatial and time queries, or other data attributes. Further, the growth in computer power has meant that higher-level data products can be generated in really short time frames. This means that we need a systematic way to refer to the exact version of a data set or data product that was used to underpin the research findings or was used to generate higher-level data products, including who developed and also funded it.


Versioning procedures and best practices are well established for scientific software and can be used to enable reproducibility of scientific results. The codebase of very large software projects does bear some semblance to large dynamic datasets. Are these practices suitable for datasets or do we need different practices for data versioning? The need for unambiguous references to specific datasets was recognised by the RDA Working Group on Data Citation, whose final report recognises the need for systematic data versioning practices.


This gap was discussed at a BoF meeting held at the RDA Plenary in September 2016 in Denver, resulting in the formation in 2017 of an RDA Interest Group on Data Versioning. A review of the recommendations by this RDA Data Versioning IG concluded that systematic data versioning practices were not available. In 2018 the Working Group was formed and first met at P12 in Gaborone. Its focus was on assessing current practices and compiled 39 use cases of data versioning across 33 organisations globally. In January 2020, the WG produced a white paper documenting these use cases and recommended practices (Klump, et al, 2020). The WG delineated 6 high-level principles, which provided a high-level framework for guiding the consistent practice of data versioning and can also serve as guidance for data centres or data providers when setting up their own data revision and version
protocols and procedures (Klump, et al, 2021). To further the adoption of the outcomes, the proposed new interest group plans to contribute the use cases and recommended data versioning practices to other groups in RDA, W3C, and other emerging activities in this field.


Please read the group's charter for more information.


The BoF initially emerged at Plenary 8 in Denver through the discussion available here:  https://www.rd-alliance.org/data-versioning-rda-8th-plenary-bof-meeting