The demand for reproducibility of research results is growing, therefore it will become increasingly important for a researcher to be able to cite the exact version of the data set that was used to underpin their research publication. The capacity of computational hardware infrastructures have grown and this has encouraged the development of concatenated seamless data sets where users can use web services to select subsets based on spatial and time queries. Further, the growth in computer power has meant that higher level data products can be generated in really short time frames.
Combined, this means that we need a systematic way to refer to the exact version of a data set or data product that that was used to underpin the research findings, or was used to generate higher level products. This was recognised by the RDA Working Group on Data Citation, whose final report recognises the need for systematic data versioning practices, which are currently not available. This gap was discussed at a BoF meeting held at the RDA Plenary in September 2016 in Denver, resulting in the formation of an Interest Group on data versioning.
Versioning procedures and best practices are well established for scientific software and can be used enable reproducibility of scientific results. The codebase of very large software projects does bear some semblance to large dynamic datasets. Are these suitable for data sets or do we need a separate suite of practices for data versioning?