- Pilot name: Git Reference Implementation
- Contact person: Stefan Pröll, Kristof Meixner
- Type: research pilot
- Status: completed
- Type of data: ASCII
- Dynamics: frequent (minutes)
- Domain: Generic
- Short description: Generic approach for referencing and citing data sets in ASCII
- Solution / approach: Use git for handling the versioning. Store subset queries in Git as well or in query store.
- Timeline: 2016
- Supplementary material:
- Stefan Proell and Kristof Meixner and Andreas Rauber, "Precise Data Identification Services for Long Tail Research Data," in 13th International Conference on Digital Preservation (iPRES 2016), 2016. PDF
- Prototype at Github
- Screenshots below
RDA Data Citation Recommendations and their Application in the Git Reference Implementation
The source code management system Git is very popular in the domain of software development. Git allows to collaboratively work on files without distracting each other's work. Git takes care of the versioning of the files and allows to create separate branches, such that changes wich are made on the same files by different users are non interefering. These branches can be merged where the changes are incorporated back into a single file. As each revision of a file is stored, users can access earlier version of each file at any time. Each change of a file is unambiguously identified by a unique commit hash.
The principles used by Git in the domain of source code management can also be used for storing subsets of data in a reproduclble way. Instead of storing subsets of data in different versions as individual data exports, we only store the parent data set in a versioned fashion with Git. We can then use any query language, such as SQL, in order to create subsets of the parent data set, by executing queries in the versioned data. This allows us to create specific subsets of a potentially large data set and keep track of all changes in the parental dataset as they occur. We also store the queries as plain text in the Git repository and we create a link between the specific parent data set version and each query we want to keep. As both - the query and the data set - are versioned and uniquely identified with a commit hash, we can re-execute the query again on the desired version of the data set and retrieve the subset again as it was at the given time. This adds reproducibility for subsets of evolving data sources in a very simple way.
The users can select a data set from a list of available data sets.
Users can enter a SQL query for creating a specific subset.
The subset is shown to the user, who can then either save the subset or make refinements.
Instead of storing the subset as individual data export, the system stores the query, the commit hash, a persistent identifier and a description.
In order to retrieve the same subset again, users can select dataset and the subset via the interface and re-execute the query on versioned data.