|Data Citation Working Group
Recommendation Title: Scalable Dynamic Data Citation Methodology
Impact: Supports accurate citation of data subjected to change, for the efficient processing of data and linking from publications.
|Recommendation package DOI: http://dx.doi.org/10.15497/RDA00016
Andreas Rauber, Vienna University of Technology
Dieter Van Uytvanck, CLARIN
Ari Asmi, University of Helsinki
Stefan Pröll, SBA Research (Secretary)
Digitally driven research is dependent on quickly evolving technology. As a result, many existing tools and collections of data were not developed with a focus on long term sustainability. Researchers strive for fast results and promotion of those results, but without a consistent and long term record of the validation of their data, evaluation and verification of research experiments and business processes is not possible.
There is a strong need for data identification and citation mechanisms that identify arbitrary subsets of large data sets with precision in a machine-actionable way. These mechanisms need to be user-friendly, transparent, machine-actionable, scalable and applicable to various static and dynamic data types.
The aim of the Dynamic Data Citation Working Group was to devise a simple, scalable mechanism that allows the precise, machineactionable identification of arbitrary sub selections of data at a given point in time irrespective of any subsequent addition, deletion or modification. The principles must be applicable regardless of the underlying database management system (DMBS), working across technological changes. It shall enable efficient resolution of the identified data, allowing it to be used in both human-readable citations as well as machine-processable linking to data as part of analysis processes.
The approach recommended by the Working Group relies on dynamic resolution of a data citation via a time-stamped query also known as dynamic data citation. It is based on time-stamped and versioned source data and time-stamped queries utilized for retrieving the desired dataset at the specific time in the appropriate version.
The solution comprises of the following core recommendations:
» Data Versioning: For retrieving earlier states of datasets the data needs to be versioned. Markers shall indicate inserts, updates and deletes of data in the database.
» Data Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp.
» Data Identification: The data used shall be identified via a PID pointing to a time-stamped query, resolving to a landing page.
Instead of providing static data exports or textual descriptions of data subsets, we support a dynamic, query centric view of data sets. The proposed solution enables precise identification of the very subset and version of data used, supporting reproducibility of processes, sharing and reuse of data.
The attached recommendation gives a set of 14 clear rules that, if you follow, you make your dynamic data citable.
Please use the comment function below for questions and suggestions. Please note that you need to login in order to comment.