Recommendations by the RDA WG on Data Citation: Making Data Citable
The recommendations of the Working Group on Data Citation (https://rd-alliance.org/groups/data-citation-wg.html) on how to make data citeable have been published and are available on this page in the download section below. The recommendations are available in two flavors: A two page flyer describing the concepts briefly and in a more extensive article discussing the implications and consequences of applying and implementing the recommendations if different scenarios. Both documents are available in the most recent version in the file repository of the Working Group and can be retrieved from the download section below.
Adoption stories are available form the webinar page of the WGDC at https://www.rd-alliance.org/group/data-citation-wg/webconference/webconference-data-citation-wg.html
Background and History
The recommendations have first been presented on March 30 2015 and have since undergone several revisions after a series of individual consultations, presentations and workshops with WG pilots. The result of these revisions is now available for further public comment and subject to further revisions.
The present document tries to summarize the key aspects discussed within the WG in a 2-page flyer. This compressed format obviously cannot reflect all the detailed discussions and provide all the reasoning behind the recommendations. also, it does not reflect on individual implementation aspectsa of these recommendations in individual settings. These are part of ongoing evaluations complementing the finalization of the recommendations in a series of WG pilot evaluations as well as new pilots being started.
The document presents 14 recommendations grouped into 4 phases (Preparing the data and query store; Persistently identifying specific datasets; Resolving a PID; Actions to be taken upon modification of the dat ainfrastructure). to ease commenting, these are provided below (please see the attached document for further details, the mission of the WG and the benefits of the proposed solution as well as short answers to the most pertinent FAQs):
A. Preparing the Data and the Query Store
- R1 – Data Versioning: For retrieving earlier states of data sets the data needs to be versioned.
- R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp.
- R3 – Query Store: Provide means to store the queries used to select data and associated metadata.
B. Persistently Identify Specific Data sets
When a data set should be persisted, the following steps need to be applied:
- R4 – Query Uniqueness: Re-write the query to a normalised form so that identical queries can be detected. Compute a checksum of the normalized query to efficiently detect identical queries.
- R5 – Stable Sorting: Ensure an unambiguous sorting of the records in the data set.
- R6 – Result Set Verification: Compute a checksum of the query result set to enable verification of the correctness of a result upon re-execution.
- R7 – Query Timestamping: Assign a timestamp to the query either based on the last update to the entire database or the last update to the selection of data affected by the query or the query execution time. This allows retrieving the data as it existed at query time.
- R8 – Query PID: Assign a new PID to the query if either the query is new or if the result set returned from an earlier identical query is different due to changes in the data. Otherwise, return the existing PID.
- R9 – Store Query: Store query and metadata (e.g. PID, original and normalised query, query & result set checksum, timestamp, superset PID, data set description and other) in the query store.
- R10 – Citation Text: Provide a recommended citation text and the PID to the user.
C. Upon Request of a PID
- R11 – Landing Page: PIDs should resolve to a human readable landing page of the data set, which provides metadata including a link to the superset (PID of the data source) and citation text snippet.
- R12 – Machine Actionability: the landing page should be machine-actionable and allow retrieving the data set by re-executing the timestamped query.
D. Upon Modifications to the Data Infrastructure
- R13 – Technology Migration: When data is migrated to a new representation (e.g. new database system, a new schema or a completely different technology), the queries and associated checksums need to be migrated.
- R14 – Migration Verification: Successful query migration should be verified by ensuring that queries can be re-executed correctly.
We invite all stakeholders to contribute their comments on the recommendations to finalize the wording and ensure that all aspects of enabling precise identification of arbitrary subsets of data in potentially highly dynamic environments can be properly addressed to enable subsequent citation and re-use.
The Recommendations Article: Details and Discussion
In addition to the flyer we present the recommendations and their effects on existing research data infrastructures in more detail in the article "Identification of Reproducible Subsets for Data Citation,
Sharing and Re-Use". This article ia available as draft version below and will appear in the next issue of the Bulletin of IEEE Technical Committee on Digital Libraries (TCDL). The article mimics the structure of the 2-page fler and provides more details for each individual recommendation.
You can find the most recent version of both documents in the file repository of this working group.
- Two-page flyer: download
- TCDL article: Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Pröll. Identification of Reproducible Subsets for Data Citation, Sharing and Re-Use. Bulletin of the IEEE Technical Committe on Digital Libraries, 12(1), 2016. download