Skip to main content


We are in the process of rolling out a soft launch of the RDA website, which includes a new member platform. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: Visitors may encounter functionality issues with group pages, navigation, missing content, broken links, etc. As you explore the new site, please provide your feedback using the UserSnap tool on the bottom right corner of each page. Thank you for your understanding and support as we work through all issues as quickly as possible. Stay updated about upcoming features and functionalities:


Dear all,
This is absolutely correct! In a nutshell, we need to differentiate
between two aspects, namely (1) **identifying** the precise
subset/colection of a – potentially changing – dataset, and 2) the
actual information making up a citation to that dataset.
for 1) this WG has come up with an answer, a single principle, that so
far seems to work across all types of data and solutions for
implementing a data repository, whereas for 2) some recommendations can
be made, but it will ultimately depend strongly on the domain, the type
of data and its use.
1) For the identification we again have two sub-challenges, namely a)
the evolution of data (new data being added, errors corrected, …; and
b) the identification of arbitrary subsets. 1a) is solved by versioning
all changes to data, whereas 1b) is addressed by resolving any subsets
dynamically via an operation that is reproducible (referred to, in the
guidelines, as a query) that was executed at a certain timestamp, that
needs to be stored and that is associted with a persistent identifier.
This could be the check-out from a Git repository at a certain point in
time, it could be a “list-directory” command against a versioned file
system, it can be an SQL query against a temporal table, we have seen
solutions using slice/dice operators against a NetCDF file, but it could
also – to use the audio-example referred to before – even a pointer to a
specific off-set in an audio file (or, e.g. a 30sec segment starting at
minute 1 for all audio files with a 44kHz sampling rate in the
collection, as sometimes used to be done for music retrieval
experiments). This also works across distributed repositories, as each
only needs to keep the queries it processed as well as the timestamps
locally, without any need to synchronize clocks. An aggregator would
then simply store the individual PIDs of the responses from a federated
2) Concerning the actual citation text, we may want to re-visit that
topic to see how much more specific we can get while still making sure
that any recommendation works across possibly all types of data and
domains. Currently, we stayed at a very limited set of metadata,
borrowing from the analogy of citations to literature, recommending the
use of two identifiers: one for the (continuously evolving) data source,
and one for the specific subset extracted from it at a given point in
time (analogy: a specific (static) paper identified e.g. via a DOI in an
(evolving, i.e. growing, with new editions being added) journal
(identified via e.g. an ISSN). The creator of the subset may be compared
to the author of a paper, whereas the owner/operator of the data source
may be likened to the editor of proceedings – but any such mapping will
already differ quite a lot across repositories and types of data, so it
is not part of the general recommendations – beyond the statement that
each data center should provide a recommended way of phrasing/expressing
citation. this may well be worth picking up if we have a feeling that
this core can be extended. now that we have a better understadning of
the identification and resolving process.
The pre-print version of the paper we’ve prepared on the recommendations
as well as reference implementations and deployed adoptions, could be
useful to review these principles in different settings:
best regards,