Plenary 3 - WG Data Citation: Making Dynamic Data Citeable Session

1100-1230 on Friday 28 March
Session chair(s) including email address(es):
      Andreas Rauber
      Asmi, Ari
      Dieter van Uytvanck
Being able to reliably and efficiently cite entire or subsets of data in large and dynamically growing or 
changing datasets constitutes a significant challenge for a range of research domains. Several 
approaches for assigning PIDs to support data citation at different levels in the process have been 
proposed. While these may suffice in settings where small and/or static datasets are concerned (such as 
assigning PIDs to the entire dataset or to individual data items), these do not provide sufficient flexibility 
in dynamic, high-volume data settings. The RDA Working Group on Data Citation (WG-DC) aims to bring 
together a group of experts to discuss the issues, requirements, advantages and shortcomings of existing 
approaches for efficiently citing subsets of data. Our concept includes machine actionable data citations 
that are efficient and can be applied transparently. The goal is to assure that each state and subset of 
data can be uniquely identified in the face of data being added, deleted or otherwise modified in a 
database, across longer periods of time, even when data is being migrated from one DMS to another. 
Currently discussed principles include the following aspects:
ï Ensuring that data items added to a data collection are added in a 
manner that is time-stamped
ï Ensuring that the data collection is versioned, i.e. changes/deletions 
to the data are marked as changed with validity timestamps
ï PIDs are assigned to the query/expression identifying a certain subset 
of the data that one wishes to cite, with the query being time-stamped as well
ï Hash keys may be computed for the selection result to allow subsequent 
verification of identity of the results returned
Links to additional reading material:
The Case Statement of the WG is available at
- Case statement Discussion: shaping the activities of the WG
- Moving ahead with pilots: conceptual walk-through
- Venue/contributors for joint paper on initial pilots
- Schedule / preferred timing for upcoming web meetings
Meeting minutes
by Stefan Pröll

Andreas introduced the agenda of the session and highlighted the accomplishments which have been achieved so far. Andreas repeated the core principles of the dynamic data citation approach and provided an overview of the six existing pilots, which can be found on the RDA Web site. The use cases should be used to check the fitness of the approach. The WG on Data Citation (WGDC) aims to provide actual implementations, the time frame is 18 months from now.  

He outlined the phases that we want ot achieve within this working group: validate the concepts by collecting as many use cases as possible; Then try to apply them on real pilots. The goals were to identify requirements which will be derived from (new) pilots and the development of a reference architecture and guidelines for dynamic data citation. Andreas encouraged all participants to engage and provide use cases which we can develop methods which enable the precise citation of subsets of data.

Carlos asked if we consider schema evolution yet in our model and stressed that change concepts need to be considered at this level as well. Andreas acknowledged the importance of such concepts and explained that query migration to new schemata is essential for facilitating technology independence. Also legal constraints might enforce the deletion of specific details by law and therefore the framework needs to be able to handle such events as well. Therefore not only the data, but also the schema needs to be versioned. We have to use the pilots in order to detect such limitations of our approach because obviously there can be a gap between theory and practice.

It was noted that in any case we require the goodwill of the data providers who will implement the framework in the end. Without their support the proposed solution will not work.

Use Cases proposed from the participants of the work shop

Carlos proposed an atomic data as a use case. Researchers would submit their query to the system. Then a node replies with the unique result which contains the actual location of the data. So far there are no versions or schema changes, but there is a need to extend the system with these features. The queries which are used in this system can be quite complex and return millions of records.

Stephen proposed the Google book collection which is handled at his institution. Currently the collection has OCR scans of 3,9 billion books at the University of Illinois. Most of the data is under copy right and may not leave the building. The scans get improved during several iterations, hence versioning is important.

Hans showed slides about the "Earth System Science Data" (ESSD) journal which maintains links between data sets and journal articles. Data set can be simple (e.g. Excel) but also complex. Articles in the digital journal can be improved iteratively and thus versioning is required. The articles are referenced with a single DOI, but each article can point to several data files. The journal does not store the data itself, but it links to a landing page. The data is stored at reliable repositories. Hans proposed the question how the completeness and correctness of the data set can be verified. He proposed internal or even external hash service providers. He stressed that journals do not want the responsibility for the data.

Rob mentioned that credit for data is very important to scientists and that complex questions arise when derived data sets are considered. Also when articles refer to data sets, usually only the article author gets recognition for the work, but not the data producer. Ari clarified that there are two purposes for data citation: The first is finding the data, the second one is providing credit. Andreas responded that the solution needs to be agnostic about the purpose of the citation there are other working groups dealing with bibliometrics etc. We can't weigh the credit giving mechanisms and therefore need to support all potential uses, but we are not interested in the semantics of these credits.

Steve proposed his use case of asynchronously changing CSV files that can be updated anytime. The data is highly dynamic.

John presented the dynamic research data which they use at the research centre. They manage the data from various disciplines and changes occur quite often. Credit is also an important topic and citation of the data is very difficult for versions. Researchers need to know which is the latest version, but they also require access to earlier versions. Researchers need to know what previous versions exist. Data citation needs to be data driven and data centric. The research data centre handles many different data types in various formats such as RDF.

Andreas encouraged the participants to provide their use cases even if similar settings have been proposed already. There exist a lot of different usages of data and the implementations can be diverse. Even if identical sets are considered there might be a difference how the data is used in different environments. Although we are not primarily interested in questions regarding credit, landing pages etc we are looking for synergies with other working groups.

Roger described his use case where time series and time relevant sensor data is used. Researchers can submit their data, but now they need versioning, subsets etc which also need to be citable.

Andreas showed the pilots page and provided an overview of the existing use cases. The participants of the workshop should use the template provided there in order to provide their own use cases to the community of this WG. The more use cases we have, the easier it is to create an overview of the existing technologies, requirements and needs. This allows us to derive generic methods. We need to find more concrete examples. We then should meet in even smaller groups where those people with similar use cases meet and exchange our ideas for each scenario. Having small but focussed meetings face to face in bi-/trilateral meetings would by highly benefitial. Details can be discussed much more effectively when people meet in person.

Parinaz mentioned that NoSQL databases are not yet considered. She proposed a use case where meteorological data is stored in file systems but the metadata is stored in a MongoDB instance. It was agreed that NosQL and file system based data stores will also be considered.

The question which PID system we as a group would chose was brought up. Andreas replied that again we are agnostic about any specific system and that we are neutral in the choice of the PID mechanisms used. Still we can collect our experience with the systems and provide a wiki page with our ideas. In general it seemed to be a good idea to create wiki pages for subtopics such as PIDs, hashing, schema versioning etc.

Rob introduced his use case with nano publications, which is a bottom up approach based on single RDF statements. In this use case, these statements do not change and are combined to new data sets from the bottom up. During the discussion it was agreed that the query centric top down approach and the nano publications approach both address the same issue and conceptually describe the same problem. There are always queries involved which can be used, not matter how fine granular the data is. Also the idea of migrating everything to LOD and then make that citable was considered.

Dieter mentioned the CLARIN research project and described the PID approach used.

It was agreed that we need as many use cases as possible and that participants will provide short descriptions of their use cases on the wiki. Sarah  and John might have resources for prototype development.

Stephen suggested to use an old data set and start from that, even if there are no new updates.

It was agreed that whatever approach we develop, it needs to be easy to use for researchers.

Andre proposed a use case with aurora images, operational and experimental data.

Organisational Issues

We will mainly use our email list ( and telcos. A three week schedule with alternating starting times was proposed. Currently it was agreed on Wednesday with one morning slot (7:30 GMT, 8:30 CET, 18:30 Sydney) and one evening slot (17:00 GTM, 18:00 CET, 11:00 New York, 09:00 San Francisco), but this is not yet fixed. Alternative proposals included also 514 hours meeting interval (:-)). The final decision will be published here.