The case statement outlines our work and provides the focus and the boundaries where our research will go.
We need to integrate all stakeholders and reflect their views accordingly. So far we identified four stakeholders that will actually use our contributions:
- Data providers – data will be reused
- Solution providers – machine readable data citations
- Researchers – receives citable results
- Community – gains trust and transparency
The beneficiaries will be able to reuse data, reproduce experiments, provide machine readable and machine actionable data citations for complex data sets and trace their data and its usage.
Being able to reliably and efficiently cite entire or subsets of data in large and dynamically growing or changing datasets constitutes a significant challenge for a range of research domains. Several approaches for assigning PIDs to support data citation at different levels in the process have been proposed. These may range from individual PIDs being assigned to individual data elements to PIDs assigned to queries executed on time-stamped and versioned databases.
Based on the discussions at the First Plenary Meeting in Gothenburg, the formation of a Working Group on Data Citation (WG-DC) was initiated. The RDA Working Group on Data Citation (WG-DC) aims to bring together a group of experts to discuss the issues, requirements, advantages and shortcomings of existing approaches for efficiently citing subsets of data. The WG-DC focuses on a narrow field where we can contribute significantly and provide prototypes and reference implementations. So far different data citation initiatives exist, all of which have their advantages and special purposes. An overview of these standards and their best practices was published by the CODATA Task Group on Digital Data Curation . We encourage strong cooperation with existing initiatives is required: CODATA, OpenAire, DataCite, W3C, Open Annotation Coalition and the related standards.
Our concept includes machine actionable data citation that is efficient and can be applied transparently. We will be looking at different types of data and database management systems, including:
- SQL-style databases
- XML databases / semi-structured databases
- Graph-based databases
- NetCDF files
- HDF5 files
The goal is to assure that subsections of data can be uniquely identified in the face of data being added, deleted or otherwise modified in a database, across longer periods of time, even when data is being migrated from one DBMS to another. We want to discuss and evaluate different existing approaches to this challenge, evaluate their advantages and shortcomings and identify obstacles to their deployment in different settings, as well as concrete recommendations for the deployment of prototypes within existing data centers. Amongst others these should subsequently form a solid basis for citing data, linking to it from publications in an actionable manner.
Dynamic data citation tackles challenges of versioning and the proper definition of subsets of data in different domains. Potential issues concern the relations between data sets, which need to be captured as well. Other challenges are scalability, costs and benefits (trade off) of ownership and operations that are potentially not reversible. This WG concentrates on the technical aspects of data citation solutions, focusing on proof of concept and prototype implementations. It will collaborate with other RGA working groups on PIDs and other topics under the umbrella of the Interest Group on Data Publication.
The principle currently proposed includes the following aspects:
- Ensuring that data items added to a data collection are added in a manner that is time-stamped
- Ensuring that the data collection is versioned, i.e. changes/deletions to the data are marked as changed with validity timestamps
- PIDs are assigned to the query/expression identifying a certain subset of the data that one wishes to cite, with the query being time-stamped as well
- Hash keys are computed for the selection result to allow subsequent verification of identity
- Issues such as unique sorting of results need to be considered when the operation returns data as sets and subsequent process work on the sequence the data is provided in
These should be working across all settings where we have a combination of data sources and operations identifying subsets at specific points in time.
We propose a three stage plan consisting of solutions (short-term), plans (mid-term) and the future perspective (long-term).
Download the full document