Working Group On Data Citation - Case Statement Proposal

CASE STATEMENT PROPOSAL v.1/ RDA WORKING GROUP ON DATA CITATION 5/8/2013 6:38:46 PM

 
1. Contents
2. WG Charter 
2.1 Short-term goals (M12)
2.2 Mid-term goals (M18) 
2.3 Long-term goals (> M18) 
2.4 Timeframe
3. Value Proposition 
3.1 Individuals, communities, and initiatives that will benefit from the RDA WG on Data Citation
3.2 Key impacts of the RDA Data Citation initiative 
4. Engagement with existing work in the area 
5. Work Plan 
5.1 Work plan components 
5.2 WG-DC operation 
6. Initial Membership 
7. References 
8. Appendix A 
 
2. WG Charter
The case statement outlines our work and provides the focus and the boundaries where our research will go. We need to integrate all stakeholders and reflect their views accordingly. So far we identified four stakeholders that will actually use our contributions:
  •  Data providers – data will be reused
  •  Solution providers – machine readable data citations
  •  Researchers – receives citable results
  •  Community – gains trust and transparency
The beneficiaries will be able to reuse data, reproduce experiments, provide machine readable and machine actionable data citations for complex data sets and trace their data and its usage.
Being able to reliably and efficiently cite entire or subsets of data in large and dynamically growing or changing datasets constitutes a significant challenge for a range of research domains. Several approaches for assigning PIDs to support data citation at different levels in the process have been proposed. These may range from individual PIDs being assigned to individual data elements to PIDs assigned to queries executed on time-stamped and versioned databases.
Based on the discussions at the First Plenary Meeting in Gothenburg, the formation of a Working Group on Data Citation (WG-DC) was initiated. The RDA Working Group on Data Citation (WG-DC) aims to bring together agroup of experts to discuss the issues, requirements, advantages and shortcomings of existing approaches for efficiently citing subsets of data. The WG-DC focuses on a narrow field where we can contribute significantly and provide prototypes and reference implementations. So far different data citation initiatives exist, all of which have their advantages and special purposes. An overview of these standards and their best practices was published by the CODATA Task Group on Digital Data Curation [1]. We encourage strong cooperation with existing initiatives is required: CODATA, OpenAire, DataCite, W3C, Open Annotation Coalition and the related standards.
Our concept includes machine actionable data citation that is efficient and can be applied transparently. We will be looking at different types of data and database management systems, including:
  •  SQL-style databases
  •  XML databases / semi-structured databases
  •  Graph-based databases
  •  NetCDF files
  •  HDF5 files
  •  …
The goal is to assure that subsections of data can be uniquely identified in the face of data being added, deleted or otherwise modified in a database, across longer periods of time, even when data is being migrated from one DBMS to another. We want to discuss and evaluate different existing approaches to this challenge, evaluate their advantages and shortcomings and identify obstacles to their deployment in different settings, as well as concrete recommendations for the deployment of prototypes within existing data centers. Amongst others these should subsequently form a solid basis for citing data, linking to it from publications in an actionable manner.
Dynamic data citation tackles challenges of versioning and the proper definition of subsets of data in different domains. Potential issues concern the relations between data sets, which need to be captured as well. Other challenges are scalability, costs and benefits (trade off) of ownership and operations that are potentially not reversible. This WG concentrates on the technical aspects of data citation solutions, focusing on proof of concept and prototype implementations. It will collaborate with other RGA working groups on PIDs and other topics under the umbrella of the Interest Group on Data Publication.
The principle currently proposed includes the following aspects:
  • Ensuring that data items added to a data collection are added in a manner that is time-stamped
  • Ensuring that the data collection is versioned, i.e. changes/deletions to the data are marked as changed with validity timestamps
  • PIDs are assigned to the query/expression identifying a certain subset of the data that one wishes to cite, with the query being time-stamped as well
  • Hash keys are computed for the selection result to allow subsequent verification of identity
  • Issues such as unique sorting of results need to be considered when the operation returns data as sets and subsequent process work on the sequence the data is provided in
These should be working across all settings where we have a combination of data sources and operations identifying subsets at specific points in time.
We propose a three stage plan consisting of solutions (short-term), plans (mid-term) and the future perspective (long-term).
2.1 Short-term goals (M12)
  • Evaluation of recommended data citation approaches for specific scenarios
  • Selecting pilot candidates in cooperation with stakeholders/data owners
  • Detailed planning of technical aspects of data citation approach preparing for implementation
  • Develop a model for long term proof data citation within databases and initial prototypes and demonstrators.
2.2 Mid-term goals (M18)
  • Develop a set of reference implementations of the data citation model for selected pilot data types.
  • Evaluate developed prototypes
  • Establish consensus on a universal data citation model that can be implemented independently from specific systems or vendors.
2.3 Long-term goals (> M18)
  • Seek official endorsement for data citation within the RDA community and foster the application of identified standards.
  • Become the contact point for data citation questions and provide the community with best practices and know how.
2.4 Timeframe
The following figure depicts the sequence of the three phases. All of them will be accompanied by networking and intense feedback loops between the WG-DC, the data providers, the solution providers and the community. Fruitful collaboration with other initiatives such as CODATA, DataCite, W3C, open Annotation coalition and others is carried out along all three phases.
 
3. Value Proposition
Digitally driven research is a rather young discipline that evolves fast. As a result the tools and the data are rarely developed with a focus of long term awareness. What matters most to researchers are fast results and prompt publications. Whether the data they produce today can be understood, interpreted or even accessed in the future is hardly an issue these days. Only if results can be reproduced precisely, the validity of research experiments and business processes can be judged, evaluated and verified in a machine-actionable manner that is scalable and can cope with dynamically changing data,. Hence there is a strong need for data citation mechanisms that allow identifying portions of large data set with a precision.
An additional challenge within the area of research data is the requirement to cite evolving data reliably. Researchers need the possibility to reference data material that is subject to change. Hence mechanisms are required that allow to cite data as the used it during a particular experiment. When the data gets updated, modified or deleted, these changes must be reflected by the citation system as well. Therefore time based data is an important factor. Also the possibility to specify subsets and derived data is a requirement. Being able to identify, reference, share and distribute specific subsets encourages reuse amongst researchers. The easier and more transparently this citation process can be implemented, the higher is the acceptance among the target audience and the designated community. We will provide proofs of concept, mockups and prototype implementations that can be tested and used by the community. We want to go beyond theoretical work and deliver real world applications for our models. In an optimal setting, a researcher, when selecting a subset of data for an experiment, will be issued with a PID that allows others retrieving the same data set again.
The international orientation and interdisciplinary nature of the RDA community provides input from various interesting areas, enabling broad research and application of the results of the WG-DC. Participation of researchers, data providers, solution providers and the community allow integrating expert knowledge from various perspectives. This direct access to domain experts boosts the development of new standards within the area of data citation and allows improvement via direct feedback loops. Collaboration with other initiatives in the field is also a key concern of this WG.
3.1 Individuals, communities, and initiatives that will benefit from the RDA WG on Data Citation.
  • Researchers: by being able to cite their data
  • Database developers: by reducing redundancy
  • Digital preservation managers: by being able to retrieve subsets of data
  • Data centers: by enhancing reuse of existing resources
  • Data managers and data scientists: having tools for referencing subsets in dynamic data environments
  • Professional societies: by reproducing experiments
  • Publishers: by encouraging reuse and an increased level of trust
  • Repositories, data archives: by being able to reference, cite and retrieve subsets
  • Software tool developers: by allowing transparent implementation of data citation capabilities
3.2 Key impacts of the RDA Data Citation initiative
  • Provide the knowhow for data citation of partial datasets from dynamic data sources
  • Enhance reproducibility of research results by allowing peers to re-execute experiments
  • Facilitate discovery, access, and reference of large data sets
  • Provide a reference model for dynamic data citation
  • Enhance digital preservation of data sets and their reference
 
4. Engagement with existing work in the area
  • CODATA1
  • OpenAire2
  • DataCite3
  • W3C4
  • Open Annotation Coalition 5
  • and others
 
5. Work Plan
5.1 Work plan components
1. Analysis of requirements and the selection of candidate solutions (months 1-6)
In the beginning phase we will consult existing work in the area of data citation and study available best practices [1]. We will evaluate recommended data citation approaches for specific scenarios and select pilot candidates. The selection of these candidates will be in close cooperation with stakeholders and data owners. Detailed planning of the technical aspects of data citation approaches will guide us during the implementation.
2. Defining the reference model (months 4-12)
We will develop a technical reference model for data citation in relational database systems. This model should be open, extensible and implementation agnostic. It defines how research data must be structured in order to allow the citation of subsets.
3. Improve and test the model iteratively (months 8-15)
The model developed in the previous step will be evaluated against suitable data sets of considerable site from various research disciplines. The consortium partners will be asked to provide their real world research sets as a testbed upon which the model can be evaluated. This will follow an iterative approach on order to allow improvements.
After the model has been carefully tested, a reference architecture for citable research data will be implemented. This reference implementation should be based on open source software in order to be usable and improvable by all participating partners. Hence this phase provides a second feedback loop for the iterative improvement and testing phase. The implementation has to be generic and flexible enough for being adapted to various purposes. An official release will follow the implementation and iterative improvement phase.
4. Promotion of the RDA Data Citation Model and Reference Implementation (months 12-18)
Promotion activities will include wide-spread dissemination about the data citation model. The ready to use reference implementation will be accompanied by substantial documentation and use case scenarios in order to increase acceptance and encourage contributions.
 
5.2 WG-DC operation
  • Form and description of final deliverables
  • Short-term deliverables include reference models for selected types of datasets (e.g. SQL, XML …), proof of concept deployments and guidelines for making data citable.
  • Milestones
The four steps outlined in the previous section can be used for setting milestones. All phases will be supported by multi-channel dissemination activities and community outreach. This includes a wiki, a developer forum and mailing lists.
 
6. Initial Membership
Leadership (brief biographic notes in Appendix A):
- Chair: Andreas Rauber, Vienna University of Technology & SBA
- Co-chair: Reagan Moore, UNC Chapel Hill
- Co-Chair: Dieter van Uytvanck, MPI
Members/Interested: (based on BoF Session Participation and individual nominations)
- Daan Broeder, MPI
- Hans Pfeiffenberger, Alfred-Wegner Institute for Polar and Maritime Research
- Peter Wittenburg, Max Plank Institute
- Jeroen Rombouts TU Delft
- Joachim Wambsganss Uni Heidelberg
- Ilya Zaslavsky UCSD
- JuanLe Wang Chinese Academy of Sciences
- Robert H. McDonald Indiana University
- Emily Grumbling NSF
- Diana Hendrickx Maastricht University
- Stefan Pröll SBA Research
- Patricia Cruse DataCite
- Martina Stockhause WDCC/DKRZ
- Christoph Becker TU Wien
- Ari Asmi Uni Helsinki
- Natalia Manola University of Athens
- Constantino Thanos ISTI-CNR
- Volker Boehlke Uni Leipzig
- Thomas Eckart Uni Leipzig
- Paul Uhlir National Academy of Sciences
- Yannis Ioannidis University of Athens
- Shih-Chieh Ilya Li Academia Sinica/CODATA Taipei
- Jane Hunter, University of Queensland
 
7. References
[1]CODATA Task Group on Digital Data Citation. Best Practices: Research & Analysis Results, CODATA, 2012.
[2]S. Pröll (Ed.). Position statements submitted to the BoF Session on Data Citation. at the Research Data Alliance - Launch and First Plenary, Gothenburg, Sweden, March 18-20, 2013. http://forum.rd-alliance.org/download/file.php?id=90
[3]S. Pröll und A. Rauber. Minutes of the BoF-Session on Data Citation, at the Research Data Alliance - Launch and First Plenary, Gothenburg, Sweden, March 18-20, 2013. http://forum.rd-alliance.org/download/file.php?id=110
 
8. Appendix A
Leadership Biographical Notes
 
Andreas Rauber is Associate Professor at the Department of Software Technology and Interactive Systems (ifs) at the Vienna University of Technology (TU-Wien). He furthermore is president of AARIT, the Austrian Association for Research in IT and a Honorary Research Fellow in the Department of Humanities Advanced Technology and Information Institute (HATII), University of Glasgow. He received his MSc and PhD in Computer Science from the Vienna University of Technology in 1997 and 2000, respectively. In 2001 he joined the National Research Council of Italy (CNR) in Pisa as an ERCIM Research Fellow, followed by an ERCIM Research position at the French National Institute for Research in Computer Science and Control (INRIA), at Rocquencourt, France, in 2002. From 2004-2008 he was also head of the iSpaces research group at the eCommerce Competence Center (ec3). In 1998 he received the ÖGAI Award of the Austrian Society for Artificial Intelligence (ÖGAI), and the Cor-Baayen Award of the European Research Consortium for Informatics and Mathematics (ERCIM) in 2002. He has published numerous papers in refereed journals and international conferences and served as PC member and reviewer for several major journals, conferences and workshops. He is a member of the Association for Computing Machinery (ACM), The Institute of Electrical and Electronics Engineers (IEEE), the Austrian Society for Artificial Intelligence (ÖGAI). He serves on the board of the IEEE Technical Committee on Digital Libraries (TCDL), and was a member of the DELOS Network of Excellence on Digital Libraries as well as the MUSCLE Network of Excellence on Multimedia Understanding through Semantics, Computation and Learning. His research interests cover the broad scope of digital libraries and information spaces, including specifically text and music information retrieval and organization, information visualization, as well as data analysis, neural computation and digital preservation.
Reagan Moore
Dieter van Uytvanck