Repository Platforms for Research Data
25 September 2015- BREAKOUT 9 - 14:00
Stefan Kramer SKramer@american.edu
Ralph Pfefferkorn firstname.lastname@example.org (absent)
David Wilcox email@example.com (absent)
Eric Maris firstname.lastname@example.org The Donders Institute
Reagan Moore email@example.com
Natalie Meyers firstname.lastname@example.org http://library.nd.edu/cds
Thomas Jejkal email@example.com
Original agenda https://rd-alliance.org/ig-repository-platforms-research-data-p6-meeting-session.html
The goal of this meeting is for group members to meet face-to-face and work toward several of our already-established goals, including:
* Reviewing related work within RDA
* Reviewing related work outside RDA
* Reviewing submitted use cases
* Identifying sources for additional use cases
* Planning next steps
The following is a draft agenda for the meeting:
* Introductions (10 mins)
* Related document review and discussion (20 mins)
** Reviewers summarize their reviews
** Group asks questions and notes any relevant info for group activities
* RDA group liaison discussion (20 mins)
** Liaisons summarize activities of related groups
** Group asks questions and notes any relevant info for group activities
* Use case discussion (30 mins)
** Initial use cases are presented (preferably by those who submitted them)
** Sources for additional use cases
** Discuss any issues with the template, collection procedures, wording/clarity, etc.
* Next steps (10 mins)
** Who will volunteer to be an editor?
Agenda point: Introduction
Stefan started out with a brief review of the history of the IG: it was structured as a WG when formed, but due to RDA registration issues it was approved as an IG, but the group still have a case statement and a detailed description.
end of group: 1 year from now (originally proposed)
Expected output: matrix of requirements and use cases to be utilized and referenced for repository developers, researchers and repository managers.
Question: What type of repository is investigated? Disciplinary wise.
Answer: We prepared 4 presentations from different disciplines today and had gathered 6 use cases submitted to the IG.
Agenda point: Related document review and discussion
expectation from the review: whether the repositories adhere to data terminology
expected usage of function requirement produced by the IG: whether the examined repositories requirement overlaps
Agreement on the exchange object between repos on a technical level - first step: specification of interfaces → could related to data packaging.
Agenda point: Reviewers summarize their reviews
Group asks questions and notes any relevant info for group activities
RDA group liaison discussion (20 mins)
Liaisons summarize activities of related groups
Use case discussion
6 submitted use cases available on group website, 4 of them are presented at the session.
The life-cycle RDM developed by the Donders Institute RDM project defined a protocol that describes how collections must be built, collections are categorized into 3 different sets: Data acquisition collection (DAC), Scientific integrity collection (SIC) and Data sharing collection (DSC). Each collection has different clearance level.
Data collection starts at the moment they are being produced by the machines, and enters the DAC, the first level collection, including raw data files and the metadata pertain to it. Data generated during the scientific analysis process are recorded in the SIC, finally after the paper is accepted, the data asked to be shared by the journal enters the DSC.
The objective of data collection is data preservation -- data in the broad sense, not just data produced by the machines -- simulated data/ scripts are also included.
SIC makes use of data from DAC. By the time SIC starts the DAC is frozen and closed with a PID (this is machine actionable). The SIC will be closed when the publication is accepted. The DSC will be exposed to the general public, the DAC and SIC will stay internal (may contain sensitive info).
Data collection is an individual project for the researcher, which end in a data publication. [notes see slides]
Genetic data are to be shared. An agreement is to be signed for sharing. Data use agreement for non-sensitive data. Control to data access is established. Non-sensitive data can be accessed with google/facebook account.
Question: is the metadata available? Metadata will be exposed for crawlers. Controlled metadata vocabularies of 4 disciplinary domains.
Question: drivers for collection types is privacy concerns. Metadata for data sharing collection will be exposed. For other collections metadata sharing is recommended.
The data intensive cyber environment working group (at UNC) worked with a wide range of research repos over the last 2 decades. The goal of the group is to understand the generic requirement of managing research data. Requirements are difficult to pin down because collections of data researchers are interested in evolve over time, thus demands for repos evolves with it. Most research institutions start with project collection, where participants of project have deep understanding of how the data are organized, what the metadata mean, what are the formats, etc. But problems arise when they broaden their research approach and include project/ researchers from other institutions to a shared collection, they have to make some of the tacit knowledge explicit (to ensure the people on the other side will be able to use the data without having to call them on the phone). If you publish your data then you need to meet the standards of your discipline, if you build a processing pipeline you need to provide mechanism for data manipulation, when you archive the data you have to ensure long term reusability and retrievability.
We work with groups that are going through these transitions, all partners we worked with had to evolve their context, services, restructure their data, reorganize their data -- we have to work with a highly dynamic environment. We use policies -- computer actionable rules that the system is enforcing, the computer actionable rules determine workflows, which can be dynamically changed so you can modify the environment over time to execute additional tasks.
As user of the data collection broaden, we are providing mechanisms for the communities to gain contexts they need to understand the original collection.
Currently 200 performable operations, 300 types of status info -- consist a broad context within which the information is being managed.
Mechanism adopted by various of projects, of different sizes and disciplines. Each project has different policies for data types and data access.
context associated with a collection evolves as the user community broadens
policies are computer actionable rules
Collection as the persistent item, system flows through it.
Application environments ranges. all manage distributed data
requirements across projects [slide]: A generic infrastructure that enables these multiple types of operations and policies to be applied.
Question: what the UX is like, how can developers apply?
A: everything happens backstage, UX doesn’t change.
Question: when did it start?
A: development started 1994. policy basis started 2006.
Motivation: Observation produce huge data sets, data processing software is still under development, lack of experience in what to analyse.
Needs: A platform for storing, processing and interpreting data, that allows sharing, workflow annotation, result comparison and keeping track of provenance information (algorithm and parameters used in the analysis process.)
Architecture: [diagram on slide]
Requirements: PID, Quality control, data processing, data policy, data reuse, disciplinary specific microservice, APIs [see slide for specifics]
Question: volume of data is a challenge → anyway to only record the changes in dataset so you don’t have to copy the 200 TB every time?
A: data sets are all different. metadata creation for dynamic data citation counter balances the increase of reproducibility, too much time to reconstruct the data sets (Natalie)
Agenda Point: Issues with the template, collection procedures, wording/clarity, etc.
blank template and submitted use cases are available at the IG RDA site
You don’t have to have a solution for a use case to submit a form, as long as there’s a need.
Next steps: to volunteer to be an editor for the matrix of requirements, contact one of the co-chairs of the IG.