2015-09-25 RPRD P6 Meeting

Repository Platforms for Research Data

25 September 2015- BREAKOUT 9 - 14:00


Stefan Kramer SKramer@american.edu

Ralph Pfefferkorn ralph.mueller-pfefferkorn@tu-dresden.de (absent)

David Wilcox dwilcox@duraspace.org (absent)



Eric Maris e.maris@donders.ru.nl The Donders Institute

Reagan Moore rwmoore@renci.org

Natalie Meyers natalie.meyers@nd.edu http://library.nd.edu/cds

Thomas Jejkal thomas.jejkal@kit.edu


Original agenda https://rd-alliance.org/ig-repository-platforms-research-data-p6-meeting-session.html

The goal of this meeting is for group members to meet face-to-face and work toward several of our already-established goals, including:


* Reviewing related work within RDA

* Reviewing related work outside RDA

* Reviewing submitted use cases

* Identifying sources for additional use cases

* Planning next steps


The following is a draft agenda for the meeting:


* Introductions (10 mins)

* Related document review and discussion (20 mins)

** Reviewers summarize their reviews

** Group asks questions and notes any relevant info for group activities

* RDA group liaison discussion (20 mins)

** Liaisons summarize activities of related groups

** Group asks questions and notes any relevant info for group activities

* Use case discussion (30 mins)

** Initial use cases are presented (preferably by those who submitted them)

** Sources for additional use cases

** Discuss any issues with the template, collection procedures, wording/clarity, etc.

* Next steps (10 mins)

** Who will volunteer to be an editor?

Agenda point: Introduction

Stefan started out with a brief review of the history of the IG: it was structured as a WG when formed, but due to RDA registration issues it was approved as an IG, but the group still have a case statement and a detailed description.

  • end of group: 1 year from now (originally proposed)

  • Expected output: matrix of requirements and use cases to be utilized and referenced for repository developers, researchers and repository managers.

Question: What type of repository is investigated? Disciplinary wise.

Answer: We prepared 4 presentations from different disciplines today and had gathered 6 use cases submitted to the IG.

Agenda point: Related document review and discussion

  • expectation from the review: whether the repositories adhere to data terminology

  • expected usage of function requirement produced by the IG: whether the examined repositories requirement overlaps

  • Agreement on the exchange object between repos on a technical level - first step: specification of interfaces → could related to data packaging.

Agenda point: Reviewers summarize their reviews

Group asks questions and notes any relevant info for group activities

RDA group liaison discussion (20 mins)

  • Add 2 other RDA WGs to liaison (see action items at the end of notes).

Liaisons summarize activities of related groups

  • Liaisons not present

Use case discussion

6 submitted use cases available on group website, 4 of them are presented at the session.

  • Eric The Donders Institute RDM Project [Slides]

The life-cycle RDM developed by the Donders Institute RDM project defined a protocol that describes how collections must be built, collections are categorized into 3 different sets: Data acquisition collection (DAC), Scientific integrity collection (SIC) and Data sharing collection (DSC). Each collection has different clearance level.

Data collection starts at the moment they are being produced by the machines, and enters the DAC, the first level collection, including raw data files and the metadata pertain to it. Data generated during the scientific analysis process are recorded in the SIC, finally after the paper is accepted, the data asked to be shared by the journal enters the DSC.

The objective of data collection is data preservation -- data in the broad sense, not just data produced by the machines -- simulated data/ scripts are also included.

SIC makes use of data from DAC. By the time SIC starts the DAC is frozen and closed with a PID (this is machine actionable). The SIC will be closed when the publication is accepted. The DSC will be exposed to the general public, the DAC and SIC will stay internal (may contain sensitive info).  

  • Data collection is an individual project for the researcher, which end in a data publication. [notes see slides]

  • Genetic data are to be shared. An agreement is to be signed for sharing. Data use agreement for non-sensitive data. Control to data access is established. Non-sensitive data can be accessed with google/facebook account.

  • Question: is the metadata available? Metadata will be exposed for crawlers. Controlled metadata vocabularies of 4 disciplinary domains.

  • Question: drivers for collection types is privacy concerns. Metadata for data sharing collection will be exposed. For other collections metadata sharing is recommended.

  • Reagen Repository Use Case Policy Orientated RDM System

The data intensive cyber environment working group (at UNC) worked with a wide range of research repos over the last 2 decades. The goal of the group is to understand the generic requirement of managing research data. Requirements are difficult to pin down because collections of data researchers are interested in evolve over time, thus demands for repos evolves with it. Most research institutions start with project collection, where participants of project have deep understanding of how the data are organized, what the metadata mean, what are the formats, etc. But problems arise when they broaden their research approach and include project/ researchers from other institutions to a shared collection, they have to make some of the tacit knowledge explicit (to ensure the people on the other side will be able to use the data without having to call them on the phone). If you publish your data then you need to meet the standards of your discipline, if you build a processing pipeline you need to provide mechanism for data manipulation, when you archive the data you have to ensure long term reusability and retrievability.

We work with groups that are going through these transitions, all partners we worked with had to evolve their context, services, restructure their data, reorganize their data -- we have to work with a highly dynamic environment. We use policies -- computer actionable rules that the system is enforcing, the computer actionable rules determine workflows, which can be dynamically changed so you can modify the environment over time to execute additional tasks.

As user of the data collection broaden, we are providing mechanisms for the communities to gain contexts they need to understand the original collection.

Currently 200 performable operations, 300 types of status info -- consist a broad context within which the information is being managed.

Mechanism adopted by various of projects, of different sizes and disciplines. Each project has different policies for data types and data access.

  • context associated with a collection evolves as the user community broadens

  • policies are computer actionable rules

  • Collection as the persistent item, system flows through it.

  • Application environments ranges. all manage distributed data

  • requirements across projects [slide]: A generic infrastructure that enables these multiple types of operations and policies to be applied.

  • independently manage storage

  • access controls

  • microservice -- continue to evolve

  • Question: what the UX is like, how can developers apply?

  • A: everything happens backstage, UX doesn’t change.

  • Question: when did it start?

  • A: development started 1994. policy basis started 2006.

  • Natalie VecNet Digital Library

    • digital library that enables data simulation, data model access and reproducible project.

    • digital library designed for input- output data to facilitate the interdisciplinary collaboration. how: let people share data in a common environment, curate data, code snippet, make data from different models comparable, tag geographic citation to literature.

    • Software: Fedora with Hydra on top to support specific workflow, Solar to index, with extension based on disciplinary use.

    • Emphasize the fact that in this context repository enabled data as research object that is shared/ communicated and integrated to the research process instead of as final research output.

  • Thomas Nanoscopy Open Reference Data Repository (NORDR)

  • Motivation: Observation produce huge data sets, data processing software is still under development, lack of experience in what to analyse.

  • Needs: A platform for storing, processing and interpreting data, that allows sharing, workflow annotation, result comparison and keeping track of provenance information (algorithm and parameters used in the analysis process.)

  • Architecture: [diagram on slide]

  • Requirements: PID, Quality control, data processing, data policy, data reuse, disciplinary specific microservice, APIs [see slide for specifics]

    • Question: volume of data is a challenge → anyway to only record the changes in dataset so you don’t have to copy the 200 TB every time?

    • A: data sets are all different. metadata creation for dynamic data citation counter balances the increase of reproducibility, too much time to reconstruct the data sets (Natalie)

Agenda Point: Issues with the template, collection procedures, wording/clarity, etc.

  • blank template and submitted use cases are available at the IG RDA site

  • You don’t have to have a solution for a use case to submit a form, as long as there’s a need.

  • Next steps: to volunteer to be an editor for the matrix of requirements, contact one of the co-chairs of the IG.


Action items

  • Group video conference in mid-october