The objective of the workshop is to exchange information about current practices in using PIDs and PID systems and to draw best practice conclusions at the end. In one of the discussions during the RDA EU Science Workshop, participants stated the urgent need for such a workshop (in combination with a training course) to disseminate basic knowledge.
The workshop aims to bring people from various discipline backgrounds together and discuss about advanced usages of PIDs to sharpen understanding and to disseminate current practices. Therefore, different use cases that have exemplary character will be presented and discussed. Plenty of time will be reserved for discussion.
The workshop will start at 1 September 2016 (13.00) and end at 2 September (16.00).
Link to the main page of the PID events.
Both events will be organised at the Max Planck Compute and Data Facility (MPCDF) in Garching/Munich (Germany). There is no registration fee. Lunch and dinner as well as travel and accommodation costs are self-paid.
The workshop will be devoted to submitted papers on PID use cases that are accepted by the Programme Committee (PC). The PC will select a number of advanced use cases that have shown its practicability in practice. These will get more time for their presentation. Each session will have sufficient time for Q&A and comments. At the end we will first ask one experienced expert to present highlights from the training course and the presentation and then invite a panel to discuss these highlights and other issues. The audience will be included in the panel discussion. The essential results and outcomes of the workshop will be summarised before conclusion.
Ari Asmi (ENVRI), Geoff Builder (CROSSREF), Jonathan Clark (IDF), Massimo Cocco (EPOS), Sünje Dallmeier-Thiessen (CERN, Thor), Wolfram Horstmann (DARIAH, LIBER), Larry Lannom (CNRI), Stéphane Rondenay (University of Bergen), Laura Rueda (DataCite), Tibor Kalman (GWDG), Pavel Stranak (CLARIN), Tobias Weigel (DKRZ, ENES), Peter Wittenburg (RDA)
Participation is free of charge, but subject to on-line registration. If you would like to attend the event but do not intend to submit an abstract, please register here https://rd-alliance.org/views-pid-systems-training-course-and-workshop-registration
The final report on the PID training course and workshop is available here.
Thursday 1. September
Friday 2. September
Ulrich Schwardmann, GWDG
Building and Maintaining a Registry for PID Info Types:
The general structure is, that a PID InfoType can be given by a so called Basic PID InfoType or can be built out of PID InfoTypes and Basic PID InfoTypes in a structured way. The presentation will give an overview on how this is implemented. The validation of syntactical correctness for tabular data, a wide spread data format, is a first simple and easy established example of the benefit for the users by this concept.
Jozef Misutka, Institute of Formal and Applied Linguistics
SHORTREF.ORG - citing URLs:
We will present an easy-to-cite and persistent infrastructure (www.shortref.org) for research and data citation in the form of a URL shortener service.
Reproducibility of results is very important for the extension of research and it directly depends on the availability of the research data. The advancements in the web technologies made redistribution of the data much more easy nowadays, however, due to the dynamic nature of the web, the content is consistently on the move from one destination to another. The URLs researchers use for the citation of their contents do not directly account for these changes and many times when the users try to access the cited URLs, the data is either not available or has moved to a newer version. In our proposed solution, the shortened URLs are not simple URLs but use persistent identifiers and provide a reliable mechanism to make the data always accessible that can directly improve the impact of research.
In the presentation, we would discuss the technology, work flow and the advantages of using PIDs instead of urls directly.
Hainaut, Bordelon, Grothkopf, Fourniol, Micol, Retzlaff, Sterzik, Stoehr [ESO]; Harry Enke, Kristin Riebe [AIP]
DOI usage in Astronomical Data Centers - the ESO archive and the AIP data center:
ESO, the European Southern Observatory, is the intergovernmental astronomy organisation in Europe. ESO operates various observing sites in Chile: La Silla, with two 4-m class telescopes, Paranal, with the Very Large Telescope and two survey telescopes. ESO is a major partner in ALMA, the largest interferometric millimetre radio telescope, on Chajnantor. ESO is building on Armazones the 39-metre European Extremely Large Telescope, the E-ELT. All data produced by ESO’s telescopes are stored as raw data in and accessed through the Science Archive Facility, together with a variety of processed datasets. It currently holds 40 million files corresponding to 700TB of data, characterized by 30 billion metadata rows (with a full back-up at MPCDF).
The Leibniz Institute for Astrophysics Potsdam (AIP) is one of Germany’s astronomical data centers, providing cosmological data, survey data and data of digitized astronomical photographic plates.
ESO will generate DOIs for specific, well-defined data sets:
• All raw data produced by an observing program run (which is the “quantum” in which ESO allocates the telescope time to projects, with well-defined start and end dates)
• All “Science-Grade Data Products” published as “data releases”, i.e. well-defined sets of processed frames, corresponding for instance to the results of an astronomical survey programme.
ESO also plans to offer its users a “DOI generator” in which they can upload a list of frame identifiers, along with a series of metadata describing the dataset (e.g. their name, affiliation, purpose of the dataset, etc.), in order to make specific datasets citable and retrievable.
AIP currently uses DOIs for the data releases of the RAVE survey, i.e. database tables with ~0.5 million rows, for CosmoSim, a cosmological database of simulated data with tables of up to 10 billion rows. For the digitized plates archive DOIs are also used to identify the images as Cultural Heritage Objects within Europeana.
The presented material will
• Provide details on 6 different “real-life” implementations of DOIs for different types of dataset from ESO and AIP
• Discuss the infrastructure requirements for the data centers
• Discuss the limitations and constraints on the data in the context of the RDA recommendations
Florian Krämer, Marius Politze, Dr. Dominik Schmitz, RWTH Aachen University
Empowering the usage of persistent identifiers (PIDs) in local research processes by providing a service and integration infrastructure
A university typically has a rather diverse infrastructure. While there are essential central services such as identity management, financial management, an e-learning infrastructure or a student lifecycle management, research related services are often realized at the institute level. This is essential in particular if specialized machines, hardware and software are pivotal to this research. Nonetheless, research data management in such situations can benefit from centralized services as well, as long as these can be easily integrated into the individual researcher’s infrastructure. Most commonly, backup and archiving as well as publication services are offered centrally. To our understanding, services providing and resolving persistent identifiers (PIDs) can also be offered in a centralized manner, as long as some suitable measures to integrate it in the researcher’s local infrastructure are taken.
We focus on the very early adoption of the persistent identifier concept already during the creation and utilizing lifecycle of research data as introduced by Peter Wittenburg’s data fabric [Witt15]. Registering a PID early, even when it is not yet clear, whether the associated data will be valuable, might seem wasteful. But this approach is useful in particular if the early adoption allows for more automation or fits better into the researchers’ working processes. For example, it might be much easier to capture important metadata if they can be fetched easily from the machine at the time of creation or if they can be collected from the user manually while she or he is waiting for analysis results.
In this contribution we present our concept for a solution, accommodating the requirements mentioned above. At RWTH Aachen University we decided to take part in the EPIC infrastructure (http://www.pidconsortium.eu/) by using the PID service offered by GWDG Göttingen. To embed the infrastructure into our local identity management, we chose to hide the original application programming interface (API) by a local implementation within the existing REST and OAuth-based infrastructure that has access to organizational attributes and services. In particular, we integrated a basic support for metadata management by allowing to choose and adapt RDF-based metadata schemata that are then associated with the organization the user comes from. Furthermore, we limited the publicly available metadata of a PID to the minimum to be able to accommodate privacy concerns. The RESTful service is easy to use from foreign code, thus enabling local efforts to embed its usage in institute-specific processes and infrastructures. By standard JSON Web Token the OAuth-framework provides an easy to use rights management for changes to a PIDs metadata.
As a first application, we have enriched our basic archiving service to enable the registration of a PID for an archive node. By referring to the publication identifiers within the metadata it becomes possible to establish a link between publications and unpublished but locally stored data that provide the foundation of the research work. Currently we are working in two use cases on the integration of the central PID service into local infrastructures.
In the workshop we would like to exchange ideas about approaches to integrate PIDs into diverse IT landscapes and discuss possible solutions as well as pitfalls of this approach.
[Witt15] Wittenburg, Peter: Data Foundation & Terminology WG. Data Fabric IG. Presentation at the RDA-DE-DINI Workshop “Aktuelle Resultate der Research Data Alliance (RDA) und deren zukünftige Bedeutung” 2015-05-28/29 in Karlsruhe, Germany. Online: http://www.forschungsdaten.org/images/f/fc/RDA-DE-2015_Wittenburg_Peter_... [Last access: 2016-07-08]
Wolfgang Kuchinke, Heinrich-Heine University Duesseldorf
Persistent Identifiers (PIDs) as Means to Improve Transparency of Clinical Research
Most clinical trial data is not openly accessible, and often not even published. Thus, a large portion of high quality human data is not available for reuse, reanalysis and meta-analysis. This lack of transparency of clinical research has serious implications for patient safety, for health providers and the health system in general.
Recently, the need for improved transparency of clinical trials was raised as part of several research transparency initiatives (e.g. OpenTrials, AllTrials, CTTI).
In this context, we have developed a concept for the consistent adoption of Persistent Identifiers (PIDs) for clinical trials that adapts to the clinical trials data life cycle, with the aim to improve transparency of clinical trials based on the fact that the persistent identification of digital resources can play a vital role in enabling their accessibility and re-usability. In our concept PIDs are assigned to all components of clinical trials, for patients, investigators, contractors, institutions, companies, sites, as well as for data sets, documents, software solutions, etc. PIDs are created and allocated already during the formation of the digital resources and accompany them along the entire life cycle until the final archiving of the TMF.
This approach is a solution that will consider not only the allocation of clinical trial results, but also issues of their intellectual property, restrictions caused by commercial interests, the promotion of public health, the quality of trials and data, and the significance of data privacy protection. In detail, existing trial identifiers will be integrated and PIDs will be linked to relevant clinical trial metadata. PIDs are assigned not only to clinical trial datasets, according to their origin (CRF based, eSource, EHR based, lab data) and their data protection needs, but also to their generators, like investigator, study nurse, the patient, or even a machine), and the corresponding documents (e.g. study protocol, Statistical Analysis Plan, data management plan, informed consent).
Once fully developed and embedded into a trust ensuring infrastructure, such a PID solution will enable the investigator to demonstrate how and by whom the created datasets were collected, and processed, if the correct rights and permissions were available, if necessary documentation is available, if the trial was GCP compliant and if re-analysed data has been cited and referenced correctly. In this way, PIDs can support to some degree Open Research even for restricted clinical trials. We hope to discuss this concept to evaluate it's feasibility and to improve it, so that it may be used for the development of the ECRIN clinical trials data repository.
Margareta Hellström, Lund University & ICOS RI
Credit where credit is due - PIDs, citations and bibliometry for research data collections
ENVRIplus (http://www.envriplus.eu), a Horizon2020 cluster project involving 20 European environmental and Earth science research infrastructures, recently made a survey of its members’ requirements for developments and services within 9 ICT and data management-related topics. One of the surveyed topics was Data Identification and Citation, which is also the main focus of one of the ENVRIplus work packages.
The ENVRIplus survey responses showed that a majority of RIs find that absolutely necessary to ensure that credit for producing and managing of scientific data sets is “properly assigned”, down to the level of individual principal investigators (PIs) in charge of measurement and observation stations. This result is in line with many earlier studies which have shown that the perceived lack of proper attribution of data is a major reason for the hesitancy felt by many researchers to share their data openly.
While there is reasonable confidence that identification and subsequent citation practices will result in adequate possibilities to trace and account for usage of individual datasets, for example by assigning DOIs to data objects, many RIs are apprehensive and concerned about how usage statistics for data collections can be fairly and correctly translated into “usage credit” for data items that are members of such collections - i.e., they fear it will be difficult or even impossible to trace back the provenance of actually used collection items to their individual provider through the currently used data citation practices and bibliometry tools.
One of the main causes for this concern is the increasing pressure from policy makers and funding agencies towards research groups and organizations to show that their data are not only being released under Open Access policies, but that they are also re-used, thus maximizing the benefit of public investments. Indeed, funding agencies are pushing towards increasingly more open data policies, including re-distribution and commercial use of data. At the same time, the same usage statistics (mainly in the form of “citation metrics”), remains the basis of documenting of scientific merit that is paramount for scientists’ employment and stations’ funding.
Within ENVRIplus, a decision has been taken to set up a task force to investigate different methods for how data usage metrics and statistics can be improved, not only for individual datasets but especially for collections of data sets. The goal is to, within about 12 months’ time, to have a detailed “best practices” report that outlines advise to both data producers & RI repositories and to end users of the ENVRIplus RI’s data. (The work will be closely coordinated with another ENVRIplus initiative aimed at setting up working demonstrators of the recommendations of the RDA Data Citations group.)
Robert Huber, Jens Klump, University of Bremen
How dead is dead in the PID zombie zoo?
Persistent identifiers (PID) were invented to address challenges arising from the distributed and disorganised nature of the internet, which not only allowed new technologies to emerge, it also made it difficult to maintain a persistent record of science. This phenomenon, also dubbed “link rot”, affects all digital resources on the web, including research data.
To address this problem, several PID systems entered the research market with their promise of salvation. Their diagnosis was: missing separation of the identity of an object from its location on the web and the medicine was to combine globally unique identifiers with a clever resolution system.
Web-based persistent identifiers have been around for more than 20 years, a period long enough for us to start observing patterns of success and failure. In our presentation we will give an overview on some commonly used PID systems and present some key indicators to estimate their present state of health, their trustworthiness, sustainability and persistence.
Based on these indicators, our analysis shows that unfortunately not every PID systems is well managed today. PID zombies, orphaned identifiers as well as PID systems, represent a major problem for the scientific community. We will present some prominent examples, an estimate of the dimension of the problem and discuss possible means to revitalize dead PIDs.