#IDW2018 at Botswana: These are exciting times to get into research data management
My name is João Rocha da Silva and I am a researcher at INESC TEC and the Faculty of Engineering of the University of Porto (FEUP), Portugal, and I had the privilege of being one of the recipients of the RDA Early Research Grants. The grant allowed me to attend the International Data Week on the past 5-9 November, in Gaborone, Botswana. The International Data Week hosted both the 12th RDA Plenary and the SciDataCon 2018.
Participation in the RDA
I have been involved in the RDA since 2017 and I am currently one of the co-chairs of the Repository Platforms for Research Data Interest Group, together with Ralph Müller-Pfefferkorn and Robert Downs. The RDA’s 12th Plenary was a great opportunity to meet the co-chairs of my Interest Group face to face. We usually meet online to manage the IG, but this time we had the opportunity to gather on the first day of the conference to host our own BoF session, with two proposals to launch new WGs related to research data repositories:
The “Policy Support and Enforcement within Repositories” WG proposal aims to gather requirements to help repository platform developers provide mechanisms to support the execution and validation of institutional policies for the management of research data: say, for example, being able to automate metadata validation, deposit steps or embargo periods.
The “Repository Interfaces for Data Analytics” WG proposal has another ambitious goal: to outline a set of recommendations to foster data interoperability across different repositories: imagine you are a researcher and you need to fetch data from multiple repositories; it would be really cool to use the same data retrieval interface for all these different sources, so that you can focus on what really matters: the algorithms you want to run on the data.
Interested? Join the RDA RPRD Group here and set your mark in these recommendations!
A self-assessment of FAIR compliance
As part of my participation in the plenary I also presented a poster, co-authored with Cristina Ribeiro, João Aguiar Castro, Joana Rodrigues and Nelson Pereira. The poster covers our proposal for a research data management workflow currently being implemented at the University of Porto and INESC TEC, and showcases a self-assessment of the workflow according to the FAIR principles.
At the conference
During this week I had the chance to see the amazing developments in the RDA community. From persistent identifiers to metadata, data interoperability and visualization, the scene is growing at a very fast pace.
On the first day of conference, 5th November, I attended the “Persistent Identifiers in Action” joint meeting, where several applications for PIDs are being studied. You can find the full meeting notes here, but I would like to name a few:
Rolf Krahl presented PIDs for research instruments, i.e. identifying the actual machine or measurement instrument that produced a certain dataset. This aids in logistics, tracking instrument use and provides a way for labs to get credit for their instrument use by their own researchers or others;
PIDs for individual samples gathered, helping to track the provenance of data down to the individual samples that were used in their production;
A proposal for a metadata schema for the records associated to these PIDs, inspired by the DataCite metadata schema;
Tom Demeranville presented some of the latest developments of ORCID, Assertion Origins, which will allow ORCID researcher profiles to record their distinctions, invited positions, memberships and services or access to facilities;
Martin Fenner from DataCite gave the audience an update on the Organization Identifier Initiative, which aims to provide PIDs for organizations, allowing them to be linked to their members’ scientific production;
At the end of the session where was some time for debate, with Andrew Treolar stating that continued emphasis on relationships between the entities identified by PIDs is paramount for realizing a valuable knowledge graph for research overall, and Martin Fenner stating that we should be looking at a distributed scenario instead of trying to build a single very large graph.
On the 7th, I attended the CODATA workshop on Data Integration for Science. Many relevant topics faced by data curators were discussed, from which I highlight:
Issues related to generic metadata: abstracts and subject keywords that do not follow any established vocabulary make it hard to process these records automatically, hindering the I in the FAIR principles, Interoperability. Also, metadata tends to be produced according to the “fitness for purpose” of a given dataset, tying it to the original application for which it was produced and limiting discoverability for other purposes; for example, when an X-ray is taken, people may describe it as a “Spine X-ray” because it was used to treat a spinal problem, but that X-ray can contain information on more than just the spine; it can show information relevant on other parts of the torso that can serve for others to study different scenarios;
Multilingual metadata: how to represent the same metadata in different languages (RDF would help solve this with the use of language tags, but would require the metadata to be represented using that markup);
The Plinth project was presented as a basis for scalable, inter-disciplinary data usage, by querying heterogeneous sources of Linked Data, and adding semantic layers on top of existing data lakes to help with interoperability and discovery.
In the afternoon I attended the Early Career and Engagement IG session. A great forum and experience! The moderators did an amazing job to try and get everyone to introduce themselves and their line of research. The group is trying to match the newcomers to RDA with “mentors”, people with more experience that can help them find the right groups to join and contribute to. I will be staying in touch with several of the people I met there.
Another topic in which I am very much interested is data citation and versioning. At the Data Versioning WG session, the group showed 5 patterns of versioning (more details at the versioning use cases report on the topic):
Version Identification (PID)
Granularity (Single Objects vs. Collections)
Provenance (Derived products)
On the final day at the conference, the 8th, there were several interesting presentations on the Visualization and Pattern Recognition Techniques for Understanding Data session of SciDataCon. From these, I highlight ORBUS, a software platform for displaying georeferenced data in spherical visualizers. These are fascinating visualizations that allow the overlaying of georeferenced, time-dependent overlays on a sphere that represents the Earth. One of the demonstrations showed the propagation of the energy waves of earthquakes across the surface of the planet and another showed the path followed by a hurricanes.
The portuguese node of RDA
On the 6th November, the second day of conference, I attended the Regional Engagement meeting, where I had the chance to again express the Portuguese commitment to the recently approved Portuguese node of the RDA. The application resulted from a nationwide consortium of universities and research institutions (INESC TEC, Universidade de Évora, Instituto Superior Técnico, the DGLAB, Universidade do Minho, Instituto de Ciências Sociais, Universidade de Coimbra, Escola Superior de Tecnologia e Gestão de Portalegre) as well as the main research funding provider in Portugal, the FCT (Foundation for Science and Technology).
The Portuguese node will help promote the adoption of RDA recommendations in Portugal and serve as a platform to spread awareness on RDA activities among our researchers. In fact, we have already presented the new node and called for the participation of all librarians, researchers and research data managers attending the 4th National Forum on Research Data Management, which took place on the last 16th November at Castelo Branco, Portugal.
Time for the wrap-up
Overall, engaging in this plenary was everything I expected and then some. The sessions included a lot of report on lessons learned after the implementation of actual production workflows on some of the major research institutions of the world. It is amazing to see how far research data management has come since 2011, when I started working in this field. There are so many new initiatives on research data management all over the world that I see a promising future ahead for the topic and also for RDA in particular.
I think that RDA is my kind of crowd: the “Movers and Shakers” of this field, who don’t just talk about doing stuff but actually get their hands dirty and come out with results to show for it. If you have not yet joined RDA, let me tell you that it is a unique opportunity to engage with the world’s leading experts in research data management, providing a very open and healthy discussion forum, bustling with activity!
About the author
João Rocha da Silva has been working on research in the area of Research Data Management since 2011, having obtained his PhD in Informatics Engineering in 2016 from FEUP. As part of his studies, he is developing the Dendro research data management platform. Dendro introduces research data management from the start of the research project, in a collaborative environment similar to Dropbox, but adding support for ontology-based, domain-specific metadata production. Dendro is being deployed as part of the research data management workflow of INESC TEC and U.Porto in the context of the TAIL project, funded by FCT.