Shortly after I joined the Leibniz Supercomputing Centre (LRZ) in December 2018 I got in touch with the Research Data Alliance (RDA) through my colleagues Tobias Weber and Stephan Hachinger. Both had visited RDA plenaries before and were involved in working (WG) and interest groups (IG). Thanks to an early career travel grant I had the opportunity to attend the 14th RDA Plenary in Helsinki. While being known mainly for its activities in the field of high-performance computing (HPC), LRZ is actively involved in several projects that deal with making research data FAIR, e.g. the GeRDI (www.gerdi.org) and AlpEnDAC (www.alpendac.eu) projects, supporting its customers to make the hidden “treasures” in their datastores available to the public. While scientific datasets often just require a few gigabytes of disk space, the output of simulations which run on HPC systems can easily occupy hundreds of terabytes of disk space. This makes these datasets immobile in the sense that they cannot be transferred to any other data repository without major efforts. The research data management team at LRZ tries to develop a software solution to overcome such issues. In order to follow common standards and best practices, visiting RDA plenaries is essential for us to get in touch with latest developments in the field. While I already visited several scientific conferences during my PhD, the RDA plenary was a rather different experience. While scientific conferences commonly let the audience remain in the consumer role, the interactivity of most breakout sessions at the RDA plenary was a surprising and joyful new experience to me. However, due to the number of parallel sessions and diversity of discussed topics, newcomers can get lost very easily.
Thanks to the early career program, the onboarding process was much smoother than I first expected and with the help of my experienced colleague Stephan Hachinger (who was granted an expert travel grant), I had the chance to get in touch with many people and their interesting work. From my personal experience, the most valuable things of conferences are usually the discussion during coffee breaks or poster session. As these are combined at the RDA plenary, you get a lot of chances to present your own work and discuss how to improve it. At the 14th plenary we presented a poster about our lightweight microservice based architecture to make research data of HPC simulations FAIR. Thanks to the valuable inputs by experts in the field, we are now able to improve the service’s compatibility with common standards to exchange metadata between repositories. Due to the interactive nature of the breakout sessions, I could also learn a lot of things from the sessions I visited. In the following, I want to focus on just a few of them.
IG – Data Policy Standardization and Implementation Meeting
This interest group was the first one that I was assigned to as an early career. The group is mainly driven by journal publishers and tries to standardize data policies between journals. Today, nearly every journal has its own data policy, making it difficult for researchers to find a journal that fits their scientific topic as well as their founder’s or department’s policies. The IG tries to solve this issue by providing a general data policy framework that can be adapted by journals and other publishers. The advantage of this approach is obvious as it significantly simplifies the process of finding a journal suitable to publish a researcher’s work. A first step to a general framework has already been made by the publication of an open accessible paper that provides an overview about the existing data policies in different journals. The paper groups them into 6 different policy types that range from minimum to very strict specifications. The according matrix helps to choose the right policy according to 14 features. The paper’s presentation as well as the report of an early adopter lead to an intense but very productive discussion that surrendered the difficulties related with this topic. For our work at LRZ those guidelines are crucial as we have to fulfill the data policy requirements given by the journals that our customers select for publication. As our user base ranges from astrophysics to biology, the number of journals and different policies is huge. Hence a simplified set of policies would significantly simplify our development efforts.
WG – Data Versioning: Final Recommendations and Next Steps
Today, version control is one of the most important technologies used in code management. While it is usual to use tools like “git” to track every code change, this is still an open discussion in research data. While the large datasets of most HPC simulations commonly do not change anymore after publication, the input data for these simulations may change over time and with increasing number of new data points, it might also come to a new run of the simulation. In the AlpEnDAC project we deal with constantly incoming datapoints for various sensors and measurement instruments. For all of these datasets a digital object identifier somehow has to be assigned. However, the constant changes in the range of available data need to be taken into account. Therefore, I was really happy to be assigned to this very interesting working group, which unfortunately is already in its final stage. From the session I learned, that the topic is much more difficult than I first expected because it does not only require the formulation of best practices, but also involves a lot of definitions and technical considerations. Thanks to the well written final report of the group, which is currently available as a draft, I could take away many things for my own work. The recommendations and guidelines in the final report are intended to be implemented soon in our projects in order to be in agreement with the latest standards and recommendations.
BoF – Repository Interfaces for Data Analytics (RIDA)
This session was very special to me, as I had the chance to be involved in the preparation of the session and being dedicated as co-chair together with Gretchen Green. I am pretty sure that being co-chair is something very unusual for an early career on its first plenary, however, my colleague at LRZ Tobias Weber, who is one of the initiators of RIDA, proposed me for this position. As research gets more and more data driven, easy access to the data stored in repositories becomes more and more crucial, especially when combining data of different scopes (e.g. health data with environmental or climate data). Writing code to access the data by a bunch of different APIs is a cumbersome work that is one of the most time-consuming parts of a data scientists’ daily workflow. Therefore, RIDA tries to establish a common protocol for a standard repository interface. While several domains have developed their own protocols already, none of these protocols has been standardized or reused by other scientific communities yet. As defining a new and accepted protocol is a huge task on its own, we decided to make the RIDA BoF session as interactive as possible in order to target RIDAin the right way by collecting as much community input as possible. Thanks to an interested and very active audience we received a bunch of great comments, which allows us to refocus RIDA on its way to an official RDA working group.
IG – Research Data Architectures in Research Institutions: Healthy Architectures for Healthy Data - Sharing Approaches for Sensitive Data Architectures
The meeting of this interest group mainly focused on the handling of sensitive data, a challenging topic for all infrastructure providers. The speakers presented very interesting approaches how they handle sensitive data. An often-used approach is to separate this type of data from other data and encapsulate it. This is commonly done by physical separation on different storage and processing systems with specific requirements for security. Access to these data is then further restricted by strict handling of access rights and ways. Besides all sensitive data related talks, my colleague Stephan Hachinger gave an updated about the research data management efforts in the Munich area (rdmuc). Together with the university libraries of the Technical University of Munich (TUM) and the Ludwig-Maximilian-University Munich (LMU) as well as Bayrische Staatsbibliothek (BSB), LRZ is in close collaboration to provide research data management solutions for scientists in the Munich area. Although being in an early stage, the topic of sensitive data is also a major issue for LRZ in its role as university datacenter. Therefore, Stephan’s talk included a short report on the first efforts in this direction.
What I take with me and what comes next?
In many ways, the RDA Plenary 14th in Helsinki was an exciting new experience for me. I learned a bunch of new things, which will directly impact my work on the different data management projects at LRZ. I hope that we can make the transition to an official RDA working group with RIDA as this is a topic of high interest for the community. I am pretty sure that this plenary won’t be my last one, therefore, I am looking forward to the next RDA plenary. I also want to thank the organizers who made this a very special event in a very special city. I really enjoyed the nice landscape around Aalto University with its many lakes. To have this conference in autumn was perfect due to the nice colors of this time of the year combined with this nearly perfect weather for the region.