In December 2018, RDA Europe issued an open call for projects adopting outputs from the RDA’s various Working and Interest Groups. Following recommendations from external evaluators, eight funding grants were awarded in April 2019. This blog series will introduce the eight Adoption Grant cases, giving an overview of their project remits and demonstrating the practical approaches organisations can take when looking to implement the RDA’s Recommendations & Outputs.
For some RDA members, the value of the alliance is as a space to formulate ideas, to deliberate and build on embryonic concepts with a like-minded community. For others, particularly those who are working on projects that are further advanced, the benefit of RDA comes in the form of the guidelines it offers and the technical expertise made available as Outputs from the Working and Interest Groups. The Data Centre at the Climate Change Centre Austria falls into the latter category. Its RDA Europe 4.0 adoption grant will allow it to continue with a project that began in 2016, outlined in a previous pilot programme report and RDA Adoption Story. In essence, the project is focused on implementing the technical recommendations of an RDA Working Group to develop an operational service which will enable the citation of evolving, dynamic data and its data fragments (i.e. subsets) in an accurate and consistent manner.
The project was initially developed within the framework of a data infrastructure investment program by the Austrian government – specifically the Federal Ministry of Education, Science and Research – named GEOCLIM, which aimed at fostering closer collaboration between the CCCA Data Centre and Earth Observation Data Centre (EODC) through the common use of existing hardware and server environment, such as data storage, archive and HPC facilities. The CCCA maintains part of the research data infrastructure for the Austrian climate research community, while the EODC, as spin-off of the TU Vienna, has a focus on earth observation, especially on the EU Copernicus Sentinel products. The orchestrated and collaborative development of the service together with the EODC is strengthened by the work available from the Open Data Cube, which has a similar data ingest and processing approach to CCCA Data Centre. These processing fundamentals came from the climatological scientific domain and are well established, based on the NetCDF environment and python tool boxes, such as xarray, but the Open Data Cube provides a condensed software packages.
Scope of project up to now
The CCCA Data Centre was set up in 2015, at the same time when the first results of the WG were released. Chris Schubert and his team have utilised the Data Citation for Evolving Data Recommendation from the Data Citation Working Group from the initial conceptual phase of the Data Centre. The next challenge concerned subsetting their vast datasets and providing adequate information about them; following the successful implementation of the guidelines set out by the Working Group, CCCA has continued its application of this recommendation, with its focus on the development of a subsetting tool service which can be extended to other databases from the earth observation community. These earth observation repositories contain data from satellites that cover a region with many snapshots over time. These can be linked to the CCCA’s, meaning that the technical approach for this adoption is to enable the subsetting processes for this ‘remote’ data to be developed. The aim is for a user to get a dynamic generated citation text which contains the original author, label of the data set, versions, selected parameters, data set intersections, and a persistent identifier. Metadata in these subsets are to be inherited from the original sets and supplemented by the defined arguments to identify the subset.
Focus of Adoption Grant
Climate science by default deals with very large data sets, many of which are constantly evolving. High resolution climate data in particular, which is dealt with by CCCA, modifies frequently due to its complex dependencies on global and regional climate models. How to deal with these large amounts of data - including how to negotiate storage consumption, to minimise the download rate of unnecessary information, to provide data provenance and versioning information, and to automate metadata - all come under the general scope of the implementation.
New technical challenges which will be taken and developed by the RDA Adoption Grant support include:
- Setup of a dedicated openEO Driver for CCCA, as an independent backend for openEO, with the advantage that data can be searched and downloaded via an openEO client (simplified data access for R, Python & QGIS).
- Linking of an existing backend with data from CCCA repository, i.e. users access CCCA data with an openEO client and the connection of a backend (e.g. EODC) in order to run processes on it which are executed on one of both infrastructures. The expected benefit will be user get the data management capability of CCCA-DC.
- This final development may fall beyond the end of the RDA Adoption project, but a long-term aim would be to demonstrate the CCCA-DC as an openEO Query Store, independent from other existing data infrastructures; data would still be housed remotely at the respective repositories, but the queries and the versioning information would be created and delivered by the CCCA-DC. This would mean that a distributed data repository infrastructure would be implemented for several backends and there would be a central location for data records that can be used in openEO.
A number of developments have been made since the beginning of the RDA grant period. A paper titled ‘Dynamic Data Citation Service—Subset Tool for Operational Data Management’ has been published by Chris Schubert and his colleagues Georg Seyerl and Katharina Sack; it outlines the degree to which the RDA’s Data Citation Working Group has contributed to the project so far, as well as the next steps addressing the scalability of the service. Chris was also on hand at the recent Plenary meeting in Helsinki, at which he presented on the progress of the Adoption project so far at the Adoption and Outputs plenary session.
Long term aims
As a research data infrastructure facility in Austria for the climatological domain, CCCA’s focus is currently on a limited user community. The planned approach to extend the subsetting service to EODC is done with the intention of exposing the service to a wider community. This also feeds into the longer term sustainability of the service - understanding the options for the permanent maintenance of a service with an increasing user community are one of the aims of the Adoption. As such, the dissemination of the project achievements through the RDA’s platforms - including the Austria National Node - are vital to the future viability.
As is pointed out in the previous CCCA Adoption Story, the specific benefit of the RDA Outputs lies in the fact that they can relieve a project of the “intellectual and conceptual work required” in a project of this sort. These types of guidelines offer a way to directly apply the knowledge and expertise of the community to an ongoing project, with the project in turn demonstrating the practical contribution that the RDA Groups make to the wider research data community.