DKRZ is integrating persistent identifiers for use cases supporting precise data tracking, automated replication and versioning, custom and early data citation into the Earth System Grid Federation data infrastructure which supports WCRP CMIP6 data provisioning. This requires elemental PID information to be interoperable across multiple services and tools and formulating community-specific PID profiles. Furthermore, future automated processing workflows could leverage such information as well if bound to specific data types and brokered through a dedicated service. To give structure to possibly huge numbers of objects and their identifiers, the services and tools involved can also benefit from a possible RDA recommendation on research data collection.
“Current data management practices still rely largely on managing files and directories in file systems. Factors such as the relative increase of data volumes compared to available network bandwidth and the easy availability of remote and on-demand computing resources are drivers behind bringing processing and data closer together. National and international policy changes in Earth Science funding may also cause a shift in the skills and expectations of archetypical data service users.“
Says Tobias Weigel, a Computer scientist at the adopting organisation, Deutsches Klimarechenzentrum (DKRZ)
Together, these factors lead to scenarios where it will be increasingly difficult to manage data on a per-file or per-directory basis and deal with data transfer, replication and life cycle management at a comparatively low level of automation. Future tools may intentionally hide the location and structure of scientific data objects from the user, requiring more intelligence from back-end services. Services that provide easy data preparation and processing and make data provenance transparent may be particularly valuable for interdisciplinary users unfamiliar with established community practices.
Weigel continues “Without solutions that increase automation, costs of maintaining services will increase, which would have a deteriorating effect on service quality or lessen resources available for developing new services required for future user demands. Past experience has shown that many tasks such as data transfers or replication suffer from manual intervention required as long as no comprehensible data tracking solution is in place. Such tasks may take up even more resources given that the data volumes and number of objects to manage increases exponential
RDA Recommendations adopted
- Data Foundation and Terminology
- PID Information Types
- Data Fabric
- Data Type Registries
- Dynamic Data Citation
DKRZ German Climate Computing Center
DKRZ (German Climate Computing Center) is a national German facility, providing state-of-the- art super-computing, data and other associated services to the German and also the international scientific community to conduct top of the line Earth System and Climate Modelling. DKRZ operates a fully scalable supercomputing system designed for and dedicated to earth system modelling including mass storage system to a capacity of at least 400 PByte. DKRZ is partner in ENES (European Network for Earth System Modelling) and is one the representatives of the Earth system research communities in the EUDAT project. DKRZ is operating the ICSU World Data Centre Climate (WDCC), a community specific long-term data archive. Linked to WDCC, DKRZ provides best practice examples in scientific data life cycle management for the Earth system research community (federated data infrastructures, long-term archiving service, grid-based data processing workflows).