NetCDF Pilot Implementation of Austrain Climate Scenarios at CCCA Climate Change Centre Austria - Data Centre
- Pilot name: NetCDF Pilot Implementation
- Contact person: Chris Schubert
- Type: research & implementation pilot
- Status: active
- Type of data: NetCDF
- Dynamics: frequent (year/month/daily/hourly)
- Domain: High Resolution Climate Change Scenarios for Austria 1970 - 2100
- Short description: Make the amount of subfiles and subsets from Climate Scenarios data resources citable.
- Solution / approach: Import the NetCDF data resources into a data portal (file based, as well as data based import), providing versions and PID's. Users are able to create subsets with arguments like time range, gergraphical extend and parameters via a Web interface. Such individual queries wil be captured in the Query Store and get a persistent identifier combined with a time stamp. Created subsets can be immediately published in the CCCA Data Portal, get a automated generated Citation text, inklusive PID, Metadata are inherited from the original Dataset, aligned with the new parameters, bounding box for the new Area of interest and new time range as well the contact of the sub set creator was added. In addition all relation to versions from the origin will be kept.
- Phase 1: file based approach, March 2016 - Summer 2016
- Phase 2: individual queries and subsets, and web based visualisation, Summer 2016 - Dec 2016, finalised June 2017
- Supplementary material:
Goal and scope
The pilot's goal is to implement a technical solution for citation of climate data at the CCCA Data Centre. The pilot was initiated in context of the e-Infrastructures Austria project. As data provider, the CCCA Data Centre ensures access to distributed information related to climate research in Austria from its member organisations as well as from other institutions. It is foreseen to provide integration, retrieval of, and access to various data types (primary data, metadata, project reports, etc.), including data, models, as well as model results relevant to climate research. CCCA is responsible for the technical implementation of the pilot. The RDA Data Citation WG and the e-Infrastructures Austria project act as consultants.
With the introduction of consistent Data Citation with clear declaration and machine-readable representation of version, time stamp, geo location, data policies, and attribution of a persistent identifier (PID), the pilot aims at making the implicit knowledge of Austrian climate data, including their contextual information and related processes, available to the research community and the general public. The NetCDF Pilot Implementation refers closely to the recommendations by the RDA Data Citation WG as well as requirements and results provided by the e-Infrastructures Austria project.
Data and format
For the best practice implementation of the pilot at CCCA, data from the ÖKS15 project (Austrian climate scenarios until 2100) are being used (Chimani et al. 2016). The data include important climatological parameters from the EURO-CORDEX RCMs (regional climate models) brought to a high-resolution 1x1 km² grid by using empirical-statistical downscaling (ESD-)methods to calculate climate change signals between 1970 and 2100 via various climate indexes. The climate-parameter outputs are temperature, precipitation, and global sunshine duration in relation to a grid and stations. The derived climate-indexes (in total 33) are, to mention a few, temperature medium, summer-, heat-days, tropical nights, length of heat wave, cold snap, etc.
These data are stored in NetCDF (Network Common Data Format). NetCDF is an open standard, which has been developed as machine-independent data format. In the science context it is mainly used for storing structured multi-dimensional data in a single container. NetCDF data contain attributes, dimensions, and variables. Attributes have a name and a value and can be associated with a variable. Dimensions are used to define the size of the variable fields. Variables are the data container for a single value or a complex data matrix. Data type, number of dimensions, and required attributes need to be declared. An example: temperature data are provided as screened, geo-referenced data split into a number of single files ordered by region and for each day, week, or month. The values of a grid-pixel correspond to the average temperature values for the respective timespan and place.
The whole ÖKS15 data set (approx. 3,5 TB) has been integrated and made openly accessible via the CCCA Data Centre. The correlations between data sets, applied methods, and the results will be represented as links. This approach ensures that the whole process, from the meteorological measurements to the decision makers’ fact sheet, is completely transparent and replicable for everyone.
Reference: Barbara Chimani et al., ÖKS 15: Hochaufgelöste, biaskorrigierte Klimaszenarien für Österreich. In Tagungsband 17. Österreichischer Klimatag, 6.–8. April 2016, Graz, S. 34-35
Workplan and status
The work plan is divided in two phases:
Phase 1: Implementation of data citation using file-based query (approx. 1700 files);
Phase 2: Implementation of a parameter-free query to generate subsets.
At the end of May 2016 the CCCA Data Centre presented a prototype to be tested by the community. The original ÖKS data set can be downloaded via sftp. Phase 1 of the prototype implementation supports downloading the entire NetCDF files with an attributed PID per file.
In Phase 1, the versioning of NetCDF files is being done on file-level as well as the ISO 19115/INSPIRE compliant metadata, in which the NetCDF data are described. Users can select a set of data that they want to view via an interface. ckan map and application server were used, extensions for a proper metadata & PID management, versioning were developed and will be published as official open source ckan extension. For visualisation the Thredds Server was implemented, as well an available NetCDF Subset Service. As PID, a local running handle server was used.
Potential users, however, are expected to focus on specific regions, time intervals, or variables within the high resolution data set. Due to this reason a further expansion stage foresees queries on sub-set level. To enable the users to directly access specific slices from each file instead of downloading hundreds of files, storing of queries within NetCDF data was implemented in Phase 2 of the pilot implementation. The main motivation was to enable a puzzle for a proper data management, to save storage consumption because only the query will be stored not the created subset itself.
To implement this functionality the CCCA Data Centre, in cooperation with the RDA Data Citation WG, is interested in extending the pilot runtime beyond the currently devised timeframe. For the continuation of the pilot beyond the e-Infrastructures Austria project there is currently a small budget available from the CCCA.
Current, outstanding work on the Dynamic Data Citation tool at the CCCA Data Centre are i) providing subset templates (include used arguments - e.g. an individual area of interest can be applied on all other data sets) planned release Sept. 2017, ii) working in the subset verification and iii) a job scheduler with a notification. The reason for this queuing system is the current limitation on the time range, if a user choose a bigger time range than 10y (daily data), the cluster of implemented server gives a time out.