Use Cases for the Data Type Registries WG

Use Cases for the Data Type Registries WG

Here are the use cases we have gathered over the summer for a Data Type Registry. They informed the creation of the proposed data model, which will be described in a separate document.

1. Broad Functional Classification

Source: Daan Broeder – MPI and Co-chair of this group.

Description: high-level functional classification of the object being referenced or described. The basic problem he wants to solve is that repositories all hold data and metadata under widely varying policies and even this elementary level of data description is needed to make sense of what is available. Likely categories include: primary data object, structured metadata for that object, human readable description of that object, general description of the repository providing access to the object, contact information needed to obtain access to the object, and so forth.

2. Simple License Information Available Via ID Resolution

Source: Jan Brase – DataCite

This use case came out of an International DOI Federation meeting in June, in which Type Registry use cases were specifically solicited. The general idea is that DataCite DOIs reference data sets with varying access conditions. A handle/type/value tuple resulting from DOI resolution would be used to identify, probably through indirection, the specific access condition, such that clients that understood those types could display them appropriately, e.g., dx.doi.org proxy server system could display the access conditions in some sort of pop-up or intervening page. The first type used here would be the type of the handle value that carried the access condition and a further set of types could be used to categorize the access conditions. In some cases those types could be sufficient to describe the access conditions, in other cases they would categorize the reference that would bring the detailed access condition.  Creative Commons licenses are a likely starting point for describing access conditions.

3. Object Types in the Deep Carbon Observatory Data Management System

Source: John Erickson – RPI/TWC

The Deep Carbon Observatory (DCO) is a large global ten-year project studying Earth’s deep carbon. It covers a wide range of disciplines and researchers and RPI/TWC is the Data Science partner building the data management and access systems for the entire project. Every data object in the project, including experiments, data sets, researchers, etc., will be given an identifier (DCO-ID) and be linked appropriately to related objects. The anticipated use of types and a type registry in this project is to explicitly type every object, which is to say everything with a DCO-ID, and to record that type in the DCO-ID record, which is a handle record. Each type would be registered in a type registry and related to an ontology that would define the required and optional properties of an object of that type. For data acquisition, the type informs the template for mapping the raw data into an identified DCO object with required properties and for data use the type, in John’s nice phrasing, serves as a “simple short-cut for dependent services to figure out if the data object in question has what is needed for processing.”

4. Registration of Existing and Future Handle/DOI Types

Source: Larry Lannom – CNRI and Co-chair of this group

Handle records, including all DOIs, are composed of type/value pairs, which are returned to clients resolving handles. Most handle clients, e.g., the widely used hdl.handle.net and dx.doi.org http-to-handle proxy servers, have built-in understanding of what actions can be taken given certain specific types returned by handle resolution. URL types result in one-to-one http redirects, 10320/loc contains structured data that can be interpreted to provide one-to-many resolution, such as journal articles available through different venues, an HS_VLIST type indicates the handles of valid administrators for a given handle, and so on. These common types are themselves registered as handles but are not discoverable. The typing system is open, that is, anyone with administrative permissions over a handle record can create any type they and their community find useful. A type registry for these types, and other PID types, would enable developers to see if a needed type was already available and would also allow those who have minted a new type to explain what it meant and what properties or services could be expected in the identifier record and/or the identified object.

5. Content negotiation

 

Source: Simon Cox - CSIRO 

 

An individual dataset may be provided in multiple formats for different client applications. HTTP headers can be used to request a specific MIME-type, or express an order of preference. But there are (at least) two limitations: 

(i) HTTP does not provide a way to discover what types are available. So HTTP conneg is essentially a guessing game. 

(ii) the granularity of MIME-types is insufficient. So even if you get a format you like, the content may still be unusable. 

A plausible interaction sequence would see 

- an initial request for the available types, via a landing page, or by requesting a representation denoted by a (new?) specific MIME-type (the latter perhaps a special usage of AtomPub)

- final request using a suitably fine-grained data-type descriptor, selected from the set obtained in response to the initial request. 

I had assumed that these concerns were key motivations for the work of this group, but don't see this case specifically reflected in the ones provided to date. 

6. Data Persistence from Lab Devices

Source: Dirk Fleischer Kiel Marine Science Christian-Albrechts University Kiel

Lab devices usually create a standard output file, the best case is a csv, Excel, etc. SensorML is usually not available. We are considering to set up a system to capture alle necessary information on the general usage and intention of the lab device. It would be greate to have a DTR to attache the device output file to the collected information and then process the data into a repository or database. The DTR would simplify the programing of the processing module. DTR request on the id of the lab device would return the necessary processing information to bring the data combined with the additional information into a DB for later use. It would be unnecessary to create lab device based processing modules and handle their versioning and evolution. While output formats are equal between device versions they can share the DTR record instead of an additional programming module.
This would be a great help for data capturing at the point of origin -  right at the machine where the values have been measured.