Metadata for a research dataset can be about the dataset's bibliographic properties as well as their scientific properties. The scientific metadata are bound to be domain-specific. In materials science, it may describe what measurements were performed on which specimen using which instrument in what kind of environment. We defined an original schema to describe these, along with other methods which are covered later.
Materials science is a syncretic field of study, and creating a widely-accepted standard to describe the materials-science metadata is challenging. Our JSON Schema is designed primarily for messaging and data relaying among the systems in our materials data platform (data collection, management, repository, etc.). It features a hierarchical structure to describe several different types of resources (instruments, specimen, processes, properties, and computation) or combinations thereof. It is partially inspired by the Materials Data Vocabulary, which was one of the outputs of the RDA International Materials Resource Registries WG. We have been developing the platform systems to generate/interpret this format. However, the complexity of the hierarchical structure has caused difficulties in implementing the model to some systems. Also, user feedback indicated that for some researchers, the complex form mirroring the structure of the schema could feel burdensome. We also found out that datasets may contain multiple parts where different metadata apply, which cannot be accurately described as long as only one metadata is attached per dataset.
The second method is the directory-based metadata collection. In cases where the structure of the metadata can be stably defined beforehand, we can design a system where the researchers are requested to save their data in a pre-defined file hierarchy, which can then be interpreted by another system. A simplified version of this idea was tried out as part of a semi-automated data collection system in our institute and has received generally positive feedback from researchers. We are expanding the same principle to one of our closed repositories, currently in development under the codename "MDR-X" (Materials Data Repository-X), which will deal with even greater variety of materials data. This approach has the benefits of being easy to operate for the researchers and that it can be used in an environment with limited network connectivity, a common situation in materials laboratories. Relative lack of flexibility could be a weak point of this approach.
Meanwhile, it is becoming increasingly popular to describe metadata using files in a data package, typically using JSON or other lightweight formats. In particular, RO-Crate is a data packaging approach based on Schema.org, where metadata is written in JSON-LD. (Schema.org focuses on structured data for web resources, but RO-Crate has extended the use cases for more general-purpose metadata description.) We anticipate that this approach can potentially provide a solution to the problem where parts of the datasets require different metadata. The life sciences community has been actively developing Bioschemas as an extension to Schema.org. For materials science, NIST has recently published a pre-alpha version of Material Schema on their website. We believe RDA is a great place to concert these kinds of domain-specific schema developments.
This poster aims to share the issues we have encountered in implementing materials metadata and promote discussion on scientific metadata efforts.
Click on the poster image to enlarge