Materials metadata: as a custom schema, as directories, or in a data package
Metadata for a research dataset can be about the dataset's bibliographic properties as well as their scientific properties. The scientific metadata are bound to be domain-specific. In materials science, it may describe what measurements were performed on which specimen using which instrument in what kind of environment. We defined an original schema to describe these, along with other methods which are covered later.
Materials science is a syncretic field of study, and creating a widely-accepted standard to describe the materials-science metadata is challenging. Our JSON Schema is designed primarily for messaging and data relaying among the systems in our materials data platform (data collection, management, repository, etc.). It features a hierarchical structure to describe several different types of resources (instruments, specimen, processes, properties, and computation) or combinations thereof. It is partially inspired by the Materials Data Vocabulary, which was one of the outputs of the RDA International Materials Resource Registries WG. We have been developing the platform systems to generate/interpret this format. However, the complexity of the hierarchical structure has caused difficulties in implementing the model to some systems. Also, user feedback indicated that for some researchers, the complex form mirroring the structure of the schema could feel burdensome. We also found out that datasets may contain multiple parts where different metadata apply, which cannot be accurately described as long as only one metadata is attached per dataset.
The second method is the directory-based metadata collection. In cases where the structure of the metadata can be stably defined beforehand, we can design a system where the researchers are requested to save their data in a pre-defined file hierarchy, which can then be interpreted by another system. A simplified version of this idea was tried out as part of a semi-automated data collection system in our institute and has received generally positive feedback from researchers. We are expanding the same principle to one of our closed repositories, currently in development under the codename "MDR-X" (Materials Data Repository-X), which will deal with even greater variety of materials data. This approach has the benefits of being easy to operate for the researchers and that it can be used in an environment with limited network connectivity, a common situation in materials laboratories. Relative lack of flexibility could be a weak point of this approach.
Meanwhile, it is becoming increasingly popular to describe metadata using files in a data package, typically using JSON or other lightweight formats. In particular, RO-Crate is a data packaging approach based on Schema.org, where metadata is written in JSON-LD. (Schema.org focuses on structured data for web resources, but RO-Crate has extended the use cases for more general-purpose metadata description.) We anticipate that this approach can potentially provide a solution to the problem where parts of the datasets require different metadata. The life sciences community has been actively developing Bioschemas as an extension to Schema.org. For materials science, NIST has recently published a pre-alpha version of Material Schema on their website. We believe RDA is a great place to concert these kinds of domain-specific schema developments.
This poster aims to share the issues we have encountered in implementing materials metadata and promote discussion on scientific metadata efforts.
Click on the poster image to enlarge
The efforts described here are based on earlier outputs from the RDA/CODATA Materials Data, Infrastructure & Interoperability IG (Materials IG) and the RDA International Materials Resource Registries WG. Further potential developments considered would benefit from international discussion and cooperation, where the Materials IG and the Metadata IG can be suitable venues for such efforts.
Author: James Myers
Date: 08 Apr, 2020
Interesting poster. It's nice to see consideration of how usable different options might be in a practical research setting.
FWIW, I like the idea of mining directory structure to generate formal metadata. On past projects, I've proposed, but never implemented, the idea of using regular expressions to provide a flexible, per-project way to extract the parts of a path needed to generate values that could then be represented in the repository as formal metadata. Some of the things that stopped us were figuring out how to make such a capability simple enough for people to configure and handling keeping the file path and metadata in sync if either can be edited. I'm curious to know what mechanism(s) you've thought about for specifying the mappings to terms. I'd also like to know if your system is allowing edits or just generated the metadata once, and if editing can happen, I'm curious if you've thought about how to handle metadata derived from paths.
Author: Asahiko Matsuda
Date: 09 Apr, 2020
James, thank you for your comment!
What we've actually implemented so far does not have any per-project way to define more metadata. All we have for now are the user ID, project ID, and the instrument... very basic stuff. They are not editable after going to the system. The files in the system no longer live in a hierarchical directory, and so editing or syncing never becomes an issue in this system. Obviously this first system is fairly limited; and there's a separate project that tries to do more: project-specific metadata mapping, keeping the directory structure intact, etc... We've only begun working on that, so we've yet to come up with pretty solutions to your concerns. Thank you for your heads-up on potential issues.