The last breakout slot of Thirteenth RDA Plenary Meeting saw a joint meeting between some long-running groups with a shared interest in metadata: the Metadata Interest Group, the Metadata Standards Catalog Working Group and the Data In Context Interest Group.
Metadata Standards Catalog
The meeting began with my report on progress with the Metadata Standards Catalog. The Catalog represents the third phase of development of a resource whose first phase was a project within the Digital Curation Centre (DCC) in the UK. That first phase culminated in the January 2013 launch of the DCC’s Disciplinary Metadata catalogue, a resource aimed at research support staff to help them advise researchers on how best to document their data. Meanwhile, within the RDA, the Metadata Standards Directory Working Group was forming with the aim of developing a resource to combat both the underuse and duplication of metadata standards for research data. It soon became apparent that the DCC catalogue was a good fit for the aims of the working group, so instead of starting again from scratch the group gave it a second phase of development. The focus of the work was essentially two-fold: to fill in gaps in coverage, and to make the ongoing hosting and maintenance of the catalogue more distributed and sustainable.
The work of the group resulted in the addition of 11 new major standards, 5 new metadata profiles, 7 metadata-related tools and 19 examples of these standards being used by services and organizations. A more visible outcome was the Metadata Standards Directory itself: a new instance of the catalogue hosted on GitHub, providing a straightforward and transparent way for people to suggest changes. Between them, the two instances of the catalogue have been recommended for use by the DataONE Best Practices Database and various individual institutions. Their impact can be felt in instances of adoption of and support for standards by projects, groups and institutions internationally. And as a way of sharing expertise the Directory has proved valuable: around 50 updates have been submitted to it via GitHub since its launch.
In January 2016, the Metadata Standards Catalog (MSC) Working Group started work on a third phase of development. There were several drivers for this work, most notably to make the information available through a Web service API, and to enrich the information held to support a wider set of use cases, including metadata research and development. The work was conducted in the mould of a software development project, dividing roughly into six months collecting use cases and developing a specification, six months designing a new data model and migrating the Directory data into it, and six months developing the application software.
Though that concluded the main business of the Working Group, there followed a lengthy period of getting the application adopted into production. An private instance of the MSC was set up by the DCC to allow the information on metadata standards to be pulled into the data management planning tool DMPonline via the API. This was upgraded to a public-facing prototype in December 2017. Even so, the prototype was dogged with various technical problems, which meant we were hesitant to recommend it more widely. Finally, in March 2019, a new instance of the MSC was set up on a dedicated virtual server at the University of Bath, and we are very happy with its performance.
In the meeting, I gave a live demonstration of how to update a record in the MSC, and how to use the API to retrieve information. To help with wider adoption of the MSC, its API has now been documented with live demonstration capabilities on SwaggerHub; the OpenAPI source for this can be inspected in the MSC’s GitHub repository. As I said to those present, if you have an interest in metadata, please do take a look at the records in the MSC and contribute your additions, updates and corrections. If you would like access to the editor-level API functions, please get in touch.
Even though the working group has been operating in maintenance mode for over 18 months, we put it to the room that the time was right to make it official; they agreed, and so we have since applied to transition to a Maintenance Group. The hard work is potentially just beginning, though, as one of the main adoptions planned for the MSC is as a vehicle for documenting standards in terms of the Recommended Metadata Element Set (MES) being developed by the Metadata Interest Group, with the eventual aim of enabling semi-automated production of converters between arbitrary metadata standards.
Metadata Element Set
This lead seamlessly into Keith Jeffery’s update on progress on the MES. So far 17 broad elements have been identified and discussed in some detail, but there is more work to do to unpack them into a form suitable for use a sort of metadata Rosetta Stone. We are especially interested in enumerating the possible subelements and attributes of the elements, the possible relationships that exist between those elements and subelements, and the classifications that will be needed to represent their semantics. The end result for each element should be a formal syntax (with referential and functional integrity) and declared semantics (through rich ontological structures to permit management of the relationships of terms including multilinguality). This will require some dedicated work, so Keith asked for volunteers to head up task and finish groups to deal with each element in turn. At the meeting we had two volunteers: Nick Juty (University of Manchester) will be leading the group on identifiers and licensing, while Jane Greenberg (Drexel University) will lead the group on data quality.
The full list of elements is available on the Metadata Interest Group page. If you would like to get involved in leading or contributing to a group, please email the Metadata Interest Group list: don’t forget to mention which element you’d like to tackle!
In the discussion that followed, Maggie Hellström (ICOS Carbon Portal) encouraged us to look at the archiving of metadata schema specifications. We had always planned to store schemas normalised in terms of the MES within the Metadata Standards Catalog, but we could certainly look at storing native schemas as well.
Presentations from other groups
The Metadata Interest Group has always been the focal point for all things metadata within RDA, but groups throughout RDA find metadata forms an important part of achieving their aims. Rebecca Koskela invited five such groups to present their activity and share their metadata experiences within the forum of this meeting.
RDA/TDWG Attribution Metadata for Collections
Anne E Thessen presented on behalf of the RDA/TDWG Metadata Standards for Attribution of Physical and Digital Collections Stewardship Working Group. The group is tackling the issue that curating and maintaining collections is valuable work that goes largely unrecognised. By developing a metadata standard for recording this work, the group hopes to provide a basis for factoring such contributions into incentive structures.
- An Entity is generated by an Activity. The Activity has a DateTime period (when it happened) and optionally a Reason (why it happened).
- The Activity is associated with an Agent in either a generic way or via a qualified Association.
- The qualified Association is used to specify the Role that the Agent played.
The scheme has been mapped to the Metadata Interest Group MES and a controlled vocabulary for Roles is being assembled from existing resources such as TaDiRAH and CRediT. The group is looking into adding terms/subclasses into VIVO to permit adoption of the standard in that ontology, and similarly into developing an extension to Darwin Core based on the standard.
The standard has already been adopted by providers of attribution metadata such as Arcton, IDigBio, Bloodhound and TaxonWorks, and by aggregators/consumers such as ORCID, ImpactStory and Altmetric.
Data Discovery Paradigms
Mingfang Wu presented on behalf of the Data Discovery Paradigms Interest Group, which is currently running three task forces related to metadata:
- Metadata Enrichment, describing and cataloguing efforts to enrich the metadata of research datasets;
- Data/Metadata Granularity, addressing the issue of granularity in dataset description, cataloguing, citation and access;
- Using schema.org for Research Dataset Discovery, identifying minimum information guidelines, and looking to fill gaps in both core and extension schemas for datasets such as Bioschemas and Science-on-schema.
The last of these task forces recently ran a survey aimed at data repository managers, to establish current practice with regard to using metadata schemas generally, schema.org in particular, and performing crosswalks between schema.org and other schemas. In particular, the survey was interested to know what schemas and crosswalks were being used, how schema.org is being applied, any gaps people may have noticed, and also to gather suggestions for what the task group could do to help the situation.
By the time the survey closed on 25 March, 20 replies had been received. One of the biggest bugbears people had was the lack of support within schema.org for controlled vocabularies, and indeed there were calls for support in ‘collecting’ and managing schemas, vocabularies, ontologies and other semantic assets. There was also recognition that support for Arts and Humanities domains is somewhat lacking at the moment.
Standard Vocabularies, Metadata and Improved Semantics
Charles Vardeman gave a presentation that was partly personal and partly on behalf of the Data Foundations and Terminology Interest Group, but bringing in perspectives from the Vocabulary Services Interest Group.
The thrust of his talk was about how ontology design patterns can be used to handle some of the shortcomings of lightweight solutions. Most lightweight solutions rely on standard vocabularies and illustrations, but it can be hard to arrive at standardized definitions and a single focus for illustrations. Complex concepts like ‘ablation’ can be decomposed several different ways (as a process or as an amount of mass removed, or from which layer the mass was removed) and simply splitting out the subconcepts and relating them back to the main one with an ‘x is a y’ relationship can be confusing.
As a worked example, Charles looked at some cryosphere concepts which imply things about the location of ice as well as the chemistry of its composition. For example, ice sheets and ice shelves are largely similar, except that an ice shelf is attached to the coast. So the Hydro Foundational Ontology (HyFO) has an ontology design pattern for ‘features’: a GeoFeature has a constituent Physical Material and has Immaterial/Abstract properties. In this case the ice shelf is considered a feature that is composed primarily of some form of ice, and has a more abstract location that is both floating and attached to land. It is then possible to chain several of these patterns together: the ice shelf may be considered and further described as a type of physical object, both as a feature and as a body of ice, and the particularities of its actual location may be described in terms of a terrain or in terms of coordinates.
Charles’ final point was that we shouldn’t underestimate the value of local models tailored to particular solutions: in attempting to unify them into one grand vision, we can easily end up with a hypermodel object, the information associated with which transcends our capability to control or understand it.
Erik Schultes gave a presentation on the Metadata for Machines (M4M) Workshops for Faster FAIRification, jointly organized by the GO FAIR project and the Research Data Alliance. The workshops are predicated on the idea that data cannot be FAIR without machine actionable metadata, but (a) it’s not always easy for domain experts to create machine-ready metadata, and never easy to achieve its widespread reuse.
The workshops are about bringing together domain experts and metadata experts to design metadata templates tick all the boxes of supporting FAIR compliance, compliance with community standards, being machine-actionable, and registered to enable reuse. More specifically, a domain community aiming to become more FAIR engages with metadata experts at the workshop to create a FAIR metadata template, being either a profile of existing metadata elements or a new set of elements. This is then declared and registered, adding it to the pool of elements and templates from which the next communities can draw.
In the second M4M workshop, for example, Preclinicaltrials.eu (representing the preclinincal trials research community) worked with CEDAR and FAIRsharing on a template, and in the third, the Health Research Board and ZonMw (representing funders) worked with the same metadata experts. Many more workshops are in the pipeline.
In future, there are plans to hold different types of workshops: a ‘beginners’ version for creating simple profiles, an ‘advanced’ version for tackling more complex problems, and an institutional version aimed at aligning internal administrative metadata to global standards.
Chemistry Metadata Initiatives
Finally, Ian Bruno (Cambridge Crystallographic Data Centre) presented on behalf of the Chemistry Research Data Interest Group. IUPAC (International Union of Pure and Applied Chemistry) was formed in 1919 and has worked extensively on developing and standardizing chemical terminology. This terminology has historically been set out in a series of colour coded books (e.g. the Blue Book for Organic Chemistry), but in 2006 it was unified into a single digital resource: the Gold Book, a name that handily continues the naming convention but also signifies its role as the new ‘gold standard’ while also honouring Victor Gold, who initiated work on the first edition. The resource contains over 7000 terms with authoritative definitions, each of which has its own DOI for ease of reference. Work is now underway to update the underlying platform and provide an API for machine access.
Ian was an advisor to the NSF OAC workshop ‘FAIR Publishing Guidelines for Spectral Data and Chemical Structures’ held in March 2019. The goal was to try to achieve some quick wins in metadata best practice, by developing a data publishing workflow, formulating guidelines for published FAIR chemical data, and looking again at re-use cases for chemical characterization data. One thing it did, for example, was look at NMR spectra and consider what were the key metadata fields for describing them: suggestions included the InChI (International Chemical Identifier) of the substance being studied, and various items of instrument and bibliographic metadata.
At that workshop, Dave Martinsen and Henry Rzepa presented work on registering DOIs for chemical substances. One challenge they faced was deciding the most useful way of encoding key metadata about the substances in the DataCite metadata schema. But they have a number of proofs of concept, with the pleasing result that it is possible to search DataCite for an InChI, included in the metadata as a subject term.
Another promising initiative is NMReDATA, an effort to create a standardized file format for chemical structures that includes both NMR data and metadata. The critical part of this effort is to try to get the format adopted by instrument vendors and software suppliers so that it is truly standard.
Following the talks, Keith suggested some ways in which the groups could engaged with the Metadata Interest Group further:
The RDA/TDWG Attribution Metadata for Collections group could add their schema into the Metadata Standards Catalog (MSC).
The Data Discovery Paradigms group could add the schemas they discover into the MSC and align common elements to the Metadata Element Set (MES).
The Data Foundations & Terminology and Vocabulary Services Interest Groups could contribute terminology, especially valid values of instances, into the formulation of the MES.
GO FAIR could ensure that the MES provides enough metadata for FAIR compliance, and surface expertise from the wider Metadata Interest Group in GO FAIR recommendations.
Chemistry Research Data Interest Group could look at relating the Rzepa group’s DataCite profile to the MES.
Data In Context Interest Group
Keith and Rebecca explained that the Data In Context Interest Group (DICIG) no longer had a distinct role to play within the RDA. It was originally set up to develop contextual profiles for describing datasets as they evolve through the data lifecycle, using standardized open vocabularies and formal semantics. It performed its preparatory work for this, collecting metadata use cases from RDA members, but things had moved on by the time this was complete. Work relating to the original aim was and is going on in the context of other groups (such as the RDA/CODATA Materials Data, Infrastructure and Interoperability Interest Group).
Rebecca and Keith therefore indicated they would like a new co-chair to come on board and lead the group in a new direction, otherwise it would be best to put the group officially into hibernation.
All in all, this was a highly constructive session, including not only reports of substantive work being accomplished, but also fixing actions for the near and medium term. Personally I find it highly rewarding to be engaged in an such active groups, and I would recommend people make the most of opportunities within RDA for contributing to or even chairing groups, and adopting the outputs wherever possible.