This is the update from the Metadata Standards Catalog Working Group delivered at the RDA 9th Plenary in Barcelona.
The accompanying slides are available.
The problem we're addressing is that data is not self-documenting, so we need to have metadata and other forms of documentation to tell us what it means. If that documentation is in a standard, structured form, we can easily get computers to sort through it all and provide discovery services, processing and analysis services, and many other good things. The trouble is that many researchers aren't doing this. They might be using a structure they've invented themselves or that their lab has come up with, or they might just stick to narrative text, or they might not bother at all. And that's hurting science.
So the idea of the Metadata Standards Catalog is to make it easy for researchers to find out about the standards relevant to their work, and point them towards tools and examples that take the pain out of learning to use them. We also have ambitions for trying to make standards more useful by recording and even generating mappings between standards and profiles.
Building on previous work
This isn't a new piece of work but the latest phase in the evolution of a resource originally created and still maintained by the UK Digital Curation Centre. It was originally put together by consultant Liz Bedford and subsequently maintained by a small team of, well, one busy person.
To make it easier, the Metadata Standards Directory Working Group transferred the data to GitHub and set up a nice new static website for displaying the information. Now anyone can make changes to the data, and once one of the admins has approved them, they show up on the Web. Speaking from personal experience of both systems, I have to say the second is much easier.
So why do we need a new Catalog?
These resources are good, but there are things we can improve.
Both the existing directories are for all practical purposes static. You can't search them in a structured way; that's something we could improve.
The directories are also for human eyes only. The only way to interact with the DCC directory from a script is by screen scraping. Things are slightly easier with the GitHub directory since the records are all available as a collection of structured YAML files, but that still means people have to write their own scripts for querying the information.
Thirdly, as good as the directories are, the information held is quite limited. We've uncovered use cases that people have for the directories that we simply can't support with the information we currently hold.
Migrating the data
The first step we took in developing the new Catalog was to work out what information we did need to hold, and work out a way of upgrading the Directory database to suit.
The Directory database uses a simple four-element data model. Most of the information is held in the records for top-level metadata standards. Other entities just have a description, link and subject classification. This is even true of metadata application profiles. A side effect is that you can't associate a tool, say, with a profile; you have to associate it with a top-level standard.
The main changes to the data model of the Catalog are these:
- Records for standards and tools include information about version history.
- Profiles can now be given just as much detail as parent schemes.
- Mappings are a separate entity so they can be found from either end but maintained in one place.
- The use case entity has been generalized to 'organization', allowing us to record maintainers and funders in the same way.
- There is a new endorsements entity.
Here are some explanatory notes to accompany the video demonstration (17MB).
[00:00] Schemes are arranged in a hierarchy with profiles grouped under their top-level metadata standards. There is also an index for tools.
[00:18] You can browse by subject: The terms come from the UNESCO Vocabulary, again presented in their hierarchy. The list is filtered to remove unused terms. The linked pages show schemes which have been tagged with the given term or one of its broader or narrower terms.
[00:31] If you know what you want, you can go straight to it by searching. We have automatic hints to help you. If you only have one search result you are taken straight to it, otherwise you get the full list of results. Currently you can search by scheme name, identifier, or subject. When we get information about funders and data types into the Catalog you'll be able to search for those too.
[01:07] Display of scheme: As with the Directory, we provide ways to navigate to related records. You can click on subject areas to show other standards related to that area. There are links to the specification and website. New to the Catalog is the version history of the standard.
You can easily see how this standard relates to others, with links to parent schemes and profiles, and then a list of mappings between this standard and others, with links to more information. There are also links to records describing tools that support this standard. If this record had links to organizations, those links would take you to a search for other standards used, maintained or funded by them.
[02:19] We can also search using the API. Here's the same search with a JSON response. In fact there are a couple more search options than in the Web version. Once you have a list of IDs back, you can look up the underlying JSON record for any of the internal identifiers.
[02:38] If you want to make changes, you have to sign in. The Catalog currently delegates authentication to external providers. You can use any Open ID v2 provider, or Google, LinkedIn or Twitter. Support for Facebook is planned.
[02:54] Once you've signed in you see extra links for adding new records to the Catalog. You can also navigate to an existing scheme or tool page and follow the editing links there. In this example we'll look at the page for the DataCite Metadata Schema. You can see links for editing mappings and organizations, and at the bottom there's a link for editing the current record.
[03:29] You can see that we're showing 3.1 as the most recent version. Is that right? No. So let's add the most recent version. We begin by adding the number and date. Once we've saved that we get a link for adding version specific details. So let's add the links to the specification and the schema. Both have DOIs so we can add those too. You can only add one at once, after which you get additional boxes. At some point I'll try to make that easier with Ajax. You can see the record now shows 4.0 as the most recent version.
[05:41] To protect the Catalog, all changes are version controlled using Git. You can see in this visualization of the Git log where we added the new version, the specification, and the schema.
[05:58] Now we've finished, we can sign out again.
In terms of raw capability, I'll admit that this is an incremental improvement on what we had before, but the important thing is that it opens up a much greater range of possibilities. With this platform we'll be able to collect much richer information and provide more powerful functionality.
For an up-to-date list of features we're considering adding to the Catalog, see the GitHub issue tracker. (On the slide, these have been arranged with features related to the Web pages on the left, and features for the JSON API on the right. The features in yellow we aim to do, but probably not by July. The features in brown we may or may not do depending on how much work it takes.)
This isn't live on the Internet yet. There are some admin things to sort out first. But you can keep track of developments through our GitHub pages. Of particular interest are the aforementioned issue tracker where we keep the requirements, and the readmes for the application and the database.
We can't do this without your support so please keep in touch.