We are at a tipping point in the development of a common conceptual framework and set of tools and components which will revolutionize the management of scientific data. It is widely acknowledged, as detailed below, that the current volumes and complexity of data now being collected and, even more so, the inevitable and enormous increase in that volume and complexity, have reached the point where action is required. At the same time, and largely in response to this perceived crisis, a number of principles for the management of scientific data have arisen and been widely endorsed. The danger now is that agreement will stop at the level of principles and that multiple non-interoperable domain and technology specific silos will continue to arise, all based on the abstract principles, and we will lose the opportunity of leveraging the current crisis to create a common set of tools and components based on an agreed conceptual approach.
What follows is our summary of the current agreed-upon principles, a more detailed analysis of the requirements implied by those principles, and the current state of work on those requirements, as reflected in the work of RDA, which we believe has the broadest base and most neutral view of the situation. This includes brief summaries of the requirements and the current state of work on repositories, registries, identifiers, metadata, types, licenses, and, in general, the whole ecosystem of interlinked digital objects needed for managing the life cycle of scientific data. We end with a more detailed view of the requirements for selected components extending partly the FAIR principles (Findable-Accessible-Interoperable-Reusable, Appendix A).
Action is now required to put in place operational infrastructural components based on this and similar analyses. Some of these components already exist at an operational level with wide experience across communities, while others are yet at a prototype or concept stage, i.e. no design from scratch is intended and we can build on extensive knowledge built up in various regions. We should now install a systematic approach where these components can mature and ultimately enable communities to build new services and proof that added value can be achieved by means of combination of components. There will, of course, be a risk in doing this and some of these components will surely fail or otherwise prove inadequate. In some cases waiting another five or ten years would perhaps result in better designs and implementations based on technology advances between now and then but by that time the interoperable silo problem will have gained ground and be difficult to displace. The real risk at the moment is in not building a common core infrastructure according to our best current information.
In addition to the current widely adopted recommendations by funders the essence of which recommendations will come next can be summarized as:
1. Digital objects should be stored in trustworthy repositories that are assessed regularly using DSA/WDS1 guidelines and those repositories should be registered in open registries such as re3data2.
2. Trustworthy repositories need to assign PIDs to all digital objects and register them with trustworthy PID service providers, such as the International DOI Federation3 and the European Persistent ID Consortium for eResearch4, that guarantee their resolution to meaningful state information.
3. The digital objects referenced in points 1 and 2 above are not restricted to the data itself but also include schemas, queries, concepts and concept vocabularies, all of which need to be registered in open registries and assigned PIDs if they are cited or referenced.
There are still many issues to be explored and questions to be answered, but we believe that science would be well-served if future scientific data infrastructure projects accepted and followed these high level recommendations.
Download the full document