1 WG Charter
In complex data domains, unique and persistent identifiers (PIDs) associated with specific information are the core of proper data management and access. They can be used to give every data object (including collection objects) an identity that enables referring to the data resources and metadata and, additionally, to prove integrity, authenticity and other attributes. But this requires a PID to be uniquely associated with specific types of information, and those types and their association with PIDs must be well managed. Therefore it is useful to specify a framework for information types, to start agreeing on some essential types, and to define a process by which other types can be integrated. The framework provides generic facilities only, which can and must be employed by specific communities to support their needs. The focus of the working group therefore is on cross-community concerns.
1.1 A note on terminology
Terminological discussions should not be part of this WG’s activities, but rather the domain of the dedicated Terminology WG. Most particularly, the definition of what PID Information Types (PITs) are should be provided through interaction with the terminology WG. For the scope of this WG, it is more important to provide the necessary elements to fulfill the use cases rather than head for a universal definition. In this regard, one view on PITs may be that they form the interoperable set of encodings of object properties. The scope of this WG also ends where e.g. domain-level distinctions between data and metadata are concerned, because one user’s data may be another’s metadata.
In general, the strategy to avoid terminology discussion lock-up is to head for pragmatic compromises that sacrifice precision in favour of steady progress towards the WG goals. In view of the limited time frame and measurable outcomes we may be forced to accept some potentially poor compromises.
1.2 Short-term goals (12-18 months)
· Scope the type standardization effort and its boundaries
· Define core PID information types
· Define the role of structural elements, such as collections, in combination with the information types
· Define a profiles mechanism and demonstrate its scope with some initial exemplar profiles
· Provide a first API prototype that demonstrates a possible implementation of these elements and accompanying rudimentary services
1.3 Long-term goals (beyond 18 months)
· As applications emerge, come to a coherent set of information types and organizational structures that is sufficient for most scenarios from the various disciplines
2 Value Proposition
The ongoing work and final outcomes of this WG will shape the first usage scenarios of persistent identifiers and lay the groundwork for future use. Promoting a standardization effort at an early stage will have various benefits, as for example initial tools can be developed in a more streamlined manner and be of more use to early adopters, and interoperability between repositories is ensured early on.
In particular, the following actors or organizations will benefit from the outcomes of this WG in their respective ways:
· Data centers: attributes help in proper data management, execution of policy rules, producing a better product, reduction in support effort, etc.
· Data infrastructure: commonalities across infrastructures/federations, ease of building new services, etc.
· Data consumer: attributes help in disambiguation, resource discoverability, trust building, reducing need for support (citation info etc.), enabling automatic processing
· Data providers: enhancing visibility & reusability, enabling automatic processing
· Tool builders: benefit from controlled, explicit vocabulary
Marketing the value of this WG’s outcomes is a crucial task to transfer the rather academic topic of PIDs to adoption in practice. The strategy for accomplishing is to focus on exemplary use cases which can be understood and appraised by different communities and actors.
3 Engagement with existing work in the area
· EPIC, the European Persistent Identifier Consortium, can provide PID services and may decide to implement PITs from an infrastructural perspective
· DataCite should be involved as a major consumer and potential adopter of proposed information types
4 Broad involvement of experts particularly on technical details of existing systems and initiatives in the wider PID is required for this WG to achieve overarching consensus. Membership from initiatives such as ARK, URN, LSID, pURL and others must be encouraged and sought out actively. Work Plan
The WG will use an iterative approach with two basic cycles.
In general, activities start with the gathering of use cases and a review of existing approaches to determine what can be learned from them. The use cases that start the discussion will most likely be community-specific, so one of the major goals is to abstract from them and derive cross-community use cases. The use cases feed into a first discussion on core PITs. Community-specific elements should be generalized here, as well, to come to a cross-community core set of elements.
After this first cycle is complete in M6, there will be a short second cycle to pick up any use cases and PITs which came up in the first cycle but were left out because they were out of scope for the first set of use cases. The second cycle is intentionally short to acknowledge that we do not intend to provide a full set of PITs, but only an initial core stack that can be extended by future WGs. Some of the second cycle activities can be done in parallel to the first cycle. Both cycles finish at a major milestone in M9.
The second half of the WG time frame is used to discuss and define two frameworks and the profiles mechanism. The first one is a framework for PITs, which includes a proposition of processes by which they can be proposed and integrated as well as instruments to define PITs. The second one is a framework for collections, references and any other higher-level information structures that build upon or enhance the previously defined PITs. In the end, a discussion on the role of a profiles mechanism and its potential specification commences. The current understanding (which is subject to discussion) is as follows: A profile defines a set of PITs which are mandatory for creating PIDs that conform to that profile. The purpose of profiles is to ease implementation of higher-level interoperable tools that expect certain information to be there and to provide a minimal form of quality control. The exact definition and scope of a profile and the profile mechanism will be defined during WG activities or even by the terminology WG .
API specification and prototypical implementation takes place during the second half of the WG activity and forms in itself an iterative process to pick up any PITs, collection framework etc. which emerge as the WG activities progress. The API is meant to be a middleware / service layer, not a complete PID provider infrastructure. The prototype will however have to be built for a specific PID infrastructure and coming to the particular decision is a WG activity.
Overall, WG activities should focus on an initial, non-exhaustive core set of elements. The potential number and thematic broadness of elements is very large, and the key deliverables should reflect this in providing an overview of topics that were deemed relevant but did not make it into the core discussion. We definitely plan on future RDA WGs or external projects to use the outputs of this WG as a starting point for their own work.
While the number of deliverables may seem quite large given the limited time frame, some of them may actually manifest as very short documents. In view of the complexity and openness of the topic, the WG activities are subject to fixed deadlines for all deliverables to come to a stable prototypic core. We are all aware that the defined PITs, profiles etc. are not settled once and for all, and the deliverables reflect this by including open discussion items whenever possible.
Each of the deliverables listed below includes an identifier (e.g., “D1”); a short, descriptive name (e.g., “Use cases”); the time frame for working on the particular process and finishing the deliverable, specified as a range of months; and a more complete description. For deliverables that have two specified time frames, the end of the first time frame corresponds to a deadline for the preliminary version of the deliverable after the first major iteration cycle.
All deliverables must adhere to the policy that to enable wide adoption, current practice must not be disturbed beyond reasonable limits. The overall idea is to examine existing infrastructures and practices, come to a common denominator and then extend, preferably by exposing or enhancing already existing functionality. We cannot hope to include every existing effort, but participation should be maximized.
D1. Review of existing approaches (M0-M3): A brief summary on lessons learned from past projects and whether (and how) they are taken up by WG activities. The focus includes both literature in general as well as lessons from existing systems or registries that deal with types in a broader sense.
D2. Use cases (M0-M3, M3-M6): A collection of detailed use cases in storytelling / user scenario prose format. There will be a short statement for each use case characterizing whether it is used as a driver in the WG or as a supplementary use case as a reference for future work.
D3. Core PID Information Types (M3-M6, M6-M9): A report on the agreed upon initial set of core PITs, describing among other things their scope and relationship to the individual driving use cases. This report is designed as a major input to the Type Registry WG. It should also include a supplementary list of further candidates which were discussed but not accepted as being part of the core list. The supplementary elements are only described in brief.
D4. Collections and references (M3-M9, M9-M12): A report on how collections of identified elements and typed references (typed links) to other elements may be implemented. Different communities are likely to need different measures of addressing collections (as individual first-class objects, as dynamic groups of other first-class objects etc.) and the report should provide one or many generic solutions with respect to different usage scenarios. The same applies to typed element references: Where there is no single solution possible, the report should provide alternatives.
D5. PIT Framework (M9-M18): A report on an overarching framework on PITs in general, ideally with detailed instructions and process descriptions. The framework includes a clarification of the practical role of PITs and relationship to other defined concepts (such as metadata, data types, profiles etc.) as well as a first suggestion at practical policies involving PITs such as processes by which new PITs can be proposed and integrated.
D6. Profiles mechanism (M12-M18): A report on the scope and elements of a profile, which also includes an agreed-upon definition for a profile. If there were actual profiles being discussed, the report will define up to three candidate profiles in detail and briefly describe others that were under consideration. The report should also define the scope of the profile mechanism, and may go as far as defining rough use cases for it.
D7. API specification (M6-M12, M12-M18): Specification of an API which allows for requesting information associated with PIDs. This API spans the concerns of D3-D6.
D8. API implementation (M12-M18): Prototypic implementation of the specified API.
4.2 Milestones and intermediate documents
Milestone A. Finalization of the first iteration cycle in M6. Initial set of use cases and PITs defined.
Milestone B. Finalization of the second iteration cycle in M9. Set of use cases and PITs refined and first report on structural elements.
Milestone C. First specification of the API and kick-off of its implementation in M12.
Milestone D. Collections and other structural elements defined in M12. Integration of results with API activities.
Milestone E. End of the WG activity in M18. Prototypic implementation and final specification delivered, profiles mechanism established and put to use with exemplary profiles. PIT Framework finished.
4.3 Working Group operation
Mode and frequency of operation: The primary forms of communication are the RDA Forum and official RDA assemblies. Additional video- and teleconferences may be used, however it is expected from the participants that after each such virtual meeting a short report on its outcomes will be posted to the public to keep the rest of the WG informed. The canonical state of WG discussions and work is the RDA Forum.
Achieving consensus and addressing conflicts: Consensus will be reached via open discussion and voting as appropriate. If required during the course of the WG activities, more formalized suffrage might be put into action. It is the responsibility of the WG leaders to drive consensus by structured moderation and take careful notice of dissents and conflicts to moderate a resolution. If a conflict involves a WG leader or cannot be resolved within the WG, the RDA Council will be consulted and an independent person not in the WG will be brought in to mediate the conflict.
Staying on track and within scope: The project plan is specifically designed to concentrate on a limited set of core items which should be the focus of discussions. The appointed moderators on the communication mediums should help to channel discussions and softly enforce focused discussion while not precluding sidetracks, e.g. by splitting forum threads as appropriate. The project plan also specifies deadlines for milestones and deliverables which must be adhered to, even if the continuing discussion is not finished. In cases where there is still a lot of open discussion as a deadline approaches, the state of the discussion should be reported in the corresponding deliverable.
5 Adoption Plan
DKRZ will implement a prototype of the API. DKRZ will most likely adopt outcomes of the WG early on, e.g. in its long-term archive, at latest during the second half of WG activities.
Other interested archival facilities, data centers or similar organizations are encouraged to start adopting the emerging API, core types and profiles during the course of the WG and integrate it in their own software frameworks. This will help to shape the API in a pragmatic way and possibly enable collaboration within the group.
Overall, the API must be designed in a way which aims to minimize adoption barriers. Existing systems should be extended rather than forced to be replaced or heavily reworked, otherwise wide adoption will fail. Rather than setting up a central service provider, the goal is to enable any interested organization to adopt the technical specification quickly and integrate it in their own operational software ecosystem. Interoperability is achieved by agreeing on the standard protocols which are part of the API.
Finally, in parallel to adoption of the API, the WG must encourage the development of some first prototypic tools which make use of the proposed types and structures. Although we cannot make this a key deliverable due to the timeframe and workload involved, at least some ideas will be sought out towards the end of WG activities and put into action to demonstrate actual use of the API.
The framework which establishes the procedures to follow to implement and disseminate new PID Information Types can only work as a community effort, supported by a critical mass. At the end of WG activities, the participating communities will be encouraged to continue adoption of the framework. Full adoption can only be successful as part of community culture, which is however unlikely to consolidate during the WG timeframe.
Long-term adoption of all WG outcomes will be promoted by creating dissemination flyers at one or two major milestones to be distributed at relevant conferences or other high-visibility events. Final and possibly also intermediate results will be published as articles.