An initial needs assessment for the PID Kernel

17 Dec 2017
Groups audience: 

Dear PID Kernel Info group,
I’ve been thinking a bit about the kernel, since our Montreal meeting. I have taken a step back and thought about this as a sociological problem and have approached it from a science and technology studies (STS) perspective. This is a bit long, but I hope it’s helpful.
To give you a sense of what I’ve been thinking about, you can see a poster I presented in the informatics section of the American Geophysical Union meeting: Persistent Identifiers as Boundary Objects. I suspect some of you will find it arm-waving social science, but I’ve found it useful. I was considering issues broader than just the Kernel but I think it does inform the kernel.
In viewing PIDs as a type of ‘boundary object’ per Star and Griesemer (1989). I build from classic STS work of Latour, Callon, Law, and others that shows that important questions concern the flow of objects and concepts through a network of participants and social worlds. It’s an anti-reductionist, “ecological” viewpoint. I think we need to consider a few basic principles:
We need to consider multiple, non-privileged perspectives -- not just the researcher, but also the technician, the banker, the air conditioning manufacturer (per Larry’s Intel use case), the pharmaceutical company, the disaster relief organization, etc. The idea is not simply to identify a least common denominator (at least not until later), but to consider a fundamental, simultaneous, n-way translation requiring a holistic view to understand. In essence, we are looking for a standardized method.
BOs, hence PIDs, are created through durable (‘artful’) collaboration and situated learning -- so we need to make sure we actually apply them in these multiple contexts.
PIDs need to be locally useful and globally relevant. They reduce uncertainty but are weakly structured in common use and become strongly structured in individual site use. This is where profiles come in.
From Parsons et al 2011: Our experience suggests that, when developing data systems, it is best to start simple, using proven and known approaches, and then take an incremental, adaptive approach to expanding their interconnection.”
I also think the more specific technical principles from Beth, Tobias, and Ulrich are solid and we need to continue to test them.
Given all that, I returned to the FAIR principles for another assessment. I end up with more questions than answers, but it may be helpful. My more general conclusion is that the real difficulty is going to be in the typing.
F1. (meta)data are assigned a globally unique and persistent identifier
This raises a granularity issue and the identifier/locator issue — does every level of PID need the kernel?
F2. data are described with rich metadata (defined by R1 below)
Not our problem, we just link.
F3. metadata clearly and explicitly include the identifier of the data it describes
Redundant with F1 and F2
F4. (meta)data are registered or indexed in a searchable resource
Is this our problem? Or is the idea to enable just enough typing and linking to enable indexes and other search tools to identify the first level of stuff. It raises the granularity question too.
General comment: these are not really looking at first level discoverability. How many look-ups per unit time do we need to consider? Do we need a primary and maybe secondary filtering mechanism? Or a dynamic filter? At core it’s a typing question. Is this just a latency and throughput issue? Are we assuming anonymous interaction at this level?
Include: Yes (open public domain)/No/with restrictions or license [link], but does it need to be in the kernel?
A1. (meta)data are retrievable by their identifier using a standardized communications protocol
i.e. it must be a locator and there needs to be a link in the kernel to metadata. Does this apply to PIDs for physical things like an ORCID or instrument ID where we just mean info about that physical thing which is harder to keep current. Do we need separate types for physical and digital objects a a primitive?
A1.1 the protocol is open, free, and universally implementable
Yup. Is this another principle?
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
Is this a PID kernel concern? Is it a subset of “with restrictions"? (see “Safe” below)
A2. metadata are accessible, even when the data are no longer available
A second level PID concern? A subset of “with restrictions” or a different type like retired or deprecated?
None of these seem in scope of the kernel. We just need a link to metadata
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
Is there value for including metadata license information in the kernel?
R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
Safe/Secure (because FAIR is not enough)
Should the kernel contain information about the authenticity or integrity of the object? What about the integrity of the organization managing the PID? Can I trust the PID location is current/correct? How long is my resolver service valid.
Mark A. Parsons
Senior Research Scientist
Tetherless World Constellation
Rensselaer Polytechnic Institute
+1 303 941 9986
Skype: mark.a.parsons
mail: 1550 Linden Ave., Boulder CO 80304, USA

  • Mark Parsons's picture

    Author: Mark Parsons

    Date: 17 Dec, 2017

    Sorry that email came across very messy.
    See it here:
    -m. =

  • Tobias Weigel's picture

    Author: Tobias Weigel

    Date: 18 Dec, 2017

    Hello Mark, all,
    fabulous work. You are raising a lot of very important, fundamental
    questions to which I'd like to reply and start a discussion in more
    detail. What strikes me most is that you are approaching this from a
    process perspective rather than through the static structure of Kernel
    Information; reading our guiding principles, I see some process aspects
    reflected, but I don't feel that this is complete when I look at what
    you opened up. Process in this case means by far not only how to come to
    KI profiles, but what their use results in, how they shape
    organizational processes in return etc.
    I would like to discuss a bit at today's call as well, though I suspect
    we have to do other things first (P11 planning in particular). But at
    the very least we should talk about what to make of this, how to proceed.
    Best, Tobias

  • Mark Parsons's picture

    Author: Mark Parsons

    Date: 21 Dec, 2017

    Yes data are often boundary objects, but I think PIDS are BOs too. They are not file names. They are more complex than that, and they are often extended in local use beyond what they mean globally. (As BOs do) For example the DCO-ID is used for many objects in man ways within the DCO data portal, but Vivo (for one) uses them in a much more narrow sense.
    I think this is part of the profile type question being discussed in the others thread.

submit a comment