Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts

    You are here

12 Apr 2016
Groups audience: 

Hi,
I'm one of the researchers that Jeremy contacted yesterday regarding the
definition of the *gatheredInto(x,y)* predicate. I've been reading up on
this discussion and had a question about one of the collection definitions
being maintained by the RDA. Regarding the concept of PID (persistent
identifier) has there been any true consensus on what "persistent" and
"identifier" mean? For instance, would the name Keith be a PID (why?/why
not?).
Also regarding the "precise" collection definition from the collection
working group. I see where the term "collection" has been pegged to the PID
which effectively makes the PID an identifier and a collection...which
seems a bit strange (but then again my street address is both a fire number
and a mailing address, so that may be okay). More importantly though, the
nature of the digital object the PID points to is never defined beyond what
it consists of (a set or a list). But the digital object is not itself a
set or a list. Is it some kind of anonymous thing, like a blank node in
RDF? So I'm not sure what to make of it. The collection definition doesn't
seem very precise at all.
The main problem seems to be a conflation of the notions referer and
referent. Identifiers (persistent or otherwise) only refer to things and
are not synonymous with the things they refer to. A better definition might
be: A collection is a digital object which consists of a set or a list and
is named by a PID (which when reified delivers the collection object).
Another analogy I might make is that sets (and possibly lists) are scalar
objects particular to a point in time. Collections (and probably also lists
in all actuality) are vector objects (using the physicist's notion of
vectors) that vary over time. So at any point in time, a collection may be
represented by one set or another (not dissimilar to a document being
represented by one string or another over time as it gets edited).
Some thoughts on the *gatheredInto* predicate. If memory serves correctly,
the relationship "gatheredInto" is intended to capture the creative effort
(or curatorial effort) that goes into creating a collection. You may have
noted that we dialed back from that version of the definition in our JOHD
article (and in the HTRC's workset data model). This is because that
definition of "gatheredInto" is a bit semantically overloaded. It captures
both that a collection is the result of some process ("gathering") and that
said process is the result of some intelligent agent. So we went for a
narrower definition---"gatheredInto" only states that a collection is the
result of some process ("gathering") but doesn't remark on where that
process comes from. This permits a relaxed definition that admits more
colloquial uses of "collecting" such as "the silt collects at the river's
mouth", "debris collects on the beach", "crumbs collect at the bottom of
the cereal box", etc. While trivial looking, such distinctions can provide
important ques indicating when metadata needs to be articulated (in this
case by articulating who is doing the gathering and why they are doing it).
This is important, as it seems to me that regardless of how one defines
collection one needs to understand the role it's supposed to play within
it's native information ecosystem. Most importantly, how do you distinguish
among different collections? Is it easier to rely on a layer of metadata
that describes the collection object (past experiences suggest that it is)
or to dig into it and examine the descriptions of the objects in the
collection? Is it more like a box at the supermarket or more like a box
under the Christmas tree?
A closing thought thought. Since Europeana is already working on a
collection model is there a need for RDA to reinvent the wheel here?
Regards,
Jacob
_____________________________________________________
Jacob Jett
Research Assistant
Center for Informatics Research in Science and Scholarship
The Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
(217) 244-2164
***@***.***

  • Bridget Almas's picture

    Author: Bridget Almas

    Date: 12 Apr, 2016

    Hi Jacob,
    With the RDA Collections WG we aren't trying to reinvent the wheel, but
    rather to see if we can come up with a generalizable API that works
    across disciplines and models and reduces the overhead needed to develop
    tools and services to create and use collections of data. In an ideal
    world, it would work equally for a Collections endpoint provided by
    Europeana for working with collections of humanities data, and one
    provided by DKRZ for working with collections of climate science data.
    We are really trying to take into account as much of the existing work
    on collections as possible with this -- we already do have the HTRC case
    on our radar. We will also look at what Europeana is doing.
    You can read more about what we're trying to accomplish in our case
    statement -- we welcome additional participants!
    https://rd-alliance.org/group/pid-collections-wg/case-statement/pid-coll...
    Best
    Bridget

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 12 Apr, 2016

    Jacob,
    I agree with much of what you say about refer, references etc.. Indeed in
    Europeana is working on a Collections Model we can take that as input and
    maybe adopt it. In DFT we did look at some of this work and as I noted
    leverage some of the language.
    The RDA community discussion is a way of generating some views that might
    be used to gauge their definition. PIDs play a big role in some parts of
    RDA so you see some attempt to relate this to collection, although as
    earlier noted some of us, like you, don't think on any identifier as the
    same type of thing as the object identified.
    In regard to your question:
    "Regarding the concept of PID (persistent identifier) has there been any
    true consensus on what
    "persistent" and "identifier" mean? For instance, would the name Keith be a
    PID
    (why?/why not?).
    I drafted definitions for Identifier and Identity which are in our Term
    Tool. The definition of Identifier is not general but in a digital
    identifier which is what i think most RDA folk are thinking about (as
    opposed to a name label like "Keith" which is another type of metadata we
    might use to find info about Keith.
    An identifier (ala digital identifier) is a bitstring that is used to
    provide Object Identity.
    Explanation For many a digital identifier is associated with a registry for
    the identifier and a repository for data that is identified
    Identity is that property of an object, such as a Digital Object or
    Resource, which distinguishes each object from all others.
    Identity is established by some process that connects a set of attributes
    to some object.
    There are legitimate issues of generality vs specificity in many of our
    definitions. You raised an issue, for example, of aggregation and whether
    soil accumulating at the delta is covered. In my opinion no, because we
    have a focus in digital-data concepts and not the broader non-digital
    aspect of reality.
    Gary Berg-Cross, Ph.D.
    ***@***.***
    ​​

    *http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    *
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770

  • Jacob Jett's picture

    Author: Jacob Jett

    Date: 12 Apr, 2016

    Hi Gary,
    I think perhaps I am having some trouble with how you use the term
    identity. To clarify, it wouldn't be the case that some bitstring (an
    identifier) is identical to a file object it names would it? Or is it the
    case that you are making the assumption that all identifiers must be
    reified?
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 10:56 AM, Gary Berg-Cross <***@***.***>
    wrote:

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 12 Apr, 2016

    Jacob,
    > think perhaps I am having some trouble with how you use the term
    identity. To clarify, it wouldn't be the case that some bitstring (an
    Jacob,
    > think perhaps I am having some trouble with how you use the term
    identity. To clarify, it wouldn't be the case that some bitstring (an
    identifier) is identical to a file object it names would it? Or is it the
    case that you are making the assumption that all identifiers must be
    reified?
    I have argued (with other RDA folks) that a bitstring is NOT identical with
    the object it identifies.
    So I hope that the definition does not in any way imply identity, I tried
    to use the idea of representation for such things.
    I'm not sure if I understand your reification questions. Bit-strings are
    reified digital things, just not the same as the object they reference.
    They serve a role in a reference process.
    Gary Berg-Cross, Ph.D.
    ***@***.***
    ​​

    *http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    *
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770

  • Jacob Jett's picture

    Author: Jacob Jett

    Date: 12 Apr, 2016

    Hi Gary,
    My point is that identifiers really aren't any different than names,
    labels, or what ever you call them. (I *think* we're on the same page.)
    The primary distinction is that the "identifiers" in this case are expected
    to operate in a digital environment.
    The thing is, the difference between something that lets you refer to
    something else and something that lets you navigate to that something else
    is one that human beings effortlessly ignore. If I know your name I can
    (with some effort) find you in addition to using it to refer to you. The
    functionality of labels is context-dependent (and so are their uniqueness
    and persistence).
    This kind of functionality is actually baked into computers because
    computers are designed and programmed by humans. So I do believe a better
    definition for an identifier is "a label that names a (digital) thing."
    Like with real-world names, I can use an identifier to both refer to a
    thing (i.e., assert facts about, such as through a metadata record or a
    graph of assertions) and I can also use it to find the thing. That the
    label is unique and persistent is a matter of the context it's expected to
    operate within and not particular to any specific identifier in and of its
    self. Uniqueness and persistent are contingent properties of an identifier
    and/or contingent metaproperties of the thing the identifier names.
    Ultimately saying something like 'collection == PID' (i.e., a collection is
    a PID) is weird because the object and the identifier are not the same
    kinds of things and don't possess the same properties and so are
    fundamentally, formally not identical to one another. The definition
    probably needs to be altered to clarify this for the humans building the
    APIs.
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 11:25 AM, Gary Berg-Cross <***@***.***>
    wrote:

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 12 Apr, 2016

    BTW Jacob asked about definitions for persistence.
    There is one as part of the Persistent Identifier "explanation" in the term
    tool
    An identifier should have an unlimited lifetime, even if the existence of
    identified entity ceases. This aspect of an identifier is called
    “persistency”.
    ​Ref:
    Paskin, N. (1999). Toward unique identifiers. In: Proceedings of the IEEE
    87 (7) 1208-1227.
    Khedmatgozar, Hamid Reza, and Mehdi Alipour-Hafezi. "A Basic Comparative
    Framework for Evaluation of Digital Identifier Systems." Journal of Digital
    Information Management 13.3 (2015): 191.

    Gary Berg-Cross, Ph.D.
    ***@***.***
    ​​

    *http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    *
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 11:56 AM, Gary Berg-Cross <***@***.***>
    wrote:

submit a comment