Re: [rda-datafabric-ig][rda-dft][rda-collection-wg] Re: [rda-datafabric-ig] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabri...

    You are here

12 Apr 2016
Groups audience: 

Hi Gary,
Hi Gary,
On Tue, Apr 12, 2016 at 1:57 PM, Gary <***@***.***> wrote:
>
> They are, I would say, of the same KIND. But there are differences in
> practice.
>
>
Yes, this is my understanding. Or more specifically an identifier is a kind
of name (which itself is a kind of label).
Hi Gary,
On Tue, Apr 12, 2016 at 1:57 PM, Gary <***@***.***> wrote:
>
> They are, I would say, of the same KIND. But there are differences in
> practice.
>
>
Yes, this is my understanding. Or more specifically an identifier is a kind
of name (which itself is a kind of label).
> When people document data with metadata there is often a separate spot for
> an identifier (s) (for digital purposes as you say) and a labeled name
> etc. They may all operate in a digital environment but in different ways
> including this role of navigating to a prepared spot like a landing page
> with more state info etc. I think of the label as asserting a naming fact
> as you say but that the role of finding something involves a concept beyond
> naming and I like the practice of distinguishing them as in many metadata
> efforts.
>
> ​
>
I would hesitate to call an identifier a name in the role of finding
something. If you look at things from the RDF worldview it is also a handy
thing to use for making assertions about the thing it names. So the analogy
begins to break down as soon as we try to separate the finding role from
the naming role and almost ironically we find that we need it to be unique
and persistent for this last role, naming, or our assertions start to
become unreliable. As I said these roles are baked right into our
architecture (all the way down to the fundamental manner in which things
move on the bus and to/from various memory spaces). Context of use matters
(and is probably more important than the entities and properties
themselves).
Apologies for the digression from the topic. This discussion of
labels/names really has very little to do with collections (beyond the fact
that one shouldn't conflate the collection with its name/identifier/label
thing).
Regards,
Jacob
_____________________________________________________
Jacob Jett
Research Assistant
Center for Informatics Research in Science and Scholarship
The Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
(217) 244-2164
***@***.***

  • Reagan Moore's picture

    Author: Reagan Moore

    Date: 12 Apr, 2016

    Gary:
    Behind every identifier there is an operation that links the identifier to the digital object. In effect, the identifier is a proxy for the resolving operation.
    You can generalize any identifier as the operation that can be performed. There are a wide variety of operations in use in data management systems:
    * GUID, globally unique identifier which has no associated location information, but does have a repository of names
    * Handle, which has a unique identifier and an associated access location
    * Ticket, which has a unique identifier, an access location, and access controls
    * Collection name, which has an identifier, location, access control, arrangement, descriptive/provenance metadata
    You can also invert this description. Every label is defined by a set of assertions that must be valid. The assertions are verified by applying operations to see if the label is correct. If we are defining unique identifiers, we are specifying the set of assertions that must hold for the assignment of the identifier to be valid. A collection assumes that each member of the collection has passed an equivalent set of assertions for naming.
    Thus a collection is an assertion that the members can be identified through a common naming convention. In practice, we use multiple naming conventions to build collections.
    Reagan Moore
    From: <***@***.***-groups.org> on behalf of jjett <***@***.***>
    Date: Tuesday, April 12, 2016 at 3:46 PM
    To: Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Cc: Data Foundations and Terminology IG <***@***.***-groups.org>, Research Data Collections WG <***@***.***-groups.org>, "***@***.***" <***@***.***>
    Subject: [rda-datafabric-ig] Re: [rda-datafabric-ig][rda-dft][rda-collection-wg] Re: [rda-datafabric-ig] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabri...
    Hi Gary,
    On Tue, Apr 12, 2016 at 1:57 PM, Gary <***@***.***> wrote:
    They are, I would say, of the same KIND. But there are differences in practice.
    Yes, this is my understanding. Or more specifically an identifier is a kind of name (which itself is a kind of label).
    When people document data with metadata there is often a separate spot for an identifier (s) (for digital purposes as you say) and a labeled name etc. They may all operate in a digital environment but in different ways including this role of navigating to a prepared spot like a landing page with more state info etc. I think of the label as asserting a naming fact as you say but that the role of finding something involves a concept beyond naming and I like the practice of distinguishing them as in many metadata efforts.

    I would hesitate to call an identifier a name in the role of finding something. If you look at things from the RDF worldview it is also a handy thing to use for making assertions about the thing it names. So the analogy begins to break down as soon as we try to separate the finding role from the naming role and almost ironically we find that we need it to be unique and persistent for this last role, naming, or our assertions start to become unreliable. As I said these roles are baked right into our architecture (all the way down to the fundamental manner in which things move on the bus and to/from various memory spaces). Context of use matters (and is probably more important than the entities and properties themselves).
    Apologies for the digression from the topic. This discussion of labels/names really has very little to do with collections (beyond the fact that one shouldn't conflate the collection with its name/identifier/label thing).
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 1:06 PM, jjett <***@***.***> wrote:
    Hi Gary,
    My point is that identifiers really aren't any different than names, labels, or what ever you call them. (I think we're on the same page.)
    The primary distinction is that the "identifiers" in this case are expected to operate in a digital environment.
    The thing is, the difference between something that lets you refer to something else and something that lets you navigate to that something else is one that human beings effortlessly ignore. If I know your name I can (with some effort) find you in addition to using it to refer to you. The functionality of labels is context-dependent (and so are their uniqueness and persistence).
    This kind of functionality is actually baked into computers because computers are designed and programmed by humans. So I do believe a better definition for an identifier is "a label that names a (digital) thing." Like with real-world names, I can use an identifier to both refer to a thing (i.e., assert facts about, such as through a metadata record or a graph of assertions) and I can also use it to find the thing. That the label is unique and persistent is a matter of the context it's expected to operate within and not particular to any specific identifier in and of its self. Uniqueness and persistent are contingent properties of an identifier and/or contingent metaproperties of the thing the identifier names.
    Ultimately saying something like 'collection == PID' (i.e., a collection is a PID) is weird because the object and the identifier are not the same kinds of things and don't possess the same properties and so are fundamentally, formally not identical to one another. The definition probably needs to be altered to clarify this for the humans building the APIs.
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 11:25 AM, Gary Berg-Cross <***@***.***> wrote:
    Jacob,
    Gary:
    Behind every identifier there is an operation that links the identifier to the digital object. In effect, the identifier is a proxy for the resolving operation.
    You can generalize any identifier as the operation that can be performed. There are a wide variety of operations in use in data management systems:
    * GUID, globally unique identifier which has no associated location information, but does have a repository of names
    * Handle, which has a unique identifier and an associated access location
    * Ticket, which has a unique identifier, an access location, and access controls
    * Collection name, which has an identifier, location, access control, arrangement, descriptive/provenance metadata
    You can also invert this description. Every label is defined by a set of assertions that must be valid. The assertions are verified by applying operations to see if the label is correct. If we are defining unique identifiers, we are specifying the set of assertions that must hold for the assignment of the identifier to be valid. A collection assumes that each member of the collection has passed an equivalent set of assertions for naming.
    Thus a collection is an assertion that the members can be identified through a common naming convention. In practice, we use multiple naming conventions to build collections.
    Reagan Moore
    From: <***@***.***-groups.org> on behalf of jjett <***@***.***>
    Date: Tuesday, April 12, 2016 at 3:46 PM
    To: Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Cc: Data Foundations and Terminology IG <***@***.***-groups.org>, Research Data Collections WG <***@***.***-groups.org>, "***@***.***" <***@***.***>
    Subject: [rda-datafabric-ig] Re: [rda-datafabric-ig][rda-dft][rda-collection-wg] Re: [rda-datafabric-ig] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabri...
    Hi Gary,
    On Tue, Apr 12, 2016 at 1:57 PM, Gary <***@***.***> wrote:
    They are, I would say, of the same KIND. But there are differences in practice.
    Yes, this is my understanding. Or more specifically an identifier is a kind of name (which itself is a kind of label).
    When people document data with metadata there is often a separate spot for an identifier (s) (for digital purposes as you say) and a labeled name etc. They may all operate in a digital environment but in different ways including this role of navigating to a prepared spot like a landing page with more state info etc. I think of the label as asserting a naming fact as you say but that the role of finding something involves a concept beyond naming and I like the practice of distinguishing them as in many metadata efforts.

    I would hesitate to call an identifier a name in the role of finding something. If you look at things from the RDF worldview it is also a handy thing to use for making assertions about the thing it names. So the analogy begins to break down as soon as we try to separate the finding role from the naming role and almost ironically we find that we need it to be unique and persistent for this last role, naming, or our assertions start to become unreliable. As I said these roles are baked right into our architecture (all the way down to the fundamental manner in which things move on the bus and to/from various memory spaces). Context of use matters (and is probably more important than the entities and properties themselves).
    Apologies for the digression from the topic. This discussion of labels/names really has very little to do with collections (beyond the fact that one shouldn't conflate the collection with its name/identifier/label thing).
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 1:06 PM, jjett <***@***.***> wrote:
    Hi Gary,
    My point is that identifiers really aren't any different than names, labels, or what ever you call them. (I think we're on the same page.)
    The primary distinction is that the "identifiers" in this case are expected to operate in a digital environment.
    The thing is, the difference between something that lets you refer to something else and something that lets you navigate to that something else is one that human beings effortlessly ignore. If I know your name I can (with some effort) find you in addition to using it to refer to you. The functionality of labels is context-dependent (and so are their uniqueness and persistence).
    This kind of functionality is actually baked into computers because computers are designed and programmed by humans. So I do believe a better definition for an identifier is "a label that names a (digital) thing." Like with real-world names, I can use an identifier to both refer to a thing (i.e., assert facts about, such as through a metadata record or a graph of assertions) and I can also use it to find the thing. That the label is unique and persistent is a matter of the context it's expected to operate within and not particular to any specific identifier in and of its self. Uniqueness and persistent are contingent properties of an identifier and/or contingent metaproperties of the thing the identifier names.
    Ultimately saying something like 'collection == PID' (i.e., a collection is a PID) is weird because the object and the identifier are not the same kinds of things and don't possess the same properties and so are fundamentally, formally not identical to one another. The definition probably needs to be altered to clarify this for the humans building the APIs.
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 11:25 AM, Gary Berg-Cross <***@***.***> wrote:
    Jacob,
    > think perhaps I am having some trouble with how you use the term identity. To clarify, it wouldn't be the case that some bitstring (an identifier) is identical to a file object it names would it? Or is it the case that you are making the assumption that all identifiers must be reified?
    I have argued (with other RDA folks) that a bitstring is NOT identical with the object it identifies.
    So I hope that the definition does not in any way imply identity, I tried to use the idea of representation for such things.
    I'm not sure if I understand your reification questions. Bit-strings are reified digital things, just not the same as the object they reference. They serve a role in a reference process.
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 12:05 PM, Jacob Jett <***@***.***> wrote:
    Hi Gary,
    I think perhaps I am having some trouble with how you use the term identity. To clarify, it wouldn't be the case that some bitstring (an identifier) is identical to a file object it names would it? Or is it the case that you are making the assumption that all identifiers must be reified?
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 10:56 AM, Gary Berg-Cross <***@***.***> wrote:
    Jacob,
    I agree with much of what you say about refer, references etc.. Indeed in Europeana is working on a Collections Model we can take that as input and maybe adopt it. In DFT we did look at some of this work and as I noted leverage some of the language.
    The RDA community discussion is a way of generating some views that might be used to gauge their definition. PIDs play a big role in some parts of RDA so you see some attempt to relate this to collection, although as earlier noted some of us, like you, don't think on any identifier as the same type of thing as the object identified.
    In regard to your question:
    "Regarding the concept of PID (persistent identifier) has there been any true consensus on what
    "persistent" and "identifier" mean? For instance, would the name Keith be a PID
    (why?/why not?).
    I drafted definitions for Identifier and Identity which are in our Term Tool. The definition of Identifier is not general but in a digital identifier which is what i think most RDA folk are thinking about (as opposed to a name label like "Keith" which is another type of metadata we might use to find info about Keith.
    An identifier (ala digital identifier) is a bitstring that is used to provide Object Identity.
    Explanation For many a digital identifier is associated with a registry for the identifier and a repository for data that is identified
    Identity is that property of an object, such as a Digital Object or Resource, which distinguishes each object from all others.
    Identity is established by some process that connects a set of attributes to some object.
    There are legitimate issues of generality vs specificity in many of our definitions. You raised an issue, for example, of aggregation and whether soil accumulating at the delta is covered. In my opinion no, because we have a focus in digital-data concepts and not the broader non-digital aspect of reality.
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 10:10 AM, jjett <***@***.***> wrote:
    Hi,
    I'm one of the researchers that Jeremy contacted yesterday regarding the definition of the gatheredInto(x,y) predicate. I've been reading up on this discussion and had a question about one of the collection definitions being maintained by the RDA. Regarding the concept of PID (persistent identifier) has there been any true consensus on what "persistent" and "identifier" mean? For instance, would the name Keith be a PID (why?/why not?).
    Also regarding the "precise" collection definition from the collection working group. I see where the term "collection" has been pegged to the PID which effectively makes the PID an identifier and a collection...which seems a bit strange (but then again my street address is both a fire number and a mailing address, so that may be okay). More importantly though, the nature of the digital object the PID points to is never defined beyond what it consists of (a set or a list). But the digital object is not itself a set or a list. Is it some kind of anonymous thing, like a blank node in RDF? So I'm not sure what to make of it. The collection definition doesn't seem very precise at all.
    The main problem seems to be a conflation of the notions referer and referent. Identifiers (persistent or otherwise) only refer to things and are not synonymous with the things they refer to. A better definition might be: A collection is a digital object which consists of a set or a list and is named by a PID (which when reified delivers the collection object).
    Another analogy I might make is that sets (and possibly lists) are scalar objects particular to a point in time. Collections (and probably also lists in all actuality) are vector objects (using the physicist's notion of vectors) that vary over time. So at any point in time, a collection may be represented by one set or another (not dissimilar to a document being represented by one string or another over time as it gets edited).
    Some thoughts on the gatheredInto predicate. If memory serves correctly, the relationship "gatheredInto" is intended to capture the creative effort (or curatorial effort) that goes into creating a collection. You may have noted that we dialed back from that version of the definition in our JOHD article (and in the HTRC's workset data model). This is because that definition of "gatheredInto" is a bit semantically overloaded. It captures both that a collection is the result of some process ("gathering") and that said process is the result of some intelligent agent. So we went for a narrower definition---"gatheredInto" only states that a collection is the result of some process ("gathering") but doesn't remark on where that process comes from. This permits a relaxed definition that admits more colloquial uses of "collecting" such as "the silt collects at the river's mouth", "debris collects on the beach", "crumbs collect at the bottom of the cereal box", etc. While trivial looking, such distinctions can provide important ques indicating when metadata needs to be articulated (in this case by articulating who is doing the gathering and why they are doing it).
    This is important, as it seems to me that regardless of how one defines collection one needs to understand the role it's supposed to play within it's native information ecosystem. Most importantly, how do you distinguish among different collections? Is it easier to rely on a layer of metadata that describes the collection object (past experiences suggest that it is) or to dig into it and examine the descriptions of the objects in the collection? Is it more like a box at the supermarket or more like a box under the Christmas tree?
    A closing thought thought. Since Europeana is already working on a collection model is there a need for RDA to reinvent the wheel here?
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 8:36 AM, uschwar1 <***@***.***> wrote:
    Dear Keith, all,
    dear, I had to truncate the subject line, because it got to long during our debate for the RDA list server. I think this is a really strong reminder to get settled;-)
    Am 12.04.2016 um 12:41 schrieb ***@***.***:
    Ulrich –
    Many thanks for this clear explanation. I also enjoy this type of discussion. I have a few points:
    You said:
    Alternatively we also can say, we omit the possibility of correctness proves and use artificial intelligence. In this case we can just use language and, if really wanted, ontologies.
    In fact using knowledge engineering does not necessarily preclude formality (after all logic is a branch of mathematics) so I believe we can have the best of both worlds: the formality and precision of formal mathematics and the richness of declared semantics within a formal syntax.
    You are right here, with knowledge engineering we have a third alternative, and we should try to get the best of both, or even all three worlds.
    KE can indeed help to decide for example, whether an DO is a collection, but on the other hand this only helps the collection WG, if the underlying definition, used by KE, is reductive enough to be used by correctness provable processes.
    You said
    May suggestion would be to change the definition to: A collection is +++referenced by+++ a PID pointing to a digital object consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id.
    I think we have to be very clear on the difference between an identifier (PID) and a navigation. For me a PID is an identifier. I do get annoyed by W3C people who insist a URL (named URI) is an identifier when in fact it is a navigation path (or address if you like).
    I'm not sure whether it would already be sufficient to say 'A collection is +++identified by+++ a PID' to avoid this problem. Or is more behind the scenes here?
    You said:
    I understand this as a statement about possible counterexamples,……
    It was meant purely to indicate that in collections we have to deal with richer structures (syntax) than hierarchies and that not all nodes are reachable by simple recursion. However, this is not less formal than your original case.
    You said:
    'give me all vertices and edges connected to one of its vertices'.
    With the added semantics I suggested on the edges this can become a query –for example ‘only those vertices connected by an edge with role ‘is part of’ and temporal duration between 20160101 and 20160331’
    yes, exactly this type of functions I have in mind. Currently we do not
    have anything else as 'role' in our definition in place than 'is part
    of'. My suggestion here is to think about overloading this 'is part of'
    role with other semantical content in the metadata of a collection. For
    example I did this analogously in the collection example of publication
    references in publications and the corresponding graphs of it. But we
    probably would say to have two different collections if they differ in
    this semantical content even if the components would the same by chance.
    But as you see, these are interesting questions in the collection WG and
    not necessarily in the DFT WG.
    All this has been implemented in Europe with CERIF (Common European Research Information Format – an EU Recommendation to member states) see http://www.eurocris.org/cerif/main-features-cerif and its dependent tree of information for details. Although the data model is represented in extended entity-relation notation it can be implemented in just about any paradigm (logic programming, object-oriented….)
    It would be also an interesting question, whether the collection WG will
    be able to use this and to what extend, but first we need to get the
    basics in order in our WG. But we should certainly come back to this.
    Best
    Keith
    From: uschwar1=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of uschwar1
    Sent: 12 April 2016 10:01
    To: Gary Berg-Cross; Research Data Collections WG; Data Foundations and Terminology IG; Data Fabric IG
    Cc: Jeremy York; TobiasWeigel; ThomasZastrow
    Subject: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
    Hi Gary, all,
    I agree with Thomas: this now tends to become a more and more philosophical debate - I like this, and we should continue this perhaps with a beer in Denver. But to shorten the decisions process here let me assume that an undoubted goal is to setup the foundations to build automated processes on collections and try to bring it down to a simple question:
    Do we want to be able to prove the correctness of processes on collections or not. If this is case, we need a mathematical solid definition of the object we are working on. I'm not saying, that we have to prove correctness for all processes, btw., that's not common practice in computer science anyway.
    Alternatively we also can say, we omit the possibility of correctness proves and use artificial intelligence. In this case we can just use language and, if really wanted, ontologies.
    The obvious resulting question in this case is, how and why AI processes would need a concept of collection. I suppose, these processes would not reflect on collections but just use the links inside collections in an unstructured, recursive way, just as a crawler would work. A concept of collections becomes unnecessary for such processes, they just work.
    But to understand, how they work brings us back to the foundations of automated processes on collections and the correctness proves of our understanding. That's why I think we should rely on sound definitions.
    To Juha and Keith (1.):
    we are still talking about whether we use PID or ID or both inside collections. The mayor point is, that we want to formalize the references as the mayor structural element of collections.
    The phrase "But saying that a collection is a PID is a bit like saying that a book is an ISBN." is great and shows, what is irritating here, even if it makes sense from a mathematical viewpoint.
    May suggestion would be to change the definition to: A collection is +++referenced by+++ a PID pointing to a digital object consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id.
    Juha's phrase then becomes: "But saying that a collection is referenced by a PID is a bit like saying that a book is referenced by an ISBN." which sounds reasonable for me.
    To Keith (2.):
    it should be possible to express relationships between any collections (or any DO) whether hierarchic (‘belongs to’/is part of’) or in a fully connected graph where it may be that one collection is a proper subset of another (or superset of >1 other collections) or that collection A was derived from Collection B by process X or that collection C was derived from collection D with process U and from collection E with process W - and all with appropriate date/time stamping so that provenance is recorded (and all associated descriptive / contextual / actionable metadata).
    I understand this as a statement about possible counterexamples, but actually it is a great 'collection' of test cases, where one can see the possiblities of the given definition:
    The fully connected graph is a resulting description of collections seen as vertices with (directed) edges given by the PIDs/Ids inside each of the collections. The fully connected graph therefore is a collection given by the process 'give me all vertices and edges connected to one of its vertices'. Proper subsets of collections are in the scope of the definition as well. And that collection A could be derived from Collection B by process X, was something I said before anyway. Date/time stamping and provenance for collections is on the roadmap of the collections WG too. So from my point of view at least this all fits quite well.
    Am 12.04.2016 um 00:12 schrieb Gary Berg-Cross:
    Ulrich
    In response to your reductive assumption in:
    Gary:
    Behind every identifier there is an operation that links the identifier to the digital object. In effect, the identifier is a proxy for the resolving operation.
    You can generalize any identifier as the operation that can be performed. There are a wide variety of operations in use in data management systems:
    * GUID, globally unique identifier which has no associated location information, but does have a repository of names
    * Handle, which has a unique identifier and an associated access location
    * Ticket, which has a unique identifier, an access location, and access controls
    * Collection name, which has an identifier, location, access control, arrangement, descriptive/provenance metadata
    You can also invert this description. Every label is defined by a set of assertions that must be valid. The assertions are verified by applying operations to see if the label is correct. If we are defining unique identifiers, we are specifying the set of assertions that must hold for the assignment of the identifier to be valid. A collection assumes that each member of the collection has passed an equivalent set of assertions for naming.
    Thus a collection is an assertion that the members can be identified through a common naming convention. In practice, we use multiple naming conventions to build collections.
    Reagan Moore
    From: <***@***.***-groups.org> on behalf of jjett <***@***.***>
    Date: Tuesday, April 12, 2016 at 3:46 PM
    To: Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Cc: Data Foundations and Terminology IG <***@***.***-groups.org>, Research Data Collections WG <***@***.***-groups.org>, "***@***.***" <***@***.***>
    Subject: [rda-datafabric-ig] Re: [rda-datafabric-ig][rda-dft][rda-collection-wg] Re: [rda-datafabric-ig] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabri...
    Hi Gary,
    On Tue, Apr 12, 2016 at 1:57 PM, Gary <***@***.***> wrote:
    They are, I would say, of the same KIND. But there are differences in practice.
    Yes, this is my understanding. Or more specifically an identifier is a kind of name (which itself is a kind of label).
    When people document data with metadata there is often a separate spot for an identifier (s) (for digital purposes as you say) and a labeled name etc. They may all operate in a digital environment but in different ways including this role of navigating to a prepared spot like a landing page with more state info etc. I think of the label as asserting a naming fact as you say but that the role of finding something involves a concept beyond naming and I like the practice of distinguishing them as in many metadata efforts.

    I would hesitate to call an identifier a name in the role of finding something. If you look at things from the RDF worldview it is also a handy thing to use for making assertions about the thing it names. So the analogy begins to break down as soon as we try to separate the finding role from the naming role and almost ironically we find that we need it to be unique and persistent for this last role, naming, or our assertions start to become unreliable. As I said these roles are baked right into our architecture (all the way down to the fundamental manner in which things move on the bus and to/from various memory spaces). Context of use matters (and is probably more important than the entities and properties themselves).
    Apologies for the digression from the topic. This discussion of labels/names really has very little to do with collections (beyond the fact that one shouldn't conflate the collection with its name/identifier/label thing).
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 1:06 PM, jjett <***@***.***> wrote:
    Hi Gary,
    My point is that identifiers really aren't any different than names, labels, or what ever you call them. (I think we're on the same page.)
    The primary distinction is that the "identifiers" in this case are expected to operate in a digital environment.
    The thing is, the difference between something that lets you refer to something else and something that lets you navigate to that something else is one that human beings effortlessly ignore. If I know your name I can (with some effort) find you in addition to using it to refer to you. The functionality of labels is context-dependent (and so are their uniqueness and persistence).
    This kind of functionality is actually baked into computers because computers are designed and programmed by humans. So I do believe a better definition for an identifier is "a label that names a (digital) thing." Like with real-world names, I can use an identifier to both refer to a thing (i.e., assert facts about, such as through a metadata record or a graph of assertions) and I can also use it to find the thing. That the label is unique and persistent is a matter of the context it's expected to operate within and not particular to any specific identifier in and of its self. Uniqueness and persistent are contingent properties of an identifier and/or contingent metaproperties of the thing the identifier names.
    Ultimately saying something like 'collection == PID' (i.e., a collection is a PID) is weird because the object and the identifier are not the same kinds of things and don't possess the same properties and so are fundamentally, formally not identical to one another. The definition probably needs to be altered to clarify this for the humans building the APIs.
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 11:25 AM, Gary Berg-Cross <***@***.***> wrote:
    Jacob,
    > think perhaps I am having some trouble with how you use the term identity. To clarify, it wouldn't be the case that some bitstring (an identifier) is identical to a file object it names would it? Or is it the case that you are making the assumption that all identifiers must be reified?
    I have argued (with other RDA folks) that a bitstring is NOT identical with the object it identifies.
    So I hope that the definition does not in any way imply identity, I tried to use the idea of representation for such things.
    I'm not sure if I understand your reification questions. Bit-strings are reified digital things, just not the same as the object they reference. They serve a role in a reference process.
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 12:05 PM, Jacob Jett <***@***.***> wrote:
    Hi Gary,
    I think perhaps I am having some trouble with how you use the term identity. To clarify, it wouldn't be the case that some bitstring (an identifier) is identical to a file object it names would it? Or is it the case that you are making the assumption that all identifiers must be reified?
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 10:56 AM, Gary Berg-Cross <***@***.***> wrote:
    Jacob,
    I agree with much of what you say about refer, references etc.. Indeed in Europeana is working on a Collections Model we can take that as input and maybe adopt it. In DFT we did look at some of this work and as I noted leverage some of the language.
    The RDA community discussion is a way of generating some views that might be used to gauge their definition. PIDs play a big role in some parts of RDA so you see some attempt to relate this to collection, although as earlier noted some of us, like you, don't think on any identifier as the same type of thing as the object identified.
    In regard to your question:
    "Regarding the concept of PID (persistent identifier) has there been any true consensus on what
    "persistent" and "identifier" mean? For instance, would the name Keith be a PID
    (why?/why not?).
    I drafted definitions for Identifier and Identity which are in our Term Tool. The definition of Identifier is not general but in a digital identifier which is what i think most RDA folk are thinking about (as opposed to a name label like "Keith" which is another type of metadata we might use to find info about Keith.
    An identifier (ala digital identifier) is a bitstring that is used to provide Object Identity.
    Explanation For many a digital identifier is associated with a registry for the identifier and a repository for data that is identified
    Identity is that property of an object, such as a Digital Object or Resource, which distinguishes each object from all others.
    Identity is established by some process that connects a set of attributes to some object.
    There are legitimate issues of generality vs specificity in many of our definitions. You raised an issue, for example, of aggregation and whether soil accumulating at the delta is covered. In my opinion no, because we have a focus in digital-data concepts and not the broader non-digital aspect of reality.
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Tue, Apr 12, 2016 at 10:10 AM, jjett <***@***.***> wrote:
    Hi,
    I'm one of the researchers that Jeremy contacted yesterday regarding the definition of the gatheredInto(x,y) predicate. I've been reading up on this discussion and had a question about one of the collection definitions being maintained by the RDA. Regarding the concept of PID (persistent identifier) has there been any true consensus on what "persistent" and "identifier" mean? For instance, would the name Keith be a PID (why?/why not?).
    Also regarding the "precise" collection definition from the collection working group. I see where the term "collection" has been pegged to the PID which effectively makes the PID an identifier and a collection...which seems a bit strange (but then again my street address is both a fire number and a mailing address, so that may be okay). More importantly though, the nature of the digital object the PID points to is never defined beyond what it consists of (a set or a list). But the digital object is not itself a set or a list. Is it some kind of anonymous thing, like a blank node in RDF? So I'm not sure what to make of it. The collection definition doesn't seem very precise at all.
    The main problem seems to be a conflation of the notions referer and referent. Identifiers (persistent or otherwise) only refer to things and are not synonymous with the things they refer to. A better definition might be: A collection is a digital object which consists of a set or a list and is named by a PID (which when reified delivers the collection object).
    Another analogy I might make is that sets (and possibly lists) are scalar objects particular to a point in time. Collections (and probably also lists in all actuality) are vector objects (using the physicist's notion of vectors) that vary over time. So at any point in time, a collection may be represented by one set or another (not dissimilar to a document being represented by one string or another over time as it gets edited).
    Some thoughts on the gatheredInto predicate. If memory serves correctly, the relationship "gatheredInto" is intended to capture the creative effort (or curatorial effort) that goes into creating a collection. You may have noted that we dialed back from that version of the definition in our JOHD article (and in the HTRC's workset data model). This is because that definition of "gatheredInto" is a bit semantically overloaded. It captures both that a collection is the result of some process ("gathering") and that said process is the result of some intelligent agent. So we went for a narrower definition---"gatheredInto" only states that a collection is the result of some process ("gathering") but doesn't remark on where that process comes from. This permits a relaxed definition that admits more colloquial uses of "collecting" such as "the silt collects at the river's mouth", "debris collects on the beach", "crumbs collect at the bottom of the cereal box", etc. While trivial looking, such distinctions can provide important ques indicating when metadata needs to be articulated (in this case by articulating who is doing the gathering and why they are doing it).
    This is important, as it seems to me that regardless of how one defines collection one needs to understand the role it's supposed to play within it's native information ecosystem. Most importantly, how do you distinguish among different collections? Is it easier to rely on a layer of metadata that describes the collection object (past experiences suggest that it is) or to dig into it and examine the descriptions of the objects in the collection? Is it more like a box at the supermarket or more like a box under the Christmas tree?
    A closing thought thought. Since Europeana is already working on a collection model is there a need for RDA to reinvent the wheel here?
    Regards,
    Jacob
    _____________________________________________________
    Jacob Jett
    Research Assistant
    Center for Informatics Research in Science and Scholarship
    The Graduate School of Library and Information Science
    University of Illinois at Urbana-Champaign
    501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
    (217) 244-2164
    ***@***.***
    On Tue, Apr 12, 2016 at 8:36 AM, uschwar1 <***@***.***> wrote:
    Dear Keith, all,
    dear, I had to truncate the subject line, because it got to long during our debate for the RDA list server. I think this is a really strong reminder to get settled;-)
    Am 12.04.2016 um 12:41 schrieb ***@***.***:
    Ulrich –
    Many thanks for this clear explanation. I also enjoy this type of discussion. I have a few points:
    You said:
    Alternatively we also can say, we omit the possibility of correctness proves and use artificial intelligence. In this case we can just use language and, if really wanted, ontologies.
    In fact using knowledge engineering does not necessarily preclude formality (after all logic is a branch of mathematics) so I believe we can have the best of both worlds: the formality and precision of formal mathematics and the richness of declared semantics within a formal syntax.
    You are right here, with knowledge engineering we have a third alternative, and we should try to get the best of both, or even all three worlds.
    KE can indeed help to decide for example, whether an DO is a collection, but on the other hand this only helps the collection WG, if the underlying definition, used by KE, is reductive enough to be used by correctness provable processes.
    You said
    May suggestion would be to change the definition to: A collection is +++referenced by+++ a PID pointing to a digital object consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id.
    I think we have to be very clear on the difference between an identifier (PID) and a navigation. For me a PID is an identifier. I do get annoyed by W3C people who insist a URL (named URI) is an identifier when in fact it is a navigation path (or address if you like).
    I'm not sure whether it would already be sufficient to say 'A collection is +++identified by+++ a PID' to avoid this problem. Or is more behind the scenes here?
    You said:
    I understand this as a statement about possible counterexamples,……
    It was meant purely to indicate that in collections we have to deal with richer structures (syntax) than hierarchies and that not all nodes are reachable by simple recursion. However, this is not less formal than your original case.
    You said:
    'give me all vertices and edges connected to one of its vertices'.
    With the added semantics I suggested on the edges this can become a query –for example ‘only those vertices connected by an edge with role ‘is part of’ and temporal duration between 20160101 and 20160331’
    yes, exactly this type of functions I have in mind. Currently we do not
    have anything else as 'role' in our definition in place than 'is part
    of'. My suggestion here is to think about overloading this 'is part of'
    role with other semantical content in the metadata of a collection. For
    example I did this analogously in the collection example of publication
    references in publications and the corresponding graphs of it. But we
    probably would say to have two different collections if they differ in
    this semantical content even if the components would the same by chance.
    But as you see, these are interesting questions in the collection WG and
    not necessarily in the DFT WG.
    All this has been implemented in Europe with CERIF (Common European Research Information Format – an EU Recommendation to member states) see http://www.eurocris.org/cerif/main-features-cerif and its dependent tree of information for details. Although the data model is represented in extended entity-relation notation it can be implemented in just about any paradigm (logic programming, object-oriented….)
    It would be also an interesting question, whether the collection WG will
    be able to use this and to what extend, but first we need to get the
    basics in order in our WG. But we should certainly come back to this.
    Best
    Keith
    From: uschwar1=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of uschwar1
    Sent: 12 April 2016 10:01
    To: Gary Berg-Cross; Research Data Collections WG; Data Foundations and Terminology IG; Data Fabric IG
    Cc: Jeremy York; TobiasWeigel; ThomasZastrow
    Subject: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
    Hi Gary, all,
    I agree with Thomas: this now tends to become a more and more philosophical debate - I like this, and we should continue this perhaps with a beer in Denver. But to shorten the decisions process here let me assume that an undoubted goal is to setup the foundations to build automated processes on collections and try to bring it down to a simple question:
    Do we want to be able to prove the correctness of processes on collections or not. If this is case, we need a mathematical solid definition of the object we are working on. I'm not saying, that we have to prove correctness for all processes, btw., that's not common practice in computer science anyway.
    Alternatively we also can say, we omit the possibility of correctness proves and use artificial intelligence. In this case we can just use language and, if really wanted, ontologies.
    The obvious resulting question in this case is, how and why AI processes would need a concept of collection. I suppose, these processes would not reflect on collections but just use the links inside collections in an unstructured, recursive way, just as a crawler would work. A concept of collections becomes unnecessary for such processes, they just work.
    But to understand, how they work brings us back to the foundations of automated processes on collections and the correctness proves of our understanding. That's why I think we should rely on sound definitions.
    To Juha and Keith (1.):
    we are still talking about whether we use PID or ID or both inside collections. The mayor point is, that we want to formalize the references as the mayor structural element of collections.
    The phrase "But saying that a collection is a PID is a bit like saying that a book is an ISBN." is great and shows, what is irritating here, even if it makes sense from a mathematical viewpoint.
    May suggestion would be to change the definition to: A collection is +++referenced by+++ a PID pointing to a digital object consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id.
    Juha's phrase then becomes: "But saying that a collection is referenced by a PID is a bit like saying that a book is referenced by an ISBN." which sounds reasonable for me.
    To Keith (2.):
    it should be possible to express relationships between any collections (or any DO) whether hierarchic (‘belongs to’/is part of’) or in a fully connected graph where it may be that one collection is a proper subset of another (or superset of >1 other collections) or that collection A was derived from Collection B by process X or that collection C was derived from collection D with process U and from collection E with process W - and all with appropriate date/time stamping so that provenance is recorded (and all associated descriptive / contextual / actionable metadata).
    I understand this as a statement about possible counterexamples, but actually it is a great 'collection' of test cases, where one can see the possiblities of the given definition:
    The fully connected graph is a resulting description of collections seen as vertices with (directed) edges given by the PIDs/Ids inside each of the collections. The fully connected graph therefore is a collection given by the process 'give me all vertices and edges connected to one of its vertices'. Proper subsets of collections are in the scope of the definition as well. And that collection A could be derived from Collection B by process X, was something I said before anyway. Date/time stamping and provenance for collections is on the roadmap of the collections WG too. So from my point of view at least this all fits quite well.
    Am 12.04.2016 um 00:12 schrieb Gary Berg-Cross:
    Ulrich
    In response to your reductive assumption in:
    >To Gary: of course a collection is something different to an ordinary PID also in my reductionist approach. It is a PID, that points to a very special kind of DO. My assumption is, that this is sufficient for all underlying "substance". But this of course still has to be proven. But perhaps the examples I mentioned already give a feeling of the possibilities, that such a definition can have.
    PID doesn't seem to be the substrate even if it can be formalized nearly and recursed. Behind a PID idea is that of Identity, but even this doesn't seem like a basis for build up a Collection concept. Data collections pre-existed digital data and thus PID as a practical example.
    I am more in the camp of ontologists like John Sowa who see ontological concepts as the material which logical operators are used to express concepts.
    "Pure logic is ontologically neutral. It makes no presuppositions about what exists or may exist in any domain or any language for talking about the domain. To represent knowledge about a specific domain, it must be supplemented with an ontology that defines the categories of things in that domain and the terms that people use to talk about them. The ontology defines the words of a natural language, the predicates of predicate calculus, the concept and relation types of conceptual graphs, the classes of an object-oriented language, or the tables and fields of a relational database." from "Ontology, Metadata, and Semiotics" John F. Sowa
    So as a basis of Collection, if you want to find an atom for the molecule of Collection it might be the idea of "and" or "partOf" which produces aggregations & wholes. But there are just some many ways of building larger structures from smaller ones and this is sub-part of ontology called Mereology.
    So to me we can't start with mathematical and logical terms and expect to build a world unless we use concepts with terms from that world.
    Again to quote Sowa on this language effort:
    "No ontology, formal or informal, is independent of the vocabulary and the methodologies (i.e., language games) used to analyze the data. Natural language terms have been the starting point for every ontology from Aristotle to the present. Even the most abstract ontologies of mathematics and science are analyzed, debated, explained, and taught in natural languages. For computer applications, the users who enter data and choose options on menus, think in the words of the NL vocabulary. Any options that cannot be explained in words the users understand are open invitations to mistakes, confusions, and system vulnerabilities. Therefore, every ontology that has any practical application must have a mapping, direct or indirect, to and from natural languages. " (from John Sowa's "The Role of Logic and Ontology In Language and Reasoning."
    Gary Berg-Cross, Ph.D.
    ***@***.***

    ​​

    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Mon, Apr 11, 2016 at 12:50 PM, uschwar1 <***@***.***> wrote:
    Dear Jeremy, all
    here, as far as I can see from a first look, the definition is relying on the binary predicate isGatheredInto(x,y), which I couldn't find to be defined at the given location anymore. So one probably cannot use this as a definition here, without defining how this predicate function works in all cases.
    But the other way around: if one uses my reductionist definition, the function isGatheredInto(x,y) is almost trivially to define, because one just looks, whether PID y is contained in the set of PIDs in the DO where PID x points to.
    To Gary: of course a collection is something different to an ordinary PID also in my reductionist approach. It is a PID, that points to a very special kind of DO. My assumption is, that this is sufficient for all underlying "substance". But this of course still has to be proven. But perhaps the examples I mentioned already give a feeling of the possibilities, that such a definition can have.
    And certainly we need to discuss counter examples, to see what the limitations are.
    Am 11.04.2016 um 18:24 schrieb Jeremy York:
    I don't know if this will contribute to the discussion but I wanted to point to work being done with HathiTrust at the University of Illinois to define collections in a digital humanities context: http://doi.org/10.5334/johd.3.
    Jeremy
    Jeremy York
    Project Manager
    The Stewardship Gap
    http://bit.ly/stewardshipgap
    On Mon, Apr 11, 2016 at 12:13 PM, TobiasWeigel <***@***.***> wrote:
    Hello Ulrich,
    thank you for the examples - I particularly like the power collection idea as it could solve very aesthetically some of the issues we get into once we talk about collections that grow over time but yet should be somewhat statically referable. I think this also has a new twist on the API: A rule-based collection might need its own dedicated querying and creation mechanisms (or at least different parameter sets). When thinking in terms of collection models, I mostly worked along lines of common ADTs and multiple membership in several collections. The 'family' of rule-based collections may be a distinct sister branch to these. Thanks a lot for sharing these early examples - I clearly have to look deeper into the mathematical view when continuing down the models path.
    Best, Tobias
    -------- Original Message --------
    Subject: Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
    From: Ulrich Schwardmann <***@***.***>
    To: TobiasWeigel <***@***.***>, ThomasZastrow
    <***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>, RDA Collections WG <***@***.***-groups.org>
    Date: 11 Apr 2016, 16:19
    Hi Tobias, Gary and others,
    in principle each function, that generates (new) collections, could be used. For example from a given collection one could build a new collection by requiring restrictions like for example time constraints on the generation of the DOs it contains. Or one can build a kind of power collection, the collection of all sub collections.
    Particularly interesting generation rules come with the possibity of following the links given in the collection, either by the PIDs in the collection itsself or by the additional pointers/links given in the definition. For example if one has a set of collections consisting each of lets say two PIDs pointing to another collection in this set, then one can see this as such a set, but also one can build the sub collections build by the connected components in the graph with PID vertices and edges defined by the relation 'PID in a collection'.
    A real world example would be 'references in publications': each publication (collection) only contains a small number of references (PIDs), but for a given publication there is a whole tree of all publications, that this publication relies on, which is a new collection.
    Even more interesting is also the reverse generation rule: give me all publications, that rely on a given publication. It is a valid rule too, but its much harder to implement it, because one needs for each publication to know all reliying publications, or all publications at all.
    Similarly new collections can be build from the additional pointers that are possible for a collection according the definition below. A typical example for such a pointer could be the previous version of a collection and one can build easily the collection of all previous versions of a collection by the rule to follow always the previous version pointer.
    Am 11.04.2016 um 14:30 schrieb TobiasWeigel:
    Hi Ulrich, Gary,
    I think this is a very timely and much needed discussion. I like Ulrich's idea to boil this down to the mathematical definitions because I also agree that this reduces the ambiguity and there are some well-known concepts we can reuse. At least at this abstract level, we then won't have to define e.g. a Digital Object in all its meaning at first.
    Ulrich - can you give an example for a generation rule? I think I get the direction in which you are heading, but I am not sure I understand the variety of possibilities you hint at.
    I am not so sure that collection implementation will mostly be lists - there is a clear advantage in terms of computational efficiency in using unordered sets (distributed hash maps, NoSQL storage and so on). In my mind, both set and list implementations are valid choices with trade-offs depending on a concrete use case.
    Best, Tobias
    -------- Original Message --------
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    From: uschwar1 <***@***.***>
    To: ThomasZastrow
    <***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Date: 11 Apr 2016, 11:12
    Dear Gary, all,
    as Thomas already mentioned, in the last VC of the Collections WG we saw the necessity to have a relatively rigid and precice definition of what a digital collection should be in the sense of that WG. This definition is still under discussion and currently given as the fourth of currently three such definitions at
    http://smw-rda.esc.rzg.mpg.de/index.php/Collection
    and the one in the DFT WG snapshot document. The current definition of the collection WG is:
    (
    Definition
    A collection is a PID pointing to a digital object consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id.
    A collection can be given explicitely by naming each PIDs/Id directly as well as implicitly by a generating rule.
    By definition a collection can contain other "sub-"collections.
    A collection is called finite, if the set of PIDs/Ids, generated by iteratively resolving its "sub-"collections, is finite.
    )
    which is relatively abstract, tries to use mathematical terms like sets or lists or simple constructions like PIDs and pointers and avoids to rely on other relatively undefined terms like aggregations and DEs.
    A DO is complicated enough and therefore under discussion to be avoided as well, but currently without a good alternative.
    The reason for such an attempt was, that we were discussing several concepts, like data streams, that are used and need to be referenced, but that permanently collect additional data in time, causing the necessity to get the versioning under control for such references. The idea of the collection WG is to pave the way for automated services on collections. With such a definition as above we are much better able handle different representations of such a use case and to classify them.
    From my point of view especially the use of the generating rules allows a huge amount of possibilities. And the definition of a finite collection is an important restriction here, as this way one is able to create collections by generating rules but avoids the mathematical (set theoretical) problems that can be caused this way.
    The definition above is still not terminal in the sense, that we are still discussing the alternatives given by the slashes '/'. For example there are good reasons to see a collection as an unordered 'set' in an abstract sense, but in most implementations it usually will be a list (where the ordering might play an ex- or implicite role), and therefore we have to handle this possibility anyway.
    From my point of view the idea from Reagan is interesting, as it provides with the communities needs an additional aspect of collections, and one can mention something like that additionally. But again the terms arrangement etc. are too far from being well defined, such that they cannot be used to create automated services on them.
    Am 11.04.2016 um 10:11 schrieb ThomasZastrow:
    Hi Gary,
    The Research Data Collection group also started to do some work regarding the definition of basic terms like "collections". Fortunately, the TeD-T tool supports multiple definitions and scopes.
    Our final definition will be more narrow, but in our group we need to come to a concrete specification / implementation:
    http://smw-rda.esc.rzg.mpg.de/index.php/Collection
    (Using the scope "BOF PID Collection")
    Best,
    Tom
    Am 10.04.2016 um 17:36 schrieb Gary:
    The various types of data aggregation and what we call them has been a topic in several RDA groups. "Data set/dataset" or "Digital Collection" and "data series" are a few of the frequently used terms. In the DFT WG snapshot document we had an initial definition of "Digital Collection" as:
    A digital collection is an aggregation which contains DOs and DEs. The collection is identified by a PID and described by metadata.
    Note: A digital collection is a (complex) DO.
    Note: A digital collection is an aggregation in so far as there are other types of aggregations.
    There was probably too little discussion of this and related concepts and so I have tried to continue the conversation with relevant people and groups.
    A recent was with Reagan Moore who provided some ideas (perhaps from a policy point of view) as below. I thought that it might serve as a basis for more conversation.
    1. Reagan "Digital collections implement arrangement by a community for organizing their digital entities."
    Gary comment - this makes the point that aggregations serve community needs and thus will vary. There may then not be external labels for all of these types of arrangements. Maybe the best we can do is to have some broad categories into which different types of arrangements fit.
    2 Reagan "Data series is used by NARA to define the sequence of records archived by a federal agency under a submission agreement control."
    Gary comment - I like this as a way of grounding ourseleves in a authoritive source, the NARA, as a basis of data series. They merely add a time dimension to files and digital sets. But does this work for everyone and if not how would their definition different from NARA's? See http://smw-rda.esc.rzg.mpg.de/index.php/Dataset_series for our attempt as part of DFT WG.
    3. Reagan "A data series is also used to denote the sequence of data received from a sensor."
    Gary discussion - This introduces a more specific type of data series - a "sensor-based data series."
    4. Regan "A data set nominally identifies a discrete set of digital entities."
    Gary comment -We might need to explain that arrangement basis for the "discrete set." Not how many alternate idea on dataset we had when discussing this
    in DFT WG see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Set
    5. Regan "A data stream denotes the sequence of data received from a sensor."
    Gary comment - We did no have the sensor as source in our working defintion but this was perhaps included or implied in the context of messaging. see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Stream
    Comments on the above idea would be appreciated.
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Dr. Thomas Zastrow
    Max Planck Computing and Data Facility (MPCDF)
    Gießenbachstr. 2, D-85748 Garching bei München, Germany
    Tel +49-89-3299-1457
    http://www.mpcdf.de
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    ORCID: orcid.org/0000-0002-4040-0215
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    ORCID: orcid.org/0000-0002-4040-0215
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig-research-data-collections-w...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51950
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig-research-data-collections-w...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51951
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig-data-foundations-and-termin...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51967
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/re-rda-datafabric-igrd...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51969
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/re-rda-datafabric-igrd...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51969

submit a comment