Some thoughts on "Data Aggregations" terminology & concepts

10 Apr 2016

The various types of data aggregation and what we call them has been a topic in several RDA groups.  "Data set/dataset" or "Digital Collection" and "data series" are a few of the frequently used terms.  In the DFT WG snapshot document we had an initial definition of  "Digital Collection" as:

A digital collection is an aggregation which contains DOs and DEs. The collection is identified by a PID and described by metadata.

Note: A digital collection is a (complex) DO.

Note: A digital collection is an aggregation in so far as there are other types of aggregations.

 There was probably too little discussion of this and related concepts and so I have tried to continue the conversation with relevant people and groups.

A recent was with Reagan Moore who provided some ideas (perhaps from a policy point of view) as below.  I thought that it might serve as a basis for more conversation.

 

1. Reagan "Digital collections implement arrangement by a community for organizing their digital entities."

Gary comment - this makes the point that aggregations serve community needs and thus will vary.  There may then not be external labels for all of these types of arrangements.  Maybe the best we can do is to have some broad categories into which different types of arrangements fit.

 

2 Reagan  "Data series is used by NARA to define the sequence of records archived by a federal agency under a submission agreement control."

Gary comment -  I like this as a way of grounding ourseleves in a authoritive source, the NARA, as a basis of data series.  They merely add a time dimension to files and digital sets.  But does this work for everyone and if not how would their definition different from NARA's? See http://smw-rda.esc.rzg.mpg.de/index.php/Dataset_series for our attempt as part of DFT WG.

 

3. Reagan "A data series is also used to denote the sequence of data received from a sensor."

Gary discussion - This introduces a more specific type of data series - a "sensor-based data series." 

 

4. Regan "A data set nominally identifies a discrete set of digital entities."

Gary comment -We might need to explain that arrangement basis for the "discrete set." Not how many alternate idea on dataset we had when discussing this

in DFT WG  see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Set

 

5. Regan "A data stream denotes the sequence of data received from a sensor."

Gary comment - We did no have the sensor as source in our working defintion but this was perhaps included or implied in the context of messaging. see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Stream

 

Comments on the above idea would be appreciated.

  • Thomas Zastrow's picture

    Author: Thomas Zastrow

    Date: 11 Apr, 2016

    Hi Gary,
    The Research Data Collection group also started to do some work
    regarding the definition of basic terms like "collections". Fortunately,
    the TeD-T tool supports multiple definitions and scopes.
    Our final definition will be more narrow, but in our group we need to
    come to a concrete specification / implementation:
    http://smw-rda.esc.rzg.mpg.de/index.php/Collection
    (Using the scope "BOF PID Collection")
    Best,
    Tom

  • Ulrich Schwardmann's picture

    Author: Ulrich Schwardmann

    Date: 11 Apr, 2016

    Dear Gary, all,
    as Thomas already mentioned, in the last VC of the Collections WG we saw
    the necessity to have a relatively rigid and precice definition of what
    a digital collection should be in the sense of that WG. This definition
    is still under discussion and currently given as the fourth of currently
    three such definitions at
    http://smw-rda.esc.rzg.mpg.de/index.php/Collection
    and the one in the DFT WG snapshot document. The current definition of
    the collection WG is:
    (
    Definition A collection is a PID pointing to a digital object
    consisting of a set/list of PIDs/Ids and a set of additional
    pointers/links and metadata together with each PID/Id.
    A collection can be given explicitely by naming each PIDs/Id directly as
    well as implicitly by a generating rule.
    By definition a collection can contain other "sub-"collections.
    A collection is called finite, if the set of PIDs/Ids, generated by
    iteratively resolving its "sub-"collections, is finite.
    )
    which is relatively abstract, tries to use mathematical terms like sets
    or lists or simple constructions like PIDs and pointers and avoids to
    rely on other relatively undefined terms like aggregations and DEs.
    A DO is complicated enough and therefore under discussion to be avoided
    as well, but currently without a good alternative.
    The reason for such an attempt was, that we were discussing several
    concepts, like data streams, that are used and need to be referenced,
    but that permanently collect additional data in time, causing the
    necessity to get the versioning under control for such references. The
    idea of the collection WG is to pave the way for automated services on
    collections. With such a definition as above we are much better able
    handle different representations of such a use case and to classify them.
    From my point of view especially the use of the generating rules allows
    a huge amount of possibilities. And the definition of a finite
    collection is an important restriction here, as this way one is able to
    create collections by generating rules but avoids the mathematical (set
    theoretical) problems that can be caused this way.
    The definition above is still not terminal in the sense, that we are
    still discussing the alternatives given by the slashes '/'. For example
    there are good reasons to see a collection as an unordered 'set' in an
    abstract sense, but in most implementations it usually will be a list
    (where the ordering might play an ex- or implicite role), and therefore
    we have to handle this possibility anyway.
    From my point of view the idea from Reagan is interesting, as it
    provides with the communities needs an additional aspect of collections,
    and one can mention something like that additionally. But again the
    terms arrangement etc. are too far from being well defined, such that
    they cannot be used to create automated services on them.

  • Reagan Moore's picture

    Author: Reagan Moore

    Date: 11 Apr, 2016

    Gary:
    Data grids rely upon multiple abstractions for collections:
    * Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
    * Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
    Reagan Moore

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 11 Apr, 2016

    Gary, Reagan -
    The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    - Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
    Sent: 11 April 2016 13:12
    To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Gary:
    Data grids rely upon multiple abstractions for collections:
    * Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
    * Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
    Reagan Moore

  • Tobias Weigel's picture

    Author: Tobias Weigel

    Date: 11 Apr, 2016

    Hi Ulrich, Gary,
    I think this is a very timely and much needed discussion. I like
    Ulrich's idea to boil this down to the mathematical definitions because
    I also agree that this reduces the ambiguity and there are some
    well-known concepts we can reuse. At least at this abstract level, we
    then won't have to define e.g. a Digital Object in all its meaning at first.
    Ulrich - can you give an example for a generation rule? I think I get
    the direction in which you are heading, but I am not sure I understand
    the variety of possibilities you hint at.
    I am not so sure that collection implementation will mostly be lists -
    there is a clear advantage in terms of computational efficiency in using
    unordered sets (distributed hash maps, NoSQL storage and so on). In my
    mind, both set and list implementations are valid choices with
    trade-offs depending on a concrete use case.
    Best, Tobias
    -------- Original Message --------
    *Subject: *Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations"
    terminology & concepts
    *From: *uschwar1 <***@***.***>
    *To: *ThomasZastrow
    <***@***.***>, Gary
    <***@***.***>, Data Fabric IG <***@***.***-groups.org>

  • Reagan Moore's picture

    Author: Reagan Moore

    Date: 11 Apr, 2016

    Keith:
    The challenge is managing the index using an arbitrary choice of technology. You can store the information is a relational database, or a graph database, or a NoSQL database, or a spread sheet. The interface requires three basic operations: create, read, delete. A query can be deconstructed into these operations, which can then be mapped to the choice of technology.
    Thus a PID is an attribute on a digital object which can be queried to find the rest of the state information. But you can query any piece of the state information to return a characterization of the digital objects.
    Reagan Moore
    From: Keith Jeffery <***@***.***>
    Date: Monday, April 11, 2016 at 8:14 AM
    To: Reagan Moore <***@***.***>, Ulrich Schwardman <***@***.***>, ThomasZastrow
    <***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Subject: RE: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Gary, Reagan -
    The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    - Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
    Sent: 11 April 2016 13:12
    To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Gary:
    Data grids rely upon multiple abstractions for collections:
    * Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
    * Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
    Reagan Moore

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 11 Apr, 2016

    We have a good discussion started here and I'm glad that some think it timely.  A major challenge is to proceed with the exchenge and somehow forge a consensus so we have a useful defintion or family definitons that serve the community.

    I did have a Q for Ulrich on his working Definition " A collection is a PID pointing to a digital object
    consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id."

     

    I don't understand the first part of this.  Why do you say " A collection is a PID"? I think of the collection as something that includes an attribute of a PID, as I think that Reagan has said.  In some of our attempts we suggest that a collection is the result of some type of aggregation proccess and that there are some metadata used to document the (digital) collection.

  • Ulrich Schwardmann's picture

    Author: Ulrich Schwardmann

    Date: 11 Apr, 2016

    Hi Gary,
    Hi Gary,
    Am 11.04.2016 um 17:39 schrieb Gary:
    >
    > We have a good discussion started here and I'm glad that some think it
    > timely. A major challenge is to proceed with the exchenge and somehow
    > forge a consensus so we have a useful defintion or family definitons
    > that serve the community.
    >
    > I did have a Q for Ulrich on his working Definition " A collection is
    > a PID pointing to a digital object
    > consisting of a set/list of PIDs/Ids and a set of
    > additional pointers/links and metadata together with each PID/Id."
    >
    >
    >
    > I don't understand the first part of this. Why do you say " A
    > collection is a PID"?
    >
    This is because this way a collection uses a relatively good defined
    term (PID) and also it becomes a recursive structure in a very simple
    way, because it contains PIDs, that again can be collections. This way
    one defines a very rich structure with very simpke terms. One shouldn't
    define a collection as something, that can contains collections, because
    it would need its own definition to become defined.
    On the other hand everything more domain specific can be described in
    the metadata or data of that, what the PID is pointing to, and thus can
    be hidden behind the term digital object with its definition and all its
    fuzzyness.

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 11 Apr, 2016

    Ulrich,
    It will be interesting to see what others make of this idea of foundations
    for Collection:
    "This is because this way a collection uses a relatively good defined term
    (PID) and also it becomes a recursive structure in a very simple way,
    because it contains PIDs, that again can be collections."
    For me, as in geometry you have some assumed primitives to build on which
    aren't defined in the same way as those that proceed from the primitives.
    But, again to me, there is an issue of what the underlying "substance" is
    in this redunctionist approach.
    A PID seems to me to a different TYPE of thing than a COLLECTION so I would
    like to have some other concept at its base for understanding. That is why
    I/we looked to a more general idea of AGGREGATION as the base.
    Gary Berg-Cross, Ph.D.
    ***@***.***
    ​​

    *http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    *
    Member, Ontolog Board of Trustees
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Mon, Apr 11, 2016 at 11:58 AM, Ulrich Schwardmann <***@***.***>
    wrote:

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 13 Apr, 2016

    Reagan -
    As usual we agree; the index can be made more useful though if it is structure to reflect the collection (of collections...) - this allows more efficient query and also the ability to partition/fragment the collections themselves.
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    - Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
    Sent: 11 April 2016 13:40
    To: Keith Jeffery; Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Keith:
    The challenge is managing the index using an arbitrary choice of technology. You can store the information is a relational database, or a graph database, or a NoSQL database, or a spread sheet. The interface requires three basic operations: create, read, delete. A query can be deconstructed into these operations, which can then be mapped to the choice of technology.
    Thus a PID is an attribute on a digital object which can be queried to find the rest of the state information. But you can query any piece of the state information to return a characterization of the digital objects.
    Reagan Moore
    From: Keith Jeffery <***@***.***>
    Date: Monday, April 11, 2016 at 8:14 AM
    To: Reagan Moore <***@***.***>, Ulrich Schwardman <***@***.***>, ThomasZastrow
    <***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Subject: RE: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Gary, Reagan -
    The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
    Sent: 11 April 2016 13:12
    To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Gary:
    Data grids rely upon multiple abstractions for collections:
    * Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
    * Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
    Reagan Moore
    Reagan -
    As usual we agree; the index can be made more useful though if it is structure to reflect the collection (of collections...) - this allows more efficient query and also the ability to partition/fragment the collections themselves.
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
    Sent: 11 April 2016 13:40
    To: Keith Jeffery; Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Keith:
    The challenge is managing the index using an arbitrary choice of technology. You can store the information is a relational database, or a graph database, or a NoSQL database, or a spread sheet. The interface requires three basic operations: create, read, delete. A query can be deconstructed into these operations, which can then be mapped to the choice of technology.
    Thus a PID is an attribute on a digital object which can be queried to find the rest of the state information. But you can query any piece of the state information to return a characterization of the digital objects.
    Reagan Moore
    From: Keith Jeffery <***@***.***>
    Date: Monday, April 11, 2016 at 8:14 AM
    To: Reagan Moore <***@***.***>, Ulrich Schwardman <***@***.***>, ThomasZastrow
    <***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Subject: RE: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Gary, Reagan -
    The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    - Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
    Sent: 11 April 2016 13:12
    To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    Gary:
    Data grids rely upon multiple abstractions for collections:
    * Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
    * Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
    Reagan Moore

submit a comment