The various types of data aggregation and what we call them has been a topic in several RDA groups. "Data set/dataset" or "Digital Collection" and "data series" are a few of the frequently used terms. In the DFT WG snapshot document we had an initial definition of "Digital Collection" as:
A digital collection is an aggregation which contains DOs and DEs. The collection is identified by a PID and described by metadata.
Note: A digital collection is a (complex) DO.
Note: A digital collection is an aggregation in so far as there are other types of aggregations.
There was probably too little discussion of this and related concepts and so I have tried to continue the conversation with relevant people and groups.
A recent was with Reagan Moore who provided some ideas (perhaps from a policy point of view) as below. I thought that it might serve as a basis for more conversation.
1. Reagan "Digital collections implement arrangement by a community for organizing their digital entities."
Gary comment - this makes the point that aggregations serve community needs and thus will vary. There may then not be external labels for all of these types of arrangements. Maybe the best we can do is to have some broad categories into which different types of arrangements fit.
2 Reagan "Data series is used by NARA to define the sequence of records archived by a federal agency under a submission agreement control."
Gary comment - I like this as a way of grounding ourseleves in a authoritive source, the NARA, as a basis of data series. They merely add a time dimension to files and digital sets. But does this work for everyone and if not how would their definition different from NARA's? See http://smw-rda.esc.rzg.mpg.de/index.php/Dataset_series for our attempt as part of DFT WG.
3. Reagan "A data series is also used to denote the sequence of data received from a sensor."
Gary discussion - This introduces a more specific type of data series - a "sensor-based data series."
4. Regan "A data set nominally identifies a discrete set of digital entities."
Gary comment -We might need to explain that arrangement basis for the "discrete set." Not how many alternate idea on dataset we had when discussing this
in DFT WG see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Set
5. Regan "A data stream denotes the sequence of data received from a sensor."
Gary comment - We did no have the sensor as source in our working defintion but this was perhaps included or implied in the context of messaging. see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Stream
Comments on the above idea would be appreciated.
Author: Thomas Zastrow
Date: 11 Apr, 2016
Hi Gary,
The Research Data Collection group also started to do some work
regarding the definition of basic terms like "collections". Fortunately,
the TeD-T tool supports multiple definitions and scopes.
Our final definition will be more narrow, but in our group we need to
come to a concrete specification / implementation:
http://smw-rda.esc.rzg.mpg.de/index.php/Collection
(Using the scope "BOF PID Collection")
Best,
Tom
Author: Ulrich Schwardmann
Date: 11 Apr, 2016
Dear Gary, all,
as Thomas already mentioned, in the last VC of the Collections WG we saw
the necessity to have a relatively rigid and precice definition of what
a digital collection should be in the sense of that WG. This definition
is still under discussion and currently given as the fourth of currently
three such definitions at
http://smw-rda.esc.rzg.mpg.de/index.php/Collection
and the one in the DFT WG snapshot document. The current definition of
the collection WG is:
(
Definition A collection is a PID pointing to a digital object
consisting of a set/list of PIDs/Ids and a set of additional
pointers/links and metadata together with each PID/Id.
A collection can be given explicitely by naming each PIDs/Id directly as
well as implicitly by a generating rule.
By definition a collection can contain other "sub-"collections.
A collection is called finite, if the set of PIDs/Ids, generated by
iteratively resolving its "sub-"collections, is finite.
)
which is relatively abstract, tries to use mathematical terms like sets
or lists or simple constructions like PIDs and pointers and avoids to
rely on other relatively undefined terms like aggregations and DEs.
A DO is complicated enough and therefore under discussion to be avoided
as well, but currently without a good alternative.
The reason for such an attempt was, that we were discussing several
concepts, like data streams, that are used and need to be referenced,
but that permanently collect additional data in time, causing the
necessity to get the versioning under control for such references. The
idea of the collection WG is to pave the way for automated services on
collections. With such a definition as above we are much better able
handle different representations of such a use case and to classify them.
From my point of view especially the use of the generating rules allows
a huge amount of possibilities. And the definition of a finite
collection is an important restriction here, as this way one is able to
create collections by generating rules but avoids the mathematical (set
theoretical) problems that can be caused this way.
The definition above is still not terminal in the sense, that we are
still discussing the alternatives given by the slashes '/'. For example
there are good reasons to see a collection as an unordered 'set' in an
abstract sense, but in most implementations it usually will be a list
(where the ordering might play an ex- or implicite role), and therefore
we have to handle this possibility anyway.
From my point of view the idea from Reagan is interesting, as it
provides with the communities needs an additional aspect of collections,
and one can mention something like that additionally. But again the
terms arrangement etc. are too far from being well defined, such that
they cannot be used to create automated services on them.
Author: Reagan Moore
Date: 11 Apr, 2016
Gary:
Data grids rely upon multiple abstractions for collections:
* Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
* Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
Reagan Moore
Author: Keith Jeffery
Date: 11 Apr, 2016
Gary, Reagan -
The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
- Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
Sent: 11 April 2016 13:12
To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Gary:
Data grids rely upon multiple abstractions for collections:
* Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
* Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
Reagan Moore
Author: Tobias Weigel
Date: 11 Apr, 2016
Hi Ulrich, Gary,
I think this is a very timely and much needed discussion. I like
Ulrich's idea to boil this down to the mathematical definitions because
I also agree that this reduces the ambiguity and there are some
well-known concepts we can reuse. At least at this abstract level, we
then won't have to define e.g. a Digital Object in all its meaning at first.
Ulrich - can you give an example for a generation rule? I think I get
the direction in which you are heading, but I am not sure I understand
the variety of possibilities you hint at.
I am not so sure that collection implementation will mostly be lists -
there is a clear advantage in terms of computational efficiency in using
unordered sets (distributed hash maps, NoSQL storage and so on). In my
mind, both set and list implementations are valid choices with
trade-offs depending on a concrete use case.
Best, Tobias
-------- Original Message --------
*Subject: *Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations"
terminology & concepts
*From: *uschwar1 <***@***.***>
*To: *ThomasZastrow
<***@***.***>, Gary
<***@***.***>, Data Fabric IG <***@***.***-groups.org>
Author: Reagan Moore
Date: 11 Apr, 2016
Keith:
The challenge is managing the index using an arbitrary choice of technology. You can store the information is a relational database, or a graph database, or a NoSQL database, or a spread sheet. The interface requires three basic operations: create, read, delete. A query can be deconstructed into these operations, which can then be mapped to the choice of technology.
Thus a PID is an attribute on a digital object which can be queried to find the rest of the state information. But you can query any piece of the state information to return a characterization of the digital objects.
Reagan Moore
From: Keith Jeffery <***@***.***>
Date: Monday, April 11, 2016 at 8:14 AM
To: Reagan Moore <***@***.***>, Ulrich Schwardman <***@***.***>, ThomasZastrow
<***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
Subject: RE: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Gary, Reagan -
The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
- Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
Sent: 11 April 2016 13:12
To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Gary:
Data grids rely upon multiple abstractions for collections:
* Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
* Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
Reagan Moore
Author: Gary Berg-Cross
Date: 11 Apr, 2016
We have a good discussion started here and I'm glad that some think it timely. A major challenge is to proceed with the exchenge and somehow forge a consensus so we have a useful defintion or family definitons that serve the community.
I did have a Q for Ulrich on his working Definition " A collection is a PID pointing to a digital object
consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id."
I don't understand the first part of this. Why do you say " A collection is a PID"? I think of the collection as something that includes an attribute of a PID, as I think that Reagan has said. In some of our attempts we suggest that a collection is the result of some type of aggregation proccess and that there are some metadata used to document the (digital) collection.
Author: Ulrich Schwardmann
Date: 11 Apr, 2016
Hi Gary,
Hi Gary,
Am 11.04.2016 um 17:39 schrieb Gary:
>
> We have a good discussion started here and I'm glad that some think it
> timely. A major challenge is to proceed with the exchenge and somehow
> forge a consensus so we have a useful defintion or family definitons
> that serve the community.
>
> I did have a Q for Ulrich on his working Definition " A collection is
> a PID pointing to a digital object
> consisting of a set/list of PIDs/Ids and a set of
> additional pointers/links and metadata together with each PID/Id."
>
>
>
> I don't understand the first part of this. Why do you say " A
> collection is a PID"?
>
This is because this way a collection uses a relatively good defined
term (PID) and also it becomes a recursive structure in a very simple
way, because it contains PIDs, that again can be collections. This way
one defines a very rich structure with very simpke terms. One shouldn't
define a collection as something, that can contains collections, because
it would need its own definition to become defined.
On the other hand everything more domain specific can be described in
the metadata or data of that, what the PID is pointing to, and thus can
be hidden behind the term digital object with its definition and all its
fuzzyness.
Author: Gary Berg-Cross
Date: 11 Apr, 2016
Ulrich,
It will be interesting to see what others make of this idea of foundations
for Collection:
"This is because this way a collection uses a relatively good defined term
(PID) and also it becomes a recursive structure in a very simple way,
because it contains PIDs, that again can be collections."
For me, as in geometry you have some assumed primitives to build on which
aren't defined in the same way as those that proceed from the primitives.
But, again to me, there is an issue of what the underlying "substance" is
in this redunctionist approach.
A PID seems to me to a different TYPE of thing than a COLLECTION so I would
like to have some other concept at its base for understanding. That is why
I/we looked to a more general idea of AGGREGATION as the base.
Gary Berg-Cross, Ph.D.
***@***.***
*http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
*
Member, Ontolog Board of Trustees
Independent Consultant
Potomac, MD
240-426-0770
On Mon, Apr 11, 2016 at 11:58 AM, Ulrich Schwardmann <***@***.***>
wrote:
Author: Keith Jeffery
Date: 13 Apr, 2016
Reagan -
As usual we agree; the index can be made more useful though if it is structure to reflect the collection (of collections...) - this allows more efficient query and also the ability to partition/fragment the collections themselves.
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
- Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
Sent: 11 April 2016 13:40
To: Keith Jeffery; Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Keith:
The challenge is managing the index using an arbitrary choice of technology. You can store the information is a relational database, or a graph database, or a NoSQL database, or a spread sheet. The interface requires three basic operations: create, read, delete. A query can be deconstructed into these operations, which can then be mapped to the choice of technology.
Thus a PID is an attribute on a digital object which can be queried to find the rest of the state information. But you can query any piece of the state information to return a characterization of the digital objects.
Reagan Moore
From: Keith Jeffery <***@***.***>
Date: Monday, April 11, 2016 at 8:14 AM
To: Reagan Moore <***@***.***>, Ulrich Schwardman <***@***.***>, ThomasZastrow
<***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
Subject: RE: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Gary, Reagan -
The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
Sent: 11 April 2016 13:12
To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Gary:
Data grids rely upon multiple abstractions for collections:
* Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
* Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
Reagan Moore
Reagan -
As usual we agree; the index can be made more useful though if it is structure to reflect the collection (of collections...) - this allows more efficient query and also the ability to partition/fragment the collections themselves.
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
Sent: 11 April 2016 13:40
To: Keith Jeffery; Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Keith:
The challenge is managing the index using an arbitrary choice of technology. You can store the information is a relational database, or a graph database, or a NoSQL database, or a spread sheet. The interface requires three basic operations: create, read, delete. A query can be deconstructed into these operations, which can then be mapped to the choice of technology.
Thus a PID is an attribute on a digital object which can be queried to find the rest of the state information. But you can query any piece of the state information to return a characterization of the digital objects.
Reagan Moore
From: Keith Jeffery <***@***.***>
Date: Monday, April 11, 2016 at 8:14 AM
To: Reagan Moore <***@***.***>, Ulrich Schwardman <***@***.***>, ThomasZastrow
<***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
Subject: RE: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Gary, Reagan -
The first may be regarded as some kind of index (hopefully a metadata index) and the second is more-or-less a database 'view'
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
- Show quoted text -From: rwmoore=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of reaganwmoore
Sent: 11 April 2016 13:12
To: Ulrich Schwardman; ThomasZastrow; Gary; Data Fabric IG
Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
Gary:
Data grids rely upon multiple abstractions for collections:
* Logical collection. This denotes the ability to name and arrange digital objects independently of their physical location and the naming convention used on a storage system. Note that a logical collection has properties related to naming, arrangement, access controls, management policies, distribution, retention, disposition, and metadata (provenance, description, representation, administrative). These properties are associated with the logical collection. A typical property is a PID. Multiple versions of a PID can be associated with a logical collection.
* Virtual collection. The arrangement of digital objects into a collection can be done through a query on the logical collection properties. The result of the query can be presented as a browsable collection that links each entity back to the logical collection. The logical collection in turn links the digital objects back to a physical entity.
Reagan Moore