RE: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts

    You are here

11 Apr 2016

All –
Let me rejoin now.
1. I don’t like ‘a collection is a PID’. A collection is a collection and a PID is something that identifies it uniquely and permanently
2. The recursive approach is elegant but limited; it should be possible to express relationships between any collections (or any DO) whether hierarchic (‘belongs to’/is part of’) or in a fully connected graph where it may be that one collection is a proper subset of another (or superset of >1 other collections) or that collection A was derived from Collection B by process X or that collection C was derived from collection D with process U and from collection E with process W - and all with appropriate date/time stamping so that provenance is recorded (and all associated descriptive / contextual / actionable metadata).
So this is an appeal that we do not simplify to a level where we lose rich semantics
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
From: uschwar1=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of uschwar1
Sent: 11 April 2016 17:51
To: Jeremy York; TobiasWeigel; Data Fabric IG; Research Data Collections WG
Cc: ThomasZastrow; Gary
Subject: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
Dear Jeremy, all
here, as far as I can see from a first look, the definition is relying on the binary predicate isGatheredInto(x,y), which I couldn't find to be defined at the given location anymore. So one probably cannot use this as a definition here, without defining how this predicate function works in all cases.
But the other way around: if one uses my reductionist definition, the function isGatheredInto(x,y) is almost trivially to define, because one just looks, whether PID y is contained in the set of PIDs in the DO where PID x points to.
To Gary: of course a collection is something different to an ordinary PID also in my reductionist approach. It is a PID, that points to a very special kind of DO. My assumption is, that this is sufficient for all underlying "substance". But this of course still has to be proven. But perhaps the examples I mentioned already give a feeling of the possibilities, that such a definition can have.
And certainly we need to discuss counter examples, to see what the limitations are.
Am 11.04.2016 um 18:24 schrieb Jeremy York:
I don't know if this will contribute to the discussion but I wanted to point to work being done with HathiTrust at the University of Illinois to define collections in a digital humanities context: http://doi.org/10.5334/johd.3.
Jeremy
Jeremy York
Project Manager
The Stewardship Gap
http://bit.ly/stewardshipgap
On Mon, Apr 11, 2016 at 12:13 PM, TobiasWeigel <***@***.***> wrote:
Hello Ulrich,
thank you for the examples - I particularly like the power collection idea as it could solve very aesthetically some of the issues we get into once we talk about collections that grow over time but yet should be somewhat statically referable. I think this also has a new twist on the API: A rule-based collection might need its own dedicated querying and creation mechanisms (or at least different parameter sets). When thinking in terms of collection models, I mostly worked along lines of common ADTs and multiple membership in several collections. The 'family' of rule-based collections may be a distinct sister branch to these. Thanks a lot for sharing these early examples - I clearly have to look deeper into the mathematical view when continuing down the models path.
Best, Tobias
-------- Original Message --------
Subject: Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
From: Ulrich Schwardmann <***@***.***>
To: TobiasWeigel <***@***.***>, ThomasZastrow
<***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>, RDA Collections WG <***@***.***-groups.org>
Date: 11 Apr 2016, 16:19
Hi Tobias, Gary and others,
in principle each function, that generates (new) collections, could be used. For example from a given collection one could build a new collection by requiring restrictions like for example time constraints on the generation of the DOs it contains. Or one can build a kind of power collection, the collection of all sub collections.
Particularly interesting generation rules come with the possibity of following the links given in the collection, either by the PIDs in the collection itsself or by the additional pointers/links given in the definition. For example if one has a set of collections consisting each of lets say two PIDs pointing to another collection in this set, then one can see this as such a set, but also one can build the sub collections build by the connected components in the graph with PID vertices and edges defined by the relation 'PID in a collection'.
A real world example would be 'references in publications': each publication (collection) only contains a small number of references (PIDs), but for a given publication there is a whole tree of all publications, that this publication relies on, which is a new collection.
Even more interesting is also the reverse generation rule: give me all publications, that rely on a given publication. It is a valid rule too, but its much harder to implement it, because one needs for each publication to know all reliying publications, or all publications at all.
Similarly new collections can be build from the additional pointers that are possible for a collection according the definition below. A typical example for such a pointer could be the previous version of a collection and one can build easily the collection of all previous versions of a collection by the rule to follow always the previous version pointer.
Am 11.04.2016 um 14:30 schrieb TobiasWeigel:
Hi Ulrich, Gary,
I think this is a very timely and much needed discussion. I like Ulrich's idea to boil this down to the mathematical definitions because I also agree that this reduces the ambiguity and there are some well-known concepts we can reuse. At least at this abstract level, we then won't have to define e.g. a Digital Object in all its meaning at first.
Ulrich - can you give an example for a generation rule? I think I get the direction in which you are heading, but I am not sure I understand the variety of possibilities you hint at.
I am not so sure that collection implementation will mostly be lists - there is a clear advantage in terms of computational efficiency in using unordered sets (distributed hash maps, NoSQL storage and so on). In my mind, both set and list implementations are valid choices with trade-offs depending on a concrete use case.
Best, Tobias
-------- Original Message --------
Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
From: uschwar1 <***@***.***>
To: ThomasZastrow
<***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
Date: 11 Apr 2016, 11:12
Dear Gary, all,
as Thomas already mentioned, in the last VC of the Collections WG we saw the necessity to have a relatively rigid and precice definition of what a digital collection should be in the sense of that WG. This definition is still under discussion and currently given as the fourth of currently three such definitions at
http://smw-rda.esc.rzg.mpg.de/index.php/Collection
and the one in the DFT WG snapshot document. The current definition of the collection WG is:
(
Definition
A collection is a PID pointing to a digital object consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id.
A collection can be given explicitely by naming each PIDs/Id directly as well as implicitly by a generating rule.
By definition a collection can contain other "sub-"collections.
A collection is called finite, if the set of PIDs/Ids, generated by iteratively resolving its "sub-"collections, is finite.
)
which is relatively abstract, tries to use mathematical terms like sets or lists or simple constructions like PIDs and pointers and avoids to rely on other relatively undefined terms like aggregations and DEs.
A DO is complicated enough and therefore under discussion to be avoided as well, but currently without a good alternative.
The reason for such an attempt was, that we were discussing several concepts, like data streams, that are used and need to be referenced, but that permanently collect additional data in time, causing the necessity to get the versioning under control for such references. The idea of the collection WG is to pave the way for automated services on collections. With such a definition as above we are much better able handle different representations of such a use case and to classify them.
From my point of view especially the use of the generating rules allows a huge amount of possibilities. And the definition of a finite collection is an important restriction here, as this way one is able to create collections by generating rules but avoids the mathematical (set theoretical) problems that can be caused this way.
The definition above is still not terminal in the sense, that we are still discussing the alternatives given by the slashes '/'. For example there are good reasons to see a collection as an unordered 'set' in an abstract sense, but in most implementations it usually will be a list (where the ordering might play an ex- or implicite role), and therefore we have to handle this possibility anyway.
From my point of view the idea from Reagan is interesting, as it provides with the communities needs an additional aspect of collections, and one can mention something like that additionally. But again the terms arrangement etc. are too far from being well defined, such that they cannot be used to create automated services on them.
Am 11.04.2016 um 10:11 schrieb ThomasZastrow:
Hi Gary,
The Research Data Collection group also started to do some work regarding the definition of basic terms like "collections". Fortunately, the TeD-T tool supports multiple definitions and scopes.
Our final definition will be more narrow, but in our group we need to come to a concrete specification / implementation:
http://smw-rda.esc.rzg.mpg.de/index.php/Collection
(Using the scope "BOF PID Collection")
Best,
Tom
Am 10.04.2016 um 17:36 schrieb Gary:
The various types of data aggregation and what we call them has been a topic in several RDA groups. "Data set/dataset" or "Digital Collection" and "data series" are a few of the frequently used terms. In the DFT WG snapshot document we had an initial definition of "Digital Collection" as:
A digital collection is an aggregation which contains DOs and DEs. The collection is identified by a PID and described by metadata.
Note: A digital collection is a (complex) DO.
Note: A digital collection is an aggregation in so far as there are other types of aggregations.
There was probably too little discussion of this and related concepts and so I have tried to continue the conversation with relevant people and groups.
A recent was with Reagan Moore who provided some ideas (perhaps from a policy point of view) as below. I thought that it might serve as a basis for more conversation.
1. Reagan "Digital collections implement arrangement by a community for organizing their digital entities."
Gary comment - this makes the point that aggregations serve community needs and thus will vary. There may then not be external labels for all of these types of arrangements. Maybe the best we can do is to have some broad categories into which different types of arrangements fit.
2 Reagan "Data series is used by NARA to define the sequence of records archived by a federal agency under a submission agreement control."
Gary comment - I like this as a way of grounding ourseleves in a authoritive source, the NARA, as a basis of data series. They merely add a time dimension to files and digital sets. But does this work for everyone and if not how would their definition different from NARA's? See http://smw-rda.esc.rzg.mpg.de/index.php/Dataset_series for our attempt as part of DFT WG.
3. Reagan "A data series is also used to denote the sequence of data received from a sensor."
Gary discussion - This introduces a more specific type of data series - a "sensor-based data series."
4. Regan "A data set nominally identifies a discrete set of digital entities."
Gary comment -We might need to explain that arrangement basis for the "discrete set." Not how many alternate idea on dataset we had when discussing this
in DFT WG see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Set
5. Regan "A data stream denotes the sequence of data received from a sensor."
Gary comment - We did no have the sensor as source in our working defintion but this was perhaps included or implied in the context of messaging. see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Stream
Comments on the above idea would be appreciated.
--
Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
--
Dr. Thomas Zastrow
Max Planck Computing and Data Facility (MPCDF)
Gießenbachstr. 2, D-85748 Garching bei München, Germany
Tel +49-89-3299-1457
http://www.mpcdf.de
--
Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
--
Mit freundlichem Gruss
Ulrich Schwardmann
Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
URL: http://www.gwdg.de E-Mail: ***@***.***
Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
--
Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
--
Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
--
Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
--
Mit freundlichem Gruss
Ulrich Schwardmann
Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
URL: http://www.gwdg.de E-Mail: ***@***.***
Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
--
Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
ORCID: orcid.org/0000-0002-4040-0215
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
--
Full post: https://rd-alliance.org/group/data-fabric-ig-research-data-collections-w...
Manage my subscriptions: https://rd-alliance.org/mailinglist
Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51950
--
Mit freundlichem Gruss
Ulrich Schwardmann
Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
URL: http://www.gwdg.de E-Mail: ***@***.***
Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001

  • Jeremy York's picture

    Author: Jeremy York

    Date: 11 Apr, 2016

    To follow up regarding the location of the binary predicate definition in
    the paper I linked to, one of the paper's authors (Jacob Jett) informed me
    that the most recent definition is here:(
    http://www.dlib.org/dlib/may14/wickett/05wickett.html). He further noted, "The
    predicate itself comes from the DCMI collection applications profile (
    http://dublincore.org/groups/collections/collection-application-profile/...).
    (At least this has always been the oldest mention of it. But the DCMI-CAP
    is also one of the oldest formal accounts of collections.)"
    Jeremy
    On Mon, Apr 11, 2016 at 1:23 PM,
    ***@***.*** <
    ***@***.***> wrote:

  • Juha Hakala's picture

    Author: Juha Hakala

    Date: 12 Apr, 2016

    Hello,
    Dublin Core community discussed the definition of collection a lot when we were drafting DC Collections application profile, available at http://dublincore.org/groups/collections/collection-application-profile/. After trying several other alternatives we finally decided to use simply “collection is an aggregation of items” since adding more detail would have limited the applicability of the definition. The definition allows even collections with zero items (one of the things which also caused problems). Item in turn is a physical or digital resource, and these resources may be complex, like research data sets.
    Like Keith I do not think it is a good idea to use PID in the definition of a collection. There are a lot of collections out there which do not and may never have PIDs or any other kind of identifiers. Identifier such as ISCI (International Standard Collection Identifier, ISO 27730 http://www.iso.org/iso/catalogue_detail.htm?csnumber=44293) is one of the key metadata elements describing a collection, and I’m fine with for instance making it mandatory in RDA. But saying that a collection is a PID is a bit like saying that a book is an ISBN. RDA can of course use whatever collection definition it wishes, but other communities may not follow the example, or understand fully what is going on. On the other hand, using or refining the Dublin Core definition of collection (or something else that is already out there) would make the RDA approach easier to grasp.
    One of the things I like in DC Collections application profile is its data model, which was inherited from an earlier research project carried out in the UK. RDA is of course free to develop its own data model, but IMO it would do no harm to take a look at what Dublin Core community has done. DC data model does not explicitly present sub- and super-collections, but they have been taken into account in the metadata level, just like associated collections and associated publications, which are both relevant for research data collections.
    International Standard Collection Identifier, by the way, is a semantic identifier which is based on the standard identifier of the agent which owns the collection. For instance, any ISCI owned by the National library of Finland would start with FI-NL, which is the library’s ISIL standard identifier. Deciding what kind of (standard) identifiers collections should have can be non-trivial.
    Best,
    Juha
    From: keith.jeffery=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of ***@***.***
    Sent: 11. huhtikuuta 2016 20:23
    To: uschwar1 <***@***.***>; Jeremy York <***@***.***>; TobiasWeigel <***@***.***>; Data Fabric IG <***@***.***-groups.org>; Research Data Collections WG <***@***.***-groups.org>
    Cc: ThomasZastrow
    <***@***.***>; Gary <***@***.***>
    Subject: [rda-datafabric-ig] RE: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
    All –
    Let me rejoin now.
    1. I don’t like ‘a collection is a PID’. A collection is a collection and a PID is something that identifies it uniquely and permanently
    2. The recursive approach is elegant but limited; it should be possible to express relationships between any collections (or any DO) whether hierarchic (‘belongs to’/is part of’) or in a fully connected graph where it may be that one collection is a proper subset of another (or superset of >1 other collections) or that collection A was derived from Collection B by process X or that collection C was derived from collection D with process U and from collection E with process W - and all with appropriate date/time stamping so that provenance is recorded (and all associated descriptive / contextual / actionable metadata).
    So this is an appeal that we do not simplify to a level where we lose rich semantics
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    From: uschwar1=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of uschwar1
    Sent: 11 April 2016 17:51
    To: Jeremy York; TobiasWeigel; Data Fabric IG; Research Data Collections WG
    Cc: ThomasZastrow; Gary
    Subject: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
    Dear Jeremy, all
    here, as far as I can see from a first look, the definition is relying on the binary predicate isGatheredInto(x,y), which I couldn't find to be defined at the given location anymore. So one probably cannot use this as a definition here, without defining how this predicate function works in all cases.
    But the other way around: if one uses my reductionist definition, the function isGatheredInto(x,y) is almost trivially to define, because one just looks, whether PID y is contained in the set of PIDs in the DO where PID x points to.
    To Gary: of course a collection is something different to an ordinary PID also in my reductionist approach. It is a PID, that points to a very special kind of DO. My assumption is, that this is sufficient for all underlying "substance". But this of course still has to be proven. But perhaps the examples I mentioned already give a feeling of the possibilities, that such a definition can have.
    And certainly we need to discuss counter examples, to see what the limitations are.
    Am 11.04.2016 um 18:24 schrieb Jeremy York:
    I don't know if this will contribute to the discussion but I wanted to point to work being done with HathiTrust at the University of Illinois to define collections in a digital humanities context: http://doi.org/10.5334/johd.3.
    Jeremy
    Jeremy York
    Project Manager
    The Stewardship Gap
    http://bit.ly/stewardshipgap
    On Mon, Apr 11, 2016 at 12:13 PM, TobiasWeigel <***@***.***> wrote:
    Hello Ulrich,
    thank you for the examples - I particularly like the power collection idea as it could solve very aesthetically some of the issues we get into once we talk about collections that grow over time but yet should be somewhat statically referable. I think this also has a new twist on the API: A rule-based collection might need its own dedicated querying and creation mechanisms (or at least different parameter sets). When thinking in terms of collection models, I mostly worked along lines of common ADTs and multiple membership in several collections. The 'family' of rule-based collections may be a distinct sister branch to these. Thanks a lot for sharing these early examples - I clearly have to look deeper into the mathematical view when continuing down the models path.
    Best, Tobias
    -------- Original Message --------
    Subject: Re: [rda-datafabric-ig][rda-collection-wg] Some thoughts on "Data Aggregations" terminology & concepts
    From: Ulrich Schwardmann <***@***.***>
    To: TobiasWeigel <***@***.***>, ThomasZastrow
    <***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>, RDA Collections WG <***@***.***-groups.org>
    Date: 11 Apr 2016, 16:19
    Hi Tobias, Gary and others,
    in principle each function, that generates (new) collections, could be used. For example from a given collection one could build a new collection by requiring restrictions like for example time constraints on the generation of the DOs it contains. Or one can build a kind of power collection, the collection of all sub collections.
    Particularly interesting generation rules come with the possibity of following the links given in the collection, either by the PIDs in the collection itsself or by the additional pointers/links given in the definition. For example if one has a set of collections consisting each of lets say two PIDs pointing to another collection in this set, then one can see this as such a set, but also one can build the sub collections build by the connected components in the graph with PID vertices and edges defined by the relation 'PID in a collection'.
    A real world example would be 'references in publications': each publication (collection) only contains a small number of references (PIDs), but for a given publication there is a whole tree of all publications, that this publication relies on, which is a new collection.
    Even more interesting is also the reverse generation rule: give me all publications, that rely on a given publication. It is a valid rule too, but its much harder to implement it, because one needs for each publication to know all reliying publications, or all publications at all.
    Similarly new collections can be build from the additional pointers that are possible for a collection according the definition below. A typical example for such a pointer could be the previous version of a collection and one can build easily the collection of all previous versions of a collection by the rule to follow always the previous version pointer.
    Am 11.04.2016 um 14:30 schrieb TobiasWeigel:
    Hi Ulrich, Gary,
    I think this is a very timely and much needed discussion. I like Ulrich's idea to boil this down to the mathematical definitions because I also agree that this reduces the ambiguity and there are some well-known concepts we can reuse. At least at this abstract level, we then won't have to define e.g. a Digital Object in all its meaning at first.
    Ulrich - can you give an example for a generation rule? I think I get the direction in which you are heading, but I am not sure I understand the variety of possibilities you hint at.
    I am not so sure that collection implementation will mostly be lists - there is a clear advantage in terms of computational efficiency in using unordered sets (distributed hash maps, NoSQL storage and so on). In my mind, both set and list implementations are valid choices with trade-offs depending on a concrete use case.
    Best, Tobias
    -------- Original Message --------
    Subject: Re: [rda-datafabric-ig] Some thoughts on "Data Aggregations" terminology & concepts
    From: uschwar1 <***@***.***>
    To: ThomasZastrow
    <***@***.***>, Gary <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Date: 11 Apr 2016, 11:12
    Dear Gary, all,
    as Thomas already mentioned, in the last VC of the Collections WG we saw the necessity to have a relatively rigid and precice definition of what a digital collection should be in the sense of that WG. This definition is still under discussion and currently given as the fourth of currently three such definitions at
    http://smw-rda.esc.rzg.mpg.de/index.php/Collection
    and the one in the DFT WG snapshot document. The current definition of the collection WG is:
    (
    Definition
    A collection is a PID pointing to a digital object consisting of a set/list of PIDs/Ids and a set of additional pointers/links and metadata together with each PID/Id.
    A collection can be given explicitely by naming each PIDs/Id directly as well as implicitly by a generating rule.
    By definition a collection can contain other "sub-"collections.
    A collection is called finite, if the set of PIDs/Ids, generated by iteratively resolving its "sub-"collections, is finite.
    )
    which is relatively abstract, tries to use mathematical terms like sets or lists or simple constructions like PIDs and pointers and avoids to rely on other relatively undefined terms like aggregations and DEs.
    A DO is complicated enough and therefore under discussion to be avoided as well, but currently without a good alternative.
    The reason for such an attempt was, that we were discussing several concepts, like data streams, that are used and need to be referenced, but that permanently collect additional data in time, causing the necessity to get the versioning under control for such references. The idea of the collection WG is to pave the way for automated services on collections. With such a definition as above we are much better able handle different representations of such a use case and to classify them.
    From my point of view especially the use of the generating rules allows a huge amount of possibilities. And the definition of a finite collection is an important restriction here, as this way one is able to create collections by generating rules but avoids the mathematical (set theoretical) problems that can be caused this way.
    The definition above is still not terminal in the sense, that we are still discussing the alternatives given by the slashes '/'. For example there are good reasons to see a collection as an unordered 'set' in an abstract sense, but in most implementations it usually will be a list (where the ordering might play an ex- or implicite role), and therefore we have to handle this possibility anyway.
    From my point of view the idea from Reagan is interesting, as it provides with the communities needs an additional aspect of collections, and one can mention something like that additionally. But again the terms arrangement etc. are too far from being well defined, such that they cannot be used to create automated services on them.
    Am 11.04.2016 um 10:11 schrieb ThomasZastrow:
    Hi Gary,
    The Research Data Collection group also started to do some work regarding the definition of basic terms like "collections". Fortunately, the TeD-T tool supports multiple definitions and scopes.
    Our final definition will be more narrow, but in our group we need to come to a concrete specification / implementation:
    http://smw-rda.esc.rzg.mpg.de/index.php/Collection
    (Using the scope "BOF PID Collection")
    Best,
    Tom
    Am 10.04.2016 um 17:36 schrieb Gary:
    The various types of data aggregation and what we call them has been a topic in several RDA groups. "Data set/dataset" or "Digital Collection" and "data series" are a few of the frequently used terms. In the DFT WG snapshot document we had an initial definition of "Digital Collection" as:
    A digital collection is an aggregation which contains DOs and DEs. The collection is identified by a PID and described by metadata.
    Note: A digital collection is a (complex) DO.
    Note: A digital collection is an aggregation in so far as there are other types of aggregations.
    There was probably too little discussion of this and related concepts and so I have tried to continue the conversation with relevant people and groups.
    A recent was with Reagan Moore who provided some ideas (perhaps from a policy point of view) as below. I thought that it might serve as a basis for more conversation.
    1. Reagan "Digital collections implement arrangement by a community for organizing their digital entities."
    Gary comment - this makes the point that aggregations serve community needs and thus will vary. There may then not be external labels for all of these types of arrangements. Maybe the best we can do is to have some broad categories into which different types of arrangements fit.
    2 Reagan "Data series is used by NARA to define the sequence of records archived by a federal agency under a submission agreement control."
    Gary comment - I like this as a way of grounding ourseleves in a authoritive source, the NARA, as a basis of data series. They merely add a time dimension to files and digital sets. But does this work for everyone and if not how would their definition different from NARA's? See http://smw-rda.esc.rzg.mpg.de/index.php/Dataset_series for our attempt as part of DFT WG.
    3. Reagan "A data series is also used to denote the sequence of data received from a sensor."
    Gary discussion - This introduces a more specific type of data series - a "sensor-based data series."
    4. Regan "A data set nominally identifies a discrete set of digital entities."
    Gary comment -We might need to explain that arrangement basis for the "discrete set." Not how many alternate idea on dataset we had when discussing this
    in DFT WG see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Set
    5. Regan "A data stream denotes the sequence of data received from a sensor."
    Gary comment - We did no have the sensor as source in our working defintion but this was perhaps included or implied in the context of messaging. see http://smw-rda.esc.rzg.mpg.de/index.php/Data_Stream
    Comments on the above idea would be appreciated.
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Dr. Thomas Zastrow
    Max Planck Computing and Data Facility (MPCDF)
    Gießenbachstr. 2, D-85748 Garching bei München, Germany
    Tel +49-89-3299-1457
    http://www.mpcdf.de
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    ORCID: orcid.org/0000-0002-4040-0215
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig/post/some-thoughts-data-agg...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51939
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    ORCID: orcid.org/0000-0002-4040-0215
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784
    --
    Full post: https://rd-alliance.org/group/data-fabric-ig-research-data-collections-w...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/51950
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001

submit a comment