Re: [rda-datafabric-ig] draft version white paper form DFIG

02 Jan 2015
Groups audience: 

Ulrich.
This implied bi-directionality is a good catch.
I think that was intended, based on past work (in part by Yin Chin) and
analysis of the DFT WG was one of your proposed solutions:
>The other solution would be to use the PID record, that points to the
digital object, also as reference to the metadata object. Several PID
systems are able to setup such an additional reference. In this case the
whole triple can be referred to by this PID record very efficiently. The
metadata as well as the digital object are directly resolved in this
construction. From my point of view this solution would be more elegant (in
Ulrich.
This implied bi-directionality is a good catch.
I think that was intended, based on past work (in part by Yin Chin) and
analysis of the DFT WG was one of your proposed solutions:
>The other solution would be to use the PID record, that points to the
digital object, also as reference to the metadata object. Several PID
systems are able to setup such an additional reference. In this case the
whole triple can be referred to by this PID record very efficiently. The
metadata as well as the digital object are directly resolved in this
construction. From my point of view this solution would be more elegant (in
the sense of Occam's razor), because it >uses less resources and gives more
direct access to all the necessary information.
​I would make another observation about what is or is not implied in
Diagram 5.
The line relations are not qualified or labeled. They seem to imply some
"process" leading to access via "info" in the boxes. But this processing
(or something else) is subject to interpretation. From some use case
discussions one may do a search on metadata ​and identify a digital object
or several of interest. One may go directly to a DO from such a metadata
based search and bi-pass the PID record.
I think that the Figure should show this relation too since such access to
DOs is likely to persist and may be a plurality of activity now.
Gary Berg-Cross, Ph.D.
***@***.***
http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
SOCoP Executive Secretary
Independent Consultant
Potomac, MD
240-426-0770

  • Herman Stehouwer's picture

    Author: Herman Stehouwer

    Date: 02 Jan, 2015

    Dear all,
    For what it is worth, I fully agree that the triple should be fully
    connected.
    I.e. I do not think a DO without MD is very useful, neither is MD
    without a pointer to the DO.
    Equally, the PID should be able to resolve both in some manner.
    I am not sure that the DO should be able to resolve the MD (though it
    could be useful in some cases), but I am sure the MD should be able to
    resolve to the DO.
    My main reason here is that if the DO has to resolve the MD than the
    coupling of the MD with the storage of the DO has to be fairly tight.
    I prefer a loose coupling as I feel that will make things easier on
    everyone.
    The other way around, one can generally put some sort of reference
    somewhere in the MD without any issues.
    In short: PID -> MD, PID -> DO, MD -> DO, and MD->PID.
    Quite likely DO->PID (required for PID management and so forth anyway).
    Not sure about DO -> MD (though one can always do the DO -> MD -> PID hop).
    Cheers,
    Herman

  • Tobias Weigel's picture

    Author: Tobias Weigel

    Date: 05 Jan, 2015

    -------- Original Message --------
    *Subject: *Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version
    white paper form DFIG
    *From: *HermanStehouwer <***@***.***>
    *To: *Gary <***@***.***>, uschwar1 <***@***.***>, Data
    Fabric IG <***@***.***-groups.org>
    -------- Original Message --------
    *Subject: *Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version
    white paper form DFIG
    *From: *HermanStehouwer <***@***.***>
    *To: *Gary <***@***.***>, uschwar1 <***@***.***>, Data
    Fabric IG <***@***.***-groups.org>
    *Date: *02 Jan 2015, 17:20
    > Dear all,
    >
    > For what it is worth, I fully agree that the triple should be fully
    > connected.
    > I.e. I do not think a DO without MD is very useful, neither is MD
    > without a pointer to the DO.
    > Equally, the PID should be able to resolve both in some manner.
    >
    > I am not sure that the DO should be able to resolve the MD (though it
    > could be useful in some cases), but I am sure the MD should be able to
    > resolve to the DO.
    > My main reason here is that if the DO has to resolve the MD than the
    > coupling of the MD with the storage of the DO has to be fairly tight.
    > I prefer a loose coupling as I feel that will make things easier on
    > everyone.
    > The other way around, one can generally put some sort of reference
    > somewhere in the MD without any issues.
    >
    > In short: PID -> MD, PID -> DO, MD -> DO, and MD->PID.
    > Quite likely DO->PID (required for PID management and so forth anyway).
    > Not sure about DO -> MD (though one can always do the DO -> MD -> PID
    > hop).
    I agree with this and just want to emphasize the the 'hop' solution as
    an important point: at some point (mio. of objects) we may want to
    prefer a smaller number of direct (location-dependent) links in view of
    maintenance issues although this makes navigation more expensive. So you
    may even cut off the direct MD -> DO connection.
    An alternative constellation comes to mind if MD and DO are seen as
    distinct entities with dedicated PIDs (i.e., 2 PIDs in total). Then - in
    view of the same considerations as above - we would point back and forth
    between MD/DO and its PID and have bidirectional pointers between the
    PIDs (putting down PID A in PID record of PID B and vice versa).
    Relocating MD or DO then does not force us to touch the other entity.
    Both alternatives are valid, the choice depends on the use case /
    discipline / desired granularity.
    Best, Tobias

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 05 Jan, 2015

    Tobias-
    I had always assumed the MD had its own PID just like the DO. However the problem comes in dfining PID(s) for the DO and this depends on the DO structure.
    A simple tabular data file may have one PID.
    But a file of complex objects (e.g. descriptions of experiments with their results, equipment used, cross-references to publications etc) may have multiple PIDs. In shor tthe atomicity of DO PIDs needs discussion.
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    From: weigel=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of TobiasWeigel
    Sent: 05 January 2015 13:17
    To: HermanStehouwer; Gary; uschwar1; Data Fabric IG
    Cc: Peter Wittenburg; Daan Broeder
    Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
    -------- Original Message --------
    Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
    From: HermanStehouwer <***@***.***>
    To: Gary <***@***.***>, uschwar1 <***@***.***>, Data Fabric IG <***@***.***-groups.org>
    Date: 02 Jan 2015, 17:20
    Dear all,
    For what it is worth, I fully agree that the triple should be fully connected.
    I.e. I do not think a DO without MD is very useful, neither is MD without a pointer to the DO.
    Equally, the PID should be able to resolve both in some manner.
    I am not sure that the DO should be able to resolve the MD (though it could be useful in some cases), but I am sure the MD should be able to resolve to the DO.
    My main reason here is that if the DO has to resolve the MD than the coupling of the MD with the storage of the DO has to be fairly tight.
    I prefer a loose coupling as I feel that will make things easier on everyone.
    The other way around, one can generally put some sort of reference somewhere in the MD without any issues.
    In short: PID -> MD, PID -> DO, MD -> DO, and MD->PID.
    Quite likely DO->PID (required for PID management and so forth anyway).
    Not sure about DO -> MD (though one can always do the DO -> MD -> PID hop).
    I agree with this and just want to emphasize the the 'hop' solution as an important point: at some point (mio. of objects) we may want to prefer a smaller number of direct (location-dependent) links in view of maintenance issues although this makes navigation more expensive. So you may even cut off the direct MD -> DO connection.
    An alternative constellation comes to mind if MD and DO are seen as distinct entities with dedicated PIDs (i.e., 2 PIDs in total). Then - in view of the same considerations as above - we would point back and forth between MD/DO and its PID and have bidirectional pointers between the PIDs (putting down PID A in PID record of PID B and vice versa). Relocating MD or DO then does not force us to touch the other entity.
    Both alternatives are valid, the choice depends on the use case / discipline / desired granularity.
    Best, Tobias
    Cheers,
    Herman
    On 02/01/15 16:46, Gary wrote:
    Ulrich.
    This implied bi-directionality is a good catch.
    I think that was intended, based on past work (in part by Yin Chin) and analysis of the DFT WG was one of your proposed solutions:
    ​I would make another observation about what is or is not implied in Diagram 5.
    The line relations are not qualified or labeled. They seem to imply some "process" leading to access via "info" in the boxes. But this processing (or something else) is subject to interpretation. From some use case discussions one may do a search on metadata ​and identify a digital object or several of interest. One may go directly to a DO from such a metadata based search and bi-pass the PID record.
    I think that the Figure should show this relation too since such access to DOs is likely to persist and may be a plurality of activity now.
    Gary Berg-Cross, Ph.D.
    ***@***.***
    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    SOCoP Executive Secretary
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Fri, Jan 2, 2015 at 7:44 AM, uschwar1 <***@***.***> wrote:
    Dear all, Peter,
    Thanks for all the work already done in this White Paper, which already gives quite a good impression of what this group is about, and in which direction it might go.
    I have a comment and change suggestion concerning the white paper, which seems only to be a small correction in diagram 5:
    the arrow between the metadata object and the PID record should be bidirectional.
    Because I think, that this is important but on the other hand that it might raise some discussions how data, metadata and PIDs are (or have to be) organized, I will try to justify this in the following. If you immediatly agree, just ignore the rest (except the BTW).
    As mentioned in the White Paper this diagram is a core part of the framework, because it tries to describe schematically the processes and its input and output elements on an atomic level. Since both, input and output elements are triples of the same kind, these triples are somehow elementary of the framework and the question arises, how these triples are referred to inside the framework.
    With reference here I mean the technical reference, but also it makes sense to have a good name for these triples. But this is another discussion.
    The technical reference to the basic elements in a framework always plays an important role, as inside the framework workflows these basic elements are used and transferred via its reference, the pointers, as long as possible. This is also why pointers play such an important role in programming.
    In the current version of the diagram the technical reference to all components of the triple is only possible via the metadata object, because only this points to the PID record and from there to the digital object. There is no direct way to get the metadata, if one only has the PID record as reference to the digital object. This always needs a search inside all the metadata records.
    The consequence would be, that the framework would be essentially driven by metadata objects organized in some metadata registry. From my point of view this only would work efficiently, safe and interoperable, if also the metadata objects have identifiers like PIDs that can be used as lightweight pointers in the processes. In this case we have to extend the triples to quadrupels, containing also the PID referring to the metadata object, and we have a three step resolution to come from the pointer of the quardupel to the digital object. But at the end this could also be a possible extension of the diagram.
    The other solution would be to use the PID record, that points to the digital object, also as reference to the metadata object. Several PID systems are able to setup such an additional reference. In this case the whole triple can be referred to by this PID record very efficiently. The metadate as well as the digital object are directy resolved in this construction. From my point of view this solution would be more elegant (in the sense of Occam's razor), because it uses less resources and gives more direct access to all the necessary information.
    At least for the further discussions I would emphasise the possibility of such a solution by allowing the reference between PID record and metadata object in both directions. Whether this additional arrow direction is implemented in the concrete case is another question, that is addressed by the formulation, that the diagram illustrates possible processing at the atomic level.
    BTW. A happy New Year to everybody.
    Am 24.12.2014 um 10:10 schrieb Peter Wittenburg:
    Thanks Daan.
    We received now a few comments on the first draft. In early January we should work on a revised version again. If other people have comments as well, please send them asap.
    Bet
    Peter
    From: Daan Broeder
    Sent: Tuesday, December 23, 2014 11:10 PM
    To: Peter Wittenburg; ***@***.***-groups.org
    Subject: Re: [rda-datafabric-ig] draft version white paper form DFIG
    Dear all,
    Some comments and suggestions.
    Main points are:
    * Still need to make the concept of DF clearer, especially differences with respect to workflow frame works. Suggest to emphasize the ‘pure’ DM application of DF
    * I suggest to introduce the idea of the DF as a superset of all DM components and services. Specific combinations of these (“profiles”) may be used to do specific DM work
    Please see the attached version for more.
    If you have already discussed and clarified these points, i apologise i did not join before, but please still have a look.
    Happy Christmas,
    Daan
    --
    Daan Broeder
    CTO & Deputy Head
    The Language Archive – MPI for Psycholinguistics
    +31 24 3521103
    ***@***.***
    P.O. Box 310
    6500 AH Nijmegen, The Netherlands
    --
    From: Peter Wittenburg
    <***@***.***>
    Date: Thursday, 4 December 2014 16:57
    To: "***@***.***-groups.org" <***@***.***-groups.org>
    Cc: Peter Wittenburg
    <***@***.***>
    Subject: [rda-datafabric-ig] draft version white paper form DFIG
    Dear all,
    Here is a first draft version of the white paper which we want to circulate outside of the editing team. All side information you can find in the DFIG group wiki:
    https://rd-alliance.org/groups/data-fabric-ig/wiki/data-fabric-ig-docume...
    Please comment before Christmas in this thread so that we can work on a first real version 1.X during the Christmas/Newyears days.
    Best
    Rob & Peter
    ---------------------------------------------------------------------------------------------------------------
    Peter Wittenburg Tel: +49 2821 49180 ***@***.***
    RDA Founding Member
    EUDAT Scientific Coordinator
    Senior Advisor Data Systems
    Computer Center Garching
    Boltzmannstraße 2
    85748 Garching
    Germany
    http://www.rzg.mpg.de/
    http://www.mpi.nl/people/wittenburg-peter
    former affiliation:
    Max Planck Institute for Psycholinguistics
    Wundtlaan 1
    6525 XD Nijmegen
    The Netherlands
    --
    Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/draft-version-whit...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46723
    --
    Mit freundlichem Gruss
    Ulrich Schwardmann
    Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
    Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
    Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
    Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
    URL: http://www.gwdg.de E-Mail: ***@***.***
    Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
    Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
    Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
    Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
    Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
    --
    Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/draft-version-whit...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46723
    --
    Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/re-rda-datafabric-...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46883
    --
    Dr. ir. Herman Stehouwer
    Rechenzentrum Garching @ Max Planck for Plasmaphysics
    RDA Secretariat
    ***@***.*** 0031-619258815
    Skype: herman.stehouwer.mpi
    --
    Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/re-rda-datafabric-...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46883
    --
    Tobias Weigel
    Abteilung Datenmanagement
    Deutsches Klimarechenzentrum GmbH (DKRZ)
    Bundesstraße 45 a • 20146 Hamburg • Germany
    Phone: +49 40 460094-104
    Email: ***@***.***
    URL: http://www.dkrz.de
    Geschäftsführer: Prof. Dr. Thomas Ludwig
    Sitz der Gesellschaft: Hamburg
    Amtsgericht Hamburg HRB 39784

  • Thomas Zastrow's picture

    Author: Thomas Zastrow

    Date: 05 Jan, 2015

    Tobias, Herman, all,
    I'm a fan of storing the MD directly with the object data. That makes
    the handling of the whole thing much easier and all the common
    repositories today are also working in this way: the MD is strored
    together with the object data in a thing called "Digital Object" (having
    Fedora, DSpace in mind). Then, only one PID is necessary to adress the
    whole DO, containing object- and its metadata. Of course this scenario
    is not always possible, but as a recommendation, I think it would make
    sense.
    In Clarin, we had the PID stored as "MDSelfLink" in the CMDI metadata,
    that would be the already existing arrwo in the diagram MD -> PID.
    Having a pointer back from the PID to the MD would be a great
    enhancement, especially having our PIT API in mind: implementing such a
    pointer pack would be easy and as Ulrich already said, common PID
    systems are able to do this.
    Just 2 cents :-)
    Best,
    Tom

  • Larry Lannom's picture

    Author: Larry Lannom

    Date: 05 Jan, 2015

    All,
    Right - the MD object is its own DO in its own right. It is also the case that not all DOs which exist in the role of metadata to other DOs will be known to the owners/controllers of those other DOs, e.g., annotations and reviews, and so direct links establishing all relationships among all possible related DOs is not a reasonable expectation.
    Larry

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 05 Jan, 2015

    Larry -
    But intelligent mining might discover potential relationships which can then be validated?
    Best
    Keith
    -----------------------------------------------------------------------------------------------------------------------
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    -----Original Message-----
    From: Larry Lannom [mailto:***@***.***]
    Sent: 05 January 2015 15:21
    To: Keith Jeffery
    Cc: TobiasWeigel; HermanStehouwer; Gary; Ulrich Schwardmann; Data Fabric IG; Peter Wittenburg; Daan Broeder
    Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
    All,
    Right - the MD object is its own DO in its own right. It is also the case that not all DOs which exist in the role of metadata to other DOs will be known to the owners/controllers of those other DOs, e.g., annotations and reviews, and so direct links establishing all relationships among all possible related DOs is not a reasonable expectation.
    Larry

  • Larry Lannom's picture

    Author: Larry Lannom

    Date: 05 Jan, 2015

    Keith,
    Absolutely. And one could even help less intelligent miners by building indexes, etc. I believe the key is being able to rely on the identifiers and associated metadata of any and all DOs to dynamically build the network of relationships as needed.
    Best,
    Larry

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 05 Jan, 2015

    Larry
    As usual we are in accord
    Best
    Keith
    ----------------------------------------------------------------------------------------------------------
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    -----Original Message-----
    From: Larry Lannom [mailto:***@***.***]
    Sent: 05 January 2015 15:56
    To: Keith Jeffery
    Cc: TobiasWeigel; HermanStehouwer; Gary; Ulrich Schwardmann; Data Fabric IG; Peter Wittenburg; Daan Broeder
    Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
    Keith,
    Absolutely. And one could even help less intelligent miners by building indexes, etc. I believe the key is being able to rely on the identifiers and associated metadata of any and all DOs to dynamically build the network of relationships as needed.
    Best,
    Larry

  • Daan Broeder's picture

    Author: Daan Broeder

    Date: 05 Jan, 2015

    Dear all,
    Clearly this keeps being a popular topic over the years.
    I would urge for treating metadata as much as a ‘separate’ object as
    possible. This implies each having a separate PID and having and also
    allowing metadata for the metadata. If data and metadata are separate
    objects there is also a real need for managing both with PIDs.
    Now, not all metadata is equal. Some is more authoritative than others
    metadata, and is likely minted and maintained by the DO creator/provider
    that owns the PID and that metadata can then indeed be linked to also by
    the DO PID (via a PID Information Type). This type of metadata should for
    instance contain Rights/Licensing information that should not be changed
    (by others). The fact that such metadata can be reached from the PID
    ‘proves’ that it is the authoritative metadata (controlled by the owner of
    the PID).
    There are proposals (and implementations) for different architectures:
    PID->MD->DO, PID->MD{DO}, … But these all rely on a coherent data world
    where software knows for instance how to fish a DO PID from the metadata.
    There is some advantage in having a machinery that is robust and does not
    depend special constructs.
    Best,
    Daan
    --
    Daan Broeder
    CTO & Deputy Head
    The Language Archive – MPI for Psycholinguistics
    +31 24 3521103
    ***@***.***
    P.O. Box 310
    6500 AH Nijmegen, The Netherlands
    --
    On 05/01/2015 17:23, "***@***.***"
    <***@***.***> wrote:

  • Ulrich Schwardmann's picture

    Author: Ulrich Schwardmann

    Date: 06 Jan, 2015

    Dear all,
    I must confess I didn't want to raise a discussion about the different
    types and levels of metadata here, even if, as Daan said, this is a
    popular topic.
    I suppose, my point was at some different level. I was looking at this
    figure 5 from the processing point of view, and my underlying question
    was, what is the simplest model for the operational objects, that is
    needed and useful for processing there.
    My proposition here is, that the simplest, still useful model for such
    an operational object is a triple of the form MD<-PID->DO, and the next
    less simple model with some more advantages is MD<->PID->DO. The most
    complex model is of course the bidirectional fully connected triple PID
    <-> MD <-> DO <-> PID, and there are a couple of others, which are in
    some cases useful and in other cases not useful at all (like unconnected
    triples).
    My assumptions are:
    A0) We are looking primarily for the simplest solution.
    A1) We want to get as much coherence in the system(s) as possible.
    A2) Processing on a DO will always need some sort of the MD (owner,
    rights, data structure, time stamps, ...).
    A3) The processing itself is a black box.
    A4) For some reason (like global interoperability, persistent links) we
    also need PIDs.
    A5) PIDs are a special kind of pointers.
    A6) PIDs at least point to a DO, and might have information types.
    A7) rocessing can also take place on PIDs (i.e. update).
    Consequences:
    C1) From A1) the operational objects should be adressable via pointers.
    C2) From A1) and A3) the operational objects must always have the same
    minimal internal structure.
    C3) From A2) and A7) depending on the process we might need access to
    the DO, the PID and/or the MD, which usually needs pointers to them.
    C4) From A3), C1), C2) and C3) the pointer to the operational object
    must gives access to the whole triple MD, PID and DO.
    C5) A solution for C4) could be to introduce a new entity, which is a
    list of pointers to MD, PID, DO and to which the pointer to the
    operational object actually points, but this would be rejected by A0).
    C6) If the pointer to the operational object points to one elements of
    the triple, for a solution of C4) the internal structure must allow the
    access to pointers to the others, if needed in seperate steps.
    C7) From C6) the triple must build a connected directed graph, and
    actually must contain a tree, because a connection X->Y<-Z would not
    fulfill C4).
    C8) From A6) (PID->DO) and C7), and if we assume only two pointers
    (according to A0)) we only have the possibilities MD<-PID->DO,
    MD->PID->DO, PID->DO->MD and DO->MD->PID. The latter two are less useful
    in my opinion, and others made similar remarks.
    C9) if we additionally use A5) we can directly use the PID as pointer
    internally, which would be in accordance to A0 again, and from C8) this
    ends up with MD<-PID->DO (my favorite tree:).
    Of course with other assumptions one gets other consequences here, and
    usually the different existing repository implementations will have
    started with different assumptions. I nevertheless suppose, that the
    assumptions above are rather generic, and that the other assumptions
    might contain these assumptions, such that a specific solution might be
    a more connected triple (like MD<->PID->DO for instance).
    BTW. One also has a very similar situation in the context of object
    storage, where each object typically includes the data itself, a
    variable amount of metadata, and a globally unique identifier. Usually
    here also one can directly access metadata and data via the identifier.
    The only problem here is the usually not really guaranteed global
    uniqueness and resolvability of the identifier. Beside this it probably
    contains all the abstraction we need in this context.

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 06 Jan, 2015

    Ulrich -
    Just to note a common use case is:
    Each DO instance has a PID
    There are >1 MD instances that may be or must be associated with 1 DO instance (descriptive metadata, restrictive metadata, navigational metadata, schema metadata...)
    Some MD instances may be or must be associated with >1 DO instance (e.g. a formal description of rights)
    Each of these MD instances has a PID
    Hence the cardinality in the relationship MD<->DO is n:m (i.e. either end may be 1 (or even 0) but it is not mandatory).
    For me the PID instance is what it says, a persistent (or permanent) identifier instance, hopefully with a unique value (hence my preference for UUIDs because of the problems of managing uniqueness in registered PIDs). Ideally It has no semantics. It is effectively an attribute value of a DO or MD instance. It is not itself an entity or object since it has no attributes/properties. However, this does not accord with the European EPIC consortium which is supported by RDA which uses a handle system where the prefix refers to the server to resolve the suffix (and hence introduces a dangerous binding to physical resources).
    [aside we have the problem that PID was used earlier to mean process identifier i.e. how an operating system identifies uniquely a running process]
    best
    Keith
    ---------------------------------------------------------------------------------------------------------------------------------------
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    -----Original Message-----
    From: Ulrich Schwardmann [mailto:***@***.***]
    Sent: 06 January 2015 14:08
    To: dgbroeder; Keith Jeffery; Larry Lannom; Data Fabric IG
    Cc: TobiasWeigel; HermanStehouwer; Gary; Peter Wittenburg
    Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
    Dear all,
    I must confess I didn't want to raise a discussion about the different
    types and levels of metadata here, even if, as Daan said, this is a
    popular topic.
    I suppose, my point was at some different level. I was looking at this
    figure 5 from the processing point of view, and my underlying question
    was, what is the simplest model for the operational objects, that is
    needed and useful for processing there.
    My proposition here is, that the simplest, still useful model for such
    an operational object is a triple of the form MD<-PID->DO, and the next
    less simple model with some more advantages is MD<->PID->DO. The most
    complex model is of course the bidirectional fully connected triple PID
    <-> MD <-> DO <-> PID, and there are a couple of others, which are in
    some cases useful and in other cases not useful at all (like unconnected
    triples).
    My assumptions are:
    A0) We are looking primarily for the simplest solution.
    A1) We want to get as much coherence in the system(s) as possible.
    A2) Processing on a DO will always need some sort of the MD (owner,
    rights, data structure, time stamps, ...).
    A3) The processing itself is a black box.
    A4) For some reason (like global interoperability, persistent links) we
    also need PIDs.
    A5) PIDs are a special kind of pointers.
    A6) PIDs at least point to a DO, and might have information types.
    A7) rocessing can also take place on PIDs (i.e. update).
    Consequences:
    C1) From A1) the operational objects should be adressable via pointers.
    C2) From A1) and A3) the operational objects must always have the same
    minimal internal structure.
    C3) From A2) and A7) depending on the process we might need access to
    the DO, the PID and/or the MD, which usually needs pointers to them.
    C4) From A3), C1), C2) and C3) the pointer to the operational object
    must gives access to the whole triple MD, PID and DO.
    C5) A solution for C4) could be to introduce a new entity, which is a
    list of pointers to MD, PID, DO and to which the pointer to the
    operational object actually points, but this would be rejected by A0).
    C6) If the pointer to the operational object points to one elements of
    the triple, for a solution of C4) the internal structure must allow the
    access to pointers to the others, if needed in seperate steps.
    C7) From C6) the triple must build a connected directed graph, and
    actually must contain a tree, because a connection X->Y<-Z would not
    fulfill C4).
    C8) From A6) (PID->DO) and C7), and if we assume only two pointers
    (according to A0)) we only have the possibilities MD<-PID->DO,
    MD->PID->DO, PID->DO->MD and DO->MD->PID. The latter two are less useful
    in my opinion, and others made similar remarks.
    C9) if we additionally use A5) we can directly use the PID as pointer
    internally, which would be in accordance to A0 again, and from C8) this
    ends up with MD<-PID->DO (my favorite tree:).
    Of course with other assumptions one gets other consequences here, and
    usually the different existing repository implementations will have
    started with different assumptions. I nevertheless suppose, that the
    assumptions above are rather generic, and that the other assumptions
    might contain these assumptions, such that a specific solution might be
    a more connected triple (like MD<->PID->DO for instance).
    BTW. One also has a very similar situation in the context of object
    storage, where each object typically includes the data itself, a
    variable amount of metadata, and a globally unique identifier. Usually
    here also one can directly access metadata and data via the identifier.
    The only problem here is the usually not really guaranteed global
    uniqueness and resolvability of the identifier. Beside this it probably
    contains all the abstraction we need in this context.

  • Ulrich Schwardmann's picture

    Author: Ulrich Schwardmann

    Date: 06 Jan, 2015

    Hi Keith,
    a n:m relationship for MD<->DO is certainly not a simple construction
    and therefore from my point of view not a possible model for the
    operational object in such a system. Therefore I suggest to break such a
    complex object down into smaller pieces, such that the MDs, you are
    mentioning, become DOs on its own, that refer additionally to the DO and
    to other related MD, which are then also DO. This gives simple units at
    the operational level, but obviously one needs to map the more complex
    structure on a higher level. A couple of the processing elements will be
    designed to handle these interdependences and references between the
    operational objects. So far to the abstract level.
    On the more technical level your hint about a dangerous binding to
    physical resources via global resolution is very important and I
    completely agree, that such a system should be designed to be able to
    avoid global resolution for each processing operation. But there are
    alternatives. Especially in the case of Handle and EPIC all PIDs used
    inside the local domain can be resolved directly from the underlying
    local data base. The rest is a performance and reliability issue of the
    underlying data base. But this is solvable, as the internal DBs of the
    object storage implementations show. Only changes to the PID need a bit
    more than DB access, but are also local. And only references to external
    objects via external PIDs need global resolution. But all this would
    still fit into the abstract picture of such triples.
    And to bring UUIDs into play here: they could become globally resolvable
    as suffix of a PID. One would have two resolving systems in this case,
    one by the UUID, the other by minting into the PID system. We use this
    configuration at GWDG in some projects.

  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 06 Jan, 2015

    Ulrich -
    Thanks for your response - very helpful.
    We have to be careful about the distinction between MD and DO. I was using DO to mean (in the RDA context) research data instances such as tuples in a relation describing observations or experiments, or complex objects in an OODB. I was using MD to mean data (a DO if you like) being used as metadata to describe the DO.
    The example I always use (since it is simple and universally understood) is the e-library catalog card system. To a researcher finding a book it is metadata. To a librarian counting books on geochemistry it is data. Thus the distinction between data ad metadata is purpose or intent, not any fundamental property. Each instance has a PID. The question is whether the file as a whole (of instances of catalog cards) should have a PID and whether the metadata describing the e-library catalog card system (which could be a simple relational schema or something with many kinds of metadata concerning rights etc.) should have a PID. If so the question is how are they used and how are they related to the instance PIDs.
    To me DO and MD (and for that matter software) is all data, as is descriptions of computing resources, data storage, networks, users....all the things needed for a VRE (Virtual research Environment). The relationships (declared and/or discovered and re-use / re-purposing of the relationships) between object classes / entities or object instances / tuples is for me the interesting part.
    Since these relationships are complex and certainly in the real world they are n:m we have to find a way to handle the n:m relationships in IT systems if they are to reflect accurately the real world. There is a tendency in IT to over-simplify (hierarchies instead of fully connected graphs for example) which leads to terrible misrepresentation of the real world in simplified data structures and horrible hacks to overcome the limitations (think of aspect oriented programing or representing multiple instances of complex objects in XML!). BTW IBM discovered the problem in the data environment with DBOMP for Boeing back in the 60s leading to IMS (which had the equivalent of aspect-orientation) and finally of course the simplicity, clarity and formality (theoretical basis) of relational systems (which do handle fully connected graphs using the MVD decomposition technique of linking relations (expressing the relationship or dependency) between base relations).
    So this is why I cannot accept your first assertion; I believe we must handle n:m in order to represent the real world of interest to RDA. Moreover we have to handle dynamic changes to n:m in cardinality and also in temporal duration (we are getting into provenance here) and possibly in probability (including estimated degrees of probability that the assertion in the tuple (or triple) is true. In an earlier email in this thread (20150105) I explained the problems with triples and described the IBM Hursley work in the 1980s on triples, later extended to quintuples and septuples (adding temporal and modal logic). The problem was performance and storage required and the use of conventional relational base entities with linking entities was demonstrated to be more effective. However, the need to go beyond simple triples (more recently re-invented as RDF) was clear. There is no problem generating RDF triples from a formal relational (or object-oriented) environment although the reverse may be undecidable.
    So, your suggestion of breaking down so all objects are DOs is fine by me, as long as the relationships between them (at entity/object class level and at instance level) are described fully and appropriately. I do not accept that having the same PID (as some have suggested) expresses sufficiently a relationship. It is analogous to having the same primary key value in 2 relations which should, in fact, indicate that the associated attribute values in each case can be components of an equi-join. In the case of a MD DO and a Data instance DO this is clearly wrong. Even between two data instance DOs the relationship is likely to be more subtle - for example one set of attributes could have been collected by one experiment and the other set by observation - and this should be recorded in the relationship. Another problem (opportunity!) is that the relationships may not be known or defined at data input / collection time; in fact a great part of science is discovering the new relationships between entities (using instances of them) and these relationships may well be n:m, are certainly dynamic (new relationships discovered changing for example cardinality) and may have attached probability.
    Thanks again for the good discussion
    Best
    Keith
    Keith G Jeffery Consultants
    Prof Keith G Jeffery
    E: ***@***.***
    T: +44 7768 446088
    S: keithgjeffery
    Past President ERCIM www.ercim.eu (***@***.***)
    Past President euroCRIS www.eurocris.org
    Past Vice President VLDB www.vldb.org
    Fellow (CITP, CEng) BCS www.bcs.org
    Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
    Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
    Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
    ----------------------------------------------------------------------------------------------------------------------------------
    The contents of this email are sent in confidence for the use of the
    intended recipient only. If you are not one of the intended
    recipients do not take action on it or show it to anyone else, but
    return this email to the sender and delete your copy of it.
    ----------------------------------------------------------------------------------------------------------------------------------
    -----Original Message-----
    From: Ulrich Schwardmann [mailto:***@***.***]
    Sent: 06 January 2015 17:16
    To: Keith Jeffery; Data Fabric IG
    Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
    Hi Keith,
    a n:m relationship for MD<->DO is certainly not a simple construction and therefore from my point of view not a possible model for the operational object in such a system. Therefore I suggest to break such a complex object down into smaller pieces, such that the MDs, you are mentioning, become DOs on its own, that refer additionally to the DO and to other related MD, which are then also DO. This gives simple units at the operational level, but obviously one needs to map the more complex structure on a higher level. A couple of the processing elements will be designed to handle these interdependences and references between the operational objects. So far to the abstract level.
    On the more technical level your hint about a dangerous binding to physical resources via global resolution is very important and I completely agree, that such a system should be designed to be able to avoid global resolution for each processing operation. But there are alternatives. Especially in the case of Handle and EPIC all PIDs used inside the local domain can be resolved directly from the underlying local data base. The rest is a performance and reliability issue of the underlying data base. But this is solvable, as the internal DBs of the object storage implementations show. Only changes to the PID need a bit more than DB access, but are also local. And only references to external objects via external PIDs need global resolution. But all this would still fit into the abstract picture of such triples.
    And to bring UUIDs into play here: they could become globally resolvable as suffix of a PID. One would have two resolving systems in this case, one by the UUID, the other by minting into the PID system. We use this configuration at GWDG in some projects.

submit a comment