Ulrich.
This implied bi-directionality is a good catch.
I think that was intended, based on past work (in part by Yin Chin) and
analysis of the DFT WG was one of your proposed solutions:
>The other solution would be to use the PID record, that points to the
digital object, also as reference to the metadata object. Several PID
systems are able to setup such an additional reference. In this case the
whole triple can be referred to by this PID record very efficiently. The
metadata as well as the digital object are directly resolved in this
construction. From my point of view this solution would be more elegant (in
Ulrich.
This implied bi-directionality is a good catch.
I think that was intended, based on past work (in part by Yin Chin) and
analysis of the DFT WG was one of your proposed solutions:
>The other solution would be to use the PID record, that points to the
digital object, also as reference to the metadata object. Several PID
systems are able to setup such an additional reference. In this case the
whole triple can be referred to by this PID record very efficiently. The
metadata as well as the digital object are directly resolved in this
construction. From my point of view this solution would be more elegant (in
the sense of Occam's razor), because it >uses less resources and gives more
direct access to all the necessary information.
I would make another observation about what is or is not implied in
Diagram 5.
The line relations are not qualified or labeled. They seem to imply some
"process" leading to access via "info" in the boxes. But this processing
(or something else) is subject to interpretation. From some use case
discussions one may do a search on metadata and identify a digital object
or several of interest. One may go directly to a DO from such a metadata
based search and bi-pass the PID record.
I think that the Figure should show this relation too since such access to
DOs is likely to persist and may be a plurality of activity now.
Gary Berg-Cross, Ph.D.
***@***.***
http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
SOCoP Executive Secretary
Independent Consultant
Potomac, MD
240-426-0770
Author: Herman Stehouwer
Date: 02 Jan, 2015
Dear all,
For what it is worth, I fully agree that the triple should be fully
connected.
I.e. I do not think a DO without MD is very useful, neither is MD
without a pointer to the DO.
Equally, the PID should be able to resolve both in some manner.
I am not sure that the DO should be able to resolve the MD (though it
could be useful in some cases), but I am sure the MD should be able to
resolve to the DO.
My main reason here is that if the DO has to resolve the MD than the
coupling of the MD with the storage of the DO has to be fairly tight.
I prefer a loose coupling as I feel that will make things easier on
everyone.
The other way around, one can generally put some sort of reference
somewhere in the MD without any issues.
In short: PID -> MD, PID -> DO, MD -> DO, and MD->PID.
Quite likely DO->PID (required for PID management and so forth anyway).
Not sure about DO -> MD (though one can always do the DO -> MD -> PID hop).
Cheers,
Herman
Author: Tobias Weigel
Date: 05 Jan, 2015
-------- Original Message --------
*Subject: *Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version
white paper form DFIG
*From: *HermanStehouwer <***@***.***>
*To: *Gary <***@***.***>, uschwar1 <***@***.***>, Data
Fabric IG <***@***.***-groups.org>
-------- Original Message --------
*Subject: *Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version
white paper form DFIG
*From: *HermanStehouwer <***@***.***>
*To: *Gary <***@***.***>, uschwar1 <***@***.***>, Data
Fabric IG <***@***.***-groups.org>
*Date: *02 Jan 2015, 17:20
> Dear all,
>
> For what it is worth, I fully agree that the triple should be fully
> connected.
> I.e. I do not think a DO without MD is very useful, neither is MD
> without a pointer to the DO.
> Equally, the PID should be able to resolve both in some manner.
>
> I am not sure that the DO should be able to resolve the MD (though it
> could be useful in some cases), but I am sure the MD should be able to
> resolve to the DO.
> My main reason here is that if the DO has to resolve the MD than the
> coupling of the MD with the storage of the DO has to be fairly tight.
> I prefer a loose coupling as I feel that will make things easier on
> everyone.
> The other way around, one can generally put some sort of reference
> somewhere in the MD without any issues.
>
> In short: PID -> MD, PID -> DO, MD -> DO, and MD->PID.
> Quite likely DO->PID (required for PID management and so forth anyway).
> Not sure about DO -> MD (though one can always do the DO -> MD -> PID
> hop).
I agree with this and just want to emphasize the the 'hop' solution as
an important point: at some point (mio. of objects) we may want to
prefer a smaller number of direct (location-dependent) links in view of
maintenance issues although this makes navigation more expensive. So you
may even cut off the direct MD -> DO connection.
An alternative constellation comes to mind if MD and DO are seen as
distinct entities with dedicated PIDs (i.e., 2 PIDs in total). Then - in
view of the same considerations as above - we would point back and forth
between MD/DO and its PID and have bidirectional pointers between the
PIDs (putting down PID A in PID record of PID B and vice versa).
Relocating MD or DO then does not force us to touch the other entity.
Both alternatives are valid, the choice depends on the use case /
discipline / desired granularity.
Best, Tobias
Author: Keith Jeffery
Date: 05 Jan, 2015
Tobias-
I had always assumed the MD had its own PID just like the DO. However the problem comes in dfining PID(s) for the DO and this depends on the DO structure.
A simple tabular data file may have one PID.
But a file of complex objects (e.g. descriptions of experiments with their results, equipment used, cross-references to publications etc) may have multiple PIDs. In shor tthe atomicity of DO PIDs needs discussion.
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
From: weigel=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of TobiasWeigel
Sent: 05 January 2015 13:17
To: HermanStehouwer; Gary; uschwar1; Data Fabric IG
Cc: Peter Wittenburg; Daan Broeder
Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
-------- Original Message --------
Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
From: HermanStehouwer <***@***.***>
To: Gary <***@***.***>, uschwar1 <***@***.***>, Data Fabric IG <***@***.***-groups.org>
Date: 02 Jan 2015, 17:20
Dear all,
For what it is worth, I fully agree that the triple should be fully connected.
I.e. I do not think a DO without MD is very useful, neither is MD without a pointer to the DO.
Equally, the PID should be able to resolve both in some manner.
I am not sure that the DO should be able to resolve the MD (though it could be useful in some cases), but I am sure the MD should be able to resolve to the DO.
My main reason here is that if the DO has to resolve the MD than the coupling of the MD with the storage of the DO has to be fairly tight.
I prefer a loose coupling as I feel that will make things easier on everyone.
The other way around, one can generally put some sort of reference somewhere in the MD without any issues.
In short: PID -> MD, PID -> DO, MD -> DO, and MD->PID.
Quite likely DO->PID (required for PID management and so forth anyway).
Not sure about DO -> MD (though one can always do the DO -> MD -> PID hop).
I agree with this and just want to emphasize the the 'hop' solution as an important point: at some point (mio. of objects) we may want to prefer a smaller number of direct (location-dependent) links in view of maintenance issues although this makes navigation more expensive. So you may even cut off the direct MD -> DO connection.
An alternative constellation comes to mind if MD and DO are seen as distinct entities with dedicated PIDs (i.e., 2 PIDs in total). Then - in view of the same considerations as above - we would point back and forth between MD/DO and its PID and have bidirectional pointers between the PIDs (putting down PID A in PID record of PID B and vice versa). Relocating MD or DO then does not force us to touch the other entity.
Both alternatives are valid, the choice depends on the use case / discipline / desired granularity.
Best, Tobias
Cheers,
Herman
On 02/01/15 16:46, Gary wrote:
Ulrich.
This implied bi-directionality is a good catch.
I think that was intended, based on past work (in part by Yin Chin) and analysis of the DFT WG was one of your proposed solutions:
I would make another observation about what is or is not implied in Diagram 5.
The line relations are not qualified or labeled. They seem to imply some "process" leading to access via "info" in the boxes. But this processing (or something else) is subject to interpretation. From some use case discussions one may do a search on metadata and identify a digital object or several of interest. One may go directly to a DO from such a metadata based search and bi-pass the PID record.
I think that the Figure should show this relation too since such access to DOs is likely to persist and may be a plurality of activity now.
Gary Berg-Cross, Ph.D.
***@***.***
http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
SOCoP Executive Secretary
Independent Consultant
Potomac, MD
240-426-0770
On Fri, Jan 2, 2015 at 7:44 AM, uschwar1 <***@***.***> wrote:
Dear all, Peter,
Thanks for all the work already done in this White Paper, which already gives quite a good impression of what this group is about, and in which direction it might go.
I have a comment and change suggestion concerning the white paper, which seems only to be a small correction in diagram 5:
the arrow between the metadata object and the PID record should be bidirectional.
Because I think, that this is important but on the other hand that it might raise some discussions how data, metadata and PIDs are (or have to be) organized, I will try to justify this in the following. If you immediatly agree, just ignore the rest (except the BTW).
As mentioned in the White Paper this diagram is a core part of the framework, because it tries to describe schematically the processes and its input and output elements on an atomic level. Since both, input and output elements are triples of the same kind, these triples are somehow elementary of the framework and the question arises, how these triples are referred to inside the framework.
With reference here I mean the technical reference, but also it makes sense to have a good name for these triples. But this is another discussion.
The technical reference to the basic elements in a framework always plays an important role, as inside the framework workflows these basic elements are used and transferred via its reference, the pointers, as long as possible. This is also why pointers play such an important role in programming.
In the current version of the diagram the technical reference to all components of the triple is only possible via the metadata object, because only this points to the PID record and from there to the digital object. There is no direct way to get the metadata, if one only has the PID record as reference to the digital object. This always needs a search inside all the metadata records.
The consequence would be, that the framework would be essentially driven by metadata objects organized in some metadata registry. From my point of view this only would work efficiently, safe and interoperable, if also the metadata objects have identifiers like PIDs that can be used as lightweight pointers in the processes. In this case we have to extend the triples to quadrupels, containing also the PID referring to the metadata object, and we have a three step resolution to come from the pointer of the quardupel to the digital object. But at the end this could also be a possible extension of the diagram.
The other solution would be to use the PID record, that points to the digital object, also as reference to the metadata object. Several PID systems are able to setup such an additional reference. In this case the whole triple can be referred to by this PID record very efficiently. The metadate as well as the digital object are directy resolved in this construction. From my point of view this solution would be more elegant (in the sense of Occam's razor), because it uses less resources and gives more direct access to all the necessary information.
At least for the further discussions I would emphasise the possibility of such a solution by allowing the reference between PID record and metadata object in both directions. Whether this additional arrow direction is implemented in the concrete case is another question, that is addressed by the formulation, that the diagram illustrates possible processing at the atomic level.
BTW. A happy New Year to everybody.
Am 24.12.2014 um 10:10 schrieb Peter Wittenburg:
Thanks Daan.
We received now a few comments on the first draft. In early January we should work on a revised version again. If other people have comments as well, please send them asap.
Bet
Peter
From: Daan Broeder
Sent: Tuesday, December 23, 2014 11:10 PM
To: Peter Wittenburg; ***@***.***-groups.org
Subject: Re: [rda-datafabric-ig] draft version white paper form DFIG
Dear all,
Some comments and suggestions.
Main points are:
* Still need to make the concept of DF clearer, especially differences with respect to workflow frame works. Suggest to emphasize the ‘pure’ DM application of DF
* I suggest to introduce the idea of the DF as a superset of all DM components and services. Specific combinations of these (“profiles”) may be used to do specific DM work
Please see the attached version for more.
If you have already discussed and clarified these points, i apologise i did not join before, but please still have a look.
Happy Christmas,
Daan
--
Daan Broeder
CTO & Deputy Head
The Language Archive – MPI for Psycholinguistics
+31 24 3521103
***@***.***
P.O. Box 310
6500 AH Nijmegen, The Netherlands
--
From: Peter Wittenburg
<***@***.***>
Date: Thursday, 4 December 2014 16:57
To: "***@***.***-groups.org" <***@***.***-groups.org>
Cc: Peter Wittenburg
<***@***.***>
Subject: [rda-datafabric-ig] draft version white paper form DFIG
Dear all,
Here is a first draft version of the white paper which we want to circulate outside of the editing team. All side information you can find in the DFIG group wiki:
https://rd-alliance.org/groups/data-fabric-ig/wiki/data-fabric-ig-docume...
Please comment before Christmas in this thread so that we can work on a first real version 1.X during the Christmas/Newyears days.
Best
Rob & Peter
---------------------------------------------------------------------------------------------------------------
Peter Wittenburg Tel: +49 2821 49180 ***@***.***
RDA Founding Member
EUDAT Scientific Coordinator
Senior Advisor Data Systems
Computer Center Garching
Boltzmannstraße 2
85748 Garching
Germany
http://www.rzg.mpg.de/
http://www.mpi.nl/people/wittenburg-peter
former affiliation:
Max Planck Institute for Psycholinguistics
Wundtlaan 1
6525 XD Nijmegen
The Netherlands
--
Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/draft-version-whit...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46723
--
Mit freundlichem Gruss
Ulrich Schwardmann
Phone:+49-551-201-1542 Email:***@***.*** _____ _____ ___
Gesellschaft fuer wissenschaftliche / __\ \ / / \ / __|
Datenverarbeitung mbH Goettingen (GWDG) | (_--\ \/\/ /| |) | (_--
Am Fassberg 11 D-37077 Goettingen Germany \___| \_/\_/ |___/ \___|
URL: http://www.gwdg.de E-Mail: ***@***.***
Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150
Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Dipl.-Kfm. Markus Hoppe
Sitz der Gesellschaft: Goettingen Registergericht: Goettingen
Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001
--
Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/draft-version-whit...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46723
--
Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/re-rda-datafabric-...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46883
--
Dr. ir. Herman Stehouwer
Rechenzentrum Garching @ Max Planck for Plasmaphysics
RDA Secretariat
***@***.*** 0031-619258815
Skype: herman.stehouwer.mpi
--
Full post: https://www.rd-alliance.org/group/data-fabric-ig/post/re-rda-datafabric-...
Manage my subscriptions: https://www.rd-alliance.org/mailinglist
Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/46883
--
Tobias Weigel
Abteilung Datenmanagement
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45 a • 20146 Hamburg • Germany
Phone: +49 40 460094-104
Email: ***@***.***
URL: http://www.dkrz.de
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
Author: Thomas Zastrow
Date: 05 Jan, 2015
Tobias, Herman, all,
I'm a fan of storing the MD directly with the object data. That makes
the handling of the whole thing much easier and all the common
repositories today are also working in this way: the MD is strored
together with the object data in a thing called "Digital Object" (having
Fedora, DSpace in mind). Then, only one PID is necessary to adress the
whole DO, containing object- and its metadata. Of course this scenario
is not always possible, but as a recommendation, I think it would make
sense.
In Clarin, we had the PID stored as "MDSelfLink" in the CMDI metadata,
that would be the already existing arrwo in the diagram MD -> PID.
Having a pointer back from the PID to the MD would be a great
enhancement, especially having our PIT API in mind: implementing such a
pointer pack would be easy and as Ulrich already said, common PID
systems are able to do this.
Just 2 cents :-)
Best,
Tom
Author: Larry Lannom
Date: 05 Jan, 2015
All,
Right - the MD object is its own DO in its own right. It is also the case that not all DOs which exist in the role of metadata to other DOs will be known to the owners/controllers of those other DOs, e.g., annotations and reviews, and so direct links establishing all relationships among all possible related DOs is not a reasonable expectation.
Larry
Author: Keith Jeffery
Date: 05 Jan, 2015
Larry -
But intelligent mining might discover potential relationships which can then be validated?
Best
Keith
-----------------------------------------------------------------------------------------------------------------------
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
-----Original Message-----
From: Larry Lannom [mailto:***@***.***]
Sent: 05 January 2015 15:21
To: Keith Jeffery
Cc: TobiasWeigel; HermanStehouwer; Gary; Ulrich Schwardmann; Data Fabric IG; Peter Wittenburg; Daan Broeder
Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
All,
Right - the MD object is its own DO in its own right. It is also the case that not all DOs which exist in the role of metadata to other DOs will be known to the owners/controllers of those other DOs, e.g., annotations and reviews, and so direct links establishing all relationships among all possible related DOs is not a reasonable expectation.
Larry
Author: Larry Lannom
Date: 05 Jan, 2015
Keith,
Absolutely. And one could even help less intelligent miners by building indexes, etc. I believe the key is being able to rely on the identifiers and associated metadata of any and all DOs to dynamically build the network of relationships as needed.
Best,
Larry
Author: Keith Jeffery
Date: 05 Jan, 2015
Larry
As usual we are in accord
Best
Keith
----------------------------------------------------------------------------------------------------------
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
-----Original Message-----
From: Larry Lannom [mailto:***@***.***]
Sent: 05 January 2015 15:56
To: Keith Jeffery
Cc: TobiasWeigel; HermanStehouwer; Gary; Ulrich Schwardmann; Data Fabric IG; Peter Wittenburg; Daan Broeder
Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
Keith,
Absolutely. And one could even help less intelligent miners by building indexes, etc. I believe the key is being able to rely on the identifiers and associated metadata of any and all DOs to dynamically build the network of relationships as needed.
Best,
Larry
Author: Daan Broeder
Date: 05 Jan, 2015
Dear all,
Clearly this keeps being a popular topic over the years.
I would urge for treating metadata as much as a ‘separate’ object as
possible. This implies each having a separate PID and having and also
allowing metadata for the metadata. If data and metadata are separate
objects there is also a real need for managing both with PIDs.
Now, not all metadata is equal. Some is more authoritative than others
metadata, and is likely minted and maintained by the DO creator/provider
that owns the PID and that metadata can then indeed be linked to also by
the DO PID (via a PID Information Type). This type of metadata should for
instance contain Rights/Licensing information that should not be changed
(by others). The fact that such metadata can be reached from the PID
‘proves’ that it is the authoritative metadata (controlled by the owner of
the PID).
There are proposals (and implementations) for different architectures:
PID->MD->DO, PID->MD{DO}, … But these all rely on a coherent data world
where software knows for instance how to fish a DO PID from the metadata.
There is some advantage in having a machinery that is robust and does not
depend special constructs.
Best,
Daan
--
Daan Broeder
CTO & Deputy Head
The Language Archive – MPI for Psycholinguistics
+31 24 3521103
***@***.***
P.O. Box 310
6500 AH Nijmegen, The Netherlands
--
On 05/01/2015 17:23, "***@***.***"
<***@***.***> wrote:
Author: Ulrich Schwardmann
Date: 06 Jan, 2015
Dear all,
I must confess I didn't want to raise a discussion about the different
types and levels of metadata here, even if, as Daan said, this is a
popular topic.
I suppose, my point was at some different level. I was looking at this
figure 5 from the processing point of view, and my underlying question
was, what is the simplest model for the operational objects, that is
needed and useful for processing there.
My proposition here is, that the simplest, still useful model for such
an operational object is a triple of the form MD<-PID->DO, and the next
less simple model with some more advantages is MD<->PID->DO. The most
complex model is of course the bidirectional fully connected triple PID
<-> MD <-> DO <-> PID, and there are a couple of others, which are in
some cases useful and in other cases not useful at all (like unconnected
triples).
My assumptions are:
A0) We are looking primarily for the simplest solution.
A1) We want to get as much coherence in the system(s) as possible.
A2) Processing on a DO will always need some sort of the MD (owner,
rights, data structure, time stamps, ...).
A3) The processing itself is a black box.
A4) For some reason (like global interoperability, persistent links) we
also need PIDs.
A5) PIDs are a special kind of pointers.
A6) PIDs at least point to a DO, and might have information types.
A7) rocessing can also take place on PIDs (i.e. update).
Consequences:
C1) From A1) the operational objects should be adressable via pointers.
C2) From A1) and A3) the operational objects must always have the same
minimal internal structure.
C3) From A2) and A7) depending on the process we might need access to
the DO, the PID and/or the MD, which usually needs pointers to them.
C4) From A3), C1), C2) and C3) the pointer to the operational object
must gives access to the whole triple MD, PID and DO.
C5) A solution for C4) could be to introduce a new entity, which is a
list of pointers to MD, PID, DO and to which the pointer to the
operational object actually points, but this would be rejected by A0).
C6) If the pointer to the operational object points to one elements of
the triple, for a solution of C4) the internal structure must allow the
access to pointers to the others, if needed in seperate steps.
C7) From C6) the triple must build a connected directed graph, and
actually must contain a tree, because a connection X->Y<-Z would not
fulfill C4).
C8) From A6) (PID->DO) and C7), and if we assume only two pointers
(according to A0)) we only have the possibilities MD<-PID->DO,
MD->PID->DO, PID->DO->MD and DO->MD->PID. The latter two are less useful
in my opinion, and others made similar remarks.
C9) if we additionally use A5) we can directly use the PID as pointer
internally, which would be in accordance to A0 again, and from C8) this
ends up with MD<-PID->DO (my favorite tree:).
Of course with other assumptions one gets other consequences here, and
usually the different existing repository implementations will have
started with different assumptions. I nevertheless suppose, that the
assumptions above are rather generic, and that the other assumptions
might contain these assumptions, such that a specific solution might be
a more connected triple (like MD<->PID->DO for instance).
BTW. One also has a very similar situation in the context of object
storage, where each object typically includes the data itself, a
variable amount of metadata, and a globally unique identifier. Usually
here also one can directly access metadata and data via the identifier.
The only problem here is the usually not really guaranteed global
uniqueness and resolvability of the identifier. Beside this it probably
contains all the abstraction we need in this context.
Author: Keith Jeffery
Date: 06 Jan, 2015
Ulrich -
Just to note a common use case is:
Each DO instance has a PID
There are >1 MD instances that may be or must be associated with 1 DO instance (descriptive metadata, restrictive metadata, navigational metadata, schema metadata...)
Some MD instances may be or must be associated with >1 DO instance (e.g. a formal description of rights)
Each of these MD instances has a PID
Hence the cardinality in the relationship MD<->DO is n:m (i.e. either end may be 1 (or even 0) but it is not mandatory).
For me the PID instance is what it says, a persistent (or permanent) identifier instance, hopefully with a unique value (hence my preference for UUIDs because of the problems of managing uniqueness in registered PIDs). Ideally It has no semantics. It is effectively an attribute value of a DO or MD instance. It is not itself an entity or object since it has no attributes/properties. However, this does not accord with the European EPIC consortium which is supported by RDA which uses a handle system where the prefix refers to the server to resolve the suffix (and hence introduces a dangerous binding to physical resources).
[aside we have the problem that PID was used earlier to mean process identifier i.e. how an operating system identifies uniquely a running process]
best
Keith
---------------------------------------------------------------------------------------------------------------------------------------
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
-----Original Message-----
From: Ulrich Schwardmann [mailto:***@***.***]
Sent: 06 January 2015 14:08
To: dgbroeder; Keith Jeffery; Larry Lannom; Data Fabric IG
Cc: TobiasWeigel; HermanStehouwer; Gary; Peter Wittenburg
Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
Dear all,
I must confess I didn't want to raise a discussion about the different
types and levels of metadata here, even if, as Daan said, this is a
popular topic.
I suppose, my point was at some different level. I was looking at this
figure 5 from the processing point of view, and my underlying question
was, what is the simplest model for the operational objects, that is
needed and useful for processing there.
My proposition here is, that the simplest, still useful model for such
an operational object is a triple of the form MD<-PID->DO, and the next
less simple model with some more advantages is MD<->PID->DO. The most
complex model is of course the bidirectional fully connected triple PID
<-> MD <-> DO <-> PID, and there are a couple of others, which are in
some cases useful and in other cases not useful at all (like unconnected
triples).
My assumptions are:
A0) We are looking primarily for the simplest solution.
A1) We want to get as much coherence in the system(s) as possible.
A2) Processing on a DO will always need some sort of the MD (owner,
rights, data structure, time stamps, ...).
A3) The processing itself is a black box.
A4) For some reason (like global interoperability, persistent links) we
also need PIDs.
A5) PIDs are a special kind of pointers.
A6) PIDs at least point to a DO, and might have information types.
A7) rocessing can also take place on PIDs (i.e. update).
Consequences:
C1) From A1) the operational objects should be adressable via pointers.
C2) From A1) and A3) the operational objects must always have the same
minimal internal structure.
C3) From A2) and A7) depending on the process we might need access to
the DO, the PID and/or the MD, which usually needs pointers to them.
C4) From A3), C1), C2) and C3) the pointer to the operational object
must gives access to the whole triple MD, PID and DO.
C5) A solution for C4) could be to introduce a new entity, which is a
list of pointers to MD, PID, DO and to which the pointer to the
operational object actually points, but this would be rejected by A0).
C6) If the pointer to the operational object points to one elements of
the triple, for a solution of C4) the internal structure must allow the
access to pointers to the others, if needed in seperate steps.
C7) From C6) the triple must build a connected directed graph, and
actually must contain a tree, because a connection X->Y<-Z would not
fulfill C4).
C8) From A6) (PID->DO) and C7), and if we assume only two pointers
(according to A0)) we only have the possibilities MD<-PID->DO,
MD->PID->DO, PID->DO->MD and DO->MD->PID. The latter two are less useful
in my opinion, and others made similar remarks.
C9) if we additionally use A5) we can directly use the PID as pointer
internally, which would be in accordance to A0 again, and from C8) this
ends up with MD<-PID->DO (my favorite tree:).
Of course with other assumptions one gets other consequences here, and
usually the different existing repository implementations will have
started with different assumptions. I nevertheless suppose, that the
assumptions above are rather generic, and that the other assumptions
might contain these assumptions, such that a specific solution might be
a more connected triple (like MD<->PID->DO for instance).
BTW. One also has a very similar situation in the context of object
storage, where each object typically includes the data itself, a
variable amount of metadata, and a globally unique identifier. Usually
here also one can directly access metadata and data via the identifier.
The only problem here is the usually not really guaranteed global
uniqueness and resolvability of the identifier. Beside this it probably
contains all the abstraction we need in this context.
Author: Ulrich Schwardmann
Date: 06 Jan, 2015
Hi Keith,
a n:m relationship for MD<->DO is certainly not a simple construction
and therefore from my point of view not a possible model for the
operational object in such a system. Therefore I suggest to break such a
complex object down into smaller pieces, such that the MDs, you are
mentioning, become DOs on its own, that refer additionally to the DO and
to other related MD, which are then also DO. This gives simple units at
the operational level, but obviously one needs to map the more complex
structure on a higher level. A couple of the processing elements will be
designed to handle these interdependences and references between the
operational objects. So far to the abstract level.
On the more technical level your hint about a dangerous binding to
physical resources via global resolution is very important and I
completely agree, that such a system should be designed to be able to
avoid global resolution for each processing operation. But there are
alternatives. Especially in the case of Handle and EPIC all PIDs used
inside the local domain can be resolved directly from the underlying
local data base. The rest is a performance and reliability issue of the
underlying data base. But this is solvable, as the internal DBs of the
object storage implementations show. Only changes to the PID need a bit
more than DB access, but are also local. And only references to external
objects via external PIDs need global resolution. But all this would
still fit into the abstract picture of such triples.
And to bring UUIDs into play here: they could become globally resolvable
as suffix of a PID. One would have two resolving systems in this case,
one by the UUID, the other by minting into the PID system. We use this
configuration at GWDG in some projects.
Author: Keith Jeffery
Date: 06 Jan, 2015
Ulrich -
Thanks for your response - very helpful.
We have to be careful about the distinction between MD and DO. I was using DO to mean (in the RDA context) research data instances such as tuples in a relation describing observations or experiments, or complex objects in an OODB. I was using MD to mean data (a DO if you like) being used as metadata to describe the DO.
The example I always use (since it is simple and universally understood) is the e-library catalog card system. To a researcher finding a book it is metadata. To a librarian counting books on geochemistry it is data. Thus the distinction between data ad metadata is purpose or intent, not any fundamental property. Each instance has a PID. The question is whether the file as a whole (of instances of catalog cards) should have a PID and whether the metadata describing the e-library catalog card system (which could be a simple relational schema or something with many kinds of metadata concerning rights etc.) should have a PID. If so the question is how are they used and how are they related to the instance PIDs.
To me DO and MD (and for that matter software) is all data, as is descriptions of computing resources, data storage, networks, users....all the things needed for a VRE (Virtual research Environment). The relationships (declared and/or discovered and re-use / re-purposing of the relationships) between object classes / entities or object instances / tuples is for me the interesting part.
Since these relationships are complex and certainly in the real world they are n:m we have to find a way to handle the n:m relationships in IT systems if they are to reflect accurately the real world. There is a tendency in IT to over-simplify (hierarchies instead of fully connected graphs for example) which leads to terrible misrepresentation of the real world in simplified data structures and horrible hacks to overcome the limitations (think of aspect oriented programing or representing multiple instances of complex objects in XML!). BTW IBM discovered the problem in the data environment with DBOMP for Boeing back in the 60s leading to IMS (which had the equivalent of aspect-orientation) and finally of course the simplicity, clarity and formality (theoretical basis) of relational systems (which do handle fully connected graphs using the MVD decomposition technique of linking relations (expressing the relationship or dependency) between base relations).
So this is why I cannot accept your first assertion; I believe we must handle n:m in order to represent the real world of interest to RDA. Moreover we have to handle dynamic changes to n:m in cardinality and also in temporal duration (we are getting into provenance here) and possibly in probability (including estimated degrees of probability that the assertion in the tuple (or triple) is true. In an earlier email in this thread (20150105) I explained the problems with triples and described the IBM Hursley work in the 1980s on triples, later extended to quintuples and septuples (adding temporal and modal logic). The problem was performance and storage required and the use of conventional relational base entities with linking entities was demonstrated to be more effective. However, the need to go beyond simple triples (more recently re-invented as RDF) was clear. There is no problem generating RDF triples from a formal relational (or object-oriented) environment although the reverse may be undecidable.
So, your suggestion of breaking down so all objects are DOs is fine by me, as long as the relationships between them (at entity/object class level and at instance level) are described fully and appropriately. I do not accept that having the same PID (as some have suggested) expresses sufficiently a relationship. It is analogous to having the same primary key value in 2 relations which should, in fact, indicate that the associated attribute values in each case can be components of an equi-join. In the case of a MD DO and a Data instance DO this is clearly wrong. Even between two data instance DOs the relationship is likely to be more subtle - for example one set of attributes could have been collected by one experiment and the other set by observation - and this should be recorded in the relationship. Another problem (opportunity!) is that the relationships may not be known or defined at data input / collection time; in fact a great part of science is discovering the new relationships between entities (using instances of them) and these relationships may well be n:m, are certainly dynamic (new relationships discovered changing for example cardinality) and may have attached probability.
Thanks again for the good discussion
Best
Keith
Keith G Jeffery Consultants
Prof Keith G Jeffery
E: ***@***.***
T: +44 7768 446088
S: keithgjeffery
Past President ERCIM www.ercim.eu (***@***.***)
Past President euroCRIS www.eurocris.org
Past Vice President VLDB www.vldb.org
Fellow (CITP, CEng) BCS www.bcs.org
Co-chair RDA MIG https://rd-alliance.org/internal-groups/metadata-ig.html
Co-chair RDA MSDWG https://rd-alliance.org/working-groups/metadata-standards-directory-work...
Co-chair RDA DICIG https://rd-alliance.org/internal-groups/data-context-ig.html
----------------------------------------------------------------------------------------------------------------------------------
The contents of this email are sent in confidence for the use of the
intended recipient only. If you are not one of the intended
recipients do not take action on it or show it to anyone else, but
return this email to the sender and delete your copy of it.
----------------------------------------------------------------------------------------------------------------------------------
-----Original Message-----
From: Ulrich Schwardmann [mailto:***@***.***]
Sent: 06 January 2015 17:16
To: Keith Jeffery; Data Fabric IG
Subject: Re: [rda-datafabric-ig] Re: [rda-datafabric-ig] draft version white paper form DFIG
Hi Keith,
a n:m relationship for MD<->DO is certainly not a simple construction and therefore from my point of view not a possible model for the operational object in such a system. Therefore I suggest to break such a complex object down into smaller pieces, such that the MDs, you are mentioning, become DOs on its own, that refer additionally to the DO and to other related MD, which are then also DO. This gives simple units at the operational level, but obviously one needs to map the more complex structure on a higher level. A couple of the processing elements will be designed to handle these interdependences and references between the operational objects. So far to the abstract level.
On the more technical level your hint about a dangerous binding to physical resources via global resolution is very important and I completely agree, that such a system should be designed to be able to avoid global resolution for each processing operation. But there are alternatives. Especially in the case of Handle and EPIC all PIDs used inside the local domain can be resolved directly from the underlying local data base. The rest is a performance and reliability issue of the underlying data base. But this is solvable, as the internal DBs of the object storage implementations show. Only changes to the PID need a bit more than DB access, but are also local. And only references to external objects via external PIDs need global resolution. But all this would still fit into the abstract picture of such triples.
And to bring UUIDs into play here: they could become globally resolvable as suffix of a PID. One would have two resolving systems in this case, one by the UUID, the other by minting into the PID system. We use this configuration at GWDG in some projects.