Community discussion of the definition of Digital Object

14 Aug 2014

For some time we've had alternate ideas about the nature and essential definition of a Digital Object (DO).  The term is used widely (4,340,000 hits in google) but with different implicit ideas in various communities.  Of course this is far from the only term with differing views from different communities, but this one seems central to communites and workers with

"digital object identifiers" 

(3,840,000 hits),Persistent identifiers (896,000), Persistent identifiers and PIDs ( 14,400) and groups like DataCite.

DOs in the archive context are discussed differently than we some other contexts. In Europeana they discuss digital objects as  playable or viewable as computer files (JPEG, PDF, MP3, AVI etc)..They don't have PIDs in general...they have URIs 

Given such difference it seems usefel to have some discussion from the community on this.

Below are 2 of the definitions currently in the DFT term tool. 

1. A digital object is composed of structured sequence of bits/bytes. As an object it is named. The bit sequence realizing the object can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties.

2. Digital Object is also called a Digital Entity defined as “machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems in the Internet.”

On the other hand there is a definition of DO as requiring an ID has one of the models reviewed in the DFT analysis document and goes back to the Kahn-Wilensky framework published in the mid-90s.  This requirement for a DO seems at odds with the use of the concept in other models.

Perhaps people in the WG would care to offer their opinions on the alternate defintions and ways to resolve differences to come up with a useful definition.  Perhaps alternate terms like "Registered Data Object" could be added. or is this making things too complex?

  • Herman Stehouwer's picture

    Author: Herman Stehouwer

    Date: 14 Aug, 2014

    Dear Gary, All,
    I am a tad surprised to see this discussion now.
    I was under the impression that you all managed to converge on a
    definition that included identity, attributes, and data in an
    infrastructure.
    To me, having a digital object (or whatever you want to all it) as the
    thin bit of the hourglass is key to the success of not just this group,
    but RDA.
    It is a clear addressable thing on which we can build.
    On the other hand, I am just an observer, don't mind me ;)
    Cheers,
    Herman

  • Thomas Zastrow's picture

    Author: Thomas Zastrow

    Date: 14 Aug, 2014

    Dear all,
    Unfortunately, a colloquial term was used for describing a very special
    thing within a limited, scientific scope. A much more specialized
    wording would avoid a lot of confusing here.
    But I'm coming from a community where people like to formalize stuff -
    so I activated the Math extension of the Mediawiki software.
    Unfortunately, it doesn't works together with the semantic mediawiki
    extension, but nevertheless, on normal wiki pages, you can now make use
    of Latex equtations.
    So as a first attempt, I tried to define "Digital Object" in a
    formalized way - with all its advantages and disadvantages:
    http://smw-rda.esc.rzg.mpg.de/index.php/Test_Formalization
    Its not perfect and all of you are free to change it. But from my point
    of view, this is how a clear scientific definition should look like.
    Best,
    Tom

  • Herman Stehouwer's picture

    Author: Herman Stehouwer

    Date: 14 Aug, 2014

    Hi THomas,
    I cannot edit the wiki so I will put some stuff here.
    I think a DO should have one or more metadata files associated with it
    (not just one).
    I am not sure it needs multiple bitstreams (identical bitstreams in
    multiple locations, sure).
    In any case it is a good start.
    Cheers,
    Herman

  • Peter Wittenburg's picture

    Author: Peter Wittenburg

    Date: 14 Aug, 2014

    Thanks Tom.
    Very Interesting test :)
    In particular I like DO:=⟨P,M,∑1kB⟩ !!!!
    Peter
    - Show quoted text -From: thomas.zastrow=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of ThomasZastrow
    Sent: Thursday, August 14, 2014 8:56 AM
    To: ***@***.***-groups.org
    Subject: Re: [rda-dft-wg] Community discussion of the definition of Digital Object
    Dear all,
    Unfortunately, a colloquial term was used for describing a very special thing within a limited, scientific scope. A much more specialized wording would avoid a lot of confusing here.
    But I'm coming from a community where people like to formalize stuff - so I activated the Math extension of the Mediawiki software. Unfortunately, it doesn't works together with the semantic mediawiki extension, but nevertheless, on normal wiki pages, you can now make use of Latex equtations.
    So as a first attempt, I tried to define "Digital Object" in a formalized way - with all its advantages and disadvantages:
    http://smw-rda.esc.rzg.mpg.de/index.php/Test_Formalization
    Its not perfect and all of you are free to change it. But from my point of view, this is how a clear scientific definition should look like.
    Best,
    Tom
    Am 14.08.2014 um 00:07 schrieb Gary:
    For some time we've had alternate ideas about the nature and essential definition of a Digital Object (DO). The term is used widely (4,340,000 hits in google) but with different implicit ideas in various communities. Of course this is far from the only term with differing views from different communities, but this one seems central to communites and workers with
    "digital object identifiers"
    (3,840,000 hits),Persistent identifiers (896,000), Persistent identifiers and PIDs ( 14,400) and groups like DataCite.
    DOs in the archive context are discussed differently than we some other contexts. In Europeana they discuss digital objects as playable or viewable as computer files (JPEG, PDF, MP3, AVI etc)..They don't have PIDs in general...they have URIs
    Given such difference it seems usefel to have some discussion from the community on this.
    Below are 2 of the definitions currently in the DFT term tool.
    1. A digital object is composed of structured sequence of bits/bytes. As an object it is named. The bit sequence realizing the object can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties.
    2. Digital Object is also called a Digital Entity defined as “machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems in the Internet.”
    On the other hand there is a definition of DO as requiring an ID has one of the models reviewed in the DFT analysis document and goes back to the Kahn-Wilensky framework published in the mid-90s. This requirement for a DO seems at odds with the use of the concept in other models.
    Perhaps people in the WG would care to offer their opinions on the alternate defintions and ways to resolve differences to come up with a useful definition. Perhaps alternate terms like "Registered Data Object" could be added. or is this making things too complex?
    --
    Full post: https://rd-alliance.org/group/data-foundation-and-terminology-wg/post/co...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/44814
    --
    Dr. Thomas Zastrow
    Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft / MPI für Plasmaphysik
    Boltzmannstrasse 2, D-85748 Garching
    Tel +49-89-3299-1457
    http://www.rzg.mpg.de

  • Thomas Zastrow's picture

    Author: Thomas Zastrow

    Date: 14 Aug, 2014

    Hi Herman,
    There were too many spammers, so we decided that people need an account
    to edit pages in the wiki. You can request one here:
    http://smw-rda.esc.rzg.mpg.de/index.php/Special:RequestAccount
    (As admin I still have to confirm your account but I will do it asap :-)
    What you mentioned: I agree that there could be more than one metadata
    bitstream, I'll change that.
    Best,
    Tom

  • Peter Wittenburg's picture

    Author: Peter Wittenburg

    Date: 14 Aug, 2014

    Herman,
    Wrt the metadata objects associated with it we had a long discussion earlier.
    MD can be traded openly since you can harvest it etc. So there will be many instances of MD objects floating around and they will be changed, transformed etc. So metadata are mutual objects.
    I guess that it is necessary to indicate here something like that there is an "original metadata" object somewhere. In case of different appearances (formats, schemas) the checksums are different and repositories that assign for example checksums to the PIDs which identify digital objects may need to assign different PIDs.
    For similar reasons we need to say in case of the bit stream (encoding the content of the data object) which can be stored in different repositories which one the original deposit repository is. They may have certain rights/duties wrt the data such as determining access permissions etc.
    Best
    Peter
    - Show quoted text -From: herman.stehouwer=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of HermanStehouwer
    Sent: Thursday, August 14, 2014 9:22 AM
    To: ***@***.***-groups.org
    Subject: Re: [rda-dft-wg] Community discussion of the definition of Digital Object
    Hi THomas,
    I cannot edit the wiki so I will put some stuff here.
    I think a DO should have one or more metadata files associated with it (not just one).
    I am not sure it needs multiple bitstreams (identical bitstreams in multiple locations, sure).
    In any case it is a good start.
    Cheers,
    Herman
    On 14/08/14 08:55, ThomasZastrow wrote:
    Dear all,
    Unfortunately, a colloquial term was used for describing a very special thing within a limited, scientific scope. A much more specialized wording would avoid a lot of confusing here.
    But I'm coming from a community where people like to formalize stuff - so I activated the Math extension of the Mediawiki software. Unfortunately, it doesn't works together with the semantic mediawiki extension, but nevertheless, on normal wiki pages, you can now make use of Latex equtations.
    So as a first attempt, I tried to define "Digital Object" in a formalized way - with all its advantages and disadvantages:
    http://smw-rda.esc.rzg.mpg.de/index.php/Test_Formalization
    Its not perfect and all of you are free to change it. But from my point of view, this is how a clear scientific definition should look like.
    Best,
    Tom
    Am 14.08.2014 um 00:07 schrieb Gary:
    For some time we've had alternate ideas about the nature and essential definition of a Digital Object (DO). The term is used widely (4,340,000 hits in google) but with different implicit ideas in various communities. Of course this is far from the only term with differing views from different communities, but this one seems central to communites and workers with
    "digital object identifiers"
    (3,840,000 hits),Persistent identifiers (896,000), Persistent identifiers and PIDs ( 14,400) and groups like DataCite.
    DOs in the archive context are discussed differently than we some other contexts. In Europeana they discuss digital objects as playable or viewable as computer files (JPEG, PDF, MP3, AVI etc)..They don't have PIDs in general...they have URIs
    Given such difference it seems usefel to have some discussion from the community on this.
    Below are 2 of the definitions currently in the DFT term tool.
    1. A digital object is composed of structured sequence of bits/bytes. As an object it is named. The bit sequence realizing the object can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties.
    2. Digital Object is also called a Digital Entity defined as “machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems in the Internet.”
    On the other hand there is a definition of DO as requiring an ID has one of the models reviewed in the DFT analysis document and goes back to the Kahn-Wilensky framework published in the mid-90s. This requirement for a DO seems at odds with the use of the concept in other models.
    Perhaps people in the WG would care to offer their opinions on the alternate defintions and ways to resolve differences to come up with a useful definition. Perhaps alternate terms like "Registered Data Object" could be added. or is this making things too complex?
    --
    Full post: https://rd-alliance.org/group/data-foundation-and-terminology-wg/post/co...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/44814
    --
    Dr. Thomas Zastrow
    Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft / MPI für Plasmaphysik
    Boltzmannstrasse 2, D-85748 Garching
    Tel +49-89-3299-1457
    http://www.rzg.mpg.de
    --
    Full post: https://www.rd-alliance.org/group/data-foundation-and-terminology-wg/pos...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/44814
    --
    Dr. ir. Herman Stehouwer
    Rechenzentrum Garching @ Max Planck for Plasmaphysics
    RDA Secretariat
    ***@***.*** 0031-619258815

  • Peter Wittenburg's picture

    Author: Peter Wittenburg

    Date: 14 Aug, 2014

    As just indicated Tom.
    Indeed there will be many metadata sets.
    But when we talk about the core model we just need to request that there is a PID and a metadata description. We do not need to say in the core model that there can be created many versions.
    Peter

  • Peter Wittenburg's picture

    Author: Peter Wittenburg

    Date: 14 Aug, 2014

    Thanks Herman.
    In principle I agree with you. Let's come back again to the basics.
    In all what we will do we will only talk about the registered domain of data objects knowing that there is much more around. But for RDA this is not per se relevant since we want to speak only about data objects which are referable, findable, accessible. RDA is about "improving sharing etc".
    So when we at the beginning of our documents make this clear that we speak about the registered domain of data then we do not have to talk about the difference between registered vs. unregistered digital objects or so. In our open domain Digital Objects have assigned a PID.
    Best
    Peter
    - Show quoted text -From: herman.stehouwer=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of HermanStehouwer
    Sent: Thursday, August 14, 2014 8:33 AM
    To: ***@***.***-groups.org
    Subject: Re: [rda-dft-wg] Community discussion of the definition of Digital Object
    Dear Gary, All,
    I am a tad surprised to see this discussion now.
    I was under the impression that you all managed to converge on a definition that included identity, attributes, and data in an infrastructure.
    To me, having a digital object (or whatever you want to all it) as the thin bit of the hourglass is key to the success of not just this group, but RDA.
    It is a clear addressable thing on which we can build.
    On the other hand, I am just an observer, don't mind me ;)
    Cheers,
    Herman
    On 14/08/14 00:07, Gary wrote:
    For some time we've had alternate ideas about the nature and essential definition of a Digital Object (DO). The term is used widely (4,340,000 hits in google) but with different implicit ideas in various communities. Of course this is far from the only term with differing views from different communities, but this one seems central to communites and workers with
    "digital object identifiers"
    (3,840,000 hits),Persistent identifiers (896,000), Persistent identifiers and PIDs ( 14,400) and groups like DataCite.
    DOs in the archive context are discussed differently than we some other contexts. In Europeana they discuss digital objects as playable or viewable as computer files (JPEG, PDF, MP3, AVI etc)..They don't have PIDs in general...they have URIs
    Given such difference it seems usefel to have some discussion from the community on this.
    Below are 2 of the definitions currently in the DFT term tool.
    1. A digital object is composed of structured sequence of bits/bytes. As an object it is named. The bit sequence realizing the object can be identified & accessed by a unique and persistent identifier or by use of referencing attributes describing its properties.
    2. Digital Object is also called a Digital Entity defined as “machine-independent data structure consisting of one or more elements in digital form that can be parsed by different information systems; the structure helps to enable interoperability among diverse information systems in the Internet.”
    On the other hand there is a definition of DO as requiring an ID has one of the models reviewed in the DFT analysis document and goes back to the Kahn-Wilensky framework published in the mid-90s. This requirement for a DO seems at odds with the use of the concept in other models.
    Perhaps people in the WG would care to offer their opinions on the alternate defintions and ways to resolve differences to come up with a useful definition. Perhaps alternate terms like "Registered Data Object" could be added. or is this making things too complex?
    --
    Full post: https://rd-alliance.org/group/data-foundation-and-terminology-wg/post/co...
    Manage my subscriptions: https://rd-alliance.org/mailinglist
    Stop emails for this post: https://rd-alliance.org/mailinglist/unsubscribe/44814
    --
    Dr. ir. Herman Stehouwer
    Rechenzentrum Garching @ Max Planck for Plasmaphysics
    RDA Secretariat
    ***@***.*** 0031-619258815

  • Herman Stehouwer's picture

    Author: Herman Stehouwer

    Date: 14 Aug, 2014

    Peter,
    I am talking about the expected different metadata sets.
    E.g. core set (author, description, etc.), technical metadata (equipment
    used, settings for experiment, etc.), data type metadata (see DTR),
    regular metadata (like so many metadata types).
    I think we can usually expect at least two sets in different metadata
    formats.
    I don't think we can avoid the possibility of multiple sets, with a
    minimum of one set of metadata.
    Cheers,
    Herman

  • Herman Stehouwer's picture

    Author: Herman Stehouwer

    Date: 14 Aug, 2014

    Peter,
    It is probably good to keep in mind that I am talking about the formal
    definition as started by Thomas.
    The formal definition has to cover all correct cases and non of the
    incorrect cases.
    The written description can be as you say as far as I am concerned for
    easy of explanation.
    Cheers,
    Herman

  • Thomas Zastrow's picture

    Author: Thomas Zastrow

    Date: 14 Aug, 2014

    Hi Peter & Herman,
    I don't see a problem here. I changed the formal definition so that
    there could be 1 to i sets of metadata ... which means, at least a DO
    needs one metadata description, but there could be more. For example,
    one exhaustive CMDI file and Dublin Core as an extraction of that CMDI file.
    BTW, this is exactly how Fedora Commons does it. And when I'm talking
    about Digital Objects, I always have the definition of Fedora Commons in
    my mind:
    http://www.fedora-commons.org/documentation/3.0b1/userdocs/digitalobject...
    For RDA at all, this is already too specific. But in my eyes, it is one
    of the best definitions and one of the few things here which are already
    implemented and in *practical* use for years ... ;-)
    Best,
    Tom

  • Peter Wittenburg's picture

    Author: Peter Wittenburg

    Date: 14 Aug, 2014

    Thanks Thomas.
    Well I added once the Fedora model to the model paper as you may have seen it and I agree with you that they did a very good contribution to the development. But as you indicate the Fedora model goes beyond of what we are discussing as core, but as far as I can see it the Fedora model is compliant with what we discuss as core which is good.
    Such a Fedora object needs to have a PID but then it can contain a lot of "streams" as they call it. This then can be seen as an aggregation in our basic terms.
    Let me add here the following: If all would use for example the Fedora Object model or something similar to describe their data organization we would not have to discuss about inefficiencies in dealing with data etc etc. The practice is different as we know. My feeling is that this will only change when major software providers will adopt an "object model" and implement it. It is the task of DFT/RDA to come to agreements since why should software builders otherwise change their software.
    Peter
    - Show quoted text -From: thomas.zastrow=***@***.***-groups.org [mailto:***@***.***-groups.org] On Behalf Of ThomasZastrow
    Sent: Thursday, August 14, 2014 9:56 AM
    To: ***@***.***-groups.org
    Subject: Re: [rda-dft-wg] Community discussion of the definition of Digital Object
    Hi Peter & Herman,
    I don't see a problem here. I changed the formal definition so that there could be 1 to i sets of metadata ... which means, at least a DO needs one metadata description, but there could be more. For example, one exhaustive CMDI file and Dublin Core as an extraction of that CMDI file.
    BTW, this is exactly how Fedora Commons does it. And when I'm talking about Digital Objects, I always have the definition of Fedora Commons in my mind:
    http://www.fedora-commons.org/documentation/3.0b1/userdocs/digitalobject...
    For RDA at all, this is already too specific. But in my eyes, it is one of the best definitions and one of the few things here which are already implemented and in *practical* use for years ... ;-)
    Best,
    Tom
    Am 14.08.2014 um 09:36 schrieb Peter Wittenburg:
    As just indicated Tom.
    Indeed there will be many metadata sets.
    But when we talk about the core model we just need to request that there is a PID and a metadata description. We do not need to say in the core model that there can be created many versions.
    Peter
    --
    Full post: https://www.rd-alliance.org/group/data-foundation-and-terminology-wg/pos...
    Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/44814
    --
    Dr. Thomas Zastrow
    Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft / MPI für Plasmaphysik
    Boltzmannstrasse 2, D-85748 Garching
    Tel +49-89-3299-1457
    http://www.rzg.mpg.de

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 14 Aug, 2014

    Thomas and Herman
    Thanks for engaging on this topic. I hope that others join in.
    The thing about formalization is that one should get agreement on the
    ingredients and relations of what you want to put into a formal language.
    In the case of Digital objects there is some difference of opinion of what
    to call the entity that requires a PID is it a digital object or a special
    case of a more generalized notion of a digital object and one that includes
    a PID as part of becoming a registered digital object.
    I note that both of you employ the concept of an infrastructure within
    which digital objects exist.
    In Thomas' logic formalism there seems to be the concept of metadata
    formats such as CMDI which I take to be:
    Component Metadata Infrastructure
    ​(part of CLARIN effort)
    http://www.balisage.net/Proceedings/vol7/html/Broeder01/BalisageVol7-Bro...
    might be a relevant discussion of this concept which I don't think we had
    drilled down to.
    Thomas will you add the relevant CMDI concept to the tool? The variou
    metadata group may be interested in this.
    Gary Berg-Cross, Ph.D.
    ***@***.***
    http://ontolog.cim3.net/cgi-bin/wiki.pl?GaryBergCross
    NSF INTEROP Project
    http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0955816
    SOCoP Executive Secretary
    Independent Consultant
    Potomac, MD
    240-426-0770
    On Thu, Aug 14, 2014 at 2:55 AM, ThomasZastrow
    <***@***.***>
    wrote:

  • Thomas Zastrow's picture

    Author: Thomas Zastrow

    Date: 15 Aug, 2014

    Hi Gary,
    Yes, regarding the CMDI format you have the right link. But I meant it
    only as an example format, there are so many metadata formats out there,
    so I annotated it as an open vocabulary where CMDI is just one out of a lot.
    What I wanted to say is that these kind of metadata should be explicite
    - not only implicite like for example some kinds of system metadata. The
    timestamp of a file for example is not visible if you access that file
    via a web server. So, what meant here as "metadata" is something
    described explicitely and published as part of the DO - beeing free
    harvestable, as Peter already mentioned.
    Best,
    Tom

submit a comment