Data Foundation and Terminology (DFT) WG Recommendations (Endorsed)

  Data Foundation and Terminology (DFT) Working Group

Recommendation Title: Basic Vocabulary of Foundational Terminology Query Tool

Impact: Ensures researchersapply a common core data model when organising their data and thus making data accessible and re-usable

Recommendation package DOI: dx.doi.org/10.15497/06825049-8CA4-40BD-BCAF-DE9F0EA2FADF

Authors: Gary Berg-Cross, Raphael Ritz, Peter Wittenburg

Contributors: Stan Ahalt, Gary Berg-Cross, Jan Brase, Daan Broeder, Keith Jeffery, Yin Chen, Antoine Isaac, Bob Kahn, Larry Lannom, Michael Lautenschlager, Reagan Moore, Alberto Michelini, Hans Pfeifenberger, Raphael Ritz, Ulrich Schwardmann, Herbert van de Sommpel, Dieter van Uytvanck, Dave Vieglas, Peter Wittenburg, Yunqiang ZHU, Herman Stehouwer, Thomas Zastrow 

 

Analysis and Synthesis of Data Management Terms

 

Based on a variety of data models and use cases presented by experts coming from different disciplines and 120 interviews and interactions with different scientists and scientific departments, the DFT group has produced 8 inter-related reports and defined a number of simple definitions for digital data in a registered domain based on an agreed conceptualization. A summary of the core terms and the underlying model can be found in the first document called "Core Terms and Model". This can also be used to cite the work.

 

Reference: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html

 

Rights: Creative Commons CC0 1.0 Public Domain Waiver (CC0)

 


Please download the full Data Foundation and Terminology (DFT) WG Recommendations package below, either as individual sections. 

RDA is running a community review on the recommendations and highly values your input on this first set of recommendations. 

 

Please use the comment function below for questions, suggestions and requests. Please note that you need to login in order to comment. 

Group content visibility: 
Use group defaults
  • Keith Jeffery's picture

    Author: Keith Jeffery

    Date: 29 Jan, 2015

    I recognise the huge amount of work and thinkingthat has gone into this collection of documents. However......

     

    DFT2

    Comment1

    1.2.2 Access Methods We can basically identify 2 methods to access a bitstream that encodes the information we are looking for: 1. A user (or algorithm) has found a meaningful PID, resolves it to useful state information amongst which is access path information and requests the bitstream from a repository by making use of the access information. The advantage of this method is that the PID can be seen as an immediate handle since it identifies the DO and will offer all kinds of useful information. The disadvantage is that humans cannot work with such numbers. 2. A user (or algorithm) finds a metadata description of a useful digital object or collection, extracts the PID from the description and then step 1 is carry out. The advantage is that the domain of metadata descriptions is understandable for humans. Disadvantage is that the access is not given immediately to the bitstream, but that an intermediate step needs to be done.

    The problem with Method 1 is that it may bypass all metadata concerning legalities (rights etc).

    Comment2

    However, to prevent recursion metadata objects to not have metadata descriptions which make them special digital objects.

    Why this restriction?  There is no reason why metadata objects should not have metadata descriptions – in fact XML schema describing DC metadata is a very widely used example.

     

    DFT3

    Comment 1

    2.2.5 Metadata A. Definition Metadata contains descriptive, contextual and provenance assertions about the properties of a DO. Note: Such metadata will make the DO discoverable, accessible and usable/interpretable. Note: To make metadata referable it needs to be associated with a PID and thus is a DO. Note: Metadata minimally needs to contain the PID.

    Is at variance with

    2.2.3 PID Record A. Definition A PID record contains a set of attributes stored with a PID describing DO properties.

    The concept of a PD record is redundant since metadata records have (a) their own PID (b) a PID representing the DO with which the metadata record has a relationship

    Comment2

    2.2.5 Metadata A. Definition Metadata contains descriptive, contextual and provenance assertions about the properties of a DO. Note: Such metadata will make the DO discoverable, accessible and usable/interpretable. Note: To make metadata referable it needs to be associated with a PID and thus is a DO. Note: Metadata minimally needs to contain the PID.

    Metadata can contain much more that the kinds of information mentioned here.  Refer to RDA Metadata Groups for details.

    Comment3

    2.2.10 Bitstream A. Definition A bitstream is a sequence of bits that encodes a specific informational content, either stored on some media or being transferred under control of protocols.

    The bitstream content is data, not information (if we follow the common definition that information is  structured data in context).  Also the use of bitstream for a finite length data representation of a DO is at variance with its common usage meaning a non-finite length data stream (as in time-series recording) which implies specialised processing (windowing/segmenting etc).  This is mentkoned in DFT4 Chapter 3.

    Comment4

    2.2.11 State Information A. Definition State information is “metadata” information that describes those current properties of the DO that are relevant for proper management and access.

    In Computer Science, state is defined as the value of all memory locations (attributes) associated with an identified object.  The proposal in the document is for a limited collection of attributes and clearly these do not define uniquely the state.

    Comment5

    2.2.14 Checksum A. Definition A checksum is a type of metadata and an important property of a digital object to allow verifying identity and integrity.

    In the elaboration it says “A checksum is a randomly generated piece of data calculated by some algorithm that is used to verify the fixity or stability of a digital object”.  This is not correct.  It verifies….. the identifier of a DO.

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 05 Mar, 2015

    Keith

     

    Here are some responses to some of your questions and comments:

    For DFT document 2, I largely agree with Comment1 which considers both human and automated processing of PID info and had questions like this especially in regard to Reagan's Use Case along these lines.
     

    Comment2

    > There is no reason why metadata objects should not have metadata descriptions – in fact XML schema describing DC metadata is a very widely used example.

    Seems reasonable, so why the restriction as asked?  There is a natural limit to recursion in that people stop documenting when they choose not because we prohibit it.

     

    DFT 3

    Comment 1

     >PID Record A. Definition A PID record contains a set of attributes stored with a PID describing DO properties.

     

    >The concept of a PD record is redundant since metadata records have (a) their own PID (b) a PID representing the DO with which the metadata record has a relationship

     

    I think the disconnect here is that the PID record might play a functional role for PIDs and is therefore broken out from the overall concept of metadata.  One might think of some other parts of oveall metadata as also distinct for particular purposes.

     

    Comment 2

    >2.2.5 Metadata A. Definition Metadata contains descriptive, contextual and provenance assertions about the properties of a DO...

    >Metadata can contain much more that the kinds of information mentioned here.  Refer to RDA Metadata Groups for details.

    Sure and this is for the various MD groups to elaborate on.  I would assume that our definition will evolve as they clarify things.

     

    Comment 3

    2.2.10 Bitstream A. Definition 

    >The bitstream content is data, not information (if we follow the common definition that information is  structured data in context).  

    We say "encodes a specific informational content" meaning  there is information content converyed in the bitstream.  This is consistent with other data-information distinctions such as in OAIS.

    >Also the use of bitstream for a finite length data representation of a DO is at variance with its common usage meaning a non-finite length data stream (as in time-series recording) which implies specialised processing (windowing/segmenting etc).  This is mentioned in DFT4 Chapter 3.

    I'm at a loss to understand here. Referencing non-finite things seems beyond the usual limits of data we find in IT.  It may be indefintely long and growing as in time-series, but not non finite to my way of thinking.

     

    Comment 4

    >2.2.11 State Information A. Definition State information is “metadata” information that describes those current properties of the DO that are relevant for proper management and access.

    >In Computer Science, state is defined as the value of all memory locations (attributes) associated with an identified object. 

    Well we are talking about the state of the data as described by the metadata, so perhaps we can call it Data State information.  I am in favor of using these type of term expansions to be clearer about things and disavow trying to take over terms used elsewhere or what some have called "loaded terms."

     

    > The proposal in the document is for a limited collection of attributes and clearly these do not define uniquely the state.

    Perhaps we are not communicating the ides here clearly enough.  I think the intention is that we would include enough of the attributes to identify it.  Perhaps this is a suffice definition in that it should work for most people, most of the time.  It may be a judgement call from people adding the attribute MD, but we can perhaps set some guidelines here.

     

    Comment 5

    >2.2.14 Checksum A. Definition A checksum is a type of metadata and an important property of a digital object to allow verifying identity and integrity.

    >In the elaboration it says “A checksum is a randomly generated piece of data calculated by some algorithm that is used to verify the fixity or stability of a digital object”.  This is not correct.  It verifies….. the identifier of a DO. 

     

    Hmm.  well I can see both ideas some maybe we combine then.  The intent is to provide some stability for the DO but the mechanism for this is by ensuring that our PID is still good and will get us to the DO.

     

  • John Ambrosiano's picture

    Author: John Ambrosiano

    Date: 09 Feb, 2015

    Please forgive me if this is the wrong thread.

    I'm trying to find out if the DFT WG is the right group to ask about the availability of a general ontology for scientific terms and topics. Does such a thing exist outside of specific communities like biomedicine? If not, will this be within the scope of the DFT WG?

  • Michael Martin's picture

    Author: Michael Martin

    Date: 26 Feb, 2015

    Hi Ambrosiano

    If you find one please let me know, I am looking for the same thing.  So far no luck.

    Thanks, Mike

     

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 05 Mar, 2015

    John

    There are lots of parts to scientific terms by discipline (Hydrology & geology have different ones for example), measurement, research process and the like.

    You can, for example, get some sense of work by looking at Semanticscience Integrated Ontology (SIO).

    SIO with just 5 key entities and some top-level relations (is-related-to) provides a simple, integrated ontology of types and relations for rich description of objects, processes and their attributes.
     
    limplemented as an OWL-DL ontology (SRIQ(D) expressivity) SIO comprises of 1396 classes, 203 object properties, 1 datatype property, 8 annotation properties, 7272 axioms, 1747 subClassOf axioms, 43 equivalentClass axioms, and 209 subPropertyOf axioms.
     
    SIO does include a small model of the research and experiment process.
     
    You can start looing at SIO at:  http://sio.semanticscience.org

  • Stephen Richard's picture

    Author: Stephen Richard

    Date: 09 Feb, 2015

    in DFT3 Snapshot of Core TErms, the following definitions appear..

    6. A digital aggregation is a bundle of digital entities.

    7. A digital collection is an aggregation which contains DOs and DEs. The collection is identified by its PID and described by its metadata.

     

    in the model on p.12 of DFT2 Analysis and Synthesis

    d-collection IsA d-aggregation, and d-aggregation has 'aggregates' associations to d-entity and to d-object.

    The diagram implies to me that d-collection and d-aggregation can both aggregate d-entity and d-object. 

    The defintion of d-object (a thing represented by a bitstream, with some other mandatory properties) makes it a restriction of d-entity (anything that can be represented by a bitstream), implying to me that it should have an 'IsA' relationship to d-entity. The diagram is consistent with these definitions.

    It seems that the defintion of digital collection should also state  explicitly that a digital collection is a kind of digital object (which is implicit in saying that it has a PID and metadata...)

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 05 Mar, 2015

    Steve, I agree that the snapshot definitions on aggregation do not address some issues such as the level of aggregation and how a collection is a specail type of aggregation (since d-collection and d-aggregation can both aggregate d-entity and d-object.)  

    The snapshot definitions do try to make the point that collections and aggregations need identifiers, but of course if they are digital objects too that comes by implication. It does try to make the point in the otehr way that the components of a collection also need an ID.

  • Stephen Richard's picture

    Author: Stephen Richard

    Date: 09 Feb, 2015

    what is the basis for distinguishing 'isTypeOf' from 'isa' in the model on page 12 of DFT2?

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 05 Mar, 2015

    One may argue that "IsTypeOf" is a more specific relation.  

     

    It distingusihes between two ISAs. One is to say what type of thing something is  (e.g. " Human  is a species") and an intance of a type  "Steve is a human".  

    We don't want to say that Steve is a species by inference so we distinguish the TypeOf relations from the (Isa) instance relation.

  • Stephen Richard's picture

    Author: Stephen Richard

    Date: 09 Feb, 2015

    the model on p12 of DFT 2 does not capture the relationship of state information to a d-object.

  • Peter Doorn's picture

    Author: Peter Doorn

    Date: 11 Feb, 2015

    The Science Europe Working Group on Research Data (http://www.scienceeurope.org/policy/working-groups/Research-Data) is also working on a list of terms. A small task group, chaired by Rūta Petrauskaitė <r.petrauskaite@hmf.vdu.lt> and myself, has compiled a preliminary list of data terms with (draft) definitions. Work is still in progress. The scope of the SE list is different, but there are similarities with what the RDA WG is doing. Perhaps we can explore the possibilities to combine our efforts at he upcoming RDA P5 in San Diego?

    The SE list is more policy-oriented and first of all intended for internal use, but we will make the list available on the Science Europe website. It would be very useful if funding organisations throughout Europe refer to the same terms and use the same definitions. A common term list of RDA and SE would even have more authority.

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 05 Mar, 2015

    Peter,

     

    I hope that you will come to the P5 DFT IG session on Monday afternoon (1400 I believe)

    and talk about the common interest and help develop plans for joint work going forward.

  • Michael Martin's picture

    Author: Michael Martin

    Date: 26 Feb, 2015

    Hi everyone

    I am uncomfortable with "13. A digital metadata repository is a type of digital repository that is able to store, manage and curate metadata."   I think it is clearer to associate repositories (like archival storage in OAIS) with data and registries (data management in OAIS) with metadata.  There may be metadata in the repository but metadata in a searchable form should be in a registry or data management system.  The services to access data from a repository are different than the services to access metadata from a registry.  I would much rather see "metadata registry", which appears in some of your documents.

    I am also uncomfortable with term 6, digital aggregation.  I think OAIS calls this an information object or content information.  It might also be a digital product or digital resource.  Also, digital aggregation seems more of a verb than a noun.

    I'm not sure I understand what a digital entity is.  Just data bits with no metadata?

    Thanks, Mike Martin

    NASA Consultant

     

  • Gary Berg-Cross's picture

    Author: Gary Berg-Cross

    Date: 05 Mar, 2015

    Mike

    One issue here might be the different type of services one might expect from a registry versus a repository.

     

    It is easy to think that you have a registry as a front end (for data and/or metadata) which provides some basic info about what is registered but that additional, repository processess are needed for each across the lifecycle.

    I tend to think of this subsequent data management and curation as part of a repository process.

    Some make the argument that metadata is a role that data can play and in this view metadata like what we consider data needs repository care.

submit a comment