Validation of datacite

24 Jan 2018

Dear all,

during the last week I’ve implemented the importer/exporter tool according to our recommendations for the KIT Data Manager platform. In the meantime, David figured out that the SWORD people are also referring to our recommendations with the difference that they skip the requirement of datacite.xml. After a short discussion in their profile working document [1] it seems that the main reason is the necessity of a DOI. Obviously, using machine-recognizable codes as stated under ‘Guidance for handling missing mandatory property values’ in [2] applies to all mandatory properties but the identifier as the schema defined a fixed regular expression with the value 10\..+/.+ for this element. Thus, datacite documents using placeholders for the identifier won’t validate against the schema.

The question is now how we proceed. I see two options:

We ignore the fact that datacite documents without DOI can’t be validated and add a comment to our recommendations saying it is allowed (but of course not advised) to add ‘invalid’ datacite documents, but in that case, there should be an alternateIdentifier of type INTERNAL that can be used by the consumer of the bag.
We switch to another standard for providing our minimal metadata set, e.g. DataCrate.

What is your opinion on that? Do you see a third option?

Regards,

Thomas

[1] https://docs.google.com/document/d/1eQL1Guv0ihfxPJIIceLJk4l22cRpGbXTsNVS...

[2] https://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_...

[3] https://github.com/UTS-eResearch/datacrate/blob/develop/spec/0.1/data_cr...

--

Karlsruhe Institute of Technology (KIT)

Steinbuch Centre for Computing (SCC)

Dipl. Ing. Thomas Jejkal

Hermann-von-Helmholtz-Platz 1

76344 Eggenstein-Leopoldshafen, Germany

Phone: +49 721 608-24042

E-mail: thomas.jejkal@kit.edu

Web: http://www.scc.kit.edu

ORCID: http://orcid.org/0000-0003-2804-688X

Registered office: Kaiserstraße 12, 76133 Karlsruhe, Germany

KIT – The Research University in the Helmholtz Association

  • Rolf Krahl's picture

    Author: Rolf Krahl

    Date: 24 Jan, 2018

    Hi Thomas,

    Am Mittwoch, 24. Januar 2018, 08:10:41 schrieb TJejkal:
    >
    > David figured out that the SWORD people are also referring to our
    > recommendations with the difference that they skip the requirement
    > of datacite.xml. After a short discussion in their profile working
    > document [1] it seems that the main reason is the necessity of a
    > DOI. Obviously, using machine-recognizable codes as stated under
    > ‘Guidance for handling missing mandatory property values’ in [2]
    > applies to all mandatory properties but the identifier as the schema
    > defined a fixed regular expression with the value 10\..+/.+ for this
    > element. Thus, datacite documents using placeholders for the
    > identifier won’t validate against the schema.

    I would say, the authoritative source for the standard is the written
    document. The XML Schema Definition file is (or should be) merely an
    implementation of this standard. The formulation in the standard
    document is clear, cite p. 10:

    | 2.3 DataCite Properties
    |
    | Table 3 provides a detailed description of the mandatory properties,
    | which must be supplied with any initial metadata submission to
    | DataCite, together with their sub‐properties. If one of the
    | required properties is unavailable, please use one of the standard
    | (machine‐recognizable) codes listed in Appendix 3, Table 11.

    E.g. the standard values for unknown information in Appendix 3, Table
    11 are allowed to be used for the mandatory properties listed in Table
    3, which includes the Identifier property. From this follows that the
    regular expression in the XML Schema Definition file is a bug.

    > The question is now how we proceed. I see two options:
    >
    > We ignore the fact that datacite documents without DOI can’t be
    > validated and add a comment to our recommendations saying it is
    > allowed (but of course not advised) to add ‘invalid’ datacite
    > documents, but in that case, there should be an alternateIdentifier
    > of type INTERNAL that can be used by the consumer of the bag.

    I would favor:

    1. For the time being, we keep the requirement that packages must
    contain a datacite.xml file and also that the content of this file
    must be valid according to the DataCite standard.

    2. We add a note that that if the digital object in the package does
    not has a DOI or if the DOI is not known, one of the standard
    values for unknown information (Appendix 3 of DataCite) MUST be
    used in the Identifier property. (E.g. we state explicitly that
    according to our interpretation, DataCite does not imply the
    necessity of a DOI.)

    We add that an AlternateIdentifier SHOULD be used if a DOI is not
    provided. (But I would not require any particular type. DataCite
    requires the alternateIdentifierType sub-property to be used with
    AlternateIdentifier, but specifies the allowed value only as free
    text. We should not go further then that here.)

    3. We add a note that if one of the standard values for unknown
    information is used for the Identifier property, the datacite.xml
    will not validate against the DataCite XML Schema Definition. We add
    that we consider this a bug in the XSD and that this fact does not
    imply invalidity of the provided metadata. We add that the
    receiver of a package MUST NOT reject the package based on a failed
    XML Schema validation, if this failure is only due to an unknown
    DOI.

    4. We contact the DataCite people and ask them to fix the bug in their
    XML Schema Definition.

    --
    Rolf Krahl
    Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
    Albert-Einstein-Str. 15, 12489 Berlin
    Tel.: +49 30 8062 12122

  • Thomas Jejkal's picture

    Author: Thomas Jejkal

    Date: 24 Jan, 2018

    Hi Rolf,

    thanks for your thoughts. Of course, we can interpret the XSD as (too strict) implementation of the standard, maybe the RegEx is also an artifact from older versions. I think is definitely a good idea to file an issue in the GitHub repository of DataCite. I will do this in a second. However, as we do not 'force' someone to validate datacite.xml this should be no showstopper. It's just not optimal. Regarding the alternate identifier we can of course also skip recommending any identifier type and let the bag consumer assign an identifier if the bagged object has no identifier.

    Regards,
    Thomas

    --
    Karlsruhe Institute of Technology (KIT)
    Steinbuch Centre for Computing (SCC)

    Dipl. Ing. Thomas Jejkal

    Hermann-von-Helmholtz-Platz 1
    76344 Eggenstein-Leopoldshafen, Germany

    Phone: +49 721 608-24042
    E-mail: thomas.jejkal@kit.edu
    Web: http://www.scc.kit.edu
    ORCID: http://orcid.org/0000-0003-2804-688X

    Registered office: Kaiserstraße 12, 76133 Karlsruhe, Germany

    KIT – The Research University in the Helmholtz Association
    Am 24.01.18, 10:33 schrieb "rolf.krahl=helmholtz-berlin.de@rda-groups.org im Auftrag von rolf.krahl" :

    Hi Thomas,

    Am Mittwoch, 24. Januar 2018, 08:10:41 schrieb TJejkal:
    >
    > David figured out that the SWORD people are also referring to our
    > recommendations with the difference that they skip the requirement
    > of datacite.xml. After a short discussion in their profile working
    > document [1] it seems that the main reason is the necessity of a
    > DOI. Obviously, using machine-recognizable codes as stated under
    > ‘Guidance for handling missing mandatory property values’ in [2]
    > applies to all mandatory properties but the identifier as the schema
    > defined a fixed regular expression with the value 10\..+/.+ for this
    > element. Thus, datacite documents using placeholders for the
    > identifier won’t validate against the schema.

    I would say, the authoritative source for the standard is the written
    document. The XML Schema Definition file is (or should be) merely an
    implementation of this standard. The formulation in the standard
    document is clear, cite p. 10:

    | 2.3 DataCite Properties
    |
    | Table 3 provides a detailed description of the mandatory properties,
    | which must be supplied with any initial metadata submission to
    | DataCite, together with their sub‐properties. If one of the
    | required properties is unavailable, please use one of the standard
    | (machine‐recognizable) codes listed in Appendix 3, Table 11.

    E.g. the standard values for unknown information in Appendix 3, Table
    11 are allowed to be used for the mandatory properties listed in Table
    3, which includes the Identifier property. From this follows that the
    regular expression in the XML Schema Definition file is a bug.

    > The question is now how we proceed. I see two options:
    >
    > We ignore the fact that datacite documents without DOI can’t be
    > validated and add a comment to our recommendations saying it is
    > allowed (but of course not advised) to add ‘invalid’ datacite
    > documents, but in that case, there should be an alternateIdentifier
    > of type INTERNAL that can be used by the consumer of the bag.

    I would favor:

    1. For the time being, we keep the requirement that packages must
    contain a datacite.xml file and also that the content of this file
    must be valid according to the DataCite standard.

    2. We add a note that that if the digital object in the package does
    not has a DOI or if the DOI is not known, one of the standard
    values for unknown information (Appendix 3 of DataCite) MUST be
    used in the Identifier property. (E.g. we state explicitly that
    according to our interpretation, DataCite does not imply the
    necessity of a DOI.)

    We add that an AlternateIdentifier SHOULD be used if a DOI is not
    provided. (But I would not require any particular type. DataCite
    requires the alternateIdentifierType sub-property to be used with
    AlternateIdentifier, but specifies the allowed value only as free
    text. We should not go further then that here.)

    3. We add a note that if one of the standard values for unknown
    information is used for the Identifier property, the datacite.xml
    will not validate against the DataCite XML Schema Definition. We add
    that we consider this a bug in the XSD and that this fact does not
    imply invalidity of the provided metadata. We add that the
    receiver of a package MUST NOT reject the package based on a failed
    XML Schema validation, if this failure is only due to an unknown
    DOI.

    4. We contact the DataCite people and ask them to fix the bug in their
    XML Schema Definition.

    --
    Rolf Krahl
    Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
    Albert-Einstein-Str. 15, 12489 Berlin
    Tel.: +49 30 8062 12122

  • David Wilcox's picture

    Author: David Wilcox

    Date: 29 Jan, 2018

    Hi Thomas,

    Thanks for volunteering to file an issue in the DataCite GitHub repository. Please report back once you receive a reply so we can determine how to proceed. I agree with Rolf that the written DataCite spec is the authoritative source for the standard and that we do not require validation against the XSD; however, as Thomas points out, the fact that one can create a DataCite record that is valid according to the spec but fails to validate against the XSD if a DOI is not present is not optimal.

    It would be useful to schedule another group call so we can review our current status and make any necessary decisions on how to proceed before the next plenary in March. I’ll work with Thomas to put something on the calendar and send an invitation to the group.

    Regards,

    David

    --
    David Wilcox
    Fedora Product Manager
    DuraSpace
    dwilcox@duraspace.org

    > On Jan 24, 2018, at 6:50 AM, TJejkal wrote:
    >
    > Hi Rolf,
    >
    > thanks for your thoughts. Of course, we can interpret the XSD as (too strict) implementation of the standard, maybe the RegEx is also an artifact from older versions. I think is definitely a good idea to file an issue in the GitHub repository of DataCite. I will do this in a second. However, as we do not 'force' someone to validate datacite.xml this should be no showstopper. It's just not optimal. Regarding the alternate identifier we can of course also skip recommending any identifier type and let the bag consumer assign an identifier if the bagged object has no identifier.
    >
    > Regards,
    > Thomas
    >
    > --
    > Karlsruhe Institute of Technology (KIT)
    > Steinbuch Centre for Computing (SCC)
    >
    > Dipl. Ing. Thomas Jejkal
    >
    > Hermann-von-Helmholtz-Platz 1
    > 76344 Eggenstein-Leopoldshafen, Germany
    >
    > Phone: +49 721 608-24042
    > E-mail: thomas.jejkal@kit.edu
    > Web: http://www.scc.kit.edu
    > ORCID: http://orcid.org/0000-0003-2804-688X
    >
    > Registered office: Kaiserstraße 12, 76133 Karlsruhe, Germany
    >
    > KIT – The Research University in the Helmholtz Association
    > Am 24.01.18, 10:33 schrieb "rolf.krahl=helmholtz-berlin.de@rda-groups.org im Auftrag von rolf.krahl" :
    >
    > Hi Thomas,
    >
    > Am Mittwoch, 24. Januar 2018, 08:10:41 schrieb TJejkal:
    >>
    >> David figured out that the SWORD people are also referring to our
    >> recommendations with the difference that they skip the requirement
    >> of datacite.xml. After a short discussion in their profile working
    >> document [1] it seems that the main reason is the necessity of a
    >> DOI. Obviously, using machine-recognizable codes as stated under
    >> ‘Guidance for handling missing mandatory property values’ in [2]
    >> applies to all mandatory properties but the identifier as the schema
    >> defined a fixed regular expression with the value 10\..+/.+ for this
    >> element. Thus, datacite documents using placeholders for the
    >> identifier won’t validate against the schema.
    >
    > I would say, the authoritative source for the standard is the written
    > document. The XML Schema Definition file is (or should be) merely an
    > implementation of this standard. The formulation in the standard
    > document is clear, cite p. 10:
    >
    > | 2.3 DataCite Properties
    > |
    > | Table 3 provides a detailed description of the mandatory properties,
    > | which must be supplied with any initial metadata submission to
    > | DataCite, together with their sub‐properties. If one of the
    > | required properties is unavailable, please use one of the standard
    > | (machine‐recognizable) codes listed in Appendix 3, Table 11.
    >
    > E.g. the standard values for unknown information in Appendix 3, Table
    > 11 are allowed to be used for the mandatory properties listed in Table
    > 3, which includes the Identifier property. From this follows that the
    > regular expression in the XML Schema Definition file is a bug.
    >
    >
    >> The question is now how we proceed. I see two options:
    >>
    >> We ignore the fact that datacite documents without DOI can’t be
    >> validated and add a comment to our recommendations saying it is
    >> allowed (but of course not advised) to add ‘invalid’ datacite
    >> documents, but in that case, there should be an alternateIdentifier
    >> of type INTERNAL that can be used by the consumer of the bag.
    >
    > I would favor:
    >
    > 1. For the time being, we keep the requirement that packages must
    > contain a datacite.xml file and also that the content of this file
    > must be valid according to the DataCite standard.
    >
    > 2. We add a note that that if the digital object in the package does
    > not has a DOI or if the DOI is not known, one of the standard
    > values for unknown information (Appendix 3 of DataCite) MUST be
    > used in the Identifier property. (E.g. we state explicitly that
    > according to our interpretation, DataCite does not imply the
    > necessity of a DOI.)
    >
    > We add that an AlternateIdentifier SHOULD be used if a DOI is not
    > provided. (But I would not require any particular type. DataCite
    > requires the alternateIdentifierType sub-property to be used with
    > AlternateIdentifier, but specifies the allowed value only as free
    > text. We should not go further then that here.)
    >
    > 3. We add a note that if one of the standard values for unknown
    > information is used for the Identifier property, the datacite.xml
    > will not validate against the DataCite XML Schema Definition. We add
    > that we consider this a bug in the XSD and that this fact does not
    > imply invalidity of the provided metadata. We add that the
    > receiver of a package MUST NOT reject the package based on a failed
    > XML Schema validation, if this failure is only due to an unknown
    > DOI.
    >
    > 4. We contact the DataCite people and ask them to fix the bug in their
    > XML Schema Definition.
    >
    > --
    > Rolf Krahl
    > Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
    > Albert-Einstein-Str. 15, 12489 Berlin
    > Tel.: +49 30 8062 12122
    >
    >
    >
    >
    >
    > --
    > Full post: https://www.rd-alliance.org/group/research-data-repository-interoperabil...
    > Manage my subscriptions: https://www.rd-alliance.org/mailinglist
    > Stop emails for this post: https://www.rd-alliance.org/mailinglist/unsubscribe/58723

  • Rolf Krahl's picture

    Author: Rolf Krahl

    Date: 31 Jan, 2018

    Dear all,

    Am Mittwoch, 24. Januar 2018, 10:49:57 schrieb rolf. krahl:
    >
    > I would say, the authoritative source for the standard is the written
    > document. The XML Schema Definition file is (or should be) merely an
    > implementation of this standard. The formulation in the standard
    > document is clear, cite p. 10:

    Thomas has opened an issue in the DataCite GitHub project. My
    conclusion from the discussion with DataCite people is that it
    revealed a fundamental misconception about what DataCite is: I was
    always using DataCite as a metadata standard. And I know of many
    other people in the community that do as well. From the discussion on
    GitHub it becomes apparent that DataCite people themselves consider it
    a DOI registration service and that what we call the "DataCite
    metadata standard" is only the input format for this particular
    service from their point of view. What I was reading as a standard
    document is in fact intended to be a manual for the input of the
    DataCite service.

    Obviously, for a DOI registration service, they insist that a DOI must
    be provided in the input. I would never argue against that.

    From this follows that I must amend my proposal on how to deal with
    this issue.

    > I would favor:
    >
    > 1. For the time being, we keep the requirement that packages must
    > contain a datacite.xml file and also that the content of this file
    > must be valid according to the DataCite standard.
    >
    > 2. We add a note that that if the digital object in the package does
    > not has a DOI or if the DOI is not known, one of the standard
    > values for unknown information (Appendix 3 of DataCite) MUST be
    > used in the Identifier property. (E.g. we state explicitly that
    > according to our interpretation, DataCite does not imply the
    > necessity of a DOI.)
    >
    > We add that an AlternateIdentifier SHOULD be used if a DOI is not
    > provided. (But I would not require any particular type. DataCite
    > requires the alternateIdentifierType sub-property to be used with
    > AlternateIdentifier, but specifies the allowed value only as free
    > text. We should not go further then that here.)
    >
    > 3. We add a note that if one of the standard values for unknown
    > information is used for the Identifier property, the datacite.xml
    > will not validate against the DataCite XML Schema Definition. We add
    > that we consider this a bug in the XSD and that this fact does not
    > imply invalidity of the provided metadata. We add that the
    > receiver of a package MUST NOT reject the package based on a failed
    > XML Schema validation, if this failure is only due to an unknown
    > DOI.
    >
    > 4. We contact the DataCite people and ask them to fix the bug in their
    > XML Schema Definition.

    We could do the following instead:

    1. We add an explicit note to the specification that the content of
    the package does not need to have a DOI.

    2. We keep the requirement that packages must contain a datacite.xml
    file. The content of this file must be a /variant/ of the DataCite
    metadata standard.

    3. We add a small section on "Minimal metadata" that explains the
    format of the datacite.xml file in detail and which modifications
    to DataCite we use in our specification. The only modification
    that I see for the moment, is that we explicitly allow the standard
    values for unknown information to be used for the Identifier
    property if no DOI is available. We add that an
    AlternateIdentifier SHOULD be used if a DOI is not provided.

    4. As an option, we may provide a mofified XML Schema Definition for
    our variant of the DataCite metadata standard.

    Let me add two more notes: it is common praxis in the community to
    consider DataCite a metadata standard. Thus, I still believe it is
    acceptable to keep the term "DataCite metadata standard" in our
    specification even though we just learned that the DataCite people
    themselves might not consider it a standard. Most people that use
    DataCite as a standard don't take it too strict either and often apply
    their own custom modifications. Compared to that what others do, our
    modification is very minor. Therefore I still believe it is
    acceptable to keep the name "datacit.xml" here.

    Best regards,
    Rolf

    --
    Rolf Krahl
    Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
    Albert-Einstein-Str. 15, 12489 Berlin
    Tel.: +49 30 8062 12122

submit a comment