Dear all,
during the last week I’ve implemented the importer/exporter tool according to our recommendations for the KIT Data Manager platform. In the meantime, David figured out that the SWORD people are also referring to our recommendations with the difference that they skip the requirement of datacite.xml. After a short discussion in their profile working document [1] it seems that the main reason is the necessity of a DOI. Obviously, using machine-recognizable codes as stated under ‘Guidance for handling missing mandatory property values’ in [2] applies to all mandatory properties but the identifier as the schema defined a fixed regular expression with the value 10\..+/.+ for this element. Thus, datacite documents using placeholders for the identifier won’t validate against the schema.
The question is now how we proceed. I see two options:
We ignore the fact that datacite documents without DOI can’t be validated and add a comment to our recommendations saying it is allowed (but of course not advised) to add ‘invalid’ datacite documents, but in that case, there should be an alternateIdentifier of type INTERNAL that can be used by the consumer of the bag.
We switch to another standard for providing our minimal metadata set, e.g. DataCrate.
What is your opinion on that? Do you see a third option?
Regards,
Thomas
[1] https://docs.google.com/document/d/1eQL1Guv0ihfxPJIIceLJk4l22cRpGbXTsNVS...
[2] https://schema.datacite.org/meta/kernel-4.0/doc/DataCite-MetadataKernel_...
[3] https://github.com/UTS-eResearch/datacrate/blob/develop/spec/0.1/data_cr...
--
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)
Dipl. Ing. Thomas Jejkal
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany
Phone: +49 721 608-24042
E-mail: ***@***.***
Web: http://www.scc.kit.edu
ORCID: http://orcid.org/0000-0003-2804-688X
Registered office: Kaiserstraße 12, 76133 Karlsruhe, Germany
KIT – The Research University in the Helmholtz Association
Author: Rolf Krahl
Date: 24 Jan, 2018
Hi Thomas,
Hi Thomas,
Am Mittwoch, 24. Januar 2018, 08:10:41 schrieb TJejkal:
>
> David figured out that the SWORD people are also referring to our
> recommendations with the difference that they skip the requirement
> of datacite.xml. After a short discussion in their profile working
> document [1] it seems that the main reason is the necessity of a
> DOI. Obviously, using machine-recognizable codes as stated under
> ‘Guidance for handling missing mandatory property values’ in [2]
> applies to all mandatory properties but the identifier as the schema
> defined a fixed regular expression with the value 10\..+/.+ for this
> element. Thus, datacite documents using placeholders for the
> identifier won’t validate against the schema.
I would say, the authoritative source for the standard is the written
document. The XML Schema Definition file is (or should be) merely an
implementation of this standard. The formulation in the standard
document is clear, cite p. 10:
| 2.3 DataCite Properties
|
| Table 3 provides a detailed description of the mandatory properties,
| which must be supplied with any initial metadata submission to
| DataCite, together with their sub‐properties. If one of the
| required properties is unavailable, please use one of the standard
| (machine‐recognizable) codes listed in Appendix 3, Table 11.
E.g. the standard values for unknown information in Appendix 3, Table
11 are allowed to be used for the mandatory properties listed in Table
3, which includes the Identifier property. From this follows that the
regular expression in the XML Schema Definition file is a bug.
I would favor:
1. For the time being, we keep the requirement that packages must
contain a datacite.xml file and also that the content of this file
must be valid according to the DataCite standard.
2. We add a note that that if the digital object in the package does
not has a DOI or if the DOI is not known, one of the standard
values for unknown information (Appendix 3 of DataCite) MUST be
used in the Identifier property. (E.g. we state explicitly that
according to our interpretation, DataCite does not imply the
necessity of a DOI.)
We add that an AlternateIdentifier SHOULD be used if a DOI is not
provided. (But I would not require any particular type. DataCite
requires the alternateIdentifierType sub-property to be used with
AlternateIdentifier, but specifies the allowed value only as free
text. We should not go further then that here.)
3. We add a note that if one of the standard values for unknown
information is used for the Identifier property, the datacite.xml
will not validate against the DataCite XML Schema Definition. We add
that we consider this a bug in the XSD and that this fact does not
imply invalidity of the provided metadata. We add that the
receiver of a package MUST NOT reject the package based on a failed
XML Schema validation, if this failure is only due to an unknown
DOI.
4. We contact the DataCite people and ask them to fix the bug in their
XML Schema Definition.
--
Rolf Krahl <***@***.***-berlin.de>
Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
Albert-Einstein-Str. 15, 12489 Berlin
Tel.: +49 30 8062 12122
Author: Thomas Jejkal
Date: 24 Jan, 2018
Hi Rolf,
thanks for your thoughts. Of course, we can interpret the XSD as (too strict) implementation of the standard, maybe the RegEx is also an artifact from older versions. I think is definitely a good idea to file an issue in the GitHub repository of DataCite. I will do this in a second. However, as we do not 'force' someone to validate datacite.xml this should be no showstopper. It's just not optimal. Regarding the alternate identifier we can of course also skip recommending any identifier type and let the bag consumer assign an identifier if the bagged object has no identifier.
Regards,
Thomas
--
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)
Dipl. Ing. Thomas Jejkal
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany
Phone: +49 721 608-24042
E-mail: ***@***.***
Web: http://www.scc.kit.edu
ORCID: http://orcid.org/0000-0003-2804-688X
Registered office: Kaiserstraße 12, 76133 Karlsruhe, Germany
KIT – The Research University in the Helmholtz Association
Am 24.01.18, 10:33 schrieb "rolf.krahl=***@***.***-groups.org im Auftrag von rolf.krahl" <***@***.***-groups.org im Auftrag von ***@***.***-berlin.de>:
Hi Thomas,
Hi Rolf,
thanks for your thoughts. Of course, we can interpret the XSD as (too strict) implementation of the standard, maybe the RegEx is also an artifact from older versions. I think is definitely a good idea to file an issue in the GitHub repository of DataCite. I will do this in a second. However, as we do not 'force' someone to validate datacite.xml this should be no showstopper. It's just not optimal. Regarding the alternate identifier we can of course also skip recommending any identifier type and let the bag consumer assign an identifier if the bagged object has no identifier.
Regards,
Thomas
--
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)
Dipl. Ing. Thomas Jejkal
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany
Phone: +49 721 608-24042
E-mail: ***@***.***
Web: http://www.scc.kit.edu
ORCID: http://orcid.org/0000-0003-2804-688X
Registered office: Kaiserstraße 12, 76133 Karlsruhe, Germany
KIT – The Research University in the Helmholtz Association
Am 24.01.18, 10:33 schrieb "rolf.krahl=***@***.***-groups.org im Auftrag von rolf.krahl" <***@***.***-groups.org im Auftrag von ***@***.***-berlin.de>:
Hi Thomas,
Am Mittwoch, 24. Januar 2018, 08:10:41 schrieb TJejkal:
>
> David figured out that the SWORD people are also referring to our
> recommendations with the difference that they skip the requirement
> of datacite.xml. After a short discussion in their profile working
> document [1] it seems that the main reason is the necessity of a
> DOI. Obviously, using machine-recognizable codes as stated under
> ‘Guidance for handling missing mandatory property values’ in [2]
> applies to all mandatory properties but the identifier as the schema
> defined a fixed regular expression with the value 10\..+/.+ for this
> element. Thus, datacite documents using placeholders for the
> identifier won’t validate against the schema.
I would say, the authoritative source for the standard is the written
document. The XML Schema Definition file is (or should be) merely an
implementation of this standard. The formulation in the standard
document is clear, cite p. 10:
| 2.3 DataCite Properties
|
| Table 3 provides a detailed description of the mandatory properties,
| which must be supplied with any initial metadata submission to
| DataCite, together with their sub‐properties. If one of the
| required properties is unavailable, please use one of the standard
| (machine‐recognizable) codes listed in Appendix 3, Table 11.
E.g. the standard values for unknown information in Appendix 3, Table
11 are allowed to be used for the mandatory properties listed in Table
3, which includes the Identifier property. From this follows that the
regular expression in the XML Schema Definition file is a bug.
I would favor:
1. For the time being, we keep the requirement that packages must
contain a datacite.xml file and also that the content of this file
must be valid according to the DataCite standard.
2. We add a note that that if the digital object in the package does
not has a DOI or if the DOI is not known, one of the standard
values for unknown information (Appendix 3 of DataCite) MUST be
used in the Identifier property. (E.g. we state explicitly that
according to our interpretation, DataCite does not imply the
necessity of a DOI.)
We add that an AlternateIdentifier SHOULD be used if a DOI is not
provided. (But I would not require any particular type. DataCite
requires the alternateIdentifierType sub-property to be used with
AlternateIdentifier, but specifies the allowed value only as free
text. We should not go further then that here.)
3. We add a note that if one of the standard values for unknown
information is used for the Identifier property, the datacite.xml
will not validate against the DataCite XML Schema Definition. We add
that we consider this a bug in the XSD and that this fact does not
imply invalidity of the provided metadata. We add that the
receiver of a package MUST NOT reject the package based on a failed
XML Schema validation, if this failure is only due to an unknown
DOI.
4. We contact the DataCite people and ask them to fix the bug in their
XML Schema Definition.
--
Rolf Krahl <***@***.***-berlin.de>
Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
Albert-Einstein-Str. 15, 12489 Berlin
Tel.: +49 30 8062 12122
Author: David Wilcox
Date: 29 Jan, 2018
Hi Thomas,
Thanks for volunteering to file an issue in the DataCite GitHub repository. Please report back once you receive a reply so we can determine how to proceed. I agree with Rolf that the written DataCite spec is the authoritative source for the standard and that we do not require validation against the XSD; however, as Thomas points out, the fact that one can create a DataCite record that is valid according to the spec but fails to validate against the XSD if a DOI is not present is not optimal.
It would be useful to schedule another group call so we can review our current status and make any necessary decisions on how to proceed before the next plenary in March. I’ll work with Thomas to put something on the calendar and send an invitation to the group.
Regards,
David
--
David Wilcox
Fedora Product Manager
DuraSpace
***@***.***
Author: Rolf Krahl
Date: 31 Jan, 2018
Dear all,
Dear all,
Am Mittwoch, 24. Januar 2018, 10:49:57 schrieb rolf. krahl:
>
> I would say, the authoritative source for the standard is the written
> document. The XML Schema Definition file is (or should be) merely an
> implementation of this standard. The formulation in the standard
> document is clear, cite p. 10:
Thomas has opened an issue in the DataCite GitHub project. My
conclusion from the discussion with DataCite people is that it
revealed a fundamental misconception about what DataCite is: I was
always using DataCite as a metadata standard. And I know of many
other people in the community that do as well. From the discussion on
GitHub it becomes apparent that DataCite people themselves consider it
a DOI registration service and that what we call the "DataCite
metadata standard" is only the input format for this particular
service from their point of view. What I was reading as a standard
document is in fact intended to be a manual for the input of the
DataCite service.
Obviously, for a DOI registration service, they insist that a DOI must
be provided in the input. I would never argue against that.
From this follows that I must amend my proposal on how to deal with
this issue.
We could do the following instead:
1. We add an explicit note to the specification that the content of
the package does not need to have a DOI.
2. We keep the requirement that packages must contain a datacite.xml
file. The content of this file must be a /variant/ of the DataCite
metadata standard.
3. We add a small section on "Minimal metadata" that explains the
format of the datacite.xml file in detail and which modifications
to DataCite we use in our specification. The only modification
that I see for the moment, is that we explicitly allow the standard
values for unknown information to be used for the Identifier
property if no DOI is available. We add that an
AlternateIdentifier SHOULD be used if a DOI is not provided.
4. As an option, we may provide a mofified XML Schema Definition for
our variant of the DataCite metadata standard.
Let me add two more notes: it is common praxis in the community to
consider DataCite a metadata standard. Thus, I still believe it is
acceptable to keep the term "DataCite metadata standard" in our
specification even though we just learned that the DataCite people
themselves might not consider it a standard. Most people that use
DataCite as a standard don't take it too strict either and often apply
their own custom modifications. Compared to that what others do, our
modification is very minor. Therefore I still believe it is
acceptable to keep the name "datacit.xml" here.
Best regards,
Rolf
--
Rolf Krahl <***@***.***-berlin.de>
Helmholtz-Zentrum Berlin für Materialien und Energie (HZB)
Albert-Einstein-Str. 15, 12489 Berlin
Tel.: +49 30 8062 12122