Dear WG Data Citation members,
I work at a repository for the long-term preservation of digital research data and we are currently in the process of preparing citation information for each data collection and also for the individual parts of the collection.
We decided to use BibLaTeX because it already provides a type 'dataset' that is already also supported by some citation styles (e.g. APA6th). BibLaTeX is actively developed, which allows for the shaping of the dataset-type. This is needed because currently there is no 'good' way to e.g. collect reference information for parts of a dataset or provide fingerprints, unique queries, etc.
I've started a discussion on GitHub: https://github.com/plk/biblatex/issues/1103
Any feedback (here or maybe directly on GitHub) would be highly appreciated!
Best from Vienna
Author: Hugh Paterson
Date: 01 Apr, 2021
There is an overarching problem in dataset citation, and you point out half
of it. Datasets are frequently aggregate works (collections) and there is
no easy way to reference the components of the aggregate unit. While I
advocate making bibliographic metadata available, using biblatex is
marginally better than bibtex, neither was designed to be the source
authority format for archival records.
The second and more frequently ignored problem in dataset referencing is
that datasets as archival objects are often miscategorized.
Some take the position that all objects in digital form are data… but is
software data? Ans this leads to an important philosophical question about
the role of institutional repositories, should they persist data, or should
they persist the evidenciary record? That is, is the term data at all even
If I have an aggregate work of audio materials that may be be
cited/referenced as an album and each sub-unit as a track. There is no
reason to categorize this as a “dataset”. The same is true with a set of
ethnographic interviews which are just dumped into a repository. They are
interviews not just recordings or “dataset”. So depending on the media
type some things should not be datasets. I find the dcmitype vocabulary
very helpful in this regard. Dublin core says that every artifact should
have a one to one record in the catalogue. So each audio recording should
get its own record an a relationship to the record for the aggregate work
which would be the album.
Datasets are a legitimate item type, but as the dcmitype identifies them
they are tabular data, ready for ingest into a computer application. In
this manner they are distinct from the dcmitype for text in that they are
not designed for human literary consumption.
The need to accurately identify item types comes back to repositories and
how they identify content, and make those identical ions available via
pre-formatted bibliographic records. If the repository says that everything
is “dataset” then the use of biblatex @dataset versus bibtex @misc is a
mute point because both are equally unhelpful and ambiguous to the end-user
who might look to reuse the bibliographic metadata.
Also note that apa7th is out. I don’t like it but it is out.
Some food for thought,
All the best,
On Thu, Apr 1, 2021 at 10:26 AM mtrognitz via Data Citation WG <
Author: Roberto Di Cosmo
Date: 01 Apr, 2021
similar discussions have been taking place about software (that is not just a
special case of data), and the following short summary may be of interest for
your work in this area.
Of course, Bib(La)TeX entries are not meant to contain all relevant metadata
about software, there are other standards for that (like CodeMeta): for example,
we do not expect to find individual author roles or affiliations recorded there.
Nonetheless they are extremely useful when it comes to citing software in
publications, creating bibliographic reference lists, producing activity
Hence the importance to have proper support for software entries in BibLaTeX,
but, like for @dataset, the @software entry in the stock BibLaTeX is just
another name for @misc, and does not fit the bill at all.
A significant amount of work has been done to determine:
- the fields needed to describe software in a bibliography
- the kind of entries needed for capturing the software facets
(@software alone is not enough)
- the best way to make these new fields and entries supported in existing stlyes
This has led to the development of the biblatex-software package, included in
all recent TeXLive distributions, and also separately available from CTAN at
This package is a /style extension/, that can be used to add to any existing
BibLaTeX bibliographic style support for the folloing four entries:
Biblatex-software supports inheritance between these entries, and provides a
broad set of parameters that allow to tweak the rendering of bibliographies as
desired, see the extensive documentation at  for more information.
It would be great to see a similar work done for @dataset, and I hope this
information about what we did for software (not only the technical
implementation, but also the process that led to it) may be of help
All the best
Roberto Di Cosmo
Computer Science Professor
(on leave at INRIA from IRIF/Université de Paris)
Software Heritage https://www.softwareheritage.org
Bureau C328 E-mail : ***@***.***
2, Rue Simone Iff Web page : http://www.dicosmo.org
CS 42112 Twitter : http://twitter.com/rdicosmo
75589 Paris Cedex 12 Tel : +33 1 80 49 44 42
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
Author: Andreas Rauber
Date: 02 Apr, 2021
This is absolutely correct! In a nutshell, we need to differentiate
between two aspects, namely (1) **identifying** the precise
subset/colection of a - potentially changing - dataset, and 2) the
actual information making up a citation to that dataset.
for 1) this WG has come up with an answer, a single principle, that so
far seems to work across all types of data and solutions for
implementing a data repository, whereas for 2) some recommendations can
be made, but it will ultimately depend strongly on the domain, the type
of data and its use.
1) For the identification we again have two sub-challenges, namely a)
the evolution of data (new data being added, errors corrected, ...; and
b) the identification of arbitrary subsets. 1a) is solved by versioning
all changes to data, whereas 1b) is addressed by resolving any subsets
dynamically via an operation that is reproducible (referred to, in the
guidelines, as a query) that was executed at a certain timestamp, that
needs to be stored and that is associted with a persistent identifier.
This could be the check-out from a Git repository at a certain point in
time, it could be a "list-directory" command against a versioned file
system, it can be an SQL query against a temporal table, we have seen
solutions using slice/dice operators against a NetCDF file, but it could
also - to use the audio-example referred to before - even a pointer to a
specific off-set in an audio file (or, e.g. a 30sec segment starting at
minute 1 for all audio files with a 44kHz sampling rate in the
collection, as sometimes used to be done for music retrieval
experiments). This also works across distributed repositories, as each
only needs to keep the queries it processed as well as the timestamps
locally, without any need to synchronize clocks. An aggregator would
then simply store the individual PIDs of the responses from a federated
2) Concerning the actual citation text, we may want to re-visit that
topic to see how much more specific we can get while still making sure
that any recommendation works across possibly all types of data and
domains. Currently, we stayed at a very limited set of metadata,
borrowing from the analogy of citations to literature, recommending the
use of two identifiers: one for the (continuously evolving) data source,
and one for the specific subset extracted from it at a given point in
time (analogy: a specific (static) paper identified e.g. via a DOI in an
(evolving, i.e. growing, with new editions being added) journal
(identified via e.g. an ISSN). The creator of the subset may be compared
to the author of a paper, whereas the owner/operator of the data source
may be likened to the editor of proceedings - but any such mapping will
already differ quite a lot across repositories and types of data, so it
is not part of the general recommendations - beyond the statement that
each data center should provide a recommended way of phrasing/expressing
citation. this may well be worth picking up if we have a feeling that
this core can be extended. now that we have a better understadning of
the identification and resolving process.
The pre-print version of the paper we've prepared on the recommendations
as well as reference implementations and deployed adoptions, could be
useful to review these principles in different settings:
Author: Martina Trognitz
Date: 14 Apr, 2021
Dear Hugh, Roberto, Andi and Mark,
thank you very much for your thoughts, comments and pointers to other recommendations. They do provide some useful information to consider for shaping BibLaTeX's @dataset type.
I feel that I should elaborate a bit on my use case to clarify what I am trying to achieve. The repository hosts data from the (digital) humanities, with collections from disciplines like oriental studies, archaeology or history. Sizes of the collections vary significantly (1GB to 10TB; or a few up to 100 000 (and more) resources) and a collection can contain multiple resource types (we use a vocabulary based on DCMI type). We developed a dedicated metadata schema to describe the objects both on collection as well as on resource level and as long as the resources are publicly accessible they also get a PID (Handle). The PID points to the respective object's landing page with all its metadata and machine-readable endpoints are also available.
To aid in proper attribution when a collection (or a subcollection or a single resource) is re-used the repository provides a citation suggestion. This is comparable to those you can e.g. find on Zenodo or on Dataverse instances. The suggested citation is automatically computed from the metadata and we decided to provide it in BibLaTeX format because most reference management software supports this and also many citation styles already exist.
During the mapping process of our metadata to BibLaTeX, I found that most of the principles of this WG's recommendations can be met, but not all. As BibLaTeX is still actively developed I saw the chance to shape the @dataset type into something that could then help citation style developers to provide sound and useful citations of 'datasets'. One of the developers pointed out: "I realise that with some things we might have a bit of a chicken-or-egg problem: Certain things might not be popular yet, because they are not properly supported by the software yet." -- By working on enhancing the data model for the @dataset type and possibly even introduce a new type like @datasubset we could pave the way for better citation styles for data. I myself e.g. was thinking of promoting this in the German-speaking Archaeology community, but this only makes sense if the technological basis is there.
Author: Hugh Paterson
Date: 14 Apr, 2021
given your further description, my impression is that your DCMIType should
be "collection" not "dataset", given that DCMIType suggests that a dataset
can not be further broken down and described—this is inferred because
"collection" is the only DCMIType class which can be further broken down
and "contain"/"hasPart" individually describable items. In that case the
most appropriate type biblatex type would be @collection, and each part
could be either @incollection or another more appropriate Biblatex database
entry type for when the item is referenced as a single entity. Looking at
the linked reference  I notice that there is no default @audio
or @recording which would be appropriate for audio type artifacts in
collections—an album is a type of audio collection. However even though
these are non-default, they are used in some style sheets (see link ).
This leads me to ask, if there you want to stay with the default settings
in Biblatex or if you are willing to provide data in formats used within
"standard" secondary communities of the biblatex community. That is, if
there is a biblatex style that is common with the major audience of the
content within your archive, then maybe venturing into providing biblatex
within that dialect would be acceptable. Another thing to note, if you are
serving archaeology data, is that you might have DCMIType
InteractiveResource material — assuming that some of the larger artifacts
are visualizations from lidar or other 3D imaging tools used in modern
You mention the collection size varying significantly and you list (1GB to
10GB) but, I suggest that your extent on a collection is not the number of
bytes that it contains but rather the number of objects which are uniquely
described within next lower level of the collection (collections in Dublin
Core can be recursive). Unfortunately, the number of items in a collection
does not fit into the allowable options of the extent field within DCTerms
(see ). One must use the property tableOfContents . Different
referencing styles handle this sort of information in different ways.
APA6th edition  provides the following template on page 212, which I
have emulated for how I would apply it for a collection of field recordings
in linguistics. My application shows how I would provide the collection
summary statement to include the tableOfContents/extent information.
Author, A. A. (Year, Month Day). Title of material. [Description of
material]. Name of collection (Call number, Box number, File name or
number, etc.). Name and location of repository.
Paterson III, H. J. (2018-2019). Western Kainji Oral Stories. [435 audio
and video recordings, 5 hours, 8 languages]. African Voices (ark:12025,
DOI: 10.1234/780912 ) Pangloss, Paris France.
With an audio or video artifact (or set of artifacts) would not be
helpfully described as 1GB to 10GB, but would be more helpfully described
with a time based extent, e.g., 1h3m35s. Note that Zenodo does not
currently allow a depositor to distinguish between audio and video
all the best,
- Hugh Paterson III
: VandenBos, Gary R, ed. 2010. *Publication Manual of the American
Psychological Association*. 6th edn. Washington, DC: American Psychological
On Wed, Apr 14, 2021 at 4:42 PM mtrognitz via Data Citation WG <
Author: Martina Trognitz
Date: 16 Apr, 2021
Good morning Hugh,
thank you again!
One thing I would like to stress in this discussion is the purpose of references, which should identify and point to some source or resource. IMHO references are not intended to fully and thoroughly describe the object it is identifying, as this is either done with an imprint (or other means) in a printed resource or a landing page of a digital resource.
The Dublin Core's DCMI Type and BibLaTeX's Entry Types are two different pair of shoes with very different backgrounds and terminology. While the DCMI Type vocabulary was already developed with digital data collections and different resource types in mind, the Entry Types of BibLaTeX originate in bibtex which itself was first released in 1985. bibtex and biblatex were developed having printed contributions in mind and therefore the Entry Types concentrate on such, e.g. @collection is defined as:
From this definition, it becomes clear that it is not possible to just use that type for electronic data collections, especially if having in mind that a user might include various different kinds of references in a bibliography. The best fit is @dataset, which was introduced as a fully supported Entry Type in BibLaTeX in 2019. It is quite vaguely defined as
For the purposes of including a reference to a data collection, I think this vague definition is fine. I think introducing corresponding biblatex Entry Types for each of the DCMI Types should be avoided to (1) prevent the list of Entry Types from exploding, (2) avoid dealing with disambiguation in defining Entry Types, (3) keep the adoption barrier low. Adoption is key: it does not suffice to have a proper Entry Type, but also, after properly defining the Entry Type, respective citation styles (e.g. with the Citation Style Language (CSL)) should be available for convenient use.
What I am trying to achieve is (a) provide a convenient way for user's of our repository to save a reference to a data collection to their bibliography, and (b) do this in a way that it can be widely used without much tinkering like installing proper packages etc. This is why I thought that shaping and expanding the @dataset Entry Type (and the necessary and optional Entry Fields) directly for one of the next releases of BibLaTeX (see issue on GitHub) might be the best way. (Another task would be to advocate for proper data referencing in the respective communities and try to influence citation guidelines.)
To wrap up a bit here are the key points that came up in this discussion, which I will suggest over there:
* way to reference the components of the aggregate unit (i.e. part of a data set)
* get an Entry Field to allow to indicate the type of @dataset (e.g. with terms from DCMI Type)
* have an Entry Field to allow storing a "query"
By the way: BibLaTeX has an Entry Type @software, but it is treated as an alias of @misc. and it could also be worked and expanded on, if wanted.