Identifiers of Digital Objects (IDOs) and Digital Identifiers of Objects (DIOs): a conceptual framework for identifiers

26 Oct 2018

[ I'm cross-posting this message to several groups that may be interested in these results: I hope that by sending only one mail to several mailing list will allow the mailing list server to send just one message to each recipient.
My apologies for the duplicate messages if this is not the case ]
Dear all,
the Software Heritage archive provides universal access to some 5 billion source code files from over 80 million origins since last June (see https://www.softwareheritage.org/2018/09/22/browsing-the-software-herita... for a walkthrough).
Assigning identifiers to the billions of digital objects we archive is not an easy task. It is not just a technological challege: we knew that whatever choice we made, it would end up setting a standard in the medium term, and this is a serious responsibility.
That's why I'm delighted to share with you a research article that provides a full account of why and how we designed the system of identifiers that is now deployed across all the Software Heritage archive (and that stands behind the "Permalinks" red vertical tab that is available in all views of the webapp that allows to browse the code).
It has been presented at iPres 2018 in Boston this September 2018, and you can find it now online at https://hal.inria.fr/hal-01865790
To make some sense of the complex landscape of terms, properties, and systems we were confronted with, we had to produce a conceptual framework for analysing the existing systems of identifiers, which can, after all, be modeled as just a simple abstract data type, and clearly fall into two different categories:
- Digital Identifiers of Objects (or DIOs), of which DOIs or ARKs are well known examples,
- Identifiers of Digital Objects (or IDOs), of which UUIDs, git commit hashes, and now the Software Heritage identifiers, are prominent instances.
We do not claim any originality in the introduction of the terms IDOs and DIOs, as they are directly extracted from the following enlightening remark written by Norman Paskin in his 2010 reference article on DOIs in the Encyclopedia of Library and Information Sciences:
The term “Digital Object Identifier” is construed as “digital identifier of an
object," rather than “identifier of a digital object”: the objects identified by
DOI names may be of any form—digital, physical, or abstract—as all these forms
may be necessary parts of a content management system. The DOI system is an
abstract framework which does not specify a particular context of its
application, but is designed with the aim of working over the Internet.
These two categories of identifier systems address clearly different needs, and we need to keep this in mind when tackling fundamental issues like reproducibility of software intensive experiments.
It turns out that for identifying digitally native objects that have a canonical form (like software source code, but not only), IDOs fulfill all the prerequisites, and this is why we have chosen IDOs and not DIOs as the Software Heritage itentifiers (SWH ID).
Full details are in the article, but, as an example, here are a few quite nice comment lines found in the source code of the Apollo 11 guidance computer software:
https://archive.softwareheritage.org/swh:1:cnt:41ddb23118f92d7218099a5e7...
This URL is built out of three key parts:
a resolver URL: https://archive.softwareheritage.org/
the SWH ID: swh:1:cnt:41ddb23118f92d7218099a5e7a990cf58f1d07fa
optional attributes: ;lines=64-72;origin=https://github.com/chrislgarry/Apollo-11/
SWH IDs work at all levels, for example, here is the SWH ID of the directory containing the full Apollo-11 source code:
swh:1:dir:3c235a1a8223727a964c154eb8f2273176c48c88
SWH IDs are resolved also by N2T's www.n2t.net, so the following will work too
https://n2t.net/swh:1:dir:3c235a1a8223727a964c154eb8f2273176c48c88
I hope that you will find this work interesting and the conceptual framework useful.
All the best
--
Roberto Di Cosmo
------------------------------------------------------------------
Computer Science Professor
(on leave at INRIA from IRIF/University Paris Diderot)
Director
Software Heritage E-mail : ***@***.***
INRIA Web : http://www.dicosmo.org
Bureau C123 Twitter : http://twitter.com/rdicosmo
2, Rue Simone Iff Tel : +33 1 80 49 44 42
CS 42112
75589 Paris Cedex 12
------------------------------------------------------------------
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3