[software-source-code] Answering Q& A from VP17 session

05 May 2021
Groups audience: 

Dear all,
We had a very lively session at VP17 and I want to thank you for
participating and answering all our questions (ice-breakers and workshop
I'm sharing with you the links to slides[2

and to the note taking document[1 ]. The video
from the session is available on the Juno platform (for participants) and
will be shared publicly in a few weeks.
During the session, we had some unanswered Q & A, which I promised to
answer via email.
For your benefit, we have prepared answers for all the questions:
* Q: Has this excellent work (The EOSC SIRS report) highlighting important
issues resulted yet in any progress on the short-term recommendations,
and/or identification of ways to enable the recommendations in general?
A: The SIRS report was published in December 2020 and exposes a full
range of recommendations for the short, mid and long term in section 5 of
the report[4 ].
* Q: Can you elaborate on "no need for a register"? There is still a need
for a "registry", correct? How will these objects with intrinsic
identifiers be discovered? How do you get on a global scale from an
intrinsic id the object behind it?
A: this question touches upon three different issues : identification,
retrieval and search (or discovery).
When talking about identification, one important concern is how an
identifier is assigned to an object: an intrinsic identifier, such as the
SWHID, is calculated from the artifact itself, like a fingerprint; no
registry or authority is needed for assigning the identifier; everybody
that has (a copy of) the object can check that it corresponds to the
identifier; nobody can tamper with the object and get away unnoticed
(compare this with register based identifiers, where preventing tampering
is a real issue).
When talking about retrieval, the key concern is how to access a copy of
the designated object when one has only its identifier at hand. This is a
far more complex issue, as the system of identifiers itself offers no
guarantee that (a copy of) the designated object is available, no matter
whether we use intrinsic or extrinsic identifiers. Extrinsic systems of
identifiers typically maintain a registry with a metadata record that
points to a known location where the object is stored, but cannot guarantee
that the object will be actually there (this has been well known for a
while, see https://tools.ietf.org/html/rfc3650 : “the only operational
connection between a handle and the entity it names is maintained within
the Handle System. This of course does not guarantee persistence, which is
a function of administrative care.“). For intrinsic identifiers, one can
also use a registry that keeps a link to a copy, or rely on a
comprehensive archive, or use a peer-to-peer system, like Bittorrent or
Finally, we have the issue of looking for objects that we do not know (aka
discovery). For this, we need a search engine that indexes information about
the objects (metadata) or from the object (e.g. extracted from PDF or
software source code). When a registry is already used (e.g. for extrinsic
identifiers), it may be handy to also store in it a lot of metadata, and
then use that metadata for the search engine (e.g.
https://datacite.org/search.html) . But this is just one way of building a
search engine: another approach is to harvest objects and extract
information from them directly (e.g.
For more information on intrinsic and extrinsic identifiers, we invite you
to read the following blog post

* Q: How do intrinsic identifiers interact with approaches where the
metadata is brought into the project filesystem (e.g. CodeMeta)? Can
intrinsic identifiers treat the metadata and software source components as
A: metadata which is included in the source code tree (we call it
intrinsic metadata), is part of the project, so they are included in the
SWHID calculation, like all other files.
Notice though that SWHIDs (which are intrinsic identifiers) can identify
different levels of granularity and each file can be identified separately
if needed. For example here is an identifier of a codemeta.json file:

* Q: Why are Github/Gitlab not “scholarly infrastructures” for Source Code?
What is missing, what should be added?
A: When we say scholarly infrastructures, we mean infrastructures that
are developed in or for academia and are created to support research
outputs. Academia has a voice in scholarly infrastructures, either via
participation in the governance, or as a key customer. GitHub and Gitlab
are used for academic projects but are not scholarly infrastructure, much
like the Web, which is definitely used massively in academia, but
is not a “scholarly
infrastructure” either. Raising awareness of these differences and the
advantages of using scholarly infrastructures is also important.
* Q: Roadmap from today's practices to the optimal practices? today most
repositories store software and do not match these expectations. Generic
metadata formats are applied (DC, DataCite). Is the approach taking such
"migration" into account? For example prop
A: very good point. This “migration” should be taken into account when
implementing the recommendations to use CodeMeta metadata format, which is
software specific (and not generic) and is one of the EOSC SIRS report
[1] Collaborative notes SSC IG https://tinyurl.com/6tj2tu8
[2] The main slides, Morane Gruenpeter and Neil Chue Hong:
[3] SIRS report slides, Roberto Di Cosmo:
[4] European Commission. Directorate General for Research and Innovation.
(2020). Scholarly infrastructures for research software: report from the
EOSC Executive Board Working Group (WG) Architecture Task Force (TF) SIRS.
Publications Office. https://doi.org/10.2777/28598
Best regards,
Morane Ottilia GRUENPETER
Software engineer and metadata specialist
Software Heritage http://www.softwareheritage.org
@INRIA Paris
personal website: http://moranegg.github.io/