Rich Metadata for annotation of citations contexts and data-citations contexts

You are here

25 Jan 2021

Rich Metadata for annotation of citations contexts and data-citations contexts

Submitted by Carlo Maria Zwölf


Meeting objectives: 

co-proposers List (in alphabetical order):

Name

Affiliation(s)

Daan Broeder

CLARIN ERIC, KNAW/HuC

Marie Lise Dubernet

Virtual Atomic and Molecular Data Centre, Paris Observatory.

Nicolas Larrousse

Huma-Num, CNRS

Fenghong Liu

Data Intelligence

Paolo Manghi 

CNR - OpenAIRE

Mark Parson

Codata Data Science Journal, ESIP Research Artifact Citation Cluster

Peter Wittenburg

Max Planck Computing and Data Facility, GEDE. 

Carlo Maria Zwölf

Virtual Atomic and Molecular Data Centre, Paris Observatory, GEDE

 

The Research Data Alliance, through its Data Citation Working Group and its Scholix WG successfully addressed the challenges of data-citation and cross-referencing between classic papers and data.  

In “classic” scientific papers the context of citations is deduced from the text: a human reader may easily understand why the authors is citing a given previous work (e.g. for proposing a new methodology compared to the referenced work, because the citing work is based on some fundamental result of the cited work, etc…). 

The “citation context” is lost when the bibliographic information is processed through automatic machine-based workflows. Our aim is to provide the community with a mechanism for authors (both data and paper authors) to state the intent of a citation in a machine actionable way.

The intention behind a citation is crucial for scientific reasons:

  • the reasons may provide a first assessment about the quality of what is cited. A data-set which is cited as « crucial »  in several other works presumably has a better quality compared to data-sets which has several citations from «erratum-works». Let us consider for example the paper about the memory of water (doi: 10.1038/333816a0) which has a high H factor, but a lot of citations are (of course) negatives.
  • understanding how and why they work is re-used will help the data-producers to better fit the community needs.

 

The goal of the proposed BoF is to obtain feedback and evaluate the interest of the RDA members into the proposed themes and to structure the activities of a new Interest Group: we foresee to organize the WG work in two phases:

1) At first we will identify, from the contributors and members, a “taxonomy” of all the identified citation contexts (e.g. the citing element is a subset of the cited one; the citing element is collection containing the cited one; the cited element is wrong and the citing one is the erratum, etc…)

  • We will also try to identify eventual additional metadata that may be helpful to retrieve/interpret the citer's intention and to assess the value of the cited object. For this process, we may crossmatch several other sources like eventual semantic annotations characterizing the cited-object on registries (e.g. registries and repository metadata attached to the cited-object). 

2)   In the second phase we will provide a data-model / ontology for the contexts identified in the first phase.

  • The proposed data model and required infrastructure for processing the citation and additional metadata may rely on FDO (Fair Digital Object): a citation-context annotation would be an FDO, containing a pointer to the PID of the citing element and another pointer to the cited element. The inner metadata of the FDO define the context of the citation in a machine actionable way.
Meeting agenda: 

Collaborative meeting notes: https://docs.google.com/document/d/1gpajAMWl-LzMkPD2CQZ1gmaqTzhIq5B4j8xV...

 

Preliminary session program:

  • 40 min - presentation from the contributors (expression of needs, see following table)

Presentation list (alphabetic order)

Name

community 

Marie Lise Dubernet 

Atomic and Molecular Physics

Nicolas Larrousse 

SSHOC

Fenghong Liu

Botanic, Data-Science

Paolo Manghi 

CNR - OpenAIRE

Dieter van Uytvanck

CLARIN

Peter WIttenburg

Data-Science

  • 30 min - discussions about the opportunity of a dedicated IG 

  • 20 min – preliminary organization of the new IG and early-adopters census (in case where there is no new IG, how to address the identified needs?)

Type of Meeting: 
Working meeting
Short introduction describing any previous activities: 

The Research Data Alliance, through its Data Citation Working Group and its Scholix WG successfully addressed the challenges of data-citation and cross-referencing between classic papers and data. 

This proposal comes from our recent works: during our activity as data-producers and providers in VAMDC (Virtual Atomic and Molecular Data Centre) we faced a new need: we would like to express in a machine actionable way WHY we are putting a reference to a given paper and/or datum. Even if from the “technical” point of view the citation action is the same, it has not the same meaning if we are citing something for saying “the cited work is very good and I’m using its results because they’re crucial here in the present work” or for saying “the cited work is wrong. Here we explain why and provide corrections”.

Since the proposal dealsl with some aspects of data-citation, we discussed with Andreas Rauber (chair of RDA data citation Interest Group). The Data-Citation IG is more interested by the « HOW » aspects (in the sense of what identifier, what type of data,…) rather than on « WHY ». They are not against what we are proposing and these are complementary works. 

We are interested by the « WHY » question mainly for scientific reasons:

  • the rasons may provide a first assessment about the quality of what is cited. A data-set with is cited as « crucial »  in several other works presumably has a better quality compared to data-sets which has several citations from « erratum-works ». Let us consider for example the paper about the memory of water ( doi: 10.1038/333816a0) which has a high H factor, but lot of citations are (of course) negatives.
  • understanding how and why they work is re-used will help the data-producers to better fit the community needs.
BoF chair serving as contact person: 
Remote participation availability (only for physical Plenaries): 
Yes
Please indicate the breakout slot (s) that would suit your meeting. : 
Breakout 1
Breakout 2
Breakout 3
Breakout 4
Breakout 5
Breakout 6
Breakout 7
Breakout 8
Breakout 9
Breakout 10
Breakout 11
Are you willing to host a live second session to accommodate a different time zone? : 
No
Meeting presenters: 
Marie Lise Dubernet (Virtual Atomic and Molecular Data Centre, Paris Observatory.), Nicolas Larrousse (Huma-Num, CNRS), Fenghong Liu (Data Intelligence), Paolo Manghi (CNR, OpenAIRE), Dieter van Uytvanck (CLARIN), Peter WIttenburg (GEDE)
How do you prefer to hold the virtual component of your session: 
live
Avoid conflict with the following group (1): 
Avoid conflict with the following group (2): 
Do any of the session speakers plan to present from the venue?: 
Remote presentations only
Contact for group (email):