Challenges of Curating for Reproducible and FAIR Research Output

    You are here

20
Apr
2021

Challenges of Curating for Reproducible and FAIR Research Output

By Limor Peer


CURE-FAIR WG

Group co-chairs: Limor PeerFlorio ArguillasThu-Mai ChristianTom Honeyman

Supporting Output title: Challenges of Curating for Reproducible and FAIR Research Output

Authors: Limor Peer, Florio Arguillas, Tom Honeyman, Nadica Miljković, Karsten Peters-von Gehlen and CURE-FAIR subgroup 3

DOI: 10.15497/RDA00063

Citation: Peer, L., Arguillas, F., Honeyman, T., Miljković, N., Peters-von-Gehlen, K., & CURE-FAIR WG Subgroup 3. (2021). Challenges of Curating for Reproducible and FAIR Research Output. Research Data Alliance. DOI: 10.15497/RDA00063

 

Computational reproducibility is the ability to repeat the analysis and arrive at the same result (National Academies of Sciences, Engineering, and Medicine, 2019). Computational reproducibility contributes to the preservation of a complete scientific record, verification of scientific claims, building upon the findings, and teaching. In this framework, the object of the curation is a “reproducible file bundle,” which optimally includes FAIR data and code. This output reports on the challenges of preparing and reusing materials required for computational reproducibility.

Context: The goal of the CURE-FAIR WG is to establish standards-based guidelines for curating for reproducible and FAIR data and code (Wilkinson et al., 2016). The final deliverable of the WG is a document outlining CURE-FAIR standards-based guidelines for best practices in publishing and archiving computationally reproducible studies. To support the deliverable, the WG created four subgroups, each tasked with studying and summarizing a particular aspect of this work: CURE-FAIR Definitions, Practices, Challenges, and Alliances. 

Objective: The goal of this output is to provide a review of the literature and collected use cases, stories, and interviews with various stakeholders (researchers, publishers, funders, data professionals, IT, repositories) who are trying to reproduce computation-based scientific results and processes. We believe that this report is an accurate and comprehensive survey of the current state of CURE-FAIR. We plan to complement this output with a report examining curation practices and their alignment with FAIR principles as currently implemented by various organizations. Our ultimate objective is to improve FAIR-ness and long-term usability of “reproducible file bundles” across domains.

Request for comment: We invite the RDA community to review and comment on the CURE-FAIR WG output as part of the open process for endorsement and recognition by RDA. Comments are welcome and should be made no later than May 21st 2021.

(Please note that Versions 1.0 and 1.1 of the Output underwent community review, and that Version 2.0 is the final version of the Supporting Output based on these comments.)

Output Status: 
Supporting Outputs under community review
Review period start: 
Wednesday, 21 April, 2021 to Friday, 21 May, 2021
Group content visibility: 
Public - accessible to all site users
Primary WG Focus / Output focus: 
Domain Agnostic: 
Domain Agnostic
  • Sarah Davidson's picture

    Author: Sarah Davidson

    Date: 03 May, 2021

    Thanks very much for the huge effort you've put into this and to sharing the draft and giving the opportunity for comments!

    I think it is important to clarify that while reproducibility is a type of reuse, but a focus on exactly reproducing specific published analyses can directly conflict with enabling other types of re-use. Given the resources needed to advance this work across disciplines, I would start by defining what types of re-use are being asked for by user communities. The document seems to treat reproducibility as largely synonymous with reuse. However, in ten years of curating a research data platform that promotes FAIR data, no one has ever come to me with questions about reproducing a published analysis. Rather, they regularly want to apply similar methods across different data, or aggregate data for meta-analysis—these are re-uses of the data for different purposes that can advance knowledge and are uniquely enabled by FAI data. In my opinion, a major problem with archiving data underlying specific analyses/papers is that the same data records from a full dataset are commonly re-published multiple times, it may be difficult or impossible to identify this duplication, and across a research career the owner is never asked to publish the full dataset as one unit, because this exceeds the mandate of journals (concerned with individual papers) or funders (concerned with individual grants). Re-usability of the full dataset therefore is not supported by the archived data, and worse, attempting to aggregate data with lots of overlapping data records will require significant preprocessing that many users may not realize is needed or know how to do, or otherwise could lead to meaningless or false conclusions.

    Likewise, in the case of software/code, I am doubtful that most re-users of software/code care about exact reproducibility of an existing result. How often are researchers or practitioners asking for the ability to replicate the exact computing environment used in a published analysis? What are the most common reasons users provide for wanting access to published software/code? Robust research results should be reproducible using state-of-the-art software and computing power—in some cases, results will be improved on or corrected. In this case, and given many of the challenges you describe in the report, one way forward is to consider what best practices become apparent if you focus on the goal of publishing software or code and related metadata as documenting the analytical approach of a published analysis, and providing specific examples of commented script that allow users to apply these approaches to new use cases, using current software and computing environments, dependencies, etc. To me this way of thinking through the problem helps to delineate a more manageable goal and is in line with the reality of the speed of technology development. Pieces of this message appear in the document but I think could be made more clear.

    I realize some of these comments might be outside the scope of feedback you're looking for at this point, but in any case I hope it's helpful for this or future efforts. I'll keep an eye out for related RDA WGs.

    Best regards,

    Sarah

submit a comment