Challenges of Curating for Reproducible and FAIR Research Output
By Limor Peer
Group co-chairs: Limor Peer, Florio Arguillas, Thu-Mai Christian, Tom Honeyman
Supporting Output title: Challenges of Curating for Reproducible and FAIR Research Output
Authors: Limor Peer, Florio Arguillas, Tom Honeyman, Nadica Miljković, Karsten Peters-von Gehlen and CURE-FAIR subgroup 3
|Impact: This report identified the current set of key challenges associated with generating, sharing, and using reproducible computation-based scientific results. The report will inform the WG’s forthcoming standards-based guidelines for curating for FAIR and reproducible research outputs.|
Citation: Peer, L., Arguillas, F., Honeyman, T., Miljković, N., Peters-von-Gehlen, K., & CURE-FAIR WG Subgroup 3. (2021). Challenges of Curating for Reproducible and FAIR Research Output. Research Data Alliance. DOI: 10.15497/RDA00063
Computational reproducibility is the ability to repeat the analysis and arrive at the same result (National Academies of Sciences, Engineering, and Medicine, 2019). Computational reproducibility contributes to the preservation of a complete scientific record, verification of scientific claims, building upon the findings, and teaching. In this framework, the object of the curation is a “reproducible file bundle,” which optimally includes FAIR data and code. This output reports on the challenges of preparing and reusing materials required for computational reproducibility.
Context: The goal of the CURE-FAIR WG is to establish standards-based guidelines for curating for reproducible and FAIR data and code (Wilkinson et al., 2016). The final deliverable of the WG is a document outlining CURE-FAIR standards-based guidelines for best practices in publishing and archiving computationally reproducible studies. To support the deliverable, the WG created four subgroups, each tasked with studying and summarizing a particular aspect of this work: CURE-FAIR Definitions, Practices, Challenges, and Alliances.
Objective: The goal of this output is to provide a review of the literature and collected use cases, stories, and interviews with various stakeholders (researchers, publishers, funders, data professionals, IT, repositories) who are trying to reproduce computation-based scientific results and processes. We believe that this report is an accurate and comprehensive survey of the current state of CURE-FAIR. We plan to complement this output with a report examining curation practices and their alignment with FAIR principles as currently implemented by various organizations. Our ultimate objective is to improve FAIR-ness and long-term usability of “reproducible file bundles” across domains.
Request for comment: We invite the RDA community to review and comment on the CURE-FAIR WG output as part of the open process for endorsement and recognition by RDA. Comments are welcome and should be made no later than May 21st 2021.
Please note that Versions 1.0 and 1.1 of the Output underwent review by the WG. Version 2.0 is based on these comments and underwent RDA community review. Version 2.1 is the final version and was created after the community review.
|Challenges of Curating for Reproducible and FAIR Research Output - Output Card.pdf||935.65 KB|
Author: Sarah Davidson
Date: 03 May, 2021
Thanks very much for the huge effort you've put into this and to sharing the draft and giving the opportunity for comments!
I think it is important to clarify that while reproducibility is a type of reuse, but a focus on exactly reproducing specific published analyses can directly conflict with enabling other types of re-use. Given the resources needed to advance this work across disciplines, I would start by defining what types of re-use are being asked for by user communities. The document seems to treat reproducibility as largely synonymous with reuse. However, in ten years of curating a research data platform that promotes FAIR data, no one has ever come to me with questions about reproducing a published analysis. Rather, they regularly want to apply similar methods across different data, or aggregate data for meta-analysis—these are re-uses of the data for different purposes that can advance knowledge and are uniquely enabled by FAI data. In my opinion, a major problem with archiving data underlying specific analyses/papers is that the same data records from a full dataset are commonly re-published multiple times, it may be difficult or impossible to identify this duplication, and across a research career the owner is never asked to publish the full dataset as one unit, because this exceeds the mandate of journals (concerned with individual papers) or funders (concerned with individual grants). Re-usability of the full dataset therefore is not supported by the archived data, and worse, attempting to aggregate data with lots of overlapping data records will require significant preprocessing that many users may not realize is needed or know how to do, or otherwise could lead to meaningless or false conclusions.
Likewise, in the case of software/code, I am doubtful that most re-users of software/code care about exact reproducibility of an existing result. How often are researchers or practitioners asking for the ability to replicate the exact computing environment used in a published analysis? What are the most common reasons users provide for wanting access to published software/code? Robust research results should be reproducible using state-of-the-art software and computing power—in some cases, results will be improved on or corrected. In this case, and given many of the challenges you describe in the report, one way forward is to consider what best practices become apparent if you focus on the goal of publishing software or code and related metadata as documenting the analytical approach of a published analysis, and providing specific examples of commented script that allow users to apply these approaches to new use cases, using current software and computing environments, dependencies, etc. To me this way of thinking through the problem helps to delineate a more manageable goal and is in line with the reality of the speed of technology development. Pieces of this message appear in the document but I think could be made more clear.
I realize some of these comments might be outside the scope of feedback you're looking for at this point, but in any case I hope it's helpful for this or future efforts. I'll keep an eye out for related RDA WGs.
Author: Limor Peer
Date: 18 May, 2021
Thank you for sharing your thoughts and providing us with excellent feedback on the report. You raise valid points and highlight areas that can be clarified and we will incorporate your comments in our next draft.
We want to respond here to the spirit of your comments about the issues that need to be considered when focusing on “exactly reproducing specific published analyses.”
First, we’re in agreement that a good place to start is to ask communities what types of re-use are they are looking for. We acknowledge disciplinary differences and try in our report to reflect the challenges of various domains. When it comes to writing the guidelines for CURE-FAIR (the final deliverable), it might make sense to start with focusing on domains where this is quite well defined. For example, some social science journals (e.g., AEA, AJPS) always request verification that results can be reproduced. It’s a condition for publication. We can highlight the practices of institutions and organizations that support these requirements and then look for applications in other disciplines. In that vein, we would argue that some re-users of software/code also care about exact reproducibility of an existing result. In addition to publishers, other stakeholders might want to check the reproducibility of results. For example, if it becomes known that the code had an error that resulted in papers with incorrect results, scholarly associations, domain archives, or even research integrity oversight bodies might want to look into the results (we’re thinking of problems such as a recent case in computational chemistry in which there were questions about the software used to calibrate and process images https://arstechnica.com/science/2021/02/scientific-community-on-report-of-a-strange-chemical-at-venus-probably-not/ ).
Second, we don’t think that reproducing a specific result (and archiving the materials that generate the result) is inherently in conflict with other types of re-use. The goal should be to work with, archive, and share the complete data. We recommend including in the reproducible file bundle, or referencing if this is restricted, the raw FAIR data and enabling computational reproducibility on top of that. Depending on the discipline and data characteristics, in some cases it may make sense to include the original data in the file bundle, and in others to provide instructions on where and how to access it. A readme file can include links to where to download the files from and where to put them exactly in the reproducibility package such that, when running the code, it works and reproduces the results exactly. We agree that efforts should be made to include documentation about links to and versions of the data to avoid duplication. This will require more education of researchers and curators, and perhaps better guidelines from journals on what exactly they require for reproducibility.
Third, thanks for the comment about documentation and metadata of the code. It highlights the potential re-use of the code to further science, and the importance of comments in the code detailing the methods and analytic process. Not only should comments help map code to sections of the paper (or to reproduce results), but they also talk re-users through the methods application and analytic thought process. However, having too many comments could clutter the code, so finding the right balance (just enough comments) is important. Often you do not need too many comments, because the code itself describes the process. This will be important to include in the guidelines.
Florio, Tom, and Limor
Author: Heather Brindley
Date: 21 May, 2021
Hello, I have just joined and was browsing the requests for input. I'm not sure whether you have considered the approach below in previous discussions. I think it will work for your topic because software code is a form of data. If you find the material interesting I'm happy to provide more detail and other references.
Brown, W.J., McCormick, H.W., & Thomas, S.W. (1999). Antipatterns and patterns in software configuration management. New York: J. Wiley.
Author: Limor Peer
Date: 01 Jul, 2021
Thank you very much for your comment. I respond here on behalf of subgroup 3 of the WG: First, we will update the report in light of the current work of the FAIR4RS WG (subgroup 3: Definition of research software), addressing your point about the differences and similarities between software and data. Second, we appreciate the reference that you suggest and its message that learning should be performed on both good and bad examples (i.e., patterns and antipatterns) and we used the same approach when reporting current practices in order to dentify the challenges of curating for reproducible and FAIR research output. In our report, in addition to the perspectives of software engineers, developers, architects, and project managers, and we also aimed to include the experiences of librarians, data curators, researchers and students, decision makers etc. We highlight the motivation to repeat the analysis and arrive at the same result, not only to conform to software standards and models (e.g., ISO, IEEE), as the main focus of the software configuration management.