10 Things for Curating Reproducible and FAIR Research

12
Apr
2022

10 Things for Curating Reproducible and FAIR Research

By Limor Peer


CURE-FAIR WG

Group co-chairs: Limor PeerFlorio ArguillasThu-Mai ChristianTom Honeyman, Mandy Gooch

Recommendation title: 10 Things for Curating Reproducible and FAIR Research

Authors: 

Lead Authors: Florio Arguillas, Thu-Mai Christian, Mandy Gooch, Tom Honeyman, Limor Peer (CURE-FAIR WG co-chairs)

Contributors: Erin Clary, Christopher Erdmann, Ana Van Gulick, Daniel S. Katz, Katherine E. Koziar, Wanda Marsolek, Peter McQuilton, Qian Zhang, and members of the CURE-FAIR WG

 

Impact: 

The "10 Things for Curating Reproducible and FAIR Research" offer a framework for implementing effective curation workflows for achieving greater FAIR-ness and long-term usability of research data and code. Adoption of the guidelines for curating reproducible and FAIR research will improve the prospects for a reproducible scholarly record.

The "10 CURE-FAIR Things" provide guidance for better practices for those entrusted with stewardship of the scholarly record, including repository managers, curators, and preservation and archival experts; the scholarly community, including researchers who generate and use data and code; policy-setting institutions, including research organizations, publishers, and funders; as well as others involved in the production, dissemination, and preservation of research.

The document focuses primarily on research compendia produced by quantitative data-driven social science. It serves as a starting point for the development of curatorial guidelines to extend beyond the specific concerns of the social sciences community and other domains and disciplines that use similar methods, and to the particular curatorial concerns and requirements of an archives or publisher.

DOI: 10.15497/RDA00074

Citation: Arguillas, F., Christian, T.-M., Gooch, M., Honeyman, T., & Peer, L. (2022). 10 Things for Curating Reproducible and FAIR Research (Version 1.1). Research Data Alliance. https://doi.org/10.15497/RDA00074

 

This document, "10 Things for Curating Reproducible and FAIR Research," describes the key issues of curating reproducible and FAIR research (CURE-FAIR). It lists standards-based guidelines for ten practices, focusing primarily on research compendia produced by quantitative data-driven social science.

The "10 CURE-FAIR Things" are intended primarily for data curators and information professionals who are charged with publication and archival of FAIR and computationally reproducible research. Often the first reusers of the research compendium, they have the opportunity to verify that a computation can be executed and that it can reproduce prespecified results. Secondarily, the "10 CURE-FAIR Things" will be of interest to researchers, publishers, editors, reviewers, and others who have a stake in creating, using, sharing, publishing, or preserving reproducible research.

The "10 CURE-FAIR Things" are:

  1. Completeness: The research compendium contains all of the objects needed to reproduce a predefined outcome. 
  2. Organization: It is easy to understand and keep track of the various objects in the research compendium and their relationship over time.
  3. Economy: Fewer extraneous objects in the compendium mean fewer things that can break and require less maintenance over time.
  4. Transparency: The research compendium provides full disclosure of the research process that produced the scientific claim.
  5. Documentation: Information describing compendium objects is provided in enough detail to enable independent understanding and use of the compendium.
  6. Access: It is clear who can use what, how, and under what conditions, with open access preferred. 
  7. Provenance: The origin of the components of the research compendium and how each has changed over time is evident.
  8. Metadata: Information about the research compendium and its components is embedded in a standardized, machine-readable code.
  9. Automation: As much as possible, the computational workflow is script- or workflow-based so that the workflow can be re-executed using minimal actions.
  10. Review: A series of managed activities needed to ensure continued access to and functionality of the research compendium and its components for as long as necessary.

Maintenance: This output will be hosted on Github and maintained by the Odum Institute


Version: 1.1

 

 

 

Output Status: 
RDA Endorsed Recommendations
Review period start: 
Wednesday, 13 April, 2022 to Friday, 13 May, 2022
Group content visibility: 
Public - accessible to all site users
Primary WG Focus / Output focus: 
Domain Agnostic: 
Domain Agnostic
  • Lars Vilhuber's picture

    Author: Lars Vilhuber

    Date: 14 Apr, 2022

    Great work. I've added a few comments in the original Google Doc. https://docs.google.com/document/d/1xNQtSlsCw_eyCim-iKd13x_g-hy9FYdgPKPr...

  • Mandy Gooch's picture

    Author: Mandy Gooch

    Date: 02 Jun, 2022

    Hello Lars,

    Thank you very much for your comments on this output. We accidentally left the draft version open in google and have since closed it, therefore, we have addressed your comments below:

    • Lars' comment - v.4 is missing here. This seems redundant with the one at the top
      • Response: Added v.4 to table at bottom of document with link to draft 4
    • Lars' comment - delete 'github' replace with GitHub
      • Response:  Accepted suggested edit
    • Lars' comment - Suggested addition of question: 'If not, for how long is the data available expected to be available?' to Thing 1: Completeness, section Get started, subsection Data, bullet point 4. 
      • Response: Accepted edit, altered text to read: 'If not, for how long is the input data expected to be available?' 
    • Lars' comment - In Introduction we state: 'This document is primarily for data curators and information professionals who are charged with verifying that a computation can be executed and that it can reproduce prespecified results. Secondarily,  it will be of interest to researchers, publishers, editors, reviewers, and others who have a stake in creating, using, sharing, publishing, or preserving reproducible research.' Lars asked: 'Why this limitation? Seems like the guidance is equally applicable to researchers as well as data curators.'
      • Response:  Altered the text to read: 'This document is for data curators and information professionals who are charged with verifying that a computation can be executed and that it can reproduce prespecified results. It will also be of interest to researchers, publishers, editors, reviewers, and others who have a stake in creating, using, sharing, publishing, or preserving reproducible research.'
    • Lars' comment: I also humbly submit https://social-science-data-editors.github.io/template_README/ (Vilhuber, Kóren, Llull, Connolly, Morrow, 2019)
      • Response: Accepted additional citation; added to Thing 5: Documentation, Learn more citations list.
    • Lars' comment: Note that this particular guide is specific to documenting DATA, not CODE. It makes no or very little mention of code. (in reference to Cornell University Research Data Management Service Group. (n.d.). Guide to writing "README" style metadata. https://data.research.cornell.edu/content/readme citation). 
      • Response: Provided more explicit language - This targets data documentation, but can serve as a model for other types of documentation (e.g., code).
    • Lars' comment: For this and above, reference the Data Citation Principles (in reference to Thing 5: Documentation, Get started, subsection Data Availability Statement bullet point 1 & 2)
      • Response: Accepted comment and added citation to Learn more references list, plus added links to bullet point 2 of Data Availability Statement to DCP and Data Citation Guidelines
    • Lars' comment on Thing 9: Automation > Learn more > Holding the computation as fixed section: 'These are all platforms that leverage containers ("docker"). Should mention that as an underlying principle, no "platform" required. Platforms just make it easier.
      • Response: Added clarification to this section to emphasize that it might make things easier, and to explain what happens when using these platforms. Text reads as: "Holding the computation environment as fixed: Platforms that make this easiery

        One way to aid automation is to hold the execution environment as fixed. Fixing the execution environment (hardware, operating system, software and dependencies) increases the likelihood that code that ran in the same environment will run again without issue in the future.

        A common approach to making it easier to rerun code is to do the computation on a cloud based service or platform. Examples of this approach include WHOLETALE, Code Ocean, or MyBinder. Many of these are services built on top of JupyterHub or RStudio which encapsulate the compute environment in a container. Integrations of similar functionality are also starting to become available within journals, notably the reproducible article from eLife, which brings the reproduction of results as close as possible to the published article."

    • Lars' comment on Thing 7: Provenance > Learn More: 'Curious why DCP are not repeated here. Citations are about documenting provenance.'
      • Response: Added DCP and software citation principle/guide links to the beginning of Provenance for users to read up on those resources
    • Lars' comment on Thing 5: Documentation > Get started > Code headers and comments: 'Why should that be in the code, if it's already in the README?'
      • Response: We kept this section and clarified the necessity of it: This embeds metadata within the code files themselves so that the context is available to re-users of those files. This is necessary in case the code file(s) are ever orphaned from the original research compendium. It can also include the licensing information within the citation, which may be different from that in the README documentation.
    • Lars' comment on Thing 5: Documentation > Get Started > Data Availability Statement: 'When available'
      • Response: Accepted comment, updated language in bullet point to read: "Formal data citation that, if available, includes a persistent identifier (e.g., DOI) or other stable URL."
    • Lars' comment on Thing 4: Transparency > Get Started > Access Transparency: "Are those reasons listed? (it is not clear from the sentence that the reasons should be listed, only that instructions are provided)"
      • Response: Changed to - Are explicit instructions provided on where and how to request and access materials not included in the compendium? If they cannot be made publicly available due to licensing or sensitive data restrictions, is this detail included?
    • Lars' comment on Thing 1: Completeness > Get Started > Code, first bullet point: "Does this allow for Python code as PDF? (yes, people do that; that should not be possible under these recommendations)"
      • Response: Added language to second bullet point to address this comment: Files should be in a form ready for execution.
    • Lars' comment on Thing 1: Completeness > Get Started > Data, first bullet: "This is not an either/or. Regardless of whether the data are included, detailed information on where and how to obtain it must always be provided - that is necessary for completeness as well as transparency of data provenance. I would start this list with this item, without caveats."
      • Response: Added link to Thing 4: Transparency to address this comment. 

    Thank you again for your comments and review of this draft. We appreciate your contributions!

    Best,
    CURE-FAIR Co-Chairs

submit a comment