WDS/RDA Assessment of Data Fitness for Use WG
Recommendation Title: WDS/RDA Assessment of Data Fitness for Use WG Outputs and Recommendations
Group co-chairs: Michael Diepenbroek, Claire Austin, Jonathan Petters, Marina Soares e Silva
Authors: Claire Austin (Environment Canada), Helena Cousijn (DataCite), Michael Diepenbroek (PANGAEA), Jonathan Petters (Virginia Tech), Marina Soares E Silva (Elsevier),
World Data System-Research Data Alliance Data Fitness for Use Working Group
Impact: A practical process by which CoreTrustSeal certified repositories can evaluate dataset holding for their fitness for use.
|Recommendation package DOI: 10.15497/rda00034
This statement describes the background, efforts and outputs of the WDS/RDA Assessment of Data Fitness for Use Working Group. This group was chartered to develop criteria, procedures for assessment of research data fitness for use, along with a means to communicate this assessment to others. It concluded with development of a) criteria for research dataset fitness for use compared against the CoreTrustSeal requirements and FAIR principles, and b) a checklist for evaluation of dataset for fitness for use meant to supplement the CoreTrustSeal Repository Certification process. The checklist carries with it numerous caveats that exemplify the broad landscape surrounding dataset fitness assessment that this working group has mapped.
As described in our case statement, The increasing availability of research data and its evolving role as a first class scientific output in the scholarly communication requires a better understanding of and the possibility to assess data quality, which in turn can be described as conformance of data properties to data usability or fitness for use.
Thus this working group formed with the goal of producing the following deliverables:
- The definition of criteria and procedures for assessment of fitness for use
- The development of a system of badges/labels communicating fitness for use of individual datasets
In the course of its efforts the working group has created the following two deliverables:
Criteria for research dataset fitness for use compared against the CoreTrustSeal requirements and FAIR principles.
Our five primary categories of dataset fitness for use criteria can be mapped to the FAIR principles as displayed in parentheses:
- Metadata completeness (R)
- Accessibility (A)
- Data completeness and correctness (R)
- Findability & interoperability (F, I)
- Curation (leading to overall FAIRness)
Through this comparison of criteria we determined that a CoreTrustSeal certified repository’s data holdings would meet several (but not all) aspects of dataset fitness for use. Additionally we determined that evaluating metadata completeness and data completeness/correctness in an automated fashion is not feasible at this time. Thus we aimed to develop a manual evaluative process for research datasets that would build on the CoreTrustSeal repository certification process.
A checklist for evaluation of dataset for fitness for use. This checklist is meant to supplement the CoreTrustSeal Repository Certification process, and is based on the above criteria.
This manual evaluative process would be conducted by a repository manager or an external entity such as a CoreTrustSeal repository evaluator for a sample (6-12) of individual data sets within the repository.
The working group conducted minimal testing of this checklist through ICPSR staff and has received input from a number of repository managers through the Domain Repositories Interest Group.
The checklist has numerous caveats and limitations:
- Designed as an add-on to the CoreTrustSeal repository certification process, it is not applicable to non-CoreTrustSeal-certified repositories (but may be somewhat applicable to repositories with the Data Seal of Approval).
- Because this checklist is anticipated to be implemented as a manual approach, it can only be anticipated to be used on a small sample of datasets. For repositories with a large number of datasets we can expect the representativeness of this small sample to decrease for repositories as the heterogeneity of repository holdings increases.
- Currently we do not find it feasible to automate checks for metadata completeness and data correctness as these require domain expertise to evaluate. Thus the knowledge and expertise of the evaluator will be important to the dataset assessment.
- This checklist does not incorporate repository data use history where it is available, and this history is a metric that older, more established repositories rely on somewhat for assessing data fitness.
- Our approach focuses on the data provider (i.e. repository manager) perspective, and thus neglects the important perspective of the prospective data user.
- Many research domains have not established data or metadata standards towards reusability which could hamper attempts to effectively assess dataset fitness in those domains.
These numerous caveats exemplify the broad landscape surrounding dataset fitness assessment that this working group has mapped.
These deliverables are provided with a Creative Commons Attribution Share-alike 4.0 license.
We recommend that this checklist be adopted for use as an add-on to the CoreTrustSeal repository certification process after further testing and refinement. We have had initial discussions with members of the CoreTrustSeal Board about this add-on and it is under consideration. Repository managers may also find this checklist a useful tool in helping to evaluate their holdings for fitness for use, or begin consideration of how to conduct such evaluations.
Future Work (or the steps we would have taken but we ran out of time)
The next steps that would be taken if resources were available are as follows:
- We would recommend for this dataset fitness checklist evaluation process be tested and commented upon by a wider set of repository managers, possibly through a subset of the repositories of the World Data System (WDS). Following testing we would further refine this checklist and the evaluative process.
- We have not yet created a rating or badging scheme that pairs with this dataset fitness checklist evaluation process (i.e. the second planned deliverable for this working group).
- We would include the capability for prospective data users to comment on each dataset towards its fitness for their use, and for these comment to be accessible to other prospective data users. This capability would allow for a form of social tagging towards rating datasets.
We note that the current outputs of this working group are being considered and included in the efforts of the new FAIR Data Maturity Model WG within the RDA. Further, one of this working group’s co-chairs (M. Diepenbroek) is a participant in the recently awarded FAIRsFAIR project, and anticipates following-up on these efforts through this project.