You are here


NOTE - The following Charter text has been revised, see the attached document - 29 Jan 2018



RDA Interest Group Draft Charter Template
Name of Proposed Interest Group: Preservation Tools, Techniques, and Policies

Introduction (A brief articulation of what issues the IG will address, how this IG is aligned with the RDA mission, and how this IG would be a value-added contribution to the RDA community):

The Preservation Tools, Techniques, and Policies (PTTP) IG provides a forum to bring together domain researchers, data and informatics experts, and policy specialists to discuss such issues as:

  • What data/software/artifacts/documentation (hereafter referred to as “knowledge products”) should be preserved for sharing, re-use, and reproducibility for a given research domain? For other domains?
  • What tools are available for researchers to preserve these elements in a manner that does not obstruct or hinder their research?
    • What are the strengths and weaknesses of these tools?
    • Are there common features that could allow tools from one domain to be re-used elsewhere?
    • Are there tools that archives/repositories could provide that could make preservation much easier for researchers?
    • What are the longer-term development goals of each of these tools?
  • What preservation policies exist, imposed by government agencies, publishers, or other actors? How are they changing? How are they implemented? What are their strengths and weaknesses?
  • How can preservation policies be implemented in a way that aids research both now and in the future?
    • How does this depend on the tools provided?

Through the course of these discussions, the PTTP IG acts to strengthen the dialogue between domain researchers and the data community by focusing on how researchers are enabled to use previously generated and preserve new results. This enhanced engagement amplifies the voice of the research community within the fabric of RDA. The additional focus on policy considerations, by nature nation-, agency-, and organization-specific, serves to illustrate the means by which research preservation can be encouraged (or required) and the implications of these policy decisions.

Given that one must preserve knowledge products before one can (usefully) share them, the mechanisms by which this preservation happens is primarily in the hands of the researcher and should be a critically important element to the mission of the RDA. The quality of the data and the information relevant to their creation can only be guaranteed by the researcher who produces the data. Thus, it is in the RDA’s best interest to consider this an integral part of its progress.

This group has obvious synergies with the Reproducibility IG, the Provenance IG, the Active Data Management Plans IG, and the Preservation e-Infrastructure IG, among others. It entirely complementary in charter/focus with the existing Preservation e- Infrastructure IG.

User scenario(s) or use case(s) the IG wishes to address (what triggered the desire for this IG in the first place):
Largely absent from formal RDA deliberations thus far have been discussions of how researchers can interact with repositories in order to preserve their findings. Most researchers do not consider preservation as part of their research workflow, and, when confronted with an unfamiliar repository interface for data ingestion, are unable to provide the information required. They do not, as a matter of course, use tools that allow the automatic generation of the necessary metadata and other information that is necessary for preserving the knowledge behind their research results. In fact, for many researchers, the situation is represented by the (re-purposed) familiar cartoon, below:

The PTTP IG will operate in the “overlap space” that is (somewhat unfairly) represented as empty in the above figure. The IG will explore how knowledge preservation is currently being done, what tools exist, what are their strengths and shortfalls, and how policy considerations are (if at all) driving preservation strategies, preservation tool development, and preservation tool adoption. These discussions are extremely urgent given the impending implementation of “open data” policies from all US funding agencies and the corresponding move in the EU in this same direction. The knowledge preservation tools for most researchers are either inadequate or woefully under-adopted. This clash between the research enterprise and policy can only be resolved with discussions between the primary stakeholders. RDA is the only global forum that currently provides an opportunity for these discussions. By encouraging increased participation by domain researchers in these important discussions, the PTTP IG can make significant contributions to solving one of the most important issues around data and research.


Objectives (A specific set of focus areas for discussion, including use cases that pointed to the need for the IG in the first place. Articulate how this group is different from other current activities inside or outside of RDA.):

Following from the issues listed above, the PTTP IG will:

  • Catalogue available preservation tools, including capabilities, compatibilities, and rates of adoption, and make this information available to researchers and archivists. This catalogue will serve as a basis for discussion of tool development and deployment in order to better meet the needs of diverse research communities.

  • Undertake outreach activities, including holding sessions at RDA plenaries and conducting outside workshops to engage researchers and archivists in preservation tool specification and, potentially, adoption.

  • Engage related RDA IGs: Metadata, Provenance, Reproducibility, Active Data Management Plans, Preservation e-Infrastructure; and domain-specific IGs in a wider dialogue around preservation needs, tools, and the specifications thereof.

  • Survey preservation policies across countries, funding agencies, and research areas to assemble a comprehensive view of researcher and archive responsibilities

  • Engage representatives of funding agencies, either at RDA plenaries or at other workshops, in order to involve them in the details of these discussions.


Participation (Address which communities will be involved, what skills or knowledge should they have, and how will you engage these communities. Also address how this group proposes to coordinate its activity with relevant related groups.):

The PTTP IG seeks to connect domain researchers with data scientists and data-handling professionals in order to improve, first, the communication between them, and second and more important, the tools the researchers use to preserve the knowledge inherent in their research results. If this is taken as a primary goal, the communities involved are quite broad: all researchers on one hand, and all those whose goal is to provide the platforms for preservation and access on the other. This second community is currently the main constituents of RDA, and they are quite engaged already. Emissaries of the research community are also enthusiastic RDA members. The PTTP IG will prevail upon this smaller cadre for outreach (or in-reach) to other researchers who should be involved in the discussions. Several means will be exploited to achieve additional in-reach or outreach to broader communities, including attendance at scientific society meetings or at domain-specific conferences. Additional, smaller workshops outside of RDA plenaries may be a more targeted way to attract additional participation and dialogue. Extensive networks of interested researchers exist; the task is to bring more of them into the conversation.

In terms of coordination, members of the proposed IG have already coordinated two joint sessions at the 8th RDA Plenary in Denver with two of the related groups, the Reproducibility IG and the Active Data Management Plans IG. IG leaders from the Provenance IG and several of the discipline-specific groups were also attendees at those sessions. Thus, coordination is already occurring. Maintaining open lines of communication and combining meeting sessions when appropriate should be relatively straightforward.

Outcomes (Discuss what the IG intends to accomplish. Include examples of WG topics or supporting IG-level outputs that might lead to WGs later on.):

A. As mentioned above, the PTTP IG intends to produce two catalogues/reviews:

  1. A taxonomy of preservation tools, including their features, strengths and

    weaknesses, and their rates of adoption in various research domains

  2. A survey of open access policies worldwide

B. A primary outcome of this IG will be better communication between the data scientist/archivist realms and that of the domain researcher

C. A primary outcome of this IG will be wider adoption of preservation tools by domain researchers

D. A primary outcome of this IG will be the delineation of desirable characteristics and features of scientific preservation tools for future tool development

A potential WG project would be to choose a pilot research domain with no good preservation tools and to solve that particular problem in a manner that isn’t completely domain-specific.


Mechanism (Describe how often your group will meet and how will you maintain momentum between Plenaries.):

The PTTP IG will have meetings every month to six weeks, partially reflecting on issues raised by previous plenary meetings, and planning for the next round of plenary sessions. Significant attention will also be devoted to accomplishing the goals of the IG, namely greater researcher participation and the development of the proposed catalogues of tools and policies. Planning of additional workshops, when appropriate, will certainly keep the level of engagement high.


Timeline (Describe draft milestones and goals for the first 12 months):

1. Organize at least one breakout session at the 9
th RDA Plenary in Barcelona, including researchers who have not previously attended RDA. a. Timeline: months

2. Compile preliminary list of available preservation tools a. Timeline: 3 months

3. Compile preliminary tool taxonomies (features, strengths, weaknesses, adoption) a. Timeline: available for 10th Plenary (12 months)

4. Plan for Preservation policy discussion at 10th Plenary  a. Timeline: begin discussions in Barcelona (6 months), detailed planning finished: 9 months

Potential Group Members (Include proposed chairs/initial leadership and all members who have expressed interest):   Visible in the attached PDF

Review period start:
Monday, 9 January, 2017
Custom text:


     Small Unmanned Aircraft Systems (sUAS) are rapidly becoming important tools for data capture across many Scientific domains, as well as within commercial industry.  sUAS have the potential to transform how data are captured in many arenas by, offering higher temporal and spatial resolutions, with less impact on the environments being monitored, and access to new locations and parameters.  In many cases these advantages are further accompanied by lowered costs and increased human safety during data capture.   


As a new technology, however, there are currently no industry-wide accepted best practices for sUAS sensor and flight data handling and management.  There are many reasons for why such would be beneficial but 3 of particular note include:

(1) The creation of standards would lower the barrier to entry and  innovation in terms of what might be monitored with sUAS, by reducing the number of unknowns a new user faces and providing working examples to serve as guides.

(2) With no common goal standards to build to, the development of mature tools for sUAS captured data processing and fusion (with sUAS and other data sources) is currently hampered.  As a consequence, each use case generally develops a unique custom pipeline that only sees one-time use. 

(3) sUAS captured data is - for the most part - not being managed according to data stewardship best practices, such as would ensure the data is FAIR, as articulated by Force11 (Findable, Accessible, Interoperable, and Re-usable).  


This interest group therefore seeks to explore and publish (via the RDA community based working group model), some best practices as regards the handling of sUAS captured sensor and flight data.  By publishing such, after a broad, cross-community engagement process, it is hoped and expected that such will see adoption by both those already using sUAS for scientific work those just beginning to explore their possibilities.  They will therefore address the 3 concerns laid out above, with the associated positive consequences for the scientific community. These outcomes also align directly with the RDA’s Vision and Mission focus, namely, promoting the open sharing of data.


User scenario(s) or use case(s) the IG wishes to address

There are many examples to list here, the following 3 specific examples are selected solely for the broader context they represent:

(1) It is possible to place a temperature sensor on a sUAS. However, there is currently no other equivalent (spatially or temporally) example of capturing temperature data.  It is therefore left to each individual researcher to; create a sampling protocol, to select a data storage format, to determine which of the many possible metadata parameters are worth storing, to develop a tool for processing the captured data for integration with other data sets, and finally to choose how to publish the captured data and with what metadata.

(2) It is currently a non-trivial task (generally one that requires at least team including of members with electrical, computational, and mechanical engineering expertise, along with the target science expertise) to go about using a sUAS to capture data in the field.  As a result, there is a new industry evolving that is able to provide many of the desired data products to a researcher for a fee.  If standard practices existed these providers would firstly be able to utilise them where advantageous to their own models. Secondly, researchers would be able to require the commercial providers adhere to such, so as to ensure good open data stewardship practices are upheld.

(3) As indicated above it is is currently a non-trivial task to use a sUAS based sensory system.  However, in addition to the industry avenue - thanks to the long standing hobbyist Remote Control market - there is already a highly sophisticated and very mature fully open sUAS stack, that is also available to researchers.  While already mature in fundamental function this stack is immature in terms of usability and science use case features.  It therefore still requires many of the above mentioned expertise to be successfully utilised.  However, many of these remaining challenges could be removed or overcome, if the appropriate common standards were in place for developers to build to. 



  1. Provide a venue for data standards and recommendations comparisons with oceanographic AUVs, and other similar platforms.
  2. Identify common and divergent data needs across sUAS implementations in different domains.
  3. Identify a community aggregation point for others in the field who are currently isolated.
  4. Identify community partnerships, including with industry, tech companies/manufacturers, and computing organizations and infrastructures.
  5. Provide a venue for ongoing community discussion around the legalities, logistics and opportunities governing sUAS use, given that sUAS are a relatively new data collection platform.



Within RDA:

Agricultural Data IG, Geospatial IG, Metadata IG, Marine Data Harmonization, Vocabulary Services IG, Weather Climate and Air Quality IG


External to RDA:

Earth Science Information Partners (ESIP): This group will be closely linked with the Earth Science Informatics community through joint development (and continued) collaboration with the Federation of Earth Science Information Partners (ESIP). The Drone Cluster (chaired by Lindsay Barbieri and Jane Wyngaard) provides ample opportunity to work closely with Earth Science data practitioners from NASA, NOAA, USGS, USDA and other major sUAS research organizations. Sessions at biannual meetings and monthly telecons have set the stage for collaborative work and can continue to attract sUAS user interest both from the researcher and data practitioner perspective. Additionally, previous collaborations between the ESIP Drone Cluster and the ESIP Education Workgroup have already resulted in sUAS-use education for K-12 teachers and further workshops for education and implementation activities could be developed.


The following is a list of groups whom Wyngaard and Barbieri have been in contact with, with interest in helping to develop further data and metadata standards and community working relationships:

  • AgGateway: Consortium of over 300 agricultural industry partners (including sUAS companies) for the development of agricultural industry standards. Barbieri has attended their annual meeting, presented during their geospatial working group session, and has garnered interest and support from their UAS precision agriculture community.
  • UAViators, Humanitarian UAV Network: With over 2,500 members in 80+ countries they promote the safe, coordinated and effective use of UAVs for data collection and cargo delivery in a wide range of humanitarian and development settings by developing and championing international guidelines for the responsible use of UAVs. Barbieri has connected with Patrick Meier (director), and had him speak at an ESIP meeting and garnered interest for the continued discussion and community development of UAS data standards.
  • The American Geophysical Union (AGU): Members of the AGU are currently discussing formalizing a UAS Focus Group, or more formalized UAS in Earth Sciences working group. Barbieri has been in communication with them and garnered interest and support for collaboration between AGU Focus Group and an RDA IG.


Other organizations we intend to reach out to, with whom we’ve had some communication and collaborative ties, but no direct explicit RDA IG communication yet:




Outcomes (Discuss what the IG intends to accomplish.  Include examples of WG topics or supporting IG-level outputs that might lead to WGs later on.):

  1. Provide a discussion venue for sUAS use within many disciplines to distill current data and metadata uses and needs - with a final report on current practices and identify gaps.
  2. Provide a list of recommended data formats for a relevant range of parameters.
  3. Provide a list of recommended metadata formats for a range of relevant parameters.
  4. Provide a recommended parameter naming convention to be used.
  5. Provide a recommended file naming convention to be used.
  6. Provide an international and transdisciplinary community platform for continued discussion, development, and implementation of sUAS data recommendations.



Mechanism (Describe how often your group will meet and how will you maintain momentum between Plenaries.):

  • Regular telecons,potentially subdivided into relevant sections, and as frequent as it relevant for each.For instance, initially there may need to be a weekly telecon for those interested in the broad goal and contributing new insights.This might fade to a monthly telecon.Simultaneously, there may need to be a weekly telecon for those interested in and focused on organising the first kickoff session.Post P9 this may convert into a weekly telecon focused on spinning off a working group.
  • Within the USA, the ESIP drone cluster will support bi-annual meetings at meetings in January and July annually.It is hoped that similar equivalent local meetings will develop in Europe and elsewhere.
  • The Interest group may potentially support the submission of proposals where the goals of such align with those of this Interest Group.
  • Active documentation of IG activity through use of the Open Science Framework, RDA website, or other web-based project management tool, and possible ongoing collaboration through Slack or other online host.


Timeline (Describe draft milestones and goals for the first 12 months):

  • Hold a kick-off session at P9 in April 2017 that sees contributions from as many relevant sectors as possible (sUAS manufacture and data collection-processing industry, various academic and non-academic current sUAS users, data practice experts, RC hobbiest sUAS community members, and experts from relevant analogous fields).
  • Post P9, host continued community discussions to develop a 3 year strategic plan for the sUAS RDA IG, including targeting a specific goal to address via a working group by then end of the first 12 months.
  • Conduct a Survey with sUAS users and leaders from a variety of disciplines and sectors to draft a report on current sUAS data and metadata practices and identification of the gap between current practices and ideal data and metadata needs. With the goal of publishing this report and hosting a follow up workshop.


Potential Group Members (Include proposed chairs/initial leadership and all members who have expressed interest):

Name                              Title                                               Institution                              

Jane Wyngaard               Data Technologist                        University of Notre Dame

Lindsay Barbier               Doctoral Student                          University of Vermont Gund Institute

Rob Stevenson               Associate Professor of Biology    University of Massachusetts Boston 

Cynthia Parr                    Technical Information Specialist   United States Department of Agriculture

Vanessa  Raymond         Graduate Research Assistant      Geographic Information Network of Alaska

Bill Teng                          Programme Manager                    National Aeronautics and Space Administration

Karen Anderson              Associate Professor                      Exeter University

Adam Steer                     Earth systems data specialist       National Computational Infrastructure

Charles Vardeman II       Professor                                       University of Notre Dame

Lance Christensen         Researcher                                    Jet Propulsion Laboratory

Sean Barberie                Data Scientist                                 University of Alaska Fairbanks

Stephen Gray                 Senior Research Data Librarian     University of Bristol


Add more lines as needed by hitting the ‘tab’ key at the very end of the ‘Title’ line

Review period start:
Wednesday, 21 December, 2016
Custom text:


Increasing the availability of research data for reuse is in part being driven by research data policies and the number of funders and journals and institutions with some form of research data policy is growing. The research data policy landscape of funders, institutions and publishers is however too complex (Ref: and the implementation and implications of policies for researchers can be unclear.  While around half of researchers share data, their primary motivations are often to carry out and publish good research, and to receive renewed funding, rather than making data available. Data policies that support publication of research need to be practical and seen in this context to be effective beyond specialist data communities and publications.


Use cases and user scenarios

The prevalence of research data policies from institutions and research funders (such as the UK research councils and European Commission) is increasing (Ref:, so publishers and editors are paying more attention to  standardisation and the wider adoption of data sharing policies. The International Committee of Medicial Journal Editors introduced a data sharing policy; Springer Nature is implementing a standardised research data policy framework with four standard data policy types, each with a defined set of requirements, and is encouraging adoption across all its journals (Ref: More than 1000 journals have adopted one of these policies as of June 2017.  This policy framework is available for reuse by others under a Creative Commons license but requires wider debate in the research and publishing communities. We envisage there to be common elements of research data policy shared between all stakeholders, such as support for data repositories and data citation.

Much of this work draws on earlier Jisc activity in examining the potential for a tabulation of publisher research data policies. Naughton and Kernohan (2016) (Ref: reported that the journal data policy landscape was not at the required maturity to be comparable or indexable in this way. Jisc is  therefore committed to working with publishers in supporting the standardisation of journal data policies, with an end goal of supporting machine readable policies that would be easier for researchers and research support staff to utilize in selecting a suitable journal for publication, ensuring compliance with journal and funder data requirements.


Objectives and Outcomes

  • Help define common frameworks for research data policy allowing for different levels of commitment and requirements and disciplinary differences that could be agreed by multiple stakeholders

  • Identify priority areas/stakeholders where policy frameworks can be defined e.g. beginning with journal/publisher policy, then considering funder policy

  • For these prioritised areas, stimulate creation of Working Groups to:

    • Produce guidance for researchers on complying with and implementing research data policy and the tools to support compliance

  • Facilitate greater understanding of the landscape of research data policies across disciplines, institutions and learned societies

  • Increase adoption of (standardised) research data policies by all stakeholders in particular journals and publishers

  • Connect stakeholders and broaden a collective understanding of their roles and relationships in data policy implementation

The report from the RDA P9 meeting is here:

Minutes from first informal meeting of this group at RDA 8th Plenary are here:



While the focus of the policies developed by the Group would be on publishing research data, multiple stakeholders (publishers, institutions, repositories, societies, funders) will be included. Common elements of data policy likely exist for all these stakeholders and this will be explored.

The proposed group would complement the Practical Policy WG ( as this proposed group has a specific focus on journals and publishing with a goal of harmonising and standardising policy. These seem to be prerequisites to and would feed into efforts to create machine readable and actionable policies.

The proposed group would also complement efforts aimed at publishing and citing research data, as data policy of publications should help raise awareness of both these activities.



Co-chairs will have regular conference calls (every 1-2 months) and communicate updates to group members via the RDA group mailing list and using other RDA communication resources as needed e.g. group wiki, file repository. Group members will be invited to a group/community call that will take place every 2-3 months, after an initial meeting of the group at the RDA plenary - currently scheduled for April 2017.

We will use collaborative editing tools (Google Drive etc) to rapidly share outcomes of calls, key documents and to solicit feedback from group members.



The first 6-9 months will involve further discussions with members and stakeholders to prioritise the objectives and secure support for delivering them, which might require  the creation of sub-groups focused on specific tasks. We envisage our first priority to be the first listed objective, to “Help define a common framework for research data policy allowing for different levels of commitment and requirements and disciplinary differences that could be agreed by multiple stakeholders”, to support academic publishers and others in developing usable and practical research data policies. We will gather requirements in 2017 and present them to group members, by September 2017.

Our goal is to evolve from an Interest Group to a Working Group for publisher/journal policy by 2018, in coordination with RDA plenary meetings.



Iain Hrynaszkiewicz (, Springer Nature (group proposer)
Natasha Simons, ANDS
Simon Goudie, Wiley
TBC, Jisc

Review period start:
Monday, 19 December, 2016
Custom text:

WDS/RDA Publishing Data Interest Group

WDS/RDA Certification of Digital Repositories Interest Group


Assessment of Data Fitness for Use


WG Charter

The increasing availability of research data and its evolving role as a first class scientific output in the scholarly communication requires a better understanding of and the possibility to assess data quality, which in turn can be described as conformance of data properties to data usability or fitness for use. These properties are multifaceted and cover various aspects related to data objects, access services, and data management processes such as the level of annotation, curation, peer review, and citability or machine readability of datasets. Moreover, the compliance  of a data repository or data center providing datasets - for example with certification requirements - could serve as a useful proxy.

Currently, there is a fairly good understanding on how to certify the quality of a data center / repository as a whole, but there is no generally acknowledged concept for assessment of data usability (or fitness for use) of individual datasets. Some of the properties describing data usability are not available or not transparent to users and requirements for other properties cannot be matched with standards. Furthermore, current certifications and accreditations of data repositories only allow limited conclusions on the re-usability of individual datasets. Thus assessing the fitness for purpose and making a decision whether to reuse a dataset is not straightforward.  This situation  reduces  the chances of shared data being reused  and  in case of reuse could decrease the reliability of research results.

Firstly, a concept of data fitness requires assessment of quality criteria to be included as well as the weighing of each of those criteria. The process should preferably lead to the development  of a corresponding metric. Secondly, we want to find effective ways to expose and communicate this metric, for e.g. by using a labelling or tagging system whereby different usability levels are importantly made explicit.

The proposed working group would work towards the following deliverables:

  • The definition of criteria and procedures for assessment of fitness for use

  • The development of a system of badges/labels communicating fitness for use of individual datasets

Criteria would be used such as:

  • Trustworthiness of the data centers/repositories (such as assessed through existing certifications: DSA-WDS, DIN, ISO 16363 etc.)

  • Data accessibility in terms of discoverability, openness, interoperability etc.

  • Level of curation applied  (citability, metadata completeness, data harmonization, machine readability etc.)


Value Proposition

The following stakeholders would benefit:

  • Researchers who deposit data can visibly improve and communicate the quality of their datasets, thereby increasing reuse and citation, which provides researcher with additional metrics showing their productivity.

  • Researchers who reuse data can more easily assess the quality of a dataset and in particular its fitness for their reuse. This makes reuse of data safer and more efficient.

  • Data centers/repositories can offer better quality data publication services - such as  more transparent curation - thus increasing the overall usage of services which in turn might lead to improving the facility's financial base.

  • Science publishers can better integrate referenced data into the editorial process and improve the review of articles and related datasets as well as citations and cross-linking of datasets and literature as a result of more transparency about data usability.   

  • Funders can make provisions for funded data archiving and publication services in accordance with their funding requirements and expectations in terms of data fitness for use (and reuse).

Overall impacts:

  • Improved and standardized data publication services

  • Improved communication of data fitness for use

  • Improved reliability and efficiency in the reuse of research data


Engagement with existing work in the area

Data fitness for use has been addressed in literature over the last 20 years. The topic received more attention with the general increase of data production. The following gives a brief overview of selected publications. It is by no means exhaustive.  In 1998 Tayi and Ballou stated that the concept of data quality is relative with quality being dependant on users and applications. Some authors concentrated on special aspects as for example assessment of accuracy of geospatial data (de Bruin 2001) or de-duplication relevant for example to data mining approaches (Christen & Goiser 2007). A further aspect is preservation of usability of sensitive data (Bhumiratana & Bishop 2009). In 2007, the OECD underlined the importance of efficiency in reusing data (OECD 2007). For example  efficient compilations of data from multiple providers require harmonized and machine readable data, in particular for data with high volumes. Correspondingly, the FAIR Data Publishing group supplies a set of principles for publishing data and emphasizes machine readability of data as one of the major challenges (Wilkinson et al. 2016). More recently authors also started to investigate data usability with respect to big data approaches (Jianzhong 2013). The effect of peer-review on data quality, respectively usability was stressed by Lawrence et al (Lawrence 2011) and an editorial in the Nature Scientific Data Journal (2016). Costello linked data fitness for use with the data publication concept (Costello 2013).  Also worthy to note is the ISO/IEC 25012 data quality model (ISO/IEC 2008) and the ISO 8000 Requirements for Quality Data (ISO 2009). The W3C Data on the Web Best Practices Working Group elaborated vocabularies needed to describe data quality and highlights the importance of data provenance (W3C 2016), which – if applicable — should include also detailed information about physical samples, for example in the case of biocollections (Bishop 2016). Finally, fitness for use of datasets should be transparent and comprehensive to users. The effectiveness of using badges or labels for this purpose was shown by Kidwell et al (Kidwell 2016).

In addition to works published in the literature, the WG can build on a wide range of activities that are relevant to the aims and scope of the group. In particular:

  • The Working Group would operate under the umbrella of the RDA-WDS  Data Publishing IG and RDA/WDS Certification of Digital Repositories IG

  • This Working Group will follow up on the work of the RDA/WDS Data Publishing Workflows WG and assess the impact of workflows on fitness for use (Austin et al. 2016)

  • This Working Group will follow up on the work of the Repository Audit and Certification DSA–WDS Partnership WG and develop a related certification system for individual datasets

  • The Working Group would incorporate the criteria defined by the FAIR working
    Group (Wilkinson 2016) as a starting point.

  • The Working Group will collaborate with the NIH Commons FAIR metrics group to elaborate on the FAIR criteria (NIH 2016)

  • This Working Group would incorporate the W3C data quality vocabulary to define quality processes (W3C 2016).


Work Plan

Work will be along four strands:

  1. Descriptions and definitions of data fitness criteria. In a first step we will gather literature and initiatives having addressed the topic. To sort out ambiguities of term definitions relevant to this group, we will collaborate with the CODATA/CASRAI development of an International Research Data Management glossary (IRiDiuM) and maintain consistency with terms in the RDA Term Definition Tool (TeD-T). The selection of data fitness criteria will be set out to the wider community before finalizing the document.

  2. Development of a fitness for use label at the level of datasets

    1. Conceptual model

      1. Selection and evaluation/weighing of criteria with respect to the different aspect of fitness for use such as curation or accessibility

      2. Considerations for adoption by stakeholders (archives/repositories: for e.g built into workflows, science publishers)

    2. Design of label/badge

  3. Development of service components

    1. Investigate how a fitness of use concept can be integrated into current certification procedures for data centers/repositories (WDS/DSA)

    2. Investigate data centers/repositories service components

    3. Setup of a testbed of several data centers/repositories

  4. Governance and sustainability:

    1. Concept for a long-term organizational structure to operate elaborated services successfully and in a way that meets the needs of all stakeholder groups. This stream will also deliver a process through which new organizations can connect to the service.


  • Addition or revision of relevant terms in the IRiDiuM glossary (CODATA/CASRAI)

  • Document defining fitness for use criteria

  • Description and design of fitness for use label (badge system)

  • Concept for a certification procedure including the fitness for use aspect

  • Concept for a data centers/repositories service components

  • Adoption plan including certifying organizations and governance

  • Manuscript for submission to a peer-reviewed journal.


  • Fitness for use concept ready

  • Setup of a testbed with several data centers/repositories and science publishers

  • Prototype of fitness for use label available

Mode & frequency of operation

  • Telecons every 4 weeks

  • Face to face meetings during RDA plenaries and at least one additional workshop. RDA plenaries in particular will be used to engage the wider community and coordinate the work with related groups.

  • Additional meetings of subgroups working on particular deliverables including adoption





April - July 2017

Terminology & definition of criteria

Overview of criteria, for discussion at 9th plenary meeting

July - December 2017

Pilot assessment of criteria

Report on outcomes of pilot, for discussion at 10th plenary meeting

December - February 2017

Development/design of badge system and integration with current certification schemes

Guide for repositories

February - August 2018

Concept for integration of data repository service components. Piloting Integration of badge system.

Governance structure and adoption plan

May 2017 - October 2018

Draft article for peer review

Submission of article to a peer-reviewed Journal.


Adoption Plan

Members of the proposed working group are planning to carry out a pilot during the 12-18 month timeframe in which they incorporate the insights that come out of the working group. In this pilot, a first assessment of the fitness for use of individual datasets will be carried out. This simultaneous pilot will provide the working group with important information about both benefits of and challenges with adoption which will make it easier for additional organizations to adopt the outcomes of the working group. The goal is that at the end of the 18 month timeframe, a first network of adopters will exist.


Initial Membership

Claire Austin (Research Data Canada, Co-Chair, )

Bradley Wade Bishop (Univ. Tennessee)

Helena Cousijn (Elsevier, Co-Chair, )

Michael Diepenbroek (PANGAEA, Co-Chair, )

Amy Nurnberger (Columbia University Libraries)

Ingrid Dillo (DANS)

Stephane Pesant (MARUM)

Mustapha Mokrane (ICSU-WDS)

Markus Stocker (PANGAEA)

Rob Hooft (DTL)

Peter Doorn (DANS)

Christina Lohr (Elsevier)

Robert R. Downs (CIESIN, Columbia University)

Daniel Fowler (Open Knowledge International)

Martina Stockhause (WDC Climate, DKRZ)

Ian Bruno (CCDC)

Tim Smith (CERN/Zenodo)

Donna Scott (NSIDC)

Jonathan Petters (Virginia Tech)

Kathleen Gregory (DANS)



Austin CC, *Bloom T , *Dallmeier-Tiessen S, Khodiyar V, Murphy F, Nurnberger A, Raymond L, Stockhause M, Tedds J, Vardigan M, & Whyte A (2016). Key components of data publishing: Using current best practices to develop a reference model for data publishing. International Journal on Digital Libraries (IJDL), Research Data Publishing Special Issue. Pages 1-16. DOI 10.1007/s00799-016-0178-2

Bhumiratana B & Bishop M (2009) Privacy aware data sharing: balancing the usability and privacy of datasets, in: Proceedings of the 2nd International Conference on PErvasive Technologies Related to Assistive Environments,

Bishop, B. W. & Hank, C. F. (2016) Fitness for Use in Data Curation Profiles for Biocollections [Presentation] American Society for Information Science and Technology Annual Meeting, October 2016, Copenhagen, Denmark

de Bruin S, Bregt A, van de Ven M (2001) Assessing fitness for use: the expected value of spatial data sets, International Journal of Geographical Information Science, v15, no5, p457-471

Christen P & Goiser K (2007) Quality and Complexity Measures for Data Linkage and Deduplication, in: Guillot FC & Hamilton HJ (eds) Quality Measures in Data Mining, Studies in Computational Intelligence pp 127-151

Costello M et al (2013) Biodiversity data should be published, cited, and peer reviewed, Trends in Ecology & Evolution, p1-8

International Renewable Energy Agency (2013) Data quality for the Global Renewable Energy Atlas – Solar and Wind,

ISO (2009ff) Data quality,

ISO/IEC (2008) Data quality model,

Kidwell MC, Lazarević LB, Baranski E, Hardwicke TE, Piechowski S, Falkenberg L-S, et al. (2016) Badges to Acknowledge Open Practices: A Simple, Low-Cost, Effective Method for Increasing Transparency. PLoS Biol 14(5): e1002456.  

Lawrence, B., Jones, C., Matthews, B., Pepler, S. & Callaghan, S. (2011). Citation and Peer Review of Data: Moving Towards Formal Data Publication. International Journal of Digital Curation 6, 4–37

Li Jianzhong & Liu Xianmin (2013) An important aspect of big data: data usability, Journal of Computer Research and Development, v6

NIH Commons FAIR metrics group (2016) WG interim report,

OECD (2007) OECD Principles and Guidelines for Access to Research Data from Public Funding,

Scientific Data Journal (2016) Let referees see the data, editorial, Nature Scientific Data Journal, 3, 160033.

Tayi GK & Ballou DP (1998) Examining data quality, Communications of the ACM, v41, no2, p54-57

W3C (2016) Data on the Web Best Practices: Data Quality Vocabulary, W3C Working Group Note,

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., and Baak, A. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018,

Review period start:
Tuesday, 15 November, 2016 to Thursday, 15 December, 2016
Custom text:

WG Charter

The general objective of the Fisheries Data Interoperability Working Group (FDIWG) is to devise a global data exchange and integration framework to support scientific advice on stock status and exploitation that build on fisheries data. Various fisheries data domains utilized in such scientific processes are concerned, including data collected for monitoring control and surveillance, scientific fisheries Data Collection Frameworks, fisheries scientific observers schemes, and statistical or status & trends reporting frameworks. The proposed framework will facilitate the use of de-facto, and preferably open, standards for the identification, description, mapping and publication of fisheries data supporting scientific processes..

More specifically, the fisheries Data WG will address the (minimal) metadata requirements to describe fisheries data required for supporting stock assessment and fisheries management. It will also seek to recommend global data standards for topical vocabularies, domain ontologies, and mapping rules and formats (as done for example by CF Conventions for physical and chemical parameters in oceanography).

Driven by pragmatic considerations, the working group will focus on few selected priority needs expressed by its invited participants, ranging from filling gaps in selected schemes to application of best practices across schemes, through issues of data transformation and harmonization among schemes. In terms of functionality and data types, the WG will identify several use cases describing realistic scenarios to produce and test fisheries data work-flows. The result of the WG recommendations will be captured as a set of best practices.

By including interoperability experts, organisations with standardization initiatives, and standardization bodies, the WG will have key actors to reflect and propose future governance of the data framework The focus of this governance is the efficient delivery of interoperability guidelines.

To organize the collaboration and involvement from the community, the WG co-chair on Fisheries data structures will oversee the activity of two topical sub-groups, with one co-chair responsible for the formulation of a framework for structured fisheries data exchange (data structures), and another co-chair responsible for fisheries geospatial explicit data.

To achieve these objectives, the WG will;

  1. Promote existing facilities for data sharing on capture, landing, effort, size classes, VMS and production through sharing of structural data definitions. This promotion will be supported by demonstrations of live examples of data sharing;

  2. Facilitate access to data by recommending standards such as netCDF, SDMX or UN/CEFACT and assist in adoption of tools and facilities;

  3. Recommend existing data tools: Tools for Master Data Management (MDM), database connectors, registries and other assets;

  4. Recommend Master Data Management solutions for classifications and multilingual / multi-locale data: The challenge lies in the variety of languages in which the data is stored, and locale specific data types. This requires also mapping between local classifications and regional and global ones.

  5. Connect existing data networking initiatives such as

    1. The FAO FiRMS partnership,

    2. The FAO secretariat to the Coordinating Working Party on Fishery Statistics (CWP), which combines 19 global partners such as ICES and IOTC,

    3. The Tuna Atlas initiative (tuna RFMOs, FAO, IRD), to provide examples for storing datasets from various RFMOs within a single gridded data format;

    4. Extend CF Conventions  for biological and fisheries data

    5. EU / DG MARE: DCF, Integrated Fisheries Data Management Programme (FLUX)  and INSPIRE directive,

    6. The SDMX community, such as through of Eurostat, FAO, and Worldbank

    7. Unesco’s "International Oceanographic Data and Information Exchange" (IODE) of the "Intergovernmental Oceanographic Commission" (IOC)

    8. Other relevant RDA WG’s and IG’s; such as (Alfabetically) the Agrisemantics WG, Data Citation WG, Agricultural Data IG, Geospatial IG Marine Data Harmonization IG, RDA/CODATA Legal Interoperability IG


Value Proposition

The WG will provide a negotiation framework on fishery related standards for data storage and exchange structures to improve data analysis. It will benefit organizations in the fisheries sector by providing a reference interoperability framework based on existing initiatives and formats.

In the longer term, implementing a common framework (however small the scale may be) will help to further cultivate a fisheries data ecosystem, based on common tools and services.

  • The fisheries data managers and data scientists will have a common and global framework to describe, document, and structure their fisheries data.

  • If suitable standards are identified, then the WG can propose generic data storage standards (e.g. for gridded datasets or NetCDF) and services (OGC Web Services for GIS community, facilitating INSPIRE DIRECTIVE compliance..)

  • Fishers, traceability organizations, NGO’s, and other data users will have seamless access to a wide range of fisheries data. Data mapping will also ease emergence of new data analyses and knowledge discovery methodologies.

  • Other infrastructures data managers and scientists will have the benefit of a reusable data framework. Researchers working on other domains will easily access, reuse and link up fisheries data with their own data.

  • Development professionals and policy makers for will be enabled to take informed decisions across multiple data providers.

Expected key impacts of the RDA fisheries Data Interoperability Guidelines

  • Reduced costs related to reusing data. The incompleteness of standards (or guidelines) has a cost. Indeed, e.g. data structures can vary a lot for similar data and much time is wasted to transform data from one format into another. Agreement on a set of standards and writing related guidelines is key.

  • Increased adoption of existing common standards, vocabularies and best practices related to fisheries data management with new communities, such as regional projects. Increased general awareness about research open data and interoperability standards among the fisheries organizations.

  • Enhanced access, discovery (metadata) and reuse of fisheries data, and improved visibility.

  • Major fisheries data integration and more effective measure of the of free sharing impact of fisheries data through data provenance attribution.

  • Created new opportunities for Data Structure Definition (DSD) and ontology based knowledge management in the fisheries sector.

Engagement with existing work in the area:

The members of the WG will liaise through their organizations with existing activities in the area of fisheries data exchange and overall activities to foster data interoperability. The engagement will allow the WG to tap into a wide knowledge base of data exchange specialist, and prepare its recommendations that may also be of value to experts beyond the domain of fisheries data exchange, such as legal interoperability and geospatial metadata experts.

  • iMarine / BlueBRIDGE: Tuna Atlas use case for RFMOs datasets,

  • ICCAT BFT-E Stock Assessment working group to facilitate stock assessment datasets sharing,

  • OpenAIRE open data specialists,

  • Agroknow network of expertise on open data sharing,

  • EGI Engage e.g. for legal interoperability,

  • IRD scientific data collection activities,

  • FAO and Eurostat SDMX SEIF initiative,  

  • DG-MARE: FLUX initiative (in particular VMS & elogbooks) and DCF / DCMAP,

  • FAO CWP standards for fisheries reference data,

  • OGC geospatial standards setting organization.

Through the engagement work, a list of potential adopters of the WG products will emerge. Specific statements of interest and priority needs are expected from the invited participants while the WG is established. Examples of interoperable data flows that could benefit from the application of WG reference models and best practices include:

  • FAO Fisheries and Aquaculture department:

    • data ingestion from regional fishery bodies, fisheries organizations or members states; regional databases to support scientific process;

    • improve the statistical data exchange in line with CWP’s SDMX initiative;

    • Improve the geospatial data exchange building on CWP’s geospatial standards work group;

  • IRD:

    • improve fisheries observers’ data flows to support regional fisheries bodies;

    • Improve scientific data flows;  

    • improve the quality of NetCDF metadata;

  • EU:

    • Ease interoperability between FLUX and SDMX;

    • Ease interoperability between FLUX and FishFrame.

Work Plan

Work plan components

Inventory of existing formats to support solutions (months 1-4)

The first months after the WG has been established, a consultation of existing formats and activities related to fisheries data will identify:

  1. Data formats, existing and proposed,

  2. Data exchange needs and examples,

  3. Data access and storage existing solutions and development proposals

We will evaluate recommended data exchange approaches for several specific scenarios and select pilot candidates for a demonstration. The selection of these candidates will be in close cooperation with stakeholders and data owners. In this phase, a detailed report of the technical aspects of data sharing approaches will be developed.

Examples of scenarios where the WG can propose data interoperability solutions could be selected for inclusion in the report include:

  • Globally established data frameworks interoperability; what are the technical challenges in re-using data collected through e.g. FLUX or SDMX work-flows?

  • Improve coverage and re-usability of on-board collected data such as by-catch reports by harmonizing reference data through master data management; can the interoperability of collected data be improved by relying on global reference data for e.g. species names, gear classifications, and area references;

  • Legal interoperability requirements; what provisions do exist in current data exchange mechanisms to ensure that the data are properly described from a legal perspective through descriptive metadata on license, copyright, and ownership

  • Spatial data interoperability of fisheries geospatial explicit data such as gridded datasets (Tuna Atlas example) through descriptive metadata;

  • Identify requirements for additional data formats for activities such as vessel or FAD (Fishing Aggregating Devices) trajectories

The WG will not meet physically, but be consulted on-line with several on-line WG meetings.

This report will be the Deliverable of this phase.


Defining the reference models (months 3-8)

We will develop technical reference models for data exchange based on the inventory above and including possibly Data Structure Definitions for statistical data, and as UML for OGC and ISO standards.

Each model should be open, extensible and, if possible, implementation agnostic. They define how fisheries data can be structured in order to facilitate the sharing of data- and subsets, and how those structured data can be used in interoperable exchanges.

A selected set of DSD’s and UML diagrams or other formalization of fisheries data for exchange, based on the report of the previous phase, will be this activity’s deliverable.

A reference model should address the interoperability issues related to formats, ownership, copyright, data re-use and data quality.

Improve and test the models iteratively (months 7-12)

The models(structure definitions and UML) developed in the previous step will be evaluated against suitable data sets of considerable size from various research organizations. The consortium partners will be asked to provide their real world research sets as a testbed to evaluate each model. This will follow an iterative approach in order to allow improvements.

After the models have been validated, a reference architecture for fisheries data will be implemented. This reference implementation should be based on open source software in order to be usable and improvable by all participating partners. Several implementations of data architectures already exist, and these could be repurposed to also accept the fisheries data models.

  • For statistical data;

  • For geospatial explicit data;

The implementation has to be generic and flexible enough for being adapted to various purposes. An official release will follow the implementation and iterative improvement phase and demonstrate interoperability between two systems (a producer and a consumer) with a live example of fisheries data.

The evaluation report of the existing reference architectures for suitability to manage fisheries data will be the deliverable of this phase.

Promotion of the RDA FDI Model and Reference Adoption

(months 8-18)

Promotion activities will include internal and external dissemination about the data structures and architecture. The reference implementation will be accompanied by substantial documentation and use case scenarios in order to increase adoption and encourage contributions.

WGFDI operation

Form and description of final deliverables

The deliverables are listed above as activity phase outcomes.


No particular milestones are specified. If needed, the Deliverables of the previous section can be used as milestones.

Communication and outreach

The entire process will be supported by dissemination activities and community outreach. The dissemination will rely on RDA tools, and include a wiki, documents, and possibly a demonstration site in an EU infrastructure. No developer forum or mailing lists are foreseen.

The outreach will focus on the initiation phase and conclusion phase; the announcement of the activity and pans, the installation of the core team and the resource team, the development of the concrete objectives, and when a result has been obtained, a presentation of progress, and plans for a further development and roll-out phase through the participating members channels.


Initial Membership of the FDIWG

The WG organization is specified in the Case Statement. It will be structured with 3 Co-chairs. The first Co-chair will retain overall responsibility on progress and deliverables, and communication with RDA, while the Co-chairs will be responsible for content development and more broadly with technical issues and future collaboration.


  • Co-chair: Anton Ellenbroek (FAO) - Fisheries data structures

  • Co-chair: Julien Barde  (IRD) - Fisheries and geospatial data management

  • Co-chair: Aymen Charif (FAO) - Statistical data management

Members/Interested (Not formally invited):

  • Marc Taconet - FAO Rome - Data Governance and Global fisheries data interoperability

  • Donatella Castelli - CNR-ISTI, Pisa - Networking and data interoperability

  • Pasquale Pagano - CNR-ISTI, Pisa - Data and infrastructure interoperability

  • Yann Laurent - FAO Consultant - Fisheries data exchange and interoperability expert

  • Neil Holdsworth - ICES Denmark - fisheries data formats and tools

  • Daniel Surany - ESTAT - SDMX Expertise

  • Erik van Ingen - FAO CIO Rome - SDMX Expertise, mainstreaming fisheries data in FAO UN statistical data flows

  • Fabio Carocci and Emmanuel Blondel; FAO Fisheries - Geospatial data standards expertise;


  • Charalampos Thanopoulos - Agroknow Greece - Expert on data interoperability

  • Imma Subirats / C.Caracciola - FAO OPCC Rome - Data interoperability experts

  • NOAA - TBC

  • NAFO - Through FAO FiRMS partnership and CWP (logbook data models)

  • David Ramm - CCAMLR Hobart - Fisheries data management expert

  • Alicia Mostiero / Dawn Borg Costanzi - FAO Rome - Global Record - Vessel data management expert  (UN/CEFACT - FLUX)

  • DG Mare - FLUX: Thierry Remy / Eric Honoré (UN/CEFACT business layer standardization)

  • DG MAre - DCF: Bas Drukker / Venetia Kostopoulou Venetia.Kostopoulou

  • JRC - TBD

  • VLIZ - WoRMS Marine species master data, marine georeferences

  • Dimitris Gavrilis - Athena RC

Review period start:
Thursday, 5 January, 2017
Custom text:






In the healthcare sector, 1.3 million new pieces of research related to biomedical science alone are published each year.[2] A typical database search returns about 80,000 hits, and only 4,000 of those are likely to be very relevant to a researcher’s work. Text and Data Mining (TDM) techniques can already be used to zoom in on the top 25% of papers which are most relevant to any given search query. Researchers believe that, with a little more work, it will be possible to use TDM to identify the top 10% of search results. In a similar vein, the quantity of data being created has also grown exponentially, making it difficult to handle and analyse. Data mining techniques are needed to help researchers to spot patterns in large batches of data.

TDM was initially defined as “the discovery by computer of new, previously unknown information, by automatically extracting and relating information from different (…) resources, to reveal otherwise hidden meanings.” It’s applicability in all fields of research is growing in this age of information overload.[3]

Recent studies show the uptake of TDM is lacking.[4] One of the reasons is the lack of awareness and skill amongst researchers, librarians, and industry practitioners.[5] A key conclusion from the Publishing Research Consortium survey on TDM was that ‘Awareness of text mining techniques is still relatively low.[6] Moreover, the European Communications Monitor identified a ‘gap between training offered and development needs’ [7] Both industry and academia have confirmed a need for education on TDM. Our focus will be on providing the basic skills so as to reach the widest audiences. The decision therefore is to establish a Working Group with a clear focus and purpose to develop a course within the 18 month time frame.


Purpose of Initiative

This Working Group aims to address the current skills gap identified with respect to Text and Data Mining (TDM) and help improve the adoption of these practices in a range of research disciplines.

TDM is a cross-cutting skill of value to a wide range of researchers. This working group aims to develop a short module that can plug into existing courses (e.g. the CODATA-RDA School of Research Data Science and existing university research skills courses) to equip researchers and practitioners with basic TDM skills and increase the use of these.


Scope of initiative

The Working Group aims to develop a short introductory programme and related content (presentations, exercises and case studies) to introduce researchers[8] to TDM and provide practical experience in applying open source tools to use these skills in their field of research.[9]
The design of the course will be developed based on the research and feedback from the stakeholder communities in the upcoming months.(see timeline and workplan) More specifically the content and the proposed duration of the course will be determined after these consultations. For now we envision a 1-2 day modular course for people with no prior knowledge that includes stand alone modules,lessons and elements that can be selected independently depending on the focus and level of knowledge the participants. The course can be spread out over several days or weeks to fit within existing courses and trainings.

The introductory course will not be discipline-specific, though later iterations could be tailored towards this if needed and for example go into more detail into discipline related fields of interest and expertise.. Although the 1-2 day course aim is to address the skills gap for researchers with no prior knowledge we anticipate that we may need to extend the duration of the course to 4-5 days if we find that we need to include more basic introduction courses on for example the more technical aspects of TDM.

The course and course materials will be made available online and in digital easy to use and modular format accessible for anyone who is interested to use and adapted the course to suit their specific level/audience.[10]


Background to Initiative

The European projects FutureTDM, FOSTER and EDISON confirm that there is a growing demand for researchers who understand and are able to use TDM and that current education is falling behind in providing people with the skills and knowledge needed both in academia and industry.[11]

At RDA Plenary 8, a discussion on TDM in the IG on Education session confirmed community interest in developing training materials to address the skills gap. This working group therefore aims to look at how education and in-work training can help fill the gap and create enough expert data scientists.[12]


Relevance of the Initiative

Taking into account the many benefits of TDM for research and society this is a topic relevant for RDA. By designing a course to cover TDM skills and developing course materials and making them available to the community we can contribute in bridging this gap. This will include learning outcomes (essential and desirable) and  course content (specific readings, lecture and discussion content, class activities, practical assignments, and graded assignments).
Proposed Outcome
The aim is to develop a generic/adaptable course or training module that can then be used by different disciplines on TDM skills and knowledge.


Timeline and Workplan : Term: 12-18 months

Quarter one - 2017: Requirements gathering phase

This will include identifying survey participants (such as existing course providers, the research community, industry partners, librarians and RDA members) and undertaking a questionnaire to understand what skills need to be covered in an introductory TDM course.

Analysing survey outputs and drafting a course design, learning outcomes and programme for consultation at the RDA plenary in Barcelona.

This work will be conducted via virtual meetings and desk-based research.

Deliverable: Survey and results

Milestone: Preliminary course outline for discussion in Barcelona

Quarter two - 2017: Course development

Development of course content, including specific readings, lecture and discussion content, class activities, practical exercises and graded assignments. For this we will look at existing courses and tutorials and build upon those with input from the TDM community such as users and tool developers. For example we will work together with Contentmine, Industry partners such as SAS and at least two Universities who have expressed interest in adopting a course.

Establishing an international network of experts and potential TDM trainers. This will build on the initial survey work and contacts developed through the WG and will support roll-out and reuse of the materials.

The majority of this work will be conducted virtually, with OKFN leading. At least one face-to-face meeting will be scheduled to help define the structure of the course and/or develop key components.

Deliverable: A draft set of training materials and user guides ready for testing

Quarter three - 2017: testing

Liaising with contacts to establish one or two potential opportunities to trial the course. These could be aligned with existing events from partners such as DCC, FutureTDM or institutions who have expressed an interest in hosting events for researchers.

A train-the-trainers style session could be run at RDA Montreal to walk members through the course content and how this should be delivered to receive feedback from potential adopters.

This work will require at least two face-to-face sessions to deliver courses in different contexts

Milestone: Have tested the course and gathered feedback from trainers and pilot participants

Quarter four - 2017: evaluation and review

Here we will take stock of feedback received during the trial. Particular emphasis will be paid to which sessions were most effective in addressing the learning outcomes and engaging participants. The time taken to deliver the sessions, any technical issues encountered by trainers and ideas for reworking content or improving flow will also be addressed.

The course materials will be refined based on the feedback and materials to assist others in reusing the content such as speaker notes will also be improved.

The work will be conducted remotely with regular virtual meetings to support the analysis and review.

Deliverable: a revised set of openly-licensed training materials available online for reuse

Quarter five - 2018: adoption

The complete course materials will be made available online (github, slideshare, zenodo) together with documentation on how to implement the course module, FAQs and contact details for support. Further events like the train-the-trainers at Montreal could help others to understand and adopt the resources.

Through the DCC, European training initiatives (e.g. Swafs-07) and e-infrastructure projects like OpenAIRE, we will raise awareness of the module and promote adoption in academia.

In addition the IEA has a number of industrial partners (including Microsoft, Airbus, environmental consultancies and civil engineering companies)  and can be used as a route to gaining contact with industry.

This work will involve promoting the outputs at events, as well as specific meeting with key targets (e.g. training departments and Doctoral Training Centres) to promote adoption


WG Communication


Bi-weekly calls for the Chairs or others engaged in specific activities currently underway

Monthly calls to update all members of the Working Group on progress

WG Email list for discussion and sharing of relevant information

Google Drive/ Github for collaboration on course materials





-                       Freyja van den Boom (EU)
                        Sarah Jones (EU)

                        Devan Ray Donaldson (US)
                        Clement E. Onime (TBC)


  • Steve Brewer
  • Vicky Lucas
  • Simon Hodson
  • Amy Nurnberger        
  • Puneet Kishor
  • Baden Appleyard
  • Christoph Bruch
  • Alex Fenlon
  • Jez Cope
  • Hugh Shanahan
  • Małgorzata Krakowian
  • Bridget Almas

Group Email:

Secretariat Liaison: Fotis Karayannis

TAB Liaison: Devika Madalli

Engagement with existing work in the area:

Collaborations and opportunities for further engagement include: The FutureTDM project seeks to improve uptake of text and data mining (TDM) in the EU. FutureTDM actively engages with stakeholders such as researchers, developers, publishers and SMEs and looks in depth at the TDM landscape in the EU to help pinpoint why uptake is lower, to raise awareness of TDM and to develop solutions.
EDISON is a 2-year project (started September 2015) with the purpose of accelerating the creation of the Data Science profession.

The forthcoming Swafs-07 ‘Training on Open Science in the European Research Area’ project.
CODATA-RDA School of Research Data Science

Part of the University of Reading, providing training on analytics and producing proof of concept software either by using environmental data or big data for environmental applications.  The IEA is funded until 2019 by the Higher Education Funding Council for England. The IEA recognises that TDM is a growing field for environmental analysis and applications.  The IEA currently has projects using TDM in tweets and text messages and is moving into larger document analysis, specifically environmental impact assessments.

The Belmont Forum is a group of national science funders, including NSF (US) and NERC (UK).  The e-infrastructure group is exploring training requirements for research data scientists, including developing a relevant curriculum in 2017.


The UK Digital Curation Centre has delivered training on Research Data Management for several years and is involved in training activities for a number of European projects such as FOSTER, OpenAIRE, EUDAT and the European Open Science Cloud. Through these and participation in the CODATA summer schools, the DCC will help to embed the module in existing courses and encourage broad adoption.

Other possible collaborations:

Academia: We have interest from several Universities

Possible try-outs may be organized alongside Trieste School 10-21 July at ICTP in Trieste; followed by Sao Paolo, Brazil, 4-15 December.

School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively

Industry and organisations: Contentmine, SAS


[1] Developed during the Plenary in Denver IG session Education and Training on handling of research data

[2] FutureTDM project report D4.3 Compendium of Best Practices and Methodologies available online at

[3] See for an overview of use examples in the US: Why “Big Data” Is a Big Deal Information science promises to change the world, Shaw. J Harvard Magazine available online

[4] The EU expert report on Text and Datamining states that Europe is falling behing the US and China with respect to the uptake of TDM available at

[5] FutureTDM consortium D4.3 Compendium of Best Practices and Methodologies report shows the need for more TDM practitioners in industry as well as a lack of awareness and skill amongst students and researchers in different disciplines.

[6] Key finding from the Publishers community on this issue available here

[7] As identified in Europe. See European Communication Monitor 2016

[8] We will initially develop this course aimed for (student) researchers with no or little prior knowledge on TDM. For a second iteration of the course we will also look at industry, librarians and other interested parties to see how the course can be tailored more to specific needs.

[9] The course will be made available under an open access license using open source tools and materials to make sure the course can be adopted by a wide audience.

[10] The content of the course, course materials and best platform to make them available will be looked at in this working group. See timeline for more detailed information,

[11] FutureTDM Deliverable 2.4 and 4.3 available at

[12] The UK Royal Society is holding a special conference on this topic see


Review period start:
Friday, 9 December, 2016 to Monday, 9 January, 2017
Custom text:

Scholarly Link Exchange Working Group:

Follow on from: RDA-WDS working group on Data Publishing Services

On Enabling Interlinking of Data and Literature


The Scholarly Link Exchange Working group aims to enable a comprehensive global view of the links between scholarly literature and data.  The working group will leverage existing work and international initiatives to work towards a global information commons by establishing:

  • Pathfinder services and enabling infrastructure
  • An interoperability framework with guidelines and standards (see also
  • A significant consensus
  • Support for communities of practice and implementation


By the end of this 18 month WG period there will be:

  • A critical mass of Scholix conformant hubs providing the enabling infrastructure for a global view of data-literature links
  • Pathfinder services providing aggregations, query services, and analyses
  • Beneficiaries of these services accessing data-literature link information to add value to scholarly journal sites, data centre portals, research impact services, research discovery services, research management software, etc.
  • Operational workflows to populate the infrastructure with data-literature links
  • Better understanding of current data-literature interlinking landscape viewed from the perspective of e.g. disciplines, publishers, repositories etc.


The working group follows on from the RDA/WDS Publishing Data Services WG,  The original working group established demonstrator services enabling infrastructure.  The follow on working group will support the “hardening” of that infrastructure and services as well as an increase in the number of participating hubs and services. The original working group established an interoperability framework. The follow on group will provide further specification, documentation and profiling of that framework to support adoption by link contributors and consumers.  The original working group established a consensus among large infrastructure providers and early adopters; the follow up group will extend that consensus to the next stage of adopters and to a more diverse set of infrastructure providers.  The original working group harnessed the energy and interest of specialists; the follow up group will provide support for a number of communities and services as they implement and adopt the framework and vision established in the original group.

The working group believes a global system for linking data and literature should be:

  • Cross-disciplinary and global (built for, and aspiring to, comprehensiveness) 
  • Transparent with provenance allowing users to make trust and quality decisions
  • Open and non-discriminatory in terms of content coverage and user access (this also means ranging from formal to informal, and from structured to non-structured content)
  • Standards-based (content standards and exchange protocols)
  • Participatory and adopted, including community buy-in
  • Sustainable
  • An enabling infrastructure, on top of which services can be built (as opposed to a monolithic “one-stop-shop” solution).

Note - This group retains the principles established in its precursor working group (Publishing Data Services)


Value Proposition:

The WG aims to oversee and guide the maturation of a distributed global system to collect, normalize, aggregate, store, and share links between research data and the literature. This will build upon the output of the preceding Data Publishing Services Working Group, which delivered a consensus vision and set of guidelines called the Scholix Framework, together with an operational system called the Data-Literature Interlinking (DLI) System, which puts these guidelines into practice as a pathfinder implementation. The WG proposed here will build out these assets into an operational infrastructure and service layer that is to become the de facto go-to place for organizations to deposit or retrieve links between research data and the literature.


The value of such a system ultimately rests on the value of links between research data and the literature. The utility of such links is threefold (see also the Case Statement of the Data Publishing Services WG):

  1. They improve the visibility and discoverability of research data (and relevant literature), so that researchers can find relevant material more easily.
  2. They help place research data in the right context, so that researchers can re-use data more effectively.
  3. They support credit attribution mechanisms, which incentivize researchers to share their data in the first place.


These value elements are illustrated below, and in more detail in Annex A.

While there is broad support for the value and utility of data-literature links amongst the various stakeholders in research data publishing (including researchers as the ultimate end-users of this information), organizing the associated information space is not an easy feat: there are many disconnected sources with overlapping information, and there is a wide heterogeneity in practices today - both at a technical level (different PID systems, storage systems, etc.) and at a social level (different ways of referencing a data set in the literature, different moments in time to assert a link, etc.). As a consequence, the landscape today is incomplete and patchy, characterized by independent, many-to-many non-standard solutions - for example a bilateral arrangement between a journal publisher and a data center. This is both inefficient and limiting in the value that can be delivered to researchers.


The universal linking infrastructure which this WG strives to put in place represents a systemic change. It will offer an overarching, cohesive structure that binds together many of today’s practices into a common interoperability framework - which will ensure that links between research data and the literature can be easily shared, aggregated, and used on a global scale. This will drive a network effect, where the value in the system as a whole is greater than the sum of individual parts: for researchers as end-users, this value lies in the comprehensiveness and quality of link information; for service providers and infrastructure providers (including journal publishers and data centers), the value also lies in simplicity, efficiency, and reduction of friction in the process by being able to work with a single interface to deposit and retrieve links (and, potentially, the possibility to benefit from additional services developed on top of the core infrastructure).


Who will benefit and Impact


Mapping the value proposition as described in the above to the various stakeholders and actors in research data publishing (copied largely from the Data Publishing Services WG Summary & Recommendations), benefits and impact may be summarized as follows:

  • For data repositories and journal publishers: linking data and the literature will increase their visibility and usage, and can support additional services to improve the user experience on online platforms (for example, offering links to relevant data sets with articles, or offering links to the literature that will help place data in context). In contrast to the bilateral arrangements that we often see today between data centers and journal publishers, the global linking infrastructure will make the process of linking data sets and research literature a more robust, comprehensive, and scalable enterprise.
  • For research institutes, bibliographic service providers, and funding bodies: the infrastructure will enable advanced bibliographic services and productivity assessment tools that track datasets and journal publications within a common and comprehensive framework.
  • For researchers: firstly, the infrastructure will make the processes of finding and accessingrelevant articles and data sets easier and more effective. Secondlyit will



Engagement with existing work in the area:

  1. Building upon previous work of the RDA/WDS Publishing Data Services WG
  2. RDA/WDS Publishing Data IG
    • RDA/WDS Publishing Data Bibliometrics WG
    • RDA/WDS Publishing Data Workflows WG
  3. Infrastructure providers
  4. Infrastructure projects
  5. Related projects
  6. Data Center Community
    • ICSU WDS
    • DataCite
  7. Publisher Community
  8. Institutional Repository Community
    • OpenAIRE
    • SHARE
  9. Discipline-specific Communities
    • Pangaea (Earth and Environmental Science)
    • EBI-EMBL (Life Sciences)
    • ICPSR (Social Sciences)
    • CERN (High Energy Physics)


Adoption Plan:

The Adoption Plan for this Working Group is quite mature since it builds on a previous working group, includes adopter work packages, includes outreach and documentation work packages, targets new hubs, and focuses on benefit realisation.


Previous Working Group:  The proposed working group builds directly on the Data Publishing Services Working Group which has a considerable membership with an active core of contributors. The WG is representative of publishers, data centres, research organisations and research information infrastructure services who are the key stakeholder and adopter communities. The existing momentum and buy-in of this group will be leveraged for adoption.


Technical Development of Hubs: In a similar vein, the WG activity plan includes targeted activity to extend existing hubs (CrossRef, DataCite, OpenAire, RMap) and establish new hubs in new community areas (such as Astronomy, Life Sciences).


Implementation Sub Projects: The working group case statement “Activities” section provides details of a number of adoption sub projects.  The Scholix framework that underpins the WG approach involves content publishers (eg journal publishers or data centres) communicating with natural hubs (eg CrossRef and DataCite). This WG activity plan includes implementation projects from publisher to hub.


Documentation and Support Materials: The WG activity plan includes an extension of the Scholix framework by providing documentation of instantiation of the abstract Scholix information model in various technologies or formats (such as xml, rdf, json) and using a number of common protocols (such as open api calls, sparql, oai-pmh, resourceSync). These specification and implementation materials will also be the product of the development and adoption projects described above.


Outreach, Liaison, Collaboration:  This Working Group focuses on a technical solution to the exchange and aggregation of data-literature link information.  Other peak bodies and advocacy groups focus on changing practice and integrating data citation as part of scientific practice.  The WG work plan includes collaboration with those organisations to leverage their established agendas.  Current members of the WG include leaders in these organisations and further such activity is slated in that area of the work plan.


Benefit realisation: The sustainable driver of adoption is benefit for the adopter.  The overall work plan is underpinned by the objective of delivering benefits to end users, as outlined in the use cases of the Annex A.


Work Plan:

The work plan will be implemented through a set of interconnected activities outlined below. Categories exist only for planning and pragmatic purposes; they are not at all independent and activities will not be siloed.  Cross-category contributions by working group members will be the norm.

Stream 1.Technical Development.

The objective of this stream to put the Scholix framework into practice such that both hubs and services develop operational functionality.

A. Develop Hubs

  1. OpenAIRE
    • Make OpenAIRE APIs compatible with Scholix to export and import links to and from DLI Service
  2. DataCite
    • Further develop standardised interfaces for query and export
  3. CrossRef
    • Further develop standardised interfaces for query and export
  4. New domain-specific hubs, e.g. EMBL/EBI(TBC by opportunity)
  5. Interim hubs (direct feed to DLI): standardisation (using Scholix framework) of feeds from previous working group and improvement of dynamic currency of feeds
    • ANDS to DLI direct (only non-DOI content)
    • ...
  6. Further interoperation of the hubs (extensions to the Scholix conceptual framework during the course of the working group)

B. Develop Services (in relation to the user scenarios defined in previous WG)

  1. DLI aggregation service
    • Transition to production at OpenAIRE data centre and infrastructure
    • APIs for PID resolution (Scholix conformant) - Pangaea
    • Improving quality: e.g. de-duplication of objects (datasets and literature)
    • Improving service level: live updates of links
  2. Use of the Scholix framework to access and expose links between articles and data in exemplar end-user services
    • OpenAIRE APIs compatible with Scholix to export and import links to and from DLI Service
    • Data centre/ publisher exemplar projects using DLI as per user scenarios

C. Elaborate the Scholix framework

  1. Create profiles of the inf model for use in different technologies
    • XML for oai-pmh
    • JSON for RESTful api
  2. Investigate how best to apply
    • DISCO (through cooperation with RMAP)
    • ResourceSync
    • Others?…(RDF for Sparql)
  3. Provide documentation and support materials for the above


2. Community buy-in stream

This stream supports buy-in from different communities such that exchange of scholarly link information is implemented and accepted as standard practice.

D. Support Community Adoption:

  1. Create strategies for community adoption:
    • Publishers
    • Data centres
    • Repositories
    • ….
  2. Implement these strategies through:
    • Early adopter groups (eg CrossRef early adopters; e.g. Force11 DCIP project; eg via the THOR project; with COAR)
    • Implementation projects
    • Webinars
    • Presentations
    • Support materials and activities

E. Communicate Broadly

  1. Create communications plans
  2. Implement communications plans

F.  Create Coordination and Governance Materials. Investigate and document issues such as:

  • Quality of data links
  • Requirements to be a hub
  • Access
  • Benefits for contributors
  • Measures of success


Key Stakeholder Groups:

The above Activity Plan will be delivered with involvement of the following groups who bring complementary resources, approaches, focus, and expertise.

A. Advocacy and Peak Bodies

  • Force11 (application data citation standards & advise on implementation standards)
  • CODATA (application data citation standards & advise on implementation standards)
  • ICSU World Data System (e.g. get more citations into DataCite)
  • STM (outreach, training, Crossref early adopter project)
  • FAIR Data

B. Other data literature linkage projects

  • National Data Service
  • RMAP (application of DISCO)
  • RDA Working Groups (Publishing Data IG, ….)

C. Prospective Hubs

  • BIOCaddie (DataMed)


Initial Membership

Initial members are coming from the existing working group on an opt-out basis; they will be asked again if they want to join this newly formed working group by e-mail following the RDA


Adrian Burton

George Mbevi

Kathrin Beck

Paul Dlug

Amir Aryani

Håkan Grudd

Kerstin Helbig

Peter Rose

Amye Kenall

Haralambos Marmanis

Kerstin Lehnert*

Peter Fox

Aris Gkoulalas-Divanis

Howard Ratner

Lars Vilhuber

Rabia Khan

Arnold Rots

Hua Xu

Laura Rueda*

Rainer Stotzka

Arthur Smith

Hylke Koers

Laurel Haak

Richard Kidd

Bernard Avril

Iain Hrynaszkiewicz

Leonardo Candela

Rick Johnson

Carly Strasser

Ian Bruno*

Luiz Olavo Bonino da Silva Santos

Robert Arko

Carole Goble

Ingrid Dillo*

Lyubomir Penev

Rorie Edmunds*

Caroline Martin

Jamus Collier

Mark Donoghue

Sarah Callaghan*

Claire Austin

Jeffrey Grethe

Martin Fenner*

Sheila Morrissey

Claudio Atzori

Jingbo Wang

Martina Stockhause*

Siddeswara Guru

Dan Valen

Jo McEntyre

Michael Diepenbroek*

Simon Hodson*

David Martinsen

Joachim Wackerow

Mohan Ramamurthy

Suenje Dallmeier-Tiessen

David Arctur

Johanna Schwarz

Mustapha Mokrane*

Tim DiLauro

Donatella Castelli

John Helly

Natalia Manola

Timea Biro

Eefke Smit*

Jonathan Tedds

Niclas Jareborg

Tom Demeranville

Elise Dunham

Juanle Wang*

Nigel Robinson

Ui Ikeuchi

Elizabeth Moss

Kate Roberts

Paolo Manghi

William Mischo

Francis ANDRE

Katerina Iatropoulou

Patricia Cruse*

Wouter Haak*




Xiaoli Chen




Yolanda Meleco

* Representattives of a WDS member


Initiatial workstream leads and co-chairs:

    • techincal specs and docs (Paolo Manghi)
    • hub development and interoperability (Martin Fenner)
    • Scholix service development (Jeff Grethe)
    • publisher (Iain )
    • repository (Ian Bruno)
    • general outreach (Fiona Murphy)
  1. WG Coordination
    • WG program oversight (Wouter Haak)
    • WG component integration (Adrian Burton)


Annex: Use Cases


Use Case


 Live linking

As a publisher, I want to know about relevant data for an article that I published so that I  can present links to such data sets to the users on my platform

- OR -

As a data center, I want to know about relevant articles for a data set that I published so that I  can present links to such articles to the users on my platform

  • Needs to be on-demand, real-time query. Performance is critical.
  • Publisher or  data center platform should be able to control UI for smooth platform integration.
  • No need for the service to do any filtering; just return all linked data sets and client can filter as needed.



As a data center, I want to obtain a full overview of article/data (and data/data) links for the data sets relevant to me so that I  can demonstrate the utility of my data

  • Query should be on-demand, complete, and up-to-date.
  • Precision and comprehensiveness are key
  • Ideally on-demand,  pull mechanism.


As a data center, I want to be alerted that an article may be citing/referencing our data so that I can validate that link and then add it to our own database.

  • For an alerting mechanism, recall is more important than precision (since the data center will still validate)
  • Should be push notifications.
  • Data center needs to be able to selectively receive notifications for their data repository only, need “data center” metadata.
  • This service is not so sensitive to comprehensive coverage


As a researcher interested in a particular topic of study, I want to be able to explore a relevant article/data graph so that I  can find the articles or data sets that I am interested in.

  • General “research” use case, could apply to individual researchers, data repositories, and others.
  • Requires a lot of freedom to do exploration at the user’s terms
  • Would expect the user in this case is highly tech-savvy and will want to create their own search logic using a minimal “hopping service” that exposes a set of links given an article or data set PID.



Review period start:
Friday, 7 October, 2016
Custom text:

Please see attached document.

Review period start:
Thursday, 1 September, 2016 to Friday, 30 September, 2016
Custom text:

"Semantic Interoperability is usually defined as the ability of services and systems to exchange data in a meaningful/useful way." In practice, achieving semantic interoperability is a hard task, in part because the description of data (their meanings, methodologies of creation, relations with other data etc.) is difficult to separate from the contexts in which the data are produced. This problem is evident even when trying to use or compare data sets about seemingly unambiguous observations, such as the height of a given crop (depending on how height was measured, at which growth phase, under what cultural conditions, ...). Another difficulty with achieving semantic interoperability is the lack of the appropriate set of tools and methodologies that allow people to produce and reuse semantically-rich data, while staying within the paradigm of open, distributed and linked data.

The use and reuse of accurate semantics for the description of data, datasets and services, and to provide interoperable content (e.g., column headings, and data values) should be supported as community resources at an infrastructural level. Such an infrastructure should enable data producers to find, access and reuse the appropriate semantic resources for their data, and produce new ones when no reusable resource is available. The Agrisemantics working group aims at being a community hub for the diffusion of knowledge and practices related to semantic interoperability in agriculture, and to serve a common place where the future of data interoperability through semantics will be envisaged.

Review period start:
Thursday, 1 September, 2016 to Saturday, 1 October, 2016
Custom text:

This WG proposal emerged from the repository registry discussions within the Data Fabric IG. The bootstrapping co-chairs are Michael Witt, Johannes Reetz, Herman Stehouwer and Peter Wittenburg. At P8 we will suggest an election of the co-chairs and present an initial core group covering European, US and Asian experts also including an increased number of other initiatives that are actively building large federations.


For background information look at the Repository Registry web-pages in the DFIG realm:


Work Group (WG) Charter

The task of the RCD WG is to analyse existing mechanisms and schemas with help of which repositories are offering their detailed characteristics to service providers and based on this analysis develop two concrete recommendations:

  1. A set of guidelines that should be followed by digital repositories in presenting their characteristics
  2. A flexible enough nevertheless unified schema that should be used by trustworthy repositories in presenting their characteristics

Since it will not be easy to collect the information of a large group of repositories active in larger federations, the WG may restrict itself to deliver point 1 within the 18 months period, i.e. shift the definition of an agreed schema to a phase 2 group.


The full case statement can be downloaded  here.

Review period start:
Tuesday, 19 July, 2016
Custom text: