• Primary Domain: Natural Sciences
  • Group Focus: Identify, Store, and Reserve, Disseminate, Link, and Find
  • Group Technology Focus: Dissemination, Search & Discovery, Re-Use
  • RDA Pathways: Semantics, Ontology, Standardisation, Discipline Focused Data Issues
  • Group Description

    Motivation: Since the completion of the Human Genome Project, we have witnessed an explosion of datasets annotating particular locations along reference DNA sequences with e.g. aspects of functional processes. Tremendous amounts of research funding have been provided to large and small projects that have generated genomic annotation datasets. Many of these datasets relate to human genomes, but an increasing amount of such data is also generated for other model organisms – and with the recent surge of biodiversity projects – for a range of species spanning the whole tree of life. Unfortunately, researchers who want to make use of such data face practical challenges in discovering and reusing datasets relevant to their research. While many repositories and data distribution solutions exist, the metadata is often poorly aligned with best practices for Findable, Accessible, Interoperable and Reusable (FAIR) research data1. Even in cases where proper metadata exists, relevant solutions typically provide metadata according to different metadata models and APIs.


    Figure 1: Example of genomic annotations (tracks) from the ENCODE consortium imported into the UCSC Genome Browser. From Rosenbloom et. al. (2011). “ENCODE whole-genome data in the UCSC genome browser: update 2012.” Nucleic acids research. 40. D912-7. Licence: CC BY-NC 3.0


    Genomic annotation data: The FAIRification of Genomic Annotations Working Group (FGA-WG) focuses on the challenges of harmonising metadata and software solutions to improve the discovery and reuse of publicly available “genomic annotation” data (see Figure 1). Genomic annotations, sometimes referred to as “genomic tracks2”, refer to data files that annotate reference sequence positions and can be visualised in genome browsers, such as the UCSC Genome Browser and Ensembl, or analysed often non-visually using computational tools that span the domains of life science (e.g., statistical colocalization analyses). Genomic annotations are often routinely generated by computational workflows in larger datasets; in the context of software- or method-oriented publications; or as the result of manual curation processes. The types of data include condensed summaries of experiment outputs as well as predictive and descriptive annotation of DNA reference sequence positions.


    Deliverables: The FGA-WG establishes a minimal metadata schema based on the harmonisation of existing schemas and recommendations, working to gradually improve this schema based on the interrogation of a set of use cases. To foster the availability of metadata that adheres to the schema, the FGA-WG will develop practical recommendations to build scalable and maintainable metadata transformation pipelines to “FAIRify” and unify metadata from multiple data sources. Recommendations will be developed for publishing and registering the harmonised metadata – with persistent and globally unique identifiers – such that the metadata can be easily harvested by search engines and other services that improve the discovery and reuse of the metadata (and associated datasets). Lastly, the FGA-WG will develop a harmonised API for the use of search and discovery services by downstream users and tools.


    The three main use cases to be considered are the following:


    1. Biomedical analysis: Discovery of genomic annotations through harmonised metadata to improve the availability and scope of datasets for analytical applications, including methodology-oriented software tools and AI/ML efforts. The focus area will be human biomedical research, but this use case could also include support for research on other species insofar as this aligns with the interests of the FGA-WG contributors.


    2. Biodiversity genome annotation: Establish genome annotations for biodiversity assemblies as FAIR objects with improved metadata and integration with relevant infrastructure and tools, including both the manual and automated annotation of genome assemblies.


    3. Track Hub infrastructure: Enhance genome browser-related infrastructure for track hubs with harmonised metadata and related services, to simplify the process of generating, hosting, and registering metadata-enriched track hubs; improve discovery of track hub-hosted data at the level of individual tracks; and facilitate the integration of metadata-enriched track hubs with genome browsers and other analysis tools.


    Community building: The FGA-WG aims to build a broad community of individuals and groups with an interest in data integration related to genomic annotations, encompassing data producers, domain experts, tool/service developers, FAIR/RDM specialists, ethics and ELSI expertise, and analytical end users. The long-term goal is to build a sustainable infrastructure that improves the FAIRness of genomic annotations, with a particular focus on improving the end-user experience.


    1 e.g.; Xue, Bingjie, et al. “Opportunities and challenges in sharing and reusing genomic interval data.” Frontiers in Genetics 14 (2023): 1155809; Gonçalves, Rafael S., and Mark A. Musen. “The variable quality of metadata about biological samples used in biomedical experiments.” Scientific data 6.1 (2019): 1-15.

    2 We will, unless otherwise indicated, use the terms “genomic annotations” and “genomic tracks” interchangeably in this text, referring to the same group of data files, whether in context of visualisation, non–visual analysis, or elsewhere.



  • Rationale Summary

  • Secretariat Liason(s)

  • Group Visibility

  • Group Creation Date

  • Endorsement Date

  • Estimated End Date

  • Actual End Date

  • Group Email

  • Group Type: Working Group
  • Group Status: In Group Revisions
  • Co-Chair(s): Sveinung Gundersen, Adam Wright, Anna Bernasconi, Nathan Sheffield
No comments found.