The Life Cycle of Structural Biology Data

RDA Structural Biology Interest Group
Supporting Output TitleThe Life Cycle of Structural Biology Data
Corresponding author: Chris Morris, STFC, Daresbury Laboratory, WA4 4AD
Contributors: Claudia Alen, Lucia Banci, Alexandre Bonvin, Pablo Conesa, Alfonso Duarte, John Helliwell, Rob Hooft, Brian Matthews, Antonio Rosato, Sameer Velankar, Geerten Vuister, John Westbrook, Martyn Winn

 

Research data is acquired, interpreted, published, reused, and sometimes eventually discarded. This document reports how structural biologists perform these tasks, and recommends improvements to the infrastructure available to them.

Download The Life Cycle of Structural Biology Data report

 

Executive Summary

Research data is acquired, interpreted, published, reused, and sometimes eventually discarded. Understanding this life cycle better will help the development of appropriate infrastructural services, ones which make it easier for researchers to preserve, share, and find data.

 

Structural biology is a discipline within the life sciences, one that investigates the molecular basis of life by discovering and interpreting the shapes of macromolecules. Structural biology has a strong tradition of data sharing, expressed by the founding of the Protein Data Bank (PDB) in 1971 (PDB, 1971). In the early years, data submissions to the archive were made by mailing decks of punched cards. The culture of structural biology is therefore already in line with perspective of the European Commission that data from publicly funded research projects are public data (COM(2011) 882 final).

 

This report is based on the data life cycle as defined by the UK Data Archive. This is the most clearly defined workflow that the authors are aware of. It identifies six stages: creating data, processing data, analysing data, preserving data, giving access to data, re-using data. Each will be discussed below. However, the data infrastructure for structural biology is not a perfect match for this workflow. For clarity, ʻpreserving dataʼ and ʻgiving access to dataʼ are discussed together. We also add a final stage to the life cycle, ʻdiscarding dataʼ.

 

Changes in research goals and methods have led to some changes in the requirements for IT infrastructure. A common data infrastructure is required, giving a simple user interface and simple programmatic access to scattered data. Progress on these tasks will support the development of workflows that facilitate the use of datasets from different facilities and techniques. The automatic acquisition of metadata can help. Large experimental centres already provide a highly professional data infrastructure. For smaller centres this is onerous - it is desirable that a standard package is provided enabling them to use the European e-infrastructure resources, in a way that integrates with other structural biology resources.

 

 

Group content visibility: 
Public - accessible to all site users
File: 
AttachmentSize
PDF icon SB-IG-Life-Cycle-Report.pdf476.61 KB
  • Rainer Stotzka's picture

    Author: Rainer Stotzka

    Date: 13 Jun, 2017

    (The thoughts I am describing in this comment reflect my personal opinions as a RDA member.)

    The report proposed as an "RDA Supporting Output" summarizes the European situation in research data management in some fields of structural biology. It concludes that a common data infrastructure is required "making the facilities offered by EUDAT and INDICO (European projects) more directly accessible to structural biologists".

    The report does not cover the situation and progress of structural biology in the other regions Americas, Africa, Asia, and Australia and of related scientific domains with similar characteristics. It is nearly identical to the deliverable D3.1 http://internal-wiki.west-life.eu/w/index.php?title=File:Assessment_of_t... of the H2020 West-Life project http://about.west-life.eu/ . T

    o my knowledge the IG Structural Biology has been inactive from Feb 2013 until Apr 2017 without any meetings and open discussions visible to all RDA members. At P9 in Barcelona the co-chairs organized a session which I attended from 16:00 until 18:00: 5th April 2017, Breakout 3, 16:00 – 17:30 IG Structural Biology: The Life Cycle of Structural Biology Data (https://www.rd-alliance.org/ig-structural-biology-rda-9th-plenary-meeting). The discussion of proposing the report as an "RDA Supporting Output" was neither addressed in the agenda (see link above) nor was it performed in the official meeting time.

    In my opinion the report “The Life Cycle of Structural Biology Data” does not represent an RDA consensus produced by an open and balanced discussion.

    I would recommend to facilitate an open and transparent discussion within the IG, to involve members from all regions, and to examine which already existing outputs and recommendations (from RDA, W3C, WDC, CODATA, …) can be adopted to improve research data management in structural biology. The newly forming IG on Disciplinary Interoperability Framework could provide new insights for future implementation. This could result in a real RDA community- and consensus-driven output that can be consolidated by a meeting at Plenary 10 in Montreal.

    I would welcome the revitalization of the IG Structural Biology by this process.

  • Chris Morris's picture

    Author: Chris Morris

    Date: 28 Jun, 2017

    The initial discussions that led to this document, as part of the West-Life project (www.west-life.eu), were organised through the Structural Biology IG of the RDA, and some of the contributors to the document are members of the RDA. We additionally reached out into the structural biology research community and gained gaining valuable contributions from people who are not (yet) members of the RDA.

    Notably, many concepts in the design of West-Life itself crystallized thanks to the interaction of the SBIG with other RDA groups, beyond discussions within the community. This is a success for the RDA, which we have acknowledged on many public occasions. Some of the issues identified in the first report of the Interest Group are currently being addressed by West-Life. The effort to develop this report was also provided by this grant, and the first draft was submitted as a deliverable of that grant.  The RDA engages with research communities at a variety of levels of maturity, so the right way to engage varies case by case. By convening the SBIG, we had a chance to involve key personalities in the field who represent global perspectives. Dr John Helliwell reported on this document (and other RDA discussions) to the International Union of Crystallography (http://www.iucr.org/), arguably the most important organization on crystallographic research - not limited to biology. Moreover, the Director of PDB Europe, Dr. Sameer Velankar , provided a global perspective of the world-wide PDB (wwPDB), which covers all regions and not just Europe. The  wwPDB itself has been fostering relevant discussion with different task forces and workshops.

    Structural Biology is far from a green field for data management. The Protein Data Bank was founded in 1971. The present outreach was the only way of developing a document that is either accurate or representative of a consensus among leaders of the community. As the draft reflects, the discipline of structural biology has a rich variety of infrastructure and practices for data management, but also some room for improvement. We believe the present document will be instructive to foster such improvement.

  • Dimitrios Koureas's picture

    Author: Dimitrios Koureas

    Date: 23 Jun, 2017

    This is a very interesting summary of the existing practices related to data flows of structural biology information. The authors mention that the described data workflow is following outputs of the UK Data Archive. The article is indeed a useful reference point for practicioners in that field of study. 

    I would like, however, to raise some concerns regarding the status of such report as a RDA supporting output. It is unclear to me what was the work undertaken in the relevant RDA group that led to this document. It is also not very clear how this document is expected to be used to inform decisions of practicioners within the domain or, more importnatly, in other domains.

    I would argue that the RDA community would benefit more from this robust work, if the authors would address these two points above.

     

  • Antonio Rosato's picture

    Author: Antonio Rosato

    Date: 27 Jun, 2017

    I received the following comment by Dr. John Markley, who authorized me to post it on his behalf. Dr. Markley is, among other duties, Director of the only public repository for biological NMR data (BMRB: www.bmrb.wisc.edu/). The BMRB, which is based in Winsconsin, USA, is a member of the World-Wide PDB.

    -------- Forwarded Message --------

    Subject: RE: Request for Comments
    Date: Wed, 14 Jun 2017 14:03:46 -0400
    From: John Markley <johnlmarkley@gmail.com>
    To: 'Antonio Rosato' <rosato@cerm.unifi.it>

     

    Dear Antonio,
    Thanks for sending me the draft report. It is well written and fairly comprehensive.
    I have a couple of suggestions. 
    It would be good to include some discussion of BMRB as a repository for NMR data associated with structures that are beyond the scope of the data in the PDB archive. BMRB captures more extensive metadata, and data on a broad variety of NMR experiments.(1) BMRB also captures primary (time-domain data) that is useful for reproducibility and for technique development. 
    It would also be worth mentioning the NMRbox project, which aims at archiving workflows and associated software packages.(2) 
    Best regards,
    John
    
    1.	Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, Nakatani E, Schulte CF, Tolmie DE, Kent Wenger R, Yao H, Markley JL. BioMagResBank. Nucleic Acids Res. 2008;36(Database issue):D402-8. PubMed PMID: 17984079.
    2.	Maciejewski MW, Schuyler AD, Gryk MR, Moraru, II, Romero PR, Ulrich EL, Eghbalnia HR, Livny M, Delaglio F, Hoch JC. NMRbox: A Resource for Biomolecular NMR Computation. Biophys J. 2017;112(8):1529-34. doi: 10.1016/j.bpj.2017.03.011. PubMed PMID: 28445744; PMCID: PMC5406371.
    

     

  • Matthew Viljoen's picture

    Author: Matthew Viljoen

    Date: 29 Jun, 2017

    This is an interesting document outlining the data life cycle.  I particularly welcome the inclusion of the topic of discarding data, an area often overlooked.  There are however some aspects that this topic could cover such as the potential dangers of discarding data and controls that are (or could be) included in this part of the life cycle.

    Finally, it would be helpful if the section on Preserving Data and Giving Access to Data could include some of the challenges and solutions involved in giving access to data - especially within the context of ever larger data sets from newer detectors.

  • Antonio Rosato's picture

    Author: Antonio Rosato

    Date: 03 Jul, 2017

    Just read the report you sent. I actually enjoyed reading it. I fully agree with its contents. A few very minor points to include:
    
    a) These days many beamlines do also have web-based interface (a compilation of different program) for both in-line and remote data processing and structure determination (MR and SAD methods). One example, RAPD at NE-CAT beamline of APS, Chicago (https://rapd.nec.aps.anl.gov/rapd/). I am sure about availability of similar resources at the synchrotrons in Europe.
    b) In addition to MrBUMP for automated MR (page 15), there is another pipeline that people use for molecular replacement, BALBES (PMID: 18094476).
    
    These are just minor points. Let me know if you have any questions.
    
    Best regards
    
    Yogesh
      
    

    -- Yogesh Gupta, PhD Assistant Professor Greehey Children's Cancer Research Institute & Dept. of Biochemistry & Structural Biology Univ. of Texas Health Science Center 8403 Floyd Curl Drive, San Antonio, TX

  • Antonio Rosato's picture

    Author: Antonio Rosato

    Date: 03 Jul, 2017

    Hello Chris and Antonio,
    I think the Life Cycle of Structural Biology Data document draft is looking excellent. 

    My suggestions are:-
    At page 13 regarding options for depositing raw diffraction images / primary data I commend mention of Zenodo. (You do mention it elsewhere but here is a very specific need to mention it again.); you could also usefully mention Kroon-Batenburg et al 2017 IUCrJ article as an overview for MX (https://journals.iucr.org/m/issues/2017/01/00/ti5008/ti5008.pdf). 

    In the references list I request that for Minor et al all authors are explicitly cited not least as the point made in the amino et al text re access to data for referees is my portion of that paper!

    Finally there is the question of what referees should do with their access to data for submitted articles? Whilst there is a clear workflow and traditions for assessment in chemical Crystallography of article with underlying data, both automatic via the checkcif report and overall human scrutiny of data,  (http://www.iucr.org/__data/assets/pdf_file/0003/80274/AL_Data_Validation.pdf) this is not so clear in general (https://arxiv.org/abs/1704.02236) or for MX in particular (https://arxiv.org/abs/1704.08848). 

     

    Greetings, 

    John 

    Emeritus Professor John R Helliwell DSc

  • Gaetano Montelione's picture

    Author: Gaetano Montelione

    Date: 06 Jul, 2017

    On Page 1. structural biology... investigates the molecular basis of life by discovering and interpreting the shapes AND MOTIONS of macromolecules

     

    Page 10.  "The NMR Validation Task Force (Montelione, 2013) also strongly encourages depositors of biomolecular NMR structures to archive (where available) NOESY peak lists and other experimental data, including unprocessed free induction decay (FID) data, in the BioMagResBank. However, to date, only a handful of research groups have followed these recommendations."   This point might also be  better made on page 13, under Preserving Data?

    Page 13.  Should mention that, at least currently, NEF is "particularly well developed for representing NMR-derived restraints, and sharing them between structure-generation programs".  

    Page 17.  Regarding "discarding data".  It should also be mentioned here that obsoleted coordinates, and the data used to generate them, are very valuable to testing new methods of structure quality assessment.  For this reason, "an archive of annotated obsoleted structures and data should be maintained, separately from the currently recommended model(s)".

    Additional comments

    Consistent data formats are needed for reproducibility and transparency, and also to allow uniform structure quality assessment and validation.  Such validation is essential for ensuring the usefulness of structural biology data for the broader biological community.

    "Several efforts have been made to track the various processes of NMR data analysis, and to generate a comprehensive archive of intermediate and final results, but to date none of these efforts has resulted in a broadly adopted platform".  

    Many of the logistical issues of data archiving can be addressed by (i) developing common formats and conventions for data exchange by consensus discussions with leaders of the communities and (ii) implementation of these formats and conventions for data exchange by the developers who create the most widely used software tools for structural biology.

     

  • Mark Parsons's picture

    Author: Mark Parsons

    Date: 06 Jul, 2017

    This is a valuable work, but I don’t think it is reasonable to call it an RDA output yet. It is much more clearly a West-Life output that was influenced by RDA activity. Even some of the comments in this RFC don’t come from RDA members and are entered by proxy. RDA is not in the practice of endorsing project deliverables.

    That said, it seems like it could easily become an RDA output:  just socialize it more within RDA, respond to the comments above, and demonstrate a broader and truly international acceptance beyond the West-Life project.

     

  • Christine Zardecki's picture

    Author: Christine Zardecki

    Date: 18 Jul, 2017

    Thank you for sharing this document with us.  A few comments below:

    In the Introduction, the number listed is the number released to the public, and not the number of entries deposited.    Both numbers are listed by year at https://www.wwpdb.org/stats/deposition

    In section 5, the number of data downloads in 2015 is 534,339,871.

    At the end of section 2 (data processing), there is some discussion of publication standards, data standards
    and validation standards which would be better placed in Section 4 (preserving and access).

    Section 4 could acknowledge the value added by the repositories in managing data during the life cycle. The PDB and other repositories are active in developing, maintaining and promoting community data content and quality standards; supporting tools for data harvesting; integrating content with other resources; maintaining reference data; and delivering a single archive.  

    In Section 6, please note that marking structures as obsolete does not delete the data.  Obsolete entries remain available to the public through the ftp archive in order to preserve these data as part of the historical scientific record.

    The data that is deleted or unaccounted in this life cycle are those data that do not result in a successful structural outcome.   It is this collection of data that may have some future value that is currently lost to the community.

    In the Appendix, we believe the estimated number of 25,000 experimental sessions performed each year is an underestimate.  ~1 out of 10 synchrotron crystal diffraction data sets leads to a publishable structure submitted in the PDB. 

    In the References, please cite the wwPDB using Nature Structural Biology 10, 980 (2003) doi: 10.1038/nsb1203-980.

submit a comment