The Life Cycle of Structural Biology Data
|RDA Structural Biology Interest Group|
|Supporting Output Title: The Life Cycle of Structural Biology Data|
|Corresponding author: Chris Morris, STFC, Daresbury Laboratory, WA4 4AD|
|Contributors: Claudia Alen, Lucia Banci, Alexandre Bonvin, Pablo Conesa, Alfonso Duarte, John Helliwell, Yogesh Gupta, Rob Hooft, John Markley, Brian Matthews, Gaetano Montelione, Antonio Rosato, Sameer Velankar, Matthew Viljoen, Geerten Vuister, John Westbrook, Martyn Winn, and Christine Zardecki.|
Research data is acquired, interpreted, published, reused, and sometimes eventually discarded. This document reports how structural biologists perform these tasks, and recommends improvements to the infrastructure available to them.
Download The Life Cycle of Structural Biology Data report
Research data is acquired, interpreted, published, reused, and sometimes eventually discarded. Understanding this life cycle better will help the development of appropriate infrastructural services, ones which make it easier for researchers to preserve, share, and find data.
Structural biology is a discipline within the life sciences, one that investigates the molecular basis of life by discovering and interpreting the shapes of macromolecules. Structural biology has a strong tradition of data sharing, expressed by the founding of the Protein Data Bank (PDB) in 1971 (PDB, 1971). In the early years, data submissions to the archive were made by mailing decks of punched cards. The culture of structural biology is therefore already in line with perspective of the European Commission that data from publicly funded research projects are public data (COM(2011) 882 final).
This report is based on the data life cycle as defined by the UK Data Archive. This is the most clearly defined workflow that the authors are aware of. It identifies six stages: creating data, processing data, analysing data, preserving data, giving access to data, re-using data. Each will be discussed below. However, the data infrastructure for structural biology is not a perfect match for this workflow. For clarity, ʻpreserving dataʼ and ʻgiving access to dataʼ are discussed together. We also add a final stage to the life cycle, ʻdiscarding dataʼ.
Changes in research goals and methods have led to some changes in the requirements for IT infrastructure. A common data infrastructure is required, giving a simple user interface and simple programmatic access to scattered data. Progress on these tasks will support the development of workflows that facilitate the use of datasets from different facilities and techniques. The automatic acquisition of metadata can help. Large experimental centres already provide a highly professional data infrastructure. For smaller centres this is onerous - it is desirable that a standard package is provided enabling them to use the European e-infrastructure resources, in a way that integrates with other structural biology resources.
Author: Rainer Stotzka
Date: 13 Jun, 2017
(The thoughts I am describing in this comment reflect my personal opinions as a RDA member.)
The report proposed as an "RDA Supporting Output" summarizes the European situation in research data management in some fields of structural biology. It concludes that a common data infrastructure is required "making the facilities offered by EUDAT and INDICO (European projects) more directly accessible to structural biologists".
The report does not cover the situation and progress of structural biology in the other regions Americas, Africa, Asia, and Australia and of related scientific domains with similar characteristics. It is nearly identical to the deliverable D3.1 http://internal-wiki.west-life.eu/w/index.php?title=File:Assessment_of_t... of the H2020 West-Life project http://about.west-life.eu/ . T
o my knowledge the IG Structural Biology has been inactive from Feb 2013 until Apr 2017 without any meetings and open discussions visible to all RDA members. At P9 in Barcelona the co-chairs organized a session which I attended from 16:00 until 18:00: 5th April 2017, Breakout 3, 16:00 – 17:30 IG Structural Biology: The Life Cycle of Structural Biology Data (https://www.rd-alliance.org/ig-structural-biology-rda-9th-plenary-meeting). The discussion of proposing the report as an "RDA Supporting Output" was neither addressed in the agenda (see link above) nor was it performed in the official meeting time.
In my opinion the report “The Life Cycle of Structural Biology Data” does not represent an RDA consensus produced by an open and balanced discussion.
I would recommend to facilitate an open and transparent discussion within the IG, to involve members from all regions, and to examine which already existing outputs and recommendations (from RDA, W3C, WDC, CODATA, …) can be adopted to improve research data management in structural biology. The newly forming IG on Disciplinary Interoperability Framework could provide new insights for future implementation. This could result in a real RDA community- and consensus-driven output that can be consolidated by a meeting at Plenary 10 in Montreal.
I would welcome the revitalization of the IG Structural Biology by this process.
Author: Chris Morris
Date: 28 Jun, 2017
The initial discussions that led to this document, as part of the West-Life project (www.west-life.eu), were organised through the Structural Biology IG of the RDA, and some of the contributors to the document are members of the RDA. We additionally reached out into the structural biology research community and gained gaining valuable contributions from people who are not (yet) members of the RDA.
Notably, many concepts in the design of West-Life itself crystallized thanks to the interaction of the SBIG with other RDA groups, beyond discussions within the community. This is a success for the RDA, which we have acknowledged on many public occasions. Some of the issues identified in the first report of the Interest Group are currently being addressed by West-Life. The effort to develop this report was also provided by this grant, and the first draft was submitted as a deliverable of that grant. The RDA engages with research communities at a variety of levels of maturity, so the right way to engage varies case by case. By convening the SBIG, we had a chance to involve key personalities in the field who represent global perspectives. Dr John Helliwell reported on this document (and other RDA discussions) to the International Union of Crystallography (http://www.iucr.org/), arguably the most important organization on crystallographic research - not limited to biology. Moreover, the Director of PDB Europe, Dr. Sameer Velankar , provided a global perspective of the world-wide PDB (wwPDB), which covers all regions and not just Europe. The wwPDB itself has been fostering relevant discussion with different task forces and workshops.
Structural Biology is far from a green field for data management. The Protein Data Bank was founded in 1971. The present outreach was the only way of developing a document that is either accurate or representative of a consensus among leaders of the community. As the draft reflects, the discipline of structural biology has a rich variety of infrastructure and practices for data management, but also some room for improvement. We believe the present document will be instructive to foster such improvement.
Author: Dimitris Koureas
Date: 23 Jun, 2017
This is a very interesting summary of the existing practices related to data flows of structural biology information. The authors mention that the described data workflow is following outputs of the UK Data Archive. The article is indeed a useful reference point for practicioners in that field of study.
I would like, however, to raise some concerns regarding the status of such report as a RDA supporting output. It is unclear to me what was the work undertaken in the relevant RDA group that led to this document. It is also not very clear how this document is expected to be used to inform decisions of practicioners within the domain or, more importnatly, in other domains.
I would argue that the RDA community would benefit more from this robust work, if the authors would address these two points above.
Author: Antonio Rosato
Date: 27 Jun, 2017
I received the following comment by Dr. John Markley, who authorized me to post it on his behalf. Dr. Markley is, among other duties, Director of the only public repository for biological NMR data (BMRB: www.bmrb.wisc.edu/). The BMRB, which is based in Winsconsin, USA, is a member of the World-Wide PDB.
-------- Forwarded Message --------
Author: Matthew Viljoen
Date: 29 Jun, 2017
This is an interesting document outlining the data life cycle. I particularly welcome the inclusion of the topic of discarding data, an area often overlooked. There are however some aspects that this topic could cover such as the potential dangers of discarding data and controls that are (or could be) included in this part of the life cycle.
Finally, it would be helpful if the section on Preserving Data and Giving Access to Data could include some of the challenges and solutions involved in giving access to data - especially within the context of ever larger data sets from newer detectors.
Author: Antonio Rosato
Date: 03 Jul, 2017
-- Yogesh Gupta, PhD Assistant Professor Greehey Children's Cancer Research Institute & Dept. of Biochemistry & Structural Biology Univ. of Texas Health Science Center 8403 Floyd Curl Drive, San Antonio, TX
Author: Antonio Rosato
Date: 03 Jul, 2017
Hello Chris and Antonio,
I think the Life Cycle of Structural Biology Data document draft is looking excellent.
My suggestions are:-
At page 13 regarding options for depositing raw diffraction images / primary data I commend mention of Zenodo. (You do mention it elsewhere but here is a very specific need to mention it again.); you could also usefully mention Kroon-Batenburg et al 2017 IUCrJ article as an overview for MX (https://journals.iucr.org/m/issues/2017/01/00/ti5008/ti5008.pdf).
In the references list I request that for Minor et al all authors are explicitly cited not least as the point made in the amino et al text re access to data for referees is my portion of that paper!
Finally there is the question of what referees should do with their access to data for submitted articles? Whilst there is a clear workflow and traditions for assessment in chemical Crystallography of article with underlying data, both automatic via the checkcif report and overall human scrutiny of data, (http://www.iucr.org/__data/assets/pdf_file/0003/80274/AL_Data_Validation.pdf) this is not so clear in general (https://arxiv.org/abs/1704.02236) or for MX in particular (https://arxiv.org/abs/1704.08848).
Emeritus Professor John R Helliwell DSc
Author: Gaetano Montelione
Date: 06 Jul, 2017
Author: Mark Parsons
Date: 06 Jul, 2017
This is a valuable work, but I don’t think it is reasonable to call it an RDA output yet. It is much more clearly a West-Life output that was influenced by RDA activity. Even some of the comments in this RFC don’t come from RDA members and are entered by proxy. RDA is not in the practice of endorsing project deliverables.
That said, it seems like it could easily become an RDA output: just socialize it more within RDA, respond to the comments above, and demonstrate a broader and truly international acceptance beyond the West-Life project.
Author: Christine Zardecki
Date: 18 Jul, 2017
Thank you for sharing this document with us. A few comments below:
In the Introduction, the number listed is the number released to the public, and not the number of entries deposited. Both numbers are listed by year at https://www.wwpdb.org/stats/deposition
In section 5, the number of data downloads in 2015 is 534,339,871.
At the end of section 2 (data processing), there is some discussion of publication standards, data standards
and validation standards which would be better placed in Section 4 (preserving and access).
Section 4 could acknowledge the value added by the repositories in managing data during the life cycle. The PDB and other repositories are active in developing, maintaining and promoting community data content and quality standards; supporting tools for data harvesting; integrating content with other resources; maintaining reference data; and delivering a single archive.
In Section 6, please note that marking structures as obsolete does not delete the data. Obsolete entries remain available to the public through the ftp archive in order to preserve these data as part of the historical scientific record.
The data that is deleted or unaccounted in this life cycle are those data that do not result in a successful structural outcome. It is this collection of data that may have some future value that is currently lost to the community.
In the Appendix, we believe the estimated number of 25,000 experimental sessions performed each year is an underestimate. ~1 out of 10 synchrotron crystal diffraction data sets leads to a publishable structure submitted in the PDB.
In the References, please cite the wwPDB using Nature Structural Biology 10, 980 (2003) doi: 10.1038/nsb1203-980.
Author: Chris Morris
Date: 22 Aug, 2017
Thanks to all for the comments above, report now updated.