By Kisun Pokharel, RDA EU Early Career Grant Winner – University of Helsinki
I attended International Data forum (15.09.2016) and 8th plenary meeting of Research Data Alliance (RDA; 16.-17.09.2016) which were organized as part of International Data Week in Denver, USA. I would like to first thank RDA-EU for providing the early career award.
As part of my PhD project, I am working with DNA and RNA sequence data from different animal species and my work heavily depends on open-source tools and software for analyzing the data. In addition to using my own data and sharing to public, I heavily depend on open-source data and databases for conducting my research. Thus, I already belong to the community of data science. However, attending the IDF and RDA plenary meeting in Denver provided a vast array of aspects associated with data science that I never realized.
At international data forum (IDF), a series of keynote talks and panel discussions highlighted both the opportunities and challenges of the data revolutions. The whole day meeting was divided into five different panel sessions. While the discussions and presentations during the International Data Forum were more general, the RDA plenary sessions gave a very good understanding of how people from industry, government and academia are working together under the umbrella of RDA in terms of data generation, data management, data analysis and data sharing.
As a newbie to the RDA community, it was an excellent opportunity for me to understand RDA as an organization, interesting collaborative works done by different working groups and most importantly, getting to know with others working on data. There was a dedicated session “RDA for newcomers” on the first day that covered all important information that was extremely useful for newcomers like me. In addition, an informal “early-career lunch” was a nice way to know fellow early-career winners and create network. I kept myself engaged during the whole meeting and every session I attended was equally interesting.
As an early career winner, I was assigned to assist chairs of two meetings, which was basically preparing notes regarding current and future activities of groups. Joint meeting of IGAD and different agricultural working groups was of particular interest to me. During the meetings of IGs and WGs, overall progresses as well as future plans were discussed. It’s amazing how people of diverse (academic/expertise) background, geographical location voluntarily work together in a WG to create data infrastructures in a short duration of 18 months. Personally, I think that’s one of the strongest assets of RDA.
Finally the Denver meeting has been very much rewarding to me. I am now more aware of the community based research, open science and big data. While I have joined a number of interesting IGs and WGs, I look forward to be actively involved in WGs in nearest future and advance my career as a genomic data scientist.
Open data as public good and the responsibilities of scientists
In the following paragraphs, I will try to highlight the topics that were addressed during the second session of International Data forum “Open data as public good and the responsibilities of scientists”
Victoria Stodden started her talk by highlighting the complexity of big data. While we are collecting existing data – everyday more data is generated. “The data is big in terms of different types, styles and velocity”. While existing data in many cases still need improved methods to provide meaningful results, new kind of data are demanding new kind of methods. Data revolution has already started. Computational aspects of data are equally important and so are the underlying methods for analyzing data. Whether it is with a real data or simulated data, we are doing computationally intensive experiments and analyses, said Victoria. Thus, the three aspects of big data, collecting data, storing and analyzing the data as well as the underlying methods are equally important.
Victoria mentioned three different types of reproducibility during her talk – empirical reproducibility, statistical reproducibility and computational reproducibility. The importance of reproducibility to data scientists is unquestionable. “Every time researchers report new findings, they document the underlying methods in details so that another scientist can retrace the steps in order to obtain a similar result which is referred to as reproducible research”. Although all scientific experiments are meant to be reproducible, it’s worth mentioning that there are experiments that are irreproducible (such as those involving fossil specimens) and others (such as biological experiment under the influence of environmental) can be difficult to reproduce.
Nowadays majority of scientific disciplines are dependent on computer software in their research. Thus, computational reproducibility is vital and extremely exciting from the reproducibility perspective because we are able to capture much more of the tacit knowledge ever before and a lot of steps can be recorded on a computer. Although, in theory, computational reproducibility should be ideal, there are a number of factors that is affecting – complexity of the software, manual/tutorial, operating system, available parameters and settings, and software version. It might be the case that not all the thinking behind the experimental design can be captured by software design and because of which we need to try different settings and run different models before we get something that matches our hypothesis.
Takashi Onishi briefed about recent progresses in Japan regarding open science and open data. He mentioned three recommendations set by Science Council of Japan (SCJ) regarding open data. First, establishment of research data infrastructure to enable open innovation by promoting interdisciplinary integration and social implementation. Second, letting research community to decide their own open-close strategy for data under some sort of holistic guideline provided by national science academies such as SCJ. Third, adopting measures for career design of data producers and data curators. Takashi told that cooperation between publishers and funders with regards to open science may lead to better research. Such initiatives already exist among many different organizations across the world. For instance, national funding agencies such as Academy of Finland are promoting open science by committing funded projects to open access publishing and open data.
Victoria mentioned during the panel discussion that “you are never dealing with data without dealing with software – they are intermittently together”. Open-source softwares are backbones of open science but such software should also be sustainable. However, Victoria argued that “although the open software may not be sustainable, for instance, we may not be able to run the code next year, or may not run at all but there is still value in openness in software in the sense that we at least can get some ideas on how the code was implemented, what sort of parameter settings were available and so on. Thus, inspection of code even if it’s not executable is really useful”. There have been a lot of progresses lately focusing on portability of virtual machines and containers that are addressing the solution of sustainability. We don’t expect all softwares should be running forever. After all, useful codes persist and other just fall to side and may stay open for reproducibility or verification reasons.
Data as public good
The data-driven work can provide important new insights which could ultimately improve lives across the world. Takashi Onishi talked about how open data is contributing towards disaster risk reduction (DRR). After experiencing east Japan earthquake and tsunami in 2011, Science Council Japan (SCJ) initiated a network of 52 academic organizations and more than 160 scientific societies to work on DRR. The network comprised experts from different areas such as seismology, geo-science, architecture, civil engineering, acute medicine and disaster nursing. Takashi showed an example where interdisciplinary collaboration between different science and engineering fields produced emergency brake system in bullet trains that will immediately stop after the system senses primary earthquake signals. Recent earthquake in Nepal that occurred on 25th April 2015 also showed how big data played crucial role in both disaster response and relief. Google’s People Finder and Facebook’s Safety Check systems were used to track missing loved ones. Moreover, a number of open source projects such as OpenStreetMap, QuakeMap.org, and Open Data Kit helped in disaster relief such as damage assessment, relief mobilization and reconstruction mobilization. While it’s unlikely that big data will help preventing such natural disasters, it has saved a lot of lives and its role as a public good will only improve in future.
Responsibilites of data scientists
The power of data science is unquestionable but there are issues regarding responsibilities of data scientist. Myron P. Gutman, during his talk focused on the confidentiality issues of data which is one of the most important responsibilities as a data scientist. We work with information about people which they might not want to be made public. In such cases, while we collect information about them, we promise to protect their privacy and confidentiality. Myron said that in the world of data sharing, many times we are in a situation that leads to a tension between protecting individuals and making data available. Thus, we need to find balance between access and protection of such data and it can be achieved by overcoming both the technological as well as legal and institutional barriers. The technological aspects include the storage and management of confidential data – some data need restrictive access and other can be partially accessed. Thus, one of the responsibilities of data scientist is to find ways to answer questions without revealing confidential information. In addition to be responsible for dealing with confidentiality, there are other areas where data scientists should be responsible for. As a data scientist, we should be responsible how the data is analyzed and the results are error-proof. Similarly, data scientist should also be responsible for transparency. Victoria gave an excellent example of google flu trend that gave false report. That particular example lacked both the transparency as well as accuracy. Failing to address transparency and accuracy may lead to several consequences.