Andreas Rauber is Associate Professor at the Department of Software Technology and Interactive Systems (ifs) at the Vienna University of Technology (TU-Wien).
He furthermore is president of AARIT, the Austrian Association for Research in IT, a Key Researcher at Secure Business Austria (SBA-Research) and Co-Chair of the RDA Working Group on Dynamic Data Citation. He received his MSc and PhD in Computer Science from the Vienna University of Technology in 1997 and 2000, respectively. In 2001 he joined the National Research Council of Italy (CNR) in Pisa as an ERCIM Research Fellow, followed by an ERCIM Research position at the French National Institute for Research in Computer Science and Control (INRIA), at Rocquencourt, France, in 2002. From 2004-2008 he was also head of the iSpaces research group at the eCommerce Competence Center (ec3).
His research interests cover the broad scope of digital libraries and information spaces, including specifically text and music information retrieval and organization, information visualization, as well as data analysis and digital preservation, all of which start to merge recently under the umbrella of reproducible science.
When: Day 1 - 14th November, Session 2 Setting the Context, 11:10 - 12:40
Reproducibility challenges in computational settings: what are they, why should we address them, and how?
Abstract. Reproducibility of experiments is a key foundation in the empirical sciences. Yet, both the perceived complexity as well as proposed solutions sometimes fail to grasp the full extent of the problem. At the same time, reproducibility is often perceived as a goal in its own right, rather than questioning what precisely we may gain from the investment of effort required to make a specific experiment reproducible. Last, but not least, we need to consider the fact that computational aspects are pervading virtually all scientific disciplines - yet we cannot expect every domain scientist to become an expert in addressing computational reproducibility issues.
In this talk Andreas will review a few examples of reproducibility challenges in computational environments and discuss their potential effects. Based on discussions in a recent Dagstuhl seminar we will identify different types of reproducibility. Here, he will focus specifically on what we gain from them, rather than seeing them merely as means to an end. He subsequently will address two core challenges impacting reproducibility, namely (1) understanding and automatically capturing process context and provenance information, and (2) approaches allowing us to deal with dynamically evolving data sets relying on recommendation of the Research Data Alliance (RDA). The goal is to ensure reproducibility transparently and not only in strictly defined benchmark but also operational settings. After all, we want to ensure that results obtained in operational conditions are scientifically solid as well, that they can be analyzed, traced, and reproduced even when obtained in a dynamically changing, complex world.
When: Day 2 - 15th November, Session 9: DMP Technical Services #Part 2
Reproducibility: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data citation
Abstract. In order to repeat an earlier study, to apply data from an earlier study to a new model, we need to be able to precisely identify the very subset of data used. While verbal descriptions of how the subset was created (e.g. by providing selected attribute ranges and time intervals) are hardly precise enough and do not support automated handling, keeping redundant copies of the data in question does not scale up to the big data settings encountered in many disciplines today. Conventional approaches, such as assigning persistent identifiers to entire data sets or individual subsets or data items, are not sufficient to meet these requirements. This problem is further exacerbated if the data itself is dynamic, i.e. if new data keeps being added to a database, if errors are corrected or if data items are being deleted.
In this talk we will review the challenges identified above and discuss the solutions and recommendations that have been elaborated within the context of a Working Group of the Research Data Alliance (RDA) on Data Citation: Making Dynamic Data Citeable. These approaches are based on versioned and time-stamped data sources, with persistent identifiers being assigned to the time-stamped queries/expressions that are used for creating the subset of data. We will review examples of how these can be implemented for different types of data, including SQL-style databases, CSV or XML files, present operational pilots currently under development, and see how this fits into the larger context of activities on Data Citation.