- Arriving late to the party
For three days in Philadelphia, I attended a plethora of working groups, interest groups, and keynotes, even though, admittedly, I had no idea at that point what the difference between an interest group and a working group was. Being an early career researcher in research data management, with a background in the social sciences and humanities no less, comes with certain advantages when attending an RDA plenary meeting for the first time, especially when that meeting’s overarching topic is data and responsibility.
First, research data management seems, first and foremost, to be the concern of researchers with scientific and/or technical backgrounds, whose agendas can be quite far removed from those of humanities scholars and researchers. It was thus refreshing to be reminded that, even though few cared to admit it, the problems and issues surrounding research data management are as much technical and managerial as they are organisational and social in nature (although in the social sciences this is fairly old news, it might be the case that this information has yet to transpire into other fields). Having said this, I was especially pleased to discover that within the RDA there are already like-minded people such as Brandon Costelloe-Kuehn and Lindsay Poirier with comparable and arguably more developed agendas with whom to interact and from whom to learn in at least two ways: One, Research Data Management is as much a social/organisational issue as it is a technical problem, and so we had better make use of sociological, anthropological, (insert your favourite social science here) methodologies as best we can to understand them. Two, the social sciences produce and (re)use research data as well, which creates problems which are very similar (and yet very different) to those faced by other scientists. As e.g. Sabina Leonelli (2016) notes, re-using (research) data involves packaging them for travel, transferring them across (geographical, institutional, sometimes disciplinary) contexts and re-opening them. Needless to say, this involves interpretation as much as it necessitates coming to grips with those different contexts of use. Understanding context is arguably very well developed in the “hard” sciences, even though one might expect the SSH to be following suit (after all, they deal with meanings and interpretations). As Steve McFeely aptly remarked during the “Data for Sustainable Development” plenary (Wednesday 3 April), at RDA people talk a lot about data management and data science, but hardly ever about statistics.
In any case, we had better find out where there are overlaps and opportunities for (mutual?) learning. It should be noted here that data (whatever we mean by that; the meetings I attended made it abundantly clear that the term is notoriously ambiguous) have technical, social, economic, ethical, political (the list can be extended) aspects all at the same time, they are not one or the other, though some aspects might be more important than others depending on the situation.
- Joining the conversation: Working Groups and Interest Groups
Second, many conversations within RDA have been going on for a while, so joining a Working Group about to wrap up discussion can feel a bit like arriving late to a party, where everyone already has someone to talk to and no one can make time to introduce the newcomers. On the other hand, this very particular situation can be an asset in that it allows the newcomer to focus on outcomes instead of processes/discussions, and to make the most in terms of acquiring knowledge while being unable to contribute much to the discussion. In addition, RDA appears to be quite complex as an organization, so a lot of time must be devoted (at least by the novice) to deciphering its inner workings. What’s an Interest Group for, how do Working Groups work, and what can be expected at a BoF? These questions do not have easy answers for the newcomer. Having been through an entire meeting was not enough to grasp RDA in full I’d say, but it was long enough to get excited about its many initiatives. Again, having a social science background comes in handy here, since talking about research data management is as much a social and cultural phenomenon as actually managing research data.
- First Impressions: The responsibility of data scientists
As an early career grantee, I had the great pleasure of joining several working groups and a few interest groups. Notably, I observed the proceedings of one of two Data Management Planning Tools Working Groups, a topic which is being discussed at Interest Group level as well (IG Active DMPs), the Data Science Ethics Interest Group which discussed challenges in data ethics but notably excluded issues of privacy, and the Libraries for Research Data Interest Group. The Data Properties as Economic Goods Interest Group brought issues of data commodification such as business models and institutional impact on data ownership to the stage, whereas the Empirical Humanities Working Group discussed guidelines for Metadata in the Social Sciences and Humanities.
Julia Stoyanovich’s keynote seemed especially apt in hindsight as an initial foray into the world of RDA, as she spoke about the pitfalls of translating ethical standards (fairness, accountability, transparency) into technological standards and made a strong case as to why this translation is necessary. Julia went to great lengths to explain what it might mean to do data science responsibly, using an array of real-world examples, some of which were positively alarming to the data management novice. Problems of bias abound, and while they may in part result from technical decisions, she stressed that the solution cannot and will not be (merely) technical, since bias may reside not in the algorithms per se, but rather in the trade-offs necessary to operate them. The problem of how to achieve fairness in (machine) classification is essentially unsolvable, as it pertains to the way outcomes are assigned to sub-populations, which in turn depends on the underlying concept of fairness. Julia thus traced the problems of algorithmic transparency and accountability to their philosophical foundations: Essentially, the problem concerns the world-view which underlies a choice of algorithm. As a consequence, the solution depends on choice of world-view and therefore cannot be purely technical.
- Getting down to the nitty gritty: Talking Data Management Plans
Having a data management plan in place is increasingly becoming mandatory in national and international grant applications. For me, it was therefore all the more interesting to hear about the outcomes of the DMP Common Standards Working Group which was launched in October 2017. The Working Group has given the issue some thought how to make DMPs usable by getting in the right information at the right time and by making them interoperable with other systems. So, the issues with current DMPs concern repetitive information and systems that do not integrate. The solution proposed by the Working Group involves an automated data management flow specifying which information needs to be provided when and by whom to make the system more efficient. This would incidentally also help to overcome the shortcomings of existing DMPs which are vagueness, the fact that they need to be completed manually and are usually not updated. Machine-actionable DMPs need common data models to model information from a standard DMP; the DMP model should use existing standards, e.g. persistent identifiers wherever possible and only develop new concepts where necessary. The solution works by mapping out processes to identify tasks performed by stakeholders to determine which systems need to be put in place e.g. maDMP repository or costing service; the concepts to be developed and models to be used to automate tasks. Interestingly, while ultimately deploying a technical solution (i.e. a machine-actionable DMP), a lot of the work that went into it is sociological in nature.
- Ethics of Data Science: Privacy is not the issue (or is it?)
The Ethics Interest Group re-iterated themes from the keynote, adding depth to what data scientists’ responsibility might be. Whose turn is it to define the proper ethics for data science, ethicists/philosophers or data scientists? Oya Beyan considered the ethical problem with machine learning in her contribution, echoing themes from Julia Stoyanovich’s keynote. The fundamental problem can be summed up like this: Do we learn from data or do we simply perpetuate human bias? The answer according to Oya lies in observing what we do with data and how/when: Big Data often implies collecting as much data as possible, but we must not forget to consider whether data are representative, whether data have sensitive attributes (e.g. gender). Data scientists tend to check for correlations, but not whether sensitive features can be inferred; likewise, the impact e.g. on minorities (specific groups) is rarely considered.
According to Fran Berman, many ethical problems in data science stem from the diversity of data types involved, particularly when considering interdisciplinary fields such as environmental science or climate science. These transdisciplinary environments are a real challenge to inducing cultural change with respect to sharing research data. To overcome a lack of trust and a sense of data ownership which arguably stand in the way here, Fran recommended forging trust through more interaction and communications and effective data licensing agreements between those disciplines/groups involved.
Myrna Morales reminded the attendees that ethics and morality refer to democracy and, consequently, to empowerment. To address ethical concerns that are not about privacy or surveillance is inextricably linked to fighting for democracy and to re-writing history “from below”, which refers to the fact that data scientists and engineers are predominantly white males with middle-class backgrounds. Accordingly, technologies are often created with this demographic in mind.
Fran Berman talked about bias inherent in algorithms, data, and systems. Transparency thus concerns all these levels, not just algorithms – data can and often are biased as well which leads to misinterpretation and infamous predictions (such as Hilary Clinton’s 2016 victory). Big Data should therefore be used with caution, as they can be easily misinterpreted, as correlations often are spurious and so should not replace the scientific method. In too many cases, Big Data analyses are merely offering scientific-sounding answers to problems that are ill-defined on closer inspection. Training (i.e. using data science tools to e.g. do research) is different from an education in Data Science (i.e. with the goal of becoming a data scientist); often these courses do not include policy and compliance, ethics/RRI, stewardship and data preservation. What kinds of questions are we even asking about the data we collect, and who is asking the questions? Neither data nor algorithms are neutral.
- Data Properties as Economic Goods: Who owns the data?
Incidentally, when deciding to go to the RDA 13th plenary, this was precisely the question I had been pondering for weeks as part of a project I am working on at Graz University of Technology (TU Graz for short). The project is about introducing RDM to TU Graz, first through policy development and later by providing data repositories and introducing data stewards. What struck me most during the project initiation (we talked to several researchers from different fields of expertise) was that many consider research data to be their own personal asset. Given the expectations regarding academic careers, this is understandable even if there are no genuine expectations to monetarize these data, but it can prove problematic for introducing Open Science and RDM. It is therefore essential to understand notions of data ownership, even if they have nothing directly to do (in this case) with business models. The IG on Data Properties as Economic Goods tackled precisely these issues, and predominantly from an economic perspective. Luis Rios considered in his presentation the boundary conditions to the commodification of data; economics is useful to understanding well defined markets, but less so for ill-defined information industry. Because data is not a traditional commodity, market analogies tend to break down. How can we balance private benefits and public welfare (open source versus data property)? Industrial economies are supply-side economies of scale based on accumulation of resources. Not so with the information economy: connections (nodes) create value and grow non-linearly (information is a non-rivalrous good which means there is no point of diminishing returns as in industrial economies). Data and information create network effects for those who control the platforms. Intellectual Property is therefore one of the most valuable resources. The situation is arguably very different for academia: Both academia and industry produce data, but academia thrives on attention and reputation as currencies. Open source software has thus had very different effects in the two domains: in academia users can get citations for providing free software which creates an alternative incentive. What can be monetized in academia is the services needed to enable data sharing which doesn’t need to touch upon the openness of data (incidentally, this is a model which various publishers are now pursuing). The problem with the data economy is that platforms harbour data from suppliers (i.e users) in exchange for using the platform which means that users have very little bargaining power. Greg Madden pointed out in his presentation that it is often the institutional structure of universities which creates ownership problems with respect to data, simply because all data handlers need to undertake practical actions (but which?) and this means that costs accrue such technical, administrative, political, institutional opportunity costs as well as researcher opportunity costs. This might be another reason for some researchers’ insistence on not sharing their data.
- Data Management and Epistemic Pluralism
The issue of epistemic pluralism is part and parcel of research data management efforts, it would seem, and it popped up in multiple spaces at the plenary meeting. One of them was the Guidelines for Metadata in the Empirical Humanities Working Group convened by Lindsay Poirier and Brendon Costelloe-Kuehn. Data Management poses specific problems for empirical SSH researchers (such as anthropologists and sociologists). First, there is hardly any technical expertise to implement metadata practices, and “standard” standards are typically not intended for research communities embracing epistemological pluralism. Empirical SSH researchers are uncomfortable with sharing data due to concerns about data ownership and confidentiality, and often do not know what to include in the metadata portions of their DMPs. The solution offered by the WG was to develop a series of guidelines for planning for metadata management to support work in empirical SSH under the rubric “Fair to FAIR’ paradigm” (making data Findable Accessible Interoperable Re-Interpretable). Metadata in SSH are about reinterpretation and re-analysis, NOT about reproducibility which does not have a lot of traction in ethnography and related fields. Coming from sociology (and philosophy, but that is a different story), epistemic pluralism is something I am very familiar with. However, I did not sense a lot of discussion on this issue outside of the Empirical SSH WG, even though the RDA and its meetings are wildly interdisciplinary (it might be difficult to find another conference that features medicine, environmental sciences and fisheries at the same time). Sure, researchers from fields dubbed “interdisciplinary” are more attuned to noticing different and sometimes mutually exclusive forms of reasoning, but in many cases these are resolved, on the face of it, in terms of different data types collected by different disciplines. In any case, this epistemic pluralism (sometimes downright epistemic inequalities) lies at the heart of why data are not being shared more openly. If this premise were accepted, interoperability could be profitably analysed (and tackled) not as (merely) a technical problem but one related to social and cultural change.
- Libraries for Research Data: So, what about sharing?
The Libraries for Research Data Interest Group is based on the premise that libraries have looked after written research assets for centuries and are therefore in an excellent position to tackle the challenge of adapting their function to making data reliably accessible and re-usable. Research Data have become a primary research asset that often requires continued access in the dynamic environment of mobile researchers, volatile repositories, transient products and short-lived standards. What better institutional format, the argument goes, than libraries to provide guidance and support? Devan Ray Donaldson thus presented a digital curation project his students had carried out in close collaboration with the Indiana University Libraries entitled “User stories and agile project management in developing repository software”.
The objective was to define user stories (informal narratives) to create repository software for the university library to collect, manage and store data sets. Students therefore had to study repository features, functionalities, limitations, and data citation and then create user stories, i.e. informal, natural-language descriptions of features written from the perspective of end-users and other stakeholders (who, what, how, why are we building the repository for). After creating the stories, each group turned their story into a series of features of the repository that should address the needs of those users (data producer, data consumer, data repository manager, grant funder, officer of research administration, institutional review board committee member and their specific needs). Based on these user stories, students developed and refined features of the repository for each type of user.
Helena Andreassen, Raman Ganguly, Andrea Medina-Smith talked about the gap between perceived benefits of data sharing and actual data sharing practice (i.e. the social dilemma of data sharing) and how RDM service providers can help shrink this gap via engaging researchers through surveys (which can still be accessed here: http://www.1ka.si/a/193487). Led by Marta Teperek (TU Delft), the goal is to help institutions engage with RDM practices. The project will be presented on a website (still under construction) featuring information about the project, team, data, and data analysis. Data will be shared via a repository and linked to the website which in turn will be linked to the RDA website. At the moment, qualitative and quantitative data are being analysed, a full report will be presented at the RDA 14th plenary in Helsinki later this year. The IG meeting ended with a glimpse of the wikidata platform and how it increasingly features data about bibliographic objects, the WikiCite initiative to develop a database of open citations and linked bibliographic data to serve free knowledge. Wikidata was presented as having the potential to be the source for all kinds of datasets once a common structure can be agreed upon. For obvious reasons, librarians could form the core of this emerging community. All in all, the IG meeting was rather hard to follow for a novice, but many of the insights presented might prove valuable for my own work.
- Managing Data, Manging Expectations: The RDA Plenary Meeting Venue
Believe it or not, attending the RDA 13th Plenary Meeting was not just my first foray into RDA, but (incidentally) also my first-ever trip to the US-of-A. Philadelphia (at least what little of it I had the great fortune to visit) is marvellous, and would have warranted an extensive stay which – due to family obligations – proved impossible for me. Arriving at the Loews participants were greeted with an American breakfast (at least I would consider it American) of bagels w/ cream cheese, porridge, and muffins (take that, continental breakfast; I find myself missing the bagels most of all back home). Somehow, despite the sheer number of participants (close to 400!), the atmosphere was still homey, allowing for making contact with many interesting people (to be fair, coffee – in the morning – and alcohol – in the evening – helped as well). What struck me most, aside from the number of participants, was the diversity of backgrounds represented within the RDA, and my own ignorance about many of the exciting discussions happening now. Talking about data seems to bring people together, even if those people sometimes speak very different languages.
LEONELLI, Sabina. (2016). Data-Centric Biology. A Philosophical Study. Chicago, London: The University of Chicago Press, pp.45-66.