Minutes of the Security and Trust Birds of a Feather (BoF) Session During RDA P6.
Editor: Stefan Pröll
Session chairs: Stefan Pröll, Rudolf Mayer, Peter Kieseberg, Andreas Rauber, Vasily Bunakov, Paul Burton and Mike Priddy
During the 6th Plenary Meeting of the RDA in Paris we held the first session for establishing a working group with a focus of security. 26 participants attended the session, 6 speakers presented potential use cases.
Stefan Pröll introduced the speakers and gave an introduction to the aims of this working group. As the name of the proposed WG implies, the main focus will be on trust and the secure exchange of data. This requires the identification of common questions and issues shared among the participants and the agreement on common standards, best practices and tools. Suggested topics are policies on data access and data release for research data that is deemed sensitive, authentication and authorisation protocols for data access or protocols for data integrity and authenticity. Obviously, not all relevant questions can be addressed within this group, but we will focus on practical topics. If this working group will be established, we will develop a set of guidelines how institutions define security requirements and how they can exchange their data in a secure way. Following the introduction, the use cases have been presented.
Presented Use Cases
Rudolf Mayer and Stefan Pröll from SBA Research presented DEXHELPP, an Austrian project in the domain of routine, secondary medical data. The data is used for accounting / reimbursement data from the social insurance providers for doctors and hospitals and it is collected for 99% of the Austrian population. The data covers a 2-year span, for some states even for a longer periods. It is structured data (relational database) and consists of around 2.5 billion records. The data is the basis for analysing the effectiveness of health care technologies / treatments and for the prediction of future demand of health care services.
The data is highly sensitive, and disclosed on case-by-case basis. It is approved by data provider, for a clear defined data structure. The data is exported by a designated data curator. Obviously security and trust are vital for owner to release their data, non-disclosure agreements signed by project partners. As the DEXHELPP project brings together several data providers and researchers and integrates data from various sources, the collaboration and data sharing needs to be improved.
The handling of sensitive data needs to be professionalised and defined workflows, procedures and common standards need to be established and implemented.
Big Facility for Small Science
Vasily Bunakov from STFC presented the Big Facility for Small Science use case and provided overview of the facilities research life cycle. The way how researchers create and run their experiments is well defined and requires to protect sensitive data at several steps. Even for non-commercial research, the data is considered sensitive and also the collected metadata of the experiments need to be protected. Therefore, sensible policies for data sharing, including user authentication for data and metadata access are required. User authorization needs to be implemented where applicable and the integrity of data needs to be ensured. Additionally, the protection for Intellectual Property derived from science (empowered by data management and data analysis technology) is a demand.
Currently data release policies based on embargo period for data access are implemented and in use. Technology in support of data release policies includes a data catalogue with user authentication and authorization. Nevertheless several issues need to be tackled, potentially within the proposed WG on Security. These challenges include the inclusion of embargoed data in research and innovation value chain, the actuality of metadata and contextual data for sensitive research, data integrity and data repositories certification.
Data Exchange in Biomarker Research
Peter Kieseberg from SBA Research presented the Biomarker Research use case, which deals with genomics, diagnosis and clinical data and other sensitive sources. The goal of this project is to foster the secondary use of the data in different fields and encourage complementary research in the same field. Also sharing the data during peer review processes and providing the data sources of publications is a requirement.
The privacy needs to be protected and anonymity needs to be ensured in big data settings. Methods for detecting data leakage such as fingerprinting need to be applied. Additionally, the disclosure of sensitive information, for instance by aggregating data needs to be prevented. This requires trust models for data exchange, which can be implemented by the project participants. Therefore legal considerations need to be taken into account.
Therefore the Cbmed project requires a technical framework consisting of guidelines, standardised tools, platforms and workflows, which allow implementing secure data sharing. A legal framework needs to clarify questions regarding the data exchange across institutional and national boundaries. Questions regarding ethical considerations as well as a common ethical baseline need to be defined.
DataSHIELD: secured analysis and co-analysis of micro-data
Paul Burton from the University of Bristol presented the DataSHIELD Research Project (Data Aggregation Through Anonymous Summary-statistics from Harmonized Individual levEL Databases ), which deals with the analysis of sensitive medical data in a cloud-based research environment. There are a number of use cases where datashield can be of value.
Datashield was developed for the three areas where microdata sharing is troublesome. There is ethical-legal questions, most of these frameworks have been developed decades ago. The second is control of intellectual property. Currently research results are happy to be pushed, but the data is not given out. Passing the data physically is difficult, not only due to legal but also to technical reasons. For instance handling the physical size of large data sets is difficult, it can’t be handled well across institutional boundaries.
Currently there are six studies considered in DataSHIELD, each with own servers. There is one additional server where the analysis can be controlled from. DataSHIELD allows calculating the the result at the local servers at each institution and integrate it centrally. Obviously this causes concerns, such as the disclosing of data in smaller cell counts of data. DataSHIELD can block such requests, by only returning the statistical data, which do not disclose information. Implementing this approach in reality is difficult.
At the moment. there exists an R environment embedded in the local databases, which has all functionality removed, which could potentially be harmful. Only allowed functions are added with specified and allowed parameters. The only way of interacting with the data is via allowed beforehand. This allows coordinating the work with the data. DataSHIELD can be put in front of any data sitting on a server and regulate what exactly can be done with the data.
Access and Use of Confidential Microdata in Social and Economic Sciences
Mike Priddy from DANS (Data Archiving and Networked Services) presented the Access and Use of Confidential Microdata in Social and Economic Sciences use case. In this scenario, the national statistical agencies and data archives of different countries hold data about individuals and businesses. Currently the access to the data for researchers is controlled by different accreditation and legal regimes. Mostly within safe rooms or centers, which provide an additional physical layer of protection. Still remote access to the data is required.
Sharing data across borders is difficult due to legal concerns, thus combining data from different sources/countries becomes a legal challenge on top of technical issues. Currently no standards on sharing or the accreditation of researchers exist. Researchers tend to stick with using data from only one or two studies/sources, because they want to avoid the trouble. Currently the confidential data which is generated by researchers is not consistently handled, a secure deposit is required. In addition training is needed and trust in the researcher needs to be established.
There exist many different data owners, who have differing & ad hoc policies on who can use/view media. The access to the data needs to be managed and differing security requirements or levels of sensitivity need to be supported. Storing the data in a secure way and ensuring its long term access is fundamental for reusing and sharing the data across different organisations.
What we need is new research methods to work with structured data. Currently most archives do not have capability or capacity to handle and guarantee its safety and privacy. Materials can still be sensitive even if the law allows publication, some is still confidential. Therefore a network of secure services for customisable workflows is required, which supports liinked secure data.
After the presentation the discussion was opened, Mike Priddy was moderating the session. The following comments, questions and suggestions have been raised:
There are many areas, where currently no security measures are in place. This causes trust issues. As example biodiversity, agriculture and species data were mentioned. All areas have in common that they deal with sensitive data which is exchanged between researchers, currently the researchers are often not even aware that they are handling sensitive data.
The potential of leaking data by linking data sets is often not considered by researchers. People often do not realise that this is an issue.
Small institutes want to share the data with the right people, who they know. But sharing the data is not yet systematically. We would like different security levels and levels of permissions and what people are allowed to share. Even clearance based on training etc is often not possible to describe. A mapping between what is highly sensitive and what you are allowed to see would be desired.
Geologists work at different sites, where they would benefit from sharing data, but funders do not allow sharing the data. This could also be tackled with security levels.
Smart contracts could be an idea, something practical, technical, based on cryptographic hash key would be useful to automate.
There are guidelines in the medical domain and they adopt a risk assessment approach. You have the tools, but there are tradeoffs in these data exchange workflows. If you are a doctor and you want to get a view on an examination, an anonymized version might not have the proper quality. Often this is decided case by case. This also takes into account different quality requirements. How can quality measured, how can it be decided how quality is defined. The dream of just having to press a button will not work, there are tools required making the whole process simpler. Money is also a problem. many research facility do not want to give a way a certain quality, as this is their asset. They want to share data only to a level they can control.
How to ensure that a standard is appropriate in a given context. We need to decide what are the issues to agree. We need a project with a variety of use cases to start looking at the requirements. This is a long way.
The Tale of the Two BoFs
Several BoFs dealt with security questions. Naturally the question was raised, how these groups differ and how they can be harmonised for avoiding duplication. In this BoF, we discussed with David Schiller, who chaired the BoF on International Access to Sensitive Social and Economic Microdata.
The scope of the two sessions differed, but both span across different disciplines. It was agreed that we should not break it into medical and social science domains. All areas should be covered. But there is a technical view and a more policy and governance view. This could be a line where the two potential WGs are differentiated. Still we have to consider that if we separate the technology too far from the legal view, we might create wonderful things which we are not allowed too. We identified the two dimensions of legal and technical dimensions.
As a third dimension, organizational aspects need to be taken into account in both groups. It includes things from the data providers view. It is not just legal issues, but legal issues translated into technical issues. Policies need to be translated also between data providers needs.
It was agreed that it is an opportunity to take the two groups forward, as we have enough to address. So sharing is also obvious between the two groups. This should be seen as positive cross fertilization, we will align both groups and not just develop in silos.
What Should be the Outcome of the WG?