Full Summary of the Long Tail Research Data IG Meeting

Long Tail of Research Data Interest Group

Friday 28th of March 2014, 11:00-12:30 and 14:00-15:30

By Artemis Lavasa – CERN & ATEITH (Greece)

The Long Tail of Research Data Interest Group session, which was chaired by Kathleen Shearer (COAR) and Wolfram Horstmann (The Bodleian Libraries) was divided into two parts on Friday, both of which attracted a rather significant amount of people.

Session 1 was dedicated to scoping the landscape and delving into the current situation in the area of long tail data. The purpose of the session was to explore how long tail data is being managed through several examples. The examples were separated into external services, institutional services and research solutions. The topic of long tail research data, generally characterized as small and/ or multidisciplinary data sets that fall outside the scope of the big data repositories, is very current. It could also be said that it is generating a lot of interest as reflected by the over-whelming response to the call to contributions to this session.

Within 90 minutes, 13 examples were presented, namely: Dryad, Scientific Data, F1000 Research, Ubiquity Press and Zenodo in the external services category, the California Digital Library/ UC3, Oxford, Columbia, the Notre Dame /Northwestern/ Indiana/ Cincinnati/ UVa collaboration and the University of Leicester in the institutional services category and finally the Strasbourg Astronomical Data Center, SiDORA (Smithsonian) and Scratchpads in the research solutions category. The presentations, even though brief due to time restrictions, were very informative and to the point and succeeded in displaying the main features of each of the services. They are available from the Long Tail of Research Data webpage: https://rd-alliance.org/internal-groups/long-tail-research-data-ig.html

Session 2 began with a presentation of the results of a survey of current practices for discovery of research data in repositories by Kathleen. The survey targeted long tail repositories and received 60 responses, 30 of which were complete. It was noted that the number of responses to the survey are not a representative sample of data repositories, but rather an indication of which way the wind is blowing. The survey found that Dublin Core and DataCite metadata were the most common schemas used in the data repositories and less than half of the respondents were using DOIs. In terms of discovery, most respondents indicated that the metadata was sufficient for users to find the datasets in the context of searching within the repository, however, the metadata may not support widespread discovery via search engines or directories. The Interest Group discussed the tension between content recruitment, whereby the aim is to make the deposit process as easy and quick as possible versus the need for data documentation and metadata if the datasets are to be found and re-used. The group discussed some strategies to improve the discovery of datasets, which included assigning a DOI, connecting the data to the journal article, adding greater descriptive information about the data and attaching data management plans to datasets.

As a final discussion point the group was asked to provide suggestions for specific areas that could be pursued through the long tail IG. One concern expressed was that further research is needed in order to have a more spherical understanding about certain issues; what are the tools, support and environments needed that will facilitate research engagement and good practice.

Other suggestions concerned collecting evidence to incentivise researchers to deposit, creating environments to make it easier for researchers to deposit their data, sharing practices about discovery, and ways to achieve interoperability across repositories, as well as preservation planning. Taking all the suggestions into consideration, in the immediate future, the group will start building on some of the ideas, which resulted from this session.