Greetings, members of the Data Discovery Paradigms Interest Group!
As it's been some time since the P11 in Berlin, we would like to share with
everyone in the group the discussions held within our session, as well as to
capture your thoughts on forming and joining the new set of Task Forces (see
points 3 and 4 below). But first, here's the summary of the session:
1. Summary
DDPIG had its 4th break-out session this March at the RDA P11 in Berlin. The
main objectives of the meeting were to inform all members and interested
parties of progress made regarding Data Discovery, as well as to discuss
next steps for existing Task Forces and potential new Task Forces. The
session attracted 45 attendees, which led to interesting discussions and
feedback on all aspects of the interest group.
The session started with a short introduction of the group's goals and
progress, including a timeline from the initial BoF in Apr 2016 leading to
the current status. Two task forces that were selected from the most
frequently requested topics after P8, namely "Use Cases, Prototyping Tools &
Test Collections" and "Best Practices for Making Data Findable", were
officially closed and their respective outputs presented. The two remaining
ongoing task forces, "Relevancy Ranking" and "Metadata Enrichment", were
consequently presented, each briefly outlining the work done so far.
The presentation slides from the break-out session are available
sC1_gV2Q/edit?usp=sharing> here and the collaborative notes are available
BjTU/edit?usp=sharing> here.
2. Presentations from the four task forces
The "Use Cases / Requirements" Task Force focused on the process of
capturing user scenarios, and re-formulating them as requirements applicable
to different actors in the Data Discovery process (such as Researchers,
Repository Managers, Librarians etc). As the task force has officially
completed its lifecycle, the corresponding output was presented, i.e.
-discovery-paradigms-user-requirements-and> Data Discovery Paradigms: User
Requirements and Recommendations for Data Repositories, also being currently
considered for submission as a journal article.
The Best Practices Task Force focused on exploring current practices of
making data findable, and recommending best practices to the data community.
After reaching its conclusion, the task force has developed two
recommendations for two types of audiences: data repositories (combined with
the use cases Task Force) and data users. The recommendations for data
users is titled
en-quick-tips-finding-research-data> Eleven Quick Tips for Finding Research
Data, also published as an
article on PLoS Computational Biology (DOI: 10.1371/journal.pcbi.1006038).
The Relevancy Ranking Task Force presented an overview of the task force and
a report of the progress so far. Its goal is to help data repositories
choose appropriate technologies when implementing or improving search
functionality at their repositories, and to explore creation of test
collections and search tasks for data search community to collectively work
on. The Task Force is scheduled to conclude during the next plenary (P12),
with primary focus until then the preparation of its final output, which is
the analysis of the 114 responses in a community survey to capture and
identify common ranking models.
A discussion on the work of the particular TF during the P11 addressed the
following points:
* Main usability is that people expect something like Google.
Completely unsure what a sensible ranking would be. Are there any thoughts
on how relevancy can be defined?
* In search communities the metrics used are usually precision/recall
- but in the end, the question is whether the user evaluates the system as
useful.
* A lot of thinking has gone into this. The library community has done
a lot of work on this. The survey actually gives some information on how
ranking is performed and expected.
* An interesting point highlighting this issue is the fact that, when
we collected our use-case, one of the requests was to "Be like Google". But
this is a vague requirement; is the interface, the method, or something
else?
* What are the user expectations in domain/cross-domain repositories.
* The participating repositories are mostly closed domain
repositories. So the survey itself does not really tell what the roadblocks
are. It's just the current system configuration.
* However, if one looks at the survey report, you'll see the
statistics from the domain covered by the responses. The majority of the
responses were hitting on multiple domains. However, it's unclear if they
have a different way of evaluating relevancy. We haven't made any
recommendations from this group yet, but this is something that we plan to
discuss.
The Metadata Enrichment Task Force presented current activities and focus of
future TF efforts. Formed in April 2017, the objective is to describe and
catalog various methods of enriching research data metadata sets to satisfy
several use cases. A list of the planned activities was provided, including
a review of responses to the DDPIG survey question regarding metadata
enrichment, as well as cross-referencing survey responses about metadata
enrichment efforts with other responses to look for possible correlations.
3. New Task Forces: With the completion of work by most of the initial task
forces, the discussion turned to potential new Task Forces, with most of the
discussion revolving around a Schema.org TF and a Granularity TF.
Schema.org Task Force
An analysis of search log from Research Data Australia shows that nearly 80%
of its traffic are from web search engines (
sC1_gV2Q/edit#slide=id.g3565b6a47d_2_176> slides 46th from the group
presentation), this data indicates the importance of making data searchable
by web search engines. The wide use of schema.org
vocabularies to add structured metadata in web pages for use by commercial
search engines has attracted the attention of the data management community
as a possible mechanism to leverage the robust commercial search engines
like Google, Yahoo, Bing etc. to facilitate discovery and access to
scientific data. Various projects have been exploring this approach,
including the US NSF EarthCube p418 project,
Google's
Dataset Recommendations, BioSchemas,
Force11 DCIP,
DataCite, and
Research Data Australia. Since the
origins of the schema.org vocabulary have largely been driven by commercial
business use cases, and a loosely governed process for adding and defining
vocabulary, there are gaps and deficiencies in the vocabulary that make its
application for science data problematic, so are opportunities there for
this interest group and research data community at large to work with major
web search engines providers to make data discoverable from the web.
vQEQ/edit> Draft proposal for the schema.org Task Force (by
Dr. Stephen Richard)
The proposed Task Force in discussion
has a git
repository with identified activities registered as issues. Your
contribution to this task force, either registering new activities or
contributing to existing activities, is much encouraged and appreciated!
The following suggestions were provided from the attendees in the session:
1. Tradeoffs of using schema opposed to more strict vocabularies.
a. Life sciences for example have a specific domain RIS. We are
thinking of how to spread information from different domain. We decided to
use schema.org properties, and then add domain-specific vocabularies in
combination. Two ways to do this: (i) You propose something to schema.org,
or (ii) you use what is already there.
1. Another concern about schema.org is what are the control safeguards
can be implemented. Any data is associated to a source. It is up to the user
to evaluate whether the source is trustworthy or not.
Granularity Task Force
Data are commonly discoverable & accessible at the level of 'datasets',
which are often aggregates of observations. The efficient and effective
reuse of data requires users, be they humans or machines, to be able to find
and access resources at finer levels of granularity. For example, a dataset
could be a collection of digital objects (DO), with each DO containing
multiple variables or geospatial layers. It is becoming more common for
repositories to offer services that allow users to discover and access
individual files within a collection and even individual layers, or columns
from a table. This is possible only if there is metadata matching the level
of granularity desired. If a user accesses and uses only a subset of a
collection, that subset needs to have an identifier for proper citation.
The following suggestions were provided from the attendees in the session:
1. How can you define citability within the context of granularity?
a. First Index the data, create/extract tags and from these create the
metadata.
b. The
-1-supporting-direct-access-increasingly> Dynamic Citation Group proposed a
solution for data that is generated dynamically from a service.
1. In the Long Tail of Science, granularity of the data used tends to
decrease in frequency but increase in diversity. So how to deal with small,
very specialized data sets?
. Have a look at online data and identify the meta-data
through
-bof-meeting> data packages.
4. Next Actions | ToDo!
We would like to ask all of you to consider:
1. Whether any of these two Task Forces appeal to you
2. If so, if you are interested in leading or joining the respective
Task Force(s) (leading a task force takes about 8 hours per month; joining,
about 4 hours a month)
3. For the Schema.org TF, we have already initiated a
GitHub
repository that will be capturing all the discussion. Feel free to join in
and give us your perspective.
4. For the Granularity TF, we are keen in starting the discussion on
the particular short- and medium-term goals.
Please feel free to contact any of us with any further questions or
suggestions.
Looking forward to hearing your views!
With kind regards, the IG Chairs:
SiriJodha Singh Khalsa, ***@***.***
Fotis E. Psomopoulos, ***@***.***
MingFang Wu, ***@***.***