In the healthcare sector, 1.3 million new pieces of research related to biomedical science alone are published each year. A typical database search returns about 80,000 hits, and only 4,000 of those are likely to be very relevant to a researcher’s work. Text and Data Mining (TDM) techniques can already be used to zoom in on the top 25% of papers which are most relevant to any given search query. Researchers believe that, with a little more work, it will be possible to use TDM to identify the top 10% of search results. In a similar vein, the quantity of data being created has also grown exponentially, making it difficult to handle and analyse. Data mining techniques are needed to help researchers to spot patterns in large batches of data.
TDM was initially defined as “the discovery by computer of new, previously unknown information, by automatically extracting and relating information from different (…) resources, to reveal otherwise hidden meanings.” It’s applicability in all fields of research is growing in this age of information overload.
Recent studies show the uptake of TDM is lacking. One of the reasons is the lack of awareness and skill amongst researchers, librarians, and industry practitioners. A key conclusion from the Publishing Research Consortium survey on TDM was that ‘Awareness of text mining techniques is still relatively low. Moreover, the European Communications Monitor identified a ‘gap between training offered and development needs’  Both industry and academia have confirmed a need for education on TDM. Our focus will be on providing the basic skills so as to reach the widest audiences. The decision therefore is to establish a Working Group with a clear focus and purpose to develop a course within the 18 month time frame.
Purpose of Initiative
This Working Group aims to address the current skills gap identified with respect to Text and Data Mining (TDM) and help improve the adoption of these practices in a range of research disciplines.
TDM is a cross-cutting skill of value to a wide range of researchers. This working group aims to develop a short module that can plug into existing courses (e.g. the CODATA-RDA School of Research Data Science and existing university research skills courses) to equip researchers and practitioners with basic TDM skills and increase the use of these.
Scope of initiative
The Working Group aims to develop a short introductory programme and related content (presentations, exercises and case studies) to introduce researchers to TDM and provide practical experience in applying open source tools to use these skills in their field of research.
The design of the course will be developed based on the research and feedback from the stakeholder communities in the upcoming months.(see timeline and workplan) More specifically the content and the proposed duration of the course will be determined after these consultations. For now we envision a 1-2 day modular course for people with no prior knowledge that includes stand alone modules,lessons and elements that can be selected independently depending on the focus and level of knowledge the participants. The course can be spread out over several days or weeks to fit within existing courses and trainings.
The introductory course will not be discipline-specific, though later iterations could be tailored towards this if needed and for example go into more detail into discipline related fields of interest and expertise.. Although the 1-2 day course aim is to address the skills gap for researchers with no prior knowledge we anticipate that we may need to extend the duration of the course to 4-5 days if we find that we need to include more basic introduction courses on for example the more technical aspects of TDM.
The course and course materials will be made available online and in digital easy to use and modular format accessible for anyone who is interested to use and adapted the course to suit their specific level/audience.
Background to Initiative
The European projects FutureTDM, FOSTER and EDISON confirm that there is a growing demand for researchers who understand and are able to use TDM and that current education is falling behind in providing people with the skills and knowledge needed both in academia and industry.
At RDA Plenary 8, a discussion on TDM in the IG on Education session confirmed community interest in developing training materials to address the skills gap. This working group therefore aims to look at how education and in-work training can help fill the gap and create enough expert data scientists.
Relevance of the Initiative
Taking into account the many benefits of TDM for research and society this is a topic relevant for RDA. By designing a course to cover TDM skills and developing course materials and making them available to the community we can contribute in bridging this gap. This will include learning outcomes (essential and desirable) and course content (specific readings, lecture and discussion content, class activities, practical assignments, and graded assignments).
The aim is to develop a generic/adaptable course or training module that can then be used by different disciplines on TDM skills and knowledge.
Timeline and Workplan : Term: 12-18 months
Quarter one - 2017: Requirements gathering phase
This will include identifying survey participants (such as existing course providers, the research community, industry partners, librarians and RDA members) and undertaking a questionnaire to understand what skills need to be covered in an introductory TDM course.
Analysing survey outputs and drafting a course design, learning outcomes and programme for consultation at the RDA plenary in Barcelona.
This work will be conducted via virtual meetings and desk-based research.
Deliverable: Survey and results
Milestone: Preliminary course outline for discussion in Barcelona
Quarter two - 2017: Course development
Development of course content, including specific readings, lecture and discussion content, class activities, practical exercises and graded assignments. For this we will look at existing courses and tutorials and build upon those with input from the TDM community such as users and tool developers. For example we will work together with Contentmine, Industry partners such as SAS and at least two Universities who have expressed interest in adopting a course.
Establishing an international network of experts and potential TDM trainers. This will build on the initial survey work and contacts developed through the WG and will support roll-out and reuse of the materials.
The majority of this work will be conducted virtually, with OKFN leading. At least one face-to-face meeting will be scheduled to help define the structure of the course and/or develop key components.
Deliverable: A draft set of training materials and user guides ready for testing
Quarter three - 2017: testing
Liaising with contacts to establish one or two potential opportunities to trial the course. These could be aligned with existing events from partners such as DCC, FutureTDM or institutions who have expressed an interest in hosting events for researchers.
A train-the-trainers style session could be run at RDA Montreal to walk members through the course content and how this should be delivered to receive feedback from potential adopters.
This work will require at least two face-to-face sessions to deliver courses in different contexts
Milestone: Have tested the course and gathered feedback from trainers and pilot participants
Quarter four - 2017: evaluation and review
Here we will take stock of feedback received during the trial. Particular emphasis will be paid to which sessions were most effective in addressing the learning outcomes and engaging participants. The time taken to deliver the sessions, any technical issues encountered by trainers and ideas for reworking content or improving flow will also be addressed.
The course materials will be refined based on the feedback and materials to assist others in reusing the content such as speaker notes will also be improved.
The work will be conducted remotely with regular virtual meetings to support the analysis and review.
Deliverable: a revised set of openly-licensed training materials available online for reuse
Quarter five - 2018: adoption
The complete course materials will be made available online (github, slideshare, zenodo) together with documentation on how to implement the course module, FAQs and contact details for support. Further events like the train-the-trainers at Montreal could help others to understand and adopt the resources.
Through the DCC, European training initiatives (e.g. Swafs-07) and e-infrastructure projects like OpenAIRE, we will raise awareness of the module and promote adoption in academia.
In addition the IEA has a number of industrial partners (including Microsoft, Airbus, environmental consultancies and civil engineering companies) and can be used as a route to gaining contact with industry.
This work will involve promoting the outputs at events, as well as specific meeting with key targets (e.g. training departments and Doctoral Training Centres) to promote adoption
Bi-weekly calls for the Chairs or others engaged in specific activities currently underway
Monthly calls to update all members of the Working Group on progress
WG Email list for discussion and sharing of relevant information
Google Drive/ Github for collaboration on course materials
- Freyja van den Boom (EU)
Sarah Jones (EU)
Devan Ray Donaldson (US)
Clement E. Onime (TBC)
- Steve Brewer
- Vicky Lucas
- Simon Hodson
- Amy Nurnberger
- Puneet Kishor
- Baden Appleyard
- Christoph Bruch
- Alex Fenlon
- Jez Cope
- Hugh Shanahan
- Małgorzata Krakowian
- Bridget Almas
Group Email: firstname.lastname@example.org
Secretariat Liaison: Fotis Karayannis
TAB Liaison: Devika Madalli
Engagement with existing work in the area:
Collaborations and opportunities for further engagement include:
http://www.futuretdm.eu/ The FutureTDM project seeks to improve uptake of text and data mining (TDM) in the EU. FutureTDM actively engages with stakeholders such as researchers, developers, publishers and SMEs and looks in depth at the TDM landscape in the EU to help pinpoint why uptake is lower, to raise awareness of TDM and to develop solutions.
EDISON is a 2-year project (started September 2015) with the purpose of accelerating the creation of the Data Science profession.
The forthcoming Swafs-07 ‘Training on Open Science in the European Research Area’ project.
CODATA-RDA School of Research Data Science
Part of the University of Reading, providing training on analytics and producing proof of concept software either by using environmental data or big data for environmental applications. The IEA is funded until 2019 by the Higher Education Funding Council for England. The IEA recognises that TDM is a growing field for environmental analysis and applications. The IEA currently has projects using TDM in tweets and text messages and is moving into larger document analysis, specifically environmental impact assessments.
The Belmont Forum is a group of national science funders, including NSF (US) and NERC (UK). The e-infrastructure group is exploring training requirements for research data scientists, including developing a relevant curriculum in 2017.
The UK Digital Curation Centre has delivered training on Research Data Management for several years and is involved in training activities for a number of European projects such as FOSTER, OpenAIRE, EUDAT and the European Open Science Cloud. Through these and participation in the CODATA summer schools, the DCC will help to embed the module in existing courses and encourage broad adoption.
Other possible collaborations:
Academia: We have interest from several Universities
Possible try-outs may be organized alongside Trieste School 10-21 July at ICTP in Trieste; followed by Sao Paolo, Brazil, 4-15 December.
School of Data works to empower civil society organizations, journalists and citizens with the skills they need to use data effectively
Industry and organisations: Contentmine, SAS
 Developed during the Plenary in Denver IG session Education and Training on handling of research data
 FutureTDM project report D4.3 Compendium of Best Practices and Methodologies available online at http://www.futuretdm.eu/knowledge-library/
 See for an overview of use examples in the US: Why “Big Data” Is a Big Deal Information science promises to change the world, Shaw. J Harvard Magazine available online http://harvardmag.com/pdf/2014/03-pdfs/0314-30.pdf
 The EU expert report on Text and Datamining states that Europe is falling behing the US and China with respect to the uptake of TDM available at http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_ex...
 FutureTDM consortium D4.3 Compendium of Best Practices and Methodologies report shows the need for more TDM practitioners in industry as well as a lack of awareness and skill amongst students and researchers in different disciplines.
 Key finding from the Publishers community on this issue available here http://publishingresearchconsortium.com/index.php/prc-projects/text-mining-of-journal-literature-2016?platform=hootsuite
 As identified in Europe. See European Communication Monitor 2016 http://www.communicationmonitor.eu/
 We will initially develop this course aimed for (student) researchers with no or little prior knowledge on TDM. For a second iteration of the course we will also look at industry, librarians and other interested parties to see how the course can be tailored more to specific needs.
 The course will be made available under an open access license using open source tools and materials to make sure the course can be adopted by a wide audience.
 The content of the course, course materials and best platform to make them available will be looked at in this working group. See timeline for more detailed information,
 FutureTDM Deliverable 2.4 and 4.3 available at http://www.futuretdm.eu/
 The UK Royal Society is holding a special conference on this topic see https://royalsociety.org/science-events-and-lectures/2016/11/data-skills-workshop/