Linguistics Data Interest Group Charter Statement

08 Apr 2017

Linguistics Data Interest Group Charter Statement

Version 1.0 14th June 2017

(the draft version 0.1 and final approved version are available as attached documents at the bottom the page]



Data are fundamental to the field of linguistics. Examples drawn from natural languages provide a foundation for claims about the nature of human language, and validation of these linguistic claims relies crucially on these supporting data. Yet, while linguists have always relied on language data, they have not always facilitated access to those data. Publications typically include only short excerpts from data sets, and where citations are provided, the connections to the data sets are usually only vaguely identified. At the same time, the field of linguistics has generally viewed the value of data without accompanying analysis with some degree of skepticism, and thus linguists have murky benchmarks for evaluating the creation, curation, and sharing of data sets in hiring, tenure and promotion decisions.


This disconnect between linguistics publications and their supporting data results in much linguistic research being unreproducible, either in principle or in practice. Without reproducibility, linguistic claims cannot be readily validated or tested, rendering their scientific value moot. In order to facilitate the development of reproducible research in linguistics, The Linguistics Data Interest Group (LDIG) plans to develop the discipline-wide adoption of common standards for data citation and attribution. In our parlance citation refers to the practice of identifying the source of linguistic data, and attribution refers to mechanisms for assessing the intellectual and academic value of data citations. The LDIG is for data at all linguistic levels (from individual sounds or words to video recordings of conversations to experimental data) and data for all of the world’s languages, and acknowledges that many of the world’s languages have high cultural value and are underrepresented with regards to the amount of information that is available about them.


This interest group is aligned with the RDA mission to improve open sharing of data through forming transparent discipline-specific data citation and attribution conventions to be adopted by the international research community. This interest group will add value to the RDA community by providing breadth to the current roster of RDA interest groups: linguistics is a discipline that straddles social/behavioral sciences and the humanities, and thus we have a great deal to contribute to the general RDA discussion on a multiplicity of data types. This group ties in with other initiatives in transparent research methods in linguistics at all stages of the workflow, including Open Access data archiving and publishing, reproducible methodologies and critical consideration of data licensing. The LDIG seeks to support these initiatives while focusing on data citation specifically. The LDIG provides an ongoing space for linguists to come together to improve how we manage and cite our data, and how we train linguists in good practice.


Who this group is for?

The LDIG is for people who work with linguistic and language data. This work includes, but is not limited to, the collection, management and analysis of linguistic data. We encourage participation from academic and speaker communities.


Objectives and outcomes

Our overarching objective is to contribute to a positive culture of linguistic data management and transparency in ways that are in keeping with what is happening in the larger digital data management community. To do this we aim to be a group that is able to provide tangible tools (e.g. guidelines, software) for improving the culture of data citation and attribution within linguistics. This will also involve understanding the breadth of data types linguists work with, and current uses of persistent identifiers. We outline three main objectives. For each objective we also suggest specific outcomes, which would be the focus of shorter term timelines (e.g. Working Groups):

  • Development and adoption of common principles and guidelines for data citation and attribution by professional organizations, such as the Linguistic Society of America and the Societas Linguistica Europaea, academic publishers, and archives for linguistic and language data. Principles and guidelines will follow the recommendations in the Joint Declaration of Data Citation Principles.

    Potential WG topics include:

    • Development of a common stylesheet for citation of linguistic data

    • Adoption of the style sheet by publishers, archives, organisations and individuals

    • Integrating RIS with linguistic data services like the Open Language Archives Community

  • Education and outreach efforts to make linguists more aware of the principles of reproducible research and the value of data creation methodology, curation, management, sharing, citation and attribution. Practical training also helps make proper data preparation less burdensome for researchers, and normalises this work as an expectation of the discipline. While much of this work will be practical training, outreach also needs to take into account the complex and varying attitudes towards creation of open access data sets across linguistics.
    Potential WG topics include:

    • Development of training modules

    • Delivery of training at conferences and workshops

    • Development of tools for the management of linguistic data

  • Efforts to ensure greater attribution of linguistic data set preparation within the linguistics profession.
    Potential WG topics include:

    • Framework for valuing the development of linguistic data sets in job appointments, tenure and promotion applications and in research degrees and postdoctoral research projects.

It will be up to the LDIG to decide if any of these specific outcomes would be best met by forming short term working groups with specific timelines for the deliverables. Other outcomes may be worked on within the LDIG on a more open timeline. Further goals include fostering greater transparency in research methodology, and data access rights. We expect that other outcomes will be developed as LDIG grows and responds to the changing research environment.



The co-chairs will hold a conference call every two months. The wider LDIG will convene quarterly meetings. The timezone spread of LDIG members means that these meetings will be held asynchronously in an editable document. The agenda will be posted with discussion points, and will be open for comment for a week, before actions are decided upon and delegated. We will also host face-to-face meetings at relevant linguistics conferences, such as Societas Linguistica Europaea, Linguistic Society of America, the Australian Linguistics Society, and at the RDA plenaries.


Interaction with groups in RDA

The following RDA groups have been identified as having interests that are relevant to LDIG, both in terms of technical and ethical issues in linguistic data management:

While setting up the LDIG we will ask at least four of our members to nominate themselves to participate in one of these other groups and be officially named as our cross-group co-ordinator. This will facilitate cross-group relevance.

Linguists from particular subfields may find that particular interest groups are relevant to particular issues in their area, for example corpus linguists may find that the Big Data IG addresses relevant issues. We encourage LDIG participants to also engage with other interest groups and working groups in the RDA.


Related projects and activities

There are also a number of organisations and groups outside the RDA that LDIG will engage with directly as the objectives of the group are addressed.




Andrea L. Berez-Kroeker, U Hawai‘i at Mānoa

Lauren Gawne, La Trobe University

Helene N. Andreassen, UiT The Arctic University of Norway

Potential members are welcome to sign up to the LDIG or contact the co-chairs for more information. LDIG has been promoted through the LINGUIST List, and we invite any interested party to participate.



The LDIG aims to be an ongoing group, whose overall aim is to promote better practice in linguistic data management. A general timeline is given, however some of these responsibilities may be handed over to a working group specifically set up for the delivery of the data citation standards.

Outreach - first 6 months (May-November 2017)

  • April 2017    Draft charter posted

  • May 2017    Group advertised publically

  • June 2017    Amended charter posted

  • Sept 2017    Attend Montreal RDA plenary and connect with relevant RDA groups

  • Oct 2017    Finalise LDIG structure and communication processes

Groundwork - second 6 months (November 2017-May 2018)

This groundwork helps us expand the reach of the LDIG and ensures that we are as relevant and inclusive as possible. Includes attendance at April 2018 RDA plenary:

  • Survey of linguists on current data citation practice (individual practice and institutional level training opportunities)

  • Collate possible citation practices

  • Survey of linguists on current practices for academic attribution of curation of linguistic data sets in departmental tenure and promotion

Review period start: 
Saturday, 8 April, 2017
  • Lauren Collister's picture

    Author: Lauren Collister

    Date: 11 Apr, 2017

    Overall, I approve of this charter. I have two comments that came to mind as I read this, and may be useful to those drafting the charter or reading it. 

    1. The multiplicity of data types used by linguists is briefly mentioned here; I believe that this may be one of this group's most valuable contributions to the RDA. Linguists use many different kinds of data with many different methods of describing, citing, and reusing; therefore, the work of this group to encompass all of those data types may be the most difficult task. I hope that being part of the RDA will help shine a light on how to think about all of these different data types and that the group will be able to connect to others with similar questions and issues. 

    2. One thing not mentioned here is the cultural value of linguistic data. As just one example, linguistic data sets may be used to make language revitalization materials for endangered languages and communities. This cultural value of linguistic data sets contextualizes linguistic data, and may be a useful addition to this charter to help others understand the nature and potential uses of these data. 

    Cheers to all for great work. 

  • Jill Vaughan's picture

    Author: Jill Vaughan

    Date: 19 Apr, 2017

    Great work!

    One thing springs to mind across the three objectives – more connected to data management and making data open and shareable – is the need for good data management and archiving protocols from the very earliest stages of research, during a graduate project or equivalent research undertaking, with the knowledge that this should ultimately lead to the open sharing of data down the track. This guidance first needs to come from immediate supervisors, and perhaps also through grant providers (some of which do this already).

    While I wholeheartedly agree that there ought to be ways to value this kind of work in job appointments and the like, it also needs to be valued as a time-intensive activity for the kinds of researchers seeking jobs, e.g. during short postdoc contracts amidst the pressures of publishing, teaching, attending conferences etc. In my experience this is often not the case.

    Best practice protocols for these activities should be widely available and manageable. I think many hold back from making data available as they feel it is not in some idealised state of final and absolute analysis, so an articulation of the basic requirements of what needs to be included may benefit those who are hesitant.

  • Alexis Michaud's picture

    Author: Alexis Michaud

    Date: 20 Apr, 2017

    Facilitating reproducible research: great aim!

    For citing data, what about using the same system as for publications? It could facilitate the transition. Users of Zotero, EndNote, BibTEX and other software for managing references could cite data sources like they do publications. This would imply using a standard such as RIS (the standardized tag format developed by Research Information Systems). This raises the issue to what extent RIS is compatible with the metadata of OLAC (Open Language Archive Community), which allow fine-grained description of who did what in putting together a resource (transcriber, translator, interviewer, sponsor, researcher etc).

    Another issue is that of identifiers. To cite data, we want stable identifiers (UIDs = unique identifiers; this is a bit of a pleonasm, as identifiers need to be unique). If there is a measure of convergence with the formats and identifiers used for publications, this would argue in favour of identifiers with which many researchers are familiar from handling their bibliographic references. We could get the best of both worlds by having DOIs several layers of identifiers, assigned independently: some more familiar to archivists and IT specialists (ARK, OAI...), others more familiar to the world of publishing houses. Coordination with OLAC would seem really important here.

    Again, many thanks for getting this started!

  • Sebastian Nordhoff's picture

    Author: Sebastian Nordhoff

    Date: 02 May, 2017

    Thanks for this nice and timely initiative. I have the following comments to offer:

    • encourage depositing data in accessible dedicated archives
    • add the goal of establishing best practices for licensing linguistic data, also with regard to personality rights
    • add the goal of making methods of collecting/collating/analyzing/representing data available

    I understand that citation/attribution is a first step, but these other aspects are worthwhile to be included from the outset as further goals.

    The Generic Style Rules for Linguistics should be added to the list of related projects.

submit a comment