Linguistics Data Interest Group Charter Statement

08 Apr 2017

Linguistics Data Interest Group Charter Statement

Version 0.1 draft Charter Statement 7th April 2017

 

Introduction

Data are fundamental to the field of linguistics. Examples drawn from natural languages provide a foundation for claims about the nature of human language, and validation of these linguistic claims relies crucially on these supporting data. Yet, while linguists have always relied on language data, they have not always facilitated access to those data. Publications typically include only short excerpts from data sets, and where citations are provided, the connections to the data sets are usually only vaguely identified. At the same time, the field of linguistics has generally viewed the value of data without accompanying analysis with some degree of skepticism, and thus linguists have murky benchmarks for evaluating the creation, curation, and sharing of data sets in hiring, tenure and promotion decisions.

 

This disconnect between linguistics publications and their supporting data results in much linguistic research being unreproducible, either in principle or in practice. Without reproducibility, linguistic claims cannot be readily validated or tested, rendering their scientific value moot. In order to facilitate the development of reproducible research in linguistics, The Linguistics Data Interest Group plans to develop the discipline-wide adoption of common standards for data citation and attribution. In our parlance citation refers to the practice of identifying the source of linguistic data, and attribution refers to mechanisms for assessing the intellectual and academic value of data citations.

 

This interest group is aligned with the RDA mission to improve open sharing of data through forming transparent discipline-specific data citation and attribution conventions to be adopted by the international research community. This interest group will add value to the RDA community by providing breadth to the current roster of RDA interest groups: linguistics is a discipline that straddles social/behavioral sciences and the humanities, and thus we have a great deal to contribute to the general RDA discussion on a multiplicity of data types.

 

Who this group is for?

The LDIG is for people who work with linguistic and language data. This work includes, but is not limited to, the collection, management and analysis of linguistic data. We encourage participation from academic and speaker communities.

 

Objectives and outcomes

 

Our overarching objective is to provide tangible tools (e.g. guidelines, software) for improving the culture of data citation and attribution within linguistics. We outline three main objectives, and specific outcomes for each:

  •   Development and adoption of common principles and guidelines for data citation and attribution by professional organizations, such as the Linguistic Society of America and the Societas Linguistica Europaea, and academic publishersOutcomes include:

    •   Development of a common stylesheet for citation of linguistic data

    •   Adoption of the style sheet by publishers, organisations and individuals

  •   Education and outreach efforts to make linguists more aware of the principles of reproducible research and the value of data creation, curation, management, sharing, citation and attribution;   Outcomes include:

    •   Development of training modules

    •   Delivery of training at conferences and workshops

    •   Development of tools for the management of linguistic data

  •   Efforts to ensure greater attribution of linguistic data set preparation within the linguistics profession.  Outcomes include:

    •   Framework for valuing the development of linguistic data sets in job appointments

    •   Framework for valuing the development of linguistic data sets in tenure and promotion applications.
       

We expect that other outcomes will be developed as LDIG grows.

 

Mechanism

The co-chairs will hold a conference call every two months. The wider LDIG will convene quarterly meetings. The timezone spread of LDIG members means that these meetings will be held asynchronously in an editable document. The agenda will be posted with discussion points, and will be open for comment for a week, before actions are decided upon and delegated. We will also host face-to-face meetings at relevant linguistics conferences, such as Societas Linguistica Europaea, Linguistic Society of America, and the Australian Linguistics Society, and at the RDA plenaries.

 

Interaction with groups in RDA

The following RDA groups have been identified as having interests that are relevant to LDIG, both in terms of technical and ethical issues in linguistic data management:

 

  •   Data policy standardisation and implementation IG

  •   Data Versioning IG

  •   Reproducibility IG

  •   RDA/NISO Privacy Implications of Research Data Sets IG

  •   Ethics and Social Aspects of Data IG

  •   Metadata IG

  •   Data Citation WG

  •   BoF on Data Champion Communities

     

While setting up the LDIG we will ask at least four of our members to nominate themselves to participate in one of these other groups and be officially named as our cross-group co-ordinator. This will facilitate cross-group relevance.

Linguists from particular subfields may find that particular interest groups are relevant to particular issues in their area, for example corpus linguists may find that the Big Data IG addresses relevant issues. We encourage LDIG participants to also engage with other interest groups and working groups in the RDA.

 

Related projects and activities

There are also a number of organisations and groups outside the RDA that LDIG will engage with directly as the objectives of the group are addressed.

Contributors

Co-Chairs:
Andrea L. Berez-Kroeker, U Hawai‘i at Mānoa
Lauren Gawne, La Trobe University
Susan S. Kung, U Texas at Austin
Helene N. Andreassen, UiT The Arctic University of Norway

 

Timeline

 

Outreach - first 6 months (May-November 2017)

  •   April 2017  - Draft charter posted

  •   May 2017  - Group advertised publically

  •   July 2017  - Member comment (within 6 weeks of draft going live)

  •   Sept 2017  - Revised charter posted

  •   Sept 2017  - Attend Montreal RDA plenary and connect with relevant RDA groups

  •   Oct 2017   - Finalise LDIG structure and communication processes

Groundwork - second 6 months (November 2017-May 2018)
To be driven by Working Groups lead by 1-2 LDIG Chairs; includes attendance at April 2018 RDA plenary:

  •   Survey of linguists on current data citation practice (individual practice and institutional level training opportunities)

  •   Collate possible citation practices

  •   Survey of linguists on current practices for academic attribution of curation of linguistic data sets in departmental tenure and promotion

     

Building the citation standards - third 6 months (May 2018-November 2018)

  •   Development of the citation standards

  •   Development of a statement on and guidelines for tenure and promotion committees and applicants about how to weigh data set curation in linguistics - Includes attendance at September 2018 RDA plenary

Documents : 
AttachmentSize
PDF icon LDIGcharterstatement.pdf89.49 KB
  • Lauren Collister's picture

    Author: Lauren Collister

    Date: 11 Apr, 2017

    Overall, I approve of this charter. I have two comments that came to mind as I read this, and may be useful to those drafting the charter or reading it. 

    1. The multiplicity of data types used by linguists is briefly mentioned here; I believe that this may be one of this group's most valuable contributions to the RDA. Linguists use many different kinds of data with many different methods of describing, citing, and reusing; therefore, the work of this group to encompass all of those data types may be the most difficult task. I hope that being part of the RDA will help shine a light on how to think about all of these different data types and that the group will be able to connect to others with similar questions and issues. 

    2. One thing not mentioned here is the cultural value of linguistic data. As just one example, linguistic data sets may be used to make language revitalization materials for endangered languages and communities. This cultural value of linguistic data sets contextualizes linguistic data, and may be a useful addition to this charter to help others understand the nature and potential uses of these data. 

    Cheers to all for great work. 

  • Jill Vaughan's picture

    Author: Jill Vaughan

    Date: 19 Apr, 2017

    Great work!

    One thing springs to mind across the three objectives – more connected to data management and making data open and shareable – is the need for good data management and archiving protocols from the very earliest stages of research, during a graduate project or equivalent research undertaking, with the knowledge that this should ultimately lead to the open sharing of data down the track. This guidance first needs to come from immediate supervisors, and perhaps also through grant providers (some of which do this already).

    While I wholeheartedly agree that there ought to be ways to value this kind of work in job appointments and the like, it also needs to be valued as a time-intensive activity for the kinds of researchers seeking jobs, e.g. during short postdoc contracts amidst the pressures of publishing, teaching, attending conferences etc. In my experience this is often not the case.

    Best practice protocols for these activities should be widely available and manageable. I think many hold back from making data available as they feel it is not in some idealised state of final and absolute analysis, so an articulation of the basic requirements of what needs to be included may benefit those who are hesitant.

  • Alexis Michaud's picture

    Author: Alexis Michaud

    Date: 20 Apr, 2017

    Facilitating reproducible research: great aim!

    For citing data, what about using the same system as for publications? It could facilitate the transition. Users of Zotero, EndNote, BibTEX and other software for managing references could cite data sources like they do publications. This would imply using a standard such as RIS (the standardized tag format developed by Research Information Systems). This raises the issue to what extent RIS is compatible with the metadata of OLAC (Open Language Archive Community), which allow fine-grained description of who did what in putting together a resource (transcriber, translator, interviewer, sponsor, researcher etc).

    Another issue is that of identifiers. To cite data, we want stable identifiers (UIDs = unique identifiers; this is a bit of a pleonasm, as identifiers need to be unique). If there is a measure of convergence with the formats and identifiers used for publications, this would argue in favour of identifiers with which many researchers are familiar from handling their bibliographic references. We could get the best of both worlds by having DOIs several layers of identifiers, assigned independently: some more familiar to archivists and IT specialists (ARK, OAI...), others more familiar to the world of publishing houses. Coordination with OLAC would seem really important here.

    Again, many thanks for getting this started!

  • Sebastian Nordhoff's picture

    Author: Sebastian Nordhoff

    Date: 02 May, 2017

    Thanks for this nice and timely initiative. I have the following comments to offer:

    • encourage depositing data in accessible dedicated archives
    • add the goal of establishing best practices for licensing linguistic data, also with regard to personality rights
    • add the goal of making methods of collecting/collating/analyzing/representing data available

    I understand that citation/attribution is a first step, but these other aspects are worthwhile to be included from the outset as further goals.

    The Generic Style Rules for Linguistics should be added to the list of related projects.

submit a comment