Composed by: Dr Helene Andreassen (RDA/EOSC Future Ambassador for Linguistics), Andrea Berez-Kroeker, Lindsay Ferrara
Contributors: List TBA
Comments requested: Please note that this is a new Discipline page, and it is open for comments from the RDA Community. To add your input please use the comments section below.
Downloadable disciplinary info sheet: Linguistics
Overview of data-related practices in Linguistics
“Data, in many forms and from many sources, underlie the discipline of linguistics. [...] From descriptive to theoretical work, from corpus-based to introspection-based inquiry, from quantitative to qualitative analysis, linguists rely on data every day. [... D]ata must be understandable, discoverable, reusable, shareable, remixable, and transformable.”
(Berez-Kroeker, A.L, McDonnell, B., Collister, L.B., Koller, E. 2022. Data, Data Management, and Reproducible Research in Linguistics: On the Need for The Open Handbook of Linguistic Data Management. In Berez-Kroeker, A.L, McDonnell, B., Collister, L.B., Koller, E. (eds.), The Open Handbook of Linguistic Data Management, p. 3. Cambridge, MA: MIT Press Open. https://doi.org/10.7551/mitpress/12200.003.0005)
Linguistics has a history of developing data practices in relative isolation by subfield, lab, and researcher, which means that broad disciplinary discussions about the role of data in our research is needed. The value of data to our field is under-recognized, and language data has the added dimension of attention to the ethics required in handling the words and languages of historically marginalized peoples.
The Linguistics Data Interest Group of the RDA endeavors to broaden the conversation around research data and increase the competence of practitioners in our field about methods for data handling. Our outputs include:
The interest group currently works on a needs analysis aiming to determine which educational efforts are needed to broadly train linguists in the methods of open science. This work is supported by the RDA/EOSC Future Domain Ambassador #2 (2022-2023) project. LDIG members are also involved in SSHOC, a project responsible for developing the social sciences and humanities area of EOSC.
What kinds of data are used in linguistics?
Documentary linguistic data (e.g., text, audio, video), grammaticality judgements, instrumental data (e.g., eye tracking, EEG measurements, spectrograms), experimental data, derived data (e.g., transcriptions, annotations, syntactic treebanks), metadata, lexical data (e.g., dictionaries), language catalogs, computational data, interview data
Where is linguistics data shared?
Domain-specific repositories for language and linguistics, institutional repositories, national repositories, Open Science Framework, personal websites, article supplementary files
How is linguistics data shared (e.g. standards, guidelines, trusted examples)?
Metadata requirements in repository guidelines, e.g. domain-neutral ones such as Dublin Core, or more discipline-specific ones such as the Data Documentation Initiative and the International Standard for Language Engineering
File format requirements in repository guidelines
CC licenses, CLARIN licenses (https://www.clarin.eu/content/licenses-and-clarin-categories)
Citation guidelines: Recommendations in journal author guidelines (e.g. IASSIST's Quick Guide to Data Citation or DataCite), Tromsø Recommendations for Citation of Research Data in Linguistics (published in late 2019). Also (auto-generated) citation format on the dataset landing page in repositories.
What are typical file formats for linguistics data?
Audio: .wav, .mp3, .flac
Video: mpeg, .mp4, and others
Text: .txt, .pdf, .docx, .eaf
Image: .tiff, .jpg, .png
Tabular data: .csv, .xclx, .txt, .tsv, .json
Programming: .r, .py, .ipynb
Structured attribute-value data: .xml and derivatives (.lmf, .tbx, .tmx, .tei, .cmdi, and others), .json
Which disciplines collaborate or interface with linguistics?
(Social) Psychology, Gesture studies, Anthropology, Semiotics, Cognitive Science, Education, Applied Linguistics, Health Sciences
RDA Groups active in this discipline
RDA Groups in this discipline that are no longer active
Highlighted RDA outputs