Of interest: Cross Linguistic Data Formats

16 Apr 2018
Groups audience: 

LDIG members Simon Greenhill and Robert Forkel (Max Planck Institute for
the Science of Human History, Jena) share the following message and the
attached paper on Cross Linguistic Data Formats. They invite LDIG members
to either comment here or visit the github repository
to comment, where any discussions become
open and archived with the CLDF documentation for posterity!
We’ve been thinking hard about how best to store cross-linguistic data that
follows best practices.
Our approach is a set of lightweight data formats based on CSV called CLDF
(Cross Linguistic Data Formats). We actually just submitted a description
of this data format (attached below) a few days before your ’position
statement' paper was released last year, and I met Lauren at ALT in
Canberra and discussed this briefly with her.
This format is based on our years of experience building and running some
of the world’s largest cross-linguistic databases. We followed a strict set
of design principles: the format should be lightweight, ‘natural’ for
practitioners to use (i.e. something that fits well with how linguists are
used to working), prioritise human readability over machine readability (no
XML!), and everything should be explicitly defined and referenced.
Currently we have protocols for Wordlists, Grammatical datasets, Parallel
texts and dictionaries. The format is explicitly designed to be extendable
to new types of data as well.
The CLDF format is now underlying all of our language databases (a short
and incomplete list is here: http://clld.org/datasets.html and there are
more coming - http://glottobank.org/).
We have an open framework for it on github (https://github.com/cldf/cldf)
along with some examples e.g. https://github.com/cldf/cldf/
tree/master/examples) and documentation about how to use it in programming
languages like Python (https://github.com/cldf/pycldf) or R (e.g.
cldf_r/simple_access_to_values.ipynb )

File Attachment: 
PDF icon Forkel_et_al.pdf353.68 KB