RDA P3 BoF Rescuing Heritage Data

You are here

20 February 2014 1957 reads

Rescuing Heritage Data BoF

Chair: Elizabeth Griffin, Dominion Astrophysical Observatory, Victoria, Canada.

Scientific research, particularly into long-term variability, needs to access, share and analyse all relevant data, *including pre-digital ones*. Unfortunately rather few pre-digital observations can be accessed electronically today, yet they have the same scientific weight as born-digital ones.  The born-digital/pre-digital divide is an artificial side-effect of technology, and its perpetuation threatens the integrity of modern research. This BoF will discuss the problems, the risks, possible solutions, and case studies.

The natural sciences are empirical, meaning that knowledge is derived from observation or experiment.  Observations in the natural sciences have been made, recorded and (mostly) kept for literally yonx.  Sets of those data build up, per science or field, a unique history of the way the measured properties have changed naturally, evolved, or have been modified for other reasons.
Since the advent of digital technology, digital detectors and space missions, data management has become vastly more complex than it once was, but also potentially much more powerful.  Models and data simulations are used all over (think of investigation into climate change); in the (still unresolved) debates regarding the fact, and (if a fact) then the cause(s), of global warming and
associated trends, ingesting all possible electronic data into computer-generated solutions is the central activity of vast research efforts in Earth sciences in many countries and centres today.

However, the data which *really* establish baselines for long-term variability are the historic measurements - hand-written records, photographs, log-books, pro-formas, or early magnetic tapes with unrecognized formats and devoid of decoding instructions and meta-data.  All of the established sciences have some legacy of such analogue observations, and if only the data were accessible in digital form and on-line, how much richer the scientific research, how much more reliable the models and the predictions, if only, if only ...  yet the number of efforts to recover and fully digitize those analogue measurements is pitifully small.

Why?  The technology is hardly challenging; "scanning" a paper record requires minimal training, and compiling specific information is often only copy-typing. Is the obstacle money, resources, or (more worryingly) attitude?  It is an interesting truism that the perceived quality of a product is linked tightly to the status of the operator, so when high-flying scientists delegate the use of simple but out-moded technologies to low-qualified or volunteer assistants, the products get branded accordingly.  Yet in the few cases where legacy data have been successfully recovered, digitized and incorporated into modern research (for example, GODAR in oceanography; measuring the Earth's ozone from historic astronomical spectra; determining weather patterns from old ship log-books) the results have been stunning, the sort that make headlines.  Why is there no concerted effort among scientists to locate, preserve and digitize all that heritage information upon which modern assertions of (say) climate variability rest so sensitively?  Is there a reluctance to look backwards when "progressive" people only look forwards?  Is it competition for funding, a lack of precedent, or insufficient access to all of the skills that are actually required?

Modern educational courses on aspects of data acquisition, handling, management and archiving refer only to born-digital data.  Implicitly, therefore, the teaching is that only those (recent) data are "valuable".  Is the student tutorial the best place to begin modifying attitudes?

Activities of "data rescue" need to include elements like locating, provenance assessment, handling, preservation and validation that are the province of archivists, librarians and IT specialists, but often there is a virtual divide between those fields and the pure sciences.  How could the situation be improved (as improved it certainly must be)?

A prime problem for the science researcher is knowing or discovering what is actually out there.  How can the public (a.k.a. citizen scientist) assist, and how can inventories (the only sure way to learn about "lost" data) be made?

Success stories encompass not only the sciences but the humanities too such as archaeology, papyrology, philology, and cultural history of every hue.  How can we create comprehensive lists of those successes?

This BoF will discuss the above issues - and any others that attendees wish to raise - and debate ideas for solutions.  Contributions (formal or just brief statements) will be welcome from anyone who has had experience of data recovery, whether successful or otherwise (particularly the latter, as bad experiences are great teachers).

The driving force behind this BoF is the CODATA Task Group on "Data At Risk", established in 2010.  The TG objectives, membership and other vital statistics can be read at http://ils.unc.edu/~janeg/dartg/.  Descriptions of recent activities and aspirations can also be read on the CODATA blog at http://codata.org/blog/2014/01/19/building-support-for-principle-guidelines-for-data-at-risk/

BoF Agenda: 90 minutes is not very long to share all we know, or want to learn, about Data At Risk.  We will commence with an Introduction by the Chair, and then look in turn at the following:

** The status of 'lost' data, and evolving attitudes
** Success stories - how to spread the word?
** Skills required for appropriate and adequate data recovery and digitization
** Education - how important is it, and at what level/field?
** "Lost" data in the public domain: can (and should) citizen science help?

Those with material to share, or ideas to offer, are invited to email me so that a list of intending contributions can be created.  The BoF is in effect a plenary discussion, but we should ensure that those planning to contribute do get a space.

Thank you!

We look forward eagerly to discussing this remarkably neglected aspect of scientific data, and to meeting you at this BoF on Wednesday March 26.

Elizabeth Griffin