Cultural Heritage Science Data at the Crossroads
Submitted by Fenella France
- Collaborative Notes Link: https://drive.google.com/open?id=1007wm-do6sU32mSXpk5mP58PgqeQLsFwhvyDuO...
Cultural heritage data is at the convergence of historical and cross-cultural humanities, social and physical science data. Further, heritage science [1] research data is inherently multi-disciplinary, including scientists from a diverse range of fields, including chemistry, physics, material science, engineering, and archeology. With increasing heritage preservation challenges – moveable collections and heritage sites – the data and information collected needs to be collated and shared with an ever-wider range of colleagues, not infrequently with expertise in previously unrelated fields. For instance, it is more critical than ever that sites and environments are closely monitored, with changes tracked over time, which is itself an important component of many scientific research projects. Data fusion of these data sets requires a common terminology to allow event-based coordination of research activities, and the ability to crosswalk between appropriate ontologies and schema.
To align with the mission of RDA we consider this data initiate extremely relevant to real-world data issues, especially in relation to the destruction and damage of cultural heritage materials worldwide. As part of our engagement with the Council on Library and Information Resources (CLIR) we are looking to use this infrastructure to address heritage preservation challenges through truly global, linked and shared heritage science data. CLIR had a strong global focus for the advancement of knowledge not bound by national lines since these challenges to cultural heritage with climate change issues are also not bound by nations (https://www.clir.org/global/). CLIR’s current response is an evolving initiative, Pangia: an open, interoperable, advanced quantitative environment that will preserve and make reusable digitized cultural and scientific knowledge, essential to addressing the climate crisis. Our CH data platform is integrally linked with this initiative. Pangia is in partnership with Stanford University and currently in formal discussion with Europeana (Europeana.eu - is an initiative of the European Union for sharing heritage collection data and digital tools). We consider this data challenge to have real-world impact, and the opportunity of the BoF will enable us to form a strong WG networking across science and humanities’ fields to share data and technical solutions.
We are aware of numerous RDA IGs sharing and discussing solutions and challenges relevant to the cultural heritage sphere. Our hope is that interested parties can be drawn together to form a WG to address the increasingly critical need for a baseline level of data sharing, and to establish this through mutual agreement and published guidelines for the integration of data. We have made some headway independently, and would like to demonstrate what we have been working on for discussion and to open up the subject. However, our work in isolation cannot address the need for international, cross-institution, data interoperability.
IGs we have coordinated with include Libraries for Research Data IG, Research Data Architectures in Research Institutions IG, Chemistry Research Data IG, RDA/CODATA Materials Data, Infrastructure & Interoperability IG, and the Physical Samples and Collections in the Research Data Ecosystem IG. There is interest in various overlap areas with colleagues including data scientists, chemists, materials scientists, earth scientists, life scientists, archaeologists, and anthropologists, computer and information scientists, and collection and museum curators. We will be setting up conference calls early in the New Year to discuss specific involvement and further information for group members.
Most disciplines are using similar instruments and techniques, therefore being able to access non-proprietary file formats and a data model/structure that allows dataset fusion would be of great benefit for collaborations, especially on the international level. The STEM community could make scientific data and datasets available to a broader user base, and share relevant data between disciplines more readily, thorough structured linked data[2] that reuses existing vocabularies[3] and models[4], rather than have each institution create new structures that meet its needs only, too specifically, and in ways that prove more or less incompatible with that of other institutions’. The United States and Europe have initiatives to move forward with shared open data, especially in regards to broader and easier dissemination for diverse user groups – scientists, researchers, academia, government, and the public.
The purpose of this proposed BoF meeting is to address the increasingly critical need for a baseline level of data sharing, and to establish this through mutual agreement and published guidelines for the integration of data. To achieve this aim we will explore what could be used as the top 10 ontologies for scientific data terminology, accepted file formats and structure, and required high level metadata. Further, to understand what other RDA STEM/STEAM group projects related to shared LOD scientific authoritative sources for instrumentation and analytical techniques are underway that we can make use of, contribute to and integrate with heritage science data. Some of these include the IG on Digital Practices in History and Ethnography, the IG on International Indigenous Data Sovereignty, the IG on Physical Samples and Collections in the Research Data Ecosystem, IG on Social Science Research, WG Empirical Humanities Metadata, and WG Metadata Standards for Attribution of Physical and Digital Collections. Discussions with data and informatics and library colleagues at the National Institute of Standards and Technology (NIST) and other institutions have indicated significant crossover between research data within chemistry, archaeology, materials science, physics and other fields. Addressing the current capabilities for true linked open data that exist within STEM disciplines to provide authoritative sources will form the scope of the discussion around heritage science, to determine how this may already be represented or developed within existing RDA interest and working groups. Data analytical and instrumentation types include spectral imaging, x-ray fluorescence (XRF), gas-chromatography-mass spectrometry (GC-MS), fiber optic reflectance spectrometry (FORS), Fourier transform infrared spectrometry (FTIR), size exclusion chromatography (SEC), and spectral imaging to name a few.
Meeting objectives: Discussion of what a baseline level of data sharing would look like, and what guidelines would be needed to begin that integration of data.
- Discuss the current heritage science data initiative to assess the objective analysis of 3000 of the same volumes, instrumental sources, the challenges of access to active datasets, and avoiding creating new terminology.
- Discussion of how to move forward in collaboration with RDA partners for creation of a Working Group
- Discussion of advantages, issues, challenges and opportunities
- Current heritage science and STEM/STEAM vocabularies, terminologies
- Identification of interested and additional partners to expand and engage a broader and more diverse audience (such as small and lower–funded institutions or in economically disadvantaged nations and regions)
- Identification of interested and additional partners for developing an expandable base model suitable for describing cultural heritage scientific data alongside humanities’ data: discussion will include baseline guidelines with minimum barrier to entry to encourage adoption
The target audience of this meeting will include other scientific discipline interest groups, cultural heritage members, libraries and archives, data science, reference collections, metadata standards, ontologies, research data repository or infrastructure developers and providers, and all interested in integrated datasets and data fusion between related scientific disciplines. A declared goal of this session is to establish a RDA working group and learn from colleagues in related scientific disciplines who are facing similar issues for linking and sharing data.
- Presentation and demonstration of the data modelling for “Assessing the Condition of the National Collection” research project with active data integration (20 minutes)
- Overview of the CLIR Pangia global data sharing initiative (https://www.clir.org/?s=pangia) 5 minutes
- Update on the current status of the European Research Infrastructure for Heritage Science – Digilab (5 minutes)
- Review of commonly used data research analytical and instrumentation: For heritage data these include spectral imaging, microscopy, x-ray fluorescence (XRF), gas-chromatography-mass spectrometry (GC-MS), fiber optic reflectance spectrometry (FORS), Fourier transform infrared spectrometry (FTIR)
Presenters: Fenella France* ***, Andrew Forsberg*, John Henry Scott** (* Library of Congress LC; ** National Institute of Standards and Technology NIST, ***Council on Library and Information Resources (CLIR)), - Overview of the objectives for moving forward and discussion of how to engage with current interest groups, (discussions, all participants) 40 minutes.
- Other international partners include the Council on Library and Information Resources (CLIR), the International Image Interoperability Framework (IIIF), Stanford University, The European Research Infrastructure for Heritage Science (E-RIHS)[1] National Research Council (CNI) Florence, Italy), National Gallery London.
- Identification of other potential group members (all participants) (5 minutes)
- Summary of the results, actions, and identification of contributions of the group members (TBD) 25 minutes
The intent of this BOF is to develop a working group to identify specific components of heritage science data that overlap and can enhance other current RDA WG initiatives.
The minutes of the meeting will be published at the latest one week after the session as an attachment to the session’s web page.
This approach is to assemble a pragmatic selection of use-case driven solutions for real world scenarios, one which privileges ‘recipes’ with minimal barriers for entry. To this end, we are currently working on a pilot series of LOD design patterns/recipes for the Mellon-funded ‘Assessing the Physical Condition of the National Collection’[5] (APCNC) project. As is the norm for the cultural heritage domain, APCNC’s data, its producers and consumers, bridge numerous traditional disciplinary borders. We need to integrate and query data from some quite standard sources, such as online cataloging records (OCLC[6]), but the overwhelming majority of it is from more disparate sources, such as:
- the actual publication details collected from the partner institutions’ physical books;
- quantitative and qualitative book, binding, and paper descriptions and condition assessments; and,
- data and metadata for a wide array of scientific procedures, instruments, and analyses.
We are using CouchDB[7] as a JSON[8] document store to enable interrogating heterogeneous datasets for trends and correlations. However, a primary goal is to publish (via an API) the same serialized as JSON-LD (and Turtle, etc) for sharing publicly, and for consumption by other internal tools and services. We could use a triple store, but would like that to not be a requirement.
[1] https://en.wikipedia.org/wiki/Heritage_science Heritage science is cross-disciplinary scientific research of cultural heritage. It is the application of science and technology to heritage to improve understanding, engagement and its long-term management.
[2] https://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
[3] https://en.wikipedia.org/wiki/Controlled_vocabulary Vocabulary terms allow LOD to clearly identify exactly what data refers to in a way that both people and computers can interact with. https://www.w3.org/TR/ld-glossary/#vocabulary; And/or: https://www.w3.org/TR/ld-glossary/#ontology
[4] https://en.wikipedia.org/wiki/Data_model LOD models define how the data is organized in a standardized way.
[5] https://nationalbookcollection.org/
[6] https://www.oclc.org/en/home.html and https://www.worldcat.org/
[7] Apache CouchDB: https://couchdb.apache.org/
[8] JSON: https://www.w3.org/TR/ld-glossary/#json, JSON-LD: https://www.w3.org/TR/ld-glossary/#json-ld
Reducing the gap between the humanities and scientific aspects of cultural heritage can foster the engagement of cultural heritage and STEM scholarly communities within the digital realm, creating a more accessible and interoperable digital library of images and data.
Accordingly, we have been assessing how to best reuse existing data initiatives in each field, beginning with the CIDOC CRM extension, the Scientific Observation Model (CRMsci).[1] A large part of the attraction was due to CIDOC CRM itself being adapted by the Linked Art’s[2] streamlined LOD model for data sharing between cultural heritage institutions. Linked Art enjoys a large international community,[3] where there is a high degree of overlap with IIIF-adopting institutions.[4] The model attempts to retain the flexibility of the CIDOC CRM, and compatibility with it, while greatly reducing the burden of its complexity, and we had high hopes for performing an analogous transformation of the CRMsci. Compatibility issues between the CRM and CRMsci, and the latter being better suited for mapping the scientific process in logic than it is for defining and sharing interoperable datasets, led us to look at other scientific procedure data ontologies. We are currently evaluating using the Semanticscience Integrated Ontology (SIO)[5] for procedures, measurements, and other scientific data, with the intention to use that alongside Linked Art for cultural heritage institutions’ other domains.
Other terminology and vocab links for heritage data and science include:
Getty Vocabularies:[6] The Linked Art model strongly recommends using Getty vocabulary terms wherever possible for the sake of consistency and interoperability, and Getty’s vocabularies are a very good fit for the cultural heritage community.[7] However, for scientific data the terms are not always adequate, or appropriate ones simply do not exist. We are attempting an inclusive approach – for instance, for the measurement units used in heritage science, these can be specified with both Getty vocabularies and the Units of Measurement Ontology (UO),[8] an approach that Linked Art’s ‘bucket’ approach[9] to classifications handles gracefully, and we think should cover the needs of most interested parties.
Open Biological and Biomedical Ontology Foundry (OBO):[10] a collaborative collection of interoperable ontologies for BioMed applications, a number of which are well suited for the techniques and procedures, instruments, and data types used in heritage science labs, including the Chemical Entities of Biological Interest, Chemical Methods, Mass Spectrometry, and Units of Measurement ontologies (ChEBI, CHMO, MO, UO), and an OBO-friendly edition of the NCI Thesaurus (NCIT). In addition to supplying scientific terms, the NCIT has proven extremely useful for its qualitative terms within the cultural heritage humanities side as well. (i.e., most vocabularies, the Getty included, steer clear of qualifying terms like ‘temporary,’ ‘permanent,’ ‘partial,’ ‘acceptable,’ and ‘unacceptable,’ each of which have their place in this domain. For example, in Oddy Test results.[11])
IUPAC’s Gold Book:[12] Recently updated, edging ever closer to becoming a LOD source.
Rare Books and Manuscripts Controlled Vocabularies (RBMS):[13] Some cultural heritage projects have specialized needs that go beyond what the Getty can accommodate, and such is the case for the APCNC project where specific binding, paper, and printing terms are required. Using the RBMS controlled vocabularies represents an example of how a core model based on a context with reconciled LinkedArt/SIO/Getty ontologies might be expanded for domain-specific needs.
Along with SIO, we are using the partner Chemical Information Ontology (ChemInf), both of which, with OBO Foundry ontologies, interoperate well in our experience with NIH’s PubChem.
[1] CIDOC is ICOM’s International Committee for Documentation, http://network.icom.museum/cidoc/. The CIDOC CRM is CIDOC’s Conceptual Reference Model, which provides an event-driven formal structure for describing concepts and relationships used in cultural heritage documentation, http://www.cidoc-crm.org. CRMsci, http://www.cidoc-crm.org/crmsci/, uses and extends the CIDOC CRM, and is a formal ontology intended to be used as a global schema for integrating metadata about scientific observation, measurements and processed data in descriptive and empirical sciences such as biodiversity, geology, geography, archaeology, cultural heritage conservation and others in research IT environments and research data libraries.
[3] https://linked.art/community/index.html
[4] https://iiif.io/, see: https://iiif.io/community/#participating-institutions
[5] https://sio.semanticscience.org/, see: https://www.researchgate.net/publication/260608288
[6] http://www.getty.edu/research/tools/vocabularies/ and http://vocab.getty.edu/
[7] Including, for instance, historical units of measure, such as the pre-metric system French ‘ligne’: http://vocab.getty.edu/page/aat/300435501
[8] https://github.com/bio-ontology-research-group/unit-ontology. Alternatives, such as Quantities, Units, Dimensions and Types (QUDT) also suggest themselves, http://www.qudt.org/2.1/catalog/qudt-catalog.html.
[9] Specifically, a record’s type/classification can be specified with an array of appropriate values, which is efficient, terse, and convenient for programmatic handling. e.g., a work of art can be classified as a ‘painting’ and ‘artwork’: https://linked.art/model/base/#types-and-classifications; the ‘part’ of a photograph describing its front can be classified as ‘front part’ and ‘artwork’: https://linked.art/model/object/physical/index.html#object_50
[10] http://www.obofoundry.org/
[11] https://www.conservation-wiki.com/wiki/Oddy_Test
[12] https://goldbook.iupac.org/
[13] http://rbms.info/vocabularies/index.shtml
- 475 reads