Exploring Services for Collections as Data at the Library of Congress

You are here

31 Oct 2022

Exploring Services for Collections as Data at the Library of Congress

Submitted by Eileen Manchester

Meeting objectives: 

The objective of this Birds of a Feather meeting is to introduce the Research Data Alliance community to the Library of Congress LC Labs’ teams efforts to make digital historic materials available for bulk analysis as datasets. The session will focus on the findings from a specialized user research effort to better understand the needs of researchers and practitioners accessing collections-derived data.

For five years, LC Labs approaches have focused on the barriers to increased use of the Library of Congress digital collections and extended the “Collections as Data” framework under investigation in other cultural heritage institutions. Through these lenses, the LC Labs team has developed user research, proof of concept tools, and pilot initiatives to make the Library’s digital materials accessible and available for a wider range of research and creative purposes.

The informative meeting will highlight past work to both enhance Library data and make them more intelligible. To date, this work has included the public release of the loc.gov JSON/YAML application programming interface; the incubation of the Library’s crowdsourced transcription program; a body of experiments investigating responsible application of machine learning methods for digital cultural heritage; public-facing exploratory interfaces developed by external Innovators in Residence; and preliminary recommendations for designing human-in-the-loop workflows for increasing the discoverability of library and archival materials.

Finally, the presentation will conclude by sharing outcomes from LC Labs’ most recent initiative centered on this topic: Computing Cultural Heritage in the Cloud (CCHC). Funded by a $1 million grant from the Andrew W. Mellon foundation, CCHC investigates the service models, cost implications, and technical affordances of providing access to cultural heritage collections as data in the cloud.

In the fall of 2022, the CCHC team invited 7 cultural heritage data experts from around the world to test the use of programmatic (i.e. via computer programs) access pathways for retrieving data in the Library’s cloud-based storage environment. Those experts were offered the opportunity to explore three different datasets of digital Library content ahead of the virtual CCHC Data Jam. Then, each participant shared their feedback on both the access pathway and the dataset itself. The datasets are now available in LC Labs’ experimental sandbox environment (https://data.labs.loc.gov/) and include nearly 40,000 stereograph card images from the loc.gov Stereograph Cards collection; about 5,000 map images from the Library’s Austria-Hungary cartographic resources; and over 165,000 full text book files from the Selected Digitized Books collection.

As a result of this effort to understand and prototype contextualized data packages, the LC Labs team is poised to continue iterating upon datasets developed for a range of users, while simultaneously creating mechanisms to better understand the service model requirements to support deep engagement with the Library’s collections as data.

Meeting agenda: 


Collaborative session notes: https://docs.google.com/document/d/1-3WjI0q3Wqj5233bGet_RiOBxnQz8ZwXqfUE...


Text in orange indicates revisions made on 12/12/22 in response to reviewer feedback. 

  1. Introducing LC Labs
  2. Accessing the Library's collections "as data" : strategies and examples for end users
    1. Loc.gov JSON API and supporting documentation
    2. Collection Readiness
    3. Crowdsourcing and machine learning
    4. Digital Scholarship support
  3. About Computing Cultural Heritage in the Cloud
    1. 2022 CCHC Data Jam Goals & Participants
    2. Findings from Data Jam aka feedback from end users 
  4. Time for Feedback and Discussion with RDA members & ideas for potential collaborations 
Type of Meeting: 
Informative meeting
Short introduction describing any previous activities: 

In September 2017, the Library established a group of innovation specialists to support creative uses of the digital collections. LC Labs works with colleagues around the institution to help throw open the Library’s treasure chest, connect more deeply with researchers and the public, and cultivate a culture of continuous learning.

Through research, experimentation, and collaborations with other federal agencies and cultural heritage groups, some of the Library’s brightest ideas have become vivid reality. The original Library of Congress API has evolved into three distinct services and an array of machine-readable access methods. The early Beyond Words crowdsourcing pilot has grown into By the People, now a permanent Library program with thousands of dedicated volunteers. The efforts of the Library’s very first Innovator in Residence, data artist Jer Thorp, now sit in company with ideas from Brian FooBenjamin Lee, and Courtney McClellan. And numerous other investigations have explored machine learningspeech-to-text transcriptionemulation environments, and other ways of using technology to help make the collections more available. Read more in this recap post on the Signal Blog: https://blogs.loc.gov/thesignal/2022/09/lc-labs-is-celebrating-five-years/  

BoF chair serving as contact person: 
Meeting presenters: 
Eileen J. Manchester (ejakeway@loc.gov); Meghan Ferriter (mefe@loc.gov)
Avoid conflict with the following group (1): 
Avoid conflict with the following group (2): 
Contact for group (email): 
Driven by RDA Organisational Member: 
Applicable Pathways: 
Data Infrastructures and Environments - Institutional
Training, Stewardship, and Data Management Planning
Discipline Focused Data Issues