Managing Collections as Data services: the 14,000-foot view

May 29, 2019
Jul 11, 2019

Kevin Clair

image credit: "Grays and Torreys Peaks 2006-08-06," by Flickr user [Daidipya](https://www.flickr.com/people/16157855@N00), from Wikimedia Commons; CC-BY 2.0

My name is Kevin Clair and I am the Digital Collections Librarian at the University of Denver. I manage the Digital Collections Services unit in the University Libraries; in brief, we are responsible for the technical services side of Special Collections and Archives, including the Beck Archives of Rocky Mountain Jewish History, where the source materials for our Collections as Data project – the Jewish Consumptives’ Relief Society (JCRS) Records – are housed. In this post I’ll talk about some of the opportunities and challenges of incorporating Collections as Data services into my work as a department manager in the library.

Digital Collections Services is responsible for technical services and digital collections development for Special Collections and Archives at the University Libraries, including digitization, preservation and reformatting, and resource description and metadata services. We build and maintain collections using our digital repository, and manage resource description using our archival collection management system. Through our collaborations with colleagues in Special Collections, we have provided bespoke collections-as-data services to researchers for some time, using a combination of digitized collections, traditional archival arrangement and description, and JCRS patient databases.

Being part of the first Collections as Data cohort has presented us with interesting opportunities, but also interesting challenges. Where does publishing collections as data fit naturally into the services we already provide? What new services, and consequently what new skills and workflows, are required in order to implement these services successfully? How can we implement these services without losing sight of our core mission, and the services we already provide to support it? As a unit manager, answering these questions is the role I see myself playing in the Collections as Data project.

Transcription of textual resources to enable search and retrieval has always been one of our primary services. Like most digitization units in academic libraries, we perform optical character recognition (OCR) on most of the textual resources we scan, and make the products of that OCR available through the digital repository for full-text search. Handwritten text recognition (HTR) services are a natural extension of that work. We have been using Transkribus for a few months now, training its algorithms on the handwritten manuscripts and correspondence from the JCRS Records that we have already scanned. The initial returns from this work, while far from what we hope and expect to see by the end of the grant, are nevertheless very promising and exceeded our expectations. Alice Tarrant, our Digitization Coordinator, has led this initial work, and in a future post will speak more to the technical details of it.

This initial work has already raised a number of questions we will need to address as a unit moving forward. Currently we are rebuilding our digital repository infrastructure; one of the goals of this work is to enable integrations with other systems as much as possible, to allow particular stars within our constellation of collection management tools to do what they do best. One of these stars is ArchivesSpace, our primary collection management tool, and the source of most of the metadata in the repository. ArchivesSpace is many things, but one thing it specifically is not is a digital asset management system, so we cannot use it to store scans of JCRS correspondence, and we cannot easily use it to manage transcriptions or other technical metadata extracted from Transkribus describing the HTR process that may be useful for researchers wishing to use the JCRS collection as a dataset.

We have begun work within Digital Collections Services, and to some extent across technical services as a whole, on documenting a data dictionary of core data properties used across collection management systems for resource management and user retrieval; this work will hopefully aid us in system integration and future data migrations. This work is in very early days, and there is a lot left to be done. Notably absent – to me, anyway – are any technical metadata, such as information about HTR algorithm versions and other characteristics of a JCRS patient record transcript generated with Transkribus. Identifying and documenting these property definitions will be of assistance to both researchers and those working directly on the project, and will also be of help should we ever move away from Transkribus as an HTR tool, or use other services in complement to it.

I wasn’t able to attend the recent Collections as Data project meeting in Philadelphia, but prior to it we had some conversations locally on the topic of developing user personas for potential stakeholders in the JCRS project. This was a very valuable activity, and ties in with conversations we have had in technical services in the Libraries around the relative value of different metadata properties for resource search and retrieval, and the importance of talking to users in establishing that value. While we have talked to scholars at DU regarding the scholarly value and importance of the JCRS Records and of publishing them as data sets for research, we don’t have a sense of how scholars in different disciplines might find or use such data sets in their work. Developing personas and refining them based on future conversations with digital humanists and others working with computational data in digital library collections will help us out greatly in this work, and will hopefully provide us with some guidance in how to proceed with both publishing our collections as data and with thinking through how those services might affect how we model library data across all of our collection management tools.

That’s my introduction to what I’m thinking about, as a unit manager, in terms of incorporating collections as data into our existing services. You’ll hear more about the details in future posts, and I’m looking forward to checking back in a few months from now about the things we’ve learned so far…