Strategic Planning for Adopting Collections as Data

Aug 20, 2019
Aug 20, 2019

Kim Pham

My name is Kim Pham and I am the IT Librarian in the Library Technology Services department at the University of Denver Main Library. My role in our collections as data project is largely that of a project manager’s - I coordinate the various goings on and do some of the high-level design and specification work to create our digital collections as data. If you haven’t read any of the other blog posts, more specifically we’re focusing on producing machine-readable transcripts from the Jewish Consumptives Relief Society records, a series of handwritten medical documents application letters, cheques, and more.

In my role I get to think about some broader implications of collections as data in our organization, such as what the resulting services we should provide, how to sustain this work going forward, and what impacts it will have on library staffing workflows and processes. I read up on various communications (mostly blog posts, conference presentations and papers) to see what progress the library community at large has made on adopting handwriting text recognition (HTR) technologies and also reached out to other organizations who worked or are working on similar projects to us. I talked to British Library’s digital research team who graciously shared their documentation and practices and got to see how they managed to integrate HTR into their regular collection workflows. I talked to the researchers and developers behind Transkribus, the software platform we’re using to create HTR transcripts and have also gotten a lot of valuable tips and information through the Transkribus Users group (I find it very charming that it’s on Facebook).

A very active Transkribus open-source community on Facebook

The manager in me thinks this outreach and advocacy work is important for us not only to build relationships and get external validation/internal buy-in for our project, but it also helps us generate ideas and to take the time to think about how to standardize our processes so that they can be shared and reused by other organizations in the future. A report that has framed some of this thinking is David A Smith and Ryan Cordell’s “A Research Agenda for Historical and Multilingual Optical Character Recognition”, in it they emphasize the need for standard models and to develop and distribute tools in all stages of OCR (and this can be applied to HTR) - from image preprocessing to layout analysis to transcription and post-correction. HTR is going to be a growing research area in the next five to ten years so it’s important for us to think about what needs to be documented, how we can develop and train people, and what would be interesting and insightful data to collect about the process.

More specific to our institution, I am also thinking and planning about how integrating collections as data will align closely with our department’s work, and how it will fit into our technology infrastructure. Kevin mentioned in his post that we are have been building our digital collections infrastructure, which is an integration of Archivematica (preservation services) + Duracloud (storage) + Kaltura (media streaming) + nodejs frontend + nodejs backend admin.

University of Denver’s digital repository architecture. HTR transcripts will be indexed in Elasticsearch, data will be accessed through the repository frontend.

To me it’s a pretty standard design in terms of meeting the general use cases of digital repositories, where you have storage, retrieval, curatorial services, and means to access the digital objects. The access part is similar to that of an exhibition or a traditional library catalogue, where you search and retrieve objects, can browse, view, and download them. Then when I first read the collections as data CFP one paragraph stuck out for me:

“items in digital collections are generally treated as surrogates of physical objects and their organization and the expectations for their use are based on the metaphor of a physical bookshelf, gallery wall, or listening booth. A collections as data orientation suggests different approaches to collection production, documentation, and access.” (Reference: Grant narrative for Collections as Data: Part to Whole)

A lot of digital collection development focuses on mirroring physical library and collection viewing practices but in a digital space. But one problem that P. Gabrielle Foreman and Labanya Mookerjee mention is that this approach “does not meet the needs of the researcher, the student, the journalist, and others who would like to leverage computational methods and tools to treat digital library collections as data”. So working on this project we saw this as an opportunity to rethink how our infrastructure should be developed. Thinking about this more carefully we realized it’s important to consider collections as data as a forethought in our software development cycle, rather than trying to shoehorn this need into a traditional digital collections experience.

So, we’re taking the outputs of Kevin Alice’s and the Digital Collections Services’ department’s work, and now I am working through thinking through the architecture of how we can best store, use and represent this data to meet the needs of the users that will end up using it. It might be just as a browsing, viewing need, there’s no denying there’s still a place for that and that people still research in that way, but we see our faculty and classes doing work with this collection that can be better solved through computational analysis. Stay tuned, I’ll talk about some of the design decisions we’ve made in my next post!