ESTHR Project: Grant Narrative

Evolutionary Subject Tagging in the Humanities

Enhancing the Humanities through Innovation; Environmental Scan

Recently, the Boston University Center for the Study of Asia asked the library how many books we hold in Asian Studies. We could not easily answer this question. Library of Congress subject headings (LCSH) are not designed to answer it; and time and workflow constraints typically limit the number of subject headings a cataloger can assign to any object, regardless of how many subjects it may touch upon. This cataloging problem is getting worse quickly as the number of new publications grows exponentially every year.

As scholars delve deeper into academic studies, knowledge in all of its complexity refuses to be contained within any set of imposed boundaries, and we must continually discard or re-draw them. We aim to thematize the problem of scale. We are requesting a Level I Digital Humanities Start-Up Grant to begin to approach the complex problem of creating a rule set for software that will parse humanities articles, mapping text onto a set of controlled vocabularies, and automatically suggest subject headings for them.

By tackling the problem of scale we hope to enable humanities scholars to more easily contextualize their research, and more fluidly cross disciplinary boundaries in search of answers to their questions. We believe that the proper function of the 21st-century library is not only to provide materials to scholars, but also to make research more efficient. We also propose to create a venue that has been called for in numerous scholarly publications, and at several professional gatherings: one where librarians, archivists, IT professionals and humanities scholars can work together on a specific, and broadly applicable, project.

Library classification and subject analysis systems were developed in part to arrange physically objects (books) or their representatives (catalog cards) in order to make them discoverable by browsing. Even if we could automate the cataloging process so as to eliminate time and workflow constraints that limit the number of subject headings that can be applied, would adding more (or different) subject headings simply impose additional or different structure to knowledge that carries with it the problems we have identified? We look to explore this question by experimenting with ways of re-drawing boundaries dynamically.

Efforts to develop web-scale discovery tools for published literature like EBSCO Information Services’ Discovery, ExLibris’ Primo, OCLC’s WorldCat Local and MetaSearch, and SerialsSolutions’ Summon don’t address issues of metadata creation and limits. They attempt to normalize metadata and make it searchable in a single search. We want to provide improved metadata to search.

There exist computer-based tools for text analysis and metadata creation. One such tool, Mallet, approaches a corpus of text computationally, doing topic modeling, sequence tagging and classification through computational analysis. MIT’s Curators WorkBench adopts an “approximate cataloging” model: attempting to address the scalability of metadata creation and maintenance by allowing curator-provided metadata to be assigned to groups of similar or related data. Perhaps most promising, MeSH Indexer Web Services (MIWS), a tool developed by the Johns Hopkins Medical School Library, ingests the text of articles in their entirety, then analyzes key terms and the recurrence of key words against an existing grid of terms that are set up to map to a taxonomy (the National Library of Medicine’s Medical Subject Headings, or MeSH). MIWS then pre-populates a catalog record for that article with appropriate MeSH headings.

Web-scale searching, computational text analysis, “approximate cataloging,” and automated indexing each hold promise for addressing various aspects of the problem, but none seem to fully address the problem. In addition, we continue to question whether traditional print models of classification and subject analysis serve humanistic researchers’ needs in working with digital content.

History and duration of the project

Over the past two years, Drs. Ammerman, Green and Zafrin met regularly to discuss issues relating to the reconception—re-creation—of the academic library to suit new developments in digital humanities. Initial conversations about the impact of digital culture and content on collection development and how people seek and perceive information led us to explore facilitating user engagement with digital content as a social object. How should libraries shape their collecting, services, and programming to address the specific needs of humanistic research? One result of these conversations was a jointly authored paper titled “Library as Agent of [Re]Contextualization,” presented at Digital Humanities 2009 and available on Digital Common.

How do we organize digital content? Traditional schema for geographically arranging information as physical items on library shelves become less compelling, yet traditional models of classification and subject analysis endure. Libraries now collect and archive not only atoms but bits; how does this change their role and potential in today’s academe? From these conversations emerged two significant research questions:

  1. How is humanistic research different from scientific research? How should libraries shape their collecting, services, and programming to address these specific needs?
  2. How do we facilitate improved retrieval of contextually, if not directly, relevant information in searching without imposing an arrangement of information that inhibits discovery?

To date, our exploration has enabled us to gain some clarity on the issues that need to be addressed. We also recognize that the primary investigators, though familiar with standard tools for humanities research and digital librarianship, don’t hold expertise in computational analysis of massive amounts of data. Our natural inclination to approach such problems at a micro, rather than macro level, often leads us to develop solutions that lack scalability. We believe bringing librarians and humanities scholars into conversation with scholars whose expertise is computational analysis of massive sets of data will enable us to gain new perspective on the issues and to frame possible solutions that are in fact scalable.

Work plan

We intend to organize a consultation, inviting a group of librarians, computational analysis experts, and humanities scholars to meet with the principal investigators with the goal of gaining additional perspective on the problem and developing a scalable solution. We recognize that the very title of our proposal, “Evolutionary Subject Tagging in the Humanities,” already signals our preconception of how we would approach the issue – which is to say, iteratively. We will ask our consultants to help us formulate an approach to the problem of improving the discoverability of information from multi-disciplinary perspectives in the humanities.

We envision a two-day consultation meeting, beginning with lunch on the first day and ending with lunch on the second. During the meeting, the principal investigators will present the issues as they understand them and invite the consultants to bring their disciplinary expertise to expand and reframe our understanding of the problem of mapping scholarly prose in the humanities onto controlled vocabularies.

Following the meeting, we will look more closely at existing software, to discover tools that will be most likely to address our needs in the future. We will then apply for a Phase II grant to tag two test corpora of 100 articles each in the fields of theology, philosophy and science. We will plan to use the data to create a parsing rule set for the software we have chosen to employ, and to improve that rule set by an iterative process of testing and modification.

Staff

Jack Ammerman (Assoc. Univ. Librarian for Digital Initiatives and Open Access),Garth Green (Asst. Prof. of Philosophy of Religion), Daniel Benedetti (Bibliographer/ Librarian) and Vika Zafrin (Digital Collections Librarian) will be the project team leaders.

Consultants from the Library of Congress (Janis Young, Senior Subject Policy Specialist; Daniel Chudnov, Librarian and Programmer, Office of Strategic Initiatives), Boston University faculty, and others yet to be identified will be brought in for the consultation meeting.

Final product and dissemination

We will publish a white paper detailing our methods and process, and theorizing a prototype of the software module that we will intend to build in Phase II, should that application be funded. The paper will describe problems encountered (both technologically and in terms of communication across fields of expertise), how they were resolved, and what remains unresolved. We will disseminate the paper through mailing list, direct email and conference presentations, and invite feedback, which will inform our Phase II proposal.

A nontrivial product of our Phase I project will be relationships built over the course of the consultation meeting. We will have identified a team of scholars, librarians and technologists that will continue this work in collaboration in Phase II.