Project News

Evolutionary Subject Tagging in the Humanities project publishes report

December 1st, 2011 in ESTH, news.

The project team for the Evolutionary Subject Tagging in the Humanities project published the culminating white paper for the project: Evolutionary Subject Tagging in the Humanities; Supporting Discovery and Examination in Digital Cultural Landscapes.

The research interest in “evolutionary subject tagging in humanities research” grew out of both an appreciation for the value of subject classification in organizing and discovering information, and frustration with the limitations we encounter in currently available systems. Even as subject terms highlight and focus attention on relationships between information objects, particularly within academic disciplines, they can hide and blur relationships when trying to bridge multiple disciplines in one’s research.

Driven by desire to help humanities scholars more easily discover and examine information, the team’s early articulations of the problem led them to explore how we might fix it. Would additional subject tags improve discoverability? Would layering subject terms from multiple disciplines help? Is there a way to merge them? Is translation between disciplinary thesauri required? If more
subject terms are required, could we develop a scalable (sustainable) model for providing them? Would any of the efforts to improve the discoverability of humanities texts actually facilitate enhanced examination of the texts?

Repeatedly throughout the project, team members found themselves challenging each other about very basic assumptions that underlie subject classification and the use of subject terms for discovery and examination of information objects. Those conflicting opinions form a creative tension out of which the project and this paper have emerged. They continue to engage the team and to shape the Libraries’ exploration of how to improve discovery and examination of texts for humanities research.

ESTHR at the DLF Fall Forum

November 19th, 2010 in ESTH.

In early November, Jack Ammerman and Vika Zafrin attended the Digital Library Federation Fall Forum 2010 in Palo Alto, California. While there, we led a working session on the subject of our grant project: evolutionary subject tagging in humanities research. We are grateful for the feedback the session participants provided. Below is a summary of references and questions we’ll need to consider.  If you have any input on these, please email us!

Extant tools:

  • Weka (“Data Mining and Open Source Machine Learning Software in Java”)
  • Bowker Data Profiler

Topics and questions to consider:

  • Interaction between social tagging and “official” cataloging
  • What’s the test/first corpus to tag?
    • how to pick it while remaining as non-disciplinary as we can? (see latent semantic analysis below)
  • Seeing subjects along with examples of those subjects within a structure can enable users to learn a taxonomy as they use it.
  • How do disciplinary portals/sites describe, classify, categorize information?
  • Computational linguistics

Approaches to consider:

  • Reverse engineer bibliographies via citations?
  • Look at latent semantic analysis
    • Do we want to obtain relatedness of objects, or a thesaurus?
    • LSA, avoiding any strings, might work for languages beyond English
    • Trusting mathematical models vs. trusting catalogers’ (or anyone else’s) point of view
    • Strategy that limits us to English and isn’t scalable may be the wrong road.
    • Relevance: exploring items related to other items, or along a taxonomy map?
    • we need to pull together conceptually related terms (phil. soul and theo. soul are different)
    • Wikipedia disambiguation model?

Two models for how we might proceed: natural-language, and mathematical.

We need to clearly articulate those models, and what we see to be the strengths and weaknesses of each model.

Engage consultants around that thinking: are we missing something here? Is there a hybrid of these two models? Should we imagine moving forward with both at the same time? Different target audiences? Two projects? (What if we could compare the results of both on the same corpus?) If we move ahead with either of these two steps, who needs to be involved? What would be the barriers of both approaches? Which target audience would be best served by which of these approaches?

2010 Digital Humanities Start-up Grants Project Directors Meeting

October 13th, 2010 in ESTH.

Vika Zafrin and Jack Ammerman attended the 2010 Digital Humanities Start-Up Grants Project Directors Meeting on September 28. Below is the presentation that Vika make during the “Lightning Round” of project presentations. We were limited to three slides and two minutes:

MeSH Indexer Web Services

When we were setting up MIT's DSpace at Boston University, to serve as our institutional repository software, our colleagues at the BU medical campus showed us the MeSH Indexer Web Services application developed at Johns Hopkins. When a medical article is uploaded into DSpace, the Indexer parses its text and automatically suggests medical subject headings for it. This process is semantic: epidemiology might be suggested for an article on AIDS even if the word "epidemiology" is found nowhere in it.

It's Complicated

We thought that was brilliant. We thought, we should have something like that for the humanities. It could mean big and positive changes for the usually time-consuming library cataloging process. But doing this in the humanities is harder: word meanings are much more multivariate, more context-dependent. This needs collaboration among people with a wide range of expertise.

Evolutionary Subject Tagging in the Humanities

So: our grant project will bring together computing analysis specialists, librarians and humanities scholars. We'll work together to produce a white paper detailing how we think such software can be built. Then we'll gather a team that will apply for a Phase II grant to build it.


Twitter Hashtag:  #SUG2010