ESTHR at the DLF Fall Forum

in ESTH
November 19th, 2010

In early November, Jack Ammerman and Vika Zafrin attended the Digital Library Federation Fall Forum 2010 in Palo Alto, California. While there, we led a working session on the subject of our grant project: evolutionary subject tagging in humanities research. We are grateful for the feedback the session participants provided. Below is a summary of references and questions we’ll need to consider.  If you have any input on these, please email us!

Extant tools:

  • Weka (“Data Mining and Open Source Machine Learning Software in Java”)
  • Bowker Data Profiler

Topics and questions to consider:

  • Interaction between social tagging and “official” cataloging
  • What’s the test/first corpus to tag?
    • how to pick it while remaining as non-disciplinary as we can? (see latent semantic analysis below)
  • Seeing subjects along with examples of those subjects within a structure can enable users to learn a taxonomy as they use it.
  • How do disciplinary portals/sites describe, classify, categorize information?
  • Computational linguistics

Approaches to consider:

  • Reverse engineer bibliographies via citations?
  • Look at latent semantic analysis
    • Do we want to obtain relatedness of objects, or a thesaurus?
    • LSA, avoiding any strings, might work for languages beyond English
    • Trusting mathematical models vs. trusting catalogers’ (or anyone else’s) point of view
    • Strategy that limits us to English and isn’t scalable may be the wrong road.
    • Relevance: exploring items related to other items, or along a taxonomy map?
    • we need to pull together conceptually related terms (phil. soul and theo. soul are different)
    • Wikipedia disambiguation model?

Two models for how we might proceed: natural-language, and mathematical.

We need to clearly articulate those models, and what we see to be the strengths and weaknesses of each model.

Engage consultants around that thinking: are we missing something here? Is there a hybrid of these two models? Should we imagine moving forward with both at the same time? Different target audiences? Two projects? (What if we could compare the results of both on the same corpus?) If we move ahead with either of these two steps, who needs to be involved? What would be the barriers of both approaches? Which target audience would be best served by which of these approaches?

Comments are closed.