DH09 Tuesday, session 3: Use Cases Driving the Tool Development in the MONK Project

in DigiLib BLog
June 23rd, 2009

MONK Project is “a digital environment designed to help humanities scholars discover and analyze patterns in the texts they study. The MONK project has been generously supported by the Andrew W. Mellon Foundation, from 2007-2009. All code produced by the project is open source. MONK has a publicly available instance with texts contributed by Indiana University, the University of North Carolina at Chapel Hill, the University of Virginia, and Martin Mueller at Northwestern University.” So now you have context.

Today’s presentation was guided by case studies. The cases were diverse but representative of questions that MONK seeks to address. First up is Tanya Clement presenting “The Story of One or, Rereading The Making of Americans by Gertrude Stein, Part Two.”

The pre(o)mise of MONK: not-reading a million books.

Why The Making of Americans (1925 published, 1911 finished) is not read. Mostly because it’s v. difficult to read. The text progressively dismantles both the story and the plot. For 900 pages. Tanya says: a new perspective afforded by digital tools reveals a structure in the text that positions it as a Modernist text, not a postmodern one.

MoA has been studied via frequent pattern analysis, repetition analysis, and word usage. The pattern of repetitions from all nine chapters of MoA: first half of the text has longer, less frequent phrase repetitions, and second half has much shorter, jagged chunks of repetition. (The jagged chunks are there throughout the text, but in the first half you’re paying too much attention to the plot to notice.)

So how do we make meaning out of the jagged chunks?

WordHoard is a part of the MONK development phase in terms of text analysis tools. Using Wordle, Tanya measured word usage in the text, and compared word frequencies in MoA to other works by other authors (Austen, Dickens, etc) using the Dunnings log likelihood ratio. One word consistently appears more in MoA than in all the other works, and is one of the most often-used words in MoA internally, is “one.” Why? Partly because it’s a “schizophrenic” word with “many meaning-making possibilities.”

There follows a demo of PosViz, a MONK-created text analysis project that allows for rich visualization of text analysis. Tanya walks us through the multiple but not indeterminate meanings of the word “one,” through which the story of “one” becomes the story of everyone. And since Stein’s work is about identity formation, that tells us something about the work.

[vz: By the way, Tanya’s paper was much richer and more nuanced than I’m managing to capture here. If you seek out one fascinating text-analytical paper, make it hers.]

Sara Steger’s presentation, “More Than a Feeling: Patterns in Sentimentality in Victorian Literature,” is being presented by… Sara, with audio overlaid onto her presentation! She’s at home now, expecting a baby very soon, and all the best to them. After the talk, she took questions via Skype. Awesome.

Sara’s testbed for studying sentimentality in the Victorian age, is 80 novels (or 3,921 chapters). She created a training set of 409 chapters: 186 of them she classified as sentimental, and 223 — as unsentimental. Using WordHoard, she created lexicons for sentimental vs. unsentimental chapters; ordered them into parts of speech; and compared resulting top words. Sentimental nouns: Mother, Heart, Child, Boy, Home, God, Arm, Love, Thought. Trend towards words that reflect intimate connections. Unsentimental nouns: Lady, Miss, Friend, Gentleman, Woman, People, Morning, Name, Course, Lord. So, not community but hierarchy.

Sentimental adjectives: full, dead, happy, wild, low, strange, dark, deep, quiet, kind. Both happiness and sorrow are equally affective. Unsentimental adjectives: certain, small, half, high, whole, real, large, best, short, pretty. Quantitative, concrete words. Sentimentality seeks the mysterious.

Sentimental verbs: cry, love, lie, bear, open. Unsentimental verbs: want, like, talk, mean, suppose. The sentimental doesn’t seek a way out of uncertainties. The unsentimental uses less specialized, more common, less nuanced words (talk as opposed to utter, cry out, etc.)

Dunnings’ Log Likelihood ratio: how many times a word appears vs. how many times we would expect it to appear. Words over-represented in sentimental texts, with their log likelihood ratios: she (1773), I (384), mother (377), child (310), heart (243), love (210), tear (158), sorrow (150), etc. Sentimentality is unusually concerned with the personal, and with interpersonal relationships and the domestic sphere.

Words under-represented in sentimental texts (and hey, this is nigh unto impossible to discover without digital tools!): mr., duke, gentleman, lady, bishop, archdeacon, hound… Words having to do with titles and hierarchies and hunting and business and politics.

OK, so then Sara started in on machine classification. She used a naive Bayes machine learning algorithm involving a decision tree for classification of sentimentality. Starting with the 3921 chapters, 943 were classified as sentimental. Some were obvious, like all of Dickens. But Sara was pointed by this process also to works she didn’t know before.

Women were underrepresented in the test bed (22% of the chapters), but of the 943 sentimental chapters, 507 (54%) were written by women.

The sentimentality project used an early prototype of MONK’s “backend” tools, and helped shape them too. Demonstrated the need to be able to import worksets; demonstrated the need to have email notification when analysis is complete. In that, it was a great case study for MONK.

Kirsten Carol Uszkalo talks about “The Devil and Mother Shipton: Serendipitous Associations and the MONK Project,” all about using the MONK workbench to trace the role of familiars in early modern witchcraft trials, mostly (all?) in England. The word cloud she showed us first reveals that, even in witch trials, it’s always all about God.

Pattern finding across a corpus is important because it uses working memory, fluid intelligence vs. crystallized intelligence, and that’s where creativity happens.

Text analysis is useful, if it assists us in understanding and interpreting a work.

Richard head claims to have discovered a 15th century manuscript of Mother Shipton’s prophecies, but the representation of the sexy, well dressed gentleman devil was a late 17th-century phenomenon. (The sexy, well dressed, moneyed-guy image is a power thing.)

Kirsten’s worksets are full texts, and full tracts: trials and condemnations, mostly. In the workbench, you can import [ginormous!] texts and then define chunks that you’re testing. Kirsten flagged words and phrases that define what makes a witch, according to these texts. She then asked the workbench analytical tools to flag other texts for the density of what-makes-a-witch descriptions. (What if instead of a keyword you could search by example? “Give me something like this.”)

Some of the associations that came up were ones Kirsten wouldn’t have been able to predict. What popped up if not a text about Tannakin Skinker, the “hog-faced woman” from Holland! She showed up as an example of sex with the Devil. But there *was* no sex with the devil in this tract! What’s up with that?

Tannakin Skinker actually shares numerous traits with Richard Head’s Mother Shipton. Published in the same year, both texts include the presence of implied or clearly articulated themes of witchcraft, the presence of a Devil, witchcraft itself, wealth as a lure, the setting in a costly location, and hog-faced children.

The economics of courtship, the idea of a malefic ugliness that could’ve only come from the Devil, were themes in both of the texts.

In conclusion: MONK tools can be used for pattern discovery.

Tagged , ,

Post Your Comment