Thursday, January 12, 2006
Poetry, Patterns, and Provocation: The nora Project
What follows is a brief introduction to nora, an ongoing and experimental project in literature and computation. My thanks to The Valve for the opportunity to make our work part of the conversation here. While not directly inspired by Moretti’s writing in Graphs, Maps, Trees, nora exhibits many of the same priorities: an emphasis on quantitative method, large-scale data analysis, visualization, abstract modeling, cooperation and collaboration. These are methods foreign to many in the humanities, as are our actual technologies which run the gamut from XML and Java to a toolkit developed by the Automated Learning Group at the National Center for Supercomputing Applications. Yet nora (which, depending on who you ask on the project team, originated as either an acronym for no one remembers acronyms or a character in the William Gibson novel Pattern Recognition--though we’ve since located other noras), is also about provocation, ambiguity, and ultimately, interpretation--in short, still the stuff most of us would identify as central to academic literary studies.
First, here’s the official version of what we’re doing:
The goal of the nora project is to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries. In search-and-retrieval, we bring specific queries to collections of text and get back (more or less useful) answers to those queries; by contrast, the goal of data-mining (including text-mining) is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends. Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions, are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools will make those collections significantly more useful, more informative, and more rewarding for research and teaching.nora is currently in the second year of two years of funding from the Andrew W. Mellon Foundation. The Principal Investigator is John Unsworth, formerly director of the Institute for Advanced Technology in the Humanities at the University of Virginia, now Dean of the Graduate School of Library and Information Science at the University of Illinois, Urbana Champaign. Participating researchers are also based at the Universities of Georgia, Maryland, Virginia, and Alberta.
Now for the unofficial version:
All of this is very, very hard. For starters, none of the technical architecture for what we wanted to do was in place when we started. We were able to leverage several existing platforms and technologies but other pieces had to be built from scratch. At present, nora is held together with chewing gum and duct tape, a loose tissue of resources and standards (datastores, text mining engine, visualization toolkit and end-user interface) coupled with the more-than-occasional late night email or IM session when something isn’t working. A significant part of our efforts to date have been devoted to stabilizing this architecture, and we’re most of the way there at this point.
But we’ve also been spending our time trying to figure out what technologies like text mining are good for in humanities research, particularly literary studies. Were we in a social sciences discipline that routinely contends with large amounts of data or even perhaps a humanities discipline like history we would not have to work quite so hard. Literary scholars, however--here the force of Moretti’s arguments make themselves felt--traditionally do not contend with very large amounts of data in their research. A significant component of our work is therefore basic research in the most literal sense: what kinds of questions do we seek to answer in literary studies and how can data mining help, or--more interestingly--what new kinds of questions can data mining provoke? (As a sidebar, we tried to answer that first one inductively. You can look here for a list of verbs occurring in critical essays on 18th- and 19th-century British and American literature from journals in Project Muse that never or rarely occur in the American National Corpus, which is newspaper writing for the most part. In short, a portrait of a profession.) But the barriers to entry are non-trivial. To engage in data mining on its own turf demands fluency in terms like naïve Bayesian analysis, cosine similarity matrices, features, vectors, dendograms, decision trees, and neural networks. On the one hand we don’t want to black box this stuff. We also don’t want to build a system so intimidating that one needs an advanced degree in information science just to approach it.
Data mining and visualization are traditionally perceived as problem solving technologies. The canonical instance is Don Swanson’s early use of text mining in bio-medical literature to identify a possible link between magnesium deficiency and migraine headaches. Swanson founds patterns of association in widely disparate areas of the published literature to make the initial connection, subsequently confirmed through a great deal of more traditional medical testing. He called his findings “undiscovered public knowledge"--it was all there, already in the journals, but no one had put the pieces together because no human reader would likely ever have been in a position to synthesize all of the relevant articles. But we don’t typically set out to “solve” problems in the humanities. We’re not trying to find the causes of migraines. We’re not trying to “solve” the problem of Emily Dickinson so that we can move on to the even more urgent problem of Walt Whitman. So what does data mining have to offer literary interpretation?
To start, we’re interested in provocation, anomaly, and outlier results as much or more than in what we think the system actually gets right. In one early proof of concept for nora, Steve Ramsay and Bei Yu attempted to classify Shakespeare’s plays according to the traditional categories. The data mining got most of them right. That’s not what was interesting though, at least not to Steve. What was interesting was that the data mining thought Othello might be a comedy. Interesting not because we’re assigning any undue authority to the machine’s determinations, but because the question became what was it about Othello that made it different from the other tragedies? Why did this dumb brute force machine “read” it as a comedy? As it happens, Steve subsequently stumbled across a strain of scholarship on the play that makes exactly that argument.
This intitial experiment lead us down a path that produced the following kinds of questions, as recently articulated by John Unsworth: What patterns would be of interest to literary scholars? Can we distinguish between patterns that are, for example, characteristic of the English language, and those that are characteristic of a particular author, work, topic, or time? Can we extract patterns that are based in things like plot, or syntax? Or can we just find patterns of words? When is a correlation meaningful, and when is it coincidental? What does it mean to be “coincidental”? How do we train software to focus on the features that are of interest to researchers, and can that training interface be usable for people who don’t like numbers and do like to read? Can we structure an interface that is sufficiently generalized that it can accommodate interest in many different kinds of features, without knowing in advance what they will be? What are meaningful visualizations, and how do we allow them to instruct their users on their use, while provoking an appropriate suspicion of what they appear to convey? How would we evaluate the effectiveness of our visualizations, or the software in general? Is it succeeding if it surprises us with its results, or if it doesn’t? How can we make visualizations function as interfaces, in an iterative process that allows the user to explore and tinker? And how in the hell can we do all this in real time on the web, when a modest subset of our collection, like the novels of a single author, contain millions of datapoints, all of which need to be sifted for these patterns?
Our team at Maryland includes my colleague Martha Nell Smith in the English department, a long-time Emily Dickinson scholar. In order to focus our efforts I urged Martha and the rest of the team here to focus on the question of erotic language in Dickinson, certainly a well turned question in the scholarship. We began with a corpus of about 200 XML-encoded letters comprising correspondence between the poet Emily Dickinson and Susan Huntington (Gilbert) Dickinson, her sister-in-law (married to her brother William Austin). The demo we produced, which is available for you to mess with here, requires a user to first rank an initial set of documents with which to train the automatic classifier. This is done on a scale of 1 to 5, a process we call hot or not for short. The process is not unlike the NetFlix interface that asks you to evaluate your favorite movies and then finds others that the recommender system thinks also might be to your liking. This initial training set is then delivered to the data mining engine, which proceeds to iterate over the rest of the document set and return its initial predictions. Users can see which individual words the data mining seized upon as potential indicators of the erotic. The method here, by the way, is known as naïve bayes. Bayesian probability is the domain of probability that deals with non-quantifiable events: not whether a coin will land heads or tails for instance, but rather the percentage of people who believe the coin might land on its side; also known as subjective probability. Our Bayesian classification is “naïve” because it deliberately does not consider relationships and dependencies between words we might instinctively think go together--"kiss" and “lips,” for example. The algorithm merely establishes the presence or absence of one or more words, and takes their presence or absence into account when assigning a probability value to the overall text. ("Yeah, naïve, you, got that part right,” I hear some of you saying. But this is the kind of thing computers arevery good at, and naïve Bayes has been proven surprisingly reliable in a number of different text classification domains.) Right now the demo is hard-coded to the Dickinson corpus, but it will be general before we are through with nora.
As I hope should be clear, by far the least interesting aspect of this (to us) would be the machine’s definitive conclusion as to whether Emily is hot or not (we think the answer to that is rather obvious). No, we’re interested in the data mining’s capacity for provocation. Here, for example, is Smith on some early results, when the word “mine” ranked high on the list of words the data mining thought might be hot:
The minute I saw it, I had one of those “I knew that” moments.Besides possessiveness, “mine” connotes delving deep, plumbing, penetrating--all things we associate with the erotic at one point or another. And Emily Dickinson was, by her own accounting and metaphor, a diver who relished going for the pearls. So “mine” should have been identified as a “likely hot” word, but has not been, oddly enough, in the extensive literature on Dickinson’s desires. Same goes for “write"-- oh to leave a piece of oneself with, for, the beloved. To “write” is to present oneself, or a piece of oneself, physically—and noting that the data mining was picking up both “write” when recorded by Dickinson and “write” in the [XML] header [where it would indicate a letter] led the three of us to a “can we teach a computer to recognize tone” discussion. I wonder, remembering Dickinson’s “A pen has so many inflections and a voice but one” what the human machine can do, what the human machine does (recognizing, identifying tone) what we think we’re doing when we’re so damned sure of ourselves. So the data mining has made me plumb much more deeply into little four- and five-letter words, the function of which I thought I was already sure, and has also enabled me to expand and deepen some critical connections I’ve been making for the last 20 years.
We’re currently in the midst of a second, larger experiment on reading sentimental novels from the
There’s more to say, including what students might get out of a process like this. Maybe I can address some of that in the comments. But just one more point here: Louis Menand, in the current issue of Profession, decries what he calls the “Captain Kangaroo” model of interdisciplinarity that pervades the humanities: putting a psychologist on the podium with a Freudian literary critic for a conference session, for example. When the full nora project team meets, we have literary scholars, computer scientists, and information specialists around the table. I have a Ph.D. in English, but at Maryland, in addition to Martha Nell Smith and Tanya Clement in the English department I work with Catherine Plaisant and James Rose in the Human Computer Interaction Lab and Greg Lord at the Maryland Institute for Technology in the Humanities (MITH). Steve Ramsay at the University of Georgia is a gifted programmer as well as an Assistant Professor of English and his graduate student Sara Steger is both a serious scholar of the 19th century novel and a serious hacker. Similar teams exist at Virginia and Alberta. At Illinois, we’re joined by personnel from the Graduate School of Library and Information Science (usually ranked as the finest library school in the country) and the National Center for Supercomputing Applications (NCSA). This collaboration is not always easy or idyllic. As Stanley Fish is said to have said somewhere, being interdisciplinary is hard. But the collaboration and mutual respect are real, and they are indispensable to getting things done. No one person has all of the expertise and knowledge at hand that a project like nora demands. I dare say that the graduate students working with us on the project, both from English and computer/information science, are exposed to a very different model of scholarly production, and different work habits, from what would typically be the case in their disciplines.
Okay, this really is the last thing I want to say: none of us see this as a messianic enterprise. We’re not out to “save” the humanities. Graphs, maps, and trees may not be for everyone, nor is data mining I’m sure. But by the same token, none of what I have been describing here is extracurricular research for me or any of the other humanities scholars on the project team. It counts toward the research portion of my annual distribution of effort for one thing, and the publications and results go onto my vita. Moreover, I’m a member of the MLA (dues all paid up) and work from nora has already been presented there, to interested and receptive audiences. Let’s all be a little careful with the generalizations and assumptions that have been flying around some of the previous entries. Cheers again to The Valve for getting this book event together.
There are 98 google hits for “hamlet as a comedy” and 2 google hits for “othello as a comedy”. Adjusting for the google indicated 3.41 greater popularity of hamlet, we can estimate that hamlet is 14.3 times funnier than othello. Or something.
Which is just to say if nora indicated hamlet was a comedy nobody would be too suprised. All the incentives are for the user of nora to interpret its output as indicating some suprising new truth. My spam filter uses Bayesian probabilities too, but sometimes it just makes mistakes.
>All the incentives are for the user of nora to interpret its output as indicating some suprising new truth.
Hope it wasn’t anything that I wrote that gave you that impression, Joe. It may be that 9 times out of 10 a user simply shrugs and dismisses nora’s results. That’s fine. But every so often, nora does have the capacity to startle.
The lesson of Swanson and his migraines should not be lost here: his particular medical insight was only confirmed through a great deal of more traditional, empirical testing. In like manner, we wouldn’t want or encourage anyone to take some result out of nora and go forth and publish it without appropriate the reading, thinking, and skepticism that characterizes literary criticism. The system’s results are a starting point, not a deliverable. This approach is typical of data mining in many fields, not just our corner of literary studies.
Which leads me to also sress the iterative nature of the process. A session with nora isn’t a one shot deal. You play with the system, you adjust weights and rankings, you resubmit the sample set and you see where the process takes you. Unlike the approaches favored by Moretti, however, this always involves extensive close reading of the texts. So in some ways we hold a very different set of critical values.
Data dredging is a potential problem.
With medical or other scientific databases where data mining is preliminary to actual tests, this isn’t such a big problem. Even data mining for business purposes gives you feedback. If you are wrong, you can lose a lot of money. When data mining is just prelimary to coming up with a reasonable sounding story, this is a problem. Fiddling with the weights makes it worse.
Smith’s reaction to the word “mine” is an example of what I am talking about.
>The minute I saw it, I had one of those “I knew that” moments.Besides possessiveness, “mine” connotes delving deep, plumbing, penetrating--all things we associate with the erotic at one point or another.
I just don’t buy it.
I do agree that nora could act as an aid to come up with new interpretive ideas.
I was struck by
We will label a subset of the chapters (the training set) with a score indicating a level of sentimentalism, and then see how text mining classifies the remaining chapters from those novels. (Texts: Charlotte, Uncle Tom’s Cabin, Incidents in the Life of a Slave Girl.)
I don’t mean to be picky, but aren’t both Incidents and Narrative, even allowing for the presence of composite characters and fictionalizations, more non-fiction than novel? Which isn’t to say there isn’t a lot of potential here: I’d be interested to know how often the, not to mention which kinds of, patterns of sentimental novels were adopted in contemporaneous non-fiction writing.
In this experiment there will be a strong focus will be on gaining insights on sentimentalism and these novels. Texts: Moby-Dick, The Scarlet Letter, The Blithedale Romance, Irving’s Sketchbook, Narrative of the Life of Frederick Douglass.)
eb, you’re right of course. That was just careless prose. We’re interested in sentiment primarily but not *exclusively* in novels.
>I just don’t buy it.
Well, we’re not charging any licensing fees so you don’t have to. ;-)
>I do agree that nora could act as an aid to come up with new interpretive ideas.
That’s all we’ve ever really claimed.
I believe that the four-color theorem was proved mechanically, and, that in practice, its validity was also assessed mechanically, as it’s too complex for any human mathematician to follow. I wonder about the ability of data mining and algorithmic hypothesizing to uncover hidden relationships in the great unread corpus, relationships so hidden that they may be beyond our ability to understand.
I’ve experimented with MOOs in teaching literature classes before and am fascinated with the potential for automatic creation of the story world through data mining. You could interact with an environment in which scenery, objects, and characters have been blended from a wide sample of a chosen genre or historical period.
This is most interesting and encouraging.
In the near term, one obvious use for this kind of technology in Moretti’s program would be to search out genealogies of the kind he investigated in the “Trees” chapter. A bit more ambitiously, you could use it to investigate the kinds of genre cycles he uncovered in the “Graphs” chapter. Here the big job would be to digitize and tag all those second and third tier texts. I have no idea about the kind of computational horsepower that would be needed to do that kind of classification work on relatively large numbers of whole texts—thousands and tens of thousands of texts as opposed, for example, to the 30-odd dramatic texts by Shakespeare. But I imagine that would be less of a problem than getting all those texts prepared.
The problem would be to justify that effort. Is getting a fine-grained appreciation for the evolution of middle-class European mentality through several centures sufficient justification? I don’t know.
But there is something else that intrigues me, and that is unpredictable in its effects. I’m interested in the models that underlie this technology. The nora website has a link to supplementary site for a book entitled Geometry and Meaning:
That is a fairly technical account of the semantic models used in natural language processing. These are more sophisticated than anything routinely used in literary criticism which, for all its sense of complexity and subtlety of meanings, has little in the way of explicit models of semantic structure. Back in the days before structuralism was eclipsed by post-structuralism there was some interest in such models, but that disappeared quickly. Umberto Eco, for example, adopted an early AI model by Ross Quillian in his 1979 A Theory of Semiotics (pp. 125-129), but never did much with it, nor did he or anyone else follow up on such models. [Not to mention that Quillian’s model was a decade old by the time Eco wrote that book.] Marie-Laure Ryan looks at such models in her 1991 Possible Worlds, Artificial Intelligence, and Narrative Theory and I believe that, more recently, David Herman has been looking at some of them as well. And some of that knowledge represetation work is lurking behind the scenes in the conceptual metaphor theory and conceptual blending work in cognitive linguistics. So far this is just here and there, nothing really systematic.
I can’t help but thinking that, as people use nora-type software, they’re going to become interested in what’s going on “under the hood,” as it were. And so they’ll not only learn about these models, but actually learn and begin to use the models themselves. That will change how the professions think about “meaning.” Meaning will seem less and less like an endlessly self-reflecting funhouse of mirrors and mystery. It will gradually seem tractable and intelligible.
We aren’t there yet, nor am I willing to hazard any predictions about when we’ll get there, much less how things will look then and there. But it seems to me that this particular genie is out of the bottle.
And something else, all over the web we have piles of text generated by fans to this or that TV show, movie, graphic novel, etc. As I’ve mentioned elsewhere on this site* I’ve followed one particular fan community for *Buffy the Vampire Slayer* (and, by extension, all things Joss). That particular community has maintained a complete record of its interaction since it started in the Table Talk area of Salon. [The older stuff is in the form of downloadable files.] One could mine that record for patterns of discourse in discussing BTVS and thereby put some empirical teeth into the abstract notion of an interpretive community.
For example, the conversation switches back and forth between discussing the people and characters in BTVS as though they were reall and discussing the show as a staged event. In the former mode, people wonder about motivation and intention and back story and so forth. In the latter mode they comment on who well or poorly an actor did, how plausible the plot, whether or not this is one of Joss’s or Marti’s better or worse episodes, and so forth. Is there any order to this ebb and flow between these two modes of discourse? That’s one sort of thing you could look at.
You could also look at how certain words and phrases spread through the community. At some point “kissage” appeared in one of the episodes. The word was quickly picked up by the fans. How much of this has been going on?
How are newbies integrated into the community?
And so forth?
*From Frye to the Buffisstas, with a glance at hermeneutics along the way: http://tinyurl.com/8lwz4
Organizing, characterizing, and postulating. Text mining offers organization assistance through classification (which can extend through forecasting and filling-in-the-blank type exercises like identifying authors of unattributed texts). With decision tree classification in particular one can even develop a methodology for classifying texts (not truly classification in the library science sense but rather more precisely called categorization). Characterization of texts can be performed en masse through clustering; it’s a great way to divine if you will new means for describing large collections of texts. Think of it as a credible & supported way to create new ways of grouping and characterizing literature. Classification and categorization can feed each other. Postulating, the power to generate relationships, is an exercize in generating new sentences. Not new knowledge per se but rather information for further investigation.
The true and rather unrealized power of text mining is its ability to generate new information. New, as in unusual and hopefully of high quality, and information, as in that which is data in context. the realization of text mining’s power is in part due to the lack of clarity with which text mining belwethers have described text mining to others.
Swanson’s “undiscovered public knowledge” is a misleading phrase. That which is disclosed by Swanson’s method (and Swanson & Smalheiser’s application Arrowsmith) is neither hidden nor known even after it is “disclosed.” What is *generated* is simply a relationship, an association between terms supported by documents. The term relationship paired with supporting documents in turn supports the user of the application in formulating a credible hypothesis, a narrative if you will, of how those associated terms. The quality of the produce rests with the user, the rarity or novelty of the association, as well as the value of the papers that support it all.
The problem with the application of text mining to literary studies may be nothing more than a reflection of the state of literary study itself. As you say there are no problems to be solved per se, but then, if that’s the case (which I don’t believe for a second) then why study it, right? There are many reasons for doing so of course. Any domain in which a search engine can help text mining can improve on that assistance. Bravo on nora.
There is one particular area that could easily benefit literary analysis, linguistics, and even basic text mining research itself. That is the construction of a *temporal* semantic ontology. Think of it as putting the OED together with WrodNet. In fact I don’t think text mining could be used ambitiously for any diachronic humanities projects without such information. Literary analysis always requires some grounding in the shifts in meanings of words.
Ultimately this boils down to the suggestion that text mining offers a powerful enhancement of creativity and productivity in any language-related domain.