Tuesday, June 01, 2010
From the CHE: Text-Mining and Data Digging as “The Humanities Go Google”
Data-diggers are gunning to debunk old claims based on “anecdotal” evidence and answer once-impossible questions about the evolution of ideas, language, and culture. Critics, meanwhile, worry that these stat-happy quants take the human out of the humanities. Novels aren’t commodities like bags of flour, they warn. Cranking words from deeply specific texts like grist through a mill is a recipe for lousy research, they say—and a potential disaster for the profession. . . .
The idea that animates [Franco Moretti’s] vision for pushing the field forward is “distant reading.” Mr. Moretti and Mr. Jockers say scholars should step back from scrutinizing individual texts to probe whole systems by counting, mapping, and graphing novels.
And not just famous ones. New insights can be gleaned by shining a spotlight into the “cellars of culture” beneath the small portion of works that are typically studied, Mr. Moretti believes.
He has pointed out that the 19-century British heyday of Dickens and Austen, for example, saw the publication of perhaps 20,000 or 30,000 novels—the huge majority of which are never studied.
The problem with this “great unread” is that no human can sift through it all. “It just puts out of work most of the tools that we have developed in, what, 150 years of literary theory and criticism,” Mr. Moretti says. “We have to replace them with something else.”
Something else, to him, means methods from linguistics and statistical analysis. His Stanford team takes the Hardys and the Austens, the Thackerays and the Trollopes, and tosses their masterpieces into a database that contains hundreds of lesser novels. Then they cast giant digital nets into that megapot of words, trawling around like intelligence agents hunting for patterns in the chatter of terrorists.
(read the whole story here)
I couldn’t decide what I wanted to say about this story when I first posted it. I’m still not sure, but it’s something about the way this project, and the reactions to it, highlight the difference between approaching literature as symptomatic and approaching it as aesthetic experience. Of those 20,000 or 30,000 unread 19thC novels, how many offer much for those interested in the latter approach, for instance? And yet it is pretty obviously imperfect to proceed towards large generalizations about culture and/or society on the basis of the handful of books (OK, dozens) that are really well known from the 19thC.
Has the filtering process selected for aesthetic value, do you think? It seems so, but there’s really too much unread to know for sure.
I agree that we can’t assume it has. But there’s no way to read through 30,000 volumes to check--assuming “text mining,” whatever else it does, can’t help you find the lost gems. What I had in mind more, though, was just the difference between slow reading and data digging as methods and what they suggest about the purpose of literary study--whichever texts we select.
Also, the comment about “dullards” in that article is a bit much.
Well, this is one kind of issue if you think of literary studies as REALLY DEEPLY AND TRULY one thing. If that’s the case, then we’ve got a rather nasty conflict over what that one thing is and how to study it.
But maybe it’s not ONE THING. Maybe there’s more than one intellectually valid way of looking at literature and its history. In that case, the data miners aren’t necessarily in conflict with the close readers.
What I’m wondering is something like this: If the canonical texts really are somehow “superior to” or “more central” than the rest, can we show this with statistical methods. I’m imagining that for any given corpus of texts and for any text within that corpus, we can calculate the, shall we say, projection of the text into the the corpus. I would expect the projections of canonical texts to be of a different statistical character than the projections of non-canonical texts? Just what that difference should be, gimme a break, I’m making this up as I go along. But that’s what I’d be gunning for if I were doing this work.
I have a book entitled Algorithmic Aesthetics that was published in the late 70s, I think. People have been thinking about this (and using computational stylistics) for a long time. It’s just the size of the available corpus that has changed. (And probably what Moretti is doing is qualitatively different as well.)
The size of the corpus makes a BIG difference. It’s one thing to look for the “signature” of excellence text by text. What I have in mind is quite different. I want to look at the relationships that obtain among all the texts and see if some aren’t more “central” than others.
We said quite a bit about this in the Moretti event linked to at the left. As usual, those bygone discussions are “dead” even though they’re right there.
The article strikes me as the typical kind of thing you get with science journalism. In humanities journalism, you most often get the mocking article. In science journalism, it’s the gosh-wow everything-is-becoming-different article, whether the science is really new or not. It’s not really the fault of the people interviewed, who have typically been subtly misquoted in any event. But it’s impossible for a science reporter to write an article that does not go gosh-wow, both because that’s the only way they get them published, and because reporters are generally the dimmest group of literate people, as a class, that I’ve ever encountered.
The various issues mentioned don’t seem like that large a deal to me, other than their ongoing thumbsucker value. People who study literary quality and aesthetics are going to study individual works. People who study mass trends in literature are going to study a database. It’s gradually going to get easier and easier to do the second, so there’s going to be a gradual push-and-shove over turf within the academic departments that study literature, but the two kinds of questions aren’t really that connected.
the Moretti event
My bad, Rich; thanks for pointing this connection out. I’d seen that graphic many times but the event predated my relationship with the Valve and it didn’t even come to mind when I saw this CHE story. There’s something interesting, perhaps, in this “news” story being sort of 4 years old (at least)--that event was in 2006. Of course, it may yet be new to other people now, as it was, basically, to me.
What Rich said. There’s no particular reason for you to have know or been curious about the Moretti event, Rohan.
And, as Jonathan Goodwin has noted, this kind of thing has been going on for a long time. There’s a journal called Computers and the Humanities—I assume it still exists, though I don’t really know—that’s been publishing at least since the mid-70s (when I coauthored a review-article on computational linguistics and the humanities), more likely since some time in the 60s. And Stanley Fish critiqued some of this work in some of his 70s essays on stylistics (collected in Is There a Text in This Class?). So, the person who wrote the article didn’t do their homework.
Wow. I mean, I wish Moretti and Jockers all the best, but the article gave me the sense that they really have no idea how they are going to use these databases and algorithms. The one actual research question they raise—how did the 19th century novel shift from an abstract teaching form to a concrete and dramatic form—seems already weird and misguided. And they way they hope to track it—by looking at the frequency of abstract words like “loyalty”—seems ignorant at best, moronic at worst.
Take two sentences:
“Sacrifice your life to him to whom you’ve tied your life.”
“I questioned Johnson’s loyalty. His honor was at stake.”
The former sentence is abstract and didactic. The latter the sentence is dramatic and concrete.
Abstract diction vs. concrete diction is different than “saying versus showing,” which is what the researchers really seem to be addressing.
Ugh. These questions should be raised by historians and sociologists, not by literature scholars. The fact that they are *not* raised by sociologists and historians suggests that they aren’t terribly important.
Not all of it is the fault of the journalists—let’s not forget that Franco Moretti and Matt Jockers are themselves presenting this shift as a revolution, a radical departure from the way things have been done in the past.
It’s not, of course: literary historians (not to mention good old plain historians, who know that they are more likely to get usable data from less “interesting” literary texts) have been going beyond the “classics” for quite some time.
As so often, I’m with Bill Benzon on this one: what has changed is the set of tools; what remains is the choice between aesthetics-focused projects and mass-trend-focused projects.
Both are extremely valuable when well done, and I look forward to seeing some interesting results from the new databases. Here’s a really great one, indicating—for all the constructivists out there—that Romantic love is a literary universal. http://muse.jhu.edu/journals/philosophy_and_literature/v030/30.2gottschall.html
My concerns are only (1) the rhetoric of revolution (2) the studies that return less than exciting results ("hey, titles got shorter!") (3) the occasional mishandling of evidence (as with the clues in Sherlock Holmes).
Yeah, the rhetoric of revolution’s got to go, even if there’s a revolution at hand. It’s been totally devalued by the previous generation of “revolutionaries.”
To Luther Blissett [whoever this person actually is]: you think Jockers and I are “ignorant at best, moronic at worst”? We’ve published plenty of work with our signature at the end: read it, and then say whatever you wish. But read our work, not someone else’s.
To Joshua Landy, who is a colleague, so to speak, at Stanford: you want to criticize what I have written on titles and other topics? Why didn’t you show up the many times I have discussed them in public at Stanford? Let me suggest a reason: because you could never get away with caricature ["hey, titles got shorter”!] and slander [mishandling of evidence].
Prof. Moretti—I did not label you or Prof. Jockers that way. I labeled the attempt to trace developments in the novel by doing word searches as ignorant and/or moronic. If the journalist misrepresented that research plan, I apologize. However, I think that you’ll need to do more than look at diction to prove large claims about if, how, and when the novel shifted in the way the journalist claims you claim it shifted.
Which is to say that a million half-pennies of evidence cannot prove an argument any better than fifty if that evidence is, by its very nature, not sufficient evidence. I presented a simplistic thought experiment to get at this problem above. Another: let’s say that from 1846 to 1881, the word “loyalty” was used in geometrically increasing numbers. Does that mean that the theme of loyalty suddenly became more important? I doubt it. Nor does it suggest that the novel before 1846 was somehow more dramatic and less abstract and didactic about loyalty.
Which is to say: any large claim of any import about literature will always require close reading. If that means one must closely read 30,000 novels to establish the validity of that claim, then so be it. But I wonder what sorts of claims about literature as literature require such amounts of data. It seems to me that the sorts of claims being investigated are in the historian’s or sociologist’s terrain. And if literature is merely a subset of history or sociology, then let’s simply disband the English departments and move the now-all-too-limited funding to those departments whose questions are so much more pressing.
In the end, though, I’d rather read a sentence of Dickens’ than one of any of his unfortunately forgotten English contemporaries. And I’m quite stubborn in my belief that explaining why or arguing if that is the case is the business of English departments (and teaching students how to write in such powerful ways is the other business of English departments).
Well, Luther, alot depends on what claims you consider to be “large” and significant, doesn’t it? Consider the following article:
Monika Fludernik, The Diachrnoization of Narratology. Narrative, Vol 11, No. 3, October 2003.
After some preliminary throat-clearing, Fludernik gets around to a particular investigation, how the writer negotiates a scene shift from one location to another. She looks at 50 British narratives from late medieval to early 20th century. One of the things she finds is that, early on, the scene shift was explicitly signaled (now we’re moving from X to Y) and that such signalling disappeared over time and, with the novel, scene shifts correspond to chapter shifts and have no explicit signalling. It’s simply done. Here’s the opening of her summary statement:
As a consequence of changes in narrative structure, the scene shift that was an important functional element in Middle English narrative became downgraded to a supernumerary element occurring at chapter beginnings. The originally highly important scene shift was connected to the oral delivery of episodic narrative and depended on the overt manipulations of the narrator qua bard.15 The metanarrative quality of the scene shifts, although salient from the perspective of the twentiethcentury novel, was far less salient in Middle English narrative, since such narrative included much more striking abstract and coda sections in which the narrator played a crucial role. In the development of prose narrative, the figure of the narrator, still prominent in the Renaissance and of special importance in Fielding’s work, increasingly lost his position of privilege. At the same time the structural patterns of drama acquired a hold over narrative production and started to influence scene shifts. Thus, in most novels the narrative moves from one scene to the other, rarely changing the setting within a chapter. In this manner the novel, despite its diegetic surface structure, comes to obey the deep-structural patterns of drama, in which scene changes occur either when a new setting is introduced or when a new set of protagonists appear on stage. The introduction of lengthy dialogue scenes and, later, of consciousness scenes in the novel underlines this development (Fludernik, “Natural” Narratology ch. 4).
Now, I regard that as large and interesting. Fludernik, of course, has done close reading of her 50 texts. But we don’t know whether or not her sample is biased. What I’d really like to do is check her results in all relevant texts. That might well be doable with current technology, though we’d need to get all the texts together in a single digital archive.
From my POV the single most important result would the demonstration of a systematic change in scene shift over time. That would suggest that there is a direction to literary history, that literary history is not just a stack of more or less disconnected moments in time (see this recent post of mine on Time’s Arrow in Literary Space). If that’s case, then the historicists have to move beyond encapsulating their texts in time-stamped cases of amber and figure out what’s “driving” that directionality. Why does literature change, and in a specific direction?
In defence of Luther, it’s worth noting that the interesting studies being done on the basis of digital corpora command our attention by virtue of the questions they pose. For example, the case Bill cites, about scene-changes, is compelling because scene-changes are interesting. (What a great topic, by the way!)
But these questions would simply be unimaginable were it not for the existence of close reading. All interesting cases of “distant reading” are, in fact, ultimately parasitic on traditions of attention to detail (narratological, formalist, new critical, or what-have-you). And so it’s a great shame that the recent re-advocacy of “distant reading” (which has always been a good idea) has been accompanied by a needless and counterproductive polemic against close reading. If we kill close reading, we will, in the long run, kill distant reading too.
I don’t see how you can do close reading without looking at a wider context. Or all you end up with is something rather distant from the text.
I hit this problem yesterday with a 17th century text by Walter Hamond “A paradox Prooving that the inhabitants of the isle called Madagascar, or St. Laurence, (in temporall things) are the happiest people in the world”
Louis Wright suggests that Hammond owes no particular debt to literary tradition; may have picked up a few literary illusions but views the text as dependant on Hamond “turning his imagination loose”.
Hamond certainly has turned his imagination loose playing with a number of well worn themes that would be highly familar in the 17th century but not the 20th century. As it’s a work of propoganda he gives a non-tradional spin on traditional themes in places to suite his agenda and to reasure an audience familiar with such themes and his work is certainly in keeping with classical tradition stemming from the Alexandrian school with regard to the nobility of primative people.
The text is certainly original but it has a clear historical context and past life, which is down-played in the article. Wrights concern is to discuss it as an early example of the popular mid 18th century theme of man in a state of nature.
I don’t think you can draw historical horizon’s so tightly. It’s certainly a product of it’s time but it has a relationship with history.
Can you lit-crit types ever figure out the difference between a “topos” and a “trope”? Juat sayin’.
Jim, I think your comment would have made more sense over on the tvtropes thread. And yeah, you’re right: a trope is any twist of the language and usually today synonymous with any figure of speech. A topos is a rhetorical commonplace: definition, cause and effect, before and after, process analysis, etc.