Monday, June 6, 2011

Data Mining Literature

Franco Moretti is data mining literature. Not evolution and literature, to be sure, but an interesting development all the same. My version I proposed in my dissertation is somewhere between data mining and evolutionary approaches. It involves data mining a work for theme words that have a fractal distribution, which I predicted to exist from the consistent appearance and evolution of fractal patterns in nature.


  1. Do you feel this possible fractal pattern owes more to the recurrent function of plot or the structure of human memory? It sounds like both must be considered: if the formal elements of plot require familiar patterns of repetition and summary throughout a novel, for instance, then this would account for a patterned distribution of theme words within a single work. But we also need to acknowledge the existence of authorial "pet terms." A timely example is George R. R. Martin's use of the eccentric "mistrust," which has great thematic importance throughout his (yet incomplete) Song of Ice and Fire. In some ways he likely uses the word to age his language, but in other ways the amount of and ability to trust in other people is at the heart of his tale.

    Sorry If a bit tortuous -- just riffing as I read the linked article/blog.

  2. Some excellent questions and points. One thing I have thought about is the nature of the patterns. There is some evidence that regular word distributions have what is known as a bios pattern. That is an extremely complex pattern discovered by Hector Sabelli. It is related to, but more complex than, fractal/chaotic patterns. The presence of this simpler (relative to bios) pattern may draw (unconscious?) attention to itself. Think of if you had a rectangle somewhere in a picture of a forest. What would your eye be drawn to? In the case of literature, it is what your consciousness is drawn to. It probably has a fractal pattern because, in novels and epics especially, the pattern is built from the "bottom-up" so to speak -- there is a strong element of self-organization (vs. a lyrical poem, for example, which has a great deal of top-down structure) -- and thus one would expect there to be more complex patterns. The themes are attractors which draw the theme words toward them, creating the patterns. How much of that is the recurrent function of the plot (which is not unrelated the fact that language itself has recurrent structure) and how much is an element of memory (which functions according to the laws of self-organization and strange attractors as well) would of course be very much worth investigating. My suspicion is that the plot structure emerges out of the structure of language, which is related to how action is structured, which is related to how memories form. All of this is separate from the issue of a purposefully chosen word -- except insofar as one can use it to check to see if the word is as unconsciously chosen as it is consciously chosen. For example, I think that Thomas Hardy knew he was writing about friendship and not love in Jude the Obscure. However, I have an unpublished novella I wrote I thought was about love, but when I submitted it to the same process, discovered that "love" was randomly distributed, but that "friend" had a similarly structured pattern as one finds in Jude the Obscure (and I wrote my novella before I read Jude, so there was no even an unconscious influence). It turns out my novella is about friendship, not love. And when I did the analysis, I realized that that was exactly what the novella was really about.