Diplobabble | Week 4 Practicum: The Modern Art of Topic Modeling

Prior to this week’s practicum, I was unfamiliar with topic modeling “…a method of computational linguistics that attempts to find words that frequently appear together within a text and then group them into clusters.” But, in the context of understanding algorithm basics I knew that whatever I would be working with would generate more questions than answers. After all, as dana boyd made clear in a past presentation on “Data, Algorithms, Fairness, Accountability,” “there is nothing about doing data analysis that is neutral.” One thing that already seemed at odds about the process of using a topic modeling tool was using literature as the data input for a tool that some scholars suggest is better suited to analyzing materials that is copious in volume but somewhat limited in its topical scope as illuminated by an example of a midwife’s daily diary from the late 1700’s to early 1800’s. “In many ways, it seems that Martha Ballard’s diary is ideally suited for this kind of analysis. Short, content-driven entries that usually touch upon a limited number of topics appear to produce remarkably cohesive and accurate topics.”

By contrast, in finding texts that I knew well and were readily available for this project I focused on literary works from the world of theater and poetry: Shakespeare’s “Titus Andronicus” and T.S. Eliot’s “The Love Song of J. Alfred Prufrock”. In order to harness the full capability of a topic modeling tool, considering the best way to match tool to the job at hand is key. I suspect topic modeling provides limited efficacy in analyzing the items I chose. Now, if I had had the time to enter the full canon of Shakespeare’s plays or T.S. Eliot’s work across his career, the power of the topic modeling tool might have been more awe-inspiring. On the other hand, given the earlier suggestions by scholars that a text with limited topics to begin with could yield more fruitful topic modeling results, I think the comprehensive range of issues a Shakespeare play touches upon alone would render the tool impotent. Or, perhaps this is simply my conceit as a human being feeling threatened by the potential power of a digital algorithm.

But, in the process of entering my literary texts into the topic modeling tool a few things stood out.

First, I really needed to format my text to prepare it to be entered into the tool, otherwise, information that was not part of the text of interest would be mistakenly included in the topic modeling word list. For example, since all the text on Project Gutenberg from which I drew Shakespeare’s and Eliot’s work came with legally-mandated language preceding the text of interest, I had to put everything into a TextEdit window so I could erase the Project Gutenberg boilerplate language lest that be incorporated into the text.

Topic Model Example - The Lovesong of J. Alfred Prufrock by T.S. Eliot

A Topic Model of “The Lovesong of J. Alfred Prufrock” by T.S. Eliot

Secondly, by using poetry or works filled with highly poetic dialogue by one author famous for coining words and phrases that didn’t exist in the English language prior (Shakespeare) as well as work by a modernist poet whose work actually is characterized by a certain level of artistically inflected, deliberately unfinished phrases, it makes the topic modeling output a bit trickier for analysis because sometimes I cannot tell what is computer linguistic nonsense and what was part of the artistic license of the author. After all, it is generally those with mastery of their native tongue who demonstrate that mastery precisely in breaking grammatical rules. With a computer algorithm fully focused on finding rules for the rule-less moments, it starts to look kind of confusing in a topic modeling chart.

A Topic Model of “Titus Andronicus” by William Shakespeare.

Finally (and following on the second concern above), depending on a more subjective view of what merits artistic accolades or not, what is worthy of being called literary art can be subjective. Ever go to a modern art museum with a friend, you both look at a work of art that you aren’t sure is really art, and then one of you blurts out “my two year old could have made that,” and then you are both staring ahead wondering who would overpay for such art and why did you pay for tickets to this exhibition anyway? I think something like that comes into play in my mind as I read through the words coming together in a new way that leaves wondering did the algorithm really reveal something brilliant my human mind could not capture, or did the algorithm just spit out a bunch of words randomly, and because of how my brain provides the contextual mortar that can hold those literary bricks together (from having read those pieces many times before as a full work) did I invent a building out of the computer’s haphazard throwing of bricks? How much credit should the algorithm get for doing anything? Can you tell I’m a bit skeptical on this tool still?

For now, as I read the topic lists I’ve generated from this project, I think I’ll not over-analyze them but enjoy them as a remix to the original text, a tribute to literary classics with a digital melody line that makes it feel a bit newer.

Works Cited:

(1) boyd, danah. 2016. “Data, Algorithms, Fairness, Accountability” U.S Department of Commerce, Data Advisory Council. Washington, DC, October 28.

(2) Topic Modeling: A Basic Introduction Journal of Digital Humanities.” Accessed February 24, 2021. http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/.

(3) “Topic Modeling Martha Ballard’s Diary | Cameron Blevins,” November 16, 2016. https://web.archive.org/web/20161116080309/http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/.