Topic Modeling and Algorithms

I ran the Complete Grimms Fairy Tales on the TMT to generate topic modeling. I tried it with various lengths of words. I thought it was interesting because it created categories for what kind of stories were being told.

Topic modeling of "The Complete Grimms Brothers" with 5 words. Many of the words are "rquote" or other meaningless terms

Running it with 5 words created a problem; because I had not tokenized my text, many of the topics included words that were not useful. However, Topic 4 is clearly stories about children, topic 8 about princess stories, and topic 10 probably about the “animal” type of stories.

Topic modeling of "The Complete Grimms Brothers" with 10 words

Running with 10 words removed some of the problems of using non-tokenized texts. For example, now topic 1 looks to be about people who “set off” from home to find castles, topic 2 about adventure stories, topic 7 about princess stories, and so on.

Topic modeling of "The Complete Grimms Brothers" with 25 words

But expanding the number of words is not always useful; using 25 words was too many. It began putting words like “snowdrop” and “dog” together, which might be because of too many words per topic.

I then ran Paradise Lost, which I have only read parts of, because I was interested to see if I could parse the topics if I only know them vaguely. Sure enough, I could see the correlation of Topic 1: Paradise, and topic 2: Lost, pretty quickly. Topic 5 also spoke to me as being about Love. If I am to use this tool on texts that I have not close read, I’m glad I can still parse the text somewhat.

Topic modeling of "Paradise Lost" with 10 words

I used the “Talk to Transformer” tool to see what it came up with for “Digital Humanities”. It gave me either definitions, quotes from academic articles, or a breakdown of how students did on a quiz based on their major.

Text generated by "Talk to Transformer", which includes the text "European (i.e. non British) students score better in several subjects than “British” students"

I don’t think this software will be writing blog posts any time soon. However, it is scary that information on the internet is so readily available. Several of my classmates found the Transformer tool could find information about them and their families with just a little input.

One could argue that the tools have to be trained on something. If information is available to be used to train these algorithms, is it really a problem if some of that information is a little sensitive? I say yes, absolutely, because the main reason we are developing algorithms to deal with large amounts of data is because companies are collecting large amounts of data. They don’t always know what to do with it, but they are causing a demand for software that can deal with huge amounts of data, by collecting this data.

I have never been able to opt out of this process. My classes began using Facebook as a method of communication when I was a senior in high school, and did not have a Facebook. This put me at a disadvantage with my peers, and I ended up having to go online more – using dial-up, because Wi-Fi is expensive and my family couldn’t spare the money – just to keep up. I have to use my gmail to sign into things all the time, signing me up for spam emails for the rest of my life. I have had to write letters of recommendation for my friends where their prospective bosses ask all sorts of personal questions that have nothing to do with their ability to accurately do a job. A friend had to take a personality test as part of the school application process.

Does this data collecting help anyone? It certainly hurts the underprivileged, who can’t afford Wi-Fi, or who have difficult family situations, as Danah Boyd points out. You can’t not be online – no job will hire you without a cell phone number to call. Currently, the choices are social isolation, and giving up privacy. As the last year has shown us, social isolation is not a choice anyone can make. We need to be protected from these businesses that gather data without permission. If that means we can only train AI on “Paradise Lost” – so be it.

 

Works Cited

Blevins, Cameron. “Topic Modeling Martha Ballard’s Diary.” Wayback Machine, 16 Nov. 2016, https://web.archive.org/web/20161116080309/http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/.

Boyd, Danah. Data, Algorithms, Fairness, Accountability. http://www.danah.org/papers/talks/2016/CDAC.html. Accessed 23 Feb. 2021.

Brett, Megan R. Topic Modeling: A Basic Introduction Journal of Digital Humanities. http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/. Accessed 23 Feb. 2021.

Grimm, Jacob, and Wilhelm Grimm. Fairy Tales, by The Brothers Grimm. Translated by Edgar Taylor and Marian Edwardes, https://www.gutenberg.org/files/2591/2591-h/2591-h.htm. Accessed 26 Feb. 2021.

Milton, John. Paradise Lost. 1991. Project Gutenberg, https://www.gutenberg.org/ebooks/20.

Remy, Emma, et al. “How Does a Computer ‘see’ Gender?” Pew Research Center, https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/. Accessed 23 Feb. 2021.