Topic Modeling and Algorithms

I ran the Complete Grimms Fairy Tales on the TMT to generate topic modeling. I tried it with various lengths of words. I thought it was interesting because it created categories for what kind of stories were being told.

Topic modeling of "The Complete Grimms Brothers" with 5 words. Many of the words are "rquote" or other meaningless terms

Running it with 5 words created a problem; because I had not tokenized my text, many of the topics included words that were not useful. However, Topic 4 is clearly stories about children, topic 8 about princess stories, and topic 10 probably about the “animal” type of stories.

Topic modeling of "The Complete Grimms Brothers" with 10 words

Running with 10 words removed some of the problems of using non-tokenized texts. For example, now topic 1 looks to be about people who “set off” from home to find castles, topic 2 about adventure stories, topic 7 about princess stories, and so on.

Topic modeling of "The Complete Grimms Brothers" with 25 words

But expanding the number of words is not always useful; using 25 words was too many. It began putting words like “snowdrop” and “dog” together, which might be because of too many words per topic.

I then ran Paradise Lost, which I have only read parts of, because I was interested to see if I could parse the topics if I only know them vaguely. Sure enough, I could see the correlation of Topic 1: Paradise, and topic 2: Lost, pretty quickly. Topic 5 also spoke to me as being about Love. If I am to use this tool on texts that I have not close read, I’m glad I can still parse the text somewhat.

Topic modeling of "Paradise Lost" with 10 words

I used the “Talk to Transformer” tool to see what it came up with for “Digital Humanities”. It gave me either definitions, quotes from academic articles, or a breakdown of how students did on a quiz based on their major.

Text generated by "Talk to Transformer", which includes the text "European (i.e. non British) students score better in several subjects than “British” students"

I don’t think this software will be writing blog posts any time soon. However, it is scary that information on the internet is so readily available. Several of my classmates found the Transformer tool could find information about them and their families with just a little input.

One could argue that the tools have to be trained on something. If information is available to be used to train these algorithms, is it really a problem if some of that information is a little sensitive? I say yes, absolutely, because the main reason we are developing algorithms to deal with large amounts of data is because companies are collecting large amounts of data. They don’t always know what to do with it, but they are causing a demand for software that can deal with huge amounts of data, by collecting this data.

I have never been able to opt out of this process. My classes began using Facebook as a method of communication when I was a senior in high school, and did not have a Facebook. This put me at a disadvantage with my peers, and I ended up having to go online more – using dial-up, because Wi-Fi is expensive and my family couldn’t spare the money – just to keep up. I have to use my gmail to sign into things all the time, signing me up for spam emails for the rest of my life. I have had to write letters of recommendation for my friends where their prospective bosses ask all sorts of personal questions that have nothing to do with their ability to accurately do a job. A friend had to take a personality test as part of the school application process.

Does this data collecting help anyone? It certainly hurts the underprivileged, who can’t afford Wi-Fi, or who have difficult family situations, as Danah Boyd points out. You can’t not be online – no job will hire you without a cell phone number to call. Currently, the choices are social isolation, and giving up privacy. As the last year has shown us, social isolation is not a choice anyone can make. We need to be protected from these businesses that gather data without permission. If that means we can only train AI on “Paradise Lost” – so be it.

 

Works Cited

Blevins, Cameron. “Topic Modeling Martha Ballard’s Diary.” Wayback Machine, 16 Nov. 2016, https://web.archive.org/web/20161116080309/http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/.

Boyd, Danah. Data, Algorithms, Fairness, Accountability. http://www.danah.org/papers/talks/2016/CDAC.html. Accessed 23 Feb. 2021.

Brett, Megan R. Topic Modeling: A Basic Introduction Journal of Digital Humanities. http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/. Accessed 23 Feb. 2021.

Grimm, Jacob, and Wilhelm Grimm. Fairy Tales, by The Brothers Grimm. Translated by Edgar Taylor and Marian Edwardes, https://www.gutenberg.org/files/2591/2591-h/2591-h.htm. Accessed 26 Feb. 2021.

Milton, John. Paradise Lost. 1991. Project Gutenberg, https://www.gutenberg.org/ebooks/20.

Remy, Emma, et al. “How Does a Computer ‘see’ Gender?” Pew Research Center, https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/. Accessed 23 Feb. 2021.

Using “Voyant” to analyze texts

When I was writing my undergraduate thesis, I often came across advice about how to look at sources similar to that given in “Tooling Up for Digital Humanities”:

“Scholars once accustomed to studying a handful of letters or a couple hundred diary entries are now faced with massive amounts of data that cannot possibly be analyzed in traditional ways.”

When working on my thesis, I only had a handful of letters, because I was working with the very limited sources left behind by the Catholic Women in Elizabethan England. However, the fact remains that we now live in a time when there is more information available than ever before. Thankfully, we also have more ways of analyzing this information than ever before. One of these is Voyant, a tool that analyzes texts to see what words are most common, and when they occur.

In examining some texts using Voyant, I realized that there was truth to another statement in “Tooling Up for Digital Humanities,” that simply looking at the words in a vacuum can lead to a simplistic understanding of the text. Having a close understanding of these texts helped me to avoid some simplistic assumptions, but in unfamiliar texts, such close reading would be more difficult. For example, I ran an analysis on Wuthering Heights and found that the words “Catherine” and Cathy” both appeared frequently. Knowing the book well, I knew that this referred to two characters; without this familiarity, I might not have understood why the two terms appeared as they did.

Use of the words “Catherine” and “Cathy” in the book Wuthering Heights. It is not immediately clear if it refers to one or two characters

I also found some problems with the text entries from Project Gutenberg. I had to remove the beginning and end of the texts, which explain what the text is and what restrictions are placed on it, or else Voyant mistakenly believed “Gutenberg” to be a popular word in every text analyzed. I also had to check if there was an introduction; including the introduction to Edgar Allan Poe’s poems created a very different chart than the one that used only Poe’s own words. If I was to analyze a poem to see if it was written by Poe or not, checking for similar common words, like heart, like, love, and night, would be a good first step. This would be skewed if I included words written by other authors.

If the preface is included, the word “Poe” is one of the most common – at least, in the beginning of the book

With the preface removed, the most common words become Poe’s own. Although even here, the prominence of the word “poem” means that I probably left an afterward at the end that I would have to remove if I wanted to truly examine this document.

I next ran an analysis on Jane Eyre. I was most interested in the word “Little,” because my favorite line in this book is “Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you, — and full as much heart!” Using the “Word Tree” tool, I could even see a little bit of the context. However, the punctuation not being scrubbed left some obvious problems; I am not interested in “little” being followed by a semicolon, but rather what word came next. If I was to run a proper analysis, I would need to Tokenize the text.

The word “Little” in Voyant’s “Word Tree” tool for Jane Eyre

Using Voyant was very interesting, and revealed some of the advantages and disadvantages of text analysis tools.

 

Works Cited

Brontë, Charlotte, and F. H. (Frederick Henry) Townsend. Jane Eyre: An Autobiography. 1998. Project Gutenberg, https://www.gutenberg.org/ebooks/1260.
Brontë, Emily. Wuthering Heights. 1996. Project Gutenberg, https://www.gutenberg.org/ebooks/768.
“Optical Character Recognition.” Wikipedia, 28 Jan. 2021. Wikipedia, https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&oldid=1003359641.
Poe, Edgar Allan. The Complete Poetical Works of Edgar Allan Poe Including Essays on Poetry. Edited by John Henry Ingram, 2003. Project Gutenberg, https://www.gutenberg.org/ebooks/10031.
Text Analysis » Tooling Up for Digital Humanities. 2 Mar. 2017, https://web.archive.org/web/20170302102313/http://toolingup.stanford.edu/?page_id=981.
“When OCR Goes Bad: Google’s Ngram Viewer & The F-Word.” Search Engine Land, 19 Dec. 2010, https://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181.

Week Two Practicum: Zotero

Barkawi, Tarak. Globalization and War. Rowman & Littlefield Publishers, 2005.

Dudziak, Mary L. War Time: An Idea, Its History, Its Consequences. Reprint edition, Oxford University Press, 2013.

Foucault, Michel. “Preface.” The Order of Things: An Archaeology of the Human Sciences, Vintage Books: A Division of Random House, Inc.

Gray, J. Glenn, and Hannah Arendt. The Warriors: Reflections on Men in Battle. 1998.

Kieran, David. Signature Wounds: The Untold Story of the Military’s Mental Health Crisis. First edition., NYU Press, 2019.

Lee, Ashley. “Critics Ridiculed Brandy’s ‘Cinderella.’ Its Legacy Is a Lesson to Hollywood.” Los Angeles Times, 12 Feb. 2021, https://www.latimes.com/entertainment-arts/tv/story/2021-02-12/cinderella-brandy-whitney-houston-disney-plus-movies.

Lutz, Catherine A. Homefront: A Military City and the American Twentieth Century. First Edition, Beacon Press, 2002.

Ramsay, Stephen. “Databases.” Companion to Digital Humanities (Blackwell Companions to Literature and Culture), by Susan Schreibman et al., Hardcover, Blackwell Publishing Professional, 2004, http://www.digitalhumanities.org/companion/.

Sperberg-McQueen, C. M. “Classification and Its Structures.” Companion to Digital Humanities (Blackwell Companions to Literature and Culture), by Susan Schreibman et al., Hardcover, Blackwell Publishing Professional, 2004, http://www.digitalhumanities.org/companion/.

The Editors of Encyclopaedia Britannica, and Melissa Petruzzello. “Saint Margaret Clitherow | Biography, Death, & Facts.” Encyclopedia Britannica, 1 Jan. 2021, https://www.britannica.com/biography/Saint-Margaret-Clitherow.

Tishkov, Valery, and Mikhail S. Gorbachev. Chechnya: Life in a War-Torn Society. University of California Press, 2004.

 

Categorization and Digital Scholarship

Foucault argues that because “certain aphasiacs, when shown various differently coloured skeins of wool on a table top, are consistently unable to arrange them into any coherent pattern,” language is required for categorization (Foucault xviii). The opposite might also be true; the nature of language encourages people to categorize. Words as simple as “sister,” “brother,” “near,” “far,” “expensive,” or “cheap” exist because the nature of descriptive words is to place things into more easily understood categories.

For Digital Humanities in particular, categorization is important because this study focuses on how to present and understand large data sets. Providing a list of thousands of data points is not a useful way to present information; it must be organized into a fashion more easily understood.

Categorization of data into databases does cause some problems, however. For example, if the people creating the categories don’t know what the data will ultimately be used for, they might split the data up into categories that are not useful at all. I have seen this in my work in economic research. The Census Bureau often presents data separated into age groups; if a study is being done on a specific age instead of the age group, this category is not useful. Alternately, the Bureau of Labor Statistics provides employment data separated into years and location; if a comparison must be done for all locations in a particular year, re-assembling the data out of the categories provided can be a lengthy process. Dividing data into too many categories can be as irritating to the user as dividing data into too few categories. Stephen Ramsay discusses the problems inherent in setting up a database, pointing out that “one needs to balance the goals of correctness against the practical exigencies of the system and its users” (Ramsay).

The end-user of a system ought to determine how the system is organized. For example, I once got a job organizing music for a high school choir teacher. For this, the first step was to categorize by “Christmas and Non Christmas Music”, because that is how High School Choir teachers use their music. But, if I were organizing it for another purpose, this would be pointless. In this instance, having the works digitized is helpful, because it flattens all the categories onto one level. The user can sort by the category they want, instead of have to work in the order used by the person organizing the music. Sperberg-McQueen discusses this by saying “the order of axes has tended to become somewhat less important in multidimensional classification schemes intended for computer use” (Sperberg-McQueen).

Zotero is a place where the end user and the person organizing the database ought to be the same person – me. For this reason, I will be using the folder function to keep track of things, and will come up with a keyword system as I work further toward my research projects. I must be careful and check my citations when they are generated from the browser plugin. Sometimes, the link is incorrect; when I tried to cite a newspaper article and an encyclopedia, it didn’t recognize them and instead categorized them as “web pages”. Furthermore, using the browser add-in to cite A Companion to Digital Humanities cited the entire book, instead of the chapter. Although Zotero is a huge time-saver, it must be used with care.

Works Cited
Foucault, Michel. “Preface.” The Order of Things: An Archaeology of the Human Sciences, Vintage Books: A Division of Random House, Inc.
Ramsay, Stephen. “Databases.” Companion to Digital Humanities (Blackwell Companions to Literature and Culture), by Susan Schreibman et al., Hardcover, Blackwell Publishing Professional, 2004, http://www.digitalhumanities.org/companion/.
Sperberg-McQueen, C. M. “Classification and Its Structures.” Companion to Digital Humanities (Blackwell Companions to Literature and Culture), by Susan Schreibman et al., Hardcover, Blackwell Publishing Professional, 2004, http://www.digitalhumanities.org/companion/.

Week One Practicum: Zooniverse

Transcribing required knowing what was being asked for, which could be difficult when different forms were used, as shown in these two forms

For this week, I worked on the Zooniverse Project Every Name Counts. I specifically chose to transcribe in the subheading “BASIC Names Auschwitz – Prisoner Registration Forms,” because I recognized the name of the concentration camp, and having that history made me connect more to the work I was doing. I immediately ran into the problem that I do not speak German, which meant that I had to click on the “Need some help with this task?” button for every single question – usually multiple times. The first transcription was especially difficult, because in transcribing relatives I couldn’t figure out that parents were listed together on one line. Once I finished the first document, and understood the process better, I had a much easier time. After this, the process of transcribing the names weighed on me, knowing the suffering these people underwent. I learned that I learned that Motek is a name, and also a term of endearment in Hebrew, and transcribed a record for an 18-year-old who was imprisoned with his parents. I also transcribed the record of a married carpenter who died three months after his date of entry. Knowing these facts about Roman Kolanola’s life make me remember his name – and helped me understand the importance of the Every Name Counts database.

“Motek” is a Hebrew Endearment, as I learned in researching this project

Franz Gemza’s parents had crosses over their names, marking their deaths

Roman Kolanola’s death date was shown on his form

 

 

 

 

 

 

 

 

 

Participating in this Digital Humanities Project gave me a deeper understanding of what the subject means. When I read in the instructions that each item is transcribed three times to keep mistakes from slipping through, I understood Kirschenbaum’s words: “digital humanities is also a social undertaking” (Kirschenbaum 2). I also understood “The Bentham Project” more after performing transcription myself, since I did not understand how crowd-sourced transcription could work when I first heard of it (Ross 29). I found the tagging feature at the end of the Zooniverse project to be very interesting, and it struck me as something users of this database in the future will find very useful. Perhaps I found this intuitively important because of tagging’s prevalence of Twitter and other social medias, as discussed in both of the readings for this week (Kirschenbaum 4, Ross 33-35). I was glad to be able to participate in a Digital Humanities project, and help digitize these important records

 

Works Cited

Kirschenbaum, Michael G. “What Is Digital Humanities and What’s It Doing in English Departments?” ADE Bulletin, no. 150 (2010): 1–7.

Ross, Claire. “Social Media for Digital Humanities and Community Engagement.” In Digital Humanities in Practice, by Claire Warwick, Melissa M. Terras, and Julianne Nyhan, 23–45. London: Facet Publishing, 2012. https://eds-b-ebscohost-com.libproxy.chapman.edu/eds/ebookviewer/ebook?sid=5322476a-83c0-4a5d-a546-0dd1c3ab2a02%40pdc-v-sessmgr04&ppid=pp_23&vid=0&format=EB.