Using “Voyant” to analyze texts

When I was writing my undergraduate thesis, I often came across advice about how to look at sources similar to that given in “Tooling Up for Digital Humanities”:

“Scholars once accustomed to studying a handful of letters or a couple hundred diary entries are now faced with massive amounts of data that cannot possibly be analyzed in traditional ways.”

When working on my thesis, I only had a handful of letters, because I was working with the very limited sources left behind by the Catholic Women in Elizabethan England. However, the fact remains that we now live in a time when there is more information available than ever before. Thankfully, we also have more ways of analyzing this information than ever before. One of these is Voyant, a tool that analyzes texts to see what words are most common, and when they occur.

In examining some texts using Voyant, I realized that there was truth to another statement in “Tooling Up for Digital Humanities,” that simply looking at the words in a vacuum can lead to a simplistic understanding of the text. Having a close understanding of these texts helped me to avoid some simplistic assumptions, but in unfamiliar texts, such close reading would be more difficult. For example, I ran an analysis on Wuthering Heights and found that the words “Catherine” and Cathy” both appeared frequently. Knowing the book well, I knew that this referred to two characters; without this familiarity, I might not have understood why the two terms appeared as they did.

Use of the words “Catherine” and “Cathy” in the book Wuthering Heights. It is not immediately clear if it refers to one or two characters

I also found some problems with the text entries from Project Gutenberg. I had to remove the beginning and end of the texts, which explain what the text is and what restrictions are placed on it, or else Voyant mistakenly believed “Gutenberg” to be a popular word in every text analyzed. I also had to check if there was an introduction; including the introduction to Edgar Allan Poe’s poems created a very different chart than the one that used only Poe’s own words. If I was to analyze a poem to see if it was written by Poe or not, checking for similar common words, like heart, like, love, and night, would be a good first step. This would be skewed if I included words written by other authors.

If the preface is included, the word “Poe” is one of the most common – at least, in the beginning of the book

With the preface removed, the most common words become Poe’s own. Although even here, the prominence of the word “poem” means that I probably left an afterward at the end that I would have to remove if I wanted to truly examine this document.

I next ran an analysis on Jane Eyre. I was most interested in the word “Little,” because my favorite line in this book is “Do you think, because I am poor, obscure, plain, and little, I am soulless and heartless? You think wrong! — I have as much soul as you, — and full as much heart!” Using the “Word Tree” tool, I could even see a little bit of the context. However, the punctuation not being scrubbed left some obvious problems; I am not interested in “little” being followed by a semicolon, but rather what word came next. If I was to run a proper analysis, I would need to Tokenize the text.

The word “Little” in Voyant’s “Word Tree” tool for Jane Eyre

Using Voyant was very interesting, and revealed some of the advantages and disadvantages of text analysis tools.

 

Works Cited

Brontë, Charlotte, and F. H. (Frederick Henry) Townsend. Jane Eyre: An Autobiography. 1998. Project Gutenberg, https://www.gutenberg.org/ebooks/1260.
Brontë, Emily. Wuthering Heights. 1996. Project Gutenberg, https://www.gutenberg.org/ebooks/768.
“Optical Character Recognition.” Wikipedia, 28 Jan. 2021. Wikipedia, https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&oldid=1003359641.
Poe, Edgar Allan. The Complete Poetical Works of Edgar Allan Poe Including Essays on Poetry. Edited by John Henry Ingram, 2003. Project Gutenberg, https://www.gutenberg.org/ebooks/10031.
Text Analysis » Tooling Up for Digital Humanities. 2 Mar. 2017, https://web.archive.org/web/20170302102313/http://toolingup.stanford.edu/?page_id=981.
“When OCR Goes Bad: Google’s Ngram Viewer & The F-Word.” Search Engine Land, 19 Dec. 2010, https://searchengineland.com/when-ocr-goes-bad-googles-ngram-viewer-the-f-word-59181.