February 2021 archive

Topic Modeling Practicum

1) Use the Voyant Topics tool or the TMT to generate a Topic Model of a text you know well. Report on your results–are you seeing the patterns in your topics like Blevins and Posner describe in their articles about Topic Modeling? Perhaps try to run the tool multiple times with different topic and word lengths to see if that offers any insights into a “distant” reading of your text. You may also want to try using more than one text to see if one generates better topics than another.

In Belvin and Posners article about topic modeling, they discussed how topic modeling finds words that frequently appear together and then it groups them into clusters. The program is only concerned with how the words are used in the text and what words tend to be used similarly. I ran the topic tool through each available platform; corpus, document and grid. In my text, The Great Gatsby, I found different results each time I ran the tool through each platform. I found that most of the results that made sense presented themselves when I ran the corpus tool and with smaller word lengths. The larger the word length was, the less things made sense together and had minimal correlation. The shorter word length but higher topic count, the best results appeared. 

The most common themes I found when I ran the topic tool on corpus was names and what I consider major symbols. For example, I found Tom and Daisy grouped together often, or Gatsby and Jordan. I also found the names Gatsby and Daisy grouped with topics such as love, and the city. These things are heavily correlated in the text to one another. On the grid and document topic tools, none of the results really quite made sense. I was expecting that the bigger the word length and higher the topic length, the better the results would be. However, for The Great Gatsby, this was quite the opposite result. 

In our other reading from class, Topic Modeling: A Basic Introduction, it says that one of the best ways to understand what the program is telling you is through visualization because the output is not always human readable. I was looking at the vizuations provided on Voyant next to the topic output, and I did not find it super helpful because the output results did not quite make sense together so neither did the visualizations. According to the article, you have to prepare the corpus before you use it by stripping out the punctuation, capitalization, and ignoring stop words. However, I could only use the Voyant tool because the TMT was not working. Due to this I could not properly prepare the corpus to have these optimized results. I think this largely affected the output because many of my words happen to be stop words. 

Due to the fact that the Great Gatsby did not run great results, I ran a second text, The Wizard of Oz. Unfortunately I saw much of the same issues with this text as well. There were a lot of names and places mentioned when I kept the word length shorter but when I increased the word length I was shown a lot of random words that might have been cut out if I had the proper modifications on the TMT platform. 

2) Use one of the neural net tools to generate some text on a Humanities topic. Do you notice anything concerning in your results?  If so, why do you think you are getting these results?  If you use the same prompt again, do you get a different kind of response?

When I used the neural net tool, the Humanities topic I typed in was “textual mining”. I ran it twice and got different results both times. The first result I got appeared to be a definition of textual mining. The second result I got was very strange. It was discussing Bromium and cyber-attacks which was a bit odd. The next topic I put in was digital archives. Both results I got each time I ran it made no sense. The first was a link to another article, and the second was about Berkeley Auditorium which does not seem to have anything to do with digital archives in my opinion. In the last reading from class, Data, Algorithms, Fairness, and Accountability, it discusses how accountability of data is not always accurate because we don’t really have the tools to do it well. I think that this is the case with this tool. It is not correctly reading and representing the data because data is not always accurate, and neither are the tools that we have to make sense of certain things.

I think that the TMT tool is the most accurate because it contains options to refine the search. I was only able to use Voyant for this assignment which does not have the same refining options for the text analysis. I think that it still can produce meaningful results, however its limitations include not being able to eliminate stop-words which can make the results slightly less readable. The neural net is the most sketchy. It does not ever produce similar results and most of the results make absolutely no sense. I think it has the most limitations because you really have no idea if your results are accurate or how to read it. I think computer data is increasingly impacting our lives in many ways, most frequently seen in online marketing and advertisement on a daily basis. I think we can learn a lot from computer data, and from Boyds reading, I have learned that it should not be used for everything.

 

Voyant Tool: Week Three Practicum

In the class reading Tooling up for Digital Humanities there are several sections that I found significant to my findings from the tools. The section, Authorship Attribution, discusses the identification of writings based on comparing common function words. A tool called Document Terms on Voyant shows you a word cloud of the most frequently used terms. For my particular book, The Wizard of Oz, the most frequently used words were the characters names: Dorothy, Scarecrow, and Woodman.

In another section of this same reading, Term Frequency, it states that the most fundamental form of content-based text analysis is word counting. This is how often words and phrases appear in a text. The Text Revival section of this reading correlates to the Phrases tab on the Voyant tool because it searches particular words and phrases. I think these results show that the story is very character driven and focused. It’s crazy how you are able to make conclusions and assumptions about something simply from this tool. 

The next reading we had for class was titled When OCR Goes Bad: Google’s Ngram Viewer & The F-Word, and it discussed optical character recognition. It takes the word you input and scans it against the five million books Google has. The Ngram does not always recognize words correctly and it is case sensitive. One thing I found interesting was how the tool could show you different relationships between different words during different time periods. It allows you to then search books that these results are found in, which can give great insight into relationships during different times. I have included one of my search results between different animals during different time periods. I was not surprised by the results, but it was interesting to look and see when domesticated animals became common in comparison to owning a horse throughout the decades. 

 

In the final class reading, Optical Character Recognition, it talked about how OCR provides a high degree of recognition accuracy. In both activities, the pattern recognition component of OCR was the most frequently utilized and displayed in different ways. My take-away from this was learning how to read, analyze and use these patterns to make some observations about texts. 

 

 

 

 

 

 

 

 

 

Zotero Practicum

I inserted a screenshot of both the bibliography and the ten Zotero links because they did not paste in the correct format on my site but I have included them anyways. The categorization of materials is useful to digital scholarship because it helps researchers access, store, and share data easily with others in a reliable manner. I have found it particularly helpful in storing different types of medias in an organized manner that is easy to access later on. The three authors text we read this week provided insight into the positives and negatives of digital humanities and the different applications and rules that go into making it successful and the simple ways to avoid things from going wrong.
Categorization of materials is useful to digital scholarship because it helps sort data to make it easier to search, save and find for users. The reading Classification and Its Structures addresses the many different ways things can be classified: One-dimensional, N-dimensional, classification schemes, Priori Systems and several rules of classification to follow.  The closer the purpose of the classification to the central problem of the research, the more likely is a custom-made classification scheme to be necessary.This reading also highlights a growing emphasis on image-based computing for humanities and how this database is growing. However, there are difficulties with categories for humanities research. This reading discusses the difficulties of agreeing on and maintaining consistency in keyword-based classifications or descriptions of images due to similarities among graphic images.
In the second reading, Databases, addresses how databases can be problematic for some humanities research and how to minimize these errors. Humanists had realized that the use of databases could create intellectual opportunities such as the mapping of relationships among entitiesvisualization of information patterns and methodologies worthy of studying further. This chapter focuses on the implementation and design of relational databases to remove technical and conceptual details that are problematic. The most common issues involved redundancy, which can be resolved with creating a primary key record and normal forms. Transaction management and Collaborative Database Collections can often lead to fragmented data that does not give consistent results. However, if implemented successfully, it could expand the possibilities of knowledge representation considerably. 
Finally, in The Order of Things, the author seems to question what constitutes how things are classified and what establishes the justification behind their categorization. This seems to be the overarching issue he is addressing and how that related to our thinking of categorization in humanities. This author brought up several examples that served as good tools in how we think about categories and groupings of items.
Citations:

Foucault, M. (2005). The order of things. doi:10.4324/9780203996645