Topic Modeling Practicum
1) Use the Voyant Topics tool or the TMT to generate a Topic Model of a text you know well. Report on your results–are you seeing the patterns in your topics like Blevins and Posner describe in their articles about Topic Modeling? Perhaps try to run the tool multiple times with different topic and word lengths to see if that offers any insights into a “distant” reading of your text. You may also want to try using more than one text to see if one generates better topics than another.
In Belvin and Posners article about topic modeling, they discussed how topic modeling finds words that frequently appear together and then it groups them into clusters. The program is only concerned with how the words are used in the text and what words tend to be used similarly. I ran the topic tool through each available platform; corpus, document and grid. In my text, The Great Gatsby, I found different results each time I ran the tool through each platform. I found that most of the results that made sense presented themselves when I ran the corpus tool and with smaller word lengths. The larger the word length was, the less things made sense together and had minimal correlation. The shorter word length but higher topic count, the best results appeared.
The most common themes I found when I ran the topic tool on corpus was names and what I consider major symbols. For example, I found Tom and Daisy grouped together often, or Gatsby and Jordan. I also found the names Gatsby and Daisy grouped with topics such as love, and the city. These things are heavily correlated in the text to one another. On the grid and document topic tools, none of the results really quite made sense. I was expecting that the bigger the word length and higher the topic length, the better the results would be. However, for The Great Gatsby, this was quite the opposite result.
In our other reading from class, Topic Modeling: A Basic Introduction, it says that one of the best ways to understand what the program is telling you is through visualization because the output is not always human readable. I was looking at the vizuations provided on Voyant next to the topic output, and I did not find it super helpful because the output results did not quite make sense together so neither did the visualizations. According to the article, you have to prepare the corpus before you use it by stripping out the punctuation, capitalization, and ignoring stop words. However, I could only use the Voyant tool because the TMT was not working. Due to this I could not properly prepare the corpus to have these optimized results. I think this largely affected the output because many of my words happen to be stop words.
Due to the fact that the Great Gatsby did not run great results, I ran a second text, The Wizard of Oz. Unfortunately I saw much of the same issues with this text as well. There were a lot of names and places mentioned when I kept the word length shorter but when I increased the word length I was shown a lot of random words that might have been cut out if I had the proper modifications on the TMT platform.
2) Use one of the neural net tools to generate some text on a Humanities topic. Do you notice anything concerning in your results? If so, why do you think you are getting these results? If you use the same prompt again, do you get a different kind of response?
When I used the neural net tool, the Humanities topic I typed in was “textual mining”. I ran it twice and got different results both times. The first result I got appeared to be a definition of textual mining. The second result I got was very strange. It was discussing Bromium and cyber-attacks which was a bit odd. The next topic I put in was digital archives. Both results I got each time I ran it made no sense. The first was a link to another article, and the second was about Berkeley Auditorium which does not seem to have anything to do with digital archives in my opinion. In the last reading from class, Data, Algorithms, Fairness, and Accountability, it discusses how accountability of data is not always accurate because we don’t really have the tools to do it well. I think that this is the case with this tool. It is not correctly reading and representing the data because data is not always accurate, and neither are the tools that we have to make sense of certain things.
I think that the TMT tool is the most accurate because it contains options to refine the search. I was only able to use Voyant for this assignment which does not have the same refining options for the text analysis. I think that it still can produce meaningful results, however its limitations include not being able to eliminate stop-words which can make the results slightly less readable. The neural net is the most sketchy. It does not ever produce similar results and most of the results make absolutely no sense. I think it has the most limitations because you really have no idea if your results are accurate or how to read it. I think computer data is increasingly impacting our lives in many ways, most frequently seen in online marketing and advertisement on a daily basis. I think we can learn a lot from computer data, and from Boyds reading, I have learned that it should not be used for everything.