Introduction to Topic Modeling with Mallet
Identifies clusters of words in a text that represent “topics”
Offers different approach to text analysis beyond basic word-frequency
Java-based package, run on the command line
“Topic modeling is not a way of revealing any objective “truth” about a text; instead, it’s a way of deriving a certain kind of meaning — which still needs to be interpreted and interrogated.”
Dividing up your documents.
Do you want to segment your txt files based on article? issue? academic year? calendar year? This will impact your results. Mallet looks for clusters in each individual txt file. If you group based on year, you will get broader patterns, for individual articles, more granularity.
How many topics?
The fewer topics you ask for, the broader they will be. The more topics you as for, the more specific.
*TMT is only for unaccented Latin characters.
LDA: “Latent Dirichlet Allocation” probablistic technique used in topic modeling. “Let’s also assume that a “topic” can be understood as a collection of words that have different probabilities of appearance in passages discussing the topic. ”
“part of the value of LDA will be that it implicitly sorts out the different contexts/meanings of a written symbol” e.g. lead as an element versus lead as a verb “to lead.”
Why topic modeling?
A way to think about the topical structure or assembly of topics at a large scale.
A way to investigate rhettoric
The difference in linguistic discourse from fiction, poetry, letters, oration
*”Standard list of stopwords is rarely adequate”
“I should also admit that, when you’re modeling fiction, the “author” signal can be very strong. I frequently discover topics that are dominated by a single author, and clearly reflect her unique idiom. This could be a feature or a bug, depending on your interests; I tend to view it as a bug, but I find that the author signal does diffuse more or less automatically as the collection expands.”
“valuable exploratory method”
But you make decisions that impact the results: stopwords, number of topics, scope of the collection and how it is divided.
Classifying versus clustering
“-I think his synopsis that “they’re powerful, widely applicable, easy to use, and difficult to understand — a dangerous combination” – Scott Weingart