To use the mallet R package, we need to use rJava
, an R
package for using Java within R (to access the mallet Java code). See
details at github.com/s-u/rJava.
Next, install the mallet
R package from CRAN. To
install, simply use install.packages()
install.packages("mallet")
Depending on the size of your data, it can be so that you need to
increase the Java virtual machine (JVM) heap memory to handle larger
corpora. To do this, you need to specify how much memory you want to
allocate to the JVM using the Xmx
flag. Below is an example
of allocating 4 Gb to the JVM.
options(java.parameters = "-Xmx4g")
To load the package, use library()
.
library(mallet)
There are multiple ways to read text data into R. A simple way is to
read individual text files into a character vector. Below is an example
of reading the different stop list txt files that come with the
mallet
package into R as a character vector (that can be
used by the mallet
R package as data).
# Note this is the path to the folder where the stoplists are stored in the R package.
# Change this path to another directory to read other txt files into R.
<- system.file("stoplists", package = "mallet")
directory
<- list.files(directory, full.names = TRUE)
files_in_directory
<- character(length(files_in_directory))
txt_file_content for(i in seq_along(files_in_directory)){
<- paste(readLines(files_in_directory[i]), collapse = "\n")
txt_file_content[i]
}# We can check the content with str()
str(txt_file_content)
## chr [1:6] "English stoplist is the standard Mallet stoplist.\n\nGerman, French, Finnish are borrowed from http://www.ranks.nl." ...
We will now use the example data set of the State of the Union
addresses from 1946 to 2000 that is included with the
mallet
R package as a data.frame
. This data
can be accessed as follows.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(sotu)
"text"]][1:2] sotu[[
## [1] "To the Congress of the United States: "
## [2] "A quarter century ago the Congress decided that it could no longer consider the financial programs of the various departments on a piecemeal basis. Instead it has called on the President to present a comprehensive Executive Budget. The Congress has shown its satisfaction with that method by extending the budget system and tightening its controls. The bigger and more complex the Federal Program, the more necessary it is for the Chief Executive to submit a single budget for action by the Congress. "
Mallet also comes with five different stop list files (see above). We can access the path to these lists as follows.
mallet_supported_stoplists()
## [1] "de" "en" "fi" "fr" "jp"
<- mallet_stoplist_file_path("en") stopwords_en_file_path
As a first step, we need to create an LDA trainer object and supply
the trainer with documents. We start by creating a mallet instance list
object. This function has a few extra options (whether to lowercase or
how we define a token). See ?mallet.import
for details.
<-
sotu.instances mallet.import(id.array = row.names(sotu),
text.array = sotu[["text"]],
stoplist = stopwords_en_file_path,
token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
If the data is already cleaned and we want to use the index of
text.array
, we can supply the text.array
.
<-
sotu.instances.short mallet.import(text.array = sotu[["text"]])
It is also possible to supply stop words as a character vector.
<- readLines(stopwords_en_file_path)
stop_vector <-
sotu.instances.short mallet.import(text.array = sotu[["text"]],
stoplist = stop_vector)
We first need to create a topic trainer object to fit a model.
<- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1) topic.model
Load our documents. We could also pass in the filename of a saved instance list file we build from the command-line tools.
$loadDocuments(sotu.instances) topic.model
We use the method getVocabulary()
to get the model’s
vocabulary. The vocabulary may be helpful in further curating the
stopword list.
<- topic.model$getVocabulary()
vocabulary head(vocabulary)
## [1] "congress" "united" "states" "quarter" "century" "ago"
Similarly, we can access the word and document frequencies with
mallet.word.freqs()
.
<- mallet.word.freqs(topic.model)
word_freqs head(word_freqs)
## word word.freq doc.freq
## 1 congress 1025 879
## 2 united 508 426
## 3 states 557 480
## 4 quarter 16 15
## 5 century 166 155
## 6 ago 179 171
To optimize hyperparameters ( and ) every 20 iterations, after 50 burn-in iterations, we set alpha optimization as follows.
$setAlphaOptimization(20, 50) topic.model
Now train a model. Note that hyperparameter optimization is on by default. We can specify the number of iterations. Here we’ll use a large-ish round number.
$train(200) topic.model
We can also run through a few iterations where we pick the best topic for each token rather than sampling from the posterior distribution.
$maximize(10) topic.model
To analyze our corpus using our model, we usually want to access the probability of topics per document and the probability of words per topic. By default, these functions return raw word counts. Here we want probabilities, so we normalize and add “smoothing” so that nothing has exactly 0 probability.
<- mallet.doc.topics(topic.model, smoothed=TRUE, normalized=TRUE)
doc.topics <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE) topic.words
What are the top words in topic 2? Notice that R indexes from 1 and Java from 0, so this will be the topic that mallet called topic 1.
mallet.top.words(topic.model, word.weights = topic.words[2,], num.top.words = 5)
## term weight
## 1 years 0.01721053
## 2 war 0.01525148
## 3 america 0.01422937
## 4 people 0.01192963
## 5 men 0.01048164
Show the largest document with at least 50% tokens belonging to topic 2. Note, since the model is not identified, you might end up with another topic if you run the same code.
<- which(doc.topics[,2] > 0.50)
docs <- nchar(sotu[["text"]])[docs]
doc_size <- docs[order(doc_size, decreasing = TRUE)[1]]
idx "text"]][idx] sotu[[
## [1] "The last person I want to introduce is Jack Lucas from Hattiesburg, Mississippi. Jack, would you stand up. Fifty years ago in the sands of Iwo Jima, Jack Lucas taught and learned the lessons of citizenship. On February the 20th, 1945, he and three of his buddies encountered the enemy and two grenades at their feet. Jack Lucas threw himself on both of them. In that moment he saved the lives of his companions and miraculously in the next instant a medic saved his life. He gained a foothold for freedom and at the age of 17, just a year older than his grandson, who's up there with him today, and his son, who is a West Point graduate and a veteran, at 17, Jack Lucas became the youngest marine in history and the youngest soldier in this century to win the Congressional Medal of Honor. All these years later, yesterday, here's what he said about that day: Didn't matter where you were from or who you were. You relied on one another. You did it for your country. We all gain when we give and we reap what we sow. That's at the heart of this New Covenant. Responsibility, opportunity and citizenship. "
We can also study the topics and how the differ in different parts of the corpus, for example in different time periods.
<- mallet.subset.topic.words(topic.model, sotu[["year"]] > 1975)
post1975_topic_words mallet.top.words(topic.model, word.weights = post1975_topic_words[2,], num.top.words = 5)
## term weight
## 1 america 123
## 2 years 92
## 3 american 70
## 4 people 67
## 5 war 60
Another functionality included in the mallet
R package
is to (hierarchically) cluster the topics to assess what topics that are
“closer” to each other. Use ?mallet.topic.hclust
to see
further details on how to cluster topics.
<- mallet.topic.labels(topic.model, num.top.words = 2)
topic_labels <- mallet.topic.hclust(doc.topics, topic.words, balance = 0.5)
topic_clusters plot(topic_clusters, labels=topic_labels, xlab = "", )
We can also store our current topic model state to use it for postprocessing. We can store the state file either as a text file or a compressed gzip file.
<- file.path(tempdir(), "temp_mallet_state.gz")
state_file save.mallet.state(topic.model = topic.model, state.file = state_file)
We also store the topic counts per document and remove the old model.
<- mallet.doc.topics(topic.model, smoothed=FALSE, normalized=FALSE)
doc.topics.counts
rm(topic.model)
To initialize a model with the sampled topic indicators, one needs to create a new model, load the same data and then load the topic indicators. Unfortunately, setting the alpha parameter vector is currently not possible, so it is not currently possible to initialize the model with the same alpha prior.
<- MalletLDA(num.topics=10, alpha.sum = 1, beta = 0.1)
new.topic.model $loadDocuments(sotu.instances)
new.topic.modelload.mallet.state(topic.model = new.topic.model, state.file = state_file)
1:3, 1:6] doc.topics.counts[
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 7 6 0 0 0
## [3,] 0 0 3 0 0 0
mallet.doc.topics(new.topic.model, smoothed=FALSE, normalized=FALSE)[1:3, 1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 7 6 0 0 0
## [3,] 0 0 3 0 0 0
This vignette gives a first example of using the mallet R package for topic modelling.
We can also save Mallet topic models and load them back into R.
<- file.path(tempdir(), "temp_mallet.model")
model_file mallet.topic.model.save(new.topic.model, model_file)
<- mallet.topic.model.read(model_file)
read.topic.model
1:3, 1:6] doc.topics.counts[
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 7 6 0 0 0
## [3,] 0 0 3 0 0 0
mallet.doc.topics(read.topic.model, smoothed=FALSE, normalized=FALSE)[1:3, 1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 7 6 0 0 0
## [3,] 0 0 3 0 0 0