Introduction to tipitaka.critical

The tipitaka.critical package provides a lemmatized critical edition of the complete Pali Canon (Tipitaka), the canonical scripture of Theravada Buddhism. The text is based on a five-witness collation and lemmatized using the Digital Pali Dictionary.

The texts dataset

The package ships a single dataset, texts, containing 5,777 text units spanning all three pitakas:

library(tipitaka.critical)
#> Loading required package: Matrix

dim(texts)
#> [1] 5777    6
names(texts)
#> [1] "id"              "collection"      "pitaka"          "title"          
#> [5] "text"            "text_lemmatized"

Each row is a text unit (a sutta, a chapter, or a standalone text) with both the surface-form Pali text and a lemmatized version where every word is replaced by its dictionary headword:

# The Brahmajala Sutta (DN 1)
dn1 <- texts[texts$id == "dn1", ]
dn1$title
#> [1] "Brahmajālasutta"

# First 120 characters of surface text
cat(substr(dn1$text, 1, 120), "...\n")
#> dīgha nikāya brahmajālasutta paribbājakakathā evaṃ me sutaṃ ekaṃ samayaṃ bhagavā antarā ca rājagahaṃ antarā ca nāḷandaṃ  ...

# Same passage, lemmatized
cat(substr(dn1$text_lemmatized, 1, 120), "...\n")
#> dīgha nika brahmajālasutta paribbājakakathā evaṃ ahaṃ suta eka samaya bhagavant antara ca rājagaha antara ca nāḷandā add ...

The three pitakas and seven collections are:

table(texts$pitaka)
#> 
#> abhidhamma      sutta     vinaya 
#>          8       5764          5
table(texts$collection)
#> 
#> abhidhamma         an         dn         kn         mn         sn     vinaya 
#>          8       1408         34       2351        152       1819          5

Lemma frequencies

The lemmas dataset is a frequency table computed from the lemmatized text. It is not shipped with the package but computed automatically on first access (about 5 seconds):

dim(lemmas)
#> [1] 642064      7
head(lemmas)
#>           word  n total         freq  id collection pitaka
#> 1   abbhantara  4  7808 0.0005122951 dn1         dn  sutta
#> 2 abbhujjalana  1  7808 0.0001280738 dn1         dn  sutta
#> 3      abbhuta  3  7808 0.0003842213 dn1         dn  sutta
#> 4      abhibhū  3  7808 0.0003842213 dn1         dn  sutta
#> 5 abhinandunti  1  7808 0.0001280738 dn1         dn  sutta
#> 6   abhivadati 22  7808 0.0028176230 dn1         dn  sutta

Each row gives the count and frequency of one lemma in one text unit. This makes it easy to find the most common words across the entire canon:

totals <- tapply(lemmas$n, lemmas$word, sum)
head(sort(totals, decreasing = TRUE), 15)
#>        ta        ti        ca    dhamma        na        pa        va       kho 
#>     87797     64523     60970     56529     49894     41909     34493     32672 
#>      hoti        ya      ahaṃ bhikkhave uppajjati   paccaya       eka 
#>     31890     31412     29822     26517     20562     18207     16611

The most frequent lemmas are grammatical particles: ta (that/it), ti (quotative marker), ca (and), na (not). The first content word is dhamma (teaching, truth, phenomenon) — the central concept of the entire canon. Further down, bhikkhave (O monks, vocative) and bhikkhu (monk) both appear in the top 20, reflecting that the primary audience for these teachings was the monastic community.

Or within a single collection:

dn_lemmas <- lemmas[lemmas$collection == "dn", ]
dn_totals <- tapply(dn_lemmas$n, dn_lemmas$word, sum)
head(sort(dn_totals, decreasing = TRUE), 10)
#>     ta     ti    kho     ca     va   hoti   ahaṃ     na     ya dhamma 
#>   5027   3440   3291   2341   2158   1897   1685   1547   1467   1232

Searching for a lemma

The search_lemma() function finds all text units containing a given lemma, sorted by frequency:

# Where does "nibbana" appear most frequently?
nibbana <- search_lemma("nibbana")
head(nibbana[, c("id", "collection", "n", "freq")])
#>                id collection n        freq
#> 341398      ja272         kn 1 0.026315789
#> 325887 dhp273-289         kn 1 0.004629630
#> 304547       cnd4         kn 1 0.004444444
#> 490124    snp5.19         kn 1 0.002314815
#> 315956      cnd22         kn 7 0.001462294

# "dhamma" across collections
dhamma <- search_lemma("dhamma")
tapply(dhamma$n, dhamma$collection, sum)
#> abhidhamma         an         dn         kn         mn         sn     vinaya 
#>      40160       4735       1232       3925       2319       2215       1943

Document-term matrix

The dtm dataset is a sparse matrix (from the Matrix package) with text units as rows and lemmas as columns. Values are within-document frequencies. Like lemmas, it is computed on first access:

dim(dtm)
#> [1]   5777 102577
class(dtm)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

# Sparsity (proportion of zero entries)
1 - length(dtm@x) / prod(dim(dtm))
#> [1] 0.9989165

Visualizing the Canon

The DTM enables standard text-analysis workflows. We can start with a simple example: hierarchical clustering of the 34 Digha Nikaya suttas.

dn_ids <- texts$id[texts$collection == "dn"]
dn_dtm <- dtm[dn_ids, ]

# Drop empty columns
dn_dtm <- dn_dtm[, colSums(dn_dtm) > 0]

d <- dist(as.matrix(dn_dtm))
hc <- hclust(d, method = "ward.D2")
plot(hc, main = "Digha Nikaya — Hierarchical Clustering",
     xlab = "", sub = "", cex = 0.7)

PCA of the entire Canon

To see how all 5,777 text units relate to each other, we can project the DTM into two dimensions using principal component analysis. We use the 500 most frequent lemmas to keep the computation fast:

# Select top 500 lemmas by total frequency
col_sums <- colSums(dtm)
top_terms <- names(sort(col_sums, decreasing = TRUE))[1:500]
dtm_sub <- as.matrix(dtm[, top_terms])

# PCA
pca <- prcomp(dtm_sub, center = TRUE, scale. = FALSE)
pct_var <- summary(pca)$importance[2, 1:2] * 100

# Color by collection
coll_colors <- c(
  abhidhamma = "#E41A1C", an = "#377EB8", dn = "#4DAF4A",
  kn = "#FF7F00", mn = "#984EA3", sn = "#A65628",
  vinaya = "#F781BF"
)
pt_col <- coll_colors[texts$collection]

plot(pca$x[, 1], pca$x[, 2],
     col = adjustcolor(pt_col, alpha.f = 0.5), pch = 16, cex = 0.6,
     xlab = paste0("PC1 (", round(pct_var[1], 1), "%)"),
     ylab = paste0("PC2 (", round(pct_var[2], 1), "%)"),
     main = "PCA of All Tipitaka Texts")
legend("topright",
       c("Abhidhamma", "AN", "DN", "KN", "MN", "SN", "Vinaya"),
       col = coll_colors, pch = 16, cex = 0.8)

The Abhidhamma texts cluster distinctly from the Sutta Pitaka, reflecting their specialized technical vocabulary. Within the Sutta Pitaka, the five nikayas overlap substantially but show characteristic tendencies.

Canon-wide hierarchical clustering

For a dendrogram of the whole canon, we aggregate texts to an intermediate level: individual suttas for DN and MN, samyuttas for SN, nipatas for AN, and individual texts for KN, Vinaya, and Abhidhamma.

# Create group IDs at an intermediate level
group_id <- texts$id

# SN: sn1.1 -> sn1 (by samyutta)
sn_mask <- texts$collection == "sn"
group_id[sn_mask] <- sub("\\..*", "", group_id[sn_mask])

# AN: an1.1 -> an1 (by nipata)
an_mask <- texts$collection == "an"
group_id[an_mask] <- sub("\\..*", "", group_id[an_mask])

# KN: dhp1-20 -> dhp, snp1.1 -> snp, etc. (by text)
kn_mask <- texts$collection == "kn"
group_id[kn_mask] <- sub("[0-9].*", "", group_id[kn_mask])

# Aggregate DTM by group (mean of member frequencies)
groups <- unique(group_id)
group_dtm <- matrix(0, length(groups), length(top_terms))
group_coll <- character(length(groups))
for (i in seq_along(groups)) {
  rows <- which(group_id == groups[i])
  if (length(rows) == 1) {
    group_dtm[i, ] <- dtm_sub[rows, ]
  } else {
    group_dtm[i, ] <- colMeans(dtm_sub[rows, ])
  }
  group_coll[i] <- texts$collection[rows[1]]
}
rownames(group_dtm) <- groups

# Cluster
d <- dist(group_dtm)
hc <- hclust(d, method = "ward.D2")

# Color labels by collection
label_col <- coll_colors[group_coll[hc$order]]
dend <- as.dendrogram(hc)
# Apply colors to leaf labels
color_labels <- function(n, col_vec) {
  if (is.leaf(n)) {
    i <- match(attr(n, "label"), groups[hc$order])
    attr(n, "nodePar") <- list(pch = NA, lab.col = col_vec[i], lab.cex = 0.45)
  }
  n
}
dend <- dendrapply(dend, color_labels, col_vec = label_col)

oldpar <- par(mar = c(2, 1, 2, 8))
plot(dend, horiz = TRUE, main = "Tipitaka — Hierarchical Clustering",
     xlab = "")
legend("topleft",
       c("Abhidhamma", "AN", "DN", "KN", "MN", "SN", "Vinaya"),
       text.col = coll_colors, cex = 0.7, bty = "n")

par(oldpar)

The dendrogram reveals how texts cluster by vocabulary: Abhidhamma and Vinaya texts form their own branches, while within the Sutta Pitaka, texts with similar subject matter cluster together regardless of which nikaya they belong to.