When analyzing texts in any context, the most basic linguistic characteristics of the corpus (i.e., texts) to describe are word tokens (i.e., the number of words) and types (i.e., the number of distinct words). These numbers, however, will be slightly different depending on which software you use. I wanted to compare different options for obtaining the number of tokens and types.
In this post, I use the R packages quanteda, tidytext + textstem, the NLP engines that I have introduced in this blog (spacy and Stanford CoreNLP), and popular software, AntConc and Wordsmith (version 7), for such comparison.
I have randomly chosen five essays from LOCNESS to use in as examples.
I prefer using the package readtext to build a raw corpus. It will create a data table with the document ID and text in two separate columns for each file. I named the corpus object txt.
library(readtext)
txt <- readtext("corpus/*.txt")
head(txt)## readtext object consisting of 5 documents and 0 docvars.
## # Description: df[,2] [5 x 2]
## doc_id text
## * <chr> <chr>
## 1 text_01.txt "\"Two men, o\"..."
## 2 text_02.txt "\"I am not a\"..."
## 3 text_03.txt "\"Over the p\"..."
## 4 text_04.txt "\"There is a\"..."
## 5 text_05.txt "\"Boxing is \"..."The package quanteda, which stands for quantitative analysis of textual data, provides simple functions that compute the number of tokens, ntoken(), and types ntype(). This seems to be something quick and dirty.
library(quanteda)
ex <- "This is an example. Here is another example sentence, providing a couple short sentences."
ntoken(ex)## text1
## 17ntype(tolower(ex))## text1
## 14ntoken says there are 17 tokens, which suggests that each word is counted including puncuation marks. ntype says there are 14 types. I assume that these are: “this”, “is” (2), “an”, “example” (2), “.” (2), “here”, “another”, “sentence”, “,”, “providing”, “a”, “couple”, “short”, “sentences”. In other words, the ntype function only considers the same exact words as one type. Therefore, the pairs “a” and “an” and “sentence” and “sentences”, appear as different two different types. This actually doesn’t agree with the definition of word type.
I already have the corpus txt, but I can feed this into quanteda’s corpus() function to build its version of corpus. Calling summary, I can see the types and tokens in each text. The Type column must be computed differently through this process than through the ntype() function, as I see different numbers. However, I’m not sure what causes such difference.
(quant_n <- summary(corpus(txt)))## Corpus consisting of 5 documents:
##
## Text Types Tokens Sentences
## text_01.txt 167 316 10
## text_02.txt 276 517 21
## text_03.txt 260 685 22
## text_04.txt 205 390 16
## text_05.txt 183 372 13
##
## Source: D:/GitHub/susie-kim.github.io/content/post/* on x86-64 by susie
## Created: Sun Nov 03 12:43:13 2019
## Notes:The differences range from 5 to 12. I will append this result as Type_word to the data table for later comparisons.
ntype(tolower(txt$text))## text1 text2 text3 text4 text5
## 162 265 248 199 177This is the summary of results from using the quanteda package.
| id | Tokens | Types | Types_word |
|---|---|---|---|
| text_01.txt | 316 | 167 | 162 |
| text_02.txt | 517 | 276 | 265 |
| text_03.txt | 685 | 260 | 248 |
| text_04.txt | 390 | 205 | 199 |
| text_05.txt | 372 | 183 | 177 |
The package tidytext includes a tokenizing function, unnest_tokens(). It automatically removes punctuation marks. Therefore, the result will defintiely be different from the previous analysis.
I don’t think this package includes any lemmatizing functions so I turn to the textstem package for this process. lemmatize_words() litterally lemmatizes words, meaning that it returns the base form of each word. For instance, the word “men” is noted as “man” in the lemma column below.
library(tidytext); library(textstem); library(dplyr)
tidy_text <- txt %>%
unnest_tokens(word, text) %>%
mutate(lemma = lemmatize_words(word))
head(tidy_text, 10)## doc_id word lemma
## 1 text_01.txt two two
## 2 text_01.txt men man
## 3 text_01.txt one one
## 4 text_01.txt ring ring
## 5 text_01.txt only only
## 6 text_01.txt one one
## 7 text_01.txt can can
## 8 text_01.txt leave leave
## 9 text_01.txt dramatic dramatic
## 10 text_01.txt it itYou can see from the code below that Types_word is the number of unique words, and Types_lemma is the number of unique lemmas.
tidy_n <- tidy_text %>%
group_by(doc_id) %>%
summarize(Tokens = n(),
Types_word = length(unique(word)),
Types_lemma = length(unique(lemma))) %>%
rename(id = doc_id)| id | Tokens | Types_word | Types_lemma |
|---|---|---|---|
| text_01.txt | 288 | 155 | 140 |
| text_02.txt | 487 | 260 | 233 |
| text_03.txt | 623 | 245 | 214 |
| text_04.txt | 353 | 190 | 175 |
| text_05.txt | 342 | 173 | 157 |
For installation, see my previous post on the topic.
I annotate the corpus using the cnlp_annotate function, which will perform tokenization, lemmatization, and tagging for parts-of-speech.
library(cleanNLP); library(reticulate)
cnlp_init_spacy(model_name = "en_core_web_lg")
cnlp_ann <- cnlp_annotate(txt)
cnlp_tok <- cnlp_get_token(cnlp_ann)head(cnlp_tok, 10)## # A tibble: 10 x 8
## id sid tid word lemma upos pos cid
## <chr> <int> <int> <chr> <chr> <chr> <chr> <int>
## 1 text_01.txt 1 1 Two two NUM CD 0
## 2 text_01.txt 1 2 men man NOUN NNS 4
## 3 text_01.txt 1 3 , , PUNCT , 7
## 4 text_01.txt 1 4 one one NUM CD 9
## 5 text_01.txt 1 5 ring ring NOUN NN 13
## 6 text_01.txt 1 6 , , PUNCT , 17
## 7 text_01.txt 1 7 only only ADV RB 19
## 8 text_01.txt 1 8 one one PRON PRP 24
## 9 text_01.txt 1 9 can can VERB MD 28
## 10 text_01.txt 1 10 leave leave VERB VB 32Here is the result from this data. I’ve included the number of unique words, which would presumably be equivalent to quanteda’s ntype function.
cnlp_n <- cnlp_tok %>%
group_by(id) %>%
summarize(Tokens = n(),
Types_word = length(unique(word)),
Types_lemma = length(unique(lemma)))| id | Tokens | Types_word | Types_lemma |
|---|---|---|---|
| text_01.txt | 317 | 167 | 145 |
| text_02.txt | 525 | 277 | 237 |
| text_03.txt | 685 | 260 | 221 |
| text_04.txt | 390 | 205 | 182 |
| text_05.txt | 375 | 185 | 160 |
To exercise this option, I utilzed the Stanford CoreNLP tool, the process of which is illustrated in this post. I have already processed the tfiles and only present the results here. The types of output I have genereated are the same as with the previous one.
head(st_tok, 10)## id word lemma CharacterOffsetBegin CharacterOffsetEnd POS
## 1 text_01.txt Two two 0 3 CD
## 2 text_01.txt men man 4 7 NNS
## 3 text_01.txt , , 7 8 ,
## 4 text_01.txt one one 9 12 CD
## 5 text_01.txt ring ring 13 17 NN
## 6 text_01.txt , , 17 18 ,
## 7 text_01.txt only only 19 23 RB
## 8 text_01.txt one one 24 27 CD
## 9 text_01.txt can can 28 31 MD
## 10 text_01.txt leave leave 32 37 VBsnlp_n <- st_tok %>% group_by(id) %>%
summarize(Tokens = n(),
Types_word = length(unique(word)),
Types_lemma = length(unique(lemma)))| id | Tokens | Types_word | Types_lemma |
|---|---|---|---|
| text_01.txt | 317 | 168 | 147 |
| text_02.txt | 521 | 276 | 243 |
| text_03.txt | 680 | 259 | 222 |
| text_04.txt | 390 | 207 | 187 |
| text_05.txt | 373 | 184 | 161 |
So far, I have obtained word tokens and types using four different methods. I also ran the texts in AntConc and Wordsmith, which are not described here.
Let’s look at the number of tokens that the different methods of analysis/software produced.
| id | quanteda | tidytext | cnlp | snlp | antconc | wordsmith |
|---|---|---|---|---|---|---|
| text_01.txt | 316 | 288 | 317 | 317 | 289 | 288 |
| text_02.txt | 517 | 487 | 525 | 521 | 490 | 487 |
| text_03.txt | 685 | 623 | 685 | 680 | 623 | 623 |
| text_04.txt | 390 | 353 | 390 | 390 | 351 | 353 |
| text_05.txt | 372 | 342 | 375 | 373 | 343 | 341 |
The fact that the numbers are similar among tidytext, antconc, and wordsmith suggests that AntConc and Wordsmith do not include punctuation in their word count, which makes sense. cnlp and snlp currently include punctuation so I will remove them and recalculate the tokens and types.
cnlp_n <- cnlp_tok %>% filter(!upos %in% c("PUNCT", "SYM", "NUM")) %>%
group_by(id) %>%
summarize(Tokens = n(),
Types_word = length(unique(word)),
Types_lemma = length(unique(lemma)))
#create a list of POS tags to exclude
except <- c(",", ".", "``", "''", ":", "#", "$", "-LRB-", "-RRB-", "CD")
snlp_n <- st_tok %>% filter(!POS %in% except) %>%
group_by(id) %>%
summarize(Tokens = n(),
Types_word = length(unique(word)),
Types_lemma = length(unique(lemma)))| id | quanteda | tidytext | cnlp | snlp | antconc | wordsmith |
|---|---|---|---|---|---|---|
| text_01.txt | 316 | 288 | 286 | 285 | 289 | 288 |
| text_02.txt | 517 | 487 | 487 | 483 | 490 | 487 |
| text_03.txt | 685 | 623 | 620 | 619 | 623 | 623 |
| text_04.txt | 390 | 353 | 349 | 349 | 351 | 353 |
| text_05.txt | 372 | 342 | 337 | 336 | 343 | 341 |
Now the results look very similar except for those from quanteda. The differences could be resulting from how contractions, numbers, and non-characters are accounted for. These numbers are also slightly different from the word count from Microsoft Word, but close enough.
The number of unique words is interesting because even though I have removed punctuation marks, there are some differences between tidytext and cnlp/snlp:
| id | quanteda | tidytext | cnlp | snlp |
|---|---|---|---|---|
| text_01.txt | 162 | 155 | 159 | 158 |
| text_02.txt | 265 | 260 | 267 | 265 |
| text_03.txt | 248 | 245 | 254 | 254 |
| text_04.txt | 199 | 190 | 193 | 193 |
| text_05.txt | 177 | 173 | 175 | 174 |
Next, I compared the unique lemma types. Right off the bat, quanteda’s type is visibly deviant from others so I would not trust that. There are two patterns: numbers obtained from AntConc and Wordsmith are very similar, and the numbers from tidytext, cnlp (spacy), and snlp (Stanford CoreNLP) are very close to one another. The numbers of these two groups are also different enough to assume that the number of unique lemma is not how word type is computed in conventional software.
| id | quanteda | tidytext | cnlp | snlp | antconc | wordsmith |
|---|---|---|---|---|---|---|
| text_01.txt | 167 | 140 | 137 | 137 | 156 | 155 |
| text_02.txt | 276 | 233 | 227 | 231 | 261 | 260 |
| text_03.txt | 260 | 214 | 215 | 217 | 245 | 245 |
| text_04.txt | 205 | 175 | 170 | 173 | 188 | 189 |
| text_05.txt | 183 | 157 | 150 | 151 | 174 | 173 |
While, conceptually, one might think that the words “sentence” and “sentences” are the same word, tokens include different forms of the same lemma. For instance, “am”, “are”, “was”, “were” are four tokens.
Tokens from AntConc and Wordsmith are different from the number of unique words shown above, with the results from NLP being slightly larger. I suspect that how numbers and contractions are treated has something to do with this. In NLP, contractions become separated: For example, “shouldn’t” becomes two words, “should” and “not”. In AntConc, “shouldn’t” becomes “shouldn” and “t”. Therefore “should” and “shouldn” are two types. In Wordsmith, “shouldn’t” is just one word.
Obtaining tokens and types by file has so far been the easiest with Wordsmith, but the software requires purchase. AntConc is freely avilable, but doesn’t analyze text by text when there is a batch of files to process. In my quest to examine and compare how the tokens and types differ by software, I found that using tidytext was the simplest yet reliable in computing word tokens and types.