Comparing tools for obtaining word token and type

When analyzing texts in any context, the most basic linguistic characteristics of the corpus (i.e., texts) to describe are word tokens (i.e., the number of words) and types (i.e., the number of distinct words). These numbers, however, will be slightly different depending on which software you use. I wanted to compare different options for obtaining the number of tokens and types.

In this post, I use the R packages quanteda, tidytext + textstem, the NLP engines that I have introduced in this blog (spacy and Stanford CoreNLP), and popular software, AntConc and Wordsmith (version 7), for such comparison.

1 . Text files

I have randomly chosen five essays from LOCNESS to use in as examples.

I prefer using the package readtext to build a raw corpus. It will create a data table with the document ID and text in two separate columns for each file. I named the corpus object txt.

library(readtext)
txt <- readtext("corpus/*.txt")
head(txt)
## readtext object consisting of 5 documents and 0 docvars.
## # data.frame [5 x 2]
##   doc_id      text               
## * <chr>       <chr>              
## 1 text_01.txt "\"Two men, o\"..."
## 2 text_02.txt "\"I am not a\"..."
## 3 text_03.txt "\"Over the p\"..."
## 4 text_04.txt "\"There is a\"..."
## 5 text_05.txt "\"Boxing is \"..."

2 . Working with R packages

2.1 . Quanteda

The package quanteda, which stands for quantitative analysis of textual data, provides simple functions that compute the number of tokens, ntoken(), and types ntype(). This seems to be something quick and dirty.

library(quanteda)

ex <- "This is an example. Here is another example sentence, providing a couple short sentences."

ntoken(ex)
## text1 
##    17
ntype(tolower(ex))
## text1 
##    14

ntoken says there are 17 tokens, which suggests that each word is counted including puncuation marks. ntype says there are 14 types. I assume that these are: “this”, “is” (2), “an”, “example” (2), “.” (2), “here”, “another”, “sentence”, “,”, “providing”, “a”, “couple”, “short”, “sentences”. In other words, the ntype function only considers the same exact words as one type. Therefore, the pairs “a” and “an” and “sentence” and “sentences”, appear as different two different types. This actually doesn’t agree with the definition of word type.

I already have the corpus txt, but I can feed this into quanteda’s corpus() function to build its version of corpus. Calling summary, I can see the types and tokens in each text. The Type column must be computed differently through this process than through the ntype() function, as I see different numbers. However, I’m not sure what causes such difference.

(quant_n <- summary(corpus(txt)))
## Corpus consisting of 5 documents:
## 
##         Text Types Tokens Sentences
##  text_01.txt   167    316        10
##  text_02.txt   276    517        21
##  text_03.txt   260    685        22
##  text_04.txt   205    390        16
##  text_05.txt   183    372        13
## 
## Source: D:/GitHub/susie-kim.github.io/content/post/* on x86-64 by susie
## Created: Thu Jan 03 18:45:21 2019
## Notes:

The differences range from 5 to 12. I will append this result as Type_word to the data table for later comparisons.

ntype(tolower(txt$text))
## text1 text2 text3 text4 text5 
##   162   265   248   199   177

This is the summary of results from using the quanteda package.

id Tokens Types Types_word
text_01.txt 316 167 162
text_02.txt 517 276 265
text_03.txt 685 260 248
text_04.txt 390 205 199
text_05.txt 372 183 177

2.2 . Tidytext

The package tidytext includes a tokenizing function, unnest_tokens(). It automatically removes punctuation marks. Therefore, the result will defintiely be different from the previous analysis.

I don’t think this package includes any lemmatizing functions so I turn to the textstem package for this process. lemmatize_words() litterally lemmatizes words, meaning that it returns the base form of each word. For instance, the word “men” is noted as “man” in the lemma column below.

library(tidytext); library(textstem); library(dplyr)

tidy_text <- txt %>% 
    unnest_tokens(word, text) %>% 
    mutate(lemma = lemmatize_words(word))

head(tidy_text, 10)
##         doc_id     word    lemma
## 1  text_01.txt      two      two
## 2  text_01.txt      men      man
## 3  text_01.txt      one      one
## 4  text_01.txt     ring     ring
## 5  text_01.txt     only     only
## 6  text_01.txt      one      one
## 7  text_01.txt      can      can
## 8  text_01.txt    leave    leave
## 9  text_01.txt dramatic dramatic
## 10 text_01.txt       it       it

You can see from the code below that Types_word is the number of unique words, and Types_lemma is the number of unique lemmas.

tidy_n <- tidy_text %>% 
    group_by(doc_id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma))) %>% 
    rename(id = doc_id)
id Tokens Types_word Types_lemma
text_01.txt 288 155 140
text_02.txt 487 260 233
text_03.txt 623 245 214
text_04.txt 353 190 175
text_05.txt 342 173 157

3 . Results from Natural Language Processing Tools

3.1 . spacy

For installation, see my previous post on the topic. I annotate the corpus using the cnlp_annotate function, which will perform tokenization, lemmatization, and tagging for parts-of-speech.

library(cleanNLP); library(reticulate)

cnlp_init_spacy()
cnlp_ann <- cnlp_annotate(txt)
cnlp_tok <- cnlp_get_token(cnlp_ann)
head(cnlp_tok, 10)
## # A tibble: 10 x 8
##    id            sid   tid word  lemma upos  pos     cid
##    <chr>       <int> <int> <chr> <chr> <chr> <chr> <int>
##  1 text_01.txt     1     1 Two   two   NUM   CD        0
##  2 text_01.txt     1     2 men   man   NOUN  NNS       4
##  3 text_01.txt     1     3 ,     ,     PUNCT ,         7
##  4 text_01.txt     1     4 one   one   NUM   CD        9
##  5 text_01.txt     1     5 ring  ring  NOUN  NN       13
##  6 text_01.txt     1     6 ,     ,     PUNCT ,        17
##  7 text_01.txt     1     7 only  only  ADV   RB       19
##  8 text_01.txt     1     8 one   one   PRON  PRP      24
##  9 text_01.txt     1     9 can   can   VERB  MD       28
## 10 text_01.txt     1    10 leave leave VERB  VB       32

Here is the result from this data. I’ve included the number of unique words, which would presumably be equivalent to quanteda’s ntype function.

cnlp_n <- cnlp_tok %>% 
    group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))
id Tokens Types_word Types_lemma
text_01.txt 317 167 145
text_02.txt 525 277 236
text_03.txt 685 260 220
text_04.txt 390 205 182
text_05.txt 375 185 161

3.2 . Stanford CoreNLP

To exercise this option, I utilzed the Stanford CoreNLP tool, the process of which is illustrated in this post. I have already processed the tfiles and only present the results here. The types of output I have genereated are the same as with the previous one.

head(st_tok, 10)
##             id  word lemma CharacterOffsetBegin CharacterOffsetEnd POS
## 1  text_01.txt   Two   two                    0                  3  CD
## 2  text_01.txt   men   man                    4                  7 NNS
## 3  text_01.txt     ,     ,                    7                  8   ,
## 4  text_01.txt   one   one                    9                 12  CD
## 5  text_01.txt  ring  ring                   13                 17  NN
## 6  text_01.txt     ,     ,                   17                 18   ,
## 7  text_01.txt  only  only                   19                 23  RB
## 8  text_01.txt   one   one                   24                 27  CD
## 9  text_01.txt   can   can                   28                 31  MD
## 10 text_01.txt leave leave                   32                 37  VB
snlp_n <- st_tok %>% group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))
id Tokens Types_word Types_lemma
text_01.txt 317 168 147
text_02.txt 521 276 243
text_03.txt 680 259 222
text_04.txt 390 207 187
text_05.txt 373 184 161

4 . Comparisons

So far, I have obtained word tokens and types using four different methods. I also ran the texts in AntConc and Wordsmith, which are not described here.

4.1 . Tokens

Let’s look at the number of tokens that the different methods of analysis/software produced.

id quanteda tidytext cnlp snlp antconc wordsmith
text_01.txt 316 288 317 317 289 288
text_02.txt 517 487 525 521 490 487
text_03.txt 685 623 685 680 623 623
text_04.txt 390 353 390 390 351 353
text_05.txt 372 342 375 373 343 341

The fact that the numbers are similar among tidytext, antconc, and wordsmith suggests that AntConc and Wordsmith do not include punctuation in their word count, which makes sense. cnlp and snlp currently include punctuation so I will remove them and recalculate the tokens and types.

cnlp_n <- cnlp_tok %>% filter(!upos %in% c("PUNCT", "SYM", "NUM")) %>% 
    group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))

#create a list of POS tags to exclude 
except <- c(",", ".", "``", "''", ":", "#", "$", "-LRB-", "-RRB-", "CD")

snlp_n <- st_tok %>% filter(!POS  %in% except) %>% 
    group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))
id quanteda tidytext cnlp snlp antconc wordsmith
text_01.txt 316 288 286 285 289 288
text_02.txt 517 487 486 483 490 487
text_03.txt 685 623 620 619 623 623
text_04.txt 390 353 350 349 351 353
text_05.txt 372 342 337 336 343 341

Now the results look very similar except for those from quanteda. The differences could be resulting from how contractions, numbers, and non-characters are accounted for. These numbers are also slightly different from the word count from Microsoft Word, but close enough.

4.2 . Types

The number of unique words is interesting because even though I have removed punctuation marks, there are some differences between tidytext and cnlp/snlp:

id quanteda tidytext cnlp snlp
text_01.txt 162 155 159 158
text_02.txt 265 260 266 265
text_03.txt 248 245 254 254
text_04.txt 199 190 194 193
text_05.txt 177 173 175 174

Next, I compared the unique lemma types. Right off the bat, quanteda’s type is visibly deviant from others so I would not trust that. There are two patterns: numbers obtained from AntConc and Wordsmith are very similar, and the numbers from tidytext, cnlp (spacy), and snlp (Stanford CoreNLP) are very close to one another. The numbers of these two groups are also different enough to assume that the number of unique lemma is not how word type is computed in conventional software.

id quanteda tidytext cnlp snlp antconc wordsmith
text_01.txt 167 140 137 137 156 155
text_02.txt 276 233 225 231 261 260
text_03.txt 260 214 214 217 245 245
text_04.txt 205 175 171 173 188 189
text_05.txt 183 157 151 151 174 173

While, conceptually, one might think that the words “sentence” and “sentences” are the same word, tokens include different forms of the same lemma. For instance, “am”, “are”, “was”, “were” are four tokens.

Tokens from AntConc and Wordsmith are different from the number of unique words shown above, with the results from NLP being slightly larger. I suspect that how numbers and contractions are treated has something to do with this. In NLP, contractions become separated: For example, “shouldn’t” becomes two words, “should” and “not”. In AntConc, “shouldn’t” becomes “shouldn” and “t”. Therefore “should” and “shouldn” are two types. In Wordsmith, “shouldn’t” is just one word.

Obtaining tokens and types by file has so far been the easiest with Wordsmith, but the software requires purchase. AntConc is freely avilable, but doesn’t analyze text by text when there is a batch of files to process. In my quest to examine and compare how the tokens and types differ by software, I found that using tidytext was the simplest yet reliable in computing word tokens and types.

Related

Next
Previous
comments powered by Disqus