Comparing tools for obtaining word token and type

Dec. 25, 2018

1 . Text files
2 . Working with R packages
- 2.1 . Quanteda
- 2.2 . Tidytext
3 . Results from Natural Language Processing Tools
- 3.1 . spacy
- 3.2 . Stanford CoreNLP
4 . Comparisons
- 4.1 . Tokens
- 4.2 . Types

When analyzing texts in any context, the most basic linguistic characteristics of the corpus (i.e., texts) to describe are word tokens (i.e., the number of words) and types (i.e., the number of distinct words). These numbers, however, will be slightly different depending on which software you use. I wanted to compare different options for obtaining the number of tokens and types.

In this post, I use the R packages quanteda, tidytext + textstem, the NLP engines that I have introduced in this blog (spacy and Stanford CoreNLP), and popular software, AntConc and Wordsmith (version 7), for such comparison.

1 . Text files

I have randomly chosen five essays from LOCNESS to use in as examples.

I prefer using the package readtext to build a raw corpus. It will create a data table with the document ID and text in two separate columns for each file. I named the corpus object txt.

library(readtext)
txt <- readtext("corpus/*.txt")
head(txt)

## readtext object consisting of 5 documents and 0 docvars.
## # Description: df[,2] [5 x 2]
##   doc_id      text               
## * <chr>       <chr>              
## 1 text_01.txt "\"Two men, o\"..."
## 2 text_02.txt "\"I am not a\"..."
## 3 text_03.txt "\"Over the p\"..."
## 4 text_04.txt "\"There is a\"..."
## 5 text_05.txt "\"Boxing is \"..."

2 . Working with R packages

2.1 . Quanteda

The package quanteda, which stands for quantitative analysis of textual data, provides simple functions that compute the number of tokens, ntoken(), and types ntype(). This seems to be something quick and dirty.

library(quanteda)

ex <- "This is an example. Here is another example sentence, providing a couple short sentences."

ntoken(ex)

## text1 
##    17

ntype(tolower(ex))

## text1 
##    14

ntoken says there are 17 tokens, which suggests that each word is counted including puncuation marks. ntype says there are 14 types. I assume that these are: “this”, “is” (2), “an”, “example” (2), “.” (2), “here”, “another”, “sentence”, “,”, “providing”, “a”, “couple”, “short”, “sentences”. In other words, the ntype function only considers the same exact words as one type. Therefore, the pairs “a” and “an” and “sentence” and “sentences”, appear as different two different types. This actually doesn’t agree with the definition of word type.

I already have the corpus txt, but I can feed this into quanteda’s corpus() function to build its version of corpus. Calling summary, I can see the types and tokens in each text. The Type column must be computed differently through this process than through the ntype() function, as I see different numbers. However, I’m not sure what causes such difference.

(quant_n <- summary(corpus(txt)))

## Corpus consisting of 5 documents:
## 
##         Text Types Tokens Sentences
##  text_01.txt   167    316        10
##  text_02.txt   276    517        21
##  text_03.txt   260    685        22
##  text_04.txt   205    390        16
##  text_05.txt   183    372        13
## 
## Source: D:/GitHub/susie-kim.github.io/content/post/* on x86-64 by susie
## Created: Sun Nov 03 12:43:13 2019
## Notes:

The differences range from 5 to 12. I will append this result as Type_word to the data table for later comparisons.

ntype(tolower(txt$text))

## text1 text2 text3 text4 text5 
##   162   265   248   199   177

This is the summary of results from using the quanteda package.

id	Tokens	Types	Types_word
text_01.txt	316	167	162
text_02.txt	517	276	265
text_03.txt	685	260	248
text_04.txt	390	205	199
text_05.txt	372	183	177

2.2 . Tidytext

The package tidytext includes a tokenizing function, unnest_tokens(). It automatically removes punctuation marks. Therefore, the result will defintiely be different from the previous analysis.

I don’t think this package includes any lemmatizing functions so I turn to the textstem package for this process. lemmatize_words() litterally lemmatizes words, meaning that it returns the base form of each word. For instance, the word “men” is noted as “man” in the lemma column below.

library(tidytext); library(textstem); library(dplyr)

tidy_text <- txt %>% 
    unnest_tokens(word, text) %>% 
    mutate(lemma = lemmatize_words(word))

head(tidy_text, 10)

##         doc_id     word    lemma
## 1  text_01.txt      two      two
## 2  text_01.txt      men      man
## 3  text_01.txt      one      one
## 4  text_01.txt     ring     ring
## 5  text_01.txt     only     only
## 6  text_01.txt      one      one
## 7  text_01.txt      can      can
## 8  text_01.txt    leave    leave
## 9  text_01.txt dramatic dramatic
## 10 text_01.txt       it       it

You can see from the code below that Types_word is the number of unique words, and Types_lemma is the number of unique lemmas.

tidy_n <- tidy_text %>% 
    group_by(doc_id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma))) %>% 
    rename(id = doc_id)

id	Tokens	Types_word	Types_lemma
text_01.txt	288	155	140
text_02.txt	487	260	233
text_03.txt	623	245	214
text_04.txt	353	190	175
text_05.txt	342	173	157

3 . Results from Natural Language Processing Tools

3.1 . spacy

For installation, see my previous post on the topic. I annotate the corpus using the cnlp_annotate function, which will perform tokenization, lemmatization, and tagging for parts-of-speech.

library(cleanNLP); library(reticulate)

cnlp_init_spacy(model_name = "en_core_web_lg")
cnlp_ann <- cnlp_annotate(txt)
cnlp_tok <- cnlp_get_token(cnlp_ann)

head(cnlp_tok, 10)

## # A tibble: 10 x 8
##    id            sid   tid word  lemma upos  pos     cid
##    <chr>       <int> <int> <chr> <chr> <chr> <chr> <int>
##  1 text_01.txt     1     1 Two   two   NUM   CD        0
##  2 text_01.txt     1     2 men   man   NOUN  NNS       4
##  3 text_01.txt     1     3 ,     ,     PUNCT ,         7
##  4 text_01.txt     1     4 one   one   NUM   CD        9
##  5 text_01.txt     1     5 ring  ring  NOUN  NN       13
##  6 text_01.txt     1     6 ,     ,     PUNCT ,        17
##  7 text_01.txt     1     7 only  only  ADV   RB       19
##  8 text_01.txt     1     8 one   one   PRON  PRP      24
##  9 text_01.txt     1     9 can   can   VERB  MD       28
## 10 text_01.txt     1    10 leave leave VERB  VB       32

Here is the result from this data. I’ve included the number of unique words, which would presumably be equivalent to quanteda’s ntype function.

cnlp_n <- cnlp_tok %>% 
    group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))

id	Tokens	Types_word	Types_lemma
text_01.txt	317	167	145
text_02.txt	525	277	237
text_03.txt	685	260	221
text_04.txt	390	205	182
text_05.txt	375	185	160

3.2 . Stanford CoreNLP

To exercise this option, I utilzed the Stanford CoreNLP tool, the process of which is illustrated in this post. I have already processed the tfiles and only present the results here. The types of output I have genereated are the same as with the previous one.

head(st_tok, 10)

##             id  word lemma CharacterOffsetBegin CharacterOffsetEnd POS
## 1  text_01.txt   Two   two                    0                  3  CD
## 2  text_01.txt   men   man                    4                  7 NNS
## 3  text_01.txt     ,     ,                    7                  8   ,
## 4  text_01.txt   one   one                    9                 12  CD
## 5  text_01.txt  ring  ring                   13                 17  NN
## 6  text_01.txt     ,     ,                   17                 18   ,
## 7  text_01.txt  only  only                   19                 23  RB
## 8  text_01.txt   one   one                   24                 27  CD
## 9  text_01.txt   can   can                   28                 31  MD
## 10 text_01.txt leave leave                   32                 37  VB

snlp_n <- st_tok %>% group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))

id	Tokens	Types_word	Types_lemma
text_01.txt	317	168	147
text_02.txt	521	276	243
text_03.txt	680	259	222
text_04.txt	390	207	187
text_05.txt	373	184	161

4 . Comparisons

So far, I have obtained word tokens and types using four different methods. I also ran the texts in AntConc and Wordsmith, which are not described here.

4.1 . Tokens

Let’s look at the number of tokens that the different methods of analysis/software produced.

id	quanteda	tidytext	cnlp	snlp	antconc	wordsmith
text_01.txt	316	288	317	317	289	288
text_02.txt	517	487	525	521	490	487
text_03.txt	685	623	685	680	623	623
text_04.txt	390	353	390	390	351	353
text_05.txt	372	342	375	373	343	341

The fact that the numbers are similar among tidytext, antconc, and wordsmith suggests that AntConc and Wordsmith do not include punctuation in their word count, which makes sense. cnlp and snlp currently include punctuation so I will remove them and recalculate the tokens and types.

cnlp_n <- cnlp_tok %>% filter(!upos %in% c("PUNCT", "SYM", "NUM")) %>% 
    group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))

#create a list of POS tags to exclude 
except <- c(",", ".", "``", "''", ":", "#", "$", "-LRB-", "-RRB-", "CD")

snlp_n <- st_tok %>% filter(!POS  %in% except) %>% 
    group_by(id) %>% 
    summarize(Tokens = n(), 
              Types_word = length(unique(word)), 
              Types_lemma = length(unique(lemma)))

id	quanteda	tidytext	cnlp	snlp	antconc	wordsmith
text_01.txt	316	288	286	285	289	288
text_02.txt	517	487	487	483	490	487
text_03.txt	685	623	620	619	623	623
text_04.txt	390	353	349	349	351	353
text_05.txt	372	342	337	336	343	341

Now the results look very similar except for those from quanteda. The differences could be resulting from how contractions, numbers, and non-characters are accounted for. These numbers are also slightly different from the word count from Microsoft Word, but close enough.

4.2 . Types

The number of unique words is interesting because even though I have removed punctuation marks, there are some differences between tidytext and cnlp/snlp:

id	quanteda	tidytext	cnlp	snlp
text_01.txt	162	155	159	158
text_02.txt	265	260	267	265
text_03.txt	248	245	254	254
text_04.txt	199	190	193	193
text_05.txt	177	173	175	174

Next, I compared the unique lemma types. Right off the bat, quanteda’s type is visibly deviant from others so I would not trust that. There are two patterns: numbers obtained from AntConc and Wordsmith are very similar, and the numbers from tidytext, cnlp (spacy), and snlp (Stanford CoreNLP) are very close to one another. The numbers of these two groups are also different enough to assume that the number of unique lemma is not how word type is computed in conventional software.

id	quanteda	tidytext	cnlp	snlp	antconc	wordsmith
text_01.txt	167	140	137	137	156	155
text_02.txt	276	233	227	231	261	260
text_03.txt	260	214	215	217	245	245
text_04.txt	205	175	170	173	188	189
text_05.txt	183	157	150	151	174	173

While, conceptually, one might think that the words “sentence” and “sentences” are the same word, tokens include different forms of the same lemma. For instance, “am”, “are”, “was”, “were” are four tokens.

Tokens from AntConc and Wordsmith are different from the number of unique words shown above, with the results from NLP being slightly larger. I suspect that how numbers and contractions are treated has something to do with this. In NLP, contractions become separated: For example, “shouldn’t” becomes two words, “should” and “not”. In AntConc, “shouldn’t” becomes “shouldn” and “t”. Therefore “should” and “shouldn” are two types. In Wordsmith, “shouldn’t” is just one word.

Obtaining tokens and types by file has so far been the easiest with Wordsmith, but the software requires purchase. AntConc is freely avilable, but doesn’t analyze text by text when there is a batch of files to process. In my quest to examine and compare how the tokens and types differ by software, I found that using tidytext was the simplest yet reliable in computing word tokens and types.