Forget my previous posts on using the Stanford NLP engine via command and retreiving information from XML files in R…. I’ve found that everything can be done in RStudio (at least I learned more about how to work with XML in R). This post replaces these two previous ones and adds more example analyses.
Download and install, if you don’t already have in on your computer, the Java Development Kit. No specific things to look out for during installation.
The packages we need in R are rJava and cleanNLP. Install the developmental version of cleanNLP as using the (old) CRAN version won’t work properly.
install.packages("rJava")
devtools::install_github("statsmaths/cleanNLP")library(cleanNLP)
library(dplyr)After loading the package, you can pass an argument to download different language models. The default is set to English so I’m not going to pass anything to the function.
cnlp_download_corenlp()This will take some time.
Now you can itialize the engine to parse your text. The more annotation features you want to utlize, the higher the anno_level will be. I usually just go for anno_level = 0 since I only need tokenization, lemmatization, and part-of-speech tagging. Loading higher level functions takes longer time and can slow down your computer.
cnlp_init_corenlp(anno_level = 0)I’ll process the same five texts that I’ve been using in this blog, five random essays from the LOCNESS. The function below can directly call text files from a directory and annotate them.
anno_text <- cnlp_annotate("corpus/*.txt", as_strings = FALSE)However, I like building the corpus as its own object to keep using it for various analyses.
#Build the corpus
txt_cor <- readtext::readtext("corpus/*.txt")
#Save annotations as a table
txt_ann <- cnlp_annotate(txt_cor)
txt_tab <- cnlp_get_token(txt_ann)
#Check the first 15 words
head(txt_tab, 15)## # A tibble: 15 x 8
## id sid tid word lemma upos pos cid
## <chr> <int> <int> <chr> <chr> <chr> <chr> <int>
## 1 text_01.txt 1 1 Two two NUM CD 0
## 2 text_01.txt 1 2 men man NOUN NNS 4
## 3 text_01.txt 1 3 , , . , 7
## 4 text_01.txt 1 4 one one NUM CD 9
## 5 text_01.txt 1 5 ring ring NOUN NN 13
## 6 text_01.txt 1 6 , , . , 17
## 7 text_01.txt 1 7 only only ADV RB 19
## 8 text_01.txt 1 8 one one NUM CD 24
## 9 text_01.txt 1 9 can can VERB MD 28
## 10 text_01.txt 1 10 leave leave VERB VB 32
## 11 text_01.txt 1 11 . . . . 37
## 12 text_01.txt 2 1 Dramatic dramatic ADJ JJ 39
## 13 text_01.txt 2 2 it it PRON PRP 48
## 14 text_01.txt 2 3 may may VERB MD 51
## 15 text_01.txt 2 4 be be VERB VB 55It appears that everything worked well. Next, I’ll do some text analysis.
tidytext is a convenient means to perform text analysis. package. Luckily, free resources are available such as Tidytext that will serve as a structured, useful guide.
library(tidytext)This package includes some functions that are easy to use. We used the corenlp to POS tag the text but if we didn’t need that, we could have just tokenized using the unnest_tokens() function as I have done in the previous post.
unnest_token() first takes the data frame (txt_cor). The default setting breaks the text into words (i.e., tokenizes) and creates a new data frame. We need to provide the name of the column for this new data frame (output, I named it word) and the column that includes the text data (input, which is text).
tidy_tok <- txt_cor %>% unnest_tokens(word, text)Analyzing n-grams is done with the same function. We can just provide different values to generate a table of n-grams. The first following code takes the corpus and creates a new data frame (tidy_bi) with the column bigram that contains the bigram. token = "ngrams" and n =2 will extract two-word sequences. The second code will create a list of trigrams.
tidy_bi <- txt_cor %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_tri <- txt_cor %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)| bigram | trigram |
|---|---|
| two men | two men one |
| men one | men one ring |
| one ring | one ring only |
| ring only | ring only one |
| only one | only one can |
| one can | one can leave |
| can leave | can leave dramatic |
| leave dramatic | leave dramatic it |
| dramatic it | dramatic it may |
| it may | it may be |
Coming from a linguistics perspective, I find it potentially problematic that the bigrams include word chunks that are not meaningful, especially for qualitative text analysis. What I mean is that, for example, the last word of sentence #1 and the first word of sentence #2, “it dramatically”, are treated as a bigram. Same applies to words within a sentence that are separated by commas or other punctuation. Consider the first couple sentences from our corpus:
“Two men, one ring, only one can leave. Dramatic it may be but…”
“men one” is not meaningful, “ring only” is not meaningful. Punctuation serves specific purposes in writing, and ignoring them might fail to deliver meaningful results. Crossing such borders can also lead to misleading results. The meaningful, uninterrupted n-grams are called “CollGrams” by some researchers (Bestgen & Granger, 2014)1.
For this reason, I’ll go back to the annotated data we created earlier. To inspect sequences of words, we can use the lead() function from the dplyr package, to create new columns that contain information regarding the next row of each word.
txt_df <- txt_tab %>%
mutate(second_word = lead(word), second_upos = lead(upos), second_pos = lead(pos),
third_word = lead(word, 2), third_upos = lead(upos, 2), third_pos = lead(pos, 2))| id | sid | tid | word | lemma | upos | pos | second_word | second_upos | second_pos | third_word | third_upos | third_pos |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| text_01.txt | 1 | 1 | Two | two | NUM | CD | men | NOUN | NNS | , | . | , |
| text_01.txt | 1 | 2 | men | man | NOUN | NNS | , | . | , | one | NUM | CD |
| text_01.txt | 1 | 3 | , | , | . | , | one | NUM | CD | ring | NOUN | NN |
| text_01.txt | 1 | 4 | one | one | NUM | CD | ring | NOUN | NN | , | . | , |
| text_01.txt | 1 | 5 | ring | ring | NOUN | NN | , | . | , | only | ADV | RB |
| text_01.txt | 1 | 6 | , | , | . | , | only | ADV | RB | one | NUM | CD |
| text_01.txt | 1 | 7 | only | only | ADV | RB | one | NUM | CD | can | VERB | MD |
| text_01.txt | 1 | 8 | one | one | NUM | CD | can | VERB | MD | leave | VERB | VB |
| text_01.txt | 1 | 9 | can | can | VERB | MD | leave | VERB | VB | . | . | . |
| text_01.txt | 1 | 10 | leave | leave | VERB | VB | . | . | . | Dramatic | ADJ | JJ |
In the newly created columns in light green are information pertaining to the next word, and the columns in light blue are that of the second next word.
To clean this data, we’ll execute the following code. The unite() function from the tidyr package concatenates the word and second_word to show the biagram. Although where punctuation occurs can be of interest itself (e.g., marking clauses, inserting phrases, etc.), in this post I’ll filter out the bigrams that include any punctuation marks to only consider two- or three-word sequences that co-occur without any interruption.
library(tidyr)
txt_bi <- txt_df %>% unite(bigram, word, second_word, sep = " ") %>%
filter(!second_upos == ".", !upos == ".") %>% select(1, 4:9)| id | bigram | lemma | upos | pos | second_upos | second_pos |
|---|---|---|---|---|---|---|
| text_01.txt | Two men | two | NUM | CD | NOUN | NNS |
| text_01.txt | one ring | one | NUM | CD | NOUN | NN |
| text_01.txt | only one | only | ADV | RB | NUM | CD |
| text_01.txt | one can | one | NUM | CD | VERB | MD |
| text_01.txt | can leave | can | VERB | MD | VERB | VB |
| text_01.txt | Dramatic it | dramatic | ADJ | JJ | PRON | PRP |
| text_01.txt | it may | it | PRON | PRP | VERB | MD |
| text_01.txt | may be | may | VERB | MD | VERB | VB |
txt_tri <- txt_df %>% unite(trigram, word, second_word, third_word, sep = " ") %>%
filter(!third_upos == ".", !second_upos == ".", !upos == ".") %>% select(1, 4:11)| id | trigram | lemma | upos | pos | second_upos | second_pos | third_upos | third_pos |
|---|---|---|---|---|---|---|---|---|
| text_01.txt | only one can | only | ADV | RB | NUM | CD | VERB | MD |
| text_01.txt | one can leave | one | NUM | CD | VERB | MD | VERB | VB |
| text_01.txt | Dramatic it may | dramatic | ADJ | JJ | PRON | PRP | VERB | MD |
| text_01.txt | it may be | it | PRON | PRP | VERB | MD | VERB | VB |
| text_01.txt | may be but | may | VERB | MD | VERB | VB | CONJ | CC |
| text_01.txt | be but basically | be | VERB | VB | CONJ | CC | ADV | RB |
| text_01.txt | but basically that | but | CONJ | CC | ADV | RB | DET | DT |
| text_01.txt | basically that is | basically | ADV | RB | DET | DT | VERB | VBZ |
What’s the most common part of speech that comes after the “be” verb? What does it say about the role of the “be” verb and the constituent that follows?
txt_bi %>% filter(lemma == "be") %>% count(second_upos, sort = TRUE)Largely, “be” is most frequently followed by another verb. Looking at the POS tag reveals a bit more information.
txt_bi %>% filter(lemma == "be") %>% count(second_pos, sort = TRUE)The “be” verb most frequently co-occurs with another verb in the past participle form (i.e., VBN), so presumably the 34 occurrences are passive constructions, in which “be” serves as an auxiliary.
It’s almost certain that a determiner starts a noun phrase, thus in 27 cases the “be” verb is a main verb and is followed by a noun phrase complement.

txt_bi %>% filter(lemma == "be", second_pos == "VBN") %>% select(bigram)## # A tibble: 34 x 1
## bigram
## <chr>
## 1 been made
## 2 be banned
## 3 is argued
## 4 been won
## 5 be banned
## 6 was put
## 7 be prepared
## 8 are surrounded
## 9 are trained
## 10 are paid
## # ... with 24 more rowsThe third most frequently co-occuring tag is the adverb. The question is then, what follows an adverb? Considering that verb past participle is the category that appears most frequently after “be”, it could be that an adverb is inserted between these two verbs (be + adverb + past participle; e.g., is actually made).
It is also possible that the adverb is part of an adjective phrase (e.g., is really important), which in turn, may or may not constitute a noun phrase (e.g., is really an important ). Let’s dig a little deeper by looking at the trigrams.
txt_tri %>%
filter(lemma == "be", second_pos == "RB") %>%
count(third_pos, sort = TRUE) %>%
mutate(percent = round(n*100/sum(n), 1))## # A tibble: 10 x 3
## third_pos n percent
## <chr> <int> <dbl>
## 1 JJ 9 34.6
## 2 DT 4 15.4
## 3 RB 4 15.4
## 4 VBG 2 7.7
## 5 VBN 2 7.7
## 6 CD 1 3.8
## 7 IN 1 3.8
## 8 RBR 1 3.8
## 9 TO 1 3.8
## 10 WRB 1 3.8It appears that many of the “be + adverb” sequences (34.6%) are followed by an adjective, such as:
txt_tri %>%
filter(lemma == "be", second_pos == "RB", third_pos == "JJ") %>%
select(trigram)## # A tibble: 9 x 1
## trigram
## <chr>
## 1 are already few
## 2 were not stupid
## 3 is not right
## 4 is very popular
## 5 is always much
## 6 is clearly aware
## 7 is hardly surprising
## 8 is very likely
## 9 be very severeHowever, to determine the exact structure, we need to go further. We can look at the sentence where each trigram occurs.
| trigram | sentence |
|---|---|
| are already few | first of all, there are already few enough liberties in this country when compared with other nations of similar political and economic conditions such as france and the united states. |
| were not stupid | these men were professionals, they were not stupid. |
| is not right | the people who want the sport to be banned, say this because it is not right for people to fight, and there is a too high risk of serious injury and brain damage caused by the severe pounding that the head takes during a boxing match. |
| is very popular | another reason not to ban boxing is because it is very popular, and millions of people worldwide get entertained by watching boxing, so why should it banned it will cause displeasure to so many people. |
| is always much | there is always much speculation over the dangers of such a brutal sport as boxing. |
| is clearly aware | he is clearly aware of the dangers and brutalism of the sport, which is possibly why he enjoys it so much. |
| is hardly surprising | it is hardly surprising how important the sport can be to some. |
| is very likely | so, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there. |
| be very severe | so, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there. |
As a very rough summary, we can say that out of 127 cases where the verb “be” was used, 40 were auxiliary be, consisting passive voice constructions and progressive forms. In other times, the verb was frequently followed by a noun phrase (at least 31 times) or an adjective phrase that may or may not be embedded in a noun phrase.
Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26, 28-41. https://doi.org/10.1016/j.jslw.2014.09.004.↩