Using Stanford CoreNLP with R: Bigram and Trigram Analysis

Jan. 1, 2019

1 . Preparation
- 1.1 . Install Java
- 1.2 . Install cleanNLP and language model
2 . Annotation Using Stanford CoreNLP
3 . Example Text Analysis: Creating Bigrams and Trigrams

Forget my previous posts on using the Stanford NLP engine via command and retreiving information from XML files in R…. I’ve found that everything can be done in RStudio (at least I learned more about how to work with XML in R). This post replaces these two previous ones and adds more example analyses.

1 . Preparation

1.1 . Install Java

Download and install, if you don’t already have in on your computer, the Java Development Kit. No specific things to look out for during installation.

1.2 . Install cleanNLP and language model

The packages we need in R are rJava and cleanNLP. Install the developmental version of cleanNLP as using the (old) CRAN version won’t work properly.

install.packages("rJava")
devtools::install_github("statsmaths/cleanNLP")

library(cleanNLP)
library(dplyr)

After loading the package, you can pass an argument to download different language models. The default is set to English so I’m not going to pass anything to the function.

cnlp_download_corenlp()

This will take some time.

2 . Annotation Using Stanford CoreNLP

Now you can itialize the engine to parse your text. The more annotation features you want to utlize, the higher the anno_level will be. I usually just go for anno_level = 0 since I only need tokenization, lemmatization, and part-of-speech tagging. Loading higher level functions takes longer time and can slow down your computer.

cnlp_init_corenlp(anno_level = 0)

I’ll process the same five texts that I’ve been using in this blog, five random essays from the LOCNESS. The function below can directly call text files from a directory and annotate them.

anno_text <- cnlp_annotate("corpus/*.txt", as_strings = FALSE)

However, I like building the corpus as its own object to keep using it for various analyses.

#Build the corpus
txt_cor <- readtext::readtext("corpus/*.txt")

#Save annotations as a table
txt_ann <- cnlp_annotate(txt_cor)
txt_tab <- cnlp_get_token(txt_ann)

#Check the first 15 words
head(txt_tab, 15)

## # A tibble: 15 x 8
##    id            sid   tid word     lemma    upos  pos     cid
##    <chr>       <int> <int> <chr>    <chr>    <chr> <chr> <int>
##  1 text_01.txt     1     1 Two      two      NUM   CD        0
##  2 text_01.txt     1     2 men      man      NOUN  NNS       4
##  3 text_01.txt     1     3 ,        ,        .     ,         7
##  4 text_01.txt     1     4 one      one      NUM   CD        9
##  5 text_01.txt     1     5 ring     ring     NOUN  NN       13
##  6 text_01.txt     1     6 ,        ,        .     ,        17
##  7 text_01.txt     1     7 only     only     ADV   RB       19
##  8 text_01.txt     1     8 one      one      NUM   CD       24
##  9 text_01.txt     1     9 can      can      VERB  MD       28
## 10 text_01.txt     1    10 leave    leave    VERB  VB       32
## 11 text_01.txt     1    11 .        .        .     .        37
## 12 text_01.txt     2     1 Dramatic dramatic ADJ   JJ       39
## 13 text_01.txt     2     2 it       it       PRON  PRP      48
## 14 text_01.txt     2     3 may      may      VERB  MD       51
## 15 text_01.txt     2     4 be       be       VERB  VB       55

It appears that everything worked well. Next, I’ll do some text analysis.

3 . Example Text Analysis: Creating Bigrams and Trigrams

3.1 . With tidytext

tidytext is a convenient means to perform text analysis. package. Luckily, free resources are available such as Tidytext that will serve as a structured, useful guide.

library(tidytext)

This package includes some functions that are easy to use. We used the corenlp to POS tag the text but if we didn’t need that, we could have just tokenized using the unnest_tokens() function as I have done in the previous post.

unnest_token() first takes the data frame (txt_cor). The default setting breaks the text into words (i.e., tokenizes) and creates a new data frame. We need to provide the name of the column for this new data frame (output, I named it word) and the column that includes the text data (input, which is text).

tidy_tok <- txt_cor %>% unnest_tokens(word, text)

Analyzing n-grams is done with the same function. We can just provide different values to generate a table of n-grams. The first following code takes the corpus and creates a new data frame (tidy_bi) with the column bigram that contains the bigram. token = "ngrams" and n =2 will extract two-word sequences. The second code will create a list of trigrams.

tidy_bi <- txt_cor %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_tri <- txt_cor %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)

bigram	trigram
two men	two men one
men one	men one ring
one ring	one ring only
ring only	ring only one
only one	only one can
one can	one can leave
can leave	can leave dramatic
leave dramatic	leave dramatic it
dramatic it	dramatic it may
it may	it may be

Coming from a linguistics perspective, I find it potentially problematic that the bigrams include word chunks that are not meaningful, especially for qualitative text analysis. What I mean is that, for example, the last word of sentence #1 and the first word of sentence #2, “it dramatically”, are treated as a bigram. Same applies to words within a sentence that are separated by commas or other punctuation. Consider the first couple sentences from our corpus:

“Two men, one ring, only one can leave. Dramatic it may be but…”

“men one” is not meaningful, “ring only” is not meaningful. Punctuation serves specific purposes in writing, and ignoring them might fail to deliver meaningful results. Crossing such borders can also lead to misleading results. The meaningful, uninterrupted n-grams are called “CollGrams” by some researchers (Bestgen & Granger, 2014)¹.

3.2 . Manually Creating Bigrams and Trigrams

For this reason, I’ll go back to the annotated data we created earlier. To inspect sequences of words, we can use the lead() function from the dplyr package, to create new columns that contain information regarding the next row of each word.

txt_df <- txt_tab %>% 
  mutate(second_word = lead(word), second_upos = lead(upos), second_pos = lead(pos), 
         third_word = lead(word, 2), third_upos = lead(upos, 2), third_pos = lead(pos, 2))

id	sid	tid	word	lemma	upos	pos	second_word	second_upos	second_pos	third_word	third_upos	third_pos
text_01.txt	1	1	Two	two	NUM	CD	men	NOUN	NNS	,	.	,
text_01.txt	1	2	men	man	NOUN	NNS	,	.	,	one	NUM	CD
text_01.txt	1	3	,	,	.	,	one	NUM	CD	ring	NOUN	NN
text_01.txt	1	4	one	one	NUM	CD	ring	NOUN	NN	,	.	,
text_01.txt	1	5	ring	ring	NOUN	NN	,	.	,	only	ADV	RB
text_01.txt	1	6	,	,	.	,	only	ADV	RB	one	NUM	CD
text_01.txt	1	7	only	only	ADV	RB	one	NUM	CD	can	VERB	MD
text_01.txt	1	8	one	one	NUM	CD	can	VERB	MD	leave	VERB	VB
text_01.txt	1	9	can	can	VERB	MD	leave	VERB	VB	.	.	.
text_01.txt	1	10	leave	leave	VERB	VB	.	.	.	Dramatic	ADJ	JJ

In the newly created columns in light green are information pertaining to the next word, and the columns in light blue are that of the second next word.

To clean this data, we’ll execute the following code. The unite() function from the tidyr package concatenates the word and second_word to show the biagram. Although where punctuation occurs can be of interest itself (e.g., marking clauses, inserting phrases, etc.), in this post I’ll filter out the bigrams that include any punctuation marks to only consider two- or three-word sequences that co-occur without any interruption.

library(tidyr)

txt_bi <- txt_df %>% unite(bigram, word, second_word, sep = " ") %>% 
  filter(!second_upos == ".", !upos == ".") %>% select(1, 4:9)

id	bigram	lemma	upos	pos	second_upos	second_pos
text_01.txt	Two men	two	NUM	CD	NOUN	NNS
text_01.txt	one ring	one	NUM	CD	NOUN	NN
text_01.txt	only one	only	ADV	RB	NUM	CD
text_01.txt	one can	one	NUM	CD	VERB	MD
text_01.txt	can leave	can	VERB	MD	VERB	VB
text_01.txt	Dramatic it	dramatic	ADJ	JJ	PRON	PRP
text_01.txt	it may	it	PRON	PRP	VERB	MD
text_01.txt	may be	may	VERB	MD	VERB	VB

txt_tri <- txt_df %>% unite(trigram, word, second_word, third_word, sep = " ") %>% 
  filter(!third_upos == ".", !second_upos == ".", !upos == ".") %>% select(1, 4:11)

id	trigram	lemma	upos	pos	second_upos	second_pos	third_upos	third_pos
text_01.txt	only one can	only	ADV	RB	NUM	CD	VERB	MD
text_01.txt	one can leave	one	NUM	CD	VERB	MD	VERB	VB
text_01.txt	Dramatic it may	dramatic	ADJ	JJ	PRON	PRP	VERB	MD
text_01.txt	it may be	it	PRON	PRP	VERB	MD	VERB	VB
text_01.txt	may be but	may	VERB	MD	VERB	VB	CONJ	CC
text_01.txt	be but basically	be	VERB	VB	CONJ	CC	ADV	RB
text_01.txt	but basically that	but	CONJ	CC	ADV	RB	DET	DT
text_01.txt	basically that is	basically	ADV	RB	DET	DT	VERB	VBZ

3.3 . Example Analysis: Be + words

What’s the most common part of speech that comes after the “be” verb? What does it say about the role of the “be” verb and the constituent that follows?

txt_bi %>% filter(lemma == "be") %>% count(second_upos, sort = TRUE)

Largely, “be” is most frequently followed by another verb. Looking at the POS tag reveals a bit more information.

txt_bi %>% filter(lemma == "be") %>% count(second_pos, sort = TRUE)

The “be” verb most frequently co-occurs with another verb in the past participle form (i.e., VBN), so presumably the 34 occurrences are passive constructions, in which “be” serves as an auxiliary.

It’s almost certain that a determiner starts a noun phrase, thus in 27 cases the “be” verb is a main verb and is followed by a noun phrase complement.

txt_bi %>% filter(lemma == "be", second_pos == "VBN") %>% select(bigram)

## # A tibble: 34 x 1
##    bigram        
##    <chr>         
##  1 been made     
##  2 be banned     
##  3 is argued     
##  4 been won      
##  5 be banned     
##  6 was put       
##  7 be prepared   
##  8 are surrounded
##  9 are trained   
## 10 are paid      
## # ... with 24 more rows

The third most frequently co-occuring tag is the adverb. The question is then, what follows an adverb? Considering that verb past participle is the category that appears most frequently after “be”, it could be that an adverb is inserted between these two verbs (be + adverb + past participle; e.g., is actually made).

It is also possible that the adverb is part of an adjective phrase (e.g., is really important), which in turn, may or may not constitute a noun phrase (e.g., is really an important ). Let’s dig a little deeper by looking at the trigrams.

txt_tri %>% 
    filter(lemma == "be", second_pos == "RB") %>% 
    count(third_pos, sort = TRUE) %>% 
    mutate(percent = round(n*100/sum(n), 1))

## # A tibble: 10 x 3
##    third_pos     n percent
##    <chr>     <int>   <dbl>
##  1 JJ            9    34.6
##  2 DT            4    15.4
##  3 RB            4    15.4
##  4 VBG           2     7.7
##  5 VBN           2     7.7
##  6 CD            1     3.8
##  7 IN            1     3.8
##  8 RBR           1     3.8
##  9 TO            1     3.8
## 10 WRB           1     3.8

It appears that many of the “be + adverb” sequences (34.6%) are followed by an adjective, such as:

txt_tri %>% 
  filter(lemma == "be", second_pos == "RB", third_pos == "JJ") %>% 
  select(trigram)

## # A tibble: 9 x 1
##   trigram             
##   <chr>               
## 1 are already few     
## 2 were not stupid     
## 3 is not right        
## 4 is very popular     
## 5 is always much      
## 6 is clearly aware    
## 7 is hardly surprising
## 8 is very likely      
## 9 be very severe

However, to determine the exact structure, we need to go further. We can look at the sentence where each trigram occurs.

trigram	sentence
are already few	first of all, there are already few enough liberties in this country when compared with other nations of similar political and economic conditions such as france and the united states.
were not stupid	these men were professionals, they were not stupid.
is not right	the people who want the sport to be banned, say this because it is not right for people to fight, and there is a too high risk of serious injury and brain damage caused by the severe pounding that the head takes during a boxing match.
is very popular	another reason not to ban boxing is because it is very popular, and millions of people worldwide get entertained by watching boxing, so why should it banned it will cause displeasure to so many people.
is always much	there is always much speculation over the dangers of such a brutal sport as boxing.
is clearly aware	he is clearly aware of the dangers and brutalism of the sport, which is possibly why he enjoys it so much.
is hardly surprising	it is hardly surprising how important the sport can be to some.
is very likely	so, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there.
be very severe	so, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there.

As a very rough summary, we can say that out of 127 cases where the verb “be” was used, 40 were auxiliary be, consisting passive voice constructions and progressive forms. In other times, the verb was frequently followed by a noun phrase (at least 31 times) or an adjective phrase that may or may not be embedded in a noun phrase.

Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26, 28-41. https://doi.org/10.1016/j.jslw.2014.09.004.↩