Using Stanford CoreNLP with R: Bigram and Trigram Analysis

Jan. 1, 2019

Forget my previous posts on using the Stanford NLP engine via command and retreiving information from XML files in R…. I’ve found that everything can be done in RStudio (at least I learned more about how to work with XML in R). This post replaces these two previous ones and adds more example analyses.

1 . Preparation

1.1 . Install Java

Download and install, if you don’t already have in on your computer, the Java Development Kit. No specific things to look out for during installation.

1.2 . Install cleanNLP and language model

The packages we need in R are rJava and cleanNLP. Install the developmental version of cleanNLP as using the (old) CRAN version won’t work properly.

install.packages("rJava")
devtools::install_github("statsmaths/cleanNLP")
library(cleanNLP)
library(dplyr)

After loading the package, you can pass an argument to download different language models. The default is set to English so I’m not going to pass anything to the function.

cnlp_download_corenlp()

This will take some time.

2 . Annotation Using Stanford CoreNLP

Now you can itialize the engine to parse your text. The more annotation features you want to utlize, the higher the anno_level will be. I usually just go for anno_level = 0 since I only need tokenization, lemmatization, and part-of-speech tagging. Loading higher level functions takes longer time and can slow down your computer.

cnlp_init_corenlp(anno_level = 0)

I’ll process the same five texts that I’ve been using in this blog, five random essays from the LOCNESS. The function below can directly call text files from a directory and annotate them.

anno_text <- cnlp_annotate("corpus/*.txt", as_strings = FALSE)

However, I like building the corpus as its own object to keep using it for various analyses.

#Build the corpus
txt_cor <- readtext::readtext("corpus/*.txt")

#Save annotations as a table
txt_ann <- cnlp_annotate(txt_cor)
txt_tab <- cnlp_get_token(txt_ann)

#Check the first 15 words
head(txt_tab, 15)
## # A tibble: 15 x 8
##    id            sid   tid word     lemma    upos  pos     cid
##    <chr>       <int> <int> <chr>    <chr>    <chr> <chr> <int>
##  1 text_01.txt     1     1 Two      two      NUM   CD        0
##  2 text_01.txt     1     2 men      man      NOUN  NNS       4
##  3 text_01.txt     1     3 ,        ,        .     ,         7
##  4 text_01.txt     1     4 one      one      NUM   CD        9
##  5 text_01.txt     1     5 ring     ring     NOUN  NN       13
##  6 text_01.txt     1     6 ,        ,        .     ,        17
##  7 text_01.txt     1     7 only     only     ADV   RB       19
##  8 text_01.txt     1     8 one      one      NUM   CD       24
##  9 text_01.txt     1     9 can      can      VERB  MD       28
## 10 text_01.txt     1    10 leave    leave    VERB  VB       32
## 11 text_01.txt     1    11 .        .        .     .        37
## 12 text_01.txt     2     1 Dramatic dramatic ADJ   JJ       39
## 13 text_01.txt     2     2 it       it       PRON  PRP      48
## 14 text_01.txt     2     3 may      may      VERB  MD       51
## 15 text_01.txt     2     4 be       be       VERB  VB       55

It appears that everything worked well. Next, I’ll do some text analysis.

3 . Example Text Analysis: Creating Bigrams and Trigrams

3.1 . With tidytext

tidytext is a convenient means to perform text analysis. package. Luckily, free resources are available such as Tidytext that will serve as a structured, useful guide.

library(tidytext)

This package includes some functions that are easy to use. We used the corenlp to POS tag the text but if we didn’t need that, we could have just tokenized using the unnest_tokens() function as I have done in the previous post.

unnest_token() first takes the data frame (txt_cor). The default setting breaks the text into words (i.e., tokenizes) and creates a new data frame. We need to provide the name of the column for this new data frame (output, I named it word) and the column that includes the text data (input, which is text).

tidy_tok <- txt_cor %>% unnest_tokens(word, text)

Analyzing n-grams is done with the same function. We can just provide different values to generate a table of n-grams. The first following code takes the corpus and creates a new data frame (tidy_bi) with the column bigram that contains the bigram. token = "ngrams" and n =2 will extract two-word sequences. The second code will create a list of trigrams.

tidy_bi <- txt_cor %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_tri <- txt_cor %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)
bigramtrigram
two mentwo men one
men onemen one ring
one ringone ring only
ring onlyring only one
only oneonly one can
one canone can leave
can leavecan leave dramatic
leave dramaticleave dramatic it
dramatic itdramatic it may
it mayit may be

Coming from a linguistics perspective, I find it potentially problematic that the bigrams include word chunks that are not meaningful, especially for qualitative text analysis. What I mean is that, for example, the last word of sentence #1 and the first word of sentence #2, “it dramatically”, are treated as a bigram. Same applies to words within a sentence that are separated by commas or other punctuation. Consider the first couple sentences from our corpus:

“Two men, one ring, only one can leave. Dramatic it may be but…”

“men one” is not meaningful, “ring only” is not meaningful. Punctuation serves specific purposes in writing, and ignoring them might fail to deliver meaningful results. Crossing such borders can also lead to misleading results. The meaningful, uninterrupted n-grams are called “CollGrams” by some researchers (Bestgen & Granger, 2014)1.

3.2 . Manually Creating Bigrams and Trigrams

For this reason, I’ll go back to the annotated data we created earlier. To inspect sequences of words, we can use the lead() function from the dplyr package, to create new columns that contain information regarding the next row of each word.

txt_df <- txt_tab %>% 
  mutate(second_word = lead(word), second_upos = lead(upos), second_pos = lead(pos), 
         third_word = lead(word, 2), third_upos = lead(upos, 2), third_pos = lead(pos, 2))
idsidtidwordlemmaupospossecond_wordsecond_upossecond_posthird_wordthird_uposthird_pos
text_01.txt11TwotwoNUMCDmenNOUNNNS,.,
text_01.txt12menmanNOUNNNS,.,oneNUMCD
text_01.txt13,,.,oneNUMCDringNOUNNN
text_01.txt14oneoneNUMCDringNOUNNN,.,
text_01.txt15ringringNOUNNN,.,onlyADVRB
text_01.txt16,,.,onlyADVRBoneNUMCD
text_01.txt17onlyonlyADVRBoneNUMCDcanVERBMD
text_01.txt18oneoneNUMCDcanVERBMDleaveVERBVB
text_01.txt19cancanVERBMDleaveVERBVB...
text_01.txt110leaveleaveVERBVB...DramaticADJJJ

In the newly created columns in light green are information pertaining to the next word, and the columns in light blue are that of the second next word.

To clean this data, we’ll execute the following code. The unite() function from the tidyr package concatenates the word and second_word to show the biagram. Although where punctuation occurs can be of interest itself (e.g., marking clauses, inserting phrases, etc.), in this post I’ll filter out the bigrams that include any punctuation marks to only consider two- or three-word sequences that co-occur without any interruption.

library(tidyr)

txt_bi <- txt_df %>% unite(bigram, word, second_word, sep = " ") %>% 
  filter(!second_upos == ".", !upos == ".") %>% select(1, 4:9)
idbigramlemmaupospossecond_upossecond_pos
text_01.txtTwo mentwoNUMCDNOUNNNS
text_01.txtone ringoneNUMCDNOUNNN
text_01.txtonly oneonlyADVRBNUMCD
text_01.txtone canoneNUMCDVERBMD
text_01.txtcan leavecanVERBMDVERBVB
text_01.txtDramatic itdramaticADJJJPRONPRP
text_01.txtit mayitPRONPRPVERBMD
text_01.txtmay bemayVERBMDVERBVB
txt_tri <- txt_df %>% unite(trigram, word, second_word, third_word, sep = " ") %>% 
  filter(!third_upos == ".", !second_upos == ".", !upos == ".") %>% select(1, 4:11)
idtrigramlemmaupospossecond_upossecond_posthird_uposthird_pos
text_01.txtonly one canonlyADVRBNUMCDVERBMD
text_01.txtone can leaveoneNUMCDVERBMDVERBVB
text_01.txtDramatic it maydramaticADJJJPRONPRPVERBMD
text_01.txtit may beitPRONPRPVERBMDVERBVB
text_01.txtmay be butmayVERBMDVERBVBCONJCC
text_01.txtbe but basicallybeVERBVBCONJCCADVRB
text_01.txtbut basically thatbutCONJCCADVRBDETDT
text_01.txtbasically that isbasicallyADVRBDETDTVERBVBZ

3.3 . Example Analysis: Be + words

What’s the most common part of speech that comes after the “be” verb? What does it say about the role of the “be” verb and the constituent that follows?

txt_bi %>% filter(lemma == "be") %>% count(second_upos, sort = TRUE)

Largely, “be” is most frequently followed by another verb. Looking at the POS tag reveals a bit more information.

txt_bi %>% filter(lemma == "be") %>% count(second_pos, sort = TRUE)

The “be” verb most frequently co-occurs with another verb in the past participle form (i.e., VBN), so presumably the 34 occurrences are passive constructions, in which “be” serves as an auxiliary.

It’s almost certain that a determiner starts a noun phrase, thus in 27 cases the “be” verb is a main verb and is followed by a noun phrase complement.

txt_bi %>% filter(lemma == "be", second_pos == "VBN") %>% select(bigram)
## # A tibble: 34 x 1
##    bigram        
##    <chr>         
##  1 been made     
##  2 be banned     
##  3 is argued     
##  4 been won      
##  5 be banned     
##  6 was put       
##  7 be prepared   
##  8 are surrounded
##  9 are trained   
## 10 are paid      
## # ... with 24 more rows

The third most frequently co-occuring tag is the adverb. The question is then, what follows an adverb? Considering that verb past participle is the category that appears most frequently after “be”, it could be that an adverb is inserted between these two verbs (be + adverb + past participle; e.g., is actually made).

It is also possible that the adverb is part of an adjective phrase (e.g., is really important), which in turn, may or may not constitute a noun phrase (e.g., is really an important ). Let’s dig a little deeper by looking at the trigrams.

txt_tri %>% 
    filter(lemma == "be", second_pos == "RB") %>% 
    count(third_pos, sort = TRUE) %>% 
    mutate(percent = round(n*100/sum(n), 1))
## # A tibble: 10 x 3
##    third_pos     n percent
##    <chr>     <int>   <dbl>
##  1 JJ            9    34.6
##  2 DT            4    15.4
##  3 RB            4    15.4
##  4 VBG           2     7.7
##  5 VBN           2     7.7
##  6 CD            1     3.8
##  7 IN            1     3.8
##  8 RBR           1     3.8
##  9 TO            1     3.8
## 10 WRB           1     3.8

It appears that many of the “be + adverb” sequences (34.6%) are followed by an adjective, such as:

txt_tri %>% 
  filter(lemma == "be", second_pos == "RB", third_pos == "JJ") %>% 
  select(trigram)
## # A tibble: 9 x 1
##   trigram             
##   <chr>               
## 1 are already few     
## 2 were not stupid     
## 3 is not right        
## 4 is very popular     
## 5 is always much      
## 6 is clearly aware    
## 7 is hardly surprising
## 8 is very likely      
## 9 be very severe

However, to determine the exact structure, we need to go further. We can look at the sentence where each trigram occurs.

trigramsentence
are already fewfirst of all, there are already few enough liberties in this country when compared with other nations of similar political and economic conditions such as france and the united states.
were not stupidthese men were professionals, they were not stupid.
is not rightthe people who want the sport to be banned, say this because it is not right for people to fight, and there is a too high risk of serious injury and brain damage caused by the severe pounding that the head takes during a boxing match.
is very popularanother reason not to ban boxing is because it is very popular, and millions of people worldwide get entertained by watching boxing, so why should it banned it will cause displeasure to so many people.
is always muchthere is always much speculation over the dangers of such a brutal sport as boxing.
is clearly awarehe is clearly aware of the dangers and brutalism of the sport, which is possibly why he enjoys it so much.
is hardly surprisingit is hardly surprising how important the sport can be to some.
is very likelyso, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there.
be very severeso, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there.

As a very rough summary, we can say that out of 127 cases where the verb “be” was used, 40 were auxiliary be, consisting passive voice constructions and progressive forms. In other times, the verb was frequently followed by a noun phrase (at least 31 times) or an adjective phrase that may or may not be embedded in a noun phrase.


  1. Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26, 28-41. https://doi.org/10.1016/j.jslw.2014.09.004.