Update to Using Stanford CoreNLP with R

Forget my previous posts on using the Stanford NLP engine via command and retreiving information from XML files in R…. I’ve found that everything can be done in RStudio (at least I learned more about how to work with XML in R). This post replaces these two previous ones and adds more example analyses.

1 . Preparation

1.1 . Install Java

Download and install, if you don’t already have in on your computer, the Java Development Kit. No specific things to look out for during installation.

1.2 . Install cleanNLP and language model

The packages we need in R are rJava and cleanNLP. Install the developmental version of cleanNLP as using the (old) CRAN version won’t work properly.

install.packages("rJava")
devtools::install_github("statsmaths/cleanNLP")
library(cleanNLP)
library(dplyr)

After loading the package, you can pass an argument to download different language models. The default is set to English so I’m not going to pass anything to the function.

cnlp_download_corenlp()

This will take some time.

2 . Annotation Using Stanford CoreNLP

Now you can itialize the engine to parse your text. The more annotation features you want to utlize, the higher the anno_level will be. I usually just go for anno_level = 0 since I only need tokenization, lemmatization, and part-of-speech tagging. Loading higher level functions takes longer time and can slow down your computer.

cnlp_init_corenlp(anno_level = 0)

I’ll process the same five texts that I’ve been using in this blog, five random essays from the LOCNESS. The function below can directly call text files from a directory and annotate them.

anno_text <- cnlp_annotate("corpus/*.txt", as_strings = FALSE)

However, I like building the corpus as its own object to keep using it for various analyses.

#Build the corpus
txt_cor <- readtext::readtext("corpus/*.txt")

#Save annotations as a table
txt_ann <- cnlp_annotate(txt_cor)
txt_tab <- cnlp_get_token(txt_ann)

#Check the first 15 words
head(txt_tab, 15)
## # A tibble: 15 x 8
##    id            sid   tid word     lemma    upos  pos     cid
##    <chr>       <int> <int> <chr>    <chr>    <chr> <chr> <int>
##  1 text_01.txt     1     1 Two      two      NUM   CD        0
##  2 text_01.txt     1     2 men      man      NOUN  NNS       4
##  3 text_01.txt     1     3 ,        ,        .     ,         7
##  4 text_01.txt     1     4 one      one      NUM   CD        9
##  5 text_01.txt     1     5 ring     ring     NOUN  NN       13
##  6 text_01.txt     1     6 ,        ,        .     ,        17
##  7 text_01.txt     1     7 only     only     ADV   RB       19
##  8 text_01.txt     1     8 one      one      NUM   CD       24
##  9 text_01.txt     1     9 can      can      VERB  MD       28
## 10 text_01.txt     1    10 leave    leave    VERB  VB       32
## 11 text_01.txt     1    11 .        .        .     .        37
## 12 text_01.txt     2     1 Dramatic dramatic ADJ   JJ       39
## 13 text_01.txt     2     2 it       it       PRON  PRP      48
## 14 text_01.txt     2     3 may      may      VERB  MD       51
## 15 text_01.txt     2     4 be       be       VERB  VB       55

It appears that everything worked well. Next, I’ll do some text analysis.

3 . Example Text Analysis: Creating Bigrams and Trigrams

3.1 . With tidytext

tidytext is a convenient means to perform text analysis. package. Luckily, free resources are available such as Tidytext that will serve as a structured, useful guide.

library(tidytext)

This package includes some functions that are easy to use. We used the corenlp to POS tag the text but if we didn’t need that, we could have just tokenized using the unnest_tokens() function as I have done in the previous post.

unnest_token() first takes the data frame (txt_cor). The default setting breaks the text into words (i.e., tokenizes) and creates a new data frame. We need to provide the name of the column for this new data frame (output, I named it word) and the column that includes the text data (input, which is text).

tidy_tok <- txt_cor %>% unnest_tokens(word, text)

Analyzing n-grams is done with the same function. We can just provide different values to generate a table of n-grams. The first following code takes the corpus and creates a new data frame (tidy_bi) with the column bigram that contains the bigram. token = "ngrams" and n =2 will extract two-word sequences. The second code will create a list of trigrams.

tidy_bi <- txt_cor %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_tri <- txt_cor %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)
bigram trigram
two men two men one
men one men one ring
one ring one ring only
ring only ring only one
only one only one can
one can one can leave
can leave can leave dramatic
leave dramatic leave dramatic it
dramatic it dramatic it may
it may it may be

Coming from a linguistics perspective, I find it potentially problematic that the bigrams include word chunks that are not meaningful, especially for qualitative text analysis. What I mean is that, for example, the last word of sentence #1 and the first word of sentence #2, “it dramatically”, are treated as a bigram. Same applies to words within a sentence that are separated by commas or other punctuation. Consider the first couple sentences from our corpus:

“Two men, one ring, only one can leave. Dramatic it may be but…”

“men one” is not meaningful, “ring only” is not meaningful. Punctuation serves specific purposes in writing, and ignoring them might fail to deliver meaningful results. Crossing such borders can also lead to misleading results. The meaningful, uninterrupted n-grams are called “CollGrams” by some researchers (Bestgen & Granger, 2014)1.

3.2 . Manually Creating Bigrams and Trigrams

For this reason, I’ll go back to the annotated data we created earlier. To inspect sequences of words, we can use the lead() function from the dplyr package, to create new columns that contain information regarding the next row of each word.

txt_df <- txt_tab %>% 
  mutate(second_word = lead(word), second_upos = lead(upos), second_pos = lead(pos), 
         third_word = lead(word, 2), third_upos = lead(upos, 2), third_pos = lead(pos, 2))
id sid tid word lemma upos pos second_word second_upos second_pos third_word third_upos third_pos
text_01.txt 1 1 Two two NUM CD men NOUN NNS , . ,
text_01.txt 1 2 men man NOUN NNS , . , one NUM CD
text_01.txt 1 3 , , . , one NUM CD ring NOUN NN
text_01.txt 1 4 one one NUM CD ring NOUN NN , . ,
text_01.txt 1 5 ring ring NOUN NN , . , only ADV RB
text_01.txt 1 6 , , . , only ADV RB one NUM CD
text_01.txt 1 7 only only ADV RB one NUM CD can VERB MD
text_01.txt 1 8 one one NUM CD can VERB MD leave VERB VB
text_01.txt 1 9 can can VERB MD leave VERB VB . . .
text_01.txt 1 10 leave leave VERB VB . . . Dramatic ADJ JJ

In the newly created columns in light green are information pertaining to the next word, and the columns in light blue are that of the second next word.

To clean this data, we’ll execute the following code. The unite() function from the tidyr package concatenates the word and second_word to show the biagram. Although where punctuation occurs can be of interest itself (e.g., marking clauses, inserting phrases, etc.), in this post I’ll filter out the bigrams that include any punctuation marks to only consider two- or three-word sequences that co-occur without any interruption.

library(tidyr)

txt_bi <- txt_df %>% unite(bigram, word, second_word, sep = " ") %>% 
  filter(!second_upos == ".", !upos == ".") %>% select(1, 4:9)
id bigram lemma upos pos second_upos second_pos
text_01.txt Two men two NUM CD NOUN NNS
text_01.txt one ring one NUM CD NOUN NN
text_01.txt only one only ADV RB NUM CD
text_01.txt one can one NUM CD VERB MD
text_01.txt can leave can VERB MD VERB VB
text_01.txt Dramatic it dramatic ADJ JJ PRON PRP
text_01.txt it may it PRON PRP VERB MD
text_01.txt may be may VERB MD VERB VB
txt_tri <- txt_df %>% unite(trigram, word, second_word, third_word, sep = " ") %>% 
  filter(!third_upos == ".", !second_upos == ".", !upos == ".") %>% select(1, 4:11)
id trigram lemma upos pos second_upos second_pos third_upos third_pos
text_01.txt only one can only ADV RB NUM CD VERB MD
text_01.txt one can leave one NUM CD VERB MD VERB VB
text_01.txt Dramatic it may dramatic ADJ JJ PRON PRP VERB MD
text_01.txt it may be it PRON PRP VERB MD VERB VB
text_01.txt may be but may VERB MD VERB VB CONJ CC
text_01.txt be but basically be VERB VB CONJ CC ADV RB
text_01.txt but basically that but CONJ CC ADV RB DET DT
text_01.txt basically that is basically ADV RB DET DT VERB VBZ

3.3 . Example Analysis: Be + words

What’s the most common part of speech that comes after the “be” verb? What does it say about the role of the “be” verb and the constituent that follows?

txt_bi %>% filter(lemma == "be") %>% count(second_upos, sort = TRUE)

Largely, “be” is most frequently followed by another verb. Looking at the POS tag reveals a bit more information.

txt_bi %>% filter(lemma == "be") %>% count(second_pos, sort = TRUE)

The “be” verb most frequently co-occurs with another verb in the past participle form (i.e., VBN), so presumably the 34 occurrences are passive constructions, in which “be” serves as an auxiliary.

It’s almost certain that a determiner starts a noun phrase, thus in 27 cases the “be” verb is a main verb and is followed by a noun phrase complement.

txt_bi %>% filter(lemma == "be", second_pos == "VBN") %>% select(bigram)
## # A tibble: 34 x 1
##    bigram        
##    <chr>         
##  1 been made     
##  2 be banned     
##  3 is argued     
##  4 been won      
##  5 be banned     
##  6 was put       
##  7 be prepared   
##  8 are surrounded
##  9 are trained   
## 10 are paid      
## # ... with 24 more rows

The third most frequently co-occuring tag is the adverb. The question is then, what follows an adverb? Considering that verb past participle is the category that appears most frequently after “be”, it could be that an adverb is inserted between these two verbs (be + adverb + past participle; e.g., is actually made).

It is also possible that the adverb is part of an adjective phrase (e.g., is really important), which in turn, may or may not constitute a noun phrase (e.g., is really an important ). Let’s dig a little deeper by looking at the trigrams.

txt_tri %>% 
    filter(lemma == "be", second_pos == "RB") %>% 
    count(third_pos, sort = TRUE) %>% 
    mutate(percent = round(n*100/sum(n), 1))
## # A tibble: 10 x 3
##    third_pos     n percent
##    <chr>     <int>   <dbl>
##  1 JJ            9    34.6
##  2 DT            4    15.4
##  3 RB            4    15.4
##  4 VBG           2     7.7
##  5 VBN           2     7.7
##  6 CD            1     3.8
##  7 IN            1     3.8
##  8 RBR           1     3.8
##  9 TO            1     3.8
## 10 WRB           1     3.8

It appears that many of the “be + adverb” sequences (34.6%) are followed by an adjective, such as:

txt_tri %>% 
  filter(lemma == "be", second_pos == "RB", third_pos == "JJ") %>% 
  select(trigram)
## # A tibble: 9 x 1
##   trigram             
##   <chr>               
## 1 are already few     
## 2 were not stupid     
## 3 is not right        
## 4 is very popular     
## 5 is always much      
## 6 is clearly aware    
## 7 is hardly surprising
## 8 is very likely      
## 9 be very severe

However, to determine the exact structure, we need to go further. We can look at the sentence where each trigram occurs.

trigram sentence
are already few first of all, there are already few enough liberties in this country when compared with other nations of similar political and economic conditions such as france and the united states.
were not stupid these men were professionals, they were not stupid.
is not right the people who want the sport to be banned, say this because it is not right for people to fight, and there is a too high risk of serious injury and brain damage caused by the severe pounding that the head takes during a boxing match.
is very popular another reason not to ban boxing is because it is very popular, and millions of people worldwide get entertained by watching boxing, so why should it banned it will cause displeasure to so many people.
is always much there is always much speculation over the dangers of such a brutal sport as boxing.
is clearly aware he is clearly aware of the dangers and brutalism of the sport, which is possibly why he enjoys it so much.
is hardly surprising it is hardly surprising how important the sport can be to some.
is very likely so, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there.
be very severe so, a boxer who has been boxing for a number of years, ten for example, and retires is very likely to have brain damage and it could be very severe or hardly noticable but the damage is still there.

As a very rough summary, we can say that out of 127 cases where the verb “be” was used, 40 were auxiliary be, consisting passive voice constructions and progressive forms. In other times, the verb was frequently followed by a noun phrase (at least 31 times) or an adjective phrase that may or may not be embedded in a noun phrase.


  1. Bestgen, Y., & Granger, S. (2014). Quantifying the development of phraseological competence in L2 English writing: An automated approach. Journal of Second Language Writing, 26, 28-41. https://doi.org/10.1016/j.jslw.2014.09.004.

Related

Next
Previous
comments powered by Disqus