Source: https://www.r-bloggers.com/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know/
Text mining methods allow us to highlight the virtually ofttimes used keywords inwards a paragraph of texts. One tin produce a word cloud, also referred as text cloud or tag cloud, which is a visual representation of text data.
The physical care for of creating give-and-take clouds is really elementary inwards R if y'all know the dissimilar steps to execute. The text mining bundle (tm) in addition to the give-and-take cloud generator bundle (wordcloud) are available inwards R for helping us to analyze texts in addition to to chop-chop visualize the keywords every bit a give-and-take cloud.
In this article, we’ll describe, mensuration past times step, how to generate word clouds using the R software.
Contents
3 reasons y'all should exercise give-and-take clouds to acquaint your text data
- Word clouds add simplicity in addition to clarity. The virtually used keywords stand upward out ameliorate inwards a give-and-take cloud
- Word clouds are a strong communication tool. They are slow to understand, to last shared in addition to are impactful
- Word clouds are visually engaging than a tabular array data
Who is using give-and-take clouds ?
- Researchers : for reporting qualitative data
- Marketers : for highlighting the needs in addition to hurting points of customers
- Educators : to back upward essential issues
- Politicians in addition to journalists
- social media sites : to collect, analyze in addition to portion user sentiments
The 5 principal steps to produce give-and-take clouds inwards R
Step 1: Create a text file
In the next examples, I’ll physical care for the “I convey a dream speech” from “Martin Luther King” only y'all tin exercise whatever text y'all desire :- Copy in addition to glue the text inwards a evidently text file (e.g : ml.txt)
- Save the file
Step ii : Install in addition to charge the required packages
Type the R code below, to install in addition to charge the required packages:# Install install.packages("tm") # for text mining install.packages("SnowballC") # for text stemming install.packages("wordcloud") # word-cloud generator install.packages("RColorBrewer") # color palettes # Load library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer")
Step 3 : Text mining
charge the text
The text is loaded using Corpus() function from text mining ™ package. Corpus is a listing of a document (in our case, nosotros exclusively convey ane document).- We start past times importing the text file created inwards Step 1
text <- readLines(file.choose())
In the representative below, I’ll charge a .txt file hosted on STHDA website:# Read the text file from meshwork filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt" text <- readLines(filePath)
- Load the information every bit a corpus
# Load the information every bit a corpus docs <- Corpus(VectorSource(text))
VectorSource() portion creates a corpus of grapheme vectors- Inspect the content of the document
inspect(docs)
Text transformation
Transformation is performed using tm_map() function to replace, for example, especial characters from the text.Replacing “/”, “@” in addition to “|” amongst space:
toSpace <- content_transformer(function (x , designing ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|")
Cleaning the text
the tm_map() function is used to take away unnecessary white space, to convert the text to lower case, to take away mutual stopwords similar ‘the’, “we”.The information value of ‘stopwords’ is close goose egg due to the fact that they are in addition to thence mutual inwards a language. Removing this sort of words is useful earlier farther analyses. For ‘stopwords’, supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish in addition to swedish. Language names are instance sensitive.
I’ll also demo y'all how to brand your ain listing of stopwords to take away from the text.
You could also take away numbers in addition to punctuation with removeNumbers and removePunctuation arguments.
Another of import preprocessing mensuration is to brand a text stemming which reduces words to their root form. In other words, this physical care for removes suffixes from words to travel inwards elementary in addition to to instruct the mutual origin. For example, a stemming physical care for reduces the words “moving”, “moved” in addition to “movement” to the root word, “move”.
Note that, text stemming require the bundle ‘SnowballC’.
The R code below tin last used to produce clean your text :
# Convert the text to lower instance docs <- tm_map(docs, content_transformer(tolower)) # Remove numbers docs <- tm_map(docs, removeNumbers) # Remove english mutual stopwords docs <- tm_map(docs, removeWords, stopwords("english")) # Remove your ain goal give-and-take # specify your stopwords every bit a grapheme vector docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) # Remove punctuations docs <- tm_map(docs, removePunctuation) # Eliminate extra white spaces docs <- tm_map(docs, stripWhitespace) # Text stemming # docs <- tm_map(docs, stemDocument)
Step 4 : Build a term-document matrix
Document matrix is a tabular array containing the frequency of the words. Column names are words in addition to row names are documents. The function TermDocumentMatrix() from text mining package tin last used every bit follow :dtm <- TermDocumentMatrix(docs) thousand <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)
give-and-take freq volition volition 17 liberty liberty thirteen band band 12 twenty-four threescore minutes menses twenty-four threescore minutes menses xi dream dream xi allow allow xi every every nine able able 8 ane ane 8 together together 7
Step 5 : Generate the Word cloud
The importance of words tin last illustrated every bit a word cloud as follow :set.seed(1234) wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
Arguments of the word cloud generator function :
- words : the words to last plotted
- freq : their frequencies
- min.freq : words amongst frequency below min.freq volition non last plotted
- max.words : maximum number of words to last plotted
- random.order : plot words inwards random order. If false, they volition last plotted inwards decreasing frequency
- rot.per : proportion words amongst xc score rotation (vertical text)
- colors : color words from to the lowest degree to virtually frequent. Use, for example, colors =“black” for unmarried color.
Go further
Explore frequent price in addition to their associations
You tin convey a hold off at the frequent price inwards the term-document matrix every bit follow. In the representative below nosotros desire to abide by words that come about at to the lowest degree iv times :findFreqTerms(dtm, lowfreq = 4)
[1] "able" "day" "dream" "every" "faith" "free" "freedom" "let" "mountain" "nation" [11] "one" "ring" "shall" "together" "will"
You tin analyze the association betwixt frequent price (i.e., price which correlate) using findAssocs() function. The R code below identifies which words are associated amongst “freedom” in I convey a dream speech :findAssocs(dtm, price = "freedom", corlimit = 0.3)
$freedom allow band mississippi mountainside rock every mount nation 0.89 0.86 0.34 0.34 0.34 0.32 0.32 0.32
The frequency tabular array of words
head(d, 10)
give-and-take freq volition volition 17 liberty liberty thirteen band band 12 twenty-four threescore minutes menses twenty-four threescore minutes menses xi dream dream xi allow allow xi every every nine able able 8 ane ane 8 together together 7
Plot give-and-take frequencies
The frequency of the source 10 frequent words are plotted :barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word, col ="lightblue", principal ="Most frequent words", ylab = "Word frequencies")