Anda di halaman 1dari 14

Build a word cloud using text mining

tools of R
BIG DATA BUSINESS ANALYTICS R

SHARE

TAVISH SRIVASTAVA , MAY 7, 2014 / 11

This is how a word cloud of our entire website looks like!

A word cloud is a graphical representation of frequently used words in a collection of


text files. The height of each word in this picture is an indication of frequency of
occurrence of the word in the entire text. By the end of this article, you will be able to
make a word cloud using R on any given set of text files. Such diagrams are very
useful when doing text analytics.
Why do we need text analytics?

Analytics is the science of processing raw information to bring out meaningful insights.
This raw information can come from variety of sources. For instance, let’s consider a
modern multinational bank, who wants to use all the available information to drive the
best strategy. What are the sources of information available to the bank?

1. Bank issues different kinds of products to different customers. This information is


fed to the system and can be used for targeting new customers, servicing existing
customers and forming customer level strategies.

2. Customers of bank would be doing millions of transactions everyday. The


information about where these transactions are done, when they are done and what
amount of transactions where they helps bank to understand their customer.

There can be other behavioral variables (e.g. cash withdrawal patterns) which can
provide the bank with valuable data, which helps the bank build optimal strategy. This
analysis gives the bank, a competitive edge over other market players by targeting the
right customer, with the right product at the right time. But, given that, at present every
competitor is using similar kind of tools and data, analytics have become more of a
hygiene factor rather than competitive edge. To gain the edge back, the bank has to
find more sources of data and more sophisticated tools to handle this data. All the
data, we have discussed till this point is the structured data. There are two other types
of data, the bank can use to drive insightful information.

1. System data : Consider a teller carrying out a transaction at one of the counter.
Every time he completes a transaction, a log is created in the system. This type of
data is called system data. It is obviously enormous in volumes, but still not utilized to
a considerable extent in a lot of banks. If we do analyze this data, we can optimize the
number of tellers in a branch or scale the efficiency of each branch.

2. Unstructured data : Feedback forms with free text comments, comments on


Bank’s Facebook Page, twitter page, etc. are all examples of unstructured data. This
data has unique information about customer sentiment. Say, the bank launches a
product and found that this product is very profitable in first 3 months. But customers
who bought the product found that this product was really a bad choice and started
spreading bad words about the product on all social networks and through feedback
channels. If the bank has no way to decode this information, this will lead to a huge
loss because the bank will never make a proactive effort to stop the negative wave
against its image. Imagine, the kind of power analyzing such data hands over to the
bank.

Installing required packages on R

Text Mining needs some special packages, which might not be pre-installed on your
R software. You need to install Natural Language Processing package to load a library
called tm and SnowballC. Follow the instructions written in the box to install the
required packages.

> install.packages(“ctv”)

> library(“ctv”)

> install.views(“NaturalLanguageProcessing”)

Step by step coding on R

Following is the step by step algorithm of creating a word cloud on a bunch of text files.
For simplicity, we are using files in .txt format.

Step 1 : Identiy & create text files to turn into a cloud

The first step is to identify & create text files on which you want to create the word
cloud. Store these files in the location “./corpus/target”. Make sure that you do not
have any other file in this location. You can use any location to do this exercise, but
for simplicity, try it with this location for the first time.

Step 2 : Create a corpus from the collection of text files

The second step is to transform these text files into a R – readable format. The
package TM and other text mining packages operate on a format called corpus.
Corpus is just a way to store a collection of documents in a R software readable format.
> cname <- file.path(".","corpus","target")

> library (tm)

> docs <- Corpus(DirSource(cname))

Step 3 : Data processing on the text files

This is the most critical step in the entire process. Here, we will decode the text file by
selecting some keywords, which builds up the meaning of the sentence or the
sentiments of the author. R makes this step really easy. Here we will make 3 important
transformations.

i. Replace symbols like “/” or “@” with a blank space

ii. Remove words like “a”, “an”, “the”, “I”, “He” and numbers. This is done to remove
any skewness caused by these commonly occurring words.

iii. Remove punctuation and finally whitespaces. Note that we are not replacing
these with blanks because grammatically they will have an additional blank.

> library (SnowballC)

> for (j in seq(docs))

+ {docs[[j]] <- gsub("/"," ",docs[[j]])


+ docs[[j]] <- gsub("@"," ",docs[[j]])}

> docs <- tm_map(docs,tolower)

> docs <- tm_map(docs,removeWords, stopwords("english"))

> docs <- tm_map(docs,removeNumbers)

> docs <- tm_map(docs,removePunctuation)

> docs <- tm_map(docs,stripWhitespace)

Step 4 : Create structured data from the text file

Now is the time to convert this entire corpus into a structured dataset. Note that we
have removed all filler words. This is done by a command “DocumentTermMatrix” in
R. Execute the following line in your R session to make this conversion.

> dtm <- DocumentTermMatrix(docs)

Step 5 : Making the word cloud using the structured form of the data

Once, we have the structured format of the text file content, we now make a matrix of
word and their frequencies. This matrix will be finally put into the function to build
wordcloud.
> library(wordcloud)

> m <- as.matrix(dtm)

> v <- sort(colSums(m),decreasing=TRUE)

> head(v,14)

> words <- names(v)

> d <- data.frame(word=words, freq=v)

> wordcloud(d$word,d$freq,min.freq=50)

Running this code will give you the required output. The order of words is completely
random but the length of the words are directly proportional to the frequency of
occurrence of the word in text files.Our website has the world “Analytics Vidhya”
repeated many times and hence this word has the maximum length. This diagram
directly helps us identify the most frequently used words in the text files.

End Notes

Text mining deals with relationships between words and analyzing the sentiment made
by the combination of these words and their relationship. Structured data have defined
number of variables and all the analysis is done by finding out correlation between
these variables. But in text mining the relationship is found between all the words
present in the text. This is the reason text mining is rarely used in the industry today.
But R offers such tools which make this analysis much simpler. This article covers only
the tip of the iceberg. In one of the coming articles, we will cover a framework of
analyzing unstructured data.
Did you find the article interesting? Have you worked on text mining before? Did you
use R to do text mining or some other softwares? Let us know any other interesting
feature of R used for text mining.

If you like what you just read & want to


continue your analytics learning, subscribe to
our emails, follow us on twitter or like
our facebook page.

Share this:
 51Click to share on LinkedIn (Opens in new window)51
 99Share on Facebook (Opens in new window)99
 Click to share on Google+ (Opens in new window)
 Click to share on Twitter (Opens in new window)
 Click to share on Pocket (Opens in new window)
 Click to share on Reddit (Opens in new window)

TAGS: MINING, R, READ TEXT, TEXT ANALYTICS, TEXT MINING, TWITTER MINING, WORD CLOUD

Next Article
Manager / Sr. Manager, Business Analytics, Policybazaar, Gurgaon (2+ yrs in Analytics)

Previous Article
Analytics events in 2014 – India and abroad
Author
Tavish Srivastava
I am Tavish Srivastava, a post graduate from IIT Madras in Mechanical Engineering. I
have more than two years of work experience in Analytics. My experience ranges from
hands on analytics in a developing country like India to convince banking partners with
analytical solution in matured market like US. For last two and a half years I have
contributed to various sales strategies, marketing strategies and Recruitment strategies
in both Insurance and Banking industry.

RELATED ARTICLES

KUNAL JAIN, JANUARY 21, 2014

Boon from big data or loss of privacy?

MANISH SARASWAT, AUGUST 4, 2015

Get Knowledge from Best Ever Data Science Discussions on Reddit


KUNAL JAIN, SEPTEMBER 24, 2015

Running scalable Data Science on Cloud with R & Python

11 COMMENTS

 Eric J. Christeson says:

MAY 8, 2014 AT 2:49 PM

This is really neat. Thanks for writing about this. One issue I noticed is the ordering
of the normalization. I would switch tolower and stopwords like so:

> docs docs <- tm_map(docs,removeWords, stopwords("english"))

otherwise an capitalized stop words such as ‘The’ are not removed. I ran it on a

collection of books; many of the sentences must start with ‘The’

REPLY

o Tavish Srivastava says:

MAY 8, 2014 AT 8:53 PM

Eric,
Thanks for the input. I re ran the code multiple times, which eliminated this
issue. Thanks for bringing this up. I have revised the code in the article.

Tavish

REPLY

 Avinash says:

SEPTEMBER 1, 2014 AT 11:26 AM

Hi Tavish,

I am planning to do word cloud on a data set, which has the word and respective
word count. Is it feasible/ appropriate to do word cloud on such data? If so can you
please guide me.

Thank you

REPLY

o Kunal Jain says:

SEPTEMBER 4, 2014 AT 3:32 AM

Avinash,

Look at step 5 of the article – I think that is the only thing you need in this
case.

Regards,
Kunal

REPLY
 Ankit Mundada says:

OCTOBER 14, 2014 AT 7:08 AM

I think the article needs an update. The new version of tm library forces you to use
content_transformer() method to apply any other external processing of the corpus
than the ones available in tm_map.
I got an error saying “Error: inherits(doc, “TextDocument”) is not TRUE”. From the
following link I got to know that the library is updated.

http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-
argument

REPLY

 Bob says:

NOVEMBER 13, 2014 AT 1:31 PM

Hi,
I’m currently working with a heavy data set (a .txt file of about 100Mb) and I’m trying
to develop a wordcloud, as a graphical output of the analysis. But, here´s where i
continue to get problems, when I get to the TermDocumenMatrix command, it takes
hours, and it seems like it might have stopped (the little stop sign in the top right
corner of the console view doesn’t disappear and the blue pointer signaling the next
row doesn’t appear either).
Is this hardware related? Or, is it due to the size of the file?

Thanks. Cheers.

REPLY
 Nissanka says:

DECEMBER 15, 2014 AT 6:25 AM

Hi Avinash, I am exploring the possibility of creating a wordcloud using characters


that are not from the latin alphabet. Do you have any ideas about this?

REPLY

 Career In Business Analytics | Start Learning Analytics says:

DECEMBER 26, 2014 AT 11:29 AM

[…] 3. http://www.analyticsvidhya.com/blog/2014/05/build-word-cloud-text-mining-
tools/ […]

REPLY

 Riza says:

APRIL 29, 2015 AT 6:17 PM

Brilliant work.

Is there any possibility for a sentiment analysis to incorporate in this process? It


would be very useful if the same corpus can be used to fetch the opinion bias.

REPLY

 Vinod Patidar says:


JUNE 3, 2015 AT 11:5 7 AM

Hi,

I search same meaning word count in single word in Corpus library in R data, can
any one help.

Thanks

REPLY

 nandhini says:

JULY 31, 2015 AT 4:0 9 AM

HI, I want to manually assign the number of occurrence of words.I have a text file
with 200 distinct words(each word in a new line) .I have another file with 200
numbers .How can I assign these numbers as number of occurrences of the
words?The words are mix of European languages and some of the words are
breaking itself .what should I do to get the complete word?

REPLY

LEAVE A REPLY
Top of Form
Connect with:

Your email address will not be published.


Notify me of follow-up comments by email.

Notify me of new posts by email.

Bottom of Form
GET CONNECTED

3,488
Followers
9,722
Followers
772
Followers
Email
Subscribe

Anda mungkin juga menyukai