RDataMining Slides Twitter Analysis

Twitter Data Analysis with R Text Mining and
Social Network Analysis 1
Yanchang Zhao
http://www.RDataMining.com
Short Course on R and Data Mining

University of Canberra
7 October 2016
1
Chapter 10: Text Mining, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Frequent Words and Word Cloud
Word Associations
Topic Modelling
Sentiment Analysis
Followers and Retweeting Analysis
Follower Analysis
Retweeting Analysis
R Packages
References and Online Resources
2 / 40
Twitter
I An online social networking service that enables users to send

and read short 140-character messages called tweets
(Wikipedia)
I Over 300 million monthly active users (as of 2015)
I Creating over 500 million tweets per day
3 / 40
RDataMining Twitter Account
I @RDataMining: focuses on R and Data Mining

I 600+ tweets/retweets (as of October 2016)
I 2,700+ followers
4 / 40
Techniques and Tools
I Techniques
I Text mining
I Topic modelling
I Sentiment analysis
I Social network analysis
I Tools
I Twitter API
I R and its packages:
I twitteR
I tm
I topicmodels
I sentiment140
I igraph
5 / 40
Process
1. Extract tweets and followers from the Twitter website with R

and the twitteR package
2. With the tm package, clean text by removing punctuations,
numbers, hyperlinks and stop words, followed by stemming
and stem completion
3. Build a term-document matrix
4. Analyse topics with the topicmodels package
5. Analyse sentiment with the sentiment140 package
6. Analyse following/followed and retweeting relationships with
the igraph package
6 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Word Associations
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
7 / 40
Retrieve Tweets
## Option 1: retrieve tweets from Twitter

library(twitteR)
library(ROAuth)
## Twitter authentication
setup_twitter_oauth(consumer_key, consumer_secret, access_token,
access_secret)
## 3200 is the maximum to retrieve
tweets <- userTimeline("RDataMining", n = 3200)
## Option 2: download @RDataMining tweets from RDataMining.com

url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds"
download.file(url, destfile = "./data/RDataMining-Tweets-20160212.rds")
## load tweets into R
tweets <- readRDS("./data/RDataMining-Tweets-20160212.rds")
Twitter Authentication with OAuth:

Section 3 of http://geoffjentry.hexdump.org/twitteR.pdf
8 / 40
(n.tweet <- length(tweets))
## [1] 448
# convert tweets to a data frame

tweets.df <- twListToDF(tweets)
# tweet #190
tweets.df[190, c("id", "created", "screenName", "replyToSN",
"favoriteCount", "retweetCount", "longitude", "latitude", "text")]
## id created screenName re...

## 190 362866933894352898 2013-08-01 09:26:33 RDataMining ...
## favoriteCount retweetCount longitude latitude
## 190 9 9 NA NA
## ...
## 190 The R Reference Card for Data Mining now provides lin...
# print tweet #190 and make text fit for slide width
writeLines(strwrap(tweets.df$text[190], 60))
## The R Reference Card for Data Mining now provides links to

## packages on CRAN. Packages for MapReduce and Hadoop added.
## http://t.co/RrFypol8kw 9 / 40
Text Cleaning
library(tm)
# build a corpus, and specify the source to be character vectors
myCorpus <- Corpus(VectorSource(tweets.df$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
# remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
# remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
# remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
# keep a copy for stem completion later

myCorpusCopy <- myCorpus
10 / 40
2
Stemming and Stem Completion
myCorpus <- tm_map(myCorpus, stemDocument) # stem words
writeLines(strwrap(myCorpus[[190]]$content, 60))
## r refer card data mine now provid link packag cran packag
## mapreduc hadoop ad
stemCompletion2 <- function(x, dictionary) {

x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
myCorpus <- lapply(myCorpus, stemCompletion2, dictionary=myCorpusCopy)
myCorpus <- Corpus(VectorSource(myCorpus))
writeLines(strwrap(myCorpus[[190]]$content, 60))
## r reference card data miner now provided link package cran

## package mapreduce hadoop add
2
http://stackoverflow.com/questions/25206049/stemcompletion-is-not-working
11 / 40
Issues in Stem Completion: Miner vs Mining
# count word frequence
wordFreq <- function(corpus, word) {
results <- lapply(corpus,
function(x) { grep(as.character(x), pattern=paste0("\\<",word)) }
)
sum(unlist(results))
}
n.miner <- wordFreq(myCorpusCopy, "miner")
n.mining <- wordFreq(myCorpusCopy, "mining")
cat(n.miner, n.mining)
## 9 104
# replace oldword with newword

replaceWord <- function(corpus, oldword, newword) {
tm_map(corpus, content_transformer(gsub),
pattern=oldword, replacement=newword)
}
myCorpus <- replaceWord(myCorpus, "miner", "mining")
myCorpus <- replaceWord(myCorpus, "universidad", "university")
myCorpus <- replaceWord(myCorpus, "scienc", "science")
12 / 40
Build Term Document Matrix
tdm <- TermDocumentMatrix(myCorpus,

control = list(wordLengths = c(1, Inf)))
tdm
## <<TermDocumentMatrix (terms: 1073, documents: 448)>>

## Non-/sparse entries: 3594/477110
## Sparsity : 99%
## Maximal term length: 23
## Weighting : term frequency (tf)
idx <- which(dimnames(tdm)$Terms %in% c("r", "data", "mining"))

as.matrix(tdm[idx, 21:30])
## Docs
## Terms 21 22 23 24 25 26 27 28 29 30
## data 0 1 0 0 1 0 0 0 0 1
## mining 0 0 0 0 1 0 0 0 0 1
## r 1 1 1 1 0 1 0 1 1 1
13 / 40
Top Frequent Terms
# inspect frequent words

(freq.terms <- findFreqTerms(tdm, lowfreq = 20))
## [1] "analysing" "analytics" "australia" "big"

## [5] "canberra" "course" "data" "example"
## [9] "group" "introduction" "learn" "mining"
## [13] "network" "package" "position" "r"
## [17] "rdatamining" "research" "science" "slide"
## [21] "talk" "text" "tutorial" "university"
term.freq <- rowSums(as.matrix(tdm))

term.freq <- subset(term.freq, term.freq >= 20)
df <- data.frame(term = names(term.freq), freq = term.freq)
14 / 40
library(ggplot2)
ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip() +
theme(axis.text=element_text(size=7))
university
tutorial
text
talk
slide
science
research
rdatamining
r
position
Terms
package
network
mining
learn
introduction
group
example
data
course
canberra
big
australia
analytics
analysing
0 50 100 150 200
15 / 40
Wordcloud
m <- as.matrix(tdm)
# calculate the frequency of words and sort it by frequency
word.freq <- sort(rowSums(m), decreasing = T)
# colors
pal <- brewer.pal(9, "BuGn")[-(1:4)]
# plot word cloud

library(wordcloud)
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3,
random.order = F, colors = pal)
16 / 40
random website
paper snowfall
sunday
source decision wwwrdataminingcom
track friday search developed management
mode interacting california version facebook
tree edited center guidance forecastingadvanced summit
titled document page check healthcare
member novfunction
postdoc technological official
conference
format sas apache thanks
iselect card classification singapore
learn april
june
credit
knowledge due deadline america
mapreduce postdoctoral
may
university
jan march
load excel process poll start list
link
francisco computational reference project
result area modeling find retrieval
step seminar forest
analytics
risk
engine
melbourne book
intern
make
create twittercan
us machine spatial
case
participation
build top new application research webinar

tutorial
performance
linkedin cluster free recent

mining
associate download
business
regression
tweet
fast kdnuggets package video notes
answers
web
week
event th now language

acm science
data
useful dataset
today
updated
please
open social feb
tool
dr
group
provided cfp time sydney
tuesday
australia simple
short file rule thursday
visualisations
database series available follow

go
neoj keynote
r
job call software high
survey seattle
pm
v example
online outlier workshop dmapps
big analysing
vs
will
looking china oct

task access nd
fit position text statistical
spark share state
nice aug
distributed
coursera
iapa
network talk present slide
senior published
add lab
plot sept extract rdatamining canberra
hadoop
course
ieee graph give dynamic
mid
amazon tricks rstudio little code cran
informal
introduction join
scientist
pls technique august systemmexico
ausdm skills
pdf
massive canada parallel get user close map

studies vacancies program extended
competition lecture large publicimprove
stanford submission kdd sna detection
support visit
experience analyst easier detailed fellow initial
google topic quick
run sigkdd chapter predicting san
sentiment
datacamp
australasian
prof graphical ranked cloud media

various industrial
algorithm handling natural contain
together australian
comment youtube
southern world
17 / 40
Associations
# which words are associated with 'r'?
findAssocs(tdm, "r", 0.2)
## r
## code 0.27
## example 0.21
## series 0.21
## markdown 0.20
## user 0.20
# which words are associated with 'data'?

findAssocs(tdm, "data", 0.2)
## data
## mining 0.48
## big 0.44
## analytics 0.31
## science 0.29
## poll 0.24
18 / 40
Network of Terms
library(graph)
library(Rgraphviz)
plot(tdm, term = freq.terms, corThreshold = 0.1, weighting = T)
talk
tutorial slide
network text rdatamining r
analysing group mining example package science learn
canberra data course university
big research
analytics australia position introduction
19 / 40
Topic Modelling
dtm <- as.DocumentTermMatrix(tdm)
library(topicmodels)
lda <- LDA(dtm, k = 8) # find 8 topics
term <- terms(lda, 7) # first 7 terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))
## Topic 1
## "r, data, mining, slide, position, series, application"
## Topic 2
## "r, mining, data, big, position, available, text"
## Topic 3
## "data, science, group, poll, kdnuggets, package, software"
## Topic 4
## "r, data, talk, slide, mining, analysing, dataset"
## Topic 5
## "r, mining, package, book, example, slide, analysing"
## Topic 6
## "big, r, mining, network, analysing, statistical, tutorial"
## Topic 7
## "data, r, slide, analytics, research, analysing, workshop"
## Topic 8
20 / 40
## "data, mining, research, canberra, big, event, text"
Topic Modelling
topics <- topics(lda) # 1st topic identified for every document (tweet)
topics <- data.frame(date=as.IDate(tweets.df$created), topic=topics)
ggplot(topics, aes(date, fill = term[topic])) +
geom_density(position = "stack")
0.006
0.004
term[topic]
big, r, mining, network, analysing, statistical, tutorial
data, mining, research, canberra, big, event, text
data, r, slide, analytics, research, analysing, workshop
density
data, science, group, poll, kdnuggets, package, software

r, data, mining, slide, position, series, application
r, data, talk, slide, mining, analysing, dataset
r, mining, data, big, position, available, text
0.002 r, mining, package, book, example, slide, analysing
0.000
2012 2013 2014 2015 2016

date
Another way to plot steam graph:

http://menugget.blogspot.com.au/2013/12/data-mountains-and-streams-stacked-area.html
21 / 40
Sentiment Analysis
# install package sentiment140
require(devtools)
install_github("sentiment140", "okugami79")
# sentiment analysis
library(sentiment)
sentiments <- sentiment(tweets.df$text)
table(sentiments$polarity)
##
## neutral positive
## 428 20
# sentiment plot
sentiments$score <- 0
sentiments$score[sentiments$polarity == "positive"] <- 1
sentiments$score[sentiments$polarity == "negative"] <- -1
sentiments$date <- as.IDate(tweets.df$created)
result <- aggregate(score ~ date, data = sentiments, sum)
plot(result, type = "l")
22 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Word Associations
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
23 / 40
Retrieve User Info and Followers
user <- getUser("RDataMining")
user$toDataFrame()
friends <- user$getFriends() # who this user follows
followers <- user$getFollowers() # this user's followers
followers2 <- followers[[1]]$getFollowers() # a follower's followers
## [,1] ...
## description "R and Data Mining. Group on LinkedIn: ...
## statusesCount "583" ...
## followersCount "2376" ...
## favoritesCount "6" ...
## friendsCount "72" ...
## url "http://t.co/LwL50uRmPd" ...
## name "Yanchang Zhao" ...
## created "2011-04-04 09:15:43" ...
## protected "FALSE" ...
## verified "FALSE" ...
## screenName "RDataMining" ...
## location "Australia" ...
## lang "en" ...
## id "276895537" ... 24 / 40
Follower Map3
@RDataMining Followers (#: 2376)
3
Based on Jeff Leeks twitterMap function at
http://biostat.jhsph.edu/~jleek/code/twitterMap.R
25 / 40
Active Influential Followers
M Kautzar Ichramsyah
Prof. Diego Kuonen
10.0 20.0
Christopher D. Long
.................................
Murari Bhartia
#AI PR Girl
Robert Penner
5.0
Roby Zac S.
#Tweets per day
Prithwis Mukerjee Ryan Rosario
Michal Illich
Sharon Machlis
StatsBlogs David Smith Marcel Molina
Mitch Sanders Data Science London
2.0
Daniel D. Gutierrez DataCamp

pavel jaek Derecho Internet
Rahul Kapil
1.0
Statistics Blog

Yichuan Wang Learn R
0.5
biao
Rob J Hyndman RDataMining
Data Mining
Antonio Piccolboni
0.2
Duccio Schiavon
LearnDataAnalysis
5 10 20 50 100
#followers / #friends
26 / 40
Top Retweeted Tweets
# select top retweeted tweets

table(tweets.df$retweetCount)
selected <- which(tweets.df$retweetCount >= 9)
# plot them
dates <- strptime(tweets.df$created, format="%Y-%m-%d")
plot(x=dates, y=tweets.df$retweetCount, type="l", col="grey",
xlab="Date", ylab="Times retweeted")
colors <- rainbow(10)[1:length(selected)]
points(dates[selected], tweets.df$retweetCount[selected],
pch=19, col=colors)
text(dates[selected], tweets.df$retweetCount[selected],
tweets.df$text[selected], col=colors, cex=.9)
27 / 40
Top Retweeted Tweets
Handling and Processing Strings in R an ebook in PDF format, 105 pages. http://t.co/UXnetU7k87
15
A Twitter dataset for text mining: @RDataMining Tweets extracted on 3 February 2016. Download it at https://t.co/lQp94IvfPf
Times retweeted
Free online course on Computing for Data Analysis (with R), to start on 24 Sept 2012 https://t.co/Y617n30y Slides in 8 PDF files on Getting Data from the Web with R http://t.co/epT4Jv07WD
10

Lecture videos of natural language processing course at Stanford University: 18 videos, with each of over 1 hr length http://t.co/VKKdA9Tykm

The R Reference Card for Data Mining now provides links to packages on CRAN. Packages for MapReduce and Hadoop added. http://t.co/RrFypol8kw
5
0
2012 2013 2014 2015 2016
Date
28 / 40
Tracking Message Propagation
tweets[[1]]
retweeters(tweets[[1]]$id)
retweets(tweets[[1]]$id)
## [1] "RDataMining: A Twitter dataset for text mining: @RDa...
## [1] "197489286" "316875164" "229796464" "3316009302"

## [5] "244077734" "16900353" "2404767650" "222061895"
## [9] "11686382" "190569306" "49413866" "187048879"
## [13] "6146692" "2591996912"
## [[1]]
## [1] "bobaiKato: RT @RDataMining: A Twitter dataset for te...
##
## [[2]]
## [1] "VipulMathur: RT @RDataMining: A Twitter dataset for ...
##
## [[3]]
## [1] "tau_phoenix: RT @RDataMining: A Twitter dataset for ...
The tweet potentially reached around 120,000 users. 29 / 40

30 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Word Associations
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
31 / 40
R Packages
I Twitter data extraction: twitteR

I Text cleaning and mining: tm
I Word cloud: wordcloud
I Topic modelling: topicmodels, lda
I Sentiment analysis: sentiment140
I Social network analysis: igraph, sna
I Visualisation: wordcloud, Rgraphviz, ggplot2
32 / 40
4
Twitter Data Extraction Package twitteR
I userTimeline, homeTimeline, mentions,
retweetsOfMe: retrive various timelines
I getUser, lookupUsers: get information of Twitter user(s)
I getFollowers, getFollowerIDs: retrieve followers (or
their IDs)
I getFriends, getFriendIDs: return a list of Twitter users
(or user IDs) that a user follows
I retweets, retweeters: return retweets or users who
retweeted a tweet
I searchTwitter: issue a search of Twitter
I getCurRateLimitInfo: retrieve current rate limit
information
I twListToDF: convert into data.frame
4
https://cran.r-project.org/package=twitteR
33 / 40
5
Text Mining Package tm
I removeNumbers, removePunctuation, removeWords,
removeSparseTerms, stripWhitespace: remove numbers,
punctuations, words or extra whitespaces
I removeSparseTerms: remove sparse terms from a
term-document matrix
I stopwords: various kinds of stopwords
I stemDocument, stemCompletion: stem words and
complete stems
I TermDocumentMatrix, DocumentTermMatrix: build a
term-document matrix or a document-term matrix
I termFreq: generate a term frequency vector
I findFreqTerms, findAssocs: find frequent terms or
associations of terms
I weightBin, weightTf, weightTfIdf, weightSMART,
WeightFunction: various ways to weight a term-document
matrix
5
https://cran.r-project.org/package=tm
34 / 40
Topic Modelling and Sentiment Analysis Packages
topicmodels & sentiment140
Package topicmodels 6
I LDA: build a Latent Dirichlet Allocation (LDA) model

I CTM: build a Correlated Topic Model (CTM) model
I terms: extract the most likely terms for each topic
I topics: extract the most likely topics for each document
Package sentiment140 7
I sentiment: sentiment analysis with the sentiment140 API,

tune to Twitter text analysis
6
https://cran.r-project.org/package=topicmodels
7
https://github.com/okugami79/sentiment140
35 / 40
Social Network Analysis and Visualization Package
igraph 8
I degree, betweenness, closeness, transitivity:
various centrality scores
I neighborhood: neighborhood of graph vertices
I cliques, largest.cliques, maximal.cliques,
clique.number: find cliques, ie. complete subgraphs
I clusters, no.clusters: maximal connected components
of a graph and the number of them
I fastgreedy.community, spinglass.community:
community detection
I cohesive.blocks: calculate cohesive blocks
I induced.subgraph: create a subgraph of a graph (igraph)
I read.graph, write.graph: read and writ graphs from and
to files of various formats
8
https://cran.r-project.org/package=igraph
36 / 40
Outline
Introduction
Tweets Analysis
Extracting Tweets
Text Cleaning
Word Associations
Topic Modelling
Sentiment Analysis
Follower Analysis
Retweeting Analysis
R Packages
37 / 40
References
I Yanchang Zhao. R and Data Mining: Examples and Case

Studies. ISBN 978-0-12-396963-7, December 2012. Academic
Press, Elsevier. 256 pages.
http://www.rdatamining.com/docs/RDataMining-book.pdf
I Yanchang Zhao and Yonghua Cen (Eds.). Data Mining
Applications with R. ISBN 978-0124115118, December 2013.
Academic Press, Elsevier.
I Yanchang Zhao. Analysing Twitter Data with Text Mining
and Social Network Analysis. In Proc. of the 11th
Australasian Data Mining Analytics Conference (AusDM
2013), Canberra, Australia, November 13-15, 2013.
38 / 40
Online Resources
I Chapter 10 Text Mining & Chapter 11 Social Network
Analysis, in book R and Data Mining: Examples and Case
Studies
http://www.rdatamining.com/docs/RDataMining.pdf
I RDataMining Reference Card
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
I Online documents, books and tutorials
http://www.rdatamining.com/resources/onlinedocs
I Free online courses
http://www.rdatamining.com/resources/courses
I RDataMining Group on LinkedIn (22,000+ members)
http://group.rdatamining.com
I Twitter (2,700+ followers)
@RDataMining
39 / 40
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
40 / 40

RDataMining Slides Twitter Analysis

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

RDataMining Slides Twitter Analysis

Diunggah oleh

Hak Cipta:

Format Tersedia

Twitter Data Analysis with R Text Mining and

Social Network Analysis 1

Short Course on R and Data Mining

I An online social networking service that enables users to send

I @RDataMining: focuses on R and Data Mining

1. Extract tweets and followers from the Twitter website with R

## Option 1: retrieve tweets from Twitter

## Option 2: download @RDataMining tweets from RDataMining.com

Twitter Authentication with OAuth:

# convert tweets to a data frame

## id created screenName re...

## The R Reference Card for Data Mining now provides links to

# keep a copy for stem completion later

stemCompletion2 <- function(x, dictionary) {

## r reference card data miner now provided link package cran

# replace oldword with newword

tdm <- TermDocumentMatrix(myCorpus,

## <<TermDocumentMatrix (terms: 1073, documents: 448)>>

idx <- which(dimnames(tdm)$Terms %in% c("r", "data", "mining"))

# inspect frequent words

## [1] "analysing" "analytics" "australia" "big"

term.freq <- rowSums(as.matrix(tdm))

# plot word cloud

build top new application research webinar

linkedin cluster free recent

event th now language

open social feb

database series available follow

looking china oct

massive canada parallel get user close map

prof graphical ranked cloud media

# which words are associated with 'data'?

network text rdatamining r

analysing group mining example package science learn

canberra data course university

analytics australia position introduction

data, science, group, poll, kdnuggets, package, software

0.002 r, mining, package, book, example, slide, analysing

2012 2013 2014 2015 2016

Another way to plot steam graph:

Prithwis Mukerjee Ryan Rosario

Daniel D. Gutierrez DataCamp

# select top retweeted tweets

2012 2013 2014 2015 2016

## [1] "RDataMining: A Twitter dataset for text mining: @RDa...

## [1] "197489286" "316875164" "229796464" "3316009302"

The tweet potentially reached around 120,000 users. 29 / 40

I Twitter data extraction: twitteR

I LDA: build a Latent Dirichlet Allocation (LDA) model

I sentiment: sentiment analysis with the sentiment140 API,

I Yanchang Zhao. R and Data Mining: Examples and Case

Anda mungkin juga menyukai