ThisdocumentprovidesarunningexampleofcompletingtheWeek3assignment:
Ashorterversionwithfewercommentsisavailableasscript:
sparkMLlibClustering.py
TorunthesecommandsinClouderaVM:firstrunthesetupscript:
setupWeek3.sh
YoucanthencopypastethesecommandsinpySpark.
ToopenpySpark,referto:
Week2
and
Week4
oftheMachineLearning
course
Notethatyourdatasetmaybedifferentfromwhatisusedhere,soyourresults
maynotmatchwiththoseshownhere
Finally,makesurethatyourworkingdirectorycontainsthedatafiles(.csv)forthefastest
support.
Werecommendworknginyourhomedirectory(typecd~inyourterminal).Pleaserunany
scriptsusingyourterminalforpropersettings.
In[1]:
import
pandas
as
pd
from
pyspark.mllib.clustering
import
KMeans,KMeansModel
from
numpy
import
array
Step1:AttributeSelection
ImportData
First let us read the contents of the file adclicks.csv. ThefollowingcommandsreadintheCSV
file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it
becomes'userid'.
Note that youmustchangethepathtoadclicks.csvtothelocationonyourmachine,ifyouwant
torunthiscommandonyourmachine.
In[2]:
adclicksDF
=
pd
.
read_csv(
'./adclicks.csv'
)
adclicksDF
=
adclicksDF
.
rename(columns
=
lambda
x:x
.
strip())
#removewhitespacesfromheaders
Letusdisplaythefirst5linesofadclicksDF:
In[3]:
adclicksDF
.
head(n
=5
)
Out[3]:
timestamp
txId
userSessionId
0 2016053014:24:03
6616 6289
24
876
29
movies
1 2016053014:24:47
6624 6144
29
1935
games
2 2016053014:26:34
6628 6536
20
1588
25
movies
3 2016053014:26:50
6618 6518
21
1195
19
fashion
4 2016053014:27:06
6629 6072
146
1685
20
games
In[4]:
adclicksDF[
'adCount'
]
=
Letusdisplaythefirst5linesofadclicksDFandseeifanewcolumnhasbeenadded:
In[5]:
adclicksDF
.
head(n
=5
)
Out[5]:
timestamp
0 2016053014:24:03
6616 6289
24
876
29
movies
1 2016053014:24:47
6624 6144
29
1935
games
2 2016053014:26:34
6628 6536
20
1588
25
movies
3 2016053014:26:50
6618 6518
21
1195
19
fashion
4 2016053014:27:06
6629 6072
146
1685
20
games
Note that you must change the path to buyclicks.csv to the location on your machine, if you
wanttorunthiscommandonyourmachine.
In[6]:
buyclicksDF
=
pd
.
read_csv(
'./buyclicks.csv'
)
buyclicksDF
=
buyclicksDF
.
rename(columns
=
lambda
x:x
.
strip())
#removeswhitespacesfrom
headers
Letusdisplaythefirst5linesofbuyclicksDF:
In[7]:
buyclicksDF
.
head(n
=5
)
Out[7]:
timestamp
txId
userSessionId
0 2016053013:51:50
6587 6368
29
1422
1 2016053013:51:50
6588 6440
112
2253
2 2016053013:51:50
6589 6420
48
1393
3 2016053014:21:50
6631 6495
141
2295
4 2016053014:21:50
6632 6111
119
1560
FeatureSelection
For this exercise, we can choose from buyclicksDF, the 'price' of each app that a user
purchases as an attribute that captures user's purchasing behavior. The following command
selects'userid'and'price'anddropsallothercolumnsthatwedonotwanttouseatthisstage.
In[8]:
userPurchases
=
buyclicksDF[[
'userId'
,
'price'
]]
#selectonlyuseridandprice
userPurchases
.
head(n
=5
)
Out[8]:
userId price
0 1422
1 2253
2 1393
3 2295
4 1560
Similarly, from the adclicksDF, we will use the 'adCount' as an attribute that captures user's
inclination to click on ads. The following command selects 'userid' and 'adCount' and drops all
othercolumnsthatwedonotwanttouseatthisstage.
In[9]:
useradClicks
=
adclicksDF[[
'userId'
,
'adCount'
]]
In[10]:
useradClicks
.
head(n
=5
)
#aswesawbefore,thislinedisplaysfirstfivelines
Out[10]:
userId adCount
0 876
1 1935
2 1588
3 1195
4 1685
Step2:TrainingDataSetCreation
Createthefirstaggregatefeatureforclustering
if you choose a different feature and your data set already provides the necessary
information.
In the end, we want to get one row per user, ifweareperformingclusteringover
users.
In[11]:
adsPerUser
=
useradClicks
.
groupby(
'userId'
)
.
sum()
adsPerUser
=
adsPerUser
.
reset_index()
adsPerUser
.
columns
=
[
'userId'
,
'totalAdClicks'
]
#renamethecolumns
Let us display the first 5 lines of 'adsPerUser' to see if there is a column named 'totalAdClicks'
containingtotaladclicksperuser.
In[12]:
adsPerUser
.
head(n
=5
)
Out[12]:
userId totalAdClicks
0 1
42
1 5
2 9
17
3 11
4 14
40
Createthesecondaggregatefeatureforclustering
Similar to what we did for adclicks, here we find out how much money in total did each user
spend on buying inapp purchases. As an example, let's picka userwithuserid=9.Tofindout
the total money spentbythisuser,we haveto findeachrowthatcontainsuserid =9,andreport
the sum of the column'price' of eachproducttheypurchased.Thefollowingcommandssumthe
totalmoneyspentbyeachuserandrenamethecolumnstobecalled'userid'and'revenue'.
Note:
that you can also use other aggregates, such as sum of money spent on a specific ad
category by auseroronasetofadcategories byeachuser,gameclicksperhourbyeachuser
etc. You are free to use any mathematical operations on the fields provided in the CSV files
whencreatingfeatures.
In[13]:
revenuePerUser
=
userPurchases
.
groupby(
'userId'
)
.
sum()
revenuePerUser
=
revenuePerUser
.
reset_index()
revenuePerUser
.
columns
=
[
'userId'
,
'revenue'
]
#renamethecolumns
In[14]:
revenuePerUser
.
head(n
=5
)
Out[14]:
userId revenue
0 1
32
1 5
2 9
10
3 14
26
4 17
25
Mergethetwotables
Lets see what we have so far. We have a table called revenuePerUser, where each row
contains total money a user (with that 'userid') has spent. We also have another table called
adsPerUser where each row contains total number of ads a user has clicked. We will use
revenuePerUserandadsPerUserasfeatures/attributestocaptureourusers'behavior.
Let us combine these two attributes (features) so that each row contains both attributes per
user.Let'smergethesetwotablestogetonesingletablewecanuseforKMeansclustering.
In[15]:
combinedDF
=
adsPerUser
.
merge(revenuePerUser,on
=
'userId'
)
#userId,adCount,price
choose, you may not need to merge tables. You may get all your attributes from a
singletable.
In[16]:
combinedDF
.
head(n
=5
)
#displayhowthemergedtablelooks
Out[16]:
userId totalAdClicks
revenue
0 1
42
32
1 5
2 9
17
10
3 14
40
26
4 17
50
25
Createthefinaltrainingdataset
Our training data set is almost ready. At this stage we can remove the 'userid' from each row,
since 'userid' is a computer generated random number assigned to each user. It does not
In[17]:
trainingDF
=
combinedDF[[
'totalAdClicks'
,
'revenue'
]]
trainingDF
.
head(n
=5
)
Out[17]:
totalAdClicks
revenue
0 42
32
1 4
2 17
10
3 40
26
4 50
25
Displaythedimensionsofthetrainingdataset
Display the dimension of the training data set. To display the dimensions of the trainingDF,
simplyadd.shapeasasuffixandhitenter.
In[18]:
trainingDF
.
shape
Out[18]:
(832,2)
The following two commands convert the tables we created into a format that can be
understoodbytheKMeans.trainfunction.
line[0] refers to the first column. line[1] refers to the second column. If you have more than 2
columnsinyourtrainingtable,modifythiscommandbyaddingline[2],line[3],line[4]...
In[19]:
sqlContext
=
SQLContext(sc)
pDF
=
sqlContext
.
createDataFrame(trainingDF)
parsedData
=
pDF
.
rdd
.
map(
lambda
line:array([line[
0
],line[
1
]]))
#totalAdClicks,revenue
Step3:TraintoCreateClusterCenters
TrainKMeansmodel
Herewearecreatingtwoclustersasdenotedinthesecondargument.
In[20]:
my_kmmodel
=
KMeans
.
train(parsedData,
2
,maxIterations
=10
,runs
=10
,
initializationMode
=
"random"
)
/usr/local/Cellar/apachespark/1.6.0/libexec/python/pyspark/mllib/clustering.py:176:UserWarning:
Supportforrunsisdeprecatedin1.6.0.Thisparamwillhavenoeffectin1.7.0.
"Supportforrunsisdeprecatedin1.6.0.Thisparamwillhavenoeffectin1.7.0.")
Displaythecentersoftwoclustersformed
In[21]:
print
(my_kmmodel
.
centers)
[array([42.05442177,113.02040816]),array([29.43211679,24.21021898])]
Step4:RecommendActions
Analyzetheclustercenters
Eacharraydenotesthecenterforacluster:
OneClusteriscenteredat...array([29.43211679,24.21021898])
OtherClusteriscenteredat...array([42.05442177,113.02040816])
In one cluster, in general, players click on ads much more often (~1.4 times) and spend more
money (~4.7 times) on inapp purchases.AssumingthatEglenceInc.gets paidforshowingads
and for hosting inapp purchase items, we can use thisinformationtoincreasegame'srevenue
by increasing the prices for ads we show to the frequentclickers, and charge higher fees for
hostingtheinapppurchaseitemsshowntothehigherrevenuegeneratingbuyers.
Note:
This analysis requires you to compare the cluster centers and find any significant
differences in the corresponding feature values of the centers. The answer to this question will
dependonthefeaturesyouhavechosen.
Some features help distinguish the clusters remarkably while others may not tell you much. At
this point, if you dont findcleardistinguishingpatterns,perhapsrerunningtheclustering model
withdifferentnumbersofclustersandrevisingthefeaturesyoupickedwouldbeagoodidea.