Anda di halaman 1dari 12

WelcometoWeek3oftheBigDataCapstone

ThisdocumentprovidesarunningexampleofcompletingtheWeek3assignment:

Ashorterversionwithfewercommentsisavailableasscript:
sparkMLlibClustering.py

TorunthesecommandsinClouderaVM:firstrunthesetupscript:
setupWeek3.sh

YoucanthencopypastethesecommandsinpySpark.

ToopenpySpark,referto:
Week2
and
Week4
oftheMachineLearning
course

Notethatyourdatasetmaybedifferentfromwhatisusedhere,soyourresults
maynotmatchwiththoseshownhere

Finally,makesurethatyourworkingdirectorycontainsthedatafiles(.csv)forthefastest
support.
Werecommendworknginyourhomedirectory(typecd~inyourterminal).Pleaserunany
scriptsusingyourterminalforpropersettings.

In[1]:
import

pandas

as

pd
from

pyspark.mllib.clustering

import
KMeans,KMeansModel
from

numpy

import
array

Step1:AttributeSelection
ImportData
First let us read the contents of the file adclicks.csv. ThefollowingcommandsreadintheCSV
file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it
becomes'userid'.
Note that youmustchangethepathtoadclicks.csvtothelocationonyourmachine,ifyouwant
torunthiscommandonyourmachine.
In[2]:
adclicksDF
=
pd
.
read_csv(
'./adclicks.csv'
)
adclicksDF
=
adclicksDF
.
rename(columns
=
lambda
x:x
.
strip())
#removewhitespacesfromheaders

Letusdisplaythefirst5linesofadclicksDF:
In[3]:
adclicksDF
.
head(n
=5
)

Out[3]:

timestamp

txId

userSessionId

teamId userId adId adCategory

0 2016053014:24:03

6616 6289

24

876

29

movies

1 2016053014:24:47

6624 6144

29

1935

games

2 2016053014:26:34

6628 6536

20

1588

25

movies

3 2016053014:26:50

6618 6518

21

1195

19

fashion

4 2016053014:27:06

6629 6072

146

1685

20

games

Next, We are going to add an extra columntotheadclickstableandmakeitequalto1.Wedo


so to recordthefactthateachROWis1adclick.Youwillseehowthis willbecomeusefulwhen
wesumupthiscolumntofindhowmanyadsdidauserclick.

In[4]:
adclicksDF[
'adCount'
]
=

Letusdisplaythefirst5linesofadclicksDFandseeifanewcolumnhasbeenadded:
In[5]:
adclicksDF
.
head(n
=5
)

Out[5]:

timestamp

txId userSessionId teamId userId adId adCategory adCount

0 2016053014:24:03

6616 6289

24

876

29

movies

1 2016053014:24:47

6624 6144

29

1935

games

2 2016053014:26:34

6628 6536

20

1588

25

movies

3 2016053014:26:50

6618 6518

21

1195

19

fashion

4 2016053014:27:06

6629 6072

146

1685

20

games

Next, let us read the contentsof thefilebuyclicks.csv.Asbefore,thefollowingcommandsread


in the CSV file in a table format and removesanyextrawhitespaces.So,iftheCSVcontained '
userid'itbecomes'userid'.

Note that you must change the path to buyclicks.csv to the location on your machine, if you
wanttorunthiscommandonyourmachine.
In[6]:
buyclicksDF
=
pd
.
read_csv(
'./buyclicks.csv'
)
buyclicksDF
=
buyclicksDF
.
rename(columns
=
lambda
x:x
.
strip())
#removeswhitespacesfrom
headers

Letusdisplaythefirst5linesofbuyclicksDF:
In[7]:
buyclicksDF
.
head(n
=5
)


Out[7]:

timestamp

txId

userSessionId

team userId buyId price

0 2016053013:51:50

6587 6368

29

1422

1 2016053013:51:50

6588 6440

112

2253

2 2016053013:51:50

6589 6420

48

1393

3 2016053014:21:50

6631 6495

141

2295

4 2016053014:21:50

6632 6111

119

1560

FeatureSelection

For this exercise, we can choose from buyclicksDF, the 'price' of each app that a user
purchases as an attribute that captures user's purchasing behavior. The following command
selects'userid'and'price'anddropsallothercolumnsthatwedonotwanttouseatthisstage.
In[8]:
userPurchases
=
buyclicksDF[[
'userId'
,
'price'
]]
#selectonlyuseridandprice
userPurchases
.
head(n
=5
)

Out[8]:

userId price

0 1422

1 2253

2 1393

3 2295

4 1560

Similarly, from the adclicksDF, we will use the 'adCount' as an attribute that captures user's
inclination to click on ads. The following command selects 'userid' and 'adCount' and drops all
othercolumnsthatwedonotwanttouseatthisstage.
In[9]:
useradClicks
=
adclicksDF[[
'userId'
,
'adCount'
]]

In[10]:
useradClicks
.
head(n
=5
)
#aswesawbefore,thislinedisplaysfirstfivelines

Out[10]:

userId adCount

0 876

1 1935

2 1588

3 1195

4 1685

Step2:TrainingDataSetCreation
Createthefirstaggregatefeatureforclustering

From each of these single adclicksper row,wecannowgeneratetotaladclicksperuser.Let's


pick a user with userid = 3. To find out how many ads this user hasclickedoverall,wehaveto
find each row that contains userid = 3, and report the total number of such rows.Thefollowing
commands sum the total number of ads per user and rename the columns to becalled'userid'
and 'totalAdClicks'.
Notethatyoumaynotneed toaggregate(e.g.sumovermanyrows)

if you choose a different feature and your data set already provides the necessary

information.
In the end, we want to get one row per user, ifweareperformingclusteringover
users.

In[11]:
adsPerUser
=
useradClicks
.
groupby(
'userId'
)
.
sum()
adsPerUser
=
adsPerUser
.
reset_index()
adsPerUser
.
columns
=
[
'userId'
,
'totalAdClicks'
]
#renamethecolumns

Let us display the first 5 lines of 'adsPerUser' to see if there is a column named 'totalAdClicks'

containingtotaladclicksperuser.
In[12]:
adsPerUser
.
head(n
=5
)

Out[12]:

userId totalAdClicks

0 1

42

1 5

2 9

17

3 11

4 14

40

Createthesecondaggregatefeatureforclustering
Similar to what we did for adclicks, here we find out how much money in total did each user
spend on buying inapp purchases. As an example, let's picka userwithuserid=9.Tofindout
the total money spentbythisuser,we haveto findeachrowthatcontainsuserid =9,andreport
the sum of the column'price' of eachproducttheypurchased.Thefollowingcommandssumthe
totalmoneyspentbyeachuserandrenamethecolumnstobecalled'userid'and'revenue'.

Note:
that you can also use other aggregates, such as sum of money spent on a specific ad
category by auseroronasetofadcategories byeachuser,gameclicksperhourbyeachuser
etc. You are free to use any mathematical operations on the fields provided in the CSV files
whencreatingfeatures.

In[13]:
revenuePerUser
=
userPurchases
.
groupby(
'userId'
)
.
sum()
revenuePerUser
=
revenuePerUser
.
reset_index()
revenuePerUser
.
columns
=
[
'userId'
,
'revenue'
]
#renamethecolumns

In[14]:
revenuePerUser
.
head(n
=5
)

Out[14]:

userId revenue

0 1

32

1 5

2 9

10

3 14

26

4 17

25

Mergethetwotables
Lets see what we have so far. We have a table called revenuePerUser, where each row
contains total money a user (with that 'userid') has spent. We also have another table called
adsPerUser where each row contains total number of ads a user has clicked. We will use
revenuePerUserandadsPerUserasfeatures/attributestocaptureourusers'behavior.


Let us combine these two attributes (features) so that each row contains both attributes per
user.Let'smergethesetwotablestogetonesingletablewecanuseforKMeansclustering.

In[15]:
combinedDF
=
adsPerUser
.
merge(revenuePerUser,on
=
'userId'
)
#userId,adCount,price

Let us display the first 5 linesofthemergedtable.


Note:Dependingonwhatattributesyou

choose, you may not need to merge tables. You may get all your attributes from a
singletable.

In[16]:
combinedDF
.
head(n
=5
)
#displayhowthemergedtablelooks

Out[16]:

userId totalAdClicks

revenue

0 1

42

32

1 5

2 9

17

10

3 14

40

26

4 17

50

25

Createthefinaltrainingdataset
Our training data set is almost ready. At this stage we can remove the 'userid' from each row,
since 'userid' is a computer generated random number assigned to each user. It does not

capture any behavioral aspect of a user. One way to drop the'userid',istoselecttheothertwo


columns.

In[17]:
trainingDF
=
combinedDF[[
'totalAdClicks'
,
'revenue'
]]
trainingDF
.
head(n
=5
)

Out[17]:

totalAdClicks

revenue

0 42

32

1 4

2 17

10

3 40

26

4 50

25

Displaythedimensionsofthetrainingdataset
Display the dimension of the training data set. To display the dimensions of the trainingDF,
simplyadd.shapeasasuffixandhitenter.

In[18]:
trainingDF
.
shape

Out[18]:
(832,2)

The following two commands convert the tables we created into a format that can be
understoodbytheKMeans.trainfunction.

line[0] refers to the first column. line[1] refers to the second column. If you have more than 2
columnsinyourtrainingtable,modifythiscommandbyaddingline[2],line[3],line[4]...

In[19]:
sqlContext
=
SQLContext(sc)
pDF
=
sqlContext
.
createDataFrame(trainingDF)
parsedData
=
pDF
.
rdd
.
map(
lambda
line:array([line[
0
],line[
1
]]))
#totalAdClicks,revenue

Step3:TraintoCreateClusterCenters
TrainKMeansmodel
Herewearecreatingtwoclustersasdenotedinthesecondargument.
In[20]:
my_kmmodel
=
KMeans
.
train(parsedData,
2
,maxIterations
=10
,runs
=10
,
initializationMode
=
"random"
)

/usr/local/Cellar/apachespark/1.6.0/libexec/python/pyspark/mllib/clustering.py:176:UserWarning:
Supportforrunsisdeprecatedin1.6.0.Thisparamwillhavenoeffectin1.7.0.
"Supportforrunsisdeprecatedin1.6.0.Thisparamwillhavenoeffectin1.7.0.")

Displaythecentersoftwoclustersformed

In[21]:
print
(my_kmmodel
.
centers)

[array([42.05442177,113.02040816]),array([29.43211679,24.21021898])]

Step4:RecommendActions
Analyzetheclustercenters

Eacharraydenotesthecenterforacluster:

OneClusteriscenteredat...array([29.43211679,24.21021898])
OtherClusteriscenteredat...array([42.05442177,113.02040816])

First number (field1) in each array refers to numberofadclicksandthesecondnumber(field2)


is the revenue per user. Compare the1stnumberofeachclustertoseehowdifferentlyusersin
each cluster behave when it comes to clicking ads. Comparethe2ndnumberofeachcluster to
seehowdifferentlyusersineachclusterbehavewhenitcomestobuyingstuff.

In one cluster, in general, players click on ads much more often (~1.4 times) and spend more
money (~4.7 times) on inapp purchases.AssumingthatEglenceInc.gets paidforshowingads
and for hosting inapp purchase items, we can use thisinformationtoincreasegame'srevenue
by increasing the prices for ads we show to the frequentclickers, and charge higher fees for
hostingtheinapppurchaseitemsshowntothehigherrevenuegeneratingbuyers.

Note:
This analysis requires you to compare the cluster centers and find any significant
differences in the corresponding feature values of the centers. The answer to this question will
dependonthefeaturesyouhavechosen.
Some features help distinguish the clusters remarkably while others may not tell you much. At
this point, if you dont findcleardistinguishingpatterns,perhapsrerunningtheclustering model
withdifferentnumbersofclustersandrevisingthefeaturesyoupickedwouldbeagoodidea.

Anda mungkin juga menyukai