Reading Py Spark ML Lib

WelcometoWeek3oftheBigDataCapstone
ThisdocumentprovidesarunningexampleofcompletingtheWeek3assignment:
Ashorterversionwithfewercommentsisavailableasscript:
sparkMLlibClustering.py
TorunthesecommandsinClouderaVM:firstrunthesetupscript:
setupWeek3.sh
YoucanthencopypastethesecommandsinpySpark.
ToopenpySpark,referto:
Week2
and
Week4
oftheMachineLearning
course
Notethatyourdatasetmaybedifferentfromwhatisusedhere,soyourresults
maynotmatchwiththoseshownhere
Finally,makesurethatyourworkingdirectorycontainsthedatafiles(.csv)forthefastest
support.
Werecommendworknginyourhomedirectory(typecd~inyourterminal).Pleaserunany
scriptsusingyourterminalforpropersettings.
In[1]:
import
pandas
as
pd
from
pyspark.mllib.clustering
import
KMeans,KMeansModel
from
numpy
import
array
Step1:AttributeSelection
ImportData
First let us read the contents of the file adclicks.csv. ThefollowingcommandsreadintheCSV
file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it
becomes'userid'.
Note that youmustchangethepathtoadclicks.csvtothelocationonyourmachine,ifyouwant
torunthiscommandonyourmachine.
In[2]:
adclicksDF
=
pd
.
read_csv(
'./adclicks.csv'
)
adclicksDF
=
adclicksDF
.
rename(columns
=
lambda
x:x
.
strip())
#removewhitespacesfromheaders
Letusdisplaythefirst5linesofadclicksDF:
In[3]:
adclicksDF
.
head(n
=5
)
Out[3]:
timestamp
txId
userSessionId
teamId userId adId adCategory
0 2016053014:24:03
6616 6289
24
876
29
movies
1 2016053014:24:47
6624 6144
29
1935
games
2 2016053014:26:34
6628 6536
20
1588
25
movies
3 2016053014:26:50
6618 6518
21
1195
19
fashion
4 2016053014:27:06
6629 6072
146
1685
20
games
Next, We are going to add an extra columntotheadclickstableandmakeitequalto1.Wedo

so to recordthefactthateachROWis1adclick.Youwillseehowthis willbecomeusefulwhen
wesumupthiscolumntofindhowmanyadsdidauserclick.
In[4]:
adclicksDF[
'adCount'
]
=
Letusdisplaythefirst5linesofadclicksDFandseeifanewcolumnhasbeenadded:
In[5]:
adclicksDF
.
head(n
=5
)
Out[5]:
timestamp
txId userSessionId teamId userId adId adCategory adCount
0 2016053014:24:03
6616 6289
24
876
29
movies
1 2016053014:24:47
6624 6144
29
1935
games
2 2016053014:26:34
6628 6536
20
1588
25
movies
3 2016053014:26:50
6618 6518
21
1195
19
fashion
4 2016053014:27:06
6629 6072
146
1685
20
games
Next, let us read the contentsof thefilebuyclicks.csv.Asbefore,thefollowingcommandsread

in the CSV file in a table format and removesanyextrawhitespaces.So,iftheCSVcontained '
userid'itbecomes'userid'.
Note that you must change the path to buyclicks.csv to the location on your machine, if you
wanttorunthiscommandonyourmachine.
In[6]:
buyclicksDF
=
pd
.
read_csv(
'./buyclicks.csv'
)
buyclicksDF
=
buyclicksDF
.
rename(columns
=
lambda
x:x
.
strip())
#removeswhitespacesfrom
headers
Letusdisplaythefirst5linesofbuyclicksDF:
In[7]:
buyclicksDF
.
head(n
=5
)

Out[7]:
timestamp
txId
userSessionId
team userId buyId price
0 2016053013:51:50
6587 6368
29
1422
1 2016053013:51:50
6588 6440
112
2253
2 2016053013:51:50
6589 6420
48
1393
3 2016053014:21:50
6631 6495
141
2295
4 2016053014:21:50
6632 6111
119
1560
FeatureSelection
For this exercise, we can choose from buyclicksDF, the 'price' of each app that a user
purchases as an attribute that captures user's purchasing behavior. The following command
selects'userid'and'price'anddropsallothercolumnsthatwedonotwanttouseatthisstage.
In[8]:
userPurchases
=
buyclicksDF[[
'userId'
,
'price'
]]
#selectonlyuseridandprice
userPurchases
.
head(n
=5
)
Out[8]:
userId price
0 1422
1 2253
2 1393
3 2295
4 1560
Similarly, from the adclicksDF, we will use the 'adCount' as an attribute that captures user's
inclination to click on ads. The following command selects 'userid' and 'adCount' and drops all
othercolumnsthatwedonotwanttouseatthisstage.
In[9]:
useradClicks
=
adclicksDF[[
'userId'
,
'adCount'
]]
In[10]:
useradClicks
.
head(n
=5
)
#aswesawbefore,thislinedisplaysfirstfivelines
Out[10]:
userId adCount
0 876
1 1935
2 1588
3 1195
4 1685
Step2:TrainingDataSetCreation
Createthefirstaggregatefeatureforclustering
From each of these single adclicksper row,wecannowgeneratetotaladclicksperuser.Let's

pick a user with userid = 3. To find out how many ads this user hasclickedoverall,wehaveto
find each row that contains userid = 3, and report the total number of such rows.Thefollowing
commands sum the total number of ads per user and rename the columns to becalled'userid'
and 'totalAdClicks'.
Notethatyoumaynotneed toaggregate(e.g.sumovermanyrows)
if you choose a different feature and your data set already provides the necessary
information.
In the end, we want to get one row per user, ifweareperformingclusteringover
users.
In[11]:
adsPerUser
=
useradClicks
.
groupby(
'userId'
)
.
sum()
adsPerUser
=
adsPerUser
.
reset_index()
adsPerUser
.
columns
=
[
'userId'
,
'totalAdClicks'
]
#renamethecolumns
Let us display the first 5 lines of 'adsPerUser' to see if there is a column named 'totalAdClicks'
containingtotaladclicksperuser.
In[12]:
adsPerUser
.
head(n
=5
)
Out[12]:
userId totalAdClicks
0 1
42
1 5
2 9
17
3 11
4 14
40
Createthesecondaggregatefeatureforclustering
Similar to what we did for adclicks, here we find out how much money in total did each user
spend on buying inapp purchases. As an example, let's picka userwithuserid=9.Tofindout
the total money spentbythisuser,we haveto findeachrowthatcontainsuserid =9,andreport
the sum of the column'price' of eachproducttheypurchased.Thefollowingcommandssumthe
totalmoneyspentbyeachuserandrenamethecolumnstobecalled'userid'and'revenue'.
Note:
that you can also use other aggregates, such as sum of money spent on a specific ad
category by auseroronasetofadcategories byeachuser,gameclicksperhourbyeachuser
etc. You are free to use any mathematical operations on the fields provided in the CSV files
whencreatingfeatures.
In[13]:
revenuePerUser
=
userPurchases
.
groupby(
'userId'
)
.
sum()
revenuePerUser
=
revenuePerUser
.
reset_index()
revenuePerUser
.
columns
=
[
'userId'
,
'revenue'
]
#renamethecolumns
In[14]:
revenuePerUser
.
head(n
=5
)
Out[14]:
userId revenue
0 1
32
1 5
2 9
10
3 14
26
4 17
25
Mergethetwotables
Lets see what we have so far. We have a table called revenuePerUser, where each row
contains total money a user (with that 'userid') has spent. We also have another table called
adsPerUser where each row contains total number of ads a user has clicked. We will use
revenuePerUserandadsPerUserasfeatures/attributestocaptureourusers'behavior.

Let us combine these two attributes (features) so that each row contains both attributes per
user.Let'smergethesetwotablestogetonesingletablewecanuseforKMeansclustering.
In[15]:
combinedDF
=
adsPerUser
.
merge(revenuePerUser,on
=
'userId'
)
#userId,adCount,price
Let us display the first 5 linesofthemergedtable.

Note:Dependingonwhatattributesyou
choose, you may not need to merge tables. You may get all your attributes from a
singletable.
In[16]:
combinedDF
.
head(n
=5
)
#displayhowthemergedtablelooks
Out[16]:
userId totalAdClicks
revenue
0 1
42
32
1 5
2 9
17
10
3 14
40
26
4 17
50
25
Createthefinaltrainingdataset
Our training data set is almost ready. At this stage we can remove the 'userid' from each row,
since 'userid' is a computer generated random number assigned to each user. It does not
capture any behavioral aspect of a user. One way to drop the'userid',istoselecttheothertwo

columns.
In[17]:
trainingDF
=
combinedDF[[
'totalAdClicks'
,
'revenue'
]]
trainingDF
.
head(n
=5
)
Out[17]:
totalAdClicks
revenue
0 42
32
1 4
2 17
10
3 40
26
4 50
25
Displaythedimensionsofthetrainingdataset
Display the dimension of the training data set. To display the dimensions of the trainingDF,
simplyadd.shapeasasuffixandhitenter.
In[18]:
trainingDF
.
shape
Out[18]:
(832,2)
The following two commands convert the tables we created into a format that can be
understoodbytheKMeans.trainfunction.
line[0] refers to the first column. line[1] refers to the second column. If you have more than 2
columnsinyourtrainingtable,modifythiscommandbyaddingline[2],line[3],line[4]...
In[19]:
sqlContext
=
SQLContext(sc)
pDF
=
sqlContext
.
createDataFrame(trainingDF)
parsedData
=
pDF
.
rdd
.
map(
lambda
line:array([line[
0
],line[
1
]]))
#totalAdClicks,revenue
Step3:TraintoCreateClusterCenters
TrainKMeansmodel
Herewearecreatingtwoclustersasdenotedinthesecondargument.
In[20]:
my_kmmodel
=
KMeans
.
train(parsedData,
2
,maxIterations
=10
,runs
=10
,
initializationMode
=
"random"
)
/usr/local/Cellar/apachespark/1.6.0/libexec/python/pyspark/mllib/clustering.py:176:UserWarning:
Supportforrunsisdeprecatedin1.6.0.Thisparamwillhavenoeffectin1.7.0.
"Supportforrunsisdeprecatedin1.6.0.Thisparamwillhavenoeffectin1.7.0.")
Displaythecentersoftwoclustersformed
In[21]:
print
(my_kmmodel
.
centers)
[array([42.05442177,113.02040816]),array([29.43211679,24.21021898])]
Step4:RecommendActions
Analyzetheclustercenters
Eacharraydenotesthecenterforacluster:
OneClusteriscenteredat...array([29.43211679,24.21021898])
OtherClusteriscenteredat...array([42.05442177,113.02040816])
First number (field1) in each array refers to numberofadclicksandthesecondnumber(field2)

is the revenue per user. Compare the1stnumberofeachclustertoseehowdifferentlyusersin
each cluster behave when it comes to clicking ads. Comparethe2ndnumberofeachcluster to
seehowdifferentlyusersineachclusterbehavewhenitcomestobuyingstuff.
In one cluster, in general, players click on ads much more often (~1.4 times) and spend more
money (~4.7 times) on inapp purchases.AssumingthatEglenceInc.gets paidforshowingads
and for hosting inapp purchase items, we can use thisinformationtoincreasegame'srevenue
by increasing the prices for ads we show to the frequentclickers, and charge higher fees for
hostingtheinapppurchaseitemsshowntothehigherrevenuegeneratingbuyers.
Note:
This analysis requires you to compare the cluster centers and find any significant
differences in the corresponding feature values of the centers. The answer to this question will
dependonthefeaturesyouhavechosen.
Some features help distinguish the clusters remarkably while others may not tell you much. At
this point, if you dont findcleardistinguishingpatterns,perhapsrerunningtheclustering model
withdifferentnumbersofclustersandrevisingthefeaturesyoupickedwouldbeagoodidea.

Reading Py Spark ML Lib

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Reading Py Spark ML Lib

Diunggah oleh

Hak Cipta:

Format Tersedia

WelcometoWeek3oftheBigDataCapstone

teamId userId adId adCategory

Next, We are going to add an extra columntotheadclickstableandmakeitequalto1.Wedo

txId userSessionId teamId userId adId adCategory adCount

Next, let us read the contentsof thefilebuyclicks.csv.Asbefore,thefollowingcommandsread

team userId buyId price

From each of these single adclicksper row,wecannowgeneratetotaladclicksperuser.Let's

Let us display the first 5 linesofthemergedtable.

capture any behavioral aspect of a user. One way to drop the'userid',istoselecttheothertwo

First number (field1) in each array refers to numberofadclicksandthesecondnumber(field2)

Anda mungkin juga menyukai