Anda di halaman 1dari 8

K-Means Cluster Analysis

Cluster analysis is an unsupervised learning method. The goal of cluster analysis is to group,
or cluster, observations into subsets based on their similarity of responses on multiple
variables. With cluster analysis, what we want is to obtain clusters that have less variance
within clusters and more variance between clusters. Clustering variables should be primarily
quantitative variables, but binary variables may also be included. All used variables in my
assignment were quantitative.
A k-means cluster analysis was conducted to identify underlying subgroups of life expectancy
(Gapminder data set) based on their similarity of responses on 5 variables that represent
characteristics that could have an impact on life expectancy. Clustering variables included
incomeperperson, alcconsumption, co2emissions, employrate and urbanrate.
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations (N=3201)
and a test set that included 30% of the observations (N=1701). A series of k-means cluster
analyses were conducted on the training data specifying k=1-6 clusters, using Euclidean
distance. The variance in the clustering variables that was accounted for by the clusters (rsquare) was plotted for each of the six cluster solutions in an elbow curve to provide guidance
for choosing the number of clusters to interpret.

SAS Code
libname mydata /courses/d1406ae5ba27fe300 access=readonly;
DATA clust; set mydata.gapminder;
* create a unique identifier to merge cluster assignment variable with
the main data set;
idnum=_n_;
keep idnum alcconsumption employrate co2emissions incomeperperson lifeexpectancy urbanrate;
if cmiss(of _all_) then delete;
run;
* split data randomly into test and training data;
proc surveyselect data= clust out=traintest seed = 123
samprate=0.7 method=srs outall;
run;
data clus_train;
set traintest;
if selected=1;
run;
data clus_test;
set traintest;
if selected=0;
run;
* standardize the clustering variables to have a mean of 0 and standard deviation of 1;
proc standard data=clus_train out=clustvar mean=0 std=1;
var alcconsumption employrate co2emissions incomeperperson lifeexpectancy urbanrate;
run;

%macro kmean(K);
proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300;
var alcconsumption employrate co2emissions incomeperperson lifeexpectancy urbanrate;
run;
%mend;
%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
* extract r-square values from each cluster solution and then merge them to plot elbow curve;
data clus1;
set cluststat1;
nclust=1;
if _type_=RSQ;
keep nclust over_all;
run;
data clus2;
set cluststat2;
nclust=2;
if _type_=RSQ;
keep nclust over_all;
run;
data clus3;
set cluststat3;
nclust=3;
if _type_=RSQ;
keep nclust over_all;
run;
data clus4;
set cluststat4;
nclust=4;
if _type_=RSQ;
keep nclust over_all;
run;
data clus5;
set cluststat5;
nclust=5;
if _type_=RSQ;
keep nclust over_all;
run;
data clus6;
set cluststat6;
nclust=6;

if _type_=RSQ;
keep nclust over_all;
run;
data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6;
run;
* plot elbow curve using r-square values;
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
plot over_all*nclust;
run;
**************************************************************************
examine cluster solution number of clusters suggested by the elbow curve
**************************************************************************
* plot clusters for 4 cluster solution;
proc candisc data=outdata4 out=clustcan;
class cluster;
var alcconsumption employrate co2emissions incomeperperson lifeexpectancy urbanrate;
run;
proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;
run;
* first merge clustering variable and assignment data with GPA data;
data lifee_data;
set clus_train;
keep idnum lifeexpectancy;
run;
proc sort data=outdata4;
by idnum;
run;
proc sort data=lifee_data;
by idnum;
run;
data merged;
merge outdata4 lifee_data;
by idnum;
run;
proc sort data=merged;
by cluster;
run;
proc means data=merged;
var lifeexpectancy;
by cluster;
run;
proc anova data=merged;
class cluster;
model lifeexpectancy = cluster;
means cluster/tukey;
run;

Output

Summary of Findings
A k-means cluster analysis was conducted to identify underlying subgroups of life expectancy
based on their similarity of responses on 5 variables that represent characteristics that could
have an impact on life expectancy. Clustering variables alcconsumption, employrate,
co2emissions, incomeperperson, lifeexpectancy and urbanrate. All clustering variables are
quantitative.
All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data were randomly split into a training set that included 70% of the observations. A series of
k-means cluster analyses were conducted on the training data specifying k=1-6 clusters, using
Euclidean distance. The variance in the clustering variables that was accounted for by the
clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to
provide guidance for choosing the number of clusters to interpret.
A plot of elbow curve increased in the proportion advantage in the clustering variables,
explained by each of the cluster solutions. The plot start with the K equals 1R squared which
is zero because theres no clustering yet. Then we can see that the two cluster solution
accounts for about 30% of the variance and next for abuot 50% variance. The elbow curve
suggesting that the 2, 4, and 5-cluster solutions might be interpreted to help us figure out
which solutions was the best.
The results below are for an interpretation of the 4-cluster solution. Canonical discriminant
analyses were used to reduce the 6 clustering variable down a few variables that accounted for
most of the variance in the clustering variables. The results show that the 6 variables reduced
to 3 canonical variables that can be used to visualize the location of the clusters.
A scatterplot of the first two canonical variables by cluster indicated that the observations in
clusters 3 and 4 were densely packed with relatively low within cluster variance, and did
overlap very much with the each other. Cluster 2 was generally distinct, but the
observations had greater spread suggesting higher within cluster variance. Observations in
cluster 2 were spread out more than the other clusters, showing high within cluster variance.
The results of this plot suggest that the best cluster solution may have fewer than 3 clusters.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting
to test for significant differences between the clusters on lifeexpectancy. A tukey test was used
for post hoc comparisons between the clusters. The box plot shows the mean GPA by
cluster. Not surprisingly, cluster 4 was the most troubled group and cluster 2 was the least
troubled group.
The Tukey test shows that the clusters differed significantly in mean lifeexpectancy 2 and
3, and 3 and 4. Clusters 1 and 2, 1 and 3, 1 and 4, and 3 and 4 did not differ significantly
from each other.

Anda mungkin juga menyukai