Anda di halaman 1dari 5

Benchmarking of Classification Algorithms using a Challenging

Dataset
Chao Ji
December 14, 2009

Introduction
Multivariate statistics is concerned with the situation where each data object is described by more
than one statistical variable, typically more than 100 in real applications. On type of problem studied
in multivariate statistics and machine learning community is supervised learning in which one wants to
predict the values of outputs based on observed input variables, by learning input-output relationship
from training data. The prediction task is also known as classification if the outputs are categorial,
discrete or qualitative. Multiple classification algorithms have been developed and widely used. The
purpose of this class project is to benchmark the performance of Linear Discriminant Analysis (LDA),
Quadratic Discriminant Analysis (QDA), K-Nearest-Neighbor (KNN) and Support Vector Machine
(SVM) on a single challenging dataset and assess the impact of various factors (e.g. number of PCs
used, the number of nearest neighbors in KNN) on the performance of these algorithms.

The dataset is obtained from UCI Machine Learning Repository. It is an artificial dataset containing
data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly
labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those
features were added to form a set of 20 (redundant) informative features. Based on those 20 features
one must separate the examples into the 2 classes (corresponding to the +-1 labels). A number of
distractor features are also added that have no predictive power. The order of the features and patterns
were randomized. Specifically, the dataset contains 1000 positive and 1000 negative instances, each of
which is described by 500 features.

Results
Data Visualization
Due to the high-dimension nature of this dataset, it is sensible to perform PCA before data visual-
ization. Figure ?? shows the scatterplots of data points projected in the subspace coordinatized by
the first two and three PCs . Red and blue dots represent positive (+1) and negative (−1) instances
respectively. It’s intuitive that there is no clearly discernible pattern or structure. In addition, as can
be seen in scree plot (Figure 2), only 30% variance is captured even if we use the first 10 principle
components and it’s not clear which principal components are the 5 ”informative” ones . Actually
the first 225 principal components would be needed to explained 90% of the total variance. Figure ??
shows the h-plots superimposed on the 2-D and 3-D representation of data in which most of the vec-
tors have negligibly small lengths and those vectors with substantial lengths seem to point to random
directions.
Put together these results indicate that this dataset is highly non-linearly separable more coordinates
must be preserved to reveal its structure.

1
2−D representation of data 3−D representation of data
800

600

400

600

200
400
PC#2

200
0

PC#3
0 600

400
−200 −200
200

−400 0

−400 −200
−600
−400
800
600
400 −600
200
0
−600 −200
−800 −600 −400 −200 0 200 400 600 −400 −800 PC#1
PC#1 −600

PC#2

(a) 2-D representation of data (b) 3-D representation of data

Figure 1: Data visualization after PCA

1 100%

0.9 90%

0.8 80%

0.7 70%
Varience Explained (%)

0.6 60%

0.5 50%

0.4 40%

0.3 30%

0.2 20%

0.1 10%

0 0%
1 2 3 4 5 6 7 8 9 10
Principal Components

Figure 2: Scree Plot

Biplot on the first two PCs Biplot on first three PCs


0.8

0.6

0.4 1

0.8

0.6
0.2
0.4
Component 2

0.2
Component 3

0
0

−0.2

−0.2 −0.4

−0.6

−0.8
−0.4
1
−1 0.8
−1 0.6
−0.8 0.4
−0.6 0.2
−0.6 −0.4
−0.2 0
0 −0.2
0.2 −0.4
0.4 −0.6
0.6
0.8 −0.8
−0.8
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1
Component 1
Component 2

Component 1

(a) Biplot (b) Triplot

Figure 3: h-plot

2
Benchmarking of Classification Algorithms
I used the Statistics Toolbox in MATLAB, which provides nice implementations of these algorithms
to evaluate their performance. Cross-validation was performed to estimate the true misclassification
rate (MCR).

LDA and QDA


The original 500 variables, the first 225 and the first 2 principal components were used to train and
test both models in k-fold (k = 5, 10, 20, 40, 50) cross-validation, respectively. The numbers of misclas-
sifications and MCR were computed and visualized in Figure ??. The results indicate that different
values of k had virtually no impact on the the performance evaluated using k-fold cross-validation, and
their performances were only slightly improved (< 5%) if using less principal components (Figure 4(b)
and 4(c)). Also QDA outperformed LDA in all three cases, which suggests that the dataset is better
explained by a more complicated model.

In general, both LDA and QDA have relatively high MCR, which are not surprising since both LDA
and QDA are based on the assumption that each variable obeys the normal distribution, which is
clearly violated in this dataset.

0.5
LDA
QDA

0.49

0.48
Misclassification Rate

0.47

0.46

0.45

0.44
5 10 15 20 25 30 35 40 45 50
k−fold C.V.

(a) Using the Original Coordinates

0.5
LDA
QDA

0.49

0.405
0.48 LDA
QDA

0.4
0.47
Misclassification Rate

0.395
0.46

0.45 0.39

0.44 0.385

0.43
5 10 15 20 25 30 35 40 45 50 0.38
5 10 15 20 25 30 35 40 45 50
k−fold C.V.

(b) Using the first 225 PCs (c) Using the first 2 PCs

Figure 4: Misclassification Rates of LDA and QDA

KNN
As in LDA and QDA I performed cross-validation using different k-fold values, which, again does
not have significant impact on their performance. 10-fold cross-validation was used to estimate MCR.

3
Euclidean distance was used as the distance measure and the majority rule with nearest point tie-break
was used to assign class labels. Figure 5 depicts how the MCR changes with different values of the
number of the nearest neighbors. The KNN using all variables (red) or using the first 225 PCs (green)
have similar MCR across different values of k, and they performed generally better than the one using
only the first 2 PCs (blue) which suggests that the KNN algorithm will be better informed by more
variables. Moreover, there seems to be a value of k at which the KNN algorithm performed best.

SVM
Only the first two principal components were used to train and assess the performance of Support
Vector Machines in which linear, quadratic and polynomial kernel functions were considered. The
orders of the polynomials were 2, 3, 4 and 5. The correct classification rates estimated by 2-fold cross-
validation were 0.6000, 0.6190, 0.6030, 0.6050 0.6250 and 0.6260, respectively. Figure ?? illustrate
the optimal separating hyperplanes as well as all the support vectors constructed by the SVM with
polynomial kernels.

0.42
Using all variables
Using the first 225 PCs
0.4 Using the first 2 PCs

0.38

0.36
Misclassification Rate

0.34

0.32

0.3

0.28

0.26

0.24
0 10 20 30 40 50 60 70 80 90
number of nearest neighbors

Figure 5: Misclassification Rate of KNN

It can be observed that virtually all training data points were used as support vectors to determine
the optimal hyperplane, which suggests the high non-linearly separability of the training data . In
addition the performance improved as the order of the polynomial kernel functions increased, and the
quadratic kernel function also had a slightly better performance over linear kernel. Since SVM relies
on a nonlinear mapping of non-linearly separable input vectors in low dimensional space to a high
dimensional space, the higher the order of the polynomial kernel, the more likely that the vectors in
higher dimensional space will be linearly separable. However, the performances were generally poor in
all the 6 SVMs I have considered, it’s possible that the order of the polynomial kernel must be further
increased to achieve better linear-separability.

Conclusion
As discussed in Data Visualization section, this dataset is high-dimensional and highly non-linearly
separable. It has a highly complicated internal structure that is unlikely to be revealed by conventional
dimension reduction techniques. The discriminant analysis approaches (LDA and QDA) had the worst
performance, which is not surprising since they both assume a multivariate normal distribution of the
data whereas it’s clearly not the case. The SVM didn’t perform well either, which is probably because
the order of the polynomial kernel is too low to achieve the linear separability in higher dimensions. In

4
Kernel Function: poly_kernel Kernel Function: poly_kernel
800 800
0 (training) 0 (training)
0 (classified) 0 (classified)
1 (training) 1 (training)
1 (classified) 1 (classified)
Support Vectors Support Vectors
600 600

400 400

200 200

PC#2

PC#2
0 0

−200 −200

−400 −400

−600 −600
−800 −600 −400 −200 0 200 400 600 −800 −600 −400 −200 0 200 400 600
PC#1 PC#1

(a) polynomial kernel with order 2 (b) polynomial kernel with order 3
Kernel Function: poly_kernel Kernel Function: poly_kernel
800 800
0 (training) 0 (training)
0 (classified) 0 (classified)
1 (training) 1 (training)
1 (classified) 1 (classified)
Support Vectors Support Vectors
600 600

400 400

200 200
PC#2

PC#2
0 0

−200 −200

−400 −400

−600 −600
−800 −600 −400 −200 0 200 400 600 −800 −600 −400 −200 0 200 400 600
PC#1 PC#1

(c) polynomial kernel with order 4 (d) polynomial kernel with order 5

Figure 6: SVM

general the KNN algorithm performed best since it does not rely on any stringent assumption of the
data and only focuses on the structure in the local region. In conclusion, K-Nearest-Neighbor algorithm
has generally better performance given a high-dimensional and non-linearly separable dataset.

Anda mungkin juga menyukai