Dataset
Chao Ji
December 14, 2009
Introduction
Multivariate statistics is concerned with the situation where each data object is described by more
than one statistical variable, typically more than 100 in real applications. On type of problem studied
in multivariate statistics and machine learning community is supervised learning in which one wants to
predict the values of outputs based on observed input variables, by learning input-output relationship
from training data. The prediction task is also known as classification if the outputs are categorial,
discrete or qualitative. Multiple classification algorithms have been developed and widely used. The
purpose of this class project is to benchmark the performance of Linear Discriminant Analysis (LDA),
Quadratic Discriminant Analysis (QDA), K-Nearest-Neighbor (KNN) and Support Vector Machine
(SVM) on a single challenging dataset and assess the impact of various factors (e.g. number of PCs
used, the number of nearest neighbors in KNN) on the performance of these algorithms.
The dataset is obtained from UCI Machine Learning Repository. It is an artificial dataset containing
data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly
labeled +1 or -1. The five dimensions constitute 5 informative features. 15 linear combinations of those
features were added to form a set of 20 (redundant) informative features. Based on those 20 features
one must separate the examples into the 2 classes (corresponding to the +-1 labels). A number of
distractor features are also added that have no predictive power. The order of the features and patterns
were randomized. Specifically, the dataset contains 1000 positive and 1000 negative instances, each of
which is described by 500 features.
Results
Data Visualization
Due to the high-dimension nature of this dataset, it is sensible to perform PCA before data visual-
ization. Figure ?? shows the scatterplots of data points projected in the subspace coordinatized by
the first two and three PCs . Red and blue dots represent positive (+1) and negative (−1) instances
respectively. It’s intuitive that there is no clearly discernible pattern or structure. In addition, as can
be seen in scree plot (Figure 2), only 30% variance is captured even if we use the first 10 principle
components and it’s not clear which principal components are the 5 ”informative” ones . Actually
the first 225 principal components would be needed to explained 90% of the total variance. Figure ??
shows the h-plots superimposed on the 2-D and 3-D representation of data in which most of the vec-
tors have negligibly small lengths and those vectors with substantial lengths seem to point to random
directions.
Put together these results indicate that this dataset is highly non-linearly separable more coordinates
must be preserved to reveal its structure.
1
2−D representation of data 3−D representation of data
800
600
400
600
200
400
PC#2
200
0
PC#3
0 600
400
−200 −200
200
−400 0
−400 −200
−600
−400
800
600
400 −600
200
0
−600 −200
−800 −600 −400 −200 0 200 400 600 −400 −800 PC#1
PC#1 −600
PC#2
1 100%
0.9 90%
0.8 80%
0.7 70%
Varience Explained (%)
0.6 60%
0.5 50%
0.4 40%
0.3 30%
0.2 20%
0.1 10%
0 0%
1 2 3 4 5 6 7 8 9 10
Principal Components
0.6
0.4 1
0.8
0.6
0.2
0.4
Component 2
0.2
Component 3
0
0
−0.2
−0.2 −0.4
−0.6
−0.8
−0.4
1
−1 0.8
−1 0.6
−0.8 0.4
−0.6 0.2
−0.6 −0.4
−0.2 0
0 −0.2
0.2 −0.4
0.4 −0.6
0.6
0.8 −0.8
−0.8
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1
Component 1
Component 2
Component 1
Figure 3: h-plot
2
Benchmarking of Classification Algorithms
I used the Statistics Toolbox in MATLAB, which provides nice implementations of these algorithms
to evaluate their performance. Cross-validation was performed to estimate the true misclassification
rate (MCR).
In general, both LDA and QDA have relatively high MCR, which are not surprising since both LDA
and QDA are based on the assumption that each variable obeys the normal distribution, which is
clearly violated in this dataset.
0.5
LDA
QDA
0.49
0.48
Misclassification Rate
0.47
0.46
0.45
0.44
5 10 15 20 25 30 35 40 45 50
k−fold C.V.
0.5
LDA
QDA
0.49
0.405
0.48 LDA
QDA
0.4
0.47
Misclassification Rate
0.395
0.46
0.45 0.39
0.44 0.385
0.43
5 10 15 20 25 30 35 40 45 50 0.38
5 10 15 20 25 30 35 40 45 50
k−fold C.V.
(b) Using the first 225 PCs (c) Using the first 2 PCs
KNN
As in LDA and QDA I performed cross-validation using different k-fold values, which, again does
not have significant impact on their performance. 10-fold cross-validation was used to estimate MCR.
3
Euclidean distance was used as the distance measure and the majority rule with nearest point tie-break
was used to assign class labels. Figure 5 depicts how the MCR changes with different values of the
number of the nearest neighbors. The KNN using all variables (red) or using the first 225 PCs (green)
have similar MCR across different values of k, and they performed generally better than the one using
only the first 2 PCs (blue) which suggests that the KNN algorithm will be better informed by more
variables. Moreover, there seems to be a value of k at which the KNN algorithm performed best.
SVM
Only the first two principal components were used to train and assess the performance of Support
Vector Machines in which linear, quadratic and polynomial kernel functions were considered. The
orders of the polynomials were 2, 3, 4 and 5. The correct classification rates estimated by 2-fold cross-
validation were 0.6000, 0.6190, 0.6030, 0.6050 0.6250 and 0.6260, respectively. Figure ?? illustrate
the optimal separating hyperplanes as well as all the support vectors constructed by the SVM with
polynomial kernels.
0.42
Using all variables
Using the first 225 PCs
0.4 Using the first 2 PCs
0.38
0.36
Misclassification Rate
0.34
0.32
0.3
0.28
0.26
0.24
0 10 20 30 40 50 60 70 80 90
number of nearest neighbors
It can be observed that virtually all training data points were used as support vectors to determine
the optimal hyperplane, which suggests the high non-linearly separability of the training data . In
addition the performance improved as the order of the polynomial kernel functions increased, and the
quadratic kernel function also had a slightly better performance over linear kernel. Since SVM relies
on a nonlinear mapping of non-linearly separable input vectors in low dimensional space to a high
dimensional space, the higher the order of the polynomial kernel, the more likely that the vectors in
higher dimensional space will be linearly separable. However, the performances were generally poor in
all the 6 SVMs I have considered, it’s possible that the order of the polynomial kernel must be further
increased to achieve better linear-separability.
Conclusion
As discussed in Data Visualization section, this dataset is high-dimensional and highly non-linearly
separable. It has a highly complicated internal structure that is unlikely to be revealed by conventional
dimension reduction techniques. The discriminant analysis approaches (LDA and QDA) had the worst
performance, which is not surprising since they both assume a multivariate normal distribution of the
data whereas it’s clearly not the case. The SVM didn’t perform well either, which is probably because
the order of the polynomial kernel is too low to achieve the linear separability in higher dimensions. In
4
Kernel Function: poly_kernel Kernel Function: poly_kernel
800 800
0 (training) 0 (training)
0 (classified) 0 (classified)
1 (training) 1 (training)
1 (classified) 1 (classified)
Support Vectors Support Vectors
600 600
400 400
200 200
PC#2
PC#2
0 0
−200 −200
−400 −400
−600 −600
−800 −600 −400 −200 0 200 400 600 −800 −600 −400 −200 0 200 400 600
PC#1 PC#1
(a) polynomial kernel with order 2 (b) polynomial kernel with order 3
Kernel Function: poly_kernel Kernel Function: poly_kernel
800 800
0 (training) 0 (training)
0 (classified) 0 (classified)
1 (training) 1 (training)
1 (classified) 1 (classified)
Support Vectors Support Vectors
600 600
400 400
200 200
PC#2
PC#2
0 0
−200 −200
−400 −400
−600 −600
−800 −600 −400 −200 0 200 400 600 −800 −600 −400 −200 0 200 400 600
PC#1 PC#1
(c) polynomial kernel with order 4 (d) polynomial kernel with order 5
Figure 6: SVM
general the KNN algorithm performed best since it does not rely on any stringent assumption of the
data and only focuses on the structure in the local region. In conclusion, K-Nearest-Neighbor algorithm
has generally better performance given a high-dimensional and non-linearly separable dataset.