Investigating The Statistical Linear Relation Between The Model Selection Criterion and The Complexities of Data Mining Algorithms

Investigating the Statistical Linear Relation between the Model
Selection Criterion and the Complexities of Data Mining Algorithms

DOST MUHAMMAD KHAN
1
, NAWAZ MOHAMUDALLY
2
, D K R BABAJEE
3

1
Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur,
PAKISTAN & PhD Student, School of Innovative Technologies & Engineering, University of
Technology Mauritius (UTM), MAURITIUS

2
Associate Professor, & Consultancy & Technology Transfer Centre, Manager, University of
Technology, Mauritius (UTM), MAURITIUS

3
Lecturer, Department of Applied Mathematical Sciences, SITE, University of Technology, Mauritius

Abstract: - The model selection criterion plays a vital role in the preparation of the readiness of a dataset for
further data mining actions. It is a gauge to determine whether the dataset is under-fitted or over-fitted. In such
cases the datasets are not suitable for knowledge extraction; otherwise the knowledge obtained would be vague,
ambiguous and might be misleading. In this paper, we investigate the linear relation between the selection of
data mining algorithms through the model selection criterion, the Akaike Information Criterion (AIC) or the
Bayesian Information Criterion (BIC), the two most commonly used methodologies in an attempt to set up the
dataset for the data mining process cycle. Moreover the complexities of data mining algorithms together with
the AIC or BIC at different steps within the data mining process cycle are evaluated to apply the best algorithm
in view of generating the optimum accuracy of the knowledge.
Key-Words: - AIC, BIC, Over-fitted, Under-fitted, Model Selection Criterion, Linear Model, Correlation
1 Introduction
The purpose of the model selection criterion is to
identify a methodology such as AIC, BIC, VC
Dimension and so on, that best characterizes the
model as a matter of fact, the candidate dataset
ready for data mining. The concept of over-fitting
and under-fitting is important in data mining as
mentioned above. The over and under-fitting are due
to missing and noisy, inconsistent and redundant
values and number of attributes in a dataset. We can
avoid these problems by using one of these
techniques; apply upper or lower thresholds values,
remove attributes below a threshold value and
remove noise and redundant attributes. The most
effective solution to these problems is use many
training datasets and to avoid excessive or too few
assumptions.
Current models for selection of fitting datasets
are the following; VC (Vapnik-Chervonenkis)-
dimension, AIC (Akaike Information Criterion),
BIC (Bayesian Information Criterion), (SRMVC)
Structural Risk Minimize with VC dimension, CV
(Cross-validation), Deviance Information Criterion,
Hannan-Quinn Information Criterion, Jensen-
Shannon Divergence, Kullback-Leibler Divergence
models. The focus here is on AIC and BIC. A
dataset is more appropriate for data mining if it has
a minimum value of AIC or BIC. Both AIC and BIC
have solid theoretical foundations, the AIC uses
Kullback-Leibler distance of information theory and
the BIC is based on the integrated likelihood of
Bayesian theory. If the complexity of the good-
fitting model also called as the true model does not
increase with the size of the dataset, BIC is the
preferred criterion; otherwise AIC is the best choice.
Since selecting the number of parameters and
number of attributes is the main issue in model
selection, so one has to take care of these important
aspects of a dataset. Using too many parameters can
fit the data perfectly, but it can be an over-fitting.
Using too few parameters may not fit the dataset at
all, thus under-fitting. This shows the importance of
parameters and observed data in a given dataset.
Variable selection by AIC or BIC will provide an
answer to this problem. We illustrate the importance
of comparing different models/datasets with
different number of parameters by using AIC and
BIC. The idea of model selection criterion using
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 14

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
AIC or BIC has also been applied recently to
epidemiology, microarray data analysis, and DNA
sequence analysis [21] [22] [23] [24] [25].
The rest of the paper is organized as follows:
section 2 discusses the model selection criteria AIC
and BIC and section 3 is about the methodology. In
section 4 we present the statistical relation between
the model selection criterion AIC and the
complexities of data mining algorithms. The results
are discussed in section 5 and finally the conclusion
is drawn in section 6.
2 Model Selection Criteria
A brief introduction of model selection criteria AIC
and BIC is given below:
2.1 Akaike Information Criterion
The AIC is a criterion for the model selection,
developed by Hirotsugu Akaike in 1974, under the
name of Akaike Information Criterion. The AIC is
based on information theory. Suppose that the data
is generated by some unknown process f. We
consider two candidate models to represent f: g
1
and
g
2
. If we know f, then we can find the information
lost from using g
1
to represent f by calculating the
KullbackLeibler divergence, D
KL
(f,g
1
); similarly,
the information lost from using g
2
to represent f will
be found by calculating D
KL
(f,g
2
). We can then
choose the candidate model that minimizes the
information loss. The AIC can tell nothing about
how well a model fits the data in an absolute sense.
If all the candidate models fit poorly, then the AIC
will not give any warning. The AIC is thus a
difference of the accuracy and the complexity of the
model [7] [9] [10] [15] [17] [18] [20]. The
mathematical formula of AIC is given in equation
(1).
k likelihood AIC 2 )) (log( 2 + = (1)
Where 2k is the number of parameters and
) log(likelihood is the log of the likelihood and
)) (log( 2 likelihood due to perfect fitting, the value of
the likelihood gradually approaches to 0 as the
number of parameters is increased. It is also called
the Model Accuracy. Therefore, AIC is given in
equation (2).
acy ModelAccur rs ofParamete No AIC
rs ofParamete No acy ModelAccur AIC
=
+ =
.
.
(2)
2.2 Bayesian Information Criterion
The BIC is a criterion for the selection of model
among class of models with different number of
parameters. When estimating the parameters of a
model by using maximum likelihood estimation, it
is possible to increase the likelihood by adding
parameters, which may result in over-fitting. BIC
resolves this problem by introducing a penalty term
for the number of parameters in the model. This
penalty is larger in the BIC than in the related AIC.
BIC is widely used for model identification in time
series and linear regression. The main characteristics
of BIC are: It measures the efficiency of the
parameterized model in terms of predicting the data,
penalizes the complexity of the model where
complexity refers to the number of parameters in
model, is exactly equal to the minimum description
length criterion but with negative sign and is closely
related to other likelihood criterion such as AIC [4]
[8] [12] [13] [16] [19]. The mathematical formula of
BIC is given in equation (3).
) log( ) log( . 2 n k likelihood BIC + = (3)
where k is the number of parameters, n is the sample
size or the datapoints of the given dataset,
) log( . 2 likelihood the value of likelihood gradually
approaches to 0 with the increase on number of
parameters, is also known as the Model Accuracy
and ) log(n k is the model size. Therefore, BIC is
given in equation (4).
acy ModelAccur ModelSize BIC = (4)
3 Methodology
Suppose there is a sample } ,...,
2
,
1
{
n
x x x X = of n
sequence of observations, coming from a
distribution with an unknown probability density
function ) | ( u X p , where ) | ( u X p is called a
parametric model, in which all the parameters are in
finite-dimensional parameter spaces. These
parameters are collected together to form a single m-
dimensional parameter vector, ) ,...,
2
,
1
(
m
u u u u = . To
use the method of maximum likelihood, one first
specifies the joint density function for all
observations. The joint density function of the given
observation is given below in equation (5).
) | ( ) | ( ) | ( ) | ,..., , ( ) | (
2 1 2 1
u u u u u
n n
x p x p x p x x x p X p = = (5)
where the observed values x
1
, x
2
, ..., x
n
are fixed
parameters of this function and will be the
functions variable and allowed to vary freely. From
this point of view this distribution function will be
called the likelihood as given in equation (6).

[
=
= =
n
i
i
x p
n
x x x p
n
x x x likelihood
1
) | ( ) | ,...,
2
,
1
( ) ,...,
2
,
1
| ( u u u (6)
It is more convenient to work with the logarithm of
the likelihood function, called the log-likelihood as
shown in equation (7) below.
) ) | ( log( )) ,..., , | ( log(
1
2 1 [
=
=
n
i
i n
x p x x x likelihood u u ,
) log(
1
likelihood
n
= (7)
The method of maximum likelihood was first
proposed by the English statistician and population
geneticist R. A. Fisher. The maximum likelihood
method finds the estimate of a parameter that
maximizes the probability of observing the data
given a specific model for the data. The idea behind
maximum likelihood parameter estimation is to
determine the parameters that maximize the
probability (likelihood) of the sample data. From a
statistical point of view, the method of maximum
likelihood is considered to be more robust and
yields estimators with good statistical properties. It
is a flexible method and can be applied to most
models and to different types of data. Although the
methodology for maximum likelihood estimation is
simple, the implementation is mathematically
intense [1] [2] [3] [5] [6] [11] [14].
We use the stepwise variable selection method,
starting with one variable and then add or remove
variable if the value of AIC or BIC is reduced. The
stepwise variable is a local optimal procedure and is
tested with different starting sets of parameters so
that the optimization is not carried to the extreme.
The following steps explain the computation of the
value of AIC and BIC:
Step 1: Calculate the maximum likelihood of the
dataset
The likelihood function is simply the joint
probability of observing the data. The joint
probability
is
[
=
= =
n
i
i
x p
n
x x x p
n
x x x L
1
) | ( ) | ,...,
2
,
1
( ) ,...,
2
,
1
| ( u u u
. Take the log of this value will give the value of
model accuracy which is shown below in equation
(8).
) log(likelihood acy ModelAccur = (8)
Step 2: Compute the Model Size
The formula to calculate the model size is given in
equation (9).
) (log n k ModelSize = (9)
Where k is the number of parameters and n is the
datapoints.
Step 3: Compute the Minimum Description Length
(MDL)
acy ModelAccur ModelSize MDLScore =
The Minimum Description Length (MDL) is also
referred as the value of BIC; hence, from the value
of MDL we can compute the values of AIC and
BIC. The mathematical formulas of AIC and BIC
are given in equations (10) and (11) respectively.
acy ModelAccur rs ofParamete No AIC = . (10)
acy ModelAccur ModelSize BIC = (11)
The equations (2) and (10) are identical; similarly,
equations (4) and (11) are identical. The smallest
value of the model shows that it performs better than
the other selection criterion. Therefore, the selection
criterion which produces the smallest value of the
given model is the best choice [1].
4 The Statistical Linear Relation
between the Model Selection AIC and
the Complexities of Data Mining
Algorithms
In order to find the statistical linear relation between
the value of AIC and the logarithm value of the
complexities O of data mining algorithms, we
apply the linear model (lm) shown in equation (12).
x y b mx y x y lm
1 0
) ~ ( | | + = + = (12)
where m is the slop of the line and b is the
interception of the line with the y-axis; lines goes up
is positive and the line goes down is the negative,
0
| is a constant,
1
| is the regression co-efficient,
x is the value of an independent variable and y is
the value of dependent variable. In other words,
0
|
is the theoretical y-intercept and
1
| is the theoretical
slope [28][29][30].
In our case X is the value of AIC and Y is the
logarithm value of the complexities O of data
mining algorithms.
The table 1 shows the complexities of commonly
used data mining algorithms.
Table 1 The Complexities O of Data Mining
Algorithms


The table 1 is about the complexity of K-means,
C4.5, Data Visualization, K-NN, SOM and NNs
data mining algorithms. Where n is the sample
size, m is the number of attributes, k is the
number of clusters, l is the number of iterations
and d is the dimension (in our case it is 2).
We take the log of the complexities of these
algorithms because the use of log makes extremely
efficient when dealing with large values. There are
some other utilities of taking the log of a value,
which are: the log is taken if the transferred data is
closer to satisfy the assumptions of the statistical
model, to analyze the exponential processes,
because the log function is the inverse of the
exponential function, to measure the pH or acidity
of a chemical solution, to measure the intensity of
earth quake on Richter scale, to model many natural
processes with the statistical model.
In our case it is used to model the value of
computational complexities O of a data mining
algorithm with the value of model selection criterion
AIC. This will help to select the right algorithm for
the given dataset.
The table 2 below shows the value of AIC of 5
different datasets.
Table 2 The Value of Model Selection Criterion
AIC

The essential parameters for the calculation of the
values of AIC and the logarithm value of
complexities O of data mining algorithms are: The
number of parameters k, the number of attributes
m and the sample size n
The following four cases will further illustrate the
statistical linear relation:
Case 1: In case 1 the number of parameters k
varies and the number of attributes m and the
sample size n are fixed. The linear model is
applied the values are shown in table 3.

Table 3 The number of parameters k varies

The value of correlation coefficient r is 0.92 which
is very close to 1 and shows the strong and positive
linear relation between the values of AIC and O in
case when the number of parameters k varies.
The goal of a linear regression is to find the best
estimates for
0
| and
1
| by minimizing the residual
error. The summary of the linear model and the
Regression Statistics using R-Language
[26][27][29] for case 1 is given below:
Residuals:
Min 1Q Median 3Q Max
-116.70 -73.07 -26.85 65.30 192.09
The Residuals gives the difference between the
experimental and predicted Y i.e. the Complexities
O of the data mining algorithms.
Residual standard error: 92.9 on 20 degrees of
freedom
Multiple R-squared: 0.8472,
Adjusted R-squared: 0.8396
The Residual standard error is the residuals adjusted
to ensure that they have a standard deviation of 1;
they have a mean of zero already. The most
important number is Multiple R-squared the
absolute value of correlation coefficient, which has
the value of 0.8472, the number is close to the
correlation coefficient r as shown in table 1 above.
Thus, multiple R-squared is always between 0 and
1, and closer to 1 indicates stronger relation. Hence,
in this case being close to 1, implies that we can

conclude that the two variables are indeed somewhat
related. In other words, if the value or Multiple R-
squared in the Regression Statistics is close to 1,
then the least-square regression line indeed fits the
data points pretty well, and there is a linear (positive
or negative) relation between the two variables. In
our case a strong positive linear relation exists
between the values of AIC and O if the number of
parameters k varies.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4724.68 480.68 -9.829 4.22e-09
x 210.45 19.98 10.530 1.32e-09
The Coefficients for the Intercept and the X,
an independent variable are the two required values.
These two values are the slope and intercept of the
least-square regression line. The value of X is
positive in case 1. The coefficients give the
coefficient for each parameter, including the
intercept (the constant), the standard errors, the t-
values (the t-value is the coefficient divided by the
standard error) and the p-value associated with the
variable, and the confidence intervals of the
parameter estimates. The graphs in the fig. 1 further
explain the linear model.

Fig.1 The Linear Model for Case 1
The plot in fig. 1, the upper left shows the residual
errors plotted versus their fitted values. The
residuals are randomly distributed in the plot. The
plot in fig. 1, the upper right is a standard Q-Q plot,
which shows that the residual errors are normally
distributed. A residual plot draws the scatter plots of
each independent variable on the x-axis, and the
residual on the y-axis and a line fit plot draws
scatter-plots of each independent variable on the x-
axis, and the predicted and actual values of the
dependent variable on the y axis. The both plots
show the positive linear relation between AIC and
O. The plot in fig. 1, the lower left shows the square
root of standardized residuals as a function of the
fitted values and again there is no obvious trend in
this plot. Finally, the plot in the lower right shows
each point leverage, which is a measure of its
importance in determining the regression result. The
superimposed on the plot are contour lines for
Cooks distance, which is another measure of the
importance of each observation to the regression.
The smaller distance means that removing the
observation has little affect on the regression results
and the distance larger than 1 are suspicious and
suggest the presence of a possible outlier or a poor
model. For case 1, the Cooks distance implies that
the linear model is the best because the value of the
distance is between 1 and 0.5. The lower plots are
also used for the diagnostics purpose
[31][32][33][34][35].
Remark: The value of AIC is computable if the
number of parameters k is within the range from 2
to 22 i.e. 22 2 s s k . If the value of k is greater than
22 then the value of AIC is incomputable, thus there
is no linear relation between AIC and O.
Case 2: In case 2 the number of attributes m varies
and the number of parameters k and the sample
size n are fixed. The linear model is applied the
values are shown in table 4.
Table 4 The number of attributes m varies

is very close to 1 and shows the strong and positive
linear relation between the values of AIC and O in
case when the number of attributes m varies.
The summary of the linear model and the
Residuals:
-100.81 -60.27 -29.19 64.05 131.42


freedom
the value of 0.9075, the number is close to the
correlation coefficient r as shown in table 2 above.
in this case being close to 1, implies that we can
conclude that the two variables are indeed somewhat
related. In other words, if the value or Multiple R-
squared in the Regression Statistics is close to 1,
then the least-square regression line indeed fits the
data points pretty well, and there is a linear (positive
or negative) relation between the two variables. In
our case a strong positive linear relation exists
between the values of AIC and O if the number of
attributes m varies.
Coefficients:
Estimate Std.
Error
t value Pr(>|t|)
(Intercept) -5511.41 680.60 -8.098 4.00e-05
x 254.66 28.75 8.857 2.08e-05
least-square regression line. The value of X is
positive in case 2. The coefficients give the
coefficient for each parameter, including the

Fig. 2 The Linear Model for Case 2
residuals are randomly distributed in the plot. The
plot in fig. 2, the upper right is a standard Q-Q plot,
dependent variable on the y axis. The both plots
show the positive linear relation between AIC and
O. The plot in fig. 2, the lower left shows the square
root of standardized residuals as a function of the
fitted values and again there is no obvious trend in
this plot. Finally, the plot in the lower right shows
each point leverage, which is a measure of its
importance in determining the regression result. The
also used for the diagnostics purpose.
Remark: The value of AIC is computable if the
number of attributes m is less than or equal to 211
i.e. 22 s m . If the value of m is greater than 211
then the value of AIC is incomputable, thus there is
no linear relation between AIC and O.
Case 3: In case 3, the sample size n varies and the
number of parameters k and the number of

attributes m are fixed. The linear model is
applied; the values are shown in table 5.
Table 5 The sample size n varies

The value of correlation coefficient r is -0.3 which
shows the weak and negative linear relation between
the values of AIC and O in case when the sample
size n varies.
Residuals:
-3.4479 -0.8900 0.6814 1.1637 2.1814
freedom
the value of 0.1141, the number is not very close to
the correlation coefficient r as shown in table 3
above. Thus, multiple R-squared is always between
0 and 1, and closer to 1 indicates stronger relation.
Hence, in this case being close to 0, implies that we
can conclude that the two variables are very weakly
related. In other words, if the value of the Multiple
R-squared in the Regression Statistics is 0.1141,
then the least-square regression line does not fit the
data points pretty well, and there is a linear negative
relation between the two variables. In our case a
weak negative linear relation exists between the
values of AIC and O if the sample size n varies.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.128 35.064 1.487 0.175
x -1.715 1.690 -1.015 0.340
least-square regression line. The values of
Intercept and X are very low in the case 3. The
coefficients give the coefficient for each parameter,
including the intercept (the constant), the standard
errors, the t-values (the t-value is the coefficient
divided by the standard error) and the p-value
associated with the variable, and the confidence
intervals of the parameter estimates. The graphs in
the fig. 3 further explain the linear model.

residuals are randomly distributed in the plot but as
compared to plots in fig.1 and fig.2 the plot is not
even randomly distributed. The plot in fig. 3, the
upper right is a standard Q-Q plot, which shows that
the residual errors are normally distributed. A
residual plot draws the scatter plots of each
independent variable on the x-axis, and the residual
on the y-axis and a line fit plot draws scatter-plots of
predicted and actual values of the dependent
variable on the y axis. The both plots show the
positive linear relation between AIC and O. The plot
in fig. 3, the lower left shows the square root of
standardized residuals as a function of the fitted
values and again there is no obvious trend in this
plot. Finally, the plot in the lower right shows each
point leverage, which is a measure of its importance
in determining the regression result. The

also used for the diagnostics purpose.
Remark: We compute the value of AIC by setting
the requited values for the sample size n is 12000,
the number of parameters k is 3 and the number of
attributes m is 211 and observe that AIC is
computable. The linear relation exists between AIC
and O but very weak with negative value. Therefore,
we can say AIC is computable if the sample size n
is 12000.
Case 4: In case 4 the sample size n varies and the
number of parameters k is fixed but the number of
records of each parameter is equal and also the
number of attributes m is fixed. The linear model
is applied; the values are shown in table 6.
Table 6 The sample size n varies

shows that there is no linear relation between the
values of AIC and O in case when the sample size
n varies and the value of each parameter is same in
the dataset.
Residuals:
-4.17 -0.97 0.48 1.53 2.23
freedom
they have a mean of zero already. The number
Multiple R-squared the absolute value of
correlation coefficient, is undetermined in this case.
in this case it is 0, implies that we can conclude that
the relation between the two variables is feeble. In
other words, if the value of the Multiple R-
squared in the Regression Statistics is 0, then the
least-square regression line does not fit the data
points pretty well, and there is very weak linear
relation between the two variables.
Coefficients: (1 not defined because of
singularities)
Estimate Std.
Error
t
value
Pr(>|t|)
(Intercept) 16.1700 0.6455 25.05 1.24e-09
***
x NA NA NA NA
least-square regression line. The value of X is not
applicable (NA) in the case 4. The coefficients give
the coefficient for each parameter, including the

residuals are not randomly distributed in the plot,
which shows that the residual error is zero. The plot
in fig. 4, the upper right is a standard Q-Q plot,

dependent variable on the y axis. The plot in fig. 4,
the lower left shows the square root of standardized
residuals as a function of the fitted values. The plot
shows that vertical line, which shows that the
residual error is zero. Due to the zero value of
residual error the fourth plot for the Cooks distance
is not plotted i.e. for case 4, the Cooks distance is
not calculated which implies that the poor linear
model.
Remark: If the sample size n varies and the
number of parameters k, number of iterations l
and the number of attributes m are fixed and the
value of each parameter is same in the dataset then
there is no linear relations between AIC and O.
The result of these four cases is that the value of
AIC and O only depends on the number of
parameters k and the number of attributes m.
There is no impact of sample size n on the value of
AIC and O. Another issue is that if the value of
parameters exceed then 22 and the number of
attributes exceed then 211 then the dataset is over-
fitted, hence the value of AIC is incomputable and
therefore, no linear relation between AIC and O.
5 Results and Discussion
The model selection criterion is used to map the
appropriate algorithms with a particular dataset, in
order to extract the knowledge. A vehicle dataset
cars which is about the different models of brands
from different countries is selected. The number of
attributes is 9, number of datapoints or records or
the sample size is 261 and number of parameters or
brands is 30. These parameters are the brands from
3 countries, from US there are 14, from Europe
there are 10 and from Japan there are 6 brands of the
cars. The distribution of the records in the chosen
dataset is: 62.45% are from US, 18.00% are from
Europe 18.00% and similarly, 19.45% are from
Japan. The percentage of parameters/brands from
US is 46.67%, from Europe is 33.33% and from
Japan is 16.67%.
Case 1: Fitness (Over and Under-fitted) of the
Dataset
This case is about the fitness of a dataset i.e. the
dataset is either over-fitted or under-fitted. The
number of attributes, the sample size and number of
parameters are required to compute the value of
AIC. In this case either the value of these
parameters is small or large to test the fitness of the
dataset.
Table 7 Models Selection with Variable Parameters

In the first two cases of table 7 the value of AIC and
BIC is computable but it also indicates that the
dataset is very close to be over-fitted. In the third
case the value of AIC and BIC is infinity although
the sample size is small and the dataset is over-
fitted. In the fourth the dataset is under-fitted
because the number of parameters, number of
attributes and the sample size are very small. Hence
we can say that if the number of attributes or
observed data or the number of parameters is very
small then the dataset is under-fitted and if the
values of both parameters are large then the dataset
is over-fitted. The number of parameters and the
number of attributes play a vital role for a dataset to
become over or under-fitted, therefore, in order to
avoid from these problems, the datasets must be
organized with care. Furthermore, the value of AIC
is less then BIC in these cases, therefore, AIC is the
right choice because the smallest value is the best.
Case 2: Correlation between the value of AIC
and the Log. of O
We compute the correlation to evaluate the strength
between the value of AIC of different datasets and
the logarithm of the complexities O of different
data mining algorithms. The formula of correlation
is given in equation (13) [28][29][30].
) ) ( ( ) ( (
) )( (
2 2 2 2
Y Y N X X N
Y X XY N
r

= (13)
Where r is the correlation, N is the number of
values; X and Y are the variables. In our case X
shows the value of AIC and Y represents the
logarithm value of the complexities O. The value
of correlation (r) +1.0 shows the strong relation, -1.0
shows the weak relation and 0 shows there is no
relation. The value of correlation (r) greater than 0.8
is considered as strong whereas the value of
correlation (r) less than 0.5 is judged as weak
[29][30][31][32][33].
Table 11 shows the correlation(r) between the value
of AIC of datasets, Iris, Breastcancer, Diabetes,
DNA and Cars and the complexities O of K-
means clustering data mining algorithm.

Table 8 The Complexities O of K-means
clustering Algorithm & AIC of datasets

The value of correlation (r) is 0.8 which shows the
strong and positive relationship between AIC and K-
means clustering algorithm. The graph in fig. 5
depicts this relationship.

Fig. 5 The Correlation between AIC and the
complexities O of K-means Algorithm
The graph in fig. 5 shows the correlation of model
selection criterion AIC and the O of the K-means
clustering data mining algorithm for different
datasets. The graph shows the strong and positive
relationship these values.
DNA and Cars and the complexities of C4.5
(Decision Tree) data mining algorithm.
Table 9 The Complexities of C4.5 Algorithm & AIC
of datasets

positive relationship between AIC and C4.5
algorithm. The value of this correlation is less then
the value of correlation between AIC and K-means
clustering algorithm but still a moderate relationship
between AIC and O. The graph in fig. 6 depicts this
relationship.

Fig. 6 The Correlation between AIC and the O of
C4.5 Algorithm
selection criterion AIC and the O of the C4.5
(Decision Tree) data mining algorithm for different
datasets. The graph shows a moderate and positive
DNA and Cars and the complexities of Data
Visualization data mining algorithm.
Table 10 The Complexities of Data Visualization
Algorithm & AIC of datasets

positive relationship between AIC and Data
Visualization algorithm. The value of this
correlation is less then the value of correlation
between AIC and K-means clustering algorithm and
it is also less then C4.5 but still a moderate
relationship between AIC and O. The graph in fig. 7
depicts this relationship.


Fig. 7 The Correlation between AIC and the
Complexities O of Data Visualization Algorithm
selection criterion AIC and the O of the Data
Visualization data mining algorithm for different
DNA and Cars and the complexities of K-NN
data mining algorithm.
Table 11 The Complexities of K-NN Algorithm &
AIC of datasets

positive relationship between AIC and K-NN
clustering algorithm and it is equal to the C4.5 but
still a moderate relationship between AIC and O.
The graph in fig. 8 depicts this relationship.

K-NN Data Mining Algorithm
The graph in fig.8 shows the correlation of model
selection criterion AIC and the O of the K-Nearest
Neighbour (K-NN) data mining algorithm for
different datasets. The graph shows a moderate and
positive relationship these values.
DNA and Cars and the complexities of SOM data
mining algorithm.
Table 12 The Complexities of SOM Algorithm &
AIC of datasets

positive relationship between AIC and SOM
clustering algorithm and it is greater than C4.5 and
K-NN data mining algorithms but still a good and
strong relationship between AIC and O. The graph
in fig. 9 depicts this relationship.


SOM Algorithm
selection criterion AIC and the O of the Self-
organizing Map (SOM) data mining algorithm for
different datasets. The graph shows the strong and
positive relationship these values.
DNA and Cars and the complexities of NNs data
mining algorithm.
Table 13 The Complexities of NNs Algorithm &
AIC of datasets

positive relationship between AIC and NNs
clustering algorithm and SOM and is equal to C4.5
and K-NN data mining algorithms but still a
moderate relationship between AIC and O. The
graph in fig. 10 depicts this relationship.

NNs Data Mining Algorithm
selection criterion AIC and the O of the Neural
Networks (NNs) data mining algorithm for different
From the above results, it is clear that there exists a
linear relation between the value of AIC and the
complexities of data mining algorithm and both
depend on the dataset. If the value of the parameters
of the dataset is changed then the value of AIC and
complexities O will also change but there will be
no effect on their relationship. Although the
correlation of complexities of K-means clustering
data mining algorithm with the value of AIC is
stronger then any of the other selected data mining
algorithms but also the relationship between
complexities O and the value of AIC is positive
and moderate.
The coefficient of determination, r
2
, a linear
association between the variables AIC and O is
shown in table 14. It is computed by taking the
square of correlation (r). It is useful because it gives
the proportion of the variance (fluctuation) of one
variable that is predictable from the other variable. It
is a measure that allows us to determine how certain
one can be in making predictions from a certain
graph or model. It is a ratio of the explained
variation to the total variation which is 1 0
2
s s r .
Table 14 Coefficient of Determination, r
2

The table 14 shows that coefficient of determination
of the correlation (r) between the value of the model

selection criterion AIC and the complexities O of K-
means, C4.5, Data Visualization, K-NN, SOM and
NNs data mining algorithms. The result is that 64%
of the total variation in O can be explained by the
linear relationship between AIC and O and the other
36% of the total variation in O remains unexplained
in case of K-means clustering data mining
algorithm. Similarly, 36% of the total variation in O
can be explained by the linear relationship between
AIC and O and the other 64% of the total variation
in O remains unexplained in case of C4.5 data
mining algorithm. The 30% of the total variation in
O can be explained by the linear relationship
between AIC and O and the other 70% of the total
variation in O remains unexplained in case of Data
Visualization data mining algorithm. There is 36%
of the total variation in O which can be explained by
the linear relationship between AIC and O and the
other 64% of the total variation in O remains
unexplained in case of K-Nearest Neighbour (K-
NN) data mining algorithm. The total variation in O
which can be explained by the linear relationship
between AIC and O is 49% and the other 51% of the
total variation in O remains unexplained in case of
Self-organizing Map (SOM) data mining algorithm.
Finally, 36% of the total variation in O can be
explained by the linear relationship between AIC
and O and the other 64% of the total variation in O
remains unexplained in case of Neural Networks
(NNs) data mining algorithm. It is obvious from the
table above that the AIC and O both depends on the
given dataset. If the parameters of the dataset are
changed, the values of AIC and O will also change.
It is also obvious that the strong relationship exists
between AIC and O.
Case 3: Mapping of the value of Model Selection
Criterion AIC with the Complexities O of Data
Mining Algorithms
In this case we present different data mining
algorithms for clustering, classification and
visualization and then select the appropriate
algorithm for different datasets using the value of
AIC.
Table 15 The Clustering Algorithms
Dataset AIC K-means K-NN SOM
Iris 26.8 19.4 19.1 27.2
Diabetes 21.6 23.6 25.6 39.5
Breastcancer 24.6 20.1 22.4 32.5
DNA 922.5 26.8 36.9 45.4
Cars 1054.8 25.2 23.6 39.1
Table 15 shows the value of AIC of different
datasets and the value of log of O of clustering
data mining algorithms such as K-means, K-NN and
SOM. Fig. 11 further illustrates that which
algorithm is appropriate for the given datasets using
the value of AIC of these datasets.
The Clustering Algorithms for Datasets
0
200
400
600
800
1000
1200
Iris Diabetes Breastcancer DNA Cars
Datasets
V
a
l
u
e

o
f

A
I
C

&

L
o
g

'O
'
AIC
Value of Log of 'O' of K-means
Value of Log of 'O' of K-NN
Value of Log of 'O' of SOM

Fig. 11 The Clustering Algorithms for Datasets
The explanation of fig. 11 is: the difference between
the value of AIC of Iris dataset and the value of
log of O of SOM is less than K-means and K-NN
clustering algorithms, therefore, SOM is the
appropriate clustering algorithm for the dataset
Iris. The difference between the value of AIC of
Diabetes dataset and the value of log of O of K-
means less than K-NN and SOM clustering
algorithms, therefore, K-means is the appropriate
clustering algorithm for the dataset Diabetes.
Similarly, the difference between the value of AIC
of Breastcancer dataset and the value of log of O
of K-NN is less than K-means and SOM clustering
algorithms, therefore, K-NN is the appropriate
clustering algorithm for the dataset Breastcancer.
The difference between the value of AIC of datasets
DNA and Cars and the value of log of O of the
clustering algorithms is very huge, therefore, none
of these clustering algorithms is suitable for these
two datasets. We can say that the datasets DNA
and Cars are over-fitted and need cleansing.
Table 16 The Classification Algorithms
Dataset AIC C4.5 NNs
Iris 26.8 16.8 38.5
Diabetes 21.6 22.4 51.3
Breastcancer 24.6 18.4 42.6
DNA 922.5 29.4 62.3
Cars 1054.8 20.5 46.4
datasets and the value of log of O of classification
data mining algorithms such as C4.5 and NNs. Fig.
12 further illustrates that which algorithm is
appropriate for the given datasets using the value of
AIC of these datasets.

The Classification Algorithms for Datasets
0
200
400
600
800
1000
1200
Datasets
V
a
l
u
e

o
f

A
I
C

&

L
o
g

'O
'
AIC
Value of Log of 'O' of C4.5
Value of Log of 'O' of NNs

Fig. 12 The Classification Algorithms for Datasets
the values of AIC of datasets Iris, Diabetes and
Breastcancer and the value of log of O of C4.5 is
less than NNs classification algorithms, therefore,
C4.5 is the appropriate classification algorithm for
these datasets. The difference between the values of
AIC of datasets DNA and Cars and the values of
log of O of the classification algorithms is very
gigantic, therefore, none of these classification
algorithms is suitable for these two datasets. We can
say that the datasets DNA and Cars are over-
fitted and need cleansing.
Table 17 The Data Visualization Algorithms
Dataset AIC Data Visualization SOM
Iris 26.8 15.5 27.2
Diabetes 21.6 20.2 39.5
Breastcancer 24.6 16.7 32.5
DNA 922.5 22.9 45.4
Cars 1054.8 18.3 39.1
datasets and the value of log of O of visualization
data mining algorithms such as Data Visualization
(2D graphs) and SOM. Fig. 13 further illustrates that
which algorithm is appropriate for the given datasets
using the value of AIC of these datasets.
The Visualization Algorithms for Datasets
0
200
400
600
800
1000
1200
Datasets
V
a
l
u
e

o
f

A
I
C

&

L
o
g

'O
'
AIC
Value of Log of 'O' of Data
Visualization
Value of Log of 'O' of SOM

Fig. 13 The Data Visualization Algorithms for
Datasets
the value of AIC of Iris dataset and the value of
log of O of SOM is less than 2D graphs data
visualization algorithms, therefore, SOM is the
appropriate visualization algorithm for the dataset
Iris. The difference between the value of AIC of
Diabetes dataset and the value of log of O of 2D
graphs is less than SOM visualization algorithms,
therefore, 2D graphs is the appropriate visualization
algorithm for the dataset Diabetes. Similarly, the
difference between the value of AIC of
Breastcancer dataset and the value of log of O of
both visualization algorithms is the same, therefore,
both visualization algorithms are appropriate for the
dataset Breastcancer. The difference between the
value of AIC of datasets DNA and Cars and the
value of log of O of the visualization algorithms is
very huge, therefore, none of these visualization
algorithms is suitable for these two datasets. We can
say that the datasets DNA and Cars are over-
fitted and need cleansing.
6 Conclusion
In this paper we present, the model selection
criterion, the parameter for the fitness of a dataset
and the liner relation between the value of AIC and
Log of O. We can conclude that the number of
parameters of a given dataset should be average to
enable knowledge extraction. The AIC performs
better than BIC in all types of the sample sizes.
Therefore, we opted for AIC as a selection criterion
for a given dataset. We mapped the values of the
model selection criterion AIC of a dataset with the
complexities of data mining algorithms K-means, K-
NN, SOM, C4.5, NNs and Data Visualization, this
approach leads to select the appropriate data mining
algorithm(s) for a particular dataset. We tested the
values of AIC for the selection of algorithms over
five different datasets. We conclude that if the
difference between the value of AIC and
complexities O of data mining algorithm is close
then the algorithm(s) is(are) suitable for the dataset
otherwise the dataset requires cleansing i.e. we need
to reduce the number of parameters or reduce the
number of attributes. How far AIC and O can be
co-related to determine the choice of the data
mining algorithm is certainly an avenue for further
research.
Acknowledgement
The authors are thankful to The Islamia University
of Bahawalpur, Pakistan for providing financial
assistance to carry out this research activity under
HEC project 6467/F II.
References:
[1]URL:http://www.doc.ic.ac.uk/~dfg/ProbabilisticI
nference/IDAPILecture08.pdf, 2011.
[2] Aldrich, John., R. A. Fisher and the making of
maximum likelihood 19121922, Statistical

Science 12 (3): 162176.
doi:10.1214/ss/1030037906. MR1617519, 1997.
[3] Anderson, Erling B., Asymptotic Properties of
Conditional Maximum Likelihood Estimators,
Journal of the Royal Statistical Society B 32, 283
301, 1970.
[4] Andersen, Erling B., Discrete Statistical Models
with Social Science Applications, North Holland,
1980.
[5] Debabrata Basu., Statistical Information and
Likelihood: A Collection of Critical Essays, by Dr.
D. Basu ; J.K. Ghosh, editor. Lecture Notes in
Statistics Volume 45, Springer-Verlag, 1988.
[6] Le Cam, Lucien., Maximum likelihood an
introduction. ISI Review 58 (2): 153171, 1990.
[7] Burnham. Kenneth P., Anderson. David R.,
Model Selection and Multi-model Inference: a
Practical Information-theoretic Approach, 2nd
edition, Springer, ISBN: 0-387-95364-7, 2002.
[8] Brockwell, P.J., and Davis, R.A., Time Series:
Theory and Methods, 2nd ed. Springer, 2009.
[9] Akaike, Hirotugu, A new look at the statistical
model identification, IEEE Transactions on
Automatic Control 19 (6): 716723.
doi:10.1109/TAC.1974.1100705. MR0423716,
1974.
[10] Weakliem. David L. , A Critique of the
Bayesian Information Criterion for Model
Selection, University of Connecticut Sociological
Methods Research, vol. 27 no. 3 359-397, 1999.
[11] Cavanaugh. Joseph E., Statistics and Actuarial
Science, The University of Iowa, URL:
http://myweb.uiowa.edu/cavaaugh/ms_lec_6_ho.pdf
, 2009.
[12]Liddle, A.R., Information criteria for
astrophysical model selection,
http://xxx.adelaide.edu.au/PS_cache/astro-
ph/pdf/0701/0701113v2.pdf
[13] Ernest S. et al., HOW TO BE A BAYESIAN
IN SAS: MODEL SELECTION UNCERTAINTY
IN PROC LOGISTIC AND PROC GENMOD,
Harvard Medical School, Harvard Pilgrim Health
Care, Boston, MA, 2010.
[14] In Jae Myung, Tutorial on maximum
likelihood estimation, Department of Psychology,
Ohio State University, USA, Journal of
Mathematical Psychology 47 (2003) pp 90100.
[15] Isabelle Guyon., A practical guide to model
selection, ClopiNet, Berkeley, CA 94708, USA,
2010.
[16] Vladimir Cherkassky, COMPARISON of
MODEL SELECTION METHODS for
REGRESSION, Dept. Electrical & Computer Eng.
University of Minnesota, 2010.
[17] Schwarz G., Estimating the dimension of a
model. Ann Stat 6:461-464, 1978.
[18] Burnham KP, Anderson DR., Model Selection
and Inference, (Springer), 1998.
[19] Parzen E, Tanabe K, Kitagawa G., Selected
Papers of Hirotugu Akaike, (Springer), 1998.
[20] Li W., DNA segmentation as a model
selection process. Proc. RECOMB'01, in press,
2001.
[21] Li W., New criteria for segmenting DNA
sequences, submitted, 2001.
[22] Li W, Sherriff A, Liu X., Assessing risk
factors of human complex diseases by Akaike and
Bayesian information criteria (abstract). Am J Hum
Genet 67(Suppl):S222, 2000.
[23] Li W, Yang Y., How many genes are needed
for a discriminate microarray data analysis. Proc.
CAMDA'00, in press, 2001.
[24]Li W, Yang Y, Edington J, Haghighi F.,
Determining the number of genes needed for
cancer classification using microarray data,
submitted, 2001.
[25] Wentian Li, Dale R Nyholt, Marker Selection
by AIC and BIC, Laboratory of Statistical
Genetics, The Rockefeller University, New York,
NY, 2010.
[26]Venables, W.N., Smith, D.M. et al, An
Introduction to R, version 2.14.2, 2012.
[27] R Programming Language at http://CRAN.R-
project.org, 2012.
[28] Stapleton James H., Linear Statistical
Models, John Wiley & Sons Inc., ISBN:
9780470231463, 2009.
[29] Faraway Julian James., Linear Models in R,
Chapman & Hall/CRC, ISBN: 9781584884255,
2007.
[30] Neter John, Wasserman William, Kutner.
Michael H., Applied Linear Regression Models,
R.D.Irwin, ISBN: 9780256070682, 2010.
[31] Draper Norman Richard, Smith Harry.,
Applied Regression Analysis, Wiley, ISBN:
97804771029953, 2007.
[32] Neter John, Wasserman William, Kutner.
Michael H., Applied Regression Statistical Models:
Regression, Analysis of Variance and Experimental
Designs, R.D.Irwin, ISBN: 9780256024470, 2010.
[33] Monahan John F., A Primer on Linear
Models, Chapman & Hall/CRC, 1
st
Edition, ISBN:
9781420062014, 2008.
[34] Bingham N.H., Fry John M., Regression:
Linear Models in Statistics, Springer 1
st
Edition,
ISBN: 9781848829688, 2010.
[35] Rencher Alvin C., Schaalie G. Bruce., Linear
Models in Statistics, Wiley-Interscience 2
nd

Edition, ISBN: 9780471754985, 2008.


Investigating The Statistical Linear Relation Between The Model Selection Criterion and The Complexities of Data Mining Algorithms

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Investigating The Statistical Linear Relation Between The Model Selection Criterion and The Complexities of Data Mining Algorithms

Diunggah oleh

Hak Cipta:

Format Tersedia

Investigating the Statistical Linear Relation between the Model

Selection Criterion and the Complexities of Data Mining Algorithms

Anda mungkin juga menyukai