Investigating the Statistical Linear Relation between the Model
Selection Criterion and the Complexities of Data Mining Algorithms
DOST MUHAMMAD KHAN 1 , NAWAZ MOHAMUDALLY 2 , D K R BABAJEE 3
1 Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur, PAKISTAN & PhD Student, School of Innovative Technologies & Engineering, University of Technology Mauritius (UTM), MAURITIUS
2 Associate Professor, & Consultancy & Technology Transfer Centre, Manager, University of Technology, Mauritius (UTM), MAURITIUS
3 Lecturer, Department of Applied Mathematical Sciences, SITE, University of Technology, Mauritius
Abstract: - The model selection criterion plays a vital role in the preparation of the readiness of a dataset for further data mining actions. It is a gauge to determine whether the dataset is under-fitted or over-fitted. In such cases the datasets are not suitable for knowledge extraction; otherwise the knowledge obtained would be vague, ambiguous and might be misleading. In this paper, we investigate the linear relation between the selection of data mining algorithms through the model selection criterion, the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), the two most commonly used methodologies in an attempt to set up the dataset for the data mining process cycle. Moreover the complexities of data mining algorithms together with the AIC or BIC at different steps within the data mining process cycle are evaluated to apply the best algorithm in view of generating the optimum accuracy of the knowledge. Key-Words: - AIC, BIC, Over-fitted, Under-fitted, Model Selection Criterion, Linear Model, Correlation 1 Introduction The purpose of the model selection criterion is to identify a methodology such as AIC, BIC, VC Dimension and so on, that best characterizes the model as a matter of fact, the candidate dataset ready for data mining. The concept of over-fitting and under-fitting is important in data mining as mentioned above. The over and under-fitting are due to missing and noisy, inconsistent and redundant values and number of attributes in a dataset. We can avoid these problems by using one of these techniques; apply upper or lower thresholds values, remove attributes below a threshold value and remove noise and redundant attributes. The most effective solution to these problems is use many training datasets and to avoid excessive or too few assumptions. Current models for selection of fitting datasets are the following; VC (Vapnik-Chervonenkis)- dimension, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), (SRMVC) Structural Risk Minimize with VC dimension, CV (Cross-validation), Deviance Information Criterion, Hannan-Quinn Information Criterion, Jensen- Shannon Divergence, Kullback-Leibler Divergence models. The focus here is on AIC and BIC. A dataset is more appropriate for data mining if it has a minimum value of AIC or BIC. Both AIC and BIC have solid theoretical foundations, the AIC uses Kullback-Leibler distance of information theory and the BIC is based on the integrated likelihood of Bayesian theory. If the complexity of the good- fitting model also called as the true model does not increase with the size of the dataset, BIC is the preferred criterion; otherwise AIC is the best choice. Since selecting the number of parameters and number of attributes is the main issue in model selection, so one has to take care of these important aspects of a dataset. Using too many parameters can fit the data perfectly, but it can be an over-fitting. Using too few parameters may not fit the dataset at all, thus under-fitting. This shows the importance of parameters and observed data in a given dataset. Variable selection by AIC or BIC will provide an answer to this problem. We illustrate the importance of comparing different models/datasets with different number of parameters by using AIC and BIC. The idea of model selection criterion using JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 14
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 AIC or BIC has also been applied recently to epidemiology, microarray data analysis, and DNA sequence analysis [21] [22] [23] [24] [25]. The rest of the paper is organized as follows: section 2 discusses the model selection criteria AIC and BIC and section 3 is about the methodology. In section 4 we present the statistical relation between the model selection criterion AIC and the complexities of data mining algorithms. The results are discussed in section 5 and finally the conclusion is drawn in section 6. 2 Model Selection Criteria A brief introduction of model selection criteria AIC and BIC is given below: 2.1 Akaike Information Criterion The AIC is a criterion for the model selection, developed by Hirotsugu Akaike in 1974, under the name of Akaike Information Criterion. The AIC is based on information theory. Suppose that the data is generated by some unknown process f. We consider two candidate models to represent f: g 1 and g 2 . If we know f, then we can find the information lost from using g 1 to represent f by calculating the KullbackLeibler divergence, D KL (f,g 1 ); similarly, the information lost from using g 2 to represent f will be found by calculating D KL (f,g 2 ). We can then choose the candidate model that minimizes the information loss. The AIC can tell nothing about how well a model fits the data in an absolute sense. If all the candidate models fit poorly, then the AIC will not give any warning. The AIC is thus a difference of the accuracy and the complexity of the model [7] [9] [10] [15] [17] [18] [20]. The mathematical formula of AIC is given in equation (1). k likelihood AIC 2 )) (log( 2 + = (1) Where 2k is the number of parameters and ) log(likelihood is the log of the likelihood and )) (log( 2 likelihood due to perfect fitting, the value of the likelihood gradually approaches to 0 as the number of parameters is increased. It is also called the Model Accuracy. Therefore, AIC is given in equation (2). acy ModelAccur rs ofParamete No AIC rs ofParamete No acy ModelAccur AIC = + = . . (2) 2.2 Bayesian Information Criterion The BIC is a criterion for the selection of model among class of models with different number of parameters. When estimating the parameters of a model by using maximum likelihood estimation, it is possible to increase the likelihood by adding parameters, which may result in over-fitting. BIC resolves this problem by introducing a penalty term for the number of parameters in the model. This penalty is larger in the BIC than in the related AIC. BIC is widely used for model identification in time series and linear regression. The main characteristics of BIC are: It measures the efficiency of the parameterized model in terms of predicting the data, penalizes the complexity of the model where complexity refers to the number of parameters in model, is exactly equal to the minimum description length criterion but with negative sign and is closely related to other likelihood criterion such as AIC [4] [8] [12] [13] [16] [19]. The mathematical formula of BIC is given in equation (3). ) log( ) log( . 2 n k likelihood BIC + = (3) where k is the number of parameters, n is the sample size or the datapoints of the given dataset, ) log( . 2 likelihood the value of likelihood gradually approaches to 0 with the increase on number of parameters, is also known as the Model Accuracy and ) log(n k is the model size. Therefore, BIC is given in equation (4). acy ModelAccur ModelSize BIC = (4) 3 Methodology Suppose there is a sample } ,..., 2 , 1 { n x x x X = of n sequence of observations, coming from a distribution with an unknown probability density function ) | ( u X p , where ) | ( u X p is called a parametric model, in which all the parameters are in finite-dimensional parameter spaces. These parameters are collected together to form a single m- dimensional parameter vector, ) ,..., 2 , 1 ( m u u u u = . To use the method of maximum likelihood, one first specifies the joint density function for all observations. The joint density function of the given observation is given below in equation (5). ) | ( ) | ( ) | ( ) | ,..., , ( ) | ( 2 1 2 1 u u u u u n n x p x p x p x x x p X p = = (5) where the observed values x 1 , x 2 , ..., x n are fixed parameters of this function and will be the functions variable and allowed to vary freely. From this point of view this distribution function will be called the likelihood as given in equation (6). JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 15
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 [ = = = n i i x p n x x x p n x x x likelihood 1 ) | ( ) | ,..., 2 , 1 ( ) ,..., 2 , 1 | ( u u u (6) It is more convenient to work with the logarithm of the likelihood function, called the log-likelihood as shown in equation (7) below. ) ) | ( log( )) ,..., , | ( log( 1 2 1 [ = = n i i n x p x x x likelihood u u , ) log( 1 likelihood n = (7) The method of maximum likelihood was first proposed by the English statistician and population geneticist R. A. Fisher. The maximum likelihood method finds the estimate of a parameter that maximizes the probability of observing the data given a specific model for the data. The idea behind maximum likelihood parameter estimation is to determine the parameters that maximize the probability (likelihood) of the sample data. From a statistical point of view, the method of maximum likelihood is considered to be more robust and yields estimators with good statistical properties. It is a flexible method and can be applied to most models and to different types of data. Although the methodology for maximum likelihood estimation is simple, the implementation is mathematically intense [1] [2] [3] [5] [6] [11] [14]. We use the stepwise variable selection method, starting with one variable and then add or remove variable if the value of AIC or BIC is reduced. The stepwise variable is a local optimal procedure and is tested with different starting sets of parameters so that the optimization is not carried to the extreme. The following steps explain the computation of the value of AIC and BIC: Step 1: Calculate the maximum likelihood of the dataset The likelihood function is simply the joint probability of observing the data. The joint probability is [ = = = n i i x p n x x x p n x x x L 1 ) | ( ) | ,..., 2 , 1 ( ) ,..., 2 , 1 | ( u u u . Take the log of this value will give the value of model accuracy which is shown below in equation (8). ) log(likelihood acy ModelAccur = (8) Step 2: Compute the Model Size The formula to calculate the model size is given in equation (9). ) (log n k ModelSize = (9) Where k is the number of parameters and n is the datapoints. Step 3: Compute the Minimum Description Length (MDL) acy ModelAccur ModelSize MDLScore = The Minimum Description Length (MDL) is also referred as the value of BIC; hence, from the value of MDL we can compute the values of AIC and BIC. The mathematical formulas of AIC and BIC are given in equations (10) and (11) respectively. acy ModelAccur rs ofParamete No AIC = . (10) acy ModelAccur ModelSize BIC = (11) The equations (2) and (10) are identical; similarly, equations (4) and (11) are identical. The smallest value of the model shows that it performs better than the other selection criterion. Therefore, the selection criterion which produces the smallest value of the given model is the best choice [1]. 4 The Statistical Linear Relation between the Model Selection AIC and the Complexities of Data Mining Algorithms In order to find the statistical linear relation between the value of AIC and the logarithm value of the complexities O of data mining algorithms, we apply the linear model (lm) shown in equation (12). x y b mx y x y lm 1 0 ) ~ ( | | + = + = (12) where m is the slop of the line and b is the interception of the line with the y-axis; lines goes up is positive and the line goes down is the negative, 0 | is a constant, 1 | is the regression co-efficient, x is the value of an independent variable and y is the value of dependent variable. In other words, 0 | is the theoretical y-intercept and 1 | is the theoretical slope [28][29][30]. In our case X is the value of AIC and Y is the logarithm value of the complexities O of data mining algorithms. The table 1 shows the complexities of commonly used data mining algorithms. Table 1 The Complexities O of Data Mining Algorithms
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 16
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 The table 1 is about the complexity of K-means, C4.5, Data Visualization, K-NN, SOM and NNs data mining algorithms. Where n is the sample size, m is the number of attributes, k is the number of clusters, l is the number of iterations and d is the dimension (in our case it is 2). We take the log of the complexities of these algorithms because the use of log makes extremely efficient when dealing with large values. There are some other utilities of taking the log of a value, which are: the log is taken if the transferred data is closer to satisfy the assumptions of the statistical model, to analyze the exponential processes, because the log function is the inverse of the exponential function, to measure the pH or acidity of a chemical solution, to measure the intensity of earth quake on Richter scale, to model many natural processes with the statistical model. In our case it is used to model the value of computational complexities O of a data mining algorithm with the value of model selection criterion AIC. This will help to select the right algorithm for the given dataset. The table 2 below shows the value of AIC of 5 different datasets. Table 2 The Value of Model Selection Criterion AIC
The essential parameters for the calculation of the values of AIC and the logarithm value of complexities O of data mining algorithms are: The number of parameters k, the number of attributes m and the sample size n The following four cases will further illustrate the statistical linear relation: Case 1: In case 1 the number of parameters k varies and the number of attributes m and the sample size n are fixed. The linear model is applied the values are shown in table 3.
Table 3 The number of parameters k varies
The value of correlation coefficient r is 0.92 which is very close to 1 and shows the strong and positive linear relation between the values of AIC and O in case when the number of parameters k varies. The goal of a linear regression is to find the best estimates for 0 | and 1 | by minimizing the residual error. The summary of the linear model and the Regression Statistics using R-Language [26][27][29] for case 1 is given below: Residuals: Min 1Q Median 3Q Max -116.70 -73.07 -26.85 65.30 192.09 The Residuals gives the difference between the experimental and predicted Y i.e. the Complexities O of the data mining algorithms. Residual standard error: 92.9 on 20 degrees of freedom Multiple R-squared: 0.8472, Adjusted R-squared: 0.8396 The Residual standard error is the residuals adjusted to ensure that they have a standard deviation of 1; they have a mean of zero already. The most important number is Multiple R-squared the absolute value of correlation coefficient, which has the value of 0.8472, the number is close to the correlation coefficient r as shown in table 1 above. Thus, multiple R-squared is always between 0 and 1, and closer to 1 indicates stronger relation. Hence, in this case being close to 1, implies that we can JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 17
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 conclude that the two variables are indeed somewhat related. In other words, if the value or Multiple R- squared in the Regression Statistics is close to 1, then the least-square regression line indeed fits the data points pretty well, and there is a linear (positive or negative) relation between the two variables. In our case a strong positive linear relation exists between the values of AIC and O if the number of parameters k varies. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4724.68 480.68 -9.829 4.22e-09 x 210.45 19.98 10.530 1.32e-09 The Coefficients for the Intercept and the X, an independent variable are the two required values. These two values are the slope and intercept of the least-square regression line. The value of X is positive in case 1. The coefficients give the coefficient for each parameter, including the intercept (the constant), the standard errors, the t- values (the t-value is the coefficient divided by the standard error) and the p-value associated with the variable, and the confidence intervals of the parameter estimates. The graphs in the fig. 1 further explain the linear model.
Fig.1 The Linear Model for Case 1 The plot in fig. 1, the upper left shows the residual errors plotted versus their fitted values. The residuals are randomly distributed in the plot. The plot in fig. 1, the upper right is a standard Q-Q plot, which shows that the residual errors are normally distributed. A residual plot draws the scatter plots of each independent variable on the x-axis, and the residual on the y-axis and a line fit plot draws scatter-plots of each independent variable on the x- axis, and the predicted and actual values of the dependent variable on the y axis. The both plots show the positive linear relation between AIC and O. The plot in fig. 1, the lower left shows the square root of standardized residuals as a function of the fitted values and again there is no obvious trend in this plot. Finally, the plot in the lower right shows each point leverage, which is a measure of its importance in determining the regression result. The superimposed on the plot are contour lines for Cooks distance, which is another measure of the importance of each observation to the regression. The smaller distance means that removing the observation has little affect on the regression results and the distance larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model. For case 1, the Cooks distance implies that the linear model is the best because the value of the distance is between 1 and 0.5. The lower plots are also used for the diagnostics purpose [31][32][33][34][35]. Remark: The value of AIC is computable if the number of parameters k is within the range from 2 to 22 i.e. 22 2 s s k . If the value of k is greater than 22 then the value of AIC is incomputable, thus there is no linear relation between AIC and O. Case 2: In case 2 the number of attributes m varies and the number of parameters k and the sample size n are fixed. The linear model is applied the values are shown in table 4. Table 4 The number of attributes m varies
The value of correlation coefficient r is 0.95 which is very close to 1 and shows the strong and positive linear relation between the values of AIC and O in case when the number of attributes m varies. The summary of the linear model and the Regression Statistics using R-Language [26][27][29] for case 2 is given below: Residuals: Min 1Q Median 3Q Max -100.81 -60.27 -29.19 64.05 131.42 JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 18
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 The Residuals gives the difference between the experimental and predicted Y i.e. the Complexities O of the data mining algorithms.
Residual standard error: 90.49 on 8 degrees of freedom Multiple R-squared: 0.9075, Adjusted R-squared: 0.8959 The Residual standard error is the residuals adjusted to ensure that they have a standard deviation of 1; they have a mean of zero already. The most important number is Multiple R-squared the absolute value of correlation coefficient, which has the value of 0.9075, the number is close to the correlation coefficient r as shown in table 2 above. Thus, multiple R-squared is always between 0 and 1, and closer to 1 indicates stronger relation. Hence, in this case being close to 1, implies that we can conclude that the two variables are indeed somewhat related. In other words, if the value or Multiple R- squared in the Regression Statistics is close to 1, then the least-square regression line indeed fits the data points pretty well, and there is a linear (positive or negative) relation between the two variables. In our case a strong positive linear relation exists between the values of AIC and O if the number of attributes m varies. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5511.41 680.60 -8.098 4.00e-05 x 254.66 28.75 8.857 2.08e-05 The Coefficients for the Intercept and the X, an independent variable are the two required values. These two values are the slope and intercept of the least-square regression line. The value of X is positive in case 2. The coefficients give the coefficient for each parameter, including the intercept (the constant), the standard errors, the t- values (the t-value is the coefficient divided by the standard error) and the p-value associated with the variable, and the confidence intervals of the parameter estimates. The graphs in the fig. 2 further explain the linear model.
Fig. 2 The Linear Model for Case 2 The plot in fig. 2, the upper left shows the residual errors plotted versus their fitted values. The residuals are randomly distributed in the plot. The plot in fig. 2, the upper right is a standard Q-Q plot, which shows that the residual errors are normally distributed. A residual plot draws the scatter plots of each independent variable on the x-axis, and the residual on the y-axis and a line fit plot draws scatter-plots of each independent variable on the x- axis, and the predicted and actual values of the dependent variable on the y axis. The both plots show the positive linear relation between AIC and O. The plot in fig. 2, the lower left shows the square root of standardized residuals as a function of the fitted values and again there is no obvious trend in this plot. Finally, the plot in the lower right shows each point leverage, which is a measure of its importance in determining the regression result. The superimposed on the plot are contour lines for Cooks distance, which is another measure of the importance of each observation to the regression. The smaller distance means that removing the observation has little affect on the regression results and the distance larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model. For case 2, the Cooks distance implies that the linear model is the best because the value of the distance is between 1 and 0.5. The lower plots are also used for the diagnostics purpose. Remark: The value of AIC is computable if the number of attributes m is less than or equal to 211 i.e. 22 s m . If the value of m is greater than 211 then the value of AIC is incomputable, thus there is no linear relation between AIC and O. Case 3: In case 3, the sample size n varies and the number of parameters k and the number of JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 19
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 attributes m are fixed. The linear model is applied; the values are shown in table 5. Table 5 The sample size n varies
The value of correlation coefficient r is -0.3 which shows the weak and negative linear relation between the values of AIC and O in case when the sample size n varies. The summary of the linear model and the Regression Statistics using R-Language [26][27][29] for case 3 is given below: Residuals: Min 1Q Median 3Q Max -3.4479 -0.8900 0.6814 1.1637 2.1814 The Residuals gives the difference between the experimental and predicted Y i.e. the Complexities O of the data mining algorithms. Residual standard error: 2.073 on 8 degrees of freedom Multiple R-squared: 0.1141, Adjusted R-squared: 0.003309 The Residual standard error is the residuals adjusted to ensure that they have a standard deviation of 1; they have a mean of zero already. The most important number is Multiple R-squared the absolute value of correlation coefficient, which has the value of 0.1141, the number is not very close to the correlation coefficient r as shown in table 3 above. Thus, multiple R-squared is always between 0 and 1, and closer to 1 indicates stronger relation. Hence, in this case being close to 0, implies that we can conclude that the two variables are very weakly related. In other words, if the value of the Multiple R-squared in the Regression Statistics is 0.1141, then the least-square regression line does not fit the data points pretty well, and there is a linear negative relation between the two variables. In our case a weak negative linear relation exists between the values of AIC and O if the sample size n varies. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 52.128 35.064 1.487 0.175 x -1.715 1.690 -1.015 0.340 The Coefficients for the Intercept and the X, an independent variable are the two required values. These two values are the slope and intercept of the least-square regression line. The values of Intercept and X are very low in the case 3. The coefficients give the coefficient for each parameter, including the intercept (the constant), the standard errors, the t-values (the t-value is the coefficient divided by the standard error) and the p-value associated with the variable, and the confidence intervals of the parameter estimates. The graphs in the fig. 3 further explain the linear model.
Fig. 3 The Linear Model for Case 3 The plot in fig. 3, the upper left shows the residual errors plotted versus their fitted values. The residuals are randomly distributed in the plot but as compared to plots in fig.1 and fig.2 the plot is not even randomly distributed. The plot in fig. 3, the upper right is a standard Q-Q plot, which shows that the residual errors are normally distributed. A residual plot draws the scatter plots of each independent variable on the x-axis, and the residual on the y-axis and a line fit plot draws scatter-plots of each independent variable on the x-axis, and the predicted and actual values of the dependent variable on the y axis. The both plots show the positive linear relation between AIC and O. The plot in fig. 3, the lower left shows the square root of standardized residuals as a function of the fitted values and again there is no obvious trend in this plot. Finally, the plot in the lower right shows each point leverage, which is a measure of its importance in determining the regression result. The superimposed on the plot are contour lines for Cooks distance, which is another measure of the importance of each observation to the regression. JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 20
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 The smaller distance means that removing the observation has little affect on the regression results and the distance larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model. For case 3, the Cooks distance implies that the linear model is the best because the value of the distance is between 1 and 0.5. The lower plots are also used for the diagnostics purpose. Remark: We compute the value of AIC by setting the requited values for the sample size n is 12000, the number of parameters k is 3 and the number of attributes m is 211 and observe that AIC is computable. The linear relation exists between AIC and O but very weak with negative value. Therefore, we can say AIC is computable if the sample size n is 12000. Case 4: In case 4 the sample size n varies and the number of parameters k is fixed but the number of records of each parameter is equal and also the number of attributes m is fixed. The linear model is applied; the values are shown in table 6. Table 6 The sample size n varies
The value of correlation coefficient r is 0.0 which shows that there is no linear relation between the values of AIC and O in case when the sample size n varies and the value of each parameter is same in the dataset. The summary of the linear model and the Regression Statistics using R-Language [26][27][29] for case 4 is given below: Residuals: Min 1Q Median 3Q Max -4.17 -0.97 0.48 1.53 2.23 The Residuals gives the difference between the experimental and predicted Y i.e. the Complexities O of the data mining algorithms. Residual standard error: 2.041 on 9 degrees of freedom The Residual standard error is the residuals adjusted to ensure that they have a standard deviation of 1; they have a mean of zero already. The number Multiple R-squared the absolute value of correlation coefficient, is undetermined in this case. Thus, multiple R-squared is always between 0 and 1, and closer to 1 indicates stronger relation. Hence, in this case it is 0, implies that we can conclude that the relation between the two variables is feeble. In other words, if the value of the Multiple R- squared in the Regression Statistics is 0, then the least-square regression line does not fit the data points pretty well, and there is very weak linear relation between the two variables. Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 16.1700 0.6455 25.05 1.24e-09 *** x NA NA NA NA The Coefficients for the Intercept and the X, an independent variable are the two required values. These two values are the slope and intercept of the least-square regression line. The value of X is not applicable (NA) in the case 4. The coefficients give the coefficient for each parameter, including the intercept (the constant), the standard errors, the t- values (the t-value is the coefficient divided by the standard error) and the p-value associated with the variable, and the confidence intervals of the parameter estimates. The graphs in the fig. 4 further explain the linear model.
Fig. 4 The Linear Model for Case 4 The plot in fig. 4, the upper left shows the residual errors plotted versus their fitted values. The residuals are not randomly distributed in the plot, which shows that the residual error is zero. The plot in fig. 4, the upper right is a standard Q-Q plot, JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 21
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 which shows that the residual errors are normally distributed. A residual plot draws the scatter plots of each independent variable on the x-axis, and the residual on the y-axis and a line fit plot draws scatter-plots of each independent variable on the x- axis, and the predicted and actual values of the dependent variable on the y axis. The plot in fig. 4, the lower left shows the square root of standardized residuals as a function of the fitted values. The plot shows that vertical line, which shows that the residual error is zero. Due to the zero value of residual error the fourth plot for the Cooks distance is not plotted i.e. for case 4, the Cooks distance is not calculated which implies that the poor linear model. Remark: If the sample size n varies and the number of parameters k, number of iterations l and the number of attributes m are fixed and the value of each parameter is same in the dataset then there is no linear relations between AIC and O. The result of these four cases is that the value of AIC and O only depends on the number of parameters k and the number of attributes m. There is no impact of sample size n on the value of AIC and O. Another issue is that if the value of parameters exceed then 22 and the number of attributes exceed then 211 then the dataset is over- fitted, hence the value of AIC is incomputable and therefore, no linear relation between AIC and O. 5 Results and Discussion The model selection criterion is used to map the appropriate algorithms with a particular dataset, in order to extract the knowledge. A vehicle dataset cars which is about the different models of brands from different countries is selected. The number of attributes is 9, number of datapoints or records or the sample size is 261 and number of parameters or brands is 30. These parameters are the brands from 3 countries, from US there are 14, from Europe there are 10 and from Japan there are 6 brands of the cars. The distribution of the records in the chosen dataset is: 62.45% are from US, 18.00% are from Europe 18.00% and similarly, 19.45% are from Japan. The percentage of parameters/brands from US is 46.67%, from Europe is 33.33% and from Japan is 16.67%. Case 1: Fitness (Over and Under-fitted) of the Dataset This case is about the fitness of a dataset i.e. the dataset is either over-fitted or under-fitted. The number of attributes, the sample size and number of parameters are required to compute the value of AIC. In this case either the value of these parameters is small or large to test the fitness of the dataset. Table 7 Models Selection with Variable Parameters
In the first two cases of table 7 the value of AIC and BIC is computable but it also indicates that the dataset is very close to be over-fitted. In the third case the value of AIC and BIC is infinity although the sample size is small and the dataset is over- fitted. In the fourth the dataset is under-fitted because the number of parameters, number of attributes and the sample size are very small. Hence we can say that if the number of attributes or observed data or the number of parameters is very small then the dataset is under-fitted and if the values of both parameters are large then the dataset is over-fitted. The number of parameters and the number of attributes play a vital role for a dataset to become over or under-fitted, therefore, in order to avoid from these problems, the datasets must be organized with care. Furthermore, the value of AIC is less then BIC in these cases, therefore, AIC is the right choice because the smallest value is the best. Case 2: Correlation between the value of AIC and the Log. of O We compute the correlation to evaluate the strength between the value of AIC of different datasets and the logarithm of the complexities O of different data mining algorithms. The formula of correlation is given in equation (13) [28][29][30]. ) ) ( ( ) ( ( ) )( ( 2 2 2 2 Y Y N X X N Y X XY N r
= (13) Where r is the correlation, N is the number of values; X and Y are the variables. In our case X shows the value of AIC and Y represents the logarithm value of the complexities O. The value of correlation (r) +1.0 shows the strong relation, -1.0 shows the weak relation and 0 shows there is no relation. The value of correlation (r) greater than 0.8 is considered as strong whereas the value of correlation (r) less than 0.5 is judged as weak [29][30][31][32][33]. Table 11 shows the correlation(r) between the value of AIC of datasets, Iris, Breastcancer, Diabetes, DNA and Cars and the complexities O of K- means clustering data mining algorithm. JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 22
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 Table 8 The Complexities O of K-means clustering Algorithm & AIC of datasets
The value of correlation (r) is 0.8 which shows the strong and positive relationship between AIC and K- means clustering algorithm. The graph in fig. 5 depicts this relationship.
Fig. 5 The Correlation between AIC and the complexities O of K-means Algorithm The graph in fig. 5 shows the correlation of model selection criterion AIC and the O of the K-means clustering data mining algorithm for different datasets. The graph shows the strong and positive relationship these values. Table 9 shows the correlation(r) between the value of AIC of datasets, Iris, Breastcancer, Diabetes, DNA and Cars and the complexities of C4.5 (Decision Tree) data mining algorithm. Table 9 The Complexities of C4.5 Algorithm & AIC of datasets
The value of correlation (r) is 0.6 which shows the positive relationship between AIC and C4.5 algorithm. The value of this correlation is less then the value of correlation between AIC and K-means clustering algorithm but still a moderate relationship between AIC and O. The graph in fig. 6 depicts this relationship.
Fig. 6 The Correlation between AIC and the O of C4.5 Algorithm The graph in fig. 6 shows the correlation of model selection criterion AIC and the O of the C4.5 (Decision Tree) data mining algorithm for different datasets. The graph shows a moderate and positive relationship these values. Table 10 shows the correlation(r) between the value of AIC of datasets, Iris, Breastcancer, Diabetes, DNA and Cars and the complexities of Data Visualization data mining algorithm. Table 10 The Complexities of Data Visualization Algorithm & AIC of datasets
The value of correlation (r) is 0.5 which shows the positive relationship between AIC and Data Visualization algorithm. The value of this correlation is less then the value of correlation between AIC and K-means clustering algorithm and it is also less then C4.5 but still a moderate relationship between AIC and O. The graph in fig. 7 depicts this relationship. JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 23
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
Fig. 7 The Correlation between AIC and the Complexities O of Data Visualization Algorithm The graph in fig. 7 shows the correlation of model selection criterion AIC and the O of the Data Visualization data mining algorithm for different datasets. The graph shows a moderate and positive relationship these values. Table 11 shows the correlation(r) between the value of AIC of datasets, Iris, Breastcancer, Diabetes, DNA and Cars and the complexities of K-NN data mining algorithm. Table 11 The Complexities of K-NN Algorithm & AIC of datasets
The value of correlation (r) is 0.6 which shows the positive relationship between AIC and K-NN algorithm. The value of this correlation is less then the value of correlation between AIC and K-means clustering algorithm and it is equal to the C4.5 but still a moderate relationship between AIC and O. The graph in fig. 8 depicts this relationship.
Fig. 8 The Correlation between AIC and the O of K-NN Data Mining Algorithm The graph in fig.8 shows the correlation of model selection criterion AIC and the O of the K-Nearest Neighbour (K-NN) data mining algorithm for different datasets. The graph shows a moderate and positive relationship these values. Table 12 shows the correlation(r) between the value of AIC of datasets, Iris, Breastcancer, Diabetes, DNA and Cars and the complexities of SOM data mining algorithm. Table 12 The Complexities of SOM Algorithm & AIC of datasets
The value of correlation (r) is 0.7 which shows the positive relationship between AIC and SOM algorithm. The value of this correlation is less then the value of correlation between AIC and K-means clustering algorithm and it is greater than C4.5 and K-NN data mining algorithms but still a good and strong relationship between AIC and O. The graph in fig. 9 depicts this relationship. JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 24
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
Fig. 9 The Correlation between AIC and the O of SOM Algorithm The graph in fig. 9 shows the correlation of model selection criterion AIC and the O of the Self- organizing Map (SOM) data mining algorithm for different datasets. The graph shows the strong and positive relationship these values. Table 13 shows the correlation(r) between the value of AIC of datasets, Iris, Breastcancer, Diabetes, DNA and Cars and the complexities of NNs data mining algorithm. Table 13 The Complexities of NNs Algorithm & AIC of datasets
The value of correlation (r) is 0.6 which shows the positive relationship between AIC and NNs algorithm. The value of this correlation is less then the value of correlation between AIC and K-means clustering algorithm and SOM and is equal to C4.5 and K-NN data mining algorithms but still a moderate relationship between AIC and O. The graph in fig. 10 depicts this relationship.
Fig. 10 The Correlation between AIC and the O of NNs Data Mining Algorithm The graph in fig. 10 shows the correlation of model selection criterion AIC and the O of the Neural Networks (NNs) data mining algorithm for different datasets. The graph shows a moderate and positive relationship these values. From the above results, it is clear that there exists a linear relation between the value of AIC and the complexities of data mining algorithm and both depend on the dataset. If the value of the parameters of the dataset is changed then the value of AIC and complexities O will also change but there will be no effect on their relationship. Although the correlation of complexities of K-means clustering data mining algorithm with the value of AIC is stronger then any of the other selected data mining algorithms but also the relationship between complexities O and the value of AIC is positive and moderate. The coefficient of determination, r 2 , a linear association between the variables AIC and O is shown in table 14. It is computed by taking the square of correlation (r). It is useful because it gives the proportion of the variance (fluctuation) of one variable that is predictable from the other variable. It is a measure that allows us to determine how certain one can be in making predictions from a certain graph or model. It is a ratio of the explained variation to the total variation which is 1 0 2 s s r . Table 14 Coefficient of Determination, r 2
The table 14 shows that coefficient of determination of the correlation (r) between the value of the model JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 25
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 selection criterion AIC and the complexities O of K- means, C4.5, Data Visualization, K-NN, SOM and NNs data mining algorithms. The result is that 64% of the total variation in O can be explained by the linear relationship between AIC and O and the other 36% of the total variation in O remains unexplained in case of K-means clustering data mining algorithm. Similarly, 36% of the total variation in O can be explained by the linear relationship between AIC and O and the other 64% of the total variation in O remains unexplained in case of C4.5 data mining algorithm. The 30% of the total variation in O can be explained by the linear relationship between AIC and O and the other 70% of the total variation in O remains unexplained in case of Data Visualization data mining algorithm. There is 36% of the total variation in O which can be explained by the linear relationship between AIC and O and the other 64% of the total variation in O remains unexplained in case of K-Nearest Neighbour (K- NN) data mining algorithm. The total variation in O which can be explained by the linear relationship between AIC and O is 49% and the other 51% of the total variation in O remains unexplained in case of Self-organizing Map (SOM) data mining algorithm. Finally, 36% of the total variation in O can be explained by the linear relationship between AIC and O and the other 64% of the total variation in O remains unexplained in case of Neural Networks (NNs) data mining algorithm. It is obvious from the table above that the AIC and O both depends on the given dataset. If the parameters of the dataset are changed, the values of AIC and O will also change. It is also obvious that the strong relationship exists between AIC and O. Case 3: Mapping of the value of Model Selection Criterion AIC with the Complexities O of Data Mining Algorithms In this case we present different data mining algorithms for clustering, classification and visualization and then select the appropriate algorithm for different datasets using the value of AIC. Table 15 The Clustering Algorithms Dataset AIC K-means K-NN SOM Iris 26.8 19.4 19.1 27.2 Diabetes 21.6 23.6 25.6 39.5 Breastcancer 24.6 20.1 22.4 32.5 DNA 922.5 26.8 36.9 45.4 Cars 1054.8 25.2 23.6 39.1 Table 15 shows the value of AIC of different datasets and the value of log of O of clustering data mining algorithms such as K-means, K-NN and SOM. Fig. 11 further illustrates that which algorithm is appropriate for the given datasets using the value of AIC of these datasets. The Clustering Algorithms for Datasets 0 200 400 600 800 1000 1200 Iris Diabetes Breastcancer DNA Cars Datasets V a l u e
o f
A I C
&
L o g
'O ' AIC Value of Log of 'O' of K-means Value of Log of 'O' of K-NN Value of Log of 'O' of SOM
Fig. 11 The Clustering Algorithms for Datasets The explanation of fig. 11 is: the difference between the value of AIC of Iris dataset and the value of log of O of SOM is less than K-means and K-NN clustering algorithms, therefore, SOM is the appropriate clustering algorithm for the dataset Iris. The difference between the value of AIC of Diabetes dataset and the value of log of O of K- means less than K-NN and SOM clustering algorithms, therefore, K-means is the appropriate clustering algorithm for the dataset Diabetes. Similarly, the difference between the value of AIC of Breastcancer dataset and the value of log of O of K-NN is less than K-means and SOM clustering algorithms, therefore, K-NN is the appropriate clustering algorithm for the dataset Breastcancer. The difference between the value of AIC of datasets DNA and Cars and the value of log of O of the clustering algorithms is very huge, therefore, none of these clustering algorithms is suitable for these two datasets. We can say that the datasets DNA and Cars are over-fitted and need cleansing. Table 16 The Classification Algorithms Dataset AIC C4.5 NNs Iris 26.8 16.8 38.5 Diabetes 21.6 22.4 51.3 Breastcancer 24.6 18.4 42.6 DNA 922.5 29.4 62.3 Cars 1054.8 20.5 46.4 Table 16 shows the value of AIC of different datasets and the value of log of O of classification data mining algorithms such as C4.5 and NNs. Fig. 12 further illustrates that which algorithm is appropriate for the given datasets using the value of AIC of these datasets. JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 26
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 The Classification Algorithms for Datasets 0 200 400 600 800 1000 1200 Iris Diabetes Breastcancer DNA Cars Datasets V a l u e
o f
A I C
&
L o g
'O ' AIC Value of Log of 'O' of C4.5 Value of Log of 'O' of NNs
Fig. 12 The Classification Algorithms for Datasets The explanation of fig. 12 is: the difference between the values of AIC of datasets Iris, Diabetes and Breastcancer and the value of log of O of C4.5 is less than NNs classification algorithms, therefore, C4.5 is the appropriate classification algorithm for these datasets. The difference between the values of AIC of datasets DNA and Cars and the values of log of O of the classification algorithms is very gigantic, therefore, none of these classification algorithms is suitable for these two datasets. We can say that the datasets DNA and Cars are over- fitted and need cleansing. Table 17 The Data Visualization Algorithms Dataset AIC Data Visualization SOM Iris 26.8 15.5 27.2 Diabetes 21.6 20.2 39.5 Breastcancer 24.6 16.7 32.5 DNA 922.5 22.9 45.4 Cars 1054.8 18.3 39.1 Table 17 shows the value of AIC of different datasets and the value of log of O of visualization data mining algorithms such as Data Visualization (2D graphs) and SOM. Fig. 13 further illustrates that which algorithm is appropriate for the given datasets using the value of AIC of these datasets. The Visualization Algorithms for Datasets 0 200 400 600 800 1000 1200 Iris Diabetes Breastcancer DNA Cars Datasets V a l u e
o f
A I C
&
L o g
'O ' AIC Value of Log of 'O' of Data Visualization Value of Log of 'O' of SOM
Fig. 13 The Data Visualization Algorithms for Datasets The explanation of fig. 13 is: the difference between the value of AIC of Iris dataset and the value of log of O of SOM is less than 2D graphs data visualization algorithms, therefore, SOM is the appropriate visualization algorithm for the dataset Iris. The difference between the value of AIC of Diabetes dataset and the value of log of O of 2D graphs is less than SOM visualization algorithms, therefore, 2D graphs is the appropriate visualization algorithm for the dataset Diabetes. Similarly, the difference between the value of AIC of Breastcancer dataset and the value of log of O of both visualization algorithms is the same, therefore, both visualization algorithms are appropriate for the dataset Breastcancer. The difference between the value of AIC of datasets DNA and Cars and the value of log of O of the visualization algorithms is very huge, therefore, none of these visualization algorithms is suitable for these two datasets. We can say that the datasets DNA and Cars are over- fitted and need cleansing. 6 Conclusion In this paper we present, the model selection criterion, the parameter for the fitness of a dataset and the liner relation between the value of AIC and Log of O. We can conclude that the number of parameters of a given dataset should be average to enable knowledge extraction. The AIC performs better than BIC in all types of the sample sizes. Therefore, we opted for AIC as a selection criterion for a given dataset. We mapped the values of the model selection criterion AIC of a dataset with the complexities of data mining algorithms K-means, K- NN, SOM, C4.5, NNs and Data Visualization, this approach leads to select the appropriate data mining algorithm(s) for a particular dataset. We tested the values of AIC for the selection of algorithms over five different datasets. We conclude that if the difference between the value of AIC and complexities O of data mining algorithm is close then the algorithm(s) is(are) suitable for the dataset otherwise the dataset requires cleansing i.e. we need to reduce the number of parameters or reduce the number of attributes. How far AIC and O can be co-related to determine the choice of the data mining algorithm is certainly an avenue for further research. Acknowledgement The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to carry out this research activity under HEC project 6467/F II. References: [1]URL:http://www.doc.ic.ac.uk/~dfg/ProbabilisticI nference/IDAPILecture08.pdf, 2011. [2] Aldrich, John., R. A. Fisher and the making of maximum likelihood 19121922, Statistical JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 27
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 Science 12 (3): 162176. doi:10.1214/ss/1030037906. MR1617519, 1997. [3] Anderson, Erling B., Asymptotic Properties of Conditional Maximum Likelihood Estimators, Journal of the Royal Statistical Society B 32, 283 301, 1970. [4] Andersen, Erling B., Discrete Statistical Models with Social Science Applications, North Holland, 1980. [5] Debabrata Basu., Statistical Information and Likelihood: A Collection of Critical Essays, by Dr. D. Basu ; J.K. Ghosh, editor. Lecture Notes in Statistics Volume 45, Springer-Verlag, 1988. [6] Le Cam, Lucien., Maximum likelihood an introduction. ISI Review 58 (2): 153171, 1990. [7] Burnham. Kenneth P., Anderson. David R., Model Selection and Multi-model Inference: a Practical Information-theoretic Approach, 2nd edition, Springer, ISBN: 0-387-95364-7, 2002. [8] Brockwell, P.J., and Davis, R.A., Time Series: Theory and Methods, 2nd ed. Springer, 2009. [9] Akaike, Hirotugu, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6): 716723. doi:10.1109/TAC.1974.1100705. MR0423716, 1974. [10] Weakliem. David L. , A Critique of the Bayesian Information Criterion for Model Selection, University of Connecticut Sociological Methods Research, vol. 27 no. 3 359-397, 1999. [11] Cavanaugh. Joseph E., Statistics and Actuarial Science, The University of Iowa, URL: http://myweb.uiowa.edu/cavaaugh/ms_lec_6_ho.pdf , 2009. [12]Liddle, A.R., Information criteria for astrophysical model selection, http://xxx.adelaide.edu.au/PS_cache/astro- ph/pdf/0701/0701113v2.pdf [13] Ernest S. et al., HOW TO BE A BAYESIAN IN SAS: MODEL SELECTION UNCERTAINTY IN PROC LOGISTIC AND PROC GENMOD, Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA, 2010. [14] In Jae Myung, Tutorial on maximum likelihood estimation, Department of Psychology, Ohio State University, USA, Journal of Mathematical Psychology 47 (2003) pp 90100. [15] Isabelle Guyon., A practical guide to model selection, ClopiNet, Berkeley, CA 94708, USA, 2010. [16] Vladimir Cherkassky, COMPARISON of MODEL SELECTION METHODS for REGRESSION, Dept. Electrical & Computer Eng. University of Minnesota, 2010. [17] Schwarz G., Estimating the dimension of a model. Ann Stat 6:461-464, 1978. [18] Burnham KP, Anderson DR., Model Selection and Inference, (Springer), 1998. [19] Parzen E, Tanabe K, Kitagawa G., Selected Papers of Hirotugu Akaike, (Springer), 1998. [20] Li W., DNA segmentation as a model selection process. Proc. RECOMB'01, in press, 2001. [21] Li W., New criteria for segmenting DNA sequences, submitted, 2001. [22] Li W, Sherriff A, Liu X., Assessing risk factors of human complex diseases by Akaike and Bayesian information criteria (abstract). Am J Hum Genet 67(Suppl):S222, 2000. [23] Li W, Yang Y., How many genes are needed for a discriminate microarray data analysis. Proc. CAMDA'00, in press, 2001. [24]Li W, Yang Y, Edington J, Haghighi F., Determining the number of genes needed for cancer classification using microarray data, submitted, 2001. [25] Wentian Li, Dale R Nyholt, Marker Selection by AIC and BIC, Laboratory of Statistical Genetics, The Rockefeller University, New York, NY, 2010. [26]Venables, W.N., Smith, D.M. et al, An Introduction to R, version 2.14.2, 2012. [27] R Programming Language at http://CRAN.R- project.org, 2012. [28] Stapleton James H., Linear Statistical Models, John Wiley & Sons Inc., ISBN: 9780470231463, 2009. [29] Faraway Julian James., Linear Models in R, Chapman & Hall/CRC, ISBN: 9781584884255, 2007. [30] Neter John, Wasserman William, Kutner. Michael H., Applied Linear Regression Models, R.D.Irwin, ISBN: 9780256070682, 2010. [31] Draper Norman Richard, Smith Harry., Applied Regression Analysis, Wiley, ISBN: 97804771029953, 2007. [32] Neter John, Wasserman William, Kutner. Michael H., Applied Regression Statistical Models: Regression, Analysis of Variance and Experimental Designs, R.D.Irwin, ISBN: 9780256024470, 2010. [33] Monahan John F., A Primer on Linear Models, Chapman & Hall/CRC, 1 st Edition, ISBN: 9781420062014, 2008. [34] Bingham N.H., Fry John M., Regression: Linear Models in Statistics, Springer 1 st Edition, ISBN: 9781848829688, 2010. [35] Rencher Alvin C., Schaalie G. Bruce., Linear Models in Statistics, Wiley-Interscience 2 nd
Edition, ISBN: 9780471754985, 2008. JOURNAL OF COMPUTING, VOLUME 4, ISSUE 8, AUGUST 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG 28
2012 Journal of Computing Press, NY, USA, ISSN 2151-9617
Frank Kane - Hands-On Data Science and Python Machine Learning - Perform Data Mining and Machine Learning Efficiently Using Python and Spark-Packt Publishing - Ebooks Account (2017) PDF