Anda di halaman 1dari 9

MACHINE LEARNING

• Machine learning is an application of artificial intelligence (AI) that


provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine learning
focuses on the development of computer programs that can access data and
use it learn for themselves.

• The process of learning begins with observations or data, such as examples,


direct experience, or instruction, in order to look for patterns in data and
make better decisions in the future based on the examples that human
provide. The primary aim is "to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly."

Steps in Machine Learning are as follows:

1. Data Cleaning: Treatment of data set for missing values, Normalizing the data
Set
2. Data Partition: Dividing the data set into Training and Testing (generally 80:20)
3. Training/ Model Building: Use of various Supervised and Unsupervised Model
for training
4. Testing/ Validation: Using the Confusion Matrix determine the validation of
Model
Different Models:

1. Supervised Learning: Classification


i. KNN (k- nearest neighbor)
ii. NB (Naïve Bayes)
iii. GLM (Generalized Linear Model or Logistics Regression)
2. Unsupervised Learning: Clustering (k means)

Significance of the techniques used:

KNN Model:

K-Nearest Neighbors(kNN) algorithm is one of the simplest, non-parametric


(which means it does not take any assumptions on the underlying data
distribution), lazy classifying learning algorithms. The main use of this model is
to divide the data points of a database into several classes to predict the
classification of a new sample points. kNN model could and probably should be
one of the first choices for a classification study when there is little or no prior
knowledge about the distribution data.

NB Model:

The Negative Binomial regression model is used for modeling counts-based data
sets. It produces results that are: Explainable, Comparable, Defensible and Usable.
It is backed by statistical theory that is strong and very well understood. For doing
regression on countsbased data sets, a good strategy to follow is to start with the
Negative Binomial regression model.

GLM Model:

Generalized linear model (GLM) is a flexible generalization of ordinary linear


regression that allows for response variables that have error distribution models
other than a normal distribution. The choice of link is separate from the choice of
random component thus we have more flexibility in modeling.
DATA SOURCE & INFORMATION :

cars sale in Ukraine


This dataset contains data of 200 cars sale in Ukraine. Most of them are used
cars so it opens the possibility to analyse features related to car operation.

Content : Dataset contains200rows and 10 variables with essential meanings:

• car: manufacturer brand


• price: seller’s price in advertisement (in USD)
• body: car body type
• mileage: as mentioned in advertisement
• engV: rounded engine volume
• engType: type of fuel
• registration: whether car registered in Ukraine or not
• year: year of production
• drive: drive type

This is data which I have taken this the link

https://www.kaggle.com/antfarol/car-sale-advertisements/version/1

Attribute Information:

A data frame with 299 observations of 9 variables, 2 being int variable, 2 being
numerical variable, 5 being factor variable and 1 being target class.
str(rakesh)
'data.frame': 299 obs. of 9 variables:
$ car : Factor w/ 28 levels "Alfa Romeo","Audi",..: 7 15 17 8 21
17 3 15 17 12 ...
$ price : num 15500 17800 16600 6500 10500 ...
$ body : Factor w/ 6 levels "crossover","hatch",..: 1 6 1 4 5 1 4
1 1 1 ...
$ mileage : int 68 162 83 199 185 2 2 0 83 0 ...
$ engV : num 2.5 1.8 2 2 1.5 1.2 5 3 2 4.4 ...
$ engType : Factor w/ 4 levels "Diesel","Gas",..: 2 1 4 4 1 4 4 4 4 1
...
$ registration: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ year : int 2010 2012 2013 2003 2011 2016 2016 2016 2013 2016
...
$ drive : Factor w/ 2 levels "front","full": 2 1 2 1 1 1 2 2 2 2
...
On the basis of these features target column is determined In this dataset, Drive
is the target column which is to be predicted. There are 299 instances and 9
attributes.

HEAD AND TAIL DATA SET


>head(rakesh)
head(rakesh)
car price body mileage engV engType
1 Ford 15500.00 crossover 68 2.5 Gas
2 Mercedes-Benz 17800.00 van 162 1.8 Diesel
3 Nissan 16600.00 crossover 83 2.0 Petrol
4 Honda 6500.00 sedan 199 2.0 Petrol
5 Renault 10500.00 vagon 185 1.5 Diesel
6 Nissan 20447.15 crossover 2 1.2 Petrol
registration year drive
1 yes 2010 full
2 yes 2012 front
3 yes 2013 full
4 yes 2003 front
5 yes 2011 front
6 yes 2016 front

>

>tail(rakesh)
tail(rakesh)
car price body mileage engV engType registration
294 Renault 12499 van 78 2.0 Diesel yes
295 Nissan 15000 crossover 46 1.6 Petrol yes
296 Toyota 37700 crossover 47 5.7 Gas yes
297 Suzuki 9650 crossover 81 2.0 Petrol yes
298 Volkswagen 7000 sedan 185 2.8 Gas yes
299 Honda 13200 sedan 100 1.8 Petrol yes
year drive
294 2012 front
295 2013 front
296 2009 full
297 2006 full
298 2004 full
299 2012 front

>

SYNTAX/CODE :
DATA :
1. Set the library :
>getwd()
2. loading packages
library(caret)
library(lattice)
library(ggplot2)
library(klaR)

3. Import the file :


rakesh=read.csv(file.choose(),header=T )
str(rakesh)
DATA PRE-PROCESSING :
This step is used to either clear missing values or omit them. Making the data
more concise.
1. Find missing values:
>sum(is.na(rakesh))

>sum(is.na(rakesh))
[1] 0

Since, there are no missing values so no need of omit function. Now the data
is good for next step.

DATA PARTIOTINING :
In this step the data is divided into two parts i.e., testing and training. The ratio
can be varied. Generally, Training (70%-80%), Testing (20%-30%) is followed.

I used 80% for training and rest 20% for testing.

intrain=createDataPartition(y=rakesh$drive, p=0.80, list=F)


training=rakesh[intrain, ]
testing=rakesh[-intrain, ]
dim(training)
dim(testing)

> dim(training)
[1] 240 9
> dim(testing)
[1] 59 9
Model Building:

• For KNN model :

modelfit1=train(drive~., data=training, method="knn")

• For NB model :

modelfit2=train(drive~.,data=training,method="nb")

• For GLM model :

modelfit3=train(drive~.,data=training,method="glm")

MODEL VALIDATION :

Prediction is done on the basis of the testing and training results.

For KNN model :

>predections1=predict(modelfit1,newdata=testing)

>confusionMatrix(predections1,testing$drive)

Reference
Prediction front full
front 19 5
full 7 28

Accuracy : 0.7966
95% CI : (0.6717, 0.8902)
No Information Rate : 0.5593
P-Value [Acc > NIR] : 0.0001191

Kappa : 0.584

Mcnemar's Test P-Value : 0.7728300

Sensitivity : 0.7308
Specificity : 0.8485
Pos Pred Value : 0.7917
Neg Pred Value : 0.8000
Prevalence : 0.4407
Detection Rate : 0.3220
Detection Prevalence : 0.4068
Balanced Accuracy : 0.7896
'Positive' Class : front:

For NB model:
>predections2=predict(modelfit2,newdata=testing)
>confusionMatrix(predections2,testing$drive)
Reference
Prediction front full
front 23 5
full 3 28

Accuracy : 0.8644
95% CI : (0.7502, 0.9396)
No Information Rate : 0.5593
P-Value [Acc > NIR] : 5.256e-07
Kappa : 0.7272

Mcnemar's Test P-Value : 0.7237

Sensitivity : 0.8846
Specificity : 0.8485
Pos Pred Value : 0.8214
Neg Pred Value : 0.9032
Prevalence : 0.4407
Detection Rate : 0.3898
Detection Prevalence : 0.4746
Balanced Accuracy : 0.8666

'Positive' Class : front

For GLM model :


>predections3=predict(modelfit3,newdata=testing)
>confusionMatrix(predections3,testing$drive)
Reference
Prediction front full
front 23 4
full 3 29

Accuracy : 0.8814
95% CI : (0.7707, 0.9509)
No Information Rate : 0.5593
P-Value [Acc > NIR] : 9.938e-08
Kappa : 0.7603

Mcnemar's Test P-Value : 1

Sensitivity : 0.8846
Specificity : 0.8788
Pos Pred Value : 0.8519
Neg Pred Value : 0.9062
Prevalence : 0.4407
Detection Rate : 0.3898
Detection Prevalence : 0.4576
Balanced Accuracy : 0.8817

'Positive' Class : front


ANALYSIS TO FIND WHICH MODEL IS BEST ?

1. Kappa value : The kappa statistic is frequently used to test interrater


reliability. The importance of rater reliability lies in the fact that it
represents the extent to which the data collected in the study are correct
representations of the variables measured.
• The kappa value for three models are listed below :
1. KNN model : kappa = 0.584
2. NB model : kappa = 0.7272
3. GLM model : kappa= 0.7603
• The value of kappa is higher for GLM model which suggests that the it
has more extent to which the data collected in the study are correct.

2. Accuracy : it refers to how close a measurement is to the correct value.


• The accuracy of the three models are listed below :
1. KNN model : accuracy = 76.66
2. NB model : accuracy = 86.44
3. GLM model : accuracy = 88.14
• Out of all models used in this project “glm” gave the maximum accuracy
of 93.16% and it only predicted less results as false prediction compare to
other methods.

3. Sensitivity : The sensitivity signifies the proportion of positive results out


of the number of samples which were actually positive. When there are
no positive results, sensitivity is not defined and a value of NA is
returned.
• The sensistivity of 3 models are shown below :
1. KNN model : sensitivity = 0.7308
2. NB Model : sensitivity = 0.8846
3. GLM model : sensitivity = 0.8846
• Out of all models the GLM and NB are giving us the maximum
sensitivity which means they are giving the positive results out of the
number of samples which were really positive.

Hence from the three above parameters the GLM model is best one as the GLM
is having better accuracy, sensitivity, kappa value than the other two models.
PRACTICAL UTILITY OF THE MODEL:

GLM (Generalized Linear Model or Logistic regression is used to predict a


discrete outcome based on variables which may be discrete, continuous or mixed.
The outcome could be in the form of Yes / No, 1 / 0, True / False, High/Low,
given a set of independent variables.

The Application of the model are as:

1. Marketing: A marketing consultant wants to predict if the subsidiary of his


company will make profit, loss or just break even depending on the characteristic
of the subsidiary operations.
2. Human Resources: The HR manager of a company wants to predict the
absenteeism pattern of his employees based on their individual characteristic.
3. Finance: A bank wants to predict if his customers would default based on the
previous transactions and history.
4. Education: As discussed in the assignment we can predict the chance of the
admission of any student based on various parameters like CGPA, GRE score,
etc.
5. Health Care: A hospital wants to make a prediction based on a survey to
determine whether the patients are detected with cancer or not based on the
symptoms and conditions.
6. Manufacturing: A manufacturer wants to predict the acceptance or rejection of
machined part based on the tolerance limit by ‘Go-NoGo’ defined by dimensions
of the part.

Anda mungkin juga menyukai