Anda di halaman 1dari 2

Assignment 1

Introduction
In this assignment we have been provided with the data set which include both training and test data set. The main purpose of this
assignment is to find the value of Class value in the test data set using the logic of KNN AND the model used will be trained on provided
training dataset.

Data Analysis:
Analyzing the data before building the model on the data provided, helps in increasing the accuracy of the model build. The data that have
been provided consist of 10 variables. The statistical analysis of the data has been done on the first stage and the result are mentioned
below.

Statistical Analysis of Data

Data Set Analysis


The provided data set have been analyzed on the early stage to get the basic insight of the data, so that we can made the statistical model
on the data provided and below conclusions have been made.
Variables types
Dataset info
Numeric 10
Number of variables 10
Number of observations 204
Categorical 0
Total Missing (%) 0.00% Boolean 0
Date 0
Descriptive Statistics
Quantitative descriptions provided below by doing the descriptive analysis of the data. The results of the descriptive analysis
done is provided below.

Correlation Analysis
The predictions can be more accurate by removing unwanted variables. One of the very good way of analyzing the data is by
checking the correlation matrix and finding out the variables which are highly correlated and then we can remove those
variables to get high accuracy.

Here instead of checking the simple correlation we have checked the correlation using the Pearson and Spearman Correlation
matrix.

Clearly, we can see there are high chance of Ca and RI being duplicate with correlation value 0.8108. So, we can reject one of
the variables.

Methodology used:
1. We have used Euclidean Distance formula to calculate the nearest neighbor of data point.
2. Cross validation is used to test the optimum value of k for which the result is most accurate.
Cross validation result file: result.csv
Correlation matrix for CrossValidation: corrfile.csv
3. To test this, we have split the training data set in two sets train and validate.
4. We have predicted the value of Class column validate data set where we already knew this value.
5. We use value of K from 1 to 20.
6. Correlation between values predicted from k=1 to 20 and the expected value that we already knew is compared.
7. One with the maximum correlation value is selected as optimum value of K.
8. Below is the file that have been created by us while working on the provided data set.
a. Result set from different K values.
b. Correlation matrix

9. Once the K value is decided then, the model is built on the training data provided and the result is predicted.
10. Result with probability for the given data set is given below for K=2

Anda mungkin juga menyukai