Anda di halaman 1dari 6

IDS 572 Data Mining

Assignment 5

Question 1:

Question 1a:
Since each point is considered to be its own neighbor, we can say that when k = 1, the
training error will be zero since each cluster is just that one data point.
The resultant training error will be minimum when k = 1.

Question 1b:
When k is too small, there might be overfitting.
This produces good result on training data (as training error will likely be zero or close to
zero especially as we are considering the point to be its own near neighbor), but due to
overfitting it will not perform well on the test data.

When k is too large, the model might be nave.


This method ends up misclassifying a lot of data points. For large values of k, in this case
k=13 or 14, the model ends up giving a result of whichever instance (positive or negative)
occurs more frequently for all the values. So, all values end up being predicted as positive
or all points end up being predicted as negative. Hence, this method isnt good.
Hence, we use Cross Validation to figure out the best value of k that produces a model
which is neither nave nor overfitted. We base our result on the value of average error
produced.

Question 1c:
The given data set is linearly separable. We see that when k is an even number already
given that a point can be its own nearest neighbor the distribution of positive and
negative values among the k-nearest neighbors will mostly be even (the same number of
positive and negative points). Hence the predicted value of the target variable will always
be a coin toss situation. Here, we took the values which would match the actual value, but it
could also be the case that our predictions dont match the actual value. Hence, in case of k
being even, the results(prediction) of the KNN become ambiguous and hence we dont
consider the even k values for our optimum k-values.
Leaving (5,1) out, we take the rest of the dataset into consideration and get the following
results. The results would be the same whichever point we leave out as the total number of
positive and negative cases are the same. The average of all of them would be the same as
each of the results.
The given points:
Point X Y
1 1 5
2 2 6
3 2 7
4 5 1
5 6 2
6 7 2
7 7 3
8 8 4
9 8 3
10 9 5
11 3 7
12 3 8
13 4 8
14 5 9

Using the naming convention as above, point number 1 for (1,5) and so on and calculating
the Euclidean Distance between the points (Euclidean Distance Matrix) in excel, we get the
following table:
Point 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0.00 1.41 2.24 5.66 5.83 6.71 6.32 7.07 7.28 8.00 2.83 3.61 4.24 5.66
2 1.41 0.00 1.00 5.83 5.66 6.40 5.83 6.32 6.71 7.07 1.41 2.24 2.83 4.24
3 2.24 1.00 0.00 6.71 6.40 7.07 6.40 6.71 7.21 7.28 1.00 1.41 2.24 3.61
4 5.66 5.83 6.71 0.00 1.41 2.24 2.83 4.24 3.61 5.66 6.32 7.28 7.07 8.00
5 5.83 5.66 6.40 1.41 0.00 1.00 1.41 2.83 2.24 4.24 5.83 6.71 6.32 7.07
6 6.71 6.40 7.07 2.24 1.00 0.00 1.00 2.24 1.41 3.61 6.40 7.21 6.71 7.28
7 6.32 5.83 6.40 2.83 1.41 1.00 0.00 1.41 1.00 2.83 5.66 6.40 5.83 6.32
8 7.07 6.32 6.71 4.24 2.83 2.24 1.41 0.00 1.00 1.41 5.83 6.40 5.66 5.83
9 7.28 6.71 7.21 3.61 2.24 1.41 1.00 1.00 0.00 2.24 6.40 7.07 6.40 6.71
10 8.00 7.07 7.28 5.66 4.24 3.61 2.83 1.41 2.24 0.00 6.32 6.71 5.83 5.66
11 2.83 1.41 1.00 6.32 5.83 6.40 5.66 5.83 6.40 6.32 0.00 1.00 1.41 2.83
12 3.61 2.24 1.41 7.28 6.71 7.21 6.40 6.40 7.07 6.71 1.00 0.00 1.00 2.24
13 4.24 2.83 2.24 7.07 6.32 6.71 5.83 5.66 6.40 5.83 1.41 1.00 0.00 1.41
14 5.66 4.24 3.61 8.00 7.07 7.28 6.32 5.83 6.71 5.66 2.83 2.24 1.41 0.00

Using the distances from the above table, we see the following prediction results using
different k-values, leaving one-value-out:
Actual Value P N P N P P N N P N P N N
LOOP k value (6,2) (7,2) (7,3) (8,3) (8,4) (9,5) (1,5) (2,6) (2,7) (3,7) (3,8) (4,8) (5,9) Error Rate
(5,1) k=2 P n p P P P N N p n p N N ambiguous
(5,1) k=3 P P N P P P n n n p n n n 5/14
(5,1) k=4 p p p p p p n n n n n n n 4/14 - ambiguos
(5,1) k=5 p p p p p p n n n n n n n 4/14
(5,1) k=6 p p p p p p n n n n n n n 4/14- ambiguous
(5,1) k=7 p p p p p p n n n n n n n 4/14
(5,1) k=8 p n p n p p n n n n n n n 2/14 - ambiguous
(5,1) k=9 n n n n n n n n n n n n n 6/14
(5,1) k=10 n n n n n n n n p n p n n 6/14 - ambiguous
(5,1) k=11 p p p p p n n n n p p n n 5/14
(5,1) k=12 p n p n p p n n n n n n n 2/12 - ambiguous
(5,1) k=13 n n n n n n n n n n n n n 6/12

LOOP = Leave-one-out-point which is (5,1) in this case.


We already concluded that we dont consider the ambiguous results of k = any even value.
Hence, from the above table we surmise that only the values k=5 and k=7 give us concrete
lower error rates of 4/14 each.
Hence, the optimal k-value will be k = 5 or k = 7.

Question 2:
The given points:

Point X Y
1 8 4
2 3 3
3 4 5
4 0 1
5 10 2
6 3 7
7 0 9
8 8 1
9 4 3
10 9 4

Question 2a:
Initial Centroids: (1,1) and (8,8). Calculating the distance between the points and
computing the new centroids, we have:
Point X Y Distance Distance Cluster
from C1 from C2 Assigned
1 8 4 10 4 2
2 3 3 4 10 1
3 4 5 7 7 1
4 0 1 1 15 1
5 10 2 10 8 2
6 3 7 8 6 2
7 0 9 9 9 1
8 8 1 7 7 1
9 4 3 5 9 1
10 9 4 11 5 2

Here we see that cluster 1 consists of {(3,3), (4,5), (0,1), (0,9), (8,1), (9,4)}
Cluster two contains: {(8,4), (10,2), (3,7), (9,4)}
Computing the means of the points for each cluster, we get the new centroids:
F1- Means X Y
C1 3.17 3.67
C2 7.5 4.25

So, the new centroids are (3.17, 3.67) and (7.5,4.25)


Now calculating the new distances and assigning clusters, we have:
Point X Y Distance Distance Cluster
from C1 from C2 Assigned
1 8 4 5.17 0.75 2
2 3 3 0.83 5.75 1
3 4 5 2.17 4.25 1
4 0 1 5.83 10.75 1
5 10 2 8.50 4.75 2
6 3 7 3.50 7.25 1
7 0 9 8.50 12.25 1
8 8 1 7.50 3.75 2
9 4 3 1.50 4.75 1
10 9 4 6.17 1.75 2

Here we see that cluster 1 consists of {(3,3), (4,5), (0,1), (3,7), (0,9), (9,4)}
Cluster two contains: {(8,4), (10,2), (8,1), (9,4)}
Computing the means of the points for each cluster, we get the new centroids:
F2 Means X Y
C1 2.33 4.67
C2 8.75 2.75

Question 2b:
Point X Y Distance from Output
Centroid
1 8 4 4 Yes
2 3 3 2 Yes
3 4 5 3 No
4 0 1 7 No
5 10 2 6 Yes
6 3 7 6 Yes
7 0 9 11 No
8 8 1 5 No
9 4 3 1 Yes
10 9 4 5 Yes

Based on the distance from the points to centroid, we see that the points (4,3), (3,3), (4,5)
are the three nearest neighbors to (5,3).
Point X Y Distance from Output
Centroid
9 4 3 1 Yes
2 3 3 2 Yes
3 4 5 3 No

Using the k-nearest neighbor approach and considering the facts that the weights are all
the same, we get:
Considering 1 for Yes and 0 for No:
1/3+1/3+0/3 = 2/3 which means Yes
Hence, the predicted output for the point (5,3) is Yes.

Anda mungkin juga menyukai