Assignment 5
Question 1:
Question 1a:
Since each point is considered to be its own neighbor, we can say that when k = 1, the
training error will be zero since each cluster is just that one data point.
The resultant training error will be minimum when k = 1.
Question 1b:
When k is too small, there might be overfitting.
This produces good result on training data (as training error will likely be zero or close to
zero especially as we are considering the point to be its own near neighbor), but due to
overfitting it will not perform well on the test data.
Question 1c:
The given data set is linearly separable. We see that when k is an even number already
given that a point can be its own nearest neighbor the distribution of positive and
negative values among the k-nearest neighbors will mostly be even (the same number of
positive and negative points). Hence the predicted value of the target variable will always
be a coin toss situation. Here, we took the values which would match the actual value, but it
could also be the case that our predictions dont match the actual value. Hence, in case of k
being even, the results(prediction) of the KNN become ambiguous and hence we dont
consider the even k values for our optimum k-values.
Leaving (5,1) out, we take the rest of the dataset into consideration and get the following
results. The results would be the same whichever point we leave out as the total number of
positive and negative cases are the same. The average of all of them would be the same as
each of the results.
The given points:
Point X Y
1 1 5
2 2 6
3 2 7
4 5 1
5 6 2
6 7 2
7 7 3
8 8 4
9 8 3
10 9 5
11 3 7
12 3 8
13 4 8
14 5 9
Using the naming convention as above, point number 1 for (1,5) and so on and calculating
the Euclidean Distance between the points (Euclidean Distance Matrix) in excel, we get the
following table:
Point 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0.00 1.41 2.24 5.66 5.83 6.71 6.32 7.07 7.28 8.00 2.83 3.61 4.24 5.66
2 1.41 0.00 1.00 5.83 5.66 6.40 5.83 6.32 6.71 7.07 1.41 2.24 2.83 4.24
3 2.24 1.00 0.00 6.71 6.40 7.07 6.40 6.71 7.21 7.28 1.00 1.41 2.24 3.61
4 5.66 5.83 6.71 0.00 1.41 2.24 2.83 4.24 3.61 5.66 6.32 7.28 7.07 8.00
5 5.83 5.66 6.40 1.41 0.00 1.00 1.41 2.83 2.24 4.24 5.83 6.71 6.32 7.07
6 6.71 6.40 7.07 2.24 1.00 0.00 1.00 2.24 1.41 3.61 6.40 7.21 6.71 7.28
7 6.32 5.83 6.40 2.83 1.41 1.00 0.00 1.41 1.00 2.83 5.66 6.40 5.83 6.32
8 7.07 6.32 6.71 4.24 2.83 2.24 1.41 0.00 1.00 1.41 5.83 6.40 5.66 5.83
9 7.28 6.71 7.21 3.61 2.24 1.41 1.00 1.00 0.00 2.24 6.40 7.07 6.40 6.71
10 8.00 7.07 7.28 5.66 4.24 3.61 2.83 1.41 2.24 0.00 6.32 6.71 5.83 5.66
11 2.83 1.41 1.00 6.32 5.83 6.40 5.66 5.83 6.40 6.32 0.00 1.00 1.41 2.83
12 3.61 2.24 1.41 7.28 6.71 7.21 6.40 6.40 7.07 6.71 1.00 0.00 1.00 2.24
13 4.24 2.83 2.24 7.07 6.32 6.71 5.83 5.66 6.40 5.83 1.41 1.00 0.00 1.41
14 5.66 4.24 3.61 8.00 7.07 7.28 6.32 5.83 6.71 5.66 2.83 2.24 1.41 0.00
Using the distances from the above table, we see the following prediction results using
different k-values, leaving one-value-out:
Actual Value P N P N P P N N P N P N N
LOOP k value (6,2) (7,2) (7,3) (8,3) (8,4) (9,5) (1,5) (2,6) (2,7) (3,7) (3,8) (4,8) (5,9) Error Rate
(5,1) k=2 P n p P P P N N p n p N N ambiguous
(5,1) k=3 P P N P P P n n n p n n n 5/14
(5,1) k=4 p p p p p p n n n n n n n 4/14 - ambiguos
(5,1) k=5 p p p p p p n n n n n n n 4/14
(5,1) k=6 p p p p p p n n n n n n n 4/14- ambiguous
(5,1) k=7 p p p p p p n n n n n n n 4/14
(5,1) k=8 p n p n p p n n n n n n n 2/14 - ambiguous
(5,1) k=9 n n n n n n n n n n n n n 6/14
(5,1) k=10 n n n n n n n n p n p n n 6/14 - ambiguous
(5,1) k=11 p p p p p n n n n p p n n 5/14
(5,1) k=12 p n p n p p n n n n n n n 2/12 - ambiguous
(5,1) k=13 n n n n n n n n n n n n n 6/12
Question 2:
The given points:
Point X Y
1 8 4
2 3 3
3 4 5
4 0 1
5 10 2
6 3 7
7 0 9
8 8 1
9 4 3
10 9 4
Question 2a:
Initial Centroids: (1,1) and (8,8). Calculating the distance between the points and
computing the new centroids, we have:
Point X Y Distance Distance Cluster
from C1 from C2 Assigned
1 8 4 10 4 2
2 3 3 4 10 1
3 4 5 7 7 1
4 0 1 1 15 1
5 10 2 10 8 2
6 3 7 8 6 2
7 0 9 9 9 1
8 8 1 7 7 1
9 4 3 5 9 1
10 9 4 11 5 2
Here we see that cluster 1 consists of {(3,3), (4,5), (0,1), (0,9), (8,1), (9,4)}
Cluster two contains: {(8,4), (10,2), (3,7), (9,4)}
Computing the means of the points for each cluster, we get the new centroids:
F1- Means X Y
C1 3.17 3.67
C2 7.5 4.25
Here we see that cluster 1 consists of {(3,3), (4,5), (0,1), (3,7), (0,9), (9,4)}
Cluster two contains: {(8,4), (10,2), (8,1), (9,4)}
Computing the means of the points for each cluster, we get the new centroids:
F2 Means X Y
C1 2.33 4.67
C2 8.75 2.75
Question 2b:
Point X Y Distance from Output
Centroid
1 8 4 4 Yes
2 3 3 2 Yes
3 4 5 3 No
4 0 1 7 No
5 10 2 6 Yes
6 3 7 6 Yes
7 0 9 11 No
8 8 1 5 No
9 4 3 1 Yes
10 9 4 5 Yes
Based on the distance from the points to centroid, we see that the points (4,3), (3,3), (4,5)
are the three nearest neighbors to (5,3).
Point X Y Distance from Output
Centroid
9 4 3 1 Yes
2 3 3 2 Yes
3 4 5 3 No
Using the k-nearest neighbor approach and considering the facts that the weights are all
the same, we get:
Considering 1 for Yes and 0 for No:
1/3+1/3+0/3 = 2/3 which means Yes
Hence, the predicted output for the point (5,3) is Yes.