Anda di halaman 1dari 2

COMPUTER SCIENCE & ENGINEERING DEPARTMENT

liT KHARAGPUR
MACHINE LEARNING CS60050
FULL
MARKS: 40
---------.-..-------------------..-.--

--~--

1.

MIDTERM EXAM

SPRING 2011
DATE:
22-FEB-2011
....... ......--.-.....................-.......--,_..._..... __.__....__,...__

---~------------~---------

-.-

_.

. ..........

,,.._ .,.....,,

TIME:
2 HOURS
--------------. ----______ ...................,._.____ .........- ......
_,..._..,,

Answer briefly.
[2*4 = 8]
a) Consider a standard six-sided die. Let X denote the number of times that 6 shows up over
n throws of the dice. Compute an upper bound on P(X ~ nh) . You may use Markov's
inequality or any other result.
b) Consider a training set whose labels were randomly corrupted. For the k-nearest neighbor
classifier, which of the following choices of k is more robust to the labeling noise: k=l and k=4?
Explain in one sentence.
c) Suppose that you train a classifier with a training set of size m. As m -+ oo what do you expect
will be the behavior of the training error? What would you expect to be the behavior of the test
error? Draw a picture to illustrate.
d) Consider an instance space where the instances are described by n 4 attributes, and the
target attribute takes on the values 1 for true and -1 for false. Imagine we are using a single
perceptron with n
5 weights (w0 , w11 , w4 ) for prediction, where initially all weights are set
to 0.1. Show two iterations of the perceptron algorithm on the following two instances:

i.
ii.

< 1, 1, 1, 1; y = 1 >
< 1,0.1,1,0.1; y = -1 >

The value after the semi-colon is the value of the target attribute for that instance

2.

[2+2+2+2=8]
We wish to model the relationships between x andy where x and yare reals, x is a single value. We use
a simple regression model. According to our model,
y = w1 x + w0 + E where E- N(O,s 2 )
However, suppose, the inputs and outputs are actually related quadratically:

+ 81 x + 80 + E

where E- N(O,o- 2 )
That is, we are trying to model the underlying and unknown quadratic relation with a linear model.

y = 8 2x 2

Suppose the training set contains the following instances:

{(XvYl), (Xz,Yz), (xm,Ym)}


a)

What are the least squares estimates of the parameters of the linear model?

b) What is the predicted response y(x) from the model at a new point x?
c) Give a general expression for the expected prediction error. Show how it is decomposed into the
following components of bias, variance and noise.
d) Write down ari expression for the bias of y(x) at a fixed input x for the above case.

[1+2+2=5]

3.

You are asked to use regularized linear regression to predict the target Y E IR from the ten-dimensional
feature vector X E JR10 You define the model Y = wT X and consider the following two objective
functions:

minw L~1(yi- wTxi)2

I.

~m (
T )
mmw
Lli= 1 yi- w xi

II.

~10 wi 2
A...j=
1

a)

What are the regularization terms in the above functions?

b)

For II. state whether the bias and variance will increase or decrease or remain unaffected as you
increase the value of A..

c)

State qualitatively what is likely to be the difference in thew values in these two cases.

4.

[2+2+2=6]
a)

Suppose you are given data x1 , x2 , ... , Xm where each xi is a single real value. That is, you have
m instances of a single real valued feature. Suppose that the data is distributed uniformly
randomly between 0 and w. You are trying to find a maximum likelihood estimate of w based on
the data.

i. Write the likelihood function L(w).


ii. What is the maximum likelihood estimate for w? Justify your answer based on the
likelihood function.
b) You have received a new coin and want to estimate the probability 0 that it will.come up heads
if you toss it. A priori you assume that the most probable value of 0 is 0.5. You then toss the coin
3 times, and it comes up heads twice. What is the maximum likelihood estimate (MLE) of 0, and
what is the maximum a posteriori probability (MAP) estimate of 0?.

5.
a)

b)

[2+2+2=6)
Suppose that you have a logistic regression classifier. Consider a point that is currently classified
correctly, and is far away from the decision boundary. If you remove the point from the training
set, and re-train the classifier, will the decision boundary change or stay the same?
Suppose you are given the following classification task: predict the target Y E {0, 1} given two
real valued features X 1 E IR and X 2 E JR. After some training, you learn the following decision
rule: predict Y

w2

= 5.

= 1 iff w0 + w 1X1 + w2X2 ;::::

0 and Y

= 0 otherwise, where w 0= 1, w1 = -3,

i.

Draw the feature space. Plot the decision boundary and label the region where we
would predict Y
0 and Y
1.
ii. Suppose that we learned the above weights using logistic regression. Using this model,

what would be our prediction for

6.
a)

b)

P(Y = 1IX1,X2)?

[2+5=7)
Suppose you want to use aBoolean function to pick spam sms. Each sms has n = 10 binary
features (e.g., contains I does not contain the word "lottery"). Suppose the sms's are generated
by some unknown Boolean function of then binary features. How many sample sms are
sufficient to get a Boolean function with probability at least 900-' that its error is less than 5%?
Consider the space of points in the plane. Consider the class of hypotheses defined by a straight
line passing through the origin.
i. can you find a set of 2 points that can be shattered by this hypotheses space? If yes,
give an example. If no, give a short proof.
ii. Can you find a set of 3 points that can be shattered by this hypotheses space? If yes,
give an example. If no, give a short proof.
iii. What is the VC dimension of this hypothesis class?

Anda mungkin juga menyukai