Anda di halaman 1dari 2

CS 189 Final Note Sheet Rishi Sharma, Peter Gao, et. al.

Probability & Matrix Review

Support Vector Machines

LDA and QDA

Bayesian Decision Theory

In the strictly separable case, the goal is to find a separating


hyperplane (like logistic regression) except now we dont just want
any hyperplane, but one with the largest margin.

Classify y 2 {0, 1}, Model p(y) = f y f 1 y and

P(x|w)P(w)
Bayes Rule: P(w|x) =
, P(x) = i P(x|wi )P(wi )
P(x)
P(x, w) = P(x|w)P(w) = P(w|x)P(x)

P(error) = P(error|x)P(x)dx

P(w1 |x)
if we decide w2
P(error|x) =
P(w2 |x)
if we decide w1
n
0
i = j (correct)
0-1 Loss: l (ai |w j ) =
1
i 6= j (mismatch)

The final optimization problem is:

Expected Loss (Risk): R(ai |x) = cj=1 l (ai |w j )P(w j |x)


0-1 Risk: R(ai |x) = c P(w j |x) = 1
j6=i

H = {w T x + b = 0}, since scaling w and b in opposite directions


doesnt change the hyperplane our optimization function should
have scaling invariance built into it. Thus, we do it now and define
the closest points to the hyperplane xsv (support vectors) to satisfy:
|w T xsv + b| = 1. The distance from any support vector to the hyper
1 . Maximizing the distance to the hyperplane is
plane is now:
||w||2
the same as minimizing ||w||2 .
1
min ||w||2 s.t. y(i) (wT x(i) + b)
w,b 2

P(wi |x)

Generative vs. Discriminative Model

Primal: L p (w, b, a) = 12 ||w||2

Generative: Model class conditional density p(x|y) and find


p(y|x) p(x|y)p(y) or
model joint density p(x, y) and marginalize
to find p(y = k|x) = x p(x, y = k)dx (posterior)

Lp
=w
w

Discriminative: Model conditional p(y|x).


class conditional P(X|Y )
prior P(Y )

Lagrangian

Multivariate Gaussian X N (, S)

S is PSD =) xT Sx

1
2

The distribution is the result of a linear transformation of a vector of


univariate Gaussians Z N (0, I) such that X = AZ + where we
have S = AA| . From the pdf, we see that the level curves of the
distribution decrease proportionally with x| S 1 x (assume = 0)
=)
c-level set of f {x : x| S 1 x = c}
x| S 1 = c x| UL 1 U | x = c =)

Thus the level curves form an ellipsoid with axis lengths equal to the
square root of the eigenvalues of the covariance matrix.

Loss Functions

Squared error = [y f (x)]2 = [1 y f (x)]2


minimizing function f (x) = 2P [Y = +1 | x] 1
Huberized
square hinge loss

4y f (x)
if y f (x) < 1
=
[1 y f (x)]2
otherwise
+
minimizing function f (x) = 2P [Y = +1 | x] 1

1
2

Newtons Method: qt+1 = qt

1
[2
q f (qt )] q f (qt )

Gradient Decent: qt+1 = qt

aq f (qt ), for minimizing

Gradients
y1
...
6 x1
6
6 .
y 6
.
,
.
.
x 6
6 .
.
4
y1
...
xn
(xT x)
(xT Ax)
= 2x,
x
x

Using notation from Peters notes:


Given minx f (x) s.t. gi (x) = 0, hi (x) 0, the corresponding
Lagrangian is: L(x, a, b ) = f (x) + ki=1 ai gi (x) + li=1 bi hi (x)
We min over x and max over the Lagrange multipliers a and b

L1 regularization results in lasso regression.


Used when x has a Laplace prior. Gives sparse results.

3
ym
x1 7
7
T
7 (Ax)
. 7
T (x A) = A,
. 7, x = A , x
. 7
5
ym
xn
(trBA)
= (A + AT )x,
= BT
A

1
1
2 qT x =
T ) e
T
1+eq x
1+eq x
hq )

T
1+e q x
1

= hq (x)

1
T
1+e q x

Other Classifiers

K ! 0, e

KNN limN!,K! , N
knn = e
Curse of dimensionality: As the number of dimensions increases,
everything becomes farther apart. Our low dimension intuition falls
apart. Consider the Hypersphere/Hypercube ratio, its close to zero
at d = 10. How do deal with this curse:
1. Get more data to fill all of that empty space
2. Get better features, reducing the dimensionality and packing the
data closer together. Ex: Bag-of-words, Histograms,...
3. Use a better distance metric.
1
p p
Minkowski: Dis p (x, y) = (d
i=1 |xi yu | ) = ||x y|| p
0-norm: Dis0 (x, y) = d
i=1 I|xi = yi |
q
Mahalanobis: DisM (x, y|S) = (x y)T S 1 (x

y)

In high-d we get Hubs s.t most points identify the hubs as their
NN. These hubs are usually near the means (Ex: dull gray images,
sky and clouds). To avoid having everything classified as these hubs,
we can use cosine similarity.
K-d trees increase the efficiency of nearest neighbor lookup.

Decision Trees
Given a set of points and classes {xi , yi }n
i=1 , test features x j and
branch on the feature which best separates the data. Recursively
split on the new subset of data. Growing the tree to max depth tends
to overfit (training data gets cut quickly =) subtrees train on small
sets). Mistakes high up in the tree propagate to corresponding
subtrees. To reduce overfitting, we can prune using a validation set,
and we can limit the depth.
DTs are prone to label noise. Building the correct tree is hard.

H(D)

x j 2X j

P(X j = x j ) H(D|X j = x j )

where H(D) = c2C P(y = c) log[p(y = c)] is the entropy of the


data set, C is the set of classes each data point can take, and P(y = c)
is the fraction of data points with class c.
For REGRESSION, minimize the variance. Same optimization
problem as above, except H is replaced with var. Pure leaves
correspond to low variance, and the result is the mean of the current
leaf.

(i)
hq (x(i) ))1 y
=)

(i)
(i)
l(q ) = m
i=1 y log(hq (x )) + (1

y(i) ) log(1

hq (x(i) ))x(i) = X | (y

( j)
Stochastic: qt+1 = qt + a(yt
Batch: qt+1 = qt + aX | (y

hq (X))

hq (x(i) )) =)

hq (X)), (want max l(q ))

( j) ( j)
hq (xt ))xt

Problem: DTs are unstable: small changes in the input data have
large effect on tree structure =) DTs are high-variance estimators.
Solution: Random Forests train M different trees with randomly
sampled subsets of the data (called bagging), and sometimes with
randomly sampled subsets of the features to de-correlate the trees. A
new point is tested on all M trees and we take the majority as our
output class (for regression we take the average of the output).

Boosting

hq (x))1 y =)

(i) y(i) (1
L(q ) = m
i=1 (hq (x ))

q l = i (y(i)

For QDA, the model is the same as LDA except that each class has a
unique covariance matrix.
1
1
T
h(x) = arg maxk 1
2 log|Sk | 2 (x k ) Sk (x k ) + log(pk )

Random Forests

dhq
dq = (

p(y|x; q ) = (hq (x))y (1

y )T .
(i)

where pk = p(y = k)

max
j

L2 regularization results in ridge regression.


Used when A contains a null space. L2 reg falls out of the MLE
when we add a Gaussian prior on x with S = cI.
minx ||Ax y||22 + l ||x||22 =) x = (AT A + l I) 1 X T y

hq (1

y )(x(i)
(i)

Notice the covariance matrix is the same for all classes in LDA.
If p(x|y) multivariate gaussian (w/ shared S), then p(y|x) is logistic
function. The converse is NOT true. LDA makes stronger
assumptions about data than does logistic regression.
h(x) = arg maxk 21 (x k )T S 1 (x k ) + log(pk )

Heurisitic: For classification, maximize information gain

Regression

Classify y 2 {0, 1} =) Model p(y = 1|x) =

Optimization

Think of the li as the cost of violating the constraint fi (x) 0.


L defines a saddle point game: one player (M IN); the other
player (M AX) chooses l to maximize L. If M IN violates a
constraint, fi (x) > 0, then M AX can drive L to infinity.
We call the original optimization problem the primal problem.
It has valuep = minx maxl 0 L (x, l )
(Because of an infeasible x, L (x, l ) can be made infinite, and
for a feasible x, the li fi (x) terms will become zero.)
Define g (l ) := minx L (x, l ), and define the dual problem as
d = maxl 0 g (l ) = maxl 0 minx L (x, l )
In a zero sum game, its always better to play second:
p = minx maxl 0 L (x, l ) maxl 0 minx L (x, l ) = dThis
is called weak duality.
If there is a saddle point (x, l ), so that for all x and l 0,
L (x, l ) L (x, l ) L (x, l ) , then we have strong duality:
the primal and dual have the same value,
p = minx maxl 0 L (x, l ) = maxl 0 minx L (x, l ) = d

Logistic Regression

h
i
Binomial deviance = log 1 + e y f (x)

P[Y =+1|x]
minimizing function f (x) = log
P[Y = 1|x]
SVM hinge loss = [1 y f (x)]+

minimizing function f (x) = sign P [Y = +1 | x]

li fi (x)

i=1

In general the loss function consists of two parts, the loss term and
the regularization term. J(w) = i Lossi + l R(w)

|
|
l1 1 (u1 x)2 + + ln 1 (un x)2 = c
|
{z
}
|
{z
}
p
p
axis length: ln
axis length: l1

0, if inverse exists S must be PD

If X N(, S), then AX + b N(A + b, ASAT )


1
1
=) S 2 (X ) N(0, I), where S 2 = UL

L (x, l ) = f0 (x) +

xi

The dual for non-separable doesnt change much except that each ai
now has an upper bound of C =) 0 ai C

(i) (i)
h(xt ))xt , hq (x) = q | x

Gaussian class conditionals lead to a logistic posterior.

1
1 (x )T S 1 (x
f (x; , S) =
exp
2
(2p)n/2 |S|1/2

1 m (x(i)
SMLE = m
i=1

Key Idea: Store all training examples xi , f (xi )


NN: Find closest training point using some distance metric and take
its label.
k-NN: Find closest k training points and take on the most likely
label based on some voting scheme (mean, median,...)
Behavior at the limit: 1NN limN! e eNN 2e
e = error of optimal prediction, enn = error of 1NN classifier

1) = 0 where an > 0.

m
1
min ||w||2 +C xi s.t. y(i) (wT x(i) + b)
w,b 2
i=1

nout
Cross Entropy Loss i=1
y log(hq (x)) + (1

(i) (i)
(i)
l(q , 0 , 1 , S) = log Pm
i=1 p(x |y ; 0 , 1 , S)p(y ; F) gives
us
1 m 1{y(i) = 1},
fMLE = m
kMLE =
i=1
avg of x(i) classified as k,

Nearest Neighbor

1 m m y(i) y( j) a a (x(i) )T x( j)
i j
2 i=1 j=1

In the non-separable case we allow points to cross the marginal


boundary by some amount x and penalize it.

X | y = 0 =) q = (X | X) 1 X | y

)T ] = E[XX T ]

ai y(i) = 0, Note: ai 6= 0 only for support vectors.

KKT says an (yn (wT xn + b)

Gaussian noise in our data set {x(i) , y(i) }m


i=1 gives us least squares
| |
| |
|
minq ||Xq y||2
2 minq q X Xq 2q X y + y Y

)(X

ai y(i) x(i) = 0 =) w = ai y(i) x(i)

Ld (a) = m
i=1 ai

Note: The intercept term x0 = 1 is accounted for in q


!
(y(i) q | x(i) )2
=) p(y(i) |x(i) ; q ) = p 1
exp
2
2
2s
2ps
!
(y(i) q | x(i) )2
p 1
=) L(q ) = m
i=1 2ps 2 exp
2
2s
1 m (y(i) q | x(i) )2
=) l(q ) = m log p 1
2s 2 i=1
2ps 2
(i) h (x))2
=) maxq l(q ) minq m
q
i=1 (y

S = E[(X

1)

Dual:

y(i) = q | x(i) + e (i) with noise e(i) N (0, s 2 )

Gradient Descent:
(i)
qt+1 = qt + a(yt

(i) T (i)
m
i=1 ai (y (w x + b)

Substitute the derivatives into the primal to get the dual.

posterior P(Y |X)


evidence P(X)

Probabilistic Motivation for Least Squares

q l(q ) = X | Xq

Lp
=
b

1, i = 1, . . . , m

Error Functions:

Weak Learner: Can classify with at least 50% accuracy.


Train weak learner to get a weak classifier. Test it on the training
data, up-weigh misclassified data, down-weigh correctly classified
data. Train a new weak learner on the weighted data. Repeat. A new
point is classified by every weak learner and the output class is the
sign of a weighted avg. of weak learner outputs. Boosting generally
overfits. If there is label noise, boosting keeps upweighing the
mislabeled data.
AdaBoost is a boosting algorithm. The weak learner weights are
1 et
given by at = 1
2 ln( et ) where et = PrDt (ht (xi ) 6= yi )
(probability of misclassification). The weights are updated
D (i)exp( at yi ht (xi ))
Dt+1 (i) = t
where Zt is a normalization
Zt
factor.

Neural Networks
Neural Nets explore what you can do by combining perceptrons,
each of which is a simple linear classifier. We use a soft threshold
for each activation function q because it is twice differentiable.

nout
Mean Squared Error i=1
(y

y) log(1

hq (x))

hq (x))2

Notation:
(l)
1. wi j is the weight from neuron i in layer l 1 to neuron j in
layer l. There are d (l) nodes in the l th layer.
2. L layers, where L is output layer and data is 0th layer.
(l)
(l)
3. x j = q (s j ) is the output of a neuron. Its the activation
(l)
(l) (l 1)
function applied to the input signal. s j = i wi j xi
4. e(w) is the error as a function of the weights
(l)
The goal is to learn the weights wi j . We use gradient descent, but

error function is non-convex so we tend to local minima. The naive


version takes O(w2 ). Back propagation, an algorithm for efficient
computation of the gradient, takes O(w).
e(w) !

(l)
(l) (l 1)
e(w)
e(w) s j
=
= d j xi
(l)
(l)
(l)
wi j
s j wi j

(L)
(L)
(L) 0
e(w)
e(w) x j
Final Layer: d j =
=
= e0 (x j )qout
(sLj )
(l)
(L)
(L)
sj
xj
sj
General:

(l)
(l 1)
sj
x
(l) e(w)
(l 1)
e(w)
di
=
= dj=1

i
(l 1)
(l)
(l 1)
(l 1)
si
sj
xi
si
(l) (l)
(l)
(l 1)
= dj=1 d j wi j q 0 (si
)

Unsupervised Learning
Clustering
Unsupervised learning (no labels).
Distance functions. Suppose we have two sets of points.

Single linkage is minimum distance between members.


Complete linkage is maximum distance between members.
Centroid linkage is distance between centroids.
Average linkage is average distance between all pairs.

K-medians: Works with arbitrary distance/dissimilarity metric,


the centers k are represented by data points. Is more restrictive
thus has higher loss.

Hierarchical:
Agglomerative: Start with n points, merge 2 closest clusters
using some measure, such as: Single-link (closest pair),
Complete-link (furthest pair), Average-link (average of all
pairs), Centroid (centroid distance).
Note: SL and CL are sensitive to outliers.
Divisive: Start with single cluster, recursively divide clusters
into 2 subclusters.
Partitioning: Partition the data into a K mutually exclusive
exhaustive groups (i.e. encode k=C(i)). Iteratively reallocate to
minimize some loss function. Finding the correct partitions is hard.
Use a greedy algorithm called K-means (coordinate decent). Loss
function is non-convex thus we find local minima.
K-means: Choose clusters at random, calculate centroid of
each cluster, reallocate objects to nearest centroid, repeat.
Works with: spherical, well-separated clusters of similar
volumes and count.
K-means++: Initialize clusters one by one. D(x) = distance of
point x to nearest cluster. Pr(x is new cluster center) D(x)2

K
General Loss: N
n=1 k=1 d(xn , k )rnk where rnk = 1 if xn is in
cluster k, and 0 o.w.

Vector Quantization
Use clustering to find representative prototype vectors, which are
used to simplify representations of signals.

Parametric Density Estimation


Mixture Models. Assume PDF is made up of multiple gaussians
nc
with different centers. P(x) = i=1
P(ci )P(x|ci ) with objective
function as log likelihood of data. Use EM to estimate this model.
P(i )P(xk |i )
E Step: P(i |xk ) =
j P( j )P(x j | j )
n
M Step: P(ci ) = n1e e P(i |xk )
k=1
x P(i |xk )
i = k k
k P(i |xk )
(x )2 P(i |xk )
si2 = k k i
.
k P(i |xk )

Non-parametric Density Estimation


Can use Histogram or Kernel Density Estimation (KDE).
KDE: P(x) = 1
n K(x xi ) is a function of the data.
The kernel K has the following
properties:

Symmetric, Normalized d K(x)dx = 1, and


R
d
lim||x||! ||x|| K(x) = 0.

The bandwidth is the width of the kernel function. Too small =


jagged results, too large = smoothed out results.

Principal Component Analysis


First run singular value decomposition on pattern matrix X:

Activation Functions:
s
s
q (s) = tanh(s) = es e s =) q 0 (s) = 1 q 2 (s)
e +e
1
q (s) = s (s) =
=) q 0 (s) = s (s)(1 s (s))
1+e s

1.
2.
3.
4.
5.

Subtract mean from each point


(Sometimes) scale each dimension by its variance
Compute covariance S = X T X (must be symmetric)
Compute eigenvectors/values S = V SV | (spectral thm)
Get back X = XS = (XV )SV | = USV |

S contains the eigenvalues of the transformed features. The larger


the Sii , the larger the variance of that feature. We want the k largest
features, so we find the indices of the k largest items in S and we
keep only these entries in U and V .

3
2. Maximum Entropy Distribution
Suppose we have a discrete random variable that has a Categorical distribution described by the parameters p1 , p2 , . . . , pd . Recall that the definition of entropy of a
discrete random variable is

CS 189 ALL OF IT Che Yeon, Chloe, Dhruv, Li, Sean


d

H(X) = E[ log p(X)] =

Past Exam Questions


Spring 2013 Midterm
(a)
(b)
(c)
(d)

(e)

kwk2
False: In SVMs, we maximize 2 subject to the margin
constraints.
False: In kernelized SVMS, the kernel matrix K has to be
positive definite.
True: If two random variables are independent, then they have
to be uncorrelated.
False: Isocontours of Gaussian distributions have axes whose
lengths are proportional to the eigenvalues of the covariance
matrix.

2
True: The RBF kernel K xi , x j = exp
g xi x j

corresponds to an infinite dimensional mapping of the feature


vectors.
(f) True: If (X,Y ) are jointly Gaussian, then X and Y are also
Gaussian distributed.
(g) True: A function f(x,y,z) is convex if the Hessian of f is
positive semi-definite.
(h) True: In a least-squares linear regression problem, adding an
L2 regularization penalty cannot decrease the L2 error of the
solution w on the training data.
(i) True: In linear SVMs, the optimal weight vector w is a linear
combination of training data points.
(j) False: In stochastic gradient descent, we take steps in the
exact direction of the gradient vector.
(k) False: In a two class problem when the class conditionals
P [x | y = 0] andP [x | y = 1] are modeled as Gaussians with
different covariance matrices, the posterior probabilities turn
out to be logistic functions.
(l) True: The perceptron training procedure is guaranteed to
converge if the two classes are linearly separable.
(m) False: The maximum likelihood estimate for the variance of a
univariate Gaussian is unbiased.
(n) True: In linear regression, using an L1 regularization penalty
term results in sparser solutions than using an L2
regularization penalty term.

Spring 2013 Final


(a) True: Solving a non linear separation problem with a hard
margin Kernelized SVM (Gaussian RBF Kernel) might lead to
overfitting.
(b) True: In SVMs, the sum of the Lagrange multipliers
corresponding to the positive examples is equal to the sum of
the Lagrange multipliers corresponding to the negative
examples.
(c) False: SVMs directly give us the posterior probabilities
P (y = 1 | x) and P (y = 1 | x).
(d) False: V (X) = E[X]2 E[X 2 ]
(e) True: In the discriminative approach to solving classification
problems, we model the conditional probability of the labels
given the observations.
(f) False: In a two class classification problem, a point on the
Bayes optimal decision boundary x* always satisfies
P [y = 1 | x] = P [y = 0 | x].
(g) True: Any linear combination of the components of a
multivariate Gaussian is a univariate Gaussian.

(h) False: For any two random variables X N 1 , s12 and

Y N 2 , s22 , X +Y N 1 + 2 , s12 + s22 .


(i) False: For a logistic regression problem differing initialization
points can lead to a much better optimum.
p
(j) False: In logistic regression, we model the odds ratio 1 p as
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)

a linear function.
True: Random forests can be used to classify infinite
dimensional data.
False: In boosting we start with a Gaussian weight
distribution over the training samples.
False: In Adaboost, the error of each hypothesis is calculated
by the ratio of misclassified examples to the total number of
examples.
True: When k = 1 and N ! , the kNN classification rate is
bounded above by twice the Bayes error rate.
True: A single layer neural network with a sigmoid activation
for binary classification with the cross entropy loss is exactly
equivalent to logistic regression.
True: Convolution is a linear operation i.e.
a f1 + b f2 g = a f1 g + b f2 g.
True: The k-means algorithm does coordinate descent on a
non-convex objective function.
True: A 1-NN classifier has higher variance than a 3-NN
classifier.
False: The single link agglomerative clustering algorithm
groups two clusters on the basis of the maximum distance
between points in the two clusters.
False: The largest eigenvector of the covariance matrix is the
direction of minimum variance in the data.
False: The eigenvectors of AAT and AT A are the same.
True: The non-zero eigenvalues of AAT and AT A are the
same.

(a) In linear regression, the irreducible error is s 2 and

2
E (y E(y | x)) .

(b)

(c)

(d)
(e)
(f)
(g)

Let S1 and S2 be the support vectors for w1 (hard margin)


and w2 (soft margin). Then S1 may not be a subset of S2 and
w1 may not be equal to w2 .
Ordinary least square regression assumes each data point is
generated according to a linear function of the input plus
N (0, s ) noise. In many systems, the noise variance is a
positive linear function of the input. In this case, the
probability model that describes this situation is
(y (w0 + w1 x))2
1
P(y|x) = p
exp(
.
s 2px
2xs 2
Averaging the outputs of multiple decision trees helps reduce
variance.
The following loss functions are convex: logistic, hinge,
exponential. Misclassification loss is not.
Bias will be smaller and variance will be larger for trees of
smaller depth.
If making a tree with k-ary splits, the algorithm will prefer
high values of k and there will be k 1 thresholds for a k-ary
split.

Spring 2014 Final


(a)
(b)

(c)
(d)
(e)
(f)
(g)
(h)
(i)

False: The singular value decomposition of a real matrix is


unique.
2.
True: A multiple-layer neural network with linear activation
functions is equivalent to one single-layer perceptron that uses
the same error function on the output layer and has the same
number of inputs.
False: The maximum likelihood estimator for the parameter q
of a uniform distribution over [0, q ] is unbiased.
True: The k-means algorithm for clustering is guaranteed to
converge to a local optimum.
True: Increasing the depth of a decision tree cannot increase
its training error.
False: There exists a one-to-one feature mapping f for every
valid kernel k.
True: For high-dimensional data data, k-d trees can be slower
than brute force nearest neighbor search.
True: If we had infinite data and infinitely fast computers,
kNN would be the only algorithm we would study in CS 189.
True: For datasets with high label noise (many data points
with incorrect labels, random forests would generally perform
better than boosted decision trees.

(a) In Homework 4, you fit a logistic regression model on spam


and ham data for a Kaggle Comp. Assume you had a very
good score on the public test set, but when the GSIs ran your
model on a private test set, your score dropped a lot. This is
likely because you overfitted by submitting multiple times and
changing the following between submissions: l , your penalty
term; e, your convergence criterion; your step size; fixing a
random bug.
(b) Given d-dimensional data {xi }N
i=1 , you run principal
component analysis and pick P principal components. Can
you always reconstruct any data point xi for i from 1 to N
from the P principal components with zero reconstruction
error? Yes, if P = d.
(c) Putting a standard Gaussian prior on the weights for linear
regression (w N(0, I)) will result in what type of posterior
distribution on the weights? Gaussian.
(d) Suppose we have N instances of d-dimensional data. Let h be
the amount of data storage necessary for a histogram with a
fixed number of ticks per axis, and let k be the amount of data
storage necessary for kernel density estimation. Which of the
following is true about h and k? h grows exponentially with d,
and k grows linearly with N.
(e) John just trained a decision tree for a digit recognition. He
notices an extremely low training error, but an abnormally
large test error. He also notices that an SVM with a linear
2.
kernel performs much better than his tree. What could be the 2.
cause of his problem? Decision tree is too deep; decision tree
is overfitting.
(f) John has now switched to multilayer neural networks and
notices that the training error is going down and converges to
a local minimum. Then when he test on the new data, the test
error is abnormally high. What is probably going wrong and
what do you recommend him to do? The training data size is
not large enough so collect a larger training data and retain it;
play with learning rate and add regularization term to
objective function; use a different initialization and train the
network several times and use the average of predictions from
all nets to predict test data; use the same training data but use
less hidden layers.

Discussion 9 Entropy

Modifying neural networks for fun and profit


Solution:

(a) How
we modify
neural
network
toi.e.
perform
instead
of classifiFor could
simplicity,
assume athat
log has
base e,
log = regression
ln (the solution
is the
same
cation?
no matter what base we assume). The optimization problem we are trying to solve
is:

Solution: Change the output function of the final layer to be a linear function
rather than the normal non-linear function.
d
argmin

pi log pi

p
i=1
Consider a neural network with the addition
that the input layer is also fully connected
d
to the output layer. This type of neural network
is also called skip-layer.

s.t.

pi = 1

(b) How many weights would this model i=1


require? (Let d0 be the dimensionality of
the input vector, and d1 . . . dL be the number of nodes in the L following layers.
Formulating the Lagrangian, we get
Dont worry about the bias term. Also, you may want to try drawing out the
!
NN.)
d
d
L(p, ) =

pi log pi +

pi

i=1

Solution:

i=1

L 1

Taking the derivative w.r.t. pi and :

di di+1 + d0 dL

i=0

pi

L(p, ) = log pi +

pi
pi

=)

1 = log pi

(c) What sort of problems


could
L(p,
) = this psort
i = 1of neural network introduce? How do we
i=1
compensate for these problems?
This says that log pi = log pj , 8i, j, which implies that pi = pj , 8i, j. Combining
this
with theWeve
constraint,
we getthe
that
pi = d1 , of
which
is the(parameters)
uniform distribution.
Solution:
increased
number
weights
so now we

a much higher chance of overfitting the data. We could fix this by reducing
the number of nodes at each layer and/or reducing the number of layers to

accomodate for the increased connectivity.


Discussion
11 Skip-Layer NN
(d) Consider the simplest skip-layer neural network pictured below. The weights are
w = [wxh , why , wxy ]T .
Input

Hidden

Output

4
Given some non-linear function g, calculate

w y.

Dont forget to use s.

Solution: The output y is given by the function:


y = g(sy ) = g(why h + wxy x) = g(why g(sh ) + wxy x) = g(why g(wxh x) + wxy x)
y
y
y
wh , why , wxy .

To calculate y we need all the partial derivatives


with the ones closest to the output.

sy
y
y
=

= g (sy )h =
why
sy
why

yh

sy
y
y
=

= g (sy )x =
wxy
sy
wxy

yx

Well start

22

sy
y
y
Derivation of
of PCA
PCA
=

= g (sy )
(why h + wxy x)
Derivation
wxh
sy
wxh
wxh
In
In this
this question
question we
we will
will derive
derive PCA.
PCA. PCA
PCA aims
aims to
to find
find the
the direction
direction of
of maximum
maximum
variance among
among aa dataset.
dataset.
You
want(w
the
lineh ))
suchshthat
that
projecting
your data
data onto
onto this
this
= g You
(sy )(want
+ projecting
wxy x)your
variance
the
line
such
hy g(s
s
w
wxhthe optimization problem
h
xh
line will
will retain
retain the
the maximum
maximum amount
amount of
of information.
information. Thus,
Thus,
line
the optimization problem
sh
is
is
= w1hy gnn (sy )g (sh )
2
1
T
Twxh2
T
T
max
u
x
u
x

max n
u xii u x

u: u
u 22 =1
=1 n
u:
(wxh x)
i=1
= why gi=1
(sy )g (sh )
wxh average
where n
n is
is the
the number
number of
of data
data points
points and
and x
x
is the
the sample
sample
average of
of the
the data
data points.
points.
where
is
(a) Show
Show that
that this
this optimization
problem
can
be
massaged
into
this format
format
(a)
optimization
massaged
= why problem
g (sy )g (scan
=w
)x this
h )x be
hy y g (shinto
uTT u
u
max u
max

u: u
u
u:

=1
2 =1
2

Discussion
12(x
T
where =
= n11 nni=1
(xi PCA
x
)(xi x
x
where
x
)(x
))T ..
i=1

Solution:
Solution:
We can
can massage
massage the
the objective
objective function
function (lefts
(lefts call
call if
if ff00 (u)
(u) in
in this
this way:
way:
We
n
n

11
ff00 (u)
(u) =
= n
n

(a)

(a) You can use kernels with SVM and perceptron.


(b) Cross validation is used to select hyperparameters. It prevents
overfitting, but is not guaranteed to prevent it.
(c) L2 regularization is equivalent to imposing a Gaussian prior in
linear regression.
(d) If we have 2 two-dimensional Gaussians, the same covariance
matrix for both will result in a linear decision boundary.
(e) The normal equations can be derived from minimizing
empirical risk, assuming normally distributed noise, and
assuming P(Y | X) is distributed normally with mean $BTx$
and variance s 2 .
(f) Logistic regression can be motivated from log odds equated to
an affine function of x and generative models with gaussian
class conditionals.
(g) The perceptron algorithm will converge only if the data is
linearly separable.
(h) True: Newtons method is typically more expensive to
calculate than gradient descent per iteration.
True: for quadratic equations, Newtons method typically
requires fewer iterations than gradient descent.
False: Gradient descent can be viewed as iteratively
reweighted least squares.
(i) True: Complementary slackness implies that every training
point that is misclassified by a soft margin SVM is a support
vector.
True: When we solve the SVM with the dual problem, we
need only the dot product of xi and x j for all i, j.
True: we use Lagrange multipliers in an optimization problem
with inequality constraints.
(j) kF(x) F(y)k22 can be computed exclusively with inner
products.
But not kF(x) F(y)k1 norm or F(x) F(y).
(k) Strong duality holds for hard and soft margin SVM, but not
constrained optimization problems in general.

i=1

Find the distribution (values of the pi ) that maximizes entropy. (Hint: remember that
d
3
i=1 pi = 1. Dont forget to include that in the optimization as a constraint!)

Spring 2015 Midterm


True: If the data is not linearly separable, there is no solution
to hard margin SVM.
(b) True: logistic regression can be used for classification.
(c) False: Two ways to prevent beta vectors from getting too large
are to use a small step size and use a small regularization value
(d) False: The L2 norm is often used because it produces sparse
results, as opposed to the L1 norm which does not
(e) False: For multivariate gaussian, the eigenvalues of the
covariance matrix are inversely proportional to the lengths of
the ellipsoid axes that determine the isocontours of the density.
(f) True: In a generative binary classification model where we
assume the class conditionals are distributed as poisson and
the class priors are bernoulli, the posterior assumes a logistic
form.
(g) False: MLE gives us not only a point estimate, but a
distribution over the parameters we are estimating.
(h) False: Penalized MLE and bayesian estimators for parameters
are better used in the setting of low-dimensional data with
many training examples
(i) True: It is not good machine learning practice to use the test
set to help adjust the hyperparameters
(j) False: a symmetric positive semidefinite matrix always has
nonnegative elements.
(k) True: for a valid kernel function k, the corresponding feature
mapping can map a finite dimensional vector to an infinite
dimensional vector
(l) False: the more features we use, the better our learning
algorithm will generalize to new data points.
(m) True: a discriminative classifier explicitly models P (Y | X).

pi log pi

Discussion Problems

T
u
uT x
xii

11
=
= n
n

x
))TT u
x
u

(x
(xii
i=1
i=1
n
n

1
= 1
=
n
n

2
2

T
u

uT x
x

i=1
i=1
n
n

(uTT (x
(xii
(u

2
2

x
))TT u)
u)
x
!
!

x
))((x
))((xii
x

i=1
i=1

11
n
n

uTT

=u
=

n
n

(xii
(x

x
))TT
x

x
)(x
)(xii
x

i=1
i=1

u
u

=u
uTT u
u
=
(b) Show
that the
the maximizer
maximizer for
for this
this problem
problem is
is equal
equal to
to vv11 ,, where
where vv11 is
is the
the eigenvector
eigenvector
(b)
Show that
corresponding to
to the
the largest
largest eigenvalue
eigenvalue 11 .. Also
Also show
show that
that optimal
optimal value
value of
of this
this
corresponding
problem is
is equal
equal to
to 11 ..
problem
Solution:
Solution:
We start
start by
by invoking
invoking the
the spectral
spectral decomposition
decomposition of
of
We
symmetric
symmetric positive
positive semi-definite
semi-definite matrix.
matrix.

max uT u =

u: u

=V
V V
V TT ,, which
which is
is aa
=

max uT V V T u

2 =1

u: u

2 =1

max (V T u)T V T u

u: u

2 =1

Here is an aside: note through this one line proof that left-multiplying a vector
by an orthogonal (or rotation) matrix preserves the length of the vector:
q
p
p
kV T uk2 = (V T u)T (V T u) = uT V V T u = uT u = kuk2

I define a new variable z = V T u, and maximize over this variable. Note that
because V is invertible, there is a one to one mapping between u and z. Also
note that the constraint is the same because the length of the vector u does
not change when multiplied by an orthogonal matrix.
d

max z T z = max

z: z

2 =1

2
i zi

zi2 = 1

i=1

i=1

From this
formulation,
it is obvious to see that we can maximize this by
3. Deriving
the new
second
principal component
n
into one basket and
setting zi = 1 if i is the index of
1 eggs
T
(a)throwing
Let J(v2all
, z2of
) =our
i=1 (xi zi1 v1 zi2 v2 ) (xi zi1 v1 zi2 v2 ) given the constraints
n
largest
and zthat
Thus, T
J
i = 0 otherwise.
v1Tthe
v2 =
0 andeigenvalue,
v2T v2 = 1. Show
z2 = 0 yields zi2 = v2 xi .

(b) We have shown that zi2 = v2TT xi so that the second principal encoding is gotten
z = V u =) u = V z = v
by projecting onto the second principal direction. Show 1that the value of v2 that
n
T with the second largest
minimizes
given
by the eigenvector
of Cand
= n1 corresponds
where v1Jisisthe
principle
eigenvector,
1 . Plugging this
i=1 (xi xi ) to
eigenvalue.
Assumed function,
we have already
theoptimal
v1 is the
eigenvector
into the objective
we see proved
that the
value
is 1 . of C with the
largest eigenvalue.
Solution: (a) We have
J(v2 , z2 ) =

1
n

(xT
i xi

zi1 xT
i v1

zi2 xT
i v2

2 T
zi1 v1T xi + zi1
v1 v1 + (1)

i=1

zi1 zi2 v1T v2

2 T
zi2 v2T xi + zi1 zi2 v2T v1 + zi2
v2 v2 )

(2)

Take derivative respect to z2 , we have


J
1
1
T
T
T
T
T
= ( xT
( 2xT
i v2 +zi1 v1 v2 v2 xi +zi1 v2 v1 +2zi2 v2 v2 ) =
i v2 +2zi2 v2 v2 )
zi2
n
n
Set the derivative to 0 and we have
zi2 v2T v2 = xT
i v2
Since v2T v2 = 1, we have zi2 = xT
i v2
(b) Plug in zi2 into J(v2 , z2 ), we have
J(v2 ) =

1
n

1
n

n
T
T
T
2 T
T
2 T
(xT
i xi zi1 xi v1 zi2 xi v2 zi1 v1 xi +zi1 v1 v1 zi2 v2 xi +zi2 v2 v2 )
i=1

(const

2
2zi2 xT
i v2 + zi2 ) =

i=1

1
n

n
T
T
( 2v2T xi xT
i v2 + v2 xi xi v2 + const)

i=1
v2T Cv2 +

const

In order to minimize J with constraints v2T v2 = 1, we have Langrage L =


v2T Cv2 + (v2T v2 1) and take derivative of v2 , we have
L
=
v2

2Cv2 + 2 v2 = 0

Then, we have
Cv2 = v2

Anda mungkin juga menyukai