P(x|w)P(w)
Bayes Rule: P(w|x) =
, P(x) = i P(x|wi )P(wi )
P(x)
P(x, w) = P(x|w)P(w) = P(w|x)P(x)
P(error) = P(error|x)P(x)dx
P(w1 |x)
if we decide w2
P(error|x) =
P(w2 |x)
if we decide w1
n
0
i = j (correct)
0-1 Loss: l (ai |w j ) =
1
i 6= j (mismatch)
P(wi |x)
Lp
=w
w
Lagrangian
Multivariate Gaussian X N (, S)
S is PSD =) xT Sx
1
2
Thus the level curves form an ellipsoid with axis lengths equal to the
square root of the eigenvalues of the covariance matrix.
Loss Functions
4y f (x)
if y f (x) < 1
=
[1 y f (x)]2
otherwise
+
minimizing function f (x) = 2P [Y = +1 | x] 1
1
2
1
[2
q f (qt )] q f (qt )
Gradients
y1
...
6 x1
6
6 .
y 6
.
,
.
.
x 6
6 .
.
4
y1
...
xn
(xT x)
(xT Ax)
= 2x,
x
x
3
ym
x1 7
7
T
7 (Ax)
. 7
T (x A) = A,
. 7, x = A , x
. 7
5
ym
xn
(trBA)
= (A + AT )x,
= BT
A
1
1
2 qT x =
T ) e
T
1+eq x
1+eq x
hq )
T
1+e q x
1
= hq (x)
1
T
1+e q x
Other Classifiers
K ! 0, e
KNN limN!,K! , N
knn = e
Curse of dimensionality: As the number of dimensions increases,
everything becomes farther apart. Our low dimension intuition falls
apart. Consider the Hypersphere/Hypercube ratio, its close to zero
at d = 10. How do deal with this curse:
1. Get more data to fill all of that empty space
2. Get better features, reducing the dimensionality and packing the
data closer together. Ex: Bag-of-words, Histograms,...
3. Use a better distance metric.
1
p p
Minkowski: Dis p (x, y) = (d
i=1 |xi yu | ) = ||x y|| p
0-norm: Dis0 (x, y) = d
i=1 I|xi = yi |
q
Mahalanobis: DisM (x, y|S) = (x y)T S 1 (x
y)
In high-d we get Hubs s.t most points identify the hubs as their
NN. These hubs are usually near the means (Ex: dull gray images,
sky and clouds). To avoid having everything classified as these hubs,
we can use cosine similarity.
K-d trees increase the efficiency of nearest neighbor lookup.
Decision Trees
Given a set of points and classes {xi , yi }n
i=1 , test features x j and
branch on the feature which best separates the data. Recursively
split on the new subset of data. Growing the tree to max depth tends
to overfit (training data gets cut quickly =) subtrees train on small
sets). Mistakes high up in the tree propagate to corresponding
subtrees. To reduce overfitting, we can prune using a validation set,
and we can limit the depth.
DTs are prone to label noise. Building the correct tree is hard.
H(D)
x j 2X j
P(X j = x j ) H(D|X j = x j )
(i)
hq (x(i) ))1 y
=)
(i)
(i)
l(q ) = m
i=1 y log(hq (x )) + (1
y(i) ) log(1
hq (x(i) ))x(i) = X | (y
( j)
Stochastic: qt+1 = qt + a(yt
Batch: qt+1 = qt + aX | (y
hq (X))
hq (x(i) )) =)
( j) ( j)
hq (xt ))xt
Problem: DTs are unstable: small changes in the input data have
large effect on tree structure =) DTs are high-variance estimators.
Solution: Random Forests train M different trees with randomly
sampled subsets of the data (called bagging), and sometimes with
randomly sampled subsets of the features to de-correlate the trees. A
new point is tested on all M trees and we take the majority as our
output class (for regression we take the average of the output).
Boosting
hq (x))1 y =)
(i) y(i) (1
L(q ) = m
i=1 (hq (x ))
q l = i (y(i)
For QDA, the model is the same as LDA except that each class has a
unique covariance matrix.
1
1
T
h(x) = arg maxk 1
2 log|Sk | 2 (x k ) Sk (x k ) + log(pk )
Random Forests
dhq
dq = (
y )T .
(i)
where pk = p(y = k)
max
j
hq (1
y )(x(i)
(i)
Notice the covariance matrix is the same for all classes in LDA.
If p(x|y) multivariate gaussian (w/ shared S), then p(y|x) is logistic
function. The converse is NOT true. LDA makes stronger
assumptions about data than does logistic regression.
h(x) = arg maxk 21 (x k )T S 1 (x k ) + log(pk )
Regression
Optimization
Logistic Regression
h
i
Binomial deviance = log 1 + e y f (x)
P[Y =+1|x]
minimizing function f (x) = log
P[Y = 1|x]
SVM hinge loss = [1 y f (x)]+
li fi (x)
i=1
In general the loss function consists of two parts, the loss term and
the regularization term. J(w) = i Lossi + l R(w)
|
|
l1 1 (u1 x)2 + + ln 1 (un x)2 = c
|
{z
}
|
{z
}
p
p
axis length: ln
axis length: l1
L (x, l ) = f0 (x) +
xi
The dual for non-separable doesnt change much except that each ai
now has an upper bound of C =) 0 ai C
(i) (i)
h(xt ))xt , hq (x) = q | x
1
1 (x )T S 1 (x
f (x; , S) =
exp
2
(2p)n/2 |S|1/2
1 m (x(i)
SMLE = m
i=1
1) = 0 where an > 0.
m
1
min ||w||2 +C xi s.t. y(i) (wT x(i) + b)
w,b 2
i=1
nout
Cross Entropy Loss i=1
y log(hq (x)) + (1
(i) (i)
(i)
l(q , 0 , 1 , S) = log Pm
i=1 p(x |y ; 0 , 1 , S)p(y ; F) gives
us
1 m 1{y(i) = 1},
fMLE = m
kMLE =
i=1
avg of x(i) classified as k,
Nearest Neighbor
1 m m y(i) y( j) a a (x(i) )T x( j)
i j
2 i=1 j=1
X | y = 0 =) q = (X | X) 1 X | y
)T ] = E[XX T ]
)(X
Ld (a) = m
i=1 ai
S = E[(X
1)
Dual:
Gradient Descent:
(i)
qt+1 = qt + a(yt
(i) T (i)
m
i=1 ai (y (w x + b)
q l(q ) = X | Xq
Lp
=
b
1, i = 1, . . . , m
Error Functions:
Neural Networks
Neural Nets explore what you can do by combining perceptrons,
each of which is a simple linear classifier. We use a soft threshold
for each activation function q because it is twice differentiable.
nout
Mean Squared Error i=1
(y
y) log(1
hq (x))
hq (x))2
Notation:
(l)
1. wi j is the weight from neuron i in layer l 1 to neuron j in
layer l. There are d (l) nodes in the l th layer.
2. L layers, where L is output layer and data is 0th layer.
(l)
(l)
3. x j = q (s j ) is the output of a neuron. Its the activation
(l)
(l) (l 1)
function applied to the input signal. s j = i wi j xi
4. e(w) is the error as a function of the weights
(l)
The goal is to learn the weights wi j . We use gradient descent, but
(l)
(l) (l 1)
e(w)
e(w) s j
=
= d j xi
(l)
(l)
(l)
wi j
s j wi j
(L)
(L)
(L) 0
e(w)
e(w) x j
Final Layer: d j =
=
= e0 (x j )qout
(sLj )
(l)
(L)
(L)
sj
xj
sj
General:
(l)
(l 1)
sj
x
(l) e(w)
(l 1)
e(w)
di
=
= dj=1
i
(l 1)
(l)
(l 1)
(l 1)
si
sj
xi
si
(l) (l)
(l)
(l 1)
= dj=1 d j wi j q 0 (si
)
Unsupervised Learning
Clustering
Unsupervised learning (no labels).
Distance functions. Suppose we have two sets of points.
Hierarchical:
Agglomerative: Start with n points, merge 2 closest clusters
using some measure, such as: Single-link (closest pair),
Complete-link (furthest pair), Average-link (average of all
pairs), Centroid (centroid distance).
Note: SL and CL are sensitive to outliers.
Divisive: Start with single cluster, recursively divide clusters
into 2 subclusters.
Partitioning: Partition the data into a K mutually exclusive
exhaustive groups (i.e. encode k=C(i)). Iteratively reallocate to
minimize some loss function. Finding the correct partitions is hard.
Use a greedy algorithm called K-means (coordinate decent). Loss
function is non-convex thus we find local minima.
K-means: Choose clusters at random, calculate centroid of
each cluster, reallocate objects to nearest centroid, repeat.
Works with: spherical, well-separated clusters of similar
volumes and count.
K-means++: Initialize clusters one by one. D(x) = distance of
point x to nearest cluster. Pr(x is new cluster center) D(x)2
K
General Loss: N
n=1 k=1 d(xn , k )rnk where rnk = 1 if xn is in
cluster k, and 0 o.w.
Vector Quantization
Use clustering to find representative prototype vectors, which are
used to simplify representations of signals.
Activation Functions:
s
s
q (s) = tanh(s) = es e s =) q 0 (s) = 1 q 2 (s)
e +e
1
q (s) = s (s) =
=) q 0 (s) = s (s)(1 s (s))
1+e s
1.
2.
3.
4.
5.
3
2. Maximum Entropy Distribution
Suppose we have a discrete random variable that has a Categorical distribution described by the parameters p1 , p2 , . . . , pd . Recall that the definition of entropy of a
discrete random variable is
(e)
kwk2
False: In SVMs, we maximize 2 subject to the margin
constraints.
False: In kernelized SVMS, the kernel matrix K has to be
positive definite.
True: If two random variables are independent, then they have
to be uncorrelated.
False: Isocontours of Gaussian distributions have axes whose
lengths are proportional to the eigenvalues of the covariance
matrix.
2
True: The RBF kernel K xi , x j = exp
g xi x j
a linear function.
True: Random forests can be used to classify infinite
dimensional data.
False: In boosting we start with a Gaussian weight
distribution over the training samples.
False: In Adaboost, the error of each hypothesis is calculated
by the ratio of misclassified examples to the total number of
examples.
True: When k = 1 and N ! , the kNN classification rate is
bounded above by twice the Bayes error rate.
True: A single layer neural network with a sigmoid activation
for binary classification with the cross entropy loss is exactly
equivalent to logistic regression.
True: Convolution is a linear operation i.e.
a f1 + b f2 g = a f1 g + b f2 g.
True: The k-means algorithm does coordinate descent on a
non-convex objective function.
True: A 1-NN classifier has higher variance than a 3-NN
classifier.
False: The single link agglomerative clustering algorithm
groups two clusters on the basis of the maximum distance
between points in the two clusters.
False: The largest eigenvector of the covariance matrix is the
direction of minimum variance in the data.
False: The eigenvectors of AAT and AT A are the same.
True: The non-zero eigenvalues of AAT and AT A are the
same.
2
E (y E(y | x)) .
(b)
(c)
(d)
(e)
(f)
(g)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Discussion 9 Entropy
(a) How
we modify
neural
network
toi.e.
perform
instead
of classifiFor could
simplicity,
assume athat
log has
base e,
log = regression
ln (the solution
is the
same
cation?
no matter what base we assume). The optimization problem we are trying to solve
is:
Solution: Change the output function of the final layer to be a linear function
rather than the normal non-linear function.
d
argmin
pi log pi
p
i=1
Consider a neural network with the addition
that the input layer is also fully connected
d
to the output layer. This type of neural network
is also called skip-layer.
s.t.
pi = 1
pi log pi +
pi
i=1
Solution:
i=1
L 1
di di+1 + d0 dL
i=0
pi
L(p, ) = log pi +
pi
pi
=)
1 = log pi
a much higher chance of overfitting the data. We could fix this by reducing
the number of nodes at each layer and/or reducing the number of layers to
Hidden
Output
4
Given some non-linear function g, calculate
w y.
sy
y
y
=
= g (sy )h =
why
sy
why
yh
sy
y
y
=
= g (sy )x =
wxy
sy
wxy
yx
Well start
22
sy
y
y
Derivation of
of PCA
PCA
=
= g (sy )
(why h + wxy x)
Derivation
wxh
sy
wxh
wxh
In
In this
this question
question we
we will
will derive
derive PCA.
PCA. PCA
PCA aims
aims to
to find
find the
the direction
direction of
of maximum
maximum
variance among
among aa dataset.
dataset.
You
want(w
the
lineh ))
suchshthat
that
projecting
your data
data onto
onto this
this
= g You
(sy )(want
+ projecting
wxy x)your
variance
the
line
such
hy g(s
s
w
wxhthe optimization problem
h
xh
line will
will retain
retain the
the maximum
maximum amount
amount of
of information.
information. Thus,
Thus,
line
the optimization problem
sh
is
is
= w1hy gnn (sy )g (sh )
2
1
T
Twxh2
T
T
max
u
x
u
x
max n
u xii u x
u: u
u 22 =1
=1 n
u:
(wxh x)
i=1
= why gi=1
(sy )g (sh )
wxh average
where n
n is
is the
the number
number of
of data
data points
points and
and x
x
is the
the sample
sample
average of
of the
the data
data points.
points.
where
is
(a) Show
Show that
that this
this optimization
problem
can
be
massaged
into
this format
format
(a)
optimization
massaged
= why problem
g (sy )g (scan
=w
)x this
h )x be
hy y g (shinto
uTT u
u
max u
max
u: u
u
u:
=1
2 =1
2
Discussion
12(x
T
where =
= n11 nni=1
(xi PCA
x
)(xi x
x
where
x
)(x
))T ..
i=1
Solution:
Solution:
We can
can massage
massage the
the objective
objective function
function (lefts
(lefts call
call if
if ff00 (u)
(u) in
in this
this way:
way:
We
n
n
11
ff00 (u)
(u) =
= n
n
(a)
i=1
Find the distribution (values of the pi ) that maximizes entropy. (Hint: remember that
d
3
i=1 pi = 1. Dont forget to include that in the optimization as a constraint!)
pi log pi
Discussion Problems
T
u
uT x
xii
11
=
= n
n
x
))TT u
x
u
(x
(xii
i=1
i=1
n
n
1
= 1
=
n
n
2
2
T
u
uT x
x
i=1
i=1
n
n
(uTT (x
(xii
(u
2
2
x
))TT u)
u)
x
!
!
x
))((x
))((xii
x
i=1
i=1
11
n
n
uTT
=u
=
n
n
(xii
(x
x
))TT
x
x
)(x
)(xii
x
i=1
i=1
u
u
=u
uTT u
u
=
(b) Show
that the
the maximizer
maximizer for
for this
this problem
problem is
is equal
equal to
to vv11 ,, where
where vv11 is
is the
the eigenvector
eigenvector
(b)
Show that
corresponding to
to the
the largest
largest eigenvalue
eigenvalue 11 .. Also
Also show
show that
that optimal
optimal value
value of
of this
this
corresponding
problem is
is equal
equal to
to 11 ..
problem
Solution:
Solution:
We start
start by
by invoking
invoking the
the spectral
spectral decomposition
decomposition of
of
We
symmetric
symmetric positive
positive semi-definite
semi-definite matrix.
matrix.
max uT u =
u: u
=V
V V
V TT ,, which
which is
is aa
=
max uT V V T u
2 =1
u: u
2 =1
max (V T u)T V T u
u: u
2 =1
Here is an aside: note through this one line proof that left-multiplying a vector
by an orthogonal (or rotation) matrix preserves the length of the vector:
q
p
p
kV T uk2 = (V T u)T (V T u) = uT V V T u = uT u = kuk2
I define a new variable z = V T u, and maximize over this variable. Note that
because V is invertible, there is a one to one mapping between u and z. Also
note that the constraint is the same because the length of the vector u does
not change when multiplied by an orthogonal matrix.
d
max z T z = max
z: z
2 =1
2
i zi
zi2 = 1
i=1
i=1
From this
formulation,
it is obvious to see that we can maximize this by
3. Deriving
the new
second
principal component
n
into one basket and
setting zi = 1 if i is the index of
1 eggs
T
(a)throwing
Let J(v2all
, z2of
) =our
i=1 (xi zi1 v1 zi2 v2 ) (xi zi1 v1 zi2 v2 ) given the constraints
n
largest
and zthat
Thus, T
J
i = 0 otherwise.
v1Tthe
v2 =
0 andeigenvalue,
v2T v2 = 1. Show
z2 = 0 yields zi2 = v2 xi .
(b) We have shown that zi2 = v2TT xi so that the second principal encoding is gotten
z = V u =) u = V z = v
by projecting onto the second principal direction. Show 1that the value of v2 that
n
T with the second largest
minimizes
given
by the eigenvector
of Cand
= n1 corresponds
where v1Jisisthe
principle
eigenvector,
1 . Plugging this
i=1 (xi xi ) to
eigenvalue.
Assumed function,
we have already
theoptimal
v1 is the
eigenvector
into the objective
we see proved
that the
value
is 1 . of C with the
largest eigenvalue.
Solution: (a) We have
J(v2 , z2 ) =
1
n
(xT
i xi
zi1 xT
i v1
zi2 xT
i v2
2 T
zi1 v1T xi + zi1
v1 v1 + (1)
i=1
2 T
zi2 v2T xi + zi1 zi2 v2T v1 + zi2
v2 v2 )
(2)
1
n
1
n
n
T
T
T
2 T
T
2 T
(xT
i xi zi1 xi v1 zi2 xi v2 zi1 v1 xi +zi1 v1 v1 zi2 v2 xi +zi2 v2 v2 )
i=1
(const
2
2zi2 xT
i v2 + zi2 ) =
i=1
1
n
n
T
T
( 2v2T xi xT
i v2 + v2 xi xi v2 + const)
i=1
v2T Cv2 +
const
2Cv2 + 2 v2 = 0
Then, we have
Cv2 = v2