Anda di halaman 1dari 30

Maximum likelihood estimation

Nuno Vasconcelos
UCSD
Bayesian decision theory
• recall that we have
– Y – state of the world
– X – observations
– g(x) – decision function
– L[g(x),y] – loss of predicting y with g(x)
• Bayes decision rule is the rule that minimizes the risk
Risk = E X ,YY [L( X , Y )]

• for the “0-1” loss


⎧1, g ( x) ≠ y
L[ g ( x), y ] = ⎨
⎩0, g ( x) = y
• optimal decision rule is the maximum a
a-posteriori
posteriori
probability rule 2
MAP rule
• we have
h shown
h th
thatt it can b be iimplemented
l t d iin any off th
the
three following ways
– 1)) i * (x ) = arg max PY |X (i | x )
i

– 2)) i * (x ) = arg max[PX |Y (x | i )PY (i )]


i

– 3)) i * (x ) = argg max[logg PX Y| (x | i ) + logg PY (i )]


i

• by introducing a “model” for the class-conditional


distributions we can express this as a simple equation
– e.g. for the multivariate Gaussian

1 ⎧ 1 ⎫
PX |Y ( x | i ) = exp⎨− ( x − µi )T Σ i−1 ( x − µi )⎬
(2π ) d | Σ i | ⎩ 2 ⎭ 3
The Gaussian classifier
discriminant:
di i i t
• the solution is PY|X(1|x ) = 0.5

i * (x ) = arg min[d i (x , µi ) + α i ]
i

with
d i (x , y ) = (x − y )T Σi−1 (x − y )

α i = log(2π )d Σi − 2 log PY (i )

• the optimal rule is to assign x to the closest class


• closest is measured with the Mahalanobis distance di(x,y)
• can be further simplified in special cases
4
Geometric interpretation
• for Gaussian classes, equal covariance σ2I

µi − µ j
w=
σ2
µi + µ j σ2
x0 = − 2
l
log
PY (i )
(µi − µ j )
2 µi − µ j PY ( j )

σ w µj

µi x0
1 PY (i )
2
log
µi − µ j PY ( j )
5
σ2
Geometric interpretation
• for Gaussian classes, equal but arbitrary covariance

w = Σ −1 (µi − µ j )
µi + µ j PY (i )
x0 = −
1
(µi − µ j )
(µi − µ j ) Σ (µi − µ j ) PY ( j )
T −1
log
2

x0 w µj

µi

6
Bayesian decision theory
• advantages:
– BDR is optimal and cannot be beaten
– Bayes keeps you honest
– models reflect causal interpretation of the problem, this is how
we think
– natural decomposition into “what we knew already” (prior) and
“what data tells us” (CCD)
– no need for heuristics to combine these two sources of info
– BDR is, almost invariably, intuitive
– Bayes rule, chain rule, and marginalization enable modularity,
and scalability to very complicated models and problems
• problems:
– BDR is optimal only insofar the models are correct.

7
Implementation
• we do have an optimal solution
w = Σ −1 (µi − µ j )
µi + µ j PY (i )
x0 = −
1
(µi − µ j )
(µi − µ j ) Σ (µi − µ j ) PY ( j )
T −1
log
2
• but in practice we do not know the values off the
parameters µ, Σ, PY(1)
– we have to somehow estimate these values
– this is OK, we can come up with an estimate from a training set
– e.g. use the average value as an estimate for the mean

w = Σˆ −1 (µˆ i − µˆ j )
µˆ i + µˆ j
x0 = −
1 PˆY (i )
(µˆ i − µˆ j )
(µˆ i − µˆ j ) Σ (µˆ i − µˆ j ) PˆY ( j )
T −1
log
2
8
Important
• warning: at this point all optimality claims for the BDR
cease to be valid!!
• the BDR is guaranteed
to achieve the minimum
loss when we use the
true probabilities
• when we “plug in” the
probability
b bilit estimates,
ti t
we could be
implementing
p g
a classifier that is quite distant from the optimal
– e.g. if the PX|Y(x|i) look like the example above
– I could never approximate them well by parametric models (e
(e.g.
g
Gaussian)
9
Maximum likelihood
• this seems pretty serious
– how should I get these probabilities then?
• we rely on the maximum likelihood (ML) principle
• this has three steps:
– 1) we choose a parametric model for all probabilities
– to make this clear we denote the vector of parameters by Θ and
the class-conditional distributions by

PX |Y ( x | i; Θ)
– note that this means that Θ is NOT a random variable (otherwise
it would have to show up as subscript)
– it is simply a parameter, and the probabilities are a function of this
parameter

10
Maximum likelihood
• three steps:
– 2) we assemble a collection of datasets
D(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from
class i

– 3) we select the parameters of class i to be the ones that


maximize the probability of the data from that class

Θ i = arg max PX |Y (D (i )
| i; Θ )
Θ

= arg (
g PX |Y|Y D ( i ) | i; Θ
g max log )
Θ

– like before,, it does not really


y make any
y difference to maximize
probabilities or their logs
11
Maximum likelihood
• since
– each sample D(i) is considered independently
t Θi estimated
– parameter ti t d only l D(i)
l ffrom sample
• we simply have to repeat the procedure for all classes
• so,
so from now on we omit the class variable

Θ* = arg max PX (D; Θ )


Θ

= arg max log PX (D; Θ )


Θ

• the function PX(D;Θ) is called the likelihood of the


parameter Θ with respect to the data
• or simply
i l the
h likelihood
lik lih d ffunction
i
12
Maximum likelihood
• note that the likelihood
function is a function
of the parameters Θ
• it does not have the
same shape as the
density itselff
• e.g. the likelihood
function of a Gaussian
is not bell-shaped
• the likelihood is
defined only after we
have a sample
1 ⎧ (d − µ ) 2 ⎫
PX (d ; Θ) = exp⎨− ⎬
(2π )σ 2
⎩ 2σ 2
⎭ 13
Maximum likelihood
• given a sample, to obtain ML estimate we need to solve

Θ* = arg max PX (D; Θ )


Θ

• when Θ is a scalar this is high-school


high school calculus

• we have a maximum when


– first derivative is zero
– second derivative is negative
14
The gradient
• in higher dimensions, the generalization of the derivative
is the gradient
• the gradient of a function f(w) at z is
T
⎛ ∂f ∂f ⎞ ∇f
∇f ( z ) = ⎜⎜ ( z ),L, ( z) ⎟
⎝ ∂w0 ∂wn −1 ⎠
• the gradient has a nice geometric
i t
interpretation
t ti
– it points in the direction of maximum f(x,y)
growth of the function
– which makes it perpendicular to the
contours where the function is constant
∇f ( x0 , y0 )
∇f ( x1 , y1 )
15
The gradient max

• note that if ∇f = 0
– there is no direction of growth
g
– also -∇f = 0, and there is no direction of
decrease
– we are either at a local minimum or maximum
or “saddle”
“ ddl ” point
i t
• conversely, at local min or max or saddle
point
– no direction
di ti off growth
th or d
decrease min
– ∇f = 0
• this shows that we have a critical point if
and l if ∇f = 0
d only
saddle
• to determine which type we need second
order conditions

16
The Hessian
• the extension of the second-order derivative is the
Hessian matrix
⎡ ∂ 2f ∂ 2f ⎤
⎢ (x ) L (x )⎥
⎢ ∂x 0 2
∂x 0 ∂x n −1 ⎥
∇ f (x ) = ⎢
2
M ⎥
⎢ ∂ 2f ∂ 2f ⎥
⎢ (x ) L (x ) ⎥
⎣ ∂x n −1∂x 0 ∂x n −1
2

– at each point x, gives us the quadratic function

x t ∇ 2f (x )x
that best approximates f(x)

17
The Hessian
• this means that, when gradient is
zero at x, we have max
– a maximum when function can be
approximated by an “upwards-facing”
quadratic
– a minimum when function can be
approximated by a “downwards-facing”
quadratic
– a saddle point otherwise

saddle
min

18
The Hessian max

• for any matrix M, the function

x t Mx
• is
– upwards facing quadratic when M
is negative definite
– downwards facing quadratic when M min
is positive definite
– saddle otherwise
• hence, all that matters is the positive saddle
d fi it
definiteness off the
th Hessian
H i
• we have a maximum when Hessian
is negative definite
19
Maximum likelihood
• in summary, given a sample, we need to solve

Θ* = arg max PX (D; Θ )


Θ max

• the solutions are the parameters


such that

∇ ΘPX (D ; Θ) = 0
θ t ∇ Θ 2PX (D ;θ )θ ≤ 0, ∀θ ∈ ℜn

• note that you always have to check the second-order


condition!
20
Maximum likelihood
• let’s consider the Gaussian example

• given a sample {T1, …, TN} of independent points


• the likelihood is

21
Maximum likelihood
• and the log-likelihood is

• the derivative with respect to the mean is zero when

• or

• note that this is just the sample mean


22
Maximum likelihood
• and the log-likelihood is

• the derivative with respect to the variance is zero when

• or

• note that this is just the sample variance


23
Maximum likelihood
• example:
– if sample is {10,20,30,40,50}

24
Homework
• show that the Hessian is negative definite

θ t ∇ Θ 2PX (D ;θ )θ ≤ 0, ∀θ ∈ ℜn

• show that these formulas can be generalized to the vector


case
– D(i) = {x1(i) , ..., xn(i)} set of examples from class i
– the
th ML estimates
ti t are

1 1
µi =
n
∑x j
j
(i )
Σi =
n
∑ j i j i
(
j
x (i )
− µ )( x (i )
− µ )T

• note that the ML solution is usually intuitive


25
Estimators
• when we talk about estimators, it is important to keep in
mind that
– an estimate is a number
– an estimator is a random variable

θˆ = f ( X 1 ,K, X n )
• an estimate is the value of the estimator for a g
given
sample.
1
• if D = {x1 , ..., xn}, when we say µ̂ = ∑ x j
n j
what we mean is µˆ = f ( X 1 ,K , X n ) X 1 = x1 ,K, X n = xn with
1
f ( X 1 ,K, X n ) = ∑ X j the Xi are random variables
n j
26
Bias and variance
• we know how to produce estimators (by ML)
• how do we evaluate an estimator?
• Q1: is the expected value equal to the true value?
• this is measured by the bias
– if
θˆ = f ( X 1 ,K, X n )

()
then
Bias θˆ = E X 1 ,K, X n [ f ( X 1 ,K, X n ) − θ ]
– an estimator that has bias will usually not converge to the perfect
estimate θ, no matter how large the sample is
1
– e.g. if θ is negative and the estimator is f ( X 1 ,K, X n ) = ∑ X j
2

the bias is clearly non


non-zero
zero n j

27
Bias and variance
• the estimators is said to be biased
– this means that it is not expressive enough to approximate the
tr e value
true al e arbitraril
arbitrarily well
ell
– this will be clearer when we talk about density estimation
• Q2: assuming
g that the estimator converges
g to the true
value, how many sample points do we need?
– this can be measured by the variance

()
Var θˆ =
{( )}
E X 1 ,K, X n f ( X 1 ,K, X n ) − E X1 ,K, X n [ f ( X 1 ,K, X n )]
2

– the variance usually


y decreases as one collects more training
g
examples
28
Example
• ML estimator for the mean of a Gaussian N(µ,σ2)
Bias (µˆ ) = E X 1 ,K, X n
[µˆ − µ ] = E X ,K,X [µˆ ] − µ
1 n

⎡1 ⎤
= E X 1 ,K, X n ⎢ ∑ X i ⎥ − µ
⎣n i ⎦
= ∑ E X 1 ,K, X n [X i ] − µ
1
n i

E X [X i ] − µ
1
= ∑
n i i

= µ−µ =0
• the estimator is unbiased
29
38

Anda mungkin juga menyukai