Bayesian and MLE

Maximum likelihood estimation
Nuno Vasconcelos
UCSD
Bayesian decision theory
• recall that we have
– Y – state of the world
– X – observations
– g(x) – decision function
– L[g(x),y] – loss of predicting y with g(x)
• Bayes decision rule is the rule that minimizes the risk
Risk = E X ,YY [L( X , Y )]
• for the “0-1” loss

⎧1, g ( x) ≠ y
L[ g ( x), y ] = ⎨
⎩0, g ( x) = y
• optimal decision rule is the maximum a
a-posteriori
posteriori
probability rule 2
MAP rule
• we have
h shown
h th
thatt it can b be iimplemented
l t d iin any off th
the
three following ways
– 1)) i * (x ) = arg max PY |X (i | x )
i
– 2)) i * (x ) = arg max[PX |Y (x | i )PY (i )]

i
– 3)) i * (x ) = argg max[logg PX Y| (x | i ) + logg PY (i )]

i
• by introducing a “model” for the class-conditional

distributions we can express this as a simple equation
– e.g. for the multivariate Gaussian
1 ⎧ 1 ⎫
PX |Y ( x | i ) = exp⎨− ( x − µi )T Σ i−1 ( x − µi )⎬
(2π ) d | Σ i | ⎩ 2 ⎭ 3
The Gaussian classifier
discriminant:
di i i t
• the solution is PY|X(1|x ) = 0.5
i * (x ) = arg min[d i (x , µi ) + α i ]
i
with
d i (x , y ) = (x − y )T Σi−1 (x − y )
α i = log(2π )d Σi − 2 log PY (i )
• the optimal rule is to assign x to the closest class

• closest is measured with the Mahalanobis distance di(x,y)
• can be further simplified in special cases
4
Geometric interpretation
• for Gaussian classes, equal covariance σ2I
µi − µ j
w=
σ2
µi + µ j σ2
x0 = − 2
l
log
PY (i )
(µi − µ j )
2 µi − µ j PY ( j )
σ w µj
µi x0
1 PY (i )
2
log
µi − µ j PY ( j )
5
σ2
Geometric interpretation
• for Gaussian classes, equal but arbitrary covariance
w = Σ −1 (µi − µ j )
µi + µ j PY (i )
x0 = −
1
(µi − µ j )
(µi − µ j ) Σ (µi − µ j ) PY ( j )
T −1
log
2
x0 w µj
µi
6
Bayesian decision theory
• advantages:
– BDR is optimal and cannot be beaten
– Bayes keeps you honest
– models reflect causal interpretation of the problem, this is how
we think
– natural decomposition into “what we knew already” (prior) and
“what data tells us” (CCD)
– no need for heuristics to combine these two sources of info
– BDR is, almost invariably, intuitive
– Bayes rule, chain rule, and marginalization enable modularity,
and scalability to very complicated models and problems
• problems:
– BDR is optimal only insofar the models are correct.
7
Implementation
• we do have an optimal solution
w = Σ −1 (µi − µ j )
µi + µ j PY (i )
x0 = −
1
(µi − µ j )
(µi − µ j ) Σ (µi − µ j ) PY ( j )
T −1
log
2
• but in practice we do not know the values off the
parameters µ, Σ, PY(1)
– we have to somehow estimate these values
– this is OK, we can come up with an estimate from a training set
– e.g. use the average value as an estimate for the mean
w = Σˆ −1 (µˆ i − µˆ j )
µˆ i + µˆ j
x0 = −
1 PˆY (i )
(µˆ i − µˆ j )
(µˆ i − µˆ j ) Σ (µˆ i − µˆ j ) PˆY ( j )
T −1
log
2
8
Important
• warning: at this point all optimality claims for the BDR
cease to be valid!!
• the BDR is guaranteed
to achieve the minimum
loss when we use the
true probabilities
• when we “plug in” the
probability
b bilit estimates,
ti t
we could be
implementing
p g
a classifier that is quite distant from the optimal
– e.g. if the PX|Y(x|i) look like the example above
– I could never approximate them well by parametric models (e
(e.g.
g
Gaussian)
9
Maximum likelihood
• this seems pretty serious
– how should I get these probabilities then?
• we rely on the maximum likelihood (ML) principle
• this has three steps:
– 1) we choose a parametric model for all probabilities
– to make this clear we denote the vector of parameters by Θ and
the class-conditional distributions by
PX |Y ( x | i; Θ)
– note that this means that Θ is NOT a random variable (otherwise
it would have to show up as subscript)
– it is simply a parameter, and the probabilities are a function of this
parameter
10
Maximum likelihood
• three steps:
– 2) we assemble a collection of datasets
D(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from
class i
– 3) we select the parameters of class i to be the ones that

maximize the probability of the data from that class
Θ i = arg max PX |Y (D (i )
| i; Θ )
Θ
= arg (
g PX |Y|Y D ( i ) | i; Θ
g max log )
Θ
– like before,, it does not really

y make any
y difference to maximize
probabilities or their logs
11
Maximum likelihood
• since
– each sample D(i) is considered independently
t Θi estimated
– parameter ti t d only l D(i)
l ffrom sample
• we simply have to repeat the procedure for all classes
• so,
so from now on we omit the class variable
Θ* = arg max PX (D; Θ )

Θ
= arg max log PX (D; Θ )

Θ
• the function PX(D;Θ) is called the likelihood of the

parameter Θ with respect to the data
• or simply
i l the
h likelihood
lik lih d ffunction
i
12
Maximum likelihood
• note that the likelihood
function is a function
of the parameters Θ
• it does not have the
same shape as the
density itselff
• e.g. the likelihood
function of a Gaussian
is not bell-shaped
• the likelihood is
defined only after we
have a sample
1 ⎧ (d − µ ) 2 ⎫
PX (d ; Θ) = exp⎨− ⎬
(2π )σ 2
⎩ 2σ 2
⎭ 13
Maximum likelihood
• given a sample, to obtain ML estimate we need to solve

Θ
• when Θ is a scalar this is high-school

high school calculus
• we have a maximum when

– first derivative is zero
– second derivative is negative
14
The gradient
• in higher dimensions, the generalization of the derivative
is the gradient
• the gradient of a function f(w) at z is
T
⎛ ∂f ∂f ⎞ ∇f
∇f ( z ) = ⎜⎜ ( z ),L, ( z) ⎟
⎝ ∂w0 ∂wn −1 ⎠
• the gradient has a nice geometric
i t
interpretation
t ti
– it points in the direction of maximum f(x,y)
growth of the function
– which makes it perpendicular to the
contours where the function is constant
∇f ( x0 , y0 )
∇f ( x1 , y1 )
15
The gradient max
• note that if ∇f = 0
– there is no direction of growth
g
– also -∇f = 0, and there is no direction of
decrease
– we are either at a local minimum or maximum
or “saddle”
“ ddl ” point
i t
• conversely, at local min or max or saddle
point
– no direction
di ti off growth
th or d
decrease min
– ∇f = 0
• this shows that we have a critical point if
and l if ∇f = 0
d only
saddle
• to determine which type we need second
order conditions
16
The Hessian
• the extension of the second-order derivative is the
Hessian matrix
⎡ ∂ 2f ∂ 2f ⎤
⎢ (x ) L (x )⎥
⎢ ∂x 0 2
∂x 0 ∂x n −1 ⎥
∇ f (x ) = ⎢
2
M ⎥
⎢ ∂ 2f ∂ 2f ⎥
⎢ (x ) L (x ) ⎥
⎣ ∂x n −1∂x 0 ∂x n −1
2
⎦
– at each point x, gives us the quadratic function
x t ∇ 2f (x )x
that best approximates f(x)
17
The Hessian
• this means that, when gradient is
zero at x, we have max
– a maximum when function can be
approximated by an “upwards-facing”
quadratic
– a minimum when function can be
approximated by a “downwards-facing”
quadratic
– a saddle point otherwise
saddle
min
18
The Hessian max
• for any matrix M, the function
x t Mx
• is
– upwards facing quadratic when M
is negative definite
– downwards facing quadratic when M min
is positive definite
– saddle otherwise
• hence, all that matters is the positive saddle
d fi it
definiteness off the
th Hessian
H i
• we have a maximum when Hessian
is negative definite
19
Maximum likelihood
• in summary, given a sample, we need to solve

Θ max
• the solutions are the parameters

such that
∇ ΘPX (D ; Θ) = 0
θ t ∇ Θ 2PX (D ;θ )θ ≤ 0, ∀θ ∈ ℜn
• note that you always have to check the second-order

condition!
20
Maximum likelihood
• let’s consider the Gaussian example
• given a sample {T1, …, TN} of independent points

• the likelihood is
21
Maximum likelihood
• and the log-likelihood is
• the derivative with respect to the mean is zero when
• or
• note that this is just the sample mean

22
Maximum likelihood
• and the log-likelihood is
• the derivative with respect to the variance is zero when
• or
• note that this is just the sample variance

23
Maximum likelihood
• example:
– if sample is {10,20,30,40,50}
24
Homework
• show that the Hessian is negative definite
θ t ∇ Θ 2PX (D ;θ )θ ≤ 0, ∀θ ∈ ℜn
• show that these formulas can be generalized to the vector

case
– D(i) = {x1(i) , ..., xn(i)} set of examples from class i
– the
th ML estimates
ti t are
1 1
µi =
n
∑x j
j
(i )
Σi =
n
∑ j i j i
(
j
x (i )
− µ )( x (i )
− µ )T
• note that the ML solution is usually intuitive

25
Estimators
• when we talk about estimators, it is important to keep in
mind that
– an estimate is a number
– an estimator is a random variable
θˆ = f ( X 1 ,K, X n )
• an estimate is the value of the estimator for a g
given
sample.
1
• if D = {x1 , ..., xn}, when we say µ̂ = ∑ x j
n j
what we mean is µˆ = f ( X 1 ,K , X n ) X 1 = x1 ,K, X n = xn with
1
f ( X 1 ,K, X n ) = ∑ X j the Xi are random variables
n j
26
Bias and variance
• we know how to produce estimators (by ML)
• how do we evaluate an estimator?
• Q1: is the expected value equal to the true value?
• this is measured by the bias
– if
θˆ = f ( X 1 ,K, X n )
()
then
Bias θˆ = E X 1 ,K, X n [ f ( X 1 ,K, X n ) − θ ]
– an estimator that has bias will usually not converge to the perfect
estimate θ, no matter how large the sample is
1
– e.g. if θ is negative and the estimator is f ( X 1 ,K, X n ) = ∑ X j
2
the bias is clearly non

non-zero
zero n j
27
Bias and variance
• the estimators is said to be biased
– this means that it is not expressive enough to approximate the
tr e value
true al e arbitraril
arbitrarily well
ell
– this will be clearer when we talk about density estimation
• Q2: assuming
g that the estimator converges
g to the true
value, how many sample points do we need?
– this can be measured by the variance
()
Var θˆ =
{( )}
E X 1 ,K, X n f ( X 1 ,K, X n ) − E X1 ,K, X n [ f ( X 1 ,K, X n )]
2
– the variance usually

y decreases as one collects more training
g
examples
28
Example
• ML estimator for the mean of a Gaussian N(µ,σ2)
Bias (µˆ ) = E X 1 ,K, X n
[µˆ − µ ] = E X ,K,X [µˆ ] − µ
1 n
⎡1 ⎤
= E X 1 ,K, X n ⎢ ∑ X i ⎥ − µ
⎣n i ⎦
= ∑ E X 1 ,K, X n [X i ] − µ
1
n i
E X [X i ] − µ
1
= ∑
n i i
= µ−µ =0
• the estimator is unbiased
29
38

Bayesian and MLE

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Bayesian and MLE

Diunggah oleh

Hak Cipta:

Format Tersedia

Maximum likelihood estimation

• for the “0-1” loss

– 2)) i * (x ) = arg max[PX |Y (x | i )PY (i )]

– 3)) i * (x ) = argg max[logg PX Y| (x | i ) + logg PY (i )]

• by introducing a “model” for the class-conditional

• the optimal rule is to assign x to the closest class

– 3) we select the parameters of class i to be the ones that

– like before,, it does not really

Θ* = arg max PX (D; Θ )

= arg max log PX (D; Θ )

• the function PX(D;Θ) is called the likelihood of the

Θ* = arg max PX (D; Θ )

• when Θ is a scalar this is high-school

• we have a maximum when

• for any matrix M, the function

Θ* = arg max PX (D; Θ )

• the solutions are the parameters

• note that you always have to check the second-order

• given a sample {T1, …, TN} of independent points

• the derivative with respect to the mean is zero when

• note that this is just the sample mean

• the derivative with respect to the variance is zero when

• note that this is just the sample variance

• show that these formulas can be generalized to the vector

• note that the ML solution is usually intuitive

the bias is clearly non

– the variance usually

Anda mungkin juga menyukai