Nuno Vasconcelos
UCSD
Bayesian decision theory
• recall that we have
– Y – state of the world
– X – observations
– g(x) – decision function
– L[g(x),y] – loss of predicting y with g(x)
• Bayes decision rule is the rule that minimizes the risk
Risk = E X ,YY [L( X , Y )]
1 ⎧ 1 ⎫
PX |Y ( x | i ) = exp⎨− ( x − µi )T Σ i−1 ( x − µi )⎬
(2π ) d | Σ i | ⎩ 2 ⎭ 3
The Gaussian classifier
discriminant:
di i i t
• the solution is PY|X(1|x ) = 0.5
i * (x ) = arg min[d i (x , µi ) + α i ]
i
with
d i (x , y ) = (x − y )T Σi−1 (x − y )
α i = log(2π )d Σi − 2 log PY (i )
µi − µ j
w=
σ2
µi + µ j σ2
x0 = − 2
l
log
PY (i )
(µi − µ j )
2 µi − µ j PY ( j )
σ w µj
µi x0
1 PY (i )
2
log
µi − µ j PY ( j )
5
σ2
Geometric interpretation
• for Gaussian classes, equal but arbitrary covariance
w = Σ −1 (µi − µ j )
µi + µ j PY (i )
x0 = −
1
(µi − µ j )
(µi − µ j ) Σ (µi − µ j ) PY ( j )
T −1
log
2
x0 w µj
µi
6
Bayesian decision theory
• advantages:
– BDR is optimal and cannot be beaten
– Bayes keeps you honest
– models reflect causal interpretation of the problem, this is how
we think
– natural decomposition into “what we knew already” (prior) and
“what data tells us” (CCD)
– no need for heuristics to combine these two sources of info
– BDR is, almost invariably, intuitive
– Bayes rule, chain rule, and marginalization enable modularity,
and scalability to very complicated models and problems
• problems:
– BDR is optimal only insofar the models are correct.
7
Implementation
• we do have an optimal solution
w = Σ −1 (µi − µ j )
µi + µ j PY (i )
x0 = −
1
(µi − µ j )
(µi − µ j ) Σ (µi − µ j ) PY ( j )
T −1
log
2
• but in practice we do not know the values off the
parameters µ, Σ, PY(1)
– we have to somehow estimate these values
– this is OK, we can come up with an estimate from a training set
– e.g. use the average value as an estimate for the mean
w = Σˆ −1 (µˆ i − µˆ j )
µˆ i + µˆ j
x0 = −
1 PˆY (i )
(µˆ i − µˆ j )
(µˆ i − µˆ j ) Σ (µˆ i − µˆ j ) PˆY ( j )
T −1
log
2
8
Important
• warning: at this point all optimality claims for the BDR
cease to be valid!!
• the BDR is guaranteed
to achieve the minimum
loss when we use the
true probabilities
• when we “plug in” the
probability
b bilit estimates,
ti t
we could be
implementing
p g
a classifier that is quite distant from the optimal
– e.g. if the PX|Y(x|i) look like the example above
– I could never approximate them well by parametric models (e
(e.g.
g
Gaussian)
9
Maximum likelihood
• this seems pretty serious
– how should I get these probabilities then?
• we rely on the maximum likelihood (ML) principle
• this has three steps:
– 1) we choose a parametric model for all probabilities
– to make this clear we denote the vector of parameters by Θ and
the class-conditional distributions by
PX |Y ( x | i; Θ)
– note that this means that Θ is NOT a random variable (otherwise
it would have to show up as subscript)
– it is simply a parameter, and the probabilities are a function of this
parameter
10
Maximum likelihood
• three steps:
– 2) we assemble a collection of datasets
D(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from
class i
Θ i = arg max PX |Y (D (i )
| i; Θ )
Θ
= arg (
g PX |Y|Y D ( i ) | i; Θ
g max log )
Θ
• note that if ∇f = 0
– there is no direction of growth
g
– also -∇f = 0, and there is no direction of
decrease
– we are either at a local minimum or maximum
or “saddle”
“ ddl ” point
i t
• conversely, at local min or max or saddle
point
– no direction
di ti off growth
th or d
decrease min
– ∇f = 0
• this shows that we have a critical point if
and l if ∇f = 0
d only
saddle
• to determine which type we need second
order conditions
16
The Hessian
• the extension of the second-order derivative is the
Hessian matrix
⎡ ∂ 2f ∂ 2f ⎤
⎢ (x ) L (x )⎥
⎢ ∂x 0 2
∂x 0 ∂x n −1 ⎥
∇ f (x ) = ⎢
2
M ⎥
⎢ ∂ 2f ∂ 2f ⎥
⎢ (x ) L (x ) ⎥
⎣ ∂x n −1∂x 0 ∂x n −1
2
⎦
– at each point x, gives us the quadratic function
x t ∇ 2f (x )x
that best approximates f(x)
17
The Hessian
• this means that, when gradient is
zero at x, we have max
– a maximum when function can be
approximated by an “upwards-facing”
quadratic
– a minimum when function can be
approximated by a “downwards-facing”
quadratic
– a saddle point otherwise
saddle
min
18
The Hessian max
x t Mx
• is
– upwards facing quadratic when M
is negative definite
– downwards facing quadratic when M min
is positive definite
– saddle otherwise
• hence, all that matters is the positive saddle
d fi it
definiteness off the
th Hessian
H i
• we have a maximum when Hessian
is negative definite
19
Maximum likelihood
• in summary, given a sample, we need to solve
∇ ΘPX (D ; Θ) = 0
θ t ∇ Θ 2PX (D ;θ )θ ≤ 0, ∀θ ∈ ℜn
21
Maximum likelihood
• and the log-likelihood is
• or
• or
24
Homework
• show that the Hessian is negative definite
θ t ∇ Θ 2PX (D ;θ )θ ≤ 0, ∀θ ∈ ℜn
1 1
µi =
n
∑x j
j
(i )
Σi =
n
∑ j i j i
(
j
x (i )
− µ )( x (i )
− µ )T
θˆ = f ( X 1 ,K, X n )
• an estimate is the value of the estimator for a g
given
sample.
1
• if D = {x1 , ..., xn}, when we say µ̂ = ∑ x j
n j
what we mean is µˆ = f ( X 1 ,K , X n ) X 1 = x1 ,K, X n = xn with
1
f ( X 1 ,K, X n ) = ∑ X j the Xi are random variables
n j
26
Bias and variance
• we know how to produce estimators (by ML)
• how do we evaluate an estimator?
• Q1: is the expected value equal to the true value?
• this is measured by the bias
– if
θˆ = f ( X 1 ,K, X n )
()
then
Bias θˆ = E X 1 ,K, X n [ f ( X 1 ,K, X n ) − θ ]
– an estimator that has bias will usually not converge to the perfect
estimate θ, no matter how large the sample is
1
– e.g. if θ is negative and the estimator is f ( X 1 ,K, X n ) = ∑ X j
2
27
Bias and variance
• the estimators is said to be biased
– this means that it is not expressive enough to approximate the
tr e value
true al e arbitraril
arbitrarily well
ell
– this will be clearer when we talk about density estimation
• Q2: assuming
g that the estimator converges
g to the true
value, how many sample points do we need?
– this can be measured by the variance
()
Var θˆ =
{( )}
E X 1 ,K, X n f ( X 1 ,K, X n ) − E X1 ,K, X n [ f ( X 1 ,K, X n )]
2
⎡1 ⎤
= E X 1 ,K, X n ⎢ ∑ X i ⎥ − µ
⎣n i ⎦
= ∑ E X 1 ,K, X n [X i ] − µ
1
n i
E X [X i ] − µ
1
= ∑
n i i
= µ−µ =0
• the estimator is unbiased
29
38