Moreover, if f is strictly convex, then E[f (X)] = f (EX) holds true if and only if
X = E[X] with probability 1 (i.e., if X is a constant).
Jensen’s inequality also holds for concave func!ons f , but with the direc!on of all the
inequali!es reversed (E[f (X)] ≤ f (EX), etc.).
If both values of xi and f (xi ) are in the domain of Φ, we can replace xi with f (xi ) and
s!ll get
n n
Φ(∑ pi × f (xi )) ≤ ∑ pi × Φ(f (xi ))
i=1 i=1
For the con!nuous case with ∫x∈S p(x) = 1, if both values of x and f (x) are in the
domain of Φ we can get
Suppose Ω is a measurable subset of the real line and f (x) is a non-nega!ve func!on
such that
∞
∫ f (x) dx = 1
−∞
Then Jensen’s inequality becomes the following statement about convex integrals:
If g is any real-valued measurable func!on and φ is convex over the range of g , then
∞ ∞
φ (∫ g(x)f (x) dx) ≤ ∫ φ(g(x))f (x) dx.
−∞ −∞
If g(x) = x, then this form of the inequality reduces to a commonly used special case:
∞ ∞
φ (∫ x f (x) dx) ≤ ∫ φ(x) f (x) dx.
−∞ −∞
Let Ω = {x1 , ...xn }, and take µ to be the coun!ng measure on Ω, then the general
form reduces to a statement about sums:
n n
φ (∑ g(xi )f (xi )) ≤ ∑ φ(g(xi ))f (xi )
i=1 i=1
λ1 + ⋯ + λn = 1
Gibbs’ inequality
If p(x) is the true probability distribu!on for x, and q(x) is another distribu!on, then
applying Jensen’s inequality for the random variable Y (x) = q(x)/p(x) and the
func!on φ(y) = −log(y) gives
E[φ(Y )] ≥ φ(E[Y ])
Therefore:
It shows that the average message length is minimized when codes are assigned on the
basis of the true probabili!es p rather than any other distribu!on q . The quan!ty that is
non-nega!ve is called the Kullback–Leibler divergence of q from p.
Since −log(x) is a strictly convex func!on for x > 0, it follows that equality holds
when p(x) equals q(x) almost everywhere.
Notes:
Compared with other notes, it seems that the notes from Richard Yida Xu and Wikipedia
be"er match the deriva!on of EM and KL-divergence.
Reference