i
9781107035904ar 2014/1/6 20:35 page 470 #484
27
Model selection
E
k=1
ak k (xi ) :=
E
i,k ak .
k=1
We make the fairly weak assumption that the column vectors in the matrix are still
linearly independent. Otherwise we eliminate linearly dependent vectors from the set.
The unknowns to be determined are the expansion coefficients a1 , . . . , aE . Apparently
the expansion becomes complete when E approaches N . In the limit E = N the expansion will reproduce exactly every measured data point. What we have found is a model
that describes both signal plus noise. This is not of interest at all, what we really want is
a representation which is optimal in the sense that it solely describes the physical model,
avoiding the fitting of uninteresting noise. An expansion affording this requirement will be
a truncated and hence more parsimonious one.
The philosophy just described is very old and is known as Ockhams principle: Employ
a model as complicated as necessary to describe the essentials of the data and as simple as possible to avoid the fitting of noise [22, 80]. General considerations concerning
Ockhams razor have been given already in Section 3.3 [p. 43]. Ockhams principle can
be accessed in quantitative terms by use of Bayesian probability theory. We have encountered this topic in various contexts and differing guises already, in Chapter 17 [p. 255] and
i
i
i
9781107035904ar 2014/1/6 20:35 page 471 #485
Model selection
471
Section 3.2.4 [p. 38]. In this section we want to apply these ideas to real-world problems.
The quantity we want to know is p(E|d, N, I), the probability of expansion order E of a
system of functions specified by I when applied to characterize a given set of data d. From
Bayes theorem
P (E|d, N, I) =
P (E|N, I) p(d|E, N, I)
.
p(d|N, I)
(27.1)
This probability is given by a prior probability p(E|N, I) and the marginal likelihood
(27.2)
p(d|E, N, I) = d E a p(a|E, N, I) p(d|a, E, N, I),
which we obtain from the full likelihood by marginalizing over the possible values of the
expansion coefficients a. Noting further that
P (E|N, I) p(d|E, N, I),
(27.3)
p(d|N, I) =
E
we see when returning to equation (27.1) that p(E|d, N, I ) equals one regardless of what
the prior and the likelihood are if we consider a single, isolated model, which means
P (E|N, I) = E,E0 . The question of what the probability of a given model is does not
make sense at all. What makes sense, however, is the question of whether P (E1 |d, N, I),
is greater or smaller than P (E2 |d, N, I), or more generally which expansion order Ek
{1, . . . , N} is the most probable one. This meaningful question may be answered by considering the odds ratio
P (Ej |N, I) p(d|Ej , N, I)
P (Ej |d, N, I)
=
,
P (Ek |d, N, I)
P (Ek |N, I) p(d|Ek , N, I)
(27.4)
of two alternatives of interest. The odds ratio factors into two terms, the prior odds P (Ej |
N, I)/P (Ek |N, I) and the Bayes factor p(d|Ej , N, I)/p(d|Ek , N, I). The prior odds
specifies our preference between Ej and Ek before we have made use of the new data d.
The prior odds represents expert knowledge which may stem from earlier experiments,
related facts from other sources, etc. The prior odds may favour model Ej over Ek even
if the marginal likelihood p(d|Ej , N, I) is smaller than p(d|Ek , N, I). Most often, however, the prior odds is chosen to be one, which expresses complete indifference instead of
preference. In fact, this status of indifference is the main driving force for the design and
performance of new experiments in the natural sciences.
We shall now investigate the Bayes factor a little further. In the case of well-conditioned
(informative) data, the likelihood p(d|a, E, N, I) will exhibit a well-localized maximum
a as a function of a. Let us assume in particular that the prior in equation (27.2) is comparatively diffuse. A sensible approximation to the marginal likelihood is then obtained by
pulling the prior in equation (27.2) with a = a out of the integral:
p(d|E, N, I) p(a|E,
N, I)
da p(d|a, E, N, I).
(27.5)
i
i
i
9781107035904ar 2014/1/6 20:35 page 472 #486
472
Model selection
E, N, I) (a)E ,
p(d|E, N, I) p(a|E,
N, I) p(d|a,
(27.6)
where (a)E is an effective volume occupied by the likelihood in parameter space. In order
to highlight the key idea, we assume a uniform and normalized prior. Then
p(a|E,
N, I) = 1/(a)E ,
(27.7)
where (a)E is the prior volume in parameter space. Merging equations (27.6) and (27.7)
we obtain, for the logarithm of the marginal likelihood,
E, N, I) + E ln
ln p(d|E, N, I) ln p(d|a,
a
.
a
(27.8)
Since, according to our assumption, the prior is more diffuse than the likelihood, a < a,
we arrive at the important result that the marginal likelihood is approximately equal to the
maximum likelihood penalized by E ln(a/a). This penalty may be detailed further
since a is certainly a function of the number of data. In fact, in the limiting case of the
number of data approaching infinity, a will vanish and the likelihood collapses to a multidimensional unnormalized delta function in a-space. The approximate dependence of the
width of the likelihood as a function of the number of data can be guessed from previously
discussed examples (see, for example, equation (21.7) [p. 335]) to be inversely proportional to the square root
of the number of data, hence if is the standard deviation of a
single datum then / N is approximately the width of the likelihood along one direction
in parameter space. Then our final estimate for the marginal likelihood becomes
a 2
E
E, N, I) ln N
ln p(d|E, N, I) ln p(d|a,
.
(27.9)
2
Equation (27.9) is a form of the Bayesian information criterion [125], which measures the
information content of the data with respect to a model specified by the set of parameters a regardless of the numerical values of the parameters. This information is obtained
from the maximum likelihood of the problem penalized by a term which accounts for the
complexity of the model represented by E and the precision of the likelihood given by the
number of data N . We see that Ockhams razor is an integral part of Bayesian probability
theory. Note that our hand-waving derivation is far from being rigorous. In the ensuing
examples, the Bayesian expressions will be evaluated rigorously and we will see that the
key features are still the same.
For the initially considered example of an orthogonal expansion we can even push the
formal evaluation of the Bayes factor using equation (27.9) one step further. An orthogonal expansion constitutes an example for the important type of nested models. A family of
models is called nested if the model of (expansion) order Ej is completely contained in
the model of (expansion) order Ek for Ek > Ej . The prominent property of such a model
family is that the likelihood is a monotonic function of the expansion order. A more complicated model provides a fit at least as good as a simpler one but most often even better.
i
i
i
9781107035904ar 2014/1/6 20:35 page 473 #487
473
For the log of the Bayes factor BE : p(d|E + 1, N, I)/p(d|E, N, I) for two consecutive
models of a family of nested models we find
p(d|a E+1 , E + 1, N, I)
1
a 2
.
ln N
ln BE ln
p(d|a E , E, N, I)
2
(27.10)
The conclusion is that expansion order E + 1 is accepted only if the corresponding loglikelihood is sufficiently larger than that of model E, such that the penalty term 1/2 ln N
(a/ )2 is overcompensated. Equation (27.10) allows us to draw a final conclusion with
validity beyond the special case used to set up equation (27.10). The penalty term in equation (27.10) depends also on the prior width and the penalty is a monotonic function of a.
Hence, if we choose the prior to be sufficiently uninformative, a , then the simpler
model (and by iteration of the argument, the simplest member of the model family) will
win the competition. This means that improper priors, though in many cases acceptable
in parameter estimation, are entirely useless in model selection problems. Model selection
depends critically on the prior employed in the analysis and due care should be exercised
in order to choose it to be as informative as possible. The deeper meaning of Ockhams
razor has been discussed in Section 3.3 [p. 45].
1 E
ln ,
E
c
(27.11)
i
i
i
9781107035904ar 2014/1/6 20:35 page 474 #488
474
Model selection
with an adjustable parameter c. If optical transitions between the initial and final state are
forbidden, then
(E)
1
.
E
(27.12)
Equation (27.11) must of course still be modified to provide a sensible representation of the
cross-section for all energies. One obvious modification is the choice of c. If we choose c
to be equal to the threshold energy for ionization, then (E) tends to zero as the threshold
energy is approached from E > E0 . We introduce as a new variable the reduced energy
E = E/E0 , describe the threshold behaviour by an additional factor (E 1) and introduce
a constant which tunes the transition to the asymptotic region:
(E) = 0
(E 1)
ln E.
+ E(E 1)
(27.13)
A similar modification is required for equation (27.13). For the region close to the threshold
E = 1 + E we notice that equation (27.13) becomes proportional to (E)+1 . The
appropriate modification of equation (27.13) is then
(E) = 0
(E 1)+1
.
+ E(E 1)+1
(27.14)
0 , and now have exactly the same meaning in both cases. The prior p(0 , , |I) for
a Bayesian estimate of the parameters is therefore also the same in both cases. Assuming
then that we adopt a prior which is uniform over some sensible, positive, sufficiently large
range implies that the Bayes factor for comparison of the two models in equations (27.13)
and (27.14) is reduced to the ratio of maximum likelihoods. This in turn means that in
the present case a better fit means a more probable model, since the expense of volume in
parameter space is model independent.
Figure 27.1 shows data and fits for electron impact ionization of H2 . The ionization
channel leading to H2+ is classified as an optically allowed transition with log (odds)
= +7.6, while the dissociative ionization channel leading to an atomic ion H + turns out
to be an optically forbidden transition with log (odds) = +64.5. The analysis has therefore
made a model choice with very high selectivity and provides good reason to employ equations (27.13) and (27.14) for the calculation of rate coefficients for arbitrary temperatures.
This example constitutes the simplest possible case of Bayesian model selection, since we
have also taken the prior odds equal to one.
i
i