Anda di halaman 1dari 5

i

i
9781107035904ar 2014/1/6 20:35 page 470 #484

27
Model selection

Interpretation of a set of experimental measurements requires us to choose a model which


will fit the data to some extent. The model choice may not be unique, but model Mk may
be just one possible choice out of a family of models {Mi }. In this case we are finally
faced with the problem of deciding which member Mk of the family represents the most
probable physics hidden in the measured data. It is immediately clear that the choice of the
best model is a topic of paramount importance since it will most probably influence the
direction of further research. The traditional tendency is to equate best fit to best model.
This is certainly incorrect, as will be explained by reference to the fit of an orthogonal
expansion to the experimental and hence noisy representation of a function. Consider N
data points di corresponding to N different values xi of an input variable. The theoretical
model function f (x) shall be a linear combination of E orthonormal basis functions k (x),
then the theoretical prediction based on this model reads
f (xi ) =

E

k=1

ak k (xi ) :=

E


i,k ak .

k=1

We make the fairly weak assumption that the column vectors in the matrix are still
linearly independent. Otherwise we eliminate linearly dependent vectors from the set.
The unknowns to be determined are the expansion coefficients a1 , . . . , aE . Apparently
the expansion becomes complete when E approaches N . In the limit E = N the expansion will reproduce exactly every measured data point. What we have found is a model
that describes both signal plus noise. This is not of interest at all, what we really want is
a representation which is optimal in the sense that it solely describes the physical model,
avoiding the fitting of uninteresting noise. An expansion affording this requirement will be
a truncated and hence more parsimonious one.
The philosophy just described is very old and is known as Ockhams principle: Employ
a model as complicated as necessary to describe the essentials of the data and as simple as possible to avoid the fitting of noise [22, 80]. General considerations concerning
Ockhams razor have been given already in Section 3.3 [p. 43]. Ockhams principle can
be accessed in quantitative terms by use of Bayesian probability theory. We have encountered this topic in various contexts and differing guises already, in Chapter 17 [p. 255] and

i
i

i
9781107035904ar 2014/1/6 20:35 page 471 #485

Model selection

471

Section 3.2.4 [p. 38]. In this section we want to apply these ideas to real-world problems.
The quantity we want to know is p(E|d, N, I), the probability of expansion order E of a
system of functions specified by I when applied to characterize a given set of data d. From
Bayes theorem
P (E|d, N, I) =

P (E|N, I) p(d|E, N, I)
.
p(d|N, I)

(27.1)

This probability is given by a prior probability p(E|N, I) and the marginal likelihood

(27.2)
p(d|E, N, I) = d E a p(a|E, N, I) p(d|a, E, N, I),
which we obtain from the full likelihood by marginalizing over the possible values of the
expansion coefficients a. Noting further that

P (E|N, I) p(d|E, N, I),
(27.3)
p(d|N, I) =
E

we see when returning to equation (27.1) that p(E|d, N, I ) equals one regardless of what
the prior and the likelihood are if we consider a single, isolated model, which means
P (E|N, I) = E,E0 . The question of what the probability of a given model is does not
make sense at all. What makes sense, however, is the question of whether P (E1 |d, N, I),
is greater or smaller than P (E2 |d, N, I), or more generally which expansion order Ek
{1, . . . , N} is the most probable one. This meaningful question may be answered by considering the odds ratio
P (Ej |N, I) p(d|Ej , N, I)
P (Ej |d, N, I)
=
,
P (Ek |d, N, I)
P (Ek |N, I) p(d|Ek , N, I)

(27.4)

of two alternatives of interest. The odds ratio factors into two terms, the prior odds P (Ej |
N, I)/P (Ek |N, I) and the Bayes factor p(d|Ej , N, I)/p(d|Ek , N, I). The prior odds
specifies our preference between Ej and Ek before we have made use of the new data d.
The prior odds represents expert knowledge which may stem from earlier experiments,
related facts from other sources, etc. The prior odds may favour model Ej over Ek even
if the marginal likelihood p(d|Ej , N, I) is smaller than p(d|Ek , N, I). Most often, however, the prior odds is chosen to be one, which expresses complete indifference instead of
preference. In fact, this status of indifference is the main driving force for the design and
performance of new experiments in the natural sciences.
We shall now investigate the Bayes factor a little further. In the case of well-conditioned
(informative) data, the likelihood p(d|a, E, N, I) will exhibit a well-localized maximum
a as a function of a. Let us assume in particular that the prior in equation (27.2) is comparatively diffuse. A sensible approximation to the marginal likelihood is then obtained by
pulling the prior in equation (27.2) with a = a out of the integral:


p(d|E, N, I) p(a|E,
N, I)
da p(d|a, E, N, I).
(27.5)

i
i

i
9781107035904ar 2014/1/6 20:35 page 472 #486

472

Model selection

The integral may further be characterized by

E, N, I) (a)E ,
p(d|E, N, I) p(a|E,
N, I) p(d|a,

(27.6)

where (a)E is an effective volume occupied by the likelihood in parameter space. In order
to highlight the key idea, we assume a uniform and normalized prior. Then

p(a|E,
N, I) = 1/(a)E ,

(27.7)

where (a)E is the prior volume in parameter space. Merging equations (27.6) and (27.7)
we obtain, for the logarithm of the marginal likelihood,
E, N, I) + E ln
ln p(d|E, N, I) ln p(d|a,

a
.
a

(27.8)

Since, according to our assumption, the prior is more diffuse than the likelihood, a < a,
we arrive at the important result that the marginal likelihood is approximately equal to the
maximum likelihood penalized by E ln(a/a). This penalty may be detailed further
since a is certainly a function of the number of data. In fact, in the limiting case of the
number of data approaching infinity, a will vanish and the likelihood collapses to a multidimensional unnormalized delta function in a-space. The approximate dependence of the
width of the likelihood as a function of the number of data can be guessed from previously
discussed examples (see, for example, equation (21.7) [p. 335]) to be inversely proportional to the square root
of the number of data, hence if is the standard deviation of a
single datum then / N is approximately the width of the likelihood along one direction
in parameter space. Then our final estimate for the marginal likelihood becomes


a 2
E
E, N, I) ln N
ln p(d|E, N, I) ln p(d|a,
.
(27.9)
2

Equation (27.9) is a form of the Bayesian information criterion [125], which measures the
information content of the data with respect to a model specified by the set of parameters a regardless of the numerical values of the parameters. This information is obtained
from the maximum likelihood of the problem penalized by a term which accounts for the
complexity of the model represented by E and the precision of the likelihood given by the
number of data N . We see that Ockhams razor is an integral part of Bayesian probability
theory. Note that our hand-waving derivation is far from being rigorous. In the ensuing
examples, the Bayesian expressions will be evaluated rigorously and we will see that the
key features are still the same.
For the initially considered example of an orthogonal expansion we can even push the
formal evaluation of the Bayes factor using equation (27.9) one step further. An orthogonal expansion constitutes an example for the important type of nested models. A family of
models is called nested if the model of (expansion) order Ej is completely contained in
the model of (expansion) order Ek for Ek > Ej . The prominent property of such a model
family is that the likelihood is a monotonic function of the expansion order. A more complicated model provides a fit at least as good as a simpler one but most often even better.

i
i

i
9781107035904ar 2014/1/6 20:35 page 473 #487

27.1 Inelastic electron scattering

473

For the log of the Bayes factor BE : p(d|E + 1, N, I)/p(d|E, N, I) for two consecutive
models of a family of nested models we find





p(d|a E+1 , E + 1, N, I)
1
a 2
.
ln N
ln BE ln
p(d|a E , E, N, I)
2

(27.10)

The conclusion is that expansion order E + 1 is accepted only if the corresponding loglikelihood is sufficiently larger than that of model E, such that the penalty term 1/2 ln N
(a/ )2 is overcompensated. Equation (27.10) allows us to draw a final conclusion with
validity beyond the special case used to set up equation (27.10). The penalty term in equation (27.10) depends also on the prior width and the penalty is a monotonic function of a.
Hence, if we choose the prior to be sufficiently uninformative, a , then the simpler
model (and by iteration of the argument, the simplest member of the model family) will
win the competition. This means that improper priors, though in many cases acceptable
in parameter estimation, are entirely useless in model selection problems. Model selection
depends critically on the prior employed in the analysis and due care should be exercised
in order to choose it to be as informative as possible. The deeper meaning of Ockhams
razor has been discussed in Section 3.3 [p. 45].

27.1 Inelastic electron scattering


Cross-sections and rate coefficients for electron impact ionization of process gases such
as methane and hydrogen constitute the necessary database for modelling low-temperature
glow discharges. The interest in the latter stems from their application in the formation
of diamond and diamond-like carbon films [102]. Experimental data on electron impact
ionization are available for many gaseous species, however, they always span only a limited interval on the energy scale. The calculation of rate coefficients requires, in contrast,
the availability of cross-section data for all energies. While rate coefficients for low temperatures are mainly determined by the threshold behaviour of the cross-section, for high
temperatures such as in the boundary of magnetic confinement fusion plasmas the highenergy cross-section functional behaviour becomes important. The asymptotic form of
cross-sections for inelastic electron scattering is fortunately known from theory and this
knowledge can be used to set up a parametric family of functions for which the optimum
parameters are then estimated on the basis of the available experimental data. Atomic collision theory distinguishes between two cases for the asymptotic behaviour of inelastic
scattering cross-sections. In the case that the final state considered in the collision can also
be reached by photon absorption from the initial (ground) state (optically allowed), the
cross-section (E) as a function of energy E behaves as
(E) 

1 E
ln ,
E
c

(27.11)

i
i

i
9781107035904ar 2014/1/6 20:35 page 474 #488

474

Model selection

with an adjustable parameter c. If optical transitions between the initial and final state are
forbidden, then
(E) 

1
.
E

(27.12)

Equation (27.11) must of course still be modified to provide a sensible representation of the
cross-section for all energies. One obvious modification is the choice of c. If we choose c
to be equal to the threshold energy for ionization, then (E) tends to zero as the threshold
energy is approached from E > E0 . We introduce as a new variable the reduced energy
E = E/E0 , describe the threshold behaviour by an additional factor (E 1) and introduce
a constant which tunes the transition to the asymptotic region:
(E) = 0

(E 1)
ln E.
+ E(E 1)

(27.13)

A similar modification is required for equation (27.13). For the region close to the threshold
E = 1 + E we notice that equation (27.13) becomes proportional to (E)+1 . The
appropriate modification of equation (27.13) is then
(E) = 0

(E 1)+1
.
+ E(E 1)+1

(27.14)

0 , and now have exactly the same meaning in both cases. The prior p(0 , , |I) for
a Bayesian estimate of the parameters is therefore also the same in both cases. Assuming
then that we adopt a prior which is uniform over some sensible, positive, sufficiently large
range implies that the Bayes factor for comparison of the two models in equations (27.13)
and (27.14) is reduced to the ratio of maximum likelihoods. This in turn means that in
the present case a better fit means a more probable model, since the expense of volume in
parameter space is model independent.
Figure 27.1 shows data and fits for electron impact ionization of H2 . The ionization
channel leading to H2+ is classified as an optically allowed transition with log (odds)
= +7.6, while the dissociative ionization channel leading to an atomic ion H + turns out
to be an optically forbidden transition with log (odds) = +64.5. The analysis has therefore
made a model choice with very high selectivity and provides good reason to employ equations (27.13) and (27.14) for the calculation of rate coefficients for arbitrary temperatures.
This example constitutes the simplest possible case of Bayesian model selection, since we
have also taken the prior odds equal to one.

27.2 Signalbackground separation


An evergreen in physical data analysis is the separation of a signal from the background on
which it resides. A probabilistic answer to this problem has been given in Chapter 23 [p. 396].
The present alternative approach will be applied to the data shown in Figure 23.3 [p. 404].
The traditional procedure chooses a background model and defines the signal as the difference between measured data and simulated background. The result of such an operation

i
i

Anda mungkin juga menyukai