Lnformation Theory and Its Application To Analytical Chemistry

lnformation Theory and Its Application
to Analytical Chemistry
D. E. Clegg
GriffithUniversity, Nathan, 4111, Queensland Australia
D. L. Massart
Free University Brussels, Brussels 1090, Belgium
The Changing Demands on the Analyst sages", where the term message is used in its widest sense.
The professional mission of analytical chemists is to s u p This range includes the faint molecular messages that em-
ply clients with information aboufthe composition of Sam- anate from the analyte in a sample, then pass through the
ples. The means for gaining this information has changed electronic encoding and decoding processes of a spe&om-
~rofoundlvover the vears.from the wmmon ~~-use
~-~ of
~- "&t"
-~
eter.
ehemical methods toihe almost complete reliance on phys-
ical instrumentation that is found in contemoorarv labora- Evaluating and Comparing Procedures
tories. Such changes have beenmade largeliin response to
the more stringent demands made by clients, for example, The reliability of information passing through such a
requests for the determination of very low levels of anal* system depends on both the ability of the sender (the ana-
or for fast multielement analysis. lyst) and the ability of the transmitter. The sender must
Inevitably, the shiR to instrumental methods has also re- correctly prepare the sample for encoding. The transmitter
quired analysts to augment their training in chemistry must convey the desired signal to the output stage. Also,
with an understanding of other fields such as electronics, this signal must be keot se~nratefrom the mvriad of other
~ ~~ - ~ -
optics, electromagnetism, and computing. However, the signals emanating &om 'the sample extract and from
need to develop knowledge and skills in these diverse dis- within the instrumentation.
ciplines should not be allowed to obscure the distinctive Shannon's singular contribution was to recognize that it
task of the analyst: to supply chemical information about was possible to describe and compare the performance of
samples. The emphasis should he kept on gathering reli- various message-transmission systems in a much more
able information, rather than being shifted to any particu- meaningful way. He defined the term "information" to
lar technique or methodology. allow the amount of information passing through a system
to he quantified.
The Evolving Concept of lnformation Workers in the relatively new field of "chemometrics"
It would be useful and timely for analysts to reconsider recognized the application of his theory to analysis. It
in greater depth what is meant by "information"in general could provide a measure of the performanee of an analyti-
and "chemical information" in particular. Then their com- cal procedure, expressed in terms of a common currencv.
plex and rapidly evolving methodologies could be put in namely the yieldbf "information". Widely different me&
broader teleological perspective. odologies could thus he evaluated and comoared bvmeans
Such matters have been the subject of much study since of thi; single yardstick (2).
a theory of information was developed by Claude Shannon Below we discuss how information is defined and how
in the late 1940's to solve some problems in message com- the basic ideas of information theory can be applied to an-
munication over the primitive Morse Code transmitters alytical chemistry--from spot tests and chromatography
then in use (1).The theory has been applied to a varied to the interpretation of large chemical data bases. For ped-
range of situations that involve the transmission of "mes- agogical reasons the examples are taken from qualitative
Volume 70 Number 1 January 1993 19

analysis, but the concepts are equally applicable to quanti- No. of Possible Outcomes (n) No. of Tests Needed
tative analysis. 1 = 2O 0
Quantifying Information ' 2 = 2' 1
Information is required to reduce the uncertainty that 4 = 22 2
arises when several options exist and no one knows which
is correct. As the number of options increases, so does the 8 = 3
uncertainty and the amount of information needed to re-
duce it--or completely resolve it. The power to which 2 must be raised to give the number
Thus, in a sense, information and uncertainty are in- of possibilities n is defined as the log to base 2 of that num-
versely related: Each is related quantitatively to the number. Thus, information and uncertainty can be defined
ber ofoptions that exist. Below we use an example from a quantitatively in terms of the log to base 2 of the number
simple lab test to show how this leads to a useful quantita- of possible analytical outcomes.
tive definition of information (and uncertainty).
The Simplest Qualitative Spot Test
Consider the use of a qualitative spot test to determine where I indicates the amount of information; and H indi-
the presence of iron in a water sample. In terms of informa- cates the amount of uncertainty.
tion, there are only two possible options: iron either is or is The initial uncertainty can also be defined in terms of
not present above the detection limits for the test. (Compli- the probability of the occurrence of each outcome. For ex-
cations that might arise near the detection limit will be ample, by referring to the probabilities in the table, the
ignored.) following d e f ~ t i o ucan be written.
Without any sample history, the testing analyst must
begin by assuming that these two outcomes are equipmb-
able. This situation is summarized as follows.
Outcome 0 (Fe is absent) 1(Fe is present) where I is the information contained in the answer, given
that there were n oossibilities: H is the initial uncertaintv
Probability V2 V2 resulting from the need to consider then possibilities; and
D is the orobabilitv ofeach outcome if all n oossibilities are
This describes the most basic level of uncertainty that is
possible: the choice between two equally likely cases. Such
a situation is assigned a unit value of uncertainty. Nonequal Probabilities
The value of the information contained in the outcome of The expression can be generalized to the situation in
this analysis is also equal to 1 if the uncertainty is com- which the probability of each outcome is not the same. If
pletely removed. This unit of information is called the we know from past experience that some elements are
"bit", as used in binary-coded systems. (Clearly, the situa- more likely to be present than others, eq 2 is adjusted so
tion with only one possibility has no uncertainty, and thus that the logarithms of the individual probabilities, suit-
can yield no information). ably weighted, are summed.
Then only one test, which must have two distinct obsew-
able states, is needed to complete this analysis. For exam-
ple, after oxidation, the addition of ferrocyanide reagent
either will or will not produce the characteristic color of
Prussian blue. Thus, an analytical task with 1unit of un- where
certainty can be completely resolved by a test that can pro-
vide 1unit of information.
Increasing the Possibilities Thus, we can consider again the original example, ex-
When up to two metals may be present in the sample cept that now past experience has shown that 90% of the
solution (e.g., Fe or Ni or both) there are four possible out- samples contained no iron. This situation is summarized
comes, ranging from neither being present to both being as follows.
present.
Outcome 0 (noFe) 1 (Fe)
Probability 0.9 0.1
The degree of uncertainty is calculated using eq 3 as

Probability U4 U4 V4 U4
H=-(0.9 log 0.9 + 0.1 log 0.1) bits
Which of these four possibilities turns up can be deter- = 0.468 bits
mined using two tests, each having two observable states. The amount of uncertainty in this case is substantially
Similarly, with three elements there are eight possibilities, less than that in the original equiprobable case (1bit). In
each with a probability of 118 (i.e., l/z3). Three tests are commonsense terms. if we know beforehand that 90% of
then needed to resolve the question. the samples contain no imn, our uncertainty about the out-
Computing Information and Uncertainty come is not as meat. Thus. the value of the information in
our report willnot be as b e a t either.
The following pattern clearly relates the uncertainty and In general, the greatest uncertainty is associated with
the information needed to resolve it. The number of possi- the most-even distribution of prior probabilities. Conse-
bilities is expressed in powers of 2, more particularly, to quently, the information content from reporting the out-
the power of 2 itself. come of such a situation is also at a maximum. Conversely,
20 Journal of Chemical Education

when past analyses show that certain outcomes are much The two observations are not equiprobable. The first re-
more likely than others, the level of information in the result will be found only 2 times out of 20; the second will be
port is correspondingly lower. found 18 times. Thus, to obtain the average information
obtained from the observation, we must calculate a
Comparison of Analytical Systems weighted mean.
The quantitative nature of the defmitions for informa-
tion and uncertainty given previously provides a useful
method to assess and compare different analytical meth-
ods. Methods can be compared in terms of their ability to
go ) [ii j
I = -x 3.32 + - x 0.15 = 0.47bit
generate information that reduces uncertainty about the This can be written as
composition of samples. A number of papers have de-
scribed this comparison for specific cases (3-6). Below we I = - @+ log, pi) - @- logzp 2
use thin-layer chromatography (TLC) as a conceptually
simple analytical situation that illustrates how the defmi-
tion is used to do this. where pf is the probability of finding an Rf between 0.19
and 0.25; andp- is the probability of fmding another value.
Calculations fora TLC Test of Many Species Consider not just two categories of results but n catego-
ries. For example, wnsider an Rfin the range (0-0.05) or
Consider the use of TLC to identify drugs that are used (0.05-0.10), etc. Then
therapeutically or in abuse. Suppose the compound to be
identified is known to be one of a library of no compounds
in which the probability of occurrence is the same for each.
Thus, the a priori probability for each is
where n is the number of categories; and xi is the ith cate-

Then, by definition, the uncertainty before TLC analysis gory.
for any one the no substances will be described by the fol- Evaluating Different TLC Systems
lowing equation.
The following example shows how information theory
can be used to compare the "quality" of two such chromato-
graphic methods. TLC system 1permits the separation of
20 substances into five classes of four. TLC system 2 per-
Suppose that a spot is found at position Rti after develop- mits their separation into one class of 10, one class of
ing the TLC plate. It is known that any one of n;of the no seven, and three classes of one. Which is the best system?
substances can be found here. Then the probability has
been increased to
Thus, the uncertainty about identifieation is reduced to

System 1permits the largest reduction inuncertainty In
other words, system 1 can provide, on an average, more
analytical information. However, for the three substances
that are uniquely classified in system 2, this system would
The difference in these uncertainties results from the in- be better.
formation derived from the analytical procedure. Thus, This example illustrates again that the greatest uncer-
tainty is generally associated with the moseeven distribu-
tion of prior probabilities. Consequently, the information
content from reporting the outcome of such a situation is
also at a maximum.
System Optimization
To take a numerical example, wnsider a screening test
for 20 possible substances that are all equally likely to Besides using information theory to wmpare the quality
occur, in which two have anRf between 00.9 and 0.25. (Due of chromatographic systems, we can use it for method de-
to the precision of this TLC method, substances with Rjs velopment to optimize such systems. In high performance
outside the range (0.19-0.25) cannot be distinguished.) liquid chromatography (HPLC),the composition of the mo-
The following equation gives the information obtained bile phase must be optimized. This is frequently done with
from the procedure when a spot is observed with anRf in experimental design methods.
the range (0.19-0.25). For any formal optimization method, a criterion is
needed. Usually, it is some function of chromatographic
20
resolution. However, as shown by Siouffi(71,it is also pos-
Ii = logz-= 3.32 bit sible to use the information wntent of a chromatogram,
2
computed as above, as the criterion.
Then the information obtained when no spot is observed Combining Qualitative Analysis Systems
with an Rf in the range (0.19-0.25) is given below. Because chromatography is a poor identifier, analysts
20
I; = logz-= 0.15bit have sought to reduce the risk of misidentificationby using
18 a combination of chromatographic separations carried out

on the same sample mixture. How such procedures in- Rf (solvent 3)
crease information can also best be illustrated with TLC.
Increasing the Informing Power
Consider the problem of separating eight substances by
TLC. If the analytical task is to identify which one is pres-
ent in a sample, then the system must assign a charader-
istic Rfvalue to each of the eight possible substances. If it
fails to separate two or more of them, a second or even
third TLC system must be added until enough "informing
power" is acquired to completely resolve the uncertainty.
The table lists the (hypothetical)Rf values of the eight
substances in four different solvents in a TLC svstem that
can distinguish spots measured at 0.1to 0.8,in intervals of
0.1.Solvent 1 represents the ideal situation in which each
substance bas adifferent Rfvalue.
In terms of information, the sample has an initial uncer-
u 1 Rf (solvent2)
tainty of 1082 8 = 3 bits because there are eight possibilities.

The maximum information that the TLC system can de-
liver is also logz 8 = 3 bits because it can produce eight
different "signals", as seen with solvent 1. I Rf (solvent 4)
Combining Solvents
In reality, this parity between sample uncertainty and
system information does not happen very oRen. We are
more likely to encounter the situation shown by solvent 2,
which gives us some information but not enough to guar-
antee identification. It fails to distinguish between the
pairs AB, CD, EF, and GH. It delivers logz 4 = 2 bits of
information, and we need 3 bits.
Solvent 3 can only separate the substances into two
groups (1 bit). However, when solvent 3 and solvent 2 are
combined, all eight substances can be identified sepa-
rately. Solvent 4 also gives only two spots, but combining
solvents 4 and 2 will not yield identification of all sub-
u 1 Rf (solvent 2)
stances.
Uncorrelated and conelated R, values and the effect on signal
Correlations
"Space".
When two systems are combined, simple addition of the If correlation occurs, the actual combinations become re-
information is possible only when the systems are un- stricted to a limited zone that becomes more narrow as p,*
correlated. That is, for systems A and B, approaches 1. In other words, the chance that two sub-
stances have the same analytical "signal" increases as this
zone narrows, thus reducing the information-generating
capabilities of the combined system. The example de-
where pas is the correlation coefficientbetween Aand B. scribed above is illustrated diagrammatically in the figure.
Correlation always reduces the signal "space" that is Classification Using Expert Systems
available to the substances present. Thus, in the above
case, a combination of two solvents produces a two-dimen- The application of information theory for so-called ma-
sional space with a potential array of 8 x 8 = 64 distinct chine learning was proposed by Quinlan (8).Machines can
signals, which is equivalent to a maximum of 6 bits of in- "learn" by so-called inductive reasoning, and inductive ex-
formation. Thus, the system can ideally recognize 64 dif- pert systems are now available to "teach" them. Before ex-
ferent Rf combinations. plaining how information theory plays a role in such ex-
pert systems, it may be necessary to clarify the difference
Rt Values between deductive and inductive expert systems.
Substance Solvent 1 Solvent 2 Solvent 3 Solvent 4 Deductive Systems
A
The deductive expert systems are more usual and thus
0.10 0.20 0.20 0.20 better-known. Their knowledge base consists of rules that
B 0.20 0.20 0.40 0.20 have been entered by experts. The system uses these rules
C 0.30 0.40 0.20 0.20 by chaining them to reach a conclusion.
D 0.40 0.40 0.40 0.20 For example, we can reach the couclusion below by com-
E 0.50 0.60 0.20 0.40 bining the following two rules.
F 0.60 0.60 0.40 0.40 Ifa substance mntains more than 20 carbon atoms, then it
G 0.70 0.80 0.20 0.40 should be considered apolar.
H 0.80 0.80 0.40 0.40 If a substance is apolar, then it should be soluble in methanol.
Information 3 2 1 1 Thus, substance X,with 23 carbon atoms, should be soluble
(bits) in methanol.

lnductive Systems where H is the initial uncertainty, the maximum possible
in this case.
Inductive expert systems work from examples. An induc- Let us also assume that the test requires determining
tive expert system that was designed to classify the whether or not some property of the oil exceeds a threshold
solubilities of organic substances would comprise the fol- value. For example, say we want to determine whether or
lowing. not the oleic acid percentage exceeds 15. If so, the test is
a set of substances ~ositive(+). If not. the test is ne~ative(4.
a selection of their pmperties, such as carbon number, Finally, in order to carry out the following calculations,
functional group, etc. we assume that we already know the following about the
a set of solvents known to dissolve the substances samples,
When a new nubstance is presented to the system, the sys- The test shows that five of the eight samples are above 15%,
tem would rrive advice based on rules or analogies that it and thus positve.
has derived-from these examples. These five positive samples are also 3 E and 2 W.
In other words, while deductive systems need rules given The remaining three samples test negative.
by experts, inductive systemsmake rules that are based on These three positive samples are 1E and 2 W.
the examples supplied. Qninlan's Id3 algorithm, which is
based on -inform&on theory, is the best-known algorithm The question then arises "How much information would
for such inductive learning. be obtained, on an average, when an analyst applies this
test to a sample from this batch of eight?" If the result is
~ -
Usina Information Theow with lnductive Systems
The use of information theory for inductive expert sys-
>15%,then the following would be the information still re-
quired after the test to resolve the remaining uncertainty.
tems can be shown in the use of chromatographic analysis
for food authentication. Consider, for example, the problem
of determining the origin of a particular unkown olive oil
sample. The determination could be based on the fatty acid
composition of known olive oil samples that originated in If the result is less than 15%,it would be
two growing reaons, East and West Liguria in Italy (91.
Choosing the Rules
In the inductive learning process a set of rules is devel-
oped from the patterns of fatty acid levels in each of the Thus, the weighted average residual uncertainty cover-
oils. Thus, "information" is the criterion for choosing the ing all possible outcomes of the test would be
rules that have the greatest power of discrimination.
Suppose there are eight oil samples from East (El and
West (W) Lignria. Then the task becomes determining the
origin of a sample based on the percentage of one or more
H -
a-[: ]+ 6
- x 0.971
]
- x 0.918 = 0.951 bit
fatty acids. In deciding whether a sample is a W or an E, because we know the distribution of oils: five samples
the required information (or the initial uncertainty before above the threshold and three below.
testing) will be the following. Thus, the information generated by test 1is the differ-
Hb =-@(W) logzp(W))- @(Ellogzp(E)) ence between the initial and final amounts of required in-
formation (or uncertainties).
If it is known that there are four unknowns from each
category, then the a priori probability is 0.5 for each cate-
gory Using eq 1,we conclude that Ha = 1.For other a priori
probabilities, other Hb values are obtained. Thus, we get
the following table.
Number Prob. Uncertainty This, of course, is very little. A test with this threshold
value is not very useful because it does not yield an appre-
4 W 4E (0.5 W 0.5 E) Hb= l bit ciable amount of information.
3 W 5E 10.315 W 0.625 E) Hb= 0.954 bit
2w 6~ i0.25 W 0.75 ~j Hb= 0.81 bit An Improved Test
OW8E (0 W 1 E) Hb=O bit
Then let us consider another test in which there are
again five positive and three negative samples. This time
These numbers can be understood, for example, if we all the negative samples are E, and the positive results are
know that the situation 0 W 8 E does indeed require no from 4 W and 1E samples. The information still required
information. We know that all samples are E. Thus, we do after the test is shown below.
not reqnire information to determine the origin of one of If the test result is positive, then calculating as above,
these samples. The situation 0.5 W 0.5 E is the most uncer-
tain. We reqnire more bits than in the other situations in H+= 0.70 bit
which the a priori knowledge is greater.
The Effect of Test Results If it is negative, we get
Let us now go a step further and see how the required
information (or initial uncertainty) is affected by a partic-
ular test result. For test 1,let us assume the worst situa-
tion, in which there is a 50:50 chance that the oil is from E because a negative test means that the sample can only be
or W. Thus, an E sample.
Thus, the average residual un&rtainty after performing
p=0.5 and H = l test 2 is

Four such rules were derived that allowed correct classifi-
cation in over 90% of all cases.
Another typical application was reported by Hopke and
Thus, the information obtained from the test is Mi (10).They used such a n expert system to develop rules
that allow classification of individual particles collected
I = 1 - 0.44 = 0.56 bit during air-pollution studies. The particle's composition
was analyzed using a scanning electron microscope with
Clearly this test is better than the first one. X-ray detection (SEM/XRF). In general, this approach is
useful when much data is generated and rules must be
Quinlan's Algorithm generated to make classifications or decisions.
Quinlan's algorithm is sequential. The first variable it Conclusion
selects is the variable that yields the most information. For
continuous variables, the algorithm must find the thresh- Information theory has made a special contribution to
old at which that variable is most informative. This is then analytical chemistry: It provides a fundamental criterion
used to split the sample in two. for assessing the progress made after analvsis in decreas-
If the second test was found to be the best. then this test o
ing the initial unce;tainty surrounding sample. This
could mean an increase in qualitative information about
~ ~~~ ~~~ ~
would he selected, as done in the previous example of test-

ing for E and W. Conseauentlv.
created.
-
".the followine rule would he sample composition or a bettkr understanding of its classi-
fication among a group.
From the philosophical point of view, this criterion has
IF test 2 = (-1 THEN the sample = E the virtue of being equally . applicable
.. to auantitative and
qualitative spherrs. l'rotocols in analysis al-
A positive test in this example is not conclusive, and an ready include familiar and well-proven statistical parame-
additional test would be selected to further divide the ters. However, this is not so for qualitative analysis, in
group in two, until the only groups remaining comprise a which concepts such as specificity and selectivity still tend
single category. to he vague and poorly understood. Also, in qualitative
In this simple example the five remaining unseparated analysis the analyst must decide how to quantify the mer-
samples are 4 W and 1 E. The resulting complete set of its ofalternatives such as theT1.C svstemsor color ~~~- tests in
rules might appear as below. our simple examples.

I n dealing with aualitative uncertaintv. ", the lack of
~ - - an
~~~~~~-~~~
IF test 2 = (-) THEN sample = E (3) accessible cTitenonof performance often leads to analyti-
cal "overkill" tn cover any risk oferror in identification. In-
IF test 2 = (+) and IF test 6 = blue THEN sample = E (1) format~ontheory offersanalystsa way ofmoving moreeas-
IF test 2 = (+) and IF test 6 = red THEN sample = W (4) ily within this aualitative dimension oftheir work. It " mves ~~
them a rationai basis for more precisely and economically

matching the analytical needs of the client to the method-
Test 6 is the best of several tests investigated for separat-
ing the 4 W and 1E. Of course, the example given above is ology.
only an example to explain the methodol&y Literature Cited
Computer Programs and Automation
I n practical situations, one needs computer programs.
For example, EX-TRAN was used to fmd decision rules to
separate E and W oils that were characterized by their
fatty acid content. Seven fatty acids were used (9).
The rules developed by the program take the following
form.
If the linolenic acid content is found ta be less than 25 and
the linolie acid is less than 665.0, then sample is E.

Lnformation Theory and Its Application To Analytical Chemistry

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lnformation Theory and Its Application To Analytical Chemistry

Diunggah oleh

Hak Cipta:

Format Tersedia

lnformation Theory and Its Application

Volume 70 Number 1 January 1993 19

The degree of uncertainty is calculated using eq 3 as

20 Journal of Chemical Education

where n is the number of categories; and xi is the ith cate-

Thus, the uncertainty about identifieation is reduced to

Volume 70 Number 1 January 1993 21

tainty of 1082 8 = 3 bits because there are eight possibilities.

Substance Solvent 1 Solvent 2 Solvent 3 Solvent 4 Deductive Systems

22 Journal of Chemical Education

Volume 70 Number 1 January 1993 23

would he selected, as done in the previous example of test-

rules might appear as below. our simple examples.

them a rationai basis for more precisely and economically

24 Journal of Chemical Education

Anda mungkin juga menyukai