Anda di halaman 1dari 6

Analytica Chimica Acta 446 (2001) 115–120

Chemical knowledge discovery from mass spectral database


I. Isotope distribution and Beynon table
Yizeng Liang a,∗ , Feng Gan b
a College of Chemistry and Chemical Engineering, Institute of Chemometrics and Intelligent Analytical Instruments,
Central South University, Changsha 410083, PR China
b College of Chemistry and Chemical Engineering, Institute of Chemometrics and Chemical Sensing Technology,

Hunan University, Changsha 410082, PR China


Received 7 November 2000; received in revised form 4 April 2001; accepted 11 April 2001

Abstract
A primary approach on data mining (DM) of mass spectral library has been developed in this work. The results obtained
from DM for the first time showed that the ratios of isotope peaks and molecular ion peak obey a logarithm normal distribution.
With the help of statistical inference, the guideline about how to efficiently use Beynon table is also given in order to provide
information for making a decision on the molecular weight or molecular formula in mass spectral analysis from the point of
view of statistics. The work indicates that a wide studying on DM of chemical databases is essential not only for the verification
of the existing knowledge, but also for the discovery of new laws in chemistry. © 2001 Elsevier Science B.V. All rights reserved.
Keywords: Data mining; Beynon table; Statistical inference; Logarithm normal distribution; Chemometrics

1. Introduction chromatographic databases, or a large amount of data


on molecular structures and their properties. Could
Chemistry is essentially so far a strongly experience- we directly obtain some useful knowledge from the
dependent scientific discipline. Lots of chemists do rapidly growing volumes of chemical data? Modern
lots of chemical experiments and chemical measure- microcomputer that makes the digitized information
ment everyday. The major body of knowledge of easy to capture and fairly inexpensive to store and to
chemistry obtained so far is mostly based on chemi- access, which gives us chemists a great opportunity
cal experiments and measurement data, even theoretic to extract useful chemical knowledge in databases or
quantum chemistry could explain something in chem- the huge amount of the chemical data. This is also
istry. However, with the development of the chem- just the object of new techniques, named knowledge
istry and information sciences, one new challenge for discovery in databases (KDD) or data mining (DM),
chemists is the spectacular growth of measurement developed quickly in computer science, since the
data that contain a large amount of chemical com- DM aims to discover something new from the facts
pound information, such as the spectral databases, recorded in the databases or huge amount of data col-
lected [1–4]. Recently, Buydens and her colleagues
∗ Corresponding author. Tel.: +86-731-882-2841;
give such an example for this [5].
fax: +86-731-882-5637. As several large-scale databases of mass spectra
E-mail address: yzliang@cs.hn.cn (Y. Liang). are now available, it is possible for us to enlarge and

0003-2670/01/$ – see front matter © 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 0 0 3 - 2 6 7 0 ( 0 1 ) 0 1 0 7 3 - X
116 Y. Liang, F. Gan / Analytica Chimica Acta 446 (2001) 115–120

deepen our knowledge extent in this field. On the other Equipment: GC-17A Gas Chromatograph, QP-5000
hand, the high-speed computer and statistical meth- Mass Spectrometer, Shimadzu.
ods will make the approach easier and fruitful. The Detection conditions: An OV-17 capillary column
aim of this paper is to show how the DM technique (30 m × 0.25 mm i.d.) is used. Column temperature is
could work out some useful chemical knowledge in maintained at 60◦ C for 2 min, programmed from 60
the databases. First, a comprehensive statistic investi- to 270◦ C at a rate of 20◦ C/min. Inlet temperature is
gation was conducted, then a primary approach on DM kept at 250◦ C. Helium carrier gas is used at a constant
of mass spectral library has also been developed based flow-rate of 1 ml/min.
on this information. The rather astonishing results ob- Mass spectrometer: Electron impact (EI+ ) mass
tained from DM for the first time showed that the ra- spectra are recorded at 70 ev ionization energy in
tios of isotope peaks and molecular ion peak obey a full scan mode in the 40–426 amu mass range with
logarithm normal distribution. With the help of statis- 0.2 s/scan velocity. The ionization source temperature
tical inference, the guideline about how to efficiently is set at 230◦ C. Detected spectra are identified by
use Beynon table is also given in order to provide matching EI+ against the national institute of stan-
information for making a decision on the molecular dards and technology (NIST) MS database containing
weight or molecular formula in mass spectral analysis about 62,000 compounds.
from the point of view of statistics. The work indicates All programs are written in MATLAB 5.0 and run
that a broad studying on DM of chemical databases is on a PC (CPU 200, RAM128MB).
essential not only for the verification of the existing
knowledge, but also for the discovery of new rules in
chemistry. 3. Results and discussion

The molecular ion peak, denoted by (M) in this pa-


2. Experimental per, is very important in mass spectral analysis, since
with the help of the molecular ion peak, people could
A mass spectral library was established by trans- easily obtain some information on molecular weight as
ferring NIST62 mass spectrum library, which is built well as molecular formula. Beynon table is a common
in the GCMS-QP5000 of Shimadzu. The data of our tool for checking the molecular ion peak and molec-
library are stored in the format of binary file and ular formula in mass spectral analysis, in which the
MATLAB data file. The number of the mass spectra ratios of (M + 1)/M and/or (M + 2)/M are given as
included in the library is 61999. constants according to the isotopic abundance of each
The spectra of compounds containing (C, H), (C, element in nature. Thus, one could identify if a peak
H, O), (C, H, N) or (C, H, O, N) in the NIST62 mass is really a molecular ion peak by comparing the calcu-
spectrum library are all collected, respectively. Ratios lated abundance ratio of (M + 1)/M and/or (M + 2)/M
of the isotopic peaks with the corresponding molecular with the one listed in the Beynon table. Thus, most
ion peaks are then calculated in this work, but only of the text books on mass spectral analysis will be at-
the results of (M + 1)/M are presented in this paper. tached a Beynon table for the convenience of users
Thus, an elaborate program was encoded in Matlab to usage [7,8]. However, when we worked with the table,
collect the molecular ion peaks denoted by (M) and the we found that the consistent rate between the measured
corresponding isotopic peaks, denoted by (M + 1) and abundance ratio of (M + 1)/M and the value listed in
(M + 2), respectively, and then to calculate the ratios Beynon table is rather lower. (Table 1). Is there some-
of (M + 1)/M and/or (M + 2)/M. The true values of thing wrong for the Beynon table or something hidden
the ratios of the isotopic peaks with the corresponding in the facts?
molecular ion peaks are obtained by using those listed With the help of the NIST62 mass spectral database,
in the book of Beynon [6]. in which 62199 mass spectra were collected, a broad
The mass spectrum of hexane was measured. The survey of the molecular ion peaks and their corre-
chemical, say hexane, is of analytical grade. The ex- sponding isotopic peaks, say (M + 1) and (M + 2)
periment condition of the GC–MS is the following peaks, was conducted. Totally, 18170 mass spectra, in
Y. Liang, F. Gan / Analytica Chimica Acta 446 (2001) 115–120 117

Table 1
Investigation results for consistent rate between the measured abundance ratios of isotopic peaks and molecular peaks with the ones listed
in Beynon table
Range of Number of Com. Part. con. Devia. (+10 Devia.(−10 Devia. Devia. Total rate
molecular samples con. with.b to +50%)c to −50%) (>+50%) (<−50%) of consis.d
weight with.a
0–50 35 1 14 6 8 3 3 42.86
51–70 130 5 49 22 32 19 3 41.54
71–90 385 24 103 71 67 91 29 32.99
91–110 756 58 277 109 145 118 49 44.31
111–130 1388 80 456 274 275 225 78 38.62
131–150 1979 110 808 348 398 201 114 46.39
151–170 1772 130 744 344 316 148 90 49.32
171–190 2244 110 900 402 450 228 154 45.01
191–210 1740 136 749 370 350 135 100 50.86
211–230 1361 119 586 213 271 106 66 51.80
231–250 1164 135 509 178 205 80 57 55.33
a Com. con. with.: completely consistence with the ones listed in Beynon table (deviation ≤ 5%).
b Part. con. with.: partially consistence with the ones listed in Beynon table (5% < deviation ≤ 10%).
c Devia. (+10 to +50%): deviations between the range of +10 to +50%.
d Total rate of consis.: total rate of consistence, including the ones of deviations within ±10%.

which 1681 molecular formulae are included, were in- might be misleading, since the complete consis-
vestigated. Except for 5216 mass spectra neither with- tent rate with the values in the Beynon table is
out molecular ion peaks or without (M + 1) peaks, the only around 5–10% (column 2 and column 3 in
abundance ratios of (M + 1)/M and/or (M + 2)/M of Table 1). Notice that here we say that the ratios cal-
12954 mass spectra are calculated. For the compounds culated are completely consistence with the one list
with the molecular formulae CW HX NY OZ , the calcu- in the Beynon table if and only if their deviations are
lation formula for the abundance ratio of M/(M + 1) within ±(5%) for both (M + 1) peaks and (M + 2)
listed in Beynon table can be as follows peaks. The total consistent rate only for (M + 1)
peaks is around 50% (last column in Table 1). Almost
Ratio(M+1), Beynon half the deviations fall into the range of ±(10–50%),
= (1.08W + 0.016X + 0.38Y + 0.04Z) (1) (column 5 and column 6 in Table 1), and some even
are beyond the range from −50 to +50%. The results
where Ratio(M+1),Beynon denotes the abundance ratio seem to suggest that the measured ratios of (M + 1)/M
of (M + 1)/M listed in Beynon table. The deviation may obey some statistical distribution.
of measured ratio of M/(M + 1) and the value listed In order to confirm this assumption, a comprehen-
in the Beynon table can then be estimated using the sive investigation over the whole database of mass
following equation spectra was then conducted. The molecules contain-
ing (C, H), (C, H, O), (C, H, N) or (C, H, O, N) were
Deviation % all investigated. The total number of the compounds
Ratio(M+1), measured − Ratio(M+1), Beynon in the database containing only elements carbon and
= 100 ×
Ratio(M+1), Beynon hydrogen is 4463. The results are shown in Fig. 1.
(2) From this plot, one can see that the distribution of
the abundance ratios of isotopic peaks with molecular
the results obtained are shown in Table 1. From peaks, say (M + 1)/M, looks like normal distribution
the table, one can easily see that simply using the (top part in Fig. 1) with some tailing. However, if
Beynon table to confirm the molecular ion peak we use the normal probability plot to check if it is
(or molecular weight) as well as molecular formula an approximately normal distribution, the answer is
118 Y. Liang, F. Gan / Analytica Chimica Acta 446 (2001) 115–120

Fig. 1. Statistical results for compounds embracing carbon and Fig. 2. Statistical results for compounds embracing carbon, hy-
hydrogen. Top part: the deviation distribution diagram of the mea- drogen and nitrogen. Top part: the deviation distribution diagram
sured abundance ratios of (M + 1)/M compared to the ones in of the measured abundance ratios of (M + 1)/M compared to the
Beynon table; low left part: the normal probability plot of the ones in Beynon table; low left part: the normal probability plot of
distribution; low right part: the normal probability plot of the dis- the distribution; low right part: the normal probability plot of the
tribution after logarithm transformation. distribution after logarithm transformation.

negative. The result is shown in Fig. 1 (plot in low left It is worthy noting that the ability of any method
part in Fig. 1). The observations can not be expressed to work with is strongly determined by the repro-
by a straight line in the normal probability plot. What ducibility of the intensities (abundance) in mass spec-
kind distribution will the abundance ratios of isotopic tra. It was found that the mass spectral intensities have
peaks with molecular peaks obey? Then, the loga- strong heteroscedastic noise [9]. Thus, the influence
rithm normal distribution was tried. The results are from the heteroscedastic noise on the intensities of
also shown in Fig. 1 (plot in lower right part in Fig. 1).
The results clearly show that they approximately
obey the logarithm normal distribution. In order to
further confirm this result, we continue to investigate
the compounds containing element carbon, hydrogen
and nitrogen (C, H, N) with the total number of 4310
molecules, the compounds containing element carbon,
hydrogen and oxygen (C, H, O) with the total number
of 9833 molecules, and the compounds containing
element carbon, hydrogen, nitrogen and oxygen (C,
H, O, N) with the total number of 10519 molecules.
The results are shown in Figs. 2–4, respectively. From
these figures, one can see that the abundance ratios
of isotopic peaks with molecular peaks obey really
the logarithm normal distribution. Especially, the re-
sults shown in Figs. 3 and 4 demonstrate the perfect
agreements between the assumption and facts. The Fig. 3. Statistical result for compounds embracing carbon, hydro-
gen and oxygen. Top part: the deviation distribution diagram of
reason for this might lie in that the total numbers of
the measured abundance ratios of (M + 1)/M compared to the
the compounds of (C, H, O) and (C, H, O, N) are ones in Beynon table; low left part: the normal probability plot of
much larger than the numbers of compounds of (C, the distribution; low right part: the normal probability plot of the
H) and (C, H, N), say 9833 and 10,519, respectively. distribution after logarithm transformation.
Y. Liang, F. Gan / Analytica Chimica Acta 446 (2001) 115–120 119

Fig. 4. Statistical results for compounds embracing carbon, hy- Fig. 5. Statistical results for compound hexane measured on
drogen, nitrogen and oxygen. Top part: the deviation distribution GC–MS. Top part: the deviation distribution diagram of the mea-
diagram of the measured abundance ratios of (M + 1)/M compared sured abundance ratios of (M+1)/M compared to the ones in
to the ones in Beynon table; low left part: the normal probability Beynon table; low left part: the normal probability plot of the
plot of the distribution; low right part: the normal probability plot distribution; low right part: the normal probability plot of the dis-
of the distribution after logarithm transformation. tribution after logarithm transformation.

mass spectra should be taken into account when one distribution. This suggests that DM based on efficient
uses ratio of M/(M + 1) to decide whether a m/e peak computer-calculation on large databases is quite nec-
is a molecular peak with the help of the Beynon ta- essary, since DM is useful not only for the verification
ble. Similarly, Grotch found several decades ago that of the existing knowledge, but also for the discovery
abundance values of ion fragments measured by mass of new knowledge in chemistry. On the other hand, the
spectroscopy closely follow a logarithm normal distri- databases of large quantity and high quality spectra,
bution [10]. This fact is consistent with the facts we including infrared, mass, NMR and UV-visible spec-
found in this work. In order to confirm this conclusion, tra, are available in recent years. How to efficiently
we further conducted some experiments on measuring use the chemical information accumulated from a long
some known compounds, i.e. hexane, on GC–MS to history of chemical researches will be a new challenge
check the idea. The result is shown in Fig. 5. From this for the chemists. We believe in that a comprehensive
plot, one can see that the logarithm normal distribu- DM upon mass spectral library will give a chance to
tion can also be seen even for one compound. This fact develop a new kind of expert system which must be
shows clearly that one can not simply use the ratio of more strong and effective.
(M + 1)/M as a constant as commonly used in chem-
istry. One has to consider it is really a random variable
obeying a logarithm normal distribution. Thus, in or- References
der to arrive at some conclusion, the statistic inference
[1] U. Fayyad, R. Uthurusamy, Commun. ACM 39 (1996) 27–34.
technique is necessary to conduct. [2] C. Glymour, D. Madigan, D. Pregibon, P. Smyth, Commun.
ACM 39 (1996) 35–41.
[3] W.H. Inmon, Commun. ACM 39 (1996) 49–50.
4. Conclusion [4] U. Fayyad, D. Haussler, P. Stolorz, Commun. ACM. 39 (1996)
51–57.
[5] L.M.C. Buydens, T.H. Reijmers, M.L.M. Beckers, R.
The results obtained in this work from DM shows Wehrens, Chemom. Intelli. Lab. Syst. 49 (1999) 121–133.
for the first time that the abundance ratios of isotope [6] J.H. Beynon, A.E. Williams, Mass and Abundance Tables for
peaks and molecular ion peak obey a logarithm normal Use in Mass Spectrometry, Elsevier, Amsterdam, 1963.
120 Y. Liang, F. Gan / Analytica Chimica Acta 446 (2001) 115–120

[7] J.R. Chapman, Computerized Mass Spectrometry, Academic [9] O.M. kvalheim, F. Brakstad, Y.Z. Liang, Anal. Chem. 66
Press, London, 1978. (1994) 43–51.
[8] Chen Yaozhu, Organic Analysis, Higher Education Publishing [10] S.L. Grotch, Am. Soc., Mass Spectrom. 34 (1969)
House, Beijing, 1983, pp. 694. 459–466.