C)
proteins and the low Tm (Tm<55
C) proteins.
2.2. Statistical method
Statistical inferences were made to establish a correlation
between the Tm of a protein and the composition of dipeptides
withinits sequence(Guruprasadet al., 1990). Thestatistical method
we used to distinguish high Tm and low Tm groups was modied
from the method previously used to calculate II (Instability Index)
(Guruprasadet al., 1990). First, the chi-square test was usedtoeval-
uate the statistical signicance of this relationship for Tm value
and dipeptides in certain proteins. The mathematical expectation
of each group is dened by,
E(X) =
x=
x=
x f
x
(X)
In the case of two independent random variables, the expected
value can be calculated from,
E(XY) = E(X) E(Y)
Since the amino acids are 20, so the above equation may be
written as,
E(xy) =
N
obs
(x)
T
N
obs
(y)
T
y=20
y=1
x=20
x=1
N
obs
(xy)
where T is the total number of amino acids in a particular group;
N
obs
(x) and N
obs
(y) are the observed occurrences of amino acids
x and y, respectively, and N
obs
(xy) is the observed occurrences of
dipeptide xy. From chi-square denition, the equation is,
2
(xy) =
[N
obs
(xy) E(xy)]
2
E(xy)
The average chi-square for each group is,
2
avg
=
1
400
xy=400
xy=1
2
(xy)
The average value of chi-square was thenused as the condence
limit to select signicant dipeptides for highTmand lowTmgroups
of proteins, respectively are
2
H
2
Havg
and
2
L
2
Lavg
The potential occurrence P(xy) for each dipeptide is given by,
P(xy) =
N
obs
(xy)
E(xy)
Compared with chi-square values under condence limit, then
retainedthe potential occurrence P(xy) of the signicant dipeptides
and set the other P(xy) to be zero. The relative occurrence P
rev
(xy)
of the signicant dipeptides were given by,
P
rev
(xy) = P(xy) 1
A 20 by 20 matrix of relatively potential occurrence is obtained
for each group. The value >0 indicate the dipeptide signicant
increasing in the certain proteins. The value <0 reveal the dipep-
tide signicant decreasing in the certain proteins. Then these two
relatively potential occurrence matrices are combined into one 20
by 20 matrix, named the Tm weight value table (P
index
) using the
equation,
P
index
(xy) = [P
revHighTm
(xy) P
revLowTm
(xy) +1] 100
100 is a scaling factor. The dipeptides with >100 Tm weight value
may contribute to thermostability of proteins. The Tmweight value
<100 may reduce the thermostability of proteins. These dipeptides
Tm weight values (P
index
) for all 400 possible dipeptide combina-
tions are presentedas a matrixinTable 2. Finally, this table (Table 2)
was applied to predict the Tmvalue fromprotein sequence, the Tm
Index (TI) for a proteinwas thencomputedusing P
index
by equation:
TI =
(100/L)
L1
i=1
P
index
(x
i
y
i+1
) 9372
398
where X
i
Y
i+1
designates a specic dipeptide within the sequence,
L is the number of amino acid residues in the sequence and 100 is
a scaling factor. The numbers 9372 and 398 are empirical values.
These TI (Tm Index) for various proteins in the high and low Tm
groups are given in the right-most column of Table 1.
3. Results and discussion
3.1. Dipeptides involved in melting temperature of proteins
Dipeptides constitute the smallest unit that denes order in an
amino acid sequence. We calculated the propensities for amino
acids to interact with residues occurring before and after in the
amino acid sequence. The calculation of relative potential occur-
rence reveals that each amino acid may have a different tendency
to occur next to a particular amino acid at both its N-terminal and
C-terminal side. As shown in Table 2, each dipeptide is suggested
to contribute differently to the Tm Index (TI) of a given protein.
The dipeptides with relatively high Tm weight values (i.e., >100)
may contribute to a higher Tm, whereas those having lower values
(<100) may reduce the Tm. For example, the occurrence of His-Cys,
Trp-Met and Cys-Pro in a protein may contribute to a higher TI,
whereas the occurrence of Met-Met, Trp-Cys, Asp-Trp and Trp-Pro
may reduce the TI.
T. Ku et al. / Computational Biology and Chemistry 33 (2009) 445450 447
Table 1
Properties of the 35 high and low Tm proteins used in the analysis.
No.
a
Protein Experimental pH Tm range (
C) TI
b
High Tm proteins >65
C
1 Odorant binding protein 6.6 6877 (Burova et al., 1999) 2.915
2 Alpha-chymotrypsin 7.08.0 86 (Bae and Sturtevant, 1995) 2.777
3 Gamma-crystallin 6.87.0 6878 (Sen et al., 1992) 5.807
4 Glutamate dehydrogenase 6.08.0 89 (Lebbink et al., 1999) 2.985
5 Dsba 7.0 6877 (Moutiez et al., 1999) 3.420
6 FLT3 6.17.4 7880 (Remmele et al., 1999) 3.121
7 Procarboxypeptidase A 7.5 8889 (Sanchez-Ruiz et al., 1988) 2.276
8 Carboxylesterase Est2 7.5 92 (Del Vecchio et al., 2002) 3.198
9 Pyrophosphatase 7.0 6599 (Leppanen et al., 1999) 3.371
10 Thioredoxin bacillus 7.0 85 (Pedone et al., 1999) 1.455
11 Superoxide dismutase 7.8 88.9 (Leveque et al., 2000) 1.890
12 Ribonuclease H 6.0 6686 (Hollien and Marqusee, 1999) 3.455
13 Thrombin 7.4 7181 (Lentz et al., 1994) 3.180
14 Tumor suppressor P53 6.07.0 8184 (Johnson et al., 1995) 2.559
15 Bacteriorhodopsin 7.5 95.1 (Azuaga et al., 1996) 1.001
16 Thioredoxin 6.57.5 85.3 (Bolon and Mayo, 2001) 1.350
No.
a
Protein Experimental pH Tm range (
C) TI
b
Low Tm proteins <55
C
1 Cro protein 6.0 3055 (Padmanabhan et al., 1999) 1.585
2 Beta lactamase 7.5 41 (Rahil and Pratt, 1994) 1.697
3 Barnase 6.07.0 53 (Kellis et al., 1989) 2.018
4 Aldolase 7.0 42.544 (Rudolph et al., 1992) 0.924
5 Adrenodoxin 6.57.4 47.553.7 (Burova et al., 1995) 0.598
6 Fibroblast growth factor 6.6 2539 (Culajay et al., 2000) 1.463
7 Ribonuclease T1 7.0 50.8 (Giletto and Pace, 1999) 1.256
8 C-Myb DNA-binding domain 7.5 39 (Morii et al., 1999) 2.085
9 Staphylococcal nuclease 6.08.0 52 (Leung et al., 2001) 1.732
10 Tropomyosin 7.5 4553 (Ishii, 1994) 0.001
11 Tumor suppressor protein P16 7.5 42 (Boice and Fairman, 1996) 1.862
12 Myoglobin 7.0 52 (Staniforth et al., 2000) 1.724
13 Myosin 7.58.0 3145 (Masino et al., 2000) 3.333
14 Chymotrypsin inhibitor 6.3 46 (Ruiz-Sanz et al., 1995) 2.724
15 Histone 6.57.5 4647 (Karantza et al., 2001) 0.681
16 Tryptophan synthase 7.8 46 (Ahmed et al., 1988) 0.637
17 Glucanohydrolase 6.0 48.7 (Wele et al., 1996) 2.868
18 Cro repressor protein 7.0 39 (Pakula and Sauer, 1990) 2.473
19 Alpha lactalbumin 7.0 3643 (Harushima and Sugai, 1989) 4.680
a
The serial number of the protein.
b
Tm Index.
Table 2
Tm weight values for 400 possible dipeptides. The rows denote the rst residue of the dipeptide, and the columns denote the second residue. Weight values >100 are
highlighted in light gray, and those <100 are highlighted in deep gray.
Dipeptide Tm weight values
A C D E F G H I K L M N P Q R S T V W Y
A 100 100 48.6 41.8 100 100 100 100 126 130 150 58.8 100 36 54.4 100 141 168 100 87
C 28.2 100 78 29.3 100 161 100 32.1 100 100 100 100 255 100 100 100 62.1 100 100 100
D 100 100 100 100 142 12 24.7 55.8 100 134 25.1 52.4 100 165 100 143 48 140 99 100
E 100 26.7 100 110 100 178 100 100 100 64.4 102 143 62.6 54.5 142 83.1 115 100 100 63.5
F 100 168 100 100 3.45 100 248 87.2 153 16.7 100 100 3.45 168 100 144 100 26.3 197 100
G 100 100 100 174 107 124 100 150 66.5 105 51.9 138 100 15 61.3 138 100 2.2 151 67.3
H 100 402 93 100 177 100 100 59 24.5 33.4 100 36 189 100 28.6 184 100 100 100 100
I 100 43 35.3 100 37 136 244 100 100 100 100 100 100 100 100 100 158 100 100 153
K 100 29 100 100 105 100 100 100 141 47.5 100 165 58.3 100 137 57.5 100 132 100 121
L 100 56.3 100 132 100 63.7 100 100 100 133 32.2 100 100 54.2 100 100 100 100 100 173
M 164 100 226 23.4 100 153 100 100 100 100 209 219 100 100 100 46.6 v37.2 37.1 100 100
N 100 76 100 102 100 100 100 100 3.49 100 164 2.25 100 116 166 100 100 161 100 88.1
P 154 159 100 100 100 61 100 100 100 116 100 100 139 168 34 100 52.3 88.6 100 100
Q 100 8.5 147 156 42.1 132 188 60.1 100 95.1 100 100 100 253 15.8 63.7 100 43.6 40.6 149
R 62.1 100 140 66.3 100 100 25.9 16.3 137 100 100 100 159 47.5 144 67.6 100 93.6 100 100
S 68.8 100 159 68.6 149 100 88.5 100 100 136 100 100 100 100 100 44.6 100 55.6 100 100
T 129 100 100 106 100 100 100 100 34.1 100 100 100 62.8 100 100 100 100 100 26.8 100
V 100 173 100 100 100 73.4 24.5 100 174 158 100 100 43.6 100 53.3 100 87.3 146 100 41.2
W 100 153 100 25 100 151 100 211 100 69 306 100 84 100 158 25.5 100 100 100 23
Y 100 100 151 140 100 11.7 191 100 100 100 100 88.1 197 212 100 40.1 100 30.9 100 21.6
448 T. Ku et al. / Computational Biology and Chemistry 33 (2009) 445450
Table 3
The predictedhighTmproteinpercentage (HTPP) of 75genomes, rankedbydecreas-
ing HTPP. Mesophiles (OGT, <55
C) in deep gray.
Genome Kingdom OGT
a
(
C) HTPP
b
Aquifex aeolicus B 90 66.9
Pyrococcus abyssi A 97 62.3
Thermotoga maritime B 80 62.2
Thermoanaerobacter tengongensis B 75 61.1
Pyrococcus horikoshii A 95 60.5
Pyrococcus furiosus A 98 60.1
Aeropyrum pernix A 90 58.1
Archaeoglobus fulgidus A 82 57.8
Methanococcus jannaschii A 85 57.7
Pyrobaculum aerophilum A 98 57.4
Bacillus halodurans B 30 57.3
Methanopyrus kandleri A 98 56.8
Sulfolobus solfataricus A 78 56.7
Sulfolobus tokodaii A 80 55.7
Helicobacter pylori 26695 B 37 55.6
Bacillus subtilis B 30 55.2
Chlamydia muridarum B 37 54.5
Campylobacter jejuni B 37 54.1
Chlamydia trachomatis B 37 53.4
Pasteurella multocida B 37 52.9
Listeria monocytogenes strain EGD B 37 52.5
Mycoplasma pulmonis B 37 52.4
Lactococcus lactis subsp. lactis B 30 52.3
Streptococcus pneumoniae R6 B 37 52.2
Listeria innocua Clip11262 B 37 52.1
Methanosarcina mazei strain Goe1 A 37 52.0
Thermoplasma volcanium A 60 51.9
Fusobacterium nucleatum subsp. B 37 51.2
Chlamydophila pneumoniae CWL029 B 37 50.7
Methanobacterium thermoautotrophicum A 65 50.6
Thermoplasma acidophilum A 58 50.4
Clostridium perfringens B 37 50.3
Staphylococcus aureus N315 B 37 49.8
Chlamydophila pneumoniae AR39 B 37 49.4
Staphylococcus aureus strain Mu50 B 37 49.3
Methanosarcina acetivorans str. C2A A 39 49.2
Staphylococcus aureus MW2 B 37 49.1
Haemophilus inuenzae Rd B 37 49.0
Streptococcus pyogenes B 37 48.7
Neisseria meningitidis Z2491 B 37 48.2
Buchnera aphidicola str. APS B 26 47.5
Synechocystis sp. PCC 6803 B 37 46.7
Neisseria meningitidis MC58 B 37 46.6
Mycoplasma genitalium B 37 46.1
Clostridium acetobutylicum ATCC824 B 37 46.0
Borrelia burgdorferi B 35 45.9
Mycoplasma pneumoniae B 37 45.8
Nostoc sp. PCC 7120 B 30 44.3
Vibrio cholerae B 28 44.1
Salmonella typhimurium LT2 B 37 43.2
Chlorobium tepidum TLS B 48 43.1
Yersinia pestis KIM B 37 42.5
Deinococcus radiodurans B 30 42.2
Escherichia coli K12 B 37 42.0
Ureaplasma urealyticum B 37 41.6
Corynebacterium glutamicum B 30 40.9
Agrobacterium tumefaciens strain C58 B 30 40.6
Sinorhizobium meliloti 1021 B 26 40.5
Escherichia coli O157:H7 B 37 40.4
Escherichia coli O157:H7 EDL933 B 37 40.2
Pseudomonas aeruginosa B 37 40.1
Caulobacter crescentus B 30 39.5
Streptomyces coelicolor A3(2) B 28 38.2
Halobacterium sp. NRC-1 A 37 37.4
Xylella fastidiosa 9a5c B 26 37.3
Brucella melitensis B 37 37.3
Rickettsia conorii Malish 7 B 37 37.3
Mesorhizobium loti B 26 36.5
Rickettsia prowazekii B 37 36.4
Ralstonia solanacearum B 30 35.2
Mycobacterium leprae B 37 35.2
Mycobacterium tuberculosis H37Rv B 37 34.5
Table 3 (Continued)
Genome Kingdom OGT
a
(
C) HTPP
b
Treponema pallidum B 37 34.4
Mycobacterium tuberculosis CDC1551 B 37 34.3
Xanthomonas campestris pv. campestris B 26 33.2
a
Optimal growth temperature.
b
High Tm (Tm>65
C) are high-
lighted in white, thermophiles (55
C<OGT<80
C) in black.
teins that we predicted. For each genome, we calculated a high
Tmprotein percentage (HTPP), which designates the percentage of
encoded proteins for which the predicted Tm is greater than 65
C.
All mesophiles had an HTPP of less than 56%, whereas the HTPP
for the hyperthermophilic bacterial genomes exceeded 56%. The
only exception was Bacillus halodurans, a facultative alkaliphile and
extremophile fromdeep-sea environments (Takami and Horikoshi,
2000). This unique high-pressure niche may be the major factor for
the unexpected predicted result.
The clear boundary (Table 3) between the hyperthermophiles
and the mesophiles indicates that the rapid TI method can distin-
guishbetweenthese two types of organisms. This analysis included
both bacteria and archaea, and thus the phylogenetic relationships
do not affected the results.
The optimal growth temperature (OGT) of the rst 14 genomes
listed in Table 3 is greater than 75