s=1
(d
1
b
1
(m)+s,b
1
(k)+s
d
2
b
2
(m)+s,b
2
(k)+s
)
2
(1)
where d
1
i, j
, d
2
i, j
is the distance between residue i and j in
protein 1 and protein 2, respectively, b
1
(m), b
1
(k), b
2
(m)
ii248
b
y
g
u
e
s
t
o
n
F
e
b
r
u
a
r
y
1
2
,
2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d
f
r
o
m
Flexible structure alignment by chaining aligned fragment pairs allowing twists
and b
2
(k) are the starting positions of AFP m and k, in
proteins 1 and 2 respectively, as dened earlier, and L is
the length of each AFP.
Flexible structure alignment
Flexible structure alignment can be formulated as the
AFPs chaining process (Guseld 1999) allowing at most t
Author:
Guseld
1999 not in
reference
list twists, and the exible structure alignment is transformed
into a rigid structure alignment when t is 0. Dynamic
programming is used to nd the optimal chaining. If we
denote S(k) as the best score ending at AFP k, it can
be calculated from the best ending at previous AFPs that
can be connected with AFP k subject to the following
constraints (Fig. 2b),
S(k) = a(k) +max
max
e
1
(m) < b
1
(k)
e
2
(m) < b
2
(k)
(S(m) +
c(m k)
, 0
s.t.T(k) t (2)
where a(k) is the score of AFP k itself; c(m k) is
the score of introducing a connection between AFP m
and AFP k; T(k) is the number of twists required for
connecting the chain of AFPs leading up to S(k), which
is calculated by,
T(k) = T(m) +t (m k) (3)
where t (m k) is 1 if a twist is required to connect AFP
m and k and 0 if no twist is required.
The score of an AFP k is determined by its RMSD (d
k
)
and length (L); long AFPs are rewarded and large RMSDs
are penalized,
a(k) = R
s
L F(d
k
) (4)
where R
s
is the rewarding score associated with a good
aligned position and F(d
k
) is the function of d
k
.
The score for connecting AFP m and k is the function of
the compatibility of the AFPs and the mis-matched regions
( p) and/or gaps (q) created by the connection of the two
AFPs,
c(m k) = W(D
mk
) P
c
+ F( p, q) (5)
W(D
mk
) =
1 if D
mk
> D
c
D
mk
D
0
D
c
D
0
2
elsif D
0
< D
mk
D
c
0 else
(6)
F( p, q) = M
c
p + M
s
q (7)
where D
mk
is the root mean square of the distance matrix
between AFP m and k, as dened above; D
c
is the
threshold for dening a twist; D
0
is the threshold for
penalizing a connection; P
c
is the maximum penalty for
connecting two AFPs; M
c
is the penalty involved with
mis-matching two positions; M
g
is the penalty for a gap.
Post-processing of AFP chaining
Several post processing steps are applied after deriving the
best AFP chain dened by the scoring system presented
above. Additional twists are introduced into the AFP
chains if its overall RMSD is larger than a xed threshold.
Unnecessary twists that do not lower the overall RMSD
are removed. Finally, we apply iterative renement of
structure alignments by dynamic programming performed
on the distance matrix calculated from the two superim-
posed structures as described in previous studies (Feng
and Sippl, 1996; Shindyalov and Bourne, 1998; Lackner
et al., 2000).
RESULTS
We implemented the FATCAT approach in C++ on a
Linux platform. The running time of FATCAT comparing
a pair of protein structures on a 1.8GHz Pentium varies
from seconds to a few minutes, depending on the number
of AFPs the two structures have. For instance, 42 060
AFPs were detected in comparing protein 1fmk (with 438
residues) and 1tki (with 321 residues) (alignment result is
shown below), and the whole process took 76 seconds.
We rst applied FATCAT to several alignments de-
scribed as difcultin the literature (Fischer et al., 1996)
and compared its performance with three rigid alignment
programs, DALI (Holm and Sander, 1993), VAST (Madej
et al., 1995) and CE (Shindyalov and Bourne, 1998). We
then compared FATCATs performance with the results
of the only other readily available exible alignment
program, FlexProt (Shatsky et al., 2002). Finally, to
obtain a broader overview we applied FATCAT to a large
set of similar structures extracted from the non-redundant
SCOP database (proteins are clustered at 40% sequential
identity) (Murzin et al., 1995). To avoid bias from large
families, we retained only one pair per family, leading
to 6437 pairs of structurally similar proteins, including
854 family-level protein pairs, 3200 superfamily-level
protein pairs (one representative structure per family),
and 2383 fold-level protein pairs (one representative
structure per superfamily). The same parameters (t = 5;
L = 8; C
t
= 3.0; D
c
= 5.0; D
0
= 1.0; R
s
= 3.0;
P
c
= 25; M
s
= 0.5 and M
g
= 0.5) were used in all
the calculations.
Comparison with rigid structure alignment
programs
FATCAT works well in aligning distantly similar protein
structures, comparable to the performances of the rigid
structure alignment programs, DALI, VAST and CE. In
the test of 10 difcultexamples (Fischer et al., 1996),
FATCAT produced good alignments (no twists needed,
similar alignment length and similar RMSD) in 8 out of
10 examples, except that 2 and 5 twists are introduced in
ii249
b
y
g
u
e
s
t
o
n
F
e
b
r
u
a
r
y
1
2
,
2
0
1
1
b
i
o
i
n
f
o
r
m
a
t
i
c
s
.
o
x
f
o
r
d
j
o
u
r
n
a
l
s
.
o
r
g
D
o
w
n
l
o
a
d
e
d
f
r
o
m
Y.Ye and A.Godzik
Table 1. Comparison of structure alignments of 10 difcult pairs of structures from (Fischer et al., 1996) by different methods
VAST DALI CE FATCAT
Pro1 Pro2 Size RMSD Size RMSD Size RMSD Size RMSD Twist
1fxiA 1ubq 48 2.1 - - - - 63 3.01 0
1ten 3hhrB 78 1.6 86 1.9 87 1.9 87 1.9 0
3hlaB 2rhe - - 63 2.5 85 3.5 79 2.81 2
2azaA 1paz 74 2.2 - - 85 2.9 87 3.01 0
1cewI 1molA 71 1.9 81 2.3 69 1.9 83 2.44 0
1cid 2rhe 85 2.2 95 3.3 94 2.7 100 3.11 0
1crl 1ede - - 211 3.4 187 3.2 269 3.55 5
2sim 1nsbA 284 3.8 286 3.8 264 3.0 286 3.07 0
1bgeB 2gmfA 74 2.5 98 3.5 94 4.1 100 3.19 0
1tie 4fgf 82 1.7 108 2.0 116 2.9 117 3.05 0
The data for VAST, DALI and CE are from (Shindyalov and Bourne, 1998). Descriptions for the items are: Size, the number of aligned positions; Twist, the
number of twists introduced in FATCAT. The RMSD value in FATCAT is the overall RMSD.
comparing (3hlaB, 2rhe ) and (1crl , 1ede ), respectively
(Table 1). In both cases, however, the FATCAT alignment
is arguably better, with either a lower RMSD or a
longer alignment. This result shows that FATCAT is not
specically biased to detect hinges.
FATCAT obviously outperforms rigid structure align-
ment programs with respect to its capability to detect
hinges in protein structures. For example, in the compari-
son between 2spcA and 1aj3 discussed in the Introduction
section, FATCAT identied a structure alignment span-
ning the entire length of both proteins by introducing
two twists (Fig. 1), a result which is consistent with their
evolutionary relationship. In contrast, the rigid structure
alignment programs, such as CE and DALI were only able
to identify short local alignments, either stopping around
the hinge position (DALI) or aligning non-homologous
regions (CE).
Comparison with FlexProt
As mentioned earlier, the main features of FATCAT
are its ability to optimize the structure alignment and
introduce the fewest number of twists at the same time.
Its advantage over the FlexProt (Shatsky et al., 2002)
is demonstrated by the examples listed in Table 2.
Overall, FATCAT alignments have a smaller number
of twists but similar RMSDs and lengths as FlexProt
alignments, suggesting that the strategy of separating
the hinge detection and the chaining process introduces
unnecessary twists into the alignments. For instance,
FATCAT created an alignment of 238 aligned positions
with overall RMSD of 3.08
Abetween the human tyrosine-
protein kinase C-SRC (PDB code 1fmk) and the titian
protein (PDB code 1tki), whereas FlexProt was forced to
introduce two hinges to get an even shorter alignment (231
aligned positions) and a higher RMSD (3.28
A).
In the second example, the tissue factor (PDB code
1a21, chain A) was compared to the growth hormone-
binding protein (PDB code 1hwg, chain C). Four hinges
are detected by FlexProt which results in a structure
alignment of 163 aligned positions with an RMSD of 2.75